Elementary Numerical Mathematics for Programmers and Engineers

Cálculo Numérico

•
Exatas

Peres silva
31.12.2017
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 3, do total de 222 páginas
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 6, do total de 222 páginas
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 9, do total de 222 páginas
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
E aí, curtiu este material?
Ajude a incentivar outros estudantes a melhorar o conteúdo
Gostou desse material? Compartilhe! 🧡
Cálculo Numérico

30.926 Materiais compartilhados
Baixe o app para aproveitar ainda mais
Leia os materiais offline, sem usar a internet. Além de vários outros recursos!
Prévia do material em texto
Compact Textbooks in Mathematics
Gisbert Stoyan
Agnes Baran
Elementary Numerical 
Mathematics for 
Programmers and 
Engineers
Compact Textbooks in Mathematics
Compact Textbooks in Mathematics
This textbook series presents concise introductions to current topics in math-
ematics and mainly addresses advanced undergraduates and master students.
The concept is to offer small books covering subject matter equivalent to 2- or
3-hour lectures or seminars which are also suitable for self-study. The books pro-
vide students and teachers with new perspectives and novel approaches. They
feature examples and exercises to illustrate key concepts and applications of the
theoretical contents. The series also includes textbooks specifically speaking to
the needs of students from other disciplines such as physics, computer science,
engineering, life sciences, finance.
• compact: small books presenting the relevant knowledge
 learning made easy: examples and exercises illustrate the application of the
contents
 useful for lecturers: each title can serve as basis and guideline for a 2-3 hours
course/lecture/seminar
More information about this series at http://www.springer.com/series/11225
Gisbert Stoyan  Agnes Baran
Elementary Numerical
Mathematics for
Programmers and
Engineers
Gisbert Stoyan
Faculty of Informatics
ELTE University
Budapest, Hungary
Agnes Baran
Faculty of Informatics
University of Debrecen
Debrecen, Hungary
Revised translation from the Hungarian language edition: Numerikus matematika
mérnökönek és programozóknak by Gisbert Stoyan, © Typotex kft, 2007 All Rights
Reserved.
MATLABr is a registered trademark of The MathWorks, Inc.
ISSN 2296-4568 ISSN 2296-455X (electronic)
Compact Textbooks in Mathematics
ISBN 978-3-319-44659-2 ISBN 978-3-319-44660-8 (eBook)
DOI 10.1007/978-3-319-44660-8
Library of Congress Control Number: 2016958653
Mathematics Subject Classification (2010): 65D30, 65F15, 65F05
© Springer International Publishing AG 2007, 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made.
Printed on acid-free paper
This book is published under the trade name Birkhäuser, www.birkhauser-science.com
The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword to the English Edition
This elementary introduction to numerical mathematics arose from lectures and
exercises which were held by the two authors jointly at Debrecen University for
students of informatics. It contains material for one semester and was first issued
quite successfully in Hungary. The English translation is the joint work of the two
authors as are the MATLAB exercises and Chap. 10 which have been added for this
edition in English.
The first author has taught similar topics to students of mathematics, informatics,
physics and meteorology at ELTE University; also he has written (in Hungarian)
a three-volume textbook on numerical methods, embracing more material, from
rounding errors to methods for partial differential equations; the last two volumes
are now mostly used in PhD studies.
The present text in its Hungarian form was much more successful; due to its
shortness and lower requirements on mathematical knowledge, more than 900
copies were sold in few years.
We thank the two unknown referees for their helpful remarks.
Budapest, Hungary Gisbert Stoyan
Debrecen, Hungary Agnes Baran
February 2015
v
Contents
1 Floating Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Integers .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Floating Point Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Floating Point Arithmetic, Rounding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Accumulation of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Norms, Condition Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Norms and Their Elementary Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 The Induced Matrix Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Definition and Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Computation of the Induced Matrix Norm for
the Vector p-Norms, p D 1;1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Computation of the Induced Matrix Norm (p D 2) . . . . . . . . . 23
2.3 Error Estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 The Right-Hand Side of the Linear System is Perturbed .. . . 26
2.3.2 Condition Number.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.3 The Matrix of the Linear System is Perturbed . . . . . . . . . . . . . . . 31
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Solution of Systems of Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Gaussian Elimination .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 When Can Gaussian Elimination be Performed? . . . . . . . . . . . . . . . . . . . . 40
3.3 The LU Factorization .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Algorithms, Cost of Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Influence of Rounding Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 LU Factorization for General Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6.1 Algorithm of the LDU Factorization, Test Examples. . . . . . . . 54
3.7 Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7.1 Algorithm of the LDLT Factorization, Test Examples. . . . . . . 60
3.8 Band Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.8.1 TridiagonalSystems of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.8.2 The Tridiagonal Algorithm, Test Examples . . . . . . . . . . . . . . . . . . 64
3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
vii
viii Contents
4 The Least Squares Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1 Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1.1 Algebraic Description .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.2 The Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Normal Equations.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 Solution Algorithm, Test Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1 Fundamental Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1.1 Normal Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1.2 The Characteristic Polynomial. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1.3 Localization of the Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2 Power Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.1 Conditions of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.2 The Rayleigh Quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2.3 Algorithm of the Power Iteration, Test Examples . . . . . . . . . . . 98
5.2.4 The Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 The Inverse Iteration .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.1 Conditions of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.2 Algorithm of the Inverse Iteration, Test Examples . . . . . . . . . . 103
5.4 The QR Algorithm.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.1 Interpolation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2 Lagrangian Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2.1 Lagrange Interpolation Problem .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2.2 Newton’s Divided Differences Method . . . . . . . . . . . . . . . . . . . . . . 115
6.2.3 The Difference Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2.4 Algorithm of Lagrangian Interpolation, Test Examples . . . . . 122
6.2.5 Error Estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3 Hermite Interpolation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4 Piecewise Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7 Nonlinear Equations and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.1 Bisection Method, Fixed Point Iterations .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.2 Newton’s Method .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2.1 Damped Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.2.2 Secant Method .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3 Solution of Systems of Equations .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.3.1 Newton’s Method.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.3.2 Algorithm of Damped Newton Method, Test Examples. . . . . 150
7.3.3 Approximation of the Jacobian Matrix . . . . . . . . . . . . . . . . . . . . . . . 152
7.3.4 Broyden’s Method .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Contents ix
7.4 Gauss–Newton Method.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.4.1 Description.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.4.2 Algorithm of the Gauss–Newton Method,
Test Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.1 Elementary Quadrature Formulae.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.2 Interpolational Quadrature Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.3 Composite Quadrature Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.3.1 Construction of Composite Formulae . . . . . . . . . . . . . . . . . . . . . . . . 168
8.3.2 Convergence of Composite Formulae . . . . . . . . . . . . . . . . . . . . . . . . 171
8.4 Practical Points of View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.5 Calculation of Multiple Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.5.1 Reduction to Integration of Functions of One Variable. . . . . . 177
8.5.2 Approximation of the Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
8.5.3 Algorithm of the Two-Dimensional Simpson
Rule, Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9 Numerical Solution of Ordinary Differential Equations . . . . . . . . . . . . . . . 185
9.1 Motivation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.2 Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
9.3 Euler’s Method.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.3.1 Algorithm of Euler’s Method, Test Examples . . . . . . . . . . . . . . . 192
9.4 Error Analysis of Euler’s Method .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 193
9.5 The Modified Euler Method, Runge–Kutta Methods . . . . . . . . . . . . . . . . 196
9.6 The Implicit Euler Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.6.1 The Implicit Euler Method for Linear Systems . . . . . . . . . . . . . . 201
9.6.2 Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.6.3 Algorithm of the Implicit Euler Method, Test Examples . . . . 206
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
10 Practical Error Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
1Floating Point Arithmetic
You know from secondary school:
If the base of a power is a > 0, then the following relations are valid:
abac D abCc ; a
b
ac
D ab�c :
If q ¤ 1
1 C q C q2 C � � � C qn D
nX
iD0
qi D 1 � q
nC1
1 � q
holds, moreover, if q D 1, then PniD0 qi D n C 1:
1.1 Integers
Integers are represented on a computer in the form of signed binary numbers. Often
2-, 4- and 8-byte integers are available where a byte possesses eight binary digits.
In many computers 4 bytes are the smallest available—addressable—unit of the
memory. It may turn out that we can work with one- and 16-byte integers, too.
The arithmetical operations performed with integers may be and often are faster
than the operations performed with floating point numbers (which will be discussed
in the following section), but this depends on their word length, the processor and
the translator. Such operations can be considered as error-free, hence, using integers
a given algorithm can be faster on the computer. However, all steps of a calculation
with integers have to be carefully planned, because actually in this case we work in
“residue classes”. As an example consider the addition of 1-byte integers:
11010101
+ 10101010
=1 01111111
213
+ 170
= 383 � 127 mod(256)
© Springer International Publishing AG 2016
G. Stoyan, A. Baran, Elementary Numerical Mathematics for Programmers and
Engineers, Compact Textbooks in Mathematics, DOI 10.1007/978-3-319-44660-8_1
1
2 1 Floating Point Arithmetic
The processing unit cuts off—without any warning!—the 1 which cannot be placed
in the frame (it would produce an overflow), and in this way the sum is replaced by
its residual with respect to 28.
Thus, we move to the residue class of 28 and, e.g., in the case of two-byte
numbers to the residue class of 216, etc.
1.2 Floating Point Numbers
This type of numbers is fundamental in the use of computers for numerical
computations. The form of nonzero floating point numbers is
˙ ak
�m1
a
C m2
a2
C � � � C mt
at
�
; (1.1)
where a > 1 is the base of the representation of numbers, “+” or “�” is the sign,
t > 1 is the number of digits, k is the exponent. The digit m1 is normalized, that is,
it satisfies the inequality
1 � m1 � a � 1:
This ensures the uniqueness of the representation and the full exploitation of the
available digits. For the remaining digits
0 � mi � a � 1; 2 � i � t
holds. The number zero is not normalized: in this case k D 0; m1 D m2 D � � � D
mt D 0, and its sign is usually “+”.
It often happens that the base a of the representation of floating point numbers
is not equal to 2. Instead of 2 it can be, e.g., 10 or 16. Some computers have a
decimal CPU. However, usually the corresponding arithmetic is implemented by
the programming language, and when, e.g., the main purpose is data processing,
it can be more economical to use the decimal number system. We may view the
computations as being done in the decimal number system and can count on about
t D 8 using single precision and about t D 16 using double precision.
We can imagine the storage of floating point numbers in the following form
(reality may be different from this, but here we do not describe these details of
the technical implementation which do not affect our numerical calculations):
Œ˙; k;m1; : : : ; mt �;
where the vector .m1; : : : ; mt/ DW m is the mantissa (or significand). The exponent
k is also called the characteristic of the number. Depending on the machine and
on the precision (single, double, quadruple precision) four, eight or 16 bytes are
available to store m. Simultaneously, the range of k increases but, depending on the
1.2 Floating Point Numbers 3
given precision,
k.�/ � k � k.C/ ;
where k.�/ < 0; k.C/ > 0 and jk.�/j � jk.C/j (see Exercises 1 and 2). Then the
largest representable number is
M1 WD ak.C/
tX
iD1
a � 1
ai
D ak.C/
�
a � 1
a
C a � 1
a2
C � � � C a � 1
at
�
D ak.C/ .1 � a�t /;
while the smallest number is �M1. The floating point numbers form a discrete
subset of the rational numbers from Œ�M1;M1�, and this subset is symmetric
about zero.
We denote the positive floating point number nearest to zero by "0,
"0 WD ak.�/
�
1
a
C 0 C � � � C 0
�
D ak.�/�1:
Thus, besides zero there is no other (normalized) floating point number in the
interval .�"0; "0/. (Here we disregard the fact that on a computer denormalized
nonzero numbers can also be used, but to ensure the uniqueness of the representation
only in the case k D k.�/. Practically, this means that then the formula for "0
changes to ak.�/�t .) The interval .�"0; "0/ appears as a huge gap if we represent
the floating point numbers on the real axis (see Fig. 1.1 and Exercise 3.) since the
positive floating point number nearest to "0 is
ak.�/
�
1
a
C 0 C � � � C 0 C 1
at
�
D "0 C ak.�/�t D "0.1 C a1�t /;
and
"0 � a1�t � "0 :
The number 1 always belongs to the floating point numbers:
1 D ŒC; 1; 1; 0; : : : ; 0�:
0 ε0 1 20.5
Fig. 1.1 Floating point numbers on the real axis
4 1 Floating Point Arithmetic
After the number 1 the next floating point number
ŒC; 1; 1; 0; : : : ; 0; 1� D 1 C "1;
follows, where
"1 WD a1�t :
This number is often called the relative precision of the machine, or the machine
epsilon (see Exercise 4). The importance of the numbers "0 and "1 is that in the case
of the input and of the four basic operations they give an absolute and a relative
error bound, respectively (see the next section).
If we take any floating point number satisfying
0 < x D ŒC; k;m� D ak
�m1
a
C m2
a2
C � � � C mt
at
�
< M1 ;
then the nearest floating point number to x larger than x is x WD x C ak�t :
x D x C ak
�
0 C 0 C � � � C 0 C 1
at
�
D x C ak�t :
Thus, the spacing between the two numbers is ıx WD x � x D ak�t . The smallest
possible value x for the present characteristic k is equal to ak 1
a
D ak�1, hence,
x � x D ıx D ak�t D ak�1C1�t D ak�1a1�t D ak�1"1 � x"1: (1.2)
If x < 0 and x is the left-hand side floating point neighbour of x then, similarly,
jx � xj � jxj"1 : (1.3)
1.3 Floating Point Arithmetic, Rounding
First consider the input.
Imagine a real number x to be read in, jxj � M1. Then, most often, the floating
point number Qx assigned to x by the machine is
Qx WD
(
0; if jxj < "0 ;
the floating point number nearest to x if "0 � jxj � M1 :
Thus, in the first case the number is rounded towards zero (the gap around 0 appears
as a “black hole”) while in the second case the number is rounded properly. Then
1.3 Floating Point Arithmetic, Rounding 5
(see relations (1.2) and (1.3))
j Qx � xj �
(
"0; if jxj < "0 ;
1
2
"1jxj; if jxj � "0:
(1.4)
If the machine always rounds the numbers towards zero (since truncation of the
non–significant digits is simpler than proper rounding), then in (1.4) "1 replaces 12"1
and so j Qx � xj < "1jxj is true always. Disregarding the gap around zero we obtain
a simplified version of (1.4), i.e.
j Qx � xj � 1
2
"1jxj (1.5)
in the case of rounding, or
j Qx � xj � "1jxj (1.6)
in the case of truncation.
The numbers 1
2
"1 or "1 arising here can be described as follows: they are the
smallest positive floating point numbers " for which the result of the addition 1 C "
is greater than 1—after rounding and truncation, respectively (see Exercise 4).
Next, consider the floating point operations. As an example, first examine the
addition. Take, e.g., a D 10; t D 3; k.�/ D �3; k.C/ D 3 (then "1 D 10�2)
and x D 3:45; y D 0:567 as the two numbers to be added. As the first step of the
addition the characteristics of the two numbers (k1 and k2) have to be made equal,
and the mantissa of the smallest number has to be shifted right by k1 � k2 digits:
+ 1 345
+ + 0 567
= + 1 345
+ + 1 0567
10*0.345 ( D x)
+ 1*0.567 ( D y)
= 10*0.345
+ 10*0.0567=4.017=x C y DW z
Here immediately the question arises whether the CPU uses t digits to place the
mantissa.
In the case of addition at least one more digit (a so-called guard digit) is necessary
to have an acceptable error. (In the case of multiplication it is best if there are 2t
digits available to place the product of the mantissas because then the product can
be placed without loss.)
If we have this guard digit, then the addition from the above example can be
continued in the following way:
+ 1 345
+ + 0 567
= + 1 345
+ + 1 0567
= + 1 4017
6 1 Floating Point Arithmetic
Up to this point the result is error-free but usually one has to put it back into the
storage where in our example there are only three digits to place the mantissa. Then
the last digit has to be truncated or rounded. If the latter was chosen when the CPU
(or the programming language which implements the arithmetic) was planned, then
the result of the computation, i.e. Qz D Ax C y is
+ 1 402
Hence, the result is loaded by error, and for the error the following is valid—in
accordance with (1.6), writing z; Qz instead of x; Qx:
jz � Qzj D 0:003 D "1 	 0:3 D 1
2
"1 	 0:6 < 1
2
"1 	 jzj .D 0:020085/: (1.7)
Let us examine one more example of addition (we take a D 10; t D 3;
k.�/ D �3; k.C/ D 3 as before and further assume that in the CPU one guard
digit is available): x D 0:987 and y D 0:567.
Now, there is no need to shift the numbers before the addition:
+ 0 987
+ + 0 567
= + 0 1554
Š + 1 155
0.987 ( D x)
+ 0.567 ( D y)
= 1.554 ( D z)
Š 10*0.155=:Qz
Here, in the last step the result had to be normalized and rounded, and after these
steps the result (once more in accordance with (1.6)) can be put into the storage.
Now, examine the rounding errors occurring during the basic operations, in
general. According to the example above, we assume that first the machine performs
exactly any of the four basic operations .C;�;	; =, denoting them regardlessly by
˘), and after that it assigns to the result a floating point number using truncation or
rounding. Then in the case of rounding according to (1.4)
jAx ˘ y � x ˘ yj �
(
"0 ; if jx ˘ yj < "0
1
2
"1jx ˘ yj; if jx ˘ yj � "0 ;
(1.8)
or, in a simplified version, omitting the case corresponding to "0,
jAx ˘ y � x ˘ yj � "1jx ˘ yj 	
(
1; in the case of truncation
1
2
; in the case of rounding.
(1.9)
This inequality can be described as
Ax ˘ y D .x ˘ y/ � .1 C "˘/; j"˘j � "1 	
(
1; truncation
1
2
; rounding.
(1.10)
1.3 Floating Point Arithmetic, Rounding 7
Indeed, relation (1.10) can be obtained from (1.9) in the following way (considering
the case of truncation): inequality (1.9) means that
�"1jx ˘ yj � Ax ˘ y � x ˘ y � "1jx ˘ yj:
Therefore, the expression Ax ˘ y � x ˘ y can take any value between �"1jx ˘ yj and
"1jx ˘ yj. This can be written as
Ax ˘ y � x ˘ y DW "1jx ˘ yjs; where � 1 � s � 1:
Thus,
Ax ˘ y D x ˘ y C "1jx ˘ yjs D .x ˘ y/.1 C sign.x ˘ y/"1s/;
and here j sign.x ˘ y/"1sj � "1. Now, we have only to substitute "˘ for
sign.x ˘ y/"1s.
Our example concerning the addition illustrates the estimation (1.10) for ˘ D C:
"˘ D "C D
Ax C y � .x C y/
x C y D
0:003
4:017
� 0:000747 < 1
2
	 "1 D 0:005:
The estimation (1.10) is not valid in the gap around zero, moreover, it is also not
valid in the case when the result of the operation is in absolute value larger than M1.
This would mean overflow which is a serious error for floating point calculation,
and—depending on the programming language and on the operating system—after
one or more errors of this kind the machine terminates its work (more precisely, it
works with the next program).
If the result of an operation is not equal to zero, but it lies in the interval .�"0; "0/,
then according to the above reasoning underflow occurs; the machine considers
the result to be zero, usually without an error message. In this case important
information may be annihilated, the last sign of life of some phenomenon.
Using floating point subtraction, valuable digits often disappear. This means the
following: as earlier consider the case a D 10; t D 3 and, e.g., x D 1:234,
y D 1:233. Then there is no need to shift the digits, and in the second step of
the subtraction we get x � y D 0:001, without error. After this the number has to
be normalized; the result is 10�2. 1
10
C 0 C � � � C 0/ without error. However, there
is a problem: the numbers x; y had four digits, which resulted, e.g., from an earlier
calculation, and after the subtraction only one digit remains. From the point of view
of the floating point calculation, we cannot take any responsibility for the accuracy
of the zeros after the leading digit 1.
The other dangerous phenomenon of the floating point arithmetic is the possible
cancellation by subtraction.
This means the following. The machine does not correspond to the simple model
above when in the CPU the exact result is first produced. In this case there is no
8 1 Floating Point Arithmetic
guard digit, and before the subtraction of two floating point numbers x; y with
different exponents .k1 ¤ k2/ the digits are not only shifted by k1 � k2 digits
(compare with the above example for addition), but the digits moving on the
right-hand side out of the frame are immediately truncated. For example,
x D ŒC; k; 1; 0; : : : ; 0� D ak�1;
y D ŒC; k � 1; a � 1; : : : ; a � 1� D ak
�
1
a
� 1
atC1
�
:
After shifting and truncating the last digit which has the value ak a�1
atC1
, the
machine continues working with the number Qy D ak � 1
a
� 1
at
�
instead of y, and
x � Qy D ak�t DW Qz appears as the result of the machine subtraction. However, the
exact result is z WD ak�t�1. (See Exercise 6.)
This example calls for the distinction of two errors. If Qz is the result of a floating
point calculation and z the exact result (in our example z D ak�t�1 and Qz D ak�t ),
then jQz � zj is called the absolute error of the calculation. If z ¤ 0, then the value
j.Qz�z/=zj is the relative error. Applying this to our example, the absolute error of the
subtraction performed under the above circumstances is jQz � zj D ak�t � ak�t�1 D
ak�t�1.a � 1/, while the relative error is
ak�t�1.a � 1/=ak�t�1 D a � 1 D a � 1
a1�t
"1 � at"1 
 "1;
which is considerably larger than "1 D a1�t , nearly at times larger. This relative
error is smallest when a D 2, but it is even then not acceptable, since if the relative
error can be equal to 1, then the result can be confused with its error. This is the
possible fatal loss of significant digits on a (badly planned) computer.
To avoid this phenomenon of machine subtraction we need at leastone guard
digit in the CPU, on the .t C 1/th place which, however, is missing in several
machines. By subtraction, significant digits can also disappear when a guard digit is
available, and the exponents of the two numbers are equal, in this case not in a fatal
way, but according to (1.8).
In practice, in connection with the evaluation of the accuracy of computational
results, one works not only with absolute and relative errors. It is often practical to
apply a (“mixed” absolute-relative) test like
jz � Qzj � "
�
1 C 1
2
.jzj C jQzj/
	
(1.11)
where " is the given accuracy. If the inequality is not true, the result Qz is rejected as
a too inaccurate result; if it is true, then we accept the result. Using the above test,
if jzj C jQzj is small, absolute accuracy is measured by ", while if jzj C jQzj is large,
relative accuracy is measured. The analogy with (1.4) and with (1.8) is striking. We
advise you to apply such a test during programming.
1.4 Accumulation of Errors 9
The innocent-looking operation of rounding can lead to large errors even when
a guard digit is present as the examples in the following section show. However,
rounding is unavoidable after almost all floating point arithmetical operations (but
see Exercise 6), hence a possible solution is to replace the simple rounding described
by (1.4) and (1.10) by one-sided roundings. After each arithmetical operation, when
the result is not a floating point number, one produces two results by rounding up
and rounding down. As a consequence the complete calculation changes: instead of
numbers one always has to work with intervals. Hence, either the machines have
to be reconstructed to perform the roundings always with the actual arithmetical
operation (i.e. instead of each basic operation one has two operations) or one has to
build the interval arithmetic into the programming language. The latter was done,
e.g., in the case of Pascal-SC.
1.4 Accumulation of Errors
How do errors accumulate when one performs many operations? Suppose we have
to add n C 1 floating point numbers,
Sn WD x0 C x1 C � � � C xn D
nX
iD0
xi ;
that is, we have to perform not one, but n additions. For each addition, the CPU
first adds to the previous partial sum Sk�1 the following number xk without errors:
Sk WD Sk�1 C xk; k D 1; 2; : : :, then it applies, e.g., truncation. Here we disregard
the fact that some partial or the final result can be in the interval .�"0; "0/, or it can
overflow.
Then based on (1.10) the first partial sum obtained on the computer is
eS1 WD Bx0 C x1 D .x0 C x1/.1 C "0;1/ D S1 C "0;1S1 where j"0;1j � "1 :
Similarly:
eS2 WD BeS1 C x2
D .eS1 C x2/.1 C "1;2/
D .S1 C "01S1 C x2/.1 C "1;2/
D .S1 C x2 C "0;1S1/.1 C "1;2/
D .S2 C "0;1S1/.1 C "1;2/
D S2 C "0;1S1 C "1;2S2 C "1;2"0;1S1
10 1 Floating Point Arithmetic
where "1;2 is the corresponding value of the rounding error, j"1;2j � "1. In order
to have a better view of the essence of the problem, and using that "1;2"0;1S1 is of
second order in "1:
j"1;2"0;1S1j � "21jS1j;
we neglect this term (and use the notation Š for equalities up to second order terms):
S2 Š S2 C "0;1S1 C "1;2S2 :
Continuing the above reasoning you see that
eSn Š Sn C
nX
kD1
"k�1;kSk : (1.12)
Let us express the partial sums Sk with the help of the xi :
nX
kD1
"k�1;kSk D x0
nX
kD1
"k�1;k C
nX
iD1
xi
nX
kDi
"k�1;k :
Observe that the estimation of the sum multiplying x0 and x1 is
ˇˇ
ˇˇ
ˇ
nX
kD1
"k�1;k
ˇˇ
ˇˇ
ˇ � n"1 ;
since j"k�1;kj � "1 holds for all k and we have n terms. The estimation of the sum
multiplying x2 is
ˇˇ
ˇˇ
ˇ
nX
kD2
"k�1;k
ˇˇ
ˇˇ
ˇ � .n � 1/"1 ;
and, generally, the estimation of the sum multiplying xi (i � 1) is
ˇˇ
ˇˇ
ˇ
nX
kDi
"k�1;k
ˇˇ
ˇˇ
ˇ � .n � i C 1/"1 :
Returning to (1.12) we can write that (neglecting all second order terms)
jeSn � Snj � "1
 
njx0j C
nX
iD1
.n � i C 1/jxi j
!
1.4 Accumulation of Errors 11
is valid. From this you see that it is worth beginning the summation with the numbers
xi having the smallest absolute value: in the error term the multiplier of x0 and x1
is n"1, but xn has only an "1 multiplier.
The following estimation is a little bit rougher but it shows better that the error
increases linearly as the number of operations n increases:
jeSn � Snj � n"1
nX
iD0
jxi j: (1.13)
In the special case when all xi are positive (and neglecting the second order terms)
we obtain an estimation of the relative error:
ˇˇ
ˇˇ
ˇ
QSn � Sn
Sn
ˇˇ
ˇˇ
ˇ � n"1 :
This relation shows that when n"1 � 1 an accurate result cannot be expected.
We come to a similar conclusion when we multiply nC 1 floating point numbers
which are not only positive but lie around 1, i.e. the n multiplications do not involve
the danger of overflow and underflow. If
P1 WD x0 	 x1; Pn D Pn�1 	 xn; n D 2; 3; : : : ;
that is
Pn D
nY
kD0
xn ;
then on the computer (with corresponding but possibly different from (1.12) error
quantities "k�1;k)
QP1 D P1 	 .1 C "0;1/;
QP2 D BQP1 	 x2 D .P1 	 .1 C "0;1/ 	 x2/ 	 .1 C "1;2/
D P2 	 .1 C "0;1/ 	 .1 C "1;2/
D P2 	 .1 C "0;1 C "1;2 C "0;1"1;2/
Š P2 	 .1 C "0;1 C "1;2/;
and generally
QPn Š Pn.1 C
nX
kD1
"k�1;k/:
12 1 Floating Point Arithmetic
This means that
ˇˇ
ˇˇ
ˇ
QPn � Pn
Pn
ˇˇ
ˇˇ
ˇ Š
ˇˇ
ˇˇ
ˇ
nX
kD1
"k�1;k
ˇˇ
ˇˇ
ˇ � n"1; (1.14)
since an upper bound of the sum is n"1. Thus, when the latter quantity does not
exceed 1=a then we can expect at least one accurate digit.
To what extent can this reasoning be accepted as we have neglected second order
terms? The worst case is when the error is maximal in each multiplication: "1. Then
Pn � QPn � Pn.1 C "1/n and
ˇˇ
ˇˇ
ˇ
QPn � Pn
Pn
ˇˇ
ˇˇ
ˇ � .1 C "1/
n � 1:
Recall from mathematical analysis that 1 C x � ex (for all x), and for this reason
on the one hand .1 C "1/n � 1 � en"1 � 1, on the other hand one can verify that
ex � 1 � x
1� x2 if 0 � x < 2. Thus, in the case of x D n"1 < 2:
ˇˇ
ˇˇ
ˇ
QPn � Pn
Pn
ˇˇ
ˇˇ
ˇ �
n"1
1 � n"1
2
:
If we would like the relative error of QPn to be less than 1a (i.e. at least one accurate
digit in QPn), then solving the relation x1� x2 �
1
a
for x we obtain that
ax � 1 � x
2
H) .a C 1
2
/x � 1 H) x � 1
a C 1
2
:
This means the following: from relation (1.14) obtained by neglecting the second
order quantities you can see that the relative error of QPn is less than 1a if n"1 � 1a .
Without neglecting the second order terms this is not valid—but it is sufficient to
require n"1 � 1aC 12 .
1.5 Conclusions
On computers we can work with two fundamental types of numbers: integers and
floating point numbers. While the operations with integers are error-free (in fact here
we work in residue classes), for floating point numbers rounding is unavoidable. As
a consequence of this (also from other sources like measuring) errors will occur,
and accumulation of errors can strongly bias the final result. For this reason it is
important to estimate the final errors. While after one operation the error will be of
the order of "1, the relative machine accuracy, after n operations it will be of the
order of n"1.
1.6 Exercises 13
Most probably, the speed of computers will increase in the future. (This will
happen not by increasing the “clockrate” but rather by increasing the number of
cores available and newer technical developments.) If parallel with this the length of
the mantissa of floating point numbers used in programs and the domain Œk.�/; k.C/�
of the characteristic do not increase, then the danger of unreliable arithmetic results
also increases: if the relative error obtained after n operations is not less than one,
then the result is unacceptable.
1.6 Exercises
1. Assuming that a D 10,determine the values of t , k.C/ and k.�/ on your
computer (laptop, mobile etc.). How does the machine react to overflow and
to underflow? Does a guard digit exist on your computer?
2. Why is it often true that jk.C/j ¤ jk.�/j, for normalized numbers? Start from
the question how many characteristics can be represented with n binary digits
in general, how many when the first digit is used for the sign, and how many
when the exponent k is stored in form k C 2n�1, without sign.
3. Represent on the real axis the non-negative floating point numbers correspond-
ing to a D 2, t D 4, k.�/ D �3, k.C/ D 2.
(In the following exercises we always assume that the values a; t; k.C/,
k.�/ characterizing the representation of floating point numbers, are given,
moreover, k.�/ � k.C/ > t � 2 is valid.)
4. Determine in an experimental way the numbers "0 and "1.
5. Which is the smallest natural number not contained among the floating point
numbers?
6. Which arithmetical operations do you know which can be performed error-free
for floating point numbers?
7. Name two floating point numbers x; y, different from zero and one, such that
x C y or x 	 y are (a) floating point or (b) not floating point.
8. We have seen that the right-hand side floating point number adjacent to 1 is
1 C "1. However, what is the left-hand side floating point neighbour of 1?
9. How many positive floating point numbers exist for a fixed characteristic k?
10. Give a reason for the fact that in the case of floating point calculation the
sequence fsng1nD1, s1 D 1, sn D sn�1 C 1=n, n D 2; 3; : : :, known from
mathematical analysis, becomes a constant sequence when n is large enough—
instead of diverging.
Estimate approximately how large Qsn will be, when n is large enough. For
this, use that for the exact values of sn
log.n C 1/ < sn < 1 C log n
holds. Check your result with the help of your computer (using the shortest
“real” type).
14 1 Floating Point Arithmetic
11. Imagine you have a program to calculate the determinant of a second order
matrix based on the well-known mathematical formula. In the CPU there is a
guard digit, after the arithmetical operations the number with t C 1 digits will
be normalized and rounded, if necessary. If the matrix is
A D A."/ D
�
1 1 � "
1 1
�
; " > 0;
then starting from " D 1 and decreasing the value of ", what is the smallest
positive " for which the determinant is positive?
12. Examine the value of the logical expression
(1/45)*15==1/3
in MATLAB. Try to explain the phenomenon.
13. Choosing x D 1=3 as the initial value, the expression 4x�1 should return with
1=3. Run the command x D 4 	 x � 1 several (e.g., 27, 30, 40) times in a loop
and check the final value of x.
14. Consider the MATLAB commands
4-5+1
0.4-0.5+0.1
and give a comment on the outcomes.
15. The equation
�
1
5x
C 1
�
x � x D 0:2
is true for all x ¤ 0. How many times does MATLAB find the above equation
valid, if x D 1; : : : ; 100?
16. Read the text on “Representation of Floating-Point Numbers” in the help-part
of MATLAB and look on the internet for “IEEE standard 754”.
17. Find a simple algebraic relation between the following MATLAB numbers:
realmin; eps.0/; eps by reading the explanations in the MATLAB help—
or by direct experimentation.
2Norms, Condition Numbers
You remember from Algebra:
If A 2 IRn�n is a matrix with (real) elements aij , and x D .x1; x2; : : : ; xn/T is a column vector,
then Ax is a vector where the i th component is
.Ax/i D
nX
jD1
aij xj ; i D 1; : : : ; n:
Moreover, if for some vector v and number �
Av D �v; v ¤ 0;
holds, then � is an eigenvalue of the matrix A, and v is the corresponding eigenvector. Both �
and v may turn out to be complex.
Let us show that as generalization of jx C yj � jxj C jyj there holds
ˇˇ
ˇˇ
ˇ
nX
iD1
xi
ˇˇ
ˇˇ
ˇ �
nX
iD1
jxi j:
This follows from
ˇˇ
ˇˇ
ˇ
nX
iD1
xi
ˇˇ
ˇˇ
ˇ � jx1j C
ˇˇ
ˇˇ
ˇ
nX
iD2
xi
ˇˇ
ˇˇ
ˇ � jx1j C jx2j C
ˇˇ
ˇˇ
ˇ
nX
iD3
xi
ˇˇ
ˇˇ
ˇ
etc.
© Springer International Publishing AG 2016
G. Stoyan, A. Baran, Elementary Numerical Mathematics for Programmers and
Engineers, Compact Textbooks in Mathematics, DOI 10.1007/978-3-319-44660-8_2
15
16 2 Norms, Condition Numbers
You surely know and understand the following operations with sums (here m; n � 1 are
integers):
nX
iD1
mX
jD1
aibj D
nX
iD1
ai
mX
jD1
bj D
mX
jD1
nX
iD1
aibj D
mX
jD1
bj
nX
iD1
ai :
These equalities hold because in each case all possible combinations of ai and bj are multiplied
and summed:
nX
iD1
mX
jD1
ai bj D a1b1 C a1b2 C � � � C a1bm C a2b1 C a2b2 C � � � C a2bm
C � � � C anb1 C � � � C anbm ;
nX
iD1
ai
mX
jD1
bj D a1.b1 C b2 C � � � C bm/ C a2.b1 C b2 C � � � C bm/
C � � � C an.b1 C b2 C � � � C bm/;
mX
jD1
nX
iD1
ai bj D a1b1 C a2b1 C � � � C anb1 C a1b2 C a2b2 C � � � C anb2
C � � � C a1bm C � � � C anbm ;
mX
jD1
bj
nX
iD1
ai D b1.a1 C a2 C � � � C an/ C b2.a1 C a2 C � � � C an/
C � � � C bm.a1 C a2 C � � � C an/:
2.1 Norms and Their Elementary Properties
In the following chapter we are going to discuss the solution of a system of linear
equations
Ax D b (2.1)
where A is an n � n real matrix: A 2 IRn�n; b 2 IRn is a given vector and x 2 IRn
is the unknown vector. In practice both vector b and matrix A are often given with
uncertainties: they are perturbed by errors, e.g. by rounding errors. Hence, in this
chapter we study the magnitude of the error in the vector x caused by errors in A
and b. As a result of this examination we will understand the following: on what
does it depend that on a given computer a linear system with a given error can be
solved with an acceptable error, or not?
To answer this question we shall introduce norms and condition numbers.
If, for example, bCıb is given instead of b, where ıb is the vector of errors in b, then
we would like to know, how large is the error of the solution y of the “perturbed”
system Ay D b C ıb, i.e. how large is y � x where x is the unknown error-free
2.1 Norms and Their Elementary Properties 17
solution? To answer this we need to review (or introduce) some definitions, as we
have to measure the distance of two vectors and two matrices.
In what follows, even if we write complex vectors: x 2 ICn, it is usually possible
to think only of real ones. Complex vectors occur as a solution of the eigenvalue
problem since a real matrix can also have complex eigenvectors and -values.
A function d W ICn ! R is called a norm if the following four conditions hold:
1. d.x/ � 0 for all x 2 ICn;
2. d.x/ D 0 , x D 0, so the nullvector can uniquely be recognized;
3. d.a � x/ D jajd.x/ where a is an arbitrary number which can also be complex
(i.e., if a D b C ci; i D p�1, then jaj D pb2 C c2);
4. d.x C y/ � d.x/ C d.y/ (this is the so-called triangle inequality).
The value d.x/ is the norm of the vector x and since it is a generalization of the
absolute value, instead of d.x/ the usual notation is kxk.
Notice that in the second property 0 has two different meanings: as a number and
in the case of x D 0 as a vector. In what follows we always write 0, both in the case
of numbers and vectors, and even for matrices and functions.
In the space ICn the following p-norms are useful for us:
d.x/ D kxkp WD
 
nX
iD1
jxi jp
!1=p
;
p D 1 (Manhattan norm),
p D 2 (Euclidean norm),
p D 1 (maximum norm).
The maximum norm could be interpreted as the limit for p ! 1 but it can also
be defined in the following way:
kxk1 WD max
1�i�n jxi j:
This follows from the inequality
max
1�i�n jxi j �
 
nX
iD1
jxi jp
!1=p
� .n/1=p max
1�i�n jxi j
which is always true because max1�i�n jxi j is present among the valuesjxi j (this
gives the lower estimation on the left-hand side). Moreover, jxi j can take the
maximal value at most n times which explains the upper estimation on the right-
hand side. Further, recall from mathematical analysis that n1=n tends to 1 as n ! 1.
So much the more n1=p ! 1, if n is fixed and p ! 1, as here. For example, if
n D 256 and p D 2 then n1=p D 16, if p D 4 then n1=p D 4, but in the case of
p D 16 we have n1=p D p2 � 1:414.
We often omit the subscript p when any of the above-mentioned three norms can
be used, but in general, these norms behave rather differently (Fig. 2.1).
18 2 Norms, Condition Numbers
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
1
1
−1
−1
p=∞
p=2
p=1
Fig. 2.1 The unit ball in three p-norms: the points x satisfying jjxjjp D 1
The Euclidean norm corresponding to p D 2 is probably known from your
previous studies. If the vector x has real components, then we can simply write
kxk2 D
q
x21 C x22 C � � � C x2n D
 
nX
iD1
x2i
!1=2
:
Consider an example: if x is the row vector .1;�2;�3; 4/, then
kxk1 D 1 C j � 2j C j � 3j C 4 D 10;
kxk2 D
p
1 C 4 C 9 C 16 D p30 � 5:477;
kxk1 D max.1; j � 2j; j � 3j; 4/ D 4:
In the case of the maximum and the Manhattan norm, the triangle inequality follows
from the fact that for the absolute value the inequality ja C bj � jaj C jbj holds.
In the case of the Euclidean norm it is a consequence of the Cauchy-inequality (see
the beginning of Chap. 4).
We remark that the parameter p of the p-norms can be any number from the
interval Œ1;1�, but 0 < p < 1 is not possible because in this case the fourth property
does not hold, that is, kxkp does not define a norm. As a counterexample consider
n � 2, moreover, let x D .1; 0 : : : ; 0/T and y D .0; 1; 0 : : : ; 0/T . Thus, according
to the definition, kxkp D 1; kykp D 1, but from 0 < p < 1 we have
kx C ykp D .1 C 1/1=p D 21=p > 2 D kxkp C kykp :
2.2 The Induced Matrix Norm 19
2.2 The InducedMatrix Norm
2.2.1 Definition and Properties
In what follows we shall denote column vectors by x; y. As an example,
x D
0
BBB@
1
0
:::
0
1
CCCA 2 IR
n
is the first column unit vector, denoted also as e1 and often written as a transposed
row vector .1; 0; : : : ; 0/T .
Now, if A 2 IRn�n is a real matrix, with the help of a vector norm we can define
a matrix norm:
kAk WD max
x¤0
kAxk
kxk : (2.2)
It is the matrix norm induced by the given vector norm (or subordinate to the given
vector norm). We need to check that (2.2) defines indeed a norm on the linear space
of the matrices:
1. The non-negativity of the expression on the right-hand side follows from the
non-negativity of the vector norms.
2. If A D 0, then kAxk D 0 for all vectors x, so kAk D 0. In the opposite direction:
if A ¤ 0, then A has at least one nonzero element, for example in the j th column.
If we multiply the matrix A by the j th unit vector x D ej , then Ax is the j th
column vector of A. The norm of this vector is positive, so the maximum is also
positive.
3. In the case of the matrix c � A for any real or complex number c, it follows
again from the property of the vector norm that kc � Axk D jcjkAxk. Since the
maximum does not concern c, there holds kc � Ak D jcjkAk.
4. Finally, if B is also a matrix on IRn�n, then k.A C B/xk D kAx C Bxk �
kAxk C kBxk. Hence, with the help of definition (2.2) we obtain
kA C Bk WD max
x¤0
k.A C B/xk
kxk � maxx¤0
kAxk C kBxk
kxk
� max
x¤0
kAxk
kxk C maxx¤0
kBxk
kxk D kAk C kBk;
where the second inequality is valid because on its right-hand side we take the
maxima of kAxkkxk and
kBxk
kxk separately.
Therefore, the triangle inequality is true, too.
20 2 Norms, Condition Numbers
Since in (2.2) the matrix norm is defined as the maximum over all nonzero vectors
x, for a fixed x ¤ 0 the quotient kAxk=kxk can be less than the norm, but it can
never be greater:
kAxk
kxk � kAk; that is kAxk � kAk kxk: (2.3)
Later this inequality will be used in several estimations. Conversely, if M is a (non-
negative) number such that for all vectors x
kAxk � M � kxk; (2.4)
then kAk � M and the norm of the matrix A is the smallest number M for
which (2.4) is always valid.
Using (2.3) we can also obtain an upper bound of the norm of a vector multiplied
by several matrices, e.g.
kABxk D kA.Bx/k � kAk kBxk � kAk kBk kxk:
By (2.2) this means that for all matrices A;B 2 IRn�n
kA � Bk � kAk � kBk (2.5)
holds.
2.2.2 Computation of the Induced Matrix Norm for the Vector
p-Norms, p D 1; 1
We compute the matrix norm (2.2) directly from the elements aij of the matrix A.
For this aim
1. we have to verify an estimation of the form (2.4),
2. we need to find a vector x ¤ 0 for which one has equality in (2.4).
Consider first the case p D 1:
First step:
The inequality
kAxk1 D
nX
iD1
j.Ax/i j D
nX
iD1
ˇˇ
ˇˇ
ˇˇ
nX
jD1
aij xj
ˇˇ
ˇˇ
ˇˇ
�
nX
iD1
nX
jD1
jaij jjxj j D
nX
jD1
jxj j
nX
iD1
jaij j
holds for all x.
2.2 The Induced Matrix Norm 21
In fact the last double sum is an expression of the form
Pn
jD1 jxj jcj , where
cj WD PniD1 jaij j. We take from these n non-negative numbers the largest one:
maxj
�Pn
iD1 jaij j
�
, and then
kAxk1 � max
j
 
nX
iD1
jaij j
!
�
nX
jD1
jxj j: (2.6)
Since
Pn
jD1 jxj j D kxk1, this means that kAk1 � max1�j�n
�Pn
iD1 jaij j
�
.
Second step:
If j0 is an index where the maximum occurs, i.e.
nX
iD1
jaij0 j D max
j
 
nX
iD1
jaij j
!
;
and x is the vector for which the j0th coordinate is equal to 1 and the other
coordinates are equal to 0 so that x D ej0 is the j0th unit vector, then one has
equality in (2.6). Hence,
kAk1 D max
j
 
nX
iD1
jaij j
!
:
According to this easily evaluated expression the matrix norm induced by the
Manhattan vector norm is also called the column sum norm.
Example
A D
0
@
3 �4 1
�2 0 �1
2 �5 3
1
A : (2.7)
Here
Pn
iD1 jaij j takes the values 7; 9; 5, if j D 1; 2; 3, respectively. Hence, kAk1 D 9. We
also find a vector for which the equality holds in (2.4): according to the previous reasoning the
corresponding vector is x D e2, and then
Ax D
0
@
�4
0
�5
1
A ; so kAxk1 D j � 4j C 0 C j � 5j D 4 C 5 D 9 D 9kxk1 :
Now, let us consider the case p D 1:
First step: The equality
kAxk1 D max
1�i�n j.Ax/i j D max1�i�n
ˇˇ
ˇˇ
ˇˇ
nX
jD1
aij xj
ˇˇ
ˇˇ
ˇˇ
22 2 Norms, Condition Numbers
holds for all vectors x. Moreover, for all fixed i
ˇˇ
ˇˇ
ˇˇ
nX
jD1
aij xj
ˇˇ
ˇˇ
ˇˇ �
nX
jD1
jaij jjxj j �
0
@
nX
jD1
jaij j
1
A max
1�j�n jxj j D
0
@
nX
jD1
jaij j
1
A kxk1
is valid. Summarizing both statements above we get
kAxk1 D max
1�i�n
ˇˇ
ˇˇ
ˇˇ
nX
jD1
aij xj
ˇˇ
ˇˇ
ˇˇ �
0
@max
1�i�n
nX
jD1
jaij j
1
A kxk1 : (2.8)
This upper bound means, that
kAk1 � max
i
0
@
nX
jD1
jaij j
1
A :
Second step: If i0 is an index where the maximum occurs, that is,
nX
jD1
jai0j j D max
i
0
@
nX
jD1
jaij j
1
A ;
and x is a vector with its coordinates xj defined as
xj WD
(
C1; if ai0j � 0;
�1; if ai0j < 0;
(2.9)
then kxk1 D 1 and .Ax/i0 D
Pn
jD1 jai0j j. In addition, for i ¤ i0
j.Ax/i j �
nX
jD1
jaij xj j D
nX
jD1
jaij j �
nX
jD1
jai0j j
holds, hence max1�i�n j.Ax/i j D PnjD1 jai0j j D
Pn
jD1 jai0j jkxk1 which means
kAk1 D max
i
0
@
nX
jD1
jaij j
1
A :
This is why the maximum norm is also called the row sum norm.
2.2 The Induced Matrix Norm 23
Example
Consider again the matrix A in (2.7). Then the sum
Pn
jD1 jaij j takes the values 8; 3; 10, if i D
1; 2; 3, respectively.Hence, kAk1 D 10.
Let us clarify which vector implies equality in (2.8): according to the arguments above
(see (2.9)) the corresponding vector is x D .1;�1; 1/T , and then
Ax D
0
@
8
�3
10
1
A ; so kAxk1 D max.8; j � 3j; 10/ D 10 D 10kxk1 :
If you are in doubt as to which p the column sum or the row sum induced
matrix norms correspond, then the following simple rule can help you: the 1
stands—as a column, the 1 lies—as a row.
2.2.3 Computation of the Induced Matrix Norm (p D 2)
Finally, consider the case p D 2. The derived matrix norm is useful for theoretical
considerations but—as it turns out—in practice it is difficult to compute.
The norm kxk2 is connected with the Euclidean inner product
.x; y/ WD
nX
iD1
xi yi ; x; y 2 ICn; (2.10)
(where yi is the conjugate complex value of yi ) in the following way:
kxk22 D .x; x/:
With the inner product (2.10) the connection
.Ax; y/ D .x; AT y/ (2.11)
is valid for all real matrices A. We get this identity if we change the order of
summations, that is,
.Ax; y/ D
nX
iD1
0
@
nX
jD1
aij xj
1
A � yi
D
nX
jD1
 
nX
iD1
aij yi
!
� xj
D
nX
jD1
xj �
 
nX
iD1
aij yi
!
D .x; AT y/;
24 2 Norms, Condition Numbers
where AT is the transpose of A. In the expression
Pn
iD1 aij yi the components of
the vector y are multiplied by the elements of the j th column of the matrix A—i.e.,
by the j th row of the transpose of A—but not by the elements of the j th row of A
since this would yield the product .Ay/j D PniD1 aj iyi .
Applying the relation (2.11) to y D Ax we get
kAxk22 D .x; AT Ax/: (2.12)
The matrix AT A is symmetric, that is, .AT A/T D AT A. Recall now an important
concept: a symmetric matrix B 2 IRn�n is positive semidefinite, if the inequality
.Bx; x/ � 0 (2.13)
holds for all x 2 ICn. Accordingly, equality (2.12) shows that the matrix AT A is
positive semidefinite, and from this it follows that all eigenvalues of the matrix AT A
are non-negative. Denote by �max the largest eigenvalue of the matrix AT A: �max D
�max.A
T A/. We below derive the inequality
kAxk22 D .x; AT Ax/ � �maxkxk22 (2.14)
for all x 2 ICn—which implies 0 � �max and kAk2 �
p
�max.AT A/ .
The inequality (2.14) follows from a fact of linear algebra: since AT A is a
symmetric matrix, it can be diagonalized with an orthogonal matrix. This means
that there exists a matrix Q 2 IRn�n satisfying QT Q D I D QQT , such that
QAT AQT D D D diag1�i�n.�i .AT A//
holds, where D is a diagonal matrix, and its main diagonal contains the eigenvalues
�i of AT A. In the case of an orthogonal matrix Q, the equation
kxk22 D .x; x/ D .QT Qx; x/ D .Qx;Qx/ D kQxk22 ; (2.15)
is valid, so the Euclidean length of a vector does not change when we multiply it by
Q. In our case
.x; AT Ax/ D .QT Qx;AT Ax/ D .Qx;QAT Ax/ D .y;QAT AQT y/
where y WD Qx, so that
.x; AT Ax/ D .y;Dy/ D
nX
iD1
di jyi j2 � dmax
nX
iD1
jyi j2
D �maxkyk22 D �maxkQxk22 D �maxkxk22
2.2 The Induced Matrix Norm 25
which gives (2.14). In the previous estimation, equality holds if x D vmax is
the eigenvector of AT A corresponding to the eigenvalue �max having the largest
absolute value. Hence, for the Euclidean norm of a real matrix A we have
kAk2 D
p
�max.AT A/: (2.16)
The spectral radius is connected to the Euclidean matrix norm. Let A 2 IRn�n
be a matrix, and denote by �i .A/ the eigenvalues of A; i D 1; : : : ; n. The absolute
value of the eigenvalue having the largest absolute value is called spectral radius,
�.A/ WD max
1�i�n j�i .A/j:
If the matrix A is symmetric, then its eigenvalues are real numbers, moreover, the
eigenvalues of the matrix AT A D A2 are easy to get from the eigenvalues � D �.A/
of the original matrix: AT Ax D A2x D A.Ax/ D A.�x/ D �Ax D �2x. Thus in
this case
kAk2 D
p
�max.AT A/ D
q
�2max.A/ D j�max.A/j D �.A/;
meaning that now the spectral radius is also a norm.
As an example consider the matrix
A D
�
3 4
0 0
�
:
Here you can immediately obtain that kAk1 D 4; kAk1 D 7. Moreover, because
of the special form of A (which is a so-called upper triangle matrix), it is known
that the eigenvalues are the elements of the main diagonal: one of the eigenvalues is
equal to 3, the other is 0, and therefore �.A/ D 3.
Calculating the matrix
AT A D
�
9 12
12 16
�
;
we can find its eigenvalues, which are 25 and 0 so that kAk2 D
p
25 D 5. The
eigenvector of AT A corresponding to the eigenvalue 25 is x D .3; 4/T , and indeed
Ax D
�
25
0
�
; kAxk2 D 25 D 5 � 5 D 5kxk2 D kAk2kxk2 :
As a second example take again the matrix A in (2.7). Then kAk1 D 9 and
kAk1 D 10.
26 2 Norms, Condition Numbers
In this case the determination of the eigenvalues of A requires a longer calcu-
lation. The eigenvalues are approximately 6.425, 1.608, �2:033, and so �.A/ �
6:425. Moreover,
AT A D
0
@
17 �22 11
�22 41 �19
11 �19 11
1
A ;
and the eigenvalues of this matrix are approximately 63:279; 3:962; 1:759, hence,
finally, kAk2 �
p
63:279 � 7:955.
This example shows that the maximum and Manhattan norms are often more
practical for direct calculations than the Euclidean norm.
Also, it can be seen in both examples that the spectral radius is less than the three
norms. As the reasoning below shows, this is not accidental. First, let k � k be an
arbitrary vector norm, moreover, let � D �.A/ be some eigenvalue of A, and v be
the corresponding eigenvector. Then
j�j � kvk D kAvk � kAk � kvk H) j�j � kAk; (2.17)
since kvk > 0. This is also true for the maximal (by absolute value) of the
eigenvalues so that we get
�.A/ � kAk;
and here the norm may be any induced norm of A.
2.3 Error Estimations
We continue our way to the numerical solution of a linear system Ax D b by
tackling first the question whether it makes sense to try it at all.
Having norms already at our disposition, consider the linear system (2.1) in the
case of a regular n � n matrix A, and of a vector b ¤ 0. Taking the norms of both
sides we get
0 < kbk D kAxk � kAk kxk: (2.18)
2.3.1 The Right-Hand Side of the Linear System is Perturbed
Let us estimate the error of the solution if instead of b a perturbed vector b C ıb is
given, ıb representing the error of the right-hand side. Let y D xC.y�x/ DW xCıx
be the solution of the equation. Hence, ıx D y � x and
A.x C ıx/ D b C ıb:
Then b C ıb D A.x C ıx/ D b C Aıx, and in this way ıb D Aıx. Because of
the regularity of A, that is, it has an inverse, we can also write (at least formally: the
2.3 Error Estimations 27
calculation of the inverse matrix is usually much more expensive than solving the
system) ıx D A�1ıb. Taking again the norm we obtain
kıxk D kA�1ıbk � kA�1k kıbk: (2.19)
This inequality shows that kA�1k kıbk is an upper bound for the absolute error of
the solution x.
To estimate the relative error (which usually is more interesting) we use first
inequality (2.19) and then (2.18) (where kAk > 0 since A ¤ 0 being regular):
kıxk
kxk �
kA�1k kıbk
kxk �
kA�1k kıbk
kbk=kAk D kAk � kA
�1k � kıbkkbk : (2.20)
On the right-hand side stands the relative error of the vector b (it can be, e.g., the
rounding error), while on the left-hand side the relative error of the solution appears.
The estimations (2.18), (2.19) are strict because there exist both a nonzero vector
x (in other words: a vector b) and a nonzero vector ıb such that
kAxk D kAk kxk; and kA�1ıbk D kA�1k kıbk
hold where the second relation means strictness of (2.19). Further, kıxkkxk D
kA�1kkıbk
kxk D kAk � kA�1k � kıbkkbk which shows that estimation (2.20) is also strict.
2.3.2 Condition Number
The expression kAk � kA�1k appearing in Eq. (2.20) is called the condition number
of the matrix A and denoted by
cond.A/ WD kAk � kA�1k:
According to the calculationin the preceding subsection, the condition number tells
us exactly how much larger the relative error of the solution of the linear system
can be than the relative error of the right-hand side vector. Hence, the knowledge
of the condition number will be critical from the point of view of the numerical
solvability of the system. However, it is hard to compute because first we would
have to compute the inverse matrix.
Properties:
1. The condition number depends on the matrix norm, and as the examples below
show, it also depends on the vector norm inducing it. In this way, if necessary,
we use a corresponding subscript: condp.A/ D kA�1k.p/kAk.p/, p D 1; 2;1.
2. The condition number cannot be less than 1. In the case of the induced norms
kIk D 1, because from (2.2) it follows that
kIk D max
x¤0
kxk
kxk D 1:
28 2 Norms, Condition Numbers
Hence, taking into account also (2.5), 1 D kIk D kAA�1k � kAk kA�1k D
cond.A/. In other words, we usually must expect that the relative error of the
solution increases compared to the relative error of the right-hand side.
3. In the special case if A D Q is an orthogonal matrix, cond2.Q/ D 1, see (2.15).
Then the relative error corresponding to the Euclidean norm cannot increase
during the solution of the linear system.
4. For a regular matrix A, 0 is not an eigenvalue. Then, for any eigenvector v,
Av D �v H) 1
�
v D A�1v; i.e. ��1.A/ D �.A�1/:
Now, denote by �min.A/ and �max.A/ the eigenvalues of A which have the
smallest and largest absolute values, respectively. We can apply the reasoning
of (2.17) for the inverse matrix too:
kAk � j�max.A/j; kA�1k � j�max.A�1/j D 1j�min.A/j :
From this it follows that
cond.A/ �
ˇˇ
ˇˇ�max.A/
�min.A/
ˇˇ
ˇˇ :
On the basis of the above inequality the condition number can be estimated
from below using the methods described in the chapter dealing with eigenvalue
problems.
5. We take the case that the relative error of the right-hand side vector be equal to
the rounding error: kıbk=kbk D "1 D a1�t , see (1.6) in Sect. 1.3, and
cond.A/ � 1
"1
:
Then (2.20) shows that the relative error of the solution can be unacceptably large:
since cond.A/ kıbkkbk � 1, it may occur that kıxk � kxk. In this case we call the linear
system (and the matrix) ill-conditioned.
To guarantee at least one correct digit in the solution, the inequality
cond.A/ � 1
a"1
has to be satisfied, because then from (2.20) and from kıbk=kbk � "1 it follows
kıxk
kxk �
1
a
:
2.3 Error Estimations 29
Examples
If the matrix is a regular 2 � 2 matrix, then the condition number can be calculated without any
problem. Let the determinant of the matrix
A D
�
a b
c d
�
be different from zero: det.A/ D ad � cb ¤ 0. Then the inverse of A is
A�1 D 1
ad � cb
�
d �b
�c a
�
;
and hence, for example,
cond1.A/ D max.jaj C jbj; jcj C jd j/max.jd j C jbj; jcj C jaj/jad � cbj D
kAk1kAk1
j det.A/j :
But for the example A D
�
3 4
0 0
�
, the condition number is not defined (independently of the vector
norm used) because of the singularity of the matrix.
If the matrix is
A D
�
3 4
�1 2
�
;
then kAk1 D 7; det.A/ D 10 and
A�1 D 1
10
�
2 �4
1 3
�
:
So kA�1k1 D 610 D 35 and cond1.A/ D 215 .
For which vector x does the equality kAxk1 D kAk1kxk1 hold? According to our result
in Sect. 2.2.2, the corresponding vector is x D .1; 1/T —and then b D Ax D .7; 1/T . Similarly,
kA�1ıbk1 D kA�1k1kıbk1 , if ıb D .1;�1/T —and then ıx D A�1ıb D 110
�
6
�2
� D 1
5
�
3
�1
�
.
If instead of ıb we have t � ıb and instead of ıx the vector t � ıx with some parameter t , the
equalities remain valid: A.t � ıx/ D t � ıb and kA�1t � ıbk1 D kA�1k1kt � ıbk1. Hence, if the
original and the perturbed systems are
Ax D
�
3 4
�1 2
�
x D
 
7
1
!
;
and
A.x C t � ıx/ D
�
3 4
�1 2
�
.x C t � ıx/ D
 
7 C t
1 � t
!
; (2.21)
respectively, we have the equality
kt � ıxk
kxk D kAk � kA
�1k � kt � ıbkkbk :
30 2 Norms, Condition Numbers
Substituting the calculated values into the equation we get
kt � ıxk
kxk D
jt j3=5
1
D jt j3
5
D 21
5
� jt j
7
D 21
5
� kt � ıbkkbk :
This means that—while t ¤ 0—during the solution of the system (2.21) the relative error of the
solution may increase by 21=5, independently of t , i.e. independently of having a small or large
relative error on the right-hand side.
In the case of the matrix (2.7) you can readily check that
A�1 D 1
21
0
@
5 �7 �4
�4 �7 �1
�10 �7 8
1
A ; so kA�1k1 D 1; kA�1k1 D 25
21
:
Hence, cond1.A/ D 9 � 1 D 9; cond1.A/ D 10 � 2521 D 25021 � 11:905, and after a
longer calculation we get that cond2.A/ � 7:955 	 0:754 � 5:998.
Even small matrices can have large condition numbers as the following example
shows. Let A.t/ be the matrix
A.t/ WD
�
1 t
t 1 C t2
�
; (2.22)
which depends on the parameter t . Then det.A/ D 1 holds, independently of t , and
A.t/�1 WD
�
1 C t2 �t
�t 1
�
:
Hence, kA.t/k1 D max.jt j C 1 C t2; 1 C jt j/ D 1 C jt j C t2 D kA.t/�1k1, so
cond1.A.t// D .1 C jt j C t2/2. For a proper t this can be very large, for example
if t D 100, then cond1.A.t// D 102;030;201 > 108.
A famous example of this phenomenon is the so-called Hilbert matrix
Hn D
�
1
i C j � 1
�n
i;jD1
:
As a specific case take n D 5, so
H5 D
0
BBBBB@
1 1=2 1=3 1=4 1=5
1=2 1=3 1=4 1=5 1=6
1=3 1=4 1=5 1=6 1=7
1=4 1=5 1=6 1=7 1=8
1=5 1=6 1=7 1=8 1=9
1
CCCCCA
:
2.3 Error Estimations 31
Then cond1.H5/ � 9:437 �105; cond2.H5/ � 4:766 �105 (here �min � 3:288 �10�6,
�max � 1:567) and cond1.H12/ � 3:798 � 1016:
In the case of the matrix H12 the condition number is so large that the rounding
error (regardless whether using single or double precision computation) of the right-
hand side vector itself can cause a huge error in the exact solution of the Eq. (2.1).
Obviously, in this case no solution algorithm can give an acceptable result; the
matrix is indeed ill-conditioned.
We have arrived at the most important observation of the present chapter:
for a given computer, linear system and error level, it is not the determinant
but the condition number that determines the numerical solvability of the
system.
Surely, if the determinant is equal to zero, the condition number is not defined.
On the other hand, if det.A/ differs from zero, then the magnitude of the condition
number of A does not depend on the value of the determinant: for example multiply-
ing the system (2.1) by the power 10k yields a 10kn times larger determinant, while
the condition number does not change. We can see the same in example (2.22): the
determinant is equal to 1, independently of t , while the condition number can take
arbitrarily large values.
The matrices arising in practice are typically ill-conditioned, while the matrices
constructed from random numbers are most often well-conditioned.
A good program package, like MATLAB, will issue a warning if encountering
a linear system with an ill-conditioned matrix.
2.3.3 The Matrix of the Linear System is Perturbed
Now we examine the case where the elements of the matrix are perturbed:
B WD A C ıA; By D b; y D x C ıx: (2.23)
Here, we need a condition for the matrix A which ensures that the perturbation ıA
is small enough such that both A and B are non-singular.
Consider first a simpler question: what kind of sufficient condition can be given
for the matrix R to ensure regularity of I C R ?
The answer to this question is the perturbation lemma: when S WD I C R and
kRk DW q < 1 in some induced norm then S is regular, and
kS�1k � 1=.1 � q/:
32 2 Norms, Condition Numbers
To understand that this is true, we proceed as follows:
(a) First, from the triangle inequality kx C yk � kxkC kyk you have
kxk � kx C yk � kyk:
Replacing x C y by z (that is x D z � y) you get the so-called inverse triangle
inequality
kz � yk � kzk � kyk
where vectors x; y, and hence vectors y and z, are arbitrary.
(b) In the inverse triangle inequality, take z WD x, y WD �Rx:
kSxk D kx C Rxk � kxk � kRxk � kxk � qkxk D .1 � q/kxk:
(Here the second inequality is a consequence of �kRxk � �kRkkxk.) Hence, there
does not exist a vector x ¤ 0 for which Sx D 0 holds (otherwise kSxk D 0 and
.1�q/kxk � 0 would follow). But this means that S is regular. Substitute x D S�1z
(that is Sx D z) into the inequality to get:
kzk � .1 � q/kS�1zk;
and you arrive at the estimation of kS�1k using (2.4).
When �1 is an eigenvalue of the matrix R and the corresponding eigenvector is
v ¤ 0, then
q D kRk � �.R/ D j�max.R/j � 1;
so the condition of the perturbation lemma is not fulfilled. And really, in this case 0
is an eigenvalue of I C R: .I C R/v D v C .�v/ D 0, and I C R is singular.
Next, with the help of the perturbation lemma we obtain a sufficient condition
which ensures that in case of regularity of the error-free matrix A the matrix B D
A C ıA of the perturbed system (2.23) is also regular.
Since B D A.I C A�1ıA/ holds and so
det.B/ D det.A/ det.I C A�1ıA/;
matrix B is regular if and only if so is matrix I C A�1ıA. Moreover, based on the
perturbation lemma, the condition
kıAk < 1=kA�1k
(holding for some induced matrix norm) is sufficient to ensure the regularity of I C
A�1ıA because in this case from (2.5) it follows that kA�1ıAk � kA�1k kıAk < 1.
2.4 Exercises 33
We can also estimate the norm of matrix B�1 starting from B�1 D .I C
A�1ıA/�1A�1. Then
kB�1k � k.I C A�1ıA/�1k kA�1k � kA
�1k
1 � kA�1k kıAk : (2.24)
After this preparation we can estimate the difference of the solutions of sys-
tems (2.1) and (2.23). Since
Ax D b D By D Bx C Bıx D b C ıAx C Bıx;
there holds ıx D �B�1ıAx, and according to (2.24)
kıxk � kB�1kkıAkkxk � kA
�1k kıAk
1 � kA�1k kıAkkxk;
that is
kıxk
kxk �
�
1 � � ; � WD cond.A/
kıAk
kAk : (2.25)
In this way the relative error of the solution can be expressed by the condition
number and by the relative error of the data (in our case the latter is kıAk=kAk).
Later we will see that the condition number plays an important role whenever
we work with matrices: for example in eigenvalue problems and in the solution of
systems of ordinary linear differential equations.
2.4 Exercises
1. Compute the maximum and Manhattan norm of the following matrix:
A D
0
@
1 2 3
4 �5 6
7 8 �9
1
A :
Find also the vectors x satisfying kAxkp D kAkpkxkp , p D 1;1.
2. Compute the maximum norm and the condition number of the matrix
A D
�
2 �7
6 13
�
:
For which vectors x; y do the equalities kAxk1 D kAk1kxk1 and
kA�1yk1 D kA�1k1kyk1 hold?
34 2 Norms, Condition Numbers
3. You know the condition for
A D
�
a b
c d
�
to be regular. Give an expression for cond1.A/.
4. Compute the condition numbers of the following two matrices. (The norm can
be chosen arbitrarily.)
A D
�
2 �5
0 0
�
; B D
�
" 0
0 "�1
�
; 0 < " < 1:
5. Let the matrix A be
A D
0
@
5 3 6
6 2 3
4 2 5
1
A : Then A�1 D 1
10
0
@
�4 3 3
18 �1 �21
�4 �2 8
1
A :
Compute the maximum and Manhattan norms of A, and the corresponding
condition numbers.
6. For a real n � n matrix A D .aij / 2 IRn�n, derive the following inequalities:
kAkp � n � max
i;j
jaij j; p D 1;1:
7. If A is the regular matrix given by (2.7) and ıA is a perturbation of it, then give
an upper bound for kıAk1 ensuring the regularity of A C ıA !
8. The following matrix depends on a real parameter t :
A D A.t/ WD I4 C t �
0
BB@
�3 2 �7 2
0 �5 3 1
4 0 �9 6
�2 8 �4 5
1
CCA ;
and I4 is the 4 � 4 unit matrix. For which values of t can you be sure of the
regularity of A.t/ ?
9. Show that if
0 < m � kAxkkxk � M; 8x ¤ 0 2 IR
n;
holds for some vector norm, then the condition number of A is well defined,
and cond.A/ � M=m.
10. Find out which norm corresponds to the MATLAB function norm when called
norm(x) for a vector x, and norm(A,inf) for a matrix A.
11. Compare the MATLAB values obtained from norm(A)*norm(x) and
norm(A*x) when x=[1 -1]’ and A=[1 2; 3 4].
2.4 Exercises 35
12. Take the 6 � 6 “built-in” Hilbert matrix as A and explain and compare the
MATLAB numbers cond(A,1),rcond(A),cond(A,1)*rcond(A).
13. Consider a 1000 � 1000 random matrix and using the commands tic and
toc examine how long it takes to compute its 2-, 1-, 1- and Frobenius-norm,
respectively.
14. Construct the following matrix A 2 IR100�100 and vector b 2 IR100, and using
the backslash operator in MATLAB solve the system Ax D b (i.e., apply the
command x D Anb). After that perturb the vector b, e.g. let b.100/ D 1:00001
instead of 1 and solve the system again. Compute the condition number of A
(also give an exact formula for cond1.A/).
aij D
8
ˆˆ<
ˆˆ:
1; if i D j;
�1; if i < j;
0; otherwise,
b D .�98;�97; : : : ; 0; 1/T :
3Solution of Systems of Linear Equations
You know from Algebra:
If A 2 IRn�n is a regular matrix, then the linear system Ax D b can be uniquely solved (e.g.,
with the help of Cramer’s rule).
The following conditions are equivalent to the non-regularity (that is, to the singularity) of A:
1. det.A/ D 0,
2. system Ax D b cannot be solved for all vectors b,
3. there exists a vector b for which the linear system has infinitely many solutions,
4. there is a vector x ¤ 0 such that Ax is the zero vector: Ax D 0,
5. 0 is an eigenvalue of the matrix A: Ax D 0 � x, where x ¤ 0.
3.1 Gaussian Elimination
In this chapter we deal with solving the linear system Ax D b. Assume that A 2
IRn�n is a given regular matrix, b 2 IRn is a given (column) vector, and x 2 IRn is
the unknown (column) vector to be found.
In practice Cramer’s rule is not used even in the case of n D 2, because it costs
more operations than Gaussian elimination which we discuss below. First, we write
the system Ax D b in detailed form:
a11x1 C a12x2 C � � � C a1nxn D b1
a21x1 C a22x2 C � � � C a2nxn D b2 (3.1)
:::
:::
:::
an1x1 C an2x2 C � � � C annxn D bn
To start, we aim to eliminate x1 and assume that a11 ¤ 0. Then, we subtract the
first equation multiplied by .ai1=a11/ from the i th equation, i > 1. As a result,
© Springer International Publishing AG 2016
G. Stoyan, A. Baran, Elementary Numerical Mathematics for Programmers and
Engineers, Compact Textbooks in Mathematics, DOI 10.1007/978-3-319-44660-8_3
37
38 3 Solution of Systems of Linear Equations
the coefficient of x1 in the i th row will be zero, that is, the unknown x1 is indeed
eliminated there, while the first row is suitable to solve later for x1 in terms of the
other unknowns.
Denote by aij DW a.1/ij the elements of A and by a.2/ij , i > 1, the new elements
obtained after the subtraction. Then the previous operations can be described using
the following expressions:
a
.2/
ij WD a.1/ij � `i1a.1/1j ; i; j D 2; : : : ; n; (3.2)
where `i1 WD a.1/i1 =a.1/11 ; i D 2; : : : ; n: (3.3)
Here, i D 1 is excluded because our purpose was just to obtain a.2/i1 D 0 for i > 1.
Further, it makes no sense to calculate these zeros in a loop so that j > 1.
Then, if
b
.1/
1 WD b1 and b.2/i WD b.1/i � `i1b.1/1 ; i D 2; : : : ; n; (3.4)
after the first step of this “Gaussian” elimination we have the following system:
a
.1/
11 x1 C a.1/12 x2 C � � � C a.1/1n xn D b.1/1
a
.2/
22 x2 C � � � C a.2/2n xn D b.2/2 (3.5)
:::
:::
:::
a
.2/
n2 x2 C � � � C a.2/nn xn D b.2/n
Now, if a.2/22 ¤ 0 holds, then you can perform similar steps as before for the
.n � 1/ � .n � 1/ system standing under the first row: with the help of the second
row, you can eliminate the unknown x2 from the