Baixe o app para aproveitar ainda mais
Prévia do material em texto
Compact Textbooks in Mathematics Gisbert Stoyan Agnes Baran Elementary Numerical Mathematics for Programmers and Engineers Compact Textbooks in Mathematics Compact Textbooks in Mathematics This textbook series presents concise introductions to current topics in math- ematics and mainly addresses advanced undergraduates and master students. The concept is to offer small books covering subject matter equivalent to 2- or 3-hour lectures or seminars which are also suitable for self-study. The books pro- vide students and teachers with new perspectives and novel approaches. They feature examples and exercises to illustrate key concepts and applications of the theoretical contents. The series also includes textbooks specifically speaking to the needs of students from other disciplines such as physics, computer science, engineering, life sciences, finance. • compact: small books presenting the relevant knowledge learning made easy: examples and exercises illustrate the application of the contents useful for lecturers: each title can serve as basis and guideline for a 2-3 hours course/lecture/seminar More information about this series at http://www.springer.com/series/11225 Gisbert Stoyan Agnes Baran Elementary Numerical Mathematics for Programmers and Engineers Gisbert Stoyan Faculty of Informatics ELTE University Budapest, Hungary Agnes Baran Faculty of Informatics University of Debrecen Debrecen, Hungary Revised translation from the Hungarian language edition: Numerikus matematika mérnökönek és programozóknak by Gisbert Stoyan, © Typotex kft, 2007 All Rights Reserved. MATLABr is a registered trademark of The MathWorks, Inc. ISSN 2296-4568 ISSN 2296-455X (electronic) Compact Textbooks in Mathematics ISBN 978-3-319-44659-2 ISBN 978-3-319-44660-8 (eBook) DOI 10.1007/978-3-319-44660-8 Library of Congress Control Number: 2016958653 Mathematics Subject Classification (2010): 65D30, 65F15, 65F05 © Springer International Publishing AG 2007, 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper This book is published under the trade name Birkhäuser, www.birkhauser-science.com The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Foreword to the English Edition This elementary introduction to numerical mathematics arose from lectures and exercises which were held by the two authors jointly at Debrecen University for students of informatics. It contains material for one semester and was first issued quite successfully in Hungary. The English translation is the joint work of the two authors as are the MATLAB exercises and Chap. 10 which have been added for this edition in English. The first author has taught similar topics to students of mathematics, informatics, physics and meteorology at ELTE University; also he has written (in Hungarian) a three-volume textbook on numerical methods, embracing more material, from rounding errors to methods for partial differential equations; the last two volumes are now mostly used in PhD studies. The present text in its Hungarian form was much more successful; due to its shortness and lower requirements on mathematical knowledge, more than 900 copies were sold in few years. We thank the two unknown referees for their helpful remarks. Budapest, Hungary Gisbert Stoyan Debrecen, Hungary Agnes Baran February 2015 v Contents 1 Floating Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Integers .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Floating Point Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Floating Point Arithmetic, Rounding .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Accumulation of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Norms, Condition Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Norms and Their Elementary Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 The Induced Matrix Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Definition and Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Computation of the Induced Matrix Norm for the Vector p-Norms, p D 1;1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.3 Computation of the Induced Matrix Norm (p D 2) . . . . . . . . . 23 2.3 Error Estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 The Right-Hand Side of the Linear System is Perturbed .. . . 26 2.3.2 Condition Number.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.3 The Matrix of the Linear System is Perturbed . . . . . . . . . . . . . . . 31 2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 Solution of Systems of Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Gaussian Elimination .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 When Can Gaussian Elimination be Performed? . . . . . . . . . . . . . . . . . . . . 40 3.3 The LU Factorization .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 Algorithms, Cost of Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Influence of Rounding Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 LU Factorization for General Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.1 Algorithm of the LDU Factorization, Test Examples. . . . . . . . 54 3.7 Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.7.1 Algorithm of the LDLT Factorization, Test Examples. . . . . . . 60 3.8 Band Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.8.1 TridiagonalSystems of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.8.2 The Tridiagonal Algorithm, Test Examples . . . . . . . . . . . . . . . . . . 64 3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 vii viii Contents 4 The Least Squares Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.1 Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.1.1 Algebraic Description .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.1.2 The Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 Normal Equations.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3 Solution Algorithm, Test Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5 Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1 Fundamental Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1.1 Normal Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.1.2 The Characteristic Polynomial. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.1.3 Localization of the Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Power Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2.1 Conditions of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2.2 The Rayleigh Quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.3 Algorithm of the Power Iteration, Test Examples . . . . . . . . . . . 98 5.2.4 The Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3 The Inverse Iteration .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.3.1 Conditions of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3.2 Algorithm of the Inverse Iteration, Test Examples . . . . . . . . . . 103 5.4 The QR Algorithm.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.1 Interpolation Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2 Lagrangian Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2.1 Lagrange Interpolation Problem .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2.2 Newton’s Divided Differences Method . . . . . . . . . . . . . . . . . . . . . . 115 6.2.3 The Difference Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.2.4 Algorithm of Lagrangian Interpolation, Test Examples . . . . . 122 6.2.5 Error Estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.3 Hermite Interpolation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.4 Piecewise Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7 Nonlinear Equations and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.1 Bisection Method, Fixed Point Iterations .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.2 Newton’s Method .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.2.1 Damped Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.2.2 Secant Method .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.3 Solution of Systems of Equations .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.3.1 Newton’s Method.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.3.2 Algorithm of Damped Newton Method, Test Examples. . . . . 150 7.3.3 Approximation of the Jacobian Matrix . . . . . . . . . . . . . . . . . . . . . . . 152 7.3.4 Broyden’s Method .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Contents ix 7.4 Gauss–Newton Method.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.4.1 Description.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.4.2 Algorithm of the Gauss–Newton Method, Test Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.1 Elementary Quadrature Formulae.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 8.2 Interpolational Quadrature Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 8.3 Composite Quadrature Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.3.1 Construction of Composite Formulae . . . . . . . . . . . . . . . . . . . . . . . . 168 8.3.2 Convergence of Composite Formulae . . . . . . . . . . . . . . . . . . . . . . . . 171 8.4 Practical Points of View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 8.5 Calculation of Multiple Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 8.5.1 Reduction to Integration of Functions of One Variable. . . . . . 177 8.5.2 Approximation of the Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.5.3 Algorithm of the Two-Dimensional Simpson Rule, Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 8.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 9 Numerical Solution of Ordinary Differential Equations . . . . . . . . . . . . . . . 185 9.1 Motivation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 9.2 Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 9.3 Euler’s Method.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 9.3.1 Algorithm of Euler’s Method, Test Examples . . . . . . . . . . . . . . . 192 9.4 Error Analysis of Euler’s Method .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 193 9.5 The Modified Euler Method, Runge–Kutta Methods . . . . . . . . . . . . . . . . 196 9.6 The Implicit Euler Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 9.6.1 The Implicit Euler Method for Linear Systems . . . . . . . . . . . . . . 201 9.6.2 Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 9.6.3 Algorithm of the Implicit Euler Method, Test Examples . . . . 206 9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 10 Practical Error Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 1Floating Point Arithmetic You know from secondary school: If the base of a power is a > 0, then the following relations are valid: abac D abCc ; a b ac D ab�c : If q ¤ 1 1 C q C q2 C � � � C qn D nX iD0 qi D 1 � q nC1 1 � q holds, moreover, if q D 1, then PniD0 qi D n C 1: 1.1 Integers Integers are represented on a computer in the form of signed binary numbers. Often 2-, 4- and 8-byte integers are available where a byte possesses eight binary digits. In many computers 4 bytes are the smallest available—addressable—unit of the memory. It may turn out that we can work with one- and 16-byte integers, too. The arithmetical operations performed with integers may be and often are faster than the operations performed with floating point numbers (which will be discussed in the following section), but this depends on their word length, the processor and the translator. Such operations can be considered as error-free, hence, using integers a given algorithm can be faster on the computer. However, all steps of a calculation with integers have to be carefully planned, because actually in this case we work in “residue classes”. As an example consider the addition of 1-byte integers: 11010101 + 10101010 =1 01111111 213 + 170 = 383 � 127 mod(256) © Springer International Publishing AG 2016 G. Stoyan, A. Baran, Elementary Numerical Mathematics for Programmers and Engineers, Compact Textbooks in Mathematics, DOI 10.1007/978-3-319-44660-8_1 1 2 1 Floating Point Arithmetic The processing unit cuts off—without any warning!—the 1 which cannot be placed in the frame (it would produce an overflow), and in this way the sum is replaced by its residual with respect to 28. Thus, we move to the residue class of 28 and, e.g., in the case of two-byte numbers to the residue class of 216, etc. 1.2 Floating Point Numbers This type of numbers is fundamental in the use of computers for numerical computations. The form of nonzero floating point numbers is ˙ ak �m1 a C m2 a2 C � � � C mt at � ; (1.1) where a > 1 is the base of the representation of numbers, “+” or “�” is the sign, t > 1 is the number of digits, k is the exponent. The digit m1 is normalized, that is, it satisfies the inequality 1 � m1 � a � 1: This ensures the uniqueness of the representation and the full exploitation of the available digits. For the remaining digits 0 � mi � a � 1; 2 � i � t holds. The number zero is not normalized: in this case k D 0; m1 D m2 D � � � D mt D 0, and its sign is usually “+”. It often happens that the base a of the representation of floating point numbers is not equal to 2. Instead of 2 it can be, e.g., 10 or 16. Some computers have a decimal CPU. However, usually the corresponding arithmetic is implemented by the programming language, and when, e.g., the main purpose is data processing, it can be more economical to use the decimal number system. We may view the computations as being done in the decimal number system and can count on about t D 8 using single precision and about t D 16 using double precision. We can imagine the storage of floating point numbers in the following form (reality may be different from this, but here we do not describe these details of the technical implementation which do not affect our numerical calculations): Œ˙; k;m1; : : : ; mt �; where the vector .m1; : : : ; mt/ DW m is the mantissa (or significand). The exponent k is also called the characteristic of the number. Depending on the machine and on the precision (single, double, quadruple precision) four, eight or 16 bytes are available to store m. Simultaneously, the range of k increases but, depending on the 1.2 Floating Point Numbers 3 given precision, k.�/ � k � k.C/ ; where k.�/ < 0; k.C/ > 0 and jk.�/j � jk.C/j (see Exercises 1 and 2). Then the largest representable number is M1 WD ak.C/ tX iD1 a � 1 ai D ak.C/ � a � 1 a C a � 1 a2 C � � � C a � 1 at � D ak.C/ .1 � a�t /; while the smallest number is �M1. The floating point numbers form a discrete subset of the rational numbers from Œ�M1;M1�, and this subset is symmetric about zero. We denote the positive floating point number nearest to zero by "0, "0 WD ak.�/ � 1 a C 0 C � � � C 0 � D ak.�/�1: Thus, besides zero there is no other (normalized) floating point number in the interval .�"0; "0/. (Here we disregard the fact that on a computer denormalized nonzero numbers can also be used, but to ensure the uniqueness of the representation only in the case k D k.�/. Practically, this means that then the formula for "0 changes to ak.�/�t .) The interval .�"0; "0/ appears as a huge gap if we represent the floating point numbers on the real axis (see Fig. 1.1 and Exercise 3.) since the positive floating point number nearest to "0 is ak.�/ � 1 a C 0 C � � � C 0 C 1 at � D "0 C ak.�/�t D "0.1 C a1�t /; and "0 � a1�t � "0 : The number 1 always belongs to the floating point numbers: 1 D ŒC; 1; 1; 0; : : : ; 0�: 0 ε0 1 20.5 Fig. 1.1 Floating point numbers on the real axis 4 1 Floating Point Arithmetic After the number 1 the next floating point number ŒC; 1; 1; 0; : : : ; 0; 1� D 1 C "1; follows, where "1 WD a1�t : This number is often called the relative precision of the machine, or the machine epsilon (see Exercise 4). The importance of the numbers "0 and "1 is that in the case of the input and of the four basic operations they give an absolute and a relative error bound, respectively (see the next section). If we take any floating point number satisfying 0 < x D ŒC; k;m� D ak �m1 a C m2 a2 C � � � C mt at � < M1 ; then the nearest floating point number to x larger than x is x WD x C ak�t : x D x C ak � 0 C 0 C � � � C 0 C 1 at � D x C ak�t : Thus, the spacing between the two numbers is ıx WD x � x D ak�t . The smallest possible value x for the present characteristic k is equal to ak 1 a D ak�1, hence, x � x D ıx D ak�t D ak�1C1�t D ak�1a1�t D ak�1"1 � x"1: (1.2) If x < 0 and x is the left-hand side floating point neighbour of x then, similarly, jx � xj � jxj"1 : (1.3) 1.3 Floating Point Arithmetic, Rounding First consider the input. Imagine a real number x to be read in, jxj � M1. Then, most often, the floating point number Qx assigned to x by the machine is Qx WD ( 0; if jxj < "0 ; the floating point number nearest to x if "0 � jxj � M1 : Thus, in the first case the number is rounded towards zero (the gap around 0 appears as a “black hole”) while in the second case the number is rounded properly. Then 1.3 Floating Point Arithmetic, Rounding 5 (see relations (1.2) and (1.3)) j Qx � xj � ( "0; if jxj < "0 ; 1 2 "1jxj; if jxj � "0: (1.4) If the machine always rounds the numbers towards zero (since truncation of the non–significant digits is simpler than proper rounding), then in (1.4) "1 replaces 12"1 and so j Qx � xj < "1jxj is true always. Disregarding the gap around zero we obtain a simplified version of (1.4), i.e. j Qx � xj � 1 2 "1jxj (1.5) in the case of rounding, or j Qx � xj � "1jxj (1.6) in the case of truncation. The numbers 1 2 "1 or "1 arising here can be described as follows: they are the smallest positive floating point numbers " for which the result of the addition 1 C " is greater than 1—after rounding and truncation, respectively (see Exercise 4). Next, consider the floating point operations. As an example, first examine the addition. Take, e.g., a D 10; t D 3; k.�/ D �3; k.C/ D 3 (then "1 D 10�2) and x D 3:45; y D 0:567 as the two numbers to be added. As the first step of the addition the characteristics of the two numbers (k1 and k2) have to be made equal, and the mantissa of the smallest number has to be shifted right by k1 � k2 digits: + 1 345 + + 0 567 = + 1 345 + + 1 0567 10*0.345 ( D x) + 1*0.567 ( D y) = 10*0.345 + 10*0.0567=4.017=x C y DW z Here immediately the question arises whether the CPU uses t digits to place the mantissa. In the case of addition at least one more digit (a so-called guard digit) is necessary to have an acceptable error. (In the case of multiplication it is best if there are 2t digits available to place the product of the mantissas because then the product can be placed without loss.) If we have this guard digit, then the addition from the above example can be continued in the following way: + 1 345 + + 0 567 = + 1 345 + + 1 0567 = + 1 4017 6 1 Floating Point Arithmetic Up to this point the result is error-free but usually one has to put it back into the storage where in our example there are only three digits to place the mantissa. Then the last digit has to be truncated or rounded. If the latter was chosen when the CPU (or the programming language which implements the arithmetic) was planned, then the result of the computation, i.e. Qz D Ax C y is + 1 402 Hence, the result is loaded by error, and for the error the following is valid—in accordance with (1.6), writing z; Qz instead of x; Qx: jz � Qzj D 0:003 D "1 0:3 D 1 2 "1 0:6 < 1 2 "1 jzj .D 0:020085/: (1.7) Let us examine one more example of addition (we take a D 10; t D 3; k.�/ D �3; k.C/ D 3 as before and further assume that in the CPU one guard digit is available): x D 0:987 and y D 0:567. Now, there is no need to shift the numbers before the addition: + 0 987 + + 0 567 = + 0 1554 Š + 1 155 0.987 ( D x) + 0.567 ( D y) = 1.554 ( D z) Š 10*0.155=:Qz Here, in the last step the result had to be normalized and rounded, and after these steps the result (once more in accordance with (1.6)) can be put into the storage. Now, examine the rounding errors occurring during the basic operations, in general. According to the example above, we assume that first the machine performs exactly any of the four basic operations .C;�; ; =, denoting them regardlessly by ˘), and after that it assigns to the result a floating point number using truncation or rounding. Then in the case of rounding according to (1.4) jAx ˘ y � x ˘ yj � ( "0 ; if jx ˘ yj < "0 1 2 "1jx ˘ yj; if jx ˘ yj � "0 ; (1.8) or, in a simplified version, omitting the case corresponding to "0, jAx ˘ y � x ˘ yj � "1jx ˘ yj ( 1; in the case of truncation 1 2 ; in the case of rounding. (1.9) This inequality can be described as Ax ˘ y D .x ˘ y/ � .1 C "˘/; j"˘j � "1 ( 1; truncation 1 2 ; rounding. (1.10) 1.3 Floating Point Arithmetic, Rounding 7 Indeed, relation (1.10) can be obtained from (1.9) in the following way (considering the case of truncation): inequality (1.9) means that �"1jx ˘ yj � Ax ˘ y � x ˘ y � "1jx ˘ yj: Therefore, the expression Ax ˘ y � x ˘ y can take any value between �"1jx ˘ yj and "1jx ˘ yj. This can be written as Ax ˘ y � x ˘ y DW "1jx ˘ yjs; where � 1 � s � 1: Thus, Ax ˘ y D x ˘ y C "1jx ˘ yjs D .x ˘ y/.1 C sign.x ˘ y/"1s/; and here j sign.x ˘ y/"1sj � "1. Now, we have only to substitute "˘ for sign.x ˘ y/"1s. Our example concerning the addition illustrates the estimation (1.10) for ˘ D C: "˘ D "C D Ax C y � .x C y/ x C y D 0:003 4:017 � 0:000747 < 1 2 "1 D 0:005: The estimation (1.10) is not valid in the gap around zero, moreover, it is also not valid in the case when the result of the operation is in absolute value larger than M1. This would mean overflow which is a serious error for floating point calculation, and—depending on the programming language and on the operating system—after one or more errors of this kind the machine terminates its work (more precisely, it works with the next program). If the result of an operation is not equal to zero, but it lies in the interval .�"0; "0/, then according to the above reasoning underflow occurs; the machine considers the result to be zero, usually without an error message. In this case important information may be annihilated, the last sign of life of some phenomenon. Using floating point subtraction, valuable digits often disappear. This means the following: as earlier consider the case a D 10; t D 3 and, e.g., x D 1:234, y D 1:233. Then there is no need to shift the digits, and in the second step of the subtraction we get x � y D 0:001, without error. After this the number has to be normalized; the result is 10�2. 1 10 C 0 C � � � C 0/ without error. However, there is a problem: the numbers x; y had four digits, which resulted, e.g., from an earlier calculation, and after the subtraction only one digit remains. From the point of view of the floating point calculation, we cannot take any responsibility for the accuracy of the zeros after the leading digit 1. The other dangerous phenomenon of the floating point arithmetic is the possible cancellation by subtraction. This means the following. The machine does not correspond to the simple model above when in the CPU the exact result is first produced. In this case there is no 8 1 Floating Point Arithmetic guard digit, and before the subtraction of two floating point numbers x; y with different exponents .k1 ¤ k2/ the digits are not only shifted by k1 � k2 digits (compare with the above example for addition), but the digits moving on the right-hand side out of the frame are immediately truncated. For example, x D ŒC; k; 1; 0; : : : ; 0� D ak�1; y D ŒC; k � 1; a � 1; : : : ; a � 1� D ak � 1 a � 1 atC1 � : After shifting and truncating the last digit which has the value ak a�1 atC1 , the machine continues working with the number Qy D ak � 1 a � 1 at � instead of y, and x � Qy D ak�t DW Qz appears as the result of the machine subtraction. However, the exact result is z WD ak�t�1. (See Exercise 6.) This example calls for the distinction of two errors. If Qz is the result of a floating point calculation and z the exact result (in our example z D ak�t�1 and Qz D ak�t ), then jQz � zj is called the absolute error of the calculation. If z ¤ 0, then the value j.Qz�z/=zj is the relative error. Applying this to our example, the absolute error of the subtraction performed under the above circumstances is jQz � zj D ak�t � ak�t�1 D ak�t�1.a � 1/, while the relative error is ak�t�1.a � 1/=ak�t�1 D a � 1 D a � 1 a1�t "1 � at"1 "1; which is considerably larger than "1 D a1�t , nearly at times larger. This relative error is smallest when a D 2, but it is even then not acceptable, since if the relative error can be equal to 1, then the result can be confused with its error. This is the possible fatal loss of significant digits on a (badly planned) computer. To avoid this phenomenon of machine subtraction we need at leastone guard digit in the CPU, on the .t C 1/th place which, however, is missing in several machines. By subtraction, significant digits can also disappear when a guard digit is available, and the exponents of the two numbers are equal, in this case not in a fatal way, but according to (1.8). In practice, in connection with the evaluation of the accuracy of computational results, one works not only with absolute and relative errors. It is often practical to apply a (“mixed” absolute-relative) test like jz � Qzj � " � 1 C 1 2 .jzj C jQzj/ (1.11) where " is the given accuracy. If the inequality is not true, the result Qz is rejected as a too inaccurate result; if it is true, then we accept the result. Using the above test, if jzj C jQzj is small, absolute accuracy is measured by ", while if jzj C jQzj is large, relative accuracy is measured. The analogy with (1.4) and with (1.8) is striking. We advise you to apply such a test during programming. 1.4 Accumulation of Errors 9 The innocent-looking operation of rounding can lead to large errors even when a guard digit is present as the examples in the following section show. However, rounding is unavoidable after almost all floating point arithmetical operations (but see Exercise 6), hence a possible solution is to replace the simple rounding described by (1.4) and (1.10) by one-sided roundings. After each arithmetical operation, when the result is not a floating point number, one produces two results by rounding up and rounding down. As a consequence the complete calculation changes: instead of numbers one always has to work with intervals. Hence, either the machines have to be reconstructed to perform the roundings always with the actual arithmetical operation (i.e. instead of each basic operation one has two operations) or one has to build the interval arithmetic into the programming language. The latter was done, e.g., in the case of Pascal-SC. 1.4 Accumulation of Errors How do errors accumulate when one performs many operations? Suppose we have to add n C 1 floating point numbers, Sn WD x0 C x1 C � � � C xn D nX iD0 xi ; that is, we have to perform not one, but n additions. For each addition, the CPU first adds to the previous partial sum Sk�1 the following number xk without errors: Sk WD Sk�1 C xk; k D 1; 2; : : :, then it applies, e.g., truncation. Here we disregard the fact that some partial or the final result can be in the interval .�"0; "0/, or it can overflow. Then based on (1.10) the first partial sum obtained on the computer is eS1 WD Bx0 C x1 D .x0 C x1/.1 C "0;1/ D S1 C "0;1S1 where j"0;1j � "1 : Similarly: eS2 WD BeS1 C x2 D .eS1 C x2/.1 C "1;2/ D .S1 C "01S1 C x2/.1 C "1;2/ D .S1 C x2 C "0;1S1/.1 C "1;2/ D .S2 C "0;1S1/.1 C "1;2/ D S2 C "0;1S1 C "1;2S2 C "1;2"0;1S1 10 1 Floating Point Arithmetic where "1;2 is the corresponding value of the rounding error, j"1;2j � "1. In order to have a better view of the essence of the problem, and using that "1;2"0;1S1 is of second order in "1: j"1;2"0;1S1j � "21jS1j; we neglect this term (and use the notation Š for equalities up to second order terms): S2 Š S2 C "0;1S1 C "1;2S2 : Continuing the above reasoning you see that eSn Š Sn C nX kD1 "k�1;kSk : (1.12) Let us express the partial sums Sk with the help of the xi : nX kD1 "k�1;kSk D x0 nX kD1 "k�1;k C nX iD1 xi nX kDi "k�1;k : Observe that the estimation of the sum multiplying x0 and x1 is ˇˇ ˇˇ ˇ nX kD1 "k�1;k ˇˇ ˇˇ ˇ � n"1 ; since j"k�1;kj � "1 holds for all k and we have n terms. The estimation of the sum multiplying x2 is ˇˇ ˇˇ ˇ nX kD2 "k�1;k ˇˇ ˇˇ ˇ � .n � 1/"1 ; and, generally, the estimation of the sum multiplying xi (i � 1) is ˇˇ ˇˇ ˇ nX kDi "k�1;k ˇˇ ˇˇ ˇ � .n � i C 1/"1 : Returning to (1.12) we can write that (neglecting all second order terms) jeSn � Snj � "1 njx0j C nX iD1 .n � i C 1/jxi j ! 1.4 Accumulation of Errors 11 is valid. From this you see that it is worth beginning the summation with the numbers xi having the smallest absolute value: in the error term the multiplier of x0 and x1 is n"1, but xn has only an "1 multiplier. The following estimation is a little bit rougher but it shows better that the error increases linearly as the number of operations n increases: jeSn � Snj � n"1 nX iD0 jxi j: (1.13) In the special case when all xi are positive (and neglecting the second order terms) we obtain an estimation of the relative error: ˇˇ ˇˇ ˇ QSn � Sn Sn ˇˇ ˇˇ ˇ � n"1 : This relation shows that when n"1 � 1 an accurate result cannot be expected. We come to a similar conclusion when we multiply nC 1 floating point numbers which are not only positive but lie around 1, i.e. the n multiplications do not involve the danger of overflow and underflow. If P1 WD x0 x1; Pn D Pn�1 xn; n D 2; 3; : : : ; that is Pn D nY kD0 xn ; then on the computer (with corresponding but possibly different from (1.12) error quantities "k�1;k) QP1 D P1 .1 C "0;1/; QP2 D BQP1 x2 D .P1 .1 C "0;1/ x2/ .1 C "1;2/ D P2 .1 C "0;1/ .1 C "1;2/ D P2 .1 C "0;1 C "1;2 C "0;1"1;2/ Š P2 .1 C "0;1 C "1;2/; and generally QPn Š Pn.1 C nX kD1 "k�1;k/: 12 1 Floating Point Arithmetic This means that ˇˇ ˇˇ ˇ QPn � Pn Pn ˇˇ ˇˇ ˇ Š ˇˇ ˇˇ ˇ nX kD1 "k�1;k ˇˇ ˇˇ ˇ � n"1; (1.14) since an upper bound of the sum is n"1. Thus, when the latter quantity does not exceed 1=a then we can expect at least one accurate digit. To what extent can this reasoning be accepted as we have neglected second order terms? The worst case is when the error is maximal in each multiplication: "1. Then Pn � QPn � Pn.1 C "1/n and ˇˇ ˇˇ ˇ QPn � Pn Pn ˇˇ ˇˇ ˇ � .1 C "1/ n � 1: Recall from mathematical analysis that 1 C x � ex (for all x), and for this reason on the one hand .1 C "1/n � 1 � en"1 � 1, on the other hand one can verify that ex � 1 � x 1� x2 if 0 � x < 2. Thus, in the case of x D n"1 < 2: ˇˇ ˇˇ ˇ QPn � Pn Pn ˇˇ ˇˇ ˇ � n"1 1 � n"1 2 : If we would like the relative error of QPn to be less than 1a (i.e. at least one accurate digit in QPn), then solving the relation x1� x2 � 1 a for x we obtain that ax � 1 � x 2 H) .a C 1 2 /x � 1 H) x � 1 a C 1 2 : This means the following: from relation (1.14) obtained by neglecting the second order quantities you can see that the relative error of QPn is less than 1a if n"1 � 1a . Without neglecting the second order terms this is not valid—but it is sufficient to require n"1 � 1aC 12 . 1.5 Conclusions On computers we can work with two fundamental types of numbers: integers and floating point numbers. While the operations with integers are error-free (in fact here we work in residue classes), for floating point numbers rounding is unavoidable. As a consequence of this (also from other sources like measuring) errors will occur, and accumulation of errors can strongly bias the final result. For this reason it is important to estimate the final errors. While after one operation the error will be of the order of "1, the relative machine accuracy, after n operations it will be of the order of n"1. 1.6 Exercises 13 Most probably, the speed of computers will increase in the future. (This will happen not by increasing the “clockrate” but rather by increasing the number of cores available and newer technical developments.) If parallel with this the length of the mantissa of floating point numbers used in programs and the domain Œk.�/; k.C/� of the characteristic do not increase, then the danger of unreliable arithmetic results also increases: if the relative error obtained after n operations is not less than one, then the result is unacceptable. 1.6 Exercises 1. Assuming that a D 10,determine the values of t , k.C/ and k.�/ on your computer (laptop, mobile etc.). How does the machine react to overflow and to underflow? Does a guard digit exist on your computer? 2. Why is it often true that jk.C/j ¤ jk.�/j, for normalized numbers? Start from the question how many characteristics can be represented with n binary digits in general, how many when the first digit is used for the sign, and how many when the exponent k is stored in form k C 2n�1, without sign. 3. Represent on the real axis the non-negative floating point numbers correspond- ing to a D 2, t D 4, k.�/ D �3, k.C/ D 2. (In the following exercises we always assume that the values a; t; k.C/, k.�/ characterizing the representation of floating point numbers, are given, moreover, k.�/ � k.C/ > t � 2 is valid.) 4. Determine in an experimental way the numbers "0 and "1. 5. Which is the smallest natural number not contained among the floating point numbers? 6. Which arithmetical operations do you know which can be performed error-free for floating point numbers? 7. Name two floating point numbers x; y, different from zero and one, such that x C y or x y are (a) floating point or (b) not floating point. 8. We have seen that the right-hand side floating point number adjacent to 1 is 1 C "1. However, what is the left-hand side floating point neighbour of 1? 9. How many positive floating point numbers exist for a fixed characteristic k? 10. Give a reason for the fact that in the case of floating point calculation the sequence fsng1nD1, s1 D 1, sn D sn�1 C 1=n, n D 2; 3; : : :, known from mathematical analysis, becomes a constant sequence when n is large enough— instead of diverging. Estimate approximately how large Qsn will be, when n is large enough. For this, use that for the exact values of sn log.n C 1/ < sn < 1 C log n holds. Check your result with the help of your computer (using the shortest “real” type). 14 1 Floating Point Arithmetic 11. Imagine you have a program to calculate the determinant of a second order matrix based on the well-known mathematical formula. In the CPU there is a guard digit, after the arithmetical operations the number with t C 1 digits will be normalized and rounded, if necessary. If the matrix is A D A."/ D � 1 1 � " 1 1 � ; " > 0; then starting from " D 1 and decreasing the value of ", what is the smallest positive " for which the determinant is positive? 12. Examine the value of the logical expression (1/45)*15==1/3 in MATLAB. Try to explain the phenomenon. 13. Choosing x D 1=3 as the initial value, the expression 4x�1 should return with 1=3. Run the command x D 4 x � 1 several (e.g., 27, 30, 40) times in a loop and check the final value of x. 14. Consider the MATLAB commands 4-5+1 0.4-0.5+0.1 and give a comment on the outcomes. 15. The equation � 1 5x C 1 � x � x D 0:2 is true for all x ¤ 0. How many times does MATLAB find the above equation valid, if x D 1; : : : ; 100? 16. Read the text on “Representation of Floating-Point Numbers” in the help-part of MATLAB and look on the internet for “IEEE standard 754”. 17. Find a simple algebraic relation between the following MATLAB numbers: realmin; eps.0/; eps by reading the explanations in the MATLAB help— or by direct experimentation. 2Norms, Condition Numbers You remember from Algebra: If A 2 IRn�n is a matrix with (real) elements aij , and x D .x1; x2; : : : ; xn/T is a column vector, then Ax is a vector where the i th component is .Ax/i D nX jD1 aij xj ; i D 1; : : : ; n: Moreover, if for some vector v and number � Av D �v; v ¤ 0; holds, then � is an eigenvalue of the matrix A, and v is the corresponding eigenvector. Both � and v may turn out to be complex. Let us show that as generalization of jx C yj � jxj C jyj there holds ˇˇ ˇˇ ˇ nX iD1 xi ˇˇ ˇˇ ˇ � nX iD1 jxi j: This follows from ˇˇ ˇˇ ˇ nX iD1 xi ˇˇ ˇˇ ˇ � jx1j C ˇˇ ˇˇ ˇ nX iD2 xi ˇˇ ˇˇ ˇ � jx1j C jx2j C ˇˇ ˇˇ ˇ nX iD3 xi ˇˇ ˇˇ ˇ etc. © Springer International Publishing AG 2016 G. Stoyan, A. Baran, Elementary Numerical Mathematics for Programmers and Engineers, Compact Textbooks in Mathematics, DOI 10.1007/978-3-319-44660-8_2 15 16 2 Norms, Condition Numbers You surely know and understand the following operations with sums (here m; n � 1 are integers): nX iD1 mX jD1 aibj D nX iD1 ai mX jD1 bj D mX jD1 nX iD1 aibj D mX jD1 bj nX iD1 ai : These equalities hold because in each case all possible combinations of ai and bj are multiplied and summed: nX iD1 mX jD1 ai bj D a1b1 C a1b2 C � � � C a1bm C a2b1 C a2b2 C � � � C a2bm C � � � C anb1 C � � � C anbm ; nX iD1 ai mX jD1 bj D a1.b1 C b2 C � � � C bm/ C a2.b1 C b2 C � � � C bm/ C � � � C an.b1 C b2 C � � � C bm/; mX jD1 nX iD1 ai bj D a1b1 C a2b1 C � � � C anb1 C a1b2 C a2b2 C � � � C anb2 C � � � C a1bm C � � � C anbm ; mX jD1 bj nX iD1 ai D b1.a1 C a2 C � � � C an/ C b2.a1 C a2 C � � � C an/ C � � � C bm.a1 C a2 C � � � C an/: 2.1 Norms and Their Elementary Properties In the following chapter we are going to discuss the solution of a system of linear equations Ax D b (2.1) where A is an n � n real matrix: A 2 IRn�n; b 2 IRn is a given vector and x 2 IRn is the unknown vector. In practice both vector b and matrix A are often given with uncertainties: they are perturbed by errors, e.g. by rounding errors. Hence, in this chapter we study the magnitude of the error in the vector x caused by errors in A and b. As a result of this examination we will understand the following: on what does it depend that on a given computer a linear system with a given error can be solved with an acceptable error, or not? To answer this question we shall introduce norms and condition numbers. If, for example, bCıb is given instead of b, where ıb is the vector of errors in b, then we would like to know, how large is the error of the solution y of the “perturbed” system Ay D b C ıb, i.e. how large is y � x where x is the unknown error-free 2.1 Norms and Their Elementary Properties 17 solution? To answer this we need to review (or introduce) some definitions, as we have to measure the distance of two vectors and two matrices. In what follows, even if we write complex vectors: x 2 ICn, it is usually possible to think only of real ones. Complex vectors occur as a solution of the eigenvalue problem since a real matrix can also have complex eigenvectors and -values. A function d W ICn ! R is called a norm if the following four conditions hold: 1. d.x/ � 0 for all x 2 ICn; 2. d.x/ D 0 , x D 0, so the nullvector can uniquely be recognized; 3. d.a � x/ D jajd.x/ where a is an arbitrary number which can also be complex (i.e., if a D b C ci; i D p�1, then jaj D pb2 C c2); 4. d.x C y/ � d.x/ C d.y/ (this is the so-called triangle inequality). The value d.x/ is the norm of the vector x and since it is a generalization of the absolute value, instead of d.x/ the usual notation is kxk. Notice that in the second property 0 has two different meanings: as a number and in the case of x D 0 as a vector. In what follows we always write 0, both in the case of numbers and vectors, and even for matrices and functions. In the space ICn the following p-norms are useful for us: d.x/ D kxkp WD nX iD1 jxi jp !1=p ; p D 1 (Manhattan norm), p D 2 (Euclidean norm), p D 1 (maximum norm). The maximum norm could be interpreted as the limit for p ! 1 but it can also be defined in the following way: kxk1 WD max 1�i�n jxi j: This follows from the inequality max 1�i�n jxi j � nX iD1 jxi jp !1=p � .n/1=p max 1�i�n jxi j which is always true because max1�i�n jxi j is present among the valuesjxi j (this gives the lower estimation on the left-hand side). Moreover, jxi j can take the maximal value at most n times which explains the upper estimation on the right- hand side. Further, recall from mathematical analysis that n1=n tends to 1 as n ! 1. So much the more n1=p ! 1, if n is fixed and p ! 1, as here. For example, if n D 256 and p D 2 then n1=p D 16, if p D 4 then n1=p D 4, but in the case of p D 16 we have n1=p D p2 � 1:414. We often omit the subscript p when any of the above-mentioned three norms can be used, but in general, these norms behave rather differently (Fig. 2.1). 18 2 Norms, Condition Numbers −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 1 1 −1 −1 p=∞ p=2 p=1 Fig. 2.1 The unit ball in three p-norms: the points x satisfying jjxjjp D 1 The Euclidean norm corresponding to p D 2 is probably known from your previous studies. If the vector x has real components, then we can simply write kxk2 D q x21 C x22 C � � � C x2n D nX iD1 x2i !1=2 : Consider an example: if x is the row vector .1;�2;�3; 4/, then kxk1 D 1 C j � 2j C j � 3j C 4 D 10; kxk2 D p 1 C 4 C 9 C 16 D p30 � 5:477; kxk1 D max.1; j � 2j; j � 3j; 4/ D 4: In the case of the maximum and the Manhattan norm, the triangle inequality follows from the fact that for the absolute value the inequality ja C bj � jaj C jbj holds. In the case of the Euclidean norm it is a consequence of the Cauchy-inequality (see the beginning of Chap. 4). We remark that the parameter p of the p-norms can be any number from the interval Œ1;1�, but 0 < p < 1 is not possible because in this case the fourth property does not hold, that is, kxkp does not define a norm. As a counterexample consider n � 2, moreover, let x D .1; 0 : : : ; 0/T and y D .0; 1; 0 : : : ; 0/T . Thus, according to the definition, kxkp D 1; kykp D 1, but from 0 < p < 1 we have kx C ykp D .1 C 1/1=p D 21=p > 2 D kxkp C kykp : 2.2 The Induced Matrix Norm 19 2.2 The InducedMatrix Norm 2.2.1 Definition and Properties In what follows we shall denote column vectors by x; y. As an example, x D 0 BBB@ 1 0 ::: 0 1 CCCA 2 IR n is the first column unit vector, denoted also as e1 and often written as a transposed row vector .1; 0; : : : ; 0/T . Now, if A 2 IRn�n is a real matrix, with the help of a vector norm we can define a matrix norm: kAk WD max x¤0 kAxk kxk : (2.2) It is the matrix norm induced by the given vector norm (or subordinate to the given vector norm). We need to check that (2.2) defines indeed a norm on the linear space of the matrices: 1. The non-negativity of the expression on the right-hand side follows from the non-negativity of the vector norms. 2. If A D 0, then kAxk D 0 for all vectors x, so kAk D 0. In the opposite direction: if A ¤ 0, then A has at least one nonzero element, for example in the j th column. If we multiply the matrix A by the j th unit vector x D ej , then Ax is the j th column vector of A. The norm of this vector is positive, so the maximum is also positive. 3. In the case of the matrix c � A for any real or complex number c, it follows again from the property of the vector norm that kc � Axk D jcjkAxk. Since the maximum does not concern c, there holds kc � Ak D jcjkAk. 4. Finally, if B is also a matrix on IRn�n, then k.A C B/xk D kAx C Bxk � kAxk C kBxk. Hence, with the help of definition (2.2) we obtain kA C Bk WD max x¤0 k.A C B/xk kxk � maxx¤0 kAxk C kBxk kxk � max x¤0 kAxk kxk C maxx¤0 kBxk kxk D kAk C kBk; where the second inequality is valid because on its right-hand side we take the maxima of kAxkkxk and kBxk kxk separately. Therefore, the triangle inequality is true, too. 20 2 Norms, Condition Numbers Since in (2.2) the matrix norm is defined as the maximum over all nonzero vectors x, for a fixed x ¤ 0 the quotient kAxk=kxk can be less than the norm, but it can never be greater: kAxk kxk � kAk; that is kAxk � kAk kxk: (2.3) Later this inequality will be used in several estimations. Conversely, if M is a (non- negative) number such that for all vectors x kAxk � M � kxk; (2.4) then kAk � M and the norm of the matrix A is the smallest number M for which (2.4) is always valid. Using (2.3) we can also obtain an upper bound of the norm of a vector multiplied by several matrices, e.g. kABxk D kA.Bx/k � kAk kBxk � kAk kBk kxk: By (2.2) this means that for all matrices A;B 2 IRn�n kA � Bk � kAk � kBk (2.5) holds. 2.2.2 Computation of the Induced Matrix Norm for the Vector p-Norms, p D 1; 1 We compute the matrix norm (2.2) directly from the elements aij of the matrix A. For this aim 1. we have to verify an estimation of the form (2.4), 2. we need to find a vector x ¤ 0 for which one has equality in (2.4). Consider first the case p D 1: First step: The inequality kAxk1 D nX iD1 j.Ax/i j D nX iD1 ˇˇ ˇˇ ˇˇ nX jD1 aij xj ˇˇ ˇˇ ˇˇ � nX iD1 nX jD1 jaij jjxj j D nX jD1 jxj j nX iD1 jaij j holds for all x. 2.2 The Induced Matrix Norm 21 In fact the last double sum is an expression of the form Pn jD1 jxj jcj , where cj WD PniD1 jaij j. We take from these n non-negative numbers the largest one: maxj �Pn iD1 jaij j � , and then kAxk1 � max j nX iD1 jaij j ! � nX jD1 jxj j: (2.6) Since Pn jD1 jxj j D kxk1, this means that kAk1 � max1�j�n �Pn iD1 jaij j � . Second step: If j0 is an index where the maximum occurs, i.e. nX iD1 jaij0 j D max j nX iD1 jaij j ! ; and x is the vector for which the j0th coordinate is equal to 1 and the other coordinates are equal to 0 so that x D ej0 is the j0th unit vector, then one has equality in (2.6). Hence, kAk1 D max j nX iD1 jaij j ! : According to this easily evaluated expression the matrix norm induced by the Manhattan vector norm is also called the column sum norm. Example A D 0 @ 3 �4 1 �2 0 �1 2 �5 3 1 A : (2.7) Here Pn iD1 jaij j takes the values 7; 9; 5, if j D 1; 2; 3, respectively. Hence, kAk1 D 9. We also find a vector for which the equality holds in (2.4): according to the previous reasoning the corresponding vector is x D e2, and then Ax D 0 @ �4 0 �5 1 A ; so kAxk1 D j � 4j C 0 C j � 5j D 4 C 5 D 9 D 9kxk1 : Now, let us consider the case p D 1: First step: The equality kAxk1 D max 1�i�n j.Ax/i j D max1�i�n ˇˇ ˇˇ ˇˇ nX jD1 aij xj ˇˇ ˇˇ ˇˇ 22 2 Norms, Condition Numbers holds for all vectors x. Moreover, for all fixed i ˇˇ ˇˇ ˇˇ nX jD1 aij xj ˇˇ ˇˇ ˇˇ � nX jD1 jaij jjxj j � 0 @ nX jD1 jaij j 1 A max 1�j�n jxj j D 0 @ nX jD1 jaij j 1 A kxk1 is valid. Summarizing both statements above we get kAxk1 D max 1�i�n ˇˇ ˇˇ ˇˇ nX jD1 aij xj ˇˇ ˇˇ ˇˇ � 0 @max 1�i�n nX jD1 jaij j 1 A kxk1 : (2.8) This upper bound means, that kAk1 � max i 0 @ nX jD1 jaij j 1 A : Second step: If i0 is an index where the maximum occurs, that is, nX jD1 jai0j j D max i 0 @ nX jD1 jaij j 1 A ; and x is a vector with its coordinates xj defined as xj WD ( C1; if ai0j � 0; �1; if ai0j < 0; (2.9) then kxk1 D 1 and .Ax/i0 D Pn jD1 jai0j j. In addition, for i ¤ i0 j.Ax/i j � nX jD1 jaij xj j D nX jD1 jaij j � nX jD1 jai0j j holds, hence max1�i�n j.Ax/i j D PnjD1 jai0j j D Pn jD1 jai0j jkxk1 which means kAk1 D max i 0 @ nX jD1 jaij j 1 A : This is why the maximum norm is also called the row sum norm. 2.2 The Induced Matrix Norm 23 Example Consider again the matrix A in (2.7). Then the sum Pn jD1 jaij j takes the values 8; 3; 10, if i D 1; 2; 3, respectively.Hence, kAk1 D 10. Let us clarify which vector implies equality in (2.8): according to the arguments above (see (2.9)) the corresponding vector is x D .1;�1; 1/T , and then Ax D 0 @ 8 �3 10 1 A ; so kAxk1 D max.8; j � 3j; 10/ D 10 D 10kxk1 : If you are in doubt as to which p the column sum or the row sum induced matrix norms correspond, then the following simple rule can help you: the 1 stands—as a column, the 1 lies—as a row. 2.2.3 Computation of the Induced Matrix Norm (p D 2) Finally, consider the case p D 2. The derived matrix norm is useful for theoretical considerations but—as it turns out—in practice it is difficult to compute. The norm kxk2 is connected with the Euclidean inner product .x; y/ WD nX iD1 xi yi ; x; y 2 ICn; (2.10) (where yi is the conjugate complex value of yi ) in the following way: kxk22 D .x; x/: With the inner product (2.10) the connection .Ax; y/ D .x; AT y/ (2.11) is valid for all real matrices A. We get this identity if we change the order of summations, that is, .Ax; y/ D nX iD1 0 @ nX jD1 aij xj 1 A � yi D nX jD1 nX iD1 aij yi ! � xj D nX jD1 xj � nX iD1 aij yi ! D .x; AT y/; 24 2 Norms, Condition Numbers where AT is the transpose of A. In the expression Pn iD1 aij yi the components of the vector y are multiplied by the elements of the j th column of the matrix A—i.e., by the j th row of the transpose of A—but not by the elements of the j th row of A since this would yield the product .Ay/j D PniD1 aj iyi . Applying the relation (2.11) to y D Ax we get kAxk22 D .x; AT Ax/: (2.12) The matrix AT A is symmetric, that is, .AT A/T D AT A. Recall now an important concept: a symmetric matrix B 2 IRn�n is positive semidefinite, if the inequality .Bx; x/ � 0 (2.13) holds for all x 2 ICn. Accordingly, equality (2.12) shows that the matrix AT A is positive semidefinite, and from this it follows that all eigenvalues of the matrix AT A are non-negative. Denote by �max the largest eigenvalue of the matrix AT A: �max D �max.A T A/. We below derive the inequality kAxk22 D .x; AT Ax/ � �maxkxk22 (2.14) for all x 2 ICn—which implies 0 � �max and kAk2 � p �max.AT A/ . The inequality (2.14) follows from a fact of linear algebra: since AT A is a symmetric matrix, it can be diagonalized with an orthogonal matrix. This means that there exists a matrix Q 2 IRn�n satisfying QT Q D I D QQT , such that QAT AQT D D D diag1�i�n.�i .AT A// holds, where D is a diagonal matrix, and its main diagonal contains the eigenvalues �i of AT A. In the case of an orthogonal matrix Q, the equation kxk22 D .x; x/ D .QT Qx; x/ D .Qx;Qx/ D kQxk22 ; (2.15) is valid, so the Euclidean length of a vector does not change when we multiply it by Q. In our case .x; AT Ax/ D .QT Qx;AT Ax/ D .Qx;QAT Ax/ D .y;QAT AQT y/ where y WD Qx, so that .x; AT Ax/ D .y;Dy/ D nX iD1 di jyi j2 � dmax nX iD1 jyi j2 D �maxkyk22 D �maxkQxk22 D �maxkxk22 2.2 The Induced Matrix Norm 25 which gives (2.14). In the previous estimation, equality holds if x D vmax is the eigenvector of AT A corresponding to the eigenvalue �max having the largest absolute value. Hence, for the Euclidean norm of a real matrix A we have kAk2 D p �max.AT A/: (2.16) The spectral radius is connected to the Euclidean matrix norm. Let A 2 IRn�n be a matrix, and denote by �i .A/ the eigenvalues of A; i D 1; : : : ; n. The absolute value of the eigenvalue having the largest absolute value is called spectral radius, �.A/ WD max 1�i�n j�i .A/j: If the matrix A is symmetric, then its eigenvalues are real numbers, moreover, the eigenvalues of the matrix AT A D A2 are easy to get from the eigenvalues � D �.A/ of the original matrix: AT Ax D A2x D A.Ax/ D A.�x/ D �Ax D �2x. Thus in this case kAk2 D p �max.AT A/ D q �2max.A/ D j�max.A/j D �.A/; meaning that now the spectral radius is also a norm. As an example consider the matrix A D � 3 4 0 0 � : Here you can immediately obtain that kAk1 D 4; kAk1 D 7. Moreover, because of the special form of A (which is a so-called upper triangle matrix), it is known that the eigenvalues are the elements of the main diagonal: one of the eigenvalues is equal to 3, the other is 0, and therefore �.A/ D 3. Calculating the matrix AT A D � 9 12 12 16 � ; we can find its eigenvalues, which are 25 and 0 so that kAk2 D p 25 D 5. The eigenvector of AT A corresponding to the eigenvalue 25 is x D .3; 4/T , and indeed Ax D � 25 0 � ; kAxk2 D 25 D 5 � 5 D 5kxk2 D kAk2kxk2 : As a second example take again the matrix A in (2.7). Then kAk1 D 9 and kAk1 D 10. 26 2 Norms, Condition Numbers In this case the determination of the eigenvalues of A requires a longer calcu- lation. The eigenvalues are approximately 6.425, 1.608, �2:033, and so �.A/ � 6:425. Moreover, AT A D 0 @ 17 �22 11 �22 41 �19 11 �19 11 1 A ; and the eigenvalues of this matrix are approximately 63:279; 3:962; 1:759, hence, finally, kAk2 � p 63:279 � 7:955. This example shows that the maximum and Manhattan norms are often more practical for direct calculations than the Euclidean norm. Also, it can be seen in both examples that the spectral radius is less than the three norms. As the reasoning below shows, this is not accidental. First, let k � k be an arbitrary vector norm, moreover, let � D �.A/ be some eigenvalue of A, and v be the corresponding eigenvector. Then j�j � kvk D kAvk � kAk � kvk H) j�j � kAk; (2.17) since kvk > 0. This is also true for the maximal (by absolute value) of the eigenvalues so that we get �.A/ � kAk; and here the norm may be any induced norm of A. 2.3 Error Estimations We continue our way to the numerical solution of a linear system Ax D b by tackling first the question whether it makes sense to try it at all. Having norms already at our disposition, consider the linear system (2.1) in the case of a regular n � n matrix A, and of a vector b ¤ 0. Taking the norms of both sides we get 0 < kbk D kAxk � kAk kxk: (2.18) 2.3.1 The Right-Hand Side of the Linear System is Perturbed Let us estimate the error of the solution if instead of b a perturbed vector b C ıb is given, ıb representing the error of the right-hand side. Let y D xC.y�x/ DW xCıx be the solution of the equation. Hence, ıx D y � x and A.x C ıx/ D b C ıb: Then b C ıb D A.x C ıx/ D b C Aıx, and in this way ıb D Aıx. Because of the regularity of A, that is, it has an inverse, we can also write (at least formally: the 2.3 Error Estimations 27 calculation of the inverse matrix is usually much more expensive than solving the system) ıx D A�1ıb. Taking again the norm we obtain kıxk D kA�1ıbk � kA�1k kıbk: (2.19) This inequality shows that kA�1k kıbk is an upper bound for the absolute error of the solution x. To estimate the relative error (which usually is more interesting) we use first inequality (2.19) and then (2.18) (where kAk > 0 since A ¤ 0 being regular): kıxk kxk � kA�1k kıbk kxk � kA�1k kıbk kbk=kAk D kAk � kA �1k � kıbkkbk : (2.20) On the right-hand side stands the relative error of the vector b (it can be, e.g., the rounding error), while on the left-hand side the relative error of the solution appears. The estimations (2.18), (2.19) are strict because there exist both a nonzero vector x (in other words: a vector b) and a nonzero vector ıb such that kAxk D kAk kxk; and kA�1ıbk D kA�1k kıbk hold where the second relation means strictness of (2.19). Further, kıxkkxk D kA�1kkıbk kxk D kAk � kA�1k � kıbkkbk which shows that estimation (2.20) is also strict. 2.3.2 Condition Number The expression kAk � kA�1k appearing in Eq. (2.20) is called the condition number of the matrix A and denoted by cond.A/ WD kAk � kA�1k: According to the calculationin the preceding subsection, the condition number tells us exactly how much larger the relative error of the solution of the linear system can be than the relative error of the right-hand side vector. Hence, the knowledge of the condition number will be critical from the point of view of the numerical solvability of the system. However, it is hard to compute because first we would have to compute the inverse matrix. Properties: 1. The condition number depends on the matrix norm, and as the examples below show, it also depends on the vector norm inducing it. In this way, if necessary, we use a corresponding subscript: condp.A/ D kA�1k.p/kAk.p/, p D 1; 2;1. 2. The condition number cannot be less than 1. In the case of the induced norms kIk D 1, because from (2.2) it follows that kIk D max x¤0 kxk kxk D 1: 28 2 Norms, Condition Numbers Hence, taking into account also (2.5), 1 D kIk D kAA�1k � kAk kA�1k D cond.A/. In other words, we usually must expect that the relative error of the solution increases compared to the relative error of the right-hand side. 3. In the special case if A D Q is an orthogonal matrix, cond2.Q/ D 1, see (2.15). Then the relative error corresponding to the Euclidean norm cannot increase during the solution of the linear system. 4. For a regular matrix A, 0 is not an eigenvalue. Then, for any eigenvector v, Av D �v H) 1 � v D A�1v; i.e. ��1.A/ D �.A�1/: Now, denote by �min.A/ and �max.A/ the eigenvalues of A which have the smallest and largest absolute values, respectively. We can apply the reasoning of (2.17) for the inverse matrix too: kAk � j�max.A/j; kA�1k � j�max.A�1/j D 1j�min.A/j : From this it follows that cond.A/ � ˇˇ ˇˇ�max.A/ �min.A/ ˇˇ ˇˇ : On the basis of the above inequality the condition number can be estimated from below using the methods described in the chapter dealing with eigenvalue problems. 5. We take the case that the relative error of the right-hand side vector be equal to the rounding error: kıbk=kbk D "1 D a1�t , see (1.6) in Sect. 1.3, and cond.A/ � 1 "1 : Then (2.20) shows that the relative error of the solution can be unacceptably large: since cond.A/ kıbkkbk � 1, it may occur that kıxk � kxk. In this case we call the linear system (and the matrix) ill-conditioned. To guarantee at least one correct digit in the solution, the inequality cond.A/ � 1 a"1 has to be satisfied, because then from (2.20) and from kıbk=kbk � "1 it follows kıxk kxk � 1 a : 2.3 Error Estimations 29 Examples If the matrix is a regular 2 � 2 matrix, then the condition number can be calculated without any problem. Let the determinant of the matrix A D � a b c d � be different from zero: det.A/ D ad � cb ¤ 0. Then the inverse of A is A�1 D 1 ad � cb � d �b �c a � ; and hence, for example, cond1.A/ D max.jaj C jbj; jcj C jd j/max.jd j C jbj; jcj C jaj/jad � cbj D kAk1kAk1 j det.A/j : But for the example A D � 3 4 0 0 � , the condition number is not defined (independently of the vector norm used) because of the singularity of the matrix. If the matrix is A D � 3 4 �1 2 � ; then kAk1 D 7; det.A/ D 10 and A�1 D 1 10 � 2 �4 1 3 � : So kA�1k1 D 610 D 35 and cond1.A/ D 215 . For which vector x does the equality kAxk1 D kAk1kxk1 hold? According to our result in Sect. 2.2.2, the corresponding vector is x D .1; 1/T —and then b D Ax D .7; 1/T . Similarly, kA�1ıbk1 D kA�1k1kıbk1 , if ıb D .1;�1/T —and then ıx D A�1ıb D 110 � 6 �2 � D 1 5 � 3 �1 � . If instead of ıb we have t � ıb and instead of ıx the vector t � ıx with some parameter t , the equalities remain valid: A.t � ıx/ D t � ıb and kA�1t � ıbk1 D kA�1k1kt � ıbk1. Hence, if the original and the perturbed systems are Ax D � 3 4 �1 2 � x D 7 1 ! ; and A.x C t � ıx/ D � 3 4 �1 2 � .x C t � ıx/ D 7 C t 1 � t ! ; (2.21) respectively, we have the equality kt � ıxk kxk D kAk � kA �1k � kt � ıbkkbk : 30 2 Norms, Condition Numbers Substituting the calculated values into the equation we get kt � ıxk kxk D jt j3=5 1 D jt j3 5 D 21 5 � jt j 7 D 21 5 � kt � ıbkkbk : This means that—while t ¤ 0—during the solution of the system (2.21) the relative error of the solution may increase by 21=5, independently of t , i.e. independently of having a small or large relative error on the right-hand side. In the case of the matrix (2.7) you can readily check that A�1 D 1 21 0 @ 5 �7 �4 �4 �7 �1 �10 �7 8 1 A ; so kA�1k1 D 1; kA�1k1 D 25 21 : Hence, cond1.A/ D 9 � 1 D 9; cond1.A/ D 10 � 2521 D 25021 � 11:905, and after a longer calculation we get that cond2.A/ � 7:955 0:754 � 5:998. Even small matrices can have large condition numbers as the following example shows. Let A.t/ be the matrix A.t/ WD � 1 t t 1 C t2 � ; (2.22) which depends on the parameter t . Then det.A/ D 1 holds, independently of t , and A.t/�1 WD � 1 C t2 �t �t 1 � : Hence, kA.t/k1 D max.jt j C 1 C t2; 1 C jt j/ D 1 C jt j C t2 D kA.t/�1k1, so cond1.A.t// D .1 C jt j C t2/2. For a proper t this can be very large, for example if t D 100, then cond1.A.t// D 102;030;201 > 108. A famous example of this phenomenon is the so-called Hilbert matrix Hn D � 1 i C j � 1 �n i;jD1 : As a specific case take n D 5, so H5 D 0 BBBBB@ 1 1=2 1=3 1=4 1=5 1=2 1=3 1=4 1=5 1=6 1=3 1=4 1=5 1=6 1=7 1=4 1=5 1=6 1=7 1=8 1=5 1=6 1=7 1=8 1=9 1 CCCCCA : 2.3 Error Estimations 31 Then cond1.H5/ � 9:437 �105; cond2.H5/ � 4:766 �105 (here �min � 3:288 �10�6, �max � 1:567) and cond1.H12/ � 3:798 � 1016: In the case of the matrix H12 the condition number is so large that the rounding error (regardless whether using single or double precision computation) of the right- hand side vector itself can cause a huge error in the exact solution of the Eq. (2.1). Obviously, in this case no solution algorithm can give an acceptable result; the matrix is indeed ill-conditioned. We have arrived at the most important observation of the present chapter: for a given computer, linear system and error level, it is not the determinant but the condition number that determines the numerical solvability of the system. Surely, if the determinant is equal to zero, the condition number is not defined. On the other hand, if det.A/ differs from zero, then the magnitude of the condition number of A does not depend on the value of the determinant: for example multiply- ing the system (2.1) by the power 10k yields a 10kn times larger determinant, while the condition number does not change. We can see the same in example (2.22): the determinant is equal to 1, independently of t , while the condition number can take arbitrarily large values. The matrices arising in practice are typically ill-conditioned, while the matrices constructed from random numbers are most often well-conditioned. A good program package, like MATLAB, will issue a warning if encountering a linear system with an ill-conditioned matrix. 2.3.3 The Matrix of the Linear System is Perturbed Now we examine the case where the elements of the matrix are perturbed: B WD A C ıA; By D b; y D x C ıx: (2.23) Here, we need a condition for the matrix A which ensures that the perturbation ıA is small enough such that both A and B are non-singular. Consider first a simpler question: what kind of sufficient condition can be given for the matrix R to ensure regularity of I C R ? The answer to this question is the perturbation lemma: when S WD I C R and kRk DW q < 1 in some induced norm then S is regular, and kS�1k � 1=.1 � q/: 32 2 Norms, Condition Numbers To understand that this is true, we proceed as follows: (a) First, from the triangle inequality kx C yk � kxkC kyk you have kxk � kx C yk � kyk: Replacing x C y by z (that is x D z � y) you get the so-called inverse triangle inequality kz � yk � kzk � kyk where vectors x; y, and hence vectors y and z, are arbitrary. (b) In the inverse triangle inequality, take z WD x, y WD �Rx: kSxk D kx C Rxk � kxk � kRxk � kxk � qkxk D .1 � q/kxk: (Here the second inequality is a consequence of �kRxk � �kRkkxk.) Hence, there does not exist a vector x ¤ 0 for which Sx D 0 holds (otherwise kSxk D 0 and .1�q/kxk � 0 would follow). But this means that S is regular. Substitute x D S�1z (that is Sx D z) into the inequality to get: kzk � .1 � q/kS�1zk; and you arrive at the estimation of kS�1k using (2.4). When �1 is an eigenvalue of the matrix R and the corresponding eigenvector is v ¤ 0, then q D kRk � �.R/ D j�max.R/j � 1; so the condition of the perturbation lemma is not fulfilled. And really, in this case 0 is an eigenvalue of I C R: .I C R/v D v C .�v/ D 0, and I C R is singular. Next, with the help of the perturbation lemma we obtain a sufficient condition which ensures that in case of regularity of the error-free matrix A the matrix B D A C ıA of the perturbed system (2.23) is also regular. Since B D A.I C A�1ıA/ holds and so det.B/ D det.A/ det.I C A�1ıA/; matrix B is regular if and only if so is matrix I C A�1ıA. Moreover, based on the perturbation lemma, the condition kıAk < 1=kA�1k (holding for some induced matrix norm) is sufficient to ensure the regularity of I C A�1ıA because in this case from (2.5) it follows that kA�1ıAk � kA�1k kıAk < 1. 2.4 Exercises 33 We can also estimate the norm of matrix B�1 starting from B�1 D .I C A�1ıA/�1A�1. Then kB�1k � k.I C A�1ıA/�1k kA�1k � kA �1k 1 � kA�1k kıAk : (2.24) After this preparation we can estimate the difference of the solutions of sys- tems (2.1) and (2.23). Since Ax D b D By D Bx C Bıx D b C ıAx C Bıx; there holds ıx D �B�1ıAx, and according to (2.24) kıxk � kB�1kkıAkkxk � kA �1k kıAk 1 � kA�1k kıAkkxk; that is kıxk kxk � � 1 � � ; � WD cond.A/ kıAk kAk : (2.25) In this way the relative error of the solution can be expressed by the condition number and by the relative error of the data (in our case the latter is kıAk=kAk). Later we will see that the condition number plays an important role whenever we work with matrices: for example in eigenvalue problems and in the solution of systems of ordinary linear differential equations. 2.4 Exercises 1. Compute the maximum and Manhattan norm of the following matrix: A D 0 @ 1 2 3 4 �5 6 7 8 �9 1 A : Find also the vectors x satisfying kAxkp D kAkpkxkp , p D 1;1. 2. Compute the maximum norm and the condition number of the matrix A D � 2 �7 6 13 � : For which vectors x; y do the equalities kAxk1 D kAk1kxk1 and kA�1yk1 D kA�1k1kyk1 hold? 34 2 Norms, Condition Numbers 3. You know the condition for A D � a b c d � to be regular. Give an expression for cond1.A/. 4. Compute the condition numbers of the following two matrices. (The norm can be chosen arbitrarily.) A D � 2 �5 0 0 � ; B D � " 0 0 "�1 � ; 0 < " < 1: 5. Let the matrix A be A D 0 @ 5 3 6 6 2 3 4 2 5 1 A : Then A�1 D 1 10 0 @ �4 3 3 18 �1 �21 �4 �2 8 1 A : Compute the maximum and Manhattan norms of A, and the corresponding condition numbers. 6. For a real n � n matrix A D .aij / 2 IRn�n, derive the following inequalities: kAkp � n � max i;j jaij j; p D 1;1: 7. If A is the regular matrix given by (2.7) and ıA is a perturbation of it, then give an upper bound for kıAk1 ensuring the regularity of A C ıA ! 8. The following matrix depends on a real parameter t : A D A.t/ WD I4 C t � 0 BB@ �3 2 �7 2 0 �5 3 1 4 0 �9 6 �2 8 �4 5 1 CCA ; and I4 is the 4 � 4 unit matrix. For which values of t can you be sure of the regularity of A.t/ ? 9. Show that if 0 < m � kAxkkxk � M; 8x ¤ 0 2 IR n; holds for some vector norm, then the condition number of A is well defined, and cond.A/ � M=m. 10. Find out which norm corresponds to the MATLAB function norm when called norm(x) for a vector x, and norm(A,inf) for a matrix A. 11. Compare the MATLAB values obtained from norm(A)*norm(x) and norm(A*x) when x=[1 -1]’ and A=[1 2; 3 4]. 2.4 Exercises 35 12. Take the 6 � 6 “built-in” Hilbert matrix as A and explain and compare the MATLAB numbers cond(A,1),rcond(A),cond(A,1)*rcond(A). 13. Consider a 1000 � 1000 random matrix and using the commands tic and toc examine how long it takes to compute its 2-, 1-, 1- and Frobenius-norm, respectively. 14. Construct the following matrix A 2 IR100�100 and vector b 2 IR100, and using the backslash operator in MATLAB solve the system Ax D b (i.e., apply the command x D Anb). After that perturb the vector b, e.g. let b.100/ D 1:00001 instead of 1 and solve the system again. Compute the condition number of A (also give an exact formula for cond1.A/). aij D 8 ˆˆ< ˆˆ: 1; if i D j; �1; if i < j; 0; otherwise, b D .�98;�97; : : : ; 0; 1/T : 3Solution of Systems of Linear Equations You know from Algebra: If A 2 IRn�n is a regular matrix, then the linear system Ax D b can be uniquely solved (e.g., with the help of Cramer’s rule). The following conditions are equivalent to the non-regularity (that is, to the singularity) of A: 1. det.A/ D 0, 2. system Ax D b cannot be solved for all vectors b, 3. there exists a vector b for which the linear system has infinitely many solutions, 4. there is a vector x ¤ 0 such that Ax is the zero vector: Ax D 0, 5. 0 is an eigenvalue of the matrix A: Ax D 0 � x, where x ¤ 0. 3.1 Gaussian Elimination In this chapter we deal with solving the linear system Ax D b. Assume that A 2 IRn�n is a given regular matrix, b 2 IRn is a given (column) vector, and x 2 IRn is the unknown (column) vector to be found. In practice Cramer’s rule is not used even in the case of n D 2, because it costs more operations than Gaussian elimination which we discuss below. First, we write the system Ax D b in detailed form: a11x1 C a12x2 C � � � C a1nxn D b1 a21x1 C a22x2 C � � � C a2nxn D b2 (3.1) ::: ::: ::: an1x1 C an2x2 C � � � C annxn D bn To start, we aim to eliminate x1 and assume that a11 ¤ 0. Then, we subtract the first equation multiplied by .ai1=a11/ from the i th equation, i > 1. As a result, © Springer International Publishing AG 2016 G. Stoyan, A. Baran, Elementary Numerical Mathematics for Programmers and Engineers, Compact Textbooks in Mathematics, DOI 10.1007/978-3-319-44660-8_3 37 38 3 Solution of Systems of Linear Equations the coefficient of x1 in the i th row will be zero, that is, the unknown x1 is indeed eliminated there, while the first row is suitable to solve later for x1 in terms of the other unknowns. Denote by aij DW a.1/ij the elements of A and by a.2/ij , i > 1, the new elements obtained after the subtraction. Then the previous operations can be described using the following expressions: a .2/ ij WD a.1/ij � `i1a.1/1j ; i; j D 2; : : : ; n; (3.2) where `i1 WD a.1/i1 =a.1/11 ; i D 2; : : : ; n: (3.3) Here, i D 1 is excluded because our purpose was just to obtain a.2/i1 D 0 for i > 1. Further, it makes no sense to calculate these zeros in a loop so that j > 1. Then, if b .1/ 1 WD b1 and b.2/i WD b.1/i � `i1b.1/1 ; i D 2; : : : ; n; (3.4) after the first step of this “Gaussian” elimination we have the following system: a .1/ 11 x1 C a.1/12 x2 C � � � C a.1/1n xn D b.1/1 a .2/ 22 x2 C � � � C a.2/2n xn D b.2/2 (3.5) ::: ::: ::: a .2/ n2 x2 C � � � C a.2/nn xn D b.2/n Now, if a.2/22 ¤ 0 holds, then you can perform similar steps as before for the .n � 1/ � .n � 1/ system standing under the first row: with the help of the second row, you can eliminate the unknown x2 from the
Compartilhar