Baixe o app para aproveitar ainda mais
Prévia do material em texto
Nonlinear Regression Analysis and Its Applications Nonlinear Regression Analysis and Its Applications Second edition Douglas M. Bates and Donald G. Watts A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York / Chichester / Weinheim / Brisbane / Singapore / Toronto Preface text... This is a preface section text... v Acknowledgments To Mary Ellen and Valery vii Contents Preface v Acknowledgments vii 1 Review of Linear Regression 1 1.1 The Linear Regression Model 1 1.1.1 The Least Squares Estimates 4 1.1.2 Sampling Theory Inference Results 5 1.1.3 Likelihood Inference Results 7 1.1.4 Bayesian Inference Results 7 1.1.5 Comments 7 1.2 The Geometry of Linear Least Squares 9 1.2.1 The Expectation Surface 10 1.2.2 Determining the Least Squares Estimates 12 1.2.3 Parameter Inference Regions 15 1.2.4 Marginal Confidence Intervals 19 1.2.5 The Geometry of Likelihood Results 23 1.3 Assumptions and Model Assessment 24 1.3.1 Assumptions and Their Implications 25 1.3.2 Model Assessment 27 1.3.3 Plotting Residuals 28 ix x CONTENTS 1.3.4 Stabilizing Variance 29 1.3.5 Lack of Fit 30 Problems 31 2 Nonlinear Regression 33 2.1 The Nonlinear Regression Model 33 2.1.1 Transformably Linear Models 35 2.1.2 Conditionally Linear Parameters 37 2.1.3 The Geometry of the Expectation Surface 37 2.2 Determining the Least Squares Estimates 40 2.2.1 The Gauss–Newton Method 2 2 1 41 2.2.2 The Geometry of Nonlinear Least Squares 43 2.2.3 Convergence 50 2.3 Nonlinear Regression Inference Using the Linear Approximation 53 2.3.1 Approximate Inference Regions for Parameters 2 3 1 54 2.3.2 Approximate Inference Bands for the Expected Response 58 2.4 Nonlinear Least Squares via Sums of Squares 62 2.4.1 The Linear Approximation 62 2.4.2 Overshoot 67 Problems 67 Appendix A Data Sets Used in Examples 69 A.1 PCB 69 A.2 Rumford 70 A.3 Puromycin 70 A.4 BOD 71 A.5 Isomerization 73 A.6 α-Pinene 74 A.7 Sulfisoxazole 75 A.8 Lubricant 76 A.9 Chloride 76 A.10 Ethyl Acrylate 76 A.11 Saccharin 79 A.12 Nitrite Utilization 80 A.13 s-PMMA 82 A.14 Tetracycline 82 CONTENTS xi A.15 Oil Shale 82 A.16 Lipoproteins 85 References 87 1 Review of Linear Regression Non sunt multiplicanda entia praeter necessitatem. (Entities are not to be multiplied beyond necessity.) —William of Ockham We begin with a brief review of linear regression, because a thorough grounding in linear regression is fundamental to understanding nonlinear re- gression. For a more complete presentation of linear regression see, for ex- ample, Draper and Smith (1981), Montgomery and Peck (1982), or Seber (1977). Detailed discussion of regression diagnostics is given in Belsley, Kuh and Welsch (1980) and Cook and Weisberg (1982), and the Bayesian approach is discussed in Box and Tiao (1973). Two topics which we emphasize are modern numerical methods and the geometry of linear least squares. As will be seen, attention to efficient com- puting methods increases understanding of linear regression, while the geo- metric approach provides insight into the methods of linear least squares and the analysis of variance, and subsequently into nonlinear regression. 1.1 THE LINEAR REGRESSION MODEL Linear regression provides estimates and other inferential results for the pa- rameters β = (β1, β2, . . . , βP ) T in the model Yn = β1xn1 + β2xn2 + · · ·+ βP xnP + Zn (1.1) = (xn1, . . . , xnP )β + Zn 1 2 REVIEW OF LINEAR REGRESSION In this model, the random variable Yn, which represents the response for case n, n = 1, 2, . . . , N , has a deterministic part and a stochastic part. The deterministic part, (xn1, . . . , xnP )β, depends upon the parameters β and upon the predictor or regressor variables xnp, p = 1, 2, . . . , P . The stochastic part, represented by the random variable Zn, is a disturbance which perturbs the response for that case. The superscript T denotes the transpose of a matrix. The model for N cases can be written Y = Xβ + Z (1.2) where Y is the vector of random variables representing the data we may get, X is the N × P matrix of regressor variables, X = x11 x21 . . . xN1 x12 x22 . . . xN2 x13 x23 . . . xN3 . . . . . . . . . x1P x2P . . . xNP and Z is the vector of random variables representing the disturbances. (We will use bold face italic letters for vectors of random variables.) The deterministic part, Xβ, a function of the parameters and the regres- sor variables, gives the mathematical model or the model function for the responses. Since a nonzero mean for Zn can be incorporated into the model function, we assume that E[Z] = 0 (1.3) or, equivalently, E[Y ] = Xβ We therefore call Xβ the expectation function for the regression model. The matrix X is called the derivative matrix , since the (n, p)th term is the deriva- tive of the nth row of the expectation function with respect to the pth pa- rameter. Note that for linear models, derivatives with respect to any of the parame- ters are independent of all the parameters. If we further assume that Z is normally distributed with Var[Z] = E[ZZT] = σ2I (1.4) where I is an N×N identity matrix, then the joint probability density function for Y , given β and the variance σ2, is p(y|β, σ2) = (2piσ2)−N/2 exp ( − (y −Xβ)T (y −Xβ) 2σ2 ) (1.5) = ( 2piσ2 ) −N/2 exp (−‖y −Xβ‖2 2σ2 ) THE LINEAR REGRESSION MODEL 3 • • • • • • • •• • • • • • • • • • • • • • • • • • • • Age (years) PC B co nc en tra tio n (pp m) 2 4 6 8 10 12 0 5 10 15 20 25 30 Fig. 1.1 Plot of PCB concentration versus age for lake trout. where the double vertical bars denote the length of a vector. When provided with a derivative matrix X and a vector of observed data y, we wish to make inferences about σ2 and the P parameters β. Example: As a simple example of a linear regression model, we consider the con- centration of polychlorinated biphenyls (PCBs) in Lake Cayuga trout as a function of age (Bache, Serum, Youngs and Lisk, 1972). The data set is described in Appendix 1, Section A1.1. A plot of the PCB concentra- tion versus age, Figure 1.1, reveals a curved relationship between PCB concentration and age. Furthermore, there is increasing variance in the PCB concentration as the concentration increases. Since the assump- tion (1.4) requires that the variance of the disturbances be constant, we seek a transformation of the PCB concentration which will stabilize the variance (see Section 1.3.2). Plotting the PCB concentration on a logarithmic scale, as in Figure 1.2a, nicely stabilizes the variance and produces a more nearly linear relationship. Thus, a linear expectation function of the form ln(PCB) = β1 + β2 age could be considered appropriate, where ln denotes the natural logarithm (logarithm to the base e). Transforming the regressor variable (Box and Tidwell, 1962) can produce an even straighter plot, as shown in Figure 1.2b, where we use the cube root of age. Thus a simple expectation 4 REVIEW OF LINEAR REGRESSION • • • • • • • • • • • • • • • • • • • • • • • • • • • • Age (years) PC B co nc en tra tio n (pp m) 2 4 6 8 10 12 0. 5 1 5 10 (a) • • • • • • • • • • • • • • • • • • • • • • • • • • • • Cube root of age PC B co nc en tra tio n (pp m) 1.0 1.2 1.4 1.6 1.8 2.0 2.2 0. 5 1 5 10 (b) Fig. 1.2 Plot of PCB concentration versusage for lake trout. The concentration, on a logarithmic scale, is plotted versus age in part a and versus 3 √ age in part b. function to be fitted is ln(PCB) = β1 + β2 3 √ age (Note that the methods of Chapter 2 can be used to fit models of the form f(x, β, α) = β0 + β1x α1 1 + β2x α2 2 + · · · + βP xαPP by simultaneously estimating the conditionally linear parameters β and the transformation parameters α. The powers α1, . . . , αP are used to transform the factors so that a simple linear model in xα11 , . . . , x αP P is appropriate. In this book we use the power α = 0.33 for the age variable even though, for the PCB data, the optimal value is 0.20.) • 1.1.1 The Least Squares Estimates The likelihood function, or more simply, the likelihood, l(β, σ|y), for β and σ is identical in form to the joint probability density (1.5) except that l(β, σ|y) is regarded as a function of the parameters conditional on the observed data, rather than as a function of the responses conditional on the values of the parameters. Suppressing the constant (2pi)−N/2 we write l(β, σ|y) ∝ σ−N exp (−‖y − Xβ‖2 2σ2 ) (1.6) The likelihood is maximized with respect to β when the residual sum of squares S(β) = ‖y − Xβ‖2 (1.7) THE LINEAR REGRESSION MODEL 5 = N∑ n=1 [ yn − ( P∑ p=1 xnpβp )]2 is a minimum. Thus the maximum likelihood estimate βˆ is the value of β which minimizes S(β). This βˆ is called the least squares estimate and can be written βˆ = (XTX)−1XTy (1.8) Least squares estimates can also be derived by using sampling theory, since the least squares estimator is the minimum variance unbiased estimator for β, or by using a Bayesian approach with a noninformative prior density on β and σ. In the Bayesian approach, βˆ is the mode of the marginal posterior density function for β. All three of these methods of inference, the likelihood approach, the sam- pling theory approach, and the Bayesian approach, produce the same point estimates for β. As we will see shortly, they also produce similar regions of “reasonable” parameter values. First, however, it is important to realize that the least squares estimates are only appropriate when the model (1.2) and the assumptions on the disturbance term, (1.3) and (1.4), are valid. Expressed in another way, in using the least squares estimates we assume: 1. The expectation function is correct. 2. The response is expectation function plus disturbance. 3. The disturbance is independent of the expectation function. 4. Each disturbance has a normal distribution. 5. Each disturbance has zero mean. 6. The disturbances have equal variances. 7. The disturbances are independently distributed. When these assumptions appear reasonable and have been checked using diagnostic plots such as those described in Section 1.3.2, we can go on to make further inferences about the regression model. Looking in detail at each of the three methods of statistical inference, we can characterize some of the properties of the least squares estimates. 1.1.2 Sampling Theory Inference Results The least squares estimator has a number of desirable properties as shown, for example, in Seber (1977): 1. The least squares estimator βˆ is normally distributed. This follows because the estimator is a linear function of Y , which in turn is a linear 6 REVIEW OF LINEAR REGRESSION function of Z. Since Z is assumed to be normally distributed, βˆ is normally distributed. 2. E[βˆ] = β: the least squares estimator is unbiased. 3. Var[βˆ] = σ2(XTX)−1: the covariance matrix of the least squares esti- mator depends on the variance of the disturbances and on the derivative matrix X. 4. A 1− α joint confidence region for β is the ellipsoid (β − βˆ)TXTX(β − βˆ) ≤ Ps2F (P, N − P ; α) (1.9) where s2 = S(βˆ) N − P is the residual mean square or variance estimate based on N−P degrees of freedom, and F (P, N − P ; α) is the upper α quantile for Fisher’s F distribution with P and N − P degrees of freedom. 5. A 1− α marginal confidence interval for the parameter βp is βˆp ± se(βˆp)t(N − P ; α/2) (1.10) where t(N − P ; α/2) is the upper α/2 quantile for Student’s T distri- bution with N − P degrees of freedom and the standard error of the parameter estimator is se(βˆp) = s √{ (XTX)−1 } pp (1.11) with { (XTX)−1 } pp equal to the pth diagonal term of the matrix (XTX)−1. 6. A 1− α confidence interval for the expected response at x0 is xT0 βˆ ± s √ xT0 (X TX)−1x0 t(N − P ; α/2) (1.12) 7. A 1− α confidence interval for the expected response at x0 is xT0 βˆ ± s √ xT0 (X TX)−1x0 t(N − P ; α/2) (1.13) 8. A 1− α confidence band for the response function at any x is given by xTβˆ ± s √ xT(XTX)−1x √ PF (P, N − P ; α) (1.14) The expressions (1.13) and (1.14) differ because (1.13) concerns an interval at a single specific point, whereas (1.14) concerns the band produced by the intervals at all the values of x considered simultaneously. THE LINEAR REGRESSION MODEL 7 1.1.3 Likelihood Inference Results The likelihood l(β, σ | y), equation (1.6), depends on β only through ‖y − Xβ‖, so likelihood contours are of the form ‖y −Xβ‖2 = c (1.15) where c is a constant. A likelihood region bounded by the contour for which c = S(βˆ) [ 1 + P N − P F (P, N − P ; α) ] is identical to a 1 − α joint confidence region from the sampling theory ap- proach. The interpretation of a likelihood region is quite different from that of a confidence region, however. 1.1.4 Bayesian Inference Results As shown in Box and Tiao (1973), the Bayesian marginal posterior density for β, assuming a noninformative prior density for β and σ of the form p(β, σ) ∝ σ−1 (1.16) is p(β|y) ∝ { 1 + (β − βˆ)TXTX(β − βˆ) νs2 } −(ν+P )/2 (1.17) which is in the form of a P -variate Student’s T density with location pa- rameter βˆ, scaling matrix s2(XTX)−1, and ν = N − P degrees of freedom. Furthermore, the marginal posterior density for a single parameter βp, say, is a univariate Student’s T density with location parameter βˆp, scale parameter s2 { (XTX)−1 } pp , and degrees of freedom N − P . The marginal posterior density for the mean of y at x0 is a univariate Student’s T density with location parameter xT0 βˆ, scale parameter s 2xT0 (X TX)−1x0, and degrees of freedom N − P . A highest posterior density (HPD) region of content 1− α is defined (Box and Tiao, 1973) as a region R in the parameter space such that Pr {β ∈ R} = 1 − α and, for β1 ∈ R and β2 6∈ R, p(β1|y) ≥ p(β2|y). For linear models with a noninformative prior, an HPD region is therefore given by the ellipsoid defined in (1.9). Similarly, the marginal HPD regions for βp and x T 0 β are numerically identical to the sampling theory regions (1.11, 1.12, and 1.13). 1.1.5 Comments Although the three approaches to statistical inference differ considerably, they lead to essentially identical inferences. In particular, since the joint confidence, 8 REVIEW OF LINEAR REGRESSION likelihood, and Bayesian HPD regions are identical, we refer to them all as inference regions. In addition, when referring to standard errors or correlations, we will use the Bayesian term “the standard error of βp” when, for the sampling theory or likelihood methods, we should more properly say “the standard error of the estimate of βp”. For linear least squares, any of the approaches can be used. For nonlinear least squares, however, the likelihood approach has the simplest and most direct geometrical interpretation, and so we emphasize it. Example: The PCB data can be used to determine parameter estimates and joint and marginal inference regions. In this linear situation, the regions can be summarized using βˆ, s2, XTX, and ν = N − P . For the ln(PCB) data with 3√ age as the regressor, we have βˆ = (−2.391, 2.300)T , s2 = 0.246 on ν = 26 degrees of freedom, and X T X = [ 28.000 46.941 46.941 83.367 ] ( X T X ) −1 = [ 0.6374 −0.3589 −0.3589 0.2141 ] The joint 95% inference region is then 28.00(β1 + 2.391) 2 + 93.88(β1 + 2.391)(β2 − 2.300) + 83.37(β2 − 2.300)2 = 2(0.246)3.37 = 1.66 the marginal 95% inference interval for the parameter β1 is −2.391 ± (0.496) √ 0.6374(2.056) or −3.21 ≤ β1 ≤ −1.58 and the marginal 95% inference interval for the parameter β2 is 2.300 ± (0.496) √ 0.2141(2.056) or 1.83 ≤ β2 ≤ 2.77 The 95% inference band for the ln(PCB) value at any 3 √ age = x, is −2.391 + 2.300x ± (0.496) √ 0.637 − 0.718x + 0.214x2 √ 2(3.37) These regions are plotted in Figure 1.3. • While it is possible to give formal expressions for the least squares estima- tors and the regression summary quantities in terms of the matrices XTX and (XTX)−1, the use of these matrices for computing the estimates is not recommended. Superior computing methods are presented in Section 1.2.2. THE GEOMETRY OF LINEAR LEAST SQUARES 9 –3.5 –3.0 –2.5 –2.0 –1.5 1. 8 2. 0 2. 2 2. 4 2. 6 2. 8 + β1 β 2 (a) 1.0 1.2 1.4 1.6 1.8 2.0 2.2 – 1 0 1 2 3 Cube root of age ln (P CB co nc en tra tio n) • • • • • • • • • • • • • • • • • • • • • • • • • • • • (b) Fig. 1.3 Inference regions for the model ln(PCB) = β1 + β2 3 √ age. Part a shows the least squares estimates (+), the parameter joint 95% inference region (solid line), and the marginal 95% inference intervals (dotted lines). Part b shows the fitted response (solid line) and the 95% inference band (dotted lines). Finally, the assumptions which lead to the use of the least squares estimates should always be examined when using a regression model. Further discussion on assumptions and their implications is given in Section 1.3. 1.2 THE GEOMETRY OF LINEAR LEAST SQUARES The model (1.2) and assumptions (1.3) and (1.4) lead to the use of the least squares estimate (1.8) which minimizes the residual sum of squares (1.7). As implied by (1.7), S(β) can be regarded as the square of the distance from the data vector y to the expected response vector Xβ. This links the subject of linear regression to Euclidean geometry and linear algebra. The assumption of a normally distributed disturbance term satisfying (1.3) and (1.4) indicates that the appropriate scale for measuring the distance between y and Xβ is the usual Euclidean distance between vectors. In this way the Euclidean geometry of the N -dimensional response space becomes statistically meaningful. This connection between geometry and statistics is exemplified by the use of the term spherical normal for the normal distribution with the assumptions (1.3) and (1.4), because then contours of constant probability are spheres. Note that when we speak of the linear form of the expectation function Xβ, we are regarding it as a function of the parameters β, and that when determining parameter estimates we are only concerned with how the expected response depends on the parameters, not with how it depends on the variables. In the PCB example we fit the response to 3 √ age using linear least squares because the parameters β enter the model linearly. 10 REVIEW OF LINEAR REGRESSION 1.2.1 The Expectation Surface The process of calculating S(β) involves two steps: 1. Using the P -dimensional parameter vector β and the N × P derivative matrix X to obtain the N -dimensional expected response vector η(β) = Xβ, and 2. Calculating the squared distance from η(β) to the observed response y, ‖y − η(β)‖2. The possible expected response vectors η(β) form a P -dimensional expec- tation surface in the N -dimensional response space. This surface is a linear subspace of the response space, so we call it the expectation plane when dealing with a linear model. Example: To illustrate the geometry of the expectation surface, consider just three cases from the ln(PCB) versus 3 √ age data, 3 √ age ln(PCB) 1.26 0.92 1.82 2.15 2.22 2.52 The matrix X is then X = [ 1 1 1 1.26 1.82 2.22 ] which consists of two column vectors x1 = (1, 1, 1) T and x2 = (1.26, 1.82, 2.22) T. These two vectors in the 3-dimensional response space are shown in Figure 1.4b, and correspond to the points β = (1, 0)T and β = (0, 1)T in the parameter plane, shown in Figure 1.4a. The expectation func- tion η(β) = Xβ defines a 2-dimensional expectation plane in the 3-dimensional response space. This is shown in Figure 1.4c, where the parameter lines corresponding to the lines β1 = −3, . . . , 5 and β2 = −2, . . . , 2, shown in Figure 1.4a, are given. A parameter line is as- sociated with the parameter which is varying so the lines corresponding to β1 = −3, . . . , 5 (dotdashed lines) are called β2 lines. Note that the parameter lines in the parameter plane are straight, par- allel, and equispaced, and that their images on the expectation plane are also straight, parallel, and equispaced. Because the vector x1 is shorter than x2 (‖x1‖ = √ 3 while ‖x2‖ = √ 9.83), the spacing between the lines of constant β1 on the expectation plane is less than that be- tween the lines of constant β2. Also, the vectors x1 and x2 are not orthogonal. The angle ω between them can be calculated from cos ω = xT1 x2 ‖x1‖‖x2‖ THE GEOMETRY OF LINEAR LEAST SQUARES 11 β1 β 2 –2 0 2 4 – 2 – 1 0 1 2 • • (0,1)T (1,0)T (a) • • (1,1,1)T (1.26,1.82,2.22)T 1 2 η1 1 2 η3 1 2η2 (b) 1 2 η1 1 2 η2 β1 = 0 β1 = 2 β2 = 0 β2 = –1 β2 = –2 (c) Fig. 1.4 Expectation surface for the 3-case PCB example. Part a shows the parameter plane with β1 parameter lines (dashed) and β2 parameter lines (dot–dashed). Part b shows the vectors x1 (dashed line) and x2 (dot–dashed line) in the response space. The end points of the vectors correspond to β = (1, 0)T and β = (0, 1)T respectively. Part c shows a portion of the expectation plane (shaded) in the response space, with β1 parameter lines (dashed) and β2 parameter lines (dot–dashed). 12 REVIEW OF LINEAR REGRESSION = 5.30√ (3)(9.83) = 0.98 to be about 11◦, so the parameter lines on the expectation plane are not at right angles as they are on the parameter plane. As a consequence of the unequal length and nonorthogonality of the vectors, unit squares on the parameter plane map to parallelograms on the expectation plane. The area of the parallelogram is ‖x1‖‖x2‖ sin ω = ‖x1‖‖x2‖ √ 1 − cos2 ω (1.18) = √ (xT1 x1)(x T 2 x2)− (xT1 x2)2 = √∣∣XTX∣∣ That is, the Jacobian determinant of the transformation from the pa- rameter plane to the expectation plane is a constant equal to ∣∣XTX∣∣1/2. Conversely, the ratio of areas in the parameter plane to those on the expectation plane is ∣∣XTX∣∣−1/2. • The simple linear mapping seen in the above example is true for all linear regression models. That is, for linear models, straight parallel equispaced lines in the parameter space map to straight parallel equispaced lines on the expectation plane in the response space. Consequently, rectangles in one plane map to parallelepipeds in the other plane, and circles or spheres in one plane map to ellipses or ellipsoids in the other plane. Furthermore, the Jacobian determinant, ∣∣∣XTX∣∣∣1/2, is a constant for linear models, and so regions of fixed size in one plane map to regions of fixed size in the other, no matter where they are on the plane. These properties, which make linear least squares especially simple, are discussed further in Section 1.2.3. 1.2.2 Determiningthe Least Squares Estimates The geometric representation of linear least squares allows us to formulate a very simple scheme for determining the parameters estimates βˆ. Since the expectation surface is linear, all we must do to determine the point on the surface which is closest to the point y, is to project y onto the expectation plane. This gives us ηˆ, and βˆ is then simply the value of β corresponding to ηˆ. One approach to defining this projection is to observe that, after the projec- tion, the residual vector y−ηˆ will be orthogonal, or normal, to the expectation plane. Equivalently, the residual vector must be orthogonal to all the columns of the X matrix, so XT(y −Xβˆ) = 0 THE GEOMETRY OF LINEAR LEAST SQUARES 13 which is to say that the least squares estimate βˆ satisfies the normal equations XTXβˆ = XTy (1.19) Because of (1.19) the least squares estimates are often written βˆ = (XTX)−1XTy as in (1.8). However, another way of expressing the estimate, and a more sta- ble way of computing it, involves decomposing X into the product of an orthogonal matrix and an easily inverted matrix. Two such decompositions are the QR decomposition and the singular value decomposition (Dongarra, Bunch, Moler and Stewart, 1979, Chapters 9 and 11). We use the QR decom- position, where X = QR with the N ×N matrix Q and the N × P matrix R constructed so that Q is orthogonal (that is, QTQ = QQT = I) and R is zero below the main diagonal. Writing R = [ R1 0 ] where R1 is P × P and upper triangular, and Q = [Q1|Q2] with Q1 the first P columns and Q2 the last N − P columns of Q, we have X = QR = Q1R1 (1.20) Performing a QR decomposition is straightforward, as is shown in Appendix 2. Geometrically, the columns of Q define an orthonormal, or orthogonal, basis for the response space with the property that the first P columns span the expectation plane. Projection onto the expectation plane is then very easy if we work in the coordinate system given by Q. For example we transform the response vector to w = QTy (1.21) with components w1 = Q T 1 y (1.22) and w2 = Q T 2 y (1.23) The projection of w onto the expectation plane is then simply[ w1 0 ] in the Q coordinates and ηˆ = Q [ w1 0 ] = Q1w1 (1.24) 14 REVIEW OF LINEAR REGRESSION • y 1 2 η1 2 η2 • • • q1 q3 q2 (a) • w 1 2 q1 1 2 2 q2 w2 (b) Fig. 1.5 Expectation surface for the 3-case PCB example. Part a shows a portion of the expectation plane (shaded) in the response space with β1 parameter lines (dashed) and β2 parameter lines (dot–dashed) together with the response vector y. Also shown are the orthogonal unit vectors q1 and q2 in the expectation plane, and q3 orthogonal to the plane. Part b shows the response vector w, and a portion of the expectation plane (shaded) in the rotated coordinates given by Q. in the original coordinates. Example: As shown in Appendix 2, the QR decomposition (1.20) of the matrix X = [ 1 1 1 1.26 1.82 2.22 ] for the 3-case PCB example is[ 0.5774 0.5774 0.5774 −0.7409 0.0732 0.6677 0.3432 −0.8132 0.4700 ][ 1.7321 0 0 3.0600 0.6820 0 ] which gives [equation (1.21)] w = [ 3.23 1.16 −0.24 ] In Figure 1.5a we show the expectation plane and observation vector in the original coordinate system. We also show the vectors q1, q2, q3, THE GEOMETRY OF LINEAR LEAST SQUARES 15 which are the columns of Q. It can be seen that q1 and q2 lie in the expectation plane and q3 is orthogonal to it. In Figure 1.5b we show, in the transformed coordinates, the observation vector and the expectation plane, which is now horizontal. Note that projecting w onto the expectation plane is especially simple, since it merely requires replacing the last element in w by zero. • To determine the least squares estimate we must find the value βˆ corre- sponding to ηˆ. Since ηˆ = Xβˆ using (1.24) and (1.20) R1βˆ = w1 (1.25) and we solve for βˆ by back-substitution (Stewart, 1973). Example: For the complete ln(PCB), 3 √ age data set, R1 = [ 5.29150 0 8.87105 2.16134 ] and w1 = (7.7570, 4.9721) T , so βˆ = (−2.391, 2.300)T . • 1.2.3 Parameter Inference Regions Just as the least squares estimates have informative geometric interpretations, so do the parameter inference regions (1.9), (1.10), (1.15) and those derived from (1.17). Such interpretations are helpful for understanding linear regres- sion, and are essential for understanding nonlinear regression. (The geometric interpretation is less helpful in the Bayesian approach, so we discuss only the sampling theory and likelihood approaches.) The main difference between the likelihood and sampling theory geometric interpretations is that the likelihood approach centers on the point y and the length of the residual vector at η(β) compared to the shortest residual vector, while the sampling theory approach focuses on possible values of η(β) and the angle that the resulting residual vectors could make with the expectation plane. 1.2.3.1 The Geometry of Sampling Theory Results To develop the geomet- ric basis of linear regression results from the sampling theory approach, we transform to the Q coordinate system. The model for the random variable W = QTY is W = Rβ + QTZ or U = W −Rβ (1.26) where U = QTZ. 16 REVIEW OF LINEAR REGRESSION The spherical normal distribution of Z is not affected by the orthogonal transformation, so U also has a spherical normal distribution. This can be es- tablished on the basis of the geometry, since the spherical probability contours will not be changed by a rigid rotation or reflection, which is what an orthogo- nal transformation must be. Alternatively, this can be established analytically because QTQ = I , so the determinant of Q is ±1 and ‖Qx‖ = ‖x‖ for any N - vector x. Now the joint density for the random variables Z = (Z1, . . . , Zn) T is pZ(z) = (2piσ 2)−N/2 exp (−zTz 2σ2 ) and, after transformation, the joint density for U = QTZ is pU (u) = (2piσ 2)−N/2|Q| exp ( −uTQTQu 2σ2 ) = (2piσ2)−N/2 exp (−uTu 2σ2 ) From (1.26), the form of R leads us to partition U into two components: U = [ U 1 U 2 ] where U 1 consists of the first P elements of U , and U 2 the remaining N −P elements. Each of these components has a spherical normal distribution of the appropriate dimension. Furthermore, independence of elements in the original disturbance vector Z leads to independence of the elements of U , so the components U 1 and U 2 are independent. The dimensions νi of the components U i, called the degrees of freedom, are ν1 = P and ν2 = N − P . The sum of squares of the coordinates of a ν-dimensional spherical normal vector has a σ2χ2 distribution on ν degrees of freedom, so ‖U1‖2 ∼ σ2χ2P ‖U2‖2 ∼ σ2χ2N−P where the symbol ∼ is read “is distributed as.” Using the independence of U 1 and U 2, we have ‖U1‖2/P ‖U2‖2/(N − P ) ∼ F (P, N − P ) (1.27) since the scaled ratio of two independent χ2 random variables is distributed as Fisher’s F distribution. The distribution (1.27) gives a reference distribution for the ratio of the squared component lengths or, equivalently, for the angle that the disturbance THE GEOMETRY OF LINEAR LEAST SQUARES 17 vector makes with the horizontal plane. We may therefore use (1.26) and (1.27) to test the hypothesis that β equals some specific value, say β0, by calculating the residual vector u0 = QTy −Rβ0 and comparing the lengths of the components u01 and u 0 2 as in (1.27). The reasoning here is that a large ‖u01‖ compared to ‖u02‖ suggests that the vector y is not very likely to have been generated by the model (1.2) with β = β0, since u0 has a suspiciously large component in the Q1 plane. Notethat ‖u02‖2 N − P = S(βˆ) N − P = s 2 and ‖u01‖2 = ‖R1β0 −w1‖2 (1.28) and so the ratio (1.27) becomes ‖R1β0 −w1‖2 Ps2 (1.29) Example: We illustrate the decomposition of the residual u for testing the null hypothesis H0 : β = (−2.0, 2.0)T versus the alternative HA : β 6= (−2.0, 2.0)T for the full PCB data set in Figure 1.6. Even though the rotated data vector w and the expectation surface for this example are in a 28-dimensional space, the relevant distances can be pictured in the 3- dimensional space spanned by the expectation surface (vectors q1 and q2) and the residual vector. The scaled lengths of the components u1 and u2 are compared to determine if the point β 0 = (−2.0, 2.0)T is reasonable. The numerator in (1.29) is∥∥∥[ 5.29150 0 8.87105 2.16134 ] [ −2.0 2.0 ] − [ 7.7570 4.9721 ]∥∥∥2 = 0.882 The ratio is then 0.882/(2 × 0.246) = 1.79, which corresponds to a tail probability (or p value) of 0.19 for an F distribution with 2 and 26 degrees of freedom. Since the probability of obtaining a ratio at least as large as 1.79 is 19%, we do not reject the null hypothesis. • A 1 − α joint confidence region for the parameters β consists of all those values for which the above hypothesis test is not rejected at level α. Thus, a value β0 is within a 1− α confidence region if ‖u01‖2/P ‖u02‖2/(N − P ) ≤ F (P, N − P ; α) 18 REVIEW OF LINEAR REGRESSION • • w u0 u2 0 u1 0 R1 β0 7 8 9 10 q1 4 5 6 7 2 3 4 w2 q2 β2 = 2 β2 = 3 β1 = –2 β1 = –1 Fig. 1.6 A geometric interpretation of the test H0 : β = (−2.0, 2.0)T for the full PCB data set. We show the projections of the response vector w and a portion of the expectation plane projected into the 3-dimensional space given by the tangent vectors q1 and q2, and the orthogonal component of the response vector, w2. For the test point β0, the residual vector u0 is decomposed into a tangential component u01 and an orthogonal component u02. THE GEOMETRY OF LINEAR LEAST SQUARES 19 Since s2 does not depend on β0, the points inside the confidence region form a disk on the expectation plane defined by ‖u1‖2 ≤ Ps2F (P, N − P ; α) Furthermore, from (1.25) and (1.28) we have ‖u1‖2 = ‖R1(β − βˆ)‖2 so a point on the boundary of the confidence region in the parameter space satisfies R1(β − βˆ) = √ Ps2F (P, N − P ; α) d where ‖d‖ = 1. That is, the confidence region is given by{ β = βˆ + √ Ps2F (P, N − P ; α)R−11 d | ‖d‖ = 1 } (1.30) Thus the region of “reasonable” parameter values is a disk centered at R1βˆ on the expectation plane and is an ellipse centered at βˆ in the parameter space. Example: For the ln(PCB) versus 3 √ age data, βˆ = (−2.391, 2.300)T and s2 = 0.246 based on 26 degrees of freedom, so the 95% confidence disk on the transformed expectation surface is R1β = [ 7.7570 4.9721 ] + 1.288 [ cos ω sin ω ] where 0 ≤ ω ≤ 2pi. The disk is shown in the expectation plane in Figure 1.7a, and the corresponding ellipse β = [ −2.391 2.300 ] + 1.288 [ 0.18898 0 −0.77566 0.46268 ] [ cos ω sin ω ] is shown in the parameter plane in Figure 1.7b. • 1.2.4 Marginal Confidence Intervals We can create a marginal confidence interval for a single parameter, say β1, by “inverting” a hypothesis test of the form H0 : β1 = β 0 1 versus HA : β1 6= β01 Any β01 for which H0 is not rejected at level α is included in the 1−α confidence interval. To perform the hypothesis test, we choose any parameter vector with β1 = β 0 1 , say ( β01 ,0 T )T , calculate the transformed residual vector u0, and divide it into three components: the first component u01 of dimension P − 1 20 REVIEW OF LINEAR REGRESSION • w• 7 8 9 10 q14 5 6 7 2 3 4 q2 –3.5 –3.0 –2.5 –2.0 –1.5 1. 8 2. 0 2. 2 2. 4 2. 6 2. 8 + β1 β 2 Fig. 1.7 The 95% confidence disk and parameter confidence region for the PCB data. Part a shows the response vector w and a portion of the expectation plane projected into the 3-dimensional space given by the tangent vectors q1 and q2, and the orthogonal component of the response vector, w2. The 95% confidence disk (shaded) in the expectation plane (part a) maps to the elliptical confidence region (shaded) in the parameter plane (part b). and parallel to the hyperplane defined by β1 = β 0 1 ; the second component u02 of dimension 1 and in the expectation plane but orthogonal to the β 0 1 hyperplane; and the third component u03 of length (N −P )s2 and orthogonal to the expectation plane. The component u02 is the same for any parameter β with β1 = β 0 1 , and, assuming that the true β1 is β 0 1 , the scaled ratio of the corresponding random variables U2 and U 3 has the distribution U22 /1 ‖U3‖2/(N − P ) ∼ F (1, N − P ) Thus we reject H0 at level α if( u02 )2 s2F (1, N − P ; α) Example: To test the null hypothesis H0 : β1 = −2.0 versus the alternative HA : β1 6= −2.0 for the complete PCB data set, we decompose the transformed residual vector at β0 = (−2.0, 2.2)T into three components as shown in Figure 1.8 and calculate the ratio( u02 )2 s2 = 0.240 0.246 THE GEOMETRY OF LINEAR LEAST SQUARES 21 • • u w u3 u1u2 7 8 9 10 q1 4 5 6 7 2 3 4 q2 w2 β1 = –2 Fig. 1.8 A geometric interpretation of the test H0:β1 = −2.0 for the full PCB data set. We show the response vector w, and a portion of the expectation plane projected into the 3-dimensional space given by the tangent vectors q1 and q2, and the orthogonal component of the response vector, w2. For a representative point on the line β1 = −2 the residual vector u is decomposed into a tangential component u01 along the line, a tangential component u02 perpendicular to the line, and an orthogonal component u 0 3. 22 REVIEW OF LINEAR REGRESSION = 0.97 This corresponds to a p value of 0.33, and so we do not reject the null hypothesis. • We can create a 1− α marginal confidence interval for β1 as all values for which ( u02 )2 ≤ s2F (1, N − P ; α) or, equivalently, |u02| ≤ s · t(N − P ; α/2) (1.31) Since |u02| is the distance from the point R1βˆ to the line corresponding to β1 = β 0 1 on the transformed parameter plane, the confidence interval will include all values β01 for which the corresponding parameter line intersects the disk { R1βˆ + st(N − P ; α/2)d | ‖d‖ = 1 } (1.32) Instead of determining the value of |u02| for each β01 , we take the disk (1.32) and determine the minimum and maximum values of β1 for points on the disk. Writing r1 for the first row of R−11 , the values of β1 corresponding to points on the expectation plane disk are r1(R1βˆ + s · t(N − P ; α/2)d) = βˆ1 + s · t(N − P ; α/2)r1d and the minimum and maximum occur for the unit vectors in the direction of r1; that is, d = ± (r1)T /‖r1‖. This gives the confidence interval βˆ1 ± s‖r1‖t(N − P ; α/2) In general, a marginal confidence interval for parameter βp is βˆp ± s‖rp‖t(N − P ; α/2) (1.33) where rp is the pth row of R−11 . The quantity se(βˆp) = s‖rp‖ (1.34) is called the standard error for the pth parameter. Since (XTX)−1 = (RT1 R1) −1 = R−11 R −T 1 ‖rp‖2 = { (XTX)−1 } pp , so the standard error can be written as in equation (1.11). A convenient summary of the variability of the parameter estimates can be obtained by factoring R−11 as R−11 = diag(‖r1‖, ‖r2‖, . . . , ‖rP ‖)L (1.35) THE GEOMETRY OF LINEAR LEAST SQUARES 23 where L has unit length rows. The diagonal matrix provides the parameter standard errors, while the correlation matrix C = LLT (1.36) gives the correlations between the parameter estimates. Example: Forthe ln(PCB) data, βˆ = (−2.391, 2.300)T, s2 = 0.246 with 26 degrees of freedom, and R −1 1 = [ 5.29150 0 8.87105 2.16134 ] −1 = [ 0.18898 0 −0.77566 0.46268 ] = [ 0.798 0 0 0.463 ] [ 0.237 0 −0.972 1 ] which gives standard errors of 0.798 √ 0.246 = 0.396 for β1 and 0.463 √ 0.246 = 0.230 for β2. Also C = [ 1 −0.972 −0.972 1 ] so the correlation between β1 and β2 is −0.97. The 95% confidence intervals for the parameters are given by −2.391 ± 2.056(0.396) and 2.300 ± 2.056(0.230), which are plotted in Figure 1.3a. • Marginal confidence intervals for the expected response at a design point x0 can be created by determining which hyperplanes formed by constant xT0 β intersect the disk (1.32). Using the same argument as was used to derive (1.33), we obtain a standard error for the expected response at x0 as s‖xT0 R−11 ‖, so the confidence interval is xT0 βˆ ± s‖xT0 R−11 ‖t(N − P ; α/2) (1.37) Similarly, a confidence band for the response function is xTβˆ ± s‖xTR−11 ‖ √ PF (P, N − P ; α) (1.38) Example: A plot of the fitted expectation function and the 95% confidence bands for the PCB example was given in Figure 1.3b. • Ansley (1985) gives derivations of other sampling theory results in linear regression using the QR decomposition, which, as we have seen, is closely related to the geometric approach to regression. 1.2.5 The Geometry of Likelihood Results The likelihood function indicates the plausibility of values of η relative to y, and consequently has a simple geometrical interpretation. If we allow η to 24 REVIEW OF LINEAR REGRESSION take on any value in the N -dimensional response space, the likelihood contours are spheres centered on y. Values of η of the form η = Xβ generate a P - dimensional expectation plane, and so the intersection of the plane with the likelihood spheres produces disks. Analytically, the likelihood function (1.6) depends on η through ‖η − y‖2 = ‖QT(η − y)‖2 (1.39) = ‖QT1 (η − y)‖2 + ‖QT2 (η − y)‖2 = ‖w(β)−w1‖2 + ‖w2‖2 where w(β) = QT1 η and Q T 2 η = 0. A constant value of the total sum of squares specifies a disk of the form ‖w(β)−w1‖2 = c on the expectation plane. Choosing c = Ps2F (P, N − P ; α) produces the disk corresponding to a 1−α confidence region. In terms of the total sum of squares, the contour is S(β) = S(βˆ) { 1 + P N − P F (P, N − P ; α) } (1.40) As shown previously, and illustrated in Figure 1.7, this disk transforms to an ellipsoid in the parameter space. 1.3 ASSUMPTIONS AND MODEL ASSESSMENT The statistical assumptions which lead to the use of the least squares esti- mates encompass several different aspects of the regression model. As with any statistical analysis, if the assumptions on the model and data are not appropriate, the results of the analysis will not be valid. Since we cannot guarantee a priori that the different assumptions are all valid, we must proceed in an iterative fashion as described, for example, in Box, Hunter and Hunter (1978). We entertain a plausible statistical model for the data, analyze the data using that model, then go back and use diagnostics such as plots of the residuals to assess the assumptions. If the diagnostics indicate failure of assumptions in either the deterministic or stochastic com- ponents of the model, we must modify the model or the analysis and repeat the cycle. It is important to recognize that the design of the experiment and the method of data collection can affect the chances of assumptions being valid in a particular experiment. In particular randomization can be of great help in ensuring the appropriateness of all the assumptions, and replication allows greater ability to check the appropriateness of specific assumptions. ASSUMPTIONS AND MODEL ASSESSMENT 25 1.3.1 Assumptions and Their Implications The assumptions, as listed in Section 1.1.1, are: 1. The expectation function is correct. Ensuring the validity of this as- sumption is, to some extent, the goal of all science. We wish to build a model with which we can predict natural phenomena. It is in building the mathematical model for the expectation function that we frequently find ourselves in an iterative loop. We proceed as though the expecta- tion function were correct, but we should be prepared to modify it as the data and the analyses dictate. In almost all linear, and in many nonlinear, regression situations we do not know the “true” model, but we choose a plausible one by examining the situation, looking at data plots and cross-correlations, and so on. As the analysis proceeds we can modify the expectation function and the assumptions about the distur- bance term to obtain a more sensible and useful answer. Models should be treated as just models, and it must be recognized that some will be more appropriate or adequate than others. Nevertheless, assumption (1) is a strong one, since it implies that the expectation function includes all the important predictor variables in precisely the correct form, and that it does not include any unimportant predictor variables. A useful technique to enable checking the adequacy of a model function is to include replications in the experiment. It is also important to actually manipulate the predictor variables and randomize the order in which the experiments are done, to ensure that causation, not correlation, is being determined (Box, 1960). 2. The response is expectation function plus disturbance. This assumption is important theoretically, since it allows the probability density func- tion for the random variable Y describing the responses to be simply calculated from the probability density function for the random variable Z describing the disturbances. Thus, pY (y|β, σ2) = pZ(y −Xβ|σ2) In practice, this assumption is closely tied to the assumption of constant variance of the disturbances. It may be the case that the disturbances can be considered as having constant variance, but as entering the model multiplicatively, since in many phenomena, as the level of the “signal” increases, the level of the “noise” increases. This lack of additivity of the disturbance will manifest itself as a nonconstant variance in the diagnostic plots. In both cases, the corrective action is the same—either use weighted least squares or take a transformation of the response as was done in Example PCB 1. 3. The disturbance is independent of the expectation function. This as- sumption is closely related to assumption (2), since they both relate to 26 REVIEW OF LINEAR REGRESSION appropriateness of the additive model. One of the implications of this assumption is that the control or predictor variables are measured per- fectly. Also, as a converse to the implication in assumption (1) that all important variables are included in the model, this assumption implies that any important variables which are not included are not systemat- ically related to the response. An important technique to improve the chances that this is true is to randomize the order in which the ex- periments are done, as suggested by Fisher (1935). In this way, if an important variable has been omitted, its effect may be manifested as a disturbance (and hence simply inflate the variability of the observa- tions) rather than being confounded with one of the predictor effects (and hence bias the parameter estimates). And, of course, it is impor- tant to actually manipulate the predictor variables not merely record their values. 4. Each disturbance has a normal distribution. The assumption of normal- ity of the disturbances is important, since this dictates the form of the sampling distribution of the random variables describing the responses, and through this, the likelihood function for the parameters. This leads to the criterion of least squares, which is enormouslypowerful because of its mathematical tractability. For example, given a linear model, it is possible to write down the analytic solution for the parameter estima- tors and to show [Gauss’s theorem (Seber, 1977)] that the least squares estimates are the best both individually and in any linear combination, in the sense that they have the smallest mean square error of any lin- ear estimators. The normality assumption can be justified by appealing to the central limit theorem, which states that the resultant of many disturbances, no one of which is dominant, will tend to be normally distributed. Since most experiments involve many operations to set up and measure the results, it is reasonable to assume, at least tentatively, that the disturbances will be normally distributed. Again, the assump- tion of normality will be more likely to be appropriate if the order of the experiments is randomized. The assumption of normality may be checked by examining the residuals. 5. Each disturbance has zero mean. This assumption is primarily a sim- plifying one, which reduces the number of unknown parameters to a manageable level. Any nonzero mean common to all observations can be accommodated by introducing a constant term in the expectation function, so this assumption is unimportant in linear regression. It can be important in nonlinear regression, however, where many expectation functions occur which do not include a constant. The main implication of this assumption is that there is no systematic bias in the disturbances such as could be caused by an unsuspected influential variable. Hence, we see again the value of randomization. ASSUMPTIONS AND MODEL ASSESSMENT 27 6. The disturbances have equal variances. This assumption is more impor- tant practically than theoretically, since a solution exists for the least squares estimation problem for the case of unequal variances [see, e.g., Draper and Smith (1981) concerning weighted least squares]. Practi- cally, however, one must describe how the variances vary, which can only be done by making further assumptions, or by using information from replications and incorporating this into the analysis through gener- alized least squares, or by transforming the data. When the variance is constant, the likelihood function is especially simple, since the parame- ters can be estimated independently of the nuisance parameter σ2. The main implication of this assumption is that all data values are equally unreliable, and so the simple least squares criterion can be used. The appropriateness of this assumption can sometimes be checked after a model has been fitted by plotting the residuals versus the fitted values, but it is much better to have replications. With replications, we can check the assumption before even fitting a model, and can in fact use the replication averages and variances to determine a suitable variance- stabilizing transformation; see Section 1.3.2. Transforming to constant variance often has the additional effect of making the disturbances be- have more normally. This is because a constant variance is necessarily independent of the mean (and anything else, for that matter), and this independence property is fundamental to the normal density. 7. The disturbances are distributed independently. The final assumption is that the disturbances in different experiments are independent of one another. This is an enormously simplifying assumption, because then the joint probability density function for the vector Y is just the product of the probability densities for the individual random variables Yn, n = 1, 2, . . . , N . The implication of this assumption is that the dis- turbances on separate runs are not systematically related, an assump- tion which can usually be made to be more appropriate by randomiza- tion. Nonindependent disturbances can be treated by generalized least squares, but, as in the case where there is nonconstant variance, modi- fications to the model must be made either through information gained from the data, or by additional assumptions as to the nature of the interdependence. 1.3.2 Model Assessment In this subsection we present some simple methods for verifying the appropri- ateness of assumptions, especially through plots of residuals. Further discus- sion on regression diagnostics for linear models is given in Hocking (1983), and in the books by Belsley et al. (1980), Cook and Weisberg (1982), and Draper and Smith (1981). In Chapter 3 we discuss model assessment for nonlinear models. 28 REVIEW OF LINEAR REGRESSION 1.3.3 Plotting Residuals A simple, effective method for checking the adequacy of a model is to plot the studentized residuals, zˆn/s √ 1− hnn, versus the predictor variables and any other possibly important “lurking” variables (Box, 1960; Joiner, 1981). The term hnn is the nth diagonal term of the “hat” matrix H = X(X TX)−1XT = Q1Q T 1 , and zˆn is the residual for the nth case, zˆn = yn − yˆn A relationship between the residuals and any variable then suggests that there is an effect due to that variable which has not been accounted for. Features to look for include systematic linear or curvilinear behavior of the residuals with respect to a variable. Important common “lurking” variables include time or the order number of the experiment; if a plot of residuals versus time shows suspicious behavior, such as runs of residuals of the same sign, then the assumption of independence of the disturbances may be inappropriate. Plotting residuals versus the fitted values yˆn is also useful, since such plots can reveal outliers or general inadequacy in the form of the expectation func- tion. It is also a very effective plot for revealing whether the assumption of constant variance is appropriate. The most common form of nonconstant variance is an increase in the variability in the responses when the level of the response changes. This behavior was noticed in the original PCB data. If a regression model is fitted to such data, the plot of the studentized residuals versus the fitted values tends to have a wedge-shaped pattern. When residual plots or the data themselves give an indication of non- constant variance, the estimation procedure should be modified. Possible changes include transforming the data as was done with the PCB data or using weighted least squares. A quantile–quantile plot (Chambers, Cleveland, Kleiner and Tukey, 1983) of the studentized residuals versus a normal distribution gives a direct check on the assumption of normality. If the expectation function is correct and the assumption of normality is appropriate, such a normal probability plot of the residuals should be a fairly straight line. Departures from a straight line therefore suggest inappropriateness of the normality assumption, although, as demonstrated in Daniel and Wood (1980), considerable variability can be expected in normal plots. Normal probability plots are also good for revealing outliers. Example: Plots of residuals are given in Figure 1.9 for the fit of ln(PCB) to 3 √ age. Since the fitted values are a linear function of the regressor variable 3 √ age, the form of the plot of the studentized residuals versus yˆ will be the same as that versus 3 √ age, so we only display the former. The plot versus yˆ and the quantile–quantile plot are well behaved. Neither plot reveals outliers. • ASSUMPTIONS AND MODEL ASSESSMENT 29 • • • • • • • • • • • • • • • • • • • • • • • • • • • • Fitted values St ud en tiz ed re sid ua ls 0.0 0.5 1.0 1.5 2.0 2.5 3.0 – 2 – 1 0 1 2 (a) • • • • • • • •• •• • • • • •• • ••• • • • • • • • Normal quantile St ud en tiz ed re sid ua ls –2 –1 0 1 2 – 2 – 10 1 2 (b) Fig. 1.9 Studentized residuals for the PCB data plotted versus fitted values in part a and versus normal quantiles in part b. 1.3.4 Stabilizing Variance An experiment which includes replications allows further tests to be made on the appropriateness of assumptions. For example, even before an expectation function has been proposed, it is possible to check the assumption of constant variance by using an analysis of variance to get averages and variances for each set of replications and plotting the variances and standard deviations versus the averages. If the plots show systematic relationships, then one can use a variance-stabilizing procedure to transform to constant variance. One procedure is to try a range of power transformations in the form (Box and Cox, 1964) y(λ) = { yλ−1 λ λ 6= 0 ln y λ = 0 We calculate and plot variances versus averages for y(λ), λ = 0,±0.5,±1, . . . and select that value of λ for which the variance appears to be most stable. Alternatively, for a random variable Y , if there is a power relationship between the standard deviation σ and the mean µ such that σ ∝ µα, it can be shown (Draper and Smith, 1981; Montgomery and Peck, 1982; Box et al., 1978) that the variance of the transformed random variable Y 1−α will be approximately constant. Variance-stabilizing transformations usually have the additional benefit of making the distribution of the disturbances appear more nearly normal, as discussed in Section 1.3.1. Alternatively, one can use the replication infor- mation to assist in choosing a form of weighting for weighted least squares. Example: 30 REVIEW OF LINEAR REGRESSION • •• • •• • • Average (ppm) St an da rd d ev ia tio n (pp m) 0 5 10 15 0 2 4 6 8 10 (a) • • • • • • • • Average (log scale) St an da rd d ev ia tio n (lo g s ca le) 0.0 0.5 1.0 1.5 2.0 2.5 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 (b) Fig. 1.10 Replication standard deviations plotted versus replication averages for the PCB data in part a and for the ln(PCB) data in part b. A plot of the standard deviations versus the averages for the original PCB data is given in Figure 1.10a. It can be seen that there is a good straight line relationship between s and y¯, and so the variance- stabilizing technique leads to the logarithmic transformation. In Fig- ure 1.10b we plot the standard deviations versus the averages for the ln(PCB) data. This plot shows no systematic relationship, and hence substantiates the effectiveness of the logarithmic transformation in sta- bilizing the variance. • 1.3.5 Lack of Fit When the data set includes replications, it is also possible to perform tests for lack of fit of the expectation function. Such analyses are based on an analysis of variance in which the residual sum of squares S(βˆ) with N − P degrees of freedom is decomposed into the replication sum of squares Sr (equal to the total sum of squares of deviations of the replication values about their averages) with, say, νr degrees of freedom, and the lack of fit sum of squares Sl = S(βˆ)−Sr, with νl = fN −P − νr degrees of freedom. We then compare the ratio of the lack of fit mean square over the replication mean square with the appropriate value in the F table. That is, we compare Sl/νl Sr/νr with F (νl, νr; α) to determine whether there is significant lack of fit at level α. The geometric justification for this analysis is that the replication subspace is always orthog- onal to the subspace containing the averages and the expectation function. PROBLEMS 31 Table 1.1 Lack of fit analysis of the model fitted to the PCB data Source Degrees of Sum of Mean F Ratio p Value Freedom Squares Square Lack of fit 9 1.923 0.214 0.812 0.61 Replication 17 4.475 0.263 Residuals 26 6.398 0.246 If no lack of fit is found, then the lack of fit analysis of variance has served its purpose, and the estimate of σ2 should be based on the residual mean square. That is, the replication and lack of fit sums of squares and degrees of freedom should be recombined to give an estimate with the largest number of degrees of freedom, so as to provide the most reliable parameter and expected value confidence regions. If lack of fit is found, the analyst should attempt to discover why, and modify the expectation function accordingly. Further discussion on assessing the fit of a model and on modifying and comparing models is given in Sections 3.7 and 3.10. Example: For the ln(PCB) versus 3 √ age data, the lack of fit analysis is presented in Table 1.3.5. Because the p value suggests no lack of fit, we combine the lack of fit and replication sums of squares and degrees of freedom and take as our estimate of σ2, the residual mean square of 0.246 based on 26 degrees of freedom. If there had been lack of fit, we would have had to modify the model: in either situation, we do not simply use the replication mean square as an estimate of the variance. • Problems 1.1 Write a computer routine in a language of your choice to perform a QR decomposition of a matrix using Householder transformations. 1.2 Draw a picture to show the Householder transformation of a vector y = (y1, y2) T to the x axis. Use both forms of the vector u corresponding to equations (A2.1) and (A2.2). Hint: Draw a circle of radius ‖y‖. 1.3 Perform a QR decomposition of the matrix X from Example PCB 3, X = [ 1 1 1 1.26 1.82 2.22 ] using u as in equation (A2.2). Compare the result with that in Appendix 2. 1.4 32 REVIEW OF LINEAR REGRESSION 1.4.1. Perform a QR decomposition of the matrix D = 0 0 0 1 0 1 1 1 1 1 and obtain the matrix Q1. This matrix is used in Example α-pinene 6, Section 4.3.4. 1.4.2. Calculate QT2 y, where y = (50.4, 32.9, 6.0, 1.5, 9.3) T, without ex- plicitly solving for Q2. 1.5 1.5.1. Fit the model ln(PCB) = β1 + β2age to the PCB data and perform a lack of fit analysis of the model. What do you conclude about the adequacy of this model? 1.5.2. Plot the residuals versus age, and assess the adequacy of the model. Now what do you conclude about the adequacy of the model? 1.5.3. Fit the model ln(PCB) = β1 +β2age+β3age 2 to the PCB data and perform a lack of fit analysis of the model. What do you conclude about the adequacy of this model? 1.5.4. Perform an extra sum of squares analysis to determine whether the quadratic term is a useful addition. 1.5.5. Explain the difference between your answers in (a), (b), and (d). 2 Nonlinear Regression: Iterative Estimation and Linear Approximations Although this may seem a paradox, all exact science is dominated by the idea of approximation. —Bertrand Russell Linear regression is a powerful method for analyzing data described by models which are linear in the parameters. Often, however, a researcher has a mathematical expression which relates the response to the predictor variables, and these models are usually nonlinear in the parameters. In such cases, linear regression techniques must be extended, which introduces considerable complexity. 2.1 THE NONLINEAR REGRESSION MODEL A nonlinear regression model can be written Yn = f(xn, θ) + Zn (2.1) where f is the expectation function and xn is a vector of associated regressor variables or independent variables for the nth case. This model is of exactly the same form as (1.1) except that the expected responses are nonlinear func- tions of the parameters. That is, for nonlinear models, at least one of the derivatives of the expectation function with respect to the parameters depends on at least one of the parameters. 33 34 NONLINEAR REGRESSION To emphasize the distinction between linear and nonlinear models, we use θ for the parametersin a nonlinear model. As before, we use P for the number of parameters. When analyzing a particular set of data we consider the vectors xn, n = 1, 2, . . . , N , as fixed and concentrate on the dependence of the expected re- sponses on θ. We create the N -vector η(θ) with nth element ηn(θ) = f(xn, θ)n = 1, . . . , N and write the nonlinear regression model as Y = η(θ) + Z (2.2) with Z assumed to have a spherical normal distribution. That is, E[Z] = 0 Var(Z) = E[ZZT] = σ2I as in the linear model. Example: Count Rumford of Bavaria was one of the early experimenters on the physics of heat. In 1798 he performed an experiment in which a cannon barrel was heated by grinding it with a blunt bore. When the cannon had reached a steady temperature of 130◦F, it was allowed to cool and temperature readings were taken at various times. The ambient temperature during the experiment was 60◦F, so [under Newton’s law of cooling, which states that df/dt = −θ(f−T0), where T0 is the ambient temperature] the temperature at time t should be f(t, θ) = 60 + 70e−θt Since ∂f/∂θ = −70te−θt depends on the parameter θ, this model is nonlinear. Rumford’s data are presented in Appendix A, Section A.2. • Example: The Michaelis–Menten model for enzyme kinetics relates the initial “velocity” of an enzymatic reaction to the substrate concentration x through the equation f(x, θ) = θ1x θ2+x (2.3) In Appendix A, Section A.3 we present data from Treloar (1974) on the initial rate of a reaction for which the Michaelis–Menten model is believed to be appropriate. The data, for an enzyme treated with Puromycin, are plotted in Figure 2.1. Differentiating f with respect to θ1 and θ2 gives ∂f ∂θ1 = x θ2 + x (2.4) ∂f ∂θ2 = −θ1x (θ2 + x)2 THE NONLINEAR REGRESSION MODEL 35 • • • • • • • • • • • • Concentration (ppm) Ve lo ci ty (( co un ts/ mi n)/ mi n) 0.0 0.2 0.4 0.6 0.8 1.0 0 50 10 0 15 0 20 0 Fig. 2.1 Plot of reaction velocity versus substrate concentration for the Puromycin data. and since both these derivatives involve at least one of the parameters, the model is recognized as nonlinear. • 2.1.1 Transformably Linear Models The Michaelis–Menten model (2.3) can be transformed to a linear model by expressing the reciprocal of the velocity as a function of the reciprocal sub- strate concentration, 1 f = 1 θ1 + θ2 θ1 1 x (2.5) = β1 + β2u We call such models transformably linear. Some authors use the term “in- trinsically linear”, but we reserve the term “intrinsic” for a special geometric property of nonlinear models, as discussed in Chapter 7. As will be seen in Chapter 3, transformably linear models have some advantages in nonlinear regression because it is easy to get starting values for some of the parameters. It is important to understand, however, that a transformation of the data involves a transformation of the disturbance term too, which affects the as- sumptions on it. Thus, if we assume the model function (2.2) with an additive, spherical normal disturbance term is an appropriate representation of the ex- perimental situation, then these same assumptions will not be appropriate for 36 NONLINEAR REGRESSION • • • • • • •• •••• Concentration–1 Ve lo ci ty – 1 0 10 20 30 40 50 0. 00 5 0. 01 0 0. 01 5 0. 02 0 (a) Concentration Ve lo ci ty 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0 50 10 0 15 0 20 0 • • • • • • • • • • • • (b) Fig. 2.2 Plot of inverse velocity versus inverse substrate concentration for the Puromycin experiment with the linear regression line (dashed) in part a, and the corresponding fitted curve (dashed) in the original scale in part b. the transformed data. Hence we should use nonlinear regression on the origi- nal data, or else weighted least squares on the transformed data. Sometimes, of course, transforming a data set to induce constant variance also produces a linear expectation function in which case linear regression can be used on the transformed data. Example: Because there are replications in the Puromycin data set, it is easy to see from Figure 2.1 that the variance of the original data is constant, and hence that nonlinear regression should be used to estimate the pa- rameters. However, the reciprocal data, plotted in Figure 2.2a, while showing a simple straight line relationship, also show decidely noncon- stant variance. If we use linear regression to fit the model (2.5) to these data, we obtain the estimates βˆ = (0.005107, 0.0002472)T corresponding to θˆ = (195.8, 0.04841)T The fitted curve is overlaid with the data in the original scale in Figure 2.2b, where we see that the predicted asymptote is too small. Because the variance of the replicates has been distorted by the transformation, the cases with low concentration (high reciprocal concentration) domi- nate the determination of the parameters and the curve does not fit the data well at high concentrations. • This example demonstrates two important features. First, it emphasizes the value of replications, because without replications it may not be possible to detect either the constant variance in the original data or the nonconstant THE NONLINEAR REGRESSION MODEL 37 variance in the transformed data; and second, it shows that while transforming can produce simple linear behavior, it also affects the disturbances. 2.1.2 Conditionally Linear Parameters The Michaelis–Menten model is also an example of a model in which there is a conditionally linear parameter, θ1. It is conditionally linear because the derivative of the expectation function with respect to θ1 does not involve θ1. We can therefore estimate θ1, conditional on θ2, by a linear regression of velocity x/(θ2 + x). Models with conditionally linear parameters enjoy some advantageous properties, which can be exploited in nonlinear regression. 2.1.3 The Geometry of the Expectation Surface The assumption of a spherical normal distribution for the disturbance term Z leads us to consider the Euclidean geometry of the N -dimensional response space, because again we will be interested in the least squares estimates θˆ of the parameters. The N -vectors η(θ) define a P -dimensional surface called the expectation surface in the response space, and the least squares estimates correspond to the point on the expectation surface, ηˆ = η(θˆ) which is closest to y. That is, θˆ minimizes the residual sum of squares S(θ) = ‖y − η(θ)‖2 Example: To illustrate the geometry of nonlinear models, consider the two cases t = 4 and t = 41 for the Rumford data. Under the assumption that Newton’s law of cooling holds for these data, the expected responses are η(θ) = [ 60 + 70e−4θ 60 + 70e−41θ ] θ ≥ 0 Substituting values for θ in these equations and plotting the points in a 2-dimensional response space gives the 1-dimensional expectation surface (curve) shown in Figure 2.3. Note that the expectation surface is curved and of finite extent , which is in contrast to the linear model in which the expectation surface is a plane of infinite extent. Note, too, that points with equal spacing on the parameter line (θ) map to points with unequal spacing on the expectation surface. • Example: As another example, consider the three cases from Example Puromycin 2.1: x = 1.10, x = 0.56, and x = 0.22. Under the assumption that the 38 NONLINEAR REGRESSION η1 η 2 0 20 40 60 80 100 120 140 0 20 40 60 80 10 0 12 0 14 0 ∞ 0 0.05 0.10.51 Fig. 2.3 Plot of the expectation surface (solid line) in the response space for the 2-case Rumford data. The points corresponding to θ = 0, 0.01, 0.02, . . . , 0.1, 0.2, . . ., 1,∞ are marked. expectation function (2.3) is the correct one, the expected responses for these substrate values are η(θ) = θ1(1.10) θ2+1.10 θ1(0.56) θ2+0.56 θ1(0.22) θ2+0.22 θ1, θ2 ≥ 0 and so we can plot the expectation surface by substituting values for θ in these equations. A portion of the 2-dimensional expectation surface for these x values is shown in Figure 2.4. Again, in contrast to the linear model, this expectation surface is not an infinite plane, and in general, straight lines in the parameter plane do not map to straight lines on the expectation surface. It is also seen that unit squares in the parameter plane map to irregularly shaped areas on the expectation surface and that the sizes of these areas vary. Thus, the Jacobian determinant is not constant, which can be seen analytically, of course, because the derivatives (2.4) depend on θ. For this model, there are straight lines on the expectation surface in Figure 2.4 corresponding to the θ1 parameter lines (lines with θ2 held constant), reflecting the fact that θ1 is conditionally linear. However, the θ1 parameter lines are neither parallel nor equispaced. The θ2 lines are not straight, parallel, or equispaced. • THE NONLINEAR REGRESSION MODEL 39 100 200 η1 100 200 η3 100 200 η2 θ2 = 0 θ2 = 0.5 θ2 = 1.0 θ1 = 300 θ1 = 200 Fig. 2.4 Expectation surface for the 3-case Puromycin example. We show a portion of the expectation surface (shaded) in the expectation space with θ1 parameter lines (dashed) and θ2 parameter lines (dot–dashed). 40 NONLINEAR REGRESSION As can be seen from these examples, for nonlinear models with P parame- ters, it is generally true that: 1. the expectation surface, η(θ), is a P -dimensional curved surface in the N -dimensional response space; 2. parameter lines in the parameter space map to curves on the curved expectation surface; 3. the Jacobian determinant , which measures how large unit areas in θ become in η(θ), is not constant . We explore these interesting and important aspects of the expectation sur- face later, but first we discuss how to obtain the least squares estimates θˆ for the parameters θ. Nonlinear least squares estimation from the point of view of sum of squares contours is given in Section 2.4. 2.2 DETERMINING THE LEAST SQUARES ESTIMATES The problem of finding the least squares estimates can be stated very simply geometrically—given a data vector y, an expectation function f(xn, θ), and a set of design vectors xn, n = 1, . . . , N (1)find the point ηˆ on the expectation surface which is closest to y, and then (2)determine the parameter vector θˆ which corresponds to the point ηˆ. For a linear model, step (1) is straightforward because the expectation sur- face is a plane of infinite extent, and we may write down an explicit expression for the point on that plane which is closest to y, ηˆ = Q1Q T 1 y For a linear model, step (2) is also straightforward because the P -dimensional parameter plane maps linearly and invertibly to the expectation plane, so once we know where we are on one plane we can easily find the corresponding point on the other. Thus βˆ = R−11 Q T 1 ηˆ In the nonlinear case, however, the two steps are very difficult: the first because the expectation surface is curved and often of finite extent (or, at least, has edges) so that it is difficult even to find ηˆ, and the second because we can map points easily only in one direction—from the parameter plane to the expectation surface. That is, even if we know ηˆ, it is extremely difficult to determine the parameter plane coordinates θˆ corresponding to that point. To overcome these difficulties, we use iterative methods to determine the least squares estimates. DETERMINING THE LEAST SQUARES ESTIMATES 41 2.2.1 The Gauss–Newton Method 2 2 1 An approach suggested by Gauss is to use a linear approximation to the expectation function to iteratively improve an initial guess θ0 for θ and keep improving the estimates until there is no change. That is, we expand the expectation function f(xn, θ) in a first order Taylor series about θ 0 as f(xn, θ) ≈ f(xn, θ0) + vn1(θ1 − θ01) + vn2(θ2 − θ02) + . . . + vnP (θP − θ0P ) where vnp = ∂f(xn, θ) ∂θp ∣∣∣∣ θ0 p = 1, 2, . . . , P Incorporating all N cases, we write η(θ) ≈ η(θ0) + V 0(θ − θ0) (2.6) where V 0 is the N×P derivative matrix with elements vnp. This is equivalent to approximating the residuals, z(θ) = y − η(θ), by z(θ) ≈ y − [η(θ0) + V 0δ] = z0 − V 0δ (2.7) where z0 = y − η(θ0) and δ = θ − θ0. We then calculate the Gauss increment δ0 to minimize the approximate residual sum of squares ‖z0 − V 0δ‖2, using V 0 = QR = Q1R1[cf.(1.19)] w1 = Q T 1 z 0[cf.(1.21)] ηˆ1 = Q1w1[cf.(1.23)] and so R1δ 0 = w1[cf.(1.24)] The point ηˆ1 = η(θ1) = η(θ0 + δ0) should now be closer to y than η(θ0), and so we move to this better parameter value θ1 = θ0 +δ0 and perform another iteration by calculating new residuals z1 = y − η(θ1), a new derivative matrix V 1, and a new increment. This process is repeated until convergence is obtained, that is, until the increment is so small that there is no useful change in the elements of the parameter vector. Example: To illustrate these calculations, consider the data from Example Puromycin 2.1, with the starting estimates θ0 = (205, 0.08)T. The data, along with the fitted values, residuals, and derivatives evaluated at θ0, are shown in Table 2.1. Collecting these derivatives into the derivative matrix V 0, we then per- form a QR decomposition, from which we generate w1 = Q T 1 z 0 and 42 NONLINEAR REGRESSION Table 2.1 Residuals and derivatives for Puromycin data at θ = (205, 0.08)T. n xn yn η 0 n z 0 n v 0 n1 v 0 n2 1 0.02 76 41.00 35.00 0.2000 –410.00 2 0.02 47 41.00 6.00 0.2000 –410.00 3 0.06 97 87.86 9.14 0.4286 –627.55 4 0.06 107 87.86 19.14 0.4286 –627.55 5 0.11 123 118.68 4.32 0.5789 –624.65 6 0.11 139 118.68 20.32 0.5789 –624.65 7 0.22 159 150.33 8.67 0.7333 –501.11 8 0.22 152 150.33 1.67 0.7333 –501.11 9 0.56 191 179.38 11.62 0.8750 –280.27 10 0.56 201 179.38 21.62 0.8750 –280.27 11 1.10 207 191.10 15.90 0.9322 –161.95 12 1.10 200 191.10 8.90 0.9322 –161.95 then solve for δ0 using R1δ 0 = w1. In this case, δ 0 = (8.03,−0.017)T and the sum of squares at θ1 = θ0 + δ0 is S(θ1) = 1206, which is much smaller than S(θ0) = 3155. We therefore move to θ1 = (213.03, 0.063)T and perform another iteration. • Example: As a second example, we consider data on biochemical oxygen demand (BOD) from Marske (1967), reproduced in Appendix A.4. The data are plotted in Figure 2.5. For these data, the model f(x, θ) = θ1(1 − eθ2x) (2.8) is considered appropriate. Using the starting estimates θ0 = (20, 0.24)T, for which S(θ0) = 128.2, produces an increment to θ1 = (13.61, 0.52)T with S(θ1) = 145.2. In this case, the sum of squares has increased and so we must modify the increment as discussed below. • Step Factor As seen in the last example, the Gauss–Newton increment can produce an increase in the sum of squares when the requested increment extends beyond the region where the linear approximation is valid. Even in these circum- stances, however, the linear approximation will be a close approximation to the actual surface for a sufficiently small region around η(θ0). Thus a small step in the direction δ0 should produce a decrease in the sum of squares. We therefore introduce a step factor λ, and calculate θ1 = θ0 + λδ0 DETERMINING THE LEAST SQUARES ESTIMATES 43 • • • • • • Time (days) BO D (m g/l ) 0 2 4 6 8 0 5 10 15 20 Fig. 2.5 Plot of BOD versus time where λ is chosen to ensure that S(θ1) < S(θ0) (2.9) A common method of selecting λ is to start with λ = 1 and halve
Compartilhar