Baixe o app para aproveitar ainda mais
Prévia do material em texto
Linear Algebra for Regression Analysis Guillaume Pouliot December 25, 2013 Introduction and motivation One of the simplest models for regression analysis is the factor model with a single factor, or one-way ANOVA model, yij = µ+ ↵i + eij , i = 1, ..., t, j = 1, ..., Ni, where E(eij) = 0, V (eij) = �2, and Cov(eij , ei0j0) = 0 when (i, j) 6= (i0, j0). It’s very straighforward to see what the coefficients, the “pure effects” ↵i stand for, and their being discrete along with the model being additive makes it obvious how one would go about estimating them and obtaining fitted values for the yij ’s. Perhaps the most intuitive solution is obtained by fitting the overparametrized model (so without fixing one ↵i or P i ↵i to zero). In that case, you could easily guess that one solution would be µˆ = 0 and ↵ˆi = 1 Ni X j yij . This approach, although immediate and intuitive, does not have a nice ge- ometric interpretation, and does not readily yield intuition to the case with continuous covariates, which is arguably the central case. We want a general framework that will have powerful geometric interpretations, and we want it to nest the simple cases (such as the one-way ANOVA model) as we want to be able to rely on them for examples and simple intuition. One such framework is the “projection approach”, which in most instantiations is an almost direct application of vector space, orthogonality, and more generally linear algebraic ideas to regression analysis. Let’s rephrase the one-way ANOVA model problem in matrix form, in co- ordinates1: suppose for concreteness that t = 3, N1 = 5, N2 = 3, N3 = 3, then 1Another way to go about presenting linear model theory is with the so-called coordinate- free approach. It gets at the abstract linear algebraic concepts more directly, but at the cost of losing sight -quite literally- of the statistical objects they represent. For a great book on the coordinate-free approach, see Michael Wichura’s. 1 266666666666666664 y11 y12 y13 y14 y15 y21 y22 y23 y31 y32 y33 377777777777777775 = 266666666666666664 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 377777777777777775 2664 µ ↵1 ↵2 ↵3 3775+ ". Note that the 11⇥4 design matrix is not full rank (prove it in four seconds?), therefore there will not be a unique solution for the coefficients. There will, however, be a unique solution for the fitted values yˆij ’s. That solution (the solution to the problem of minimizing P i,j(yij � yˆij)2, where yˆij = µˆ + ↵ˆi) is obtained by projecting the [yij ] vector in the plane spanned by the columns of the design matrix; it is in a sense the crux of our approach here that we do not just want to think of solving the inverse problem for the coefficients, and then plug them back in. As we will see, the orthogonal projection on the space spanned by the columns of the design matrix X has the closed form X(X 0X)�1X. More generally, and necessarily if X is not full rank, the closed form is X(X 0X)�X, where A� denotes the general inverse of A. Let’s compute the projection matrix M := X(X 0X)�X. You can easily check that X 0X = 2664 11 5 3 3 5 5 0 0 3 0 3 0 3 0 0 3 3775 and by checking the sufficient condition (X 0X)(X 0X)�(X 0X) = X 0X, you get that (X 0X)� = 2664 0 0 0 0 0 1/5 0 0 0 0 1/3 0 0 0 0 1/3 3775 . Then M = X(X 0X)�X = X 2664 0 0 0 0 0 0 0 0 0 0 0 1/5 1/5 1/5 1/5 1/5 0 0 0 0 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 0 0 0 0 1/3 1/3 1/3 3775 2 Victoria Highlight Victoria Highlight = 266666666666666664 1/5 1/5 1/5 1/5 1/5 0 0 0 0 0 0 1/5 1/5 1/5 1/5 1/5 0 0 0 0 0 0 1/5 1/5 1/5 1/5 1/5 0 0 0 0 0 0 1/5 1/5 1/5 1/5 1/5 0 0 0 0 0 0 1/5 1/5 1/5 1/5 1/5 0 0 0 0 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 0 1/3 1/3 1/3 377777777777777775 . Thus what the projection matrix is doing here is perfectly intuitive (and perfectly in line with the first approach): applying M to the vector of responses [yij ] orthogonaly projects [yij ] on –so finds the vector that best approximates on– the three-dimensional plane of vectors of the form266666666666666664 a a a a a b b b c c c 377777777777777775 , a, b, c 2 R, and of course, the closest (in an `-2 sense) such vector is the one obtained by taking the average of the entries associated with the same factor level. You can see from the second equality (or obviously since uˆ = 0) that we have also computed (X 0X)�X 0Y = 2664 µˆ ↵ˆ1 ↵ˆ2 ↵ˆ3 3775 0BB@⌘ 2664 0 yˆ11 yˆ21 yˆ31 3775 1CCA . The matrix M and the projection interpretation are intuitively very rich, and provide very useful insights for regression analysis in applied and theoretical work. All its technical underpinnings are essentially linear algebraic. We now move on to the meat. 3 Matrices as Operators, Spectral Representation, and Essential Inequalities The matrix as an operator A matrix is many things, and hopefully we will grow to deem it apostasy to merely consider it “a box with numbers inside”. A correct abstraction is to think of an m⇥ n matrix A as a linear operator A : Rn ! Rm. Manipulations at this level of abstraction will allow for tractable and beau- tiful proofs2, as it allows one to abstract away from the choice of basis (in that sense, it is coordinate-free). For many applications, however, we will care about the in-coordinates representation of the matrix, and some intuition is to be gotten from such a representation too. Writing A in coordinates as266664 a11 a12 · · · a1n a21 a22 ... ... . . . ... am1 · · · · · · amn 377775 , we have that each entry has a very specific meaning. For each i and j, aij is what you multiply the jth entry of the input vector by to get his contribution to the ith entry of the ouput vector. If this is the matrix representation of a linear operator with respect to the standard basis (in domain and range), then you can replace “entry” with “basis vector” in the previous sentence. More explicitly, Aej = 266664 a11 a12 · · · a1n a21 a22 ... ... . . . ... am1 · · · · · · amn 377775 26666664 0 ... 1 ... 0 37777775 | {z } nonzero jth entry = 26666664 a1j ... aij ... amj 37777775 = mX i=1 aijei. where ei is the vector with a 1 in the ith entry, and zeroes in all other entries. Eigenvectors and Eigenvalues Let ⌦ be a symmetric n ⇥ n matrix. The eigenvalues of ⌦ are traditionally defined as the solutions of the characteristic polynomial det(⌦� �I) = 0. 2See Sheldon Axler’s Linear Algebra Done Right. 4 This is probably the most handicapping way to think about them. For our purposes, it will be more convenient (and is entirely rigorous with symmetric matrices) to define an eigenvalue of ⌦ as a constant �i such that ⌦ maps some vector vi, called the corresponding eigenvector, to �ivi. That is, ⌦ maps vi to a vector in the exact same direction as vi, i.e. a multiple of vi, and that multiple is the eigenvalue �i corresponding to vi. A trivial example of a symmetric n ⇥ n matrix with eigenvalues �1, . . . ,�n is the diagonal matrix ⇤ = 266664 �1 0 · · · 0 0 �2 0 ... ... 0 . . . ... 0 · · · · · · �n 377775 , where, for each i, the eigenvalue �i corresponds to the eigenvector ei. Note that {e1, . . . , en} is an orthogonal basis for Rn. Can we obtain a matrix with the same eigenvalues but a different, given orthonormal basis as its set of eigenvectors (change of coordinates) ? The answer is yes, and the demonstration is trivial. Let us be given an orthonormal basis {q1, ..., qn}. Each qi is an n-long vector. Let Q = [q1, ..., qn] be the n ⇥ n matrix obtained by stacking them horizontally.Then you can check that B = Q⇤QT has eigenvalues �1, . . . ,�n and corresponding eigenvectors q1, ..., qn. The much deeper and much more powerful result is that the converse also holds: Suppose ⌦ is a real symmetric matrix, then there exists an orthonormal basis {q1, ..., qn} that consists of eigenvectors of ⌦. Furthermore, ⌦ can be de- composed as in the display above. This is called the spectral decomposition of ⌦. The rank of a matrix Given an n⇥ p matrix X, we will often be interested in the rank of the matrix. The rank of X, denoted rank(X), is the dimension of the linear space spanned by its columns. It can also be defined as the number of linearly independent columns of X, or equivalently the number of linearly independent rows (see exercise 1). Exercise 1 Let X be an n⇥ p real matrix. suppose X has q linearly independent colums. Show that it has q linearly independent rows. Two inequalities Let A and B be two compatible matrices (this will mean something different for both inequalities, make sure the size requirements are clear for you), possibly 5 rectangular. Then rank(AB) min{rank(A), rank(B)}, and rank(A+B) rank(A) + rank(B). These are inequalities you should never have to remember by heart. Consider the first one. The rank of B is rank(B), so by definition its output vector lies in a rank(B)-dimensional space, and since the same applies for A, and the image of a rank(B)-dimensional space is at most rank(B)-dimensional, then we see immediately that the rank of AB is bounded above by the rank of A and the rank of B. For square symmetric matrices, the spectral decomposition provides us with an immediate connection between the number of nonzero eigenvalues of a matrix and its rank: it is obvious (think about it) that the rank of a diagonal matrix is equal to the number of nonzero (diagonal) entries. Hence, for a symmetric squared ⌦ and its spectral decomposition ⌦ = Q⇤QT , we have rank(⌦) = rank(Q⇤QT ) = min{rank(Q), rank(⇤)} = rank(⇤) = # of nonzero eigenvalues of ⌦. (See exercise 2 for why the second equality indeed holds with equality). In fact, proving both inequalities above in the special case of squared sym- metric matrices gives a good example of the relationship between nonzero eign- values and rank (see exercise 2). This gives another intuitive way to think about to rank and size of the null space of a squared symmetric matrix. Let {q1, ..., qn} be the eigenbasis of ⌦ from the spectral decomposition of ⌦, then rank(⌦) is the number of basis eigenvectors qi which are mapped to a nonzero vector by ⌦. Exercise 2 Use the spectral decomposition to prove the special case of the following inequalities for square symmetric matrices A and B: rank(AB) min{rank(A), rank(B)}, and rank(A+B) rank(A) + rank(B). For both inequalities, explain the requirement on the eigenvalues for the in- equality to be strict, and show that the bound is tight (i.e. both can hold with equality). Orthogonality and Orthogonal Projection The basic vector space concepts we will rely on are those such as subspaces, spanning sets, linear independence, bases and dimensions of linear spaces, or- 6 thogonal vectors, orthogonal bases, and norms.3 Let X be a matrix, and let C(X) ⇢ Rn be the vector space spanned by its columns. We say that M is a perpendicular projection operator (matrix) onto C(X) if and only if • projection: v 2 C(X) implies Mv = v. • orthogonality: w ? C(X) implies Mw = 0. Note that w ? C(X) means w0v = 0 for all v 2 C(X). Recall also that for any linear subspace, C(X) ⇢ Rn, we have its orthogonal complement C(X)? such that C(X) ? C(X)?, and every v 2 Rn can be decomposed uniquely as v = a|{z} 2C(X) + b|{z} 2C(X)? . Another, equivalent characterization of orthogonal projection M is that it is idempotent (MM = M) and symmetric (M 0 = M). Of course, the space spanned by the columns of M will be the same as the one spanned by the columns of X (see exercise 3). Exercise 3 If M is a perpendicular projection operator onto C(X), then C(X) = C(M). Exercise 4 Prove the following. Let o1, . . . , or be an orthonormal basis for C(X), and let O = [o1, . . . , or]. Then OO0 = rP i=1 oioTi is the perpendiular projection operator onto C(X). Exercise 5 For n⇥ p matrix X full rank, show that X(X 0X)�1X 0 is the perpendicular projection operator onto C(X). Exercise 6 Show that if M and M0 are perpendicular projection matrices with C(M0) ⇢ C(M). Then M �M0 is a perpendicular projection matrix. (it projects on the orthogonal complement of C(M0) within C(M) ). Exercise 7 Show that for any matrix X, C(X 0X) = C(X). Exercise 8 Show that for any matrix X, rank(XX 0) = rank(X). Exercise 9 Show that if X is n⇥ p and rank(X) = p, then the p⇥ p matrix X 0X is nonsingular. The Geometry of OLS The OLS problem is that given a model 3We don’t go over these in this note because there are many available references for their definitions, but feel free to ask about these concepts in office hours. We will cover some basic definitions in section. 7 yi = �0 + �1xi1 + · · ·+ �pxip + "i, E("i) = 0, V ("i) = �2, i = 1, ..., n, we want to find (�ˆ0, . . . , �ˆp) 2 Rp+1 mini- mizing nX i=1 (yi � yˆi)2, where yˆi = �ˆ0+�ˆ1xi1+· · ·+�ˆpxip. If we write in matrix form Y = (y1, · · · , yn)T and Yˆ = (yˆ1, · · · , yˆn)T , the above display can be rewritten (by definition) as ||Y � Yˆ ||22, making explicit that we are in the business of minimizing the distance between two vectors. To find the solution, you can brute force it: take the derivative of ||Y � Yˆ ||22 = (Y �X�ˆ)0(Y �X�ˆ) with respect to �ˆ, set it to zero, and solve for the minimizer �ˆ and the fitted response vector Yˆ = X�ˆ. Alternatively, and relying on a more geometric intuition, we could recognize that this is an application of the Pythagorean theorem: if Y 2 Rn and C(X) ⇢ Rn is a linear space, then the unique point Yˆ 2 C(X) that is the closest to Y (in the sense that in minimizes ||Y � Yˆ ||22 over C(X)) will be fully characterized by the property: (Y � Yˆ ) ? C(X). Exercise 10 Show that the first ordoer conditons obtained when minizing ||Y �Yˆ ||22 in �ˆ by differentiaton yield the geometric condition (Y �Yˆ ) ? C(X). That the minimizing point is characterized by an orthogonality condition is a lot of what makes the orthogonal projection apparatus and results so powerful in regression analysis. Furthermore, now that the geometry of the problem is clear, we can properly visualize it. The usual picture is displayed right below. Figure 1 Multivariate regression as an orthogonal projection. 8 Let’s put the projection together with the geometry of the problem as presented in figure 1. From the Pythagorean theorem, the fitted vector Yˆ = X�ˆ is precisely the orthogonal projection of Y onto C(X). In exercise 5 you’ve found that this is given by M = X(XTX)�1X. That is Yˆ = MY . The residual vector is the difference between Y and Yˆ , so it is that part of Y that is orthogonal to C(X). That is, for Q = (I �M) the orthogonal projection onto C(X)?, we get "ˆ = QY . (You should be able to see immediately that QMY = MQY = 0, and (M +Q)Y = Y ). Exercise 11 Is it correct that the usual drawing (figure 1) of the multivariate regression as an orthogonal projection arises from a regression with design matrix X = [X1, X2] = X11 X12 X21 X22 � ? Exercise 12 Draw the equivalent of figure 1 for the omitted variable problem. Exercise 13 Draw the equivalent of figure 1 for the the instrumental variable problem. The geometry of the F -test Suppose we are testing the null hypothesis that the short model Y = X0� + " is correct against the alternative hypothesis that the long model Y = X� + ", C(X0) ⇢ C(X), is correct. In both cases we assume " ⇠ N(0,�2). If the short model is correct, both M0Y (M0 = X0(X 00X0)�1X 00)and MY (M = X(X 0X)�1X 0) are estimating E[Y ] and their expected difference is just error, i.e. E[MY � M0Y ] = 0. However, if the short model is not correct, then MY and M0Y are estimating different things, namely M0E[Y ] 6= E[Y ], and we should expect MY and M0Y to be distant from each other, i.e. we would expect ||MY �M0Y ||22 to be big. Relying on this opportunity for the power of a test, how can we get a pivotal test statistic? What is the distribution of ||MY �M0Y ||22 under the null? With our new tools, there’s an easy way to see it. We have ||MY �M0Y ||22 = Y 0(M �M0)Y, where you can check thatM�M0 is an orthogonal projection matrix (and is thus idempotent, we used that to get the equality in the display above). In particular, there is an orthogonal (change of basis) matrix Q such that C(Q(M �M0)) has vectors all of the form (⇤ · · · ⇤| {z } k 0 · · · 0| {z } n�k ), where k = rank(M �M0). And since we can write Y = �Z for Z ⇠ N(0, I), it follows immediately that Y 0(M �M0)Y = �2Z 0(M �M0)0(M �M0)Z = �2 ((Q(M �M0)Q0)(QZ))0 ((Q(M �M0)Q0)(QZ)) = �2�2rank(M�M0) 9 where the last equality holds because, by rotational invariance of standard Gaus- sian random variables, Z ⇠ QZ, and Q(M�M0)Q0 is the orthogonal projection wich preserves the first k coordinates of the imput vector and sends the others to zero. But we want a pivot. We can repeat the same argument and find that Y 0(I �M)Y = �2�2rank(I�M). You should be able to convince yourself4 that Y 0(M �M0)Y and Y 0(I �M)Y are independent. Therefore Y 0(M �M0)Y/rank(M �M0) Y 0(I �M)Y/rank(I �M) = �2rank(M�M0)/rank(M �M0) �2rank(I�M)/rank(I �M) ⇠ Frank(M�M0),rank(I�M), and we have a beautiful test statistic. It’s also a nice example of using linear algebra directly to manipulate Gaussian random variables. Positive-Definiteness and Matrix Inequalities The positive definiteness of matrices has a very direct statistical interpretation for us, especially in terms of covariance matrices. We say of an n⇥ n symmetric matrix A that it is positive definite if v0Av > 0, 8 v 2 Rn, v 6= 0. If the inequality is not strict, we say it is positive semi-definite. To see the connection with covariance matrices, recall how to compute the variance of a linear combination of random variables, take X = X1 X2 � , V (X) = ⌃ = �11 �12 �12 �22 � . We remember the formula by heart for ↵1X1 + ↵2X2. That is V (↵1X1 + ↵2X2) = V (↵1X1) + V (↵2X2) + 2Cov(↵1X1,↵2X2) = ↵21�11 + ↵ 2 2�22 + 2↵1↵2�12. But we can see that all the information for the variance of linear combi- nations of the entries of X is conveniently packed in the covariance matrix ⌃. Indeed ↵1X1 + ↵2X2 = ⇥ ↵1 ↵2 ⇤ X1 X2 � , 4Think about picking Q, in the cannonical form we found, you are using different entries of the random vector for the numerator and denominator. 10 and V ( ⇥ ↵1 ↵2 ⇤ X1 X2 � ) = ⇥ ↵1 ↵2 ⇤ �11 �12 �12 �22 � ↵1 ↵2 � = ⇥ ↵1�11 + ↵2�12 ↵1�12 + ↵2�22 ⇤ ↵1 ↵2 � = ↵21�11 + 2↵1↵2�12 + ↵ 2 2�22, so we get the formula back! Oftentimes it will be convenient to leave everything in matrix form for computations, but it’s good to see we’re really doing the same thing as before. Of course, all this readily generalizes to n-long vectors. For two n ⇥ n matrices A and B, we say that A < B if B � A is positive definite. If A < B, this means that for random vectors XA with V (XA) = A and V (XB) = B, and for any linear combination ↵ of their entries, we have V (↵0XA) = ↵0A↵ < ↵0B↵ = V (↵0B), where the inequality holds because ↵0(B � A)↵ = ↵0B↵ � ↵0A↵. So positive definiteness has a direct interpretation in terms of covariances! Let’s look at an immediate application of what we have done until now. This was a (very linear) question in a nonlinear econometrics final: show that asymptotic variance of the homoskedastic 2SLS coefficient estimate decreases with the number of non-redundant instruments. Let Y be the outcome variable, X the design matrix (of regressors, some endogeneous), and Z the matrix of instruments. Then V (�ˆ2SLS) = � 2(X 0MC(Z)X)�1, where MC(Z) is the orthogonal projection onto C(Z). Now suppose we are given additional instruments Z+ and form the extended matrix of instruments Z˜ = [Z,Z+]. Then C(Z˜) � C(Z), and we can decompose MZ˜ as MC(Z˜) = MC(Z) +MC(Z)? C(Z˜) , where C(Z)? C(Z˜) = {v 2 C(Z˜) : v ? C(Z)}. And we observe that, for any compatible, nonzero vector v, v0X 0MC(Z˜)Xv = (Xv) 0 ⇣MC(Z) +MC(Z)? C(Z˜) ⌘ (Xv) = (Xv)0MC(Z) (Xv) + (Xv) 0MC(Z)? C(Z˜) (Xv)| {z } >0 > (Xv)0MC(Z) (Xv) = v0X 0MC(Z)Xv. By definition, this meansX 0MC(Z˜)X > X 0MC(Z)X, and thus �2 ⇣ X 0MC(Z˜)X ⌘�1 < �2 � X 0MC(Z)X ��1. 11 Exercise 14 Show that if B > A, with both A and B full rank symmetric matrices , then B�1 < A�1. Gram-Schmidt Theorem and the QR Decomposi- tion The Gram-Schmidt theorem states that for any space, given some basis, we can find an orthonormal basis. The Gram-Schmidt Theorem Let N be a space with basis {x1, ..., xr}. There exists an orthonormal basis for N , say {y1, ..., yr}, with ys in the space spanned by x1, ..., xs, s = 1, ..., r. The algorith for the building the basis is basically the proof itself (you can -and should!- check it satisfies the claim). To get {y1, ..., yr}, simply define the yi’s inductively. y1 = x1/ p x01x1, ws = xs � s�1X i=1 (x0syi)yi, ys = ws/ p w0sws, and you’re done! You can already get the triangular flavor of the procedure. Indeed the Gram- Schmidt decomposition naturally gives rise to the QR decomposition. Let (e1, ..., en) be the orthogonal basis obtained through the Gram-Schmidt process. Then observe that we can write the original vectors as x1 =< e1, x1 > e1, x2 =< e1, x2 > e1+ < e2, x2 > e2, and so on. Note that, for instance, < e1, x2 > picks up the coefficient of x2 in the e basis. In general, xk = nX j=1 projejxk = nX j=1 < ej , xk > < ej , ej > ej = nX j=1 < ej , xk > ej = kX j=1 < ej , xk > ej . Writing this in matrix form yields the QR decomposition right away (x1, ..., xn) = X = (e1, ..., en) 26664 < e1, a1 > < e1, a2 > < e1, a3 > . . . < e2, a2 > < e2, a3 > . . . < e3, a3 > . . . 0 . . . 37775 . 12 Exercise 15 Use the QR decomposition to prove Hadamard’s inequality: if X = (x1, ..., xn),then |detX| nY j=1 ||xj ||. Equality holds here if and only if either the xj are mutually orthogonal or some xj is zero. Variational Representation of Eigenvalues Thinking of a symmetric matrix A as “big” if, relatively speaking, it gives a bigger number for v0Av for any compatible vector v invites an alternate and very intuitive definition of eigenvalues. The Courant-Fischer-Weyl Minimax Principle Let A be a symmetric matrix on C(X). Then �k(A) = maxM⇢C(X) dimM=k min x2M ||x||=1 x0Ax = max M⇢C(X) dimM=n�k+1 min x2M ||x||=1 x0Ax Proof: Just observe that, for the first equality, the maximal set is the set M = {u1, ..., uk} of the k top eigenvectors, and the minimal vector is the eigenvector µk, the eigenvector corresponding to the kth (in decreasing order) eigenvalue. To prove the second inequality, apply the same thinking.⇤ 13
Compartilhar