Linear Algebra Review

•

USP-SP

Eduardo Nerd

15/03/2018

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Você viu 3, do total de 13 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Você viu 6, do total de 13 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Você viu 9, do total de 13 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

E aí, curtiu este material?

Ajude a incentivar outros estudantes a melhorar o conteúdo

Gostou desse material? Compartilhe! 🧡

Econometria

6.229 Materiais compartilhados

Baixe o app para aproveitar ainda mais

Leia os materiais offline, sem usar a internet. Além de vários outros recursos!

Prévia do material em texto

Linear Algebra for Regression Analysis
Guillaume Pouliot
December 25, 2013
Introduction and motivation
One of the simplest models for regression analysis is the factor model with a
single factor, or one-way ANOVA model,
yij = µ+ ↵i + eij ,
i = 1, ..., t, j = 1, ..., Ni, where E(eij) = 0, V (eij) = �2, and Cov(eij , ei0j0) = 0
when (i, j) 6= (i0, j0).
It’s very straighforward to see what the coeﬃcients, the “pure eﬀects” ↵i
stand for, and their being discrete along with the model being additive makes
it obvious how one would go about estimating them and obtaining fitted values
for the yij ’s. Perhaps the most intuitive solution is obtained by fitting the
overparametrized model (so without fixing one ↵i or
P
i ↵i to zero). In that
case, you could easily guess that one solution would be
µˆ = 0 and ↵ˆi =
1
Ni
X
j
yij .
This approach, although immediate and intuitive, does not have a nice ge-
ometric interpretation, and does not readily yield intuition to the case with
continuous covariates, which is arguably the central case. We want a general
framework that will have powerful geometric interpretations, and we want it to
nest the simple cases (such as the one-way ANOVA model) as we want to be
able to rely on them for examples and simple intuition. One such framework
is the “projection approach”, which in most instantiations is an almost direct
application of vector space, orthogonality, and more generally linear algebraic
ideas to regression analysis.
Let’s rephrase the one-way ANOVA model problem in matrix form, in co-
ordinates1: suppose for concreteness that t = 3, N1 = 5, N2 = 3, N3 = 3,
then
1Another way to go about presenting linear model theory is with the so-called coordinate-
free approach. It gets at the abstract linear algebraic concepts more directly, but at the cost
of losing sight -quite literally- of the statistical objects they represent. For a great book on
the coordinate-free approach, see Michael Wichura’s.
1
266666666666666664
y11
y12
y13
y14
y15
y21
y22
y23
y31
y32
y33
377777777777777775
=
266666666666666664
1 1 0 0
1 1 0 0
1 1 0 0
1 1 0 0
1 1 0 0
1 0 1 0
1 0 1 0
1 0 1 0
1 0 0 1
1 0 0 1
1 0 0 1
377777777777777775
2664
µ
↵1
↵2
↵3
3775+ ".
Note that the 11⇥4 design matrix is not full rank (prove it in four seconds?),
therefore there will not be a unique solution for the coeﬃcients. There will,
however, be a unique solution for the fitted values yˆij ’s. That solution (the
solution to the problem of minimizing
P
i,j(yij � yˆij)2, where yˆij = µˆ + ↵ˆi) is
obtained by projecting the [yij ] vector in the plane spanned by the columns of the
design matrix; it is in a sense the crux of our approach here that we do not just
want to think of solving the inverse problem for the coeﬃcients, and then plug
them back in. As we will see, the orthogonal projection on the space spanned
by the columns of the design matrix X has the closed form X(X 0X)�1X. More
generally, and necessarily if X is not full rank, the closed form is X(X 0X)�X,
where A� denotes the general inverse of A.
Let’s compute the projection matrix M := X(X 0X)�X. You can easily
check that
X 0X =
2664
11 5 3 3
5 5 0 0
3 0 3 0
3 0 0 3
3775
and by checking the suﬃcient condition (X 0X)(X 0X)�(X 0X) = X 0X, you get
that
(X 0X)� =
2664
0 0 0 0
0 1/5 0 0
0 0 1/3 0
0 0 0 1/3
3775 .
Then M
= X(X 0X)�X
= X
2664
0 0 0 0 0 0 0 0 0 0 0
1/5 1/5 1/5 1/5 1/5 0 0 0 0 0 0
0 0 0 0 0 1/3 1/3 1/3 0 0 0
0 0 0 0 0 0 0 0 1/3 1/3 1/3
3775
2
Victoria
Highlight
Victoria
Highlight
=
266666666666666664
1/5 1/5 1/5 1/5 1/5 0 0 0 0 0 0
1/5 1/5 1/5 1/5 1/5 0 0 0 0 0 0
1/5 1/5 1/5 1/5 1/5 0 0 0 0 0 0
1/5 1/5 1/5 1/5 1/5 0 0 0 0 0 0
1/5 1/5 1/5 1/5 1/5 0 0 0 0 0 0
0 0 0 0 0 1/3 1/3 1/3 0 0 0
0 0 0 0 0 1/3 1/3 1/3 0 0 0
0 0 0 0 0 1/3 1/3 1/3 0 0 0
0 0 0 0 0 0 0 0 1/3 1/3 1/3
0 0 0 0 0 0 0 0 1/3 1/3 1/3
0 0 0 0 0 0 0 0 1/3 1/3 1/3
377777777777777775
.
Thus what the projection matrix is doing here is perfectly intuitive (and
perfectly in line with the first approach): applying M to the vector of responses
[yij ] orthogonaly projects [yij ] on –so finds the vector that best approximates
on– the three-dimensional plane of vectors of the form266666666666666664
a
a
a
a
a
b
b
b
c
c
c
377777777777777775
, a, b, c 2 R,
and of course, the closest (in an `-2 sense) such vector is the one obtained by
taking the average of the entries associated with the same factor level. You
can see from the second equality (or obviously since uˆ = 0) that we have also
computed
(X 0X)�X 0Y =
2664
µˆ
↵ˆ1
↵ˆ2
↵ˆ3
3775
0BB@⌘
2664
0
yˆ11
yˆ21
yˆ31
3775
1CCA .
The matrix M and the projection interpretation are intuitively very rich,
and provide very useful insights for regression analysis in applied and theoretical
work. All its technical underpinnings are essentially linear algebraic. We now
move on to the meat.
3
Matrices as Operators, Spectral Representation,
and Essential Inequalities
The matrix as an operator
A matrix is many things, and hopefully we will grow to deem it apostasy to
merely consider it “a box with numbers inside”. A correct abstraction is to
think of an m⇥ n matrix A as a linear operator
A : Rn ! Rm.
Manipulations at this level of abstraction will allow for tractable and beau-
tiful proofs2, as it allows one to abstract away from the choice of basis (in that
sense, it is coordinate-free). For many applications, however, we will care about
the in-coordinates representation of the matrix, and some intuition is to be
gotten from such a representation too. Writing A in coordinates as266664
a11 a12 · · · a1n
a21 a22
...
...
. . .
...
am1 · · · · · · amn
377775 ,
we have that each entry has a very specific meaning. For each i and j, aij is
what you multiply the jth entry of the input vector by to get his contribution to
the ith entry of the ouput vector. If this is the matrix representation of a linear
operator with respect to the standard basis (in domain and range), then you can
replace “entry” with “basis vector” in the previous sentence. More explicitly,
Aej =
266664
a11 a12 · · · a1n
a21 a22
...
...
. . .
...
am1 · · · · · · amn
377775
26666664
0
...
1
...
0
37777775
| {z }
nonzero jth entry
=
26666664
a1j
...
aij
...
amj
37777775 =
mX
i=1
aijei.
where ei is the vector with a 1 in the ith entry, and zeroes in all other entries.
Eigenvectors and Eigenvalues
Let ⌦ be a symmetric n ⇥ n matrix. The eigenvalues of ⌦ are traditionally
defined as the solutions of the characteristic polynomial
det(⌦� �I) = 0.
2See Sheldon Axler’s Linear Algebra Done Right.
4
This is probably the most handicapping way to think about them. For our
purposes, it will be more convenient (and is entirely rigorous with symmetric
matrices) to define an eigenvalue of ⌦ as a constant �i such that ⌦ maps some
vector vi, called the corresponding eigenvector, to �ivi. That is, ⌦ maps vi to a
vector in the exact same direction as vi, i.e. a multiple of vi, and that multiple
is the eigenvalue �i corresponding to vi.
A trivial example of a symmetric n ⇥ n matrix with eigenvalues �1, . . . ,�n
is the diagonal matrix
⇤ =
266664
�1 0 · · · 0
0 �2 0
...
... 0
. . .
...
0 · · · · · · �n
377775 ,
where, for each i, the eigenvalue �i corresponds to the eigenvector ei.
Note that {e1, . . . , en} is an orthogonal basis for Rn. Can we obtain a matrix
with the same eigenvalues but a diﬀerent, given orthonormal basis as its set of
eigenvectors (change of coordinates) ? The answer is yes, and the demonstration
is trivial. Let us be given an orthonormal basis {q1, ..., qn}. Each qi is an n-long
vector. Let Q = [q1, ..., qn] be the n ⇥ n matrix obtained by stacking them
horizontally.Then you can check that
B = Q⇤QT
has eigenvalues �1, . . . ,�n and corresponding eigenvectors q1, ..., qn.
The much deeper and much more powerful result is that the converse also
holds: Suppose ⌦ is a real symmetric matrix, then there exists an orthonormal
basis {q1, ..., qn} that consists of eigenvectors of ⌦. Furthermore, ⌦ can be de-
composed as in the display above. This is called the spectral decomposition of
⌦.
The rank of a matrix
Given an n⇥ p matrix X, we will often be interested in the rank of the matrix.
The rank of X, denoted rank(X), is the dimension of the linear space spanned
by its columns. It can also be defined as the number of linearly independent
columns of X, or equivalently the number of linearly independent rows (see
exercise 1).
Exercise 1 Let X be an n⇥ p real matrix. suppose X has q linearly
independent colums. Show that it has q linearly independent rows.
Two inequalities
Let A and B be two compatible matrices (this will mean something diﬀerent for
both inequalities, make sure the size requirements are clear for you), possibly
5
rectangular. Then
rank(AB)  min{rank(A), rank(B)},
and
rank(A+B)  rank(A) + rank(B).
These are inequalities you should never have to remember by heart. Consider
the first one. The rank of B is rank(B), so by definition its output vector lies in
a rank(B)-dimensional space, and since the same applies for A, and the image
of a rank(B)-dimensional space is at most rank(B)-dimensional, then we see
immediately that the rank of AB is bounded above by the rank of A and the
rank of B.
For square symmetric matrices, the spectral decomposition provides us with
an immediate connection between the number of nonzero eigenvalues of a matrix
and its rank: it is obvious (think about it) that the rank of a diagonal matrix
is equal to the number of nonzero (diagonal) entries. Hence, for a symmetric
squared ⌦ and its spectral decomposition ⌦ = Q⇤QT , we have
rank(⌦) = rank(Q⇤QT ) = min{rank(Q), rank(⇤)} = rank(⇤) = # of nonzero eigenvalues of ⌦.
(See exercise 2 for why the second equality indeed holds with equality).
In fact, proving both inequalities above in the special case of squared sym-
metric matrices gives a good example of the relationship between nonzero eign-
values and rank (see exercise 2).
This gives another intuitive way to think about to rank and size of the
null space of a squared symmetric matrix. Let {q1, ..., qn} be the eigenbasis of
⌦ from the spectral decomposition of ⌦, then rank(⌦) is the number of basis
eigenvectors qi which are mapped to a nonzero vector by ⌦.
Exercise 2 Use the spectral decomposition to prove the special case
of the following inequalities for square symmetric matrices A and B:
rank(AB)  min{rank(A), rank(B)},
and
rank(A+B)  rank(A) + rank(B).
For both inequalities, explain the requirement on the eigenvalues for the in-
equality to be strict, and show that the bound is tight (i.e. both can hold with
equality).
Orthogonality and Orthogonal Projection
The basic vector space concepts we will rely on are those such as subspaces,
spanning sets, linear independence, bases and dimensions of linear spaces, or-
6
thogonal vectors, orthogonal bases, and norms.3
Let X be a matrix, and let C(X) ⇢ Rn be the vector space spanned by its
columns. We say that M is a perpendicular projection operator (matrix) onto
C(X) if and only if
• projection: v 2 C(X) implies Mv = v.
• orthogonality: w ? C(X) implies Mw = 0.
Note that w ? C(X) means w0v = 0 for all v 2 C(X). Recall also that for any
linear subspace, C(X) ⇢ Rn, we have its orthogonal complement C(X)? such
that C(X) ? C(X)?, and every v 2 Rn can be decomposed uniquely as
v = a|{z}
2C(X)
+ b|{z}
2C(X)?
.
Another, equivalent characterization of orthogonal projection M is that it
is idempotent (MM = M) and symmetric (M 0 = M). Of course, the space
spanned by the columns of M will be the same as the one spanned by the
columns of X (see exercise 3).
Exercise 3 If M is a perpendicular projection operator onto C(X),
then C(X) = C(M).
Exercise 4 Prove the following. Let o1, . . . , or be an orthonormal basis
for C(X), and let O = [o1, . . . , or]. Then OO0 =
rP
i=1
oioTi is the perpendiular
projection operator onto C(X).
Exercise 5 For n⇥ p matrix X full rank, show that X(X 0X)�1X 0 is
the perpendicular projection operator onto C(X).
Exercise 6 Show that if M and M0 are perpendicular projection
matrices with C(M0) ⇢ C(M). Then M �M0 is a perpendicular projection
matrix. (it projects on the orthogonal complement of C(M0) within C(M) ).
Exercise 7 Show that for any matrix X, C(X 0X) = C(X).
Exercise 8 Show that for any matrix X, rank(XX 0) = rank(X).
Exercise 9 Show that if X is n⇥ p and rank(X) = p, then the p⇥ p
matrix X 0X is nonsingular.
The Geometry of OLS
The OLS problem is that given a model
3We don’t go over these in this note because there are many available references for their
definitions, but feel free to ask about these concepts in oﬃce hours. We will cover some basic
definitions in section.
7
yi = �0 + �1xi1 + · · ·+ �pxip + "i,
E("i) = 0, V ("i) = �2, i = 1, ..., n, we want to find (�ˆ0, . . . , �ˆp) 2 Rp+1 mini-
mizing
nX
i=1
(yi � yˆi)2,
where yˆi = �ˆ0+�ˆ1xi1+· · ·+�ˆpxip. If we write in matrix form Y = (y1, · · · , yn)T
and Yˆ = (yˆ1, · · · , yˆn)T , the above display can be rewritten (by definition) as
||Y � Yˆ ||22,
making explicit that we are in the business of minimizing the distance between
two vectors.
To find the solution, you can brute force it: take the derivative of
||Y � Yˆ ||22 = (Y �X�ˆ)0(Y �X�ˆ)
with respect to �ˆ, set it to zero, and solve for the minimizer �ˆ and the fitted
response vector Yˆ = X�ˆ.
Alternatively, and relying on a more geometric intuition, we could recognize
that this is an application of the Pythagorean theorem: if Y 2 Rn and C(X) ⇢
Rn is a linear space, then the unique point Yˆ 2 C(X) that is the closest to Y
(in the sense that in minimizes ||Y � Yˆ ||22 over C(X)) will be fully characterized
by the property: (Y � Yˆ ) ? C(X).
Exercise 10 Show that the first ordoer conditons obtained when
minizing ||Y �Yˆ ||22 in �ˆ by diﬀerentiaton yield the geometric condition (Y �Yˆ ) ?
C(X).
That the minimizing point is characterized by an orthogonality condition is
a lot of what makes the orthogonal projection apparatus and results so powerful
in regression analysis. Furthermore, now that the geometry of the problem is
clear, we can properly visualize it. The usual picture is displayed right below.
Figure 1 Multivariate regression as an orthogonal projection.
8
Let’s put the projection together with the geometry of the problem as presented
in figure 1. From the Pythagorean theorem, the fitted vector Yˆ = X�ˆ is precisely
the orthogonal projection of Y onto C(X). In exercise 5 you’ve found that this
is given by M = X(XTX)�1X. That is Yˆ = MY . The residual vector is
the diﬀerence between Y and Yˆ , so it is that part of Y that is orthogonal to
C(X). That is, for Q = (I �M) the orthogonal projection onto C(X)?, we get
"ˆ = QY . (You should be able to see immediately that QMY = MQY = 0, and
(M +Q)Y = Y ).
Exercise 11 Is it correct that the usual drawing (figure 1) of the
multivariate regression as an orthogonal projection arises from a regression with
design matrix X = [X1, X2] =

X11 X12
X21 X22
�
?
Exercise 12 Draw the equivalent of figure 1 for the omitted variable
problem.
Exercise 13 Draw the equivalent of figure 1 for the the instrumental
variable problem.
The geometry of the F -test
Suppose we are testing the null hypothesis that the short model Y = X0� + "
is correct against the alternative hypothesis that the long model Y = X� + ",
C(X0) ⇢ C(X), is correct. In both cases we assume " ⇠ N(0,�2). If the
short model is correct, both M0Y (M0 = X0(X 00X0)�1X 00)and MY (M =
X(X 0X)�1X 0) are estimating E[Y ] and their expected diﬀerence is just error,
i.e. E[MY � M0Y ] = 0. However, if the short model is not correct, then
MY and M0Y are estimating diﬀerent things, namely M0E[Y ] 6= E[Y ], and
we should expect MY and M0Y to be distant from each other, i.e. we would
expect ||MY �M0Y ||22 to be big.
Relying on this opportunity for the power of a test, how can we get a pivotal
test statistic? What is the distribution of ||MY �M0Y ||22 under the null? With
our new tools, there’s an easy way to see it. We have
||MY �M0Y ||22 = Y 0(M �M0)Y,
where you can check thatM�M0 is an orthogonal projection matrix (and is thus
idempotent, we used that to get the equality in the display above). In particular,
there is an orthogonal (change of basis) matrix Q such that C(Q(M �M0)) has
vectors all of the form (⇤ · · · ⇤| {z }
k
0 · · · 0| {z }
n�k
), where k = rank(M �M0). And since
we can write Y = �Z for Z ⇠ N(0, I), it follows immediately that
Y 0(M �M0)Y = �2Z 0(M �M0)0(M �M0)Z
= �2 ((Q(M �M0)Q0)(QZ))0 ((Q(M �M0)Q0)(QZ))
= �2�2rank(M�M0)
9
where the last equality holds because, by rotational invariance of standard Gaus-
sian random variables, Z ⇠ QZ, and Q(M�M0)Q0 is the orthogonal projection
wich preserves the first k coordinates of the imput vector and sends the others
to zero. But we want a pivot. We can repeat the same argument and find that
Y 0(I �M)Y = �2�2rank(I�M).
You should be able to convince yourself4 that Y 0(M �M0)Y and Y 0(I �M)Y
are independent. Therefore
Y 0(M �M0)Y/rank(M �M0)
Y 0(I �M)Y/rank(I �M) =
�2rank(M�M0)/rank(M �M0)
�2rank(I�M)/rank(I �M)
⇠ Frank(M�M0),rank(I�M),
and we have a beautiful test statistic. It’s also a nice example of using linear
algebra directly to manipulate Gaussian random variables.
Positive-Definiteness and Matrix Inequalities
The positive definiteness of matrices has a very direct statistical interpretation
for us, especially in terms of covariance matrices.
We say of an n⇥ n symmetric matrix A that it is positive definite if
v0Av > 0, 8 v 2 Rn, v 6= 0.
If the inequality is not strict, we say it is positive semi-definite.
To see the connection with covariance matrices, recall how to compute the
variance of a linear combination of random variables, take
X =

X1
X2
�
, V (X) = ⌃ =

�11 �12
�12 �22
�
.
We remember the formula by heart for ↵1X1 + ↵2X2. That is
V (↵1X1 + ↵2X2) = V (↵1X1) + V (↵2X2) + 2Cov(↵1X1,↵2X2)
= ↵21�11 + ↵
2
2�22 + 2↵1↵2�12.
But we can see that all the information for the variance of linear combi-
nations of the entries of X is conveniently packed in the covariance matrix ⌃.
Indeed
↵1X1 + ↵2X2 =
⇥
↵1 ↵2
⇤  X1
X2
�
,
4Think about picking Q, in the cannonical form we found, you are using diﬀerent entries
of the random vector for the numerator and denominator.
10
and
V (
⇥
↵1 ↵2
⇤  X1
X2
�
) =
⇥
↵1 ↵2
⇤  �11 �12
�12 �22
� 
↵1
↵2
�
=
⇥
↵1�11 + ↵2�12 ↵1�12 + ↵2�22
⇤  ↵1
↵2
�
= ↵21�11 + 2↵1↵2�12 + ↵
2
2�22,
so we get the formula back! Oftentimes it will be convenient to leave everything
in matrix form for computations, but it’s good to see we’re really doing the
same thing as before. Of course, all this readily generalizes to n-long vectors.
For two n ⇥ n matrices A and B, we say that A < B if B � A is positive
definite. If A < B, this means that for random vectors XA with V (XA) = A
and V (XB) = B, and for any linear combination ↵ of their entries, we have
V (↵0XA) = ↵0A↵ < ↵0B↵ = V (↵0B),
where the inequality holds because ↵0(B � A)↵ = ↵0B↵ � ↵0A↵. So positive
definiteness has a direct interpretation in terms of covariances!
Let’s look at an immediate application of what we have done until now.
This was a (very linear) question in a nonlinear econometrics final: show that
asymptotic variance of the homoskedastic 2SLS coeﬃcient estimate decreases
with the number of non-redundant instruments.
Let Y be the outcome variable, X the design matrix (of regressors, some
endogeneous), and Z the matrix of instruments. Then
V (�ˆ2SLS) = �
2(X 0MC(Z)X)�1,
where MC(Z) is the orthogonal projection onto C(Z). Now suppose we are
given additional instruments Z+ and form the extended matrix of instruments
Z˜ = [Z,Z+]. Then C(Z˜) � C(Z), and we can decompose MZ˜ as
MC(Z˜) = MC(Z) +MC(Z)?
C(Z˜)
,
where C(Z)?
C(Z˜)
= {v 2 C(Z˜) : v ? C(Z)}. And we observe that, for any
compatible, nonzero vector v,
v0X 0MC(Z˜)Xv = (Xv)
0 ⇣MC(Z) +MC(Z)?
C(Z˜)
⌘
(Xv)
= (Xv)0MC(Z) (Xv) + (Xv)
0MC(Z)?
C(Z˜)
(Xv)| {z }
>0
> (Xv)0MC(Z) (Xv)
= v0X 0MC(Z)Xv.
By definition, this meansX 0MC(Z˜)X > X
0MC(Z)X, and thus �2
⇣
X 0MC(Z˜)X
⌘�1
<
�2
�
X 0MC(Z)X
��1.
11
Exercise 14 Show that if B > A, with both A and B full rank
symmetric matrices , then B�1 < A�1.
Gram-Schmidt Theorem and the QR Decomposi-
tion
The Gram-Schmidt theorem states that for any space, given some basis, we can
find an orthonormal basis.
The Gram-Schmidt Theorem
Let N be a space with basis {x1, ..., xr}. There exists an orthonormal basis
for N , say {y1, ..., yr}, with ys in the space spanned by x1, ..., xs, s = 1, ..., r.
The algorith for the building the basis is basically the proof itself (you can
-and should!- check it satisfies the claim). To get {y1, ..., yr}, simply define the
yi’s inductively.
y1 = x1/
p
x01x1,
ws = xs �
s�1X
i=1
(x0syi)yi,
ys = ws/
p
w0sws,
and you’re done!
You can already get the triangular flavor of the procedure. Indeed the Gram-
Schmidt decomposition naturally gives rise to the QR decomposition.
Let (e1, ..., en) be the orthogonal basis obtained through the Gram-Schmidt
process. Then observe that we can write the original vectors as
x1 =< e1, x1 > e1,
x2 =< e1, x2 > e1+ < e2, x2 > e2,
and so on.
Note that, for instance, < e1, x2 > picks up the coeﬃcient of x2 in the e
basis. In general,
xk =
nX
j=1
projejxk =
nX
j=1
< ej , xk >
< ej , ej >
ej =
nX
j=1
< ej , xk > ej =
kX
j=1
< ej , xk > ej .
Writing this in matrix form yields the QR decomposition right away
(x1, ..., xn) = X = (e1, ..., en)
26664
< e1, a1 > < e1, a2 > < e1, a3 > . . .
< e2, a2 > < e2, a3 > . . .
< e3, a3 > . . .
0
. . .
37775 .
12
Exercise 15 Use the QR decomposition to prove Hadamard’s inequality:
if X = (x1, ..., xn),then
|detX| 
nY
j=1
||xj ||.
Equality holds here if and only if either the xj are mutually orthogonal or some
xj is zero.
Variational Representation of Eigenvalues
Thinking of a symmetric matrix A as “big” if, relatively speaking, it gives a
bigger number for v0Av for any compatible vector v invites an alternate and
very intuitive definition of eigenvalues.
The Courant-Fischer-Weyl Minimax Principle
Let A be a symmetric matrix on C(X). Then
�k(A) = maxM⇢C(X)
dimM=k
min
x2M
||x||=1
x0Ax
= max
M⇢C(X)
dimM=n�k+1
min
x2M
||x||=1
x0Ax
Proof:
Just observe that, for the first equality, the maximal set is the set M =
{u1, ..., uk} of the k top eigenvectors, and the minimal vector is the eigenvector
µk, the eigenvector corresponding to the kth (in decreasing order) eigenvalue.
To prove the second inequality, apply the same thinking.⇤
13