Note 4 Econometrics Havard

Econometria

•

USP-SP

0

Eduardo Nerd

15/03/2018

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Você viu 3, do total de 12 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Você viu 6, do total de 12 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Você viu 9, do total de 12 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

E aí, curtiu este material?

Ajude a incentivar outros estudantes a melhorar o conteúdo

Gostou desse material? Compartilhe! 🧡

Econometria

6.231 Materiais compartilhados

Baixe o app para aproveitar ainda mais

Leia os materiais offline, sem usar a internet. Além de vários outros recursos!

Prévia do material em texto

Harvard Economics
Ec 1126
Tamer - September 25, 2015
Note 4 - Inference in CRM
Note 4: Least Squares and the Normal Linear Model
Background: This Note requires some use of linear and matrix algebra. So, some of you
may need to keep a reference for matrix algebra handy.
1 Statistical properties of Least Squares
In frequentist inference, we use a confidence set to express our uncertainty about the parameter of
interest β which is taken to be a fixed constant (that is unknown and hence is to be estimated).
The issue is that whether we are interested in the best linear predictor, or the conditional
expectation function, both of these are population quantities, i.e. they depend on the distribution
F that generated the data, and so are typically not available. What is available is a random sample
from this population (from F ). So, the problem of statistical inference is how to learn about these
population quantities from data.
In a random sample of size N , we have N independent draws (with replacement) from the
same population. The ith draw results in the random variable or random vector (Yi, Xi). The joint
distribution of this list of random variables is the population distribution F . We can summarize
random sampling by saying
(Yi, Xi)
i.i.d∼ F (i = 1, . . . , N)
Here i.i.d. stands for independent and identically distributed.
For notation, let .
Y︸︷︷︸
N×1
=
 Y1...
YN
 , X︸︷︷︸
N×K
=
 X
′
1
...
X ′N

We have derived the least squares estimator as βˆ = (X ′X)−1X ′Y where
E∗[Yi|Xi] = X ′iβ = β0 +X1iβ1 + . . .+XKiβK =
K∑
k=0
Xkiβk
(and assume here that the first column of X, X0 = 1, is a column of ones).
1
Victoria
Highlight
Victoria
Highlight
Victoria
Highlight
Victoria
Highlight
Harvard Economics
Ec 1126
Tamer - September 25, 2015
Note 4 - Inference in CRM
The question that we want to answer: how “good” is βˆ as an estimator for β which was
derived earlier1 as β = [E(XiX
′
i)]
−1EXiYi? Here, βˆ is a random variable that depends on the
sample size N .
For example, bias is one measure that we can use to judge how “good” βˆ is. We say that βˆ
is an unbiased estimator of β if
E[βˆ] = β
Without any further assumptions, it seems difficult to proceed since βˆ is a nonlinear function
(E(X ′X)−1 6= [E(X ′X)]−1), i.e., in general E[βˆ] 6= β. We can, using iterated expectation, get
E[βˆ] = E
[
(X ′X)−1X ′E[Y |X]]
or more neatly,
E[βˆ|X] = (X ′X)−1X ′E[Y |X] = (X ′X)−1X ′
 r(X1)...
r(XN )

So far we have not made any assumptions other than random sampling. Now suppose that
the linear predictor, which is intended to approximate the conditional expectation, actually equals
the conditional expectation:
r(Xi) = X
′
iβ
Then, we have
E[βˆ|X] = (X ′X)−1X ′E[Y |X] = (X ′X)−1X ′
 X
′
1
...
X ′N
β = (X ′X)−1X ′Xβ = β
Applying iterated expectation yields
E[βˆ] = E[E[βˆ|X]] = E[β] = β
This means that if r(Xi) = X
′
iβ, then βˆ is an unbiased estimator for β. Otherwise, the expectation
of βˆ leads to a linear combination of the conditional expectation function.
This leads us to an important model that makes even stronger assumptions on the relation-
ship between Yi and Xi.
1Reminder: β is a population object, and so is E(XiX
′
i) where Xi is a K × 1 vector and includes all the predictors
- including the constant 1.
2
Victoria
Highlight
Victoria
Highlight
Harvard Economics
Ec 1126
Tamer - September 25, 2015
Note 4 - Inference in CRM
2 Classical Regression Model - Or the Normal Linear Model
We shall see how to obtain confidence sets using the normal linear model, i.e., the classical least
squares model first. The normal linear model is most definitely a parametric model. A parametric
model is a set of assumptions that require that the density f of (Yi, Xi) belongs to a set of densities
Fθ : {fθ : θ ∈ Rk} where fθ is known up to θ which is a finite dimensional parameter. For example,
in a location model, (Yi, Xi) ∼ fµ with µ ∈ R2 and fµ=(µ1,µ2)(z) = φ(z1 − µ1; z2 − µ2) where φ
is the standard bivariate normal density. So, the problem of learning an unknown density in a
parametric model is one of learning µ = (µ1, µ2) ∈ R2.
The normal linear model allows us to characterize the statistical properties of the least
squares estimator. Later, we will relax the assumptions of the normal linear model and instead use
approximations, via limit theorems such as the law of large numbers and central limit theorem,
to develop confidence sets for nonparametric models. Nonparametric models are ones where the
distribution of the data belongs to a set of distributions that is rich - richer than the parametric
class. These confidence sets will be derived using limit arguments, as sample size N tends to
infinity. So there will always be the issue of how well the limit theorems approximate finite sample
properties. So, parametric models yields exact inference at the cost of parametric assumptions
(what happens if the true density of the data does not belong to the parametric class), while
nonparametric model require one to appeal to large sample approximations.
First, we start with the classical model with N fixed.
Definition 2.1. Classical Linear Model Assumptions
Let the following hold:
1. (random sampling)
(Yi, Xi)
i.i.d∼ f (i = 1, . . . , N)
2. (normality)
Yi|Xi ∼ N (X ′iβ, σ2) for i = 1, . . . , N.
This means that E[Yi|Xi] = X ′iβ, and V ar(Yi|Xi) = σ2.
The normal linear model can also be written as
Yi = X
′
iβ + �i
where �i ∼ N (0, σ2) and is independent of Xi.
3
Harvard Economics
Ec 1126
Tamer - September 25, 2015
Note 4 - Inference in CRM
2.1 Expectation and Variance of the Least Squares Estimator
Given the above, the claim below follows.
Claim 2.2. Under the normal model assumptions, the least squares estimator is an unbiased esti-
mator of β.
Now, to characterize “noise,” it is natural to compute the variance of the estimator. Gen-
erally, the smaller the variance the less the statistical noise (i.e. the higher the precision). Here,
noise characterizes sample uncertainty and hence if we have access to the population (presumably
if we observe everyone), then this kind of uncertainty will go to zero.
Claim 2.3. Suppose that the random vector Y is N × 1; and the nonrandom matrices d1 and d2
are M ×N and M × 1. Then
Cov(d1Y + d2) = d1Cov(Y )d
′
1
Proof:(sketch) First, note that with Y˜ = Y − EY ,
Cov(Y ) = EY˜ Y˜ ′
Here, we have Y˜ = d1(Y − EY ) and so
Cov(Y ) = E
{
d1(Y − EY )[d1(Y − EY )]′
}
= Ed1E(Y˜ Y˜
′)d′1 = d1Cov(Y )d
′
1 �
So, now we are able to obtain the variance of the least squares estimator of β.
We have
Cov(βˆ|X) = (X ′X)−1X ′Cov(Y |X)X(X ′X)−1
The covariance of Y conditional on X is an N ×N matrix as follows:
Cov(Y |X) =
 V ar(Y1|X) . . . 0. . . . . . ...
0 . . . V ar(YN |X)

The off-diagonal elements are zero, Cov(Yi, Yj |X) = 0, because observations i and j are
independent for i 6= j due to random sampling. Also, given the normal model assumptions, we
have V ar(Yi|Xi) = σ2 and so
Cov(Y |X) = σ2IN
4
Harvard Economics
Ec 1126
Tamer - September 25, 2015
Note 4 - Inference in CRM
where IN is the N ×N identity matrix. This leads to
Cov(βˆ|X) = σ2(X ′X)−1
So, Cov(βˆ|X) is K ×K :
Cov(βˆ|X) =
 Cov(βˆ1, βˆ1|X) . . . Cov(βˆ1, βˆK |X). . . . . . ...
Cov(βˆK , βˆ1|X) . . . Cov(βˆK , βˆK)|X)

where Cov(βˆ1, βˆ1|X) = V ar(βˆ1|X). Let the [(X ′X)−1]jk denote the (j, k) element of (X ′X)−1.
Then we have
Cov(βˆj , βˆk) = σ
2[(X ′X)−1]jk
2.2 Gauss Markov Theorem
Suppose we have any other unbiased linear estimator β˜. Here, maintain the assumption that the
X’s are fixed constants (for easier notation2). Then it must be in the formβ˜ = C ′Y and E(β˜) = β
where C ′ = (X ′X)−1X ′ +D′.
Note that
β˜ = C ′(Xβ + ε) = C ′Xβ + C ′ε ⇒ E(β˜) = C ′Xβ.
So β˜ is unbiased if C ′X = I, which implies D′X = 0.
var(β˜) = σ2C ′C
= σ2[(X ′X)−1X ′ +D′][(X ′X)−1X ′ +D′]′
= σ2[(X ′X)−1X ′X(X ′X)−1 + (X ′X)−1X ′D︸︷︷︸
=0
+D′X︸︷︷︸
=0
(X ′X)−1 +D′D]
= σ2(X ′X)−1 + σ2D′D
= var(βˆ) + σ2D′D.
Since D′D ≥ 0, the proof is complete.
This essentially is a proof of the following.
2Effectively, the calculation would need to be made conditional on X and so is cumbersome.
5
Victoria
Highlight
Victoria
Sticky Note
nao entendi o cov (beta,beta). daonde saiu? onde foi parar o X?
Victoria
Sticky Note
por que + D'?
Harvard Economics
Ec 1126
Tamer - September 25, 2015
Note 4 - Inference in CRM
Theorem 2.4. (Gauss-Markov theorem)
In the classical regression model, the least squares estimator is best among all linear unbi-
ased estimators.
Proof. This was done above the statement of the theorem.
2.3 Sample version (or estimator) of σ2
To be able to use a feasible version of the variance matrix of βˆ, we require a sample estimator of
σ2.
Note that
e = Y − Yˆ = Xβ + ε−Xβˆ = Xβ + ε−X(X ′X)−1X ′(Xβ + ε) = [I −X(X ′X)−1X ′]ε = Mε.
where M = [I −X(X ′X)−1X ′]. Reminder: M is idempotent3, thus
e′e = ε′M ′Mε = ε′Mε.
Now since ε′Mε is scalar, we have ε′Mε = Trε′Mε 4: (suppressing the conditioning on X)
Ee′e = E[ε′Mε] = E[Trε′Mε] = E[TrMεε′]= Tr(E[Mεε′]) = TrMσ2I = σ2TrM.
And TrM = Tr[IN − X(X ′X)−1X ′] = TrIN − Tr(X(X ′X)−1X ′) = N − Tr((X ′X)−1X ′X) =
N − TrIK = N −K. Thus
E[e′e|X] = σ2(N −K)
and so unconditionally also, E[e′e] = σ2(N −K). Therefore we can use the following estimator for
σ2:
σˆ2 =
e′e
N −K ≡ s
2.
Hence, overall,
v̂ar[βˆ] = (X ′X)−1s2 =
1
N
( 1
N
∑
xix
′
i
)−1
s2 =
1
N
( 1
N
∑
xix
′
i
)−1 1
N −K
∑
(yi − x′iβˆ)2
3The only nonsingular idempotent n × n matrix is In: Suppose A = AA and A is nonsingular. Then A−1AA =
A−1A = I.
4Properties of trace: 1) Tr(A+B) = TrA+ TrB, 2) Tr(AB) = Tr(B′A′), 3) Tr(ABC) = Tr(BCA) = Tr(CAB) 6=
Tr(BAC).
6
Victoria
Sticky Note
why would we condition on X? or Why, below, can we say "so unconditionally also,..."?
Harvard Economics
Ec 1126
Tamer - September 25, 2015
Note 4 - Inference in CRM
3 Reminder of multivariate normal distribution
Let y = [y1, . . . , yN ]
′ be a random vector with multivariate normal distribution with parameters
µ = [µ1, . . . , µN ]
′ (N × 1 vector) mean and Σ a N ×N (positive definite real) variance-covariance
matrix. y ∼ N(µ,Σ). Then y has probability density function
f(y) =
1
(2pi)
N
2 |Σ| 12
exp
(
−w
2
)
, where w = (y − µ)′Σ−1(y − µ) and |Σ| = det(Σ)
Conditional distribution: Let
y =
[
y1
y2
]
, µ =
[
µ1
µ2
]
Σ =
[
Σ11 Σ12
Σ21 Σ22
]
,
then unconditional distribution of y1 is y1 ∼ N(µ1,Σ11). The distribution of y1 conditional of y2 is
multivariate normal y1|y2 ∼ N(µ∗1,Σ∗1), where
µ∗1 = E(y1|y2) = µ1+
[
(Σ22)
−1Σ12
]′
(y2−µ2) = µ1 −
[
(Σ22)
−1Σ12
]′
µ2︸ ︷︷ ︸
α=µ1−β′µ2
+
[
(Σ22)
−1Σ12
]′︸ ︷︷ ︸
β′
y2 = α+β
′y2
We can derive the variance by law of iterated variance5,
Σ11 = V [E(y1|y2)] + E[V (y1|y2)] = V [E(α+ β′y2)] + EΣ∗1 = β′Σ22β + Σ∗1
(note here that V (y1|y2) does not depend on y2 - can you show this?) Hence
Σ∗1 = Σ11 − β′Σ22β.
Then y1|y2 ∼ N(α + β′y2,Σ∗1). Note here that if we treat this as running a regression of y (here
y1) on x (here y2), we get that (from previous Note) β is the covariance between y and x divided
by its variance. This should remind you of our β here which has this form: β = (Σ22)
−1Σ12.
Functions of standard normal variables:
1. Let z ∼ N(0, In), then z′z =
∑n
i=1 z
2
i ∼ χ2(n),
2. E[χ2(n)] = n and var[χ2(n)] = 2n.
3. Let v = w1/mw2/n , where w1 ∼ χ2(m), w2 ∼ χ2(n) and w1, w2 are independent, then v ∼ F (m,n),
5Law of iterated variance: V (y1) = V [E(y1|y2)] + E[V (y1|y2)]
7
Harvard Economics
Ec 1126
Tamer - September 25, 2015
Note 4 - Inference in CRM
4. Let u = z√
w/n
, where z ∼ N(0, 1) and w ∼ χ2(n) and z, w are independent, then u ∼ t(n)
and u2 ∼ F (1, n).
5. Let y ∼ N(0,Σ), then y′Σ−1y ∼ χ2(n).
Proof (of 5.): Σ can be factorized into Σ = PP ′ which means that Σ−1 = (P−1)′P−1. Then,
define x = P−1y and those y’s are independent N(0, 1) random variables since x ∼ N (0, In).
Then we can use 1. above to get the result.
4 Confidence Intervals
Normality Assumption Again: Vector (N × 1) y = Xβ + ε has a multivariate normal distri-
bution y ∼ N(Xβ, σ2IN ).
Therefore
βˆ = (X ′X)−1X ′y ∼ N(β, σ2(X ′X)−1)
Suppose we want to summarize the statistical uncertainty in the sample regarding the value
of a parameter (taken to be an unknown constant). One common way to do this is to construct
a confidence interval. We will do this first for a univariate parameter. Then, we will generalize.
First, we derive distributions for statistics that are used to construct a confidence interval.
Proposition 4.1. The following statistic is distributed by t-distribution with N − K degrees of
freedom,
tk =
βˆk − βk
se(βˆk)
=
βˆk − βk√
s2(X ′X)−1kk
∼ t(N −K).
Proof. Denote zk =
βˆk−βk√
σ2(X′X)−1kk
and q = e
′e
σ2
and note that zk ∼ N(0, 1). Hence
tk =
βˆk − βk√
σ2(X ′X)−1kk
√
σ2
s2
=
zk√
s2
σ2
=
zk√
e′e/(N−K)
σ2
=
zk√
q
N−K
There remains to be shown that 1) q ∼ χ2(N −K), 2) q, zk are independent.
8
Harvard Economics
Ec 1126
Tamer - September 25, 2015
Note 4 - Inference in CRM
1. Remember e = Mε, where M has rank N −K. Then
q =
e′e
σ2
=
ε′
σ
M
ε
σ
∼ N(0, IN )′MN(0, IN ) = χ2(N −K).
This uses a result6 on the distribution of the quadratic form x′Ax where x ∼ N(0, I) and A
is idempotent of rank r ≤ N .
2. q, zk are independent, since e = Mε and βˆk − βk are independent
E[(βˆ − β)e′] = E[(X ′X)−1X ′εε′M ] = σ2E[(X ′X)−1X ′M ] = 0.
Here, we have shown that Cov(βˆ − β, e) = 0. But, since βˆ and e are Normally distributed,
then showing that they are uncorrelated is equivalent to showing that they are independent.
The t-distribution is available in tables and in computer programs. Suppose that N −K = 30. We
have Prob(t(30) > 2.04) = .025 and since the t-distribution is symmetric about zero,
Prob(|t(30)| ≤ 2.04) = .95
Then, by the Proposition above,
Prob(−2.04 ≤ βˆk − βk
se(βˆk)
≤ 2.04) = .95
and so
Prob(βˆk − 2.04se(βˆk) ≤ βk ≤ βˆk + 2.04se(βˆk)) = .95
This is an unconditional probability statement that can be written also as
Prob(βk ∈ [βˆk ± 2.04.se(βˆk)]) = .95
The interval [βˆk±2.04.se(βˆk)] is usually called the 95% confidence interval. It provides a summary
of the statistical uncertainty in our estimate of β.
6Optional proof to show this: an idempotent matrix A can be written as Q′AQ =
[
Ir 0
0 0
]
where Q is an
orthogonal matrix (inverse equals transpose) of eigenvectors. Let y = Q′x and so x = Qy then Ey = 0, and
var(y) = Eyy′ = E(Q′xx′Q) = Q′IQ = I so y’s are independent standard normals and so
x′Ax = y′Q′AQy = Y 21 + . . . , Y
2
r
which gives the result.
9
Harvard Economics
Ec 1126
Tamer - September 25, 2015
Note 4 - Inference in CRM
As N −K increases from 30 to infinity, the 97.5 percentile of the t-distribution decreases
from 2.04 to a limiting value of 1.96 (which is the 97.5 percentile of a standard normal distribution).
4.1 Confidence Ellipse
Again, we have
βˆ = (X ′X)−1X ′y ∼ N(β, σ2(X ′X)−1)
We shall obtain a confidence region for two or more linear combinations of the coefficients. Or,
more generally this leads to confidence regions for linear functions of β. Let R be h ×K so that
Rβ is a h× 1 and
Rβˆ ∼ N(Rβ, σ2R(X ′X)−1R′)
Claim 4.2. We have
F = (Rβˆ −Rβ)′[V̂ ar(Rβˆ)]−1(Rβˆ −Rβ)/h ∼ F(h,N −K)
To prove the above, we start with the F distribution. The above can be written as
F =
(Rβˆ −Rβ)′[R(X ′X)−1R′]−1(Rβˆ −Rβ)/h
s2
We know that s2 = e
′e
N−K and the distribution of Rβˆ is Rβˆ ∼ N(Rβ, σ2R(X ′X)−1R′).
Hence
F =
w/h
q/(N −K)
To find the distribution of test statistic there remains to be shown that:
1. w = (Rβˆ −Rβ)′[σ2R(X ′X)−1R′]−1(Rβˆ −Rβ) ∼ χ2(h),
2. q ∼ χ2(N −K) (done earlier),
3. w, q are independent (here you can show that ...)
The above leads to a confidence region for Rβ.
For example, suppose h = 2 with
R =
(
1 0 . . . 0
0 1 . . . 0
)
and Rβ =
(
β1
β2
)
10
Harvard Economics
Ec 1126
Tamer - September 25, 2015
Note 4 - Inference in CRM
Suppose that N−K = 30. The F−distribution is available in tables and computer programs.
We have
Prob(F (2, 30) > 3.32) = 0.05
This means that
Prob([
(
βˆ1
βˆ2
)
−
(
β1
β2
)
]′[V̂ ar
(
βˆ1
βˆ2
)
]−1[
(
βˆ1
βˆ2
)
−
(
β1
β2
)
]/2 ≤ 3.32) = .95
The confidence region consists of the values for (β1, β2) that satisfy the inequality above.
This gives the interior of an ellipse which is centered at the least-squares values (βˆ1, βˆ2). Goldberger
(1991)[?] is a good reference for the normal linear model.
5 Generalized Least Squares
In this section we assume that Ω is known.
Aitken’s notation7 V (Y ) = σ2Ω. Define
• H such that Ω−1 = H ′H (ie H = Ω− 12 , or Ω = (H ′H)−1 = H−1(H ′)−1).
• Y ∗ = HY ,
• X∗ = HX.
Then
E(Y ∗) = E(HY ) = HXβ = X∗β,
var(Y ∗) = H var(Y )H ′ = Hσ2ΩH ′ = σ2HH−1(H ′)−1H ′ = σ2IN .
Therefore the classical assumptions are satisfied and we can apply Gauss-Markov theorem to
Y ∗ = X∗β + u∗.
We get BLUE estimator
βˆGLS = (X
∗′X∗)−1X∗
′
Y ∗ = (X ′H ′HX)−1X ′H ′HY = (X ′Ω−1X)−1X ′Ω−1Y.
7Usually and later V (Y ) = Ω, but this is WLOG, it is just to make it look more similar to homoscedastic variance.
11
Victoria
Highlight
Victoria
Sticky Note
??
Harvard Economics
Ec 1126
Tamer - September 25, 2015
Note 4 - Inference in CRM
If we assume additionally that (this is the generalized normal regression model)
Y ∼ N(Xβ, σ2Ω),
we get that
βˆGLS ∼ N
(
β, σ2(X ′Ω−1X)−1
)
,
and
N −K
σ2
s2GLS =
1
σ2
(Y −XβˆGLS)′Ω−1(Y −XβˆGLS) ∼ χ2(N −K).
Remarks
1. Alternatively we can say that
βˆGLS = arg min b(Y −Xb)′Ω−1(Y −Xb),
i.e. it minimizes weighted least squares instead of minimizing just least squares.
In language of norms: LS minimizes square norm ‖v‖2 = v′v, and GLS minimizes ‖v‖Ω =
v′Ω−1v.
In practice, Ω is an unknown N × N matrix, which is clearly impossible to estimate from
data with N observations. To solve the problem with heteroscedasticity in practice, we need some
way to model the covariance matrix Ω. This will be case specific.
12
	Statistical properties of Least Squares
	Classical Regression Model - Or the Normal Linear Model
	Expectation and Variance of the Least Squares Estimator
	Gauss Markov Theorem
	 Sample version (or estimator) of 2
	Reminder of multivariate normal distribution
	 Confidence Intervals
	Confidence Ellipse
	Generalized Least Squares