Note 3 Econometrics Havard

•

USP-SP

Eduardo Nerd

15.03.2018

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Você viu 3, do total de 22 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Você viu 6, do total de 22 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Você viu 9, do total de 22 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

E aí, curtiu este material?

Ajude a incentivar outros estudantes a melhorar o conteúdo

Gostou desse material? Compartilhe! 🧡

Econometria

6.228 Materiais compartilhados

Baixe o app para aproveitar ainda mais

Leia os materiais offline, sem usar a internet. Além de vários outros recursos!

Prévia do material em texto

Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
Note 3: Least Squares and the 2-Variable Model
1 Two Variable Model
Now, we are back to the example where we are trying to predict the earnings of an individual Y ,
using that individuals education X as a predictor. Or, in other words, does education have any
predictive power on earnings?
So, using a linear predictor under squared loss,
our predictor for Y given X is
Yˆ = β0 + β1X
and again, those parameters (which determine or opera-
tionalize this prediction problem) are determined by min-
imizing the mean square error:
min
β0,β1
E(Y − Yˆ )2 = min
β0,β1
E(Y − β0 − β1X)2 (1.1)
This is a typical minimization problem that we will see in this class. So, we can try and
handle it in a geometric way (for now) that would apply in many setups and is also insightful. We
will use a more brute force approach (using standard calculus) later.
The key is to use an orthogonal projection in a vector space with an inner product. Here
the inner product is
< Y,X >= E(Y X)
The associated norm is
‖Y ‖ =< Y, Y >1/2
1
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
Then, our problem above becomes:
min
β0,β1
‖Y − Yˆ ‖2 = min
β0,β1
‖Y − β0 + β1X‖2
So, here you are trying to minimize the Euclidian distance between Y and the (linear) space spanned
by (1, X). The solution is the orthogonal projection of Y on 1 and X (as any predictor vector is
a linear combination of 1 and X). See the Figure to the right: we have a vector y and we want
to see in the space spanned by the x’s which vector is ”closest” to y. This would be exactly the
orthogonal projection.
So, if you think of the random variable 1 as X0, then this
orthogonal projection requires that the prediction error
(Y − Yˆ ) (also called a residual) be orthogonal to both X0
and X:
< Y − Yˆ , X0 >= 0
< Y − Yˆ , X >= 0
Geometrically again, what you are doing is taking the orthogonal projection of the vector Y on the
space spanned by (or linear combinations of) (X0, X). This means that
Y − Yˆ ⊥ X0, Y − Yˆ ⊥ X
In particular, we have
< Y − Yˆ , X0 >=< Y − β0 − β1X,X0 >=< Y,X0 > −β0 < X0, X0 > −β1 < X,X0 >= 0
< Y − Yˆ , X >=< Y − β0 − β1X,X >=< Y,X > −β0 < X0, X > −β1 < X,X >= 0
The projection here is special in that it is done using the distance measure we defined above.
Replacing < ., . > with E(..) in the above, we get
E(Y )− β0 − β1E(X) = 0
2
Victoria
Sticky Note
+ B1 OR - B1?
Victoria
Highlight
Victoria
Sticky Note
como que a gente define o primeiro argumento no < > do B_0? Eu sei que o inner product tem que ser 1 porque o B_0 ta sozinho, mas como que a gente escolhe se 'e X_0 ou X o primeiro argumento?
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
E(Y X)− β0E(X)− β1E(X2) = 0
This is a two equation linear system with two unknowns (you could also verify that these
two equations are the first order conditions from the minimization problem in (1.1) above) which
can be solved to give us:
β1 =
E(Y X)− E(Y )E(X)
E(X2)− E(X)E(X)
β0 = E(Y )− β1E(X) (1.2)
Notice, the numerator in the expression for β1 is the covariance between Y and X
Cov(Y,X) = E(Y X)− E(Y )E(X)
and the denominator is the variance of X:
V ar(X) = E(X2)− E(X)E(X)
I.e,
β1 =
Cov(Y,X)
V ar(X)
(1.3)
Again, the notation for this linear prediction is
E∗[Y |X] = β0 + β1X (1.4)
The above is a population version of the best linear predictor. This means that β0 and β1 are not
directly observed in the data (they are functions of the joint distribution of Y and X).
A Non-Geometry Approach: (using calculus)1 The problem we are asked to solve in
1This is may be easier to follow for some than the geometric approach and though it provides less intuition it does
work.
3
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
(1.1) is to minimize
min
β0,β1
E(Y − Yˆ )2 = min
β0,β1
E(Y − β0 − β1X)2
To do this problem, we can take first order conditions (i.e., derivative with respect to β0 and β1)
∂β0,β1E(Y − β0 − β1X)2 which gives us:
2E(Y − β0 − β1X)(−1) = 0
2E(Y − β0 − β1X)(−X1) = 0
(1.5)
(here we implicitly interchanged ∂ with E which is allowed...) Solving for β0 and β1 from above (2
equations and 2 unknowns) we get the same solutions we got in (1.2) and (1.3) (try this).
2 Least squares
The sample analog of the above population problem uses a sample of size n to construct an estimator
of β0 and β1. This can be done via least squares, which is the sample counterpart of (1.1) above.
Example 2.1. To get more intuition about sample vs population, consider this example.
The average gpa for all undergraduate students here is µGPA, a population quantity.
This is a population quantity because it is related to the population of interest, mainly
ALL undergraduate students (the registrar surely has access to this quantity). Now,
if we were to pick 10 students at random and average their GPA, that would give us
µ¯GPA = (GPA1 + . . .+GPA10)/10. So, µ¯GPA is a sample quantity. It is usually called
an estimator of µGPA and the role of statistics is to tell us how close is µˆGPA to µGPA.
The reason why we do not go after population quantities directly (and spare ourself
all of statistics) is because it is too costly (often impossible) to get answers from every
member of the population. Think about exit polls where pollsters ask a few voters
during a short period of voting and then try to use those data (the sample) to predict
whether a given candidate will win. There it is basically too costly to ask every single
voter how they voted (many will not even answer...).
Now, we want an estimate of the regression line using sample data, we mimick the same arguments
as above but with a sample to get the following. First, let the sample of size n be in the form of
matrices
y =

y1
...
yn
 , x0 =

1
...
1
 , x =

x1
...
xn

4
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
The fitted value for the ith observation is
yˆi = b0 + b1xi
and the objective again is choosing (b0, b1) to minimize the sum of squared residuals:
min
b0,b1
1
n
∑
i
(yi − yˆi)2 (2.1)
Notice the similarity between this objective function and the one in (1.1) above: the latter uses
the sample and so the minimization problem is feasible, i.e., can be done with a computer (or
calculator), while the former involves minimizing an object that involves the operator E and hence
requires knowledge of the joint distribution of (Y,X) which is generally unobserved. Though our
objects of interest are the solutions (β0, β1) to (1.1) above, the feasible solutions to (2.1) will be our
least squares estimators of these objects. We shall study the statistical properties of such estimators
later.
First, to define the estimators, we mimic the approach above. The inner product between
vector y and x is
< y, x >=
1
n
∑
i
yixi
We have now the minimum norm problem
min
b0,b1
‖y − b0x0 − b1x‖2
and the solution will also take the orthogonal projection of y on x0 and x. This will require that
the prediction error (y − yˆ) is orthogonal to x and x0:
< y − yˆ, x0 >= 0
< y − yˆ, x >= 0
This means that
y − yˆ ⊥ x0, y − yˆ ⊥ x
In particular, this means that
< y − yˆ, x0 >=< y − b0 − b1x, x0 >=< y, x0 > −b0 < x0, x0 > −b1 < x, x0 >= 0
< y − yˆ, x >=< y − b0 − b1x, x >=< y, x > −b0 < x0, x > −b1 < x, x >= 0
Taking the sample analogue
y¯ − b0 − b1x¯ = 0
yx− b0x¯− b1x2 = 0
5
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
where
y¯ =
1
n
∑i
yi, yx =
1
n
∑
yixi, x2 =
1
n
∑
i
x2i
The two linear equations for the two unknowns b0, and b1 can be solved to give
b1 =
yx− y¯x¯
x2 − x¯x¯
b0 = y¯ − b1x¯
2.1 Goodness of fit
Note that,
0 ≤ ‖Y − E
∗(Y |1, X)‖2
‖Y − E∗(Y |1)‖2 ≤ 1
This ratio is less than 1 because using X to predict Y cannot increase the mean square
error, since -β1 is allowed to be zero (the linear predictor using only a constant is E
∗(Y |1) = E(Y )).
We can then define the measure of goodness of fit in the population as
R2pop = 1−
‖Y − E∗(Y |1, X)‖2
‖Y − E∗(Y |1)‖2
This measure is
• scale free: is not affected by the way we measure Y (multiply Y by 10 does not change it)
• Easy to interpret since
0 ≤ R2pop ≤ 1
with higher values implying better prediction accuracy.
The sample counterpart of this population object is
R2 = 1− ‖y − (yˆ|1, x)‖
2
‖y − (yˆ|1)‖2 = 1−
1
n
∑
(yi − b0 − b1xi)2
1
n
∑
(yi − y¯)2
(here the least squares fit with only a constant is (yˆ|1) = y¯). So, as you can see from the formula,
intuitively, R2 gives you an indication of how much the variation in Y is explained by variation in
X 2.
2.1.1 Example
Mincer [2] uses data from the 1960 census on annual earnings in 1959. With y = log(earnings)
and s = years of schooling completed, he reports the least squares fit:
yˆ = 7.58 + .07s, R2 = .067
2We can define similarly an R2 for K regressors which would be 1− ‖y−(yˆ|1,x1,...,xK)‖2‖y−(yˆ|1)‖2 .
6
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
Notice here that the predictive power of education in terms of explaining the variation of earnings
is very low. This is not atypical of cross sectional data since observables do not usually explain
much of the variation in wages (there are many other things that could help explain earnings other
than education).
3 Omitted Variables
Consider an individual chosen at random from a population. Let Y denote her earnings, and let
X1 and X2 denote her education and her score on a test administered when she was in the third
grade. The random variables (Y,X1, X2) have a joint distribution. There is a (population) linear
predictor for Y given X1 and X2 (and a constant):
E∗(Y |1, X1, X2) = β0 + β1X1 + β2X2
This is usually called (by Goldberger for example) the long regression.
In addition, there is a (population) linear predictor for Y just given X1 (and a constant):
E∗(Y |1, X1) = α0 + α1X1
This is called the short regression.
It is useful to relate the long predictor coefficients to the short ones. In particular, this is
relevant if in fact we do not have data say on X2 and are interested in the coefficient β1. To do
that requires the auxiliary linear predictor of X2 given X1 (and a constant)
E∗(X2|1, X1) = γ0 + γ1X1
This is the auxiliary regression. Let U denote the prediction error using the long predictor:
U ≡ Y − E∗(Y |1, X1, X2),
so that
Y = β0 + β1X1 + β2X2 + U (3.1)
It is useful to pause here and get acquainted with U and how it is implicitly defined. Its properties
(and its role) help in understanding (and using) the linear model.
Example 3.1. The E∗ operator is linear, i.e., E∗[Y1 + Y2|1, X] = E∗[Y1|1, X] +
E∗[Y2|1, X]. How do we show this? One way to do this is to start with the formal
definition of what E∗ given in (1.4) above (this is the algebraic way). Using that
7
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
we have E∗[Y1 + Y2|1, X] = L1 + L2X where L1 and L2 solve
E[Y1 + Y2 − L1 − L2X] = 0
EX[Y1 + Y2 − L1 − L2X] = 0
(3.2)
whereas E∗[Y1|1, X] = L11 + L12X and E∗[Y2|1, X] = L21 + L22X and these solve the
following equations above:
E[Y1 − L11 − L12X] = 0
EX[Y1 − L11 − L12X] = 0
E[Y2 − L21 − L22X] = 0
EX[Y2 +−L21 − L22X] = 0
(3.3)
Summing the first and third equation in (3.3), we get
E[Y1 + Y2 − L11 − L21 − (L12 + L22)X] = 0
while summing the second and fourth equation in (3.3)
EX[Y1 + Y2 − L11 − L21 − (L12 + L22)X] = 0
By comparing the above two equations to the ones in (3.2), it must be that L11+L
2
1 =
L1 and L
1
2 + L
2
2 = L2 which implies that
E∗[Y1 + Y2|1, X] = E∗[Y1|1, X] + E∗[Y2|1, X]
(You may be able to show this using geometric arguments: the projection of a sum
of vectors Y1 + Y2 is the sum of their projections. Think of this geometrically!)
Because U is a prediction error, it is orthogonal to the variables used in the predictor:
U ⊥ 1, U ⊥ X1, U ⊥ X2
This implies that (you can also formally show this)
E∗(U |1, X1) = 0
Using equation (3.1) and linearity of E∗ (we showed that in the example above), we get
E∗(Y |1, X1) = E∗(β0 + β1X1 + β2X2 + U |1, X1)
= β0 + β1X1 + β2E
∗(X2|1, X1) + E∗(U |1, X1)
= β0 + β1X1 + β2(γ0 + γ1X1) + 0
= (β0 + β2γ0) + (β1 + β2γ1)X1
So this means that
α0 = β0 + β2γ0, α1 = β1 + β2γ1
8
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
Why is
E∗(X1|1, X1) = X1?
As you can see, the short regression leads to a coefficient α1 which is generally different
than β1. In particular, α1 contains β2 multiplied by γ1 the coefficient from the auxiliary regression.
The “bias term” β2γ1 is what is called the omitted variable bias and the above are the omitted
variable formula. This bias is zero if either 1) γ1 = 0 (i.e. Cov(X1, X2) = 0)) or 2) β2 is zero.
Note: Both α1 and β1 are well defined but answer different questions. α1 is part of the
linear predictor of Y given that you only use X1, while β1 is the predictor of Y when you also use
X2 to form your prediction. Generally, this omitted variable formula is important and useful.
3.1 Least squares version
Of course the above (population) result holds also for the sample least squares prediction problem
where one can substitute the population inner product with its sample counterpart, i.e., replace
E(XY ) with 1n
∑
i yixi. Doing this, we get
a0 = b0 + b2c0, a1 = b1 + b2c1
where
yˆi|1, x1i, x2i = b0 + b1x1i + b2x2i
yˆi|1, x1i = a0 + a1x1i
xˆ2i|1, x1i = c0 + c1x1i
3.2 Example ctd’
Mincer (1974) has a discussion of omitted variable bias (pages 139, 140). He views earnings as
related to the individual’s total human capital stock, including original or initial components. He
views ability as such an initial component, and argues that including a measure of early ability would
lead to a drop in the coefficient on schooling, because the coefficient on ability would be positive
and there is a positive association between ability and schooling. For numerical magnitudes, he
cites Griliche and Mason [1], who use a sample of post-World War II veterans of the U.S military,
contacted by the Bureau of the Census in a 1964 Current Population Survey (CPS). The military
records contain individual scores on the Armed Forces Qualification Test (AFQT), which Griliches
and Mason use in lieu of standard civilian mental ability (IQ) tests. A problem is that the AFQT
is not a measure of initial ability, in that it is administered just prior to entering the military. This
problem is addressed by splitting total years of schooling (ST) into schooling before the military
(SB) and schooling after the military (SI). Then AFQT can be regarded as an early test, relative
9
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
to the schooling increment SI, which came after the test was taken. Here are some results from
Table 3:
yˆ = .0508ST + . . .
yˆ = .0433ST + .00150AFQT + . . .
yˆ = .0502SB + .0528SI + . . .
yˆ = .0418SB + .0475SI + .00154AFQT + . . .
This is for a subsample of veterans, age 21-34 in 1964. y = log of usual weekly earnings. The AFQT
score is measured as a percentile, from 0 to 100. These least-squares fits also include a constant,age, and length of time served in the military. Griliches and Mason focus on the coefficient on SI
and note that it is reduced by including AFQT, but not by very much.
4 Functional Form
To recap, the linear predictor problem, or best linear predictor is always a well defined problem. It
is very flexible because we can include in the predictor any transformations of the original variables.
For example, with Y = earnings and EXP a measure of years of experience, we can set X1 = EXP
and X2 = EXP
2. Then the linear predictor is
E∗(Y |1, X1, X2) = β0 + β1X1 + β2X2
and at EXP = c it gives
β0 + β1c+ β2c
2
This allow for example for the predictive effect of experience to be nonlinear. The prediction model
allows also for effects of interaction of variables. For example, in addition to EXP suppose we
have EDUC, a measure of years of education. We can set X1 = EDUC, X2 = EXP , X3 =
EDUC.EXP . Then evaluating
E∗(Y |1, X1, X2, X3) = β0 + β1X1 + β2X2 + β3X3
4.1 Conditional Expectation
Suppose that we start with a single original variable Z and develop linear predictors of Y based on
Z that are increasingly flexible. To be specific, consider using a polynomial of order M :
E∗(Y |1, Z, Z2, . . . , ZM ))
The expectation of the squared prediction error cannot increase as M increases, because
the coefficients on the additional terms are allowed to be 0. So
E[(Y − E∗(Y |1, Z, Z2, . . . , ZM ))]2
10
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
is decreasing in M and must approach a limit since it is nonnegative. We shall assume that the linear
predictor itself approaches a limit, and we shall identify this limit with the conditional expectation,
E(Y |Z) :
E(Y |Z) = lim
M→∞
E∗(Y |1, Z, Z2, . . . , ZM )
This limit is in a mean square sense:
lim
M→∞
E[E(Y |Z)− E∗(Y |1, Z, Z2, . . . , ZM )]2 = 0
What this means is that there is a way (a precise one) in which a linear predictor can approximate
a conditional expectation. This is useful since a linear predictor has a sample counterpart and can
be constructed. This turns out to be exact in the case of discrete regressors (see below).
Let V be notation for the prediction error:
V ≡ Y − E(Y |Z)
Then, V is orthogonal to any power of Z:
< V,Zj >= E(U ′Zj) = 0 (j = 0, 1, 2, . . .)
Because general functions of Z can be approximated (in mean square) by polynomials in
Z, we have
< V, g(Z) >= E[V g(Z)] = 0
for arbitrary functions of g(.). In the population, we shall generally prefer to work with the
conditional expectation (the conditional expectation is the solution to the best prediction problem
after all). The linear predictor remains useful, however, because it has a direct sample counterpart:
the sample linear predictor or least-squares fit. We shall use a (population) linear predictor to
approximate the conditional expectation, and then use a least squares fit to estimate the linear
predictor.
The conditional expectation at a particular value of Z is denoted by
r(z) = E(Y |Z = z)
This function is called the (mean) regression function. The regression function evaluated at the
random variable Z is the conditional expectation: r(Z) = E(Y |Z). Because the regression function
may be complicated, we may want to approximate it by a simpler function that would be easier to
estimate. For example, E∗[r(Z)|1, Z] is a minimum mean-square error approximation that uses a
linear function of Z. This turns out to be the same as the linear predictor of Y given Z:
11
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
Claim 4.1. E∗[r(Z)|1, Z] = E∗(Y |1, Z) = β0 + β1Z.
Proof: Let V denote the prediction error:
V ≡ Y − E(Y |Z) = Y − r(Z) (4.1)
Then V is orthogonal to any function of Z: E[V g(Z)] = 0, and so is orthogonal to 1 and to Z:
E(V ) = E(V Z) = 0
This implies that the linear predictor of V given 1, Z is 0, and applying that to (4.1) gives
0 = E∗(V |1, Z) = E∗(Y |1, Z)− E∗[r(Z)|1, Z] �
This result is more general in that it holds for more variables. For example, the conditional
expectation of Y given two (or more) variables Z1 and Z2 can also be viewed as a limit of increasingly
flexible linear predictors:
E(Y |Z1, Z2) = lim
M→∞
E∗(Y |1, Z1, Z2, Z21 , Z1Z2, Z22 , . . . , Z1ZM−12 , ZM2 )
The regression function is defined as
r(z1, z2) ≡ E(Y |Z1 = z1, Z2 = z2)
As above, we can use the linear predictor to approximate the regression function. For example, the
proof of Claim 4.1 can be used to show that
E∗[r(Z1, Z2)|1, Z1, Z2, Z21 , Z1Z2, Z22 ] = E∗(Y |1, Z1, Z2, Z21 , Z1Z2, Z22 )
We shall conclude this section by deriving the iterated expectations formula and then using it to
obtain an omitted variables formula.
Claim 4.2. (Iterated Expectations)
E[E(Y |Z1, Z2)|Z1] = E(Y |Z1)
or equivalently, E[r(Z1, Z2)|Z1] = r(Z1).
Proof. Let V denote the prediction error:
V ≡ Y − E(Y |Z1, Z2) = Y − r(Z1, Z2) (4.2)
Then V is orthogonal to any function of (Z1, Z2) :
E[V g(Z1, Z2)] = 0
12
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
and so, in particular, it is orthogonal to any function of Z1:
E[V g(Z1)] = 0
This implies that E[V |Z1] = 0 and substituting that in (4.2) above gives
0 = E[U |Z1] = E[Y |Z1]− E[r(Z1, Z2)|Z1] �
The law of iterated expectations can be shown in other ways by using the definition of a
conditional expectation. It is a useful formula.
For example, can you just show that
E[Y |X1] = E[E[Y |X1, X2]|X1]
Now, we can show an Omitted Variable Bias for regression functions.
Claim 4.3 (Omitted Variable Bias). If
E[Y |Z1, Z2] = β0 + β1Z1 + β2Z2
and
E[Z2|Z1] = γ0 + γ1Z1
then
E[Y |Z1] = (β0 + β2γ0) + (β1 + β2γ1)Z1
Proof:
E(Y |Z1) = E[E(Y |Z1, Z2)|Z1]
= E(β0 + β1Z1 + β2Z2|Z1)
= β0 + β1Z1 + β2E(Z2|Z1)
= β0 + β1Z1 + β2(γ0 + γ1Z1)
= (β0 + β2γ0) + (β1 + β2γ1)Z1
Note that here we assume that the regression function for Y on Z1 and Z2 is linear in Z1 and Z2,
and that the regression function for Z2 on Z1 is linear in Z1. It then follows that the regression
function for Y on Z1 is linear in Z1, and the coefficients are related to the coefficients in the long
regression function in the same way as in the linear prediction problem.
To recap, if we ultimately are interested in the conditional expectation (for many reasons...)
then we can do well by estimating a best linear predictor, and we showed that this linear best
predictor can theoretically approximate this unknown conditional expectation (in a precise sense).
13
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
5 Discrete Regressors (make things easier)
Suppose that Z1 and Z2 take on only a finite set of values:
Z1 ∈ {λ1, . . . , λJ}, Z2 ∈ {δ1, . . . , δK}
The objective of this section is to show that it is in this case possible to empirically estimate
the regression function.
Construct the following dummy variables
Xjk =
{
1, if Z1 = λj , Z2 = δk
0, otherwise
These Xjk are indicator functions that tell what cell does (Z1, Z2) belong to
Xjk = 1(Z1 = λj ; Z2 = δk) (j = 1, . . . , J ; k = 1, . . . ,K)
The notation for the indicator function 1[B] is such that this function is equal to 1 if B is true and
equal to zero otherwise.
The key result is provided next.
Claim 5.1.
E(Y |Z1, Z2) = E∗(Y |X11, . . . , XJK)
Proof. Any function g(Z1, Z2) can be written as (when the Z’s take finitely many values)
E[Y |Z1, Z2] =
J∑
i=1
K∑
k=1
γjkXjk
with γjk = g(λj , δk). So searching over functions g to find the best predictor is equivalent to
searching over the coefficients γjk to find the best linear predictor. �
Note this requires that we use a complete set of dummy variables, with one for each valueof (Z1, Z2). In this discrete regressor case, there is a concrete form for the notion that conditional
expectation is a limit of increasingly flexible linear predictors. Here the limit is achieved by using
a complete set of dummy variables in the linear predictor. There is a sample analog to this result,
using least-squares fits.
The basic data consist of (yi, zi1, zi2) for each of i = 1, . . . , n members of the sample.
Construct the dummy variables (this is done on the computer using the data matrix)
xijk = 1(zi1 = λj ; zi2 = δk)
14
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
This means that the ith observation in the data corresponds to the jk cell. Construct the matrices
y =

y1
...
yn
 xjk =

x1jk
...
xnjk
 j = 1, . . . , J ; k = 1, . . . ,K
The coefficients in the least-squares fit are obtained from
min ‖y −
J∑
j
K∑
k
bjkxjk‖2
where the minimization is over {bjk} and the inner product is
< y, xjk >=
1
n
∑
i
yixijk
(recall: xjk is a n-size vector of 1’s and 0’s where a 1 corresponds to the place where the data
belongs to cell (j, k) - so this is a linear regression with KL regressors).
Claim 5.2.
blm =
∑
i yixilm∑
i xilm
l = 1, . . . , J,m = 1, . . . ,K
Proof. (try doing this on your own first) By definition of an orthogonal projection, we get
that the residuals have to be orthogonal to each of the dummy variables:
< y −
∑
kl
bjkxjk, xlm >= 0
Also, by construction (a data point cannot belong to two bins!)
< xjk, xlm >= 0
unless the indices are the same. So, we have
0 =< y −
∑
kl
bjkxjk, xlm > =< y, xlm > −
∑
kl bjk < xjk, xlm >
=< y, xlm > −blm < xlm, xlm >
So,
blm =
< y, xlm >
< xlm, xlm >
=
∑
i yixilm∑
i xilm
This is because x2lm = xlm (these are vectors of 1s and 0s) �
Note that, the numerator above
∑
i yixilm is the sum of ys among the observations whose
regressors belong to cell (l,m), and the denominator is the number of observations that fall in cell
(l,m). So, the coefficients blm correspond to subsample mean which is a nice interpretation that is
helpful.
15
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
A major use of regression analysis is to measure the effect of one variable holding constant
other variables. Consider, for example, the effect on Y of a change from Z1 = c to Z1 = d, holding
Z2 constant at Z2 = e. Let θ denote this effect:
θ = E(Y |Z1 = d;Z2 = e)− E(Y |Z1 = c, Z2 = e) = r(d, e)− r(c, e)
This is a predictive effect. It measures how the prediction of Y changes as we change the
value for one of the predictor variables, holding constant the value of the other predictor variable.
This holding constant the value of other regressor is often discussed as a way to control for observed
differences.
In the case of discrete regressors with a complete set of dummy variables, this predictive
effect has a sample analog which the sample analog of the Y s in the cell (d, e) minus the subsample
mean in (c, e) (or the difference between two regression coefficients). The individuals in the first
subsample have zi1 = c, and the individuals in the second subsample have zi1 = d. In both
subsamples, all individuals have the same value for z2 : zi2 = e. So the sense in which z2 is being
held constant is clear: all individuals in the comparison of means have the same value for z2.
In general there is a different effect θ for each value of Z2, and we may want to have a way
to summarize these effects. This is discussed in the next section.
5.1 Average Partial Effect
Recall our definition of a regression function
r(s, t) = E(y|Z1 = s, Z2 = t)
Consider the predictive effect based on comparing Z1 = c with Z1 = d, with Z2 = t :
r(d, t)− r(c, t)
Instead of reporting a different effect for each value of Z2, we can evaluate the effect at the random
variable Z2:
r(d, Z2)− r(c, Z2).
This gives a random variable, and we can take its expectation:
θ = E[r(d, Z2)− r(c, Z2)]
We shall refer to this as an average partial effect. It is “partial” in the sense of holding Z2 fixed
and the average is taken over all values of Z2 (so, we evaluate this -predictive- effect for people
with the same Z2 and then average over all values of Z2).
16
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
Once we have an estimate rˆ of the regression function, we can form an estimate of θ by
taking an average over the sample:
θˆ =
1
n
∑
i
[rˆ(d, zi2)− rˆ(c, zi2)]
It is generally not easy to get an estimate of r from the data. But, it is, as we have seen
possible to approximate the conditional expectation by a linear prediction, using a polynomial in
Z1, Z2
E[Y |Z1, Z2] ' E∗[Y |{Zj1Zk2 }Mj+k=0] =
M∑
j,k:j+k=0
βj,kZ
j
1Z
k
2
We can then use least squares to obtain estimates bik of βjk. Then, we can use
rˆ(c, zi2) =
M∑
j,k:j+k=0
bj,kc
jZki2 and rˆ(d, zi2) =
M∑
j,k:j+k=0
bj,kd
jZki2
in the formula for θˆ above.
In the case when the regressors take discrete values, i.e. the approximation to the conditional
expectation becomes more exact, we can get a direct sample analog to the (mean) regression function
r using the above framework for discrete regressors.
5.2 Example: Polynomial Regressors
Consider the following quadratic polynomial approximation to a regression function
E[Y |Z1 = s, Z2 = t] ' β0 + β1s+ β2s2 + β3st+ β4t+ β5t2
Table 5.1 in Mincer (1974)([2]) provides the least-squares fit:
yˆ = 4.87 + .255s− .0029s2 − .0043ts+ .148t− .0018t2
with y = log(earnings), s = years of schooling, and t = years of work experience. The data are
from the 1 in 1000 sample, 1960 census with 1959 annual earnings, and sample size n = 31093.
The partial predictive effect of four years of college, holding work experience constant at t,
is
E(Y |Z1 = 16, Z2 = t)− E(Y |Z1 = 12, Z2 = t)
' β14 + β2(162 − 122) + β34t
The term returns to college is related to the use of log(earnings) and is discussed below.
17
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
5.3 LOGS
We stressed above that the linear predictor is flexible because we are free to construct transforma-
tions of the original variables. A transformation that is often used is the logarithm:
E∗(Y |a, logZ) = β0 + β1 logZ
In order to compare Z = c and Z = d, we simply substitute:
β1 log d− β1 log c = β1 log(d/c)
A useful approximation here is
(β1/100)[100 log(d/c)] ' (β1/100)[100(d/c− 1)].
With this approximation, we can interpret (β1/100) as the (predictive) effect of a one per cent
change in Z.
Now consider a log transformation of Y :
E∗(log Y |1, Z) = β0 + β1Z
We can certainly say that the predicted change in log Y is β1(d− c), and it is often useful to think
of 100β1(d− c) as a predicted percentage change in Y . We should note, however, that even if the
conditional expectation of log Y is linear, so that
E(log Y |Z) = β0 + β1Z,
we cannot relate this to the conditional expectation of Y without additional assumptions. To see
this, define U so that
U ≡ log Y − E(log Y |Z), E(U |Z) = 0
Since log Y = β0 + β1Z + U , we have
Y = exp(β0 + β1Z + U) = exp(β0 + β1Z)exp(U)
In general, E(U |Z) = 0 does not imply that E[exp(U)|Z] is a constant. If we make an
additional assumption that U and Z are independent, then E[exp(U)|Z] = E[exp(U)]. In that
case, and
E(Y |Z = d)
E(Y |Z = c) = exp[β1(d− c)] ' β1(d− c) + 1,
and
100
[
E(Y |Z = d)
E(Y |Z = c) − 1
]
' β1(d− c)
18
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
6 Least Squares Matrix Notation, etc
Here, we provide useful formulas that exploit the (nice)geometric intuition in the linear model
when we have many regressors. We will also derive a K-variate analog of the omitted variable
formula. The derivations here are common (and useful) in regression analysis.
Consider the linear predictor with a general list of K predictor variables (plus a constant):
E∗(Y |1, X1, . . . , XK) = β0 + β1X1 + . . .+XK (6.1)
As a reminder, the notation E∗ denotes the best predictor of Y , or the best linear approx-
imation of the conditional mean using a linear function.
We are going to develop a formula for a single coefficient, which, for convenience, will be
βK . Our result will use the linear predictor of XK given the other predictor variables:
E∗(XK |1, X1, . . . , XK−1) = γ0 + γ1X1 + · · ·+ γK−1XK−1
Define X˜K as the residual (prediction error) from this linear predictor:
X˜K = XK − E∗(XK |1, X1, . . . , XK−1)
This residual “takes out” from XK the linear component that is determined by the rest of the
regressors (so, we can think of this residual as the “part” in XK that cannot be linearly predicted
by the rest of the regressors).
The result is that βK in (6.1) is the coefficient on XK in the linear predictor of Y given just
X˜K :
Claim 6.1. E∗(Y |X˜K) = βKX˜K with βK = E(Y X˜K)/E(X˜2K)
Proof: substitute XK = γ0 + γ1X1 + · · ·+ γK−1XK−1 + X˜K into (6.1) to obtain
E∗(Y |1, X1, . . . , XK) = β0 + β1X1 + . . .+ βK−1XK−1
+βK(γ0 + γ1X1 + · · ·+ γK−1XK−1 + X˜K)
= β˜0 + β˜1X1 + . . .+ β˜K−1XK−1 + βKX˜K (6.2)
with
β˜j = βj + βKγj (j = 0, 1, . . . ,K − 1)
The residual from the problem in (6.1) is orthogonal to 1, X1, . . . , XK . Since X˜K is a linear
combination of 1, X1, . . . , XK then it must be that this X˜K is also orthogonal to the same residual
Y − E∗[Y |1, X1, . . . , XK ]:
< Y − E∗[Y |1, X1, . . . , XK ], X˜K >= 0
19
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
=< Y − β˜0 − β˜1X1 − . . . β˜K−1XK−1 − βKX˜K , X˜K >= 0
Now, again, since X˜K is itself the residual from a prediction based on 1, X1, . . . , XK−1, it is orthog-
onal to those variables and the above reduces to
< Y − βKX˜K , X˜K >=< Y, X˜K > −βK < X˜K , X˜K >= 0
So, βKX˜K is the orthogonal projection of Y on X˜K (it is like running a regression with only one
regressor - as opposed to K) and we get that
βK =< Y, X˜K > / < X˜K , X˜K >= E(Y X˜K)/E(X˜
2
K) �
This population result has a sample counterpart. The only difference would be to replace the inner
product above with the least squares inner product: < y, xj >=
1
n
∑N
i=1 yixij where
y =

y1
...
yn
 xi =

xi1
...
xin

6.1 Omitted Variables
This section derives the general version, with K predictor variables, of the omitted variable formula
we derived earlier. We shall use the notation (and part of the argument) from the residual regression
result above. The short linear predictor is
E∗[Y |1, X1, . . . , XK−1] = α0 + α1X1 + . . .+ αK−1XK−1
Claim 6.2. αj = βj + βKγj
Proof. Let U denote the following prediction error:
U ≡ Y − E∗(Y |1, X1, . . . , XK)
Now, use equation (6.2) to write
Y = β˜0 + β˜1X1 + . . .+ β˜K−1XK−1 + βKX˜K + U
where β˜ = βj + βKγj . Note that for j = 0, 1, . . . ,K − 1
< Y − β˜0 − β˜1X1 − . . .− β˜K−1XK−1, Xj >=< βKX˜K + U,Xj >= 0
This means that
< Y − β˜0 − β˜1X1 − . . .− β˜K−1XK−1, Xj >= 0
These othogonality conditions characterize the short linear predictor, and so αj = β˜j . �
The sample counterpart of this result uses the short least-squares fit:
yˆ|1, x1, . . . , xK−1 = a0 + a1xi1 + . . .+ aK−1xi,K−1
20
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
Claim 6.3. aj = bj + bKcj (j = 0, 1, . . . ,K − 1)
This is a numerical/computational identity that can be checked in the data.
7 Matrix Version of the Least Squares Model
The notation for the least squares model can be put in matrix form which is helpful. Set up the
following (K + 1)× 1 matrices:
X =

X0
X1
...
XK
 , β =

β0
β1
...
βK

The linear predictor coefficients βj are determined by the following orthogonality conditions:
< Y − β0 − β1X1 − . . .− βKXK , Xj >= 0 (j = 0, 1, . . . ,K)
So,
E[(Y −X ′β)Xj ] = E[Xj(Y −X ′β)] = 0 (j = 0, 1, . . . ,K)
We can write these conditions together as
E[X(Y −X ′β)] = 0
This gives the following system
E(XY )− E(XX ′)β = 0
which has a solution
β = [E(XX ′)]−1E(XY ) (7.1)
(provided that the (K + 1)× (K + 1) matrix E(XX ′) is nonsingular).
(You can also derive (7.1) has the first order condition from the minimization problem
minβ ‖Y −X ′β‖2 = minβ E(Y −X ′β)2.)
For the least squares fit, set up the (K + 1)× 1 matrices
xj =

x1j
...
xnj
 , (j = 0, 1, . . . ,K) y =

y1
...
yn
 b =

b0
b1
...
bK

and the n× (K + 1) matrix
21
Harvard Economics
Ec 1126
Tamer - September 8, 2015
Note 3 - 2 Variable Model
x = (x0 x1 . . . xK) =

x10 x11 . . . x1K
...
...
xn0 xn1 . . . xnK

Though the notation is cumbersome, the operations we do mirror those for the population regres-
sion. In particular, The least-squares coefficients bj are determined by the following orthogonality
conditions:
< y − b0x0 − b1x1 − . . .− bKxK , xj >= 0 (j = 0, 1, . . .K)
So,
(y − xb)′xj = x′j(y − xb) = 0 (j = 0, 1, . . . ,K)
We can write all the orthogonality conditions (sometimes called normal equations) as:
x′0
x′1
...
x′K
 (y − xb) = x′(y − xb) = 0
This gives the following system of (K+1)equations and (K+1) unknowns:
x′y = x′xb
which has a solution
b = (x′x)−1x′y
(provided that the (K + 1)× (K + 1) matrix x′x is nonsingular3 ).
References
[1] Zvi Griliches and William M Mason. Education, income, and ability. The Journal of Political
Economy, pages S74–S103, 1972.
[2] Jacob Mincer. Schooling, experience, and earnings. human behavior & social institutions no. 2.
1974.
3Can this matrix be nonsingular if indeed we had K > n? This framework with K > n is the so-called big data
setup.
22
	Two Variable Model
	Least squares
	Goodness of fit
	Example
	Omitted Variables
	Least squares version
	Example ctd'
	Functional Form
	Conditional Expectation
	Discrete Regressors (make things easier)
	Average Partial Effect
	Example: Polynomial Regressors
	LOGS
	Least Squares Matrix Notation, etc
	Omitted Variables
	Matrix Version of the Least Squares Model