Applied Econometrics Introduction

Econometria

•
UNB

Amanda Lopes
14/09/2022
E aí, curtiu este material?
Ajude a incentivar outros estudantes a melhorar o conteúdo
Gostou desse material? Compartilhe! 🧡
Econometria

6.342 Materiais compartilhados
Baixe o app para aproveitar ainda mais
Leia os materiais offline, sem usar a internet. Além de vários outros recursos!
Prévia do material em texto
FEM11090 WEEK 36 Lecture Notes
We talked last week about econometric models, where
1. we specify a relationship of interest in the population: E[Y |X]
2. we model our ignorance about the relationship: e = Y − E[Y |X]
This week we delve into the mechanics of regression, where
1. we typically specify a linear model - functional form assumption for E[Y |X]
2. Identification-Estimation-Inference Paradigm for making conclusions about E[Y |X].
3. Prediction of Y via LASSO
1
1 Linear model
Regression is simple way of modeling or parameterizing a wide range of conditional expec-
tations. Workhorse model for doing this is “linear in parameters” model:
Yi = β0 + β1Xi1 + β2Xi2 + ...+ βKXiK + ei
where:
• i is unit i in the population. i can be a person, firm, state, etc.
• β0, β1, ..., βK are unknown parameters we wish to learn.
• Y,Xi1, ..., XiK are observable random variables.
• ei is an unobservable random variable (“disturbance”).
– E[ei|Xi1, ..., XiK ] = 0 by our definition of the error term.
– E[ei] = 0 because, by LIE, E[ei] = E[E[e|Xi1, ..., XiK ]] = E[0] = 0
Familiar conditional expectation models include:
E[Yi|Xi1, Xi2] = β0 + β1Xi1 + β2Xi2 (1)
E[Yi|Xi1, Xi2] = β0 + β1Xi1 + β2Xi2 + β3X2i2 (2)
E[Yi|Xi1, Xi2] = β0 + β1Xi1 + β2Xi2 + β3Xi1Xi2 (3)
These are all Linearity in parameter models. What do we mean by this?
Linear in parameter means that the partial derivative of ∂E[Yi|Xi1, Xi2]/βk doesn’t de-
pend on any of the β0, β1, ...βk, ..., βK . For example,
• ∂E[Yi|Xi1,Xi2]
∂β0
= 1 and ∂E[Yi|Xi1,Xi2]
∂β2
= Xi2 for all these models.
• ∂E[Yi|Xi1,Xi2]
∂β1
= Xi1.
∂E[Yi|Xi1,Xi2]
∂β3
= X2i2 in model 2.
• ∂E[Yi|Xi1,Xi2]
∂β3
= Xi1Xi2 in model 3.
Can you come up with a model that is non-linear in parameters?
2
Suppose you hypothesize the following relationship between wages and experience.
How could you capture the curvature using the linear model? We could suppose that the
population model is given by
Yi = β0 + β1Xi1 + β2X
2
i1 + ei
where Yi is wage, Xi1 is experience, ei is the error term.
• Note that the population model is still linear in the parameters β0, β1, β2.
• but that it is non-linear in experience Xi1.
• that is, the effect of experience depends on experience: ∆Yi
∆Xi1
= β1 + 2β2Xi1
In Stata, reg wage exper expersq gives coefficient estimates which are suggestive of a
concave relationship between wages and experience. Wage is increasing in experience at
lower levels of experience. The marginal effect ∆Yi
∆Xi1
= .298 − 2(0.0061)exper is lower at
higher levels of experience. You can find these estimates in the raw Stata output at the top
of the next page. With this information you should be able to determine the turning point,
where wages start to decline with experience. See Chapter 6.2 of Wooldridge for details of
this model.
3
In the quadratic case the marginal effect of experience depends on the level of experience.
Can allow marginal effects to depend on other factors via ‘interactions’ or ‘interaction effects’.
To see why, consider the following population model for house prices
Yi = β0 + β1Xi1 + β2Xi2 + β3Xi1Xi2 + ei
where Yi is the house price for house i in thousands of dollars, Xi1 is the square footage
(meters in NL), Xi2 is number of bedrooms. The parameter β3 tells us how the square
footage and number of bedrooms work together to generate house prices.
Suppose you run reg price sqrft bdrms sqrft*bdrms in Stata. The marginal effect of
the number of bedrooms is ∆̂Yi
∆Xi2
= β̂2 + β̂3Xi1 = −35.96 + 0.02Xi1. -35.96 can be interpreted
as the estimated effect of the number of bedrooms, when the square footage of the house is
0.
Is this sensible? NO. A nonsensical out-of-sample interpretation (a 0 square footage house?).
But how to fix this?
4
One can instead estimate
Yi = β0 + β1Xi1 + β2Xi2 + β3(Xi1 − E[Xi1])(Xi2 − E[Xi2]) + ei
Marginal effect of the number of bedrooms is now ∆Yi
∆Xi2
= β2 + β3(Xi1 − E[Xi1]). β2 is now
the effect of the number of bedrooms at the mean square footage. Making the adjustment
gives
We are going to express the linear model more compactly using matrix notation:
Yi = Xiβ + ei
where the i denotes person i in the population, Xi = (1, Xi1, ..., XiK) is a row vector with 1
row and K + 1 columns that is specific to individual i in the population.
β =

β0
β1
...
βK

is a column vector with K + 1 rows and 1 column that is the same for all individuals in
the population. Note that their product Xiβ is just a compact way of writing β0 + β1Xi1 +
β2Xi2 + ...+ βKXiK .
Example 1. The following row is the Xi for the first observation in the tribal data.
The row is X1 therefore.
5
Here are some matrix rules you should know
• matrix addition
– If A and B each have 3 rows and 2 columns, they can be added: A + B has 3
rows and 2 columns. You cannot add matrices with different dimensions.
• matrix multiplication
– If A is 3 by 2, then the matrix product AB exists if B has 2 rows. If B has 1
column. Then AB is a 3 by 1.
• matrix transpose
A =
(
1 0 0
0 1 0
)
AT =
 1 00 1
0 0

• matrix inversion
– only square matrices of full rank are invertible
You can find more rules and details are found in Appendix D of Wooldridge. I recommend
that you only read what you need.
6
2 Identification-Estimation-Inference paradigm
Our model for the conditional expectation shows how the vector of variables Xi determines
the variable Yi in the population. By this token, the vector β can be interpretated as pop-
ulation parameters or, equivalently, as parameters that index the joint distribution F (x, y)
in the population. We want to be able to
1. Express this vector in terms of observable objects in our model (Identification).
2. Enter data to these expressions so that we can obtain estimates of β (Estimation).
3. Use these estimates to conduct hypothesis tests regarding the true value of the param-
eters being estimated (Inference).
This is the identification-estimation-inference paradigm. It guides econometric analysis
in a lot of applied papers, either implicitly or explicitly.
2.1. Identification. Earlier I said E[Y ],E[X],E[Y 2] (or V ar(Y )), E[X2] (or V ar(X)),
E[XY ] (or Cov(X, Y )), E[Y |X] are parameters that describe the population of interest.
They are all objects we will never truly know. We guess or estimate these using data and
rely on mathematical statistics to interpret the estimates. This is all true but in economet-
rics we are interested in relationships and we need to make some assumptions about these
parameters to learn about relationships. Or more precisely, we want to assume that the
unknown parameter is E[Y |X] and that E[Y ],E[X],E[Y 2] (or V ar(Y )), E[X2] (or V ar(X)),
E[XY ] are all known parameters. With identification we want to be able to show that we
can use the known or constructable parameters E[Y ],E[X],E[Y 2] (or V ar(Y )), E[X2] (or
V ar(X)), E[XY ] to recover the unknown parameter E[Y |X].
Our econometric specification of the error term has two direct implications:
E[ei|Xi] = 0 (4)
and, together with the law of iterated expectations,
E[ei] = 0 (5)
These equations are called moment conditions. The left hand sides are called population
moments. The right hand side is a condition we impose on these moments. Note that the
first moment condition E[ei|Xi] = 0 implies that for each element k of Xi
Cov(ei, Xik) = E[eiXik] = E[E[eiXik|Xi]] = (6)
E[XikE[ei|Xi]] = E[Xik0] = 0. (7)
7
So there are in fact K + 1 moment conditions, one for each control and one for the constant
term:
E[ei] = 0 (8)
E[eiXi1] = 0 (9)
E[eiXi2] = 0 (10)
... (11)
E[eiXiK ] = 0. (12)
We can use our definition of the error and these moment conditions to express the unknown
population parameters β in terms observable parameters like E[[Y ],E[[X],E[Y 2],E[X2],E[XY ].
We defined the error as Yi −Xiβ. Plugging this into the moment conditions gives:
E[Yi −Xiβ] = 0
E[(Yi −Xiβ)Xi1] = 0
E[(Yi −Xiβ)Xi2] = 0
...
E[(Yi −Xiβ)XiK ] = 0
This gives us K+1 equations for K+1 unknowns. Under certain conditionsthis system has
only one solution. We say that this model is just identified: number of equations just equal
to the number of unknown parameters. The system can be expressed more compactly as
E[XTi (Yi −Xiβ)] = 0
where T denotes the transpose of a vector or matrix.
8
Example 2. Assume K = 1 (simple regression model):
Yi = β0 + β1Xi1 + ei
Moment conditions are:
E[(Yi − β0 − β1Xi1)Xi1] = 0
E[Yi − β0 − β1Xi1] = 0
which we can rewrite as
E[YiXi1] = E[Xi1]β0 + E[X2i1]β1
E[Yi] = β0 + E[Xi1]β1
This is a system of 2 equations with the 2 unknown parameters of interest β0 and β1. We can
therefore solve for the unknowns β0 and β1 in terms of the parameters that we are assumed
to know E[Yi], E[Xi1], E[Xi1Yi], and E[X2i1]. In fact one can show (can you do it?):
• β1 = Cov(Yi,Xi1)V (Xi1)
• β0 = E[Yi]− β1E[Xi1]
The unknown population parameters that define the conditional expectation have a clear
interpretation:
• β1 tells us how much Y and X1 move together as a fraction of the total movement of
X1
• β0 is the average of Y after removing the effect of X1 at the average of X1.
9
Example 3. Suppose we are interested in:
Yi = β0 + β1Xi1 + β2Xi2 + ei
where Xi2 = 2Xi1. This model is under-identified. Easier way to see this is to substitute the
expression for Xi2 into our regression. This gives
Yi = β0 + (β1 + 2β2)Xi1 + ei.
These Xi1 and Xi2 do not let us distinguish between β1 and β2. At best we can identify the
sum β1 + 2β2. A more formal way to see this uses the moment conditions:
E[(Yi − β0 − β1Xi1 − β2Xi2)Xi2] = 0
E[(Yi − β0 − β1Xi1 − β2Xi2)Xi1] = 0
E[Yi − β0 − β1Xi1 − β2Xi2] = 0
which we can rewrite as
E[(Yi − β0 − β1Xi1 − β22Xi1)2Xi1] = 0
E[(Yi − β0 − β1Xi1 − β22Xi1)Xi1] = 0
E[Yi − β0 − β1Xi1 − β22Xi1] = 0
and then as
E[(Yi − β0 − (β1 + β22)Xi1)Xi1] = 0
E[Yi − β0 − (β1 + β22)Xi1] = 0
because the first and second equations are redundant. The redundancy means that we have 2
equations and 3 unknowns (the βs). We need at least 3 equations to identify 3 unknowns.
In the general case,
E[XTi (Yi −Xiβ)] = 0 ⇔ E[XTi Yi] = E[XTi Xi]β
Pre-multiplying each side by (E[XTi Xi])−1 gives the solution
β = E[XTi Xi]−1E[XTi Yi].
This last expression looks more complicated than it is. Each element of the β is simply
βk =
Cov(Yi, X̃ik)
V (X̃ik)
.
where X̃ik is the residual from the regression of Xk on all other X’s.
10
• X̃ik is part of Xk not explained by rest of X’s.
• Then interpretation of βk is: how much Y and part of Xk not explained by other X’s
move together as a fraction of total variation/movement in X̃ik.
• Ratio provides essence of ceteris paribus interpretation in multiple regression.
Note that we can still get identification if we relax assumption that
E[ei|Xi1, ..., Xik] = 0.
Can instead assume
Cov(ei, Xik) = 0.
for all k in 0, 1, 2, ..., K.
• Difference in assumptions is subtle but can be important.
• Second assumption lets us identify β even if some functions of Xik are in the error
term. That is, we can identify β even though E[ei|Xi1, ..., Xik] 6= 0.
• First assumption requires that we assume these functions of Xik are not in the error
term (we should put them in our model).
Question. If really want to understand the material you can think about why this is. If you
are mainly interested in applications, then you can ignore this subtle difference. I will not
test you on it.
We have so far viewed identification through the lens of moment conditions. There are
other views.
1. “line of best fit”: Look for β where the “line” Xiβ that “best” fits Yi in population.
2. “maximum likelihood”: Make specifics assumptions about the distribution F (x, y) for
the population. Given this distributional assumption, look for the β the likelihood of
observing your population/sample.
Let us consider each briefly in turn.
What do we mean by “line of best fit”? Let’s suppose we define “best” by the β that
minimizes the error variance, V ar(ei), which equals:
E
[
e2i
]
and by our definition of the error is then
E
[
(Yi −Xiβ)2
]
11
Taking derivatives of this function with respect the elements of β yields a set of equations
that are identical to our moment conditions above (can you show this when K = 2?). We
again end up with:
βk =
Cov(Yi, X̃ik)
V (X̃ik)
.
for k in 0,1,2...,K.
Suppose instead that the conditional distribution of Yi given Xi is normally distributed
with mean E[Yi|Xi] = Xiβ and variance σ2. Normality means the probability density f for
Yi given Xi is
f(Yi|Xi;β) =
1
σ
√
2π
exp(−(Yi −Xiβ)
2
2σ2
)
where exp is the exponential function. f(Yi|Xi;β) can be interpreted as the probability of
observing a particular value of Yi given the value of Xi. If Yi is the earnings of person i and
Xi is their sex, f(50000|sexi = female;β) is the probability that i earns 5000 given that
they are female. “Maximum likelihood” chooses β that maximizes this probability in the
population. To see what we get, convert everything to natural logarithms
`(Yi|Xi;β) = −ln(σ
√
2π)− (Yi −Xiβ)
2
2σ2
and take the expectation to get
E[`(Yi|Xi;β)] = −ln(σ
√
2π)− E[(Yi −Xiβ)
2]
2σ2
Maximization gives
βk =
Cov(Yi, X̃ik)
V (X̃ik)
.
for k in 0,1,2...,K.
All three approaches (moment conditions, line of best of fit, maximum likelihood) yield
same expression for the unknown parameter β in terms of population moments. Several
factors suggest this may be unsurprising.
• Maximum likelihood makes assumptions about entire distribution of `(Yi|Xi). Moment
conditions make assumptions about part of the distribution. It just so happens that
in this case these assumptions coincide.
• It’s also easy to see that maximizing the likelihood is equivalent to minimizing the
variance for the error (deviation from line of best fit). This is because of underlying
properties of the standard normal distribution - the first two moments are the only
moments that enter the likelihood.
12
2.2. Estimation. Estimation usually follows naturally from the identification argument.
With identification we think about how we would use “perfect” data to recover unknown
parameters. With estimation we think about how we would use actual data to estimate the
equation that allows us to identify the unknown parameters. We typically assume that we
have a random sample but we can allow for dependence across observations or for draws from
heterogeneous distributions (heteroskedasticity for example). To estimate these econometric
models we can replace the population moments with sample moments in our equations that
define β. Let’s see an example where this is case.
Assume K = 1. Assume we have a random sample of size N : (Y1, X11), (Y2, X21), ...,
(Y2, XN1). Substituting sample averages for population averages into moment conditions
(E[ei] = 0 and E[eiXi1] = 0) gives
N∑
i=1
YiXi1
N
=
N∑
i=1
Xi1
N
β̂0 +
N∑
i=1
X2i1
N
β̂1
N∑
i=1
Yi
N
= β̂0 +
N∑
i=1
Xi1
N
β̂1
We can solve these equations to get our estimators of β̂0 and β̂1.
By doing so, we are effectively forcing our estimator of to satisfy the moment conditions:
N∑
i=1
YiXi1
N
=
N∑
i=1
Xi1
N
β̂0 +
N∑
i=1
X2i1
N
β̂1
N∑
i=1
Yi
N
= β̂0 +
N∑
i=1
Xi1
N
β̂1
Does this imply that the moment conditions themselves E[ei] = 0 and E[eiXi1] = 0) are true
in the population? NO. The error term may contain variables which are related to both
Yi and Xi1. By saying that our estimators satisfy the above equation, we are forcing the
influences of these other variables into our esimate of β1. In this way, it is not necessarily
the case that the β̂1 will have a causal interpretation. It will reflect the true effect of Xi1
as well as the effect that operates through omitted variables.
With this in mind, let us solve these equations anyways. Solving gives the estimators
β̂1 =
Ĉov(Yi, Xi1)
V̂ (Xi1)
β̂0 = Y − β̂1X1
13
where
Ĉov(Yi, Xi1) =
N∑
i=1
(Yi − Y )(Xi1 −X1)
N
V̂ (Xi1) =
N∑
i=1
(Xi1 −X1)2
N
and the hat over a parameter defines an estimator of a parameter.
Question. What is the difference between an estimator and an estimate?
Note that the above estimatorsβ̂0 and β̂1 are “good” in that they are:
1. Consistent
• β̂0 approaches β0, and β̂1 approaches β1 as the sample size increases.
2. Efficient
• β̂0 and β̂1 have “low” variance across repeated samples.
In the population we can recover the unknown parameter β using the formula
β = E[XTi Xi]−1E[XTi Yi].
With data (Y1,X1), (Y2,X2), ..., (YN ,XN), we can estimate the right hand side of the above
equation using ( N∑
i=1
XTi Xi
N
)−1( N∑
i=1
XTi Yi
N
)
.
We call this object β̂ but really it’s an estimator of the right hand side of the first equation.
It is in fact a consistent estimator of the right hand side. To see this,
plim
(( N∑
i=1
XTi Xi
N
)−1( N∑
i=1
XTi Yi
N
))
= plim
(( N∑
i=1
XTi Xi
N
)−1)
plim
( N∑
i=1
XTi Yi
N
)
=
(
plim
N∑
i=1
XTi Xi
N
)−1
plim
( N∑
i=1
XTi Yi
N
)
14
= E[XTi Xi]−1E[XTi Yi]
This last term equals
= E[XTi Xi]−1E[XTi (Xiβ + ei)]
and then
= β + E[XTi Xi]−1E[XTi ei]
All this implies that our estimator gives us β plus some stuff. It gives us the true β if it
is true that E[XTi ei] = 0. I recommend that you think a bit about what is being said here.
It highlights the important distinction between identification and estimation, a distinction
that confuses many students.
I will end this subsection with a discussion of goodness of fit measures and log specifica-
tions. Both are used heavily in applied work.
We define
R̂2 =
V̂ (Xiβ̂)
V̂ (Yi)
as a measure of goodness of fit. It tells us how much of the variation in the response
variable is explained by the estimated conditional expectation Xiβ̂. We increase the R̂2 by
adding controls to our regression. The problem is that the R̂2 increases even if the controls
are meaningfulness for explaining Yi. This is undesirable from a goodness of fit measure.
Accordingly, we use adjusted R̂2 to adjust for this. The adjusted R2 penalizes the use of
meaningless controls. Some general remarks on R-squared:
• A high R-squared does NOT imply a causal interpretation
• A low R-squared does NOT imply no causal interpretation
• A low R-squared does not preclude precise estimation of marginal effects
Some remarks on the use of logs Do’s and Don’ts
• Variables measured in units such as years are usually not logged. These variables
already yield a convenient interpretation.
• Variables measured in percentage points are usually not logged for same reason.
• Can’t log 0s or negative numbers. If you have a variable x that is often 0 but otherwise
positive, you can use log(x+1) instead. If you have a variable x that is often negative,
think of some other transformation or none at all.
15
2.3. Inference. The assumption that E[ei|Xi] = 0 is powerful. The assumption lets us
construct reasonable estimators of E[Yi|Xi]. However, we need more assumptions to test
hypotheses about E[Yi|Xi], however. In particular, we need assumptions on the variance-
covariance matrix of ei given Xi, call it Ω. This is an N by N matrix.
I encourage you to construct simple examples to for your understanding. One especially
simple example has two independent observations and error variances which are independent
of Xi
Ω =
(
V ar(e1) Cov(e1, e2)
Cov(e1, e2) V ar(e2)
)
=
(
E[e21] E[e1e2]
E[e1e2] E[e22]
)
where we are implicitly using the moment conditions E[e1] = 0 and E[e2] = 0. Independence
of the squared error (the variance) from Xi is called homoskedasticity. Random sampling
implies
Ω =
(
E[e21] 0
0 E[e22]
)
This Ω is the “original” variance-covariance matrix traditionally assumed in econometric
analysis.
It is rare these days to see data sets with independent observations or with error variances
which are independent of Xi. We typically allow for some dependence between observations.
We call this dependence clustering. We also tend to let the error variances depend on
Xi. We call this dependence heteroskedasticity. Let’s obtain an estimator for the variance-
covariance matrix that allows for these possibilities. We will revisit these issues from an
applied perspective in later lectures.
Our estimator of β is
β̂ =
( N∑
i=1
XTi Xi
N
)−1( N∑
i=1
XTi Yi
N
)
.
Plug in the model for Yi to get
β̂ = β +
( N∑
i=1
XTi Xi
N
)−1( N∑
i=1
XTi ei
N
)
.
16
One can show that the variance-covariance matrix for this object can be estimated by
V̂ ar(β̂) =
( N∑
i=1
XTi Xi
N
)−1( N∑
i=1
ê2iX
T
i Xi
N2
)( N∑
i=1
XTi Xi
N
)−1
.
which is a K + 1 by K + 1 matrix.
Assume K = 1, homoskedastic and independent errors. The standard error for β̂1 is the
given by the square root of the row 2 column 2 entry of V̂ ar(β̂)2×2.
ŜE(β̂1) =
√
V̂ (êi)
NV̂ (Xi1)
.
It is large if
1. sample size N is small.
2. large error variance across units i.
3. small Xi variance across units i.
Question. Are the last two properties intuitive? Why or why not? Does the figure at the
top of the next page help you understand this?
96 Chapter 2
Figure 2.2
Variance in X is good
0 2 4 6 8
8
6
4
2
0
10
X
Y
illustrated in Figure 2.2, which shows how adding variability
in Xi (specifically, adding the observations plotted in gray)
helps pin down the slope linking Yi and Xi.
The regression anatomy formula for multiple regression car-
ries over to standard errors. In a multivariate model like this,
Yi = α +
K∑
k=1
βkXki + ei ,
the standard error for the kth sample slope, β̂k, is
SE(β̂k) = σe√
n
× 1
σX̃k
, (2.15)
where σX̃k is the standard deviation of X̃ki, the residual from
a regression of Xki on all other regressors. The addition of
controls has two opposing effects on SE(β̂k). The residual vari-
ance (σe in the numerator of the standard error formula) falls
when covariates that predict Yi are added to the regression. On
 
 
 
 
 
 
 
From Mastering ‘Metrics: The Path from Cause to Effect. © 2015 Princeton University Press. Used by permission. 
All rights reserved. 
Question. Compare the standard errors for ŜE(β̂1) and ŜE(Y ) (see Week 1). Are the
differences sensible?
17
Now assume K > 1. Errors are still homoskedastic and independent. Then we have
ŜE(β̂1) =
√
V̂ (êi)
nV̂ (X̃i1)
where X̃i1 is the residual from a regression of Xi1 on all other regressors. This says that
control variables have opposing effects on ŜE(β̂1):
1. it reduces the variance of the error term. Why?
2. it takes out some of the variance for Xi. Why?
The first effect decreases the standard error. The second effect increases it. The net effect
is ambiguous.
We assume independent observations and homoskedasticy for illustrative purposes. This
is unreasonable in most situations.
• To relax homoskedasticity use robust standard errors in stata:
reg y x, robust
• To relax independence across observations use clustered standard errors in stata:
reg y x, cluster(unit)
or reg y x, vce(cluster unit)
Note that with panel data xt environment in stata, the robust command delivers standard
errors that are clustered on the cross sectional unit. We will come back to this later.
18
Example 4. The Linear Probability Model ( lpm) is a model where dependent variable is
binary (0 or 1). This implies
E[Yi|Xi] = 1P(Yi = 1|Xi) + 0P(Yi = 0|Xi)
and consequently that
E[Yi|Xi] = P(Yi = 1|Xi)
Estimating this econometric model is therefore tantamount to estimating the conditional
probability that Yi = 1 given our controls. This model is interesting for us because we have
heteroskedastic errors by definition.
lpm has been used to model relationship between labour force participation of married
women and years of schooling. Here P(Yi = 1|Xi1) = β0 + β1Xi1 where Yi indicates labour
force participation and Xi1 is years of schooling. A population model where β0 = −.146 and
β1 = 0.038 is described below
To see why lpm has heteroskedastic errors by definition, note that
V ar(Yi|Xi) = P(Yi = 1|Xi)(1− P(Yi = 1|Xi)).
Since
V ar(Yi|Xi) = V ar(Xi + ei|Xi) = V ar(ei|Xi)
then
V ar(ei|Xi) = P(Yi = 1|Xi)(1− P(Yi = 1|Xi)).
19
so that the error term depends on Xi. What can we do about this? Economists used to
model the heteroskedasticity and then incorporatethis model into the construction of standard
errors. Now you can just use the add-on command robust in Stata.
Let us continue our discussion of inference. We can use the estimator to V̂ ar(β̂) construct
t-statistics. For example, if we wish to test
H0 : βk = 0
H1 : βk 6= 0
Then we can take the k’th diagonal element of V̂ ar(β̂), which equals V̂ ar(β̂k), and plug it
into the denominator of the t-statistic:
t =
β̂k − 0√
V̂ ar(β̂k)
These are called heteroskedasticity robust t-statistics. Confidence intervals are also obtained
in the usual way.
We can use the estimator to V̂ ar(β̂) to test multiple hypotheses at the same time.
Suppose we wish to test the hypothesis that
H0 : β1 = 0 and β2 = 0
H1 : β1 6= 0 or β2 6= 0
We can write this more compactly using
R2×KβK×1 = r2×1
where R2×K is a matrix with 2 rows (one for each hypothesized restriction on β) and K
columns
R2×K =
(
0 1 0 . . . 0 0
0 0 1 . . . 0 0
)
and r is a 2 row 1 column matrix of 0s. We can then construct a heteroskedaticity robust
Wald statistic:
W = (Rβ̂ − r)T (RT V̂ ar(β̂)R)(Rβ̂ − r)
Under the null hypothesis H0, this statistic follows a χ
2
2 distribution. The subscript 2 denotes
the number of restrictions Q. Sometimes we adjust this formula to obtain an F -statistic with
Q = 2 numerator degrees of freedom and N −K denominator degrees of freedom. Whether
you use χ22 or F formulation makes little difference. It is traditional to use F . I recommend
you to make up a simple example so that you understand what is happening above.
I don’t expect you to memorize the formula for W . I do expect you to understand
hypothesis testing and, in particular,
20
• how to set up and test multiple hypotheses
• that you can test multiple hypotheses using χ22 or F statistics.
You will get practice with multiple hypothesis testing in the tutorial assignment.
This completes our discussion of the basics of identification, estimation, and inference
for the linear model. This discussion typically provides the basis for causal analysis in
econometrics, and indeed this has received a great deal of attention in the last 25 years. But
regression is useful for other purposes, especially for prediction (and machine learning by
implication). I am going to briefly talk about prediction.
21
3 Prediction
Regression models provide a basis for prediction. The idea is that if the observed Yi is
generated in accordance with this model, and we know β, then we can learn Yi before
observing it or without having to observe it. That is, we can make in sample and out of
sample predictions about the value of Yi. We are going to discuss the Lasso method. This
is an acronym for “Least Absolute Shrinkage and Selection Operator.” It is a method for
selecting and fitting covariates for the purposes of prediction and model selection.
3.1. Bias-variance trade off. Recall that our expression β = E[XTi Xi]−1E[XTi Yi] can be
thought of the minimizer of
E
[
(Yi −Xiβ)2
]
.
We can then use a random sample to approximate the right hand side of our expression using
β̂ = (
N∑
i=1
XTi Xi)
−1(
N∑
i=1
XTi Yi).
This estimator has good properties. It is unbiased.1 It has low variance. Properties are
important for causal inference, when we want the “right” β. Properties may be less important
for prediction, where we want a model that predicts Yi. Here, we may improve predictions
by trading off some bias for even lower variance.1
lasso provides us with a disciplined mechanism for making this trade off. With lasso
we choose β to minimize
E
[
(Yi −Xiβ)2
]
subject to the constraint
K∑
k=0
|βk| ≤ t
where t is a “budget” for the regression coefficients. This problem is equivalent to choosing
β to minimize
E
[
(Yi −Xiβ)2
]
+ λ
K∑
k=0
|βk|
where λ is effectively the price of violating the constraint above. Rewriting this problem
gives
min
β
E
[
(Yi −Xiβ)2
]
+ λ
K∑
k=0
|βk|
1I use “consistent” and unbiased interchangeably. This is imprecise as the concepts are different. I use
them interchangeably because it simplifies the discussion.
22
Note that the derivative of λ
∑K
k=0 |βk| with respect to βk is not defined at 0. Because of
this, small βks are dropped from the econometric model. This is useful for model reduction
as well as prediction, specifically in cases where we have more control variables than data
points.
If λ = 0, then the solution to the above problem is β = E[XTi Xi]−1E[XTi Yi]. Our
estimator of β is then β̂ = (
∑N
i=1 X
T
i Xi)
−1(
∑N
i=1 X
T
i Yi). If λ = ∞, then the solution is to
set all the βks to 0. In fact, there is some number, call it λmax, where all the βks to 0. For λs
between 0 and λmax we trade off increased bias for less variance in our new estimator of β.
Note that λ is unknown to us. Need to specify it before estimating regression coefficients.
Which λ should we choose? This is a complex problem that is too involved to cover here. In
problem set we let Stata choose λ via a method called cross validation (cv). cv looks for a
λ that minimizes mse. All you need to know about choosing λ for the exam is:
1. it needs to be specified beforehand
2. that you can use cross validation to choose λ
3. that cross validation minimizes mse
4. and how to implement it in Stata.
How do we measure the net result of the bias-variance tradeoff? We typically use mean
squared error (mse):
MSE = E
[
(Yi −Xiβ̂)2
]
= bias(X̂iβ)
2 + variance(Xiβ̂)
With our least squares estimator bias(Xiβ̂) = 0, such that
MSELS = variance(Xiβ̂)
where ls denotes least squares. With lasso we get less variance but more bias, bias(Xiβ̂) 6=
0. Variance reduction may be large relative to increased bias. In this case mse decreases.
We in turn get a “better” prediction. For this reason lasso may be preferable to ls from a
prediction perspective.
How do we leverage lasso for out of sample prediction?
1. Split your sample into 2:
(a) Training (t) sample.
(b) Validation (v) sample
2. Estimate regression coefficients by applying lasso to the training sample. Call these
estimates β̂
T
.
23
3. Use β̂
T
to calculate the mse on the validation sample:
MSEV = E
[
(Y Vi −XVi β̂
T
)2
]
This measures the lasso prediction error.
You should know this prediction algorithm for the final exam. This week you will get practice
implementing this algorithm in the tutorials. To this end, note that an alternative measure
of the prediction error is the
R̂2
V
=
V̂ (XVi β̂
T
)
V̂ (Y Vi )
Caveats:
• Can you use LASSO for causal inference? Yes but we need to adjust standard errors
to allow for variation generated by the model selection procedure. We will not discuss
this issue in this course.
This ends our discussion of lasso and prediction.
24
4 Conclusion
This week we delved into the mechanics of regression, where
• we typically specify a linear model - functional form assumption for E[Y |X]
• Identification-Estimation-Inference Paradigm for making conclusions about E[Y |X].
• Prediction of Y via LASSO
Why do we talk about prediction? To show you that regression is about more than causal
inference.
Next week: Common Biases and What to do about them.
25
	Linear model
	Identification-Estimation-Inference paradigm
	Identification.
	Estimation.
	Inference.
	Prediction
	Bias-variance trade off.
	Conclusion