Baixe o app para aproveitar ainda mais
Prévia do material em texto
FEM11090 WEEK 36 Lecture Notes We talked last week about econometric models, where 1. we specify a relationship of interest in the population: E[Y |X] 2. we model our ignorance about the relationship: e = Y − E[Y |X] This week we delve into the mechanics of regression, where 1. we typically specify a linear model - functional form assumption for E[Y |X] 2. Identification-Estimation-Inference Paradigm for making conclusions about E[Y |X]. 3. Prediction of Y via LASSO 1 1 Linear model Regression is simple way of modeling or parameterizing a wide range of conditional expec- tations. Workhorse model for doing this is “linear in parameters” model: Yi = β0 + β1Xi1 + β2Xi2 + ...+ βKXiK + ei where: • i is unit i in the population. i can be a person, firm, state, etc. • β0, β1, ..., βK are unknown parameters we wish to learn. • Y,Xi1, ..., XiK are observable random variables. • ei is an unobservable random variable (“disturbance”). – E[ei|Xi1, ..., XiK ] = 0 by our definition of the error term. – E[ei] = 0 because, by LIE, E[ei] = E[E[e|Xi1, ..., XiK ]] = E[0] = 0 Familiar conditional expectation models include: E[Yi|Xi1, Xi2] = β0 + β1Xi1 + β2Xi2 (1) E[Yi|Xi1, Xi2] = β0 + β1Xi1 + β2Xi2 + β3X2i2 (2) E[Yi|Xi1, Xi2] = β0 + β1Xi1 + β2Xi2 + β3Xi1Xi2 (3) These are all Linearity in parameter models. What do we mean by this? Linear in parameter means that the partial derivative of ∂E[Yi|Xi1, Xi2]/βk doesn’t de- pend on any of the β0, β1, ...βk, ..., βK . For example, • ∂E[Yi|Xi1,Xi2] ∂β0 = 1 and ∂E[Yi|Xi1,Xi2] ∂β2 = Xi2 for all these models. • ∂E[Yi|Xi1,Xi2] ∂β1 = Xi1. ∂E[Yi|Xi1,Xi2] ∂β3 = X2i2 in model 2. • ∂E[Yi|Xi1,Xi2] ∂β3 = Xi1Xi2 in model 3. Can you come up with a model that is non-linear in parameters? 2 Suppose you hypothesize the following relationship between wages and experience. How could you capture the curvature using the linear model? We could suppose that the population model is given by Yi = β0 + β1Xi1 + β2X 2 i1 + ei where Yi is wage, Xi1 is experience, ei is the error term. • Note that the population model is still linear in the parameters β0, β1, β2. • but that it is non-linear in experience Xi1. • that is, the effect of experience depends on experience: ∆Yi ∆Xi1 = β1 + 2β2Xi1 In Stata, reg wage exper expersq gives coefficient estimates which are suggestive of a concave relationship between wages and experience. Wage is increasing in experience at lower levels of experience. The marginal effect ∆Yi ∆Xi1 = .298 − 2(0.0061)exper is lower at higher levels of experience. You can find these estimates in the raw Stata output at the top of the next page. With this information you should be able to determine the turning point, where wages start to decline with experience. See Chapter 6.2 of Wooldridge for details of this model. 3 In the quadratic case the marginal effect of experience depends on the level of experience. Can allow marginal effects to depend on other factors via ‘interactions’ or ‘interaction effects’. To see why, consider the following population model for house prices Yi = β0 + β1Xi1 + β2Xi2 + β3Xi1Xi2 + ei where Yi is the house price for house i in thousands of dollars, Xi1 is the square footage (meters in NL), Xi2 is number of bedrooms. The parameter β3 tells us how the square footage and number of bedrooms work together to generate house prices. Suppose you run reg price sqrft bdrms sqrft*bdrms in Stata. The marginal effect of the number of bedrooms is ∆̂Yi ∆Xi2 = β̂2 + β̂3Xi1 = −35.96 + 0.02Xi1. -35.96 can be interpreted as the estimated effect of the number of bedrooms, when the square footage of the house is 0. Is this sensible? NO. A nonsensical out-of-sample interpretation (a 0 square footage house?). But how to fix this? 4 One can instead estimate Yi = β0 + β1Xi1 + β2Xi2 + β3(Xi1 − E[Xi1])(Xi2 − E[Xi2]) + ei Marginal effect of the number of bedrooms is now ∆Yi ∆Xi2 = β2 + β3(Xi1 − E[Xi1]). β2 is now the effect of the number of bedrooms at the mean square footage. Making the adjustment gives We are going to express the linear model more compactly using matrix notation: Yi = Xiβ + ei where the i denotes person i in the population, Xi = (1, Xi1, ..., XiK) is a row vector with 1 row and K + 1 columns that is specific to individual i in the population. β = β0 β1 ... βK is a column vector with K + 1 rows and 1 column that is the same for all individuals in the population. Note that their product Xiβ is just a compact way of writing β0 + β1Xi1 + β2Xi2 + ...+ βKXiK . Example 1. The following row is the Xi for the first observation in the tribal data. The row is X1 therefore. 5 Here are some matrix rules you should know • matrix addition – If A and B each have 3 rows and 2 columns, they can be added: A + B has 3 rows and 2 columns. You cannot add matrices with different dimensions. • matrix multiplication – If A is 3 by 2, then the matrix product AB exists if B has 2 rows. If B has 1 column. Then AB is a 3 by 1. • matrix transpose A = ( 1 0 0 0 1 0 ) AT = 1 00 1 0 0 • matrix inversion – only square matrices of full rank are invertible You can find more rules and details are found in Appendix D of Wooldridge. I recommend that you only read what you need. 6 2 Identification-Estimation-Inference paradigm Our model for the conditional expectation shows how the vector of variables Xi determines the variable Yi in the population. By this token, the vector β can be interpretated as pop- ulation parameters or, equivalently, as parameters that index the joint distribution F (x, y) in the population. We want to be able to 1. Express this vector in terms of observable objects in our model (Identification). 2. Enter data to these expressions so that we can obtain estimates of β (Estimation). 3. Use these estimates to conduct hypothesis tests regarding the true value of the param- eters being estimated (Inference). This is the identification-estimation-inference paradigm. It guides econometric analysis in a lot of applied papers, either implicitly or explicitly. 2.1. Identification. Earlier I said E[Y ],E[X],E[Y 2] (or V ar(Y )), E[X2] (or V ar(X)), E[XY ] (or Cov(X, Y )), E[Y |X] are parameters that describe the population of interest. They are all objects we will never truly know. We guess or estimate these using data and rely on mathematical statistics to interpret the estimates. This is all true but in economet- rics we are interested in relationships and we need to make some assumptions about these parameters to learn about relationships. Or more precisely, we want to assume that the unknown parameter is E[Y |X] and that E[Y ],E[X],E[Y 2] (or V ar(Y )), E[X2] (or V ar(X)), E[XY ] are all known parameters. With identification we want to be able to show that we can use the known or constructable parameters E[Y ],E[X],E[Y 2] (or V ar(Y )), E[X2] (or V ar(X)), E[XY ] to recover the unknown parameter E[Y |X]. Our econometric specification of the error term has two direct implications: E[ei|Xi] = 0 (4) and, together with the law of iterated expectations, E[ei] = 0 (5) These equations are called moment conditions. The left hand sides are called population moments. The right hand side is a condition we impose on these moments. Note that the first moment condition E[ei|Xi] = 0 implies that for each element k of Xi Cov(ei, Xik) = E[eiXik] = E[E[eiXik|Xi]] = (6) E[XikE[ei|Xi]] = E[Xik0] = 0. (7) 7 So there are in fact K + 1 moment conditions, one for each control and one for the constant term: E[ei] = 0 (8) E[eiXi1] = 0 (9) E[eiXi2] = 0 (10) ... (11) E[eiXiK ] = 0. (12) We can use our definition of the error and these moment conditions to express the unknown population parameters β in terms observable parameters like E[[Y ],E[[X],E[Y 2],E[X2],E[XY ]. We defined the error as Yi −Xiβ. Plugging this into the moment conditions gives: E[Yi −Xiβ] = 0 E[(Yi −Xiβ)Xi1] = 0 E[(Yi −Xiβ)Xi2] = 0 ... E[(Yi −Xiβ)XiK ] = 0 This gives us K+1 equations for K+1 unknowns. Under certain conditionsthis system has only one solution. We say that this model is just identified: number of equations just equal to the number of unknown parameters. The system can be expressed more compactly as E[XTi (Yi −Xiβ)] = 0 where T denotes the transpose of a vector or matrix. 8 Example 2. Assume K = 1 (simple regression model): Yi = β0 + β1Xi1 + ei Moment conditions are: E[(Yi − β0 − β1Xi1)Xi1] = 0 E[Yi − β0 − β1Xi1] = 0 which we can rewrite as E[YiXi1] = E[Xi1]β0 + E[X2i1]β1 E[Yi] = β0 + E[Xi1]β1 This is a system of 2 equations with the 2 unknown parameters of interest β0 and β1. We can therefore solve for the unknowns β0 and β1 in terms of the parameters that we are assumed to know E[Yi], E[Xi1], E[Xi1Yi], and E[X2i1]. In fact one can show (can you do it?): • β1 = Cov(Yi,Xi1)V (Xi1) • β0 = E[Yi]− β1E[Xi1] The unknown population parameters that define the conditional expectation have a clear interpretation: • β1 tells us how much Y and X1 move together as a fraction of the total movement of X1 • β0 is the average of Y after removing the effect of X1 at the average of X1. 9 Example 3. Suppose we are interested in: Yi = β0 + β1Xi1 + β2Xi2 + ei where Xi2 = 2Xi1. This model is under-identified. Easier way to see this is to substitute the expression for Xi2 into our regression. This gives Yi = β0 + (β1 + 2β2)Xi1 + ei. These Xi1 and Xi2 do not let us distinguish between β1 and β2. At best we can identify the sum β1 + 2β2. A more formal way to see this uses the moment conditions: E[(Yi − β0 − β1Xi1 − β2Xi2)Xi2] = 0 E[(Yi − β0 − β1Xi1 − β2Xi2)Xi1] = 0 E[Yi − β0 − β1Xi1 − β2Xi2] = 0 which we can rewrite as E[(Yi − β0 − β1Xi1 − β22Xi1)2Xi1] = 0 E[(Yi − β0 − β1Xi1 − β22Xi1)Xi1] = 0 E[Yi − β0 − β1Xi1 − β22Xi1] = 0 and then as E[(Yi − β0 − (β1 + β22)Xi1)Xi1] = 0 E[Yi − β0 − (β1 + β22)Xi1] = 0 because the first and second equations are redundant. The redundancy means that we have 2 equations and 3 unknowns (the βs). We need at least 3 equations to identify 3 unknowns. In the general case, E[XTi (Yi −Xiβ)] = 0 ⇔ E[XTi Yi] = E[XTi Xi]β Pre-multiplying each side by (E[XTi Xi])−1 gives the solution β = E[XTi Xi]−1E[XTi Yi]. This last expression looks more complicated than it is. Each element of the β is simply βk = Cov(Yi, X̃ik) V (X̃ik) . where X̃ik is the residual from the regression of Xk on all other X’s. 10 • X̃ik is part of Xk not explained by rest of X’s. • Then interpretation of βk is: how much Y and part of Xk not explained by other X’s move together as a fraction of total variation/movement in X̃ik. • Ratio provides essence of ceteris paribus interpretation in multiple regression. Note that we can still get identification if we relax assumption that E[ei|Xi1, ..., Xik] = 0. Can instead assume Cov(ei, Xik) = 0. for all k in 0, 1, 2, ..., K. • Difference in assumptions is subtle but can be important. • Second assumption lets us identify β even if some functions of Xik are in the error term. That is, we can identify β even though E[ei|Xi1, ..., Xik] 6= 0. • First assumption requires that we assume these functions of Xik are not in the error term (we should put them in our model). Question. If really want to understand the material you can think about why this is. If you are mainly interested in applications, then you can ignore this subtle difference. I will not test you on it. We have so far viewed identification through the lens of moment conditions. There are other views. 1. “line of best fit”: Look for β where the “line” Xiβ that “best” fits Yi in population. 2. “maximum likelihood”: Make specifics assumptions about the distribution F (x, y) for the population. Given this distributional assumption, look for the β the likelihood of observing your population/sample. Let us consider each briefly in turn. What do we mean by “line of best fit”? Let’s suppose we define “best” by the β that minimizes the error variance, V ar(ei), which equals: E [ e2i ] and by our definition of the error is then E [ (Yi −Xiβ)2 ] 11 Taking derivatives of this function with respect the elements of β yields a set of equations that are identical to our moment conditions above (can you show this when K = 2?). We again end up with: βk = Cov(Yi, X̃ik) V (X̃ik) . for k in 0,1,2...,K. Suppose instead that the conditional distribution of Yi given Xi is normally distributed with mean E[Yi|Xi] = Xiβ and variance σ2. Normality means the probability density f for Yi given Xi is f(Yi|Xi;β) = 1 σ √ 2π exp(−(Yi −Xiβ) 2 2σ2 ) where exp is the exponential function. f(Yi|Xi;β) can be interpreted as the probability of observing a particular value of Yi given the value of Xi. If Yi is the earnings of person i and Xi is their sex, f(50000|sexi = female;β) is the probability that i earns 5000 given that they are female. “Maximum likelihood” chooses β that maximizes this probability in the population. To see what we get, convert everything to natural logarithms `(Yi|Xi;β) = −ln(σ √ 2π)− (Yi −Xiβ) 2 2σ2 and take the expectation to get E[`(Yi|Xi;β)] = −ln(σ √ 2π)− E[(Yi −Xiβ) 2] 2σ2 Maximization gives βk = Cov(Yi, X̃ik) V (X̃ik) . for k in 0,1,2...,K. All three approaches (moment conditions, line of best of fit, maximum likelihood) yield same expression for the unknown parameter β in terms of population moments. Several factors suggest this may be unsurprising. • Maximum likelihood makes assumptions about entire distribution of `(Yi|Xi). Moment conditions make assumptions about part of the distribution. It just so happens that in this case these assumptions coincide. • It’s also easy to see that maximizing the likelihood is equivalent to minimizing the variance for the error (deviation from line of best fit). This is because of underlying properties of the standard normal distribution - the first two moments are the only moments that enter the likelihood. 12 2.2. Estimation. Estimation usually follows naturally from the identification argument. With identification we think about how we would use “perfect” data to recover unknown parameters. With estimation we think about how we would use actual data to estimate the equation that allows us to identify the unknown parameters. We typically assume that we have a random sample but we can allow for dependence across observations or for draws from heterogeneous distributions (heteroskedasticity for example). To estimate these econometric models we can replace the population moments with sample moments in our equations that define β. Let’s see an example where this is case. Assume K = 1. Assume we have a random sample of size N : (Y1, X11), (Y2, X21), ..., (Y2, XN1). Substituting sample averages for population averages into moment conditions (E[ei] = 0 and E[eiXi1] = 0) gives N∑ i=1 YiXi1 N = N∑ i=1 Xi1 N β̂0 + N∑ i=1 X2i1 N β̂1 N∑ i=1 Yi N = β̂0 + N∑ i=1 Xi1 N β̂1 We can solve these equations to get our estimators of β̂0 and β̂1. By doing so, we are effectively forcing our estimator of to satisfy the moment conditions: N∑ i=1 YiXi1 N = N∑ i=1 Xi1 N β̂0 + N∑ i=1 X2i1 N β̂1 N∑ i=1 Yi N = β̂0 + N∑ i=1 Xi1 N β̂1 Does this imply that the moment conditions themselves E[ei] = 0 and E[eiXi1] = 0) are true in the population? NO. The error term may contain variables which are related to both Yi and Xi1. By saying that our estimators satisfy the above equation, we are forcing the influences of these other variables into our esimate of β1. In this way, it is not necessarily the case that the β̂1 will have a causal interpretation. It will reflect the true effect of Xi1 as well as the effect that operates through omitted variables. With this in mind, let us solve these equations anyways. Solving gives the estimators β̂1 = Ĉov(Yi, Xi1) V̂ (Xi1) β̂0 = Y − β̂1X1 13 where Ĉov(Yi, Xi1) = N∑ i=1 (Yi − Y )(Xi1 −X1) N V̂ (Xi1) = N∑ i=1 (Xi1 −X1)2 N and the hat over a parameter defines an estimator of a parameter. Question. What is the difference between an estimator and an estimate? Note that the above estimatorsβ̂0 and β̂1 are “good” in that they are: 1. Consistent • β̂0 approaches β0, and β̂1 approaches β1 as the sample size increases. 2. Efficient • β̂0 and β̂1 have “low” variance across repeated samples. In the population we can recover the unknown parameter β using the formula β = E[XTi Xi]−1E[XTi Yi]. With data (Y1,X1), (Y2,X2), ..., (YN ,XN), we can estimate the right hand side of the above equation using ( N∑ i=1 XTi Xi N )−1( N∑ i=1 XTi Yi N ) . We call this object β̂ but really it’s an estimator of the right hand side of the first equation. It is in fact a consistent estimator of the right hand side. To see this, plim (( N∑ i=1 XTi Xi N )−1( N∑ i=1 XTi Yi N )) = plim (( N∑ i=1 XTi Xi N )−1) plim ( N∑ i=1 XTi Yi N ) = ( plim N∑ i=1 XTi Xi N )−1 plim ( N∑ i=1 XTi Yi N ) 14 = E[XTi Xi]−1E[XTi Yi] This last term equals = E[XTi Xi]−1E[XTi (Xiβ + ei)] and then = β + E[XTi Xi]−1E[XTi ei] All this implies that our estimator gives us β plus some stuff. It gives us the true β if it is true that E[XTi ei] = 0. I recommend that you think a bit about what is being said here. It highlights the important distinction between identification and estimation, a distinction that confuses many students. I will end this subsection with a discussion of goodness of fit measures and log specifica- tions. Both are used heavily in applied work. We define R̂2 = V̂ (Xiβ̂) V̂ (Yi) as a measure of goodness of fit. It tells us how much of the variation in the response variable is explained by the estimated conditional expectation Xiβ̂. We increase the R̂2 by adding controls to our regression. The problem is that the R̂2 increases even if the controls are meaningfulness for explaining Yi. This is undesirable from a goodness of fit measure. Accordingly, we use adjusted R̂2 to adjust for this. The adjusted R2 penalizes the use of meaningless controls. Some general remarks on R-squared: • A high R-squared does NOT imply a causal interpretation • A low R-squared does NOT imply no causal interpretation • A low R-squared does not preclude precise estimation of marginal effects Some remarks on the use of logs Do’s and Don’ts • Variables measured in units such as years are usually not logged. These variables already yield a convenient interpretation. • Variables measured in percentage points are usually not logged for same reason. • Can’t log 0s or negative numbers. If you have a variable x that is often 0 but otherwise positive, you can use log(x+1) instead. If you have a variable x that is often negative, think of some other transformation or none at all. 15 2.3. Inference. The assumption that E[ei|Xi] = 0 is powerful. The assumption lets us construct reasonable estimators of E[Yi|Xi]. However, we need more assumptions to test hypotheses about E[Yi|Xi], however. In particular, we need assumptions on the variance- covariance matrix of ei given Xi, call it Ω. This is an N by N matrix. I encourage you to construct simple examples to for your understanding. One especially simple example has two independent observations and error variances which are independent of Xi Ω = ( V ar(e1) Cov(e1, e2) Cov(e1, e2) V ar(e2) ) = ( E[e21] E[e1e2] E[e1e2] E[e22] ) where we are implicitly using the moment conditions E[e1] = 0 and E[e2] = 0. Independence of the squared error (the variance) from Xi is called homoskedasticity. Random sampling implies Ω = ( E[e21] 0 0 E[e22] ) This Ω is the “original” variance-covariance matrix traditionally assumed in econometric analysis. It is rare these days to see data sets with independent observations or with error variances which are independent of Xi. We typically allow for some dependence between observations. We call this dependence clustering. We also tend to let the error variances depend on Xi. We call this dependence heteroskedasticity. Let’s obtain an estimator for the variance- covariance matrix that allows for these possibilities. We will revisit these issues from an applied perspective in later lectures. Our estimator of β is β̂ = ( N∑ i=1 XTi Xi N )−1( N∑ i=1 XTi Yi N ) . Plug in the model for Yi to get β̂ = β + ( N∑ i=1 XTi Xi N )−1( N∑ i=1 XTi ei N ) . 16 One can show that the variance-covariance matrix for this object can be estimated by V̂ ar(β̂) = ( N∑ i=1 XTi Xi N )−1( N∑ i=1 ê2iX T i Xi N2 )( N∑ i=1 XTi Xi N )−1 . which is a K + 1 by K + 1 matrix. Assume K = 1, homoskedastic and independent errors. The standard error for β̂1 is the given by the square root of the row 2 column 2 entry of V̂ ar(β̂)2×2. ŜE(β̂1) = √ V̂ (êi) NV̂ (Xi1) . It is large if 1. sample size N is small. 2. large error variance across units i. 3. small Xi variance across units i. Question. Are the last two properties intuitive? Why or why not? Does the figure at the top of the next page help you understand this? 96 Chapter 2 Figure 2.2 Variance in X is good 0 2 4 6 8 8 6 4 2 0 10 X Y illustrated in Figure 2.2, which shows how adding variability in Xi (specifically, adding the observations plotted in gray) helps pin down the slope linking Yi and Xi. The regression anatomy formula for multiple regression car- ries over to standard errors. In a multivariate model like this, Yi = α + K∑ k=1 βkXki + ei , the standard error for the kth sample slope, β̂k, is SE(β̂k) = σe√ n × 1 σX̃k , (2.15) where σX̃k is the standard deviation of X̃ki, the residual from a regression of Xki on all other regressors. The addition of controls has two opposing effects on SE(β̂k). The residual vari- ance (σe in the numerator of the standard error formula) falls when covariates that predict Yi are added to the regression. On From Mastering ‘Metrics: The Path from Cause to Effect. © 2015 Princeton University Press. Used by permission. All rights reserved. Question. Compare the standard errors for ŜE(β̂1) and ŜE(Y ) (see Week 1). Are the differences sensible? 17 Now assume K > 1. Errors are still homoskedastic and independent. Then we have ŜE(β̂1) = √ V̂ (êi) nV̂ (X̃i1) where X̃i1 is the residual from a regression of Xi1 on all other regressors. This says that control variables have opposing effects on ŜE(β̂1): 1. it reduces the variance of the error term. Why? 2. it takes out some of the variance for Xi. Why? The first effect decreases the standard error. The second effect increases it. The net effect is ambiguous. We assume independent observations and homoskedasticy for illustrative purposes. This is unreasonable in most situations. • To relax homoskedasticity use robust standard errors in stata: reg y x, robust • To relax independence across observations use clustered standard errors in stata: reg y x, cluster(unit) or reg y x, vce(cluster unit) Note that with panel data xt environment in stata, the robust command delivers standard errors that are clustered on the cross sectional unit. We will come back to this later. 18 Example 4. The Linear Probability Model ( lpm) is a model where dependent variable is binary (0 or 1). This implies E[Yi|Xi] = 1P(Yi = 1|Xi) + 0P(Yi = 0|Xi) and consequently that E[Yi|Xi] = P(Yi = 1|Xi) Estimating this econometric model is therefore tantamount to estimating the conditional probability that Yi = 1 given our controls. This model is interesting for us because we have heteroskedastic errors by definition. lpm has been used to model relationship between labour force participation of married women and years of schooling. Here P(Yi = 1|Xi1) = β0 + β1Xi1 where Yi indicates labour force participation and Xi1 is years of schooling. A population model where β0 = −.146 and β1 = 0.038 is described below To see why lpm has heteroskedastic errors by definition, note that V ar(Yi|Xi) = P(Yi = 1|Xi)(1− P(Yi = 1|Xi)). Since V ar(Yi|Xi) = V ar(Xi + ei|Xi) = V ar(ei|Xi) then V ar(ei|Xi) = P(Yi = 1|Xi)(1− P(Yi = 1|Xi)). 19 so that the error term depends on Xi. What can we do about this? Economists used to model the heteroskedasticity and then incorporatethis model into the construction of standard errors. Now you can just use the add-on command robust in Stata. Let us continue our discussion of inference. We can use the estimator to V̂ ar(β̂) construct t-statistics. For example, if we wish to test H0 : βk = 0 H1 : βk 6= 0 Then we can take the k’th diagonal element of V̂ ar(β̂), which equals V̂ ar(β̂k), and plug it into the denominator of the t-statistic: t = β̂k − 0√ V̂ ar(β̂k) These are called heteroskedasticity robust t-statistics. Confidence intervals are also obtained in the usual way. We can use the estimator to V̂ ar(β̂) to test multiple hypotheses at the same time. Suppose we wish to test the hypothesis that H0 : β1 = 0 and β2 = 0 H1 : β1 6= 0 or β2 6= 0 We can write this more compactly using R2×KβK×1 = r2×1 where R2×K is a matrix with 2 rows (one for each hypothesized restriction on β) and K columns R2×K = ( 0 1 0 . . . 0 0 0 0 1 . . . 0 0 ) and r is a 2 row 1 column matrix of 0s. We can then construct a heteroskedaticity robust Wald statistic: W = (Rβ̂ − r)T (RT V̂ ar(β̂)R)(Rβ̂ − r) Under the null hypothesis H0, this statistic follows a χ 2 2 distribution. The subscript 2 denotes the number of restrictions Q. Sometimes we adjust this formula to obtain an F -statistic with Q = 2 numerator degrees of freedom and N −K denominator degrees of freedom. Whether you use χ22 or F formulation makes little difference. It is traditional to use F . I recommend you to make up a simple example so that you understand what is happening above. I don’t expect you to memorize the formula for W . I do expect you to understand hypothesis testing and, in particular, 20 • how to set up and test multiple hypotheses • that you can test multiple hypotheses using χ22 or F statistics. You will get practice with multiple hypothesis testing in the tutorial assignment. This completes our discussion of the basics of identification, estimation, and inference for the linear model. This discussion typically provides the basis for causal analysis in econometrics, and indeed this has received a great deal of attention in the last 25 years. But regression is useful for other purposes, especially for prediction (and machine learning by implication). I am going to briefly talk about prediction. 21 3 Prediction Regression models provide a basis for prediction. The idea is that if the observed Yi is generated in accordance with this model, and we know β, then we can learn Yi before observing it or without having to observe it. That is, we can make in sample and out of sample predictions about the value of Yi. We are going to discuss the Lasso method. This is an acronym for “Least Absolute Shrinkage and Selection Operator.” It is a method for selecting and fitting covariates for the purposes of prediction and model selection. 3.1. Bias-variance trade off. Recall that our expression β = E[XTi Xi]−1E[XTi Yi] can be thought of the minimizer of E [ (Yi −Xiβ)2 ] . We can then use a random sample to approximate the right hand side of our expression using β̂ = ( N∑ i=1 XTi Xi) −1( N∑ i=1 XTi Yi). This estimator has good properties. It is unbiased.1 It has low variance. Properties are important for causal inference, when we want the “right” β. Properties may be less important for prediction, where we want a model that predicts Yi. Here, we may improve predictions by trading off some bias for even lower variance.1 lasso provides us with a disciplined mechanism for making this trade off. With lasso we choose β to minimize E [ (Yi −Xiβ)2 ] subject to the constraint K∑ k=0 |βk| ≤ t where t is a “budget” for the regression coefficients. This problem is equivalent to choosing β to minimize E [ (Yi −Xiβ)2 ] + λ K∑ k=0 |βk| where λ is effectively the price of violating the constraint above. Rewriting this problem gives min β E [ (Yi −Xiβ)2 ] + λ K∑ k=0 |βk| 1I use “consistent” and unbiased interchangeably. This is imprecise as the concepts are different. I use them interchangeably because it simplifies the discussion. 22 Note that the derivative of λ ∑K k=0 |βk| with respect to βk is not defined at 0. Because of this, small βks are dropped from the econometric model. This is useful for model reduction as well as prediction, specifically in cases where we have more control variables than data points. If λ = 0, then the solution to the above problem is β = E[XTi Xi]−1E[XTi Yi]. Our estimator of β is then β̂ = ( ∑N i=1 X T i Xi) −1( ∑N i=1 X T i Yi). If λ = ∞, then the solution is to set all the βks to 0. In fact, there is some number, call it λmax, where all the βks to 0. For λs between 0 and λmax we trade off increased bias for less variance in our new estimator of β. Note that λ is unknown to us. Need to specify it before estimating regression coefficients. Which λ should we choose? This is a complex problem that is too involved to cover here. In problem set we let Stata choose λ via a method called cross validation (cv). cv looks for a λ that minimizes mse. All you need to know about choosing λ for the exam is: 1. it needs to be specified beforehand 2. that you can use cross validation to choose λ 3. that cross validation minimizes mse 4. and how to implement it in Stata. How do we measure the net result of the bias-variance tradeoff? We typically use mean squared error (mse): MSE = E [ (Yi −Xiβ̂)2 ] = bias(X̂iβ) 2 + variance(Xiβ̂) With our least squares estimator bias(Xiβ̂) = 0, such that MSELS = variance(Xiβ̂) where ls denotes least squares. With lasso we get less variance but more bias, bias(Xiβ̂) 6= 0. Variance reduction may be large relative to increased bias. In this case mse decreases. We in turn get a “better” prediction. For this reason lasso may be preferable to ls from a prediction perspective. How do we leverage lasso for out of sample prediction? 1. Split your sample into 2: (a) Training (t) sample. (b) Validation (v) sample 2. Estimate regression coefficients by applying lasso to the training sample. Call these estimates β̂ T . 23 3. Use β̂ T to calculate the mse on the validation sample: MSEV = E [ (Y Vi −XVi β̂ T )2 ] This measures the lasso prediction error. You should know this prediction algorithm for the final exam. This week you will get practice implementing this algorithm in the tutorials. To this end, note that an alternative measure of the prediction error is the R̂2 V = V̂ (XVi β̂ T ) V̂ (Y Vi ) Caveats: • Can you use LASSO for causal inference? Yes but we need to adjust standard errors to allow for variation generated by the model selection procedure. We will not discuss this issue in this course. This ends our discussion of lasso and prediction. 24 4 Conclusion This week we delved into the mechanics of regression, where • we typically specify a linear model - functional form assumption for E[Y |X] • Identification-Estimation-Inference Paradigm for making conclusions about E[Y |X]. • Prediction of Y via LASSO Why do we talk about prediction? To show you that regression is about more than causal inference. Next week: Common Biases and What to do about them. 25 Linear model Identification-Estimation-Inference paradigm Identification. Estimation. Inference. Prediction Bias-variance trade off. Conclusion
Compartilhar