Baixe o app para aproveitar ainda mais
Prévia do material em texto
Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model Note 3: Least Squares and the 2-Variable Model 1 Two Variable Model Now, we are back to the example where we are trying to predict the earnings of an individual Y , using that individuals education X as a predictor. Or, in other words, does education have any predictive power on earnings? So, using a linear predictor under squared loss, our predictor for Y given X is Yˆ = β0 + β1X and again, those parameters (which determine or opera- tionalize this prediction problem) are determined by min- imizing the mean square error: min β0,β1 E(Y − Yˆ )2 = min β0,β1 E(Y − β0 − β1X)2 (1.1) This is a typical minimization problem that we will see in this class. So, we can try and handle it in a geometric way (for now) that would apply in many setups and is also insightful. We will use a more brute force approach (using standard calculus) later. The key is to use an orthogonal projection in a vector space with an inner product. Here the inner product is < Y,X >= E(Y X) The associated norm is ‖Y ‖ =< Y, Y >1/2 1 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model Then, our problem above becomes: min β0,β1 ‖Y − Yˆ ‖2 = min β0,β1 ‖Y − β0 + β1X‖2 So, here you are trying to minimize the Euclidian distance between Y and the (linear) space spanned by (1, X). The solution is the orthogonal projection of Y on 1 and X (as any predictor vector is a linear combination of 1 and X). See the Figure to the right: we have a vector y and we want to see in the space spanned by the x’s which vector is ”closest” to y. This would be exactly the orthogonal projection. So, if you think of the random variable 1 as X0, then this orthogonal projection requires that the prediction error (Y − Yˆ ) (also called a residual) be orthogonal to both X0 and X: < Y − Yˆ , X0 >= 0 < Y − Yˆ , X >= 0 Geometrically again, what you are doing is taking the orthogonal projection of the vector Y on the space spanned by (or linear combinations of) (X0, X). This means that Y − Yˆ ⊥ X0, Y − Yˆ ⊥ X In particular, we have < Y − Yˆ , X0 >=< Y − β0 − β1X,X0 >=< Y,X0 > −β0 < X0, X0 > −β1 < X,X0 >= 0 < Y − Yˆ , X >=< Y − β0 − β1X,X >=< Y,X > −β0 < X0, X > −β1 < X,X >= 0 The projection here is special in that it is done using the distance measure we defined above. Replacing < ., . > with E(..) in the above, we get E(Y )− β0 − β1E(X) = 0 2 Victoria Sticky Note + B1 OR - B1? Victoria Highlight Victoria Sticky Note como que a gente define o primeiro argumento no < > do B_0? Eu sei que o inner product tem que ser 1 porque o B_0 ta sozinho, mas como que a gente escolhe se 'e X_0 ou X o primeiro argumento? Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model E(Y X)− β0E(X)− β1E(X2) = 0 This is a two equation linear system with two unknowns (you could also verify that these two equations are the first order conditions from the minimization problem in (1.1) above) which can be solved to give us: β1 = E(Y X)− E(Y )E(X) E(X2)− E(X)E(X) β0 = E(Y )− β1E(X) (1.2) Notice, the numerator in the expression for β1 is the covariance between Y and X Cov(Y,X) = E(Y X)− E(Y )E(X) and the denominator is the variance of X: V ar(X) = E(X2)− E(X)E(X) I.e, β1 = Cov(Y,X) V ar(X) (1.3) Again, the notation for this linear prediction is E∗[Y |X] = β0 + β1X (1.4) The above is a population version of the best linear predictor. This means that β0 and β1 are not directly observed in the data (they are functions of the joint distribution of Y and X). A Non-Geometry Approach: (using calculus)1 The problem we are asked to solve in 1This is may be easier to follow for some than the geometric approach and though it provides less intuition it does work. 3 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model (1.1) is to minimize min β0,β1 E(Y − Yˆ )2 = min β0,β1 E(Y − β0 − β1X)2 To do this problem, we can take first order conditions (i.e., derivative with respect to β0 and β1) ∂β0,β1E(Y − β0 − β1X)2 which gives us: 2E(Y − β0 − β1X)(−1) = 0 2E(Y − β0 − β1X)(−X1) = 0 (1.5) (here we implicitly interchanged ∂ with E which is allowed...) Solving for β0 and β1 from above (2 equations and 2 unknowns) we get the same solutions we got in (1.2) and (1.3) (try this). 2 Least squares The sample analog of the above population problem uses a sample of size n to construct an estimator of β0 and β1. This can be done via least squares, which is the sample counterpart of (1.1) above. Example 2.1. To get more intuition about sample vs population, consider this example. The average gpa for all undergraduate students here is µGPA, a population quantity. This is a population quantity because it is related to the population of interest, mainly ALL undergraduate students (the registrar surely has access to this quantity). Now, if we were to pick 10 students at random and average their GPA, that would give us µ¯GPA = (GPA1 + . . .+GPA10)/10. So, µ¯GPA is a sample quantity. It is usually called an estimator of µGPA and the role of statistics is to tell us how close is µˆGPA to µGPA. The reason why we do not go after population quantities directly (and spare ourself all of statistics) is because it is too costly (often impossible) to get answers from every member of the population. Think about exit polls where pollsters ask a few voters during a short period of voting and then try to use those data (the sample) to predict whether a given candidate will win. There it is basically too costly to ask every single voter how they voted (many will not even answer...). Now, we want an estimate of the regression line using sample data, we mimick the same arguments as above but with a sample to get the following. First, let the sample of size n be in the form of matrices y = y1 ... yn , x0 = 1 ... 1 , x = x1 ... xn 4 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model The fitted value for the ith observation is yˆi = b0 + b1xi and the objective again is choosing (b0, b1) to minimize the sum of squared residuals: min b0,b1 1 n ∑ i (yi − yˆi)2 (2.1) Notice the similarity between this objective function and the one in (1.1) above: the latter uses the sample and so the minimization problem is feasible, i.e., can be done with a computer (or calculator), while the former involves minimizing an object that involves the operator E and hence requires knowledge of the joint distribution of (Y,X) which is generally unobserved. Though our objects of interest are the solutions (β0, β1) to (1.1) above, the feasible solutions to (2.1) will be our least squares estimators of these objects. We shall study the statistical properties of such estimators later. First, to define the estimators, we mimic the approach above. The inner product between vector y and x is < y, x >= 1 n ∑ i yixi We have now the minimum norm problem min b0,b1 ‖y − b0x0 − b1x‖2 and the solution will also take the orthogonal projection of y on x0 and x. This will require that the prediction error (y − yˆ) is orthogonal to x and x0: < y − yˆ, x0 >= 0 < y − yˆ, x >= 0 This means that y − yˆ ⊥ x0, y − yˆ ⊥ x In particular, this means that < y − yˆ, x0 >=< y − b0 − b1x, x0 >=< y, x0 > −b0 < x0, x0 > −b1 < x, x0 >= 0 < y − yˆ, x >=< y − b0 − b1x, x >=< y, x > −b0 < x0, x > −b1 < x, x >= 0 Taking the sample analogue y¯ − b0 − b1x¯ = 0 yx− b0x¯− b1x2 = 0 5 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model where y¯ = 1 n ∑i yi, yx = 1 n ∑ yixi, x2 = 1 n ∑ i x2i The two linear equations for the two unknowns b0, and b1 can be solved to give b1 = yx− y¯x¯ x2 − x¯x¯ b0 = y¯ − b1x¯ 2.1 Goodness of fit Note that, 0 ≤ ‖Y − E ∗(Y |1, X)‖2 ‖Y − E∗(Y |1)‖2 ≤ 1 This ratio is less than 1 because using X to predict Y cannot increase the mean square error, since -β1 is allowed to be zero (the linear predictor using only a constant is E ∗(Y |1) = E(Y )). We can then define the measure of goodness of fit in the population as R2pop = 1− ‖Y − E∗(Y |1, X)‖2 ‖Y − E∗(Y |1)‖2 This measure is • scale free: is not affected by the way we measure Y (multiply Y by 10 does not change it) • Easy to interpret since 0 ≤ R2pop ≤ 1 with higher values implying better prediction accuracy. The sample counterpart of this population object is R2 = 1− ‖y − (yˆ|1, x)‖ 2 ‖y − (yˆ|1)‖2 = 1− 1 n ∑ (yi − b0 − b1xi)2 1 n ∑ (yi − y¯)2 (here the least squares fit with only a constant is (yˆ|1) = y¯). So, as you can see from the formula, intuitively, R2 gives you an indication of how much the variation in Y is explained by variation in X 2. 2.1.1 Example Mincer [2] uses data from the 1960 census on annual earnings in 1959. With y = log(earnings) and s = years of schooling completed, he reports the least squares fit: yˆ = 7.58 + .07s, R2 = .067 2We can define similarly an R2 for K regressors which would be 1− ‖y−(yˆ|1,x1,...,xK)‖2‖y−(yˆ|1)‖2 . 6 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model Notice here that the predictive power of education in terms of explaining the variation of earnings is very low. This is not atypical of cross sectional data since observables do not usually explain much of the variation in wages (there are many other things that could help explain earnings other than education). 3 Omitted Variables Consider an individual chosen at random from a population. Let Y denote her earnings, and let X1 and X2 denote her education and her score on a test administered when she was in the third grade. The random variables (Y,X1, X2) have a joint distribution. There is a (population) linear predictor for Y given X1 and X2 (and a constant): E∗(Y |1, X1, X2) = β0 + β1X1 + β2X2 This is usually called (by Goldberger for example) the long regression. In addition, there is a (population) linear predictor for Y just given X1 (and a constant): E∗(Y |1, X1) = α0 + α1X1 This is called the short regression. It is useful to relate the long predictor coefficients to the short ones. In particular, this is relevant if in fact we do not have data say on X2 and are interested in the coefficient β1. To do that requires the auxiliary linear predictor of X2 given X1 (and a constant) E∗(X2|1, X1) = γ0 + γ1X1 This is the auxiliary regression. Let U denote the prediction error using the long predictor: U ≡ Y − E∗(Y |1, X1, X2), so that Y = β0 + β1X1 + β2X2 + U (3.1) It is useful to pause here and get acquainted with U and how it is implicitly defined. Its properties (and its role) help in understanding (and using) the linear model. Example 3.1. The E∗ operator is linear, i.e., E∗[Y1 + Y2|1, X] = E∗[Y1|1, X] + E∗[Y2|1, X]. How do we show this? One way to do this is to start with the formal definition of what E∗ given in (1.4) above (this is the algebraic way). Using that 7 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model we have E∗[Y1 + Y2|1, X] = L1 + L2X where L1 and L2 solve E[Y1 + Y2 − L1 − L2X] = 0 EX[Y1 + Y2 − L1 − L2X] = 0 (3.2) whereas E∗[Y1|1, X] = L11 + L12X and E∗[Y2|1, X] = L21 + L22X and these solve the following equations above: E[Y1 − L11 − L12X] = 0 EX[Y1 − L11 − L12X] = 0 E[Y2 − L21 − L22X] = 0 EX[Y2 +−L21 − L22X] = 0 (3.3) Summing the first and third equation in (3.3), we get E[Y1 + Y2 − L11 − L21 − (L12 + L22)X] = 0 while summing the second and fourth equation in (3.3) EX[Y1 + Y2 − L11 − L21 − (L12 + L22)X] = 0 By comparing the above two equations to the ones in (3.2), it must be that L11+L 2 1 = L1 and L 1 2 + L 2 2 = L2 which implies that E∗[Y1 + Y2|1, X] = E∗[Y1|1, X] + E∗[Y2|1, X] (You may be able to show this using geometric arguments: the projection of a sum of vectors Y1 + Y2 is the sum of their projections. Think of this geometrically!) Because U is a prediction error, it is orthogonal to the variables used in the predictor: U ⊥ 1, U ⊥ X1, U ⊥ X2 This implies that (you can also formally show this) E∗(U |1, X1) = 0 Using equation (3.1) and linearity of E∗ (we showed that in the example above), we get E∗(Y |1, X1) = E∗(β0 + β1X1 + β2X2 + U |1, X1) = β0 + β1X1 + β2E ∗(X2|1, X1) + E∗(U |1, X1) = β0 + β1X1 + β2(γ0 + γ1X1) + 0 = (β0 + β2γ0) + (β1 + β2γ1)X1 So this means that α0 = β0 + β2γ0, α1 = β1 + β2γ1 8 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model Why is E∗(X1|1, X1) = X1? As you can see, the short regression leads to a coefficient α1 which is generally different than β1. In particular, α1 contains β2 multiplied by γ1 the coefficient from the auxiliary regression. The “bias term” β2γ1 is what is called the omitted variable bias and the above are the omitted variable formula. This bias is zero if either 1) γ1 = 0 (i.e. Cov(X1, X2) = 0)) or 2) β2 is zero. Note: Both α1 and β1 are well defined but answer different questions. α1 is part of the linear predictor of Y given that you only use X1, while β1 is the predictor of Y when you also use X2 to form your prediction. Generally, this omitted variable formula is important and useful. 3.1 Least squares version Of course the above (population) result holds also for the sample least squares prediction problem where one can substitute the population inner product with its sample counterpart, i.e., replace E(XY ) with 1n ∑ i yixi. Doing this, we get a0 = b0 + b2c0, a1 = b1 + b2c1 where yˆi|1, x1i, x2i = b0 + b1x1i + b2x2i yˆi|1, x1i = a0 + a1x1i xˆ2i|1, x1i = c0 + c1x1i 3.2 Example ctd’ Mincer (1974) has a discussion of omitted variable bias (pages 139, 140). He views earnings as related to the individual’s total human capital stock, including original or initial components. He views ability as such an initial component, and argues that including a measure of early ability would lead to a drop in the coefficient on schooling, because the coefficient on ability would be positive and there is a positive association between ability and schooling. For numerical magnitudes, he cites Griliche and Mason [1], who use a sample of post-World War II veterans of the U.S military, contacted by the Bureau of the Census in a 1964 Current Population Survey (CPS). The military records contain individual scores on the Armed Forces Qualification Test (AFQT), which Griliches and Mason use in lieu of standard civilian mental ability (IQ) tests. A problem is that the AFQT is not a measure of initial ability, in that it is administered just prior to entering the military. This problem is addressed by splitting total years of schooling (ST) into schooling before the military (SB) and schooling after the military (SI). Then AFQT can be regarded as an early test, relative 9 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model to the schooling increment SI, which came after the test was taken. Here are some results from Table 3: yˆ = .0508ST + . . . yˆ = .0433ST + .00150AFQT + . . . yˆ = .0502SB + .0528SI + . . . yˆ = .0418SB + .0475SI + .00154AFQT + . . . This is for a subsample of veterans, age 21-34 in 1964. y = log of usual weekly earnings. The AFQT score is measured as a percentile, from 0 to 100. These least-squares fits also include a constant,age, and length of time served in the military. Griliches and Mason focus on the coefficient on SI and note that it is reduced by including AFQT, but not by very much. 4 Functional Form To recap, the linear predictor problem, or best linear predictor is always a well defined problem. It is very flexible because we can include in the predictor any transformations of the original variables. For example, with Y = earnings and EXP a measure of years of experience, we can set X1 = EXP and X2 = EXP 2. Then the linear predictor is E∗(Y |1, X1, X2) = β0 + β1X1 + β2X2 and at EXP = c it gives β0 + β1c+ β2c 2 This allow for example for the predictive effect of experience to be nonlinear. The prediction model allows also for effects of interaction of variables. For example, in addition to EXP suppose we have EDUC, a measure of years of education. We can set X1 = EDUC, X2 = EXP , X3 = EDUC.EXP . Then evaluating E∗(Y |1, X1, X2, X3) = β0 + β1X1 + β2X2 + β3X3 4.1 Conditional Expectation Suppose that we start with a single original variable Z and develop linear predictors of Y based on Z that are increasingly flexible. To be specific, consider using a polynomial of order M : E∗(Y |1, Z, Z2, . . . , ZM )) The expectation of the squared prediction error cannot increase as M increases, because the coefficients on the additional terms are allowed to be 0. So E[(Y − E∗(Y |1, Z, Z2, . . . , ZM ))]2 10 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model is decreasing in M and must approach a limit since it is nonnegative. We shall assume that the linear predictor itself approaches a limit, and we shall identify this limit with the conditional expectation, E(Y |Z) : E(Y |Z) = lim M→∞ E∗(Y |1, Z, Z2, . . . , ZM ) This limit is in a mean square sense: lim M→∞ E[E(Y |Z)− E∗(Y |1, Z, Z2, . . . , ZM )]2 = 0 What this means is that there is a way (a precise one) in which a linear predictor can approximate a conditional expectation. This is useful since a linear predictor has a sample counterpart and can be constructed. This turns out to be exact in the case of discrete regressors (see below). Let V be notation for the prediction error: V ≡ Y − E(Y |Z) Then, V is orthogonal to any power of Z: < V,Zj >= E(U ′Zj) = 0 (j = 0, 1, 2, . . .) Because general functions of Z can be approximated (in mean square) by polynomials in Z, we have < V, g(Z) >= E[V g(Z)] = 0 for arbitrary functions of g(.). In the population, we shall generally prefer to work with the conditional expectation (the conditional expectation is the solution to the best prediction problem after all). The linear predictor remains useful, however, because it has a direct sample counterpart: the sample linear predictor or least-squares fit. We shall use a (population) linear predictor to approximate the conditional expectation, and then use a least squares fit to estimate the linear predictor. The conditional expectation at a particular value of Z is denoted by r(z) = E(Y |Z = z) This function is called the (mean) regression function. The regression function evaluated at the random variable Z is the conditional expectation: r(Z) = E(Y |Z). Because the regression function may be complicated, we may want to approximate it by a simpler function that would be easier to estimate. For example, E∗[r(Z)|1, Z] is a minimum mean-square error approximation that uses a linear function of Z. This turns out to be the same as the linear predictor of Y given Z: 11 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model Claim 4.1. E∗[r(Z)|1, Z] = E∗(Y |1, Z) = β0 + β1Z. Proof: Let V denote the prediction error: V ≡ Y − E(Y |Z) = Y − r(Z) (4.1) Then V is orthogonal to any function of Z: E[V g(Z)] = 0, and so is orthogonal to 1 and to Z: E(V ) = E(V Z) = 0 This implies that the linear predictor of V given 1, Z is 0, and applying that to (4.1) gives 0 = E∗(V |1, Z) = E∗(Y |1, Z)− E∗[r(Z)|1, Z] � This result is more general in that it holds for more variables. For example, the conditional expectation of Y given two (or more) variables Z1 and Z2 can also be viewed as a limit of increasingly flexible linear predictors: E(Y |Z1, Z2) = lim M→∞ E∗(Y |1, Z1, Z2, Z21 , Z1Z2, Z22 , . . . , Z1ZM−12 , ZM2 ) The regression function is defined as r(z1, z2) ≡ E(Y |Z1 = z1, Z2 = z2) As above, we can use the linear predictor to approximate the regression function. For example, the proof of Claim 4.1 can be used to show that E∗[r(Z1, Z2)|1, Z1, Z2, Z21 , Z1Z2, Z22 ] = E∗(Y |1, Z1, Z2, Z21 , Z1Z2, Z22 ) We shall conclude this section by deriving the iterated expectations formula and then using it to obtain an omitted variables formula. Claim 4.2. (Iterated Expectations) E[E(Y |Z1, Z2)|Z1] = E(Y |Z1) or equivalently, E[r(Z1, Z2)|Z1] = r(Z1). Proof. Let V denote the prediction error: V ≡ Y − E(Y |Z1, Z2) = Y − r(Z1, Z2) (4.2) Then V is orthogonal to any function of (Z1, Z2) : E[V g(Z1, Z2)] = 0 12 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model and so, in particular, it is orthogonal to any function of Z1: E[V g(Z1)] = 0 This implies that E[V |Z1] = 0 and substituting that in (4.2) above gives 0 = E[U |Z1] = E[Y |Z1]− E[r(Z1, Z2)|Z1] � The law of iterated expectations can be shown in other ways by using the definition of a conditional expectation. It is a useful formula. For example, can you just show that E[Y |X1] = E[E[Y |X1, X2]|X1] Now, we can show an Omitted Variable Bias for regression functions. Claim 4.3 (Omitted Variable Bias). If E[Y |Z1, Z2] = β0 + β1Z1 + β2Z2 and E[Z2|Z1] = γ0 + γ1Z1 then E[Y |Z1] = (β0 + β2γ0) + (β1 + β2γ1)Z1 Proof: E(Y |Z1) = E[E(Y |Z1, Z2)|Z1] = E(β0 + β1Z1 + β2Z2|Z1) = β0 + β1Z1 + β2E(Z2|Z1) = β0 + β1Z1 + β2(γ0 + γ1Z1) = (β0 + β2γ0) + (β1 + β2γ1)Z1 Note that here we assume that the regression function for Y on Z1 and Z2 is linear in Z1 and Z2, and that the regression function for Z2 on Z1 is linear in Z1. It then follows that the regression function for Y on Z1 is linear in Z1, and the coefficients are related to the coefficients in the long regression function in the same way as in the linear prediction problem. To recap, if we ultimately are interested in the conditional expectation (for many reasons...) then we can do well by estimating a best linear predictor, and we showed that this linear best predictor can theoretically approximate this unknown conditional expectation (in a precise sense). 13 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model 5 Discrete Regressors (make things easier) Suppose that Z1 and Z2 take on only a finite set of values: Z1 ∈ {λ1, . . . , λJ}, Z2 ∈ {δ1, . . . , δK} The objective of this section is to show that it is in this case possible to empirically estimate the regression function. Construct the following dummy variables Xjk = { 1, if Z1 = λj , Z2 = δk 0, otherwise These Xjk are indicator functions that tell what cell does (Z1, Z2) belong to Xjk = 1(Z1 = λj ; Z2 = δk) (j = 1, . . . , J ; k = 1, . . . ,K) The notation for the indicator function 1[B] is such that this function is equal to 1 if B is true and equal to zero otherwise. The key result is provided next. Claim 5.1. E(Y |Z1, Z2) = E∗(Y |X11, . . . , XJK) Proof. Any function g(Z1, Z2) can be written as (when the Z’s take finitely many values) E[Y |Z1, Z2] = J∑ i=1 K∑ k=1 γjkXjk with γjk = g(λj , δk). So searching over functions g to find the best predictor is equivalent to searching over the coefficients γjk to find the best linear predictor. � Note this requires that we use a complete set of dummy variables, with one for each valueof (Z1, Z2). In this discrete regressor case, there is a concrete form for the notion that conditional expectation is a limit of increasingly flexible linear predictors. Here the limit is achieved by using a complete set of dummy variables in the linear predictor. There is a sample analog to this result, using least-squares fits. The basic data consist of (yi, zi1, zi2) for each of i = 1, . . . , n members of the sample. Construct the dummy variables (this is done on the computer using the data matrix) xijk = 1(zi1 = λj ; zi2 = δk) 14 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model This means that the ith observation in the data corresponds to the jk cell. Construct the matrices y = y1 ... yn xjk = x1jk ... xnjk j = 1, . . . , J ; k = 1, . . . ,K The coefficients in the least-squares fit are obtained from min ‖y − J∑ j K∑ k bjkxjk‖2 where the minimization is over {bjk} and the inner product is < y, xjk >= 1 n ∑ i yixijk (recall: xjk is a n-size vector of 1’s and 0’s where a 1 corresponds to the place where the data belongs to cell (j, k) - so this is a linear regression with KL regressors). Claim 5.2. blm = ∑ i yixilm∑ i xilm l = 1, . . . , J,m = 1, . . . ,K Proof. (try doing this on your own first) By definition of an orthogonal projection, we get that the residuals have to be orthogonal to each of the dummy variables: < y − ∑ kl bjkxjk, xlm >= 0 Also, by construction (a data point cannot belong to two bins!) < xjk, xlm >= 0 unless the indices are the same. So, we have 0 =< y − ∑ kl bjkxjk, xlm > =< y, xlm > − ∑ kl bjk < xjk, xlm > =< y, xlm > −blm < xlm, xlm > So, blm = < y, xlm > < xlm, xlm > = ∑ i yixilm∑ i xilm This is because x2lm = xlm (these are vectors of 1s and 0s) � Note that, the numerator above ∑ i yixilm is the sum of ys among the observations whose regressors belong to cell (l,m), and the denominator is the number of observations that fall in cell (l,m). So, the coefficients blm correspond to subsample mean which is a nice interpretation that is helpful. 15 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model A major use of regression analysis is to measure the effect of one variable holding constant other variables. Consider, for example, the effect on Y of a change from Z1 = c to Z1 = d, holding Z2 constant at Z2 = e. Let θ denote this effect: θ = E(Y |Z1 = d;Z2 = e)− E(Y |Z1 = c, Z2 = e) = r(d, e)− r(c, e) This is a predictive effect. It measures how the prediction of Y changes as we change the value for one of the predictor variables, holding constant the value of the other predictor variable. This holding constant the value of other regressor is often discussed as a way to control for observed differences. In the case of discrete regressors with a complete set of dummy variables, this predictive effect has a sample analog which the sample analog of the Y s in the cell (d, e) minus the subsample mean in (c, e) (or the difference between two regression coefficients). The individuals in the first subsample have zi1 = c, and the individuals in the second subsample have zi1 = d. In both subsamples, all individuals have the same value for z2 : zi2 = e. So the sense in which z2 is being held constant is clear: all individuals in the comparison of means have the same value for z2. In general there is a different effect θ for each value of Z2, and we may want to have a way to summarize these effects. This is discussed in the next section. 5.1 Average Partial Effect Recall our definition of a regression function r(s, t) = E(y|Z1 = s, Z2 = t) Consider the predictive effect based on comparing Z1 = c with Z1 = d, with Z2 = t : r(d, t)− r(c, t) Instead of reporting a different effect for each value of Z2, we can evaluate the effect at the random variable Z2: r(d, Z2)− r(c, Z2). This gives a random variable, and we can take its expectation: θ = E[r(d, Z2)− r(c, Z2)] We shall refer to this as an average partial effect. It is “partial” in the sense of holding Z2 fixed and the average is taken over all values of Z2 (so, we evaluate this -predictive- effect for people with the same Z2 and then average over all values of Z2). 16 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model Once we have an estimate rˆ of the regression function, we can form an estimate of θ by taking an average over the sample: θˆ = 1 n ∑ i [rˆ(d, zi2)− rˆ(c, zi2)] It is generally not easy to get an estimate of r from the data. But, it is, as we have seen possible to approximate the conditional expectation by a linear prediction, using a polynomial in Z1, Z2 E[Y |Z1, Z2] ' E∗[Y |{Zj1Zk2 }Mj+k=0] = M∑ j,k:j+k=0 βj,kZ j 1Z k 2 We can then use least squares to obtain estimates bik of βjk. Then, we can use rˆ(c, zi2) = M∑ j,k:j+k=0 bj,kc jZki2 and rˆ(d, zi2) = M∑ j,k:j+k=0 bj,kd jZki2 in the formula for θˆ above. In the case when the regressors take discrete values, i.e. the approximation to the conditional expectation becomes more exact, we can get a direct sample analog to the (mean) regression function r using the above framework for discrete regressors. 5.2 Example: Polynomial Regressors Consider the following quadratic polynomial approximation to a regression function E[Y |Z1 = s, Z2 = t] ' β0 + β1s+ β2s2 + β3st+ β4t+ β5t2 Table 5.1 in Mincer (1974)([2]) provides the least-squares fit: yˆ = 4.87 + .255s− .0029s2 − .0043ts+ .148t− .0018t2 with y = log(earnings), s = years of schooling, and t = years of work experience. The data are from the 1 in 1000 sample, 1960 census with 1959 annual earnings, and sample size n = 31093. The partial predictive effect of four years of college, holding work experience constant at t, is E(Y |Z1 = 16, Z2 = t)− E(Y |Z1 = 12, Z2 = t) ' β14 + β2(162 − 122) + β34t The term returns to college is related to the use of log(earnings) and is discussed below. 17 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model 5.3 LOGS We stressed above that the linear predictor is flexible because we are free to construct transforma- tions of the original variables. A transformation that is often used is the logarithm: E∗(Y |a, logZ) = β0 + β1 logZ In order to compare Z = c and Z = d, we simply substitute: β1 log d− β1 log c = β1 log(d/c) A useful approximation here is (β1/100)[100 log(d/c)] ' (β1/100)[100(d/c− 1)]. With this approximation, we can interpret (β1/100) as the (predictive) effect of a one per cent change in Z. Now consider a log transformation of Y : E∗(log Y |1, Z) = β0 + β1Z We can certainly say that the predicted change in log Y is β1(d− c), and it is often useful to think of 100β1(d− c) as a predicted percentage change in Y . We should note, however, that even if the conditional expectation of log Y is linear, so that E(log Y |Z) = β0 + β1Z, we cannot relate this to the conditional expectation of Y without additional assumptions. To see this, define U so that U ≡ log Y − E(log Y |Z), E(U |Z) = 0 Since log Y = β0 + β1Z + U , we have Y = exp(β0 + β1Z + U) = exp(β0 + β1Z)exp(U) In general, E(U |Z) = 0 does not imply that E[exp(U)|Z] is a constant. If we make an additional assumption that U and Z are independent, then E[exp(U)|Z] = E[exp(U)]. In that case, and E(Y |Z = d) E(Y |Z = c) = exp[β1(d− c)] ' β1(d− c) + 1, and 100 [ E(Y |Z = d) E(Y |Z = c) − 1 ] ' β1(d− c) 18 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model 6 Least Squares Matrix Notation, etc Here, we provide useful formulas that exploit the (nice)geometric intuition in the linear model when we have many regressors. We will also derive a K-variate analog of the omitted variable formula. The derivations here are common (and useful) in regression analysis. Consider the linear predictor with a general list of K predictor variables (plus a constant): E∗(Y |1, X1, . . . , XK) = β0 + β1X1 + . . .+XK (6.1) As a reminder, the notation E∗ denotes the best predictor of Y , or the best linear approx- imation of the conditional mean using a linear function. We are going to develop a formula for a single coefficient, which, for convenience, will be βK . Our result will use the linear predictor of XK given the other predictor variables: E∗(XK |1, X1, . . . , XK−1) = γ0 + γ1X1 + · · ·+ γK−1XK−1 Define X˜K as the residual (prediction error) from this linear predictor: X˜K = XK − E∗(XK |1, X1, . . . , XK−1) This residual “takes out” from XK the linear component that is determined by the rest of the regressors (so, we can think of this residual as the “part” in XK that cannot be linearly predicted by the rest of the regressors). The result is that βK in (6.1) is the coefficient on XK in the linear predictor of Y given just X˜K : Claim 6.1. E∗(Y |X˜K) = βKX˜K with βK = E(Y X˜K)/E(X˜2K) Proof: substitute XK = γ0 + γ1X1 + · · ·+ γK−1XK−1 + X˜K into (6.1) to obtain E∗(Y |1, X1, . . . , XK) = β0 + β1X1 + . . .+ βK−1XK−1 +βK(γ0 + γ1X1 + · · ·+ γK−1XK−1 + X˜K) = β˜0 + β˜1X1 + . . .+ β˜K−1XK−1 + βKX˜K (6.2) with β˜j = βj + βKγj (j = 0, 1, . . . ,K − 1) The residual from the problem in (6.1) is orthogonal to 1, X1, . . . , XK . Since X˜K is a linear combination of 1, X1, . . . , XK then it must be that this X˜K is also orthogonal to the same residual Y − E∗[Y |1, X1, . . . , XK ]: < Y − E∗[Y |1, X1, . . . , XK ], X˜K >= 0 19 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model =< Y − β˜0 − β˜1X1 − . . . β˜K−1XK−1 − βKX˜K , X˜K >= 0 Now, again, since X˜K is itself the residual from a prediction based on 1, X1, . . . , XK−1, it is orthog- onal to those variables and the above reduces to < Y − βKX˜K , X˜K >=< Y, X˜K > −βK < X˜K , X˜K >= 0 So, βKX˜K is the orthogonal projection of Y on X˜K (it is like running a regression with only one regressor - as opposed to K) and we get that βK =< Y, X˜K > / < X˜K , X˜K >= E(Y X˜K)/E(X˜ 2 K) � This population result has a sample counterpart. The only difference would be to replace the inner product above with the least squares inner product: < y, xj >= 1 n ∑N i=1 yixij where y = y1 ... yn xi = xi1 ... xin 6.1 Omitted Variables This section derives the general version, with K predictor variables, of the omitted variable formula we derived earlier. We shall use the notation (and part of the argument) from the residual regression result above. The short linear predictor is E∗[Y |1, X1, . . . , XK−1] = α0 + α1X1 + . . .+ αK−1XK−1 Claim 6.2. αj = βj + βKγj Proof. Let U denote the following prediction error: U ≡ Y − E∗(Y |1, X1, . . . , XK) Now, use equation (6.2) to write Y = β˜0 + β˜1X1 + . . .+ β˜K−1XK−1 + βKX˜K + U where β˜ = βj + βKγj . Note that for j = 0, 1, . . . ,K − 1 < Y − β˜0 − β˜1X1 − . . .− β˜K−1XK−1, Xj >=< βKX˜K + U,Xj >= 0 This means that < Y − β˜0 − β˜1X1 − . . .− β˜K−1XK−1, Xj >= 0 These othogonality conditions characterize the short linear predictor, and so αj = β˜j . � The sample counterpart of this result uses the short least-squares fit: yˆ|1, x1, . . . , xK−1 = a0 + a1xi1 + . . .+ aK−1xi,K−1 20 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model Claim 6.3. aj = bj + bKcj (j = 0, 1, . . . ,K − 1) This is a numerical/computational identity that can be checked in the data. 7 Matrix Version of the Least Squares Model The notation for the least squares model can be put in matrix form which is helpful. Set up the following (K + 1)× 1 matrices: X = X0 X1 ... XK , β = β0 β1 ... βK The linear predictor coefficients βj are determined by the following orthogonality conditions: < Y − β0 − β1X1 − . . .− βKXK , Xj >= 0 (j = 0, 1, . . . ,K) So, E[(Y −X ′β)Xj ] = E[Xj(Y −X ′β)] = 0 (j = 0, 1, . . . ,K) We can write these conditions together as E[X(Y −X ′β)] = 0 This gives the following system E(XY )− E(XX ′)β = 0 which has a solution β = [E(XX ′)]−1E(XY ) (7.1) (provided that the (K + 1)× (K + 1) matrix E(XX ′) is nonsingular). (You can also derive (7.1) has the first order condition from the minimization problem minβ ‖Y −X ′β‖2 = minβ E(Y −X ′β)2.) For the least squares fit, set up the (K + 1)× 1 matrices xj = x1j ... xnj , (j = 0, 1, . . . ,K) y = y1 ... yn b = b0 b1 ... bK and the n× (K + 1) matrix 21 Harvard Economics Ec 1126 Tamer - September 8, 2015 Note 3 - 2 Variable Model x = (x0 x1 . . . xK) = x10 x11 . . . x1K ... ... xn0 xn1 . . . xnK Though the notation is cumbersome, the operations we do mirror those for the population regres- sion. In particular, The least-squares coefficients bj are determined by the following orthogonality conditions: < y − b0x0 − b1x1 − . . .− bKxK , xj >= 0 (j = 0, 1, . . .K) So, (y − xb)′xj = x′j(y − xb) = 0 (j = 0, 1, . . . ,K) We can write all the orthogonality conditions (sometimes called normal equations) as: x′0 x′1 ... x′K (y − xb) = x′(y − xb) = 0 This gives the following system of (K+1)equations and (K+1) unknowns: x′y = x′xb which has a solution b = (x′x)−1x′y (provided that the (K + 1)× (K + 1) matrix x′x is nonsingular3 ). References [1] Zvi Griliches and William M Mason. Education, income, and ability. The Journal of Political Economy, pages S74–S103, 1972. [2] Jacob Mincer. Schooling, experience, and earnings. human behavior & social institutions no. 2. 1974. 3Can this matrix be nonsingular if indeed we had K > n? This framework with K > n is the so-called big data setup. 22 Two Variable Model Least squares Goodness of fit Example Omitted Variables Least squares version Example ctd' Functional Form Conditional Expectation Discrete Regressors (make things easier) Average Partial Effect Example: Polynomial Regressors LOGS Least Squares Matrix Notation, etc Omitted Variables Matrix Version of the Least Squares Model
Compartilhar