Baixe o app para aproveitar ainda mais
Prévia do material em texto
Harvard Economics Ec 1126 Tamer - September 25, 2015 Note 4 - Inference in CRM Note 4: Least Squares and the Normal Linear Model Background: This Note requires some use of linear and matrix algebra. So, some of you may need to keep a reference for matrix algebra handy. 1 Statistical properties of Least Squares In frequentist inference, we use a confidence set to express our uncertainty about the parameter of interest β which is taken to be a fixed constant (that is unknown and hence is to be estimated). The issue is that whether we are interested in the best linear predictor, or the conditional expectation function, both of these are population quantities, i.e. they depend on the distribution F that generated the data, and so are typically not available. What is available is a random sample from this population (from F ). So, the problem of statistical inference is how to learn about these population quantities from data. In a random sample of size N , we have N independent draws (with replacement) from the same population. The ith draw results in the random variable or random vector (Yi, Xi). The joint distribution of this list of random variables is the population distribution F . We can summarize random sampling by saying (Yi, Xi) i.i.d∼ F (i = 1, . . . , N) Here i.i.d. stands for independent and identically distributed. For notation, let . Y︸︷︷︸ N×1 = Y1... YN , X︸︷︷︸ N×K = X ′ 1 ... X ′N We have derived the least squares estimator as βˆ = (X ′X)−1X ′Y where E∗[Yi|Xi] = X ′iβ = β0 +X1iβ1 + . . .+XKiβK = K∑ k=0 Xkiβk (and assume here that the first column of X, X0 = 1, is a column of ones). 1 Victoria Highlight Victoria Highlight Victoria Highlight Victoria Highlight Harvard Economics Ec 1126 Tamer - September 25, 2015 Note 4 - Inference in CRM The question that we want to answer: how “good” is βˆ as an estimator for β which was derived earlier1 as β = [E(XiX ′ i)] −1EXiYi? Here, βˆ is a random variable that depends on the sample size N . For example, bias is one measure that we can use to judge how “good” βˆ is. We say that βˆ is an unbiased estimator of β if E[βˆ] = β Without any further assumptions, it seems difficult to proceed since βˆ is a nonlinear function (E(X ′X)−1 6= [E(X ′X)]−1), i.e., in general E[βˆ] 6= β. We can, using iterated expectation, get E[βˆ] = E [ (X ′X)−1X ′E[Y |X]] or more neatly, E[βˆ|X] = (X ′X)−1X ′E[Y |X] = (X ′X)−1X ′ r(X1)... r(XN ) So far we have not made any assumptions other than random sampling. Now suppose that the linear predictor, which is intended to approximate the conditional expectation, actually equals the conditional expectation: r(Xi) = X ′ iβ Then, we have E[βˆ|X] = (X ′X)−1X ′E[Y |X] = (X ′X)−1X ′ X ′ 1 ... X ′N β = (X ′X)−1X ′Xβ = β Applying iterated expectation yields E[βˆ] = E[E[βˆ|X]] = E[β] = β This means that if r(Xi) = X ′ iβ, then βˆ is an unbiased estimator for β. Otherwise, the expectation of βˆ leads to a linear combination of the conditional expectation function. This leads us to an important model that makes even stronger assumptions on the relation- ship between Yi and Xi. 1Reminder: β is a population object, and so is E(XiX ′ i) where Xi is a K × 1 vector and includes all the predictors - including the constant 1. 2 Victoria Highlight Victoria Highlight Harvard Economics Ec 1126 Tamer - September 25, 2015 Note 4 - Inference in CRM 2 Classical Regression Model - Or the Normal Linear Model We shall see how to obtain confidence sets using the normal linear model, i.e., the classical least squares model first. The normal linear model is most definitely a parametric model. A parametric model is a set of assumptions that require that the density f of (Yi, Xi) belongs to a set of densities Fθ : {fθ : θ ∈ Rk} where fθ is known up to θ which is a finite dimensional parameter. For example, in a location model, (Yi, Xi) ∼ fµ with µ ∈ R2 and fµ=(µ1,µ2)(z) = φ(z1 − µ1; z2 − µ2) where φ is the standard bivariate normal density. So, the problem of learning an unknown density in a parametric model is one of learning µ = (µ1, µ2) ∈ R2. The normal linear model allows us to characterize the statistical properties of the least squares estimator. Later, we will relax the assumptions of the normal linear model and instead use approximations, via limit theorems such as the law of large numbers and central limit theorem, to develop confidence sets for nonparametric models. Nonparametric models are ones where the distribution of the data belongs to a set of distributions that is rich - richer than the parametric class. These confidence sets will be derived using limit arguments, as sample size N tends to infinity. So there will always be the issue of how well the limit theorems approximate finite sample properties. So, parametric models yields exact inference at the cost of parametric assumptions (what happens if the true density of the data does not belong to the parametric class), while nonparametric model require one to appeal to large sample approximations. First, we start with the classical model with N fixed. Definition 2.1. Classical Linear Model Assumptions Let the following hold: 1. (random sampling) (Yi, Xi) i.i.d∼ f (i = 1, . . . , N) 2. (normality) Yi|Xi ∼ N (X ′iβ, σ2) for i = 1, . . . , N. This means that E[Yi|Xi] = X ′iβ, and V ar(Yi|Xi) = σ2. The normal linear model can also be written as Yi = X ′ iβ + �i where �i ∼ N (0, σ2) and is independent of Xi. 3 Harvard Economics Ec 1126 Tamer - September 25, 2015 Note 4 - Inference in CRM 2.1 Expectation and Variance of the Least Squares Estimator Given the above, the claim below follows. Claim 2.2. Under the normal model assumptions, the least squares estimator is an unbiased esti- mator of β. Now, to characterize “noise,” it is natural to compute the variance of the estimator. Gen- erally, the smaller the variance the less the statistical noise (i.e. the higher the precision). Here, noise characterizes sample uncertainty and hence if we have access to the population (presumably if we observe everyone), then this kind of uncertainty will go to zero. Claim 2.3. Suppose that the random vector Y is N × 1; and the nonrandom matrices d1 and d2 are M ×N and M × 1. Then Cov(d1Y + d2) = d1Cov(Y )d ′ 1 Proof:(sketch) First, note that with Y˜ = Y − EY , Cov(Y ) = EY˜ Y˜ ′ Here, we have Y˜ = d1(Y − EY ) and so Cov(Y ) = E { d1(Y − EY )[d1(Y − EY )]′ } = Ed1E(Y˜ Y˜ ′)d′1 = d1Cov(Y )d ′ 1 � So, now we are able to obtain the variance of the least squares estimator of β. We have Cov(βˆ|X) = (X ′X)−1X ′Cov(Y |X)X(X ′X)−1 The covariance of Y conditional on X is an N ×N matrix as follows: Cov(Y |X) = V ar(Y1|X) . . . 0. . . . . . ... 0 . . . V ar(YN |X) The off-diagonal elements are zero, Cov(Yi, Yj |X) = 0, because observations i and j are independent for i 6= j due to random sampling. Also, given the normal model assumptions, we have V ar(Yi|Xi) = σ2 and so Cov(Y |X) = σ2IN 4 Harvard Economics Ec 1126 Tamer - September 25, 2015 Note 4 - Inference in CRM where IN is the N ×N identity matrix. This leads to Cov(βˆ|X) = σ2(X ′X)−1 So, Cov(βˆ|X) is K ×K : Cov(βˆ|X) = Cov(βˆ1, βˆ1|X) . . . Cov(βˆ1, βˆK |X). . . . . . ... Cov(βˆK , βˆ1|X) . . . Cov(βˆK , βˆK)|X) where Cov(βˆ1, βˆ1|X) = V ar(βˆ1|X). Let the [(X ′X)−1]jk denote the (j, k) element of (X ′X)−1. Then we have Cov(βˆj , βˆk) = σ 2[(X ′X)−1]jk 2.2 Gauss Markov Theorem Suppose we have any other unbiased linear estimator β˜. Here, maintain the assumption that the X’s are fixed constants (for easier notation2). Then it must be in the formβ˜ = C ′Y and E(β˜) = β where C ′ = (X ′X)−1X ′ +D′. Note that β˜ = C ′(Xβ + ε) = C ′Xβ + C ′ε ⇒ E(β˜) = C ′Xβ. So β˜ is unbiased if C ′X = I, which implies D′X = 0. var(β˜) = σ2C ′C = σ2[(X ′X)−1X ′ +D′][(X ′X)−1X ′ +D′]′ = σ2[(X ′X)−1X ′X(X ′X)−1 + (X ′X)−1X ′D︸︷︷︸ =0 +D′X︸︷︷︸ =0 (X ′X)−1 +D′D] = σ2(X ′X)−1 + σ2D′D = var(βˆ) + σ2D′D. Since D′D ≥ 0, the proof is complete. This essentially is a proof of the following. 2Effectively, the calculation would need to be made conditional on X and so is cumbersome. 5 Victoria Highlight Victoria Sticky Note nao entendi o cov (beta,beta). daonde saiu? onde foi parar o X? Victoria Sticky Note por que + D'? Harvard Economics Ec 1126 Tamer - September 25, 2015 Note 4 - Inference in CRM Theorem 2.4. (Gauss-Markov theorem) In the classical regression model, the least squares estimator is best among all linear unbi- ased estimators. Proof. This was done above the statement of the theorem. 2.3 Sample version (or estimator) of σ2 To be able to use a feasible version of the variance matrix of βˆ, we require a sample estimator of σ2. Note that e = Y − Yˆ = Xβ + ε−Xβˆ = Xβ + ε−X(X ′X)−1X ′(Xβ + ε) = [I −X(X ′X)−1X ′]ε = Mε. where M = [I −X(X ′X)−1X ′]. Reminder: M is idempotent3, thus e′e = ε′M ′Mε = ε′Mε. Now since ε′Mε is scalar, we have ε′Mε = Trε′Mε 4: (suppressing the conditioning on X) Ee′e = E[ε′Mε] = E[Trε′Mε] = E[TrMεε′]= Tr(E[Mεε′]) = TrMσ2I = σ2TrM. And TrM = Tr[IN − X(X ′X)−1X ′] = TrIN − Tr(X(X ′X)−1X ′) = N − Tr((X ′X)−1X ′X) = N − TrIK = N −K. Thus E[e′e|X] = σ2(N −K) and so unconditionally also, E[e′e] = σ2(N −K). Therefore we can use the following estimator for σ2: σˆ2 = e′e N −K ≡ s 2. Hence, overall, v̂ar[βˆ] = (X ′X)−1s2 = 1 N ( 1 N ∑ xix ′ i )−1 s2 = 1 N ( 1 N ∑ xix ′ i )−1 1 N −K ∑ (yi − x′iβˆ)2 3The only nonsingular idempotent n × n matrix is In: Suppose A = AA and A is nonsingular. Then A−1AA = A−1A = I. 4Properties of trace: 1) Tr(A+B) = TrA+ TrB, 2) Tr(AB) = Tr(B′A′), 3) Tr(ABC) = Tr(BCA) = Tr(CAB) 6= Tr(BAC). 6 Victoria Sticky Note why would we condition on X? or Why, below, can we say "so unconditionally also,..."? Harvard Economics Ec 1126 Tamer - September 25, 2015 Note 4 - Inference in CRM 3 Reminder of multivariate normal distribution Let y = [y1, . . . , yN ] ′ be a random vector with multivariate normal distribution with parameters µ = [µ1, . . . , µN ] ′ (N × 1 vector) mean and Σ a N ×N (positive definite real) variance-covariance matrix. y ∼ N(µ,Σ). Then y has probability density function f(y) = 1 (2pi) N 2 |Σ| 12 exp ( −w 2 ) , where w = (y − µ)′Σ−1(y − µ) and |Σ| = det(Σ) Conditional distribution: Let y = [ y1 y2 ] , µ = [ µ1 µ2 ] Σ = [ Σ11 Σ12 Σ21 Σ22 ] , then unconditional distribution of y1 is y1 ∼ N(µ1,Σ11). The distribution of y1 conditional of y2 is multivariate normal y1|y2 ∼ N(µ∗1,Σ∗1), where µ∗1 = E(y1|y2) = µ1+ [ (Σ22) −1Σ12 ]′ (y2−µ2) = µ1 − [ (Σ22) −1Σ12 ]′ µ2︸ ︷︷ ︸ α=µ1−β′µ2 + [ (Σ22) −1Σ12 ]′︸ ︷︷ ︸ β′ y2 = α+β ′y2 We can derive the variance by law of iterated variance5, Σ11 = V [E(y1|y2)] + E[V (y1|y2)] = V [E(α+ β′y2)] + EΣ∗1 = β′Σ22β + Σ∗1 (note here that V (y1|y2) does not depend on y2 - can you show this?) Hence Σ∗1 = Σ11 − β′Σ22β. Then y1|y2 ∼ N(α + β′y2,Σ∗1). Note here that if we treat this as running a regression of y (here y1) on x (here y2), we get that (from previous Note) β is the covariance between y and x divided by its variance. This should remind you of our β here which has this form: β = (Σ22) −1Σ12. Functions of standard normal variables: 1. Let z ∼ N(0, In), then z′z = ∑n i=1 z 2 i ∼ χ2(n), 2. E[χ2(n)] = n and var[χ2(n)] = 2n. 3. Let v = w1/mw2/n , where w1 ∼ χ2(m), w2 ∼ χ2(n) and w1, w2 are independent, then v ∼ F (m,n), 5Law of iterated variance: V (y1) = V [E(y1|y2)] + E[V (y1|y2)] 7 Harvard Economics Ec 1126 Tamer - September 25, 2015 Note 4 - Inference in CRM 4. Let u = z√ w/n , where z ∼ N(0, 1) and w ∼ χ2(n) and z, w are independent, then u ∼ t(n) and u2 ∼ F (1, n). 5. Let y ∼ N(0,Σ), then y′Σ−1y ∼ χ2(n). Proof (of 5.): Σ can be factorized into Σ = PP ′ which means that Σ−1 = (P−1)′P−1. Then, define x = P−1y and those y’s are independent N(0, 1) random variables since x ∼ N (0, In). Then we can use 1. above to get the result. 4 Confidence Intervals Normality Assumption Again: Vector (N × 1) y = Xβ + ε has a multivariate normal distri- bution y ∼ N(Xβ, σ2IN ). Therefore βˆ = (X ′X)−1X ′y ∼ N(β, σ2(X ′X)−1) Suppose we want to summarize the statistical uncertainty in the sample regarding the value of a parameter (taken to be an unknown constant). One common way to do this is to construct a confidence interval. We will do this first for a univariate parameter. Then, we will generalize. First, we derive distributions for statistics that are used to construct a confidence interval. Proposition 4.1. The following statistic is distributed by t-distribution with N − K degrees of freedom, tk = βˆk − βk se(βˆk) = βˆk − βk√ s2(X ′X)−1kk ∼ t(N −K). Proof. Denote zk = βˆk−βk√ σ2(X′X)−1kk and q = e ′e σ2 and note that zk ∼ N(0, 1). Hence tk = βˆk − βk√ σ2(X ′X)−1kk √ σ2 s2 = zk√ s2 σ2 = zk√ e′e/(N−K) σ2 = zk√ q N−K There remains to be shown that 1) q ∼ χ2(N −K), 2) q, zk are independent. 8 Harvard Economics Ec 1126 Tamer - September 25, 2015 Note 4 - Inference in CRM 1. Remember e = Mε, where M has rank N −K. Then q = e′e σ2 = ε′ σ M ε σ ∼ N(0, IN )′MN(0, IN ) = χ2(N −K). This uses a result6 on the distribution of the quadratic form x′Ax where x ∼ N(0, I) and A is idempotent of rank r ≤ N . 2. q, zk are independent, since e = Mε and βˆk − βk are independent E[(βˆ − β)e′] = E[(X ′X)−1X ′εε′M ] = σ2E[(X ′X)−1X ′M ] = 0. Here, we have shown that Cov(βˆ − β, e) = 0. But, since βˆ and e are Normally distributed, then showing that they are uncorrelated is equivalent to showing that they are independent. The t-distribution is available in tables and in computer programs. Suppose that N −K = 30. We have Prob(t(30) > 2.04) = .025 and since the t-distribution is symmetric about zero, Prob(|t(30)| ≤ 2.04) = .95 Then, by the Proposition above, Prob(−2.04 ≤ βˆk − βk se(βˆk) ≤ 2.04) = .95 and so Prob(βˆk − 2.04se(βˆk) ≤ βk ≤ βˆk + 2.04se(βˆk)) = .95 This is an unconditional probability statement that can be written also as Prob(βk ∈ [βˆk ± 2.04.se(βˆk)]) = .95 The interval [βˆk±2.04.se(βˆk)] is usually called the 95% confidence interval. It provides a summary of the statistical uncertainty in our estimate of β. 6Optional proof to show this: an idempotent matrix A can be written as Q′AQ = [ Ir 0 0 0 ] where Q is an orthogonal matrix (inverse equals transpose) of eigenvectors. Let y = Q′x and so x = Qy then Ey = 0, and var(y) = Eyy′ = E(Q′xx′Q) = Q′IQ = I so y’s are independent standard normals and so x′Ax = y′Q′AQy = Y 21 + . . . , Y 2 r which gives the result. 9 Harvard Economics Ec 1126 Tamer - September 25, 2015 Note 4 - Inference in CRM As N −K increases from 30 to infinity, the 97.5 percentile of the t-distribution decreases from 2.04 to a limiting value of 1.96 (which is the 97.5 percentile of a standard normal distribution). 4.1 Confidence Ellipse Again, we have βˆ = (X ′X)−1X ′y ∼ N(β, σ2(X ′X)−1) We shall obtain a confidence region for two or more linear combinations of the coefficients. Or, more generally this leads to confidence regions for linear functions of β. Let R be h ×K so that Rβ is a h× 1 and Rβˆ ∼ N(Rβ, σ2R(X ′X)−1R′) Claim 4.2. We have F = (Rβˆ −Rβ)′[V̂ ar(Rβˆ)]−1(Rβˆ −Rβ)/h ∼ F(h,N −K) To prove the above, we start with the F distribution. The above can be written as F = (Rβˆ −Rβ)′[R(X ′X)−1R′]−1(Rβˆ −Rβ)/h s2 We know that s2 = e ′e N−K and the distribution of Rβˆ is Rβˆ ∼ N(Rβ, σ2R(X ′X)−1R′). Hence F = w/h q/(N −K) To find the distribution of test statistic there remains to be shown that: 1. w = (Rβˆ −Rβ)′[σ2R(X ′X)−1R′]−1(Rβˆ −Rβ) ∼ χ2(h), 2. q ∼ χ2(N −K) (done earlier), 3. w, q are independent (here you can show that ...) The above leads to a confidence region for Rβ. For example, suppose h = 2 with R = ( 1 0 . . . 0 0 1 . . . 0 ) and Rβ = ( β1 β2 ) 10 Harvard Economics Ec 1126 Tamer - September 25, 2015 Note 4 - Inference in CRM Suppose that N−K = 30. The F−distribution is available in tables and computer programs. We have Prob(F (2, 30) > 3.32) = 0.05 This means that Prob([ ( βˆ1 βˆ2 ) − ( β1 β2 ) ]′[V̂ ar ( βˆ1 βˆ2 ) ]−1[ ( βˆ1 βˆ2 ) − ( β1 β2 ) ]/2 ≤ 3.32) = .95 The confidence region consists of the values for (β1, β2) that satisfy the inequality above. This gives the interior of an ellipse which is centered at the least-squares values (βˆ1, βˆ2). Goldberger (1991)[?] is a good reference for the normal linear model. 5 Generalized Least Squares In this section we assume that Ω is known. Aitken’s notation7 V (Y ) = σ2Ω. Define • H such that Ω−1 = H ′H (ie H = Ω− 12 , or Ω = (H ′H)−1 = H−1(H ′)−1). • Y ∗ = HY , • X∗ = HX. Then E(Y ∗) = E(HY ) = HXβ = X∗β, var(Y ∗) = H var(Y )H ′ = Hσ2ΩH ′ = σ2HH−1(H ′)−1H ′ = σ2IN . Therefore the classical assumptions are satisfied and we can apply Gauss-Markov theorem to Y ∗ = X∗β + u∗. We get BLUE estimator βˆGLS = (X ∗′X∗)−1X∗ ′ Y ∗ = (X ′H ′HX)−1X ′H ′HY = (X ′Ω−1X)−1X ′Ω−1Y. 7Usually and later V (Y ) = Ω, but this is WLOG, it is just to make it look more similar to homoscedastic variance. 11 Victoria Highlight Victoria Sticky Note ?? Harvard Economics Ec 1126 Tamer - September 25, 2015 Note 4 - Inference in CRM If we assume additionally that (this is the generalized normal regression model) Y ∼ N(Xβ, σ2Ω), we get that βˆGLS ∼ N ( β, σ2(X ′Ω−1X)−1 ) , and N −K σ2 s2GLS = 1 σ2 (Y −XβˆGLS)′Ω−1(Y −XβˆGLS) ∼ χ2(N −K). Remarks 1. Alternatively we can say that βˆGLS = arg min b(Y −Xb)′Ω−1(Y −Xb), i.e. it minimizes weighted least squares instead of minimizing just least squares. In language of norms: LS minimizes square norm ‖v‖2 = v′v, and GLS minimizes ‖v‖Ω = v′Ω−1v. In practice, Ω is an unknown N × N matrix, which is clearly impossible to estimate from data with N observations. To solve the problem with heteroscedasticity in practice, we need some way to model the covariance matrix Ω. This will be case specific. 12 Statistical properties of Least Squares Classical Regression Model - Or the Normal Linear Model Expectation and Variance of the Least Squares Estimator Gauss Markov Theorem Sample version (or estimator) of 2 Reminder of multivariate normal distribution Confidence Intervals Confidence Ellipse Generalized Least Squares
Compartilhar