Baixe o app para aproveitar ainda mais
Prévia do material em texto
Statistics II Lesson 4. Simple linear regression Academic Year 2010/11 Lesson 4. Simple linear regression Contents I The subject of regression analysis I The specification of a simple linear regression model I Least squares estimators: construction and properties I Inferences about the regression model: I Inference about the slope I Inference about the variance I Estimation of a mean response I Prediction of a new response Lesson 4. Simple linear regression Learning objectives I Know how to construct a simple linear regression model that describes how a variable X influences another variable Y I Know now to obtain point estimations of the parameters of this model I Know to construct confidence intervals and perform tests about the parameters of the model I Know to estimate the mean value of Y for a specified value of X I Know to predict future values for the dependent (response) variable Y Lesson 4. Simple linear regression Recommended bibliography I Meyer, P. “Probabilidad y aplicaciones estad´ısticas” (1992) I Chapter I Newbold, P. “Estad´ıstica para los negocios y la econom´ıa” (1997) I Chapter 10 I Pen˜a, D. “Regresio´n y ana´lisis de experimentos”(2005) I Chapter 5 Introduction A regression model is a model that describes how a variable X influences the value of another variable Y . I X : Independent or explanatory or exogenous variable I Y : Dependent or response o endogenous variable The aim is to obtain reasonable estimations of Y for different values of X from a sample of n pairs of values (x1, y1), . . . , (xn, yn). Introduction Examples I Study how the parents’ height may influence their children’s height I Estimate the price of a house depending on its surface I Predict the unemployment level for different ages I Approximate the grades attained in a subject as a function of the number of study hours per week I Forecast the execution time of a program depending on the speed of the processor Introduction Types of relation I Deterministic: Given the value of X , the value of Y is perfectly established. They are of the form: Y = f (X ) Example: The relationship between the temperature measured in degrees Celsius (X ) and the equivalent measure in degrees Fahrenheit (Y ) is given by: Y = 1.8X + 32 P lo t o f G ra d o s F ah re n h ei t v s G ra d o s ce n tí g ra d o s 0 1 0 2 0 3 0 4 0 G ra d o s ce n tí g ra d o s 3 2 5 2 7 2 9 2 1 1 2 Grados Fahrenheit Introduction Types of relation I Non deterministic: Given the value of X , the value of Y is not completely determined. They are of the form: y = f (x) + u where u is an unknown perturbation (random variable). Example: A sample is taken regarding the volume of production (X ) and the total cost (Y ) associated with a product in a corporate group. P lo t o f C o s to s v s V o lu m e n V o lu m e n Costos 2 6 3 1 3 6 4 1 4 6 5 1 5 6 0 2 0 4 0 6 0 8 0 A relation exists but it is not exact Introduction Types of relation I Linear: When the function f (x) is linear, f (x) = β0 + β1x I If β1 > 0 there is a positive linear relationship I If β1 < 0 there is a negative linear relationship Relación lineal positiva X Y -2 -1 0 1 2 -6 -2 2 6 10 Relación lineal negativa X Y -2 -1 0 1 2 -6 -2 2 6 10 The data show a linear pattern Introduction Types of relation I Nonlinear: When the function f (x) is nonlinear. For exmaple, f (x) = log(x), f (x) = x2 + 3, . . . Relación no lineal X Y -2 -1 0 1 2 -4 -3 -2 -1 0 1 2 The data do not show linear patterns Introduction Types of relation I Absence of relation: Whenever f (x) = c , that is, whenever f (x) does not depend on x Ausencia de relación X Y -2 -1 0 1 2 -2,5 -1,5 -0,5 0,5 1,5 2,5 Measures of linear dependency Covariance A measure of linear dependency is the covariance: cov (x , y) = n∑ i=1 (xi − x¯) (yi − y¯) n − 1 I If there is a positive linear relation, the covariance will be positive and large I If there is a negative linear relation, the covariance will be negative and large in absolute value. I If there is no relation between the variables or the relation is significantly linear, the covariance will be close to zero. but the covariance depends on the units of measurement of the variables Measures of linear dependency The correlation coefficient A measure of linear dependency that doesn’t depend on the units of measurement is the correlation coefficient: r(x,y) = cor (x , y) = cov (x , y) sxsy where s2x = n∑ i=1 (xi − x¯)2 n − 1 and s 2 y = n∑ i=1 (yi − y¯)2 n − 1 I −1 ≤ cor (x , y) ≤ 1 I cor (x , y) = cor (y , x) I cor (ax + b, cy + d) = sign(a) sign(c) cor (x , y) for any values a, b, c , d The simple linear regression model The simple linear regression model assumes that, yi = β0 + β1xi + ui where: I yi represents the value of the dependent variable for the i-th observation I xi represents the value of the independent variable for the i-th observation I ui represents the error for the i-th observation, which we will assume to be normal, ui ∼ N(0, σ) I β0 and β1 are the regression coefficients: I β0 : intercept I β1 : slope The parameters to estimate are: β0, β1 and σ The simple linear regression model Our goal is to obtain estimates βˆ0 and βˆ1 for β0 and β1 to define the regression line yˆ = βˆ0 + βˆ1x that provides the best fit for the data Example: Assume that the regression line of the previous example is: Cost = −15.65 + 1.29 VolumePlot o f F it te d M o d e l V o lu m e n Costos 2 6 3 1 3 6 4 1 4 6 5 1 5 6 0 2 0 4 0 6 0 8 0 From this model it is estimated that a company that produces 25 thousand units will have a cost given by: Cost = −15.65 + 1.29× 25 = 16.6 thousand euros The simple linear regression model The difference between each value yi of the response variable and its estimation yˆi is called a residual: ei = yi − yˆi Va lo r o bs e rv a do D a to (y) R e ct a de re gr es ió n e st im a da Example (cont.): Undoubtedly, a certain company that has produced exactly 25 thousand units is not going to have a cost exactly equal to 16.6 thousand euros. The difference between the estimated cost and the real one is the error. If for example the real cost of the company is 18 thousand euros, the residual is: ei = 18− 16.6 = 1.4 thousand euros Hypotheses of the simple linear regression model I Linearity: The existing relation between X and Y is linear, f (x) = β0 + β1x I Homogeneity: The mean value of the error is zero, E [ui ] = 0 I Homoscedasticity: The variance of the errors is constant, Var(ui ) = σ 2 I Independence: The observations are independent, E [uiuj ] = 0 I Normality: The errors follow a normal distribution, ui ∼ N(0, σ) Hypotheses of the simple linear regression model Linearity The data have to look reasonably straightPlo t o f F it te d M o d e l V o lu m e n Costos 2 6 3 1 3 6 4 1 4 6 5 1 5 6 0 2 0 4 0 6 0 8 0 Otherwise, the regression line doesn’t represent the structure of the data P lo t o f F it te d M o d e l X Y -5 -3 -1 1 3 5 -641 4 2 4 3 4Hypotheses of the simple linear regression model Homoscedasticity The dispersion of the data must be constant for the data to be homoscedastic P lo t o f C o s to s v s V o lu m e n 4 0 6 0 8 0 Costos 2 6 3 1 3 6 4 1 4 6 5 1 5 6 V o lu m e n 0 2 0 4 0 Costos If this condition does not hold, the data are said to be heteroscedastic Hypotheses of the simple linear regression model Independence I The data must be independent I An observation must not give information about the rest of the observations I Usually, it is known from the type of the data if they are adequate or not for this analysis I In general, time series do not satisfy the independence hypothesis Normality I We will assumed that the data are a priori normal 2 R eg re si ón Li ne al M od el o ge ne ra l d e re gr es ió n O bj et iv o: A na liz ar la re la ci ón e nt re u na o v ar ia s va ria bl es d ep en di en te s y u n co nj un to d e fa ct or es in de pe nd ie nt es . Ti po s d e re la ci on es : -R el ac ió n no li ne al -R el ac ió n lin ea l R eg re si ón li ne al si m pl e 1 2 1 2 ( , ,.. ., | , ,.. ., ) k l f Y Y Y X X X 3 R eg re si ón Li ne al R eg re si ón s im pl e co ns um o y pe so d e au to m óv ile s Nú m . O bs . P es o Co ns um o (i) kg lit ro s/ 10 0 km 1 98 1 11 2 87 8 12 3 70 8 8 4 11 38 11 5 10 64 13 6 65 5 6 7 12 73 14 8 14 85 17 9 13 66 18 10 13 51 18 11 16 35 20 12 90 0 10 13 88 8 7 14 76 6 9 15 98 1 13 16 72 9 7 17 10 34 12 18 13 84 17 19 77 6 12 20 83 5 10 21 65 0 9 22 95 6 12 23 68 8 8 24 71 6 7 25 60 8 7 26 80 2 11 27 15 78 18 28 68 8 7 29 14 61 17 30 15 56 15 0510152025 50 0 70 0 90 0 11 00 13 00 15 00 17 00 Pe so (K g) Consumo (litros/100 Km) 4 R eg re si ón Li ne al M od el o ix iy x 1 0 E E � os de sc on oc id pa rá m et ro s : , , 2 1 0 V E E ) ,0( , 2 1 0 V E E N u u x y i i i i o � � 5 R eg re si ón Li ne al H ip ót es is d el m od el o Li ne al id ad y i = E 0 + E 1 x i + u i N or m al id ad y i| x i N (E 0+ E 1 x i, V2 ) H om oc ed as tic id ad Va r[ y i| x i] = V2 In de pe nd en ci a C ov [y i, y k ]= 0 210 VEE P ar ám et ro s Least squares estimators Gauss proposed in 1809 the method of least squares for obtaining the values βˆ0 and βˆ1 that best fit the data: yˆi = βˆ0 + βˆ1xi The method consists of minimizing the sum of the squares of the vertical distances between the data and the estimations, that is, minimize the sum of the squared residuals n∑ i=1 e2i = n∑ i=1 (yi − yˆi )2 = n∑ i=1 ( yi − ( βˆ0 + βˆ1xi ))2 6 R eg re si ón Li ne al M od el o ) ,0( , 2 1 0 V E E N u u x y i i i i o � � y i :V ar ia bl e de pe nd ie nt e x i : V ar ia bl e in de pe nd ie nt e u i : P ar te a le at or ia 0V 7 R eg re si ón Li ne al R ec ta d e re gr es ió n y ie iy x ix 8 R eg re si ón Li ne al R ec ta d e re gr es ió n x y 1 0 ˆ ˆ ˆ E E � y P en di en te 1Eˆ x y 1 0 ˆ ˆ E E � x 9 R eg re si ón Li ne al R es id uo s N N R es id uo Pr ev is to V al or ˆ ˆ O bs er va do V al or 1 0 i i i e x y � � � � � E E iy i i x y 1 0 ˆ ˆ ˆ E E � ie ix Least squares estimators The results are given by βˆ1 = cov(x , y) s2x = n∑ i=1 (xi − x¯) (yi − y¯) n∑ i=1 (xi − x¯)2 βˆ0 = y¯ − βˆ1x¯ 6 R eg re si ón Li ne al M od el o ) ,0( , 2 1 0 V E E N u u x y i i i i o � � y i :V ar ia bl e de pe nd ie nt e x i : V ar ia bl e in de pe nd ie nt e u i : P ar te a le at or ia 0V 7 R eg re si ón Li ne al R ec ta d e re gr es ió n y ie iy x ix 8 R eg re si ón Li ne al R ec ta d e re gr es ió n x y 1 0 ˆ ˆ ˆ E E � y P en di en te 1Eˆ x y 1 0 ˆ ˆ E E � x 9 R eg re si ón Li ne al R es id uo s N N R es id uo Pr ev is to V al or ˆ ˆ O bs er va do V al or 1 0 i i i e x y � � � � � E E iy i i x y 1 0 ˆ ˆ ˆ E E � ie ix Least squares estimators Exercise 4.1 The data regarding the production of wheat in tons (X ) and the price of the kilo of flour in pesetas (Y ) in the decade of the 80’s in Spain were: Wheat production 30 28 32 25 25 25 22 24 35 40 Flour price 25 30 27 40 42 40 50 45 30 25 Fit the regression line using the method of least squares Results βˆ1 = 10X i=1 xiyi − nx¯y¯ 10X i=1 x2i − nx¯2 = 9734− 10× 28.6× 35.4 8468− 10× 28.62 = −1.3537 βˆ0 = y¯ − βˆ1x¯ = 35.4 + 1.3537× 28.6 = 74.116 The regression line is: yˆ = 74.116− 1.3537x Least squares estimators Exercise 4.1 The data regarding the production of wheat in tons (X ) and the price of the kilo of flour in pesetas (Y ) in the decade of the 80’s in Spain were: Wheat production 30 28 32 25 25 25 22 24 35 40 Flour price 25 30 27 40 42 40 50 45 30 25 Fit the regression line using the method of least squares Results βˆ1 = 10X i=1 xiyi − nx¯y¯ 10X i=1 x2i − nx¯2 = 9734− 10× 28.6× 35.4 8468− 10× 28.62 = −1.3537 βˆ0 = y¯ − βˆ1x¯ = 35.4 + 1.3537× 28.6 = 74.116 The regression line is: yˆ = 74.116− 1.3537x Least squares estimators P lo t o f F it te d M o d el P ro d u cc io n e n k g . Precio en ptas. 2 2 2 5 2 8 3 1 3 4 3 7 4 0 2 5 3 0 3 5 4 0 4 5 5 0 R e gr e s s io n A n a ly s is - L in e a r m o de l: Y = a + b* X - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - D e pe n de n t v a r ia bl e : P r e c io e n pt a s . I n de pe n de n t v a r ia bl e : P r o du c c io n e n kg . - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - St a n da r d T P a r a m e t e r E s t im a t e E r r o r St a t is t ic P - V a lu e - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - I n t e r c e pt 74 , 11 51 8, 73 57 7 8, 48 41 0, 00 00 Sl o pe - 1, 35 36 8 0, 30 02 - 4, 50 92 4 0, 00 20 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A n a ly s is o f V a r ia n c e - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0βˆ 1βˆ So u r c e Su m o f Sq u a r e s D f M e a n Sq u a r e F - R a t io P - V a lu e - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - M o de l 52 8, 47 5 1 52 8, 47 5 20 , 33 0, 00 20 R e s id u a l 20 7, 92 5 8 25 , 99 06 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - T o t a l (C o r r . ) 73 6, 4 9 Co r r e la t io n Co e ff ic ie n t = - 0, 84 71 4 R - s qu a r e d = 71 , 76 47 pe r c e n t St a n da r d E r r o r o f E s t . = 5, 09 81 Estimation of the variance To estimate the variance of the errors, σ2, we can use, σˆ2 = n∑ i=1 e2i n which is the maximum likelihood estimator of σ2, but it is a biased estimator. An unbiased estimator of σ2 is the residual variance, s2R = n∑ i=1 e2i n − 2 Estimation of the variance Exercise 4.2 Compute the residual variance in exercise 4.1 Results We first compute the residuals, ei , from the regression line, yˆi = 74.116− 1.3537xi xi 30 28 32 25 25 25 22 24 35 40 yi 25 30 27 40 42 40 50 45 30 25 yˆi 33.5 36.21 30.79 40.27 40.27 40.27 44.33 41.62 26.73 19.96 ei -8.50 -6.21 -3.79 -0.27 1.72 -0.27 5.66 3.37 3.26 5.03 The residual variance is: s2R = nX i=1 e2i n − 2 = 207.92 8 = 25.99 Estimation of the variance Exercise 4.2 Compute the residual variance in exercise 4.1 Results We first compute the residuals, ei , from the regression line, yˆi = 74.116− 1.3537xi xi 30 28 32 25 25 25 22 24 35 40 yi 25 30 27 40 42 40 50 45 30 25 yˆi 33.5 36.21 30.79 40.27 40.27 40.27 44.33 41.62 26.73 19.96 ei -8.50 -6.21 -3.79 -0.27 1.72 -0.27 5.66 3.37 3.26 5.03 The residual variance is: s2R = nX i=1 e2i n − 2 = 207.92 8 = 25.99 Estimation of the variance R e gr e s s io n A n a ly s is - L in e a r m o de l: Y = a + b* X - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - D e pe n de n t v a r ia bl e : P r e c io e n pt a s . I n de pe n de n t v a r ia bl e : P r o du c c io n e n kg . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - St a n da r d T P a r a m e t e r E s t im a t e E r r o r St a t is t ic P - V a lu e - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - I n t e r c e pt 74 , 11 51 8, 73 57 7 8, 48 41 0, 00 00 Sl o pe - 1, 35 36 8 0, 30 02 - 4, 50 92 4 0, 00 20 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -A n a ly s is o f V a r ia n c e - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - So u r c e Su m o f Sq u a r e s D f M e a n Sq u a r e F - R a t io P - V a lu e - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - M o de l 52 8, 47 5 1 52 8, 47 5 20 , 33 0, 00 20 R e s id u a l 20 7, 92 5 8 25 , 99 06 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - T o t a l (C o r r . ) 73 6, 4 9 Co r r e la t io n Co e ff ic ie n t = - 0, 84 71 4 R - s qu a r e d = 71 , 76 47 pe r c e n t St a n da r d E r r o r o f E s t . = 5, 09 81 2 RSˆ Inference on the regression model I Up to this point we have obtained only point estimates for the regression coefficients I Using confidence intervals we can obtain a measure of the precision of the above mentioned estimates I Using hypothesis testing we can verify if a certain value can be the true value of the parameter Inference about the slope The estimator βˆ1 follows a normal distribution because it is a linear combination of normals, βˆ1 = n∑ i=1 (xi − x¯) (n − 1)s2X yi = n∑ i=1 wiyi where yi = β0 + β1xi + ui , satisfying yi ∼ N ( β0 + β1xi , σ 2 ) . Additionally, βˆ1 is an unbiased estimator for β1, E [ βˆ1 ] = n∑ i=1 (xi − x¯) (n − 1)s2X E [yi ] = β1 and its variance is given by Var [ βˆ1 ] = n∑ i=1 ( (xi − x¯) (n − 1)s2X )2 Var [yi ] = σ2 (n − 1)s2X Thus, βˆ1 ∼ N ( β1, σ2 (n − 1)s2X ) Confidence intervals for the slope We want to obtain a confidence interval for β1 at a 1?α level. Since σ 2 is unknown, it will be estimated using s2R . The basic result when the variance is unknown is: βˆ1 − β1√ s2R (n − 1)s2X ∼ tn−2 that allows us to obtain a confidence interval for β1: βˆ1 ± tn−2,α/2 √ s2R (n − 1)s2X The length of this interval will decrease if: I The sample size increases I The variance of the independent observations xi increases I The residual variance decreases Hypothesis testing on the slope Using the previous result we can perform hypothesis testing for β1. In particular, if the true value of β1 is zero then Y does not depend linearly on X . Therefore, the following contrast is of special interest: H0 : β1 = 0 H1 : β1 6= 0 The rejection region for the null hypothesis is:∣∣∣∣∣ βˆ1√s2R/(n − 1)s2X ∣∣∣∣∣ > tn−2,α/2 Equivalently, if the value zero is outside of the confidence interval for β1 at a 1?α level, we reject the null hypothesis at this level. The p-value of the test is: p-value = 2 Pr ( tn−2 > ∣∣∣∣∣ βˆ1√s2R/(n − 1)s2X ∣∣∣∣∣ ) Inference for the slope Exercise 4.3 1. Compute a 95% confidence interval for the slope of the regression line obtained in exercise 4.1 2. Test the hypothesis that the price of flour depends linearly on the production of wheat, using a 0.05 significance level Results 1. tn−2,α/2 = t8,0.025 = 2.306 −2.306 ≤ −1.3537− β1q 25.99 9×32.04 ≤ 2.306 −2.046 ≤ β1 ≤ −0.661 2. As the interval does not contain the value zero, we reject β1 = 0 at the 0.05 level. In fact: ˛˛˛˛ ˛ βˆ1ps2R/ (n − 1) s2X ˛˛˛˛ ˛ = ˛˛˛˛ ˛˛ −1.3537q 25.99 9×32.04 ˛˛˛˛ ˛˛ = 4.509 > 2.306 p-value = 2 Pr(t8 > 4.509) = 0.002 Inference for the slope Exercise 4.3 1. Compute a 95% confidence interval for the slope of the regression line obtained in exercise 4.1 2. Test the hypothesis that the price of flour depends linearly on the production of wheat, using a 0.05 significance level Results 1. tn−2,α/2 = t8,0.025 = 2.306 −2.306 ≤ −1.3537− β1q 25.99 9×32.04 ≤ 2.306 −2.046 ≤ β1 ≤ −0.661 2. As the interval does not contain the value zero, we reject β1 = 0 at the 0.05 level. In fact: ˛˛˛˛ ˛ βˆ1ps2R/ (n − 1) s2X ˛˛˛˛ ˛ = ˛˛˛˛ ˛˛ −1.3537q 25.99 9×32.04 ˛˛˛˛ ˛˛ = 4.509 > 2.306 p-value = 2 Pr(t8 > 4.509) = 0.002 Inference for the slope R e gr e s s io n A n a ly s is - L in e a r m o de l: Y = a + b* X - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - D e pe n de n t v a r ia bl e : P r e c io e n pt a s . I n de pe n de n t v a r ia bl e : P r o du c c io n e n kg . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - St a n da r d T P a r a m e t e r E s t im a t e E r r o r St a t is t ic P - V a lu e - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - I n t e r c e pt 74 , 11 51 8, 73 57 7 8, 48 41 0, 00 00 Sl o pe - 1, 35 36 8 0, 30 02 - 4, 50 92 4 0, 00 20 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - 2 2 1 )1 /( ˆ X R s n s −β 2 2 )1 ( X R s n s − - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A n a ly s is o f V a r ia n c e - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - So u r c e Su m o f Sq u a r e s D f M e a n Sq u a r e F - R a t io P - V a lu e - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - M o de l 52 8, 47 5 1 52 8, 47 5 20 , 33 0, 00 20 R e s id u a l 20 7, 92 5 8 25 , 99 06 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - T o t a l (C o r r . ) 73 6, 4 9 Co r r e la t io n Co e ff ic ie n t = - 0, 84 71 4 R - s qu a r e d = 71 , 76 47 pe r c e n t St a n da r d E r r o r o f E s t . = 5, 09 81 Inference for the intercept The estimator βˆ0 follows a normal distribution, as it can be written as a linear combination of normal random varaibles, βˆ0 = n∑ i=1 ( 1 n − x¯wi ) yi where wi = (xi − x¯) /ns2X and yi = β0 + β1xi + ui , satisfying yi ∼ N ( β0 + β1xi , σ 2 ) . Additionally, βˆ0 is an unbiased estimator for β0, E [ βˆ0 ] = n∑ i=1 ( 1 n − x¯wi ) E [yi ] = β0 and its variance is Var [ βˆ0 ] = n∑ i=1 ( 1 n − x¯wi )2 Var [yi ] = σ 2 ( 1 n + x¯2 (n − 1)s2X ) implying βˆ0 ∼ N ( β0, σ 2 ( 1 n + x¯2 (n − 1)s2X )) Confidence interval for the intercept We wish to obtain a confidence interval for β0 at a 1?α level. Since σ 2 is unknown, this value will be estimated using s2R . The basic result when the variance is unknown is: βˆ0 − β0√ s2R ( 1 n + x¯2 (n − 1)s2X ) ∼ tn−2 From it we can obtain a confidence interval for β0: βˆ0 ± tn−2,α/2 √ s2R ( 1 n + x¯2 (n−1)s2X ) The length of the confidence interval decreases if: I The sample size increases I The variance of the independent observations xi increases I The residual variance decreases I The mean of the independent observations decreases Hypothesis testing on the intercept Using the previous result we can perform hypothesis testing for β0. In particular, if the true value of β0 is zero then the regression line passes through the origin. Therefore, the following test is of special interest: H0 : β0 = 0 H1 : β0 6= 0 The critical region for this test is:∣∣∣∣∣∣∣∣ βˆ0√ s2R ( 1 n + x¯2 (n−1)s2X ) ∣∣∣∣∣∣∣∣ > tn−2,α/2 Equivalently, if zero lies outside the confidence interval for β0 at a level 1−α, we reject the null hypothesis at that level. The p-value is given by: p-value = 2 Pr tn−2 > ∣∣∣∣∣∣∣∣ βˆ0√ s2R ( 1 n + x¯2 (n−1)s2X ) ∣∣∣∣∣∣∣∣ Inference for the intercept Exercise 4.4 1. Compute a 95% confidence interval for the intercept of the regression line obtained in exercise 4.1 2. Test the hypothesis that the regression line passes through the origin, using a 0.05 significance level Results 1. tn−2,α/2 = t8,0.025 = 2.306 −2.306 ≤ 74.1151− β0r 25.99 “ 1 10 + 28.62 9×32.04 ” ≤ 2.306⇔ 53.969 ≤ β0 ≤ 94.261 2. As the computed interval does not contain the value zero, we reject that β0 = 0 at the level 0.05. In fact,˛˛˛˛ ˛˛˛˛ ˛˛ βˆ0s s2R „ 1 n + x¯2 (n−1)s2 X « ˛˛˛˛ ˛˛˛˛ ˛˛ = ˛˛˛˛ ˛˛˛˛ 74.1151r 25.99 “ 1 10 + 28.62 9×32.04 ” ˛˛˛˛ ˛˛˛˛ = 8.484 > 2.306 p-value = 2 Pr(t8 > 8.483) = 0.000 Inference for the intercept Exercise 4.4 1. Compute a 95% confidence interval for the intercept of the regression line obtained in exercise 4.1 2. Test the hypothesis that the regression line passes through the origin, using a 0.05 significance level Results 1. tn−2,α/2 = t8,0.025 = 2.306 −2.306 ≤ 74.1151− β0r 25.99 “ 1 10 + 28.62 9×32.04 ” ≤ 2.306⇔ 53.969 ≤ β0 ≤ 94.261 2. As the computed interval does not contain the value zero, we reject that β0 = 0 at the level 0.05. In fact,˛˛˛˛ ˛˛˛˛ ˛˛ βˆ0s s2R „ 1 n + x¯2 (n−1)s2 X « ˛˛˛˛ ˛˛˛˛ ˛˛ = ˛˛˛˛ ˛˛˛˛ 74.1151r 25.99 “ 1 10 + 28.62 9×32.04 ” ˛˛˛˛ ˛˛˛˛ = 8.484 > 2.306 p-value = 2 Pr(t8 > 8.483) = 0.000 Inference for the intercept R e gr e s s io n A n a ly s is - L in e a r m o de l: Y = a + b* X - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - D e pe n de n t v a r ia bl e : P r e c io e n pt a s . I n de pe n de n t v a r ia bl e : P r o du c c io n e n kg . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - St a n da r d T P a r a m e t e r E s t im a t e E r r o r St a t is t ic P - V a lu e - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - I n t e r c e pt 74 , 11 51 8, 73 57 7 8, 48 41 0, 00 00 Sl o pe - 1, 35 36 8 0, 30 02 - 4, 50 92 4 0, 00 20 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - − + 2 2 2 0 )1 ( 1 ˆ X R s n x n s β − + 2 2 2 )1 ( 1 X R s n x n s - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - A n a ly s is o f V a r ia n c e - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - So u r c e Su m o f Sq u a r e s D f M e a n Sq u a r e F - R a t io P - V a lu e - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - M o de l 52 8, 47 5 1 52 8, 47 5 20 , 33 0, 00 20 R e s id u a l 20 7, 92 5 8 25 , 99 06 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - T o t a l (C o r r . ) 73 6, 4 9 Co r r e la t io n Co e ff ic ie n t = - 0, 84 71 4 R - s qu a r e d = 71 , 76 47 pe r c e n t St a n da r d E r r o r o f E s t . = 5, 09 81 Inference for the variance The basic result is: (n − 2) s2R σ2 ∼ χ2n−2 Using this result we can: I Compute a confidence interval for the variance: (n − 2) s2R χ2n−2,α/2 ≤ σ2 ≤ (n − 2) s 2 R χ2n−2,1−α/2 I Perform hypothesis testing of the form: H0 : σ 2 = σ20 H1 : σ 2 6= σ20 Estimation of the mean response and prediction of a new response We consider two types of problems: 1. Estimate the mean value of the variable Y corresponding to a given value X = x0 2. Predict a future value of the variable Y for a given value X = x0 For example, in exercise 4.1 we might be interested in the following questions: 1. What will be the mean price of a Kg of flour in those years where wheat production equals 30 tons? 2. If in a given year wheat production is 30 tons, which will be the price of a Kg of flour? In both cases the estimation is: yˆ0 = βˆ0 + βˆ1x0 = y¯ + βˆ1 (x0 − x¯) but the precision in the estimations is different Estimation of a mean response Taking into account that: Var (yˆ0) = Var (y¯) + (x0 − x¯)2 Var ( βˆ1 ) = σ2 ( 1 n + (x0 − x¯)2 (n − 1) s2X ) the confidence interval for the mean response is: yˆ0 ± tn−2,α/2 √√√√s2R ( 1 n + (x0 − x¯)2 (n − 1) s2X ) Prediction of a new response The variance of the prediction of a new response is the mean squared error of the prediction: E [ (y0 − yˆ0)2 ] = Var (y0) + Var (yˆ0) = σ2 ( 1 + 1 n + (x0 − x¯)2 (n − 1) s2X ) The confidence interval for the prediction of a new response is: yˆ0 ± tn−2,α/2 √√√√s2R ( 1 + 1 n + (x0 − x¯)2 (n − 1) s2X ) The length of this interval is larger than the one for the preceding case (we have less precision), as we are not estimating a mean value but a specific one. Estimation of the mean response and prediction of a new response The intervals for the estimated means are shown in red in the figure below, while those for the predictions are drawn in pink It is readily apparent that the size of the latter ones is considerably larger P lo t o f F it te d M o d el 2 2 2 5 2 8 3 1 3 4 3 7 4 0 P ro d u cc io n e n k g . 2 5 3 0 3 5 4 0 4 5 5 0 Precio en ptas.
Compartilhar