Baixe o app para aproveitar ainda mais
Prévia do material em texto
Statistics 522: Sampling and Survey Techniques Topic 3 Topic Overview This topic will cover � Ratio and Regression Estimation � Estimation in Domains Laplace � Wanted to determine population of France in 1802 � Number of births is easy to obtain from public records. � Size of population is di�cult to determine. � Use births to predict population � Sampled 30 \communes" { Total population: 2,037,615 { Births in last 3 years: 215,599 or 71,866.33 per year � persons per birth: 2037615=71866:33 = 28:35 � Multiply births by 28.35 Ratio estimation basics � yi is the characteristic of interest (response variable). � xi is the auxiliary variable or subsidiary variable. � ty = P pop yi � tx = P pop xi � B = ty tx = �yU=�xU 1 Procedure � We assume that tx is known. � Therefore, �xU = txN is known. � Use an SRS and measure yi and xi in the sample. � Calculate �y and �x for the sample. � B^ = �y �x = P sample yi= P sample xi � t^y;ratio = B^tx or � �^yratio = B^�xU Why Sometimes we are interested in the ratio � acres per farm di�er � yield per acre � per capita income Notes � Both �y and �x are random variables. { They will di�er from sample to sample. { We exclude the possibility that �x is zero. � Denominator often looks like a sample size. { The usual estimator for an SRS can be viewed this way: xi = 1 for all items. N = tx = X pop xi n = X sample xi B^ = P yiP xi = �y t^ = B^tx = B^N = N �y 2 Why (2) � Sometimes we want to estimate a population total but N is not known, so we can’t use N �y. { Estimate the number of �sh in a catch { Weigh and count a sample; weigh the total { Multiply the average �sh per pound times the total weight of the catch � Increase the precision { Laplace could have estimated the population of France by computing the average number of persons per commune and multiplying by the number of communes. { Ratio estimate has smaller MSE (because of positive correlation between births and population size). � We can adjust estimates to reflect demographic totals. { example in text on page 62 concerning gender { This is called poststrati�cation. { We will discuss this in Section 4.7 and Chapters 7 and 8 � Can be used to adjust for nonresponse { Example in text on page 63 { Discussed in Chapter 8 Example 3.2 � The U.S. Census of Agriculture � We have a SRS size n = 300 from the population of N = 3078 counties. � Suppose we have the population totals for 1987 but have only the sample information for 1992. � We want to estimate { average acres per farm { total acres 3 Get the sample data ( SLL063.sas ) *import the file agsrs.dat, check it, then create a permanent SAS data set; proc print data=asrs; run; libname xxx ’C:\Purdue\Stat522\SASdata’; data xxx.agsrs; set asrs; run; Plot the data symbol1 v=circle i=sm70; proc gplot data=asrs; plot acres92*acres87/frame; run; proc reg data=asrs; model acres92=acres87/noint; run; proc univariate data=asrs; var acres92 acres87; run; Regression through the origin 4 Smoothed plot Ratio estimate proc univariate data=asrs; var acres92 acres87; Sums Variable: ACRES92 Sum Observations 89369114 Variable: ACRES87 Sum Observations 90586117 A calculation data acalc; tot92=89369114; tot87=90586117; ratio=tot92/tot87; output; Output Obs tot92 tot87 ratio 1 89369114 90586117 0.98657 Ratio is B^: in the sample of n = 300, 1992 acres are 98:7% of 1987 acres. Total acres for 1987 proc univariate data=apop; var acres87; 5 Output Variable: ACRES87 Mean 313016.378 Sum Observations 963464412 NOTE: These values di�er slightly from the values in the text. The estimates � Total acres for 1987: 963464412 � B^ = 0:98657 = 1992acres 1987acres � Estimate of total 1992 acres is B^ � total87acres = 0:98657(963464412) = 950000000 � Estimate of mean acres per county for 1992 is B^ �mean87acrespercounty = 0:98657(313016:378) = 309000 Comments � Ratio estimators are biased. � The random variable is B^ = �xpop=�xsam. � The relative bias: (jbias(B^)j=s(B^)) � jCV (�x)j Proof � Use E(t^x) = tx; E(t^y) = ty; t^y;ratio = t^yt^x tx = B^tx. E(t^y;ratio − ty) = E(t^y;ratio − t^y + t^y − ty) = E � t^y t^x tx − t^y � = E � t^y t^x (tx − t^x) � = −E[B^(t^x − tx)] = − h E(B^t^x)− E(B^)E(t^x) i = −Cov(B^; t^x) 6 � Therefore, jE(B^ −B)j = ����E(t^y;ratio − ty)tx ���� = ����� Cov(B^; t^x) tx ����� = ����� Cov(B^; �xsam) �xpop ����� = ����� Corr(B^; �x) �xpop ����� q Var(B^)Var(�x) � SE(B^)SE(�x)=j�xpopj = SE(B^)jCV (�x)j More Comments � Similar argument shows that Bias(B^) � fpc 1 n�x2U (S2xB − Cov(x; y)) � The bias will be small when { n is large. { the sampling fraction n=N is large. { �xU is large. { the standard deviation of x { Sx { is small. { the correlation R between y and x is close to one. MSE of B^ v^ar(B^) � MSE(B^) � Var( �d) �x2U ; where di = yi −Bxi. Idea behind proof B^ −B = �y �x −B = (�y −B�x) �x � Note that we have the random variable �x in the denominator of this expression. � Approximate it by �xU . B^ −B � (�y −B�x) �xU 7 Standard error for B^ � To estimate the standard deviation of B^, substitute sample estimates for unknown quantities: v^ar(B^) = fpc s2e n�x2U ; where s2e = P e2i =(n− 1) and ei = yi − B^xi. � SE is the square root of the variance. Other standard errors � t^ratio = txB^ { Estimated variance of t^ratio is t2xv^ar(B^) = fpc N2 n s2e � �^yratio = �xU B^ { Estimated variance of �^yratio is �x2U v^ar(B^) = fpc n s2e � Take square roots to obtain the SE’s Con�dence intervals For 95%, general form is estimate� 1:96SE(estimate) Example 3.3 � The US Census of Agriculture � We have a SRS size n = 300 from the population of N = 3078 counties. � Estimate total acres for 1992 using a ratio estimate. � B is the ratio of 1992 acres to 1987 acres for the population; B^ for the sample of n = 300. � We use the known value of total acres in 1987 for the population N = 3078. 8 The estimates � Total acres for 1987: 963464412 � B^ = 0:98657 (1992acres=1987acres) � Estimate of total 1992 acres is B^ � total87acres = 0:98657(963464412) = 950000000 � We need to calculate the standard error and a 95% con�dence interval . Estimate B (SLL068.sas) libname xxx ’xxx\SASdata’; data asrs; set xxx.agsrs; proc means data=asrs; var acres87 acres92; output out=a2 sum=sum87 sum92; run; data a2; set a2; Bhat=sum92/sum87; proc print data=a2; run; Output Obs sum87 sum92 Bhat 1 90586117 89369114 0.98657 De�ne e and compute SE data asrs2; set asrs; if _n_ eq 1 then set a2; e=acres92-Bhat*acres87; proc means data=asrs2; var e; output out=a4 stderr=se_e n=nsrs; Find population total for 1987 proc means data=xxx.agpop; var acres87; output out=a5 sum=sum87pop n=Npop; Put it together data a6; merge a2 a4 a5; fpc=(1-nsrs/Npop); var_tot=(Npop*Npop)*fpc*se_e*se_e; 9 se_tot=sqrt(var_tot); moe=1.96*se_tot; tot_est=bhat*sum87pop; lcl95=tot_est-moe; ucl95=tot_est+moe; Print proc print data=a6; var tot_est se_tot moe lcl95 ucl95; Output Obs tot_est se_tot moe lcl95 ucl95 1 950520496 5344567 10475351 940045144 960995848 Evaluation � The 95% CI is 941 to 961 million acres � The SE for the ratio estimator is 5.3 million acres � If we use the SRS estimate (N� mean acres for the sample), the standard error is 58.2 million acres � The ratio estimate works very well for this problem MSE approximation Text (page 71) suggests that the approximation may severely underestimate the true MSE (i.e.,miss the bias) unless � N is at least 30 � CV (�x) � 0:1 � CV (�y) � 0:1 When is the ratio estimate better? � When the deviations of yi from �y are larger than the deviations of yi from B^xi. � We want to compare the MSE’s of the usual and the ratio estimators. � The MSE of the ratio estimator is smaller (MSE(�^y) � MSE(�y)) whenever R � CV (x) 2CV (y) 10 Model If the relationship between y and x is a straight line through the origin with variance propor- tional to x, B is the weighted least squares estimate of the slope with weights proportional to 1=x. B^ = arg min B X sam 1 xi (yi −Bxi)2 Ratio estimators for proportions � B^ = �y �x is the quantity of interest. � Use the same approach. � See Section 3.1.3, including Example 3.5 on pages 72-73. Regression estimation � Statistics 512 � Regression �t is �^y = B^0 + B^1�xpop = �y + B^1(�xpop − �xsam) � Ratio estimator: B^0 = 0. � The regression estimate is the predicted value when we substitute �xU for x. � We need to do more work to calculate the standard error, MOE and CI. MSE(�^yreg) � fpc n(N − 1)RSSpop Example 3.6 � Estimate the number of dead trees in an area. � Divide area into 100 square plots. � Photo counts are easy (x), available for all N = 100 plots. � For a sample of n = 25 plots, measure actual numbers of dead trees (y). Estimation � Use a regression to describe the relationship between the actual count of dead trees (y) and the photo number of dead trees (x) in the n = 25 sample. � Find the average number of dead trees in the photos (�xU = 11:3) and use this to get the predicted average number of dead trees in the population. � Multiply by N to get the total. 11 Enter the data ( SLL075.sas ) data a1; input photo field @@; datalines; 10 15 12 14 7 9 13 14 13 8 6 5 17 18 16 15 15 13 10 15 14 11 12 15 10 12 5 8 12 13 10 9 10 11 9 12 6 9 11 12 7 13 9 11 11 10 10 9 10 8 11.3 . ; Run the regression proc reg data=a1; model field=photo/clm; run; Output Par St Var DF Est Error t Pr>|t| Int 1 5.059 1.763 2.87 0.0087 photo 1 0.613 0.160 3.83 0.0009 Predicted average Dep Var Predicted Obs field Value 26 . 11.9893 Standard error � Di�erent approximations are available � We will use (3.14) on page 75 v^ar(�^yreg) = fpc s2e n � See Example 3.6 on pages 75-76 Di�erence estimation � Special case where slope is assumed to be one � Many cases where di�erence between y and x are zero � Di�erences are equally likely to be positive or negative 12 � Sometimes useful in auditing � See text Section 3.2.2 on page 77 Estimation in domains � Suppose we want to estimate a mean and/or a total for a subset of the population. � We call the subpopulation a domain or a subdomain. � In Example 3.7 on page 79, we use the SRS of counties to estimate mean and total 1992 acres for the western states. One view � After we take the sample of size n, select the observations that are within the domain of interest and proceed as if this were an SRS of size nd from the domain. � The fpc would be fpcd = 1− ndNd . � Note, this requires that we know Nd. � We also need Nd to convert �yd to t^d. A technical di�culty � The sample size in the domain nd is a random variable; it varies from sample to sample. � With this view we are ignoring the variability introduced into our estimate from this fact. � We are conditioning on nd. � This is what we do in regression when we treat the values of the explanatory variables as �xed and not random. Section 3.3 view � For the �yd use the mean of the observations in the sample that are in the domain. � Treat the denominator nd as a random variable. � In this view, �yd is a ratio estimator (a B^) and the methods of this chapter apply. 13 Some details � Let ui = yi if i is in the domain, and 0 otherwise. � Similarly, xi = 1 if i is in the domain, and 0 otherwise. � �yd = B^ = P sam uiP sam xi = P dom yi nd � t^d = Nd�yd if Nd is known � Use the formula on the next to the last line of page 78 for SE(�yd). � This is a rewrite of the formula for a ratio estimate v^ar(B^) = fpc n�x2pop P sam(yi − B^xi)2 n− 1 using the notation of this section. � Use SE(t^d) = NdSE(�yd) for the total. Nd unknown � Use the same estimator for the domain mean: the mean of this subset of the sample �yd. � Use the last line on page 78 for the SE of �yd, an approximation that assumes { nd n � Nd N . { (nd−1) (n−1) � ndn . Standard error for �yd SE(�yd) = p fpc sy;dp nd 14 � Proof v^ar(�yd) = fpc n�x2pop P sam(ui − B^xi)2 n− 1 = fpc n�x2pop P sam(yi − B^xi)2I(i 2 dom) n− 1 = fpc n � N Nd �2 P dom(yi − B^xi)2 n− 1 � fpc n � n nd �2 (nd − 1)s2y;d n− 1 � fpc n � n nd �2 (nd)s2y;d n = fpc s2y;d nd � The term syd=pnd is the usual (sampling from an in�nite population) standard error for a �Y . � So, for the estimation of the domain mean, we treat the nd observations as an SRS and approximate fpcd with the fpc for the entire SRS. Estimation of the total � (If Nd is known, we multiply �Yd and the SE by Nd; MOE and CI follow.) � When Nd is unknown, we have a bit of a mess. � The domain proportion in the population (Nd=N) and sample (nd=n) should be ap- proximately the same. � So, Nd is approximately N ( nd n � A consequence t^y;d = N �u SE(t^y;d) = N p fpc� SE(�u); � where SE(�u) is the in�nite population standard error estimate. � This is a strange SE/variance. � The variance is used to summarize variability in a situation where many/most obser- vations are 0. 15 Example 3.7 ( SLL078.sas ) � The US Census of Agriculture � We have a SRS size n = 300 from the population of N = 3078 counties. � B is the ratio of 1992 acres in western states to the number of counties in western states, the mean acres per county. � The total of interest is the total number of 1992 acres in western states. Get data and de�ne West libname xxx ’C:\...\SASdata’; data asrs; set xxx.agsrs; if state eq ’AK’ . . . or state eq ’WY’ then region = ’West’; De�ne u and x if region = ’West’ then do; u=acres92; x=1; end; else do; u=0; x=0; end; run; Estimates and SE for mean proc means data=asrs; var acres92; where region eq ’West’; output out=a2 mean=ybard stderr=sey; run; Estimates and SE for total proc means data=asrs; var u; output out=a3 mean=ubar stderr=seu; run; 16 Merge and calculate data a4; merge a2 a3; npop=3078; nsrs=300; fpc=1-nsrs/npop; seybard=sqrt(fpc)*sey; thatd=npop*ubar; seubar=sqrt(fpc)*seu; sethatd=npop*(seubar); Print proc print data=a4; var ybard seybard thatd sethatd; Output Obs ybard seybard thatd sethatd 1 598680.58974 78520.29 239556051.18 46090456.81 Models � Models: Yi = �xi + �i; Var(�i) = � 2xi Yi = �0 + �1xi + �i; Var(�i) = � 2 { Get parameter estimates and make predictions and take summaries of predictions. { SE’s are SE’s of expected values. � Estimates of parameters are generally the same for randomization theory and model- based inference. � Estimators are \model unbiased". { Variability is in �’s, not in sampling design. � Standard errors may be slightly di�erent, calculated from { design or { model � Requires stronger assumptions { If model is true, evaluate only large x’s or x’s far from mean. { Diagnostics are key. 17 Nonconstant sampling � In SRS, each data point is sampled with equal probability (�i = nN ). � Sometimes, we want to sample with unequal probabilities. { to reduce bias (randomization) { to minimize variance (model) Var(yi) = � 2xi ) �i / pxi Comparison � SRS, ratio, regression � See page 88 for summary � Ratio and regressionestimators are useful when there is an informative x � Ratio estimators are useful when we have cluster sampling 18
Compartilhar