Buscar

Captura e Recaptura - Laplace 1802

Prévia do material em texto

Statistics 522: Sampling and Survey Techniques
Topic 3
Topic Overview
This topic will cover
� Ratio and Regression Estimation
� Estimation in Domains
Laplace
� Wanted to determine population of France in 1802
� Number of births is easy to obtain from public records.
� Size of population is di�cult to determine.
� Use births to predict population
� Sampled 30 \communes"
{ Total population: 2,037,615
{ Births in last 3 years: 215,599
or 71,866.33 per year
� persons per birth:
2037615=71866:33 = 28:35
� Multiply births by 28.35
Ratio estimation basics
� yi is the characteristic of interest (response variable).
� xi is the auxiliary variable or subsidiary variable.
� ty =
P
pop yi
� tx =
P
pop xi
� B = ty
tx
= �yU=�xU
1
Procedure
� We assume that tx is known.
� Therefore, �xU = txN is known.
� Use an SRS and measure yi and xi in the sample.
� Calculate �y and �x for the sample.
� B^ = �y
�x
=
P
sample yi=
P
sample xi
� t^y;ratio = B^tx or
� �^yratio = B^�xU
Why
Sometimes we are interested in the ratio
� acres per farm di�er
� yield per acre
� per capita income
Notes
� Both �y and �x are random variables.
{ They will di�er from sample to sample.
{ We exclude the possibility that �x is zero.
� Denominator often looks like a sample size.
{ The usual estimator for an SRS can be viewed this way: xi = 1 for all items.
N = tx =
X
pop
xi
n =
X
sample
xi
B^ =
P
yiP
xi
= �y
t^ = B^tx = B^N = N �y
2
Why (2)
� Sometimes we want to estimate a population total but N is not known, so we can’t
use N �y.
{ Estimate the number of �sh in a catch
{ Weigh and count a sample; weigh the total
{ Multiply the average �sh per pound times the total weight of the catch
� Increase the precision
{ Laplace could have estimated the population of France by computing the average
number of persons per commune and multiplying by the number of communes.
{ Ratio estimate has smaller MSE (because of positive correlation between births
and population size).
� We can adjust estimates to reflect demographic totals.
{ example in text on page 62 concerning gender
{ This is called poststrati�cation.
{ We will discuss this in Section 4.7 and Chapters 7 and 8
� Can be used to adjust for nonresponse
{ Example in text on page 63
{ Discussed in Chapter 8
Example 3.2
� The U.S. Census of Agriculture
� We have a SRS size n = 300 from the population of N = 3078 counties.
� Suppose we have the population totals for 1987 but have only the sample information
for 1992.
� We want to estimate
{ average acres per farm
{ total acres
3
Get the sample data ( SLL063.sas )
*import the file agsrs.dat, check it,
then create a permanent SAS data set;
proc print data=asrs;
run;
libname xxx ’C:\Purdue\Stat522\SASdata’;
data xxx.agsrs; set asrs;
run;
Plot the data
symbol1 v=circle i=sm70;
proc gplot data=asrs;
plot acres92*acres87/frame;
run;
proc reg data=asrs;
model acres92=acres87/noint;
run;
proc univariate data=asrs;
var acres92 acres87;
run;
Regression through the origin
4
Smoothed plot
Ratio estimate
proc univariate data=asrs;
var acres92 acres87;
Sums
Variable: ACRES92
Sum Observations 89369114
Variable: ACRES87
Sum Observations 90586117
A calculation
data acalc;
tot92=89369114;
tot87=90586117;
ratio=tot92/tot87;
output;
Output
Obs tot92 tot87 ratio
1 89369114 90586117 0.98657
Ratio is B^: in the sample of n = 300, 1992 acres are 98:7% of 1987 acres.
Total acres for 1987
proc univariate data=apop;
var acres87;
5
Output
Variable: ACRES87
Mean 313016.378
Sum Observations 963464412
NOTE: These values di�er slightly from the values in the text.
The estimates
� Total acres for 1987: 963464412
� B^ = 0:98657 = 1992acres
1987acres
� Estimate of total 1992 acres is
B^ � total87acres = 0:98657(963464412) = 950000000
� Estimate of mean acres per county for 1992 is
B^ �mean87acrespercounty = 0:98657(313016:378) = 309000
Comments
� Ratio estimators are biased.
� The random variable is B^ = �xpop=�xsam.
� The relative bias: (jbias(B^)j=s(B^)) � jCV (�x)j
Proof
� Use E(t^x) = tx; E(t^y) = ty; t^y;ratio = t^yt^x tx = B^tx.
E(t^y;ratio − ty) = E(t^y;ratio − t^y + t^y − ty)
= E
�
t^y
t^x
tx − t^y
�
= E
�
t^y
t^x
(tx − t^x)
�
= −E[B^(t^x − tx)]
= −
h
E(B^t^x)− E(B^)E(t^x)
i
= −Cov(B^; t^x)
6
� Therefore,
jE(B^ −B)j =
����E(t^y;ratio − ty)tx
����
=
�����
Cov(B^; t^x)
tx
����� =
�����
Cov(B^; �xsam)
�xpop
�����
=
�����
Corr(B^; �x)
�xpop
�����
q
Var(B^)Var(�x)
� SE(B^)SE(�x)=j�xpopj
= SE(B^)jCV (�x)j
More Comments
� Similar argument shows that
Bias(B^) � fpc 1
n�x2U
(S2xB − Cov(x; y))
� The bias will be small when
{ n is large.
{ the sampling fraction n=N is large.
{ �xU is large.
{ the standard deviation of x { Sx { is small.
{ the correlation R between y and x is close to one.
MSE of B^
v^ar(B^) � MSE(B^) � Var(
�d)
�x2U
;
where di = yi −Bxi.
Idea behind proof
B^ −B = �y
�x
−B
=
(�y −B�x)
�x
� Note that we have the random variable �x in the denominator of this expression.
� Approximate it by �xU .
B^ −B � (�y −B�x)
�xU
7
Standard error for B^
� To estimate the standard deviation of B^, substitute sample estimates for unknown
quantities:
v^ar(B^) = fpc
s2e
n�x2U
;
where s2e =
P
e2i =(n− 1) and ei = yi − B^xi.
� SE is the square root of the variance.
Other standard errors
� t^ratio = txB^
{ Estimated variance of t^ratio is
t2xv^ar(B^) = fpc
N2
n
s2e
� �^yratio = �xU B^
{ Estimated variance of �^yratio is
�x2U v^ar(B^) =
fpc
n
s2e
� Take square roots to obtain the SE’s
Con�dence intervals
For 95%, general form is
estimate� 1:96SE(estimate)
Example 3.3
� The US Census of Agriculture
� We have a SRS size n = 300 from the population of N = 3078 counties.
� Estimate total acres for 1992 using a ratio estimate.
� B is the ratio of 1992 acres to 1987 acres for the population; B^ for the sample of
n = 300.
� We use the known value of total acres in 1987 for the population N = 3078.
8
The estimates
� Total acres for 1987: 963464412
� B^ = 0:98657 (1992acres=1987acres)
� Estimate of total 1992 acres is
B^ � total87acres = 0:98657(963464412) = 950000000
� We need to calculate the standard error and a 95% con�dence interval .
Estimate B (SLL068.sas)
libname xxx ’xxx\SASdata’;
data asrs; set xxx.agsrs;
proc means data=asrs;
var acres87 acres92;
output out=a2 sum=sum87 sum92;
run;
data a2; set a2; Bhat=sum92/sum87;
proc print data=a2;
run;
Output
Obs sum87 sum92 Bhat
1 90586117 89369114 0.98657
De�ne e and compute SE
data asrs2; set asrs;
if _n_ eq 1 then set a2;
e=acres92-Bhat*acres87;
proc means data=asrs2;
var e;
output out=a4 stderr=se_e n=nsrs;
Find population total for 1987
proc means data=xxx.agpop;
var acres87;
output out=a5 sum=sum87pop n=Npop;
Put it together
data a6; merge a2 a4 a5;
fpc=(1-nsrs/Npop);
var_tot=(Npop*Npop)*fpc*se_e*se_e;
9
se_tot=sqrt(var_tot);
moe=1.96*se_tot;
tot_est=bhat*sum87pop;
lcl95=tot_est-moe;
ucl95=tot_est+moe;
Print
proc print data=a6;
var tot_est se_tot moe lcl95 ucl95;
Output
Obs tot_est se_tot moe lcl95 ucl95
1 950520496 5344567 10475351 940045144 960995848
Evaluation
� The 95% CI is 941 to 961 million acres
� The SE for the ratio estimator is 5.3 million acres
� If we use the SRS estimate (N� mean acres for the sample), the standard error is 58.2
million acres
� The ratio estimate works very well for this problem
MSE approximation
Text (page 71) suggests that the approximation may severely underestimate the true MSE
(i.e.,miss the bias) unless
� N is at least 30
� CV (�x) � 0:1
� CV (�y) � 0:1
When is the ratio estimate better?
� When the deviations of yi from �y are larger than the deviations of yi from B^xi.
� We want to compare the MSE’s of the usual and the ratio estimators.
� The MSE of the ratio estimator is smaller (MSE(�^y) � MSE(�y)) whenever
R � CV (x)
2CV (y)
10
Model
If the relationship between y and x is a straight line through the origin with variance propor-
tional to x, B is the weighted least squares estimate of the slope with weights proportional
to 1=x.
B^ = arg min
B
X
sam
1
xi
(yi −Bxi)2
Ratio estimators for proportions
� B^ = �y
�x
is the quantity of interest.
� Use the same approach.
� See Section 3.1.3, including Example 3.5 on pages 72-73.
Regression estimation
� Statistics 512
� Regression �t is �^y = B^0 + B^1�xpop = �y + B^1(�xpop − �xsam)
� Ratio estimator: B^0 = 0.
� The regression estimate is the predicted value when we substitute �xU for x.
� We need to do more work to calculate the standard error, MOE and CI.
MSE(�^yreg) �
fpc
n(N − 1)RSSpop
Example 3.6
� Estimate the number of dead trees in an area.
� Divide area into 100 square plots.
� Photo counts are easy (x), available for all N = 100 plots.
� For a sample of n = 25 plots, measure actual numbers of dead trees (y).
Estimation
� Use a regression to describe the relationship between the actual count of dead trees
(y) and the photo number of dead trees (x) in the n = 25 sample.
� Find the average number of dead trees in the photos (�xU = 11:3) and use this to get
the predicted average number of dead trees in the population.
� Multiply by N to get the total.
11
Enter the data ( SLL075.sas )
data a1;
input photo field @@;
datalines;
10 15 12 14 7 9 13 14 13 8
6 5 17 18 16 15 15 13 10 15
14 11 12 15 10 12 5 8 12 13
10 9 10 11 9 12 6 9 11 12
7 13 9 11 11 10 10 9 10 8
11.3 .
;
Run the regression
proc reg data=a1;
model field=photo/clm;
run;
Output
Par St
Var DF Est Error t Pr>|t|
Int 1 5.059 1.763 2.87 0.0087
photo 1 0.613 0.160 3.83 0.0009
Predicted average
Dep Var Predicted
Obs field Value
26 . 11.9893
Standard error
� Di�erent approximations are available
� We will use (3.14) on page 75
v^ar(�^yreg) = fpc
s2e
n
� See Example 3.6 on pages 75-76
Di�erence estimation
� Special case where slope is assumed to be one
� Many cases where di�erence between y and x are zero
� Di�erences are equally likely to be positive or negative
12
� Sometimes useful in auditing
� See text Section 3.2.2 on page 77
Estimation in domains
� Suppose we want to estimate a mean and/or a total for a subset of the population.
� We call the subpopulation a domain or a subdomain.
� In Example 3.7 on page 79, we use the SRS of counties to estimate mean and total
1992 acres for the western states.
One view
� After we take the sample of size n, select the observations that are within the domain
of interest and proceed as if this were an SRS of size nd from the domain.
� The fpc would be fpcd = 1− ndNd .
� Note, this requires that we know Nd.
� We also need Nd to convert �yd to t^d.
A technical di�culty
� The sample size in the domain nd is a random variable; it varies from sample to sample.
� With this view we are ignoring the variability introduced into our estimate from this
fact.
� We are conditioning on nd.
� This is what we do in regression when we treat the values of the explanatory variables
as �xed and not random.
Section 3.3 view
� For the �yd use the mean of the observations in the sample that are in the domain.
� Treat the denominator nd as a random variable.
� In this view, �yd is a ratio estimator (a B^) and the methods of this chapter apply.
13
Some details
� Let ui = yi if i is in the domain, and 0 otherwise.
� Similarly, xi = 1 if i is in the domain, and 0 otherwise.
� �yd = B^ =
P
sam uiP
sam xi
=
P
dom yi
nd
� t^d = Nd�yd if Nd is known
� Use the formula on the next to the last line of page 78 for SE(�yd).
� This is a rewrite of the formula for a ratio estimate
v^ar(B^) =
fpc
n�x2pop
P
sam(yi − B^xi)2
n− 1
using the notation of this section.
� Use SE(t^d) = NdSE(�yd) for the total.
Nd unknown
� Use the same estimator for the domain mean: the mean of this subset of the sample
�yd.
� Use the last line on page 78 for the SE of �yd, an approximation that assumes
{ nd
n
� Nd
N
.
{ (nd−1)
(n−1) � ndn .
Standard error for �yd
SE(�yd) =
p
fpc
sy;dp
nd
14
� Proof
v^ar(�yd) =
fpc
n�x2pop
P
sam(ui − B^xi)2
n− 1
=
fpc
n�x2pop
P
sam(yi − B^xi)2I(i 2 dom)
n− 1
=
fpc
n
�
N
Nd
�2 P
dom(yi − B^xi)2
n− 1
� fpc
n
�
n
nd
�2 (nd − 1)s2y;d
n− 1
� fpc
n
�
n
nd
�2 (nd)s2y;d
n
= fpc
s2y;d
nd
� The term syd=pnd is the usual (sampling from an in�nite population) standard error
for a �Y .
� So, for the estimation of the domain mean, we treat the nd observations as an SRS and
approximate fpcd with the fpc for the entire SRS.
Estimation of the total
� (If Nd is known, we multiply �Yd and the SE by Nd; MOE and CI follow.)
� When Nd is unknown, we have a bit of a mess.
� The domain proportion in the population (Nd=N) and sample (nd=n) should be ap-
proximately the same.
� So, Nd is approximately N
(
nd
n
�
A consequence
t^y;d = N �u
SE(t^y;d) = N
p
fpc� SE(�u);
� where SE(�u) is the in�nite population standard error estimate.
� This is a strange SE/variance.
� The variance is used to summarize variability in a situation where many/most obser-
vations are 0.
15
Example 3.7 ( SLL078.sas )
� The US Census of Agriculture
� We have a SRS size n = 300 from the population of N = 3078 counties.
� B is the ratio of 1992 acres in western states to the number of counties in western
states, the mean acres per county.
� The total of interest is the total number of 1992 acres in western states.
Get data and de�ne West
libname xxx ’C:\...\SASdata’;
data asrs; set xxx.agsrs;
if state eq ’AK’
. . .
or state eq ’WY’
then region = ’West’;
De�ne u and x
if region = ’West’
then do;
u=acres92; x=1;
end;
else do;
u=0; x=0; end;
run;
Estimates and SE for mean
proc means data=asrs;
var acres92;
where region eq ’West’;
output out=a2
mean=ybard
stderr=sey;
run;
Estimates and SE for total
proc means data=asrs;
var u;
output out=a3
mean=ubar
stderr=seu;
run;
16
Merge and calculate
data a4; merge a2 a3;
npop=3078; nsrs=300;
fpc=1-nsrs/npop;
seybard=sqrt(fpc)*sey;
thatd=npop*ubar;
seubar=sqrt(fpc)*seu;
sethatd=npop*(seubar);
Print
proc print data=a4;
var ybard seybard thatd sethatd;
Output
Obs ybard seybard thatd sethatd
1 598680.58974 78520.29 239556051.18 46090456.81
Models
� Models:
Yi = �xi + �i; Var(�i) = �
2xi
Yi = �0 + �1xi + �i; Var(�i) = �
2
{ Get parameter estimates and make predictions and take summaries of predictions.
{ SE’s are SE’s of expected values.
� Estimates of parameters are generally the same for randomization theory and model-
based inference.
� Estimators are \model unbiased".
{ Variability is in �’s, not in sampling design.
� Standard errors may be slightly di�erent, calculated from
{ design or
{ model
� Requires stronger assumptions
{ If model is true, evaluate only large x’s or x’s far from mean.
{ Diagnostics are key.
17
Nonconstant sampling
� In SRS, each data point is sampled with equal probability (�i = nN ).
� Sometimes, we want to sample with unequal probabilities.
{ to reduce bias (randomization)
{ to minimize variance (model)
Var(yi) = �
2xi ) �i / pxi
Comparison
� SRS, ratio, regression
� See page 88 for summary
� Ratio and regressionestimators are useful when there is an informative x
� Ratio estimators are useful when we have cluster sampling
18

Continue navegando