Buscar

L0Review

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 3, do total de 8 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 6, do total de 8 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Prévia do material em texto

1 
 
Lecture 1 
Review of Fundamental Statistical Concepts 
 
Measures of Central Tendency and Dispersion 
 
A word about notation for this class: 
 
Individuals in a population are designated Yi, where the index “i” ranges from 1 to N, and N is 
the total number of individuals in the population. Individuals in a random sample taken from a 
population are also denoted Yi, but in this case the index “i” ranges from 1 to n, where n is the 
total number of individuals in the sample. 
 
Greek letters will be used for population parameters (e.g. µ = population mean; σ2 = population 
variance), while Roman letters will be used for estimates of population parameters, based on 
random sampling (e.g. Y = sample mean ≈ population mean = µ; s2 = sample variance ≈ 
population variance = σ2). 
 
Basic formulas: 
 
Mean or average (a measure of central tendency) 
 
 Population mean: 
N
Y
N
i
i
 1 
 
 Sample mean: 
n
Y
Y
n
i
i
 1 
 
 
Variance (a measure of dispersion of individuals about the mean) 
 
 Population variance: 
N
Y
N
i
i


 1
2
2
)( 
 
 
 Sample variance: 
1
)(
1
2
2





n
YY
s
n
i
i
 
 
 The quantities )( YYi  are called the deviations. 
 
 
 
 
 
 2 
 
Standard deviation (a measure of dispersion in the original units of observation) 
 
 Population standard deviation: 2  
 
 Sample standard deviation: 2ss  
 
Coefficient of variation 
 
In some situations, it is useful to express the standard deviation in units of the population mean. For this 
purpose, we have a quantity called the coefficient of variation: 
 
 Population coefficient of variation: 
CV 
 
 Sample coefficient of variation: 
Y
sCV  
 
 
Measures of dispersion of sample means 
 
Another important population parameter we will work with in this class is the sample variance of the 
mean ( 2Y ). If you repeatedly sample a population by taking samples of size n, the variance of those 
sample means is what we call the sample variance of the mean. It relates very simply to the population 
variance, in this way: 
 
 Variance of the mean: 
nY
2
2   
 
We can estimate 2Y for a population by taking r independent, random samples of size n from that 
population, calculating the sample means iY , and then calculating the variance of those sample means. In 
other words, if iY is the mean of the ith sample and Y is the overall mean for all r samples, then what we 
find is: 
 
21
2
2
1
)(
Y
r
i
i
Y r
YY
s 



 
 
The square root of 2Y is called the standard deviation of a mean, or more often the standard error. 
 
 Sample Standard Error: 
n
sss YY  2 
As with the standard deviation, this is a quantity in the original units of observation. As you will see, the 
standard error is extremely useful due to the role it plays in determining confidence intervals and the 
powers of tests. 
 3 
 
The Normal Distribution 
 
If you measure a quantitative trait on a population of meaningfully related individuals, what you often 
find is that most of the measurements will cluster near the population mean (µ). And as you consider 
values further and further from µ, individuals exhibiting those values become rarer. Graphically, such a 
situation can be visualized in terms of a frequency distribution, as shown below: 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Some basic characteristics of this kind of distribution are: 
 
1) The maximum value occurs at µ (i.e. the most probable value of an individual pulled 
randomly from the population is µ; another way of saying this is that the expected 
value of an individual pulled randomly from this population is µ); 
2) The dispersion is symmetric about µ (i.e. the mean, median, and mode of the 
population are equal); and 
3) The “tails” asymptotically approach zero. 
 
A distribution which meets these basic criteria is known as a normal distribution. 
 
The following conditions tend to result in a normal distribution of a quantitative trait: 
 
1) There are many factors which contribute to the observed value of the trait; 
2) These many factors act independently of one another; and 
3) The individual effects of these factors are additive and of comparable magnitude. 
 
As it turns out, a great many variables of interest are approximately normally distributed. Indeed, the 
normal distribution is observed for characters in complex systems of all kinds: Biological, 
socioeconomic, industrial, etc. 
 
The bell-shaped normal distribution is also known as a Gaussian curve, named after Friedrich Gauss who 
figured out the formal mathematics underlying functions of this type. Specifically, a normal probability 
density function of mean µ and standard variance σ2 is described by the expression: 
 
2
2
1
2
1)(


  


Y
eYZ 
 
where Z(Y) is the height of the curve at a given observed value Y. Notice that the location and shape of a 
normal probability density function are uniquely determined by only two parameters, µ and σ2. By 
µ 
Frequency 
of observation 
Observed 
value 
 4 
 
varying the value of µ, one can center Z(Y) anywhere on the x-axis. By varying σ2, one can freely adjust 
the width of the central hump. All of the statistical techniques we will discuss in this class are based on 
the idea that many systems we study in the real world can be modeled by this theoretical function Z(Y). 
Such techniques fall into the broad category of parametric statistics, because the ultimate objectives of 
these techniques are to estimate and compare the theoretical parameters (in this case, µ and σ2) which best 
explain our observations. 
 
If we set µ = 0 and σ2 = 1, we obtain an especially useful normal probability density function known as 
the standard normal curve [N(0,1)]: 
 
)1,0(
2
1)( 2]1,0[
2
2 NeYZ
Y
   
 
A word about notation: Rather than Y, it is traditional to use the letter Z to represent a random variable 
drawn from the standard normal distribution: 
 
2
2
2
1)1,0(
Z
eN
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
As with all probability density functions, the total area under the curve equals 1. On page 612 of your 
book (Appendix A4), you will find a table of the properties of this curve. For any given positive value of 
Z, the table reports the area under the curve to the right of Z. This is useful because the area to the right 
of Z is the theoretical probability of randomly picking an individual from N(0,1) whose value is greater 
than Z. 
 
How does this help us in the real world? It helps us because ANY normal distribution can be 
standardized (i.e. any normal distribution can be converted into N(0,1)). The way this is done is quite 
simple: 

 ii YZ 
 
Subtracting µ from each observation iY shifts the mean of the distribution to 0. Dividing by σ changes 
the scale of the x-axis from the original units of observation to units of standard deviation and thus makes 
the standard deviation (and the variance) of the distribution equal to 1. What this means is that for any 
unique individual iY from a normal distribution of mean µ and variance σ2, there is a corresponding 
Frequency 
of observation 
0 
Z 
Area under curve = 1 
σ2 = 1 
 5 
 
unique value iZ (i.e. a normal score) in the standard normal curve. And since we know the theoretical 
probability of picking an individual of a certain value at random from N(0,1), we now have a way of 
determining the probability of picking an individual of a certain value at random from any normally 
distributed population. Some examples: 
 
 
Question 1: From a normally distributed population of finches withmean weight (µ) = 17.2 g and 
variance (σ2) = 36 g2, what is the probability of randomly selecting an individual finch weighing more 
than 22 g? 
 
Solution: To answer this, first convert the value 22 g to its corresponding normal score: 
 
8.0
6
2.1722 
g
ggYZ ii 

 
 
From Table A14, we see that 21.19% of the area under N(0,1) lies to the right of Z = 0.8. Therefore, 
there is a 21.19% chance of randomly selecting an individual finch weighing more than 22 g from this 
population. In other words, 22 g is not an unusual weight for a finch in this population. 
 
It helps to visualize these problems graphically, especially as they get more complicated: 
 
 
 
 
 
 
 
 
 
 
 
Incidentally, we also see, simply by symmetry, that we have a 21.19% chance of randomly selecting an 
individual finch weighing less than 12.4 g. Do you see why? 
 
 
 
Question 2: From a normally distributed population of finches with mean weight (µ) = 17.2 g and 
variance (σ2) = 36 g2, what is the probability of randomly selecting a sample of 20 finches with an average 
weight of more than 22 g? 
 
Solution: The difference between this question and the previous one is that the previous question was 
asking the probability of selecting an individual of a certain value at random while this question is asking 
for the probability of selecting a sample of a certain average value at random. For individuals, the 
appropriate distribution to consider is the normal distribution of the population of individuals (µ = 17.2 g 
and σ2 = 36 g2). But for samples of size n = 20, the appropriate distribution to consider is the normal 
distribution of sample means for sample size n = 20 (µ = 17.2 g and 2
22
2
)20( 8.120
36 gg
nnY
  ). 
 
With this in mind, we proceed as before: 
17.2 
Y
22.0 
Question: What is this area? 
Or: P(Y≥22) = X 
0.8
Z 
0
Answer: 
P(Y≥22) = P(Z≥0.8) = 0.2119 
 6 
 
 
6.3
34.1
2.1722
)20(

 g
ggYZ
nY
i
i 

 
 
From Table A14, we see that only 0.02% of the area under N(0,1) lies to the right of Z = 3.6. Therefore, 
there is a mere 0.02% chance of randomly selecting a sample of 20 finches with an average weight of 
more than 22 g from this population. In other words, 22 g is an extremely unusual mean weight for a 
sample of twenty finches in this population. 
 
 
So, with a simple transformation of location and scale, any normal distribution, whether of individuals or 
of sample means, can be transformed into N(0,1), thereby allowing us to determine how unusual a given 
individual or sample is. Recall that the x-axis of the standard normal curve is in units of standard 
deviations. Our minds are not used to thinking in terms of units of dispersion, but it is an incredibly 
powerful way to think. To give you an intuitive feeling for such units, consider the following: 
 
In a normal frequency distribution, 
 
  1 contains 68.27% of the items 
  2 contains 95.45% of the items 
  3 contains 99.73% of the items 
 
Thought of in another way, 
 
50% of the items fall between   0.674 
95% of the items fall between   1.960 
99% of the items fall between   2.576 
 
 
With these basic benchmarks in place, the results from the two examples above make a lot of sense. A 22 
g finch is not unusual because it is less than one standard deviation from the mean. But a sample of 20 
finches with a mean weight of 22 g is highly unusual because this sample mean is more than three 
standard errors from the mean. 
 
One final word about the importance and wide applicability of the normal distribution: 
The central limit theorem states that, as sample size increases, the distribution of sample 
means drawn from a population of any distribution will approach a normal distribution 
with mean µ and variance σ2/n. 
 
 
Testing for Normality [ST&D pp. 566-567] 
 
One is justified in using Table A14 (and, as you will see, t-tests and ANOVAs) if and only if the 
population or sample under consideration is “normal.” Such statistical tables and techniques are said to 
“assume” normality of the data. Do not be misled by this use of the word. As a user of these tables and 
techniques, you do not simply “assume” that your data are normal; you test for it. Normality is spoken of 
as an “assumption,” but in fact it is a criterion which must be met for the analysis to be valid. In this 
class, we will be using the Shapiro-Wilk test for assessing normality. See pages 566-567 in your text for 
0 
Z 
1σ 3σ 2σ -3σ -1σ -2σ 
99.73% 
95.45% 
68.27% 
 7 
 
a good description of this technique. Below is Figure 24.2 from your book (page 566), with some 
supplemental annotation and discussion to help in understanding the test: 
 
 
 
Figure 24.2 A normal probability plot (a.k.a. quantile-quantile or Q-Q plot). This is a graphic 
tool for visualizing deviation from normality. The Shapiro-Wilk test assigns a probability to such 
deviation, providing an "objective" determination of normality. 
 
Since the dataset consists of 14 values (n = 14) [77.7, 76.0, 76.9, 74.6, 74.7, 76.5, 74.2, 75.4, 76.0, 76.0, 
73.9, 77.4, 76.6, and 77.3], the area under N(0,1) is divided into 14 equal portions. This means that each 
portion, like the one indicated in orange above, has an area of 1/14 = 0.0714 square units. The normal 
score (Z) which splits a portion in half (by area) is considered the "expected" value for that portion. In 
the figure above, this means that the yellow area and the blue area are equal (1/28 = 0.0357 square units 
"Expected" normal score 
of the highest quantile: 
Z14 ≈ 1.80 (Table A.4, P=0.0357) 
"Expected" observed value 
of the highest quantile: 
Y14 = σZ14+μ = 1.227(1.80) + 75.943 
Y14 = 78.152 
Actual observed value 
of the highest quantile: 
Y14 = 77.7 
Deviation from 
"normal" 
Observed normal score 
of the highest quantile: 
Z14 = 1.432 [(77.7-75.943)/1.227] 
W = 0.9484 
(Pr<W) = 0.5092 > 0.05 
Fail to reject H0 
Sample is "normal" 
3.57% 3.57% 
 7.14% (=1/14) 
 8 
 
each). The "expected" value of the second Z score is ≈ -1.24 (Table A4, Z value corresponding to a 
probability of 0.1071=0.0714+0.0357). The 14 "expected" normal scores are then transformed into the 
original units of observation (   ii ZY ), thereby generating a perfectly straight line with slope σ and 
intercept μ. 
 
So, each data point in the sample has a corresponding "expected" normal score (e.g. while the actual 
normal score for 77.7 is 432.1
227.1
943.757.77 iZ , its expected normal score is Z14 ≈ 1.80), and a 
normal probability plot is essentially just a scatter-plot of these paired values. You can think of the 
Shapiro-Wilk test as essentially a test for correlation to the normal line. If the sample is perfectly normal, 
the scatter-plot of observed values vs. expected normal scores will fall exactly on the normal line and the 
Shapiro-Wilk test statistic W (similar to a correlation coefficient) will equal 1. Complete lack of 
correlation (i.e. a completely non-normal distribution) will yield a test statistic W equal to 0. 
 
How much deviation is too much? For this class, to reject the null hypothesis of the test (H0: The sample 
is from a normally distributed population), the probability of obtaining, by chance, a value of W from a 
truly normal distribution that is less than the observed value of W must be less than 5%. 
 
In this case, W = 0.9484 and Pr<W = 0.5092 > 0.05; so we fail to reject H0. There is no evidence, at this 
chosen level of significance, that the sample is from a non-normal population.

Outros materiais