L0Review

•

UFV

0

Lorena Ribeiro

23.09.2014

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Você viu 3, do total de 8 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Você viu 6, do total de 8 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

E aí, curtiu este material?

Ajude a incentivar outros estudantes a melhorar o conteúdo

Gostou desse material? Compartilhe! 🧡

Estatística Aplicada

23.744 Materiais compartilhados

Baixe o app para aproveitar ainda mais

Leia os materiais offline, sem usar a internet. Além de vários outros recursos!

Prévia do material em texto

1

Lecture 1
Review of Fundamental Statistical Concepts

Measures of Central Tendency and Dispersion

A word about notation for this class:

Individuals in a population are designated Yi, where the index “i” ranges from 1 to N, and N is
the total number of individuals in the population. Individuals in a random sample taken from a
population are also denoted Yi, but in this case the index “i” ranges from 1 to n, where n is the
total number of individuals in the sample.

Greek letters will be used for population parameters (e.g. µ = population mean; σ2 = population
variance), while Roman letters will be used for estimates of population parameters, based on
random sampling (e.g. Y = sample mean ≈ population mean = µ; s2 = sample variance ≈
population variance = σ2).

Basic formulas:

Mean or average (a measure of central tendency)

Population mean:
N
Y
N
i
i
 1

Sample mean:
n
Y
Y
n
i
i
 1

Variance (a measure of dispersion of individuals about the mean)

Population variance:
N
Y
N
i
i


 1
2
2
)( 


Sample variance:
1
)(
1
2
2





n
YY
s
n
i
i

The quantities )( YYi  are called the deviations.

2

Standard deviation (a measure of dispersion in the original units of observation)

Population standard deviation: 2 

Sample standard deviation: 2ss 

Coefficient of variation

In some situations, it is useful to express the standard deviation in units of the population mean. For this
purpose, we have a quantity called the coefficient of variation:

Population coefficient of variation: 
CV

Sample coefficient of variation:
Y
sCV 

Measures of dispersion of sample means

Another important population parameter we will work with in this class is the sample variance of the
mean ( 2Y ). If you repeatedly sample a population by taking samples of size n, the variance of those
sample means is what we call the sample variance of the mean. It relates very simply to the population
variance, in this way:

Variance of the mean:
nY
2
2  

We can estimate 2Y for a population by taking r independent, random samples of size n from that
population, calculating the sample means iY , and then calculating the variance of those sample means. In
other words, if iY is the mean of the ith sample and Y is the overall mean for all r samples, then what we
find is:

21
2
2
1
)(
Y
r
i
i
Y r
YY
s 





The square root of 2Y is called the standard deviation of a mean, or more often the standard error.

Sample Standard Error:
n
sss YY  2
As with the standard deviation, this is a quantity in the original units of observation. As you will see, the
standard error is extremely useful due to the role it plays in determining confidence intervals and the
powers of tests.
3

The Normal Distribution

If you measure a quantitative trait on a population of meaningfully related individuals, what you often
find is that most of the measurements will cluster near the population mean (µ). And as you consider
values further and further from µ, individuals exhibiting those values become rarer. Graphically, such a
situation can be visualized in terms of a frequency distribution, as shown below:

Some basic characteristics of this kind of distribution are:

1) The maximum value occurs at µ (i.e. the most probable value of an individual pulled
randomly from the population is µ; another way of saying this is that the expected
value of an individual pulled randomly from this population is µ);
2) The dispersion is symmetric about µ (i.e. the mean, median, and mode of the
population are equal); and
3) The “tails” asymptotically approach zero.

A distribution which meets these basic criteria is known as a normal distribution.

The following conditions tend to result in a normal distribution of a quantitative trait:

1) There are many factors which contribute to the observed value of the trait;
2) These many factors act independently of one another; and
3) The individual effects of these factors are additive and of comparable magnitude.

As it turns out, a great many variables of interest are approximately normally distributed. Indeed, the
normal distribution is observed for characters in complex systems of all kinds: Biological,
socioeconomic, industrial, etc.

The bell-shaped normal distribution is also known as a Gaussian curve, named after Friedrich Gauss who
figured out the formal mathematics underlying functions of this type. Specifically, a normal probability
density function of mean µ and standard variance σ2 is described by the expression:

2
2
1
2
1)(


  


Y
eYZ

where Z(Y) is the height of the curve at a given observed value Y. Notice that the location and shape of a
normal probability density function are uniquely determined by only two parameters, µ and σ2. By
µ
Frequency
of observation
Observed
value
4

varying the value of µ, one can center Z(Y) anywhere on the x-axis. By varying σ2, one can freely adjust
the width of the central hump. All of the statistical techniques we will discuss in this class are based on
the idea that many systems we study in the real world can be modeled by this theoretical function Z(Y).
Such techniques fall into the broad category of parametric statistics, because the ultimate objectives of
these techniques are to estimate and compare the theoretical parameters (in this case, µ and σ2) which best
explain our observations.

If we set µ = 0 and σ2 = 1, we obtain an especially useful normal probability density function known as
the standard normal curve [N(0,1)]:

)1,0(
2
1)( 2]1,0[
2
2 NeYZ
Y
  

A word about notation: Rather than Y, it is traditional to use the letter Z to represent a random variable
drawn from the standard normal distribution:

2
2
2
1)1,0(
Z
eN
 

As with all probability density functions, the total area under the curve equals 1. On page 612 of your
book (Appendix A4), you will find a table of the properties of this curve. For any given positive value of
Z, the table reports the area under the curve to the right of Z. This is useful because the area to the right
of Z is the theoretical probability of randomly picking an individual from N(0,1) whose value is greater
than Z.

How does this help us in the real world? It helps us because ANY normal distribution can be
standardized (i.e. any normal distribution can be converted into N(0,1)). The way this is done is quite
simple:

 ii YZ

Subtracting µ from each observation iY shifts the mean of the distribution to 0. Dividing by σ changes
the scale of the x-axis from the original units of observation to units of standard deviation and thus makes
the standard deviation (and the variance) of the distribution equal to 1. What this means is that for any
unique individual iY from a normal distribution of mean µ and variance σ2, there is a corresponding
Frequency
of observation
0
Z
Area under curve = 1
σ2 = 1
5

unique value iZ (i.e. a normal score) in the standard normal curve. And since we know the theoretical
probability of picking an individual of a certain value at random from N(0,1), we now have a way of
determining the probability of picking an individual of a certain value at random from any normally
distributed population. Some examples:

Question 1: From a normally distributed population of finches withmean weight (µ) = 17.2 g and
variance (σ2) = 36 g2, what is the probability of randomly selecting an individual finch weighing more
than 22 g?

Solution: To answer this, first convert the value 22 g to its corresponding normal score:

8.0
6
2.1722 
g
ggYZ ii 


From Table A14, we see that 21.19% of the area under N(0,1) lies to the right of Z = 0.8. Therefore,
there is a 21.19% chance of randomly selecting an individual finch weighing more than 22 g from this
population. In other words, 22 g is not an unusual weight for a finch in this population.

It helps to visualize these problems graphically, especially as they get more complicated:

Incidentally, we also see, simply by symmetry, that we have a 21.19% chance of randomly selecting an
individual finch weighing less than 12.4 g. Do you see why?

Question 2: From a normally distributed population of finches with mean weight (µ) = 17.2 g and
variance (σ2) = 36 g2, what is the probability of randomly selecting a sample of 20 finches with an average
weight of more than 22 g?

Solution: The difference between this question and the previous one is that the previous question was
asking the probability of selecting an individual of a certain value at random while this question is asking
for the probability of selecting a sample of a certain average value at random. For individuals, the
appropriate distribution to consider is the normal distribution of the population of individuals (µ = 17.2 g
and σ2 = 36 g2). But for samples of size n = 20, the appropriate distribution to consider is the normal
distribution of sample means for sample size n = 20 (µ = 17.2 g and 2
22
2
)20( 8.120
36 gg
nnY
  ).

With this in mind, we proceed as before:
17.2
Y
22.0
Question: What is this area?
Or: P(Y≥22) = X
0.8
Z
0
Answer:
P(Y≥22) = P(Z≥0.8) = 0.2119
6

6.3
34.1
2.1722
)20(

 g
ggYZ
nY
i
i 


From Table A14, we see that only 0.02% of the area under N(0,1) lies to the right of Z = 3.6. Therefore,
there is a mere 0.02% chance of randomly selecting a sample of 20 finches with an average weight of
more than 22 g from this population. In other words, 22 g is an extremely unusual mean weight for a
sample of twenty finches in this population.

So, with a simple transformation of location and scale, any normal distribution, whether of individuals or
of sample means, can be transformed into N(0,1), thereby allowing us to determine how unusual a given
individual or sample is. Recall that the x-axis of the standard normal curve is in units of standard
deviations. Our minds are not used to thinking in terms of units of dispersion, but it is an incredibly
powerful way to think. To give you an intuitive feeling for such units, consider the following:

In a normal frequency distribution,

  1 contains 68.27% of the items
  2 contains 95.45% of the items
  3 contains 99.73% of the items

Thought of in another way,

50% of the items fall between   0.674
95% of the items fall between   1.960
99% of the items fall between   2.576

With these basic benchmarks in place, the results from the two examples above make a lot of sense. A 22
g finch is not unusual because it is less than one standard deviation from the mean. But a sample of 20
finches with a mean weight of 22 g is highly unusual because this sample mean is more than three
standard errors from the mean.

One final word about the importance and wide applicability of the normal distribution:
The central limit theorem states that, as sample size increases, the distribution of sample
means drawn from a population of any distribution will approach a normal distribution
with mean µ and variance σ2/n.

Testing for Normality [ST&D pp. 566-567]

One is justified in using Table A14 (and, as you will see, t-tests and ANOVAs) if and only if the
population or sample under consideration is “normal.” Such statistical tables and techniques are said to
“assume” normality of the data. Do not be misled by this use of the word. As a user of these tables and
techniques, you do not simply “assume” that your data are normal; you test for it. Normality is spoken of
as an “assumption,” but in fact it is a criterion which must be met for the analysis to be valid. In this
class, we will be using the Shapiro-Wilk test for assessing normality. See pages 566-567 in your text for
0
Z
1σ 3σ 2σ -3σ -1σ -2σ
99.73%
95.45%
68.27%
7

a good description of this technique. Below is Figure 24.2 from your book (page 566), with some
supplemental annotation and discussion to help in understanding the test:

Figure 24.2 A normal probability plot (a.k.a. quantile-quantile or Q-Q plot). This is a graphic
tool for visualizing deviation from normality. The Shapiro-Wilk test assigns a probability to such
deviation, providing an "objective" determination of normality.

Since the dataset consists of 14 values (n = 14) [77.7, 76.0, 76.9, 74.6, 74.7, 76.5, 74.2, 75.4, 76.0, 76.0,
73.9, 77.4, 76.6, and 77.3], the area under N(0,1) is divided into 14 equal portions. This means that each
portion, like the one indicated in orange above, has an area of 1/14 = 0.0714 square units. The normal
score (Z) which splits a portion in half (by area) is considered the "expected" value for that portion. In
the figure above, this means that the yellow area and the blue area are equal (1/28 = 0.0357 square units
"Expected" normal score
of the highest quantile:
Z14 ≈ 1.80 (Table A.4, P=0.0357)
"Expected" observed value
of the highest quantile:
Y14 = σZ14+μ = 1.227(1.80) + 75.943
Y14 = 78.152
Actual observed value
of the highest quantile:
Y14 = 77.7
Deviation from
"normal"
Observed normal score
of the highest quantile:
Z14 = 1.432 [(77.7-75.943)/1.227]
W = 0.9484
(Pr<W) = 0.5092 > 0.05
Fail to reject H0
Sample is "normal"
3.57% 3.57%
7.14% (=1/14)
8

each). The "expected" value of the second Z score is ≈ -1.24 (Table A4, Z value corresponding to a
probability of 0.1071=0.0714+0.0357). The 14 "expected" normal scores are then transformed into the
original units of observation (   ii ZY ), thereby generating a perfectly straight line with slope σ and
intercept μ.

So, each data point in the sample has a corresponding "expected" normal score (e.g. while the actual
normal score for 77.7 is 432.1
227.1
943.757.77 iZ , its expected normal score is Z14 ≈ 1.80), and a
normal probability plot is essentially just a scatter-plot of these paired values. You can think of the
Shapiro-Wilk test as essentially a test for correlation to the normal line. If the sample is perfectly normal,
the scatter-plot of observed values vs. expected normal scores will fall exactly on the normal line and the
Shapiro-Wilk test statistic W (similar to a correlation coefficient) will equal 1. Complete lack of
correlation (i.e. a completely non-normal distribution) will yield a test statistic W equal to 0.

How much deviation is too much? For this class, to reject the null hypothesis of the test (H0: The sample
is from a normally distributed population), the probability of obtaining, by chance, a value of W from a
truly normal distribution that is less than the observed value of W must be less than 5%.

In this case, W = 0.9484 and Pr<W = 0.5092 > 0.05; so we fail to reject H0. There is no evidence, at this
chosen level of significance, that the sample is from a non-normal population.