Buscar

Introdução à Ciência da Estatística

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 3, do total de 442 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 6, do total de 442 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 9, do total de 442 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Prévia do material em texto

An Introduction to the Science of Statistics:
From Theory to Implementation
Preliminary Edition
c©Joseph C. Watkins
Contents
I Organizing and Producing Data 1
1 Displaying Data 3
1.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Two-way Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Histograms and the Empirical Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . 10
1.5 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Time Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.7 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Describing Distributions with Numbers 21
2.1 Measuring Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Medians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Measuring Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Five Number Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Sample Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Quantiles and Standardized Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Quantile-Quantile Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Correlation and Regression 33
3.1 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Transformed Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Producing Data 65
4.1 Preliminary Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Professional Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Formal Statistical Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1 Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.2 Randomized Controlled Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.3 Natural experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
i
4.4.1 Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
II Probability 79
5 The Basics of Probability 81
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Equally Likely Outcomes and the Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 Consequences of the Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.1 Fundamental Principle of Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4.3 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6 Set Theory - Probability Theory Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6 Conditional Probability and Independence 97
6.1 Restricting the Sample Space - Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 The Multiplication Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 The Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4 Bayes formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.6 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7 Random Variables and Distribution Functions 111
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3 Properties of the Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4 Mass Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.6 Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.7 Joint and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.7.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.7.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.7.3 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.8 Simulating Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.8.1 Discrete Random Variables and the sample Command . . . . . . . . . . . . . . . . . . . . 124
7.8.2 Continuous Random Variables and the Probability Transform . . . . . . . . . . . . . . . . . 125
7.9 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8 The Expected Value 137
8.1 Definition and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
8.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.3 Bernoulli Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 141
8.4 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.6 Names for Eg(X). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.7 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.8 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
ii
8.8.1 Equivalent Conditions for Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.9 Quantile Plots and Probability Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.10 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9 Examples of Mass Functions and Densities 157
9.1 Examples of Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.2 Examples of Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.3 More on Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.4 R Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.5 Summary of Properties of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.5.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.5.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.6 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
10 The Law of Large Numbers 179
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
10.2 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
10.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.4 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
11 The Central Limit Theorem 193
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
11.2 The Classical Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
11.2.1 Bernoulli Trials and the Continuity Correction . . . . . . . . . . . . . . . . . . . . . . . . . 197
11.3 Propagation of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
11.4 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
11.5 Summary of Normal Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
11.5.1 Sample Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
11.5.2 Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.5.3 Sample Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.5.4 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.6 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
III Estimation 213
12 Overview of Estimation 215
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
12.2 Classical Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
12.3 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
12.4 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
13 Method of Moments 231
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
13.2 The Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
13.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
13.4 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
iii
14 Unbiased Estimation 241
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
14.2 Computing Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
14.3 Compensating for Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
14.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
14.5 Cramér-Rao Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
14.6 A Note on Exponential Families and Efficient Estimators . . . . . . . . . . . . . . . . . . . . . . . . 254
14.7 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
15 Maximum Likelihood Estimation 261
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
15.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
15.3 Summary of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
15.4 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
15.5 Comparison of Estimation Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
15.6 Multidimensional Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
15.7 The Case of Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
15.8 Choice of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
15.9 Technical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
15.10Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
16 Interval Estimation 281
16.1 Classical Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
16.1.1 Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
16.1.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
16.1.3 Sample Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
16.1.4 Summary of Standard Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
16.1.5 Interpretation of the Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
16.1.6 Extensions on the Use of Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . 292
16.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
16.3 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
16.4 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
IV Hypothesis Testing 301
17 Simple Hypotheses 303
17.1 Overview and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
17.2 The Neyman-Pearson Lemma . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 304
17.2.1 The Receiver Operating Characteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
17.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
17.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
17.5 Proof of the Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
17.6 An Brief Introduction to the Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
17.7 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
18 Composite Hypotheses 323
18.1 Partitioning the Parameter Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
18.2 The Power Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
18.3 The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
18.4 Distribution of p-values and the Receiving Operating Characteristic . . . . . . . . . . . . . . . . . . 334
18.5 Multiple Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
iv
18.5.1 Familywise Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
18.5.2 False Discovery Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
18.6 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
19 Extensions on the Likelihood Ratio 341
19.1 One-Sided Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
19.2 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
19.3 Chi-square Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
19.4 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
20 t Procedures 361
20.1 Guidelines for Using the t Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
20.2 One Sample t Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
20.3 Correspondence between Two-Sided Tests and Confidence Intervals . . . . . . . . . . . . . . . . . . 365
20.4 Matched Pairs Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
20.5 Two Sample Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
20.6 Summary of Tests of Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
20.6.1 General Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
20.6.2 Test for Population Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
20.6.3 Test for Population Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
20.7 A Note on the Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
20.8 The t Test as a Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
20.9 Non-parametric alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
20.9.1 Permutation Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
20.9.2 Mann-Whitney or Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
20.9.3 Wilcoxon Signed-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
20.10Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
21 Goodness of Fit 385
21.1 Fit of a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
21.2 Contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
21.3 Applicability and Alternatives to Chi-squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
21.4 Answer to Selected Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
22 Analysis of Variance 403
22.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
22.2 One Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
22.3 Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
22.4 Two Sample Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
22.5 Kruskal-Wallis Rank-Sum Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
22.6 Answer to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
Appendix A: A Sample R Session 417
Index 423
v
Introduction to the Science of Statistics
vi
Preface
Statistical thinking will one day be as necessary a qualification for efficient citizenship as the ability to
read and write. – Samuel Wilkes, 1951, paraphrasing H. G. Wells from Mankind in the Making
The value of statistical thinking is now accepted by researchers and practitioners from a broad range of endeavors.
This viewpoint has become common wisdom in a world of big data. The challenge for statistics educators is to adapt
their pedagogy to accommodate the circumstances associated to the information age. This choice of pedagogy should
be attuned to the quantitative capabilities and scientific background of the students as well as the intended use of their
newly acquired knowledge of statistics.
Many university students, presumed to be proficient in college algebra, are taught a variety of procedures and
standard tests under a well-developed pedagogy. This approach is sufficiently refined so that students have a good
intuitive understanding of the underlying principles presented in the course. However, if the statistical needs presented
by a given scientific question fall outside the battery of methods presented in the standard curriculum, then students
are typically at a loss to adjust the procedures to accommodate the additional demand.
On the other hand, undergraduate students majoring in mathematics frequently have a course on the theory of
statistics as a part of their program of study. In this case, the standard curriculum repeatedly finds itself close to the
very practically minded subject that statistics is. However, the demands of the syllabus provide very little time to
explore these applications with any sustained attention.
Our goal is to find a middle ground.
Despite the fact that calculus is a routine tool in the development of statistics, the benefits to students who have
learned calculus are infrequently employed in the statistics curriculum. The objective of this book is to meet this need
with a one semester course in statistics that moves forward in recognition of the coherent body of knowledge provided
by statistical theory having an eye consistently on the application of the subject. Such a course may not be able to
achieve the same degree of completeness now presented by the two more standard courses described above. However,
it ought to able to achieve some important goals:
• leaving students capable of understanding what statistical thinking is and how to integrate this with scientific
procedures and quantitative modeling and
• learning how to ask statistics experts productive questions, and how to implement their ideas using statistical
software and other computational tools.
Inevitably, many important topics are not includedin this book. In addition, I have chosen to incorporate abbre-
viated introductions of some more advanced topics. Such topics can be skipped in a first pass through the material.
However, one value of a textbook is that it can serve as a reference in future years. The context for some parts of
the exposition will become more clear as students continue their own education in statistics. In these cases, the more
advanced pieces can serve as a bridge from this book to more well developed accounts. My goal is not to compose a
stand alone treatise, but rather to build a foundation that allows those who have worked through this book to introduce
themselves to many exciting topics both in statistics and in its areas of application.
vii
Introduction to the Science of Statistics
Who Should Use this Book
The major prerequisites are comfort with calculus and a strong interest in questions that can benefit from statistical
analysis. Willingness to engage in explorations utilizing statistical software is an important additional requirement.
The original audience for the course associated to this book are undergraduate students minoring in mathematics.
These student have typically completed a course in multivariate calculus. Many have been exposed to either linear
algebra or differential equations. They enroll in this course because they want to obtain a better understanding of
their own core subject. Even though we regularly rely on the mechanics of calculus and occasionally need to work
with matrices, this is not a textbook for a mathematics course, but rather a textbook that is dedicated to a higher level
of understanding of the concepts and practical applications of statistics. In this regard, it relies on a solid grasp of
concepts and structures in calculus and algebra.
With the advance and adoption of the Common Core State Standards in mathematics, we can anticipate that
primary and secondary school students will experience a broader exposure to statistics through their school years. As
a consequence, we will need to develop a curriculum for teachers and future teachers so that they can take content in
statistics and turn that into curriculum for their students. This book can serve as a source of that content.
In addition, those engaged both in industry and in scholarly research are experiencing a surge in the need to
design more complex experiments and analyze more diverse data types. Universities and industry are responding with
advanced educational opportunities to extend statistics education beyond the theory of probability and statistics, linear
models and design of experiments to more modern approaches that include stochastic processes, machine learning and
data mining, Bayesian statistics, and statistical computing. This book can serve as an entry point for these critical
topics in statistics.
An Annotated Syllabus
The four parts of the course - organizing and collecting data, an introduction to probability, estimation procedures and
hypothesis testing - are the building blocks of many statistics courses. We highlight some of the particular features in
this book.
Organizing and Collecting Data
Much of this is standard and essential - organizing categorical and quantitative data, appropriately displayed as contin-
gency tables, bar charts, histograms, boxplots, time plots, and scatterplots, and summarized using medians, quartiles,
means, weighted means, trimmed means, standard deviations, correlations and regression lines. We use this as an
opportunity to introduce to the statistical software package R and to add additional summaries like the empirical cu-
mulative distribution function and the empirical survival function. One example incorporating the use of this is the
comparison of the lifetimes of wildtype and transgenic mosquitoes and a discussion of the best strategy to display and
summarize data if the goal is to examine the differences in these two genotypes of mosquitoes in their ability to carry
and spread malaria. A bit later, we will do an integration by parts exercise to show that the mean of a non-negative
continuous random variable is the area under its survival function.
Collecting data under a good design is introduced early in the text and discussion of the underlying principles of
experimental design is an abiding issue throughout the text. With each new mathematical or statistical concept comes
an enhanced understanding of what an experiment might uncover through a more sophisticated design than what was
previously thought possible. The students are given readings on design of experiment and examples using R to create
a sample under variety of protocols.
Introduction to Probability
Probability theory is the analysis of random phenomena. It is built on the axioms of probability and is explored, for
example, through the introduction of random variables. The goal of probability theory is to uncover properties arising
from the phenomena under study. Statistics is devoted to the analysis of data. One goal of statistical science is to
viii
Introduction to the Science of Statistics
articulate as well as possible what model of random phenomena underlies the production of the data. The focus of this
section of the course is to develop those probabilistic ideas that relate most directly to the needs of statistics.
Thus, we must study the axioms and basic properties of probability to the extent that the students understand
conditional probability and independence. Conditional probability is necessary to develop Bayes formula which we
will later use to give a taste of the Bayesian approach to statistics. Independence will be needed to describe the
likelihood function in the case of an experimental design that is based on independent observations. Densities for
continuous random variables and mass function for discrete random variables are necessary to write these likelihood
functions explicitly. Expectation will be used to standardize a sample sum or sample mean and to perform method of
moments estimates.
Random variables are developed for a variety of reasons. Some, like the binomial, negative binomial, Poisson or
the gamma random variable, arise from considerations based on Bernoulli trials or exponential waiting. The hyperge-
ometric random variable helps us understand the difference between sampling with and without replacement. The F ,
t and chi-square random variables will later become test statistics. Uniform random variables are the ones simulated
by random number generators. Because of the central limit theorem, the normal family is the most important among
the list of parametric families of random variables.
The flavor of the text returns to becoming more authentically statistical with the law of large numbers and the
central limit theorem. These are largely developed using simulation explorations and first applied to simple Monte
Carlo techniques and importance sampling to estimate the value of an definite integrals. One cautionary tale is an
example of the failure of these simulation techniques when applied without careful analysis. If one uses, for example,
Cauchy random variables in the evaluation of some quantity, then the simulated sample means can appear to be
converging only to experience an abrupt and unpredictable jump. The lack of convergence of an improper integral
reveals the difficulty. The central object of study is, of course, the central limit theorem. It is developed both in terms
of sample sums and sample means and proportions and used in relatively standard ways to estimate probabilities.
However, in this book, we can introduce the delta method which adds ideas associated to the central limit theorem to
the context of propagation of error.
Estimation
In the simplest possible terms, the goal of estimation theory is to answer the question: What is that number? An
estimate is a statistic, i. e., a function of the data. We look to two types of estimation techniques - method of moments
and maximum likelihood and several criteria for an estimatorusing, for example, variance and bias. Several examples
including mark and recapture and the distribution of fitness effects from genetic data are developed for both types of
estimators. The variance of an estimator is approximated using the delta method for method of moments estimators
and using Fisher information for maximum likelihood estimators. An analysis of bias is based on quadratic Taylor
series approximations and the properties of expectations. Both classes of estimators are often consistent. This implies
that the bias decreases towards zero with an increasing number of observations. R is routinely used in simulations to
gain insight into the quality of estimators.
The point estimation techniques are followed by interval estimation and, notably, by confidence intervals. This
brings us to the familiar one and two sample t-intervals for population means and one and two sample z-intervals for
population proportions. In addition, we can return to the delta method and the observed Fisher information to construct
confidence intervals associated respectively to method of moment estimators and and maximum likelihood estimators.
We also add a brief introduction on bootstrap confidence intervals and Bayesian credible intervals in order to provide
a broader introduction to strategies for parameter estimation.
Hypothesis Testing
For hypothesis testing, we first establish the central issues - null and alternative hypotheses, type I and type II errors,
test statistics and critical regions, significance and power. We then present the ideas behind the use of likelihood ratio
tests as best tests for a simple hypothesis. This is motivated by a game designed to explain the importance of the
Neyman Pearson lemma. This approach leads us to well-known diagnostics of an experimental design, notably, the
receiver operating characteristic and power curves.
ix
Introduction to the Science of Statistics
Extensions of the Neyman Pearson lemma form the basis for the t test for means, the chi-square test for goodness
of fit, and the F test for analysis of variance. These results follow from the application of optimization techniques
from calculus, including the use of Lagrange multipliers to develop goodness of fit tests. The Bayesian approach
to hypothesis testing is explored for the case of simple hypothesis using morphometric measurements, in this case a
butterfly wingspan, to test whether a habitat has been invaded by a mimic species.
The desire of a powerful test is articulated in a variety of ways. In engineering terms, power is called sensitivity.
We illustrate this with a radon detector. An insensitive instrument is a risky purchase. This can be either because
the instrument is substandard in the detection of fluctuations or poor in the statistical basis for the algorithm used to
determine a change in radon level. An insensitive detector has the undesirable property of not sounding its alarm when
the radon level has indeed risen.
The course ends by looking at the logic of hypotheses testing and the results of different likelihood ratio analyses
applied to a variety of experimental designs. The delta method allows us to extend the resulting test statistics to
multivariate nonlinear transformations of the data. The textbook concludes with a practical view of the consequences
of this analysis through case studies in a variety of disciplines including, for example, genetics, health, ecology,
and bee biology. This will serve to introduce us to the well known t procedure for inference of the mean, both the
likelihood-based G2 test and the traditional chi-square test for discrete distributions and contingency tables, and the
F test for one-way analysis of variance. We add short descriptions for the corresponding non-parametric procedures,
namely, permutation, ranked-sum and signed-rank tests for quantitative data, and exact tests for categorical data
Exercises and Problems
One obligatory statement in the preface of a book such as this is to note the necessity of working problems. The mate-
rial can only be mastered by grappling with the issues through the application to engaging and substantive questions.
In this book, we address this imperative through exercises and through problems. The exercises, integrated into the
textbook narrative, are of two basic types. The first is largely mathematical or computational exercises that are meant
to provide or extend the derivation of a useful identity or data analysis technique. These experiences will prepare the
student to perform the calculations that routinely occur in investigations that use statistical thinking. The second type
form a collection of questions that are meant to affirm the understanding of a particular concept.
Problems are collected at the end of each of the four parts of the book. While the ordering of the problems generally
follows the flow of the text, they are designed to be more extensive and integrative. These problems often incorporate
several concepts and will call on a variety of problem solving strategies combining handwritten work with the use of
statistical software. Without question, the best problems are those that the students chose from their own interests.
Acknowledgements
The concept that let to this book grew out of a conversation with the late Michael Wells, Professor of Biochemistry
at the University of Arizona. He felt that if we are asking future life scientist researchers to take the time to learn
calculus and differential equations, we should also provide a statistics course that adds value to their abilities to design
experiments and analyze data while reinforcing both the practical and conceptual sides of calculus. As a consequence,
course development received initial funding from a Howard Hughes Medical Institute grant (52005889). Christopher
Bergevin, an HHMI postdoctoral fellow, provided a valuable initial collaboration.
Since that time, I have had the great fortune to be the teacher of many bright and dedicated students whose future
contribution to our general well-being is beyond dispute. Their cheerfulness and inquisitiveness has been a source
of inspiration for me. More practically, their questions and their persistence led to a much clearer exposition and
the addition of many dozens of figures to the text. Through their end of semester projects, I have been introduced
to many interesting questions that are intriguing in their own right, but also have added to the range of applications
presented throughout the text. Four of these students - Beryl Jones, Clayton Mosher, Laurel Watkins de Jong, and
Taylor Corcoran - have gone on to become assistants in the course. I am particularly thankful to these four for their
contributions to the dynamical atmosphere that characterizes the class experience.
x
Part I
Organizing and Producing Data
1
Topic 1
Displaying Data
There are two goals when presenting data: convey your story and establish credibility. - Edward Tufte
Statistics is a mathematical science that is concerned with the collection, analysis, interpretation or explanation,
and presentation of data. Properly used statistical principles are essential in guiding any inquiry informed by data and,
especially in the phase of data exploration, is routinely a fundamental source for discovery and innovation. Insights
from data may come from a well conceived visualization of the data, from modern methods of statistical learning and
model selection as well as from time-honored formal statistical procedures.
The first encounters one has to data are through graphical displays and numerical summaries. The goal is to find
an elegant method for this presentation that is at the same time both objective and informative - making clear with a
few lines or a few numbers the salient features of the data. In this sense, data presentation is at the same time an art, a
science, and an obligation to impartiality.
In the section, we will describe some of the standard presentations of data and at the same time, taking theopportu-
nity to introduce some of the commands that the software package R provides to draw figures and compute summaries
of the data.
1.1 Types of Data
A data set provides information about a group of individuals. These individuals are, typically, representatives chosen
from a population under study. Data on the individuals are meant, either informally or formally, to allow us to make
inferences about the population. We shall later discuss how to define a population, how to choose individuals in the
population and how to collect data on these individuals.
• Individuals are the objects described by the data.
• Variables are characteristics of an individual. In order to present data, we must first recognize the types of data
under consideration.
– Categorical variables partition the individuals into classes. Other names for categorical variables are
levels or factors. One special type of categorical variables are ordered categorical variables that suggest
a ranking, say small. medium, large or mild, moderate, severe.
– Quantitative variables are those for which arithmetic operations like addition and differences make sense.
Example 1.1 (individuals and variables). We consider two populations - the first is the nations of the world and the
second is the people who live in those countries. Below is a collection of variables that might be used to study these
populations.
3
Introduction to the Science of Statistics Displaying Data
nations people
population size age
time zones height
average rainfall gender
life expectancy ethnicities
mean income annual income
literacy rate literacy
capital city mother’s maiden name
largest river marital status
Exercise 1.2. Classify the variables as quantitative or categorical in the example above.
The naming of variables and their classification as categorical or quantitative may seem like a simple, even trite,
exercise. However, the first steps in designing an experiment and deciding on which individuals to include and which
information to collect are vital to the success of the experiment. For example, if your goal is to measure the time for
an animal (insect, bird, mammal) to complete some task under different (genetic, environmental, learning) conditions,
then, you may decide to have a single quantitative variable - the time to complete the task. However, an animal in
your study may not attempt the task, may not complete the task, or may perform the task. As a consequence, your
data analysis will run into difficulties if you do not add a categorical variable to include these possible outcomes of an
experiment.
Exercise 1.3. Give examples of variables for the population of vertebrates, of proteins.
1.2 Categorical Data
1.2.1 Pie Chart
A pie chart is a circular chart divided into sectors, illustrating relative magnitudes in frequencies or percents. In a pie
chart, the area is proportional to the quantity it represents.
Example 1.4. As the nation debates strategies for delivering health insurance, let’s look at the sources of funds and
the types of expenditures.
 
Figure 1.1: 2008 United States health care (a) expenditures (b) income sources, Source: Centers for Medicare and Medicaid Services, Office of
the Actuary, National Health Statistics Group
4
Introduction to the Science of Statistics Displaying Data
Exercise 1.5. How do you anticipate that this pie chart will evolve over the next decade? Which pie slices are likely
to become larger? smaller? On what do you base your predictions?
Example 1.6. From UNICEF, we read “The proportion of children who reach their fifth birthday is one of the most
fundamental indicators of a country’s concern for its people. Child survival statistics are a poignant indicator of the
priority given to the services that help a child to flourish: adequate supplies of nutritious food, the availability of high-
quality health care and easy access to safe water and sanitation facilities, as well as the family’s overall economic
condition and the health and status of women in the community. ”
Example 1.7. Gene Ontology (GO) project is a bioinformatics initiative whose goal is to provide unified terminology
of genes and their products. The project began in 1998 as a collaboration between three model organism databases,
Drosophila, yeast, and mouse. The GO Consortium presently includes many databases, spanning repositories for
plant, animal and microbial genomes. This project is supported by National Human Genome Research Institute. See
http://www.geneontology.org/
Figure 1.2: The 25 most frequent Biological Process Gene Ontology (GO) terms.
5
Introduction to the Science of Statistics Displaying Data
To make a simple pie chart in R for the proportion of AIDS cases among US males by transmission category.
> males<- c(58,18,16,7,1)
> pie(males)
This many be sufficient for your own personal use. However, if we want to use a pie chart in a presentation, we
will have to provide some essential details. For a more descriptive pie chart, one has to become accustomed to learning
to interact with the software to settle on a graph that is satisfactory to the situation.
• Define some colors ideal for black and white print.
> colors <- c("white","grey70","grey90","grey50","black")
• Calculate the percentage for each category.
> male_labels <- round(males/sum(males)*100, 1)
The number 1 indicates rounded to one decimal place.
> male_labels <- paste(male_labels, "\%", sep=" ")
This adds a space and a percent sign.
• Create a pie chart with defined heading and custom colors and labels and create a legend.
> pie(males, main="Proportion of AIDS Cases among Males by Transmission Category
+ Diagnosed - USA, 2005", col=colors, labels=male_labels, cex=0.8)
> legend("topright", c("Male-male contact","Injection drug use (IDU)",
+ "High-risk heterosexual contact","Male-male contact and IDU","Other"),
+ cex=0.8,fill=colors)
The entry cex=0.8 indicates that the legend has a type set that is 80% of the font size of the main title.
58 %
18 % 16 %
7 %
1 %
Proportion of AIDS Cases among Males by Transmission Category Diagnosed − USA, 2005
Male−male contact
Injection drug use (IDU)
High−risk heterosexual contact
Male−male contact and IDU
Other
6
Introduction to the Science of Statistics Displaying Data
1.2.2 Bar Charts
Because the human eye is good at judging linear measures and poor at judging relative areas, a bar chart or bar graph
is often preferable to pie charts as a way to display categorical data.
To make a simple bar graph in R,
> barplot(males)
For a more descriptive bar chart with information on females:
• Enter the data for females and create a 5× 2 array.
> females <- c(0,71,27,0,2)
> hiv<-array(c(males,females), dim=c(5,2))
• Generate side-by-side bar graphs and create a legend,
> barplot(hiv, main="Proportion of AIDS Cases by Sex and Transmission Category
+ Diagnosed - USA, 2005", ylab= "percent", beside=TRUE,
+ names.arg = c("Males", "Females"),col=colors)
> legend("topright", c("Male-male contact","Injection drug use (IDU)",
+ "High-risk heterosexual contact","Male-male contact and IDU","Other"),
+ cex=0.8,fill=colors)
Males Females
Proportion of AIDS Cases by Sex and Transmission Category
Diagnosed − USA, 2005
pe
rc
en
t
0
10
20
30
40
50
60
70 Male−male contact
Injection drug use (IDU)
High−risk heterosexual contact
Male−male contact and IDU
Other
Example 1.8. Next we examine a segmented bar plot. This shows the ancestral sources of genes for 75 populations
throughout Asia. the data are based on information gathered from 50,000 genetic markers. The designations for the
groups were decided by the software package STRUCTURE.
1.3 Two-way Tables
Relationships between two categorical variables can be shown through a two-way table (also known as a contingency
table , cross tabulation table or a cross classifying table ).
7
Introduction to the Science of Statistics Displaying Data
(F
ig
.
1
an
d
fig
s.
S1
to
S1
3)
,
po
pu
la
tio
n
ph
y-
lo
ge
ni
es
(F
ig
.1
an
d
fig
s.
S2
7an
d
S2
8)
,a
nd
PC
A
re
su
lts
(F
ig
.2
)a
ll
sh
ow
th
at
po
pu
la
tio
ns
fr
om
th
e
sa
m
e
lin
gu
is
tic
gr
ou
p
te
nd
to
cl
us
te
r
to
ge
th
er
.A
M
an
te
lt
es
tc
on
fir
m
s
th
e
co
rr
el
at
io
n
be
tw
ee
n
lin
-
gu
is
tic
an
d
ge
ne
tic
af
fin
iti
es
(R
2
=
0.
25
3;
P
<
0.
00
01
w
ith
10
,0
00
pe
rm
ut
at
io
ns
),
ev
en
af
te
rc
on
tro
lli
ng
fo
r
ge
og
ra
ph
y
(p
ar
tia
l
co
rr
el
at
io
n
=
0.
13
6;
P
<
0.
00
5
w
ith
10
,0
00
pe
rm
ut
at
io
ns
).
N
ev
er
th
el
es
s,
w
e
id
en
tif
ie
d
ei
gh
tp
op
ul
at
io
n
ou
tli
er
sw
ho
se
lin
gu
is
tic
an
d
ge
ne
tic
af
fin
iti
es
ar
e
in
co
ns
is
te
nt
[A
ffy
m
et
rix
-
M
el
an
es
ia
n
(A
X
-M
E
),
M
al
ay
si
a-
Je
ha
i
(M
Y
-J
H
)
Fi
g.
1.
M
ax
im
um
-li
ke
lih
oo
d
tr
ee
of
75
po
pu
la
tio
ns
.A
hy
po
th
et
ic
al
m
os
t-
re
ce
nt
co
m
m
on
an
ce
st
or
(M
RC
A)
co
m
po
se
d
of
an
ce
st
ra
la
lle
le
s
as
in
fe
rr
ed
fr
om
th
e
ge
no
ty
pe
so
fo
ne
go
ril
la
an
d
21
ch
im
pa
nz
ee
sw
as
us
ed
to
ro
ot
th
e
tr
ee
.
Br
an
ch
es
wi
th
bo
ot
st
ra
p
va
lu
es
le
ss
th
an
50
%
we
re
co
nd
en
se
d.
Po
pu
la
tio
n
id
en
tif
ic
at
io
n
nu
m
be
rs
(ID
s)
,s
am
pl
e
co
lle
ct
io
n
lo
ca
tio
ns
wi
th
la
tit
ud
es
an
d
lo
ng
itu
de
s,
et
hn
ic
iti
es
,
la
ng
ua
ge
sp
ok
en
,
an
d
si
ze
of
po
p-
ul
at
io
n
sa
m
pl
es
ar
e
sh
ow
n
in
th
e
ta
bl
e
ad
ja
ce
nt
to
ea
ch
br
an
ch
in
th
e
tr
ee
.
Li
ng
ui
st
ic
gr
ou
ps
ar
e
in
di
ca
te
d
wi
th
co
lo
rs
as
sh
ow
n
in
th
e
le
ge
nd
.
Al
l
po
pu
la
tio
n
ID
s
ex
ce
pt
th
e
fo
ur
Ha
pM
ap
sa
m
pl
es
ar
e
de
no
te
d
by
fo
ur
ch
ar
ac
te
rs
.
Th
e
fir
st
tw
o
le
tte
rs
in
di
ca
te
th
e
co
un
tr
y
wh
er
e
th
e
sa
m
pl
es
we
re
co
lle
ct
ed
or
(in
th
e
ca
se
of
Af
fy
m
et
rix
)
ge
no
ty
pe
d,
ac
co
rd
in
g
to
th
e
fo
llo
wi
ng
co
nv
en
tio
n:
AX
,A
ffy
m
et
rix
;C
N
,C
hi
na
;I
D,
In
do
ne
si
a;
IN
,I
nd
ia
;
JP
,J
ap
an
;K
R,
Ko
re
a;
M
Y,
M
al
ay
si
a;
PI
,t
he
Ph
ili
pp
in
es
;S
G,
Si
ng
ap
or
e;
TH
,
Th
ai
la
nd
;
an
d
TW
,
Ta
iw
an
.
Th
e
la
st
tw
o
le
tte
rs
ar
e
un
iq
ue
ID
s
fo
r
th
e
po
pu
la
tio
n.
To
th
e
rig
ht
of
th
e
ta
bl
e,
an
av
er
ag
ed
gr
ap
h
of
re
su
lts
fr
om
ST
RU
CT
U
RE
is
sh
ow
n
fo
rK
=
14
.
11
D
EC
EM
BE
R
20
09
VO
L
32
6
SC
IE
N
C
E
w
w
w
.s
ci
en
ce
m
ag
.o
rg
15
42RE
PO
RT
S
 on April 28, 2010 www.sciencemag.org Downloaded from 
Figure 1.3: Dispaying human genetic diversity for 75 populations in Asia. The software program STRUCTURE here infers 14 source populations,
10 of them major. The length of each segment in the bar is the estimate by STRUCTURE of the fraction of the genome in the sample that has
ancestors among the given source population.
8
Introduction to the Science of Statistics Displaying Data
2 parents 1 parent 0 parents
does not smoke
smokes
0
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
Example 1.9. In 1964, Surgeon General Dr. Luther Leonidas Terry published a landmark report saying that smoking
may be hazardous to health. This led to many influential reports on the topic, including the study of the smoking habits
of 5375 high school children in Tucson in 1967. Here is a two-way table summarizing some of the results.
student student
smokes does not smoke total
2 parents smoke 400 1380 1780
1 parent smokes 416 1823 2239
0 parents smoke 188 1168 1356
total 1004 4371 5375
• The row variable is the parents smoking habits.
• The column variable is the student smoking habits.
• The cells display the counts for each of the categories of row and column variables.
A two-way table with r rows and c columns is often called an r by c table (written r × c).
The totals along each of the rows and columns give the marginal distributions. We can create a segmented bar
graph as follows:
> smoking<-matrix(c(400,1380,416,1823,188,1168),ncol=3)
> colnames(smoking)<-c("2 parents","1 parent", "0 parents")
> rownames(smoking)<-c("smokes","does not smoke")
> smoking
2 parents 1 parent 0 parents
smokes 400 416 188
does not smoke 1380 1823 1168
> barplot(smoking,legend=rownames(smoking))
9
Introduction to the Science of Statistics Displaying Data
Example 1.10. Hemoglobin E is a variant of hemoglobin with a mutation in the β globin gene causing substitution of
glutamic acid for lysine at position 26 of the β globin chain. HbE (E is the one letter abbreviation for glutamic acid.)
is the second most common abnormal hemoglobin after sickle cell hemoglobin (HbS). HbE is common from India to
Southeast Asia. The β chain of HbE is synthesized at a reduced rate compare to normal hemoglobin (HbA) as the HbE
produces an alternate splicing site within an exon.
It has been suggested that Hemoglobin E provides some protection against malaria virulence when heterozygous,
but is causes anemia when homozygous. The circumstance in which the heterozygotes for the alleles under considera-
tion have a higher adaptive value than the homozygote is called balancing selection.
The table below gives the counts of differing hemoglobin genotypes on two Indonesian islands.
genotype AA AE EE
Flores 128 6 0
Sumba 119 78 4
Because the heterozygotes are rare on Flores, it appears malaria is less prevalent there since the heterozygote does
not provide an adaptive advantage.
Exercise 1.11. Make a segmented barchart of the data on hemoglobin genotypes. Have each bar display the distribu-
tion of genotypes on the two Indonesian islands.
1.4 Histograms and the Empirical Cumulative Distribution Function
Histograms are a common visual representation of a quantitative variable. Histograms summarize the data using
rectangles to display either frequencies or proportions as normalized frequencies. In making a histogram, we
• Divide the range of data into bins of equal width (usually, but not always).
• Count the number of observations in each class.
• Draw the histogram rectangles representing frequencies or percents by area.
Interpret the histogram by giving
• the overall pattern
– the center
– the spread
– the shape (symmetry, skewness, peaks)
• and deviations from the pattern
– outliers
– gaps
The direction of the skewness is the direction of the longer of the two tails (left or right) of the distribution.
No one choice for the number of bins is considered best. One possible choice for larger data sets is Sturges’
formula to choose b1 + log2 nc bins. (b·c, the floor function, is obtained by rounding down to the next integer.)
Exercise 1.12. The histograms in Figure 1.4 shows the distribution of lengths of a normal strain and mutant strain of
Bacillus subtilis. Describe the distributions.
Example 1.13. Taking the age of the presidents of the United States at the time of their inauguration and creating its
histogram, empirical cumulative distribution function and boxplot in R is accomplished as follows.
10
Introduction to the Science of Statistics Displaying Data
Figure 1.4: Histogram of lengths of Bacillus subtilis. Solid lines indicate wild type and dashed line mutant strain.
> age<- c(57,61,57,57,58,57,61,54,68,51,49,64,50,48,65,52,56,46,54,49,51,47,55,55,
54,42,51,56,55,51,54,51,60,61,43,55,56,61,52,69,64,46,54,47,70)
> par(mfrow=c(1,2))
> hist(age)
> plot(ecdf(age),xlab="age",main="Age of Presidents at the Time of Inauguaration",
sub="Empriical Cumulative Distribution Function")
Histogram of age
age
Fr
eq
ue
nc
y
40 45 50 55 60 65 70
0
5
10
15
40 45 50 55 60 65 70
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Age of Presidents at Inauguaration
Empriical Cumulative Distribution Function
age
Fn
(x
)
So the age of presidents at the time of inauguration range from the early forties to the late sixties with the frequency
starting their tenure peaking in the early fifties. The histogram in generally symmetric about 55 years with spread from
around 40 to 70 years.
The empirical cumulative distribution function Fn(x) gives, for each value x, the fraction of the data less than
or equal to x. If the number of observations is n, then
Fn(x) =
1
n
#(observations lessthan or equal to x).
11
Introduction to the Science of Statistics Displaying Data
Thus, Fn(x) = 0 for any value of x less than all of the observed values and Fn(x) = 1 for any x greater than
all of the observed values. In between, we will see jumps that are multiples of the 1/n. For example, in the empirical
cumulative distribution function for the age of the presidents, we will see a jump of size 4/45 at x = 57 to indicate the
fact that 4 of the 44 presidents were 57 at the time of their inauguration.
For an alternative method to create a graph of the empirical cumulative distribution function, first place the
observations in order from smallest to largest. For the age of presidents data, we can accomplish this in R by writing
sort(age). Next match these up with the integral multiples of the 1 over the number of observations. In R, we enter
1:length(age)/length(age). Finally, type="s" to give us the steps described above.
> plot(sort(age),1:length(age)/length(age),type="s",ylim=c(0,1),
main = c("Age of Presidents at the Time of Inauguration"),
sub=("Empiricial Cumulative Distribution Function"),
xlab=c("age"),ylab=c("cumulative fraction"))
Exercise 1.14. Give the fraction of presidents whose age at inauguration was under 60. What is the range for the age
at inauguration of the youngest fifth of the presidents?
Exercise 1.15. The histogram for data on the length of three bacterial strains is shown below. Lengths are given in
microns. Below the histograms (but not necessarily directly below) are empirical cumulative distribution functions
corresponding to these three histograms.
Histogram of wild1f
wild1f
F
re
q
u
e
n
c
y
0 2 4 6 8
0
5
1
0
1
5
Histogram of wild2f
wild2f
F
re
q
u
e
n
c
y
0 2 4 6 8
0
5
1
0
1
5
Histogram of wild3f
wild3f
F
re
q
u
e
n
c
y
0 2 4 6 8
0
5
1
0
1
5
0 2 4 6 8
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
wildaf
c
u
m
u
la
ti
v
e
 f
ra
c
ti
o
n
0 2 4 6 8
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
wildbf
c
u
m
u
la
ti
v
e
 f
ra
c
ti
o
n
0 2 4 6 8
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
wildcf
c
u
m
u
la
ti
v
e
 f
ra
c
ti
o
n
Match the histograms to their respective empirical cumulative distribution functions.
In looking at life span data, the natural question is “What fraction of the individuals have survived a given length
of time?” The survival function Sn(x) gives, for each value x, the fraction of the data greater than or equal to x. If
the number of observations is n, then
Sn(x) =
1
n
#(observations greater than x) =
1
n
(n−#(observations less than or equal to x))
= 1− 1
n
#(observations less than or equal to x) = 1− Fn(x)
12
Introduction to the Science of Statistics Displaying Data
20 25 30 35 40
40
60
80
10
0
average age of the parent
nu
m
be
r o
f d
e 
no
vo
 m
ut
at
io
ns
1.5 Scatterplots
We now consider two dimensional data. The values of the first variable x1, x2, . . . , xn are assumed known and in an
experiment and are often set by the experimenter. This variable is called the explanatory, predictor, discriptor, or
input variables and in a two dimensional scatterplot of the data display its values on the horizontal axis. The values
y1, y2 . . . , yn, taken from observations with input x1, x2, . . . , xn are called the response or target variable and its
values are displayed on the vertical axis. In describing a scatterplot, take into consideration
• the form, for example,
– linear
– curved relationships
– clusters
• the direction,
– a positive or negative association
• and the strength of the aspects of the scatterplot.
Example 1.16. Genetic evolution is based on mutation. Consequently, one fundamental question in evolutionary
biology is the rate of de novo mutations. To investigate this question in humans, Kong et al, sequenced the entire
genomes of 78 Icelandic trios and recorded the age of the parents and the number of de novo mutations in the offspring.
The plot shows a moderate positive linear association, children of older parent have, on average, more mutations.
The number of mutations range from ∼ 40 for children of younger parents to ∼ 100 for children of older parents. We
will later learn that the father is the major source of this difference with age.
Example 1.17 (Fossils of the Archeopteryx). The name Archeopteryx derives from the ancient Greek meaning “ancient
feather” or “ancient wing”. Archeopteryx is generally accepted by palaeontologists as being the oldest known bird.
Archaeopteryx lived in the Late Jurassic Period around 150 million years ago, in what is now southern Germany
during a time when Europe was an archipelago of islands in a shallow warm tropical sea. The first complete specimen
of Archaeopteryx was announced in 1861, only two years after Charles Darwin published On the Origin of Species,
13
Introduction to the Science of Statistics Displaying Data
and thus became a key piece of evidence in the debate over evolution. Below are the lengths in centimeters of the
femur and humerus for the five specimens of Archeopteryx that have preserved both bones.
femur 38 56 59 64 74
humerus 41 63 70 72 84
> femur<-c(38,56,59,64,74)
> humerus<-c(41,63,70,72,84)
> plot(femur, humerus,main=c("Bone Lengths for Archeopteryx"))
Unless we have a specific scientific question, we have no real reason for a choice of the explanatory variable.
●
●
●
●
●
40 45 50 55 60 65 70 75
40
50
60
70
80
Bone Lengths for Archeopteryx
femur
hu
m
er
us
Describe the scatterplot.
Example 1.18. This historical data show the 20 largest banks in 1974. Values given in billions of dollars.
Bank 1 2 3 4 5 6 7 8 9 10
Assets 49.0 42.3 36.6 16.4 14.9 14.2 13.5 13.4 13.2 11.8
Income 218.8 265.6 170.9 85.9 88.1 63.6 96.9 60.9 144.2 53.6
Bank 11 12 13 14 15 16 17 18 19 20
Assets 11.6 9.5 9.4 7.5 7.2 6.7 6.0 4.6 3.8 3.4
Income 42.9 32.4 68.3 48.6 32.2 42.7 28.9 40.7 13.8 22.2
14
Introduction to the Science of Statistics Displaying Data
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10 20 30 40 50
50
10
0
15
0
20
0
25
0
Income vs. Assets (in billions of dollars)
assets
in
co
m
e
Describe the scatterplot.
In 1972, Michele Sindona, a banker with close ties to the Mafia, along with a purportedly bogus Freemasonic
lodge, and the Nixon administration purchased controlling interest in Bank 19, Long Island’s Franklin National Bank.
As a result of his acquisition of a controlling stake in Franklin, Sindona had a money laundering operation to aid his
alleged ties to Vatican Bank and the Sicilian drug cartel. Sindona used the bank’s ability to transfer funds, produce
letters of credit, and trade in foreign currencies to begin building a banking empire in the United States. In mid-1974,
management revealed huge losses and depositors started taking out large withdrawals, causing the bank to have to
borrow over $1 billion from the Federal Reserve Bank. On 8 October 1974, the bank was declared insolvent due to
mismanagement and fraud, involving losses in foreign currency speculation and poor loan policies.
What would you expect to be a feature on this scatterplot of a failing bank? Does the Franklin Bank have this
feature?
1.6 Time Plots
Some data sets come with an order of events, say ordered by time.
Example 1.19. The modern history of petroleum began in the 19th century with the refining of kerosene from crude oil.
The world’s first commercial oil wells were drilled in the 1850s in Poland and in Romania.The first oil well in North
America was in Oil Springs, Ontario, Canada in 1858. The US petroleum industry began with Edwin Drake’s drilling
of a 69-foot deep oil well in 1859 on Oil Creek near Titusville, Pennsylvania for the Seneca Oil Company. The industry
grew through the 1800s, driven by the demand for kerosene and oil lamps. The introduction of the internal combustion
engine in the early part of the 20th century provided a demand that has largely sustained the industry to this day.
Today, about 90% of vehicular fuel needs are met by oil. Petroleum also makes up 40% of total energyconsumption
in the United States, but is responsible for only 2% of electricity generation. Oil use increased exponentially until the
world oil crises of the 1970s.
15
Introduction to the Science of Statistics Displaying Data
Worldwide Oil Production
Million Million Million
Year Barrels Year Barrels Year Barrels
1880 30 1940 2150 1972 18584
1890 77 1945 2595 1974 20389
1900 149 1950 3803 1976 20188
1905 215 1955 5626 1978 21922
1910 328 1960 7674 1980 21722
1915 432 1962 8882 1982 19411
1920 689 1964 10310 1984 19837
1925 1069 1966 12016 1986 20246
1930 1412 1968 14014 1988 21338
1935 1655 1970 16690
With the data given in two columns oil and year, the time plot plot(year,oil,type="b") is given on
the left side of the figure below. This uses type="b" that puts both lines and circles on the plot.
● ● ● ● ●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
1880 1900 1920 1940 1960 1980
0
5
10
15
20
World Oil Production
year
bi
lli
on
s 
of
 b
ar
re
ls
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●●●●●●
1880 1900 1920 1940 1960 1980
−
1.
5
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
World Oil Production
year
lo
g(
bi
lli
on
s 
of
 b
ar
re
ls
)
Figure 1.5: Oil production (left) and the logarithm of oil production (right) from 1880 to 1988.
Sometimes a transformation of the data can reveal the structure of the time series. For example, if we wish to
examine an exponential increase displayed in the oil production plot, then we can take the base 10 logarithm of the
production and give its time series plot. This is shown in the plot on the right above. (In R, we write log(x) for the
natural logarithm and log(x,10) for the base 10 logarithm.)
Exercise 1.20. What happened in the mid 1970s that resulted in the long term departure from exponential growth in
the use of oil?
16
Introduction to the Science of Statistics Displaying Data
Example 1.21. The Intergovernmental Panel on Climate Change (IPCC) is a scientific intergovernmental body tasked
with evaluating the risk of climate change caused by human activity. The panel was established in 1988 by the
World Meteorological Organization and the United Nations Environment Programme, two organizations of the United
Nations. The IPCC does not perform original research but rather uses three working groups who synthesize research
and prepare a report. In addition, the IPCC prepares a summary report. The Fourth Assessment Report (AR4) was
completed in early 2007. The fifth was released in 2014.
Below is the first graph from the 2007 Climate Change Synthesis Report: Summary for Policymakers.
The technique used to draw the curves on the graphs is called local regression. At the risk of discussing concepts
that have not yet been introduced, let’s describe the technique behind local regression. Typically, at each point in the
data set, the goal is to draw a linear or quadratic function. The function is determined using weighted least squares,
giving most weight to nearby points and less weight to points further away. The graphs above show the approximating
curves. The blue regions show areas within two standard deviations of the estimate (called a confidence interval). The
goal of local regression is to provide a smooth approximation to the data and a sense of the uncertainty of the data. In
practice, local regression requires a large data set to work well.
 17
Introduction to the Science of Statistics Displaying Data
Example 1.22. The next figure give a time series plot of a single molecule experiment showing the movement of kinesin
along a microtubule. In this case the kinesin has at its foot a glass bead and its heads are attached to a microtubule.
The position of the glass bead is determined by using a laser beam and the optical properties of the bead to locate the
bead and provide a force on the kinesin molecule. In this time plot, the load on the microtubule has a force of 3.5 pN
and the concentration of ATP is 100µM. What is the source of fluctuations in this time series plot of bead position?
How would you expect this time plot to change with changes in ATP concentration and with changes in force?
1.7 Answers to Selected Exercises
1.11. Here are the R commands:
> genotypes<-matrix(c(128,6,0,119,78,4),ncol=2)
> colnames(genotypes)<-c("Flores","Sumba")
> rownames(genotypes)<-c("AA","AE","EE")
> genotypes
Flores Sumba
AA 128 119
AE 6 78
EE 0 4
> barplot(genotypes,legend=rownames(genotypes),args.legend=list(x="topleft"))
The legend was moved to the left side to avoid crowding with the taller bar for the data on Sumba.
18
Introduction to the Science of Statistics Displaying Data
Flores Sumba
EE
AE
AA
0
50
10
0
15
0
20
0
1.12. The lengths of the normal strain has its center at 2.5 microns and range from 1.5 to 5 microns. It is somewhat
skewed right with no outliers. The mutant strain has its center at 5 or 6 microns. Its range is from 2 to 14 microns and
it is slightly skewed right. It has not outliers.
1.14. Look at the graph to the point above the value 60 years. Look left from this point to note that it corresponds to a
value of 0.80.
Look at the graph to the point right from the value 0.20. Look down to note that it corresponds to 49 years. .
1.15. Match histogram wild1f to wilddaf. Note that both show the range is from 2 to 5 microns and that about half of
the data lies between 2 and 3 microns. Match histogram wild2f with wildcf. The data is relatively uniform from 3.5
to 6.5 microns. Finally, match histogram wild3f with wildbf. The range is from 2 to 8 microns with most of the data
between 3 and 6 microns. .
1.22. The fluctuation are due to the many bombardments with other molecules in the cell, most frequently, water
molecules.
As force increases, we expect the velocity to increase - to a point. If the force is too large, then the kinesin is ripped
away from the microtubule. As ATP concentration increases, we expect the velocity to increase - again, to a point. If
ATP concentration is sufficiently large, then the biochemical processes are saturated.
19
Introduction to the Science of Statistics Describing Distributions with Numbers
20
Topic 2
Describing Distributions with Numbers
There are three kinds of lies: lies, damned lies, and statistics. - Benjamin Disraeli
It is easy to lie with statistics. It is hard to tell the truth without it. - Andrejs Dunkels
We next look at quantitative data. Recall that in this case, these data can be subject to the operations of arithmetic.
In particular, we can add or subtract observation values, we can sort them and rank them from lowest to highest.
We will look at two fundamental properties of these observations. The first is a measure of the center value for
the data, i.e., the median or the mean. Associated to this measure, we add a second value that describes how these
observations are spread or dispersed about this given measure of center.
The median is the central observation of the data after it is sorted from the lowest to highest observations. In
addition, to give a sense of the spread in the data, we often give the smallest and largest observations as well as the
observed value that is 1/4 and 3/4 of the way up this list, known at the first and third quartiles. This difference, known
as the interquartile range is a measure of the spread or the dispersion of the data. For the mean, we commonly use
the standard deviation to describe the spread of the data.
These concepts are described in more detail in this section.
2.1 Measuring Center
2.1.1 Medians
The median take the middle value for x1, x2, . . . , xn after the data has been sorted from smallest to largest,
x(1), x(2), . . . , x(n).
(x(k) is called the k-th order statistic. Sorting can be accomplished in R by using the sort command.)
If n is odd, then this is just the value of the middle observation x((n+1)/2). If n is even, then the two values closest
to the center are averaged.
1
2
(x(n/2) + x(n/2+1)).
If we store the data in R in a vector x, we can write median(x)to compute the median.
2.1.2 Means
For a collection of numeric data, x1, x2, . . . , xn, the sample mean is the numerical average
x̄ =
1
n
(x1 + x2 + . . .+ xn) =
1
n
n∑
i=1
xi
21
Introduction to the Science of Statistics Describing Distributions with Numbers
Alternatively, if the value x occurs n(x) times in the data, then use the distributive property to see that
x̄ =
1
n
∑
x
xn(x) =
∑
x
xp(x), where p(x) =
n(x)
n
.
So the mean x̄ depends only on the proportion of observations p(x) for each value of x.
Example 2.1. For the data set {1, 2, 2, 2, 3, 3, 4, 4, 4, 5}, we have n = 10 and the sum
1 + 2 + 2 + 2 + 3 + 3 + 4 + 4 + 4 + 5 = 1n(1) + 2n(2) + 3n(3) + 4n(4) + 5n(5)
= 1(1) + 2(3) + 3(2) + 4(3) + 5(1) = 30
Thus, x̄ = 30/10 = 3.
Example 2.2. For the data on the length in microns of wild type Bacillus subtilis data, we have
length x frequency n(x) proportion p(x) product xp(x)
1.5 18 0.090 0.135
2.0 71 0.355 0.710
2.5 48 0.240 0.600
3.0 37 0.185 0.555
3.5 16 0.080 0.280
4.0 6 0.030 0.120
4.5 4 0.020 0.090
sum 200 1 2.490
So the sample mean x̄ = 2.49.
If we store the data in R in a vector x, we can write mean(x) which is equal to sum(x)/length(x) to
compute the mean.
To extend this idea a bit, we can take a real-valued function h and instead consider the observations h(x1), h(x2), . . . , h(xn),
then
h(x) =
1
n
(h(x1) + h(x2) + . . .+ h(xn)) =
1
n
n∑
i=1
h(xi) =
1
n
∑
x
h(x)n(x) =
∑
x
h(x)p(x).
Exercise 2.3. Let x̄n be the sample mean for the quantitative data x1, x2, . . . , xn. For an additional observation
xn+1, use x̄ to give a formula for x̄n+1, the mean of n + 1 observations. Generalize this formula for the case of k
additional observations xn+1 . . . , xn+k
Many times, we do not want to give the same weight to each observation. For example, in computing a student’s
grade point average, we begin by setting values xi corresponding to grades ( A 7→ 4, B 7→ 3 and so on) and giving
weights w1, w2, . . . , wn equal to the number of units in a course. We then compute the grade point average as a
weighted mean. To do this:
• Multiply the value of each course by its weight xiwi. This is called the number of quality points for the course.
• Add up the quality points:
x1w1 + x2w2 + . . .+ xnwn =
n∑
i=1
xiwi
• Add up the weights, i. e., the number of units attempted:
w1 + w2 + . . .+ wn =
n∑
i=1
wi
22
Introduction to the Science of Statistics Describing Distributions with Numbers
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
−0.2
0
0.2
0.4
0.6
0.8
1
1.5 p(1.5)
2.0 p(2.0)
2.5 p(2.5)
3.0 p(3.0)
3.5 p(3.5)
4.0 p(4.0) 4.5 p(4.5)
Figure 2.1: Empirical Survival Function for the Bacterial Data. This figure displays how the area under the survival function to the right of the
y-axis and above the x-axis is the mean value x̄ for non-negative data. For x = 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, and 4.5. This area is the sum of the area
of the retangles displayed. The width of each of the rectangles is x and the height is equal to p(x). Thus, the area is the product xp(x). The sum of
these areas are presented in Example 2.2 to compute the sample mean.
• Divide the total quality points by the number of units attempted:
x1w1 + x2w2 + . . .+ xnwn
w1 + w2 + . . .+ wn
=
∑n
i=1 xiwi∑n
i=1 wi
. (2.1)
If we let
pj = wj/
n∑
i=1
wi
be the proportion or fraction of the weight given to the j-th observation, then we can rewrite (2.1) as
n∑
i=1
xipi.
If we store the weights in a vector w, then we can compute the weighted mean using weighted.mean(x,w)
If an extremely high observation is changed to be even higher, then the mean follows this change while the median
does not. For this reason, the mean is said to be sensitive to outliers while the median is not. To reduce the impact of
extreme outliers on the mean as a measure of center, we can also consider a truncated mean or trimmed mean. The
p trimmed mean is obtained by discarding both the lower and the upper p×100% of the data and taking the arithmetic
mean of the remaining data.
In R, we write mean(x, trim = p) where p, a number between 0 and 0.5, is the fraction of observations to
be trimmed from each end before the mean is computed.
Note that the median can be regarded as the 50% trimmed mean. The median does not change with a changes in
the extreme observations. Such a property is called a resistant measure. On the other hand, the mean is not a resistant
measure.
23
Introduction to the Science of Statistics Describing Distributions with Numbers
Exercise 2.4. Give the relationship between the median and the mean for a (a) left skewed, (b) symmetric, or (c) right
skewed distribution.
2.2 Measuring Spread
2.2.1 Five Number Summary
The first and third quartile, Q1 and Q3, are, respectively, the median of the lower half and the upper half of the data.
The five number summary of the data are the values of the minimum, Q1, the median, Q3 and the maximum. These
values, along with the mean, are given in R using summary(x). Returning to the data set on the age of presidents:
> summary(age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
42.00 51.00 55.00 54.98 58.00 70.00
We can display the five number summary using a boxplot.
> boxplot(age, main = c("Age of Presidents at the Time of Inauguration"))
45
50
55
60
65
70
Age of Presidents at the Time of Inauguration
The value Q3−Q1 is called the interquartile range and is denoted by IQR. It is found in R with the command IQR.
Outliers are somewhat arbitrarily chosen to be those above Q3 + 32IQR and below Q1 − 32IQR. With this criterion,
the ages of Ronald Reagan and Donald Trump, considered outliers, are displayed by the two circles at the top of the
boxplot. The boxplot command has the default value range = 1.5 in the choice of displaying outliers. This can
be altered to loosen or tighten this criterion.
Exercise 2.5. Use the range command to create a boxplot for the age of the presidents at the time of their inaugu-
ration using as outliers any value above Q3 + IQR and below Q1 − IQR as the criterion for outliers. How many
outliers does this boxplot have?
24
Introduction to the Science of Statistics Describing Distributions with Numbers
Example 2.6. Consider a two column data set. Column 1 - MPH - gives car gas milage. Column 2 - origin - gives
the country of origin for the car. We can create side by side boxplots with the command
> boxplot(MPG,Origin)
to produce
2.2.2 Sample Variance and Standard Deviation
The sample variance averages the square of the differences from the mean
var(x) = s2x =
1
n− 1
n∑
i=1
(xi − x̄)2.
The sample standard deviation, sx, is the square root of the sample variance. We shall soon learn the rationale for
the decision to divide by n− 1. However, we shall also encounter circumstances in which division by n is preferable.
We will routinely drop the subscript x and write s to denote standard deviation if there is no ambiguity.
Example 2.7. For the data set on Bacillus subtilis data, we have x̄ = 498/200 = 2.49
length, x frequency, n(x) x− x̄ (x− x̄)2 (x− x̄)2n(x)
1.5 18 -0.99 0.9801 17.6418
2.0 71 -0.49 0.2401 17.0471
2.5 48 0.01 0.0001 0.0048
3.0 37 0.51 0.2601 9.6237
3.5 16 1.01 1.0201 16.3216
4.0 6 1.51 2.2801 13.6806
4.5 4 2.01 4.0401 16.1604
sum 200 90.4800
So the sample variance s2x = 90.48/199 = 0.4546734 and standard deviation sx = 0.6742947. To accomplish
this in R
25
Introduction to the Science of Statistics Describing Distributions with Numbers
> bacteria<-c(rep(1.5,18),rep(2.0,71),rep(2.5,48),rep(3,37),rep(3.5,16),rep(4,6),
+ rep(4.5,4))
> length(bacteria)
[1] 200
> mean(bacteria)
[1] 2.49
> var(bacteria)
[1] 0.4546734
> sd(bacteria)
[1] 0.6742947
For quantitative variables that take on positive values, we can take the ratio of the standard deviation to the mean
cvx =
sx
x̄
,
called the coefficient of variation as a measure of the relative variability of the observations. Note that cvx is a pure
number and has no units.
For the data of bacteria lengths, the coefficient of variability is
cvx =
0.6742947

Continue navegando