Baixe o app para aproveitar ainda mais
Prévia do material em texto
An Introduction to the Science of Statistics: From Theory to Implementation Preliminary Edition c©Joseph C. Watkins Contents I Organizing and Producing Data 1 1 Displaying Data 3 1.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Two-way Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Histograms and the Empirical Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . 10 1.5 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Time Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.7 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 Describing Distributions with Numbers 21 2.1 Measuring Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.1 Medians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.2 Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Measuring Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.1 Five Number Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Sample Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Quantiles and Standardized Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Quantile-Quantile Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Correlation and Regression 33 3.1 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Transformed Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.1 Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4 Producing Data 65 4.1 Preliminary Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 Professional Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3 Formal Statistical Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.1 Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.2 Randomized Controlled Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.3 Natural experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 i 4.4.1 Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 II Probability 79 5 The Basics of Probability 81 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Equally Likely Outcomes and the Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3 Consequences of the Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.1 Fundamental Principle of Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4.3 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.6 Set Theory - Probability Theory Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6 Conditional Probability and Independence 97 6.1 Restricting the Sample Space - Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 The Multiplication Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3 The Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4 Bayes formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.6 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7 Random Variables and Distribution Functions 111 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.2 Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3 Properties of the Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.3.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.3.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.4 Mass Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.5 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.6 Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.7 Joint and Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.7.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.7.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.7.3 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.8 Simulating Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.8.1 Discrete Random Variables and the sample Command . . . . . . . . . . . . . . . . . . . . 124 7.8.2 Continuous Random Variables and the Probability Transform . . . . . . . . . . . . . . . . . 125 7.9 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 8 The Expected Value 137 8.1 Definition and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.3 Bernoulli Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 141 8.4 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.6 Names for Eg(X). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.7 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.8 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 ii 8.8.1 Equivalent Conditions for Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.9 Quantile Plots and Probability Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 8.10 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 9 Examples of Mass Functions and Densities 157 9.1 Examples of Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 9.2 Examples of Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 9.3 More on Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.4 R Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.5 Summary of Properties of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 9.5.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 9.5.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 9.6 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 10 The Law of Large Numbers 179 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 10.2 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 10.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 10.4 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 11 The Central Limit Theorem 193 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 11.2 The Classical Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 11.2.1 Bernoulli Trials and the Continuity Correction . . . . . . . . . . . . . . . . . . . . . . . . . 197 11.3 Propagation of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 11.4 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 11.5 Summary of Normal Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 11.5.1 Sample Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 11.5.2 Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 11.5.3 Sample Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 11.5.4 Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 11.6 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 III Estimation 213 12 Overview of Estimation 215 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 12.2 Classical Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 12.3 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 12.4 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 13 Method of Moments 231 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 13.2 The Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 13.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 13.4 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 iii 14 Unbiased Estimation 241 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 14.2 Computing Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 14.3 Compensating for Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 14.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 14.5 Cramér-Rao Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 14.6 A Note on Exponential Families and Efficient Estimators . . . . . . . . . . . . . . . . . . . . . . . . 254 14.7 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 15 Maximum Likelihood Estimation 261 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 15.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 15.3 Summary of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 15.4 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 15.5 Comparison of Estimation Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 15.6 Multidimensional Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 15.7 The Case of Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 15.8 Choice of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 15.9 Technical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 15.10Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 16 Interval Estimation 281 16.1 Classical Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 16.1.1 Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 16.1.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 16.1.3 Sample Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 16.1.4 Summary of Standard Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 16.1.5 Interpretation of the Confidence Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 16.1.6 Extensions on the Use of Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . 292 16.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 16.3 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 16.4 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 IV Hypothesis Testing 301 17 Simple Hypotheses 303 17.1 Overview and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 17.2 The Neyman-Pearson Lemma . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 304 17.2.1 The Receiver Operating Characteristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 17.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 17.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 17.5 Proof of the Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 17.6 An Brief Introduction to the Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 17.7 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 18 Composite Hypotheses 323 18.1 Partitioning the Parameter Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 18.2 The Power Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 18.3 The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 18.4 Distribution of p-values and the Receiving Operating Characteristic . . . . . . . . . . . . . . . . . . 334 18.5 Multiple Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 iv 18.5.1 Familywise Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 18.5.2 False Discovery Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 18.6 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 19 Extensions on the Likelihood Ratio 341 19.1 One-Sided Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 19.2 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 19.3 Chi-square Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 19.4 Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 20 t Procedures 361 20.1 Guidelines for Using the t Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 20.2 One Sample t Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 20.3 Correspondence between Two-Sided Tests and Confidence Intervals . . . . . . . . . . . . . . . . . . 365 20.4 Matched Pairs Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 20.5 Two Sample Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 20.6 Summary of Tests of Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 20.6.1 General Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 20.6.2 Test for Population Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 20.6.3 Test for Population Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 20.7 A Note on the Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 20.8 The t Test as a Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 20.9 Non-parametric alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 20.9.1 Permutation Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 20.9.2 Mann-Whitney or Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 20.9.3 Wilcoxon Signed-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 20.10Answers to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 21 Goodness of Fit 385 21.1 Fit of a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 21.2 Contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 21.3 Applicability and Alternatives to Chi-squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 21.4 Answer to Selected Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 22 Analysis of Variance 403 22.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 22.2 One Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 22.3 Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 22.4 Two Sample Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 22.5 Kruskal-Wallis Rank-Sum Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 22.6 Answer to Selected Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 Appendix A: A Sample R Session 417 Index 423 v Introduction to the Science of Statistics vi Preface Statistical thinking will one day be as necessary a qualification for efficient citizenship as the ability to read and write. – Samuel Wilkes, 1951, paraphrasing H. G. Wells from Mankind in the Making The value of statistical thinking is now accepted by researchers and practitioners from a broad range of endeavors. This viewpoint has become common wisdom in a world of big data. The challenge for statistics educators is to adapt their pedagogy to accommodate the circumstances associated to the information age. This choice of pedagogy should be attuned to the quantitative capabilities and scientific background of the students as well as the intended use of their newly acquired knowledge of statistics. Many university students, presumed to be proficient in college algebra, are taught a variety of procedures and standard tests under a well-developed pedagogy. This approach is sufficiently refined so that students have a good intuitive understanding of the underlying principles presented in the course. However, if the statistical needs presented by a given scientific question fall outside the battery of methods presented in the standard curriculum, then students are typically at a loss to adjust the procedures to accommodate the additional demand. On the other hand, undergraduate students majoring in mathematics frequently have a course on the theory of statistics as a part of their program of study. In this case, the standard curriculum repeatedly finds itself close to the very practically minded subject that statistics is. However, the demands of the syllabus provide very little time to explore these applications with any sustained attention. Our goal is to find a middle ground. Despite the fact that calculus is a routine tool in the development of statistics, the benefits to students who have learned calculus are infrequently employed in the statistics curriculum. The objective of this book is to meet this need with a one semester course in statistics that moves forward in recognition of the coherent body of knowledge provided by statistical theory having an eye consistently on the application of the subject. Such a course may not be able to achieve the same degree of completeness now presented by the two more standard courses described above. However, it ought to able to achieve some important goals: • leaving students capable of understanding what statistical thinking is and how to integrate this with scientific procedures and quantitative modeling and • learning how to ask statistics experts productive questions, and how to implement their ideas using statistical software and other computational tools. Inevitably, many important topics are not includedin this book. In addition, I have chosen to incorporate abbre- viated introductions of some more advanced topics. Such topics can be skipped in a first pass through the material. However, one value of a textbook is that it can serve as a reference in future years. The context for some parts of the exposition will become more clear as students continue their own education in statistics. In these cases, the more advanced pieces can serve as a bridge from this book to more well developed accounts. My goal is not to compose a stand alone treatise, but rather to build a foundation that allows those who have worked through this book to introduce themselves to many exciting topics both in statistics and in its areas of application. vii Introduction to the Science of Statistics Who Should Use this Book The major prerequisites are comfort with calculus and a strong interest in questions that can benefit from statistical analysis. Willingness to engage in explorations utilizing statistical software is an important additional requirement. The original audience for the course associated to this book are undergraduate students minoring in mathematics. These student have typically completed a course in multivariate calculus. Many have been exposed to either linear algebra or differential equations. They enroll in this course because they want to obtain a better understanding of their own core subject. Even though we regularly rely on the mechanics of calculus and occasionally need to work with matrices, this is not a textbook for a mathematics course, but rather a textbook that is dedicated to a higher level of understanding of the concepts and practical applications of statistics. In this regard, it relies on a solid grasp of concepts and structures in calculus and algebra. With the advance and adoption of the Common Core State Standards in mathematics, we can anticipate that primary and secondary school students will experience a broader exposure to statistics through their school years. As a consequence, we will need to develop a curriculum for teachers and future teachers so that they can take content in statistics and turn that into curriculum for their students. This book can serve as a source of that content. In addition, those engaged both in industry and in scholarly research are experiencing a surge in the need to design more complex experiments and analyze more diverse data types. Universities and industry are responding with advanced educational opportunities to extend statistics education beyond the theory of probability and statistics, linear models and design of experiments to more modern approaches that include stochastic processes, machine learning and data mining, Bayesian statistics, and statistical computing. This book can serve as an entry point for these critical topics in statistics. An Annotated Syllabus The four parts of the course - organizing and collecting data, an introduction to probability, estimation procedures and hypothesis testing - are the building blocks of many statistics courses. We highlight some of the particular features in this book. Organizing and Collecting Data Much of this is standard and essential - organizing categorical and quantitative data, appropriately displayed as contin- gency tables, bar charts, histograms, boxplots, time plots, and scatterplots, and summarized using medians, quartiles, means, weighted means, trimmed means, standard deviations, correlations and regression lines. We use this as an opportunity to introduce to the statistical software package R and to add additional summaries like the empirical cu- mulative distribution function and the empirical survival function. One example incorporating the use of this is the comparison of the lifetimes of wildtype and transgenic mosquitoes and a discussion of the best strategy to display and summarize data if the goal is to examine the differences in these two genotypes of mosquitoes in their ability to carry and spread malaria. A bit later, we will do an integration by parts exercise to show that the mean of a non-negative continuous random variable is the area under its survival function. Collecting data under a good design is introduced early in the text and discussion of the underlying principles of experimental design is an abiding issue throughout the text. With each new mathematical or statistical concept comes an enhanced understanding of what an experiment might uncover through a more sophisticated design than what was previously thought possible. The students are given readings on design of experiment and examples using R to create a sample under variety of protocols. Introduction to Probability Probability theory is the analysis of random phenomena. It is built on the axioms of probability and is explored, for example, through the introduction of random variables. The goal of probability theory is to uncover properties arising from the phenomena under study. Statistics is devoted to the analysis of data. One goal of statistical science is to viii Introduction to the Science of Statistics articulate as well as possible what model of random phenomena underlies the production of the data. The focus of this section of the course is to develop those probabilistic ideas that relate most directly to the needs of statistics. Thus, we must study the axioms and basic properties of probability to the extent that the students understand conditional probability and independence. Conditional probability is necessary to develop Bayes formula which we will later use to give a taste of the Bayesian approach to statistics. Independence will be needed to describe the likelihood function in the case of an experimental design that is based on independent observations. Densities for continuous random variables and mass function for discrete random variables are necessary to write these likelihood functions explicitly. Expectation will be used to standardize a sample sum or sample mean and to perform method of moments estimates. Random variables are developed for a variety of reasons. Some, like the binomial, negative binomial, Poisson or the gamma random variable, arise from considerations based on Bernoulli trials or exponential waiting. The hyperge- ometric random variable helps us understand the difference between sampling with and without replacement. The F , t and chi-square random variables will later become test statistics. Uniform random variables are the ones simulated by random number generators. Because of the central limit theorem, the normal family is the most important among the list of parametric families of random variables. The flavor of the text returns to becoming more authentically statistical with the law of large numbers and the central limit theorem. These are largely developed using simulation explorations and first applied to simple Monte Carlo techniques and importance sampling to estimate the value of an definite integrals. One cautionary tale is an example of the failure of these simulation techniques when applied without careful analysis. If one uses, for example, Cauchy random variables in the evaluation of some quantity, then the simulated sample means can appear to be converging only to experience an abrupt and unpredictable jump. The lack of convergence of an improper integral reveals the difficulty. The central object of study is, of course, the central limit theorem. It is developed both in terms of sample sums and sample means and proportions and used in relatively standard ways to estimate probabilities. However, in this book, we can introduce the delta method which adds ideas associated to the central limit theorem to the context of propagation of error. Estimation In the simplest possible terms, the goal of estimation theory is to answer the question: What is that number? An estimate is a statistic, i. e., a function of the data. We look to two types of estimation techniques - method of moments and maximum likelihood and several criteria for an estimatorusing, for example, variance and bias. Several examples including mark and recapture and the distribution of fitness effects from genetic data are developed for both types of estimators. The variance of an estimator is approximated using the delta method for method of moments estimators and using Fisher information for maximum likelihood estimators. An analysis of bias is based on quadratic Taylor series approximations and the properties of expectations. Both classes of estimators are often consistent. This implies that the bias decreases towards zero with an increasing number of observations. R is routinely used in simulations to gain insight into the quality of estimators. The point estimation techniques are followed by interval estimation and, notably, by confidence intervals. This brings us to the familiar one and two sample t-intervals for population means and one and two sample z-intervals for population proportions. In addition, we can return to the delta method and the observed Fisher information to construct confidence intervals associated respectively to method of moment estimators and and maximum likelihood estimators. We also add a brief introduction on bootstrap confidence intervals and Bayesian credible intervals in order to provide a broader introduction to strategies for parameter estimation. Hypothesis Testing For hypothesis testing, we first establish the central issues - null and alternative hypotheses, type I and type II errors, test statistics and critical regions, significance and power. We then present the ideas behind the use of likelihood ratio tests as best tests for a simple hypothesis. This is motivated by a game designed to explain the importance of the Neyman Pearson lemma. This approach leads us to well-known diagnostics of an experimental design, notably, the receiver operating characteristic and power curves. ix Introduction to the Science of Statistics Extensions of the Neyman Pearson lemma form the basis for the t test for means, the chi-square test for goodness of fit, and the F test for analysis of variance. These results follow from the application of optimization techniques from calculus, including the use of Lagrange multipliers to develop goodness of fit tests. The Bayesian approach to hypothesis testing is explored for the case of simple hypothesis using morphometric measurements, in this case a butterfly wingspan, to test whether a habitat has been invaded by a mimic species. The desire of a powerful test is articulated in a variety of ways. In engineering terms, power is called sensitivity. We illustrate this with a radon detector. An insensitive instrument is a risky purchase. This can be either because the instrument is substandard in the detection of fluctuations or poor in the statistical basis for the algorithm used to determine a change in radon level. An insensitive detector has the undesirable property of not sounding its alarm when the radon level has indeed risen. The course ends by looking at the logic of hypotheses testing and the results of different likelihood ratio analyses applied to a variety of experimental designs. The delta method allows us to extend the resulting test statistics to multivariate nonlinear transformations of the data. The textbook concludes with a practical view of the consequences of this analysis through case studies in a variety of disciplines including, for example, genetics, health, ecology, and bee biology. This will serve to introduce us to the well known t procedure for inference of the mean, both the likelihood-based G2 test and the traditional chi-square test for discrete distributions and contingency tables, and the F test for one-way analysis of variance. We add short descriptions for the corresponding non-parametric procedures, namely, permutation, ranked-sum and signed-rank tests for quantitative data, and exact tests for categorical data Exercises and Problems One obligatory statement in the preface of a book such as this is to note the necessity of working problems. The mate- rial can only be mastered by grappling with the issues through the application to engaging and substantive questions. In this book, we address this imperative through exercises and through problems. The exercises, integrated into the textbook narrative, are of two basic types. The first is largely mathematical or computational exercises that are meant to provide or extend the derivation of a useful identity or data analysis technique. These experiences will prepare the student to perform the calculations that routinely occur in investigations that use statistical thinking. The second type form a collection of questions that are meant to affirm the understanding of a particular concept. Problems are collected at the end of each of the four parts of the book. While the ordering of the problems generally follows the flow of the text, they are designed to be more extensive and integrative. These problems often incorporate several concepts and will call on a variety of problem solving strategies combining handwritten work with the use of statistical software. Without question, the best problems are those that the students chose from their own interests. Acknowledgements The concept that let to this book grew out of a conversation with the late Michael Wells, Professor of Biochemistry at the University of Arizona. He felt that if we are asking future life scientist researchers to take the time to learn calculus and differential equations, we should also provide a statistics course that adds value to their abilities to design experiments and analyze data while reinforcing both the practical and conceptual sides of calculus. As a consequence, course development received initial funding from a Howard Hughes Medical Institute grant (52005889). Christopher Bergevin, an HHMI postdoctoral fellow, provided a valuable initial collaboration. Since that time, I have had the great fortune to be the teacher of many bright and dedicated students whose future contribution to our general well-being is beyond dispute. Their cheerfulness and inquisitiveness has been a source of inspiration for me. More practically, their questions and their persistence led to a much clearer exposition and the addition of many dozens of figures to the text. Through their end of semester projects, I have been introduced to many interesting questions that are intriguing in their own right, but also have added to the range of applications presented throughout the text. Four of these students - Beryl Jones, Clayton Mosher, Laurel Watkins de Jong, and Taylor Corcoran - have gone on to become assistants in the course. I am particularly thankful to these four for their contributions to the dynamical atmosphere that characterizes the class experience. x Part I Organizing and Producing Data 1 Topic 1 Displaying Data There are two goals when presenting data: convey your story and establish credibility. - Edward Tufte Statistics is a mathematical science that is concerned with the collection, analysis, interpretation or explanation, and presentation of data. Properly used statistical principles are essential in guiding any inquiry informed by data and, especially in the phase of data exploration, is routinely a fundamental source for discovery and innovation. Insights from data may come from a well conceived visualization of the data, from modern methods of statistical learning and model selection as well as from time-honored formal statistical procedures. The first encounters one has to data are through graphical displays and numerical summaries. The goal is to find an elegant method for this presentation that is at the same time both objective and informative - making clear with a few lines or a few numbers the salient features of the data. In this sense, data presentation is at the same time an art, a science, and an obligation to impartiality. In the section, we will describe some of the standard presentations of data and at the same time, taking theopportu- nity to introduce some of the commands that the software package R provides to draw figures and compute summaries of the data. 1.1 Types of Data A data set provides information about a group of individuals. These individuals are, typically, representatives chosen from a population under study. Data on the individuals are meant, either informally or formally, to allow us to make inferences about the population. We shall later discuss how to define a population, how to choose individuals in the population and how to collect data on these individuals. • Individuals are the objects described by the data. • Variables are characteristics of an individual. In order to present data, we must first recognize the types of data under consideration. – Categorical variables partition the individuals into classes. Other names for categorical variables are levels or factors. One special type of categorical variables are ordered categorical variables that suggest a ranking, say small. medium, large or mild, moderate, severe. – Quantitative variables are those for which arithmetic operations like addition and differences make sense. Example 1.1 (individuals and variables). We consider two populations - the first is the nations of the world and the second is the people who live in those countries. Below is a collection of variables that might be used to study these populations. 3 Introduction to the Science of Statistics Displaying Data nations people population size age time zones height average rainfall gender life expectancy ethnicities mean income annual income literacy rate literacy capital city mother’s maiden name largest river marital status Exercise 1.2. Classify the variables as quantitative or categorical in the example above. The naming of variables and their classification as categorical or quantitative may seem like a simple, even trite, exercise. However, the first steps in designing an experiment and deciding on which individuals to include and which information to collect are vital to the success of the experiment. For example, if your goal is to measure the time for an animal (insect, bird, mammal) to complete some task under different (genetic, environmental, learning) conditions, then, you may decide to have a single quantitative variable - the time to complete the task. However, an animal in your study may not attempt the task, may not complete the task, or may perform the task. As a consequence, your data analysis will run into difficulties if you do not add a categorical variable to include these possible outcomes of an experiment. Exercise 1.3. Give examples of variables for the population of vertebrates, of proteins. 1.2 Categorical Data 1.2.1 Pie Chart A pie chart is a circular chart divided into sectors, illustrating relative magnitudes in frequencies or percents. In a pie chart, the area is proportional to the quantity it represents. Example 1.4. As the nation debates strategies for delivering health insurance, let’s look at the sources of funds and the types of expenditures. Figure 1.1: 2008 United States health care (a) expenditures (b) income sources, Source: Centers for Medicare and Medicaid Services, Office of the Actuary, National Health Statistics Group 4 Introduction to the Science of Statistics Displaying Data Exercise 1.5. How do you anticipate that this pie chart will evolve over the next decade? Which pie slices are likely to become larger? smaller? On what do you base your predictions? Example 1.6. From UNICEF, we read “The proportion of children who reach their fifth birthday is one of the most fundamental indicators of a country’s concern for its people. Child survival statistics are a poignant indicator of the priority given to the services that help a child to flourish: adequate supplies of nutritious food, the availability of high- quality health care and easy access to safe water and sanitation facilities, as well as the family’s overall economic condition and the health and status of women in the community. ” Example 1.7. Gene Ontology (GO) project is a bioinformatics initiative whose goal is to provide unified terminology of genes and their products. The project began in 1998 as a collaboration between three model organism databases, Drosophila, yeast, and mouse. The GO Consortium presently includes many databases, spanning repositories for plant, animal and microbial genomes. This project is supported by National Human Genome Research Institute. See http://www.geneontology.org/ Figure 1.2: The 25 most frequent Biological Process Gene Ontology (GO) terms. 5 Introduction to the Science of Statistics Displaying Data To make a simple pie chart in R for the proportion of AIDS cases among US males by transmission category. > males<- c(58,18,16,7,1) > pie(males) This many be sufficient for your own personal use. However, if we want to use a pie chart in a presentation, we will have to provide some essential details. For a more descriptive pie chart, one has to become accustomed to learning to interact with the software to settle on a graph that is satisfactory to the situation. • Define some colors ideal for black and white print. > colors <- c("white","grey70","grey90","grey50","black") • Calculate the percentage for each category. > male_labels <- round(males/sum(males)*100, 1) The number 1 indicates rounded to one decimal place. > male_labels <- paste(male_labels, "\%", sep=" ") This adds a space and a percent sign. • Create a pie chart with defined heading and custom colors and labels and create a legend. > pie(males, main="Proportion of AIDS Cases among Males by Transmission Category + Diagnosed - USA, 2005", col=colors, labels=male_labels, cex=0.8) > legend("topright", c("Male-male contact","Injection drug use (IDU)", + "High-risk heterosexual contact","Male-male contact and IDU","Other"), + cex=0.8,fill=colors) The entry cex=0.8 indicates that the legend has a type set that is 80% of the font size of the main title. 58 % 18 % 16 % 7 % 1 % Proportion of AIDS Cases among Males by Transmission Category Diagnosed − USA, 2005 Male−male contact Injection drug use (IDU) High−risk heterosexual contact Male−male contact and IDU Other 6 Introduction to the Science of Statistics Displaying Data 1.2.2 Bar Charts Because the human eye is good at judging linear measures and poor at judging relative areas, a bar chart or bar graph is often preferable to pie charts as a way to display categorical data. To make a simple bar graph in R, > barplot(males) For a more descriptive bar chart with information on females: • Enter the data for females and create a 5× 2 array. > females <- c(0,71,27,0,2) > hiv<-array(c(males,females), dim=c(5,2)) • Generate side-by-side bar graphs and create a legend, > barplot(hiv, main="Proportion of AIDS Cases by Sex and Transmission Category + Diagnosed - USA, 2005", ylab= "percent", beside=TRUE, + names.arg = c("Males", "Females"),col=colors) > legend("topright", c("Male-male contact","Injection drug use (IDU)", + "High-risk heterosexual contact","Male-male contact and IDU","Other"), + cex=0.8,fill=colors) Males Females Proportion of AIDS Cases by Sex and Transmission Category Diagnosed − USA, 2005 pe rc en t 0 10 20 30 40 50 60 70 Male−male contact Injection drug use (IDU) High−risk heterosexual contact Male−male contact and IDU Other Example 1.8. Next we examine a segmented bar plot. This shows the ancestral sources of genes for 75 populations throughout Asia. the data are based on information gathered from 50,000 genetic markers. The designations for the groups were decided by the software package STRUCTURE. 1.3 Two-way Tables Relationships between two categorical variables can be shown through a two-way table (also known as a contingency table , cross tabulation table or a cross classifying table ). 7 Introduction to the Science of Statistics Displaying Data (F ig . 1 an d fig s. S1 to S1 3) , po pu la tio n ph y- lo ge ni es (F ig .1 an d fig s. S2 7an d S2 8) ,a nd PC A re su lts (F ig .2 )a ll sh ow th at po pu la tio ns fr om th e sa m e lin gu is tic gr ou p te nd to cl us te r to ge th er .A M an te lt es tc on fir m s th e co rr el at io n be tw ee n lin - gu is tic an d ge ne tic af fin iti es (R 2 = 0. 25 3; P < 0. 00 01 w ith 10 ,0 00 pe rm ut at io ns ), ev en af te rc on tro lli ng fo r ge og ra ph y (p ar tia l co rr el at io n = 0. 13 6; P < 0. 00 5 w ith 10 ,0 00 pe rm ut at io ns ). N ev er th el es s, w e id en tif ie d ei gh tp op ul at io n ou tli er sw ho se lin gu is tic an d ge ne tic af fin iti es ar e in co ns is te nt [A ffy m et rix - M el an es ia n (A X -M E ), M al ay si a- Je ha i (M Y -J H ) Fi g. 1. M ax im um -li ke lih oo d tr ee of 75 po pu la tio ns .A hy po th et ic al m os t- re ce nt co m m on an ce st or (M RC A) co m po se d of an ce st ra la lle le s as in fe rr ed fr om th e ge no ty pe so fo ne go ril la an d 21 ch im pa nz ee sw as us ed to ro ot th e tr ee . Br an ch es wi th bo ot st ra p va lu es le ss th an 50 % we re co nd en se d. Po pu la tio n id en tif ic at io n nu m be rs (ID s) ,s am pl e co lle ct io n lo ca tio ns wi th la tit ud es an d lo ng itu de s, et hn ic iti es , la ng ua ge sp ok en , an d si ze of po p- ul at io n sa m pl es ar e sh ow n in th e ta bl e ad ja ce nt to ea ch br an ch in th e tr ee . Li ng ui st ic gr ou ps ar e in di ca te d wi th co lo rs as sh ow n in th e le ge nd . Al l po pu la tio n ID s ex ce pt th e fo ur Ha pM ap sa m pl es ar e de no te d by fo ur ch ar ac te rs . Th e fir st tw o le tte rs in di ca te th e co un tr y wh er e th e sa m pl es we re co lle ct ed or (in th e ca se of Af fy m et rix ) ge no ty pe d, ac co rd in g to th e fo llo wi ng co nv en tio n: AX ,A ffy m et rix ;C N ,C hi na ;I D, In do ne si a; IN ,I nd ia ; JP ,J ap an ;K R, Ko re a; M Y, M al ay si a; PI ,t he Ph ili pp in es ;S G, Si ng ap or e; TH , Th ai la nd ; an d TW , Ta iw an . Th e la st tw o le tte rs ar e un iq ue ID s fo r th e po pu la tio n. To th e rig ht of th e ta bl e, an av er ag ed gr ap h of re su lts fr om ST RU CT U RE is sh ow n fo rK = 14 . 11 D EC EM BE R 20 09 VO L 32 6 SC IE N C E w w w .s ci en ce m ag .o rg 15 42RE PO RT S on April 28, 2010 www.sciencemag.org Downloaded from Figure 1.3: Dispaying human genetic diversity for 75 populations in Asia. The software program STRUCTURE here infers 14 source populations, 10 of them major. The length of each segment in the bar is the estimate by STRUCTURE of the fraction of the genome in the sample that has ancestors among the given source population. 8 Introduction to the Science of Statistics Displaying Data 2 parents 1 parent 0 parents does not smoke smokes 0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 Example 1.9. In 1964, Surgeon General Dr. Luther Leonidas Terry published a landmark report saying that smoking may be hazardous to health. This led to many influential reports on the topic, including the study of the smoking habits of 5375 high school children in Tucson in 1967. Here is a two-way table summarizing some of the results. student student smokes does not smoke total 2 parents smoke 400 1380 1780 1 parent smokes 416 1823 2239 0 parents smoke 188 1168 1356 total 1004 4371 5375 • The row variable is the parents smoking habits. • The column variable is the student smoking habits. • The cells display the counts for each of the categories of row and column variables. A two-way table with r rows and c columns is often called an r by c table (written r × c). The totals along each of the rows and columns give the marginal distributions. We can create a segmented bar graph as follows: > smoking<-matrix(c(400,1380,416,1823,188,1168),ncol=3) > colnames(smoking)<-c("2 parents","1 parent", "0 parents") > rownames(smoking)<-c("smokes","does not smoke") > smoking 2 parents 1 parent 0 parents smokes 400 416 188 does not smoke 1380 1823 1168 > barplot(smoking,legend=rownames(smoking)) 9 Introduction to the Science of Statistics Displaying Data Example 1.10. Hemoglobin E is a variant of hemoglobin with a mutation in the β globin gene causing substitution of glutamic acid for lysine at position 26 of the β globin chain. HbE (E is the one letter abbreviation for glutamic acid.) is the second most common abnormal hemoglobin after sickle cell hemoglobin (HbS). HbE is common from India to Southeast Asia. The β chain of HbE is synthesized at a reduced rate compare to normal hemoglobin (HbA) as the HbE produces an alternate splicing site within an exon. It has been suggested that Hemoglobin E provides some protection against malaria virulence when heterozygous, but is causes anemia when homozygous. The circumstance in which the heterozygotes for the alleles under considera- tion have a higher adaptive value than the homozygote is called balancing selection. The table below gives the counts of differing hemoglobin genotypes on two Indonesian islands. genotype AA AE EE Flores 128 6 0 Sumba 119 78 4 Because the heterozygotes are rare on Flores, it appears malaria is less prevalent there since the heterozygote does not provide an adaptive advantage. Exercise 1.11. Make a segmented barchart of the data on hemoglobin genotypes. Have each bar display the distribu- tion of genotypes on the two Indonesian islands. 1.4 Histograms and the Empirical Cumulative Distribution Function Histograms are a common visual representation of a quantitative variable. Histograms summarize the data using rectangles to display either frequencies or proportions as normalized frequencies. In making a histogram, we • Divide the range of data into bins of equal width (usually, but not always). • Count the number of observations in each class. • Draw the histogram rectangles representing frequencies or percents by area. Interpret the histogram by giving • the overall pattern – the center – the spread – the shape (symmetry, skewness, peaks) • and deviations from the pattern – outliers – gaps The direction of the skewness is the direction of the longer of the two tails (left or right) of the distribution. No one choice for the number of bins is considered best. One possible choice for larger data sets is Sturges’ formula to choose b1 + log2 nc bins. (b·c, the floor function, is obtained by rounding down to the next integer.) Exercise 1.12. The histograms in Figure 1.4 shows the distribution of lengths of a normal strain and mutant strain of Bacillus subtilis. Describe the distributions. Example 1.13. Taking the age of the presidents of the United States at the time of their inauguration and creating its histogram, empirical cumulative distribution function and boxplot in R is accomplished as follows. 10 Introduction to the Science of Statistics Displaying Data Figure 1.4: Histogram of lengths of Bacillus subtilis. Solid lines indicate wild type and dashed line mutant strain. > age<- c(57,61,57,57,58,57,61,54,68,51,49,64,50,48,65,52,56,46,54,49,51,47,55,55, 54,42,51,56,55,51,54,51,60,61,43,55,56,61,52,69,64,46,54,47,70) > par(mfrow=c(1,2)) > hist(age) > plot(ecdf(age),xlab="age",main="Age of Presidents at the Time of Inauguaration", sub="Empriical Cumulative Distribution Function") Histogram of age age Fr eq ue nc y 40 45 50 55 60 65 70 0 5 10 15 40 45 50 55 60 65 70 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Age of Presidents at Inauguaration Empriical Cumulative Distribution Function age Fn (x ) So the age of presidents at the time of inauguration range from the early forties to the late sixties with the frequency starting their tenure peaking in the early fifties. The histogram in generally symmetric about 55 years with spread from around 40 to 70 years. The empirical cumulative distribution function Fn(x) gives, for each value x, the fraction of the data less than or equal to x. If the number of observations is n, then Fn(x) = 1 n #(observations lessthan or equal to x). 11 Introduction to the Science of Statistics Displaying Data Thus, Fn(x) = 0 for any value of x less than all of the observed values and Fn(x) = 1 for any x greater than all of the observed values. In between, we will see jumps that are multiples of the 1/n. For example, in the empirical cumulative distribution function for the age of the presidents, we will see a jump of size 4/45 at x = 57 to indicate the fact that 4 of the 44 presidents were 57 at the time of their inauguration. For an alternative method to create a graph of the empirical cumulative distribution function, first place the observations in order from smallest to largest. For the age of presidents data, we can accomplish this in R by writing sort(age). Next match these up with the integral multiples of the 1 over the number of observations. In R, we enter 1:length(age)/length(age). Finally, type="s" to give us the steps described above. > plot(sort(age),1:length(age)/length(age),type="s",ylim=c(0,1), main = c("Age of Presidents at the Time of Inauguration"), sub=("Empiricial Cumulative Distribution Function"), xlab=c("age"),ylab=c("cumulative fraction")) Exercise 1.14. Give the fraction of presidents whose age at inauguration was under 60. What is the range for the age at inauguration of the youngest fifth of the presidents? Exercise 1.15. The histogram for data on the length of three bacterial strains is shown below. Lengths are given in microns. Below the histograms (but not necessarily directly below) are empirical cumulative distribution functions corresponding to these three histograms. Histogram of wild1f wild1f F re q u e n c y 0 2 4 6 8 0 5 1 0 1 5 Histogram of wild2f wild2f F re q u e n c y 0 2 4 6 8 0 5 1 0 1 5 Histogram of wild3f wild3f F re q u e n c y 0 2 4 6 8 0 5 1 0 1 5 0 2 4 6 8 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 wildaf c u m u la ti v e f ra c ti o n 0 2 4 6 8 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 wildbf c u m u la ti v e f ra c ti o n 0 2 4 6 8 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 wildcf c u m u la ti v e f ra c ti o n Match the histograms to their respective empirical cumulative distribution functions. In looking at life span data, the natural question is “What fraction of the individuals have survived a given length of time?” The survival function Sn(x) gives, for each value x, the fraction of the data greater than or equal to x. If the number of observations is n, then Sn(x) = 1 n #(observations greater than x) = 1 n (n−#(observations less than or equal to x)) = 1− 1 n #(observations less than or equal to x) = 1− Fn(x) 12 Introduction to the Science of Statistics Displaying Data 20 25 30 35 40 40 60 80 10 0 average age of the parent nu m be r o f d e no vo m ut at io ns 1.5 Scatterplots We now consider two dimensional data. The values of the first variable x1, x2, . . . , xn are assumed known and in an experiment and are often set by the experimenter. This variable is called the explanatory, predictor, discriptor, or input variables and in a two dimensional scatterplot of the data display its values on the horizontal axis. The values y1, y2 . . . , yn, taken from observations with input x1, x2, . . . , xn are called the response or target variable and its values are displayed on the vertical axis. In describing a scatterplot, take into consideration • the form, for example, – linear – curved relationships – clusters • the direction, – a positive or negative association • and the strength of the aspects of the scatterplot. Example 1.16. Genetic evolution is based on mutation. Consequently, one fundamental question in evolutionary biology is the rate of de novo mutations. To investigate this question in humans, Kong et al, sequenced the entire genomes of 78 Icelandic trios and recorded the age of the parents and the number of de novo mutations in the offspring. The plot shows a moderate positive linear association, children of older parent have, on average, more mutations. The number of mutations range from ∼ 40 for children of younger parents to ∼ 100 for children of older parents. We will later learn that the father is the major source of this difference with age. Example 1.17 (Fossils of the Archeopteryx). The name Archeopteryx derives from the ancient Greek meaning “ancient feather” or “ancient wing”. Archeopteryx is generally accepted by palaeontologists as being the oldest known bird. Archaeopteryx lived in the Late Jurassic Period around 150 million years ago, in what is now southern Germany during a time when Europe was an archipelago of islands in a shallow warm tropical sea. The first complete specimen of Archaeopteryx was announced in 1861, only two years after Charles Darwin published On the Origin of Species, 13 Introduction to the Science of Statistics Displaying Data and thus became a key piece of evidence in the debate over evolution. Below are the lengths in centimeters of the femur and humerus for the five specimens of Archeopteryx that have preserved both bones. femur 38 56 59 64 74 humerus 41 63 70 72 84 > femur<-c(38,56,59,64,74) > humerus<-c(41,63,70,72,84) > plot(femur, humerus,main=c("Bone Lengths for Archeopteryx")) Unless we have a specific scientific question, we have no real reason for a choice of the explanatory variable. ● ● ● ● ● 40 45 50 55 60 65 70 75 40 50 60 70 80 Bone Lengths for Archeopteryx femur hu m er us Describe the scatterplot. Example 1.18. This historical data show the 20 largest banks in 1974. Values given in billions of dollars. Bank 1 2 3 4 5 6 7 8 9 10 Assets 49.0 42.3 36.6 16.4 14.9 14.2 13.5 13.4 13.2 11.8 Income 218.8 265.6 170.9 85.9 88.1 63.6 96.9 60.9 144.2 53.6 Bank 11 12 13 14 15 16 17 18 19 20 Assets 11.6 9.5 9.4 7.5 7.2 6.7 6.0 4.6 3.8 3.4 Income 42.9 32.4 68.3 48.6 32.2 42.7 28.9 40.7 13.8 22.2 14 Introduction to the Science of Statistics Displaying Data ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 20 30 40 50 50 10 0 15 0 20 0 25 0 Income vs. Assets (in billions of dollars) assets in co m e Describe the scatterplot. In 1972, Michele Sindona, a banker with close ties to the Mafia, along with a purportedly bogus Freemasonic lodge, and the Nixon administration purchased controlling interest in Bank 19, Long Island’s Franklin National Bank. As a result of his acquisition of a controlling stake in Franklin, Sindona had a money laundering operation to aid his alleged ties to Vatican Bank and the Sicilian drug cartel. Sindona used the bank’s ability to transfer funds, produce letters of credit, and trade in foreign currencies to begin building a banking empire in the United States. In mid-1974, management revealed huge losses and depositors started taking out large withdrawals, causing the bank to have to borrow over $1 billion from the Federal Reserve Bank. On 8 October 1974, the bank was declared insolvent due to mismanagement and fraud, involving losses in foreign currency speculation and poor loan policies. What would you expect to be a feature on this scatterplot of a failing bank? Does the Franklin Bank have this feature? 1.6 Time Plots Some data sets come with an order of events, say ordered by time. Example 1.19. The modern history of petroleum began in the 19th century with the refining of kerosene from crude oil. The world’s first commercial oil wells were drilled in the 1850s in Poland and in Romania.The first oil well in North America was in Oil Springs, Ontario, Canada in 1858. The US petroleum industry began with Edwin Drake’s drilling of a 69-foot deep oil well in 1859 on Oil Creek near Titusville, Pennsylvania for the Seneca Oil Company. The industry grew through the 1800s, driven by the demand for kerosene and oil lamps. The introduction of the internal combustion engine in the early part of the 20th century provided a demand that has largely sustained the industry to this day. Today, about 90% of vehicular fuel needs are met by oil. Petroleum also makes up 40% of total energyconsumption in the United States, but is responsible for only 2% of electricity generation. Oil use increased exponentially until the world oil crises of the 1970s. 15 Introduction to the Science of Statistics Displaying Data Worldwide Oil Production Million Million Million Year Barrels Year Barrels Year Barrels 1880 30 1940 2150 1972 18584 1890 77 1945 2595 1974 20389 1900 149 1950 3803 1976 20188 1905 215 1955 5626 1978 21922 1910 328 1960 7674 1980 21722 1915 432 1962 8882 1982 19411 1920 689 1964 10310 1984 19837 1925 1069 1966 12016 1986 20246 1930 1412 1968 14014 1988 21338 1935 1655 1970 16690 With the data given in two columns oil and year, the time plot plot(year,oil,type="b") is given on the left side of the figure below. This uses type="b" that puts both lines and circles on the plot. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● 1880 1900 1920 1940 1960 1980 0 5 10 15 20 World Oil Production year bi lli on s of b ar re ls ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●●●●●● 1880 1900 1920 1940 1960 1980 − 1. 5 − 1. 0 − 0. 5 0. 0 0. 5 1. 0 World Oil Production year lo g( bi lli on s of b ar re ls ) Figure 1.5: Oil production (left) and the logarithm of oil production (right) from 1880 to 1988. Sometimes a transformation of the data can reveal the structure of the time series. For example, if we wish to examine an exponential increase displayed in the oil production plot, then we can take the base 10 logarithm of the production and give its time series plot. This is shown in the plot on the right above. (In R, we write log(x) for the natural logarithm and log(x,10) for the base 10 logarithm.) Exercise 1.20. What happened in the mid 1970s that resulted in the long term departure from exponential growth in the use of oil? 16 Introduction to the Science of Statistics Displaying Data Example 1.21. The Intergovernmental Panel on Climate Change (IPCC) is a scientific intergovernmental body tasked with evaluating the risk of climate change caused by human activity. The panel was established in 1988 by the World Meteorological Organization and the United Nations Environment Programme, two organizations of the United Nations. The IPCC does not perform original research but rather uses three working groups who synthesize research and prepare a report. In addition, the IPCC prepares a summary report. The Fourth Assessment Report (AR4) was completed in early 2007. The fifth was released in 2014. Below is the first graph from the 2007 Climate Change Synthesis Report: Summary for Policymakers. The technique used to draw the curves on the graphs is called local regression. At the risk of discussing concepts that have not yet been introduced, let’s describe the technique behind local regression. Typically, at each point in the data set, the goal is to draw a linear or quadratic function. The function is determined using weighted least squares, giving most weight to nearby points and less weight to points further away. The graphs above show the approximating curves. The blue regions show areas within two standard deviations of the estimate (called a confidence interval). The goal of local regression is to provide a smooth approximation to the data and a sense of the uncertainty of the data. In practice, local regression requires a large data set to work well. 17 Introduction to the Science of Statistics Displaying Data Example 1.22. The next figure give a time series plot of a single molecule experiment showing the movement of kinesin along a microtubule. In this case the kinesin has at its foot a glass bead and its heads are attached to a microtubule. The position of the glass bead is determined by using a laser beam and the optical properties of the bead to locate the bead and provide a force on the kinesin molecule. In this time plot, the load on the microtubule has a force of 3.5 pN and the concentration of ATP is 100µM. What is the source of fluctuations in this time series plot of bead position? How would you expect this time plot to change with changes in ATP concentration and with changes in force? 1.7 Answers to Selected Exercises 1.11. Here are the R commands: > genotypes<-matrix(c(128,6,0,119,78,4),ncol=2) > colnames(genotypes)<-c("Flores","Sumba") > rownames(genotypes)<-c("AA","AE","EE") > genotypes Flores Sumba AA 128 119 AE 6 78 EE 0 4 > barplot(genotypes,legend=rownames(genotypes),args.legend=list(x="topleft")) The legend was moved to the left side to avoid crowding with the taller bar for the data on Sumba. 18 Introduction to the Science of Statistics Displaying Data Flores Sumba EE AE AA 0 50 10 0 15 0 20 0 1.12. The lengths of the normal strain has its center at 2.5 microns and range from 1.5 to 5 microns. It is somewhat skewed right with no outliers. The mutant strain has its center at 5 or 6 microns. Its range is from 2 to 14 microns and it is slightly skewed right. It has not outliers. 1.14. Look at the graph to the point above the value 60 years. Look left from this point to note that it corresponds to a value of 0.80. Look at the graph to the point right from the value 0.20. Look down to note that it corresponds to 49 years. . 1.15. Match histogram wild1f to wilddaf. Note that both show the range is from 2 to 5 microns and that about half of the data lies between 2 and 3 microns. Match histogram wild2f with wildcf. The data is relatively uniform from 3.5 to 6.5 microns. Finally, match histogram wild3f with wildbf. The range is from 2 to 8 microns with most of the data between 3 and 6 microns. . 1.22. The fluctuation are due to the many bombardments with other molecules in the cell, most frequently, water molecules. As force increases, we expect the velocity to increase - to a point. If the force is too large, then the kinesin is ripped away from the microtubule. As ATP concentration increases, we expect the velocity to increase - again, to a point. If ATP concentration is sufficiently large, then the biochemical processes are saturated. 19 Introduction to the Science of Statistics Describing Distributions with Numbers 20 Topic 2 Describing Distributions with Numbers There are three kinds of lies: lies, damned lies, and statistics. - Benjamin Disraeli It is easy to lie with statistics. It is hard to tell the truth without it. - Andrejs Dunkels We next look at quantitative data. Recall that in this case, these data can be subject to the operations of arithmetic. In particular, we can add or subtract observation values, we can sort them and rank them from lowest to highest. We will look at two fundamental properties of these observations. The first is a measure of the center value for the data, i.e., the median or the mean. Associated to this measure, we add a second value that describes how these observations are spread or dispersed about this given measure of center. The median is the central observation of the data after it is sorted from the lowest to highest observations. In addition, to give a sense of the spread in the data, we often give the smallest and largest observations as well as the observed value that is 1/4 and 3/4 of the way up this list, known at the first and third quartiles. This difference, known as the interquartile range is a measure of the spread or the dispersion of the data. For the mean, we commonly use the standard deviation to describe the spread of the data. These concepts are described in more detail in this section. 2.1 Measuring Center 2.1.1 Medians The median take the middle value for x1, x2, . . . , xn after the data has been sorted from smallest to largest, x(1), x(2), . . . , x(n). (x(k) is called the k-th order statistic. Sorting can be accomplished in R by using the sort command.) If n is odd, then this is just the value of the middle observation x((n+1)/2). If n is even, then the two values closest to the center are averaged. 1 2 (x(n/2) + x(n/2+1)). If we store the data in R in a vector x, we can write median(x)to compute the median. 2.1.2 Means For a collection of numeric data, x1, x2, . . . , xn, the sample mean is the numerical average x̄ = 1 n (x1 + x2 + . . .+ xn) = 1 n n∑ i=1 xi 21 Introduction to the Science of Statistics Describing Distributions with Numbers Alternatively, if the value x occurs n(x) times in the data, then use the distributive property to see that x̄ = 1 n ∑ x xn(x) = ∑ x xp(x), where p(x) = n(x) n . So the mean x̄ depends only on the proportion of observations p(x) for each value of x. Example 2.1. For the data set {1, 2, 2, 2, 3, 3, 4, 4, 4, 5}, we have n = 10 and the sum 1 + 2 + 2 + 2 + 3 + 3 + 4 + 4 + 4 + 5 = 1n(1) + 2n(2) + 3n(3) + 4n(4) + 5n(5) = 1(1) + 2(3) + 3(2) + 4(3) + 5(1) = 30 Thus, x̄ = 30/10 = 3. Example 2.2. For the data on the length in microns of wild type Bacillus subtilis data, we have length x frequency n(x) proportion p(x) product xp(x) 1.5 18 0.090 0.135 2.0 71 0.355 0.710 2.5 48 0.240 0.600 3.0 37 0.185 0.555 3.5 16 0.080 0.280 4.0 6 0.030 0.120 4.5 4 0.020 0.090 sum 200 1 2.490 So the sample mean x̄ = 2.49. If we store the data in R in a vector x, we can write mean(x) which is equal to sum(x)/length(x) to compute the mean. To extend this idea a bit, we can take a real-valued function h and instead consider the observations h(x1), h(x2), . . . , h(xn), then h(x) = 1 n (h(x1) + h(x2) + . . .+ h(xn)) = 1 n n∑ i=1 h(xi) = 1 n ∑ x h(x)n(x) = ∑ x h(x)p(x). Exercise 2.3. Let x̄n be the sample mean for the quantitative data x1, x2, . . . , xn. For an additional observation xn+1, use x̄ to give a formula for x̄n+1, the mean of n + 1 observations. Generalize this formula for the case of k additional observations xn+1 . . . , xn+k Many times, we do not want to give the same weight to each observation. For example, in computing a student’s grade point average, we begin by setting values xi corresponding to grades ( A 7→ 4, B 7→ 3 and so on) and giving weights w1, w2, . . . , wn equal to the number of units in a course. We then compute the grade point average as a weighted mean. To do this: • Multiply the value of each course by its weight xiwi. This is called the number of quality points for the course. • Add up the quality points: x1w1 + x2w2 + . . .+ xnwn = n∑ i=1 xiwi • Add up the weights, i. e., the number of units attempted: w1 + w2 + . . .+ wn = n∑ i=1 wi 22 Introduction to the Science of Statistics Describing Distributions with Numbers 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −0.2 0 0.2 0.4 0.6 0.8 1 1.5 p(1.5) 2.0 p(2.0) 2.5 p(2.5) 3.0 p(3.0) 3.5 p(3.5) 4.0 p(4.0) 4.5 p(4.5) Figure 2.1: Empirical Survival Function for the Bacterial Data. This figure displays how the area under the survival function to the right of the y-axis and above the x-axis is the mean value x̄ for non-negative data. For x = 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, and 4.5. This area is the sum of the area of the retangles displayed. The width of each of the rectangles is x and the height is equal to p(x). Thus, the area is the product xp(x). The sum of these areas are presented in Example 2.2 to compute the sample mean. • Divide the total quality points by the number of units attempted: x1w1 + x2w2 + . . .+ xnwn w1 + w2 + . . .+ wn = ∑n i=1 xiwi∑n i=1 wi . (2.1) If we let pj = wj/ n∑ i=1 wi be the proportion or fraction of the weight given to the j-th observation, then we can rewrite (2.1) as n∑ i=1 xipi. If we store the weights in a vector w, then we can compute the weighted mean using weighted.mean(x,w) If an extremely high observation is changed to be even higher, then the mean follows this change while the median does not. For this reason, the mean is said to be sensitive to outliers while the median is not. To reduce the impact of extreme outliers on the mean as a measure of center, we can also consider a truncated mean or trimmed mean. The p trimmed mean is obtained by discarding both the lower and the upper p×100% of the data and taking the arithmetic mean of the remaining data. In R, we write mean(x, trim = p) where p, a number between 0 and 0.5, is the fraction of observations to be trimmed from each end before the mean is computed. Note that the median can be regarded as the 50% trimmed mean. The median does not change with a changes in the extreme observations. Such a property is called a resistant measure. On the other hand, the mean is not a resistant measure. 23 Introduction to the Science of Statistics Describing Distributions with Numbers Exercise 2.4. Give the relationship between the median and the mean for a (a) left skewed, (b) symmetric, or (c) right skewed distribution. 2.2 Measuring Spread 2.2.1 Five Number Summary The first and third quartile, Q1 and Q3, are, respectively, the median of the lower half and the upper half of the data. The five number summary of the data are the values of the minimum, Q1, the median, Q3 and the maximum. These values, along with the mean, are given in R using summary(x). Returning to the data set on the age of presidents: > summary(age) Min. 1st Qu. Median Mean 3rd Qu. Max. 42.00 51.00 55.00 54.98 58.00 70.00 We can display the five number summary using a boxplot. > boxplot(age, main = c("Age of Presidents at the Time of Inauguration")) 45 50 55 60 65 70 Age of Presidents at the Time of Inauguration The value Q3−Q1 is called the interquartile range and is denoted by IQR. It is found in R with the command IQR. Outliers are somewhat arbitrarily chosen to be those above Q3 + 32IQR and below Q1 − 32IQR. With this criterion, the ages of Ronald Reagan and Donald Trump, considered outliers, are displayed by the two circles at the top of the boxplot. The boxplot command has the default value range = 1.5 in the choice of displaying outliers. This can be altered to loosen or tighten this criterion. Exercise 2.5. Use the range command to create a boxplot for the age of the presidents at the time of their inaugu- ration using as outliers any value above Q3 + IQR and below Q1 − IQR as the criterion for outliers. How many outliers does this boxplot have? 24 Introduction to the Science of Statistics Describing Distributions with Numbers Example 2.6. Consider a two column data set. Column 1 - MPH - gives car gas milage. Column 2 - origin - gives the country of origin for the car. We can create side by side boxplots with the command > boxplot(MPG,Origin) to produce 2.2.2 Sample Variance and Standard Deviation The sample variance averages the square of the differences from the mean var(x) = s2x = 1 n− 1 n∑ i=1 (xi − x̄)2. The sample standard deviation, sx, is the square root of the sample variance. We shall soon learn the rationale for the decision to divide by n− 1. However, we shall also encounter circumstances in which division by n is preferable. We will routinely drop the subscript x and write s to denote standard deviation if there is no ambiguity. Example 2.7. For the data set on Bacillus subtilis data, we have x̄ = 498/200 = 2.49 length, x frequency, n(x) x− x̄ (x− x̄)2 (x− x̄)2n(x) 1.5 18 -0.99 0.9801 17.6418 2.0 71 -0.49 0.2401 17.0471 2.5 48 0.01 0.0001 0.0048 3.0 37 0.51 0.2601 9.6237 3.5 16 1.01 1.0201 16.3216 4.0 6 1.51 2.2801 13.6806 4.5 4 2.01 4.0401 16.1604 sum 200 90.4800 So the sample variance s2x = 90.48/199 = 0.4546734 and standard deviation sx = 0.6742947. To accomplish this in R 25 Introduction to the Science of Statistics Describing Distributions with Numbers > bacteria<-c(rep(1.5,18),rep(2.0,71),rep(2.5,48),rep(3,37),rep(3.5,16),rep(4,6), + rep(4.5,4)) > length(bacteria) [1] 200 > mean(bacteria) [1] 2.49 > var(bacteria) [1] 0.4546734 > sd(bacteria) [1] 0.6742947 For quantitative variables that take on positive values, we can take the ratio of the standard deviation to the mean cvx = sx x̄ , called the coefficient of variation as a measure of the relative variability of the observations. Note that cvx is a pure number and has no units. For the data of bacteria lengths, the coefficient of variability is cvx = 0.6742947
Compartilhar