Prévia do material em texto
John's z-score of –0.21 is higher than Ali's z-score of –0.3. For GPA, higher values are better, so we conclude that John has the better GPA when compared to his school. TRY IT 2.32 Two swimmers, Angie and Beth, from different teams, wanted to find out who had the fastest time for the 50 meter freestyle when compared to her team. Which swimmer had the fastest time when compared to her team? Swimmer Time (seconds) Team mean time Team standard deviation Angie 26.2 27.2 0.8 Beth 27.3 30.1 1.4 Table 2.36 The following lists give a few facts that provide a little more insight into what the standard deviation tells us about the distribution of the data. For ANY data set, no matter what the distribution of the data is: • At least 75% of the data is within two standard deviations of the mean. • At least 89% of the data is within three standard deviations of the mean. • At least 95% of the data is within 4.5 standard deviations of the mean. • This is known as Chebyshev's Rule. For data having a Normal Distribution, which we will examine in great detail later: • Approximately 68% of the data is within one standard deviation of the mean. • Approximately 95% of the data is within two standard deviations of the mean. • More than 99% of the data is within three standard deviations of the mean. • This is known as the Empirical Rule. • It is important to note that this rule only applies when the shape of the distribution of the data is bell-shaped and symmetric. We will learn more about this when studying the "Normal" or "Gaussian" probability distribution in later chapters. Coefficient of Variation Another useful way to compare distributions besides simple comparisons of means or standard deviations is to adjust for differences in the scale of the data being measured. Quite simply, a large variation in data with a large mean is different than the same variation in data with a small mean. To adjust for the scale of the underlying data the Coefficient of Variation (CV) has been developed. Mathematically: We can see that this measures the variability of the underlying data as a percentage of the mean value; the center weight of the data set. This measure is useful in comparing risk where an adjustment is warranted because of differences in scale of two data sets. In effect, the scale is changed to common scale, percentage differences, and allows direct comparison of the two or more magnitudes of variation of different data sets. 2.7 • Measures of the Spread of the Data 89 Key Terms Frequency the number of times a value of the data occurs Frequency Table a data representation in which grouped data is displayed along with the corresponding frequencies Histogram a graphical representation in x-y form of the distribution of data in a data set; x represents the data and y represents the frequency, or relative frequency. The graph consists of contiguous rectangles. Interquartile Range or IQR, is the range of the middle 50 percent of the data values; the IQR is found by subtracting the first quartile from the third quartile. Mean (arithmetic) a number that measures the central tendency of the data; a common name for mean is 'average.' The term 'mean' is a shortened form of 'arithmetic mean.' By definition, the mean for a sample (denoted by ) is , and the mean for a population (denoted by μ) is . Mean (geometric) a measure of central tendency that provides a measure of average geometric growth over multiple time periods. Median a number that separates ordered data into halves; half the values are the same number or smaller than the median and half the values are the same number or larger than the median. The median may or may not be part of the data. Midpoint the mean of an interval in a frequency table Mode the value that appears most frequently in a set of data Outlier an observation that does not fit the rest of the data Percentile a number that divides ordered data into hundredths; percentiles may or may not be part of the data. The median of the data is the second quartile and the 50th percentile. The first and third quartiles are the 25th and the 75th percentiles, respectively. Quartiles the numbers that separate the data into quarters; quartiles may or may not be part of the data. The second quartile is the median of the data. Relative Frequency the ratio of the number of times a value of the data occurs in the set of all outcomes to the number of all outcomes Standard Deviation a number that is equal to the square root of the variance and measures how far data values are from their mean; notation: s for sample standard deviation and σ for population standard deviation. Variance mean of the squared deviations from the mean, or the square of the standard deviation; for a set of data, a deviation can be represented as x – where x is a value of the data and is the sample mean. The sample variance is equal to the sum of the squares of the deviations divided by the difference of the sample size and one. Chapter Review 2.1 Display Data A stem-and-leaf plot is a way to plot data and look at the distribution. In a stem-and-leaf plot, all data values within a class are visible. The advantage in a stem-and-leaf plot is that all values are listed, unlike a histogram, which gives classes of data values. A line graph is often used to represent a set of data values in which a quantity varies with time. These graphs are useful for finding trends. That is, finding a general pattern in data sets including temperature, sales, employment, company profit or cost over a period of time. A bar graph is a chart that uses either horizontal or vertical bars to show comparisons among categories. One axis of the chart shows the specific categories being compared, and the other axis represents a discrete value. Some bar graphs present bars clustered in groups of more than one (grouped bar graphs), and others show the bars divided into subparts to show cumulative effect (stacked bar graphs). Bar graphs are especially useful when categorical data is being used. A histogram is a graphic version of a frequency distribution. The graph consists of bars of equal width drawn adjacent to each other. The horizontal scale represents classes of quantitative data values and the vertical scale represents frequencies. The heights of the bars correspond to frequency values. Histograms are typically used for large, continuous, quantitative data sets. A frequency polygon can also be used when graphing large data sets with data points that repeat. The data usually goes on y-axis with the frequency being graphed on the x-axis. Time series graphs can be helpful when looking at large amounts of data for one variable over a period of time. 2.2 Measures of the Location of the Data The values that divide a rank-ordered set of data into 100 equal parts are called percentiles. Percentiles are used to compare and interpret data. For example, an observation at the 50th percentile would be greater than 50 percent of the other observations in the set. Quartiles divide data into quarters. The first quartile (Q1) is the 25th percentile,the second quartile (Q2 or median) is 50th percentile, and the third quartile (Q3) is the 75th percentile. The interquartile range, or IQR, is the range of the middle 50 percent of the data values. The IQR is found by subtracting Q1 from Q3, and can help determine outliers by using the following two expressions. 90 2 • Key Terms Access for free at openstax.org • Q3 + IQR(1.5) • Q1 – IQR(1.5) 2.3 Measures of the Center of the Data The mean and the median can be calculated to help you find the "center" of a data set. The mean is the best estimate for the actual data set, but the median is the best measurement when a data set contains several outliers or extreme values. The mode will tell you the most frequently occurring datum (or data) in your data set. The mean, median, and mode are extremely helpful when you need to analyze your data, but if your data set consists of ranges which lack specific values, themean may seem impossible to calculate. However, the mean can be approximated if you add the lower boundary with the upper boundary and divide by two to find the midpoint of each interval. Multiply each midpoint by the number of values found in the corresponding range. Divide the sum of these values by the total number of data values in the set. 2.6 Skewness and the Mean, Median, and Mode Looking at the distribution of data can reveal a lot about the relationship between the mean, the median, and the mode. There are three types of distributions. A left (or negative) skewed distribution has a shape like Figure 2.12. A right (or positive) skewed distribution has a shape like Figure 2.13. A symmetrical distribution looks like Figure 2.11. 2.7 Measures of the Spread of the Data The standard deviation can help you calculate the spread of data. There are different equations to use if are calculating the standard deviation of a sample or of a population. • The Standard Deviation allows us to compare individual data or classes to the data set mean numerically. • s = or s = is the formula for calculating the standard deviation of a sample. To calculate the standard deviation of a population, we would use the population mean, μ, and the formula σ = or σ = . Formula Review 2.2 Measures of the Location of the Data where i = the ranking or position of a data value, k = the kth percentile, n = total number of data. Expression for finding the percentile of a data value: (100) where x = the number of values counting from the bottom of the data list up to but not including the data value for which you want to find the percentile, y = the number of data values equal to the data value for which you want to find the percentile, n = total number of data 2.3 Measures of the Center of the Data Where f = interval frequencies and m = interval midpoints. The arithmetic mean for a sample (denoted by ) is The arithmetic mean for a population (denoted by μ) is 2.5 Geometric Mean The Geometric Mean: 2.6 Skewness and the Mean, Median, and Mode Formula for skewness: Formula for Coefficient of Variation: 2.7 Measures of the Spread of the Data where Formulas for Sample Standard Deviation or or For the sample standard deviation, the denominator is n - 1, that is the sample size - 1. 2 • Formula Review 91 Formulas for Population Standard Deviation or or For the population standard deviation, the denominator is N, the number of items in the population. Practice 2.1 Display Data For the next three exercises, use the data to construct a line graph. 1. In a survey, 40 people were asked how many times they visited a store before making a major purchase. The results are shown in Table 2.37. Number of times in store Frequency 1 4 2 10 3 16 4 6 5 4 Table 2.37 2. In a survey, several people were asked how many years it has been since they purchased a mattress. The results are shown in Table 2.38. Years since last purchase Frequency 0 2 1 8 2 13 3 22 4 16 5 9 Table 2.38 3. Several children were asked how many TV shows they watch each day. The results of the survey are shown in Table 2.39. 92 2 • Practice Access for free at openstax.org Number of TV shows Frequency 0 12 1 18 2 36 3 7 4 2 Table 2.39 4. The students in Ms. Ramirez’s math class have birthdays in each of the four seasons. Table 2.40 shows the four seasons, the number of students who have birthdays in each season, and the percentage (%) of students in each group. Construct a bar graph showing the number of students. Seasons Number of students Proportion of population Spring 8 24% Summer 9 26% Autumn 11 32% Winter 6 18% Table 2.40 5. Using the data from Mrs. Ramirez’s math class supplied in Exercise 2.4, construct a bar graph showing the percentages. 6. David County has six high schools. Each school sent students to participate in a county-wide science competition. Table 2.41 shows the percentage breakdown of competitors from each school, and the percentage of the entire student population of the county that goes to each school. Construct a bar graph that shows the population percentage of competitors from each school. High school Science competition population Overall student population Alabaster 28.9% 8.6% Concordia 7.6% 23.2% Genoa 12.1% 15.0% Mocksville 18.5% 14.3% Tynneson 24.2% 10.1% West End 8.7% 28.8% Table 2.41 2 • Practice 93 Chapter 2 Descriptive Statistics Key Terms Chapter Review Formula Review Practice