685 pág.

## Pré-visualização | Página 40 de 50

the mean squared error. To solve this problem, we denote E(Y ) by μ and 4.4 Conditional Expectation and Prediction 153 observe that (see Theorem A of Section 4.2.1) E[(Y − c)2] = Var(Y − c) + [E(Y − c)]2 = Var(Y ) + (μ − c)2 The ﬁrst term of the last expression does not depend on c, and the second term is minimized for c = μ, which is the optimal choice of c. Now let us consider predicting Y by some function h(X) in order to minimize MSE = E{[Y − h(X)]2}. From Theorem A of Section 4.4.1, the right-hand side can be expressed as E{[Y − h(X)]2} = E(E{[Y − h(X)]2|X}) The outer expectation is with respect to X . For every x , the inner expectation is minimized by setting h(x) equal to the constant E(Y |X = x), from the result of the preceding paragraph. We thus have that the minimizing function h(X) is h(X) = E(Y |X) E X A M P L E A For the bivariate normal distribution, we found that E(Y |X) = μY + ρ σY σX (X − μX ) This linear function of X is thus the minimum mean squared error predictor of Y from X . ■ A practical limitation of the optimal prediction scheme is that its implementation depends on knowing the joint distribution of Y and X in order to ﬁnd E(Y |X), and often this information is not available, not even approximately. For this reason, we can try to attain the more modest goal of ﬁnding the optimal linear predictor of Y . (In Example A, it turned out that the best predictor was linear, but this is not generally the case.) That is, rather than ﬁnding the best function h among all functions, we try to ﬁnd the best function of the form h(x) = α +βx . This merely requires optimizing over the two parameters α and β. Now E[(Y − α − βX)2] = Var(Y − α − βX) + [E(Y − α − βX)]2 = Var(Y − βX) + [E(Y − α − βX)]2 The ﬁrst term of the last expression does not depend on α, so α can be chosen so as to minimize the second term. To do this, note that E(Y − α − βX) = μY − α − βμX and that the right-hand side is zero, and hence its square is minimized, if α = μY − βμX As for the ﬁrst term, Var(Y − βX) = σ 2Y + β2σ 2X − 2βσXY 154 Chapter 4 Expected Values where σXY = Cov(X, Y ). This is a quadratic function of β, and the minimum is found by setting the derivative with respect to β equal to zero, which yields β = σXY σ 2X = ρ σY σX ρ is the correlation coefﬁcient. Substituting in these values of α and β, we ﬁnd that the minimum mean squared error predictor, which we denote by ˆY , is ˆY = α + βX = μY + σXY σ 2X (X − μX ) The mean squared prediction error is then Var(Y − βX) = σ 2Y + σ 2XY σ 4X σ 2X − 2 σXY σ 2X σXY = σ 2Y − σ 2XY σ 2X = σ 2Y − ρ2σ 2Y = σ 2Y (1 − ρ2) Note that the optimal linear predictor depends on the joint distribution of X and Y only through their means, variances, and covariance. Thus, in practice, it is generally easier to construct the optimal linear predictor or an approximation to it than to construct the general optimal predictor E(Y |X). Second, note that the form of the optimal linear predictor is the same as that of E(Y |X) for the bivariate normal distribution. Third, note that the mean squared prediction error depends only on σY and ρ and that it is small if ρ is close to +1 or −1. Here we see again, from a different point of view, that the correlation coefﬁcient is a measure of the strength of the linear relationship between X and Y . E X A M P L E B Suppose that two examinations are given in a course. As a probability model, we regard the scores of a student on the ﬁrst and second examinations as jointly distributed random variables X and Y . Suppose for simplicity that the exams are scaled to have the same means μ = μX = μY and standard deviations σ = σX = σY . Then, the correlation between X and Y is ρ = σXY /σ 2 and the best linear predictor is ˆY = μ + ρ(X − μ), so ˆY − μ = ρ(X − μ) Notice that by this equation we predict the student’s score on the second examination to differ from the overall mean μ by less than did the score on the ﬁrst examination. If the correlation ρ is positive, this is encouraging for a student who scores below the mean on the ﬁrst exam, since our best prediction is that his score on the next exam will be closer to the mean. On the other hand, it’s bad news for the student who scored above the mean on the ﬁrst exam, since our best prediction is that she will score closer to the mean on the next exam. This phenomenon is often referred to as regression to the mean. ■ 4.5 The Moment-Generating Function 155 4.5 The Moment-Generating Function This section develops and applies some of the properties of the moment-generating function. It turns out, despite its unlikely appearance, to be a very useful tool that can dramatically simplify certain calculations. The moment-generating function (mgf) of a random variable X is M(t) = E(et X ) if the expectation is deﬁned. In the discrete case, M(t) = ∑ x etx p(x) and in the continuous case, M(t) = ∫ ∞ −∞ etx f (x) dx The expectation, and hence the moment-generating function, may or may not exist for any particular value of t . In the continuous case, the existence of the expectation depends on how rapidly the tails of the density decrease; for example, because the tails of the Cauchy density die down at the rate x−2, the expectation does not exist for any t and the moment-generating function is undeﬁned. The tails of the normal density die down at the rate e−x2 , so the integral converges for all t . PROPERTY A If the moment-generating function exists for t in an open interval containing zero, it uniquely determines the probability distribution. ■ We cannot prove this important property here—its proof depends on properties of the Laplace transform. Note that Property A says that if two random variables have the same mgf in an open interval containing zero, they have the same distribution. For some problems, we can ﬁnd the mgf and then deduce the unique probability distribution corresponding to it. The r th moment of a random variable is E(Xr ) if the expectation exists. We have already encountered the ﬁrst and second moments earlier in this chapter, that is, E(X) and E(X 2). Central moments rather than ordinary moments are often used: The r th central moment is E{[X − E(X)]r }. The variance is the second central moment and is a measure of dispersion about the mean. The third central moment, called the skewness, is used as a measure of the asymmetry of a density or a frequency function about its mean; if a density is symmetric about its mean, the skewness is zero (see Problem 78 at the end of this chapter). As its name implies, the moment-generating function has something to do with moments. To see this, consider the continuous case: M(t) = ∫ ∞ −∞ etx f (x) dx 156 Chapter 4 Expected Values The derivative of M(t) is M ′(t) = d dt ∫ ∞ −∞ etx f (x) dx It can be shown that differentiation and integration can be interchanged, so that M ′(t) = ∫ ∞ −∞ xetx f (x) dx and M ′(0) = ∫ ∞ −∞ x f (x) dx = E(X) Differentiating r times, we ﬁnd M (r)(0) = E(Xr ) It can further be argued that if the moment-generating function exists in an interval containing zero, then so do all the moments. We thus have the following property. PROPERTY B If the moment-generating function exists in an open interval containing zero, then M (r)(0) = E(Xr ). ■ To ﬁnd the moments of a random variable from the deﬁnition of expectation, we must sum a series or carry out an integration. The utility of Property B is that, if the mgf can be found, the process of integration or summation, which may be difﬁcult, can be replaced by the process of differentiation, which is mechanical. We now illustrate these concepts using some familiar distributions. E X A M P L E A Poisson Distribution By deﬁnition, M(t) = ∞∑ k=0 etk λk k! e−λ = ∞∑