A maior rede de estudos do Brasil

685 pág.
Rice J A Mathematical statistics and data analysis

Pré-visualização | Página 40 de 50

the mean squared error. To solve this problem, we denote E(Y ) by μ and
4.4 Conditional Expectation and Prediction 153
observe that (see Theorem A of Section 4.2.1)
E[(Y − c)2] = Var(Y − c) + [E(Y − c)]2
= Var(Y ) + (μ − c)2
The first term of the last expression does not depend on c, and the second term is
minimized for c = μ, which is the optimal choice of c.
Now let us consider predicting Y by some function h(X) in order to minimize
MSE = E{[Y − h(X)]2}. From Theorem A of Section 4.4.1, the right-hand side can
be expressed as
E{[Y − h(X)]2} = E(E{[Y − h(X)]2|X})
The outer expectation is with respect to X . For every x , the inner expectation is
minimized by setting h(x) equal to the constant E(Y |X = x), from the result of the
preceding paragraph. We thus have that the minimizing function h(X) is
h(X) = E(Y |X)
E X A M P L E A For the bivariate normal distribution, we found that
E(Y |X) = μY + ρ σY
(X − μX )
This linear function of X is thus the minimum mean squared error predictor of Y
from X . ■
A practical limitation of the optimal prediction scheme is that its implementation
depends on knowing the joint distribution of Y and X in order to find E(Y |X), and
often this information is not available, not even approximately. For this reason, we
can try to attain the more modest goal of finding the optimal linear predictor of Y . (In
Example A, it turned out that the best predictor was linear, but this is not generally
the case.) That is, rather than finding the best function h among all functions, we try
to find the best function of the form h(x) = α +βx . This merely requires optimizing
over the two parameters α and β. Now
E[(Y − α − βX)2] = Var(Y − α − βX) + [E(Y − α − βX)]2
= Var(Y − βX) + [E(Y − α − βX)]2
The first term of the last expression does not depend on α, so α can be chosen so as
to minimize the second term. To do this, note that
E(Y − α − βX) = μY − α − βμX
and that the right-hand side is zero, and hence its square is minimized, if
α = μY − βμX
As for the first term,
Var(Y − βX) = σ 2Y + β2σ 2X − 2βσXY
154 Chapter 4 Expected Values
where σXY = Cov(X, Y ). This is a quadratic function of β, and the minimum is
found by setting the derivative with respect to β equal to zero, which yields
β = σXY
σ 2X
= ρ σY
ρ is the correlation coefficient. Substituting in these values of α and β, we find that
the minimum mean squared error predictor, which we denote by ˆY , is
ˆY = α + βX
= μY + σXY
σ 2X
(X − μX )
The mean squared prediction error is then
Var(Y − βX) = σ 2Y +
σ 2XY
σ 4X
σ 2X − 2
σ 2X
= σ 2Y −
σ 2XY
σ 2X
= σ 2Y − ρ2σ 2Y
= σ 2Y (1 − ρ2)
Note that the optimal linear predictor depends on the joint distribution of X
and Y only through their means, variances, and covariance. Thus, in practice, it is
generally easier to construct the optimal linear predictor or an approximation to it
than to construct the general optimal predictor E(Y |X). Second, note that the form
of the optimal linear predictor is the same as that of E(Y |X) for the bivariate normal
distribution. Third, note that the mean squared prediction error depends only on σY
and ρ and that it is small if ρ is close to +1 or −1. Here we see again, from a different
point of view, that the correlation coefficient is a measure of the strength of the linear
relationship between X and Y .
E X A M P L E B Suppose that two examinations are given in a course. As a probability model, we
regard the scores of a student on the first and second examinations as jointly distributed
random variables X and Y . Suppose for simplicity that the exams are scaled to have
the same means μ = μX = μY and standard deviations σ = σX = σY . Then,
the correlation between X and Y is ρ = σXY /σ 2 and the best linear predictor is
ˆY = μ + ρ(X − μ), so
ˆY − μ = ρ(X − μ)
Notice that by this equation we predict the student’s score on the second examination
to differ from the overall mean μ by less than did the score on the first examination.
If the correlation ρ is positive, this is encouraging for a student who scores below the
mean on the first exam, since our best prediction is that his score on the next exam
will be closer to the mean. On the other hand, it’s bad news for the student who scored
above the mean on the first exam, since our best prediction is that she will score closer
to the mean on the next exam. This phenomenon is often referred to as regression to
the mean. ■
4.5 The Moment-Generating Function 155
4.5 The Moment-Generating Function
This section develops and applies some of the properties of the moment-generating
function. It turns out, despite its unlikely appearance, to be a very useful tool that can
dramatically simplify certain calculations.
The moment-generating function (mgf) of a random variable X is M(t) =
E(et X ) if the expectation is defined. In the discrete case,
M(t) =
etx p(x)
and in the continuous case,
M(t) =
∫ ∞
etx f (x) dx
The expectation, and hence the moment-generating function, may or may not exist
for any particular value of t . In the continuous case, the existence of the expectation
depends on how rapidly the tails of the density decrease; for example, because the
tails of the Cauchy density die down at the rate x−2, the expectation does not exist
for any t and the moment-generating function is undefined. The tails of the normal
density die down at the rate e−x2 , so the integral converges for all t .
If the moment-generating function exists for t in an open interval containing
zero, it uniquely determines the probability distribution. ■
We cannot prove this important property here—its proof depends on properties
of the Laplace transform. Note that Property A says that if two random variables have
the same mgf in an open interval containing zero, they have the same distribution.
For some problems, we can find the mgf and then deduce the unique probability
distribution corresponding to it.
The r th moment of a random variable is E(Xr ) if the expectation exists. We
have already encountered the first and second moments earlier in this chapter, that is,
E(X) and E(X 2). Central moments rather than ordinary moments are often used: The
r th central moment is E{[X − E(X)]r }. The variance is the second central moment
and is a measure of dispersion about the mean. The third central moment, called the
skewness, is used as a measure of the asymmetry of a density or a frequency function
about its mean; if a density is symmetric about its mean, the skewness is zero (see
Problem 78 at the end of this chapter). As its name implies, the moment-generating
function has something to do with moments. To see this, consider the continuous
M(t) =
∫ ∞
etx f (x) dx
156 Chapter 4 Expected Values
The derivative of M(t) is
M ′(t) = d
∫ ∞
etx f (x) dx
It can be shown that differentiation and integration can be interchanged, so that
M ′(t) =
∫ ∞
xetx f (x) dx
M ′(0) =
∫ ∞
x f (x) dx = E(X)
Differentiating r times, we find
M (r)(0) = E(Xr )
It can further be argued that if the moment-generating function exists in an interval
containing zero, then so do all the moments. We thus have the following property.
If the moment-generating function exists in an open interval containing zero,
then M (r)(0) = E(Xr ). ■
To find the moments of a random variable from the definition of expectation, we
must sum a series or carry out an integration. The utility of Property B is that, if the
mgf can be found, the process of integration or summation, which may be difficult, can
be replaced by the process of differentiation, which is mechanical. We now illustrate
these concepts using some familiar distributions.
E X A M P L E A Poisson Distribution
By definition,
M(t) =