Introduction to Mathematical Statistics, Gordan Zitkovic UTEXAS 2018

Probabilidade e Estatística

•
Exatas

Pedro H
17/02/2022
E aí, curtiu este material?
Ajude a incentivar outros estudantes a melhorar o conteúdo
Gostou desse material? Compartilhe! 🧡
Probabilidade e Estatística

29.919 Materiais compartilhados
Baixe o app para aproveitar ainda mais
Leia os materiais offline, sem usar a internet. Além de vários outros recursos!
Prévia do material em texto
Lecture 1: Discrete random variables 1 of 15
Course: Mathematical Statistics
Term: Fall 2018
Instructor: Gordan Žitković
Lecture 1
Discrete random variables
1.1 Random Variables
A large chunk of probability is about random variables. Instead of giving a
precise definition, let us just mention that a random variable can be thought
of as an uncertain (usually numerical, i.e., with values in R, but not always)
quantity.
While it is true that we do not know with certainty what value a random
variable Y will take, we usually know how to assign a number - the probabil-
ity - that its value will be in some some1 subset of R. For example, we might
be interested in P[Y ≥ 7], P[Y ∈ [2, 3.1]] or P[Y ∈ {1, 2, 3}].
Random variables are usually divided into discrete and continuous, even
though there exist random variables which are neither discrete nor continu-
ous. Those can be safely neglected for the purposes of this course, but they
play an important role in many areas of probability and statistics.
1.2 Discrete random variables
Before we define discrete random variables, we need some vocabulary.
Definition 1.2.1. Given a set B, we say that the random variable Y is
B-valued if P[Y ∈ B] = 1.
In words, Y is B-valued if we know for a fact that Y will never take a value
outside of B.
Definition 1.2.2. A random variable is said to be discrete if there exists
a set S such that S is either finite or countablea and Y is S-valued.
aCountable means that its elements can be enumerated by the natural numbers. The
only (infinite) countable sets we will need are N = {1, 2, . . . , } or N0 = {0, 1, 2, . . . }.
1We will not worry about measurability and similar subtleties in this class.
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 2 of 15
Definition 1.2.3. The support SY of the discrete random variable Y is
the smallest set S such that Y is S-valued.
Example 1.2.4. A die is thrown and the number obtained is recorded and
denoted by Y. The possible values of Y are {1, 2, 3, 4, 5, 6} and each
happens with probability 1/6, so Y is certainly S-valued. Since S is
finite, Y is discrete.
One still needs to argue that S is the support SY of Y. The alternative
would be that SY is a proper subset of S , i.e., that there are redundand
elements in S . This is not the case since all elements in S are “impor-
tant”, i.e., happen with positive probability. If we remove anything
from S , we are omitting a possible value for Y.
On the other hand, it is certainly true that Y always takes its values in
the finite set S ′ = {1, 2, 3, 4, 5, 6,7}, i.e., that Y is S ′-valued. One has
to be careful with the terminology here: it is correct to say that Y is
an S ′-valued (or even N-valued) random variable, even though it only
takes the values 1, 2, . . . , 6 with positive probabilities.
Discrete random variables are very nice due to the following fact: in or-
der to be able to compute any conceivable probability involving a discrete
random variable Y, it is enough to know how to compute the probabilities
P[Y = y], for all y ∈ S . Indeed, if you are interested in figuring out what
P[Y ∈ B] is, for some set B ⊆ R (e.g., B = {5, 6, 7}, B = [3, 6], or B = [−2, ∞)),
we simply pick all y ∈ SY which are also in B and sum their probabilities. In
mathematical notation, we have
P[Y ∈ B] = ∑
y∈SY∩B
P[Y = y]. (1.2.1)
Definition 1.2.5. The probability mass function (pmf) of a discrete
random variable Y is a function pY defined on the support SY of Y by
pY(y) = P[Y = y], y ∈ SY.
In practice, we usually present the pmf pY in the form of a table (called
the distribution table) as
Y ∼ y y1 y2 y3 . . .pY(y) p1 p2 p3 . . .
or, simply,
Y ∼ y1 y2 y3 . . .p1 p2 p3 . . .
,
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 3 of 15
where the top row lists all the elements y of the support SY of Y, and the
bottom row lists their probabilities pY(y) = P[Y = y]. It is easy to see that
the function pY has the following properties:
1. pY(y) ∈ [0, 1] for all y, and
2. ∑y∈SY pY(y) = 1.
Here is a first round of examples of discrete random variables and their
supports.
Example 1.2.6.
1. A fair (unbiased) coin is tossed and the value observed is denoted by Y.
Since the only possible values Y can take are H or T, and the set
S = {H, T} is clearly finite, Y is a discrete random variable. Its
distribution is given by the following table:
y H T
pY(y) 1/2 1/2
Both H and T are possible (happen with probability 1/2), so no
smaller set S will have the property that P[Y ∈ S ] = 1. Conse-
quently, the support SY of Y is S = {H, T}.
2. A die is thrown and the number obtained is recorded and denoted by Y.
The possible values of Y are {1, 2, 3, 4, 5, 6} and each happens with
probability 1/6, so Y is discrete with support SY. Its distribution is
given by the table
y 1 2 3 4 5 6
pY(y) 1/6 1/6 1/6 1/6 1/6 1/6
3. A fair coin is thrown repeatedly until the first H is observed; the number
of Ts observed before that is denoted by Y. In this case we know that
Y can take any of the values N0 = {0, 1, 2, . . . } and that there is
no finite upper bound for it. Nevertheless, we know that Y cannot
take values that are not non-negative integers. Therefore, Y is N0-
valued and, in fact, SY = N0 is its support. Indeed, we have P[Y =
y] = 2−y−1, for y ∈N0, i.e.,
Y ∼ y 0 1 2 . . .pY(y) 1/2 1/4 1/8 . . .
4. A card is drawn randomly form a standard deck, and the result is de-
noted by Y. This example is similar to the 2., above, since Y takes
one of finitely many values, and all values are equally likely. The
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 4 of 15
difference is that the result is not a number anymore. The set
S of all possible values can be represented as the set of all pairs
like (♠, 7), where the first entry denotes the picked card’s suit (in
{♥,♠,♣,♦}), and the second is a number between 1 and 13. It
is, of course, possible to use different conventions and use the set
{2, 3, . . . , 9, 10, J, Q, K, A} for the second component. The point is
that the values Y takes are not numbers.
1.3 Events and Bernoulli random variables
Random variables Y which can only take one of two values 0, 1, i.e., for
which SY ⊆ {0, 1}, are called indicators or Bernoulli random variables and
are very useful in probability and statistics (and elsewhere). The name comes
from the fact that you should think of such variables as signal lights; if Y = 1
an event of interest has happened, and if Y = 0 it has not happened. In other
words, Y indicates the occurence of an event.
One reason the Bernoulli random variables are so useful is that they let
us manipulate events without ever leaving the language of random variables.
Here is an example:
Example 1.3.1. Suppose that two dice are thrown so that Y1 and Y2 are
the numbers obtained (both Y1 and Y2 are discrete random variables
with SY1 = SY2 = {1, 2, 3, 4, 5, 6}). If we are interested in the probabil-
ity that their sum is at least 9, we proceed as follows. We define the
random variable W - the sum of Y1 and Y2 - by W = Y1 + Y2. Another
random variable, let us call it Y, is a Bernoulli random variable defined
by
Y =
{
1, W ≥ 9,
0, W < 9.
With such a set-up, Y signals whether the event of interest has hap-
pened, and we can state our original problem in terms of Y, namely
“Compute P[Y = 1] !”.
This example is, admittedly, a little contrived. The point, however, is that
anything can be phrased in terms of random variables; thus, if you know how
to work with random variables, i.e., know how to compute their distributions,
you can solve any problem in probability that comes your way.
Another reason Bernoulli random variables are useful is the fact that we
can do arithmetic with them.
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 5 of 15
Example 1.3.2. 70 coins are tossed and their outcomes are denoted by
W1, W2, . . . , W70. All Wi are random variables with values in {H, T}
(and therefore not Bernoulli random variables), butthey can be easily
recoded into Bernoulli random variables as follows:
Yi =
{
1, if Wi = H,
0, if Wi = T.
Once you have the “dictionary” {1↔ H, 0↔ T}, random variables Yi
and Wi carry exactly the same information. The advantage of using Yi
is that the random variable
N =
70
∑
i=1
Yi,
which takes values in SN = {0, 1, 2, . . . , 70} counts the number of
heads among W1, . . . , W70. Similarly, the random variable
M = Y1 ×Y2 × · · · ×Y70
is a Bernoulli random variable itself. What event does it indicate?
1.4 Some widely used discrete random variables
The distribution of a random variable is sometimes defined as “the collection
of all possible probabilities associated to it”. This sounds a bit abstract, and,
at least in the discrete case, obscures the practical significance of this impor-
tant concept. We have learned that for discrete variables the knowledge of the
pmf or the distribution table (such as the one in part 1., 2. or 3. of Example
1.2.6) amounts to the knowledge of the whole distribution. It turns out that
many random variables in widely different contexts come with the same (or
similar) distribution tables, and that some of those appear so often that they
deserve to be named (so that we don’t have to write the distribution table
every time). The following example lists some of those, named, distribution.
There are many others, but we will not need them in these notes.
Example 1.4.1.
1. Bernoulli distribution. We have already encountered this distri-
bution in our discussion of indicator random variables above. It is
characterized by the distribution table of the form
0 1
1− p p , (1.4.1)
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 6 of 15
where p can be any number in (0, 1). Strictly speaking, each value
of p defines a different distribution, so it would be more correct to
speak of a parametric family of distributions, with p ∈ (0, 1) being
the parameter.
In order not to write down the table (1.4.1) every time, we also use
the notation Y ∼ B(p). For example, the Bernoulli random variable
which takes the value 1 when a fair coin falls H and 0 when it falls
T has a B(1/2)-distribution.
An experiment (random occurrence) which can end in two possible
ways (usually called success and failure, even though those names
should not always be taken literally) is often called a Bernoulli
trial. If we “encode” success as 1 and failure by 0, each Bernoulli
trial gives rise to a Bernoulli random variable.
2. Binomial distribution. A random variable whose distribution ta-
ble looks like this
0 1 . . . (n− 1) n
qn (n1)pq
n−1 . . . ( nn−1)qp
n−1 pn
for some n ∈ N, p ∈ (0, 1) and q = 1− p, is called the binomial
distribution, usually denoted by b(n, p). Remember that the bino-
mial coefficient (nk) is given by(
n
k
)
=
n!
k!(n− k)! where n! = n(n− 1)(n− 2) · . . . · 2 · 1.
Binomial distribution(s) form a parametric family with two param-
eters n ∈ N and p ∈ (0, 1), and each pair (n, p) corresponds to a
different binomial distribution.
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ●
0 1 2 n
y
Figure 1. The probability mass function (pmf) of a typical binomial
distribution.
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 7 of 15
Recall that the binomial distribution arises as the “number of suc-
cesses in n independent Bernoulli trials”, i.e., it counts the number
of H in n independent tosses of a biased coin whose probability of
H is p.
3. Geometric distribution. The geometric distribution is similar to
the binomial in that it counts the number of “successes” in inde-
pendent, repeated Bernoulli trials. The difference is that the num-
ber of trials is no longer fixed (i.e., = n), but we keep tossing until
we get our first success. Since the trials are independent, if the
probability of success in each trial is p ∈ (0, 1), the probability that
it will take exactly k failures before the first success is qk p, where
q = 1− p. Therefore, so the geometric distribution - denoted by
g(p) - comes with the following table
0 1 2 3 . . .
p qp q2 p q3 p . . .
.
●
●
●
●
●
●
●
● ● ● ● ● ● ●
0 1 2
y
Figure 2. The probability mass function (pmf) of a typical geometric
distribution.
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 8 of 15
Caveat: When defining the geometric distribution, some
books count the number of trials to the first success, i.e.,
add the final success into the count. This shifts everything
by 1 and leads to a distribution with support N (and not
N0). While this is no big deal, this ambiguity tends to be
confusing at times and leads to bugs in software. For us, the
geometric distribution will always start from 0. The distri-
bution which counts the final success will be referred to as
the shifted geometric distribution, but we’ll try to avoid it
altogether.
4. Poisson distribution. This is also a family of distributions, param-
eterized by a single parameter λ > 0, and denoted by P(λ). Its
support is N0 and the distribution table is given by
0 1 2 3 4 . . .
e−λ e−λλ e−λ λ
2
2 e
−λ λ3
3! e
−λ λ4
4! . . .
.
The closed form for the pmf is
pY(y) = e−λ λ
y
y! , y ∈N.
The Poisson distribution arises as a limit when n → ∞ and p → 0
while np ∼ λ in the Binomial distribution.
●
●
●
● ●
●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ●
0 1 2
y
Figure 3. The probability mass function (pmf) of a typical Poisson
distribution with λ > 1.
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 9 of 15
1.5 Expectations and standard deviations
Expectations and standard deviations provide summaries of numerical ran-
dom variables - they give us some information about them without over-
whelming us with the entire distribution table. The expectation can be thought
of as a center of the distribution, while the standard deviation gives you an
idea about its spread2
Definition 1.5.1. For a discrete random variable Y with support SY ⊆
R, we define the expectation E[Y] of Y by
E[Y] = ∑
y∈Sy
y pY(y), (1.5.1)
if the (possibly) infinite sum ∑y∈S y pY(y) absolutely converges, i.e., as
long as
∑
y∈S
|y| pY(y) < ∞. (1.5.2)
When the sum in (1.5.2) above diverges (i.e., takes the value +∞), we
say that the expectation of Y is not defined.
Perhaps the most important property of the expectation is its linearity:
Theorem 1.5.2. If E[Y1] and E[Y2] are both defined then so is E[αY1 + βY2],
for any two constants α, β. Moreover,
E[αY1 + βY2] = αE[Y1] + βE[Y2].
In order to define the standard deviation, we first need to define the vari-
ance. Like the expectation, the variance may or may not be defined (depend-
ing on whether the sums used to compute it converge absolutely or not).
Since we will be working only with distributions for which the existence of
expectation(s) is never a problem, we do not mention this issue in the sequel.
Definition 1.5.3. The variance of the random variable Y is
Var[Y] = E[(Y− µY)2] = ∑
y∈SY
(y− µY)2 pY(y) where µY = E[Y].
The standard deviation of Y is
sd[Y] =
√
Var[Y].
2this should be taken with a grain of salt. After all, what exactly do we mean by a center or
a spread of a distribution?
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 10 of 15
The fundamental properties of the variance/standard deviation are given
in the following theorem:
Theorem 1.5.4. Suppose that Y1 and Y2 are random variables and that α is
a constant. Then
1. Var[αY1] = α2 Var[Y1], and
2. if, additionally, Y1 and Y2 are independent, then
Var[Y1 + Y2] = Var[Y1] + Var[Y2].
Caveat: These properties are not the same as the properties of the ex-
pectation. First of all the constant comes out of the variance with a
square, and second, the variance of the sum is the sum of the indi-
vidual variances only if additional assumptions, such as the indepen-
dence between the two variables, are imposed.
Finally, here is a very useful alternative formula for the variance of a
random variable:
Proposition 1.5.5. Var[Y] = E[Y2]− (E[Y])2.
Let us compute expectationsand variances/standard deviations for our
most important examples.
Example 1.5.6.
1. Bernoulli distribution. Let Y ∼ B(p) be a Bernoulli random vari-
able with parameter p. Then (remember q is a shortcut for 1− p)
E[Y] = 0× q + 1× p = p.
Using (1.5.5), we get
Var[Y] = E[Y2]− (E[Y])2 = 02 × q + 12 × p− p2
= p− p2 = p(1− p) = pq,
and, so, sd[Y] =
√
pq.
2. Binomial distribution. Moving on to the binomial, Y ∼ b(n, p), we
could either use the formula (1.5.1) and try to evaluate the sum
E[Y] =
n
∑
k=0
k
(
n
k
)
pkqn−k,
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 11 of 15
or use some of the properties of the expectation of Theorem 1.5.2.
To do the latter, we remember that the distribution of a binomial is
the same as the distribution of a sum of n (independent) Bernoul-
lies. So if we write Y = Y1 + · · · + Yn, and each Y1 . . . Yn has the
B(p) distribution, Theorem 1.5.2 yields
E[Y] = E[Y1] + E[Y2] + · · ·+ E[Yn] = np. (1.5.3)
A similar simplification can be achieved in the computation of the
variance, too. While it was unimportant in that Y1, . . . , Yn are inde-
pendent in (1.5.3), it is crucial for Theorem 1.5.4:
Var[Y] = Var[Y1] + · · ·+ Var[Yn] = npq,
and, so, sd[Y] =
√
npq.
3. Geometric distribution. The trick from 2. above cannot be applied
to the geometric random variables. If nothing else, this is because
Theorem 1.5.2 can only be applied to a given (fixed, nonrandom)
number n of random variables. We can still use the definition (1.5.1)
and evaluate an infinite sum:
E[Y] =
∞
∑
k=0
kpqk.
Instead of doing that, let us proceed somewhat informally and note
that we can think of a geometric random variable as follows:
With probability p our first throw is a success and Y =
0. With probability q our first throw is a failure and we
restart the experiment on the second throw, making sure
to add the first failure to the count.
Therefore,
E[Y] = p× 0 + q× (1 + E[Y]),
and, so, E[Y] = q/p.
Similar reasoning can be applied to obtain
E[Y2] = p× 0 + qE[(1 + Y)2] = q + 2q E[Y] + qE[Y2]
= q + 2q2/p + qE[Y2],
which yields Var[Y] = E[Y2]− (E[Y])2 = q/p2 and sd[Y] = √q/p.
4. Poisson distribution. We know that the Poisson distribution arises
as a limit of binomial distributions when n → ∞, p → 0 and
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 12 of 15
np ∼ λ. We can expect, therefore, that its expectation and vari-
ance should behave accordingly, i.e., for Y ∼ P(λ), we have
E[Y] = λ and Var[Y] = λ. (1.5.4)
The reasoning behind Var[Y] = λ uses the formula Var[Y] = npq
when Y ∼ b(n, p) and plugs in q ≡ 1, since q = 1− p and p→ 0. A
more rigorous way of showing that (1.5.4) is correct is to evaluate
the sums
E[Y] =
∞
∑
k=0
kpY(k) =
∞
∑
k=0
ke−λλk/k! and
E[Y2] =
∞
∑
k=0
k2 pY(k) =
∞
∑
k=0
k2e−λλk/k!.
and use Proposition 1.5.5. The sums can be evaluated explicitly, but
since the focus of these notes is not on evaluation of infinite sums,
so we skip the details.
1.6 Problems
Problem 1.6.1. A die is rolled 5 times; let the obtained numbers be given by
Y1, . . . , Y5. Use counting to compute the probability that
1. all Y1, . . . , Y5 are even?
2. at most 4 of Y1, . . . , Y5 are odd?
3. the values of Y1, . . . , Y5 are all different from each other?
Problem 1.6.2. Identify the supports of the following random variables:
1. Y + 1, where Y ∼ B(p) (Bernoulli),
2. Y2, where Y ∼ b(n, p) (binomial),
3. Y− 5, where Y ∼ g(p) (geometric),
4. 2Y, where Y ∼ P(λ) (Poisson).
Problem 1.6.3. Let Y denote the number of tosses of a fair die until the first
6 is obtained (if we get a 6 on the first try, Y = 0). The support SY of Y is
(a) {0, 1, 2, 3, 4, . . . }
(b) {1, 2, 3, 4, 5, 6}
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 13 of 15
(c) { 16 ,
1
6 ,
1
6 ,
1
6 ,
1
6}
(d) { 16 ,
5
6 ×
1
6 ,
( 5
6
)2 × 16 , ( 56)3 × 16 , . . . }
(e) none of the above
Problem 1.6.4. The probability that Janet makes a free throw is 0.6. What is
the probability that she will make at least 16 out of 23 (independent) throws?
Write down the answer as a sum - no need to evaluate it.
Problem 1.6.5. Let Y1 and Y2 be random variables with distributions
Y1 ∼
1 2 3 4
1/4 1/4 1/4 1/4 and Y2 ∼
1 2
1/2 1/2 .
Then
(a) Y1 + Y2 ∼
2 3 4 5 6
1/8 1/4 1/4 1/4 1/8
(b) SY1+Y2 = SY1 ∪ SY2
(c) Y1 is binomially distributed
(d) the events {Y1 = 1} and {Y2 = 2} are mutually exclusive
(e) none of the above
Problem 1.6.6. (*) Bob and Alice alternate taking customer calls at a call
center, with Alice always taking the first call. The number of calls during a
day has a Poisson distribution with parameter λ > 0.
1. What is the probability that Bob will take the last call of the day (that
includes the case when there are 0 calls). (Hint: What is the Taylor series
for the function cosh(x) = 12 (e
x + e−x) around x = 0?)
2. Who is more likely to take the last call? Alice or Bob? As above, if there
are no calls, we give the “last call” to Bob.
Problem 1.6.7. Three unbiased and independent coins are tossed. Let Y1 be
the total number of heads on the first two coins, and let Y be the random
variable which is equal to Y1 if the third coin comes up heads and −Y1 if it
comes up tails. Compute Var[Y].
Problem 1.6.8. A die is thrown and a coin is tossed independently of it. Let
Y be the random variable which is equal to the number on the die in case the
coin comes up heads and twice the number on the die if it comes up tails.
1. What the support of SY of Y? What is its distribution (pmf)?
2. Compute E[Y] and Var[Y].
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 14 of 15
Problem 1.6.9. n people vote in a general election, with only two candidates
running. The vote of person i is denoted by Yi and it can take values 0 and 1,
depending which candidate they voted for (we encode one of them as 0 and
the other as 1). We assume that votes are independent of each other and that
each person votes for candidate 1 with probability p. If the total number of
votes for candidate 1 is denoted by Y, then
(a) Y is a geometric random variable
(b) Y2 is a binomial random variable
(c) Y is uniform on {0, 1, . . . , n}
(d) Var[Y] ≤ E[Y]
(e) none of the above
Problem 1.6.10. A discrete random variable Y is said to have a discrete uni-
form distribution on {0, 1, 2, . . . , n}, denoted by Y ∼ u(n) if its distribution
table looks like this:
0 1 2 . . . n
1
n+1
1
n+1
1
n+1 . . .
1
n+1
.
Compute the expectation and the variance of u(n). You may use the fol-
lowing identities: 1 + 2 + · · · + n = 12 n(n + 1) and 12 + 22 + · · · + n2 =
1
6 n(n + 1)(2n + 1).
Problem 1.6.11. (*) Let Y be a discrete random variable such that SY ⊆ N0.
By counting the same thing in two different ways, explain why
E[Y] = ∑
n∈N
P[Y ≥ n].
This is called the tail formula for the expectation.
Problem 1.6.12. Let X be a geometric random variable with parameter p ∈
(0, 1), i.e. X ∼ g(p), and let Y = 2−X . Write down the (first few entries in)
the distribution table of Y. Compute E[Y] = E[2−X ].
Problem 1.6.13. Let Y1 and Y2 be uncorrelated discrete random variables such
that Var[2Y1 − Y2] = 17 and Var[Y1 + 2Y2] = 5. Compute Var[Y1 − Y2]. (Note:
Y1 and Y2 are uncorrelated if E[(Y1 −E[Y1])(Y2 −E[Y2])] = 0.)
(Hint: What is Var[αY2 + βY2] in terms of Var[Y1] and Var[Y2] when Y1
and Y2 are uncorrelated?)
Problem 1.6.14. Let Y1 and Y2 be uncorrelated random variables such that
sd[Y1 + Y2] = 5. Then sd[Y1 −Y2] =
(a) 1 (b)
√
2 (c)
√
3 (d) 5 (e) not enough information is given
Last Updated: September 25, 2019
Lecture 1: Discrete random variables 15 of 15
Problem 1.6.15 (*). A mail lady has l ∈ N letters in her bag when she starts
her shift and is scheduled to visit n ∈ N different households during her
round. If each letter is equally likely to be addressed to any one of the n
households, and the letters are delivered independently of each other, what
is the expected number of households that will receive at least one letter?
(Note: Itis quite possible that some households will receive more than 1
letter.)
Last Updated: September 25, 2019
Lecture 2: Continuous random variables 1 of 11
Course: Mathematical Statistics
Term: Fall 2018
Instructor: Gordan Žitković
Lecture 2
Probability review - continuous random variables
2.1 Probability Density functions (pdfs)
Some random variables naturally take one of a continuum of values, and
cannot be associated with a countable set. The simplest example is the uni-
form random variable Y on [0, 1] (also known as a random number), which
can take any value in the interval [0, 1], with the probability of it landing
between a and b, where 0 < a < b < 1, given by
P
[
Y ∈ [a, b]
]
= b− a. (2.1.1)
One of the most counterintuitive things about Y is that P[Y = y] = 0 for
any y ∈ [0, 1], even though we know that Y will take some value in [0, 1].
Therefore, unlike in the discrete case, where the probabilities given by the
pmf pY(y) = P[Y = y] contain all the information, in the case of the uniform
these are completely uninformative. The right questions to ask is the one of
(2.1.1), i.e., one needs to focus on probabilities of values in intervals. The class
of random variables where such questions come with an easy-to-represent
answer are called continuous. More precisely
Definition 2.1.1. A random variable Y is said to have a continuous
distribution if there exists a function fY : R→ [0, ∞) such that
P
[
Y ∈ [a, b]
]
=
∫ b
a
fY(y) dy for all a < b.
The function fY is called the probability density function (pdf) of Y.
Not any function can serve as a pdf. The pdf of any random variable will
always have the following properties:
1. fY(y) ≥ 0 for all y, and
2.
∫ ∞
−∞ fY(y) dy = 1 since the P[Y ∈ (−∞, ∞)] = 1.
Last Updated: September 25, 2019
Lecture 2: Continuous random variables 2 of 11
It can be shown that any such function is a pdf of some continuous random
variables, but we will focus on a small number of important examples in
these notes.
Caveat:
1. There are random variables which are neither continuous nor dis-
crete, but we will not encounter them in these notes (even though
some important random variables in applications - e.g., insurance -
fall into this category.)
2. One should think of the pdf fY as an analogue of the pmf in the
discrete case, but this analogy should not be stretched too far. For
example, we can easily have fY(y) > 1 at some y, or even on an
entire interval. This is the consequence of the fact that fY(y) is not
the probability of anything. It is a probability density, i.e., for small
(in the sense of a limit) ∆y > 0 we have
P
[
Y ∈ [y, y + ∆y]
]
≈ fY(y)∆y,
i.e. fY(y) is, approximately, the quotient between the probability of
in interval and the size of the same interval.
2.2 The “indicator” notation
Before we list some of the most important examples of continuous random
variables, we need to introduce a very useful notation tool.
Definition 2.2.1. For a set A ⊆ R, the function 1A : R→ R, given by
1A(y) =
{
1, y ∈ A,
0, otherwise,
is called the indicator of A.
As its name already suggests, interval indicators indicate whether their
argument y belongs to the set A or not. The graph of a typical indicator -
when A is an interval [a, b] - is given in Figure 1.
Last Updated: September 25, 2019
Lecture 2: Continuous random variables 3 of 11
-1 1 2 3
y
0.5
1
Figure 1. The indicator function 1[1,2] of the interval [1, 2].
Indicators are useful when dealing with functions that are defined with
different formulas on different parts of their domain.
Example 2.2.2. The uniform distribution U(l, r) is a slight generaliza-
tion of the uniform U(0, 1) distribution mentioned above. It models a
number randomly chosen in the interval [l, r] such that the probability
of getting a point in the subinterval [a, b] ⊆ [l, r] is proportional to its
length b− a. Since the probability of choosing some point in [l, r] is 1,
by definition, we have to have
P[Y ∈ [a, b]] = 1r−l (b− a) for all a < b ∈ [l, r].
To show that this is a continuous distribution, we need to show that it
admits a pdf, i.e., a function f such that∫ b
a
fY(y) dy = 1r−l (b− a) for all a < b.
For a, b < l or a, b > r we must have P[Y ∈ [a, b]] = 0, so∫ b
a
fY(y) dy = 0 for a, b ∈ R \ [l, r].
These two requirements force that
fY(y) = 1r−l for y ∈ [l, r] and fY(y) = 0 for y 6∈ [l, r], (2.2.1)
and we can easily check that fY(y) is, indeed, the pdf of Y.
The indicator notation can be used to write (2.2.1) in a more compact
way:
fY(y) = 1r−l 1[l,r](y).
Not only does this give a single formula valid for all y, it also reveals
that [l, r] is the “effective” part of the domain of fY. We can think of fY
as “the constant 1r−l , but only on the interval [l, r]; it is zero everywhere
else”.
Last Updated: September 25, 2019
Lecture 2: Continuous random variables 4 of 11
The interval-indicator notation will come into its own a bit later when we
discuss densities of several random variables (random vectors), but for now
let us comment on how it allows us to write any integral as an integral over
(−∞, ∞). The idea behind is that any function f multiplied by the indicator
1[a,b] stays the same on [a, b], but takes value 0 everywhere else. Therefore,∫ b
a
f (y) dy =
∫ ∞
−∞
f (y)1[a,b](y) dy,
because the integral of the function 0 is 0, even when taken over infinite
intervals.
Finally, let us introduce another notation for the indicator functions. It
turns out to be more intuitive, at least for intervals, and will do wonders
for the evaluation of iterated integrals. Since the condition y ∈ [a, b] can be
written as a ≤ y ≤ b, we sometimes write
1{a≤y≤b} instead of 1[a,b](y).
2.3 First examples of continuous random variables
Example 2.3.1.
1. Uniform distribution. We have already encountered the uniform
distribution U(l, e) on the interval [l, r] and we have shown that it
is a continuous distribution with the pdf
fY(y) = 1r−l 1[l,r](y).
As always, this is really a whole family of distributions, parameter-
ized by two real parameters a and b.
l r
y
1
r-l
Figure 2. The density function (pdf) of the uniform U(a, b) distribution.
2. Normal distribution The family of normal distributions - denoted
by Y ∼ N(µ, σ) - is also parameterized by two parameters µ ∈ R
and σ > 0 and its pdf is given by the (at first sight complicated)
formula: The normal distribution is symmetric around µ and its
standard deviation (as we shall see shortly) is σ; its graph is shown
in Figure 3.
Last Updated: September 25, 2019
Lecture 2: Continuous random variables 5 of 11
σ
μ
y
Figure 3. The density function (pdf) of the normal distribution N(µ, σ).
The function fY is defined by the above formula for each y ∈ R and
it is a notrivial task to show that it is, indeed, a pdf of anything.
The difficulty lies in evaluating the integral∫ ∞
−∞
fY(y) dy
and showing that it equals to 1. This is, indeed, true, but needs
a bit more mathematics than we care to get into right now. The
probabilities P[Y ∈ [a, b]] are not any easier to compute for concrete
a, b and, in general, do not admit a closed form (formula). That is
why we used to use tables of precomputed approximate values (we
use software today).
Nevertheless, the normal distribution is, arguably, the most impor-
tant distribution in probability and statistics. The main reason for
that is that it appears in the central limit theorem (which we will
talk more about later), and, therefore, shows up whenever a large
number of independent random influences act at the same time.
3. Exponential distribution. The exponential distribution is a contin-
uous analogue of the geometric distribution and is used in mod-
eling lifetimes of light bulbs or waiting times in the supermarket
checkout lines. It comes in a parametric family E(τ), parameterized
by the positive parameter τ > 0. Its pdf is given by
fY(y) = 1τ e
−y/τ1[0,∞)(y).
The graph of fY is given on the right
y
1
τ
Figure 4. The density function (pdf) of the exponential E(τ) distribution.
Last Updated:September 25, 2019
Lecture 2: Continuous random variables 6 of 11
The use of an interval indicator in the expression above signals
that fY is positive only for y > 0, and that, in turn, means that an
exponential random variable cannot take negative values.
Caveat: Many books use a different parametrization of the
exponential family, namely Y ∼ E(λ) if
fY(y) = λe−λy1{[0,∞)}(y),
so that, effectively λ = 1/τ. Both parameters have meaning-
ful interpretations, and, depending on the context, one can
be more natural than the other. Keep this in mind to avoid
unnecessary confusion.
2.4 Expectations and standard deviations
The definitions of the expectation will look similar to that in the discrete
case, but sums will be replaced by integrals. Once the expectation is defined,
everything else can be repeated verbatim from the previous lecture.
Definition 2.4.1. For a continuous random variable Y with pdf fY we
define the expectation E[Y] of Y by
E[Y] =
∫ ∞
−∞
y fY(y) dy, (2.4.1)
as long as
∫ ∞
−∞ |y fY(y)| dy < ∞. When this value is +∞, we say that
the expectation of Y is not defined.
The definition of the variance and the standard deviation are analogous
to their discrete versions:
Var[Y] =
∫ ∞
−∞
(y− µY)2 fY(y) dy where µY = E[Y],
and sd[Y] =
√
Var[Y]. Theorem ?? and Proposition ?? are valid exactly as
written in the continuous case, too.
Let us compute expectations and variances/standard deviations of the
distributions from Example 2.3.1.
Example 2.4.2.
1. Uniform distribution The computations needed for the expecta-
Last Updated: September 25, 2019
Lecture 2: Continuous random variables 7 of 11
tion and the variance of the uniform U(l, r) distribution are quite
simple:
E[Y] =
∫ ∞
−∞
y fY(y) dy = 1r−l
∫ ∞
−∞
y1[l,r](y) dy
= 1r−l
∫ r
l
y dy = 1r−l
r2−l2
2 =
l+r
2 .
Similarly,
Var[Y] =
∫ ∞
−∞
(y− l+r2 )
2 fY(y) dy = 1r−l
∫ r
l
(y− l+r2 )
2 dy
= 1r−l [
1
3 (y−
l+r
2 )
3]ba =
1
12 (r− l)
2
2. Normal distribution. To compute the expectation of the normal
distribution N(µ, σ), we need to evaluate the following integral
E[Y] =
∫ ∞
−∞
y 1√
2πσ2
e−
(y−µ)2
2σ2 dy.
We change the variable z = (y− µ)/σ to obtain
E[Y] =
∫ ∞
−∞
(σz + µ) 1√
2π
e−
1
2 z
2
dz
= σ√
2π
∫ ∞
−∞
ze−
1
2 z
2
dz + µ
∫ ∞
−∞
1√
2π
e−
1
2 z
2
dz = µ.
The integral next to µ evaluates to 1, because it is simply the inte-
gral of the density function f of the standard normal N(0, 1). The
integral next to σ√
2π
is 0 because it is an integral of an odd function
over the entire R.
To compute the variance, we need to evaluate the integral
Var[Y] =
∫ ∞
−∞
(y− µ)2 1√
2πσ2
e−
(y−µ)2
2σ2 dy,
because we now know that µY = µ. The same change of variables
as above yields:
Var[Y] = σ2 1√
2π
∫ ∞
−∞
z2e−
1
2 z
2
dz = σ2,
where we used the fact (which can be obtained using integration
by parts, but we skip the details here) that
∫ ∞
−∞ z
2e−
1
2 z
2
dz =
√
2π.
Last Updated: September 25, 2019
Lecture 2: Continuous random variables 8 of 11
3. Exponential distribution. The integrals involved in the evaluation
of the expectation and the variance of the exponential distribution
are simpler and only involve a bit of integration by parts, so we skip
the details. It should be noted that the interval indicator notation
we used to define the pdf of the exponential tells use immediately
what bounds to use for integration. For Y ∼ E(τ), we have
E[Y] =
∫ ∞
−∞
y fY(y) dy =
∫ ∞
−∞
y/τe−y/τ1[0,∞)(y) dy
=
∫ ∞
0
y/τe−y/τ dy = τ.
Therefore µY = τ and, so,
Var[Y] = E[Y2]− (E[Y])2 =
∫ ∞
−∞
y2 fY(y) dy− τ2
=
∫ ∞
0
y2/τe−y/τ dy− τ2.
To evaluate the first integral on the right, we change variables z =
y/τ, so that
Var[Y] = τ2
∫ ∞
0
z2e−z dz− τ2 = 2τ2 − τ2 = τ2,
where we used the fact (which can be derived by integration by
parts) that
∫ ∞
0 z
2e−z dz = 2.
2.5 Moments
The expectation is the integral of the first power y = y1 multiplied by the pdf,
and the variance involves a similar integral with y replaced by y2. Integrals
of higher powers of y are important in statistics; not as important as the
expectation and variance, but still important enough to have names:
Definition 2.5.1. For a random variable Y with pdf fY and k = 1, 2, . . . ,
we define k-th (raw) moment µk by
µk = E[Yk] =
∫ ∞
−∞
yk fY(y) dy,
as well as the k-th central moment µck by
µck = E[(Y−E[Y])
k] =
∫ ∞
−∞
(y− µ1)k fY(y) dy.
Last Updated: September 25, 2019
Lecture 2: Continuous random variables 9 of 11
We see immediately from the definition that the expectation (mean) is the
first (raw) moment and that the variance is the second central moment, i.e.,
µ1 = E[Y], µc2 = Var[Y].
The third and fourth moment of the standardized random variable, namely,
E
[(Y−E[Y]
sd[Y]
)3] and E [(Y−E[Y]sd[Y] )4] .
are called skeweness and kurtosis, respectively. It is easy to see that, in
terms of moments, we can express skeweness as µc3/(µ
c
2)
3/2 and kurtosis as
µc4/(µ
c
2)
2.
Example 2.5.2.
1. Uniform distribution We leave this to the reader as an exercise in
the Problems section below.
2. Normal Distribution Let us compute the central moments, too.
Since Y− µ ∼ N(0, σ), whenever Y ∼ N(µ, σ), central moments of
N(µ, σ) are nothing by raw moments of N(0, σ). For that, we need
to compute the integrals∫ ∞
−∞
yk fY(y) dy =
∫ ∞
−∞
1
2πσ y
ke−
1
2 y
2
dy. (2.5.1)
For odd k, these are integrals of odd functions over the entire R,
and therefore, their value is 0, i.e.,
µk = 0 for k odd.
For even k, there is no such a shortcut, and the integral in (2.5.1)
can be computed by parts:∫ ∞
−∞
yke−
1
2σ2 y
2
dy = 1k+1 y
k+1e−
1
2σ2 y
2 ∣∣∣∞
−∞
−
−
∫ ∞
−∞
1
k+1 y
k+1 (− 1
σ2
y)e−
1
2σ2 y
2
dy.
Since limy→±∞ yk+1e
− 12σ2 y
2
= 0, we obtain
1
2πσ
∫ ∞
−∞
yke−
1
2σ2 y
2
dy = 12πσ
1
σ2(k+1)
∫ ∞
−∞
yk+2e
1
2σ2 y
2
dy.
Written more compactly,
µk+2 = σ
2(k + 1)µk.
Last Updated: September 25, 2019
Lecture 2: Continuous random variables 10 of 11
Starting from µ2 = 12πσ
∫ ∞
−∞ y
2e−
1
2σ2 y
2
= σ2, we get
µk = σ
k (k− 1)× (k− 3)× · · · × 5× 3× 1, for k even.
3. Exponential distribution. A similar, integration-by-parts proce-
dure as above, allows us to compute the (raw) moments of the
exponential (we skip the details):
µk =
∫ ∞
0
yk 1τ e
−y/τ dy = τkk× (k− 1)× · · · × 2× 1 = τk k!,
for k = 1, 2, 3, . . . . The central moments are not so important, and
do not admit such a nice closed formula.
2.6 Problems
Problem 2.6.1. Let Y be a continuous random variable whose pdf fY is given
by
fY(y) =
{
cy2, y ∈ [−1, 1]
0, otherwise,
for some constant c.
1. Write down an expression for fY using the interval-indicator notation.
2. What is the value of c?
3. Compute E[Y] and sd[Y].
Problem 2.6.2. Let Y be a continuous random variable with the pdf
fY(y) = 154 y
2(1− y2)1{−1≤y≤1}.
Compute P[Y2 ≤ 1/4].
Problem 2.6.3. The random variable Y has the pdf fY(y) = 32 y
21{−1≤y≤1}.
Compute the probability P[2Y2 ≥ 1].
Problem 2.6.4. Let Y have the pdf f (y) = 1
π(1+y2) for y ∈ (−∞, ∞). Compute
the probability that Y−2 lies in the interval [1/4, 4].
Problem 2.6.5 (The exponential distribution). Suppose that the random vari-
able Y follows an exponential distribution with parameter τ > 0, i.e., Y ∼
E(τ), i.e., Y is a continuous random variable with the density function fY
given by
fY(y) = 1τ e
−y/τ1y≥0.
Last Updated: September 25, 2019
Lecture 2: Continuous random variables 11 of 11
Compute the following quantities
1. P[Y = 0], 2. P[Y ≤ 0], 3. P[Y ≤ y] for y ∈ (−∞, ∞),
4. P[Y > 1], 5. P[|Y− 2| > 1], 6. E[Y], 7. E[Y2], 8. Var[Y],
9. The mode of Y (look up mode if you don’t know what it is)
10. The median of Y (look up median if you don’t know what it is)
11. (Optional) P[ bYc is odd ]‘’, where bac denotes the largest integer ≤ a.
Which one is bigger P[ bYc is odd ] or P[ bYc is even ]? Explain without
using any calculations.
Problem 2.6.6 (The triangular distribution). We say that the random variable
Y follows the triangular distribution with parameters l < r if it is continuous
with pdf fY givenby
fY(y) = c(y− l)1[l, 12 (l+r)]
(y) + c(r− y)1
[ 12 (l+r),r]
(y).
1. Determine the value of the constant c,
2. Compute the expectation and the standard deviation of Y
3. Assuming that l = −1 and r = 1, compute P
[∣∣∣Y−E[Y]∣∣∣ ≥ sd[Y]].
Problem 2.6.7 (Moments of the uniform distribution). Let Y follow the uni-
form distribution U(l, r) on the interval [l, r], where l < r, i.e., its density is
given by
fY(y) = 1r−l 1{l≤y≤r} =
{
1/(r− l), if y ∈ [l, r],
0, otherwise.
Compute the moments µk and central moments µck, k = 1, 2, . . . of Y,
Last Updated: September 25, 2019
Lecture 3: Cumulative distribution functions 1 of 8
Course: Mathematical Statistics
Term: Fall 2018
Instructor: Gordan Žitković
Lecture 3
Cumulative distribution functions and derived
quantities
When we talk about the distribution of a discrete random variable, we
write down its pmf (or a distribution table), and when the variable is contin-
uous, we give its pdf. There are other ways of expressing the same informa-
tion; depending on the context, these other ways can be much more useful
or effective.
3.1 Cumulative distribution functions (cdf)
Definition 3.1.1. For a random variable Y, discrete or continuous, we
define its cumulative distribution function (cdf) FY : R→ [0, 1] by
FY(y) = P[Y ≤ y], y ∈ R.
The first, obvious, advantage of the cdf is that it can be used for both dis-
crete and continuous random variables. Since it is defined as a probability of
an event, FY(y) can be computed (at least in principle) from the distribution
table in the discrete case
FY(y) = ∑
u∈SY ,u≤y
pY(u),
or from the pdf (in the continuous case):
FY(y) =
∫ y
−∞
fY(u) du. (3.1.1)
As we shall see in the examples, going the other way in the discrete case is
possible, but the formula is a bit clumsy. The continuous case is nicer because
one could use the fundamental theorem of calculus to conclude that
fY(y) = ddy FY(y) for y ∈ R,
at least for those y where fY is a continuous function.
We know that the pdf fY of any random variable Y must be nonnega-
tive and integrate to 1. In a similar way, any cdf will have the following
properties:
Last Updated: September 25, 2019
Lecture 3: Cumulative distribution functions 2 of 8
1. 0 ≤ FY(u) ≤ 1,
2. FY is nondecreasing, and
3. limu→∞ FY(u) = 1 and limu→−∞ FY(u) = 0.
Example 3.1.2.
1. Bernoulli. Let Y be a Bernoulli random variable B(p). To find an
expression for FY, we first note that
FY(y) = 0 for y < 0.
This follows directly from the defintion - Y takes values 0 or 1, so
P[Y ≤ y] = 0, as soon as y < 0. Similarly,
FY(y) = 1 for y ≥ 1.
What happens in the middle? For any y ∈ [0, 1), the only way for
Y ≤ y to be true is if Y = 0. Therefore,
FY(y) = P[Y ≤ y] = P[Y = 0] = q for y ∈ [0, 1).
A picture makes it even easier to grasp:
1
y
q
1
Figure 1. The cumulative distribution function (CDF) for the Bernoulli
B(p) distribution.
2. Discrete with finite support. Let Y be a discrete random variable
with a finite support SY = {y1, . . . , yn} and let its distribution table
be given by
y1 y2 . . . yn
p1 p2 . . . pn
Last Updated: September 25, 2019
Lecture 3: Cumulative distribution functions 3 of 8
Following the same reasoning as in the Bernoulli case, we get the
following expression for the cdf
FY(y) =

0, y < y1,
p1, y1 ≤ y < y2,
p1 + p2, y2 ≤ y < y3,
. . .
p1 + p2 + · · ·+ pn−1 yn−1 ≤ y < yn,
1, y ≥ yn.
Again, a picture is easier to parse:
p1
p2
pn
...
...
y1 y2 yn
11
Figure 2. The cumulative distribution of a discrete distribution with
support {y1, . . . , yn} and the associated probabilities {p1, . . . , pn}.
3. Uniform. The cdf of the uniform distribution U(l, r) will no longer
have “jumps”. In fact, that is the reason behind calling continuous
distributions continuous. Here, we use the expression (3.1.1) and
integrate the pdf fY of the uniform distribution from −∞ to y. As
above, FY(y) = 0 for y < l because fY(y) = 0 for y < l and
integration of 0 yields 0. To see what is going on between l and
r, we pick y ∈ [l, r] and note that∫ y
−∞
fY(u) du =
∫ y
l
fY(u) du =
∫ y
l
1
r−l 1[l,r](y) du =
1
r−l
∫ y
l
du = y−lr−l .
Finally, for y > r, we have FY(y) = 1. Alternatively, we could have
used the definition of FY to conclude directly that
FY(y) = P[Y ≤ y] =

0, y < l,
y−l
r−l , y ∈ [l, r],
1, y > l.
Last Updated: September 25, 2019
Lecture 3: Cumulative distribution functions 4 of 8
l r
11
Figure 3. The cumulative distribution of a uniform U(l, r) distribution.
4. Normal Distribution. The CDF of the normal distribution N(µ, σ)
FY(y) =
∫ u
−∞
1√
2πσ2
e−
(u−µ)2
2σ2 du
does not have an explicit expression in terms of elementary func-
tions (not even for µ = 0 and σ = 1). That is why you had to use
tables (or software) to compute various probabilities associate to
the normal in your probability class. Using mathematical software,
one can evaluate this integral numerically, and the resulting picture
is given below:
μ μ+σμ-σ μ+2σ
y
0.5
0.84
0.16
0.98
Figure 4. The cumulative distribution of a normal N(µ, σ) distribution.
5. Exponential distribution. The integration in the computation of
the cdf FY of an exponentially-distributed random variable Y ∼
E(τ) can be performed quite easily and completely explicitly. First
Last Updated: September 25, 2019
Lecture 3: Cumulative distribution functions 5 of 8
of all, for y < 0, we clearly have FY(y) = 0. For y > 0, we compute
FY(y) =
∫ y
−∞
1
τ e
−u/τ1[0,∞)(u) du =
∫ y
0
1
τ e
−u/τ du = 1− e−y/τ , y > 0.
0
y
11
Figure 5. CDF of the exponential distribution E(τ).
3.2 Quantiles
The notion of a quantile is familiar to almost everyone, even if you have not
learned it formally in a class. You don know what “top 1%” means, right?
The formal definition is easy once we have the notion of a cdf at our disposal:
Definition 3.2.1. For α ∈ (0, 1), we define the α-quantile of the distri-
bution of the random Y as the number qY(α) ∈ R with the property
that
FY(qY(α)) = α, i.e., P[Y ≤ qY(α)] = α.
Caveat: The way we defined above, the quantile qY(α) may not need
to exist for all α. This can be remedied by adopting a more careful
definition, but, since we will not have to deal with this problem in
these notes - and whenever we need quantiles, they will happily exist
- we simply ignore it. If you want to think about this a bit more, try to
figure out which quantiles of the Bernoulli distribution actually exist,
i.e., for which α can we find a number q such that P[Y ≤ q] = α, when
Y is Bernoulli. Is such a q uniquely determined?
Last Updated: September 25, 2019
Lecture 3: Cumulative distribution functions 6 of 8
Example 3.2.2. Normal quantiles. In practice, one finds quantiles by
inverting the CDF; graphically this amounts to finding α on the ver-
tical axis, and then finding a value q on the horizontal axis such that
FY(q) = α. For example, Figure 4. in Example (3.1.2), part 4., above,
reveals that, for Y ∼ N(µ, σ), we have (approximately)
qY(0.16) = µ−σ, qY(0.5) = µ, qY(0.84) = µ+σ and qY(0.98) = µ+ 2σ.
This is very much related to the well-known 68− 95− 99.7-rule.
3.3 Survival and hazard functions
Survival and hazard functions are especially important for an area of statis-
tics called the survival analysis, but are also a part of the vocabulary of gen-
eral statistics.
Definition 3.3.1. Let Y be a random variable with cdf FY.
1. The survival function SY(y) of Y is defined by
SY(y) = 1− FY(y) for y ∈ R.
2. If Y is continuous, the hazard function hY(y) is given by
hy(y) =
fY(y)
SY(y)
for y with FY(y) < 1.
These quantities have natural interpretations when Y is thought of as a
lifetime (of a particle, bulb, bacterium, individual, etc.). Fixing, for conve-
nience, the interpretation that Y is the age at death of an individual, we have
1. SY(y) is the probability that the individual will survive at least y years.
2. hY(y)∆y is the (conditional) probability that the individual will die some
time in the (small) interval [y, y + ∆y], given that it has surviveduntil y.
Example 3.3.2. Let Y be an exponential random variable with param-
eter τ. Then
SY(y) = e−y/τ and hY(y) = 1τ for y ≥ 0.
In words, exponentially-distributed lifetimes have constant hazard
functions - “the probability of dying in the next ∆y is constant and
does not depend on the age y.” For comparison, Figure 6 below fea-
Last Updated: September 25, 2019
Lecture 3: Cumulative distribution functions 7 of 8
tures some real data about humans where the hazard rate is far from
constant.
20 40 60 80 100 120
y
0.2
0.4
0.6
0.8
20 40 60 80 100
y
0.05
0.1
0.15
0.2
0.25
0.3
Figure 6. The survival (left) and the hazard (right) functions of the empirical
distribution of the ages of death of all female individuals born in the US in
1917.
3.4 Problems
Problem 3.4.1. Two (unbiased, independent) coins are tossed, and the total
number of heads is denoted by Y. Write an expression for the CDF of Y and
sketch its graph.
Problem 3.4.2. Which of the following pairs of functions could be the pdf and
the cdf (respectively) of some probability distribution:
(a) f (x) = x2, F(x) = 13 x
3
(b) f (x) = cos(x), F(x) = sin(x).
(c) f (x) = 2e−2x1{x>0}, F(x) = (1− e−2x)1{x>0}.
(d) f (x) = 1√
2π
e−x
2/2, F(x) = 1− e−x2 .
(e) f (x) = 1{x>0}, F(x) = x1{x>0}.
Problem 3.4.3. Let Y be a random variable with CDF FY, and let qY : (0, 1)→
R be its quantile function (we assume it exists for each α ∈ (0, 1)). What is
the relationship between the graphs of FY and qY, i.e., how do you get one
from the other?
Last Updated: September 25, 2019
Lecture 3: Cumulative distribution functions 8 of 8
Problem 3.4.4. Let Y be a continuous random variable with the density fY
given by
fY(y) = cy2(1− y)1[0,1](y),
for an appropriate constant c.
1. Sketch the graph of f and find the value of the constant c.
2. Compute the cumulative distribution function (cdf) FY and the survival
function SY, of Y.
3. What is the domain of the hazard function? Compute the hazard function
hY itself.
4. Find the mode of Y
5. Compute the 516 -th quantile of Y. (Note: Guess and verify.)
Problem 3.4.5. Let Y be a random variable with the pdf
fY(y) = 2y1{0≤y≤1}.
Compute the hazard function hY of Y.
Problem 3.4.6. Let Y be a uniform random variable on the interval [0, 100].
The hazard function hY of the distribution of Y is given by
(a) 1y 1{y>0} for y ∈ (−∞, 100)
(b) 1100−y 1{y>0} for y ∈ (−∞, 100)
(c) 1{y<0} +
100−y
100 1{0≤y≤100} for y ∈ (−∞, 100]
(d) (100− y)1{y∈[0,100)} for y ∈ [0, ∞)
(e) none of the above
Problem 3.4.7. The expected lifetime of a bulb is h (in hours). Assuming that
the bulb lifetimes are exponentially distributed, compute
1. the probability that the bulb is still functional at time h
2. the half-life of the bulb, i.e., a number t∗ such that the probability that the
bulb is still functional after t∗ hours is exactly 1/2.
Problem 3.4.8. Compute the α-quantile qY(α) for α = 0.75 where Y is the
uniform distribution U(4, 8) on [4, 8].
Last Updated: September 25, 2019
Lecture 4: Functions of random variables 1 of 11
Course: Mathematical Statistics
Term: Fall 2017
Instructor: Gordan Žitković
Lecture 4
Functions of random variables
Let Y be a random variable, discrete and continuous, and let g be a func-
tion from R to R, which we think of as a transformation. For example, Y
could be a height of a randomly chosen person in a given population in
inches, and g could be a function which transforms inches to centimeters, i.e.
g(y) = 2.54× y. Then W = g(Y) is also a random variable, but its distribu-
tion (pdf), mean, variance, etc. will differ from that of Y. Transformations of
random variables play a central role in statistics, and we will learn how to
work with them in this section.
4.1 Computing expectations
Expectations of functions of random variables are easy to compute, thanks to
the following result, sometimes known as the fundamental formula.
Theorem 4.1.1. Suppose that Y is a random variable, g is a transformation,
i.e., a real function, and W = g(Y). Then
1. if Y is discrete, with pmf pY, we have
E[W] = ∑
y∈SY
g(y) pY(y),
2. if Y is continuous, with pdf fY, we have
E[W] =
∫ ∞
−∞
g(y) fY(y) dy.
We have already used this formula, without knowing, when we wrote
down a formula for the variance
Var[Y] = E[(Y− µY)2] =
∫ ∞
−∞
(y− µY)2 fY(y) dy.
Indeed, we applied the transformation g(y) = (y− µY)2 to Y and then com-
puted the expectation of the new random variable W = g(Y) = (Y− µY)2.
Last Updated: September 25, 2019
Lecture 4: Functions of random variables 2 of 11
Example 4.1.2. The stopping distance (the distance traveled by the car
from the moment the brake is applied to the moment it stops) in feet
is a quadratic function of the car’s speed (in mph), i.e.,
g(y) = cy2
where c is a constant which depends on the physical characteristics
of the car, its brakes, the road surface, etc. For the purposes of this
example, let’s take a realistic value c = 0.07 (in appropriate units).
In a certain traffic study, the distribution of cars’ speeds at the onset
of breaking is empirically determined to be uniformly distributed on
the interval [60, 90], measured in miles per hour. What is the expected
value of the stopping distance?
The stopping distance W is given by W = g(Y), where g(y) = 0.07 y2,
and so, according to our formula, we have
E[W] =
∫ ∞
−∞
g(y) fY(y) dy =
∫ ∞
−∞
0.07 y2 190−60 1[60,90](y) dy
= 0.07/30
∫ 90
60
y2 dy = 399.
If we compute the expected speed, we get
E[Y] = 130
∫ 90
60
y dy = 75,
and if we compute the stopping distance of the car traveling at 75 mph,
we get
g(75) = 393.75.
It follows that the average (expected) stopping distance is not the same
as the stopping distance corresponding to the average speed. Why is
that?
Caveat: What we observed at the end of Example 4.1.2 is so important
that it should be repeated:
In general, E[g(Y)] 6= g(E[Y])!.
In fact, the only time we can guarantee equality for any Y is when g is
an affine function, i.e., when g(y) = αy + β for some constants α and
β.
Last Updated: September 25, 2019
Lecture 4: Functions of random variables 3 of 11
4.2 The cdf-method
The fundamental formula of Theorem 4.1.1 is useful for computing expec-
tations, but it has nothing to say about the distribution of W = g(Y). For
example, we may wonder whether the distribution of stopping distances is
uniform on some interval, just like the distribution of velocities at the on-
set of breaking in Example 4.1.2. There are several methods for answering
this question, and we start with the one which almost always works - the
cdf-method.
Suppose that we know the cdf FY of Y and that we we are interested in
the distribution of W = g(Y). Using the definition of the cdf FW of W, we
can write
FW(w) = P[W ≤ w] = P[g(Y) ≤ w].
The probability on the right is not quite the cdf of Y, but if it can be rewritten
in terms of probabilities involving Y, or, better, the cdf of Y, we are in business:
1. If g is strictly increasing, then it admits an inverse function g−1 and we
can write
FW(w) = P[g(Y) ≤ w] = P[Y ≤ g−1(w)] = FY(g−1(w)),
and we have an expression of FW in terms of FY. Once FW is known, it can
be used further to compute the pdf (in the continuous case) or the pmf (in
the discrete case), or . . .
2. A very similar computation can be made if g is strictly decreasing. The
only difference is that now P[g(Y) ≤ w] = P[Y ≥ g−1(w)]. In the con-
tinuous case we have P[Y ≥ y] = 1− FY(y) (why only in continuous?),
so
FW(w) = P[g(Y) ≤ w] = P[Y ≥ g−1(w)] = 1− FY(g−1(w)).
3. The function g is neither increasing nor decreasing, but the inequality
g(y) ≤ w can be “solved” in simple terms. To understand what is meant
by this, have a look at examples below.
Example 4.2.1.
1. Linear transformations. Let Y be any random variable, and let
W = g(Y) where g(y) = a + by is a linear transformation with
b > 0. Since b > 0, the function g is strictly increasing. Therefore,
FW(w) = P[g(Y) ≤ w] = P[a + bY ≤ w] = P[Y ≤ w−ab ]
= FY(w−ab ).
This expression is especiallynice if Y is a continuous random vari-
Last Updated: September 25, 2019
Lecture 4: Functions of random variables 4 of 11
able because then so is W, and we have
fW(w) = ddw FW(w) =
d
dw FY(
w−a
b ) = fY(
w−a
b )
1
b ,
where the last equality follows from the chain rule. Here are some
important special cases:
a) Linear transformations of a normal. If Y ∼ N(0, 1) is a unit
normal, then fY(y) = 1√2π e
− 12 y
2
, and so,
fW(w) = 1√2πb2 e
− (w−a)
2
2b2 .
We recognize this as the pdf of the normal distribution, but this
time with paramters a and b. This inverse of this computation
lies behind the familiar z-score transformation: if Y ∼ N(µ, σ),
then Z = Y−µσ ∼ N(0, 1).
b) Linear transformations of a uniform. If Y ∼ U(0, 1) is a ran-
dom number, g(y) = a + by and W = g(Y), then FW(w) =
FY((w− a)/b) and, so,
fW(w) = 1b fY((w− a)/b)) =
1
b 1[0,1]((w− a)/b).
When we talked about indicators, we mentioned that a dif-
ferent notation for the same function can simplify computa-
tions in some cases. Here is the case in point. If we replace
1[0,1]((w − a)/b) by 1{0≤(w−a)/b≤1}, we can rearrange the ex-
pression inside {} and get
fW(w) = 1b 1{a≤w≤a+b} =
1
b 1[a,a+b](w),
and we readily recognize fW as the pdf of another uniform dis-
tribution, but this time with parameters a and b. If we wanted
to transform U(0, 1) into U(l, r), we would simply need to pick
a = l and b = r− l.
It is not a coincidence that linear transformations of normal and
uniform random variables result in random variables in the same
parametric families (albeit with different parameters). Parametric
families are often (but not always) chosen to have this exact prop-
erty.
2. Inverse Exponential distribution. Let Y ∼ E(τ) be an exponen-
tially distributed random variable, and let g(y) = 1/y. The func-
tion g is strictly decreasing on (0, ∞) and so, for w > 0, we have
FW(w) = P[1/Y ≤ w] = P[Y ≥ 1/w] = 1− FY(1/w) = e−
1
τw .
Last Updated: September 25, 2019
Lecture 4: Functions of random variables 5 of 11
This computation will not work for w ≤ 0, but we know that W
always takes positive values, as it is the reciprocal of Y, which is
always positive. Therefore, FW(w) = 0 for w ≤ 0. We can differen-
tiate the expression for FW to obtain the pdf fW :
fW(w) = e
− 1τw 1
τw2 1(0,∞)(w).
This pdf cannot be recognized as the pdf of any of our named
distributions, but it is sometimes called the inverse exponential
distribution, and it is used in wireless communications.
y
Figure 1. The pdf of the inverse exponential distribution.
3. χ2-distribution. Let Y ∼ N(0, 1) be the unit normal random vari-
able, and let g(y) = y2. This is an example of a transformation
which is neither increasing nor decreasing. We can still try to make
sense of the expression g(y) ≤ w, i.e., y2 ≤ w and use the cdf-
method:
FW(w) = P[Y2 ≤ w].
For w < 0 it is impossible that Y2 < w, so we immediately conclude
that FW(w) = 0 for w < 0. When w ≥ 0, we have
P[Y2 ≤ w] = P[Y ∈ [−
√
w,
√
w]] = P[Y ≤
√
w]−P[Y < −
√
w].
Since P[Y = −
√
w] = 0 (as Y is continuous), we get
FW(w) =
{
0, w < 0
FY(
√
w)− FY(−
√
w), w ≥ 0.
We differentiate both sides in w and use the fact that FY is a normal
cdf so that ddy FY(y) =
1√
2π
e−
1
2 y
2
to obtain
fW(w) = 1√2πw e
− 12 w1(0,∞)(w). (4.2.1)
A random variable with the pdf fW(w) of (4.2.1) above is said to
have a χ2-distribution (pronounced [kai-skwer]). It is very impor-
tant in statistics, and we will spend a lot more space on it later.
Last Updated: September 25, 2019
Lecture 4: Functions of random variables 6 of 11
y
Figure 2. The pdf of the χ2 distribution.
4.3 The h-method
The application of the cdf-method can sometimes be streamlined, leading to
the so-called h-method or the method of transformations. It works when
Y is a continuous random variable and when the transformation function
g admits an inverse function h. Supposing that is the case, remember that,
when g is increasing, we have
FW(w) = FY(g−1(w)) = FY(h(w)).
If we assume that everything is differentiable and that Y and W admit pdfs
fY and fW , we can take a derivative in w to obtain
fW(w) = ddw FW(w) =
d
dw
(
FY(h(w))
)
= ddw FY(h(w))
d
dw h(w) = fY(h(w))h
′(w).
Another way of deriving the same formula is to interpret the pdf fY(y) as the
quantity such that
P[Y ∈ [y, y + ∆y]] ≈ fY(y)∆y,
when ∆y > 0 is “small”. Applying the same to W = g(Y) in two ways yields
P[W ∈ [w, w + ∆w]] ≈ fW(w)∆w,
by also, assuming that g is increasing,
P[W ∈ [w, w + ∆w] = P[g(Y) ∈ [w, w + ∆w]] = P[Y ∈ [h(w), h(w + ∆w)]]
≈ P[Y ∈ [h(w), h(w) + ∆w h′(w)]] = fY(h(w))h′(w)∆w.
Last Updated: September 25, 2019
Lecture 4: Functions of random variables 7 of 11
The approximate equality ≈ between the third and the fourth probability
above uses the fact that h(w + ∆w) ≈ h(w) + ∆w h′(w) which is nothing but
a consequence of the definition of the derivative
h′(w) = lim
∆w→0
h(w+∆w)−h(w)
∆w .
It could also be seen as the first-order Taylor formula for h around w.
The derivation above can be made fully rigorous, leading to the follow-
ing theorem (why does the absolute value |h′(w)| appear there?). The word
interval in it means (a, b), where either a or b could be infinite (so that, for
example, R itself is also an interval).
Theorem 4.3.1 (The h-method). Suppose that the function g is
1. defined on an interval I ⊆ R.
2. its image is an interval J ⊆ R
3. g has a continuously-differentiable inverse function h : J → I
Suppose that Y is a continuous random variable with pdf fY such that
fY(y) = 0 for y 6∈ I. Then W = g(Y) is also a continuous random variable
and its pdf is given by the following formula:
fW(w) = fY(h(w))
∣∣h′(w)∣∣ 1{w∈J}.
Note: in almost all applications I = {y ∈ R : fY(y) > 0}, for a properly
defined version of fY and J = g(I).
Example 4.3.2.
1. Let Y be a continuous random variable with pdf
fY(y) = 1π(1+y2) .
The distribution of Y is called the Cauchy distribution. We define
W = g(Y) where g(y) = arctan(y). The function g is defined on
the interval I = R and its image is the interval J = (−π/2, π/2).
Moreover, its inverse is the function h : J → I given by
h(w) = tan(w).
This function admits a derivative h′(w) = 1/ cos2(w).
Theorem 4.3.1 can be applied and it states that
fW(w) = 1π(1+tan2(w))
1
cos2(w)1{w∈(−π/2,π/2)} =
1
π 1{w∈(−π/2,π/2)}.
Last Updated: September 25, 2019
Lecture 4: Functions of random variables 8 of 11
This allows us to identify the distribution of W as uniform on the
interval (−π/2, π/2), i.e., Y ∼ U(−π/2, π/2).
2. Let Y ∼ E(τ), and let g(y) = √y. The function g is defined on
I = (0, ∞) and maps it into J = (0, ∞), and its inverse is h(w) =
w2 : J → I. The pdf fY of y is given by
fY(y) = 1τ exp(−y/τ)1{y>0}
and, so, by Theorem 4.3.1 W =
√
Y = g(Y) is a continuous random
variable with density
fW(w) = 2τ w exp(−w
2/τ)1{w>0},
where we removed the absolute value around h′(w) = 2w because
of the indicator 1w>0. This is known as the Weibull distribution.
3. Let Y and W be as in Example 4.1.2, i.e., Y ∼ U(60, 90) and
g(y) = cy2, where c = 0.07. Then g : R → R is neither increas-
ing nor decreasing, and does not admit an inverse. However, it is
increasing on the set (60, 90) where the random variable Y takes
place, i.e., where fY(y) > 0 (we can exclude the end-points 60
and 90 because they happen with probability 0). If we restrict g
to the interval I = (60, 90), it admits an inverse h : J → I, where
J = (g(60), g(90)) = (252, 567), and
h(w) =
√
w
c , w ∈ J.
The pdf fW(w) of W is then given by
fW(y) = fY(h(w))h′(w) = 190−60
1
2
√
c w
−1/21{w∈J}.
Here are the graphs of the pdfs of Y and W = g(Y):
60 90
y
1
30
252 576
y
0.004
Figure 3. The pdfs of Y and W = g(Y) (with both axes scaled differently)
Last Updated: September 25, 2019
Lecture 4: Functions of random variables 9 of 11
4.4 Problems
Problem 4.4.1. Let Y be an exponential random variable with parameter τ >
0. Compute the cdf FW and the pdf fW of the random variable W = Y3.
Problem 4.4.2. Let Y be a uniformly distributed randomvariable on the in-
terval [0, 1], and let W = exp(Y). Compute E[W], the CDF FW of W and the
pdf fW of W.
Problem 4.4.3. A scientist measures the side of cubical box and the result is
a random variable (due to the measurement error) Y, which we assume is
normally distributed with mean µ = 1 and σ = 0.1 (both in feet). In other
words, the true measurement of the side of the box is 1 foot, but the scientist
does not know that; she only knows the value of Y.
1. What is the distribution of the volume W of the box?
2. What is the probability that the scientist’s measurement overestimates the
volume of the box by more than 10%? (Hint: Review z-scores and com-
putations of probabilities related to the normal distribution. You will not
need to integrate anything here, but may need to use software (if you
know how) or look into a normal table. We will talk more about how to
do that later. For now, use any method you want.)
Problem 4.4.4. Let Y be a continuous random variable with the density func-
tion fY given by
fY(y) =
{
2+3y
16 , y ∈ (1, 3)
0, otherwise.
The pdf fW(w) of W = Y2 is
(a) fW(w) =
2+3
√
w
16
√
w 1{w∈(1,3)}
(b) fW(w) =
2+3
√
w
32
√
w 1{w∈(1,9)}
(c) fW(w) = 2+3w16√w 1{w∈(1,9)}
(d) fW(w) = 2+3w32√w 1{w∈(1,3)}
(e) none of the above
Problem 4.4.5. The Maxwell-Boltzmann distribution describes the distribu-
tion of speeds of particles in an (idealized) gas, and its pdf is given by
fY(y) =
4β3√
π
y2e−βy
2
1(0,∞)(y),
where β > 0 is a constant that depends on the properties of the particular
gas studied.
Last Updated: September 25, 2019
Lecture 4: Functions of random variables 10 of 11
The kinetic energy of a gas particle with speed y is 12 my
2. What is the dis-
tribution (pdf) of the kinetic energy if the particle speed follows the Maxwell-
Boltzmann distribution?
Problem 4.4.6. Let Y be a uniform distribution on (0, 1). The distribution of
the random variable − 12 log(Y) is
(a) exponential E(τ) with τ = 2
(b) exponential E(τ) with τ = 1/2
(c) uniform U(0, 1/2) on (0, 1/2)
(d) uniform U(0, 2) on (0, 2)
(e) none of the above
Problem 4.4.7. Let Y be an exponential random variable with parameter τ >
−1. Then E[e−Y] =
(a) τ (b) 11+τ (c)
1
1−τ (d)
τ
1−τ (e) none of the above
Problem 4.4.8. Let Y be a random variable with the pdf fY(y) = 1π(1+y2) . The
pdf of W = 1/Y2 is
(a) 2
π
√
w(1+w)1{w>0}
(b) 1
π
√
w(1+w)1{w>0}
(c) w
2
π(1+w4)
(d) 2w
2
π(1+w4)
(e) none of the above
(Hint: You do not need to actually compute the cdf of Y.)
Problem 4.4.9. The pdf of W = 1/Y2, where Y ∼ E(τ) is
(a) 2τ y
−3/2e−y/τ1{y>0}
(b) 12τ (−y−3/2)e
−√y/τ1{y>0}
(c) 1τ e
−1/(y2τ)1{y>0}
(d) 1
2τy3/2 e
−1/(τ√y)1{y>0}
Last Updated: September 25, 2019
Lecture 4: Functions of random variables 11 of 11
(e) none of the above
Problem 4.4.10. Let Y be a uniform random variable on [−1, 1], and let W =
Y2. The pdf of W is
(a) 1
4
√
|w|
1{−1<w<1}
(b) 1√w 1{0<w<1}
(c) 12√w 1{0<w<1}
(d) 2w1{0<w<1}
(e) none of the above
Problem 4.4.11. Let Y be a uniform random variable on [0, 1], and let W = Y2.
The pdf of W is
(a) 1
2
√
|w|
1{−1<w<1}
(b) 1√w 1{0<w<1}
(c) 12√w 1{0<w<1}
(d) 2w1{0<w<1}
(e) none of the above
Problem 4.4.12. Fuel efficiency of a sample of cars has been measured by a
group of American engineers, and it turns out that the distribution of gas-
mileage is uniformly distributed on the interval [10 mpg, 30 mpg], where
mpg stands for miles per gallon. In particular, the average gas-mileage is 20
mpg.
A group of European engineers decided to redo the statistics, but, being
European, they used European units. Instead of miles per gallon, they used
liters per 100 km. (Note that the ratio is reversed in Europe - higher numbers
in liters per km correspond to worse gas mileage). If one mile is 1.609 km and
one gallon is 3.785 liters, the average gas-mileage of 20 mpg, obtained by the
Americans, translates into 11.762 liters per 100 km. However, the average gas
mileage obtained by the Europeans was different from 11.762, even though
they used the same sample, and made no computational errors. How can
that be? What did they get? Is the distribution of gas-mileage still uniform,
when expressed in European units? If not, what is its pdf?
Problem 4.4.13. Let Y be a uniformly distributed random variable on the
interval (0, 1). Find a function g such that the random variable W = g(Y)
has the exponential distribution with parameter τ = 1. (Hint: Use the h-
method.)
Last Updated: September 25, 2019
Lecture 5: Joint Distributions 1 of 21
Course: Mathematical Statistics
Term: Fall 2017
Instructor: Gordan Žitković
Lecture 5
Probability review - joint distributions
5.1 Random vectors
So far we talked about the distribution of a single random variable Y. In
the discrete case we used the notion of a pmf (or a probability table) and in
the continuous case the notion of the pdf, to describe that distribution and
to compute various related quantities (probabilities, expectations, variances,
moments).
Now we turn to distributions of several random variables put together.
Just like several real numbers in an order make a vector, so n random vari-
ables, defined in the same setting (same probability space), make a random
vector. We typically denote the components of random vectors by subscripts,
so that (Y1, Y2, Y3) make a typical 3-dimensional random vector. We can also
think of a random vector as a random point in an n-dimensional space. This
way, a random pair (Y1, Y2) can be thought of as a random point in the plane,
with Y1 and Y2 interpreted as its x- and y-coordinates.
There is a significant (and somewhat unexpected) difference between the
distribution of a random vector and the pair of distributions of its compo-
nents, taken separately. This is not the case with non-random quantities. A
point in the plane is uniquely determined by its (two) coordinates, but the
distribution of a random point in the plane is not determined by the dis-
tributions of its projections onto the coordinate axes. The situation can be
illustrated by the following example:
Example 5.1.1.
1. Let us toss two unbiased coins, and let us call the outcomes Y1 and
Y2. Assuming that the tosses are unrelated, the probabilities of the
following four outcomes
{Y1 = H, Y2 = H}, {Y1 = H, Y2 = T},
{Y1 = T, Y2 = H}, {Y1 = T, Y2 = T}
are the same, namely 1/4. In particular, the probabilities that the
first coin lands on H or T are the same, namely 1/2. The distribution
Last Updated: September 25, 2019
Lecture 5: Joint Distributions 2 of 21
tables for both Y1 and Y2 are the same and look like this
H T
1/2 1/2 .
Let us now repeat the same experiment, but with two coins at-
tached to each other (say, welded together) as in the picture:
Figure 1. Two quarters welded together, so that when one falls on heads
the other must fall on tails, and vice versa.
We can still toss them and call the outcome of the first one Y1 and
the outcome of the second one Y2. Since they are welded, it can
never happen that Y1 = H and Y2 = H at the same time, or that
Y1 = T and Y2 = T at the same time, either. Therefore of the four
outcomes above only two “survive”
{Y1 = H, Y2 = H}, {Y1 = H, Y2 = T},
{Y1 = T, Y2 = H}, {Y1 = T, Y2 = T}
and each happens with the probability 12 . The distribution of Y1
considered separately from Y2 is the same as in the non-welded
case, namely
H T
1/2 1/2 ,
and the same goes for Y2. This is one of the simplest examples, but
it already strikes the heart of the matter: randomness in one part
of the system may depend on the randomness in the other part.
2. Here is an artistic (geometric) view of an analogous phenomenon.
The projections of a 3D object on two orthogonal planes do not
Last Updated: September 25, 2019
Lecture 5: Joint Distributions 3 of 21
determine the object entirely. Sculptor Markus Raetz used that fact
to create the sculpture entitled “Yes/No”:
Figure 2. Yes/No - A “typographical” sculpture by Markus Raetz
Not to be outdone, I decided to create a different typographicalsculpture with the same projections (I could not find the exact same
font Markus is using so you need to pretend that my projections
match his completely). It is not hard to see that my sculpture differs
siginificantly from Marcus’s, but they both have (almost) the same
projections, namely the words “Yes” and “No”.
Figure 3. My own attempt at a “typographical” sculpture, using SketchUp.
You should pretend that Markus’s and mine fonts are the same.
Last Updated: September 25, 2019
Lecture 5: Joint Distributions 4 of 21
5.2 Joint distributions - the discrete case
So, in order to describe the distribution of the random vector (Y1, . . . , Yn), we
need more than just individual distributions of its components Y1 . . . Yn. In
the discrete case, the events whose probabilities finally made their way into
the distribution table were of the form {Y = i}, for all i in the support SY of
Y. For several random variables, we need to know their joint distribution,
i.e., the probabilities of all combinations
{Y1 = i1, Y2 = i2, . . . , Yn = in},
over the set of all combinations (i1, i2, . . . , in) of possible values our random
variables can take. These numbers cannot comfortably fit into a table, except
in the case n = 2, where we talk about the joint distribution table which
looks like this
j1 j2 . . .
i1 P[Y1 = i1, Y2 = j1] P[Y1 = i1, Y2 = j2] . . .
i2 P[Y1 = i2, Y2 = j1] P[Y1 = i2, Y2 = j2] . . .
...
...
...
. . .
Example 5.2.1. Two dice are thrown (independently of each other) and
their outcomes are denoted by Y1 and Y2. Since P[Y1 = i, Y2 = j] = 1/36
for any i, j ∈ {1, 2, . . . , 6}, the joint distribution table of (Y1, Y2) looks
like this
1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
The situation is more interesting if Y1 still denotes the outcome of the
first die, but Z now stands for the sum of the numbers on two dies. It
is not hard to see that the joint distribution table of (Y1, Z) now looks
like this:
2 3 4 5 6 7 8 9 10 11 12
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36
Last Updated: September 25, 2019
Lecture 5: Joint Distributions 5 of 21
Going from the joint distribution of the random vector (Y1, Y2, . . . , Yn) to
individual (called marginal) distributions of Y1, Y2, . . . , Yn is easy. To com-
pute P[Y1 = i] we need to sum P[Y1 = i, Y2 = i2, . . . , Yn = in], over all combi-
nations (i2, . . . , in) where i2, . . . , in range trough all possible values Y2, . . . , Yn
can take.
Example 5.2.2. Continuing the previous example, let us compute the
marginal distribution of Y1, Y2 and Z. For Y1 we sum the probabilities
in each row in in the joint distribution table of (Y1, Y2) to obtain
1 2 3 4 5 6
1/6 1/6 1/6 1/6 1/6 1/6
The same table is obtained for the marginal distribution of Y2 (even
though we sum over columns this time). For the marginal distribu-
tion of Z, we use the joint distribution table for (Y1, Z) and sum over
columns:
2 3 4 5 6 7 8 9 10 11 12
1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
Once the distribution table of a random vector (Y1, . . . , Yn) is given, we
can compute (in theory) the probability of any event concerning the random
variables Y1, . . . , Yn, but simply summing over the set of appropriate entries
in the joint distribution.
Example 5.2.3. We continue with random variables Y1, Y2 and Z de-
fined above and ask the following question: what is the probability
that two dice have the same outcome? In other words, we are inter-
ested in P[Y1 = Y2]? The entries in the table corresponding to this
event are boxed:
1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
,
so that
P[Y1 = Y2] = 1/36 + 1/36 + 1/36 + 1/36 + 1/36 + 1/36 = 1/6.
Last Updated: September 25, 2019
Lecture 5: Joint Distributions 6 of 21
5.3 Joint distributions - the continuous case
Just as in the univariate case (the case of a single random variable), the con-
tinuous analogue of the distribution table (or the pmf) is the pdf. Recall that
pdf fY(y) of a single random variable Y is the function with the property that
P[Y ∈ [a, b]] =
∫ b
a
fY(y) dy.
In the multivariate case (the case of a random vector, i.e., several random
variables), the pdf of the random vector (Y1, . . . , Yn) becomes a function of
several variables fY1,...,Yn(y1, . . . , yn) and it is characterized by the property that
P[Y1 ∈ [a1, b1], Y2 ∈ [a2, b2], . . . , Yn ∈ [an, bn]] =
=
∫ b1
a1
∫ b2
a2
· · ·
∫ bn−1
an−1
∫ bn
an
fY1,...,Yn(y1, . . . , yn) dyn dyn−1 . . . dy2 dy1.
This formula is better understood if interpreted geometrically. The left-hand
side is the probability that the random vector (Y1, . . . , Yn) (think of it as a
random point in Rn) lies in the region [a1, b1] × . . . [an, bn], while the right-
hand side is the integral of fY1,...,Yn over the same region.
Example 5.3.1. A point is randomly and uniformly chosen inside a
square with side 1. That means that any two regions of equal area
inside the square have the same probability of containing the point.
We denote the two coordinates of this point by Y1 and Y2 (even though
X and Y would be more natural), and their joint pdf by fY1,Y2 . Since the
probabilities are computed by integrating fY1,Y2 over various regions in
the square, there is no reason for f to take different values on different
points inside the square; this makes fY1,Y2(y1, y2) = c for some constant
c > 0, for all points (y1, y2) in the square. Our random point never
falls outside the square, so the value of f outside the square should be
0. Pdfs (either in one or in several dimensions) integrate to 1 , so we
conclude that f should be given by
fY1,Y2(y1, y2) =
{
1, (y1, y2) ∈ [0, 1]2
0, otherwise.
Once a pdf of a random vector (Y1, . . . , Yn) is given, we can compute all
kinds of probabilities with it. For any region A ⊂ Rn (not only for rectangles
of the form [a1, b1]× . . . [an, bn]), we have
P[(Y1, . . . , Yn) ∈ A] =
∫ ∫
· · ·
∫
A
fY1,...,Yn(y1, . . . , yn)dyn . . . dy1.
As it is almost always the case in the multivariate setting, this is much better
understood through an example:
Last Updated: September 25, 2019
Lecture 5: Joint Distributions 7 of 21
Example 5.3.2. Let (Y1, Y2) be the random uniform point in the square
[0, 1]2 from the previous example. To compute the probability that the
distance from (Y1, Y2) to the origin (0, 0) is at most 1, we define
A =
{
(y1, y2) ∈ [0, 1]2 :
√
y21 + y
2
2 ≤ 1
}
,
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 4. The region A.
Therefore, since fY1,Y2(y1, y2) = 1, for all (y1, y2) ∈ A, we have
P[(Y1, Y2) is at most 1 unit away from (0,0)] = P[(Y1, Y2) ∈ A]
=
∫∫
A
fY1,Y2(y1, y2) dy2dy1 =
∫∫
A
1 dy1 dy2 = area(A) = π4 .
The calculations in the previous example sometimes fall under the head-
ing of geometric probability because the probability π/4 we obtained is simply
the ratio of the area of A and the area of [0, 1]2 (just like one computes a
uniform probability in a finite setting by dividing the number of “favorable”
cases by the total number). This works only if the underlying pdf is uniform.
In practice, pdfs are rarely uniform.
Example 5.3.3. Let (Y1, Y2) be a random vector with the pdf
fY1,Y2(y1, y2) =
{
6y1, 0 ≤ y1 ≤ y2 ≤ 1
0, otherwise,
or, in the indicator notation,
fY1,Y2(y1, y2) = 6y11{0≤y1≤y2≤1}.
Last Updated: September 25, 2019
Lecture 5: Joint Distributions 8 of 21
Here is a sketch of what f looks like
Figure 5. The pdf of (Y1, Y2)
This still corresponds to a distribution of a random point in the unit
square, but this distribution is no longer