b inferencia classica

Inferência

•
UNIMES

Álaze Gifted
15/09/2017
Esta é uma pré-visualização de arquivo. Entre para ver o arquivo original
Classical Inference
Lecture notes
Rafael Bassi Stern
U´ltima revisa˜o: September 14, 2017
Please send comments, typos and mistakes to rbstern@gmail.com
Acknowledgements: I am thankful for the comments from Gilson Shimizu . . .
1
Contents
1 Review 3
1.1 (Basic) Set theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Set operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Combinatorial analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Random vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Statistical decision theory 20
2.1 Statistical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Elements of statistical decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Optimality criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 How to find a Bayes decision rule? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 How to find a minimax decision rule? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Review of statistical decision theory 39
2
1 Review
1.1 (Basic) Set theory
A set is a collection of objects. If a set has a finite number of objects, o1, o2, . . . , on, we denote it by {o1, o2, . . . , on}.
We denote the set of natural numbers by N, the set of integers by Z and the set of reals by R. In probability, sets
are fundamental for describing outcomes of an experiment.
Example 1.1 (Sets).
• The set of possible outcomes of a six sided die: {1, 2, 3, 4, 5, 6}.
• The set of outcomes of a coin flip: {T,H}.
• The set of outcomes of two coin flips: {(T, T ), (T,H), (H,T ), (H,H)}.
• The set of all odd numbers: {2n+ 1 : n ∈ N} or {1, 3, 5, 7, . . .}.
• The set of non-negative real numbers: {x ∈ R : x ≥ 0}.
• A circle of radious 1: {(x, y) ∈ R2 : x2 + y2 ≤ 1}.
Definition 1.2 (∈ and /∈). We write o ∈ S if object o is an element of set S and o /∈ S, otherwise.
Example 1.3 (∈ and /∈).
• T ∈ {T,H}.
• 7 /∈ {1, 2, 3, 4, 5, 6}.
• 7 ∈ {2n+ 1 : n ∈ N}.
Definition 1.4 (empty set - ∅). ∅ is the only set with no elements. That is, for every object o, o /∈ ∅.
Definition 1.5 (disjoint sets).
• Two sets A and B are disjoint if, for every o ∈ A we have that o /∈ B and for every o ∈ B, o /∈ A.
• A sequence of sets (An)n∈N is disjoint if, for every i 6= j, Ai is disjoint from Aj .
Example 1.6 (Disjoint sets).
• {1, 2} and {3, 4} are disjoint.
• {1, 2} and {2, 3} are not disjoint since 2 ∈ {1, 2} and 2 ∈ {2, 3}.
Definition 1.7 (⊂ and =). Let A and B be two sets. We say that:
• A ⊂ B if, for every o ∈ A, o ∈ B.
• A = B if A ⊂ B and B ⊂ A.
Example 1.8 (⊂ and =).
• {1, 2} ⊂ {1, 2, 3, 4}.
3
• {n ∈ Z : n ≥ 1} ⊂ N.
• {n ∈ Z : n ≥ 0} = N.
In order to describe uncertainty, probability requires one to define the set of all possible outcomes. We reserve
the symbol Ω for this set. Ω is often called the sample space in probability theory. For example, consider that
a six-sided die is flipped. In this case, one might choose Ω = {1, 2, 3, 4, 5, 6}, where each number denotes one of
the possible outcomes of the die.
In the context of probability theory, subsets of Ω are often called events. For example, consider again the
six-sided die. One might be interested in the event that the die lands on an even number. This event can be
described as A = {2, 4, 6} ⊂ Ω.
1.1.1 Set operations
Definition 1.9 (complement - c). Let A be a set. o is an element of Ac if and only if o /∈ A. That is, the
complement of A is formally defined as Ac = {o ∈ Ω : o /∈ A}.
Example 1.10 (c).
• Let Ω = {T,H}, {T}c = {H}.
• Let Ω = {1, 2, 3, 4, 5, 6}, {1, 2}c = {3, 4, 5, 6}.
• Let Ω = N, {n ∈ N : n > 0}c = {0}.
The event Ac is commonly interpreted as the negation of A. For example, if A = {T} means that the outcome
of a coin flip was tails, then Ac means that the outcome wasn’t tails. Moreover, if Ω = {H,T}, then {T}c = {H},
that is, if the outcome of the coin flip isn’t tails, then it is heads.
Definition 1.11 (union - ∪).
• Let A and B be two sets. o ∈ Ω is an element of the union between A and B, A ∪ B, if and only if either
o is an element of A or o is an element of B. That is, A ∪B = {o ∈ Ω : o ∈ A or o ∈ B}.
• Let (An)n∈N be a sequence of sets. o ∈ Ω is an element of the union of (An)n∈N, ∪n∈NAn, if and only if there
exists n ∈ N such that o ∈ An. That is, ∪n∈NAn = {o ∈ Ω : exists n ∈ N such that o ∈ An}
Example 1.12 (∪).
• {T} ∪ {H} = {T,H}.
• {1, 2} ∪ {2, 3} = {1, 2, 3}.
• {1} ∪ {3} ∪ {5} = {1, 3, 5}.
• {n ∈ Z : n > 0} ∪ {n ∈ Z : n < 0} = {n ∈ Z : n 6= 0}.
• ∪n∈N{n} = N.
• ∪n∈N{x ∈ R : x ≥ n} = {x ∈ R : x ≥ 0}.
• ∪n∈N{x ∈ R : x ≥ 1/(n+ 1)} = {x ∈ R : x > 0}.
4
Note that A ∪ B ⊂ Ω, that is, A ∪ B is an event. This event can be pronunciated as “A or B”. For example,
let Ω = {1, 2, 3, 4, 5, 6} stand for the outcome of a six-sided die. If A = {1, 2, 3} and B = {2, 4, 6} are respectively
the event that the outcome is smaller than 4 and that the outcome is even, then the event that the outcome is
smaller than 4 or is even is given by A ∪ B = {1, 2, 3, 4, 6}. Note that 2 ∈ A, 2 ∈ B and 2 ∈ A ∪ B, that is, the
“or” in “A or B” should not be interpreted as an exclusive “or”.
Definition 1.13 (intersection - ∩).
• Let A and B be two sets. o is an element of the intersection between A and B, A ∩B, if and only if o ∈ Ω
is an element of A and o is an element of B. That is, A ∩B = {o ∈ Ω : o ∈ A and o ∈ B}.
• Let (An)n∈N be a sequence of sets. o ∈ Ω is an element of the intersection of (An)n∈N, ∩n∈NAn, if and only
if for every n ∈ N, o ∈ An. That is, ∩n∈NAn = {o ∈ Ω : for every n ∈ N, o ∈ An}.
Example 1.14 (∩).
• {T} ∩ {H} = ∅.
• {1, 2} ∩ {2, 3} = {2}.
• ({1, 2} ∩ {2, 3}) ∪ {5} = {2, 5}.
• {n ∈ Z : n ≥ 0} ∩ {n ∈ Z : n ≤ 0} = {0}.
• ∩n∈N{i ∈ N : i ≥ n} = ∅.
• ∩n∈N{x ∈ R : x ≤ n} = {x ∈ R : x ≤ 0}.
Note that A ∩ B ⊂ Ω, that is, A ∩ B is also an event. This event can be pronunciated as “A and B”. For
example, let Ω = {1, 2, 3, 4, 5, 6}, stand for the outcome of a six-sided die. If A = {1, 2, 3} and B = {2, 4, 6}, then
the event that the die outcome is smaller than 4 and even is denoted by A ∩B = {2}.
Theorem 1.15 (DeMorgan’s laws). Let (An)n∈N be a sequence of subsets of Ω. Then, for every n ∈ N,
• (∪ni=1Ai)c = ∩ni=1Aci
• (∩ni=1Ai)c = ∪ni=1Aci
Moreover,
• (∪i∈NAi)c = ∩i∈NAci
• (∩i∈NAi)c = ∪i∈NAci
Definition 1.16 (Partition). Let (An)n∈N be a sequence of sets. We say that (An)n∈N partitions Ω if:
• for every i, j ∈ N such that i 6= j, Ai and Aj are disjoint.
• ∪n∈NAn = Ω.
5
Exercises
Exercise 1.17. Let A = {1, 3, 5}, B = {1, 2} and Ω =
{1, 2, 3, 4, 5, 6}. Find:
a. A ∪B
b. A ∩Bc
c. B ∪Bc
d. (A ∪B)c
e. (A ∩B)c
Solution:
a. A ∪B = {1, 2, 3, 5}.
b. A ∩Bc = {1, 3, 5} ∩ {3, 4, 5, 6} = {3, 5}.
c. B ∪Bc = {1, 2} ∪ {3, 4, 5, 6} = {1, 2, 3, 4, 5, 6}. (Note that, for every set D, D ∪Dc = Ω)
d. (A ∪B)c = {1, 2, 3, 5}c = {4, 6}.
e. (A ∩B)c = {1}c = {2, 3, 4, 5, 6}.
Exercise 1.18. Let A,B and C be events of Ω. Using set operations, describe the following sets:
• At least one of A,B or C happens.
• Either A and B happen or C happens.
• A and B happen, but C does not.
• Exactly one of A,B and C happens.
Exercise 1.19. Consider that a given day can be either rainy - R - or not rainy - NR. We are interested in the
weather of the next two days.
a. How would you formally write Ω?
b. How would you formally write “The outcomes such that both days are rainy’? Call this set A.
c. How would you formally write “The outcomes such that at least one day is rainy”? Call this set B.
d. Is it true that A ⊂ B?
e. Find Bc. How would you describe this set in English?
f. Is it true that A and Bc are disjoint?
Solution:
a. Ω = {(NR,NR), (NR,R), (R,NR), (R,R)}.
b. A = {(R,R)}.
6
c. B = {(NR,R), (R,NR), (R,R)}.
d. Since (R,R) is the only element of A and (R,R) ∈ B, A ⊂ B.
e. The only element of Ω that is not in B is (NR,NR). Hence Bc = {(NR,NR)}. This is the set of outcomes
such that there is no rainy day.
f. A ∩Bc = {(NR,NR)} ∩ {(R,R)} = ∅. Hence, A and Bc are disjoint.
Exercise 1.20. Are the following statements true or false?
a. ∅ ∈ ∅.
b. ∅ ∈ {∅}.
c. {∅} ∈ ∅.
d. ∅ ⊂ ∅.
e. ∅ ⊂ {∅}.
f. {∅} ⊂ ∅.
Solution:
a. By definition, there is no element in ∅. That is, ∅ /∈ ∅.
b. {∅} is the set with a single element that is ∅. Hence, ∅ ∈ {∅}.
c. By definition, there is no element in ∅.. Hence, {∅} /∈ ∅.
d. Since ∅ has no elements, all elements of ∅ are in any other set. Hence ∅ is a subset of every set and ∅ ⊂ ∅.
e. Using the same reasoning as in the previous item, ∅ ⊂ {∅}.
f. ∅ is an element of {∅} and ∅ /∈ ∅. Hence, there exists an element of {∅} that is not an element of ∅. Conclude
that {∅} 6⊂ ∅.
Exercise 1.21. Prove the following:
a. S and T are disjoint if and only if S ∩ T = ∅.
b. S ∪ T = T ∪ S
c. S ∩ T = T ∩ S
d. S = (Sc)c
e. S ∪ Sc = Ω
f. S ∩ Ω = S
g. S ∪ Ω = Ω
h. (S ∩ T )c = Sc ∪ T c
7
i. (S ∪ T )c = Sc ∩ T c
j. ({n})n∈N partitions N.
k. ({1, 2}, {3, 4}, ∅, ∅, ∅, . . .) partitions {1, 2, 3, 4}.
Solution:
a. Since S and T are disjoint, for every w ∈ A, w /∈ T and for every w ∈ T , w /∈ S. This happens if and only
if {w ∈ Ω : w ∈ S and w ∈ T} = ∅, that is S ∩ T = ∅.
b. S ∪ T = {w ∈ Ω : w ∈ S or w ∈ T} = {w ∈ Ω : w ∈ T or w ∈ S} = T ∪ S.
c. S ∩ T = {w ∈ Ω : w ∈ S and w ∈ T} = {w ∈ Ω : w ∈ T and w ∈ S} = T ∩ S.
d. If w ∈ S, then w /∈ Sc and w ∈ (Sc)c. Hence S ⊂ Sc. Similarly, if w ∈ (Sc)c, then w /∈ Sc and w ∈ S. Hence
Sc ⊂ S. Conclude that S = Sc.
e. S ∪ Sc = {w ∈ Ω : w ∈ S or w ∈ Sc} = {w ∈ Ω : w ∈ S or w /∈ S} = Ω.
f. S ∩ Ω = {w ∈ Ω : w ∈ S and w ∈ Ω} = {w ∈ Ω : w ∈ S} = S.
g. S ∪ Ω = {w ∈ Ω : w ∈ S or w ∈ Ω} = {w ∈ Ω : w ∈ Ω} = Ω.
h. If w ∈ Sc ∪ T c, then w /∈ S or w /∈ T . Hence w /∈ S ∩ T and w ∈ (S ∩ T )c. If w ∈ (S ∩ T )c, then w /∈ S ∩ T ,
that is w /∈ S or w /∈ T . That is, w ∈ Sc or W ∈ T c and, therefore w ∈ Sc ∪ T c.
i. If w ∈ Sc∩T c, then w /∈ S and w /∈ T . Hence w /∈ (S∪T ) and w ∈ (S∪T )c. If w ∈ (S∪T )c, then w /∈ S∪T ,
that is w /∈ S and w /∈ T . That is, w ∈ Sc and W ∈ T c and, therefore w ∈ Sc ∩ T c.
j. For every i 6= j, {i} ∩ {j} = ∅ and ∪n∈N{n} = N. Hence, ({n})n∈N partitions N.
k. {1, 2}∩{3, 4} = ∅ and A∩∅ = ∅ for every A. Hence, since {1, 2}∪{3, 4}∪∅ = {1, 2, 3, 4}, then the sequence
of events ({1, 2}, {3, 4}, ∅, ∅, ∅, . . .) partitions {1, 2, 3, 4}.
Exercise 1.22. In a class of 25 students, consider the following statements:
• 14 will major in Computer Science.
• 12 will major in Engineering.
• 5 will major both in Computer Science and Engineering.
How many won’t major in either Computer Science or Engineering?
Solution: Let A denote the set of students who will major in Computer Science and B denote the set of
students who will major in Engineering. Let |A| stand for the number of elements in A. The problem states that
|Ω| = 25, |A| = 14, |B| = 12 and |A ∩ B| = 5. The problem asks for |Ac ∩ Bc|. We can partition Ω in 4 sets:
A ∩B, A ∩Bc, Ac ∩B and Ac ∩Bc. Observe that:
a. The problem states that |A ∩B| = 5.
b. (A ∩ B) ∪ (A ∩ Bc) = A. Hence, since A ∩ B and A ∩ Bc are disjoint, |A ∩ B| + |A ∩ Bc| = |A|. Thus,
5 + |A ∩Bc| = 14 and |A ∩Bc| = 9.
8
c. Similarly, (A ∩B) ∪ (Ac ∩B) = B and (A ∩B) and (Ac ∩B) are disjoint. Hence, |A ∩B|+ |Ac ∩B| = |B|.
Conclude that 5 + |Ac ∩B| = 12 and |Ac ∩B| = 7.
d. Since Ω can be partitioned in the sets A ∩B, A ∩Bc, Ac ∩B and Ac ∩Bc, conclude that:
|A ∩B|+ |A ∩Bc|+ |Ac ∩B|+ |Ac ∩Bc| = |Ω|
5 + 9 + 7 + |Ac ∩Bc| = 25
|Ac ∩Bc| = 4
Hence, 4 students won’t major in either Computer Science or Engineering.
Exercise 1.23. Prove that A and B are disjoint if and only if A ∩B = ∅.
Exercise 1.24. Prove DeMorgan’s law.
1.2 Functions
A function can be understood as a table that maps each element in a set, A, to an element in a set, B. While the
set A is called the domain of the function, B is called the counter-domain.
x ∈ A y ∈ B
-2 4
-1 1
0 0
1 1
2 4
Table 1: Example of a function from A = {−2,−1, 0, 1, 2} to B = R.
Table 1 illustrates a function, f , such that f(−2) = 4, f(−1) = 1, f(0) = 0, f(1) = 1 and f(2) = 2. One can
state that f has domain A and counter-domain B by writing f : A→ B. In this case, f admits a simple analytical
expression. Indeed, we can write f : A→ B, f(x) = x2. This compact way of determining f is often more useful
than presenting tables such as table 1.
Definition 1.25 (Image). Let f : A→ B and A∗ ⊆ A. Define
f [A∗] = {f(a) : a ∈ A∗}
That is, f [A∗] is a new set that has as its elements f(a), for each a ∈ A∗. For example, for the f in table 1,
f [{−2,−1, 2}] = {1, 4}.
Definition 1.26 (Pre-image). Let B∗ ⊆ B. Define
f−1[B∗] = {a ∈ A : f(a) ∈ B∗}
That is, f−1[B∗] is the set of elements of A that get mapped onto B∗. For example, consider f such as in
table 1. By looking at the second column of this table, one can observe that the elements of A that map to 1 are
−1 and 1. Therefore, f−1[{1}] = {−1, 1}. Similarly, f−1[{0, 1}] = {−1, 0, 1}.
9
Exercises
Exercise 1.27. Let f : {1, 2, 3, 4} → R, f(x) = (x− 2)4.
(a) Write the table description of f .
(b) Find f−1[{1}].
Exercise 1.28. Let f : R→ R, f(x) = x2. Find
(a) f−1[{0}]
(b) f−1[(0, 1]].
(c) f−1[R+].
(d) f−1[{−1}].
Exercise 1.29. Let f : A→ B be an arbitrary function.
(a) Prove that, for every B∗ ⊆ f [A], f [f−1[B∗]] = B∗.
(b) Show an example of A, B and f such that there exists A∗ ⊆ A such that f−1[f [A∗]] 6= A∗.
(c) Prove that, for every A∗ ⊆ A, A∗ ⊆ f−1[f [A∗]].
1.3 Combinatorial analysis
Combinatorial analysis is a collection of methods for counting objects. Also, in order to compute probabilities, it
is often important to count the number of elements in an event. Hence the importance of combinatorial analysis in
probability theory. This section provides some of the elementary techniques developed in combinatorial analysis.
1.3.1 Counting
Lemma 1.30 (Counting principle). Assume r experiments are performed. If the i-th experiment has ni possible
outcomes, i = 1, . . . , n, then the total number of outcomes of the joint experiment is given by
Πri=1ni := n1n2 . . . nr.
Example 1.31. If we flip a coin and then a die, there is a total of 2 × 6 = 12 results. Indeed, by arranging the
possible results in a rectangle, one can visualize why the total number of results of 12.
2
{
(H, 1) (H, 2) (H, 3) (H, 4) (H, 5) (H, 6)
(T, 1) (T, 2) (T, 3) (T, 4) (T, 5) (T, 6)︸
︷︷ ︸
6
Using the counting principle, one can find the total number of elements in the rectangle by counting that it has
2 rows and 6 columns. This task is easier than counting, one by one, each element in the rectangle. This is why
the counting principle is useful.
Example 1.32. If we flip a coin r times, the total number of outcome in each experiment is two: either heads or
tails. It follows from the counting principle that there are Πri=12 = 2
r possible results, e.g., there is a total of 2r
sequences of heads or tails.
10
1.3.2 Permutations
Lemma 1.33. Consider a set with n objects. Sorting these objects corresponds to assigning, without repetition, a
number in {1, . . . , n} to each object. The total number of ways the objects can be sorted is given by
n! := n× (n− 1)× . . .× 2× 1
Example 1.34. There are 4!=24 ways of displaying four shirts in a closet.
1.3.3 Combinations
Lemma 1.35. Consider a set with n objects. The total number of different groups of size r ≤ n that can be
formed with these objects is given by (
n
r
)
:=
n!
r!(n− r)!
Example 1.36. There are
(
10
2
)
= 45 ways of choosing two questions from a section with 10 questions.
Lemma 1.37 (Vandermonde’s Identity).
(
n+m
z
)
=
min(n,z)∑
x=max(0,z−m)
(
n
x
)(
m
z − x
)
Proof. Consider that you have n + m objects. The total number of ways you can select z of these objects is(
n+m
z
)
. Next, consider that you divide these objects into two groups, the type-1 group with n objects and the
type-2 group with m objects. For every x, the total ways you can select x objects from the type-1 group and z−x
objects from the type-2 group is
(
n
x
)(
m
z−x
)
. If one considers the set of selections ranging over all values of x, then
one obtains exactly the ways to select z objects from the group of size n+m. That is,
(
n+m
z
)
=
min(n,z)∑
x=max(0,z−m)
(
n
x
)(
m
z − x
)
Exercises
Exercise 1.38. A restaurant offers three different entries, five main dishes and six deserts. How many different
meals does this restaurant offer?
Exercise 1.39. A student has 2 math books, 3 geography books and 4 chemistry books. In how many ways can
these books be displayed in a shelf:
• when the student accepts all possible arrangements of the books?
• when the student wants the books about the same subject to be next to each other?
Exercise 1.40. How many different words can be formed with two letters A, three letters B and one letter C?
11
Exercise 1.41. A student needs to study for three exams this week. The math teacher gave him 6 exercises to
help him study to the test, the geography teacher gave him 7 exercises, and the chemistry teacher gave him 5
exercises. Considering that the student does not have much time, how many different sets of exercise can he pick
to do if he wants to choose only 2 math exercises, 4 geography exercises, and 1 chemistry exercise?
Exercise 1.42. Eight people, named A, B, . . . , H, are going to make a line.
• In how many ways can these people be placed?
• In how many ways can these people be placed if A and B have to be next to each other?
• In how many ways can these people be placed if A and B have to be next to each other, and C and D also
have to be next to each other?
Exercise 1.43. Show that (
n+ 1
k + 1
)
=
n−k+h∑
x=h
(
x
h
)(
n− x
k − h
)
1.4 Probability theory
Definition 1.44 (Axioms of Probability). P : F → R is a probability if:
1. (Non-negativity) For all A ∈ F , P(A) ≥ 0.
2. (Countable additivity) If (An)n∈N is a sequence of disjoint sets in F , P(∪n∈NAn) =
∑
n∈N P(An).
3. (Normalization) P(Ω) = 1.
Lemma 1.45. P(∅) = 0.
Lemma 1.46. For every event A, P(Ac) = 1− P(A)
Lemma 1.47. For every events A and B,
P(A ∪B) = P(A) + P(B)− P(A ∩B)
Lemma 1.48. If A ⊆ B, then P(B) ≥ P(A).
Lemma 1.49. For every events A1, . . . , An,
P(∪ni=1Ai) ≤
n∑
i=1
P(Ai)
Lemma 1.50 (Continuity of probability). Let A1, A2, . . . be a sequence of events such that, for every i ≥ 1,
Ai ⊆ Ai+1. In this case, define limn→∞An := ∪i≥1Ai. It follows that
P( lim
n→∞An) = limn→∞P(An)
Lemma 1.51. For every sequence of events, A1, A2, . . . ∈ F ,
P(∪i≥1Ai) ≤
∑
i≥1
P(Ai)
12
Lemma 1.52. Let A1, A2, . . . and B1, B2, . . . be a sequences of events.
1. If P(Ai) = 0, for every i ≥ 1, then P(∪i≥1Ai) = 0
2. If P(Bi) = 1, for every i ≥ 1, then P(∩i≥1Bi) = 1.
Definition 1.53 (Axiom of Conditional Probabilities).
P(A ∩B) = P(A)P(B|A)
Definition 1.54. Two events A and B are independent if P(A ∩B) = P(A)P(B).
Lemma 1.55. A and B are independent if and only if P(A|B) = P(A). In other words, A and B are independent
if and only if one’s uncertainty about A does not change assuming that B is true.
Theorem 1.56 (Multiplication rule). Let A1, A2, . . . An be events. Then
P(∩ni=1Ai) = P(A1)P(A2|A1)P(A3|A1 ∩A2) . . .P(An| ∩n−1i=1 Ai)
Theorem 1.57 (Law of Total Probability). Let (An)n∈N be a partition of Ω and B be an event.
P(B) =
∑
n∈N
P(An)P(B|An)
Theorem 1.58 (Bayes Theorem). Let (Ai)i∈N be a partition of Ω and B be an event. For every n ∈ N,
P(An|B) = P(An)P(B|An)∑
i∈N P(Ai)P(B|Ai)
1.5 Random variables
Definition 1.59 (Cumulative distribution function (cdf)). Let X be a random variable. The cumulative distri-
bution function of X, FX : R→ [0, 1] is such that FX(x) = P(X ≤ x).
Lemma 1.60. For every random variable, X, and a < b,
FX(b)− FX(a) = P(a < X ≤ b)
Lemma 1.61. For every random variable, X,
1. FX(x) ∈ [0, 1].
2. FX(x) is a non-decreasing function.
3. FX(x) is right continuous.
4. limx→−∞ FX(x) = 0 and limx→∞ FX(x) = 1.
Observe that for continuous random variables, F is a continuous function. However, for discrete random variables,
F has left-discontinuities at the image values of X.
13
Definition 1.62 (Probability density function). Let X be a random variable and A be an event. The probability
density function of X given A, fX(x|A) : R→ R+ is defined as
fX(x|A) =
P(X = x|A) , if X is discrete.dP(X≤x|A)
dx , if X is continuous.
The probability density function of X, fX(x) : R→ R+ is defined as fX(x) := fX(x|Ω). That is,
fX(x) =
P(X = x|Ω) = P(X = x) , if X is discrete.dP(X≤x|Ω)
dx =
dFX(x)
dx , if X is continuous.
Lemma 1.63.
P(X ∈ A) =
∫
A
fX(x)dx
Corollary 1.64. For every random variable, X,
1.
∫ b
a fX(x)dx = P(a ≤ X ≤ b).
2.
∫∞
−∞ fX(x)dx = 1.
Definition 1.65 (Independence of random variables). Let X1, . . . , Xn be discrete random variables. We say that
they are independent if, for every A1, . . . , An ⊂ R,
P(X1 ∈ A1, . . . , Xn ∈ An) := P(∩ni=1Xi ∈ Ai) =
n∏
i=1
P(Xi ∈ Ai)
Therefore, for every A1, . . . , An, {ω ∈ Ω : Xi(ω) ∈ Ai} are jointly independent.
Definition 1.66. Let X be a random variable and A an event. The expected value of X given A is denoted by
E[X|A] and defined as
E[X|A] =
∫
R
xfX(x|A)dx
The expected value of X, E[X] is defined as E[X|Ω], that is,
E[X] =
∫
R
xfX(x)dx
Lemma 1.67 (Law of the unconscious statistician). For every random variable, X,
E[g(X)] =
∫
R
g(x)fX(x)dx
Lemma 1.68. If X ∈ Z, then
E[X] =
∞∑
j=0
(1− FX(j))−
−1∑
j=−∞
FX(j)
14
Lemma 1.69. For every continuous random variable, X,
E[X] =
∫ ∞
0
(1− FX(y))dy −
∫ 0
−∞
FX(y)dy
Lemma 1.70. Linearity of Expected Value
E
[
n∑
i=1
ciXi
∣∣∣∣A
]
=
n∑
i=1
ciE[Xi
∣∣A]
Lemma 1.71 (Law of total expectation). Let A1, . . . , An be a partition of Ω and X be a random variable.
E[X] =
n∑
i=1
E[X|Ai] · P(Ai)
Definition 1.72 (Variance). The variance of X is defined as E[(X − E[X])2] and denoted by V[X].
Lemma 1.73. V[aX + b] = a2V[X]
Lemma 1.74.
V[X] = E[X2]− E[X]2
Lemma 1.75. If X and Y are independent, then E[XY ]
= E[X]E[Y ]. However, E[XY ] = E[X]E[Y ] does not
imply that X and Y are independent.
Lemma 1.76. If X and Y are independent, then V[X + Y ] = V[X] + V[Y ].
Lemma 1.77. If X is a random variable, V[X] = 0 if, and only if, X is constant (i.e, there exists c ∈ R such
that P(X = c) = 1).
Random Variable (Y ) pdf: fY (y) E[Y ] V[Y ]
Binomial(n, p)
(
n
y
)
py(1− p)n−y np np(1− p)
y ∈ {0, 1, 2, . . . , n}
Hypergeometric(N,n, k)
(ky)(
N−k
n−y)
(Nn)
n · kN n · kN · (1− kN ) · N−nN−1
max(0, n−N + k) ≤ y ≤ min(n, k)
Geometric(p) p(1− p)y−1 1p 1−pp2
y ∈ {1, 2, 3, . . .}
Negative Binomial(r, p)
(
y−1
r−1
)
pr(1− p)y−r r · 1p r · 1−pp2
y ∈ {r, r + 1, r + 2, . . .}
Poisson(λ) e
−λλy
y! λ λ
y ∈ N
Table 2: Important discrete distributions.
1.6 Random vectors
Definition 1.78. The cumulative distribution function (cdf) of X, FX(x) is such that, for every x ∈ Rd,
FX(x) := P(X1 ≤ x1, . . . , Xd ≤ xd)
15
Random Variable (Y ) pdf: fY (y) E[Y ] V[Y ]
Uniform(a, b) 1b−a
a+b
2
(b−a)2
12
y ∈ (a, b)
Exponential(λ) 1λe
− y
λ λ λ2
y ∈ R+
Gamma(k, λ) 1
Γ(k)λk
yk−1e−
y
λ kλ kλ2
y ∈ R+
Beta(α, β) Γ(α+β)Γ(α)Γ(β)x
α−1(1− x)β−1 αα+β αβ(α+β)2(α+β+1)
y ∈ (0, 1)
Normal(µ, σ2) 1
σ
√
2pi
e−
(y−µ)2
2σ2 µ σ2
y ∈ R
Table 3: Important continuous distributions.
Lemma 1.79. Some of the useful properties of a cdf are
1. For every x ∈ Rd, 0 ≤ FX(x) ≤ 1.
2. For every 1 ≤ i ≤ d, limxi→−∞ FX(x) = 0.
3. For every 1 ≤ i ≤ d, limxi→∞ FX(x) = FX−i(x−i).
Definition 1.80. The probability density function (pdf) of X, fX(x) is such that,
fX(x) =
P (X1 = x1, . . . , Xd = xd) , if X is discrete.dFX(x)
dx , if X is continuous.
Lemma 1.81.
P (X ∈ A) =
∫
A
fX(x)dx
Lemma 1.82 (Law of total probability for random variables).∫ ∞
−∞
fX(x)dxi = fX−i(x−i)
Definition 1.83. Let X and Y be two vectors of random variables. The pdf of X given Y is defined as
fX|Y(x|y) =
f(X,Y)(x,y)
fY(y)
Definition 1.84.
P (X ∈ A|Y = y) =
∫
A
fX|Y(x|y)dx
Theorem 1.85 (Law of total probability for vectors of random variables).∫ ∞
−∞
fX|Y(x|y)dxi = fX−i|Y (x−i|y)
16
Theorem 1.86 (Bayes theorem for vectors of random variables).
fX|Y(x|y) =
fX(x)fY|X(y|x)∫∞
−∞ fX(x)fY|X(y|x)dx
Definition 1.87. We say that X1, . . . ,Xd are conditionally independent given Y if, for every x1, . . . ,xd and y,
f(X1,...,Xd)|Y(x1, . . . ,xd|y) =
d∏
i=1
fXi|Y (xi|y)
In particular, we say that X1, . . . ,Xd are independent if Y is empty, that is, for every x1, . . . ,xd
f(X1,...,Xd)(x1, . . . ,xd) =
d∏
i=1
fXi(xi)
Lemma 1.88. The following statements are equivalent:
1. (X1, . . . ,Xd) are conditionally independent given Y.
2. There exist functions h1, . . . , hd such that f(X1,...,Xd)|Y(x1, . . . ,xd|y) =
∏n
j=1 hj(xj ,y).
3. For every i, fXi|X−i,Y(xi|x−i,y) = fXi|Y(xi|y).
4. For every i, fXi|Xi−11 ,Y(xi|x
i−1
1 ,y) = fXi|Y(xi|y).
Definition 1.89. Let (X,Y) be vectors of random variables with probability density function f(X,Y)(x,y). Let
X have dimension d. The conditional expectation of X given that Y = y, E[X|Y = y], is such that
E[X|Y = y] =
∫
Rd
xfX|Y(x|y)dx
That is, E[X|Y = y] is a d-dimensional vector such that
E[X|Y = y]i =
∫
R
xifXi|Y(xi|y)dxi
In particular,
E[X] =
∫
Rd
xfX(x)dx
Lemma 1.90. If X is independent of Y, then, for every y,
E[X|Y = y] = E[X]
Lemma 1.91 (Law of the unconscious statistician).
E[g(X)h(Y)|Y = y] = h(y)
∫
Rd
g(x)fX|Y(x|y)dx
Lemma 1.92. E is a linear operator, that is,
17
1. For every X1, . . . ,Xn such that
∑n
i=1 Xi is defined
E
[
n∑
i=1
Xi|Y = y
]
=
n∑
i=1
E[Xi|Y = y]
2. If h is a linear function, then
E[h(X)|Y = y] = h(E[X|Y = y])
In particular, for every a ∈ R and α ∈ R,
E[aX|Y = y] = aE[X|Y = y]
E[αtX|Y = y] = αtE[X|Y = y]
E[Xαt|Y = y] = E[X|Y = y]αt
Lemma 1.93. If X1 and X2 are independent given Y, then
E[X1Xt2|Y = y] = E[X1|Y = y]E[X2|Y = y]t
Definition 1.94. E[X|Y] is defined as
E[X|Y] =
∫
Rd
xfX|Y(x|Y)dx
Note that E[X|Y] is a function of Y and, therefore, is a random variable. Also, for every w ∈ Ω, E[X|Y](w) =
E[X|Y = Y(w)]. That is, E[X|Y] is a function of Y , g(Y ), such that g(Y(w)) = E[X|Y = Y(w)]
Theorem 1.95 (Tower law). Let X and Y be vectors of random variables such that E[‖X‖1] <∞. Then,
E[E[X|Y]] = E[X]
Lemma 1.96.
1. For every g,
E[g(X)|Y] =
∫
Rd
g(x)fX|Y(x|Y)dx
2. For every h(Y),
E[g(X)h(Y)|Y ] = h(Y)E[g(X)|Y]
3.
E
[
n∑
i=1
Xi
∣∣∣∣Y
]
=
n∑
i=1
E[Xi|Y]
18
Definition 1.97. The conditional variance of X given Y = y, V[X|Y = y] is defined as
V[X|Y = y] = E[(X− E[X|Y = y])(X− E[X|Y = y])t]
In particular,
V[X] = E[(X− E[X])(X− E[X])t]
Note that V[X|Y = y] is a d× d matrix and that
V[X|Y = y]i,j = E[(Xi − E[Xi|Y = y])(Xj − E[Xj |Y = y])]
From the above, Cov[Xi, Xj |Y = y] is defined as V[X|Y = y]i,j .
Lemma 1.98. V[X|Y = y] = E[XXt|Y = y]− E[X|Y = y]E[X|Y = y]t.
Lemma 1.99. Let k ∈ N∗ be arbitrary and A be a k × d matrix.
V[AX|Y = y] = AV[X|Y = y]At
In particular, if α ∈ Rd,
V[αtX|Y = y] = αtV[X|Y = y]α
Lemma 1.100. For every X, V[X] is a positive semi-definite matrix.
Lemma 1.101. Let b ∈ Rd,
V[X + b|Y = y] = V[X|Y = y]
Lemma 1.102. If X1 and X2 are independent given Y, then
V[X1 + X2|Y = y] = V[X1|Y = y] + V[X2|Y = y]
Definition 1.103.
V[X|Y] = E[(X− E[X|Y])(X− E[X|Y])t]
Note that V[X|Y] is a function of Y and, therefore, is a random variable. Also, for every w ∈ Ω, V[X|Y](w) =
V[X|Y = Y(w)]. That is, V[X|Y] is a function of Y , g(Y ), such that g(Y(w)) = V[X|Y = Y(w)].
Theorem 1.104 (Law of Total Variance).
V[X] = V[E[X|Y]] + E[V[X|Y]]
19
2 Statistical decision theory
2.1 Statistical model
Definition 2.1 (Statistical model). A statistical model is a pair (X ,P), where X ⊂ Rn represents the possible
values that the data can assume, P represents a collection of distributions over X .
Definition 2.2 (Model’s parameter). In a statistical model, (X ,P), it might be useful to parameterize P. We
define θ as a bijective function over P, θ : P → Θ. θ is called the parameter of the model. For each θ0 ∈ Θ, define
f(x|θ0) = θ−1(θ0). Note that P = {f(x|θ0) : θ0 ∈ Θ}.
Example 2.3. A fraction of the toys manufactured by a factory is hazardous. In order to learn about this fraction,
we sample one toy and observe whether it is hazardous. The possible samples are given by X = {0, 1}, where “1”
represents that the toy is hazardous and “0” represents that the toy is not hazardous. Let f(x|t) = tx(1− t)1−x.
Note that f(x|t) is the density of a Bernoulli(t). Define P = {f(x|t) : 0 < t < 1}, that is, the collection of all
Bernoulli distributions. Note that we don’t know which f(x|t) ∈ P generated the data. Finally, let θ(f(x|t)) = t.
It is also possible to describe the statistical model implicitly. For example, we can summarize the previous
paragraph by saying that the data, X, is such that X ∼ Bernoulli(θ), where 0 < θ < 1.
Example 2.4 (Rain model). Before leaving my house, I check the weather broadcast for rain. My data, X,
is the indicator that rain is expected today. Therefore, X = {0, 1}. Let θ ∈ {R, R¯}, where R represents that
it rains today and R¯ that it does not rain. I believe that the broadcast is fairly accurate and determine that
P(X = 1|θ = R) = 0.8 and P(X = 1|θ = R¯) = 0.1. Implicitly, I defined P = {Bernoulli(0.8),Bernoulli(0.1)}.
Definition 2.5 (Independence). X1 . . . , Xn are independent if, for every f(x1, . . . , xn|θ) ∈ P,
f(x1, . . . , xn|θ) =
n∏
i=1
f(xi|θ)
Definition 2.6 (Identically distributed). X1 . . . , Xn are identically distributed if, for every f(x1, . . . , xn|θ) ∈ P
and x ∈ R,
fX1(x|θ) = fX2(x|θ) = . . . = fXn(x|θ)
Notation 2.7 (i.i.d.). X1, . . . , Xn are i.i.d. if they are independent
and identically distributed.
Definition 2.8 (Exponential family). Let X a random variable of a statistical model with parameter θ. X is in
an exponential family if there exist h(x), T (x), η(θ) and A(θ) such that, for every θ0 ∈ Θ,
f(x|θ0) = h(x) exp(T (x) · η(θ0) +A(θ0))
20
Example 2.9. Let X = (X1, . . . , Xn) be i.i.d. random variables such that Xi ∼ Bernoulli(θ). Obtain that
f(x1, . . . , xn|θ0) =
n∏
i=1
f(xi|θ0) Definition 2.5
=
n∏
i=1
θxi0 (1− θ0)1−xi
=
(
n∏
i=1
θxi0
)(
n∏
i=1
(1− θ0)1−xi
)
= θ
∑n
i=1 xi
0 (1− θ0)
∑n
i=1 (1−xi)
= θnx¯0 (1− θ0)n(1−x¯)
= exp
(
nx¯ log
(
θ0
1− θ0
)
+ n log(1− θ0)
)
Therefore, X is an the exponential family with h(x) = 1, T (x) = nx¯, η(θ0) = log
(
θ
1−θ
)
and A(θ0) = n log(1− θ).
Lemma 2.10. Let (X ,P) be a statistical model with parameter θ and X1 be in an exponential family with functions
h(x), T (x), η(θ0) and and A(θ0). If X1, . . . , Xn are i.i.d., then X = (X1, . . . , Xn) is in an exponential family with
functions h∗(x) =
∏n
i=1 h(xi), T
∗(x) =
∑n
i=1 T (xi), η
∗(θ0) = η(θ0) and A∗(θ0) = nA(θ0).
Proof.
f(x1, . . . , xn|θ0) =
n∏
i=1
f(xi|θ0) Definition 2.5
=
n∏
i=1
h(xi) exp(T (xi) · η(θ0) +A(θ0)) Definition 2.8
=
(
n∏
i=1
h(xi)
)
exp
(
n∑
i=1
T (Xi) · η(θ0) + nA(θ0)
)
Therefore X is in an exponential family with h∗(x) =
∏n
i=1 h(xi), T
∗(x) =
∑n
i=1 T (Xi), η
∗(θ0) = η(θ0) and
A∗(θ0) = nA(θ0).
Example 2.11. Let X = (X1, . . . , Xn) be i.i.d. random variables such that Xi ∼ N(µ, σ2), where θ = (µ, σ2).
Obtain that
f(x1|µ0, σ20) =
1√
2piσ20
exp
(−(x1 − µ0)2
2σ20
)
Table 3
=
1√
2piσ20
exp
(−x21 + 2µ20x1 − µ20
2σ20
)
= exp
(
(−x21, x1) ·
(
1
2σ20
,
µ20
σ20
)
− µ
2
0
2σ20
− log(2piσ
2
0)
2
)
Conclude that X1 is in an exponential family with h(x1) = 1, T (x1) = (−x21, x1), η(µ0, σ20) =
(
1
2σ20
,
µ20
σ20
)
and
A(µ0, σ
2
0) = − µ
2
0
2σ20
− log(2piσ20)2 . Conclude from Lemma 2.10 that X is in an exponential family.
21
Exercises
Exercise 2.12. In the following situations, determine a statistical model and a parameter:
(a) A crate has N light bulbs. A unknown number of these bulbs are defective. In order to determine this
number, a random sample (without replacement) of n of these bulbs is tested.
(b) A light bulb lasts on average an unknown number of seconds. In order to determine this amount of time,
one takes a sample of n light bulbs and checks how long each one of them lasts.
Exercise 2.13 (Casella and Berger (2002; p.132)). Show that each of the following families is an exponential
family
(a) Gamma(α, β), where α ∈ R+∗ and β ∈ R+∗ .
(b) Beta(α, β), where α ∈ R+∗ and β ∈ R+∗ .
(c) Poisson(θ), where θ ∈ R+∗ .
(d) Negative Binomial(r,θ), where 0 < θ < 1.
Solution:
(a)
f(x|α, β) = β
α
Γ(α)
xα−1 exp(−βx)
= exp ((α− 1) log(x)− βx+ α log(β)− log(Γ(α)))
= exp ((log(x),−x) · (α− 1, β) + α log(β)− log(Γ(α)))
Therefore, f(x|α, β) is in the exponential family by taking h(x) = 1, T (x) = (log(x)−x), η(α, β) = (α−1, β)
and A(α, β) = α log(β)− log(Γ(α)).
(b)
f(x|α, β) = B−1(α, β)xα−1(1− x)β−1
= exp
(
(α− 1) log(x) + (β − 1) log(1− x) + log(B−1(α, β))
= exp
(
(log(x), log(1− x)) · (α− 1, β − 1) + log(B−1(α, β)))
Therefore, f(x|α, β) is in the exponential family by taking h(x) = 1, T (x) = (log(x), log(1− x)), η(α, β) =
(α− 1, β − 1) and A(α, β) = log(B−1(α, β)).
(c)
f(x|θ) = exp(−θ)θ
x
x!
= (x!)−1 exp (log(x)θ − θ)
Therefore, f(x|θ) is in the exponential family by taking h(x) = (x!)−1, T (x) = log(x), η(θ) = θ and
A(θ) = −θ.
22
(d)
f(x|θ) =
(
x− 1
r − 1
)
θr(1− θ)x−r
=
(
x− 1
r − 1
)
exp ((x− r) log(1− θ) + r log(θ))
Therefore, f(x|θ) is in the exponential family by taking h(x) = (x−1r−1), T (x) = x− r, η(θ) = log(1− θ) and
A(θ) = r log(θ).
2.2 Elements of statistical decision theory
Definition 2.14 (Statistical decision problem). A statistical decision problem is a collection, (M, θ,A, L), where
M is a statistical model, θ is a parameter forM, A is the set of alternatives that are available and L is a function,
L : A×Θ→ R, that represents the loss of choosing alternative a ∈ A under the parameter value θ0 ∈ Θ.
Example 2.15 (Umbrella problem). Consider Example 2.4. I might be interested in deciding whether I take my
umbrella to work today. Let A = {U, U¯}, where U represents that I take my umbrella and U¯ that I don’t take it.
A possible loss function is given by table 4.
��
��
��
���θ
a ∈ A U U¯
R 0.7 1
R¯ 0.3 0
Table 4: L(a, θ) in Example 2.15.
In the following examples of statistical decision problems consider that X1, . . . , Xn is the data obtained for
statistical model M and that θ is a parameter for this model.
Example 2.16 (Estimation). We say that we estimate θ when we wish to choose a value that is “similar” to θ.
In this case, A = Θ and L(a, θ0) = d(a, θ0), a dissimilarity measure between a and θ0.
Consider Example 2.11. In order to estimate µ, we take A = R. A common choice for L(a, θ0) is (a− θ0)2, the
quadratic loss function.
Example 2.17 (Confidence region). When constructing a confidence region, A = {R : R ⊂ Θ}. We wish to
determine a small subset of Θ, R, such that it is likely that θ ∈ R. In order to obtain these goals, one can choose
L(R, θ0) = I(θ0 /∈ R) + µ(R), where µ(R) measures the size of R.
Consider Example 2.11. We might wish to construct a confidence interval for µ. In this case, A = {[l, u] : l ≤ u}.
A common choice for L([l, u], θ0) is L(R, θ0) = I(θ0 /∈ R) + k(u− l).
Example 2.18 (Hypothesis test). In an hypothesis test, we wish to reject H0 ⊂ Θ when θ /∈ H0 and not to
reject H0 when θinH0. In this case, A = {0, 1}, where 1 designates that H0 was rejected. It is common to choose
L(1, θ0) = k1I(θ ∈ H0) and L(0, θ1) = k2I(θ /∈ H0). Note that k1 is the loss when a type-1 error is committed,
that is, when H0 is rejected and H0 is true. Similarly, k2 is the loss when a type-2 error is committed, that is,
when H0 is not rejected and H0 is false.
Example 2.19 (Prediction). Let Xn+1 denote a data point that wasn’t observed yet. We might be interested
in predicting the value of Xn+1. In this case, A = X and we wish to choose a value that is close to Xn+1. It is
23
common to choose L(a, θ) =
∫
d(a, xn+1)f(xn+1|θ)dxn+1, where d is a distance. In particular, one might choose
L(a, θ) =
∫
(a− xn+1)2f(xn+1|θ)dxn+1.
Definition 2.20 (Decision function). A decision function, δ, for a statistical decision problem chooses an alter-
native in A for each possible data value in X . That is, δ : X → A.
Example 2.21 (Umbrella problem; pt.2). Consider Example 2.15. There are 4 possible decision functions, which
are presented in table 5.
δ δ(0) δ(1)
δ1 U¯ U¯
δ2 U¯ U
δ3 U U¯
δ4 U U
Table 5: Possible decision functions in Example 2.15.
Definition 2.22 (Risk). Let δ be a decision function. The risk of δ, Rδ is a function, Rδ : Θ→ R such that
Rδ(θ0) = E[L(δ(X), θ0)|θ0]
=
∫
X
L(δ(x), θ0)f(x|θ0)dx
Example 2.23 (Umbrella problem; pt.3). Consider Example 2.21. Next, we compute the risk for each decision
function. First, note that δ1(X) = U¯ , that is, δ1 does not depend on the data. Therefore, we obtain that
Rδ1(R¯) = E[L(δ1(X), R¯)|R¯]
= E[L(U¯ , R¯)|R¯]
= L(U¯ , R¯) = 0
Rδ1(R) = L(U¯ , R) = 1
Similarly, δ4(X) = U . Therefore,
Rδ4(R¯) = L(U, R¯) = 0.3
Rδ4(R) = L(U,R) = 0.7
24
On the other hand, δ2 and δ3 depend on the data. In these cases, we obtain:
Rδ2(R¯) = E[L(δ2(X), R¯)|R¯]
= L(δ2(0), R¯)P(X = 0|R¯) + L(δ2(1), R¯)P(X = 1|R¯)
= L(U¯ , R¯)P(X = 0|R¯) + L(U, R¯)P(X = 1|R¯)
= 0 · 0.9 + 0.3 · 0.1 = 0.03
Rδ2(R) = E[L(δ2(X), R)|R]
= L(U¯ , R)P(X = 0|R) + L(U,R)P(X = 1|R)
= 1 · 0.2 + 0.7 · 0.8 = 0.76
Rδ3(R¯) = E[L(δ3(X), R¯)|R¯]
= L(U, R¯)P(X = 0|R¯) + L(U¯ , R¯)P(X = 1|R¯)
= 0.3 · 0.9 + 0 · 0.1 = 0.27
Rδ3(R) = E[L(δ3(X), R)|R]
= L(U,R)P(X = 0|R) + L(U¯ , R)P(X = 1|R)
= 0.7 · 0.2 + 1 · 0.8 = 0.94
Table 6 summarizes the risk function for each of the decision functions above.
δ Rδ(R¯) Rδ(R)
δ1 0 1
δ2 0.03 0.76
δ3 0.27 0.94
δ4 0.3 0.7
Table 6: Risks for each decision function in Example 2.21.
Notice that, if it doesn’t rain, then δ1 has the smallest risk (Rδ1(R¯) = 0). Similarly, if it rains, then δ4 has the
smallest risk (Rδ4(R) = 0.7). Since we don’t know whether it will rain or not, no decision function is better than
all others in all possible situations.
How does one choose the best decision function? The next section starts to discuss this question.
Exercises
Exercise 2.24. Change the values of the loss function in table 4 according to your own preferences. Calculate
the risk function for each decision function in table 5 according to your loss function.
Exercise 2.25 (Bickel and Doksum (2015; p.75)). Let X = {0, 1}, P = {Bernoulli(p),Bernoulli(q)}, θ(Bernoulli(t)) =
t, A = {a1, a2, a3} and D be the collection of all decision functions from X to A. Compute the risk of each of the
decision functions in D when
(a) p = q = 0.1.
(b) p = 1− q = 0.1.
25
Exercise 2.26. Let X1, . . . , Xn be i.i.d. and Xi ∼ Bernoulli(θ), where θ ∈ (0, 1). You wish to estimate θ and
consider A = (0, 1) and L(a, θ0) = (a−θ0)2. In order to perform this estimation, you consider the decision function
δ(X) = X¯. Compute the risk of δ.
Exercise 2.27. Let X1, . . . , Xn be i.i.d. and Xi ∼ N(θ, 1), where θ ∈ R. You wish to estimate θ and consider
A = R and L(a, θ0) = (a−θ0)2. In order to perform this estimation, you consider the decision function δ1(X) = X¯
and δ2(X) = 0.9X¯. Compute the risks of δ1 and δ2.
Solution: Note that
Rδ1(θ0) = E
[
(X¯ − θ0)2|θ0
]
= V[X¯] = n−1
Rδ2(θ0) = E
[
(0.9X¯ − θ0)2|θ0
]
= (E[0.9X¯]− θ0)2 + V[0.9X¯]
= 0.01θ20 + 0.81n
−1
Exercise 2.28. Let X1, . . . , Xn be i.i.d. and Xi ∼ Uniform(0, θ), where θ ∈ R+. You wish to estimate θ and
consider A = R+ and L(a, θ0) = (a−θ0)2. In order to perform this estimation, you consider two decision functions:
δ1(X) = 2X¯ and δ2(X) = max(X1, . . . , Xn). Compute the risks of δ1 and δ2.
Exercise 2.29. Let X1, . . . , X100 be i.i.d. Xi ∼ N(θ, 1) and L(a, θ0) = |a − θ0| Simulate the risk of δ1 = X1,
δ2 = X¯ and δ3 = Median(X1, . . . , X100).
2.3 Admissibility
Definition 2.30 (Decision space). The decision space, D, is a collection of decision functions, that is, D ⊂ AX .
Definition 2.31. Let δ1 ∈ D and δ2 ∈ D. δ1 dominates δ2 if:
1. Rδ1(θ0) ≤ Rδ2(θ0), for every θ0 ∈ Θ.
2. Rδ1(θ0) < Rδ2(θ0), for some θ0 ∈ Θ.
Instead of writing δ1 dominates δ2, one can also write δ1 ≺ δ2.
Definition 2.32 (Admissibility). A decision function δ ∈ D is inadmissible in D if there exists δ∗ in D that
dominates δ. Also δ is admissible if it is not inadmissible.
Intuitively, inadmissible decision functions are undesirable, since, no matter what is the value of θ, another
decision function performs better. Therefore, it is reasonable to restrict one’s attention solely to admissible
decision functions.
Example 2.33. Consider Example 2.23. Note that the Rδ2 is strictly smaller than Rδ3 . That is, δ2 ≺ δ3 and δ3
is inadmissible. Also note that δ1, δ2 and δ4 are admissible.
Exercises
Exercise 2.34. Consider Exercise 2.27 and let D = {δ1, δ2}. Is δ1 admissible in D? Is δ2 admissible in D?
26
Exercise 2.35 (Bickel and Doksum (2015; p.79)). Assume that L ≥ 0 and that, if Pθ0(B) = 0, then Pθ1(B) = 0,
for every θ1 ∈ Θ. Show that, if L(a0, θ0) = 0, then δ ≡ a0 is admissible.
Exercise 2.36. Let X be a random variable such that E[X|θ] = θ and V[X|θ] = σ2. Let A = R, L(a, θ) = (a−θ)2
and D = {cX + d : (c, d) ∈ R2}.
(a) Show that, if c > 1, then cX + d is inadmissible in D.
(b) Show that, if c < 0, then cX + d is inadmissible in D.
(c) Show that, if c = 1 and d 6= 0, then cX + d is inadmissible in D.
2.4 Sufficiency
Definition 2.37 (Statistic). A statistic is a function (summary) of the data. That is, if T is a statistic, then
T : X → T , where T is an arbitrary set.
Lemma 2.38. For every statistic, T , and θ0 ∈ Θ
fX(x|θ0) = fX,T (x, T (x)|θ0)
Proof. The results holds since {X = x} ⊂ {T = T (x)}.
Lemma 2.39. For every statistic, T , and θ0 ∈ Θ
fT (t|θ0) =
∫
A(t)
fX(x
′|θ)dx′
where A(X) = {x′ ∈ X : T (x′) = t}
Proof. Note that, for every x′ such that T (x′) 6= t, fX,T (x′, t|θ0) = 0. Therefore,
fT (t|θ0) =
∫
A(t)
fX,T (x
′, t|θ0)dx′ Lemma 1.82
=
∫
A(x)
fX(x
′|θ0)dx′ Lemma 2.38
Definition 2.40 (Sufficient statistic). A statistic, T , is a sufficient statistic if fX|T (x|T (x), θ0) does not depend on
θ0. In other words, given T , the distribution of the data is known. This distribution is denoted by fX|T (x|T (x)).
Theorem 2.41 (Neyman-Fisher factorization). T is a sufficient statistic if and only if there exist h(x) and g(t, θ0)
such that
f(x|θ0) = h(x)g(T (x), θ0)
27
Proof. Assume that T is a sufficient statistic and let t = T (x). Note that,
fX(x|θ0) = fX,T (x, t|θ0) Lemma 2.38
= fT (t|θ0)fX|T (x|t, θ0)
= fT (t|θ0)fX|T (x|t) Definition 2.40
= h(x)g(t, θ0) h(x) = fX|T (x|t); g(t, θ0) = fT (t|θ0).
Next, assume that there exist h(x) and g(t, θ0) such that f(x|θ0) = h(x)g(T (x), θ0) and let t = T (x).
fX|T (x|t, θ0) =
fX,T (x, t|θ0)
fT (t|θ0)
=
fX(x|θ0)
fT (t|θ0) Lemma 2.38
=
fX(x|θ0)∫
A(t) fX(x
′|θ0)dx′ Lemma 2.39
=
h(x)g(t, θ0)∫
A(x) h(x
′)g(T (x′), θ0)dx′
=
h(x)g(t, θ0)∫
A(x) h(x
′)g(t, θ0)dx′
x′ ∈ A(t)→ T (x′) = t
=
h(x)g(t, θ0)
g(t, θ0)
∫
A(x) h(x
′)dx′
=
h(x)∫
A(x) h(x
′)dx′
Therefore fX|T (x|t, θ0) doesn’t depend on θ0 and T is sufficient.
Lemma 2.42. If X1, . . . , Xn are i.i.d. and X1 is in the exponential family with functions h(x), T (x), η(θ0) and
A(θ0), then T
∗(X) =
∑n
i=1 T (xi) is a sufficient statistic for θ.
Proof. Since X1 is in the exponential family with functions h(x), T (x), η(θ0) and A(θ0), then it follows from
Lemma 2.10 that X = (X1, . . . , Xn) is in an exponential family with functions h
∗(x) =
∏n
i=1 h(xi), T
∗(x) =∑n
i=1 T (xi), η
∗(θ0) = η(θ0) and A∗(θ0) = nA(θ0). That is,
f(x|θ) = h∗(x) exp (T ∗(x) · η(θ0) +A∗(θ0))
Therefore, f(x|θ) can be written as in the form of Theorem 2.41 using the statistic T ∗(x). Conclude that T ∗(X) =∑n
i=1 T (Xi) is a sufficient statistic for θ.
Definition 2.43 (Convexity). Let A be a set. A is convex if, for every probability density function, f , over A,∫
A xf(x)dx ∈ A.
Let g : A→ B be a function. We say that g is convex if, for every density over A, f ,
g
(∫
A
xf(x)dx
)
≤
∫
A
g(x)f(x)dx
28
Theorem 2.44 (Rao-Blackwell). Let A and L be convex and T be a sufficient statistic. For every decision
function, δ, and θ0 ∈ Θ
RE[δ|T ](θ0) ≤ Rδ(θ0)
Proof.
RE[δ|T ](θ0) = E[L(E[δ|T ], θ0)] Definition 2.22
=
∫
L
(∫
δ(x)fX|T (x|t)dx, θ0
)
fT (t|θ0)dt
≤
∫ (∫
L (δ(x), θ0) fX|T (x|t)dx
)
fT (t|θ0)dt Definition 2.43
=
∫ ∫
L(δ(x), θ0)fX|T (x|t)fT (t|θ0)dxdt
=
∫ ∫
L(δ(x), θ0)fX,T (x, t|θ0)dxdt
=
∫ ∫
L(δ(x), θ0)fX,T (x, t|θ0)dtdx
=
∫
L (δ(x), θ0)
∫
fX,T (x, t|θ0)dtdx
=
∫
L (δ(x), θ0) fX(x|θ0)dx = Rδ(θ0) Theorem 1.57
Exercises
Exercise 2.45. Let X1, . . . , Xn be i.i.d. and such that X1 ∼ Bernoulli(θ). Show that T1 = X¯ is a sufficient
statistic. Is T2 = 10X¯ + 5 a sufficient statistic?
Exercise 2.46. Let X1, . . . , Xn be i.i.d. and such that X1 ∼ Uniform(0, θ). Show that T = max(X1, . . . , Xn) is
a sufficient statistic.
Exercise 2.47 (Shao (2003; p.145)). Find a univariate sufficient statistic for θ in the following cases:
(a) X ∼ Poisson(θ), θ > 0.
(b) X ∼ Negative Binomial(r, θ), 0 < θ < 1.
(c)
X ∼ Exponential(θ), θ > 0.
(d) X ∼ Gamma(k, θ), θ > 0.
Exercise 2.48 (Shao (2003; p.145)). Let S be a statistic and T = f(S) be a sufficient statistic. Prove that S is
a sufficient statistic.
Exercise 2.49. Let n ≥ 2, and X1, . . . , Xn be i.i.d. and such that Xi ∼ N(θ, 1), where θ ∈ R.
(a) Show that X¯ is a sufficient statistic.
(b) Find Eθ[X1|X¯].
29
(c) Show that, if X¯ ∈ D, then X1 is inadmissible in D.
Solution:
(a)
f(x1, . . . , xn|θ0) =
n∏
i=1
f(xi|θ0)
=
n∏
i=1
(2pi)−0.5 exp
(
−(xi − θ)
2
2
)
= (2pi)−n0.5 exp
(
−
n∑
i=1
(xi − θ)2
2
)
= (2pi)−n0.5 exp
(
−
n∑
i=1
x2i − 2xiθ0 + θ20
2
)
= (2pi)−n0.5 exp
(
−
∑n
i=1 x
2
i
2
)
· exp
(
2nx¯θ0 − nθ20
2
)
Conclude that f(x1, . . . , xn|θ0) can be written in the form of Theorem 2.41 with h(x) = (2pi)−n0.5 exp
(
−
∑n
i=1 x
2
i
2
)
,
T (x) = x¯ and g(T (x), θ0) = exp
(
2nT (x)θ0 − nθ20
)
. That is, T (X) = X¯ is a sufficient statistic.
(b) Since X1, . . . , Xn are i.i.d. and X¯ is invariant under permutations, E[X1|X¯] = E[Xi|X¯], for 1 ≤ i ≤ n.
Conclude that
E[X1|X¯] = E[Xi|X¯]
n∑
i=1
E[X1|X¯] =
n∑
i=1
E[X1|X¯]
nE[X1|X¯] = E
[
n∑
i=1
Xi|X¯
]
nE[X1|X¯] = E
[
nX¯|X¯]
nE[X1|X¯] = nX¯
E[X1|X¯] = X¯
Note that, to prove this result, we used only that X1, . . . , Xn are i.i.d. Specifically, we did not use that
X1 ∼ N(θ, 1).
(c) Note that
RX1(θ0) = E[(X1 − θ0)2|θ0]
= V[X1|θ0] E[X1|θ0] = θ0
= 1
30
Similarly,
RX¯(θ0) = E[(X¯ − θ0)2|θ0]
= V[X¯|θ0] E[X¯|θ0] = θ0
= n−1
Therefore, if n ≥ 2, then RX¯(θ0) < RX1(θ0), for every θ0 ∈ R. Conclude that X1 is inadmissible in
D = {X1, X¯} since X¯ dominates X1.
Exercise 2.50. Let X1, X2, X3 be i.i.d. and such that Xi ∼ U(α, β), where θ = (α, β). Let L(a, (α, β) =(
a− α+β2
)2
.
(a) Compute RX¯ .
(b) Determine a sufficient statistic for θ.
(c) Is X¯ admissible in D = {δ : X → R}?
2.5 Optimality criteria
Definition 2.51 (Bayes risk). Let S be a statistical decision problem with parameter θ, f(θ0) be a probability
density function over Θ and δ be a decision function. The Bayes risk of δ according to f(θ0), rδ,f(θ0), is
rδ =
∫
Θ
Rδ(θ0)f(θ0)dθ0
Example 2.52. Consider table 6 and that f(R¯) = 0.7 and f(R) = 0.3). Obtain
rδ1,f = Rδ1(R¯)f(R¯) +Rδ1(R)f(R)
= 0 · 0.7 + 1 · 0.3 = 0.3rδ2,f = 0.03 · 0.7 + 0.76 · 0.3 = 0.249
rδ3,f = 0.27 · 0.7 + 0.94 · 0.3 = 0.3387
rδ4,f = 0.3 · 0.7 + 0.7 · 0.3 = 0.273
Definition 2.53 (Bayes decision). Let S be a statistical decision problem, D be a collection of decision functions
and δ∗ ∈ D. δ∗ is a Bayes decision in D according to f(θ0) if, for every δ ∈ D,
rδ∗,f(θ0) ≤ rδ,f(θ0).
Example 2.54. Let D = {δ1, δ2, δ3, δ4} in Example 2.52. Note that
rδ2,f < rδ4,f < rδ1,f < rδ3,f
Therefore, δ2 is a Bayes decision in D according to f .
Definition 2.55 (Minimax risk). Let S be a statistical decision problem with parameter θ and δ be a decision
function. The minimax risk of δ, mδ, is
mδ = sup
θ∈Θ
Rδ(θ)
31
Example 2.56. Consider table 6. Obtain:
mδ1 = max(Rδ1(R¯), Rδ1(R)) = max(0, 1) = 1
mδ2 = max(0.03, 0.76) = 0.76
mδ3 = max(0.27, 0.94) = 0.94
mδ4 = max(0.3, 0.7) = 0.7
Definition 2.57 (Minimax decision). Let S be a statistical decision problem with parameter θ and D be a
collection of decision functions. A decision function δ∗ ∈ D is minimax in D if, for every δ ∈ D,
mδ∗ ≤ mδ.
Example 2.58. Consider Example 2.56. Since
mδ4 < mδ2 < mδ3 < mδ1 ,
δ1 is a minimax decision in D = {δ1, δ2, δ3, δ4}.
Exercises
Exercise 2.59. Consider the statistical decision problem in Exercise 2.25 when p = 1 − q = 0.1. Let D be the
collection of all decision functions, δ1, . . . , δ9.
(a) Find a Bayes decision in D according to f(θ = 0.1) = f(θ = 00.9) = 0.5.
(b) Find a minimax decision in D.
2.6 How to find a Bayes decision rule?
Definition 2.60 (Posterior distribution). LetM be a statistical model with parameter θ and f(θ0) be a probability
density function over Θ. For every x ∈ X , the posterior distribution of θ given x according to f(θ0), f(θ0|x) is
f(θ0|x) = f(θ0)f(x|θ0)
f(x)
, where Theorem 1.86
f(x) =
∫
Θ
f(θ0)f(x|θ0)dθ0 Theorem 1.85
Alternatively, one can write θ ∼ f(θ0) and θ|x ∼ f(θ0|x).
32
Example 2.61. Let X1, . . . , Xn be i.i.d. and Xi ∼ Bernoulli(θ). If θ ∼ Beta(a, b), then
f(θ0|x) = f(θ0)f(x|θ0)∫ 1
0 f(x|θ0)f(θ0)dθ0
=
f(θ0)θ
nx¯
0 (1− θ0)n(1−x¯)∫ 1
0 f(θ0)θ
nx¯
0 (1− θ0)n(1−x¯)dθ0
Example 2.9
=
β−1(a, b)θa−10 (1− θ0)b−1θnx¯0 (1− θ0)n(1−x¯)∫ 1
0 β
−1(a, b)θa−10 (1− θ0)b−1θnx¯0 (1− θ0)n(1−x¯)dθ0
table 3
=
θa+nx¯−10 (1− θ0)b+n(1−x¯)−1∫ 1
0 θ
a+nx¯−1
0 (1− θ0)b+n(1−x¯)−1
= β−1(a+ nx¯, b+ n(1− x¯))θa+nx¯−10 (1− θ0)b+n(1−x¯)−1
Conclude that θ|x1, . . . , xn ∼ Beta(a+ nx¯, b+ n(1− x¯)).
Example 2.62. Let X1, . . . , Xn be i.i.d. and Xi ∼ N(θ, σ20), where σ20 ∈ R+∗ is known. If θ ∼ N(a, b2), then
θ|x1, . . . , xn ∼ N
(
b−2a+nσ−20 x¯
b−2+nσ−20
, (b−2 + nσ−20 )
−1
)
.
Definition 2.63 (Posterior loss). Let S be a statistical decision problem with parameter θ and f(θ0) be a density
over Θ. For every decision function, δ, the posterior loss of δ according to f(θ0) given x, lδ,f(θ0|x), is
lδ,f(θ0|x) = E[L(δ(X), θ)|X = x]
=
∫
Θ
L(δ(x), θ0)f(θ0|x)dθ0
Example 2.64. In Example 2.61, let A = [0, 1] and L(a, θ0) = (a − θ0)2. Let δ(X) = a+nx¯a+b+n . Note that
δ(X) = E[θ|X]. Therefore,
lδ,f(θ0|x) = E[L(δ(X), θ)|X = x]
= E[(θ − E[θ|X])2|X = x]
= V[θ|X = x] = (a+ nx¯)(b+ n(1− x¯))
(a+ b+ n)2(a+ b+ n+ 1)
Lemma 2.65 (Extensive form).
rδ,f(θ0) =
∫
X
lδ,f(θ0|x)f(x)dx
33
Proof.
rδ,f(θ0) =
∫
Rδ(θ0)f(θ0)dθ0 Definition 2.51
=
∫
Θ
∫
X
L(δ(x), θ0)f(x|θ0)f(θ0)dxdθ0
=
∫
Θ
∫
X
L(δ(x), θ0)f(x, θ0)dxdθ0 Definition 1.83
=
∫
X
∫
Θ
L(δ(x), θ0)f(x, θ0)dθ0dx
=
∫
X
∫
Θ
L(δ(x), θ0)f(θ0|x)f(x)dθ0dx Definition 1.83
=
∫
X
lδ,f(θ0|x)f(x)dx Definition 2.63
Lemma 2.66. Let S be a statistical decision problem with parameter θ, f(θ0) be a probability density function
over Θ, D be a collection of decision functions and δ∗ ∈ D. If for every x ∈ X and δ ∈ D,
lδ∗,f(θ0|x) ≤ lδ,f(θ0|x),
then δ∗ is a Bayes decision in D according to f(θ0).
Proof. Assume that for every x ∈ X and δ ∈ D,
lδ∗,f(θ0|x) ≤ lδ,f(θ0|x),
Therefore, ∫
sX
lδ∗,f(θ0|x)dx ≤
∫
X
lδ,f(θ0|x)dx
rδ∗,f(θ0) ≤ rδ∗,f(θ0) Lemma 2.65
Conclude that δ∗ is a Bayes decision in D according to f(θ0).
Theorem 2.67. Let S be a statistical decision problem with parameter θ, f(θ0) be a probability density function
over Θ and D be a collection of decision functions. IF δ∗ ∈ D is such that, for every x ∈ X ,
δ∗(x) = arg min
a∈A
∫
Θ
L(a, θ0)f(θ0|x)dθ0
then δ∗ is a Bayes function in D according to f(θ0).
Proof. For every x ∈ X and δ ∈ D,
lδ∗,f(θ0|x) =
∫
Θ
L(δ∗(x), θ0)f(θ0|x)dθ0 Definition 2.63
= min
a∈A
∫
Θ
L(a, θ0)f(θ0|x)dθ0
≤
∫
Θ
L(δ(x), θ0)f(θ0|x)dθ0 = lδ,f(θ0|x)
34
Conclude from Lemma 2.66 that δ∗ is a Bayes decision rule in D according to f(θ0).
Example 2.68. Consider a statistical model such that X|θ ∼ Bernoulli(θ), θ ∈ {0.2, 0.9} and P(θ = 0.2) = 0.5.
Also, let A = {a1, a2, a3, a4} and L(a, θ0) be as in table 7. The posterior distribution is such that
��
��
��Θ
A a1 a2 a3 a4
0.2 1 0 0.8 0.6
0.9 0 2 0.1 0.3
Table 7: Loss function, L(a, θ0), for Example 2.68
P(θ = 0.2|X = 0) = P(θ = 0.2)P(X = 0|θ = 0.2)
P(θ = 0.2)P(X = 0|θ = 0.2) + P(θ = 0.9)P(X = 0|θ = 0.9)
=
0.5 · 0.8
0.5 · 0.8 + 0.5 · 0.1 =
8
9
P(θ = 0.2|X = 1) = P(θ = 0.2)P(X = 1|θ = 0.2)
P(θ = 0.2)P(X = 1|θ = 0.2) + P(θ = 0.9)P(X = 1|θ = 0.9)
=
0.5 · 0.2
0.5 · 0.2 + 0.5 · 0.9 =
2
11
Next, for each x ∈ X , we can compute la,f(θ0|x) =
∫
Θ L(a, θ0)f(θ0|x)dθ0 and determine mina∈A
∫
Θ L(a, θ0)f(θ0|x)dθ0.
la1,f(θ0|0) = L(a1, 0.2)P(θ = 0.2|X = 0) + L(a1, 0.9)P(θ = 0.9|X = 0)
= 1 · 8
9
+ 0 · 1
9
=
1
9
la2,f(θ0|0) =
la3,f(θ0|0) =
la4,f(θ0|0) =
Theorem 2.69. If L(a, θ0) = (a− θ0)2 and D = RX , then E[θ|X] is a Bayes decision in D.
Proof. For every δ ∈ D,
lδ,f(θ0|x) = E[L(δ(X), θ)|X = x] Definition 2.63
= E[(δ(x)− θ)2|X = x]
= V[θ|X = x] + (E[θ|X = x]− δ(x))2 Lemma 1.98
= E[(θ − E[θ|X = x])2|X = x] + (E[θ|X = x]− δ(x))2
≥ E[(θ − E[θ|X = x])2|X = x] = lE[θ|X],f(θ0|x)
Conclude from Theorem 2.67 that E[θ|X] is a Bayes decision in D.
Exercises
Exercise 2.70. In Example 2.52, for every x ∈ X compute the posterior loss for δ1, δ2, δ3 and δ4. Use Theorem 2.67
to find a Bayes decision in D = {δ1, δ2, δ3, δ4}.
Exercise 2.71. Let X1, . . . , Xn be i.i.d. with Xi ∼ N(0, τ−2). If L(a, θ0) = (a− θ0)2 and τ−2 ∼ Gamma(a, b),
35
(a) find f(θ0|x1, . . . , xn).
(b) find a Bayes decision.
2.7 How to find a minimax decision rule?
Lemma 2.72. Let S be a statistical decision problem and D be a collection of decision functions. If there exists
a prior, f(θ0), such that δ
∗ ∈ D is Bayes in D according to f(θ0), then for every δ ∈ D,
rδ∗,f(θ0) ≤ mδ
Proof. For every δ ∈ D,
rδ∗,f(θ0) ≤ rδ,f(θ0) Definition 2.53
=
∫
Θ
Rδ(θ0)f(θ0)dθ0 Definition 2.51
≤ sup
θ0∈Θ
Rδ(θ0) = mδ Definition 2.55
Theorem 2.73. If δ∗ is a Bayes decision in D according to f(θ0) and Rδ∗(θ0) is constant over θ0, then δ∗ is
minimax in D.
Proof. For every δ ∈ D,
mδ∗ = sup
θ0
(Rδ∗(θ0))
= Rδ∗(θ0) Rδ∗(θ0) is constant.
=
∫
Θ
Rδ∗(θ0)f(θ0)dθ0 Rδ∗(θ0) is constant.
= rδ∗,f(θ0) Definition 2.51
≤ mδ Lemma 2.72
Conclude that δ∗ is a minimax decision (Definition 2.57).
Example 2.74. Consider that M is the statistical model in Example 2.61. Let θ ∼ Beta(a, b). It follows from
Example 2.61 that θ|X ∼ Beta(a + nx¯, b + n(1 − x¯)). Also, let A = [0, 1] and D = AX , the collection of all
functions from X to A. L(a, θ0) = (a− θ0)2. It follows from Theorem 2.69 that E[θ|X] is a Bayes decision in D.
Since θ|X ∼ Beta(a+ nx¯, b+ n(1− x¯)), conclude that δ = E[θ|X] = a+nX¯a+b+n is a Bayes decision in D according to
36
θ ∼ Beta(a, b). The risk of δ is given by
Rδ(θ0) = E[L(δ(X), θ0)|θ0] Definition 2.22
= E
[(
a+ nX¯
a+ b+ n
− θ0
)2 ∣∣∣∣θ0
]
= V
[
a+ nX¯
a+ b+ n
∣∣∣∣θ0]+ (E [ a+ nX¯a+ b+ n
∣∣∣∣θ0]− θ0)2
=
nθ0(1− θ0)
(a+ b+ n)2
+
(
a+ nθ0
a+ b+ n
− θ0
)2
nX¯ ∼ Binomial(n, θ)
=
−nθ20 + nθ0
(a+ b+ n)2
+
(a− (a+ b)θ0)2
(a+ b+ n)2
=
−nθ20 + nθ0 + a2 − 2a(a+ b)θ0 + (a+ b)2θ20
(a+ b+ n)2
=
((a+ b)2 − n)θ20 + (n− 2a(a+ b))θ0 + a2
(a+ b+ n)2
Note that, if (a+ b)2 − n = 0, and n− 2a(a+ b) = 0, then Rδ(θ0) = a2(a+b+n)2 , which is constant over θ0. The first
equation is satisfied if b =
√
n− a. Plugging this solution into the second equation, obtain
n− 2a(a+√n− a) = 0
2a
√
n = n
a = 2−1
√
n
Since b =
√
n− a, obtain b = 2−1√n.
Conclude that, if θ ∼ Beta(2−1√n, 2−1√n), then δ∗ = E[θ|X] = 2−1
√
n+nX¯√
n+n
is a Bayes decision in D. Further-
more, Rδ∗(θ0) =
n
(
√
n+n)2
is constant over θ0. Since δ
∗ is a Bayes decision in D with constant risk, conclude from
Theorem 2.73 that δ∗ is minimax in D.
Theorem 2.75. Let S be a statistical decision problem, D be a collection of decision functions and f1(θ0), . . . , fn(θ0), . . .
be a sequence of probability density functions over Θ. If, δ1, . . . , δn, . . . is a collection of decision functions in D
such that δi is a Bayes decision in D according to fi(θ0), δ∗ ∈ D is such that Rδ∗(θ) is constant over Θ and
limn→∞ rδn,fn(θ0) = R(δ
∗)(θ0), then δ∗ is a minimax decision.
Proof. For every δ ∈ D,
rδn,fn(θ0) ≤ mδ Lemma 2.72
lim
n→∞ rδn,fn(θ0) ≤ limn→∞mδ = mδ
Rδ∗(θ0) ≤ mδ
sup
θ0∈Θ
Rδ∗(θ0) ≤ mδ Rδ∗(θ0) is constant
mδ∗ ≤ mδ Definition 2.55
Conclude that δ∗ is a minimax decision in D (Definition 2.57).
Example 2.76. Let X1, . . . Xn be i.i.d., X1 ∼ N(θ, σ20), where σ20 ∈ R+∗ is known and θ ∈ R, A = R, L(a, θ0) =
(a− θ0)2, and D = {δ : X → A}. We will prove that X¯ is minimax in D.
37
Consider that θ ∼ N(0, n2). Conclude from Example 2.62 that θ|x1, . . . , xn ∼ N
(
nσ−20 x¯
n−2+nσ−20
, (n−2 + nσ−20 )
−1
)
.
Therefore, it follows from Theorem 2.69 that δn = E[θ|X] = nσ
−2
0 X¯
n−2+nσ−20
is a Bayes decision function in D and
rδn,fn(θ0) = E[(E[θ|X]− θ)2]
= E[E[(E[θ|X]− θ)2|X]]
= E[V[θ|X]]
= E[(n−2 + nσ−20 )
−1]
= (n−2 + nσ−20 )
−1
Finally, note that
RX¯(θ0) = E[(X¯ − θ)2|θ0]
= V[X¯|θ0] = n−1σ20
Since RX¯(θ0) is constant over Θ, limn→∞ rδn,fn(θ0) = RX¯(θ0), and δn are Bayes decision in D, conclude from
Theorem 2.75 that X¯ is minimax in D.
Exercises
Exercise 2.77. Let X ∼ Geom(θ), A = R and L(a, θ0) = (θ0−a)
2
θ0(1−θ0) . Prove that δ
∗ = I(X = 0) is minimax in
D = {δ : X → A}.
38
3 Review of statistical decision theory
Exercise 3.1. Let X1, . . . , Xn be i.i.d. and X1 ∼ Poisson(θ). Also, L(a, θ0) = (a− θ0)2 and D = {δ : R→ R}.
(a) Find a sufficient statistic for θ.
(b) If θ ∼ Gamma(a, b), determine f(θ|x).
(c) Find a Bayes decision in D when θ ∼ Gamma(a, b).
(d) Find a minimax decision in D.
Solution:
(a)
f(x1, . . . , xn|θ0) =
n∏
i=1
f(xi|θ0)
=
n∏
i=1
exp(−θ0)θxi0
xi!
=
exp(−nθ0)θnx¯∏n
i=1 xi!
=
(
n∏
i=1
xi!
)−1
θnx¯ exp(−nθ0)
Conclude from Theorem 2.41 that X¯ is a sufficient statistic.
(b)
f(θ0|x1, . . . , xn) = f(θ0)f(x1, . . . , xn|θ0)∫
f(θ)f(x1, . . . , xn|θ)dθ Theorem 1.86
=
ba
Γ(a)θ
a−1
0 exp(−bθ0) (
∏n
i=1 xi!)
−1 θnx¯ exp(−nθ0)∫
ba
Γ(a)θ
a−1 exp(−bθ) (∏ni=1 xi!)−1 θnx¯ exp(−nθ)dθ
=
θa+nx¯−10 exp(−(b+ n)θ0)∫
θa+nx¯−1 exp(−(b+ n)θ)dθ
=
(b+ n)a+nx¯
Γ(a+ nx¯)
θa+nx¯−10 exp(−(b+ n)θ0)
Conclude that θ|x1, . . . , xn ∼ Gamma(a+ nx¯, b+ n).
(c) It follows from Theorem 2.69 that δ∗a,b = E[θ|X] = a+nX¯b+n is a Bayes decision function in D.
39
(d) Note that
rδ∗a,b,f(θ0) = E[E[L(δ
∗
a,b, θ)|θ]]
= E
[
E
[(
a+ nX¯
b+ n
− θ
)2 ∣∣∣∣θ
]]
= E
[(
E
[
a+ nX¯
b+ n
∣∣∣∣θ]− θ)2 + V [a+ nX¯b+ n
∣∣∣∣θ]
]
= E
[(
a+ nθ
b+ n
− θ
)2
+
nθ
(b+ n)2
]
= E
[
(a− bθ)2 + nθ
(b+ n)2
]
=
E[a2 − 2abθ + b2θ2 + nθ]
(b+ n)2
=
a2 − 2a2 + b2
(
a
b2
+ a
2
b2
)
+ nab
(b+ n)2
=
40
References
Bickel, P. J. and Doksum, K. A. (2015). Mathematical statistics: basic ideas and selected topics, volume 2. CRC
Press.
Casella, G. and Berger, R. L. (2002). Statistical inference, volume 2. Duxbury Pacific Grove.
Shao, J. (2003). Mathematical statistics, volume 1. Springer.
41
	Review
	(Basic) Set theory
	Set operations
	Functions
	Combinatorial analysis
	Counting
	Permutations
	Combinations
	Probability theory
	Random variables
	Random vectors
	Statistical decision theory
	Statistical model
	Elements of statistical decision theory
	Admissibility
	Sufficiency
	Optimality criteria
	How to find a Bayes decision rule?
	How to find a minimax decision rule?
	Review of statistical decision theory