Baixe o app para aproveitar ainda mais
Prévia do material em texto
Lecture 1: Discrete random variables 1 of 15 Course: Mathematical Statistics Term: Fall 2018 Instructor: Gordan Žitković Lecture 1 Discrete random variables 1.1 Random Variables A large chunk of probability is about random variables. Instead of giving a precise definition, let us just mention that a random variable can be thought of as an uncertain (usually numerical, i.e., with values in R, but not always) quantity. While it is true that we do not know with certainty what value a random variable Y will take, we usually know how to assign a number - the probabil- ity - that its value will be in some some1 subset of R. For example, we might be interested in P[Y ≥ 7], P[Y ∈ [2, 3.1]] or P[Y ∈ {1, 2, 3}]. Random variables are usually divided into discrete and continuous, even though there exist random variables which are neither discrete nor continu- ous. Those can be safely neglected for the purposes of this course, but they play an important role in many areas of probability and statistics. 1.2 Discrete random variables Before we define discrete random variables, we need some vocabulary. Definition 1.2.1. Given a set B, we say that the random variable Y is B-valued if P[Y ∈ B] = 1. In words, Y is B-valued if we know for a fact that Y will never take a value outside of B. Definition 1.2.2. A random variable is said to be discrete if there exists a set S such that S is either finite or countablea and Y is S-valued. aCountable means that its elements can be enumerated by the natural numbers. The only (infinite) countable sets we will need are N = {1, 2, . . . , } or N0 = {0, 1, 2, . . . }. 1We will not worry about measurability and similar subtleties in this class. Last Updated: September 25, 2019 Lecture 1: Discrete random variables 2 of 15 Definition 1.2.3. The support SY of the discrete random variable Y is the smallest set S such that Y is S-valued. Example 1.2.4. A die is thrown and the number obtained is recorded and denoted by Y. The possible values of Y are {1, 2, 3, 4, 5, 6} and each happens with probability 1/6, so Y is certainly S-valued. Since S is finite, Y is discrete. One still needs to argue that S is the support SY of Y. The alternative would be that SY is a proper subset of S , i.e., that there are redundand elements in S . This is not the case since all elements in S are “impor- tant”, i.e., happen with positive probability. If we remove anything from S , we are omitting a possible value for Y. On the other hand, it is certainly true that Y always takes its values in the finite set S ′ = {1, 2, 3, 4, 5, 6,7}, i.e., that Y is S ′-valued. One has to be careful with the terminology here: it is correct to say that Y is an S ′-valued (or even N-valued) random variable, even though it only takes the values 1, 2, . . . , 6 with positive probabilities. Discrete random variables are very nice due to the following fact: in or- der to be able to compute any conceivable probability involving a discrete random variable Y, it is enough to know how to compute the probabilities P[Y = y], for all y ∈ S . Indeed, if you are interested in figuring out what P[Y ∈ B] is, for some set B ⊆ R (e.g., B = {5, 6, 7}, B = [3, 6], or B = [−2, ∞)), we simply pick all y ∈ SY which are also in B and sum their probabilities. In mathematical notation, we have P[Y ∈ B] = ∑ y∈SY∩B P[Y = y]. (1.2.1) Definition 1.2.5. The probability mass function (pmf) of a discrete random variable Y is a function pY defined on the support SY of Y by pY(y) = P[Y = y], y ∈ SY. In practice, we usually present the pmf pY in the form of a table (called the distribution table) as Y ∼ y y1 y2 y3 . . .pY(y) p1 p2 p3 . . . or, simply, Y ∼ y1 y2 y3 . . .p1 p2 p3 . . . , Last Updated: September 25, 2019 Lecture 1: Discrete random variables 3 of 15 where the top row lists all the elements y of the support SY of Y, and the bottom row lists their probabilities pY(y) = P[Y = y]. It is easy to see that the function pY has the following properties: 1. pY(y) ∈ [0, 1] for all y, and 2. ∑y∈SY pY(y) = 1. Here is a first round of examples of discrete random variables and their supports. Example 1.2.6. 1. A fair (unbiased) coin is tossed and the value observed is denoted by Y. Since the only possible values Y can take are H or T, and the set S = {H, T} is clearly finite, Y is a discrete random variable. Its distribution is given by the following table: y H T pY(y) 1/2 1/2 Both H and T are possible (happen with probability 1/2), so no smaller set S will have the property that P[Y ∈ S ] = 1. Conse- quently, the support SY of Y is S = {H, T}. 2. A die is thrown and the number obtained is recorded and denoted by Y. The possible values of Y are {1, 2, 3, 4, 5, 6} and each happens with probability 1/6, so Y is discrete with support SY. Its distribution is given by the table y 1 2 3 4 5 6 pY(y) 1/6 1/6 1/6 1/6 1/6 1/6 3. A fair coin is thrown repeatedly until the first H is observed; the number of Ts observed before that is denoted by Y. In this case we know that Y can take any of the values N0 = {0, 1, 2, . . . } and that there is no finite upper bound for it. Nevertheless, we know that Y cannot take values that are not non-negative integers. Therefore, Y is N0- valued and, in fact, SY = N0 is its support. Indeed, we have P[Y = y] = 2−y−1, for y ∈N0, i.e., Y ∼ y 0 1 2 . . .pY(y) 1/2 1/4 1/8 . . . 4. A card is drawn randomly form a standard deck, and the result is de- noted by Y. This example is similar to the 2., above, since Y takes one of finitely many values, and all values are equally likely. The Last Updated: September 25, 2019 Lecture 1: Discrete random variables 4 of 15 difference is that the result is not a number anymore. The set S of all possible values can be represented as the set of all pairs like (♠, 7), where the first entry denotes the picked card’s suit (in {♥,♠,♣,♦}), and the second is a number between 1 and 13. It is, of course, possible to use different conventions and use the set {2, 3, . . . , 9, 10, J, Q, K, A} for the second component. The point is that the values Y takes are not numbers. 1.3 Events and Bernoulli random variables Random variables Y which can only take one of two values 0, 1, i.e., for which SY ⊆ {0, 1}, are called indicators or Bernoulli random variables and are very useful in probability and statistics (and elsewhere). The name comes from the fact that you should think of such variables as signal lights; if Y = 1 an event of interest has happened, and if Y = 0 it has not happened. In other words, Y indicates the occurence of an event. One reason the Bernoulli random variables are so useful is that they let us manipulate events without ever leaving the language of random variables. Here is an example: Example 1.3.1. Suppose that two dice are thrown so that Y1 and Y2 are the numbers obtained (both Y1 and Y2 are discrete random variables with SY1 = SY2 = {1, 2, 3, 4, 5, 6}). If we are interested in the probabil- ity that their sum is at least 9, we proceed as follows. We define the random variable W - the sum of Y1 and Y2 - by W = Y1 + Y2. Another random variable, let us call it Y, is a Bernoulli random variable defined by Y = { 1, W ≥ 9, 0, W < 9. With such a set-up, Y signals whether the event of interest has hap- pened, and we can state our original problem in terms of Y, namely “Compute P[Y = 1] !”. This example is, admittedly, a little contrived. The point, however, is that anything can be phrased in terms of random variables; thus, if you know how to work with random variables, i.e., know how to compute their distributions, you can solve any problem in probability that comes your way. Another reason Bernoulli random variables are useful is the fact that we can do arithmetic with them. Last Updated: September 25, 2019 Lecture 1: Discrete random variables 5 of 15 Example 1.3.2. 70 coins are tossed and their outcomes are denoted by W1, W2, . . . , W70. All Wi are random variables with values in {H, T} (and therefore not Bernoulli random variables), butthey can be easily recoded into Bernoulli random variables as follows: Yi = { 1, if Wi = H, 0, if Wi = T. Once you have the “dictionary” {1↔ H, 0↔ T}, random variables Yi and Wi carry exactly the same information. The advantage of using Yi is that the random variable N = 70 ∑ i=1 Yi, which takes values in SN = {0, 1, 2, . . . , 70} counts the number of heads among W1, . . . , W70. Similarly, the random variable M = Y1 ×Y2 × · · · ×Y70 is a Bernoulli random variable itself. What event does it indicate? 1.4 Some widely used discrete random variables The distribution of a random variable is sometimes defined as “the collection of all possible probabilities associated to it”. This sounds a bit abstract, and, at least in the discrete case, obscures the practical significance of this impor- tant concept. We have learned that for discrete variables the knowledge of the pmf or the distribution table (such as the one in part 1., 2. or 3. of Example 1.2.6) amounts to the knowledge of the whole distribution. It turns out that many random variables in widely different contexts come with the same (or similar) distribution tables, and that some of those appear so often that they deserve to be named (so that we don’t have to write the distribution table every time). The following example lists some of those, named, distribution. There are many others, but we will not need them in these notes. Example 1.4.1. 1. Bernoulli distribution. We have already encountered this distri- bution in our discussion of indicator random variables above. It is characterized by the distribution table of the form 0 1 1− p p , (1.4.1) Last Updated: September 25, 2019 Lecture 1: Discrete random variables 6 of 15 where p can be any number in (0, 1). Strictly speaking, each value of p defines a different distribution, so it would be more correct to speak of a parametric family of distributions, with p ∈ (0, 1) being the parameter. In order not to write down the table (1.4.1) every time, we also use the notation Y ∼ B(p). For example, the Bernoulli random variable which takes the value 1 when a fair coin falls H and 0 when it falls T has a B(1/2)-distribution. An experiment (random occurrence) which can end in two possible ways (usually called success and failure, even though those names should not always be taken literally) is often called a Bernoulli trial. If we “encode” success as 1 and failure by 0, each Bernoulli trial gives rise to a Bernoulli random variable. 2. Binomial distribution. A random variable whose distribution ta- ble looks like this 0 1 . . . (n− 1) n qn (n1)pq n−1 . . . ( nn−1)qp n−1 pn for some n ∈ N, p ∈ (0, 1) and q = 1− p, is called the binomial distribution, usually denoted by b(n, p). Remember that the bino- mial coefficient (nk) is given by( n k ) = n! k!(n− k)! where n! = n(n− 1)(n− 2) · . . . · 2 · 1. Binomial distribution(s) form a parametric family with two param- eters n ∈ N and p ∈ (0, 1), and each pair (n, p) corresponds to a different binomial distribution. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 n y Figure 1. The probability mass function (pmf) of a typical binomial distribution. Last Updated: September 25, 2019 Lecture 1: Discrete random variables 7 of 15 Recall that the binomial distribution arises as the “number of suc- cesses in n independent Bernoulli trials”, i.e., it counts the number of H in n independent tosses of a biased coin whose probability of H is p. 3. Geometric distribution. The geometric distribution is similar to the binomial in that it counts the number of “successes” in inde- pendent, repeated Bernoulli trials. The difference is that the num- ber of trials is no longer fixed (i.e., = n), but we keep tossing until we get our first success. Since the trials are independent, if the probability of success in each trial is p ∈ (0, 1), the probability that it will take exactly k failures before the first success is qk p, where q = 1− p. Therefore, so the geometric distribution - denoted by g(p) - comes with the following table 0 1 2 3 . . . p qp q2 p q3 p . . . . ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 y Figure 2. The probability mass function (pmf) of a typical geometric distribution. Last Updated: September 25, 2019 Lecture 1: Discrete random variables 8 of 15 Caveat: When defining the geometric distribution, some books count the number of trials to the first success, i.e., add the final success into the count. This shifts everything by 1 and leads to a distribution with support N (and not N0). While this is no big deal, this ambiguity tends to be confusing at times and leads to bugs in software. For us, the geometric distribution will always start from 0. The distri- bution which counts the final success will be referred to as the shifted geometric distribution, but we’ll try to avoid it altogether. 4. Poisson distribution. This is also a family of distributions, param- eterized by a single parameter λ > 0, and denoted by P(λ). Its support is N0 and the distribution table is given by 0 1 2 3 4 . . . e−λ e−λλ e−λ λ 2 2 e −λ λ3 3! e −λ λ4 4! . . . . The closed form for the pmf is pY(y) = e−λ λ y y! , y ∈N. The Poisson distribution arises as a limit when n → ∞ and p → 0 while np ∼ λ in the Binomial distribution. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 y Figure 3. The probability mass function (pmf) of a typical Poisson distribution with λ > 1. Last Updated: September 25, 2019 Lecture 1: Discrete random variables 9 of 15 1.5 Expectations and standard deviations Expectations and standard deviations provide summaries of numerical ran- dom variables - they give us some information about them without over- whelming us with the entire distribution table. The expectation can be thought of as a center of the distribution, while the standard deviation gives you an idea about its spread2 Definition 1.5.1. For a discrete random variable Y with support SY ⊆ R, we define the expectation E[Y] of Y by E[Y] = ∑ y∈Sy y pY(y), (1.5.1) if the (possibly) infinite sum ∑y∈S y pY(y) absolutely converges, i.e., as long as ∑ y∈S |y| pY(y) < ∞. (1.5.2) When the sum in (1.5.2) above diverges (i.e., takes the value +∞), we say that the expectation of Y is not defined. Perhaps the most important property of the expectation is its linearity: Theorem 1.5.2. If E[Y1] and E[Y2] are both defined then so is E[αY1 + βY2], for any two constants α, β. Moreover, E[αY1 + βY2] = αE[Y1] + βE[Y2]. In order to define the standard deviation, we first need to define the vari- ance. Like the expectation, the variance may or may not be defined (depend- ing on whether the sums used to compute it converge absolutely or not). Since we will be working only with distributions for which the existence of expectation(s) is never a problem, we do not mention this issue in the sequel. Definition 1.5.3. The variance of the random variable Y is Var[Y] = E[(Y− µY)2] = ∑ y∈SY (y− µY)2 pY(y) where µY = E[Y]. The standard deviation of Y is sd[Y] = √ Var[Y]. 2this should be taken with a grain of salt. After all, what exactly do we mean by a center or a spread of a distribution? Last Updated: September 25, 2019 Lecture 1: Discrete random variables 10 of 15 The fundamental properties of the variance/standard deviation are given in the following theorem: Theorem 1.5.4. Suppose that Y1 and Y2 are random variables and that α is a constant. Then 1. Var[αY1] = α2 Var[Y1], and 2. if, additionally, Y1 and Y2 are independent, then Var[Y1 + Y2] = Var[Y1] + Var[Y2]. Caveat: These properties are not the same as the properties of the ex- pectation. First of all the constant comes out of the variance with a square, and second, the variance of the sum is the sum of the indi- vidual variances only if additional assumptions, such as the indepen- dence between the two variables, are imposed. Finally, here is a very useful alternative formula for the variance of a random variable: Proposition 1.5.5. Var[Y] = E[Y2]− (E[Y])2. Let us compute expectationsand variances/standard deviations for our most important examples. Example 1.5.6. 1. Bernoulli distribution. Let Y ∼ B(p) be a Bernoulli random vari- able with parameter p. Then (remember q is a shortcut for 1− p) E[Y] = 0× q + 1× p = p. Using (1.5.5), we get Var[Y] = E[Y2]− (E[Y])2 = 02 × q + 12 × p− p2 = p− p2 = p(1− p) = pq, and, so, sd[Y] = √ pq. 2. Binomial distribution. Moving on to the binomial, Y ∼ b(n, p), we could either use the formula (1.5.1) and try to evaluate the sum E[Y] = n ∑ k=0 k ( n k ) pkqn−k, Last Updated: September 25, 2019 Lecture 1: Discrete random variables 11 of 15 or use some of the properties of the expectation of Theorem 1.5.2. To do the latter, we remember that the distribution of a binomial is the same as the distribution of a sum of n (independent) Bernoul- lies. So if we write Y = Y1 + · · · + Yn, and each Y1 . . . Yn has the B(p) distribution, Theorem 1.5.2 yields E[Y] = E[Y1] + E[Y2] + · · ·+ E[Yn] = np. (1.5.3) A similar simplification can be achieved in the computation of the variance, too. While it was unimportant in that Y1, . . . , Yn are inde- pendent in (1.5.3), it is crucial for Theorem 1.5.4: Var[Y] = Var[Y1] + · · ·+ Var[Yn] = npq, and, so, sd[Y] = √ npq. 3. Geometric distribution. The trick from 2. above cannot be applied to the geometric random variables. If nothing else, this is because Theorem 1.5.2 can only be applied to a given (fixed, nonrandom) number n of random variables. We can still use the definition (1.5.1) and evaluate an infinite sum: E[Y] = ∞ ∑ k=0 kpqk. Instead of doing that, let us proceed somewhat informally and note that we can think of a geometric random variable as follows: With probability p our first throw is a success and Y = 0. With probability q our first throw is a failure and we restart the experiment on the second throw, making sure to add the first failure to the count. Therefore, E[Y] = p× 0 + q× (1 + E[Y]), and, so, E[Y] = q/p. Similar reasoning can be applied to obtain E[Y2] = p× 0 + qE[(1 + Y)2] = q + 2q E[Y] + qE[Y2] = q + 2q2/p + qE[Y2], which yields Var[Y] = E[Y2]− (E[Y])2 = q/p2 and sd[Y] = √q/p. 4. Poisson distribution. We know that the Poisson distribution arises as a limit of binomial distributions when n → ∞, p → 0 and Last Updated: September 25, 2019 Lecture 1: Discrete random variables 12 of 15 np ∼ λ. We can expect, therefore, that its expectation and vari- ance should behave accordingly, i.e., for Y ∼ P(λ), we have E[Y] = λ and Var[Y] = λ. (1.5.4) The reasoning behind Var[Y] = λ uses the formula Var[Y] = npq when Y ∼ b(n, p) and plugs in q ≡ 1, since q = 1− p and p→ 0. A more rigorous way of showing that (1.5.4) is correct is to evaluate the sums E[Y] = ∞ ∑ k=0 kpY(k) = ∞ ∑ k=0 ke−λλk/k! and E[Y2] = ∞ ∑ k=0 k2 pY(k) = ∞ ∑ k=0 k2e−λλk/k!. and use Proposition 1.5.5. The sums can be evaluated explicitly, but since the focus of these notes is not on evaluation of infinite sums, so we skip the details. 1.6 Problems Problem 1.6.1. A die is rolled 5 times; let the obtained numbers be given by Y1, . . . , Y5. Use counting to compute the probability that 1. all Y1, . . . , Y5 are even? 2. at most 4 of Y1, . . . , Y5 are odd? 3. the values of Y1, . . . , Y5 are all different from each other? Problem 1.6.2. Identify the supports of the following random variables: 1. Y + 1, where Y ∼ B(p) (Bernoulli), 2. Y2, where Y ∼ b(n, p) (binomial), 3. Y− 5, where Y ∼ g(p) (geometric), 4. 2Y, where Y ∼ P(λ) (Poisson). Problem 1.6.3. Let Y denote the number of tosses of a fair die until the first 6 is obtained (if we get a 6 on the first try, Y = 0). The support SY of Y is (a) {0, 1, 2, 3, 4, . . . } (b) {1, 2, 3, 4, 5, 6} Last Updated: September 25, 2019 Lecture 1: Discrete random variables 13 of 15 (c) { 16 , 1 6 , 1 6 , 1 6 , 1 6} (d) { 16 , 5 6 × 1 6 , ( 5 6 )2 × 16 , ( 56)3 × 16 , . . . } (e) none of the above Problem 1.6.4. The probability that Janet makes a free throw is 0.6. What is the probability that she will make at least 16 out of 23 (independent) throws? Write down the answer as a sum - no need to evaluate it. Problem 1.6.5. Let Y1 and Y2 be random variables with distributions Y1 ∼ 1 2 3 4 1/4 1/4 1/4 1/4 and Y2 ∼ 1 2 1/2 1/2 . Then (a) Y1 + Y2 ∼ 2 3 4 5 6 1/8 1/4 1/4 1/4 1/8 (b) SY1+Y2 = SY1 ∪ SY2 (c) Y1 is binomially distributed (d) the events {Y1 = 1} and {Y2 = 2} are mutually exclusive (e) none of the above Problem 1.6.6. (*) Bob and Alice alternate taking customer calls at a call center, with Alice always taking the first call. The number of calls during a day has a Poisson distribution with parameter λ > 0. 1. What is the probability that Bob will take the last call of the day (that includes the case when there are 0 calls). (Hint: What is the Taylor series for the function cosh(x) = 12 (e x + e−x) around x = 0?) 2. Who is more likely to take the last call? Alice or Bob? As above, if there are no calls, we give the “last call” to Bob. Problem 1.6.7. Three unbiased and independent coins are tossed. Let Y1 be the total number of heads on the first two coins, and let Y be the random variable which is equal to Y1 if the third coin comes up heads and −Y1 if it comes up tails. Compute Var[Y]. Problem 1.6.8. A die is thrown and a coin is tossed independently of it. Let Y be the random variable which is equal to the number on the die in case the coin comes up heads and twice the number on the die if it comes up tails. 1. What the support of SY of Y? What is its distribution (pmf)? 2. Compute E[Y] and Var[Y]. Last Updated: September 25, 2019 Lecture 1: Discrete random variables 14 of 15 Problem 1.6.9. n people vote in a general election, with only two candidates running. The vote of person i is denoted by Yi and it can take values 0 and 1, depending which candidate they voted for (we encode one of them as 0 and the other as 1). We assume that votes are independent of each other and that each person votes for candidate 1 with probability p. If the total number of votes for candidate 1 is denoted by Y, then (a) Y is a geometric random variable (b) Y2 is a binomial random variable (c) Y is uniform on {0, 1, . . . , n} (d) Var[Y] ≤ E[Y] (e) none of the above Problem 1.6.10. A discrete random variable Y is said to have a discrete uni- form distribution on {0, 1, 2, . . . , n}, denoted by Y ∼ u(n) if its distribution table looks like this: 0 1 2 . . . n 1 n+1 1 n+1 1 n+1 . . . 1 n+1 . Compute the expectation and the variance of u(n). You may use the fol- lowing identities: 1 + 2 + · · · + n = 12 n(n + 1) and 12 + 22 + · · · + n2 = 1 6 n(n + 1)(2n + 1). Problem 1.6.11. (*) Let Y be a discrete random variable such that SY ⊆ N0. By counting the same thing in two different ways, explain why E[Y] = ∑ n∈N P[Y ≥ n]. This is called the tail formula for the expectation. Problem 1.6.12. Let X be a geometric random variable with parameter p ∈ (0, 1), i.e. X ∼ g(p), and let Y = 2−X . Write down the (first few entries in) the distribution table of Y. Compute E[Y] = E[2−X ]. Problem 1.6.13. Let Y1 and Y2 be uncorrelated discrete random variables such that Var[2Y1 − Y2] = 17 and Var[Y1 + 2Y2] = 5. Compute Var[Y1 − Y2]. (Note: Y1 and Y2 are uncorrelated if E[(Y1 −E[Y1])(Y2 −E[Y2])] = 0.) (Hint: What is Var[αY2 + βY2] in terms of Var[Y1] and Var[Y2] when Y1 and Y2 are uncorrelated?) Problem 1.6.14. Let Y1 and Y2 be uncorrelated random variables such that sd[Y1 + Y2] = 5. Then sd[Y1 −Y2] = (a) 1 (b) √ 2 (c) √ 3 (d) 5 (e) not enough information is given Last Updated: September 25, 2019 Lecture 1: Discrete random variables 15 of 15 Problem 1.6.15 (*). A mail lady has l ∈ N letters in her bag when she starts her shift and is scheduled to visit n ∈ N different households during her round. If each letter is equally likely to be addressed to any one of the n households, and the letters are delivered independently of each other, what is the expected number of households that will receive at least one letter? (Note: Itis quite possible that some households will receive more than 1 letter.) Last Updated: September 25, 2019 Lecture 2: Continuous random variables 1 of 11 Course: Mathematical Statistics Term: Fall 2018 Instructor: Gordan Žitković Lecture 2 Probability review - continuous random variables 2.1 Probability Density functions (pdfs) Some random variables naturally take one of a continuum of values, and cannot be associated with a countable set. The simplest example is the uni- form random variable Y on [0, 1] (also known as a random number), which can take any value in the interval [0, 1], with the probability of it landing between a and b, where 0 < a < b < 1, given by P [ Y ∈ [a, b] ] = b− a. (2.1.1) One of the most counterintuitive things about Y is that P[Y = y] = 0 for any y ∈ [0, 1], even though we know that Y will take some value in [0, 1]. Therefore, unlike in the discrete case, where the probabilities given by the pmf pY(y) = P[Y = y] contain all the information, in the case of the uniform these are completely uninformative. The right questions to ask is the one of (2.1.1), i.e., one needs to focus on probabilities of values in intervals. The class of random variables where such questions come with an easy-to-represent answer are called continuous. More precisely Definition 2.1.1. A random variable Y is said to have a continuous distribution if there exists a function fY : R→ [0, ∞) such that P [ Y ∈ [a, b] ] = ∫ b a fY(y) dy for all a < b. The function fY is called the probability density function (pdf) of Y. Not any function can serve as a pdf. The pdf of any random variable will always have the following properties: 1. fY(y) ≥ 0 for all y, and 2. ∫ ∞ −∞ fY(y) dy = 1 since the P[Y ∈ (−∞, ∞)] = 1. Last Updated: September 25, 2019 Lecture 2: Continuous random variables 2 of 11 It can be shown that any such function is a pdf of some continuous random variables, but we will focus on a small number of important examples in these notes. Caveat: 1. There are random variables which are neither continuous nor dis- crete, but we will not encounter them in these notes (even though some important random variables in applications - e.g., insurance - fall into this category.) 2. One should think of the pdf fY as an analogue of the pmf in the discrete case, but this analogy should not be stretched too far. For example, we can easily have fY(y) > 1 at some y, or even on an entire interval. This is the consequence of the fact that fY(y) is not the probability of anything. It is a probability density, i.e., for small (in the sense of a limit) ∆y > 0 we have P [ Y ∈ [y, y + ∆y] ] ≈ fY(y)∆y, i.e. fY(y) is, approximately, the quotient between the probability of in interval and the size of the same interval. 2.2 The “indicator” notation Before we list some of the most important examples of continuous random variables, we need to introduce a very useful notation tool. Definition 2.2.1. For a set A ⊆ R, the function 1A : R→ R, given by 1A(y) = { 1, y ∈ A, 0, otherwise, is called the indicator of A. As its name already suggests, interval indicators indicate whether their argument y belongs to the set A or not. The graph of a typical indicator - when A is an interval [a, b] - is given in Figure 1. Last Updated: September 25, 2019 Lecture 2: Continuous random variables 3 of 11 -1 1 2 3 y 0.5 1 Figure 1. The indicator function 1[1,2] of the interval [1, 2]. Indicators are useful when dealing with functions that are defined with different formulas on different parts of their domain. Example 2.2.2. The uniform distribution U(l, r) is a slight generaliza- tion of the uniform U(0, 1) distribution mentioned above. It models a number randomly chosen in the interval [l, r] such that the probability of getting a point in the subinterval [a, b] ⊆ [l, r] is proportional to its length b− a. Since the probability of choosing some point in [l, r] is 1, by definition, we have to have P[Y ∈ [a, b]] = 1r−l (b− a) for all a < b ∈ [l, r]. To show that this is a continuous distribution, we need to show that it admits a pdf, i.e., a function f such that∫ b a fY(y) dy = 1r−l (b− a) for all a < b. For a, b < l or a, b > r we must have P[Y ∈ [a, b]] = 0, so∫ b a fY(y) dy = 0 for a, b ∈ R \ [l, r]. These two requirements force that fY(y) = 1r−l for y ∈ [l, r] and fY(y) = 0 for y 6∈ [l, r], (2.2.1) and we can easily check that fY(y) is, indeed, the pdf of Y. The indicator notation can be used to write (2.2.1) in a more compact way: fY(y) = 1r−l 1[l,r](y). Not only does this give a single formula valid for all y, it also reveals that [l, r] is the “effective” part of the domain of fY. We can think of fY as “the constant 1r−l , but only on the interval [l, r]; it is zero everywhere else”. Last Updated: September 25, 2019 Lecture 2: Continuous random variables 4 of 11 The interval-indicator notation will come into its own a bit later when we discuss densities of several random variables (random vectors), but for now let us comment on how it allows us to write any integral as an integral over (−∞, ∞). The idea behind is that any function f multiplied by the indicator 1[a,b] stays the same on [a, b], but takes value 0 everywhere else. Therefore,∫ b a f (y) dy = ∫ ∞ −∞ f (y)1[a,b](y) dy, because the integral of the function 0 is 0, even when taken over infinite intervals. Finally, let us introduce another notation for the indicator functions. It turns out to be more intuitive, at least for intervals, and will do wonders for the evaluation of iterated integrals. Since the condition y ∈ [a, b] can be written as a ≤ y ≤ b, we sometimes write 1{a≤y≤b} instead of 1[a,b](y). 2.3 First examples of continuous random variables Example 2.3.1. 1. Uniform distribution. We have already encountered the uniform distribution U(l, e) on the interval [l, r] and we have shown that it is a continuous distribution with the pdf fY(y) = 1r−l 1[l,r](y). As always, this is really a whole family of distributions, parameter- ized by two real parameters a and b. l r y 1 r-l Figure 2. The density function (pdf) of the uniform U(a, b) distribution. 2. Normal distribution The family of normal distributions - denoted by Y ∼ N(µ, σ) - is also parameterized by two parameters µ ∈ R and σ > 0 and its pdf is given by the (at first sight complicated) formula: The normal distribution is symmetric around µ and its standard deviation (as we shall see shortly) is σ; its graph is shown in Figure 3. Last Updated: September 25, 2019 Lecture 2: Continuous random variables 5 of 11 σ μ y Figure 3. The density function (pdf) of the normal distribution N(µ, σ). The function fY is defined by the above formula for each y ∈ R and it is a notrivial task to show that it is, indeed, a pdf of anything. The difficulty lies in evaluating the integral∫ ∞ −∞ fY(y) dy and showing that it equals to 1. This is, indeed, true, but needs a bit more mathematics than we care to get into right now. The probabilities P[Y ∈ [a, b]] are not any easier to compute for concrete a, b and, in general, do not admit a closed form (formula). That is why we used to use tables of precomputed approximate values (we use software today). Nevertheless, the normal distribution is, arguably, the most impor- tant distribution in probability and statistics. The main reason for that is that it appears in the central limit theorem (which we will talk more about later), and, therefore, shows up whenever a large number of independent random influences act at the same time. 3. Exponential distribution. The exponential distribution is a contin- uous analogue of the geometric distribution and is used in mod- eling lifetimes of light bulbs or waiting times in the supermarket checkout lines. It comes in a parametric family E(τ), parameterized by the positive parameter τ > 0. Its pdf is given by fY(y) = 1τ e −y/τ1[0,∞)(y). The graph of fY is given on the right y 1 τ Figure 4. The density function (pdf) of the exponential E(τ) distribution. Last Updated:September 25, 2019 Lecture 2: Continuous random variables 6 of 11 The use of an interval indicator in the expression above signals that fY is positive only for y > 0, and that, in turn, means that an exponential random variable cannot take negative values. Caveat: Many books use a different parametrization of the exponential family, namely Y ∼ E(λ) if fY(y) = λe−λy1{[0,∞)}(y), so that, effectively λ = 1/τ. Both parameters have meaning- ful interpretations, and, depending on the context, one can be more natural than the other. Keep this in mind to avoid unnecessary confusion. 2.4 Expectations and standard deviations The definitions of the expectation will look similar to that in the discrete case, but sums will be replaced by integrals. Once the expectation is defined, everything else can be repeated verbatim from the previous lecture. Definition 2.4.1. For a continuous random variable Y with pdf fY we define the expectation E[Y] of Y by E[Y] = ∫ ∞ −∞ y fY(y) dy, (2.4.1) as long as ∫ ∞ −∞ |y fY(y)| dy < ∞. When this value is +∞, we say that the expectation of Y is not defined. The definition of the variance and the standard deviation are analogous to their discrete versions: Var[Y] = ∫ ∞ −∞ (y− µY)2 fY(y) dy where µY = E[Y], and sd[Y] = √ Var[Y]. Theorem ?? and Proposition ?? are valid exactly as written in the continuous case, too. Let us compute expectations and variances/standard deviations of the distributions from Example 2.3.1. Example 2.4.2. 1. Uniform distribution The computations needed for the expecta- Last Updated: September 25, 2019 Lecture 2: Continuous random variables 7 of 11 tion and the variance of the uniform U(l, r) distribution are quite simple: E[Y] = ∫ ∞ −∞ y fY(y) dy = 1r−l ∫ ∞ −∞ y1[l,r](y) dy = 1r−l ∫ r l y dy = 1r−l r2−l2 2 = l+r 2 . Similarly, Var[Y] = ∫ ∞ −∞ (y− l+r2 ) 2 fY(y) dy = 1r−l ∫ r l (y− l+r2 ) 2 dy = 1r−l [ 1 3 (y− l+r 2 ) 3]ba = 1 12 (r− l) 2 2. Normal distribution. To compute the expectation of the normal distribution N(µ, σ), we need to evaluate the following integral E[Y] = ∫ ∞ −∞ y 1√ 2πσ2 e− (y−µ)2 2σ2 dy. We change the variable z = (y− µ)/σ to obtain E[Y] = ∫ ∞ −∞ (σz + µ) 1√ 2π e− 1 2 z 2 dz = σ√ 2π ∫ ∞ −∞ ze− 1 2 z 2 dz + µ ∫ ∞ −∞ 1√ 2π e− 1 2 z 2 dz = µ. The integral next to µ evaluates to 1, because it is simply the inte- gral of the density function f of the standard normal N(0, 1). The integral next to σ√ 2π is 0 because it is an integral of an odd function over the entire R. To compute the variance, we need to evaluate the integral Var[Y] = ∫ ∞ −∞ (y− µ)2 1√ 2πσ2 e− (y−µ)2 2σ2 dy, because we now know that µY = µ. The same change of variables as above yields: Var[Y] = σ2 1√ 2π ∫ ∞ −∞ z2e− 1 2 z 2 dz = σ2, where we used the fact (which can be obtained using integration by parts, but we skip the details here) that ∫ ∞ −∞ z 2e− 1 2 z 2 dz = √ 2π. Last Updated: September 25, 2019 Lecture 2: Continuous random variables 8 of 11 3. Exponential distribution. The integrals involved in the evaluation of the expectation and the variance of the exponential distribution are simpler and only involve a bit of integration by parts, so we skip the details. It should be noted that the interval indicator notation we used to define the pdf of the exponential tells use immediately what bounds to use for integration. For Y ∼ E(τ), we have E[Y] = ∫ ∞ −∞ y fY(y) dy = ∫ ∞ −∞ y/τe−y/τ1[0,∞)(y) dy = ∫ ∞ 0 y/τe−y/τ dy = τ. Therefore µY = τ and, so, Var[Y] = E[Y2]− (E[Y])2 = ∫ ∞ −∞ y2 fY(y) dy− τ2 = ∫ ∞ 0 y2/τe−y/τ dy− τ2. To evaluate the first integral on the right, we change variables z = y/τ, so that Var[Y] = τ2 ∫ ∞ 0 z2e−z dz− τ2 = 2τ2 − τ2 = τ2, where we used the fact (which can be derived by integration by parts) that ∫ ∞ 0 z 2e−z dz = 2. 2.5 Moments The expectation is the integral of the first power y = y1 multiplied by the pdf, and the variance involves a similar integral with y replaced by y2. Integrals of higher powers of y are important in statistics; not as important as the expectation and variance, but still important enough to have names: Definition 2.5.1. For a random variable Y with pdf fY and k = 1, 2, . . . , we define k-th (raw) moment µk by µk = E[Yk] = ∫ ∞ −∞ yk fY(y) dy, as well as the k-th central moment µck by µck = E[(Y−E[Y]) k] = ∫ ∞ −∞ (y− µ1)k fY(y) dy. Last Updated: September 25, 2019 Lecture 2: Continuous random variables 9 of 11 We see immediately from the definition that the expectation (mean) is the first (raw) moment and that the variance is the second central moment, i.e., µ1 = E[Y], µc2 = Var[Y]. The third and fourth moment of the standardized random variable, namely, E [(Y−E[Y] sd[Y] )3] and E [(Y−E[Y]sd[Y] )4] . are called skeweness and kurtosis, respectively. It is easy to see that, in terms of moments, we can express skeweness as µc3/(µ c 2) 3/2 and kurtosis as µc4/(µ c 2) 2. Example 2.5.2. 1. Uniform distribution We leave this to the reader as an exercise in the Problems section below. 2. Normal Distribution Let us compute the central moments, too. Since Y− µ ∼ N(0, σ), whenever Y ∼ N(µ, σ), central moments of N(µ, σ) are nothing by raw moments of N(0, σ). For that, we need to compute the integrals∫ ∞ −∞ yk fY(y) dy = ∫ ∞ −∞ 1 2πσ y ke− 1 2 y 2 dy. (2.5.1) For odd k, these are integrals of odd functions over the entire R, and therefore, their value is 0, i.e., µk = 0 for k odd. For even k, there is no such a shortcut, and the integral in (2.5.1) can be computed by parts:∫ ∞ −∞ yke− 1 2σ2 y 2 dy = 1k+1 y k+1e− 1 2σ2 y 2 ∣∣∣∞ −∞ − − ∫ ∞ −∞ 1 k+1 y k+1 (− 1 σ2 y)e− 1 2σ2 y 2 dy. Since limy→±∞ yk+1e − 12σ2 y 2 = 0, we obtain 1 2πσ ∫ ∞ −∞ yke− 1 2σ2 y 2 dy = 12πσ 1 σ2(k+1) ∫ ∞ −∞ yk+2e 1 2σ2 y 2 dy. Written more compactly, µk+2 = σ 2(k + 1)µk. Last Updated: September 25, 2019 Lecture 2: Continuous random variables 10 of 11 Starting from µ2 = 12πσ ∫ ∞ −∞ y 2e− 1 2σ2 y 2 = σ2, we get µk = σ k (k− 1)× (k− 3)× · · · × 5× 3× 1, for k even. 3. Exponential distribution. A similar, integration-by-parts proce- dure as above, allows us to compute the (raw) moments of the exponential (we skip the details): µk = ∫ ∞ 0 yk 1τ e −y/τ dy = τkk× (k− 1)× · · · × 2× 1 = τk k!, for k = 1, 2, 3, . . . . The central moments are not so important, and do not admit such a nice closed formula. 2.6 Problems Problem 2.6.1. Let Y be a continuous random variable whose pdf fY is given by fY(y) = { cy2, y ∈ [−1, 1] 0, otherwise, for some constant c. 1. Write down an expression for fY using the interval-indicator notation. 2. What is the value of c? 3. Compute E[Y] and sd[Y]. Problem 2.6.2. Let Y be a continuous random variable with the pdf fY(y) = 154 y 2(1− y2)1{−1≤y≤1}. Compute P[Y2 ≤ 1/4]. Problem 2.6.3. The random variable Y has the pdf fY(y) = 32 y 21{−1≤y≤1}. Compute the probability P[2Y2 ≥ 1]. Problem 2.6.4. Let Y have the pdf f (y) = 1 π(1+y2) for y ∈ (−∞, ∞). Compute the probability that Y−2 lies in the interval [1/4, 4]. Problem 2.6.5 (The exponential distribution). Suppose that the random vari- able Y follows an exponential distribution with parameter τ > 0, i.e., Y ∼ E(τ), i.e., Y is a continuous random variable with the density function fY given by fY(y) = 1τ e −y/τ1y≥0. Last Updated: September 25, 2019 Lecture 2: Continuous random variables 11 of 11 Compute the following quantities 1. P[Y = 0], 2. P[Y ≤ 0], 3. P[Y ≤ y] for y ∈ (−∞, ∞), 4. P[Y > 1], 5. P[|Y− 2| > 1], 6. E[Y], 7. E[Y2], 8. Var[Y], 9. The mode of Y (look up mode if you don’t know what it is) 10. The median of Y (look up median if you don’t know what it is) 11. (Optional) P[ bYc is odd ]‘’, where bac denotes the largest integer ≤ a. Which one is bigger P[ bYc is odd ] or P[ bYc is even ]? Explain without using any calculations. Problem 2.6.6 (The triangular distribution). We say that the random variable Y follows the triangular distribution with parameters l < r if it is continuous with pdf fY givenby fY(y) = c(y− l)1[l, 12 (l+r)] (y) + c(r− y)1 [ 12 (l+r),r] (y). 1. Determine the value of the constant c, 2. Compute the expectation and the standard deviation of Y 3. Assuming that l = −1 and r = 1, compute P [∣∣∣Y−E[Y]∣∣∣ ≥ sd[Y]]. Problem 2.6.7 (Moments of the uniform distribution). Let Y follow the uni- form distribution U(l, r) on the interval [l, r], where l < r, i.e., its density is given by fY(y) = 1r−l 1{l≤y≤r} = { 1/(r− l), if y ∈ [l, r], 0, otherwise. Compute the moments µk and central moments µck, k = 1, 2, . . . of Y, Last Updated: September 25, 2019 Lecture 3: Cumulative distribution functions 1 of 8 Course: Mathematical Statistics Term: Fall 2018 Instructor: Gordan Žitković Lecture 3 Cumulative distribution functions and derived quantities When we talk about the distribution of a discrete random variable, we write down its pmf (or a distribution table), and when the variable is contin- uous, we give its pdf. There are other ways of expressing the same informa- tion; depending on the context, these other ways can be much more useful or effective. 3.1 Cumulative distribution functions (cdf) Definition 3.1.1. For a random variable Y, discrete or continuous, we define its cumulative distribution function (cdf) FY : R→ [0, 1] by FY(y) = P[Y ≤ y], y ∈ R. The first, obvious, advantage of the cdf is that it can be used for both dis- crete and continuous random variables. Since it is defined as a probability of an event, FY(y) can be computed (at least in principle) from the distribution table in the discrete case FY(y) = ∑ u∈SY ,u≤y pY(u), or from the pdf (in the continuous case): FY(y) = ∫ y −∞ fY(u) du. (3.1.1) As we shall see in the examples, going the other way in the discrete case is possible, but the formula is a bit clumsy. The continuous case is nicer because one could use the fundamental theorem of calculus to conclude that fY(y) = ddy FY(y) for y ∈ R, at least for those y where fY is a continuous function. We know that the pdf fY of any random variable Y must be nonnega- tive and integrate to 1. In a similar way, any cdf will have the following properties: Last Updated: September 25, 2019 Lecture 3: Cumulative distribution functions 2 of 8 1. 0 ≤ FY(u) ≤ 1, 2. FY is nondecreasing, and 3. limu→∞ FY(u) = 1 and limu→−∞ FY(u) = 0. Example 3.1.2. 1. Bernoulli. Let Y be a Bernoulli random variable B(p). To find an expression for FY, we first note that FY(y) = 0 for y < 0. This follows directly from the defintion - Y takes values 0 or 1, so P[Y ≤ y] = 0, as soon as y < 0. Similarly, FY(y) = 1 for y ≥ 1. What happens in the middle? For any y ∈ [0, 1), the only way for Y ≤ y to be true is if Y = 0. Therefore, FY(y) = P[Y ≤ y] = P[Y = 0] = q for y ∈ [0, 1). A picture makes it even easier to grasp: 1 y q 1 Figure 1. The cumulative distribution function (CDF) for the Bernoulli B(p) distribution. 2. Discrete with finite support. Let Y be a discrete random variable with a finite support SY = {y1, . . . , yn} and let its distribution table be given by y1 y2 . . . yn p1 p2 . . . pn Last Updated: September 25, 2019 Lecture 3: Cumulative distribution functions 3 of 8 Following the same reasoning as in the Bernoulli case, we get the following expression for the cdf FY(y) = 0, y < y1, p1, y1 ≤ y < y2, p1 + p2, y2 ≤ y < y3, . . . p1 + p2 + · · ·+ pn−1 yn−1 ≤ y < yn, 1, y ≥ yn. Again, a picture is easier to parse: p1 p2 pn ... ... y1 y2 yn 11 Figure 2. The cumulative distribution of a discrete distribution with support {y1, . . . , yn} and the associated probabilities {p1, . . . , pn}. 3. Uniform. The cdf of the uniform distribution U(l, r) will no longer have “jumps”. In fact, that is the reason behind calling continuous distributions continuous. Here, we use the expression (3.1.1) and integrate the pdf fY of the uniform distribution from −∞ to y. As above, FY(y) = 0 for y < l because fY(y) = 0 for y < l and integration of 0 yields 0. To see what is going on between l and r, we pick y ∈ [l, r] and note that∫ y −∞ fY(u) du = ∫ y l fY(u) du = ∫ y l 1 r−l 1[l,r](y) du = 1 r−l ∫ y l du = y−lr−l . Finally, for y > r, we have FY(y) = 1. Alternatively, we could have used the definition of FY to conclude directly that FY(y) = P[Y ≤ y] = 0, y < l, y−l r−l , y ∈ [l, r], 1, y > l. Last Updated: September 25, 2019 Lecture 3: Cumulative distribution functions 4 of 8 l r 11 Figure 3. The cumulative distribution of a uniform U(l, r) distribution. 4. Normal Distribution. The CDF of the normal distribution N(µ, σ) FY(y) = ∫ u −∞ 1√ 2πσ2 e− (u−µ)2 2σ2 du does not have an explicit expression in terms of elementary func- tions (not even for µ = 0 and σ = 1). That is why you had to use tables (or software) to compute various probabilities associate to the normal in your probability class. Using mathematical software, one can evaluate this integral numerically, and the resulting picture is given below: μ μ+σμ-σ μ+2σ y 0.5 0.84 0.16 0.98 Figure 4. The cumulative distribution of a normal N(µ, σ) distribution. 5. Exponential distribution. The integration in the computation of the cdf FY of an exponentially-distributed random variable Y ∼ E(τ) can be performed quite easily and completely explicitly. First Last Updated: September 25, 2019 Lecture 3: Cumulative distribution functions 5 of 8 of all, for y < 0, we clearly have FY(y) = 0. For y > 0, we compute FY(y) = ∫ y −∞ 1 τ e −u/τ1[0,∞)(u) du = ∫ y 0 1 τ e −u/τ du = 1− e−y/τ , y > 0. 0 y 11 Figure 5. CDF of the exponential distribution E(τ). 3.2 Quantiles The notion of a quantile is familiar to almost everyone, even if you have not learned it formally in a class. You don know what “top 1%” means, right? The formal definition is easy once we have the notion of a cdf at our disposal: Definition 3.2.1. For α ∈ (0, 1), we define the α-quantile of the distri- bution of the random Y as the number qY(α) ∈ R with the property that FY(qY(α)) = α, i.e., P[Y ≤ qY(α)] = α. Caveat: The way we defined above, the quantile qY(α) may not need to exist for all α. This can be remedied by adopting a more careful definition, but, since we will not have to deal with this problem in these notes - and whenever we need quantiles, they will happily exist - we simply ignore it. If you want to think about this a bit more, try to figure out which quantiles of the Bernoulli distribution actually exist, i.e., for which α can we find a number q such that P[Y ≤ q] = α, when Y is Bernoulli. Is such a q uniquely determined? Last Updated: September 25, 2019 Lecture 3: Cumulative distribution functions 6 of 8 Example 3.2.2. Normal quantiles. In practice, one finds quantiles by inverting the CDF; graphically this amounts to finding α on the ver- tical axis, and then finding a value q on the horizontal axis such that FY(q) = α. For example, Figure 4. in Example (3.1.2), part 4., above, reveals that, for Y ∼ N(µ, σ), we have (approximately) qY(0.16) = µ−σ, qY(0.5) = µ, qY(0.84) = µ+σ and qY(0.98) = µ+ 2σ. This is very much related to the well-known 68− 95− 99.7-rule. 3.3 Survival and hazard functions Survival and hazard functions are especially important for an area of statis- tics called the survival analysis, but are also a part of the vocabulary of gen- eral statistics. Definition 3.3.1. Let Y be a random variable with cdf FY. 1. The survival function SY(y) of Y is defined by SY(y) = 1− FY(y) for y ∈ R. 2. If Y is continuous, the hazard function hY(y) is given by hy(y) = fY(y) SY(y) for y with FY(y) < 1. These quantities have natural interpretations when Y is thought of as a lifetime (of a particle, bulb, bacterium, individual, etc.). Fixing, for conve- nience, the interpretation that Y is the age at death of an individual, we have 1. SY(y) is the probability that the individual will survive at least y years. 2. hY(y)∆y is the (conditional) probability that the individual will die some time in the (small) interval [y, y + ∆y], given that it has surviveduntil y. Example 3.3.2. Let Y be an exponential random variable with param- eter τ. Then SY(y) = e−y/τ and hY(y) = 1τ for y ≥ 0. In words, exponentially-distributed lifetimes have constant hazard functions - “the probability of dying in the next ∆y is constant and does not depend on the age y.” For comparison, Figure 6 below fea- Last Updated: September 25, 2019 Lecture 3: Cumulative distribution functions 7 of 8 tures some real data about humans where the hazard rate is far from constant. 20 40 60 80 100 120 y 0.2 0.4 0.6 0.8 20 40 60 80 100 y 0.05 0.1 0.15 0.2 0.25 0.3 Figure 6. The survival (left) and the hazard (right) functions of the empirical distribution of the ages of death of all female individuals born in the US in 1917. 3.4 Problems Problem 3.4.1. Two (unbiased, independent) coins are tossed, and the total number of heads is denoted by Y. Write an expression for the CDF of Y and sketch its graph. Problem 3.4.2. Which of the following pairs of functions could be the pdf and the cdf (respectively) of some probability distribution: (a) f (x) = x2, F(x) = 13 x 3 (b) f (x) = cos(x), F(x) = sin(x). (c) f (x) = 2e−2x1{x>0}, F(x) = (1− e−2x)1{x>0}. (d) f (x) = 1√ 2π e−x 2/2, F(x) = 1− e−x2 . (e) f (x) = 1{x>0}, F(x) = x1{x>0}. Problem 3.4.3. Let Y be a random variable with CDF FY, and let qY : (0, 1)→ R be its quantile function (we assume it exists for each α ∈ (0, 1)). What is the relationship between the graphs of FY and qY, i.e., how do you get one from the other? Last Updated: September 25, 2019 Lecture 3: Cumulative distribution functions 8 of 8 Problem 3.4.4. Let Y be a continuous random variable with the density fY given by fY(y) = cy2(1− y)1[0,1](y), for an appropriate constant c. 1. Sketch the graph of f and find the value of the constant c. 2. Compute the cumulative distribution function (cdf) FY and the survival function SY, of Y. 3. What is the domain of the hazard function? Compute the hazard function hY itself. 4. Find the mode of Y 5. Compute the 516 -th quantile of Y. (Note: Guess and verify.) Problem 3.4.5. Let Y be a random variable with the pdf fY(y) = 2y1{0≤y≤1}. Compute the hazard function hY of Y. Problem 3.4.6. Let Y be a uniform random variable on the interval [0, 100]. The hazard function hY of the distribution of Y is given by (a) 1y 1{y>0} for y ∈ (−∞, 100) (b) 1100−y 1{y>0} for y ∈ (−∞, 100) (c) 1{y<0} + 100−y 100 1{0≤y≤100} for y ∈ (−∞, 100] (d) (100− y)1{y∈[0,100)} for y ∈ [0, ∞) (e) none of the above Problem 3.4.7. The expected lifetime of a bulb is h (in hours). Assuming that the bulb lifetimes are exponentially distributed, compute 1. the probability that the bulb is still functional at time h 2. the half-life of the bulb, i.e., a number t∗ such that the probability that the bulb is still functional after t∗ hours is exactly 1/2. Problem 3.4.8. Compute the α-quantile qY(α) for α = 0.75 where Y is the uniform distribution U(4, 8) on [4, 8]. Last Updated: September 25, 2019 Lecture 4: Functions of random variables 1 of 11 Course: Mathematical Statistics Term: Fall 2017 Instructor: Gordan Žitković Lecture 4 Functions of random variables Let Y be a random variable, discrete and continuous, and let g be a func- tion from R to R, which we think of as a transformation. For example, Y could be a height of a randomly chosen person in a given population in inches, and g could be a function which transforms inches to centimeters, i.e. g(y) = 2.54× y. Then W = g(Y) is also a random variable, but its distribu- tion (pdf), mean, variance, etc. will differ from that of Y. Transformations of random variables play a central role in statistics, and we will learn how to work with them in this section. 4.1 Computing expectations Expectations of functions of random variables are easy to compute, thanks to the following result, sometimes known as the fundamental formula. Theorem 4.1.1. Suppose that Y is a random variable, g is a transformation, i.e., a real function, and W = g(Y). Then 1. if Y is discrete, with pmf pY, we have E[W] = ∑ y∈SY g(y) pY(y), 2. if Y is continuous, with pdf fY, we have E[W] = ∫ ∞ −∞ g(y) fY(y) dy. We have already used this formula, without knowing, when we wrote down a formula for the variance Var[Y] = E[(Y− µY)2] = ∫ ∞ −∞ (y− µY)2 fY(y) dy. Indeed, we applied the transformation g(y) = (y− µY)2 to Y and then com- puted the expectation of the new random variable W = g(Y) = (Y− µY)2. Last Updated: September 25, 2019 Lecture 4: Functions of random variables 2 of 11 Example 4.1.2. The stopping distance (the distance traveled by the car from the moment the brake is applied to the moment it stops) in feet is a quadratic function of the car’s speed (in mph), i.e., g(y) = cy2 where c is a constant which depends on the physical characteristics of the car, its brakes, the road surface, etc. For the purposes of this example, let’s take a realistic value c = 0.07 (in appropriate units). In a certain traffic study, the distribution of cars’ speeds at the onset of breaking is empirically determined to be uniformly distributed on the interval [60, 90], measured in miles per hour. What is the expected value of the stopping distance? The stopping distance W is given by W = g(Y), where g(y) = 0.07 y2, and so, according to our formula, we have E[W] = ∫ ∞ −∞ g(y) fY(y) dy = ∫ ∞ −∞ 0.07 y2 190−60 1[60,90](y) dy = 0.07/30 ∫ 90 60 y2 dy = 399. If we compute the expected speed, we get E[Y] = 130 ∫ 90 60 y dy = 75, and if we compute the stopping distance of the car traveling at 75 mph, we get g(75) = 393.75. It follows that the average (expected) stopping distance is not the same as the stopping distance corresponding to the average speed. Why is that? Caveat: What we observed at the end of Example 4.1.2 is so important that it should be repeated: In general, E[g(Y)] 6= g(E[Y])!. In fact, the only time we can guarantee equality for any Y is when g is an affine function, i.e., when g(y) = αy + β for some constants α and β. Last Updated: September 25, 2019 Lecture 4: Functions of random variables 3 of 11 4.2 The cdf-method The fundamental formula of Theorem 4.1.1 is useful for computing expec- tations, but it has nothing to say about the distribution of W = g(Y). For example, we may wonder whether the distribution of stopping distances is uniform on some interval, just like the distribution of velocities at the on- set of breaking in Example 4.1.2. There are several methods for answering this question, and we start with the one which almost always works - the cdf-method. Suppose that we know the cdf FY of Y and that we we are interested in the distribution of W = g(Y). Using the definition of the cdf FW of W, we can write FW(w) = P[W ≤ w] = P[g(Y) ≤ w]. The probability on the right is not quite the cdf of Y, but if it can be rewritten in terms of probabilities involving Y, or, better, the cdf of Y, we are in business: 1. If g is strictly increasing, then it admits an inverse function g−1 and we can write FW(w) = P[g(Y) ≤ w] = P[Y ≤ g−1(w)] = FY(g−1(w)), and we have an expression of FW in terms of FY. Once FW is known, it can be used further to compute the pdf (in the continuous case) or the pmf (in the discrete case), or . . . 2. A very similar computation can be made if g is strictly decreasing. The only difference is that now P[g(Y) ≤ w] = P[Y ≥ g−1(w)]. In the con- tinuous case we have P[Y ≥ y] = 1− FY(y) (why only in continuous?), so FW(w) = P[g(Y) ≤ w] = P[Y ≥ g−1(w)] = 1− FY(g−1(w)). 3. The function g is neither increasing nor decreasing, but the inequality g(y) ≤ w can be “solved” in simple terms. To understand what is meant by this, have a look at examples below. Example 4.2.1. 1. Linear transformations. Let Y be any random variable, and let W = g(Y) where g(y) = a + by is a linear transformation with b > 0. Since b > 0, the function g is strictly increasing. Therefore, FW(w) = P[g(Y) ≤ w] = P[a + bY ≤ w] = P[Y ≤ w−ab ] = FY(w−ab ). This expression is especiallynice if Y is a continuous random vari- Last Updated: September 25, 2019 Lecture 4: Functions of random variables 4 of 11 able because then so is W, and we have fW(w) = ddw FW(w) = d dw FY( w−a b ) = fY( w−a b ) 1 b , where the last equality follows from the chain rule. Here are some important special cases: a) Linear transformations of a normal. If Y ∼ N(0, 1) is a unit normal, then fY(y) = 1√2π e − 12 y 2 , and so, fW(w) = 1√2πb2 e − (w−a) 2 2b2 . We recognize this as the pdf of the normal distribution, but this time with paramters a and b. This inverse of this computation lies behind the familiar z-score transformation: if Y ∼ N(µ, σ), then Z = Y−µσ ∼ N(0, 1). b) Linear transformations of a uniform. If Y ∼ U(0, 1) is a ran- dom number, g(y) = a + by and W = g(Y), then FW(w) = FY((w− a)/b) and, so, fW(w) = 1b fY((w− a)/b)) = 1 b 1[0,1]((w− a)/b). When we talked about indicators, we mentioned that a dif- ferent notation for the same function can simplify computa- tions in some cases. Here is the case in point. If we replace 1[0,1]((w − a)/b) by 1{0≤(w−a)/b≤1}, we can rearrange the ex- pression inside {} and get fW(w) = 1b 1{a≤w≤a+b} = 1 b 1[a,a+b](w), and we readily recognize fW as the pdf of another uniform dis- tribution, but this time with parameters a and b. If we wanted to transform U(0, 1) into U(l, r), we would simply need to pick a = l and b = r− l. It is not a coincidence that linear transformations of normal and uniform random variables result in random variables in the same parametric families (albeit with different parameters). Parametric families are often (but not always) chosen to have this exact prop- erty. 2. Inverse Exponential distribution. Let Y ∼ E(τ) be an exponen- tially distributed random variable, and let g(y) = 1/y. The func- tion g is strictly decreasing on (0, ∞) and so, for w > 0, we have FW(w) = P[1/Y ≤ w] = P[Y ≥ 1/w] = 1− FY(1/w) = e− 1 τw . Last Updated: September 25, 2019 Lecture 4: Functions of random variables 5 of 11 This computation will not work for w ≤ 0, but we know that W always takes positive values, as it is the reciprocal of Y, which is always positive. Therefore, FW(w) = 0 for w ≤ 0. We can differen- tiate the expression for FW to obtain the pdf fW : fW(w) = e − 1τw 1 τw2 1(0,∞)(w). This pdf cannot be recognized as the pdf of any of our named distributions, but it is sometimes called the inverse exponential distribution, and it is used in wireless communications. y Figure 1. The pdf of the inverse exponential distribution. 3. χ2-distribution. Let Y ∼ N(0, 1) be the unit normal random vari- able, and let g(y) = y2. This is an example of a transformation which is neither increasing nor decreasing. We can still try to make sense of the expression g(y) ≤ w, i.e., y2 ≤ w and use the cdf- method: FW(w) = P[Y2 ≤ w]. For w < 0 it is impossible that Y2 < w, so we immediately conclude that FW(w) = 0 for w < 0. When w ≥ 0, we have P[Y2 ≤ w] = P[Y ∈ [− √ w, √ w]] = P[Y ≤ √ w]−P[Y < − √ w]. Since P[Y = − √ w] = 0 (as Y is continuous), we get FW(w) = { 0, w < 0 FY( √ w)− FY(− √ w), w ≥ 0. We differentiate both sides in w and use the fact that FY is a normal cdf so that ddy FY(y) = 1√ 2π e− 1 2 y 2 to obtain fW(w) = 1√2πw e − 12 w1(0,∞)(w). (4.2.1) A random variable with the pdf fW(w) of (4.2.1) above is said to have a χ2-distribution (pronounced [kai-skwer]). It is very impor- tant in statistics, and we will spend a lot more space on it later. Last Updated: September 25, 2019 Lecture 4: Functions of random variables 6 of 11 y Figure 2. The pdf of the χ2 distribution. 4.3 The h-method The application of the cdf-method can sometimes be streamlined, leading to the so-called h-method or the method of transformations. It works when Y is a continuous random variable and when the transformation function g admits an inverse function h. Supposing that is the case, remember that, when g is increasing, we have FW(w) = FY(g−1(w)) = FY(h(w)). If we assume that everything is differentiable and that Y and W admit pdfs fY and fW , we can take a derivative in w to obtain fW(w) = ddw FW(w) = d dw ( FY(h(w)) ) = ddw FY(h(w)) d dw h(w) = fY(h(w))h ′(w). Another way of deriving the same formula is to interpret the pdf fY(y) as the quantity such that P[Y ∈ [y, y + ∆y]] ≈ fY(y)∆y, when ∆y > 0 is “small”. Applying the same to W = g(Y) in two ways yields P[W ∈ [w, w + ∆w]] ≈ fW(w)∆w, by also, assuming that g is increasing, P[W ∈ [w, w + ∆w] = P[g(Y) ∈ [w, w + ∆w]] = P[Y ∈ [h(w), h(w + ∆w)]] ≈ P[Y ∈ [h(w), h(w) + ∆w h′(w)]] = fY(h(w))h′(w)∆w. Last Updated: September 25, 2019 Lecture 4: Functions of random variables 7 of 11 The approximate equality ≈ between the third and the fourth probability above uses the fact that h(w + ∆w) ≈ h(w) + ∆w h′(w) which is nothing but a consequence of the definition of the derivative h′(w) = lim ∆w→0 h(w+∆w)−h(w) ∆w . It could also be seen as the first-order Taylor formula for h around w. The derivation above can be made fully rigorous, leading to the follow- ing theorem (why does the absolute value |h′(w)| appear there?). The word interval in it means (a, b), where either a or b could be infinite (so that, for example, R itself is also an interval). Theorem 4.3.1 (The h-method). Suppose that the function g is 1. defined on an interval I ⊆ R. 2. its image is an interval J ⊆ R 3. g has a continuously-differentiable inverse function h : J → I Suppose that Y is a continuous random variable with pdf fY such that fY(y) = 0 for y 6∈ I. Then W = g(Y) is also a continuous random variable and its pdf is given by the following formula: fW(w) = fY(h(w)) ∣∣h′(w)∣∣ 1{w∈J}. Note: in almost all applications I = {y ∈ R : fY(y) > 0}, for a properly defined version of fY and J = g(I). Example 4.3.2. 1. Let Y be a continuous random variable with pdf fY(y) = 1π(1+y2) . The distribution of Y is called the Cauchy distribution. We define W = g(Y) where g(y) = arctan(y). The function g is defined on the interval I = R and its image is the interval J = (−π/2, π/2). Moreover, its inverse is the function h : J → I given by h(w) = tan(w). This function admits a derivative h′(w) = 1/ cos2(w). Theorem 4.3.1 can be applied and it states that fW(w) = 1π(1+tan2(w)) 1 cos2(w)1{w∈(−π/2,π/2)} = 1 π 1{w∈(−π/2,π/2)}. Last Updated: September 25, 2019 Lecture 4: Functions of random variables 8 of 11 This allows us to identify the distribution of W as uniform on the interval (−π/2, π/2), i.e., Y ∼ U(−π/2, π/2). 2. Let Y ∼ E(τ), and let g(y) = √y. The function g is defined on I = (0, ∞) and maps it into J = (0, ∞), and its inverse is h(w) = w2 : J → I. The pdf fY of y is given by fY(y) = 1τ exp(−y/τ)1{y>0} and, so, by Theorem 4.3.1 W = √ Y = g(Y) is a continuous random variable with density fW(w) = 2τ w exp(−w 2/τ)1{w>0}, where we removed the absolute value around h′(w) = 2w because of the indicator 1w>0. This is known as the Weibull distribution. 3. Let Y and W be as in Example 4.1.2, i.e., Y ∼ U(60, 90) and g(y) = cy2, where c = 0.07. Then g : R → R is neither increas- ing nor decreasing, and does not admit an inverse. However, it is increasing on the set (60, 90) where the random variable Y takes place, i.e., where fY(y) > 0 (we can exclude the end-points 60 and 90 because they happen with probability 0). If we restrict g to the interval I = (60, 90), it admits an inverse h : J → I, where J = (g(60), g(90)) = (252, 567), and h(w) = √ w c , w ∈ J. The pdf fW(w) of W is then given by fW(y) = fY(h(w))h′(w) = 190−60 1 2 √ c w −1/21{w∈J}. Here are the graphs of the pdfs of Y and W = g(Y): 60 90 y 1 30 252 576 y 0.004 Figure 3. The pdfs of Y and W = g(Y) (with both axes scaled differently) Last Updated: September 25, 2019 Lecture 4: Functions of random variables 9 of 11 4.4 Problems Problem 4.4.1. Let Y be an exponential random variable with parameter τ > 0. Compute the cdf FW and the pdf fW of the random variable W = Y3. Problem 4.4.2. Let Y be a uniformly distributed randomvariable on the in- terval [0, 1], and let W = exp(Y). Compute E[W], the CDF FW of W and the pdf fW of W. Problem 4.4.3. A scientist measures the side of cubical box and the result is a random variable (due to the measurement error) Y, which we assume is normally distributed with mean µ = 1 and σ = 0.1 (both in feet). In other words, the true measurement of the side of the box is 1 foot, but the scientist does not know that; she only knows the value of Y. 1. What is the distribution of the volume W of the box? 2. What is the probability that the scientist’s measurement overestimates the volume of the box by more than 10%? (Hint: Review z-scores and com- putations of probabilities related to the normal distribution. You will not need to integrate anything here, but may need to use software (if you know how) or look into a normal table. We will talk more about how to do that later. For now, use any method you want.) Problem 4.4.4. Let Y be a continuous random variable with the density func- tion fY given by fY(y) = { 2+3y 16 , y ∈ (1, 3) 0, otherwise. The pdf fW(w) of W = Y2 is (a) fW(w) = 2+3 √ w 16 √ w 1{w∈(1,3)} (b) fW(w) = 2+3 √ w 32 √ w 1{w∈(1,9)} (c) fW(w) = 2+3w16√w 1{w∈(1,9)} (d) fW(w) = 2+3w32√w 1{w∈(1,3)} (e) none of the above Problem 4.4.5. The Maxwell-Boltzmann distribution describes the distribu- tion of speeds of particles in an (idealized) gas, and its pdf is given by fY(y) = 4β3√ π y2e−βy 2 1(0,∞)(y), where β > 0 is a constant that depends on the properties of the particular gas studied. Last Updated: September 25, 2019 Lecture 4: Functions of random variables 10 of 11 The kinetic energy of a gas particle with speed y is 12 my 2. What is the dis- tribution (pdf) of the kinetic energy if the particle speed follows the Maxwell- Boltzmann distribution? Problem 4.4.6. Let Y be a uniform distribution on (0, 1). The distribution of the random variable − 12 log(Y) is (a) exponential E(τ) with τ = 2 (b) exponential E(τ) with τ = 1/2 (c) uniform U(0, 1/2) on (0, 1/2) (d) uniform U(0, 2) on (0, 2) (e) none of the above Problem 4.4.7. Let Y be an exponential random variable with parameter τ > −1. Then E[e−Y] = (a) τ (b) 11+τ (c) 1 1−τ (d) τ 1−τ (e) none of the above Problem 4.4.8. Let Y be a random variable with the pdf fY(y) = 1π(1+y2) . The pdf of W = 1/Y2 is (a) 2 π √ w(1+w)1{w>0} (b) 1 π √ w(1+w)1{w>0} (c) w 2 π(1+w4) (d) 2w 2 π(1+w4) (e) none of the above (Hint: You do not need to actually compute the cdf of Y.) Problem 4.4.9. The pdf of W = 1/Y2, where Y ∼ E(τ) is (a) 2τ y −3/2e−y/τ1{y>0} (b) 12τ (−y−3/2)e −√y/τ1{y>0} (c) 1τ e −1/(y2τ)1{y>0} (d) 1 2τy3/2 e −1/(τ√y)1{y>0} Last Updated: September 25, 2019 Lecture 4: Functions of random variables 11 of 11 (e) none of the above Problem 4.4.10. Let Y be a uniform random variable on [−1, 1], and let W = Y2. The pdf of W is (a) 1 4 √ |w| 1{−1<w<1} (b) 1√w 1{0<w<1} (c) 12√w 1{0<w<1} (d) 2w1{0<w<1} (e) none of the above Problem 4.4.11. Let Y be a uniform random variable on [0, 1], and let W = Y2. The pdf of W is (a) 1 2 √ |w| 1{−1<w<1} (b) 1√w 1{0<w<1} (c) 12√w 1{0<w<1} (d) 2w1{0<w<1} (e) none of the above Problem 4.4.12. Fuel efficiency of a sample of cars has been measured by a group of American engineers, and it turns out that the distribution of gas- mileage is uniformly distributed on the interval [10 mpg, 30 mpg], where mpg stands for miles per gallon. In particular, the average gas-mileage is 20 mpg. A group of European engineers decided to redo the statistics, but, being European, they used European units. Instead of miles per gallon, they used liters per 100 km. (Note that the ratio is reversed in Europe - higher numbers in liters per km correspond to worse gas mileage). If one mile is 1.609 km and one gallon is 3.785 liters, the average gas-mileage of 20 mpg, obtained by the Americans, translates into 11.762 liters per 100 km. However, the average gas mileage obtained by the Europeans was different from 11.762, even though they used the same sample, and made no computational errors. How can that be? What did they get? Is the distribution of gas-mileage still uniform, when expressed in European units? If not, what is its pdf? Problem 4.4.13. Let Y be a uniformly distributed random variable on the interval (0, 1). Find a function g such that the random variable W = g(Y) has the exponential distribution with parameter τ = 1. (Hint: Use the h- method.) Last Updated: September 25, 2019 Lecture 5: Joint Distributions 1 of 21 Course: Mathematical Statistics Term: Fall 2017 Instructor: Gordan Žitković Lecture 5 Probability review - joint distributions 5.1 Random vectors So far we talked about the distribution of a single random variable Y. In the discrete case we used the notion of a pmf (or a probability table) and in the continuous case the notion of the pdf, to describe that distribution and to compute various related quantities (probabilities, expectations, variances, moments). Now we turn to distributions of several random variables put together. Just like several real numbers in an order make a vector, so n random vari- ables, defined in the same setting (same probability space), make a random vector. We typically denote the components of random vectors by subscripts, so that (Y1, Y2, Y3) make a typical 3-dimensional random vector. We can also think of a random vector as a random point in an n-dimensional space. This way, a random pair (Y1, Y2) can be thought of as a random point in the plane, with Y1 and Y2 interpreted as its x- and y-coordinates. There is a significant (and somewhat unexpected) difference between the distribution of a random vector and the pair of distributions of its compo- nents, taken separately. This is not the case with non-random quantities. A point in the plane is uniquely determined by its (two) coordinates, but the distribution of a random point in the plane is not determined by the dis- tributions of its projections onto the coordinate axes. The situation can be illustrated by the following example: Example 5.1.1. 1. Let us toss two unbiased coins, and let us call the outcomes Y1 and Y2. Assuming that the tosses are unrelated, the probabilities of the following four outcomes {Y1 = H, Y2 = H}, {Y1 = H, Y2 = T}, {Y1 = T, Y2 = H}, {Y1 = T, Y2 = T} are the same, namely 1/4. In particular, the probabilities that the first coin lands on H or T are the same, namely 1/2. The distribution Last Updated: September 25, 2019 Lecture 5: Joint Distributions 2 of 21 tables for both Y1 and Y2 are the same and look like this H T 1/2 1/2 . Let us now repeat the same experiment, but with two coins at- tached to each other (say, welded together) as in the picture: Figure 1. Two quarters welded together, so that when one falls on heads the other must fall on tails, and vice versa. We can still toss them and call the outcome of the first one Y1 and the outcome of the second one Y2. Since they are welded, it can never happen that Y1 = H and Y2 = H at the same time, or that Y1 = T and Y2 = T at the same time, either. Therefore of the four outcomes above only two “survive” {Y1 = H, Y2 = H}, {Y1 = H, Y2 = T}, {Y1 = T, Y2 = H}, {Y1 = T, Y2 = T} and each happens with the probability 12 . The distribution of Y1 considered separately from Y2 is the same as in the non-welded case, namely H T 1/2 1/2 , and the same goes for Y2. This is one of the simplest examples, but it already strikes the heart of the matter: randomness in one part of the system may depend on the randomness in the other part. 2. Here is an artistic (geometric) view of an analogous phenomenon. The projections of a 3D object on two orthogonal planes do not Last Updated: September 25, 2019 Lecture 5: Joint Distributions 3 of 21 determine the object entirely. Sculptor Markus Raetz used that fact to create the sculpture entitled “Yes/No”: Figure 2. Yes/No - A “typographical” sculpture by Markus Raetz Not to be outdone, I decided to create a different typographicalsculpture with the same projections (I could not find the exact same font Markus is using so you need to pretend that my projections match his completely). It is not hard to see that my sculpture differs siginificantly from Marcus’s, but they both have (almost) the same projections, namely the words “Yes” and “No”. Figure 3. My own attempt at a “typographical” sculpture, using SketchUp. You should pretend that Markus’s and mine fonts are the same. Last Updated: September 25, 2019 Lecture 5: Joint Distributions 4 of 21 5.2 Joint distributions - the discrete case So, in order to describe the distribution of the random vector (Y1, . . . , Yn), we need more than just individual distributions of its components Y1 . . . Yn. In the discrete case, the events whose probabilities finally made their way into the distribution table were of the form {Y = i}, for all i in the support SY of Y. For several random variables, we need to know their joint distribution, i.e., the probabilities of all combinations {Y1 = i1, Y2 = i2, . . . , Yn = in}, over the set of all combinations (i1, i2, . . . , in) of possible values our random variables can take. These numbers cannot comfortably fit into a table, except in the case n = 2, where we talk about the joint distribution table which looks like this j1 j2 . . . i1 P[Y1 = i1, Y2 = j1] P[Y1 = i1, Y2 = j2] . . . i2 P[Y1 = i2, Y2 = j1] P[Y1 = i2, Y2 = j2] . . . ... ... ... . . . Example 5.2.1. Two dice are thrown (independently of each other) and their outcomes are denoted by Y1 and Y2. Since P[Y1 = i, Y2 = j] = 1/36 for any i, j ∈ {1, 2, . . . , 6}, the joint distribution table of (Y1, Y2) looks like this 1 2 3 4 5 6 1 1/36 1/36 1/36 1/36 1/36 1/36 2 1/36 1/36 1/36 1/36 1/36 1/36 3 1/36 1/36 1/36 1/36 1/36 1/36 4 1/36 1/36 1/36 1/36 1/36 1/36 5 1/36 1/36 1/36 1/36 1/36 1/36 6 1/36 1/36 1/36 1/36 1/36 1/36 The situation is more interesting if Y1 still denotes the outcome of the first die, but Z now stands for the sum of the numbers on two dies. It is not hard to see that the joint distribution table of (Y1, Z) now looks like this: 2 3 4 5 6 7 8 9 10 11 12 1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0 2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 Last Updated: September 25, 2019 Lecture 5: Joint Distributions 5 of 21 Going from the joint distribution of the random vector (Y1, Y2, . . . , Yn) to individual (called marginal) distributions of Y1, Y2, . . . , Yn is easy. To com- pute P[Y1 = i] we need to sum P[Y1 = i, Y2 = i2, . . . , Yn = in], over all combi- nations (i2, . . . , in) where i2, . . . , in range trough all possible values Y2, . . . , Yn can take. Example 5.2.2. Continuing the previous example, let us compute the marginal distribution of Y1, Y2 and Z. For Y1 we sum the probabilities in each row in in the joint distribution table of (Y1, Y2) to obtain 1 2 3 4 5 6 1/6 1/6 1/6 1/6 1/6 1/6 The same table is obtained for the marginal distribution of Y2 (even though we sum over columns this time). For the marginal distribu- tion of Z, we use the joint distribution table for (Y1, Z) and sum over columns: 2 3 4 5 6 7 8 9 10 11 12 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 Once the distribution table of a random vector (Y1, . . . , Yn) is given, we can compute (in theory) the probability of any event concerning the random variables Y1, . . . , Yn, but simply summing over the set of appropriate entries in the joint distribution. Example 5.2.3. We continue with random variables Y1, Y2 and Z de- fined above and ask the following question: what is the probability that two dice have the same outcome? In other words, we are inter- ested in P[Y1 = Y2]? The entries in the table corresponding to this event are boxed: 1 2 3 4 5 6 1 1/36 1/36 1/36 1/36 1/36 1/36 2 1/36 1/36 1/36 1/36 1/36 1/36 3 1/36 1/36 1/36 1/36 1/36 1/36 4 1/36 1/36 1/36 1/36 1/36 1/36 5 1/36 1/36 1/36 1/36 1/36 1/36 6 1/36 1/36 1/36 1/36 1/36 1/36 , so that P[Y1 = Y2] = 1/36 + 1/36 + 1/36 + 1/36 + 1/36 + 1/36 = 1/6. Last Updated: September 25, 2019 Lecture 5: Joint Distributions 6 of 21 5.3 Joint distributions - the continuous case Just as in the univariate case (the case of a single random variable), the con- tinuous analogue of the distribution table (or the pmf) is the pdf. Recall that pdf fY(y) of a single random variable Y is the function with the property that P[Y ∈ [a, b]] = ∫ b a fY(y) dy. In the multivariate case (the case of a random vector, i.e., several random variables), the pdf of the random vector (Y1, . . . , Yn) becomes a function of several variables fY1,...,Yn(y1, . . . , yn) and it is characterized by the property that P[Y1 ∈ [a1, b1], Y2 ∈ [a2, b2], . . . , Yn ∈ [an, bn]] = = ∫ b1 a1 ∫ b2 a2 · · · ∫ bn−1 an−1 ∫ bn an fY1,...,Yn(y1, . . . , yn) dyn dyn−1 . . . dy2 dy1. This formula is better understood if interpreted geometrically. The left-hand side is the probability that the random vector (Y1, . . . , Yn) (think of it as a random point in Rn) lies in the region [a1, b1] × . . . [an, bn], while the right- hand side is the integral of fY1,...,Yn over the same region. Example 5.3.1. A point is randomly and uniformly chosen inside a square with side 1. That means that any two regions of equal area inside the square have the same probability of containing the point. We denote the two coordinates of this point by Y1 and Y2 (even though X and Y would be more natural), and their joint pdf by fY1,Y2 . Since the probabilities are computed by integrating fY1,Y2 over various regions in the square, there is no reason for f to take different values on different points inside the square; this makes fY1,Y2(y1, y2) = c for some constant c > 0, for all points (y1, y2) in the square. Our random point never falls outside the square, so the value of f outside the square should be 0. Pdfs (either in one or in several dimensions) integrate to 1 , so we conclude that f should be given by fY1,Y2(y1, y2) = { 1, (y1, y2) ∈ [0, 1]2 0, otherwise. Once a pdf of a random vector (Y1, . . . , Yn) is given, we can compute all kinds of probabilities with it. For any region A ⊂ Rn (not only for rectangles of the form [a1, b1]× . . . [an, bn]), we have P[(Y1, . . . , Yn) ∈ A] = ∫ ∫ · · · ∫ A fY1,...,Yn(y1, . . . , yn)dyn . . . dy1. As it is almost always the case in the multivariate setting, this is much better understood through an example: Last Updated: September 25, 2019 Lecture 5: Joint Distributions 7 of 21 Example 5.3.2. Let (Y1, Y2) be the random uniform point in the square [0, 1]2 from the previous example. To compute the probability that the distance from (Y1, Y2) to the origin (0, 0) is at most 1, we define A = { (y1, y2) ∈ [0, 1]2 : √ y21 + y 2 2 ≤ 1 } , 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4. The region A. Therefore, since fY1,Y2(y1, y2) = 1, for all (y1, y2) ∈ A, we have P[(Y1, Y2) is at most 1 unit away from (0,0)] = P[(Y1, Y2) ∈ A] = ∫∫ A fY1,Y2(y1, y2) dy2dy1 = ∫∫ A 1 dy1 dy2 = area(A) = π4 . The calculations in the previous example sometimes fall under the head- ing of geometric probability because the probability π/4 we obtained is simply the ratio of the area of A and the area of [0, 1]2 (just like one computes a uniform probability in a finite setting by dividing the number of “favorable” cases by the total number). This works only if the underlying pdf is uniform. In practice, pdfs are rarely uniform. Example 5.3.3. Let (Y1, Y2) be a random vector with the pdf fY1,Y2(y1, y2) = { 6y1, 0 ≤ y1 ≤ y2 ≤ 1 0, otherwise, or, in the indicator notation, fY1,Y2(y1, y2) = 6y11{0≤y1≤y2≤1}. Last Updated: September 25, 2019 Lecture 5: Joint Distributions 8 of 21 Here is a sketch of what f looks like Figure 5. The pdf of (Y1, Y2) This still corresponds to a distribution of a random point in the unit square, but this distribution is no longer
Compartilhar