Baixe o app para aproveitar ainda mais
Prévia do material em texto
ROBUST STATISTICAL METHODS with R ROBUST STATISTICAL METHODS with R Jana Jurecková Jan Picek ˇ Published in 2006 by Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2006 by Taylor & Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-10: 1-58488-454-1 (Hardcover) International Standard Book Number-13: 978-1-58488-454-5 (Hardcover) Library of Congress Card Number 2005053192 This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Jureckova, Jana, 1940- Robust statistical methods with R / Jana Jureckova, Jan Picek. p. cm. Includes bibliographical references and indexes. ISBN-13: 978-1-58488-454-5 (acid-free paper) ISBN-10: 1-58488-454-1 (acid-free paper) Robust statistics. 2. R (Computer program language)--Statistical methods. I. Picek, Jan, 1965- II. Title. QA276.J868 2006 519.5--dc22 2005053192 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Taylor & Francis Group is the Academic Division of Informa plc. C4541_Discl.fm Page 1 Friday, October 7, 2005 12:06 PM Contents Preface ix Authors xi Introduction 1 1 Mathematical tools of robustness 5 1.1 Statistical model 5 1.2 Illustration on statistical estimation 8 1.3 Statistical functional 9 1.4 Fisher consistency 11 1.5 Some distances of probability measures 12 1.6 Relations between distances 13 1.7 Differentiable statistical functionals 14 1.8 Gâteau derivative 15 1.9 Fréchet derivative 17 1.10 Hadamard (compact) derivative 18 1.11 Large sample distribution of empirical functional 18 1.12 Computation and software notes 19 1.13 Problems and complements 23 2 Basic characteristics of robustness 27 2.1 Influence function 27 2.2 Discretized form of influence function 28 2.3 Qualitative robustness 30 v vi CONTENTS 2.4 Quantitative characteristics of robustness based on influence function 32 2.5 Maximum bias 33 2.6 Breakdown point 35 2.7 Tail–behavior measure of a statistical estimator 36 2.8 Variance of asymptotic normal distribution 41 2.9 Problems and complements 41 3 Robust estimators of real parameter 43 3.1 Introduction 43 3.2 M -estimators 43 3.3 M -estimator of location parameter 45 3.4 Finite sample minimax property of M -estimator 54 3.5 Moment convergence of M -estimators 58 3.6 Studentized M -estimators 61 3.7 L-estimators 63 3.8 Moment convergence of L-estimators 70 3.9 Sequential M - and L-estimators 72 3.10 R-estimators 74 3.11 Numerical illustration 77 3.12 Computation and software notes 80 3.13 Problems and complements 83 4 Robust estimators in linear model 85 4.1 Introduction 85 4.2 Least squares method 87 4.3 M -estimators 94 4.4 GM -estimators 98 4.5 S-estimators and MM -estimators 100 4.6 L-estimators, regression quantiles 101 4.7 Regression rank scores 104 4.8 Robust scale statistics 106 CONTENTS vii 4.9 Estimators with high breakdown points 109 4.10 One-step versions of estimators 110 4.11 Numerical illustrations 112 4.12 Computation and software notes 115 4.13 Problems and complements 126 5 Multivariate location model 129 5.1 Introduction 129 5.2 Multivariate M -estimators of location and scatter 129 5.3 High breakdown estimators of multivariate location and scatter 132 5.4 Admissibility and shrinkage 133 5.5 Numerical illustrations and software notes 134 5.6 Problems and complements 139 6 Some large sample properties of robust procedures 141 6.1 Introduction 141 6.2 M -estimators 142 6.3 L-estimators 144 6.4 R-estimators 146 6.5 Interrelationships of M -, L- and R-estimators 146 6.6 Minimaximally robust estimators 150 6.7 Problems and complements 153 7 Some goodness-of-fit tests 155 7.1 Introduction 155 7.2 Tests of normality of the Shapiro-Wilk type with nuisance regression and scale parameters 155 7.3 Goodness-of-fit tests for general distribution with nuisance regression and scale 158 7.4 Numerical illustration 160 7.5 Computation and software notes 166 viii CONTENTS Appendix A: R system 173 A.1 Brief R overview 174 References 181 Subject index 191 Author index 195 Preface Robust statistical procedures became a part of the general statistical con- sciousness. Yet, students first learn descriptive statistics and the classical sta- tistical procedures. Only later students and practical statisticians hear that the classical procedures should be used with great caution and that their favorite and simple least squares estimator and other procedures should be replaced with the robust or nonparametric alternatives. To be convinced to use the robust methods, one needs a reasonable motivation; but everybody needs motivation of their own: a mathematically–oriented person demands a theoretical proof, while a practitioner prefers to see the numerical results. Both aspects are important. The robust statistical procedures became known to the Prague statistical so- ciety by the end of the 1960s, thanks to Jaroslav Hájek and his contacts with Peter J. Huber, Peter J. Bickel and other outstanding statisticians. Frank Hampel presented his “Some small sample asymptotics,” also touching M- estimates, at Hájek’s conference in Prague in 1973; and published his paper in the Proceedings. Thus, we had our own experience with the first skepti- cism toward the robust methods, but by 1980 we started organizing regular workshops for applied statisticians called ROBUST. The 14th ROBUST will be in January 2006. On this occasion, we express our appreciation to Jaromı́r Antoch, the main organizer. The course “Robust Statistical Methods” is now a part of the master study of statistics at Charles University in Prague and is followed by all statistical students. The present book draws on experience obtained during these courses. We supplement the theory with examples and computational procedures in the system R. We chose R as a suitable tool because R seems to be one of the best statistical environments available. It is also free and the R project is an open source project. The code you are using is always available for you to see. Detailed information about the system R and the R project is available from http://www.r-project.org/. The prepared procedures and dataset, not available from the public resource, can be found on website: http://www.fp.vslib.cz/kap/picek/robust/. ix x PREFACE We acknowledge the support of the Czech Republic Grant 201/05/2340, the Research ProjectsMSM 0021620839 of Charles University in Prague, and MSM 467488501 of Technical University in Liberec. The authors would also like to thank the editors and anonymous referees who contributed considerably to the readability of the text. Our gratitude also be- longs to our families for their support and patience. Jana Jurečková Jan Picek Prague and Liberec Authors Jana Jurečková is a professor of statistics and probability at Charles Uni- versity in Prague, Czech Republic. She is a coauthor of Robust Statistical Inference: Asymptotics and Inter-Relations (with P.K. Sen, John Wiley & Sons, 1996) and of Adaptive Regression (with Y. Dodge, Springer, 2000). She is a fellow of the Institute of Mathematical Statistics and an elected member of the International Statistical Institute. She was an associate editor of the Annals of Statistics for six years, and previously for two years in the 1980s; and now is an associate editor of Journal of the American Statistical Asso- ciation. Jurečková participates in extensive international cooperation, mainly with statisticians of Belgium, Canada, France, Switzerland and United States. Jan Picek is an associate professor of applied mathematics at Technical Uni- versity in Liberec, Czech Republic. xi Introduction If we analyze the data with the aid of classical statistical procedures, based on parametric models, we usually tacitly assume that the regression is linear, the observations are independent and homoscedastic, and assume the normal distribution of errors. However, when today we can simulate data from any probability distribution and from various models with our high–speed com- puters and follow the graphics, which was not possible before, we observe that these assumptions are often violated. Then we are mainly interested in the two following questions: a) When should we still use the classical statistical procedures, and when are they still optimal? b) Are there other statistical procedures that are not so closely connected with special models and conditions? The classical procedures are typically parametric: the model is fully specified up to the values of several scalar or vector parameters. These parameters typ- ically correspond to the probability distributions of the random errors of the model. If we succeed in estimating these parameters or in testing a hypothesis on their domain, we can use our data and reach a definite conclusion. However, this conclusion is correct only under the validity of our model. An opposite approach is using the nonparametric procedures. These procedures are independent of or only weakly dependent on the special shape of the basic probability distribution, and they behave reasonably well (though not just optimally) for a broad class of distribution functions, e.g., that of distribution functions with densities, eventually symmetric. The discrete probability dis- tributions do not create a serious problem, because their forms usually follow the character of the experiment. A typical representative of nonparametric statistical procedures is the class of the rank tests of statistical hypotheses: the “null distributions” of the test criterion (i.e., the distributions under the hy- pothesis H0 of interest) coincide under all continuous probability distribution functions of the observations. Unlike the parametric models with scalar or vector parameters, the nonpara- metric models consider the whole density function or the regression function as an unknown parameter of infinite dimension. If this functional parameter is only a nuisance, i.e., if our conclusions concern other entities of the model, 1 2 INTRODUCTION then we try to avoid its estimation, if possible. The statistical procedures — considering the unknown density, regression function or the influence func- tion as a nuisance parameter, while the inference concerns something else — are known as semiparametric procedures; they were developed mainly during the past 30 years. On the other hand, if just the unknown density, regression functions, etc. are our main interests, then we try to find their best possible estimates or the tests of hypotheses on their shapes (goodness-of-fit tests). Unlike the nonparametric procedures, the robust statistical procedures do not try to behave necessarily well for a broad class of models, but they are optimal in some way in a neighborhood of some probability distribution, e.g., normal. Starting with the 1940s, the statisticians successively observed that even small deviations from the normal distribution could be harmful, and can strongly impact the quality of the popular least squares estimator, the classical F - test and of other classical methods. Hence, the robust statistical procedures were developed as modifications of the classical procedures, which do not fail under small deviations from the assumed conditions. They are optimal, in a specified sense, in a special neighborhood of a fixed probability distribution, defined with respect to a specified distance of probability measures. As such, the robust procedures are more efficient than the nonparametric ones that pay for their universality by some loss of their efficiency. When we speak of robust statistical procedures, we usually have the estima- tion procedures in mind. There are also robust tests of statistical hypotheses, namely tests of the Wald type, based on the robust estimators; but when- ever possible we recommend using the rank tests instead, mainly for their simplicity and high efficiency. In the list of various statistical procedures, we cannot omit the adaptive proce- dures that tend to the optimal parametric estimator or test with an increasing number of observations, either in probability or almost surely. Hence, these procedures adapt themselves to the pertaining parametric model with an in- creasing number of observations, which would be highly desirable. However, this convergence is typically very slow and the optimality is attained under a nonrealistic number of observations. Partially adaptive procedures also exist that successively tend to be best from a prescribed finite set of possible de- cisions; the choice of prescribed finite set naturally determines the success of our inference. The adaptive, nonparametric, robust and semiparametric methods developed successively, mainly since the 1940s, and they continue to develop as the robust procedures in multivariate statistical models. As such, they are not disjoint from each other, there are no sharp boundaries between these classes, and some concepts and aspects appear in all of them. This book, too, though oriented mainly to robust statistical procedures, often touches other statistical methods. Our ultimate goal is to show and demonstrate which alternative procedures to apply when we are not sure of our model. INTRODUCTION 3 Mathematically, we consider the robust procedures as the statistical func- tionals, defined on the space of distribution functions, and we are interested in their behavior in a neighborhood of a specific distribution or a model. This neighborhood is defined with respect to a specified distance; hence, we should first consider possible distances on the space of distribution functions and pertaining to basic characteristics of statistical functionals, such as their continuity and derivatives. This is the theoretic background of the robust statistical procedures. However, robust procedures were developed as an al- ternative to the practical statistical procedures, and they should be applied to practical problems. Keeping this in mind, we also must pay great attention to the computational aspects, and refer to the computational programs that are available or provide our own. As such, we hope that the readers will use, with understanding, robust procedures to solve their problems. CHAPTER 1 Mathematical tools of robustness 1.1 Statistical model A random experiment leads to some observed values; denote them X1, . . . , Xn. To make a formal analysis ofthe experiment and its results, we include eve- rything in the frame of a statistical model. The classical statistical model assumes that the vector X = (X1, . . . , Xn) can attain the values in a sample space X (or Xn) and the subsets of X are random events of our interest. If X is finite, then there is no problem working with the family of all its subsets. However, some space X can be too rich, as, e.g., the n-dimensional Euclidean space; then we do not consider all its subsets, but restrict our consi- derations only to some properly selected subsets/events. In order to describe the experiments and the events mathematically, we consider the family of events that creates a σ-field, i.e., that is closed with respect to the countable unions and the complements of its elements. Let us denote it as B (or Bn). The probabilistic behavior of random vector X is described by the probability distribution P, which is a set function defined on B. The classical statistical model is a family P = {Pθ, θ ∈ Θ} of probability distributions, to which our specific distribution P also belongs. While we can observe X1, . . . , Xn, the parameter θ is unobservable. It is a real number or a vector and can take on any value in the parametric space Θ ⊆ Rp, where p is a positive integer. The triple {X ,B, Pθ : θ ∈ Θ} is the parametric statistical model. The com- ponents X1, . . . , Xn are often independent copies of a random variable X, but they can also form a segment of a time series. The model is multivariate, when the components Xi, i = 1, . . . , n are themselves vectors in Rk with some positive integer k. The type of experiment often fully determines the character of the paramet- ric model. We easily recognize a special discrete probability distribution, as Bernoulli (alternative distribution), binomial, multinomial, Poisson and hy- pergeometric distributions. Example 1.1 The binomial random variable is a number of successful trials among n independent Bernoulli trials: the i-th Bernoulli trial can result ei- ther in success with probability p: then we put Xi = 1 - or in a failure with probability 1 − p: then we put Xi = 0. In the case of n trials, the binomial 5 6 MATHEMATICAL TOOLS OF ROBUSTNESS random variable X is equal to X = ∑n i=1 Xi and can take all integer values 0, 1, . . . , n; specifically, P (X = k) = ( n k ) pk(1 − p)n−k, k = 0, . . . , n, 0 ≤ p ≤ 1 Example 1.2 Let us have n independent trials, each of them leading exactly to one of k different outcomes, to the i-th one with probability pi, i = 1, . . . , k,∑k i=1 pi = 1. Then the i-th component Xi of the multinomial random vector X is the number of trials, leading to the outcome i, and P (X1 = x1, . . . , Xn = xn) = n! x1! . . . xk! px11 . . . p xk k for any vector (x1, . . . , xk) of integers, 0 ≤ xi ≤ n, i = 1, . . . , k, satisfying∑k i=1 xi = n. Example 1.3 The Poisson random variable X is, e.g., the number of clients arriving in the system in a unit interval, the number of electrons emitted from the cathode in a unit interval, etc. Then X can take on all nonnegative integers, and if λ is the intensity of arrivals, emission, etc., then P (X = k) = e−λ λk k! , k = 0, 1, . . . Example 1.4 The hypergeometric random variable X is, e.g., a number of defective items in a sample of size n taken from a finite set of products. If there are M defectives in the set of N products, then P (X = k) = ( M k ) ( N − M n − k ) ( N n ) for all integers k satisfying 0 ≤ k ≤ M and 0 ≤ n−k ≤ N−M ; this probability is equal to 0 otherwise. Among the continuous probability distributions, characterized by the densi- ties, we most easily identify the asymmetric distributions concentrated on a halfline. For instance, the waiting time or the duration of a service can be characterized by the gamma distribution. Example 1.5 The gamma random variable X has density function (see Fig- ure 1.1) f(x) = { abxb−1e−ax Γ(b) if x ≥ 0 0 if x < 0 where a and b are positive constants. The special case (b=1) is the exponential distribution. STATISTICAL MODEL 7 0 2 4 6 8 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 Figure 1.1 The density function of gamma distribution with b = 3 and a = 1. In technical practice we can find many other similar examples. However, the symmetric distributions with continuous probability densities are hardly distinguished from each other by simply looking at the data. The problem is also with asymmetric distributions extended on the whole real line. We should either test a hypothesis on their shape or, lacking knowledge of the distribution shape, use a robust or nonparametric method of inference. Most of the statistical procedures elaborated in the past were derived under the normality assumption, i.e., under the condition that the observed data come from a population with the Gaussian/normal distribution. People be- lieved that every symmetric probability distribution described by a density is approximately normal. The procedures based on the normality assumption usually have a simple algebraic structure, thus one is tempted to use them in all situations in which the data can take on symmetrically all real values, forgetting the original normality assumption. For instance, the most popu- lar least squares estimator (LSE) of regression or other parameters, though 8 MATHEMATICAL TOOLS OF ROBUSTNESS seemingly universal, is closely connected with the normal distribution of the measurement errors. That itself would not matter, but the LSE fails when even a small fraction of data comes from another population whose distribu- tion has heavier tails than the normal one, or when the dataset is contaminated by some outliers. At present, these facts can be easily demonstrated numerically with simulated data, while this was not possible before the era of high-speed computers. But these facts are not only verified with computers; the close connection of the least squares and the normal distribution is also supported by strong theoret- ical arguments, based on the characterizations of the normal distribution by means of properties of estimators and other procedures. For instance, Kagan, Linnik, and Rao (1973) proved that the least squares estimator (LSE) of the regression parameter in the linear regression model with a continuous distri- bution of the errors is admissible with respect to the quadratic risk (i.e., there is no other estimator with uniformly better quadratic risk), if and only if the distribution of the measurement errors is normal. The Student t-test, the Snedecor F -test and the test of the linear hypothesis were derived under the normality assumption. While the t-test is relatively robust to deviations from the normality, the F -test is very sensitive in this sense and should be replaced with a rank test, unless the normal distribution is taken for granted. If we are not sure by the parametric form of the model, we can use either of the following possible alternative procedures: a) Nonparametric approach: We give up a parametrization of Pθ by a real or vector parameter θ, and replace the family {Pθ : θ ∈ Θ} with a broader family of probability distributions. b) Robust approach: We introduce an appropriate measure of distance of statistical procedures made on the sample space X , and study the stability of the classical procedures, optimal for the model Pθ, under small deviations from this model. At the same time, we try to modify slightly the classical procedures (i.e., to find robust procedures) to reduce their sensitivity. 1.2 Illustration on statistical estimation Let X1, . . . , Xn be independent observations, identically distributed with some probability distribution Pθ, where θ is unobservable parameter, θ ∈ Θ ⊆ Rp. Let F (x, θ) be the distribution function of Pθ. Our problem is to estimate the unobservable parameter θ. We have several possibilities, for instance (1) maximal likelihood method (2) moment method STATISTICAL FUNCTIONAL 9 (3) method of χ2-minimum, or another method minimizing another distanceof the empirical and the true distributions (4) method based on the sufficient statistics (Rao-Blackwell Theorem) and on the complete sufficient statistics (Lehmann-Scheffé Theorem) In the context of sufficient statistics, remember the very useful fact in non- parametric models that the ordered sample (the vector of order statistics) Xn:1 ≤ Xn:2 ≤ . . . ≤ Xn:n is a complete sufficient statistic for the family of probability distributions with densities ∏n i=1 f(xi), where f is an arbitrary continuous density. This corresponds to the model in which the observa- tions create an independent random sample from an arbitrary continuous distribution. If θ is a one-dimensional parameter, thus a real number, we are intuitively led to the class of L-estimators of the type Tn = n∑ i=1 cnih(Xn:i) based on order statistics, with suitable coefficients cni, i = 1, . . . , n, and a suitable function h(·). (5) Minimization of some (criterion) function of observations and of θ : e.g., the minimization n∑ i=1 ρ(Xi, θ) := min, θ ∈ Θ with a suitable non-constant function ρ(·, ·). As an example we can consider ρ(x, θ) = − log f(x, θ) leading to the maximal likelihood estimator θ̂n. The estimators of this type are called M -estimators, or estimators of the maximum likelihood type. (6) An inversion of the rank tests of the shift in location, of the significance of regression, etc. leads to the class of R-estimators, based on the ranks of the observations or of their residuals. These are the M -, L- and R-estimators, and some other robust methods that create the main subject of this book. 1.3 Statistical functional Consider a random variable X with probability distribution Pθ with distri- bution function F, where Pθ ∈ P = {Pθ : θ ∈ Θ ⊆ Rp}. Then in many cases θ can be looked at as a functional θ = T (P ) defined on P ; we can also write θ = T (F ). Intuitively, a natural estimator of θ, based on observa- tions X1, . . . , Xn, is T (Pn), where Pn is the empirical probability distribution of vector (X1, . . . , Xn), i.e., Pn(A) = 1 n n∑ i=1 I[Xi ∈ A], A ∈ B (1.1) 10 MATHEMATICAL TOOLS OF ROBUSTNESS Otherwise, Pn is the uniform distribution on the set {X1, . . . , Xn}, because Pn({Xi}) = 1n , i = 1, . . . , n. Distribution function, pertaining to Pn, is the empirical distribution function Fn(x) = Pn((−∞, x]) = 1 n n∑ i=1 I[Xi ≤ x], x ∈ R (1.2) Example 1.6 (1) Expected value: T (P ) = ∫ R xdP = EX T (Pn) = ∫ R xdPn = X̄n = 1 n n∑ i=1 Xi (2) Variance: T (P ) = var X = ∫ R x2dP − (EX)2 T (Pn) = 1 n n∑ i=1 X2i − X̄2n (3) If T (P ) = ∫ R h(x)dP, where h is an arbitrary P -integrable function, then an empirical counterpart of T (P ) is T (Pn) = 1 n n∑ i=1 h(Xi) (4) Conversely, we can find a statistical functional corresponding to a given statistical estimator: for instance, the geometric mean of observations X1, . . . , Xn is defined as T (Pn) = Gn = ( n∏ i=1 Xi )1/n log Gn = 1 n n∑ i=1 log Xi = ∫ R log xdPn hence the corresponding statistical functional has the form T (P ) = exp {∫ R log xdP } Similarly, the harmonic mean T (Pn) = Hn of observations X1, . . . , Xn is FISHER CONSISTENCY 11 defined as 1 Hn = 1 n n∑ i=1 1 Xi and the corresponding statistical functional has the form T (P ) = H = (∫ R 1 x dP )−1 Statistical functionals were first considered by von Mises (1947). The estimator T (Pn) should tend to T (P ), as n → ∞, with respect to some type of convergence defined on the space of probability measures. Mostly, it is a convergence in probability and in distribution, or almost sure convergence; but an important characteristic also is the large sample bias of estimator T (Pn), i.e., limn→∞ |E[T (Pn) − T (P )]|, which corresponds to the convergence in the mean. Because we need to study the behavior of T (Pn) also in a neighborhood of P, we consider an expansion of the functional (T (Pn)−T (P )) of the Taylor type. To do it, we need some concepts of the functional analysis, as various distances Pn and P, and their relations, and the continuity and differentiability of functional T with respect to the considered distance. 1.4 Fisher consistency A reasonable statistical estimator should have the natural property of Fisher consistency, introduced by R. A. Fisher (1922). We say that estimator θ̂n, based on observations X1 . . . , Xn with probability distribution P , is a Fisher consistent estimator of parameter θ, if, written as a functional θ̂n = T (Pn) of the empirical probability distribution of vector (X1, . . . , Xn), n = 1, . . . , it satisfies T (P ) = θ. The following example shows that this condition is not always automatically satisfied. Example 1.7 Let θ = var X = T (P ) = ∫ R x2dP − (∫ R xdP )2 be the variance of P. Then the sample variance θ̂n = T (Pn) = 1n ∑n i=1(Xi − X̄n)2 is Fisher consistent; but it is biased, because Eθ̂n = ( 1 − 1n ) θ. On the other hand, the unbiased estimator of the variance S2n = 1 n−1 ∑n i=1(Xi − X̄n)2 is not a Fisher consistent estimator of θ, because S2n = n n − 1T (Pn) and n n − 1T (P ) �= T (P ) From the robustness point of view, the natural property of Fisher consistency of an estimator is more important than its unbiasedness; hence it should be first checked on a statistical functional. 12 MATHEMATICAL TOOLS OF ROBUSTNESS 1.5 Some distances of probability measures Let X be a metric space with metric d, separable and complete, and denote B the σ-field of its Borel subsets. Furthermore, let P be the system of all probability measures on the space (X ,B). Then P is a convex set, and on P we can introduce various distances of its two elements P, Q ∈ P . Let us briefly describe some such distances, mostly used in mathematical statistics. For those who want to learn more about such and other distances and other related topics, we refer to the literature of the functional analysis and the probability theory, e.g., Billingsley (1998) or Fabian et al. (2001). (1) The Prochorov distance: dP (P, Q) = inf{ε > 0 : P (A) ≤ Q(Aε) + ε ∀A ∈ B, A �= ∅} where Aε = {x ∈ X : infy∈A d(x, y) ≤ ε} is a closed ε-neighborhood of a non-empty set A. (2) The Lévy distance: X = R is the real line; let F, G be the distribution functions of probability measures P, Q, then dL(F, G) = inf{ε > 0 : F (x − ε) − ε ≤ G(x) ≤ F (x + ε) + ε∀x ∈ R} (3) The total variation: dV (P, Q) = sup A∈B |P (A) − Q(A)| We easily verify that dV (P, Q) = ∫ X |dP − dQ| (4) The Kolmogorov distance: X = R is the real line and F, G are the distri- bution functions of probability measures P, Q, then dK(F, G) = sup x∈R |F (x) − G(x)| (5) The Hellinger distance: dH(P, Q) = {∫ X (√ dP − √ dQ )2}1/2 If f = dPdµ and g = dQ dµ are densities of P, Q with respect to some measure µ, then the Hellinger distance can be rewritten in the form (dH(P, Q))2 = ∫ X (√ f −√g )2 dµ = 2 ( 1 − ∫ X √ fgdµ ) RELATIONS BETWEEN DISTANCES 13 (6) The Lipschitz distance: Assume that d(x, y) ≤ 1 ∀x, y ∈ X (we take the metric d′ = d1+d otherwise), then dLi(P, Q) = sup ψ∈L ∣∣∣∣ ∫ X ψdP − ∫ X ψdQ ∣∣∣∣ where L = {Ψ : X �→ R : |ψ(x)−ψ(y)| ≤ d(x, y)} is the set of the Lipschitz functions. (7) Kullback-Leibler divergence: Let p, q be the densities of probability distributions P, Q with respect to measure µ (Lebesgue measure on the real line or the counting measure), then dKL(Q, P ) = ∫ q(x)ln q(x) p(x) dµ(x) The Kullback-Leibler divergence is not a metric, because it is not symmetric in P, Q and does not satisfy the triangle inequality. More on distances of probability measures can be found in Gibbs and Su (2002), Liese and Vajda (1987), Rachev (1991), Reiss (1989) and Zolotarev (1983), among others. 1.6 Relations between distances The family P of all probability measures on (X ,B) is a metric space with respect to each of the distances described above. On this metric space we can study the continuity and other properties of the statistical functional T (P ). Because we are interested in the behavior of the functional, not only at distributionP , but also in its neighborhood; we come to the question, which distance is more sensitive to small deviations of P. The following inequalities between the distances show not only which distance eventually dominates above others, but also illustrate their relations. Their verification we leave as an exercise: d2H(P, Q) ≤ 2dV (P, Q) ≤ 2dH(P, Q) d2P (P, Q) ≤ dLi(P, Q) ≤ 2dP (P, Q) ∀ P, Q ∈ P (1.3) 1 2 d2V (P, Q) ≤ dKL(P, Q) if X = R, then it further holds: dL(P, Q) ≤ dP (P, Q) ≤ dV (P, Q) dL(P, Q) ≤ dK(P, Q) ≤ dV (P, Q) ∀ P, Q ∈ P (1.4) 14 MATHEMATICAL TOOLS OF ROBUSTNESS Example 1.8 Let P be the exponential distribution with density f(x) = { e−x . . . x ≥ 0 0 . . . x < 0 and let Q be the uniform distribution R(0, 1) with density g(x) = { 1 . . . 0 ≤ x ≤ 1 0 . . . otherwise Then 2dV (P, Q) = ∫ 1 0 ( 1 − e−x ) dx + ∫ ∞ 1 e−xdx = 1 + 1 e − 1 + 1 e = 2 e hence dV (exp, R(0, 1) ≈ 0.3679. Furthermore, dK(P, Q) = sup x≥0 ∣∣1 − e−x − xI[0 ≤ x ≤ 1] − I[x > 1]∣∣ = e−1 ≈ 0.1839 and d2H(exp, R(0, 1)) = 2 ( 1 − ∫ 1 0 √ e−xdx ) = 2 ( 2√ e − 1 ) thus dH(exp, R(0, 1)) ≈ 0.6528. Finally dKL(R(0, 1), exp) = ∫ 1 0 ln 1 e−x dx = 1 2 1.7 Differentiable statistical functionals Let again P be the family of all probability measures on the space (X ,B, µ), and assume that X is a complete separable metric space with metric d and that B is the system of the Borel subsets of X . Choose some distance δ on P and consider the statistical functional T (·) defined on P . If we want to analyze an expansion of T (·) around P, analogous to the Taylor expansion, we must introduce the concept of a derivative of statistical functional. There are more possible definitions of the derivative, and we shall consider three of them: the Gâteau derivative, the Fréchet and the Hadamard derivative, and compare their properties from the statistical point of view. Definition 1.1 Let P, Q ∈ P and let t ∈ [0, 1]. Then the probability distribution Pt(Q) = (1 − t)P + tQ (1.5) is called the contamination of P by Q in ratio t. Remark 1.1 Pt(Q) is a probability distribution, because P is convex. P0(Q) = P means an absence of the contamination, while P1(Q) = Q means the full contamination. GÂTEAU DERIVATIVE 15 1.8 Gâteau derivative Fix two distributions P, Q ∈ P and denote ϕ(t) = T ((1−t)P +tQ), 0 ≤ t ≤ 1. Suppose that the function ϕ(t) has the final n-th derivative ϕ(n), and that the derivatives ϕ(k) are continuous in interval (0, 1) and that the right-hand derivatives ϕ(k)+ are right-continuous at t = 0, k = 1, . . . , n − 1. Then we can consider the Taylor expansion around u ∈ (0, 1) ϕ(t) = ϕ(u) + n−1∑ k=1 ϕ(k)(u) k! (t − u)k + ϕ (n)(v) n! (t − u)n, v ∈ [u, t] (1.6) We are mostly interested in the expansion on the right of u = 0, that corre- sponds to a small contamination of P. For that we replace derivatives ϕ(k)(0) with the right-hand derivatives ϕ(k)+ (0). The derivative ϕ′+(0) is called the Gâteau derivative of functional T in P in direction Q. Definition 1.2 We say that functional T is differentiable in the Gâteau sense in P in direction Q, if there exists the limit T ′Q(P ) = lim t→0+ T (P + t(Q − P )) − T (P ) t (1.7) T ′Q(P ) is called the Gâteau derivative T in P in direction Q. Remark 1.2 a) The Gâteau derivative T ′Q(P ) of functional T is equal to the ordinary right derivative of function ϕ at the point 0, i.e., T ′Q(P ) = ϕ ′(0+) b) Similarly defined is the Gâteau derivative of order k: T (k) Q (P ) = [ dk dtk T (P + t(Q − p)) ] t=0+ = ϕ(k)(0+) c) In the special case when Q is the Dirac probability measure Q = δx assigning probability 1 to the one-point set {x} x, we shall use a simpler notation T ′δx(P ) = T ′ x(P ) In the special case t = 1, u = 0 the Taylor expansion (1.6) reduces to the form T (Q) − T (P ) = n−1∑ k=1 T (k) Q (P ) k! + 1 n! [ dn dtn T (P + t(Q − p)) ] t=t∗ (1.8) where 0 ≤ t∗ ≤ 1. 16 MATHEMATICAL TOOLS OF ROBUSTNESS Example 1.9 (a) Expected value: T (P ) = ∫ X xdP = EP X ϕ(t) = ∫ X xd((1 − t)P + tQ) = (1 − t)EP X + tEQX =⇒ ϕ′(t) = EQX − EP X T ′Q(P ) = ϕ ′(0+) = EQX − EP X Finally we obtain for Q = δx T ′x = x − EP X (b) Variance: T (P ) = varP X = EP (X2) − (EP X)2 T ((1 − t)P + tQ) = ∫ X x2d((1 − t)P + tQ) − [∫ X xd((1 − t)P + tQ) ]2 =⇒ ϕ(t) = (1 − t)EP X2 + tEQX2 − (1 − t)2(EP X)2 −t2 (EQX)2 − 2t(1 − t)EP X · EQX ϕ′(t) = −EP X2 + EQX2 +2(1 − t) (EP X)2 − 2t (EQX)2 −2(1 − 2t)EP X · EQX This further implies lim t→0+ ϕ′(t) = T ′Q(P ) = EQX2 − EP X2 − 2EP X · EQX + 2 (EP X)2 and finally we obtain for Q = δx T ′x(P ) = x 2 − EP X2 − 2xEP X + 2 (EP X)2 = (x − EP X)2 − varP X FRÉCHET DERIVATIVE 17 1.9 Fréchet derivative Definition 1.3 We say that functional T is differentiable in P in the Fréchet sense, if there exists a linear functional LP (Q − P ) such that lim t→0 T (P + t(Q − P )) − T (P ) t = LP (Q − P ) (1.9) uniformly in Q ∈ P , δ(P, Q) ≤ C for any fixed C ∈ (0,∞). The linear functional LP (Q−P ) is called the Fréchet derivative of functional T in P in direction Q. Remark 1.3 a) Because LP is a linear functional, there exists a function g : X �→ R such that LP (Q − P ) = ∫ X gd(Q − P ) (1.10) b) If T is differentiable in the Fréchet sense, then it is differentiable in the Gâteau sense, too, i.e., there exists T ′Q(P ) ∀Q ∈ P , and it holds T ′Q(P ) = LP (Q − P ) ∀Q ∈ P (1.11) Especially, T ′x(P ) = LP (δx − P ) = g(x) − ∫ X gdP (1.12) and this further implies EP (T ′x(P )) = ∫ X T ′x(P )dP = 0. (1.13) c) Let Pn be the empirical probability distribution of vector (X1 . . . , Xn). Then Pn − P = 1n ∑n i=1 (δXi − P ) . Hence, because LP is a linear func- tional, LP (Pn − P ) = 1 n n∑ i=1 LP (δXi − P ) = 1 n n∑ i=1 T ′Xi(P ) = T ′ Pn(P ) (1.14) Proof of (1.11): Actually, because LP (·) is a linear functional, we get by (1.9) T ′Q(P ) = lim t→0+ T (P + t(Q − P )) − T (P ) t = lim t→0+ T (P + t(Q − P )) − T (P ) t − LP (Q − P ) + LP (Q − P ) = 0 + LP (Q − P ) = LP (Q − P ) � 18 MATHEMATICAL TOOLS OF ROBUSTNESS 1.10 Hadamard (compact) derivative If there exists a linear functional L(Q − P ) such that the convergence (1.9) is uniform not necessarily for bounded subsets of the metric space (P , δ) con- taining P, i.e., for all Q satisfying δ(P, Q) ≤ C, 0 < C < ∞, but only for Q from any fixed compact set K ⊂ P containing P ; then we say that functional T is differentiable in the Hadamard sense, and we call the functional L(Q−P ) the Hadamard (compact) derivative of T. The Fréchet differentiable functional is obviously also Hadamard differen- tiable, and it is, in turn, also Gâteau differentiable, similarly as in Remark 1.3. We refer to Fernholz (1983) and to Fabian et al. (2001) for more properties of differentiable functionals. The Fréchet differentiability imposes rather restrictive conditions on the func- tional that are not satisfied namely by the robust functionals. On the other hand, when we have a Fréchet differentiable functional, we can easily derive the large sample (normal) distribution of its empirical counterpart, when the number n of observations infinitely increases. If the functional is not suffi- ciently smooth, we can sometimes derive the large sample normal distribution of its empirical counterpart with the aid of the Hadamard derivative. If we only want to prove that T (Pn) is a consistent estimator of T (P ), then it suffices to consider the continuity of T (P ). The Gâteau derivative of T ′x(P ), called the influence function of functional T, is one of the most important characteristics of its robustness and will be studied in Chapter 2 in detail. 1.11 Large sample distribution of empirical functional Consider again the metric space (P , δ) of all probability distributions on (X ,B), with metric δ satisfying √ nδ(Pn, P ) = Op(1) as n → ∞, (1.15) where Pn is the empirical probability distribution of the random sample (X1, . . . , Xn), n = 1, 2, . . . . The convergence (1.15) holds, e.g., for the Kol- mogorov distance of theempirical distribution function from the true one, which is the most important for statistical applications; but it holds also for other distances. As an illustration of the use of the functional derivatives, let us show that the Fréchet differentiability, together with the classical central limit theorem, always give the large sample (asymptotic) distribution of the empirical func- tional T (Pn). Theorem 1.1 Let T be a statistical functional, Fréchet differentiable in P, COMPUTATION AND SOFTWARE NOTES 19 and assume that the empirical probability distribution Pn of the random sample (X1, . . . , Xn) satisfies the condition (1.15) as n → ∞. If the variance of the Gâteau derivative T ′X1(P ) is positive, varP T ′ X1 (P ) > 0, then the sequence√ n(T (Pn) − T (P )) is asymptotically normally distributed as n → ∞, namely L ( T (Pn) − T (P ) ) −→ N ( 0, varP T ′X1(P ) ) (1.16) Proof. By (1.14), T ′Pn(P ) = 1 n ∑n i=1 T ′ Xi (P ) and further by (1.8) and condi- tion (1.15) we obtain √ n(T (Pn) − T (P )) = 1√ n n∑ i=1 T ′Xi(P ) + Rn = 1√ n n∑ i=1 LP (Pn − P ) + √ n o(δ(Pn, P )) (1.17) = 1√ n n∑ i=1 T ′Xi(P ) + op(1) If the joint variance varP T ′Xi(P ) = varP T ′ X1 (P ), i = 1, . . . , n, is finite, then (1.16) follows from (1.17) and from the classical central limit theorem. � Example 1.10 Let T (P ) = varP X = σ2, then T (Pn) = S2n = 1 n n∑ i=1 (Xi − X̄n)2 and, by Example 1.9b), T ′x(P ) = (x − EP X)2 − varP X hence varP T ′X(P ) = EP (X − EP X)4 − E2P (X − EP X)2 = µ4 − µ22 and by Theorem 1.1 we get the large sample distribution of the sample variance L (√ n(S2n − σ2) ) −→ N ( 0, µ4 − µ22 ) 1.12 Computation and software notes We chose R (a language and environment for statistical computing and graph- ics) as a suitable tool for numerical illustration and the computation. R seems to us to be one of the best statistical environments available. It is also free and the R project is an open source project. The code you are using is always available for you to view. Detailed information about R and the R project is available from http://www.r-project.org/. 20 MATHEMATICAL TOOLS OF ROBUSTNESS Chapter 1 focuses mainly on the theoretical background. However, Section 1.1 mentioned some specific distribution shapes. The R system has built-in functions to compute the density, distribution func- tion and quantile function for many standard distributions, including ones mentioned in Section 1.1 (see Table 1.1). Table 1.1 R function names and parameters for selected probability distributions. Distribution R name Parameters binomial binom size, prob exponential exp rate gamma gamma shape, scale hypergeometric hyper m, n, k normal norm mean, sd Poisson pois lambda uniform unif min, max The first letter of the name of the R function indicates the function: dXXXX, pXXXX, qXXXX are, respectively, the density, distribution and quantile func- tions. The first argument of the function is the quantile q for the densities and distribution functions, and the probability p for quantile functions. Addi- tional arguments specify the parameters. These functions can be used as the statistical tables. Here are some examples: • P (X = 3) - binomial distribution with n=20, p = 0.1 > dbinom(3,20,0.1) [1] 0.1901199 • P (X ≤ 5) - Poisson distribution with λ=2 > ppois(5,2) [1] 0.9834364 • 95% – quantile of standard normal distribution > qnorm(0.95) [1] 1.644854 • The density function of gamma distribution with b = 3 and a = 1 ( see Figure 1.1). > plot(seq(0,8,by=0.01),dgamma(seq(0,8,by=0.01),3,1), + type="l", ylab="",xlab="") COMPUTATION AND SOFTWARE NOTES 21 System R enables the generation of random data. The corresponding functions have prefix r and first argument n, the size of the sample required. For exam- ple, we can generate a sample of size 1000 from gamma distribution (b = 3; a = 1) by > rgamma(1000,3,1) R has a function hist to plot histograms. > hist(rgamma(1000,3,1), prob=TRUE,nclass=16) We obtain Figure 1.2, compare with Figure 1.1. Histogram of rgamma(1000, 3, 1) rgamma(1000, 3, 1) D en si ty 0 2 4 6 8 10 0. 00 0. 05 0. 10 0. 15 0. 20 0. 25 Figure 1.2 The histogram of the simulated sample from gamma distribution with b = 3 and a = 1. In Section 1.5, some distances of probability measures were introduced and illustrated in Example 1.8. There we need to compute an integral. We can also use R function integrate to solve that problem. Compare the following example with the calculation in Example 1.8. 22 MATHEMATICAL TOOLS OF ROBUSTNESS > integrate(function(x) {1-exp(-x)},0,1) 0.3678794 with absolute error < 4.1e-15 > integrate(function(x) {exp(-x)},1, Inf) 0.3678794 with absolute error < 2.1e-05 R can also help us in the case of discrete distribution. Let P be the binomial distribution with parameters n = 100, p = 0.001. Let Q be Poisson distribu- tion with parameterλ = np = 1. Then 2dV (P, Q) = 100∑ k=0 ∣∣∣∣ ( n k ) 0.01k0.99100−k − e −1 k! ∣∣∣∣ + ∞∑ k=101 e−1 k! > sum(abs(choose(100,0:100)*0.01^(0:100)*(0.99)^(100:0) + -exp(-1)/factorial(0:100)))+1-sum(exp(-1)/factorial(0:100)) [1] 0.005550589 > ## or also > sum(abs(dbinom(0:100,100,0.01)-dpois(0:100,1)))+1-ppois(100,1) [1] 0.005550589 Thus dV (Bi(100, 0.01), Po(1)) ≈ 0.0028. Similarly, dK(Bi(100, 0.01), Po(1)) ≈ 0.0018, dH(Bi(100, 0.01), Po(1)) ≈ 0.0036, dKL(Bi(100, 0.01), Po(1)) ≈ 0.000025 because > ### Kolmogorov distance > max(abs(pbinom(0:100,100,0.01)-ppois(0:100,1))) [1] 0.0018471 > ### Hellinger distance > sqrt(sum((sqrt(dbinom(0:100,100,0.01)) + -sqrt(dpois(0:100,1)))^2)) [1] 0.003562329 >### Kullback-Leibler divergence (Q,P) > sum(dpois(0:100,1)*log(dpois(0:100,1)/ + dbinom(0:100,100,0.01))) [1] 2.551112e-05 >### Kullback-Leibler divergence (P,Q) > sum(dbinom(0:100,100,0.01)*log(dbinom(0:100,100,0.01)/ + dpois(0:100,1))) [1] 2.525253e-05 PROBLEMS AND COMPLEMENTS 23 1.13 Problems and complements 1.1 Let Q be the binomial distribution with parameters n, p and let P be the Poisson distribution with parameter λ = np, then 1 2 (dV (Q, P ))2 ≤ dKL(Q, P ) p2 4 ≤ dKL(Q, P ) ≤ ( 1 4 + np3 3 + p 2 + 1 4n ) p2 1 16 min(p, np2) ≤ dV (Q, P ) ≤ 2p ( 1 − e−np ) dKL(Q, P ) ≤ p2 2(1 − p) lim n→∞ n2dKL(Q, P ) = λ2 4 See Barbour and Hall (1984), Csiszár (1967), Harremoës and Ruzankin (2004), Kontoyannis et al. (2005), Pinsker (1960) for demonstrations. 1.2 Wasserstein-Kantorovich distance of distribution functions F, G of ran- dom variables X, Y : • L1-distance on F1 = {F : ∫ ∞ −∞ |x|dF (x) < ∞} : d (1) W (F, G) = ∫ 1 0 |F−1(t) − G−1(t)|dt Show that d(1)W (F, G) = ∫ ∞ −∞ |F (x) − G(x)|dx (Dobrushin (1970)). Show that d(1)W (F, G) = inf{E|X − Y |} where the infimum is over all jointly distributed X and Y with respective marginals F and G. • L2-distance on F2 = {F : ∫ ∞ −∞ x 2dF (x) < ∞} : d (2) W (F, G) = ∫ 1 0 [F−1(t) − G−1(t)]2dt Show that d(2)W (F, G) = inf{E(X − Y )2} where the infimum is over all jointly distributed X and Y with respective marginals F and G (Mal- lows (1972)). • Weighted L1-distance: d (1) W (F, G) = ∫ 1 0 |F−1(t) − G−1(t)|w(t)dt, ∫ 1 0 w(t)dt = 1 24 MATHEMATICAL TOOLS OF ROBUSTNESS 1.3 Show that dP (F, G) ≤ d(1)W (F, G) (Dobrushin (1970)). 1.4 Let (X1, . . . , Xn) and (Y1, . . . , Yn) be two independent random samples from distribution functions F, G such that ∫ ∞ −∞ xdF (x) = ∫ ∞ −∞ ydG(y) = 0. Let Fn, Gn be the distribution functions of n−1/2 ∑n i=1 Xi and n−1/2 ∑n i=1 Yi, respectively, then d (2) W (Fn, Gn) ≤ d (2) W (F, G) (Mallows (1972)). 1.5 χ2-distance: Let p, q be the densities of probability distributions P, Q with respect to measure µ (µ can be a countable measure). Then dχ2(P, Q) is defined as dχ2(P, Q) = ∫ x∈X :p(x),q(x)>0 (p(x) − q(x))2 q(x) dµ(x) Then 0 ≤ dχ2 (P, Q) ≤ ∞ and dχ2 is independent of the choice of the dom- inating measure. It is not a metric, because it is not symmetric in P, Q. Distance dχ2 is dating back to Pearson in the 1930s and has many applica- tions in the statisticalinference. The following relations hold between dχ2 and other distances: (i) dH(P, Q) ≤ √ 2(dχ2 (P, Q))1/4 (ii) If the sample space X is countable, then dV (P, Q) ≤ 12 √ dχ2(P, Q) (iii) dKL(P, Q) ≤ dχ2(P, Q) 1.6 Let P be the exponential distribution and let Q be the uniform distri- bution (see Example 1.8) Then dχ2(Q, P ) = ∫ 1 0 (1 − e−x)2 e−x dx = e + e−1 − 2 hence dχ2(R(0, 1), exp) ≈ 0.350402. Furthermore, dχ2(P, Q) = ∫ 1 0 ( e−x − 1 )2 dx = −1 2 e−2 + 2 e−1 − 1 2 hence dχ2(exp, R(0, 1)) ≈ 0.168091 1.7 Bhattacharyya distance: Let p, q be the densities of probability distribu- tions P, Q with respect to measure. Then dB(P, Q) is defined as dB(P, Q) = log (∫ x∈X :p(x),q(x)>0 √ p(x) √ q(x) dµ(x) )−1 (Bhattacharyya (1943)). Furthermore, for a comparison dB(exp, R(0, 1)) = log (∫ 1 0 √ e−xdx )−1 = − log(2 − 2√ e ) ≈ 0.239605 PROBLEMS AND COMPLEMENTS 25 1.8 Verify 2dV (P, Q) = ∫ X |dP − dQ|. 1.9 Check the inequalities 1.3. 1.10 Check the inequalities 1.4. 1.11 Compute the Wasserstein-Kantorovich distances d(1)W (F, G) and d (2) W (F, G) for the exponential distribution and the uniform distribution (as in Example 1.8). CHAPTER 2 Basic characteristics of robustness 2.1 Influence function Expansion (1.17) of difference T (Pn) − T (P ) says that T (Pn) − T (P ) = 1 n n∑ i=1 T ′Xi(P ) + n −1/2Rn (2.1) where the reminder term is asymptotically negligible, n−1/2Rn = op(n−1/2) as n → ∞. Then we can consider 1n ∑n i=1 T ′ Xi (P ) as an error of estimating T (P ) by T (Pn), and T ′Xi(P ) as a contribution of Xi to this error, or as an influence of Xi on this error. From this point of view, a natural interpretation of the Gâteau derivative T ′x(P ), x ∈ X is to call it an influence function of functional T (P ). Definition 2.1 The Gâteau derivative of functional T in distribution P in the direction of Dirac distribution δx, x ∈ X is called the influence function of T in P ; thus IF (x; T, P ) = T ′x(P ) = limt→0+ T (Pt(δx)) − T (P ) t (2.2) where Pt(δx) = (1 − t)P + tδx. As the first main properties of IF, let us mention: a) EP (IF (x; T, P )) = ∫ X T ′ x(P )dP = 0, hence the average influence of all points x on the estimation error is zero. b) If T is a Fréchet differentiable functional satisfying condition (1.15), and varP (IF (x; T, P )) = EP (IF (x; T, P ))2 > 0 then (√ n(T (Pn) − T (P ) ) −→ N ( 0, varP (IF (x; T, P )) ) Example 2.1 (a) Expected value: T (P ) = EP (X) = mP , then T (Pn) = X̄n 27 28 BASIC CHARACTERISTICS OF ROBUSTNESS IF (x; T, P ) = T ′x(P ) = x − mp EP (IF (x; T, P )) = 0 varP (IF (x; T, P )) = varP X = σ2P EQ(IF (x; T, P )) = mQ − mP for Q �= P L (√ n(X̄n − mp) ) −→ N (0, σ2P ) provided P is the true probability distribution of random sample (X1, . . . , Xn). (b) Variance: T (P ) = varP X = σ2P , then IF (x; T, P ) = (x − mP )2 − σ2P EP (IF (x; T, P )) = 0 varP (IF (x; T, P )) = µ4 − µ22 = µ4 − σ4P EQ(IF (x; T, P )) = EQ(X − mp)2 − σ2P = σ2Q + (mQ − mP )2 + 2EQ(X − mQ)(mQ − mP ) −σ2P = σ2Q − σ2P + (mQ − mP )2 2.2 Discretized form of influence function Let (X1, . . . , Xn) be the vector of observations and denote Tn = T (Pn) = Tn(X1, . . . , Xn) as its empirical functional. Consider what happens if we add another observation Y to X1, . . . , Xn. The influence of Y on Tn is characterized by the difference Tn+1(X1, . . . , Xn, Y ) − Tn(X1, . . . , Xn) := I(Tn, Y ) (2.3) Because Pn = 1 n n∑ i=1 δXi Pn+1 = 1 n + 1 ( n∑ i=1 δXi + δY ) = n n + 1 Pn + 1 n + 1 δY = ( 1 − 1 n + 1 ) Pn + 1 n + 1 δY DISCRETIZED FORM OF INFLUENCE FUNCTION 29 we can say that Pn+1 arose from Pn by its contamination by the one-point distribution δY in ratio 1n+1 , hence I(Tn, Y ) = T [( 1 − 1 n + 1 ) Pn + 1 n + 1 δY ] − T (Pn) Because lim n→∞ (n + 1)I(Tn, Y ) (2.4) = lim n→∞ T [( 1 − 1n+1 ) Pn + 1n+1δY ] − T (Pn) 1 n+1 = IF (Y ; T, P ) (n + 1)I(Tn, Y ) can be considered as a discretized form of the influence func- tion. The supremum of |I(Tn, Y )| over Y then represents a measure of sensi- tivity of the empirical functional Tn with respect to an additional observation, under fixed X1, . . . , Xn. Definition 2.2 The number S(Tn) = sup Y |I(Tn(X1, . . . , Xn), Y )| (2.5) is called a sensitivity of functional Tn(X1, . . . , Xn) to an additional observa- tion. Example 2.2 (a) Expected value: T (P ) = EP X, Tn = X̄n, Tn+1 = X̄n+1 =⇒ Tn+1 = 1 n + 1 (nX̄n + Y ) I(Tn, Y ) = ( n n + 1 − 1 ) X̄n + 1 n + 1 Y = 1 n + 1 (Y − X̄n) =⇒ (n + 1)I(Tn, Y ) = Y − X̄n P−→ Y − EP X as n → ∞ =⇒ S(X̄n) = 1 n + 1 sup Y |Y − X̄n| = ∞ Thus, the sample mean has an infinite sensitivity to an additional observation. (b) Median: Let n = 2m + 1 and let X(1) ≤ . . . ≤ X(n) be the observations ordered in increasing magnitude. Then Tn = Tn(X1, . . . , Xn) = X(m+1) and Tn+1 = 30 BASIC CHARACTERISTICS OF ROBUSTNESS Tn+1(X1 . . . , Xn, Y ) takes on the following values, depending on the position of Y among the other observations: Tn+1 = ⎧⎪⎪⎪⎨ ⎪⎪⎪⎩ X(m)+X(m+1) 2 . . . Y ≤ X(m) X(m+1)+X(m+2) 2 . . . Y ≥ X(m+2) Y +X(m+1) 2 . . . X(m) ≤ Y ≤ X(m+2) Hence, the influence of adding Y to X1, . . . , Xn on the median is measured by I(Tn, Y ) = ⎧⎪⎪⎪⎨ ⎪⎪⎪⎩ X(m)−X(m+1) 2 Y ≤ X(m) X(m+2)−X(m+1) 2 Y ≥ X(m+2) Y −X(m+1) 2 X(m) ≤ Y ≤ X(m+2) Among three possible values of |I(Tn, Y )| is |12 (Y −X(m+1))| the smallest; thus the sensitivity of the median to an additional observation is equal to S(Tn) = max { 1 2 (X(m+1) − X(m)), 1 2 (X(m+2) − X(m+1)) } and it is finite under any fixed X1, . . . , Xn. 2.3 Qualitative robustness As we have seen in Example 2.1, the influence functions of the expectation and variance are unbounded and can assume arbitrarily large values. Moreover, Example 2.2 shows that adding one more observation can cause a breakdown of the sample mean. The least squares estimator (LSE) behaves analogously (in fact, the mean is a special form of the least squares estimator). Remember the Kagan, Linnik and Rao theorem, mentioned in Section 1.1, that illustrates a large sensitivity of the LSE to deviations from the normal distribution of errors. Intuitively it means that the least squares estimator (and the mean) are very non-robust. How can we mathematically express this intuitive non-robustness property, and how shall we define the concept of robustness? Historically, this con- cept has been developing over a rather long period, since many statisticians observed a sensitivity of statistical procedures to deviations from assumed models, and analyzed it from various points of view. It is interesting that the physicists and astronomers, who tried to determine values of various physical, geophysical and astronomic parameters by means of an average of several measurements, were the first to notice the sensitivity of the mean and the variance to outlying observations. This interesting part of the statistical history is nicely described in the book by Stigler (1986). The history goes up to 1757, when R. J. Boskovich, analyzing his experiments aim- ing at a characterization of the shape of the globe, proposed an estimation QUALITATIVE ROBUSTNESS 31 method alternative to the least squares. E. S. Pearson noticed the sensitivity of the classical analysis of variance procedures to deviations from the normal- ity in 1931. J. W. Tukey and his Princeton group have started a systematic study of possible alternatives to the least squares since the 1940s. The name “robust” was first used by Box in 1953. Box and Anderson (1955) character- ized as robust such a statistical procedure that is little sensitive to changes of the nuisance or unimportant parameters, while it is sensitive (efficient) to its parameter of interest. When we speak about robustness of a statistical procedure, we usually mean its robustness with respect to deviations from the assumed distribution of errors. However, other types of robustness are also important, such as the assumed independence of observations, the assumption that isoften violated in practice. The first mathematical definition of robustness was formulated by Hampel (1968, 1971), who based the concept of robustness of a statistical functional on its continuity in a neighborhood of the considered probability distribution. The continuity and neighborhood were considered with respect to the Prohorov metric on the space P . Let a random variable (or random vector) X take on values in the sample space (X ,B); denote P as its probability distribution. We shall try to char- acterize mathematically the robustness of the functional T (P ) = T (X). This functional is estimated with the aid of observations X1, . . . , Xn, that are in- dependent copies of X. More precisely, we estimate T by the empirical func- tional Tn(Pn) = Tn(X1, . . . , Xn), based on the empirical distribution Pn of X1, . . . , Xn. Instead of the empirical functional, Tn is often called the (sam- ple) statistic. Hampel’s definition of the (qualitative) robustness is based on the Prohorov metric dP on the system P of probability measures on the sample space. Definition 2.3 We say that the sequence of statistics (empirical functionals) {Tn} is qualitatively robust for probability distribution P, if to any ε > 0 there exists a δ > 0 and a positive integer n0 such that, for all Q ∈ P and n ≥ n0, dP (P, Q) < δ =⇒ dP (LP (Tn),LQ(Tn)) < ε (2.6) where LP (Tn) and LQ(Tn) denote the probability distributions of Tn under P and Q, respectively. This robustness is only qualitative: it only says whether it is or is not the functional robust, but it does not numerically measure a level of this charac- teristic. Because such robustness concerns only the behavior of the functional in a small neighborhood of P0, it is in fact infinitesimal. We can obviously replace the Prohorov metric with another suitable metric on space P , e.g., the Lévy metric. However, we do not only want to see whether T is or is not robust. We want to compare the various functionals with each other and see which is more robust 32 BASIC CHARACTERISTICS OF ROBUSTNESS than the other. To do this, we must characterize the robustness with some quantitative measure. There are many possible quantifications of the robust- ness. However, using such quantitative measures, be aware that a replacement of a complicated concept with just one number can cause a bias and suppress important information. 2.4 Quantitative characteristics of robustness based on influence function Influence function is one of the most important characteristics of the sta- tistical functional/estimator. The value IF (x; T, P ) measures the effect of a contamination of functional T by a single value x. Hence, a robust functional T should have a bounded influence function. However, even the fact that T is a qualitatively robust functional does not automatically mean that its influ- ence function IF (x; T, P ) is bounded. As we see later, an example of such a functional is the R-estimator of the shift parameter, which is an inversion of the van der Waerden rank test; while it is qualitatively robust, its influence function is unbounded. The most popular quantitative characteristics of robustness of functional T, based on the influence function, are its global and local sensitivities: a) The global sensitivity of the functional T under distribution P is the maximum absolute value of the influence function in x under P, i.e., γ∗ = sup x∈X |IF (x; T, P )| (2.7) b) The local sensitivity of the functional T under distribution P is the value λ∗ = sup x,y; x �=y ∣∣∣∣IF (y; T, P ) − IF (x; T, P )y − x ∣∣∣∣ (2.8) that indicates the effect of the replacement of value x by value y on the functional T. The following example illustrates the difference between the global and local sensitivities. Example 2.3 (a) Mean T (P ) = EP (X), IF (x; T, P ) = x − EP X =⇒ γ∗ = ∞, λ∗ = 1; the mean is not robust, but it is not sensitive to the local changes. (b) Variance T (P ) = varP X = σ2P IF (x; T, P ) = (x − EP (X))2 − σ2P , γ∗ = ∞ MAXIMUM BIAS 33 λ∗ = sup y �=x ∣∣∣∣(x − EP (X))2 − (y − EP (X))2x − y ∣∣∣∣ = sup y �=x ∣∣∣∣x2 − y2 − 2(x − y)EP Xx − y ∣∣∣∣ = sup y �=x |x + y − 2EP X | = ∞ hence the variance is non-robust both to large as well as to small (local) changes. 2.5 Maximum bias Assume that the true distribution function F0 lies in some family F . Another natural measure of robustness of the functional T is its maximal bias (maxbias) over F , b(F) = sup F∈F {|T (F )− T (F0)|} (2.9) The family F can have various forms; for example, it can be a neighborhood of a fixed distribution F0 with respect to some distance described in Section 1.5. In the robustness analysis, F is often the ε-contaminated neighborhood of a fixed distribution function F0, that has the form FF0,ε = {F : F = (1− ε)F0 + εG, G unknown distribution function} (2.10) The value ε of the contamination ratio is considered as known, known is also the central distribution function F0. When estimating the location parameter θ of F (x − θ), where F is an unknown member of F , then the central distri- bution F0 is usually taken as symmetric around zero and unimodal, while the contaminating distribution G can run either over symmetric or asymmetric distribution functions. We then speak about symmetric or asymmetric con- taminations. Many statistical functionals are monotone with respect to the stochastic or- dering of distribution functions (or random variables), defined in the following way: Random variable X with distribution function F is stochastically smaller than random variable Y with distribution function G, if F (x) ≥ G(x) ∀x ∈ R The monotone statistical functional thus attains its maxbias either at the stochastically largest member F∞ or at the stochastically smallest member F−∞ of FF0,ε, hence b(FF0,ε) = max{|T (F∞) − T (F0)|, |T (F−∞) − T (F0)|} (2.11) The following example well illustrates the role of the maximal bias; it shows that while the mean is non-robust, the median is universally robust with respect to the maxbias criterion. 34 BASIC CHARACTERISTICS OF ROBUSTNESS Example 2.4 (i) Mean T (F ) = EF (X); if F0 is symmetric around zero and so are all contaminating distributions G, all having finite first moments, then T (F ) is unbiased for all F ∈ FF0,ε, hence b(FF0,ε) = 0. However, under an asymmetric contamina- tion, b(FF0,ε) = |E(F∞) − E(F0)| = ∞, where F∞ = (1 − ε)F0 + εδ∞, the stochastically largest member of FF0,ε. (ii) Median. Because the median is nondecreasing with respect to the stochastic ordering of distributions, its maximum absolute bias over an asymmetric ε-contaminated neighborhood of a symmetric distribution function F0 is attained either at F∞ = (1 − ε)F0 + εδ∞ (the stochastically largest distribution of FF0,ε), or at F−∞ = (1 − ε)F0 + εδ−∞ (the stochastically smallest distribution of FF0,ε). The median of F∞ is attained at x0 satisfying (1 − ε)F0(x0) = 1 2 =⇒ x0 = F−10 ( 1 2(1 − ε) ) while the median of F−∞ is x−0 such that (1 − ε)F0(x−0 ) + ε = 1 2 =⇒ x−0 = F−10 ( 1 − 1 2(1 − ε) ) = −x0 hence the maxbias of the median is equal to x0. Let T (F ) be any other functional such that its estimate T (Fn) = T (X1, . . . , Xn), based on the empirical distribution function Fn, is translation equivari- ant, i.e., T (X1 + c, . . . , Xn + c) = T (X1, . . . , Xn) + c for any c ∈ R. Then obviously T (F (· − c)) = T (F (·)) + c. We shall show that the maxbias of T cannot be smaller than x0. Consider two contaminations of F0, F+ = (1 − ε)F0 + εG+, F− = (1 − ε)F0 + εG− where G+(x) = { 0 . . . x ≤ x0 1 ε{1 − (1 − ε)[F0(x) + F0(2x0 − x)]} . . . x ≥ x0 and G−(x) = { 1 ε{(1 − ε)[F0(x + 2x0) − F0(x)]} . . . x < −x0 1 . . . x ≥ −x0 Notice that F−(x − x0)) = F+(x + x0), hence T (F−) + x0 = T (F+) − x0 and T (F+) − T (F−) = 2x0; thus the maxbias of T at F0 cannot be smaller than x0. It shows that the median has the smallest maxbias among all translation equivariant functionals. BREAKDOWN POINT 35 If T (F ) is a nonlinear functional, or if it is defined implicitlyas a solution of a minimization or of a system of equations; then it is difficult to calculate (2.11) precisely. Then we consider the maximum asymptotic bias of T (F ) over a neighborhood F of F0. More precisely, let X1, . . . , Xn be independent iden- tically distributed observations, distributed according to distribution function F ∈ F and Fn be the empirical distribution function. Assume that under an infinitely increasing number of observations, T (Fn) has an asymptotical normal distribution for every F ∈ F in the sense that P (√ n(T (Fn) − T (F )) ≤ x ) → Φ ( x σ(T, F ) ) as n → ∞ with variance σ2(T, F ) dependent of T and F. Then the maximal asymptotic bias (asymptotic maxbias) of T over F is defined as sup {|T (F ) − T (F0)| : F ∈ F} (2.12) We shall return to the asymptotic maxbias later in the context of some robust estimators that are either nonlinear or defined implicitly as a solution of a minimization or a system of equations. 2.6 Breakdown point The breakdown point, introduced by Donoho and Huber in 1983, is a very pop- ular quantitative characteristic of robustness. To describe this characteristic, start from a random sample x0 = (x1, . . . , xn) and consider the corresponding value Tn(x0) of an estimator of functional T. Imagine that in this “initial” sample we can replace any m components by arbitrary values, possibly very unfavorable, even infinite. The new sample after the replacement denotes x(m), and let Tn(xm) be the pertaining value of the estimator. The breakdown point of estimator Tn for sample x(0) is the number ε∗n(Tn, x (0)) = m∗(x(0)) n where m∗(x(0)) is the smallest integer m, for which sup x(m) ‖Tn(x(m)) − Tn(x(0))‖ = ∞ i.e., the smallest part of the observations that, being replaced with arbitrary values, can lead Tn up to infinity. Some estimators have a universal breakdown point, when m∗ is independent of the initial sample x(0). Then we can also calculate the limit ε∗ = limn→∞ ε∗n, which often also is called the breakdown point. We can modify the breakdown point in such a way that, instead of replacing m components, we extend the sample by some m (unfavorable) values. 36 BASIC CHARACTERISTICS OF ROBUSTNESS Example 2.5 (a) The average X̄n = 1n ∑n i=1 Xi : ε∗n(X̄n, x (0)) = 1n , hence limn→ ε ∗ n(X̄n, x (0)) = 0 for any initial sample x(0) (b) Median X̃n = X(n+12 ) (consider n odd, for simplicity): ε∗n(X̃n, x (0)) = n+12n , thus limn→ ε ∗ n(X̃n, x (0)) = 12 for any initial sample x (0) 2.7 Tail–behavior measure of a statistical estimator The tail–behavior measure is surprisingly intuitive mainly in estimating the shift and regression parameters. We will first illustrate this measure on the shift parameter, and then return to regression at a suitable place. Let (X1, . . . , Xn) be a random sample from a population with continuous distribution function F (x − θ), θ ∈ R. The problem of interest is that of estimating parameter θ. A reasonable estimator of the shift parameter should be translation equivariant: Tn is translation equivariant, if Tn(X1 + c, . . . , Xn + c) = Tn(X1, . . . , Xn) + c ∀ c ∈ R and ∀X1 . . . , Xn The performance of such an estimator can be characterized by probabilities Pθ(|Tn − θ| > a) analyzed either under fixed a > 0 and n → ∞, or under fixed n and a → ∞. Indeed, if {Tn} is a consistent estimator of θ, then limn→0 Pθ(|Tn−θ| > a) = 0 under any fixed a > 0. Such a characteristic was studied, e.g., by Bahadur (1967), Fu (1975, 1980) and Sievers (1978), who suggested the limit lim n→∞ { − 1 n ln Pθ(|Tn − θ| > a) } under fixed a > 0 (provided it exists) as a measure of efficiency of estimator Tn, and compared estimators from this point of view. On the other hand, a good estimator Tn also verifies the convergence lim a→∞ Pθ(|Tn − θ| > a) = lim a→∞ P0(|Tn| > a) = 0 (2.13) while this convergence is as fast as possible. The probabilities Pθ(Tn − θ > a) and Pθ(Tn − θ < −a), for a sufficiently large, are called the right and the left tails, respectively, of the probability distribution of Tn. If Tn is symmetri- cally distributed around θ, then both its tails are characterized by probability (2.13). This probability should rapidly tend to zero. However, the speed of this convergence cannot be arbitrarily high. We shall show that the rate of TAIL–BEHAVIOR MEASURE OF A STATISTICAL ESTIMATOR 37 convergence of tails of a translation equivariant estimator is bounded, and that its upper bound depends on the behavior of 1 − F (a) and F (−a) for large a > 0. Let us illustrate this upper bound on a model with symmetric distribution function satisfying F (−x) = 1 − F (x) ∀x ∈ R. Jurečková (1981) introduced the following tail-behavior measure of an equivariant estimator Tn : B(Tn; a) = −ln Pθ(|Tn − θ| > a) −ln(1 − F (a)) = −ln P0(|Tn| > a) −ln(1 − F (a)) , a > 0 (2.14) The values B(Tn; a) for a � 0 show how many times faster the probability P0(|Tn| > a) tends to 0 than 1−F (a), as a → ∞. The best is estimator Tn with the largest possible values B(Tn; a) for a � 0. The lower and upper bounds for B(Tn; a), thus for the rate of convergence of its tails, are formulated in the following lemma: Lemma 2.1 Let X1, . . . , Xn be a random sample from a population with dis- tribution function F (x − θ), 0 < F (x) < 1, F (−x) = 1 − F (x), x, θ ∈ R. Let Tn is an equivariant estimator of θ such that, for any fixed n min 1≤i≤n Xi > 0 =⇒ Tn(X1, . . . , Xn) > 0 (2.15) max 1≤i≤n Xi < 0 =⇒ Tn(X1, . . . , Xn) < 0 Then, under any fixed n 1 ≤ lima→∞B(Tn; a) ≤ lima→∞B(Tn; a) ≤ n (2.16) Proof. Indeed, if Tn is equivariant, then P0(|Tn(X1, . . . , Xn)| > a) = P0(Tn(X1, . . . , Xn) > a) + P0(Tn(X1, . . . , Xn) < −a) = P0(Tn(X1 − a, . . . , Xn − a) > 0) + P0(Tn(X1 + a, . . . , Xn + a) < 0) ≥ P0 ( min 1≤i≤n Xi > a ) + P0 ( max 1≤i≤n Xi < −a ) = (1 − F (a))n + (F (−a))n hence −ln P0(|Tn(X1, . . . , Xn)| > a) ≤ −ln 2 − n ln(1 − F (a)) =⇒ lima→∞ −ln P0(|Tn| > a) −ln(1 − F (a)) ≤ n Similarly, P0(|Tn(X1, . . . , Xn)| > a) 38 BASIC CHARACTERISTICS OF ROBUSTNESS ≤ P0 ( min 1≤i≤n Xi ≤ −a ) + P0 ( max 1≤i≤n Xi ≥ a ) = 1 − (1 − F (−a))n + 1 − (F (a))n = 2 {1 − (F (a))n} = 2(1 − F (a)) [ 1 + F (a) + . . . + (F (a))n−1 ] ≤ 2n(1 − F (a)) hence −ln P0(|Tn(X1, . . . , Xn)| > a) ≥ −ln 2 − ln n − ln(1 − F (a)) =⇒ lima→∞ −ln P0(|Tn| > a) −ln(1 − F (a)) ≥ 1 � If Tn attains the upper bound in (2.16), then it is obviously optimal for distri- bution function F , because its tails tend to zero n-times faster than 1−F (a), which is the upper bound. However, we still have the following questions: • Is the upper bound attainable, and for which Tn and F? • Is there any estimator Tn attaining high values of B(Tn; a) robustly for a broad class of distribution functions? It turns out that the sample mean X̄n can attain both lower and upper bounds in (2.16); namely, it attains the upper bound under the normal distribu- tion and under an exponentially tailed distribution, while it attains the lower bound only for the Cauchy distribution and for the heavy-tailed distributions. This demonstrates a high non-robustness of X̄n even from the tail behavior aspect. On the other hand, the sample median X̃n is robust even with respect to tails: X̃n does not attain the upper bound in (2.16), on the contrary, the limit lima→∞ B(X̃n; a) is always in the middle of the scope between 1 and n for a broad class of distribution functions. These conclusions are in good concordance with the robustness concepts. The following theorem gives them a mathematical form. Theorem 2.1 Let X1, . . . , Xn be a random sample from a population with distribution function F (x− θ), 0 < F (x) < 1, F (−x) = 1− F (x), x, θ ∈ R. (i) Let X̄n = 1n ∑n i=1 Xi be the sample mean. If F has exponential tails, i.e., lim a→∞ −ln(1 − F (a) bar = 1 for some b > 0, r ≥ 1 (2.17) then lim a→∞ B(X̄n; a) = n (2.18) TAIL–BEHAVIOR MEASURE OF A STATISTICAL ESTIMATOR 39 (ii) If F has heavy tails in the sense that lim a→∞ −ln(1 − F (a) m ln a = 1 for some m > 0 (2.19)then lim a→∞ B(X̄n; a) = 1 (2.20) (iii) Let X̃n be the sample median. Then for F satisfying either (2.17) or (2.19), n 2 ≤ lima→∞B(X̃n; a) ≤ n 2 + 1 for n even, and (2.21) lim a→∞ B(X̃n, a) = n + 1 2 for n odd (2.22) Remark 2.1 The distribution functions with exponential tails, satisfying (2.17), will be briefly called the type I. This class includes the normal dis- tribution (r = 2), logistic and the Laplace distributions (r = 1). The distri- bution functions with heavy tails, satisfying (2.19), will be called the type II. The Cauchy distribution (m = 1) or the t-distribution with m > 1 degrees of freedom belongs here. Proof of Theorem 2.1. (i) It is sufficient to prove that the exponentially tailed F has a finite expected value Eε = E0 [ exp { n(1 − ε)b|X̄n|r }] < ∞ (2.23) for arbitrary ε ∈ (0, 1). Indeed, then we conclude from the Markov inequality that P0(|X̄n| > a) ≤ Eε · exp{−n(1 − ε)bar} =⇒ lima→∞ −ln P0(|X̄n| > a) bar ≥ lim a→∞ n(1 − ε)bar − ln Eε bar = n(1 − ε) and we arrive at proposition (2.18). The finite expectation (2.23) we get from the Hölder inequality: E0 [ exp { n(1 − ε)b|X̄n|r }] ≤ E0 [ exp { (1 − ε)b n∑ i=1 |Xi|r }] (2.24) ≤ (E0 [exp {(1 − ε)b|X1|r}])n = 2n (∫ ∞ 0 [exp {(1 − ε)bxr}] dF (x) )n It follows from (2.17) that, given ε > 0, there exists an Aε > 0 such that 1 − F (a) < exp { −(1 − ε 2 )bar } holds for any a ≥ Aε. 40 BASIC CHARACTERISTICS OF ROBUSTNESS The last integral in (2.24) can be successively rewritten in the following way: ∫ ∞ 0 exp {(1 − ε)bxr} dF (x) = ∫ Aε 0 exp {(1 − ε)bxr} dF (x) − ∫ ∞ Aε exp {(1 − ε)bxr} d(1 − F (x)) = ∫ Aε 0 exp {(1 − ε)bxr} dF (x) + (1 − F (Aε)) · exp {(1 − ε)bArε} + ∫ ∞ Aε (1 − F (x))(1 − ε)brxr−1 · exp {(1 − ε)bxr} dx ≤ ∫ Aε 0 exp {(1 − ε)bxr} dF (x) + exp { −ε 2 bArε } + ∫ ∞ Aε (1 − ε)brxr−1 · exp { −ε 2 bxr } dx < ∞ and that leads to proposition (i). (ii) If F has heavy tails, then P0(|X̄n| > a) = P0(X̄n > a) + P0(X̄n < −a) ≥ P0 ( X1 > −a, . . . , Xn−1 > −a, Xn > (2n − 1)a ) +P0 ( X1 < a, . . . , Xn−1 < a, Xn < −(2n − 1)a ) = 2(F (a))n−1[1 − F ((2n − 1)a)] hence lima→∞B(X̄n, a) ≤ lima→∞ −ln [1 − F (2n − 1)a] m ln a = lim a→∞ −ln [1 − F (2n − 1)a] m ln((2n − 1)a) = 1 (iii) Let X̃n be the sample median and n be odd. Then X̃n is the middle order statistic of the sample X1, . . . , Xn, i.e., X̃n = X(m), m = n+12 , and F (X̃n) = U(m) has the beta-distribution. Then P0(|X̃n| > a) = P0(X̃n > a) + P0(X̃n < −a) = 2n ( n − 1 m − 1 ) ∫ 1 F (a) um−1(1 − u)m−1du VARIANCE OF ASYMPTOTIC NORMAL DISTRIBUTION 41 ≤ 2n ( n − 1 m − 1 ) (1 − F (a))m and similarly P0(|X̃n| > a) ≥ 2n ( n − 1 m − 1 ) (F (a))m−1(1 − F (a))m that leads to (2.22) after taking logarithms. Analogously we proceed for n even. � 2.8 Variance of asymptotic normal distribution If estimator Tn of functional T (·) is asymptotically normally distributed as n → ∞, LP (√ n(Tn − T (P )) ) → N (0, V 2(P, T )) then another possible robustness measure of T is the supremum of the variance V 2(P, T ) σ2(T ) = sup P∈P0 V 2(P, T ) over a neighborhood P0 ⊂ P of the assumed model. The estimator minimizing supP∈P0 V 2(P, T ) over a specified class T of es- timators of parameter θ, is called minimaximally robust in the class T . We shall show in the sequel that the classes of M-estimators, L-estimators and R-estimators all contain a minimaximally robust estimator of the shift and regression parameters in a class of contaminated normal distributions. 2.9 Problems and complements 2.1 Show that both the sample mean and the sample median of the random sample X1, . . . , Xn are nondecreasing in each argument Xi, i = 1, . . . , n. 2.2 Characterize distributions satisfying lim a→∞ −ln(1 − F (a + c)) −ln(1 − F (a)) = 1 (2.25) for any fixed c ∈ R. Show that this class contains distributions of Type 1 and Type 2. 2.3 Let X1, . . . , Xn be a random sample from a population with distri- bution function F (x − θ), where F is symmetric, absolutely continuous, 0 < F (x) < 1 for x ∈ R, and satisfying (2.25). Let Tn(X1, . . . , Xn) be a translation equivariant estimator of θ, nondecreasing in each argument 42 BASIC CHARACTERISTICS OF ROBUSTNESS Xi, i = 1, . . . , n. Then Tn has a universal breakdown point m∗n = m ∗ n(Tn) and there exists a constant A such that Xn:m∗n − A ≤ Tn(X1, . . . , Xn) ≤ Xn:n−m∗n+1 + A where Xn:1 ≤ Xn:2 ≤ . . . ≤ Xn:n are the order statistics of the sample X1, . . . , Xn. (Hint: see He at al. (1990)). 2.4 Let Tn(X1, . . . , Xn) be a translation equivariant estimator of θ, nonde- creasing in each argument Xi, i = 1, . . . , n. Then, under the conditions of Problem 2.2, m∗n ≤ lima→∞B(a, Tn) ≤ lima→∞B(a, Tn) ≤ n − m∗n + 1 (2.26) Illustrate it on the sample median. (Hint: see He et al. (1990)). 2.5 Let Tn(X1, . . . , Xn) be a random sample from a population with distri- bution function F (x − θ). Compute the breakdown point of Tn = 1 2 (Xn:1 + Xn:n) This estimator is called the midrange (see the next chapter). 2.6 Show that the midrange (Problem 2.5) of the random sample X1, . . . , Xn is nondecreasing in each argument Xi, i = 1, . . . , n. Illustrate (2.26) for this estimator. 2.7 Determine whether the gamma distribution (Example 1.5) has exponen- tial or heavy tails. CHAPTER 3 Robust estimators of real parameter 3.1 Introduction Let X1, . . . , Xn be a random sample from a population with probability dis- tribution P ; the distribution is generally unknown, we only assume that its distribution function F belongs to some class F of distribution functions. We look for an appropriate estimator of parameter θ, that can be expressed as a functional T (P ) of P. The same parameter θ can be characterized by means of more functionals, e.g., the center of symmetry is simultaneously the expected value, the median, the modus of the distribution, and other possible charac- terizations. Some functionals T (P ) are characterized implicitly as a root of an equation (or of a system of equations) or as a solution of a minimization (maximization) problem: such are the maximal likelihood estimator, moment estimator, etc. An estimator of parameter θ is obtained as an empirical func- tional, i.e., when one replaces P in the functional T (·) with the empirical distribution corresponding to the vector of observations X1, . . . , Xn. We shall mainly deal with three broad classes of robust estimators of the real parameter: M -estimators, L-estimators, and R-estimators. We shall later extend these classes to other models, mainly to the linear regression model. 3.2 M-estimators The class of M -estimators was introduced by P. J. Huber in (1964); the prop- erties of M -estimators are studied in his book (Huber (1981)), and also in the books by Andrews et al. (1972), Antoch et al. (1998), Bunke and Bunke (1986), Dodge and Jurečková (2000), Hampel et al. (1986), Jurečková and Sen (1996), Lecoutre and Tassi (1987), Rieder (1994), Rousseeuw and Leroy (1987), Staudte and Sheather (1990), and others. M -estimator Tn is defined as a solution of the minimization problem n∑ i=1 ρ(Xi, θ) := min with respect to θ ∈ Θ or (3.1) EPn [ρ(X, θ)] = min, θ ∈ Θ 43 44 ROBUST ESTIMATORS OF REAL PARAMETER where ρ(·, ·) is a properly chosen function. The class of M -estimators covers also the maximal likelihood estimator (MLE) of parameter θ in the parametric model P = {Pθ, θ ∈ Θ}; if f(x, θ), is the density function of Pθ, then the MLE is a solution of the minimization n∑ i=1 (− log f(Xi, θ)) = min, θ ∈ Θ If ρ in (3.1) is differentiable in θ with a continuous derivative ψ(·, θ) = ∂ ∂θρ(·, θ), then Tn is a root (or one of the roots) of the equation n∑ i=1 ψ(Xi, θ) = 0, θ ∈ Θ (3.2) hence 1 n n∑ i=1 ψ(Xi, Tn) = EPn [ψ(X, Tn)] = 0 Tn ∈ Θ. (3.3) We see from (3.1) and (3.3) that the M -functional, the statistical functional corresponding to Tn, is defined as a solution of the minimization∫ X ρ(x, T (P )) dP (x) = EP [ρ(X, T (P ))] := min, T (P ) ∈ Θ (3.4) or as a solution of the equation∫ X
Compartilhar