Buscar

Robust Statistical Methods with R

Prévia do material em texto

ROBUST STATISTICAL METHODS
with R
ROBUST STATISTICAL METHODS
with R
Jana Jurecková
Jan Picek
ˇ
Published in 2006 by
Chapman & Hall/CRC 
Taylor & Francis Group 
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2006 by Taylor & Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group
No claim to original U.S. Government works
Printed in the United States of America on acid-free paper
10 9 8 7 6 5 4 3 2 1
International Standard Book Number-10: 1-58488-454-1 (Hardcover) 
International Standard Book Number-13: 978-1-58488-454-5 (Hardcover) 
Library of Congress Card Number 2005053192
This book contains information obtained from authentic and highly regarded sources. Reprinted material is
quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts
have been made to publish reliable data and information, but the author and the publisher cannot assume
responsibility for the validity of all materials or for the consequences of their use.
No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic,
mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and
recording, or in any information storage or retrieval system, without written permission from the publishers. 
For permission to photocopy or use material electronically from this work, please access www.copyright.com
(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive,
Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration
for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate
system of payment has been arranged. 
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only
for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Jureckova, Jana, 1940-
Robust statistical methods with R / Jana Jureckova, Jan Picek.
p. cm.
Includes bibliographical references and indexes.
ISBN-13: 978-1-58488-454-5 (acid-free paper)
ISBN-10: 1-58488-454-1 (acid-free paper)
Robust statistics. 2. R (Computer program language)--Statistical methods. I. Picek, Jan, 1965- II. 
Title.
QA276.J868 2006
519.5--dc22 2005053192
Visit the Taylor & Francis Web site at 
http://www.taylorandfrancis.com
and the CRC Press Web site at 
http://www.crcpress.com
Taylor & Francis Group 
is the Academic Division of Informa plc.
C4541_Discl.fm Page 1 Friday, October 7, 2005 12:06 PM
Contents
Preface ix
Authors xi
Introduction 1
1 Mathematical tools of robustness 5
1.1 Statistical model 5
1.2 Illustration on statistical estimation 8
1.3 Statistical functional 9
1.4 Fisher consistency 11
1.5 Some distances of probability measures 12
1.6 Relations between distances 13
1.7 Differentiable statistical functionals 14
1.8 Gâteau derivative 15
1.9 Fréchet derivative 17
1.10 Hadamard (compact) derivative 18
1.11 Large sample distribution of empirical functional 18
1.12 Computation and software notes 19
1.13 Problems and complements 23
2 Basic characteristics of robustness 27
2.1 Influence function 27
2.2 Discretized form of influence function 28
2.3 Qualitative robustness 30
v
vi CONTENTS
2.4 Quantitative characteristics of robustness based on influence
function 32
2.5 Maximum bias 33
2.6 Breakdown point 35
2.7 Tail–behavior measure of a statistical estimator 36
2.8 Variance of asymptotic normal distribution 41
2.9 Problems and complements 41
3 Robust estimators of real parameter 43
3.1 Introduction 43
3.2 M -estimators 43
3.3 M -estimator of location parameter 45
3.4 Finite sample minimax property of M -estimator 54
3.5 Moment convergence of M -estimators 58
3.6 Studentized M -estimators 61
3.7 L-estimators 63
3.8 Moment convergence of L-estimators 70
3.9 Sequential M - and L-estimators 72
3.10 R-estimators 74
3.11 Numerical illustration 77
3.12 Computation and software notes 80
3.13 Problems and complements 83
4 Robust estimators in linear model 85
4.1 Introduction 85
4.2 Least squares method 87
4.3 M -estimators 94
4.4 GM -estimators 98
4.5 S-estimators and MM -estimators 100
4.6 L-estimators, regression quantiles 101
4.7 Regression rank scores 104
4.8 Robust scale statistics 106
CONTENTS vii
4.9 Estimators with high breakdown points 109
4.10 One-step versions of estimators 110
4.11 Numerical illustrations 112
4.12 Computation and software notes 115
4.13 Problems and complements 126
5 Multivariate location model 129
5.1 Introduction 129
5.2 Multivariate M -estimators of location and scatter 129
5.3 High breakdown estimators of multivariate location and scatter 132
5.4 Admissibility and shrinkage 133
5.5 Numerical illustrations and software notes 134
5.6 Problems and complements 139
6 Some large sample properties of robust procedures 141
6.1 Introduction 141
6.2 M -estimators 142
6.3 L-estimators 144
6.4 R-estimators 146
6.5 Interrelationships of M -, L- and R-estimators 146
6.6 Minimaximally robust estimators 150
6.7 Problems and complements 153
7 Some goodness-of-fit tests 155
7.1 Introduction 155
7.2 Tests of normality of the Shapiro-Wilk type with nuisance
regression and scale parameters 155
7.3 Goodness-of-fit tests for general distribution with nuisance
regression and scale 158
7.4 Numerical illustration 160
7.5 Computation and software notes 166
viii CONTENTS
Appendix A: R system 173
A.1 Brief R overview 174
References 181
Subject index 191
Author index 195
Preface
Robust statistical procedures became a part of the general statistical con-
sciousness. Yet, students first learn descriptive statistics and the classical sta-
tistical procedures. Only later students and practical statisticians hear that
the classical procedures should be used with great caution and that their
favorite and simple least squares estimator and other procedures should be
replaced with the robust or nonparametric alternatives. To be convinced to
use the robust methods, one needs a reasonable motivation; but everybody
needs motivation of their own: a mathematically–oriented person demands
a theoretical proof, while a practitioner prefers to see the numerical results.
Both aspects are important.
The robust statistical procedures became known to the Prague statistical so-
ciety by the end of the 1960s, thanks to Jaroslav Hájek and his contacts with
Peter J. Huber, Peter J. Bickel and other outstanding statisticians. Frank
Hampel presented his “Some small sample asymptotics,” also touching M-
estimates, at Hájek’s conference in Prague in 1973; and published his paper
in the Proceedings. Thus, we had our own experience with the first skepti-
cism toward the robust methods, but by 1980 we started organizing regular
workshops for applied statisticians called ROBUST. The 14th ROBUST will
be in January 2006. On this occasion, we express our appreciation to Jaromı́r
Antoch, the main organizer.
The course “Robust Statistical Methods” is now a part of the master study
of statistics at Charles University in Prague and is followed by all statistical
students. The present book draws on experience obtained during these courses.
We supplement the theory with examples and computational procedures in
the system R.
We chose R as a suitable tool because R seems to be one of the best statistical
environments available. It is also free and the R project is an open source
project. The code you are using is always available for you to see. Detailed
information about the system R and the R project is available from
http://www.r-project.org/. The prepared procedures and dataset, not
available from the public resource, can be found on website:
http://www.fp.vslib.cz/kap/picek/robust/.
ix
x PREFACE
We acknowledge the support of the Czech Republic Grant 201/05/2340, the
Research ProjectsMSM 0021620839 of Charles University in Prague, and
MSM 467488501 of Technical University in Liberec.
The authors would also like to thank the editors and anonymous referees who
contributed considerably to the readability of the text. Our gratitude also be-
longs to our families for their support and patience.
Jana Jurečková
Jan Picek
Prague and Liberec
Authors
Jana Jurečková is a professor of statistics and probability at Charles Uni-
versity in Prague, Czech Republic. She is a coauthor of Robust Statistical
Inference: Asymptotics and Inter-Relations (with P.K. Sen, John Wiley &
Sons, 1996) and of Adaptive Regression (with Y. Dodge, Springer, 2000). She
is a fellow of the Institute of Mathematical Statistics and an elected member
of the International Statistical Institute. She was an associate editor of the
Annals of Statistics for six years, and previously for two years in the 1980s;
and now is an associate editor of Journal of the American Statistical Asso-
ciation. Jurečková participates in extensive international cooperation, mainly
with statisticians of Belgium, Canada, France, Switzerland and United States.
Jan Picek is an associate professor of applied mathematics at Technical Uni-
versity in Liberec, Czech Republic.
xi
Introduction
If we analyze the data with the aid of classical statistical procedures, based
on parametric models, we usually tacitly assume that the regression is linear,
the observations are independent and homoscedastic, and assume the normal
distribution of errors. However, when today we can simulate data from any
probability distribution and from various models with our high–speed com-
puters and follow the graphics, which was not possible before, we observe that
these assumptions are often violated. Then we are mainly interested in the
two following questions:
a) When should we still use the classical statistical procedures, and when are
they still optimal?
b) Are there other statistical procedures that are not so closely connected
with special models and conditions?
The classical procedures are typically parametric: the model is fully specified
up to the values of several scalar or vector parameters. These parameters typ-
ically correspond to the probability distributions of the random errors of the
model. If we succeed in estimating these parameters or in testing a hypothesis
on their domain, we can use our data and reach a definite conclusion. However,
this conclusion is correct only under the validity of our model.
An opposite approach is using the nonparametric procedures. These procedures
are independent of or only weakly dependent on the special shape of the basic
probability distribution, and they behave reasonably well (though not just
optimally) for a broad class of distribution functions, e.g., that of distribution
functions with densities, eventually symmetric. The discrete probability dis-
tributions do not create a serious problem, because their forms usually follow
the character of the experiment. A typical representative of nonparametric
statistical procedures is the class of the rank tests of statistical hypotheses: the
“null distributions” of the test criterion (i.e., the distributions under the hy-
pothesis H0 of interest) coincide under all continuous probability distribution
functions of the observations.
Unlike the parametric models with scalar or vector parameters, the nonpara-
metric models consider the whole density function or the regression function
as an unknown parameter of infinite dimension. If this functional parameter
is only a nuisance, i.e., if our conclusions concern other entities of the model,
1
2 INTRODUCTION
then we try to avoid its estimation, if possible. The statistical procedures —
considering the unknown density, regression function or the influence func-
tion as a nuisance parameter, while the inference concerns something else —
are known as semiparametric procedures; they were developed mainly during
the past 30 years. On the other hand, if just the unknown density, regression
functions, etc. are our main interests, then we try to find their best possible
estimates or the tests of hypotheses on their shapes (goodness-of-fit tests).
Unlike the nonparametric procedures, the robust statistical procedures do not
try to behave necessarily well for a broad class of models, but they are optimal
in some way in a neighborhood of some probability distribution, e.g., normal.
Starting with the 1940s, the statisticians successively observed that even small
deviations from the normal distribution could be harmful, and can strongly
impact the quality of the popular least squares estimator, the classical F -
test and of other classical methods. Hence, the robust statistical procedures
were developed as modifications of the classical procedures, which do not fail
under small deviations from the assumed conditions. They are optimal, in a
specified sense, in a special neighborhood of a fixed probability distribution,
defined with respect to a specified distance of probability measures. As such,
the robust procedures are more efficient than the nonparametric ones that pay
for their universality by some loss of their efficiency.
When we speak of robust statistical procedures, we usually have the estima-
tion procedures in mind. There are also robust tests of statistical hypotheses,
namely tests of the Wald type, based on the robust estimators; but when-
ever possible we recommend using the rank tests instead, mainly for their
simplicity and high efficiency.
In the list of various statistical procedures, we cannot omit the adaptive proce-
dures that tend to the optimal parametric estimator or test with an increasing
number of observations, either in probability or almost surely. Hence, these
procedures adapt themselves to the pertaining parametric model with an in-
creasing number of observations, which would be highly desirable. However,
this convergence is typically very slow and the optimality is attained under a
nonrealistic number of observations. Partially adaptive procedures also exist
that successively tend to be best from a prescribed finite set of possible de-
cisions; the choice of prescribed finite set naturally determines the success of
our inference.
The adaptive, nonparametric, robust and semiparametric methods developed
successively, mainly since the 1940s, and they continue to develop as the robust
procedures in multivariate statistical models. As such, they are not disjoint
from each other, there are no sharp boundaries between these classes, and
some concepts and aspects appear in all of them. This book, too, though
oriented mainly to robust statistical procedures, often touches other statistical
methods. Our ultimate goal is to show and demonstrate which alternative
procedures to apply when we are not sure of our model.
INTRODUCTION 3
Mathematically, we consider the robust procedures as the statistical func-
tionals, defined on the space of distribution functions, and we are interested
in their behavior in a neighborhood of a specific distribution or a model.
This neighborhood is defined with respect to a specified distance; hence, we
should first consider possible distances on the space of distribution functions
and pertaining to basic characteristics of statistical functionals, such as their
continuity and derivatives. This is the theoretic background of the robust
statistical procedures. However, robust procedures were developed as an al-
ternative to the practical statistical procedures, and they should be applied
to practical problems. Keeping this in mind, we also must pay great attention
to the computational aspects, and refer to the computational programs that
are available or provide our own. As such, we hope that the readers will use,
with understanding, robust procedures to solve their problems.
CHAPTER 1
Mathematical tools of robustness
1.1 Statistical model
A random experiment leads to some observed values; denote them X1, . . . , Xn.
To make a formal analysis ofthe experiment and its results, we include eve-
rything in the frame of a statistical model. The classical statistical model
assumes that the vector X = (X1, . . . , Xn) can attain the values in a sample
space X (or Xn) and the subsets of X are random events of our interest.
If X is finite, then there is no problem working with the family of all its
subsets. However, some space X can be too rich, as, e.g., the n-dimensional
Euclidean space; then we do not consider all its subsets, but restrict our consi-
derations only to some properly selected subsets/events. In order to describe
the experiments and the events mathematically, we consider the family of
events that creates a σ-field, i.e., that is closed with respect to the countable
unions and the complements of its elements. Let us denote it as B (or Bn).
The probabilistic behavior of random vector X is described by the probability
distribution P, which is a set function defined on B. The classical statistical
model is a family P = {Pθ, θ ∈ Θ} of probability distributions, to which our
specific distribution P also belongs. While we can observe X1, . . . , Xn, the
parameter θ is unobservable. It is a real number or a vector and can take on
any value in the parametric space Θ ⊆ Rp, where p is a positive integer.
The triple {X ,B, Pθ : θ ∈ Θ} is the parametric statistical model. The com-
ponents X1, . . . , Xn are often independent copies of a random variable X,
but they can also form a segment of a time series. The model is multivariate,
when the components Xi, i = 1, . . . , n are themselves vectors in Rk with some
positive integer k.
The type of experiment often fully determines the character of the paramet-
ric model. We easily recognize a special discrete probability distribution, as
Bernoulli (alternative distribution), binomial, multinomial, Poisson and hy-
pergeometric distributions.
Example 1.1 The binomial random variable is a number of successful trials
among n independent Bernoulli trials: the i-th Bernoulli trial can result ei-
ther in success with probability p: then we put Xi = 1 - or in a failure with
probability 1 − p: then we put Xi = 0. In the case of n trials, the binomial
5
6 MATHEMATICAL TOOLS OF ROBUSTNESS
random variable X is equal to X =
∑n
i=1 Xi and can take all integer values
0, 1, . . . , n; specifically,
P (X = k) =
(
n
k
)
pk(1 − p)n−k, k = 0, . . . , n, 0 ≤ p ≤ 1
Example 1.2 Let us have n independent trials, each of them leading exactly
to one of k different outcomes, to the i-th one with probability pi, i = 1, . . . , k,∑k
i=1 pi = 1. Then the i-th component Xi of the multinomial random vector
X is the number of trials, leading to the outcome i, and
P (X1 = x1, . . . , Xn = xn) =
n!
x1! . . . xk!
px11 . . . p
xk
k
for any vector (x1, . . . , xk) of integers, 0 ≤ xi ≤ n, i = 1, . . . , k, satisfying∑k
i=1 xi = n.
Example 1.3 The Poisson random variable X is, e.g., the number of clients
arriving in the system in a unit interval, the number of electrons emitted
from the cathode in a unit interval, etc. Then X can take on all nonnegative
integers, and if λ is the intensity of arrivals, emission, etc., then
P (X = k) = e−λ
λk
k!
, k = 0, 1, . . .
Example 1.4 The hypergeometric random variable X is, e.g., a number of
defective items in a sample of size n taken from a finite set of products. If
there are M defectives in the set of N products, then
P (X = k) =
(
M
k
) (
N − M
n − k
)
(
N
n
)
for all integers k satisfying 0 ≤ k ≤ M and 0 ≤ n−k ≤ N−M ; this probability
is equal to 0 otherwise.
Among the continuous probability distributions, characterized by the densi-
ties, we most easily identify the asymmetric distributions concentrated on a
halfline. For instance, the waiting time or the duration of a service can be
characterized by the gamma distribution.
Example 1.5 The gamma random variable X has density function (see Fig-
ure 1.1)
f(x) =
{
abxb−1e−ax
Γ(b) if x ≥ 0
0 if x < 0
where a and b are positive constants. The special case (b=1) is the exponential
distribution.
STATISTICAL MODEL 7
0 2 4 6 8
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
Figure 1.1 The density function of gamma distribution with b = 3 and a = 1.
In technical practice we can find many other similar examples.
However, the symmetric distributions with continuous probability densities
are hardly distinguished from each other by simply looking at the data. The
problem is also with asymmetric distributions extended on the whole real line.
We should either test a hypothesis on their shape or, lacking knowledge of the
distribution shape, use a robust or nonparametric method of inference.
Most of the statistical procedures elaborated in the past were derived under
the normality assumption, i.e., under the condition that the observed data
come from a population with the Gaussian/normal distribution. People be-
lieved that every symmetric probability distribution described by a density
is approximately normal. The procedures based on the normality assumption
usually have a simple algebraic structure, thus one is tempted to use them
in all situations in which the data can take on symmetrically all real values,
forgetting the original normality assumption. For instance, the most popu-
lar least squares estimator (LSE) of regression or other parameters, though
8 MATHEMATICAL TOOLS OF ROBUSTNESS
seemingly universal, is closely connected with the normal distribution of the
measurement errors. That itself would not matter, but the LSE fails when
even a small fraction of data comes from another population whose distribu-
tion has heavier tails than the normal one, or when the dataset is contaminated
by some outliers.
At present, these facts can be easily demonstrated numerically with simulated
data, while this was not possible before the era of high-speed computers. But
these facts are not only verified with computers; the close connection of the
least squares and the normal distribution is also supported by strong theoret-
ical arguments, based on the characterizations of the normal distribution by
means of properties of estimators and other procedures. For instance, Kagan,
Linnik, and Rao (1973) proved that the least squares estimator (LSE) of the
regression parameter in the linear regression model with a continuous distri-
bution of the errors is admissible with respect to the quadratic risk (i.e., there
is no other estimator with uniformly better quadratic risk), if and only if the
distribution of the measurement errors is normal.
The Student t-test, the Snedecor F -test and the test of the linear hypothesis
were derived under the normality assumption. While the t-test is relatively
robust to deviations from the normality, the F -test is very sensitive in this
sense and should be replaced with a rank test, unless the normal distribution
is taken for granted.
If we are not sure by the parametric form of the model, we can use either of
the following possible alternative procedures:
a) Nonparametric approach: We give up a parametrization of Pθ by a real or
vector parameter θ, and replace the family {Pθ : θ ∈ Θ} with a broader
family of probability distributions.
b) Robust approach: We introduce an appropriate measure of distance of
statistical procedures made on the sample space X , and study the stability
of the classical procedures, optimal for the model Pθ, under small deviations
from this model. At the same time, we try to modify slightly the classical
procedures (i.e., to find robust procedures) to reduce their sensitivity.
1.2 Illustration on statistical estimation
Let X1, . . . , Xn be independent observations, identically distributed with some
probability distribution Pθ, where θ is unobservable parameter, θ ∈ Θ ⊆ Rp.
Let F (x, θ) be the distribution function of Pθ. Our problem is to estimate the
unobservable parameter θ.
We have several possibilities, for instance
(1) maximal likelihood method
(2) moment method
STATISTICAL FUNCTIONAL 9
(3) method of χ2-minimum, or another method minimizing another distanceof the empirical and the true distributions
(4) method based on the sufficient statistics (Rao-Blackwell Theorem) and
on the complete sufficient statistics (Lehmann-Scheffé Theorem)
In the context of sufficient statistics, remember the very useful fact in non-
parametric models that the ordered sample (the vector of order statistics)
Xn:1 ≤ Xn:2 ≤ . . . ≤ Xn:n is a complete sufficient statistic for the family of
probability distributions with densities
∏n
i=1 f(xi), where f is an arbitrary
continuous density. This corresponds to the model in which the observa-
tions create an independent random sample from an arbitrary continuous
distribution. If θ is a one-dimensional parameter, thus a real number, we
are intuitively led to the class of L-estimators of the type
Tn =
n∑
i=1
cnih(Xn:i)
based on order statistics, with suitable coefficients cni, i = 1, . . . , n, and a
suitable function h(·).
(5) Minimization of some (criterion) function of observations and of θ : e.g.,
the minimization
n∑
i=1
ρ(Xi, θ) := min, θ ∈ Θ
with a suitable non-constant function ρ(·, ·). As an example we can consider
ρ(x, θ) = − log f(x, θ) leading to the maximal likelihood estimator θ̂n.
The estimators of this type are called M -estimators, or estimators of the
maximum likelihood type.
(6) An inversion of the rank tests of the shift in location, of the significance
of regression, etc. leads to the class of R-estimators, based on the ranks of
the observations or of their residuals.
These are the M -, L- and R-estimators, and some other robust methods that
create the main subject of this book.
1.3 Statistical functional
Consider a random variable X with probability distribution Pθ with distri-
bution function F, where Pθ ∈ P = {Pθ : θ ∈ Θ ⊆ Rp}. Then in many
cases θ can be looked at as a functional θ = T (P ) defined on P ; we can
also write θ = T (F ). Intuitively, a natural estimator of θ, based on observa-
tions X1, . . . , Xn, is T (Pn), where Pn is the empirical probability distribution
of vector (X1, . . . , Xn), i.e.,
Pn(A) =
1
n
n∑
i=1
I[Xi ∈ A], A ∈ B (1.1)
10 MATHEMATICAL TOOLS OF ROBUSTNESS
Otherwise, Pn is the uniform distribution on the set {X1, . . . , Xn}, because
Pn({Xi}) = 1n , i = 1, . . . , n. Distribution function, pertaining to Pn, is the
empirical distribution function
Fn(x) = Pn((−∞, x]) =
1
n
n∑
i=1
I[Xi ≤ x], x ∈ R (1.2)
Example 1.6
(1) Expected value:
T (P ) =
∫
R
xdP = EX
T (Pn) =
∫
R
xdPn = X̄n =
1
n
n∑
i=1
Xi
(2) Variance:
T (P ) = var X =
∫
R
x2dP − (EX)2
T (Pn) =
1
n
n∑
i=1
X2i − X̄2n
(3) If T (P ) =
∫
R
h(x)dP, where h is an arbitrary P -integrable function, then
an empirical counterpart of T (P ) is
T (Pn) =
1
n
n∑
i=1
h(Xi)
(4) Conversely, we can find a statistical functional corresponding to a given
statistical estimator: for instance, the geometric mean of observations
X1, . . . , Xn is defined as
T (Pn) = Gn =
( n∏
i=1
Xi
)1/n
log Gn =
1
n
n∑
i=1
log Xi =
∫
R
log xdPn
hence the corresponding statistical functional has the form
T (P ) = exp
{∫
R
log xdP
}
Similarly, the harmonic mean T (Pn) = Hn of observations X1, . . . , Xn is
FISHER CONSISTENCY 11
defined as
1
Hn
=
1
n
n∑
i=1
1
Xi
and the corresponding statistical functional has the form
T (P ) = H =
(∫
R
1
x
dP
)−1
Statistical functionals were first considered by von Mises (1947).
The estimator T (Pn) should tend to T (P ), as n → ∞, with respect to some
type of convergence defined on the space of probability measures. Mostly, it is a
convergence in probability and in distribution, or almost sure convergence; but
an important characteristic also is the large sample bias of estimator T (Pn),
i.e., limn→∞ |E[T (Pn) − T (P )]|, which corresponds to the convergence in the
mean. Because we need to study the behavior of T (Pn) also in a neighborhood
of P, we consider an expansion of the functional (T (Pn)−T (P )) of the Taylor
type. To do it, we need some concepts of the functional analysis, as various
distances Pn and P, and their relations, and the continuity and differentiability
of functional T with respect to the considered distance.
1.4 Fisher consistency
A reasonable statistical estimator should have the natural property of Fisher
consistency, introduced by R. A. Fisher (1922). We say that estimator θ̂n,
based on observations X1 . . . , Xn with probability distribution P , is a Fisher
consistent estimator of parameter θ, if, written as a functional θ̂n = T (Pn)
of the empirical probability distribution of vector (X1, . . . , Xn), n = 1, . . . ,
it satisfies T (P ) = θ. The following example shows that this condition is not
always automatically satisfied.
Example 1.7 Let θ = var X = T (P ) =
∫
R
x2dP −
(∫
R
xdP
)2 be the variance
of P. Then the sample variance θ̂n = T (Pn) = 1n
∑n
i=1(Xi − X̄n)2 is Fisher
consistent; but it is biased, because Eθ̂n =
(
1 − 1n
)
θ. On the other hand, the
unbiased estimator of the variance S2n =
1
n−1
∑n
i=1(Xi − X̄n)2 is not a Fisher
consistent estimator of θ, because
S2n =
n
n − 1T (Pn) and
n
n − 1T (P ) �= T (P )
From the robustness point of view, the natural property of Fisher consistency
of an estimator is more important than its unbiasedness; hence it should be
first checked on a statistical functional.
12 MATHEMATICAL TOOLS OF ROBUSTNESS
1.5 Some distances of probability measures
Let X be a metric space with metric d, separable and complete, and denote
B the σ-field of its Borel subsets. Furthermore, let P be the system of all
probability measures on the space (X ,B). Then P is a convex set, and on P
we can introduce various distances of its two elements P, Q ∈ P .
Let us briefly describe some such distances, mostly used in mathematical
statistics. For those who want to learn more about such and other distances
and other related topics, we refer to the literature of the functional analysis
and the probability theory, e.g., Billingsley (1998) or Fabian et al. (2001).
(1) The Prochorov distance:
dP (P, Q) = inf{ε > 0 : P (A) ≤ Q(Aε) + ε
∀A ∈ B, A �= ∅}
where Aε = {x ∈ X : infy∈A d(x, y) ≤ ε} is a closed ε-neighborhood of a
non-empty set A.
(2) The Lévy distance: X = R is the real line; let F, G be the distribution
functions of probability measures P, Q, then
dL(F, G) = inf{ε > 0 : F (x − ε) − ε
≤ G(x) ≤ F (x + ε) + ε∀x ∈ R}
(3) The total variation:
dV (P, Q) = sup
A∈B
|P (A) − Q(A)|
We easily verify that dV (P, Q) =
∫
X |dP − dQ|
(4) The Kolmogorov distance: X = R is the real line and F, G are the distri-
bution functions of probability measures P, Q, then
dK(F, G) = sup
x∈R
|F (x) − G(x)|
(5) The Hellinger distance:
dH(P, Q) =
{∫
X
(√
dP −
√
dQ
)2}1/2
If f = dPdµ and g =
dQ
dµ are densities of P, Q with respect to some measure
µ, then the Hellinger distance can be rewritten in the form
(dH(P, Q))2 =
∫
X
(√
f −√g
)2
dµ = 2
(
1 −
∫
X
√
fgdµ
)
RELATIONS BETWEEN DISTANCES 13
(6) The Lipschitz distance: Assume that d(x, y) ≤ 1 ∀x, y ∈ X (we take the
metric d′ = d1+d otherwise), then
dLi(P, Q) = sup
ψ∈L
∣∣∣∣
∫
X
ψdP −
∫
X
ψdQ
∣∣∣∣
where L = {Ψ : X �→ R : |ψ(x)−ψ(y)| ≤ d(x, y)} is the set of the Lipschitz
functions.
(7) Kullback-Leibler divergence: Let p, q be the densities of probability
distributions P, Q with respect to measure µ (Lebesgue measure on the
real line or the counting measure), then
dKL(Q, P ) =
∫
q(x)ln
q(x)
p(x)
dµ(x)
The Kullback-Leibler divergence is not a metric, because it is not symmetric
in P, Q and does not satisfy the triangle inequality.
More on distances of probability measures can be found in Gibbs and Su
(2002), Liese and Vajda (1987), Rachev (1991), Reiss (1989) and Zolotarev
(1983), among others.
1.6 Relations between distances
The family P of all probability measures on (X ,B) is a metric space with
respect to each of the distances described above. On this metric space we
can study the continuity and other properties of the statistical functional
T (P ). Because we are interested in the behavior of the functional, not only at
distributionP , but also in its neighborhood; we come to the question, which
distance is more sensitive to small deviations of P.
The following inequalities between the distances show not only which distance
eventually dominates above others, but also illustrate their relations. Their
verification we leave as an exercise:
d2H(P, Q) ≤ 2dV (P, Q) ≤ 2dH(P, Q)
d2P (P, Q) ≤ dLi(P, Q) ≤ 2dP (P, Q) ∀ P, Q ∈ P (1.3)
1
2
d2V (P, Q) ≤ dKL(P, Q)
if X = R, then it further holds:
dL(P, Q) ≤ dP (P, Q) ≤ dV (P, Q)
dL(P, Q) ≤ dK(P, Q) ≤ dV (P, Q) ∀ P, Q ∈ P (1.4)
14 MATHEMATICAL TOOLS OF ROBUSTNESS
Example 1.8 Let P be the exponential distribution with density
f(x) =
{
e−x . . . x ≥ 0
0 . . . x < 0
and let Q be the uniform distribution R(0, 1) with density
g(x) =
{
1 . . . 0 ≤ x ≤ 1
0 . . . otherwise
Then
2dV (P, Q) =
∫ 1
0
(
1 − e−x
)
dx +
∫ ∞
1
e−xdx = 1 +
1
e
− 1 + 1
e
=
2
e
hence dV (exp, R(0, 1) ≈ 0.3679. Furthermore,
dK(P, Q) = sup
x≥0
∣∣1 − e−x − xI[0 ≤ x ≤ 1] − I[x > 1]∣∣
= e−1 ≈ 0.1839
and
d2H(exp, R(0, 1)) = 2
(
1 −
∫ 1
0
√
e−xdx
)
= 2
(
2√
e
− 1
)
thus dH(exp, R(0, 1)) ≈ 0.6528.
Finally
dKL(R(0, 1), exp) =
∫ 1
0
ln
1
e−x
dx =
1
2
1.7 Differentiable statistical functionals
Let again P be the family of all probability measures on the space (X ,B, µ),
and assume that X is a complete separable metric space with metric d and
that B is the system of the Borel subsets of X . Choose some distance δ on P
and consider the statistical functional T (·) defined on P . If we want to analyze
an expansion of T (·) around P, analogous to the Taylor expansion, we must
introduce the concept of a derivative of statistical functional. There are more
possible definitions of the derivative, and we shall consider three of them: the
Gâteau derivative, the Fréchet and the Hadamard derivative, and compare
their properties from the statistical point of view.
Definition 1.1 Let P, Q ∈ P and let t ∈ [0, 1].
Then the probability distribution
Pt(Q) = (1 − t)P + tQ (1.5)
is called the contamination of P by Q in ratio t.
Remark 1.1 Pt(Q) is a probability distribution, because P is convex.
P0(Q) = P means an absence of the contamination, while P1(Q) = Q means
the full contamination.
GÂTEAU DERIVATIVE 15
1.8 Gâteau derivative
Fix two distributions P, Q ∈ P and denote ϕ(t) = T ((1−t)P +tQ), 0 ≤ t ≤ 1.
Suppose that the function ϕ(t) has the final n-th derivative ϕ(n), and that
the derivatives ϕ(k) are continuous in interval (0, 1) and that the right-hand
derivatives ϕ(k)+ are right-continuous at t = 0, k = 1, . . . , n − 1. Then we can
consider the Taylor expansion around u ∈ (0, 1)
ϕ(t) = ϕ(u) +
n−1∑
k=1
ϕ(k)(u)
k!
(t − u)k + ϕ
(n)(v)
n!
(t − u)n, v ∈ [u, t] (1.6)
We are mostly interested in the expansion on the right of u = 0, that corre-
sponds to a small contamination of P. For that we replace derivatives ϕ(k)(0)
with the right-hand derivatives ϕ(k)+ (0). The derivative ϕ′+(0) is called the
Gâteau derivative of functional T in P in direction Q.
Definition 1.2 We say that functional T is differentiable in the Gâteau sense
in P in direction Q, if there exists the limit
T ′Q(P ) = lim
t→0+
T (P + t(Q − P )) − T (P )
t
(1.7)
T ′Q(P ) is called the Gâteau derivative T in P in direction Q.
Remark 1.2
a) The Gâteau derivative T ′Q(P ) of functional T is equal to the ordinary
right derivative of function ϕ at the point 0, i.e.,
T ′Q(P ) = ϕ
′(0+)
b) Similarly defined is the Gâteau derivative of order k:
T
(k)
Q (P ) =
[
dk
dtk
T (P + t(Q − p))
]
t=0+
= ϕ(k)(0+)
c) In the special case when Q is the Dirac probability measure Q = δx
assigning probability 1 to the one-point set {x} x, we shall use a simpler
notation T ′δx(P ) = T
′
x(P )
In the special case t = 1, u = 0 the Taylor expansion (1.6) reduces to the
form
T (Q) − T (P ) =
n−1∑
k=1
T
(k)
Q (P )
k!
+
1
n!
[
dn
dtn
T (P + t(Q − p))
]
t=t∗
(1.8)
where 0 ≤ t∗ ≤ 1.
16 MATHEMATICAL TOOLS OF ROBUSTNESS
Example 1.9
(a) Expected value:
T (P ) =
∫
X
xdP = EP X
ϕ(t) =
∫
X
xd((1 − t)P + tQ) = (1 − t)EP X + tEQX
=⇒ ϕ′(t) = EQX − EP X
T ′Q(P ) = ϕ
′(0+) = EQX − EP X
Finally we obtain for Q = δx
T ′x = x − EP X
(b) Variance:
T (P ) = varP X = EP (X2) − (EP X)2
T ((1 − t)P + tQ) =
∫
X
x2d((1 − t)P + tQ)
−
[∫
X
xd((1 − t)P + tQ)
]2
=⇒ ϕ(t) = (1 − t)EP X2 + tEQX2 − (1 − t)2(EP X)2
−t2 (EQX)2 − 2t(1 − t)EP X · EQX
ϕ′(t) = −EP X2 + EQX2
+2(1 − t) (EP X)2 − 2t (EQX)2
−2(1 − 2t)EP X · EQX
This further implies
lim
t→0+
ϕ′(t) = T ′Q(P )
= EQX2 − EP X2 − 2EP X · EQX + 2 (EP X)2
and finally we obtain for Q = δx
T ′x(P ) = x
2 − EP X2 − 2xEP X + 2 (EP X)2
= (x − EP X)2 − varP X
FRÉCHET DERIVATIVE 17
1.9 Fréchet derivative
Definition 1.3 We say that functional T is differentiable in P in the Fréchet
sense, if there exists a linear functional LP (Q − P ) such that
lim
t→0
T (P + t(Q − P )) − T (P )
t
= LP (Q − P ) (1.9)
uniformly in Q ∈ P , δ(P, Q) ≤ C for any fixed C ∈ (0,∞).
The linear functional LP (Q−P ) is called the Fréchet derivative of functional
T in P in direction Q.
Remark 1.3
a) Because LP is a linear functional, there exists a function g : X �→ R
such that
LP (Q − P ) =
∫
X
gd(Q − P ) (1.10)
b) If T is differentiable in the Fréchet sense, then it is differentiable in the
Gâteau sense, too, i.e., there exists T ′Q(P ) ∀Q ∈ P , and it holds
T ′Q(P ) = LP (Q − P ) ∀Q ∈ P (1.11)
Especially,
T ′x(P ) = LP (δx − P ) = g(x) −
∫
X
gdP (1.12)
and this further implies
EP (T ′x(P )) =
∫
X
T ′x(P )dP = 0. (1.13)
c) Let Pn be the empirical probability distribution of vector (X1 . . . , Xn).
Then Pn − P = 1n
∑n
i=1 (δXi − P ) . Hence, because LP is a linear func-
tional,
LP (Pn − P ) =
1
n
n∑
i=1
LP (δXi − P ) =
1
n
n∑
i=1
T ′Xi(P ) = T
′
Pn(P ) (1.14)
Proof of (1.11):
Actually, because LP (·) is a linear functional, we get by (1.9)
T ′Q(P ) = lim
t→0+
T (P + t(Q − P )) − T (P )
t
= lim
t→0+
T (P + t(Q − P )) − T (P )
t
− LP (Q − P )
+ LP (Q − P ) = 0 + LP (Q − P ) = LP (Q − P ) �
18 MATHEMATICAL TOOLS OF ROBUSTNESS
1.10 Hadamard (compact) derivative
If there exists a linear functional L(Q − P ) such that the convergence (1.9)
is uniform not necessarily for bounded subsets of the metric space (P , δ) con-
taining P, i.e., for all Q satisfying δ(P, Q) ≤ C, 0 < C < ∞, but only for Q
from any fixed compact set K ⊂ P containing P ; then we say that functional
T is differentiable in the Hadamard sense, and we call the functional L(Q−P )
the Hadamard (compact) derivative of T.
The Fréchet differentiable functional is obviously also Hadamard differen-
tiable, and it is, in turn, also Gâteau differentiable, similarly as in Remark 1.3.
We refer to Fernholz (1983) and to Fabian et al. (2001) for more properties of
differentiable functionals.
The Fréchet differentiability imposes rather restrictive conditions on the func-
tional that are not satisfied namely by the robust functionals. On the other
hand, when we have a Fréchet differentiable functional, we can easily derive
the large sample (normal) distribution of its empirical counterpart, when the
number n of observations infinitely increases. If the functional is not suffi-
ciently smooth, we can sometimes derive the large sample normal distribution
of its empirical counterpart with the aid of the Hadamard derivative. If we only
want to prove that T (Pn) is a consistent estimator of T (P ), then it suffices to
consider the continuity of T (P ).
The Gâteau derivative of T ′x(P ), called the influence function of functional
T, is one of the most important characteristics of its robustness and will be
studied in Chapter 2 in detail.
1.11 Large sample distribution of empirical functional
Consider again the metric space (P , δ) of all probability distributions on
(X ,B), with metric δ satisfying
√
nδ(Pn, P ) = Op(1) as n → ∞, (1.15)
where Pn is the empirical probability distribution of the random sample
(X1, . . . , Xn), n = 1, 2, . . . . The convergence (1.15) holds, e.g., for the Kol-
mogorov distance of theempirical distribution function from the true one,
which is the most important for statistical applications; but it holds also for
other distances.
As an illustration of the use of the functional derivatives, let us show that
the Fréchet differentiability, together with the classical central limit theorem,
always give the large sample (asymptotic) distribution of the empirical func-
tional T (Pn).
Theorem 1.1 Let T be a statistical functional, Fréchet differentiable in P,
COMPUTATION AND SOFTWARE NOTES 19
and assume that the empirical probability distribution Pn of the random sample
(X1, . . . , Xn) satisfies the condition (1.15) as n → ∞. If the variance of the
Gâteau derivative T ′X1(P ) is positive, varP T
′
X1
(P ) > 0, then the sequence√
n(T (Pn) − T (P )) is asymptotically normally distributed as n → ∞, namely
L
(
T (Pn) − T (P )
)
−→ N
(
0, varP T ′X1(P )
)
(1.16)
Proof. By (1.14), T ′Pn(P ) =
1
n
∑n
i=1 T
′
Xi
(P ) and further by (1.8) and condi-
tion (1.15) we obtain
√
n(T (Pn) − T (P )) =
1√
n
n∑
i=1
T ′Xi(P ) + Rn
=
1√
n
n∑
i=1
LP (Pn − P ) +
√
n o(δ(Pn, P )) (1.17)
=
1√
n
n∑
i=1
T ′Xi(P ) + op(1)
If the joint variance varP T ′Xi(P ) = varP T
′
X1
(P ), i = 1, . . . , n, is finite, then
(1.16) follows from (1.17) and from the classical central limit theorem. �
Example 1.10 Let T (P ) = varP X = σ2, then
T (Pn) = S2n =
1
n
n∑
i=1
(Xi − X̄n)2
and, by Example 1.9b),
T ′x(P ) = (x − EP X)2 − varP X
hence
varP T ′X(P ) = EP (X − EP X)4 − E2P (X − EP X)2 = µ4 − µ22
and by Theorem 1.1 we get the large sample distribution of the sample variance
L
(√
n(S2n − σ2)
)
−→ N
(
0, µ4 − µ22
)
1.12 Computation and software notes
We chose R (a language and environment for statistical computing and graph-
ics) as a suitable tool for numerical illustration and the computation. R seems
to us to be one of the best statistical environments available. It is also free
and the R project is an open source project. The code you are using is always
available for you to view. Detailed information about R and the R project is
available from http://www.r-project.org/.
20 MATHEMATICAL TOOLS OF ROBUSTNESS
Chapter 1 focuses mainly on the theoretical background. However, Section 1.1
mentioned some specific distribution shapes.
The R system has built-in functions to compute the density, distribution func-
tion and quantile function for many standard distributions, including ones
mentioned in Section 1.1 (see Table 1.1).
Table 1.1 R function names and parameters for selected probability distributions.
Distribution R name Parameters
binomial binom size, prob
exponential exp rate
gamma gamma shape, scale
hypergeometric hyper m, n, k
normal norm mean, sd
Poisson pois lambda
uniform unif min, max
The first letter of the name of the R function indicates the function: dXXXX,
pXXXX, qXXXX are, respectively, the density, distribution and quantile func-
tions. The first argument of the function is the quantile q for the densities
and distribution functions, and the probability p for quantile functions. Addi-
tional arguments specify the parameters. These functions can be used as the
statistical tables.
Here are some examples:
• P (X = 3) - binomial distribution with n=20, p = 0.1
> dbinom(3,20,0.1)
[1] 0.1901199
• P (X ≤ 5) - Poisson distribution with λ=2
> ppois(5,2)
[1] 0.9834364
• 95% – quantile of standard normal distribution
> qnorm(0.95)
[1] 1.644854
• The density function of gamma distribution with b = 3 and a = 1 ( see
Figure 1.1).
> plot(seq(0,8,by=0.01),dgamma(seq(0,8,by=0.01),3,1),
+ type="l", ylab="",xlab="")
COMPUTATION AND SOFTWARE NOTES 21
System R enables the generation of random data. The corresponding functions
have prefix r and first argument n, the size of the sample required. For exam-
ple, we can generate a sample of size 1000 from gamma distribution (b = 3;
a = 1) by
> rgamma(1000,3,1)
R has a function hist to plot histograms.
> hist(rgamma(1000,3,1), prob=TRUE,nclass=16)
We obtain Figure 1.2, compare with Figure 1.1.
Histogram of rgamma(1000, 3, 1)
rgamma(1000, 3, 1)
D
en
si
ty
0 2 4 6 8 10
0.
00
0.
05
0.
10
0.
15
0.
20
0.
25
Figure 1.2 The histogram of the simulated sample from gamma distribution with
b = 3 and a = 1.
In Section 1.5, some distances of probability measures were introduced and
illustrated in Example 1.8. There we need to compute an integral. We can
also use R function integrate to solve that problem. Compare the following
example with the calculation in Example 1.8.
22 MATHEMATICAL TOOLS OF ROBUSTNESS
> integrate(function(x) {1-exp(-x)},0,1)
0.3678794 with absolute error < 4.1e-15
> integrate(function(x) {exp(-x)},1, Inf)
0.3678794 with absolute error < 2.1e-05
R can also help us in the case of discrete distribution. Let P be the binomial
distribution with parameters n = 100, p = 0.001. Let Q be Poisson distribu-
tion with parameterλ = np = 1. Then
2dV (P, Q) =
100∑
k=0
∣∣∣∣
(
n
k
)
0.01k0.99100−k − e
−1
k!
∣∣∣∣ +
∞∑
k=101
e−1
k!
> sum(abs(choose(100,0:100)*0.01^(0:100)*(0.99)^(100:0)
+ -exp(-1)/factorial(0:100)))+1-sum(exp(-1)/factorial(0:100))
[1] 0.005550589
> ## or also
> sum(abs(dbinom(0:100,100,0.01)-dpois(0:100,1)))+1-ppois(100,1)
[1] 0.005550589
Thus dV (Bi(100, 0.01), Po(1)) ≈ 0.0028.
Similarly,
dK(Bi(100, 0.01), Po(1)) ≈ 0.0018, dH(Bi(100, 0.01), Po(1)) ≈ 0.0036,
dKL(Bi(100, 0.01), Po(1)) ≈ 0.000025 because
> ### Kolmogorov distance
> max(abs(pbinom(0:100,100,0.01)-ppois(0:100,1)))
[1] 0.0018471
> ### Hellinger distance
> sqrt(sum((sqrt(dbinom(0:100,100,0.01))
+ -sqrt(dpois(0:100,1)))^2))
[1] 0.003562329
>### Kullback-Leibler divergence (Q,P)
> sum(dpois(0:100,1)*log(dpois(0:100,1)/
+ dbinom(0:100,100,0.01)))
[1] 2.551112e-05
>### Kullback-Leibler divergence (P,Q)
> sum(dbinom(0:100,100,0.01)*log(dbinom(0:100,100,0.01)/
+ dpois(0:100,1)))
[1] 2.525253e-05
PROBLEMS AND COMPLEMENTS 23
1.13 Problems and complements
1.1 Let Q be the binomial distribution with parameters n, p and let P be
the Poisson distribution with parameter λ = np, then
1
2
(dV (Q, P ))2 ≤ dKL(Q, P )
p2
4
≤ dKL(Q, P ) ≤
(
1
4
+
np3
3
+
p
2
+
1
4n
)
p2
1
16
min(p, np2) ≤ dV (Q, P ) ≤ 2p
(
1 − e−np
)
dKL(Q, P ) ≤
p2
2(1 − p)
lim
n→∞
n2dKL(Q, P ) =
λ2
4
See Barbour and Hall (1984), Csiszár (1967), Harremoës and Ruzankin
(2004), Kontoyannis et al. (2005), Pinsker (1960) for demonstrations.
1.2 Wasserstein-Kantorovich distance of distribution functions F, G of ran-
dom variables X, Y :
• L1-distance on F1 = {F :
∫ ∞
−∞ |x|dF (x) < ∞} :
d
(1)
W (F, G) =
∫ 1
0
|F−1(t) − G−1(t)|dt
Show that d(1)W (F, G) =
∫ ∞
−∞ |F (x) − G(x)|dx (Dobrushin (1970)).
Show that d(1)W (F, G) = inf{E|X − Y |} where the infimum is over all
jointly distributed X and Y with respective marginals F and G.
• L2-distance on F2 = {F :
∫ ∞
−∞ x
2dF (x) < ∞} :
d
(2)
W (F, G) =
∫ 1
0
[F−1(t) − G−1(t)]2dt
Show that d(2)W (F, G) = inf{E(X − Y )2} where the infimum is over all
jointly distributed X and Y with respective marginals F and G (Mal-
lows (1972)).
• Weighted L1-distance:
d
(1)
W (F, G) =
∫ 1
0
|F−1(t) − G−1(t)|w(t)dt,
∫ 1
0
w(t)dt = 1
24 MATHEMATICAL TOOLS OF ROBUSTNESS
1.3 Show that dP (F, G) ≤ d(1)W (F, G) (Dobrushin (1970)).
1.4 Let (X1, . . . , Xn) and (Y1, . . . , Yn) be two independent random samples
from distribution functions F, G such that
∫ ∞
−∞ xdF (x) =
∫ ∞
−∞ ydG(y) = 0.
Let Fn, Gn be the distribution functions of n−1/2
∑n
i=1 Xi and
n−1/2
∑n
i=1 Yi, respectively, then
d
(2)
W (Fn, Gn) ≤ d
(2)
W (F, G)
(Mallows (1972)).
1.5 χ2-distance: Let p, q be the densities of probability distributions P, Q
with respect to measure µ (µ can be a countable measure). Then dχ2(P, Q)
is defined as
dχ2(P, Q) =
∫
x∈X :p(x),q(x)>0
(p(x) − q(x))2
q(x)
dµ(x)
Then 0 ≤ dχ2 (P, Q) ≤ ∞ and dχ2 is independent of the choice of the dom-
inating measure. It is not a metric, because it is not symmetric in P, Q.
Distance dχ2 is dating back to Pearson in the 1930s and has many applica-
tions in the statisticalinference. The following relations hold between dχ2
and other distances:
(i) dH(P, Q) ≤
√
2(dχ2 (P, Q))1/4
(ii) If the sample space X is countable, then dV (P, Q) ≤ 12
√
dχ2(P, Q)
(iii) dKL(P, Q) ≤ dχ2(P, Q)
1.6 Let P be the exponential distribution and let Q be the uniform distri-
bution (see Example 1.8) Then
dχ2(Q, P ) =
∫ 1
0
(1 − e−x)2
e−x
dx = e + e−1 − 2
hence dχ2(R(0, 1), exp) ≈ 0.350402. Furthermore,
dχ2(P, Q) =
∫ 1
0
(
e−x − 1
)2
dx = −1
2
e−2 + 2 e−1 − 1
2
hence dχ2(exp, R(0, 1)) ≈ 0.168091
1.7 Bhattacharyya distance: Let p, q be the densities of probability distribu-
tions P, Q with respect to measure. Then dB(P, Q) is defined as
dB(P, Q) = log
(∫
x∈X :p(x),q(x)>0
√
p(x)
√
q(x) dµ(x)
)−1
(Bhattacharyya (1943)). Furthermore, for a comparison
dB(exp, R(0, 1)) = log
(∫ 1
0
√
e−xdx
)−1
= − log(2 − 2√
e
) ≈ 0.239605
PROBLEMS AND COMPLEMENTS 25
1.8 Verify 2dV (P, Q) =
∫
X |dP − dQ|.
1.9 Check the inequalities 1.3.
1.10 Check the inequalities 1.4.
1.11 Compute the Wasserstein-Kantorovich distances d(1)W (F, G) and
d
(2)
W (F, G) for the exponential distribution and the uniform distribution (as
in Example 1.8).
CHAPTER 2
Basic characteristics of robustness
2.1 Influence function
Expansion (1.17) of difference T (Pn) − T (P ) says that
T (Pn) − T (P ) =
1
n
n∑
i=1
T ′Xi(P ) + n
−1/2Rn (2.1)
where the reminder term is asymptotically negligible, n−1/2Rn = op(n−1/2)
as n → ∞. Then we can consider 1n
∑n
i=1 T
′
Xi
(P ) as an error of estimating
T (P ) by T (Pn), and T ′Xi(P ) as a contribution of Xi to this error, or as an
influence of Xi on this error. From this point of view, a natural interpretation
of the Gâteau derivative T ′x(P ), x ∈ X is to call it an influence function of
functional T (P ).
Definition 2.1 The Gâteau derivative of functional T in distribution P in
the direction of Dirac distribution δx, x ∈ X is called the influence function
of T in P ; thus
IF (x; T, P ) = T ′x(P ) = limt→0+
T (Pt(δx)) − T (P )
t
(2.2)
where Pt(δx) = (1 − t)P + tδx.
As the first main properties of IF, let us mention:
a) EP (IF (x; T, P )) =
∫
X T
′
x(P )dP = 0,
hence the average influence of all points x on the estimation error is zero.
b) If T is a Fréchet differentiable functional satisfying condition (1.15), and
varP (IF (x; T, P )) = EP (IF (x; T, P ))2 > 0
then
(√
n(T (Pn) − T (P )
)
−→ N
(
0, varP (IF (x; T, P ))
)
Example 2.1
(a) Expected value: T (P ) = EP (X) = mP , then
T (Pn) = X̄n
27
28 BASIC CHARACTERISTICS OF ROBUSTNESS
IF (x; T, P ) = T ′x(P ) = x − mp
EP (IF (x; T, P )) = 0
varP (IF (x; T, P )) = varP X = σ2P
EQ(IF (x; T, P )) = mQ − mP for Q �= P
L
(√
n(X̄n − mp)
)
−→ N (0, σ2P )
provided P is the true probability distribution of random sample (X1, . . . , Xn).
(b) Variance: T (P ) = varP X = σ2P , then
IF (x; T, P ) = (x − mP )2 − σ2P
EP (IF (x; T, P )) = 0
varP (IF (x; T, P )) = µ4 − µ22 = µ4 − σ4P
EQ(IF (x; T, P )) = EQ(X − mp)2 − σ2P
= σ2Q + (mQ − mP )2 + 2EQ(X − mQ)(mQ − mP )
−σ2P = σ2Q − σ2P + (mQ − mP )2
2.2 Discretized form of influence function
Let (X1, . . . , Xn) be the vector of observations and denote Tn = T (Pn) =
Tn(X1, . . . , Xn) as its empirical functional. Consider what happens if we add
another observation Y to X1, . . . , Xn. The influence of Y on Tn is characterized
by the difference
Tn+1(X1, . . . , Xn, Y ) − Tn(X1, . . . , Xn) := I(Tn, Y ) (2.3)
Because
Pn =
1
n
n∑
i=1
δXi
Pn+1 =
1
n + 1
(
n∑
i=1
δXi + δY
)
=
n
n + 1
Pn +
1
n + 1
δY
=
(
1 − 1
n + 1
)
Pn +
1
n + 1
δY
DISCRETIZED FORM OF INFLUENCE FUNCTION 29
we can say that Pn+1 arose from Pn by its contamination by the one-point
distribution δY in ratio 1n+1 , hence
I(Tn, Y ) = T
[(
1 − 1
n + 1
)
Pn +
1
n + 1
δY
]
− T (Pn)
Because
lim
n→∞
(n + 1)I(Tn, Y ) (2.4)
= lim
n→∞
T
[(
1 − 1n+1
)
Pn + 1n+1δY
]
− T (Pn)
1
n+1
= IF (Y ; T, P )
(n + 1)I(Tn, Y ) can be considered as a discretized form of the influence func-
tion. The supremum of |I(Tn, Y )| over Y then represents a measure of sensi-
tivity of the empirical functional Tn with respect to an additional observation,
under fixed X1, . . . , Xn.
Definition 2.2 The number
S(Tn) = sup
Y
|I(Tn(X1, . . . , Xn), Y )| (2.5)
is called a sensitivity of functional Tn(X1, . . . , Xn) to an additional observa-
tion.
Example 2.2
(a) Expected value:
T (P ) = EP X, Tn = X̄n, Tn+1 = X̄n+1
=⇒ Tn+1 =
1
n + 1
(nX̄n + Y )
I(Tn, Y ) =
(
n
n + 1
− 1
)
X̄n +
1
n + 1
Y =
1
n + 1
(Y − X̄n)
=⇒ (n + 1)I(Tn, Y ) = Y − X̄n P−→ Y − EP X as n → ∞
=⇒ S(X̄n) =
1
n + 1
sup
Y
|Y − X̄n| = ∞
Thus, the sample mean has an infinite sensitivity to an additional observation.
(b) Median:
Let n = 2m + 1 and let X(1) ≤ . . . ≤ X(n) be the observations ordered in
increasing magnitude. Then Tn = Tn(X1, . . . , Xn) = X(m+1) and Tn+1 =
30 BASIC CHARACTERISTICS OF ROBUSTNESS
Tn+1(X1 . . . , Xn, Y ) takes on the following values, depending on the position
of Y among the other observations:
Tn+1 =
⎧⎪⎪⎪⎨
⎪⎪⎪⎩
X(m)+X(m+1)
2 . . . Y ≤ X(m)
X(m+1)+X(m+2)
2 . . . Y ≥ X(m+2)
Y +X(m+1)
2 . . . X(m) ≤ Y ≤ X(m+2)
Hence, the influence of adding Y to X1, . . . , Xn on the median is measured by
I(Tn, Y ) =
⎧⎪⎪⎪⎨
⎪⎪⎪⎩
X(m)−X(m+1)
2 Y ≤ X(m)
X(m+2)−X(m+1)
2 Y ≥ X(m+2)
Y −X(m+1)
2 X(m) ≤ Y ≤ X(m+2)
Among three possible values of |I(Tn, Y )| is |12 (Y −X(m+1))| the smallest; thus
the sensitivity of the median to an additional observation is equal to
S(Tn) = max
{
1
2
(X(m+1) − X(m)),
1
2
(X(m+2) − X(m+1))
}
and it is finite under any fixed X1, . . . , Xn.
2.3 Qualitative robustness
As we have seen in Example 2.1, the influence functions of the expectation and
variance are unbounded and can assume arbitrarily large values. Moreover,
Example 2.2 shows that adding one more observation can cause a breakdown
of the sample mean. The least squares estimator (LSE) behaves analogously
(in fact, the mean is a special form of the least squares estimator). Remember
the Kagan, Linnik and Rao theorem, mentioned in Section 1.1, that illustrates
a large sensitivity of the LSE to deviations from the normal distribution of
errors. Intuitively it means that the least squares estimator (and the mean)
are very non-robust.
How can we mathematically express this intuitive non-robustness property,
and how shall we define the concept of robustness? Historically, this con-
cept has been developing over a rather long period, since many statisticians
observed a sensitivity of statistical procedures to deviations from assumed
models, and analyzed it from various points of view.
It is interesting that the physicists and astronomers, who tried to determine
values of various physical, geophysical and astronomic parameters by means
of an average of several measurements, were the first to notice the sensitivity
of the mean and the variance to outlying observations. This interesting part
of the statistical history is nicely described in the book by Stigler (1986). The
history goes up to 1757, when R. J. Boskovich, analyzing his experiments aim-
ing at a characterization of the shape of the globe, proposed an estimation
QUALITATIVE ROBUSTNESS 31
method alternative to the least squares. E. S. Pearson noticed the sensitivity
of the classical analysis of variance procedures to deviations from the normal-
ity in 1931. J. W. Tukey and his Princeton group have started a systematic
study of possible alternatives to the least squares since the 1940s. The name
“robust” was first used by Box in 1953. Box and Anderson (1955) character-
ized as robust such a statistical procedure that is little sensitive to changes of
the nuisance or unimportant parameters, while it is sensitive (efficient) to its
parameter of interest.
When we speak about robustness of a statistical procedure, we usually mean
its robustness with respect to deviations from the assumed distribution of
errors. However, other types of robustness are also important, such as the
assumed independence of observations, the assumption that isoften violated
in practice. The first mathematical definition of robustness was formulated
by Hampel (1968, 1971), who based the concept of robustness of a statistical
functional on its continuity in a neighborhood of the considered probability
distribution. The continuity and neighborhood were considered with respect
to the Prohorov metric on the space P .
Let a random variable (or random vector) X take on values in the sample
space (X ,B); denote P as its probability distribution. We shall try to char-
acterize mathematically the robustness of the functional T (P ) = T (X). This
functional is estimated with the aid of observations X1, . . . , Xn, that are in-
dependent copies of X. More precisely, we estimate T by the empirical func-
tional Tn(Pn) = Tn(X1, . . . , Xn), based on the empirical distribution Pn of
X1, . . . , Xn. Instead of the empirical functional, Tn is often called the (sam-
ple) statistic. Hampel’s definition of the (qualitative) robustness is based on
the Prohorov metric dP on the system P of probability measures on the sample
space.
Definition 2.3 We say that the sequence of statistics (empirical functionals)
{Tn} is qualitatively robust for probability distribution P, if to any ε > 0 there
exists a δ > 0 and a positive integer n0 such that, for all Q ∈ P and n ≥ n0,
dP (P, Q) < δ =⇒ dP (LP (Tn),LQ(Tn)) < ε (2.6)
where LP (Tn) and LQ(Tn) denote the probability distributions of Tn under P
and Q, respectively.
This robustness is only qualitative: it only says whether it is or is not the
functional robust, but it does not numerically measure a level of this charac-
teristic. Because such robustness concerns only the behavior of the functional
in a small neighborhood of P0, it is in fact infinitesimal. We can obviously
replace the Prohorov metric with another suitable metric on space P , e.g.,
the Lévy metric.
However, we do not only want to see whether T is or is not robust. We want to
compare the various functionals with each other and see which is more robust
32 BASIC CHARACTERISTICS OF ROBUSTNESS
than the other. To do this, we must characterize the robustness with some
quantitative measure. There are many possible quantifications of the robust-
ness. However, using such quantitative measures, be aware that a replacement
of a complicated concept with just one number can cause a bias and suppress
important information.
2.4 Quantitative characteristics of robustness based on influence
function
Influence function is one of the most important characteristics of the sta-
tistical functional/estimator. The value IF (x; T, P ) measures the effect of a
contamination of functional T by a single value x. Hence, a robust functional
T should have a bounded influence function. However, even the fact that T is
a qualitatively robust functional does not automatically mean that its influ-
ence function IF (x; T, P ) is bounded. As we see later, an example of such a
functional is the R-estimator of the shift parameter, which is an inversion of
the van der Waerden rank test; while it is qualitatively robust, its influence
function is unbounded.
The most popular quantitative characteristics of robustness of functional T,
based on the influence function, are its global and local sensitivities:
a) The global sensitivity of the functional T under distribution P is the
maximum absolute value of the influence function in x under P, i.e.,
γ∗ = sup
x∈X
|IF (x; T, P )| (2.7)
b) The local sensitivity of the functional T under distribution P is the value
λ∗ = sup
x,y; x �=y
∣∣∣∣IF (y; T, P ) − IF (x; T, P )y − x
∣∣∣∣ (2.8)
that indicates the effect of the replacement of value x by value y on the
functional T.
The following example illustrates the difference between the global and local
sensitivities.
Example 2.3
(a) Mean
T (P ) = EP (X), IF (x; T, P ) = x − EP X =⇒ γ∗ = ∞, λ∗ = 1;
the mean is not robust, but it is not sensitive to the local changes.
(b) Variance
T (P ) = varP X = σ2P
IF (x; T, P ) = (x − EP (X))2 − σ2P , γ∗ = ∞
MAXIMUM BIAS 33
λ∗ = sup
y �=x
∣∣∣∣(x − EP (X))2 − (y − EP (X))2x − y
∣∣∣∣
= sup
y �=x
∣∣∣∣x2 − y2 − 2(x − y)EP Xx − y
∣∣∣∣ = sup
y �=x
|x + y − 2EP X | = ∞
hence the variance is non-robust both to large as well as to small (local)
changes.
2.5 Maximum bias
Assume that the true distribution function F0 lies in some family F . Another
natural measure of robustness of the functional T is its maximal bias (maxbias)
over F ,
b(F) = sup
F∈F
{|T (F )− T (F0)|} (2.9)
The family F can have various forms; for example, it can be a neighborhood
of a fixed distribution F0 with respect to some distance described in Section
1.5. In the robustness analysis, F is often the ε-contaminated neighborhood
of a fixed distribution function F0, that has the form
FF0,ε = {F : F = (1− ε)F0 + εG, G unknown distribution function} (2.10)
The value ε of the contamination ratio is considered as known, known is also
the central distribution function F0. When estimating the location parameter
θ of F (x − θ), where F is an unknown member of F , then the central distri-
bution F0 is usually taken as symmetric around zero and unimodal, while the
contaminating distribution G can run either over symmetric or asymmetric
distribution functions. We then speak about symmetric or asymmetric con-
taminations.
Many statistical functionals are monotone with respect to the stochastic or-
dering of distribution functions (or random variables), defined in the following
way:
Random variable X with distribution function F is stochastically smaller than
random variable Y with distribution function G, if
F (x) ≥ G(x) ∀x ∈ R
The monotone statistical functional thus attains its maxbias either at the
stochastically largest member F∞ or at the stochastically smallest member
F−∞ of FF0,ε, hence
b(FF0,ε) = max{|T (F∞) − T (F0)|, |T (F−∞) − T (F0)|} (2.11)
The following example well illustrates the role of the maximal bias; it shows
that while the mean is non-robust, the median is universally robust with
respect to the maxbias criterion.
34 BASIC CHARACTERISTICS OF ROBUSTNESS
Example 2.4
(i) Mean
T (F ) = EF (X); if F0 is symmetric around zero and so are all contaminating
distributions G, all having finite first moments, then T (F ) is unbiased for all
F ∈ FF0,ε, hence b(FF0,ε) = 0. However, under an asymmetric contamina-
tion, b(FF0,ε) = |E(F∞) − E(F0)| = ∞, where F∞ = (1 − ε)F0 + εδ∞, the
stochastically largest member of FF0,ε.
(ii) Median.
Because the median is nondecreasing with respect to the stochastic ordering of
distributions, its maximum absolute bias over an asymmetric ε-contaminated
neighborhood of a symmetric distribution function F0 is attained either at
F∞ = (1 − ε)F0 + εδ∞ (the stochastically largest distribution of FF0,ε), or at
F−∞ = (1 − ε)F0 + εδ−∞ (the stochastically smallest distribution of FF0,ε).
The median of F∞ is attained at x0 satisfying
(1 − ε)F0(x0) =
1
2
=⇒ x0 = F−10
(
1
2(1 − ε)
)
while the median of F−∞ is x−0 such that
(1 − ε)F0(x−0 ) + ε =
1
2
=⇒ x−0 = F−10
(
1 − 1
2(1 − ε)
)
= −x0
hence the maxbias of the median is equal to x0.
Let T (F ) be any other functional such that its estimate T (Fn) = T (X1, . . . ,
Xn), based on the empirical distribution function Fn, is translation equivari-
ant, i.e., T (X1 + c, . . . , Xn + c) = T (X1, . . . , Xn) + c for any c ∈ R. Then
obviously T (F (· − c)) = T (F (·)) + c. We shall show that the maxbias of T
cannot be smaller than x0.
Consider two contaminations of F0,
F+ = (1 − ε)F0 + εG+, F− = (1 − ε)F0 + εG−
where
G+(x) =
{
0 . . . x ≤ x0
1
ε{1 − (1 − ε)[F0(x) + F0(2x0 − x)]} . . . x ≥ x0
and
G−(x) =
{
1
ε{(1 − ε)[F0(x + 2x0) − F0(x)]} . . . x < −x0
1 . . . x ≥ −x0
Notice that F−(x − x0)) = F+(x + x0), hence T (F−) + x0 = T (F+) − x0 and
T (F+) − T (F−) = 2x0; thus the maxbias of T at F0 cannot be smaller than
x0. It shows that the median has the smallest maxbias among all translation
equivariant functionals.
BREAKDOWN POINT 35
If T (F ) is a nonlinear functional, or if it is defined implicitlyas a solution
of a minimization or of a system of equations; then it is difficult to calculate
(2.11) precisely. Then we consider the maximum asymptotic bias of T (F ) over
a neighborhood F of F0. More precisely, let X1, . . . , Xn be independent iden-
tically distributed observations, distributed according to distribution function
F ∈ F and Fn be the empirical distribution function. Assume that under
an infinitely increasing number of observations, T (Fn) has an asymptotical
normal distribution for every F ∈ F in the sense that
P
(√
n(T (Fn) − T (F )) ≤ x
)
→ Φ
(
x
σ(T, F )
)
as n → ∞
with variance σ2(T, F ) dependent of T and F. Then the maximal asymptotic
bias (asymptotic maxbias) of T over F is defined as
sup {|T (F ) − T (F0)| : F ∈ F} (2.12)
We shall return to the asymptotic maxbias later in the context of some robust
estimators that are either nonlinear or defined implicitly as a solution of a
minimization or a system of equations.
2.6 Breakdown point
The breakdown point, introduced by Donoho and Huber in 1983, is a very pop-
ular quantitative characteristic of robustness. To describe this characteristic,
start from a random sample x0 = (x1, . . . , xn) and consider the corresponding
value Tn(x0) of an estimator of functional T. Imagine that in this “initial”
sample we can replace any m components by arbitrary values, possibly very
unfavorable, even infinite. The new sample after the replacement denotes x(m),
and let Tn(xm) be the pertaining value of the estimator.
The breakdown point of estimator Tn for sample x(0) is the number
ε∗n(Tn, x
(0)) =
m∗(x(0))
n
where m∗(x(0)) is the smallest integer m, for which
sup
x(m)
‖Tn(x(m)) − Tn(x(0))‖ = ∞
i.e., the smallest part of the observations that, being replaced with arbitrary
values, can lead Tn up to infinity. Some estimators have a universal breakdown
point, when m∗ is independent of the initial sample x(0). Then we can also
calculate the limit ε∗ = limn→∞ ε∗n, which often also is called the breakdown
point.
We can modify the breakdown point in such a way that, instead of replacing
m components, we extend the sample by some m (unfavorable) values.
36 BASIC CHARACTERISTICS OF ROBUSTNESS
Example 2.5
(a) The average X̄n = 1n
∑n
i=1 Xi :
ε∗n(X̄n, x
(0)) = 1n , hence limn→ ε
∗
n(X̄n, x
(0)) = 0 for any initial sample x(0)
(b) Median X̃n = X(n+12 ) (consider n odd, for simplicity):
ε∗n(X̃n, x
(0)) = n+12n , thus limn→ ε
∗
n(X̃n, x
(0)) = 12 for any initial sample x
(0)
2.7 Tail–behavior measure of a statistical estimator
The tail–behavior measure is surprisingly intuitive mainly in estimating the
shift and regression parameters. We will first illustrate this measure on the
shift parameter, and then return to regression at a suitable place.
Let (X1, . . . , Xn) be a random sample from a population with continuous
distribution function F (x − θ), θ ∈ R. The problem of interest is that of
estimating parameter θ. A reasonable estimator of the shift parameter should
be translation equivariant: Tn is translation equivariant, if
Tn(X1 + c, . . . , Xn + c) = Tn(X1, . . . , Xn) + c
∀ c ∈ R and ∀X1 . . . , Xn
The performance of such an estimator can be characterized by probabilities
Pθ(|Tn − θ| > a)
analyzed either under fixed a > 0 and n → ∞, or under fixed n and a → ∞.
Indeed, if {Tn} is a consistent estimator of θ, then limn→0 Pθ(|Tn−θ| > a) = 0
under any fixed a > 0. Such a characteristic was studied, e.g., by Bahadur
(1967), Fu (1975, 1980) and Sievers (1978), who suggested the limit
lim
n→∞
{
− 1
n
ln Pθ(|Tn − θ| > a)
}
under fixed a > 0
(provided it exists) as a measure of efficiency of estimator Tn, and compared
estimators from this point of view.
On the other hand, a good estimator Tn also verifies the convergence
lim
a→∞
Pθ(|Tn − θ| > a) = lim
a→∞
P0(|Tn| > a) = 0 (2.13)
while this convergence is as fast as possible. The probabilities Pθ(Tn − θ > a)
and Pθ(Tn − θ < −a), for a sufficiently large, are called the right and the
left tails, respectively, of the probability distribution of Tn. If Tn is symmetri-
cally distributed around θ, then both its tails are characterized by probability
(2.13). This probability should rapidly tend to zero. However, the speed of
this convergence cannot be arbitrarily high. We shall show that the rate of
TAIL–BEHAVIOR MEASURE OF A STATISTICAL ESTIMATOR 37
convergence of tails of a translation equivariant estimator is bounded, and
that its upper bound depends on the behavior of 1 − F (a) and F (−a) for
large a > 0.
Let us illustrate this upper bound on a model with symmetric distribution
function satisfying F (−x) = 1 − F (x) ∀x ∈ R. Jurečková (1981) introduced
the following tail-behavior measure of an equivariant estimator Tn :
B(Tn; a) =
−ln Pθ(|Tn − θ| > a)
−ln(1 − F (a)) =
−ln P0(|Tn| > a)
−ln(1 − F (a)) , a > 0 (2.14)
The values B(Tn; a) for a � 0 show how many times faster the probability
P0(|Tn| > a) tends to 0 than 1−F (a), as a → ∞. The best is estimator Tn with
the largest possible values B(Tn; a) for a � 0. The lower and upper bounds
for B(Tn; a), thus for the rate of convergence of its tails, are formulated in the
following lemma:
Lemma 2.1 Let X1, . . . , Xn be a random sample from a population with dis-
tribution function F (x − θ), 0 < F (x) < 1, F (−x) = 1 − F (x), x, θ ∈ R. Let
Tn is an equivariant estimator of θ such that, for any fixed n
min
1≤i≤n
Xi > 0 =⇒ Tn(X1, . . . , Xn) > 0
(2.15)
max
1≤i≤n
Xi < 0 =⇒ Tn(X1, . . . , Xn) < 0
Then, under any fixed n
1 ≤ lima→∞B(Tn; a) ≤ lima→∞B(Tn; a) ≤ n (2.16)
Proof. Indeed, if Tn is equivariant, then
P0(|Tn(X1, . . . , Xn)| > a)
= P0(Tn(X1, . . . , Xn) > a) + P0(Tn(X1, . . . , Xn) < −a)
= P0(Tn(X1 − a, . . . , Xn − a) > 0) + P0(Tn(X1 + a, . . . , Xn + a) < 0)
≥ P0
(
min
1≤i≤n
Xi > a
)
+ P0
(
max
1≤i≤n
Xi < −a
)
= (1 − F (a))n + (F (−a))n
hence
−ln P0(|Tn(X1, . . . , Xn)| > a) ≤ −ln 2 − n ln(1 − F (a))
=⇒ lima→∞
−ln P0(|Tn| > a)
−ln(1 − F (a)) ≤ n
Similarly,
P0(|Tn(X1, . . . , Xn)| > a)
38 BASIC CHARACTERISTICS OF ROBUSTNESS
≤ P0
(
min
1≤i≤n
Xi ≤ −a
)
+ P0
(
max
1≤i≤n
Xi ≥ a
)
= 1 − (1 − F (−a))n + 1 − (F (a))n = 2 {1 − (F (a))n}
= 2(1 − F (a))
[
1 + F (a) + . . . + (F (a))n−1
]
≤ 2n(1 − F (a))
hence
−ln P0(|Tn(X1, . . . , Xn)| > a) ≥ −ln 2 − ln n − ln(1 − F (a))
=⇒ lima→∞
−ln P0(|Tn| > a)
−ln(1 − F (a)) ≥ 1
�
If Tn attains the upper bound in (2.16), then it is obviously optimal for distri-
bution function F , because its tails tend to zero n-times faster than 1−F (a),
which is the upper bound. However, we still have the following questions:
• Is the upper bound attainable, and for which Tn and F?
• Is there any estimator Tn attaining high values of B(Tn; a) robustly for a
broad class of distribution functions?
It turns out that the sample mean X̄n can attain both lower and upper bounds
in (2.16); namely, it attains the upper bound under the normal distribu-
tion and under an exponentially tailed distribution, while it attains the lower
bound only for the Cauchy distribution and for the heavy-tailed distributions.
This demonstrates a high non-robustness of X̄n even from the tail behavior
aspect. On the other hand, the sample median X̃n is robust even with respect
to tails: X̃n does not attain the upper bound in (2.16), on the contrary, the
limit lima→∞ B(X̃n; a) is always in the middle of the scope between 1 and n
for a broad class of distribution functions.
These conclusions are in good concordance with the robustness concepts. The
following theorem gives them a mathematical form.
Theorem 2.1 Let X1, . . . , Xn be a random sample from a population with
distribution function F (x− θ), 0 < F (x) < 1, F (−x) = 1− F (x), x, θ ∈ R.
(i) Let X̄n = 1n
∑n
i=1 Xi be the sample mean. If F has exponential tails,
i.e.,
lim
a→∞
−ln(1 − F (a)
bar
= 1 for some b > 0, r ≥ 1 (2.17)
then
lim
a→∞
B(X̄n; a) = n (2.18)
TAIL–BEHAVIOR MEASURE OF A STATISTICAL ESTIMATOR 39
(ii) If F has heavy tails in the sense that
lim
a→∞
−ln(1 − F (a)
m ln a
= 1 for some m > 0 (2.19)then
lim
a→∞
B(X̄n; a) = 1 (2.20)
(iii) Let X̃n be the sample median. Then for F satisfying either (2.17) or
(2.19),
n
2
≤ lima→∞B(X̃n; a) ≤
n
2
+ 1 for n even, and (2.21)
lim
a→∞
B(X̃n, a) =
n + 1
2
for n odd (2.22)
Remark 2.1 The distribution functions with exponential tails, satisfying
(2.17), will be briefly called the type I. This class includes the normal dis-
tribution (r = 2), logistic and the Laplace distributions (r = 1). The distri-
bution functions with heavy tails, satisfying (2.19), will be called the type II.
The Cauchy distribution (m = 1) or the t-distribution with m > 1 degrees of
freedom belongs here.
Proof of Theorem 2.1.
(i) It is sufficient to prove that the exponentially tailed F has a finite expected
value
Eε = E0
[
exp
{
n(1 − ε)b|X̄n|r
}]
< ∞ (2.23)
for arbitrary ε ∈ (0, 1). Indeed, then we conclude from the Markov inequality
that
P0(|X̄n| > a) ≤ Eε · exp{−n(1 − ε)bar}
=⇒ lima→∞
−ln P0(|X̄n| > a)
bar
≥ lim
a→∞
n(1 − ε)bar − ln Eε
bar
= n(1 − ε)
and we arrive at proposition (2.18).
The finite expectation (2.23) we get from the Hölder inequality:
E0
[
exp
{
n(1 − ε)b|X̄n|r
}]
≤ E0
[
exp
{
(1 − ε)b
n∑
i=1
|Xi|r
}]
(2.24)
≤ (E0 [exp {(1 − ε)b|X1|r}])n = 2n
(∫ ∞
0
[exp {(1 − ε)bxr}] dF (x)
)n
It follows from (2.17) that, given ε > 0, there exists an Aε > 0 such that
1 − F (a) < exp
{
−(1 − ε
2
)bar
}
holds for any a ≥ Aε.
40 BASIC CHARACTERISTICS OF ROBUSTNESS
The last integral in (2.24) can be successively rewritten in the following way:
∫ ∞
0
exp {(1 − ε)bxr} dF (x) =
∫ Aε
0
exp {(1 − ε)bxr} dF (x)
−
∫ ∞
Aε
exp {(1 − ε)bxr} d(1 − F (x))
=
∫ Aε
0
exp {(1 − ε)bxr} dF (x) + (1 − F (Aε)) · exp {(1 − ε)bArε}
+
∫ ∞
Aε
(1 − F (x))(1 − ε)brxr−1 · exp {(1 − ε)bxr} dx
≤
∫ Aε
0
exp {(1 − ε)bxr} dF (x) + exp
{
−ε
2
bArε
}
+
∫ ∞
Aε
(1 − ε)brxr−1 · exp
{
−ε
2
bxr
}
dx < ∞
and that leads to proposition (i).
(ii) If F has heavy tails, then
P0(|X̄n| > a) = P0(X̄n > a) + P0(X̄n < −a)
≥ P0
(
X1 > −a, . . . , Xn−1 > −a, Xn > (2n − 1)a
)
+P0
(
X1 < a, . . . , Xn−1 < a, Xn < −(2n − 1)a
)
= 2(F (a))n−1[1 − F ((2n − 1)a)]
hence
lima→∞B(X̄n, a) ≤ lima→∞
−ln [1 − F (2n − 1)a]
m ln a
= lim
a→∞
−ln [1 − F (2n − 1)a]
m ln((2n − 1)a) = 1
(iii) Let X̃n be the sample median and n be odd. Then X̃n is the middle
order statistic of the sample X1, . . . , Xn, i.e., X̃n = X(m), m = n+12 , and
F (X̃n) = U(m) has the beta-distribution. Then
P0(|X̃n| > a) = P0(X̃n > a) + P0(X̃n < −a)
= 2n
(
n − 1
m − 1
) ∫ 1
F (a)
um−1(1 − u)m−1du
VARIANCE OF ASYMPTOTIC NORMAL DISTRIBUTION 41
≤ 2n
(
n − 1
m − 1
)
(1 − F (a))m
and similarly
P0(|X̃n| > a) ≥ 2n
(
n − 1
m − 1
)
(F (a))m−1(1 − F (a))m
that leads to (2.22) after taking logarithms. Analogously we proceed for n
even. �
2.8 Variance of asymptotic normal distribution
If estimator Tn of functional T (·) is asymptotically normally distributed as
n → ∞,
LP
(√
n(Tn − T (P ))
)
→ N (0, V 2(P, T ))
then another possible robustness measure of T is the supremum of the variance
V 2(P, T )
σ2(T ) = sup
P∈P0
V 2(P, T )
over a neighborhood P0 ⊂ P of the assumed model.
The estimator minimizing supP∈P0 V
2(P, T ) over a specified class T of es-
timators of parameter θ, is called minimaximally robust in the class T . We
shall show in the sequel that the classes of M-estimators, L-estimators and
R-estimators all contain a minimaximally robust estimator of the shift and
regression parameters in a class of contaminated normal distributions.
2.9 Problems and complements
2.1 Show that both the sample mean and the sample median of the random
sample X1, . . . , Xn are nondecreasing in each argument Xi, i = 1, . . . , n.
2.2 Characterize distributions satisfying
lim
a→∞
−ln(1 − F (a + c))
−ln(1 − F (a)) = 1 (2.25)
for any fixed c ∈ R. Show that this class contains distributions of Type 1
and Type 2.
2.3 Let X1, . . . , Xn be a random sample from a population with distri-
bution function F (x − θ), where F is symmetric, absolutely continuous,
0 < F (x) < 1 for x ∈ R, and satisfying (2.25). Let Tn(X1, . . . , Xn) be
a translation equivariant estimator of θ, nondecreasing in each argument
42 BASIC CHARACTERISTICS OF ROBUSTNESS
Xi, i = 1, . . . , n. Then Tn has a universal breakdown point m∗n = m
∗
n(Tn)
and there exists a constant A such that
Xn:m∗n − A ≤ Tn(X1, . . . , Xn) ≤ Xn:n−m∗n+1 + A
where Xn:1 ≤ Xn:2 ≤ . . . ≤ Xn:n are the order statistics of the sample
X1, . . . , Xn. (Hint: see He at al. (1990)).
2.4 Let Tn(X1, . . . , Xn) be a translation equivariant estimator of θ, nonde-
creasing in each argument Xi, i = 1, . . . , n. Then, under the conditions of
Problem 2.2,
m∗n ≤ lima→∞B(a, Tn) ≤ lima→∞B(a, Tn) ≤ n − m∗n + 1 (2.26)
Illustrate it on the sample median. (Hint: see He et al. (1990)).
2.5 Let Tn(X1, . . . , Xn) be a random sample from a population with distri-
bution function F (x − θ). Compute the breakdown point of
Tn =
1
2
(Xn:1 + Xn:n)
This estimator is called the midrange (see the next chapter).
2.6 Show that the midrange (Problem 2.5) of the random sample X1, . . . , Xn
is nondecreasing in each argument Xi, i = 1, . . . , n. Illustrate (2.26) for this
estimator.
2.7 Determine whether the gamma distribution (Example 1.5) has exponen-
tial or heavy tails.
CHAPTER 3
Robust estimators of real parameter
3.1 Introduction
Let X1, . . . , Xn be a random sample from a population with probability dis-
tribution P ; the distribution is generally unknown, we only assume that its
distribution function F belongs to some class F of distribution functions. We
look for an appropriate estimator of parameter θ, that can be expressed as a
functional T (P ) of P. The same parameter θ can be characterized by means of
more functionals, e.g., the center of symmetry is simultaneously the expected
value, the median, the modus of the distribution, and other possible charac-
terizations. Some functionals T (P ) are characterized implicitly as a root of
an equation (or of a system of equations) or as a solution of a minimization
(maximization) problem: such are the maximal likelihood estimator, moment
estimator, etc. An estimator of parameter θ is obtained as an empirical func-
tional, i.e., when one replaces P in the functional T (·) with the empirical
distribution corresponding to the vector of observations X1, . . . , Xn.
We shall mainly deal with three broad classes of robust estimators of the
real parameter: M -estimators, L-estimators, and R-estimators. We shall later
extend these classes to other models, mainly to the linear regression model.
3.2 M-estimators
The class of M -estimators was introduced by P. J. Huber in (1964); the prop-
erties of M -estimators are studied in his book (Huber (1981)), and also in
the books by Andrews et al. (1972), Antoch et al. (1998), Bunke and Bunke
(1986), Dodge and Jurečková (2000), Hampel et al. (1986), Jurečková and
Sen (1996), Lecoutre and Tassi (1987), Rieder (1994), Rousseeuw and Leroy
(1987), Staudte and Sheather (1990), and others.
M -estimator Tn is defined as a solution of the minimization problem
n∑
i=1
ρ(Xi, θ) := min with respect to θ ∈ Θ
or (3.1)
EPn [ρ(X, θ)] = min, θ ∈ Θ
43
44 ROBUST ESTIMATORS OF REAL PARAMETER
where ρ(·, ·) is a properly chosen function. The class of M -estimators covers
also the maximal likelihood estimator (MLE) of parameter θ in the parametric
model P = {Pθ, θ ∈ Θ}; if f(x, θ), is the density function of Pθ, then the
MLE is a solution of the minimization
n∑
i=1
(− log f(Xi, θ)) = min, θ ∈ Θ
If ρ in (3.1) is differentiable in θ with a continuous derivative ψ(·, θ) =
∂
∂θρ(·, θ), then Tn is a root (or one of the roots) of the equation
n∑
i=1
ψ(Xi, θ) = 0, θ ∈ Θ (3.2)
hence
1
n
n∑
i=1
ψ(Xi, Tn) = EPn [ψ(X, Tn)] = 0 Tn ∈ Θ. (3.3)
We see from (3.1) and (3.3) that the M -functional, the statistical functional
corresponding to Tn, is defined as a solution of the minimization∫
X
ρ(x, T (P )) dP (x) = EP [ρ(X, T (P ))] := min, T (P ) ∈ Θ (3.4)
or as a solution of the equation∫
X

Continue navegando