Baixe o app para aproveitar ainda mais
Prévia do material em texto
Exact Bagging of k-Nearest Neighbor Predictors Brian Steele and Dave Patterson Dept. of Mathematical Sciences University of Montana 1 Overview • Definitions • k-nearest neighbor predictors • Accuracy estimation • Bagging and exact bagging • Comparisons 2 Definitions • Zn denotes a random sample of n obser- vations from a population P = {z1, z2, . . . , } where zi = (xi,yi) • xi is a covariate vector observed for zi ∈ P • yi is observed only for zi ∈ Zn 3 Objective: predict y or a function of y using a predictor η constructed from Zn • Distance between z and zi ∈ Zn is mea- sured on the covariates • {z(1), . . . , z(k)} denotes the k-nearest neigh- bors of z among Zn • The elementary k-nearest neighbor predic- tor of y is ŷ = k−1 k∑ i=1 y(i) 4 The classification problem • P is partitioned as c classes • y is a scalar identifying the group to which z belongs • Suppose that the groups are labeled 1, . . . , g • Ψ is the indicator function of an event. Hence Ψ(y = g) = { 1 if z ∈ Gg 0 if z 6∈ Gg, g = 1, . . . , c. 5 • Suppose that z is drawn at random from P and that x is observed but not y • Ψ(y = g) is Bernoulli given x, and E[Ψ(y = g) | x] = P (y = g | x) • Let Pg(x) = P (y = g | x) 6 • We prefer to estimate Pg(x) using those training observations with covariates iden- tical to x • There may not be any such xi so we use the k nearest neighbors • The k-NN estimator of Pg(x) is P̂g(x) = k −1 k∑ i=1 Ψ[y(i) = g] • The k-NN estimator predictor of class mem- bership is η(x) = arg max g P̂g(x) 7 • Supposing that there are no ties, then the realization of P̂g(x) is one of 0,1/k, . . . ,1 • The k-NN classifier is relatively imprecise when k is small • The cost of choosing a small k is a poten- tial increase in the mean square error of P̂g(x) 8 Bootstrap aggregation • Bagging is a method of reducing the vari- ance of a classifier prediction • Bagging is carried out by drawing B boot- strap samples from Zn • The bth sample produces an estimator of Pg(x) denoted by P ∗bg (x) using the k-nearest neighbors among the bootstrap sample 9 • The bagging estimator of Pg(x) is PBAg (x) = B −1 B∑ b=1 P ∗bg (x), • The bagging classifier is ηBA (x) = arg max g PBAg (x) (1) 10 • PBAg (x) is a Monte Carlo approximation to the exact bagging estimator E F̂ P ∗g (x) • F̂ is the empirical distribution function plac- ing probability mass n−1 at each zi ∈ Zn • P ∗g (x) is the k-NN estimator of Pg(x) ob- tained from a random sample from F̂ 11 • The exact bagging classifier E F̂ P ∗g (z) is usu- ally computationally intractable • Hence, bagging uses Monte Carlo approxi- mation to approximate E F̂ P ∗g (z) • With the k-NN classifier, it is possible to compute E F̂ P ∗g (x) analytically 12 A comparison • Data were provided by the USDA Forest Service, Northern Region • 3709 observations on a g = 15 class pop- ulation used for land cover mapping via Landsat imagery (10 covariates) 13 Difference in estimated percent accuracy is plot- ted against k along with approximate 95% con- fidence intervals Number of neighbors (k) D iff er en ce (% ) 0 10 20 30 40 50 0 1 2 • This is typical: exact bagging produces small but consistent improvements 14 A digression • Accuracy estimation for k-nearest neighbor classifiers requires care • Cross-validation estimators have small bi- ases but relatively large variances 15 66 68 70 72 74 76 78 80 0 10 20 30 40 50 Ordinary k-NN 0 10 20 30 40 50 Exact bagging k-NN Number of neighbors (k) E st im at ed a cc ur ac y (% ) Cross-validation Leave-one-out bootstrap .632 bootstrap .632+ bootstrap Estimated accuracy as a function of neighbor- hood size k • The .632 and .632+ bootstrap estimators are optimistically biased for small k • The leave-one-out bootstrap estimator is pessimistically biased • n-fold cross-validation is most appropriate for assessing the conditional accuracy of η given Zn • Bootstrap estimators are most appropriate for assessing the unconditional accuracy of η k-NN regression • y is a continuous random variable • Use a linear combination of the k nearest covariate vectors x(1), . . . ,x(k) to predict y • Nk(z) denotes the set of the k nearest neigh- bors of z among Zn 16 • Assume that in Nk(z), E(y) = xTβ(z) • β(z) is specific to Nk(z) • This model is a compromise between a con- tinuously varying β and the global param- eter of ordinary least squares regression 17 • Let ψ(zi) = Ψ[zi ∈ Nk(z)] • An estimator for β(z) can be obtained by minimizing the prediction error S(z) = n∑ i=1 ψ(zi) [ yi − xTi β(z) ]2 18 • A matrix formulation is S(z) = [y −Xβ(z)]TΨ(z)[y −Xβ(z)] where • y = (y1, . . . , yn)T • X = ( xij ) and • Ψ(z) is a diagonal with ψ(z1), . . . , ψ(zn) on the diagonal. 19 • Differentiation of S(z) with respect to β(z) yields the normal equations XTΨ(z)y = XTΨ(z)Xβ(z) • If XTΨ(z)X is full rank, then β̂(z) = [XTΨ(z)X]−1XTΨ(z)y and ŷ =XΨ(z)β̂(x) 20 Problems • Using only k observations implies the mean square error of ŷ will be large • Presumably, the β(z)’s vary more smoothly than the k-NN model implies, and informa- tion regarding β(z) can be extracted from neighbors of z that are close to z, but not in the k-neighborhood • Observations near the boundary of the k- neighborhood may be less informative than those observations that are near the center of the neighborhood. • XTΨ(z)X will not be full rank if k < p 21 Improving on k nearest neighbor regression • We propose to use the exact bootstrap ag- gregation approach to smooth the param- eter vectors • Set equal the bootstrap expectations of each side of the normal equations: E F̂ [XTΨ(z)y] =E F̂ [XTΨ(z)Xβ(x)] =E F̂ [XTΨ(z)X]β(x) • The solution is βs(z) = E F̂ [XTΨ(z)X]−1E F̂ [XTΨ(z)y]. 22 Computing expectations • Consider E F̂ [XTΨ(z)y] = E F̂ [xT1Ψ(z)y] ... E F̂ [xTpΨ(z)y] where xj is the ith column of X • The jth element of this vector is E F̂ [xTj Ψ(z)y] = n∑ i=1 xijyiPF̂ [zi ∈ N ∗ k(z)] where N∗k(z) is the set of k closest neigh- bors of z among a bootstrap sample drawn from F̂ 23 Another expression • Suppose that the rows of X and y have been arranged according to the distance of xi from x, i = 1, . . . , n • The ordered matrix X is Xo • The ordered vector y by yo • The diagonal matrix Ak contains PF̂ [z(i) ∈ N∗k(z)] on the diagonal 24 Then E F̂ [XTΨ(z)y] =XoAkyo, E F̂ [XTΨ(z)X] =XToAkXo and βs(z) = (XToAkXo) −1XoAkyo 25 Computing P F̂ [z(i) ∈ N∗k(z)] • This term is the probability that the ith nearest neighbor to z is the rth closest among a bootstrap sample • Note that P F̂ [z(i) ∈ N∗k(z)] = k∑ r=1 P F̂ [z∗(r) = z(i)] 26 It can be shown that P F̂ [z∗(r) = z(i)] =Fi/n(r, n− r+1) − F(i−1)/n(r, n− r+1) where Fp(r, n) = n ! (r − 1) !(n− r+1) ! × ∫ p ur−1(1− u)n−rdu is the cumulative distribution function of a beta random variable with parameters r and n−r+1 27 Some remarks • Inverting XToAkXo is much less of a prob- lem than inverting XTΨ(x)X because rank(XToAkXo) = min[rank(X T oXo), rank(Ak)] • Both k-nearest neighbor and exact bagging k-nearest neighbor are varieties of local lin- ear regression • Resampling methods are necessary for es- timating prediction error and intervals 28 Some examples• Time series smoothing: for zt ∈ Nk(z) as- sume the model E(yt) = β0(t) + β1(t)t where yt is the response variable at time t and β0(t) and β1(t) are intercept and slope parameters at time t • β̂1(t) is the estimated rate of change as a function of time 29 Temperature data for the Northern hemisphere. k = 18 ∗ Year 1880 1900 1920 1940 1960 1980 - 0. 4 0. 0 k-NN mean Exact bagging k-NN mean k-NN regression Exact bagging k-NN regression Year 1880 1900 1920 1940 1960 1980 - 0. 01 0. 01 k-NN regression Exact bagging k-NN regression ∗Jones, P.D. (1988) Journal of Climatology, 1, 654-60 30 USDA Forest Service Forest and Inventory Anal- ysis (FIA) data from Northern Idaho (n = 973) • Monitoring program for forest management with systematically located permanent plot installations • Objective: predict forest attributes away from plot locations using remotely sensed predictor variables such as Landsat 7 TM imagery and topographic variables • For comparative purposes, we measure ac- curacy according to 1− ∑n i=1(yi − ŷi)2∑n i=1(yi − y)2 31 Basal area (entire plot, plots with less than 20 sq. ft. were excluded) 0.35 0.40 0.45 0.50 0.55 0.60 0 100 200 300 400 .632 bootstrap 0 100 200 300 400 Cross-validation Number of neighbors (k) 1 - M S E /V ar k-NN Exact bagging k-NN Ordinary least squares k-NN regression Exact bagging k-NN regression 32 Canopy cover (entire plot) 0.35 0.40 0.45 0.50 0.55 0 100 200 300 400 .632 bootstrap 0 100 200 300 400 Cross-validation Number of neighbors (k) 1 - M S E /V ar k-NN Exact bagging k-NN Ordinary least squares k-NN regression Exact bagging k-NN regression 33 We fit all possible models (2p) and use the leave-one-out bootstrap estimates of predic- tion error (B = 200) to identify a best model Model fit plotted against p 0.0 0.2 0.4 2 4 6 8 k-NN 2 4 6 8 Exact bagging k-NN 2 4 6 8 Ordinary least squares 0.0 0.2 0.4 k-NN regression Exact bagging k-NN regression Number of predictors R el at iv e ac cu ra cy 34 Remarks • Exact bagging provides modest and con- sistnt improvements in accuracy • When k is large, there is little difference between ordinary k- and exact bagging k- nearest neighbor predictors • k-nearest neighbor regression is potentially valuable in remote sensing application 35 Some references • Breiman, L. (1996) Bagging predictors. Ma- chine Learning, 24, 123-140. • Hutson, A.D., Ernst, M.D. (2000) The ex- act bootstrap mean and variance of an L- estimator. J.R.Statist. Soc. B, 62, 89-94. • Steele, B. M., Patterson, D.A. (2000) Ideal bootstrap estimation of expected predic- tion error for k-nearest neighbor classifiers: Applications for classification and error as- sessment. Statistics and Computing, 10, 349-55. 36 Additional references • Efron, B., Tibshirani, R. (1997) Improve- ments on cross-validation: the 632+ boot- strap method. emphJournal of the Ameri- can Statistical Association, 92, 548-560. • Hall, P., Sarnworth, R.J. (2005) Proper- ties of bagged nearest neighbour classifiers. J.R.Statist. Soc. B, 67, 363-379. • Hastie,T, Tibshirani, R., Friedman, J. (2001) The Elements of Statistical Learning. Springer, New York. • Steele, B.M., Patterson, D.A., Redmond, R.L. (2003) Toward estimation of map ac- curacy without a probability sample. En- vironmental and Ecological Statistics, 10, 333-356. 37
Compartilhar