Prévia do material em texto
L e a r n i n g
F r o m
) a t a
A M L b o o k . c o m
The book website contains supporting
material for instructors and readers.
In particular, D yn am ic e-C hapters
covering advanced topics.
L e a r n i n g F r o m D ata
A SH O RT C O U R SE
Y aser S. A b u -M ostafa
California Institu te o f Technology
M alik M agdon-Ism ail
Rensselaer Polytechnic Institu te
H suan-T ien Lin
N ational Taiwan University
A M L b ook .com
Yaser S. Abu-Mostafa Malik Magdon-Ismail
Departments of Electrical Engineering Department of Computer Science
and Computer Science
California Institute of Technology Rensselaer Polytechnic Institute
Pasadena, CA 91125, USA Troy, NY 12180, USA
yaserQcaltech.edu magdonScs.r p i .edu
Hsuan-Tien Lin
Department of Computer Science
and Information Engineering
National Taiwan University
Taipei, 106, Taiwan
htlinOcsie.n t u .edu.tw
ISBN 10:1-60049-006-9
ISBN 13:978-1-60049-006-4
©2012 Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. 1.20
All rights reserved. This work may not be translated or copied in whole or in part
without the written permission of the authors. No part of this publication may
be reproduced, stored in a retrieval system, or transm itted in any form or by any
means—electronic, mechanical, photocopying, scanning, or otherwise—without prior
written permission of the authors, except as permitted under Section 107 or 108 of
the 1976 United States Copyright Act.
Limit of Liability/Disclaimer of Warranty: While the authors have used their best
efforts in preparing this book, they make no representation or warranties with re
spect to the accuracy or completeness of the contents of this book and specifically
disclaim any implied warranties of merchantability or fitness for a particular purpose.
No warranty may be created or extended by sales representatives or written sales
materials. The advice and strategies contained herein may not be suitable for your
situation. You should consult with a professional where appropriate. The authors
shall not be liable for any loss of profit or any other commercial damages, including
but not limited to special, incidental, consequential, or other damages.
The use in this publication of tradenames, trademarks, service marks, and similar
terms, even if they are not identified as such, is not to be taken as an expression of
opinion as to whether or not they are subject to proprietary rights.
This book was typeset by the authors and was printed and bound in the United
States of America.
To our teachers, and to our students
Preface
This book is designed for a short course on machine learning. It is a short
course, not a hurried course. From over a decade of teaching this m aterial, we
have distilled w hat we believe to be the core topics th a t every student of the
subject should know. We chose the title ‘learning from d a ta ’ th a t faithfully
describes w hat the subject is about, and m ade it a point to cover the topics in
a story-like fashion. O ur hope is th a t the reader can learn all the fundam entals
of the subject by reading the book cover to cover.
Learning from d a ta has d istinct theoretical and practical tracks. If you
read two books th a t focus on one track or the other, you may feel th a t you
are reading abou t two different subjects altogether. In this book, we balance
the theoretical and the practical, the m athem atical and the heuristic. Our
criterion for inclusion is relevance. Theory th a t establishes the conceptual
framework for learning is included, and so are heuristics th a t im pact the per
formance of real learning systems. S trengths and weaknesses of the different
p arts are spelled out. O ur philosophy is to say it like it is: w hat we know,
w hat we don’t know, and w hat we partia lly know.
The book can be taugh t in exactly the order it is presented. T he notable
exception may be C hapter 2, which is the most theoretical chapter of the book.
The theory of generalization th a t this chapter covers is central to learning
from data , and we m ade an effort to make it accessible to a wide readership.
However, instructors who are more interested in the practical side may skim
over it, or delay it until after the practical m ethods of C hapter 3 are taught.
You will notice th a t we included exercises (in gray boxes) th roughout the
text. The main purpose of these exercises is to engage the reader and enhance
understanding of a particu lar topic being covered. O ur reason for separating
the exercises out is th a t they are not crucial to the logical how. Nevertheless,
they contain useful inform ation, and we strongly encourage you to read them,
even if you don’t do them to completion. Instructors may find some of the
exercises appropriate as ‘easy’ homework problems, and we also provide ad
ditional problem s of varying difficulty in the Problem s section a t the end of
each chapter.
To help instructors w ith preparing their lectures based on the book, we
provide supporting m aterial on the book’s website (AMLbook.com). I'licre is
also a forum th a t covers additional topics in learning from data. We will
VII
P r e f a c e
discuss these fu rther in the Epilogue of th is book.
Acknowledgm ent (in alphabetical order for each g ro u p ): We would like to
express our g ra titu d e to the alum ni of our Learning System s G roup a t C altech
who gave us detailed expert feedback: Zehra C ataltepe, Ling Li, A m rit P ra tap ,
and Joseph Sill. We th an k th e m any studen ts and colleagues who gave us useful
feedback during the developm ent of th is book, especially Chun-W ei Liu. T he
Caltech L ibrary staff, especially K ristin B uxton and D avid M cCaslin, have
given us excellent advice and help in our self-publishing effort. We also th an k
Lucinda A costa for her help th roughou t the w riting of th is book.
Last, b u t no t least, we would like to th an k our families for their encourage
m ent, their support, and m ost of all their patience as they endured the tim e
dem ands th a t w riting a book has im posed on us.
Yaser S. A bu-M ostafa, Pasadena, California.
M alik M agdon-Ism ail, Troy, New York.
H suan-T ien Lin, Taipei, Taiwan.
March, 2012.
vni
Contents
P refa ce vii
1 T h e L earn in g P ro b lem 1
1.1 Problem S e t u p ......................................................................................... 1
1.1.1 Com ponents of L e a rn in g .......................................................... 3
1.1.2 A Simple Learning M o d e l ....................................................... 5
1.1.3 Learning versus Design .......................................................... 9
1.2 Types of L e a r n i n g .................................................................................. 11
1.2.1 Supervised L e a r n in g ................................................................. 11
1.2.2 Reinforcement L e a r n in g .......................................................... 12
1.2.3 U nsupervised L e a rn in g .............................................................. 13
1.2.4 O ther Views of L e a rn in g .......................................................... 14
1.3 Is Learning F e a s ib le ? ............................................................................... 15
1.3.1 O utside the D a ta S e t ................................................................. 16
1.3.2 P robability to the R e s c u e ....................................................... 18
1.3.3 Feasibility of L e a rn in g .............................................................. 24
1.4 E rror and N o ise ......................................................................................... 27
1.4.1 E rror M e a s u re s ............................................................................ 28
1.4.2 Noisy T a r g e ts ...............................................................................30
1.5 Problem s .................................................................................................... 33
2 T ra in in g versu s T estin g 39
2.1 Theory of G en era liza tio n ......................................................................... 39
2.1.1 Effective N um ber of Hypotheses ......................................... 41
2.1.2 Bounding the G row th F u n c t io n ............................................. 46
2.1.3 The VC D im en sio n ..................................................................... 50
2.1.4 The VC Generalization B o u n d ............................................. 53
2.2 In terpreting the G eneralization B o u n d ............................................ 55
2.2.1 Sample C o m p lex ity ..................................................................... 57
2.2.2 Penalty for Model Com plexity ............................................. 58
2.2.3 The Test S e t ................................................................................ 59
2.2.4 O ther Target T y p e s .................................................................. 61
2.3 A pproxim ation-G eneralization Tradeoff ......................................... 62
ix
C o n t e n t s
2.3.1 Bias and V a r i a n c e ........................................................................ 62
2.3.2 T he Learning C u r v e ..................................................................... 6 6
2.4 Problem s ...................................................................................................... 69
3 T h e L inear M o d e l 77
3.1 L inear C la s s if ic a t io n ................................................................................ 77
3.1.1 N on-Separable D a t a ..................................................................... 79
3.2 L inear R eg re ss io n ........................................................................................ 82
3.2.1 T he A lg o r i th m ............................................................................... 84
3.2.2 G eneralization I s s u e s ..................................................................... 87
3.3 Logistic R e g re s s io n .................................................................................... 8 8
3.3.1 P red icting a P r o b a b i l i ty .............................................................. 89
3.3.2 G radien t D e s c e n t ............................................................................ 93
3.4 N onlinear T ra n s fo rm a tio n ...................................................................... 99
3.4.1 T he Z S p a c e ................................................................................... 99
3.4.2 G om putation and G e n e ra l iz a t io n ................................................104
3.5 Problem s ..........................................................................................................109
4 O v erfitt in g 119
4.1 W hen Does O verfitting O c c u r ? ............................................................... 119
4.1.1 A Gase Study: O verfitting w ith Polynom ials ....................... 120
4.1.2 G atalysts for O v e rf i t tin g ..................................................................123
4.2 R egularization ............................................................................................... 126
4.2.1 A Soft O rder G o n s tr a in t ..................................................................128
4.2.2 W eight Decay and A ugm ented E r r o r .........................................132
4.2.3 C hoosing a Regularizer: Pill or P o i s o n ? ....................................134
4.3 V a lid a t io n ..........................................................................................................137
4.3.1 T he V alidation S e t ............................................................................ 138
4.3.2 M odel S e lec tio n ....................................................................................141
4.3.3 Cross V a l id a t io n ................................................................................ 145
4.3.4 T heory Versus P rac tice ..................................................................151
4.4 P roblem s .......................................................................................................... 154
5 T h ree L ea rn in g P r in c ip le s 167
5.1 O ccam ’s R a z o r ............................................................................................... 167
5.2 Sam pling B ia s ...................................................................................................171
5.3 D a ta S n o o p in g ................................................................................................173
5.4 Problem s .......................................................................................................... 178
E p ilo g u e 181
F u rth er R e a d in g 183
C o n t e n t s
A p p en d ix P r o o f o f th e V C B o u n d 187
A .l Relating G eneralization E rror to In-Sample D e v ia tio n s ................ 188
A.2 Bounding W orst Case Deviation Using the Grow th Function . . 190
A.3 Bounding the D eviation between In-Sample E r r o r s ........................ 191
N o ta t io n 193
In d ex ■ 197
XI
N o t a t i o n
A complete table of the notation used in
this book is included on page 193, right
before the index of terms. We suggest
referring to it as needed.
Xll
Chapter 1
The Learning Problem
If you show a picture to a three-year-old and ask if there is a tree in it, you will
likely get the correct answer. If you ask a thirty-year-old w hat the definition
of a tree is, you will likely get an inconclusive answer. We d idn’t learn w hat
a tree is by studying the m athem atical definition of trees. We learned it by
looking a t trees. In o ther words, we learned from ‘d a ta ’.
Learning from d a ta is used in situations where we don’t have an analytic
solution, b u t we do have d a ta th a t we can use to construct an em pirical solu
tion. This premise covers a lot of territory, and indeed learning from d a ta is
one of the m ost widely used techniques in science, engineering, and economics,
am ong other fields.
In this chapter, we present examples of learning from d a ta and formalize
the learning problem. We also discuss the m ain concepts associated with
learning, and the different paradigm s of learning th a t have been developed.
1.1 Problem Setup
W hat do financial forecasting, medical diagnosis, com puter vision, and search
engines have in common? They all have successfully utilized learning from
data. T he reperto ire of such applications is quite impressive. Let us open the
discussion w ith a real-life application to see how learning from d a ta works.
Consider the problem of predicting how a movie viewer would ra te the
various movies out there. This is an im portan t problem if you are a com pany
th a t rents out movies, since you w ant to recommend to different viewers the
movies they will like. Good recornmender system s are so im portan t to business
th a t the movie rental com pany Netflix offered a prize of one million dollars to
anyone who could improve their recom m endations by a mere 1 0 %.
T he main difficulty in this problem is th a t the criteria th a t viewers use to
ra te movies are quite complex. Trying to model those explicitly is no easy task,
so it m ay not be possible to come up w ith an analytic solution. However, we
1
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
.A
Viewer • •
movie
Match movie and
viewer factors
to'
<<v°
add contributions ^ predicted
from each factor rating
.
Figure 1.1: A model for how a viewer rates a movie
know th a t tlie liistorical ra tin g d a ta reveal a lot ab o u t how people ra te movies,
so we m ay be able to construc t a good em pirical solution.T here is a great
deal of d a ta available to movie ren ta l com panies, since they often ask their
viewers to ra te the movies th a t they have already seen.
Figure 1.1 illustra tes a specific approach th a t was widely used in the
m illion-dollar com petition. Here is how it works. You describe a movie as
a long array of different factors, e.g., how m uch com edy is in it, how com
plicated is the plot, how handsom e is the lead actor, etc. Now, you describe
each viewer w ith corresponding factors; how m uch do they like comedy, do
they prefer sim ple or com plicated plots, how im p o rtan t are the looks of the
lead actor, and so on. How this viewer will ra te th a t movie is now estim ated
based on the m atch /n iism atch of these factors. For exam ple, if th e movie is
inire com edy and the viewer hates comedies, the chances are he w on’t like it.
If you take dozens of these factors describing m any facets of a m ovie’s conten t
and a view er’s taste , the conclusion based on m atching all the factors will be
a good predic tor of how the viewer will ra te the movie.
T he power of learning from d a ta is th a t th is en tire process can be au to
m ated, w ithout any need for analyzing movie con ten t or viewer taste . To do
so, the learning algorithm ‘reversc-engineers’ these factors based solely on pre-
vious ratings. It s ta rts w ith random factors, then tunes tiiese factors to make
them more and more aligned w ith how viewers have ra ted movies before, until
they are ultim ately able to predict how viewers ra te movies in general. The
factors we end up w ith m ay not be as intnitive as ‘comedy con ten t’, and in
fact can be qnite subtle or even incomprehensible. After all, the algorithm is
only try ing to find the best way to predict how a viewer would ra te a movie,
not necessarily explain to us how it is done. This algorithm was p art of the
winning solution in the million-dollar com petition.
1.1.1 C om p on en ts o f Learning
T he movie ra ting application captures the essence of learning from data , and
so do m any other applications from vastly different fields. In order to abstract
the common core of the learning problem, we will pick one application and
use it as a m etaphor for the different com ponents of the problem. Let us take
credit approval as our m etaphor.
Suppose th a t a bank receives thousands of credit card applications every
day, and it wants to au tom ate the process of evaluating them . Ju s t as in the
case of movie ratings, the bank knows of no magical formula th a t can pinpoint
when credit should be approved, bu t it has a lot of data . This calls for learning
from data , so the bank uses historical records of previous custom ers to figure
out a good formula for credit approval.
Each custom er record has personal inform ation related to credit, such as
annual salary, years in residence, ou tstanding loans, etc. T he record also keeps
track of whether approving credit for th a t custom er was a good idea, i.e., did
the bank make money on th a t custom er. This d a ta guides the construction of
a successful formula for credit approval th a t can be used on future applicants.
Let us give nam es and symbols to the m ain com ponents of this learning
problem. T here is the input x (custom er inform ation th a t is used to make
a credit decision), the unknown targe t function / : T —)■ T (ideal formula for
credit approval), where X is the input space (set of all possible inputs x), and y
is the ou tpu t space (set of all possible outputs, in this case ju st a yes/no deci
sion). T here is a d a ta set V of inpu t-ou tpu t examples (x i, y i), • • • , (xjv, 2/n ),
where y„ == / ( x „ ) for n = 1 ,..., N (inputs corresponding to previous custom ers
and the correct credit decision for them in hindsight). The examples are often
referred to as d a ta points. Finally, there is the learning algorithm th a t uses the
d a ta set V to pick a formula g\ X - ^ y th a t approxim ates / . The algorithm
chooses g from a set of candidate formulas under consideration, which we call
the hypothesis set T-L. For instance, % could be the set of all linear formulas
from which the algorithm would choose the best linear fit to the data , as we
will introduce la ter in this section.
W hen a new custom er applies for credit, the bank will base its decision
on g (the hypothesis th a t the learning algorithm produced), not on f (the
ideal targe t function which rem ains unknown). The decision will be good only
to the ex ten t th a t g faithfully replicates f . To adiievc th a t, the algorithm
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
3
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
U N K N O W N T A R G E T F U N C T IO N
f - x ^ y
(ideal credit approval form ula)
T R A IN IN G E X A M P L E S
(xi,J/i), (X2,y2),- ■ • , (XAT,2/Ar)
(historical records o f credit custom ers)
L E A R N IN G
A L G O R IT H M
FIN A L
H Y P O T H E SIS
5 « /
(learned credit approval form ula)
(set o f candidate form ulas)
Figure 1.2: Basic setup of the learning problem
chooses g th a t best m atches / on th e training exam ples of previous custom ers,
w ith the hope th a t it will continue to m atch / on new custom ers. W hether
or no t th is hope is justified rem ains to be seen. F igure 1.2 illu stra tes the
com ponents of th e learning problem .
E x e r c ise 1.1
Express each of the following tasks in the framework of learning from data by
specifying the input space X , output space y , target function f : X ^ y ,
and the specifics of the data set th a t we will learn from.
(a ) Medical diagnosis: A patient walks in with a medical history and some
symptoms, and you want to identify the problem.
(b ) Handwritten digit recognition (for example postal zip code recognition
for mail sorting).
(c) Determining if an email is spam or not.
(d ) Predicting how an electric load varies with price, tem perature, and
day of the week.
(e) A problem of interest to you for which there is no analytic solution,
but you have data from which to construct an empirical solution.
We will use the setup in Figure 1.2 as our dehnition of the learning problem.
Later on, we will consider a num ber of refinements and variations to this basic
setup as needed. However, the essence of the problem will rem ain the same.
There is a ta rg e t to be learned. It is unknown to us. We have a set of examples
generated by the target. The learning algorithm uses these examples to look
for a hypothesis th a t approxim ates the target.
1.1.2 A S im ple L earning M od el
Let us consider the different com ponents of Figure 1.2. Given a specific learn
ing problem , the ta rg e t function and train ing examples are d ictated by the
problem. However, the learning algorithm and hypothesis set are not. These
are solution tools th a t we get to choose. The hypothesis set and learning
algorithm are referred to informally as the learning model.
Here is a simple model. Let T = be the input space, where is the
d-dim ensional Euclidean space, and let y = { -f l, — 1} be the o u tpu t space,
denoting a b inary (yes/no) decision. In our credit example, different coor
d inates of the input vector x G correspond to salary, years in residence,
ou tstanding debt, and the o ther d a ta fields in a credit application. The bi
nary o u tp u t y corresponds to approving or denying credit. We specify the
hypothesis set Ti. through a functional form th a t all the hypotheses h e K
share. T he functional form h(x) th a t we choose here gives different weights to
the different coordinates of x, reflecting their relative im portance in the credit
decision. T he weighted coordinates are then combined to form a ‘credit score’
and the result is com paredto a threshold value. If the applicant passes the
threshold, credit is approved; if not, credit is denied:
d
Approve credit if ^ WiXi > threshold,
i=l
d
Deny credit if ^ WiXi < threshold.
i=l
This formula can be w ritten more com pactly as
/ / d \ \
h(x) = sign WiXi + b
1. T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
\ \ i= i /
( 1 .1)
/
where X\, ■ ■ ■ ,Xd are the com ponents of the vector x; h(x) = - f l m eans ‘ap
prove c red it’ and h(x) = — 1 means ‘deny cred it’; sign(s) = - | - 1 if s > 0 and
sign(s) = —1 if s < 0.^ T he weights are ,Wd, and the threshold is
determ ined by the bias term b since in E quation (1.1), credit is approved if
> -b-
This model of H is called the perceptron, a nam e th a t it got in the context
of artificial intelligence. The learning algorithm will search H by looking for
^T h e value of s ig n (s) w hen s = 0 is a sim ple tech n ica lity th a t we ignore for th e m om ent.
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
Figure 1.3; Perceptron classification of linearly separable data in a two-
dimensional input space (a) Some training examples will be misclassified
(blue points in red region and vice versa) for certain values of the weight
parameters which define the separating line, (b) A final hypothesis that
classifies all training examples correctly, (o is +1 and X is —1.)
weights and bias th a t perform well on the d a ta set. Some of the weights
ru i, • • • ,Wd m ay end up being negative, corresponding to an adverse effect on
credit approval. For instance, the weight of th e ‘o u tstand ing d e b t’ field should
come ou t negative since m ore deb t is no t good for credit. T he bias value b
m ay end up being large or small, reflecting how lenient or s tringen t the bank
should be in extending credit. T he optim al choices of weights and bias define
the final hypothesis g GT-L th a t the algorithm produces.
E x e r c ise 1.2
Suppose th a t we use a perceptron to detect spam messages. Let’s say
th a t each email message is represented by the frequency of occurrence of
keywords, and the output is -|-1 if the message is considered spam.
(a ) Can you think of some keywords th a t will end up with a large positive
weight in the perceptron?
(b ) How about keywords that will get a negative weight?
(c) W h at parameter in the perceptron directly affects how many border
line messages end up being classified as spam?
Figure 1.3 illustra tes w hat a percepti on does in a tw o-dim ensional case {d = 2).
T he plane is split by a line into two regions, the -|-1 decision region and th e —1
decision region. D ifferent values for the param eters w \ ,W 2 ,b correspond to
different lines wiX] -(- 'W2X2 + b = 0. If the d a ta set is linearly separable, there
will be a choice for these p aram eters th a t classifies all the tra in in g exam ples
correctly.
6
1, T h e L k a u n i n c ; P k o b i .k m . 1. I 'HOHl . EM S k TI I'
To simplify the notation of the perceptron forumla. \vc will treat the bias h
as a weight u\) = b and merge it with the o ther weights into one vec'tor
w = [u’o. tt'\. • ■ • . where ' denotes the transpose of a vector, so w is a
eohmm vector. We also trea t x as a eolumn vector and modify it to become x =
[.ro ..ri.-- - where the added coordinate .co is fixed at .ro = 1. Formally
speaking, the input space is now
.1 = { 1 } X . I ' l . • ■ • . . r , ; ] ' I .I’o = 1 . .C| G M . ■ ■ • . . r , / G M } .
W ith this eonvention. w 'x = Equation (1.1) can be rew rit
ten in vector form as
/)(x) = s ig n (w 'x ). ( 1.2 )
now introduce the pcrccptroii leanmu] alporithni (PLA). The algorithm
will determ ine w hat w should be. based on the data . Let us assume tha t the
d ata set is linearly separable, which means that there is a \ee to r w that
makes ( 1 .2 ) achieve the eorreet decision //(x„) = i/„ on all the training exam
ples. as shown in Figure 1.3.
O ur learning algorithm will liud this w using a simple iterati\'c method.
Here is how it works. At iteration t. where / = th 1.2 there is a current
value of the weight vector, call it w (t). The algorithm picks an example
from (X ).//i) • • • (x,v.,V.v) that is currently miselassilied. call it (x ( /) . //(/)). and
uses it to update w (0 - Since the example is miselassilied. we lume ;/(/) ^
sign(w '' ( /)x (/)). The update rule is
w(t +\ ) = w(0 + ,v(/)x(0- (1.3)
This rule moves the boundary in tlu' direction of classifying x (/) eorrt'ctly. as
depicted in the ligure abo\('. The algorithm coutiuues with fnrtlu 'r it mat ions
until there are no longer miselassilied examples in the data .set.
E x erc ise 1 .3
The weight update rule in (1 .3 ) has the nice interpretation that it moves
in the direction of classifying x ( t ) correctly.
(a ) Show that j/( t )w '^ ( t )x ( t ) < 0. [Hint: x ( t ) is misclassified b y w {t).[
(b ) Show that y (t)w '^ (t + l ) x ( t ) > y ( t )w '^ ( f )x ( t ) . [Hint: Use (1.3).]
(c) As far as classifying x{t ) is concerned, argue that the move from w{t )
to w ( t + 1) is a move ‘in the right direction'.
A lthough the u p d a te rule in (1.3) considers only one tra in ing exam ple a t a
tim e and m ay ‘mess u p ’ the classification of the o ther exam ples th a t are not
involved in the curren t itera tion , it tu rn s ou t th a t th e algorithm is guaran teed
to arrive a t the righ t solution in the end. T he proof is th e sub ject of P ro b
lem 1.3. T he result holds regardless of which exam ple we choose from am ong
the misclassified exam ples in (x i, y i) • ■ • (x jv ,2/at) a t each itera tion , and re
gardless of how we initialize the weight vector to s ta r t the algorithm . For
simplicity, we can pick one of the misclassified exam ples a t random (or cycle
th rough the exam ples and always choose th e first misclassified one), and we
can initialize w (0) to the zero vector.
W ith in the infinite space of all weight vectors, the percep tron algorithm
m anages to find a weight vector th a t works, using a sim ple itera tive process.
This illustra tes how a learning algorithm can effectively search an infinite
hypothesis set using a finite num ber of sim ple steps. T his featu re is charac te r
istic of m any techniques th a t are used in learning, some of which are far more
sophisticated th an the percep tron learning algorithm .
E x erc ise 1 .4
Let us create our own target function / and data set V and see how the
perceptron learning algorithm works. Take d = 2 so you can visualize the
problem, and choose a random line in the plane as your target function,
where one side of the line maps to -|-1 and the other maps to —1, Choose
the inputs x „ of the data set as random points in the plane, and evaluate
the target function on each x„ to get the corresponding output yn.
Now, generate a data set of size 20. Try the perceptron learning algorithm
on your data set and see how long it takes to converge and how well the
final hypothesis g matches your target / . You can find other ways to play
with this experiment in Problem 1.4.
T he percep tron learning algorithm succeeds in achieving its goal; finding a hy
pothesis th a t classifies all the points in the d a ta set V = { ( x i , y i) • • • (xat, yiv)}
correctly. Does th is m ean th a t th is hypothesis will also be successful in classi
fying new d a ta points th a t are no t in "D? T his tu rn s ou t to be th e key question
in the theory of learning, a question th a t will be thoroughly exam ined in th is
book.
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
(a) Coin data (b) Learned classifier
Figure 1.4: The learning approach to coin classification (a) Trainingdata of
pennies, nickels, dimes, and quarters (1, 5, 10, and 25 cents) are represented
in a size-mass space where they fall into clusters, (b) A classification rule is
learned from the data set by separating the four clusters. A new coin will
be classified according to the region in the size-mass plane that it falls into.
1.1.3 Learning versus D esign
So far, we have discussed w hat learning is. Now, we discuss w hat it is not. The
goal is to distinguish between learning and a related approach th a t is used for
sim ilar problems. W hile learning is based on data , this o ther approach does
not use data . It is a ‘design’ approach based on specihcations, and is often
discussed alongside the learning approach in p a tte rn recognition literature.
Consider the problem of recognizing coins of different denom inations, which
is relevant to vending machines, for example. We want the m achine to recog
nize quarters, dimes, nickels and pennies. We will contrast the ‘learning from
d a ta ’ approach and the ‘design from specifications’ approach for this prob
lem. We assum e th a t each coin will be represented by its size and mass, a
two-dim ensional input.
In the learning approach, we are given a sam ple of coins from each of
the four denom inations and we use these coins as our d a ta set. We trea t
the size and mass as the input vector, and the denom ination as the output.
F igure 1.4(a) shows w hat the d a ta set m ay look like in the input space. There
is some variation of size and mass w ithin each class, b u t by and large coins
of the sam e denom ination cluster together. The learning algorithm searches
for a hypothesis th a t classifies the d a ta set well. If we want to classify a new
coin, the m achine m easures its size and mass, and then classihes it according
to the learned hypothesis in Figure 1.4(b).
In the design approach, we call the United S tates Mint and ask them about
the specifications of different coins. We also ask them abou t the number
9
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
(a) Probabilistic model of data (b) Inferred classifier
F igure 1.5: The design approach to coin classification (a) A probabilistic
model for the size, mass, and denomination of coins is derived from known
specifications. The figure shows the high probability region for each denom
ination (1, 5, 10, and 25 cents) according to the model, (b) A classification
rule is derived analytically to minimize the probability of error in classifying
a coin based on size and mass. The resulting regions for each denomination
are shown.
of coins of each denom ination in circulation, in order to get an estim ate of
the relative frequency of each coin. Finally, we m ake a physical m odel of
the variations in size and m ass due to exposure to th e elem ents and due to
errors in m easurem ent. We p u t all of th is inform ation together and com pute
the full jo in t p robability d is trib u tio n of size, m ass, and coin denom ination
(Figure 1.5(a)). Once we have th a t jo in t d istribu tion , we can construc t the
optim al decision rule to classify coins based on size and m ass (F igure 1.5(b)).
T he rule chooses the denom ination th a t has the highest p robab ility for a given
size and m ass, thus achieving the sm allest possible p robab ility of error.^
T he m ain difference betw een the learning approach and th e design ap
proach is the role th a t d a ta plays. In the design approach, th e problem is well
specified and one can analy tica lly derive / w ithou t the need to see any d a ta .
In the learning approach, the problem is m uch less specified, and one needs
d a ta to pin down wfiat / is.
B oth approaches m ay be viabie in some applications, b u t only the learning
approach is possible in m any applications w here the ta rg e t function is un
known. We are no t try ing to com pare the u tility or th e perform ance of the
two approaches. We are ju s t m aking the po int th a t the design approach is
d istinc t from learning. T his book is ab o u t learning.
^ T h is is ca lled B ayes o p tim a l decision th eo ry . Som e lea rn in g m o d els a re b a se d on th e
sam e th eo ry by e s t im a tin g th e p ro b a b ili ty from d a ta .
1 0
E x erc ise 1.5
Which of the following problems are more suited for the learning approach
and which are more suited for the design approach?
(a) Determining the age at which a particular medical test should be
performed
(b). Classifying numbers into primes and non-primes
(c) Detecting potential fraud in credit card charges
(d) Determining the time it would take a falling object to hit the ground
(e) Determining the optimal cycle for traffic lights in a busy intersection
1.2 T ypes o f Learning
The basic premise of learning from d a ta is the use of a set of observations to
uncover an underlying process. It is a very broad premise, and difficult to fit
into a single framework. As a result, different learning paradigm s have arisen
to deal w ith different situations and different assum ptions. In this section, we
introduce some of these paradigm s.
T he learning paradigm th a t we have discussed so far is called supervised
learning. It is the m ost studied and most utilized type of learning, bu t it is
not the only one. Some variations of supervised learning are simple enough
to be accom m odated w ithin the sam e framework. O ther variations are more
profound and lead to new concepts and techniques th a t take on lives of their
own. T he m ost im portan t variations have to do w ith the natu re of the d a ta
set.
1.2.1 S u p ervised Learning
W hen the train ing d a ta contains explicit examples of w hat the correct ou tpu t
should be for given inputs, then we are w ithin the supervised learning set
ting th a t we have covered so far. Consider the hand-w ritten digit recognition
problem (task (b) of Exercise 1.1). A reasonable d a ta set for this problem is
a collection of images of hand-w ritten digits, and for each image, w hat the
digit actually is. We thus have a set of examples of the form ( image . digit ).
The learning is supervised in the sense th a t some ‘supervisor’ has taken the
trouble to look a t each input, in this case an image, and determ ine the correct
ou tpu t, in this case one of the ten categories {0,1, 2, 3 ,4, 5, 6 , 7, 8 ,9}.
W hile we are on the subject of variations, there is more than one way th a t
a d a ta set can be presented to the learning process. D ata sets are typically cre
a ted and presented to us in their entirety a t the ou tset of the learning process.
For instance, historical records of custom ers in the credit-card application,
and previous movie ratings of custom ers in the movie ra ting application, are
already there for us to use. This protocol of a ‘ready’ d a ta set is the most
1 . T h e L e a r n i n g P r o b l e m 1 . 2 . T y p e s o f L e a r n i n g
1 1
com m on in practice, and it is w hat we will focus on in th is book. However, it
is w orth noting th a t two variations of th is protocol have a ttra c te d a significant
body of work.
One is active learning, where the d a ta set is acquired th rough queries th a t
we make. Thus, we get to choose a poin t x in the inpu t space, and the
supervisor reports to us the ta rg e t value for x . As you can see, th is opens
the possibility for s trateg ic choice of th e po int x to m axim ize its inform ation
value, sim ilar to asking a stra teg ic question in a gam e of 2 0 questions.
A nother varia tion is called online learning, where the d a ta set is given to
the algorithm one exam ple a t a tim e. T his happens when we have stream
ing d a ta th a t the algorithm has to process ‘on th e ru n ’. For instance, when
th e movie recom m endation system discussed inSection 1.1 is deployed, on
line learning can process new ra tings from cu rren t users and movies. Online
learning is also useful when we have lim itations on com puting and storage
th a t preclude us from processing the whole d a ta as a batch . We should note
th a t online learning can be used in different paradigm s of learning, no t ju s t in
supervised learning.
1.2 .2 R ein forcem en t L earning
W hen the tra in ing d a ta does no t explicitly contain th e correct o u tp u t for each
input, we are no longer in a supervised learning setting . Consider a toddler
learning no t to touch a hot cup of tea. T he experience of such a todd ler
would typically com prise a set of occasions when th e todd ler confronted a hot
cup of te a and was faced w ith the decision of touching it or no t touching it.
P resum ably, every tim e she touched it, th e resu lt was a high level of pain, and
every tim e she d id n ’t touch it, a m uch lower level of pain resu lted ( th a t of an
unsatisfied curiosity). Eventually, the todd ler learns th a t she is b e tte r off no t
touching the ho t cup.
T he tra in ing exam ples did no t spell ou t w ha t th e todd ler should have done,
b u t they instead graded different actions th a t she has taken. N evertheless, she
uses the exam ples to reinforce th e b e tte r actions, eventually learning w hat she
should do in sim ilar situations. T his characterizes reinforcement learning,
where the tra in ing exam ple does no t contain the ta rg e t o u tp u t, b u t instead
contains some possible o u tp u t together w ith a m easure of how good th a t o u t
p u t is. In con trast to supervised learning w here the tra in in g exam ples were of
the form ( input , correct o u tp u t ), the exam ples in reinforcem ent learning are
of the form
( inpu t , some o u tp u t , grade for th is o u tp u t ).
Im portan tly , the exam ple does not say how good o ther o u tp u ts would have
been for th is p a rticu la r input.
Reinforcem ent learning is especially useful for learning how to play a gam e.
Im agine a situ a tio n in backgam m on w here you have a choice betw een different
actions and you w ant to identify the best action. I t is no t a triv ia l task to
ascertain w hat the best action is a t a given stage of th e gam e, so we canno t
1 . T h e L e a r n i n g P r o b l e m 1 . 2 . T y p e s o f L e a r n i n g
1 2
1. T h e L e a r n i n g P r o b l e m 1 . 2 . T y p e s o f L e a r n i n g
Size
(a) Unlabeled Coin data (b) Unsupervised learning
Figure 1.6: Unsupervised learning of coin classification (a) The same data
set of coins in Figure 1.4(a) is again represented in the size-mass space, but
without being labeled. They still fall into clusters, (b) An unsupervised
classification rule treats the four clusters as different types. The rule may
be somewhat ambiguous, as type 1 and type 2 could be viewed as one cluster
easily create supervised learning examples. If you use reinforcement learning
instead, all you need to do is to take some action and report how well things
went, and yon have a train ing example. T he reinforcement learning algorithm
is left w ith the task of sorting out the inform ation coming from different ex
am ples to find the best line of play.
1.2.3 U n su p erv ised Learning
In the unsupervised setting, the train ing d a ta does not contain any outjiiit
inform ation a t all. We are ju s t given input examples x i , • • • ,x;v- You may
wonder how we could possibly learn anything from mere inputs. Consider the
coin classification problem th a t we discussed earlier in Figure 1.4. Suppose
th a t we d idn’t know the denom ination of any of the coins in the d a ta set. This
unlabeled data is shown in Figure 1.6(a). We still get sim ilar clusters, b u t they
are now unlabeled so all points have the sam e ‘color’. The decision regions
in unsupervised learning may be identical to those in supervised learning, but
w ithout the labels (Figure 1.6(b)). However, the correct clustering is less
obvious now, and even the num ber of clusters may be arnbigiions.
Nonetheless, this example shows th a t we can learn somethmg from the
inputs by themselves. Unsupervised learning can be viewed as the task of
spontaneously hnding p a tte rn s and s tructu re in input data . For instance, if
our task is to categorize a set of books into tojhcs, and we only use general
properties of the various books, we can identify books th a t have sim ilar prop
erties and pu t them together in one category, w ilhont nam ing th a t category.
13
1 . T h e L e a r n i n g P r o b l e m 1 . 2 . T y p e s o f L e a r n i n g
U nsupervised learning can also be viewed as a way to create a higher-
level represen ta tion of the d a ta . Im agine th a t you d on’t speak a word of
Spanish, b u t your com pany will re locate you to Spain next m onth. T hey
will arrange for Spanish lessons once you are there, b u t you would like to
prepare yourself a b it before you go. All you have access to is a Spanish radio
station . For a full m onth, you continuously bom bard yourself w ith Spanish;
this is an unsupervised learning experience since you do n ’t know the m eaning
of the words. However, you gradually develop a b e tte r represen ta tion of the
language in your bra in by becom ing m ore tu n ed to its com m on sounds and
structu res. W hen you arrive in Spain, you will be in a b e tte r position to s ta r t
your Spanish lessons. Indeed, unsupervised learning can be a precursor to
supervised learning. In o ther cases, it is a stand-alone technique.
E x erc ise 1 .6
For each of the following tasks, identify which type of learning is involved
(supervised, reinforcement, or unsupervised) and the training data to be
used. If a task can fit more than one type, explain how and describe the
training data for each type.
(a ) Recommending a book to a user in an online bookstore
(b ) Playing tic-tac-toe
(c) Categorizing movies into different types
(d ) Learning to play music
(e) Credit limit; Deciding the maximum allowed debt for each bank cus
tomer
O ur m ain focus in th is book will be supervised learning, which is the m ost
popular form of learning from d ata .
1 .2 .4 O th er V iew s o f L earning
T he study of learning has evolved som ew hat independently in a num ber of
fields th a t s ta rted historically a t different tim es and in different dom ains, and
these fields have developed different em phases and even different jargons. As a
result, learning from d a ta is a diverse sub ject w ith m any aliases in th e scientific
lite ra tu re . T he m ain field dedicated to the sub ject is called machine learning,
a nam e th a t d istinguishes it from hum an learning. We briefly m ention two
other im p o rtan t fields th a t approach learning from d a ta in th e ir own ways.
Statistics shares the basic prem ise of learning from d a ta , nam ely th e use
of a set of observations to uncover an underly ing process. In th is case, the
process is a probability d istrib u tio n and the observations are sam ples from th a t
d istribu tion . Because sta tis tics is a m athem atica l field, em phasis is given to
situations where m ost of the questions can be answ ered w ith rigorous proofs.
As a result, s ta tis tics focuses on som ew hat idealized m odels and analyzes them
in g reat detail. T his is the m ain difference betw een th e s ta tis tica l approach
1 4
1 . T h e L e a r n i n g P r o b l e m 1 . 3 . I s L e a r n i n g F e a s i b l e ?
/ = - l
/ = + l
/ = ?
Figure 1.7: A visual learning problem. The first two rows show the training
examples (each input x is a 9-bit vector represented visually as a 3 x 3 black-
and-white array). The inputs in the first row have / ( x ) = — 1, and the inputs
in the secondrow have / ( x ) = -|-1. Your task is to learn from this data set
what / is, then apply / to the test input at the bottom. Do you get —1
or -1- 1 ?
to learning and how we approach the subject here. We make less restrictive
assum ptions and deal w ith more general models th an in statistics. Therefore,
we end up w ith weaker results th a t are nonetheless broadly applicable.
Data mining is a practical field th a t focuses on finding pa tterns, correla
tions, or anomalies in large relational databases. For example, we could be
looking a t medical records of patien ts and try ing to detect a cause-effect re
lationship between a particu lar drug and long-term effects. We could also be
looking a t credit card spending p a tte rn s and try ing to detect potential fraud.
Technically, d a ta m ining is the same as learning from data , w ith more em pha
sis on d a ta analysis th an on prediction. Because databases are usually huge,
com putational issues are often critical in d a ta mining. Recom m ender systems,
which were illustra ted in Section 1.1 w ith the movie ra ting example, are also
considered p a rt of d a ta mining.
1.3 Is Learning Feasible?
T he ta rg e t function / is the object of learning. The m ost im portan t assertion
abou t the ta rg e t function is th a t it is unknown. We really m ean unknown.
This raises a na tu ra l question. How could a lim ited d a ta set reveal enough
inform ation to pin down the entire targe t function? Figure 1.7 illustrates this
15
difficulty. A simple learning task w ith 6 tra in ing exam ples of a ± 1 ta rg e t
function is shown. Try to learn w hat the function is th en apply it to the test
inpu t given. Do you get —1 or + 1? Now, show the problem to your friends
and see if they get the sam e answer.
T he chances are the answers were no t unanim ous, and for good reason.
T here is sim ply m ore th a n one function th a t fits the 6 tra in ing examples, and
some of these functions have a value of — 1 on the te s t po int and o thers have a
value of + 1 . For instance, if th e tru e / is + 1 when th e p a tte rn is sym m etric,
the value for the te s t po int would be +1 . If th e tru e / is + 1 when the top left
square of the p a tte rn is w hite, the value for the te s t po int would be —1. B oth
functions agree w ith all th e exam ples in the d a ta set, so there isn’t enough
inform ation to tell us which would be th e correct answer.
T his does no t bode well for th e feasibility of learning. To m ake m atters
worse, we will now see th a t the difficulty we experienced in th is simple problem
is the rule, no t the exception.
1.3.1 O u tsid e th e D a ta Set
W hen we get the tra in ing d a ta V , e.g., the first two rows of F igure 1.7, we
know the value of / on all the po in ts in T>. T his doesn’t m ean th a t we have
learned / , since it doesn’t guaran tee th a t we know any th ing ab o u t / outside
of V . We know w hat we have already seen, b u t th a t ’s no t learning. T h a t’s
m emorizing.
Does th e d a ta set V tell us any th ing outside of V th a t we d id n ’t know
before? If the answer is yes, th en we have learned something. If the answer is
no, we can conclude th a t learning is no t feasible.
Since we m ain ta in th a t / is an unknow n function, we can prove th a t /
rem ains unknow n outside of V . In stead of going th ro u g h a form al proof for
the general case, we will illu stra te th e idea in a concrete case. C onsider a
Boolean ta rg e t function over a three-dim ensional in p u t space X = {0,1}^.
We are given a d a ta set V of five exam ples represen ted in th e tab le below. We
denote the b inary o u tp u t by o /* for visual clarity.
1. T h e L e a r n i n g P r o b l e m 1.3. Is L e a r n i n g F e a s i b l e ?
X „ Vn
0 0 0 0
0 0 1 •
0 1 0 •
0 1 1 o
1 0 0 •
where y„ = /(x ,^ ) for n = 1 ,2 ,3 ,4 , 5. T he advantage of th is sim ple Boolean
case is th a t we can enum erate the en tire inpu t space (since th ere are only 2 ̂ = 8
distinct inpu t vectors), and we can enum erate the set of all possible ta rg e t
functions (since / is a B oolean function on 3 B oolean inputs, and th ere are
only 2^ = 256 d istinc t Boolean functions on 3 B oolean inputs).
1 6
1. T h e L e a r n i n g P r o b l e m 1.3. Is L e a r n i n g F e a s i b l e ?
Let us look a t the problem of learning / . Since / is unknown except
inside V , any function th a t agrees w ith V could conceivably be / . The table
below shows all such functions / i , • • • , /s . It also shows the d a ta set V (in
blue) and w hat the final hypothesis g may look like.
/ i / 2 h U h h h h
0 0 0
0 0 1
0 1 0
0 ] 1
1 0 0
1 0 1
1 1 0
1 1 1
• • •
o o o
O O O
O O •
O • O
o •
• o
The final hypothesis g is chosen based on the five examples in V . The table
shows the case where g is chosen to m atch / on these examples.
If we rem ain tru e to the notion of unknown targe t, we cannot exclude any
of f i , - " ) /s from being the true / . Now, we have a dilemma. The whole
purpose of learning / is to be able to predict the value of / on points th a t we
haven’t seen before. T he quality of the learning will be determ ined by how
close our prediction is to the true value. Regardless of w hat g predicts on
the three points we haven’t seen before (those outside of V , denoted by red
question m arks), it can agree or disagree w ith the target, depending on which
of / i , • • • , / s tu rns out to be the tru e target. It is easy to verify th a t any 3
b its th a t replace the red question m arks are as good as any other 3 bits.
E x erc ise 1 .7
For each of the following learning scenarios in the above problem, evaluate
the performance of g on the three points in X outside V. To measure the
performance, compute how many of the 8 possible target functions agree
with g on all three points, on two of them, on one of them, and on none
of them.
(a) H has only two hypotheses, one that always returns and one that
always returns ‘o '. The learning algorithm picks the hypothesis that
matches the data set the most.
(b) The same but the learning algorithm now picks the hypothesis
that matches the data set the least.
(c) H = {X O R } (only one hypothesis which is always picked), where
X O R is defined by X O R (x ) = • if the number of I's in x is odd and
X O R (x ) = o if the number is even.
(d) H contains all possible hypotheses (all Boolean functions on three
variables), and the learning algorithm picks the hypothesis that agrees
with all training examples, but otherwise disagrees the most with the
X O R .
17
1 . T h e L e a r n i n g P r o b l e m 1 . 3 . I s L e a r n i n g F e a s i b l e ?
B IN
SA M PL E
• • • • • • • • • •
V = fraction of red marbles
jjb = probability of red marbles
F igure 1.8: A random sample is picked from a bin of red and green marbles.
The probability fi of red marbles in the bin is unknown. W hat does the
fraction v of red marbles in the sample tell us about fxl
I t doesn’t m a tte r w hat the algorithm does or w hat hypothesis set T-L is used.
W hether T-L has a hypothesis th a t perfectly agrees w ith V (as depicted in the
table) or not, and w hether th e learning algorithm picks th a t hypothesis or
picks ano ther one th a t disagrees w ith D (different green b its), it m akes no
difference w hatsoever as far as the perform ance outside of T> is concerned. Yet
the perform ance outside V is afl th a t m a tte rs in team ing!
T his dilem m a is no t restric ted to B oolean functions, b u t ex tends to the
general learning problem . As long as / is an unknow n function, knowing T>
cannot exclude any p a tte rn of values for / ou tside of V . Therefore, the p re
dictions of g outsideof T> are meaningless.
Does th is m ean th a t learning from d a ta is doom ed? If so, th is will be a
very sho rt book F ortunately , learning is alive and well, and we will see
why. We w on’t have to change our basic assum ption to do th a t. T he ta rg e t
function will continue to be unknown, and we still m ean unknown.
1.3 .2 P ro b a b ility to th e R escu e
We will show th a t we can indeed infer som ething outside T> using only V , b u t
in a probabilistic way. W h a t we infer m ay not be m uch com pared to learning
a full ta rg e t function, b u t it will estab lish th e principle th a t we can reach
outside T>. Once we estab lish th a t , we will take it to th e general learning
problem and pin down w hat we can and canno t learn.
L e t’s take the sim plest case of picking a sam ple, and see w hen we can say
som ething ab o u t the objects outside the sam ple. C onsider a bin th a t contains
red and green m arbles, possibly infinitely many. T he p ro p o rtio n of red and
green m arbles in th e bin is such th a t if we pick a m arb le a t random , the
probability th a t it will be red is and th e p robab ility th a t it will be green
is 1 — /i. We assum e th a t th e value of g, is unknow n to us.
1 8
We pick a random sam ple of N independent m arbles (with replacem ent)
from this bin, and observe the fraction z/ of red m arbles w ithin the sample
(Figure 1.8). W hat does the value of v tell us about the value of f/!
One answer is th a t regardless of the colors of the N m arbles th a t we picked,
we still don’t know the color of any m arble th a t we d idn’t pick. We can get
m ostly green m arbles in the sam ple while the bin has m ostly red marbles.
A lthough this is certainly possible, it is by no m eans probable.
E x erc ise 1.8
If /X = 0.9, what is the probability that a sample of 10 marbles will have
V < 0.1? [Hints: 1. Use binomial distribution. 2. The answer is a very
small number.]
The situation is sim ilar to taking a poll. A random sam ple from a population
tends to agree w ith the views of the population a t large. T he probability
d istribu tion of the random variable u in term s of the param eter n is well
understood, and when the sam ple size is big, v tends to be close to pL.
To quantify the relationship between v and n, we use a simple bound called
the Hoeffding Inequality. It sta tes th a t for any sam ple size N ,
P [|iz - /x| > e] < for any e > 0. (1.4)
Here, P[-] denotes the probability of an event, in this case w ith respect to
the random sam ple we pick, and e is any positive value we choose. P u ttin g
Inequality (1.4) in words, it says th a t as the sam ple size N grows, it becomes
exponentially unlikely th a t u will deviate from pL by more th an our ‘to lerance’ e.
The only quan tity th a t is random in (1.4) is v which depends on the random
sample. By contrast, pL is not random . It is ju s t a constant, albeit unknown to
us. T here is a subtle point here. T he utility of (1.4) is to infer the value of p,
using the value of v, although it is p th a t affects v, not vice versa. However,
since the effect is th a t v tends to be close to p, we infer th a t p ‘tends’ to be
close to V.
A lthough P [ | h — p \ > e] depends on p, as p appears in the argum ent and
also affects the d istribu tion of n, we are able to bound the probability by 2 e^^'^ ^
which does not depend on p. Notice th a t only the size N of the sam ple affects
the bound, not the size of the bin. The bin can be large or small, finite or
infinite, and we still get the same bound when we use the same sam ple size.
E x erc ise 1.9
If /X = 0.9, use the Hoeffding Inequality to bound the probability that a
sample of 10 marbles will have u < 0.1 and compare the answer to the
previous exercise.
If we choose e to be very small in order to make n a good approxim ation of p,
we need a larger sam ple size N to make the RHS of Inequality (1.4) small. We
1 . T h e L e a r n i n g P r o b l e m _______________________________ 1 . 3 . I s L e a r n i n g F e a s i b l e ?
1 9
can th en assert th a t it is likely th a t v will indeed be a good approxim ation of p,.
A lthough th is assertion does no t give us the exact value of p, and doesn’t even
guaran tee th a t th e approxim ate value holds, knowing th a t we are w ithin ± e
of p m ost of the tim e is a significant im provem ent over no t knowing anything
a t all.
T he fact th a t the sam ple was random ly selected from the bin is the reason
we are able to m ake any kind of s ta tem en t ab o u t p being close to v. If the
sam ple was no t random ly selected b u t picked in a p a rticu la r way, we would
lose the benefit of the probabilistic analysis and we would again be in the dark
outside of th e sam ple.
How does the bin m odel re la te to the learning problem ? It seems th a t the
unknow n here was ju s t the value of p while th e unknow n in learning is an entire
function / : df y . T he two situations can be connected. Take any single
hypothesis h e H and com pare it to / on each point x G A . If h (x ) = / ( x ) ,
color the point x green. If h (x ) 7 ̂ / ( x ) , color th e poin t x red. T he color
th a t each po in t gets is no t known to us, since / is unknow n. However, if we
pick X a t random according to some probability d is trib u tio n P over the inpu t
space A , we know th a t x will be red w ith som e probability , call it p, and green
w ith p robability 1 — p. Regardless of th e value of p, th e space X now behaves
like the bin in F igure 1.8.
T he tra in ing exam ples play th e role of a sam ple from th e bin. If the
inputs x i , ■ • • ,Xjv in D are picked independently according to P , we will get
a random sam ple of red (h (x „) 7 ̂ / ( x „ ) ) and green (h (x „) = / ( x „ ) ) points.
Each point will be red w ith p robability p and green w ith p robability 1—p. T he
color of each point will be known to us since b o th h (x „ ) and / ( x „ ) are known
for n = 1 , • • • , A (the function h is our hypothesis so we can evaluate it on
any point, and / ( x „ ) = is given to us for all poin ts in th e d a ta set V ). T he
learning problem is now reduced to a bin problem , under th e assum ption th a t
the inpu ts in T> are picked independently according to some d istrib u tio n P
on A . Any P will tran s la te to some p in the equivalent bin. Since p is
allowed to be unknow n, P can be unknow n to us as well. F igure 1.9 adds th is
probabilistic com ponent to th e basic learning se tup depicted in F igure 1.2.
W ith th is equivalence, the Hoeffding Inequality can be applied to the learn
ing problem , allowing us to m ake a prediction outside of T>. Using u to pre
dict p tells us som ething ab o u t / , although it doesn’t tell us w hat / is. W h a t p
tells us is the erro r ra te h m akes in approx im ating f . U i y happens to be close
to zero, we can predic t th a t h will approx im ate / well over the en tire inpu t
space. If not, we are ou t of luck.
U nfortunately, we have no control over u in our cu rren t s itu a tio n , since u
is based on a p articu la r hypothesis h. In real learning, we explore an entire
hypothesis set P , looking for some h ^ T-L th a t has a sm all erro r ra te . If we
have only one hypothesis to begin w ith, we are no t really learning, b u t ra th e r
‘verifying’ w hether th a t p articu la r hypothesis is good or bad. Let us see if we
can ex tend the bin equivalence to the case w here we have m ultip le hypotheses
in order to cap tu re real learning.
1 . T h e L e a r n i n g P r o b l e m ________________________________1 . 3 . I s L e a r n i n g F e a s i b l e ?
2 0
1. T h e L e a r n i n g P r o bl e m 1.3. Is L e a r n i n g F e a s i b l e ?
Figure 1.9: Probability added to the basic learning setup
To do th a t, we s ta rt by introducing more descriptive names for the different
concepts th a t we will use. T he error ra te w ithin the sample, which corresponds
to V in the bin model, will be called the in-sample error,
E\n{h) — (fraction of V where / and h disagree)
1 ^
= m J I ^ / ( x „ ) l ,
n = l
where [[statement]] = 1 if the statem ent is true, and = 0 if the statem ent is
false. We have m ade explicit the dependency of on the particu lar h th a t
we are considering. In the same way, we define the out-of-sample error
E o u t { h ) = F [ h ( x ) ^ f i x ) ] ,
which corresponds to p in the bin model. The probability is based on the
d istribu tion P over X which is u.sed to sample the d a ta points x.
2 1
1 . T h e L e a r n i n g P r o b l e m 1.3. Is L e a r n i n g F e a s i b l e ?
Ein(/l2 )
F igure 1.10: Multiple bins depict the learning problem with M hypotheses
S ubstitu ting the new n o ta tio n for v and E^ut for /x, the Hoeffding
Inequality (1.4) can be rew ritten as
]£'in(h) - £'out(h)| > e] < 2 e for any e > 0 , (1.5)
w here N is the num ber of tra in in g exam ples. T he in-sam ple error Ein, ju s t
like u, is a random variable th a t depends on th e sam ple. T he out-of-sam ple
error Eout, ju s t like /x, is unknow n b u t no t random .
Let us consider an entire hypothesis set K in stead of ju s t one hypothesis h,
and assum e for the m om ent th a t H has a finite num ber of hypotheses
We can construc t a bin equivalent in th is case by having M bins as shown in
Figure 1.10. Each bin still represents the inpu t space T , w ith the red m arbles
in the m th bin corresponding to the po in ts x £ T w here hm (x) A / ( ^ ) - The
probability of red m arbles in th e m th bin is Eout{hm) and th e fraction of
red m arbles in the m th sam ple is E\^{hm), for m = 1, ■ ■ ■ , M . A lthough the
Hoeffding Inequality (1.5) still applies to each bin individually, the situ a tio n
becomes m ore com plicated when we consider all th e bins sim ultaneously. W hy
is th a t? T he inequality s ta ted th a t
[|£/in(h) E'out(^OI > e] ^ 2 e for any e > 0 ,
where the hypothesis h is fixed before you generate th e d a ta set, and the
p robability is w ith respect to random d a ta sets D; we em phasize th a t the
assum ption is fixed before you generate the d a ta se t” is critica l to the
vxilidity of th is bound. If you are allowed to change h after you generate the
d a ta set, the assum ptions th a t are needed to prove th e Hoeffding Inequality
no longer hold. W ith m ultip le hypotheses in R , th e learning algorithm picks
2 2
the final hypothesis g based on V , i.e. after generating the d a ta set. The
statem ent we would like to make is not
“P[|F;in(/im) - Eout{hm)\ > e] is small”
(for any particu lar, fixed hm € l-L), b u t ra ther
“P[|£'in(5 ) - £'out(.g)| > e] is small” for the final hypothesis g.
The hypothesis g is not fixed ahead of tim e before generating the data , because
which hypothesis is selected to be g depends on the data . So, we cannot just
plug in g for h in the Hoeffding inequality. T he next exercise considers a simple
coin experim ent th a t further illustrates the difference between a fixed h and
the final hypothesis g selected by the learning algorithm .
E x erc ise 1 .10
Here is an experiment that illustrates the difference between a single bin
and multiple bins. Run a computer simulation for flipping 1,000 fair coins.
Flip each coin independently 10 times. Let’s focus on 3 coins as follows:
Cl is the first coin flipped; Crand is a coin you choose at random; Cmin is the
coin that had the minimum frequency of heads (pick the earlier one in case
of a tie). Let u i, Urand and Umin be the fraction of heads you obtain for the
respective three coins. For a coin, let g be its probability of heads.
(a) W hat is g for the three coins selected?
(b) Repeat this entire experiment a large number of times (e.g., 100,000
runs of the entire experiment) to get several instances of m. Urand
and Vmin and plot the histograms of the distributions of i^i, Urand and
t'min- Notice that which coins end up being Crand and Cmin may differ
from one run to another.
(c) Using (b ), plot estimates for P[|u —/i| > e] as a function of e, together
with the Hoeffding bound ^ (on the same graph).
(d ) Which coins obey the Hoeffding bound, and which ones do not? Ex
plain why.
(e) Relate part (d ) to the multiple bins in Figure 1.10.
T he way to get around this is to try to bound P[|jLi„(ry) - £'o„t(.g)| > (] in
a way th a t does not depend on which g the learning algorithm picks. There
is a simple bn t crude way of doing th a t. Since g has to be one of the /i„,’s
regardless of the algorithm and the sample, it is always true th a t
“ |E i„ (5 ) -E o ,t( .g ) | > e ” “ |E i„ ( h i ) - E „ , . i ( /q ) | > c
or |Ei„(/i2) Eout{b'2)\ ^ e
or |£ ’i„(/iM) - £ ’o„t(hA/)| > e ”.
1. T h e L e a r n i n g P r o b l e m _______________________ 1.3. Is L e a r n i n g F e a s i b l e ?
23
where Bi B 2 m eans th a t event B\ implies event ^ 2 - A lthough th e events
on the RHS cover a lot m ore th a n the LHS, the RHS has the p roperty we want;
the hypotheses hm are fixed. We now apply two basic rules in probability;
if B i = ^ B 2 , th en P [Bi] < r [B 2],
and, if R i, B2 , • • • , B m are any events, then
F[Bi or R2 or ■ ■ ■ or Bm] < F[Bi] + F[B2] + ■ ■ • + F[Bm]-
T he second rule is known as the union bound. P u ttin g th e two rules together,
we get
¥ [ \ E M - E o M \ > e ] < P[ > e
or \Ein{h2) - Eout{h2)\ > e
1 . T h e L e a r n i n g P r o b l e m 1 . 3 . I s L e a r n i n g F e a s i b l e ?
or |F/in(h.jv/) EoutC^Ai)! ^ £ ]
f
< ^ P [ |Ein(hm) - E o u t{h m )\ > c] ■
M
771=1
Applying the Hoeffding Inequality (1.5) to the M term s one a t a tim e, we can
bound each te rm in th e sum by ^ . S ubstitu ting , we get
n \ E M - E o M \ > e] < 2 M e -2 ^ '^ . ( 1 .6 )
M athem atically , th is is a ‘uniform ’ version of (1.5). We are try ing to simul
taneously approxim ate all Eout(hm )’s by the corresponding E in (h j„)’s. This
allows the learning algorithm to choose any hypothesis based on E-m and ex
pect th a t the corresponding E^ut will uniform ly follow suit, regardless of which
hypothesis is chosen.
T he downside for uniform estim ates is th a t the p robab ility bound 2M e~^^ ^
is a factor of M looser th an the bound for a single hypothesis, and will only
be m eaningful if M is finite. We will im prove on th a t in C h ap ter 2.
1.3 .3 F easib ility o f L earning
We have in troduced two ap p aren tly conflicting argum ents ab o u t the feasibility
of learning. One argum ent says th a t we canno t learn any th ing outside of T>,
and the o ther says th a t we can. We would like to reconcile these two argum ents
and pinpoint the sense in which learning is feasible:
1. Let us reconcile the two argum ents. T he question of w hether V tells us
anything outside of V th a t we d id n 't know before has two different answers.
If we insist on a determ inistic answ er, which m eans th a t V tells us som ething
certain abou t / outside of T>. then the answ er is no. If we accept a p robabilistic
answer, which m eans th a t V tells us som ething likely ab o u t / ou tside of V .
then the answer is yes.
24
1. T h e L e a r n i n g P r o b l e m 1.3. Is L e a r n i n g F e a s i b l e ?
E x erc ise 1.11
W e are given a data set V of 25 training examples from an unknown target
function f : X - ^ y , where T = R and y = { - 1 , + 1 } . To learn / , we use
a simple hypothesis set H = { h i , /12} where hi is the constant + 1 function
and /i2 is theconstant —1.
We consider two learning algorithms, S (smart) and C (crazy). S chooses
the hypothesis that agrees the most with T> and C chooses the other hy
pothesis deliberately. Let us see how these algorithms perform out of sam
ple from the deterministic and probabilistic points of view. Assume in
the probabilistic view that there is a probability distribution on X, and let
P [ / ( x ) = +1 ] = p.
(a) Can S produce a hypothesis that is guaranteed to perform better than
random on any point outside "Dl
(b ) Suppose that all the examples in V have y„ = -t-1. Is it possible
that the hypothesis that C produces turns out to be better than the
hypothesis that S produces?
(c) If p = 0.9, what is the probability that S will produce a better hy
pothesis than C?
(d ) Is there any value of p for which it is more likely than not that C will
produce a better hypothesis than S?
By adopting the probabilistic view, we get a positive answer to the feasibility
question w ithout paying too much of a price. T he only assum ption we make
in the probabilistic framework is th a t the examples in V are generated inde
pendently. We don’t insist on using any particu lar probability distribution,
or even on knowing w hat d istribu tion is used. However, w hatever d istribu
tion we use for generating the examples, we m ust also use when we evaluate
how well g approxim ates / (Figure 1.9). T h a t’s w hat makes the Hoeffding
Inequality applicable. Of course th is ideal situation may not always happen
in practice, and some variations of it have been explored in the literature.
2. Let us pin down w hat we m ean by the feasibility of learning. Learning pro
duces a hypothesis g to approxim ate the unknown targe t function / . If learning
is successful, then g should approxim ate / well, which m eans Eout{g) ~ b-
However, this is not w hat we get from the probabilistic analysis. W hat we
get instead is Eoutig) ~ Em{g)- We still have to make E\n{g) ~ 0 in order to
conclude th a t Eoutig) ~ 0 .
We cannot guarantee th a t we will find a hypothesis th a t achieves E[„{g) ss 0,
bu t a t least we will know if we find it. Rem em ber th a t Eoutig) unknown
quantity, since / is unknown, bu t E\nig) is a (juantity th a t we can evaluate.
We have thus traded the condition Eoutig) ~ b, one th a t we cannot ascertain,
for the condition E\„{g) « 0, which we can ascertain. W hat enabled this is
the Hoeffding Inequality (1.6);
P [ \ E M - Eoutig)\ > f ] <
25
th a t assures us th a t Eout{g) ^ E{^{g) so we can use Ain as a proxy for Eout-
E x erc ise 1 .12
A friend comes to you with a learning problem. She says the target func
tion / is completely unknown, but she has 4 ,0 0 0 data points. She is
willing to pay you to solve her problem and produce for her a g which
approximates / . W hat is the best that you can promise her among the
following:
(a ) After learning you will provide her with a g that you will guarantee
approximates / well out of sample.
(b ) After learning you will provide her with a g, and with high probability
the g which you produce will approximate / well out of sample.
(c) One of two things will happen.
(i) You will produce a hypothesis gr;
(ii) You will declare th a t you failed.
If you do return a hypothesis g, then with high probability the g which
you produce will approximate / well out o f sample.
One should note th a t there are cases where we w on’t insist th a t Ein{g) ~ 0.
F inancial forecasting is an exam ple w here m arket unpred ic tab ility m akes it
im possible to get a forecast th a t has anyw here near zero error. All we hope
for is a forecast th a t gets it righ t m ore often th an not. If we get th a t, our
bets will win in th e long run. This m eans th a t a hypothesis th a t has Ein{g)
som ew hat below 0.5 will work, provided of course th a t Eout(fl') is close enough
to Ein(g).
T he feasibility of learning is thus split in to two questions:
1. C an we m ake sure th a t Eout{g) is close enough to Ei„{g)?
2. C an we m ake E^^ig) sm all enough?
1, T h e L e a r n i n g P r o b l e m 1.3. Is L e a r n i n g F e a s i b l e ?
T he Hoeffding Inequality (1.6) addresses the first question only. T he second
question is answered after we run the learning algorithm on the ac tua l d a ta
and see how sm all we can get E^n to be.
B reaking down th e feasibility of learning into these two questions provides
further insight into the role th a t different com ponents of th e learning problem
play. One such insight has to do w ith the ‘com plexity’ of these com ponents.
T h e c o m p le x ity o f E . If the num ber of hypotheses AI goes up, we run
m ore risk th a t E\n{g) will be a poor es tim ato r of Eout{g) according to In
equality (1 .6 ). M can be tho u g h t of as a m easure of the ‘com plexity’ of the
2 6
1. T h e L e a r n i n g P r o b l e m 1 . 4 . E r r o r a n d N o i s e
hypothesis set T-L th a t we use. If we want an affirmative answer to the first
question, we need to keep the complexity of T-L in check. However, if we want
an affirmative answer to the second question, we stand a be tte r chance if T~L
is more complex, since g has to come from %. So, a more complex T-L gives us
more flexibility in finding some g th a t fits the d a ta well, leading to small Ei„{g).
This tradeoff in the com plexity of 7? is a m ajor them e in learning theory th a t
we will study in detail in C hapter 2.
T h e c o m p lex ity o f / . Intuitively, a complex targe t function / should be
harder to learn th an a simple / . Let us examine if this can be inferred from
the two questions above. A close look a t Inequality (1.6) reveals th a t the
com plexity of / does not affect how well E\n{g) approxim ates Eout{g)- If
we fix the hypothesis set and the num ber of train ing examples, the inequality
provides the same bound w hether we are trying to learn a simple / (for instance
a constant function) or a complex / (for instance a highly nonlinear function).
However, this doesn’t m ean th a t we can learn complex functions as easily as
we learn simple functions. Rem em ber th a t (1.6) affects the first question only.
If the ta rg e t function is complex, the second question conies into play since
the d a ta from a complex / are harder to fit th an the d a ta from a simple / .
This means th a t we will get a worse value for Emig) when / is complex. We
might try to get around th a t by m aking our hypothesis set more complex so
th a t we can fit the d a ta be tte r and get a lower E\n{g), bu t then Eout won’t be
as close to E\^ per (1.6). E ither way we look a t it, a complex / is harder to
learn as we expected. In the extrem e case, if / is too complex, we m ay not be
able to learn it a t all.
Fortunately, m ost targe t functions in real life are not too complex; we can
learn them from a reasonable V using a reasonable T-L. This is obviously a
practical observation, not a m athem atical statem ent. Even when we cannot
learn a particu lar / , we will a t least be able to tell th a t we can ’t. As long as
we make sure th a t the com plexity of V. gives us a good Hoeffding bound, our
success or failure in learning / can be determ ined by our success or failure in
fitting the train ing data .
1.4 Error and N oise
We close this chapter by revisiting two notions in the learning problem in order
to bring them closer to the real world. The first notion is w hat apjiroxim ation
m eans when we say th a t our hypothesis approxim ates the targe t function
well. The second notion is abou t the natu re of the targe t function. In many
situations, there is noise th a t makes the o u tpu t of / not uniquely determ ined
by the input. W hat are the ram ifications of having such a ‘noisy’ targe t on
the learning problem?
2 7
1.4.1 Error M easu res
Learning is no t expected to replicate theta rg e t function perfectly. T he final
hypothesis g is only an approxim ation of / . To quantify how well g approxi
m ates / , we need to define an error measure^ th a t quantifies how far we are
from the targe t.
T he choice of an error m easure affects the outcom e of the learning process.
Different error m easures m ay lead to different choices of the final hypothesis,
even if the ta rg e t and the d a ta are the sam e, since the value of a particu la r error
m easure m ay be sm all while the value of ano ther error m easure in the same
situation is large. Therefore, which error m easure we use has consequences
for w hat we learn. W h a t are th e c rite ria for choosing one error m easure over
another? We address th is question here.
F irst, le t’s formalize th is no tion a b it. An error m easure quantifies how
well each hypothesis h in the m odel approxim ates th e ta rg e t function / ,
E rro r = E { h , f ) .
W hile E {h , / ) is based on the en tire ty of h and / , it is alm ost universally de
fined based on the errors on individual inpu t poin ts x. If we define a pointwise
error m easure e (h (x ) ,/ (x ) ) , the overall error will be the average value of th is
pointwise error. So far, we have been working w ith th e classification error
e (h (x ) ,/ (x ) ) = [[h(x) 7̂ /(x)]].
In an ideal world, E{h, / ) should be user-specified. T he sam e learning task
in different contexts m ay w arran t the use of different erro r m easures. One m ay
view E{h, f ) as the ‘co st’ of using h when you should use / . T his cost depends
on w hat h is used for, and canno t be d ic ta ted ju s t by our learning techniques.
Here is a case in point.
E x a m p le 1.1 (F ingerprin t verification). C onsider the problem of verifying
th a t a fingerprint belongs to a p articu la r person. W h a t is th e app ropria te
error m easure?
1 . T h e L e a r n i n g P r o b l e m 1 . 4 . E r r o r a n d N o i s e
T he ta rg e t function takes as inpu t a fingerprint, and re tu rn s -t-1 if it belongs
to the right person, and — 1 if it belongs to an in truder.
^ T h is m easu re is a lso ca lled a n e rro r fu n c t io n in th e l i te ra tu re , a n d so m e tim es th e e rro r
is re ferred to as cost, ob jective, o r risk.
2 8
1 . T h e L e a r n i n g P r o b l e m 1 . 4 . E r r o r a n d N o i s e
There are two types of error th a t our hypothesis h can make here. If the
correct person is rejected {h = —1 bu t / = + 1 ), it is called false reject, and if
an incorrect person is accepted {h = + 1 bu t / = —1 ), it is called false accept.
/
+ 1 - 1
no error false accept
false reject no error
How should the error m easure be defined in this problem? If the right person
is accepted or an in truder is rejected, the error is clearly zero. We need to
specify the error values for a false accept and for a false reject. The right
values depend on the application.
Consider two poten tial clients of th is fingerprint system. One is a super
m arket who will use it a t the checkout counter to verify th a t you are a m ember
of a discount program . T he other is the CIA who will use it a t the entrance
to a secure facility to verify th a t you are authorized to enter th a t facility.
For the superm arket, a false reject is costly because if a custom er gets
wrongly rejected, she m ay be discouraged from patronizing the superm arket
in the future. All future revenue from this annoyed custom er is lost. On the
o ther hand, the cost of a false accept is minor. You ju st gave away a discount
to someone who d idn’t deserve it, and th a t person left their fingerprint in your
system - they m ust be bold indeed.
For the CIA, a false accept is a disaster. An unauthorized person will gain
access to a highly sensitive facility. This should be reflected in a much higher
cost for the false accept. False rejects, on the o ther hand, can be to lerated
since authorized persons are employees (ra ther th an custom ers as w ith the
superm arket). The inconvenience of retry ing when rejected is ju s t p a rt of the
job, and they m ust deal w ith it.
T he costs of the different types of errors can be tabu la ted in a m atrix . For
our examples, the m atrices m ight look like:
/
+1 -1 +1
/
- 1
+1 0 1 , +1
̂ -1
0 1000
- 1 10 0 1 0
Superm arket CIA
These m atrices should be used to weight the different types of errors when
we com pute the to ta l error. W hen the learning algorithm minimizes a cost-
weighted error m easure, it autom atically takes into consideration the utility
of the hypothesis th a t it will produce. In the superm arket and CIA scenarios,
this could lead to two com pletely different final hypotheses. □
The m oral of this example is th a t the choice of the error m easure depends
on how the system is going to be used, ra ther th an on any inherent criterion
29
1 . T h e L e a r n i n g P r o b l e m 1 . 4 . E r r o r a n d N o i s e
Figure 1.11: The general (supervised) learning problem
th a t we can independently determ ine during th e learning process. However,
this ideal choice m ay not be possible in practice for two reasons. O ne is
th a t the user m ay not provide an error specification, which is no t uncom m on.
The o ther is th a t the w eighted cost m ay be a difficult objective function for
optim izers to work w ith. Therefore, we often look for o ther ways to define the
error m easure, som etim es w ith purely p ractical or ana ly tic considerations in
mind. We have already seen an exam ple of th is w ith the sim ple b inary error
used in th is chapter, and we will see o ther error m easures in la te r chapters.
1.4.2 N o isy T argets
In m any p ractical applications, the d a ta we learn from are no t generated by
a determ inistic ta rg e t function. Instead , th ey are generated in a noisy way
such th a t the o u tp u t is no t uniquely determ ined by the input. For instance,
in the cred it-card exam ple we presented in Section 1.1, two custom ers m ay
have identical salaries, ou tstan d in g loans, etc., b u t end up w ith different credit
behavior. Therefore, the cred it ‘function’ is no t really a determ in istic function.
30
but a noisy one.
This situation can be readily modeled w ithin the same framework th a t we
have. Instead of y = / ( x ) , we can take the o u tpu t y to be a random variable
th a t is affected by, ra th e r th an determ ined by, the input x. Formally, we have
a ta rg e t distribution P (y \ x) instead of a targe t function y = / ( x ) . A d a ta
point (x, y) is now generated by the jo in t d istribu tion P (x , y) = P (x )P (y | x).
One can th ink of a noisy targe t as a determ inistic targe t plus added noise.
If y is real-valued for example, one can take the expected value of y given x to
be the determ inistic / ( x ) , and consider y - / ( x ) as pure noise th a t is added
to / .
This view suggests th a t a determ inistic targe t function can be considered
a special case of a noisy targe t, ju st w ith zero noise. Indeed, we can formally
express any function / as a d istribu tion P {y \ x) by choosing P{y | x) to be
zero for all y except y = / ( x ) . Therefore, there is no loss of generality if we
consider the targe t to be a d istribu tion ra ther th an a function. Figure 1.11
modifies the previous Figures 1.2 and 1.9 to illustrate the general learning
problem, covering both determ inistic and noisy targets.
E x erc ise 1 .13
Consider the bin model for a hypothesis h that makes an error with prob
ability fi in approximating a deterministic target function / (both h and /
are binary functions). If we use the same h to approximate a noisy version
of / given by
A y = /(x )
1 . T h e L e a r n i n g P r o b l e m _______________________________________1 . 4. E r r o r a n d N o i s e
T(y I x) = u
(a) W hat is the probability of error that h makes in approximating y l
(b) At what value of A will the performance of h be independent of y?
[Hint: The noisy target will look completely random.]
There is a difference between the role of P{y \ x) and the role of F (x ) in
the learning problem. W hile bo th distributions model probabilistic aspects
of X and y, the ta rg e t d istribu tion P {y \ x) is w hat we are trying to learn,
while the input d istribu tion P (x ) only quantifies the relative im portance of
the point x in gauging how well we have learned.
O ur entire analysis of the feasibility of learning applies to noisy target
functions as well. Intuitively, this is because the Hoeffding Incciuality (1.6)
applies to an arb itrary , unknown targe t function. Assume we random ly picked
all the y ’s according to the d istribu tion P{y \ x) over the entire inim t space X .
This realization of P{y \ x) is effectively a targe t function. Therefore, the
inequality will be valid no m atte r which particu lar random realization the
‘ta rg e t function’ happens to be.
This does not m ean th a t learning a noisy targe t is as easy as learning a
determ inistic one. Rem em ber the two (}ucstions of learning? W ith the same
learning model, Eout m ay be as close to E\u in the noisy case as it is in the
31
determ inistic case, b u t itself will likely be worse in the noisy case since it
is hard to fit the noise.
In C hap ter 2, where we prove a stronger version of (1.6), we will assum e
the ta rg e t to be a probability d istribu tion P {y \ x ), thus covering th e general
case.
1 . T h e L e a r n i n g P r o b l e m _______________________________________ 1 . 4 . E r r o r a n d N o i s e
32
1 . T h e L e a r n i n g P r o b l e m 1 . 5 . P r o b l e m s
1.5 Problem s
P ro b lem 1.1 W e have 2 opaque bags, each containing 2 balls. One bag
has 2 black balls and the other has a black and a white ball. You pick a bag
at random and then pick one of the balls in that bag at random. When you
look at the ball it is black. You now pick the second ball from that same bag.
W hat is the probability that this ball is also black? [Hint: Use Bayes' Theorem:
F[A and B] = ¥[A \ B] P [B] = P [B | A] P [A].]
P ro b lem 1.2 Consider the perceptron in two dimensions: / i(x ) =
sign(w '‘'x ) where w = [wo,wi ,W2Y arid x = [l,a ;i,a ;2]^. Technically, x has
three coordinates, but we call this perceptron two-dimensional because the first
coordinate is fixed at 1.
(a) Show that the regions on the plane where / i(x ) = -|-1 and fi(x ) = —1 are
separated by a line. If we express this line by the equation X2 = axi + b,
what are the slope a and intercept b in terms of wo, wi , W2 ?
(b) Draw a picture for the cases w = [1, 2, 3]^ and w = —[1, 2, 3]^.
In more than two dimensions, the -|-1 and —1 regions are separated by a hy
perplane, the generalization of a line.
P ro b lem 1.3 Prove that the PLA eventually converges to a linear
separator for separable data. The following steps will guide you through the
proof Let w * be an optimal set of weights (one which separates the data).
The essential idea in this proof is to show that the PLA weights w ( f ) get “more
aligned” with w * with every iteration. For simplicity, assume that w (0 ) = 0.
(a) Let p = m ini<„<Ar y „ (w * ’'x „ ) . Show that p > 0.
(b) Show that w ’" (f)w * > w '‘' ( t - l ) w * - f p , and conclude that w ''( t )w * > t p .
[Hint: Use induction.]
(c) Show that ||w (f)||^ < ||w (t - 1)||^ + ||x (f - 1)||^.
[Hint: y{t - 1) • (w '‘"(f - l ) x ( f - 1)) < 0 because x ( f - 1) was misclas-
sified by w {t - 1 )./
(d ) Show by induction that ||w (t)||^ < tR^. where R = inaxi<„<A / ||x„,||.
(con tinued on nex t page)
33
1 . T h e L e a r n i n g P r o b l e m 1 - 5 . P r o b l e m s
(e) Using (b ) and (d), show that
and hence prove that
P
f S S & i S 1 . “ r?
In practice, PLA converges more quickly than the bound ^ “ suggests.
Nevertheless, because we do not know p in advance, we can’t determine the
number of iterations to convergence, which does pose a problem if the data is
non-separable.
I l 2
P r o b le m 1 .4 In Exercise 1.4, we use an artificial data set to study the
perceptron learning algorithm. This problem leads you to explore the algorithm
further with data sets of different sizes and dimensions.
(a ) Generate a linearly separable data set of size 20 as indicated in Exer
cise 1.4. Plot the examples { ( x „ ,y „ ) } as well as the target function / on
a plane. Be sure to mark the examples from different classes differently,
and add labels to the axes of the plot.
(b ) Run the perceptron learning algorithm on the data set above. Report the
number of updates that the algorithm takes before converging. Plot the
examples { ( x „ ,y n ) } , the target function / , and the final hypothesis g in
the same figure. Comment on whether / is close to g.
(c) Repeat everything in (b ) with another randomly generated data set of
size 20. Compare your results with (b).
(d ) Repeat everything in (b ) with another randomly generated data set of
size 100. Compare your results with (b ).
(e) Repeat everything in (b) with another randomly generated data set of
size 1,000. Compare your results with (b).
(f ) Modify the algorithm such th a t it takes x „ G instead of R^. Ran
domly generate a linearly separable data set of size 1, 000 with x„, G R^°
and feed the data set to the algorithm. How many updates does the
algorithm take to converge?
(g) Repeat the algorithm on the same data set as ( f ) for 100 experiments. In
the iterations of each experiment, pick x ( f ) randomly instead of determin-
istically. Plot a histogram for the number of updates th a t the algorithm
takes to converge.
(h ) Summarize your conclusions with respect to accuracy and running tim e
as a function of N and d.
34
1 . T h e L e a r n i n g P r o b l e m 1 . 5 . P r o b l e m s
P ro b lem 1.5 The perceptron learning algorithm works like this: In each it
eration t, pick a random (x ( t ) , and compute the ‘signal’ s{t) =
If y(t) ■ s{t) < 0, update w by
w ( t + 1) i w ( t ) - t y{t) ■ x ( t ) ;
One may argue that this algorithm does not take the ‘closeness’ between s{t)
and y{t) into consideration. Let’s look at another perceptron learning algo
rithm: In each iteration, pick a random ( x ( t ) , j / ( t ) ) and compute s{t). If
y{t) ■ s{t) < 1, update w by
w ( t -F 1) <— w ( t ) -F rj ■ (y{t) - s{t)) ■ x ( t ) ,
where 7? is a constant. T hat is, if s{t) agrees with y{t) well (their product
is > 1), the algorithm does nothing. On the other hand, if s{t) is further
from y{t), the algorithm changes w ( t ) more. In this problem, you are asked to
implement this algorithm and study its performance.
(a) Generate a training data set of size 100 similar to that used in Exercise 1.4.
Generate a test data set of size 10, 000 from the same process. To get g,
run the algorithm above with rj = 100 on the training data set, until
a maximum of 1 ,000 updates has been reached. Plot the training data
set, the target function / , and the final hypothesis g on the same figure.
Report the error on the test set.
(b) Use the data set in (a) and redo everything with t) = I.
(c) Use the data set in (a) and redo everything with g = 0.01.
(d) Use the data set in (a) and redo everything with g = 0.0001.
(e) Compare the results that you get from (a) to (d).
The algorithm above is a variant of the so-called Adaline {Adaptive Linear
Neuron) algorithm for perceptron learning.
P ro b lem 1.6 Consider a sample of 10 marbles drawn independently from
a bin that holds red and green marbles. The probability of a red marble is p.
For g = 0.05, g = 0.5, and g = 0.8, compute the probability of getting no red
marbles {u = 0)in the following cases.
(a) W e draw only one such sample. Compute the probability that u = 0.
(b ) W e draw 1,000 independent samples. Compute the probability that (at
least) one of the samples has u = 0.
(c) Repeat (b ) for 1 ,000 ,000 independent samples.
35
1 . T h e L e a r n i n g P r o b l e m 1 . 5 . P r o b l e m s
P r o b le m 1 .7 A sample of heads and tails is created by tossing a coin
a number of times independently. Assume we have a number of coins that
generate different samples independently. For a given coin, let the probability
of heads (probability of error) be g. The probability of obtaining k heads in N
tosses of this coin is given by the binomial distribution:
P [ k \ N , g ] = N-k
Remember that the training error v is A
(a) Assume the sample size { N ) is 10. If all the coins have g = 0.05 compute
the probability that at least one coin will have u = 0 for the case of 1
coin, 1 ,000 coins, 1 ,0 0 0 ,0 0 0 coins. Repeat for g = 0.8.
(b ) For the case N = 6 and 2 coins with /x = 0.5 for both coins, plot the
probability
P[max — Uil > e]
for £ in the range [0,1] (the max is over coins). On the same plot show the
bound that would be obtained using the Hoeffding Inequality . Remember
that for a single coin, the Hoeffding bound is
P [ \ i ' - g \ > e] <
[Hint: Use P [A or B] = P[A] + P[B] - P [ A and B] = P[A] + P[B] -
P[A]P[B] , where the last equality follows by independence, to evaluate
P [m ax ...]]
P r o b le m 1 .8 The Hoeffding Inequality is one form of the law o f large
numbers. One of the simplest forms of th a t law is the Chebyshev Inequality,
which you will prove here.
(a) If t is a non-negative random variable, prove th a t for any a > 0,
P[t > a] < E( t ) / a .
(b ) If u is any random variable with mean p and variance P , prove th a t for
any a > 0, P[(u - /x) ̂ > a] < ^ . [Hint: Use (a)[
(c) If xxi, • ■ • ,u n are iid random variables, each with mean /x and variance cr ,̂
a n d tx = T u „ , prove th a t for any a > 0 ,
2
P[(rx - p f > a] <
N a
Notice that the RHS of this Chebyshev Inequality goes down linearly in N ,
while the counterpart in Hoeffding's Inequality goes down exponentially. In
Problem 1.9, we develop an exponential bound using a similar approach.
3 6
1 . T h e L e a r n i n g P r o b l e m 1 . 5 . P r o b l e m s
P ro b lem 1.9 In this problem, we derive a form of the law of large numbers
that has an exponential bound, called the Chernoff bound. W e focus on the
simple case of flipping a fair coin, and use an approach similar to Problem 1.8.
(a) Let f be a (finite) random variable, a be a positive constant, and s be a
positive parameter. If T{s) = Et(e®‘ ), prove that
P[t > a] < e ~ ‘‘“ T (s ) .
[Hint: is monotonically increasing in t.[
(b ) Let w i , - - - ,UN be iid random variables, and let u =
U{s) = En„(e®“ " ) (for any n ), prove that
P[w > q ] < ( e - ‘‘“ [ / ( s ) ) ^ .
(c) Suppose P[ttn = 0] = P[wn = 1] = j (fak coin). Evaluate U{s) as
a function of s, and minimize e“ *“ [ / (s ) with respect to s for fixed a,
0 < a < 1.
(d) Conclude in (c) that, for 0 < e <
V[u > E(m) + e] < 2^'^^ ,
where P = I + { \ + e) lo g jd + e) + ( | - e) lo g j^ - e) and E (u ) = i .
Show that /3 > 0, hence the bound is exponentially decreasing in N .
P ro b lem 1 .10 Assume that X = { x i , X 2, . . . , x ^ , x v + i , • • • , x a t +m}
and y = { —1 ,+ 1 } with an unknown target function f : X ^ y . The training
data set V is ( x i , i / i ) , - - - , (x ;v ,y N ) . Define the off-training-set error of a
hypothesis h with respect to / by
1 “
Eott{h, f ) = — y ] |/l(XAf+m) /(:xjv+m)l •
m = l
(a) Say / ( x ) = + 1 for all x and
+ 1, for X = Xfc and k is odd and 1 < fc < A / + TV
^ (^ ) 1 otherwise
W hat is Eof t {h, f ) l
(b ) W e say that a target function / can ‘generate’ I> in a noiseless setting
if = / ( x „ ) for all (x „ ,y „ ) £ V. For a fixed V of size TV, how many
possible f : X ^ y can generate in a noiseless setting?
(c) For a given hypothesis h and an integer k between 0 and M , how many
of those / in (b ) satisfy Eosih, / ) = ^ 7
(d) For a given hypothesis h, if all those / that generate P in a noiseless
setting are equally likely in probability, what is the expected off-training-
set error E /[E o ff(/i, /) ]?
(con tinued on n ex t page)
3 7
1 . T h e L e a r n i n g P r o b l e m 1 . 5 . P r o b l e m s
(e) A deterministic algorithm A is defined as a procedure that takes V as
an input, and outputs a hypothesis h = A{'D). Argue th a t for any two
deterministic algorithms A i and A 2 ,
Ey [Aoff (Ai(lP), /)] = Ey [Soff (A2(1?), /) ] .
You have now proved that in a noiseless setting, for a fixed T>, if all possible /
are equally likely, any two deterministic algorithms are equivalent in terms of the
expected off-training-set error. Similar results can be proved for more general
settings.
P r o b le m 1 .11 The matrix which tabulates the cost of various errors for
the CIA and Supermarket applications in Example 1.1 is called a risk or loss
matrix.
For the two risk matrices in Example 1.1, explicitly write down the in-sample
error Eiu that one should minimize to obtain g. This in-sample error should
weight the different types of errors based on the risk matrix. [Hint: Consider
yn = + l and yn = - 1 separately.]
P r o b le m 1 .12 This problem investigates how changing the error measure
can change the result of the learning process. You have N data points yi <
■ ■ ■ < Vn and wish to estimate a ‘representative’ value.
(a ) If your algorithm is to find the hypothesis h that minimizes the in-sample
sum of squared deviations,
N
E,,(h) = J 2 ( h - y n f ,
n=l
then show that your estimate will be the in-sample mean,
1 ^
A m e a n = ^ ( Vn-
n=l
(b) If your algorithm is to find the hypothesis h th a t minimizes the in-sample
sum of absolute deviations,
N
Tin(A) = ^ | A - y n | ,
n = l
then show that your estimate will be the in-sample median hmed, which
is any value for which half the data points are at most hmed and half the
data points are at least hmed-
(c) Suppose yN is perturbed to yiv + e, where e -A 00 . So, the single data
point 'tjN becomes an outlier. W h at happens to your two estimators A m e a n
and Amed?
3 8
Chapter 2
Training versus Testing
Before the final exam, a professor m ay hand out some practice problem s and
solutions to the class. A lthough these problem s are not the exact ones th a t
will appear on the exam, studying them will help you do better. They are the
‘train ing se t’ in your learning.
If the professor’s goal is to help you do be tte r in the exam, why not give
out the exam problem s themselves? Well, nice try @ . Doing well in the
exam is not the goal in and of itself. The goal is for you to learn the course
m aterial. The exam is merely a way to gauge how well you have learned the
m aterial. If the exam problem s are known ahead of time, your perform ance
on them will no longer accurately gauge how well you have learned.
T he sam e distinction between train ing and testing happens in learning from
data . In th is chapter, we will develop a m athem atical theory th a t characterizes
this distinction. We will also discuss the conceptual and practical implications
of the contrast between train ing and testing.
2.1 Theory o f G eneralization
The out-of-sam ple error Eout measures how well our train ing on V has gener
alized to d a ta th a t we have not seen before. E^nt is based on the perform ance
over the entire inpu t space X . Intuitively, if we want to estim ate the value
of Eout using a sam ple of d a ta points, these points m ust be ‘fresh’ test points
th a t have not been used for training, sim ilar to the questions on the final exam
th a t have not been used for practice.T he in-sam ple error Ei„, by contrast, is based on d a ta points th a t have
been used for training. It expressly measures train ing perform ance, sim ilar to
your perform ance on the practice problem s th a t you got before the final exam.
Such perform ance has the benefit of looking a t the solutions and adjusting
accordingly, and m ay not reflect the u ltim ate perform ance in a real test. We
began the analysis of in-sam ple error in C hapter 1, and we will extend this
3 9
analysis to the general case in th is chapter. We will also m ake the con trast
between a tra in ing set and a tes t set m ore precise.
A word of warning: th is chap ter is the heaviest in th is book in term s of
m athem atical abstraction . To m ake it easier on the not-so-m athem atically
inclined, we will tell you which p a r t you can safely skip w ithou t ‘losing the
p lo t’. T he m athem atica l results provide fundam ental insights into learning
from d a ta , and we will in te rp re t these results in p ractical term s.
G e n e ra l iz a t io n e r r o r . We have already discussed how the value of E\n
does not always generalize to a sim ilar value of Eout- G eneralization is a key
issue in learning. One can define th e generalization error as the discrepancy
between E\n and F l o u t T h e Hoeffding Inequality (1.6) provides a way to
characterize th e generalization error w ith a probabilistic bound,
¥[\Ein{g) - Eout(5)l > e] <
for any e > 0. This can be rephrased as follows. P ick a to lerance level 6, for
exam ple 5 = 0.05, and assert w ith p robability a t least 1 —5 th a t
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
Eout{g) < E M + ^ (2.1)
We refer to the type of inequality in (2.1) as a generalization bound because
it bounds Eout hi term s of Em- To see th a t the Hoeffding Inequality implies
th is generalization bound, we rew rite (1 .6 ) as follows: w ith p robability a t least
1 - 2 M e“ ^^^^ [Flout - FlinI < e, which im plies F ’out < Flin + e. We m ay now
identify S = , from which e = In and (2 .1 ) follows.
Notice th a t th e o ther side of [Flout — Flin[ < e also holds, th a t is. Flout >
Flin — e for all h e "R. T his is im p o rtan t for learning, b u t in a m ore subtle way.
N ot only do we w ant to know th a t th e hypothesis g th a t we choose (say the
one w ith the best tra in ing error) will continue to do well ou t of sam ple (i.e..
Flout < Flin + e)i b u t we also w ant to be sure th a t we did th e best we could
w ith our H (no o ther hypothesis h Q H has Flout(h) significantly b e tte r th an
Flout(ff))- T he Flout(h) > Ein{h) — e d irection of th e bound assures us th a t
we couldn’t do m uch b e tte r because every hypothesis w ith a higher Flin th an
the g we have chosen will have a com parably higher Flout-
T he error bound y ^ ^ ™ (2-1), or ‘erro r b a r ’ if you will, depends
on M , the size of the hypothesis set H. If H is an infinite set, the bound goes
to infinity and becom es m eaningless. U nfortunately , alm ost all in teresting
learning m odels have infinite H , including the sim ple p ercep tron which we
discussed in C h ap ter 1 .
In order to s tudy generalization in such m odels, we need to derive a coun
te rp a rt to (2.1) th a t deals w ith infinite V.. We would like to replace M w ith
So m etim es ‘g en e ra liz a tio n e r ro r ’ is u sed as a n o th e r n a m e for £ o u t , b u t n o t in th is book .
40
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
som ething finite, so th a t the bound is meaningful. To do this, we notice th a t
the way we got the M factor in the first place was by taking the disjunction
of events:
“ |E in ( / i i ) -E o u t( / i i ) | > e ” o r
“ |^ in ( / l2 ) - E o u t ( / l 2 )| > e ” o r
(2 .2 )
which is guaranteed to include the event “ |i?in(5 ) ~ Eout{g)\ > e” since g is al
ways one of the hypotheses in 'H. We then over-estim ated the probability using
the union bound. Let Bm be the (Bad) event th a t “ |Bin(hm) — Eout{hm)\ > e”.
Then,
P[Bi or B2 or • • • or B m ] < P[Bi] + P[B2] -!-••• + ¥[Bm ].
If the events B i, B2 , • • • , Bm are strongly
overlapping, the union bound becomes par
ticularly loose as illustra ted in the figure to
the right for an example w ith 3 hypotheses;
the areas of different events correspond to
their probabilities. The union bound says
th a t the to ta l area covered by B i, B2 , or B3
is sm aller th an the sum of the individual ar
eas, which is true b u t is a gross overestim ate
when the areas overlap heavily as in this ex
ample. T he events “ |Bin(hm) - Eout{hm)\ >
e” ; m = 1 , • • • , M , are often strongly overlap
ping. If h i is very sim ilar to h.2 for instance,
the two events “ |B in(hi) - B out(hi)| > e” and “ |B’i„(h2 ) - Bout(h2 )| > e” are
likely to coincide for m ost d a ta sets. In a typical learning model, m any hy
potheses are indeed very similar. If you take the perceptron model for instance,
as you slowly vary the weight vector w , you get infinitely m any hypotheses
th a t differ from each other only infinitesirnally.
The m athem atical theory of generalization hinges on this observation.
Once we properly account for the overlaps of the different hypotheses, we
will be able to replace the num ber of hypotheses M in (2.1) by an effective
num ber which is finite even when M is infinite, and establish a more useful
condition under which Bout is close to Bin.
2.1.1 E ffective N u m b er o f H y p o th eses
We now introduce the growth function, the quan tity th a t will formalize the
effective num ber of hypotheses. T he grow th function is w hat will rcjilace M
41
in the generalization bound (2.1). It is a com binatorial q u an tity th a t cap
tu res how different the hypotheses in H are, and hence how m uch overlap the
different events in (2 .2 ) have.
We will s ta r t by defining the grow th function and studying its basic prop
erties. Next, we will show how we can bound the value of the grow th function.
Finally, we will show th a t we can replace M in th e generalization bound w ith
the grow th function. These th ree steps will yield the generalization bound th a t
we need, which applies to infinite ?7. We will focus on b inary ta rg e t functions
for the purpose of th is analysis, so each h &T~L m aps X to { —1 ,- |-1 }.
T he definition of the grow th function is based on the num ber of different
hypotheses th a t Ti can im plem ent, b u t only over a finite sam ple of points
ra th e r th an over the entire inpu t space X . If h G Ff is applied to a finite sam ple
x i , • • • , Xiv G T , we get an N -tu p le h (x i) , • • • , /i(xm ) of ± l ’s. Such an N -tu p le
is called a dichotomy since it splits x i , • • ■ , xjv into two groups: those points for
which h is — 1 and those for which h is -|-1. Each h gT~L generates a dichotom y
on x i , • • • , Xiv, b u t two different h ’s m ay generate th e sam e dichotom y if they
happen to give the sam e p a tte rn of ± l ’s on th is p articu la r sam ple.
D e f in it io n 2 .1 . Let x i , - - - , xm G X . The dichotomies generated by TL on
these points are defined by
F f(x i, • • ■ ,x a t) = { (h (x i) , • • • ,h (x jv )) I h G 7^} . (2.3)
One can th ink of the dichotom ies F f(x i, • ■ ■ , xw) as a set of hypotheses ju st
like H is, except th a t the hypotheses are seen th rough the eyes of N points
only. A larger 77(x i, • • • ,x ;v ) m eans 77 is m ore ‘d iverse’ - generating m ore
dichotom ies on x i , • • • ,X]v. T he grow th function is based on the num ber of
dichotom ies.
D e f in it io n 2 .2 . The growth function is defined fo r a hypothesis set 77 by
m -u iN ) = m ax |77(xi, ■ ■ ■ ,x a t) |,
xi
where \ ■\ denotes the cardinality (number o f elements) o f a set.
In words, m u { N ) is the m axim um num ber of dichotom ies th a t can be gen
erated by 77 on any N points. To com pute m-rfiN), we consider all possible
choices of N points x i , - - - ,x a t from X and pick th e one th a t gives us the
m ost dichotom ies. Like M , rn '^{N) is a m easure of th e num ber of hypotheses
in 77, except th a t a hypothesis is now considered on N po in ts instead of the
entire X . For any 77, since 77(x;, • • • , x^r) C { — 1, (the set of all possible
dichotom ies on any N points), the value of rn^(TV) is a t m ost |{ —1 . ,
hence
r n n { N ) < 2 ^ .
If 77 is capable of generating all possible dichotom ies on x i , - - - ,x a t, then
77 (x i , - - - ,xa?) = and we say th a t 77 can shatter x j , ■ ■ ■ ,x a t. T his
signifies th a t 77 is as diverse as can be on th is p articu la r sam ple.
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
42
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
(a) (b) (c)
Figure 2.1: Illustration of the growth function for a two-dimensional per
ceptron. The dichotomy of red versus blue on the 3 colinear points in part
(a) cannot be generated by a perceptron, but all 8 dichotomies on the 3
points in part (b) can. By contrast, the dichotomy of red versus blue on
the 4 points in part (c) cannot be generated by a perceptron. At most 14
out of the possible 16 dichotomies on any 4 points can be generated.
E x a m p le 2 .1 . If A is a Euclidean plane and 77 is a two-dimensional percep
tron, w hat are m-u{3) and m 7y(4 )? F igure 2.1(a) shows a dichotom y on 3 points
th a t the perceptron cannot generate, while Figure 2.1(b) shows another 3
points th a t the perceptron can shatter, generating all 2 ̂ = 8 dichotomies.
Because the definition of m ^ { N ) is based on the m axim um num ber of di
chotomies, m-u{3) = 8 in spite of the case in Figure 2.1(a).
In the case of 4 points. F igure 2.1(c) shows a dichotom y th a t the perceptron
cannot generate. One can verify th a t there are no 4 points th a t the perceptron
can shatter. The m ost a perceptron can do on any 4 points is 14 dichotomies
out of the possible 16, where the 2 missing dichotomies are as depicted in
Figure 2.1(c) w ith blue and red corresponding to —1, -|-1 or to -|-1, —1. Hence,
m-H(4) = 14. □
Let us now illustra te how to com pute m-u{N) for some simple hypothesis
sets. These examples will confirm the in tu ition th a t m 'p{N ) grows faster
when the hypothesis set 77 becomes more complex. This is w hat we expect of
a quan tity th a t is m eant to replace M in the generalization bound (2.1).
E x a m p le 2 .2 . Let us find a formula for m u { N ) in each of the following cases.
1. Positive rays: 77 consists of all hypotheses h: E —> { —l , - f l } o f the form
h{x) = sign(x — a), i.e., the hypotheses are defined in a one-dimensional
inpu t space, and they re tu rn — 1 to the left of some value a and - |- 1 to
the right of a.
h[x) = - 1
-X -Xi X2 X-i
a
- a r - - 0 -
h{x) = -hi
XN
43
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
T o com pute m 'p {N ) , we notice th a t given N points, the line is split by
the points into + 1 regions. T he dichotom y we get on the N points
is decided by which region contains the value a. As we vary a, we will
get N + 1 different dichotom ies. Since th is is the m ost we can get for
any N points, the grow th function is
= N + 1.
Notice th a t if we picked N points w here some of the points coincided
(which is allowed), we will get less th an A" + 1 dichotom ies. T his does
not affect the value of m-?^(A) since it is defined based on the m axim um
num ber of dichotom ies.
2. Positive intervals: T-L consists of all hypotheses in one dim ension th a t
re tn rn +1 w ith in some interval and —1 otherw ise. Each hypothesis is
specified by the two end values of th a t interval.
h{x) = - 1 ' h{x) = + 1 ' h( x) = - 1
-© © 0 -
Xi X2 Xi . . . Xfi
To com pute m u { N ) , we notice th a t given N points, the line is again
split by the points into N + 1 regions. T he dichotom y we get is decided
by which two regions contain th e end values of th e interval, resulting
in different dichotom ies. If b o th end values fall in th e sam e
region, the resulting hypothesis is th e constan t — 1 regardless of which
region it is. A dding up these possibilities, we get
N + l \ , ̂ 1 , . 2 . 1
Notice th a t m-H{N) grows as the square of A , faster th a n th e lin
ear m u { N ) of the ‘sim pler’ positive ray case.
3. Convex sets: PL consists of all hypotheses in two dim ensions /i: —>■
{ — 1 , -1- 1 } th a t are positive inside some convex set and negative elsewhere
(a set is convex if the line segm ent connecting any two poin ts in th e set
lies entirely w ith in the set).
To com pute rn-u{N) in th is case, we need to choose th e A po in ts care
fully. Per the next figure, choose A poin ts on th e perim eter of a circle.
Now consider any dichotom y on these points, assigning an a rb itra ry p a t
te rn of ± l ’s to the A points. If you connect the -|-1 po in ts w ith a polygon,
the hypothesis m ade up of the closed in terior of the polygon (which has
to be convex since its vertices are on the perim eter of a circle) agrees
w ith the dichotom y on all A points. For the dichotom ies th a t have less
th an th ree - f l points, the convex set will be a line segm ent, a po int, or
an em pty set.
44
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
This m eans th a t any dichotom y on these N points can be realized using a
convex hypothesis, so V. m anages to sh a tte r these points and the growth
function has the m axim um possible value
m n { N ) = 2^.
Notice th a t if the N points were chosen a t random in the plane ra ther
th an on the perim eter of a circle, m any of the points would be ‘in terna l’
and we w ouldn’t be able to sh a tte r all the points w ith convex hypotheses
as we did for the perim eter points. However, th is doesn’t m atter as far
as m'}i{N) is concerned, since it is defined based on the m axim um (2 ^
in th is case). □
It is not practical to try to com pute m,'g{N) for every hypothesis set we use.
Fortunately, we don’t have to. Since m-u{N) is m eant to replace M in (2.1),
we can use an upper bound on m-u{N) instead of the exact value, and the
inequality in (2.1) will still hold. G etting a good bound on m-^(W) will prove
much easier th an com puting m u { N ) itself, thanks to the notion of a break
point.
D e fin itio n 2 .3 . I f no data set of size k ean be shattered by H , then k is said
to be a break point fo r H .
If A: is a break point, then m'n{k) < 2^. Exam ple 2.1 shows th a t A: = 4 is a
break point for two-dim ensional perceptrons. In general, it is easier to find a
break point for H th an to com pute the full grow th function for th a t H.
E x erc ise 2.1
By inspection, find a break point k for each hypothesis set in Example 2.2
( if there is one). Verify that m n ik ) < 2*’ using the formulas derived in
that Example.
We now use the break point k to derive a bound on the growth function m .u{N)
for all values of N . For example, the fact th a t no 4 points can be shattered by
45
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
the tw o-dim ensional percep tron pu ts a significant constra in t on the num ber of
dichotom ies th a t can be realized by the perceptron on 5 or m ore points. We
will exploit th is idea to get a significant bound on m-fdiN) in general.
2 .1 .2B o u n d in g th e G row th F u n ction
T he m ost im p o rtan t fact ab o u t grow th functions is th a t if the condition
m u { N ) = 2 ^ breaks a t any point, we can bound rri'n{N) for all values of N
by a simple polynom ial based on th is b reak point. T he fact th a t the bound
is polynom ial is crucial. Absent a break point (as is the case in the convex
hypothesis exam ple), m u i N ) = for all N . If m u { N ) replaced M in E qua
tion (2 .1 ), the bound on the generalization error would not go to
zero regardless of how m any tra in ing exam ples N we have. However, if m u { N )
can be bounded by a polynom ial - any polynom ial - , the generalization error
will go to zero a,s N ^ oo. This m eans th a t we will generalize well given a
sufficient num ber of examples.
B e g in sa fe s k ip : If you tru s t our m ath , you can skip
the following p a r t w ithou t com prom ising th e logical
sequence. A sim ilar green box will tell you when to
rejoin.
To prove the polynom ial bound, we will in troduce a com binatorial q u an tity
th a t counts the m axim um num ber of dichotom ies given th a t there is a break
point, w ithout having to assum e any p articu la r form of 'H. T his bound will
therefore apply to any
D e f in it io n 2 .4 . B { N , k ) is the m ax im um number o f dichotomies on A" points
such that no subset o f size k o f the N points can be shattered by these di
chotomies.
T he definition of B { N , k ) assum es a break point k, th en tries to find the
m ost dichotom ies on N points w ithou t im posing any fu rther restric tions.
Since B { N , k ) is defined as a m axim um , it will serve as an upper bound for
any m-u{N) th a t has a break point k\
m n { N ) < B [ N , k) if k is a b reak point for "H.
T he no ta tion B comes from ‘H inon iiar and the reason will becom e clear
shortly. To evaluate B { N . k ) , we s ta r t w ith the two boundary conditions
A' = 1 and TV = 1.
B { N . l ) = 1.
B { l . k ) = 2 for A->1.
46
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
B {N , 1) = 1 for all N since if no subset of size 1 can be shattered , then only
one dichotom y can be allowed. A second different dichotom y m ust differ on at
least one point and then th a t subset of size 1 would be shattered. = 2
for fc > 1 since in th is case there do not even exist subsets of size k; the
constrain t is vacuously true and we have 2 possible dichotomies ( + 1 and - 1 )
on the one point.
We now assum e N > 2 and k > 2 and try to develop a recursion. Consider
the B {N , k) dichotomies in definition 2.4, where no k points can be shattered.
We list these dichotomies in the following table.
S t
S2
# of rows
a
/3
X], X2
+ 1 + 1
- 1 +1
+ 1 - 1
- 1 +1
+ 1 - 1
- 1 - 1
+ 1
- 1
+ 1
- 1
X N -l
+ 1
+ 1
- 1
- 1
+ 1
+ 1
+ 1
- 1
+ 1
+ 1
+ 1
- 1
xai
+ 1
- 1
- 1
+ 1
+1
+1
+1
+ 1
- 1
- 1
- 1
- 1
where x i , • • • ,Xjv in the table are labels for the N points of the dichotomy.
We have chosen a convenient order in which to list the dichotomies, as follows.
Consider the dichotomies on x i , • • ■ , x tv_ i. Some dichotomies on these N - I
points appear only once (with either + 1 or — 1 in the x^v column, bu t not
bo th). We collect these dichotomies in the set 5 i. The rem aining dichotomies
on the first N — \ points appear twice, once w ith +1 and once w ith —1 in
the xai column. We collect these dichotomies in the set S'2 which can be
divided into two equal parts, 5 ^ and S 2 (with + 1 and - 1 in the xa^ column,
respectively). Let Ai have a rows, and let S f and Sf) have (3 rows each. Since
the to ta l num ber of rows in the table is B {N , k) by construction, we have
B (N , k) = a + 2(3. (2.4)
T he to ta l num ber of different dichotomies on the first N - I points is given
by a + (3] since S f and S 2 are identical on these N - \ points, their di
chotomies are redundant. Since no subset of k of these first N — I points can
4 7
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
be sh a tte red (since no A:-subset of all N points can be sha tte red ), we deduce
th a t
a + ( 3 < B { N - l , k ) (2.5)
by definition of B . F urther, no subset of size fc — 1 of the first N — I points can
be sh a tte red by the dichotom ies in S } - If there existed such a subset, then
tak ing the corresponding set of dichotom ies in and adding xat to th e d a ta
points yields a subset of size k th a t is shattered , which we know cannot exist
in th is tab le by definition of B { N , k). Therefore,
13 < B { N - l , k - l ) . (2.6)
S ubstitu ting the two Inequalities (2.5) and (2.6) into (2.4), we get
B { N , k) < B { N - 1, fc) + B { N - 1, fc - 1). (2.7)
We can use (2.7) to recursively com pute a bound on B { N , k), as shown in the
following table.
1 2 3
k
4 5 6 . . .
1 1 2 2 2 2 2 . . .
2 1 3 4 4 4 4 . . .
3 1 4 7 8 8 8 . . .
N \
4 1 5 1 1
5 1 6
6 1 7
where the first row {N = I) and the first colum n (fc = 1) are th e bound
ary conditions th a t we already calculated. We can also use th e recursion to
bound B { N ,k ) analytically.
L e m m a 2 .3 (S auer’s Lem m a).
N \
Proof. T he sta tem en t is tru e whenever k = I or N = I, by inspection. T he
p roof is by induction on N . Assum e th e sta tem en t is tru e for all N < No
and all k. We need to prove the s ta tem en t for N = No + I and all k. Since
the s ta tem en t is already tru e when A: = 1 (for all values of N ) by the initial
condition, we only need to w orry ab o u t k > 2. By (2.7),
B {N o + l . k ) < B {N o , k ) + B {N o , k - 1 ).
48
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
Applying the induction hypothesis to each term on the RHS, we get
Ii= o ̂ A
fc-i
= 1
f N ,
L\ i
No + 1
( No ^
k-1
- V
^No + l \
2-^
i = 0 V i /
1 = 1
k-1
= i + E
i=i
k-1
= ' + E
2=1
where the com binatorial identity = ( ^ ° ) + been used.
T his identity can be proved by noticing th a t to calculate the num ber of ways
to pick i objects from N q + 1 d istinct objects, either the first object is included,
in ways, or the first object is not included, in ways. We have
thus proved the induction step, so the sta tem ent is tru e for all N and k. ■
It tu rn s out th a t B {N , k) in fact equals ("T) Problem 2.4), bu t
we only need the inequality of Lem m a 2.3 to bound the growth function. For
a given break point k, the bound Yli=o ( T ) polynom ial in N , as each term
in the sum is polynom ial (of degree i < k — 1). Since B { N ,k ) is an upper
bound on any m n { N ) th a t has a break point fc, we have proved
E n d sa fe sk ip : Those who skipped are now rejoining
us. T he next theorem sta tes th a t any grow th function
m n i N ) w ith a break point is bounded by a polyno-
mial.________________________________________________
T h e o r e m 2 .4 . If m-uik) < 2^ for some value k, then
fe-i
m n { N ) < ^
i=0
N \
for all N . T he RHS is polynom ial in N of degree fc — 1.
(2 .8 )
The im plication of Theorem 2.4 is th a t if R has a break point, we have
w hat we w ant to ensure good generalization; a polynomial bound on m.-uiN).
49
E x erc ise 2 .2
(a ) Verify the bound of Theorem 2.4 in the three cases of Example 2.2:
(i) Positive rays: Ti consists of all hypotheses in one dimension of
the form h{x) = sign(a: — a).
(ii) Positive intervals: H consists of all hypotheses in one dimension
that are positive within some interval and negative elsewhere.
(iii) Convex sets: 1-L consists of all hypotheses in two dimensions that
are positive inside some convex set and negative elsewhere.
(Note: you can use the break points you found in Exercise2.1.)
(b ) Does there exist a hypothesis set for which m u { N ) = N +
(where \_N/2\ is the largest integer < N/2)7
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
2 .1 .3 T h e V C D im en sio n
Theorem 2.4 bounds the en tire grow th function in term s of any break point.
T he sm aller the break point, the b e tte r the bound. T his leads us to the fol
lowing definition of a single p aram eter th a t characterizes the grow th function.
D e fin it io n 2 .5 . The Vapnik-Chervonenkis dimension o f a hypothesis set TL,
denoted by dvc{TL) or simply dvc> is the largest value o f N fo r which m -^{N ) =
2 ^ . I f m n i N ) - 2 ^ fo r all N , then d ^ c W = 0 0 .
If dye is the VC dim ension of H , th en k = dvc + 1 is a b reak poin t for m u
since m u { N ) canno t equal 2 ^ for any N > dye by definition. It is easy to see
th a t no sm aller break point exists since TL can sh a tte r dye points, hence it can
also sh a tte r any subset of these points.
E x e r c ise 2 .3
Compute the VC dimension of H for the hypothesis sets in parts (i), (ii),
(iii) of Exercise 2 .2 (a ).
Since k — dye -P 1 is a break po in t for m u , T heorem 2.4 can be rew ritten in
term s of the VC dimension:
^ / N \
m u { N ) < Y , [ - ] - (2-9)
i= 0 ^ ^
Therefore, the VC dim ension is the order of the polynom ial bound on m u { N ) -
It is also the best we can do using th is line of reasoning, because no sm aller
break point th an k = dye + 1 exists. T he form of the polynom ial bound can
be further sim plihed to m ake the dependency on dye m ore salient. We s ta te a
nseful form here, which can be proved by induction (P roblem 2.5).
r n u { N ) < TV'’*- + 1. (2.10)
50
Now th a t the growth function has been bounded in term s of the VC dim en
sion. we have only one more step left in our analysis, which is to replace the
num ber of hypotheses M in the generalization bound (2 .1 ) w ith the growth
function m.-HiN). If we m anage to do th a t, the VC dimension will play a
pivotal role in the generalization question. If we were to directly replace M
by ni'u{N) in (2 .1 ), we would get a bound of the form
2 . T r a i n i n g v e r s u s T e s t i n g ______________________2 . L . T h e o r y o f G e n e r a l i z a t i o n
F - F , i / ̂ 1 2m7,(iV)
B o u t < B in + \ l 2 s
Unless civc(B) = oo, we know th a t rn-^(V ) is bounded by a polynom ial in N;
thus. In m 'u{N ) grows logarithm ically in N regardless of the order of the poly
nomial, and so it will be crushed by the factor. Therefore, for any fixed
tolerance 5, the bound on Bout will be arb itra rily close to Bj,, for sufficiently
large N .
Only if dyc{Ti) = oo will this argum ent fail, as the growth function in this
case is exponential in N . For any finite value of dye, the error bar will converge
to zero a t a speed determ ined by dye, since (Be is the order of the polynomial.
The sm aller (Be is, the faster the convergence to zero.
It tu rns out th a t we cannot ju s t replace M w ith (??/«(V ) in the generaliza
tion bound (2 .1 ), bu t ra th e r we need to make other adjustm ents as we will see
shortly. However, the general idea above is correct, and dye will still play the
role th a t we discussed here. One im plication of this discussion is th a t there
is a division of models into two classes. T he ‘good m odels’ have finite (Be,
and for sufficiently large N , B i„ will be close to B o u t! foi' good models, the
in-sam ple perform ance generalizes to out of sample. The "bad m odels' have
infinite dye- W ith a bad model, no m atte r how large the d a ta set is, we cannot
make generalization conclusions from B i„ to Bout, based on the VC analysis.'-^
Because of its significant role, it is worthwhile to try to gain some insight
about the VC dimension before wo proceed to the formalities of deriving the
new generalization bound. One way to gain insight abou t (Lc is to try to
com pute it for learning models th a t we are familiar with. Perceptrons are one
case where we can com pute (Be exactly. This is done in two steps. First,
we show th a t dye is a t least a certain value, then we show th a t it is a t most
the same value. There is a logical dift'erence in arguing th a t (Be is a t least a
certain value, as opposed to a t m ost a certain value. This is because
dye > N there e x is ts V of size N such th a t H shatters V.
hence we have different conclusions in the following cases.
1. There is a set of N points th a t can be shattered by H. In this cas(', we
can conclude th a t dye > N .
^In som e cases w ith in fin ite dvc, such as tlio convex se ts th a t we d iscussed , a lte rn a tiv e
an a ly sis based on an ‘average’ g row th fu n c tio n can estab lisli good g en era lizatio n behavior.
5 1
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
2. Any set of N points can be sh a tte red by T-L. In th is case, we have more
th an enough inform ation to conclude th a t dvc > N .
3. T here is a set of N points th a t cannot be sh a tte red by H . B ased only on
th is inform ation, we cannot conclude anyth ing abou t the value of dvc-
4. No set of N points can be sh a tte red by Ti. In th is case, we can conclude
th a t dvc < N .
E x erc ise 2 .4
Consider the input space X = {1} x (including the constant coordinate
xo = 1). Show th a t the VC dimension of the perceptron (w ith d + 1
parameters, counting wo) is exactly d+1 by showing that it is at least d + 1
and at most d + 1 , as follows.
(a ) To show that dvc > d + 1, find d + 1 points in X th a t the perceptron
can shatter. [Hint: Construct a nonsingular (d + 1) x (d + 1) matrix
whose rows represent the d + 1 points, then use the nonsingularity to
argue that the perceptron can shatter these points.]
(b ) To show that dvc < d + 1, show that no set of d + 2 points in X
can be shattered by the perceptron. [Hint: Represent each point
in X as a vector o f length d + 1 , then use the fact that any d + 2
vectors o f length d + 1 have to be linearly dependent. This means
that some vector is a linear combination o f all the other vectors.
Now, if you choose the class o f these other vectors carefully, then the
classification o f the dependent vector will be dictated. Conclude that
there is some dichotomy that cannot be implemented, and therefore
that for N > d + 2, m n [ N ) < 2^.]
T he VC dim ension of a d-dim ensional perceptron^ is indeed d + 1 . T his is
consistent w ith F igure 2.1 for the case d — 2, which shows a VC dim ension
of 3. T he percep tron case provides a nice in tu ition ab o u t the VC dim ension,
since d + 1 is also the num ber of p aram eters in th is model. One can view
the VC dim ension as m easuring the ‘effective’ num ber of param eters . The
m ore param eters a m odel has, the m ore diverse its hypothesis set is, which
is reflected in a larger value of th e grow th function m 'p {N ) . In the case
of perceptrons, the effective param eters correspond to explicit param eters in
the model, nam ely wq, w \, - ■ ■ ,Wd- In o ther m odels, the effective param eters
m ay be less obvious or im plicit. T he VC dim ension m easures these effective
param eters or ‘degrees of freedom ’ th a t enable the m odel to express a diverse
set of hypotheses.
D iversity is no t necessarily a good th ing in the con tex t of generalization.
For exam ple, the set of all possible hypotheses is as diverse as can be, so
m n i N ) = 2,^ for all N and dvc('R) = oo. In th is case, no generalization a t all
is to be expected, as the final version of the generalization bound will show.
3 + = { l} x R < ^ is co n sid ered d -d in ien sio n a l s ince t lie first c o o rd in a te so = 1 is fixed.
52
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z . a t i o n
2.1 .4 T h e V C G eneralization B ound
If we trea ted the grow th function as an effective num ber of hypotheses, and
replaced M in the generalization bound (2 .1 ) with m h(.V). the resulting bound
would be
Eomig) < E M + y ^ l n ^2.11)
It tu rns out th a t this is not exactly the form th a t will hold. The quantities in
red need to be technically modified to make (2.11) true. The correct bound,
which is called the \ 'C generalization bound, is given in the following theorem:
it holds for any binary targe t function / . any hypothesis set K . any learning
algorithm A . and any input probability d istribu tion P.
T h e o re m 2 .5 (VC generalization bound). For any tolerance S > 0.
Eoutig) < E M + \ / f In (2.12)
w ith probability > 1 — (5.
If you com pare the blue item s in (2.12) to their red coun terparts in (2.11). you
notice th a t all the blue item s moA'e the bound in the weaker direction. How
ever. as long as the VC dimension is finite, the error bar still converges to zero
(albeit a t a slower ra te), since tn-g{2N) is also polynom ial of order dye hi .V.
ju st like m ^(A '). This m eans th a t, w ith enough data , each and every hypoth
esis in an infinite P w ith a finite VC dimension will generalize well from E\„
to Eout- T he key is th a t the effective num ber of hypotheses, represented by
the finite growth function, has replaced the actual num ber of hypotheses in
the bound.
The VC generalization bound is the most im portan t m athem atical result
in the theory of learning. It establishes the feasibility of learning w ith infinite
hypothesis sets. Since the formal proof is som ewhat lengthy and technical, we
illustra te the m ain ideas in a sketch of the proof, and include the formal proof
as an appendix. T here are two p arts to the proof: the justification th a t the
grow th function can replace the num ber of hypotheses in the first place, and
the reason why we had to change the red item s in (2 .1 1 ) into the blue items
in (2 .1 2 ).
S k e tc h o f th e p ro o f . The d a ta set V is the source of random ization in the
original Hoeffding Ineqnality. Consider the space of all possible d ata sets. Let
us th ink of this space as a 'canvas’ (Figure 2.2(a)). Each P is a point on tha t
canvas. T he probability of a point is determ ined by which x „ 's in ,T happen to
be in th a t particu lar P , and is calculated based on the d istribu tion P over ,V.
L et’s th ink of probabilities of different events as areas on th a t canvas, so the
to ta l area of the canvas is 1 .
53
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
(b) Union Bound (c) VC Bound
Figure 2.2: Illustration of the proof of the VC bound, where the ‘canvas’
represents the space of all data sets, with areas corresponding to probabili
ties. (a) For a given hypothesis, the colored points correspond to data sets
where E\n does not generalize well to Flout- The Hoeffding Inequality guar
antees a small colored area, (b) For several hypotheses, the union bound
assumes no overlaps, so the total colored area is large, (c) The VC bound
keeps track of overlaps, so it estimates the total area of bad generalization
to be relatively small.
For a given hypothesis /i G 77, the event “ |£'in(/t) — Eout{h)\ > e” consists
of all points T> for which the sta tem en t is true . For a p articu la r h, let us pain t
all these ‘b a d ’ poin ts using one color. W h a t the basic Hoeffding Inequality
tells us is th a t th e colored area on th e canvas will be sm all (F igure 2.2(a)).
Now, if we take ano ther /i G 77, the event “ |£ ’in(/i) — Eout{h)\ > e” m ay
contain different points, since the event depends on h. Let us p a in t these points
w ith a different color. T he area covered by all th e po in ts we colored will be
a t m ost the sum of the two individual areas, which is th e case only if the two
areas have no points in com m on. T his is the w orst case th a t the union bound
considers. If we keep throw ing in a new colored area for each /i G 77, and never
overlap w ith previous colors, th e canvas will soon be m ostly covered in color
(Figure 2.2(b)). Even if each h co n tribu ted very little , th e sheer num ber of
hypotheses will eventually m ake the colored area cover th e whole canvas. This
was the problem w ith using the union bound in the Hoeffding Inequality (1.6),
and not tak ing the overlaps of the colored areas into consideration.
T he bulk of the VC proof deals w ith how to account for the overlaps. Here
is the idea. If you were told th a t the hypotheses in 77 are such th a t each
point on the canvas th a t is colored will be colored 1 0 0 tim es (because of 1 0 0
different //’s), then the to ta l colored area is now 1 1 0 0 of w ha t it would have
been if the colored points had no t overlapped a t all. T his is the essence of
the VC bound as illu stra ted in (F igure 2.2(c)). T he argum ent goes as follows.
54
M any hypotheses share the sam e dichotom y on a given V , since tliere are
finitely m any dichotom ies even w ith an infinite num ber of hypotheses. Any
sta tem ent based on I) alone will be sim ultaneously true or sim ultaneously
false for all the hypotheses th a t look the same on th a t particu lar T>. W hat
the grow th function enables us to do is to account for this kind of hypothesis
redundancy in a precise way, so we can get a factor sim ilar to the TOO’ in the
above example.
W hen 77 is infinite, the redundancy factor will also be infinite since the
hypotheses will be divided am ong a finite num ber of dichotomies. Therefore,
the reduction in the to ta l colored area when we take the redundancy into
consideration will be dram atic. If it happens th a t the num ber of dichotomies
is only a polynom ial, the reduction will be so dram atic as to bring the to ta l
probability down to a very small value. This is the essence of the proof of
Theorem 2.5.
The reason m«(27V) appears in the VC bound instead of m n { N ) is th a t
the proof uses a sam ple of 2 N points instead of N points. W hy do we need 2 N
points? T he event “\Eifih) — Eout{h)\ > e” depends not only on V , bu t also on
the entire X because 77out(h) is based on X . This breaks the m ain premise of
grouping h ’s based on their behavior on V , since aspects of each h outside of V
affect the tru th of “ |J?in(h) — £'out(h)| > e.” To rem edy th a t, we consider the
artificial event “ |£'i„(/i) - E[fih)\ > e” instead, where i?i„ and are based
on two sam ples V and V each of size TV. This is where the 2 TV comes from.
It accounts for the to ta l size of the two sam ples T> and V . Now, the tru th of
the sta tem ent “ |77’i„(/i) — E ( f h ) \ > e” depends exclusively on the to ta l sample
of size 2 TV, and the above redundancy argum ent will hold.
Of course we have to justify why the two-sample condition —
■^in(^)l > replace the original condition “ |77in(h) - Eout(h) I > e.” In
doing so, we end up having to shrink the e’s by a factor of 4, and also end up
with a factor of 2 in the estim ate of the overall probability. This accounts for
the ^ instead of ^ in the VC bound and for having 4 instead of 2 as the
m ultiplicative factor of the grow th function. W hen you pu t all th is together,
you get the formula in (2 .1 2 ). □
2 . T r a i n i n g v e r s u s T e s t i n g __________________ 2 . 2 . I n t e r p r e t i n g t h e B o u n d
2.2 Interpreting th e G eneralization Bound
T he VC generalization bound (2.12) is a universal result in the sense th a t
it applies to all hypothesis sets, learning algorithm s, input s]>aces, probability
distributions, and binary ta rg e t functions. It can be extended to o ther types of
targe t functionsas well. Given the generality of the result, one would suspect
th a t the bound it provides may not be particularly tight in any given case,
since the same bound has to cover a lot of different cases. Indeed, tlii' bound
is quite loose.
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 2 . I n t e r p r e t i n g t h e B o u n d
E x erc ise 2 .5
Suppose we have a simple learning model whose growth function is
m n { N ) = N + I , hence dvc = 1. Use the VC bound (2 .12 ) to esti
mate the probability that Eout will be within 0.1 of Ein given 100 training
examples. [Hint: The estimate will be ridiculous.]
W hy is the VC bound so loose? T he slack in the bound can be a ttr ib u te d to
a num ber of technical factors. A m ong them ,
1. T he basic Hoeffding Inequality used in the proof already has a slack.
T he inequality gives the sam e bound w hether Eout is close to 0.5 or
close to zero. However, the variance of Ejn is qu ite different in these
two cases. Therefore, having one bound cap tu re b o th cases will result
in some slack.
2. Using m -u{N ) to quantify th e num ber of dichotom ies on N points, re
gardless of which N po in ts are in th e d a ta set, gives us a worst-case
estim ate. This does allow th e bound to be independent of th e prob
ability d istribu tion P over X . However, we would get a m ore tuned
bound if we considered specific x i , • • • ,x /v and used |'H (x i, • • • ,x tv ) | or
its expected value instead of th e upper bound m-}i{N). For instance, in
the case of convex sets in two dim ensions, which we exam ined in E xam
ple 2.2, if you pick N poin ts a t random in the plane, they will likely have
far fewer dichotom ies th a n 2 ^ , while myy(V) = 2 ^ .
3. B ounding myy(A’) by a sim ple polynom ial of order dvc, as given in (2.10),
will con tribu te fu rther slack to the VC bound.
Some effort could be p u t into tigh ten ing the VC bound, b u t m any highly
technical a ttem p ts in th e lite ra tu re have resu lted in only dim inishing re turns.
T he reality is th a t th e VC line of analysis leads to a very loose bound. W hy
did we bo ther to go th rough the analysis then? Two reasons. F irs t, th e VC
analysis is w hat establishes the feasibility of learning for infinite hypothesis
sets, the only kind we use in practice. Second, a lthough the bound is loose,
it tends to be equally loose for different learning m odels, and hence is useful
for com paring the generalization perform ance of these models. T his is an
observation from practical experience, no t a m athem atica l s ta tem en t. In real
applications, learning m odels w ith lower dvc ten d to generalize b e tte r th an
those w ith higher dye- Because of th is observation, th e VC analysis proves
useful in practice, and some rules of thum b have em erged in term s of the VC
dim ension. For instance, requiring th a t N be a t least 10 x dye to get decent
generalization is a popu lar rule of thum b.
Thus, the VC bound can be used as a guideline for generalization , relatively
if no t absolutely. W ith this understand ing , let us look a t the different ways
the bound is used in practice.
56
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 2 . I n t e r p r e t i n g t h e B o u n d
2.2.1 Sam ple C om p lex ity
T he sam ple com plexity denotes how m any train ing examples N are needed
to achieve a certain generalization perform ance. T he perform ance is specified
by two param eters, e and 6. T he error tolerance e determ ines the allowed
generalization error, and the confidence param eter 5 determ ines how often the
error tolerance e is violated. How fast N grows as e and 6 become smaller'^
indicates how much d a ta is needed to get good generalization.
We can use the VC bound to estim ate the sam ple com plexity for a given
learning model. Fix (5 > 0, and suppose we want the generalization error to
be a t m ost e. From Eqnation (2.12), the generalization error is bounded by
^ In 4 mK̂ (2 N) ̂ suffices to make y j j j In Ep-'uCN) ^ ̂ follows th a t
^ , '4 m « (2 V )
suffices to obtain generalization error a t most e (w ith probability a t least 1 —d).
This gives an implicit bound for the sam ple com plexity A , since A appears on
b o th sides of the inequality. If we replace m-H{2N) in (2.12) by its polynomial
upper bound in (2.10) which is based on the the VC dimension, we get a
sim ilar bound
/4 ( (2 A ) ‘̂ - + 1 ) '
A > - In (2.13)
5
which is again implicit in A . We can obtain a num erical value for A using
simple itera tive m ethods.
E x a m p le 2 .6 . Suppose th a t we have a learning model w ith dye = 3 and
would like the generalization error to be a t m ost 0.1 w ith confidence 90% (so
e = 0.1 and 5 = 0.1). How big a d a ta set do we need? Using (2.13), we need
8 , ^4(2V )= + 4 ',
0 . 1 J -
Trying an initial guess of A = 1,000 in the R.HS, we get
We then try the new value A = 21.193 in the RHS and continue this iterative
process, rapidly converging to an estim ate of A « 30.000. If dye were 4, a
sim ilar calculation will find th a t A « 40,000. For dye = 5, we get A ~ 50.000.
You can see th a t the inequality suggests th a t the num ber of examples needed
is approxim ately proportional to the VC dimension, as has been observed in
practice. T he constan t of p roportionality it suggests is 10.000. which is a gross
overestim ate; a more practical constant of proportionality is closer to 1 0 . □
‘‘T h e te rm ‘c o m p lex ity ’ com es from a sim ila r m e ta p h o r in co m p u ta tio n a l com plexity .
5 7
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 2 . I n t e r p r e t i n g t h e B o u n d
2 .2 .2 P en a lty for M o d el C o m p lex ity
Sam ple com plexity fixes the perform ance param eters e (generalization error)
and 6 (confidence param eter) and estim ates how m any exam ples N are needed.
In m ost p ractical situations, however, we are given a fixed d a ta set T>, so N
is also fixed. In th is case, the relevant question is w hat perform ance can we
expect given th is particu la r N . T he bound in (2.12) answers th is question:
w ith probability a t least 1 —
' 4 m n { ‘2 N ) \
If we use the polynom ial bound based on dye instead of m n { 2 N ) , we get
ano ther valid bound on the out-of-sam ple error.
E o M < E M + y I In . (2.14)
E x a m p le 2 .7 . Suppose th a t N = 100 and we have a 90% confidence require
m ent {6 = 0.1). We could ask w hat error bar can we offer w ith th is confidence,
if T~L has dye = 1- Using (2.14), we have
Eout{g) fi: E[T^{g) + ^ ss Ein(5 ') + 0.848 (2-15)
w ith confidence > 90%. This is a p re tty poor bound on Eout- Even if Ein = 0,
Eout m ay still be close to 1. If A = 1, 000, th en we get Eout(d) < Ein(g)-I-0.301,
a som ew hat m ore respectable bound. □
Let us look m ore closely a t th e two p a rts th a t m ake up th e bound on Eout
in (2.12). T he first p a r t is E i„, and the second p a r t is a te rm th a t increases
as the VC dim ension of T-L increases.
Eout(d) < E M + U(iV, n , S), (2.16)
where
<
/ 4 ( ( 2 A ) ^ v c - H ) \
' V
One way to th ink of U{N, T-l, d) is th a t it is a penalty for m odel complexity. It
penalizes us by worsening the bound on Eo„t when we use a m ore com plex H
(larger dye). If som eone m anages to fit a sim pler m odel w ith the sam e tra in ing
5 8
2 . T r a i n i n g v e r s u s T e s t i n g 2,2. In t e r p h k t i n g t h e H o u N n
Figure 2.3: When we use a more complex learning model, one that has
higher \ ’C dimension d\o. we are likely to fit the training data better ri'-
sulting in a lower in-sample error, but we pav a higher penalt\- for model
complexity. combination of the two. which estimates the out-of-sample
error, thus attains a mininmm at some intermediate
error,they will get a more favorable estim ate for Fout- The penalty llt-V. H. d")
gets worse if we insist on higher confidence (lower d"). and it gets b etter when
we have more train ing examples, as we would expect.
.Although 0_{X.H .S) goes up when H has a higher \ 'C dimension. F ,, is
likelv to go down with a higher \ 'C dimension as we have more choices w ithin H
to fit the data . Therefore, we have a tradeoff: more complex models help F „
and hurt Q[X. 'H.S). The optim al model is a compromise th a t minimizes a
com bination of the two term s, as illustrated informally in Figure 2.3.
2 .2 .3 T h e T est Set
As we have seen, the generalization bound gives us a loose estim ate of the
out-of-sam ple error Font based on F ,, . A ’hile the estim ate can be useful as
a guideline for the train ing process, it is next to useless if the goal is to get
ail accurate forecast of Fo„t- If ,vou are developing a system for a custom er,
you need a m ore accurate estim ate so th a t ,vour custom er knows how well the
system is expected to perform.
An alternative approach th a t we alluded to in the beginniiig of this chapter
is to estim ate Fo„t by using a test set. a da ta set tha t was not im'olved in the
train ing process. The final hypothesis g is e\-ahiated on the test set. aiul the
result is taken as an estim ate of Fo„t. ^^e would like to now take a closer look
at this approach.
Let us call the error we get on the test set Ftost- hen we report F,,.,., as
our estim ate of Fout- " 'e are in fact asserting that F,,,st generalizes wr.t' well
to Fout- After all. Ftest F .just a sam ple estim ate like F „ . How do we know
59
th a t Etest generalizes well? We can answer th is question w ith au tho rity now
th a t we have developed th e theory of generalization in concrete m athem atical
term s.
T he effective num ber of hypotheses th a t m atte rs in the generalization be
havior of Etest is 1. T here is only one hypothesis as far as the tes t set is
concerned, and th a t ’s the final hypothesis g th a t th e tra in ing phase produced.
This hypothesis would no t change if we used a different tes t set as it would if
we used a different tra in ing set. Therefore, the simple Hoeffding Inequality is
valid in the case of a tes t set. H ad the choice of g been affected by the test
set in any shape or form, it w ouldn’t be considered a test set any m ore and
the simple Hoeffding Inequality would no t apply.
Therefore, the generalization bound th a t applies to Etest is the simple
Hoeffding Inequality w ith one hypothesis. T his is a m uch tigh ter bound th an
the VC bound. For exam ple, if you have 1,000 d a ta poin ts in the tes t set, Etest
will be w ith in ±5% of Eout w ith probability > 98%. T he bigger the te s t set
you use, the m ore accura te Etest will be as an estim ate of Eout-
E x erc ise 2 .6
A data set has 600 examples. To properly test the performance of the
final hypothesis, you set aside a randomly selected subset of 200 examples
which are never used in the training phase; these form a test set. You use
a learning model with 1 ,000 hypotheses and select the final hypothesis g
based on the 400 training examples. W e wish to estimate Eout{g)- W e have
access to two estimates: E\n{g), the in-sample error on the 400 training
examples; and, Etest(5 ), the test error on the 200 test examples th a t were
set aside.
(a ) Using a 5% error tolerance (<5 = 0.05), which estimate has the higher
'error bar' ?
(b ) Is there any reason why you shouldn't reserve even more examples for
testing?
A nother aspect th a t distinguishes the te s t set from th e tra in in g set is th a t the
tes t set is no t biased. B oth sets are finite sam ples th a t are bound to have
some variance due to sam ple size, b u t th e te s t set doesn’t have an optim istic
or pessim istic bias in its estim ate of Eout- T he tra in in g set has an optim istic
bias, since it was used to choose a hypothesis th a t looked good on it. T he VC
generalization bound im plicitly takes th a t bias in to consideration, and th a t ’s
why it gives a huge erro r bar. T he tes t set ju s t has s tra ig h t finite-sam ple
variance, b u t no bias. W hen you rep o rt the value of Etest to your custom er
and they try your system 011 new d a ta , they are as likely to be p leasantly
surprised as unpleasantly surprised, though qu ite likely no t to be surprised a t
all.
T here is a price to be paid for having a tes t set. T he te s t set does not
affect the ontcom e of our learning process, which only uses th e tra in in g set.
T he test set ju s t tells us how well we did. Therefore, if we set aside some
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 2 . I n t e r p r e t i n g t h e B o u n d
60
of the d a ta points provided by the cnstonier as a test set. we end up using-
fewer examples for training. Since the train ing set is used to select one of the
hypotheses in T-L. tra in ing examples are essential to finding a good hypothesis.
If we take a big chunk of the d a ta for testing and end up w ith too few examples
for train ing, we m ay not get a good hypothesis from the train ing p art even if
we can reliably evaluate it in the testing part. We may end up reporting to
the custom er, w ith high confidence mind you, th a t the g we are delivering is
terrib le @ . There is thus a tradeoff to setting aside test examples. We will
address th a t tradeoff in more detail and learn some clever tricks to get around
it in C hap ter 4.
In some of the learning literature , Ftest is used as synonym ous w ith -Lout •
W hen we report experim ental results in this book, we will often trea t Ftest
based on a large test set as if it was ifout because of the closeness of the two
quantities.
2 .2 .4 O ther Target T yp es
A lthough the VC analysis was based on binary target functions, it can be
extended to real-valued functions, as well as to o ther types of functions. The
proofs in those cases are quite technical, and they do not add to the insight
th a t the VC analysis of b inary functions provides. Therefore, we will introduce
an alternative approach th a t covers real-valued functions and provides new
insights into generalization. The approach is based on bias-variance analysis,
and will be discussed in the next section.
In order to deal w ith real-valued functions, we need to adap t the definitions
of E\n and Eoat th a t have so far been based on binary functions. We defined E\^
and Eout in term s of binary error; either h (x) = / ( x ) or else /i(x) 7 ̂ / ( x ) . If /
and h are real-valued, a more appropria te error m easure would gauge how far
/ ( x ) and h(x) are from each other, ra th e r th an ju st w hether their values are
exactly the same.
An error m easure th a t is commonly used in this case is the squared error
e ( /? (x ) ,/(x )) = (/i(x) — /(x ))^ . We can define in-sam ple and out-of-sam ple
versions of th is error m easure. T he out-of-sam ple error is based on the ex
pected value of the error m easure over the entire input space X ,
E o , t ( / ? ) = E [ ( h ( x ) - / ( x ) ) 2 ; ,
while the in-sam ple error is based on averaging the error m easure over the
d a ta set,
1 ^
n=l
These definitions make E-^ a sam ple estim ate of Eout jn s t as it was in the case
of b inary functions. In fact, the error m easure used for binary functions can
also be expressed as a squared error.
2 . T r a i n i n g v e r s u s T e s t i n g _________________ 2 . 2 . I n t e r p r e t i n g t h e B o u n d
61
E x erc ise 2 .7
For binary target functions, show that P [/i(x ) / ( x ) ] can be written as an
expected value of a mean-squared error measure in the following cases.
(a) The convention used for the binary function is 0 or 1.
(b) The convention used for the binary function is ± 1 .
[Hint: The differencebetween (a) and (b) is just a scale.]
Ju s t as the sam ple frequency of error converges to th e overall p robability of
error per Hoeffding’s Inequality, the sam ple average of squared error converges
to the expected value of th a t error (assum ing finite variance). T his is a m an
ifestation of w hat is referred to as the ‘law of large num bers’ and Hoeffding’s
Inequality is ju s t one form of th a t law. T he sam e issues of the d a ta set size
and the hypothesis set com plexity come into play ju s t as they did in the VC
analysis.
2.3 A pproxim ation-G eneralization Tradeoff
T he VC analysis showed us th a t the choice of T-L needs to strike a balance
between approxim ating / on the tra in ing d a ta and generalizing on new data .
T he ideal 77 is a singleton hypothesis set contain ing only the ta rg e t function.
U nfortunately, we are b e tte r off buying a lo tte ry ticket th an hoping to have
this TL. Since we do no t know the ta rg e t function, we resort to a larger model
hoping th a t it will contain a good hypothesis, and hoping th a t th e d a ta will
pin down th a t hypothesis. W hen you select your hypothesis set, you should
balance these two conflicting goals; to have some hypothesis in T-L th a t can
approxim ate / , and to enable the d a ta to zoom in on the righ t hypothesis.
T he VC generalization bound is one way to look a t th is tradeoff. If TL is
too simple, we m ay fail to approxim ate / well and end up w ith a large in-
sam ple error term . If T~L is too com plex, we m ay fail to generalize well because
of the large m odel com plexity term . T here is ano ther way to look a t the
approxim ation-generalization tradeoff which we will p resent in th is section. It
is particu larly su ited for squared error m easures, ra th e r th an the b inary error
used in the VC analysis. T he new way provides a different angle; instead of
bounding Flout 6 y E\n plus a penalty te rm fl, we will decom pose F ’out into two
different error term s.
2.3 .1 B ias and V ariance
T he bias-variance decom position of out-of-sam ple erro r is based on squared
error m easures. T he out-of-sam ple erro r is
Fo„t(.g("^)) = Ex - Z(x))2] , (2.17)
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 3 . A p p r o x i m a t i o n - G e n e r a l i z a t i o n
62
2 . T r a i n i n g \ ' e r s u s T e s t i n g 2 . 3 . . -XpPROXI M. ATION- GENERALI Z. - VnON
where Ex denotes the expected \-alue w ith respect to x (based on the probabil
ity d istribu tion on the input space X ) . We have m ade explicit the dependence
of the final hypothesis g on the d a ta T>. as this will play a key role in the cur
rent analysis. We can rid E quation (2.17) of the dependence on a particular
d a ta set by taking the expectation with respect to all d a ta sets. We then get
the expected out-of-sam ple error for our learning model, independent of any
particu lar realization of the d ata set.
Ei> Bout (5^^') = E p
= Ex
= Ex
E x [(5("’Hx ) - / ( x ))2]
Ex>[((/^^Hx) - /(x ))̂ ]
E x)[(7̂ )̂(x )'-̂] - 2 E p [</'^Hx)]/(x) -F /(x )'̂
The term Ep[<;^^^(x)] gives an ’average function’, which we denote by ,<7 (x).
One can in terpret g(x) in the following operational way. G enerate m any d a ta
sets X>i,. . . . T>k and apply the learning algorithm to each d a ta set to produce
final hypotheses g \ gK- We can then estim ate the average function for
any x by 5 (x) ~ E I '= i 9k{^ )- Essentially, we are viewing g{:s.) as a random
variable, w ith the random ness coming from the random ness in the d ata set:
g{x.) is the expected value of this random variable (for a particu lar x). and g
is a function, the average function, composed of these expected values. The
function g is a little counterintuitive: for one thing, g need not be in the
m odel's hypothesis set. even though it is the average of functions th a t are.
E x erc ise 2 .8
(a) Show that if H is closed under linear combination (any linear combi
nation of hypotheses in H is also a hypothesis in H ), then g e P .
(b ) Give a model for which the average function g is not in the model's
hypothesis set. [Hint: Use a very simple model.]
(c) For binary classification, do you expect g to be a binary function?
We can now rew rite the e.xpected out-of-sam ple error in term s of g:
Ep[Bou,(ff'^>)]
Ex
Ex
E , , [ 5 ( ® ) ( x ) 2 ] - 2 f f ( x ) / ( x ) + / ( x ) 2 ,
E v[g ^ '^ \y i f] - g { ^ f - 2 ,9 (x ) /(x ) + f { x f
E p [((/(®Hx) - .9(x))^ (ff(x) - / ( x ) ) 2
where the last reduction follows since g(y2) is constant with respect to V.
T he term (r/(x) - /(x ) )^ m easures how much the average function th a t we
would learn using different d a ta sets V deviates from the target function tha t
generated these d a ta sets. This term is appropriately called the bias:
bias(x) = (ff(x) - /(x))'-^.
63
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 3 . A p p r o x i m a t i o n - G e n e r a l i z a t i o n
as it m easures how m uch our learning m odel is biased away from the ta rg e t
function.^ T his is because g has the benefit of learning from an unlim ited
num ber of d a ta sets, so it is only lim ited in its ability to approxim ate / by
the lim itation in the learning m odel itself. T he term E© — 5 (x))^
is the variance of the random variable
v a r ( x ) = E p [ ( 5 ( ^ ) ( x ) - 5 ( x ) ) 2 ] ,
which m easures the variation in the final hypothesis, depending on the d a ta
set. We thus arrive a t the bias-variance decom position of out-of-sam ple error,
ED[E'out(5^^^)] = E x [b ia s (x )-I-v a r(x ) ]
= b ia s -F v a r,
where bias = E x [b ia s (x )] and var = E x [v a r (x ) ] . O ur derivation assum ed th a t
the d a ta was noiseless. A sim ilar derivation w ith noise in the d a ta would lead
to an additional noise term in the out-of-sam ple erro r (Problem 2.22). T he
noise te rm is unavoidable no m a tte r w hat we do, so the term s we are in terested
in are really the bias and var.
T he approxim ation-generalization tradeoff is cap tu red in the bias-variance
decom position. To illustra te , le t’s consider two extrem e cases: a very small
m odel (w ith one hypothesis) and a very large one w ith all hypotheses.
Very sm all m odel. Since there is
only one hypothesis, both the av
erage function g and the final hy
pothesis will be the same, for
any data set. Thus, var = 0. The
bias will depend solely on how well
this single hypothesis approximates
the target / , and unless we are ex
tremely lucky, we expect a large
bias.
Very large m odel. The target
function is in 77. Different data sets
will lead to different hypotheses that
agree with / on the data set, and are
spread around / in the red region.
Thus, bias « 0 because g is likely
to be close to / . The var is large
(heuristically represented by the size
of the red region in the figure).
One can also view the variance as a m easure of ‘in s tab ility ’ in th e learning
model. In stab ility m anifests in wild reactions to sm all varia tions or idiosyn
crasies in the d a ta , resulting in vastly different hypotheses.
® W hat we call bias is so m e tim es called bias^ in th e l ite ra tu re .
64
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 3 . . Ap p r o x i m . \t i o n - G e n e r a i . i z . \t i o n
E x a m p le 2 .8 . Consider a targe t function /( .r ) sin(Tr.r) and a da ta set
of size -V = 2. We sam ple .r uniformly in [-1 .1 ] to generate a da ta set
U’l -yi )- (-i’2 -y 2 ): and fit the d a ta using one of two models:
TAo: Set of all lines of the form h[.v) — b:
T i l : Set of all lines of the form h{x) = a.r + b.
For Bo. we choose the constant hypothesis that best fits the d a ta (the hori
zontal line at the m idpoint, b = For H i. we choose the line that passes
through thetwo d a ta points ( . t ' l . y\) and ( . r o -1/ 2)- R epeating this process with
m any d a ta sets, we can estim ate the bias and the variance. The figures which
follow show the resulting fits on the sam e (random ) da ta sets for bo th models.
K ....
X
Ho Hi
W ith H i- the learned hypothesis is wilder and varies extensively depending
on the d a ta set. T he bias-var analysis is sum m arized in the next figures.
Ho
bias = 0.50;
var = 0.25.
Hi
bias = 0.2 1 :
var = 1.69.
Average hypothesis g (red) with var(x) indicated by the gray shaded
region that is g{x) ± V^var(x).
For H i, the average hypothesis g (red line) is a reasonable fit with a fairly
small bias of 0.21. However, the large variability leads to a high var of 1.69
resulting in a large expected out-of-sam ple error of 1.90. W ith the simpler
6 5
m odel 'Ho, the fits are m uch less volatile and we have a significantly lower var
of 0.25, as indicated by the shaded region. However, the average fit is now
the zero function, resulting in a higher bias of 0.50. T he to ta l out-of-sam ple
error has a m uch sm aller expected value of 0.75. T he sim pler m odel wins by
significantly decreasing the var a t the expense of a sm aller increase in bias.
Notice th a t we are not com paring how well th e red curves (the average hy
potheses) fit the sine. These curves are only conceptual, since in real learning
we do not have access to the m ultitude of d a ta sets needed to generate them .
We have one d a ta set, and the sim pler m odel results in a b e tte r out-of-sam ple
error on average as we fit our m odel to ju s t th is one d a ta . However, the var
term decreases as N increases, so if we get a bigger and bigger d a ta set, the
bias term will be the dom inant p a r t of Hout, and H i will win. □
T he learning algorithm plays a role in the bias-variance analysis th a t it did
not play in the VC analysis. Two poin ts are w orth noting.
1. By design, the VC analysis is based purely on th e hypothesis set H , in
dependently of th e learning algorithm A . In th e b ias-variance analysis,
bo th H and th e algorithm A m a tte r. W ith the sam e H , using a differ
ent learning algorithm can produce a different Since is the
building block of the b ias-variance analysis, th is m ay resu lt in different
bias and var term s.
2. A lthough the bias-variance analysis is based on squared-erro r m easure,
the learning algorithm itself does no t have to be based on m inim izing
the squared error. I t can use any criterion to produce g^^^ based on V .
However, once the algorithm produces g^'^\ we m easure its bias and
variance using squared error.
U nfortunately, the bias and variance cannot be com puted in practice, since
they depend on the ta rg e t function and the inpu t p robab ility d is trib u tio n (both
unknown). Thus, the bias-variance decom position is a conceptual too l which
is helpful when it comes to developing a model. T here are two typ ical goals
when we consider bias and variance. T he first is to try to lower th e A'ariance
w ithout significantly increasing the bias, and the second is to lower th e bias
w ithout significantly increasing the variance. These goals are achieA'ed by
different techniques, some principled and some heuristic. R egularization is
one of these techniques th a t we will discuss in C h ap te r 4. R educing the bias
w ithout increasing the variance requires some prior in form ation regard ing the
ta rg e t function to steer the selection of H in the d irection o f / , and th is task is
largely application-specific. O n the o ther hand , reducing th e variance w ithout
com prom ising the bias can be done th rough general techniques.
2.3 .2 T h e L earning C urve
We close th is chap ter w ith an im p o rtan t plot th a t illu s tra te s the tradeoffs
th a t we have seen so far. T he learning curves sum m arize th e behavior of the
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 3 . A p p r o x i m a t i o n - G e n e r a l i z a t i o n
66
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 3 . A p p r o x i m a t i o n - G e n e r a l i z .a t i o n
in-sam ple and out-of-sam ple errors as we vary the size of the train ing set.
After learning w ith a particu lar d a ta set T> of size N , the final hypothe
sis has in-sam ple error E\n{g^^^) and out-of-sam ple error Eont{g^'^^), both
of which depend on V . As we saw in the bias-variance analysis, the expectation
with respect to all d a ta sets of size N gives the expected errors: ¥.x>[Em{g^'^'^)\
and Ep[£'out(5 ^^^)]- These expected errors are functions of A , and are called
the learning curves of the model. We illustra te the learning curves for a simple
learning model and a complex one, based on actual experim ents.
S im ple M odel C om plex M odel
Notice th a t for the simple model, the learning curves converge more quickly
bu t to worse u ltim ate perform ance th an for the complex model. This behavior
is typical in practice. For bo th simple and complex models, the out-of-sample
learning cnrve is decreasing in A , while the in-sam ple learning curve is in
creasing in A . Let us take a closer look a t these curves and in terpret them in
term s of the different approaches to generalization th a t we have discussed.
In the VC analysis, Eout was expressed as the sum of E\,, and a generaliza
tion error th a t was bounded by 0 , the penalty for model complexity. In the
bias-variance analysis, Eont was expressed as the sum of a bias and a variance.
T he following learning curves illustra te these two approaches side by side.
o
Lh
CJoa
X
W
Eout
generalization error
E\n
in-sample error
Number of Data Points, N
V C A nalysis B ias-V ariance A nalysis
6 7
The VC analysis bounds the generalization error which is illu stra ted on the
left.® T he bias-variance analysis is illu stra ted on the right. T he bias-variance
illustra tion is som ew hat idealized, since it assum es th a t, for every N , the aver
age learned hypothesis g has the sam e perform ance as the best approxim ation
to / in the learning model.
W hen the num ber of d a ta points increases, we move to the righ t on the
learning curves and b o th the generalization error and th e variance term shrink,
as expected. T he learning curve also illustra tes an im p o rtan t poin t abou t E n .
As N increases, E n edges tow ard th e sm allest error th a t th e learning model
can achieve in approxim ating / . For sm all N , the value of is actually
sm aller th an th a t ‘sm allest possible’ error. T his is because the learning model
has an easier task for sm aller N ; it only needs to approxim ate / on the N
points regardless of w hat happens outside those points. Therefore, it can
achieve a superior fit on those points, a lbeit a t the expense of an inferior fit
on the rest of the points as shown by the corresponding value of Eout-
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 3 . A p p r o x i m a t i o n - G e n e r a l i z a t i o n
®For th e lea rn in g curve, we ta k e th e ex p ec te d va lues o f a ll q u a n ti t ie s w ith re sp e c t to V
o f size N .
68
2.4 Problem s
P ro b lem 2.1 In Equation (2.1), set 6 = 0.03 and let
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
(a) For M = 1, how many examples do we need to make e < 0.05?
(b) For M = 100, how many examples do we need to make e < 0.05?
(c) For M = 10, 000, how many examples do we need to make e < 0.05?
P ro b lem 2 .2 Show that for the learning model of positive rectangles
(aligned horizontally or vertically), m-H(4) = and m n (5 ) < 2®. Hence, give
a bound for to -h (A ),
P ro b lem 2 .3 Compute the maximum number of dichotomies, mu{N) ,
for these learning models, and consequentlycompute dye, the VC dimension.
(a) Positive or negative ray: TL contains the functions which are + 1 on [a, oo)
(for some a) together with those that are + 1 on ( —oo,a] (for some a).
(b ) Positive or negative interval: 7? contains the functions which are + 1 on
an interval [a, 6] and — 1 elsewhere or — 1 on an interval [a, 6] and + 1
elsewhere.
(c) Two concentric spheres in R'’*: 7? contains the functions which are + 1 for
a < ^ x \ + . . . + x1 < b.
P ro b lem 2 .4 Show that B {N ,k ) = Yli=o (G ) showing the other
direction to Lemma 2.3, namely that
^ / A
B ( A , f c ) > ^
i=o A *
To do so, construct a specific set of ( G ) dichotomies that does not
shatter any subset of k variables. [Hint: Try limiting the number o f —I's in
each dichotomy]
D
P ro b lem 2 .5 Prove by induction that X ) ( G ) - + b hence
i = 0
m n iN ) < + 1.
6 9
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
P r o b le m 2 .6 Prove that for N > d,
W e suggest you first show the following intermediate steps.
(a) E ( T ) < E ( " ) ( f ) " ' < ( f ) ‘̂ i : ( T ) ( v ) ^
i=0 i=0 i=0
N
(b) T ( T) {’n Y - Binomial theorem; (l + < e for x > 0.]
i=0 *
Hence, argue that m n {N ) <
P r o b le m 2 .7 Plot the bounds for m n iN ) given in Problems 2.5 and 2.6
for d̂ ■c = 2 and dvc = 5. When do you prefer one bound over the other?
P r o b le m 2 .8 Which of the following are possible growth functions m -u iN )
for some hypothesis set:
l + N ; 1 + i V + T k ^ ; 2 ^ ; 2 l ^ J ;
P r o b le m 2 .9 [hard] For the perceptron in d dimensions, show that
m n {N ) = 2 Y ^
[Hint: Cover(1965) in Further Reading.]
Use this formula to verify that d^c = d + 1 by evaluating m n{d + 1) and
nin{d + 2). Plot m -H {N )/2 ^ for d = 10 and N £ [1,40]. If you generate a
random dichotomy on N points in 10 dimensions, give an upper bound on the
probability that the dichotomy will be separable for .V = 10. 20 .40 .
P r o b le m 2 .1 0 Show that m ^ (2 A ') < h ) h ( .V ) ’ , and hence obtain a
generalization bound which only involves » i h (-V ).
P r o b le m 2 .11 Suppose » i h ( jY ) = A '+ 1, so d^c = 1. You have 100
training examples. Use the generalization bound to give a bound for Eout with
confidence 90%. Repeat for A' = 10. 000.
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
P ro b lem 2 .12 For an T-L with dvo = 10, what sample size do you need
(as prescribed by the generalization bound) to have a 95% confidence that your
generalization error is at most 0.05?
P ro b lem 2 .13
(a) Let 4? = { / i i , li2 , . . . , Iim } for a finite M . Prove that dvc)??) < log2 M .
(b) For hypothesis sets F i , 'H2 , • • •, H k with finite VC dimensions dvdkik),
derive and prove the tightest upper and lower bound that you can get
on dvc
(c) For hypothesis sets Hi , H2, • ■ •, H k with finite VC dimensions dvc{Hk),
derive and prove the tightest upper and lower bounds that you can get
on dvc (u t= iH k )-
P ro b lem 2 .1 4 Let H i , H 2 , ■ ■ ■ , H k be K hypothesis sets { K > 1 ),
each with the same finite VC dimension dvc- Let H = H i U H 2 U ■ ■ ■ U H k be
the union of these models.
(a) Show that dvc{H) < K{dvc + 1).
(b ) Suppose that i satisfies 2 ̂ > 2 K P ''° . Show that d„c(H) < 4.
(c) Hence, show that
d vc (F ) < m in (A (d vc + l) ,7 (d v c + A ) lo g 2 (d v c F )).
T hat is, dvc{H) = 0 (m a x (d v c , A ) logj max(dvc, A ) ) is not too bad.
P ro b lem 2 .1 5 The monotonically increasing hypothesis set is
= {/t I Xl > X 2 fl(xi) > /l(x2)} ,
where x i > X 2 if and only if the inequality is satisfied for every component.
(a) Give an example of a monotonic classifier in two dimensions, clearly show
ing the -h i and —1 regions.
(b ) Compute m -u iN ) and hence the VC dimension. [Hint: Consider a set
o fN points generated by first choosing one point, and then generating the
next point by increasing the first component and decreasing the second
component until N points are obtained.)
71
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
P r o b le m 2 .1 6 In this problem, we will consider T = K . T h a t is, x = t
is a one-dimensional variable. For a hypothesis set
n = l K
prove that the VC dimension of H is exactly ( D -F 1) by showing that
(a) There are {D + 1) points which are shattered by H .
(b ) There are no {D -F 2) points which are shattered by H .
P r o b le m 2 .1 7 The VC dimension depends on the input space as well
as 7i. For a fixed H , consider two input spaces X i C X 2 . Show th a t the VC
dimension of H with respect to input space X i is at most the VC dimension
of H with respect to input space A 2.
How can the result of this problem be used to answer part (b ) in Problem 2.16?
[Hint: How is Problem 2.16 related to a perceptron in D dimensions?[
P r o b le m 2 .1 8 T he VC dimension of the perceptron hypothesis set
corresponds to the number of parameters [wo, w i, - ■ ■ ,Wd) of the set, and this
observation is 'usually' true for other hypothesis sets. However, we will present
a counter-example here. Prove that the following hypothesis set for x € K has
an infinite VC dimension:
■ H = { / la I / ia (x ) = where q g k } ,
where [AJ is the biggest integer < A (the floor function). This hypothesis
has only one parameter a but 'enjoys' an infinite VC dimension. [Hint: Con
sider x \ , . .. ,XN, where x „ = 10” , and show how to implement an arbitrary
dichotomy y i , . . . ,yN.[
P r o b le m 2 .1 9 This problem derives a bound for the VC dimension of a
complex hypothesis set that is built from simpler hypothesis sets via composi
tion. Let H i . . . . , 'H k be hypothesis sets with VC dimension d i , . . . , (/a'. Fix
hi , . . . . i iK, where in € Hi . Define a vector z obtained from x to have com
ponents /u (x ) . Note that x £ W f but z £ { - 1 . -F l} ^ ' . Let K be a hypothesis
set of functions that take inputs in . So
/ i £ 7 ? : z £ R h -> { -F 1 .—1},
and suppose that has VC dimension d.
72
W e can apply a hypothesis in H to the z constructed from { h i , . . . , Ii k ). This
is the composition of the hypothesis set % with { H i , . .. , H k ). More formally,
the composed hypothesis set H = H o { H i , . .. , H k ) is defined b\/ h e H if
h{x) = h { h i { x ) , . . ., hK{x)) , h e H - hi € Hi.
(a ) Show that
K
m n {N ) < m y^{N)Y[ 'm ni{N ). (2.18)
i= l
[Hint: Fix N points x i , . . . , xw and fix h i , . .. , h x . This generates N
transformed points z i , . . . ,z jv - These z i , . . . ,zat can be dichotomized
in at most m.p{N) ways, hence for fixed { h i , . . . , hK) , ( x i , . . . , X i v )
can be dichotomized in at most my^{N) ways. Through the eyes of
x i , . . . , x/v , at most how many hypotheses are there (effectively) in Hi ?
Use this bound to bound the effective number o f K-tuples {h i , . . . , h x )
that need to be considered. Finally, argue that you can bound the number
o f dichotomies that can be implemented by the product o f the number
of possible K-tuples { h i , . . . , hK) and the number o f dichotomies per
K-tuple.]
(b ) Use the bound m { N ) < to get a bound for m n { N ) in terms of
d, d i , . . . , d K •
(c) Let D = J + d,, and assume that D > 2elog 2 £>. Show that
dvc{H) < 2D\og,^D.
(d ) If H i and H are all perceptron hypothesis sets, show that
dvc(Tf) = 0{ dK\ og{dK) ) .
In the next chapter, we will further develop the simple linear model. This linear
model is the building block of many other models, such as neural networks.
The results of this problem show how to bound the VC dimension of the more
complex models built in this manner.
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
P ro b lem 2 .2 0 There are a number of bounds on the generalization
error e, all holdingwith probability at least 1 — 8.
(a) Original VC-bound: ________________
N 5
(b) Rademacher Penalty Bound:
̂ 2\ii{2Nrnn{N)) , / 2 1 , 1
N + \ / v ‘'‘ 5 + iv-
(con tinued on n ex t page)
7.3
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
(c) Parrondo and Van den Broek:
N
(d ) Devroye:
2 N
Note that (c) and (d) are implicit bounds in e. Fix dvc = 50 and <5 = 0.05 and
plot these bounds as a function of N. Which is best?
P r o b le m 2 .21 Assume the following theorem to hold
T h e o rem
E p u t ( g ) E j n ( f f )
Eout{g)
> e < c ■ m n { 2 N ) exp -
where c is a constant that is a little higger than 6 .
This bound is useful because sometimes what we care about is not the absolute
generalization error but instead a relative generalization error (one can imagine
that a generalization error of 0.01 is more significant when Eout = 0.01 than
when Eout — 0 .5). Convert this to a generalization bound by showing that
with probability at least 1 — <5,
Eout(g) ^ Ein{g) + ^ 1 + t 1 +
4 E i n ( a )
where ^ = T log c-mT^(2N) ^
P r o b le m 2 .2 2 When there is noise in the data, Eout
E x ,{,[(3^® '(^ ) “ y (^ ))^ ]i where j/(x) = / (x ) + e. If e is a zero-mean noise
random variable with variance cr ,̂ show that the bias-variance decomposition
becomes
Er>[Eout(g^^')] = -F bias + var.
P r o b le m 2 .2 3 Consider the learning problem in Example 2.8, where the
input space is T = [ - 1 , 4- 1 ], the target function is /(x ) = sin(Trx), and the
input probability distribution is uniform on X . Assume that the training set V
has only two data points (picked independently), and that the learning algorithm
picks the hypothesis that minimizes the in-sample mean squared error. In this
problem, we will dig deeper into this case.
74
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
For each of the following learning models, find (analytically or numerically)
(i) the best hypothesis that approximates / in the mean-squared-error sense
(assume that / is known for this part), (ii) the expected value (with respect
to T>) of the hypothesis that the learning algorithm produces, and (iii) the
expected out-of-sample error and its bias and var components.
(a) The learning model consists of all hypotheses of the form h{x) = ax -\-b
( if you need to deal with the infinitesimal-probability case of two identical
data points, choose the hypothesis tangential to / ) .
(b) The learning model consists of all hypotheses of the form h{x) = ax.
This case was not covered in Example 2.8.
(c) The learning model consists of all hypotheses of the form h{x) = b.
P ro b lem 2 .2 4 Consider a simplified learning scenario. Assume that
the input dimension is one. Assume that the input variable x is uniformly
distributed in the interval [—1,1]. The data set consists of 2 points { x i , X 2}
and assume that the target function is f { x ) = x^. Thus, the full data set is
V = { { x i , x [ ) , { x 2 , X 2 )}. The learning algorithm returns the line fitting these
two points as g (77 consists of functions of the form h{x) = ax + b). W e are
interested in the test performance (E[Eoutj) of our learning system with respect
to the squared error measure, the bias and the var.
(a) Give the analytic expression for the average function g{x).
(b ) Describe an experiment that you could run to determine (numerically)
g{x), E[Aout], bias, and var.
(c) Run your experiment and report the results. Compare E[£;out] (expecta
tion is with respect to data sets) with bias -F var. Provide a plot of your
g{x) and f { x ) (on the same plot).
(d) Compute analytically what E[i?out], bias and var should be.
75
76
Chapter 3
The Linear Model
We often wonder how to draw a line between two categories; right versus
wrong, personal versus professional life, nseful email versus spam , to nam e a
few. A line is intuitively our first choice for a decision boundary. In learning,
as in life, a line is also a good first choice.
In C hap ter 1 , we (and the machine @ ) learned a procedure to ‘draw a line’
between two categories based on d a ta (the perceptron learning algorithm ). We
sta rted by tak ing the hypothesis set PL th a t included all possible lines (actually
hyperp lanes). T he algorithm then searched for a good line in PL by iteratively
correcting the errors m ade by the curren t candidate line, in an a ttem p t to
improve Ei„. As we saw in C hap ter 2 , the linear model - set of lines - has a
sm all VC dim ension and so is able to generalize well from E\n to Eont-
T he aim of this chapter is to further develop the basic linear model into a
powerful tool for learning from data . We branch into three im portan t prob
lems; the classification problem th a t we have seen and two other im portan t
problem s called regression and probability estimation. T he three problems
come w ith different bu t related algorithm s, and cover a lot of territo ry in
learning from data . As a rule of thum b, when faced w ith learning problems,
it is generally a w inning stra tegy to try a linear model first.
3.1 Linear C lassification
T he linear m odel for classifying d a ta into two classes uses a hypothesis set of
linear classifiers, where each h has the form
/i(x) = sign(w ’‘'x),
for some colum n vector w G where d is the dim ensionality of the input
space, and the added coordinate xq = 1 corresponds to the bias ‘w eight’ wo
(recall th a t the input space A = {1 } x IR'̂ is considered ri-dimensional since
the added coordinate xq = 1 is fixed). We will use h and w interchangeably
77
to refer to the hypothesis when the context is clear. W hen we left C hap ter 1,
we had two basic crite ria for learning:
1. C an we m ake sure th a t Eoutig) is close to E\n{g)3 T his ensures th a t w hat
we have learned in sam ple will generalize out of sample.
2. C an we m ake E\-n[g) small? T his ensures th a t w hat we have learned in
sam ple is a good hypothesis.
T he first criterion was studied in C h ap ter 2. Specifically, th e VC dim ension
of the linear m odel is only d + \ (Exercise 2.4). Using th e VC generalization
bound (2 .1 2 ), and the bound (2 .1 0 ) on the grow th function in term s of the
VC dim ension, we conclude th a t w ith high probability.
3 . T h e L i n e a r M o d e l ______________________________________ 3 . 1 . L i n e a r C l a s s i f i c a t i o n
/
Eo\it{g) = Ein{g) + O (3.1)
Thus, when N is sufficiently large, and Eout will be close to each other
(see the definition of O(-) in the N o tation tab le), and th e first criterion for
learning is fulfilled.
T he second criterion, m aking sure th a t Ein is sm all, requires first and
forem ost th a t there is some linear hypothesis th a t has sm all Ein- If there
isn’t such a linear hypothesis, th en learning certain ly can ’t find one. So, le t’s
suppose for the m om ent th a t there is a linear hypothesis w ith sm all E\n- In
fact, le t’s suppose th a t the d a ta is linearly separable, which m eans there is
some hypothesis w* w ith Ein(w *) = 0. We will deal w ith th e case w hen this
is not tru e shortly.
In C hap ter 1, we in troduced the p ercep tron learning algorithm (PLA ).
S ta rt w ith an a rb itra ry w eight vector w (0). T hen , a t every tim e step t > 0,
select any m iselassilied d a ta po int (x( t ) ,y( t ) ) , and u p d a te w (t) as follows:
w (t + 1 ) = w (t) + y (t)x ( t) .
T he in tu ition is th a t the u p d a te is a ttem p tin g to correct the erro r in classify
ing x (t) . T he rem arkable th ing is th a t th is increm ental approach of learning
based on one d a ta po int a t a tim e works. As discussed in P roblem 1.3, it can be
proved th a t the PLA will eventually stop u pdating , ending a t a solution Wpla
w ith Ein(wpLA) = 0. A lthough th is result applies to a re s tric ted setting (lin
early separable d a ta ) , it is a significant step. T he PL A is clever - it doesn’t
naively test every linear hypothesis to see if it (the hypothesis) separates the
data; th a t would take infinitely long. Using an ite ra tiv e approach , th e PLA
m anages to search an infinite hypothesis set and o u tp u t a linear sep a ra to r in
(provably) f in ite tim e.
As far as PLA is concerned, linear separab ility is a. p ro p e rty of the data.
not the target. A linearly separab le T> could have been generated e ither from
a linearly separable ta rg e t, or (by chance) from a ta rg e t th a t is no t linearly
separable. T he convergence p roof of PLA g uaran tees th a t the algorithm will
78
3 . T h e L i n e a r M o d e l 3 . 1 . L i n e a r C l a s s i f i c a t i o n
Figure 3.1: Data sets that are not linearly separable but are (a) linearly
separable after discarding a few examples, or (b) separable by a more so
phisticated curve.
work in bo th these cases, and produce a hypothesis w ith E ,, = 0. Further,
in bo th cases, you can be confident th a t th is perform ance will generalize well
out of sam ple, according to the VC bound.
E x erc ise 3 .1
W ill PLA ever stop updating if the data is not linearly separable?
3 .1 .1 N on -S ep arab le D a ta
We now address the case where the d a ta is not linearly separable. Figure 3.1
shows two d a ta sets th a t are not linearly separable. In Figure 3.1(a), the data,
becomes linearly separable after the removal of ju st two examples, which could
be considered noisy examples or outliers. In Figure 3.1(b), the d a ta can be
separated by a circle ra th e r th an a line. In bo th cases, there will always be
a misclassified train ing exam ple if we insist on using a linear hypothesis, and
hence PLA will never term inate. In fact, its behavior becomes quite unstable,
and can jum p from a good perceptron to a very bad one w ithin one update; the
quality of the resulting E n cannot be guaranteed. In Figure 3.1(a), it seems
appropria te to stick w ith a line, b u t to somehow to lera te noise and o u tpu t a.
hypothesis w ith a small E n , not necessarily E n = 0- In Figure 3.1(b), the
linear m odel does not seem to be the correct model in the first place, and we
will discuss a technique called nonlinear transform ation for this situation in
Section 3.4.
79
T he situ a tio n in F igure 3.1(a) is actually encountered very often: even
though a linear classifier seems appropria te , the d a ta m ay not be linearly sep
arable because of outliers or noise. To find a hypothesis w ith th e m inim um Ei„,
we need to solve the com binatorial op tim ization problem :
1 ^
“ I”.-, M Isign(w ^x„) Vn} . (3.2)weR<i+i Jy
n=l
3 . T h e L i n e a r M o d e l 3 . 1 . L i n e a r C l a s s i f i c a t i o n
Bln(w)
T he difficulty in solving th is problem arises from the d iscrete n a tu re of bo th
sign(-) and |- |. In fact, m inim izing E in(w ) in (3.2) in the general case is known
to be N P -hard , which m eans there is no known efficient algorithm for it, and
if you discovered one, you would becom e really, really fam ous @ • Thus, one
has to resort to approxim ately m inim izing Ein-
One approach for getting an approxim ate solu tion is to ex tend PL A th rough
a simple m odification into w hat is called the pocket algorithm. Essentially, the
pocket a lgorithm keeps ‘in its pocket’ the best weight vector encountered up
to itera tion t in PLA . A t the end, th e best weight vector will be repo rted as
the final hypothesis. T his sim ple algorithm is shown below.
T h e p o c k e t a lg o r i th m :
1 ; Set the pocket weight vector w to w (0) of PLA.
2 : fo r t = 0 , . . . , T - 1 d o
3: R un PLA for one u p d a te to o b ta in w (t -|- 1).
4: E valuate E in (w (t -f 1)).
5: If w(7 + 1) is b e tte r th an w in term s of Ei„, set w to
w ( t - f 1 ).
6 : R etu rn w .
The original PLA only checks some of the exam ples using w (t) to identify
{x{ t) ,y{ t)) in each itera tion , while the pocket algorithm needs an additional
step th a t evaluates all exam ples using w(7 -f 1) to get E in(w (t + 1)). T he
additional step m akes the pocket algorithm m uch slower th an PLA . In addi
tion, there is no guaran tee for how fast th e pocket algorithm can converge to a
good Ei„. Nevertheless, it is a useful algorithm to have on hand because of its
simplicity. O ther, m ore efficient approaches for ob ta in ing good approxim ate
solutions have been developed based on different op tim iza tion techniques, as
shown later in th is chapter.
E x e rc is e 3 .2
Take d = 2 and create a data set T> of size N = 100 th a t is not linearly
separable. You can do so by first choosing a random line in the plane as
your target function and the inputs x „ of the data set as random points
in the plane. Then, evaluate the target function on each x „ to get the
corresponding output pn. Finally, flip the labels of ~ randomly selected
2/n ’s and the data set will likely become non-separable.
8 0
3 . T h e L i n e a r M o d e l 3 . 1 . L i n e a r C l a s h i e i c a i ' i o n
Now, try the pocket algorithm on your data set using T = 1 , 0 0 0 iterations.
Repeat the experiment 2 0 times. Then, plot the average F n ( w ( t ) ) and the
average Ein{'w) (which is also a function of t) on the same figure and see
how they behave when t increases. Similarly, use a test set of size 1,000
and plot a figure to show how F o u t ( w ( f ) ) and F o „ t ( w ) behave.
E x a m p le 3 .1 (H andw ritten digit recognition). We sam ple some digits from
the US Postal Service Zip Code D atabase. These 16 x 16 pixel images are
preprocessed from the scanned handw ritten zip codes. The goal is to recognize
the digit in each image. We alluded to this task in p art (b) of Exercise 1 .1 .
A quick look a t the images reveals th a t this is a non-trivial task (even for a
hum an), and typical hum an Fout E abou t 2.5%. Common confusion occurs
between the digits {4,9} and (2 , 7) . A m achine-learned hypothesis which can
achieve such an error ra te would be highly desirable.
9
A
7
I
7
2
^ 7 ^ S
0 7 V
b
1
7
3
3
i
3
Y
3
T
7
C
7
7
7
1
t
(
S
A
L et’s first decompose the big task of separating ten digits into sm aller tasks of
separating two of the digits. Such a decomposition approach from muliicla.s.s
to binary classification is commonly used in m any learning algorithm s. We will
focus on digits {1,5} for now. A hum an apjiroach to determ iiiing the digit
corresponding to an image is to look a t the shape (or o ther propc'rties) of the
black pixels. Thus, ra th e r th an carrying all the inform ation in the 256 ])ix('ls,
it makes sense to sum m arize the inform ation contained in the image into a. lew
features. L e t’s look a t two in iiio rtan t features here: intmisity mid symmetry.
Digit 5 usually occujiies more black pixels than digit 1, and I k ' i i c c tlie average
pixel intensity of digit 5 is higher. On the o ther hand, digit I is symiiK'tric
while digit 5 is not. Therefore, if we dehne asym m etry as the average absolute'
difference between an image and its flippe'd versions, mid synimet.ry as tlu'
negation of asym m etry, digit 1 would result in a higher sym m etry value'. A
sca tte r plot for these intensity aiiel sym m etry featiireis feir semie eif the' eligits is
shown next.
8 1
3 . T h e L i n e a r M o d e l 3 . 2 . L i n e a r R e g r e s s i o n
W hile the digits can be roughly separa ted by a line in th e plane representing
these two features, there are poorly w ritten digits (such as the ‘5’ depicted in
the top-left corner) th a t prevent a perfect linear separation .
We now run PLA and pocket on the d a ta set and see w hat happens. Since
the d ata set is not linearly separable, PL A will no t stop updating . In fact,
as can be seen in F igure 3.2(a), its behavior can be quite unstab le. W hen
it is forcibly term in ated a t ite ra tio n 1,000, PL A gives a line th a t has a poor
E\n = 2.24% and Eont = 6.37%. O n the o ther hand , if th e pocket algorithm is
applied to the sam e d a ta set, as shown in F igure 3.2(b), we can ob ta in a line
th a t has a b e tte r E-̂ n = 0.45% and a b e tte r Eout = 1.89%. □
3.2 Linear R egression
Linear regression is ano ther useful linear m odel th a t applies to real-valued
ta rg e t functions.^ I t has a long h isto ry in sta tis tics , w here it has been studied
in g reat detail, and has various applications in social and behavioral sciences.
Here, we discuss linear regression from a learning perspective, where we derive
the m ain results w ith m inim al assum ptions.
Let us revisit our applica tion in cred it approval, th is tim e considering a
regression problem ra th e r th an a classification problem . Recall th a t the bank
has custom er records th a t contain inform ation fields re la ted to personal credit,
such as annual salary, years in residence, o u ts tan d in g loans, etc. Such variables
can be used to learn a linear classifier to decide on cred it approval. Instead of
ju s t m aking a b inary decision (approve or no t), the bank also w ants to set a
proper credit lim it for each approved custom er. C red it lim its are trad itiona lly
determ ined by hum an experts. T he bank w ants to au to m ate th is task , as it
did w ith credit approval.
'R e g re ss io n , a te rm in lie rited from e a rlie r w ork in s ta tis t ic s , m ea n s y is real-v a lu ed .
8 2
3 . T h e L i n e a r M o d e l 3 . 2 . L i n e a r R e g r e s s i o n
a
E„
Em
2 5 0 500 750 1000
Ite ra tio n N um ber, t
Average In tensity
(a) PLA
la O 10%
H
Eo,
250 5 00 7 5 0 1000
I te ra tio n Num ber, t
Average In tensity
(b) Pocket
F igure 3.2: Comparison of two linear classification algorithms for sep
arating digits 1 and 5. Ein and Eout are plotted versus iteration number
and below that is the learned hypothesis g. (a) A version of the PLA which
selects a random training example and updates w if that example is misclas-
sified (hence the flat regions when no update is made). This version avoids
searching all the data at every iteration, (b) The pocket algorithm.
T his is a regression learning problem . The bank uses historical records to
construct a d a ta set V of examples (xj ,?/ i) , (x 2 ,t/2 )i ■ • {^ N tVn ), where x„
is custom er inform ation and is the credit lim it set by one of the hum an
experts in the bank. Note th a t is now a real num ber (positive in this case)
instead of ju s t a b inary value ±1 . T he bank w ants to use learning to find a
hypothesis g th a t replicates how hum an experts determ ine credit limits.
Since there is m ore th an one hum an expert, and since each expert may
not be perfectly consistent, our ta rg e t will not be a determ inistic function
y = / ( x ) . Instead, it will be a noisy ta rg e t formalized as a d istribu tion of the
random variable y th a t comes from the different views of different experts as
well as the variation w ithin the views of each expert. T h a t is, the label
comes from some d istribu tion P {y | x) instead of a determ inistic function / ( x ) .
Nonetheless, as we discussed in previous chapters, the natu re of the problem
is not changed. We have an unknown d istribution P { x ,y ) th a t generates
83
each (xn,?/n), and we w ant to find a hypothesis g th a t m inimizes the error
between ^(x ) and y w ith respect to th a t d istribu tion .
T he choice of a linear m odel for th is problem presum es th a t there is a linear
com bination of the custom er inform ation fields th a t would properly approx
im ate the credit lim it as determ ined by hum an experts. If th is assum ption
does not hold, we cannot achieve a sm all erro r w ith a linear model. We will
deal w ith this situ a tio n when we discuss nonlinear transfo rm ation la ter in the
chapter.
3.2 .1 T h e A lg o r ith m
T he linear regression algorithm is based on m inim izing th e squared error be
tween h(x) and y ?
E e u t { h ) = E { h { r x ) - y f ,
where the expected value is taken w ith respect to the jo in t p robab ility d istri
bu tion P (x , y ) . T he goal is to find a hypothesis th a t achieves a sm all B o u t( / i ) -
Since the d istribu tion P (x , y) is unknown, B o u t ( h ) canno t be com puted. Sim
ilar to w hat we did in classification, we resort to the in-sam ple version instead,
1 ^
Bin(h) = { h { y i n ) - V n f ■
n = \
In linear regression, h takes th e form of a linear com bination of th e com ponents
of X. T h a t is,
d
h(x) = ^ WiXi = w'^x,
i = 0
where xq = I and x G {1} x as usual, and w G R ‘’*+ .̂ For the special case
of linear h , it is very useful to have a m atrix rep resen ta tion of B i n ( h ) . F irst,
define the d a ta m atrix X G to be the N x { d + l ) m a trix whose rows
are the inpu ts x „ as row vectors, and define the ta rg e t vector y G R ^ to be
the colum n vector whose com ponents are the ta rg e t values y„. T he in-sam ple
error is a function of w and the d a ta X, y:
1 ^
Bin(w ) ^ (w'^X„ - y n f
n = l
= ^ | | X w - y | | ^ (3.3)
= ^ (w-^X^Xw - 2w-^X"y + y ^ y ) , (3.4)
where || • || is the Euclidean norm of a vector, and (3.3) follows because the 7?.th
com ponent of the vector X w — y is exactly w '^x„ — y„. T he linear regression
^T h e te rm ‘lin e a r re g re ss io n ’ h as b een h is to r ic a lly confined to sq u a re d e r ro r m easu res.
3 . T h e L i n e a r M o d e l 3 . 2 . L i n e a r R e g r e s s i o n
84
3 . T h e L i n e a r M o d e l 3 . 2 . L i n e a r R e g r e s s i o n
(a) one d im ension (line)
F ig u re 3.3: T h e solu tion hypothesis (in blue) of tlie linear regression algo
rith m in one and tw o dim ensions. T h e sum of squared errors is minim ized.
algorithm is derived by minimizing E „ ( w ) over all possible w G
formalized by the following optim ization problem:
as
wii„ = a rg m iiiE „ (w ).
wgR'i+i
(3.5)
F igure 3.3 illustrates the solution in one and two dimensions. Since Eciiia-
tioii (3.4) implies th a t E i„(w ) is differentiable, we can use standard matri.x
calculus to find the w th a t minimizes E i„(w ) by requiring th a t tlie gradient
of Ein w ith respect to w is the zero vector, i.e., V E i„(w ) = 0. The gradient is
a (column) vector whose ith com ponent is [VEi„(w)]j = ^ E i „ ( w ) . By ex
plicitly com puting the reader can verify the following gradient identities,
Vw(w'''Aw) = (A 4- A'')w, Vw(w'’'b) b.
These identities are the m atrix analog of ordinary differmitiatiou of <iuadratie
and linear functions. To obtain the gradient of Ei„, wi' ta,ke the gra.dimit ol
each term in (3.4) to obtain
V E i„(w ) = | p ( X ' X w - X ' ' y ) .
Note th a t bo th w and V E „ ( w ) are coliinm vectors. Finally, to get V E „ ( w )
to be 0 , one should solve for w that, satislic's
X ' X w = X ' y .
If X ‘'X is invertible, w = X'ly w hen' X+ = (X'‘’X) ' X '' is l.lu' psrudo-tnvcrsc
of X. T he resulting w is the unique oiitim al solution to (3.5). If X ' X is nol
85
3 . T h e L i n e a r M o d e l 3 . 2 . L i n e a r R e g r e s s i o n
invertible, a pseudo-inverse can still be defined, b u t the solution will not be
unique (see P roblem 3.15). In practice, X'^X is invertible in m ost of the cases
since N is often much bigger th an d + 1, so there will likely he d + \ linearly
independent vectors x „ . We have thus derived the following linear regression
algorithm.
L inear re g ress io n a lg o r ith m :
1 : Construct th e m atrix X and the vector y from th e d a ta set
( x i , 2/ i ) , - - - , (xjv,yAr), where each x includes th e xq = 1
b ias coordinate, as follows
X =
yi
V2
y =
. VN
input data matrix target vector
2 : C om pute th e pseudo-inverse X ’̂ of th e m atrix X. If X'^X
is invertible,
x t = (X ^X )-^X ^.
3: R etu rn wjin = X ty .
This algorithm is som etim es referred to as ordinary least squares (OLS). It m ay
seem th a t, com pared w ith the percep tron learning algorithm , linear regression
doesn 't really look like ‘learn ing’, in th e sense th a t th e hypothesis wun comes
from an analy tic solution (m atrix inversion and m ultip lications) ra th e r th an
from itera tive learning steps. Well, as long as the hypothesis wun has a decent
out-of-sam ple error, th en learning has occurred. L inear regression is a ra re
case where w'e have an analy tic form ula for learning th a t is easy to evaluate.
This is one of the reasons why the technique is so widely used. It should
be noted th a t there are m ethods for com puting th e pseudo-inverse directly
w ithout inverting a m atrix , and th a t these m ethods are num erically m ore
stable th an m atrix inversion.
L inear regression has been analyzed in great detail in sta tis tics . We would
like to m ention one of the analysis tools here since it re la tes to in-sam ple and
out-of-sam ple errors, and th a t is the hat m atr ix H. Here is how H is defined.
The linear regression weight vector wun is an a tte m p t to m ap the inpu ts X
to the o u tp u ts y. However, wun does no t p roduce y exactly, b u t produces an
estim ate
y = Xwii,,
which differs from y due to in-sam ple error. S u b stitu tin g the expression
for wiin (assum ing X'^X is invertible), we get
y = X ( X " X ) - i X V .
86
3 . T h e L i n e a r M o d e l 3 . 2 . L i n e a r R e g r e s s i o n
Therefore, the estim ate y is a linear transform ation of the actual y through
m atrix m ultiplication w ith H, where
H = X ( X ' " X ) - (3.6)
Since y = Hy, the m atrix H ‘pu ts a h a t’ on y , hence the name. The hat
m atrix is a very special m atrix . For one thing, = H, which can be verified
using the aboye expression for H. This and other properties of H will facilitate
the analysis of in-sam ple and out-of-sam ple errors of linear regression.
E x erc ise 3 .3
Consider the hat matrix H = X ( X ’'X ) “ ^X'^, where X is an TV by d + 1
matrix, and X ’̂ 'X is invertible.
(a ) Show that H is symmetric.
(b ) Show that = H for any positive integer K .
(c) If I is the identity matrix of size TV, show that ( I — H )^ = I — H for
any positive integer K .
(d ) Show that trace(H ) = d + 1, where the trace is the sum of diagonal
elements. [Hint: trace(A B ) = tra c e (B A ).y
3.2 .2 G en era liza tion Issues
Linear regression looks for the optim al weight vector in term s of the in-sample
error F n , which leads to the usual generalization question: Does this guarantee
decent out-of-sam ple error Fout? T he short answer is yes. There is a regression
version of the VC generalization bound (3.1) th a t similarly bounds Font- In
the case of linear regression in particu lar, there are also exact formulas for
the expected Fout and Fn th a t can be derived under simplifying assum ptions.
T he general form of the result is
Fout(f/) = Fn(</) T 1
where Fout(5 ) and Ein{g) are the expected values. This is com parable to the
classification bound in (3.1).
E x erc ise 3 .4
Consider a noisy target y = w * '‘'x + e for generating the data, where e is
a noise term with zero mean and variance, independently generated for
every example (x,y) . The expected error of the best possible linear fit to
this target is thus cr .̂
For the data V = { ( x i , y i ) , . . . , (x w , yw) } , denote the noise in y„ as e„
and let e = [ei,C2, ■ • • assume that X^X is invertible. By following
(con tinued on n ex t page)
8 7
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
the steps below, show th a t the expected in-sample error of linear regression
with respect to V is given by
El.[Ain(wiin)] = ^1 -
d + 1
N
(a ) Show th a t the in-sample estimate of y is given by y = X w * -|- He.
(b ) Show that the in-sample error vector y - y can be expressed by a
matrix times e. W h at is the matrix?
(c) Express A in (w iin ) in terms of e using (b ), and simplify the expression
using Exercise 3.3(c).
(d ) Prove that E x i[F in (w iin )] = cr̂ ( l - ^ ) using (c) and the indepen
dence of e i, • ■ ■ , ejv. [Hint: The sum o f the diagonal elements o f a
matrix (the trace) will play a role. See Exercise 3.3(d).]
For the expected out-of-sample error, we take a special case which is easy to
analyze. Consider a test data set latest = {(x i , y( ) , . . . , (xw, y)v)}. which
shares the same input vectors Xn with T> but with a different realization of
the noise terms. Denote the noise in y' ̂ as ejj and let e' = [e), £2 , • • •, £jv]^-
Define Atest(w iin) to be the average squared error on T>test-
(e) Prove th a t E i,,£4£'test(w iin)] = cr̂ ( l - F ^ f^ ) .
The special test error Etest is a very restricted case of the general out-
of-sample error. Some detailed analysis shows th a t similar results can be
obtained for the general case, as shown in Problem 3.11.
Figure 3.4 illustra tes the learning curve of linear regression under th e assum p
tions of Exercise 3.4. T he best possible linear fit has expected erro r T . T he
expected in-sam ple erro r is sm aller, equal to cr^(l — ^ ^ ) for A > d -F 1. T he
learned linear fit has ea ten into th e in-sam ple noise as m uch as it could w ith
the d -F 1 degrees of freedom th a t it has a t its disposal. T his occurs because
the fitting cannot d istinguish the noise from th e ‘signal.’ O n th e o ther hand,
the expected out-of-sam ple erro r is cr^(l -F which is m ore th a n the un
avoidable error of T . T he add itional erro r reflects th e drift in wun due to
fitting the in-sam ple noise.
3.3 L ogistic R egression
The core of the linear m odel is the ‘signal’ s = w'^’x th a t com bines the inpu t
variables linearly. We have seen two m odels based on th is signal, and we are
now going to in troduce a th ird . In linear regression, the signal itself is taken
as the ou tp u t, which is ap p ro p ria te if you are try in g to p red ic t a real response
th a t could be unbounded. In linear classification, th e signal is th resholded
at zero to produce a ± 1 o u tp u t, ap p ro p ria te for b inary decisions. A th ird
possibility, which has wide app lica tion in practice, is to o u tp u t a probability,
88
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
a value between 0 and 1. O ur new model is called logistic regression. It has
sim ilarities to bo th previous models, as the o u tp u t is real (like regression) bu t
bounded (like classification).
E x a m p le 3 .2 (Prediction of heart a ttacks). Suppose we w ant to predict the
occurrence of heart a ttacks based on a person’s cholesterol level, blood pres
sure, age, weight, and other factors. Obviously, we cannot predict a heart
a ttack w ith any certainty, bu t we m ay be able to predict how likely it is to
occur given these factors. Therefore, an o u tp u t th a t varies continuously be
tween 0 and 1 would be a m ore suitable model th an a binary decision. The
closer y is to 1 , the more likely th a t the person will have a heart a ttack. □
3.3 .1 P red ic tin g a P rob ab ility
Linear classification uses a hard threshold on the signal s = w ’̂ x,
/?,(x) = sign(w'^x),
while linear regression uses no threshold a t all,
h{x) — w'^x.
In our new model, we need som ething in between these two cases th a t sm oothly
restric ts the o u tp u t to the probability range[0.1]. One choice th a t accom
plishes this goal is the logistic regression model,
h(x) = tl(w'^x),
where 0 is the so-called logistic function 0{s) = whose o u tpu t is between 0
and 1 .
89
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
s
The o u tp u t can be in terp re ted as a probabil
ity for a b inary event (heart a ttack or no heart
a ttack , digit T ’ versus digit ‘5’, etc.). Linear
classification also deals w ith a b inary event, bu t
the difference is th a t the ‘classification’ in logis
tic regression is allowed to be uncertain , w ith
in term ediate values betw een 0 and 1 reflecting
th is uncertainty. T he logistic function 9 is referred to as a soft threshold, in
con trast to the h ard threshold in classification. I t is also called a sigmoid
because its shape looks like a fla ttened o u t ‘s ’.
E x erc ise 3 .5
Another popular soft threshold is the hyperbolic tangent
tanh(s) = e — e
e® -I- e~
(a) How is tanh related to the logistic function 61 [Hint: shift and scale]
(b ) Show th a t tanh(s) converges to a hard threshold for large |s|, and
converges to no threshold for small |s|. [Hint: Formalize the figure
below f
+ 1
linear
hard threshold
The specific form ula of 6{s) will allow us to define an erro r m easure for learning
th a t has analy tica l and com putational advantages, as we will see shortly. Let
us first look a t the ta rg e t th a t logistic regression is try ing to learn. T he ta rg e t
is a probability, say of a p a tien t being a t risk for h eart a ttack , th a t depends
on the input x (the characteristics of th e p a tien t). Form ally, we are tr>-ing to
learn the ta rg e t function
/ ( x ) = P [ y = + l jx].
The d a ta does not give us the value of / explicitly. R ather, it gi\'es us sam ples
generated by th is probability, e.g., p a tien ts who had heart a ttack s and p atien ts
who didn' t . Therefore, the d a ta is in fact generated by a noisy ta rg e t P {y \ x).
P {y X = (3.7)
/ ( x ) f o r y = -f l :
[ l - / ( x ) f o r y = - l .
To learn from such d a ta , we need to define a p roper erro r m easure that gauges
how close a given hypothesis h is to / in term s of these noisy ± 1 examples.
9 0
E r r o r m e a s u re . T he s tan d ard error m easure e{h{x),y ) used in logistic re
gression is based on the notion of likelihood] how ‘likely’ is it th a t we would get
this o u tp u t y from the input x if the ta rg e t d istribution P {y \ x) was indeed
cap tured by our hypothesis A(x)? Based on (3.7), th a t likelihood would be
for y = -Fl;
for y = - 1 .
We su b stitu te for A(x) by its value 0(w'^x), and use the fact th a t 1 — 6{s) =
0{—s) (easy to verify) to get
P {y I x) = 9{y w -x ) . (3.8)
One of our reasons for choosing the m athem atical form 0{s) = e®/(l + e ‘‘) is
th a t it leads to this simple expression for P {y \ x).
Since the d a ta points ( x i , y i ) , . . . , (xAi,yAr) are independently generated,
the probability of getting all the j/„’s in the d a ta set from the correspond
ing x „ ’s would be the product
N
3 . T h e L i n e a r M o d e l ______________________ _________________ 3 . 3 . L o g i s t i c R e g r e s s i o n
n = l
The m ethod of m axim um likelihood selects the hypothesis h which maximizes
this probability.^ We can equivalently minimize a more convenient quantity.
- i v ‘”
since ‘ —;^ln( - ) ’ is a m onotonically decreasing function. Substitu ting with
E quation (3.8), we would be minimizing
I V l ( ̂ ^
^ V<9(j/nw'"x„)y
w ith respect to the weight vector w . T he fact th a t we are m in im iz in g this
quan tity allows us to tre a t it as an ‘error m easure.’ Substitu ting the func
tional form for 0 (y„’w'^x„) produces the in-sam ple error m easure for logistic
regression,
i ? i n ( w ) - i f ^ l n ( l + e - ^ " - ^ ’̂ " ) . (3.9)
n=l
T he im plied pointwise error m easure is e(/i(x„), iy„) = ln(l-f-e“ -^"''' ’‘̂ "). Notice
th a t this error m easure is small when ynw ’̂ x,, is large and positive, which
would im ply th a t sign(vkr'Tx„) = ?/„. Therefore, as our intuition would expect,
the error m easure encourages w to ‘classify’ each x„ correctly.
^ A lth o u g h th e m eth o d o f m ax im u m likelihood is in tu itiv e ly p lausib le , its rigorous ju s t i
fica tion as an in ference to o l co n tin u es to be di.scu.ssed in th e s ta tis tic s com m unity .
9 1
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
E x erc ise 3 .6 [C ross-entropy error measure]
(a ) More generally, if we are learning from ± 1 data to predict a noisy
target P {y \ x ) with candidate hypothesis h, show that the maximum
likelihood method reduces to the task of finding h th a t minimizes
B .(w ) = g = + 1] 1„ ^ + b . = -11 to
(b ) For the case F (x ) = 0 (w '’^x), argue that minimizing the in-sample
error in part (a ) is equivalent to minimizing the one in (3 .9).
For two probability distributions {p, I — p } and {q, 1 — g} with binary out
comes, the cross-entropy (from information theory) is
p l o g i - F (1 - p ) l o g - ^
g 1 - g
The in-sample error in part (a ) corresponds to a cross-entropy error measure
on the data point (x „ , j/„ ), with p = |y „ = -p i] and g = / i(x „ ) .
For linear classification, we saw th a t m inim izing for th e percep tron is a
com binatorial op tim ization problem ; to solve it, we in troduced a num ber of al
gorithm s such as the percep tron learning algorithm and th e pocket algorithm .
For linear regression, we saw th a t tra in in g can be done using th e analytic
pseudo-inverse algorithm for m inim izing by se tting V E i„(w ) = 0. These
algorithm s were developed based on th e specific form of linear classification or
linear regression, so none of them would apply to logistic regression.
To tra in logistic regression, we will take an approach sim ilar to linear re
gression in th a t we will try to set V E in(w ) = 0. U nfortunately , unlike the case
of linear regression, th e m athem atica l form of th e g rad ien t of for logistic
regression is no t easy to m anipu late , so an analy tic solution is no t feasible.
E x erc ise 3 .7
For logistic regression, show that
N
VEi 2/nXn
n = l
1 ^
= -y«X n0(-ynW '’'x„).
n— 1
Argue that a ‘misclassified’ example contributes more to the gradient than
a correctly classified one.
Instead of analy tica lly se tting the g rad ien t to zero, we will iteratively set it to
zero. To do so, we will in troduce a new algorithm , gradient descent. G rad ien t
92
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
descent is a very general algorithm th a t can be used to tra in m any other
learning m odels w ith sm ooth error measures. For logistic regression, gradient
descent has particu larly nice properties.
A
A n j h
V
3.3 .2 G radient D escen t
G radient descent is a general technique for
minimizing a Iwice-differentiable function, such
as F in(w ) in logistic regression. A useful phys
ical analogy of gradient descent is a ball rolling
down a hilly surface. If the ball is placed on
a hill, it will roll down, coming to rest a t the
bo ttom of a valley. T he sam e basic idea under
lies gradient descent. F'in(w) is a ‘surface’ in
a high-dim ensional space. At step 0, we s ta rt
somewhere on th is surface, a t w (0), and try to
roll down th is surface, thereby decreasing Bin- One th ing which you imme
diately notice from the physical analogy is th a t the ball will not necessarily
come to rest in the lowest valley of the entire surface. Depending on where
you s ta r t the ball rolling, you will end up a t the bo ttom of one of the valleys -
a local m in im um . In general, the same applies to gradient descent. Depending
on your s ta rting weights, the p a th of descent will take you to a local minimum
in the error snrface.
A particu lar advantage for logistic regression
w ith the cross-entropy error is th a t the picture
looks much nicer. T here is only one valley! So,
it does not m atte r where you s ta r t your ball
rolling, it will always roll down to the same
(unique) global m inim um . This is a consequence
of the fact th a t £ ’in(w) is a convex function
of w , a m athem atical p roperty th a t implies a
single ‘valley’ as shown to the right. This m eans W eights, w
th a t gradient descent will not be trap p ed in lo
cal m inim a when m inimizing such convex error measures.'*
L et’s now determ ine how to ‘ro ll’ down the Fin-surface. We would like to
take a step in the direction of steepest descent, to gain the biggest bang for
onr buck. Suppose th a t we take a small step of size rj in the direction of a unit
vector V. T he new weights are w (0) + r/v. Since t] is small, using the Taylor
expansion to first order, we com pute the change in E\„ as
AFin = jFin(w(0) + r/v) - Fi„(w(()))
= r / V F i n ( w ( 0 ) ) ' ' v + O ( r /2 )
> -77||V Ei„(w (()))||,
^ In fac t, th e sq u a red in -sam p le e rro r in lin ea r regression is atso convex, w hich is why th e
a n a ly tic so lu tio n found by th e pseudo-inverse is g u a ran tee d to have o ijtirnal in -sa inp le error.
tq
H
S.'a
I
93
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
where we have ignored the sm all term 0 { r f ‘). Since v is a un it vector, equality
holds if and only if
V = —
V B i„ (w (0 ) )
l|V B in(w (0))|r
(3.10)
T his direction, specified by v , leads to the largest decrease in Bin for a given
step size r/.
E x erc ise 3 .8
The claim that v is the direction which gives largest decrease in Bin only
holds for small 77. Why?
T here is no thing to prevent us from continuing to take steps of size ry, re
evaluating the direction v* a t each ite ra tio n t = 0 ,1 ,2 , . . . . How large a step
should one take a t each itera tion? T his is a good question, and to gain some
insight, le t’s look a t th e following examples.
77 too sm all 77 too large variable 77 - ju s t right
A fixed step size (if it is too sm all) is inefficient w hen you are far from the
local m inim um . On the o ther hand , too large a step size when you are close to
the m inim um leads to bouncing around, possibly even increasing E\n- Ideally,
we would like to take large steps w hen far from the m inim um to get in the
right ballpark quickly, and th en sm all (m ore careful) steps when close to the
minim um . A simple heuristic can accom plish this: far from the m inim um ,
the norm of the g rad ien t is typically large, and close to th e m inim um , it is
small. Thus, we could set ijt = 77||V B i„|| to o b ta in th e desired behavior for
the variable step size; choosing the step size p rop o rtio n a l to th e norm of the
gradient will also conveniently cancel the te rm norm alizing th e u n it vector v in
Equation (3.10), leading to the fixed learning rate gradient descent a lgorithm
for m inim izing Bi„ (w ith redefined 77):
94
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
F ix ed lea rn in g ra te g ra d ien t d escen t:
1 : Initialize the weights a t tim e step t = 0 to w (0 ).
2: for i = 0,1, 2 , . . . do
3: C om pute the gradient gt == V F ,,
4: Set the direction to move, vj = - g j .
5: U pdate the weights: w (t + 1 ) = w (t) + r/Vt.
6 : I te ra te to the next step until it is tim e to stop.
7: R etu rn the final weights.
In the algorithm , v j is a direction th a t is no longer restric ted to unit length.
The p aram eter ry (the learning rate) has to be specified. A typically good
choice for rj is around 0.1 (a purely practical observation). To use gradient
descent, one m ust com pute the gradient. This can be done explicitly for logistic
regression (see Exercise 3.7).
E x a m p le 3 .3 . C radien t descent is a general algorithm for minimizing twice-
differentiable functions. We can apply it to the logistic regression in-sample
error to re tu rn weights th a t approxim ately minimize
N
n = l
L o g istic reg ress io n a lgorith m :
1 : Initialize the weights a t tim e step t = 0 to w(0).
2: for t = 0,1, 2 , . . . do
3: C om pute the gradient
1 ^
_ _ J_ Vn^n
n=l
4: Set the direction to move, V( = - g ( .
5 : U pdate the weights: w (t + 1) = w (f) + r/Vt.
6 : Ite ra te to the next step until it is tim e to stop.
7: R etu rn the final weights w.
□
In it ia liza t io n an d te r m in a tio n . We have two more loose ends to tie: the
first is how to choose w (0), the initial weights, and the second is how to
set the criterion for “ . . .u n t i l it is tim e to stop” in step 6 ol the gradient
descent algorithm . In some cases, such as logistic regression, initializing the
weights w (0) as zeros works well. However, in general, it is safer to initialize
the weights randomly, so as to avoid getting stuck on a perfectly sym m etric
hilltop. Choosing each weight independently from a Normal d istribution with
zero m ean and sm all variance usually works well in practice.
9 5
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
T h a t takes care of initialization, so we now move on to term ination . How
do we decide when to stop? T erm ination is a non-triv ial topic in optim ization.
One simple approach, as we encountered in th e pocket algorithm , is to set an
upper bound on the num ber of itera tions, w here the upper bound is typically
in the thousands, depending on th e am ount of tra in ing tim e we have. The
problem w ith th is approach is th a t there is no guaran tee on the quality of the
final weights.
A nother plausible approach is based on the g rad ien t being zero a t any m in
imum. A n a tu ra l term ination criterion would be to stop once ||g t|| drops below
a certain threshold. Eventually th is m ust happen, b u t we do no t know when
it will happen. For logistic regression, a com bination of th e two conditions
(setting a large upper bound for the num ber of itera tions, and a sm all lower
bound for the size of the grad ien t) usually works well in practice.
T here is a problem w ith relying solely on
the size of the g rad ien t to stop, which is th a t
you m ight stop p rem aturely as illu stra ted on th e ̂
right. W hen the ite ra tio n reaches a relatively
fiat region (which is m ore com m on th a n you
m ight suspect), the algorithm will p rem aturely
stop when we m ay w ant to continue. So one so- Weights, w
lution is to require th a t te rm in atio n occurs only
if the error change is sm all and the error itself is small. U ltim ate ly a com bina
tion of term ination crite ria (a m axim um num ber of itera tions, m arginal error
im provem ent, coupled w ith sm all value for the error itself) works reasonably
well.
E x a m p le 3 .4 . By way of sum m arizing linear m odels, we revisit our old friend
the credit example. If the goal is to decide w hether to approve or deny, then
we are in the realm of classification; if you w ant to assign an am onnt of credit
line, then linear regression is appropria te ; if you w ant to p red ic t th e probability
th a t someone will default, use logistic regression.
T he th ree linear models have their respective goals, erro r m easures, and al
gorithm s. Nonetheless, they not only share sim ilar sets of linear hypotheses,
bu t are in fact re la ted in o ther ways. We would like to po in t ou t one im por
ta n t relationship: B oth logistic regression and linear regression can be used in
linear classification. Here is how.
Logistic regression produces a final hypothesis g{x) whichis our estim ate
of P[y =: -1-1 j x]. Such an estim ate can easily be used for classification by
96
setting a threshold on g(x); a n a tu ra l threshold is which corresponds to
classifying +1 if +1 is more likely. This choice for threshold corresponds to
using the logistic regression weights as weights in the perceptron for classifica
tion. N ot only can logistic regression weights be used for classification in this
way, bu t they can also be used as a way to tra in the perceptron model. The
perceptron learning problem (3.2) is a very hard com binatorial optim ization
problem. The convexity of Ein in logistic regression makes the optim ization
problem much easier to solve. Since the logistic function is a soft version of
a hard threshold, the logistic regression weights should be good weights for
classification using the perceptron.
A sim ilar relationship exists between classification and linear regression.
Linear regression can be used w ith any real-valued ta rg e t function, which
includes real values th a t are ±1 . If wjlj^x is fit to ±1 values, sign(wjljjX) will
likely agree w ith these values and make good classification predictions. In
o ther words, the linear regression weights wun, which are easily com puted
using the pseudo-inverse, are also an approxim ate solution for the perceptron
model. The weights can be directly used for classification, or used as an initial
condition for the pocket algorithm to give it a head s ta rt. □
3 . T h e L i n e a r M o d e l ________________________________________ 3 . 3 . L o g i s t i c R e g r e s s i o n
E x erc ise 3 .9
Consider pointwise error measures edass(s,2/) = Iv s ign(s)|, esq(s,y) =
{y — s)^, and eiog(s,y) = ln ( l + e x p (—ys)), where the signal s — w'^^x.
(a ) For y = 4-1, plot edass, esq and pfjeiog versus s, on the same plot.
(b ) Show that edass(s,y) < esq(s,y), and hence that the classification
error is upper bounded by the squared error.
(c) Show that edass(s,y) < y ) ’ 3"*^’ an
upper bound (up to a constant factor) using the logistic regression
error.
These bounds indicate that minimizing the squared or logistic regression
error should also decrease the classification error, which justifies using the
weights returned by linear or logistic regression as approximations for clas
sification.
S to c h a stic g ra d ien t d e sce n t. The version of gradient descent we have de
scribed so far is known as batch gradient descent - the gradient is com puted
for the error on the whole d a ta set before a weight update is done. A sequen
tia l version of gradient descent known as stochastic gradient descent (SGD)
tu rn s ou t to be very efficient in practice. Instead oi considering the lull iiatch
gradient on all N tra in ing d a ta points, we consider a stochastic version of
the gradient. F irst, pick a train ing d a ta [loint (x„,y„) uniformly a t random
(hence the nam e ‘stochastic’), and consider only the error on th a t d a ta ])oint
97
(in the case of logistic regression),
e„(w ) = l n ( l + e -^ " '^ "^ " ) .
T he gradient of th is single d a ta p o in t’s error is used for the weight u p d ate in
exactly the sam e way th a t the grad ien t was used in b a tch grad ien t descent.
T he gradient needed for the weight u p d a te of SGD is (see Exercise 3.7)
V e„(w ) = ------------ -̂--- ,X _(_ gyr.w^x„ ’
and the weight u p d a te is w w — ?7V e„(w ). Insight into why SGD works can
be gained by looking a t the expected value of the change in th e weight (the
expectation is w ith respect to the random point th a t is selected). Since n is
picked uniform ly a t random from { ! , . . . , N } , th e expected weight change is
3. T h e L i n e a r M o d e l 3.3. L o g i s t i c R e g r e s s i o n
7 1 = 1
This is exactly the sam e as the determ inistic w eight change from th e batch
gradien t descent weight update . T h a t is, ‘on average’ th e m inim ization pro
ceeds in the righ t direction, b u t is a b it wiggly. In th e long run , these random
fluctuations cancel out. T he com pu tational cost is cheaper by a factor of N ,
though, since we com pute the g rad ien t for only one po in t per itera tion , ra th e r
th an for all N poin ts as we do in ba tch g rad ien t descent.
Notice th a t SGD is sim ilar to PL A in th a t it decreases th e erro r w ith re
spect to one d a ta po int a t a tim e. M inimizing th e erro r on one d a ta poin t m ay
interfere w ith the error on the rest of th e d a ta po in ts th a t are no t considered
a t th a t itera tion . However, also sim ilar to PLA , the in terference cancels out
on average as we have ju s t argued.
E x erc ise 3 .1 0
(a ) Define an error for a single data point (x „ ,y n ) to be
6n (w ) = m a x (0 , - y „ w ' ’‘'x „ ) .
Argue that PLA can be viewed as SGD on e„ with learning rate g = 1.
(b ) For logistic regression with a very large w , argue th a t minimizing Ein
using SGD is similar to PLA. This is another indication th a t the lo
gistic regression weights can be used as a good approximation for
classification.
SGD is successful in practice, often bea ting th e batch version and o ther m ore
sophisticated algorithm s. In fact, SGD was an iiu jro rtan t p a r t of the algorithm
th a t won the m illion-dollar Netflix com petition , discussed in Section 1.1. It
scales well to large d a ta sets, and is n a tu ra lly su ited to online learning, where
98
a stream ot d a ta present them selves to the learning algorithm seeinentially.
T he random ness introduced by processing one d a ta point a t a time can be a
phis, helping the algorithm to avoid flat regions and local minima in the case
of a com plicated error surface. However, it is challenging to choose a suit
able term ination criterion lor SGD. A good stopping criterion should consider
the to ta l error on all the data , which can be com putationally dem anding to
evaluate a t each iteration.
3.4 N onlinear Transform ation
All formulas for the linear model have used the sum
d
■w'^x = ' ^ W i X i (3.11)
(=0
as the m ain quan tity in com puting the hypothesis ou tput. This (iiiantity is
linear, not only in the x f s bu t also in the Wi's. A closer inspection of the
corresponding learning algorithm s shows th a t the linearity in Wj, ’,s is the key
property for deriving these algorithm s; the Xj’s are ju st constants as far as the
algorithm is concerned. This observation opens the possibility for allowing
nonlinear versions of Xi's while still rem aining in the analytic realm of linear
models, because the form of E quation (3.11) rem ains linear in the iCj param
eters.
Consider the credit lim it problem for instance. It makes sense th a t the
‘years in residence’ field would afl'ect a person’s credit since it is correlated with
stability. However, it is less plausible th a t the credit limit would grow li7iea.rly
w ith the num ber of years in residence. More plausibly, there is a threshold
(say 1 year) below which the credit lim it is affected negatively and another
threshold (say 5 years) above which the credit lim it is afl'ected positively. If ,t,,
is the input variable th a t m easures years in residence, then two nonlinear
‘features’ derived from it, nam ely J.x-i < I j and [.r; > 5], would allow a linear
formula to reflect the credit lim it better.
We have already seen the use of features in the classitication of handw ritten
digits, where intensity and sym m etry features were derived from injuit pixels.
Nonlinear transform s can be further applied to those features, as we will see
shortly, creating more elaborate features and imiiroving the performance'. Tlu'
scope of linear m ethods expands signiflcaiitly when we' repre'sent the' input by
a set of app ropria te features.
3 .4 .1 T h e Z Space
Consielcr the sitiiatiem in Figure 3.1(b) where a line'are'lassilie'r e-an t lit the'
elata. By transfbrm ing the inputs rq ,X2 in a nemline'ar lashiem. we' will be' able'
to seiparate the elata w ith nieire eiemiplie'ated boimdarie's while still using the'
3 . T h e L i n e a r M o d e l ______________________________3 . 4 . N o n l i n e a r T h a n s i -’o r m a t i o n
0!)
3 . T h e L i n e a r M o d e l 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
simple PLA as a building block. L e t’s s ta r t by looking a t the circle in Fig
ure 3.5(a), which is a replica of the non-separable case in F igure 3.1(b). Points
on the circle satisfy the following equation:
x \ + X2 = 0.6,
which m eans th a t the nonlinear hypothesis /i(x) = sign(0.6 —xf — x |) separates
the d a ta set perfectly. We can view the hypothesis as a linear one after
applying a nonlinear transfo rm ation on x . In particu la r, consider zq = 1,
zi = x \ and 2 2 — x l ,
h{x) = sign
= sign
\ wo
/
[wio Wi W2]
' t i l
■ X
z
\
1
. ^2
z /
(-1) X n
\ wT
where the vector z is ob ta ined from x th rough a nonlinear transfo rm $ ,
z = $ (x ) .
We can plot th e d a ta in term s of z instead of x , as depicted in F igure 3.5(b).
For instance, the po int x i in F igure 3.5(a) is transform ed to the po in t z \ in
F igure 3.5(b) and th e po in t X2 is transfo rm ed to th e point Z2 . T he space Z ,
which contains the z vectors, is referred to as th e feature space since its coor
dinates are higher-level features derived from th e raw in p u t x . We designate
different quan tities in Z w ith a tilde version of th e ir co u n te rp arts in A , e.g.,
the dim ensionality of Z is d and th e weight vector is w.® T he transfo rm $
th a t takes us from A to Z is called a feature transform, w hich in th is case is
(3.12)
In general, some points in th e Z space m ay no t be valid transfo rm s of any
X e A , and m ultiple po in ts in A m ay be transform ed to th e sam e z G Z ,
depending on the nonlinear tran sfo rm <I>.
The usefulness of the transfo rm above is th a t th e nonlinear hypothesis h
(circle) in the A space can be represen ted by a linear hypothesis (line) in
the Z space. Indeed, any linear hypothesis h in z corresponds to a (possibly
nonlinear) hypothesis of x given by
/i,(x) = d(<I>(x)).
— { I f x I R '’*, w iiere d = 2 in th is case. W e t r e a t Z as d -d im e n s io n a l s ince th e ad d ed
c o o rd in a te zq = 1 is fixed.
100
3 . T h e L i n e a r M o d e l 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
(b) Transformed data in Z-space
■ 1 ■ ■ 1 ■
x = X i z = 4>(x) = x i
_X2_ A2.
Figure 3.5: (a) The original data set that is not linearly separable, but
separable by a circle, (b) The transformed data set that is linearly separable
in the Z space. In the figure, x i maps to zi and X2 maps to Z2 ; the circular
separator in the T-space maps to the linear separator in the Z-space.
T he set of these hypotheses h is denoted by H<j>. For instance, when using the
feature transform in (3.12), each h G is a quadratic separator in X th a t
corresponds to some linear separato r h in Z .
E x erc ise 3 .11
Consider the feature transform $ in (3 .12). W hat kind of boundary in X
does a hyperplane w in Z correspond to in the following cases? Draw a
picture that illustrates an example of each case.
(a ) w i > 0 , W2 < 0
(b ) w i > 0 , W2 = 0
(c) Wl > 0 ,W2 > 0,wo < 0
(d ) Wl > 0 , i (}2 > 0 , Wo > 0
Because the transform ed d a ta set (zi , j/ i ) , , {zjg,yN) in Figure 3.5(b) is
linearly separable in the feature space Z , we can apply PLA on the transform ed
d a ta set to ob tain Wpla, the PLA solution, which gives us a final hypothesis
g(x) = sign(wJ)LAz) in the X space, where z = T (x). T he whole process of
applying the feature transform before running PLA for linear classification is
depicted in Figure 3.6.
T he in-sam ple error in the input space X is the same as in the feature
space Z , so Ein{g) = 0. Hyperplanes th a t achieve E i./w p p ^ ) = 6 in Z cor
respond to separating curves in the original input space X. For instance.
101
3 . T h e L i n e a r M o d e l 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
1. O riginal d a ta
Xn e T
2. T ransform th e d a ta
z„ = 4>(x„) e z
4. Classify in T -space
y(x) = y ($ (x )) = sign(w '^$(x))
3. S eparate d a ta in Z -space
y(z) = sign(w ^z)
Figure 3.6: The nonlinear transform for separating non-separable data.
as shown in F igure 3.6, the PL A m ay select th e line Wpla = (0 .6 ,—0 .6 .—1)
th a t separates the transform ed d a ta ( z i .y i ) , - - - , ( z 7v ,y ,v ). T he correspond
ing hypothesis y(x) = sign(0 . 6 - 0 . 6 • Tj — x’| ) will sep a ra te th e original d a ta
(x i. y i), ■ ■ • , (xiv, yAf). In th is case, the decision bo u n d ary is an ellipse in .T.
How does th e feature transfo rm affect the VC bound (3.1)? If we honestly
decide on the transfo rm <F before seeing the d a ta , th en w ith p robab ility at
least 1 — d. the bound (3.1) rem ains tru e by using dyc{Tl,j>) as th e VC dim en
sion. For instance, consider the feature transfo rm 4> in (3.12). We know th a t
Z = {1 } x R 2 Since 7?$ is the p ercep tron in Z . dvc('B>i>) < 3 (the < is because
some points z G Z m ay no t be valid transform s of any x . so som e dichotom ies
may not be realizable). We can then su b s titu te An dyciP- f ) . and d' into the
VC bound. After running PLA on the transfo rm ed d a ta set. if we succeed in
102
getting some g w ith E{n{g) = 0 , we can claim th a t g will perform well out of
sample.
It is very im portan t to understand th a t the claim above is valid only if you
decide on $ before seeing the d a ta or try ing any algorithm s. W hat if we first
try using lines to separate the data , fail, and then use the circles? Then we
are effectively using a model th a t contains bo th lines and circles, and dvc is
no longer 3.
E x e r c is e 3 .1 2
W e know that in the Euclidean plane, the perceptron model % cannot
implement all 16 dichotomies on 4 points. T h a t is, m u iT ) < 16. Take the
feature transform 4? in (3.12).
(a) Show that TO-Hij.(3) = 8 .
(b ) Show that rn-H4, (4 ) < 16.
(c) Show that = 16.
T h a t is, if you used lines, dvo = 3; if you used ellipses, dvo = 3; if you used
lines and ellipses, dvc > 3.
W orse yet, if you actually look a t the d a ta (e.g., look a t the points in Fig
ure 3.1(a)) before deciding on a suitable <h, you forfeit m ost of w hat you
learned in C hap ter 2 . You have inadvertently explored a huge hypothesis
space in your m ind to come up w ith a specific 4? th a t would work fo r this data
set. If you invoke a generalization bound now, you will be charged for the VC
dim ension of the full space th a t you explored in your mind, not ju s t the space
th a t <I) creates.
This does not m ean th a t <I> should be chosen blindly. In the credit limit
problem for instance, we suggested nonlinear features based on the ‘years in
residence’ field th a t m ay be more suitable for linear regression th an the raw
input. This was based on our understanding of the problem , not on ‘snooping’
into the train ing data . Therefore, we pay no price in term s of generalization,
and we m ay well gain a dividend in perform ance because of a good choice of
features.
T he feature transform can be general, as long as it is chosen before seeing
the d a ta set (as if we cannot em phasize this enough). For instance, you may
have noticed th a t the feature transform in (3.12) only allows ns to get very
lim ited types of quadratic curves. Ellipses th a t do not center a t the origin
in X cannot correspond to a hyperplane in Z . To get all possible quadratic
curves in T , we could consider the m ore general feature transformz = <I>2 (x),
<F2 (x) = { l , X i , X 2 , x l , X i X 2 , x l ) , (3.13)
which gives us the flexibility to represent any quadratic curve in T by a hy
perplane in Z (the subscrip t 2 of T is for polynom ials of degree 2 - quadratic
curves). T he price we pay is th a t Z is now five-dimensional instead of two-
dim ensional, and hence dvc is doubled from 3 to 6 .
3 . T h e L i n e a r M o d e l _____________________________ 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
1 0 3
E x e r c is e 3 .1 3
Consider the feature transform z = 4>2 (x ) in (3 .13 ). How can we use a
hyperplane w in Z to represent the following boundaries in X I
(a) The parabola (xi — 3)^ + X 2 = 1.
(b) The circle (x i - 3)^ + (x 2 - 4)^ = 1.
(c) The ellipse 2 (x i - 3)^ + (x 2 - A f = 1.
(d ) The hyperbola (x i - 3)^ - (x 2 — 4)^ = 1.
(e) The ellipse 2 (x i + X2 — 3)^ + (x i - X2 - 4)^ = 1.
( f ) The line 2 x i + xa = L
One m ay fu rther ex tend 4>2 to a feature transfo rm $ 3 for cubic curves in X ,
or m ore generally define the feature transform 4>q for degree-Q curves in X .
T he feature transform 4>q is called the Q th order polynomial transform.
T he power of the feature transfo rm should be used w ith care. I t m ay
not be w orth it to insist on linear separab ility and em ploy a highly complex
surface to achieve th a t. C onsider the case of F igure 3.1(a). If we insist on a
feature transform th a t linearly separates th e d a ta , it m ay lead to a significant
increase of the VC dim ension. As we see in F igure 3.7, no line can separate the
tra in ing exam ples perfectly, and neither can any qu ad ra tic nor any th ird -o rder
polynom ial curves. Thus, we need to use a fourth -o rder polynom ial transform :
4>4(x) = { I , x i , x 2 , x l , x i x 2 , x l , x l , x l x 2 , x i x l , x l , x ^ , x l x 2 , x l x l , x i x l , x j ) .
If you look a t the fourth-order decision bou n d ary in F igure 3.7(b), you d o n ’t
need the VC analysis to tell you th a t th is is an overkill th a t is unlikely to
generalize well to new d a ta . A b e tte r op tion would have been to ignore th e two
misclassified exam ples in F igure 3.7(a), sep ara te th e o ther exam ples perfectly
w ith the line, and accept the sm all b u t nonzero Ei„. Indeed, som etim es our
best bet is to go w ith a sim pler hypothesis set while to le ra ting a sm all Ei„.
W hile our discussion of feature transfo rm s has focused on classification
problem s, these transform s can be applied equally to regression problem s.
B oth linear regression and logistic regression can be im plem ented in the feature
space Z instead of the inpu t space X . For instance, linear regression is often
coupled w ith a feature transfo rm to perform nonlinear regression. T he N
by d -F 1 inpu t m atrix X in th e a lgorithm is replaced w ith the V by d -F 1
m atrix Z, while the o u tp u t vector y rem ains th e sam e.
3.4 .2 C o m p u ta tio n and G en era liza tio n
A lthough using a larger Q gives us m ore flexibility in te rm s of the shape of
decision boundaries in X , th ere is a price to be paid . C om p u ta tio n is one
issue, and generalization is the o ther.
C om puta tion is an issue because th e featu re transfo rm 4>q m aps a two-
dim ensional vector X to d = dim ensions, w hich increases th e m em ory
3 . T h e L i n e a r M o d e l ______________________________ 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
1 0 4
3 . T h e L i n e a r M o d e l 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
(b) 4th order polynomial fit
F igure 3.7: Illustration of the nonlinear transform using a data set that
is not linearly separable; (a) a line separates the data after omitting a few
points, (b) a fourth-order polynomial separates all the points.
and com putational costs. Things could get worse if x is in a higher dimension
to begin with.
E x e rc is e 3 .1 4
Consider the Qth order polynomial transform 4>q for X = W hat is
the dimensionality d of the feature space Z (excluding the fixed coordinate
zo — 1). Evaluate your result on d e { 2 ,3,5,10} and Q G {2 , 3,5,10}.
The other im p o rtan t issue is generalization. If $q is the feature transform of
a two-dim ensional input space, there will be d = dimensions in Z , and
dvc{P<s>) can be as high as + 1 . This m eans th a t the second term in
the VC bound (3.1) can grow significantly. In o ther words, we would have a
weaker guaran tee th a t Eout will be small. For instance, if we use <I> = T 5 0 ,
the VC dim ension of 7^$ could be as high as -h 1 = 1326 instead of the
original dvc = 3. Applying the rule of thum b th a t the am ount of d a ta needed
is proportional to the VC dimension, we would need hundreds of times more
d a ta th an we would if we d id n ’t use a feature transform , in order to achieve
the sam e level of generalization error.
E x e rc is e 3 .15
High-dimensional feature transforms are by no means the only transforms
that we can use. W e can take the tradeoff in the other direction, and
use low-dimensional feature transforms as well (to achieve an even lower
generalization error bar).
( c o n t i n u e d o n n e x t p a g e )
1 0 5
Consider the following feature transform, which maps a (/-dimensional x to
a one-dimensional z, keeping only the A4;h coordinate of x .
4>(fc)(x) = ( L x f c ) . ( 3 . 1 4 )
Let 'Hk be the set of perceptrons in the feature space.
(a ) Prove th a t dvc (Rfc) = 2 .
(b ) Prove th a t (/vc(Ufe=i7ffc) < 2 (log2 d -P 1).
Hk is called the decision stump model on dimension k.
3 . T h e L i n e a r M o d e l ______________________________ 3 . 4 . N o n l i n e a r T r a x s f o r m a t i o x
T he problem of generalization w hen we go to high-dim ensional space is somcs
tim es balanced by th e advantage we get in approx im ating th e ta rg e t b e tte r . As
we have seen in th e case of using qu ad ra tic curves in stead of lines, th e tran s
formed d a ta becam e linearly separable, reducing Eia to 0. In general, when
choosing the app ro p ria te dim ension for th e fea tm e transform , we cannot avoid
the approxim ation-generahzation tradeoff.
i higher d b e tte r chance of being linearly separable {Eia i ) dvc
lower d possibly not linearly separab le (Tin T)
Therefore, choosing a fea tm e transfo rm before seeing th e d a ta is a non-tri^ ial
task . V h e n we apply learning to a p a rticu la r problem , som e understand ing
of th e problem can help in choosing features th a t work well. M ore generally,
there are some guidelines for choosing a su itab le transform , or a sn itab le model,
which we will discuss in C h ap ter 4.
E x erc ise 3 .1 6
W rite down the steps o f the algorithm th a t combines $ 3 w/ith linear re
gression. How about using 4>io instead? W here is the main computational
bottleneck o f the resulting algorithm?
E x a m p le 3 .5 . L et's revisit the h an d w ritten digit recognition exam ple. We
can try a different way of decom posing th e big task of sep a ra tin g ten digits
to sm aller tasks. O ne decom position is to sep a ra te digit 1 from all th e o th er
digits. I sing in tensity and st inm etry as our input variables like we did before,
the sca tte r plot of the tra in in g d a ta is shown next. .A. line can roughly sep ara te
digit 1 from the rest, but a m ore com plicated curve m ight do l>etter.
1 0 6
3 . T h e L i n e a r M o d e l 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
Average Intensity
We use linear regression (for classification), first w ithout any feature transform .
T he results are shown below (LHS). We get F n = 2.13% and Fout = 2.38%.
q;
s
a
CO
Average In tensity
Linear m odel
F n = 2 . 1 3 %
F o u t = 2 . 3 8 %
Average In tensity3rd order polynom ial m odel
F n = 1 . 7 5 %
Font = 1 . 8 7 %
C lassification of th e d ig its d a ta (‘T versus ‘no t 1’) using linear and
th ird order polynom ial models.
W hen we run linear regression w ith + 3 , the th ird-order polynom ial transform ,
we ob ta in a b e tte r fit to the data , w ith a lower Fi„ = 1.75%. T he result is
depicted in the RHS of the figure. In this case, the be tte r in-sam ple fit also
resulted in a b e tte r out-of-sam ple perform ance, w ith Fo„t = 1-87%. □
L in e a r m o d e ls , a fin a l p i tc h . The linear model (for classification or regres
sion) is an often overlooked resource in the arena of learning from data. Since
efficient learning algorithm s exist for linear models, they are low overhead.
T hey are also very robust and have good generalization properties. A sound
1 0 7
policy to follow when learning from d a ta is to f irst try a linear model. Becau.se
of the good generalization properties of linear m odels, not m uch can go wrong.
If you get a good fit to the d a ta (low Ein), th en you are done. If you do not get
a good enough fit to the d a ta and decide to go for a m ore com plex model, you
will pay a price in term s of th e VC dim ension as we have seen in Exercise 3.12,
b u t the price is m odest.
3 . T h e L i n e a r M o d e l 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
1 0 8
3.5 Problem s
P ro b lem 3 .1 Consider the double semi-circle “toy” learning task below.
3 . T h e L i n e a r M o d e l ________________________ 3 . 5 . P r o b l e m s
There are two semi-circles of width thk with inner radius rad, separated by
sep as shown (red is —1 and blue is -|-1). The center of the top semi-circle
is aligned with the middle of the edge of the bottom semi-circle. This task
is linearly separable when sep > 0, and not so for sep < 0. Set rad = 10,
thk = 5 and sep = 5. Then, generate 2, 000 examples uniformly, which means
you will have approximately 1,000 examples for each class.
(a ) Run the PLA starting from w = 0 until it converges. Plot the data and
the final hypothesis.
(b) Repeat part (a ) using the linear regression (for classification) to obtain w .
Explain your observations.
P r o b le m 3 .2 For the double-semi-circle task in Problem 3.1, vary sep in
the range {0 .2 , 0 . 4 , . . . , 5 }. Generate 2, 000 examples and run the PLA starting
with w = 0. Record the number of iterations PLA takes to converge.
Plot sep versus the number of iterations taken for PLA to converge. Explain
your observations. [Hint: Problem 1.3.]
P ro b lem 3 .3 For the double-semi-circle task in Problem 3.1, set sep = —5
and generate 2,000 examples.
(a) W hat will happen if you run PLA on those examples?
(b) Run the pocket algorithm for 100,000 iterations and plot Bin versus the
iteration number t.
(c) Plot the data and the final hypothesis in part (b).
( c o n t i n u e d o n n e x t p a g e )
1 0 9
3. T h e L i n e a r M o d e l 3.5. P r o b l e m s
(d ) Use the linear regression algorithm to obtain the weights w , and compare
this result with the pocket algorithm in terms of computation tim e and
quality of the solution.
(e) Repeat (b ) - (d ) with a 3rd order polynomial feature transform.
P r o b le m 3 .4 In Problem 1.5, we introduced the Adaptive Linear Neu
ron (Adaline) algorithm for classification. Here, we derive Adaline from an
optimization perspective.
(a ) Consider e „ (w ) = (m ax(0,1 — i/nw '^x„))^ . Show th a t e „ (w ) is contin
uous and differentiable. W rite down the gradient V e „ (w ) .
(b) Show that e „ (w ) is an upper bound for |s ign(w '^x„) / t/nl- Hence,
V L n = i upper bound for the in-sample classification er
ror E n (w ) .
(c) Argue th a t the Adaline algorithm in Problem 1.5 performs stochastic
gradient descent on T e „ (w ) .
P r o b le m 3 .5
(a ) Consider
6n (w ) = m ax(0,1 - y „w '‘'x „ ) .
Show that e „ (w ) is continuous and differentiable except when
J/n = W'^'x^.
(b ) Show that e „ (w ) is an upper bound for |s ign(w '’'xTi) f iy n j - Hence,
upper bound for the in-sample classification er
ror E n (w ) .
(c) Apply stochastic gradient descent on T e ,i(w ) (ignoring the singu
lar case of w'^'xn = j/n) and derive a new perceptron learning algorithm.
P r o b le m 3 .6 Derive a linear programming algorithm to fit a linear model
for classification using the following steps. A linear program is an optimization
problem of the following form:
mill c^z
Z
subject to A z < b.
A , b and c are parameters of the linear program and z is the optimization
variable. This is a well studied optimization problem and most m athem at
ics software packages will have canned optimization functions to solve linear
programs.
(a) For linearly separable data, show th a t for some w , y „ ( w 'x „ ) > 1 for
n = 1 , . . . ,A .
110
3 . T h e L i n e a r M o d e l 3 . 5 . P r o b l e m s
(b ) Formulate the task of finding a separating w for separable data as a linear
program. You need to specify what the parameters A ,b , c are and what
the optimization variable z is.
(c) If the data is not separable, the condition in (a) cannot hold for every n.
Thus introduce the violation ^„ > 0 to capture the amount of violation
for example x„. So, for n = 1,... , A,
yn(w'^Xn) > 1 - f n ,
Cn > 0.
Naturally, we would like to minimize the amount of violation. One intu
itive approach is to minimize ' want w that solves
N
n=l
subject to yn(w'^Xn) > 1 - Cn,
f n > 0 ,
where the inequalities must hold for n = 1, . . . , N. Formulate this prob
lem as a linear program.
(d) Argue that the linear program you derived in (c) and the optimization
problem in Problem 3.5 are equivalent.
P r o b le m 3 .7 Use the linear programming algorithm from Problem 3.6
on the learning task in Problem 3.1 for the separable (sep = 5) and the non-
separable (sep = —5) cases.
Compare your results to the linear regression approach with and without the
3rd order polynomial feature transform.
P r o b le m 3 .8 For linear regression, the out-of-sample error is
Bout(h) = E [(h(x) - y f ] .
Show that among a// hypotheses, the one that minimizes Bout is given by
h*(x) = E[y I x].
The function h* can be treated as a deterministic target function, in which
case we can write y = h*(x) -F e(x) where f (x) is an (input dependent) noise
variable. Show that e(x) has expected value zero.
I l l
3. T h e L i n e a r M o d e l 3.5. P r o b l e m s
P r o b le m 3 .9 Assuming that X^X is invertible, show by direct comparison
with Equation ( 3 . 4 ) that B i n ( w ) can be written as
Bin(w)
= (w - (X " X )-iX »"(X "X )(w - (X -X )-iX -y) + y"(I - X(X "X )-iXBy.
Use this expression for Bin to obtain wun. W hat is the in-sample error? [Hint:
The matrix X ^X is positive definite.]
P r o b le m 3 .1 0 Exercise 3.3 studied some properties of the hat matrix
H = X (X '’'X ) “ ^X^, where X is a A by d -F 1 matrix, and X"^X is invertible.
Show the following additional properties.
(a ) Every eigenvalue of H is either 0 or 1. [Hint: Exercise 3.3(b).]
(b ) Show that the trace of a symmetric matrix equals the sum of its eigen
values. [Hint: Use the spectral theorem and the cyclic property o f the
trace. Note that the same result holds for non-symmetric matrices, but
is a little harder to prove.]
(c) How many eigenvalues of H are 1? W hat is the rank of H? [Hint:
Exercise 3.3(d).]
P r o b le m 3 .1 1 Consider the linear regression problem setup in Exercise 3.4,
where the data comes from a genuine linear relationship with added noise. The
noise for the different data points is assumed to be iid with zero mean and
variance a^. Assume th a t the 2nd moment matrix E = Ex[xx^] is non-singular.
Follow the steps below to show that, with high probability, the out-of-sample
error on average is
Bout(w iin) = cr̂ ^1 -F -F o ( T )^ .
(a) For a test point x , show th a t the error y — y (x ) is
e -x ^ (X ^X )-^X "e ,
where e is the noise realization for the test point and e is the vector of
noise realizations on the data,
(b ) Take the expectation with respect to the test point, i.e., x and e, to
obtain an expression for Bo„t. Show that
Bout = -F trace ( E (X " 'X )“ ^ X '''e e ''X (X " 'X )“ ^) ,
[Hints: a = tracc (a ) for any scalar a; tra c e (A B ) = tra c e (B A ); expecta
tion and trace commute.]
(c) W hat is Ee[ee'’ ]?
112
3 . T h e L i n e a r M o d e l 3 . 5 . P r o b l e m s
(d) Take the expectation with respect to e to show that, on average,
Fout = (7̂ + ^ t r a c e (E(Ax'"X)~^) .
Note that ;^ X ^ X = A is an A-sample estimate of E . So
T X ^ 'X ss E . If -~X^X = E , then what is Eout on average?
(e) Show, after taking the expectation with respect to the data noise e, that
with high probability,
Eout = + o(X )^ .
[Hint: By the law of large numbers -^x'^'X converges in probability to
E , and so by continuity o f the inverse at E , (^ X '^ 'X ) ̂ converges in
probability to E ~ F ]
P ro b lem 3 .1 2 In linear regression, the in-sample predictions are given by
y = H y , where H = X (X ^ X )^ ^ X ^ . Show that H is a projection matrix, i.e.
= H . So y is the projection of y onto some space. W hat is this space?
P r o b le m 3 .1 3 This problem creates a linear regression algorithm from
a good algorithm for linear classification. As illustrated, the idea is to take the
original data and shift it in one direction to get the -|-1 data points; then, shift
it in the opposite direction to get the — 1 data points.
Original data for the one
dimensional regression prob
lem
Shifted data viewed as a
two-dimensional classifica
tion problem
More generally, the data (x n ,j/n ) can be viewed as data points in by
treating the y-value as the {d + l) th coordinate.
( c o n t i n u e d o n n e x t p a g e )
1 1 3
3 . T h e L i n e a r M o d e l 3 . 5 . P r o b l e m s
Now, construct positive and negative points
T>+ = ( x i , y i ) + a . . . . , (x ;v ,2/at) + a
V - = ( x i , y i ) - a , . . . , (x /v , yw ) - a,
where a is a perturbation parameter. You can now use the linear programming
algorithm in Problem 3.6 to separate 2?+ from The resulting separating
hyperplane can be used as the regression 'fit' to the original data.
(a) How many weights are learned in the classification problem? How many
weights are needed for the linear fit in the regression problem?
(b) The linear fit requires weights w , where / i(x ) = w ’^x. Suppose the
weights returned by solving the classification problem are Wdass- Derive
an expression for w as a function of Wdass-
(c) Generate a data set y„ = + <Te„ with N = 50, where x „ is uniform on
[0.1] and e„ is zero mean Gaussian noise; set a = 0.1. Plot T>+ and "D_
0
for a = 0.1
(d ) Give comparisons of the resulting fits from running the classification ap>-
proach and the analytic pseudo-inverse algorithm for linear regression.
P r o b le m 3 .1 4 In a regression setting, assume the target function is linear,
so / ( x ) = x ’̂ 'wy, and y = X w / - f e, where the entries in e are zero mean, iid
with variance cr .̂ In this problem, derive the bias and var as follows.
(a ) Show that the average function is y (x ) = / ( x ) , no m atter what the size
of the data set, as long as X ’̂ X is invertible. W h at is the bias?
(b) W hat is the var? [Hint: Problem 3.11[
P r o b le m 3 .1 5 In the text we derived th a t the linear regression solution
weights must satisfy X ’’'X w = X ’̂ y. If X ’̂ X is not invertible, the solution
W|ip = (X ^ X )~ * X ^ y w on't work. In this event, there will be many solutions
for w that minimize Ein- Here, you will derive one such solution. Let p be
the rank of X . Assume that the singular value decomposition (S V D ) of X is
X = U r V ^ . where U G satisfies U'^'U = Ip, V G R(<*+i)xp satisfies
V ^ V = I p , and L G is a positive diagonal matrix.
(a) Show that p < d + I.
(b) Show that wun = V r “ *U '’'y satisfies X'^Xwun = X'^'y, and hence is a
solution.
(c) Show that for any other solution that satisfies X ’'X w = X ’'y , Hwunil <
||w |j. T hat is, the solution we have constructed is the minimum norm set
of weights that minimizes E\n-
114
3 . T h e L i n e a r M o d e l 3 . 5 . P r o b l e m s
P r o b le m 3 .1 6 In Example 3.4, it is mentioned that the output of the
final hypothesis ^ (x ) learned using logistic regression can be thresholded to get
a ‘hard’ ( ± 1 ) classification. This problem shows how to use the risk matrix
introduced in Example 1.1 to obtain such a threshold.
Consider fingerprint verification, as in Example 1.1. After learning from the
data using logistic regression, you produce the final hypothesis
g (x ) = P[y = + l | x] ,
which is your estimate of the probability that y = + 1 . Suppose that the cost
matrix is given by
True classification
-F l (correct person) —1 (intruder)
“F l 0 Cayou say ^
Cr 0
For a new person with fingerprint x , you compute <;(x) and you now need to de
cide whether to accept or reject the person (i.e., you need a hard classification).
So, you will accept if g (x ) > k , where k is the threshold.
(a) Define the cost(accept) as your expected cost if you accept the person.
Similarly define cost(reject). Show that
cost(accept) = (1 - g (x ))ca ,
cost (reject) = ,g(x)cr.
(b ) Use part (a) to derive a condition on ,g(x) for accepting the person and
hence show that
Ca + Cr
(c) Use the cost-matrices for the Supermarket and CIA applications in Ex
ample 1.1 to compute the threshold k for each of these two cases. Give
some intuition for the thresholds you get.
P ro b lem 3 .1 7 Consider a function
E[u, u ) = e“ -F -F v f — ‘Suv -F — 3 m — 5 m ,
(a) Approximate E{u -F Am,m -F Am) by Ei(Am, Am), where Ei is the
first-order Taylor's expansion of E around (m,m) = (0,0). Suppose
El (Am, Am) = m^Am -F m^Am -F a. W hat are the values of a„, a„,
and a?
( c o n t i n u e d o n n e x t p a g e )
1 1 5
3 . T h e L i n e a r M o d e l 3 . 5 . P r o b l e m s
= 0.5.
A l t .
IS
(b ) Minimize E i over all possible (Am, A u ) such that ||(Am, Am)]
In this chapter, we proved th a t the optimal column vector
parallel to the column vector —\7 E {u ,v ), which is called the negative
gradient direction. Compute the optimal (Am, A u ) and the resulting
E {u + Am, V + Am).
(c) Approximate F (m + A m , M + A m ) by F 2(Am, Am), where F is the second-
order Taylor's expansion of E around (m, m) = (0 ,0 ) . Suppose
F 2(Am, Am) = 6„„(A M )^ + F t ,(A M )^ + 6„ „ (A M )(A M )+ 6u A M + 6t,AM + 6.
W hat are the values of bu, by, and b?
(d ) Minimize E 2 over all possible (Am , Am) (regardless of length). Use the
fact that V ^ F (m , m)|^q (the Hessian matrix at (0 ,0 ) ) is positive definite
to prove th a t the optimal column vector
A m*
A m*
= - ( V " F ( m , m) ) V F ( m , m),
which is called the Newton direction.
(e) Numerically compute the following values:
(i) the vector ( A m , A m ) of length 0 . 5 along the Newton direction, and
the resulting E {u + A m , m + A m ) .
(ii) the vector ( A m , A m ) of length 0 . 5 th a t minimizes F ( m + A m , m + A m ) ,
and the resulting F ( m + A m , m + A m ) . {Hint: Let A m = 0 . 5 s i n 0 . )
Compare the values of F ( m + A m , M + A m ) in (b ), (e-i), and (e-ii). Briefly
state your findings.
The negative gradient direction and the Newton direction are quite fundamental
for designing optimization algorithms. It is important to understand these
directions and put them in your toolbox for designing learning algorithms.
P r o b le m 3 .1 8 Take the feature transform $2 in Equation (3 .13 ) as 4>.
(a) Show that dvc)??®) < 6 .
(b ) Show that dvc)??®) > 4. [Hint: Exercise 3.12[
(c) Give an upperbound on dvc(??®fc) for X =
(d ) Define
l>2: X ^ ( l ,a; i , X2,a: i + X 2 , x i - X 2 , x l , x i X 2 , X 2 X i , x l ) for x G
Argue that dvc(??®2 ) = dvc)??*^)- other words, while l>2 (T ’) € K®,
dvc)??!,^) < 6 < 9. Thus, the dimension of <E>(T) only gives an upper
bound of dvc(??®), and the exact value of dvc(??®) can depend on the
components of the transform.
1 1 6
3 . T h e L i n e a r M o d e l 3 . 5 . P r o b l e m s
P ro b lem 3 .1 9 A Transformer thinks the following procedures would
work well in learning from two-dimensional data sets of any size. Please point
out if there are any potential problems in the procedures:
(a) Use the feature transform
r (O,-- - ,0, if X = Xn
[ (0 , 0 , • • • , 0) otherwise .
before running PLA.
(b) Use the feature transform 4> with
. . . / l i x - xF ( x ) = exp - 227
using some very small 7 .
(c) Use the feature transform 4> that consists of all
l ! x - ( f , j ) | |" ^F q ( x ) = e x p ( ^ - 2.^2
before running PLA, with i S {0, . . . , 1} and j G {0, . . . , ! } .
1 1 7
118
- . A :
Chapter 4
Overfitting
Paraskavedekatriaphobia^ (fear of Friday the 13th), and superstitions in gen
eral, are perhaps the m ost illustrious cases of the hum an ability to overfit.
U nfortunate events are mem orable, and given a few such mem orable events,
it is n a tu ra l to try and find an explanation. In the future, will there be more
unfortunate events on Friday the 13th’s th an on any other day?
O verfitting is the phenom enon where fitting the observed facts (data) well
no longer indicates th a t we will get a decent out-of-sam ple error, and may
actually lead to the opposite effect. You have probably seen cases of overfit-
ting when the learning model is more complex th an is necessary to represent
the ta rg e t function. T he model uses its additional degrees of freedom to fit
idiosyncrasies in the d a ta (for example, noise), yielding a final hypothesis th a t
is inferior. O verfitting can occur even when the hypothesis set contains only
functions which are fa r simpler th an the ta rg e t function, and so the plot thick
ens © .
T he ability to deal w ith overfitting is w hat separates professionals from
am ateurs in the field of learning from data . We will cover three themes;
W hen does overfitting occur? W hat are the tools to com bat overfitting? How
can one estim ate the degree of overfitting and ‘certify’ th a t a model is good,
or b e tte r th an another? O ur em phasis will be on techniques th a t work well in
practice.
4.1 W hen D oes O verfitting Occur?
O verfitting literally m eans “F ittin g the d a ta more th an is w arranted .” The
m ain case of overfitting is when you pick the hypothesis with lower Ei„, and
it results in higher Eout- This m eans th a t Ei„ alone is no longer a good guide
for learning. Let us s ta r t by identifying the cause of overfittiiig.
^from th e G reek paraskevi (F rid ay ), dekatre is ( tliir te en ), phobia (fear)
1 1 9
4 . O v e r f i t t i n g 4 . 1 . W h e n D o e s O v e r f i t t i n g O c c u r ?
O D a ta
— T a rg e t
— F it
Consider a simple one-dim ensional regression problem w ith five d a ta points.
We do not know the ta rg e t function, so le t’s select a general model, m axim iz
ing our chance to cap tu re th e ta rg e t function. Since 5 d a ta points can be fit
by a 4 th order polynom ial, we select 4 th order polynom ials.
The result is shown on the right. T he
ta rg e t function is a 2nd order polynom ial
(blue curve), w ith a little added noise in
the d a ta points. T hough the ta rg e t is
simple, the learning algorithm used the
full power of the 4 th order polynom ial to
fit the d a ta exactly, b u t the resu lt does
not look anyth ing like the ta rg e t function.
T he d a ta has been ‘overfit.’ T he little
noise in the d a ta has misled th e learning,
for if there were no noise, the fitted red
curve would exactly m atch the ta rg e t. This is a typ ical overfitting scenario,
in which a com plex m odel uses its add itional degrees of freedom to ‘learn ’ the
noise.
The fit has zero in-sam ple erro r b u t huge out-of-sam ple error, so th is is a
case of bad generalization (as discussed in C h ap ter 2) - a likely outcom e when
overfitting is occurring. However, our definition of overfitting goes beyond
bad generalization for any given hypothesis. Instead , overfitting applies to
a process: in th is case, the process of picking a hypothesis w ith lower and
lower Ein resulting in higher and higher Eout-
4.1 .1 A C ase S tudy: O v erfittin g w ith P o ly n o m ia ls
L et’s dig deeper to gain a b e tte r understand ing of w hen overfitting occurs. We
will illustra te the m ain concepts using d a ta in one-dim ension and polynom ial
regression, a special case of a linear m odel th a t uses the feature transfo rm
X HA (1 ,X, x^, • • •). C onsider the two regression problem s below:
(a) 10th order target function (b) 50th order target function
In bo th problem s, the ta rg e t function is a polynom ial and th e d a ta set T>
contains 15 d a ta points. In (a), the ta rg e t function is a 10th order polynom ial
120
4 . O v e r f i t t i n g 4 . 1 . W h e n D o e s O v e r f i t t i n g O c c u r ?
O Data
—2nd Order Fit
—10th Order Fit
(a) Noisy low-order target
O Data
—2nd Order Fit
— 10th Order Fit
(b) Noiseless high-order target
F igure 4.1: Fits using 2nd and 10th order polynomials to 15 data points.
In (a), the data are noisy and the target is a 10th order polynomial. In (b)
the data are noiseless and the the target is a 50th order polynomial.
and the sam pled d a ta are noisy (the d a ta do not lie on the targe t function
curve). In (b), the ta rg e t function is a 50th order polynom ial and the d a ta are
noiseless.
T he best 2nd and 10th order fits are shown in Figure 4.1, and the in-sample
and out-of-sam ple errors are given in the following table.
10th order noisy ta rg e t 50th order noiseless targe t
2nd O rder 10th O rder 2nd O rder 10th O rder
Ei„ 0.050 0.034 E[n 0.029 10“ '̂
Eout 0.127 9 .0 0 Eout 0.120 7680
W hat the learning algorithm sees is the data , not the targe t function. In both
cases, the 10th order polynom ial heavily overfits the data , and results in a
nonsensical final hypothesis which does not resemble the targe t function. The
2nd order fits do not cap tu re the full na tu re of the targe t function either, but
they do a t least cap tu re its general trend , resulting in significantly lower out-of-
sam ple error. T he 10th order fits have lower in-sam ple error and higher out-of-
sam ple error, so this is indeed a case of overfitting th a t results in pathologically
bad generalization.
E x e rc is e 4 .1
Let H 2 and H io be the 2nd and 10th order hypothesis sets respectively.
Specify these sets as parameterized sets of functions. Show that H 2 C H io .
These two examples reveal some surprising phenom ena. L et’s consider first the
10th order ta rg e t function, F igure 4.1(a). Here is the scenario. Two learners. ()
(for overfitted) and R (for restric ted), know th a t the targe t function is a 10th
order polynom ial, and th a t they will receive 15 noisy d a ta points. Learner O
121
4 . 0 \ ' E R F I T T I N G 4 . 1 . W h e n D o e s O v e r f i t t i n g O c c u r ?
Learning curves for E 2 Learning curves for Hio
Figure 4.2: Overfitting is occurring for N in the shaded gray region
because by choosing T-L\o which has better Ein, you get worse Eout-
uses m odel E io , which is known to contain th e ta rg e t function, and finds the
best fitting hypothesis to the d a ta . L earner R uses m odel 7^2, and sim ilarly
finds the best fitting hypothesis to the d a ta .
T he surprising th ing is th a t learner R wins (lower out-of-sam ple error) by
using the sm aller model, even thoughshe has knowingly given up the ability
to im plem ent the tru e ta rg e t function. L earner R trad es off a worse in-sam ple
error for a huge gain in the generalization error, u ltim ate ly resulting in lower
out-of-sam ple error. W h at is funny here? A folklore belief abou t learning is
th a t best results are ob ta ined by incorporating as m uch inform ation abou t the
targe t function as is available. B u t as we see here, even if we know the order
of the ta rg e t and naively inco rporate th is knowledge by choosing th e model
accordingly (Rio) , the perform ance is inferior to th a t dem onstra ted by the
m ore 's tab le ' 2nd order model.
T he models 772 and 77io were in fact the ones used to generate th e learn
ing curves in C hap ter 2. and we use those sam e learning curves to illustra te
overfitting in F igure 4.2. If you m entally superim pose the two plots, you can
see th a t there is a range of N' for which 7?io has lower E\n bu t higher Eout
th an 7?2 does, a case in poin t of overfitting.
Is learner R always going to prevail? C ertain ly not. For exam ple, if the
da ta was noiseless, then indeed learner O would recover th e ta rg e t function
exactly from 15 d a ta points, while learner R would have no hope. This brings
us to the second exam ple. F igure 4.1(b). Here, the d a ta is noiseless, bu t the
targe t function is very com plex (50th order polynom ial). A gain learner R
wins, and again because learner O heavily overfits the d a ta . O verfitting is
not a disease inflicted only upon com plex m odels w ith m any m ore degrees of
freedom th an w arran ted by th e com plexity of th e ta rg e t function. In fact the
reverse is tru e here, and overfittiiig is just as bad. W hat m a tte rs is how the
model com iilexity m atches the quantity and quality o f the data we have, not
how it m atches the ta rg e t function.
1 2 2
4 .1 .2 C a ta ly sts for O verfitting
A skeptical reader should ask w hether the examples in Figure 4.1 are ju st
pathological constructions created by the authors, or is overfitting a real phe
nom enon which has to be considered carefully when learning from data? The
next exercise guides you th rough an experim ental design for studying overfit
ting w ithin our curren t setup. We will use the results from this experim ent
to serve two purposes: to convince you th a t overfitting is not the result of
some rare pathological construction, and to unravel some of the conditions
conducive to overfitting.
E x erc ise 4 .2 [Experim ental design for studying overfittingj
This is a reading exercise that sets up an experimental framework to study
various aspects of overfitting. The reader interested in implementing the
experiment can find the details fleshed out in Problem 4.4. The input
space is A = [—1,1], with uniform input probability density, P [ x ) = | .
W e consider the two models 'H2 and Ttio.
The target is a degree-Q / polynomial, which we write f {x) =
where Li{x) are polynomials of increasing complexity (the
Legendre polynomials). The data set is X> = (an, y i ) , . . . , (x n , y v ) , where
= f { xn) + cren and £„ are iid (independent and identically distributed)
standard Normal random variates.
For a single experiment, with specified values for Q f , N , a , generate a ran
dom degree-Q / target function by selecting coefficients m independently
from a standard Normal, rescaling them so that Ea.x [P ] = 1- Gen
erate a data set, selecting a ; i , . . . , x j v independently according to P { x )
and yn = / ( x „ ) -t- ct£„. Let 52 and yio be the best fit hypotheses to
the data from H 2 and Hio respectively, with out-of-sample errors Fout(g2)
and Fout(y 10)•
Vary Q f , N , a, and for each combination of parameters, run a large number
of experiments, each time computing Fout(y2) and Fout(yio). Averaging
these out-of-sample errors gives estimates of the expected out-of-sample
error for the given learning scenario { Q/ , N , a ) using H 2 and 'Hio-
Exercise 4.2 set up an experim ent to study how' tlie noise level cr^, the target
com plexity Q j, and the num ber of d a ta points N relate to overfittiiig. We
com pare the final hypothesis gw G F io (larger model) to the final hypothesis
g2 G R 2 (smaller model). Clearly, Fin(yio) < F„(ry2 ) since g\o has more
degrees of freedom to fit the data . W hat is surprising is how often gw overhts
the d a ta , resulting in Fout(5io) > EoutifJ'i)- Let us define the overfit measure
as Fout (ff 1 0 ) - Fout (5 2 ). T he m ore positive tliis m easure is, tlie more severe
overfitting would be.
Figure 4.3 shows how the ex ten t of overfitting depends on certain param e
ters of the learning problem (the results arc from our iniplem entatioii of Exer
cise 4.2). In the figure, the colors m ap to the level of overfittiiig, witli ri'dder
4 . O v e r f i t t i n g _______________________________4 . 1 . W h e n D o e s Q v e r f i t t i n g O c c u r ?
1 2 3
4 . O v e r f i t t i n g 4 . 1 . W h e n D o e s O v e r f i t t i n g O c c u r ?
100 120
N um ber of D a ta P o in ts, N
(a) Stochastic noise
100 120
N um ber of D a ta P o in ts, N
(b) Deterministic noise
Figure 4.3: How overfitting depends on the noise cr ,̂ the target function
complexity Q f , and the number of data points N. The colors map to the
overfit measure EoutiTLio) — £'out('H2 ). In (a) we see how overfitting depends
on cr̂ and N , with Q f = 20. As increases we are adding stochastic noise
to the data. In (b) we see how overfitting depends on Q f and N , with
= 0.1. A s Qf increases we are adding deterministic noise to the data.
regions showing worse overfitting. These red regions are large— overfitting is
real, and here to stay.
F igure 4.3(a) reveals th a t th ere is less overfitting when the noise level
drops or when th e num ber of d a ta poin ts N increases (the linear p a tte rn in
Figure 4.3(a) is typical). Since th e ‘signal’ / is norm alized to E[/^] = 1,
the noise level cr ̂ is au tom atically ca lib ra ted to the signal level. Noise leads
the learning astray, and th e larger, m ore com plex m odel is m ore susceptible to
noise th an the sim pler one because it has m ore ways to go astray. F igure 4.3(b)
reveals th a t ta rg e t function com plexity Q / affects overfitting in a sim ilar way
to noise, albeit nonlinearly. To sum m arize.
N um ber of d a ta po in ts t O verfitting
Noise t O verfitting t
T arget com plexity t O verfitting t
D e te r m in is t ic n o ise . W hy does a higher ta rg e t com plexity lead to more
overfitting when com paring the sam e two m odels? T he in tu ition is th a t for a
given learning model, there is a best approx im ation to th e ta rg e t function. T he
p art of the ta rg e t function ‘o u tsid e’ th is best fit ac ts like noise in th e d a ta . We
can call this deterxninistic noise to d ifferentiate it from the random stochastic
noise. .Tiist as stochastic noise canno t be m odeled, the determ inistic noise
is th a t p a r t of the ta rg e t function which cannot be m odeled. T he learning
algorithm should not a tte m p t to fit th e noise; however, it canno t distinguish
noise from signal. O n a finite d a ta set, the algorithm inadverten tly uses some
1 2 4
4 . Ox'K KK i r r iX G 4 , 1 , W ' u k n O o k s LIn k k i u t i n g O g g i t ;',
F ig u re 4.4: D o tenu iu istic noise. /F is the best tit to / in R-j. The sluniing
illu stra tes d e ten n in is tie noise tor tliis learning problem .
of the degrees of freeiloiu to tit the noise, whieh e;tu result in overtit ting tuid
a spurious tiiial Ip-pothesis.
F igure 4.4 illustrates detenuiu istie uoise for a tiu;nlr;ttie luotlel tit ting ;t
m ore eomple.x target fnnetiou. W hile stoehastie aiul tleternhnistie noise have
sim ilar effeets on overtitting. there aretwo bttsie ditferenees between the two
tvpes of noise. F irst, if we generated the stune d;tt;t (x \;tlnes) ttgain, the
detennin istie noise woukl not elnmge but the stoehastie noise wonld. Seeond.
different models eap tn re different 'p a rts ' of the target fnnetion, henee the same
d a ta set will have different detennin istie noise depending on whieh tnotlel we
use. In realit,w we work w ith one model ;tt a tim e tuul Imve onlv one ilata set
on hand. Henee, we h a \e one reiilization of the noise to work with and the
algorithm eannot differentiate between the two t,vpes of noise.
E x erc ise 4 .3
Deterministic noise depends on H, as some models approximate / better
than others,
(a ) Assume H is fixed and we increase the complexity of / . W ill deter
ministic noise in general go up or down? Is there a higher or lower
tendency to overfit?
(b ) Assume / is fixed and we decrease the complexity of H. Will deter
ministic noise in general go up or down? Is there a higher or lower
tendency to overfit? [Hint: There is a race between two factors that
affect overfitting in opposite ways, but one wins.]
T he bias-varianee deeom position. whieh we disen.ssed in Seetion 2 ,3 .1 (see also
P roblem 2.22) is a useful tool for understanding how noise alfeets perfornianee:
E p [ £ ’„„t] = u" -l- bias + var.
The first two term s relleet the direet inqmet ot the stoehastie and deten n in
istie noise. T he variance of the stoehastie noise is a ' and the bias is dirc'etly
1 2 5
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
related to the determ inistic noise in th a t it cap tu res the m odel’s inaViility to
approxim ate / . T he var term is indirectly im pacted by b o th types of noise,
cap tu ring a m odel’s susceptibility to being led astray hy the noise.
4.2 R egularization
R egularization is our first weapon to com bat overfitting. R constrains the
learning algorithm to im prove out-of-sam ple error, especially when noise is
present. To w het your appe tite , look a t w hat a little regularization can do for
our first overfitting exam ple in Section 4.1. T hough we only used a very small
•amount" of regularization, the fit im proves dram atically .
without regularization with regularization
Now th a t we have your a tten tio n , we would like to come clean. R egularization
is as much an a r t as it is a science. M ost of th e m ethods used successfully
in practice are heuristic m ethods. However, these m ethods are grounded in a
m athem atical fram ework th a t is developed for special cases. We will discuss
bo th the m athem atica l and the heuristic, try ing to m ain ta in a balance th a t
reflects the reality of the field.
Speaking of heuristics, one view of regu larization is th ro u g h th e lens of the
VC bound, which bounds Eout using a m odel complexitj^ penalty ^(PL):
Eontih) < E M + n{Pl) for all h e PL. (4.1)
So. we are b e tte r off if we fit the d a ta using a sim ple PL. E x trap o la tin g one step
further, we should be b e tte r off by fitting the d a ta using a ‘simple" h from PL.
T he essence of regularization is to concoct a m easure H(/i) for th e com plexity
of an individual hypothesis. Instead of m inim izing E,„{h) alone, one m inimizes
a com bination of E\n{h) and H(/i). T his avoids overfitting by constra in ing the
learning algorithm to fit th e d a ta well using a sim ple hypothesis.
E x a m p le 4 .1 . O ne popu lar regularization technique is weight decay, which
m easures the com plexity of a hypothesis h by the size of the coefficients used
to represent h (e.g. in a linear m odel). T his heuristic prefers m ild lines w ith
12G
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
small offset and slope, to wild lines w ith bigger offset and slope. We will get to
the mechanics of weight decay shortly, bu t for now le t’s focus on the outcome.
We apply weight decay to fitting the targe t f { x ) = sin(7ra;) using TV ^ 2
d a ta points (as in Exam ple 2.8). We sam ple x uniformly in [—1.1], generate a
d a ta set and fit a line to the d a ta (our model is 77i). The figures below show the
resulting fits on the sam e (random ) d a ta sets w ith and w ithout regularization.
without regularization with regularization
W ithout regularization, the learned function varies extensively depending on
the d a ta set. As we have seen in Exam ple 2.8, a constant model scored
Eout = 0 .7 5 , handily beating the perform ance of the (unregularized) linear
m odel th a t scored Eout = 1-90. W ith a little weight decay regularization,
the fits to the same data sets are considerably less volatile. This results in a
significantly lower Eout = 0 .5 6 th a t bea ts bo th the constant model and the
unregularized linear model.
T he bias-variance decom position helps us to understand how the regular
ized version bea t bo th the unregularized version as well as the constant model.
without regularization
bias = 0 .2 1 ;
var = 1.69.
with regularization
bias = 0.23;
var = 0.33.
Average hypothesis g (red) with var(x) indicated by the gray shaded
region that is g{x) ± ^var(a:).
As expected, regularization reduced the var term ra ther dram atically Irom 1.69
down to 0 .3 3 . T he price paid in term s of the bias (ciuality of the average fit) was
1 2 7
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
m odest, only slightly increasing from 0.21 to 0.23. T he result was a signihcant
decrease in the expected out-of-sam ple error because bias-1-v a r decreased. This
is the crux of regularization . By constrain ing th e learning algorithm to select
‘sim pler’ hypotheses from TL, we sacrihce a little bias for a signihcant gain in
the var. □
This exam ple also illustra tes why regularization is needed. T he linear
m odel is too sophisticated fo r the amount o f data we have, since a line can
perfectly h t any 2 points. T his need would persist even if we changed the
ta rg e t function, as long as we have either stochastic or determ inistic noise.
T he need for regularization depends on the quantity and quality o f the data.
G iven our m eager d a ta set, our choices were either to take a sim pler model,
such as the m odel w ith constan t functions, or to constra in th e linear model. It
tu rn s out th a t using th e com plex m odel b u t constra in ing the algorithm tow ard
sim pler hypotheses gives us m ore flexibility, and ends up giving the best Eout-
In practice, this is the rule no t the exception.
E nough heuristics. L e t’s develop the m athem atics of regularization.
4 .2 .1 A Soft O rder C on stra in t
In this section, we derive a regu larization m ethod th a t applies to a wide va
riety of learning problem s. To sim plify the m ath , we will use the concrete
setting of regression using Legendre polynom ials, th e polynom ials of increas
ing com plexity used in Exercise 4.2. So, le t’s first form ally in troduce you to
the Legendre polynom ials.
C onsider a learning m odel w here TL is th e set of polynom ials in one vari
able X C [—1.1]. Instead of expressing th e polynom ials in term s of consecutive
powers of x, we whll express them as a com bination of Legendre polynom ials
in X . Legendre polynom ials are a s tan d a rd set of polynom ials w ith nice ana
lytic properties th a t resu lt in sim pler derivations. T he zero th-order Legendre
polynom ial is the constan t L q {x ) = 1, and the first few Legendre polynom ials
are illustra ted below.
i(3T - 1)
Z.4
icas.!-* - 30j-- + 3)
As you can see. when the order of the Legendre polynom ial increases, th e curve
gets m ore complex. Legendre polynom ials are o rthogonal to each o th er w ithin
T G [—1,1], and any regular polynom ial can be w ritten as a linear com bination
of Legendre polynom ials, ju s t like it can be w ritten asa linear com bination of
powers of x.
1 2 8
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
Polynom ial models are a special case of linear models in a space Z , under
a nonlinear transform ation : T Z . Here, for the Q th order polynomial
model, 4> transform s x into a vector z of Legendre polynomials,
1
Li{x )
z =
L q{x )
O ur hypothesis set F q is a linear com bination of these polynomials.
n o = <h h{x) = W'^Z = ^ WqLq
q=0 weR«+i
where L q{x ) = 1. As usual, we will som etimes refer to the hypothesis h by its
weight vector w.^ Since each h is linear in w , we can use the m achinery of
linear regression from C hap ter 3 to minimize the squared error
N
F n (w ; (4.2)
n = l
T he case of polynom ial regression w ith squared-error m easure illustrates the
m ain ideas of regularization well, and facilitates a solid m athem atical deriva
tion. Nonetheless, our discussion will generalize in practice to non-linear,
m ulti-dim ensional settings w ith more general error measures. T he baseline al
gorithm (w ithout regularization) is to minimize F „ over the hypotheses in F q
to produce the final hypothesis g{x) — w]lj,z, where wun = arginin F n (w ).
E x erc ise 4 .4
Let Z = fz i Zw be the data matrix (assume Z has full column
rank); let wun = (Z'^Z) ^Z' '̂y; and let H = Z (Z ’'Z ) 72 (the hat matrix
of Exercise 3.3). Show that
F n (w ) =
(w - wiin)'^Z'^Z(w - Wl in) + y ’' ( I “ H )y
N
(4.3)
where I is the identity matrix.
(a ) W hat value of w minimizes F n ?
(b) W hat is the minimum in-sample error?
T he task of regularization, which results in a final hypothesis Wreg instead of
the simple wun, is to constrain the learning so as to prevent overfitting the
^We used w and d for the weight vector and diinen.sion in Z. Since we are explicitly
dealing with polynomials and Z is the only space around, we use w and Q for simplicity.
1 2 9
data . We have already seen an exam ple of constrain ing the learning; the set 7^2
can be though t of as a constrained version of T-Lw in the sense th a t some of
the Hio weights are required to be zero. T h a t is, 7?2 is a subset of 77io defined
by 7^2 = {w I w G 77io; tCg = 0 for o' > 3} . R equiring some weights to be 0 is
a hard constrain t. We have seen th a t such a hard constra in t on the order can
help, for exam ple 7^2 is b e tte r th an T-Liq when there is a lot of noise and N is
small. Instead of requiring some weights to be zero, we can force th e weights
to be sm all bu t not necessarily zero th rough a softer constra in t such as
4 . O v e r f i t t i n g __________________________________________________________ 4 . 2 . R e g u l a r i z a t i o n
Q
E
q=0
w l < C.
This is a ‘soft o rd e r’ constra in t because it only encourages each weight to be
small, w ithout changing th e order of the polynom ial by explicitly setting some
weights to zero. T he in-sam ple op tim iza tion problem becomes:
m in E in (w ) sub ject to w '^w < C. (4.4)
W
T he d a ta determ ines the optim al weight sizes, given the to ta l budget C which
determ ines the am ount of regularization; the larger C is, the weaker the con
s tra in t and the sm aller th e am ount of regularization . We can define the soft-
o rder-constrained hypothesis set 77(C) by
77(C) = {h i h{x) — w'^z , w '^w < C} .
E quation (4.4) is equivalent to m inim izing E\o over 'H(C). If C i < C 2 , then
77(Ci) C 77(C2) and so dvc('77(Ci)) < dvc('77(C2)), and we expect b e tte r
generalization w ith 77(Ci). Let the regularized weights Wreg be th e solution
to (4.4).
S o lv in g fo r Wreg- If < C th en Wreg =
wiiu because wh„ g 77(C). If wu,, ^ 77(C), th en
not only is wj^gWrog < C , b u t in fact wJ(,gWreg = C
(wreg uses the entire budget C ; see P roblem 4.10).
We thus need to miuirnize Ej,, sub ject to the
equality constra in t w '^w = C . T he s itu a tio n is
illustra ted to the right. T he weights w m ust lie
on the surface of the sphere w '^w = C; th e norm al
vector to this surface a t w is the vector w itself
(also in red). A surface of constan t Ei„ is shown in
blue; this surface is a qu ad ra tic surface (see Exercise 4.4) and th e norm al to
this surface is V E i„(w ). In th is case, w canno t be op tim al because V E i„(w ) is
not parallel to the red norm al vector. T his m eans th a t V E i„(w ) has som e non
zero comi)oneiit along the constra in t surface, and by m oving a sm all am ount
ill the opposite direction of th is com ponent we can im prove Ei„, while still
130
rem aining on the surface. If w ^g is to be optim al, then for some positive
param eter Ac
V i ? i n ( W r e g ) = — 2 A c W r g g ,
i.e., VEin m ust be parallel to Wreg, the norm al vector to the constrain t surface
(the scaling by 2 is for m athem atical convenience and the negative sign is
because VEin and w are in opposite directions). Equivalently, Wreg satisfies
V (Ein(w ) + Acw"’w )|^^^_^^ = 0,
because V(w '^w) = 2w. So, for some Ac > 0, Wreg locally minimizes
E in (w ) + Acw'^w. (4 .5)
T he param eter Ac and the vector Wreg (both of which depend on C and the
data) m ust be chosen so as to sim ultaneously satisfy the gradient equality and
the weight norm constra in t wJ^gWreg — C Y T h a t Ac > 0 is intuitive since
v/e are enforcing sm aller weights, and minimizing i?in(w) + Acw'^w would
not lead to sm aller weights if Ac were negative. Note th a t if < C,
Wreg = wiin and m inim izing (4.5) still holds w ith Ac = 0. Therefore, we
have an equivalence between solving the constrained problem (4.4) and the
unconstrained m inim ization of (4.5). This equivalence m eans th a t minimiz
ing (4.5) is sim ilar to minim izing E\n using a sm aller hypothesis set, which in
tu rn m eans th a t we can expect b e tte r generalization by minimizing (4.5) th an
by ju s t m inim izing Em.
O ther variations of the constrain t in (4.4) can be used to emphasize some
weights over the others. Consider the constrain t Yl^=oTq'^‘̂ ‘q - The im
portance 7 q given to weight Wg determ ines the type of regularization. For
example, 7 g = <7 or 7 , = ê ‘ encourages a low-order fit, and 7 g = (1 + 9 ) “ ̂ or
-jg — e “ '? encourages a high-order fit. In extrem e cases, one recovers hard-order
constra in ts by choosing some 7 ̂ = 0 and some 7 ,̂ — 0 0 .
E x erc ise 4 .5 [Tikhonov regularizer]
A more general soft constraint is the Tikhonov regularization constraint
w"^r"Tw < C
which can capture relationships among the Wi (the matrix F is the Tikhonov
regularizer).
(a ) W hat should F be to obtain the constraint JYq=o'^q ^ T'?
(b ) W hat should F be to obtain the constraint < C”?
4 . O v e r f i t t i n g _________________________________________________________ 4 . 2 . R e g u l a r i z a t i o n
is know n as a L agrange m u ltip lie r an d an a lte rn a te de riv a tio n of these sam e re su lts
can be o b ta in e d v ia th e th eo ry of L agrange m u ltip lie rs for c o n stra in ed o p tim iza tio n .
1 3 1
4 .2 .2 W eight D eca y and A u g m en ted Error
T he soft-order constra in t for a given value of C is a constrained m inim iza
tion of Ein. E quation (4.5) suggests th a t we m ay equivalently solve an un
constrained m inim ization of a different function. L e t’s define the augmented
error,
•Eaug(w) = E in(w ) -F Aw'^'w, (4.6)
where A > 0 is now a free p aram eter a t our disposal. T he augm ented error has
two term s. T he first is the in-sam ple error which we are used to m inimizing,
and the second is a penalty term. N otice th a t th is fits the heuristic view of
regularization th a t we discussed earlier, w here the penalty for com plexity is
defined for each individual h in stead of H as a whole. W hen A = 0, we have the
usual in-sam ple error. For A > 0, m inim izing th e augmented erro r corresponds
to m inim izing a penalized in-sam ple error. T he value of A controls the am ount
of regularization. T he penalty te rm w'^'w enforces a tradeoff betw een m aking
the in-sam ple error sm all and m aking the weights sm all, and has becom e known
as weight decay. As discussed in P roblem 4.8, if we m inim ize the augm ented
error using an itera tive m ethod like g rad ien t descent, we will have a reduction
of the in-sam ple error together w ith a g radual shrinking of the weights, hence
the nam e weight ‘decay.’ In th e s ta tis tic s com m unity, th is type of penalty
term is a form of ridge regression.
T here is an equivalence betw een the soft o rder constra in t and augm ented
error m inim ization. In the soft-order constra in t, the am ount of regularization
is controlled by the p aram eter C. From (4.5), there is a p a rticu la r Ac (depend
ing on C and the d a ta T>), for which m inim izing the augm ented error Eaug(w)
leads to the sam e final hypothesis Wreg- A larger C allows larger weights and
is a weaker soft-order constra in t; th is corresponds to sm aller A, i.e., less em
phasis on the penalty te rm w'^'w in the augm ented error. For a p a rticu la r
d a ta set, the optim al value C* leading to m inim um out-of-sam ple erro r w ith
the soft-order constra in t corresponds to an op tim al value A* in the augm ented
error m inim ization. If we can find A*, we can get the m inim um Eout-
Have we gained from the augm ented erro r view? Yes, because augm ented
error m inim ization is unconstra ined , which is generally easier th a n constrained
m inim ization. For exam ple, we can o b ta in a closed form solution for linear
models or use a m ethod like stochastic g rad ien t descent to ca rry o u t the m ini
m ization. However, augm ented erro r m inim ization is n o t so easy to in terp re t.
T here are no values for the weights which are explicitly forbidden, as there
are in the soft-order constra in t. For a given C, the soft-order constra in t cor
responds to selecting a hypothesis from the sm aller set H (C ), and so from
our VC analysis we should expect b e tte r generalization w hen C decreases (A
increases). It is th rough th e re la tionsh ip betw een A and C th a t one has a
theoretical justification of weight decay as a m ethod for regularization .
We focused on the soft-order co n stra in t w'^’w < C w ith corresponding
augm ented error Eaug(w) = E i„(w ) -F Aw'^'w. However, our discussion applies
m ore generally. T here is a dua lity betw een the m inim ization of the in-sam ple
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
1.32
error over a constrained hypothesis set and the unconstrained niinim ization of
an augm ented error. We m ay choose to live in either world, bu t more often
th an not, the unconstrained m inim ization of the augm ented error is more
convenient.
In our definition of Eaug(w) in E quation (4.6), we only highlighted the
dependence on w . T here are two other quantities under our control, namely
the am ount of regularization. A, and the na tu re of the regularize!' which we
chose to be ■w'̂ 'w. In general, the augm ented error for a hypothesis h £ is
4 . O V E R F I T T I N G _________________________________________________________ 4 . 2 . R E G U LA R IZA T IO N
E a u g ( / i , A, n ) = E M + ^ U ( / i ) . (4.7)
For weight decay, D(/i) = w ’̂ 'w, which penalizes large weights. The penalty
term has two com ponents; the regularizer Q{h) (the type of regularization)
which penalizes a particu lar p roperty of h\ and the regularization parameter A
(the am ount of regularization). T he need for regularization goes down as the
num ber of d a ta points goes up, so we factored out this allows the optim al
choice for A to be less sensitive to N . This is ju s t a redefinition of the A th a t
we have been using, in order to make it a more stable param eter th a t is easier
to in terp re t. Notice how Equation (4.7) resembles the VC bound (4.1) as we
an tic ipated in the heuristic view of regularization. This is why we use the same
n o ta tion Q, for b o th the penalty on individual hypotheses Q,{h) and the penalty
on the whole set T he correspondence between the com plexity of H and
the com plexity of an individual h will be discussed further in Section 5.1.
T he regularizer Q is typically fixed ahead of tim e, before seeing the data;
som etim es the problem itself can d ic ta te an appropriate regularizer.
E x e rc is e 4 .6
W e have seen both the hard-order constraint and the soft-order constraint.
Which do you expect to be more useful for binary classification using the
perceptron model? [Hint: sign{'w'^x) = sign{a'w'^x) for any a > 0.[
T he optim al regularization param eter, however, typically depends on the data.
T he choice of the optim al A is one of the applications of validation, which we
will discuss shortly.
E x a m p le 4 .2 . L in e a r m o d e ls w i th w e ig h t d ecay . Linear models are
im p o rtan t enough th a t it is worthwhile to spell out the details of augm ented
error niinim ization in this case. From Exercise 4.4, the augm ented error is
(w - wiin)'^Z'^Z(w - wii„) + A w 'w + y ’̂ (I - H)y
Eaug(w) — )
where Z is the transform ed d a ta m atrix and wh,, = (Z’’ Z)^^Z ''y . The reader
m ay verify, after taking the derivatives of Ea„g and setting VwEa„g = 0, th a t
w , e g = (Z"Z + A I)^ 'Z "y .
1 3 3
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
As expected, Wreg will go to zero as A —> oo, due to the AI term . T he predic
tions on the in-sam ple d a ta are given by y = Zwreg = H(A)y, where
H(A) = Z(Z^Z + A I)-^Z '" .
T he m atrix H(A) plays an im p o rtan t role in defining the effective com plexity
of a model. W hen A = 0, H is th e h a t m atrix of Exercises 3.3 and 4.4, which
satisfies = H and trace(H ) = d + 1 . T he vector of in-sam ple errors, which
are also called residuals, is y — y = (I - H(A))y, and th e in-sam ple error
is Eiu(w,eg) = ^ y " ( I - H (A ))V - □
We can now apply weight decay regularization to th e first overfitting exam ple
th a t opened th is chapter. T he resu lts for different A’s are shown in F igure 4.5.
A = 0
O Data
—Target
—Fit
A = 0.0001 A = 0.01 A = 1
F igure 4.5: Weight decay applied to Example 4.2 with different values for
the regularization parameter A. The red fit gets flatter as we increase A.
As you can see, even very little regu larization goes a long way, b u t too much
regularization resu lts in an overly flat curve a t the expense of in-sam ple fit.
A nother case we saw earlier is Exam ple 4.1, w here we fit a linear m odel to a
sinusoid. T he regularization used there was also w eight decay, w ith A — 0.1.
4 .2 .3 C h o o sin g a R egu larizer: P ill or P o ison ?
We have presented a num ber of ways to constra in a model: hard -o rder con
stra in ts where we sim ply use a low er-order m odel, soft-order constra in ts w here
we constra in the p aram eters of th e m odel, and augm ented erro r w here we add
a penalty term to an otherw ise unconstra ined m inim ization of error. Aug
m ented error is the m ost popu lar form of regu larization , for which we need to
choose the regularizer D(/i) and the regu lariza tion p aram ete r A.
In practice, the choice of D is largely heuristic. F ind ing a perfect D is as
difficult as finding a perfect TL. I t depends on inform ation th a t , by the very
na tu re of learning, we d on’t have. However, there are regularizers we can work
w ith th a t have stood the te s t of tim e, such as w eight decay. Some forms of
regularization work and som e do not, depending on the specific application
and the data . F igure 4.5 illu s tra ted th a t eventh e am oun t of regularization
1 3 4
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
underfitting
0.5 1 1.5 2
Regularization Param eter, A
(a) U niform regularizer
0.5 1 1.5 2
Regularization Param eter, A
(b) Low -order regularizer
F ig u re 4.6: O ut-of-sam ple perform ance for th e uniform and low-order reg-
u larizers using m odel Hi s , w ith = 0.5, Q f = 15 and N = 30. O verfitting
occurs in th e shaded region because lower Fin (lower A) leads to higher Eout-
U nderfitting occurs w hen A is too large, because th e learning algorithm has
too lit tle flexibility to fit th e d a ta .
has to be chosen carefully. Too much regularization (too harsh a constraint)
leaves the learning too little flexibility to fit the d a ta and leads to underfitting,
which can be ju s t as bad as overfitting.
If so m any choices can go wrong, why do we bo ther w ith regularization
in the first place? R egularization is a necessary evil, w ith the operative word
being necessary. If our model is too sophisticated for the am ount of d a ta
we have, we are doomed. By applying regularization, we have a chance. By
applying the proper regularization, we are in good shape. Let us experim ent
w ith two choices of a regularizer for the model 7?i5 of 15th order polynomials,
using the experim ental design in Exercise 4.2:
1. A uniform regularizer: Hunif(w) =
2. A low-order regularizer: Hiow(w) = Yll=oP^q-
T he first encourages all weights to be small, uniformly; the second pays more
a tten tio n to the higher order weights, encouraging a lower order fit. Figure 4.6
shows the perform ance for ciifferent vaiues of the reguiarization param eter A .
As you decrease A , the optim ization pays less atten tion to the penalty term and
m ore to and so E-,n will decrease (Problem 4.7). In the shaded region, E„„t
increases as you decrease E j , , (decrease A ) the regularization param eter is
too sm all and there is not enough of a constra in t on the learning, leading
to decreased perform ance because of overfitting. In the unshaded region, the
regularization param eter is too large, over-constraining the learning and not
giving it enough flexibility to fit the data , leading to decreased perlorm ance
because of underfitting. As can be observed from the figure, the price paid for
overfitting is generally m ore severe th an undcrhtting . It usually pays to be
conservative.
135
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
gO.75
"d
S 0.5
p,
1 ^ 0.25
= 0 . 5
t q 0.4
= 0 . 2 5 ' S
cr2 = 0
Regularization Param eter, A
(a) S tochastic noise
Q f = 1 0 0
Q / = 3 0
Q/ = 15
0-̂ . .' 2, Regularization Param eter, A
(b) Deterministic noise
Figure 4.7: Performance of the uniform regularizer at different levels of
noise. The optimal A is highlighted for each curve.
T he optim al regularization p aram ete r for th e two cases is qu ite different
and the perform ance can be quite sensitive to th e choice of regularization
param eter. However, the prom ising m essage from the figure is th a t though
the behaviors are qu ite different, th e perform ances of th e two regularizers are
com parable (around 0.76), i f we choose the right A fo r each.
We can also use th is experim ent to s tudy how perform ance w ith regular
ization depends on th e noise. In F igure 4 .7(a), when = 0, no am ount
of regularization helps (i.e., th e optim al regu larization p aram eter is A = 0),
which is no t a surprise because th ere is no stochastic or determ inistic noise in
the d a ta (bo th ta rg e t and m odel are 15th order polynom ials). As we add m ore
stochastic noise, the overall perform ance degrades as expected. N ote th a t the
optim al value for the regu larization p aram ete r increases w ith noise, which is
also expected based on the earlier discussion th a t th e p o ten tia l to overfit in
creases as the noise increases; hence, constra in ing the learning m ore should
help. F igure 4.7(b) shows w hat happens w hen we add determ inistic noise,
keeping the stochastic noise a t zero. T his is accom plished by increasing Q f
(the ta rg e t com plexity), thereby adding determ inistic noise, b u t keeping ev
erything else the sam e. C om paring p a r ts (a) and (b) of F igures 4.7 provides
ano ther dem onstra tion of how the effects of determ in istic and stochastic noise
are sim ilar. W hen e ither is present, it is helpful to regularize, and th e m ore
noise there is, the larger th e am oun t of regu larization you need.
W h a t happens if you pick the w rong
regularizer? To illu stra te , we picked a
regularizer which encourages large weights
(weight grow th) versus w eight decay which
encourages sm all weights. As you can see,
in this case, weight grow th does no t help
the cause of overhttiiig . If we happened to
choose weight grow th as our regularizer,
we would still be OK as long as we have Regularization Parameter, A
1 3 6
a good way to pick the regularization param eter - the optim al regularization
param eter in th is case is A = 0, and we are no worse off th an not regularizing.
No regularizer will be ideal for all settings, or even for a specific setting since
we never have perfect inform ation, b u t they all tend to work w ith varying
success, i f the amount of regularization A is set to the correct level. Thus, the
entire burden rests on picking the right A, a task th a t can be addressed by a
technique called validation, which is the topic of the next section.
T he lesson learned is th a t some form of regularization is necessary, as learn
ing is quite sensitive to stochastic and determ inistic noise. The best way to
constrain the learning is in the ‘d irection’ of the targe t function, and more
of a constra in t is needed when there is more noise. Even though we don’t
know either the ta rg e t function or the noise, regularization helps by reducing
the im pact of the noise. M ost com m on models have hypothesis sets which are
na tu ra lly param eterized so th a t sm aller param eters lead to sm oother hypothe
ses. Thus, a weight decay type of regularizer constrains the learning towards
sm oother hypotheses. This helps, because stochastic noise is ‘high frequency’
(non-sm ooth). Similarly, determ inistic noise (the p art of the ta rg e t function
which cannot be m odeled) also tends to be non-sm ooth. Thus, constraining
the learning tow ards sm oother hypotheses ‘h u rts ’ our ability to overfit the
noise m ore th an it hurts our ability to fit the useful inform ation. These are
em pirical observations, not theoretically justifiable statem ents.
R e g u la r iza tio n an d th e V C d im en sio n . R egularization (for example
soft-order selection by minim izing the augm ented error) poses a problem for
the VC line of reasoning. As A goes up, the learning algorithm changes but
the hypothesis set does not, so dye will not change. We argued th a t A t in
the augm ented error corresponds to C j, in the soft-order constrained model.
So, more regularization corresponds to an effectively sm aller model, and we
expect b e tte r generalization for a small increase in Bjn even though the VC
dim ension of the model we are actually using w ith augm ented error does not
change. This suggests a heuristic th a t works well in practice, which is to use an
‘effective VC dim ension’ instead of the VC dimension. For linear perceptrons,
the VC dim ension equals the num ber of free param eters d + l , and so an effec
tive number o f parameters is a good surrogate for the VC dimension in the VC
bound. T he effective num ber of param eters will go down as A increases, and
so the effective VC dim ension will reflect b e tte r generalization w ith increased
regularization. Problem s 4.13, 4.14, and 4.15 explore the notion of aneffective
num ber of param eters.
4 . O v e r f i t t i n g ________________________________________________ 4 . 3 . V a l i d a t i o n
4.3 V alidation
So far, we have identified overfitting as a problem , uoise (stochastic and deter
m inistic) as a cause, and regularization as a cure. In this section, we introduce
ano ther cure, called validation. One can think of bo th regularization and val
1 3 7
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
idation as a ttem p ts a t m inim izing Fout ra th e r th a n ju s t Fin. O f course the
tru e Fout is not available to us, so we need an estim ate of Fout based on in
form ation available to us in sam ple. In some sense, th is is the Holy G rail of
m achine learning: to find an in-sam ple estim ate of the out-of-sam ple error.
R egularization a ttem p ts to minim ize Fout by working th rough th e equation
F o u t( /i) = F in ( / i) + overfit p en a lty ,
regularization estimates this quantity
and concocting a heuristic term th a t em ulates the penalty term . Validation,
on the o ther hand, cuts to th e chase and estim ates th e out-of-sam ple error
directly.
Eout{h) = Ein(h) + overfit penalty .
validation estimates this quantity
E stim ating the out-of-sam ple erro r d irectly is no th ing new to us. In Sec
tion 2.2.3, we in troduced the idea of a test set, a subset of V th a t is not
involved in the learning process and is used to evaluate the final hypothesis.
T he tes t error Ftest; unlike the in-sam ple erro r F i„, is an unbiased estim ate
of Fout-
4 .3 .1 T h e V a lid ation S et
T he idea of a validation set is alm ost identical to th a t of a te s t set. We
remove a subset from the d a ta ; th is subset is no t used in train ing . We then
use th is held-out subset to estim ate the out-of-sam ple error. T he held-out set
is effectively out-of-sam ple, because it has no t been used during th e learning.
However, there is a difference betw een a validation set and a tes t set.
A lthough the validation set will no t be d irectly used for train ing , it will be
used in m aking certa in choices in th e learning process. T he m inute a set affects
the learning process in any wTxy, it is no longer a test set. However, as we will
see, the way the validation set is used in the learning process is so benign th a t
its estim ate of F ou t rem ains alm ost in tact.
Let us first look a t how the validation set is created . T he first step is
to p artitio n the d a ta set T> in to a training set Ptrain of size {N — K ) and a
validation set Pvai of size K . Any p artitio n in g m ethod which does no t depend
on the values of the d a ta poin ts will do; for exam ple, we can select N — K
points a t random for tra in ing and the rem aining for validation.
Now. we run the learning algorithm using the tra in in g set Ptrain to ob ta in
a final hypothesis g~ G T-L. w here th e ■minus’ supersc rip t indicates th a t some
d a ta points were taken ou t of the tra in ing . W e th en conipnte the validation
error for g~ using the validation set Pvab
Fvai [g~) = -p X ] ® (fL (x„), IJn) ;A'
x„er>v„i
138
where e{g~{x) ,y) is the poiiitwise error m easure which we introduced in Sec
tion 1.4.1. For classification, e (g (x ),y ) = [v (x ) ^ yj and for regression using
squared error, e{g{x.),y) = {g~{x.) — y)^.
T he validation error is an unbiased estim ate of Fout because the final hy
pothesis g~ was created independently of the d a ta points in the validation set.
Indeed, taking the expectation of Fvai w ith respect to the d a ta points in T>vai,
[Evai(5')] = ^ Y l [e (V (x „),? /n )] ,
X„£T>val
— Eout{g )j
x„ e^val
= EoutCV)- (4.8)
T he first step uses the linearity of expectation, and the second step follows
because e (5 “(x„), ?/„) depends only on x„ and so
Ex>val (n (^ n ) ) 2/n)] = Ex„ [Q {g (Xri),yn)] = Eout{g ).
How reliable is Ey^i a t estim ating Eout? In the case of classification, one can
use the VC bound to predict how good the validation error is as an estim ate for
the out-of-sam ple error. We can view Dvai as an ‘in-sam ple’ d a ta set on which
we com puted the error of the single hypothesis g~. We can thus apply the
VC bound for a finite model w ith one hypothesis in it (the Hoeffding bound).
W ith high probability,
Eoutig-) < E U g - ) + O . (4.9)
W hile Inequality (4.9) applies to binary ta rg e t functions, we m ay use the
variance of Evai as a m ore generally applicable m easure of the reliability. The
next exercise studies how the variance of Evai depends on K (the size of the
validation set), and implies th a t a sim ilar bound holds for regression. The
conclusion is th a t the error between Evai(5") Eout(.V) drops as o-{g~)/\fK.
where <j{g~) is bounded by a constan t in the case of classihcation.
E x erc ise 4 .7
Fix g" (learned from T>train) and define CTvai "= Var®^^, [Fvai(V)]. W e con
sider how Oval depends on K . Let
ci^(V) = Vavx[e(V(x),y)]
be the pointwise variance in the out-of-sample error of g~.
(a ) Show that rr^ai = ;^cr^(V ).
(b ) In a classification problem, where e{g~(E),y) = \g (x ) ^ j/], express
cr̂ ai in terms of P [V (x ) / y\.
(c) Show that for any © in a classification problem, rrJai < JR-
4 . O V E R F I T T I N G __________________________________________________ 4 . 3 . V A LID A TIO N
( c o n t i n u e d o n n e x t p a g e )
1 3 9
4 . O V E R F I T T I N G 4 . 3 . V A L ID A T IO N
(d ) Is there a uniform upper bound for Var[Fvai(y“)] similar to (c) in
the case of regression with squared error e(p“ ( x ) ,y ) = (y“ (x ) — y)^?
[Hint: The squared error is unbounded.]
(e) For regression with squared error, if we train using fewer points
(smaller N — K ) to get g~, do you expect cr^ig") to be higher or
lower? [Hint: For continuous, non-negative random variables, higher
mean often implies higher variance.]
( f ) Conclude th a t increasing the size of the validation set can result in a
better or a worse estimate of Fout-
T he expected validation erro r for ? ? 2 is illu s tra ted in F igure 4.8, where we
used the experim ental design in Exercise 4.2, w ith Q f = 10, N = 4 0 and noise
level 0.4. T he expected validation erro r equals Eout{g~), per E quation (4.8).
8aXW
10 20 30
Size of V alidation Set, K
Eigure 4.8: The expected validation error E[Fvai(y )] as a function of F ;
the shaded area is E[Fvai] ± <Tvai.
T he hgure clearly shows th a t th ere is a price to be paid for se tting aside K
d a ta points to get th is unbiased estim ate of Fout: w hen we set aside m ore
d a ta for validation, th ere are fewer tra in in g d a ta poin ts and so g~ becomes
worse; Eout{g~)^ and hence th e expected validation error, increases (the blue
curve). As we expect, th e u n certa in ty in Fvai as m easured by (Jvai (size of the
shaded region) is decreasing w ith K , up to the po in t w here th e variance <J^{g^)
gets really bad. T his po in t comes w hen the num ber of tra in in g d a ta points
becomes critically sm all, as in Exercise 4.7(e). If K is neither too sm all nor
too large, Fvai provides a good estim ate of Fqu,,. A rule of thum b in practice
is to set K = E (get aside 20% of th e d a ta for validation).
We have established two conflicting dem ands on K . I t has to be big enough
for Fvai to be reliable, and it has to be sm all enough so th a t the tra in in g set
with N — K poin ts is big enough to get a decent g~. Inequality (4.9) quantifies
the h rst dem and. T he second dem and is quantified by the learning curve
1 4 0
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
discussed in Section 2.3.2 (also the blue curve in Figure 4.8, from right to
left), which shows how the expected out-of-sam ple error goes down as the
num ber of tra in ing d a ta points goes up.The fact th a t more train ing d a ta lead
to a b e tte r final hypothesis has been extensively verified empirically, although
it is challenging to prove theoretically.
R e s to r in g V . A lthough the learning curve
suggests th a t tak ing out K d a ta points for
validation and using only N — K for tra in
ing will cost us in term s of i?out, we do not
have to pay th a t price! T he purpose of vali
d ation is to estimate the out-of-sam ple per
formance, and Evai happens to be a good
estim ate of Eout (<?“)• This does not m ean
th a t we have to o u tp u t g~ as our final hy
pothesis. T he prim ary goal is to get the
best possible hypothesis, so we should ou t
p u t g, the hypothesis tra ined on the en
tire set V . T he secondary goal is to esti
m ate Eout, which is w hat validation allows
us to do. Based on our discussion of learn
ing curves, Eont{g) < Eout{gY^ so
Figure 4.9: Using a valida
tion set to estimate Aout-
ôut ig) Y Eo„i[g ) < E^aiig ) + O (4.10)
T he first inequality is subdued because it was not rigorously proved. If we first
tra in w ith N — K d a ta points, validate w ith the rem aining K d a ta points and
then re tra in using all the d a ta to get g, the validation error we got will likely
still be b e tte r a t estim ating Eout{g) th an the estim ate using the VC-bound
w ith Em{g), especially for large hypothesis sets w ith big dye-
So far, we have trea ted the validation set as a way to estim ate Eout, w ithout
involving it in any decisions th a t affect the learning process. E stim ating Eout
is a useful role by itself - a custom er would typically w ant to know how good
the final hypothesis is (in fact, the inequalities in (4.10) suggest th a t the
validation error is a pessim istic estim ate of Eout, so your custom er is likely to
be pleasantly surprised when he tries your system on new data). However, as
we will see next, an im portan t role of a validation set is in fact to guide the
learning process. T h a t’s w hat distinguishes a validation set from a test set.
4 .3 .2 M od el S e lection
By far, the m ost im p o rtan t use of validation is for model selection. This could
m ean the choice between a linear model and a nonlinear model, the choice of
the order of polynom ial in a model, the choice of the value of a regularization
1 4 1
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
Figure 4.10: Optimistic bias of the validation error when using a validation
set for the model selected.
param eter, or any o ther choice th a t affects the learning process. In alm ost
every learning situation , th ere are some choices to be m ade and we need a
principled way of m aking these choices.
T he leap is to realize th a t validation can be used to estim ate the out-of-
sam ple error for m ore th a n one model. Suppose we have M m odels 77i , . . . , H m -
Validation can be used to select one of these m odels. Use th e tra in in g set Dtrain
to learn a final hypothesis for each model. Now evaluate each m odel on
the validation set to ob ta in the validation errors E i , . . . , E m , where
E „ = E v a i ( f f f y ) ; m = l , . . . , M .
T he validation errors estim ate th e out-of-sam ple erro r Eout(fl'm) for each 77m-
E x e r c ise 4 .8
Is Em an unbiased estimate for the out-of-sample error Eout(5m )?
It is now a sim ple m a tte r to select the m odel w ith lowest validation error.
Let m* be the index of the m odel which achieves th e m inim um validation
error. So for 77m*, Em- < Em for m = 1 , . . . , M . T he m odel Hrn- is the m odel
selected based on the validation errors. N ote th a t Em- is no longer an unbiased
estim ate of Eout(<7m*)- Since we selected the m odel w ith m inim um validation
error, Em* will have an optim istic bias. T his op tim istic bias when selecting
between 772 and 77s is illu stra ted in F igure 4.10, using th e experim ental design
described in Exercise 4.2 w ith Q j — ‘3, = 0.4 and N = 35.
E x erc ise 4 .9
Referring to Figure 4.10, why are both curves increasing with K 1 W hy do
they converge to each other with increasing K 7
1 4 2
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
How good is the generalization error for this entire process of model selection
using validation? Consider a new model Evai consisting of the final hypotheses
learned from the train ing d a ta using each model U i , . . . , 'Hm -
R v a l = { < 7 d ■ ■ ■ :9u} -
M odel selection using the validation set chose one of the hypotheses in Evai
based on its_perforniance on 2?vai- Since the model Evai was obtained before
ever looking a t the d a ta in the validation set, this process is entirely equivalent
to learning a hypothesis from Tfvai using the d a ta in Vy^]- The validation
errors Eya\{g~m) ‘in-sam ple’ errors for this learning process and so we may
apply the VC bound for finite hypothesis sets, w ith |7fvai| =
V
InM
K (4.11)
W hat if we d id n ’t use a validation set to choose the model? One alternative
would be to use the in-sam ple errors from each model as the model selection
criterion. Specifically, pick the model which gives a final hypothesis w ith m in
imum in-sam ple error. This is equivalent to picking the hypothesis w ith mini
m um in-sam ple error from the grand model which contains all the hypotheses
in each of the M original models. If we w ant a bound on the out-of-sample
error for the final hypothesis th a t results from this selection, we need to apply
the V C -penalty for th is grand hypothesis set which is the union of the M
hypothesis sets (see Problem 2.14). Since th is grand hypothesis set can have
a huge VC-dim ension, the bound in (4.11) will generally be tighter.
The goal of model selection is to se
lect the best model and o u tp u t the best
hypothesis from th a t model. Specifi
cally, we w ant to select the model m for
which Eout (5m) will be m inim um when
we re tra in w ith all the data . M odel se
lection using a validation set relies on the
leap of faith th a t if Eout (.9m) is minim um ,
then Eout (5m) also minim um . T he val
idation errors Em estim ate Eout (5m )>
u 1 n 2 ■ • - 'E m
rain '
9
T) ,
I 9]2 " ' 9~m
1^val
Eh E
pick t
T
2 • • • E m
he best
) Em* )
um odulo our leap of faith, the validation
set should pick the right model. No m at
te r which model m* is selected, however,
based on the discussion of learning curves
in the previous section, we should not ou t
p u t gm* as the final hypothesis. R ather,
once m* is selected using validation, learn using all the d a ta and o u tpu t cjm- ,
which satisfies
9 m*
Figure 4.11: Using a validation
set for model selection
E o i i l [dm* ) f i Euu tidm* ) — A v a l ( . 9 m * ) E ^
In M
K ( 4 . 1 2 )
Again, the first inequality is subdued because we d id n ’t prove it.
1 4 3
4 . O V E R F I T T I N G 4 . 3 . V a l i d a t i o n
Figure 4.12: Model selection between 772 and 77s using a validation set. The
solid black line uses Ein for model selection, which always selects 775. The
dotted line shows the optimal model selection, if we could select the model
based on the true out-of-sample error. This is unachievable, but a useful
benchmark. The best performer is clearly the validation set, outputting gm* ■
For suitable K , even is better than in-sample selection.
C ontinuing our experim ent from F igure 4.10, we evaluate the out-of-sam ple
perform ance w hen using a validation set to select betw een th e models 772
and 7 7 5 . T he results are shown in F igure 4.12. V alidation is a clear w inner
over using Ei„ for m odel selection.
E x e rc is e 4 .1 0
(a ) From Figure 4.12, E [E out(gm *)] is initially decreasing. How can this
be, if E[Eout(gm )] is increasing in K for each m l
(b ) From Figure 4.12 we see th a t E [E o ut(g m *)] is initially decreasing, andthen it starts to increase. W h a t are the possible reasons for this?
(c) When K = 1, E [Eout(9m O ] < E [E out(gm *)]- How can this be, if the
learning curves for both models are decreasing?
E x a m p le 4 .3 . We can use a validation set to select the value of the reg
u larization p aram eter in the augm ented erro r of (4.6). A lthough the m ost
im portan t p a rt of a m odel is the hypothesis set, every hypothesis set has an
associated learning algorithm which selects the final hypothesis g. Two m od
els m ay be different only in th e ir learning algorithm , while working w ith the
sam e hypothesis set. C hanging the value of A in the augm ented erro r changes
the learning algorithm (the criterion by w hich g is selected) and effectively
changes the model.
Based on th is discussion, consider the M different m odels corresponding to
the same hypothesis set 77 b u t w ith M different choices for A in the augm ented
error. So, we have (77. Ai ), (77, A2 ) , . . . , (77, Xm ) as our M different models. We
1 4 4
may, for example, choose Ai = 0, A2 = 0.01, A3 = 0 . 02 , . . . , Am = 10. Using a
validation set to choose one of these M models am ounts to determ ining the
value of A to w ith in a resolution of 0.01. □
We have analyzed validation for model selection based on a finite num ber of
models. If validation is used to choose the value of a param eter, for example A
as in the previous example, then the value of M will depend on the resolution
to which we determ ine th a t param eter. In the lim it, the selection is actually
am ong an infinite num ber of models since the value of A can be any real
num ber. W h at happens to bounds like (4.11) and (4.12) which depend on M ?
Ju s t as the Hoeffding bound for a finite hypothesis set did not collapse when
we moved to infinite hypothesis sets w ith finite VC-dimension, bounds like
(4.11) and (4.12) will not com pletely collapse either. We can derive VC-type
bounds here too, because even though there are an infinite num ber of models,
these models are all very similar; they differ only slightly in the value of A. As
a rule of thum b, w hat m atte rs is the num ber of param eters we are trying to
set. If we have only one or a few param eters, the estim ates based on a decent
sized validation set would be reliable. T he more choices we make based on the
sam e validation set, the more ‘con tam inated’ the validation set becomes and
the less reliable its estim ates will be. T he more we use the validation set to
fine tune the model, the more the validation set becomes like a training set
used to ‘learn the right model’] and we all know how lim ited a train ing set is
in its ability to estim ate Eout-
You will be hard pressed to find a serious learning problem in which valida
tion is not used. V alidation is a conceptually simple technique, easy to apply
in alm ost any setting, and requires no specific knowledge abou t the details of
a model. T he m ain draw back is the reduced size of the train ing set, bu t th a t
can be significantly m itigated th rough a modified version of validation which
we discuss next.
4 .3 .3 C ross V alidation
Validation relies on the following chain of reasoning,
Eoutig) ~ Eoutig ) ~ Ê jsxig ),
(sm all K ) (large K )
which highlights the dilem m a we face in try ing to select K . We are going to
o u tp u t g. W hen K is large, there is a discrepancy between the two out-of-
sam ple errors Eoutig ) (which Eyai directly estim ates) and Eoutig) (which is
the final error when we learn using all the d a ta V ). We would like to choose K
as sm all as possible in order to minimize the discrepancy between Eoutig )
and Eoutig)-, ideally E = 1. However, if we make this choice, we lose the
reliability of the validation estim ate as the bound on the RHS of (4.9) becomes
huge. T he validation error Evai(5 ") will still be an unbiased estim ate of Eoutig~)
4 . O v e r f i t t i n g ________________________________ 4 . 3 . V a l i d a t i o n
1 4 5
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
{g~ is tra ined on N — I poin ts), b u t it will be so unreliable as to be useless
since it is based on only one d a ta point. T his brings us to the cross validation
estim ate of out-of-sam ple error. We will focus on the leave-one-out version
which corresponds to a validation set of size K = 1, and is also the easiest
case to illustra te . M ore popu lar versions typically use larger K , b u t the essence
of the m ethod is the sam e.
T here are N ways to p a r titio n the d a ta into a tra in in g set of size N — 1
and a validation set of size 1. Specifically, let
V n = ( x i , t / i ) , . .. , ( x „ _ i , y „ - i ) , (x„,j/n), ( x „ + i , i / „ + i ) , . . . , (xAT,yAr)
be the d a ta set V after leaving ou t d a ta po in t (x„ , ?/„), which has been shaded
in red. D enote the final hypothesis learned from by g'^. Let e„ be th e error
m ade by gf“ on its validation set which is ju s t a single d a ta po in t { (x „ ,y „ )} :
e„ = = e (5 ;(X n), Vn) ■
T he cross validation estim ate is th e average value of th e e „ ’s,
1 ^
n=l
Figure 4.13: Illustration of leave-one-out cross validation for a linear
fit using three data points. The average of the three red errors
obtained by the linear fits leaving out one data point at a time is Ecv
Figure 4.13 illustra tes cross validation on a sim ple exam ple. Each e„ is a
wild, yet unbiased estim ate for the corresponding Eoutidn), which follows after
setting E = 1 in (4.8). W ith cross validation, we have N functions g'l, - ■ ■ ,gR
together w ith the N e rro r estim ates e i , . . . , eAr . T he hope is th a t these N
errors together- would be alm ost equivalent to estim ating Eout on a reliable
validation set of size N , while a t the sam e tim e we m anaged to use N — 1
points to ob ta in each g~̂ . L e t’s try to im derstaiid why E^v is a good estim ato r
ol Lout-
146
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
F irst and forem ost, Ecv is an unbiased estim ator of ‘Eout(ff“) ’- We have
to be a little careful here because we don’t have a single hypothesis g~, as we
did when using a single validation set. Depending on the (xn,?/n) th a t was
taken out, each can be a different hypothesis. To understand the sense in
which Ecv estim ates Eout, we need to revisit the concept of the learning curve.
Ideally, we would like to know Eout (s')- T he final hypothesis g is the result
of learning on a random d a ta set V of size N . It is alm ost as useful to know the
expected perform ance of your model when you learn on a d a ta set of size N ;
the hypothesis g is ju s t one such instance of learning on a d a ta set of size N .
T his expected perform ance averaged over d a ta sets of size A , when viewed
as a function of A , is exactly the learning curve shown in Figure 4.2. More
formally, for a given model, let
Eout(Â ) = IEx)[Eout(5)]
be the expectation (over d a ta sets T> of size A ) of the out-of-sam ple error
produced by the model. The expected value of Ecv is exactly Eout(-V — 1).
T his is tru e because it is tru e for each individual validation error e„;
lEpl^n] = ®(xn,J/n) ?/n)] ,
= Ex> ̂[Eout(5n)],
= E o u t ( i V - l ) .
Since this equality holds for each e„, it also holds for the average. We highlight
this result by m aking it a theorem .
T h e o r e m 4 .4 . Ecv is an unbiased estim ate of Eout(-V — 1) (the expectation
of the model perform ance, E[Eout], over d a ta sets of size A - 1).
Now th a t we have our cross validation
estim ate of Eout, there is no need to ou t
p u t any of the as our final hypothesis.
We m ight as well squeeze every last drop
of perform ance and re tra in using the entire
d a ta set V , o u tp u ttin g g as the final hy
pothesis and getting the benefitof going
from A - 1 to A on the learning curve.
In this case, the cross validation estim ate
will on average be an upper estim ate for
the out-of-sam ple error: Eout (5 ) - '̂cv, so
expect to be pleasantly surprised, albeit
slightly.
W ith ju s t simple validation and a val
idation set of size A = 1, we know th a t
the validation estim ate will not be reliable.
Figure 4.14: Using cross vali
dation to estimate Eout
How reliable is the cross valida
tion estim ate Ecv? We can m easure the reliability using the variance of
1 4 7
U nfortunately, while we were able to pin down the expectation of Ecv, the
variance is no t so easy.
If the N cross validation errors e i , . . . , ejv were equivalent to N errors on a
to ta lly separate validation set of size N , th en Ecv would indeed be a reliable
estim ate, for decent-sized N . T he equivalence would hold if th e individual e „ ’s
were independent of each o ther. O f course, th is is too optim istic. Consider
two validation errors e„, e^ - T he validation error e„ depends on which was
tra ined on d a ta contain ing ( x ^ , ym)- Thus, e„ has a dependency on (x ^ , Vm)-
T he validation error e™ is com puted using {'x.m,ym) directly, and so it also
has a dependency on {xm ,ym )- Consequently, there is a possible correlation
betw een e„ and th ro u g h the d a ta po in t ( x ^ , y „ )- T h a t correlation w ouldn’t
be there if we were validating a single hypothesis using N fresh (independent)
d a ta points.
How m uch worse is the cross validation estim ate as com pared to an esti
m ate based on a tru ly independent set of N validation errors? A V C -type
probabilistic bound, or even com pu tation of the asym pto tic variance of the
cross validation estim ate (P roblem 4.23), is challenging. One way to quantify
the reliability of Ecv is to com pute how m any fresh validation d a ta points
would have a com parable reliability to Ecv, and P roblem 4.24 discusses one
way to do this. T here are two extrem es for th is effective size. O n the high end
is N , which m eans th a t th e cross validation errors are essentially independent.
On the low end is 1, which m eans th a t Ecv is only as good as any single one
of the individual cross validation errors e„, i.e., th e cross validation errors are
to ta lly dependent. W hile one canno t prove any th ing theoretically , in practice
the reliability of Ecv is m uch closer to the higher end.
Gt». Ecv
4 . O v e r f i t t i n g ____________________________________________________ 4 . 3 . V a l i d a t i o n
N
Effective number of fresh examples
giving a comparable estim ate of Bout
C ro s s v a l id a t io n fo r m o d e l s e le c t io n . In F igure 4.11, the estim ates Em
for the out-of-sam ple erro r of m odel P m were ob ta ined using th e validation set.
Instead, we m ay use cross validation estim ates to o b ta in Em- use cross valida
tion to ob ta in estim ates of th e out-of-sam ple erro r for each m odel P i , . , P m ,
and select the m odel w ith th e sm allest cross validation error. Now, tra in this
model selected by cross validation using all the d a ta to o u tp u t a final h y p o th
esis, m aking the usual leap of fa ith th a t Eout{g~) tracks Eout{g) well.
E x a m p le 4 .5 . In F igure 4.13, we illu stra ted cross validation for es tim at
ing Bout of a linear m odel (h{x) ^ a x + b) using a sim ple experim ent w ith
three d a ta poin ts generated from a co n stan t ta rg e t function w ith noise. We
now consider a second m odel, th e co n stan t m odel {h[x) = b). We can also
use cross validation to es tim ate Bout for the co n stan t m odel, illu stra ted in
F igure 4.15.
148
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
Figure 4.15: Leave-one-out cross validation error for a constant fit.
If we use the in-sam ple error after fitting all the d a ta (three points), then
the linear model wins because it can use its additional degree of freedom to
fit the d a ta be tte r. T he sam e is tru e w ith the cross validation d a ta sets of size
two - the linear m odel has perfect in-sam ple error. B ut, w ith cross validation,
w hat m atters is the error on the outstanding point in each of these hts. Even
to the naked eye, the average of the cross validation errors is sm aller for the
constan t m odel which obtained Ecv = 0.065 versus E^v = 0.184 for the linear
model. T he constan t model wins, according to cross validation. T he constant
model also has lower Fout and so cross validation selected the correct model
in this example. □
One im portan t use of validation is to estim ate the optim al regularization
param eter A, as described in Exam ple 4.3. We can use cross validation for the
sam e purpose as sum m arized in the algorithm below.
C ross v a lid a tio n for s e le c t in g A:
1 : Define M models by choosing different values for A in the
augm ented error: (7?, Ai), (??, A2 ) , . . . , (??, Am)
2 : for each model m = 1 , . . . , M do
3 : Use the cross validation module in Figure 4.14 to esti
m ate Fcv(m ), the cross validation error for model m.
4: Select the model m* w ith m inim um Ecv{rn*).
5: Use model {T~L,X m*
nal hypothesis gm'
optim al A.
) and all the d a ta T> to obtain the fi-
. Effectively, you have estim ated the
We see from Figure 4.14 th a t estim ating Fcv for ju st a single model requires N
rounds of learning on P i , . . . , P v , each of size N — I. So the cross validation
algorithm above requires M N rounds of learning. This is a formidable task.
If we could analytically obtain Fcv, th a t would be a big bonus, bu t analytic
results are often difficult to come by for cross validation. One exception is
in the case of linear models, where we are able to derive an exact analytic
form ula for the cross validation estim ate.
1 4 9
A n a ly tic c o m p u ta tio n o f Fcv for lin ea r m o d e ls . Recall th a t for linear
regression w ith weight decay, Wreg = (Z'^Z + AI)“ ^Z'^y, and the in-sam ple
predictions are
y = H(A)y,
where H(A) = Z(Z'^Z -|- X1)~^7P. G iven H, y , and y , it tu rn s ou t th a t we can
analytically com pute th e cross validation estim ate as:
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
1
F n v = V T
N
V n Un (4.13)
Notice th a t the cross validation estim ate is very sim ilar to the in-sam ple error,
E'm = j f thn^Vri ~ differing only by a norm alization of each te rm in the
sum by a factor 1 /(1 — H„„(A))^. O ne use for th is ana ly tic form ula is th a t it
can be directly optim ized to ob ta in th e best regularization p aram eter A. A
proof of th is rem arkable form ula is given in P roblem 4.26.
Even when we canno t derive such an ana ly tic charac terization of cross
validation, the technique widely resu lts in good out-of-sam ple error estim ates
in practice, and so the com pu tational bu rden is often w orth enduring. Also,
as w ith using a validation set, cross validation applies in alm ost any setting
w ithout requiring specific knowledge ab o u t th e details of th e models.
So far, we have lived in a w orld of un lim ited com putation , and all th a t
m attered was out-of-sam ple error; in reality, com pu tation tim e can be of con
sequence, especially w ith huge d a ta sets. For th is reason, leave-one-out cross
validation m ay not be the m ethod of choice.'^ A popu lar derivative of leave-
one-out cross validation is V-fo ld cross validation^ In V-fold cross validation,
the d a ta are p artitio n ed into V d isjo in t sets (or folds) P i , . . . , P y , each of size
approxim ately N / V ] each set T>v in th is p a r titio n serves as a validation set to
com pute a validation erro r for a hypothesis g~ learned on a tra in ing set which
is the com plem ent ofthe validation set, P \ P ^ . So, you always validate a
hypothesis on d a ta th a t was not used for tra in in g th a t p a rticu la r hypothesis.
T he V-fold cross validation erro r is the average of th e V validation errors th a t
are obtained, one from each validation set P^,. Leave-one-out cross validation
is the sam e as N -fold cross validation. T he gain from choosing V N is
com putational. T he draw back is th a t you will be estim ating Font for a hy
pothesis g~ tra ined on less d a ta (as com pared w ith leave-one-out) and so the
discrepancy betw een Eont{g) and Fout(.F) will be larger. A com m on choice in
practice is 1 0 -fold cross validation , and one of the folds is illu stra ted below.
P
Pi T>2 T>3 T>i P 5 T>6 Ps P 9 Pio
1 I ^ I I I I 1--------------
t r a in v a lid a te t r a in
"‘S ta b ility p ro b lo in s have a lso b een re p o rte d in leave-o n e-o u t.
■'Some a u th o rs call it b '-l'o ld c ro ss v a lid a tio n , b u t we choose \ ' so as n o t to co nfuse w ith
th e size of tlie v a lid a tio n se t K .
1 5 0
4 .3 .4 T h eory V ersus P ra ctice
B oth validation and cross validation present challenges for the inathcniatical
theory of learning, sim ilar to the challenges presented by regnla.riza.tion. The
theory of generalization, in particu lar the VC analysis, forms the foundation
for learnability. It provides us w ith guidelines under which it is possible to
make a generalization conclusion w ith high probability. It is not straightfor
ward, and som etim es not possible, to rigorously carry these conclusions over
to the analysis of validation, cross validation, or regularization. W hat is pos
sible, and indeed cpiite effective, is to use the theory as a guideline. In the
case of regularization, constraining the choice of a hypothesis leads to bet
te r generalization, as we would intuitively expect, even if the hypothesis set
rem ains technically the same. In the case of validation, m aking a choice for
few param eters does not overly contam inate the validation estim ate of
even if the VC guarantee for these estim ates is too weak. In the case of cross
validation, the benefit of averaging several validation errors is observed, even
if the estim ates are not independent.
A lthough these techniques were based on sound theoretical foundation,
they are to be considered heuristics because they do not have a full m athe
m atical justification in the general case. Learning from d a ta is an emiiirical
task w ith theoretical underpinnings. We prove w hat we can prove, bu t we use
the theory as a guideline when we d on’t have a conclusive proof. In a practical
application, heuristics m ay win over a rigorous approach th a t makes unrealis
tic assum ptions. T he only way to be convinced abou t w hat works and w hat
doesn’t in a given situation is to try out the techniipies and see for yourself.
T he basic message in this chapter can be sum m arized as follows.
4 . O v e r f i t t i n g ________________________________ _______________________________ 4 . 3 . V a l i d a t i o n
1. Noise (stochastic or determ inistic) affects learning
adversely, leading to overfitting.
2. R egularization helps to prevent overfitting by con
strain ing the model, reducing the im pact of the noise,
while still giving us flexibility to fit the data .
3. Validation and cross validation are useful techniques
for estim ating E„„t. One im portan t use of valida
tion is model selection, in particu lar to estim ate the
am ount of regularization to use.
E x a m p le 4 .6 . We illustra te validation on the handw ritten digit classification
task of deciding w hether a digit is 1 or not (see also Exam ple 3.1) based on the'
two features which m easure the sym m etry and average intensity of the digit.
The d a ta is shown in Figure 4.16(a).
1 5 1
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
Average Intensity
(a) Digits cl£issification task (b) Error curves
F igure 4.16: (a) The digits data of which 500 are selected as the training
set. (b) The data are transformed via the 5th order polynomial transform
to a 20-dimensional feature vector. We show the performance curves as we
vary the number of these features used for classification.
We have random ly selected 500 d a ta poin ts as th e tra in ing d a ta and the
rem aining are used as a te s t set for evaluation. We considered a nonlinear
feature transfo rm to a 5 th order polynom ial featu re space:
(l,0:i,a:2) -A (l,Xi,X2,xf,XiX2,X̂ ,xf,xfx2, . . . ,xf,xfx2,xfx2,XiX̂ ,XiX2,xl).
Figure 4.16(b) shows th e in-sam ple error as you use m ore of the transform ed
features, increasing th e dim ension from 1 to 20. As you add m ore dim ensions
(increase the com plexity of th e m odel), th e in-sam ple erro r drops, as expected.
T he out-of-sam ple erro r drops a t first, and th en s ta r ts to increase, as we hit
the approxim ation-generalization tradeoff. T he leave-one-out cross validation
error tracks the behavior of the out-of-sam ple erro r quite well. If we were to
pick a model based on th e in-sam ple error, we would use all 2 0 dim ensions.
T he cross validation erro r is m inim ized betw een 5-7 feature dim ensions; we
take 6 feature dim ensions as th e m odel selected by cross validation. T he tab le
below sum m arizes th e resu lting perform ance m etrics:
Ein Eout
No V alidation 0% 2.5%
Cross V alidation 0 .8 % 1.5%
Cross validation resu lts in a perform ance im provem ent of ab o u t 1%, which is
a massive relative im provem ent (40% reduction in erro r ra te ).
E x erc ise 4 .1 1
In this particular experiment, the black curve (Ecv) is sometimes below and
sometimes above the the red curve (Eout)- If we repeated this experiment
many times, and plotted the average black and red curves, would you expect
the black curve to lie above or below the red curve?
152
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
It is illum inating to see the actual classihcation boundaries learned w ith and
w ithout validation. These resulting classihers, together w ith the 500 in-sample
d a ta points, are shown in the next hgure.
Average Intensity
2 0 -dim classifier (no validation)
Ain = 0 %
Eont = 2 . 5 %
Average Intensity
6 -dim classifier (LOO-CV)
Ain = 0 .8 %
A o u t = 1 . 5 %
It is clear th a t the worse out-of-sam ple perform ance of the classifier picked
w ithout validation is due to the overfitting of a few noisy points in the training
data . W hile the train ing d a ta is perfectly separated , the shape of the resulting
boundary seems highly contorted, which is a sym ptom of overhttiiig. Does this
rem ind you of the first exam ple th a t opened the chapter? There, albeit in a
toy example, we sim ilarly ob tained a highly contorted ht. As you can see,
overfitting is real, and here to stay! □
1 5 3
4.4 P roblem s
4. O V E R F I T T I N G _______________________________ 4.4 . P R O B L EMS
P r o b le m 4 .1 Plot the monomials of order i, 4>i{x) = x \ As you increase
the order, does this correspond to the intuitive notion of increasing complexity?
P r o b le m 4 .2 Consider the feature transform z = [Lo(a;), L i(a ;), L 2(a;)]'’
and the linear model h{x) = w' ’̂z. For the hypothesis with w = [1 ,—1 , 1 ]'’'
what is h{x) explicitly as a function of x l W h a t is its degree?
P r o b le m 4 .3 The Legendre Polynomials are a family of orthogonal
polynomials which are useful for regression. The first two Legendre Polynomials
are L o (x ) = 1, L \ { x ) = x. The higher order Legendre Polynomials are defined
by the recursion:
Lk{x) = ^ x L fc - i(x ) - ^ ^ ̂L k - 2 {x).
(a) W hat are the first six Legendre Polynomials? Use the recursion to de
velop an efficient algorithm to compute L o ( x ) , . . . , L k {x ) given x. Your
algorithm should run in tim e linear in K . Plot the firstsix Legendre
polynomials.
(b) Show that Lfc(x) is a linear combination of monomials x'^,x^~^, .. . (ei
ther all odd or all even order, with highest order k). Thus,
L k i - x ) = i - l Y Lk{x) .
(c) Show that = xLk{x) - L k - i [ x ) . [Hint: use induction.]
(d) Use part (c) to show th a t Lk satisfies Legendre's differential equation
This means that the Legendre Polynomials are eigenfunctions of a Her-
mitian linear differential operator and, from Sturm-Liouville theory, they
form an orthogonal basis for continuous functions on [—1 , 1 ].
(e) Use the recurrence to show directly the orthogonality property:
' lo e f i k ,
= k.
/ dx Lk{x)Le{x) = ̂ ^
•1 - 1 I. 2M-1
[Hint: use induction on k, with (. < k. Use the recurrence for Lk and
consider separately the four cases £ = k ,k — l , k — 2 and £ < k — 2. For
the case £ = k you will need to compute the integral dx x ^ L ^ _ i(x ) .
In order to do this, you could use the differential equation in part (c),
multiply by xl jk and then integrate both sides (the LHS can be integrated
by parts). Now solve the resulting equation for dx x^Ll _^{x) . ]
154
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
P ro b lem 4 .4 This problem is a detailed version of Exercise 4.2. W e set up
an experimental framework which the reader may use to study various aspects
of overfitting. The input space is T = [ -1 ,1 ] , with uniform input probability
density, P{x) = W e consider the two models 'H2 and 77io. The target func
tion is a polynomial of degree Qf , which we write as f { x ) = J2qIoaqLq{x),
where L , ( x ) are the Legendre polynomials. W e use the Legendre polynomials
because they are a convenient orthogonal basis for the polynomials on [—1 , 1 ]
(see Section 4.2 and Problem 4.3 for some basic information on Legendre poly
nomials). The data set is T> = ( x i , y i ) , . . . , (x jv ,j/iv ), where = f {xn) + o-e„
and £„ are iid standard Normal random variates.
For a single experiment, with specified values for Q f , N , a , generate a random
degree-Qf target function by selecting coefficients a , independently from a
standard Normal, rescaling them so that E a ,! [ /^ ] = 1. Generate a data set,
selecting x i , . . . , x n independently from P{x) and yn = f { xn) + otn- Let Q2
and gio be the best fit hypotheses to the data from 'H2 and H w respectively,
with respective out-of-sample errors Aout(52) and Aout(3 io)-
(a) W hy do we normalize /? [Hint: how would you interpret a?]
(b ) How can we obtain 32 , 310? [Hint: pose the problem as linear regression
and use the technology from Chapter 3.]
(c) How can we compute Aout analytically for a given 310?
(d) Vary Q f , N , a and for each combination of parameters, run a large num
ber of experiments, each time computing Aout(3z) and Aout(3 io)- Aver
aging these out-of-sample errors gives estimates of the expected out-of-
sample error for the given learning scenario {Qf , N, a) using 772 and 77io.
Let
Aout(772) = average over experiments(Aout(32) ) ,
Aout(77io) = average over experiments(Aout(3io)).
Define the overfit measure Aout(77io) - Aout(772). When is the over
fit measure significantly positive (i.e., overfitting is serious) as opposed
to significantly negative? Try the choices Q f G {1, 2 , . . . , 100}, N G
{20, 2 5 , . . . , 120}, G (0 , 0.05, 0 . 1 , . . . , 2}.
Explain your observations.
(e) W hy do we take the average over many experiments? Use the variance
to select an acceptable number of experiments to average over.
(f ) Repeat this experiment for classification, where the target function is a
noisy perceptron, / = s i g n ( Y l q = i + <=) ■ Notice that ao = 0,
(Y :% ^ q L q { x ) f = 1.and the a ,'s should be normalized so that Ea,i
For classification, the models 7 7 2 ,7 7 io contain the sign of the 2nd and
10th order polynomials respectively. You may use a learning algorithm
for non-separable data from Chapter 3.
1 5 5
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
P r o b le m 4 .5 If A < 0 in the augmented error £'aug(w) = E i n ( w ) + A w ' ‘' w ,
what soft order constraint does this correspond to? [Hint: A < 0 encourages
large weights.]
P r o b le m 4 .6 In the augmented error minimization with P = I and A > 0 :
(a ) Show that ||wreg|| < ||wiin||, justifying the term weight decay. [Hint:
start by assuming that ||wregl| > ||wiinl| and derive a contradiction.]
In fact a stronger statem ent holds: | | w r e g | | is decreasing in A .
(b ) Explicitly verify this for linear models. [Hint:
Wreg Wreg = U [77 Z + AI)“ ^U,
where u = Z^y and Z is the transformed data matrix. Show that 77Z +
A I has the same eigenvectors with correspondingly larger eigenvalues as
Z ^ ' Z . Expand u in the eigenbasis ofZ~^Z. For a matrix A , how are the
eigenvectors and eigenvalues o f A~^ related to those o f A?]
P r o b le m 4 . 7 Show th a t the in-sample error
En(Wreg) = ^ y " ( I - H ( A ) ) V
from Example 4.2 is an increasing function of A, where H (A ) = Z (Z '^Z -|-A I)“ 'Z ’’'
and Z is the transformed data matrix.
To do so, let the S V D of Z = U P V ^ and let Z^Z have eigenvalues a f , . .. , 0 .̂
Define the vector a = U ^y. Show that
1 d y ^
E i n ( W r e g ) = E i n ( w i i n ) + ~ ^ ) U i f 1 ~ ̂
i=l ^
2 \ 2
h)
and proceed from there.
P r o b le m 4 .8 In the augmented error minimization with P = I and A > 0,
assume that E\n is differentiable and use gradient descent to minimize Eaug:
w ( t - F 1 ) • ( - w ( t ) - r / V £ ' a u g ( w ( t ) ) .
Show that the update rule above is the same as
w (< - F 1 ) ( 1 - 2 r/A )w (f) - r? V E in (w (f)).
Note: This is the origin of the name ‘weight decay’: w ( f ) decays before being
updated by the gradient of E i„ .
156
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
P r o b le m 4 .9 In Tikhonov regularization, the regularized weights are given
by Wreg = + A r^ r )“ '^Z’̂ y. The Tikhonov regularizer P is a fe x (d + 1)
matrix, each row corresponding to a d + 1 dimensional vector. Each row of Z
corresponds to a d + 1 dimensional vector (the first component is 1). For each
row of r, construct a virtual example (zi ,0) for i = I , . .. ,k, where Zi is the
vector obtained from the ith row of P after scaling it by %A, and the target
value is 0. Add these k virtual examples to the data, to construct an augmented
data set, and consider non-regularized regression with this augmented data.
(a ) Show that, for the augmented data, Zaug =
Z
v/A-r and yaug =
(b) Show that solving the least squares problem with Zaug and yaug results
in the same regularized weight w^eg, i.e. Wreg = (ZlugZaug)“ ^ZIugyaug.
This result may be interpreted as follows: an equivalent way to accomplish
weight-decay-type regularization with linear models is to create a bunch of
virtual examples all of whose target values are zero.
P r o b le m 4 .1 0 In this problem, you will investigate the relationship
between the soft order constraint and the augmented error. The regularized
weight Wreg is a solution to
min Ein(w) subject to w^P^Pw < C.
(a ) If wCinr'^Twiin < C , then what is Wreg?
(b) I f w L r 'P W H n > C, the situation is illustrated below.
The constraint is satisfied in the shaded region and the contours of con
stant Ein are the ellipsoids (why ellipsoids?). W hat is wTegP^Pwreg?
(c) Show that with
Ac = ~ Wreg VEin(Wreg),
Wreg minimizes Ein(w) + A c w ’’T'’T w . [Hint: use the previous part to
solve for Wreg as an equality constrained optimization problem using the
method o f Lagrange multipliers.]
(c o n t in u e d o n n e x t p a g e )
1 5 7
(d ) Show that the following hold for Ac:
(i) If wJInP’T w iin < C then Ac = 0 (wnn itself satisfies the constraint).
(ii) If w J in W rw iin > C, then Ac > 0 (the penalty term is positive).
(iii) If w C inA Pw iin > C , then Ac is a strictly decreasing functionof C.
[Hint: show that < Q for C £ [0, w J In W rw iin ]./
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
P r o b le m 4 .1 1 For the linear model in Exercise 4.2, the target function is
a polynomial of degree Q f , the model is P Lq , with polynomials up to order Q.
Assume Q > Q f. wun = ifl7Tf)~"^T7y, and y = Z w f + e, where Wf is the
target function and Z is the matrix containing the transformed data.
(a ) Show that wun = Wf + {T7TT)^^T7e. W h at is the average function gl
Show th a t bias = 0 (recall that: b ias(x) = (g (x ) - / ( x ) ) ^ ) .
(b ) Show that
var = ^ trace (E^> E z [(j^Z'^'Z) ^]) ,
where E $ = E[4>(x)4>'’'(a:)]. [Hints: var = E[(g^®^ - g fi]; first take the
expectation with respect to e, then with respect to 4>(x), the test point,
and the last remaining expectation will be with respect to Z. You will
need the cyclic property o f the trace.)
(c) Argue that to first order in A , var «
[Hint: J2n=i ‘̂ (a:n)4’'^(a:n) is the in-sample estimate o fE ^ ,.
By the law o f large numbers, jy Z ^ Z = E * + o{l ) . [
For the well specified linear model, the bias is zero and the variance is increasing
as the model gets larger {Q increases), but decreasing in N.
P r o b le m 4 .1 2 Use the setup in Problem 4.11 with Q > Qf . Con
sider regression with weight decay using a linear model PL in the transformed
space with input probability distribution such th a t E[zz^] = I. The regularized
weights are given by Wreg = (Z'^'Z -F A I)^^Z '‘'y , where y = Z w f -F e.
(a) Show that w,.eg = Wf — A( Z ' Z -F A I) “ ^Wf -F (Z^Z -F A l j^ 'Z ’ ê.
(b ) Argue that, to first order in j f ,
2
var 7 / ^ [trace(H'^(A))],
where 11(A) = Z ( Z ' Z - F AU-^Z"".
1 5 8
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
If we plot the bias and var, we get a figure
that is very similar to Figure 2.3, where
the tradeoff was based on fit and com
plexity rather than bias and var. Flere, the
bias is increasing in A (as expected) and
in ||wf||; the variance is decreasing in A.
When A = 0, trace(H ^(A )) = Q + I and
so trace(H ^(A )) appears to be playing the
role of an effective number of parameters.
P ro b lem 4 .1 3 W ithin the linear regression setting, many attempts have
been made to quantify the effective number of parameters in a model. Three
possibilities are:
(i) deff{\) = 2 trace(H (A )) - trace(H 2(A ))
(ii) deff(A) = trace(H (A ))
(iii) deff(A) = trace(H ^(A ))
where H(A) = Z(Z^Z + AI)^^Z’' and Z is the transformed data matrix. To
obtain deff. one must first compute H(A) as though you are doing regression.
One can then heuristically use deff in place of dvc in the VC-bound.
(a) When A = 0, show that for all three choices, deff = d + 1, where d is the
dimension in the Z space.
(b ) When A > 0, show that 0 < deft < d + 1 and deft is decreasing in A for
all three choices. [Hint: Use the singular value decomposition.]
P ro b lem 4 .1 4 The observed target values y can be separated into the
true target values f and the noise e, y = f + e. The components of e are iid
with variance cr̂ and expectation 0. For linear regression with weight decay
regularization, by taking the expected value of the in-sample error in (4 .2 ),
show that
E e[F „] = l f ' ( I - H ( A ) ) " f + ^ t r a c c ( ( I - H ( A ) ) ^ ) ,
N
where deir = 2trace(H (A )) - trace(H ^(A )), as defined in Problem 4.13(i),
H(A) = Z(Z'^'Z + AI)“ ^Z'’' and Z is the transformed data matrix.
(c o iit it in c d o n n e x t p a g e )
1 5 9
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
(a ) If the noise was not overfit, what should the term involving cr̂ be, and
why?
(b) Hence, argue that the degree to which the noise has been overfit is
a^deff/N . Interpret the dependence of this result on the parameters deff
and N , to justify the use of deff as an effective number of parameters.
P r o b le m 4 .1 5 W e further investigate defi of Problems 4.13 and 4.14. W e
know th a t H ( A ) = Z{777j + When P is square and invertible, as
is usually the case (for example with weight decay, P = I ) , denote Z = Z F "^ .
Let sq> • • • I Sd be the eigenvalues of Z ' Z ( s | > 0 when Z has full column rank).
(a) For deff(A) = trace(2H(A) - H^(A)), show that
" Â
d e « ( A ) - d + l - g ( ^ 3 ^ A ) 2 -
d
A(b ) For deff{X) = trace(H(A)), show th a t defr(A) = d-\- 1 - i^TjlA*
4 = 0 *
d 4
(c) For deff(A) = trace(H^(A)), show that deff(A) = ( j rq f y p -
In all cases, for A > 0, 0 < deff(A) < d + 1 , deff(O) = d + 1 and dea is decreasing
in A. [Hint: use the singular value decomposition Z = U S V , where U , V are
orthogonal and S is diagonal with entries Si.J
P r o b le m 4 .1 6 For linear models and the general Tikhonov regularizer F
A,
Nwith penalty term A w '^ T T w in the augmented error, show that
Wreg = (Z"Z + A r W ) - ' Z " y ,
where Z is the feature matrix.
(a ) Show that the in-sample predictions are
y = H (A )y,
where H(A) = Z ( Z ' Z + A F '^T )- ' Z F
(b) Simplify this in the case F = Z and obtain w,eg in terms of wun- This is
called uniform weight decay.
P r o b le m 4 .1 7 To model uncertainty in the measurement of the inputs,
assume that the true inputs x „ are the observed inputs x „ perturbed by some
noise e„: the true inputs are given by x „ = x „ -F e„. Assume th a t the €„.
are independent of (x „ ,,i/„ ) with covariance matrix E[€„e)(] = and mean
IGO
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
E[e„] — 0. The learning algorithm minimizes the expected in-sample error E\„,
where the expectation is with respect to the uncertainty in the true x „ .
E n(w ) = Eei . . . € ^ N n=l
Show that the weights wun which result from minimizing E n are equiva
lent to the weights which would have been obtained by minimizing E n =
^ ))L i(w '^ X n — Vnfi for the observed data, with Tikhonov regularization.
W hat are P and A (see Problem 4.16 for the general Tikhonov regularizer)?
One can interpret this result as follows: regularization enforces a robustness to
potential measurement errors (noise) in the observed inputs.
P r o b le m 4 .1 8 In a regression setting, assume the target function is
linear, so / ( x ) = w } x , and y = X w / -h e, where the entries in e are iid with
zero mean and variance . Assume a regularization term and
that E lx x ’"] = I. In this problem derive the optimal value for A as follows.
(a) Show that the average function is g{x) = W hat is the bias?
(b) Show that var is asymptotically /U/Vit.' Problem 4.12.]
(c) Use the bias and asymptotic variance to obtain an expression for E[i?out].
Optimize this with respect to A to obtain the optimal regularization pa
rameter. [Answer: A* =
(d) Explain the dependence of the optimal regularization parameter on the
parameters of the learning problem. [Hint: write X* =
P r o b le m 4 .1 9 [The Lasso algorithm] Rather than a soft order constraint
on the squares of the weights, one could use the absolute values of the weights:
d
m inE n(w ) subject to ^ |w,j < C.
The model is called the lasso algorithm.
(a) Formulate and implement this as a quadratic program. Use the exper
imental design in Problem 4.4 to compare the lasso algorithm with the
quadratic penalty by giving plots of Eout versus regularization parameter.
(b ) W hat is the augmented error? Is it more convenient to optimize?
(c) W ith d = 5 and A = 3, compare the weights from the lasso versus the
quadratic penalty. [Hint: Look at the number o f non-zero weights.]
1 6 1
4 , O v e r f i t t i n g 4 . 4 . P r o b l e m s
P r o b le m 4 .2 0 In this problem, you will explore a consistency condition
for weight decay. Suppose that we make an invertible linear transform of the
data,
Zn = Ax„, y n = a y n .
Intuitively, linear regression should not be affected by a linear transform. This
means that the new optimal weights should be given by a correspondinglinear
transform of the old optimal weights.
(a) Suppose w minimizes the in-sample error for the original problem. Show
that for the transformed problem, the optimal weights are
w = q;(A^)~^w .
(b ) Suppose the regularization penalty term in the augmented error is
w '’’X ’'X w for the original data and w'^’Z^'Zw for the transformed data.
On the original data, the regularized solution is W r e g ( A ) . Show that for
the transformed problem, the same linear transform of Wreg(A) gives the
corresponding regularized weights for the transformed problem:
W r e g ( A ) = a ( A ' " ) “ V r e g ( A ) .
P r o b le m 4 .2 1 The Tikhonov smoothness regularizer penalizes derivatives
of h, e.g., Q,{h) = f dx ^ • For linear models with feature transform
z = 4>(x), show that this penalty term is of the form w ’T ’‘T w . W h at is P?
P r o b le m 4 .2 2 You have a data set with 100 data points. You have
100 models each with VC-dimension 10. You set aside 25 points for validation.
You select the model which produced minimum validation error of 0.25. Give
a bound on the out-of-sample error for this selected function.
Suppose you instead trained each model on all the data and selected the func
tion with minimum in-sample error. The resulting in-sample error is 0.15. Give
a bound on the out-of-sample error in this case. [Hint: Use the bound in
Problem 2.14 to bound the VC-dimension o f the union o f all the models.]
P r o b le m 4 .2 3 This problem investigates the covariance of the leave-one-
out cross validation errors, C o vu [e„ , em]- Assume th a t for well behaved models,
the learning process is ‘stable’ , and so the change in the learned hypothesis
should be small, 'O { j j ) ’, if a new data point is added to a data set of size
N. W rite y ; = -F S„ and -F (5m, where is the
learned hypothesis on the data minus the n th and m th data points,
and 6n, S,n are the corrections after addition of the ?tth and m th data points
respectively.
1 6 2
(a) Show that Varx>[Fcv] = 'E t= i Van>[e„] + Covu[en, 6^]-
(b) Show C ovi)[e„, em] = V a rp (N _ 2) [Fout(c/^^“ ^^)]+ higher order in 5n,Sm-
(c) Assume that any terms involving 6n,5m are 0 ( T ) . Argue that
Varx)[Fcv] = ^ V a r i> [e i] + V arp[F o ut(5 )] +
Does Varx>[ei] decay to zero with N 1 W hat about Varr>[Fout(5)]?
(d) Use the experimental design in Problem 4.4 to study Varx>[Fcv] and give
a log-log-plot of V ari> [F cv ]/V a ri3[ei] versus N . W hat is the decay rate?
4 . Q v e r f i t t i n g ____________________________________________________ 4 . 4 . P r o b l e m s
P r o b le m 4 .2 4 For d = 3, generate a random data set with N points as
follows. For each point, each dimension of x has a standard Normal distribution.
Similarly, generate a (d+ l)-dimensional target weight vector Wf, and set j/„ =
wj xn+cren where €„ is noise (also from a standard Normal distribution) and P
is the noise variance; set a to 0.5.
Use linear regression with weight decay regularization to estimate w f with Wreg-
Set the regularization parameter to 0 .0 5 /A .
(a) For N £ { d + 15, d + 2 5 , . . . , d + 115}, compute the cross validation errors
e i , . . . , ew and F c v Repeat the experiment (say) 10® times, maintaining
the average and variance over the experiments of e i, 62 and Ecv
(b ) How should your average of the e i ’s relate to the average of the Fcv’s;
how about to the average of the e2’s? Support your claim using results
from your experiment.
(c) W hat are the contributors to the variance of the e i ’s?
(d ) If the cross validation errors were truly independent, how should the vari
ance of the e i ’s relate to the variance of the F c v ’s ?
(e) One measure of the effective number of fresh examples used in comput
ing Fcv is the ratio of the variance of the ei's to that of the Fcv’s. Explain
why, and plot, versus N , the effective number of fresh examples (A es)
as a percentage of A . You should find that Neff is close to A .
( f ) If you increase the amount of regularization, will Neff go up or down?
Explain your reasoning. Run the same experiment with A = 2 .5 /A and
compare your results from part (e) to verify your conjecture.
P r o b le m 4 .2 5 When using a validation set for model selection, all models
were learned on the same Ptrain of size N — K , and validated on the same
Pvai of size K . W e have the VC-bound (see Equation (4.12)):
, _ , ^ / /inM ^
Eout{gm*) ^ £'val(^m*) O \ nz n j
(c o n t in u e d o n n e x t p a g e )
1 6 3
Suppose that instead, you had no control over the validation process. So M
learners, each with their own models present you with the results of their val
idation processes on different validation sets. Here is what you know about
each learner:
Each learner m reports to you the size of their validation set Km,
and the validation error A v a i ( m ) . The learners may have used dif
ferent data sets, except that they faithfully learned on a training set
and validated on a held out validation set which was only used for
validation purposes.
As the model selector, you have to decide which learner to go with.
(a ) Should you select the learner with minimum validation error? If yes, why?
If no, why not? [Hint: think VC-bound.[
(b ) If all models are validated on the same validation set as described in the
text, why is it okay to select the learner with the lowest validation error?
(c) A fter selecting learner m * (say), show that
P [ E o u t ( m * ) > E v a i ( m * ) + e ] <
where K{e) = — Lci Y lm = \ ̂ “average" validation
set size.
(d ) Show that with probability at least 1 — 5, Eout < Evai -F e*, for any e*
which satisfies e* >
(e) Show that vavurnKm < n{e) < -fj y2m=i ^m - Is this bound better or
worse than the bound when all models use the same validation set size
(equal to the average validation set size A Km) 7
4 . O v e r f i t t i n g __________________________________________________________________ 4 . 4 . P r o b l e m s
P r o b le m 4 .2 6 In this problem, derive the formula for the exact expression
for the leave-one-out cross validation error for linear regression. Let Z be the
data matrix whose rows correspond to the transformed data points Zn = 4>(x„).
(a) Show that:
N N
7,'̂ Z = Znzf - Z"'y = ^ z „ i/r i; H„„i(A) = z}(A(A)“ *z„,
n=l n=l
where A = A( A) = Z"'Z -F AT’T and H (A ) = Z A (A )“ *Z"'. Hence,
show that when ( z „ ,y „ ) is left out, Z' '̂Z ^ Z' '̂Z - z „z )),a n d Z'̂ ’y
Z'^y - Zntjn.
(b) Compute w “ , the weight vector learned when the n th data point is left
out, and show that:
/ A — I I A Z7, Z.,^A a /ryT A
1 6 4
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
[Hint: use the identity (A - x x '’’) ̂ = A ̂ ■]
(c) Using (a ) and (b), show that w “ = w + ~ A~^Zn, where w is the
1 — Hnn
regression weight vector using all the data.
(d ) The prediction on the validation point is given by z^ w “ . Show that
T - Vn H n n ^ n
1 - H „ „ •
/ ̂ \ 2
(e) Show that e„ = I ” " ) , and hence prove Equation (4.13).
\ 1 — Hnn J
P r o b le m 4 .2 7 Cross validation gives an accurate estimate of E o u t(E —1),
but it can be quite sensitive, leading to problems in model selection. A common
heuristic for ‘regularizing’ cross validation is to use a measure of error (Tcv(E )
for the cross validation estimate in model selection.
(a) One choice for ctcv is the standard deviation of the leave-one-out errors
divided by \ / N , ctov ~ ■ ^ ^ -y v a r(e i,. . . , 6n). W hy divide by ' /N 7
- E l(b ) For linear models, show that y/Nacv = J2n=i ( i?Hnn )
(c) (i) Given the best model R * , the conservative one-sigma approach se
lects the simplest model within acv{R*) of the best.
(ii) The bound minimizing approach selects the model which minimizes
Ecv (E ) -F Ctc v CH).
Use the experimental design in Problem 4.4 to compare these approaches
with the ‘unregularized’ cross validation estimateas follows. Fix Q / = 15,
Q — 20, and cr = 1. Use each of the two methods proposed here as well as
traditional cross validation to select the optimal value of the regularization
parameter A in the range {0 .05, 0.10, 0 . 1 5 , . . . , 5} using weight decay
regularization, f2 (w ) = ;^ w ’’'w . Plot the resulting out-of-sample error
for the model selected using each method as a function of N, with N in
the range {2 x Q , 3 x Q , . . . , 10 x Q} .
W hat are your conclusions?
1 6 5
166
Chapter 5
Three Learning Principles
T he study of learning from d a ta highlights some general principles th a t are
fascinating concepts in their own right. Having gone through the m athem atical
analysis and em pirical illustrations of the first few chapters, we have a good
foundation from which to a rticu la te some of these principles and explain them
in concrete term s.
In this chapter, we will discuss three principles. T he first one is related to
the choice of m odel and is called O ccam ’s razor. T he other two are related
to data ; sam pling bias establishes an im .portant principle abou t obtaining the
d a ta , and d a ta snooping establishes an im portan t principle abou t handling
the d a ta . A genuine understanding of these principles will pro tect you from
the m ost com m on pitfalls in learning from data , and allow you to in terpret
generalization perform ance properly.
5.1 O ccam ’s Razor
A lthough it is no t an exact quote of E instein’s, it is often a ttrib u ted to him
th a t “An explanation of the d a ta should be m ade as simple as possible, but no
simpler." A sim ilar principle, Occam,’s Razor, dates from the 14th century and
is a ttr ib u ted to W illiam of Occam, where the ‘razo r’ is m eant to trim down
the explanation to the bare m inim um th a t is consistent w ith the data .
In the context of learning, the penalty for model complexity which was
in troduced in Section 2.2 is a m anifestation of Occam ’s razor. If E\n{g) = 0,
then the explanation (hypothesis) is consistent w ith the data . In th is case,
the m ost plausible explanation, w ith the lowest estim ate of Eout given in the
VC bound (2.14), happens when the com plexity of the explanation (m easured
by dvci'H)) is as sm all as possible. Here is a statem ent of the underlying
principle.
T h e s im p le s t m o d e l th a t fits th e d a ta is a lso th e m o st p la u sib le .
1 6 7
Applying th is principle, we should choose as simple a m odel as we th ink we can
get away w ith. A lthough the principle th a t sim pler is b e tte r m ay be intuitive,
it is neither precise nor self-evident. W hen we apply th e principle to learning
from d ata , there are two basic questions to be asked.
1. W h a t does it m ean for a m odel to be simple?
2. How do we know th a t sim pler is b e tte r?
L e t’s s ta r t w ith the first question. T here are two d istinc t approaches to defin
ing the notion of complexity, one based on a family of ob jects and the o ther
based on an individual object. We have already seen b o th approaches in our
analysis. T he VC dim ension in C h ap ter 2 is a m easure of complexity, and it
is based on th e hypothesis set 7? as a whole, i.e., based on a family of objects.
T he regularization te rm of th e augm ented erro r in C hap ter 4 is also a m easure
of complexity, b u t in th is case it is th e com plexity of an individual object,
nam ely the hypothesis h.
T he two approaches to defining com plexity are no t encountered only in
learning from data ; they are a recurring them e whenever com plexity is dis
cussed. For instance, in in form ation theory, entropy is a m easure of com plexity
based on a family of objects, while m in im u m description length is a re la ted
m easure based on individual objects. T here is a reason why th is is a recurring
them e. T he two approaches to defining com plexity are in fact related .
W hen we say a fam ily of ob jects is com plex, we m ean th a t the family is
‘big’. T h a t is, it contains a large varie ty of objects. Therefore, each individual
object in the family is one o f many. By co n trast, a sim ple family of ob jects is
‘sm all’; it has relatively few objects, and each individual ob ject is one o f few.
W hy is th e sheer num ber of ob jects an ind ication of th e level of com plexity?
T he reason is th a t b o th the num ber of ob jects in a family and th e com plexity
of an ob ject are re la ted to how m any p aram eters are needed to specify the
object. W hen you increase th e num ber of p aram eters in a learning model, you
sim ultaneously increase how diverse 7? is and how com plex the individual h is.
For example, consider 17th order polynom ials versus 3rd order polynom ials.
T here is m ore variety in 17th order polynom ials, and a t th e sam e tim e the
individual 17th order polynom ial is m ore com plex th a n a 3rd order polynom ial.
T he m ost com m on definitions of ob ject com plexity are based on the num ber
of bits needed to describe an object. U nder such definitions, an ob ject is simple
if it has a sho rt description. Therefore, a sim ple ob ject is no t only intrinsically
simple (as it can be described succinctly), b u t it also has to be one of few,
since there are fewer ob jects th a t have sh o rt descrip tions th a n there are th a t
have long descriptions, as a m a tte r of sim ple counting.
E x erc ise 5 .1
Consider hypothesis sets 7?i and Tlioo th a t contain Boolean functions on 10
Boolean variables, so A = { —1 ,+ 1 }^ ° . 7?i contains all Boolean functions
5 . T h r e e L e a r n i n g P r i n c i p l e s 5 . 1 . O c c a m ’s R a z o r
1 6 8
5 . T h r e e L e a r n i n g P r i n c i p l e s 5 . 1 . O c c a m ’s R a z o r
which evaluate to + 1 on exactly one input point, and to - 1 elsewhere;
?7ioo contains all Boolean functions which evaluate to + 1 on exactly 100
input points, and to — 1 elsewhere.
(a) How big (number of hypotheses) are 77i and Rioo?
(b ) How many bits are needed to specify one of the hypotheses in R i?
(c) How many bits are needed to specify one of the hypotheses in Rioo?
We now address the second question. W hen O ccam ’s razor says th a t simpler
is be tte r, it doesn’t m ean sim pler is more elegant. It m eans simpler has a
b e tte r chance of being right. O ccam ’s razor is abou t perform ance, not about
aesthetics. If a complex explanation of the d a ta perform s better, we will
take it.
T he argum ent th a t sim pler has a b e tte r chance of being right goes as fol
lows. We are try ing to fit a hypothesis to our d a ta A = { (x i ,j / i ) ,- - - ,{xpf,yj\r)}
(assume y ^ ’s are binary). T here are fewer simple hypotheses th an there are
complex ones. W ith complex hypotheses, there would be enough of them to
sh a tte r x i , • • • ,X]v, so it is certain th a t we can fit the d a ta set regardless of
w hat the labels y i, • • • ,yN are, even if these are com pletely random . T here
fore, fitting the d a ta does not m ean much. If, instead, we have a simple model
w ith few hypotheses and we still found one th a t perfectly fits the dichotomy
T> = { (x i ,y i) , • • • , (xAT,yiv)}, this is surprising, and therefore it m eans some
thing.
O ccam ’s Razor has been formally proved under different sets of idealized
conditions. T he above argum ent cap tures the essence of these proofs; if some
th ing is less likely to happen, then when it does happen it is more significant.
Let us look a t an example.
E x a m p le 5 .1 . Suppose th a t one constructs a physical theory abou t the re
sistivity of a m etal under various tem peratures. In this theory, aside from
some constan ts th a t need to be determ ined, the resistivity p has a linear de
pendence on the tem peratu re T . In order to verify th a t the theory is correct
andto obtain the unknown constants, 3 scientists conduct the following three
experim ents and present their d a ta to you.
Scientist 1 Scientist 2 Scientist 3
1 6 9
5 . T h r e e L e a r n i n g P r i n c i p l e s 5 . 1 . O c c a m ’s R a z o r
I t is clear th a t Scientist 3 has produced the m ost convincing evidence for the
theory. If the m easurem ents are exact, then . Scientist 2 has m anaged to falsify
the theory and we are back to th e draw ing board . W h a t ab o u t Scientist 1 ?
W hile he has not falsified the theory, has he provided any evidence for it? T he
answer is no, for we can reverse th e question. Suppose th a t the theory was not
correct, w hat could the d a ta have done to prove him wrong? N othing, since
any two poin ts can be joined by a line. Therefore, the m odel is no t ju s t likely
to fit the d a ta in th is case, it is certa in to do so. T his renders the fit to ta lly
insignificant when it does happen. □
T his exam ple illustra tes a concept re la ted to O ccam ’s Razor, which is the
axiom o f non-falsifiability. T he axiom asserts th a t th e d a ta should have some
chance of falsifying a hypothesis, if we are to conclude th a t it can provide
evidence for the hypothesis. O ne way to guaran tee th a t every d a ta set has
some chance a t falsification is for th e VC dim ension of the hypothesis set
to be less th an N , th e num ber of d a ta points. T his is discussed fu rther in
Problem 5.1. Here is ano ther exam ple of the sam e concept.
E x a m p le 5 .2 . F inancial firms try to pick good trad e rs (predictors of w hether
the m arket will go up or no t). Suppose th a t each trad e r is tested on their
p rediction (up or down) over th e next 5 days and those who perform well will
be hired. One m ight th ink th a t th is process should produce b e tte r and b e tte r
traders on Wall S treet. Viewed as a learning problem , consider each trad e r
to be a prediction hypothesis. Suppose th a t th e h iring pool is ‘com plex’; we
are interview ing 2 ® trad e rs who happen to be a diverse set of people such th a t
their predictions over the next 5 days are all different. Necessarily one of these
traders gets it all correct, and will be hired. H iring th e trad e r th rough th is
process m ay or m ay not be a good th ing, since the process will pick someone
even if the trad e rs are ju s t flipping coins to m ake th e ir predictions. A perfect
pred ic tor always exists in th is group, so finding one doesn’t m ean much. If we
were interview ing only two trad ers, and one of them m ade perfect predictions,
th a t would m ean som ething. □
E x e rc is e 5 .2
Suppose that for 5 weeks in a row, a letter arrives in the mail th a t predicts
the outcome of the upcoming Monday night football game. You keenly
watch each Monday and to your surprise, the prediction is correct each
time. On the day after the fifth game, a letter arrives, stating th a t if you
wish to see next week's prediction, a payment of $50.00 is required. Should
you pay?
(a ) How many possible predictions of win-lose are there for 5 games?
(b) If the sender wants to make sure th a t at least one person receives
correct predictions on all 5 games from him, how many people should
he target to begin with?
170
5 . T h r e e L e a r n i n g P r i n c i p l e s 5 . 2 . S a m p l i n g B i a s
(c) A fter the first letter ‘predicting’ the outcome of the first game, how
many of the original recipients does he target with the second letter?
(d) How many letters altogether will have been sent at the end of the 5
weeks?
(e) If the cost of printing and mailing out each letter is $0.50, how much
would the sender make if the recipient of 5 correct predictions sent in
the $50.00?
(f ) Can you relate this situation to the growth function and the credibility
of fitting the data?
Learning from d a ta takes O ccam ’s Razor to another level, going beyond “as
simple as possible, bu t no sim pler.” Indeed, we m ay opt for ‘a simpler fit
th an possible’, nam ely an im perfect fit of the d a ta using a simple model over
a perfect fit using a more complex one. The reason is th a t the price we pay
for a perfect fit in term s of the penalty for model com plexity in (2.14) may
be too much in com parison to the benefit of the b e tte r fit. This idea was
illustra ted in Figure 3.7, and is a m anifestation of overfitting. The idea is also
the rationale behind the recom m ended policy in C hapter 3: first try a linear
m odel - one of the sim plest models in the arena of learning from data.
5.2 Sam pling B ias
A vivid exam ple of sam pling bias happened in the 1948 US presidential election
between T rum an and Dewey. On election night, a m ajor new spaper carried
out a telephone poll to ask people how they voted. The poll indicated th a t
Dewey won, and the paper was so confident abou t the small error bar in its
poll th a t it declared Dewey the winner in its headline. W hen the actual votes
were counted, Dewey lost - to the delight of a smiling Trum an.
©Assooiat.cR U
1 7 1
This was not a case of s ta tis tica l anom aly, where th e new spaper was ju st
incredibly unlucky (rem em ber the <5 in the VC bound?). I t was a case where
the sam ple was doom ed from the get-go, regardless of its size. Even if the
experim ent were repeated , th e resu lt would be th e same. In 1948, telephones
were expensive and those who had them tended to be in an elite group th a t
favored Dewey m uch m ore th a n th e average voter did. Since th e new spaper did
its poll by telephone, it inadverten tly used an in-sam ple d istribu tion th a t was
different from th e out-of-sam ple d istribu tion . T h a t is w hat sam pling bias is.
5 . T h r e e L e a r n i n g P r i n c i p l e s 5 . 2 . S a m p l i n g B i a s
I f th e d a ta is sa m p led in a b ia se d w ay, lea rn in g w ill p ro
d u ce a s im ila r ly b ia se d o u tc o m e .
Applying th is principle, we should m ake sure th a t th e tra in ing and testing
d istribu tions are th e same; if not, our resu lts m ay be invalid, or, a t th e very
least, require careful in te rp re ta tion .
If you recall, the VC analysis m ade very few assum ptions, b u t one as
sum ption it did m ake was th a t th e d a ta set V is generated from the same
d istribu tion th a t the final hypothesis g is tested on. In practice, we m ay en
counter d a ta sets th a t were not generated under those ideal conditions. T here
are some techniques in sta tis tics and in learning to com pensate for the ‘mis
m atch ’ betw een tra in ing and testing , b u t no t in cases where V was generated
w ith the exclusion of certa in p a r ts of th e in p u t space, such as th e exclusion of
households w ith no telephones in th e above exam ple. T here is no th ing th a t
can be done when th is happens, o th er th a n to adm it th a t th e resu lt will not
be reliable - s ta tis tica l bounds like Hoeffding and VC require a m atch betw een
the tra in ing and testing d istribu tions.
T here are m any exam ples of how sam pling bias can be in troduced in d a ta
collection. In some cases it is inadverten tly in troduced by an oversight, as
in the case of Dewey and T rum an. In o ther cases, it is in troduced because
certain types of d a ta are no t available. For instance, in our cred it exam ple of
C hap ter 1 , the bank created th e tra in in g set from th e d a tab ase of previous cus
tom ers and how they perform ed for th e bank. Such a set necessarily excludes
those who applied to th e bank for cred it cards and were rejected , because the
bank does no t have d a ta on how th ey would have performed if they were ac
cepted. Since fu tu re app lican ts will come from a m ixed popu lation including