Yaser S Abu-Mostafa_ Malik Magdon-Ismail_ Hsuan-Tien Lin - Learning From Data_ A Short Course-AMLbook com (2012)

UFMG

Luiz Queiroz
em 15/11/2021
Conteúdos escolhidos para você

458 pág.
Getting Started with Natural Langage Proccesing

UCE
1312 pág.
Introduction to Algorithms 4th Leiserson Stein Rivest Cormen MIT Press 9780262046305 EBooksWorld ir

UFC
510 pág.
Hands-On Machine Learning with Scikit-Learn and TensorFlow (Early Release) (Aurelien Geron) (z-lib org)

UPE
990 pág.
Andy Field, Jeremy Miles, Zoë Field-Discovering Statistics using R-Sage (2012)

760 pág.
Hands-On_Machine_Learning_with-ilovepdf-compressed

IFRJ
Perguntas dessa disciplina

es AVALIAÇÃO Clique no botão ENCERRAR quando você terminar de responder. Como a avaliação anterior não foi encerrada, ela foi reaberta. Você ainda tem

UNIFAVENI
Questão 1/10 - Bioestatística Ler em voz alta Leia o excerto de texto: “O processo de investigação passa por diferentes estádios, dos quais já salient

Methods are procedures that guide us to doing certain processes, such as becoming more productive and better managing our actions in regard to the ...

UNB
Questão 03 1 PONTO A busca pela eficácia no mundo dos negócios deve ser incessante, principalmente para empresas de pequeno porte, as quais, geralment

USJT
AVALIAÇÃO PRESENCIAL – ENSINO DE EDUCAÇÃO FÍSICA – 6° PERÍODO – PEDAGOGIA 2 “(…) as maiores aquisições de uma criança são conseguidas no brinquedo, aq

Material
Conteúdos escolhidos para você

458 pág.
Getting Started with Natural Langage Proccesing

UCE
1312 pág.
Introduction to Algorithms 4th Leiserson Stein Rivest Cormen MIT Press 9780262046305 EBooksWorld ir

UFC
510 pág.
Hands-On Machine Learning with Scikit-Learn and TensorFlow (Early Release) (Aurelien Geron) (z-lib org)

UPE
990 pág.
Andy Field, Jeremy Miles, Zoë Field-Discovering Statistics using R-Sage (2012)

760 pág.
Hands-On_Machine_Learning_with-ilovepdf-compressed

IFRJ
Perguntas dessa disciplina

es AVALIAÇÃO Clique no botão ENCERRAR quando você terminar de responder. Como a avaliação anterior não foi encerrada, ela foi reaberta. Você ainda tem

UNIFAVENI
Questão 1/10 - Bioestatística Ler em voz alta Leia o excerto de texto: “O processo de investigação passa por diferentes estádios, dos quais já salient

Methods are procedures that guide us to doing certain processes, such as becoming more productive and better managing our actions in regard to the ...

UNB
Questão 03 1 PONTO A busca pela eficácia no mundo dos negócios deve ser incessante, principalmente para empresas de pequeno porte, as quais, geralment

USJT
AVALIAÇÃO PRESENCIAL – ENSINO DE EDUCAÇÃO FÍSICA – 6° PERÍODO – PEDAGOGIA 2 “(…) as maiores aquisições de uma criança são conseguidas no brinquedo, aq

Prévia do material em texto
L e a r n i n g 
F r o m 
) a t a
A M L b o o k . c o m
The book website contains supporting 
material for instructors and readers. 
In particular, D yn am ic e-C hapters 
covering advanced topics.
L e a r n i n g F r o m D ata
A SH O RT C O U R SE
Y aser S. A b u -M ostafa 
California Institu te o f Technology
M alik M agdon-Ism ail 
Rensselaer Polytechnic Institu te
H suan-T ien Lin 
N ational Taiwan University
A M L b ook .com
Yaser S. Abu-Mostafa Malik Magdon-Ismail
Departments of Electrical Engineering Department of Computer Science 
and Computer Science
California Institute of Technology Rensselaer Polytechnic Institute
Pasadena, CA 91125, USA Troy, NY 12180, USA
yaserQcaltech.edu magdonScs.r p i .edu
Hsuan-Tien Lin
Department of Computer Science 
and Information Engineering 
National Taiwan University 
Taipei, 106, Taiwan 
htlinOcsie.n t u .edu.tw
ISBN 10:1-60049-006-9 
ISBN 13:978-1-60049-006-4
©2012 Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin. 1.20
All rights reserved. This work may not be translated or copied in whole or in part 
without the written permission of the authors. No part of this publication may 
be reproduced, stored in a retrieval system, or transm itted in any form or by any 
means—electronic, mechanical, photocopying, scanning, or otherwise—without prior 
written permission of the authors, except as permitted under Section 107 or 108 of 
the 1976 United States Copyright Act.
Limit of Liability/Disclaimer of Warranty: While the authors have used their best 
efforts in preparing this book, they make no representation or warranties with re
spect to the accuracy or completeness of the contents of this book and specifically 
disclaim any implied warranties of merchantability or fitness for a particular purpose. 
No warranty may be created or extended by sales representatives or written sales 
materials. The advice and strategies contained herein may not be suitable for your 
situation. You should consult with a professional where appropriate. The authors 
shall not be liable for any loss of profit or any other commercial damages, including 
but not limited to special, incidental, consequential, or other damages.
The use in this publication of tradenames, trademarks, service marks, and similar 
terms, even if they are not identified as such, is not to be taken as an expression of 
opinion as to whether or not they are subject to proprietary rights.
This book was typeset by the authors and was printed and bound in the United 
States of America.
To our teachers, and to our students
Preface
This book is designed for a short course on machine learning. It is a short 
course, not a hurried course. From over a decade of teaching this m aterial, we 
have distilled w hat we believe to be the core topics th a t every student of the 
subject should know. We chose the title ‘learning from d a ta ’ th a t faithfully 
describes w hat the subject is about, and m ade it a point to cover the topics in 
a story-like fashion. O ur hope is th a t the reader can learn all the fundam entals 
of the subject by reading the book cover to cover.
Learning from d a ta has d istinct theoretical and practical tracks. If you 
read two books th a t focus on one track or the other, you may feel th a t you 
are reading abou t two different subjects altogether. In this book, we balance 
the theoretical and the practical, the m athem atical and the heuristic. Our 
criterion for inclusion is relevance. Theory th a t establishes the conceptual 
framework for learning is included, and so are heuristics th a t im pact the per
formance of real learning systems. S trengths and weaknesses of the different 
p arts are spelled out. O ur philosophy is to say it like it is: w hat we know, 
w hat we don’t know, and w hat we partia lly know.
The book can be taugh t in exactly the order it is presented. T he notable 
exception may be C hapter 2, which is the most theoretical chapter of the book. 
The theory of generalization th a t this chapter covers is central to learning 
from data , and we m ade an effort to make it accessible to a wide readership. 
However, instructors who are more interested in the practical side may skim 
over it, or delay it until after the practical m ethods of C hapter 3 are taught.
You will notice th a t we included exercises (in gray boxes) th roughout the 
text. The main purpose of these exercises is to engage the reader and enhance 
understanding of a particu lar topic being covered. O ur reason for separating 
the exercises out is th a t they are not crucial to the logical how. Nevertheless, 
they contain useful inform ation, and we strongly encourage you to read them, 
even if you don’t do them to completion. Instructors may find some of the 
exercises appropriate as ‘easy’ homework problems, and we also provide ad
ditional problem s of varying difficulty in the Problem s section a t the end of 
each chapter.
To help instructors w ith preparing their lectures based on the book, we 
provide supporting m aterial on the book’s website (AMLbook.com). I'licre is 
also a forum th a t covers additional topics in learning from data. We will
VII
P r e f a c e
discuss these fu rther in the Epilogue of th is book.
Acknowledgm ent (in alphabetical order for each g ro u p ): We would like to 
express our g ra titu d e to the alum ni of our Learning System s G roup a t C altech 
who gave us detailed expert feedback: Zehra C ataltepe, Ling Li, A m rit P ra tap , 
and Joseph Sill. We th an k th e m any studen ts and colleagues who gave us useful 
feedback during the developm ent of th is book, especially Chun-W ei Liu. T he 
Caltech L ibrary staff, especially K ristin B uxton and D avid M cCaslin, have 
given us excellent advice and help in our self-publishing effort. We also th an k 
Lucinda A costa for her help th roughou t the w riting of th is book.
Last, b u t no t least, we would like to th an k our families for their encourage
m ent, their support, and m ost of all their patience as they endured the tim e 
dem ands th a t w riting a book has im posed on us.
Yaser S. A bu-M ostafa, Pasadena, California. 
M alik M agdon-Ism ail, Troy, New York. 
H suan-T ien Lin, Taipei, Taiwan.
March, 2012.
vni
Contents
P refa ce vii
1 T h e L earn in g P ro b lem 1
1.1 Problem S e t u p ......................................................................................... 1
1.1.1 Com ponents of L e a rn in g .......................................................... 3
1.1.2 A Simple Learning M o d e l ....................................................... 5
1.1.3 Learning versus Design .......................................................... 9
1.2 Types of L e a r n i n g .................................................................................. 11
1.2.1 Supervised L e a r n in g ................................................................. 11
1.2.2 Reinforcement L e a r n in g .......................................................... 12
1.2.3 U nsupervised L e a rn in g .............................................................. 13
1.2.4 O ther Views of L e a rn in g .......................................................... 14
1.3 Is Learning F e a s ib le ? ............................................................................... 15
1.3.1 O utside the D a ta S e t ................................................................. 16
1.3.2 P robability to the R e s c u e ....................................................... 18
1.3.3 Feasibility of L e a rn in g .............................................................. 24
1.4 E rror and N o ise ......................................................................................... 27
1.4.1 E rror M e a s u re s ............................................................................ 28
1.4.2 Noisy T a r g e ts ...............................................................................30
1.5 Problem s .................................................................................................... 33
2 T ra in in g versu s T estin g 39
2.1 Theory of G en era liza tio n ......................................................................... 39
2.1.1 Effective N um ber of Hypotheses ......................................... 41
2.1.2 Bounding the G row th F u n c t io n ............................................. 46
2.1.3 The VC D im en sio n ..................................................................... 50
2.1.4 The VC Generalization B o u n d ............................................. 53
2.2 In terpreting the G eneralization B o u n d ............................................ 55
2.2.1 Sample C o m p lex ity ..................................................................... 57
2.2.2 Penalty for Model Com plexity ............................................. 58
2.2.3 The Test S e t ................................................................................ 59
2.2.4 O ther Target T y p e s .................................................................. 61
2.3 A pproxim ation-G eneralization Tradeoff ......................................... 62
ix
C o n t e n t s
2.3.1 Bias and V a r i a n c e ........................................................................ 62
2.3.2 T he Learning C u r v e ..................................................................... 6 6
2.4 Problem s ...................................................................................................... 69
3 T h e L inear M o d e l 77
3.1 L inear C la s s if ic a t io n ................................................................................ 77
3.1.1 N on-Separable D a t a ..................................................................... 79
3.2 L inear R eg re ss io n ........................................................................................ 82
3.2.1 T he A lg o r i th m ............................................................................... 84
3.2.2 G eneralization I s s u e s ..................................................................... 87
3.3 Logistic R e g re s s io n .................................................................................... 8 8
3.3.1 P red icting a P r o b a b i l i ty .............................................................. 89
3.3.2 G radien t D e s c e n t ............................................................................ 93
3.4 N onlinear T ra n s fo rm a tio n ...................................................................... 99
3.4.1 T he Z S p a c e ................................................................................... 99
3.4.2 G om putation and G e n e ra l iz a t io n ................................................104
3.5 Problem s ..........................................................................................................109
4 O v erfitt in g 119
4.1 W hen Does O verfitting O c c u r ? ............................................................... 119
4.1.1 A Gase Study: O verfitting w ith Polynom ials ....................... 120
4.1.2 G atalysts for O v e rf i t tin g ..................................................................123
4.2 R egularization ............................................................................................... 126
4.2.1 A Soft O rder G o n s tr a in t ..................................................................128
4.2.2 W eight Decay and A ugm ented E r r o r .........................................132
4.2.3 C hoosing a Regularizer: Pill or P o i s o n ? ....................................134
4.3 V a lid a t io n ..........................................................................................................137
4.3.1 T he V alidation S e t ............................................................................ 138
4.3.2 M odel S e lec tio n ....................................................................................141
4.3.3 Cross V a l id a t io n ................................................................................ 145
4.3.4 T heory Versus P rac tice ..................................................................151
4.4 P roblem s .......................................................................................................... 154
5 T h ree L ea rn in g P r in c ip le s 167
5.1 O ccam ’s R a z o r ............................................................................................... 167
5.2 Sam pling B ia s ...................................................................................................171
5.3 D a ta S n o o p in g ................................................................................................173
5.4 Problem s .......................................................................................................... 178
E p ilo g u e 181
F u rth er R e a d in g 183
C o n t e n t s
A p p en d ix P r o o f o f th e V C B o u n d 187
A .l Relating G eneralization E rror to In-Sample D e v ia tio n s ................ 188
A.2 Bounding W orst Case Deviation Using the Grow th Function . . 190
A.3 Bounding the D eviation between In-Sample E r r o r s ........................ 191
N o ta t io n 193
In d ex ■ 197
XI
N o t a t i o n
A complete table of the notation used in 
this book is included on page 193, right 
before the index of terms. We suggest 
referring to it as needed.
Xll
Chapter 1
The Learning Problem
If you show a picture to a three-year-old and ask if there is a tree in it, you will 
likely get the correct answer. If you ask a thirty-year-old w hat the definition 
of a tree is, you will likely get an inconclusive answer. We d idn’t learn w hat 
a tree is by studying the m athem atical definition of trees. We learned it by 
looking a t trees. In o ther words, we learned from ‘d a ta ’.
Learning from d a ta is used in situations where we don’t have an analytic 
solution, b u t we do have d a ta th a t we can use to construct an em pirical solu
tion. This premise covers a lot of territory, and indeed learning from d a ta is 
one of the m ost widely used techniques in science, engineering, and economics, 
am ong other fields.
In this chapter, we present examples of learning from d a ta and formalize 
the learning problem. We also discuss the m ain concepts associated with 
learning, and the different paradigm s of learning th a t have been developed.
1.1 Problem Setup
W hat do financial forecasting, medical diagnosis, com puter vision, and search 
engines have in common? They all have successfully utilized learning from 
data. T he reperto ire of such applications is quite impressive. Let us open the 
discussion w ith a real-life application to see how learning from d a ta works.
Consider the problem of predicting how a movie viewer would ra te the 
various movies out there. This is an im portan t problem if you are a com pany 
th a t rents out movies, since you w ant to recommend to different viewers the 
movies they will like. Good recornmender system s are so im portan t to business 
th a t the movie rental com pany Netflix offered a prize of one million dollars to 
anyone who could improve their recom m endations by a mere 1 0 %.
T he main difficulty in this problem is th a t the criteria th a t viewers use to 
ra te movies are quite complex. Trying to model those explicitly is no easy task, 
so it m ay not be possible to come up w ith an analytic solution. However, we
1
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
.A
Viewer • •
movie
Match movie and 
viewer factors
to'
<<v°
add contributions ^ predicted 
from each factor rating
.
Figure 1.1: A model for how a viewer rates a movie
know th a t tlie liistorical ra tin g d a ta reveal a lot ab o u t how people ra te movies, 
so we m ay be able to construc t a good em pirical solution.T here is a great 
deal of d a ta available to movie ren ta l com panies, since they often ask their 
viewers to ra te the movies th a t they have already seen.
Figure 1.1 illustra tes a specific approach th a t was widely used in the 
m illion-dollar com petition. Here is how it works. You describe a movie as 
a long array of different factors, e.g., how m uch com edy is in it, how com 
plicated is the plot, how handsom e is the lead actor, etc. Now, you describe 
each viewer w ith corresponding factors; how m uch do they like comedy, do 
they prefer sim ple or com plicated plots, how im p o rtan t are the looks of the 
lead actor, and so on. How this viewer will ra te th a t movie is now estim ated 
based on the m atch /n iism atch of these factors. For exam ple, if th e movie is 
inire com edy and the viewer hates comedies, the chances are he w on’t like it. 
If you take dozens of these factors describing m any facets of a m ovie’s conten t 
and a view er’s taste , the conclusion based on m atching all the factors will be 
a good predic tor of how the viewer will ra te the movie.
T he power of learning from d a ta is th a t th is en tire process can be au to 
m ated, w ithout any need for analyzing movie con ten t or viewer taste . To do 
so, the learning algorithm ‘reversc-engineers’ these factors based solely on pre-
vious ratings. It s ta rts w ith random factors, then tunes tiiese factors to make 
them more and more aligned w ith how viewers have ra ted movies before, until 
they are ultim ately able to predict how viewers ra te movies in general. The 
factors we end up w ith m ay not be as intnitive as ‘comedy con ten t’, and in 
fact can be qnite subtle or even incomprehensible. After all, the algorithm is 
only try ing to find the best way to predict how a viewer would ra te a movie, 
not necessarily explain to us how it is done. This algorithm was p art of the 
winning solution in the million-dollar com petition.
1.1.1 C om p on en ts o f Learning
T he movie ra ting application captures the essence of learning from data , and 
so do m any other applications from vastly different fields. In order to abstract 
the common core of the learning problem, we will pick one application and 
use it as a m etaphor for the different com ponents of the problem. Let us take 
credit approval as our m etaphor.
Suppose th a t a bank receives thousands of credit card applications every 
day, and it wants to au tom ate the process of evaluating them . Ju s t as in the 
case of movie ratings, the bank knows of no magical formula th a t can pinpoint 
when credit should be approved, bu t it has a lot of data . This calls for learning 
from data , so the bank uses historical records of previous custom ers to figure 
out a good formula for credit approval.
Each custom er record has personal inform ation related to credit, such as 
annual salary, years in residence, ou tstanding loans, etc. T he record also keeps 
track of whether approving credit for th a t custom er was a good idea, i.e., did 
the bank make money on th a t custom er. This d a ta guides the construction of 
a successful formula for credit approval th a t can be used on future applicants.
Let us give nam es and symbols to the m ain com ponents of this learning 
problem. T here is the input x (custom er inform ation th a t is used to make 
a credit decision), the unknown targe t function / : T —)■ T (ideal formula for 
credit approval), where X is the input space (set of all possible inputs x), and y 
is the ou tpu t space (set of all possible outputs, in this case ju st a yes/no deci
sion). T here is a d a ta set V of inpu t-ou tpu t examples (x i, y i), • • • , (xjv, 2/n ), 
where y„ == / ( x „ ) for n = 1 ,..., N (inputs corresponding to previous custom ers 
and the correct credit decision for them in hindsight). The examples are often 
referred to as d a ta points. Finally, there is the learning algorithm th a t uses the 
d a ta set V to pick a formula g\ X - ^ y th a t approxim ates / . The algorithm 
chooses g from a set of candidate formulas under consideration, which we call 
the hypothesis set T-L. For instance, % could be the set of all linear formulas 
from which the algorithm would choose the best linear fit to the data , as we 
will introduce la ter in this section.
W hen a new custom er applies for credit, the bank will base its decision 
on g (the hypothesis th a t the learning algorithm produced), not on f (the 
ideal targe t function which rem ains unknown). The decision will be good only 
to the ex ten t th a t g faithfully replicates f . To adiievc th a t, the algorithm
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
3
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
U N K N O W N T A R G E T F U N C T IO N 
f - x ^ y
(ideal credit approval form ula)
T R A IN IN G E X A M P L E S
(xi,J/i), (X2,y2),- ■ • , (XAT,2/Ar)
(historical records o f credit custom ers)
L E A R N IN G
A L G O R IT H M
FIN A L
H Y P O T H E SIS
5 « /
(learned credit approval form ula)
(set o f candidate form ulas)
Figure 1.2: Basic setup of the learning problem
chooses g th a t best m atches / on th e training exam ples of previous custom ers, 
w ith the hope th a t it will continue to m atch / on new custom ers. W hether 
or no t th is hope is justified rem ains to be seen. F igure 1.2 illu stra tes the 
com ponents of th e learning problem .
E x e r c ise 1.1
Express each of the following tasks in the framework of learning from data by 
specifying the input space X , output space y , target function f : X ^ y , 
and the specifics of the data set th a t we will learn from.
(a ) Medical diagnosis: A patient walks in with a medical history and some 
symptoms, and you want to identify the problem.
(b ) Handwritten digit recognition (for example postal zip code recognition 
for mail sorting).
(c) Determining if an email is spam or not.
(d ) Predicting how an electric load varies with price, tem perature, and 
day of the week.
(e) A problem of interest to you for which there is no analytic solution, 
but you have data from which to construct an empirical solution.
We will use the setup in Figure 1.2 as our dehnition of the learning problem. 
Later on, we will consider a num ber of refinements and variations to this basic 
setup as needed. However, the essence of the problem will rem ain the same. 
There is a ta rg e t to be learned. It is unknown to us. We have a set of examples 
generated by the target. The learning algorithm uses these examples to look 
for a hypothesis th a t approxim ates the target.
1.1.2 A S im ple L earning M od el
Let us consider the different com ponents of Figure 1.2. Given a specific learn
ing problem , the ta rg e t function and train ing examples are d ictated by the 
problem. However, the learning algorithm and hypothesis set are not. These 
are solution tools th a t we get to choose. The hypothesis set and learning 
algorithm are referred to informally as the learning model.
Here is a simple model. Let T = be the input space, where is the 
d-dim ensional Euclidean space, and let y = { -f l, — 1} be the o u tpu t space, 
denoting a b inary (yes/no) decision. In our credit example, different coor
d inates of the input vector x G correspond to salary, years in residence, 
ou tstanding debt, and the o ther d a ta fields in a credit application. The bi
nary o u tp u t y corresponds to approving or denying credit. We specify the 
hypothesis set Ti. through a functional form th a t all the hypotheses h e K 
share. T he functional form h(x) th a t we choose here gives different weights to 
the different coordinates of x, reflecting their relative im portance in the credit 
decision. T he weighted coordinates are then combined to form a ‘credit score’ 
and the result is com paredto a threshold value. If the applicant passes the 
threshold, credit is approved; if not, credit is denied:
d
Approve credit if ^ WiXi > threshold, 
i=l 
d
Deny credit if ^ WiXi < threshold. 
i=l
This formula can be w ritten more com pactly as
/ / d \ \
h(x) = sign WiXi + b
1. T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
\ \ i= i /
( 1 .1)
/
where X\, ■ ■ ■ ,Xd are the com ponents of the vector x; h(x) = - f l m eans ‘ap
prove c red it’ and h(x) = — 1 means ‘deny cred it’; sign(s) = - | - 1 if s > 0 and 
sign(s) = —1 if s < 0.^ T he weights are ,Wd, and the threshold is
determ ined by the bias term b since in E quation (1.1), credit is approved if
> -b-
This model of H is called the perceptron, a nam e th a t it got in the context 
of artificial intelligence. The learning algorithm will search H by looking for
^T h e value of s ig n (s) w hen s = 0 is a sim ple tech n ica lity th a t we ignore for th e m om ent.
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
Figure 1.3; Perceptron classification of linearly separable data in a two- 
dimensional input space (a) Some training examples will be misclassified 
(blue points in red region and vice versa) for certain values of the weight 
parameters which define the separating line, (b) A final hypothesis that 
classifies all training examples correctly, (o is +1 and X is —1.)
weights and bias th a t perform well on the d a ta set. Some of the weights 
ru i, • • • ,Wd m ay end up being negative, corresponding to an adverse effect on 
credit approval. For instance, the weight of th e ‘o u tstand ing d e b t’ field should 
come ou t negative since m ore deb t is no t good for credit. T he bias value b 
m ay end up being large or small, reflecting how lenient or s tringen t the bank 
should be in extending credit. T he optim al choices of weights and bias define 
the final hypothesis g GT-L th a t the algorithm produces.
E x e r c ise 1.2
Suppose th a t we use a perceptron to detect spam messages. Let’s say 
th a t each email message is represented by the frequency of occurrence of 
keywords, and the output is -|-1 if the message is considered spam.
(a ) Can you think of some keywords th a t will end up with a large positive 
weight in the perceptron?
(b ) How about keywords that will get a negative weight?
(c) W h at parameter in the perceptron directly affects how many border
line messages end up being classified as spam?
Figure 1.3 illustra tes w hat a percepti on does in a tw o-dim ensional case {d = 2). 
T he plane is split by a line into two regions, the -|-1 decision region and th e —1 
decision region. D ifferent values for the param eters w \ ,W 2 ,b correspond to 
different lines wiX] -(- 'W2X2 + b = 0. If the d a ta set is linearly separable, there 
will be a choice for these p aram eters th a t classifies all the tra in in g exam ples 
correctly.
6
1, T h e L k a u n i n c ; P k o b i .k m . 1. I 'HOHl . EM S k TI I'
To simplify the notation of the perceptron forumla. \vc will treat the bias h 
as a weight u\) = b and merge it with the o ther weights into one vec'tor 
w = [u’o. tt'\. • ■ • . where ' denotes the transpose of a vector, so w is a 
eohmm vector. We also trea t x as a eolumn vector and modify it to become x = 
[.ro ..ri.-- - where the added coordinate .co is fixed at .ro = 1. Formally
speaking, the input space is now
.1 = { 1 } X . I ' l . • ■ • . . r , ; ] ' I .I’o = 1 . .C| G M . ■ ■ • . . r , / G M } .
W ith this eonvention. w 'x = Equation (1.1) can be rew rit
ten in vector form as
/)(x) = s ig n (w 'x ). ( 1.2 )
now introduce the pcrccptroii leanmu] alporithni (PLA). The algorithm 
will determ ine w hat w should be. based on the data . Let us assume tha t the 
d ata set is linearly separable, which means that there is a \ee to r w that 
makes ( 1 .2 ) achieve the eorreet decision //(x„) = i/„ on all the training exam 
ples. as shown in Figure 1.3.
O ur learning algorithm will liud this w using a simple iterati\'c method.
Here is how it works. At iteration t. where / = th 1.2 there is a current
value of the weight vector, call it w (t). The algorithm picks an example 
from (X ).//i) • • • (x,v.,V.v) that is currently miselassilied. call it (x ( /) . //(/)). and 
uses it to update w (0 - Since the example is miselassilied. we lume ;/(/) ^ 
sign(w '' ( /)x (/)). The update rule is
w(t +\ ) = w(0 + ,v(/)x(0- (1.3)
This rule moves the boundary in tlu' direction of classifying x (/) eorrt'ctly. as 
depicted in the ligure abo\('. The algorithm coutiuues with fnrtlu 'r it mat ions 
until there are no longer miselassilied examples in the data .set.
E x erc ise 1 .3
The weight update rule in (1 .3 ) has the nice interpretation that it moves 
in the direction of classifying x ( t ) correctly.
(a ) Show that j/( t )w '^ ( t )x ( t ) < 0. [Hint: x ( t ) is misclassified b y w {t).[
(b ) Show that y (t)w '^ (t + l ) x ( t ) > y ( t )w '^ ( f )x ( t ) . [Hint: Use (1.3).]
(c) As far as classifying x{t ) is concerned, argue that the move from w{t ) 
to w ( t + 1) is a move ‘in the right direction'.
A lthough the u p d a te rule in (1.3) considers only one tra in ing exam ple a t a 
tim e and m ay ‘mess u p ’ the classification of the o ther exam ples th a t are not 
involved in the curren t itera tion , it tu rn s ou t th a t th e algorithm is guaran teed 
to arrive a t the righ t solution in the end. T he proof is th e sub ject of P ro b 
lem 1.3. T he result holds regardless of which exam ple we choose from am ong 
the misclassified exam ples in (x i, y i) • ■ • (x jv ,2/at) a t each itera tion , and re
gardless of how we initialize the weight vector to s ta r t the algorithm . For 
simplicity, we can pick one of the misclassified exam ples a t random (or cycle 
th rough the exam ples and always choose th e first misclassified one), and we 
can initialize w (0) to the zero vector.
W ith in the infinite space of all weight vectors, the percep tron algorithm 
m anages to find a weight vector th a t works, using a sim ple itera tive process. 
This illustra tes how a learning algorithm can effectively search an infinite 
hypothesis set using a finite num ber of sim ple steps. T his featu re is charac te r
istic of m any techniques th a t are used in learning, some of which are far more 
sophisticated th an the percep tron learning algorithm .
E x erc ise 1 .4
Let us create our own target function / and data set V and see how the 
perceptron learning algorithm works. Take d = 2 so you can visualize the 
problem, and choose a random line in the plane as your target function, 
where one side of the line maps to -|-1 and the other maps to —1, Choose 
the inputs x „ of the data set as random points in the plane, and evaluate 
the target function on each x„ to get the corresponding output yn.
Now, generate a data set of size 20. Try the perceptron learning algorithm 
on your data set and see how long it takes to converge and how well the 
final hypothesis g matches your target / . You can find other ways to play 
with this experiment in Problem 1.4.
T he percep tron learning algorithm succeeds in achieving its goal; finding a hy
pothesis th a t classifies all the points in the d a ta set V = { ( x i , y i) • • • (xat, yiv)} 
correctly. Does th is m ean th a t th is hypothesis will also be successful in classi
fying new d a ta points th a t are no t in "D? T his tu rn s ou t to be th e key question 
in the theory of learning, a question th a t will be thoroughly exam ined in th is 
book.
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
(a) Coin data (b) Learned classifier
Figure 1.4: The learning approach to coin classification (a) Trainingdata of 
pennies, nickels, dimes, and quarters (1, 5, 10, and 25 cents) are represented 
in a size-mass space where they fall into clusters, (b) A classification rule is 
learned from the data set by separating the four clusters. A new coin will 
be classified according to the region in the size-mass plane that it falls into.
1.1.3 Learning versus D esign
So far, we have discussed w hat learning is. Now, we discuss w hat it is not. The 
goal is to distinguish between learning and a related approach th a t is used for 
sim ilar problems. W hile learning is based on data , this o ther approach does 
not use data . It is a ‘design’ approach based on specihcations, and is often 
discussed alongside the learning approach in p a tte rn recognition literature.
Consider the problem of recognizing coins of different denom inations, which 
is relevant to vending machines, for example. We want the m achine to recog
nize quarters, dimes, nickels and pennies. We will contrast the ‘learning from 
d a ta ’ approach and the ‘design from specifications’ approach for this prob
lem. We assum e th a t each coin will be represented by its size and mass, a 
two-dim ensional input.
In the learning approach, we are given a sam ple of coins from each of 
the four denom inations and we use these coins as our d a ta set. We trea t 
the size and mass as the input vector, and the denom ination as the output. 
F igure 1.4(a) shows w hat the d a ta set m ay look like in the input space. There 
is some variation of size and mass w ithin each class, b u t by and large coins 
of the sam e denom ination cluster together. The learning algorithm searches 
for a hypothesis th a t classifies the d a ta set well. If we want to classify a new 
coin, the m achine m easures its size and mass, and then classihes it according 
to the learned hypothesis in Figure 1.4(b).
In the design approach, we call the United S tates Mint and ask them about 
the specifications of different coins. We also ask them abou t the number
9
1 . T h e L e a r n i n g P r o b l e m 1 . 1 . P r o b l e m S e t u p
(a) Probabilistic model of data (b) Inferred classifier
F igure 1.5: The design approach to coin classification (a) A probabilistic 
model for the size, mass, and denomination of coins is derived from known 
specifications. The figure shows the high probability region for each denom
ination (1, 5, 10, and 25 cents) according to the model, (b) A classification 
rule is derived analytically to minimize the probability of error in classifying 
a coin based on size and mass. The resulting regions for each denomination 
are shown.
of coins of each denom ination in circulation, in order to get an estim ate of 
the relative frequency of each coin. Finally, we m ake a physical m odel of 
the variations in size and m ass due to exposure to th e elem ents and due to 
errors in m easurem ent. We p u t all of th is inform ation together and com pute 
the full jo in t p robability d is trib u tio n of size, m ass, and coin denom ination 
(Figure 1.5(a)). Once we have th a t jo in t d istribu tion , we can construc t the 
optim al decision rule to classify coins based on size and m ass (F igure 1.5(b)). 
T he rule chooses the denom ination th a t has the highest p robab ility for a given 
size and m ass, thus achieving the sm allest possible p robab ility of error.^
T he m ain difference betw een the learning approach and th e design ap
proach is the role th a t d a ta plays. In the design approach, th e problem is well 
specified and one can analy tica lly derive / w ithou t the need to see any d a ta . 
In the learning approach, the problem is m uch less specified, and one needs 
d a ta to pin down wfiat / is.
B oth approaches m ay be viabie in some applications, b u t only the learning 
approach is possible in m any applications w here the ta rg e t function is un 
known. We are no t try ing to com pare the u tility or th e perform ance of the 
two approaches. We are ju s t m aking the po int th a t the design approach is 
d istinc t from learning. T his book is ab o u t learning.
^ T h is is ca lled B ayes o p tim a l decision th eo ry . Som e lea rn in g m o d els a re b a se d on th e 
sam e th eo ry by e s t im a tin g th e p ro b a b ili ty from d a ta .
1 0
E x erc ise 1.5
Which of the following problems are more suited for the learning approach 
and which are more suited for the design approach?
(a) Determining the age at which a particular medical test should be 
performed
(b). Classifying numbers into primes and non-primes
(c) Detecting potential fraud in credit card charges
(d) Determining the time it would take a falling object to hit the ground
(e) Determining the optimal cycle for traffic lights in a busy intersection
1.2 T ypes o f Learning
The basic premise of learning from d a ta is the use of a set of observations to 
uncover an underlying process. It is a very broad premise, and difficult to fit 
into a single framework. As a result, different learning paradigm s have arisen 
to deal w ith different situations and different assum ptions. In this section, we 
introduce some of these paradigm s.
T he learning paradigm th a t we have discussed so far is called supervised 
learning. It is the m ost studied and most utilized type of learning, bu t it is 
not the only one. Some variations of supervised learning are simple enough 
to be accom m odated w ithin the sam e framework. O ther variations are more 
profound and lead to new concepts and techniques th a t take on lives of their 
own. T he m ost im portan t variations have to do w ith the natu re of the d a ta 
set.
1.2.1 S u p ervised Learning
W hen the train ing d a ta contains explicit examples of w hat the correct ou tpu t 
should be for given inputs, then we are w ithin the supervised learning set
ting th a t we have covered so far. Consider the hand-w ritten digit recognition 
problem (task (b) of Exercise 1.1). A reasonable d a ta set for this problem is 
a collection of images of hand-w ritten digits, and for each image, w hat the 
digit actually is. We thus have a set of examples of the form ( image . digit ). 
The learning is supervised in the sense th a t some ‘supervisor’ has taken the 
trouble to look a t each input, in this case an image, and determ ine the correct 
ou tpu t, in this case one of the ten categories {0,1, 2, 3 ,4, 5, 6 , 7, 8 ,9}.
W hile we are on the subject of variations, there is more than one way th a t 
a d a ta set can be presented to the learning process. D ata sets are typically cre
a ted and presented to us in their entirety a t the ou tset of the learning process. 
For instance, historical records of custom ers in the credit-card application, 
and previous movie ratings of custom ers in the movie ra ting application, are 
already there for us to use. This protocol of a ‘ready’ d a ta set is the most
1 . T h e L e a r n i n g P r o b l e m 1 . 2 . T y p e s o f L e a r n i n g
1 1
com m on in practice, and it is w hat we will focus on in th is book. However, it 
is w orth noting th a t two variations of th is protocol have a ttra c te d a significant 
body of work.
One is active learning, where the d a ta set is acquired th rough queries th a t 
we make. Thus, we get to choose a poin t x in the inpu t space, and the 
supervisor reports to us the ta rg e t value for x . As you can see, th is opens 
the possibility for s trateg ic choice of th e po int x to m axim ize its inform ation 
value, sim ilar to asking a stra teg ic question in a gam e of 2 0 questions.
A nother varia tion is called online learning, where the d a ta set is given to 
the algorithm one exam ple a t a tim e. T his happens when we have stream 
ing d a ta th a t the algorithm has to process ‘on th e ru n ’. For instance, when 
th e movie recom m endation system discussed inSection 1.1 is deployed, on
line learning can process new ra tings from cu rren t users and movies. Online 
learning is also useful when we have lim itations on com puting and storage 
th a t preclude us from processing the whole d a ta as a batch . We should note 
th a t online learning can be used in different paradigm s of learning, no t ju s t in 
supervised learning.
1.2 .2 R ein forcem en t L earning
W hen the tra in ing d a ta does no t explicitly contain th e correct o u tp u t for each 
input, we are no longer in a supervised learning setting . Consider a toddler 
learning no t to touch a hot cup of tea. T he experience of such a todd ler 
would typically com prise a set of occasions when th e todd ler confronted a hot 
cup of te a and was faced w ith the decision of touching it or no t touching it. 
P resum ably, every tim e she touched it, th e resu lt was a high level of pain, and 
every tim e she d id n ’t touch it, a m uch lower level of pain resu lted ( th a t of an 
unsatisfied curiosity). Eventually, the todd ler learns th a t she is b e tte r off no t 
touching the ho t cup.
T he tra in ing exam ples did no t spell ou t w ha t th e todd ler should have done, 
b u t they instead graded different actions th a t she has taken. N evertheless, she 
uses the exam ples to reinforce th e b e tte r actions, eventually learning w hat she 
should do in sim ilar situations. T his characterizes reinforcement learning, 
where the tra in ing exam ple does no t contain the ta rg e t o u tp u t, b u t instead 
contains some possible o u tp u t together w ith a m easure of how good th a t o u t
p u t is. In con trast to supervised learning w here the tra in in g exam ples were of 
the form ( input , correct o u tp u t ), the exam ples in reinforcem ent learning are 
of the form
( inpu t , some o u tp u t , grade for th is o u tp u t ).
Im portan tly , the exam ple does not say how good o ther o u tp u ts would have 
been for th is p a rticu la r input.
Reinforcem ent learning is especially useful for learning how to play a gam e. 
Im agine a situ a tio n in backgam m on w here you have a choice betw een different 
actions and you w ant to identify the best action. I t is no t a triv ia l task to 
ascertain w hat the best action is a t a given stage of th e gam e, so we canno t
1 . T h e L e a r n i n g P r o b l e m 1 . 2 . T y p e s o f L e a r n i n g
1 2
1. T h e L e a r n i n g P r o b l e m 1 . 2 . T y p e s o f L e a r n i n g
Size
(a) Unlabeled Coin data (b) Unsupervised learning
Figure 1.6: Unsupervised learning of coin classification (a) The same data 
set of coins in Figure 1.4(a) is again represented in the size-mass space, but 
without being labeled. They still fall into clusters, (b) An unsupervised 
classification rule treats the four clusters as different types. The rule may 
be somewhat ambiguous, as type 1 and type 2 could be viewed as one cluster
easily create supervised learning examples. If you use reinforcement learning 
instead, all you need to do is to take some action and report how well things 
went, and yon have a train ing example. T he reinforcement learning algorithm 
is left w ith the task of sorting out the inform ation coming from different ex
am ples to find the best line of play.
1.2.3 U n su p erv ised Learning
In the unsupervised setting, the train ing d a ta does not contain any outjiiit 
inform ation a t all. We are ju s t given input examples x i , • • • ,x;v- You may 
wonder how we could possibly learn anything from mere inputs. Consider the 
coin classification problem th a t we discussed earlier in Figure 1.4. Suppose 
th a t we d idn’t know the denom ination of any of the coins in the d a ta set. This 
unlabeled data is shown in Figure 1.6(a). We still get sim ilar clusters, b u t they 
are now unlabeled so all points have the sam e ‘color’. The decision regions 
in unsupervised learning may be identical to those in supervised learning, but 
w ithout the labels (Figure 1.6(b)). However, the correct clustering is less 
obvious now, and even the num ber of clusters may be arnbigiions.
Nonetheless, this example shows th a t we can learn somethmg from the 
inputs by themselves. Unsupervised learning can be viewed as the task of 
spontaneously hnding p a tte rn s and s tructu re in input data . For instance, if 
our task is to categorize a set of books into tojhcs, and we only use general 
properties of the various books, we can identify books th a t have sim ilar prop
erties and pu t them together in one category, w ilhont nam ing th a t category.
13
1 . T h e L e a r n i n g P r o b l e m 1 . 2 . T y p e s o f L e a r n i n g
U nsupervised learning can also be viewed as a way to create a higher- 
level represen ta tion of the d a ta . Im agine th a t you d on’t speak a word of 
Spanish, b u t your com pany will re locate you to Spain next m onth. T hey 
will arrange for Spanish lessons once you are there, b u t you would like to 
prepare yourself a b it before you go. All you have access to is a Spanish radio 
station . For a full m onth, you continuously bom bard yourself w ith Spanish; 
this is an unsupervised learning experience since you do n ’t know the m eaning 
of the words. However, you gradually develop a b e tte r represen ta tion of the 
language in your bra in by becom ing m ore tu n ed to its com m on sounds and 
structu res. W hen you arrive in Spain, you will be in a b e tte r position to s ta r t 
your Spanish lessons. Indeed, unsupervised learning can be a precursor to 
supervised learning. In o ther cases, it is a stand-alone technique.
E x erc ise 1 .6
For each of the following tasks, identify which type of learning is involved 
(supervised, reinforcement, or unsupervised) and the training data to be 
used. If a task can fit more than one type, explain how and describe the 
training data for each type.
(a ) Recommending a book to a user in an online bookstore
(b ) Playing tic-tac-toe
(c) Categorizing movies into different types
(d ) Learning to play music
(e) Credit limit; Deciding the maximum allowed debt for each bank cus
tomer
O ur m ain focus in th is book will be supervised learning, which is the m ost 
popular form of learning from d ata .
1 .2 .4 O th er V iew s o f L earning
T he study of learning has evolved som ew hat independently in a num ber of 
fields th a t s ta rted historically a t different tim es and in different dom ains, and 
these fields have developed different em phases and even different jargons. As a 
result, learning from d a ta is a diverse sub ject w ith m any aliases in th e scientific 
lite ra tu re . T he m ain field dedicated to the sub ject is called machine learning, 
a nam e th a t d istinguishes it from hum an learning. We briefly m ention two 
other im p o rtan t fields th a t approach learning from d a ta in th e ir own ways.
Statistics shares the basic prem ise of learning from d a ta , nam ely th e use 
of a set of observations to uncover an underly ing process. In th is case, the 
process is a probability d istrib u tio n and the observations are sam ples from th a t 
d istribu tion . Because sta tis tics is a m athem atica l field, em phasis is given to 
situations where m ost of the questions can be answ ered w ith rigorous proofs. 
As a result, s ta tis tics focuses on som ew hat idealized m odels and analyzes them 
in g reat detail. T his is the m ain difference betw een th e s ta tis tica l approach
1 4
1 . T h e L e a r n i n g P r o b l e m 1 . 3 . I s L e a r n i n g F e a s i b l e ?
/ = - l
/ = + l
/ = ?
Figure 1.7: A visual learning problem. The first two rows show the training 
examples (each input x is a 9-bit vector represented visually as a 3 x 3 black- 
and-white array). The inputs in the first row have / ( x ) = — 1, and the inputs 
in the secondrow have / ( x ) = -|-1. Your task is to learn from this data set 
what / is, then apply / to the test input at the bottom. Do you get —1 
or -1- 1 ?
to learning and how we approach the subject here. We make less restrictive 
assum ptions and deal w ith more general models th an in statistics. Therefore, 
we end up w ith weaker results th a t are nonetheless broadly applicable.
Data mining is a practical field th a t focuses on finding pa tterns, correla
tions, or anomalies in large relational databases. For example, we could be 
looking a t medical records of patien ts and try ing to detect a cause-effect re
lationship between a particu lar drug and long-term effects. We could also be 
looking a t credit card spending p a tte rn s and try ing to detect potential fraud. 
Technically, d a ta m ining is the same as learning from data , w ith more em pha
sis on d a ta analysis th an on prediction. Because databases are usually huge, 
com putational issues are often critical in d a ta mining. Recom m ender systems, 
which were illustra ted in Section 1.1 w ith the movie ra ting example, are also 
considered p a rt of d a ta mining.
1.3 Is Learning Feasible?
T he ta rg e t function / is the object of learning. The m ost im portan t assertion 
abou t the ta rg e t function is th a t it is unknown. We really m ean unknown.
This raises a na tu ra l question. How could a lim ited d a ta set reveal enough 
inform ation to pin down the entire targe t function? Figure 1.7 illustrates this
15
difficulty. A simple learning task w ith 6 tra in ing exam ples of a ± 1 ta rg e t 
function is shown. Try to learn w hat the function is th en apply it to the test 
inpu t given. Do you get —1 or + 1? Now, show the problem to your friends 
and see if they get the sam e answer.
T he chances are the answers were no t unanim ous, and for good reason. 
T here is sim ply m ore th a n one function th a t fits the 6 tra in ing examples, and 
some of these functions have a value of — 1 on the te s t po int and o thers have a 
value of + 1 . For instance, if th e tru e / is + 1 when th e p a tte rn is sym m etric, 
the value for the te s t po int would be +1 . If th e tru e / is + 1 when the top left 
square of the p a tte rn is w hite, the value for the te s t po int would be —1. B oth 
functions agree w ith all th e exam ples in the d a ta set, so there isn’t enough 
inform ation to tell us which would be th e correct answer.
T his does no t bode well for th e feasibility of learning. To m ake m atters 
worse, we will now see th a t the difficulty we experienced in th is simple problem 
is the rule, no t the exception.
1.3.1 O u tsid e th e D a ta Set
W hen we get the tra in ing d a ta V , e.g., the first two rows of F igure 1.7, we 
know the value of / on all the po in ts in T>. T his doesn’t m ean th a t we have 
learned / , since it doesn’t guaran tee th a t we know any th ing ab o u t / outside 
of V . We know w hat we have already seen, b u t th a t ’s no t learning. T h a t’s 
m emorizing.
Does th e d a ta set V tell us any th ing outside of V th a t we d id n ’t know 
before? If the answer is yes, th en we have learned something. If the answer is 
no, we can conclude th a t learning is no t feasible.
Since we m ain ta in th a t / is an unknow n function, we can prove th a t / 
rem ains unknow n outside of V . In stead of going th ro u g h a form al proof for 
the general case, we will illu stra te th e idea in a concrete case. C onsider a 
Boolean ta rg e t function over a three-dim ensional in p u t space X = {0,1}^. 
We are given a d a ta set V of five exam ples represen ted in th e tab le below. We 
denote the b inary o u tp u t by o /* for visual clarity.
1. T h e L e a r n i n g P r o b l e m 1.3. Is L e a r n i n g F e a s i b l e ?
X „ Vn
0 0 0 0
0 0 1 •
0 1 0 •
0 1 1 o
1 0 0 •
where y„ = /(x ,^ ) for n = 1 ,2 ,3 ,4 , 5. T he advantage of th is sim ple Boolean 
case is th a t we can enum erate the en tire inpu t space (since th ere are only 2 ̂ = 8 
distinct inpu t vectors), and we can enum erate the set of all possible ta rg e t 
functions (since / is a B oolean function on 3 B oolean inputs, and th ere are 
only 2^ = 256 d istinc t Boolean functions on 3 B oolean inputs).
1 6
1. T h e L e a r n i n g P r o b l e m 1.3. Is L e a r n i n g F e a s i b l e ?
Let us look a t the problem of learning / . Since / is unknown except 
inside V , any function th a t agrees w ith V could conceivably be / . The table 
below shows all such functions / i , • • • , /s . It also shows the d a ta set V (in 
blue) and w hat the final hypothesis g may look like.
/ i / 2 h U h h h h
0 0 0 
0 0 1 
0 1 0 
0 ] 1 
1 0 0
1 0 1 
1 1 0 
1 1 1
• • •
o o o
O O O
O O •
O • O
o •
• o
The final hypothesis g is chosen based on the five examples in V . The table 
shows the case where g is chosen to m atch / on these examples.
If we rem ain tru e to the notion of unknown targe t, we cannot exclude any 
of f i , - " ) /s from being the true / . Now, we have a dilemma. The whole 
purpose of learning / is to be able to predict the value of / on points th a t we 
haven’t seen before. T he quality of the learning will be determ ined by how 
close our prediction is to the true value. Regardless of w hat g predicts on 
the three points we haven’t seen before (those outside of V , denoted by red 
question m arks), it can agree or disagree w ith the target, depending on which 
of / i , • • • , / s tu rns out to be the tru e target. It is easy to verify th a t any 3 
b its th a t replace the red question m arks are as good as any other 3 bits.
E x erc ise 1 .7
For each of the following learning scenarios in the above problem, evaluate 
the performance of g on the three points in X outside V. To measure the 
performance, compute how many of the 8 possible target functions agree 
with g on all three points, on two of them, on one of them, and on none 
of them.
(a) H has only two hypotheses, one that always returns and one that 
always returns ‘o '. The learning algorithm picks the hypothesis that 
matches the data set the most.
(b) The same but the learning algorithm now picks the hypothesis 
that matches the data set the least.
(c) H = {X O R } (only one hypothesis which is always picked), where 
X O R is defined by X O R (x ) = • if the number of I's in x is odd and 
X O R (x ) = o if the number is even.
(d) H contains all possible hypotheses (all Boolean functions on three 
variables), and the learning algorithm picks the hypothesis that agrees 
with all training examples, but otherwise disagrees the most with the 
X O R .
17
1 . T h e L e a r n i n g P r o b l e m 1 . 3 . I s L e a r n i n g F e a s i b l e ?
B IN
SA M PL E
• • • • • • • • • •
V = fraction of red marbles
jjb = probability of red marbles
F igure 1.8: A random sample is picked from a bin of red and green marbles. 
The probability fi of red marbles in the bin is unknown. W hat does the 
fraction v of red marbles in the sample tell us about fxl
I t doesn’t m a tte r w hat the algorithm does or w hat hypothesis set T-L is used. 
W hether T-L has a hypothesis th a t perfectly agrees w ith V (as depicted in the 
table) or not, and w hether th e learning algorithm picks th a t hypothesis or 
picks ano ther one th a t disagrees w ith D (different green b its), it m akes no 
difference w hatsoever as far as the perform ance outside of T> is concerned. Yet 
the perform ance outside V is afl th a t m a tte rs in team ing!
T his dilem m a is no t restric ted to B oolean functions, b u t ex tends to the 
general learning problem . As long as / is an unknow n function, knowing T> 
cannot exclude any p a tte rn of values for / ou tside of V . Therefore, the p re
dictions of g outsideof T> are meaningless.
Does th is m ean th a t learning from d a ta is doom ed? If so, th is will be a 
very sho rt book F ortunately , learning is alive and well, and we will see 
why. We w on’t have to change our basic assum ption to do th a t. T he ta rg e t 
function will continue to be unknown, and we still m ean unknown.
1.3 .2 P ro b a b ility to th e R escu e
We will show th a t we can indeed infer som ething outside T> using only V , b u t 
in a probabilistic way. W h a t we infer m ay not be m uch com pared to learning 
a full ta rg e t function, b u t it will estab lish th e principle th a t we can reach 
outside T>. Once we estab lish th a t , we will take it to th e general learning 
problem and pin down w hat we can and canno t learn.
L e t’s take the sim plest case of picking a sam ple, and see w hen we can say 
som ething ab o u t the objects outside the sam ple. C onsider a bin th a t contains 
red and green m arbles, possibly infinitely many. T he p ro p o rtio n of red and 
green m arbles in th e bin is such th a t if we pick a m arb le a t random , the 
probability th a t it will be red is and th e p robab ility th a t it will be green 
is 1 — /i. We assum e th a t th e value of g, is unknow n to us.
1 8
We pick a random sam ple of N independent m arbles (with replacem ent) 
from this bin, and observe the fraction z/ of red m arbles w ithin the sample 
(Figure 1.8). W hat does the value of v tell us about the value of f/!
One answer is th a t regardless of the colors of the N m arbles th a t we picked, 
we still don’t know the color of any m arble th a t we d idn’t pick. We can get 
m ostly green m arbles in the sam ple while the bin has m ostly red marbles. 
A lthough this is certainly possible, it is by no m eans probable.
E x erc ise 1.8
If /X = 0.9, what is the probability that a sample of 10 marbles will have 
V < 0.1? [Hints: 1. Use binomial distribution. 2. The answer is a very 
small number.]
The situation is sim ilar to taking a poll. A random sam ple from a population 
tends to agree w ith the views of the population a t large. T he probability 
d istribu tion of the random variable u in term s of the param eter n is well 
understood, and when the sam ple size is big, v tends to be close to pL.
To quantify the relationship between v and n, we use a simple bound called 
the Hoeffding Inequality. It sta tes th a t for any sam ple size N ,
P [|iz - /x| > e] < for any e > 0. (1.4)
Here, P[-] denotes the probability of an event, in this case w ith respect to 
the random sam ple we pick, and e is any positive value we choose. P u ttin g 
Inequality (1.4) in words, it says th a t as the sam ple size N grows, it becomes 
exponentially unlikely th a t u will deviate from pL by more th an our ‘to lerance’ e.
The only quan tity th a t is random in (1.4) is v which depends on the random 
sample. By contrast, pL is not random . It is ju s t a constant, albeit unknown to 
us. T here is a subtle point here. T he utility of (1.4) is to infer the value of p, 
using the value of v, although it is p th a t affects v, not vice versa. However, 
since the effect is th a t v tends to be close to p, we infer th a t p ‘tends’ to be 
close to V.
A lthough P [ | h — p \ > e] depends on p, as p appears in the argum ent and 
also affects the d istribu tion of n, we are able to bound the probability by 2 e^^'^ ^ 
which does not depend on p. Notice th a t only the size N of the sam ple affects 
the bound, not the size of the bin. The bin can be large or small, finite or 
infinite, and we still get the same bound when we use the same sam ple size.
E x erc ise 1.9
If /X = 0.9, use the Hoeffding Inequality to bound the probability that a 
sample of 10 marbles will have u < 0.1 and compare the answer to the 
previous exercise.
If we choose e to be very small in order to make n a good approxim ation of p, 
we need a larger sam ple size N to make the RHS of Inequality (1.4) small. We
1 . T h e L e a r n i n g P r o b l e m _______________________________ 1 . 3 . I s L e a r n i n g F e a s i b l e ?
1 9
can th en assert th a t it is likely th a t v will indeed be a good approxim ation of p,. 
A lthough th is assertion does no t give us the exact value of p, and doesn’t even 
guaran tee th a t th e approxim ate value holds, knowing th a t we are w ithin ± e 
of p m ost of the tim e is a significant im provem ent over no t knowing anything 
a t all.
T he fact th a t the sam ple was random ly selected from the bin is the reason 
we are able to m ake any kind of s ta tem en t ab o u t p being close to v. If the 
sam ple was no t random ly selected b u t picked in a p a rticu la r way, we would 
lose the benefit of the probabilistic analysis and we would again be in the dark 
outside of th e sam ple.
How does the bin m odel re la te to the learning problem ? It seems th a t the 
unknow n here was ju s t the value of p while th e unknow n in learning is an entire 
function / : df y . T he two situations can be connected. Take any single 
hypothesis h e H and com pare it to / on each point x G A . If h (x ) = / ( x ) , 
color the point x green. If h (x ) 7 ̂ / ( x ) , color th e poin t x red. T he color 
th a t each po in t gets is no t known to us, since / is unknow n. However, if we 
pick X a t random according to some probability d is trib u tio n P over the inpu t 
space A , we know th a t x will be red w ith som e probability , call it p, and green 
w ith p robability 1 — p. Regardless of th e value of p, th e space X now behaves 
like the bin in F igure 1.8.
T he tra in ing exam ples play th e role of a sam ple from th e bin. If the 
inputs x i , ■ • • ,Xjv in D are picked independently according to P , we will get 
a random sam ple of red (h (x „) 7 ̂ / ( x „ ) ) and green (h (x „) = / ( x „ ) ) points. 
Each point will be red w ith p robability p and green w ith p robability 1—p. T he 
color of each point will be known to us since b o th h (x „ ) and / ( x „ ) are known 
for n = 1 , • • • , A (the function h is our hypothesis so we can evaluate it on 
any point, and / ( x „ ) = is given to us for all poin ts in th e d a ta set V ). T he 
learning problem is now reduced to a bin problem , under th e assum ption th a t 
the inpu ts in T> are picked independently according to some d istrib u tio n P 
on A . Any P will tran s la te to some p in the equivalent bin. Since p is 
allowed to be unknow n, P can be unknow n to us as well. F igure 1.9 adds th is 
probabilistic com ponent to th e basic learning se tup depicted in F igure 1.2.
W ith th is equivalence, the Hoeffding Inequality can be applied to the learn
ing problem , allowing us to m ake a prediction outside of T>. Using u to pre
dict p tells us som ething ab o u t / , although it doesn’t tell us w hat / is. W h a t p 
tells us is the erro r ra te h m akes in approx im ating f . U i y happens to be close 
to zero, we can predic t th a t h will approx im ate / well over the en tire inpu t 
space. If not, we are ou t of luck.
U nfortunately, we have no control over u in our cu rren t s itu a tio n , since u 
is based on a p articu la r hypothesis h. In real learning, we explore an entire 
hypothesis set P , looking for some h ^ T-L th a t has a sm all erro r ra te . If we 
have only one hypothesis to begin w ith, we are no t really learning, b u t ra th e r 
‘verifying’ w hether th a t p articu la r hypothesis is good or bad. Let us see if we 
can ex tend the bin equivalence to the case w here we have m ultip le hypotheses 
in order to cap tu re real learning.
1 . T h e L e a r n i n g P r o b l e m ________________________________1 . 3 . I s L e a r n i n g F e a s i b l e ?
2 0
1. T h e L e a r n i n g P r o bl e m 1.3. Is L e a r n i n g F e a s i b l e ?
Figure 1.9: Probability added to the basic learning setup
To do th a t, we s ta rt by introducing more descriptive names for the different 
concepts th a t we will use. T he error ra te w ithin the sample, which corresponds 
to V in the bin model, will be called the in-sample error,
E\n{h) — (fraction of V where / and h disagree) 
1 ^
= m J I ^ / ( x „ ) l ,
n = l
where [[statement]] = 1 if the statem ent is true, and = 0 if the statem ent is 
false. We have m ade explicit the dependency of on the particu lar h th a t 
we are considering. In the same way, we define the out-of-sample error
E o u t { h ) = F [ h ( x ) ^ f i x ) ] ,
which corresponds to p in the bin model. The probability is based on the 
d istribu tion P over X which is u.sed to sample the d a ta points x.
2 1
1 . T h e L e a r n i n g P r o b l e m 1.3. Is L e a r n i n g F e a s i b l e ?
Ein(/l2 )
F igure 1.10: Multiple bins depict the learning problem with M hypotheses
S ubstitu ting the new n o ta tio n for v and E^ut for /x, the Hoeffding 
Inequality (1.4) can be rew ritten as
]£'in(h) - £'out(h)| > e] < 2 e for any e > 0 , (1.5)
w here N is the num ber of tra in in g exam ples. T he in-sam ple error Ein, ju s t 
like u, is a random variable th a t depends on th e sam ple. T he out-of-sam ple 
error Eout, ju s t like /x, is unknow n b u t no t random .
Let us consider an entire hypothesis set K in stead of ju s t one hypothesis h, 
and assum e for the m om ent th a t H has a finite num ber of hypotheses
We can construc t a bin equivalent in th is case by having M bins as shown in 
Figure 1.10. Each bin still represents the inpu t space T , w ith the red m arbles 
in the m th bin corresponding to the po in ts x £ T w here hm (x) A / ( ^ ) - The 
probability of red m arbles in th e m th bin is Eout{hm) and th e fraction of 
red m arbles in the m th sam ple is E\^{hm), for m = 1, ■ ■ ■ , M . A lthough the 
Hoeffding Inequality (1.5) still applies to each bin individually, the situ a tio n 
becomes m ore com plicated when we consider all th e bins sim ultaneously. W hy 
is th a t? T he inequality s ta ted th a t
[|£/in(h) E'out(^OI > e] ^ 2 e for any e > 0 ,
where the hypothesis h is fixed before you generate th e d a ta set, and the 
p robability is w ith respect to random d a ta sets D; we em phasize th a t the 
assum ption is fixed before you generate the d a ta se t” is critica l to the 
vxilidity of th is bound. If you are allowed to change h after you generate the 
d a ta set, the assum ptions th a t are needed to prove th e Hoeffding Inequality 
no longer hold. W ith m ultip le hypotheses in R , th e learning algorithm picks
2 2
the final hypothesis g based on V , i.e. after generating the d a ta set. The 
statem ent we would like to make is not
“P[|F;in(/im) - Eout{hm)\ > e] is small”
(for any particu lar, fixed hm € l-L), b u t ra ther
“P[|£'in(5 ) - £'out(.g)| > e] is small” for the final hypothesis g.
The hypothesis g is not fixed ahead of tim e before generating the data , because 
which hypothesis is selected to be g depends on the data . So, we cannot just 
plug in g for h in the Hoeffding inequality. T he next exercise considers a simple 
coin experim ent th a t further illustrates the difference between a fixed h and 
the final hypothesis g selected by the learning algorithm .
E x erc ise 1 .10
Here is an experiment that illustrates the difference between a single bin 
and multiple bins. Run a computer simulation for flipping 1,000 fair coins.
Flip each coin independently 10 times. Let’s focus on 3 coins as follows:
Cl is the first coin flipped; Crand is a coin you choose at random; Cmin is the 
coin that had the minimum frequency of heads (pick the earlier one in case 
of a tie). Let u i, Urand and Umin be the fraction of heads you obtain for the 
respective three coins. For a coin, let g be its probability of heads.
(a) W hat is g for the three coins selected?
(b) Repeat this entire experiment a large number of times (e.g., 100,000 
runs of the entire experiment) to get several instances of m. Urand 
and Vmin and plot the histograms of the distributions of i^i, Urand and 
t'min- Notice that which coins end up being Crand and Cmin may differ 
from one run to another.
(c) Using (b ), plot estimates for P[|u —/i| > e] as a function of e, together 
with the Hoeffding bound ^ (on the same graph).
(d ) Which coins obey the Hoeffding bound, and which ones do not? Ex
plain why.
(e) Relate part (d ) to the multiple bins in Figure 1.10.
T he way to get around this is to try to bound P[|jLi„(ry) - £'o„t(.g)| > (] in 
a way th a t does not depend on which g the learning algorithm picks. There 
is a simple bn t crude way of doing th a t. Since g has to be one of the /i„,’s 
regardless of the algorithm and the sample, it is always true th a t
“ |E i„ (5 ) -E o ,t( .g ) | > e ” “ |E i„ ( h i ) - E „ , . i ( /q ) | > c
or |Ei„(/i2) Eout{b'2)\ ^ e
or |£ ’i„(/iM) - £ ’o„t(hA/)| > e ”.
1. T h e L e a r n i n g P r o b l e m _______________________ 1.3. Is L e a r n i n g F e a s i b l e ?
23
where Bi B 2 m eans th a t event B\ implies event ^ 2 - A lthough th e events 
on the RHS cover a lot m ore th a n the LHS, the RHS has the p roperty we want; 
the hypotheses hm are fixed. We now apply two basic rules in probability;
if B i = ^ B 2 , th en P [Bi] < r [B 2],
and, if R i, B2 , • • • , B m are any events, then
F[Bi or R2 or ■ ■ ■ or Bm] < F[Bi] + F[B2] + ■ ■ • + F[Bm]-
T he second rule is known as the union bound. P u ttin g th e two rules together, 
we get
¥ [ \ E M - E o M \ > e ] < P[ > e
or \Ein{h2) - Eout{h2)\ > e
1 . T h e L e a r n i n g P r o b l e m 1 . 3 . I s L e a r n i n g F e a s i b l e ?
or |F/in(h.jv/) EoutC^Ai)! ^ £ ]
f
< ^ P [ |Ein(hm) - E o u t{h m )\ > c] ■
M
771=1
Applying the Hoeffding Inequality (1.5) to the M term s one a t a tim e, we can 
bound each te rm in th e sum by ^ . S ubstitu ting , we get
n \ E M - E o M \ > e] < 2 M e -2 ^ '^ . ( 1 .6 )
M athem atically , th is is a ‘uniform ’ version of (1.5). We are try ing to simul
taneously approxim ate all Eout(hm )’s by the corresponding E in (h j„)’s. This 
allows the learning algorithm to choose any hypothesis based on E-m and ex
pect th a t the corresponding E^ut will uniform ly follow suit, regardless of which 
hypothesis is chosen.
T he downside for uniform estim ates is th a t the p robab ility bound 2M e~^^ ^ 
is a factor of M looser th an the bound for a single hypothesis, and will only 
be m eaningful if M is finite. We will im prove on th a t in C h ap ter 2.
1.3 .3 F easib ility o f L earning
We have in troduced two ap p aren tly conflicting argum ents ab o u t the feasibility 
of learning. One argum ent says th a t we canno t learn any th ing outside of T>, 
and the o ther says th a t we can. We would like to reconcile these two argum ents 
and pinpoint the sense in which learning is feasible:
1. Let us reconcile the two argum ents. T he question of w hether V tells us 
anything outside of V th a t we d id n 't know before has two different answers. 
If we insist on a determ inistic answ er, which m eans th a t V tells us som ething 
certain abou t / outside of T>. then the answ er is no. If we accept a p robabilistic 
answer, which m eans th a t V tells us som ething likely ab o u t / ou tside of V . 
then the answer is yes.
24
1. T h e L e a r n i n g P r o b l e m 1.3. Is L e a r n i n g F e a s i b l e ?
E x erc ise 1.11
W e are given a data set V of 25 training examples from an unknown target 
function f : X - ^ y , where T = R and y = { - 1 , + 1 } . To learn / , we use 
a simple hypothesis set H = { h i , /12} where hi is the constant + 1 function 
and /i2 is theconstant —1.
We consider two learning algorithms, S (smart) and C (crazy). S chooses 
the hypothesis that agrees the most with T> and C chooses the other hy
pothesis deliberately. Let us see how these algorithms perform out of sam
ple from the deterministic and probabilistic points of view. Assume in 
the probabilistic view that there is a probability distribution on X, and let 
P [ / ( x ) = +1 ] = p.
(a) Can S produce a hypothesis that is guaranteed to perform better than 
random on any point outside "Dl
(b ) Suppose that all the examples in V have y„ = -t-1. Is it possible 
that the hypothesis that C produces turns out to be better than the 
hypothesis that S produces?
(c) If p = 0.9, what is the probability that S will produce a better hy
pothesis than C?
(d ) Is there any value of p for which it is more likely than not that C will 
produce a better hypothesis than S?
By adopting the probabilistic view, we get a positive answer to the feasibility 
question w ithout paying too much of a price. T he only assum ption we make 
in the probabilistic framework is th a t the examples in V are generated inde
pendently. We don’t insist on using any particu lar probability distribution, 
or even on knowing w hat d istribu tion is used. However, w hatever d istribu
tion we use for generating the examples, we m ust also use when we evaluate 
how well g approxim ates / (Figure 1.9). T h a t’s w hat makes the Hoeffding 
Inequality applicable. Of course th is ideal situation may not always happen 
in practice, and some variations of it have been explored in the literature.
2. Let us pin down w hat we m ean by the feasibility of learning. Learning pro
duces a hypothesis g to approxim ate the unknown targe t function / . If learning 
is successful, then g should approxim ate / well, which m eans Eout{g) ~ b- 
However, this is not w hat we get from the probabilistic analysis. W hat we 
get instead is Eoutig) ~ Em{g)- We still have to make E\n{g) ~ 0 in order to 
conclude th a t Eoutig) ~ 0 .
We cannot guarantee th a t we will find a hypothesis th a t achieves E[„{g) ss 0, 
bu t a t least we will know if we find it. Rem em ber th a t Eoutig) unknown
quantity, since / is unknown, bu t E\nig) is a (juantity th a t we can evaluate. 
We have thus traded the condition Eoutig) ~ b, one th a t we cannot ascertain, 
for the condition E\„{g) « 0, which we can ascertain. W hat enabled this is 
the Hoeffding Inequality (1.6);
P [ \ E M - Eoutig)\ > f ] <
25
th a t assures us th a t Eout{g) ^ E{^{g) so we can use Ain as a proxy for Eout- 
E x erc ise 1 .12
A friend comes to you with a learning problem. She says the target func
tion / is completely unknown, but she has 4 ,0 0 0 data points. She is 
willing to pay you to solve her problem and produce for her a g which 
approximates / . W hat is the best that you can promise her among the 
following:
(a ) After learning you will provide her with a g that you will guarantee 
approximates / well out of sample.
(b ) After learning you will provide her with a g, and with high probability 
the g which you produce will approximate / well out of sample.
(c) One of two things will happen.
(i) You will produce a hypothesis gr;
(ii) You will declare th a t you failed.
If you do return a hypothesis g, then with high probability the g which 
you produce will approximate / well out o f sample.
One should note th a t there are cases where we w on’t insist th a t Ein{g) ~ 0. 
F inancial forecasting is an exam ple w here m arket unpred ic tab ility m akes it 
im possible to get a forecast th a t has anyw here near zero error. All we hope 
for is a forecast th a t gets it righ t m ore often th an not. If we get th a t, our 
bets will win in th e long run. This m eans th a t a hypothesis th a t has Ein{g) 
som ew hat below 0.5 will work, provided of course th a t Eout(fl') is close enough 
to Ein(g).
T he feasibility of learning is thus split in to two questions:
1. C an we m ake sure th a t Eout{g) is close enough to Ei„{g)?
2. C an we m ake E^^ig) sm all enough?
1, T h e L e a r n i n g P r o b l e m 1.3. Is L e a r n i n g F e a s i b l e ?
T he Hoeffding Inequality (1.6) addresses the first question only. T he second 
question is answered after we run the learning algorithm on the ac tua l d a ta 
and see how sm all we can get E^n to be.
B reaking down th e feasibility of learning into these two questions provides 
further insight into the role th a t different com ponents of th e learning problem 
play. One such insight has to do w ith the ‘com plexity’ of these com ponents.
T h e c o m p le x ity o f E . If the num ber of hypotheses AI goes up, we run 
m ore risk th a t E\n{g) will be a poor es tim ato r of Eout{g) according to In
equality (1 .6 ). M can be tho u g h t of as a m easure of the ‘com plexity’ of the
2 6
1. T h e L e a r n i n g P r o b l e m 1 . 4 . E r r o r a n d N o i s e
hypothesis set T-L th a t we use. If we want an affirmative answer to the first 
question, we need to keep the complexity of T-L in check. However, if we want 
an affirmative answer to the second question, we stand a be tte r chance if T~L 
is more complex, since g has to come from %. So, a more complex T-L gives us 
more flexibility in finding some g th a t fits the d a ta well, leading to small Ei„{g). 
This tradeoff in the com plexity of 7? is a m ajor them e in learning theory th a t 
we will study in detail in C hapter 2.
T h e c o m p lex ity o f / . Intuitively, a complex targe t function / should be 
harder to learn th an a simple / . Let us examine if this can be inferred from 
the two questions above. A close look a t Inequality (1.6) reveals th a t the 
com plexity of / does not affect how well E\n{g) approxim ates Eout{g)- If 
we fix the hypothesis set and the num ber of train ing examples, the inequality 
provides the same bound w hether we are trying to learn a simple / (for instance 
a constant function) or a complex / (for instance a highly nonlinear function). 
However, this doesn’t m ean th a t we can learn complex functions as easily as 
we learn simple functions. Rem em ber th a t (1.6) affects the first question only. 
If the ta rg e t function is complex, the second question conies into play since 
the d a ta from a complex / are harder to fit th an the d a ta from a simple / . 
This means th a t we will get a worse value for Emig) when / is complex. We 
might try to get around th a t by m aking our hypothesis set more complex so 
th a t we can fit the d a ta be tte r and get a lower E\n{g), bu t then Eout won’t be 
as close to E\^ per (1.6). E ither way we look a t it, a complex / is harder to 
learn as we expected. In the extrem e case, if / is too complex, we m ay not be 
able to learn it a t all.
Fortunately, m ost targe t functions in real life are not too complex; we can 
learn them from a reasonable V using a reasonable T-L. This is obviously a 
practical observation, not a m athem atical statem ent. Even when we cannot 
learn a particu lar / , we will a t least be able to tell th a t we can ’t. As long as 
we make sure th a t the com plexity of V. gives us a good Hoeffding bound, our 
success or failure in learning / can be determ ined by our success or failure in 
fitting the train ing data .
1.4 Error and N oise
We close this chapter by revisiting two notions in the learning problem in order 
to bring them closer to the real world. The first notion is w hat apjiroxim ation 
m eans when we say th a t our hypothesis approxim ates the targe t function 
well. The second notion is abou t the natu re of the targe t function. In many 
situations, there is noise th a t makes the o u tpu t of / not uniquely determ ined 
by the input. W hat are the ram ifications of having such a ‘noisy’ targe t on 
the learning problem?
2 7
1.4.1 Error M easu res
Learning is no t expected to replicate theta rg e t function perfectly. T he final 
hypothesis g is only an approxim ation of / . To quantify how well g approxi
m ates / , we need to define an error measure^ th a t quantifies how far we are 
from the targe t.
T he choice of an error m easure affects the outcom e of the learning process. 
Different error m easures m ay lead to different choices of the final hypothesis, 
even if the ta rg e t and the d a ta are the sam e, since the value of a particu la r error 
m easure m ay be sm all while the value of ano ther error m easure in the same 
situation is large. Therefore, which error m easure we use has consequences 
for w hat we learn. W h a t are th e c rite ria for choosing one error m easure over 
another? We address th is question here.
F irst, le t’s formalize th is no tion a b it. An error m easure quantifies how 
well each hypothesis h in the m odel approxim ates th e ta rg e t function / ,
E rro r = E { h , f ) .
W hile E {h , / ) is based on the en tire ty of h and / , it is alm ost universally de
fined based on the errors on individual inpu t poin ts x. If we define a pointwise 
error m easure e (h (x ) ,/ (x ) ) , the overall error will be the average value of th is 
pointwise error. So far, we have been working w ith th e classification error 
e (h (x ) ,/ (x ) ) = [[h(x) 7̂ /(x)]].
In an ideal world, E{h, / ) should be user-specified. T he sam e learning task 
in different contexts m ay w arran t the use of different erro r m easures. One m ay 
view E{h, f ) as the ‘co st’ of using h when you should use / . T his cost depends 
on w hat h is used for, and canno t be d ic ta ted ju s t by our learning techniques. 
Here is a case in point.
E x a m p le 1.1 (F ingerprin t verification). C onsider the problem of verifying 
th a t a fingerprint belongs to a p articu la r person. W h a t is th e app ropria te 
error m easure?
1 . T h e L e a r n i n g P r o b l e m 1 . 4 . E r r o r a n d N o i s e
T he ta rg e t function takes as inpu t a fingerprint, and re tu rn s -t-1 if it belongs 
to the right person, and — 1 if it belongs to an in truder.
^ T h is m easu re is a lso ca lled a n e rro r fu n c t io n in th e l i te ra tu re , a n d so m e tim es th e e rro r 
is re ferred to as cost, ob jective, o r risk.
2 8
1 . T h e L e a r n i n g P r o b l e m 1 . 4 . E r r o r a n d N o i s e
There are two types of error th a t our hypothesis h can make here. If the 
correct person is rejected {h = —1 bu t / = + 1 ), it is called false reject, and if 
an incorrect person is accepted {h = + 1 bu t / = —1 ), it is called false accept.
/
+ 1 - 1
no error false accept
false reject no error
How should the error m easure be defined in this problem? If the right person 
is accepted or an in truder is rejected, the error is clearly zero. We need to 
specify the error values for a false accept and for a false reject. The right 
values depend on the application.
Consider two poten tial clients of th is fingerprint system. One is a super
m arket who will use it a t the checkout counter to verify th a t you are a m ember 
of a discount program . T he other is the CIA who will use it a t the entrance 
to a secure facility to verify th a t you are authorized to enter th a t facility.
For the superm arket, a false reject is costly because if a custom er gets 
wrongly rejected, she m ay be discouraged from patronizing the superm arket 
in the future. All future revenue from this annoyed custom er is lost. On the 
o ther hand, the cost of a false accept is minor. You ju st gave away a discount 
to someone who d idn’t deserve it, and th a t person left their fingerprint in your 
system - they m ust be bold indeed.
For the CIA, a false accept is a disaster. An unauthorized person will gain 
access to a highly sensitive facility. This should be reflected in a much higher 
cost for the false accept. False rejects, on the o ther hand, can be to lerated 
since authorized persons are employees (ra ther th an custom ers as w ith the 
superm arket). The inconvenience of retry ing when rejected is ju s t p a rt of the 
job, and they m ust deal w ith it.
T he costs of the different types of errors can be tabu la ted in a m atrix . For 
our examples, the m atrices m ight look like:
/
+1 -1 +1
/
- 1
+1 0 1 , +1 
 ̂ -1
0 1000
- 1 10 0 1 0
Superm arket CIA
These m atrices should be used to weight the different types of errors when 
we com pute the to ta l error. W hen the learning algorithm minimizes a cost- 
weighted error m easure, it autom atically takes into consideration the utility 
of the hypothesis th a t it will produce. In the superm arket and CIA scenarios, 
this could lead to two com pletely different final hypotheses. □
The m oral of this example is th a t the choice of the error m easure depends 
on how the system is going to be used, ra ther th an on any inherent criterion
29
1 . T h e L e a r n i n g P r o b l e m 1 . 4 . E r r o r a n d N o i s e
Figure 1.11: The general (supervised) learning problem
th a t we can independently determ ine during th e learning process. However, 
this ideal choice m ay not be possible in practice for two reasons. O ne is 
th a t the user m ay not provide an error specification, which is no t uncom m on. 
The o ther is th a t the w eighted cost m ay be a difficult objective function for 
optim izers to work w ith. Therefore, we often look for o ther ways to define the 
error m easure, som etim es w ith purely p ractical or ana ly tic considerations in 
mind. We have already seen an exam ple of th is w ith the sim ple b inary error 
used in th is chapter, and we will see o ther error m easures in la te r chapters.
1.4.2 N o isy T argets
In m any p ractical applications, the d a ta we learn from are no t generated by 
a determ inistic ta rg e t function. Instead , th ey are generated in a noisy way 
such th a t the o u tp u t is no t uniquely determ ined by the input. For instance, 
in the cred it-card exam ple we presented in Section 1.1, two custom ers m ay 
have identical salaries, ou tstan d in g loans, etc., b u t end up w ith different credit 
behavior. Therefore, the cred it ‘function’ is no t really a determ in istic function.
30
but a noisy one.
This situation can be readily modeled w ithin the same framework th a t we 
have. Instead of y = / ( x ) , we can take the o u tpu t y to be a random variable 
th a t is affected by, ra th e r th an determ ined by, the input x. Formally, we have 
a ta rg e t distribution P (y \ x) instead of a targe t function y = / ( x ) . A d a ta 
point (x, y) is now generated by the jo in t d istribu tion P (x , y) = P (x )P (y | x).
One can th ink of a noisy targe t as a determ inistic targe t plus added noise. 
If y is real-valued for example, one can take the expected value of y given x to 
be the determ inistic / ( x ) , and consider y - / ( x ) as pure noise th a t is added 
to / .
This view suggests th a t a determ inistic targe t function can be considered 
a special case of a noisy targe t, ju st w ith zero noise. Indeed, we can formally 
express any function / as a d istribu tion P {y \ x) by choosing P{y | x) to be 
zero for all y except y = / ( x ) . Therefore, there is no loss of generality if we 
consider the targe t to be a d istribu tion ra ther th an a function. Figure 1.11 
modifies the previous Figures 1.2 and 1.9 to illustrate the general learning 
problem, covering both determ inistic and noisy targets.
E x erc ise 1 .13
Consider the bin model for a hypothesis h that makes an error with prob
ability fi in approximating a deterministic target function / (both h and / 
are binary functions). If we use the same h to approximate a noisy version 
of / given by
A y = /(x )
1 . T h e L e a r n i n g P r o b l e m _______________________________________1 . 4. E r r o r a n d N o i s e
T(y I x) = u
(a) W hat is the probability of error that h makes in approximating y l
(b) At what value of A will the performance of h be independent of y?
[Hint: The noisy target will look completely random.]
There is a difference between the role of P{y \ x) and the role of F (x ) in 
the learning problem. W hile bo th distributions model probabilistic aspects 
of X and y, the ta rg e t d istribu tion P {y \ x) is w hat we are trying to learn, 
while the input d istribu tion P (x ) only quantifies the relative im portance of 
the point x in gauging how well we have learned.
O ur entire analysis of the feasibility of learning applies to noisy target 
functions as well. Intuitively, this is because the Hoeffding Incciuality (1.6) 
applies to an arb itrary , unknown targe t function. Assume we random ly picked 
all the y ’s according to the d istribu tion P{y \ x) over the entire inim t space X . 
This realization of P{y \ x) is effectively a targe t function. Therefore, the 
inequality will be valid no m atte r which particu lar random realization the 
‘ta rg e t function’ happens to be.
This does not m ean th a t learning a noisy targe t is as easy as learning a 
determ inistic one. Rem em ber the two (}ucstions of learning? W ith the same 
learning model, Eout m ay be as close to E\u in the noisy case as it is in the
31
determ inistic case, b u t itself will likely be worse in the noisy case since it 
is hard to fit the noise.
In C hap ter 2, where we prove a stronger version of (1.6), we will assum e 
the ta rg e t to be a probability d istribu tion P {y \ x ), thus covering th e general 
case.
1 . T h e L e a r n i n g P r o b l e m _______________________________________ 1 . 4 . E r r o r a n d N o i s e
32
1 . T h e L e a r n i n g P r o b l e m 1 . 5 . P r o b l e m s
1.5 Problem s
P ro b lem 1.1 W e have 2 opaque bags, each containing 2 balls. One bag 
has 2 black balls and the other has a black and a white ball. You pick a bag 
at random and then pick one of the balls in that bag at random. When you 
look at the ball it is black. You now pick the second ball from that same bag. 
W hat is the probability that this ball is also black? [Hint: Use Bayes' Theorem: 
F[A and B] = ¥[A \ B] P [B] = P [B | A] P [A].]
P ro b lem 1.2 Consider the perceptron in two dimensions: / i(x ) = 
sign(w '‘'x ) where w = [wo,wi ,W2Y arid x = [l,a ;i,a ;2]^. Technically, x has 
three coordinates, but we call this perceptron two-dimensional because the first 
coordinate is fixed at 1.
(a) Show that the regions on the plane where / i(x ) = -|-1 and fi(x ) = —1 are 
separated by a line. If we express this line by the equation X2 = axi + b, 
what are the slope a and intercept b in terms of wo, wi , W2 ?
(b) Draw a picture for the cases w = [1, 2, 3]^ and w = —[1, 2, 3]^.
In more than two dimensions, the -|-1 and —1 regions are separated by a hy
perplane, the generalization of a line.
P ro b lem 1.3 Prove that the PLA eventually converges to a linear
separator for separable data. The following steps will guide you through the 
proof Let w * be an optimal set of weights (one which separates the data). 
The essential idea in this proof is to show that the PLA weights w ( f ) get “more 
aligned” with w * with every iteration. For simplicity, assume that w (0 ) = 0.
(a) Let p = m ini<„<Ar y „ (w * ’'x „ ) . Show that p > 0.
(b) Show that w ’" (f)w * > w '‘' ( t - l ) w * - f p , and conclude that w ''( t )w * > t p . 
[Hint: Use induction.]
(c) Show that ||w (f)||^ < ||w (t - 1)||^ + ||x (f - 1)||^.
[Hint: y{t - 1) • (w '‘"(f - l ) x ( f - 1)) < 0 because x ( f - 1) was misclas- 
sified by w {t - 1 )./
(d ) Show by induction that ||w (t)||^ < tR^. where R = inaxi<„<A / ||x„,||.
(con tinued on nex t page)
33
1 . T h e L e a r n i n g P r o b l e m 1 - 5 . P r o b l e m s
(e) Using (b ) and (d), show that 
and hence prove that
P
f S S & i S 1 . “ r?
In practice, PLA converges more quickly than the bound ^ “ suggests.
Nevertheless, because we do not know p in advance, we can’t determine the 
number of iterations to convergence, which does pose a problem if the data is 
non-separable.
I l 2
P r o b le m 1 .4 In Exercise 1.4, we use an artificial data set to study the 
perceptron learning algorithm. This problem leads you to explore the algorithm 
further with data sets of different sizes and dimensions.
(a ) Generate a linearly separable data set of size 20 as indicated in Exer
cise 1.4. Plot the examples { ( x „ ,y „ ) } as well as the target function / on 
a plane. Be sure to mark the examples from different classes differently, 
and add labels to the axes of the plot.
(b ) Run the perceptron learning algorithm on the data set above. Report the 
number of updates that the algorithm takes before converging. Plot the 
examples { ( x „ ,y n ) } , the target function / , and the final hypothesis g in 
the same figure. Comment on whether / is close to g.
(c) Repeat everything in (b ) with another randomly generated data set of 
size 20. Compare your results with (b).
(d ) Repeat everything in (b ) with another randomly generated data set of
size 100. Compare your results with (b ).
(e) Repeat everything in (b) with another randomly generated data set of
size 1,000. Compare your results with (b).
(f ) Modify the algorithm such th a t it takes x „ G instead of R^. Ran
domly generate a linearly separable data set of size 1, 000 with x„, G R^° 
and feed the data set to the algorithm. How many updates does the 
algorithm take to converge?
(g) Repeat the algorithm on the same data set as ( f ) for 100 experiments. In 
the iterations of each experiment, pick x ( f ) randomly instead of determin- 
istically. Plot a histogram for the number of updates th a t the algorithm 
takes to converge.
(h ) Summarize your conclusions with respect to accuracy and running tim e 
as a function of N and d.
34
1 . T h e L e a r n i n g P r o b l e m 1 . 5 . P r o b l e m s
P ro b lem 1.5 The perceptron learning algorithm works like this: In each it
eration t, pick a random (x ( t ) , and compute the ‘signal’ s{t) =
If y(t) ■ s{t) < 0, update w by
w ( t + 1) i w ( t ) - t y{t) ■ x ( t ) ;
One may argue that this algorithm does not take the ‘closeness’ between s{t) 
and y{t) into consideration. Let’s look at another perceptron learning algo
rithm: In each iteration, pick a random ( x ( t ) , j / ( t ) ) and compute s{t). If 
y{t) ■ s{t) < 1, update w by
w ( t -F 1) <— w ( t ) -F rj ■ (y{t) - s{t)) ■ x ( t ) ,
where 7? is a constant. T hat is, if s{t) agrees with y{t) well (their product 
is > 1), the algorithm does nothing. On the other hand, if s{t) is further 
from y{t), the algorithm changes w ( t ) more. In this problem, you are asked to 
implement this algorithm and study its performance.
(a) Generate a training data set of size 100 similar to that used in Exercise 1.4. 
Generate a test data set of size 10, 000 from the same process. To get g, 
run the algorithm above with rj = 100 on the training data set, until 
a maximum of 1 ,000 updates has been reached. Plot the training data 
set, the target function / , and the final hypothesis g on the same figure. 
Report the error on the test set.
(b) Use the data set in (a) and redo everything with t) = I.
(c) Use the data set in (a) and redo everything with g = 0.01.
(d) Use the data set in (a) and redo everything with g = 0.0001.
(e) Compare the results that you get from (a) to (d).
The algorithm above is a variant of the so-called Adaline {Adaptive Linear 
Neuron) algorithm for perceptron learning.
P ro b lem 1.6 Consider a sample of 10 marbles drawn independently from 
a bin that holds red and green marbles. The probability of a red marble is p. 
For g = 0.05, g = 0.5, and g = 0.8, compute the probability of getting no red 
marbles {u = 0)in the following cases.
(a) W e draw only one such sample. Compute the probability that u = 0.
(b ) W e draw 1,000 independent samples. Compute the probability that (at 
least) one of the samples has u = 0.
(c) Repeat (b ) for 1 ,000 ,000 independent samples.
35
1 . T h e L e a r n i n g P r o b l e m 1 . 5 . P r o b l e m s
P r o b le m 1 .7 A sample of heads and tails is created by tossing a coin 
a number of times independently. Assume we have a number of coins that 
generate different samples independently. For a given coin, let the probability 
of heads (probability of error) be g. The probability of obtaining k heads in N 
tosses of this coin is given by the binomial distribution:
P [ k \ N , g ] = N-k
Remember that the training error v is A
(a) Assume the sample size { N ) is 10. If all the coins have g = 0.05 compute 
the probability that at least one coin will have u = 0 for the case of 1 
coin, 1 ,000 coins, 1 ,0 0 0 ,0 0 0 coins. Repeat for g = 0.8.
(b ) For the case N = 6 and 2 coins with /x = 0.5 for both coins, plot the 
probability
P[max — Uil > e]
for £ in the range [0,1] (the max is over coins). On the same plot show the 
bound that would be obtained using the Hoeffding Inequality . Remember 
that for a single coin, the Hoeffding bound is
P [ \ i ' - g \ > e] <
[Hint: Use P [A or B] = P[A] + P[B] - P [ A and B] = P[A] + P[B] - 
P[A]P[B] , where the last equality follows by independence, to evaluate 
P [m ax ...]]
P r o b le m 1 .8 The Hoeffding Inequality is one form of the law o f large 
numbers. One of the simplest forms of th a t law is the Chebyshev Inequality, 
which you will prove here.
(a) If t is a non-negative random variable, prove th a t for any a > 0, 
P[t > a] < E( t ) / a .
(b ) If u is any random variable with mean p and variance P , prove th a t for 
any a > 0, P[(u - /x) ̂ > a] < ^ . [Hint: Use (a)[
(c) If xxi, • ■ • ,u n are iid random variables, each with mean /x and variance cr ,̂ 
a n d tx = T u „ , prove th a t for any a > 0 ,
2
P[(rx - p f > a] <
N a
Notice that the RHS of this Chebyshev Inequality goes down linearly in N , 
while the counterpart in Hoeffding's Inequality goes down exponentially. In 
Problem 1.9, we develop an exponential bound using a similar approach.
3 6
1 . T h e L e a r n i n g P r o b l e m 1 . 5 . P r o b l e m s
P ro b lem 1.9 In this problem, we derive a form of the law of large numbers 
that has an exponential bound, called the Chernoff bound. W e focus on the 
simple case of flipping a fair coin, and use an approach similar to Problem 1.8.
(a) Let f be a (finite) random variable, a be a positive constant, and s be a 
positive parameter. If T{s) = Et(e®‘ ), prove that
P[t > a] < e ~ ‘‘“ T (s ) .
[Hint: is monotonically increasing in t.[
(b ) Let w i , - - - ,UN be iid random variables, and let u =
U{s) = En„(e®“ " ) (for any n ), prove that
P[w > q ] < ( e - ‘‘“ [ / ( s ) ) ^ .
(c) Suppose P[ttn = 0] = P[wn = 1] = j (fak coin). Evaluate U{s) as 
a function of s, and minimize e“ *“ [ / (s ) with respect to s for fixed a, 
0 < a < 1.
(d) Conclude in (c) that, for 0 < e <
V[u > E(m) + e] < 2^'^^ ,
where P = I + { \ + e) lo g jd + e) + ( | - e) lo g j^ - e) and E (u ) = i . 
Show that /3 > 0, hence the bound is exponentially decreasing in N .
P ro b lem 1 .10 Assume that X = { x i , X 2, . . . , x ^ , x v + i , • • • , x a t +m} 
and y = { —1 ,+ 1 } with an unknown target function f : X ^ y . The training 
data set V is ( x i , i / i ) , - - - , (x ;v ,y N ) . Define the off-training-set error of a 
hypothesis h with respect to / by
1 “
Eott{h, f ) = — y ] |/l(XAf+m) /(:xjv+m)l • 
m = l
(a) Say / ( x ) = + 1 for all x and
+ 1, for X = Xfc and k is odd and 1 < fc < A / + TV
^ (^ ) 1 otherwise
W hat is Eof t {h, f ) l
(b ) W e say that a target function / can ‘generate’ I> in a noiseless setting 
if = / ( x „ ) for all (x „ ,y „ ) £ V. For a fixed V of size TV, how many 
possible f : X ^ y can generate in a noiseless setting?
(c) For a given hypothesis h and an integer k between 0 and M , how many 
of those / in (b ) satisfy Eosih, / ) = ^ 7
(d) For a given hypothesis h, if all those / that generate P in a noiseless 
setting are equally likely in probability, what is the expected off-training- 
set error E /[E o ff(/i, /) ]?
(con tinued on n ex t page)
3 7
1 . T h e L e a r n i n g P r o b l e m 1 . 5 . P r o b l e m s
(e) A deterministic algorithm A is defined as a procedure that takes V as 
an input, and outputs a hypothesis h = A{'D). Argue th a t for any two 
deterministic algorithms A i and A 2 ,
Ey [Aoff (Ai(lP), /)] = Ey [Soff (A2(1?), /) ] .
You have now proved that in a noiseless setting, for a fixed T>, if all possible / 
are equally likely, any two deterministic algorithms are equivalent in terms of the 
expected off-training-set error. Similar results can be proved for more general 
settings.
P r o b le m 1 .11 The matrix which tabulates the cost of various errors for 
the CIA and Supermarket applications in Example 1.1 is called a risk or loss 
matrix.
For the two risk matrices in Example 1.1, explicitly write down the in-sample 
error Eiu that one should minimize to obtain g. This in-sample error should 
weight the different types of errors based on the risk matrix. [Hint: Consider 
yn = + l and yn = - 1 separately.]
P r o b le m 1 .12 This problem investigates how changing the error measure 
can change the result of the learning process. You have N data points yi <
■ ■ ■ < Vn and wish to estimate a ‘representative’ value.
(a ) If your algorithm is to find the hypothesis h that minimizes the in-sample 
sum of squared deviations,
N
E,,(h) = J 2 ( h - y n f ,
n=l
then show that your estimate will be the in-sample mean,
1 ^
A m e a n = ^ ( Vn-
n=l
(b) If your algorithm is to find the hypothesis h th a t minimizes the in-sample 
sum of absolute deviations,
N
Tin(A) = ^ | A - y n | , 
n = l
then show that your estimate will be the in-sample median hmed, which 
is any value for which half the data points are at most hmed and half the 
data points are at least hmed-
(c) Suppose yN is perturbed to yiv + e, where e -A 00 . So, the single data 
point 'tjN becomes an outlier. W h at happens to your two estimators A m e a n 
and Amed?
3 8
Chapter 2
Training versus Testing
Before the final exam, a professor m ay hand out some practice problem s and 
solutions to the class. A lthough these problem s are not the exact ones th a t 
will appear on the exam, studying them will help you do better. They are the 
‘train ing se t’ in your learning.
If the professor’s goal is to help you do be tte r in the exam, why not give 
out the exam problem s themselves? Well, nice try @ . Doing well in the 
exam is not the goal in and of itself. The goal is for you to learn the course 
m aterial. The exam is merely a way to gauge how well you have learned the 
m aterial. If the exam problem s are known ahead of time, your perform ance 
on them will no longer accurately gauge how well you have learned.
T he sam e distinction between train ing and testing happens in learning from 
data . In th is chapter, we will develop a m athem atical theory th a t characterizes 
this distinction. We will also discuss the conceptual and practical implications 
of the contrast between train ing and testing.
2.1 Theory o f G eneralization
The out-of-sam ple error Eout measures how well our train ing on V has gener
alized to d a ta th a t we have not seen before. E^nt is based on the perform ance 
over the entire inpu t space X . Intuitively, if we want to estim ate the value 
of Eout using a sam ple of d a ta points, these points m ust be ‘fresh’ test points 
th a t have not been used for training, sim ilar to the questions on the final exam 
th a t have not been used for practice.T he in-sam ple error Ei„, by contrast, is based on d a ta points th a t have 
been used for training. It expressly measures train ing perform ance, sim ilar to 
your perform ance on the practice problem s th a t you got before the final exam. 
Such perform ance has the benefit of looking a t the solutions and adjusting 
accordingly, and m ay not reflect the u ltim ate perform ance in a real test. We 
began the analysis of in-sam ple error in C hapter 1, and we will extend this
3 9
analysis to the general case in th is chapter. We will also m ake the con trast 
between a tra in ing set and a tes t set m ore precise.
A word of warning: th is chap ter is the heaviest in th is book in term s of 
m athem atical abstraction . To m ake it easier on the not-so-m athem atically 
inclined, we will tell you which p a r t you can safely skip w ithou t ‘losing the 
p lo t’. T he m athem atica l results provide fundam ental insights into learning 
from d a ta , and we will in te rp re t these results in p ractical term s.
G e n e ra l iz a t io n e r r o r . We have already discussed how the value of E\n 
does not always generalize to a sim ilar value of Eout- G eneralization is a key 
issue in learning. One can define th e generalization error as the discrepancy 
between E\n and F l o u t T h e Hoeffding Inequality (1.6) provides a way to 
characterize th e generalization error w ith a probabilistic bound,
¥[\Ein{g) - Eout(5)l > e] <
for any e > 0. This can be rephrased as follows. P ick a to lerance level 6, for 
exam ple 5 = 0.05, and assert w ith p robability a t least 1 —5 th a t
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
Eout{g) < E M + ^ (2.1)
We refer to the type of inequality in (2.1) as a generalization bound because 
it bounds Eout hi term s of Em- To see th a t the Hoeffding Inequality implies 
th is generalization bound, we rew rite (1 .6 ) as follows: w ith p robability a t least 
1 - 2 M e“ ^^^^ [Flout - FlinI < e, which im plies F ’out < Flin + e. We m ay now 
identify S = , from which e = In and (2 .1 ) follows.
Notice th a t th e o ther side of [Flout — Flin[ < e also holds, th a t is. Flout > 
Flin — e for all h e "R. T his is im p o rtan t for learning, b u t in a m ore subtle way. 
N ot only do we w ant to know th a t th e hypothesis g th a t we choose (say the 
one w ith the best tra in ing error) will continue to do well ou t of sam ple (i.e.. 
Flout < Flin + e)i b u t we also w ant to be sure th a t we did th e best we could 
w ith our H (no o ther hypothesis h Q H has Flout(h) significantly b e tte r th an 
Flout(ff))- T he Flout(h) > Ein{h) — e d irection of th e bound assures us th a t 
we couldn’t do m uch b e tte r because every hypothesis w ith a higher Flin th an 
the g we have chosen will have a com parably higher Flout-
T he error bound y ^ ^ ™ (2-1), or ‘erro r b a r ’ if you will, depends
on M , the size of the hypothesis set H. If H is an infinite set, the bound goes 
to infinity and becom es m eaningless. U nfortunately , alm ost all in teresting 
learning m odels have infinite H , including the sim ple p ercep tron which we 
discussed in C h ap ter 1 .
In order to s tudy generalization in such m odels, we need to derive a coun
te rp a rt to (2.1) th a t deals w ith infinite V.. We would like to replace M w ith
So m etim es ‘g en e ra liz a tio n e r ro r ’ is u sed as a n o th e r n a m e for £ o u t , b u t n o t in th is book .
40
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
som ething finite, so th a t the bound is meaningful. To do this, we notice th a t 
the way we got the M factor in the first place was by taking the disjunction 
of events:
“ |E in ( / i i ) -E o u t( / i i ) | > e ” o r 
“ |^ in ( / l2 ) - E o u t ( / l 2 )| > e ” o r
(2 .2 )
which is guaranteed to include the event “ |i?in(5 ) ~ Eout{g)\ > e” since g is al
ways one of the hypotheses in 'H. We then over-estim ated the probability using 
the union bound. Let Bm be the (Bad) event th a t “ |Bin(hm) — Eout{hm)\ > e”. 
Then,
P[Bi or B2 or • • • or B m ] < P[Bi] + P[B2] -!-••• + ¥[Bm ].
If the events B i, B2 , • • • , Bm are strongly 
overlapping, the union bound becomes par
ticularly loose as illustra ted in the figure to 
the right for an example w ith 3 hypotheses; 
the areas of different events correspond to 
their probabilities. The union bound says 
th a t the to ta l area covered by B i, B2 , or B3 
is sm aller th an the sum of the individual ar
eas, which is true b u t is a gross overestim ate 
when the areas overlap heavily as in this ex
ample. T he events “ |Bin(hm) - Eout{hm)\ > 
e” ; m = 1 , • • • , M , are often strongly overlap
ping. If h i is very sim ilar to h.2 for instance,
the two events “ |B in(hi) - B out(hi)| > e” and “ |B’i„(h2 ) - Bout(h2 )| > e” are 
likely to coincide for m ost d a ta sets. In a typical learning model, m any hy
potheses are indeed very similar. If you take the perceptron model for instance, 
as you slowly vary the weight vector w , you get infinitely m any hypotheses 
th a t differ from each other only infinitesirnally.
The m athem atical theory of generalization hinges on this observation. 
Once we properly account for the overlaps of the different hypotheses, we 
will be able to replace the num ber of hypotheses M in (2.1) by an effective 
num ber which is finite even when M is infinite, and establish a more useful 
condition under which Bout is close to Bin.
2.1.1 E ffective N u m b er o f H y p o th eses
We now introduce the growth function, the quan tity th a t will formalize the 
effective num ber of hypotheses. T he grow th function is w hat will rcjilace M
41
in the generalization bound (2.1). It is a com binatorial q u an tity th a t cap
tu res how different the hypotheses in H are, and hence how m uch overlap the 
different events in (2 .2 ) have.
We will s ta r t by defining the grow th function and studying its basic prop
erties. Next, we will show how we can bound the value of the grow th function. 
Finally, we will show th a t we can replace M in th e generalization bound w ith 
the grow th function. These th ree steps will yield the generalization bound th a t 
we need, which applies to infinite ?7. We will focus on b inary ta rg e t functions 
for the purpose of th is analysis, so each h &T~L m aps X to { —1 ,- |-1 }.
T he definition of the grow th function is based on the num ber of different 
hypotheses th a t Ti can im plem ent, b u t only over a finite sam ple of points 
ra th e r th an over the entire inpu t space X . If h G Ff is applied to a finite sam ple 
x i , • • • , Xiv G T , we get an N -tu p le h (x i) , • • • , /i(xm ) of ± l ’s. Such an N -tu p le 
is called a dichotomy since it splits x i , • • ■ , xjv into two groups: those points for 
which h is — 1 and those for which h is -|-1. Each h gT~L generates a dichotom y 
on x i , • • • , Xiv, b u t two different h ’s m ay generate th e sam e dichotom y if they 
happen to give the sam e p a tte rn of ± l ’s on th is p articu la r sam ple.
D e f in it io n 2 .1 . Let x i , - - - , xm G X . The dichotomies generated by TL on 
these points are defined by
F f(x i, • • ■ ,x a t) = { (h (x i) , • • • ,h (x jv )) I h G 7^} . (2.3)
One can th ink of the dichotom ies F f(x i, • ■ ■ , xw) as a set of hypotheses ju st 
like H is, except th a t the hypotheses are seen th rough the eyes of N points 
only. A larger 77(x i, • • • ,x ;v ) m eans 77 is m ore ‘d iverse’ - generating m ore 
dichotom ies on x i , • • • ,X]v. T he grow th function is based on the num ber of 
dichotom ies.
D e f in it io n 2 .2 . The growth function is defined fo r a hypothesis set 77 by 
m -u iN ) = m ax |77(xi, ■ ■ ■ ,x a t) |,
xi
where \ ■\ denotes the cardinality (number o f elements) o f a set.
In words, m u { N ) is the m axim um num ber of dichotom ies th a t can be gen
erated by 77 on any N points. To com pute m-rfiN), we consider all possible 
choices of N points x i , - - - ,x a t from X and pick th e one th a t gives us the 
m ost dichotom ies. Like M , rn '^{N) is a m easure of th e num ber of hypotheses 
in 77, except th a t a hypothesis is now considered on N po in ts instead of the 
entire X . For any 77, since 77(x;, • • • , x^r) C { — 1, (the set of all possible
dichotom ies on any N points), the value of rn^(TV) is a t m ost |{ —1 . , 
hence
r n n { N ) < 2 ^ .
If 77 is capable of generating all possible dichotom ies on x i , - - - ,x a t, then 
77 (x i , - - - ,xa?) = and we say th a t 77 can shatter x j , ■ ■ ■ ,x a t. T his
signifies th a t 77 is as diverse as can be on th is p articu la r sam ple.
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
42
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
(a) (b) (c)
Figure 2.1: Illustration of the growth function for a two-dimensional per
ceptron. The dichotomy of red versus blue on the 3 colinear points in part 
(a) cannot be generated by a perceptron, but all 8 dichotomies on the 3 
points in part (b) can. By contrast, the dichotomy of red versus blue on 
the 4 points in part (c) cannot be generated by a perceptron. At most 14 
out of the possible 16 dichotomies on any 4 points can be generated.
E x a m p le 2 .1 . If A is a Euclidean plane and 77 is a two-dimensional percep
tron, w hat are m-u{3) and m 7y(4 )? F igure 2.1(a) shows a dichotom y on 3 points 
th a t the perceptron cannot generate, while Figure 2.1(b) shows another 3 
points th a t the perceptron can shatter, generating all 2 ̂ = 8 dichotomies. 
Because the definition of m ^ { N ) is based on the m axim um num ber of di
chotomies, m-u{3) = 8 in spite of the case in Figure 2.1(a).
In the case of 4 points. F igure 2.1(c) shows a dichotom y th a t the perceptron 
cannot generate. One can verify th a t there are no 4 points th a t the perceptron 
can shatter. The m ost a perceptron can do on any 4 points is 14 dichotomies 
out of the possible 16, where the 2 missing dichotomies are as depicted in 
Figure 2.1(c) w ith blue and red corresponding to —1, -|-1 or to -|-1, —1. Hence, 
m-H(4) = 14. □
Let us now illustra te how to com pute m-u{N) for some simple hypothesis 
sets. These examples will confirm the in tu ition th a t m 'p{N ) grows faster 
when the hypothesis set 77 becomes more complex. This is w hat we expect of 
a quan tity th a t is m eant to replace M in the generalization bound (2.1).
E x a m p le 2 .2 . Let us find a formula for m u { N ) in each of the following cases.
1. Positive rays: 77 consists of all hypotheses h: E —> { —l , - f l } o f the form 
h{x) = sign(x — a), i.e., the hypotheses are defined in a one-dimensional 
inpu t space, and they re tu rn — 1 to the left of some value a and - |- 1 to 
the right of a.
h[x) = - 1
-X -Xi X2 X-i
a
- a r - - 0 -
h{x) = -hi
XN
43
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
T o com pute m 'p {N ) , we notice th a t given N points, the line is split by 
the points into + 1 regions. T he dichotom y we get on the N points 
is decided by which region contains the value a. As we vary a, we will 
get N + 1 different dichotom ies. Since th is is the m ost we can get for 
any N points, the grow th function is
= N + 1.
Notice th a t if we picked N points w here some of the points coincided 
(which is allowed), we will get less th an A" + 1 dichotom ies. T his does 
not affect the value of m-?^(A) since it is defined based on the m axim um 
num ber of dichotom ies.
2. Positive intervals: T-L consists of all hypotheses in one dim ension th a t 
re tn rn +1 w ith in some interval and —1 otherw ise. Each hypothesis is 
specified by the two end values of th a t interval.
h{x) = - 1 ' h{x) = + 1 ' h( x) = - 1
-© © 0 -
Xi X2 Xi . . . Xfi
To com pute m u { N ) , we notice th a t given N points, the line is again 
split by the points into N + 1 regions. T he dichotom y we get is decided 
by which two regions contain th e end values of th e interval, resulting 
in different dichotom ies. If b o th end values fall in th e sam e
region, the resulting hypothesis is th e constan t — 1 regardless of which 
region it is. A dding up these possibilities, we get
N + l \ , ̂ 1 , . 2 . 1
Notice th a t m-H{N) grows as the square of A , faster th a n th e lin
ear m u { N ) of the ‘sim pler’ positive ray case.
3. Convex sets: PL consists of all hypotheses in two dim ensions /i: —>■
{ — 1 , -1- 1 } th a t are positive inside some convex set and negative elsewhere 
(a set is convex if the line segm ent connecting any two poin ts in th e set 
lies entirely w ith in the set).
To com pute rn-u{N) in th is case, we need to choose th e A po in ts care
fully. Per the next figure, choose A poin ts on th e perim eter of a circle. 
Now consider any dichotom y on these points, assigning an a rb itra ry p a t
te rn of ± l ’s to the A points. If you connect the -|-1 po in ts w ith a polygon, 
the hypothesis m ade up of the closed in terior of the polygon (which has 
to be convex since its vertices are on the perim eter of a circle) agrees 
w ith the dichotom y on all A points. For the dichotom ies th a t have less 
th an th ree - f l points, the convex set will be a line segm ent, a po int, or 
an em pty set.
44
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
This m eans th a t any dichotom y on these N points can be realized using a 
convex hypothesis, so V. m anages to sh a tte r these points and the growth 
function has the m axim um possible value
m n { N ) = 2^.
Notice th a t if the N points were chosen a t random in the plane ra ther 
th an on the perim eter of a circle, m any of the points would be ‘in terna l’ 
and we w ouldn’t be able to sh a tte r all the points w ith convex hypotheses 
as we did for the perim eter points. However, th is doesn’t m atter as far 
as m'}i{N) is concerned, since it is defined based on the m axim um (2 ^ 
in th is case). □
It is not practical to try to com pute m,'g{N) for every hypothesis set we use. 
Fortunately, we don’t have to. Since m-u{N) is m eant to replace M in (2.1), 
we can use an upper bound on m-u{N) instead of the exact value, and the 
inequality in (2.1) will still hold. G etting a good bound on m-^(W) will prove 
much easier th an com puting m u { N ) itself, thanks to the notion of a break 
point.
D e fin itio n 2 .3 . I f no data set of size k ean be shattered by H , then k is said 
to be a break point fo r H .
If A: is a break point, then m'n{k) < 2^. Exam ple 2.1 shows th a t A: = 4 is a 
break point for two-dim ensional perceptrons. In general, it is easier to find a 
break point for H th an to com pute the full grow th function for th a t H.
E x erc ise 2.1
By inspection, find a break point k for each hypothesis set in Example 2.2 
( if there is one). Verify that m n ik ) < 2*’ using the formulas derived in 
that Example.
We now use the break point k to derive a bound on the growth function m .u{N) 
for all values of N . For example, the fact th a t no 4 points can be shattered by
45
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
the tw o-dim ensional percep tron pu ts a significant constra in t on the num ber of 
dichotom ies th a t can be realized by the perceptron on 5 or m ore points. We 
will exploit th is idea to get a significant bound on m-fdiN) in general.
2 .1 .2B o u n d in g th e G row th F u n ction
T he m ost im p o rtan t fact ab o u t grow th functions is th a t if the condition 
m u { N ) = 2 ^ breaks a t any point, we can bound rri'n{N) for all values of N 
by a simple polynom ial based on th is b reak point. T he fact th a t the bound 
is polynom ial is crucial. Absent a break point (as is the case in the convex 
hypothesis exam ple), m u i N ) = for all N . If m u { N ) replaced M in E qua
tion (2 .1 ), the bound on the generalization error would not go to
zero regardless of how m any tra in ing exam ples N we have. However, if m u { N ) 
can be bounded by a polynom ial - any polynom ial - , the generalization error 
will go to zero a,s N ^ oo. This m eans th a t we will generalize well given a 
sufficient num ber of examples.
B e g in sa fe s k ip : If you tru s t our m ath , you can skip 
the following p a r t w ithou t com prom ising th e logical 
sequence. A sim ilar green box will tell you when to 
rejoin.
To prove the polynom ial bound, we will in troduce a com binatorial q u an tity 
th a t counts the m axim um num ber of dichotom ies given th a t there is a break 
point, w ithout having to assum e any p articu la r form of 'H. T his bound will 
therefore apply to any
D e f in it io n 2 .4 . B { N , k ) is the m ax im um number o f dichotomies on A" points 
such that no subset o f size k o f the N points can be shattered by these di
chotomies.
T he definition of B { N , k ) assum es a break point k, th en tries to find the 
m ost dichotom ies on N points w ithou t im posing any fu rther restric tions. 
Since B { N , k ) is defined as a m axim um , it will serve as an upper bound for 
any m-u{N) th a t has a break point k\
m n { N ) < B [ N , k) if k is a b reak point for "H.
T he no ta tion B comes from ‘H inon iiar and the reason will becom e clear 
shortly. To evaluate B { N . k ) , we s ta r t w ith the two boundary conditions 
A' = 1 and TV = 1.
B { N . l ) = 1.
B { l . k ) = 2 for A->1.
46
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
B {N , 1) = 1 for all N since if no subset of size 1 can be shattered , then only 
one dichotom y can be allowed. A second different dichotom y m ust differ on at 
least one point and then th a t subset of size 1 would be shattered. = 2
for fc > 1 since in th is case there do not even exist subsets of size k; the 
constrain t is vacuously true and we have 2 possible dichotomies ( + 1 and - 1 ) 
on the one point.
We now assum e N > 2 and k > 2 and try to develop a recursion. Consider 
the B {N , k) dichotomies in definition 2.4, where no k points can be shattered. 
We list these dichotomies in the following table.
S t
S2
# of rows
a
/3
X], X2
+ 1 + 1
- 1 +1
+ 1 - 1 
- 1 +1
+ 1 - 1 
- 1 - 1
+ 1 
- 1
+ 1 
- 1
X N -l
+ 1 
+ 1
- 1
- 1
+ 1 
+ 1
+ 1 
- 1
+ 1 
+ 1
+ 1 
- 1
xai
+ 1
- 1
- 1
+ 1
+1
+1
+1
+ 1
- 1
- 1
- 1
- 1
where x i , • • • ,Xjv in the table are labels for the N points of the dichotomy. 
We have chosen a convenient order in which to list the dichotomies, as follows. 
Consider the dichotomies on x i , • • ■ , x tv_ i. Some dichotomies on these N - I 
points appear only once (with either + 1 or — 1 in the x^v column, bu t not 
bo th). We collect these dichotomies in the set 5 i. The rem aining dichotomies 
on the first N — \ points appear twice, once w ith +1 and once w ith —1 in 
the xai column. We collect these dichotomies in the set S'2 which can be 
divided into two equal parts, 5 ^ and S 2 (with + 1 and - 1 in the xa^ column, 
respectively). Let Ai have a rows, and let S f and Sf) have (3 rows each. Since 
the to ta l num ber of rows in the table is B {N , k) by construction, we have
B (N , k) = a + 2(3. (2.4)
T he to ta l num ber of different dichotomies on the first N - I points is given 
by a + (3] since S f and S 2 are identical on these N - \ points, their di
chotomies are redundant. Since no subset of k of these first N — I points can
4 7
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
be sh a tte red (since no A:-subset of all N points can be sha tte red ), we deduce 
th a t
a + ( 3 < B { N - l , k ) (2.5)
by definition of B . F urther, no subset of size fc — 1 of the first N — I points can 
be sh a tte red by the dichotom ies in S } - If there existed such a subset, then 
tak ing the corresponding set of dichotom ies in and adding xat to th e d a ta
points yields a subset of size k th a t is shattered , which we know cannot exist
in th is tab le by definition of B { N , k). Therefore,
13 < B { N - l , k - l ) . (2.6)
S ubstitu ting the two Inequalities (2.5) and (2.6) into (2.4), we get
B { N , k) < B { N - 1, fc) + B { N - 1, fc - 1). (2.7)
We can use (2.7) to recursively com pute a bound on B { N , k), as shown in the 
following table.
1 2 3
k
4 5 6 . . .
1 1 2 2 2 2 2 . . .
2 1 3 4 4 4 4 . . .
3 1 4 7 8 8 8 . . .
N \
4 1 5 1 1
5 1 6
6 1 7
where the first row {N = I) and the first colum n (fc = 1) are th e bound
ary conditions th a t we already calculated. We can also use th e recursion to 
bound B { N ,k ) analytically.
L e m m a 2 .3 (S auer’s Lem m a).
N \
Proof. T he sta tem en t is tru e whenever k = I or N = I, by inspection. T he 
p roof is by induction on N . Assum e th e sta tem en t is tru e for all N < No 
and all k. We need to prove the s ta tem en t for N = No + I and all k. Since 
the s ta tem en t is already tru e when A: = 1 (for all values of N ) by the initial 
condition, we only need to w orry ab o u t k > 2. By (2.7),
B {N o + l . k ) < B {N o , k ) + B {N o , k - 1 ).
48
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
Applying the induction hypothesis to each term on the RHS, we get
Ii= o ̂ A
fc-i
= 1
f N ,
L\ i
No + 1
( No ^
k-1
- V
^No + l \
2-^
i = 0 V i /
1 = 1 
k-1
= i + E
i=i
k-1
= ' + E 
2=1
where the com binatorial identity = ( ^ ° ) + been used.
T his identity can be proved by noticing th a t to calculate the num ber of ways 
to pick i objects from N q + 1 d istinct objects, either the first object is included, 
in ways, or the first object is not included, in ways. We have
thus proved the induction step, so the sta tem ent is tru e for all N and k. ■
It tu rn s out th a t B {N , k) in fact equals ("T) Problem 2.4), bu t
we only need the inequality of Lem m a 2.3 to bound the growth function. For 
a given break point k, the bound Yli=o ( T ) polynom ial in N , as each term 
in the sum is polynom ial (of degree i < k — 1). Since B { N ,k ) is an upper 
bound on any m n { N ) th a t has a break point fc, we have proved
E n d sa fe sk ip : Those who skipped are now rejoining 
us. T he next theorem sta tes th a t any grow th function 
m n i N ) w ith a break point is bounded by a polyno- 
mial.________________________________________________
T h e o r e m 2 .4 . If m-uik) < 2^ for some value k, then
fe-i
m n { N ) < ^
i=0
N \
for all N . T he RHS is polynom ial in N of degree fc — 1.
(2 .8 )
The im plication of Theorem 2.4 is th a t if R has a break point, we have 
w hat we w ant to ensure good generalization; a polynomial bound on m.-uiN).
49
E x erc ise 2 .2
(a ) Verify the bound of Theorem 2.4 in the three cases of Example 2.2:
(i) Positive rays: Ti consists of all hypotheses in one dimension of 
the form h{x) = sign(a: — a).
(ii) Positive intervals: H consists of all hypotheses in one dimension 
that are positive within some interval and negative elsewhere.
(iii) Convex sets: 1-L consists of all hypotheses in two dimensions that 
are positive inside some convex set and negative elsewhere.
(Note: you can use the break points you found in Exercise2.1.)
(b ) Does there exist a hypothesis set for which m u { N ) = N +
(where \_N/2\ is the largest integer < N/2)7
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
2 .1 .3 T h e V C D im en sio n
Theorem 2.4 bounds the en tire grow th function in term s of any break point. 
T he sm aller the break point, the b e tte r the bound. T his leads us to the fol
lowing definition of a single p aram eter th a t characterizes the grow th function.
D e fin it io n 2 .5 . The Vapnik-Chervonenkis dimension o f a hypothesis set TL, 
denoted by dvc{TL) or simply dvc> is the largest value o f N fo r which m -^{N ) = 
2 ^ . I f m n i N ) - 2 ^ fo r all N , then d ^ c W = 0 0 .
If dye is the VC dim ension of H , th en k = dvc + 1 is a b reak poin t for m u 
since m u { N ) canno t equal 2 ^ for any N > dye by definition. It is easy to see 
th a t no sm aller break point exists since TL can sh a tte r dye points, hence it can 
also sh a tte r any subset of these points.
E x e r c ise 2 .3
Compute the VC dimension of H for the hypothesis sets in parts (i), (ii),
(iii) of Exercise 2 .2 (a ).
Since k — dye -P 1 is a break po in t for m u , T heorem 2.4 can be rew ritten in 
term s of the VC dimension:
^ / N \
m u { N ) < Y , [ - ] - (2-9)
i= 0 ^ ^
Therefore, the VC dim ension is the order of the polynom ial bound on m u { N ) -
It is also the best we can do using th is line of reasoning, because no sm aller
break point th an k = dye + 1 exists. T he form of the polynom ial bound can 
be further sim plihed to m ake the dependency on dye m ore salient. We s ta te a 
nseful form here, which can be proved by induction (P roblem 2.5).
r n u { N ) < TV'’*- + 1. (2.10)
50
Now th a t the growth function has been bounded in term s of the VC dim en
sion. we have only one more step left in our analysis, which is to replace the 
num ber of hypotheses M in the generalization bound (2 .1 ) w ith the growth 
function m.-HiN). If we m anage to do th a t, the VC dimension will play a 
pivotal role in the generalization question. If we were to directly replace M 
by ni'u{N) in (2 .1 ), we would get a bound of the form
2 . T r a i n i n g v e r s u s T e s t i n g ______________________2 . L . T h e o r y o f G e n e r a l i z a t i o n
F - F , i / ̂ 1 2m7,(iV)
B o u t < B in + \ l 2 s
Unless civc(B) = oo, we know th a t rn-^(V ) is bounded by a polynom ial in N; 
thus. In m 'u{N ) grows logarithm ically in N regardless of the order of the poly
nomial, and so it will be crushed by the factor. Therefore, for any fixed 
tolerance 5, the bound on Bout will be arb itra rily close to Bj,, for sufficiently 
large N .
Only if dyc{Ti) = oo will this argum ent fail, as the growth function in this 
case is exponential in N . For any finite value of dye, the error bar will converge 
to zero a t a speed determ ined by dye, since (Be is the order of the polynomial. 
The sm aller (Be is, the faster the convergence to zero.
It tu rns out th a t we cannot ju s t replace M w ith (??/«(V ) in the generaliza
tion bound (2 .1 ), bu t ra th e r we need to make other adjustm ents as we will see 
shortly. However, the general idea above is correct, and dye will still play the 
role th a t we discussed here. One im plication of this discussion is th a t there 
is a division of models into two classes. T he ‘good m odels’ have finite (Be, 
and for sufficiently large N , B i„ will be close to B o u t! foi' good models, the 
in-sam ple perform ance generalizes to out of sample. The "bad m odels' have 
infinite dye- W ith a bad model, no m atte r how large the d a ta set is, we cannot 
make generalization conclusions from B i„ to Bout, based on the VC analysis.'-^
Because of its significant role, it is worthwhile to try to gain some insight 
about the VC dimension before wo proceed to the formalities of deriving the 
new generalization bound. One way to gain insight abou t (Lc is to try to 
com pute it for learning models th a t we are familiar with. Perceptrons are one 
case where we can com pute (Be exactly. This is done in two steps. First, 
we show th a t dye is a t least a certain value, then we show th a t it is a t most 
the same value. There is a logical dift'erence in arguing th a t (Be is a t least a 
certain value, as opposed to a t m ost a certain value. This is because
dye > N there e x is ts V of size N such th a t H shatters V. 
hence we have different conclusions in the following cases.
1. There is a set of N points th a t can be shattered by H. In this cas(', we 
can conclude th a t dye > N .
^In som e cases w ith in fin ite dvc, such as tlio convex se ts th a t we d iscussed , a lte rn a tiv e 
an a ly sis based on an ‘average’ g row th fu n c tio n can estab lisli good g en era lizatio n behavior.
5 1
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
2. Any set of N points can be sh a tte red by T-L. In th is case, we have more 
th an enough inform ation to conclude th a t dvc > N .
3. T here is a set of N points th a t cannot be sh a tte red by H . B ased only on 
th is inform ation, we cannot conclude anyth ing abou t the value of dvc-
4. No set of N points can be sh a tte red by Ti. In th is case, we can conclude 
th a t dvc < N .
E x erc ise 2 .4
Consider the input space X = {1} x (including the constant coordinate 
xo = 1). Show th a t the VC dimension of the perceptron (w ith d + 1 
parameters, counting wo) is exactly d+1 by showing that it is at least d + 1 
and at most d + 1 , as follows.
(a ) To show that dvc > d + 1, find d + 1 points in X th a t the perceptron 
can shatter. [Hint: Construct a nonsingular (d + 1) x (d + 1) matrix 
whose rows represent the d + 1 points, then use the nonsingularity to 
argue that the perceptron can shatter these points.]
(b ) To show that dvc < d + 1, show that no set of d + 2 points in X 
can be shattered by the perceptron. [Hint: Represent each point 
in X as a vector o f length d + 1 , then use the fact that any d + 2 
vectors o f length d + 1 have to be linearly dependent. This means 
that some vector is a linear combination o f all the other vectors.
Now, if you choose the class o f these other vectors carefully, then the 
classification o f the dependent vector will be dictated. Conclude that 
there is some dichotomy that cannot be implemented, and therefore 
that for N > d + 2, m n [ N ) < 2^.]
T he VC dim ension of a d-dim ensional perceptron^ is indeed d + 1 . T his is 
consistent w ith F igure 2.1 for the case d — 2, which shows a VC dim ension 
of 3. T he percep tron case provides a nice in tu ition ab o u t the VC dim ension, 
since d + 1 is also the num ber of p aram eters in th is model. One can view 
the VC dim ension as m easuring the ‘effective’ num ber of param eters . The 
m ore param eters a m odel has, the m ore diverse its hypothesis set is, which 
is reflected in a larger value of th e grow th function m 'p {N ) . In the case 
of perceptrons, the effective param eters correspond to explicit param eters in 
the model, nam ely wq, w \, - ■ ■ ,Wd- In o ther m odels, the effective param eters 
m ay be less obvious or im plicit. T he VC dim ension m easures these effective 
param eters or ‘degrees of freedom ’ th a t enable the m odel to express a diverse 
set of hypotheses.
D iversity is no t necessarily a good th ing in the con tex t of generalization. 
For exam ple, the set of all possible hypotheses is as diverse as can be, so 
m n i N ) = 2,^ for all N and dvc('R) = oo. In th is case, no generalization a t all 
is to be expected, as the final version of the generalization bound will show.
3 + = { l} x R < ^ is co n sid ered d -d in ien sio n a l s ince t lie first c o o rd in a te so = 1 is fixed.
52
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z . a t i o n
2.1 .4 T h e V C G eneralization B ound
If we trea ted the grow th function as an effective num ber of hypotheses, and 
replaced M in the generalization bound (2 .1 ) with m h(.V). the resulting bound 
would be
Eomig) < E M + y ^ l n ^2.11)
It tu rns out th a t this is not exactly the form th a t will hold. The quantities in 
red need to be technically modified to make (2.11) true. The correct bound, 
which is called the \ 'C generalization bound, is given in the following theorem: 
it holds for any binary targe t function / . any hypothesis set K . any learning 
algorithm A . and any input probability d istribu tion P.
T h e o re m 2 .5 (VC generalization bound). For any tolerance S > 0.
Eoutig) < E M + \ / f In (2.12)
w ith probability > 1 — (5.
If you com pare the blue item s in (2.12) to their red coun terparts in (2.11). you 
notice th a t all the blue item s moA'e the bound in the weaker direction. How
ever. as long as the VC dimension is finite, the error bar still converges to zero 
(albeit a t a slower ra te), since tn-g{2N) is also polynom ial of order dye hi .V. 
ju st like m ^(A '). This m eans th a t, w ith enough data , each and every hypoth
esis in an infinite P w ith a finite VC dimension will generalize well from E\„ 
to Eout- T he key is th a t the effective num ber of hypotheses, represented by 
the finite growth function, has replaced the actual num ber of hypotheses in 
the bound.
The VC generalization bound is the most im portan t m athem atical result 
in the theory of learning. It establishes the feasibility of learning w ith infinite 
hypothesis sets. Since the formal proof is som ewhat lengthy and technical, we 
illustra te the m ain ideas in a sketch of the proof, and include the formal proof 
as an appendix. T here are two p arts to the proof: the justification th a t the 
grow th function can replace the num ber of hypotheses in the first place, and 
the reason why we had to change the red item s in (2 .1 1 ) into the blue items 
in (2 .1 2 ).
S k e tc h o f th e p ro o f . The d a ta set V is the source of random ization in the 
original Hoeffding Ineqnality. Consider the space of all possible d ata sets. Let 
us th ink of this space as a 'canvas’ (Figure 2.2(a)). Each P is a point on tha t 
canvas. T he probability of a point is determ ined by which x „ 's in ,T happen to 
be in th a t particu lar P , and is calculated based on the d istribu tion P over ,V. 
L et’s th ink of probabilities of different events as areas on th a t canvas, so the 
to ta l area of the canvas is 1 .
53
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 1 . T h e o r y o f G e n e r a l i z a t i o n
(b) Union Bound (c) VC Bound
Figure 2.2: Illustration of the proof of the VC bound, where the ‘canvas’ 
represents the space of all data sets, with areas corresponding to probabili
ties. (a) For a given hypothesis, the colored points correspond to data sets 
where E\n does not generalize well to Flout- The Hoeffding Inequality guar
antees a small colored area, (b) For several hypotheses, the union bound 
assumes no overlaps, so the total colored area is large, (c) The VC bound 
keeps track of overlaps, so it estimates the total area of bad generalization 
to be relatively small.
For a given hypothesis /i G 77, the event “ |£'in(/t) — Eout{h)\ > e” consists 
of all points T> for which the sta tem en t is true . For a p articu la r h, let us pain t 
all these ‘b a d ’ poin ts using one color. W h a t the basic Hoeffding Inequality 
tells us is th a t th e colored area on th e canvas will be sm all (F igure 2.2(a)).
Now, if we take ano ther /i G 77, the event “ |£ ’in(/i) — Eout{h)\ > e” m ay 
contain different points, since the event depends on h. Let us p a in t these points 
w ith a different color. T he area covered by all th e po in ts we colored will be 
a t m ost the sum of the two individual areas, which is th e case only if the two 
areas have no points in com m on. T his is the w orst case th a t the union bound 
considers. If we keep throw ing in a new colored area for each /i G 77, and never 
overlap w ith previous colors, th e canvas will soon be m ostly covered in color 
(Figure 2.2(b)). Even if each h co n tribu ted very little , th e sheer num ber of 
hypotheses will eventually m ake the colored area cover th e whole canvas. This 
was the problem w ith using the union bound in the Hoeffding Inequality (1.6), 
and not tak ing the overlaps of the colored areas into consideration.
T he bulk of the VC proof deals w ith how to account for the overlaps. Here 
is the idea. If you were told th a t the hypotheses in 77 are such th a t each 
point on the canvas th a t is colored will be colored 1 0 0 tim es (because of 1 0 0 
different //’s), then the to ta l colored area is now 1 1 0 0 of w ha t it would have 
been if the colored points had no t overlapped a t all. T his is the essence of 
the VC bound as illu stra ted in (F igure 2.2(c)). T he argum ent goes as follows.
54
M any hypotheses share the sam e dichotom y on a given V , since tliere are 
finitely m any dichotom ies even w ith an infinite num ber of hypotheses. Any 
sta tem ent based on I) alone will be sim ultaneously true or sim ultaneously 
false for all the hypotheses th a t look the same on th a t particu lar T>. W hat 
the grow th function enables us to do is to account for this kind of hypothesis 
redundancy in a precise way, so we can get a factor sim ilar to the TOO’ in the 
above example.
W hen 77 is infinite, the redundancy factor will also be infinite since the 
hypotheses will be divided am ong a finite num ber of dichotomies. Therefore, 
the reduction in the to ta l colored area when we take the redundancy into 
consideration will be dram atic. If it happens th a t the num ber of dichotomies 
is only a polynom ial, the reduction will be so dram atic as to bring the to ta l 
probability down to a very small value. This is the essence of the proof of 
Theorem 2.5.
The reason m«(27V) appears in the VC bound instead of m n { N ) is th a t 
the proof uses a sam ple of 2 N points instead of N points. W hy do we need 2 N 
points? T he event “\Eifih) — Eout{h)\ > e” depends not only on V , bu t also on 
the entire X because 77out(h) is based on X . This breaks the m ain premise of 
grouping h ’s based on their behavior on V , since aspects of each h outside of V 
affect the tru th of “ |J?in(h) — £'out(h)| > e.” To rem edy th a t, we consider the 
artificial event “ |£'i„(/i) - E[fih)\ > e” instead, where i?i„ and are based 
on two sam ples V and V each of size TV. This is where the 2 TV comes from. 
It accounts for the to ta l size of the two sam ples T> and V . Now, the tru th of 
the sta tem ent “ |77’i„(/i) — E ( f h ) \ > e” depends exclusively on the to ta l sample 
of size 2 TV, and the above redundancy argum ent will hold.
Of course we have to justify why the two-sample condition —
■^in(^)l > replace the original condition “ |77in(h) - Eout(h) I > e.” In
doing so, we end up having to shrink the e’s by a factor of 4, and also end up 
with a factor of 2 in the estim ate of the overall probability. This accounts for 
the ^ instead of ^ in the VC bound and for having 4 instead of 2 as the 
m ultiplicative factor of the grow th function. W hen you pu t all th is together, 
you get the formula in (2 .1 2 ). □
2 . T r a i n i n g v e r s u s T e s t i n g __________________ 2 . 2 . I n t e r p r e t i n g t h e B o u n d
2.2 Interpreting th e G eneralization Bound
T he VC generalization bound (2.12) is a universal result in the sense th a t 
it applies to all hypothesis sets, learning algorithm s, input s]>aces, probability 
distributions, and binary ta rg e t functions. It can be extended to o ther types of 
targe t functionsas well. Given the generality of the result, one would suspect 
th a t the bound it provides may not be particularly tight in any given case, 
since the same bound has to cover a lot of different cases. Indeed, tlii' bound 
is quite loose.
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 2 . I n t e r p r e t i n g t h e B o u n d
E x erc ise 2 .5
Suppose we have a simple learning model whose growth function is 
m n { N ) = N + I , hence dvc = 1. Use the VC bound (2 .12 ) to esti
mate the probability that Eout will be within 0.1 of Ein given 100 training 
examples. [Hint: The estimate will be ridiculous.]
W hy is the VC bound so loose? T he slack in the bound can be a ttr ib u te d to 
a num ber of technical factors. A m ong them ,
1. T he basic Hoeffding Inequality used in the proof already has a slack. 
T he inequality gives the sam e bound w hether Eout is close to 0.5 or 
close to zero. However, the variance of Ejn is qu ite different in these 
two cases. Therefore, having one bound cap tu re b o th cases will result 
in some slack.
2. Using m -u{N ) to quantify th e num ber of dichotom ies on N points, re
gardless of which N po in ts are in th e d a ta set, gives us a worst-case 
estim ate. This does allow th e bound to be independent of th e prob
ability d istribu tion P over X . However, we would get a m ore tuned 
bound if we considered specific x i , • • • ,x /v and used |'H (x i, • • • ,x tv ) | or 
its expected value instead of th e upper bound m-}i{N). For instance, in 
the case of convex sets in two dim ensions, which we exam ined in E xam 
ple 2.2, if you pick N poin ts a t random in the plane, they will likely have 
far fewer dichotom ies th a n 2 ^ , while myy(V) = 2 ^ .
3. B ounding myy(A’) by a sim ple polynom ial of order dvc, as given in (2.10), 
will con tribu te fu rther slack to the VC bound.
Some effort could be p u t into tigh ten ing the VC bound, b u t m any highly 
technical a ttem p ts in th e lite ra tu re have resu lted in only dim inishing re turns. 
T he reality is th a t th e VC line of analysis leads to a very loose bound. W hy 
did we bo ther to go th rough the analysis then? Two reasons. F irs t, th e VC 
analysis is w hat establishes the feasibility of learning for infinite hypothesis 
sets, the only kind we use in practice. Second, a lthough the bound is loose, 
it tends to be equally loose for different learning m odels, and hence is useful 
for com paring the generalization perform ance of these models. T his is an 
observation from practical experience, no t a m athem atica l s ta tem en t. In real 
applications, learning m odels w ith lower dvc ten d to generalize b e tte r th an 
those w ith higher dye- Because of th is observation, th e VC analysis proves 
useful in practice, and some rules of thum b have em erged in term s of the VC 
dim ension. For instance, requiring th a t N be a t least 10 x dye to get decent 
generalization is a popu lar rule of thum b.
Thus, the VC bound can be used as a guideline for generalization , relatively 
if no t absolutely. W ith this understand ing , let us look a t the different ways 
the bound is used in practice.
56
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 2 . I n t e r p r e t i n g t h e B o u n d
2.2.1 Sam ple C om p lex ity
T he sam ple com plexity denotes how m any train ing examples N are needed 
to achieve a certain generalization perform ance. T he perform ance is specified 
by two param eters, e and 6. T he error tolerance e determ ines the allowed 
generalization error, and the confidence param eter 5 determ ines how often the 
error tolerance e is violated. How fast N grows as e and 6 become smaller'^ 
indicates how much d a ta is needed to get good generalization.
We can use the VC bound to estim ate the sam ple com plexity for a given 
learning model. Fix (5 > 0, and suppose we want the generalization error to 
be a t m ost e. From Eqnation (2.12), the generalization error is bounded by
^ In 4 mK̂ (2 N) ̂ suffices to make y j j j In Ep-'uCN) ^ ̂ follows th a t
^ , '4 m « (2 V )
suffices to obtain generalization error a t most e (w ith probability a t least 1 —d). 
This gives an implicit bound for the sam ple com plexity A , since A appears on 
b o th sides of the inequality. If we replace m-H{2N) in (2.12) by its polynomial 
upper bound in (2.10) which is based on the the VC dimension, we get a 
sim ilar bound
/4 ( (2 A ) ‘̂ - + 1 ) '
A > - In (2.13)
5
which is again implicit in A . We can obtain a num erical value for A using 
simple itera tive m ethods.
E x a m p le 2 .6 . Suppose th a t we have a learning model w ith dye = 3 and 
would like the generalization error to be a t m ost 0.1 w ith confidence 90% (so 
e = 0.1 and 5 = 0.1). How big a d a ta set do we need? Using (2.13), we need
8 , ^4(2V )= + 4 ',
0 . 1 J -
Trying an initial guess of A = 1,000 in the R.HS, we get
We then try the new value A = 21.193 in the RHS and continue this iterative 
process, rapidly converging to an estim ate of A « 30.000. If dye were 4, a 
sim ilar calculation will find th a t A « 40,000. For dye = 5, we get A ~ 50.000. 
You can see th a t the inequality suggests th a t the num ber of examples needed 
is approxim ately proportional to the VC dimension, as has been observed in 
practice. T he constan t of p roportionality it suggests is 10.000. which is a gross 
overestim ate; a more practical constant of proportionality is closer to 1 0 . □
‘‘T h e te rm ‘c o m p lex ity ’ com es from a sim ila r m e ta p h o r in co m p u ta tio n a l com plexity .
5 7
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 2 . I n t e r p r e t i n g t h e B o u n d
2 .2 .2 P en a lty for M o d el C o m p lex ity
Sam ple com plexity fixes the perform ance param eters e (generalization error) 
and 6 (confidence param eter) and estim ates how m any exam ples N are needed. 
In m ost p ractical situations, however, we are given a fixed d a ta set T>, so N 
is also fixed. In th is case, the relevant question is w hat perform ance can we 
expect given th is particu la r N . T he bound in (2.12) answers th is question: 
w ith probability a t least 1 —
' 4 m n { ‘2 N ) \
If we use the polynom ial bound based on dye instead of m n { 2 N ) , we get 
ano ther valid bound on the out-of-sam ple error.
E o M < E M + y I In . (2.14)
E x a m p le 2 .7 . Suppose th a t N = 100 and we have a 90% confidence require
m ent {6 = 0.1). We could ask w hat error bar can we offer w ith th is confidence, 
if T~L has dye = 1- Using (2.14), we have
Eout{g) fi: E[T^{g) + ^ ss Ein(5 ') + 0.848 (2-15)
w ith confidence > 90%. This is a p re tty poor bound on Eout- Even if Ein = 0, 
Eout m ay still be close to 1. If A = 1, 000, th en we get Eout(d) < Ein(g)-I-0.301, 
a som ew hat m ore respectable bound. □
Let us look m ore closely a t th e two p a rts th a t m ake up th e bound on Eout 
in (2.12). T he first p a r t is E i„, and the second p a r t is a te rm th a t increases 
as the VC dim ension of T-L increases.
Eout(d) < E M + U(iV, n , S), (2.16)
where
<
/ 4 ( ( 2 A ) ^ v c - H ) \
' V
One way to th ink of U{N, T-l, d) is th a t it is a penalty for m odel complexity. It 
penalizes us by worsening the bound on Eo„t when we use a m ore com plex H 
(larger dye). If som eone m anages to fit a sim pler m odel w ith the sam e tra in ing
5 8
2 . T r a i n i n g v e r s u s T e s t i n g 2,2. In t e r p h k t i n g t h e H o u N n
Figure 2.3: When we use a more complex learning model, one that has 
higher \ ’C dimension d\o. we are likely to fit the training data better ri'- 
sulting in a lower in-sample error, but we pav a higher penalt\- for model 
complexity. combination of the two. which estimates the out-of-sample 
error, thus attains a mininmm at some intermediate
error,they will get a more favorable estim ate for Fout- The penalty llt-V. H. d") 
gets worse if we insist on higher confidence (lower d"). and it gets b etter when 
we have more train ing examples, as we would expect.
.Although 0_{X.H .S) goes up when H has a higher \ 'C dimension. F ,, is 
likelv to go down with a higher \ 'C dimension as we have more choices w ithin H 
to fit the data . Therefore, we have a tradeoff: more complex models help F „ 
and hurt Q[X. 'H.S). The optim al model is a compromise th a t minimizes a 
com bination of the two term s, as illustrated informally in Figure 2.3.
2 .2 .3 T h e T est Set
As we have seen, the generalization bound gives us a loose estim ate of the 
out-of-sam ple error Font based on F ,, . A ’hile the estim ate can be useful as 
a guideline for the train ing process, it is next to useless if the goal is to get 
ail accurate forecast of Fo„t- If ,vou are developing a system for a custom er, 
you need a m ore accurate estim ate so th a t ,vour custom er knows how well the 
system is expected to perform.
An alternative approach th a t we alluded to in the beginniiig of this chapter 
is to estim ate Fo„t by using a test set. a da ta set tha t was not im'olved in the 
train ing process. The final hypothesis g is e\-ahiated on the test set. aiul the 
result is taken as an estim ate of Fo„t. ^^e would like to now take a closer look 
at this approach.
Let us call the error we get on the test set Ftost- hen we report F,,.,., as 
our estim ate of Fout- " 'e are in fact asserting that F,,,st generalizes wr.t' well 
to Fout- After all. Ftest F .just a sam ple estim ate like F „ . How do we know
59
th a t Etest generalizes well? We can answer th is question w ith au tho rity now 
th a t we have developed th e theory of generalization in concrete m athem atical 
term s.
T he effective num ber of hypotheses th a t m atte rs in the generalization be
havior of Etest is 1. T here is only one hypothesis as far as the tes t set is 
concerned, and th a t ’s the final hypothesis g th a t th e tra in ing phase produced. 
This hypothesis would no t change if we used a different tes t set as it would if 
we used a different tra in ing set. Therefore, the simple Hoeffding Inequality is 
valid in the case of a tes t set. H ad the choice of g been affected by the test 
set in any shape or form, it w ouldn’t be considered a test set any m ore and 
the simple Hoeffding Inequality would no t apply.
Therefore, the generalization bound th a t applies to Etest is the simple 
Hoeffding Inequality w ith one hypothesis. T his is a m uch tigh ter bound th an 
the VC bound. For exam ple, if you have 1,000 d a ta poin ts in the tes t set, Etest 
will be w ith in ±5% of Eout w ith probability > 98%. T he bigger the te s t set 
you use, the m ore accura te Etest will be as an estim ate of Eout-
E x erc ise 2 .6
A data set has 600 examples. To properly test the performance of the 
final hypothesis, you set aside a randomly selected subset of 200 examples 
which are never used in the training phase; these form a test set. You use 
a learning model with 1 ,000 hypotheses and select the final hypothesis g 
based on the 400 training examples. W e wish to estimate Eout{g)- W e have 
access to two estimates: E\n{g), the in-sample error on the 400 training 
examples; and, Etest(5 ), the test error on the 200 test examples th a t were 
set aside.
(a ) Using a 5% error tolerance (<5 = 0.05), which estimate has the higher 
'error bar' ?
(b ) Is there any reason why you shouldn't reserve even more examples for 
testing?
A nother aspect th a t distinguishes the te s t set from th e tra in in g set is th a t the 
tes t set is no t biased. B oth sets are finite sam ples th a t are bound to have 
some variance due to sam ple size, b u t th e te s t set doesn’t have an optim istic 
or pessim istic bias in its estim ate of Eout- T he tra in in g set has an optim istic 
bias, since it was used to choose a hypothesis th a t looked good on it. T he VC 
generalization bound im plicitly takes th a t bias in to consideration, and th a t ’s 
why it gives a huge erro r bar. T he tes t set ju s t has s tra ig h t finite-sam ple 
variance, b u t no bias. W hen you rep o rt the value of Etest to your custom er 
and they try your system 011 new d a ta , they are as likely to be p leasantly 
surprised as unpleasantly surprised, though qu ite likely no t to be surprised a t 
all.
T here is a price to be paid for having a tes t set. T he te s t set does not 
affect the ontcom e of our learning process, which only uses th e tra in in g set. 
T he test set ju s t tells us how well we did. Therefore, if we set aside some
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 2 . I n t e r p r e t i n g t h e B o u n d
60
of the d a ta points provided by the cnstonier as a test set. we end up using- 
fewer examples for training. Since the train ing set is used to select one of the 
hypotheses in T-L. tra in ing examples are essential to finding a good hypothesis. 
If we take a big chunk of the d a ta for testing and end up w ith too few examples 
for train ing, we m ay not get a good hypothesis from the train ing p art even if 
we can reliably evaluate it in the testing part. We may end up reporting to 
the custom er, w ith high confidence mind you, th a t the g we are delivering is 
terrib le @ . There is thus a tradeoff to setting aside test examples. We will 
address th a t tradeoff in more detail and learn some clever tricks to get around 
it in C hap ter 4.
In some of the learning literature , Ftest is used as synonym ous w ith -Lout • 
W hen we report experim ental results in this book, we will often trea t Ftest 
based on a large test set as if it was ifout because of the closeness of the two 
quantities.
2 .2 .4 O ther Target T yp es
A lthough the VC analysis was based on binary target functions, it can be 
extended to real-valued functions, as well as to o ther types of functions. The 
proofs in those cases are quite technical, and they do not add to the insight 
th a t the VC analysis of b inary functions provides. Therefore, we will introduce 
an alternative approach th a t covers real-valued functions and provides new 
insights into generalization. The approach is based on bias-variance analysis, 
and will be discussed in the next section.
In order to deal w ith real-valued functions, we need to adap t the definitions 
of E\n and Eoat th a t have so far been based on binary functions. We defined E\^ 
and Eout in term s of binary error; either h (x) = / ( x ) or else /i(x) 7 ̂ / ( x ) . If / 
and h are real-valued, a more appropria te error m easure would gauge how far 
/ ( x ) and h(x) are from each other, ra th e r th an ju st w hether their values are 
exactly the same.
An error m easure th a t is commonly used in this case is the squared error 
e ( /? (x ) ,/(x )) = (/i(x) — /(x ))^ . We can define in-sam ple and out-of-sam ple 
versions of th is error m easure. T he out-of-sam ple error is based on the ex
pected value of the error m easure over the entire input space X ,
E o , t ( / ? ) = E [ ( h ( x ) - / ( x ) ) 2 ; ,
while the in-sam ple error is based on averaging the error m easure over the 
d a ta set,
1 ^ 
n=l
These definitions make E-^ a sam ple estim ate of Eout jn s t as it was in the case 
of b inary functions. In fact, the error m easure used for binary functions can 
also be expressed as a squared error.
2 . T r a i n i n g v e r s u s T e s t i n g _________________ 2 . 2 . I n t e r p r e t i n g t h e B o u n d
61
E x erc ise 2 .7
For binary target functions, show that P [/i(x ) / ( x ) ] can be written as an
expected value of a mean-squared error measure in the following cases.
(a) The convention used for the binary function is 0 or 1.
(b) The convention used for the binary function is ± 1 .
[Hint: The differencebetween (a) and (b) is just a scale.]
Ju s t as the sam ple frequency of error converges to th e overall p robability of 
error per Hoeffding’s Inequality, the sam ple average of squared error converges 
to the expected value of th a t error (assum ing finite variance). T his is a m an
ifestation of w hat is referred to as the ‘law of large num bers’ and Hoeffding’s 
Inequality is ju s t one form of th a t law. T he sam e issues of the d a ta set size 
and the hypothesis set com plexity come into play ju s t as they did in the VC 
analysis.
2.3 A pproxim ation-G eneralization Tradeoff
T he VC analysis showed us th a t the choice of T-L needs to strike a balance 
between approxim ating / on the tra in ing d a ta and generalizing on new data . 
T he ideal 77 is a singleton hypothesis set contain ing only the ta rg e t function. 
U nfortunately, we are b e tte r off buying a lo tte ry ticket th an hoping to have 
this TL. Since we do no t know the ta rg e t function, we resort to a larger model 
hoping th a t it will contain a good hypothesis, and hoping th a t th e d a ta will 
pin down th a t hypothesis. W hen you select your hypothesis set, you should 
balance these two conflicting goals; to have some hypothesis in T-L th a t can 
approxim ate / , and to enable the d a ta to zoom in on the righ t hypothesis.
T he VC generalization bound is one way to look a t th is tradeoff. If TL is 
too simple, we m ay fail to approxim ate / well and end up w ith a large in- 
sam ple error term . If T~L is too com plex, we m ay fail to generalize well because 
of the large m odel com plexity term . T here is ano ther way to look a t the 
approxim ation-generalization tradeoff which we will p resent in th is section. It 
is particu larly su ited for squared error m easures, ra th e r th an the b inary error 
used in the VC analysis. T he new way provides a different angle; instead of 
bounding Flout 6 y E\n plus a penalty te rm fl, we will decom pose F ’out into two 
different error term s.
2.3 .1 B ias and V ariance
T he bias-variance decom position of out-of-sam ple erro r is based on squared 
error m easures. T he out-of-sam ple erro r is
Fo„t(.g("^)) = Ex - Z(x))2] , (2.17)
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 3 . A p p r o x i m a t i o n - G e n e r a l i z a t i o n
62
2 . T r a i n i n g \ ' e r s u s T e s t i n g 2 . 3 . . -XpPROXI M. ATION- GENERALI Z. - VnON
where Ex denotes the expected \-alue w ith respect to x (based on the probabil
ity d istribu tion on the input space X ) . We have m ade explicit the dependence 
of the final hypothesis g on the d a ta T>. as this will play a key role in the cur
rent analysis. We can rid E quation (2.17) of the dependence on a particular 
d a ta set by taking the expectation with respect to all d a ta sets. We then get 
the expected out-of-sam ple error for our learning model, independent of any 
particu lar realization of the d ata set.
Ei> Bout (5^^') = E p 
= Ex 
= Ex
E x [(5("’Hx ) - / ( x ))2]
Ex>[((/^^Hx) - /(x ))̂ ]
E x)[(7̂ )̂(x )'-̂] - 2 E p [</'^Hx)]/(x) -F /(x )'̂
The term Ep[<;^^^(x)] gives an ’average function’, which we denote by ,<7 (x). 
One can in terpret g(x) in the following operational way. G enerate m any d a ta 
sets X>i,. . . . T>k and apply the learning algorithm to each d a ta set to produce
final hypotheses g \ gK- We can then estim ate the average function for
any x by 5 (x) ~ E I '= i 9k{^ )- Essentially, we are viewing g{:s.) as a random 
variable, w ith the random ness coming from the random ness in the d ata set: 
g{x.) is the expected value of this random variable (for a particu lar x). and g 
is a function, the average function, composed of these expected values. The 
function g is a little counterintuitive: for one thing, g need not be in the 
m odel's hypothesis set. even though it is the average of functions th a t are.
E x erc ise 2 .8
(a) Show that if H is closed under linear combination (any linear combi
nation of hypotheses in H is also a hypothesis in H ), then g e P .
(b ) Give a model for which the average function g is not in the model's 
hypothesis set. [Hint: Use a very simple model.]
(c) For binary classification, do you expect g to be a binary function?
We can now rew rite the e.xpected out-of-sam ple error in term s of g: 
Ep[Bou,(ff'^>)]
Ex
Ex
E , , [ 5 ( ® ) ( x ) 2 ] - 2 f f ( x ) / ( x ) + / ( x ) 2 ,
E v[g ^ '^ \y i f] - g { ^ f - 2 ,9 (x ) /(x ) + f { x f
E p [((/(®Hx) - .9(x))^ (ff(x) - / ( x ) ) 2
where the last reduction follows since g(y2) is constant with respect to V. 
T he term (r/(x) - /(x ) )^ m easures how much the average function th a t we 
would learn using different d a ta sets V deviates from the target function tha t 
generated these d a ta sets. This term is appropriately called the bias:
bias(x) = (ff(x) - /(x))'-^.
63
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 3 . A p p r o x i m a t i o n - G e n e r a l i z a t i o n
as it m easures how m uch our learning m odel is biased away from the ta rg e t 
function.^ T his is because g has the benefit of learning from an unlim ited 
num ber of d a ta sets, so it is only lim ited in its ability to approxim ate / by 
the lim itation in the learning m odel itself. T he term E© — 5 (x))^
is the variance of the random variable
v a r ( x ) = E p [ ( 5 ( ^ ) ( x ) - 5 ( x ) ) 2 ] ,
which m easures the variation in the final hypothesis, depending on the d a ta 
set. We thus arrive a t the bias-variance decom position of out-of-sam ple error,
ED[E'out(5^^^)] = E x [b ia s (x )-I-v a r(x ) ]
= b ia s -F v a r,
where bias = E x [b ia s (x )] and var = E x [v a r (x ) ] . O ur derivation assum ed th a t 
the d a ta was noiseless. A sim ilar derivation w ith noise in the d a ta would lead 
to an additional noise term in the out-of-sam ple erro r (Problem 2.22). T he 
noise te rm is unavoidable no m a tte r w hat we do, so the term s we are in terested 
in are really the bias and var.
T he approxim ation-generalization tradeoff is cap tu red in the bias-variance 
decom position. To illustra te , le t’s consider two extrem e cases: a very small 
m odel (w ith one hypothesis) and a very large one w ith all hypotheses.
Very sm all m odel. Since there is 
only one hypothesis, both the av
erage function g and the final hy
pothesis will be the same, for 
any data set. Thus, var = 0. The 
bias will depend solely on how well 
this single hypothesis approximates 
the target / , and unless we are ex
tremely lucky, we expect a large 
bias.
Very large m odel. The target 
function is in 77. Different data sets 
will lead to different hypotheses that 
agree with / on the data set, and are 
spread around / in the red region. 
Thus, bias « 0 because g is likely 
to be close to / . The var is large 
(heuristically represented by the size 
of the red region in the figure).
One can also view the variance as a m easure of ‘in s tab ility ’ in th e learning 
model. In stab ility m anifests in wild reactions to sm all varia tions or idiosyn
crasies in the d a ta , resulting in vastly different hypotheses.
® W hat we call bias is so m e tim es called bias^ in th e l ite ra tu re .
64
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 3 . . Ap p r o x i m . \t i o n - G e n e r a i . i z . \t i o n
E x a m p le 2 .8 . Consider a targe t function /( .r ) sin(Tr.r) and a da ta set 
of size -V = 2. We sam ple .r uniformly in [-1 .1 ] to generate a da ta set 
U’l -yi )- (-i’2 -y 2 ): and fit the d a ta using one of two models:
TAo: Set of all lines of the form h[.v) — b:
T i l : Set of all lines of the form h{x) = a.r + b.
For Bo. we choose the constant hypothesis that best fits the d a ta (the hori
zontal line at the m idpoint, b = For H i. we choose the line that passes
through thetwo d a ta points ( . t ' l . y\) and ( . r o -1/ 2)- R epeating this process with 
m any d a ta sets, we can estim ate the bias and the variance. The figures which 
follow show the resulting fits on the sam e (random ) da ta sets for bo th models.
K ....
X
Ho Hi
W ith H i- the learned hypothesis is wilder and varies extensively depending 
on the d a ta set. T he bias-var analysis is sum m arized in the next figures.
Ho
bias = 0.50; 
var = 0.25.
Hi
bias = 0.2 1 : 
var = 1.69.
Average hypothesis g (red) with var(x) indicated by the gray shaded 
region that is g{x) ± V^var(x).
For H i, the average hypothesis g (red line) is a reasonable fit with a fairly 
small bias of 0.21. However, the large variability leads to a high var of 1.69 
resulting in a large expected out-of-sam ple error of 1.90. W ith the simpler
6 5
m odel 'Ho, the fits are m uch less volatile and we have a significantly lower var 
of 0.25, as indicated by the shaded region. However, the average fit is now 
the zero function, resulting in a higher bias of 0.50. T he to ta l out-of-sam ple 
error has a m uch sm aller expected value of 0.75. T he sim pler m odel wins by 
significantly decreasing the var a t the expense of a sm aller increase in bias.
Notice th a t we are not com paring how well th e red curves (the average hy
potheses) fit the sine. These curves are only conceptual, since in real learning 
we do not have access to the m ultitude of d a ta sets needed to generate them . 
We have one d a ta set, and the sim pler m odel results in a b e tte r out-of-sam ple 
error on average as we fit our m odel to ju s t th is one d a ta . However, the var 
term decreases as N increases, so if we get a bigger and bigger d a ta set, the 
bias term will be the dom inant p a r t of Hout, and H i will win. □
T he learning algorithm plays a role in the bias-variance analysis th a t it did 
not play in the VC analysis. Two poin ts are w orth noting.
1. By design, the VC analysis is based purely on th e hypothesis set H , in
dependently of th e learning algorithm A . In th e b ias-variance analysis, 
bo th H and th e algorithm A m a tte r. W ith the sam e H , using a differ
ent learning algorithm can produce a different Since is the 
building block of the b ias-variance analysis, th is m ay resu lt in different 
bias and var term s.
2. A lthough the bias-variance analysis is based on squared-erro r m easure, 
the learning algorithm itself does no t have to be based on m inim izing 
the squared error. I t can use any criterion to produce g^^^ based on V . 
However, once the algorithm produces g^'^\ we m easure its bias and 
variance using squared error.
U nfortunately, the bias and variance cannot be com puted in practice, since 
they depend on the ta rg e t function and the inpu t p robab ility d is trib u tio n (both 
unknown). Thus, the bias-variance decom position is a conceptual too l which 
is helpful when it comes to developing a model. T here are two typ ical goals
when we consider bias and variance. T he first is to try to lower th e A'ariance
w ithout significantly increasing the bias, and the second is to lower th e bias 
w ithout significantly increasing the variance. These goals are achieA'ed by 
different techniques, some principled and some heuristic. R egularization is 
one of these techniques th a t we will discuss in C h ap te r 4. R educing the bias 
w ithout increasing the variance requires some prior in form ation regard ing the 
ta rg e t function to steer the selection of H in the d irection o f / , and th is task is 
largely application-specific. O n the o ther hand , reducing th e variance w ithout 
com prom ising the bias can be done th rough general techniques.
2.3 .2 T h e L earning C urve
We close th is chap ter w ith an im p o rtan t plot th a t illu s tra te s the tradeoffs
th a t we have seen so far. T he learning curves sum m arize th e behavior of the
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 3 . A p p r o x i m a t i o n - G e n e r a l i z a t i o n
66
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 3 . A p p r o x i m a t i o n - G e n e r a l i z .a t i o n
in-sam ple and out-of-sam ple errors as we vary the size of the train ing set.
After learning w ith a particu lar d a ta set T> of size N , the final hypothe
sis has in-sam ple error E\n{g^^^) and out-of-sam ple error Eont{g^'^^), both 
of which depend on V . As we saw in the bias-variance analysis, the expectation 
with respect to all d a ta sets of size N gives the expected errors: ¥.x>[Em{g^'^'^)\ 
and Ep[£'out(5 ^^^)]- These expected errors are functions of A , and are called 
the learning curves of the model. We illustra te the learning curves for a simple 
learning model and a complex one, based on actual experim ents.
S im ple M odel C om plex M odel
Notice th a t for the simple model, the learning curves converge more quickly 
bu t to worse u ltim ate perform ance th an for the complex model. This behavior 
is typical in practice. For bo th simple and complex models, the out-of-sample 
learning cnrve is decreasing in A , while the in-sam ple learning curve is in
creasing in A . Let us take a closer look a t these curves and in terpret them in 
term s of the different approaches to generalization th a t we have discussed.
In the VC analysis, Eout was expressed as the sum of E\,, and a generaliza
tion error th a t was bounded by 0 , the penalty for model complexity. In the 
bias-variance analysis, Eont was expressed as the sum of a bias and a variance. 
T he following learning curves illustra te these two approaches side by side.
o
Lh
CJoa
X
W
Eout
generalization error
E\n
in-sample error
Number of Data Points, N 
V C A nalysis B ias-V ariance A nalysis
6 7
The VC analysis bounds the generalization error which is illu stra ted on the 
left.® T he bias-variance analysis is illu stra ted on the right. T he bias-variance 
illustra tion is som ew hat idealized, since it assum es th a t, for every N , the aver
age learned hypothesis g has the sam e perform ance as the best approxim ation 
to / in the learning model.
W hen the num ber of d a ta points increases, we move to the righ t on the 
learning curves and b o th the generalization error and th e variance term shrink, 
as expected. T he learning curve also illustra tes an im p o rtan t poin t abou t E n . 
As N increases, E n edges tow ard th e sm allest error th a t th e learning model 
can achieve in approxim ating / . For sm all N , the value of is actually 
sm aller th an th a t ‘sm allest possible’ error. T his is because the learning model 
has an easier task for sm aller N ; it only needs to approxim ate / on the N 
points regardless of w hat happens outside those points. Therefore, it can 
achieve a superior fit on those points, a lbeit a t the expense of an inferior fit 
on the rest of the points as shown by the corresponding value of Eout-
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 3 . A p p r o x i m a t i o n - G e n e r a l i z a t i o n
®For th e lea rn in g curve, we ta k e th e ex p ec te d va lues o f a ll q u a n ti t ie s w ith re sp e c t to V 
o f size N .
68
2.4 Problem s
P ro b lem 2.1 In Equation (2.1), set 6 = 0.03 and let
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
(a) For M = 1, how many examples do we need to make e < 0.05?
(b) For M = 100, how many examples do we need to make e < 0.05?
(c) For M = 10, 000, how many examples do we need to make e < 0.05?
P ro b lem 2 .2 Show that for the learning model of positive rectangles 
(aligned horizontally or vertically), m-H(4) = and m n (5 ) < 2®. Hence, give 
a bound for to -h (A ),
P ro b lem 2 .3 Compute the maximum number of dichotomies, mu{N) , 
for these learning models, and consequentlycompute dye, the VC dimension.
(a) Positive or negative ray: TL contains the functions which are + 1 on [a, oo) 
(for some a) together with those that are + 1 on ( —oo,a] (for some a).
(b ) Positive or negative interval: 7? contains the functions which are + 1 on 
an interval [a, 6] and — 1 elsewhere or — 1 on an interval [a, 6] and + 1 
elsewhere.
(c) Two concentric spheres in R'’*: 7? contains the functions which are + 1 for 
a < ^ x \ + . . . + x1 < b.
P ro b lem 2 .4 Show that B {N ,k ) = Yli=o (G ) showing the other 
direction to Lemma 2.3, namely that
^ / A 
B ( A , f c ) > ^
i=o A *
To do so, construct a specific set of ( G ) dichotomies that does not
shatter any subset of k variables. [Hint: Try limiting the number o f —I's in 
each dichotomy]
D
P ro b lem 2 .5 Prove by induction that X ) ( G ) - + b hence
i = 0
m n iN ) < + 1.
6 9
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
P r o b le m 2 .6 Prove that for N > d,
W e suggest you first show the following intermediate steps.
(a) E ( T ) < E ( " ) ( f ) " ' < ( f ) ‘̂ i : ( T ) ( v ) ^
i=0 i=0 i=0
N
(b) T ( T) {’n Y - Binomial theorem; (l + < e for x > 0.]
i=0 *
Hence, argue that m n {N ) <
P r o b le m 2 .7 Plot the bounds for m n iN ) given in Problems 2.5 and 2.6 
for d̂ ■c = 2 and dvc = 5. When do you prefer one bound over the other?
P r o b le m 2 .8 Which of the following are possible growth functions m -u iN ) 
for some hypothesis set:
l + N ; 1 + i V + T k ^ ; 2 ^ ; 2 l ^ J ;
P r o b le m 2 .9 [hard] For the perceptron in d dimensions, show that
m n {N ) = 2 Y ^
[Hint: Cover(1965) in Further Reading.]
Use this formula to verify that d^c = d + 1 by evaluating m n{d + 1) and 
nin{d + 2). Plot m -H {N )/2 ^ for d = 10 and N £ [1,40]. If you generate a 
random dichotomy on N points in 10 dimensions, give an upper bound on the 
probability that the dichotomy will be separable for .V = 10. 20 .40 .
P r o b le m 2 .1 0 Show that m ^ (2 A ') < h ) h ( .V ) ’ , and hence obtain a 
generalization bound which only involves » i h (-V ).
P r o b le m 2 .11 Suppose » i h ( jY ) = A '+ 1, so d^c = 1. You have 100 
training examples. Use the generalization bound to give a bound for Eout with 
confidence 90%. Repeat for A' = 10. 000.
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
P ro b lem 2 .12 For an T-L with dvo = 10, what sample size do you need 
(as prescribed by the generalization bound) to have a 95% confidence that your 
generalization error is at most 0.05?
P ro b lem 2 .13
(a) Let 4? = { / i i , li2 , . . . , Iim } for a finite M . Prove that dvc)??) < log2 M .
(b) For hypothesis sets F i , 'H2 , • • •, H k with finite VC dimensions dvdkik), 
derive and prove the tightest upper and lower bound that you can get 
on dvc
(c) For hypothesis sets Hi , H2, • ■ •, H k with finite VC dimensions dvc{Hk), 
derive and prove the tightest upper and lower bounds that you can get 
on dvc (u t= iH k )-
P ro b lem 2 .1 4 Let H i , H 2 , ■ ■ ■ , H k be K hypothesis sets { K > 1 ), 
each with the same finite VC dimension dvc- Let H = H i U H 2 U ■ ■ ■ U H k be 
the union of these models.
(a) Show that dvc{H) < K{dvc + 1).
(b ) Suppose that i satisfies 2 ̂ > 2 K P ''° . Show that d„c(H) < 4.
(c) Hence, show that
d vc (F ) < m in (A (d vc + l) ,7 (d v c + A ) lo g 2 (d v c F )).
T hat is, dvc{H) = 0 (m a x (d v c , A ) logj max(dvc, A ) ) is not too bad.
P ro b lem 2 .1 5 The monotonically increasing hypothesis set is 
= {/t I Xl > X 2 fl(xi) > /l(x2)} ,
where x i > X 2 if and only if the inequality is satisfied for every component.
(a) Give an example of a monotonic classifier in two dimensions, clearly show
ing the -h i and —1 regions.
(b ) Compute m -u iN ) and hence the VC dimension. [Hint: Consider a set 
o fN points generated by first choosing one point, and then generating the 
next point by increasing the first component and decreasing the second 
component until N points are obtained.)
71
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
P r o b le m 2 .1 6 In this problem, we will consider T = K . T h a t is, x = t 
is a one-dimensional variable. For a hypothesis set
n = l K
prove that the VC dimension of H is exactly ( D -F 1) by showing that
(a) There are {D + 1) points which are shattered by H .
(b ) There are no {D -F 2) points which are shattered by H .
P r o b le m 2 .1 7 The VC dimension depends on the input space as well 
as 7i. For a fixed H , consider two input spaces X i C X 2 . Show th a t the VC 
dimension of H with respect to input space X i is at most the VC dimension 
of H with respect to input space A 2.
How can the result of this problem be used to answer part (b ) in Problem 2.16? 
[Hint: How is Problem 2.16 related to a perceptron in D dimensions?[
P r o b le m 2 .1 8 T he VC dimension of the perceptron hypothesis set 
corresponds to the number of parameters [wo, w i, - ■ ■ ,Wd) of the set, and this 
observation is 'usually' true for other hypothesis sets. However, we will present 
a counter-example here. Prove that the following hypothesis set for x € K has 
an infinite VC dimension:
■ H = { / la I / ia (x ) = where q g k } ,
where [AJ is the biggest integer < A (the floor function). This hypothesis 
has only one parameter a but 'enjoys' an infinite VC dimension. [Hint: Con
sider x \ , . .. ,XN, where x „ = 10” , and show how to implement an arbitrary 
dichotomy y i , . . . ,yN.[
P r o b le m 2 .1 9 This problem derives a bound for the VC dimension of a 
complex hypothesis set that is built from simpler hypothesis sets via composi
tion. Let H i . . . . , 'H k be hypothesis sets with VC dimension d i , . . . , (/a'. Fix 
hi , . . . . i iK, where in € Hi . Define a vector z obtained from x to have com
ponents /u (x ) . Note that x £ W f but z £ { - 1 . -F l} ^ ' . Let K be a hypothesis 
set of functions that take inputs in . So
/ i £ 7 ? : z £ R h -> { -F 1 .—1}, 
and suppose that has VC dimension d.
72
W e can apply a hypothesis in H to the z constructed from { h i , . . . , Ii k ). This 
is the composition of the hypothesis set % with { H i , . .. , H k ). More formally, 
the composed hypothesis set H = H o { H i , . .. , H k ) is defined b\/ h e H if
h{x) = h { h i { x ) , . . ., hK{x)) , h e H - hi € Hi.
(a ) Show that
K
m n {N ) < m y^{N)Y[ 'm ni{N ). (2.18)
i= l
[Hint: Fix N points x i , . . . , xw and fix h i , . .. , h x . This generates N 
transformed points z i , . . . ,z jv - These z i , . . . ,zat can be dichotomized 
in at most m.p{N) ways, hence for fixed { h i , . . . , hK) , ( x i , . . . , X i v ) 
can be dichotomized in at most my^{N) ways. Through the eyes of 
x i , . . . , x/v , at most how many hypotheses are there (effectively) in Hi ?
Use this bound to bound the effective number o f K-tuples {h i , . . . , h x ) 
that need to be considered. Finally, argue that you can bound the number 
o f dichotomies that can be implemented by the product o f the number 
of possible K-tuples { h i , . . . , hK) and the number o f dichotomies per 
K-tuple.]
(b ) Use the bound m { N ) < to get a bound for m n { N ) in terms of 
d, d i , . . . , d K •
(c) Let D = J + d,, and assume that D > 2elog 2 £>. Show that
dvc{H) < 2D\og,^D.
(d ) If H i and H are all perceptron hypothesis sets, show that
dvc(Tf) = 0{ dK\ og{dK) ) .
In the next chapter, we will further develop the simple linear model. This linear 
model is the building block of many other models, such as neural networks.
The results of this problem show how to bound the VC dimension of the more 
complex models built in this manner.
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
P ro b lem 2 .2 0 There are a number of bounds on the generalization 
error e, all holdingwith probability at least 1 — 8.
(a) Original VC-bound: ________________
N 5
(b) Rademacher Penalty Bound:
 ̂ 2\ii{2Nrnn{N)) , / 2 1 , 1
 N + \ / v ‘'‘ 5 + iv-
(con tinued on n ex t page)
7.3
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
(c) Parrondo and Van den Broek:
N
(d ) Devroye:
2 N
Note that (c) and (d) are implicit bounds in e. Fix dvc = 50 and <5 = 0.05 and 
plot these bounds as a function of N. Which is best?
P r o b le m 2 .21 Assume the following theorem to hold
T h e o rem
E p u t ( g ) E j n ( f f ) 
Eout{g)
> e < c ■ m n { 2 N ) exp -
where c is a constant that is a little higger than 6 .
This bound is useful because sometimes what we care about is not the absolute 
generalization error but instead a relative generalization error (one can imagine 
that a generalization error of 0.01 is more significant when Eout = 0.01 than 
when Eout — 0 .5). Convert this to a generalization bound by showing that 
with probability at least 1 — <5,
Eout(g) ^ Ein{g) + ^ 1 + t 1 +
4 E i n ( a )
where ^ = T log c-mT^(2N) ^
P r o b le m 2 .2 2 When there is noise in the data, Eout
E x ,{,[(3^® '(^ ) “ y (^ ))^ ]i where j/(x) = / (x ) + e. If e is a zero-mean noise 
random variable with variance cr ,̂ show that the bias-variance decomposition 
becomes
Er>[Eout(g^^')] = -F bias + var.
P r o b le m 2 .2 3 Consider the learning problem in Example 2.8, where the 
input space is T = [ - 1 , 4- 1 ], the target function is /(x ) = sin(Trx), and the 
input probability distribution is uniform on X . Assume that the training set V 
has only two data points (picked independently), and that the learning algorithm 
picks the hypothesis that minimizes the in-sample mean squared error. In this 
problem, we will dig deeper into this case.
74
2 . T r a i n i n g v e r s u s T e s t i n g 2 . 4 . P r o b l e m s
For each of the following learning models, find (analytically or numerically) 
(i) the best hypothesis that approximates / in the mean-squared-error sense 
(assume that / is known for this part), (ii) the expected value (with respect 
to T>) of the hypothesis that the learning algorithm produces, and (iii) the 
expected out-of-sample error and its bias and var components.
(a) The learning model consists of all hypotheses of the form h{x) = ax -\-b 
( if you need to deal with the infinitesimal-probability case of two identical 
data points, choose the hypothesis tangential to / ) .
(b) The learning model consists of all hypotheses of the form h{x) = ax. 
This case was not covered in Example 2.8.
(c) The learning model consists of all hypotheses of the form h{x) = b.
P ro b lem 2 .2 4 Consider a simplified learning scenario. Assume that 
the input dimension is one. Assume that the input variable x is uniformly 
distributed in the interval [—1,1]. The data set consists of 2 points { x i , X 2} 
and assume that the target function is f { x ) = x^. Thus, the full data set is 
V = { { x i , x [ ) , { x 2 , X 2 )}. The learning algorithm returns the line fitting these 
two points as g (77 consists of functions of the form h{x) = ax + b). W e are 
interested in the test performance (E[Eoutj) of our learning system with respect 
to the squared error measure, the bias and the var.
(a) Give the analytic expression for the average function g{x).
(b ) Describe an experiment that you could run to determine (numerically) 
g{x), E[Aout], bias, and var.
(c) Run your experiment and report the results. Compare E[£;out] (expecta
tion is with respect to data sets) with bias -F var. Provide a plot of your 
g{x) and f { x ) (on the same plot).
(d) Compute analytically what E[i?out], bias and var should be.
75
76
Chapter 3
The Linear Model
We often wonder how to draw a line between two categories; right versus 
wrong, personal versus professional life, nseful email versus spam , to nam e a 
few. A line is intuitively our first choice for a decision boundary. In learning, 
as in life, a line is also a good first choice.
In C hap ter 1 , we (and the machine @ ) learned a procedure to ‘draw a line’ 
between two categories based on d a ta (the perceptron learning algorithm ). We 
sta rted by tak ing the hypothesis set PL th a t included all possible lines (actually 
hyperp lanes). T he algorithm then searched for a good line in PL by iteratively 
correcting the errors m ade by the curren t candidate line, in an a ttem p t to 
improve Ei„. As we saw in C hap ter 2 , the linear model - set of lines - has a 
sm all VC dim ension and so is able to generalize well from E\n to Eont-
T he aim of this chapter is to further develop the basic linear model into a 
powerful tool for learning from data . We branch into three im portan t prob
lems; the classification problem th a t we have seen and two other im portan t 
problem s called regression and probability estimation. T he three problems 
come w ith different bu t related algorithm s, and cover a lot of territo ry in 
learning from data . As a rule of thum b, when faced w ith learning problems, 
it is generally a w inning stra tegy to try a linear model first.
3.1 Linear C lassification
T he linear m odel for classifying d a ta into two classes uses a hypothesis set of 
linear classifiers, where each h has the form
/i(x) = sign(w ’‘'x),
for some colum n vector w G where d is the dim ensionality of the input
space, and the added coordinate xq = 1 corresponds to the bias ‘w eight’ wo 
(recall th a t the input space A = {1 } x IR'̂ is considered ri-dimensional since 
the added coordinate xq = 1 is fixed). We will use h and w interchangeably
77
to refer to the hypothesis when the context is clear. W hen we left C hap ter 1, 
we had two basic crite ria for learning:
1. C an we m ake sure th a t Eoutig) is close to E\n{g)3 T his ensures th a t w hat
we have learned in sam ple will generalize out of sample.
2. C an we m ake E\-n[g) small? T his ensures th a t w hat we have learned in
sam ple is a good hypothesis.
T he first criterion was studied in C h ap ter 2. Specifically, th e VC dim ension 
of the linear m odel is only d + \ (Exercise 2.4). Using th e VC generalization 
bound (2 .1 2 ), and the bound (2 .1 0 ) on the grow th function in term s of the 
VC dim ension, we conclude th a t w ith high probability.
3 . T h e L i n e a r M o d e l ______________________________________ 3 . 1 . L i n e a r C l a s s i f i c a t i o n
/
Eo\it{g) = Ein{g) + O (3.1)
Thus, when N is sufficiently large, and Eout will be close to each other 
(see the definition of O(-) in the N o tation tab le), and th e first criterion for 
learning is fulfilled.
T he second criterion, m aking sure th a t Ein is sm all, requires first and 
forem ost th a t there is some linear hypothesis th a t has sm all Ein- If there 
isn’t such a linear hypothesis, th en learning certain ly can ’t find one. So, le t’s 
suppose for the m om ent th a t there is a linear hypothesis w ith sm all E\n- In 
fact, le t’s suppose th a t the d a ta is linearly separable, which m eans there is 
some hypothesis w* w ith Ein(w *) = 0. We will deal w ith th e case w hen this 
is not tru e shortly.
In C hap ter 1, we in troduced the p ercep tron learning algorithm (PLA ). 
S ta rt w ith an a rb itra ry w eight vector w (0). T hen , a t every tim e step t > 0, 
select any m iselassilied d a ta po int (x( t ) ,y( t ) ) , and u p d a te w (t) as follows:
w (t + 1 ) = w (t) + y (t)x ( t) .
T he in tu ition is th a t the u p d a te is a ttem p tin g to correct the erro r in classify
ing x (t) . T he rem arkable th ing is th a t th is increm ental approach of learning 
based on one d a ta po int a t a tim e works. As discussed in P roblem 1.3, it can be 
proved th a t the PLA will eventually stop u pdating , ending a t a solution Wpla 
w ith Ein(wpLA) = 0. A lthough th is result applies to a re s tric ted setting (lin
early separable d a ta ) , it is a significant step. T he PL A is clever - it doesn’t 
naively test every linear hypothesis to see if it (the hypothesis) separates the 
data; th a t would take infinitely long. Using an ite ra tiv e approach , th e PLA 
m anages to search an infinite hypothesis set and o u tp u t a linear sep a ra to r in 
(provably) f in ite tim e.
As far as PLA is concerned, linear separab ility is a. p ro p e rty of the data. 
not the target. A linearly separab le T> could have been generated e ither from 
a linearly separable ta rg e t, or (by chance) from a ta rg e t th a t is no t linearly 
separable. T he convergence p roof of PLA g uaran tees th a t the algorithm will
78
3 . T h e L i n e a r M o d e l 3 . 1 . L i n e a r C l a s s i f i c a t i o n
Figure 3.1: Data sets that are not linearly separable but are (a) linearly 
separable after discarding a few examples, or (b) separable by a more so
phisticated curve.
work in bo th these cases, and produce a hypothesis w ith E ,, = 0. Further, 
in bo th cases, you can be confident th a t th is perform ance will generalize well 
out of sam ple, according to the VC bound.
E x erc ise 3 .1
W ill PLA ever stop updating if the data is not linearly separable?
3 .1 .1 N on -S ep arab le D a ta
We now address the case where the d a ta is not linearly separable. Figure 3.1 
shows two d a ta sets th a t are not linearly separable. In Figure 3.1(a), the data, 
becomes linearly separable after the removal of ju st two examples, which could 
be considered noisy examples or outliers. In Figure 3.1(b), the d a ta can be 
separated by a circle ra th e r th an a line. In bo th cases, there will always be 
a misclassified train ing exam ple if we insist on using a linear hypothesis, and 
hence PLA will never term inate. In fact, its behavior becomes quite unstable, 
and can jum p from a good perceptron to a very bad one w ithin one update; the 
quality of the resulting E n cannot be guaranteed. In Figure 3.1(a), it seems 
appropria te to stick w ith a line, b u t to somehow to lera te noise and o u tpu t a. 
hypothesis w ith a small E n , not necessarily E n = 0- In Figure 3.1(b), the 
linear m odel does not seem to be the correct model in the first place, and we 
will discuss a technique called nonlinear transform ation for this situation in 
Section 3.4.
79
T he situ a tio n in F igure 3.1(a) is actually encountered very often: even 
though a linear classifier seems appropria te , the d a ta m ay not be linearly sep
arable because of outliers or noise. To find a hypothesis w ith th e m inim um Ei„, 
we need to solve the com binatorial op tim ization problem :
1 ^
“ I”.-, M Isign(w ^x„) Vn} . (3.2)weR<i+i Jy
n=l
3 . T h e L i n e a r M o d e l 3 . 1 . L i n e a r C l a s s i f i c a t i o n
Bln(w)
T he difficulty in solving th is problem arises from the d iscrete n a tu re of bo th 
sign(-) and |- |. In fact, m inim izing E in(w ) in (3.2) in the general case is known 
to be N P -hard , which m eans there is no known efficient algorithm for it, and 
if you discovered one, you would becom e really, really fam ous @ • Thus, one 
has to resort to approxim ately m inim izing Ein-
One approach for getting an approxim ate solu tion is to ex tend PL A th rough 
a simple m odification into w hat is called the pocket algorithm. Essentially, the 
pocket a lgorithm keeps ‘in its pocket’ the best weight vector encountered up 
to itera tion t in PLA . A t the end, th e best weight vector will be repo rted as 
the final hypothesis. T his sim ple algorithm is shown below.
T h e p o c k e t a lg o r i th m :
1 ; Set the pocket weight vector w to w (0) of PLA.
2 : fo r t = 0 , . . . , T - 1 d o
3: R un PLA for one u p d a te to o b ta in w (t -|- 1).
4: E valuate E in (w (t -f 1)).
5: If w(7 + 1) is b e tte r th an w in term s of Ei„, set w to
w ( t - f 1 ).
6 : R etu rn w .
The original PLA only checks some of the exam ples using w (t) to identify 
{x{ t) ,y{ t)) in each itera tion , while the pocket algorithm needs an additional 
step th a t evaluates all exam ples using w(7 -f 1) to get E in(w (t + 1)). T he 
additional step m akes the pocket algorithm m uch slower th an PLA . In addi
tion, there is no guaran tee for how fast th e pocket algorithm can converge to a 
good Ei„. Nevertheless, it is a useful algorithm to have on hand because of its 
simplicity. O ther, m ore efficient approaches for ob ta in ing good approxim ate 
solutions have been developed based on different op tim iza tion techniques, as 
shown later in th is chapter.
E x e rc is e 3 .2
Take d = 2 and create a data set T> of size N = 100 th a t is not linearly 
separable. You can do so by first choosing a random line in the plane as 
your target function and the inputs x „ of the data set as random points 
in the plane. Then, evaluate the target function on each x „ to get the 
corresponding output pn. Finally, flip the labels of ~ randomly selected 
2/n ’s and the data set will likely become non-separable.
8 0
3 . T h e L i n e a r M o d e l 3 . 1 . L i n e a r C l a s h i e i c a i ' i o n
Now, try the pocket algorithm on your data set using T = 1 , 0 0 0 iterations. 
Repeat the experiment 2 0 times. Then, plot the average F n ( w ( t ) ) and the 
average Ein{'w) (which is also a function of t) on the same figure and see 
how they behave when t increases. Similarly, use a test set of size 1,000 
and plot a figure to show how F o u t ( w ( f ) ) and F o „ t ( w ) behave.
E x a m p le 3 .1 (H andw ritten digit recognition). We sam ple some digits from 
the US Postal Service Zip Code D atabase. These 16 x 16 pixel images are 
preprocessed from the scanned handw ritten zip codes. The goal is to recognize 
the digit in each image. We alluded to this task in p art (b) of Exercise 1 .1 . 
A quick look a t the images reveals th a t this is a non-trivial task (even for a 
hum an), and typical hum an Fout E abou t 2.5%. Common confusion occurs 
between the digits {4,9} and (2 , 7) . A m achine-learned hypothesis which can 
achieve such an error ra te would be highly desirable.
9
A
7
I
7
2
^ 7 ^ S 
0 7 V
b
1
7
3
3
i
3
Y
3
T
7
C
7
7
7
1
t
(
S
A
L et’s first decompose the big task of separating ten digits into sm aller tasks of 
separating two of the digits. Such a decomposition approach from muliicla.s.s 
to binary classification is commonly used in m any learning algorithm s. We will 
focus on digits {1,5} for now. A hum an apjiroach to determ iiiing the digit 
corresponding to an image is to look a t the shape (or o ther propc'rties) of the 
black pixels. Thus, ra th e r th an carrying all the inform ation in the 256 ])ix('ls, 
it makes sense to sum m arize the inform ation contained in the image into a. lew 
features. L e t’s look a t two in iiio rtan t features here: intmisity mid symmetry. 
Digit 5 usually occujiies more black pixels than digit 1, and I k ' i i c c tlie average 
pixel intensity of digit 5 is higher. On the o ther hand, digit I is symiiK'tric 
while digit 5 is not. Therefore, if we dehne asym m etry as the average absolute' 
difference between an image and its flippe'd versions, mid synimet.ry as tlu' 
negation of asym m etry, digit 1 would result in a higher sym m etry value'. A 
sca tte r plot for these intensity aiiel sym m etry featiireis feir semie eif the' eligits is 
shown next.
8 1
3 . T h e L i n e a r M o d e l 3 . 2 . L i n e a r R e g r e s s i o n
W hile the digits can be roughly separa ted by a line in th e plane representing 
these two features, there are poorly w ritten digits (such as the ‘5’ depicted in 
the top-left corner) th a t prevent a perfect linear separation .
We now run PLA and pocket on the d a ta set and see w hat happens. Since 
the d ata set is not linearly separable, PL A will no t stop updating . In fact, 
as can be seen in F igure 3.2(a), its behavior can be quite unstab le. W hen 
it is forcibly term in ated a t ite ra tio n 1,000, PL A gives a line th a t has a poor 
E\n = 2.24% and Eont = 6.37%. O n the o ther hand , if th e pocket algorithm is 
applied to the sam e d a ta set, as shown in F igure 3.2(b), we can ob ta in a line 
th a t has a b e tte r E-̂ n = 0.45% and a b e tte r Eout = 1.89%. □
3.2 Linear R egression
Linear regression is ano ther useful linear m odel th a t applies to real-valued 
ta rg e t functions.^ I t has a long h isto ry in sta tis tics , w here it has been studied 
in g reat detail, and has various applications in social and behavioral sciences. 
Here, we discuss linear regression from a learning perspective, where we derive 
the m ain results w ith m inim al assum ptions.
Let us revisit our applica tion in cred it approval, th is tim e considering a 
regression problem ra th e r th an a classification problem . Recall th a t the bank 
has custom er records th a t contain inform ation fields re la ted to personal credit, 
such as annual salary, years in residence, o u ts tan d in g loans, etc. Such variables 
can be used to learn a linear classifier to decide on cred it approval. Instead of 
ju s t m aking a b inary decision (approve or no t), the bank also w ants to set a 
proper credit lim it for each approved custom er. C red it lim its are trad itiona lly 
determ ined by hum an experts. T he bank w ants to au to m ate th is task , as it 
did w ith credit approval.
'R e g re ss io n , a te rm in lie rited from e a rlie r w ork in s ta tis t ic s , m ea n s y is real-v a lu ed .
8 2
3 . T h e L i n e a r M o d e l 3 . 2 . L i n e a r R e g r e s s i o n
a
E„
Em
2 5 0 500 750 1000
Ite ra tio n N um ber, t
Average In tensity 
(a) PLA
la O 10%
H
Eo,
250 5 00 7 5 0 1000
I te ra tio n Num ber, t
Average In tensity 
(b) Pocket
F igure 3.2: Comparison of two linear classification algorithms for sep
arating digits 1 and 5. Ein and Eout are plotted versus iteration number 
and below that is the learned hypothesis g. (a) A version of the PLA which 
selects a random training example and updates w if that example is misclas- 
sified (hence the flat regions when no update is made). This version avoids 
searching all the data at every iteration, (b) The pocket algorithm.
T his is a regression learning problem . The bank uses historical records to 
construct a d a ta set V of examples (xj ,?/ i) , (x 2 ,t/2 )i ■ • {^ N tVn ), where x„ 
is custom er inform ation and is the credit lim it set by one of the hum an 
experts in the bank. Note th a t is now a real num ber (positive in this case) 
instead of ju s t a b inary value ±1 . T he bank w ants to use learning to find a 
hypothesis g th a t replicates how hum an experts determ ine credit limits.
Since there is m ore th an one hum an expert, and since each expert may 
not be perfectly consistent, our ta rg e t will not be a determ inistic function 
y = / ( x ) . Instead, it will be a noisy ta rg e t formalized as a d istribu tion of the 
random variable y th a t comes from the different views of different experts as 
well as the variation w ithin the views of each expert. T h a t is, the label 
comes from some d istribu tion P {y | x) instead of a determ inistic function / ( x ) . 
Nonetheless, as we discussed in previous chapters, the natu re of the problem 
is not changed. We have an unknown d istribution P { x ,y ) th a t generates
83
each (xn,?/n), and we w ant to find a hypothesis g th a t m inimizes the error 
between ^(x ) and y w ith respect to th a t d istribu tion .
T he choice of a linear m odel for th is problem presum es th a t there is a linear 
com bination of the custom er inform ation fields th a t would properly approx
im ate the credit lim it as determ ined by hum an experts. If th is assum ption 
does not hold, we cannot achieve a sm all erro r w ith a linear model. We will 
deal w ith this situ a tio n when we discuss nonlinear transfo rm ation la ter in the 
chapter.
3.2 .1 T h e A lg o r ith m
T he linear regression algorithm is based on m inim izing th e squared error be
tween h(x) and y ?
E e u t { h ) = E { h { r x ) - y f ,
where the expected value is taken w ith respect to the jo in t p robab ility d istri
bu tion P (x , y ) . T he goal is to find a hypothesis th a t achieves a sm all B o u t( / i ) - 
Since the d istribu tion P (x , y) is unknown, B o u t ( h ) canno t be com puted. Sim
ilar to w hat we did in classification, we resort to the in-sam ple version instead,
1 ^
Bin(h) = { h { y i n ) - V n f ■
n = \
In linear regression, h takes th e form of a linear com bination of th e com ponents 
of X. T h a t is,
d
h(x) = ^ WiXi = w'^x, 
i = 0
where xq = I and x G {1} x as usual, and w G R ‘’*+ .̂ For the special case 
of linear h , it is very useful to have a m atrix rep resen ta tion of B i n ( h ) . F irst, 
define the d a ta m atrix X G to be the N x { d + l ) m a trix whose rows
are the inpu ts x „ as row vectors, and define the ta rg e t vector y G R ^ to be 
the colum n vector whose com ponents are the ta rg e t values y„. T he in-sam ple 
error is a function of w and the d a ta X, y:
1 ^
Bin(w ) ^ (w'^X„ - y n f
n = l
= ^ | | X w - y | | ^ (3.3)
= ^ (w-^X^Xw - 2w-^X"y + y ^ y ) , (3.4)
where || • || is the Euclidean norm of a vector, and (3.3) follows because the 7?.th 
com ponent of the vector X w — y is exactly w '^x„ — y„. T he linear regression
^T h e te rm ‘lin e a r re g re ss io n ’ h as b een h is to r ic a lly confined to sq u a re d e r ro r m easu res.
3 . T h e L i n e a r M o d e l 3 . 2 . L i n e a r R e g r e s s i o n
84
3 . T h e L i n e a r M o d e l 3 . 2 . L i n e a r R e g r e s s i o n
(a) one d im ension (line)
F ig u re 3.3: T h e solu tion hypothesis (in blue) of tlie linear regression algo
rith m in one and tw o dim ensions. T h e sum of squared errors is minim ized.
algorithm is derived by minimizing E „ ( w ) over all possible w G 
formalized by the following optim ization problem:
as
wii„ = a rg m iiiE „ (w ). 
wgR'i+i
(3.5)
F igure 3.3 illustrates the solution in one and two dimensions. Since Eciiia- 
tioii (3.4) implies th a t E i„(w ) is differentiable, we can use standard matri.x 
calculus to find the w th a t minimizes E i„(w ) by requiring th a t tlie gradient 
of Ein w ith respect to w is the zero vector, i.e., V E i„(w ) = 0. The gradient is 
a (column) vector whose ith com ponent is [VEi„(w)]j = ^ E i „ ( w ) . By ex
plicitly com puting the reader can verify the following gradient identities,
Vw(w'''Aw) = (A 4- A'')w, Vw(w'’'b) b.
These identities are the m atrix analog of ordinary differmitiatiou of <iuadratie 
and linear functions. To obtain the gradient of Ei„, wi' ta,ke the gra.dimit ol 
each term in (3.4) to obtain
V E i„(w ) = | p ( X ' X w - X ' ' y ) .
Note th a t bo th w and V E „ ( w ) are coliinm vectors. Finally, to get V E „ ( w ) 
to be 0 , one should solve for w that, satislic's
X ' X w = X ' y .
If X ‘'X is invertible, w = X'ly w hen' X+ = (X'‘’X) ' X '' is l.lu' psrudo-tnvcrsc 
of X. T he resulting w is the unique oiitim al solution to (3.5). If X ' X is nol
85
3 . T h e L i n e a r M o d e l 3 . 2 . L i n e a r R e g r e s s i o n
invertible, a pseudo-inverse can still be defined, b u t the solution will not be 
unique (see P roblem 3.15). In practice, X'^X is invertible in m ost of the cases 
since N is often much bigger th an d + 1, so there will likely he d + \ linearly 
independent vectors x „ . We have thus derived the following linear regression 
algorithm.
L inear re g ress io n a lg o r ith m :
1 : Construct th e m atrix X and the vector y from th e d a ta set 
( x i , 2/ i ) , - - - , (xjv,yAr), where each x includes th e xq = 1 
b ias coordinate, as follows
X =
yi
V2
y =
. VN
input data matrix target vector
2 : C om pute th e pseudo-inverse X ’̂ of th e m atrix X. If X'^X 
is invertible,
x t = (X ^X )-^X ^.
3: R etu rn wjin = X ty .
This algorithm is som etim es referred to as ordinary least squares (OLS). It m ay 
seem th a t, com pared w ith the percep tron learning algorithm , linear regression 
doesn 't really look like ‘learn ing’, in th e sense th a t th e hypothesis wun comes 
from an analy tic solution (m atrix inversion and m ultip lications) ra th e r th an 
from itera tive learning steps. Well, as long as the hypothesis wun has a decent 
out-of-sam ple error, th en learning has occurred. L inear regression is a ra re 
case where w'e have an analy tic form ula for learning th a t is easy to evaluate. 
This is one of the reasons why the technique is so widely used. It should 
be noted th a t there are m ethods for com puting th e pseudo-inverse directly 
w ithout inverting a m atrix , and th a t these m ethods are num erically m ore 
stable th an m atrix inversion.
L inear regression has been analyzed in great detail in sta tis tics . We would 
like to m ention one of the analysis tools here since it re la tes to in-sam ple and 
out-of-sam ple errors, and th a t is the hat m atr ix H. Here is how H is defined. 
The linear regression weight vector wun is an a tte m p t to m ap the inpu ts X 
to the o u tp u ts y. However, wun does no t p roduce y exactly, b u t produces an 
estim ate
y = Xwii,,
which differs from y due to in-sam ple error. S u b stitu tin g the expression 
for wiin (assum ing X'^X is invertible), we get
y = X ( X " X ) - i X V .
86
3 . T h e L i n e a r M o d e l 3 . 2 . L i n e a r R e g r e s s i o n
Therefore, the estim ate y is a linear transform ation of the actual y through 
m atrix m ultiplication w ith H, where
H = X ( X ' " X ) - (3.6)
Since y = Hy, the m atrix H ‘pu ts a h a t’ on y , hence the name. The hat
m atrix is a very special m atrix . For one thing, = H, which can be verified
using the aboye expression for H. This and other properties of H will facilitate 
the analysis of in-sam ple and out-of-sam ple errors of linear regression.
E x erc ise 3 .3
Consider the hat matrix H = X ( X ’'X ) “ ^X'^, where X is an TV by d + 1 
matrix, and X ’̂ 'X is invertible.
(a ) Show that H is symmetric.
(b ) Show that = H for any positive integer K .
(c) If I is the identity matrix of size TV, show that ( I — H )^ = I — H for
any positive integer K .
(d ) Show that trace(H ) = d + 1, where the trace is the sum of diagonal 
elements. [Hint: trace(A B ) = tra c e (B A ).y
3.2 .2 G en era liza tion Issues
Linear regression looks for the optim al weight vector in term s of the in-sample 
error F n , which leads to the usual generalization question: Does this guarantee 
decent out-of-sam ple error Fout? T he short answer is yes. There is a regression 
version of the VC generalization bound (3.1) th a t similarly bounds Font- In 
the case of linear regression in particu lar, there are also exact formulas for 
the expected Fout and Fn th a t can be derived under simplifying assum ptions. 
T he general form of the result is
Fout(f/) = Fn(</) T 1
where Fout(5 ) and Ein{g) are the expected values. This is com parable to the 
classification bound in (3.1).
E x erc ise 3 .4
Consider a noisy target y = w * '‘'x + e for generating the data, where e is 
a noise term with zero mean and variance, independently generated for 
every example (x,y) . The expected error of the best possible linear fit to 
this target is thus cr .̂
For the data V = { ( x i , y i ) , . . . , (x w , yw) } , denote the noise in y„ as e„ 
and let e = [ei,C2, ■ • • assume that X^X is invertible. By following
(con tinued on n ex t page)
8 7
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
the steps below, show th a t the expected in-sample error of linear regression 
with respect to V is given by
El.[Ain(wiin)] = ^1 -
d + 1
N
(a ) Show th a t the in-sample estimate of y is given by y = X w * -|- He.
(b ) Show that the in-sample error vector y - y can be expressed by a 
matrix times e. W h at is the matrix?
(c) Express A in (w iin ) in terms of e using (b ), and simplify the expression 
using Exercise 3.3(c).
(d ) Prove that E x i[F in (w iin )] = cr̂ ( l - ^ ) using (c) and the indepen
dence of e i, • ■ ■ , ejv. [Hint: The sum o f the diagonal elements o f a 
matrix (the trace) will play a role. See Exercise 3.3(d).]
For the expected out-of-sample error, we take a special case which is easy to 
analyze. Consider a test data set latest = {(x i , y( ) , . . . , (xw, y)v)}. which 
shares the same input vectors Xn with T> but with a different realization of 
the noise terms. Denote the noise in y' ̂ as ejj and let e' = [e), £2 , • • •, £jv]^- 
Define Atest(w iin) to be the average squared error on T>test-
(e) Prove th a t E i,,£4£'test(w iin)] = cr̂ ( l - F ^ f^ ) .
The special test error Etest is a very restricted case of the general out- 
of-sample error. Some detailed analysis shows th a t similar results can be 
obtained for the general case, as shown in Problem 3.11.
Figure 3.4 illustra tes the learning curve of linear regression under th e assum p
tions of Exercise 3.4. T he best possible linear fit has expected erro r T . T he 
expected in-sam ple erro r is sm aller, equal to cr^(l — ^ ^ ) for A > d -F 1. T he 
learned linear fit has ea ten into th e in-sam ple noise as m uch as it could w ith 
the d -F 1 degrees of freedom th a t it has a t its disposal. T his occurs because 
the fitting cannot d istinguish the noise from th e ‘signal.’ O n th e o ther hand, 
the expected out-of-sam ple erro r is cr^(l -F which is m ore th a n the un
avoidable error of T . T he add itional erro r reflects th e drift in wun due to 
fitting the in-sam ple noise.
3.3 L ogistic R egression
The core of the linear m odel is the ‘signal’ s = w'^’x th a t com bines the inpu t 
variables linearly. We have seen two m odels based on th is signal, and we are 
now going to in troduce a th ird . In linear regression, the signal itself is taken 
as the ou tp u t, which is ap p ro p ria te if you are try in g to p red ic t a real response 
th a t could be unbounded. In linear classification, th e signal is th resholded 
at zero to produce a ± 1 o u tp u t, ap p ro p ria te for b inary decisions. A th ird 
possibility, which has wide app lica tion in practice, is to o u tp u t a probability,
88
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
a value between 0 and 1. O ur new model is called logistic regression. It has 
sim ilarities to bo th previous models, as the o u tp u t is real (like regression) bu t 
bounded (like classification).
E x a m p le 3 .2 (Prediction of heart a ttacks). Suppose we w ant to predict the 
occurrence of heart a ttacks based on a person’s cholesterol level, blood pres
sure, age, weight, and other factors. Obviously, we cannot predict a heart 
a ttack w ith any certainty, bu t we m ay be able to predict how likely it is to 
occur given these factors. Therefore, an o u tp u t th a t varies continuously be
tween 0 and 1 would be a m ore suitable model th an a binary decision. The 
closer y is to 1 , the more likely th a t the person will have a heart a ttack. □
3.3 .1 P red ic tin g a P rob ab ility
Linear classification uses a hard threshold on the signal s = w ’̂ x,
/?,(x) = sign(w'^x), 
while linear regression uses no threshold a t all,
h{x) — w'^x.
In our new model, we need som ething in between these two cases th a t sm oothly 
restric ts the o u tp u t to the probability range[0.1]. One choice th a t accom
plishes this goal is the logistic regression model,
h(x) = tl(w'^x),
where 0 is the so-called logistic function 0{s) = whose o u tpu t is between 0 
and 1 .
89
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
s
The o u tp u t can be in terp re ted as a probabil
ity for a b inary event (heart a ttack or no heart 
a ttack , digit T ’ versus digit ‘5’, etc.). Linear 
classification also deals w ith a b inary event, bu t 
the difference is th a t the ‘classification’ in logis
tic regression is allowed to be uncertain , w ith 
in term ediate values betw een 0 and 1 reflecting 
th is uncertainty. T he logistic function 9 is referred to as a soft threshold, in 
con trast to the h ard threshold in classification. I t is also called a sigmoid 
because its shape looks like a fla ttened o u t ‘s ’.
E x erc ise 3 .5
Another popular soft threshold is the hyperbolic tangent
tanh(s) = e — e 
e® -I- e~
(a) How is tanh related to the logistic function 61 [Hint: shift and scale]
(b ) Show th a t tanh(s) converges to a hard threshold for large |s|, and 
converges to no threshold for small |s|. [Hint: Formalize the figure 
below f
+ 1
linear
hard threshold
The specific form ula of 6{s) will allow us to define an erro r m easure for learning 
th a t has analy tica l and com putational advantages, as we will see shortly. Let 
us first look a t the ta rg e t th a t logistic regression is try ing to learn. T he ta rg e t 
is a probability, say of a p a tien t being a t risk for h eart a ttack , th a t depends 
on the input x (the characteristics of th e p a tien t). Form ally, we are tr>-ing to 
learn the ta rg e t function
/ ( x ) = P [ y = + l jx].
The d a ta does not give us the value of / explicitly. R ather, it gi\'es us sam ples 
generated by th is probability, e.g., p a tien ts who had heart a ttack s and p atien ts 
who didn' t . Therefore, the d a ta is in fact generated by a noisy ta rg e t P {y \ x).
P {y X = (3.7)
/ ( x ) f o r y = -f l :
[ l - / ( x ) f o r y = - l .
To learn from such d a ta , we need to define a p roper erro r m easure that gauges 
how close a given hypothesis h is to / in term s of these noisy ± 1 examples.
9 0
E r r o r m e a s u re . T he s tan d ard error m easure e{h{x),y ) used in logistic re
gression is based on the notion of likelihood] how ‘likely’ is it th a t we would get 
this o u tp u t y from the input x if the ta rg e t d istribution P {y \ x) was indeed 
cap tured by our hypothesis A(x)? Based on (3.7), th a t likelihood would be
for y = -Fl; 
for y = - 1 .
We su b stitu te for A(x) by its value 0(w'^x), and use the fact th a t 1 — 6{s) = 
0{—s) (easy to verify) to get
P {y I x) = 9{y w -x ) . (3.8)
One of our reasons for choosing the m athem atical form 0{s) = e®/(l + e ‘‘) is 
th a t it leads to this simple expression for P {y \ x).
Since the d a ta points ( x i , y i ) , . . . , (xAi,yAr) are independently generated, 
the probability of getting all the j/„’s in the d a ta set from the correspond
ing x „ ’s would be the product
N
3 . T h e L i n e a r M o d e l ______________________ _________________ 3 . 3 . L o g i s t i c R e g r e s s i o n
n = l
The m ethod of m axim um likelihood selects the hypothesis h which maximizes 
this probability.^ We can equivalently minimize a more convenient quantity.
- i v ‘”
since ‘ —;^ln( - ) ’ is a m onotonically decreasing function. Substitu ting with 
E quation (3.8), we would be minimizing
I V l ( ̂ ^
^ V<9(j/nw'"x„)y
w ith respect to the weight vector w . T he fact th a t we are m in im iz in g this 
quan tity allows us to tre a t it as an ‘error m easure.’ Substitu ting the func
tional form for 0 (y„’w'^x„) produces the in-sam ple error m easure for logistic 
regression,
i ? i n ( w ) - i f ^ l n ( l + e - ^ " - ^ ’̂ " ) . (3.9)
n=l
T he im plied pointwise error m easure is e(/i(x„), iy„) = ln(l-f-e“ -^"''' ’‘̂ "). Notice 
th a t this error m easure is small when ynw ’̂ x,, is large and positive, which 
would im ply th a t sign(vkr'Tx„) = ?/„. Therefore, as our intuition would expect, 
the error m easure encourages w to ‘classify’ each x„ correctly.
^ A lth o u g h th e m eth o d o f m ax im u m likelihood is in tu itiv e ly p lausib le , its rigorous ju s t i 
fica tion as an in ference to o l co n tin u es to be di.scu.ssed in th e s ta tis tic s com m unity .
9 1
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
E x erc ise 3 .6 [C ross-entropy error measure]
(a ) More generally, if we are learning from ± 1 data to predict a noisy 
target P {y \ x ) with candidate hypothesis h, show that the maximum 
likelihood method reduces to the task of finding h th a t minimizes
B .(w ) = g = + 1] 1„ ^ + b . = -11 to
(b ) For the case F (x ) = 0 (w '’^x), argue that minimizing the in-sample 
error in part (a ) is equivalent to minimizing the one in (3 .9).
For two probability distributions {p, I — p } and {q, 1 — g} with binary out
comes, the cross-entropy (from information theory) is
p l o g i - F (1 - p ) l o g - ^
g 1 - g
The in-sample error in part (a ) corresponds to a cross-entropy error measure 
on the data point (x „ , j/„ ), with p = |y „ = -p i] and g = / i(x „ ) .
For linear classification, we saw th a t m inim izing for th e percep tron is a 
com binatorial op tim ization problem ; to solve it, we in troduced a num ber of al
gorithm s such as the percep tron learning algorithm and th e pocket algorithm . 
For linear regression, we saw th a t tra in in g can be done using th e analytic 
pseudo-inverse algorithm for m inim izing by se tting V E i„(w ) = 0. These 
algorithm s were developed based on th e specific form of linear classification or 
linear regression, so none of them would apply to logistic regression.
To tra in logistic regression, we will take an approach sim ilar to linear re
gression in th a t we will try to set V E in(w ) = 0. U nfortunately , unlike the case 
of linear regression, th e m athem atica l form of th e g rad ien t of for logistic 
regression is no t easy to m anipu late , so an analy tic solution is no t feasible.
E x erc ise 3 .7
For logistic regression, show that
N
VEi 2/nXn
n = l
1 ^
= -y«X n0(-ynW '’'x„).
n— 1
Argue that a ‘misclassified’ example contributes more to the gradient than 
a correctly classified one.
Instead of analy tica lly se tting the g rad ien t to zero, we will iteratively set it to 
zero. To do so, we will in troduce a new algorithm , gradient descent. G rad ien t
92
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
descent is a very general algorithm th a t can be used to tra in m any other 
learning m odels w ith sm ooth error measures. For logistic regression, gradient 
descent has particu larly nice properties.
A
A n j h
V
3.3 .2 G radient D escen t
G radient descent is a general technique for 
minimizing a Iwice-differentiable function, such 
as F in(w ) in logistic regression. A useful phys
ical analogy of gradient descent is a ball rolling 
down a hilly surface. If the ball is placed on 
a hill, it will roll down, coming to rest a t the 
bo ttom of a valley. T he sam e basic idea under
lies gradient descent. F'in(w) is a ‘surface’ in 
a high-dim ensional space. At step 0, we s ta rt 
somewhere on th is surface, a t w (0), and try to
roll down th is surface, thereby decreasing Bin- One th ing which you imme
diately notice from the physical analogy is th a t the ball will not necessarily 
come to rest in the lowest valley of the entire surface. Depending on where 
you s ta r t the ball rolling, you will end up a t the bo ttom of one of the valleys - 
a local m in im um . In general, the same applies to gradient descent. Depending 
on your s ta rting weights, the p a th of descent will take you to a local minimum 
in the error snrface.
A particu lar advantage for logistic regression 
w ith the cross-entropy error is th a t the picture 
looks much nicer. T here is only one valley! So, 
it does not m atte r where you s ta r t your ball 
rolling, it will always roll down to the same 
(unique) global m inim um . This is a consequence 
of the fact th a t £ ’in(w) is a convex function 
of w , a m athem atical p roperty th a t implies a 
single ‘valley’ as shown to the right. This m eans W eights, w
th a t gradient descent will not be trap p ed in lo
cal m inim a when m inimizing such convex error measures.'*
L et’s now determ ine how to ‘ro ll’ down the Fin-surface. We would like to 
take a step in the direction of steepest descent, to gain the biggest bang for 
onr buck. Suppose th a t we take a small step of size rj in the direction of a unit
vector V. T he new weights are w (0) + r/v. Since t] is small, using the Taylor
expansion to first order, we com pute the change in E\„ as
AFin = jFin(w(0) + r/v) - Fi„(w(()))
= r / V F i n ( w ( 0 ) ) ' ' v + O ( r /2 )
> -77||V Ei„(w (()))||,
^ In fac t, th e sq u a red in -sam p le e rro r in lin ea r regression is atso convex, w hich is why th e 
a n a ly tic so lu tio n found by th e pseudo-inverse is g u a ran tee d to have o ijtirnal in -sa inp le error.
tq
H
S.'a
I
93
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
where we have ignored the sm all term 0 { r f ‘). Since v is a un it vector, equality 
holds if and only if
V = —
V B i„ (w (0 ) )
l|V B in(w (0))|r
(3.10)
T his direction, specified by v , leads to the largest decrease in Bin for a given 
step size r/.
E x erc ise 3 .8
The claim that v is the direction which gives largest decrease in Bin only 
holds for small 77. Why?
T here is no thing to prevent us from continuing to take steps of size ry, re
evaluating the direction v* a t each ite ra tio n t = 0 ,1 ,2 , . . . . How large a step 
should one take a t each itera tion? T his is a good question, and to gain some 
insight, le t’s look a t th e following examples.
77 too sm all 77 too large variable 77 - ju s t right
A fixed step size (if it is too sm all) is inefficient w hen you are far from the 
local m inim um . On the o ther hand , too large a step size when you are close to 
the m inim um leads to bouncing around, possibly even increasing E\n- Ideally, 
we would like to take large steps w hen far from the m inim um to get in the 
right ballpark quickly, and th en sm all (m ore careful) steps when close to the 
minim um . A simple heuristic can accom plish this: far from the m inim um , 
the norm of the g rad ien t is typically large, and close to th e m inim um , it is 
small. Thus, we could set ijt = 77||V B i„|| to o b ta in th e desired behavior for 
the variable step size; choosing the step size p rop o rtio n a l to th e norm of the 
gradient will also conveniently cancel the te rm norm alizing th e u n it vector v in 
Equation (3.10), leading to the fixed learning rate gradient descent a lgorithm 
for m inim izing Bi„ (w ith redefined 77):
94
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
F ix ed lea rn in g ra te g ra d ien t d escen t:
1 : Initialize the weights a t tim e step t = 0 to w (0 ). 
2: for i = 0,1, 2 , . . . do 
3: C om pute the gradient gt == V F ,,
4: Set the direction to move, vj = - g j .
5: U pdate the weights: w (t + 1 ) = w (t) + r/Vt.
6 : I te ra te to the next step until it is tim e to stop.
7: R etu rn the final weights.
In the algorithm , v j is a direction th a t is no longer restric ted to unit length. 
The p aram eter ry (the learning rate) has to be specified. A typically good 
choice for rj is around 0.1 (a purely practical observation). To use gradient 
descent, one m ust com pute the gradient. This can be done explicitly for logistic 
regression (see Exercise 3.7).
E x a m p le 3 .3 . C radien t descent is a general algorithm for minimizing twice- 
differentiable functions. We can apply it to the logistic regression in-sample 
error to re tu rn weights th a t approxim ately minimize
N
n = l
L o g istic reg ress io n a lgorith m :
1 : Initialize the weights a t tim e step t = 0 to w(0). 
2: for t = 0,1, 2 , . . . do 
3: C om pute the gradient
1 ^
_ _ J_ Vn^n
n=l
4: Set the direction to move, V( = - g ( .
5 : U pdate the weights: w (t + 1) = w (f) + r/Vt.
6 : Ite ra te to the next step until it is tim e to stop.
7: R etu rn the final weights w.
□
In it ia liza t io n an d te r m in a tio n . We have two more loose ends to tie: the 
first is how to choose w (0), the initial weights, and the second is how to 
set the criterion for “ . . .u n t i l it is tim e to stop” in step 6 ol the gradient 
descent algorithm . In some cases, such as logistic regression, initializing the 
weights w (0) as zeros works well. However, in general, it is safer to initialize 
the weights randomly, so as to avoid getting stuck on a perfectly sym m etric 
hilltop. Choosing each weight independently from a Normal d istribution with 
zero m ean and sm all variance usually works well in practice.
9 5
3 . T h e L i n e a r M o d e l 3 . 3 . L o g i s t i c R e g r e s s i o n
T h a t takes care of initialization, so we now move on to term ination . How 
do we decide when to stop? T erm ination is a non-triv ial topic in optim ization. 
One simple approach, as we encountered in th e pocket algorithm , is to set an 
upper bound on the num ber of itera tions, w here the upper bound is typically 
in the thousands, depending on th e am ount of tra in ing tim e we have. The 
problem w ith th is approach is th a t there is no guaran tee on the quality of the 
final weights.
A nother plausible approach is based on the g rad ien t being zero a t any m in
imum. A n a tu ra l term ination criterion would be to stop once ||g t|| drops below 
a certain threshold. Eventually th is m ust happen, b u t we do no t know when 
it will happen. For logistic regression, a com bination of th e two conditions 
(setting a large upper bound for the num ber of itera tions, and a sm all lower 
bound for the size of the grad ien t) usually works well in practice.
T here is a problem w ith relying solely on 
the size of the g rad ien t to stop, which is th a t 
you m ight stop p rem aturely as illu stra ted on th e ̂
right. W hen the ite ra tio n reaches a relatively 
fiat region (which is m ore com m on th a n you 
m ight suspect), the algorithm will p rem aturely 
stop when we m ay w ant to continue. So one so- Weights, w
lution is to require th a t te rm in atio n occurs only
if the error change is sm all and the error itself is small. U ltim ate ly a com bina
tion of term ination crite ria (a m axim um num ber of itera tions, m arginal error 
im provem ent, coupled w ith sm all value for the error itself) works reasonably 
well.
E x a m p le 3 .4 . By way of sum m arizing linear m odels, we revisit our old friend 
the credit example. If the goal is to decide w hether to approve or deny, then 
we are in the realm of classification; if you w ant to assign an am onnt of credit 
line, then linear regression is appropria te ; if you w ant to p red ic t th e probability 
th a t someone will default, use logistic regression.
T he th ree linear models have their respective goals, erro r m easures, and al
gorithm s. Nonetheless, they not only share sim ilar sets of linear hypotheses, 
bu t are in fact re la ted in o ther ways. We would like to po in t ou t one im por
ta n t relationship: B oth logistic regression and linear regression can be used in 
linear classification. Here is how.
Logistic regression produces a final hypothesis g{x) whichis our estim ate 
of P[y =: -1-1 j x]. Such an estim ate can easily be used for classification by
96
setting a threshold on g(x); a n a tu ra l threshold is which corresponds to 
classifying +1 if +1 is more likely. This choice for threshold corresponds to 
using the logistic regression weights as weights in the perceptron for classifica
tion. N ot only can logistic regression weights be used for classification in this 
way, bu t they can also be used as a way to tra in the perceptron model. The 
perceptron learning problem (3.2) is a very hard com binatorial optim ization 
problem. The convexity of Ein in logistic regression makes the optim ization 
problem much easier to solve. Since the logistic function is a soft version of 
a hard threshold, the logistic regression weights should be good weights for 
classification using the perceptron.
A sim ilar relationship exists between classification and linear regression. 
Linear regression can be used w ith any real-valued ta rg e t function, which 
includes real values th a t are ±1 . If wjlj^x is fit to ±1 values, sign(wjljjX) will 
likely agree w ith these values and make good classification predictions. In 
o ther words, the linear regression weights wun, which are easily com puted 
using the pseudo-inverse, are also an approxim ate solution for the perceptron 
model. The weights can be directly used for classification, or used as an initial 
condition for the pocket algorithm to give it a head s ta rt. □
3 . T h e L i n e a r M o d e l ________________________________________ 3 . 3 . L o g i s t i c R e g r e s s i o n
E x erc ise 3 .9
Consider pointwise error measures edass(s,2/) = Iv s ign(s)|, esq(s,y) = 
{y — s)^, and eiog(s,y) = ln ( l + e x p (—ys)), where the signal s — w'^^x.
(a ) For y = 4-1, plot edass, esq and pfjeiog versus s, on the same plot.
(b ) Show that edass(s,y) < esq(s,y), and hence that the classification 
error is upper bounded by the squared error.
(c) Show that edass(s,y) < y ) ’ 3"*^’ an
upper bound (up to a constant factor) using the logistic regression 
error.
These bounds indicate that minimizing the squared or logistic regression 
error should also decrease the classification error, which justifies using the 
weights returned by linear or logistic regression as approximations for clas
sification.
S to c h a stic g ra d ien t d e sce n t. The version of gradient descent we have de
scribed so far is known as batch gradient descent - the gradient is com puted 
for the error on the whole d a ta set before a weight update is done. A sequen
tia l version of gradient descent known as stochastic gradient descent (SGD) 
tu rn s ou t to be very efficient in practice. Instead oi considering the lull iiatch 
gradient on all N tra in ing d a ta points, we consider a stochastic version of 
the gradient. F irst, pick a train ing d a ta [loint (x„,y„) uniformly a t random 
(hence the nam e ‘stochastic’), and consider only the error on th a t d a ta ])oint
97
(in the case of logistic regression),
e„(w ) = l n ( l + e -^ " '^ "^ " ) .
T he gradient of th is single d a ta p o in t’s error is used for the weight u p d ate in 
exactly the sam e way th a t the grad ien t was used in b a tch grad ien t descent. 
T he gradient needed for the weight u p d a te of SGD is (see Exercise 3.7)
V e„(w ) = ------------ -̂--- ,X _(_ gyr.w^x„ ’
and the weight u p d a te is w w — ?7V e„(w ). Insight into why SGD works can 
be gained by looking a t the expected value of the change in th e weight (the 
expectation is w ith respect to the random point th a t is selected). Since n is 
picked uniform ly a t random from { ! , . . . , N } , th e expected weight change is
3. T h e L i n e a r M o d e l 3.3. L o g i s t i c R e g r e s s i o n
7 1 = 1
This is exactly the sam e as the determ inistic w eight change from th e batch 
gradien t descent weight update . T h a t is, ‘on average’ th e m inim ization pro
ceeds in the righ t direction, b u t is a b it wiggly. In th e long run , these random 
fluctuations cancel out. T he com pu tational cost is cheaper by a factor of N , 
though, since we com pute the g rad ien t for only one po in t per itera tion , ra th e r 
th an for all N poin ts as we do in ba tch g rad ien t descent.
Notice th a t SGD is sim ilar to PL A in th a t it decreases th e erro r w ith re
spect to one d a ta po int a t a tim e. M inimizing th e erro r on one d a ta poin t m ay 
interfere w ith the error on the rest of th e d a ta po in ts th a t are no t considered 
a t th a t itera tion . However, also sim ilar to PLA , the in terference cancels out 
on average as we have ju s t argued.
E x erc ise 3 .1 0
(a ) Define an error for a single data point (x „ ,y n ) to be
6n (w ) = m a x (0 , - y „ w ' ’‘'x „ ) .
Argue that PLA can be viewed as SGD on e„ with learning rate g = 1.
(b ) For logistic regression with a very large w , argue th a t minimizing Ein 
using SGD is similar to PLA. This is another indication th a t the lo
gistic regression weights can be used as a good approximation for 
classification.
SGD is successful in practice, often bea ting th e batch version and o ther m ore 
sophisticated algorithm s. In fact, SGD was an iiu jro rtan t p a r t of the algorithm 
th a t won the m illion-dollar Netflix com petition , discussed in Section 1.1. It 
scales well to large d a ta sets, and is n a tu ra lly su ited to online learning, where
98
a stream ot d a ta present them selves to the learning algorithm seeinentially. 
T he random ness introduced by processing one d a ta point a t a time can be a 
phis, helping the algorithm to avoid flat regions and local minima in the case 
of a com plicated error surface. However, it is challenging to choose a suit
able term ination criterion lor SGD. A good stopping criterion should consider 
the to ta l error on all the data , which can be com putationally dem anding to 
evaluate a t each iteration.
3.4 N onlinear Transform ation
All formulas for the linear model have used the sum
d
■w'^x = ' ^ W i X i (3.11)
(=0
as the m ain quan tity in com puting the hypothesis ou tput. This (iiiantity is 
linear, not only in the x f s bu t also in the Wi's. A closer inspection of the 
corresponding learning algorithm s shows th a t the linearity in Wj, ’,s is the key 
property for deriving these algorithm s; the Xj’s are ju st constants as far as the 
algorithm is concerned. This observation opens the possibility for allowing 
nonlinear versions of Xi's while still rem aining in the analytic realm of linear 
models, because the form of E quation (3.11) rem ains linear in the iCj param 
eters.
Consider the credit lim it problem for instance. It makes sense th a t the 
‘years in residence’ field would afl'ect a person’s credit since it is correlated with 
stability. However, it is less plausible th a t the credit limit would grow li7iea.rly 
w ith the num ber of years in residence. More plausibly, there is a threshold 
(say 1 year) below which the credit lim it is affected negatively and another 
threshold (say 5 years) above which the credit lim it is afl'ected positively. If ,t,, 
is the input variable th a t m easures years in residence, then two nonlinear 
‘features’ derived from it, nam ely J.x-i < I j and [.r; > 5], would allow a linear 
formula to reflect the credit lim it better.
We have already seen the use of features in the classitication of handw ritten 
digits, where intensity and sym m etry features were derived from injuit pixels. 
Nonlinear transform s can be further applied to those features, as we will see 
shortly, creating more elaborate features and imiiroving the performance'. Tlu' 
scope of linear m ethods expands signiflcaiitly when we' repre'sent the' input by 
a set of app ropria te features.
3 .4 .1 T h e Z Space
Consielcr the sitiiatiem in Figure 3.1(b) where a line'are'lassilie'r e-an t lit the' 
elata. By transfbrm ing the inputs rq ,X2 in a nemline'ar lashiem. we' will be' able' 
to seiparate the elata w ith nieire eiemiplie'ated boimdarie's while still using the'
3 . T h e L i n e a r M o d e l ______________________________3 . 4 . N o n l i n e a r T h a n s i -’o r m a t i o n
0!)
3 . T h e L i n e a r M o d e l 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
simple PLA as a building block. L e t’s s ta r t by looking a t the circle in Fig
ure 3.5(a), which is a replica of the non-separable case in F igure 3.1(b). Points 
on the circle satisfy the following equation:
x \ + X2 = 0.6,
which m eans th a t the nonlinear hypothesis /i(x) = sign(0.6 —xf — x |) separates 
the d a ta set perfectly. We can view the hypothesis as a linear one after 
applying a nonlinear transfo rm ation on x . In particu la r, consider zq = 1, 
zi = x \ and 2 2 — x l ,
h{x) = sign
= sign
\ wo 
/
[wio Wi W2]
' t i l
■ X
z
\
1
. ^2
z /
(-1) X n
\ wT
where the vector z is ob ta ined from x th rough a nonlinear transfo rm $ ,
z = $ (x ) .
We can plot th e d a ta in term s of z instead of x , as depicted in F igure 3.5(b). 
For instance, the po int x i in F igure 3.5(a) is transform ed to the po in t z \ in 
F igure 3.5(b) and th e po in t X2 is transfo rm ed to th e point Z2 . T he space Z , 
which contains the z vectors, is referred to as th e feature space since its coor
dinates are higher-level features derived from th e raw in p u t x . We designate 
different quan tities in Z w ith a tilde version of th e ir co u n te rp arts in A , e.g., 
the dim ensionality of Z is d and th e weight vector is w.® T he transfo rm $ 
th a t takes us from A to Z is called a feature transform, w hich in th is case is
(3.12)
In general, some points in th e Z space m ay no t be valid transfo rm s of any 
X e A , and m ultiple po in ts in A m ay be transform ed to th e sam e z G Z , 
depending on the nonlinear tran sfo rm <I>.
The usefulness of the transfo rm above is th a t th e nonlinear hypothesis h 
(circle) in the A space can be represen ted by a linear hypothesis (line) in 
the Z space. Indeed, any linear hypothesis h in z corresponds to a (possibly 
nonlinear) hypothesis of x given by
/i,(x) = d(<I>(x)).
— { I f x I R '’*, w iiere d = 2 in th is case. W e t r e a t Z as d -d im e n s io n a l s ince th e ad d ed 
c o o rd in a te zq = 1 is fixed.
100
3 . T h e L i n e a r M o d e l 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
(b) Transformed data in Z-space
■ 1 ■ ■ 1 ■
x = X i z = 4>(x) = x i
_X2_ A2.
Figure 3.5: (a) The original data set that is not linearly separable, but 
separable by a circle, (b) The transformed data set that is linearly separable 
in the Z space. In the figure, x i maps to zi and X2 maps to Z2 ; the circular 
separator in the T-space maps to the linear separator in the Z-space.
T he set of these hypotheses h is denoted by H<j>. For instance, when using the 
feature transform in (3.12), each h G is a quadratic separator in X th a t 
corresponds to some linear separato r h in Z .
E x erc ise 3 .11
Consider the feature transform $ in (3 .12). W hat kind of boundary in X 
does a hyperplane w in Z correspond to in the following cases? Draw a 
picture that illustrates an example of each case.
(a ) w i > 0 , W2 < 0
(b ) w i > 0 , W2 = 0
(c) Wl > 0 ,W2 > 0,wo < 0
(d ) Wl > 0 , i (}2 > 0 , Wo > 0
Because the transform ed d a ta set (zi , j/ i ) , , {zjg,yN) in Figure 3.5(b) is
linearly separable in the feature space Z , we can apply PLA on the transform ed 
d a ta set to ob tain Wpla, the PLA solution, which gives us a final hypothesis 
g(x) = sign(wJ)LAz) in the X space, where z = T (x). T he whole process of 
applying the feature transform before running PLA for linear classification is 
depicted in Figure 3.6.
T he in-sam ple error in the input space X is the same as in the feature 
space Z , so Ein{g) = 0. Hyperplanes th a t achieve E i./w p p ^ ) = 6 in Z cor
respond to separating curves in the original input space X. For instance.
101
3 . T h e L i n e a r M o d e l 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
1. O riginal d a ta
Xn e T
2. T ransform th e d a ta 
z„ = 4>(x„) e z
4. Classify in T -space 
y(x) = y ($ (x )) = sign(w '^$(x))
3. S eparate d a ta in Z -space 
y(z) = sign(w ^z)
Figure 3.6: The nonlinear transform for separating non-separable data.
as shown in F igure 3.6, the PL A m ay select th e line Wpla = (0 .6 ,—0 .6 .—1) 
th a t separates the transform ed d a ta ( z i .y i ) , - - - , ( z 7v ,y ,v ). T he correspond
ing hypothesis y(x) = sign(0 . 6 - 0 . 6 • Tj — x’| ) will sep a ra te th e original d a ta 
(x i. y i), ■ ■ • , (xiv, yAf). In th is case, the decision bo u n d ary is an ellipse in .T.
How does th e feature transfo rm affect the VC bound (3.1)? If we honestly 
decide on the transfo rm <F before seeing the d a ta , th en w ith p robab ility at 
least 1 — d. the bound (3.1) rem ains tru e by using dyc{Tl,j>) as th e VC dim en
sion. For instance, consider the feature transfo rm 4> in (3.12). We know th a t 
Z = {1 } x R 2 Since 7?$ is the p ercep tron in Z . dvc('B>i>) < 3 (the < is because 
some points z G Z m ay no t be valid transform s of any x . so som e dichotom ies 
may not be realizable). We can then su b s titu te An dyciP- f ) . and d' into the 
VC bound. After running PLA on the transfo rm ed d a ta set. if we succeed in
102
getting some g w ith E{n{g) = 0 , we can claim th a t g will perform well out of 
sample.
It is very im portan t to understand th a t the claim above is valid only if you 
decide on $ before seeing the d a ta or try ing any algorithm s. W hat if we first 
try using lines to separate the data , fail, and then use the circles? Then we 
are effectively using a model th a t contains bo th lines and circles, and dvc is 
no longer 3.
E x e r c is e 3 .1 2
W e know that in the Euclidean plane, the perceptron model % cannot 
implement all 16 dichotomies on 4 points. T h a t is, m u iT ) < 16. Take the 
feature transform 4? in (3.12).
(a) Show that TO-Hij.(3) = 8 .
(b ) Show that rn-H4, (4 ) < 16.
(c) Show that = 16.
T h a t is, if you used lines, dvo = 3; if you used ellipses, dvo = 3; if you used 
lines and ellipses, dvc > 3.
W orse yet, if you actually look a t the d a ta (e.g., look a t the points in Fig
ure 3.1(a)) before deciding on a suitable <h, you forfeit m ost of w hat you 
learned in C hap ter 2 . You have inadvertently explored a huge hypothesis
space in your m ind to come up w ith a specific 4? th a t would work fo r this data 
set. If you invoke a generalization bound now, you will be charged for the VC 
dim ension of the full space th a t you explored in your mind, not ju s t the space 
th a t <I) creates.
This does not m ean th a t <I> should be chosen blindly. In the credit limit 
problem for instance, we suggested nonlinear features based on the ‘years in 
residence’ field th a t m ay be more suitable for linear regression th an the raw 
input. This was based on our understanding of the problem , not on ‘snooping’ 
into the train ing data . Therefore, we pay no price in term s of generalization, 
and we m ay well gain a dividend in perform ance because of a good choice of 
features.
T he feature transform can be general, as long as it is chosen before seeing 
the d a ta set (as if we cannot em phasize this enough). For instance, you may 
have noticed th a t the feature transform in (3.12) only allows ns to get very 
lim ited types of quadratic curves. Ellipses th a t do not center a t the origin 
in X cannot correspond to a hyperplane in Z . To get all possible quadratic 
curves in T , we could consider the m ore general feature transformz = <I>2 (x),
<F2 (x) = { l , X i , X 2 , x l , X i X 2 , x l ) , (3.13)
which gives us the flexibility to represent any quadratic curve in T by a hy
perplane in Z (the subscrip t 2 of T is for polynom ials of degree 2 - quadratic 
curves). T he price we pay is th a t Z is now five-dimensional instead of two- 
dim ensional, and hence dvc is doubled from 3 to 6 .
3 . T h e L i n e a r M o d e l _____________________________ 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
1 0 3
E x e r c is e 3 .1 3
Consider the feature transform z = 4>2 (x ) in (3 .13 ). How can we use a 
hyperplane w in Z to represent the following boundaries in X I
(a) The parabola (xi — 3)^ + X 2 = 1.
(b) The circle (x i - 3)^ + (x 2 - 4)^ = 1.
(c) The ellipse 2 (x i - 3)^ + (x 2 - A f = 1.
(d ) The hyperbola (x i - 3)^ - (x 2 — 4)^ = 1.
(e) The ellipse 2 (x i + X2 — 3)^ + (x i - X2 - 4)^ = 1.
( f ) The line 2 x i + xa = L
One m ay fu rther ex tend 4>2 to a feature transfo rm $ 3 for cubic curves in X , 
or m ore generally define the feature transform 4>q for degree-Q curves in X . 
T he feature transform 4>q is called the Q th order polynomial transform.
T he power of the feature transfo rm should be used w ith care. I t m ay 
not be w orth it to insist on linear separab ility and em ploy a highly complex 
surface to achieve th a t. C onsider the case of F igure 3.1(a). If we insist on a 
feature transform th a t linearly separates th e d a ta , it m ay lead to a significant 
increase of the VC dim ension. As we see in F igure 3.7, no line can separate the 
tra in ing exam ples perfectly, and neither can any qu ad ra tic nor any th ird -o rder 
polynom ial curves. Thus, we need to use a fourth -o rder polynom ial transform :
4>4(x) = { I , x i , x 2 , x l , x i x 2 , x l , x l , x l x 2 , x i x l , x l , x ^ , x l x 2 , x l x l , x i x l , x j ) .
If you look a t the fourth-order decision bou n d ary in F igure 3.7(b), you d o n ’t 
need the VC analysis to tell you th a t th is is an overkill th a t is unlikely to 
generalize well to new d a ta . A b e tte r op tion would have been to ignore th e two 
misclassified exam ples in F igure 3.7(a), sep ara te th e o ther exam ples perfectly 
w ith the line, and accept the sm all b u t nonzero Ei„. Indeed, som etim es our 
best bet is to go w ith a sim pler hypothesis set while to le ra ting a sm all Ei„.
W hile our discussion of feature transfo rm s has focused on classification 
problem s, these transform s can be applied equally to regression problem s. 
B oth linear regression and logistic regression can be im plem ented in the feature 
space Z instead of the inpu t space X . For instance, linear regression is often 
coupled w ith a feature transfo rm to perform nonlinear regression. T he N 
by d -F 1 inpu t m atrix X in th e a lgorithm is replaced w ith the V by d -F 1 
m atrix Z, while the o u tp u t vector y rem ains th e sam e.
3.4 .2 C o m p u ta tio n and G en era liza tio n
A lthough using a larger Q gives us m ore flexibility in te rm s of the shape of 
decision boundaries in X , th ere is a price to be paid . C om p u ta tio n is one 
issue, and generalization is the o ther.
C om puta tion is an issue because th e featu re transfo rm 4>q m aps a two- 
dim ensional vector X to d = dim ensions, w hich increases th e m em ory
3 . T h e L i n e a r M o d e l ______________________________ 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
1 0 4
3 . T h e L i n e a r M o d e l 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
(b) 4th order polynomial fit
F igure 3.7: Illustration of the nonlinear transform using a data set that 
is not linearly separable; (a) a line separates the data after omitting a few 
points, (b) a fourth-order polynomial separates all the points.
and com putational costs. Things could get worse if x is in a higher dimension 
to begin with.
E x e rc is e 3 .1 4
Consider the Qth order polynomial transform 4>q for X = W hat is 
the dimensionality d of the feature space Z (excluding the fixed coordinate 
zo — 1). Evaluate your result on d e { 2 ,3,5,10} and Q G {2 , 3,5,10}.
The other im p o rtan t issue is generalization. If $q is the feature transform of 
a two-dim ensional input space, there will be d = dimensions in Z , and
dvc{P<s>) can be as high as + 1 . This m eans th a t the second term in
the VC bound (3.1) can grow significantly. In o ther words, we would have a 
weaker guaran tee th a t Eout will be small. For instance, if we use <I> = T 5 0 , 
the VC dim ension of 7^$ could be as high as -h 1 = 1326 instead of the
original dvc = 3. Applying the rule of thum b th a t the am ount of d a ta needed 
is proportional to the VC dimension, we would need hundreds of times more 
d a ta th an we would if we d id n ’t use a feature transform , in order to achieve 
the sam e level of generalization error.
E x e rc is e 3 .15
High-dimensional feature transforms are by no means the only transforms 
that we can use. W e can take the tradeoff in the other direction, and 
use low-dimensional feature transforms as well (to achieve an even lower 
generalization error bar).
( c o n t i n u e d o n n e x t p a g e )
1 0 5
Consider the following feature transform, which maps a (/-dimensional x to 
a one-dimensional z, keeping only the A4;h coordinate of x .
4>(fc)(x) = ( L x f c ) . ( 3 . 1 4 )
Let 'Hk be the set of perceptrons in the feature space.
(a ) Prove th a t dvc (Rfc) = 2 .
(b ) Prove th a t (/vc(Ufe=i7ffc) < 2 (log2 d -P 1).
Hk is called the decision stump model on dimension k.
3 . T h e L i n e a r M o d e l ______________________________ 3 . 4 . N o n l i n e a r T r a x s f o r m a t i o x
T he problem of generalization w hen we go to high-dim ensional space is somcs 
tim es balanced by th e advantage we get in approx im ating th e ta rg e t b e tte r . As 
we have seen in th e case of using qu ad ra tic curves in stead of lines, th e tran s
formed d a ta becam e linearly separable, reducing Eia to 0. In general, when 
choosing the app ro p ria te dim ension for th e fea tm e transform , we cannot avoid 
the approxim ation-generahzation tradeoff.
i higher d b e tte r chance of being linearly separable {Eia i ) dvc 
lower d possibly not linearly separab le (Tin T)
Therefore, choosing a fea tm e transfo rm before seeing th e d a ta is a non-tri^ ial 
task . V h e n we apply learning to a p a rticu la r problem , som e understand ing 
of th e problem can help in choosing features th a t work well. M ore generally, 
there are some guidelines for choosing a su itab le transform , or a sn itab le model, 
which we will discuss in C h ap ter 4.
E x erc ise 3 .1 6
W rite down the steps o f the algorithm th a t combines $ 3 w/ith linear re
gression. How about using 4>io instead? W here is the main computational 
bottleneck o f the resulting algorithm?
E x a m p le 3 .5 . L et's revisit the h an d w ritten digit recognition exam ple. We 
can try a different way of decom posing th e big task of sep a ra tin g ten digits 
to sm aller tasks. O ne decom position is to sep a ra te digit 1 from all th e o th er 
digits. I sing in tensity and st inm etry as our input variables like we did before, 
the sca tte r plot of the tra in in g d a ta is shown next. .A. line can roughly sep ara te 
digit 1 from the rest, but a m ore com plicated curve m ight do l>etter.
1 0 6
3 . T h e L i n e a r M o d e l 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
Average Intensity
We use linear regression (for classification), first w ithout any feature transform . 
T he results are shown below (LHS). We get F n = 2.13% and Fout = 2.38%.
q;
s 
a
CO
Average In tensity
Linear m odel 
F n = 2 . 1 3 % 
F o u t = 2 . 3 8 %
Average In tensity3rd order polynom ial m odel 
F n = 1 . 7 5 %
Font = 1 . 8 7 %
C lassification of th e d ig its d a ta (‘T versus ‘no t 1’) using linear and 
th ird order polynom ial models.
W hen we run linear regression w ith + 3 , the th ird-order polynom ial transform , 
we ob ta in a b e tte r fit to the data , w ith a lower Fi„ = 1.75%. T he result is 
depicted in the RHS of the figure. In this case, the be tte r in-sam ple fit also 
resulted in a b e tte r out-of-sam ple perform ance, w ith Fo„t = 1-87%. □
L in e a r m o d e ls , a fin a l p i tc h . The linear model (for classification or regres
sion) is an often overlooked resource in the arena of learning from data. Since 
efficient learning algorithm s exist for linear models, they are low overhead. 
T hey are also very robust and have good generalization properties. A sound
1 0 7
policy to follow when learning from d a ta is to f irst try a linear model. Becau.se 
of the good generalization properties of linear m odels, not m uch can go wrong. 
If you get a good fit to the d a ta (low Ein), th en you are done. If you do not get 
a good enough fit to the d a ta and decide to go for a m ore com plex model, you 
will pay a price in term s of th e VC dim ension as we have seen in Exercise 3.12, 
b u t the price is m odest.
3 . T h e L i n e a r M o d e l 3 . 4 . N o n l i n e a r T r a n s f o r m a t i o n
1 0 8
3.5 Problem s
P ro b lem 3 .1 Consider the double semi-circle “toy” learning task below.
3 . T h e L i n e a r M o d e l ________________________ 3 . 5 . P r o b l e m s
There are two semi-circles of width thk with inner radius rad, separated by 
sep as shown (red is —1 and blue is -|-1). The center of the top semi-circle 
is aligned with the middle of the edge of the bottom semi-circle. This task 
is linearly separable when sep > 0, and not so for sep < 0. Set rad = 10, 
thk = 5 and sep = 5. Then, generate 2, 000 examples uniformly, which means 
you will have approximately 1,000 examples for each class.
(a ) Run the PLA starting from w = 0 until it converges. Plot the data and 
the final hypothesis.
(b) Repeat part (a ) using the linear regression (for classification) to obtain w . 
Explain your observations.
P r o b le m 3 .2 For the double-semi-circle task in Problem 3.1, vary sep in 
the range {0 .2 , 0 . 4 , . . . , 5 }. Generate 2, 000 examples and run the PLA starting 
with w = 0. Record the number of iterations PLA takes to converge.
Plot sep versus the number of iterations taken for PLA to converge. Explain 
your observations. [Hint: Problem 1.3.]
P ro b lem 3 .3 For the double-semi-circle task in Problem 3.1, set sep = —5 
and generate 2,000 examples.
(a) W hat will happen if you run PLA on those examples?
(b) Run the pocket algorithm for 100,000 iterations and plot Bin versus the 
iteration number t.
(c) Plot the data and the final hypothesis in part (b).
( c o n t i n u e d o n n e x t p a g e )
1 0 9
3. T h e L i n e a r M o d e l 3.5. P r o b l e m s
(d ) Use the linear regression algorithm to obtain the weights w , and compare 
this result with the pocket algorithm in terms of computation tim e and 
quality of the solution.
(e) Repeat (b ) - (d ) with a 3rd order polynomial feature transform.
P r o b le m 3 .4 In Problem 1.5, we introduced the Adaptive Linear Neu
ron (Adaline) algorithm for classification. Here, we derive Adaline from an 
optimization perspective.
(a ) Consider e „ (w ) = (m ax(0,1 — i/nw '^x„))^ . Show th a t e „ (w ) is contin
uous and differentiable. W rite down the gradient V e „ (w ) .
(b) Show that e „ (w ) is an upper bound for |s ign(w '^x„) / t/nl- Hence, 
V L n = i upper bound for the in-sample classification er
ror E n (w ) .
(c) Argue th a t the Adaline algorithm in Problem 1.5 performs stochastic 
gradient descent on T e „ (w ) .
P r o b le m 3 .5
(a ) Consider
6n (w ) = m ax(0,1 - y „w '‘'x „ ) .
Show that e „ (w ) is continuous and differentiable except when
J/n = W'^'x^.
(b ) Show that e „ (w ) is an upper bound for |s ign(w '’'xTi) f iy n j - Hence,
upper bound for the in-sample classification er
ror E n (w ) .
(c) Apply stochastic gradient descent on T e ,i(w ) (ignoring the singu
lar case of w'^'xn = j/n) and derive a new perceptron learning algorithm.
P r o b le m 3 .6 Derive a linear programming algorithm to fit a linear model 
for classification using the following steps. A linear program is an optimization 
problem of the following form:
mill c^z
Z
subject to A z < b.
A , b and c are parameters of the linear program and z is the optimization 
variable. This is a well studied optimization problem and most m athem at
ics software packages will have canned optimization functions to solve linear 
programs.
(a) For linearly separable data, show th a t for some w , y „ ( w 'x „ ) > 1 for 
n = 1 , . . . ,A .
110
3 . T h e L i n e a r M o d e l 3 . 5 . P r o b l e m s
(b ) Formulate the task of finding a separating w for separable data as a linear 
program. You need to specify what the parameters A ,b , c are and what 
the optimization variable z is.
(c) If the data is not separable, the condition in (a) cannot hold for every n. 
Thus introduce the violation ^„ > 0 to capture the amount of violation 
for example x„. So, for n = 1,... , A,
yn(w'^Xn) > 1 - f n ,
Cn > 0.
Naturally, we would like to minimize the amount of violation. One intu
itive approach is to minimize ' want w that solves
N
n=l
subject to yn(w'^Xn) > 1 - Cn, 
f n > 0 ,
where the inequalities must hold for n = 1, . . . , N. Formulate this prob
lem as a linear program.
(d) Argue that the linear program you derived in (c) and the optimization 
problem in Problem 3.5 are equivalent.
P r o b le m 3 .7 Use the linear programming algorithm from Problem 3.6 
on the learning task in Problem 3.1 for the separable (sep = 5) and the non- 
separable (sep = —5) cases.
Compare your results to the linear regression approach with and without the 
3rd order polynomial feature transform.
P r o b le m 3 .8 For linear regression, the out-of-sample error is
Bout(h) = E [(h(x) - y f ] .
Show that among a// hypotheses, the one that minimizes Bout is given by
h*(x) = E[y I x].
The function h* can be treated as a deterministic target function, in which 
case we can write y = h*(x) -F e(x) where f (x) is an (input dependent) noise 
variable. Show that e(x) has expected value zero.
I l l
3. T h e L i n e a r M o d e l 3.5. P r o b l e m s
P r o b le m 3 .9 Assuming that X^X is invertible, show by direct comparison 
with Equation ( 3 . 4 ) that B i n ( w ) can be written as
Bin(w)
= (w - (X " X )-iX »"(X "X )(w - (X -X )-iX -y) + y"(I - X(X "X )-iXBy.
Use this expression for Bin to obtain wun. W hat is the in-sample error? [Hint: 
The matrix X ^X is positive definite.]
P r o b le m 3 .1 0 Exercise 3.3 studied some properties of the hat matrix 
H = X (X '’'X ) “ ^X^, where X is a A by d -F 1 matrix, and X"^X is invertible. 
Show the following additional properties.
(a ) Every eigenvalue of H is either 0 or 1. [Hint: Exercise 3.3(b).]
(b ) Show that the trace of a symmetric matrix equals the sum of its eigen
values. [Hint: Use the spectral theorem and the cyclic property o f the 
trace. Note that the same result holds for non-symmetric matrices, but 
is a little harder to prove.]
(c) How many eigenvalues of H are 1? W hat is the rank of H? [Hint: 
Exercise 3.3(d).]
P r o b le m 3 .1 1 Consider the linear regression problem setup in Exercise 3.4, 
where the data comes from a genuine linear relationship with added noise. The 
noise for the different data points is assumed to be iid with zero mean and 
variance a^. Assume th a t the 2nd moment matrix E = Ex[xx^] is non-singular. 
Follow the steps below to show that, with high probability, the out-of-sample 
error on average is
Bout(w iin) = cr̂ ^1 -F -F o ( T )^ .
(a) For a test point x , show th a t the error y — y (x ) is
e -x ^ (X ^X )-^X "e ,
where e is the noise realization for the test point and e is the vector of 
noise realizations on the data,
(b ) Take the expectation with respect to the test point, i.e., x and e, to 
obtain an expression for Bo„t. Show that
Bout = -F trace ( E (X " 'X )“ ^ X '''e e ''X (X " 'X )“ ^) ,
[Hints: a = tracc (a ) for any scalar a; tra c e (A B ) = tra c e (B A ); expecta
tion and trace commute.]
(c) W hat is Ee[ee'’ ]?
112
3 . T h e L i n e a r M o d e l 3 . 5 . P r o b l e m s
(d) Take the expectation with respect to e to show that, on average,
Fout = (7̂ + ^ t r a c e (E(Ax'"X)~^) .
Note that ;^ X ^ X = A is an A-sample estimate of E . So
T X ^ 'X ss E . If -~X^X = E , then what is Eout on average?
(e) Show, after taking the expectation with respect to the data noise e, that 
with high probability,
Eout = + o(X )^ .
[Hint: By the law of large numbers -^x'^'X converges in probability to 
E , and so by continuity o f the inverse at E , (^ X '^ 'X ) ̂ converges in 
probability to E ~ F ]
P ro b lem 3 .1 2 In linear regression, the in-sample predictions are given by 
y = H y , where H = X (X ^ X )^ ^ X ^ . Show that H is a projection matrix, i.e. 
= H . So y is the projection of y onto some space. W hat is this space?
P r o b le m 3 .1 3 This problem creates a linear regression algorithm from 
a good algorithm for linear classification. As illustrated, the idea is to take the 
original data and shift it in one direction to get the -|-1 data points; then, shift 
it in the opposite direction to get the — 1 data points.
Original data for the one
dimensional regression prob
lem
Shifted data viewed as a 
two-dimensional classifica
tion problem
More generally, the data (x n ,j/n ) can be viewed as data points in by
treating the y-value as the {d + l) th coordinate.
( c o n t i n u e d o n n e x t p a g e )
1 1 3
3 . T h e L i n e a r M o d e l 3 . 5 . P r o b l e m s
Now, construct positive and negative points
T>+ = ( x i , y i ) + a . . . . , (x ;v ,2/at) + a 
V - = ( x i , y i ) - a , . . . , (x /v , yw ) - a,
where a is a perturbation parameter. You can now use the linear programming 
algorithm in Problem 3.6 to separate 2?+ from The resulting separating 
hyperplane can be used as the regression 'fit' to the original data.
(a) How many weights are learned in the classification problem? How many 
weights are needed for the linear fit in the regression problem?
(b) The linear fit requires weights w , where / i(x ) = w ’^x. Suppose the 
weights returned by solving the classification problem are Wdass- Derive 
an expression for w as a function of Wdass-
(c) Generate a data set y„ = + <Te„ with N = 50, where x „ is uniform on 
[0.1] and e„ is zero mean Gaussian noise; set a = 0.1. Plot T>+ and "D_
0
for a = 0.1
(d ) Give comparisons of the resulting fits from running the classification ap>- 
proach and the analytic pseudo-inverse algorithm for linear regression.
P r o b le m 3 .1 4 In a regression setting, assume the target function is linear, 
so / ( x ) = x ’̂ 'wy, and y = X w / - f e, where the entries in e are zero mean, iid 
with variance cr .̂ In this problem, derive the bias and var as follows.
(a ) Show that the average function is y (x ) = / ( x ) , no m atter what the size 
of the data set, as long as X ’̂ X is invertible. W h at is the bias?
(b) W hat is the var? [Hint: Problem 3.11[
P r o b le m 3 .1 5 In the text we derived th a t the linear regression solution 
weights must satisfy X ’’'X w = X ’̂ y. If X ’̂ X is not invertible, the solution 
W|ip = (X ^ X )~ * X ^ y w on't work. In this event, there will be many solutions 
for w that minimize Ein- Here, you will derive one such solution. Let p be 
the rank of X . Assume that the singular value decomposition (S V D ) of X is
X = U r V ^ . where U G satisfies U'^'U = Ip, V G R(<*+i)xp satisfies
V ^ V = I p , and L G is a positive diagonal matrix.
(a) Show that p < d + I.
(b) Show that wun = V r “ *U '’'y satisfies X'^Xwun = X'^'y, and hence is a
solution.
(c) Show that for any other solution that satisfies X ’'X w = X ’'y , Hwunil < 
||w |j. T hat is, the solution we have constructed is the minimum norm set 
of weights that minimizes E\n-
114
3 . T h e L i n e a r M o d e l 3 . 5 . P r o b l e m s
P r o b le m 3 .1 6 In Example 3.4, it is mentioned that the output of the 
final hypothesis ^ (x ) learned using logistic regression can be thresholded to get 
a ‘hard’ ( ± 1 ) classification. This problem shows how to use the risk matrix 
introduced in Example 1.1 to obtain such a threshold.
Consider fingerprint verification, as in Example 1.1. After learning from the 
data using logistic regression, you produce the final hypothesis
g (x ) = P[y = + l | x] ,
which is your estimate of the probability that y = + 1 . Suppose that the cost 
matrix is given by
True classification
-F l (correct person) —1 (intruder)
“F l 0 Cayou say ^
Cr 0
For a new person with fingerprint x , you compute <;(x) and you now need to de
cide whether to accept or reject the person (i.e., you need a hard classification). 
So, you will accept if g (x ) > k , where k is the threshold.
(a) Define the cost(accept) as your expected cost if you accept the person. 
Similarly define cost(reject). Show that
cost(accept) = (1 - g (x ))ca , 
cost (reject) = ,g(x)cr.
(b ) Use part (a) to derive a condition on ,g(x) for accepting the person and 
hence show that
Ca + Cr
(c) Use the cost-matrices for the Supermarket and CIA applications in Ex
ample 1.1 to compute the threshold k for each of these two cases. Give 
some intuition for the thresholds you get.
P ro b lem 3 .1 7 Consider a function
E[u, u ) = e“ -F -F v f — ‘Suv -F — 3 m — 5 m ,
(a) Approximate E{u -F Am,m -F Am) by Ei(Am, Am), where Ei is the 
first-order Taylor's expansion of E around (m,m) = (0,0). Suppose 
El (Am, Am) = m^Am -F m^Am -F a. W hat are the values of a„, a„, 
and a?
( c o n t i n u e d o n n e x t p a g e )
1 1 5
3 . T h e L i n e a r M o d e l 3 . 5 . P r o b l e m s
= 0.5. 
A l t .
IS
(b ) Minimize E i over all possible (Am, A u ) such that ||(Am, Am)]
In this chapter, we proved th a t the optimal column vector
parallel to the column vector —\7 E {u ,v ), which is called the negative 
gradient direction. Compute the optimal (Am, A u ) and the resulting 
E {u + Am, V + Am).
(c) Approximate F (m + A m , M + A m ) by F 2(Am, Am), where F is the second- 
order Taylor's expansion of E around (m, m) = (0 ,0 ) . Suppose
F 2(Am, Am) = 6„„(A M )^ + F t ,(A M )^ + 6„ „ (A M )(A M )+ 6u A M + 6t,AM + 6.
W hat are the values of bu, by, and b?
(d ) Minimize E 2 over all possible (Am , Am) (regardless of length). Use the 
fact that V ^ F (m , m)|^q (the Hessian matrix at (0 ,0 ) ) is positive definite 
to prove th a t the optimal column vector
A m*
A m*
= - ( V " F ( m , m) ) V F ( m , m),
which is called the Newton direction.
(e) Numerically compute the following values:
(i) the vector ( A m , A m ) of length 0 . 5 along the Newton direction, and 
the resulting E {u + A m , m + A m ) .
(ii) the vector ( A m , A m ) of length 0 . 5 th a t minimizes F ( m + A m , m + A m ) , 
and the resulting F ( m + A m , m + A m ) . {Hint: Let A m = 0 . 5 s i n 0 . )
Compare the values of F ( m + A m , M + A m ) in (b ), (e-i), and (e-ii). Briefly 
state your findings.
The negative gradient direction and the Newton direction are quite fundamental 
for designing optimization algorithms. It is important to understand these 
directions and put them in your toolbox for designing learning algorithms.
P r o b le m 3 .1 8 Take the feature transform $2 in Equation (3 .13 ) as 4>.
(a) Show that dvc)??®) < 6 .
(b ) Show that dvc)??®) > 4. [Hint: Exercise 3.12[
(c) Give an upperbound on dvc(??®fc) for X =
(d ) Define
l>2: X ^ ( l ,a; i , X2,a: i + X 2 , x i - X 2 , x l , x i X 2 , X 2 X i , x l ) for x G
Argue that dvc(??®2 ) = dvc)??*^)- other words, while l>2 (T ’) € K®, 
dvc)??!,^) < 6 < 9. Thus, the dimension of <E>(T) only gives an upper 
bound of dvc(??®), and the exact value of dvc(??®) can depend on the 
components of the transform.
1 1 6
3 . T h e L i n e a r M o d e l 3 . 5 . P r o b l e m s
P ro b lem 3 .1 9 A Transformer thinks the following procedures would 
work well in learning from two-dimensional data sets of any size. Please point 
out if there are any potential problems in the procedures:
(a) Use the feature transform
r (O,-- - ,0, if X = Xn
[ (0 , 0 , • • • , 0) otherwise .
before running PLA.
(b) Use the feature transform 4> with
. . . / l i x - xF ( x ) = exp - 227
using some very small 7 .
(c) Use the feature transform 4> that consists of all
l ! x - ( f , j ) | |" ^F q ( x ) = e x p ( ^ - 2.^2 
before running PLA, with i S {0, . . . , 1} and j G {0, . . . , ! } .
1 1 7
118
- . A :
Chapter 4
Overfitting
Paraskavedekatriaphobia^ (fear of Friday the 13th), and superstitions in gen
eral, are perhaps the m ost illustrious cases of the hum an ability to overfit. 
U nfortunate events are mem orable, and given a few such mem orable events, 
it is n a tu ra l to try and find an explanation. In the future, will there be more 
unfortunate events on Friday the 13th’s th an on any other day?
O verfitting is the phenom enon where fitting the observed facts (data) well 
no longer indicates th a t we will get a decent out-of-sam ple error, and may 
actually lead to the opposite effect. You have probably seen cases of overfit- 
ting when the learning model is more complex th an is necessary to represent 
the ta rg e t function. T he model uses its additional degrees of freedom to fit 
idiosyncrasies in the d a ta (for example, noise), yielding a final hypothesis th a t 
is inferior. O verfitting can occur even when the hypothesis set contains only 
functions which are fa r simpler th an the ta rg e t function, and so the plot thick
ens © .
T he ability to deal w ith overfitting is w hat separates professionals from 
am ateurs in the field of learning from data . We will cover three themes; 
W hen does overfitting occur? W hat are the tools to com bat overfitting? How 
can one estim ate the degree of overfitting and ‘certify’ th a t a model is good, 
or b e tte r th an another? O ur em phasis will be on techniques th a t work well in 
practice.
4.1 W hen D oes O verfitting Occur?
O verfitting literally m eans “F ittin g the d a ta more th an is w arranted .” The 
m ain case of overfitting is when you pick the hypothesis with lower Ei„, and 
it results in higher Eout- This m eans th a t Ei„ alone is no longer a good guide 
for learning. Let us s ta r t by identifying the cause of overfittiiig.
^from th e G reek paraskevi (F rid ay ), dekatre is ( tliir te en ), phobia (fear)
1 1 9
4 . O v e r f i t t i n g 4 . 1 . W h e n D o e s O v e r f i t t i n g O c c u r ?
O D a ta
— T a rg e t
— F it
Consider a simple one-dim ensional regression problem w ith five d a ta points. 
We do not know the ta rg e t function, so le t’s select a general model, m axim iz
ing our chance to cap tu re th e ta rg e t function. Since 5 d a ta points can be fit 
by a 4 th order polynom ial, we select 4 th order polynom ials.
The result is shown on the right. T he 
ta rg e t function is a 2nd order polynom ial 
(blue curve), w ith a little added noise in 
the d a ta points. T hough the ta rg e t is 
simple, the learning algorithm used the 
full power of the 4 th order polynom ial to 
fit the d a ta exactly, b u t the resu lt does 
not look anyth ing like the ta rg e t function.
T he d a ta has been ‘overfit.’ T he little 
noise in the d a ta has misled th e learning, 
for if there were no noise, the fitted red
curve would exactly m atch the ta rg e t. This is a typ ical overfitting scenario, 
in which a com plex m odel uses its add itional degrees of freedom to ‘learn ’ the 
noise.
The fit has zero in-sam ple erro r b u t huge out-of-sam ple error, so th is is a 
case of bad generalization (as discussed in C h ap ter 2) - a likely outcom e when 
overfitting is occurring. However, our definition of overfitting goes beyond 
bad generalization for any given hypothesis. Instead , overfitting applies to 
a process: in th is case, the process of picking a hypothesis w ith lower and 
lower Ein resulting in higher and higher Eout-
4.1 .1 A C ase S tudy: O v erfittin g w ith P o ly n o m ia ls
L et’s dig deeper to gain a b e tte r understand ing of w hen overfitting occurs. We 
will illustra te the m ain concepts using d a ta in one-dim ension and polynom ial 
regression, a special case of a linear m odel th a t uses the feature transfo rm 
X HA (1 ,X, x^, • • •). C onsider the two regression problem s below:
(a) 10th order target function (b) 50th order target function
In bo th problem s, the ta rg e t function is a polynom ial and th e d a ta set T> 
contains 15 d a ta points. In (a), the ta rg e t function is a 10th order polynom ial
120
4 . O v e r f i t t i n g 4 . 1 . W h e n D o e s O v e r f i t t i n g O c c u r ?
O Data
—2nd Order Fit 
—10th Order Fit
(a) Noisy low-order target
O Data
—2nd Order Fit 
— 10th Order Fit
(b) Noiseless high-order target
F igure 4.1: Fits using 2nd and 10th order polynomials to 15 data points. 
In (a), the data are noisy and the target is a 10th order polynomial. In (b) 
the data are noiseless and the the target is a 50th order polynomial.
and the sam pled d a ta are noisy (the d a ta do not lie on the targe t function 
curve). In (b), the ta rg e t function is a 50th order polynom ial and the d a ta are 
noiseless.
T he best 2nd and 10th order fits are shown in Figure 4.1, and the in-sample 
and out-of-sam ple errors are given in the following table.
10th order noisy ta rg e t 50th order noiseless targe t
2nd O rder 10th O rder 2nd O rder 10th O rder
Ei„ 0.050 0.034 E[n 0.029 10“ '̂
Eout 0.127 9 .0 0 Eout 0.120 7680
W hat the learning algorithm sees is the data , not the targe t function. In both 
cases, the 10th order polynom ial heavily overfits the data , and results in a 
nonsensical final hypothesis which does not resemble the targe t function. The 
2nd order fits do not cap tu re the full na tu re of the targe t function either, but 
they do a t least cap tu re its general trend , resulting in significantly lower out-of- 
sam ple error. T he 10th order fits have lower in-sam ple error and higher out-of- 
sam ple error, so this is indeed a case of overfitting th a t results in pathologically 
bad generalization.
E x e rc is e 4 .1
Let H 2 and H io be the 2nd and 10th order hypothesis sets respectively.
Specify these sets as parameterized sets of functions. Show that H 2 C H io .
These two examples reveal some surprising phenom ena. L et’s consider first the 
10th order ta rg e t function, F igure 4.1(a). Here is the scenario. Two learners. () 
(for overfitted) and R (for restric ted), know th a t the targe t function is a 10th 
order polynom ial, and th a t they will receive 15 noisy d a ta points. Learner O
121
4 . 0 \ ' E R F I T T I N G 4 . 1 . W h e n D o e s O v e r f i t t i n g O c c u r ?
Learning curves for E 2 Learning curves for Hio
Figure 4.2: Overfitting is occurring for N in the shaded gray region 
because by choosing T-L\o which has better Ein, you get worse Eout-
uses m odel E io , which is known to contain th e ta rg e t function, and finds the 
best fitting hypothesis to the d a ta . L earner R uses m odel 7^2, and sim ilarly 
finds the best fitting hypothesis to the d a ta .
T he surprising th ing is th a t learner R wins (lower out-of-sam ple error) by 
using the sm aller model, even thoughshe has knowingly given up the ability 
to im plem ent the tru e ta rg e t function. L earner R trad es off a worse in-sam ple 
error for a huge gain in the generalization error, u ltim ate ly resulting in lower 
out-of-sam ple error. W h at is funny here? A folklore belief abou t learning is 
th a t best results are ob ta ined by incorporating as m uch inform ation abou t the 
targe t function as is available. B u t as we see here, even if we know the order 
of the ta rg e t and naively inco rporate th is knowledge by choosing th e model 
accordingly (Rio) , the perform ance is inferior to th a t dem onstra ted by the 
m ore 's tab le ' 2nd order model.
T he models 772 and 77io were in fact the ones used to generate th e learn
ing curves in C hap ter 2. and we use those sam e learning curves to illustra te 
overfitting in F igure 4.2. If you m entally superim pose the two plots, you can 
see th a t there is a range of N' for which 7?io has lower E\n bu t higher Eout 
th an 7?2 does, a case in poin t of overfitting.
Is learner R always going to prevail? C ertain ly not. For exam ple, if the 
da ta was noiseless, then indeed learner O would recover th e ta rg e t function 
exactly from 15 d a ta points, while learner R would have no hope. This brings 
us to the second exam ple. F igure 4.1(b). Here, the d a ta is noiseless, bu t the 
targe t function is very com plex (50th order polynom ial). A gain learner R 
wins, and again because learner O heavily overfits the d a ta . O verfitting is 
not a disease inflicted only upon com plex m odels w ith m any m ore degrees of 
freedom th an w arran ted by th e com plexity of th e ta rg e t function. In fact the 
reverse is tru e here, and overfittiiig is just as bad. W hat m a tte rs is how the 
model com iilexity m atches the quantity and quality o f the data we have, not 
how it m atches the ta rg e t function.
1 2 2
4 .1 .2 C a ta ly sts for O verfitting
A skeptical reader should ask w hether the examples in Figure 4.1 are ju st 
pathological constructions created by the authors, or is overfitting a real phe
nom enon which has to be considered carefully when learning from data? The 
next exercise guides you th rough an experim ental design for studying overfit
ting w ithin our curren t setup. We will use the results from this experim ent 
to serve two purposes: to convince you th a t overfitting is not the result of 
some rare pathological construction, and to unravel some of the conditions 
conducive to overfitting.
E x erc ise 4 .2 [Experim ental design for studying overfittingj
This is a reading exercise that sets up an experimental framework to study 
various aspects of overfitting. The reader interested in implementing the 
experiment can find the details fleshed out in Problem 4.4. The input 
space is A = [—1,1], with uniform input probability density, P [ x ) = | .
W e consider the two models 'H2 and Ttio.
The target is a degree-Q / polynomial, which we write f {x) =
where Li{x) are polynomials of increasing complexity (the
Legendre polynomials). The data set is X> = (an, y i ) , . . . , (x n , y v ) , where
= f { xn) + cren and £„ are iid (independent and identically distributed) 
standard Normal random variates.
For a single experiment, with specified values for Q f , N , a , generate a ran
dom degree-Q / target function by selecting coefficients m independently 
from a standard Normal, rescaling them so that Ea.x [P ] = 1- Gen
erate a data set, selecting a ; i , . . . , x j v independently according to P { x ) 
and yn = / ( x „ ) -t- ct£„. Let 52 and yio be the best fit hypotheses to 
the data from H 2 and Hio respectively, with out-of-sample errors Fout(g2) 
and Fout(y 10)•
Vary Q f , N , a, and for each combination of parameters, run a large number 
of experiments, each time computing Fout(y2) and Fout(yio). Averaging 
these out-of-sample errors gives estimates of the expected out-of-sample 
error for the given learning scenario { Q/ , N , a ) using H 2 and 'Hio-
Exercise 4.2 set up an experim ent to study how' tlie noise level cr^, the target 
com plexity Q j, and the num ber of d a ta points N relate to overfittiiig. We 
com pare the final hypothesis gw G F io (larger model) to the final hypothesis 
g2 G R 2 (smaller model). Clearly, Fin(yio) < F„(ry2 ) since g\o has more 
degrees of freedom to fit the data . W hat is surprising is how often gw overhts 
the d a ta , resulting in Fout(5io) > EoutifJ'i)- Let us define the overfit measure 
as Fout (ff 1 0 ) - Fout (5 2 ). T he m ore positive tliis m easure is, tlie more severe 
overfitting would be.
Figure 4.3 shows how the ex ten t of overfitting depends on certain param e
ters of the learning problem (the results arc from our iniplem entatioii of Exer
cise 4.2). In the figure, the colors m ap to the level of overfittiiig, witli ri'dder
4 . O v e r f i t t i n g _______________________________4 . 1 . W h e n D o e s Q v e r f i t t i n g O c c u r ?
1 2 3
4 . O v e r f i t t i n g 4 . 1 . W h e n D o e s O v e r f i t t i n g O c c u r ?
100 120 
N um ber of D a ta P o in ts, N
(a) Stochastic noise
100 120 
N um ber of D a ta P o in ts, N
(b) Deterministic noise
Figure 4.3: How overfitting depends on the noise cr ,̂ the target function 
complexity Q f , and the number of data points N. The colors map to the 
overfit measure EoutiTLio) — £'out('H2 ). In (a) we see how overfitting depends 
on cr̂ and N , with Q f = 20. As increases we are adding stochastic noise 
to the data. In (b) we see how overfitting depends on Q f and N , with 
= 0.1. A s Qf increases we are adding deterministic noise to the data.
regions showing worse overfitting. These red regions are large— overfitting is 
real, and here to stay.
F igure 4.3(a) reveals th a t th ere is less overfitting when the noise level 
drops or when th e num ber of d a ta poin ts N increases (the linear p a tte rn in 
Figure 4.3(a) is typical). Since th e ‘signal’ / is norm alized to E[/^] = 1, 
the noise level cr ̂ is au tom atically ca lib ra ted to the signal level. Noise leads 
the learning astray, and th e larger, m ore com plex m odel is m ore susceptible to 
noise th an the sim pler one because it has m ore ways to go astray. F igure 4.3(b) 
reveals th a t ta rg e t function com plexity Q / affects overfitting in a sim ilar way 
to noise, albeit nonlinearly. To sum m arize.
N um ber of d a ta po in ts t O verfitting
Noise t O verfitting t
T arget com plexity t O verfitting t
D e te r m in is t ic n o ise . W hy does a higher ta rg e t com plexity lead to more 
overfitting when com paring the sam e two m odels? T he in tu ition is th a t for a 
given learning model, there is a best approx im ation to th e ta rg e t function. T he 
p art of the ta rg e t function ‘o u tsid e’ th is best fit ac ts like noise in th e d a ta . We 
can call this deterxninistic noise to d ifferentiate it from the random stochastic 
noise. .Tiist as stochastic noise canno t be m odeled, the determ inistic noise 
is th a t p a r t of the ta rg e t function which cannot be m odeled. T he learning 
algorithm should not a tte m p t to fit th e noise; however, it canno t distinguish 
noise from signal. O n a finite d a ta set, the algorithm inadverten tly uses some
1 2 4
4 . Ox'K KK i r r iX G 4 , 1 , W ' u k n O o k s LIn k k i u t i n g O g g i t ;',
F ig u re 4.4: D o tenu iu istic noise. /F is the best tit to / in R-j. The sluniing 
illu stra tes d e ten n in is tie noise tor tliis learning problem .
of the degrees of freeiloiu to tit the noise, whieh e;tu result in overtit ting tuid 
a spurious tiiial Ip-pothesis.
F igure 4.4 illustrates detenuiu istie uoise for a tiu;nlr;ttie luotlel tit ting ;t 
m ore eomple.x target fnnetiou. W hile stoehastie aiul tleternhnistie noise have 
sim ilar effeets on overtitting. there aretwo bttsie ditferenees between the two 
tvpes of noise. F irst, if we generated the stune d;tt;t (x \;tlnes) ttgain, the 
detennin istie noise woukl not elnmge but the stoehastie noise wonld. Seeond. 
different models eap tn re different 'p a rts ' of the target fnnetion, henee the same 
d a ta set will have different detennin istie noise depending on whieh tnotlel we 
use. In realit,w we work w ith one model ;tt a tim e tuul Imve onlv one ilata set 
on hand. Henee, we h a \e one reiilization of the noise to work with and the 
algorithm eannot differentiate between the two t,vpes of noise.
E x erc ise 4 .3
Deterministic noise depends on H, as some models approximate / better 
than others,
(a ) Assume H is fixed and we increase the complexity of / . W ill deter
ministic noise in general go up or down? Is there a higher or lower 
tendency to overfit?
(b ) Assume / is fixed and we decrease the complexity of H. Will deter
ministic noise in general go up or down? Is there a higher or lower 
tendency to overfit? [Hint: There is a race between two factors that 
affect overfitting in opposite ways, but one wins.]
T he bias-varianee deeom position. whieh we disen.ssed in Seetion 2 ,3 .1 (see also 
P roblem 2.22) is a useful tool for understanding how noise alfeets perfornianee:
E p [ £ ’„„t] = u" -l- bias + var.
The first two term s relleet the direet inqmet ot the stoehastie and deten n in 
istie noise. T he variance of the stoehastie noise is a ' and the bias is dirc'etly
1 2 5
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
related to the determ inistic noise in th a t it cap tu res the m odel’s inaViility to 
approxim ate / . T he var term is indirectly im pacted by b o th types of noise, 
cap tu ring a m odel’s susceptibility to being led astray hy the noise.
4.2 R egularization
R egularization is our first weapon to com bat overfitting. R constrains the 
learning algorithm to im prove out-of-sam ple error, especially when noise is 
present. To w het your appe tite , look a t w hat a little regularization can do for 
our first overfitting exam ple in Section 4.1. T hough we only used a very small 
•amount" of regularization, the fit im proves dram atically .
without regularization with regularization
Now th a t we have your a tten tio n , we would like to come clean. R egularization 
is as much an a r t as it is a science. M ost of th e m ethods used successfully 
in practice are heuristic m ethods. However, these m ethods are grounded in a 
m athem atical fram ework th a t is developed for special cases. We will discuss 
bo th the m athem atica l and the heuristic, try ing to m ain ta in a balance th a t 
reflects the reality of the field.
Speaking of heuristics, one view of regu larization is th ro u g h th e lens of the 
VC bound, which bounds Eout using a m odel complexitj^ penalty ^(PL):
Eontih) < E M + n{Pl) for all h e PL. (4.1)
So. we are b e tte r off if we fit the d a ta using a sim ple PL. E x trap o la tin g one step 
further, we should be b e tte r off by fitting the d a ta using a ‘simple" h from PL. 
T he essence of regularization is to concoct a m easure H(/i) for th e com plexity 
of an individual hypothesis. Instead of m inim izing E,„{h) alone, one m inimizes 
a com bination of E\n{h) and H(/i). T his avoids overfitting by constra in ing the 
learning algorithm to fit th e d a ta well using a sim ple hypothesis.
E x a m p le 4 .1 . O ne popu lar regularization technique is weight decay, which 
m easures the com plexity of a hypothesis h by the size of the coefficients used 
to represent h (e.g. in a linear m odel). T his heuristic prefers m ild lines w ith
12G
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
small offset and slope, to wild lines w ith bigger offset and slope. We will get to 
the mechanics of weight decay shortly, bu t for now le t’s focus on the outcome.
We apply weight decay to fitting the targe t f { x ) = sin(7ra;) using TV ^ 2 
d a ta points (as in Exam ple 2.8). We sam ple x uniformly in [—1.1], generate a 
d a ta set and fit a line to the d a ta (our model is 77i). The figures below show the 
resulting fits on the sam e (random ) d a ta sets w ith and w ithout regularization.
without regularization with regularization
W ithout regularization, the learned function varies extensively depending on 
the d a ta set. As we have seen in Exam ple 2.8, a constant model scored 
Eout = 0 .7 5 , handily beating the perform ance of the (unregularized) linear 
m odel th a t scored Eout = 1-90. W ith a little weight decay regularization, 
the fits to the same data sets are considerably less volatile. This results in a 
significantly lower Eout = 0 .5 6 th a t bea ts bo th the constant model and the 
unregularized linear model.
T he bias-variance decom position helps us to understand how the regular
ized version bea t bo th the unregularized version as well as the constant model.
without regularization 
bias = 0 .2 1 ; 
var = 1.69.
with regularization 
bias = 0.23; 
var = 0.33.
Average hypothesis g (red) with var(x) indicated by the gray shaded 
region that is g{x) ± ^var(a:).
As expected, regularization reduced the var term ra ther dram atically Irom 1.69 
down to 0 .3 3 . T he price paid in term s of the bias (ciuality of the average fit) was
1 2 7
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
m odest, only slightly increasing from 0.21 to 0.23. T he result was a signihcant 
decrease in the expected out-of-sam ple error because bias-1-v a r decreased. This 
is the crux of regularization . By constrain ing th e learning algorithm to select 
‘sim pler’ hypotheses from TL, we sacrihce a little bias for a signihcant gain in 
the var. □
This exam ple also illustra tes why regularization is needed. T he linear 
m odel is too sophisticated fo r the amount o f data we have, since a line can 
perfectly h t any 2 points. T his need would persist even if we changed the 
ta rg e t function, as long as we have either stochastic or determ inistic noise. 
T he need for regularization depends on the quantity and quality o f the data. 
G iven our m eager d a ta set, our choices were either to take a sim pler model, 
such as the m odel w ith constan t functions, or to constra in th e linear model. It 
tu rn s out th a t using th e com plex m odel b u t constra in ing the algorithm tow ard 
sim pler hypotheses gives us m ore flexibility, and ends up giving the best Eout- 
In practice, this is the rule no t the exception.
E nough heuristics. L e t’s develop the m athem atics of regularization.
4 .2 .1 A Soft O rder C on stra in t
In this section, we derive a regu larization m ethod th a t applies to a wide va
riety of learning problem s. To sim plify the m ath , we will use the concrete 
setting of regression using Legendre polynom ials, th e polynom ials of increas
ing com plexity used in Exercise 4.2. So, le t’s first form ally in troduce you to 
the Legendre polynom ials.
C onsider a learning m odel w here TL is th e set of polynom ials in one vari
able X C [—1.1]. Instead of expressing th e polynom ials in term s of consecutive 
powers of x, we whll express them as a com bination of Legendre polynom ials 
in X . Legendre polynom ials are a s tan d a rd set of polynom ials w ith nice ana
lytic properties th a t resu lt in sim pler derivations. T he zero th-order Legendre 
polynom ial is the constan t L q {x ) = 1, and the first few Legendre polynom ials 
are illustra ted below.
i(3T - 1)
Z.4
icas.!-* - 30j-- + 3)
As you can see. when the order of the Legendre polynom ial increases, th e curve 
gets m ore complex. Legendre polynom ials are o rthogonal to each o th er w ithin 
T G [—1,1], and any regular polynom ial can be w ritten as a linear com bination 
of Legendre polynom ials, ju s t like it can be w ritten asa linear com bination of 
powers of x.
1 2 8
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
Polynom ial models are a special case of linear models in a space Z , under 
a nonlinear transform ation : T Z . Here, for the Q th order polynomial 
model, 4> transform s x into a vector z of Legendre polynomials,
1
Li{x )
z =
L q{x )
O ur hypothesis set F q is a linear com bination of these polynomials.
n o = <h h{x) = W'^Z = ^ WqLq
q=0 weR«+i
where L q{x ) = 1. As usual, we will som etimes refer to the hypothesis h by its 
weight vector w.^ Since each h is linear in w , we can use the m achinery of 
linear regression from C hap ter 3 to minimize the squared error
N
F n (w ; (4.2)
n = l
T he case of polynom ial regression w ith squared-error m easure illustrates the 
m ain ideas of regularization well, and facilitates a solid m athem atical deriva
tion. Nonetheless, our discussion will generalize in practice to non-linear, 
m ulti-dim ensional settings w ith more general error measures. T he baseline al
gorithm (w ithout regularization) is to minimize F „ over the hypotheses in F q 
to produce the final hypothesis g{x) — w]lj,z, where wun = arginin F n (w ).
E x erc ise 4 .4 
Let Z = fz i Zw be the data matrix (assume Z has full column
rank); let wun = (Z'^Z) ^Z' '̂y; and let H = Z (Z ’'Z ) 72 (the hat matrix
of Exercise 3.3). Show that
F n (w ) =
(w - wiin)'^Z'^Z(w - Wl in) + y ’' ( I “ H )y 
N
(4.3)
where I is the identity matrix.
(a ) W hat value of w minimizes F n ?
(b) W hat is the minimum in-sample error?
T he task of regularization, which results in a final hypothesis Wreg instead of 
the simple wun, is to constrain the learning so as to prevent overfitting the
^We used w and d for the weight vector and diinen.sion in Z. Since we are explicitly 
dealing with polynomials and Z is the only space around, we use w and Q for simplicity.
1 2 9
data . We have already seen an exam ple of constrain ing the learning; the set 7^2 
can be though t of as a constrained version of T-Lw in the sense th a t some of 
the Hio weights are required to be zero. T h a t is, 7?2 is a subset of 77io defined 
by 7^2 = {w I w G 77io; tCg = 0 for o' > 3} . R equiring some weights to be 0 is 
a hard constrain t. We have seen th a t such a hard constra in t on the order can 
help, for exam ple 7^2 is b e tte r th an T-Liq when there is a lot of noise and N is 
small. Instead of requiring some weights to be zero, we can force th e weights 
to be sm all bu t not necessarily zero th rough a softer constra in t such as
4 . O v e r f i t t i n g __________________________________________________________ 4 . 2 . R e g u l a r i z a t i o n
Q
E
q=0
w l < C.
This is a ‘soft o rd e r’ constra in t because it only encourages each weight to be 
small, w ithout changing th e order of the polynom ial by explicitly setting some 
weights to zero. T he in-sam ple op tim iza tion problem becomes:
m in E in (w ) sub ject to w '^w < C. (4.4)
W
T he d a ta determ ines the optim al weight sizes, given the to ta l budget C which 
determ ines the am ount of regularization; the larger C is, the weaker the con
s tra in t and the sm aller th e am ount of regularization . We can define the soft- 
o rder-constrained hypothesis set 77(C) by
77(C) = {h i h{x) — w'^z , w '^w < C} .
E quation (4.4) is equivalent to m inim izing E\o over 'H(C). If C i < C 2 , then 
77(Ci) C 77(C2) and so dvc('77(Ci)) < dvc('77(C2)), and we expect b e tte r 
generalization w ith 77(Ci). Let the regularized weights Wreg be th e solution 
to (4.4).
S o lv in g fo r Wreg- If < C th en Wreg =
wiiu because wh„ g 77(C). If wu,, ^ 77(C), th en 
not only is wj^gWrog < C , b u t in fact wJ(,gWreg = C 
(wreg uses the entire budget C ; see P roblem 4.10).
We thus need to miuirnize Ej,, sub ject to the 
equality constra in t w '^w = C . T he s itu a tio n is 
illustra ted to the right. T he weights w m ust lie 
on the surface of the sphere w '^w = C; th e norm al 
vector to this surface a t w is the vector w itself 
(also in red). A surface of constan t Ei„ is shown in
blue; this surface is a qu ad ra tic surface (see Exercise 4.4) and th e norm al to 
this surface is V E i„(w ). In th is case, w canno t be op tim al because V E i„(w ) is 
not parallel to the red norm al vector. T his m eans th a t V E i„(w ) has som e non
zero comi)oneiit along the constra in t surface, and by m oving a sm all am ount 
ill the opposite direction of th is com ponent we can im prove Ei„, while still
130
rem aining on the surface. If w ^g is to be optim al, then for some positive 
param eter Ac
V i ? i n ( W r e g ) = — 2 A c W r g g ,
i.e., VEin m ust be parallel to Wreg, the norm al vector to the constrain t surface 
(the scaling by 2 is for m athem atical convenience and the negative sign is 
because VEin and w are in opposite directions). Equivalently, Wreg satisfies
V (Ein(w ) + Acw"’w )|^^^_^^ = 0,
because V(w '^w) = 2w. So, for some Ac > 0, Wreg locally minimizes
E in (w ) + Acw'^w. (4 .5)
T he param eter Ac and the vector Wreg (both of which depend on C and the 
data) m ust be chosen so as to sim ultaneously satisfy the gradient equality and 
the weight norm constra in t wJ^gWreg — C Y T h a t Ac > 0 is intuitive since 
v/e are enforcing sm aller weights, and minimizing i?in(w) + Acw'^w would 
not lead to sm aller weights if Ac were negative. Note th a t if < C,
Wreg = wiin and m inim izing (4.5) still holds w ith Ac = 0. Therefore, we 
have an equivalence between solving the constrained problem (4.4) and the 
unconstrained m inim ization of (4.5). This equivalence m eans th a t minimiz
ing (4.5) is sim ilar to minim izing E\n using a sm aller hypothesis set, which in 
tu rn m eans th a t we can expect b e tte r generalization by minimizing (4.5) th an 
by ju s t m inim izing Em.
O ther variations of the constrain t in (4.4) can be used to emphasize some 
weights over the others. Consider the constrain t Yl^=oTq'^‘̂ ‘q - The im
portance 7 q given to weight Wg determ ines the type of regularization. For 
example, 7 g = <7 or 7 , = ê ‘ encourages a low-order fit, and 7 g = (1 + 9 ) “ ̂ or 
-jg — e “ '? encourages a high-order fit. In extrem e cases, one recovers hard-order 
constra in ts by choosing some 7 ̂ = 0 and some 7 ,̂ — 0 0 .
E x erc ise 4 .5 [Tikhonov regularizer]
A more general soft constraint is the Tikhonov regularization constraint
w"^r"Tw < C
which can capture relationships among the Wi (the matrix F is the Tikhonov
regularizer).
(a ) W hat should F be to obtain the constraint JYq=o'^q ^ T'?
(b ) W hat should F be to obtain the constraint < C”?
4 . O v e r f i t t i n g _________________________________________________________ 4 . 2 . R e g u l a r i z a t i o n
is know n as a L agrange m u ltip lie r an d an a lte rn a te de riv a tio n of these sam e re su lts 
can be o b ta in e d v ia th e th eo ry of L agrange m u ltip lie rs for c o n stra in ed o p tim iza tio n .
1 3 1
4 .2 .2 W eight D eca y and A u g m en ted Error
T he soft-order constra in t for a given value of C is a constrained m inim iza
tion of Ein. E quation (4.5) suggests th a t we m ay equivalently solve an un
constrained m inim ization of a different function. L e t’s define the augmented 
error,
•Eaug(w) = E in(w ) -F Aw'^'w, (4.6)
where A > 0 is now a free p aram eter a t our disposal. T he augm ented error has 
two term s. T he first is the in-sam ple error which we are used to m inimizing, 
and the second is a penalty term. N otice th a t th is fits the heuristic view of 
regularization th a t we discussed earlier, w here the penalty for com plexity is 
defined for each individual h in stead of H as a whole. W hen A = 0, we have the 
usual in-sam ple error. For A > 0, m inim izing th e augmented erro r corresponds 
to m inim izing a penalized in-sam ple error. T he value of A controls the am ount 
of regularization. T he penalty te rm w'^'w enforces a tradeoff betw een m aking 
the in-sam ple error sm all and m aking the weights sm all, and has becom e known 
as weight decay. As discussed in P roblem 4.8, if we m inim ize the augm ented 
error using an itera tive m ethod like g rad ien t descent, we will have a reduction 
of the in-sam ple error together w ith a g radual shrinking of the weights, hence 
the nam e weight ‘decay.’ In th e s ta tis tic s com m unity, th is type of penalty 
term is a form of ridge regression.
T here is an equivalence betw een the soft o rder constra in t and augm ented 
error m inim ization. In the soft-order constra in t, the am ount of regularization 
is controlled by the p aram eter C. From (4.5), there is a p a rticu la r Ac (depend
ing on C and the d a ta T>), for which m inim izing the augm ented error Eaug(w) 
leads to the sam e final hypothesis Wreg- A larger C allows larger weights and 
is a weaker soft-order constra in t; th is corresponds to sm aller A, i.e., less em 
phasis on the penalty te rm w'^'w in the augm ented error. For a p a rticu la r 
d a ta set, the optim al value C* leading to m inim um out-of-sam ple erro r w ith 
the soft-order constra in t corresponds to an op tim al value A* in the augm ented 
error m inim ization. If we can find A*, we can get the m inim um Eout-
Have we gained from the augm ented erro r view? Yes, because augm ented 
error m inim ization is unconstra ined , which is generally easier th a n constrained 
m inim ization. For exam ple, we can o b ta in a closed form solution for linear 
models or use a m ethod like stochastic g rad ien t descent to ca rry o u t the m ini
m ization. However, augm ented erro r m inim ization is n o t so easy to in terp re t. 
T here are no values for the weights which are explicitly forbidden, as there 
are in the soft-order constra in t. For a given C, the soft-order constra in t cor
responds to selecting a hypothesis from the sm aller set H (C ), and so from 
our VC analysis we should expect b e tte r generalization w hen C decreases (A 
increases). It is th rough th e re la tionsh ip betw een A and C th a t one has a 
theoretical justification of weight decay as a m ethod for regularization .
We focused on the soft-order co n stra in t w'^’w < C w ith corresponding 
augm ented error Eaug(w) = E i„(w ) -F Aw'^'w. However, our discussion applies 
m ore generally. T here is a dua lity betw een the m inim ization of the in-sam ple
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
1.32
error over a constrained hypothesis set and the unconstrained niinim ization of 
an augm ented error. We m ay choose to live in either world, bu t more often 
th an not, the unconstrained m inim ization of the augm ented error is more 
convenient.
In our definition of Eaug(w) in E quation (4.6), we only highlighted the 
dependence on w . T here are two other quantities under our control, namely 
the am ount of regularization. A, and the na tu re of the regularize!' which we 
chose to be ■w'̂ 'w. In general, the augm ented error for a hypothesis h £ is
4 . O V E R F I T T I N G _________________________________________________________ 4 . 2 . R E G U LA R IZA T IO N
E a u g ( / i , A, n ) = E M + ^ U ( / i ) . (4.7)
For weight decay, D(/i) = w ’̂ 'w, which penalizes large weights. The penalty 
term has two com ponents; the regularizer Q{h) (the type of regularization) 
which penalizes a particu lar p roperty of h\ and the regularization parameter A 
(the am ount of regularization). T he need for regularization goes down as the 
num ber of d a ta points goes up, so we factored out this allows the optim al 
choice for A to be less sensitive to N . This is ju s t a redefinition of the A th a t 
we have been using, in order to make it a more stable param eter th a t is easier 
to in terp re t. Notice how Equation (4.7) resembles the VC bound (4.1) as we 
an tic ipated in the heuristic view of regularization. This is why we use the same 
n o ta tion Q, for b o th the penalty on individual hypotheses Q,{h) and the penalty 
on the whole set T he correspondence between the com plexity of H and
the com plexity of an individual h will be discussed further in Section 5.1.
T he regularizer Q is typically fixed ahead of tim e, before seeing the data; 
som etim es the problem itself can d ic ta te an appropriate regularizer.
E x e rc is e 4 .6
W e have seen both the hard-order constraint and the soft-order constraint.
Which do you expect to be more useful for binary classification using the 
perceptron model? [Hint: sign{'w'^x) = sign{a'w'^x) for any a > 0.[
T he optim al regularization param eter, however, typically depends on the data. 
T he choice of the optim al A is one of the applications of validation, which we 
will discuss shortly.
E x a m p le 4 .2 . L in e a r m o d e ls w i th w e ig h t d ecay . Linear models are 
im p o rtan t enough th a t it is worthwhile to spell out the details of augm ented 
error niinim ization in this case. From Exercise 4.4, the augm ented error is
(w - wiin)'^Z'^Z(w - wii„) + A w 'w + y ’̂ (I - H)y
Eaug(w) — )
where Z is the transform ed d a ta m atrix and wh,, = (Z’’ Z)^^Z ''y . The reader 
m ay verify, after taking the derivatives of Ea„g and setting VwEa„g = 0, th a t
w , e g = (Z"Z + A I)^ 'Z "y .
1 3 3
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
As expected, Wreg will go to zero as A —> oo, due to the AI term . T he predic
tions on the in-sam ple d a ta are given by y = Zwreg = H(A)y, where
H(A) = Z(Z^Z + A I)-^Z '" .
T he m atrix H(A) plays an im p o rtan t role in defining the effective com plexity 
of a model. W hen A = 0, H is th e h a t m atrix of Exercises 3.3 and 4.4, which 
satisfies = H and trace(H ) = d + 1 . T he vector of in-sam ple errors, which 
are also called residuals, is y — y = (I - H(A))y, and th e in-sam ple error 
is Eiu(w,eg) = ^ y " ( I - H (A ))V - □
We can now apply weight decay regularization to th e first overfitting exam ple 
th a t opened th is chapter. T he resu lts for different A’s are shown in F igure 4.5.
A = 0
O Data 
—Target 
—Fit
A = 0.0001 A = 0.01 A = 1
F igure 4.5: Weight decay applied to Example 4.2 with different values for 
the regularization parameter A. The red fit gets flatter as we increase A.
As you can see, even very little regu larization goes a long way, b u t too much 
regularization resu lts in an overly flat curve a t the expense of in-sam ple fit. 
A nother case we saw earlier is Exam ple 4.1, w here we fit a linear m odel to a 
sinusoid. T he regularization used there was also w eight decay, w ith A — 0.1.
4 .2 .3 C h o o sin g a R egu larizer: P ill or P o ison ?
We have presented a num ber of ways to constra in a model: hard -o rder con
stra in ts where we sim ply use a low er-order m odel, soft-order constra in ts w here 
we constra in the p aram eters of th e m odel, and augm ented erro r w here we add 
a penalty term to an otherw ise unconstra ined m inim ization of error. Aug
m ented error is the m ost popu lar form of regu larization , for which we need to 
choose the regularizer D(/i) and the regu lariza tion p aram ete r A.
In practice, the choice of D is largely heuristic. F ind ing a perfect D is as 
difficult as finding a perfect TL. I t depends on inform ation th a t , by the very 
na tu re of learning, we d on’t have. However, there are regularizers we can work 
w ith th a t have stood the te s t of tim e, such as w eight decay. Some forms of 
regularization work and som e do not, depending on the specific application 
and the data . F igure 4.5 illu s tra ted th a t eventh e am oun t of regularization
1 3 4
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
underfitting
0.5 1 1.5 2
Regularization Param eter, A 
(a) U niform regularizer
0.5 1 1.5 2
Regularization Param eter, A 
(b) Low -order regularizer
F ig u re 4.6: O ut-of-sam ple perform ance for th e uniform and low-order reg- 
u larizers using m odel Hi s , w ith = 0.5, Q f = 15 and N = 30. O verfitting 
occurs in th e shaded region because lower Fin (lower A) leads to higher Eout- 
U nderfitting occurs w hen A is too large, because th e learning algorithm has 
too lit tle flexibility to fit th e d a ta .
has to be chosen carefully. Too much regularization (too harsh a constraint) 
leaves the learning too little flexibility to fit the d a ta and leads to underfitting, 
which can be ju s t as bad as overfitting.
If so m any choices can go wrong, why do we bo ther w ith regularization 
in the first place? R egularization is a necessary evil, w ith the operative word 
being necessary. If our model is too sophisticated for the am ount of d a ta 
we have, we are doomed. By applying regularization, we have a chance. By 
applying the proper regularization, we are in good shape. Let us experim ent 
w ith two choices of a regularizer for the model 7?i5 of 15th order polynomials, 
using the experim ental design in Exercise 4.2:
1. A uniform regularizer: Hunif(w) =
2. A low-order regularizer: Hiow(w) = Yll=oP^q-
T he first encourages all weights to be small, uniformly; the second pays more 
a tten tio n to the higher order weights, encouraging a lower order fit. Figure 4.6 
shows the perform ance for ciifferent vaiues of the reguiarization param eter A . 
As you decrease A , the optim ization pays less atten tion to the penalty term and 
m ore to and so E-,n will decrease (Problem 4.7). In the shaded region, E„„t 
increases as you decrease E j , , (decrease A ) the regularization param eter is 
too sm all and there is not enough of a constra in t on the learning, leading 
to decreased perform ance because of overfitting. In the unshaded region, the 
regularization param eter is too large, over-constraining the learning and not 
giving it enough flexibility to fit the data , leading to decreased perlorm ance 
because of underfitting. As can be observed from the figure, the price paid for 
overfitting is generally m ore severe th an undcrhtting . It usually pays to be 
conservative.
135
4 . O v e r f i t t i n g 4 . 2 . R e g u l a r i z a t i o n
gO.75 
"d
S 0.5
p,
1 ^ 0.25
= 0 . 5
t q 0.4
= 0 . 2 5 ' S
cr2 = 0
Regularization Param eter, A
(a) S tochastic noise
Q f = 1 0 0
Q / = 3 0
Q/ = 15
0-̂ . .' 2, Regularization Param eter, A
(b) Deterministic noise
Figure 4.7: Performance of the uniform regularizer at different levels of 
noise. The optimal A is highlighted for each curve.
T he optim al regularization p aram ete r for th e two cases is qu ite different 
and the perform ance can be quite sensitive to th e choice of regularization 
param eter. However, the prom ising m essage from the figure is th a t though 
the behaviors are qu ite different, th e perform ances of th e two regularizers are 
com parable (around 0.76), i f we choose the right A fo r each.
We can also use th is experim ent to s tudy how perform ance w ith regular
ization depends on th e noise. In F igure 4 .7(a), when = 0, no am ount 
of regularization helps (i.e., th e optim al regu larization p aram eter is A = 0), 
which is no t a surprise because th ere is no stochastic or determ inistic noise in 
the d a ta (bo th ta rg e t and m odel are 15th order polynom ials). As we add m ore 
stochastic noise, the overall perform ance degrades as expected. N ote th a t the 
optim al value for the regu larization p aram ete r increases w ith noise, which is 
also expected based on the earlier discussion th a t th e p o ten tia l to overfit in
creases as the noise increases; hence, constra in ing the learning m ore should 
help. F igure 4.7(b) shows w hat happens w hen we add determ inistic noise, 
keeping the stochastic noise a t zero. T his is accom plished by increasing Q f 
(the ta rg e t com plexity), thereby adding determ inistic noise, b u t keeping ev
erything else the sam e. C om paring p a r ts (a) and (b) of F igures 4.7 provides 
ano ther dem onstra tion of how the effects of determ in istic and stochastic noise 
are sim ilar. W hen e ither is present, it is helpful to regularize, and th e m ore 
noise there is, the larger th e am oun t of regu larization you need.
W h a t happens if you pick the w rong 
regularizer? To illu stra te , we picked a 
regularizer which encourages large weights 
(weight grow th) versus w eight decay which 
encourages sm all weights. As you can see, 
in this case, weight grow th does no t help 
the cause of overhttiiig . If we happened to 
choose weight grow th as our regularizer,
we would still be OK as long as we have Regularization Parameter, A
1 3 6
a good way to pick the regularization param eter - the optim al regularization 
param eter in th is case is A = 0, and we are no worse off th an not regularizing. 
No regularizer will be ideal for all settings, or even for a specific setting since 
we never have perfect inform ation, b u t they all tend to work w ith varying 
success, i f the amount of regularization A is set to the correct level. Thus, the 
entire burden rests on picking the right A, a task th a t can be addressed by a 
technique called validation, which is the topic of the next section.
T he lesson learned is th a t some form of regularization is necessary, as learn
ing is quite sensitive to stochastic and determ inistic noise. The best way to 
constrain the learning is in the ‘d irection’ of the targe t function, and more 
of a constra in t is needed when there is more noise. Even though we don’t 
know either the ta rg e t function or the noise, regularization helps by reducing 
the im pact of the noise. M ost com m on models have hypothesis sets which are 
na tu ra lly param eterized so th a t sm aller param eters lead to sm oother hypothe
ses. Thus, a weight decay type of regularizer constrains the learning towards 
sm oother hypotheses. This helps, because stochastic noise is ‘high frequency’ 
(non-sm ooth). Similarly, determ inistic noise (the p art of the ta rg e t function 
which cannot be m odeled) also tends to be non-sm ooth. Thus, constraining 
the learning tow ards sm oother hypotheses ‘h u rts ’ our ability to overfit the 
noise m ore th an it hurts our ability to fit the useful inform ation. These are 
em pirical observations, not theoretically justifiable statem ents.
R e g u la r iza tio n an d th e V C d im en sio n . R egularization (for example 
soft-order selection by minim izing the augm ented error) poses a problem for 
the VC line of reasoning. As A goes up, the learning algorithm changes but 
the hypothesis set does not, so dye will not change. We argued th a t A t in 
the augm ented error corresponds to C j, in the soft-order constrained model. 
So, more regularization corresponds to an effectively sm aller model, and we 
expect b e tte r generalization for a small increase in Bjn even though the VC 
dim ension of the model we are actually using w ith augm ented error does not 
change. This suggests a heuristic th a t works well in practice, which is to use an 
‘effective VC dim ension’ instead of the VC dimension. For linear perceptrons, 
the VC dim ension equals the num ber of free param eters d + l , and so an effec
tive number o f parameters is a good surrogate for the VC dimension in the VC 
bound. T he effective num ber of param eters will go down as A increases, and 
so the effective VC dim ension will reflect b e tte r generalization w ith increased 
regularization. Problem s 4.13, 4.14, and 4.15 explore the notion of aneffective 
num ber of param eters.
4 . O v e r f i t t i n g ________________________________________________ 4 . 3 . V a l i d a t i o n
4.3 V alidation
So far, we have identified overfitting as a problem , uoise (stochastic and deter
m inistic) as a cause, and regularization as a cure. In this section, we introduce 
ano ther cure, called validation. One can think of bo th regularization and val
1 3 7
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
idation as a ttem p ts a t m inim izing Fout ra th e r th a n ju s t Fin. O f course the 
tru e Fout is not available to us, so we need an estim ate of Fout based on in
form ation available to us in sam ple. In some sense, th is is the Holy G rail of 
m achine learning: to find an in-sam ple estim ate of the out-of-sam ple error. 
R egularization a ttem p ts to minim ize Fout by working th rough th e equation
F o u t( /i) = F in ( / i) + overfit p en a lty ,
regularization estimates this quantity
and concocting a heuristic term th a t em ulates the penalty term . Validation, 
on the o ther hand, cuts to th e chase and estim ates th e out-of-sam ple error 
directly.
Eout{h) = Ein(h) + overfit penalty .
validation estimates this quantity
E stim ating the out-of-sam ple erro r d irectly is no th ing new to us. In Sec
tion 2.2.3, we in troduced the idea of a test set, a subset of V th a t is not 
involved in the learning process and is used to evaluate the final hypothesis. 
T he tes t error Ftest; unlike the in-sam ple erro r F i„, is an unbiased estim ate
of Fout-
4 .3 .1 T h e V a lid ation S et
T he idea of a validation set is alm ost identical to th a t of a te s t set. We 
remove a subset from the d a ta ; th is subset is no t used in train ing . We then 
use th is held-out subset to estim ate the out-of-sam ple error. T he held-out set 
is effectively out-of-sam ple, because it has no t been used during th e learning.
However, there is a difference betw een a validation set and a tes t set. 
A lthough the validation set will no t be d irectly used for train ing , it will be 
used in m aking certa in choices in th e learning process. T he m inute a set affects 
the learning process in any wTxy, it is no longer a test set. However, as we will 
see, the way the validation set is used in the learning process is so benign th a t 
its estim ate of F ou t rem ains alm ost in tact.
Let us first look a t how the validation set is created . T he first step is 
to p artitio n the d a ta set T> in to a training set Ptrain of size {N — K ) and a 
validation set Pvai of size K . Any p artitio n in g m ethod which does no t depend 
on the values of the d a ta poin ts will do; for exam ple, we can select N — K 
points a t random for tra in ing and the rem aining for validation.
Now. we run the learning algorithm using the tra in in g set Ptrain to ob ta in 
a final hypothesis g~ G T-L. w here th e ■minus’ supersc rip t indicates th a t some 
d a ta points were taken ou t of the tra in ing . W e th en conipnte the validation 
error for g~ using the validation set Pvab
Fvai [g~) = -p X ] ® (fL (x„), IJn) ;A'
x„er>v„i
138
where e{g~{x) ,y) is the poiiitwise error m easure which we introduced in Sec
tion 1.4.1. For classification, e (g (x ),y ) = [v (x ) ^ yj and for regression using 
squared error, e{g{x.),y) = {g~{x.) — y)^.
T he validation error is an unbiased estim ate of Fout because the final hy
pothesis g~ was created independently of the d a ta points in the validation set. 
Indeed, taking the expectation of Fvai w ith respect to the d a ta points in T>vai,
[Evai(5')] = ^ Y l [e (V (x „),? /n )] ,
X„£T>val
— Eout{g )j
x„ e^val
= EoutCV)- (4.8)
T he first step uses the linearity of expectation, and the second step follows 
because e (5 “(x„), ?/„) depends only on x„ and so
Ex>val (n (^ n ) ) 2/n)] = Ex„ [Q {g (Xri),yn)] = Eout{g ).
How reliable is Ey^i a t estim ating Eout? In the case of classification, one can 
use the VC bound to predict how good the validation error is as an estim ate for 
the out-of-sam ple error. We can view Dvai as an ‘in-sam ple’ d a ta set on which 
we com puted the error of the single hypothesis g~. We can thus apply the 
VC bound for a finite model w ith one hypothesis in it (the Hoeffding bound). 
W ith high probability,
Eoutig-) < E U g - ) + O . (4.9)
W hile Inequality (4.9) applies to binary ta rg e t functions, we m ay use the 
variance of Evai as a m ore generally applicable m easure of the reliability. The 
next exercise studies how the variance of Evai depends on K (the size of the 
validation set), and implies th a t a sim ilar bound holds for regression. The 
conclusion is th a t the error between Evai(5") Eout(.V) drops as o-{g~)/\fK. 
where <j{g~) is bounded by a constan t in the case of classihcation.
E x erc ise 4 .7
Fix g" (learned from T>train) and define CTvai "= Var®^^, [Fvai(V)]. W e con
sider how Oval depends on K . Let
ci^(V) = Vavx[e(V(x),y)]
be the pointwise variance in the out-of-sample error of g~.
(a ) Show that rr^ai = ;^cr^(V ).
(b ) In a classification problem, where e{g~(E),y) = \g (x ) ^ j/], express 
cr̂ ai in terms of P [V (x ) / y\.
(c) Show that for any © in a classification problem, rrJai < JR-
4 . O V E R F I T T I N G __________________________________________________ 4 . 3 . V A LID A TIO N
( c o n t i n u e d o n n e x t p a g e )
1 3 9
4 . O V E R F I T T I N G 4 . 3 . V A L ID A T IO N
(d ) Is there a uniform upper bound for Var[Fvai(y“)] similar to (c) in 
the case of regression with squared error e(p“ ( x ) ,y ) = (y“ (x ) — y)^?
[Hint: The squared error is unbounded.]
(e) For regression with squared error, if we train using fewer points 
(smaller N — K ) to get g~, do you expect cr^ig") to be higher or 
lower? [Hint: For continuous, non-negative random variables, higher 
mean often implies higher variance.]
( f ) Conclude th a t increasing the size of the validation set can result in a 
better or a worse estimate of Fout-
T he expected validation erro r for ? ? 2 is illu s tra ted in F igure 4.8, where we 
used the experim ental design in Exercise 4.2, w ith Q f = 10, N = 4 0 and noise 
level 0.4. T he expected validation erro r equals Eout{g~), per E quation (4.8).
8aXW
10 20 30
Size of V alidation Set, K
Eigure 4.8: The expected validation error E[Fvai(y )] as a function of F ; 
the shaded area is E[Fvai] ± <Tvai.
T he hgure clearly shows th a t th ere is a price to be paid for se tting aside K 
d a ta points to get th is unbiased estim ate of Fout: w hen we set aside m ore 
d a ta for validation, th ere are fewer tra in in g d a ta poin ts and so g~ becomes 
worse; Eout{g~)^ and hence th e expected validation error, increases (the blue 
curve). As we expect, th e u n certa in ty in Fvai as m easured by (Jvai (size of the 
shaded region) is decreasing w ith K , up to the po in t w here th e variance <J^{g^) 
gets really bad. T his po in t comes w hen the num ber of tra in in g d a ta points 
becomes critically sm all, as in Exercise 4.7(e). If K is neither too sm all nor 
too large, Fvai provides a good estim ate of Fqu,,. A rule of thum b in practice 
is to set K = E (get aside 20% of th e d a ta for validation).
We have established two conflicting dem ands on K . I t has to be big enough 
for Fvai to be reliable, and it has to be sm all enough so th a t the tra in in g set 
with N — K poin ts is big enough to get a decent g~. Inequality (4.9) quantifies 
the h rst dem and. T he second dem and is quantified by the learning curve
1 4 0
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
discussed in Section 2.3.2 (also the blue curve in Figure 4.8, from right to 
left), which shows how the expected out-of-sam ple error goes down as the 
num ber of tra in ing d a ta points goes up.The fact th a t more train ing d a ta lead 
to a b e tte r final hypothesis has been extensively verified empirically, although 
it is challenging to prove theoretically.
R e s to r in g V . A lthough the learning curve 
suggests th a t tak ing out K d a ta points for 
validation and using only N — K for tra in 
ing will cost us in term s of i?out, we do not 
have to pay th a t price! T he purpose of vali
d ation is to estimate the out-of-sam ple per
formance, and Evai happens to be a good 
estim ate of Eout (<?“)• This does not m ean 
th a t we have to o u tp u t g~ as our final hy
pothesis. T he prim ary goal is to get the 
best possible hypothesis, so we should ou t
p u t g, the hypothesis tra ined on the en
tire set V . T he secondary goal is to esti
m ate Eout, which is w hat validation allows 
us to do. Based on our discussion of learn
ing curves, Eont{g) < Eout{gY^ so
Figure 4.9: Using a valida
tion set to estimate Aout-
ôut ig) Y Eo„i[g ) < E^aiig ) + O (4.10)
T he first inequality is subdued because it was not rigorously proved. If we first 
tra in w ith N — K d a ta points, validate w ith the rem aining K d a ta points and 
then re tra in using all the d a ta to get g, the validation error we got will likely 
still be b e tte r a t estim ating Eout{g) th an the estim ate using the VC-bound 
w ith Em{g), especially for large hypothesis sets w ith big dye-
So far, we have trea ted the validation set as a way to estim ate Eout, w ithout 
involving it in any decisions th a t affect the learning process. E stim ating Eout 
is a useful role by itself - a custom er would typically w ant to know how good 
the final hypothesis is (in fact, the inequalities in (4.10) suggest th a t the 
validation error is a pessim istic estim ate of Eout, so your custom er is likely to 
be pleasantly surprised when he tries your system on new data). However, as 
we will see next, an im portan t role of a validation set is in fact to guide the 
learning process. T h a t’s w hat distinguishes a validation set from a test set.
4 .3 .2 M od el S e lection
By far, the m ost im p o rtan t use of validation is for model selection. This could 
m ean the choice between a linear model and a nonlinear model, the choice of 
the order of polynom ial in a model, the choice of the value of a regularization
1 4 1
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
Figure 4.10: Optimistic bias of the validation error when using a validation 
set for the model selected.
param eter, or any o ther choice th a t affects the learning process. In alm ost 
every learning situation , th ere are some choices to be m ade and we need a 
principled way of m aking these choices.
T he leap is to realize th a t validation can be used to estim ate the out-of- 
sam ple error for m ore th a n one model. Suppose we have M m odels 77i , . . . , H m - 
Validation can be used to select one of these m odels. Use th e tra in in g set Dtrain 
to learn a final hypothesis for each model. Now evaluate each m odel on 
the validation set to ob ta in the validation errors E i , . . . , E m , where
E „ = E v a i ( f f f y ) ; m = l , . . . , M .
T he validation errors estim ate th e out-of-sam ple erro r Eout(fl'm) for each 77m-
E x e r c ise 4 .8
Is Em an unbiased estimate for the out-of-sample error Eout(5m )?
It is now a sim ple m a tte r to select the m odel w ith lowest validation error. 
Let m* be the index of the m odel which achieves th e m inim um validation 
error. So for 77m*, Em- < Em for m = 1 , . . . , M . T he m odel Hrn- is the m odel 
selected based on the validation errors. N ote th a t Em- is no longer an unbiased 
estim ate of Eout(<7m*)- Since we selected the m odel w ith m inim um validation 
error, Em* will have an optim istic bias. T his op tim istic bias when selecting 
between 772 and 77s is illu stra ted in F igure 4.10, using th e experim ental design 
described in Exercise 4.2 w ith Q j — ‘3, = 0.4 and N = 35.
E x erc ise 4 .9
Referring to Figure 4.10, why are both curves increasing with K 1 W hy do 
they converge to each other with increasing K 7
1 4 2
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
How good is the generalization error for this entire process of model selection 
using validation? Consider a new model Evai consisting of the final hypotheses 
learned from the train ing d a ta using each model U i , . . . , 'Hm -
R v a l = { < 7 d ■ ■ ■ :9u} -
M odel selection using the validation set chose one of the hypotheses in Evai 
based on its_perforniance on 2?vai- Since the model Evai was obtained before 
ever looking a t the d a ta in the validation set, this process is entirely equivalent 
to learning a hypothesis from Tfvai using the d a ta in Vy^]- The validation 
errors Eya\{g~m) ‘in-sam ple’ errors for this learning process and so we may 
apply the VC bound for finite hypothesis sets, w ith |7fvai| =
V
InM
K (4.11)
W hat if we d id n ’t use a validation set to choose the model? One alternative 
would be to use the in-sam ple errors from each model as the model selection 
criterion. Specifically, pick the model which gives a final hypothesis w ith m in
imum in-sam ple error. This is equivalent to picking the hypothesis w ith mini
m um in-sam ple error from the grand model which contains all the hypotheses 
in each of the M original models. If we w ant a bound on the out-of-sample 
error for the final hypothesis th a t results from this selection, we need to apply 
the V C -penalty for th is grand hypothesis set which is the union of the M 
hypothesis sets (see Problem 2.14). Since th is grand hypothesis set can have 
a huge VC-dim ension, the bound in (4.11) will generally be tighter.
The goal of model selection is to se
lect the best model and o u tp u t the best 
hypothesis from th a t model. Specifi
cally, we w ant to select the model m for 
which Eout (5m) will be m inim um when 
we re tra in w ith all the data . M odel se
lection using a validation set relies on the 
leap of faith th a t if Eout (.9m) is minim um , 
then Eout (5m) also minim um . T he val
idation errors Em estim ate Eout (5m )>
u 1 n 2 ■ • - 'E m
rain '
9
T) ,
I 9]2 " ' 9~m 
1^val
Eh E
pick t
T
2 • • • E m
he best
) Em* )
um odulo our leap of faith, the validation 
set should pick the right model. No m at
te r which model m* is selected, however, 
based on the discussion of learning curves 
in the previous section, we should not ou t
p u t gm* as the final hypothesis. R ather,
once m* is selected using validation, learn using all the d a ta and o u tpu t cjm- , 
which satisfies
9 m*
Figure 4.11: Using a validation 
set for model selection
E o i i l [dm* ) f i Euu tidm* ) — A v a l ( . 9 m * ) E ^
In M 
K ( 4 . 1 2 )
Again, the first inequality is subdued because we d id n ’t prove it.
1 4 3
4 . O V E R F I T T I N G 4 . 3 . V a l i d a t i o n
Figure 4.12: Model selection between 772 and 77s using a validation set. The 
solid black line uses Ein for model selection, which always selects 775. The 
dotted line shows the optimal model selection, if we could select the model 
based on the true out-of-sample error. This is unachievable, but a useful 
benchmark. The best performer is clearly the validation set, outputting gm* ■ 
For suitable K , even is better than in-sample selection.
C ontinuing our experim ent from F igure 4.10, we evaluate the out-of-sam ple 
perform ance w hen using a validation set to select betw een th e models 772 
and 7 7 5 . T he results are shown in F igure 4.12. V alidation is a clear w inner 
over using Ei„ for m odel selection.
E x e rc is e 4 .1 0
(a ) From Figure 4.12, E [E out(gm *)] is initially decreasing. How can this 
be, if E[Eout(gm )] is increasing in K for each m l
(b ) From Figure 4.12 we see th a t E [E o ut(g m *)] is initially decreasing, andthen it starts to increase. W h a t are the possible reasons for this?
(c) When K = 1, E [Eout(9m O ] < E [E out(gm *)]- How can this be, if the 
learning curves for both models are decreasing?
E x a m p le 4 .3 . We can use a validation set to select the value of the reg
u larization p aram eter in the augm ented erro r of (4.6). A lthough the m ost 
im portan t p a rt of a m odel is the hypothesis set, every hypothesis set has an 
associated learning algorithm which selects the final hypothesis g. Two m od
els m ay be different only in th e ir learning algorithm , while working w ith the 
sam e hypothesis set. C hanging the value of A in the augm ented erro r changes 
the learning algorithm (the criterion by w hich g is selected) and effectively 
changes the model.
Based on th is discussion, consider the M different m odels corresponding to 
the same hypothesis set 77 b u t w ith M different choices for A in the augm ented 
error. So, we have (77. Ai ), (77, A2 ) , . . . , (77, Xm ) as our M different models. We
1 4 4
may, for example, choose Ai = 0, A2 = 0.01, A3 = 0 . 02 , . . . , Am = 10. Using a 
validation set to choose one of these M models am ounts to determ ining the 
value of A to w ith in a resolution of 0.01. □
We have analyzed validation for model selection based on a finite num ber of 
models. If validation is used to choose the value of a param eter, for example A 
as in the previous example, then the value of M will depend on the resolution 
to which we determ ine th a t param eter. In the lim it, the selection is actually 
am ong an infinite num ber of models since the value of A can be any real 
num ber. W h at happens to bounds like (4.11) and (4.12) which depend on M ? 
Ju s t as the Hoeffding bound for a finite hypothesis set did not collapse when 
we moved to infinite hypothesis sets w ith finite VC-dimension, bounds like
(4.11) and (4.12) will not com pletely collapse either. We can derive VC-type 
bounds here too, because even though there are an infinite num ber of models, 
these models are all very similar; they differ only slightly in the value of A. As 
a rule of thum b, w hat m atte rs is the num ber of param eters we are trying to 
set. If we have only one or a few param eters, the estim ates based on a decent
sized validation set would be reliable. T he more choices we make based on the 
sam e validation set, the more ‘con tam inated’ the validation set becomes and 
the less reliable its estim ates will be. T he more we use the validation set to 
fine tune the model, the more the validation set becomes like a training set 
used to ‘learn the right model’] and we all know how lim ited a train ing set is 
in its ability to estim ate Eout-
You will be hard pressed to find a serious learning problem in which valida
tion is not used. V alidation is a conceptually simple technique, easy to apply 
in alm ost any setting, and requires no specific knowledge abou t the details of 
a model. T he m ain draw back is the reduced size of the train ing set, bu t th a t 
can be significantly m itigated th rough a modified version of validation which 
we discuss next.
4 .3 .3 C ross V alidation
Validation relies on the following chain of reasoning,
Eoutig) ~ Eoutig ) ~ Ê jsxig ),
(sm all K ) (large K )
which highlights the dilem m a we face in try ing to select K . We are going to 
o u tp u t g. W hen K is large, there is a discrepancy between the two out-of- 
sam ple errors Eoutig ) (which Eyai directly estim ates) and Eoutig) (which is 
the final error when we learn using all the d a ta V ). We would like to choose K 
as sm all as possible in order to minimize the discrepancy between Eoutig ) 
and Eoutig)-, ideally E = 1. However, if we make this choice, we lose the 
reliability of the validation estim ate as the bound on the RHS of (4.9) becomes 
huge. T he validation error Evai(5 ") will still be an unbiased estim ate of Eoutig~)
4 . O v e r f i t t i n g ________________________________ 4 . 3 . V a l i d a t i o n
1 4 5
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
{g~ is tra ined on N — I poin ts), b u t it will be so unreliable as to be useless 
since it is based on only one d a ta point. T his brings us to the cross validation 
estim ate of out-of-sam ple error. We will focus on the leave-one-out version 
which corresponds to a validation set of size K = 1, and is also the easiest 
case to illustra te . M ore popu lar versions typically use larger K , b u t the essence 
of the m ethod is the sam e.
T here are N ways to p a r titio n the d a ta into a tra in in g set of size N — 1 
and a validation set of size 1. Specifically, let
V n = ( x i , t / i ) , . .. , ( x „ _ i , y „ - i ) , (x„,j/n), ( x „ + i , i / „ + i ) , . . . , (xAT,yAr)
be the d a ta set V after leaving ou t d a ta po in t (x„ , ?/„), which has been shaded 
in red. D enote the final hypothesis learned from by g'^. Let e„ be th e error 
m ade by gf“ on its validation set which is ju s t a single d a ta po in t { (x „ ,y „ )} :
e„ = = e (5 ;(X n), Vn) ■
T he cross validation estim ate is th e average value of th e e „ ’s,
1 ^
n=l
Figure 4.13: Illustration of leave-one-out cross validation for a linear 
fit using three data points. The average of the three red errors 
obtained by the linear fits leaving out one data point at a time is Ecv
Figure 4.13 illustra tes cross validation on a sim ple exam ple. Each e„ is a 
wild, yet unbiased estim ate for the corresponding Eoutidn), which follows after 
setting E = 1 in (4.8). W ith cross validation, we have N functions g'l, - ■ ■ ,gR 
together w ith the N e rro r estim ates e i , . . . , eAr . T he hope is th a t these N 
errors together- would be alm ost equivalent to estim ating Eout on a reliable 
validation set of size N , while a t the sam e tim e we m anaged to use N — 1 
points to ob ta in each g~̂ . L e t’s try to im derstaiid why E^v is a good estim ato r 
ol Lout-
146
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
F irst and forem ost, Ecv is an unbiased estim ator of ‘Eout(ff“) ’- We have 
to be a little careful here because we don’t have a single hypothesis g~, as we 
did when using a single validation set. Depending on the (xn,?/n) th a t was 
taken out, each can be a different hypothesis. To understand the sense in 
which Ecv estim ates Eout, we need to revisit the concept of the learning curve.
Ideally, we would like to know Eout (s')- T he final hypothesis g is the result 
of learning on a random d a ta set V of size N . It is alm ost as useful to know the 
expected perform ance of your model when you learn on a d a ta set of size N ; 
the hypothesis g is ju s t one such instance of learning on a d a ta set of size N . 
T his expected perform ance averaged over d a ta sets of size A , when viewed 
as a function of A , is exactly the learning curve shown in Figure 4.2. More 
formally, for a given model, let
Eout(Â ) = IEx)[Eout(5)]
be the expectation (over d a ta sets T> of size A ) of the out-of-sam ple error 
produced by the model. The expected value of Ecv is exactly Eout(-V — 1). 
T his is tru e because it is tru e for each individual validation error e„;
lEpl^n] = ®(xn,J/n) ?/n)] ,
= Ex> ̂[Eout(5n)],
= E o u t ( i V - l ) .
Since this equality holds for each e„, it also holds for the average. We highlight 
this result by m aking it a theorem .
T h e o r e m 4 .4 . Ecv is an unbiased estim ate of Eout(-V — 1) (the expectation 
of the model perform ance, E[Eout], over d a ta sets of size A - 1).
Now th a t we have our cross validation 
estim ate of Eout, there is no need to ou t
p u t any of the as our final hypothesis.
We m ight as well squeeze every last drop 
of perform ance and re tra in using the entire 
d a ta set V , o u tp u ttin g g as the final hy
pothesis and getting the benefitof going 
from A - 1 to A on the learning curve.
In this case, the cross validation estim ate 
will on average be an upper estim ate for 
the out-of-sam ple error: Eout (5 ) - '̂cv, so
expect to be pleasantly surprised, albeit 
slightly.
W ith ju s t simple validation and a val
idation set of size A = 1, we know th a t 
the validation estim ate will not be reliable.
Figure 4.14: Using cross vali
dation to estimate Eout
How reliable is the cross valida
tion estim ate Ecv? We can m easure the reliability using the variance of
1 4 7
U nfortunately, while we were able to pin down the expectation of Ecv, the 
variance is no t so easy.
If the N cross validation errors e i , . . . , ejv were equivalent to N errors on a 
to ta lly separate validation set of size N , th en Ecv would indeed be a reliable 
estim ate, for decent-sized N . T he equivalence would hold if th e individual e „ ’s 
were independent of each o ther. O f course, th is is too optim istic. Consider 
two validation errors e„, e^ - T he validation error e„ depends on which was 
tra ined on d a ta contain ing ( x ^ , ym)- Thus, e„ has a dependency on (x ^ , Vm)- 
T he validation error e™ is com puted using {'x.m,ym) directly, and so it also 
has a dependency on {xm ,ym )- Consequently, there is a possible correlation 
betw een e„ and th ro u g h the d a ta po in t ( x ^ , y „ )- T h a t correlation w ouldn’t 
be there if we were validating a single hypothesis using N fresh (independent) 
d a ta points.
How m uch worse is the cross validation estim ate as com pared to an esti
m ate based on a tru ly independent set of N validation errors? A V C -type 
probabilistic bound, or even com pu tation of the asym pto tic variance of the 
cross validation estim ate (P roblem 4.23), is challenging. One way to quantify 
the reliability of Ecv is to com pute how m any fresh validation d a ta points 
would have a com parable reliability to Ecv, and P roblem 4.24 discusses one 
way to do this. T here are two extrem es for th is effective size. O n the high end 
is N , which m eans th a t th e cross validation errors are essentially independent. 
On the low end is 1, which m eans th a t Ecv is only as good as any single one 
of the individual cross validation errors e„, i.e., th e cross validation errors are 
to ta lly dependent. W hile one canno t prove any th ing theoretically , in practice 
the reliability of Ecv is m uch closer to the higher end.
Gt». Ecv
4 . O v e r f i t t i n g ____________________________________________________ 4 . 3 . V a l i d a t i o n
N
Effective number of fresh examples 
giving a comparable estim ate of Bout
C ro s s v a l id a t io n fo r m o d e l s e le c t io n . In F igure 4.11, the estim ates Em 
for the out-of-sam ple erro r of m odel P m were ob ta ined using th e validation set. 
Instead, we m ay use cross validation estim ates to o b ta in Em- use cross valida
tion to ob ta in estim ates of th e out-of-sam ple erro r for each m odel P i , . , P m , 
and select the m odel w ith th e sm allest cross validation error. Now, tra in this 
model selected by cross validation using all the d a ta to o u tp u t a final h y p o th 
esis, m aking the usual leap of fa ith th a t Eout{g~) tracks Eout{g) well.
E x a m p le 4 .5 . In F igure 4.13, we illu stra ted cross validation for es tim at
ing Bout of a linear m odel (h{x) ^ a x + b) using a sim ple experim ent w ith 
three d a ta poin ts generated from a co n stan t ta rg e t function w ith noise. We 
now consider a second m odel, th e co n stan t m odel {h[x) = b). We can also 
use cross validation to es tim ate Bout for the co n stan t m odel, illu stra ted in 
F igure 4.15.
148
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
Figure 4.15: Leave-one-out cross validation error for a constant fit.
If we use the in-sam ple error after fitting all the d a ta (three points), then 
the linear model wins because it can use its additional degree of freedom to 
fit the d a ta be tte r. T he sam e is tru e w ith the cross validation d a ta sets of size 
two - the linear m odel has perfect in-sam ple error. B ut, w ith cross validation, 
w hat m atters is the error on the outstanding point in each of these hts. Even 
to the naked eye, the average of the cross validation errors is sm aller for the 
constan t m odel which obtained Ecv = 0.065 versus E^v = 0.184 for the linear 
model. T he constan t model wins, according to cross validation. T he constant 
model also has lower Fout and so cross validation selected the correct model 
in this example. □
One im portan t use of validation is to estim ate the optim al regularization 
param eter A, as described in Exam ple 4.3. We can use cross validation for the 
sam e purpose as sum m arized in the algorithm below.
C ross v a lid a tio n for s e le c t in g A:
1 : Define M models by choosing different values for A in the 
augm ented error: (7?, Ai), (??, A2 ) , . . . , (??, Am)
2 : for each model m = 1 , . . . , M do
3 : Use the cross validation module in Figure 4.14 to esti
m ate Fcv(m ), the cross validation error for model m.
4: Select the model m* w ith m inim um Ecv{rn*).
5: Use model {T~L,X m* 
nal hypothesis gm' 
optim al A.
) and all the d a ta T> to obtain the fi- 
. Effectively, you have estim ated the
We see from Figure 4.14 th a t estim ating Fcv for ju st a single model requires N 
rounds of learning on P i , . . . , P v , each of size N — I. So the cross validation 
algorithm above requires M N rounds of learning. This is a formidable task. 
If we could analytically obtain Fcv, th a t would be a big bonus, bu t analytic 
results are often difficult to come by for cross validation. One exception is 
in the case of linear models, where we are able to derive an exact analytic 
form ula for the cross validation estim ate.
1 4 9
A n a ly tic c o m p u ta tio n o f Fcv for lin ea r m o d e ls . Recall th a t for linear 
regression w ith weight decay, Wreg = (Z'^Z + AI)“ ^Z'^y, and the in-sam ple 
predictions are
y = H(A)y,
where H(A) = Z(Z'^Z -|- X1)~^7P. G iven H, y , and y , it tu rn s ou t th a t we can 
analytically com pute th e cross validation estim ate as:
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
1
F n v = V T
N
V n Un (4.13)
Notice th a t the cross validation estim ate is very sim ilar to the in-sam ple error, 
E'm = j f thn^Vri ~ differing only by a norm alization of each te rm in the 
sum by a factor 1 /(1 — H„„(A))^. O ne use for th is ana ly tic form ula is th a t it 
can be directly optim ized to ob ta in th e best regularization p aram eter A. A 
proof of th is rem arkable form ula is given in P roblem 4.26.
Even when we canno t derive such an ana ly tic charac terization of cross 
validation, the technique widely resu lts in good out-of-sam ple error estim ates 
in practice, and so the com pu tational bu rden is often w orth enduring. Also, 
as w ith using a validation set, cross validation applies in alm ost any setting 
w ithout requiring specific knowledge ab o u t th e details of th e models.
So far, we have lived in a w orld of un lim ited com putation , and all th a t 
m attered was out-of-sam ple error; in reality, com pu tation tim e can be of con
sequence, especially w ith huge d a ta sets. For th is reason, leave-one-out cross 
validation m ay not be the m ethod of choice.'^ A popu lar derivative of leave- 
one-out cross validation is V-fo ld cross validation^ In V-fold cross validation, 
the d a ta are p artitio n ed into V d isjo in t sets (or folds) P i , . . . , P y , each of size 
approxim ately N / V ] each set T>v in th is p a r titio n serves as a validation set to 
com pute a validation erro r for a hypothesis g~ learned on a tra in ing set which 
is the com plem ent ofthe validation set, P \ P ^ . So, you always validate a 
hypothesis on d a ta th a t was not used for tra in in g th a t p a rticu la r hypothesis. 
T he V-fold cross validation erro r is the average of th e V validation errors th a t 
are obtained, one from each validation set P^,. Leave-one-out cross validation 
is the sam e as N -fold cross validation. T he gain from choosing V N is 
com putational. T he draw back is th a t you will be estim ating Font for a hy
pothesis g~ tra ined on less d a ta (as com pared w ith leave-one-out) and so the 
discrepancy betw een Eont{g) and Fout(.F) will be larger. A com m on choice in 
practice is 1 0 -fold cross validation , and one of the folds is illu stra ted below.
P
Pi T>2 T>3 T>i P 5 T>6 Ps P 9 Pio
 1 I ^ I I I I 1--------------
t r a in v a lid a te t r a in
"‘S ta b ility p ro b lo in s have a lso b een re p o rte d in leave-o n e-o u t.
■'Some a u th o rs call it b '-l'o ld c ro ss v a lid a tio n , b u t we choose \ ' so as n o t to co nfuse w ith 
th e size of tlie v a lid a tio n se t K .
1 5 0
4 .3 .4 T h eory V ersus P ra ctice
B oth validation and cross validation present challenges for the inathcniatical 
theory of learning, sim ilar to the challenges presented by regnla.riza.tion. The 
theory of generalization, in particu lar the VC analysis, forms the foundation 
for learnability. It provides us w ith guidelines under which it is possible to 
make a generalization conclusion w ith high probability. It is not straightfor
ward, and som etim es not possible, to rigorously carry these conclusions over 
to the analysis of validation, cross validation, or regularization. W hat is pos
sible, and indeed cpiite effective, is to use the theory as a guideline. In the 
case of regularization, constraining the choice of a hypothesis leads to bet
te r generalization, as we would intuitively expect, even if the hypothesis set 
rem ains technically the same. In the case of validation, m aking a choice for 
few param eters does not overly contam inate the validation estim ate of 
even if the VC guarantee for these estim ates is too weak. In the case of cross 
validation, the benefit of averaging several validation errors is observed, even 
if the estim ates are not independent.
A lthough these techniques were based on sound theoretical foundation, 
they are to be considered heuristics because they do not have a full m athe
m atical justification in the general case. Learning from d a ta is an emiiirical 
task w ith theoretical underpinnings. We prove w hat we can prove, bu t we use 
the theory as a guideline when we d on’t have a conclusive proof. In a practical 
application, heuristics m ay win over a rigorous approach th a t makes unrealis
tic assum ptions. T he only way to be convinced abou t w hat works and w hat 
doesn’t in a given situation is to try out the techniipies and see for yourself. 
T he basic message in this chapter can be sum m arized as follows.
4 . O v e r f i t t i n g ________________________________ _______________________________ 4 . 3 . V a l i d a t i o n
1. Noise (stochastic or determ inistic) affects learning 
adversely, leading to overfitting.
2. R egularization helps to prevent overfitting by con
strain ing the model, reducing the im pact of the noise, 
while still giving us flexibility to fit the data .
3. Validation and cross validation are useful techniques 
for estim ating E„„t. One im portan t use of valida
tion is model selection, in particu lar to estim ate the 
am ount of regularization to use.
E x a m p le 4 .6 . We illustra te validation on the handw ritten digit classification 
task of deciding w hether a digit is 1 or not (see also Exam ple 3.1) based on the' 
two features which m easure the sym m etry and average intensity of the digit. 
The d a ta is shown in Figure 4.16(a).
1 5 1
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
Average Intensity 
(a) Digits cl£issification task (b) Error curves
F igure 4.16: (a) The digits data of which 500 are selected as the training 
set. (b) The data are transformed via the 5th order polynomial transform 
to a 20-dimensional feature vector. We show the performance curves as we 
vary the number of these features used for classification.
We have random ly selected 500 d a ta poin ts as th e tra in ing d a ta and the 
rem aining are used as a te s t set for evaluation. We considered a nonlinear 
feature transfo rm to a 5 th order polynom ial featu re space:
(l,0:i,a:2) -A (l,Xi,X2,xf,XiX2,X̂ ,xf,xfx2, . . . ,xf,xfx2,xfx2,XiX̂ ,XiX2,xl).
Figure 4.16(b) shows th e in-sam ple error as you use m ore of the transform ed 
features, increasing th e dim ension from 1 to 20. As you add m ore dim ensions 
(increase the com plexity of th e m odel), th e in-sam ple erro r drops, as expected. 
T he out-of-sam ple erro r drops a t first, and th en s ta r ts to increase, as we hit 
the approxim ation-generalization tradeoff. T he leave-one-out cross validation 
error tracks the behavior of the out-of-sam ple erro r quite well. If we were to 
pick a model based on th e in-sam ple error, we would use all 2 0 dim ensions. 
T he cross validation erro r is m inim ized betw een 5-7 feature dim ensions; we 
take 6 feature dim ensions as th e m odel selected by cross validation. T he tab le 
below sum m arizes th e resu lting perform ance m etrics:
Ein Eout
No V alidation 0% 2.5%
Cross V alidation 0 .8 % 1.5%
Cross validation resu lts in a perform ance im provem ent of ab o u t 1%, which is 
a massive relative im provem ent (40% reduction in erro r ra te ).
E x erc ise 4 .1 1
In this particular experiment, the black curve (Ecv) is sometimes below and 
sometimes above the the red curve (Eout)- If we repeated this experiment 
many times, and plotted the average black and red curves, would you expect 
the black curve to lie above or below the red curve?
152
4 . O v e r f i t t i n g 4 . 3 . V a l i d a t i o n
It is illum inating to see the actual classihcation boundaries learned w ith and 
w ithout validation. These resulting classihers, together w ith the 500 in-sample 
d a ta points, are shown in the next hgure.
Average Intensity 
2 0 -dim classifier (no validation)
Ain = 0 %
Eont = 2 . 5 %
Average Intensity
6 -dim classifier (LOO-CV) 
Ain = 0 .8 %
A o u t = 1 . 5 %
It is clear th a t the worse out-of-sam ple perform ance of the classifier picked 
w ithout validation is due to the overfitting of a few noisy points in the training 
data . W hile the train ing d a ta is perfectly separated , the shape of the resulting 
boundary seems highly contorted, which is a sym ptom of overhttiiig. Does this 
rem ind you of the first exam ple th a t opened the chapter? There, albeit in a 
toy example, we sim ilarly ob tained a highly contorted ht. As you can see, 
overfitting is real, and here to stay! □
1 5 3
4.4 P roblem s
4. O V E R F I T T I N G _______________________________ 4.4 . P R O B L EMS
P r o b le m 4 .1 Plot the monomials of order i, 4>i{x) = x \ As you increase 
the order, does this correspond to the intuitive notion of increasing complexity?
P r o b le m 4 .2 Consider the feature transform z = [Lo(a;), L i(a ;), L 2(a;)]'’ 
and the linear model h{x) = w' ’̂z. For the hypothesis with w = [1 ,—1 , 1 ]'’' 
what is h{x) explicitly as a function of x l W h a t is its degree?
P r o b le m 4 .3 The Legendre Polynomials are a family of orthogonal 
polynomials which are useful for regression. The first two Legendre Polynomials 
are L o (x ) = 1, L \ { x ) = x. The higher order Legendre Polynomials are defined 
by the recursion:
Lk{x) = ^ x L fc - i(x ) - ^ ^ ̂L k - 2 {x).
(a) W hat are the first six Legendre Polynomials? Use the recursion to de
velop an efficient algorithm to compute L o ( x ) , . . . , L k {x ) given x. Your 
algorithm should run in tim e linear in K . Plot the firstsix Legendre 
polynomials.
(b) Show that Lfc(x) is a linear combination of monomials x'^,x^~^, .. . (ei
ther all odd or all even order, with highest order k). Thus,
L k i - x ) = i - l Y Lk{x) .
(c) Show that = xLk{x) - L k - i [ x ) . [Hint: use induction.]
(d) Use part (c) to show th a t Lk satisfies Legendre's differential equation
This means that the Legendre Polynomials are eigenfunctions of a Her- 
mitian linear differential operator and, from Sturm-Liouville theory, they 
form an orthogonal basis for continuous functions on [—1 , 1 ].
(e) Use the recurrence to show directly the orthogonality property:
' lo e f i k ,
= k.
/ dx Lk{x)Le{x) = ̂ ^
•1 - 1 I. 2M-1
[Hint: use induction on k, with (. < k. Use the recurrence for Lk and 
consider separately the four cases £ = k ,k — l , k — 2 and £ < k — 2. For 
the case £ = k you will need to compute the integral dx x ^ L ^ _ i(x ) . 
In order to do this, you could use the differential equation in part (c), 
multiply by xl jk and then integrate both sides (the LHS can be integrated 
by parts). Now solve the resulting equation for dx x^Ll _^{x) . ]
154
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
P ro b lem 4 .4 This problem is a detailed version of Exercise 4.2. W e set up 
an experimental framework which the reader may use to study various aspects 
of overfitting. The input space is T = [ -1 ,1 ] , with uniform input probability 
density, P{x) = W e consider the two models 'H2 and 77io. The target func
tion is a polynomial of degree Qf , which we write as f { x ) = J2qIoaqLq{x), 
where L , ( x ) are the Legendre polynomials. W e use the Legendre polynomials 
because they are a convenient orthogonal basis for the polynomials on [—1 , 1 ] 
(see Section 4.2 and Problem 4.3 for some basic information on Legendre poly
nomials). The data set is T> = ( x i , y i ) , . . . , (x jv ,j/iv ), where = f {xn) + o-e„ 
and £„ are iid standard Normal random variates.
For a single experiment, with specified values for Q f , N , a , generate a random 
degree-Qf target function by selecting coefficients a , independently from a 
standard Normal, rescaling them so that E a ,! [ /^ ] = 1. Generate a data set, 
selecting x i , . . . , x n independently from P{x) and yn = f { xn) + otn- Let Q2 
and gio be the best fit hypotheses to the data from 'H2 and H w respectively, 
with respective out-of-sample errors Aout(52) and Aout(3 io)-
(a) W hy do we normalize /? [Hint: how would you interpret a?]
(b ) How can we obtain 32 , 310? [Hint: pose the problem as linear regression 
and use the technology from Chapter 3.]
(c) How can we compute Aout analytically for a given 310?
(d) Vary Q f , N , a and for each combination of parameters, run a large num
ber of experiments, each time computing Aout(3z) and Aout(3 io)- Aver
aging these out-of-sample errors gives estimates of the expected out-of- 
sample error for the given learning scenario {Qf , N, a) using 772 and 77io. 
Let
Aout(772) = average over experiments(Aout(32) ) ,
Aout(77io) = average over experiments(Aout(3io)).
Define the overfit measure Aout(77io) - Aout(772). When is the over
fit measure significantly positive (i.e., overfitting is serious) as opposed 
to significantly negative? Try the choices Q f G {1, 2 , . . . , 100}, N G 
{20, 2 5 , . . . , 120}, G (0 , 0.05, 0 . 1 , . . . , 2}.
Explain your observations.
(e) W hy do we take the average over many experiments? Use the variance 
to select an acceptable number of experiments to average over.
(f ) Repeat this experiment for classification, where the target function is a 
noisy perceptron, / = s i g n ( Y l q = i + <=) ■ Notice that ao = 0,
(Y :% ^ q L q { x ) f = 1.and the a ,'s should be normalized so that Ea,i 
For classification, the models 7 7 2 ,7 7 io contain the sign of the 2nd and 
10th order polynomials respectively. You may use a learning algorithm 
for non-separable data from Chapter 3.
1 5 5
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
P r o b le m 4 .5 If A < 0 in the augmented error £'aug(w) = E i n ( w ) + A w ' ‘' w , 
what soft order constraint does this correspond to? [Hint: A < 0 encourages 
large weights.]
P r o b le m 4 .6 In the augmented error minimization with P = I and A > 0 :
(a ) Show that ||wreg|| < ||wiin||, justifying the term weight decay. [Hint: 
start by assuming that ||wregl| > ||wiinl| and derive a contradiction.]
In fact a stronger statem ent holds: | | w r e g | | is decreasing in A .
(b ) Explicitly verify this for linear models. [Hint:
Wreg Wreg = U [77 Z + AI)“ ^U,
where u = Z^y and Z is the transformed data matrix. Show that 77Z + 
A I has the same eigenvectors with correspondingly larger eigenvalues as 
Z ^ ' Z . Expand u in the eigenbasis ofZ~^Z. For a matrix A , how are the 
eigenvectors and eigenvalues o f A~^ related to those o f A?]
P r o b le m 4 . 7 Show th a t the in-sample error
En(Wreg) = ^ y " ( I - H ( A ) ) V
from Example 4.2 is an increasing function of A, where H (A ) = Z (Z '^Z -|-A I)“ 'Z ’’' 
and Z is the transformed data matrix.
To do so, let the S V D of Z = U P V ^ and let Z^Z have eigenvalues a f , . .. , 0 .̂ 
Define the vector a = U ^y. Show that
1 d y ^
E i n ( W r e g ) = E i n ( w i i n ) + ~ ^ ) U i f 1 ~ ̂
i=l ^
2 \ 2 
h)
and proceed from there.
P r o b le m 4 .8 In the augmented error minimization with P = I and A > 0, 
assume that E\n is differentiable and use gradient descent to minimize Eaug:
w ( t - F 1 ) • ( - w ( t ) - r / V £ ' a u g ( w ( t ) ) .
Show that the update rule above is the same as
w (< - F 1 ) ( 1 - 2 r/A )w (f) - r? V E in (w (f)).
Note: This is the origin of the name ‘weight decay’: w ( f ) decays before being 
updated by the gradient of E i„ .
156
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
P r o b le m 4 .9 In Tikhonov regularization, the regularized weights are given 
by Wreg = + A r^ r )“ '^Z’̂ y. The Tikhonov regularizer P is a fe x (d + 1)
matrix, each row corresponding to a d + 1 dimensional vector. Each row of Z 
corresponds to a d + 1 dimensional vector (the first component is 1). For each 
row of r, construct a virtual example (zi ,0) for i = I , . .. ,k, where Zi is the 
vector obtained from the ith row of P after scaling it by %A, and the target 
value is 0. Add these k virtual examples to the data, to construct an augmented 
data set, and consider non-regularized regression with this augmented data.
(a ) Show that, for the augmented data, Zaug =
Z
v/A-r and yaug =
(b) Show that solving the least squares problem with Zaug and yaug results 
in the same regularized weight w^eg, i.e. Wreg = (ZlugZaug)“ ^ZIugyaug.
This result may be interpreted as follows: an equivalent way to accomplish 
weight-decay-type regularization with linear models is to create a bunch of 
virtual examples all of whose target values are zero.
P r o b le m 4 .1 0 In this problem, you will investigate the relationship
between the soft order constraint and the augmented error. The regularized 
weight Wreg is a solution to
min Ein(w) subject to w^P^Pw < C.
(a ) If wCinr'^Twiin < C , then what is Wreg?
(b) I f w L r 'P W H n > C, the situation is illustrated below.
The constraint is satisfied in the shaded region and the contours of con
stant Ein are the ellipsoids (why ellipsoids?). W hat is wTegP^Pwreg?
(c) Show that with
Ac = ~ Wreg VEin(Wreg),
Wreg minimizes Ein(w) + A c w ’’T'’T w . [Hint: use the previous part to 
solve for Wreg as an equality constrained optimization problem using the 
method o f Lagrange multipliers.]
(c o n t in u e d o n n e x t p a g e )
1 5 7
(d ) Show that the following hold for Ac:
(i) If wJInP’T w iin < C then Ac = 0 (wnn itself satisfies the constraint).
(ii) If w J in W rw iin > C, then Ac > 0 (the penalty term is positive).
(iii) If w C inA Pw iin > C , then Ac is a strictly decreasing functionof C.
[Hint: show that < Q for C £ [0, w J In W rw iin ]./
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
P r o b le m 4 .1 1 For the linear model in Exercise 4.2, the target function is 
a polynomial of degree Q f , the model is P Lq , with polynomials up to order Q. 
Assume Q > Q f. wun = ifl7Tf)~"^T7y, and y = Z w f + e, where Wf is the 
target function and Z is the matrix containing the transformed data.
(a ) Show that wun = Wf + {T7TT)^^T7e. W h at is the average function gl 
Show th a t bias = 0 (recall that: b ias(x) = (g (x ) - / ( x ) ) ^ ) .
(b ) Show that
var = ^ trace (E^> E z [(j^Z'^'Z) ^]) ,
where E $ = E[4>(x)4>'’'(a:)]. [Hints: var = E[(g^®^ - g fi]; first take the 
expectation with respect to e, then with respect to 4>(x), the test point, 
and the last remaining expectation will be with respect to Z. You will 
need the cyclic property o f the trace.)
(c) Argue that to first order in A , var «
[Hint: J2n=i ‘̂ (a:n)4’'^(a:n) is the in-sample estimate o fE ^ ,.
By the law o f large numbers, jy Z ^ Z = E * + o{l ) . [
For the well specified linear model, the bias is zero and the variance is increasing 
as the model gets larger {Q increases), but decreasing in N.
P r o b le m 4 .1 2 Use the setup in Problem 4.11 with Q > Qf . Con
sider regression with weight decay using a linear model PL in the transformed 
space with input probability distribution such th a t E[zz^] = I. The regularized 
weights are given by Wreg = (Z'^'Z -F A I)^^Z '‘'y , where y = Z w f -F e.
(a) Show that w,.eg = Wf — A( Z ' Z -F A I) “ ^Wf -F (Z^Z -F A l j^ 'Z ’ ê.
(b ) Argue that, to first order in j f ,
2
var 7 / ^ [trace(H'^(A))],
where 11(A) = Z ( Z ' Z - F AU-^Z"".
1 5 8
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
If we plot the bias and var, we get a figure 
that is very similar to Figure 2.3, where 
the tradeoff was based on fit and com
plexity rather than bias and var. Flere, the 
bias is increasing in A (as expected) and 
in ||wf||; the variance is decreasing in A. 
When A = 0, trace(H ^(A )) = Q + I and 
so trace(H ^(A )) appears to be playing the 
role of an effective number of parameters.
P ro b lem 4 .1 3 W ithin the linear regression setting, many attempts have 
been made to quantify the effective number of parameters in a model. Three 
possibilities are:
(i) deff{\) = 2 trace(H (A )) - trace(H 2(A ))
(ii) deff(A) = trace(H (A ))
(iii) deff(A) = trace(H ^(A ))
where H(A) = Z(Z^Z + AI)^^Z’' and Z is the transformed data matrix. To 
obtain deff. one must first compute H(A) as though you are doing regression. 
One can then heuristically use deff in place of dvc in the VC-bound.
(a) When A = 0, show that for all three choices, deff = d + 1, where d is the 
dimension in the Z space.
(b ) When A > 0, show that 0 < deft < d + 1 and deft is decreasing in A for 
all three choices. [Hint: Use the singular value decomposition.]
P ro b lem 4 .1 4 The observed target values y can be separated into the 
true target values f and the noise e, y = f + e. The components of e are iid 
with variance cr̂ and expectation 0. For linear regression with weight decay 
regularization, by taking the expected value of the in-sample error in (4 .2 ), 
show that
E e[F „] = l f ' ( I - H ( A ) ) " f + ^ t r a c c ( ( I - H ( A ) ) ^ ) ,
N
where deir = 2trace(H (A )) - trace(H ^(A )), as defined in Problem 4.13(i), 
H(A) = Z(Z'^'Z + AI)“ ^Z'’' and Z is the transformed data matrix.
(c o iit it in c d o n n e x t p a g e )
1 5 9
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
(a ) If the noise was not overfit, what should the term involving cr̂ be, and 
why?
(b) Hence, argue that the degree to which the noise has been overfit is 
a^deff/N . Interpret the dependence of this result on the parameters deff 
and N , to justify the use of deff as an effective number of parameters.
P r o b le m 4 .1 5 W e further investigate defi of Problems 4.13 and 4.14. W e 
know th a t H ( A ) = Z{777j + When P is square and invertible, as
is usually the case (for example with weight decay, P = I ) , denote Z = Z F "^ . 
Let sq> • • • I Sd be the eigenvalues of Z ' Z ( s | > 0 when Z has full column rank).
(a) For deff(A) = trace(2H(A) - H^(A)), show that
" Â
d e « ( A ) - d + l - g ( ^ 3 ^ A ) 2 -
d
A(b ) For deff{X) = trace(H(A)), show th a t defr(A) = d-\- 1 - i^TjlA*
4 = 0 *
d 4
(c) For deff(A) = trace(H^(A)), show that deff(A) = ( j rq f y p -
In all cases, for A > 0, 0 < deff(A) < d + 1 , deff(O) = d + 1 and dea is decreasing 
in A. [Hint: use the singular value decomposition Z = U S V , where U , V are 
orthogonal and S is diagonal with entries Si.J
P r o b le m 4 .1 6 For linear models and the general Tikhonov regularizer F 
A,
Nwith penalty term A w '^ T T w in the augmented error, show that
Wreg = (Z"Z + A r W ) - ' Z " y ,
where Z is the feature matrix.
(a ) Show that the in-sample predictions are
y = H (A )y,
where H(A) = Z ( Z ' Z + A F '^T )- ' Z F
(b) Simplify this in the case F = Z and obtain w,eg in terms of wun- This is 
called uniform weight decay.
P r o b le m 4 .1 7 To model uncertainty in the measurement of the inputs, 
assume that the true inputs x „ are the observed inputs x „ perturbed by some 
noise e„: the true inputs are given by x „ = x „ -F e„. Assume th a t the €„. 
are independent of (x „ ,,i/„ ) with covariance matrix E[€„e)(] = and mean
IGO
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
E[e„] — 0. The learning algorithm minimizes the expected in-sample error E\„, 
where the expectation is with respect to the uncertainty in the true x „ .
E n(w ) = Eei . . . € ^ N n=l
Show that the weights wun which result from minimizing E n are equiva
lent to the weights which would have been obtained by minimizing E n = 
^ ))L i(w '^ X n — Vnfi for the observed data, with Tikhonov regularization. 
W hat are P and A (see Problem 4.16 for the general Tikhonov regularizer)?
One can interpret this result as follows: regularization enforces a robustness to 
potential measurement errors (noise) in the observed inputs.
P r o b le m 4 .1 8 In a regression setting, assume the target function is 
linear, so / ( x ) = w } x , and y = X w / -h e, where the entries in e are iid with 
zero mean and variance . Assume a regularization term and
that E lx x ’"] = I. In this problem derive the optimal value for A as follows.
(a) Show that the average function is g{x) = W hat is the bias?
(b) Show that var is asymptotically /U/Vit.' Problem 4.12.]
(c) Use the bias and asymptotic variance to obtain an expression for E[i?out]. 
Optimize this with respect to A to obtain the optimal regularization pa
rameter. [Answer: A* =
(d) Explain the dependence of the optimal regularization parameter on the 
parameters of the learning problem. [Hint: write X* =
P r o b le m 4 .1 9 [The Lasso algorithm] Rather than a soft order constraint 
on the squares of the weights, one could use the absolute values of the weights:
d
m inE n(w ) subject to ^ |w,j < C.
The model is called the lasso algorithm.
(a) Formulate and implement this as a quadratic program. Use the exper
imental design in Problem 4.4 to compare the lasso algorithm with the 
quadratic penalty by giving plots of Eout versus regularization parameter.
(b ) W hat is the augmented error? Is it more convenient to optimize?
(c) W ith d = 5 and A = 3, compare the weights from the lasso versus the 
quadratic penalty. [Hint: Look at the number o f non-zero weights.]
1 6 1
4 , O v e r f i t t i n g 4 . 4 . P r o b l e m s
P r o b le m 4 .2 0 In this problem, you will explore a consistency condition 
for weight decay. Suppose that we make an invertible linear transform of the 
data,
Zn = Ax„, y n = a y n .
Intuitively, linear regression should not be affected by a linear transform. This 
means that the new optimal weights should be given by a correspondinglinear 
transform of the old optimal weights.
(a) Suppose w minimizes the in-sample error for the original problem. Show 
that for the transformed problem, the optimal weights are
w = q;(A^)~^w .
(b ) Suppose the regularization penalty term in the augmented error is 
w '’’X ’'X w for the original data and w'^’Z^'Zw for the transformed data. 
On the original data, the regularized solution is W r e g ( A ) . Show that for 
the transformed problem, the same linear transform of Wreg(A) gives the 
corresponding regularized weights for the transformed problem:
W r e g ( A ) = a ( A ' " ) “ V r e g ( A ) .
P r o b le m 4 .2 1 The Tikhonov smoothness regularizer penalizes derivatives 
of h, e.g., Q,{h) = f dx ^ • For linear models with feature transform
z = 4>(x), show that this penalty term is of the form w ’T ’‘T w . W h at is P?
P r o b le m 4 .2 2 You have a data set with 100 data points. You have 
100 models each with VC-dimension 10. You set aside 25 points for validation. 
You select the model which produced minimum validation error of 0.25. Give 
a bound on the out-of-sample error for this selected function.
Suppose you instead trained each model on all the data and selected the func
tion with minimum in-sample error. The resulting in-sample error is 0.15. Give 
a bound on the out-of-sample error in this case. [Hint: Use the bound in 
Problem 2.14 to bound the VC-dimension o f the union o f all the models.]
P r o b le m 4 .2 3 This problem investigates the covariance of the leave-one- 
out cross validation errors, C o vu [e„ , em]- Assume th a t for well behaved models, 
the learning process is ‘stable’ , and so the change in the learned hypothesis 
should be small, 'O { j j ) ’, if a new data point is added to a data set of size 
N. W rite y ; = -F S„ and -F (5m, where is the
learned hypothesis on the data minus the n th and m th data points,
and 6n, S,n are the corrections after addition of the ?tth and m th data points 
respectively.
1 6 2
(a) Show that Varx>[Fcv] = 'E t= i Van>[e„] + Covu[en, 6^]-
(b) Show C ovi)[e„, em] = V a rp (N _ 2) [Fout(c/^^“ ^^)]+ higher order in 5n,Sm-
(c) Assume that any terms involving 6n,5m are 0 ( T ) . Argue that
Varx)[Fcv] = ^ V a r i> [e i] + V arp[F o ut(5 )] +
Does Varx>[ei] decay to zero with N 1 W hat about Varr>[Fout(5)]?
(d) Use the experimental design in Problem 4.4 to study Varx>[Fcv] and give 
a log-log-plot of V ari> [F cv ]/V a ri3[ei] versus N . W hat is the decay rate?
4 . Q v e r f i t t i n g ____________________________________________________ 4 . 4 . P r o b l e m s
P r o b le m 4 .2 4 For d = 3, generate a random data set with N points as 
follows. For each point, each dimension of x has a standard Normal distribution. 
Similarly, generate a (d+ l)-dimensional target weight vector Wf, and set j/„ = 
wj xn+cren where €„ is noise (also from a standard Normal distribution) and P 
is the noise variance; set a to 0.5.
Use linear regression with weight decay regularization to estimate w f with Wreg- 
Set the regularization parameter to 0 .0 5 /A .
(a) For N £ { d + 15, d + 2 5 , . . . , d + 115}, compute the cross validation errors 
e i , . . . , ew and F c v Repeat the experiment (say) 10® times, maintaining 
the average and variance over the experiments of e i, 62 and Ecv
(b ) How should your average of the e i ’s relate to the average of the Fcv’s; 
how about to the average of the e2’s? Support your claim using results 
from your experiment.
(c) W hat are the contributors to the variance of the e i ’s?
(d ) If the cross validation errors were truly independent, how should the vari
ance of the e i ’s relate to the variance of the F c v ’s ?
(e) One measure of the effective number of fresh examples used in comput
ing Fcv is the ratio of the variance of the ei's to that of the Fcv’s. Explain 
why, and plot, versus N , the effective number of fresh examples (A es) 
as a percentage of A . You should find that Neff is close to A .
( f ) If you increase the amount of regularization, will Neff go up or down? 
Explain your reasoning. Run the same experiment with A = 2 .5 /A and 
compare your results from part (e) to verify your conjecture.
P r o b le m 4 .2 5 When using a validation set for model selection, all models 
were learned on the same Ptrain of size N — K , and validated on the same 
Pvai of size K . W e have the VC-bound (see Equation (4.12)):
, _ , ^ / /inM ^
Eout{gm*) ^ £'val(^m*) O \ nz n j
(c o n t in u e d o n n e x t p a g e )
1 6 3
Suppose that instead, you had no control over the validation process. So M 
learners, each with their own models present you with the results of their val
idation processes on different validation sets. Here is what you know about 
each learner:
Each learner m reports to you the size of their validation set Km, 
and the validation error A v a i ( m ) . The learners may have used dif
ferent data sets, except that they faithfully learned on a training set 
and validated on a held out validation set which was only used for 
validation purposes.
As the model selector, you have to decide which learner to go with.
(a ) Should you select the learner with minimum validation error? If yes, why?
If no, why not? [Hint: think VC-bound.[
(b ) If all models are validated on the same validation set as described in the 
text, why is it okay to select the learner with the lowest validation error?
(c) A fter selecting learner m * (say), show that
P [ E o u t ( m * ) > E v a i ( m * ) + e ] <
where K{e) = — Lci Y lm = \ ̂ “average" validation
set size.
(d ) Show that with probability at least 1 — 5, Eout < Evai -F e*, for any e* 
which satisfies e* >
(e) Show that vavurnKm < n{e) < -fj y2m=i ^m - Is this bound better or 
worse than the bound when all models use the same validation set size 
(equal to the average validation set size A Km) 7
4 . O v e r f i t t i n g __________________________________________________________________ 4 . 4 . P r o b l e m s
P r o b le m 4 .2 6 In this problem, derive the formula for the exact expression 
for the leave-one-out cross validation error for linear regression. Let Z be the 
data matrix whose rows correspond to the transformed data points Zn = 4>(x„).
(a) Show that:
N N
7,'̂ Z = Znzf - Z"'y = ^ z „ i/r i; H„„i(A) = z}(A(A)“ *z„,
n=l n=l
where A = A( A) = Z"'Z -F AT’T and H (A ) = Z A (A )“ *Z"'. Hence, 
show that when ( z „ ,y „ ) is left out, Z' '̂Z ^ Z' '̂Z - z „z )),a n d Z'̂ ’y 
Z'^y - Zntjn.
(b) Compute w “ , the weight vector learned when the n th data point is left 
out, and show that:
/ A — I I A Z7, Z.,^A a /ryT A
1 6 4
4 . O v e r f i t t i n g 4 . 4 . P r o b l e m s
[Hint: use the identity (A - x x '’’) ̂ = A ̂ ■]
(c) Using (a ) and (b), show that w “ = w + ~ A~^Zn, where w is the
1 — Hnn
regression weight vector using all the data.
(d ) The prediction on the validation point is given by z^ w “ . Show that
T - Vn H n n ^ n
1 - H „ „ •
/ ̂ \ 2
(e) Show that e„ = I ” " ) , and hence prove Equation (4.13).
\ 1 — Hnn J
P r o b le m 4 .2 7 Cross validation gives an accurate estimate of E o u t(E —1), 
but it can be quite sensitive, leading to problems in model selection. A common 
heuristic for ‘regularizing’ cross validation is to use a measure of error (Tcv(E ) 
for the cross validation estimate in model selection.
(a) One choice for ctcv is the standard deviation of the leave-one-out errors 
divided by \ / N , ctov ~ ■ ^ ^ -y v a r(e i,. . . , 6n). W hy divide by ' /N 7
- E l(b ) For linear models, show that y/Nacv = J2n=i ( i?Hnn )
(c) (i) Given the best model R * , the conservative one-sigma approach se
lects the simplest model within acv{R*) of the best.
(ii) The bound minimizing approach selects the model which minimizes
Ecv (E ) -F Ctc v CH).
Use the experimental design in Problem 4.4 to compare these approaches 
with the ‘unregularized’ cross validation estimateas follows. Fix Q / = 15, 
Q — 20, and cr = 1. Use each of the two methods proposed here as well as 
traditional cross validation to select the optimal value of the regularization 
parameter A in the range {0 .05, 0.10, 0 . 1 5 , . . . , 5} using weight decay 
regularization, f2 (w ) = ;^ w ’’'w . Plot the resulting out-of-sample error 
for the model selected using each method as a function of N, with N in 
the range {2 x Q , 3 x Q , . . . , 10 x Q} .
W hat are your conclusions?
1 6 5
166
Chapter 5
Three Learning Principles
T he study of learning from d a ta highlights some general principles th a t are 
fascinating concepts in their own right. Having gone through the m athem atical 
analysis and em pirical illustrations of the first few chapters, we have a good 
foundation from which to a rticu la te some of these principles and explain them 
in concrete term s.
In this chapter, we will discuss three principles. T he first one is related to 
the choice of m odel and is called O ccam ’s razor. T he other two are related 
to data ; sam pling bias establishes an im .portant principle abou t obtaining the 
d a ta , and d a ta snooping establishes an im portan t principle abou t handling 
the d a ta . A genuine understanding of these principles will pro tect you from 
the m ost com m on pitfalls in learning from data , and allow you to in terpret 
generalization perform ance properly.
5.1 O ccam ’s Razor
A lthough it is no t an exact quote of E instein’s, it is often a ttrib u ted to him 
th a t “An explanation of the d a ta should be m ade as simple as possible, but no 
simpler." A sim ilar principle, Occam,’s Razor, dates from the 14th century and 
is a ttr ib u ted to W illiam of Occam, where the ‘razo r’ is m eant to trim down 
the explanation to the bare m inim um th a t is consistent w ith the data .
In the context of learning, the penalty for model complexity which was 
in troduced in Section 2.2 is a m anifestation of Occam ’s razor. If E\n{g) = 0, 
then the explanation (hypothesis) is consistent w ith the data . In th is case, 
the m ost plausible explanation, w ith the lowest estim ate of Eout given in the 
VC bound (2.14), happens when the com plexity of the explanation (m easured 
by dvci'H)) is as sm all as possible. Here is a statem ent of the underlying 
principle.
T h e s im p le s t m o d e l th a t fits th e d a ta is a lso th e m o st p la u sib le .
1 6 7
Applying th is principle, we should choose as simple a m odel as we th ink we can 
get away w ith. A lthough the principle th a t sim pler is b e tte r m ay be intuitive, 
it is neither precise nor self-evident. W hen we apply th e principle to learning 
from d ata , there are two basic questions to be asked.
1. W h a t does it m ean for a m odel to be simple?
2. How do we know th a t sim pler is b e tte r?
L e t’s s ta r t w ith the first question. T here are two d istinc t approaches to defin
ing the notion of complexity, one based on a family of ob jects and the o ther 
based on an individual object. We have already seen b o th approaches in our
analysis. T he VC dim ension in C h ap ter 2 is a m easure of complexity, and it
is based on th e hypothesis set 7? as a whole, i.e., based on a family of objects. 
T he regularization te rm of th e augm ented erro r in C hap ter 4 is also a m easure 
of complexity, b u t in th is case it is th e com plexity of an individual object, 
nam ely the hypothesis h.
T he two approaches to defining com plexity are no t encountered only in 
learning from data ; they are a recurring them e whenever com plexity is dis
cussed. For instance, in in form ation theory, entropy is a m easure of com plexity 
based on a family of objects, while m in im u m description length is a re la ted 
m easure based on individual objects. T here is a reason why th is is a recurring 
them e. T he two approaches to defining com plexity are in fact related .
W hen we say a fam ily of ob jects is com plex, we m ean th a t the family is 
‘big’. T h a t is, it contains a large varie ty of objects. Therefore, each individual 
object in the family is one o f many. By co n trast, a sim ple family of ob jects is 
‘sm all’; it has relatively few objects, and each individual ob ject is one o f few.
W hy is th e sheer num ber of ob jects an ind ication of th e level of com plexity? 
T he reason is th a t b o th the num ber of ob jects in a family and th e com plexity 
of an ob ject are re la ted to how m any p aram eters are needed to specify the 
object. W hen you increase th e num ber of p aram eters in a learning model, you 
sim ultaneously increase how diverse 7? is and how com plex the individual h is. 
For example, consider 17th order polynom ials versus 3rd order polynom ials. 
T here is m ore variety in 17th order polynom ials, and a t th e sam e tim e the 
individual 17th order polynom ial is m ore com plex th a n a 3rd order polynom ial.
T he m ost com m on definitions of ob ject com plexity are based on the num ber 
of bits needed to describe an object. U nder such definitions, an ob ject is simple 
if it has a sho rt description. Therefore, a sim ple ob ject is no t only intrinsically 
simple (as it can be described succinctly), b u t it also has to be one of few, 
since there are fewer ob jects th a t have sh o rt descrip tions th a n there are th a t 
have long descriptions, as a m a tte r of sim ple counting.
E x erc ise 5 .1
Consider hypothesis sets 7?i and Tlioo th a t contain Boolean functions on 10
Boolean variables, so A = { —1 ,+ 1 }^ ° . 7?i contains all Boolean functions
5 . T h r e e L e a r n i n g P r i n c i p l e s 5 . 1 . O c c a m ’s R a z o r
1 6 8
5 . T h r e e L e a r n i n g P r i n c i p l e s 5 . 1 . O c c a m ’s R a z o r
which evaluate to + 1 on exactly one input point, and to - 1 elsewhere;
?7ioo contains all Boolean functions which evaluate to + 1 on exactly 100 
input points, and to — 1 elsewhere.
(a) How big (number of hypotheses) are 77i and Rioo?
(b ) How many bits are needed to specify one of the hypotheses in R i?
(c) How many bits are needed to specify one of the hypotheses in Rioo?
We now address the second question. W hen O ccam ’s razor says th a t simpler 
is be tte r, it doesn’t m ean sim pler is more elegant. It m eans simpler has a 
b e tte r chance of being right. O ccam ’s razor is abou t perform ance, not about 
aesthetics. If a complex explanation of the d a ta perform s better, we will 
take it.
T he argum ent th a t sim pler has a b e tte r chance of being right goes as fol
lows. We are try ing to fit a hypothesis to our d a ta A = { (x i ,j / i ) ,- - - ,{xpf,yj\r)} 
(assume y ^ ’s are binary). T here are fewer simple hypotheses th an there are 
complex ones. W ith complex hypotheses, there would be enough of them to 
sh a tte r x i , • • • ,X]v, so it is certain th a t we can fit the d a ta set regardless of 
w hat the labels y i, • • • ,yN are, even if these are com pletely random . T here
fore, fitting the d a ta does not m ean much. If, instead, we have a simple model 
w ith few hypotheses and we still found one th a t perfectly fits the dichotomy 
T> = { (x i ,y i) , • • • , (xAT,yiv)}, this is surprising, and therefore it m eans some
thing.
O ccam ’s Razor has been formally proved under different sets of idealized 
conditions. T he above argum ent cap tures the essence of these proofs; if some
th ing is less likely to happen, then when it does happen it is more significant. 
Let us look a t an example.
E x a m p le 5 .1 . Suppose th a t one constructs a physical theory abou t the re
sistivity of a m etal under various tem peratures. In this theory, aside from 
some constan ts th a t need to be determ ined, the resistivity p has a linear de
pendence on the tem peratu re T . In order to verify th a t the theory is correct 
andto obtain the unknown constants, 3 scientists conduct the following three 
experim ents and present their d a ta to you.
Scientist 1 Scientist 2 Scientist 3
1 6 9
5 . T h r e e L e a r n i n g P r i n c i p l e s 5 . 1 . O c c a m ’s R a z o r
I t is clear th a t Scientist 3 has produced the m ost convincing evidence for the 
theory. If the m easurem ents are exact, then . Scientist 2 has m anaged to falsify 
the theory and we are back to th e draw ing board . W h a t ab o u t Scientist 1 ? 
W hile he has not falsified the theory, has he provided any evidence for it? T he 
answer is no, for we can reverse th e question. Suppose th a t the theory was not 
correct, w hat could the d a ta have done to prove him wrong? N othing, since 
any two poin ts can be joined by a line. Therefore, the m odel is no t ju s t likely 
to fit the d a ta in th is case, it is certa in to do so. T his renders the fit to ta lly 
insignificant when it does happen. □
T his exam ple illustra tes a concept re la ted to O ccam ’s Razor, which is the 
axiom o f non-falsifiability. T he axiom asserts th a t th e d a ta should have some 
chance of falsifying a hypothesis, if we are to conclude th a t it can provide 
evidence for the hypothesis. O ne way to guaran tee th a t every d a ta set has 
some chance a t falsification is for th e VC dim ension of the hypothesis set 
to be less th an N , th e num ber of d a ta points. T his is discussed fu rther in 
Problem 5.1. Here is ano ther exam ple of the sam e concept.
E x a m p le 5 .2 . F inancial firms try to pick good trad e rs (predictors of w hether 
the m arket will go up or no t). Suppose th a t each trad e r is tested on their 
p rediction (up or down) over th e next 5 days and those who perform well will 
be hired. One m ight th ink th a t th is process should produce b e tte r and b e tte r 
traders on Wall S treet. Viewed as a learning problem , consider each trad e r 
to be a prediction hypothesis. Suppose th a t th e h iring pool is ‘com plex’; we 
are interview ing 2 ® trad e rs who happen to be a diverse set of people such th a t 
their predictions over the next 5 days are all different. Necessarily one of these 
traders gets it all correct, and will be hired. H iring th e trad e r th rough th is 
process m ay or m ay not be a good th ing, since the process will pick someone 
even if the trad e rs are ju s t flipping coins to m ake th e ir predictions. A perfect 
pred ic tor always exists in th is group, so finding one doesn’t m ean much. If we 
were interview ing only two trad ers, and one of them m ade perfect predictions, 
th a t would m ean som ething. □
E x e rc is e 5 .2
Suppose that for 5 weeks in a row, a letter arrives in the mail th a t predicts 
the outcome of the upcoming Monday night football game. You keenly 
watch each Monday and to your surprise, the prediction is correct each 
time. On the day after the fifth game, a letter arrives, stating th a t if you 
wish to see next week's prediction, a payment of $50.00 is required. Should 
you pay?
(a ) How many possible predictions of win-lose are there for 5 games?
(b) If the sender wants to make sure th a t at least one person receives 
correct predictions on all 5 games from him, how many people should 
he target to begin with?
170
5 . T h r e e L e a r n i n g P r i n c i p l e s 5 . 2 . S a m p l i n g B i a s
(c) A fter the first letter ‘predicting’ the outcome of the first game, how 
many of the original recipients does he target with the second letter?
(d) How many letters altogether will have been sent at the end of the 5 
weeks?
(e) If the cost of printing and mailing out each letter is $0.50, how much
would the sender make if the recipient of 5 correct predictions sent in
the $50.00?
(f ) Can you relate this situation to the growth function and the credibility 
of fitting the data?
Learning from d a ta takes O ccam ’s Razor to another level, going beyond “as 
simple as possible, bu t no sim pler.” Indeed, we m ay opt for ‘a simpler fit 
th an possible’, nam ely an im perfect fit of the d a ta using a simple model over 
a perfect fit using a more complex one. The reason is th a t the price we pay 
for a perfect fit in term s of the penalty for model com plexity in (2.14) may
be too much in com parison to the benefit of the b e tte r fit. This idea was
illustra ted in Figure 3.7, and is a m anifestation of overfitting. The idea is also 
the rationale behind the recom m ended policy in C hapter 3: first try a linear 
m odel - one of the sim plest models in the arena of learning from data.
5.2 Sam pling B ias
A vivid exam ple of sam pling bias happened in the 1948 US presidential election 
between T rum an and Dewey. On election night, a m ajor new spaper carried 
out a telephone poll to ask people how they voted. The poll indicated th a t 
Dewey won, and the paper was so confident abou t the small error bar in its 
poll th a t it declared Dewey the winner in its headline. W hen the actual votes 
were counted, Dewey lost - to the delight of a smiling Trum an.
©Assooiat.cR U
1 7 1
This was not a case of s ta tis tica l anom aly, where th e new spaper was ju st 
incredibly unlucky (rem em ber the <5 in the VC bound?). I t was a case where 
the sam ple was doom ed from the get-go, regardless of its size. Even if the 
experim ent were repeated , th e resu lt would be th e same. In 1948, telephones 
were expensive and those who had them tended to be in an elite group th a t 
favored Dewey m uch m ore th a n th e average voter did. Since th e new spaper did 
its poll by telephone, it inadverten tly used an in-sam ple d istribu tion th a t was 
different from th e out-of-sam ple d istribu tion . T h a t is w hat sam pling bias is.
5 . T h r e e L e a r n i n g P r i n c i p l e s 5 . 2 . S a m p l i n g B i a s
I f th e d a ta is sa m p led in a b ia se d w ay, lea rn in g w ill p ro
d u ce a s im ila r ly b ia se d o u tc o m e .
Applying th is principle, we should m ake sure th a t th e tra in ing and testing 
d istribu tions are th e same; if not, our resu lts m ay be invalid, or, a t th e very 
least, require careful in te rp re ta tion .
If you recall, the VC analysis m ade very few assum ptions, b u t one as
sum ption it did m ake was th a t th e d a ta set V is generated from the same 
d istribu tion th a t the final hypothesis g is tested on. In practice, we m ay en
counter d a ta sets th a t were not generated under those ideal conditions. T here 
are some techniques in sta tis tics and in learning to com pensate for the ‘mis
m atch ’ betw een tra in ing and testing , b u t no t in cases where V was generated 
w ith the exclusion of certa in p a r ts of th e in p u t space, such as th e exclusion of 
households w ith no telephones in th e above exam ple. T here is no th ing th a t 
can be done when th is happens, o th er th a n to adm it th a t th e resu lt will not 
be reliable - s ta tis tica l bounds like Hoeffding and VC require a m atch betw een 
the tra in ing and testing d istribu tions.
T here are m any exam ples of how sam pling bias can be in troduced in d a ta 
collection. In some cases it is inadverten tly in troduced by an oversight, as 
in the case of Dewey and T rum an. In o ther cases, it is in troduced because 
certain types of d a ta are no t available. For instance, in our cred it exam ple of 
C hap ter 1 , the bank created th e tra in in g set from th e d a tab ase of previous cus
tom ers and how they perform ed for th e bank. Such a set necessarily excludes 
those who applied to th e bank for cred it cards and were rejected , because the 
bank does no t have d a ta on how th ey would have performed if they were ac
cepted. Since fu tu re app lican ts will come from a m ixed popu lation including
Yaser S Abu-Mostafa_ Malik Magdon-Ismail_ Hsuan-Tien Lin - Learning From Data_ A Short Course-AMLbook com (2012)

UFMG

Ferramentas de estudo

Conteúdos escolhidos para você

Getting Started with Natural Langage Proccesing

Introduction to Algorithms 4th Leiserson Stein Rivest Cormen MIT Press 9780262046305 EBooksWorld ir

Hands-On Machine Learning with Scikit-Learn and TensorFlow (Early Release) (Aurelien Geron) (z-lib org)

Andy Field, Jeremy Miles, Zoë Field-Discovering Statistics using R-Sage (2012)

Hands-On_Machine_Learning_with-ilovepdf-compressed

Perguntas dessa disciplina

es AVALIAÇÃO Clique no botão ENCERRAR quando você terminar de responder. Como a avaliação anterior não foi encerrada, ela foi reaberta. Você ainda tem

Questão 1/10 - Bioestatística Ler em voz alta Leia o excerto de texto: “O processo de investigação passa por diferentes estádios, dos quais já salient

Methods are procedures that guide us to doing certain processes, such as becoming more productive and better managing our actions in regard to the ...

Questão 03 1 PONTO A busca pela eficácia no mundo dos negócios deve ser incessante, principalmente para empresas de pequeno porte, as quais, geralment

AVALIAÇÃO PRESENCIAL – ENSINO DE EDUCAÇÃO FÍSICA – 6° PERÍODO – PEDAGOGIA 2 “(…) as maiores aquisições de uma criança são conseguidas no brinquedo, aq

Conteúdos escolhidos para você

Getting Started with Natural Langage Proccesing

Introduction to Algorithms 4th Leiserson Stein Rivest Cormen MIT Press 9780262046305 EBooksWorld ir

Hands-On Machine Learning with Scikit-Learn and TensorFlow (Early Release) (Aurelien Geron) (z-lib org)

Andy Field, Jeremy Miles, Zoë Field-Discovering Statistics using R-Sage (2012)

Hands-On_Machine_Learning_with-ilovepdf-compressed

Perguntas dessa disciplina

es AVALIAÇÃO Clique no botão ENCERRAR quando você terminar de responder. Como a avaliação anterior não foi encerrada, ela foi reaberta. Você ainda tem

Questão 1/10 - Bioestatística Ler em voz alta Leia o excerto de texto: “O processo de investigação passa por diferentes estádios, dos quais já salient

Methods are procedures that guide us to doing certain processes, such as becoming more productive and better managing our actions in regard to the ...

Questão 03 1 PONTO A busca pela eficácia no mundo dos negócios deve ser incessante, principalmente para empresas de pequeno porte, as quais, geralment

AVALIAÇÃO PRESENCIAL – ENSINO DE EDUCAÇÃO FÍSICA – 6° PERÍODO – PEDAGOGIA 2 “(…) as maiores aquisições de uma criança são conseguidas no brinquedo, aq

Mais conteúdos dessa disciplina