Buscar

Data mining and analysis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você viu 3, do total de 659 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você viu 6, do total de 659 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você viu 9, do total de 659 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Prévia do material em texto

Data Mining and Analysis:
Fundamental Concepts and Algorithms
Mohammed J. Zaki
Wagner Meira Jr.
CONTENTS i
Contents
Preface 1
1 Data Mining and Analysis 4
1.1 Data Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Data: Algebraic and Geometric View . . . . . . . . . . . . . . . . . . 7
1.3.1 Distance and Angle . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Mean and Total Variance . . . . . . . . . . . . . . . . . . . . 13
1.3.3 Orthogonal Projection . . . . . . . . . . . . . . . . . . . . . . 14
1.3.4 Linear Independence and Dimensionality . . . . . . . . . . . . 15
1.4 Data: Probabilistic View . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.1 Bivariate Random Variables . . . . . . . . . . . . . . . . . . . 24
1.4.2 Multivariate Random Variable . . . . . . . . . . . . . . . . . 28
1.4.3 Random Sample and Statistics . . . . . . . . . . . . . . . . . 29
1.5 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . 31
1.5.2 Frequent Pattern Mining . . . . . . . . . . . . . . . . . . . . . 33
1.5.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.5.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
I Data Analysis Foundations 37
2 Numeric Attributes 38
2.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.1.1 Measures of Central Tendency . . . . . . . . . . . . . . . . . . 39
2.1.2 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . 43
2.2 Bivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.1 Measures of Location and Dispersion . . . . . . . . . . . . . . 49
2.2.2 Measures of Association . . . . . . . . . . . . . . . . . . . . . 50
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CONTENTS ii
2.3 Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.5 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.5.1 Univariate Normal Distribution . . . . . . . . . . . . . . . . . 61
2.5.2 Multivariate Normal Distribution . . . . . . . . . . . . . . . . 63
2.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3 Categorical Attributes 71
3.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.1 Bernoulli Variable . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.2 Multivariate Bernoulli Variable . . . . . . . . . . . . . . . . . 74
3.2 Bivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2.1 Attribute Dependence: Contingency Analysis . . . . . . . . . 88
3.3 Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.3.1 Multi-way Contingency Analysis . . . . . . . . . . . . . . . . 95
3.4 Distance and Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4 Graph Data 105
4.1 Graph Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 Topological Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3 Centrality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.3.1 Basic Centralities . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.3.2 Web Centralities . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.4 Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.4.1 Erdös-Rényi Random Graph Model . . . . . . . . . . . . . . . 129
4.4.2 Watts-Strogatz Small-world Graph Model . . . . . . . . . . . 133
4.4.3 Barabási-Albert Scale-free Model . . . . . . . . . . . . . . . . 139
4.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5 Kernel Methods 150
5.1 Kernel Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.1.1 Reproducing Kernel Map . . . . . . . . . . . . . . . . . . . . 156
5.1.2 Mercer Kernel Map . . . . . . . . . . . . . . . . . . . . . . . . 158
5.2 Vector Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.3 Basic Kernel Operations in Feature Space . . . . . . . . . . . . . . . 166
5.4 Kernels for Complex Objects . . . . . . . . . . . . . . . . . . . . . . 173
5.4.1 Spectrum Kernel for Strings . . . . . . . . . . . . . . . . . . . 173
5.4.2 Diffusion Kernels on Graph Nodes . . . . . . . . . . . . . . . 175
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CONTENTS iii
5.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6 High-Dimensional Data 182
6.1 High-Dimensional Objects . . . . . . . . . . . . . . . . . . . . . . . . 182
6.2 High-Dimensional Volumes . . . . . . . . . . . . . . . . . . . . . . . . 184
6.3 Hypersphere Inscribed within Hypercube . . . . . . . . . . . . . . . . 187
6.4 Volume of Thin Hypersphere Shell . . . . . . . . . . . . . . . . . . . 189
6.5 Diagonals in Hyperspace . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.6 Density of the Multivariate Normal . . . . . . . . . . . . . . . . . . . 191
6.7 Appendix: Derivation of Hypersphere Volume . . . . . . . . . . . . . 195
6.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7 Dimensionality Reduction 203
7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
7.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . 208
7.2.1 Best Line Approximation . . . . . . . . . . . . . . . . . . . . 208
7.2.2 Best Two-dimensional Approximation . . . . . . . . . . . . . 212
7.2.3 Best r-dimensional Approximation . . . . . . . . . . . . . . . 216
7.2.4 Geometry of PCA . . . . . . . . . . . . . . . . . . . . . . . . 221
7.3 Kernel Principal Component Analysis (Kernel PCA) . . . . . . . . . 224
7.4 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . 232
7.4.1 Geometry of SVD . . . . . . . . . . . . . . . . . . . . . . . . 233
7.4.2 Connection between SVD and PCA . . . . . . . . . . . . . . . 234
7.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
II Frequent Pattern Mining 239
8 Itemset Mining 240
8.1 Frequent Itemsets and Association Rules . . . . . . . . . . . . . . . . 240
8.2 Itemset Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . 244
8.2.1 Level-WiseApproach: Apriori Algorithm . . . . . . . . . . . 246
8.2.2 Tidset Intersection Approach: Eclat Algorithm . . . . . . . . 249
8.2.3 Frequent Pattern Tree Approach: FPGrowth Algorithm . . . 256
8.3 Generating Association Rules . . . . . . . . . . . . . . . . . . . . . . 259
8.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CONTENTS iv
9 Summarizing Itemsets 268
9.1 Maximal and Closed Frequent Itemsets . . . . . . . . . . . . . . . . . 268
9.2 Mining Maximal Frequent Itemsets: GenMax Algorithm . . . . . . . 272
9.3 Mining Closed Frequent Itemsets: Charm algorithm . . . . . . . . . 274
9.4 Non-Derivable Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . 277
9.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
10 Sequence Mining 288
10.1 Frequent Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
10.2 Mining Frequent Sequences . . . . . . . . . . . . . . . . . . . . . . . 289
10.2.1 Level-Wise Mining: GSP . . . . . . . . . . . . . . . . . . . . . 291
10.2.2 Vertical Sequence Mining: SPADE . . . . . . . . . . . . . . . 292
10.2.3 Projection-Based Sequence Mining: PrefixSpan . . . . . . . . 295
10.3 Substring Mining via Suffix Trees . . . . . . . . . . . . . . . . . . . . 297
10.3.1 Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
10.3.2 Ukkonen’s Linear Time Algorithm . . . . . . . . . . . . . . . 300
10.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
11 Graph Pattern Mining 313
11.1 Isomorphism and Support . . . . . . . . . . . . . . . . . . . . . . . . 313
11.2 Candidate Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 317
11.2.1 Canonical Code . . . . . . . . . . . . . . . . . . . . . . . . . . 319
11.3 The gSpan Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 322
11.3.1 Extension and Support Computation . . . . . . . . . . . . . . 325
11.3.2 Canonicality Checking . . . . . . . . . . . . . . . . . . . . . . 329
11.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
11.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
12 Pattern and Rule Assessment 336
12.1 Rule and Pattern Assessment Measures . . . . . . . . . . . . . . . . 336
12.1.1 Rule Assessment Measures . . . . . . . . . . . . . . . . . . . . 337
12.1.2 Pattern Assessment Measures . . . . . . . . . . . . . . . . . . 346
12.1.3 Comparing Multiple Rules and Patterns . . . . . . . . . . . . 349
12.2 Significance Testing and Confidence Intervals . . . . . . . . . . . . . 353
12.2.1 Fisher Exact Test for Productive Rules . . . . . . . . . . . . . 353
12.2.2 Permutation Test for Significance . . . . . . . . . . . . . . . . 358
12.2.3 Bootstrap Sampling for Confidence Interval . . . . . . . . . . 363
12.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
12.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CONTENTS v
III Clustering 369
13 Representative-based Clustering 370
13.1 K-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
13.2 Kernel K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
13.3 Expectation Maximization (EM) Clustering . . . . . . . . . . . . . . 380
13.3.1 EM in One Dimension . . . . . . . . . . . . . . . . . . . . . . 382
13.3.2 EM in d-Dimensions . . . . . . . . . . . . . . . . . . . . . . . 385
13.3.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . 392
13.3.4 Expectation-Maximization Approach . . . . . . . . . . . . . . 396
13.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
14 Hierarchical Clustering 403
14.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
14.2 Agglomerative Hierarchical Clustering . . . . . . . . . . . . . . . . . 406
14.2.1 Distance between Clusters . . . . . . . . . . . . . . . . . . . . 406
14.2.2 Updating Distance Matrix . . . . . . . . . . . . . . . . . . . . 410
14.2.3 Computational Complexity . . . . . . . . . . . . . . . . . . . 412
14.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
14.4 Exercises and Projects . . . . . . . . . . . . . . . . . . . . . . . . . . 413
15 Density-based Clustering 416
15.1 The DBSCAN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 417
15.2 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . 420
15.2.1 Univariate Density Estimation . . . . . . . . . . . . . . . . . 421
15.2.2 Multivariate Density Estimation . . . . . . . . . . . . . . . . 423
15.2.3 Nearest Neighbor Density Estimation . . . . . . . . . . . . . . 426
15.3 Density-based Clustering: DENCLUE . . . . . . . . . . . . . . . . . 427
15.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
15.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
16 Spectral and Graph Clustering 437
16.1 Graphs and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
16.2 Clustering as Graph Cuts . . . . . . . . . . . . . . . . . . . . . . . . 445
16.2.1 Clustering Objective Functions: Ratio and Normalized Cut . 447
16.2.2 Spectral Clustering Algorithm . . . . . . . . . . . . . . . . . . 450
16.2.3 Maximization Objectives: Average Cut and Modularity . . . . 454
16.3 Markov Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
16.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
16.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CONTENTS vi
17 Clustering Validation 472
17.1 External Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
17.1.1 Matching Based Measures . . . . . . . . . . . . . . . . . . . . 473
17.1.2 Entropy Based Measures . . . . . . . . . . . . . . . . . . . . . 478
17.1.3 Pair-wise Measures . . . . . . . . . . . . . . . . . . . . . . . . 481
17.1.4 Correlation Measures . . . . . . . . . . . . . . . . . . . . . . . 485
17.2 Internal Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
17.3 Relative Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
17.3.1 Cluster Stability . . . . . . . . . . . . . . . . . . . . . . . . . 504
17.3.2 Clustering Tendency . . . . . . . . . . . . . . . . . . . . . . . 507
17.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
17.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513IV Classification 515
18 Probabilistic Classification 516
18.1 Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
18.1.1 Estimating the Prior Probability . . . . . . . . . . . . . . . . 517
18.1.2 Estimating the Likelihood . . . . . . . . . . . . . . . . . . . . 517
18.2 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 523
18.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
18.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
19 Decision Tree Classifier 529
19.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
19.2 Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 534
19.2.1 Split-point Evaluation Measures . . . . . . . . . . . . . . . . 535
19.2.2 Evaluating Split-points . . . . . . . . . . . . . . . . . . . . . . 536
19.2.3 Computational Complexity . . . . . . . . . . . . . . . . . . . 545
19.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
19.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
20 Linear Discriminant Analysis 548
20.1 Optimal Linear Discriminant . . . . . . . . . . . . . . . . . . . . . . 548
20.2 Kernel Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . 555
20.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
20.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
21 Support Vector Machines 565
21.1 Linear Discriminants and Margins . . . . . . . . . . . . . . . . . . . 565
21.2 SVM: Linear and Separable Case . . . . . . . . . . . . . . . . . . . . 571
21.3 Soft Margin SVM: Linear and Non-Separable Case . . . . . . . . . . 576
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CONTENTS vii
21.3.1 Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
21.3.2 Quadratic Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 581
21.4 Kernel SVM: Nonlinear Case . . . . . . . . . . . . . . . . . . . . . . 582
21.5 SVM Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 587
21.5.1 Dual Solution: Stochastic Gradient Ascent . . . . . . . . . . . 587
21.5.2 Primal Solution: Newton Optimization . . . . . . . . . . . . . 592
22 Classification Assessment 601
22.1 Classification Performance Measures . . . . . . . . . . . . . . . . . . 601
22.1.1 Contingency Table Based Measures . . . . . . . . . . . . . . . 603
22.1.2 Binary Classification: Positive and Negative Class . . . . . . 606
22.1.3 ROC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
22.2 Classifier Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
22.2.1 K-fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . 616
22.2.2 Bootstrap Resampling . . . . . . . . . . . . . . . . . . . . . . 617
22.2.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . 619
22.2.4 Comparing Classifiers: Paired t-Test . . . . . . . . . . . . . . 624
22.3 Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . . . . . 626
22.3.1 Ensemble Classifiers . . . . . . . . . . . . . . . . . . . . . . . 631
22.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
22.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
Index 640
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CONTENTS 1
Preface
This book is an outgrowth of data mining courses at RPI and UFMG; the RPI course
has been offered every Fall since 1998, whereas the UFMG course has been offered
since 2002. While there are several good books on data mining and related topics,
we felt that many of them are either too high-level or too advanced. Our goal was
to write an introductory text which focuses on the fundamental algorithms in data
mining and analysis. It lays the mathematical foundations for the core data mining
methods, with key concepts explained when first encountered; the book also tries to
build the intuition behind the formulas to aid understanding.
The main parts of the book include exploratory data analysis, frequent pattern
mining, clustering and classification. The book lays the basic foundations of these
tasks, and it also covers cutting edge topics like kernel methods, high dimensional
data analysis, and complex graphs and networks. It integrates concepts from related
disciplines like machine learning and statistics, and is also ideal for a course on data
analysis. Most of the prerequisite material is covered in the text, especially on linear
algebra, and probability and statistics.
The book includes many examples to illustrate the main technical concepts. It
also has end of chapter exercises, which have been used in class. All of the algorithms
in the book have been implemented by the authors. We suggest that the reader use
their favorite data analysis and mining software to work through our examples, and
to implement the algorithms we describe in text; we recommend the R software,
or the Python language with its NumPy package. The datasets used and other
supplementary material like project ideas, slides, and so on, are available online at
the book’s companion site and its mirrors at RPI and UFMG
• http://dataminingbook.info
• http://www.cs.rpi.edu/~zaki/dataminingbook
• http://www.dcc.ufmg.br/dataminingbook
Having understood the basic principles and algorithms in data mining and data
analysis, the readers will be well equipped to develop their own methods or use more
advanced techniques.
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CONTENTS 2
Suggested Roadmaps
The chapter dependency graph is shown in Figure 1. We suggest some typical
roadmaps for courses and readings based on this book. For an undergraduate level
course, we suggest the following chapters: 1-3, 8, 10, 12-15, 17-19, and 21-22. For an
undergraduate course without exploratory data analysis, we recommend Chapters
1, 8-15, 17-19, and 21-22. For a graduate course, one possibility is to quickly go
over the material in Part I, or to assume it as background reading and to directly
cover Chapters 9-23; the other parts of the book, namely frequent pattern mining
(Part II), clustering (Part III), and classification (Part IV) can be covered in any
order. For a course on data analysis the chapters must include 1-7, 13-14, 15 (Section
2), and 20. Finally, for a course with an emphasis on graphs and kernels we suggest
Chapters 4, 5, 7 (Sections 1-3), 11-12, 13 (Sections 1-2), 16-17, 20-22.
1
2
14 6 7 15 5
13
17
16 20
22
21
4 19
3
18 8
11
12
9 10
Figure 1: Chapter Dependencies
Acknowledgments
Initial drafts of this book have been used in many data mining courses. We received
many valuable comments and corrections from both the faculty and students. Our
thanks go to
• Muhammad Abulaish, Jamia Millia Islamia, India
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from CambridgeUniversity Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CONTENTS 3
• Mohammad Al Hasan, Indiana University Purdue University at Indianapolis
• Marcio Luiz Bunte de Carvalho, Universidade Federal de Minas Gerais, Brazil
• Loïc Cerf, Universidade Federal de Minas Gerais, Brazil
• Ayhan Demiriz, Sakarya University, Turkey
• Murat Dundar, Indiana University Purdue University at Indianapolis
• Jun Luke Huan, University of Kansas
• Ruoming Jin, Kent State University
• Latifur Khan, University of Texas, Dallas
• Pauli Miettinen, Max-Planck-Institut für Informatik, Germany
• Suat Ozdemir, Gazi University, Turkey
• Naren Ramakrishnan, Virginia Polytechnic and State University
• Leonardo Chaves Dutra da Rocha, Universidade Federal de São João del-Rei,
Brazil
• Saeed Salem, North Dakota State University
• Ankur Teredesai, University of Washington, Tacoma
• Hannu Toivonen, University of Helsinki, Finland
• Adriano Alonso Veloso, Universidade Federal de Minas Gerais, Brazil
• Jason T.L. Wang, New Jersey Institute of Technology
• Jianyong Wang, Tsinghua University, China
• Jiong Yang, Case Western Reserve University
• Jieping Ye, Arizona State University
We would like to thank all the students enrolled in our data mining courses at RPI
and UFMG, and also the anonymous reviewers who provided technical comments
on various chapters. In addition, we thank CNPq, CAPES, FAPEMIG, Inweb –
the National Institute of Science and Technology for the Web, and Brazil’s Science
without Borders program for their support. We thank Lauren Cowles, our editor at
Cambridge University Press, for her guidance and patience in realizing this book.
Finally, on a more personal front, MJZ would like to dedicate the book to Amina,
Abrar, Afsah, and his parents, and WMJ would like to dedicate the book to Patricia,
Gabriel, Marina and his parents, Wagner and Marlene. This book would not have
been possible without their patience and support.
Troy Mohammed J. Zaki
Belo Horizonte Wagner Meira, Jr.
Summer 2013
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 4
Chapter 1
Data Mining and Analysis
Data mining is the process of discovering insightful, interesting, and novel patterns,
as well as descriptive, understandable and predictive models from large-scale data.
We begin this chapter by looking at basic properties of data modeled as a data ma-
trix. We emphasize the geometric and algebraic views, as well as the probabilistic
interpretation of data. We then discuss the main data mining tasks, which span ex-
ploratory data analysis, frequent pattern mining, clustering and classification, laying
out the road-map for the book.
1.1 Data Matrix
Data can often be represented or abstracted as an n×d data matrix, with n rows and
d columns, where rows correspond to entities in the dataset, and columns represent
attributes or properties of interest. Each row in the data matrix records the observed
attribute values for a given entity. The n× d data matrix is given as
D =

X1 X2 · · · Xd
x1 x11 x12 · · · x1d
x2 x21 x22 · · · x2d
...
...
...
. . .
...
xn xn1 xn2 · · · xnd

where xi denotes the i-th row, which is a d-tuple given as
xi = (xi1, xi2, · · · , xid)
and where Xj denotes the j-th column, which is an n-tuple given as
Xj = (x1j , x2j , · · · , xnj)
Depending on the application domain, rows may also be referred to as entities,
instances, examples, records, transactions, objects, points, feature-vectors, tuples and
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 5
so on. Likewise, columns may also be called attributes, properties, features, dimen-
sions, variables, fields, and so on. The number of instances n is referred to as the
size of the data, whereas the number of attributes d is called the dimensionality of
the data. The analysis of a single attribute is referred to as univariate analysis,
whereas the simultaneous analysis of two attributes is called bivariate analysis and
the simultaneous analysis of more than two attributes is called multivariate analysis.

sepal sepal petal petal
class
length width length width
X1 X2 X3 X4 X5
x1 5.9 3.0 4.2 1.5 Iris-versicolor
x2 6.9 3.1 4.9 1.5 Iris-versicolor
x3 6.6 2.9 4.6 1.3 Iris-versicolor
x4 4.6 3.2 1.4 0.2 Iris-setosa
x5 6.0 2.2 4.0 1.0 Iris-versicolor
x6 4.7 3.2 1.3 0.2 Iris-setosa
x7 6.5 3.0 5.8 2.2 Iris-virginica
x8 5.8 2.7 5.1 1.9 Iris-virginica
...
...
...
...
...
...
x149 7.7 3.8 6.7 2.2 Iris-virginica
x150 5.1 3.4 1.5 0.2 Iris-setosa

Table 1.1: Extract from the Iris Dataset
Example 1.1: Table 1.1 shows an extract of the Iris dataset; the complete data
forms a 150×5 data matrix. Each entity is an Iris flower, and the attributes include
sepal length, sepal width, petal length and petal width in centimeters, and
the type or class of the Iris flower. The first row is given as the 5-tuple
x1 = (5.9, 3.0, 4.2, 1.5, Iris-versicolor)
Not all datasets are in the form of a data matrix. For instance, more complex
datasets can be in the form of sequences (e.g., DNA, Proteins), text, time-series,
images, audio, video, and so on, which may need special techniques for analysis.
However, in many cases even if the raw data is not a data matrix it can usually be
transformed into that form via feature extraction. For example, given a database of
images, we can create a data matrix where rows represent images and columns corre-
spond to image features like color, texture, and so on. Sometimes, certain attributes
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 6
may have special semantics associated with them requiring special treatment. For
instance, temporal or spatial attributes are often treated differently. It is also worth
noting that traditional data analysis assumes that each entity or instance is inde-
pendent. However, given the interconnected nature of the world we live in, this
assumption may not always hold. Instances may be connected to other instances via
various kinds of relationships, giving rise to a data graph, where a node represents
an entity and an edge represents the relationship between two entities.
1.2 Attributes
Attributes may be classified into two main types depending on their domain, i.e.,
depending on the types of values they take on.
Numeric Attributes A numeric attribute is one that has a real-valued or integer-
valued domain. For example, Age with domain(Age) = N, where N denotes the set
of natural numbers (non-negative integers), is numeric, and so is petal length in
Table 1.1, with domain(petal length) = R+ (the set of all positive real numbers).
Numeric attributes that take on a finite or countably infinite set of values are called
discrete, whereas those that can take on any real value are called continuous. As a
special case of discrete, if an attributehas as its domain the set {0, 1}, it is called a
binary attribute. Numeric attributes can be further classified into two types:
• Interval-scaled: For these kinds of attributes only differences (addition or sub-
traction) make sense. For example, attribute temperature measured in ◦C or
◦F is interval-scaled. If it is 20 ◦C on one day and 10 ◦C on the following
day, it is meaningful to talk about a temperature drop of 10 ◦C, but it is not
meaningful to say that it is twice as cold as the previous day.
• Ratio-scaled: Here one can compute both differences as well as ratios between
values. For example, for attribute Age, we can say that someone who is 20
years old is twice as old as someone who is 10 years old.
Categorical Attributes A categorical attribute is one that has a set-valued do-
main composed of a set of symbols. For example, Sex and Education could be
categorical attributes with their domains given as
domain(Sex) = {M, F}
domain(Education) = {HighSchool, BS, MS, PhD}
Categorical attributes may be of two types:
• Nominal: The attribute values in the domain are unordered, and thus only
equality comparisons are meaningful. That is, we can check only whether the
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 7
value of the attribute for two given instances is the same or not. For example,
Sex is a nominal attribute. Also class in Table 1.1 is a nominal attribute with
domain(class) = {iris-setosa, iris-versicolor, iris-virginica}.
• Ordinal: The attribute values are ordered, and thus both equality comparisons
(is one value equal to another) and inequality comparisons (is one value less
than or greater than another) are allowed, though it may not be possible to
quantify the difference between values. For example, Education is an ordi-
nal attribute, since its domain values are ordered by increasing educational
qualification.
1.3 Data: Algebraic and Geometric View
If the d attributes or dimensions in the data matrix D are all numeric, then each
row can be considered as a d-dimensional point
xi = (xi1, xi2, · · · , xid) ∈ Rd
or equivalently, each row may be considered as a d-dimensional column vector (all
vectors are assumed to be column vectors by default)
xi =

xi1
xi2
...
xid
 = (xi1 xi2 · · · xid)T ∈ Rd
where T is the matrix transpose operator.
The d-dimensional Cartesian coordinate space is specified via the d unit vectors,
called the standard basis vectors, along each of the axes. The j-th standard basis
vector ej is the d-dimensional unit vector whose j-th component is 1 and the rest of
the components are 0
ej = (0, . . . , 1j , . . . , 0)
T
Any other vector in Rd can be written as linear combination of the standard basis
vectors. For example, each of the points xi can be written as the linear combination
xi = xi1e1 + xi2e2 + · · ·+ xided =
d∑
j=1
xijej
where the scalar value xij is the coordinate value along the j-th axis or attribute.
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 8
0
1
2
3
4
0 1 2 3 4 5 6
X1
X2
bc
x1 = (5.9, 3.0)
(a)
X1
X2
X3
1
2
3
4
5
6
1 2 3
1
2
3
4
bC
x1 = (5.9, 3.0, 4.2)
(b)
Figure 1.1: Row x1 as a point and vector in (a) R
2 and (b) R3
2
2.5
3.0
3.5
4.0
4.5
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
X1: sepal length
X
2
:
se
p
al
w
id
th
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bCbC
bC
bC
bC
bC bC
bC
bCbC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC bC bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC
bCbC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bCbC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
bC
b
Figure 1.2: Scatter Plot: sepal length versus sepal width. Solid circle shows the
mean point.
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 9
Example 1.2: Consider the Iris data in Table 1.1. If we project the entire
data onto the first two attributes, then each row can be considered as a point
or a vector in 2-dimensional space. For example, the projection of the 5-tuple
x1 = (5.9, 3.0, 4.2, 1.5, Iris-versicolor) on the first two attributes is shown in
Figure 1.1a. Figure 1.2 shows the scatter plot of all the n = 150 points in the
2-dimensional space spanned by the first two attributes. Likewise, Figure 1.1b
shows x1 as a point and vector in 3-dimensional space, by projecting the data onto
the first three attributes. The point (5.9, 3.0, 4.2) can be seen as specifying the
coefficients in the linear combination of the standard basis vectors in R3
x1 = 5.9e1 + 3.0e2 + 4.2e3 = 5.9
10
0
+ 3.0
01
0
+ 4.2
00
1
 =
5.93.0
4.2

Each numeric column or attribute can also be treated as a vector in an n-
dimensional space Rn
Xj =

x1j
x2j
...
xnj

If all attributes are numeric, then the data matrix D is in fact an n× d matrix,
also written as D ∈ Rn×d, given as
D =

x11 x12 · · · x1d
x21 x22 · · · x2d
...
...
. . .
...
xn1 xn2 · · · xnd
 =

— xT1 —
— xT2 —
...
— xTn —
 =
 | | |X1 X2 · · · Xd
| | |

As we can see, we can consider the entire dataset as an n× d matrix, or equivalently
as a set of n row vectors xTi ∈ Rd or as a set of d column vectors Xj ∈ Rn.
1.3.1 Distance and Angle
Treating data instances and attributes as vectors, and the entire dataset as a matrix,
enables one to apply both geometric and algebraic methods to aid in the data mining
and analysis tasks.
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 10
Let a,b ∈ Rm be two m-dimensional vectors given as
a =

a1
a2
...
am
 b =

b1
b2
...
bm

Dot Product The dot product between a and b is defined as the scalar value
aTb =
(
a1 a2 · · · am
)×

b1
b2
...
bm

= a1b1 + a2b2 + · · ·+ ambm
=
m∑
i=1
aibi
Length The Euclidean norm or length of a vector a ∈ Rm is defined as
‖a‖ =
√
aTa =
√
a21 + a
2
2 + · · ·+ a2m =
√√√√ m∑
i=1
a2i
The unit vector in the direction of a is given as
u =
a
‖a‖ =(
1
‖a‖
)
a
By definition u has length ‖u‖ = 1, and it is also called a normalized vector, which
can be used in lieu of a in some analysis tasks.
The Euclidean norm is a special case of a general class of norms, known as Lp-
norm, defined as
‖a‖p =
(
|a1|p + |a2|p + · · ·+ |am|p
) 1
p
=
(
m∑
i=1
|ai|p
) 1
p
for any p 6= 0. Thus, the Euclidean norm corresponds to the case when p = 2.
Distance From the Euclidean norm we can define the Euclidean distance between
a and b, as follows
δ(a,b) = ‖a− b‖ =
√
(a− b)T (a− b) =
√√√√ m∑
i=1
(ai − bi)2 (1.1)
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 11
Thus, the length of a vector is simply its distance from the zero vector 0, all of whose
elements are 0, i.e., ‖a‖ = ‖a− 0‖ = δ(a,0).
From the general Lp-norm we can define the corresponding Lp-distance function,
given as follows
δp(a,b) = ‖a− b‖p (1.2)
Angle The cosine of the smallest angle between vectors a and b, also called the
cosine similarity, is given as
cos θ =
aTb
‖a‖ ‖b‖ =
(
a
‖a‖
)T (
b
‖b‖
)
(1.3)
Thus, the cosine of the angle between a and b is given as the dot product of the unit
vectors a‖a‖ and
b
‖b‖ .
The Cauchy-Schwartz inequality states that for any vectors a and b in Rm
|aTb| ≤ ‖a‖ · ‖b‖
It follows immediately from the Cauchy-Schwartz inequality that
−1 ≤ cos θ ≤ 1
Since the smallest angle θ ∈ [0◦, 180◦] and since cos θ ∈ [−1, 1], the cosine similarity
value ranges from +1 corresponding to an angle of 0◦, to −1 corresponding to an
angle of 180◦ (or π radians).
Orthogonality Two vectors a and b are said to be orthogonal if and only if aTb =
0, which in turn implies that cos θ = 0, that is, the angle between them is 90◦ or π2
radians. In this case, we say that they have no similarity.
Example 1.3 (Distance and Angle): Figure 1.3 shows the two vectors
a =
(
5
3
)
and b =
(
1
4
)
Using (1.1), the Euclidean distance between them is given as
δ(a,b) =
√
(5− 1)2 + (3− 4)2 = √16 + 1 =
√
17 = 4.12
The distance can also be computed as the magnitude of the vector
a− b =
(
5
3
)
−
(
1
4
)
=
(
4
−1
)
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 12
0
1
2
3
4
0 1 2 3 4 5
X1
X2
bc
(5, 3)
bc
(1, 4)
ab
a− b
θ
Figure 1.3: Distance and Angle. Unit vectors are shown in gray.
since ‖a− b‖ =√42 + (−1)2 = √17 = 4.12.
The unit vector in the direction of a is given as
ua =
a
‖a‖ =
1√
52 + 32
(
5
3
)
=
1√
34
(
5
3
)
=
(
0.86
0.51
)
The unit vector in the direction of b can be computed similarly
ub =
(
0.24
0.97
)
These unit vectors are also shown in gray in Figure 1.3.
By (1.3) the cosine of the angle between a and b is given as
cos θ =
(
5
3
)T (
1
4
)
√
52 + 32
√
12 + 42
=
17√
34× 17 =
1√
2
We can get the angle by computing the inverse of the cosine
θ = cos−1
(
1/
√
2
)
= 45◦
Let us consider the Lp-norm for a with p = 3; we get
‖a‖3 =
(
53 + 33
)1/3
= (153)1/3 = 5.34
The distance between a and b using (1.2) for the Lp-norm with p = 3 is given as
‖a− b‖3 =
∥∥(4,−1)T ∥∥
3
=
(
43 + (−1)3)1/3 = (63)1/3 = 3.98
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 13
1.3.2 Mean and Total Variance
Mean The mean of the data matrix D is the vector obtained as the average of all
the row-vectors
mean(D) = µ =
1
n
n∑
i=1
xi
Total Variance The total variance of the data matrix D is the average squared
distance of each point from the mean
var(D) =
1
n
n∑
i=1
δ(xi,µ)
2 =
1
n
n∑
i=1
‖xi − µ‖2 (1.4)
Simplifying (1.4) we obtain
var(D) =
1
n
n∑
i=1
(
‖xi‖2 − 2xTi µ+ ‖µ‖2
)
=
1
n
(
n∑
i=1
‖xi‖2 − 2nµT
(
1
n
n∑
i=1
xi
)
+ n ‖µ‖2
)
=
1
n
(
n∑
i=1
‖xi‖2 − 2nµTµ+ n ‖µ‖2
)
=
1
n
(
n∑
i=1
‖xi‖2
)
− ‖µ‖2
The total variance is thus the difference between the average of the squared mag-
nitude of the data points and the squared magnitude of the mean (average of the
points).
Centered Data Matrix Often we need to center the data matrix by making the
mean coincide with the origin of the data space. The centered data matrix is obtained
by subtracting the mean from all the points
Z = D− 1 · µT =

xT1
xT2
...
xTn
−

µT
µT
...
µT
 =

xT1 − µT
xT2 − µT
...
xTn − µT
 =

zT1
zT2
...
zTn
 (1.5)
where zi = xi − µ represents the centered point corresponding to xi, and 1 ∈ Rn
is the n-dimensional vector all of whose elements have value 1. The mean of the
centered data matrix Z is 0 ∈ Rd, since we have subtracted the mean µ from all the
points xi.
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 14
0
1
2
3
4
0 1 2 3 4 5
X1
X2
a
b
r =
b⊥
p
=
b ‖
Figure 1.4: Orthogonal Projection
1.3.3 Orthogonal Projection
Often in data mining we need to project a point or vector onto another vector, for
example to obtain a new point after a change of the basis vectors. Let a,b ∈ Rm
be two m-dimensional vectors. An orthogonal decomposition of the vector b in the
direction of another vector a, illustrated in Figure 1.4, is given as
b = b‖ + b⊥ = p+ r (1.6)
where p = b‖ is parallel to a, and r = b⊥ is perpendicular or orthogonal to a. The
vector p is called the orthogonal projection or simply projection of b on the vector
a. Note that the point p ∈ Rm is the point closest to b on the line passing through
a. Thus, the magnitude of the vector r = b − p gives the perpendicular distance
between b and a, which is often interpreted as the residual or error vector between
the points b and p.
We can derive an expression for p by noting that p = ca for some scalar c, since
p is parallel to a. Thus, r = b−p = b− ca. Since p and r are orthogonal, we have
pT r = (ca)T (b− ca) = caTb− c2aTa = 0
which implies that c =
aTb
aTa
Therefore, the projection of b on a is given as
p = b‖ = ca =
(
aTb
aTa
)
a (1.7)
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 15
uT
utuT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
utuT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
uT
ut
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
rS
rs
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
bC
bc
X1
X2
ℓ
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
-1.0
-0.5
0.0
0.5
1.0
1.5
Figure 1.5: Projecting the Centered Data onto the Line ℓ
Example 1.4: Restricting the Iris dataset to the first two dimensions, sepal
length and sepal width, the mean point is given as
mean(D) =
(
5.843
3.054
)
which is shown as the black circle in Figure 1.2. The corresponding centered data
is shown in Figure 1.5, and the total variance is var(D) = 0.868 (centering does
not change this value).
Figure 1.5 shows the projection of each point onto the line ℓ, which is the line
that maximizes the separation between the class iris-setosa (squares) from the
other two class (circles and triangles). The line ℓ is given as the set of all the points
(x1, x2)
T satisfying the constraint
(
x1
x2
)
= c
(−2.15
2.75
)
for all scalars c ∈ R.
1.3.4 Linear Independence and Dimensionality
Given the data matrix
D =
(
x1 x2 · · · xn
)T
=
(
X1 X2 · · · Xd
)
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 16
we are often interested in the linear combinations of the rows (points) or the columns
(attributes). For instance, different linear combinations of the original d attributes
yield new derived attributes, which play a key role in feature extraction and dimen-
sionality reduction.
Given any set of vectors v1,v2, · · · ,vk in an m-dimensional vector space Rm,
their linear combination is given as
c1v1 + c2v2 + · · ·+ ckvk
where ci ∈ R are scalar values. The set of all possible linear combinations of the k
vectors is called the span, denoted as span(v1, · · · ,vk), which is itself a vector space
being a subspace of Rm. If span(v1, · · · ,vk) = Rm, then we say that v1, · · · ,vk is a
spanning set for Rm.
Row and Column Space There are several interesting vector spaces associated
with the data matrix D, two of which are the column space and row space of D.
The column space of D, denoted col(D), is the set of all linear combinations of the
d column vectors or attributes Xj ∈ Rn, i.e.,
col(D) = span(X1,X2, · · · ,Xd)
By definition col(D) is a subspace of Rn. The row space of D, denoted row(D), is
the set of all linear combinations of the n row vectors or points xi ∈ Rd, i.e.,
row(D) = span(x1,x2, · · · ,xn)
By definition row(D) is a subspace of Rd. Note also that the row space of D is the
column space of DT
row(D) = col(DT )
Linear Independence We say that the vectors v1, · · · ,vk are linearly dependent if
at least one vector can be written as a linear combination of the others. Alternatively,
the k vectors are linearly dependent if there are scalars c1, c2, · · · , ck, at least one of
which is not zero, such that
c1v1 + c2v2 + · · · + ckvk = 0
On the other hand, v1, · · · ,vk are linearly independent if and only if
c1v1 + c2v2 + · · ·+ ckvk = 0 implies c1 = c2 = · · · = ck = 0
Simply put, a set of vectors is linearly independent if none of them can be written
as a linear combination of the other vectors in the set.
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 17
Dimension and Rank Let S be a subspace of Rm. A basis for S is a set of
vectors in S, say v1, · · · ,vk, that are linearly independent and they span S, i.e.,
span(v1, · · · ,vk) = S. In fact, a basis is a minimal spanning set. If the vectors in
the basis are pair-wise orthogonal, they are said to form an orthogonal basis for S.
If, in addition, they are also normalized to be unit vectors, then they make up an
orthonormal basis for S. For instance, the standard basis for Rm is an orthonormal
basis consisting of the vectors
e1 =

1
0
...
0
 e2 =

0
1
...
0
 · · · em =

0
0
...
1

Any two bases for S must have the same number of vectors, and the number of
vectors in a basis for S is called the dimension of S, denoted as dim(S). Since S is
a subspace of Rm, we must have dim(S) ≤ m.
It is a remarkable fact that, for any matrix, the dimension of its row and column
space is the same, and this dimension is also called the rank of the matrix. For
the data matrix D ∈ Rn×d, we have rank(D) ≤ min(n, d), which follows from
the fact that the column space can have dimension at most d, and the row space
can have dimension at most n. Thus, even though the data points are ostensibly
in a d dimensional attribute space (the extrinsic dimensionality), if rank(D) < d,
then the data points reside in a lower dimensional subspace of Rd, and in this case
rank(D) gives an indication about the intrinsic dimensionality of the data. In fact,
with dimensionality reduction methods it is often possible to approximate D ∈ Rn×d
with a derived data matrix D′ ∈ Rn×k, which has much lower dimensionality, i.e.,
k ≪ d. In this case k may reflect the “true” intrinsic dimensionality of the data.
Example 1.5: The line ℓ in Figure 1.5 is given as ℓ = span
((−2.15 2.75)T),
with dim(ℓ) = 1. After normalization, we obtain the orthonormal basis for ℓ as
the unit vector
1√
12.19
(−2.15
2.75
)
=
(−0.615
0.788
)
1.4 Data: Probabilistic View
The probabilistic view of the data assumes that each numeric attribute X is a random
variable, defined as a function that assigns a real number to each outcome of an
experiment (i.e., some process of observation or measurement). Formally, X is a
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 18
functionX : O → R, where O, the domain of X, is the set of all possible outcomes
of the experiment, also called the sample space, and R, the range of X, is the set of
real numbers. If the outcomes are numeric, and represent the observed values of the
random variable, then X : O → O is simply the identity function: X(v) = v for all
v ∈ O. The distinction between the outcomes and the value of the random variable
is important, since we may want to treat the observed values differently depending
on the context, as seen in Example 1.6.
A random variable X is called a discrete random variable if it takes on only
a finite or countably infinite number of values in its range, whereas X is called a
continuous random variable if it can take on any value in its range.
5.9 6.9 6.6 4.6 6.0 4.7 6.5 5.8 6.7 6.7 5.1 5.1 5.7 6.1 4.9
5.0 5.0 5.7 5.0 7.2 5.9 6.5 5.7 5.5 4.9 5.0 5.5 4.6 7.2 6.8
5.4 5.0 5.7 5.8 5.1 5.6 5.8 5.1 6.3 6.3 5.6 6.1 6.8 7.3 5.6
4.8 7.1 5.7 5.3 5.7 5.7 5.6 4.4 6.3 5.4 6.3 6.9 7.7 6.1 5.6
6.1 6.4 5.0 5.1 5.6 5.4 5.8 4.9 4.6 5.2 7.9 7.7 6.1 5.5 4.6
4.7 4.4 6.2 4.8 6.0 6.2 5.0 6.4 6.3 6.7 5.0 5.9 6.7 5.4 6.3
4.8 4.4 6.4 6.2 6.0 7.4 4.9 7.0 5.5 6.3 6.8 6.1 6.5 6.7 6.7
4.8 4.9 6.9 4.5 4.3 5.2 5.0 6.4 5.2 5.8 5.5 7.6 6.3 6.4 6.3
5.8 5.0 6.7 6.0 5.1 4.8 5.7 5.1 6.6 6.4 5.2 6.4 7.7 5.8 4.9
5.4 5.1 6.0 6.5 5.5 7.2 6.9 6.2 6.5 6.0 5.4 5.5 6.7 7.7 5.1
Table 1.2: Iris Dataset: sepal length (in centimeters)
Example 1.6: Consider the sepal length attribute (X1) for the Iris dataset in
Table 1.1. All n = 150 values of this attribute are shown in Table 1.2, which lie
in the range [4.3, 7.9] with centimeters as the unit of measurement. Let us assume
that these constitute the set of all possible outcomes O.
By default, we can consider the attribute X1 to be a continuous random vari-
able, given as the identity function X1(v) = v, since the outcomes (sepal length
values) are all numeric.
On the other hand, if we want to distinguish between Iris flowers with short
and long sepal lengths, with long being, say, a length of 7cm or more, we can define
a discrete random variable A as follows
A(v) =
{
0 If v < 7
1 If v ≥ 7
In this case the domain of A is [4.3, 7.9], and its range is {0, 1}. Thus, A assumes
non-zero probability only at the discrete values 0 and 1.
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 19
Probability Mass Function If X is discrete, the probability mass function of X
is defined as
f(x) = P (X = x) for all x ∈ R
In other words, the function f gives the probability P (X = x) that the random
variable X has the exact value x. The name “probability mass function” intuitively
conveys the fact that the probability is concentrated or massed at only discrete values
in the range of X, and is zero for all other values. f must also obey the basic rules
of probability. That is, f must be non-negative
f(x) ≥ 0
and the sum of all probabilities should add to 1∑
x
f(x) = 1
Example 1.7 (Bernoulli and Binomial Distribution): In Example 1.6, A
was defined as discrete random variable representing long sepal length. From the
sepal length data in Table 1.2 we find that only 13 Irises have sepal length of at
least 7cm. We can thus estimate the probability mass function of A as follows
f(1) = P (A = 1) =
13
150
= 0.087 = p
and
f(0) = P (A = 0) =
137
150
= 0.913 = 1− p
In this case we say that A has a Bernoulli distribution with parameter p ∈ [0, 1],
which denotes the probability of a success, i.e., the probability of picking an Iris
with a long sepal length at random from the set of all points. On the other hand,
1−p is the probability of a failure, i.e., of not picking an Iris with long sepal length.
Let us consider another discrete random variable B, denoting the number of
Irises with long sepal length in m independent Bernoulli trials with probability of
success p. In this case, B takes on the discrete values [0,m], and its probability
mass function is given by the Binomial distribution
f(k) = P (B = k) =
(
m
k
)
pk(1− p)m−k
The formula can be understood as follows. There are
(
m
k
)
ways of picking k long
sepal length Irises out of the m trials. For each selection of k long sepal length
Irises, the total probability of the k successes is pk, and the total probability ofm−k
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 20
failures is (1− p)m−k. For example, since p = 0.087 from above, the probability of
observing exactly k = 2 Irises with long sepal length in m = 10 trials is given as
f(2) = P (B = 2) =
(
10
2
)
(0.087)2(0.913)8 = 0.164
Figure 1.6 shows the full probability mass function for different values of k for
m = 10. Since p is quite small, the probability of k successes in so few a trials falls
off rapidly as k increases, becoming practically zero for values of k ≥ 6.
0.1
0.2
0.3
0.4
0 1 2 3 4 5 6 7 8 9 10
k
P (B=k)
Figure 1.6: Binomial Distribution: Probability Mass Function (m = 10, p = 0.087)
Probability Density Function If X is continuous, its range is the entire set
of real numbers R. The probability of any specific value x is only one out of the
infinitely many possible values in the range of X, which means that P (X = x) = 0
for all x ∈ R. However, this does not mean that the value x is impossible, since
in that case we would conclude that all values are impossible! What it means is
that the probability mass is spread so thinly over the range of values, that it can be
measured only over intervals [a, b] ⊂ R, rather than at specific points. Thus, instead
of the probability mass function, we define the probability density function, which
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 21
specifies the probability that the variable X takes on values in any interval [a, b] ⊂ R
P
(
X ∈ [a, b]) = b∫
a
f(x)dx
As before, the density function f must satisfy the basic laws of probability
f(x) ≥ 0, for all x ∈ R
and ∞∫
−∞
f(x)dx = 1
We can get an intuitive understanding of the density function f by considering
the probability density over a small interval of width 2ǫ > 0, centered at x, namely
[x− ǫ, x+ ǫ]
P
(
X ∈ [x− ǫ, x+ ǫ]) = x+ǫ∫
x−ǫ
f(x)dx ≃ 2ǫ · f(x)
f(x) ≃ P
(
X ∈ [x− ǫ, x+ ǫ])
2ǫ
(1.8)
f(x) thus gives the probability density at x, given as the ratio of the probability
mass to the width of the interval, i.e., the probability mass per unit distance. Thus,
it is important to note that P (X = x) 6= f(x).
Even though the probability density function f(x) does not specify the probabil-
ity P (X = x), it can be used to obtain the relative probability of one value x1 over
another x2, since for a given ǫ > 0, by (1.8), we have
P (X ∈ [x1 − ǫ, x1 + ǫ])
P (X ∈ [x2 − ǫ, x2 + ǫ]) ≃
2ǫ · f(x1)
2ǫ · f(x2) =
f(x1)
f(x2)
(1.9)
Thus, if f(x1) is larger than f(x2), then values of X close to x1 are more probable
than values close to x2, and vice versa.
Example 1.8 (Normal Distribution): Consider again the sepal lengthval-
ues from the Iris dataset, as shown in Table 1.2. Let us assume that these values
follow a Gaussian or normal density function, given as
f(x) =
1√
2πσ2
exp
{−(x− µ)2
2σ2
}
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 22
0
0.1
0.2
0.3
0.4
0.5
2 3 4 5 6 7 8 9
x
f(x)
µ± ǫ
Figure 1.7: Normal Distribution: Probability Density Function (µ = 5.84, σ2 =
0.681)
There are two parameters of the normal density distribution, namely, µ, which
represents the mean value, and σ2, which represents the variance of the values
(these parameters will be discussed in Chapter 2). Figure 1.7 shows the character-
istic “bell” shape plot of the normal distribution. The parameters, µ = 5.84 and
σ2 = 0.681, were estimated directly from the data for sepal length in Table 1.2.
Whereas f(x = µ) = f(5.84) =
1√
2π · 0.681 exp{0} = 0.483, we emphasize that
the probability of observing X = µ is zero, i.e., P (X = µ) = 0. Thus, P (X = x)
is not given by f(x), rather, P (X = x) is given as the area under the curve for an
infinitesimally small interval [x − ǫ, x + ǫ] centered at x, with ǫ > 0. Figure 1.7
illustrates this with the shaded region centered at µ = 5.84. From (1.8), we have
P (X = µ) ≃ 2ǫ · f(µ) = 2ǫ · 0.483 = 0.967ǫ
As ǫ → 0, we get P (X = µ) → 0. However, based on (1.9) we can claim that the
probability of observing values close to the mean value µ = 5.84 is 2.67 times the
probability of observing values close to x = 7, since
f(5.84)
f(7)
=
0.483
0.18
= 2.69
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 23
Cumulative Distribution Function For any random variable X, whether dis-
crete or continuous, we can define the cumulative distribution function (CDF) F :
R→ [0, 1], that gives the probability of observing a value at most some given value
x
F (x) = P (X ≤ x) for all −∞ < x <∞
When X is discrete, F is given as
F (x) = P (X ≤ x) =
∑
u≤x
f(u)
and when X is continuous, F is given as
F (x) = P (X ≤ x) =
x∫
−∞
f(u)du
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
−1 0 1 2 3 4 5 6 7 8 9 10 11
x
F (x)
Figure 1.8: Cumulative Distribution Function for the Binomial Distribution
Example 1.9 (Cumulative Distribution Function): Figure 1.8 shows the cu-
mulative distribution function for the binomial distribution in Figure 1.6. It has
the characteristic step shape (right continuous, non-decreasing), as expected for a
discrete random variable. F (x) has the same value F (k) for all x ∈ [k, k + 1) with
0 ≤ k < m, where m is the number of trials and k is the number of successes.
The closed (filled) and open circles demarcate the corresponding closed and open
interval [k, k + 1). For instance, F (x) = 0.404 = F (0) for all x ∈ [0, 1).
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 24
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 1 2 3 4 5 6 7 8 9 10
x
F (x)
(µ, F (µ)) = (5.84, 0.5)
Figure 1.9: Cumulative Distribution Function for the Normal Distribution
Figure 1.9 shows the cumulative distribution function for the normal density
function shown in Figure 1.6. As expected, for a continuous random variable,
the CDF is also continuous, and non-decreasing. Since the normal distribution is
symmetric about the mean, we have F (µ) = P (X ≤ µ) = 0.5.
1.4.1 Bivariate Random Variables
Instead of considering each attribute as a random variable, we can also perform pair-
wise analysis by considering a pair of attributes, X1 and X2, as a bivariate random
variable
X =
(
X1
X2
)
X : O → R2 is a function that assigns to each outcome in the sample space, a pair
of real numbers, i.e., a 2-dimensional vector
(
x1
x2
)
∈ R2. As in the univariate case,
if the outcomes are numeric, then the default is to assume X to be the identity
function.
Joint Probability Mass Function If X1 and X2 are both discrete random vari-
ables then X has a joint probability mass function given as follows
f(x) = f(x1, x2) = P (X1 = x1,X2 = x2) = P (X = x)
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 25
f must satisfy the following two conditions
f(x) = f(x1, x2) ≥ 0 for all −∞ < x1, x2 <∞∑
x
f(x) =
∑
x1
∑
x2
f(x1, x2) = 1
Joint Probability Density Function If X1 and X2 are both continuous random
variables then X has a joint probability density function f given as follows
P (X ∈W ) =
∫ ∫
x∈W
f(x)dx =
∫ ∫
(x1,x2)T∈W
f(x1, x2)dx1dx2
whereW ⊂ R2 is some subset of the 2-dimensional space of reals. f must also satisfy
the following two conditions
f(x) = f(x1, x2) ≥ 0 for all −∞ < x1, x2 <∞∫
R2
f(x)dx =
∞∫
−∞
∞∫
−∞
f(x1, x2)dx1dx2 = 1
As in the univariate case, the probability mass P (x) = P
(
(x1, x2)
T
)
= 0 for
any particular point x. However, we can use f to compute the probability density
at x. Consider the square region W =
(
[x1 − ǫ, x1 + ǫ], [x2 − ǫ, x2 + ǫ]
)
, i.e., a
window of width 2ǫ centered at x = (x1, x2)
T . The probability density at x can be
approximated as
P (X ∈W ) = P
(
X ∈ ([x1 − ǫ, x1 + ǫ], [x2 − ǫ, x2 + ǫ]))
=
x1+ǫ∫
x1−ǫ
x2+ǫ∫
x2−ǫ
f(x1, x2)dx1dx2
≃ 2ǫ · 2ǫ · f(x1, x2)
which implies that
f(x1, x2) =
P (X ∈W )
(2ǫ)2
The relative probability of one value (a1, a2) versus another (b1, b2) can therefore be
computed via the probability density function
P (X ∈ ([a1 − ǫ, a1 + ǫ], [a2 − ǫ, a2 + ǫ]))
P (X ∈ ([b1 − ǫ, b1 + ǫ], [b2 − ǫ, b2 + ǫ])) ≃
(2ǫ)2 · f(a1, a2)
(2ǫ)2 · f(b1, b2) =
f(a1, a2)
f(b1, b2)
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 26
X1
X2
f(x)
b
b
b
b
0.773
0.14
0.067
0.02
0
11
Figure 1.10: Joint Probability Mass Function: X1 (long sepal length), X2 (long sepal
width)
Example 1.10 (Bivariate Distributions): Consider the sepal length and
sepal width attributes in the Iris dataset, plotted in Figure 1.2. Let A denote
the Bernoulli random variable corresponding to long sepal length (at least 7cm),
as defined in Example 1.7.
Define another Bernoulli random variable B corresponding to long sepal width,
say, at least 3.5cm. Let X =
(
A
B
)
be the discrete bivariate random variable, then
the joint probability mass function of X can be estimated from the data as follows
f (0, 0) = P (A = 0, B = 0) =
116
150
= 0.773f (0, 1) = P (A = 0, B = 1) =
21
150
= 0.140
f (1, 0) = P (A = 1, B = 0) =
10
150
= 0.067
f (1, 1) = P (A = 1, B = 1) =
3
150
= 0.020
Figure 1.10 shows a plot of this probability mass function.
Treating attributes X1 and X2 in the Iris dataset (see Table 1.1) as con-
tinuous random variables, we can define a continuous bivariate random variable
X =
(
X1
X2
)
. Assuming that X follows a bivariate normal distribution, its joint
probability density function is given as
f(x|µ,Σ) = 1
2π
√|Σ| exp
{
−(x− µ)
T Σ−1 (x− µ)
2
}
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 27
Here µ and Σ are the parameters of the bivariate normal distribution, representing
the 2-dimensional mean vector and covariance matrix, which will be discussed in
detail in Chapter 2. Further |Σ| denotes the determinant of Σ. The plot of the
bivariate normal density is given in Figure 1.11, with mean
µ = (5.843, 3.054)T
and covariance matrix
Σ =
(
0.681 −0.039
−0.039 0.187
)
It is important to emphasize that the function f(x) specifies only the probability
density at x, and f(x) 6= P (X = x). As before, we have P (X = x) = 0.
X1
X2
f(x)
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
0
0.2
0.4
b
Figure 1.11: Bivariate Normal Density: µ = (5.843, 3.054)T (black point)
Joint Cumulative Distribution Function The joint cumulative distribution
function for two random variables X1 and X2 is defined as the function F , such
that for all values x1, x2 ∈ (−∞,∞),
F (x) = F (x1, x2) = P (X1 ≤ x1 and X2 ≤ x2) = P (X ≤ x)
Statistical Independence Two random variables X1 and X2 are said to be (sta-
tistically) independent if, for every W1 ⊂ R and W2 ⊂ R, we have
P (X1 ∈W1 and X2 ∈W2) = P (X1 ∈W1) · P (X2 ∈W2)
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 28
Furthermore, if X1 and X2 are independent, then the following two conditions are
also satisfied
F (x) = F (x1, x2) = F1(x1) · F2(x2)
f(x) = f(x1, x2) = f1(x1) · f2(x2)
where Fi is the cumulative distribution function, and fi is the probability mass or
density function for random variable Xi.
1.4.2 Multivariate Random Variable
A d-dimensional multivariate random variable X = (X1,X2, · · · ,Xd)T , also called a
vector random variable, is defined as a function that assigns a vector of real numbers
to each outcome in the sample space, i.e., X : O → Rd. The range of X can be
denoted as a vector x = (x1, x2, · · · , xd)T . In case all Xj are numeric, then X is
by default assumed to be the identity function. In other words, if all attributes are
numeric, we can treat each outcome in the sample space (i.e., each point in the data
matrix) as a vector random variable. On the other hand, if the attributes are not
all numeric, then X maps the outcomes to numeric vectors in its range.
If all Xj are discrete, then X is jointly discrete and its joint probability mass
function f is given as
f(x) = P (X = x)
f(x1, x2, · · · , xd) = P (X1 = x1,X2 = x2, · · · ,Xd = xd)
If all Xj are continuous, then X is jointly continuous and its joint probability density
function is given as
P (X ∈W ) =
∫
· · ·
∫
x∈W
f(x)dx
P
(
(X1,X2, · · · ,Xd)T ∈W
)
=
∫
· · ·
∫
(x1,x2,··· ,xd)T∈W
f(x1, x2, · · · , xd)dx1dx2 · · · dxd
for any d-dimensional region W ⊆ Rd.
The laws of probability must be obeyed as usual, i.e., f(x) ≥ 0 and sum of f
over all x in the range of X must be 1. The joint cumulative distribution function
of X = (X1, · · · ,Xd)T is given as
F (x) = P (X ≤ x)
F (x1, x2, · · · , xd) = P (X1 ≤ x1,X2 ≤ x2, · · · ,Xd ≤ xd)
for every point x ∈ Rd.
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 29
We say that X1,X2, · · · ,Xd are independent random variables if and only if, for
every region Wi ⊂ R, we have
P (X1 ∈W1 and X2 ∈W2 · · · and Xd ∈Wd) =
P (X1 ∈W1) · P (X2 ∈W2) · . . . · P (Xd ∈Wd) (1.10)
If X1,X2, · · · ,Xd are independent then the following conditions are also satisfied
F (x) = F (x1, · · · , xd) = F1(x1) · F2(x2) · . . . · Fd(xd)
f(x) = f(x1, · · · , xd) = f1(x1) · f2(x2) · . . . · fd(xd) (1.11)
where Fi is the cumulative distribution function, and fi is the probability mass or
density function for random variable Xi.
1.4.3 Random Sample and Statistics
The probability mass or density function of a random variable X may follow some
known form, or as is often the case in data analysis, it may be unknown. When
the probability function is not known, it may still be convenient to assume that
the values follow some known distribution, based on the characteristics of the data.
However, even in this case, the parameters of the distribution may still be unknown.
Thus, in general, either the parameters, or the entire distribution, may have to be
estimated from the data.
In statistics, the word population is used to refer to the set or universe of all en-
tities under study. Usually we are interested in certain characteristics or parameters
of the entire population (e.g., the mean age of all computer science students in the
US). However, looking at the entire population may not be feasible or may be too
expensive. Instead, we try to make inferences about the population parameters by
drawing a random sample from the population, and by computing appropriate statis-
tics from the sample that give estimates of the corresponding population parameters
of interest.
Univariate Sample Given a random variable X, a random sample of size n from
X is defined as a set of n independent and identically distributed (IID) random
variables S1, S2, · · · , Sn, i.e., all of the Si’s are statistically independent of each
other, and follow the same probability mass or density function as X.
If we treat attribute X as a random variable, then each of the observed values
of X, namely, xi (1 ≤ i ≤ n), are themselves treated as identity random variables,
and the observed data is assumed to be a random sample drawn from X. That is,
all xi are considered to be mutually independent and identically distributed as X.
By (1.11) their joint probability function is given as
f(x1, · · · , xn) =
n∏
i=1
fX(xi)
DRAFT @ 2013-07-10 11:07. Please do not distribute. Feedback is Welcome.
Note that this book shall be available for purchase from Cambridge University Press and other
standard distribution channels, that no unauthorized distribution shall be allowed, and that the
reader may make one copy only for personal on-screen use.
CHAPTER 1. DATA MINING AND ANALYSIS 30
where fX is the probability mass or density function for X.
Multivariate Sample For multivariate parameter estimation, the n data points
xi (with 1 ≤ i ≤ n) constitute a d-dimensional multivariate random sample drawn
from the vector random variable X = (X1,X2, · · · ,Xd). That is, xi are assumed to
be independent and identically distributed, and thus their joint distribution is given
as
f(x1,x2, · · · ,xn) =
n∏
i=1
fX(xi) (1.12)
where fX is the

Outros materiais

Materiais relacionados

Perguntas relacionadas

Perguntas Recentes