Artigo - Cluster Analysis Amrender Kumar

•

USP-SP

0

Estefan

26/03/2018

Esta é uma pré-visualização de arquivo. Entre para ver o arquivo original

CLUSTER ANALYSIS 
 
Amrender Kumar 
I.A.S.R.I., Library Avenue, New Delhi – 110 012 
akjha@iasri.res.in 
 
Cluster analysis encompasses many diverse techniques for discovering structure within 
complex bodies of data. In a typical example, one has a sample of data units (subjects, person, 
cases) each described by scores on selected variables (attributes, characteristics, 
measurements). The objective is to group either the data units or variables into clusters such 
that elements within a cluster have a high degree of "natural association" among themselves 
while the clusters are "relative distinct" from one another. Searching the data for a structure of 
"natural" grouping is an important exploratory technique. The most important techniques for 
data classification are 
(1) Cluster analysis 
(2) Discriminant analysis 
 
Although both cluster and discriminant analysis classify objects into categories, discriminant 
analysis requires one to know group membership for the cases used to decide the 
classification rule whereas in cluster analysis group membership for all cases is unknown. In 
addition to membership, the number of groups is also generally unknown. Cluster analysis is 
more primitive technique in that no assumptions are made concerning the numbers of groups 
or groups' structure. Grouping is done on the basis of similarities or distances (dissimilarities). 
Thus in the case of cluster analysis the inputs are similarity measures or the data from which 
these can be computed. The definition of similarity or homogeneity varies from analysis to 
analysis and depends on the objective of study. 
 
Cluster analysis is a technique used for combining observations into groups such that: 
(a) Each group is homogeneous or compact with respect to certain characteristics i.e., 
observations in each group are similar to each other. 
(b) Each group should be different from other groups with respect to the characteristics 
i.e., observations of one group should be different from the observations of other 
groups. 
 
The need for cluster analysis arises in natural ways in many fields such as life science, 
medicine, engineering, agriculture, social science, etc. In biology, cluster analysis is used to 
identify diseases and their stages. For example by examining patients who are diagnosed as 
depressed, one finds that there are several distinct sub-groups of patients with different types 
of depression. In marketing cluster analysis is used to identify persons with similar buying 
habits. By examining their characteristics it becomes possible to plan future marketing 
strategies more efficiently. 
 
Steps in Cluster analysis 
The objective of cluster analysis is to group observations into clusters such that each cluster is 
as homogenous as possible with respect to the clustering variables. The various steps in 
cluster analysis 
Cluster Analysis 
 IV-28
(i) Select a measure of similarity. 
(ii) Decision is to be made on the type of clustering technique to be used 
(iii) Type of clustering method for the selected technique is selected 
(iv) Decision regarding the number of clusters 
(v) Cluster solution is interpreted. 
 
No generalisation about cluster analysis is possible as a vast number of clustering methods 
have been developed in several different fields with different definitions of clusters and 
similarities. There are many kinds of clusters namely: 
• Disjoint cluster where every object appears in single cluster. 
• Hierarchical clusters where one cluster can be completely contained in another 
cluster, but no other kind of overlap is permitted 
• Overlapping clusters. 
• Fuzzy clusters, defined by a probability of membership of each object in one 
cluster. 
 
Similarity Measures 
 A measure of closeness is required to form simple group structures from complex data sets. A 
great deal of subjectivity is involved in the choice of similarity measures. Important 
considerations are the nature of the variables i.e. discrete, continuous or binary or scales of 
measurement (nominal, ordinal, interval, ratio etc.) and subject matter knowledge. If the items 
are to be clustered, proximity is usually indicated by some sort of distance. The variables 
however are grouped on the basis of some measure of association like the correlation co-
efficient etc. Some of the measures are 
 
Qualitative Variables 
Consider k variables observed on n units, in case of binary response it can be represented as 
jth unit ith unit 
 Yes No Total 
Yes K11 K12 K11+K12 
No K21 K22 K21+K22 
Total K11+K21 K12+K22 K 
 
Simple matching coefficient 
 
(% matches) dij = (K11 + K22 )/ K (i,j =1,2,…………n) 
 
This can easily be summarized to polytomous responses. 
 
Quantitative Variables 
In the case of k quantitative variables recorded on n cases, the observations can be expressed 
as 
X11 X12 X13 . . .X1k 
 
X21 X22 X23 . . . X2k 
 . . . . 
 . . . . 
Xn1 Xn2 Xn3 . . . Xnk 
Cluster Analysis 
 IV-29
Similarity rij (i,j = 1,2…..n) 
 Correlation between Xik ‘s with Xjk ‘s 
 (Not the same as correlation between variables) 
Distance dij = 
2)(∑ −
k
jkik XX Euclidean distance 
X’s are standardised. It can be calculated for one variable. 
 
Hierarchical Clustering technique 
Hierarchical Clustering technique begin by either a series of successive mergers or of 
successive divisions. Consider a natural process of grouping 
• Each unit is an entity to start with 
• Merge those two units first which are most similar (least dij) – now becomes an 
entity 
• Examine mutual distance between (n-1) entities 
• Merge those two that are most similar 
• Repeat the process and go on merging till all are merged to form one entity 
• At each stage of agglomerative process, note the distance between the two 
merging entities 
• Choose that stage which shows sudden jump in this distance (Since it indicates 
that two very dissimilar entities are being merged). This could be subjective. 
 
A number of different rules or methods have been suggested for computing distance between 
two clusters. In fact, the various hierarchical clustering algorithms or methods differ mainly 
with respect to how the distance between the two clusters are computed. Some of popular 
methods are: 
• Single linkage- This method works on the principle of smallest distance or nearest 
neighbour 
• Complete linkage- It works on the principle of distant neighbour or 
dissimilarities- Farthest neighbour 
• Average linkage – This works on the principle of average distance. (Average of 
distances between unit of one entity and the other unit of the second entity. 
• Centroid – This method assigns each item to the cluster having nearest centroid 
(means). The process has three steps 
• Partition the items into k initial clusters 
• Proceed through the list of items assigning an item to the cluster 
whose centroid (mean) is nearest. Recalculate the centroid (mean) 
for the cluster receiving the new item and the cluster losing the 
item. 
• Repeat step 2 until no more assignments take place 
• Ward’s – It forms cluster by maximising within – cluster homogeneity, within 
group sum of squares is used as the measure of homogeneity 
• Two stage density linkage 
• Units assigned to modal entities on the basis of densities 
(frequencies) – (kth nearest neighbour) 
• Modal entities allowed to join later on 
Cluster Analysis 
 IV-30
Non-hierarchical clustering technique 
In non-hierarchical clustering data are divided into k partitions or groups with each partition 
representing a cluster. Therefore, as opposed to hierarchical clustering, the number of clusters 
must be known a priori. Non-hierarchical clustering techniques basically follow these
steps: 
1. Select k initial cluster centroids or seeds, where k is the number of clustered desired. 
2. Assign each observation to the cluster to which it is the closest. 
3. Reassign or re-allocate each observation to one of the k-clusters according to a pre-
determined stopping rules. 
4. Stop if there is no re-allocation of data points or if the reassignment satisfies the 
criteria set by the stopping rule. Othewise go to step-2. 
 
SAS Cluster Procedure 
The SAS procedures for clustering are oriented towards disjoint or hierarchical cluster from a 
co-ordinate data, distance or a correlation or covariance matrix. The following procedures are 
used for clustering 
CLUSTER Does hierarchical clustering of observations 
FASTCLUS Finds disjoint clusters of observations using a k-means method applied to co-
ordinate data. Recommended for large data sets. 
VARCLUS It is used for hierarchical as well as non-hierarchical clustering 
TREE Draws the tree diagrams or dendograms using outputs from the CLUSTER or 
VARCLUS procedures 
 
The TREE procedure is considered to be very important because it produces a tree diagram, 
also known as a dendrogram, using a data set created by the CLUSTER or VARCLUS 
procedure and also create output data sets giving the results of hierarchical clustering as tree 
structure. The TREE procedure uses the output sets to print a diagram. Following is the 
terminology related to TREE procedure. 
Leaves Objects that are clustered 
Root The cluster containing all the objects 
Branch A cluster containing at least two objects but not all of them 
Node A general term for leaves, branch and roots 
Parent & If A is union of cluster B and C, then A is parent and B and C are 
Child children 
 
Specifications 
The TREE procedure is invoked by the following statements: 
 
PROC TREE < options> 
Optional Statements 
 
NAME variables 
HEIGHT variables 
PARENT variables 
BY variables 
COPY variables 
FREQ variables 
ID variables 
Cluster Analysis 
 IV-31
If the data sets have been created by CLUSTER or VARCLUS, the only requirement is the 
statement PROC TREE. The other optional statements listed above are described after the 
PROC TREE statement 
 
Example: Given below is food nutrient data on calories, protein, fat, calcium and iron. The 
objective of the study is to identify suitable clusters of food nutrient data based on the five 
variables. 
 
Food Items Calories Protein Fat Calcium Iron 
1 340 20 28 9 2.6 
2 245 21 17 9 2.7 
3 420 15 39 7 2 
4 375 19 32 9 2.6 
5 180 22 10 17 3.7 
6 115 20 3 8 1.4 
7 170 25 7 12 1.5 
8 160 26 5 14 5.9 
9 265 20 20 9 2.6 
10 300 18 25 9 2.3 
11 340 20 28 9 2.5 
12 340 19 29 9 2.5 
13 355 19 30 9 2.4 
14 205 18 14 7 2.5 
15 185 23 9 9 2.7 
16 135 22 4 25 0.6 
17 70 11 1 82 6 
18 45 7 1 74 5.4 
19 90 14 2 38 0.8 
20 135 16 5 15 0.5 
21 200 19 13 5 1 
22 155 16 9 157 1.8 
23 195 16 11 14 1.3 
24 120 17 5 159 0.7 
25 180 22 9 367 2.5 
26 170 25 7 7 1.2 
27 170 23 1 98 2.6 
Data test; 
Input Food Calories Protein Fat Calcium Iron ; 
Cards; 
 
; 
proc cluster method = centroid RMSSTD RSQURE 
nonorm out = tree; 
Id food; 
var Calories Protein Fat Calcium Iron; 
run; 
Cluster Analysis 
 IV-32
proc tree data = tree out = clus3 nclusters=3; 
 
/* It will divide the data into three clusters */ 
 
Id food; 
Copy Calories Protein Fat Calcium Iron; 
proc sort; 
by cluster; 
proc print; 
by cluster; 
var food Calories Protein Fat Calcium Iron; 
run; 
Output From SAS 
 The CLUSTER Procedure 
 Centroid Hierarchical Cluster Analysis 
 Variable Mean Std Dev Skewness Kurtosis Bimodality 
 
 Calories 209.6 99.6332 0.5320 -0.6024 0.4619 
 Protein 19.0000 4.2517 -0.8237 1.3274 0.3565 
 Fat 13.4815 11.2570 0.7900 -0.6237 0.5892 
 Calcium 43.9630 78.0343 3.1590 11.3445 0.7456 
 Iron 2.3815 1.4613 1.2298 1.4689 0.5182 
 
 Root-Mean-Square Total-Sample Standard Deviation = 56.85606 
 Cluster History 
 RMS Cent 
 NCL ------Clusters Joined------- FREQ STD SPRSQ RSQ Dist 
 
 26 1 11 2 0.0316 0.0000 1.00 0.1 
 25 CL26 12 3 0.3661 0.0000 1.00 1.4151 
 24 7 26 2 1.5840 0.0000 1.00 5.009 
 23 14 21 2 1.8235 0.0000 1.00 5.7663 
 22 5 15 2 3.0332 0.0001 1.00 9.5917 
 21 CL23 23 3 3.2444 0.0002 1.00 11.531 
 20 16 20 2 3.7015 0.0002 .999 11.705 
 19 CL24 8 3 3.3143 0.0002 .999 12.081 
 18 CL25 13 4 3.3914 0.0004 .999 15.108 
 17 CL22 CL19 5 4.9157 0.0008 .998 16.519 
 16 2 9 2 6.4032 0.0005 .998 20.249 
 15 6 CL20 3 6.5865 0.0009 .997 23.409 
 14 17 18 2 8.3986 0.0008 .996 26.559 
 13 CL17 CL21 8 7.7564 0.0036 .992 28.445 
 12 CL18 4 5 6.9370 0.0019 .990 31.423 
 11 22 24 2 11.1679 0.0015 .989 35.316 
 10 CL15 19 4 11.3232 0.0035 .985 44.563 
 9 CL16 10 3 12.5993 0.0033 .982 45.537 
 8 CL13 CL10 12 16.8070 0.0274 .955 65.69 
 7 CL11 27 3 19.4452 0.0075 .947 68.821 
 6 CL12 3 6 14.3419 0.0099 .937 70.822 
 5 CL6 CL9 9 24.3675 0.0405 .897 92.253 
 4 CL14 CL7 5 30.4181 0.0342 .862 109.44 
 3 CL8 CL4 17 31.2342 0.1047 .758 111.66 
 2 CL5 CL3 26 49.8743 0.4977 .260 188.52 
 1 CL2 25 27 56.8561 0.2601 .000 336.92 
Cluster Analysis 
 IV-33
 
 
------------------------------------------ CLUSTER=1 ------------------------------------ 
 
 Obs Food Calories Protein Fat Calcium Iron 
 
 1 1 340 20 28 9 2.6 
 2 11 340 20 28 9 2.5 
 3 12 340 19 29 9 2.5 
 4 13 355 19 30 9 2.4 
 5 2 245 21 17 9 2.7 
 6 9 265 20 20 9 2.6 
 7 4 375 19 32 9 2.6 
 8 10 300 18 25 9 2.3 
 9 3 420 15 39 7 2.0 
 
 
------------------------------------------
CLUSTER=2 ------------------------------------ 
 
 Obs Food Calories Protein Fat Calcium Iron 
 
 10 7 170 25 7 12 1.5 
 11 26 170 25 7 7 1.2 
 12 14 205 18 14 7 2.5 
 13 21 200 19 13 5 1.0 
 14 5 180 22 10 17 3.7 
 15 15 185 23 9 9 2.7 
 16 23 195 16 11 14 1.3 
 17 16 135 22 4 25 0.6 
 18 20 135 16 5 15 0.5 
 19 8 160 26 5 14 5.9 
 20 6 115 20 3 8 1.4 
 21 17 70 11 1 82 6.0 
 22 18 45 7 1 74 5.4 
 23 22 155 16 9 157 1.8 
 24 24 120 17 5 159 0.7 
 25 19 90 14 2 38 0.8 
 26 27 170 23 1 98 2.6 
 
 
------------------------------------------ CLUSTER=3 ------------------------------------ 
 
 Obs Food Calories Protein Fat Calcium Iron 
 
 27 25 180 22 9 367 2.5 
Cluster Analysis 
 IV-34
0
100
200
300
400
Food
1 11 12 13 4 3 2 9 10 5 15 7 26 8 14 21 23 6 16 20 19 17 18 22 24 27 25
 
Data Entry and Procedure in SPSS 
 
Analyze… 
 Classify… 
 K-Means Cluster 
 Hierarchical Cluster 
 Discrimant 
 
Hierarchical Cluster 
Cluster Analysis 
 IV-35
 
 
Output 
 
 
SPSS Syntax for above data 
CLUSTER Calories Protein Fat Calcium Iron 
 /METHOD CENTROID 
 /MEASURE= SEUCLID 
 /PRINT SCHEDULE CLUSTER(3) 
 /PRINT DISTANCE 
 /PLOT DENDROGRAM VICICLE. 
 
References 
Chatfield, C. and Collins, A.J. (1990). Introduction to Multivariate Analysis. Chapman and 
Hall publications. 
Johnson, R.A. and Wichern, D.W. (1996). Applied Multivariate Statistical Analysis. 
Prentice-Hall of India Private Limited. 
Sharma, S. (1996). Applied Multivariate Techniques. John Wiley & Sons, New York.