Data Mining
Cluster Analysis: Basic Concepts
and Algorithms
Lecture Notes for Chapter 7
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 *
What is Cluster Analysis?
Inter-cluster distances are maximized
Intra-cluster distances are minimized
Applications of Cluster Analysis
What is not Cluster Analysis?
Types of Clusterings
Other Distinctions Between Sets of Clusters
Types of Clusters
Types of Clusters: Objective Function
Types of Clusters: Objective Function …
Characteristics of the Input Data Are Important
Clustering Algorithms
K-means Clustering
K-means Clustering – Details
Evaluating K-means Clusters
Problems with Selecting Initial Points
Solutions to Initial Centroids Problem
Handling Empty Clusters
Updating Centers Incrementally
Pre-processing and Post-processing
Bisecting K-means
Limitations of K-means
Strengths of Hierarchical Clustering
Hierarchical Clustering
Agglomerative Clustering Algorithm
Compute the proximity matrix
Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains
Hierarchical Clustering: Group Average
Cluster Similarity: Ward’s Method
Hierarchical Clustering: Time and Space requirements
Hierarchical Clustering: Problems and Limitations
DBSCAN
Cluster Validity
Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data.
Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels.
Evaluating how well the results of a cluster analysis fit the data without reference to external information.
– Use only the data
Comparing the results of two different sets of cluster analyses to determine which is better.
Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.
Different Aspects of Cluster Validation
Measures of Cluster Validity
Measuring Cluster Validity Via Correlation
Framework for Cluster Validity
Where |Ci| is the size of cluster i
Internal Measures: Cohesion and Separation
“The validation of clustering structures is the most difficult and frustrating part of cluster analysis.
Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.”
Algorithms for Clustering Data, Jain and Dubes
Final Comment on Cluster Validity
å
å
=
Î
=
K
i
C
x
i
i
x
m
dist
SSE
1
2
)
,
(
å
å
Î
–
=
i
C
x
i
i
m
x
WSS
2
)
(
å
–
=
i
i
i
m
m
C
BSS
2
)
(