Chapter 9: Clustering and Segmentation — Grouping Strategies in Data Mining

4.1 The Dendrogram as Data Structure and Argument

Hierarchical clustering produces not a single partition but a nested sequence of partitions — a dendrogram — that encodes every possible clustering from “one cluster containing all points” to “n clusters each containing one point.” This structure is profoundly different from what K-means produces. Rather than making a commitment to a fixed number of clusters, hierarchical methods defer the decision: they compute the full hierarchy and let the user (or a downstream criterion) choose where to “cut” the tree.

This deferral is both a strength and a weakness. The strength: the dendrogram is a complete description of the data’s hierarchical structure, enabling multi-resolution analysis. A biologist studying species relationships, an economist mapping market sectors, or a linguist tracing semantic similarity can all examine the tree at different granularities and extract insights unavailable to flat clustering methods. The weakness: constructing the full dendrogram is expensive — classical agglomerative algorithms run in O(n² log n) time and O(n²) space, making them impractical for datasets with more than ~50,000 points without approximation.

4.2 Agglomerative Strategies: The Linkage Taxonomy

Agglomerative (bottom-up) hierarchical clustering begins with each point as its own cluster and repeatedly merges the two clusters deemed most similar. The defining choice — and the source of the linkage taxonomy — is how inter-cluster similarity is measured once clusters contain more than one point:

Single Linkage (Nearest Neighbor): distance between clusters = minimum pairwise distance. Produces elongated, “chaining” clusters; extremely sensitive to outliers; finds non-convex shapes naturally but pathologically [10].
Complete Linkage (Furthest Neighbor): distance = maximum pairwise distance. Produces compact, roughly spherical clusters; resistant to chaining; tends to fragment large clusters with internal variation [10].
Average Linkage (UPGMA): distance = mean of all pairwise distances between members of the two clusters. A compromise that is widely used in phylogenetics; less sensitive to outliers than single linkage but produces less compact clusters than complete linkage [11].
Ward’s Method: merges the two clusters that minimize the increase in total within-cluster variance. Equivalent to K-means in its objective function; produces the most compact and homogeneous clusters; assumes Euclidean space [12].
Centroid Linkage (UPGMC): distance = Euclidean distance between cluster centroids. Can produce inversions (where a merge at a higher step has a smaller distance than a lower step), complicating dendrogram interpretation [13].

graph TD
    A[Agglomerative Linkage Methods] --> B[Single Linkage]
    A --> C[Complete Linkage]
    A --> D[Average Linkage]
    A --> E[Ward's Method]
    A --> F[Centroid Linkage]

    B --> B1[✓ Non-convex clusters\n✗ Chaining effect\n✗ Outlier sensitivity]
    C --> C1[✓ Compact clusters\n✗ Fragmentation\n✗ Large-cluster bias]
    D --> D1[✓ Balanced trade-off\n✓ Phylogenetics standard\n✗ No geometric interpretation]
    E --> E1[✓ Minimizes variance\n✓ K-means equivalent\n✗ Euclidean only]
    F --> F1[✓ Centroid-based\n✗ Inversion possible\n✗ Euclidean only]

A critical and underappreciated issue with hierarchical clustering is instability under perturbation. Small changes to the dataset — adding a single point, removing an outlier, or introducing minor measurement noise — can produce dramatically different dendrograms, particularly with single and complete linkage. This sensitivity makes hierarchical clustering unreliable for generating reproducible scientific findings from noisy data [14]. Ward’s method is empirically more stable, but no linkage strategy is immune.

4.3 Divisive Clustering and the DIANA Algorithm

Divisive (top-down) hierarchical clustering inverts the process: it begins with all points in one cluster and recursively splits. The DIANA (DIvisive ANAlysis) algorithm [15], due to Kaufman and Rousseeuw, identifies the “splinter group” — the point with the largest average dissimilarity to its cluster — and iteratively grows this group by adding points more similar to it than to the current main cluster. Divisive methods are theoretically appealing (the first split, made with global context, may be more reliable than the last merge in agglomerative methods) but computationally prohibitive: considering all possible bipartitions at each step is exponential. DIANA approximates by using a greedy strategy, but even this runs in O(n²) per split, making it practical only for moderate-sized datasets.

5. Density-Based Methods: Clusters as Regions of Concentration

5.1 The DBSCAN Revolution

In 1996, Ester, Kriegel, Sander, and Xu published a paper at KDD that introduced DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [16]. The paper won the KDD Test of Time award in 2014 — a recognition that the algorithm had proven more durable and more practically useful than its contemporary competitors. DBSCAN’s core insight was radical in its simplicity: a cluster is a dense region of space, separated from other clusters by regions of lower density. Points in sparse regions are not forced into any cluster — they are labeled as noise.

Formally, DBSCAN operates with two parameters: ε (epsilon, the neighborhood radius) and MinPts (the minimum number of points required to form a dense region). A point is a core point if its ε-neighborhood contains at least MinPts points. Two core points are density-connected if there exists a chain of core points linking them, each within ε of the next. A cluster is a maximal set of density-connected points. Points that are not core points but fall within a core point’s ε-neighborhood are border points; all others are noise.

This design has three immediate consequences that distinguish DBSCAN from centroid-based methods: (1) clusters can have arbitrary shapes — elongated, crescent, annular, or fractal; (2) the number of clusters is determined automatically by the data structure, not specified in advance; and (3) outliers are explicitly identified rather than assigned to the nearest cluster. For spatial data, geographic applications, and any domain where clusters are genuinely non-convex, these properties are transformative.

Key Insight: DBSCAN’s parameter sensitivity is its primary failure mode. ε and MinPts must be set globally — a single pair of values defines density for the entire dataset. When clusters have significantly varying densities (a common real-world condition), a single ε will simultaneously over-merge sparse clusters and fragment dense ones. This is not a bug; it is an architectural limitation of the density-connectivity model.

5.2 OPTICS: Ordering Points to Identify Clustering Structure

OPTICS (Ordering Points To Identify the Clustering Structure) [17] addresses DBSCAN’s single-density limitation by replacing the binary partition (cluster / noise) with a continuous reachability plot. Instead of extracting clusters with fixed ε, OPTICS computes, for each point, its core distance (the distance to the MinPts-th nearest neighbor) and its reachability distance with respect to every other point. Points are then ordered by processing order, and the reachability distances are plotted — producing a landscape where valleys correspond to clusters and peaks to noise or cluster boundaries.

From a single OPTICS run, clusters at any density threshold can be extracted by choosing a reachability cutoff — effectively making OPTICS a parameterized family of DBSCAN clusterings. This hierarchical view of density is powerful but introduces its own challenge: the reachability plot must be interpreted, and automated valley-finding in arbitrary reachability plots is non-trivial. The original DBSCAN∗ extraction heuristic, and later the more sophisticated OPTICS-Xi algorithm, attempt this automatically, but edge cases and parameter sensitivity at the extraction step persist [18].

5.3 HDBSCAN: Hierarchy Meets Density

HDBSCAN (Hierarchical DBSCAN) [19], introduced by Campello, Moulavi, and Sander in 2013, synthesizes the hierarchical view of OPTICS with the robustness of density connectivity and adds a cluster stability extraction algorithm that automatically selects the most significant clusters from the hierarchy. The algorithm replaces DBSCAN’s binary classification with a soft, persistence-based notion of cluster significance: a cluster persists as long as a substantial portion of its points remain in the cluster as ε decreases. Clusters that appear and disappear quickly (low persistence) are treated as noise; those that persist across a wide range of ε are considered genuine structure.

HDBSCAN has become the practical successor to DBSCAN for most applications, offering automatic cluster extraction, built-in soft clustering (each point can have a membership probability across clusters), and robust performance on datasets with varying densities. Its main limitation is computational: the construction of the mutual reachability graph and minimum spanning tree runs in O(n² log n) for naive implementations, though approximate nearest-neighbor methods can bring this closer to O(n log n) for large datasets [20].

6. Spectral and Graph-Based Clustering

6.1 From Geometry to Topology

Spectral clustering represents a conceptual leap: rather than working directly in the original feature space, it transforms the data into a representation where cluster structure becomes apparent even when it is geometrically complex. The transformation is based on the graph Laplacian — a matrix derived from the pairwise similarity graph of the data whose eigenvectors encode the data’s “topological skeleton.”

The algorithm proceeds in three phases: (1) construct a similarity graph (typically k-nearest-neighbor or ε-ball graph) weighted by a similarity function (typically Gaussian kernel); (2) compute the first k eigenvectors of the normalized graph Laplacian; (3) treat these eigenvectors as new feature representations and apply K-means in this embedded space. The effect is remarkable: data that forms concentric rings, interlocking crescents, or any other non-convex shape in the original space often becomes linearly separable in the spectral embedding [21].

The theoretical foundation connects spectral clustering to normalized graph cuts (the Shi-Malik algorithm [22]) and to the random-walk view of the data manifold (the Ng-Jordan-Weiss algorithm [23]). The connection to manifold learning — particularly to Laplacian Eigenmaps and Diffusion Maps — situates spectral clustering within a broader ecosystem of graph-based dimensionality reduction methods.

6.2 Community Detection as Clustering

Community detection in networks is clustering applied to graph-structured data — finding groups of nodes that are more densely connected internally than to the rest of the graph. The modularity-based approaches (Louvain [24], Leiden [25]) optimize a quality function (modularity Q) that measures the difference between observed edge density within communities and the expected density under a null model. The Louvain algorithm’s hierarchical optimization, operating in O(n log n) time, brought community detection to networks with hundreds of millions of nodes — enabling the analysis of social networks, citation graphs, biological interaction networks, and internet topology at scales previously inaccessible.

flowchart LR
    A[Input: Data Points] --> B[Construct Similarity Graph]
    B --> C[Compute Graph Laplacian L]
    C --> D[Eigendecomposition of L]
    D --> E[Select k Eigenvectors\nas New Features]
    E --> F[Apply K-means\nin Spectral Space]
    F --> G[Cluster Assignments]

    style A fill:#e3f2fd,stroke:#1565c0
    style G fill:#e8f5e9,stroke:#2e7d32

    H[Limitations] --> H1[Eigenvector computation: O\(n³\)]
    H --> H2[Sensitivity to graph construction]
    H --> H3[k must still be specified]
    H --> H4[Poor generalization to new points]

A fundamental limitation of spectral methods — often glossed over in tutorials — is their out-of-sample problem: once the spectral embedding is computed, there is no principled way to project a new data point into the embedded space without recomputing the entire eigenvector decomposition. For clustering tasks, this is often acceptable (clustering is typically a batch operation), but it renders spectral methods inapplicable to streaming or real-time segmentation scenarios [26].

7. Fuzzy and Probabilistic Clustering

7.1 Fuzzy C-means: Soft Boundaries for Hard Problems

Hard clustering imposes a crisp, binary assignment: each point belongs to exactly one cluster. This is epistemically violent. A customer who shops at both luxury boutiques and discount stores belongs, in some meaningful sense, to multiple behavioral segments. A gene whose expression profile straddles two regulatory programs participates in both. A news article discussing both economics and technology is not cleanly classifiable. Fuzzy clustering acknowledges this reality by replacing binary membership with a continuous membership function.

Fuzzy C-means (FCM) [27], introduced by Bezdek in 1981, generalizes K-means by replacing hard cluster assignments with membership degrees u_ij ∈ [0,1], where u_ij represents the degree to which point i belongs to cluster j, subject to the constraint that memberships sum to 1 across all clusters for each point. The objective becomes:

Jm = Σi=1n Σj=1c uijm · ||xi − μj||²

where m > 1 is the fuzzification parameter controlling how much memberships can be shared. As m → 1, FCM approaches hard K-means; as m → ∞, all memberships converge to 1/c (complete fuzziness). The choice of m is a hyperparameter with significant impact on results but little principled guidance — typically m = 2 is used by convention [28].

7.2 Gaussian Mixture Models: Probabilistic Foundations

Gaussian Mixture Models (GMMs) provide a fully probabilistic foundation for soft clustering. Rather than minimizing a heuristic objective like FCM, GMMs assume that the data is generated by a mixture of k Gaussian distributions, each with its own mean and covariance matrix. Cluster membership is then a posterior probability, computed via Bayes’ theorem from the observed data and the model parameters [29].

Parameter estimation for GMMs is accomplished via the Expectation-Maximization (EM) algorithm [30]: the E-step computes the posterior probability (responsibility) of each Gaussian component for each data point; the M-step updates the component parameters (means, covariances, mixing weights) to maximize the expected log-likelihood given those responsibilities. The algorithm alternates until convergence — guaranteed to find a local maximum of the likelihood but not the global maximum.

GMMs generalize K-means in an important sense: K-means is equivalent to a GMM with spherical covariance matrices and hard assignments (the E-step replaced by a maximum-a-posteriori assignment). Full-covariance GMMs can model ellipsoidal clusters of arbitrary orientation and eccentricity, making them substantially more flexible. The Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) can be used to select the number of components, giving GMMs a more principled answer to the “how many clusters” question than K-means offers [31].

7.3 Dirichlet Process Mixture Models: Nonparametric Bayes

The ultimate answer to the model selection problem is to place a nonparametric prior over the number of clusters and let the data determine the appropriate complexity. Dirichlet Process Mixture Models (DPMMs) [32] accomplish this by placing a Dirichlet Process prior over the mixing distribution, which implicitly places a prior over the number of components that allows the model to grow as more data is observed. The “Chinese Restaurant Process” metaphor captures the model’s behavior: each new data point joins an existing cluster with probability proportional to that cluster’s current size, or starts a new cluster with probability proportional to a concentration parameter α — creating a rich-get-richer dynamic balanced by the possibility of discovering new structure.

DPMMs offer automatic cluster number determination and principled uncertainty quantification, but at significant computational cost. Exact inference is intractable; practitioners use Markov Chain Monte Carlo (MCMC) methods or variational inference approximations, both of which require careful implementation and extensive tuning. In practice, DPMMs have found their strongest application in natural language processing (topic modeling) and bioinformatics (single-cell sequencing analysis) — domains where the number of latent categories is genuinely unknown and the computational investment is justified [33].

8. Cluster Validity: Measuring the Unmeasurable

Having surveyed the major algorithm families, we must address a question that practitioners often sidestep: how do we know if a clustering is good? Without ground-truth labels, standard accuracy metrics are inapplicable. Cluster validity indices attempt to fill this gap, measuring quality through proxies like cohesion (how similar are points within a cluster?), separation (how different are clusters from each other?), and stability (does the clustering change under small perturbations?).

graph TD
    A[Cluster Validity Indices] --> B[Internal Indices\nNo ground truth needed]
    A --> C[External Indices\nRequire ground truth]
    A --> D[Relative Indices\nCompare across k]

    B --> B1[Silhouette Score\nCohesion vs. separation]
    B --> B2[Davies-Bouldin Index\nIntra/inter cluster ratio]
    B --> B3[Calinski-Harabasz Index\nVariance ratio criterion]
    B --> B4[Dunn Index\nMin inter / max intra]

    C --> C1[Adjusted Rand Index\nARI]
    C --> C2[Normalized Mutual Information\nNMI]
    C --> C3[Fowlkes-Mallows Index\nPrecision-recall for clusters]
    C --> C4[V-measure\nHomogeneity + completeness]

    D --> D1[Elbow Method\nWCSS vs. k]
    D --> D2[Gap Statistic\nvs. null reference]
    D --> D3[BIC / AIC\nFor probabilistic models]

The silhouette score [7] is the most widely used internal validity index. For each point i, it computes a(i) (mean distance to other points in the same cluster) and b(i) (mean distance to points in the nearest other cluster). The silhouette coefficient s(i) = (b(i) − a(i)) / max(a(i), b(i)) ranges from −1 (misclassified point) to +1 (well-clustered point). The mean silhouette over all points summarizes clustering quality. However, the silhouette score is biased toward compact, convex clusters and can systematically misrank non-convex solutions — penalizing exactly the cases where density-based and spectral methods excel [34].

Research Gap: No single cluster validity index is universally applicable. All existing indices embed assumptions about cluster geometry, density, or size distribution. The development of validity indices that are algorithm-agnostic, geometry-agnostic, and computationally tractable for large datasets is an active and unsolved research problem [35].

9. Open Problems and Emerging Directions

The clustering literature is vast, but several fundamental problems remain open:

High-dimensional clustering: In spaces with hundreds or thousands of dimensions, the curse of dimensionality makes all pairwise distances nearly equal, undermining the geometric intuitions behind every major algorithm family. Subspace clustering — finding clusters that are locally defined in different subspaces of the feature space — addresses this theoretically but remains computationally expensive and difficult to tune [36].
Streaming and online clustering: Most clustering algorithms assume batch access to static datasets. Real-world data streams — sensor networks, social media feeds, financial tick data — require algorithms that can update cluster assignments incrementally without reprocessing the entire dataset. CluStream [37], DenStream [38], and their descendants address this but sacrifice theoretical guarantees for computational tractability.
Multi-view and multi-modal clustering: When data is described by multiple, potentially heterogeneous representations (images + text, genomics + clinical variables), single-view clustering methods discard information. Multi-view clustering methods attempt to integrate these views but face the fundamental question of how to weight their contributions [39].
Deep clustering: The integration of deep representation learning with clustering — using neural networks to learn a clustering-favorable embedding end-to-end — has shown dramatic empirical gains (DEC [40], IDEC [41]) but poses new challenges for theoretical analysis and stability. The mutual reinforcement between the clustering objective and the representation learning objective can lead to degenerate solutions without careful regularization.

10. Practical Taxonomy: A Decision Framework

After surveying the full landscape, we offer a practical decision framework for algorithm selection. This is not a recipe — it is a set of questions whose answers narrow the space of appropriate choices:

flowchart TD
    Q1{Do you know k\nin advance?}
    Q2{Are clusters\nconvex/spherical?}
    Q3{Is noise/outlier\ndetection needed?}
    Q4{Is scalability\n> 100k points critical?}
    Q5{Need soft/probabilistic\nmemberships?}
    Q6{Is data\ngraph-structured?}

    Q1 -->|Yes| Q2
    Q1 -->|No| Q3

    Q2 -->|Yes| Q4
    Q2 -->|No| Q3

    Q4 -->|Yes| A1[Mini-Batch K-means]
    Q4 -->|No| A2[K-means++ or\nWard's Hierarchical]

    Q3 -->|Yes| Q4b{Scale?}
    Q3 -->|No| Q5

    Q4b -->|Large| A3[HDBSCAN]
    Q4b -->|Small/Medium| A4[DBSCAN or OPTICS]

    Q5 -->|Yes| Q6b{Parametric?}
    Q5 -->|No| Q6

    Q6b -->|Yes| A5[Gaussian Mixture Model]
    Q6b -->|No| A6[DPMM / Variational Bayes]

    Q6 -->|Yes| A7[Community Detection\nLouvain / Leiden]
    Q6 -->|No| A8[Spectral Clustering]

This framework inevitably oversimplifies — no decision tree can substitute for domain knowledge and empirical experimentation. But it encodes the key differentiating assumptions that distinguish algorithm families and provides a principled starting point for practitioners who face a new clustering problem without a clear prior.

Chapter Summary

This chapter has surveyed the major families of clustering and segmentation algorithms through a taxonomic lens, revealing not just what each algorithm does but why it makes the choices it makes and what those choices cost. The partitional family (K-means and variants) offers speed, scalability, and interpretability at the cost of convexity assumptions and the need to pre-specify k. Hierarchical methods trade computational efficiency for a complete multi-resolution view of cluster structure, with linkage choice encoding critical assumptions about cluster shape and cohesion. Density-based methods (DBSCAN, OPTICS, HDBSCAN) liberate clustering from geometric constraints but introduce sensitivity to density parameters and neighbor graph construction. Spectral and graph-based methods handle the most complex non-linear structures but require expensive eigendecompositions and face out-of-sample generalization challenges. Fuzzy and probabilistic methods (FCM, GMMs, DPMMs) provide principled uncertainty quantification but at the cost of increased model complexity and computational demand.

Across all families, the same fundamental tensions recur: expressiveness vs. tractability, automation vs. control, geometric assumptions vs. topological flexibility, hard boundaries vs. soft memberships. No algorithm resolves all tensions simultaneously. The expert data miner — and the researcher pushing the field forward — is one who understands these trade-offs deeply enough to choose deliberately, diagnose failures accurately, and recognize the gaps where the existing taxonomy offers no good answer.

Chapter 10 will take the opposite perspective: rather than finding groups, it will focus on exceptions — the points that refuse to belong, the anomalies that signal system failures, fraud, disease, and discovery. Anomaly detection is, in a sense, the dark matter of data mining: rarely the focus, but often the most consequential finding.

References

Steinhaus, H. (1956). Sur la division des corps matériels en parties. Bulletin de l’Académie Polonaise des Sciences, 4(12), 801–804.
Aloise, D., Deshpande, A., Hansen, P., & Popat, P. (2009). NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2), 245–248. https://doi.org/10.1007/s10994-009-5103-0
Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 1027–1035. https://dl.acm.org/doi/10.5555/1283383.1283494
Kaufman, L., & Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley. https://doi.org/10.1002/9780470316801
Ng, R. T., & Han, J. (2002). CLARANS: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 14(5), 1003–1016. https://doi.org/10.1109/TKDE.2002.1033770
Sculley, D. (2010). Web-scale k-means clustering. Proceedings of the 19th International Conference on World Wide Web (WWW), 1177–1178. https://doi.org/10.1145/1772690.1772862
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B, 63(2), 411–423. https://doi.org/10.1111/1467-9868.00293
Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics in high dimensional space. ICDT 2001, LNCS 1973, 420–434. https://doi.org/10.1007/3-540-44503-X_27
Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Prentice Hall.
Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin, 38, 1409–1438.
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236–244. https://doi.org/10.1080/01621459.1963.10500845
Murtagh, F., & Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: Which algorithms implement Ward’s criterion? Journal of Classification, 31(3), 274–295. https://doi.org/10.1007/s00357-014-9161-z
Ben-David, S., Von Luxburg, U., & Pál, D. (2006). A sober look at clustering stability. COLT 2006, LNCS 4005, 5–19. https://doi.org/10.1007/11776420_4
Kaufman, L., & Rousseeuw, P. J. (1990). Finding Groups in Data (DIANA algorithm, Chapter 6). Wiley. https://doi.org/10.1002/9780470316801
Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. KDD-96 Proceedings, 226–231. https://dl.acm.org/doi/10.5555/3001460.3001507
Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999). OPTICS: Ordering points to identify the clustering structure. ACM SIGMOD Record, 28(2), 49–60. https://doi.org/10.1145/304181.304187
Brecheisen, S., Kriegel, H. P., Kröger, P., & Pfeifle, M. (2004). Visually mining through cluster hierarchies. SIAM International Conference on Data Mining. https://doi.org/10.1137/1.9781611972740.10
Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. PAKDD 2013, LNCS 7819, 160–172. https://doi.org/10.1007/978-3-642-37456-2_14
McInnes, L., Healy, J., & Astels, S. (2017). HDBSCAN: Hierarchical density based clustering. Journal of Open Source Software, 2(11), 205. https://doi.org/10.21105/joss.00205
Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. https://doi.org/10.1007/s11222-007-9033-z
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905. https://doi.org/10.1109/34.868688
Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. NIPS 2001, 849–856. https://dl.acm.org/doi/10.5555/2980539.2980649
Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics, P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008
Traag, V. A., Waltman, L., & Van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports, 9, 5233. https://doi.org/10.1038/s41598-019-41695-z
Bengio, Y., Paiement, J. F., Vincent, P., et al. (2004). Out-of-sample extensions for LLE, Isomap, MDS, eigenmaps, and spectral clustering. NIPS 2003, 177–184. https://dl.acm.org/doi/10.5555/2981345.2981368
Bezdek, J. C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press. https://doi.org/10.1007/978-1-4757-0450-1
Höppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999). Fuzzy Cluster Analysis. Wiley.
McLachlan, G., & Peel, D. (2000). Finite Mixture Models. Wiley. https://doi.org/10.1002/0471721182
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1), 1–38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464. https://doi.org/10.1214/aos/1176344136
Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2), 209–230. https://doi.org/10.1214/aos/1176342360
Blei, D. M., Griffiths, T. L., & Jordan, M. I. (2010). The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2), 7. https://doi.org/10.1145/1667053.1667056
Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3), 107–145. https://doi.org/10.1023/A:1012801612483
Hennig, C. (2015). What are the true clusters? Pattern Recognition Letters, 64, 53–62. https://doi.org/10.1016/j.patrec.2015.04.009
Luxburg, U. V., Williamson, R. C., & Guyon, I. (2012). Clustering: Science or art? ICML Workshop on Unsupervised and Transfer Learning, 65–80. https://proceedings.mlr.press/v27/luxburg12a.html
Parsons, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter, 6(1), 90–105. https://doi.org/10.1145/1007730.1007731
Aggarwal, C. C., Han, J., Wang, J., & Yu, P. S. (2003). A framework for clustering evolving data streams. VLDB 2003, 81–92. https://doi.org/10.1016/B978-012722442-8/50016-1
Cao, F., Ester, M., Qian, W., & Zhou, A. (2006). Density-based clustering over an evolving data stream with noise. SIAM International Conference on Data Mining. https://doi.org/10.1137/1.9781611972764.29
Zhao, H., Liu, H., & Fu, Y. (2016). Incomplete multi-modal visual data grouping. IJCAI 2016, 2392–2398. https://dl.acm.org/doi/10.5555/3060832.3060997
Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. ICML 2016, PMLR 48, 478–487. https://proceedings.mlr.press/v48/xieb16.html
Guo, X., Gao, L., Liu, X., & Yin, J. (2017). Improved deep embedded clustering with local structure preservation. IJCAI 2017, 1753–1759. https://doi.org/10.24963/ijcai.2017/243

Abstract data clusters and node networks representing clustering algorithms

Clustering & Segmentation: Grouping Strategies

📚 Academic Citation:
Ivchenko, I. & Ivchenko, O. (2026). Clustering and Segmentation — Grouping Strategies in Data Mining. Intellectual Data Analysis Series. Odessa National Polytechnic University.
DOI: 10.5281/zenodo.18672455

Abstract

Clustering stands at the heart of unsupervised data mining — it is the art of asking a machine to look at raw, unlabeled data and answer a deceptively simple question: what belongs together? This chapter offers a comprehensive taxonomy of clustering and segmentation strategies, tracing the intellectual lineage of each major family of algorithms and exposing the often-overlooked gaps in how these methods are applied across real-world domains. We examine partitional methods (K-means and its descendants), hierarchical strategies (agglomerative and divisive), density-based discovery (DBSCAN, OPTICS, HDBSCAN), spectral and graph-theoretic approaches, and fuzzy clustering frameworks. For each family, we map the theoretical assumptions, the algorithmic mechanics, the known failure modes, and the open research problems. By the end of this chapter, the reader will possess not just a catalog of clustering algorithms, but a principled lens for choosing, critiquing, and extending them.

1. The Silence Before Labels: Why Clustering Matters

Imagine receiving a letter written in an unknown language, with no dictionary, no grammar guide, and no native speaker to ask. Your only tool is pattern recognition: you look for recurring symbols, notice which characters appear together, observe the lengths of word-like clusters, and slowly build a mental map of structure. This is, in essence, what clustering algorithms do — they impose structure on structureless data.

The practical stakes are high. When a genomics lab sequences thousands of gene expression profiles, there are no pre-assigned disease labels. When a telecom company logs billions of call-detail records, no one has annotated which behavioral patterns indicate churn risk. When astronomers catalog millions of celestial objects, no cosmic taxonomy arrives pre-filled. In each of these domains, clustering is the first act of sense-making — the algorithm that transforms a fog of data points into a landscape of recognizable terrain.

Yet despite decades of research and hundreds of published algorithms, clustering remains one of the most theoretically contested areas of data mining. Unlike supervised learning, where correctness can be measured against ground truth, clustering has no single correct answer. The “right” number of groups, the “right” definition of similarity, and even the “right” shape of clusters are all judgment calls — embedded in algorithmic choices that practitioners often make unconsciously. This chapter makes those choices explicit.

Key Insight: Clustering is not a single problem but a family of related problems, each requiring different assumptions about what “similarity” means, what “group structure” looks like, and what “good” segmentation achieves. Choosing an algorithm is choosing a worldview.

2. Taxonomy of Clustering Approaches

Before diving into individual algorithms, we establish a meta-taxonomy — a map of maps. Clustering algorithms can be classified along several orthogonal dimensions: their computational strategy (partitional vs. hierarchical), their handling of cluster geometry (convex vs. arbitrary shapes), their treatment of uncertainty (hard vs. soft/fuzzy assignment), and their sensitivity to data density (centroid-based vs. density-based). The following diagram captures the primary families:

graph TD
    A[Clustering Algorithms] --> B[Partitional]
    A --> C[Hierarchical]
    A --> D[Density-Based]
    A --> E[Spectral / Graph-Based]
    A --> F[Fuzzy / Probabilistic]

    B --> B1[K-means]
    B --> B2[K-medoids / PAM]
    B --> B3[Mini-Batch K-means]
    B --> B4[K-means++]

    C --> C1[Agglomerative]
    C --> C2[Divisive / DIANA]
    C1 --> C1a[Single Linkage]
    C1 --> C1b[Complete Linkage]
    C1 --> C1c[Ward's Method]
    C1 --> C1d[Average Linkage]

    D --> D1[DBSCAN]
    D --> D2[OPTICS]
    D --> D3[HDBSCAN]
    D --> D4[DENCLUE]

    E --> E1[Spectral Clustering]
    E --> E2[Graph Partitioning]
    E --> E3[Community Detection]

    F --> F1[Fuzzy C-means]
    F --> F2[Gaussian Mixture Models]
    F --> F3[Dirichlet Process]

This taxonomy is not merely academic decoration. Each branch embeds specific assumptions about the data-generating process, and violations of those assumptions can produce spectacularly wrong results. A K-means algorithm applied to non-convex clusters will draw boundaries that make no geometric sense. A density-based method applied to uniformly distributed data will find nothing. Understanding which branch to reach for — and when — is the central skill this chapter aims to develop.

3. Partitional Methods: The K-means Family

3.1 K-means: The Algorithm That Ate the World

Few algorithms have traveled as far from their origins as K-means. Proposed independently by Stuart Lloyd in 1957 (in a Bell Labs technical memo that circulated informally for years before formal publication in 1982) and by Hugo Steinhaus in 1956 [1], the algorithm is deceptively simple: choose k centroids, assign each point to its nearest centroid, recompute centroids as the mean of assigned points, repeat until convergence. The simplicity is both its greatest strength and its greatest vulnerability.

The formal objective is to minimize the within-cluster sum of squares (WCSS):

J = Σi=1k Σx∈Ci ||x − μi||²

where μ_i is the mean (centroid) of cluster C_i. This is a combinatorial optimization problem known to be NP-hard in its exact form [2], so K-means is actually solving an approximate version via coordinate descent — alternating between assignment and update steps until a local minimum is reached. The algorithm is guaranteed to converge but not to find the global optimum.

This local-optima problem plagued practitioners for decades. The standard remedy was to run K-means multiple times with random initializations and keep the best result — crude but effective enough to persist in production systems worldwide. The more principled solution arrived in 2007 with K-means++ [3], which replaces random initialization with a probabilistic seeding procedure: the first centroid is chosen uniformly at random, and each subsequent centroid is chosen with probability proportional to its squared distance from the nearest already-chosen centroid. This simple change provides an O(log k) approximation guarantee and dramatically improves both speed of convergence and quality of solution in practice.

3.2 Variants: Robustness, Scale, and Shape

The K-means family has proliferated to address three recurring weaknesses: sensitivity to outliers (solved by K-medoids), computational cost on large datasets (solved by Mini-Batch K-means), and the assumption of spherical clusters (addressed, with limited success, by various distance-function modifications).

K-medoids (PAM — Partitioning Around Medoids) [4] replaces the centroid with an actual data point (the medoid) — the point that minimizes the sum of distances to all other points in its cluster. This makes the algorithm robust to outliers and applicable to non-Euclidean spaces (you only need a distance matrix, not a vector space). The cost is computational: PAM’s original implementation runs in O(k(n-k)²) per iteration, making it impractical for large datasets. The CLARA and CLARANS algorithms [5] introduced sampling strategies to bring K-medoids to scale, trading exactness for speed.

Mini-Batch K-means [6] tackles the large-scale problem from a different angle: instead of processing all data points per iteration, it samples a random mini-batch and updates centroids based only on that batch. The resulting solution is slightly suboptimal but achieves orders-of-magnitude speedup with minimal loss in cluster quality — a trade-off that has made it the de facto standard for clustering datasets with millions of points.

graph LR
    A[K-means Core Problem] --> B[Sensitivity to Outliers]
    A --> C[Poor Initialization]
    A --> D[Assumes Spherical Clusters]
    A --> E[Scalability on Large Data]
    A --> F[Unknown k]

    B --> B1[K-medoids / PAM]
    C --> C1[K-means++]
    D --> D1[Kernel K-means]
    D --> D2[Spectral Clustering]
    E --> E1[Mini-Batch K-means]
    E --> E2[CLARA / CLARANS]
    F --> F1[Elbow Method]
    F --> F2[Silhouette Score]
    F --> F3[Gap Statistic]
    F --> F4[BIC / AIC with GMM]

The most persistent weakness of the entire K-means family — and the one for which no fully satisfactory solution exists — is the requirement to specify k in advance. A battery of heuristics has been proposed: the elbow method (looking for a “kink” in the WCSS vs. k curve), the silhouette score [7] (measuring cohesion vs. separation), the gap statistic [8] (comparing WCSS to a null reference distribution), and information-theoretic criteria like BIC applied to Gaussian Mixture Models. Each heuristic fails in specific regimes — hierarchical or non-convex data often yields no clear elbow, silhouette scores are biased toward compact spherical clusters, and the gap statistic can be computationally expensive. This remains an open problem in the field.

Research Gap: No universally applicable method for automatic determination of k exists. Current approaches rely on assumptions (sphericality, Gaussianity, uniform density) that frequently fail in practice. This is particularly acute in high-dimensional spaces, where distance concentration effects make all pairwise distances similar, undermining the geometric intuitions on which k-selection heuristics depend [9].

4. Hierarchical Clustering: A Forest of Possibilities