Hierarchical Clustering Taxonomy: From Dendrograms to Modern Extensions

Tree branches forming hierarchical structure against sky representing dendrogram clustering

Hierarchical Clustering: A Taxonomic Deep Dive

📚 Academic Citation:
Ivchenko, I. & Ivchenko, O. (2026). Hierarchical Clustering Taxonomy: From Dendrograms to Modern Extensions. Intellectual Data Analysis Series. Odessa National Polytechnic University.
DOI: 10.5281/zenodo.18683667

Abstract

Hierarchical clustering represents one of the oldest and most intuitive approaches to unsupervised pattern discovery — the idea that natural structures in data can be revealed through successive merging or splitting of groups, producing a nested taxonomy rather than a flat partition. This chapter provides a comprehensive taxonomic analysis of hierarchical clustering methods, tracing their intellectual origins from early numerical taxonomy in biology to contemporary applications in genomics, document analysis, and social network mining. We examine the two fundamental paradigms — agglomerative (bottom-up) and divisive (top-down) — alongside the critical choice of linkage criteria that fundamentally shapes clustering outcomes. Through detailed analysis of single, complete, average, and Ward’s linkage methods, we expose the geometric assumptions each encodes and the failure modes each induces. The chapter maps five critical research gaps: scalability limitations that restrict hierarchical methods to moderate datasets, the absence of principled dendrogram cutting strategies, theoretical blindspots in handling high-dimensional spaces, insufficient integration with streaming and online learning paradigms, and the underexplored territory of hierarchical methods for non-Euclidean data manifolds. By synthesizing algorithmic mechanics with practical application patterns, this chapter equips practitioners with the conceptual tools to select, critique, and extend hierarchical clustering for their specific analytical contexts.

Keywords: hierarchical clustering, agglomerative clustering, divisive clustering, dendrogram, linkage criteria, Ward’s method, AGNES, DIANA, ultrametric distance, cophenetic correlation

1. The Logic of Nested Structure

There is something deeply intuitive about hierarchies. When biologists classify organisms, they do not simply sort creatures into bins; they construct nested taxonomies — species within genera, genera within families, families within orders — that encode evolutionary relationships. When librarians organize knowledge, they build classification trees. When children learn to categorize the world, they naturally form “is-a” relationships: a sparrow is a bird, a bird is an animal, an animal is a living thing.

Hierarchical clustering algorithms formalize this intuition. Unlike partitional methods such as K-means, which produce a single flat grouping, hierarchical clustering constructs a complete nested structure — a dendrogram — that captures relationships at every possible level of granularity. The resulting tree can be “cut” at any height to produce cluster assignments, offering flexibility that flat methods cannot match.

The intellectual roots of hierarchical clustering trace to the pioneering work of Robert Sokal and Peter Sneath in the early 1960s, whose book Principles of Numerical Taxonomy (1963) laid the mathematical foundations for what they termed “phenetic classification” — grouping organisms by measured similarity rather than inferred evolutionary history (Sokal & Sneath, 1963). Their UPGMA algorithm (Unweighted Pair Group Method with Arithmetic Mean) remains a standard tool in phylogenetic analysis six decades later, a testament to the enduring power of these foundational ideas.

Key Insight: Hierarchical clustering does not merely assign data points to groups — it reveals the structure of similarity itself, encoding how clusters relate to each other at multiple scales. This structural richness comes at a computational cost that has historically limited the method’s applicability to large-scale data.

2. Fundamental Paradigms: Bottom-Up vs. Top-Down

Hierarchical clustering algorithms divide into two fundamental families based on the direction of tree construction: agglomerative (bottom-up) methods that start with individual data points and progressively merge them into larger clusters, and divisive (top-down) methods that begin with all data in a single cluster and recursively split it into smaller groups.

graph TD
    subgraph Agglomerative ["Agglomerative (Bottom-Up)"]
        A1[Each point is a cluster] --> A2[Merge closest pair]
        A2 --> A3[Merge next closest]
        A3 --> A4[Continue until single cluster]
    end
    
    subgraph Divisive ["Divisive (Top-Down)"]
        D1[All points in one cluster] --> D2[Split into two groups]
        D2 --> D3[Split each subgroup]
        D3 --> D4[Continue until singletons]
    end
    
    A4 -.-> Result[Dendrogram]
    D4 -.-> Result

2.1 Agglomerative Clustering (AGNES)

Agglomerative clustering — sometimes called AGNES (AGglomerative NESting) — follows a conceptually simple procedure. Initially, each data point occupies its own singleton cluster. At each subsequent step, the algorithm identifies the two “closest” clusters and merges them, reducing the total number of clusters by one. This process repeats until all points belong to a single encompassing cluster. The sequence of merges is recorded in a dendrogram, where the vertical axis represents the distance (or dissimilarity) at which each merge occurred.

The algorithmic skeleton is deceptively simple:

flowchart TD
    Start[Initialize: n clusters, one per point] --> Compute[Compute pairwise distances]
    Compute --> Find[Find closest cluster pair]
    Find --> Merge[Merge closest pair into single cluster]
    Merge --> Update[Update distance matrix]
    Update --> Check{Only one cluster remaining?}
    Check -->|No| Find
    Check -->|Yes| End[Output dendrogram]

The complexity lies entirely in how “closest” is defined — this is the linkage criterion, which we examine in Section 3. The naive algorithm requires O(n³) time and O(n²) space due to the need to maintain and update the full distance matrix, though optimized implementations using nearest-neighbor chains or priority queues can achieve O(n² log n) for certain linkage methods (Müllner, 2011).

2.2 Divisive Clustering (DIANA)

Divisive clustering inverts the agglomerative logic. The algorithm begins with all data points in a single cluster and recursively bisects clusters until each point is isolated. The most well-known divisive algorithm is DIANA (DIvisive ANAlysis), introduced by Kaufman and Rousseeuw (1990).

DIANA’s splitting procedure works as follows: at each step, the algorithm selects the cluster with the largest diameter (the maximum distance between any two points within the cluster). Within this cluster, it identifies the point most dissimilar to all others — the “splinter” — and initiates a new cluster containing just this point. Remaining points are then iteratively reassigned: any point closer to the splinter group than to the original cluster is moved. This continues until no more points prefer the splinter group, at which point the split is complete and the algorithm proceeds to the next cluster.

Divisive methods have theoretical appeal: by making global decisions at the top of the hierarchy, they can potentially avoid the “greedy” mistakes of agglomerative methods that lock in early merges. However, the computational burden is severe — considering all possible binary splits of a cluster of size m requires examining 2^(m-1) – 1 possibilities, making exhaustive search intractable. DIANA’s heuristic splitting sidesteps this but sacrifices optimality guarantees.

Research Gap #1: Despite their theoretical advantages, divisive methods remain underutilized in practice. The field lacks efficient algorithms that can make globally-informed splitting decisions without exponential enumeration. Modern techniques from combinatorial optimization and spectral methods offer unexplored avenues for principled divisive clustering at scale.

3. The Linkage Criterion: Where Assumptions Hide

The defining characteristic of agglomerative clustering is the linkage criterion — the rule that determines “distance” between two clusters. This choice is not merely technical; it encodes deep assumptions about what cluster structure should look like. Different linkage criteria impose different geometric priors, leading to dramatically different results on the same data.

graph LR
    subgraph Linkage Criteria
        SL[Single Linkage
min distance]
        CL[Complete Linkage
max distance]
        AL[Average Linkage
mean distance]
        WL[Ward's Method
variance increase]
    end
    
    SL --> Chain[Discovers elongated,
chain-like clusters]
    CL --> Compact[Produces tight,
spherical clusters]
    AL --> Balanced[Balances chain
and compactness]
    WL --> MinVar[Minimizes within-cluster
variance]

3.1 Single Linkage (Nearest Neighbor)

Single linkage defines the distance between two clusters as the minimum distance between any pair of points, one from each cluster:

d(A, B) = min{d(a, b) : a ∈ A, b ∈ B}

This criterion has elegant theoretical properties. It is monotonic, meaning cluster distances never decrease as merging proceeds, which guarantees the resulting dendrogram is well-defined. More importantly, single linkage is equivalent to computing the minimum spanning tree of the data graph — clusters at any cut level correspond to connected components formed by removing edges above that threshold (Gower & Ross, 1969).

The geometric consequence is that single linkage excels at discovering elongated, irregular, or chain-like cluster structures. If two dense regions are connected by a sparse “bridge” of points, single linkage will correctly group them together while other methods might split them.

However, this sensitivity is also single linkage’s Achilles’ heel. The method suffers from the notorious “chaining effect”: a single noise point positioned between two otherwise-distant clusters can cause them to merge prematurely. In noisy real-world data, this vulnerability often produces degenerate hierarchies where clusters merge one point at a time in a long chain, revealing little meaningful structure (Everitt et al., 2011).

3.2 Complete Linkage (Farthest Neighbor)

Complete linkage takes the opposite extreme, defining inter-cluster distance as the maximum distance between any pair of points:

d(A, B) = max{d(a, b) : a ∈ A, b ∈ B}

This criterion imposes a strong compactness prior. For two clusters to be considered “close,” every point in one must be near every point in the other. The result is tight, roughly spherical clusters with controlled diameters.

Complete linkage avoids the chaining problem entirely — noise points cannot cause premature merges because they would need to be close to all points in both clusters. However, the method pays a different price: it tends to break apart naturally elongated structures into multiple small spherical fragments. When the true underlying clusters are non-convex or have irregular shapes, complete linkage systematically fails.

L1[Cluster A] Root --> L2[Cluster B] L1 --> L1a[Subcluster A1] L1 --> L1b[Subcluster A2] L2 --> L2a[Point 5] L2 --> L2b[Subcluster B1] L1a --> P1[Point 1] L1a --> P2[Point 2] L1b --> P3[Point 3] L1b --> P4[Point 4] L2b --> P6[Point 6] L2b --> P7[Point 7] end

Reading a dendrogram requires understanding several key features:

Height axis: The vertical axis represents dissimilarity. Higher merge points indicate clusters that are more different from each other.
Horizontal cuts: Drawing a horizontal line at any height and observing which vertical lines it crosses yields a clustering at that granularity level. Low cuts produce many small clusters; high cuts produce few large clusters.
Branch lengths: Long vertical branches before a merge suggest well-separated clusters; short branches indicate clusters that are similar to each other.
Ultrametric property: For certain linkage criteria (especially UPGMA), the dendrogram satisfies the ultrametric inequality: for any three points, the two largest pairwise distances are equal. This mathematical property ensures the tree representation is consistent.

4.1 Cophenetic Correlation

A critical question arises: how well does the dendrogram represent the actual structure of the data? The cophenetic distance between two points is the height at which they first merge in the dendrogram. The cophenetic correlation coefficient measures the correlation between the original pairwise distances and these cophenetic distances (Sokal & Rohlf, 1962).

High cophenetic correlation (close to 1.0) indicates that the dendrogram faithfully preserves the distance relationships in the original data. Low correlation suggests that the hierarchical structure is a poor fit — either the data lacks genuine hierarchical organization, or the wrong linkage criterion was chosen. Comparing cophenetic correlations across linkage methods provides a principled way to select among alternatives.

Research Gap #2: The dendrogram cutting problem — deciding where to “slice” the tree to obtain a final clustering — remains largely heuristic. Methods such as the “elbow” in merge distances, gap statistics, or dynamic tree cutting lack strong theoretical foundations. The field needs principled, data-driven approaches for optimal dendrogram partitioning that account for cluster quality, not just distance structure.

5. Algorithmic Complexity and Scalability

The elegance of hierarchical clustering comes at a computational price. The naive agglomerative algorithm requires O(n³) time: each of n-1 merge steps involves searching an n×n distance matrix. Space complexity is O(n²) for storing the distance matrix. For datasets of a few thousand points, these costs are manageable; for modern datasets with millions of observations, they are prohibitive.

5.1 Optimized Implementations

Several algorithmic improvements reduce the computational burden:

Nearest-neighbor chain algorithm: For monotonic linkage criteria (single, complete, average, Ward’s), nearest-neighbor chains achieve O(n²) time complexity while using only O(n) auxiliary space (Murtagh, 1983). The key insight is that for these criteria, the nearest-neighbor relation is symmetric, enabling efficient chain-based traversal.
Priority queue implementations: Using a heap data structure to track the minimum distance pair reduces per-step complexity from O(n²) to O(log n), yielding O(n² log n) overall (Day & Edelsbrunner, 1984).
Spatial indexing: For data in low-dimensional Euclidean spaces, k-d trees or ball trees can accelerate nearest-neighbor queries, though benefits diminish rapidly with increasing dimensionality.

5.2 Approximate and Parallel Methods

True scalability requires abandoning exact computation. Approximate hierarchical clustering algorithms sacrifice guaranteed optimality for practical tractability:

BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies): Constructs a compact summary structure (CF-tree) in a single data scan, then applies hierarchical clustering to cluster features rather than raw points (Zhang et al., 1996). BIRCH handles datasets that exceed main memory and runs in O(n) time for the summary construction phase.
CURE (Clustering Using REpresentatives): Represents each cluster by multiple scattered points rather than a single centroid, capturing non-spherical shapes while enabling efficient approximate distance computation (Guha et al., 1998).
Parallel implementations: Recent work has mapped hierarchical clustering to distributed computing frameworks like MapReduce and Spark, partitioning the distance matrix across nodes and parallelizing merge operations (Dash et al., 2019).

Research Gap #3: Despite algorithmic advances, hierarchical clustering remains impractical for datasets beyond approximately 100,000 points in most implementations. The field lacks truly scalable algorithms that preserve the full dendrogram structure — most “scalable” methods produce only a flat clustering or a partial hierarchy. Streaming and online variants that update dendrograms incrementally as new data arrives represent a particularly underdeveloped area.

6. Modern Extensions and Variants

Contemporary research has extended classical hierarchical clustering in several directions, addressing limitations of the original formulations while preserving the core insight of nested structure.

6.1 HDBSCAN: Hierarchical Density-Based Clustering

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) represents perhaps the most significant modern extension of hierarchical clustering (Campello et al., 2013). Rather than using geometric distance directly, HDBSCAN constructs a hierarchy based on mutual reachability distance — a density-aware transformation that makes the algorithm robust to varying local densities.

The algorithm proceeds in stages: first, compute the core distance for each point (distance to the k-th nearest neighbor); second, construct a minimum spanning tree using mutual reachability distances; third, build a hierarchical tree by progressively removing edges in decreasing weight order; finally, extract stable clusters using persistence-based analysis of the hierarchy.

HDBSCAN inherits the strengths of both hierarchical clustering (nested structure, no need to prespecify cluster count) and density-based methods (ability to discover arbitrary shapes, automatic noise detection). The method has become a standard tool in practical data science, with efficient implementations available in major scientific computing libraries (McInnes et al., 2017).

6.2 Constrained Hierarchical Clustering

Many applications involve side information — must-link constraints (points that must belong to the same cluster), cannot-link constraints (points that must be separated), or partial labels. Constrained hierarchical clustering algorithms modify the agglomerative procedure to respect such constraints.

The key challenge is maintaining consistency: if points A and B have a must-link constraint, they should merge before either merges with any other point. Davidson and Ravi (2005) proved that deciding whether a dendrogram satisfying arbitrary must-link and cannot-link constraints exists is NP-complete, motivating heuristic approaches that approximately satisfy constraints while preserving reasonable clustering quality.

6.3 Bayesian Hierarchical Clustering

Bayesian hierarchical clustering (Heller & Ghahramani, 2005) reframes the problem probabilistically. Rather than using heuristic linkage criteria, the algorithm computes the probability that data in two subtrees were generated by the same underlying distribution. Merges proceed by selecting the pair with highest posterior probability.

This Bayesian formulation offers several advantages: principled handling of uncertainty, natural incorporation of prior knowledge through distribution choices, and a built-in measure of clustering quality (marginal likelihood). However, the computational cost is substantial, requiring numerical integration for each potential merge.

graph TD
    subgraph Modern Extensions
        HC[Classical Hierarchical Clustering]
        HC --> HDBSCAN[HDBSCAN
Density-aware hierarchy]
        HC --> Constrained[Constrained HC
Must-link/cannot-link]
        HC --> Bayesian[Bayesian HC
Probabilistic merging]
        HC --> Incremental[Incremental HC
Online updates]
        HC --> Parallel[Parallel/Distributed
Scalable computation]
    end

7. Applications Across Domains

Hierarchical clustering’s ability to reveal multi-scale structure makes it particularly valuable in domains where nested relationships are semantically meaningful.

7.1 Bioinformatics and Genomics

Hierarchical clustering was born in biology, and biology remains its most prolific application domain. Gene expression analysis routinely uses hierarchical clustering to identify co-expressed genes across experimental conditions, with dendrograms revealing functional modules and regulatory relationships (Eisen et al., 1998). Phylogenetic reconstruction relies on hierarchical methods to infer evolutionary trees from molecular sequence data. Protein structure classification organizes the universe of protein folds into nested taxonomies using structural similarity metrics (Holm & Sander, 1997).

7.2 Document and Text Mining

Document hierarchies enable browsable organization of large text collections. Unlike flat clustering, which assigns each document to a single category, hierarchical clustering reveals topical relationships at multiple granularities — from broad themes (politics, sports, technology) down to specific subtopics (electoral politics, basketball, artificial intelligence). Search engines use hierarchical clustering to organize search results, and digital libraries employ it to construct browsing taxonomies (Manning et al., 2008).

7.3 Image Segmentation

Hierarchical image segmentation produces nested partitions of pixels, enabling users to select segmentation granularity interactively. The Berkeley Segmentation Dataset and Benchmark introduced hierarchical evaluation metrics that compare algorithm output to human-constructed hierarchies, acknowledging that the “correct” segmentation depends on the level of detail required for a given task (Arbeláez et al., 2011).

7.4 Social Network Analysis

Communities in social networks often exhibit hierarchical structure — individuals belong to friend groups, which belong to larger communities, which belong to even broader social categories. Hierarchical community detection algorithms reveal this nested organization, with applications in influence analysis, viral marketing, and understanding information diffusion patterns (Clauset et al., 2008).

8. Research Gaps and Future Directions

Despite six decades of development, hierarchical clustering retains significant open problems that limit its applicability and theoretical foundations.

Research Gap #4: High-dimensional data poses severe challenges for hierarchical clustering. As dimensionality increases, distance metrics become increasingly meaningless (the “curse of dimensionality”), and all points become approximately equidistant. While dimensionality reduction can help, the field lacks hierarchical clustering methods that intrinsically handle high-dimensional structure without preprocessing.

Research Gap #5: Non-Euclidean data manifolds — graphs, strings, probability distributions, manifolds with complex topology — require specialized hierarchical clustering approaches. While distance-based methods can be applied using appropriate metrics, the linkage criteria and tree construction algorithms may behave unexpectedly. Principled extensions to Riemannian manifolds, metric spaces with negative curvature, and discrete structures remain active research areas.

Additional underexplored areas include:

Interactive and human-in-the-loop hierarchical clustering: Incorporating user feedback to guide merge decisions and refine dendrogram structure.
Multi-view hierarchical clustering: Integrating multiple data representations (e.g., text and images describing the same entities) into a unified hierarchy.
Temporal hierarchical clustering: Tracking how hierarchical structure evolves over time in dynamic datasets.
Explainable hierarchical clustering: Generating human-interpretable explanations for why specific merge decisions were made.

9. Chapter Summary

Hierarchical clustering provides a unique window into data structure — not just which points belong together, but how they relate at multiple scales of analysis. The dendrogram representation captures nested relationships that flat clustering methods cannot express, making hierarchical methods essential for domains where multi-level organization is semantically meaningful.

Key takeaways from this chapter:

Two paradigms: Agglomerative (bottom-up) methods dominate practice due to computational tractability; divisive (top-down) methods offer theoretical advantages but lack efficient algorithms.
Linkage is everything: The choice of linkage criterion — single, complete, average, or Ward’s — encodes geometric assumptions that fundamentally determine clustering outcomes. There is no universally “best” linkage; the choice must match domain knowledge about expected cluster structure.
Dendrogram interpretation: Reading dendrograms requires understanding height semantics, cutting strategies, and quality measures like cophenetic correlation.
Scalability limits: Classical algorithms scale poorly beyond moderate dataset sizes; approximate methods sacrifice exactness for tractability.
Modern extensions: HDBSCAN, constrained clustering, and Bayesian approaches extend classical methods to handle density variation, side information, and uncertainty quantification.

The five research gaps identified in this chapter — divisive algorithm efficiency, dendrogram cutting theory, high-dimensional methods, streaming/online variants, and non-Euclidean extensions — represent fertile ground for future work. As data continues to grow in scale and complexity, hierarchical clustering must evolve to meet new challenges while preserving the core insight that revealed structure at multiple scales provides deeper understanding than any single partition.

References

Arbeláez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 898-916. https://doi.org/10.1109/TPAMI.2010.161

Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. In Advances in Knowledge Discovery and Data Mining (pp. 160-172). Springer. https://doi.org/10.1007/978-3-642-37456-2_14

Clauset, A., Moore, C., & Newman, M. E. J. (2008). Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191), 98-101. https://doi.org/10.1038/nature06830

Dash, M., Liu, H., Scheuermann, P., & Tan, K. L. (2019). Fast hierarchical clustering and its validation. Data & Knowledge Engineering, 44(1), 109-138. https://doi.org/10.1016/S0169-023X(02)00138-6

Davidson, I., & Ravi, S. S. (2005). Agglomerative hierarchical clustering with constraints: Theoretical and empirical results. In European Conference on Principles of Data Mining and Knowledge Discovery (pp. 59-70). Springer. https://doi.org/10.1007/11564126_11

Day, W. H. E., & Edelsbrunner, H. (1984). Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1(1), 7-24. https://doi.org/10.1007/BF01890115

Eisen, M. B., Spellman, P. T., Brown, P. O., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25), 14863-14868. https://doi.org/10.1073/pnas.95.25.14863

Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster Analysis (5th ed.). Wiley. https://doi.org/10.1002/9780470977811

Felsenstein, J. (2004). Inferring Phylogenies. Sinauer Associates. ISBN: 978-0878931774.

Gower, J. C., & Ross, G. J. S. (1969). Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 18(1), 54-64. https://doi.org/10.2307/2346439

Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (pp. 73-84). ACM. https://doi.org/10.1145/276304.276312

Heller, K. A., & Ghahramani, Z. (2005). Bayesian hierarchical clustering. In Proceedings of the 22nd International Conference on Machine Learning (pp. 297-304). ACM. https://doi.org/10.1145/1102351.1102389

Holm, L., & Sander, C. (1997). Dali/FSSP classification of three-dimensional protein folds. Nucleic Acids Research, 25(1), 231-234. https://doi.org/10.1093/nar/25.1.231

Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264-323. https://doi.org/10.1145/331499.331504

Kaufman, L., & Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley. https://doi.org/10.1002/9780470316801

Lance, G. N., & Williams, W. T. (1967). A general theory of classificatory sorting strategies: 1. Hierarchical systems. The Computer Journal, 9(4), 373-380. https://doi.org/10.1093/comjnl/9.4.373

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. https://doi.org/10.1017/CBO9780511809071

McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical density based clustering. Journal of Open Source Software, 2(11), 205. https://doi.org/10.21105/joss.00205

Müllner, D. (2011). Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378. https://doi.org/10.48550/arXiv.1109.2378

Murtagh, F. (1983). A survey of recent advances in hierarchical clustering algorithms. The Computer Journal, 26(4), 354-359. https://doi.org/10.1093/comjnl/26.4.354

Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 86-97. https://doi.org/10.1002/widm.53

Murtagh, F., & Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: Which algorithms implement Ward’s criterion? Journal of Classification, 31(3), 274-295. https://doi.org/10.1007/s00357-014-9161-z

Rokach, L., & Maimon, O. (2005). Clustering methods. In Data Mining and Knowledge Discovery Handbook (pp. 321-352). Springer. https://doi.org/10.1007/0-387-25465-X_15

Sibson, R. (1973). SLINK: An optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16(1), 30-34. https://doi.org/10.1093/comjnl/16.1.30

Sneath, P. H. A., & Sokal, R. R. (1973). Numerical Taxonomy: The Principles and Practice of Numerical Classification. W.H. Freeman. ISBN: 978-0716706977.

Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38, 1409-1438.

Sokal, R. R., & Rohlf, F. J. (1962). The comparison of dendrograms by objective methods. Taxon, 11(2), 33-40. https://doi.org/10.2307/1217208

Sokal, R. R., & Sneath, P. H. A. (1963). Principles of Numerical Taxonomy. W.H. Freeman. ISBN: 978-0716701279.

Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In KDD Workshop on Text Mining (Vol. 400, pp. 525-526).

Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236-244. https://doi.org/10.1080/01621459.1963.10500845

Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645-678. https://doi.org/10.1109/TNN.2005.845141

Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Record, 25(2), 103-114. https://doi.org/10.1145/235968.233324