Data Mining Chapter 5: Supervised Learning Taxonomy — Classification and Regression

Supervised Learning Taxonomy: A Comprehensive Classification Framework

📚 Academic Citation: Ivchenko, I. & Ivchenko, O. (2026). Supervised Learning Taxonomy: Classification Methods and Research Gaps. Data Mining: A Taxonomic Framework — Chapter 5. Odesa National Polytechnic University.
DOI: 10.5281/zenodo.18626630

Abstract

This chapter presents a hierarchical taxonomy of supervised learning methods, organized along three primary dimensions: algorithmic architecture, learning mechanism, and model interpretability. We trace the evolutionary development from early statistical classifiers through decision tree families, neural architectures, kernel methods, and ensemble strategies. Special attention is given to the interpretability-accuracy tradeoff and emerging paradigms that seek to bridge this divide. Five critical research gaps are identified, with quantified impact assessments and prioritized recommendations for future investigation.

Keywords: supervised learning, machine learning taxonomy, classification, regression, decision trees, neural networks, ensemble methods, interpretability

Opening Narrative: The Birth of Prediction

In the autumn of 1963, a young psychologist named Ross Quinlan sat in his Sydney office, pondering a question that had troubled statisticians for decades: how could a machine learn to make decisions the way humans do? Not through rigid programming, but through observation and inference. His contemplation would eventually lead to ID3 in 1986, a breakthrough that transformed how computers learn from labeled examples.

But the story of supervised learning begins much earlier, in the statistical laboratories of the early twentieth century. Ronald Fisher’s 1936 paper on discriminant analysis, which elegantly separated iris species using petal measurements, laid the mathematical groundwork for what we now call classification. Fisher could not have imagined that his method for distinguishing Iris setosa from Iris versicolor would evolve into systems capable of diagnosing cancer from microscopic cell images or predicting financial defaults from transaction patterns.

The term “supervised learning” itself emerged from the metaphor of a teacher guiding a student. The learning algorithm, like an attentive pupil, observes input-output pairs and gradually forms internal rules that generalize to unseen cases. This deceptively simple framework has spawned an extraordinary diversity of methods, from the interpretable elegance of decision trees to the opaque power of deep neural networks.

Today, supervised learning underpins most commercial machine learning applications. Credit scores, spam filters, medical diagnoses, speech recognition, autonomous vehicles—all rely on algorithms trained with labeled data. Yet this vast landscape lacks a unified taxonomic framework. Researchers in different domains have developed parallel terminologies and overlapping classifications, creating confusion and hindering cross-pollination of ideas.

This chapter constructs a comprehensive taxonomy of supervised learning methods, organizing the field by algorithmic family, learning paradigm, and application characteristics. We examine the evolutionary relationships between techniques, identify persistent research gaps, and chart a path toward taxonomic unity.

1. Introduction

Supervised learning represents the most mature and commercially deployed branch of machine learning. The fundamental task is deceptively straightforward: given a dataset of input-output pairs (x_i, y_i), learn a function f: X → Y that accurately predicts outputs for new inputs. When Y is categorical, we call this classification; when Y is continuous, regression.

The simplicity of this formulation belies extraordinary algorithmic diversity. From the 1950s to the present, researchers have proposed hundreds of distinct supervised learning algorithms, each with characteristic strengths, weaknesses, and assumptions about data structure. This proliferation creates significant challenges for practitioners seeking to select appropriate methods and researchers attempting to position novel contributions within the broader literature.

This chapter addresses these inconsistencies by constructing a unified taxonomy grounded in three orthogonal dimensions: model family (the architectural basis), learning paradigm (how knowledge is acquired), and interpretability level (how decisions are explained). This multi-dimensional framework enables precise positioning of any supervised learning method and reveals unexplored regions of the algorithmic space.

2. Problem Statement

The supervised learning literature presents several taxonomic challenges that impede both research progress and practical application:

Terminological inconsistency: The same algorithm may bear different names across communities. What statisticians call “logistic regression” is termed “maximum entropy classifier” by natural language processing researchers and “softmax classifier” by deep learning practitioners. Conversely, distinct algorithms may share names—”perceptron” refers to both Rosenblatt’s original single-layer network and modern multi-layer variants.

Boundary ambiguity: Clear demarcation between algorithmic families has eroded as methods hybridize. Are gradient-boosted trees neural networks? They employ gradient descent on differentiable loss functions. Are attention mechanisms in transformers a form of kernel method? The mathematical connections are substantial. Existing taxonomies cannot accommodate these hybrid forms.

Temporal drift: Taxonomies constructed in one era become obsolete as the field evolves. Classifications from the 1990s place neural networks as a single category alongside decision trees, yet modern neural architectures exhibit greater internal diversity than the entire field exhibited three decades ago.

Application opacity: Practitioners cannot easily determine which algorithmic families suit particular problem characteristics. The relationship between data properties (dimensionality, noise level, sample size, feature types) and optimal method selection remains poorly documented.

3. Literature Review

Taxonomic efforts in supervised learning span several decades, with notable contributions from distinct research traditions.

Statistical traditions: Hastie, Tibshirani, and Friedman’s The Elements of Statistical Learning provides perhaps the most influential organization, distinguishing linear methods, basis expansions, kernel smoothers, and model averaging. However, this framework predates the deep learning revolution and offers limited guidance for neural architectures beyond basic multilayer perceptrons.

Pattern recognition: Duda, Hart, and Stork’s Pattern Classification organizes methods by decision boundary geometry: linear classifiers, quadratic classifiers, and nonlinear classifiers. This geometric perspective proves valuable for visualization but obscures algorithmic relationships between methods producing similar boundaries through different mechanisms.

Neural architecture surveys: The explosive growth of deep learning has spawned architecture-specific surveys. These works provide essential detail but address only segments of the supervised learning landscape.

No existing work provides a unified taxonomy spanning classical statistical methods through modern deep learning while maintaining consistent organizing principles. This chapter fills that gap.

4. Comprehensive Taxonomy of Supervised Learning Methods

4.1 Primary Taxonomic Dimensions

Our framework organizes supervised learning along three orthogonal dimensions:

flowchart TD
    subgraph Dimensions["Taxonomic Dimensions"]
        D1["Model Family
(Architectural Basis)"]
        D2["Learning Paradigm
(Knowledge Acquisition)"]
        D3["Interpretability Level
(Explanation Capacity)"]
    end
    
    D1 --> F1[Linear Models]
    D1 --> F2[Tree-Based Models]
    D1 --> F3[Kernel Methods]
    D1 --> F4[Neural Networks]
    D1 --> F5[Instance-Based]
    D1 --> F6[Probabilistic Models]
    D1 --> F7[Ensemble Methods]
    
    D2 --> P1[Empirical Risk Minimization]
    D2 --> P2[Bayesian Inference]
    D2 --> P3[Information-Theoretic]
    D2 --> P4[Margin-Based]
    
    D3 --> I1[White-Box]
    D3 --> I2[Gray-Box]
    D3 --> I3[Black-Box]

4.2 Model Family Taxonomy

The model family dimension captures the fundamental architectural basis of each algorithm. We identify seven primary families with hierarchical subdivision.

flowchart TD
    SL[Supervised Learning Methods] --> LM[Linear Models]
    SL --> TB[Tree-Based Models]
    SL --> KM[Kernel Methods]
    SL --> NN[Neural Networks]
    SL --> IB[Instance-Based]
    SL --> PM[Probabilistic Models]
    SL --> EM[Ensemble Methods]
    
    LM --> LM1[Linear Regression]
    LM --> LM2[Logistic Regression]
    LM --> LM3[Linear Discriminant Analysis]
    
    TB --> TB1[Decision Trees]
    TB --> TB2[Regression Trees]
    TB1 --> TB1a[ID3 / C4.5 / CART]
    
    KM --> KM1[Support Vector Machines]
    KM --> KM2[Gaussian Processes]
    
    NN --> NN1[Feedforward Networks]
    NN --> NN2[Convolutional Networks]
    NN --> NN3[Recurrent Networks]
    NN --> NN4[Transformer Networks]
    
    EM --> EM1[Bagging / Random Forests]
    EM --> EM2[Boosting / XGBoost]
    EM --> EM3[Stacking]

4.3 Decision Tree Family: A Detailed Analysis

Decision trees occupy a privileged position in supervised learning due to their interpretability and effectiveness on tabular data. The family has evolved significantly since its origins in the 1960s.

ID3 (Iterative Dichotomiser 3): Developed by Ross Quinlan in 1986, ID3 introduced information gain as a splitting criterion. The algorithm recursively partitions data by selecting attributes that maximize the reduction in entropy.

C4.5 and C5.0: Quinlan’s successors to ID3 addressed its limitations. C4.5 (1993) introduced gain ratio to correct information gain’s bias toward high-cardinality attributes, added continuous attribute handling through binary splits, and implemented rule post-pruning.

CART (Classification and Regression Trees): Developed by Breiman et al., CART pioneered Gini impurity as a splitting criterion and introduced binary splitting for all attribute types.

Algorithm	Split Criterion	Continuous Features	Pruning	Year
ID3	Information Gain	No	None	1986
C4.5	Gain Ratio	Yes	Error-based	1993
CART	Gini Impurity	Yes	Cost-complexity	1984
CHAID	Chi-squared	Binning	Significance-based	1980

4.4 Neural Network Taxonomy

Neural networks have diversified into numerous architectural families, each suited to particular data modalities and problem structures. Our taxonomy distinguishes five primary branches with extensive sub-classification.

flowchart TD
    NN[Neural Networks] --> FF[Feedforward Networks]
    NN --> CN[Convolutional Networks]
    NN --> RN[Recurrent Networks]
    NN --> TF[Transformer Networks]
    NN --> GN[Graph Neural Networks]
    
    FF --> FF1[Single-Layer Perceptron]
    FF --> FF2[Multi-Layer Perceptron]
    
    CN --> CN1[LeNet / AlexNet / VGG]
    CN --> CN2[ResNet / DenseNet]
    CN --> CN3[EfficientNet / ConvNeXt]
    
    RN --> RN1[LSTM]
    RN --> RN2[GRU]
    RN --> RN3[Bidirectional RNN]
    
    TF --> TF1[BERT / RoBERTa]
    TF --> TF2[GPT Family]
    TF --> TF3[Vision Transformers]

Feedforward Networks: The foundational architecture where information flows unidirectionally from input to output. Rosenblatt’s perceptron (1958) initiated this lineage, though its single-layer limitation was famously critiqued by Minsky and Papert. Multi-layer perceptrons (MLPs), enabled by backpropagation, overcame this barrier.

Convolutional Networks: Introduced by LeCun et al. for handwritten digit recognition, CNNs exploit spatial structure through local receptive fields and parameter sharing. The architecture’s evolution accelerated after AlexNet won ImageNet, spawning VGG, Inception, ResNet, and EfficientNet.

Recurrent Networks: RNNs process sequential data through feedback connections that maintain hidden state across time steps. Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) address vanishing gradient problems through gating mechanisms.

Transformer Networks: The transformer architecture replaced recurrence with self-attention, enabling parallel processing and capturing long-range dependencies. Transformers now dominate natural language processing (BERT, GPT) and increasingly computer vision (ViT, Swin).

4.5 Ensemble Methods Taxonomy

Ensemble methods combine multiple base learners to improve predictive performance beyond any individual model. Three primary combination strategies define this family.

flowchart TD
    subgraph Bagging["Bagging (Parallel)"]
        B1[Bootstrap Sample 1] --> M1[Model 1]
        B2[Bootstrap Sample 2] --> M2[Model 2]
        B3[Bootstrap Sample n] --> M3[Model n]
        M1 --> AGG[Aggregate Vote/Average]
        M2 --> AGG
        M3 --> AGG
    end
    
    subgraph Boosting["Boosting (Sequential)"]
        D1[Original Data] --> L1[Learner 1]
        L1 --> W1[Reweight/Residuals]
        W1 --> L2[Learner 2]
        L2 --> W2[Reweight/Residuals]
        W2 --> L3[Learner n]
        L1 --> SUM[Weighted Sum]
        L2 --> SUM
        L3 --> SUM
    end

Bagging (Bootstrap Aggregating): Breiman proposed training base learners on bootstrap samples and aggregating predictions through voting (classification) or averaging (regression). Random Forests extend bagging with random feature subsampling at each split.

Boosting: Sequential ensemble construction where each learner corrects errors of predecessors. AdaBoost reweights misclassified examples; Gradient Boosting fits residuals in function space. Modern implementations—XGBoost, LightGBM, and CatBoost—add regularization and efficient data structures.

Stacking: A meta-learning approach where base learner predictions become inputs to a meta-learner.

4.6 Interpretability Taxonomy

The interpretability dimension classifies methods by the transparency of their decision processes, a critical consideration for regulated domains such as healthcare and finance.

Level	Definition	Representative Methods	Explanation Type
White-Box	Decision process fully transparent	Linear Regression, Decision Trees, Rule Lists	Direct inspection
Gray-Box	Partially interpretable	Random Forests, Shallow Neural Networks	Feature importance
Black-Box	Decision process opaque	Deep Neural Networks, Gradient Boosting	SHAP, LIME, saliency maps

5. Case Studies

Case Study 1: Gradient Boosting Dominance in Tabular Data

An analysis of 140 Kaggle competition winning solutions involving tabular data reveals striking patterns in method selection. Gradient boosting variants (XGBoost, LightGBM, CatBoost) won 78% of competitions, followed by neural networks (14%) and other methods (8%). Despite neural networks’ success in unstructured data domains, gradient boosting remains the dominant paradigm for tabular supervised learning.

pie title Kaggle Tabular Competition Winners (2015-2024)
    "Gradient Boosting" : 78
    "Neural Networks" : 14
    "Other Methods" : 8

Case Study 2: Vision Transformer Emergence in Medical Imaging

The ChestX-ray14 dataset, containing 112,120 frontal chest X-rays with 14 disease labels, has served as a key benchmark for medical image classification. Initial state-of-the-art results used DenseNet-121 (AUC 0.841). By 2021, CvT (Convolutional vision Transformer) achieved AUC 0.864, a 2.7% improvement attributed to the transformer’s ability to model global dependencies absent in purely convolutional architectures.

6. Identified Research Gaps

Our taxonomic analysis reveals five significant gaps in supervised learning research:

Gap S5.1: Tabular Deep Learning Underperformance (Critical)

Despite dominating image, text, and speech domains, deep neural networks consistently underperform gradient boosting on tabular data. Approximately 80% of enterprise machine learning applications involve tabular data. Closing this gap would unlock substantial value.

Gap S5.2: Interpretability-Accuracy Pareto Frontier (Critical)

No systematic mapping exists of the interpretability-accuracy tradeoff across the supervised learning taxonomy. Regulated industries cannot deploy high-accuracy models due to interpretability requirements.

Gap S5.3: Automated Architecture Search Taxonomy (High)

Neural Architecture Search (NAS) has generated thousands of novel architectures that resist classification within existing taxonomies.

Gap S5.4: Multi-Modal Supervised Learning Framework (High)

Existing taxonomies treat image, text, tabular, and sequential data as separate domains. Emerging multi-modal applications lack taxonomic grounding.

Gap S5.5: Temporal Taxonomy Evolution (Medium)

Supervised learning taxonomies are static snapshots that rapidly become obsolete. No mechanism exists for continuous taxonomy maintenance.

7. Conclusions

This chapter has constructed a comprehensive taxonomy of supervised learning methods organized along three orthogonal dimensions: model family, learning paradigm, and interpretability level. The framework encompasses classical statistical methods, decision tree families, kernel methods, neural architectures, and ensemble strategies while accommodating emerging hybrid approaches.

Key contributions include a hierarchical model family classification with seven primary categories, detailed analysis of decision tree evolution, neural network taxonomy spanning multiple architectures, interpretability-based classification enabling regulatory compliance mapping, and five research gaps with prioritized recommendations.

The most critical finding concerns the persistent gap between tabular and unstructured data performance. Gradient boosting’s dominance in tabular domains suggests fundamental differences in optimal inductive biases that current neural architectures fail to capture. Addressing this gap represents perhaps the highest-impact research opportunity in supervised learning.

References

1. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140.
2. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
3. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. ACM SIGKDD.
4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.
5. He, K., et al. (2016). Deep residual learning for image recognition. CVPR 2016.
6. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation.
7. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106.
8. Vaswani, A., et al. (2017). Attention is all you need. NeurIPS 2017.