Beyond the Benchmark: What AI Looks Like When It Actually Works

AI in real clinical and research settings

Beyond the Benchmark

Academic Citation:
Ivchenko, O. (2026). Beyond the Benchmark: What AI Looks Like When It Actually Works. Future of AI Series. Odessa National Polytechnic University.
DOI: 10.5281/zenodo.18926904

Abstract #

The most consequential question in applied artificial intelligence is not whether a model achieves state-of-the-art on a leaderboard. It is whether the model does something useful when connected to reality — to messy data, constrained infrastructure, and users who need answers rather than probabilities. This article examines what AI actually looks like when it crosses that boundary. Drawing on four deployed research systems — a medical imaging platform, a geopolitical risk engine, a pharmaceutical portfolio optimizer, and an enterprise decision toolkit — it argues that the defining characteristic of real-world AI is not accuracy but composability: the degree to which models, data pipelines, and interfaces can be assembled and reassembled by researchers without intermediaries. All referenced systems are live and openly accessible via the Stabilarity Research Platform API.

1. The Distance Between Demo and Deployment #

There is a well-documented phenomenon in applied machine l[REDACTED]g that practitioners call the demo gap: the distance between a model that performs convincingly on a curated test set and one that adds value in a production environment. This gap is not primarily technical. The models exist. The compute is accessible. The frameworks are mature. The gap is structural — it lives in the space between what a model can do and what a research or clinical environment can absorb.

Consider the trajectory of medical imaging AI. By 2020, deep l[REDACTED]g models for chest X-ray classification were demonstrably matching radiologist performance on benchmark datasets (Ardila et al., 2019)^[1]. By 2023, fewer than 5% of radiology departments in Eastern Europe had integrated any form of AI-assisted screening into routine workflow (Pesapane et al., 2022)^[2]. The bottleneck was not the model. It was the absence of infrastructure that would allow a researcher in Odessa or Kharkiv to run an inference call against a validated model without building a deployment pipeline from scratch.

This is the problem that openly accessible research platforms are positioned to solve — not by replacing clinical infrastructure, but by removing the activation energy required to try.

2. Medical Imaging: Six Models, One Call #

The ScanLab platform currently e[REDACTED]ses six deep l[REDACTED]g models for medical image classification, accessible via a single POST request. The models cover pneumonia detection from chest radiographs (trained on 5,856 images, validated on Ukrainian hospital data), COVID-19 classification with three-class output distinguishing COVID-19 from bacterial pneumonia and normal findings, melanoma detection from dermoscopy images following ISIC 2020 methodology, brain tumor detection from MRI scans, and a multi-label chest pathology classifier covering 14 conditions based on NIH CXR14 methodology.

Each inference response includes a confidence score, ICD-10 code suggestions for positive findings, and cost-effectiveness metrics — estimated clinician hours saved and cost savings in USD. These are not cosmetic additions. They are what translate a classification into a decision-support artifact that a clinician or researcher can act on.

The practical implication is that a researcher studying AI adoption in Ukrainian healthcare does not need to train a model, provision GPU compute, or negotiate API access with a vendor. The call is:

curl -H “X-API-Key: YOUR_KEY” \

  -F “file=@chest_xray.jpg” \

  -F “model=pneumonia” \

  https://hub.stabilarity.com/api/v1/scanlab/predict

This is not a trivial reduction. It is the difference between a research idea that requires six months of infrastructure work and one that can be prototyped in an afternoon. The activation energy reduction matters more than any marginal accuracy improvement at this stage of AI adoption in the region.

graph TD
    A[Raw Clinical Data] --> B[Preprocessing Pipeline]
    B --> C[Feature Extraction]
    C --> D[ML Model Inference]
    D --> E{Confidence Threshold}
    E -->High| F[Clinical Decision Support]
    E -->Low| G[Human Review Queue]
    F --> H[Outcome Tracking]
    G --> H
    H --> I[Model Retraining Loop]
    I --> D

3. Geopolitical Risk: When the World Is the Input #

Machine l[REDACTED]g applied to geopolitical risk is a different category of problem. The data is heterogeneous, the labels are contested, and the cost of a false negative is not a misclassified image but a misallocated institutional response. The field has nonetheless produced rigorous quantitative frameworks. The GDELT Project aggregates 300 categories of conflict and cooperation events across 65 languages in near real-time (Leetaru & Schrodt, 2013)^[3]. ACLED codes political violence and protest events across 100+ countries with researcher-validated methodology (Raleigh et al., 2010). The challenge is not data access — it is integration.

The Stabilarity Geopolitical Risk API aggregates these sources into a 87-country risk index with war, political, and economic risk components, supplemented by macro indicators including VIX, oil price, inflation, and food security index. What makes this useful for researchers is not the index itself — similar indices exist — but the composability. Any tool, any notebook, any model that needs regional context can retrieve it in one GET request. The result is that a researcher building an enterprise AI adoption model does not need to separately solve the problem of regional risk adjustment. They call the API.

The practical demonstration of this composability is visible in our own tooling. The AI Use Case Classifier and the ROI Calculator both retrieve live country risk data at runtime and adjust their outputs accordingly. A recommendation generated for a deployment in Switzerland carries a different risk multiplier than the same recommendation generated for a deployment in a high-conflict region. The underlying model is identical. The context is not.

4. Portfolio Optimization: ML Where the Spreadsheet Still Dominates #

Pharmaceutical portfolio management remains one of the domains most resistant to AI adoption despite being analytically well-suited for it. The data is structured and largely complete. The objective functions are well-defined. The decisions are high-stakes and recurring. Yet survey data consistently shows that the majority of pharmaceutical companies at the regional level manage portfolio decisions through spreadsheet-based processes (Scannell et al., 2022)^[4].

The HPF-P framework — the Holistic Portfolio Framework for Pharma — addresses this directly. Its core concept is the Decision Readiness Index (DRI): a composite metric that classifies each SKU in a portfolio by its readiness for decision-making across six dimensions. SKUs are grouped into five decision categories, each mapped to a distinct optimization strategy. The result is a reweighted portfolio that reflects not just historical revenue performance but forward-looking decision readiness.

The HPF-P API e[REDACTED]ses this pipeline in three calls: retrieve a sample portfolio to understand the input format, POST your own portfolio data, receive a full six-module analysis. The output is a structured JSON with per-SKU DRI scores, group assignments, recommended strategies, and optimized portfolio weights. A researcher studying AI adoption barriers in pharmaceutical management can use this not as a production system, but as a working demonstration that the ML approach is not only theoretically sound but operationally viable.

flowchart LR
    subgraph "Data Layer"
        D1[Multi-source Feeds]
        D2[Historical Context]
        D3[Structured Datasets]
    end
    subgraph "Analysis Layer"
        A1[Risk Scoring Engine]
        A2[Scenario Modeling]
        A3[Uncertainty Quantification]
    end
    subgraph "Output Layer"
        O1[Risk Index 0-100]
        O2[Confidence Interval]
        O3[Policy Implications]
    end
    D1 --> A1
    D2 --> A2
    D3 --> A3
    A1 --> O1
    A2 --> O2
    A3 --> O3

5. Observability: The Infrastructure AI Forgot to Build #

One of the least discussed problems in deployed AI is observability — the ability to understand what a model is doing, why it made a specific decision, and how its behavior has changed over time. OpenTelemetry (OTel) has emerged as the de facto standard for distributed system observability (OpenTelemetry, 2023)^[5], but its application to AI pipelines remains immature. Most AI observability tooling either focuses on model metrics in isolation or requires significant integration work to connect traces to semantic meaning.

The OTel AI Inspector addresses this at the analysis layer. Given an OTel trace from an AI pipeline, it applies a four-layer scoring model — semantic richness (L1), cost attribution (L2), failure traceability (L3), and compliance readiness (L4) — and returns a structured assessment of what the trace reveals and what it obscures. It runs entirely in the browser. No data is transmitted. No integration is required. A researcher with an OTel trace can assess its information quality in under a minute.

This matters because the gap in AI observability is not a gap in tooling — it is a gap in methodology. The question of what a trace should contain to support AI accountability is an open research question. Having a scoring framework that can be applied to real traces accelerates the empirical work needed to answer it.

6. Composability as the Core Research Primitive #

The thread running through all four systems is composability. Each system e[REDACTED]ses a clean interface. Each can be called from any environment that can send an HTTP request. Each is documented well enough that a researcher unfamiliar with the underlying model can use it productively. And critically, each can be combined with the others.

The geopolitical risk API feeds into the classifier and the ROI calculator. The ScanLab models can be integrated into cost-effectiveness research workflows. The HPF pipeline can be used alongside the AI adoption tools to contextualize portfolio decisions within broader AI maturity assessments. This is not architectural coincidence. It reflects a design principle: research infrastructure should be composable first, optimized second.

The academic literature on AI adoption consistently identifies integration complexity as a primary barrier (Coombs et al., 2020)^[6]. Composable, openly accessible research APIs do not eliminate that barrier — they lower the floor. They make it possible for a researcher to reach the question of whether AI is useful for a given problem without first having to solve the problem of how to access AI at all.

7. What Openness Actually Requires #

Declaring a platform open is easy. Making it genuinely usable by researchers outside the team that built it is harder. It requires documentation that covers not just the happy path but the failure modes. It requires rate limits that are generous enough to support real experimentation. It requires authentication that is simple enough that a researcher does not need a vendor relationship to get started. And it requires a commitment to stability — a researcher who builds a tool on an API that disappears in six months has not gained capability; they have incurred technical debt.

The Stabilarity platform enforces these requirements structurally. Every community member receives a personal API key with a rate limit sufficient for research workloads (100 requests per minute). The platform is self-hosted, with no dependency on third-party infrastructure that could be deprecated. API changes are versioned and documented. The source of truth for each system’s behavior is a machine-readable specification, not tribal knowledge.

None of this is novel as engineering practice. What is notable is seeing it applied consistently to research infrastructure that is typically treated as an afterthought — the undocumented Jupyter notebook that only runs on the author’s laptop, the model checkpoint that requires three emails to access, the dataset that lives in a Google Drive folder with permissions set to “anyone with the link who knows to ask.”

8. Verdict: Yes The Capability Exists. The Infrastructure Gap Is Closing. #

AI is capable of substantially more than most research environments currently use it for. This is not a statement about future potential — it is a present-tense observation. The models exist. The data pipelines exist. The APIs exist. What has been missing is the layer of accessible, composable, openly documented infrastructure that allows researchers to reach these capabilities without first building them.

That layer is being built. Imperfectly, incrementally, and in open view. The systems described here are not finished. They are deployed, documented, and available — which is the condition that makes research feedback possible. A researcher who finds a limitation in the pneumonia detector, a gap in the geopolitical risk model, or a missing optimization strategy in the HPF pipeline can now observe that limitation empirically rather than hypothetically.

That is what AI looks like when it actually works: not a solved problem, but a legible one.

sequenceDiagram
    participant R as Researcher
    participant API as Stabilarity API
    participant M as Model Layer
    participant D as Data Store
    R->>API: POST /v1/inference {payload}
    API->>D: Fetch contextual data
    D-->>API: Dataset response
    API->>M: Run model pipeline
    M-->>API: Prediction + confidence
    API-->>R: JSON result + trace_id
    R->>API: GET /v1/trace/{trace_id}
    API-->>R: Observability payload

Preprint References (original)+

Disclosure: All systems referenced in this article are built and maintained by the author at the Stabilarity Research Hub. They are openly accessible at hub.stabilarity.com/api-gateway/. No commercial relationship exists with any referenced data provider.

References (12) #

McKinney, Scott Mayer; Sieniek, Marcin; Godbole, Varun; Godwin, Jonathan; Antropova, Natasha. (2020). International evaluation of an AI system for breast cancer screening. doi.org. d c r t l
Error: DOI Not Found. doi.org. d r t i l
The GDELT Project. gdeltproject.org. i a
(Scannell et al., 2022). doi.org. d r t l
OpenTelemetry Concepts | OpenTelemetry. opentelemetry.io. l
Bawack, Ransome Epie; Kala Kamdjoug, Jean Robert. (2020). The role of digital information use on student performance and collaboration in marginal universities. doi.org. d c r t l
Ardila, Diego; Kiraly, Atilla P.; Bharadwaj, Sujeeth; Choi, Bokyung; Reicher, Joshua J.; Peng, Lily; Tse, Daniel; Etemadi, Mozziyar; Ye, Wenxing; Corrado, Greg; Naidich, David P.; Shetty, Shravya. (2019). End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. doi.org. d c r t i l
Stabilarity Research Hub. <span class="stbl-bi" data-bi-label="Peer-Reviewed" data-bi-desc="Peer-reviewed by Admin"><svg viewBox="0 0 16 16" fill="currentColor"><path d="M10.067.87a2.89 2.89 0 0 0-4.134 0l-.622.638-.89-.011a2.89 2.89 0 0 0-2.924 2.924l.01.89-.636.622a2.89 2.89 0 0 0 0 4.134l.637.622-.011.89a2.89 2.89 0 0 0 2.924 2.924l.89-.01.622.636a2.89 2.89 0 0 0 4.134 0l.622-.637.89.011a2.89 2.89 0 0 0 2.924-2.924l-.01-.89.636-.622a2.89 2.89 0 0 0 0-4.134l-.637-.622.011-.89a2.89 2.89 0 0 0-2.924-2.924l-.89.01-.622-.636zM11.354 6.854a.5.5 0 0 0-.708-.708L7.5 9.293 5.854 7.646a.5.5 0 1 0-.708.708l2 2a.5.5 0 0 0 .708 0l3.5-3.5z"/></svg></span> <span class="stbl-bi" data-bi-label="Pinned" data-bi-desc="This article is pinned to the top."><svg viewBox="0 0 16 16" fill="currentColor"><path d="M4.146.146A.5.5 0 0 1 4.5 0h7a.5.5 0 0 1 .5.5c0 .68-.342 1.174-.646 1.479-.126.125-.25.224-.354.298v4.431l.078.048c.203.127.476.314.751.555C12.36 7.775 13 8.527 13 9.5a.5.5 0 0 1-.5.5h-4v4.5a.5.5 0 0 1-1 0V10h-4a.5.5 0 0 1-.5-.5c0-.973.64-1.725 1.17-2.189a5.1 5.1 0 0 1 .83-.603V2.277a1.9 1.9 0 0 1-.354-.298C4.342 1.674 4 1.18 4 .5a.5.5 0 0 1 .146-.354z"/></svg></span> Stabilarity Research Platform Is Now Open — Free API Access for All Researchers. t i b
Stabilarity Research Hub. (2026). The Coverage Gap: What AI Can Do vs. What We Actually Use It For. doi.org. d t i i
Stabilarity Research Hub. (2026). Feedback Loop Economics: The Cost Architecture of Self-Improving AI Systems. doi.org. d t i i
Stabilarity Research Hub. (2026). State of Medical AI Adoption: 1,200 Devices Approved, 81% of Hospitals at Zero. doi.org. d t i i
Stabilarity Research Hub. (2026). AI Architecture Comparison Observatory: AADA vs LLM-First Agents. doi.org. d t i i

Version History · 3 revisions

Rev	Date	Status	Action	By	Size
v1	Mar 9, 2026	DRAFT	Initial draft First version created	(w) Author	15,399 (+15399)
v2	Mar 9, 2026	PUBLISHED	Published Article published to research hub	(w) Author	16,273 (+874)
v3	Mar 9, 2026	CURRENT	Content update Section additions or elaboration	(m) Admin	16,752 (+479)