Five Years in the Deep End: How Two Researchers Are Mapping the Uncharted Territory of AI

Five Years in the Deep End

How Two Researchers Are Mapping the Uncharted Territory of AI

Oleh Ivchenko & Dmytro Grybeniuk · Stabilarity Hub · 2026

In a hospital radiology department in Kyiv, a doctor named Iryna stares at a scan on her monitor. An AI system blinks its verdict: no malignancy detected. She trusts it. She is right to trust it. But here’s the thing about Iryna’s story — she was also lucky. And the difference between those two things is precisely what Oleh Ivchenko and Dmytro Grybeniuk have spent five years trying to understand.

About This Article:
Ivchenko, O. & Grybeniuk, D. (2026). Five Years in the Deep End: How Two Researchers Are Mapping the Uncharted Territory of AI. Stabilarity Research Hub.
Type: Narrative Profile | License: CC BY 4.0

The Wall Nobody Talked About #

It is 2021. The AI hype cycle is at full tilt. Every pitch deck has a machine l[REDACTED]g slide. Every startup brochure promises “intelligent automation.” Two Ukrainian entrepreneurs — Oleh Ivchenko, co-founder of data startup Flaidata, and Dmytro Grybeniuk, building growth intelligence tools at Gromus — are knee-deep in it. They are not skeptics. They are believers. That’s what makes what happens next so interesting.

Because builders, unlike theorists, have to make AI actually work. They have to deliver results on Tuesday, not publish papers about results. And what Oleh and Dmytro kept encountering — across clients, across sectors, across conversations with other builders — was a gap so consistent it had to be systematic. AI would work brilliantly in one deployment and catastrophically fail in what looked like an identical one. The demos were always perfect. The production systems were a different story.

“Everyone was optimizing for the demo,” Oleh would later explain. “Nobody was measuring why the demo stopped being true in the real world.”

WebSummit and the Wrong Question #

In November 2021, Stabilarity Hub’s early work earned them a slot in WebSummit Alpha — the competition that surfaces the most promising early-stage startups from a pool of thousands. They finished in the top 10. It was validation, global e[REDACTED]sure, and a room full of the sharpest investors and founders in technology. It was also, unexpectedly, the moment that sharpened the question that would drive the next five years of research.

Standing in the halls of Lisbon’s Altice Arena, Oleh was in conversation with a London-based venture capitalist — sharp, well-briefed, the kind who’s heard a thousand AI pitches and knows the vocabulary cold. The VC listened carefully, nodded, and asked the question that ends most startup conversations: “But can it scale?”

Oleh answered. But something didn’t sit right. On the flight home, he turned the exchange over and over. The question wasn’t wrong, exactly. It was just — too early. Too surface. Everyone in that building was asking “can it scale?” Nobody was asking “should it scale, and under what conditions, and how do we know when it’s actually working?” The VC’s question assumed the hard problem was distribution. Oleh was starting to suspect the hard problem was something far more fundamental.

Scale an AI system that works, and you get leverage. Scale one that fails unpredictably, and you get a crisis with infrastructure around it.

A Hundred Thousand Dollar Education #

That same year, Oleh and Dmytro encountered a situation that would haunt the early chapters of their research: a $100,000 investment in an Estonian data analytics algorithm — a serious piece of work, built by serious engineers. On paper, the system performed. In testing environments, it was impressive. In real-world deployment, it delivered inconsistent results that nobody could confidently explain.

A hundred thousand dollars is a meaningful amount of money when you’re building a startup. But the real cost wasn’t the investment — it was the decision-making downstream. If you can’t trust the output of an AI system, every choice made on its basis becomes a gamble. And in business, as in medicine, gambling with high-stakes decisions is not a strategy. It’s a liability.

This was the first serious encounter with what would become a central theme of their research: the economics of AI aren’t about the cost of building AI — they’re about the cost of trusting it. And trusting it wrong.

“The cost of AI isn’t compute. It’s the cost of the decisions made on its output — and what happens when those decisions are wrong.”

— Oleh Ivchenko, Stabilarity Hub

timeline title Stabilarity Hub Research Timeline 2021 : Flaidata & Gromus — AI hype vs. reality : WebSummit Alpha Top 10 : $100K Estonian algorithm — first AI economics lesson : The question forms: WHY does AI fail? 2022 : Medical AI research begins : First papers on ML in Ukrainian medical imaging : AI Economics framework in development 2023 : 20+ medical AI papers published : AI Economics organizational assessment framework formalized : Anticipatory Intelligence series launched 2024 : 35 medical imaging papers completed : Population-level bias findings published (23–40% performance gap) : AI Economics Series released 2025 : Spec-Driven AI Development research : Enterprise AI waste quantified (40–60%) : Frameworks being adopted in enterprise 2026 : Convergence insight: Decision Science = the missing layer : Research cited across healthcare, enterprise, academia

Act Two: Going Deep #

In 2022, Oleh and Dmytro made a decision that marks the transition from entrepreneurs who observed a problem to researchers who committed to solving it. They would stop reacting to AI failures on a case-by-case basis and start mapping the territory systematically. Stabilarity Hub was born not as a think-tank but as a research engine — one driven by practical questions, not theoretical ones.

Three research pillars emerged. Each one looked different on the surface. At the core, they were asking the same question from different angles.

Pillar One: Where AI Meets Medicine — and What Happens When the Training Data Lies #

The medical imaging research began with a deceptively simple observation: the AI diagnostic tools being deployed in Ukrainian hospitals were mostly trained on Western patient data. Western hospital equipment. Western radiologist annotation styles. This seemed, at first, like an operational detail. It turned out to be a chasm.

Over three years, Oleh and a team of collaborators published 35 papers examining machine l[REDACTED]g in medical imaging diagnosis, with a specific focus on Ukrainian healthcare contexts. The findings were striking in their consistency: algorithms trained on Western hospital data performed 23 to 40 percent worse when applied to Ukrainian patient cohorts.

The gap wasn’t random. It was traceable to specific, systematic differences: population-level variations in how diseases present, the age and calibration of imaging equipment in post-Soviet medical infrastructure, and — most subtly — differences in how radiologists annotate images. A radiologist trained in Kyiv, working with older equipment and a different patient demographic, sees and labels things differently than one in Hamburg or Seattle. The AI systems had learned one dialect of medicine. Ukrainian hospitals were speaking another.

This brings us back to Iryna, the radiologist in Kyiv. When she trusted that AI verdict — no malignancy detected — she was right. The patient was fine. But here’s what the research e[REDACTED]sed: she was making a decision informed by a system that hadn’t been validated on populations like her patients. In that particular case, the AI happened to be correct. In a statistically meaningful number of cases across the Ukrainian healthcare system, it wasn’t. And nobody was measuring the gap. Nobody even had a framework to measure it.

What makes this research groundbreaking isn’t the finding that bias exists in AI medical systems — that’s been noted. It’s the specificity: here is exactly what the bias is, here is exactly where it comes from, here is exactly how large the performance penalty is, and here is the population bearing the risk. That level of specificity is what moves from observation to actionable knowledge.

Peer-reviewed papers on ML in medical imaging diagnosis

23–40%

Performance degradation on Ukrainian patient cohorts vs. Western-trained AI

Enterprise AI projects using formal specifications (vs. prompt engineering)

Pillar Two: The Hidden Economics of AI — What Nobody Is Actually Measuring #

Consider a European enterprise that made a decision you’d hear celebrated at every technology conference in 2023: they deployed GPT-4 for customer service. The vision was compelling — a tireless, infinitely patient, always-available AI that could handle queries at scale. They committed seriously. Two million euros in implementation, integration, training, and infrastructure.

The results? Functional. Not transformative. When the post-mortem was done, the uncomfortable finding emerged: a €50,000 solution — a well-configured, smaller model with proper routing logic and caching — would have achieved equivalent outcomes for their specific use case. The €1.95 million gap wasn’t waste through negligence. It was waste through a measurement failure. Nobody had assessed whether the organization was ready to use AI outputs responsibly before deciding which AI to buy.

This is the insight at the heart of Stabilarity Hub’s AI Economics Series, and it led to one of the more useful diagnostic frameworks to emerge from the research — one that reframes the AI ROI question entirely.

The framework separates two distinct organizational capabilities: the ability to identify where AI should inform decisions, and the ability to act correctly on AI outputs once it does. An organization can have perfect clarity on the first — the right data, the right infrastructure, the right use cases — and still fail catastrophically on the second, if frontline teams can’t contextualize AI recommendations, if governance structures don’t catch errors, if there’s no feedback loop. In that case, the AI investment becomes noise in an expensive system.

Organizations strong on identifying use cases but weak on organizational readiness to act on AI outputs, the research found, consistently overspent and got poor ROI. This wasn’t a loose correlation — it was a pattern reliable enough to use predictively. Measure both dimensions before deployment, and you could forecast with meaningful accuracy whether an investment would deliver or disappoint.

The implication is uncomfortable for the AI vendor ecosystem: the problem isn’t usually the AI. It’s the organization’s infrastructure for using AI responsibly. And that’s not a problem you can sell your way out of.

Pillar Three: Anticipatory Intelligence — The Gap Between Knowing and Preparing #

Dmytro Grybeniuk’s research focus is, in many ways, the most ambitious of the three pillars. It starts with a question that sounds simple: what is AI actually for?

Most enterprise AI, if you strip away the branding, does one of two things: it tells you what happened (analytics, dashboards, reporting) or it tells you what’s likely to happen based on patterns in historical data (predictive modeling, forecasting). Both are valuable. Both are, in a fundamental sense, backward-looking. They are sophisticated mirrors — extraordinarily detailed reflections of the past.

What Dmytro’s Anticipatory Intelligence series documents is the gap between prediction and preparation. Prediction says: “Based on historical patterns, there is a 67% probability of supply chain disruption in Q3.” Anticipation says: “Here is what you should be building, staffing, and contracting for right now, given both that probability and the set of scenarios that fall outside historical patterns.”

The gap is not small. Across 20 domains examined in the series — from healthcare to logistics to financial services to defense — current enterprise AI demonstrates a consistent failure to generate actionable preparation guidance for scenarios outside its training distribution. In plain language: AI is good at telling you what to expect if the world keeps working the way it has. It is poor at helping you prepare for the world when it doesn’t.

This is not a compute problem. Throwing more processing power or larger models at anticipatory intelligence doesn’t solve it. The gap is architectural and conceptual. The research argues it requires a fundamentally different approach to how AI systems are designed — not just bigger, but different in kind.

graph TD A[AI System Output] –> B{Organization Readiness} B –> C[Use Case Identification: Can we identify WHERE AI helps?] B –> D[Execution Readiness: Can we ACT on AI outputs correctly?] C –> E{AI Performance Score} D –> E E –> F[High Identification + High Execution
Yes Strong ROI, Reliable Outcomes] E –> G[High Identification + Low Execution
(!)️ Overspend, Poor ROI
The most common failure mode] E –> H[Low Identification + Any Execution
No Wrong use cases, wasted capability] style F fill:#276749,color:#fff style G fill:#c53030,color:#fff style H fill:#742a2a,color:#fff style E fill:#2b6cb0,color:#fff

Act Three: The Frontier (2025–2026) #

By 2025, something had shifted in how Oleh and Dmytro were working. The three research pillars, which began as separate inquiries, were converging. Findings from medical AI kept rhyming with findings from AI economics. Patterns in anticipatory intelligence kept echoing patterns in how organizations fail to use AI responsibly. The same root problem kept surfacing from different angles.

The newest research thread — Spec-Driven AI Development — crystallized this convergence.

The Specification Gap #

Here is a finding that should unsettle every CTO and AI architect reading this: AI systems built from precise formal specifications outperform prompt-engineered systems by 2 to 5 times on task completion metrics. The performance advantage is not marginal. It is substantial, consistent, and reproducible across domains.

The follow-up finding makes this more alarming: only 3% of enterprise AI projects use formal specifications. The other 97% rely on prompt engineering — essentially, sophisticated trial and error dressed in the language of engineering. It works, often enough to ship. It rarely works well enough to be the foundation of a reliable system.

Prompt engineering is intuitive. Specification-driven development is rigorous. The difference between them is roughly the difference between giving directions verbally and handing someone a detailed map with GPS coordinates. Both might get a person to the destination. One of them works reliably at scale, in variable conditions, with multiple drivers.

The Waste Problem #

The AI economics research has now reached a finding that finance teams at AI-heavy organizations need to reckon with: most organizations are running 40 to 60 percent wasted AI spend. Not 10%. Not an acceptable margin. Nearly half of every dollar flowing into AI infrastructure is being consumed by redundant inference calls that proper caching would eliminate, and by premium model deployments being used for tasks that a fraction-of-the-cost model handles with equivalent accuracy.

This isn’t a criticism of the AI vendors. It’s a systems design failure. Organizations are buying the biggest engine available without asking what they’re actually trying to move or how far. The result is expensive AI that delivers modest value — and CFOs who are beginning to notice the gap between the promise and the invoice.

The Convergence Insight #

Medical AI failing Ukrainian patients because training data didn’t match local reality. Enterprise AI burning millions because nobody measured organizational readiness before deployment. Anticipatory intelligence failing because AI systems aren’t designed to reason about what they haven’t seen. Spec-driven development delivering 2–5x gains that 97% of the industry ignores.

These are different problems. They are the same problem.

The Convergence Thesis #

Medical AI economics + Anticipatory Intelligence + Spec-Driven Development all point to the same root failure: AI integration without decision science is just expensive automation. The missing layer isn’t more compute, bigger models, or better prompts. It is the discipline of understanding how humans and organizations make decisions — and designing AI systems that fit into those processes reliably, not just impressively.

Act Four: Why It Matters Beyond the Papers #

Research is easy to admire in the abstract. The harder question is whether it changes anything. For Stabilarity Hub, the answer is emerging in three concrete directions.

In Ukrainian healthcare, the 35-paper medical AI body of work is being engaged by institutions grappling with how to deploy AI diagnostics responsibly in a healthcare system that is simultaneously under-resourced and under enormous strain. The research doesn’t just document the bias problem — it provides a framework for measuring it, correcting for it, and validating AI tools against local populations before trusting them with clinical decisions. That framework is now being used.

In enterprise, the organizational assessment framework is increasingly being referenced by AI architects and strategy consultants as a pre-deployment diagnostic tool. The question “how ready is your organization to act on AI outputs?” is proving more useful than “which model are you using?” — and organizations are starting to realize it. CFOs who’ve lived through expensive AI disappointments are especially receptive.

And in the broader AI research community, Stabilarity Hub’s work on anticipatory intelligence is contributing to a conversation that is slowly shifting from “how do we make AI smarter?” to “how do we make AI genuinely useful for decisions humans have never made before?” That’s a harder question. It’s also the right one.

None of this happened because Oleh and Dmytro sat in an ivory tower formulating theories. It happened because they were builders first — people who saw the gap between AI’s promise and AI’s delivery and decided that documenting it, mapping it, and developing frameworks to close it was worth five years of rigorous, unglamorous work. The kind of work that doesn’t make headlines until it changes something real.

The Question That Still Won’t Go Away #

Back in Lisbon in 2021, a venture capitalist asked Oleh if his AI could scale. He answered correctly. He just answered the wrong question.

Five years later, the questions Stabilarity Hub is asking — and beginning to answer — are the ones the industry is finally catching up to. Can AI work reliably across different populations, not just the populations it was trained on? Can organizations be honestly assessed for their readiness to use AI before they spend millions deploying it? Can AI systems be designed to help humans prepare for what hasn’t happened yet, not just analyze what has? Can formal specifications replace the expensive guesswork of prompt engineering at enterprise scale?

These aren’t the flashy questions. They don’t make for a thirty-second pitch. But they are the questions that will separate the organizations that build durable AI advantage from the ones that spend fortunes on impressive demos and wonder why the numbers don’t improve.

And somewhere in Kyiv, a radiologist named Iryna is looking at another scan. Another AI verdict. This time, because of work that started with a question nobody could answer in 2021, she has something better than luck. She has a framework for knowing when to trust what the machine tells her — and when to look again.

About the Researchers #

Oleh Ivchenko #

Co-founder of Stabilarity Hub. Before research, Oleh built AI products at the sharp end of the market — co-founding Flaidata in the data analytics space and [REDACTED]g a Top 10 finish at WebSummit Alpha 2021. His research spans medical AI validation, AI economics, and the decision science frameworks that determine whether AI investments deliver value or consume it. He asks uncomfortable questions and documents the answers with precision.

Dmytro Grybeniuk #

Co-founder of Stabilarity Hub and the architect of its Anticipatory Intelligence research program. Previously at Gromus, where he encountered firsthand the limits of AI systems that could predict but not prepare. His 20-domain series on anticipatory intelligence is reshaping how organizations think about the gap between forecasting and readiness. He believes the next decade of AI progress will be defined not by what AI can predict, but by what it helps humans prepare for.

Stabilarity Hub · Mapping the uncharted territory of AI, one rigorous question at a time.

Version History · 4 revisions

Rev	Date	Status	Action	By	Size
v1	Feb 18, 2026	DRAFT	Initial draft First version created	(w) Author	21,015 (+21015)
v2	Feb 18, 2026	PUBLISHED	Published Article published to research hub	(w) Author	21,245 (+230)
v3	Feb 21, 2026	REDACTED	Minor edit Formatting, typos, or styling corrections	(r) Redactor	21,270 (+25)
v4	Feb 21, 2026	CURRENT	Content update Section additions or elaboration	(w) Author	21,795 (+525)