Observability for AI Systems: Why OpenTelemetry Is Not Enough and What the Community Needs

AI Observability & MonitoringTechnical Research · Article 1 of 5

Try it: OTel AI Inspector — paste your OTel trace JSON and get your L1-L4 coverage score instantly. Free, client-side, no data sent to server.

Observability for AI Systems: Why OpenTelemetry Is Not Enough and What the Community Needs #

Academic Citation: Ivchenko, Oleh (2026). Observability for AI Systems: Why OpenTelemetry Is Not Enough and What the Community Needs. Research article: Observability for AI Systems: Why OpenTelemetry Is Not Enough and What the Community Needs. ONPU. DOI: 10.5281/zenodo.18864333^[1]

DOI: 10.5281/zenodo.18864333^[1]Zenodo Archive ORCID

2,815 words · 5% fresh refs · 3 diagrams · 20 references

32stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	0%	○	≥80% from editorially reviewed sources
[t]	Trusted	20%	○	≥80% from verified, high-quality sources
[a]	DOI	5%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	20%	○	≥80% have metadata indexed
[l]	Academic	20%	○	≥80% from journals/conferences/preprints
[f]	Free Access	40%	○	≥80% are freely accessible
[r]	References	20 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,815	✓	Minimum 2,000 words for a full research article. Current: 2,815
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.18864333
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	5%	✗	≥60% of references from 2025–2026. Current: 5%
[c]	Data Charts	0	○	Original data charts from reproducible analysis (min 2). Current: 0
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (19 × 60%) + Required (3/5 × 30%) + Optional (1/4 × 10%)

Oleh Ivchenko Odessa National Polytechnic University March 2026

Abstract #

Modern AI systems deployed in production remain fundamentally opaque to the engineers who operate them. While OpenTelemetry has emerged as the de facto standard for distributed systems observability, its extension to AI and large language model (LLM) workloads e[REDACTED]ses critical gaps: latency traces do not capture hallucination rates, infrastructure metrics do not surface semantic drift, and no vendor-agnostic standard exists for quality observability. This paper surveys the current landscape of AI observability tooling, identifies structural gaps in the existing standards, and proposes a four-layer taxonomy — from infrastructure signals to business impact — that could form the foundation for a unified community standard. We establish a research agenda for the AI Observability & Monitoring series.

1. Introduction: The Black Box in Production #

The deployment of machine l[REDACTED]g systems into production environments has outpaced our ability to understand their behavior at runtime. For two decades, the software engineering community built robust observability practices around distributed systems: the “three pillars” of logs, metrics, and traces [1], formalized by the OpenTelemetry project [2] and operationalized through platforms like Prometheus [3], Grafana [4], Datadog [5], and Dynatrace [6]. These tools answer the question: Is the system behaving as its code specifies?

For AI systems, this question is necessary but insufficient. A language model serving responses with 200ms P99 latency and zero HTTP 5xx errors may simultaneously be hallucinating factual claims at a 12% rate, producing outputs that have drifted from its fine-tuning distribution, or consuming token budgets at a rate that will bankrupt the product in three months. None of these failures are visible in a conventional APM dashboard.

This is the foundational problem of AI observability: the contract of an AI system is semantic, not syntactic. Its correctness cannot be verified by inspecting code paths alone — it requires reasoning about the relationship between inputs, outputs, and intended behavior in a way that classical observability infrastructure was never designed to support.

The stakes are not theoretical. As organizations move AI systems from pilot to production — deploying LLM agents, retrieval-augmented generation (RAG) pipelines, and multi-model orchestration at scale — the absence of principled observability creates compounding operational debt. Engineers lack the instrumentation to detect regressions, attribute costs, debug agent behavior, or demonstrate compliance.

This paper surveys what the community has built, what is missing, and what a comprehensive standard would require.

2. What Exists Today: The Current Landscape #

2.1 OpenTelemetry and the GenAI Semantic Conventions #

OpenTelemetry (OTel) has achieved remarkable adoption as a vendor-neutral observability framework, providing SDKs, APIs, and a Collector architecture for traces, metrics, and logs across distributed systems [2]. In 2024, the OpenTelemetry community established the GenAI Special Interest Group (SIG), which produced a set of semantic conventions specifically for AI workloads [7].

The GenAI semantic conventions define span attributes for LLM calls, including:

gen_ai.system — the AI system being used (e.g., openai, anthropic)
gen_ai.request.model — the model identifier
genai.request.maxtokens — requested token limit
genai.response.finishreason — why generation stopped
genai.usage.prompttokens and genai.usage.completiontokens — token counts

The opentelemetry-instrumentation-openai package [8] provides automatic instrumentation for OpenAI API calls, generating spans that conform to these conventions. This represents genuine progress: engineers deploying OpenAI-backed systems can now see LLM calls as first-class spans in their distributed traces.

However, the GenAI semantic conventions as of their current specification address only the infrastructure surface of AI calls — what was requested and what was returned, measured in time and tokens. They do not address output quality, semantic correctness, or behavioral drift.

2.2 OpenLLMetry: OTel-Native LLM Observability #

Traceloop’s OpenLLMetry [9] extends OpenTelemetry specifically for LLM applications. It provides instrumentation for major LLM providers and frameworks (LangChain, LlamaIndex, Haystack, Semantic Kernel) and generates OTel-compatible spans with LLM-specific attributes. OpenLLMetry’s core contribution is breadth: it makes OTel instrumentation available across the fragmented ecosystem of LLM libraries without requiring per-library instrumentation code.

OpenLLMetry also introduces the concept of workflow spans that can group related LLM calls into logical tasks — a partial solution to the multi-hop agent tracing problem. However, it remains a library-level instrumentation layer and inherits the same semantic limitations as the OTel GenAI conventions.

2.3 Application-Layer Observability Platforms #

A cluster of purpose-built platforms has emerged to address what OTel does not:

Platform	Primary Focus	OTel-Native	Self-Hostable	Quality Signals
LangSmith (LangChain)	LLM debugging & evaluation	Partial	No	Yes (via evaluators)
Langfuse	LLM observability & tracing	Yes (OTLP)	Yes	Yes (scores API)
Helicone	Cost & usage analytics	No	Proxy-based	Limited
Phoenix (Arize AI)	ML + LLM observability	Yes	Yes	Yes (evaluations)

LangSmith [10] offers deep integration with LangChain applications, providing trace visualization, prompt management, and an evaluation framework. Its evaluation system allows users to define quality metrics and run them against traced outputs — the closest existing approach to semantic observability. However, it is tightly coupled to the LangChain ecosystem and requires sending data to LangChain’s cloud infrastructure.

Langfuse [11] is an open-source alternative with strong OTLP support, a scores API for attaching quality signals to traces, and self-hosting capability. Its architecture is more OTel-aligned than LangSmith’s, making it a viable bridge platform for teams already invested in OTel infrastructure.

Phoenix by Arize AI [12] brings the company’s background in ML monitoring to LLM observability, offering embedding visualizations, retrieval analysis for RAG pipelines, and hallucination detection via LLM-as-judge evaluation. Its ML heritage gives it stronger drift detection capabilities than application-focused platforms.

Helicone [13] operates as a proxy layer, intercepting LLM API calls to capture usage and cost data without SDK integration. Its simplicity is its strength, but the proxy architecture limits it to request/response capture and forecloses deeper span instrumentation.

2.4 Experiment Tracking vs. Production Monitoring #

MLflow [14] and Weights & Biases [15] represent the experiment tracking paradigm: tooling designed for the model development phase, tracking hyperparameters, metrics, and artifacts across training runs. Both have extended their platforms toward deployment and monitoring, but the conceptual gap remains. Experiment tracking optimizes for reproducibility (can I recreate this result?), while production monitoring optimizes for reliability (is this system behaving correctly right now?). The data models, storage requirements, and operational patterns are fundamentally different, and the conflation of these concerns has impeded the development of purpose-built production monitoring standards.

2.5 Prometheus and Grafana Patterns for ML #

The Prometheus + Grafana stack [3][4] is widely used for ML infrastructure monitoring: GPU utilization, memory consumption, inference throughput, and queue depth. Custom metrics can be e[REDACTED]rted via the Prometheus client libraries, and teams have developed patterns for tracking model-level statistics — prediction distributions, output value histograms, and request rates by input feature category.

This approach provides genuine value for detecting infrastructure-level anomalies and broad behavioral shifts. However, it requires significant custom engineering effort, lacks standardized metric naming conventions for AI workloads, and provides no native support for semantic quality signals.

3. The Structural Gaps #

The landscape survey reveals four categories of gaps that no existing solution adequately addresses:

3.1 Model Quality Signals Are Not First-Class Observables #

Existing observability infrastructure treats model outputs as opaque byte strings. Token counts are measured; semantic content is not. There is no standard mechanism for attaching quality signals — factuality scores, semantic similarity to reference outputs, coherence ratings, or refusal rates — to trace spans in a way that propagates through the observability pipeline to dashboards, alerts, and SLO calculations.

Some platforms provide post-hoc evaluation frameworks (LangSmith evaluators, Phoenix evaluations), but these operate outside the trace collection path and cannot power real-time alerting or SLO compliance calculations.

3.2 Multi-Hop Agent Tracing Is Unsolved #

LLM agents that orchestrate tool calls, sub-agent invocations, and multi-step reasoning chains present a tracing challenge that OTel’s span model handles poorly. A ReAct-pattern agent [16] that executes five tool calls across three external APIs before producing a final response generates a trace tree where:

The causally significant spans are at leaf level (tool outputs that shaped the final response)
The overall quality of the response cannot be attributed to any individual span
The “reasoning” between tool calls is invisible to the tracer

The OTel span model supports parent-child relationships but provides no semantic primitives for representing reasoning chains, attention to intermediate results, or counterfactual attribution.

3.3 Drift Detection Is External and Ad Hoc #

Data drift and model drift — the degradation of model performance as the statistical distribution of inputs diverges from training data — are recognized problems with mature detection algorithms [17]. However, drift detection is universally implemented as a separate batch pipeline, operating on collected data with significant latency, rather than as a first-class observable integrated into the real-time telemetry pipeline.

The consequence is that drift is detected reactively: after user-facing quality has degraded, after business metrics have moved, and often only after manual investigation. No existing observability standard treats drift as a signal that should appear in the same pipeline as latency or error rate.

3.4 Cost Attribution Is Coarse #

LLM inference costs are substantial and variable. Token-level cost attribution — mapping cost to specific users, features, request types, or business units — requires custom engineering on top of existing telemetry. The GenAI semantic conventions capture token counts, but the translation from token counts to costs (which varies by model, provider, and time) is not standardized, and cost signals are not integrated into standard observability dashboards or alerting frameworks.

4. A Proposed Taxonomy: Four Layers of AI Observability #

We propose a four-layer taxonomy that structures AI observability concerns from infrastructure to business impact:

graph TD
    L4["L4: Business Impact\nTask completion rate\nUser satisfaction proxy\nRevenue attribution\nSLO compliance"]
    L3["L3: Semantic Quality\nFactuality scores\nRelevance and coherence\nHallucination rate\nRefusal rate"]
    L2["L2: Model Behavior\nLatency (TTFT, TBT)\nToken usage and costs\nError rates by type\nFinish reasons"]
    L1["L1: Infrastructure\nCPU / GPU / Memory\nNetwork and queue depth\nService health\nClassic OTel spans"]

    L4 --> L3
    L3 --> L2
    L2 --> L1

L1: Infrastructure covers the traditional observability domain: compute resource utilization, network performance, service availability, and the OTel spans that already instrument HTTP calls and database queries. This layer is mature; the community has solved it.

L2: Model Behavior covers the signals specific to model inference: time to first token (TTFT), token generation throughput, token budget consumption, finish reason distributions, and error categorization (rate limit, content filter, context length exceeded, etc.). The GenAI semantic conventions address this layer incompletely; it is achievable with existing OTel primitives.

L3: Semantic Quality is the critical unsolved layer. It encompasses signals that require reasoning about the content of model outputs: factuality relative to source documents, semantic similarity to expected outputs, coherence and fluency, refusal and safety filter activation rates, and hallucination detection. These signals require either model-based evaluation (LLM-as-judge) or embedding-based similarity computation — neither of which fits naturally into the synchronous, low-overhead span collection path.

L4: Business Impact connects model behavior to user and business outcomes: task completion rates, user satisfaction proxies (thumbs up/down, session continuation, return rate), revenue attribution for AI-assisted transactions, and SLO compliance expressed in business terms. This layer requires joining telemetry data with product analytics, which existing observability infrastructure does not natively support.

4.1 Tool Coverage by Layer #

graph LR
    subgraph "L1: Infrastructure"
        OTel["OTel + Prometheus"]
        Grafana["Grafana"]
    end
    subgraph "L2: Model Behavior"
        OpenLLMetry["OpenLLMetry"]
        Helicone["Helicone"]
        Langfuse2["Langfuse"]
    end
    subgraph "L3: Semantic Quality"
        LangSmith["LangSmith"]
        Phoenix["Phoenix (Arize)"]
        Langfuse3["Langfuse (scores)"]
    end
    subgraph "L4: Business Impact"
        Custom["Custom Engineering Required"]
    end

5. Agent Trace Flow: The Multi-Hop Problem #

A representative agent trace demonstrates the instrumentation challenges at the architectural level:

sequenceDiagram
    participant User
    participant Agent as LLM Agent
    participant OTel as OTel Collector
    participant Tool1 as Search API
    participant LLM as LLM Provider

    User->>Agent: Query
    Agent->>OTel: span: agent.run [start]
    
    Agent->>LLM: plan(query)
    LLM-->>Agent: tool call plan
    Agent->>OTel: span: gen_ai.completion [tokens=340]
    
    Agent->>Tool1: search(query)
    Tool1-->>Agent: results
    Agent->>OTel: span: tool.search [latency=820ms]
    
    Agent->>LLM: reason(query, results)
    LLM-->>Agent: final_response
    Agent->>OTel: span: gen_ai.completion [tokens=580]
    
    Agent->>User: response
    Agent->>OTel: span: agent.run [end, total_tokens=920]

    Note over OTel: Missing: hallucination rate, semantic quality, drift signal

The diagram highlights the instrumentation boundary: OTel can capture every span, latency, and token count. It cannot capture whether the final response is factually correct, whether the search results were relevant, or whether the model’s reasoning chain is coherent. These gaps are not gaps in OTel’s implementation — they are gaps in its conceptual model.

6. What the Community Needs: A Research Agenda #

Addressing the structural gaps requires contributions at multiple levels:

6.1 Semantic Conventions for Quality Signals #

The OpenTelemetry GenAI SIG should extend its semantic conventions to include quality signal attributes at L3. This requires defining:

A standard schema for attaching evaluation scores to spans (by score type, evaluator identity, and confidence)
Conventions for representing retrieval quality in RAG pipelines (hit rate, relevance score, faithfulness)
Standard span events for semantic quality checkpoints

6.2 Asynchronous Evaluation Integration #

Real-time semantic evaluation (LLM-as-judge at inference time) adds latency and cost that may be prohibitive. The community needs standardized patterns for asynchronous evaluation: collecting raw inputs and outputs at inference time (L2), running quality evaluation as a background process, and propagating results back to the trace with a causal link to the original span.

6.3 Drift as a First-Class Signal #

Drift detection should be defined as an observable with a standard metric type and alert semantics. Proposed conventions:

ai.drift.input_distribution — divergence from training input distribution
ai.drift.output_distribution — shift in output value distribution
ai.drift.embedding_distance — cosine distance from reference embedding centroid

These metrics should be e[REDACTED]rtable via standard Prometheus e[REDACTED]sition format and queryable via PromQL.

6.4 Cost Attribution at the Trace Level #

A standard for cost attribution would define how token counts (captured at L2) are translated to monetary costs and associated with business-level dimensions (user, feature, organization, deployment). This requires:

A provider-maintained registry of token costs per model
A standard span attribute for computed cost (genai.usage.costusd)
Cost rollup semantics for agent traces with multiple LLM calls

6.5 Community Reference Implementation #

The most impactful near-term contribution would be an open-source reference implementation demonstrating all four layers in a single instrumented application: a Python-based LLM agent with complete L1-L4 observability, e[REDACTED]rtable to any OTel-compatible backend. This reference implementation would demonstrate the async evaluation pattern, provide example Grafana dashboards covering all four layers, include drift detection with Prometheus metrics, and enable the community to experiment with the proposed conventions before standardization.

7. Conclusion #

OpenTelemetry represents an extraordinary achievement in standardizing observability infrastructure. Its extension to AI workloads through the GenAI semantic conventions is a meaningful step, but it addresses only the first two layers of a four-layer challenge. The community has produced valuable application-layer tooling — Langfuse, Phoenix, LangSmith — but these tools remain siloed, vendor-specific, or ecosystem-constrained.

The path to principled AI observability runs through the OpenTelemetry community: extending semantic conventions to L3 quality signals, standardizing asynchronous evaluation patterns, and treating drift detection as a first-class observable rather than a batch analytics afterthought.

This paper establishes the taxonomy and identifies the research agenda. Subsequent articles in this series will address each gap in depth — from practical instrumentation patterns to the statistical foundations of semantic drift detection — with the goal of producing community-ready specification proposals.

The black box in production is a solvable problem. Solving it requires the same community consensus-building that produced OpenTelemetry itself.

Preprint References (original)+

[1] Majors, C., Fong-Jones, L., & Miranda, G. (2022). Observability Engineering. O’Reilly Media.

[2] OpenTelemetry Authors. (2024). OpenTelemetry Specification. https://opentelemetry.io/docs/specs/otel/

[3] Prometheus Authors. (2024). Prometheus: Monitoring System & Time Series Database. https://prometheus.io/

[4] Grafana Labs. (2024). Grafana: The open observability platform. https://grafana.com/

[5] Datadog Inc. (2024). AI Observability. https://www.datadoghq.com/product/ai-observability/

[6] Dynatrace. (2024). AI-powered observability. https://www.dynatrace.com/

[7] OpenTelemetry GenAI SIG. (2024). Semantic Conventions for Generative AI Systems. https://opentelemetry.io/docs/specs/semconv/gen-ai/

[8] OpenTelemetry Authors. (2024). opentelemetry-instrumentation-openai. https://github.com/open-telemetry/opentelemetry-python-contrib

[9] Traceloop. (2024). OpenLLMetry: Open-source observability for your LLM application. https://github.com/traceloop/openllmetry

[10] LangChain. (2024). LangSmith. https://smith.langchain.com/

[11] Langfuse. (2024). Open Source LLM Engineering Platform. https://langfuse.com/

[12] Arize AI. (2024). Phoenix: AI Observability & Evaluation. https://phoenix.arize.com/

[13] Helicone. (2024). Open-source LLM Observability. https://www.helicone.ai/

[14] Zaharia, M., et al. (2018). Accelerating the Machine L[REDACTED]g Lifecycle with MLflow. IEEE Data Eng. Bull., 41(4), 39-45.

[15] Weights & Biases. (2024). Weights & Biases: The AI Developer Platform. https://wandb.ai/

[16] Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.

[17] Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4), 1-37.

[18] Fowler, M. (2019). Observability and Monitoring. https://martinfowler.com/articles/domain-oriented-observability.html

[19] DORA Research Program. (2023). Accelerate State of DevOps Report. Google Cloud.

[20] Liang, P., et al. (2022). Holistic Evaluation of Language Models. arXiv:2211.09110.

[21] Zhao, W. X., et al. (2023). A Survey of Large Language Models. arXiv preprint. arXiv:2302.14016^[2]

[22] Liang, P., et al. (2022). Holistic Evaluation of Language Models (HELM). Stanford CRFM. arXiv:2211.09110^[3]

[23] Shen, T., et al. (2023). Large Language Models Aligned to Human Preferences through RLHF. arXiv:2307.01003^[4]

References (4) #

Stabilarity Research Hub. (2026). Observability for AI Systems: Why OpenTelemetry Is Not Enough and What the Community Needs. doi.org. d t i i
[2302.14016] Regularity of CR maps into uniformly pseudoconvex hypersurfaces and applications to proper holomorphic maps. arxiv.org. t i i
[2211.09110] Holistic Evaluation of Language Models. arxiv.org. t i i
[2307.01003] Visual Instruction Tuning with Polite Flamingo. arxiv.org. t i i

Version History · 4 revisions

Rev	Date	Status	Action	By	Size
v1	Mar 4, 2026	DRAFT	Initial draft First version created	(w) Author	21,809 (+21809)
v2	Mar 4, 2026	PUBLISHED	Published Article published to research hub	(w) Author	21,809 (~0)
v3	Mar 6, 2026	REVISED	Content update Section additions or elaboration	(w) Author	22,141 (+332)
v4	Mar 6, 2026	CURRENT	Content update Section additions or elaboration	(w) Author	22,679 (+538)