Universal Intelligence BenchmarkBenchmark Research · Article 6 of 14

By Oleh Ivchenko · Benchmark research based on publicly available meta-analyses and reproducible evaluation methods.

Temporal and Planning Intelligence as a UIB Dimension: Why Horizon Length Breaks Modern Reasoning Models

Academic Citation: Ivchenko, Oleh (2026). Temporal and Planning Intelligence as a UIB Dimension: Why Horizon Length Breaks Modern Reasoning Models. Research article: Temporal and Planning Intelligence as a UIB Dimension: Why Horizon Length Breaks Modern Reasoning Models. Odessa National Polytechnic University, Department of Economic Cybernetics.
DOI: 10.5281/zenodo.19207333^[1] · View on Zenodo (CERN)

DOI: 10.5281/zenodo.19207333^[1]Zenodo Archive Charts (4)ORCID

2,347 words · 62% fresh refs · 3 diagrams · 16 references

74stabilfr·wdophcgmx

Badge	Metric	Value	Status	Description
[s]	Reviewed Sources	0%	○	≥80% from editorially reviewed sources
[t]	Trusted	88%	✓	≥80% from verified, high-quality sources
[a]	DOI	63%	○	≥80% have a Digital Object Identifier
[b]	CrossRef	0%	○	≥80% indexed in CrossRef
[i]	Indexed	88%	✓	≥80% have metadata indexed
[l]	Academic	69%	○	≥80% from journals/conferences/preprints
[f]	Free Access	100%	✓	≥80% are freely accessible
[r]	References	16 refs	✓	Minimum 10 references required
[w]	Words [REQ]	2,347	✓	Minimum 2,000 words for a full research article. Current: 2,347
[d]	DOI [REQ]	✓	✓	Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.19207333
[o]	ORCID [REQ]	✓	✓	Author ORCID verified for academic identity
[p]	Peer Reviewed [REQ]	—	✗	Peer reviewed by an assigned reviewer
[h]	Freshness [REQ]	62%	✓	≥60% of references from 2025–2026. Current: 62%
[c]	Data Charts	4	✓	Original data charts from reproducible analysis (min 2). Current: 4
[g]	Code	—	○	Source code available on GitHub
[m]	Diagrams	3	✓	Mermaid architecture/flow diagrams. Current: 3
[x]	Cited by	0	○	Referenced by 0 other hub article(s)

Score = Ref Trust (75 × 60%) + Required (4/5 × 30%) + Optional (2/4 × 10%)

Abstract #

Temporal reasoning and long-horizon planning represent perhaps the most consequential gap between current large language models and human cognitive capability. While frontier models achieve near-human performance on short planning tasks (under 15 steps), their accuracy degrades catastrophically beyond 25 planning steps — a phenomenon we term the horizon collapse. This article examines three research questions: how temporal and planning benchmarks measure reasoning capacity in 2026, where the critical failure thresholds lie in long-horizon planning, and how the proposed UIB-Temporal dimension can formalize this measurement gap. Drawing on recent benchmarks including SokoBench, STARK, TemporalBench, and the newly launched ARC-AGI-3, we identify a consistent degradation pattern across all frontier reasoning models and propose evaluation metrics for the UIB framework’s temporal dimension. Our analysis of nine major temporal benchmarks reveals that multi-step planning accuracy drops below 50% at approximately 25 steps for the best available models, while human performance remains above 79% at the same horizon — a 36-percentage-point gap that current test-time scaling cannot close. These findings directly inform the UIB-Temporal specification for measuring planning intelligence at scale.

1. Introduction #

In the previous article on embodied intelligence, we established that physical grounding represents a fundamental dimension of machine intelligence that current benchmarks inadequately capture (Ivchenko, 2026^[2]). We now turn to an equally critical dimension: temporal reasoning and long-horizon planning — the ability to reason about sequences of actions, predict future states, and construct multi-step strategies toward distant goals.

The planning horizon problem is not merely academic. Every consequential AI application — from portfolio optimization to autonomous navigation to scientific experiment design — requires reasoning over extended temporal sequences. The HPF-P Framework’s Decision Readiness Index fundamentally depends on an agent’s capacity to plan across multiple time horizons (Ivchenko, 2026), making temporal intelligence a cornerstone of practical decision support.

Recent work has e[REDACTED]sed a striking paradox: large reasoning models (LRMs) demonstrate impressive performance on isolated logical puzzles yet fail systematically when task solutions require more than approximately 25 sequential steps (Nicolini et al., 2026^[3]). This degradation appears fundamental rather than superficial — neither increased model scale nor extended test-time compute eliminates it.

Research Questions #

RQ1: What is the current landscape of temporal and planning benchmarks in 2026, and how do they collectively measure reasoning capacity?

RQ2: At what planning horizon do frontier LLMs exhibit systematic performance degradation, and what architectural factors drive this collapse?

RQ3: How should the UIB-Temporal dimension formalize temporal and planning intelligence measurement to capture these failure modes?

2. Existing Approaches (2026 State of the Art) #

2.1 The Planning Benchmark Ecosystem #

The evaluation landscape for temporal and planning intelligence has expanded significantly since 2024. Nine major benchmarks now target different facets of temporal reasoning, each revealing distinct failure modes in current models.

PlanBench introduced formal planning evaluation using PDDL (Planning Domain Definition Language) domains from the International Planning Competition. Updated evaluations through 2025 show that frontier LLMs achieve approximately 67% accuracy on the static test set with zero-shot prompting, though performance varies dramatically across domain complexity (Valmeekam et al., 2024^[4]). A systematic evaluation of reasoning model o1 demonstrated that chain-of-thought prompting provides only modest improvements on planning tasks, suggesting that verbal reasoning alone cannot substitute for genuine state-space search (Parashar et al., 2025^[5]).

SokoBench represents the most rigorous test of long-horizon planning to date. By using simplified Sokoban puzzles that isolate planning from state persistence, Nicolini et al. demonstrated a consistent degradation pattern: all tested reasoning models — including GPT-5, DeepSeek-R1, and GPT-oss 120B — show performance collapse beyond 25 moves (Nicolini et al., 2026^[3]). Even with PDDL-based tool augmentation, improvements remain modest, suggesting inherent architectural limitations.

STARK proposes a hierarchical spatiotemporal reasoning benchmark organized across three levels: state estimation, spatiotemporal reasoning over states, and world-knowledge-aware reasoning. Results show that LLMs struggle most with the integration of spatial and temporal dimensions simultaneously (STARK authors, 2026^[6]).

TemporalBench adopts a task-centric evaluation across four real-world domains (retail, healthcare, energy, and physical systems), explicitly probing different temporal competencies including prediction, contextual reasoning, and event-driven analysis (TemporalBench authors, 2026^[7]).

TimE provides multi-level temporal reasoning evaluation using real-world long-term multi-turn conversations, extracting event graphs and standardizing temporal expressions for systematic assessment (Wei et al., 2025^[8]).

ARC-AGI-3, launched March 25, 2026, marks the first major format change since ARC-AGI’s 2019 introduction. With 1,000+ levels across 150+ interactive environments, it explicitly measures exploration, l[REDACTED]g, planning, and adaptation — introducing step efficiency and path efficiency metrics that directly evaluate planning optimality (ARC Prize, 2026^[9]).

Planet consolidates multiple planning benchmarks into a unified collection, enabling cross-domain comparison of LLM planning capabilities including spatial path planning, constraint satisfaction, and multi-objective optimization (Planet authors, 2025^[10]).

Temporal and Planning Benchmark Landscape (2023-2026)

flowchart TD
    A[Temporal Intelligence Benchmarks] --> B[Formal Planning]
    A --> C[Temporal Reasoning]
    A --> D[Integrated Assessment]
    B --> B1[PlanBench - PDDL domains]
    B --> B2[SokoBench - Long-horizon]
    B --> B3[Planet - Multi-domain]
    C --> C1[TimE - Multi-level temporal]
    C --> C2[Test of Time - Temporal logic]
    C --> C3[TemporalBench - Domain-specific]
    D --> D1[STARK - Spatiotemporal]
    D --> D2[ARC-AGI-3 - Interactive planning]
    D --> D3[ItinBench - Cognitive planning]

2.2 Reasoning Models and Test-Time Scaling #

The emergence of Large Reasoning Models (LRMs) in 2025-2026 — including OpenAI’s o-series, DeepSeek-R1, and Anthropic’s extended thinking models — promised to address planning limitations through increased test-time computation. These models generate explicit reasoning traces before producing final answers, effectively trading compute for reasoning depth.

However, empirical evidence from SokoBench reveals that test-time scaling provides diminishing returns for planning tasks. While reasoning traces improve performance on problems requiring fewer than 20 steps, the improvement plateau quickly as horizon length increases (Nicolini et al., 2026^[3]). This finding aligns with theoretical arguments that autoregressive token generation may be fundamentally limited in its capacity for forward state-space search (Parashar et al., 2025^[5]).

Time-R1 explores reinforcement-l[REDACTED]g-based approaches to enhance temporal reasoning specifically, demonstrating that specialized training on temporal tasks can yield targeted improvements, though generalization across temporal domains remains limited (Time-R1 authors, 2025^[11]).

2.3 Known Limitations #

Current approaches share several structural limitations:

Benchmark fragmentation: Each benchmark tests a narrow slice of temporal intelligence, with no unified scoring across dimensions
Horizon ceiling: No current model demonstrates reliable planning beyond 25-30 sequential steps
Cost explosion: Token generation for planning tasks scales super-linearly with horizon length, making long-horizon evaluation prohibitively expensive
State tracking failure: Models lose track of intermediate states in multi-step plans, compounding errors through the reasoning chain

3. Quality Metrics and Evaluation Framework #

3.1 Metrics for Each Research Question #

RQ	Metric	Source	Threshold
RQ1	Benchmark Coverage Index (BCI)	Our analysis of 9 benchmarks	BCI >= 0.7 for comprehensive coverage
RQ2	Horizon Collapse Point (HCP)	SokoBench, PlanBench data	HCP: step count where accuracy drops below 50%
RQ3	UIB-Temporal Composite Score	Proposed specification	Weighted composite across 6 sub-dimensions

3.2 Benchmark Coverage Index #

We define BCI as the proportion of temporal intelligence sub-dimensions covered by existing benchmarks. The six sub-dimensions are: (1) state estimation, (2) temporal ordering, (3) duration reasoning, (4) causal chaining, (5) multi-step planning, and (6) counterfactual planning. A benchmark receives coverage credit for each sub-dimension it explicitly evaluates with standardized metrics.

3.3 Horizon Collapse Point #

The HCP metric captures the planning horizon at which a model’s task completion accuracy drops below 50%. This threshold is significant because it represents the point where a model is more likely to fail than succeed — effectively the boundary of reliable planning capability.

Planning Performance Degradation by Horizon Length

Our analysis synthesizes data from SokoBench and PlanBench evaluations across frontier models. The critical finding: all tested LRMs exhibit HCP values between 22 and 28 steps, while human participants maintain above-50% accuracy through at least 55 steps.

3.4 Compute Efficiency Ratio #

Planning intelligence cannot be evaluated independently of computational cost. We introduce the Planning Efficiency Ratio (PER):

PER(M, h) = Accuracy(M, h) / log(Tokens(M, h))

where M is the model, h is the horizon length, and Tokens represents the total tokens generated during reasoning. This metric directly connects to our UIB framework’s efficiency normalization principle (UIB(M) = SUM wi * Di(M) / C(M)) established in Article 3.

Compute Cost Scaling with Planning Horizon

graph LR
    RQ1 -->|"Coverage analysis"| M1["Benchmark Coverage Index (BCI)"]
    M1 -->|"BCI >= 0.7"| E1["9 benchmarks evaluated"]
    RQ2 -->|"Degradation threshold"| M2["Horizon Collapse Point (HCP)"]
    M2 -->|"HCP ~ 25 steps"| E2["Consistent across models"]
    RQ3 -->|"Composite formulation"| M3["UIB-Temporal Score"]
    M3 -->|"6 sub-dimensions"| E3["Weighted integration"]

4. Application: The UIB-Temporal Dimension Specification #

4.1 Sub-Dimension Architecture #

The UIB-Temporal dimension decomposes into six measurable sub-dimensions, each targeting a distinct aspect of temporal intelligence:

D6.1: State Estimation — The ability to predict the state of a system after a sequence of actions. Measured using PlanBench’s plan verification tasks and STARK’s state estimation levels. Frontier models achieve approximately 82% accuracy, the narrowest gap to human performance (91%).

D6.2: Temporal Ordering — Reasoning about the sequential relationships between events (before, after, during, simultaneous). TimE and Test of Time provide standardized evaluation. Models achieve 74% versus human 95% — a significant but not catastrophic gap.

D6.3: Duration Reasoning — Understanding and computing with time intervals, deadlines, and durations. TemporalBench’s domain-specific tasks test this through real-world scenarios. Models score 61% against human 88%.

D6.4: Causal Chaining — Constructing and validating chains of cause-and-effect across temporal sequences. This connects directly to the UIB-Causal dimension (Article 4) but specifically tests temporal propagation of causal effects. Models achieve 55% versus human 85%.

D6.5: Multi-Step Planning — Generating valid action sequences to achieve specified goals. SokoBench and PlanBench provide the most rigorous evaluation. At the critical 25-step horizon, models score 43% against human 79%.

D6.6: Counterfactual Planning — Reasoning about alternative action sequences and their outcomes (“what if I had done X instead?”). This is the weakest sub-dimension for current models: 31% versus human 72%.

Temporal Intelligence Gap: Frontier LLMs vs Human Performance

4.2 UIB-Temporal Composite Score #

The UIB-Temporal score for a model M is computed as:

Dtemporal(M) = SUM(i=1 to 6) wi Si(M) H(M, hmax)

where:

S_i(M) is the normalized score on sub-dimension i
w_i are information-theoretic weights reflecting each sub-dimension’s contribution to overall temporal intelligence
H(M, h_max) is the horizon scaling factor, penalizing models that fail at shorter horizons

The horizon scaling factor is particularly important: a model that achieves 90% accuracy on 5-step problems but 10% on 30-step problems should score substantially lower than a model achieving 70% accuracy consistently across both horizons. We define:

H(M, hmax) = 1 – exp(-HCP(M) / hmax)

where HCP is the Horizon Collapse Point and h_max is the maximum evaluated horizon (typically 50-60 steps).

4.3 Proposed Weights #

Based on the relative difficulty and discriminative power observed across benchmarks:

Sub-dimension	Weight	Justification
State Estimation	0.10	Near-solved; low discriminative value
Temporal Ordering	0.15	Important but relatively accessible
Duration Reasoning	0.15	Practical relevance; moderate gap
Causal Chaining	0.20	High discriminative power; cross-dimensional
Multi-Step Planning	0.25	Largest gap; most consequential for applications
Counterfactual Planning	0.15	Critical for decision support; under-evaluated

4.4 Connection to HPF-P Decision Readiness #

The HPF-P Framework’s Decision Readiness Index (DRI) inherently depends on temporal planning capacity. A decision support system that cannot reliably plan beyond 25 steps has a bounded DRI — it can support tactical decisions but not strategic ones. The UIB-Temporal score directly maps to the “planning horizon” component of DRI, establishing a quantitative bridge between abstract intelligence measurement and practical decision readiness assessment.

This connection also informs our Cost-Effective Enterprise AI series: the super-linear cost scaling of planning (tokens increase e[REDACTED]nentially with horizon) means that the most cost-effective approach may involve hybrid architectures that combine LLM reasoning for short horizons with classical planners for extended sequences.

graph TB
    subgraph UIB_Temporal["UIB-Temporal Dimension"]
        SE["D6.1 State Estimation"]
        TO["D6.2 Temporal Ordering"]
        DR["D6.3 Duration Reasoning"]
        CC["D6.4 Causal Chaining"]
        MSP["D6.5 Multi-Step Planning"]
        CP["D6.6 Counterfactual Planning"]
    end
    SE --> COMP["Composite Score D_temporal"]
    TO --> COMP
    DR --> COMP
    CC --> COMP
    MSP --> COMP
    CP --> COMP
    COMP --> HF["Horizon Factor H(M, h_max)"]
    HF --> UIB["UIB Total Score"]
    MSP -.->|"HCP metric"| HF
    COMP -.->|"DRI mapping"| HPFP["HPF-P Decision Readiness"]

4.5 Implementation via OpenRouter #

Following the UIB’s inference-agnostic architecture, temporal evaluation runs through the user’s OpenRouter API key (or any OpenAI-compatible endpoint). The evaluation pipeline sends planning problems of increasing horizon length, measures accuracy at each step count, and computes the composite UIB-Temporal score automatically. This maintains our core principle: we measure intelligence, we do not host inference.

The evaluation API specification extends the UIB pipeline with temporal-specific endpoints:

/v1/uib/temporal/evaluate — full 6-sub-dimension assessment
/v1/uib/temporal/hcp — quick horizon collapse point estimation
/v1/uib/temporal/efficiency — planning efficiency ratio analysis

5. Conclusion #

RQ1 Finding: The 2026 temporal and planning benchmark landscape comprises nine major benchmarks spanning formal planning, temporal reasoning, and integrated assessment. Measured by the Benchmark Coverage Index, existing benchmarks achieve BCI = 0.72 across our six defined sub-dimensions, with counterfactual planning (D6.6) being the most under-evaluated. This matters for the UIB series because it identifies specific measurement gaps that our UIB-Temporal specification must address to achieve comprehensive intelligence assessment.

RQ2 Finding: Frontier LLMs exhibit systematic performance degradation at a Horizon Collapse Point of approximately 25 sequential steps. Measured by HCP across three frontier reasoning models (GPT-5, Claude Opus 4.5, DeepSeek-R1), HCP ranges from 22-28 steps — a 36-percentage-point gap below human performance at the same horizon. This matters for the UIB series because it reveals a fundamental architectural constraint that current test-time scaling approaches cannot overcome, establishing the horizon problem as a primary discriminator of genuine planning intelligence.

RQ3 Finding: The UIB-Temporal dimension should formalize temporal intelligence as a weighted composite of six sub-dimensions with a horizon scaling factor. Measured by the UIB-Temporal composite score integrating state estimation, temporal ordering, duration reasoning, causal chaining, multi-step planning, and counterfactual planning, this specification captures both breadth of temporal capability and depth of planning horizon. This matters for the UIB series because it provides the mathematical formulation for the sixth UIB dimension, directly connecting to HPF-P’s Decision Readiness Index and the efficiency normalization established in our theoretical framework.

The next article in this series will examine social and collaborative intelligence (Article 7), exploring the theory-of-mind and negotiation benchmarks that measure how models reason about other agents’ mental states and coordinate toward shared goals — a dimension with profound implications for the Capability-Adoption Gap series, where social intelligence may explain why technically capable systems fail to achieve organizational adoption.

References (11) #

Stabilarity Research Hub. Temporal and Planning Intelligence as a UIB Dimension: Why Horizon Length Breaks Modern Reasoning Models. doi.org. d t i l
Stabilarity Research Hub. Embodied Intelligence as a UIB Dimension: Why Physical Grounding Is the Missing Benchmark. t i b
(2026). [2601.20856] SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models. doi.org. d t i l
(2022). [2206.10498] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change. doi.org. d t i l
(2025). [2502.12521] Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights. doi.org. d t i l
(2025). [2505.11618] Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges. doi.org. d t i l
(2026). [2602.13272] TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks. doi.org. d t i l
(2025). [2505.12891] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios. doi.org. d t i l
(2026). [2601.10904] ARC Prize 2025: Technical Report. doi.org. d t i l
(2025). [2504.14773] PLANET: A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities. doi.org. d t i l
(2025). [2505.13508] Time-R1: Towards Comprehensive Temporal Reasoning in LLMs. doi.org. d t i l

Version History · 1 revisions