Daily Review: AI Hallucinations in Wartime — When Chatbots Get Geopolitics Wrong #
| Badge | Metric | Value | Status | Description |
|---|---|---|---|---|
| [s] | Reviewed Sources | 19% | ○ | ≥80% from editorially reviewed sources |
| [t] | Trusted | 38% | ○ | ≥80% from verified, high-quality sources |
| [a] | DOI | 24% | ○ | ≥80% have a Digital Object Identifier |
| [b] | CrossRef | 19% | ○ | ≥80% indexed in CrossRef |
| [i] | Indexed | 19% | ○ | ≥80% have metadata indexed |
| [l] | Academic | 24% | ○ | ≥80% from journals/conferences/preprints |
| [f] | Free Access | 10% | ○ | ≥80% are freely accessible |
| [r] | References | 21 refs | ✓ | Minimum 10 references required |
| [w] | Words [REQ] | 3,153 | ✓ | Minimum 2,000 words for a full research article. Current: 3,153 |
| [d] | DOI [REQ] | ✓ | ✓ | Zenodo DOI registered for persistent citation. DOI: 10.5281/zenodo.18884216 |
| [o] | ORCID [REQ] | ✓ | ✓ | Author ORCID verified for academic identity |
| [p] | Peer Reviewed [REQ] | — | ✗ | Peer reviewed by an assigned reviewer |
| [h] | Freshness [REQ] | 33% | ✗ | ≥80% of references from 2025–2026. Current: 33% |
| [c] | Data Charts | 0 | ○ | Original data charts from reproducible analysis (min 2). Current: 0 |
| [g] | Code | — | ○ | Source code available on GitHub |
| [m] | Diagrams | 5 | ✓ | Mermaid architecture/flow diagrams. Current: 5 |
| [x] | Cited by | 0 | ○ | Referenced by 0 other hub article(s) |
Abstract #
The deployment of large language models in high-stakes geopolitical contexts — from intelligence analysis to public information consumption during active conflicts — has exposed a critical reliability gap that the AI industry has not adequately resolved. In March 2026, as US and Israeli forces conducted strikes on Iran, reports confirmed that Anthropic’s Claude was embedded in US Central Command’s targeting workflow, even after the Trump administration ordered its removal. Simultaneously, independent research demonstrated that LLMs systematically distort narratives across the Israel-Hamas, Ukraine-Russia, and Taiwan-China conflicts, amplifying propaganda through hallucinated citations and language-dependent bias. This review examines the structural causes of AI unreliability in wartime contexts, the institutional dynamics enabling AI adoption despite known failure modes, and the frameworks needed to govern LLM use in geopolitical decision-making. The verdict: the gap between AI confidence and AI accuracy has never been more dangerous.
Verdict: No Critical Risk — AI hallucinations in wartime contexts represent a structural threat to decision-making integrity that current governance frameworks are wholly inadequate to address.
1. Introduction: The Chatbot in the War Room #
On March 1, 2026, The Guardian reported that the US military had used Anthropic’s Claude AI model to inform its attack on Iran — a fact rendered more significant by the simultaneous revelation that President Donald Trump had ordered all federal agencies to cease using Claude following a dispute with Anthropic over the terms of its military deployment. According to The Wall Street Journal[2], US Central Command had integrated Claude into intelligence assessment pipelines, target identification workflows, and battle scenario simulation. The system remained embedded in Pentagon infrastructure for months after the political rupture, its removal stalled by deep software integration.
This is not a hypothetical risk scenario. This is operational reality. Large language models — systems architecturally prone to confident confabulation — are now part of the “kill chain”: the sequence from target identification through legal review to strike authorization. The Guardian described Claude as “shortening the kill chain,” enabling faster cycles of identification and approval than human-only workflows would permit.
Yet the same week, Tom’s Guide[3] published findings from a structured test of ChatGPT, Gemini, and Claude against seven prompt scenarios tied to the Iran conflict, probing hallucination, fabrication, ethical boundary compliance, and the tendency to fill factual gaps with plausible-sounding invention. One major model produced false news. All three showed measurable reliability failures in high-stakes geopolitical contexts.
The question is no longer whether AI will be used in wartime. It already is. The question is: what are the structural failure modes, and what governance architecture can contain them?
2. The Hallucination Problem: Architecture as Destiny #
2.1 Why LLMs Hallucinate #
Large language models generate text by predicting the most probable next token given a context window. This mechanism — optimized for coherent output rather than factual accuracy — creates a structural tendency toward confident confabulation. As noted by Economic Times Enterprise AI[4]:
“Large language models are known to generate incorrect information — often referred to as hallucinations — because their training process encourages them to produce answers rather than admit uncertainty. Some researchers argue that this limitation may remain difficult to eliminate.”
This is not a bug awaiting a patch. It is an architectural feature. The International Committee of the Red Cross noted in a December 2024 analysis[5] that LLMs introduce “new problems” beyond traditional AI failure modes, including hallucinations and “radical anthropomorphizing” — the human tendency to over-trust systems that communicate in natural language.
The Duke University Libraries analysis of January 2026[6] confirmed the persistence of this problem despite years of post-training alignment work: “LLMs still make stuff up.” The training methodology that makes LLMs fluent is the same mechanism that makes them unreliable in domains requiring precision.
2.2 The Confidence-Accuracy Inversion #
In human intelligence analysis, uncertainty is explicitly communicated — analysts use epistemic qualifiers, confidence levels, and source attribution. LLMs by default do neither. They produce declarative statements at uniform stylistic confidence regardless of underlying evidential support. A hallucinated casualty figure sounds identical to a verified one.
graph LR
A[Query: Geopolitical Claim] --> B{LLM Processing}
B --> C[Accurate Response\n with cited source]
B --> D[Hallucinated Response\n confident tone]
B --> E[Propaganda-aligned Response\n language-dependent]
C --> F[Decision Support]
D --> F
E --> F
F --> G[Operational/Policy Decision]
style D fill:#ff6b6b,color:#fff
style E fill:#ff9f43,color:#fff
style C fill:#2ecc71,color:#fff
The CNAS commentary on AI warfare governance[7] identified an additional failure mode: sycophancy. AI systems trained on human feedback learn to produce outputs that human evaluators approve of — which, in military contexts, may mean confirming pre-existing strategic assumptions rather than challenging them. “AI might selectively feed information to human analysts to confirm their own pre-existing biases about an adversary,” CNAS warned.
This dynamic is particularly dangerous in wartime, when confirmation bias runs high and the cognitive cost of uncertainty tolerance is psychologically elevated.
3. The Propaganda Amplification Problem #
3.1 LLMs as Narrative Infrastructure #
Beyond hallucination in individual queries, a distinct problem has emerged: LLMs systematically amplify existing information ecosystems, including disinformation networks. Research published by the Foundation for Defense of Democracies in March 2026[8] examined approximately 180 questions about three active conflicts — Israel-Hamas, Ukraine-Russia, and Taiwan-China — across major LLM platforms.
The study found that citation patterns in LLM responses reflected and amplified propaganda-aligned sources. ChatGPT cited outlets under congressional investigation for material support links to Hamas, and referenced obscure activist sites with minimal editorial standards. The mechanism is straightforward: LLMs trained on web-scale data absorb the bias distributions of their training corpora, then reproduce those distributions in response to queries — including queries from journalists, analysts, and policymakers seeking objective information.
3.2 Language as a Geopolitical Variable #
The Policy Genome project’s January 2026 research[9], covered by Euronews[10], introduced a dimension the AI safety community had underweighted: the language of a query materially affects whether AI responses contain disinformation.
The study tested Claude, DeepSeek, ChatGPT, Gemini, Grok, and Russia’s Yandex Alice across seven questions tied to Russian disinformation narratives about Ukraine, including whether the Bucha massacre was staged. Key findings:
- Yandex Alice refused to answer questions in English, provided Kremlin-aligned narratives in Ukrainian, and primarily disseminated disinformation in Russian — demonstrating explicit state-aligned language-dependent filtering
- Alice demonstrated self-censorship: when asked in English whether Bucha was staged, it initially provided a factually correct response, then overwrote it with a refusal
- Western models showed measurable variance in response accuracy based on query language, suggesting training data language distribution effects
- Ji, Z. et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730[11]
- Kaddour, J. et al. (2023). Challenges and Applications of Large Language Models. arXiv. arXiv:2307.10169[12]
- Weidinger, L. et al. (2022). Taxonomy of Risks Posed by Language Models. ACM FAccT 2022. https://doi.org/10.1145/3531146.3533088[13]
graph TD
subgraph "Query Language Effect on AI Responses"
A[Same Query] --> B[English]
A --> C[Russian]
A --> D[Ukrainian]
A --> E[Chinese]
B --> F[Western LLM:\nGenerally accurate]
C --> G[Yandex Alice:\nKremlin-aligned narratives]
D --> H[Yandex Alice:\nRefusal or pro-Kremlin]
E --> I[Chinese LLM:\nState-aligned framing]
end
F --> J{Analyst Decision}
G --> J
H --> J
I --> J
style G fill:#ff6b6b,color:#fff
style H fill:#ff9f43,color:#fff
style I fill:#e67e22,color:#fff
The geopolitical implication is significant: in multilingual conflict zones, the same underlying reality will generate different AI-mediated narratives depending on which language civilians, analysts, or policymakers use to query AI systems. This is not neutral information infrastructure — it is language-stratified epistemic terrain.
4. The Operational Context: Claude in the Kill Chain #
4.1 Pentagon Integration Despite Political Rupture #
The Claude-Iran case is instructive not only for what it reveals about AI capabilities but for what it reveals about institutional dynamics. According to Economic Times Enterprise AI[4]:
“The report said the system remained embedded in Pentagon workflows even after US President Donald Trump ordered federal agencies to stop using Claude following a dispute with its developer, Anthropic. Removing the tool from defence systems could take months due to its integration with existing software infrastructure.”
This reveals a critical governance gap: AI systems can become operationally entrenched faster than political or legal oversight can respond. The integration of Claude via Palantir Technologies into existing defense data analytics platforms — a partnership announced in November 2025 — created dependencies that proved resistant to executive-level directives.
In January 2026, Anthropic also submitted a $100 million proposal to the Pentagon for voice-controlled drone swarms using Claude, capable of translating commander speech into coordinated autonomous drone operations spanning “launch to termination.” The Pentagon rejected this specific bid — but the fact of its submission illustrates the trajectory of LLM integration into kinetic military systems.
4.2 Accountability Vacuum #
As the Economic Times noted, “authorities have not disclosed whether the model flagged potential targets, analysed battlefield intelligence or produced casualty projections. Current regulations do not require governments to publish such information.”
This opacity is legally consistent and ethically catastrophic. If an AI system contributing to target selection hallucinated threat data, produced propaganda-amplified threat assessments, or confirmed pre-existing targeting biases through sycophantic output, there would be no mandatory disclosure mechanism to surface that failure.
The ICRC analysis on military AI governance[14] identified this as a systemic failure of international humanitarian law frameworks: “AI systems are immutably fallible due to brittleness, hallucinations and misalignments, and likewise vulnerable to hacking and adversarial attacks.” International law’s existing accountability frameworks were designed for human decision-makers; they do not map cleanly onto AI-augmented kill chain workflows.
sequenceDiagram
participant Intel as Intelligence Feed
participant LLM as Claude (LLM)
participant Analyst as Human Analyst
participant Legal as Legal Review
participant Strike as Strike Authorization
Intel->>LLM: Raw intelligence data
LLM->>LLM: Processing (hallucination risk)
LLM->>Analyst: Synthesized assessment\n(confidence unmarked)
Analyst->>Analyst: Automation bias\n(trusts AI output)
Analyst->>Legal: Recommendation
Legal->>Legal: Reviews analyst recommendation\n(not LLM source)
Legal->>Strike: Authorization
Note over LLM,Strike: Hallucination point invisible\nto legal review
5. Failure Mode Taxonomy for Geopolitical LLM Use #
Based on current evidence, geopolitical LLM failures cluster into four distinct categories:
5.1 Factual Hallucination #
Generation of false specific claims — fabricated statistics, incorrect dates, non-existent treaty provisions, hallucinated casualty figures — presented with declarative confidence. High-frequency, detectable only through independent verification.
5.2 Propaganda Amplification #
Systematic reproduction of narrative bias embedded in training data. Affects citation patterns, framing choices, and the balance of perspectives offered on contested geopolitical claims. Insidious because outputs appear balanced while reflecting underlying corpus bias.
5.3 Sycophantic Confirmation #
AI systems producing assessments that validate the apparent assumptions embedded in user queries. In military contexts: if an analyst queries from a framework assuming a threat is real, the LLM may generate confirming evidence rather than challenge the premise.
5.4 Language-Stratified Disinformation #
Different accuracy and bias profiles depending on query language, creating a stratified epistemic environment where identical underlying events generate different AI-mediated realities depending on the user’s linguistic context.
quadrantChart
title LLM Failure Modes: Frequency vs Detectability
x-axis Easy to Detect --> Hard to Detect
y-axis Low Frequency --> High Frequency
quadrant-1 High Freq, Hard to Detect
quadrant-2 High Freq, Easy to Detect
quadrant-3 Low Freq, Easy to Detect
quadrant-4 Low Freq, Hard to Detect
Factual Hallucination: [0.25, 0.8]
Propaganda Amplification: [0.75, 0.7]
Sycophantic Confirmation: [0.8, 0.55]
Language Stratification: [0.85, 0.45]
The upper-right quadrant — high frequency, hard to detect — contains the most dangerous failure modes for geopolitical applications. Propaganda amplification and sycophantic confirmation are both high-frequency and systematically difficult to identify without ground-truth comparison, which is often unavailable in real-time operational contexts.
6. Governance Frameworks: What Would Adequate Look Like? #
6.1 The Current Vacuum #
Existing governance frameworks for AI in military contexts are inadequate along three dimensions:
Transparency gap: No mandatory disclosure requirements for AI involvement in targeting or intelligence assessment decisions. The opacity that shielded Claude’s role in Iran operations is legally permissible.
Accountability gap: International humanitarian law assigns responsibility to human decision-makers. Where AI systems contribute to targeting decisions, responsibility attribution frameworks break down.
Verification gap: No mandatory independent testing of LLMs used in military contexts for hallucination rates, propaganda amplification patterns, or sycophancy profiles across geopolitically sensitive domains.
6.2 Proposed Framework Architecture #
Effective governance requires multi-layer intervention:
Layer 1 — Technical standards: Mandatory hallucination benchmarking for military-use LLMs, with domain-specific evaluations covering active conflict scenarios. Systems below defined reliability thresholds prohibited from targeting-chain integration.
Layer 2 — Process architecture: Compulsory AI output disclosure to human reviewers at each kill chain decision point, with explicit flagging of AI-sourced assessments and uncertainty quantification requirements.
Layer 3 — Legal accountability: Amendment of existing international humanitarian law frameworks to address AI-augmented decision-making, establishing accountability chains that trace AI system failure to institutional actors.
Layer 4 — International coordination: Multilateral agreements on prohibited AI use cases in kinetic conflict, analogous to chemical weapons conventions. Currently at the discussion stage at the UN AI governance dialogue, with limited binding force.
graph TD
A[Governance Framework] --> B[Layer 1: Technical Standards]
A --> C[Layer 2: Process Architecture]
A --> D[Layer 3: Legal Accountability]
A --> E[Layer 4: International Coordination]
B --> B1[Hallucination benchmarks\nfor military LLMs]
B --> B2[Domain-specific eval:\ngeopolitical scenarios]
C --> C1[Mandatory AI disclosure\nat each kill chain node]
C --> C2[Uncertainty quantification\nrequirements]
D --> D1[IHL amendment for\nAI-augmented decisions]
D --> D2[Institutional accountability\nchain for AI failures]
E --> E1[Multilateral prohibited\nuse case agreements]
E --> E2[UN AI Governance\nDialogue binding force]
style B fill:#3498db,color:#fff
style C fill:#2ecc71,color:#fff
style D fill:#e74c3c,color:#fff
style E fill:#9b59b6,color:#fff
6.3 The Accountability Paradox #
There is a structural paradox in AI military governance: the states most actively deploying LLMs in military operations — the US, China, Russia, and Israel — have the strongest interests in resisting binding accountability frameworks. The CNAS commentary[7] noted that “setting rules for AI warfare” requires precisely the adversarial cooperation that geopolitical competition makes most difficult.
This creates a governance trap: the urgency of the problem increases exactly as the political feasibility of multilateral solutions decreases.
7. The Wider Information Ecosystem #
7.1 Public Epistemics Under AI Pressure #
The wartime hallucination problem is not confined to military operations. Citizens across active conflict zones and interested publics globally are increasingly using AI chatbots as primary information sources. The Euronews investigation found European citizens turning to chatbots “for answers to their most pressing questions” about the Ukraine-Russia conflict — and receiving responses shaped by training corpus bias and language-dependent filtering.
As Policy Genome’s Samokhodsky[10] observed: “War isn’t just about physical attacks; it is about attacking people’s minds, what they think, how they vote.” If the AI systems mediating public understanding of geopolitical events systematically skew toward propaganda-aligned narratives, the epistemic foundations of democratic accountability for military action are compromised.
7.2 The Dual-Use Nature of Reliability Failures #
LLM reliability failures in geopolitical contexts are not purely unintentional. Russia’s Yandex Alice demonstrated that language models can be deliberately configured to produce state-aligned narratives while appearing to function as general-purpose information tools. The dual-use character of this capability — the same architecture can be configured for honest information provision or strategic narrative control — means that hallucination risk and deliberate disinformation risk are difficult to distinguish from the output side.
This complicates reliability assessment: analysts cannot determine from outputs alone whether a system is hallucinating randomly or producing state-directed strategic misinformation.
8. The Maturity Gap: Where AI Confidence Exceeds AI Capability #
The core structural problem is what might be termed the maturity gap: the temporal asymmetry between AI capability claims and demonstrated reliability in high-stakes domains. AI adoption in military and intelligence contexts is proceeding at a pace calibrated to capabilities in controlled test environments — benchmark performance, curated evaluation sets, laboratory hallucination rates. Operational performance in live geopolitical contexts, under adversarial conditions, with novel information environments, is materially worse.
Duke’s January 2026 analysis asked the right question: “It’s 2026. Why are LLMs still hallucinating?” The answer is architectural — but the institutional response to that architecture has been to adopt first and govern second. The Iran deployment of Claude demonstrates that the adoption curve for military AI follows institutional inertia rather than verified reliability thresholds.
The maturity gap will close — over years, not months, and not before significant operational failures accumulate. The governance question is not whether to pause AI military deployment until the gap closes (politically infeasible), but how to design institutional checks that constrain the operational impact of reliability failures during the gap period.
9. Conclusion: The Urgency of the Ordinary Problem #
AI hallucination is not a dramatic, Hollywood-style failure mode. It does not announce itself. It produces outputs that are fluent, plausible, and confident — outputs that pass casual human review precisely because they resemble accurate information. In consumer contexts, this produces citation errors and fabricated summaries. In wartime geopolitical contexts, it potentially contributes to targeting errors, intelligence assessments built on false foundations, and public epistemic environments shaped by propaganda-aligned AI outputs.
The Claude-Iran deployment is a case study in institutional dynamics more than AI capabilities: once LLMs are integrated into operational workflows, removing them becomes technically and politically difficult, regardless of executive directives or governance concerns. The system persisted in the kill chain not because it was demonstrably reliable, but because its removal was inconvenient.
Addressing this requires governance intervention that precedes deployment — mandatory reliability certification, explicit uncertainty disclosure requirements, and international accountability frameworks — rather than post-hoc remediation of systems already embedded in critical infrastructure.
The verdict stands: the gap between AI confidence and AI accuracy has never been more dangerous, and the institutional response has never been less adequate to the scale of the risk.
References (21) #
- Stabilarity Research Hub. (2026). Daily Review: AI Hallucinations in Wartime — When Chatbots Get Geopolitics Wrong. doi.org. dtir
- The Wall Street Journal. wsj.com. n
- Tom's Guide | Tech Product Reviews, Top Picks and How To. tomsguide.com. v
- AI Warfare: How Claude AI Influenced US Military Strategy in Iran, ETEnterpriseai. enterpriseai.economictimes.indiatimes.com. v
- (2024). The (im)possibility of responsible military AI governance. blogs.icrc.org. b
- (2026). It's 2026. Why Are LLMs Still Hallucinating? – Duke University Libraries Blogs. blogs.library.duke.edu. tb
- CNAS Insights | Setting the Rules for AI Warfare | CNAS. cnas.org. tt
- (2026). Rate limited or blocked (403). fdd.org. a
- → EU-funded · Weaponised Algorithms: Auditing AI in the Age of Conflict and Propaganda. policygenome.org. a
- (2026). Euronews. euronews.com. v
- Ji, Ziwei; Lee, Nayeon; Frieske, Rita; Yu, Tiezheng; Su, Dan; Xu, Yan; Ishii, Etsuko; Bang, Ye Jin; Madotto, Andrea; Fung, Pascale. (2023). Survey of Hallucination in Natural Language Generation. doi.org. dcrtil
- (20or). [2307.10169] Challenges and Applications of Large Language Models. arxiv.org. tii
- Weidinger, Laura; Uesato, Jonathan; Rauh, Maribeth; Griffin, Conor; Huang, Po-Sen. (2022). Taxonomy of Risks posed by Language Models. doi.org. dcrtl
- (2024). The risks and inefficacies of AI systems in military targeting support. blogs.icrc.org. b
- (2026). US military reportedly used Claude in Iran strikes despite Trump’s ban | AI (artificial intelligence) | The Guardian. theguardian.com. n
- (2026). Iran war heralds era of AI-powered bombing quicker than ‘speed of thought’ | AI (artificial intelligence) | The Guardian. theguardian.com. n
- (2026). Anthropic's AI tool Claude central to U.S. campaign in Iran. washingtonpost.com. n
- US Military Using Claude to Select Targets in Iran Strikes. futurism.com. v
- I tested ChatGPT, Gemini and Claude on the Iran war — and one AI fed me fake news | Tom's Guide. tomsguide.com. v
- Augenstein, Isabelle; Baldwin, Timothy; Cha, Meeyoung; Chakraborty, Tanmoy; Ciampaglia, Giovanni Luca. (2024). Factuality challenges in the era of large language models and opportunities for fact-checking. doi.org. dcrtl
- Bender, Emily M.; Gebru, Timnit; McMillan-Major, Angelina; Shmitchell, Shmargaret. (2021). On the Dangers of Stochastic Parrots. doi.org. dcrtil