Specification Languages for AI: From Natural Language to Formal Methods

Specification Languages for AI

📚 Academic Citation:
Ivchenko, O. (2026). Specification Languages for AI: From Natural Language to Formal Methods. Spec-Driven AI Development Series. Odessa National Polytechnic University.
DOI: 10.5281/zenodo.18684610

Abstract

Artificial intelligence systems present a fundamental specification challenge: how do we precisely describe what a learning system should do when its behaviour emerges from data rather than explicit programming? This article surveys the landscape of specification languages and approaches available to AI practitioners — from accessible natural language techniques like Gherkin-based behaviour-driven development to rigorous formal methods grounded in temporal logic and model checking. We examine AI-specific frameworks including Model Cards, Datasheets for Datasets, and the emerging ecosystem of regulatory standards (ISO/IEC 42001, NIST AI RMF, EU AI Act). Through comparative analysis, we argue that effective AI specification is not a binary choice between natural language flexibility and formal rigour, but a multi-layered discipline that selects tools appropriate to specification scope, audience, and criticality. Practitioners who master this spectrum gain measurably better outcomes in safety, auditability, and long-term maintainability of production AI systems.

1. The Specification Gap in AI Development

In classical software engineering, a specification gap — the distance between what a system is intended to do and what it actually does — can be traced directly to imprecise or incomplete requirements. In AI systems, this gap has an additional, more insidious dimension: the system’s behaviour is not programmed but learned, making it simultaneously harder to specify in advance and harder to validate against any given specification after the fact.

Sculley et al. [1] famously characterised machine learning systems as carrying “hidden technical debt” — where the model code is a small fraction of the overall system complexity, surrounded by vast, undocumented infrastructure. A significant portion of that debt is specification debt: informal, ambiguous, or entirely absent descriptions of what the model is supposed to achieve, under what conditions, and with what guarantees. The consequences of this debt accumulate slowly at first, then catastrophically — in the form of fairness violations, safety incidents, regulatory non-compliance, and the near-impossibility of meaningful model audits.

This article maps the available landscape of specification tools for AI, from those familiar to software engineers (user stories, BDD scenarios, contracts) through AI-native frameworks (Model Cards, datasheets) to formal methods with mathematical guarantees. The goal is not to prescribe a single approach but to help practitioners navigate the spectrum with principled criteria.

graph LR
    A[Informal\nNatural Language] --> B[Semi-Formal\nStructured Templates]
    B --> C[Executable\nSpecifications]
    C --> D[Formal\nMathematical Specs]

    style A fill:#fff3cd,stroke:#856404
    style B fill:#cfe2ff,stroke:#084298
    style C fill:#d1e7dd,stroke:#0f5132
    style D fill:#f8d7da,stroke:#842029

    subgraph "Examples"
        A1["User stories\nSlack threads\nReadme prose"]
        B1["Model Cards\nDatasheets\nGherkin/BDD"]
        C1["Property tests\nML Test Score\nChecklist suites"]
        D1["TLA+\nAlloy\nTemporal logic"]
    end

    A --- A1
    B --- B1
    C --- C1
    D --- D1

Figure 1: The specification formality spectrum for AI systems — from natural language narratives to mathematical proofs.

2. Natural Language Specifications: Foundations and Limits

2.1 User Stories and Acceptance Criteria

The most prevalent form of AI specification in practice is also the least formal: natural language narratives in the form of user stories and acceptance criteria. In their canonical form (“As a [user], I want [feature], so that [benefit]”), user stories communicate intent efficiently but provide almost no basis for systematic validation. For a recommendation engine, a user story might read: “As a customer, I want relevant product suggestions, so that I discover items I didn’t know I needed.” This tells us nothing about what “relevant” means, how to measure it, whether it holds across demographic groups, or how it degrades under distribution shift.

The limitation is not inherent to user stories themselves but to the absence of corresponding operational definitions. Effective natural language specification supplements stories with explicit acceptance criteria that are measurable and unambiguous: “The recommendation system shall achieve a click-through rate ≥ 3.5% on held-out validation data, measured monthly, disaggregated by user cohort (new/returning) with no cohort showing CTR below 2.5%.” This moves specification from intent to contract — still natural language, but with testable claims.

2.2 Behaviour-Driven Development (BDD) for AI

Behaviour-Driven Development, with its Gherkin syntax (Given/When/Then), provides a semi-formal bridge between natural language and executable tests [2]. Originally designed for web and business logic applications, BDD has found increasing application in ML system specification, particularly for specifying model behaviour rather than model internals [3].

A BDD scenario for a medical image classification system might read:

Feature: Pneumonia Detection
  As a radiologist support system
  I want to flag chest X-rays with likely pneumonia
  So that urgent cases receive priority review

  Scenario: High-confidence positive case
    Given a chest X-ray with visible bilateral infiltrates
    When the model processes the image
    Then the pneumonia probability score shall exceed 0.85
    And the case shall be flagged for urgent review
    And an explanation shall identify the relevant lung regions

  Scenario: Demographic invariance
    Given two chest X-rays with identical clinical findings
    When one belongs to a patient aged 25 and one aged 75
    Then the probability scores shall differ by no more than 0.05

  Scenario: Distribution shift detection
    Given a chest X-ray from a scanner model not seen during training
    When the model processes the image
    Then a calibration uncertainty flag shall be raised
    And the confidence score shall be appropriately discounted

This level of specification achieves several goals simultaneously: it communicates system behaviour to clinical stakeholders, provides executable test scaffolding for the engineering team, and documents fairness and robustness requirements that are increasingly mandated by regulators. Ribeiro et al.’s CheckList framework [4] extends this paradigm systematically, defining behavioural test suites that cover minimum functionality, invariance, and directional expectation tests across NLP systems.

2.3 The Oracle Problem

Natural language specifications for AI face what Barr et al. [5] call the oracle problem: the difficulty of determining, for any given input, what the correct output should be. In traditional software, a specification like “compute the square root of x” has a clear oracle. For a sentiment classifier operating on ambiguous social media text, no such oracle exists — reasonable experts disagree, context changes the answer, and the notion of ground truth is itself contested.

This is not merely a philosophical problem. It has direct engineering consequences: you cannot write automated tests without an oracle, you cannot measure specification compliance without one, and you cannot legally defend model decisions without one. Specification languages that ignore the oracle problem leave practitioners with unverifiable documents. The frameworks discussed in subsequent sections represent various strategies for constructing tractable oracles.

3. Semi-Formal Specifications: Structure Without Proof

3.1 Model Cards

Introduced by Mitchell et al. [6] at Google, Model Cards are structured documentation artifacts that accompany ML models. They occupy the semi-formal tier of the specification spectrum: sufficiently structured to enable comparison across models and systematic auditing, yet flexible enough to accommodate the diversity of ML applications. A Model Card documents model details, intended use, out-of-scope uses, factors (including demographic and environmental), evaluation metrics, evaluation data, training data, quantitative analyses, ethical considerations, and caveats.

The critical insight in Model Cards is the specification of disaggregated evaluation results — performance reported not just on aggregate test sets but across population subgroups, environmental conditions, and instrumentation variations. This transforms the evaluation section of a Model Card into a partial specification of expected behaviour: the model is implicitly specified to perform at documented levels across documented subgroups. Deviations from these specifications constitute measurable failures.

Since their introduction, Model Cards have been adopted by major ML platforms (Hugging Face, TensorFlow Hub, Google Cloud AI) and increasingly required by enterprise AI governance policies. The EU AI Act [7] effectively mandates Model Card-equivalent documentation for high-risk AI systems, further cementing their role as de facto industry standard.

3.2 Datasheets for Datasets

Gebru et al. [8] proposed Datasheets for Datasets by analogy with electronic component datasheets — standardised documentation that enables informed selection and appropriate use. Where Model Cards specify model behaviour, Datasheets specify the data artefacts on which models are built, covering motivation, composition, collection process, preprocessing, uses, distribution, and maintenance.

From a specification perspective, Datasheets serve as a data contract: they make explicit the distributional assumptions baked into a dataset, enabling practitioners to reason about whether a model trained on that data will satisfy behavioural specifications in a given deployment context. A medical imaging dataset collected exclusively from urban tertiary care centres, if documented honestly, immediately signals potential specification violations when the model is deployed in rural primary care — without any additional testing required.

The Dataset Nutrition Label [9], developed in parallel, takes a more quantitative approach, encoding dataset properties as structured metadata with automated computation where possible. Together, these frameworks represent the emergence of a data specification layer — a recognition that AI system behaviour is jointly determined by model architecture and training data, and that specifying only the former is fundamentally incomplete.

graph TD
    A[AI System Specification] --> B[Model Specification\nModel Card]
    A --> C[Data Specification\nDatasheet / Nutrition Label]
    A --> D[System Specification\nArchitecture + Integration]
    A --> E[Behavioural Specification\nBDD / Property Tests]

    B --> B1["Model purpose\nExpected performance\nDisaggregated results\nEthical considerations"]
    C --> C1["Data provenance\nComposition + bias\nCollection methodology\nMaintenance policy"]
    D --> D1["Component contracts\nAPI specifications\nFailure modes\nMonitoring targets"]
    E --> E1["Functional scenarios\nInvariance tests\nFairness constraints\nRobustness bounds"]

    style A fill:#1a365d,color:#fff
    style B fill:#2b6cb0,color:#fff
    style C fill:#2b6cb0,color:#fff
    style D fill:#2b6cb0,color:#fff
    style E fill:#2b6cb0,color:#fff

Figure 2: Four layers of AI system specification — model, data, system, and behaviour — each requiring distinct specification tools.

3.3 The ML Test Score and Structured Quality Frameworks

Breck et al. [10] at Google proposed the ML Test Score as a structured quality checklist covering features and data, model development, ML infrastructure, and monitoring. While framed as a testing framework, the ML Test Score is simultaneously a specification framework: the 28 tests it defines collectively constitute a precise description of what a production-ready ML system must do. Teams that achieve high ML Test Scores have, in effect, specified and validated their systems against a comprehensive (if not exhaustive) behavioural contract.

Ashmore et al. [11] extended this approach into the full ML lifecycle with their assurance framework, mapping specification, verification, and validation activities to each lifecycle phase. Their work is particularly valuable because it explicitly addresses the temporal dimension of AI specifications — the fact that model behaviour evolves through retraining, fine-tuning, and deployment drift, requiring specifications to be living documents rather than waterfall-style fixed requirements.

4. Formal Methods: Mathematical Precision for Critical AI

4.1 Why Formal Methods for AI?

Formal methods — the use of mathematically rigorous techniques for specification, development, and verification of software and hardware — have a long history in safety-critical systems [12]. Their application to AI systems is newer and more contested, but the case for them is compelling in domains where failures have catastrophic consequences: autonomous vehicles, medical devices, financial infrastructure, and critical national infrastructure.

The fundamental advantage of formal specifications is that they admit proof: given a formal specification and a formal model of the system, it is in principle possible to prove that the system satisfies (or violates) the specification, or to exhaustively search for counter-examples. This transforms verification from “we tested a lot of cases” to “we have mathematical certainty about all cases of type X.” In practice, the state-space explosion problem limits the scope of what can be proven, but even partial formal verification provides substantially stronger guarantees than testing alone.

4.2 Temporal Logic and Model Checking

Linear Temporal Logic (LTL) and Computation Tree Logic (CTL) provide formal languages for specifying properties of reactive systems over time — precisely what AI systems in production environments are. LTL operators allow specification of properties like:

Safety: □(prediction_confidence < threshold → escalate_to_human) — “always: if confidence is below threshold, escalate”
Liveness: ◇(data_drift_detected → model_retrained) — “eventually: once drift is detected, the model is retrained”
Fairness: □◇(minority_cohort_served) — “always eventually: the minority cohort receives service”

Model checkers like SPIN, NuSMV, and PRISM can automatically verify whether a system model satisfies LTL/CTL specifications, producing counter-example traces when violations are found. The challenge in applying these tools to AI is constructing appropriate abstract models of ML system behaviour — a technically demanding but tractable problem for well-constrained systems [13].

4.3 TLA+ for Distributed AI Systems

Lamport’s TLA+ (Temporal Logic of Actions Plus) [14] has found application in specifying distributed AI training and inference systems, where race conditions, fault tolerance, and consistency guarantees are critical. TLA+ specifications of model serving infrastructure can verify properties like:

Exactly-once prediction logging under network partition
Model version consistency during rolling updates
Graceful degradation when upstream data pipelines fail
Consistency of feature stores under concurrent read/write workloads

Major technology companies have used TLA+ to find subtle bugs in distributed systems before deployment [15]. For AI infrastructure, where a race condition in the feature pipeline can silently corrupt predictions, this level of specification rigour is increasingly justified.

4.4 Alloy for Structural Specifications

Jackson’s Alloy [16] provides a lightweight formal modelling language based on relational logic, with an automatic analysis tool (the Alloy Analyzer) that performs bounded exhaustive checking of specifications. Alloy is particularly useful for specifying the structural properties of AI systems: data schemas, permission models, pipeline topology, and component interaction patterns. Unlike TLA+, Alloy focuses on finding design flaws rather than verifying temporal properties — making it appropriate earlier in the specification lifecycle, when system structure is still being determined.

4.5 Formal Neural Network Verification

A specialised branch of formal methods has emerged specifically for neural network verification. Seshia et al. [17] survey the emerging field of formal specification for deep neural networks, covering properties such as:

Local robustness: For any input x, perturbations within ε-ball produce the same classification
Monotonicity: Increasing feature f always increases (or never decreases) the output
Fairness constraints: Predictions are independent of protected attribute a given other features
Reachability: Certain output regions are unreachable given constrained input domains

Tools like Reluplex, Marabou, α,β-CROWN, and VNN-Comp competitors can verify such properties for neural networks with millions of parameters — though computational cost scales poorly with network size, currently limiting practical application to smaller, safety-critical components.

graph TD
    A[Formal Specification\nfor AI Systems] --> B[System-Level\nFormal Methods]
    A --> C[Neural Network\nVerification]
    A --> D[Statistical\nFormal Guarantees]

    B --> B1[TLA+\nDistributed systems\nInfrastructure properties]
    B --> B2[LTL/CTL\nTemporal properties\nReactive behaviour]
    B --> B3[Alloy\nStructural design\nRelational models]

    C --> C1[Robustness\nAdversarial bounds]
    C --> C2[Fairness\nConstraints]
    C --> C3[Reachability\nSafety regions]

    D --> D1[PAC Learning\nGeneralisation bounds]
    D --> D2[Differential Privacy\nFormal privacy]
    D --> D3[Conformal Prediction\nCoverage guarantees]

    style A fill:#1a365d,color:#fff
    style B fill:#276749,color:#fff
    style C fill:#744210,color:#fff
    style D fill:#553c9a,color:#fff

Figure 3: Three branches of formal specification for AI — system-level infrastructure, neural network properties, and statistical guarantees.

4.6 Statistical Formal Guarantees

A uniquely AI-native form of formal specification leverages statistical learning theory to provide provable guarantees over model behaviour. PAC (Probably Approximately Correct) learning theory [18] provides formal bounds on generalisation error: given sufficient training data, a learner in a class with bounded VC dimension will, with probability ≥ 1-δ, achieve error ≤ ε on the underlying distribution. While these bounds are often loose in practice, they constitute formal specifications of model behaviour that are mathematically verifiable.

More practically powerful are conformal prediction methods [19], which provide distribution-free coverage guarantees without assumptions about the data-generating process: “With probability ≥ 90%, the true label will be in the predicted set.” This is a formal specification that can be verified empirically and enforced in deployment — providing a bridge between formal methods and practical ML engineering.

Differential privacy [20] provides yet another category of formal guarantee: mathematical bounds on information leakage about individual training examples. For systems handling sensitive data, differential privacy specifications are increasingly required by regulators and provide formally verifiable privacy contracts.

5. Industry Standards and Regulatory Frameworks

5.1 ISO/IEC 42001: AI Management Systems

Published in 2023, ISO/IEC 42001 is the first international management system standard specifically addressing AI. Modelled on the structure of ISO 9001 (quality) and ISO 27001 (security), it provides a framework for establishing, implementing, maintaining, and continually improving an AI management system. From a specification perspective, ISO 42001 mandates that organisations document:

The intended purpose and context of AI systems
Risk and impact assessment processes
Data governance and quality controls
Performance monitoring and continual improvement processes
Human oversight mechanisms

ISO 42001 does not prescribe specific specification languages or formats — it specifies the what of AI system documentation, leaving the how to the organisation. This makes it compatible with all the approaches discussed in this article, positioning them as potential implementation mechanisms for the standard’s requirements.

5.2 NIST AI Risk Management Framework

The US National Institute of Standards and Technology AI Risk Management Framework (AI RMF 1.0) [21] provides a structured approach to managing AI risks across four core functions: Govern, Map, Measure, and Manage. The framework’s “Map” function is essentially a specification function — it requires organisations to systematically describe the context, intended uses, deployment conditions, and associated risks of AI systems in sufficient detail to support downstream risk measurement and management.

The NIST AI RMF is notable for its explicit attention to trustworthiness characteristics: accuracy, reliability, safety, security, explainability and interpretability, privacy, fairness, and accountability. Each of these is simultaneously a specification dimension — a property that must be defined, measured, and managed — and a domain where specific specification tools (formal methods for safety, fairness metrics for equity, differential privacy for privacy) are appropriate.

5.3 The EU AI Act: Specification as Compliance

The European Union’s AI Act [7], entering full force progressively through 2025-2027, creates legally binding specification requirements for AI systems classified as high-risk (which includes medical devices, employment tools, critical infrastructure, biometric systems, and many others). Required technical documentation includes:

Detailed description of the AI system, its intended purpose, and design specifications
Training, validation, and testing data specifications
Performance metric definitions and validation results
Known and foreseeable risks and mitigation measures
Post-market monitoring specifications
Human oversight mechanisms and their specifications

The AI Act effectively mandates that organisations deploying high-risk AI in the EU adopt semi-formal specification practices equivalent to, or exceeding, the Model Card and Datasheet frameworks. The regulatory pressure this creates is expected to accelerate adoption of structured specification practices across the industry globally, as multinational organisations standardise on the highest applicable compliance bar.

graph LR
    subgraph Standards["Regulatory & Standards Landscape"]
        ISO[ISO/IEC 42001\nAI Management Systems]
        NIST[NIST AI RMF\nRisk Management]
        EU[EU AI Act\nHigh-Risk Requirements]
        IEEE[IEEE P2801\nAI Quality Management]
    end

    subgraph Spec["Specification Requirements"]
        S1[Intent & Context\nDocumentation]
        S2[Data Governance\nSpecs]
        S3[Performance &\nFairness Metrics]
        S4[Risk & Safety\nDocumentation]
        S5[Human Oversight\nMechanisms]
        S6[Monitoring &\nDrift Detection]
    end

    ISO --> S1
    ISO --> S2
    ISO --> S5
    NIST --> S3
    NIST --> S4
    NIST --> S6
    EU --> S1
    EU --> S2
    EU --> S3
    EU --> S4
    EU --> S5
    EU --> S6

    style ISO fill:#1a365d,color:#fff
    style NIST fill:#276749,color:#fff
    style EU fill:#744210,color:#fff
    style IEEE fill:#553c9a,color:#fff

Figure 4: Regulatory and standards landscape — common specification requirements mandated across major frameworks.

5.4 IEEE Standards for AI Quality

The IEEE has developed multiple standards relevant to AI specification. IEEE 2801 (Recommended Practice for the Quality Management of Datasets for Medical AI) provides detailed specification templates for medical AI datasets — effectively formalising the Datasheet approach for a regulated domain. IEEE 7000 (Model Process for Addressing Ethical Concerns During System Design) provides process specifications for incorporating ethical requirements into AI system development. These domain-specific standards complement the general frameworks above, providing specification guidance tailored to the unique risks and requirements of specific industries.

6. Comparing Specification Approaches: A Multi-Dimensional Analysis

No single specification approach is universally optimal. The appropriate choice depends on factors including the criticality of the system, the technical sophistication of stakeholders, the regulatory environment, available tooling, and the stage of the development lifecycle. The following comparative analysis provides a structured basis for selection.

Dimension	Natural Language	BDD / Gherkin	Model Cards	Formal Methods
Accessibility	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐
Verifiability	⭐	⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐
Completeness	⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Tooling Maturity	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐
Regulatory Alignment	⭐	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Scalability	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐

Key Insight: The most effective AI specification strategies are layered — combining accessible natural language narratives for stakeholder alignment, semi-formal frameworks (Model Cards, BDD) for team coordination and regulatory compliance, and targeted formal methods for safety-critical components. The error is not choosing the “wrong” level but applying a single level uniformly where a hierarchy of specificity is required.

7. A Layered Specification Strategy for Enterprise AI

Drawing on the frameworks surveyed above, we propose a layered specification strategy for enterprise AI systems that matches specification rigour to specification scope and risk:

graph TD
    L1["Layer 1: Intent Specification\n(Natural Language)\nUser stories, use cases, business objectives\nAudience: stakeholders, executives, product owners"]
    L2["Layer 2: Behavioural Specification\n(Semi-Formal: BDD/CheckList)\nScenarios, invariance tests, fairness constraints\nAudience: product teams, QA, regulators"]
    L3["Layer 3: Artefact Specification\n(Structured Docs: Model Cards, Datasheets)\nModel properties, data lineage, disaggregated performance\nAudience: compliance, auditors, model consumers"]
    L4["Layer 4: System Specification\n(Semi-Formal: Architecture + Contracts)\nComponent interfaces, API contracts, failure modes\nAudience: engineering teams, SRE"]
    L5["Layer 5: Critical Component Specification\n(Formal: TLA+, Neural Verification)\nMathematical properties for safety-critical parts\nAudience: safety engineers, regulators"]

    L1 --> L2
    L2 --> L3
    L3 --> L4
    L4 --> L5

    style L1 fill:#fff3cd,stroke:#856404
    style L2 fill:#cfe2ff,stroke:#084298
    style L3 fill:#d1e7dd,stroke:#0f5132
    style L4 fill:#f8d7da,stroke:#842029
    style L5 fill:#e2d9f3,stroke:#432874

Figure 5: Five-layer specification strategy for enterprise AI — each layer addresses a distinct audience, scope, and level of formality.

This layered approach respects the economic reality of specification investment: formal methods are expensive to apply and require specialised expertise, but they provide uniquely strong guarantees where applied. The key engineering decision is not “formal or informal” but “which components justify formal treatment.” In practice, this means formal specification is appropriate for:

Safety-critical decision components (medical, automotive, financial risk)
Distributed infrastructure where race conditions and consistency guarantees matter
Components subject to adversarial input (security-critical ML)
Fairness-critical classification decisions (employment, credit, criminal justice)

For the remaining 80-90% of AI system components, semi-formal approaches — Model Cards, Datasheets, BDD scenarios — provide a strong quality foundation at manageable cost. Natural language specifications retain value as stakeholder communication artefacts and living documentation, provided they are consistently maintained and recognisably distinct from the verifiable specifications they accompany.

8. The Specification-Execution Gap in AI vs. Traditional Software

A fundamental difference between specifying traditional software and specifying AI systems deserves explicit attention: the specification-execution gap. In traditional software, a correct implementation of a correct specification produces correct behaviour. In ML systems, even a correctly implemented model trained on clean data may violate specifications due to distributional mismatch between training and deployment environments.

This means that AI specifications must include not only what the system should do but also the conditions under which it should be trusted to do it. Amershi et al.’s Software Engineering for Machine Learning study [22] identified this as one of the most significant challenges practitioners face: specifications that are technically correct at training time become progressively invalid as deployment conditions evolve.

The implication for specification practice is clear: AI specifications must include explicit scope conditions (the distributional assumptions under which specifications hold), drift detection specifications (how to detect when scope conditions are violated), and update specifications (the conditions under which the system must be retrained or deprecated). Without these, specifications are snapshots masquerading as contracts.

graph LR
    A[Static Specification\n'The model shall achieve F1 ≥ 0.85'] --> B{Deployment\nCondition Check}
    B -->|In-distribution| C[Specification\nValid ✓]
    B -->|Distribution Shift| D[Specification\nPotentially Invalid ⚠️]
    D --> E[Drift Detection\nSpecification Triggered]
    E --> F{Severity?}
    F -->|Minor| G[Recalibration\nProtocol]
    F -->|Major| H[Retraining\nRequired]
    F -->|Critical| I[System Deprecation\nor Human Override]

    style A fill:#1a365d,color:#fff
    style C fill:#276749,color:#fff
    style D fill:#744210,color:#fff
    style I fill:#7b1f1f,color:#fff

Figure 6: Dynamic specification compliance — AI specifications must encode conditions of validity and responses to specification drift, not just static performance targets.

9. Practical Recommendations

Based on the survey above, the following practical recommendations emerge for teams seeking to elevate their AI specification practice:

9.1 Start with the Regulatory Minimum

Teams operating in regulated contexts (healthcare, finance, employment, public services in the EU) should begin by implementing Model Cards and Datasheets as their baseline specification artefacts. These are well-tooled, widely understood, and directly aligned with emerging regulatory requirements. The cost is low relative to the compliance risk of proceeding without them.

9.2 Instrument BDD for Behavioural Regression

BDD scenario libraries, when automated and run as part of the continuous integration pipeline, provide a lightweight form of specification compliance checking. Teams should maintain a curated suite of behavioural tests that directly reflect key specification claims — particularly around fairness, robustness, and minimum performance thresholds. Ribeiro et al.’s CheckList methodology provides a structured template for building such suites.

9.3 Apply Formal Methods Proportionally

Resist the temptation to either dismiss formal methods as impractical or to over-apply them uniformly. A risk-proportionate approach — formal specification and verification for the specific components where failures are catastrophic, semi-formal for the rest — is both more economically sustainable and more intellectually honest about where formal guarantees actually matter.

9.4 Treat Specifications as Living Documents

AI system specifications must be versioned alongside models and data artefacts, with explicit change management processes. A specification that describes an outdated model version is not merely useless — it is actively misleading to auditors, users, and the engineering team itself. Integrate specification updates into the model release process as a mandatory step, not an afterthought.

9.5 Include Scope and Drift Conditions Explicitly

Every AI specification should include a section explicitly documenting the distributional assumptions under which specifications hold, the monitoring signals that indicate specification boundary violations, and the operational response protocols for each category of violation. Without this, specifications are incomplete by construction.

10. Conclusion

The landscape of AI specification languages is richer and more mature than many practitioners recognise. From the elegant brevity of Gherkin scenarios to the mathematical rigour of temporal logic, tools exist for every point on the formality spectrum and every specification challenge AI systems present. The field’s maturation is further accelerated by regulatory pressures — ISO 42001, NIST AI RMF, and the EU AI Act together constitute the most significant push toward systematic AI specification in the technology’s history.

The central argument of this article is that effective AI specification is a multi-layered discipline requiring deliberate selection of tools matched to audience, scope, and risk. Teams that treat specification as a binary choice — “formal or informal,” “rigorous or pragmatic” — will consistently end up with the wrong tool for the job. Teams that develop fluency across the full spectrum will build AI systems that are not only better engineered, but demonstrably so — to their own teams, their organisations, their regulators, and ultimately their users.

In the articles that follow in this series, we will build on this foundation — examining how AI-specific requirements are captured, how development paradigms compare in practice, and how spec-driven principles translate into concrete architectural and tooling choices. The specification layer is not overhead. It is the difference between AI systems that are trusted and those that merely function.

References

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems, 28. https://proceedings.neurips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
Smart, J. F. (2014). BDD in Action: Behavior-Driven Development for the Whole Software Lifecycle. Manning Publications.
Kuchenbecker, S. (2018). Test-driven machine learning. Proceedings of the International Conference on Software Engineering. https://doi.org/10.1145/3183377.3183402
Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. Proceedings of ACL 2020. https://doi.org/10.18653/v1/2020.acl-main.442
Barr, E. T., Harman, M., McMinn, P., Shahbaz, M., & Yoo, S. (2015). The oracle problem in software testing: A survey. IEEE Transactions on Software Engineering, 41(5), 507–525. https://doi.org/10.1109/TSE.2014.2372785
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3287560.3287596
European Commission. (2021). Proposal for a Regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daumé, H. III, & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723
Holland, S., Hosny, A., Newman, S., Joseph, J., & Chmielinski, K. (2020). The dataset nutrition label: A framework to drive higher data quality standards. SSRN. https://doi.org/10.2139/ssrn.3280453
Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). The ML test score: A rubric for ML production readiness and technical debt reduction. Proceedings of IEEE International Conference on Big Data. https://doi.org/10.1109/BigData.2017.8258038
Ashmore, R., Calinescu, R., & Paterson, C. (2021). Assuring the machine learning lifecycle: Desiderata, methods, and challenges. ACM Computing Surveys, 54(5). https://doi.org/10.1145/3453444
Rushby, J. (1995). Formal methods and the certification of critical systems. Technical Report SRI-CSL-93-7. SRI International. https://www.csl.sri.com/papers/csl-93-7/
Luckcuck, M., Farrell, M., Dennis, L. A., Dixon, C., & Fisher, M. (2019). Formal specification and verification of autonomous robotic systems: A survey. ACM Computing Surveys, 52(5). https://doi.org/10.1145/3342355
Lamport, L. (2002). Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley. https://doi.org/10.5555/579617
Newcombe, C., Rath, T., Zhang, F., Munteanu, B., Brooker, M., & Deardeuff, M. (2015). How Amazon web services uses formal methods. Communications of the ACM, 58(4), 66–73. https://doi.org/10.1145/2699417
Jackson, D. (2012). Software Abstractions: Logic, Language, and Analysis (2nd ed.). MIT Press. https://doi.org/10.7551/mitpress/8153.001.0001
Seshia, S. A., Sadigh, D., & Sastry, S. S. (2018). Formal specification for deep neural networks. Proceedings of the International Symposium on Automated Technology for Verification and Analysis. https://doi.org/10.1007/978-3-030-01090-4_1
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. https://doi.org/10.1145/1968.1972
Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer. https://doi.org/10.1007/b106715
Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. Proceedings of the Theory of Cryptography Conference. https://doi.org/10.1007/11681878_14
NIST. (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1. https://doi.org/10.6028/NIST.AI.100-1
Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B., & Zimmermann, T. (2019). Software engineering for machine learning: A case study. Proceedings of ICSE-SEIP 2019. https://doi.org/10.1109/ICSE-SEIP.2019.00042
Wing, J. M. (1990). A specifier’s introduction to formal methods. Computer, 23(9), 8–23. https://doi.org/10.1109/2.58215
Zhang, J., Chen, B., Hao, D., Liu, Y., Liu, L., & Hao, L. (2020). Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering. https://doi.org/10.1109/TSE.2019.2962027
Sun, Y., Huang, X., Kroening, D., Sharp, J., Hill, M., & Beard, R. (2018). Testing deep neural networks. arXiv preprint. https://arxiv.org/abs/1803.04792
ISO/IEC. (2023). ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system. International Organization for Standardization.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org
Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.
Wiens, J., Saria, S., Sendak, M., Ghassemi, M., Liu, V. X., Doshi-Velez, F., Jung, K., Heller, K., Kale, D., Saeed, M., Ossorio, P. N., Thadaney-Israni, S., & Goldenberg, A. (2019). Do no harm: A roadmap for responsible machine learning for health care. Nature Medicine, 25, 1337–1340. https://doi.org/10.1038/s41591-019-0548-6
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. arXiv preprint. https://arxiv.org/abs/1606.06565
Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1, 389–399. https://doi.org/10.1038/s42256-019-0088-2
Rajan, H., & Bhatt, D. (2019). Formal specifications for machine learning. IEEE Design & Test, 36(5), 39–46. https://doi.org/10.1109/MDAT.2019.2895401

About the Author: Oleh Ivchenko is an Innovation Tech Lead, ML Scientist, and PhD Candidate in Economic Cybernetics, specialising in ML applications in enterprise systems and decision support. This article represents the author’s personal research and opinion; any resemblance to specific organisations or proprietary methodologies is coincidental.

Disclaimer: This is a preprint research article, not peer-reviewed. Content represents the author’s synthesis of publicly available research. This article was written with AI assistance and reflects original analysis and editorial judgment. Not professional legal, regulatory, or technical advice.