
The Spec-First Revolution
Oleh Ivchenko. (2026). The Spec-First Revolution: Why Enterprise AI Needs Formal Specifications. Spec-Driven AI Development Series. Odessa National Polytechnic University.
DOI: 10.5281/zenodo.18666032
Abstract
The rapid adoption of artificial intelligence in enterprise systems has revealed a critical gap between traditional software engineering practices and AI development methodologies. While conventional software development has long relied on formal specifications to ensure quality, maintainability, and compliance, AI projects often proceed in an experiment-driven manner that prioritizes rapid prototyping over systematic design. This article introduces spec-driven AI development—a paradigm that brings rigorous specification practices to machine learning systems—and argues for its necessity in enterprise contexts. Drawing on my 14 years in software engineering and 7 years specializing in enterprise AI, I explore the historical evolution from traditional software specifications to AI-specific formal methods, articulate core principles of the spec-first approach, and compare it with prevailing ad-hoc and experiment-driven methodologies. Through technical analysis and practical examples, I demonstrate that spec-driven development is not merely an academic ideal but a business imperative for organizations deploying AI at scale. Keywords: specification-driven development, formal methods, AI engineering, requirements engineering, software quality, MLOps1. Introduction: The Specification Crisis in AI Development
In my seven years working with enterprise AI systems, I’ve witnessed a recurring pattern: AI projects that begin with impressive proof-of-concept results often struggle—or fail entirely—when transitioning to production. The culprit is rarely technical sophistication. Instead, the problem lies in a fundamental disconnect between how we build traditional software and how we build AI systems. Traditional software engineering has evolved over decades to rely on specifications—formal or semi-formal descriptions of what a system should do, how it should behave under various conditions, and what constraints it must satisfy. These specifications serve as contracts between stakeholders, blueprints for implementation, and foundations for testing and validation [1, 2]. Yet when I observe AI teams, I often see them operating without comparable artifacts. Instead, they follow an experiment-driven approach: try a model, evaluate metrics, iterate. This methodology, borrowed from academic research, works well in controlled settings but creates serious challenges in enterprise environments where regulatory compliance, safety, explainability, and long-term maintenance are non-negotiable [3, 4]. The transition from experiment-driven to spec-driven AI development represents more than a methodological refinement—it’s a paradigm shift comparable to the move from ad-hoc programming to structured programming in the 1970s [5]. This article articulates why this shift is necessary, what it entails, and how it differs from current practices.1.1. The Current State: Experiment-Driven AI
Let me illustrate with a scenario I’ve encountered multiple times. A data science team receives a business problem: “We need to predict customer churn.” They immediately dive into exploratory data analysis, experiment with various algorithms (logistic regression, random forests, gradient boosting, neural networks), tune hyperparameters, and eventually present a model with 87% accuracy on a test set. Management approves, and the model goes to production. Six months later, problems emerge: – The model performs poorly on a customer segment that wasn’t well-represented in training data – It violates a regulatory requirement about protected attributes that wasn’t explicitly checked – Its predictions drift as customer behavior changes, but there’s no specification defining acceptable drift bounds – When a competitor launches a new product, the model fails catastrophically because this scenario was never specified This isn’t a failure of the data scientists—they followed standard practices. The failure is systemic: the absence of specifications meant there was no clear contract defining what “correct” behavior looks like beyond aggregate metrics [6, 7].graph TD
A[Business Problem] --> B[Exploratory Data Analysis]
B --> C[Model Experimentation]
C --> D[Metric Optimization]
D --> E{Metrics Good?}
E -->|No| C
E -->|Yes| F[Deploy]
F --> G[Production Issues]
G --> H[Reactive Fixes]
H --> C
style G fill:#ff6b6b
style H fill:#ff6b6b
classDef problemNode fill:#ff6b6b,stroke:#c92a2a,stroke-width:3px
Figure 1: The experiment-driven AI development cycle. Problems emerge in production, leading to reactive fixes without addressing root causes—the absence of formal specifications.
1.2. Why Specifications Matter for AI
Specifications serve multiple critical functions in software systems [8, 9]: 1. Communication: They establish a shared understanding among stakeholders—business owners, domain experts, developers, testers, and regulators. 2. Validation: They provide criteria for determining whether a system is correct, enabling systematic testing rather than ad-hoc evaluation. 3. Verification: They enable formal or semi-formal proofs that implementations satisfy requirements. 4. Documentation: They create lasting artifacts that support maintenance, evolution, and knowledge transfer. 5. Contractual clarity: They define responsibilities and acceptance criteria, reducing disputes and ambiguity. For traditional software, these functions are well-established. For AI systems, they’re equally critical but substantially more challenging because: – AI systems learn from data rather than being explicitly programmed [10] – Their behavior is probabilistic rather than deterministic [11] – They can exhibit emergent behaviors not present in training data [12] – They degrade over time as the world changes (concept drift) [13, 14] – Their reasoning is often opaque, making it hard to verify compliance with specifications [15] Despite these challenges—or rather, because of them—specifications are even more essential for AI than for traditional software. Without them, we have no principled way to define, verify, or maintain correct AI behavior [16, 17].2. Historical Evolution: From Software Specs to AI Specs
To understand where we need to go with AI specifications, it’s valuable to trace where we’ve been with software specifications more broadly.2.1. Traditional Software Specification Practices
The history of software specifications parallels the history of software engineering itself. Early programming was ad-hoc, but as systems grew in complexity and criticality, formal methods emerged [18, 19]. Pre-formal era (1950s-1960s): Software was specified informally through natural language descriptions and diagrams. Testing was manual and often incomplete. Formal methods emergence (1970s-1980s): Pioneers like Dijkstra, Hoare, and Liskov introduced formal specification languages and proof systems. Hoare logic enabled reasoning about program correctness [20]. The Vienna Development Method (VDM), Z notation, and B-Method provided mathematical foundations for specification [21, 22]. Design-by-contract (1990s): Bertrand Meyer’s Eiffel language popularized preconditions, postconditions, and invariants as first-class specification artifacts [23]. This made formal methods more accessible to practitioners. Agile and lightweight specifications (2000s): Behavior-Driven Development (BDD) and Test-Driven Development (TDD) brought specification practices into mainstream agile development, emphasizing executable specifications and continuous validation [24, 25].timeline
title Evolution of Software Specification Practices
1950s-1960s : Ad-hoc specifications
: Natural language
: Manual testing
1970s-1980s : Formal methods
: Hoare logic, VDM, Z notation
: Mathematical proofs
1990s : Design-by-contract
: Preconditions, postconditions
: Eiffel, JML
2000s : Agile specifications
: BDD, TDD
: Executable specs
2010s : Model-based testing
: Property-based testing
: Continuous validation
2020s : AI specification frameworks
: Model cards, datasheets
: Fairness/safety specs
Figure 2: Timeline showing the evolution of software specification practices, culminating in emerging AI specification frameworks.
2.2. The AI Specification Gap
The deep learning revolution that began in 2012 with AlexNet [26] created a specification gap. Suddenly, systems that learned complex functions from data were achieving superhuman performance on specific tasks, but the specification practices that had evolved for traditional software didn’t translate cleanly [27]. I’ve observed three factors contributing to this gap: 1. Cultural divide: AI development emerged largely from academic machine learning research, where the goal is often to advance the state-of-the-art on benchmark datasets. Software engineering’s emphasis on specifications, testing, and maintenance was less central [28]. 2. Technical challenges: Specifying learned behavior is fundamentally harder than specifying programmed behavior. How do you write a formal specification for “recognize cats in images” that’s both precise and implementable? [29] 3. Rapid evolution: The field has been moving so quickly that practitioners prioritize keeping up with new architectures and techniques over establishing robust engineering practices [30].2.3. Emerging AI Specification Frameworks
The last few years have seen growing recognition of the specification gap and early efforts to address it. Several frameworks have emerged [31, 32]: Model Cards [33]: Developed by Google researchers, model cards document the intended use, training data, evaluation metrics, and known limitations of machine learning models. They’re analogous to datasheets in electronics—standardized documentation that enables informed usage. Datasheets for Datasets [34]: These document the motivation, composition, collection process, and recommended uses for datasets, addressing the fact that data quality fundamentally determines model behavior. FactSheets [35]: IBM’s framework for transparent AI documentation, covering multiple dimensions including fairness, explainability, and robustness. Fairness specifications [36, 37]: Formal definitions of fairness (demographic parity, equalized odds, etc.) that can be mathematically verified and enforced during training and deployment. Robustness specifications [38, 39]: Formal guarantees about model behavior under distribution shift, adversarial perturbations, or out-of-distribution inputs. These are important steps, but they’re still early. Most AI systems in production lack comprehensive specifications, and many organizations don’t yet have processes for creating and maintaining them [40].3. Core Principles of Spec-Driven AI Development
Having established the need for AI specifications and their historical context, I now articulate the core principles that define spec-driven AI development. These principles, distilled from my experience deploying AI in regulated industries, provide a framework for systematic AI engineering.3.1. Specification Before Implementation
Principle: Define what the system should do before building it. This seems obvious—it’s standard practice in traditional software. Yet in AI projects, I routinely see teams diving into model development before clearly articulating requirements. The spec-first approach demands that we: 1. Document the problem we’re solving and why 2. Identify stakeholders and their requirements 3. Specify functional behavior (what the system should predict/decide) 4. Specify non-functional requirements (performance, fairness, robustness, latency, etc.) 5. Define test scenarios and acceptance criteria Only after these specifications exist should implementation begin [41, 42].graph LR
A[Requirements Gathering] --> B[Functional Specification]
B --> C[Non-Functional Specification]
C --> D[Test Specification]
D --> E[Implementation]
E --> F[Validation Against Specs]
F --> G{Meets Specs?}
G -->|No| E
G -->|Yes| H[Deployment]
style A fill:#4ecdc4
style B fill:#4ecdc4
style C fill:#4ecdc4
style D fill:#4ecdc4
style F fill:#ffe66d
classDef specPhase fill:#4ecdc4,stroke:#2a9d8f,stroke-width:2px
classDef validationPhase fill:#ffe66d,stroke:#f4a261,stroke-width:2px
Figure 3: Spec-driven development workflow. Specification activities (blue) precede implementation, and validation (yellow) explicitly checks compliance with specifications.
3.2. Explicit Uncertainty Quantification
Principle: Specifications must account for the inherent uncertainty in AI systems. Unlike traditional software that (ideally) behaves deterministically, AI systems are probabilistic. Specifications must acknowledge this [43, 44]: – Rather than “the system shall classify emails correctly,” specify “the system shall classify spam with precision ≥ 0.95 and recall ≥ 0.90 on the validation distribution” – Rather than “the system shall be fair,” specify “the false positive rate for protected groups shall not differ by more than 5 percentage points” – Rather than “the system shall be robust,” specify “accuracy shall remain above 85% when input features are perturbed by up to 10%” These specifications transform vague requirements into testable criteria [45].3.3. Behavioral Specification Over Algorithmic Specification
Principle: Specify what the system should do, not how it should do it. This is a key distinction from traditional specifications. In classical software, we might specify an algorithm (e.g., “use quicksort”). For AI, we typically specify behavior and constraints, leaving the choice of architecture, algorithm, and training procedure to the implementation phase [46]. For example, for a credit risk model: Behavioral specification: – Input: Applicant financial data (income, debts, credit history, employment) – Output: Risk score [0, 1] and approval decision {approve, reject} – Constraint 1: Rejection rate for protected groups must satisfy demographic parity within 5% – Constraint 2: False positive rate (rejecting a good applicant) must be < 15% - Constraint 3: Model must provide explanations with feature importance scores - Constraint 4: Performance must not degrade more than 5% over 6 months Not specified: Whether to use logistic regression, gradient boosting, or neural networks. That’s an implementation detail [47].3.4. Continuous Validation
Principle: Specifications must be continuously validated throughout the system lifecycle. AI systems change over time even if the code doesn’t, because the world changes [48]. Spec-driven AI therefore requires: – Pre-deployment validation: Comprehensive testing against specifications before release – Deployment-time validation: Smoke tests confirming the deployed model meets specifications – Post-deployment monitoring: Continuous evaluation of whether production behavior satisfies specifications – Triggered revalidation: Automatic retraining and revalidation when specification violations are detectedgraph TD
A[Specification] --> B[Implementation]
B --> C[Pre-deployment Testing]
C --> D{Meets Specs?}
D -->|No| B
D -->|Yes| E[Deployment]
E --> F[Production Monitoring]
F --> G{Still Meets Specs?}
G -->|Yes| F
G -->|No| H[Alert]
H --> I[Root Cause Analysis]
I --> J{Fix Required?}
J -->|Data Drift| K[Retrain]
J -->|Spec Violation| B
J -->|Spec Update Needed| A
K --> C
style D fill:#ffe66d
style G fill:#ffe66d
style H fill:#ff6b6b
classDef decision fill:#ffe66d,stroke:#f4a261,stroke-width:2px
classDef alert fill:#ff6b6b,stroke:#c92a2a,stroke-width:3px
Figure 4: Continuous validation lifecycle. Specifications are validated pre-deployment, at deployment, and continuously in production. Violations trigger analysis and appropriate responses.
3.5. Traceability and Auditability
Principle: Maintain bidirectional traceability between requirements, specifications, implementations, and validation results. In regulated industries—finance, healthcare, automotive—this isn’t optional [49, 50]. But I argue it should be standard practice everywhere. Traceability enables: – Impact analysis: When requirements change, identify affected specifications and implementations – Compliance demonstration: Show regulators or auditors that requirements are met – Root cause analysis: When failures occur, trace back to specification or implementation gaps – Knowledge preservation: Maintain institutional memory as teams change Modern MLOps platforms increasingly support this through experiment tracking, model registries, and metadata management [51, 52].3.6. Multi-Stakeholder Specifications
Principle: Specifications must incorporate requirements from all stakeholders, not just technical metrics. AI systems impact diverse groups—end users, business owners, compliance officers, affected populations. Spec-driven development requires eliciting and balancing requirements from all stakeholders [53, 54]: – Business: Revenue impact, cost constraints, time-to-market – Users: Usability, transparency, ability to contest decisions – Domain experts: Scientific validity, clinical accuracy, domain-specific constraints – Legal/Compliance: Regulatory requirements, liability concerns, privacy – Ethics: Fairness, bias mitigation, societal impact – Operations: Maintainability, monitoring, incident response In practice, I’ve found that creating multi-dimensional scorecards that explicitly track all stakeholder requirements—not just accuracy—is essential for successful AI deployment.4. Spec-Driven vs. Experiment-Driven: A Detailed Comparison
Having articulated the principles of spec-driven AI, I now contrast it systematically with experiment-driven development—the prevailing approach in many organizations.4.1. Development Process
Experiment-Driven: 1. Receive problem statement 2. Acquire/explore data 3. Try various models 4. Optimize metrics 5. Deploy best performer 6. React to production issues Spec-Driven: 1. Elicit stakeholder requirements 2. Create formal specifications 3. Design validation strategy 4. Acquire/explore data in context of specifications 5. Implement to meet specifications 6. Validate against specifications 7. Deploy with monitoring for spec compliance 8. Proactively maintain specifications The key difference: specifications guide every phase, rather than metrics alone driving the process [55].4.2. Success Criteria
Experiment-Driven: Success is typically measured by performance on a held-out test set (accuracy, AUC, F1, etc.). If the model achieves good numbers, it’s deemed successful [56]. Spec-Driven: Success is multidimensional and explicitly defined. A model isn’t successful unless it meets: – Performance specifications (accuracy, latency, throughput) – Fairness specifications – Robustness specifications – Explainability specifications – Regulatory compliance specifications – Business value specifications A model with 95% accuracy that violates fairness specifications is not successful [57, 58].graph TD
subgraph "Experiment-Driven Success Criteria"
A1[High Test Accuracy] --> A2{Success?}
A2 -->|Yes| A3[Deploy]
end
subgraph "Spec-Driven Success Criteria"
B1[Performance Specs] --> B7{All Specs Met?}
B2[Fairness Specs] --> B7
B3[Robustness Specs] --> B7
B4[Explainability Specs] --> B7
B5[Compliance Specs] --> B7
B6[Business Specs] --> B7
B7 -->|Yes| B8[Deploy]
B7 -->|No| B9[Iterate]
end
style A2 fill:#ffe66d
style B7 fill:#ffe66d
classDef decision fill:#ffe66d,stroke:#f4a261,stroke-width:2px
Figure 5: Comparison of success criteria. Experiment-driven approaches typically focus on a single metric, while spec-driven approaches evaluate multiple dimensions explicitly.
4.3. Risk Management
Experiment-Driven: Risks are often discovered reactively, after deployment. Teams may not systematically consider edge cases, distribution shift, or adversarial scenarios unless they’ve been burned by them before [59]. Spec-Driven: Risks are identified proactively during specification. Requirements gathering explicitly asks: “What could go wrong? What edge cases matter? What happens when the world changes?” These scenarios become part of the specification and are systematically tested [60, 61].4.4. Maintenance and Evolution
Experiment-Driven: When models degrade or new requirements emerge, teams often retrain or modify models without a clear framework for deciding what constitutes acceptable change. “The new model has better accuracy” becomes the justification for replacement, even if it fails on critical edge cases [62]. Spec-Driven: Specifications provide a stable contract. Models can be updated as long as they continue to satisfy specifications—or specifications can be deliberately updated with versioning and impact analysis. This enables systematic evolution rather than ad-hoc changes [63].4.5. When Each Approach Makes Sense
I’m not arguing that spec-driven development is universally superior. The appropriate approach depends on context [64]: Experiment-driven makes sense when: – You’re doing exploratory research – Requirements are genuinely unknown – The cost of failure is low – Rapid iteration and learning are paramount – You’re in the “playground” phase of a project Spec-driven is essential when: – The system will impact people’s lives (healthcare, finance, hiring, criminal justice) – Regulatory compliance is required – Long-term maintenance is necessary – Multiple teams or stakeholders are involved – The cost of failure is high – Explainability and accountability matter In my experience, many AI projects begin in experiment-driven mode but transition to production contexts where spec-driven practices become necessary. Organizations that recognize this transition and adapt their processes accordingly are far more successful [65].5. Challenges and Open Problems
Despite its benefits, spec-driven AI development faces real challenges. In this section, I candidly address the difficulties I’ve encountered and where the field needs to mature.5.1. The Specification Problem for Learned Behavior
The fundamental challenge: how do you write precise specifications for behavior that’s learned from examples rather than explicitly programmed? [66] For traditional software: “The login function shall accept a username and password, query the authentication database, and return true if credentials match, false otherwise.” This is precise and verifiable. For AI: “The image classifier shall correctly identify objects in images.” What does “correctly” mean? Correctly according to whom? What about ambiguous cases where experts disagree? Current approaches include: – Example-based specifications: Provide examples of correct behavior, but this doesn’t generalize to all cases [67] – Property-based specifications: Specify properties that should hold (e.g., “prediction should be invariant to irrelevant feature changes”), but identifying the right properties is non-trivial [68] – Probabilistic specifications: Specify statistical properties (e.g., “accuracy ≥ 90% on distribution D”), but this requires defining D precisely [69] This remains an active research area, and I believe hybrid approaches combining multiple specification styles will be necessary [70].5.2. Specification Overhead
Spec-driven development requires upfront investment. Writing comprehensive specifications takes time. For organizations used to moving fast and breaking things, this can feel like bureaucratic overhead [71]. My response: the overhead is real but justified. The cost of specifications is front-loaded, while the cost of their absence is back-loaded and often much larger. When AI systems fail in production—causing harm, violating regulations, or simply performing poorly—the cost of debugging, fixing, and redeploying far exceeds the cost of upfront specification [72]. That said, we need better tools to reduce specification overhead. Natural language processing and automated specification mining from requirements documents, codebases, and historical issues are promising directions [73, 74].5.3. Evolving Specifications
Specifications shouldn’t be set in stone. As we learn more about the problem, as business needs change, and as the world evolves, specifications must adapt [75]. But uncontrolled specification changes can undermine their value. The solution: treat specifications as versioned artifacts with change management processes. Document why specifications changed, analyze the impact on existing implementations, and explicitly revalidate [76]. This is standard practice in safety-critical systems engineering and should become standard in AI.5.4. Validating Black-Box Models
Many modern AI models—especially large deep neural networks—are black boxes. Even if we have specifications, verifying that a model satisfies them can be computationally intractable [77, 78]. Approaches to this challenge include: – Sampling-based validation: Extensively test on diverse scenarios, providing probabilistic confidence [79] – Formal verification: For specific properties (e.g., local robustness), formal methods can provide guarantees, though they’re computationally expensive and limited in scope [80, 81] – Interpretable-by-design models: Use architectures that are inherently more verifiable, accepting some performance trade-offs [82] There’s no silver bullet, and different applications will require different validation strategies based on their risk profiles.6. Toward a Spec-Driven Future
The AI field is maturing. The initial excitement of “AI can do amazing things!” is giving way to sober realization that “AI is hard to deploy reliably at scale.” This is exactly the moment when engineering discipline becomes essential [83]. I envision a future where: Specifications are first-class artifacts: Every production AI system has comprehensive, versioned specifications that are actively maintained alongside code and models [84]. Tools automate specification tasks: AI-assisted specification generation, automated test generation from specifications, and continuous validation dashboards become standard [85, 86]. Education emphasizes specifications: Data science and ML engineering programs teach specification practices as core competencies, not afterthoughts [87]. Regulations require specifications: As AI regulation evolves, formal specifications become a compliance requirement, driving adoption [88, 89]. Culture shifts from research to engineering: Organizations recognize that production AI is engineering, not research, and adopt appropriate practices [90].graph LR
A[Current State:
Ad-hoc AI] --> B[Transition Period:
Mixed Practices]
B --> C[Mature State:
Spec-Driven AI]
A1[Few specs
Metric-driven
Reactive] -.-> A
B1[Growing awareness
Early frameworks
Pilot projects] -.-> B
C1[Universal specs
Systematic validation
Proactive maintenance] -.-> C
style A fill:#ff6b6b
style B fill:#ffe66d
style C fill:#4ecdc4
classDef problem fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px
classDef transition fill:#ffe66d,stroke:#f4a261,stroke-width:2px
classDef mature fill:#4ecdc4,stroke:#2a9d8f,stroke-width:2px
Figure 6: The evolution toward spec-driven AI. The field is currently transitioning from ad-hoc practices toward systematic, specification-driven approaches.
This isn’t fantasy. I see this evolution already beginning in regulated industries and safety-critical applications. The question is how quickly it will spread to the broader AI ecosystem.7. Conclusion: The Imperative for Enterprise AI
The central argument of this article is straightforward: enterprise AI cannot succeed at scale without formal specifications. The experiment-driven approach that has dominated the field is insufficient for systems that must be reliable, fair, compliant, maintainable, and accountable. Spec-driven AI development addresses these needs by: 1. Establishing clear contracts between stakeholders 2. Enabling systematic validation and verification 3. Supporting long-term maintenance and evolution 4. Facilitating regulatory compliance and auditability 5. Reducing risk and cost of failures The transition won’t be easy. It requires cultural change, new skills, better tools, and organizational commitment. But it’s necessary. As AI becomes increasingly embedded in critical systems—from healthcare diagnosis to financial decisions to autonomous vehicles—the stakes are too high to continue with ad-hoc practices [91]. In the subsequent articles in this series, I’ll explore specific aspects of spec-driven AI in depth: specification languages and frameworks (Article 2), requirements engineering for AI (Article 3), comparative analysis of development paradigms (Article 4), implementation patterns (Article 5), tooling (Article 6), case studies (Article 7), and future directions (Article 8). For now, I leave you with this: the next time you start an AI project, ask yourself: “Do I have a specification?” If the answer is no, that’s where you should begin.References
[1] Meyer, B. (1985). “On Formalism in Specifications.” IEEE Software, 2(1), 6-26. DOI: [10.1109/MS.1985.229776](https://doi.org/10.1109/MS.1985.229776) [2] Ghezzi, C., Jazayeri, M., & Mandrioli, D. (2002). Fundamentals of Software Engineering. Pearson Education. DOI: [10.5555/579193](https://doi.org/10.5555/579193) [3] Sculley, D., et al. (2015). “Hidden Technical Debt in Machine Learning Systems.” NeurIPS, 2503-2511. Available: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems [4] Paleyes, A., Urma, R. G., & Lawrence, N. D. (2022). “Challenges in Deploying Machine Learning: A Survey of Case Studies.” ACM Computing Surveys, 55(6), 1-29. DOI: [10.1145/3533378](https://doi.org/10.1145/3533378) [5] Dijkstra, E. W. (1968). “Go To Statement Considered Harmful.” Communications of the ACM, 11(3), 147-148. DOI: [10.1145/362929.362947](https://doi.org/10.1145/362929.362947) [6] Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction.” IEEE Big Data, 1123-1132. DOI: [10.1109/BigData.2017.8258038](https://doi.org/10.1109/BigData.2017.8258038) [7] Amershi, S., et al. (2019). “Software Engineering for Machine Learning: A Case Study.” ICSE-SEIP, 291-300. DOI: [10.1109/ICSE-SEIP.2019.00042](https://doi.org/10.1109/ICSE-SEIP.2019.00042) [8] Lamport, L. (2002). Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley. ISBN: 0-321-14306-X [9] Wing, J. M. (1990). “A Specifier’s Introduction to Formal Methods.” Computer, 23(9), 8-23. DOI: [10.1109/2.58215](https://doi.org/10.1109/2.58215) [10] Jordan, M. I., & Mitchell, T. M. (2015). “Machine Learning: Trends, Perspectives, and Prospects.” Science, 349(6245), 255-260. DOI: [10.1126/science.aaa8415](https://doi.org/10.1126/science.aaa8415) [11] Ghahramani, Z. (2015). “Probabilistic Machine Learning and Artificial Intelligence.” Nature, 521(7553), 452-459. DOI: [10.1038/nature14541](https://doi.org/10.1038/nature14541) [12] Nakkiran, P., et al. (2021). “Deep Double Descent: Where Bigger Models and More Data Hurt.” Journal of Statistical Mechanics: Theory and Experiment, 2021(12), 124003. DOI: [10.1088/1742-5468/ac3a74](https://doi.org/10.1088/1742-5468/ac3a74) [13] Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). “A Survey on Concept Drift Adaptation.” ACM Computing Surveys, 46(4), 1-37. DOI: [10.1145/2523813](https://doi.org/10.1145/2523813) [14] Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2018). “Learning Under Concept Drift: A Review.” IEEE Transactions on Knowledge and Data Engineering, 31(12), 2346-2363. DOI: [10.1109/TKDE.2018.2876857](https://doi.org/10.1109/TKDE.2018.2876857) [15] Rudin, C. (2019). “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence, 1(5), 206-215. DOI: [10.1038/s42256-019-0048-x](https://doi.org/10.1038/s42256-019-0048-x) [16] Varshney, K. R. (2019). “Engineering Safety in Machine Learning.” IEEE ISTAS, 1-5. DOI: [10.1109/ISTAS48451.2019.8937920](https://doi.org/10.1109/ISTAS48451.2019.8937920) [17] Amodei, D., et al. (2016). “Concrete Problems in AI Safety.” arXiv preprint arXiv:1606.06565. DOI: [10.48550/arXiv.1606.06565](https://doi.org/10.48550/arXiv.1606.06565) [18] Boehm, B. W. (1979). “Guidelines for Verifying and Validating Software Requirements and Design Specifications.” Euro IFIP, 711-719. Available: https://dl.acm.org/doi/10.5555/1499456 [19] Parnas, D. L. (1972). “On the Criteria To Be Used in Decomposing Systems into Modules.” Communications of the ACM, 15(12), 1053-1058. DOI: [10.1145/361598.361623](https://doi.org/10.1145/361598.361623) [20] Hoare, C. A. R. (1969). “An Axiomatic Basis for Computer Programming.” Communications of the ACM, 12(10), 576-580. DOI: [10.1145/363235.363259](https://doi.org/10.1145/363235.363259) [21] Jones, C. B. (1990). Systematic Software Development Using VDM. Prentice Hall. ISBN: 0-13-880733-7 [22] Abrial, J. R. (1996). The B-Book: Assigning Programs to Meanings. Cambridge University Press. DOI: [10.1017/CBO9780511624162](https://doi.org/10.1017/CBO9780511624162) [23] Meyer, B. (1992). “Applying ‘Design by Contract’.” Computer, 25(10), 40-51. DOI: [10.1109/2.161279](https://doi.org/10.1109/2.161279) [24] North, D. (2006). “Introducing BDD.” Better Software Magazine, 12-17. Available: https://dannorth.net/introducing-bdd/ [25] Beck, K. (2003). Test-Driven Development: By Example. Addison-Wesley. ISBN: 0-321-14653-0 [26] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). “ImageNet Classification with Deep Convolutional Neural Networks.” NeurIPS, 1097-1105. DOI: [10.1145/3065386](https://doi.org/10.1145/3065386) [27] Zhang, J. M., Harman, M., Ma, L., & Liu, Y. (2022). “Machine Learning Testing: Survey, Landscapes and Horizons.” IEEE Transactions on Software Engineering, 48(1), 1-36. DOI: [10.1109/TSE.2019.2962027](https://doi.org/10.1109/TSE.2019.2962027) [28] Hooker, S. (2021). “Moving Beyond ‘Algorithmic Bias Is a Data Problem’.” Patterns, 2(4), 100241. DOI: [10.1016/j.patter.2021.100241](https://doi.org/10.1016/j.patter.2021.100241) [29] Souza, A., et al. (2021). “Specification-Guided Learning of Nash Equilibria with High Social Welfare.” AAAI, 10754-10762. DOI: [10.1609/aaai.v35i12.17286](https://doi.org/10.1609/aaai.v35i12.17286) [30] Sambasivan, N., et al. (2021). “‘Everyone Wants to Do the Model Work, Not the Data Work’: Data Cascades in High-Stakes AI.” CHI, 1-15. DOI: [10.1145/3411764.3445518](https://doi.org/10.1145/3411764.3445518) [31] Gebru, T., et al. (2021). “Datasheets for Datasets.” Communications of the ACM, 64(12), 86-92. DOI: [10.1145/3458723](https://doi.org/10.1145/3458723) [32] Arnold, M., et al. (2019). “FactSheets: Increasing Trust in AI Services through Supplier’s Declarations of Conformity.” IBM Journal of Research and Development, 63(4/5), 6:1-6:13. DOI: [10.1147/JRD.2019.2942288](https://doi.org/10.1147/JRD.2019.2942288) [33] Mitchell, M., et al. (2019). “Model Cards for Model Reporting.” *FAT**, 220-229. DOI: [10.1145/3287560.3287596](https://doi.org/10.1145/3287560.3287596) [34] Gebru, T., et al. (2018). “Datasheets for Datasets.” arXiv preprint arXiv:1803.09010. DOI: [10.48550/arXiv.1803.09010](https://doi.org/10.48550/arXiv.1803.09010) [35] Arnold, M., et al. (2019). “FactSheets: Increasing Trust in AI Services.” IBM Journal of Research and Development, 63(4/5), 6:1-6:13. DOI: [10.1147/JRD.2019.2942288](https://doi.org/10.1147/JRD.2019.2942288) [36] Hardt, M., Price, E., & Srebro, N. (2016). “Equality of Opportunity in Supervised Learning.” NeurIPS, 3315-3323. Available: https://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning [37] Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and Machine Learning: Limitations and Opportunities. MIT Press. Available: http://www.fairmlbook.org [38] Cohen, J., Rosenfeld, E., & Kolter, Z. (2019). “Certified Adversarial Robustness via Randomized Smoothing.” ICML, 1310-1320. Available: http://proceedings.mlr.press/v97/cohen19c.html [39] Katz, G., et al. (2017). “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks.” CAV, 97-117. DOI: [10.1007/978-3-319-63387-9_5](https://doi.org/10.1007/978-3-319-63387-9_5) [40] Rakotoarison, H., & Papamartzivanos, D. (2022). “Empirical Study of the State of MLOps.” arXiv preprint arXiv:2203.08642. DOI: [10.48550/arXiv.2203.08642](https://doi.org/10.48550/arXiv.2203.08642) [41] Vogelsang, A., & Borg, M. (2019). “Requirements Engineering for Machine Learning: Perspectives from Data Scientists.” RE Workshops, 245-251. DOI: [10.1109/REW.2019.00050](https://doi.org/10.1109/REW.2019.00050) [42] Belani, H., Vukovic, M., & Car, Ž. (2019). “Requirements Engineering Challenges in Building AI-Based Complex Systems.” EASE, 1-6. DOI: [10.1145/3319008.3319439](https://doi.org/10.1145/3319008.3319439) [43] Gal, Y., & Ghahramani, Z. (2016). “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” ICML, 1050-1059. Available: http://proceedings.mlr.press/v48/gal16.html [44] Abdar, M., et al. (2021). “A Review of Uncertainty Quantification in Deep Learning.” Information Fusion, 76, 243-297. DOI: [10.1016/j.inffus.2021.05.008](https://doi.org/10.1016/j.inffus.2021.05.008) [45] Rahaman, R., & Thiery, A. (2021). “Uncertainty Quantification and Deep Ensembles.” NeurIPS, 20063-20075. Available: https://proceedings.neurips.cc/paper/2021/hash/a70dc40477bc2adceef4d2c90f47eb82-Abstract.html [46] Pei, K., Cao, Y., Yang, J., & Jana, S. (2017). “DeepXplore: Automated Whitebox Testing of Deep Learning Systems.” SOSP, 1-18. DOI: [10.1145/3132747.3132785](https://doi.org/10.1145/3132747.3132785) [47] Tian, Y., et al. (2018). “DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars.” ICSE, 303-314. DOI: [10.1145/3180155.3180220](https://doi.org/10.1145/3180155.3180220) [48] Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. D. (2009). Dataset Shift in Machine Learning. MIT Press. ISBN: 978-0-262-17005-5 [49] NIST (2021). “Artificial Intelligence Risk Management Framework.” NIST AI 100-1. DOI: [10.6028/NIST.AI.100-1](https://doi.org/10.6028/NIST.AI.100-1) [50] European Commission (2021). “Proposal for a Regulation on Artificial Intelligence.” COM(2021) 206 final. Available: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206 [51] Zaharia, M., et al. (2018). “Accelerating the Machine Learning Lifecycle with MLflow.” IEEE Data Engineering Bulletin, 41(4), 39-45. Available: http://sites.computer.org/debull/A18dec/p39.pdf [52] Baylor, D., et al. (2017). “TFX: A TensorFlow-Based Production-Scale Machine Learning Platform.” KDD, 1387-1395. DOI: [10.1145/3097983.3098021](https://doi.org/10.1145/3097983.3098021) [53] Holstein, K., Wortman Vaughan, J., Daumé III, H., Dudik, M., & Wallach, H. (2019). “Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need?” CHI, 1-16. DOI: [10.1145/3290605.3300830](https://doi.org/10.1145/3290605.3300830) [54] Madaio, M. A., et al. (2020). “Co-Designing Checklists to Understand Organizational Challenges and Opportunities around Fairness in AI.” CHI, 1-14. DOI: [10.1145/3313831.3376445](https://doi.org/10.1145/3313831.3376445) [55] He, J., Baxter, D., Zhou, J., & Xu, X. (2021). “A Comparison of Machine Learning Systems for Continuous Integration.” ASE, 1373-1377. DOI: [10.1109/ASE51524.2021.9678577](https://doi.org/10.1109/ASE51524.2021.9678577) [56] Davis, A., & Khoshgoftaar, T. M. (2021). “The Relationship Between Precision-Recall and ROC Curves for Imbalanced Datasets.” Journal of Biomedical Informatics, 115, 103687. DOI: [10.1016/j.jbi.2021.103687](https://doi.org/10.1016/j.jbi.2021.103687) [57] Bellamy, R. K., et al. (2019). “AI Fairness 360: An Extensible Toolkit for Detecting and Mitigating Algorithmic Bias.” IBM Journal of Research and Development, 63(4/5), 4:1-4:15. DOI: [10.1147/JRD.2019.2942287](https://doi.org/10.1147/JRD.2019.2942287) [58] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). “A Survey on Bias and Fairness in Machine Learning.” ACM Computing Surveys, 54(6), 1-35. DOI: [10.1145/3457607](https://doi.org/10.1145/3457607) [59] Humbatova, N., Jahangirova, G., Bavota, G., Riccio, V., Stocco, A., & Tonella, P. (2020). “Taxonomy of Real Faults in Deep Learning Systems.” ICSE, 1110-1121. DOI: [10.1145/3377811.3380395](https://doi.org/10.1145/3377811.3380395) [60] Huang, X., et al. (2020). “A Survey of Safety and Trustworthiness of Deep Neural Networks.” Computer Science Review, 37, 100270. DOI: [10.1016/j.cosrev.2020.100270](https://doi.org/10.1016/j.cosrev.2020.100270) [61] Ashmore, R., Calinescu, R., & Paterson, C. (2021). “Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges.” ACM Computing Surveys, 54(5), 1-39. DOI: [10.1145/3453444](https://doi.org/10.1145/3453444) [62] Wan, Z., Xia, X., Lo, D., & Murphy, G. C. (2021). “How Does Machine Learning Change Software Development Practices?” IEEE Transactions on Software Engineering, 47(9), 1857-1871. DOI: [10.1109/TSE.2019.2937083](https://doi.org/10.1109/TSE.2019.2937083) [63] Klaise, J., Van Looveren, A., Vacanti, G., & Coca, A. (2021). “Monitoring and Explainability of Models in Production.” arXiv preprint arXiv:2007.06299. DOI: [10.48550/arXiv.2007.06299](https://doi.org/10.48550/arXiv.2007.06299) [64] Shankar, S., et al. (2022). “Operationalizing Machine Learning: An Interview Study.” arXiv preprint arXiv:2209.09125. DOI: [10.48550/arXiv.2209.09125](https://doi.org/10.48550/arXiv.2209.09125) [65] Lwakatare, L. E., Raj, A., Bosch, J., Olsson, H. H., & Crnkovic, I. (2019). “A Taxonomy of Software Engineering Challenges for Machine Learning Systems.” Journal of Systems and Software, 157, 110389. DOI: [10.1016/j.jss.2019.07.100](https://doi.org/10.1016/j.jss.2019.07.100) [66] Barash, G., et al. (2022). “Bridging the Gap: Providing Post-Hoc Symbolic Explanations for Sequential Decision-Making Problems.” ICLR. Available: https://openreview.net/forum?id=o80Z8CkSfHh [67] Zhou, Z. H. (2018). “Learning with Limited Supervision.” National Science Review, 5(1), 44-53. DOI: [10.1093/nsr/nwx106](https://doi.org/10.1093/nsr/nwx106) [68] Seshia, S. A., Sadigh, D., & Sastry, S. S. (2022). “Toward Verified Artificial Intelligence.” Communications of the ACM, 65(7), 46-55. DOI: [10.1145/3503914](https://doi.org/10.1145/3503914) [69] Albarghouthi, A., D’Antoni, L., Drews, S., & Nori, A. V. (2017). “FairSquare: Probabilistic Verification of Program Fairness.” OOPSLA, 1-30. DOI: [10.1145/3133904](https://doi.org/10.1145/3133904) [70] Drechsler, R., & Soeken, M. (2021). Formal Specification Level: Concepts and Methods. Springer. DOI: [10.1007/978-3-030-63143-4](https://doi.org/10.1007/978-3-030-63143-4) [71] Nahar, N., Zhou, S., Lewis, G., & Kästner, C. (2022). “Collaboration Challenges in Building ML-Enabled Systems.” ICSE, 413-425. DOI: [10.1145/3510003.3510209](https://doi.org/10.1145/3510003.3510209) [72] Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018). “Data Lifecycle Challenges in Production Machine Learning.” ACM SIGMOD Record, 47(2), 17-28. DOI: [10.1145/3299887.3299891](https://doi.org/10.1145/3299887.3299891) [73] Khomh, F., Adams, B., Cheng, J., Fokaefs, M., & Antoniol, G. (2018). “Software Engineering for Machine Learning Applications.” IEEE Software, 35(5), 18-26. DOI: [10.1109/MS.2018.290110840](https://doi.org/10.1109/MS.2018.290110840) [74] Zolfaghari, M., et al. (2022). “Machine Learning for Software Engineering: Models, Methods, and Applications.” Empirical Software Engineering, 27(5), 1-61. DOI: [10.1007/s10664-022-10183-w](https://doi.org/10.1007/s10664-022-10183-w) [75] Nascimento, E., et al. (2020). “Understanding Development Process of Machine Learning Systems.” ESEM, 1-6. DOI: [10.1145/3382494.3410680](https://doi.org/10.1145/3382494.3410680) [76] Raghothaman, M., Dwyer, M. B., & Elbaum, S. (2021). “Synthesizing Tests for Data-Centric Programs.” ICSE, 1400-1412. DOI: [10.1109/ICSE43902.2021.00126](https://doi.org/10.1109/ICSE43902.2021.00126) [77] Katz, G., Barrett, C., Dill, D. L., Julian, K., & Kochenderfer, M. J. (2017). “Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks.” CAV, 97-117. DOI: [10.1007/978-3-319-63387-9_5](https://doi.org/10.1007/978-3-319-63387-9_5) [78] Huang, X., et al. (2017). “Safety Verification of Deep Neural Networks.” CAV, 3-29. DOI: [10.1007/978-3-319-63387-9_1](https://doi.org/10.1007/978-3-319-63387-9_1) [79] Wicker, M., Huang, X., & Kwiatkowska, M. (2018). “Feature-Guided Black-Box Safety Testing of Deep Neural Networks.” TACAS, 408-426. DOI: [10.1007/978-3-319-89960-2_22](https://doi.org/10.1007/978-3-319-89960-2_22) [80] Singh, G., Gehr, T., Püschel, M., & Vechev, M. (2019). “An Abstract Domain for Certifying Neural Networks.” POPL, 41:1-41:30. DOI: [10.1145/3290354](https://doi.org/10.1145/3290354) [81] Wang, S., et al. (2018). “Formal Security Analysis of Neural Networks using Symbolic Intervals.” USENIX Security, 1599-1614. Available: https://www.usenix.org/conference/usenixsecurity18/presentation/wang-shiqi [82] Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L., & Zhong, C. (2022). “Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges.” Statistics Surveys, 16, 1-85. DOI: [10.1214/21-SS133](https://doi.org/10.1214/21-SS133) [83] Arpteg, A., Brinne, B., Crnkovic-Friis, L., & Bosch, J. (2018). “Software Engineering Challenges of Deep Learning.” SEAA, 50-59. DOI: [10.1109/SEAA.2018.00018](https://doi.org/10.1109/SEAA.2018.00018) [84] Hutchinson, B., et al. (2021). “Towards Accountability for Machine Learning Datasets.” FAccT, 487-498. DOI: [10.1145/3442188.3445918](https://doi.org/10.1145/3442188.3445918) [85] Zhang, T., et al. (2023). “Automated Machine Learning: Past, Present and Future.” AI Open, 4, 1-20. DOI: [10.1016/j.aiopen.2023.01.001](https://doi.org/10.1016/j.aiopen.2023.01.001) [86] Karmaker, S. K., et al. (2021). “AutoML to Date and Beyond: Challenges and Opportunities.” ACM Computing Surveys, 54(8), 1-36. DOI: [10.1145/3470918](https://doi.org/10.1145/3470918) [87] Hulten, G. (2018). Building Intelligent Systems: A Guide to Machine Learning Engineering. Apress. DOI: [10.1007/978-1-4842-3432-7](https://doi.org/10.1007/978-1-4842-3432-7) [88] Floridi, L., et al. (2022). “AI4People—An Ethical Framework for a Good AI Society.” Minds and Machines, 28(4), 689-707. DOI: [10.1007/s11023-018-9482-5](https://doi.org/10.1007/s11023-018-9482-5) [89] Brundage, M., et al. (2020). “Toward Trustworthy AI Development.” arXiv preprint arXiv:2004.07213. DOI: [10.48550/arXiv.2004.07213](https://doi.org/10.48550/arXiv.2004.07213) [90] Giray, G. (2021). “A Software Engineering Perspective on Engineering Machine Learning Systems.” Journal of Software: Evolution and Process, 33(2), e2335. DOI: [10.1002/smr.2335](https://doi.org/10.1002/smr.2335) [91] Kaur, D., Uslu, S., Rittichier, K. J., & Durresi, A. (2022). “Trustworthy Artificial Intelligence: A Review.” ACM Computing Surveys, 55(2), 1-38. DOI: [10.1145/3491209](https://doi.org/10.1145/3491209)Acknowledgments: This work was conducted as part of ongoing research at Odessa National Polytechnic University. The author thanks colleagues for valuable discussions on specification practices in production AI systems. Conflict of Interest: The author declares no conflicts of interest. Data Availability: No new data were generated for this article. All referenced works are publicly available through their respective DOI links.