AI Transforming Science: Math, Biology, and Discovery 2025

AI Transforming Science: From Mathematics to Medicine

📚 Academic Citation: Ivchenko, O. (2026). AI Transforming Science: From Mathematics to Medicine. AI in Science Series. Odesa National Polytechnic University. DOI: 10.5281/zenodo.18748877

Abstract

2025 marked a watershed year for AI-driven scientific discovery, with systems transitioning from computational tools to active research partners. Google DeepMind’s AlphaEvolve discovered novel algorithms for fundamental mathematical and computational problems, improving efficiency across Google’s infrastructure by 0.7% globally and finding new solutions to open problems that have challenged mathematicians for decades. Simultaneously, Microsoft’s MAI-DxO diagnostic orchestrator achieved 85.5% accuracy on complex medical cases—four times better than experienced physicians—while reducing estimated diagnostic costs by 70%. This article examines these breakthroughs, their methodologies, infrastructure impacts, and implications for the future of scientific research. The evidence suggests we are witnessing the emergence of AI as a genuine co-investigator capable of generating, testing, and validating novel scientific hypotheses across disciplines.

Introduction: AI as Research Partner

Scientific discovery has historically followed a human-centered paradigm: researchers formulate hypotheses, design experiments, collect data, and derive insights through analysis. AI systems have long assisted this process—automating calculations, searching literature, analyzing datasets—but always as tools guided by human intelligence. 2025 represents a phase transition: AI systems now actively participate in the creative, hypothesis-generating phases of research, proposing novel solutions that humans verify rather than merely executing human-designed procedures.

This shift is evident across multiple domains. In mathematics, Google DeepMind’s AlphaEvolve discovered new algorithms for problems ranging from matrix multiplication to geometric sphere packing. In medicine, Microsoft’s MAI-DxO diagnostic orchestrator matched or exceeded human diagnostic accuracy on complex cases involving rare diseases and atypical presentations. In materials science, AI systems propose novel compounds with desired properties. In climate science, AI-enhanced models improve long-term forecasting accuracy.

What distinguishes 2025’s developments from earlier AI research assistance is autonomy in hypothesis generation. Previous systems optimized known approaches or searched predefined solution spaces. AlphaEvolve and its contemporaries explore open-ended search spaces, proposing genuinely novel approaches that humans might not consider, then validating them through automated evaluation. This closed loop—generate, test, validate, iterate—mirrors the scientific method itself.

AlphaEvolve: Evolutionary Algorithm Discovery

AlphaEvolve, announced by Google DeepMind in May 2025, represents a new class of AI research tool: an evolutionary coding agent that combines large language model creativity with automated verification in an evolutionary framework. Unlike specialized systems like AlphaFold (protein structure) or AlphaTensor (tensor decomposition), AlphaEvolve applies to any problem whose solution can be expressed as an algorithm and automatically evaluated.

Architecture and Methodology

AlphaEvolve employs an ensemble approach leveraging multiple Gemini models:

Gemini Flash: Fast, efficient model that maximizes breadth of ideas explored
Gemini Pro: Powerful model providing depth and insightful suggestions
Automated evaluators: Domain-specific verification systems that test proposed solutions
Evolutionary selection: Framework that identifies promising solutions and uses them as seeds for next generation

The system operates through iterative cycles:

Generation: LLMs propose programs implementing algorithmic solutions
Evaluation: Automated evaluators verify correctness and measure performance
Selection: Best-performing solutions are stored in a programs database
Mutation: Promising solutions undergo variations to explore nearby solution space
Iteration: Process repeats with improved solutions as prompts for new generations

This architecture mirrors biological evolution: variation (LLM creativity), selection (automated evaluation), and inheritance (using successful solutions as templates). The key innovation is applying this to entire codebases rather than single functions, enabling discovery of complex, multi-component algorithms.

flowchart TD
    A["LLM Generates
Algorithm Code"] --> B["Automated
Evaluators"]
    B --> C{"Performance
Acceptable?"}
    C -->|Yes| D["Programs
Database"]
    C -->|No| E["Discarded"]
    D --> F["Select Best
Solutions"]
    F --> G["Mutation via
LLM Variation"]
    G --> A
    
    style A fill:#4CAF50,color:white
    style D fill:#2196F3,color:white
    style F fill:#FF9800,color:white

Figure 1: AlphaEvolve Evolutionary Cycle

Infrastructure Deployments

AlphaEvolve’s impact extends beyond academic demonstrations to production deployments across Google’s computing infrastructure:

Data Center Scheduling: AlphaEvolve discovered a heuristic for Borg, Google’s cluster management system, that continuously recovers 0.7% of Google’s worldwide compute resources. This solution has been in production for over a year. To contextualize: 0.7% of Google’s global infrastructure represents an immense absolute quantity of compute—equivalent to adding multiple large data centers without building new facilities.

Hardware Design: AlphaEvolve proposed a Verilog rewrite removing unnecessary bits in a highly optimized arithmetic circuit for matrix multiplication. After passing rigorous functional verification, this optimization was integrated into an upcoming Tensor Processing Unit (TPU) design. By working in standard chip design languages, AlphaEvolve enables collaboration between AI and human hardware engineers.

AI Training Optimization: By discovering smarter ways to partition large matrix multiplication operations, AlphaEvolve accelerated a critical kernel in Gemini’s architecture by 23%, translating to a 1% reduction in overall training time. Given that training frontier AI models requires months of compute on massive clusters, this represents substantial cost and energy savings. Perhaps more significantly, AlphaEvolve reduced kernel optimization time from weeks of expert engineering effort to days of automated experiments.

Low-Level GPU Optimization: AlphaEvolve achieved up to 32.5% speedup for FlashAttention kernel implementations in Transformer-based models by optimizing low-level GPU instructions—a domain typically too complex for direct human modification, even by experts. Compilers handle this optimization, but AlphaEvolve found improvements compilers missed.

Mathematical Discoveries

AlphaEvolve’s application to pure mathematics demonstrates its versatility beyond practical engineering problems. In collaboration with mathematician Terence Tao (Fields Medalist) and other researchers, AlphaEvolve was tested on over 50 open problems in mathematical analysis, geometry, combinatorics, and number theory.

Performance on Open Problems

~75% of cases: Rediscovered state-of-the-art solutions
~20% of cases: Improved upon previously best-known solutions
Setup time: Most experiments configured in hours, not weeks

Notable specific achievements include:

Matrix Multiplication Algorithms: AlphaEvolve discovered an algorithm to multiply 4×4 complex-valued matrices using 48 scalar multiplications, improving upon Strassen’s 1969 algorithm previously known as optimal for this case. This represents a significant advance over AlphaTensor (2022), which specialized in matrix multiplication but only found improvements for binary arithmetic at this matrix size. AlphaEvolve’s approach—designing a gradient-based optimization procedure from minimal code skeletons—demonstrates meta-algorithmic creativity: discovering algorithms for discovering algorithms.

Kissing Number Problem: AlphaEvolve established a new lower bound for the 11-dimensional kissing number problem—the maximum number of non-overlapping unit spheres that can simultaneously touch a central unit sphere. AlphaEvolve discovered a configuration of 593 outer spheres, improving the previous bound. This geometric challenge has fascinated mathematicians since Newton debated it in the 1690s, with rigorous solutions known only for dimensions 1, 2, 3, 4, 8, and 24. Improvements in other dimensions represent genuine mathematical progress.

Finite-Field Kakeya Conjecture: In work combining AlphaEvolve with Gemini Deep Think and AlphaProof, the system rediscovered and improved proofs for the finite-field Kakeya conjecture. Significantly, this created a closed AI research loop: AlphaEvolve generated proof approaches, Gemini Deep Think verified logical coherence, and AlphaProof formalized results in machine-verifiable form. This workflow—generate, verify, formalize—demonstrates AI systems handling the complete research pipeline without human intervention at intermediate steps.

Terence Tao’s Assessment

In a November 2025 blog post, Terence Tao described AlphaEvolve as “proving so powerful that mathematicians might try to translate their non-optimization problems into ones that the AI can solve.” This represents a methodological reversal: rather than adapting AI to mathematical conventions, mathematicians may adapt their problem formulations to leverage AI strengths.

Tao noted that while AlphaEvolve handles optimization problems across diverse mathematical disciplines (number theory, geometry, combinatorics), these remain “only a small fraction of all the problems that mathematicians care about.” The system excels at problems with clear objective functions and automated verification but struggles with problems requiring conceptual insights, definitions of new mathematical objects, or proofs involving non-algorithmic reasoning.

Nevertheless, Tao’s collaborative experience with DeepMind resulted in a co-authored paper demonstrating AlphaEvolve solving or improving results on 67 problems—a scale of mathematical contribution unprecedented for an AI system. Nature described the work as “spectacular” general-purpose science AI, noting that AlphaEvolve represents a shift from domain-specific AI (like AlphaFold) to general-purpose research assistance.

MAI-DxO: AI Diagnostic Orchestrator

While AlphaEvolve demonstrates AI creativity in abstract domains, Microsoft’s Medical AI Diagnostic Orchestrator (MAI-DxO) shows AI surpassing human expertise in complex, high-stakes medical diagnostics. Announced in June 2025, MAI-DxO achieved 85.5% accuracy on 304 complex medical cases from the New England Journal of Medicine’s case record series—versus 20% accuracy for a panel of 21 U.S. and U.K. physicians.

pie title AI vs Human Performance 2025
    "AI Diagnostic Accuracy" : 85.5
    "Human Diagnostic Accuracy" : 20

Figure 3: MAI-DxO vs Physicians on Complex Cases (304 NEJM cases)

Methodology and Architecture

MAI-DxO is not a single model but an orchestration system that coordinates multiple specialized AI components:

Clinical reasoning engine: Trained on medical licensing examination materials and case literature
Diagnostic hypothesis generator: Proposes differential diagnoses based on symptoms and test results
Test ordering system: Decides which diagnostic tests to request based on cost-benefit analysis
Literature search: Retrieves relevant medical research for rare presentations
Reasoning transparency layer: Generates human-readable explanations of diagnostic logic

flowchart LR
    A["Patient
Data"] --> B["Clinical
Reasoning
Engine"]
    A --> C["Hypothesis
Generator"]
    B --> D["Orchestrator"]
    C --> D
    D --> E["Test
Ordering"]
    D --> F["Literature
Search"]
    E --> G["Reasoning
Transparency"]
    F --> G
    G --> H["Diagnosis +
Explanation"]
    
    style D fill:#9C27B0,color:white
    style H fill:#4CAF50,color:white

Figure 2: MAI-DxO Diagnostic Orchestration Architecture

Critically, MAI-DxO was evaluated in an interactive format mimicking clinical workflow: both the AI and human physicians could ask questions, order tests, receive results, and revise diagnoses iteratively. This distinguishes MAI-DxO from earlier diagnostic AI systems evaluated on static datasets without the ability to request additional information.

Performance Breakdown

The 4× accuracy advantage (85.5% vs. 20%) is striking but requires context:

Case complexity: NEJM case records represent diagnostically challenging cases submitted for educational publication—not representative of typical clinical encounters
Rare diseases: Many cases involve rare conditions that physicians encounter infrequently, where AI’s exhaustive literature knowledge provides advantage
Time pressure: The evaluation structure may not have allowed physicians the consultation time they’d typically take for such complex cases
Atypical presentations: Cases often featured unusual symptom combinations requiring pattern matching across diverse medical subfields

Microsoft emphasized that MAI-DxO particularly excelled on rare and complex conditions—precisely the cases where diagnostic errors cause the most harm and where physician uncertainty is highest. The system demonstrated:

Cost efficiency: Estimated 70% reduction in diagnostic costs by ordering targeted tests rather than broad screening panels
Reasoning transparency: Explanations traceable to specific symptoms, test results, and medical literature
Configurability: Can operate within defined cost constraints, allowing explicit exploration of cost-value trade-offs

Clinical Implications and Limitations

MAI-DxO’s performance raises important questions about AI’s role in medical decision-making. Microsoft’s framing as a tool toward “medical superintelligence” suggests AI diagnostic capabilities may exceed human expertise in specific domains—a controversial claim in medical ethics and practice.

Critical limitations include:

No patient interaction: Evaluation used case summaries, not actual patient encounters requiring communication, empathy, and rapport
Physical examination: AI cannot perform hands-on examination techniques that may reveal diagnostic clues
Context integration: Physicians incorporate social, cultural, and individual patient factors that may not appear in case records
Accountability: When AI diagnostic errors occur, legal and ethical responsibility remains ambiguous
Explainability threshold: While MAI-DxO provides explanations, physicians may not find them sufficiently transparent for high-stakes decisions

Nevertheless, MAI-DxO’s performance suggests a plausible near-term role as a diagnostic co-pilot for complex cases: the AI generates differential diagnoses and suggests tests, which physicians evaluate, contextually adjust, and ultimately approve or modify. This collaborative model preserves physician judgment while leveraging AI’s pattern recognition and literature synthesis capabilities.

Broader AI in Science Trends

AlphaEvolve and MAI-DxO exemplify broader patterns in AI-assisted research:

Drug Discovery and Molecular Design

AI systems screen millions of molecular candidates for desired properties, dramatically reducing early-stage drug discovery timelines. AlphaFold’s protein structure predictions (2020-2024) enabled structure-based drug design at scale. 2025 saw experimental validation of several AI-designed drug candidates entering clinical trials, though approval remains years away.

Materials Science

AI-discovered materials with custom properties (thermal conductivity, tensile strength, chemical stability) are being synthesized and tested. The Materials Project and similar databases provide training data for models that predict material properties from atomic structure, accelerating the discovery-to-synthesis pipeline.

Climate and Environmental Science

AI-enhanced climate models incorporate machine learning components for cloud dynamics, ocean circulation, and ice sheet behavior—phenomena difficult to model from first principles. These hybrid models show improved long-term forecasting accuracy while requiring less computational resources than traditional fully physics-based models.

Genomics and Personalized Medicine

AI systems analyze genomic data to identify disease-associated variants, predict drug responses based on genetic profiles, and design personalized treatment protocols. The integration of genomic, clinical, and lifestyle data enables precision medicine approaches tailored to individual patients.

Challenges and Limitations

Despite impressive capabilities, AI research systems face significant limitations:

Verification and Reproducibility

AI-generated hypotheses require independent verification. AlphaEvolve’s mathematical results were validated by human mathematicians, and its infrastructure optimizations underwent extensive testing before deployment. However, the scale and speed of AI hypothesis generation may outpace human verification capacity, creating a potential credibility bottleneck.

Interpretability

Understanding why an AI system proposes a particular solution remains challenging. AlphaEvolve generates code that human engineers can read and understand, providing some interpretability. But the evolutionary path that led to a solution may involve thousands of intermediate steps that are impractical to review. This creates a tension: we can verify the final result works, but may not understand the discovery process.

Scope Limitations

Current AI research systems excel at optimization problems with clear objective functions and automated evaluation. They struggle with:

Conceptual breakthroughs: Defining new theoretical frameworks or mathematical objects
Qualitative reasoning: Insights requiring judgment, taste, or aesthetic considerations
Cross-domain synthesis: Combining insights from disparate fields in non-obvious ways
Problem formulation: Identifying which questions are worth asking—still primarily human-driven

Resource Requirements

AlphaEvolve and similar systems require substantial computational resources. Training the underlying LLMs, running thousands of evolutionary iterations, and evaluating candidate solutions across diverse test cases demands infrastructure accessible primarily to well-resourced organizations. This risks concentrating scientific AI capabilities in the hands of a few large tech companies, potentially exacerbating existing inequalities in research capacity.

Future Outlook

Google DeepMind announced plans for an Early Access Program for AlphaEvolve, initially targeting selected academic users, with broader availability under exploration. This democratization of access will be critical for determining whether AI research tools concentrate or distribute scientific capability.

The progression from domain-specific systems (AlphaFold for proteins, AlphaTensor for tensor decomposition) to general-purpose research agents (AlphaEvolve) suggests we may be approaching universal research assistants applicable across scientific domains. The key requirements are:

Algorithmic problem formulation: Express the research question as an optimization problem
Automated evaluation: Verify proposed solutions without human intervention
Adequate compute: Sufficient resources to explore solution spaces

Many scientific problems already meet these criteria; others may be reformulated to leverage AI capabilities, as Terence Tao suggested for mathematics.

Microsoft’s vision of “medical superintelligence” exemplifies ambitions for AI capabilities exceeding human expertise in specialized domains. Whether this represents genuine superintelligence or narrow task-specific superiority remains debatable, but the trend is clear: AI systems increasingly outperform human experts on well-defined problems with clear success metrics.

The next frontier likely involves AI systems proposing which questions to investigate—moving from executing research agendas to helping formulate them. This requires capabilities in:

Identifying gaps: Recognizing unexplored areas in scientific knowledge
Impact prediction: Estimating which questions, if answered, would have greatest scientific or practical value
Resource allocation: Prioritizing research directions based on feasibility and importance
Cross-domain synthesis: Connecting insights from disparate fields to formulate novel hypotheses

Some of these capabilities are emerging. AI systems already assist with literature synthesis and gap identification. The transition to active agenda-setting remains largely unrealized but may characterize the next phase of AI research assistance.

Conclusion

2025 represents a transition point in AI-assisted science. Systems like AlphaEvolve demonstrate AI capabilities extending beyond tool-use to genuine hypothesis generation and validation. The evidence base is substantial: novel algorithms improving real infrastructure efficiency, solutions to open mathematical problems, diagnostic performance exceeding human experts on complex cases.

However, characterizing AI as an autonomous “research partner” risks anthropomorphizing systems that remain, fundamentally, sophisticated pattern recognition and optimization tools operating within carefully constructed frameworks. AlphaEvolve generates algorithms, but humans design the evolutionary framework, specify objective functions, and interpret results. MAI-DxO diagnoses diseases, but physicians select cases, validate conclusions, and make final treatment decisions.

The productive framing may be collaborative intelligence: AI systems excel at exhaustive search, pattern recognition, and rapid iteration across vast solution spaces; humans excel at problem formulation, conceptual insight, and contextual judgment. The most impressive achievements—Terence Tao’s collaboration with AlphaEvolve, Microsoft’s diagnostic orchestration integrating multiple specialized systems—leverage both capabilities in complementary ways.

As these systems become more capable and accessible, critical questions emerge about equity, verification, and accountability. Will AI research tools concentrate scientific capability in well-resourced institutions, or democratize access to powerful research assistance? How do we verify AI-generated hypotheses at scale when they outpace human review capacity? Who bears responsibility when AI recommendations lead to incorrect conclusions or harmful decisions?

These governance challenges should not obscure the fundamental achievement: AI systems now actively contribute to extending human knowledge in mathematics, medicine, computing, and other domains. Whether we call this “partnership,” “superintelligence,” or simply “very impressive tools,” the trajectory is clear and consequential for the future of scientific inquiry.

References

AlphaEvolve Team (2025). AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms. Google DeepMind. https://deepmind.google/blog/alphaevolve/
Google DeepMind (2025). AI as a research partner: Advancing theoretical computer science with AlphaEvolve. Google Research Blog. https://research.google/blog/ai-as-a-research-partner/
Microsoft AI (2025). The Path to Medical Superintelligence. Microsoft AI Blog. https://microsoft.ai/news/the-path-to-medical-superintelligence/
Tao, T. (2025). “Mathematical exploration and discovery at scale.” Terence Tao’s Blog. November 5, 2025.
Staufer, L., et al. (2025). “AlphaEvolve and mathematical problem solving: A collaboration with Google DeepMind.” arXiv preprint.
Fortune (2025, July 3). “Microsoft claims its AI tool can diagnose complex medical cases four times more accurately than doctors.” https://fortune.com/2025/07/03/microsoft-ai-diagnostic-orchestrator/
Nature (2025, May 15). “DeepMind unveils ‘spectacular’ general-purpose science AI.” doi:10.1038/d41586-025-01523-z
New Scientist (2025, November 18). “Mathematicians say Google’s AI tools are supercharging their research.”
Decrypt (2025, November 6). “Google DeepMind’s AlphaEvolve AI Finds New Paths to Unsolved Math Problems.”

This article is part of the AI in Science research series, documenting AI’s transformative impact on scientific discovery across disciplines.
Author: Oleh Ivchenko, PhD Candidate | Innovation Tech Lead | ML Scientist
Series: AI in Science | Published: February 23, 2026 | Rewritten: February 23, 2026