AI Coding Far Outpaces Biology: Infrastructure Determines How Fast AI Gets Deployed

AI thrives in coding but stalls in biology — the bottleneck is infrastructure, not model capability.
AI has made remarkable strides in programming thanks to structured data, instant feedback loops, and modern tooling, yet progress in biology remains slow. The root cause isn't model capability but infrastructure: biological databases were built for human scientists, not AI Agents. Like cities designed before cars, these systems need fundamental restructuring — standardized APIs, machine-native data platforms, and unified knowledge graphs — to unlock AI's potential for scientific discovery.
AI Is Racing Ahead in Coding but Crawling in Biology — What's Going Wrong?
AI's progress in programming has been nothing short of remarkable — from GitHub Copilot to various code Agents, AI can now independently tackle complex software engineering tasks. GitHub Copilot was launched in 2021 as a joint effort between GitHub and OpenAI, built on the OpenAI Codex model, capable of auto-completing and generating code snippets based on context. Since then, the concept of code Agents has evolved further — from Devin (billed as the first AI software engineer) to AI IDEs like Cursor and Windsurf, to terminal tools like Claude Code — AI coding capabilities have leaped from "autocomplete assistant" to "autonomous Agent." On standardized benchmarks like SWE-bench, top AI systems can now independently resolve over 50% of real GitHub issues, meaning AI can handle complex multi-file codebases, understand project architecture, and submit code changes that pass tests.
Yet in biology, despite massive investment, AI progress has been comparatively slow. What's behind this gap?
A blog post from New Science offers an elegant analogy: for AI Agents, biological databases are like cities built before the invention of the automobile — driving through them is maddening because the infrastructure was designed for an entirely different mode of "transportation."
Coding vs. Biology: A Structural Infrastructure Gap
Why Coding Became AI's First Breakthrough Domain
The programming domain has inherent structural advantages that make it the easiest arena for AI to deliver value:
- Uniform data formats: Code is inherently structured text, naturally suited for language model processing
- Instant feedback loops: Code can be compiled, executed, and tested — AI can quickly validate its own output
- Modern infrastructure: Tools like Git, API documentation, and package managers are inherently machine-friendly
- Clear evaluation criteria: Code either runs or it doesn't — right and wrong are unambiguous
Feedback loops play a critical role in AI learning. In programming, this loop is extremely tight: AI generates code → compiler checks syntax → test suite runs → clear pass/fail signal is returned → strategy is adjusted accordingly. The entire process can complete in seconds. This dense, immediate feedback enables reinforcement learning and iterative optimization strategies to operate efficiently, making it one of the key drivers behind the rapid improvement of AI coding capabilities.
The Core Dilemma of Biological Data Infrastructure
Biological databases tell a completely different story. Decades of accumulated biological databases — from genome databases to protein structure repositories — were all designed for human researchers. Specifically, GenBank (a nucleotide sequence database) has been around since 1982 and currently contains over 250 million nucleic acid sequences; UniProt (a protein database) holds more than 250 million protein sequence entries; and PDB (the Protein Data Bank) stores over 200,000 experimentally determined 3D structures. These databases each use different data formats (such as FASTA, PDB format, GFF, etc.), are independently maintained by different institutions, and have varying query interfaces and data models.
While AlphaFold achieved a revolutionary breakthrough in protein structure prediction, this was more of a point solution — it solved a well-defined computational problem rather than systematically addressing the interaction challenges between AI and biological data infrastructure.
Their interfaces, data formats, and query methods all assume the user is a human scientist who can understand context and handle ambiguous information.
When AI Agents attempt to use these databases, the challenges include:
- Inconsistent data formats that make cross-database interoperability difficult
- Lack of standardized machine-readable interfaces
- Vast amounts of tacit knowledge embedded in human-readable but machine-opaque documentation
- Long experimental validation cycles that prevent the rapid iteration possible with code
Regarding the last point, its impact runs far deeper than it appears on the surface. Biological experiment validation cycles can span days (cell culture), weeks (animal experiments), or even years (clinical trials). This enormous difference in time scales means AI cannot learn and optimize through rapid trial-and-error in biology the way it does in programming. Methods that depend on dense feedback, such as reinforcement learning, are rendered nearly ineffective here.
"Cities Built Before Cars" — The Best Metaphor for Understanding AI Infrastructure Bottlenecks
This analogy is remarkably apt. Imagine those medieval European cities — narrow alleyways, irregular blocks, no parking spaces. These cities were designed with horse-drawn carriages and pedestrians in mind, not modern automobiles. You can drive through them, but the efficiency is abysmal.
Biological data infrastructure faces the same problem. When these systems were designed, the "user" was a PhD student with a notebook, not an AI Agent that needs programmatic access to massive datasets.
How to Build Agent-Friendly Scientific Data Infrastructure
This question points to the core bottleneck for AI applications in science: we need to fundamentally rethink the design philosophy of data infrastructure.
Short-Term, Actionable Solutions
- Build standardized API layers for existing databases
- Develop middleware for data format conversion
- Establish machine-readable metadata standards
Long-Term, Fundamental Restructuring
- Design Agent-native scientific data platforms from scratch
- Build automated experimental validation pipelines
- Create unified knowledge graphs that span across databases
Agent-native represents a fundamental shift from the current "human-first, machine-adapted" design paradigm. It means treating AI Agents as first-class citizen users from the very beginning of system design: data stored in structured, semantically explicit formats; all operations exposed through programmatic APIs rather than relying on GUIs; rich, machine-parseable metadata; and permissions and rate limits adapted to Agents' high-frequency access patterns. This is analogous to the Web's evolution from "webpage-first" to "API-first" — the emergence of RESTful APIs and GraphQL enabled programs to interact efficiently with web services without having to parse HTML pages like web crawlers. Scientific data infrastructure needs to undergo a similar paradigm shift.
Building unified knowledge graphs is another critical direction. Knowledge graphs organize information in graph structures, expressing knowledge through nodes (entities) and edges (relationships). In a biological context, this means integrating different types of biological entities — genes, proteins, diseases, drugs, metabolic pathways — and their interrelationships into a computable network. Google's Knowledge Graph and Hetionet in the biomedical domain are relevant examples. However, building such graphs faces core challenges including ontology alignment (different databases define the same concept differently), relation extraction (automatically extracting entity relationships from literature), and dynamic updates (biological knowledge evolves rapidly).
Beyond Biology: Broader Lessons for AI Infrastructure
This observation extends well beyond biology. Any domain where AI has yet to fully penetrate deserves scrutiny on whether its infrastructure is "Agent-friendly." Legal documents, medical records, industrial data — the digital infrastructure in these fields was mostly designed for humans.
The next wave of AI breakthroughs may not depend on improvements in model capabilities themselves, but on whether we can build the right "road systems" for AI. Just as cities needed to redesign their roads for automobiles, scientific data infrastructure needs fundamental restructuring for the AI era.
This is perhaps one of the most underestimated investment opportunities in AI infrastructure today. When we discuss the future of AI, we tend to focus on bigger models and more compute, overlooking a basic truth: even the most powerful car can't drive fast through medieval alleyways. Paving modern "roads" for AI — standardized data interfaces, machine-native knowledge representations, automated validation pipelines — may be the real key to unlocking AI's potential for scientific discovery.
Key Takeaways
Related articles

Agent Skills: Folders as Skills — Making AI Produce Precise, Template-Based Output
Agent Skills splits AI capabilities into independent skill folders with on-demand loading and progressive disclosure, cutting token costs by 80% and reducing hallucinations for template-based output.

Five Common Claude Code Mistakes — How Many Are You Making?
Five common Claude Code mistakes developers make: copy-pasting code, skipping CLAUDE.md, inefficient prompting, ignoring docs, and poor context management — with fixes.

Andrew Ng's New Course Explained: A Practical Guide to Using OpenAI's O1 Reasoning Model
Deep dive into Andrew Ng and OpenAI's Reasoning with O1 course covering test-time scaling, new prompting paradigms, multi-model orchestration, and practical applications for developers.