How Jane Street Built a Custom AI Programming Toolchain for OCaml

Introduction: When a Top Quantitative Trading Firm Meets LLMs

Jane Street, one of the world's most renowned quantitative trading firms, recently had John Crepezzi—head of their AI Assistance team—share a detailed look at how they apply large language models to their internal development tools. This talk is noteworthy not just because of Jane Street's technical prowess, but because they face a unique challenge: their entire technology stack is built on OCaml, an extremely niche programming language.

This means virtually all off-the-shelf AI programming tools are useless to them—they had to build an entire AI-assisted development ecosystem from scratch. The lessons learned along the way are highly valuable for any team looking to apply LLMs within a non-mainstream technology stack.

Why Off-the-Shelf AI Programming Tools Don't Work

The Peculiarity of OCaml

OCaml (Objective Caml), created in 1996 by France's National Institute for Research in Digital Science and Technology (INRIA), is a prominent member of the ML language family. It combines functional, imperative, and object-oriented programming paradigms, and is known for its powerful static type system, type inference capabilities, and high-performance runtime. OCaml's type system catches a large class of runtime errors at compile time—critical for ensuring correctness in financial trading systems.

Jane Street's adoption of OCaml was no accident. The company began large-scale use of OCaml around 2003, believing its type safety and expressiveness could effectively reduce the risk of logic errors in trading systems. Jane Street has since become OCaml's most important industrial user and contributor, maintaining widely-used open-source OCaml libraries such as Core and Async. While OCaml is primarily used in theorem proving, formal verification, and programming language development, Jane Street uses it for nearly everything: web applications are transpiled to JavaScript via JS of OCaml, Vim plugins are transpiled to VimScript via VCaml, and even FPGA code is written using the HardCaml library.

Model support for OCaml is extremely poor, for a simple reason—there's too little training data. John stated plainly that the volume of OCaml code inside Jane Street likely exceeds the total amount of OCaml code everywhere else in the world combined.

The Integration Cost of a Custom Tool Ecosystem

Beyond the language itself, Jane Street has built an entirely proprietary toolchain: a custom build system, distributed build environment, a code review system called Iron (similar to a PR system), a massive monorepo based on Mercurial, and 67% of employees use Emacs.

Monorepo (monolithic repository) is an engineering practice where all of a company's project code resides in a single version control repository, widely adopted by tech giants like Google, Meta, and Twitter. Compared to multi-repo approaches, monorepos make cross-project refactoring easier, dependency relationships more transparent, and code reuse more convenient. Jane Street chose Mercurial (Hg) over Git partly because of Mercurial's historical advantages in large-scale repository management and better extensibility support. In the context of AI-assisted development, a monorepo means the model needs to understand dependency relationships spanning millions of lines of code, which places higher demands on context window utilization strategies and is a key reason why RAG (Retrieval-Augmented Generation) proves especially valuable in code scenarios. These technology choices make integrating external AI programming tools extraordinarily difficult.

Jane Street's technology choice challenges

From Meta's Paper to Self-Trained Models

Inspiration from the Code Compose Paper

Jane Street's confidence came from Meta's Code Compose paper. Published in 2023, it described in detail how to build a code completion system for the Hack language, demonstrating the results of fine-tuning models for Hack. Hack is a programming language released by Meta in 2014, evolved from PHP, designed specifically for Meta's large-scale server-side development with a gradual type system. Hack is used almost exclusively within Meta, a situation highly analogous to OCaml at Jane Street—both are "single major customer" languages with extremely scarce public training data. Code Compose's core contribution was proving that domain-specific fine-tuning with internal enterprise code can significantly improve model performance on niche languages. Interestingly, Hack itself is implemented in OCaml, creating a fascinating technical lineage between the two languages.

However, the team quickly discovered that the naive idea of "just show the model a bunch of code and it'll learn" was overly simplistic. To achieve good results, the model needs to see large quantities of training samples whose shape matches the target task.

A Clear Goal: Generating Applicable Diffs

The team set a clear objective: users type a natural language description in their editor, and the model generates diffs that may span multiple files. These diffs need to apply cleanly and have a high probability of passing type checking. The target scope was kept under 100 lines.

The training data shape is: context + prompt + diff. The question was: how do you obtain such data at scale?

Workspace Snapshots: An Innovative Approach to Training Data

Why Features and Commits Aren't Enough

The most intuitive data source is Features (similar to PRs) from the code review system. These have human-written descriptions and corresponding code changes. But the problems are: Feature descriptions are written completely differently from editor instructions (the former are multi-paragraph formal descriptions, the latter are terse commands like "fix that bug"), and Features typically involve 500-1000 lines of changes—far exceeding the target scope.

Commits aren't ideal either—Jane Street's commits primarily serve as checkpoints during development, lacking meaningful descriptions and not representing isolated changes.

Training data acquisition challenges

Core Design of the Workspace Snapshot Approach

The final solution was Workspace Snapshotting. The system takes a snapshot of each developer's workstation every 20 seconds, simultaneously recording build state. By identifying specific patterns, the team can extract high-quality training data:

Green-Red-Green pattern: A developer makes an isolated change (from passing build to broken build back to passing)
Red-Green pattern: A developer encounters an error and fixes it, useful for training the model to recover from errors

For the description component, the team uses an LLM to generate detailed change descriptions, then progressively condenses them to the length a human would actually type in an editor. This approach cleverly transforms everyday development activity into structured training data.

Reinforcement Learning and the Code Evaluation Service

Defining "Good Code"

The second phase of training is reinforcement learning, which requires defining what "good code" means:

Parseable: Passes the OCaml parser
Type-checks: Passes static type system verification
Compiles and passes tests: The ultimate standard

Applying reinforcement learning (RL) to code generation is an important recent advance in the LLM field. Unlike RLHF (Reinforcement Learning from Human Feedback) which relies on human-annotated preferences, the code domain has naturally verifiable reward signals—whether a compiler and test suite pass or fail is an objective binary judgment requiring no human intervention. This approach is called execution-feedback-based reinforcement learning, and research from OpenAI's Codex, DeepMind's AlphaCode, and others has validated the significant improvement that execution feedback provides for code generation quality. This "verifiable rewards" concept is also being extended to other formally verifiable domains like mathematical proofs and logical reasoning.

CES: Code Evaluation Service Architecture

The team built a Code Evaluation Service (CES) whose core design principle is pre-warmed builds—maintaining a green build state at a given revision, then having workers continuously receive model-generated diffs, apply them, and check whether the build remains green. By maintaining pre-warmed build environments, what would normally require minutes of cold-start build time is compressed to second-level response times, making large-scale RL training loops feasible. This feedback loop runs continuously for months, progressively improving the model's ability to generate compilable, test-passing code.

Code Evaluation Service architecture

The same infrastructure is also used for evaluation—a portion of RL data is reserved as a test set for continuous monitoring of model performance.

Hard Lessons from Training

Training can produce catastrophic yet hilarious results. The team once trained a code review model, investing months of effort, only to have the model respond "I'll do it tomorrow" when first asked to review code. The reason was simple—the training data contained large volumes of real responses from human reviewers, and humans do indeed say things like that. This case powerfully illustrates the importance of establishing meaningful evaluation frameworks.

AID Architecture: A Unified AI Development Environment Across Editors

Architecture Design Principles

Facing the need to support three editors (NeoVim, VS Code, Emacs), the team established three principles:

Don't reinvent the wheel: Context construction and prompting strategies are written once
Stay flexible: Be able to switch models or prompting strategies at any time
Be measurable: Collect metrics on latency, diff application success rates, etc.

AID architecture design

The AID Sidecar Service Implementation

Sidecar is an architectural pattern originating from microservices and cloud-native domains, initially widely applied in container orchestration systems like Kubernetes. Its core idea is to decouple auxiliary functionality from the main application, running it as an independent process alongside the main application, interacting via local IPC (Inter-Process Communication) or HTTP interfaces. The advantages of this pattern are: the main application doesn't need to know implementation details of auxiliary logic, auxiliary services can be updated and deployed independently, and main applications built on different technology stacks can share the same auxiliary service. In the AI programming tools space, GitHub Copilot's language server and Cursor's backend service both employ similar approaches.

The final architecture is a sidecar service called AID (AI Development Environment) that runs on each developer's local machine. AID handles all the heavy lifting—building prompts, organizing context, checking build state, communicating with LLMs. Each editor only needs to implement a thin UI layer.

The advantages of this architecture are obvious: updating AID doesn't require restarting the editor—just restarting the sidecar service gives everyone the latest version. The team can also run A/B tests—routing 50% of users to one model and 50% to another, comparing acceptance rates.

Editor-Specific Interaction Experiences

In VS Code, AID appears as a sidebar interface, similar to Copilot but supporting multiple candidate results. In Emacs, considering that users are accustomed to working in text buffers, the team built the entire AI interaction experience as a Markdown buffer—users can move, copy, and edit it like a regular file, appending new content via keyboard shortcuts.

Key Takeaways and Future Directions

Jane Street's AI engineering practices reveal several important principles:

Data acquisition requires creativity: Workspace snapshotting is an ingenious solution that transforms everyday development activity into high-quality training data
Verifiability is key: Using compilers and tests as reinforcement learning reward signals is more reliable and scalable than human annotation
Architectural pluggability: In an era of rapid LLM iteration, keeping system layers decoupled is crucial
Evaluation first: Without a reliable evaluation framework, training investments can be completely wasted

The team is currently exploring new applications of RAG in editors, large-scale multi-agent workflows, and deep integration of reasoning models. The core methodology remains unchanged: stay pluggable, build solid foundations, and enable other teams across the company to extend platform capabilities through domain-specific tools.

Key Takeaways

Jane Street's use of OCaml, an extremely niche language, renders off-the-shelf AI programming tools virtually unusable, necessitating a fully custom LLM-assisted development system
The team uses workspace snapshots (captured every 20 seconds) combined with build state changes to automatically generate high-quality training data, solving the problem of traditional PR/commit data being a poor fit
They built a Code Evaluation Service (CES) that uses compilers and tests as reinforcement learning reward signals, running for months to align the model's ability to generate compilable code
They designed the AID (AI Development Environment) sidecar architecture, writing core logic once to support VS Code, Emacs, and NeoVim
Core engineering philosophy: maintain pluggability, establish reliable evaluation frameworks, and ensure rapid LLM iteration propagates to all end users at minimal cost