Building an AI Software Factory from Scratch: A Cursor Engineer's Hands-On Experience with Multi-Agent Collaboration

Introduction: The Vision and Reality of a Software Factory

Cursor engineer Eric shared his hands-on experience at the AI Engineer conference about dogfooding the product internally and gradually building a "software factory." Dogfooding refers to a company using its own product in daily work — a practice originating from Microsoft's early "eat your own dog food" culture, aimed at discovering product shortcomings in real work scenarios and iterating quickly. Eric admitted that while they haven't fully reached the stage of a fully automated software factory, many internal subsystems are already running quite autonomously.

Just as real-world hardware factories need assembly lines, management layers, and observability, software factories need to borrow these same concepts. This talk covered the complete path from autonomy levels, factory prerequisites, running the factory, to scaling up.

Building an AI Software Factory from Scratch: A Multi-Agent Collaboration Guide

Six Levels of Software Automation

Eric referenced a blog post by Dan Shapiro from earlier this year, dividing software automation into six levels:

Level 0-1: Spicy Autocomplete — Cursor's starting point in 2022-2023
Level 2-3: Pair programming stage, where humans and Agents interact back and forth, Agents generate most code, humans review
Level 4: Manager mode — delegate as much work as possible to Agents, humans primarily review outputs rather than code itself
Level 5: Dark Factory — fully black-box operation where Agents autonomously code, test, and deploy; humans only provide intent and instructions

The "Dark Factory" concept originates from manufacturing terminology, referring to fully unmanned, fully automated factories that don't even need lighting — because no one works inside, there's no need to turn on the lights. In software, this means zero human intervention from requirements understanding, code generation, test verification, to deployment, with the system continuously producing usable software artifacts like a black box.

Eric said he's currently at Level 4 — for most software projects, he delegates work to Agents as much as possible, only occasionally looking at code directly.

Why Build an AI Software Factory

Three core motivations for building a software factory:

Throughput increase: Agents can run 24/7 without sleep or food. You can run more Agents simultaneously to increase output.

Consistent output: Assembly-line-style work produces consistent output. If the factory is well-built, output quality is very stable. But without guardrails, Agent behavior becomes increasingly random — which is precisely the signal that constraints need strengthening.

Amplify creativity: You can better leverage your own taste and creativity instead of waiting to write code line by line.

Core Elements of Building the Factory

Primitives & Patterns

Code structure is crucial. A modular codebase lets Agents discover all relevant files within a single directory without needing to grep across the entire codebase. Eric made an insightful point: If you as a human can easily get up to speed on a new codebase, an Agent can probably do it too.

Patterns matter equally — are there standardized authentication methods, startup scripts, test-writing conventions? If these boilerplate patterns exist, you can have Agents replicate by referencing existing implementations. This is essentially making implicit knowledge explicit: when team best practices are encoded as reusable templates, both new human engineers and AI Agents can quickly understand "how things should be done here."

Guardrails

Guardrails have three dimensions:

Hooks: Restrict Agents from modifying specific sensitive code areas like encryption or authentication modules. Hooks are essentially pre-commit check mechanisms, similar to Git's pre-commit hooks, intercepting and validating before Agents attempt certain operations.
Rules: Eric emphasized that rules should emerge dynamically rather than pre-installing all possible rules. You should only create rules when you notice an Agent going off track. This aligns with the agile "just enough" documentation philosophy — excessive upfront rules create maintenance burden, while reactively created rules tend to be more precise and targeted.
Tests: Let Agents verify their own work by running tests to confirm existing functionality isn't broken.

Enablers

This is the most exciting part — giving Agents more capabilities:

Skills & MCP: Let Agents access external context and understand how to implement specific features. MCP (Model Context Protocol) is an open protocol released by Anthropic in late 2024 that defines standardized interfaces for AI model interactions with external tools and data sources — like a USB port for the AI world. Through MCP, Agents can connect to databases, call APIs, read documentation, greatly expanding their capability boundaries.
Feature Flags: Agents can autonomously add feature flags, merge PRs, and tell you "turn on this flag if you want to try it, roll back if you don't like it." Feature flags are a mature software engineering practice allowing teams to dynamically enable or disable specific features without redeploying code. They're widely used in gradual rollouts, A/B testing, and risk control — when an Agent-generated feature has issues, simply flipping the switch rolls back instantly without reverting code.
Self-bootstrapping environments: Can the Agent spin up its own dev environment? If so, it can scale infinitely on isolated VMs.

Running the Factory: From Worker to Manager

Mindset Shift

Once the factory is running, the most important thing is shifting your mindset:

From worker to manager
From synchronous to asynchronous
From writing code to reviewing outputs

Eric noted this aligns perfectly with human organizational management principles: small team → add people → need managers → managers of managers. The Agent world is the same — you're just continuously climbing abstraction levels.

Parallelization & Context Management

As a manager, you need to think about how to divide and parallelize work. Key principles:

One work unit per Agent
Avoid two Agents modifying the same code area simultaneously (creates merge conflicts)
Retain the code's "tribal knowledge" — understanding data flow, user needs, key modules
Front-load context: Through plans or detailed specs, give Agents sufficient information before sending tasks

"Tribal Knowledge" is an important concept in software engineering, referring to implicit knowledge that exists in team members' heads but isn't documented — like "this API was designed this way because of a historical customer's special requirement" or "this code looks redundant but deleting it causes edge cases to crash." In Agent collaboration, making this knowledge explicit and injecting it into context becomes especially critical.

Eric shared his daily workflow: typically running 5-10 Cloud Agents simultaneously, planning the next task or handling small synchronous matters during wait times. He prefers the rhythm of "plan synchronously, execute asynchronously."

The Importance of Isolated Environments

Eric strongly recommends using isolated VM environments rather than shared workspaces:

Each VM can run an independent database, internal tools, and the Cursor application itself. While more expensive and complex to set up, once built, it scales to 100 or even 1000 Agents.

Cursor internally may run thousands of Agents daily, all working on copies of the codebase. This isolation strategy borrows from containerization and microservices architecture thinking — each Agent has a completely independent runtime environment with no resource contention or state pollution, enabling true horizontal scaling.

The Automation Flywheel: Real-World Examples

Eric showcased several internal automation practices at Cursor:

Daily report automation: Automatically aggregates information from Slack and GitHub to generate daily work summaries.

PR comment learning: Automatically collects human review comments from merged PRs, stored as high-value signals to help Agents continuously learn. The core insight: human feedback in code reviews contains vast amounts of implicit knowledge about code quality, architecture preferences, and team norms — a more vivid training signal than any static documentation.

Smart Code Owner: Replaces the traditional static code owner mechanism. The system evaluates PR risk levels — low risk (e.g., variable renaming) gets auto-approved, high risk pulls in relevant people for review. This solves the bottleneck that code owners cause in 20% of cases. The traditional CODEOWNERS file is a GitHub mechanism that designates review owners for specific code paths — any changes to that path must be approved by designated personnel before merging, but in high-frequency commit teams, this often becomes a process bottleneck.

Continuous learning plugin: Automatically analyzes historical conversation records to extract memories and rules. For example, if you repeatedly correct an Agent to "use this component not that one," the system automatically generates corresponding rules.

Feature flag auto-cleanup: When a feature flag runs at 100% for over two weeks, the system automatically creates a Linear ticket triggering a Cloud Agent to remove that flag. This addresses a common technical debt problem — feature flags that aren't cleaned up promptly fill code with conditional branches, increasing complexity and maintenance costs.

Scaling: From 5 Agents to 1000 Agents

Observe and Improve

The core of scaling is establishing a feedback flywheel:

Observe Agent output results
Identify deviations (e.g., incorrect database schema naming)
Create rules or design systems to correct them
Let Agents use these improvements in the next iteration

This is essentially a PDCA (Plan-Do-Check-Act) cycle applied to AI collaboration — continuous observation, continuous improvement, making system output quality monotonically increase over time.

GUI Testing & Verification

Eric demonstrated using Cloud Agent's computer use capability for GUI testing — the Agent starts a local server, actually clicks UI elements, and records operation videos for humans to quickly verify. This is especially valuable for frontend change verification. Computer use is a capability that lets AI Agents directly operate graphical interfaces — Agents can move the mouse, click buttons, and type text just like humans, thereby verifying UI-level correctness.

Cursor Workers: Localized Scaling

Cursor's newly released Workers feature allows running the same infrastructure and orchestration layer as Cloud Agents on any machine. You can start a worker daemon on a Mac mini, local VM, or any cloud platform, managed uniformly through Cursor Cloud. This means teams can leverage existing compute resources (like idle dev machines or private clouds) to run Agents without fully depending on Cursor's cloud resources, while also meeting compliance requirements in data-sensitive scenarios.

Key Recommendations & Summary

Eric concluded with five core recommendations:

Be intentional: Think clearly about the actual problem you're solving
Don't outsource important decisions: Security, payments, authentication — critical decisions must be made by humans
Build tools and systems: Find flywheels and codify them
Store context: Save Agent conversation logs and quality outputs to help Agents understand what "good" looks like
Set Agents free: Think about what Agents need. A team at Lovable gave their Agents a "vent tool" — Agents complain in a Slack channel about resources they can't access, and these complaints become extremely valuable improvement signals

A software factory isn't built overnight — it requires continuous investment and iteration. But as model capabilities improve, the investment in building and maintaining these systems becomes increasingly worthwhile — because they provide frameworks and guardrails, which have more long-term value than manually writing code itself.

Key Takeaways

Software automation spans six levels, from autocomplete to fully autonomous dark factories; most developers are currently at Level 2-3 pair programming
Building a software factory requires three core elements: code primitives & patterns, guardrails (hooks/rules/tests), and enablers (skills/MCP/isolated environments)
The mindset shift from worker to manager is key — plan synchronously, execute asynchronously, front-load context, use isolated VM environments for infinite scaling
Automation flywheels include smart Code Owners, PR comment learning, feature flag auto-cleanup, continuous learning plugins, and other real-world examples
Rules should emerge dynamically rather than be preset; storing context and building tool systems has more long-term value than manual coding