Harness Engineering in Practice: Building an Enterprise E-Commerce System with Claude Code

From Prompt Engineering to Harness Engineering: Three Paradigm Shifts in AI Programming

If you're still stuck in the "just write a prompt and call it a day" phase, you may already be falling behind. Recently, a concept called Harness Engineering has been rapidly gaining traction in enterprise AI programming. It represents a fundamental shift from "Q&A-style" AI-assisted development to an "engineered pipeline" approach.

This article walks through the complete path from concept to implementation of Harness Engineering, based on a real-world Java e-commerce system project. It breaks down how programmers can upgrade their AI programming capabilities from a personal productivity tool to an enterprise-grade production system.

AI Engineering Paradigm Evolution

Three Stages of AI Engineering Evolution

Stage 1: Prompt Engineering

When ChatGPT first launched, everyone was studying how to write better prompts — the most basic form of human-AI interaction, essentially a question-and-answer exchange. Ask well, get good answers; ask poorly, get poor answers.

Core prompt engineering techniques include Role Prompting, Chain of Thought, Few-shot Learning, and more. These techniques work remarkably well for simple tasks, but when facing complex enterprise-level projects, the information capacity and control precision of a single prompt falls far short.

Stage 2: Context Engineering

As tasks grew more complex, prompts alone were no longer sufficient. For example, if you ask AI to write an article mimicking someone's style without providing reference material, the AI has no idea what the target style looks like. This is where you need to feed the AI extensive context — coding standards, historical documentation, reference examples — so it can "learn" before it "executes."

The rise of context engineering is closely tied to advances in large language model context windows. Early GPT-3.5 supported only 4K tokens of context, severely limiting the reference information developers could provide. With Claude 3 supporting 200K tokens and Gemini 1.5 supporting million-level tokens, developers could finally inject entire code repositories, API documentation, and design specifications into the model in a single pass. This gave rise to supporting technology stacks like RAG (Retrieval-Augmented Generation) and vector databases — first chunking and vectorizing massive documents for storage, then retrieving the most relevant fragments based on user queries to assemble as context. However, the core bottleneck of context engineering is that it remains a "single input, single output" paradigm, lacking the ability to continuously control and course-correct the AI's execution process.

Over 95% of developers are still stuck at this stage. Whether you're using Claude Code, Cursor, or Copilot, most people's workflow essentially boils down to: provide some context → write a prompt → let AI generate code.

Limitations of the Context Engineering Stage

Stage 3: Harness Engineering

The English word "Harness" translates to the gear or reins used to control a horse. Applied to the AI domain:

Large language model = A powerful, spirited horse
Harness = The reins and gear that make the horse follow commands
Harness + Large language model = A true intelligent Agent

Harness Engineering isn't as simple as passing in some context. It requires extensive constraint specifications, interactive feedback during execution, and continuous corrective control to enable AI to complete complex enterprise-level tasks. This will be the dominant paradigm in AI programming for the next 2–3 years.

The "feedback loop" here draws from the negative feedback regulation mechanism in Cybernetics. In practice, it typically involves three layers of feedback: The first layer is immediate feedback — after AI generates code, lint checks and type validation run instantly, with automatic corrections if they fail. The second layer is test feedback — running unit tests and integration tests, then using error messages from failed test cases to help the AI locate and fix issues. The third layer is human feedback — developers approve and confirm at critical decision points (such as architecture choices and security-sensitive operations). This multi-layered feedback mechanism is similar to RLHF (Reinforcement Learning from Human Feedback), but more structured and controllable, ensuring AI output always stays within the enterprise's acceptable quality range.

Setting Up the Practice Environment: Toolchain Selection and Configuration

Development Tool Stack

The technology stack used in this hands-on project:

IDE: VS Code
AI Plugin: Claude Code (VS Code plugin version)
Backend Model: Volcano Engine Coding Plan (~200 RMB/month), primarily using Zhipu GLM series

Practice Environment Configuration

Why This Combination?

Claude Code's engineering capabilities rank in the top tier among current AI programming tools and are practically essential for professional developers. Its core advantage lies in its unique Agent architecture design — unlike GitHub Copilot, which primarily handles line-level/function-level code completion, Claude Code can autonomously read project file structures, analyze dependencies, execute terminal commands, run tests, and iteratively modify code based on results. Under the hood, it relies on Anthropic's Tool Use capability, allowing the model to invoke external tools like file read/write, Shell execution, and search during the reasoning process. More critically, Claude Code supports CLAUDE.md project specification files and Custom Commands — this is precisely the technical foundation that makes the Harness Engineering system possible. Developers can inject enterprise coding standards, architectural constraints, and review criteria into the AI's workflow in a structured manner.

The choice of the domestic Zhipu GLM model over Claude's native model for the backend has an important rationale:

If a non-top-tier model (domestic GLM) combined with the Harness specification system can successfully deliver enterprise-level projects, then switching to top-tier models like Claude 4 or GPT will only yield better results.

This demonstrates that the value of the Harness Engineering system itself far exceeds simply relying on model capability alone.

Model Plan Configuration

Domestic Large Model Code Generation Capability Rankings

Based on hands-on testing and comparison, here's the recommended ranking for domestic large models in code generation scenarios:

Zhipu GLM — Top tier domestically, with consistently stable overall performance. Zhipu AI is a large model company incubated by Tsinghua University's technical team. Its GLM (General Language Model) series employs a unique autoregressive blank-filling pre-training paradigm and has released the CodeGeeX series of specialized programming models, continuously optimizing on mainstream code evaluation benchmarks like HumanEval and MBPP.
Alibaba Qwen — A viable alternative
Xiaomi MiMo — Impressive recent performance
Doubao, MiniMax, DeepSeek, and others each have their own strengths

Choosing domestic models also has an important practical consideration: data compliance. For enterprise-level projects involving core business logic, cross-border data transfer poses legal risks. Using domestic models enables private deployment where data never leaves the country — a hard requirement in industries like finance and government. Volcano Engine, as ByteDance's cloud service platform, provides unified API access and billing management for models, lowering the barrier for enterprises to integrate multiple domestic models.

Core Philosophy: The Engineering System Behind a Single Command

Same on the Surface, Worlds Apart Underneath

In the demonstration, a seemingly simple command:

Strictly follow the Turing Shop project's Harness specifications and add an order logistics tracking feature to this project

What's the difference between this command and a casual developer typing "add a logistics feature for me"? The answer: the underlying execution flow is worlds apart.

Skill-Driven End-to-End Automation

Behind this command is an entire set of pre-defined Skills (skill scripts) driving the process:

Requirements Analysis Skill — AI first understands the business requirements and breaks them down into technical tasks
Coding Skill — Generates code following the project's coding standards
Unit Testing Skill — Automatically generates test cases
Continuous Integration Skill — Triggers the CI/CD pipeline
Deployment Skill — Completes environment deployment
Code Review Skill — Automated Code Review

6–7 core Skills span the entire development lifecycle, forming an AI-driven development pipeline. This is the essence of Harness Engineering — it's not about having AI write a piece of code, but about having AI complete every step from requirements to production release according to enterprise-grade specifications.

From a technical implementation perspective, this Skill orchestration is essentially an engineered implementation of the AI Agent architecture. An Agent refers to an AI system capable of perceiving its environment, making autonomous decisions, and executing actions — distinct from simple conversational AI. Each Skill typically consists of three components: trigger conditions, execution logic (including system prompts, tool call chains, and output format constraints), and validation rules (Guard Rails). Multiple Skills are orchestrated through Directed Acyclic Graphs (DAGs) or state machines, similar to Pipeline definitions in traditional CI/CD. Current mainstream Agent frameworks like LangChain, CrewAI, and AutoGen all provide similar multi-step task orchestration capabilities, but Harness Engineering places greater emphasis on deep integration with existing enterprise DevOps workflows.

Traditional CI/CD pipelines typically include stages like code compilation, static analysis, unit testing, integration testing, artifact building, and environment deployment, driven by tools like Jenkins, GitLab CI, and GitHub Actions. The innovation of Harness Engineering lies in embedding AI capabilities into every node of the pipeline: at the code commit stage, AI automatically checks compliance with architectural specifications; at the testing stage, AI automatically generates incremental test cases based on code changes; at the Code Review stage, AI provides improvement suggestions based on the team's historical review standards. It's worth noting that "Harness" is also the name of a well-known DevOps platform company (harness.io) focused on software delivery automation, which shares conceptual alignment with the Harness Engineering concept discussed here — both emphasize standardizing and automating complex processes through engineering approaches.

Key Insights for Enterprise Implementation

Why Conceptual Tutorials Don't Work

Most online content about Harness stays at the conceptual level — what are rules, what are constraints, what are feedback loops. After consuming it, you've memorized a bunch of terminology but have no idea how to apply it, and you forget everything within a couple of days.

The truly valuable learning path is: Start with a hands-on project and understand the concepts through doing. Once you've seen how Skills are invoked, how specifications constrain AI behavior, and how feedback corrects output, going back to read those theoretical documents will make everything click.

This "practice-first" learning methodology has deep theoretical foundations in software engineering education. Cognitive science research shows that the acquisition paths for procedural knowledge (knowing how) and declarative knowledge (knowing what) are fundamentally different — the former must be internalized through repeated practice, while the latter can be acquired through reading but is easily forgotten. Harness Engineering involves extensive engineering decisions and tuning experience, making it a classic case of procedural knowledge. Learning through project practice is therefore far more efficient than reading conceptual documentation.

How Effective Is It in Practice?

According to the presenter, multiple Harness Engineering projects developed in collaboration with enterprises have been successfully deployed with significant results. This demonstrates that Harness is not an armchair concept but a proven, enterprise-validated approach.

Final Thoughts: From Using AI to Building AI Engineering Systems

Harness Engineering represents the leap in AI programming from "personal assistant" to "engineered productivity." Its core isn't about which model you use, but rather:

Whether you've established a comprehensive specification system
Whether you have Skill orchestration covering the entire workflow
Whether you've achieved a human-AI collaborative feedback loop

For programmers who want to stay competitive in the AI era, understanding and mastering Harness Engineering will be the critical step from "knowing how to use AI tools" to "knowing how to build AI engineering systems." This isn't just a technical capability upgrade — it's a transformation in engineering mindset, shifting focus from "what can AI generate" to "how to systematically ensure AI consistently and reliably delivers high-quality results."

Harness Engineering in Practice: Building an Enterprise E-Commerce System with Claude Code — A Complete Walkthrough