Hands-On Tutorial: Build an AI Agent from Scratch with 200 Lines of Python

Project Overview: Understanding Core Agent Architecture with Minimal Code

For developers looking to get started with AI Agent development, the biggest challenge often isn't the technical difficulty itself — it's the overwhelming number of concepts, heavyweight frameworks, and not knowing where to begin. A beginner-friendly tutorial on Bilibili offers an elegantly simple yet complete approach: building an AI Agent with core capabilities from scratch using just 200 lines of Python code.

We implemented a simple AI agent project

The design philosophy behind this project is clear: understand the concepts first, then validate them with code. Each lesson focuses on a core term in the Agent domain, demonstrating its role through actual coding, and ultimately assembling all modules into a complete intelligent agent. For developers with Python experience, this is a low-barrier, high-efficiency learning path.

Breaking Down the Five Core Agent Modules

Prompt: The Agent's "Personality Configuration"

The prompt is the starting point of an Agent — it defines the behavioral boundaries and response style of the intelligent agent. In this project, the prompt isn't just a simple system instruction; it's designed as the Agent's "operating system," defining how the agent should think, make decisions, and interact with users.

Prompt Engineering has been one of the most critical technical practices in large model applications since 2023. In Agent scenarios, prompt design is far more complex than for a typical ChatBot — it needs to include role definitions, behavioral constraints, output format specifications, tool-calling instructions, and other multi-layered structures. OpenAI's System Prompt mechanism provides the technical foundation for this, allowing developers to inject persistent behavioral instructions before a conversation begins. Common prompt design patterns in the industry include: the ReAct (Reasoning + Acting) pattern, which alternates between thinking and acting; the Few-shot pattern, which guides the Agent's output format through examples; and the Chain-of-Thought pattern, which forces the Agent to show its reasoning process. The essence of all these patterns is transforming large language models from passive text-completion tools into proactive task executors through carefully crafted text instructions.

Then we use code to demonstrate the role of each concept

Memory: Giving Conversations Context

An Agent without memory is essentially just a Q&A machine. The memory module enables the agent to track conversation history, understand contextual relationships, and deliver more coherent, intelligent responses. Within the constraint of 200 lines of code, implementing a basic but effective memory mechanism is key to understanding what distinguishes an Agent from an ordinary ChatBot.

In cognitive science and AI research, an Agent's memory system is typically divided into three levels: Short-term Memory, Long-term Memory, and Working Memory. Short-term memory corresponds to the context window of the current conversation, limited by the model's token length (e.g., GPT-4's 128K tokens). Long-term memory requires external storage support, typically encoding historical information as vectors stored in vector databases (such as Pinecone or Milvus). Working memory is the subset of information the Agent is currently processing, similar to a human's attentional focus. In practice, memory management also involves information compression (Summarization), forgetting mechanisms, and retrieval strategies — all critical technical challenges in evolving from a simple ChatBot to a truly intelligent agent.

Tool Use: The Bridge to the Real World

Tool calling is the core capability that distinguishes an Agent from a pure language model. By defining callable external tools (such as search engines, calculators, API endpoints, etc.), an Agent can break through the knowledge boundaries of the language model and perform real-world actions. Implementing this module helps us understand the underlying logic of Function Calling.

Function Calling is a core API capability introduced by OpenAI in June 2023, solving a fundamental problem: how to make language models reliably generate structured function call requests. Here's how it works: developers pre-define a set of function descriptions in JSON Schema format (including function names, parameter types, and parameter descriptions) and send these descriptions along with the user's message to the model. The model determines whether a function call is needed based on user intent, and if so, outputs JSON parameters conforming to the Schema. The developer then executes the function locally and returns the result to the model for a final response. The revolutionary aspect of this mechanism is that it transforms LLMs from pure text generators into control centers capable of operating external systems. Currently, all major models (GPT-4, Claude, Gemini) support this capability, and the open-source model community has achieved similar effects through fine-tuning.

RAG (Retrieval-Augmented Generation): Connecting Private Knowledge Bases

RAG (Retrieval-Augmented Generation) enables an Agent to answer questions based on a specific knowledge base, rather than relying solely on the model's pre-trained knowledge. In practice, this means you can turn your Agent into a domain expert that provides precise answers based on your documents and data.

RAG technology was proposed by Meta AI in 2020 and has become the standard architecture for enterprise-level AI applications. Its core workflow consists of three steps: Indexing, Retrieval, and Generation. During the indexing phase, documents are split into semantically complete text chunks, converted into high-dimensional vectors via embedding models (such as OpenAI's text-embedding-3-small), and stored in a vector database. During the retrieval phase, the user query is similarly converted into a vector, and the most relevant text chunks are found using cosine similarity or ANN (Approximate Nearest Neighbor) algorithms. During the generation phase, the retrieved text is injected into the prompt as context, allowing the large model to generate answers based on this information. Compared to pure model inference, RAG's advantages include updatable knowledge, traceability, controllability, and mitigation of model hallucination issues. Current RAG evolution directions include multimodal RAG, Graph RAG (based on knowledge graphs), and Agentic RAG (where the Agent autonomously decides retrieval strategies).

Skill: Composable Capability Extension Units

And the recently popular concept of Skills

The Skill module is a trending concept in recent Agent development. Unlike tool calling, Skills emphasize composable, reusable capability units. A Skill can contain multi-step operational logic — similar to installing a "plugin" for the Agent — enabling it to complete more complex task workflows.

The Skill concept originated from Microsoft's Semantic Kernel framework and shares similarities with Chains in LangChain and agent capability definitions in AutoGen, but places greater emphasis on modularity and composability. A Skill typically encapsulates complete task execution logic, including input validation, multi-step reasoning, tool orchestration, and output formatting. For example, a "Data Analysis Skill" might internally chain together data reading, cleaning, statistical computation, and visualization steps. This design philosophy is similar to microservice architecture in software engineering — breaking down complex capabilities into independently developable, testable, and deployable units through standardized interfaces. Since 2024, OpenAI's GPTs and ByteDance's Coze platform have adopted similar plugin-based capability extension mechanisms, and Skills are becoming a universal capability exchange unit in the Agent ecosystem.

Incremental Development: Building from Simple to Complete

We'll add the code step by step

This project adopts an incremental development strategy, with each step adding a new module on top of the previous one. The benefits of this approach include:

Each step is runnable: No need to wait until all code is written to see results
Clear module boundaries: The responsibilities and interfaces of each functional module are immediately apparent
Easy to debug and understand: Issues can be quickly traced to specific modules

Learning Recommendations and Target Audience

This tutorial is suitable for the following developers:

Beginners with Python experience who find AI Agent concepts unclear
Intermediate developers who've used frameworks like LangChain but want to understand the underlying principles
Engineers preparing to transition into the large model space who need to quickly build a knowledge framework

While 200 lines of code can't cover every detail of a production-grade Agent (such as error handling, concurrency control, security mechanisms, etc.), it provides a clear mental model. Once you understand this skeleton, learning mature frameworks like LangChain and AutoGen becomes significantly more efficient.

Conclusion

From prompts to memory, from tool calling to RAG, and finally to Skill extensions — these five modules form the core architecture of modern AI Agents. Connecting them with 200 lines of Python code is not just a coding exercise; it's a systematic cognitive construction of the Agent development paradigm. Mastering this approach to building intelligent agents from scratch lays a solid foundation for diving deeper into complex Agent frameworks.