AI Agent Team Automating Short-Video Production: A Complete Workflow from Raw Material to Publishing

From Manual to Fully Automated: A Creator's AI Workflow Revolution

What's the biggest pain point for short-video creators? It's not writer's block or lack of inspiration—it's repeating the same tasks every single day: finding material, writing scripts, recording, editing, and publishing across platforms one by one. A Bilibili creator (San Ge) shared how he built a "shrimp army" of multiple AI Agents to fully automate the short-video production pipeline.

bilibili source

The core philosophy of this system is: delegate repetitive work to AI, while humans focus on quality control and recording. The result? He went from producing two or three videos a day at most to batch-producing content, with a dramatic boost in efficiency.

The Four "Shrimp": Multi-Agent Team Architecture Breakdown

This AI workflow is essentially a multi-agent collaboration system, where each Agent handles a specific stage, forming an assembly-line content production pipeline.

Multi-Agent Systems (MAS) represent an important research direction in artificial intelligence. The core idea is to decompose complex tasks into subtasks, each handled by a different intelligent agent. Every Agent can perceive its environment, make autonomous decisions, and execute actions, while coordinating with other Agents through message passing or shared state. In the era of large language models, this architecture has been widely applied to content production, software development, data analysis, and more. Typical frameworks include AutoGen, CrewAI, and LangGraph, which provide infrastructure for Agent definition, task orchestration, and communication protocols—enabling even non-technical users to build multi-Agent workflows.

The First Shrimp: Material Mining Agent

Every day, the creator "feeds" this Agent with observations, insights, and reflections from work and daily life, essentially building a personal knowledge base. This Agent's job is to automatically mine valuable content material from these daily inputs and drop them into a "material pool."

之前有跟大家去讲过

This stage solves the inspiration sourcing problem. Instead of scrolling through your phone every morning looking for ideas, it becomes a process of continuous accumulation and automatic filtering.

From a technical implementation perspective, the Material Mining Agent likely involves a RAG (Retrieval-Augmented Generation) architecture under the hood. RAG works by first vectorizing user-input documents, notes, voice transcriptions, and other content, storing them in a vector database. When content generation is needed, the system first retrieves the most relevant fragments from the knowledge base, then passes them as context to a large language model for creation. This approach ensures both personalization and accuracy of output while leveraging the LLM's generative capabilities to organize scattered information into structured content. Common vector databases include Pinecone, Weaviate, and Chroma. This also explains why "continuous feeding" is so important—the richness of the knowledge base directly determines the quality and diversity of material the Agent can mine.

The Second Shrimp: Script Generation Agent ("Little Mosquito Shrimp")

"Little Mosquito Shrimp" reads content from the material pool and automatically generates talking-head scripts of around 60 seconds (ranging from 30-120 seconds). This Agent functions as a professional short-video copywriter, organizing scattered material into structured speaking scripts.

The Script Generation Agent's core capability relies on prompt engineering for large language models. Through carefully designed system prompts, the model can follow specific script structures—such as the classic short-video framework of "hook opening → pain point resonance → solution → call to action." Additionally, through few-shot examples (providing several excellent scripts as references), the model can learn the creator's personal style and language habits, ensuring generated scripts aren't cookie-cutter. Duration control is typically achieved by limiting word count—Chinese spoken content runs approximately 3-4 characters per second, so 60 seconds corresponds to roughly 180-240 characters.

The Third Shrimp: Video Packaging Code Agent

After the creator finishes recording based on the script, the footage is handed to a Code Agent. This Agent is responsible for packaging the raw recording into a complete talking-head video—adding subtitles, arranging layouts, applying effects, and other post-production work.

我来录制

A Code Agent is a type of AI agent that can autonomously write and execute code—this is what distinguishes it from regular conversational AI. In video post-production scenarios, a Code Agent typically calls video processing libraries like FFmpeg or MoviePy, combined with speech recognition APIs (such as OpenAI's Whisper model) to automatically generate subtitle files, then burns subtitles into the video, adjusts layouts, and adds transitions through code. Compared to traditional editing software operations, the Code Agent's advantage lies in batch processing and parameterized configuration—once the video template is set (font size, position, color scheme, etc.), subsequent video packaging has nearly zero marginal cost, truly achieving "configure once, reuse infinitely."

The Fourth Shrimp: Cross-Platform RPA Publishing Agent

The final stage uses RPA (Robotic Process Automation) to achieve automated publishing across all major domestic and international platforms. This eliminates the repetitive labor of "manually publishing to each platform one by one."

基本上算是

RPA (Robotic Process Automation) is a technology that uses software robots to simulate human interactions with computer interfaces. In multi-platform publishing scenarios, RPA tools simulate user actions: logging into each platform (such as Douyin, Xiaohongshu, Bilibili, YouTube, TikTok, etc.), uploading videos, filling in titles and descriptions, selecting tags and categories, setting publish times, and other sequential operations. Mainstream implementations include writing scripts with browser automation frameworks (like Playwright or Selenium) or using dedicated social media distribution tools. Since platforms vary in API openness, RPA bypasses interface restrictions through UI-level operations—though it also requires handling anti-bot strategies and maintenance costs from interface changes. For a creator covering 5-10 platforms, publishing each video might take 30-60 minutes; RPA compresses this to a few minutes of automated execution.

Core Logic of the Workflow: Human-AI Collaboration, Not Full Replacement

You might not have noticed, but this system isn't entirely hands-off. The creator still needs to do several things:

Daily material feeding: Continuously inputting valuable information into the system
Personal recording: Maintaining authenticity and personal character in the content
Quality control at each stage: Ensuring output quality

直接是自动的扭转

This design is sensible. AI excels at repetitive, structured tasks, while the creator's personal expression, aesthetic judgment, and content direction remain irreplaceable. This "Human-in-the-Loop" design philosophy is widely adopted in AI systems—it leverages AI's efficient execution capabilities while ensuring output quality and directional correctness through human judgment. Fully automated content production often falls into the trap of being "correct but boring," and retaining human involvement at key nodes is precisely what keeps content alive.

The entire system's value lies in liberating creators from tedious execution work, allowing them to focus their energy on higher-value creativity and expression.

Takeaways for Regular Creators

This workflow offers several important insights:

First, AI automation doesn't need to happen all at once. Start with the most painful bottleneck—perhaps script generation or multi-platform publishing—and gradually build the complete pipeline. This incremental automation strategy is known as "incremental delivery" in software engineering, where each step produces verifiable value and reduces the cost of experimentation.

Second, your personal knowledge base is a core asset. The Material Mining Agent's prerequisite is continuous input from you. Without high-quality "feeding," even the best Agent can't produce valuable content. This is essentially the "Garbage In, Garbage Out" principle in action—AI amplifies your existing accumulation and thinking rather than creating value from nothing.

Third, multi-Agent collaboration is the trend. Individual AI tools solve point problems, but the real efficiency leap comes from chaining multiple Agents into a complete workflow. This represents the current direction of AI applications evolving from "tools" to "systems." Since 2024, numerous multi-Agent orchestration platforms have emerged—such as Dify, Coze, and n8n—which dramatically lower the technical barrier to building such systems, enabling creators who can't code to configure their own Agent workflows through visual interfaces.

For creators who want to try this approach, start with the simplest stage—perhaps using AI to assist with scriptwriting, or using RPA tools for multi-platform distribution—then expand to the full pipeline after validating results. The key is to first get one minimum viable automation stage running; once you experience the efficiency gains, motivation to expand will come naturally.

Key Takeaways

A creator built a collaborative team of 4 AI Agents covering material mining, script generation, video packaging, and cross-platform publishing
The system scaled daily output from 2-3 videos to batch production by automating repetitive work
Humans still handle material feeding, recording, and quality control—it's human-AI collaboration, not full replacement
Continuous accumulation of a personal knowledge base is the foundation for the entire system to function
Chaining multiple Agents into complete workflows represents the trend of AI applications evolving from tools to systems