Deep Dive into Tencent Marvis: How a System-Level AI Assistant Redefines Human-Computer Interaction

From Agent to System-Level Assistant: Three Stages of Product Evolution

Agents are nothing new to us, but what's their next form? In the AI field, an Agent refers to an intelligent system with autonomous decision-making, planning, and execution capabilities—unlike traditional single-turn Q&A AI, Agents can decompose complex tasks into multiple steps, autonomously invoke tools, access external resources, and dynamically adjust strategies based on intermediate results. Since 2023, as large language model capabilities have leaped forward, Agents have rapidly moved from academic concepts to productization, forming mainstream architectural paradigms like ReAct (Reasoning + Acting) and Plan-and-Execute. Looking back at the past year-plus of development, we can clearly identify three stages:

Stage One began with Manus as the starting point—pure Agent form, including Claude Code, Codex, and domestic products like Wordware and Tree. They are all essentially command-line AI agents. Claude Code is Anthropic's command-line AI programming assistant that can directly understand codebase context in the terminal, perform file operations, and run commands; OpenAI's Codex is the productized form of its code generation model. The common characteristic of these tools is using CLI (Command Line Interface) as the interaction entry point, where developers drive Agents through natural language instructions to complete tasks like code writing, debugging, and refactoring—essentially embedding large model reasoning capabilities into developers' existing workflows.

Stage Two was the "wrapper" craze sparked by OpenAI's Operator and similar products, with various Cloud products emerging endlessly—essentially dressing Agents in a prettier outfit.

Stage Three, the emerging new form—human-centered system-level AI assistants.

bilibili source: Agent 的下个形态是什么？Marvis 给出了一个答案

Here's a key distinction: with Cloud-type products, the protagonist is always the Agent itself, and all the packaging is designed to make the Agent more usable. True innovation should establish an entirely new form, relegating the Agent to a supporting role in the background rather than center stage. Just as when Windows first appeared, it didn't make the command line better—it redefined the human-machine interaction relationship from a completely new perspective. The leap from memorizing command syntax to intuitive graphical operations shows that paradigm shifts in interaction are often more revolutionary than functional enhancements.

Tencent Marvis Core Features: Centered on Human Operating Habits

Tencent's recently launched Marvis product embodies this trend. Upon opening it, beyond the common Agent dialogue, scheduled tasks, and skill marketplace, the most distinctive change is the local knowledge base feature.

System-Level File Management Capabilities

In Marvis's knowledge base, you can:

View all installed applications, with support for opening, uninstalling, and other operations
Browse the computer's file system, with common file operations directly accessible
Leverage pre-built indexes for lightning-fast search
Build semantic-level indexes through dedicated "Documents" and "Gallery" categories

After obtaining user authorization, Marvis builds semantic indexes for files, integrating them into the scope of AI semantic search. Semantic indexing is an information retrieval approach that differs from traditional keyword indexing—it uses Embedding models to convert text, images, and other content into high-dimensional vector representations, performing similarity matching in vector space through metrics like cosine similarity. This means that even if query terms have no literal overlap with target documents, they can still be retrieved as long as they are semantically similar. Implementing this capability typically requires vector database support (such as FAISS or Milvus), along with specialized Embedding models for different modalities, such as multimodal models like CLIP for cross-modal image-text retrieval. While the current feature set isn't extensive, it gives you the first sense that this product is no longer centered on the Agent, but on how humans actually use their computers.

Semantic Search in Practice

Take a content creator's daily scenario as an example: each video episode has its own folder containing PPTs, video assets, images, research paper PDFs, and more. When you want to find something but can't remember which episode it was in, Marvis's knowledge base comes to the rescue.

It can display all papers scattered across various locations and use large models to understand and classify images. Even more powerful is natural language image search—for example, searching for "Transformer" can directly locate a Transformer architecture diagram, even if the image filename contains no such keyword. The technical principle behind this: the system pre-generates semantic vectors for each image using vision-language models (like CLIP), encodes the natural language query into a vector at search time, and achieves cross-modal retrieval through vector similarity matching—completely eliminating dependence on filenames and tags.

You can also ask it in the chat box to perform contextual analysis based on documents on your computer, turning your entire machine into one large knowledge base. This essentially brings RAG (Retrieval-Augmented Generation) technology down from cloud knowledge bases to the personal local file system, making everyone's computer a private, conversational knowledge graph.

Practical Scenarios: From SSH Configuration to Local Model Deployment

System-Level Task Automation

Marvis demonstrates real value when handling system-level tasks. For example, a practical need: using a Mac daily but needing to offload certain 4090 GPU tasks to a Windows machine. SSH (Secure Shell) is an encrypted network protocol that allows users to securely operate another computer remotely over a network. In AI development scenarios, many developers use Macs as their daily workstations, but deep learning training requires NVIDIA GPU CUDA compute power, making SSH connections to workstations equipped with high-end graphics cards a common practice. However, manual setup requires typing various commands, setting passwords, configuring firewall rules, handling key authentication and port forwarding—multiple steps that present a barrier for non-operations personnel. Marvis can directly help you complete the Windows SSH configuration, automating these tedious steps.

Once configured, connecting from Mac using VS Code's Remote-SSH extension becomes very convenient. Real-world testing comparing GPU, CPU, and Mac's MPS across three devices running Karpathy's miniGPT-2 training code shows clear performance differences. MPS (Metal Performance Shaders) is Apple's GPU acceleration framework for Mac devices, built on the Metal graphics API. Since PyTorch added MPS backend support starting from version 1.12, Mac users can leverage Apple Silicon's (M1/M2/M3/M4 series chips) unified memory architecture for model training and inference without NVIDIA graphics cards. While MPS still can't match high-end discrete GPUs in the CUDA ecosystem for deep learning performance, it's practical enough for small-to-medium scale experiments.

Built-in Local Models: Privacy and Efficiency Combined

Another pain point: many small tasks on your computer don't need cloud-based large model compute power—local small models are sufficient. Previously, you might have needed Ollama or LM Studio to run local models, then used Claude Code to modify configurations for connection—not a low barrier. Ollama provides a Docker-like command-line experience, supporting one-click pulling and running of open-source models like Llama and Mistral; LM Studio provides a graphical interface, lowering the usage threshold. The core advantages of running models locally are data privacy (sensitive information never leaves the device), zero latency (no network round-trips), and zero cost (no API call fees). But configuring local models to work with other AI tools typically requires manually setting API endpoints and model parameters—exactly what Marvis aims to simplify.

Marvis has this capability built in:

Privacy Mode: Downloads and automatically runs local models, with all tasks processed entirely locally—sensitive data never leaves the device
Efficiency Mode: Edge-cloud collaboration, automatically determining which tasks use cloud compute and when to use local compute

Edge-Cloud Collaboration is a compute resource scheduling strategy that dynamically decides whether to execute on local devices (edge) or cloud servers based on task complexity, privacy requirements, and latency sensitivity. Simple tasks like text summarization and format conversion can be quickly completed by local small models (around 7B parameter scale); complex tasks like long document analysis and multi-step reasoning are routed to cloud-based large models (tens of billions of parameters). This architecture requires an intelligent routing layer to evaluate task characteristics and make distribution decisions—a key technical path for future AI assistants to achieve balance among cost, privacy, and capability.

When processing tasks, Marvis also features a "Studio" view that displays the process of multiple built-in Agents with different roles collaborating to complete tasks—intuitive and engaging. This Multi-Agent System architecture lets different specialized Agents each handle their responsibilities—some for information retrieval, some for code execution, some for result verification—working together through coordination mechanisms to accomplish complex tasks.

Future Outlook: Agent and Operating System Becoming One

This product form—centered on the human perspective and targeting system-level Agent assistance—differs from simply stacking features on top of Agents. In the future, as more system-level capabilities are opened up, API-ified, and CLI-ified, becoming easier for Agents to invoke, this type of product will become increasingly useful—until you can't even perceive its existence, completely merged with the operating system.

This trend is highly consistent with the evolutionary pattern of computer interaction history: the best technology is often "invisible" technology. Just as the TCP/IP protocol stack is completely transparent to ordinary users, future AI Agents will sink into the operating system's foundation, becoming infrastructure like file systems and memory management—users only need to express intent, and the system automatically orchestrates AI capabilities for execution.

Of course, looking at current functionality, Marvis still has room for improvement. For example, the AI dialogue above and the knowledge base content aren't perfectly unified—the experience still feels somewhat fragmented. But the direction is correct—the future of Agents isn't more powerful Agents, but making Agents disappear into the system, returning humans to the protagonist role.

This is perhaps the critical step from "AI tool" to "AI operating system." When an Agent is no longer an application you need to actively open, but rather an intelligent foundation permeating every file operation, every search, and every system configuration, we will have truly entered a new era of human-machine collaboration.