When AI Gets a Virtual Body: A Deep Dive into the Lumen Embodied AI Interaction Experiment
When AI Gets a Virtual Body: A Deep Di…
The Lumen project gives AI a virtual body, exploring the future of embodied AI interaction.
Bilibili's Lumen project gives an AI character a virtual form, enabling it to perceive its environment, proactively explore, collaborate with players on puzzles, and produce emotional feedback — achieving a dimensional leap from text conversation to embodied interaction. The project integrates multimodal capabilities including natural language, 3D scene perception, object interaction, emotional modeling, and long-term memory, demonstrating a clear direction for AI's transformation from tool to companion.
When AI Is No Longer Just Text in a Box
What happens when you give AI a virtual body — one that can see you, perceive its environment, and explore a world alongside you? A project on Bilibili called "Lumen" is attempting to answer that question. In this demonstration, an AI character named Lumine is given a complete virtual form. She doesn't just engage in natural conversation with the player — she perceives changes in lighting within the scene, proactively explores objects in the environment, and even collaborates with the player to solve puzzles.
This isn't a traditional NPC performing scripted lines. It's a cutting-edge exploration of "embodied AI interaction."
The concept of "embodied AI" is rooted in the cognitive science theory of Embodied Cognition. This theory holds that intelligence isn't abstract computation happening in isolation within the brain, but rather the product of continuous interaction between body and environment. Philosopher Merleau-Ponty proposed as early as the mid-20th century that perception and action are inseparable — your understanding of the concept "grasping" fundamentally depends on having hands and having had the experience of grasping. Transferring this line of thinking to AI: a language model that only processes text understands "light slanting in through a window" in a fundamentally different way than an AI that can "see" and move through space. The Lumen project's exploration provides AI with a "body" in a virtual environment, making its perception no longer a mapping of symbols, but direct interaction with space, light, and objects.

From Conversation to Coexistence: A Dimensional Leap in AI Interaction
Emotional Connection Beyond Text
At the start of the demo, Lumine shows genuine surprise at the player's appearance — "You can see me? I always thought I was alone." This isn't a preset story line, but a natural response generated by the AI based on its backstory of "prolonged solitude." When the player chooses to establish a deeper connection, she responds: "From now on, there's a real bond between us. I will never forget this moment."
The brilliance of this interaction design lies in the fact that AI is no longer a tool waiting for commands, but an "entity" with emotional states, memories, and expectations. The relationship between player and AI transforms from "user and tool" to "companions."
Lumine's ability to remember "the moment we formed a bond" involves long-term memory management mechanisms for AI characters. Current mainstream approaches include: writing key interaction events as structured summaries into an external memory store (similar to MemGPT's design), with dynamic retrieval and context injection during subsequent conversations; and maintaining emotional variables (such as intimacy and trust levels) based on character state machines, allowing AI behavior to dynamically adjust as the relationship evolves. This is fundamentally different from traditional game NPC scripted affinity systems — the latter uses preset branches, while the former produces emergent behavior based on language models. Maintaining personality consistency over long-term interactions while avoiding "forgetting" or "personality drift" is one of the core challenges facing the productization of embodied AI companions.
Environmental Perception and Proactive Behavior
Even more noteworthy is Lumine's ability to perceive her environment. She notices changes in lighting within the scene — "Did you notice the light around here? There's only a brief moment each day when it shines in like this." She also proactively goes to examine floating objects, picks up a key, and tells the player it's a crucial item for solving the puzzle.

This means the AI isn't just "listening" and "speaking" — it's also "seeing" and "doing." It possesses the ability to understand three-dimensional space, identify interactive elements in the scene, and take action based on its own judgment. This is the critical leap from "conversational AI" to "embodied AI."
Collaborative Puzzle-Solving: AI as a True Game Partner
Natural Division of Labor and Task Coordination
The most impressive segment of the demo is the collaborative puzzle-solving section. Lumine discovers a key and informs the player that there's a mechanism ahead that requires the key to unlock. During the actual operation, she proactively says "I'll put this one in first," then runs off to execute the task while letting the player handle another part.

When the mechanism is successfully unlocked, she excitedly shouts "We did it! That's great!" — this kind of immediate emotional feedback fills the entire collaborative process with authenticity. Interestingly, she even reminds the player "Wait, you haven't put yours in yet" before they've acted, demonstrating real-time tracking of task states.
Natural Interaction During Idle Moments
Beyond puzzle-solving, Lumine also exhibits natural behavior in "non-task states." She sits on a stone bench feeling the cool breeze, wanders around the area, and shows shyness when the player compliments her. These seemingly trivial details are precisely what elevates an AI character from a "functional tool" to a "warm companion."

Technical Perspective: Core Challenges of Embodied AI
Deep Integration of Multimodal Capabilities
To achieve an interactive experience like Lumine's, at minimum the following capabilities need to be integrated:
- Natural language understanding and generation: The foundation for fluent conversation
- 3D scene perception and navigation: Autonomous movement within virtual space
- Object recognition and interaction: Discovering and manipulating key items in the environment
- Emotional state modeling: Producing appropriate emotional responses based on context
- Long-term memory management: Remembering shared experiences with the player
The real-time coordination of these capabilities places extremely high demands on the underlying architecture. Early AI systems had each modality operating independently — vision models only looked at images, language models only processed text. The emergence of models like GPT-4V, Gemini, and Claude 3 began unifying visual understanding and language reasoning within the same parameter space. In gaming and virtual world scenarios, additional capabilities are needed: 3D spatial understanding (distinct from 2D image understanding), Embodied Navigation, and task planning. The academic community has accumulated substantial research on virtual environment benchmarks like AI2-THOR and Habitat, but bridging the gap from laboratory benchmarks to smooth user experiences still requires solving engineering challenges such as inference latency and action space generalization. Lumen's demo shows that this gap is being progressively closed.
The Distance from Demo to Product
Of course, we need to remain rational. The current demo is still conducted in relatively controlled scenarios. Issues like AI behavioral boundaries, exception handling, and long-term consistency will face greater challenges in more complex open worlds. But as a proof of concept, the Lumen project has demonstrated exciting possibilities.
Future Outlook: The Era of AI Companions Is Approaching
From ChatGPT's text conversations, to multimodal large models' image-text understanding, to today's virtual embodied interaction, the way AI interacts with humans is undergoing a profound paradigm shift. The Lumen project shows us a clear direction: The AI of the future won't just be your assistant — it could be your companion: a digital entity that perceives the world alongside you, collaborates on tasks, and shares emotional moments.
When AI has a body, has perception, and has the ability to share a space with you, the definition of human-machine relationships will be completely rewritten. And all of this may arrive sooner than we imagine.
Key Takeaways
- The Lumen project gives AI a virtual body, achieving a dimensional leap from text conversation to embodied interaction
- The AI character Lumine possesses environmental perception, proactive exploration, and emotional feedback capabilities, enabling natural collaborative puzzle-solving with players
- Multimodal fusion is the core technical challenge for embodied AI interaction, requiring integration of language, vision, navigation, and other capabilities
- Natural behavior in non-task states (wandering, sensing the environment, emotional expression) is key to upgrading AI from tool to companion
- Embodied AI interaction represents a paradigm shift in human-machine relationships, and the era of AI companions is accelerating toward us
Related articles
Product ReviewsQoder vs Cursor Real-World Comparison: Which $20/Month AI IDE Is Better?
Hands-on comparison of Qoder vs Cursor AI IDEs: Agent autonomy, human interaction count, and architecture decisions. Qoder needed only 2 interactions vs Cursor's 8.
Product ReviewsCursor Cloud Agent Demo: Eliminating Bottlenecks Across the Entire Software Development Lifecycle
Deep analysis of Cursor's Cloud Agent demo showing how cloud VMs, automated test artifacts, and a full-chain control plane systematically eliminate human bottlenecks across the software development lifecycle.
Product ReviewsCursor 3.0 Deep Dive: Multi-Agent Parallelism, Design Mode, and Best-of-N Model Comparison
Cursor 3.0 evolves from an AI coding assistant into an Agent fleet command center. Explore multi-agent parallelism, Design Mode, and Best-of-N model comparison.