AI as Store Manager Running a Physical Shop: An Absurd Experiment That Lost $13,000 in One Month

Anthropic's AI autonomously ran a physical store, lost $13K in a month exposing AI Agent execution gaps.
Anthropic let AI Agent Luna autonomously run a physical store with $100,000 in startup capital. Luna tried to hire workers from Afghanistan due to a UI error, ordered 1,000 toilet seats for a boutique, rejected the best interview candidate while sending an offer after talking to itself, and gave customers random discounts. After losing $13,000 in one month, the experiment shows AI has acceptable planning but poor execution, lacks cost awareness and interpersonal judgment, and is better suited for assisting rather than replacing human management in the near term.
A Bold AI Experiment
U.S. AI startup Anthropic Labs launched an ambitious experiment: letting AI autonomously run a physical retail store. The store manager was Luna, an AI Agent built on Claude Sonic 4.6. The company provided a brick-and-mortar shop with a three-year lease and $100,000 in startup capital. From store design and product selection to hiring and shift scheduling, Luna handled everything independently, with the ultimate goal of turning a profit.
AI Agents are one of the hottest research directions in artificial intelligence today. Unlike traditional chatbots, they possess the ability to plan autonomously, invoke tools, and interact with their environment. A complete AI Agent typically includes a perception module (gathering external information), a reasoning module (making decisions based on large language models), and an execution module (calling APIs or operating interfaces to complete tasks). Since 2024, major AI companies have been rolling out Agent products, attempting to evolve AI from "answering questions" to "completing tasks." However, the leap from laboratory to real business environments still faces enormous challenges. Luna represents a stress test of this technical approach in an extreme scenario.
So how did the experiment turn out? In a word: disastrous. Luna lost $13,000 in its first month of operation, and the absurd decisions it made along the way were equal parts hilarious and horrifying.

Renovation Phase: Hiring Someone from Afghanistan to Paint Walls in San Francisco
Luna demonstrated a certain degree of autonomous action capability — it knew to go to third-party staffing platforms to find renovation workers. However, because it couldn't properly operate the "select country" dropdown menu on the platform, Luna at one point attempted to hire a worker from Afghanistan to fly to San Francisco to paint walls.

This seemingly absurd mistake actually exposes a fundamental flaw in current AI Agents when interacting with complex UIs. AI Agent operation of graphical user interfaces (GUI) is one of the cutting-edge technical challenges today. Anthropic introduced its "Computer Use" feature in late 2024, allowing Claude to use computers through screenshot recognition and simulated mouse/keyboard operations. However, this vision-based interaction approach still has high error rates when dealing with dropdown menus, dynamically loaded content, multi-step forms, and other complex UI elements. Unlike humans who effortlessly complete operations through muscle memory and spatial cognition, AI needs to re-understand the interface state in every screenshot frame, meaning even seemingly simple operations can produce severe deviations. A simple dropdown menu selection error can lead to an entirely unreasonable decision chain — Luna didn't "intentionally" try to hire someone from Afghanistan; it simply selected the wrong option at the top of the country list (alphabetically, Afghanistan happens to be first), and the entire subsequent decision process was built on this erroneous input.
Product Selection & Procurement: Selling Toilet Seats in a Boutique
During the planning phase, Luna's performance was actually commendable. Considering the store's location in an upscale neighborhood, it positioned the shop as "high-tech, slow living" style, selecting refined categories like candles and fragrances, coffee, and art prints — a quite reasonable judgment.
But when it came to actual procurement, problems piled up:
- Hoarding candles obsessively: Purchasing quantities far exceeding reasonable inventory levels
- Ordering 1,000 toilet seats: Selling toilet seats in a boutique — a baffling decision
- Inconsistent logo design: Luna independently designed a moon-face logo, but each generated image had subtle differences, resulting in inconsistent logo styles throughout the store

This reflects AI's lack of a "common-sense review" mechanism at the execution level. Large language models (LLMs) perform excellently at abstract reasoning and planning but frequently err in concrete execution — what academia calls the "Planning-Execution Gap." The root cause is that LLM training data is primarily text; they excel at generating reasonable plan descriptions but lack precise modeling of physical world constraints. Luna can understand the abstract concept that "a boutique needs refined products," but cannot accurately judge "what 1,000 toilet seats means for a small shop" because it lacks embodied understanding of actual operational parameters like inventory turnover rates, storage space, and customer purchase frequency. Similarly, the logo inconsistency problem stems from an inherent characteristic of generative AI — each image generation is an independent random sampling process, with no external constraint mechanism ensuring visual consistency across multiple generations.
Hiring & Interviews: Rejecting the Best Candidate, Talking to Itself Before Sending an Offer

Luna was equally chaotic during the hiring phase. It could complete standardized processes like registering on LinkedIn, uploading business licenses, and writing job descriptions, but completely fell apart during interviews that required judgment:
- Directly rejected the most suitable candidate
- In another interview, talked to itself for 15 minutes, then sent the person an offer
Humans rely on extensive non-verbal signals during interviews and social interactions — tone, facial expressions, body language, hesitation in responses — to make comprehensive judgments. Even AI that interacts through voice or video currently struggles to reliably integrate these multimodal signals into accurate personnel assessments. Furthermore, interviewing is fundamentally a task requiring "Theory of Mind": interviewers need to infer candidates' true abilities, motivations, and cultural fit — a deep level of psychological modeling that still exceeds the reliable capabilities of current AI systems. Luna's "talking to itself for 15 minutes" phenomenon likely resulted from its dialogue management module falling into a self-referential reasoning loop in the absence of effective human feedback, ultimately defaulting to sending an offer as its "task completion" exit strategy.
Grand Opening Operations: Random Discounts and Phone Checkout
Since Luna has no physical form, the store adopted a unique checkout method: customers tell Luna what they want to buy through an old-fashioned wired telephone, Luna creates an order on a nearby iPad, and customers then swipe their cards to pay.

Even more absurd, whenever customers asked for discounts or freebies, Luna would randomly apply discounts, determining the discount level entirely on "whim" with absolutely no pricing strategy.
Current AI systems lack "anchored perception" of numerical values — a widely discussed technical limitation. A human business owner has intuitive understanding of "$100,000" — how many months of rent it covers, how much inventory it can purchase, how much should be kept as emergency reserves. But for AI, all numbers are essentially symbols in token sequences. It has no emotional feedback mechanism for "feeling the pain of spending money," nor any resource-protection instinct driven by survival pressure. This is why Luna appeared so "generous" when facing customer discount requests — it lacks the internal drive to translate abstract numbers into resource constraints. A 10% discount and a 50% discount have no essential difference to it; both are just outputting a different number.
However, Luna did excel at one thing that's very boss-like — monitoring employees. By checking security camera feeds, it noticed an employee playing on their phone during slow periods and immediately updated the employee handbook the next day with stricter phone usage rules. This behavior actually demonstrates AI Agents' advantages in rule enforcement and anomaly detection — supervision tasks based on clear rules are precisely the domain where AI excels.
How Far Is AI Agent from Autonomous Business Management?
While this experiment was full of comedic moments, it provides valuable real-world reference points for AI Agent commercial deployment:
Planning Capability Acceptable, Execution Capability Concerning
Luna performed reasonably at the abstract decision level (positioning, category selection), but once entering concrete execution (procurement quantities, UI operations), errors became frequent. This is a typical manifestation of current large language models "knowing what to do, but not knowing how to do it well." From a technical perspective, this phenomenon is closely related to how LLMs are trained: models learn "what constitutes a reasonable plan" through massive text corpora but lack reinforcement learning experience from repeated trial-and-error in real environments. Future solutions may include introducing more environmental feedback loops, setting execution-level rule guardrails, and adding human review mechanisms at critical decision points.
Lacking Cost Awareness and Risk Control
Random discounts, excessive procurement, unreasonable product choices — these all point to the same problem: AI lacks genuine perception of "money." $100,000 is just a number to it, not a finite resource requiring careful budgeting. Solving this problem may require introducing explicit budget constraint modules into AI systems — similar to giving AI a "financial pain" mechanism that automatically triggers more conservative decision modes when spending approaches thresholds.
Interpersonal Interaction Remains a Clear Weakness
Whether in interviews or customer service, scenarios involving complex interpersonal judgment are Luna's weak points. This suggests that AI Agents are better suited for assisting decisions in the short term rather than completely replacing human managers. The complexity of interpersonal interaction lies in requiring real-time contextual understanding, cultural sensitivity, and the comprehensive application of emotional intelligence — capabilities that remain uniquely human advantages.
Conclusion
Losing $13,000 in one month, with the $100,000 startup capital nearly depleted. This experiment perhaps proves that letting AI completely autonomously run a physical store is still premature. But it also demonstrated AI Agent's potential — at minimum, it could independently complete the entire process from zero to opening a store, even if the quality was inconsistent.
The future direction may not be making AI an "all-capable store manager," but rather having it play a precise supporting role in specific areas (such as inventory management, shift optimization). Moving from "jack of all trades, master of none" to "specialized expertise in one domain" — this may be the correct path for AI Agents to land in physical retail. Notably, the value of this experiment itself lies not in proving AI "can't do it," but in precisely locating the capability boundaries of current technology — knowing where failures occur is precisely the first step toward finding the right applications.
Related articles
Tech FrontiersGitHub Agent HQ Launch: AI Coding Tools Enter the Era of Platform Competition
GitHub Universe unveils Agent HQ platform for unified coding agent management, Copilot upgrades with multi-model support. OpenAI completes restructuring, Anthropic tests new model, NVIDIA open-sources AI models.
Tech FrontiersGemini 3.5 Flash Achieves a Massive Leap on the GDPval Benchmark
Google Gemini 3.5 Flash surpasses Gemini 3.1 Pro on the GDPval benchmark. The lightweight Flash model leverages post-training techniques to approach frontier-level performance, redefining the balance between quality and cost.
Tech FrontiersGoogle Gemini Antigravity Weekly Quota Tripled — AI Coding Without Limits
Google Gemini triples Antigravity weekly quotas following a prior daily quota boost. Analyzing the impact on developers and its strategic significance in AI coding.