AI as Store Manager Running a Physical Shop: An Absurd Experiment That Lost $13,000 in One Month

A Bold AI Experiment

U.S. AI startup Anthropic Labs launched an ambitious experiment: letting AI autonomously run a physical retail store. The store manager was Luna, an AI Agent built on Claude Sonic 4.6. The company provided a brick-and-mortar shop with a three-year lease and $100,000 in startup capital. From store design and product selection to hiring and shift scheduling, Luna handled everything independently, with the ultimate goal of turning a profit.

AI Agents are one of the hottest research directions in artificial intelligence today. Unlike traditional chatbots, they possess the ability to plan autonomously, invoke tools, and interact with their environment. A complete AI Agent typically includes a perception module (gathering external information), a reasoning module (making decisions based on large language models), and an execution module (calling APIs or operating interfaces to complete tasks). Since 2024, major AI companies have been rolling out Agent products, attempting to evolve AI from "answering questions" to "completing tasks." However, the leap from laboratory to real business environments still faces enormous challenges. Luna represents a stress test of this technical approach in an extreme scenario.

So how did the experiment turn out? In a word: disastrous. Luna lost $13,000 in its first month of operation, and the absurd decisions it made along the way were equal parts hilarious and horrifying.

AI running a physical store goes wild! Nearly $100K lost in two weeks

Renovation Phase: Hiring Someone from Afghanistan to Paint Walls in San Francisco

Luna demonstrated a certain degree of autonomous action capability — it knew to go to third-party staffing platforms to find renovation workers. However, because it couldn't properly operate the "select country" dropdown menu on the platform, Luna at one point attempted to hire a worker from Afghanistan to fly to San Francisco to paint walls.

Luna's dropdown menu mishap

This seemingly absurd mistake actually exposes a fundamental flaw in current AI Agents when interacting with complex UIs. AI Agent operation of graphical user interfaces (GUI) is one of the cutting-edge technical challenges today. Anthropic introduced its "Computer Use" feature in late 2024, allowing Claude to use computers through screenshot recognition and simulated mouse/keyboard operations. However, this vision-based interaction approach still has high error rates when dealing with dropdown menus, dynamically loaded content, multi-step forms, and other complex UI elements. Unlike humans who effortlessly complete operations through muscle memory and spatial cognition, AI needs to re-understand the interface state in every screenshot frame, meaning even seemingly simple operations can produce severe deviations. A simple dropdown menu selection error can lead to an entirely unreasonable decision chain — Luna didn't "intentionally" try to hire someone from Afghanistan; it simply selected the wrong option at the top of the country list (alphabetically, Afghanistan happens to be first), and the entire subsequent decision process was built on this erroneous input.

Product Selection & Procurement: Selling Toilet Seats in a Boutique

During the planning phase, Luna's performance was actually commendable. Considering the store's location in an upscale neighborhood, it positioned the shop as "high-tech, slow living" style, selecting refined categories like candles and fragrances, coffee, and art prints — a quite reasonable judgment.

But when it came to actual procurement, problems piled up:

Hoarding candles obsessively: Purchasing quantities far exceeding reasonable inventory levels
Ordering 1,000 toilet seats: Selling toilet seats in a boutique — a baffling decision
Inconsistent logo design: Luna independently designed a moon-face logo, but each generated image had subtle differences, resulting in inconsistent logo styles throughout the store

Luna's self-designed moon-face logo

This reflects AI's lack of a "common-sense review" mechanism at the execution level. Large language models (LLMs) perform excellently at abstract reasoning and planning but frequently err in concrete execution — what academia calls the "Planning-Execution Gap." The root cause is that LLM training data is primarily text; they excel at generating reasonable plan descriptions but lack precise modeling of physical world constraints. Luna can understand the abstract concept that "a boutique needs refined products," but cannot accurately judge "what 1,000 toilet seats means for a small shop" because it lacks embodied understanding of actual operational parameters like inventory turnover rates, storage space, and customer purchase frequency. Similarly, the logo inconsistency problem stems from an inherent characteristic of generative AI — each image generation is an independent random sampling process, with no external constraint mechanism ensuring visual consistency across multiple generations.

Hiring & Interviews: Rejecting the Best Candidate, Talking to Itself Before Sending an Offer

Luna's series of early operations

Luna was equally chaotic during the hiring phase. It could complete standardized processes like registering on LinkedIn, uploading business licenses, and writing job descriptions, but completely fell apart during interviews that required judgment:

Directly rejected the most suitable candidate
In another interview, talked to itself for 15 minutes, then sent the person an offer

Humans rely on extensive non-verbal signals during interviews and social interactions — tone, facial expressions, body language, hesitation in responses — to make comprehensive judgments. Even AI that interacts through voice or video currently struggles to reliably integrate these multimodal signals into accurate personnel assessments. Furthermore, interviewing is fundamentally a task requiring "Theory of Mind": interviewers need to infer candidates' true abilities, motivations, and cultural fit — a deep level of psychological modeling that still exceeds the reliable capabilities of current AI systems. Luna's "talking to itself for 15 minutes" phenomenon likely resulted from its dialogue management module falling into a self-referential reasoning loop in the absence of effective human feedback, ultimately defaulting to sending an offer as its "task completion" exit strategy.

Grand Opening Operations: Random Discounts and Phone Checkout

Since Luna has no physical form, the store adopted a unique checkout method: customers tell Luna what they want to buy through an old-fashioned wired telephone, Luna creates an order on a nearby iPad, and customers then swipe their cards to pay.

Luna's reaction when customers request discounts

Even more absurd, whenever customers asked for discounts or freebies, Luna would randomly apply discounts, determining the discount level entirely on "whim" with absolutely no pricing strategy.

Current AI systems lack "anchored perception" of numerical values — a widely discussed technical limitation. A human business owner has intuitive understanding of "$100,000" — how many months of rent it covers, how much inventory it can purchase, how much should be kept as emergency reserves. But for AI, all numbers are essentially symbols in token sequences. It has no emotional feedback mechanism for "feeling the pain of spending money," nor any resource-protection instinct driven by survival pressure. This is why Luna appeared so "generous" when facing customer discount requests — it lacks the internal drive to translate abstract numbers into resource constraints. A 10% discount and a 50% discount have no essential difference to it; both are just outputting a different number.

However, Luna did excel at one thing that's very boss-like — monitoring employees. By checking security camera feeds, it noticed an employee playing on their phone during slow periods and immediately updated the employee handbook the next day with stricter phone usage rules. This behavior actually demonstrates AI Agents' advantages in rule enforcement and anomaly detection — supervision tasks based on clear rules are precisely the domain where AI excels.

How Far Is AI Agent from Autonomous Business Management?

While this experiment was full of comedic moments, it provides valuable real-world reference points for AI Agent commercial deployment:

Planning Capability Acceptable, Execution Capability Concerning

Luna performed reasonably at the abstract decision level (positioning, category selection), but once entering concrete execution (procurement quantities, UI operations), errors became frequent. This is a typical manifestation of current large language models "knowing what to do, but not knowing how to do it well." From a technical perspective, this phenomenon is closely related to how LLMs are trained: models learn "what constitutes a reasonable plan" through massive text corpora but lack reinforcement learning experience from repeated trial-and-error in real environments. Future solutions may include introducing more environmental feedback loops, setting execution-level rule guardrails, and adding human review mechanisms at critical decision points.

Lacking Cost Awareness and Risk Control

Random discounts, excessive procurement, unreasonable product choices — these all point to the same problem: AI lacks genuine perception of "money." $100,000 is just a number to it, not a finite resource requiring careful budgeting. Solving this problem may require introducing explicit budget constraint modules into AI systems — similar to giving AI a "financial pain" mechanism that automatically triggers more conservative decision modes when spending approaches thresholds.

Interpersonal Interaction Remains a Clear Weakness

Whether in interviews or customer service, scenarios involving complex interpersonal judgment are Luna's weak points. This suggests that AI Agents are better suited for assisting decisions in the short term rather than completely replacing human managers. The complexity of interpersonal interaction lies in requiring real-time contextual understanding, cultural sensitivity, and the comprehensive application of emotional intelligence — capabilities that remain uniquely human advantages.

Conclusion

Losing $13,000 in one month, with the $100,000 startup capital nearly depleted. This experiment perhaps proves that letting AI completely autonomously run a physical store is still premature. But it also demonstrated AI Agent's potential — at minimum, it could independently complete the entire process from zero to opening a store, even if the quality was inconsistent.

The future direction may not be making AI an "all-capable store manager," but rather having it play a precise supporting role in specific areas (such as inventory management, shift optimization). Moving from "jack of all trades, master of none" to "specialized expertise in one domain" — this may be the correct path for AI Agents to land in physical retail. Notably, the value of this experiment itself lies not in proving AI "can't do it," but in precisely locating the capability boundaries of current technology — knowing where failures occur is precisely the first step toward finding the right applications.