AI Product Development in Practice: Model Selection, Building Moats, and Paths to Commercialization

Core Principle: Don't Train Models from Scratch

In AI product development, model selection is one of the most critical strategic decisions. Bloomberg GPT is a classic cautionary tale — the company invested heavily in training a finance-specific model from scratch, only to see nearly all that investment wasted when GPT-4 was released.

Bloomberg GPT was a 50-billion-parameter LLM released in March 2023, trained from scratch by Bloomberg's team using proprietary financial data (including 40 years of financial news, SEC filings, research reports, etc.). The project cost tens of millions of dollars in compute and significant time from top researchers. Yet just one month later, OpenAI's GPT-4 matched or exceeded its performance on financial NLP benchmarks — while also possessing general reasoning capabilities that Bloomberg GPT entirely lacked. This case illustrates the "follower's dilemma" in foundation models: while you spend a year training a specialized model, general-purpose models may have already expanded to cover your use case.

bilibili source

For the vast majority of teams, model strategy should follow this priority order:

Call APIs directly: Start by rapidly prototyping with top models like Claude, GPT, and Gemini
Fine-tuning: Only consider this after exhausting prompt engineering, Agent design, and model switching
Self-hosted open-source models: Migrate for cost or data privacy reasons after product-market fit
Training from scratch: Almost never advisable

Fine-tuning refers to continuing to train model parameters on domain-specific or task-specific datasets on top of a pretrained foundation model. Common fine-tuning methods include full parameter fine-tuning, LoRA (Low-Rank Adaptation), and QLoRA among other parameter-efficient techniques. The core risk of fine-tuning is the "generational leap" problem: you might spend three months fine-tuning a legal document analysis model on GPT-3.5, only for GPT-4 to achieve equivalent results through zero-shot prompting. Fine-tuning also faces technical challenges like catastrophic forgetting, overfitting, and high data quality requirements.

Before resorting to fine-tuning, teams should fully exploit the potential of prompt engineering and Agent design. Prompt Engineering involves carefully crafting input prompts to guide model outputs, including strategies like few-shot learning, Chain-of-Thought, and role-setting. Agent design uses the LLM as a reasoning core, combined with tool calling, memory systems, and planning capabilities to build autonomous agents. Both approaches share the advantage of not modifying model parameters, so when the underlying model upgrades, the upper-layer logic can migrate seamlessly. In practice, many teams find that well-designed system prompts combined with RAG (Retrieval-Augmented Generation) can address over 80% of use cases without ever needing fine-tuning.

The biggest risk of fine-tuning: the next generation of foundation models may natively achieve what you spent enormous effort fine-tuning for. Buzzfeed's approach is worth emulating — they first validated product viability with the GPT API, then after multiple iterations, dramatically reduced operating costs through fine-tuning and self-hosting.

Product Building: Go Deep in a Narrow Domain

The Moat Is in the System, Not the Model

The moat for AI products doesn't lie in the underlying model, but in the entire engineering system built around it: evaluation frameworks, security and privacy handling, caching mechanisms, workflow orchestration, and more. These system-level capabilities are the real competitive barriers.

Specifically, evaluation frameworks ensure that model upgrades or prompt adjustments don't introduce regressions; security and privacy handling includes PII (Personally Identifiable Information) masking, output content moderation, and adversarial attack protection; caching mechanisms use semantic similarity matching to cache repeated queries, significantly reducing latency and cost; workflow orchestration breaks complex tasks into multiple LLM call steps with conditional logic and error handling to form reliable automation pipelines. These engineering capabilities take time to accumulate and are deeply coupled with specific business scenarios, making them difficult for competitors to simply replicate.

Go Narrow and Go Deep

Rather than building a generic but mediocre Q&A chatbot, focus on a vertical scenario and excel at it. For example, an intelligent Q&A system specifically for SEC financial filings that delivers an experience far superior to general-purpose solutions. Being transparent about your system's capability boundaries actually makes it easier to build user trust.

The underlying logic is this: general-purpose LLMs perform "good enough but not expert-level" in any single domain. When you focus on a narrow domain, you can build specialized knowledge bases, design domain-specific evaluation criteria, and optimize outputs for specific formats, creating a significant experience gap. Users will pay for "noticeably better in my specific use case" rather than "can do everything but nothing well."

Strategic Procrastination

Don't reinvent the wheel for generic features — wait for mature solutions to emerge in the market and integrate them directly. Concentrate R&D resources on directions that are tightly bound to your business scenario and offer differentiated advantages.

Execution Path: Build Your Evaluation System from Day One

The right AI product development workflow is:

Rapidly prototype using the strongest model + prompt engineering
Build an evaluation system from day one, with human evaluation and data labeling
Continuously collect high-quality data to start the data flywheel
Iterate and optimize based on feedback

AI product evaluation systems typically include multiple layers: unit-level evaluation (testing output quality of individual LLM calls), pipeline-level evaluation (testing end-to-end performance of multi-step Agents), and product-level evaluation (testing actual user experience metrics). In practice, teams need to establish golden datasets, design automated scoring metrics (such as factual accuracy, hallucination rate, format compliance rate, etc.), and combine these with human evaluation to form a closed loop. Common tools in the industry include LangSmith, Braintrust, Promptfoo, and others. AI product development without an evaluation system is like software development without unit tests — you never know whether a change is an improvement or a regression.

The Data Flywheel is the most powerful positive feedback loop in AI products: product goes live and collects user interaction data → labeling team assesses data quality → high-quality data improves the model or optimizes prompts → product experience improves → attracts more users → generates more data. Once this flywheel starts spinning, latecomers can't catch up even with the same foundation model, because your accumulated domain data and user feedback are unique competitive assets. Tesla's self-driving and TikTok's recommendation algorithm are classic examples of data flywheels.

The core goal of building an LLMOps pipeline is to shorten the feedback loop. The pace of change in AI is extremely fast, and the ability to iterate quickly directly determines whether a product lives or dies.

Plan Today's Product with Tomorrow's Prices

LLM API costs are dropping exponentially. Using GPT-4-level capability as a benchmark, API costs have dropped roughly 100x over the past 18 months. When GPT-4 launched in March 2023, the price per million input tokens was $30; by the time GPT-4o-mini launched in 2024, equivalent capability cost approximately $0.15. This decline stems from multiple factors: advances in model distillation, inference hardware optimization, maturing quantization techniques, and intensifying market competition (price wars from Google, Anthropic, and the open-source community). Following this trend, a product spending $10,000/month on API costs today might only need a few hundred dollars to maintain the same service scale a year from now.

When planning products, evaluate commercial viability using future prices — solutions that seem too expensive today may have viable business models within a year. Be bold and forward-looking: build the product first and validate its value.

But stay clear-eyed: the distance from demo to reliable product is extremely long. Just as autonomous driving took two to three decades from technical demo to actual productization, AI products similarly require patience — continuous investment in reliability and user experience. The probabilistic nature of LLMs means outputs always carry uncertainty, and the core challenge of productization is controlling that uncertainty within a range users can accept — this requires extensive edge case handling, fallback strategy design, and user expectation management.

Key Takeaways

For the vast majority of companies, training models from scratch is unwise — prioritize APIs, then consider fine-tuning and self-hosting
AI product moats lie in system engineering capabilities like evaluation, security, and caching built around the model
Going deep in a narrow domain builds more user trust than building generic but mediocre products
Build evaluation systems and data flywheels from day one to ensure continuous iteration capability
Plan today's products with tomorrow's prices — API costs are dropping exponentially