Simon Willison Updates WebRTC Voice Tool: Now Supports Document Context Conversations and GPT-Realtime-2

Background: From Experimental Tool to Practical Voice Assistant

Renowned developer Simon Willison recently updated his OpenAI WebRTC voice interaction tool, adding a document context feature that lets users have voice conversations with AI based on specific document content directly in the browser. The tool was originally launched in December 2024 as a way to experiment with OpenAI's newly released WebRTC real-time audio API.

Simon Willison is a co-creator of the Python web framework Django and the author of Datasette, an open-source data exploration tool. He's highly influential in the AI tool development community, consistently documenting and sharing his experiments with various AI APIs through his blog. His projects are known for their "minimum viable tool" philosophy — achieving practical functionality with as little code as possible while remaining fully open source, providing the developer community with a wealth of directly referenceable API integration examples.

The core driver behind this update is the brand-new model OpenAI released last month — GPT-Realtime-2, officially positioned as "the first voice model with GPT-5-level reasoning capabilities," with a knowledge cutoff date of September 30, 2024.

OpenAI WebRTC Audio Session tool interface

Two Key Updates: GPT-Realtime-2 and Document Context

GPT-Realtime-2 Model Integration

Simon mentioned that he had been waiting for the GPT-Realtime-2 model to appear in the ChatGPT iPhone app, but it never showed up. So he decided to go back to his own experimental tool and connect to this more powerful model directly via the API. Users can now select different model versions from a dropdown menu in the tool's interface, including the latest gpt-realtime-2.

GPT-Realtime-2 is a model specifically designed by OpenAI for real-time voice interaction scenarios. Compared to previous real-time audio models, it represents a quantum leap in reasoning capability. Earlier real-time voice models (such as the gpt-4o-realtime series) excelled in voice naturalness but often fell short when handling questions requiring deep thinking. GPT-Realtime-2 fills this gap, enabling voice interactions to go beyond simple Q&A and support technical discussions, document analysis, and other cognitively demanding conversation scenarios.

This means developers and power users can experience OpenAI's latest voice reasoning capabilities before the official ChatGPT app — a unique advantage of building your own tools.

Document Context: A Knowledge Anchor for Voice Conversations

This is the most practical new feature in this update. Users can paste large blocks of document text into a dedicated text area before starting a voice session. The AI can then discuss and answer questions based on that document content during the voice conversation.

From a technical implementation perspective, this feature is essentially a lightweight approach to Retrieval-Augmented Generation (RAG). In a standard RAG architecture, the system splits external documents into chunks, vectorizes them, stores them in a database, and retrieves relevant chunks to inject into the model context when users ask questions. Simon's tool takes a more direct approach — passing the entire document as part of the system prompt directly to the model. While this approach is limited by the model's context window size, its simplicity and lack of additional infrastructure requirements make it ideal for quick interactions with single documents.

From the screenshot, we can see that Simon's demo involved pasting a Markdown document about "whether DuckDB can safely execute untrusted SQL the way Datasette runs SQLite." The technical background here is worth expanding on: DuckDB is an embedded analytical database (OLAP), similar in positioning to SQLite but optimized for analytical queries, and has rapidly gained popularity in the data engineering space in recent years. Datasette is Simon's own tool that can instantly publish SQLite databases as interactive web APIs and interfaces. The core security question in the demo document is that Datasette allows users to execute arbitrary SQL queries — SQLite's sandboxing properties make this relatively safe, but DuckDB has more powerful capabilities like file system access, presenting different challenges for security isolation. The AI was then able to engage in a voice discussion about this technical topic, with the transcript panel at the bottom showing the model analyzing DuckDB's security issues.

The use cases for this feature are extensive:

Technical document review: Paste a code snippet or technical specification and quickly discuss issues via voice
Research paper study: Paste academic paper content and explore key arguments through conversation
Meeting preparation: Import meeting materials and quickly familiarize yourself with the content through voice Q&A
Learning assistance: Use textbook content as context for interactive voice-based learning

The Elegant Simplicity of the WebRTC Implementation

The design philosophy of this tool is worth noting. The entire interface is remarkably clean: an API Token input field, voice selection (defaulting to Coral), model selection, a collapsible document context area, and start/mute buttons. At the bottom is a real-time transcript panel displaying the AI's most recent voice output as text.

It runs entirely in the browser, communicating directly with OpenAI's real-time audio API via the WebRTC protocol. WebRTC (Web Real-Time Communication) is an open protocol originally developed by Google and later standardized by the W3C and IETF. It enables real-time audio, video, and data transmission directly between browsers without requiring plugins or additional software. Its core advantage lies in its peer-to-peer (P2P) communication architecture, where data streams can bypass intermediate servers and flow directly between endpoints, dramatically reducing latency. In voice AI scenarios, this means users' speech can be transmitted to OpenAI's servers for processing in near real-time, and the model's voice responses can return with extremely low latency, creating an interaction experience close to natural conversation. Traditional HTTP request-response patterns would introduce noticeable waiting times in this scenario, and WebRTC's streaming capabilities perfectly solve this problem — nobody wants to wait several seconds for a response mid-conversation.

One detail worth mentioning: users need to provide their own OpenAI API Token, meaning all costs are charged directly to the user's own account. The tool itself involves no intermediate servers or additional costs.

Implications for Voice AI Application Developers

This project reflects an important trend: when official product iteration can't keep pace with API capabilities, the developer community fills the gap on its own. Simon explicitly stated that his reason for updating this tool was that the ChatGPT app hadn't yet integrated GPT-Realtime-2.

For developers looking to build voice AI applications, this open-source tool provides an excellent reference implementation. The addition of the document context feature also demonstrates a key product insight: voice interaction shouldn't exist in isolation — it needs to be combined with the user's specific knowledge context to truly deliver value. This philosophy aligns closely with the current consensus in AI application development that "context is everything" — whether it's RAG systems, Agent frameworks, or voice assistants, AI tools that can effectively leverage external knowledge always solve real-world problems better than generic conversational AI.

With GPT-Realtime-2 bringing GPT-5-level reasoning capabilities, the quality of document-based voice conversations will improve significantly, further enhancing the practicality of tools like this. Interested readers can visit Simon's tool page to try it out for themselves.

Simon Willison Updates WebRTC Voice Tool: Now Supports Document Context Conversations and GPT-Realtime-2

Background: From Experimental Tool to Practical Voice Assistant

Two Key Updates: GPT-Realtime-2 and Document Context

GPT-Realtime-2 Model Integration

Document Context: A Knowledge Anchor for Voice Conversations

The Elegant Simplicity of the WebRTC Implementation

Implications for Voice AI Application Developers

Key Takeaways

Related articles

Building a Cold Chain Logistics Optimization Research Project with Codex: A Complete Workflow from Scratch to PDF Paper

Codex Beginner's Practical Guide: Master Core AI Programming Skills in One Weekend

AI Agent Systematic Learning Path: From Zero to Independent Development