Agora RTC + MQTT Replacing WebSocket: A Practical Guide to Decoupled Architecture for AI Voice Interaction Devices
Agora RTC + MQTT Replacing WebSocket: …
Using Agora RTC + MQTT to replace WebSocket for AI voice interaction system architecture
Bilibili creator Wang Shixiong shared an embedded AI voice interaction architecture using Agora RTC and MQTT to replace traditional WebSocket. The solution separates signaling (MQTT) from media (RTC) channels, achieving complete device-server decoupling with cross-platform communication support for ESP32, mobile apps, mini-programs, Linux, and more, while ensuring audio quality and signaling reliability even in weak network conditions.
Introduction
In current AI voice interaction device development, most solutions build voice channels on WebSocket. While simple and straightforward, this approach has obvious shortcomings in device-server coupling, cross-platform communication, and low-bandwidth environment adaptation. Bilibili content creator Wang Shixiong, based on his embedded development board project, shared a voice interaction system architecture that uses Agora RTC + MQTT protocol to replace traditional WebSocket, achieving complete decoupling between device and server sides with commercial-grade stability and scalability.
Why Abandon the WebSocket Approach?
Pain Points of Traditional WebSocket Solutions
Currently, mainstream AI voice interaction solutions on the market—whether open-source projects or commercial products—basically all rely on WebSocket as the transmission channel for voice data and signaling. Under this architecture, the device and server are tightly coupled—once the server fails to start, the device cannot function properly on the network, and the entire system becomes unavailable.
WebSocket is a full-duplex communication protocol based on TCP, born in 2011 (RFC 6455), originally designed to solve the problem of HTTP's inability to push from server to client. However, in the embedded IoT domain, WebSocket's design assumptions reveal structural flaws: the protocol itself requires maintaining persistent TCP connections, and once the server is unreachable, the client enters an unavailable state. Furthermore, WebSocket has no built-in QoS (Quality of Service) mechanism—packet loss handling in weak network environments relies entirely on upper-layer application implementation. Additionally, a complete WebSocket protocol stack typically occupies 50-100KB of ROM on MCUs like ESP32, which creates significant pressure for devices with only 4MB Flash.
For embedded devices, the WebSocket protocol also has relatively large ROM memory footprint. In ESP32 and other microcontroller development, memory space and resource consumption are core concerns that developers must constantly monitor.
Advantages of the RTC + MQTT Dual-Channel Architecture
Wang Shixiong transformed the system into a dual-channel architecture where Agora RTC handles the voice channel and MQTT handles the signaling channel. This design of separating the Signaling channel from the Media channel is actually a classic architecture pattern in telecom-grade real-time communication systems, with its history traceable to the design of SIP (Session Initiation Protocol)—SIP itself only handles session establishment, modification, and termination, while actual voice data is transmitted through independent RTP streams. In this solution, MQTT takes on the signaling role (device status reporting, session control, LLM response text delivery), while Agora RTC handles the media role (bidirectional audio streams). Each serves its own purpose, and failures in one don't affect the other.
This design brings several key benefits:
- Complete decoupling between device and server: Even if the server hasn't started, the device can still connect normally to Agora channels or MQTT's EMX server
- Independent operation between devices: Each terminal is an independent node, flexibly combined through channel mechanisms
- MQTT protocol is more lightweight: Low latency, low bandwidth, with far less ROM footprint on embedded devices than WebSocket
- RTC ensures audio quality: Stable audio transmission quality even in weak network environments
MQTT (Message Queuing Telemetry Transport) was designed by IBM in 1999 for oil pipeline SCADA systems. Its core design philosophy is "serving constrained devices on unreliable networks." The minimum protocol header is only 2 bytes—compared to HTTP's hundreds of bytes of header overhead, the advantage is extremely significant on bandwidth-constrained cellular networks (such as 2G/NB-IoT). MQTT uses a Publish/Subscribe (Pub/Sub) model, decoupling message senders and receivers through a Broker—this is precisely the key mechanism for achieving device-server decoupling. The protocol has built-in three-level QoS to ensure signaling reliability, and supports the Will Message mechanism, where the Broker automatically publishes a preset message when a device disconnects abnormally—very suitable for IoT device online status management.

Cross-Platform Communication: One Protocol Covering All Terminals
Unified RTC Communication Protocol
The biggest highlight of this architecture is its cross-platform scalability. Based on the RTC protocol, whether it's an ESP32 embedded device, mobile APP, mini-program, H5 page, or Linux device, all can achieve interconnection through the same communication protocol.
RTC (Real-Time Communication) technology's underlying foundation is the WebRTC standard, open-sourced by Google in 2011 and pushed into the W3C/IETF standards system. Its core transport layer uses SRTP (Secure Real-time Transport Protocol) over UDP, not TCP—this is the most fundamental difference from WebSocket. UDP's connectionless nature allows RTC to actively compensate through FEC (Forward Error Correction), JitterBuffer, NetEQ, and other algorithms in network jitter and packet loss scenarios, rather than waiting for retransmissions like TCP which causes latency accumulation. Agora built its own SD-RTN (Software Defined Real-time Network) global acceleration network on top of WebRTC, controlling end-to-end latency within 100ms through intelligent routing in complex domestic and international network environments—this is its core commercial value compared to native WebRTC.

Specific multi-platform access methods are as follows:
- ESP32 device side: Create rooms based on ESP32's C language SDK
- Mobile APP side: Create rooms based on mobile SDK
- Linux device side: Create rooms based on the Linux version RN architecture SDK
- Mini-program/H5: Access via Web SDK
As long as two devices are in the same channel on the signaling server, real-time communication can be achieved. At the code level, no major modifications are needed—just follow the specifications of each platform's SDK.
Extensible Application Scenarios
This solution is not limited to AI voice chat and can be extended to various real-time communication scenarios:
- Voice intercom: Real-time voice communication similar to walkie-talkies
- Multi-person conferencing: Multi-person real-time voice conference rooms
- Access control systems: Remote voice conversation and control
- Inter-device calls: Cross-device real-time calls between ESP32 boards and Linux boards
Server-Side Architecture Transformation and Technology Selection
Deep Transformation Based on the Xiaozhi Open-Source Project
Wang Shixiong's server side is based on the open-source project "Xiaozhi
Related articles
TutorialsCursor + Codex Dual-IDE Collaboration: A Practical Methodology for Open-Source Project Customization
A complete methodology for open-source project customization based on real-world experience, detailing the Cursor+Codex dual-IDE workflow, seven-stage process, MVP validation, and AI source code reading techniques.
TutorialsCursor Multi-Agent in Practice: Building a Full-Stack Next.js Blog in 50 Minutes
Build a full-stack blog in 50 minutes using Cursor IDE's multi-Agent mode with Next.js, Clerk auth, and Supabase. Learn the 4-phase AI Agent workflow and key integration pitfalls.
TutorialsBuilding an AI Software Factory from Scratch: A Cursor Engineer's Hands-On Experience with Multi-Agent Collaboration
Cursor engineer Eric shares practical insights on building an AI software factory: automation levels, guardrail design, parallel Agent management, and scaling to 1000+ Agents for 24/7 development.