Agora RTC + MQTT Replacing WebSocket: A Practical Guide to Decoupled Architecture for AI Voice Interaction Devices

Introduction

In current AI voice interaction device development, most solutions build voice channels on WebSocket. While simple and straightforward, this approach has obvious shortcomings in device-server coupling, cross-platform communication, and low-bandwidth environment adaptation. Bilibili content creator Wang Shixiong, based on his embedded development board project, shared a voice interaction system architecture that uses Agora RTC + MQTT protocol to replace traditional WebSocket, achieving complete decoupling between device and server sides with commercial-grade stability and scalability.

Why Abandon the WebSocket Approach?

Pain Points of Traditional WebSocket Solutions

Currently, mainstream AI voice interaction solutions on the market—whether open-source projects or commercial products—basically all rely on WebSocket as the transmission channel for voice data and signaling. Under this architecture, the device and server are tightly coupled—once the server fails to start, the device cannot function properly on the network, and the entire system becomes unavailable.

WebSocket is a full-duplex communication protocol based on TCP, born in 2011 (RFC 6455), originally designed to solve the problem of HTTP's inability to push from server to client. However, in the embedded IoT domain, WebSocket's design assumptions reveal structural flaws: the protocol itself requires maintaining persistent TCP connections, and once the server is unreachable, the client enters an unavailable state. Furthermore, WebSocket has no built-in QoS (Quality of Service) mechanism—packet loss handling in weak network environments relies entirely on upper-layer application implementation. Additionally, a complete WebSocket protocol stack typically occupies 50-100KB of ROM on MCUs like ESP32, which creates significant pressure for devices with only 4MB Flash.

For embedded devices, the WebSocket protocol also has relatively large ROM memory footprint. In ESP32 and other microcontroller development, memory space and resource consumption are core concerns that developers must constantly monitor.

Advantages of the RTC + MQTT Dual-Channel Architecture

Wang Shixiong transformed the system into a dual-channel architecture where Agora RTC handles the voice channel and MQTT handles the signaling channel. This design of separating the Signaling channel from the Media channel is actually a classic architecture pattern in telecom-grade real-time communication systems, with its history traceable to the design of SIP (Session Initiation Protocol)—SIP itself only handles session establishment, modification, and termination, while actual voice data is transmitted through independent RTP streams. In this solution, MQTT takes on the signaling role (device status reporting, session control, LLM response text delivery), while Agora RTC handles the media role (bidirectional audio streams). Each serves its own purpose, and failures in one don't affect the other.

This design brings several key benefits:

Complete decoupling between device and server: Even if the server hasn't started, the device can still connect normally to Agora channels or MQTT's EMX server
Independent operation between devices: Each terminal is an independent node, flexibly combined through channel mechanisms
MQTT protocol is more lightweight: Low latency, low bandwidth, with far less ROM footprint on embedded devices than WebSocket
RTC ensures audio quality: Stable audio transmission quality even in weak network environments

MQTT (Message Queuing Telemetry Transport) was designed by IBM in 1999 for oil pipeline SCADA systems. Its core design philosophy is "serving constrained devices on unreliable networks." The minimum protocol header is only 2 bytes—compared to HTTP's hundreds of bytes of header overhead, the advantage is extremely significant on bandwidth-constrained cellular networks (such as 2G/NB-IoT). MQTT uses a Publish/Subscribe (Pub/Sub) model, decoupling message senders and receivers through a Broker—this is precisely the key mechanism for achieving device-server decoupling. The protocol has built-in three-level QoS to ensure signaling reliability, and supports the Will Message mechanism, where the Broker automatically publishes a preset message when a device disconnects abnormally—very suitable for IoT device online status management.

Voice interaction demo

Cross-Platform Communication: One Protocol Covering All Terminals

Unified RTC Communication Protocol

The biggest highlight of this architecture is its cross-platform scalability. Based on the RTC protocol, whether it's an ESP32 embedded device, mobile APP, mini-program, H5 page, or Linux device, all can achieve interconnection through the same communication protocol.

RTC (Real-Time Communication) technology's underlying foundation is the WebRTC standard, open-sourced by Google in 2011 and pushed into the W3C/IETF standards system. Its core transport layer uses SRTP (Secure Real-time Transport Protocol) over UDP, not TCP—this is the most fundamental difference from WebSocket. UDP's connectionless nature allows RTC to actively compensate through FEC (Forward Error Correction), JitterBuffer, NetEQ, and other algorithms in network jitter and packet loss scenarios, rather than waiting for retransmissions like TCP which causes latency accumulation. Agora built its own SD-RTN (Software Defined Real-time Network) global acceleration network on top of WebRTC, controlling end-to-end latency within 100ms through intelligent routing in complex domestic and international network environments—this is its core commercial value compared to native WebRTC.

Why choose Agora RTC

Specific multi-platform access methods are as follows:

ESP32 device side: Create rooms based on ESP32's C language SDK
Mobile APP side: Create rooms based on mobile SDK
Linux device side: Create rooms based on the Linux version RN architecture SDK
Mini-program/H5: Access via Web SDK

As long as two devices are in the same channel on the signaling server, real-time communication can be achieved. At the code level, no major modifications are needed—just follow the specifications of each platform's SDK.

Extensible Application Scenarios

This solution is not limited to AI voice chat and can be extended to various real-time communication scenarios:

Voice intercom: Real-time voice communication similar to walkie-talkies
Multi-person conferencing: Multi-person real-time voice conference rooms
Access control systems: Remote voice conversation and control
Inter-device calls: Cross-device real-time calls between ESP32 boards and Linux boards

Server-Side Architecture Transformation and Technology Selection

Deep Transformation Based on the Xiaozhi Open-Source Project

Wang Shixiong's server side is based on the open-source project "Xiaozhi