← Back to blog
Engineering

Sub-100ms voice over E2EE: the design of Sudo's SFU mesh

Voice channels with end-to-end encryption usually mean either bad latency or trusting the server with your keys. We picked option three: a global SFU mesh that never sees plaintext.

Sudo Realtime@rtc.sudo··13 min read
Engineering

Sudo voice channels are end-to-end encrypted. Sudo voice channels also have a median round-trip latency of 38 milliseconds across 92 cities. Those two sentences are usually mutually exclusive in real-time communications. This post is about how we made them coexist.

If you are not a voice infrastructure nerd, the short version is: we built a globally distributed mesh of selective forwarding units that route encrypted media without ever decrypting it, we let clients establish keys directly with each other through MLS, and we engineered around every edge case where the temptation to "just decrypt it on the server for a moment" would have made our lives easier.

If you are a voice infrastructure nerd, the rest of this post is for you.

The classic E2EE voice trade-off

Real-time audio has historically used one of three architectures.

Full-mesh peer-to-peer is the textbook starting point: every participant sends their stream to every other participant directly. The maths break down at six or seven concurrent speakers because uplink bandwidth grows linearly with the number of recipients. Anything above a small group call is impractical on consumer networks.

Multipoint Conferencing Units (MCUs) sit on a server, decode every incoming stream, mix them into a single composite stream and send it back. The server has to see the audio in plaintext, which is fundamentally incompatible with end-to-end encryption.

Selective Forwarding Units (SFUs) sit on a server too, but they only forward streams between participants without decoding them. SFUs scale beautifully — a single SFU can comfortably handle hundreds of participants — but the naive implementation still has the encryption problem because the SFU usually negotiates the keys.

Sudo's voice stack is an SFU mesh that sits in option (c) but with a critical twist: the SFU is never on the key exchange path. Clients negotiate keys peer-to-peer through MLS, the SFU forwards opaque ciphertext, and the only thing leaking from the server is timing and packet sizes — both of which we minimise.

How an SFU forwards encrypted media

Real-time audio in 2026 means SRTP packets carrying Opus-encoded frames. Normally an SFU reads each packet's header (which is encrypted at the SRTP layer with a key shared with the SFU), decides where to forward it, and re-encrypts on egress.

Our SFU does not have the SRTP key. We use the WebRTC Insertable Streams API on the client to apply a second layer of AEAD encryption to each frame before it is handed to the SRTP stack, with a key derived from the room's MLS state. The SFU sees the SRTP header (so it knows which participant the frame came from and whether to forward it) but the payload — the Opus frame itself — remains encrypted under a key only the room members hold.

The client decrypts the inner layer using the same MLS-derived key and feeds the plaintext Opus frame to the audio stack. From the SFU's perspective, it is forwarding random noise.

Edge selection

Latency in real-time voice is dominated by two things: the audio codec frame size (typically 20 ms for Opus) and the network round trip. The codec is fixed; we can only optimise the network.

Sudo runs SFU edges in 92 cities across six continents. A new client connects to the closest healthy edge based on a latency probe, not just GeoIP. When two participants are far apart, the edges hand off to each other through a private dedicated backbone (we use Anycast-routed Wireguard tunnels with BBR congestion control) rather than the public internet, which shaves around forty milliseconds off long-haul calls.

Inside any one region, our 95th-percentile RTT is twelve milliseconds. Globally, the median is 38 ms and the 95th percentile is 174 ms, which is good enough that most users do not realise they are on a multi-hop call.

Token-gated rooms

A specific Sudo workload is the token-gated room: a voice channel that only wallets holding a specific NFT or token balance can join. The natural place to enforce this is on the SFU — the entity controlling who the bytes go to.

But the SFU does not know who you are. It sees a connection ID and an opaque ciphertext stream. It cannot look at your wallet.

We solve this with a small zero-knowledge proof issued by the client. To join a token-gated room, your client constructs a proof that says "I hold a wallet that holds at least one of contract X token", signs it with your wallet, and presents it to the SFU. The SFU verifies the proof against an on-chain Merkle root cached at the room's start, lets you in, and forgets your wallet address.

The proof does not reveal which wallet you are. Two participants in the same gated room cannot tell each other's wallets apart unless they choose to identify themselves over the audio channel. The room's gating is enforceable; the room's privacy is preserved.

What we sacrificed

Engineering is the art of choosing what not to do. A few features that other voice platforms ship are deliberately missing or limited in Sudo.

Server-side recording is not possible because the server has no access to plaintext audio. If you want to record a call, the recording happens client-side and is stored locally (or uploaded encrypted to the recorder's choice of storage). Conferences with hundreds of speakers cannot be cleanly transcribed by a server-side service for the same reason.

Live transcription, when enabled, runs on the participant's device. We have shipped a small Whisper-derived model that runs on modern laptops and high-end phones, but it is not free of latency or accuracy trade-offs versus a server-side transcriber that has full audio access.

Echo cancellation, noise suppression and other DSP run on the device by necessity. This means we cannot do server-side dereverberation tricks that other platforms use to make group calls sound clean — we ship a strong client-side suite instead.

We accept these trade-offs because the alternative — letting the server see plaintext audio — is incompatible with the fundamental product promise. If we ever ship a "transcribe this call" feature, it will run on your device or with your explicit, per-call, opt-in consent and a one-shot decryption key.

What is on the engineering roadmap

We are working on three voice improvements over the next quarter.

First, multi-stream high-definition video for Stage rooms — currently capped at four simultaneous HD speakers, expanding to twelve with adaptive layered coding so participants on slow connections still see thumbnails of every speaker.

Second, a federated SFU profile so that an organisation can run its own SFU on its own infrastructure and connect to the wider Sudo voice mesh. The hard part is preserving the encryption guarantees across federation boundaries; the design is in early review.

Third, lower-latency Opus presets for music and Stage AMA workloads. The default Opus codec settings prioritise voice intelligibility; for Stage rooms doing live music or panel-style discussions, a higher-bitrate, lower-latency preset gives a noticeably better listening experience at modest bandwidth cost.

Why this is interesting

Voice is the oldest real-time application on the internet and it is one of the hardest to encrypt at scale. Every shortcut in the design — every "just decrypt it on the server for a second" — corrodes the privacy promise. We picked the longest road on purpose because we think that road eventually gets walked by every messenger that takes encryption seriously, and it is more interesting to learn the lessons early.

If you want the technical primer, an internal memo on the keying derivations and the insertable-streams pipeline is published in the developers community.

Subscribe

Get the next post in your wallet.