skip to content
Skill Issue Dev | Dax the Dev
search
Part of series: Vanta In Practice

iroh in production: encrypted-note gossip on a 1-minute-block chain

Print view

Sections

The L2 sidecar I wrote about previously has four jobs: watch L1, serve a REST API, snapshot state, and gossip with peers. The first three are well-trod tokio-task territory. The fourth is the one that actually matters for L2 decentralisation, because if every peer has to fetch encrypted notes from one REST server, that REST server is a centralisation point, and the privacy chain isn’t really a privacy chain.

This post is the deep dive on the gossip layer specifically. The transport is iroh.computer — a pure-Rust QUIC stack with an opinionated NAT-traversal story and a built-in gossip protocol that does most of what we need. The integration code lives in vanta/vanta-node/src/gossip.rs, which is where I’d point you to read first if you want the unvarnished version.

Why iroh

The architecture doc puts the rationale tersely. From doc/vanta-architecture.md:

P2P: iroh.computer — pure Rust, QUIC-based, NAT traversal, gossip protocol, content-addressed blobs. Chosen over libp2p for simplicity, built-in QUIC + NAT hole-punching, and document sync (useful for offline branch-and-merge).

— doc/vanta-architecture.md

That’s the polite version. Let me unpack it with a tradeoff table that does not pull punches.

L2 gossip transport options I actually considered
OptionCostLatencyBlast radiusNotes
iroh.computer (chose this) ~1.5 MB extra binary; one Rust crate; fixed config QUIC, per-stream ordering, hole-punching by default Production users at n0-computer; small but active maintainer team Pure Rust. NAT-traversal-as-default is the killer feature.
libp2p (rust-libp2p) Bigger dependency tree, more configuration, more knobs Comparable on QUIC transport once configured IPFS, Filecoin, Polkadot — battle-tested Configuration tax was the killer. We do not need yamux, mplex, mdns, AND noise + tls. We need one of each.
Custom yamux-over-QUIC Maintenance burden of every NAT-traversal edge case Whatever you implement Nobody else runs this Reinvents NAT traversal. The interns will resent us.
NATS or other broker A broker. Defeats the entire premise of decentralised L2. Fast, but topology-dependent Operations matter on a single-binary chain Not seriously considered. Listed for completeness.

The “configuration tax” point is the one I want to underline. libp2p is in principle the right answer; we used it on an earlier prototype. The problem was that every libp2p deployment is a snowflake — yamux vs mplex, noise vs tls, mdns vs static seeds, gossipsub v1.0 vs v1.1 — and getting two different libp2p deployments to talk predictably across a real residential-NAT network was a recurring time sink.

iroh ships an opinionated default. There is one transport (QUIC), one ALPN per protocol, and one NAT-traversal story (n0-relay-assisted hole-punching). When it works it works the same way every time. When it fails, the failure modes are bounded and documented.

Topology

The Vanta L2 gossip topology is one topic per chain, with content-addressed blob references for any payload that’s too big for the gossip message-size limit (we cap at 64 KB per message, which is enough headroom for a single encrypted note plus headers).

Every node joins the same topic. Every message broadcast on that topic ends up at every other peer (eventually — this is gossip, not multicast, so it’s O(log N) hops in expectation). The N0 relays are a fallback for peers behind symmetric NATs or other hole-punching-resistant boundaries; once a direct path is found, the relay drops out.

The topic is a SHA-256 of a fixed string in gossip.rs:42:

fn vanta_topic() -> TopicId {
    use sha2::{Sha256, Digest};
    let mut hasher = Sha256::new();
    hasher.update(b"Vanta/L2/Gossip/v1");
    let hash = hasher.finalize();
    let mut bytes = [0u8; 32];
    bytes.copy_from_slice(&hash);
    TopicId::from_bytes(bytes)
}

Vanta/L2/Gossip/v1. The v1 is intentional: when we ship a breaking change to the message format, we’ll bump to v2 and the two networks will simply not see each other. That’s the cleanest cross-version migration story we have, and it’s a single-line change.

The message shape

Three message kinds, all bincode-serialised:

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum GossipMessage {
    NewCommitment { commitment: Hash },
    NullifierRevealed { nullifier: Hash },
    EncryptedNote(EncryptedNote),
}

Hash is a 32-byte alias from vanta_core. EncryptedNote is an opaque ciphertext blob plus a recipient hint that wallets use to do trial-decryption. The ciphertext is encrypted-to-recipient-pubkey using the same envelope scheme described in the nullifier-set postvanta-node cannot decrypt a note even if it tries.

The relevant point is what’s not here. There’s no “request-response” message. There’s no “inventory” or “bloom filter” or pull-based sync. iroh-gossip is broadcast-only; if a peer joins late, they catch up via the L1 watcher (which scans block history) and then receive new state via gossip going forward. Decoupling history-sync from real-time-sync is a simplification: gossip is always real-time, history is always re-derived from L1.

The send path

Three small fan-out helpers, one private send method, in gossip.rs:53:

impl GossipHandle {
    pub async fn broadcast_commitment(&self, commitment: Hash) -> Result<()> {
        let msg = GossipMessage::NewCommitment { commitment };
        self.broadcast(&msg).await
    }

    pub async fn broadcast_nullifier(&self, nullifier: Hash) -> Result<()> {
        let msg = GossipMessage::NullifierRevealed { nullifier };
        self.broadcast(&msg).await
    }

    pub async fn broadcast_encrypted_note(&self, enc: EncryptedNote) -> Result<()> {
        let msg = GossipMessage::EncryptedNote(enc);
        self.broadcast(&msg).await
    }

    async fn broadcast(&self, msg: &GossipMessage) -> Result<()> {
        let bytes = bincode::serialize(msg)?;
        self.sender.broadcast(Bytes::from(bytes)).await?;
        Ok(())
    }
}

The GossipHandle is Clone and gets passed to the API server, the L1 watcher, and the swap module. Whoever has the handle can broadcast. The handle is a wrapper around iroh_gossip::api::GossipSender, which is a tokio-friendly mpsc-style channel into iroh’s outbound queue.

bincode::serialize is fine here because the message types are all simple plain-old-data with no #[serde(skip)] or recursion. The deserialization path (next section) is where the gotchas live.

The receive path

start() in gossip.rs:88 is the function that brings up the whole gossip stack. It does five things:

  1. Build an Endpoint with the presets::N0 relay configuration.
  2. Spawn a Gossip instance with a 64 KB max-message-size.
  3. Wire a Router that accepts inbound gossip connections on the gossip ALPN.
  4. Subscribe to the Vanta topic with the user’s bootstrap peer list.
  5. Spawn a tokio task to drain the inbound stream into apply_gossip_message.
let endpoint = Endpoint::builder(presets::N0)
    .bind()
    .await?;

let gossip = Gossip::builder()
    .max_message_size(65536)
    .spawn(endpoint.clone());

let router = Router::builder(endpoint.clone())
    .accept(GOSSIP_ALPN, gossip.clone())
    .spawn();

let topic_id = vanta_topic();
let topic = gossip.subscribe(topic_id, peer_ids).await?;
let (sender, mut receiver) = topic.split();

The accept(GOSSIP_ALPN, gossip.clone()) call is what tells the router “any inbound QUIC connection that negotiates the gossip ALPN gets handed to this Gossip instance.” iroh multiplexes multiple protocols on one endpoint; today we only run gossip, but the same router could in principle accept blob-sync or document-sync ALPNs.

The receive loop calls receiver.try_next() in a tight loop and dispatches each event. There are three event types we care about:

async fn handle_gossip_event(
    state: &L2State,
    peer_counter: &std::sync::Arc<std::sync::atomic::AtomicUsize>,
    event: iroh_gossip::api::Event,
) {
    use std::sync::atomic::Ordering;
    match event {
        iroh_gossip::api::Event::Received(message) => {
            match bincode::deserialize::<GossipMessage>(&message.content) {
                Ok(msg) => apply_gossip_message(state, msg),
                Err(e) => {
                    tracing::debug!("Failed to deserialize gossip message: {e}");
                }
            }
        }
        iroh_gossip::api::Event::NeighborUp(peer_id) => {
            let n = peer_counter.fetch_add(1, Ordering::Relaxed) + 1;
            tracing::info!("Gossip peer joined: {} (now {n})", peer_id);
        }
        iroh_gossip::api::Event::NeighborDown(peer_id) => {
            peer_counter
                .fetch_update(Ordering::Relaxed, Ordering::Relaxed, |v| {
                    Some(v.saturating_sub(1))
                })
                .ok();
            let n = peer_counter.load(Ordering::Relaxed);
            tracing::info!("Gossip peer left: {} (now {n})", peer_id);
        }
        _ => {}
    }
}

The _ => {} is loud silence: iroh’s Event enum has more variants than we care about (relay-state changes, lurker-mode signals) and we explicitly ignore them.

The saturating-decrement gotcha

The first version of NeighborDown was peer_counter.fetch_sub(1, Ordering::Relaxed). In a happy path this was fine — every NeighborUp pairs with exactly one NeighborDown, the counter goes up and down, and /status shows the right number.

In the actual iroh deployment, NeighborDown can fire without a corresponding NeighborUp ever having been observed. (Reasons: the event stream can drop messages under backpressure; a peer can be “down” from this node’s perspective before this node has joined the topic enough to consider them “up.”) The bug surfaced as /status returning peer_count: 18446744073709551614. I had wrapped from 0 → usize::MAX - 1. Counting backwards in unsigned arithmetic is a strict no.

The fix is the fetch_update + saturating_sub pattern in the snippet above. It’s slower than a single atomic op (it’s a CAS loop) but it’s load-bearingly correct: the counter never goes negative, and on the rare double-down-without-up the counter just stays at its current value.

This is the kind of thing you don’t notice until production. TODO: Dax confirm we want to ship peer_count over /status as a u32 and saturate there too — even with the in-memory fix, a 64-bit counter shipped to a frontend could in principle overflow JavaScript’s Number.MAX_SAFE_INTEGER if something ever went really wrong upstream.

A toy iroh-shape demo

We can’t actually run iroh in a Sandbox — iroh isn’t WASM-portable, and it wants real UDP sockets. But we can simulate the message-flow shape in plain Node, which is sometimes useful for understanding the topology when you read the Rust code.

iroh-shape gossip demo [ node ]
run

This is the shape of gossip flooding. iroh’s actual implementation uses HyParView + Plumtree — more sophisticated, with eager-push trees and lazy-pull repair — but the user-facing semantic is the same: broadcast on a topic, every peer eventually sees the message, exactly once.

Encrypted notes specifically

The largest message type, EncryptedNote, is what wallets actually consume. The flow is:

  1. Sender’s wallet generates a shielded transaction. Part of the witness is an encrypted ciphertext addressed to the recipient’s pubkey.
  2. Sender’s vanta-node (via the desktop app) calls broadcast_encrypted_note(ciphertext).
  3. iroh-gossip floods every peer in the topic. Every L2 node — including the recipient’s — has the ciphertext in memory.
  4. The recipient’s wallet calls /notes/scan against its local vanta-node, which trial-decrypts every recently-seen ciphertext against the wallet’s secret key.
  5. If a trial decryption succeeds, the wallet has detected a payment.

There is no “addressing.” There is no “the recipient asks for their notes.” Every peer has every note. Each peer’s wallet decides which notes are theirs by trying to decrypt. This is the same architectural pattern Zcash sapling uses — a public ciphertext stream with private addressability — and it’s why the gossip layer can be totally untrusted: peers see ciphertexts, recipients see plaintext.

What’s not in this implementation

A few things to flag, both for honesty and for the next person to read this.

No gossip-layer backpressure. If a peer publishes 10,000 encrypted notes in a second, every other peer’s tokio task for the receive loop has to deserialize all of them. There’s no rate limit, no back-off, no “too many pending events” exception. This is fine on a 1-minute-block chain where the pool’s submission rate is bounded, but it would be a real problem on a 250 ms-block chain.

No peer reputation. Every peer is equal. A misbehaving peer (sending malformed messages, spamming) is just ignored on a per-message basis. We don’t disconnect them, ban them, or de-prefer them in routing. iroh has the primitives (endpoint.close_peer) but we don’t use them.

No persistence across restarts. When vanta-node restarts, it forgets every peer it had ever seen and re-bootstraps from the static seed list. This costs ~2 seconds on warm starts. The L1 watcher catches state up from chain history regardless, so this isn’t a correctness concern, but a peer cache would shave the startup window.

No multi-topic. All Vanta nodes are on one topic. We’ll need at least mainnet/testnet split when there’s a testnet to speak of; right now the topic is Vanta/L2/Gossip/v1 and that’s literally the only topic that exists. TODO: Dax confirm we add Vanta/L2/Gossip/regtest when the regtest deploy lands.

What I changed my mind about

I’d been libp2p-curious for a long time. The crate is mature, it’s used by IPFS and Polkadot, the docs are pretty good. I started the Vanta L2 with a libp2p prototype and it worked.

Two things made me switch.

The configuration burden is per-developer. Every new contributor who touches vanta-node would need to internalise the libp2p configuration matrix (or worse: would copy-paste it from somewhere and not understand what they were copying). iroh’s presets::N0 is a single import. The cognitive load is bounded.

NAT traversal is solved-default. libp2p’s NAT traversal is a la carte: configure DCUtR, configure STUN, configure relays. iroh’s is built in. On a privacy chain whose users include anyone with a residential ISP, NAT traversal is not optional and the failure mode (peer can’t be reached) cascades into “wallet stuck waiting for sync.” Defaulting it on saved a class of bug I was tired of debugging.

The cost of the switch was about a week of integration work. I’d take that trade every time. iroh has bugs (the saturating-decrement story above is one of mine), but they’re bugs at a scope I can hold in my head.

Further reading

Hire me — book a 30-min call $ book →