Listening to the Bluesky Firehose for Accidental Haikus

Dax the Dev

Listening to the Bluesky Firehose for Accidental Haikus

14 August 2024 10 min read verified human-written

Sections

Why a haiku detector
The line-splitting heuristic is the magic
Saving them to disk
The CAR-file decoding pain
Adding likes, reposts, and follows
What I learned
Trade-offs
What I’d do next
Further reading

The Bluesky firehose is one of the great ambient APIs. It’s a WebSocket at wss://bsky.network/xrpc/com.atproto.sync.subscribeRepos that streams every public post, like, repost, and follow on the entire network in real-time, encoded as IPLD-DAG-CBOR frames. As of late 2024 it was clocking around 1,200 events/second. You can see the firehose in real-time on jaz.land, but the more interesting use case is “consume it from a Rust binary on a Mac Mini and do something stupid with it.”

So I did. The repo is Dax911/bsky-firehose-listener, and the moment it became interesting is 291b985 — all msg + haiku on 2024-12-01. The diff is one file, +79 / -38, and what it added was: extend the listener to handle likes, reposts, and follows — and detect English haikus in real-time and save them to a file.

Why a haiku detector

Because the firehose is too much information to consume directly. Even at one second’s worth of latency, you’ll see a thousand posts. Most of them are uninteresting tweets. Some of them are accidentally beautiful three-line poems that scan as 5-7-5 syllables. The ratio is maybe one haiku per ten thousand posts. Having a real-time filter for that ratio gives you a slow, ambient stream of poetry, which is much more pleasant than a firehose.

The detector is two functions:

fn is_english(text: &str) -> bool {
    detect(text).map_or(false, |info| info.lang() == whatlang::Lang::Eng)
}

fn is_haiku(text: &str) -> bool {
    let lines: Vec<String> = if text.contains('\n') {
        text.lines().map(|s| s.to_string()).collect()
    } else {
        text.split_whitespace()
            .collect::<Vec<&str>>()
            .chunks(5)
            .map(|chunk| chunk.join(" "))
            .collect::<Vec<String>>()
    };

    if lines.len() != 3 {
        return false;
    }

    let syllables: Vec<usize> = lines.iter().map(|line| estimate_syllables(&line)).collect();
    syllables == vec![5, 7, 5]
}

whatlang::detect does language detection from a single string in low-tens-of-microseconds. syllarust::estimate_syllables is an English-language syllable estimator based on the heuristic of “count vowel groups, subtract silent-e, add a fudge factor for -le endings.” Both are fast enough to run on every post in the firehose without falling behind.

The line-splitting heuristic is the magic

Here’s the bit that made it work:

let lines: Vec<String> = if text.contains('\n') {
    text.lines().map(|s| s.to_string()).collect()
} else {
    text.split_whitespace()
        .collect::<Vec<&str>>()
        .chunks(5)
        .map(|chunk| chunk.join(" "))
        .collect::<Vec<String>>()
};

If the post has newlines, treat newlines as line breaks. Otherwise, chunk the words into groups of 5 and pretend each chunk is a line.

The “groups of 5” branch is what catches the accidental haikus — single-line tweets that just happen to scan as 5-7-5. About one in ten haikus in my output file came from that branch. Posts where the author had no idea they’d written a poem because they’d written it as a tweet.

The branch is also statistically biased. A 15-word post that gets chunked 5-5-5 is way more likely to clear the syllable check than the same post split 4-7-4 or 6-5-4. So the detector preferentially finds posts that are roughly evenly word-distributed in the right chunk shape. That’s a feature, not a bug — the same statistical bias is what makes English poetry feel “natural” when you write it.

Saving them to disk

fn save_haiku_to_file(haiku: &str, cid: &str) -> std::io::Result<()> {
    let mut file = OpenOptions::new()
        .create(true)
        .append(true)
        .open("haikus.txt")?;
    writeln!(file, "CID: {}\n{}\n", cid, haiku)?;
    Ok(())
}

haikus.txt is the output. CID-prefixed because Bluesky CIDs are content-addressed — the CID is a SHA256-based pointer that lets you go back and find the original post in the AT Protocol record store later, even if the user deletes their post (the CID survives in the firehose log and in any indexer that captured it).

The CAR-file decoding pain

The most painful part of the listener is not the haiku logic. It’s the firehose protocol. Each WebSocket binary frame contains two concatenated DAG-CBOR objects: a header (with op, t, etc.) and a body. The body is itself a CAR file (Content-Addressable aRchive) containing all the IPLD blocks for the commit. To get a single post’s text you have to:

Parse the header DAG-CBOR.
Check op == 1 (Message) and t == "#commit".
Parse the body as a Commit struct.
Walk commit.ops to find create ops on app.bsky.feed.post.
Look up the CID of each op in the CAR file’s blocks.
Decode the matching block as a post::Record.
Read record.text.

That’s a lot of decoding for what ends up being a string. Rust handles it well — the atrium-api and serde_ipld_dagcbor crates abstract steps 1–6, and the throughput on a single core is sufficient — but when I first wrote the listener (the 1311836 — feat: initial working commit on 2024-10-24), I spent a full evening debugging “valid data turns out to be invalid” errors that turned out to be the cursor-position trick on the very first line:

let mut cursor = Cursor::new(data.as_slice());
serde_ipld_dagcbor::from_reader::<Ipld, _>(&mut cursor)
    .expect_err("Somehow bsky only sends 1 frame.");
let (metadata, data) = data.split_at(cursor.position() as usize);

This is the only way to find the boundary between the two concatenated DAG-CBOR objects. You ask the decoder to fail to read the second one (because reading the second one would require interpreting a fresh DAG-CBOR root, but the cursor’s already past the end of the first object), and you observe where the cursor stopped. The error from the first read tells you where the second one starts. That’s a textbook example of a “use the parser as a finger” trick — the cursor’s position after a failed read is the parser’s best guess at the boundary.

Adding likes, reposts, and follows

The other half of the diff was the broader event handling:

match operation.path.as_str() {
    path if path.starts_with("app.bsky.feed.post") => {
        // post::Record handling, plus haiku detection
    },
    path if path.starts_with("app.bsky.feed.like") => {
        if let Ok(record) = serde_ipld_dagcbor::from_reader::<like::Record, _>(data.as_slice()) {
            info!("New like: {:?} - Subject: {}", operation.cid, record.subject.uri);
        }
    },
    path if path.starts_with("app.bsky.feed.repost") => {
        if let Ok(record) = serde_ipld_dagcbor::from_reader::<repost::Record, _>(data.as_slice()) {
            info!("New repost: {:?} - Subject: {}", operation.cid, record.subject.uri);
        }
    },
    path if path.starts_with("app.bsky.graph.follow") => {
        if let Ok(record) = serde_ipld_dagcbor::from_reader::<follow::Record, _>(data.as_slice()) {
            info!("New follow: {:?} - Subject: {:?}", operation.cid, record.subject);
        }
    },
    _ => {
        info!("Unknown event type: {}", operation.path);
    }
}

Each event type has its own AT Protocol lexicon — app.bsky.feed.like, app.bsky.graph.follow, etc. — and each lexicon is a separate Record type generated from the protocol’s JSON schema. The atrium_api crate gives you typed structs for all of them, so consuming a like is just like::Record deserialization. Adding a new event type is two lines of code.

This is the moment a firehose listener stops being “I want to read posts” and becomes “I have programmatic access to every social action on the network.” That’s the actual interesting capability. Haikus are a fun output. Tracking the graph of who’s following whom in real-time is a different kind of post.

What I learned

The firehose is more interesting as a substrate than as a feed. Reading every post is overwhelming and useless. Filtering every post through a 50-line heuristic and reading only the survivors is delightful. The same is true for likes (filter for “first like ever from this account on this account” — anniversary detection) and follows (filter for “burst of follows in a 60s window from disjoint accounts” — manipulation detection).

Rust’s CBOR/CAR ecosystem is mature and fast. atrium-api + serde_ipld_dagcbor + rs_car get you to native-throughput consumption of the AT Protocol firehose with no heroic effort. I was getting through 1,500 ev/s on a single core comfortably.

The User-Agent matters even on a public firehose. Bluesky’s relay operators throttle clients that hammer the endpoint without identifying themselves. The constant USER_AGENT: &str = "bsky-firehose-listener (https://github.com/angeloanan/bsky-firehose-listener)" is the original author’s; I left it in because the relay knew that string. Changing it cost me an hour of debugging when I forked the repo and got rate-limited.

Trade-offs

Why English-only haikus? Because syllarust only does English. You could plug in a multilingual syllable estimator, but Japanese haikus rely on moras, not syllables, and the heuristic stops working. The right answer for cross-language haiku detection is per-language pipelines, which is a real project, not a side-quest.

Why save to a flat file? Because I never ran this for more than a weekend at a time and the output file was a few hundred KB. A real version would push to a queue and persist to a database with author/time/CID. This version persists to haikus.txt and gets git add-ed when I think the file’s full.

Why no relay-side filtering? Because the AT Protocol relay doesn’t support consumer-side filtering. You get the firehose, you filter on your end. That’s the cost of an open protocol — every consumer pays for every post regardless of what they care about.

What I’d do next

If I had another afternoon I’d:

Wire the haiku detector to a Bluesky bot account that replies to the original post with 🌸 detected a haiku 🌸. The poet usually has no idea they wrote one.
Cluster haikus by topic. The whatlang step is wasted if I don’t also classify the post.
Cross-reference haikus against the like-graph: are haikus disproportionately liked compared to non-haiku posts? My weak prior is yes.

Side-quests are how you stay practiced with weird APIs. The next time someone hands me a Kafka topic with millions of events per second and says “find the interesting ones,” I have muscle memory for “decode → filter with cheap heuristic → log to flat file → look at output, profit.”

Why a haiku detector

The line-splitting heuristic is the magic

Saving them to disk

The CAR-file decoding pain

Adding likes, reposts, and follows

What I learned

Trade-offs

What I’d do next

Further reading

Related

process-thing: An LSB Watermarker for upload-thing, Written in Rust via Neon

How Random is a Local LLM? A Rust Benchmark with Redis

Zera Janitor: Closing Solana Dust Accounts in Leptos WASM

The vanta sidecar: how a Rust ZK indexer talks to a C++ Bitcoin node