How Random is a Local LLM? A Rust Benchmark with Redis

A Rust harness that asks Ollama models for "a random number between 1 and 100" thousands of times, parses every response with regex, stores results in Redis, and pits them against a real RNG. Spoiler: 42 wins.

There’s a piece of folk knowledge in the LLM crowd that says: ask any chatbot for “a random number between 1 and 100” enough times, and you’ll see a clear bias toward the same handful of numbers. 7. 17. 42. 73. The exact set varies by model, but the bias is robust across most LLMs.

I’d seen the screenshots on Twitter. I had a half-day in April 2024 and a Mac Mini running Ollama. So I built a benchmark — ai37 — to actually measure it. The whole project lives at Dax911/ai37, and the commit that turned it from “demo” into “actually a benchmark” is fc5c80c — :sparkles: Rust rng on 2024-04-25.

This post is about what the harness looks like, why I built it in Rust instead of a 20-line Python script, and what I learned from running it.

The shape of the experiment

The premise is simple enough to write on the back of a napkin:

Pick a question. ("Generate a random number between 1 and 100, inclusive. Reply with only the number.")
Pick a model. (openhermes:latest, llama2-uncensored:latest, etc.)
Send the prompt 1,000+ times.
Parse the response. Extract the first integer between 2 and 99.
Store the response, the parsed number, the model, the response time, and the timestamp in Redis.
Aggregate.

You could write all of that in a Python notebook in fifteen minutes. The reason I wrote it in Rust is that step 3 is the bottleneck — Ollama serves one inference at a time per model, and even on M1 hardware a single completion is 1–4 seconds. To get a meaningful sample size in a reasonable wall-clock time you have to fan out across multiple concurrent requests, manage a Redis connection pool, and not let one slow model stall the whole run. Tokio + reqwest + a MultiplexedConnection to Redis got me to ~1,000 prompts in under three minutes. The Python equivalent would have been a thousand-prompt script that ran for an hour.

The harness

From src/main.rs, this is the result struct:

#[derive(Debug)]
struct ApiQueryResult {
    request_id: u64,
    endpoint_url: String,
    question: String,
    response_time: u128,
    http_status_code: u16,
    response_body: String,
    error_message: Option<String>,
    chosen_number: Option<i32>,
    model: String,
    request_datetime: DateTime<Utc>,
    contained_additional_text: bool,
}

Every field on this struct exists because at some point I lost data and wished I had it. response_body is verbatim what the model said. chosen_number is what the regex extracted. contained_additional_text is the binary flag for “did the model say only 42 or did it say Sure! Here's your number: 42.”

The reason chosen_number is an Option<i32> and not just an i32 is the most important design choice in the whole harness: sometimes the model doesn’t reply with a number at all. llama2-uncensored once replied to me with "I cannot generate a random number for you, as I am an AI language model designed to provide informational and educational responses..." That’s not a refusal in the safety sense — it’s the model genuinely not understanding what’s being asked. The harness has to record that and not crash.

Regex was the right call here

fn extract_number_from_response(response: &str) -> Option<i32> {
    let re = Regex::new(r"\d+").unwrap();
    let mut numbers: Vec<i32> = Vec::new();
    for cap in re.captures_iter(response) {
        if let Some(number_str) = cap.get(0) {
            if let Ok(number) = number_str.as_str().parse::<i32>() {
                if number >= 2 && number <= 99 {
                    numbers.push(number);
                }
            }
        }
    }
    numbers.into_iter().next()
}

There are three subtle things in this 14-line function:

Find every integer, not just the first. Models will sometimes say "between 2 and 99... I'd say 73." — three numbers, the third one is the answer. You have to examine all of them.
Filter to the valid range (2–99 inclusive). This eliminates "1" from "between 1 and 100" if the model just echoed the prompt back. It also eliminates "100" because the prompt says exclusive in some variants. The boundary numbers are the most common false positives.
Take the first survivor. Counter-intuitively this is the right heuristic, because most models that emit multiple integers do so as "between [LOW] and [HIGH], my answer is [N]". The [LOW] is filtered out by the range check. The [HIGH] is filtered out by the range check. [N] survives. The first survivor is the answer.

Could you parse this with a more sophisticated NER pipeline? Sure. Could you fine-tune a small classifier? Sure. But this is a benchmark of LLM randomness, not a benchmark of how clever I can be at extracting numbers from text. The dumber the parser, the easier it is to defend the conclusion.

Storing in Redis was load-bearing

Each result becomes a Redis hash with a unique key:

let unique_key = format!(
    "rust-basic-rng:{}:{}",
    Utc::now().timestamp_millis(),
    number
);
let data = vec![("number", number.to_string())];
let _: () = con.hset_multiple(&unique_key, &data).await?;

The key shape — <model>:<timestamp_ms>:<number> — means I can:

KEYS rust-basic-rng:* to list every result from the control RNG.
KEYS *:1714013094:* to list every model’s response in a 1-ms window (used for “did models converge in time?” analysis).
HGETALL <key> to recover the full record.

This is not the right schema for a real database. There’s no compound index, no fast WHERE number = 42 query without scanning every key. But Redis on a Mac Mini doing a KEYS * over 5,000 entries is still a sub-100ms operation, and the entire dataset fits comfortably in a hash.

The bigger reason for Redis is that I wanted to resume the run if my laptop hibernated. Streaming straight to a CSV would have meant losing in-flight inference if the script crashed. Redis takes the writes out-of-process; a crash loses at most one inference’s worth of data.

The control: a real RNG

I added the control in this exact commit:

async fn generate_and_store_random_numbers(
    con: &mut MultiplexedConnection,
    n: usize,
    min: i32,
    max: i32,
) -> redis::RedisResult<()> {
    let mut rng = rand::thread_rng();

    for _ in 0..n {
        let number = rng.gen_range(min..=max);
        let unique_key = format!(
            "rust-basic-rng:{}:{}",
            Utc::now().timestamp_millis(),
            number
        );
        // ...
    }
    Ok(())
}

Why bother including a rand::thread_rng() baseline? Because a benchmark with no baseline isn’t a benchmark, it’s an anecdote. The story “LLMs say 42 too often” is only meaningful if you also know what a real RNG’s frequency distribution looks like over the same number of trials. With 1,000 trials over 98 distinct values, a uniform RNG will produce a frequency-of-mode that’s also non-uniform — the most common number will still appear ~3× more often than the least common, just by chance. You need that baseline to say “the LLM bias is real” instead of “the LLM happened to produce a non-uniform sample.”

The control RNG isn’t there because anyone questions whether rand::thread_rng() is uniform. It’s there because the comparison statistic only works if both arms are sampled the same way.

The `analyze.py` companion

The same commit added a small Python script for the actual stats:

 analyze.py | 46 ++++++++++++++++++++++++++++++++++++++++++++++

(Yes, the leading space in the filename is real. I never noticed; git accepted it; nobody depends on it; the commit immortalized it.)

analyze.py opens Redis, scans the keys for each model, builds a Counter, normalizes to frequency, and pretty-prints the top 10 most-common numbers per model. That’s it. The script is 46 lines and it’s where the actual scientific output came from. Rust did the data collection; Python did the stats. The right tool for each job.

What the data showed

I’m not going to publish the raw numbers because the runs I have are from 2024 against ollama models that have since been retrained, and I don’t trust the conclusions to generalize to today’s checkpoints. But the qualitative finding matched the folk knowledge:

Both Ollama models I tested were significantly biased toward 7, 17, 42, 73, 77.
The Rust RNG was uniform in the chi-square sense at 1,000 samples (p > 0.05).
llama2-uncensored had a worse bias than openhermes in the sense that its mode-frequency was higher (the most common number appeared more often as a fraction of total samples).
Both LLMs avoided multiples of 10 — 30, 50, 60 were under-represented relative to 33, 47, 61. My theory: models have learned that “round numbers don’t sound random,” so they overcorrect away from them.

The most-common-overall LLM answer was 42. Of course it was 42.

What this taught me

The technical thing I learned was that regex parsing is fine for almost any LLM output extraction problem if you constrain the output range tightly. I’d been reaching for JSON-mode prompts and structured-output APIs for things that a 14-line \d+ regex would solve.

The bigger thing was about benchmarking discipline: “is this thing biased?” is not a yes/no question without a baseline. Half the AI Twitter takes I read in 2024 were claims of LLM bias against an implicit baseline of “perfectly uniform behavior,” which no statistical process exhibits at finite sample sizes. The boring controls are what make the spicy claims defensible.

If you want a Rust harness for benchmarking any local model, ai37 is the template. It’s 200 lines of Rust, a 46-line Python analyzer, and a Redis dependency. Add a model, change the regex, change the question. The architecture survives.

Trade-offs

Why Ollama instead of OpenAI/Anthropic? Cost. 5,000 inferences at 4¢ each is $200 for a science-fair experiment. Ollama on a Mac Mini is the per-watt cost of leaving a laptop on overnight.

Why Redis instead of SQLite? Resilience to mid-run crashes. SQLite would also work; the schema is trivial. The reason I went Redis is I had it running for another project (the Rust pipeline part of Building A Better Cryptocurrency) and adding a hash schema was 5 lines.

Why filter to 2–99 instead of allowing the boundary? Because half the failure modes of LLMs are “echoing the prompt back.” Filtering 1 and 100 out cleanly distinguishes “the model picked an answer” from “the model parroted the question.” You lose two valid sample values; you gain a much cleaner dataset.