Nuclear reactors taught me to ship software

Watchstanding, casualty drills, and pre-task briefs map onto code review, on-call, and disaster recovery more cleanly than any management book I have ever read.

Before I had a GitHub graph I had a watchstation. I came up as an Electronics Technician (Nuclear) in the US Navy — six-and-change years of reactor instrumentation, control circuitry, and the kind of standing-around-staring-at-a-panel that looks like nothing right up until the moment it isn’t. I won’t talk about specific platforms or specific hulls. I will talk about the habits, because the habits are the part that travels.

The thesis of this post is short: every meaningful engineering practice I have at thirty-something — code review, on-call rotations, post-mortems, the way I write a runbook — was already drilled into me by the time I was twenty. I just didn’t know yet that “Naval Reactors” was preparing me to be a senior IC and, eventually, a CEO.

Here is what actually transferred.

Watchstanding is on-call with consequences

When you stand a reactor watch you are not “available.” You are on the watch. You have a logbook, a panel of indications, a procedure binder with tabs you can find in the dark, and a relief schedule that has nothing to do with how tired you are. The handoff is formal: every relevant indication, every plant evolution in progress, every ongoing concern, every casualty in your rear-view mirror that the next watch needs to know about. You sign for it. They sign for it. Now they have it.

When I started doing software incident response, I noticed how informal the handoff was. “Hey, anything weird tonight?” “Eh, the deploy was fine, see you Monday.” This is insane. A modern production system has more state than the secondary plant of a submarine. Where is the watchstanders’ log? Where is the formal turnover?

Some of the best engineering teams I’ve worked with have a chat-channel pinned message that gets edited every shift: current state of staging, current state of prod, anything in flight, anything degraded. It is, structurally, the same artifact as a watchstander’s log. It exists for the same reason: the next person who has to make a decision should not have to reconstruct context from Slack scrollback.

If you take one habit out of this post: keep a turnover log. Write it for the human who will read it at 3am with one eye open. That human is sometimes future-you.

Pre-task briefs are code review

There is a Navy ritual called a pre-task brief. Before a non-trivial evolution — pulling a primary sample, swapping a controller card, anything where a slip can scram the plant — you brief. The team gathers, the lead walks the procedure step by step, every person says back what their role is, and the off-nominal cases get explicit attention.

The first time I sat through a “design review” in the software industry I thought I was being trolled. Someone was presenting a design doc. People were nodding. Nobody was saying back their role. Nobody was naming the failure mode they personally would be responsible for catching. There was no “what would we do if this rolled back at 30% deploy.”

A good code review is a pre-task brief in slow motion. The author walks the procedure (the diff). The reviewers say back what the change does (the description). And — this is the part most teams skip — someone explicitly owns the rollback case. If this breaks production at 0200, who picks up the page, and what is the first thing they do? If you can’t answer, the change is not ready to merge. Doesn’t matter how clean the code is.

When I’m on a team that has stopped doing pre-task briefs, you can feel it. Deploys go out with vibes attached. Three deploys later, somebody discovers a regression nobody can name the source of. That’s the cost of skipping the brief.

Casualty drills make muscle memory cheap

The Navy reactor world drills constantly. Steam-rupture drill. Loss-of-cooling drill. Fire-in-machinery-spaces drill. Scram drill from every state the plant is allowed to be in, plus a few it isn’t. The point is not to surprise you; the point is the opposite. The point is for the response to be so deeply baked in that the surprise of the real event doesn’t get to consume any of your cognition. You don’t need cognition for the response — you have it. You need cognition for the anomaly, the part of the casualty that the drill didn’t cover.

In software we call this game day, chaos engineering, fire drill. We do it badly. Most teams do it never. The teams that do it well (Netflix’s monkeys, the Stripe game-day playbook, the GCP/AWS internal exercises that occasionally leak out as conference talks) discover that the drills are not really about the failures they simulate. They are about the coordination friction the drills surface. The deploy script that nobody on call has run in six months. The dashboard nobody knows the URL of at 4am. The runbook that says “page Steve” and Steve left the company in February.

If you’ve never done a drill in your software career, do one. Start small. Pick a Tuesday afternoon. Kill staging. Watch what happens. The first drill always exposes something embarrassing. That’s the entire point. The reactor world figured this out three generations ago.

After-action review is the post-mortem the right way around

Naval Reactors does an after-action review on every casualty drill and every real event. The format is unromantic. What did we expect to happen? What actually happened? Where did our model of the plant diverge from the plant? What changes do we make so that, the next time the plant does that, we are not surprised in the same way?

Notice what isn’t on that list: blame. Notice what is on it: a working theory of why your model of the system was wrong, and a delta you commit to before the meeting ends.

The blameless post-mortem culture you’ve read about in Etsy and Google blog posts is not new. It’s old. The reactor world has been running it since before I was born, because in a reactor world blame is operationally useless: the next watch is going to be staffed by the same humans, and humans don’t get less human under blame, they just get worse at reporting things. The Navy figured out that you have to make it cheap to admit you got something wrong, or the reports stop, and when the reports stop, the system goes blind.

I’m proud of every team I’ve been on that runs honest post-mortems. I am extremely suspicious of any team that does not.

Procedure-in-hand is just `runbook.md`

You don’t operate the plant from memory. You operate it from a procedure, and the procedure is an artifact you can hand someone. If a step in the procedure is wrong, you don’t quietly do the right thing — you stop, you flag it, and you change the procedure. The procedure is the source of truth, not the human currently executing it.

When I joined my first software shop, I remember being baffled that production-deploy steps lived in the heads of two senior engineers. There was no deploy.md. The whole thing was tribal. I took the meeting where one of them deployed prod, took notes in real time, made a markdown file, opened a PR, and titled it “deploy.md (first draft, please correct anything I got wrong).” The reception was “huh, that’s a good idea, why didn’t we have one of these.”

A runbook is a procedure. Treat it the way the reactor world treats procedures: it’s the source of truth; if it’s wrong, you fix the procedure, not the operator. The day after an incident, the diff to your runbooks should be visible. If it isn’t, you didn’t actually learn anything from the incident.

The sea story I’m allowed to tell

Here is one I can share. Mid-watch, deep in a long underway, an indication on a panel started doing a thing it wasn’t supposed to do. Not dangerous — just wrong. Specifically, an analog gauge was reading a value that wasn’t physically possible given the state of the plant.

Now: I was a junior tech. I had two options. (1) Convince myself the gauge was probably fine, finish my watch, go to bed. (2) Wake up the senior watch, who was already short on sleep, to look at a gauge that was probably broken.

I picked (2). The senior watch came down, looked at the gauge, looked at the related indications on the rest of the panel, ran one cross-check, and identified the actual problem in under three minutes — which was, as suspected, a failed transducer rather than anything dramatic. Then he went back to bed.

What I remember most is what he said on his way out: “Always wake me up. The gauge being wrong is fine. The gauge being wrong and you not telling me is not fine.”

That has become how I think about engineering escalation. The cost of false alarms is small. The cost of an alarm that should have happened and didn’t is enormous. Every senior engineer I now mentor gets some version of that lecture. Always page. Always escalate. The gauge being wrong is fine. Nobody knowing the gauge is wrong is not fine.

Why I’m writing this in 2024

I’m writing this between the Rusty Pipes posts and whatever comes next, because every interview I do for senior-and-above engineering roles eventually arrives at the same question: what makes you you?

The honest answer is: a Navy reactor compartment. Long before I learned git rebase, I learned that a panel does not lie, that a procedure is a contract with the future, and that the most dangerous person on a watchstation is the one who thinks the indications are probably fine.

If you ever wonder why I write the way I write, why my code reviews look the way they look, why my on-call instincts default to “wake me up, I’d rather lose sleep than lose data” — that’s why. I keep a watch. The plant just looks like Postgres now.