Fuses: a simple pattern for preventing runaway automation
I’ve built and maintained countless “janitor” systems in my career: cronjobs which go and clean up old resources like user accounts which should no longer be active. On multiple occasions, one of those janitors has gone off the rails and locked the entire company out of their accounts.
Now, whenever I build systems like this, I include a “fuse”.
The fuse defines a maximum value I expect the system to handle at any one time. The number of accounts to delete, or number of permissions to revoke. If the janitor ever tries to exceed this value, the fuse “blows” and requires a human to intervene.
The time a wobbly API suspended the entire company
At one company, we had an automation to put staff into Google Groups based on attributes from the HR system.
- Everyone with an engineering job title was put in an engineers@ group.
- Anyone with a direct report was put in a managers@ group.
- etc.
As this automation already touched both the HR system and Google Workspace APIs, we extended it to also handle offboarding: when a staff member was no longer active in the HR system, their Google account was suspended.
This is all very simple to implement:
- List everyone from the HR system.
- List all the Google Workspace accounts.
- Reconcile the two, using the HR system as the source of truth:
- Add accounts to any new groups.
- Remove accounts from any old groups.
- Suspend any accounts which should no longer be active.
And for a long time, this hummed along perfectly fine.
Dealing with failures gracefully?
One day, the HR system had some performance issues.
But, rather than failing entirely (like returning a 504 Gateway Timeout error), it degraded “gracefully”.
Specifically, the endpoint to list staff members just returned whatever it had managed to
list from the database before timing out.
{
"people": [
{ "id": 1, "..." },
{ "id": 2, "..." },
// oops, ran out of time to fetch the rest...
]
}
And so our automation dutifully suspended the Google accounts of everyone not in this API response — 95% of the company 💥
The Fuse pattern
Electrical fuses
In the UK, all plugs must have a fuse: they start at 3 Amp for small devices (which is still comically large for e.g. a phone charger) and go up to 13 Amp for more heavy-duty appliances.
These fuses are extremely simple: they contain some thin wire that, when too much current passes through, melts and breaks the circuit.

via https://commons.wikimedia.org/wiki/File:Electrical_insert_fuse.jpg
When designing your device, you’re forced to pick a safe upper bound for the amount of current it should draw. If your phone charger suddenly starts drawing over 3A, something’s definitely gone very wrong.
Software fuses
Our HR→Google sync automation should have had a fuse. We should have thought about the maximum number of accounts we expected to be suspended in one go and refuse loudly if that value was exceeded.
It could have been a simple if-statement:
accountsToSuspend := reconcileStaffGroups(hrState, googleState)
if len(accountsToSuspend) > 10 {
panic("fused: tried to suspend more than 10 user accounts")
}
for _, account := range accountsToSuspend {
// ...
}
But would have completely prevented this incident.
Battleshorts
Ok, but what about when you really do need to exceed the value you set as your fuse?
Normally, 10 accounts might be unthinkable to suspend in one daily cronjob run, but one year the company runs an internship programme, and now you’ve got a whole cohort who need suspending on the same day.
You could always modify the code to install a larger fuse, but that’s slow and you’ll probably forget to restore the original value. Instead, you can add a “battleshort” mode.
Battleshort is a condition in which some military equipment can be placed so it does not shut down when circumstances would be damaging to the equipment or personnel. The origin of the term is to bridge or “short” the fuses of an electrical apparatus before entering combat, so that the fuse blowing will not stop the equipment from operating.
– https://en.wikipedia.org/wiki/Battleshort
This can be as simple as adding a boolean field in the request.
isBattleshort := req.FormValue("battleshorts") == "true"
accountsToSuspend := reconcileStaffGroups(hrState, googleState)
if !isBattleshort && len(accountsToSuspend) > 10 {
panic("fused: tried to suspend more than 10 user accounts")
}
for _, account := range accountsToSuspend {
// ...
}
Now the fuse is temporarily ignored and all the interns will be suspended correctly.
But, usually I’d include some extra guardrails too, like:
- Requiring the human to specify the exact number of affected entities in order to bypass the fuse.
- Requiring multi-party authorization so that two people need to approve this operation.
- Include a “reason” field which gets logged in the audit trail.
Just something with enough friction that battleshorts are only used in exceptional cases, and don’t become normalised.
Counter-arguments
“But I don’t know up-front how many accounts I’ll suspend!”
Yes, implementing fuses requires designing your system such that it knows the complete set of actions it will do, before it starts doing anything. It creates a plan first, and then executes it.
This is a good thing.
It might use less memory to iterate through the staff list and update/suspend each account in turn. But, by separating the planning stage from the execution you get multiple benefits:
- You can install a fuse!
- You can have a “dry run” mode which tells you the actions that would be performed without actually doing them.
- It’s way easier to unit test: you can assert correctness on the plan output without having to mock API calls.
If you really, really can’t architect your system this way, at least have a rate limiter.
“Couldn’t we just use a rate limiter?”
There’s a maximum number of things we expect to happen within a certain timeframe. Sounds like a job for a rate limiter, right?
And, yes, a 10 accounts/day limit would have mitigated most of the impact in this case (just 10 people incorrectly suspended, rather than the entire company). But, it’s definitely an inferior solution:
- You now need a stateful component to keep track of this rate limit.
- Trying to suspend hundreds of accounts at one time is obviously incorrect. So why are you allowing the first 10 suspensions to go through anyway?
- If this happens over a holiday, the system might keep suspending 10 accounts/day until someone looks at it a week later.
Overall, a fuse is simpler and safer.