Multiple runners
Running one runner gives you the airgap property. Running two or more gives you redundancy — and it's safe to do even if you've never thought about distributed consensus.
The dedup guarantee
Every job dispatched to a runner pool goes through a single atomic
UPDATE:
Only the gateway instance whose UPDATE returns a row wins. Every
other instance sees rowCount = 0 and skips. Each job is executed
exactly once — no matter how many runners are connected.
The claim also gets a deadline: if a gateway crashes after claiming
but before dispatching, a sweeper reverts status='claimed' AND claimed_at < NOW() - INTERVAL '30s' back to pending so the job
redispatches.
Running 100 runners is safe in the current release. Running them efficiently — least-loaded dispatch, capability-aware routing — is a future optimization. Today's picker is random-over-healthy, which is fine up to the point where a single-gateway bottleneck appears.
Graceful failover
If a runner drops its WebSocket mid-job:
- Gateway detects the socket close.
- The in-flight pending Promise is rejected with
Runner <id> disconnected mid-job. - The caller sees a
502withrunner_failed. Clients should retry; by the next attempt, the claim + dispatch lands on a surviving runner.
No in-flight state is held on the gateway beyond the Promise — the runner has to own its own connection pool to your DB. That's by design; each runner is a self-contained unit.
Horizontal scaling pattern
- Start with one runner. It's single-process, handles a surprising amount of throughput (each runner reuses DB connections via the bridge's pool).
- If you want no downtime during deploys, run two. Rolling restart one at a time; the other covers all jobs during the gap.
- If you want more throughput, run more. Dispatch picks a connected runner at random, so load naturally spreads.
- If one runner is always busier than others, that's usually an imbalance in which jobs are long-tail. Partitioning by source isn't supported yet — reach out if you hit this.
Rolling restarts
The safe order:
- Drain — optional; the runner has no local queue so there's nothing to drain. Jobs in flight complete; the container just exits.
- Stop the runner (
docker stop/ scale down one replica). - The gateway closes its side of the socket; remaining runners serve all new jobs.
- Start the new runner (with the same
rk_token and id). - It handshakes, becomes live, picks up work.