Multiple runners

Running one runner gives you the airgap property. Running two or more gives you redundancy — and it's safe to do even if you've never thought about distributed consensus.

The dedup guarantee

Every job dispatched to a runner pool goes through a single atomic UPDATE:

UPDATE runner_jobs
SET    status     = 'claimed',
       claimed_by = $instance,
       claimed_at = NOW()
WHERE  id         = $id
AND    status     = 'pending'
RETURNING *;

Only the gateway instance whose UPDATE returns a row wins. Every other instance sees rowCount = 0 and skips. Each job is executed exactly once — no matter how many runners are connected.

The claim also gets a deadline: if a gateway crashes after claiming but before dispatching, a sweeper reverts status='claimed' AND claimed_at < NOW() - INTERVAL '30s' back to pending so the job redispatches.

ℹ️

Running 100 runners is safe in the current release. Running them efficiently — least-loaded dispatch, capability-aware routing — is a future optimization. Today's picker is random-over-healthy, which is fine up to the point where a single-gateway bottleneck appears.

Graceful failover

If a runner drops its WebSocket mid-job:

Gateway detects the socket close.
The in-flight pending Promise is rejected with Runner <id> disconnected mid-job.
The caller sees a 502 with runner_failed. Clients should retry; by the next attempt, the claim + dispatch lands on a surviving runner.

No in-flight state is held on the gateway beyond the Promise — the runner has to own its own connection pool to your DB. That's by design; each runner is a self-contained unit.

Horizontal scaling pattern

Start with one runner. It's single-process, handles a surprising amount of throughput (each runner reuses DB connections via the bridge's pool).
If you want no downtime during deploys, run two. Rolling restart one at a time; the other covers all jobs during the gap.
If you want more throughput, run more. Dispatch picks a connected runner at random, so load naturally spreads.
If one runner is always busier than others, that's usually an imbalance in which jobs are long-tail. Partitioning by source isn't supported yet — reach out if you hit this.

Rolling restarts

The safe order:

Drain — optional; the runner has no local queue so there's nothing to drain. Jobs in flight complete; the container just exits.
Stop the runner (docker stop / scale down one replica).
The gateway closes its side of the socket; remaining runners serve all new jobs.
Start the new runner (with the same rk_ token and id).
It handshakes, becomes live, picks up work.