Go: Distributed Scheduler, Affinity, Scoring and Anti-Starvation

This week, I had to solve a scheduling problem. Jobs to distribute to workers. "Easy," I thought, "a queue, some consumers, a SELECT, and move on."

Except the workers aren't interchangeable. Some can only process PDFs, others only Excel, and a few are versatile. And there are users waiting for real-time responses while batch jobs pile up in the queue. And multiple scheduler instances run in parallel.

And then there's the straggler. A low-priority job on a rare resource, because there's only one compatible worker. Higher priority jobs cut in line, again and again. It's been waiting for three hours. It never started. And that's when a worker crashes.

The "easy" lasted about 15 minutes. After that, it was three classic scheduling pathologies, some math, and a few distributed concurrency pitfalls. Here's the result.

The problem: it's not a simple queue

Imagine a system where jobs of different types must be executed by specialized workers. Each worker can only process certain types of jobs:

Unlike a classic FIFO queue where first-come-first-served applies, here the scheduler must solve a constrained assignment problem:

Job/worker affinity: a type X job can only run on a worker that supports X
Limited resources: fixed number of execution slots
Distribution: multiple scheduler instances run in parallel
Two consumption modes: queued jobs (the scheduler chooses when to execute them) and on-demand requests (a user is actively waiting for a response)

It's the combination of these constraints that makes the problem interesting. Taken individually, they're simple. Together, they create three pathologies that must be handled simultaneously.

Three pathologies to understand

1. Starvation

A job whose type is supported by few workers risks waiting indefinitely if "popular" jobs keep arriving:

This is a classic scheduling problem. If you always choose the job with the highest static priority, a low-priority job on a rare worker can wait forever. In scheduling theory, this is the starvation problem: the lack of progress guarantee.

2. Wasting versatility

A versatile worker (supporting multiple types) shouldn't take a job that a specialized worker could handle. If the versatile worker takes a type A job while a specialized type A worker is free, the versatile worker is wasted. When a type C job arrives (that only the versatile worker can handle), no one can take it.

Bad choice:
  Versatile worker [A,B,C] takes a type A job
  → Specialized worker [A] stays idle
  → Type C job arrives, no one can handle it

Good choice:
  Specialized worker [A] takes the type A job
  → Versatile worker [A,B,C] remains available for B or C

It's like sending the cardiac surgeon to apply a bandage when a nurse is free. The surgeon can technically do it, but it's a waste of rare expertise.

3. User urgency vs fairness

A user making an on-demand request is actively waiting. They should jump ahead of queued jobs, but not absolutely: otherwise a continuous stream of on-demand requests would starve the queued jobs, and we'd fall back to problem #1.

Architecture: the Actor Model

Before tackling scoring, a word on synchronization architecture. The scheduler maintains shared mutable state: the pool of free slots, indexes by type, counters. How do we synchronize all this?

The classic answer would be mutexes. But anyone who's spent nights tracking down a deadlock knows that mutexes are an inexhaustible source of subtle bugs. Here, the choice is an Actor Model: a single goroutine owns all mutable state, and other components communicate with it via channels.

The single goroutine listens on multiple channels: worker registration, unregistration (stop or crash), slot release, on-demand request, new job notification from the database, and periodic heartbeat. Everything goes through a Go select.

Zero mutexes in our code. Channels are packed with them internally, as anyone who's read the Go runtime knows. But the difference is in the guarantees: with explicit mutexes, it's the developer who must lock, unlock, and not mess up the order. The compiler checks nothing. With channels and a select, the compiler constrains usage. Mutable state is confined to a single goroutine, messages arrive sequentially, and a whole class of bugs (double-lock, priority inversion, forgotten lock) disappears by construction.

The advantage is that all reasoning about state is sequential within that goroutine. No race condition possible on internal structures. Concurrency complexity is confined to channels, which are thread-safe by construction.

The slot pool: inverted index by type

The core data structure is the slot pool. The fundamental question the scheduler must constantly answer is: "for a given job type, which slots are available?" If we scan all slots for each decision, we get O(s) complexity where s is the total number of slots. That's acceptable for 10 slots, much less so for 1000.

The solution is an inverted index by type. Each job type is identified by an integer, and each slot likewise. As shown in the article on Go maps, maps are very performant in Go, especially with integer keys. That's exactly our case.

We maintain three maps: one for all slots indexed by their ID, one for free slots indexed by job type, and one for the capacity counter per type.

It's the same principle as a database index. Instead of scanning the table, we maintain an auxiliary structure that answers in $O(1)$ the question "are there slots for type X?" and in O(n) "give me the best slot for type X" where n is the number of compatible slots (not the total count).

The cost is index maintenance: at each scheduling loop iteration, we modify the index (slot acquisition, release, counter updates). But thanks to maps, adding, removing and updating the accounting is O(1). The overhead is negligible compared to the gain on lookups.

The scoring system

Let's get to the heart of the matter: how do we choose the next job to execute among all eligible candidates?

Where do candidates come from?

Candidate jobs come from two sources. On-demand requests are sent directly to the scheduler via a channel and are already in local memory. Queued jobs are fetched from the database by a query that only returns jobs whose type matches a free slot. We limit ourselves to a few results: the number of jobs requested from the database equals the number of available slots × 2. This gives margin for failed claims (we'll get to that) without needlessly loading jobs we can't execute. Scoring then applies to this combined list.

To make a distribution decision, the scheduler therefore has:

The list of eligible waiting jobs: for each, its type, its business priority (an integer from 0 to 10), its creation date, and whether it's in queue or on-demand mode
The list of free slots: for each, the job types it supports

From this data, we calculate a score for each job. The job with the highest score will be executed first.

The formula

For each candidate job, we calculate a score:

\begin{aligned} \text{score}(job) =\ & \text{base\_priority} \times 1024 \\ +\ & \text{age} \times 16 \\ +\ & \text{rarity\_bonus} \\ +\ & \text{on\_demand\_bonus} \end{aligned}

Where:

$\text{base\_priority}$ : business priority of the job, integer from 0 to 10 (10 being highest priority)
$\text{age}$ : job wait time in seconds
$\text{rarity\_bonus}$ : bonus inversely proportional to the number of compatible slots
$\text{on\_demand\_bonus}$ : bonus applied only to on-demand mode jobs

This is a weighted linear combination of four factors: each captures a different aspect of the job's urgency. The constants are chosen as powers of 2 so the compiler optimizes them to bit shifts. Let's detail them.

Component 1: Base priority (0-10k points)

This is the business priority assigned to the job. Simple multiplication:

\text{score} \mathrel{+}= \text{priority} \times 1024

The factor 1024 serves to give room to other components. A priority 5 job starts with 5120 points, a priority 0 job with 0 points. This gap must be bridged by other components for a low-priority job to pass a high-priority one. This is intentional: business priority should weigh heavily.

Component 2: Anti-starvation through aging (16 points/second)

This component guarantees that no job waits indefinitely. A job's score grows linearly with its wait time:

\text{score} \mathrel{+}= \text{age} \times 16

Fundamental property: two lines with identical slopes but different y-intercepts never cross. But here, all "age" components have the same slope. It's the initial score (base priority) that differs. So two jobs with different priorities will always cross if we add a discriminating factor (the on-demand bonus has a different slope, we'll get to that).

We can calculate exactly when a low-priority job will catch up to a high-priority job. The initial score gap between two jobs is their priority difference × 1024. Aging bridges this gap at a rate of 16 points per second:

\text{Crossover time} = \frac{\text{priority gap} \times 1024}{16} = \text{priority gap} \times 64 \text{ seconds}

Concretely: a priority 0 job will catch up to a priority 5 job after 320 seconds, or about 5 minutes. This is a business parameter: if 5 minutes is too long, increase the aging coefficient. If too short, decrease it. The ratio $\frac{1024}{16} = 64$ seconds per priority unit is the tuning knob.

Component 3: Rarity bonus

A job with few compatible slots receives a bonus. This is a hyperbolic (inverse) function:

\text{rarity\_bonus} = \frac{500}{\text{compatible\_slots}}

Where:

$\text{compatible\_slots}$ : number of free slots that can execute this job

Compatible slots	Bonus	Decrease
1	500	n/a
2	250	-50%
4	125	-75%
8	62	-88%

This is a diminishing marginal utility function: the first available option has enormous value, subsequent ones less and less. A job with only one compatible slot must seize it. A job with 10 options can afford to wait.

The choice of hyperbolic isn't arbitrary. We could have used a decaying exponential or a logarithmic function. The hyperbolic has the advantage of being simple to calculate (a division), having a very strong effect for $n=1$ (500 points, or half a base priority unit), and converging quickly to zero. Here, no power-of-2 optimization is possible: it's the divisor that would need to be $2^x$ for the compiler to replace with a shift, and the divisor is a runtime variable.

In practice: for $n \geq 4$ , the bonus becomes negligible (125 points = 8 seconds of aging). Rarity only plays a significant role for truly constrained jobs.

Component 4: On-demand bonus

A user is actively waiting → high priority, but not absolute:

\text{score} \mathrel{+}= 4096 \mathrel{+} \text{age} \times 32

The initial 4096 point bonus equals 4 base priority units. An on-demand job jumps ahead of almost everyone immediately. But a high-priority queued job that's been waiting long can still pass ahead.

Accelerated aging (total slope of 48/s: 16 base + 32 bonus) ensures that waiting on-demand jobs are handled increasingly quickly compared to queued jobs.

Visualization: score lines

We can read directly from this graph that the two lines cross at 256 seconds, or about 4 minutes. A priority 0 queued job waiting for more than 4 minutes will pass ahead of a freshly arrived on-demand job. This is intentional: patience is eventually rewarded, even against urgency.

Slot selection: prefer specialists

Once the best job is chosen by scoring, we still need to decide which slot to execute it on. This is the slot/job matching problem, and it's more subtle than it appears.

The goal is to preserve versatile slots for rare jobs. The algorithm is simple: among slots compatible with the job type, we iterate through the list and keep the one that supports the fewest different types.

Concrete example:

"PDF" type job arrives
Available slots:
  - Slot A: [PDF]               ← 1 type  → CHOSEN
  - Slot B: [PDF, Excel]        ← 2 types
  - Slot C: [PDF, Excel, Index] ← 3 types

Result: Slot A is used, B and C remain available for Excel/Index

This is a greedy heuristic. It's not globally optimal (we could imagine cases where keeping the specialist free for a future job would be better), but it's O(n) and gives good results in practice. Global optimality would require predicting future jobs, which is a much more complex problem.

Distributed concurrency: the double-claim

Multiple scheduler instances can run in parallel. Two might want the same job. This is the classic double-claim problem.

Solution: optimistic claim

The solution is optimistic claiming. Rather than acquiring a lock before reading candidate jobs (expensive and a source of contention), we let each instance work independently and handle conflicts at claim time:

UPDATE jobs
SET status = 'running', worker_id = $1
WHERE id = $2 AND status = 'pending'
RETURNING id

The WHERE status = 'pending' clause is key. An UPDATE in SQL is an atomic operation: if two schedulers try to claim the same job, the first changes the status to 'running' and the second's WHERE status = 'pending' condition no longer matches. The loser sees 0 rows affected. No error, no deadlock, just an empty result.

Retry algorithm

When a claim fails, we move to the next candidate in the score-sorted list. If more than 5 claims fail in a row (adjust this value based on observed contention), our list is stale: we refetch candidates from the database and start over.

Failure handling: worker crash

A worker can crash at any time. We must handle two cases: slots that were free (easy, we remove them) and slots that were executing a job (more delicate).

Between the crash and its detection by the scheduler, an invalid FD might be used. It's not the scheduler that protects against this, it's the kernel: writing to a closed FD returns EPIPE or EBADF, the job fails cleanly.

Accepted edge case: if two workers crash and restart in reverse order, the kernel can redistribute file descriptors in a crossed manner. The scheduler, not yet notified of the crashes, then sends a job to a worker that potentially can't handle it. Three possible outcomes: the worker can handle it and that's great, it can't and the job fails, or worse it handles it incorrectly. This case is accepted: the probability is low and the cost of protection (synchronous handshake on each write) doesn't justify the complexity.

When the scheduler detects the crash, it does an unregister: we remove all the worker's slots from the index, whether free or occupied.

For occupied slots, the running job will try to write to an invalid file descriptor and get an EPIPE (broken pipe), or read and get 0. In both cases, the job will be marked as failed, and the release will find that the slot no longer exists and be silently ignored. The slot's absence from the main map serves as a safeguard: any operation on a non-existent slot is a no-op.

When the worker restarts, it registers with new slots and new file descriptors. The state is clean.

Algorithmic complexity

Let's summarize the complexities:

Operation	Complexity
Check capacity by type	$O(1)$ - lookup in `capacity`
Acquire a slot	$O(s)$ - s = slots compatible with the type
Calculate a job's score	$O(1)$ - a few arithmetic operations
Sort candidates	$O(n \log n)$ - standard Go sort, n = eligible jobs

The total complexity of a scheduling cycle is dominated by the sort: $O(n \log n)$ . In practice, n is bounded by the number of available slots × 2 (the fetch limit). For reasonable values (a few dozen jobs), it's negligible. Some perspective: during a single network round-trip to the database, we'd have time to sort a million entries. It's still always good to save CPU, especially when it doesn't cost anything in readability.

Summary of guarantees

The system combines six mechanisms that each address a specific pathology:

Inverted index by type: O(1) access to compatible slots, no linear scan
Multi-criteria scoring: weighted linear combination, tunable via coefficients
Linear aging: mathematical guarantee that no job waits indefinitely (anti-starvation)
Hyperbolic rarity bonus: constrained jobs aren't disadvantaged
Selection of the most specialized slot: preserving versatility
Optimistic SQL claim: distributed concurrency without distributed locking

All running in a single-writer Actor Model that eliminates race conditions on internal state.

Is it optimal? No. An omniscient scheduler that knew future jobs could do better. But between theoretical optimality and what's practical in production, there's a chasm that well-calibrated heuristics bridge quite adequately.