Hard constraints, soft preferences: the load balancer for clinical schedules

It is 6 AM on a Tuesday. A nurse manager has the next two weeks of her unit’s schedule open in a spreadsheet, the same one she has used for nine years. One of her travelers picked up a last-minute overnight that ends at 7 AM today, which means his Tuesday day shift would put him over a hard daily-hours cap. She knows this because she has the cap memorized, not because the spreadsheet does. She swaps him onto Wednesday. That cascades: a per diem now has a hole on Wednesday, so she pulls a name off a different unit, who happens to live forty-five minutes away and prefers not to drive in for a single shift. She makes a note to call him at a reasonable hour and ask anyway.

This is the work the load balancer is replacing. Not the strategic part---the manager still owns the unit. The mechanical part: the eight hundred small constraint checks and preference trades she runs in her head between 6 AM and 7 AM every Tuesday, often before she has finished her coffee.

It looks like a clean optimization problem. It isn’t.

Why scheduling resists a clean solver

The instinct, when you describe scheduling to an engineer, is to reach for a solver. Encode the constraints, encode the objective, hand it to a constraint-optimization engine, take the answer. The engine does exist, and we use one. But the solver is the smallest part of the system, and treating it as the whole system is how scheduling tools fail.

Three things make this harder than the textbook version.

Constraints are not all the same kind. Some must hold or the schedule is invalid: a worker cannot exceed a daily-hours cap, cannot be scheduled without the credentials the unit requires, cannot be in two places at once. Others are preferences: the unit they like working on, the days of the week that fit their life, the distance they are willing to drive. A solver that treats both as costs to minimize will produce schedules that quietly violate the hard ones in service of the soft ones. A solver that treats both as inviolable will refuse to produce a schedule at all.

The output has to be defensible to a human. A nurse manager who has been running her unit for fifteen years is not going to accept a schedule that says “trust me.” She needs to see what changed, who moved, what trade got made, and why. If the system cannot present the answer in the same vocabulary she uses on a Tuesday morning, the answer doesn’t matter.

Inputs are messy. Hospital units don’t share a clean roster of workers with us. Different facilities run different HR systems with different employee IDs, different credential records, different definitions of which units a worker is allowed on. Before we can optimize over a workforce, we have to know who the workforce is---and that turns out to be most of the engineering.

Two-column layout comparing hard constraints on the left (MaxDailyHours, Required credential, Shift overlap) with soft preferences on the right (Preferred unit, Day pattern, Distance from home). Below, an example schedule grid for one worker's week, with one Thursday slot grayed out and labeled blocked because it would breach the MaxDailyHours cap.

Hard versus soft, and what it took to ship one

The first live hard constraint in the load balancer is MaxDailyHours, which we shipped earlier this month. From a distance this looks trivial: don’t schedule a worker for more than N hours in a day. In practice it’s the constraint we spent the most time on, because shipping it safely meant getting the whole hard/soft distinction right inside the model.

A few things had to be true at once. The cap had to be expressible per worker, not just globally: contract terms differ, role-specific caps differ, some facilities have union rules that override the default. The cap had to apply across shifts that may have been created by different paths into the system, including shifts that overlap into the next calendar day. The solver had to treat a MaxDailyHours violation as infeasible, not as a high cost it could pay down; we don’t want a $50 incentive turning a cap breach into the “optimal” assignment.

We also wanted soft preferences to keep their teeth. The temptation when you ship a hard constraint is to start moving everything into the hard tier so it “always works.” That collapses the model. Preferred unit, day-of-week pattern, distance from home: real preferences, weighed by the solver, but they bend in service of the schedule existing at all. Hard constraints don’t bend.

The pattern is the part we want to keep. Each new hard constraint goes through the same shape: define it as a property of the worker or the assignment, prove it can be evaluated cheaply in the solver’s inner loop, prove it can be displayed to a manager in plain language when it bites, and ship it behind a flag. MaxDailyHours is the first; it’s the template.

The run pipeline

The solver is a function. The load balancer is the system around it. The system is what makes the function safe to run against a real unit, on a real morning, with a manager waiting on the output.

A single run is a sequence of phases. RunCoordinator is the thing that owns them. It picks up a request to balance a unit (or a group of units, or a whole facility for a given window), figures out which phases need to run, and walks them in order. Each phase can fail; the coordinator decides whether failure is fatal, retryable, or downgradable to a soft warning the manager sees alongside the output.

Two of those phases are the workhorses on the back end. BalancerPipelineJob handles the solver hand-off itself---marshaling the inputs into the shape the solver expects, running it, catching the output, validating that it satisfies all hard constraints (yes, again, after the solver claimed to). BalancerRunCaptureJob persists the result: the assignments, the KPIs, the inputs that produced them, the version of the constraint model that was active when the run fired. The capture step is unglamorous and load-bearing. Every run leaves a record. Six months from now, when a manager wants to understand why a worker was assigned to a particular shift on a particular morning, the answer is in the captured run.

Sitting above all of this is Active Admin: every run is visible, every input is inspectable, every phase’s status is on the page. Ops can re-run a phase, force a re-validate on a worker (more on that below), open a snapshot. We didn’t build a custom internal dashboard for this because we didn’t need to. The Rails admin we already have is the right shape: list view, detail view, action button. Boring on purpose.

You can’t optimize over workers you don’t have

The phase that runs first, before the solver fires, is worker sync. This is the part that surprised us in scope.

The shape of the problem: a hospital unit has a roster of nurses. Those nurses exist in the hospital’s HR system (UKG, in the case of the facilities driving this work), with employee IDs we don’t generate and don’t control. They also need to exist in our system, with our IDs, our credential records, and our preference history, so that the solver has anything to optimize over. The bridge between the two is the worker sync pipeline.

It was supposed to take a few weeks. It took several months, in six phases, because the messiness was real.

Some workers had placeholder emails on the hospital side, because their onboarding was incomplete. Some had ID collisions between facilities: the same nurse, working at two hospitals owned by the same system, with two different IDs and slightly different demographic records. Some had credentials in our system that no longer existed in the hospital’s, because the hospital had stopped requiring a certification we still tracked. The validation rules accumulated. The WorkerCreatorService we shipped to handle the create-or-match decision now consults a chain of validators before it commits, and most of the volume in that chain is rules we didn’t write until we saw the failure.

The Active Admin re-validate button is the operational pressure-release valve. When a sync record looks wrong (wrong unit, missing credential, suspected duplicate) an ops engineer can force a single worker through the validation chain again without running a full sync. It saved the rollout from a class of failure where one bad worker record blocked an entire unit’s schedule.

The lesson, restated, is the one the data-engineering side of the house has been making for years: the optimizer is the easy part. The substrate underneath is most of the engineering. You cannot run a constraint solver against a workforce you cannot describe, and describing the workforce of a multi-hospital health system is, itself, the project.

What the manager sees

The output of a balancer run is a schedule. The presentation of the output is a snapshot, paired with a KPI overlay.

A snapshot is the unit’s schedule at a point in time, captured. When the balancer produces a new schedule, the manager sees the old one and the new one side by side, with the differences highlighted: who moved, where, why. The KPI overlay sits on top: FTE utilization across the affected workers, total hours assigned, count of off-preference assignments, count of open shifts before and after.

The overlay matters because it gives the manager something to argue with. “The new schedule has eight off-preference assignments instead of three” is a real claim, not a black-box judgment. The manager can decide: that’s the trade I’m willing to make, or it isn’t, in which case she can pin a constraint, re-run, and look again. The system isn’t trying to make the decision for her. It’s trying to do the eight hundred mechanical checks so she can make the decision.

--- Engineering