Megazord: rebuilding shift ingestion for Works

A pricing analyst pinged us last year with a screenshot of a query that had been running for six minutes. She was trying to answer what looked like a small question: across a multi-hospital system, how did Tuesday night fill rate on med-surg compare to the same nights a year earlier, with the same staffing standard applied. Her query joined three ingestion tables---two shaped to one scheduling system, one shaped to another---through a units lookup that didn’t line up between the two. By the time the query came back, she had switched to something else. When she eventually got the answer, it was wrong by about eight percentage points, because the two source schemas disagreed on what counted as an open shift.

That was the failure mode the old ingestion produced over and over. A pricing decision based on a fill rate that drifted between vendors. A dashboard that showed a healthy week to one customer and a panic-stations week to another because the underlying tables were normalized to whichever scheduling system that customer happened to use. A time-of-day analysis that took six minutes because the storage shape was tuned for the import, not for the question.

We rebuilt the ingestion path. The rewrite is called Megazord, and it’s the substrate the rest of Works sits on.

The anatomy of a single Megazord substrate row. Top: one row with four labeled fields (unit, interval, demand_qty, source attribution). Bottom: a 12-hour shift drawn as a horizontal bar with 48 small mint cells underneath, each representing one 15-minute block. An annotation marks the shape that every source --- full integration, CSV upload, manager UI --- produces.

The hot path no one names

In Works, shift demand is the input to almost every interesting decision. Pricing reads demand to compute incentives. Load balancing reads demand to redistribute coverage across units. Fill-rate analytics, FTE attainment, and the dashboards an operations director looks at every morning all read the same underlying numbers. Slow ingestion means slow decisions. Vendor-shaped ingestion means every decision inherits that shape.

The old design coupled ingestion to query. Each scheduling-system integration wrote into tables shaped like the source system: column names borrowed from the vendor’s schema, joins that worked because the import knew the vendor’s conventions, unit identifiers carried through end-to-end. It worked when we had one integration family and a single customer. By the time we were live on two integration families and rolling out across more customers, the cracks were obvious. A fill-rate query that needed to span both vendor shapes became a sprawl of joins. A new product surface that wanted demand from a third source---a CSV the customer dropped in manually, or a shift the on-shift manager added through our UI---had nowhere coherent to land it.

The teams above us weren’t asking for faster ingestion. They were asking for things that ingestion was quietly making impossible: closed-loop pricing that needed sub-second fill-rate reads against a year of history, load balancing that needed one canonical view of where the gaps were, FTE attainment numbers that lined up across vendors. We were saying no to all of them, and the real reason was the substrate underneath wasn’t ready.

Decouple ingestion from query, normalize into a substrate

The architectural move was simple to describe and load-bearing in practice: separate ingestion from query, and normalize demand into a time-blocked substrate.

Ingestion becomes a one-way pipeline. Each source---a full integration from a scheduling system, a CSV upload, a shift entered through our manager UI---has a small adapter that knows how to read its native shape and emit normalized demand. Adapters don’t get to define the storage schema. They produce demand blocks. The blocks are what land in the substrate.

The substrate is a single table (logically; physically it’s partitioned for the obvious reasons) where every row is a 15-minute interval of demand against a unit. A 12-hour night shift becomes 48 blocks. A four-hour gap a manager fills late in the day becomes 16 blocks. Whether the demand came in through a full integration with one vendor, a different integration with another vendor, a CSV that arrived by email, or a manual entry, the substrate row looks the same.

Once everything is in the same shape, queries get to read one table. Fill rate at the unit level for last Tuesday night is a range scan with a group-by. Fill rate for an entire year, across every customer, broken out by shift type, is the same query with a wider range. The pricing controller, the dashboard, and the FTE attainment view all read the same rows.

Before-and-after diagram of the ingestion path. On the left, the legacy design: three vendor-specific parsers each writing into their own ad-hoc tables, with slow aggregate queries fanning across all three. On the right, Megazord: multiple sources—a scheduling-system integration, a CSV upload, the manager UI—all flowing through normalized adapters into a single time-blocked demand substrate, with fast slice queries reading from one place.

The boundary between adapter and substrate is the part of the design that did the most work. Adapters are allowed to be vendor-specific and a little ugly. The substrate isn’t. Every adapter has to produce the same shape: a unit, an interval, a demand quantity, an attribution to where it came from. When a new source shows up---and they keep showing up---the work is to write an adapter, not to thread a new column through every downstream query.

Why 15-minute blocks

We argued about the block size. Five minutes would have been more faithful to clock granularity. Thirty minutes would have been cheaper. Fifteen turned out to be the right tradeoff.

Fifteen minutes lines up with how shifts and breaks are actually scheduled in the systems we ingest from. Most start and end times in the source data are 15-minute aligned. A coarser grid would have forced us to invent rules for truncating edge cases. A finer grid would have produced rows the source systems couldn’t fill without lying.

It’s also fine enough to compute fill rate per shift gap, not just per shift. A 12-hour shift that goes uncovered for the first two hours and then gets picked up is a different operational story than a 12-hour shift that never fills. With block-level fill rate, the difference is an 8/48 versus 0/48 ratio against the same shift. Aggregate-level fill rate flattens both into “partially filled” and walks past the part of the story the operations team needs to see.

And fifteen minutes is coarse enough that a year of demand for a large customer fits in a working set we can query fast. A focused performance sprint on the substrate’s read path pushed year-long fill-rate queries to sub-250ms. A control loop that reads its setpoint and its measurement on the order of a quarter of a second can run continuously; a control loop that reads them on the order of six minutes is a batch job pretending to be a controller.

One substrate, many sources. A horizontal 24-hour timeline divided into 15-minute slots. Three feeder lanes above the timeline—a full scheduling-system integration, a CSV upload, and a manual entry through the manager UI—each contribute demand that lands in the same blocks. Below the timeline, a fill-rate strip computed per block shows where coverage is complete and where the gaps actually sit.

Three sources, one shape

The substrate has to ingest from three classes of source, and the operational reality of each one is different.

Source	Fidelity	Adapter weight	Volume	Latency to fill-rate read
Full integration	Highest---unit assignments, shift templates, posted-vs-filled status	Heaviest code in the ingestion path	Bulk of demand	Near-real-time
CSV upload	Medium---only what the spreadsheet carries	Small adapter that validates columns and maps to block shape	Recurring batches from smaller groups	Per-cadence (often daily or weekly)
Manager UI	Highest for the override case	Smallest	Smallest by volume	Seconds, so pricing and balancing see the override immediately

Attribution is preserved on every row, regardless of source: we know where the demand came from, when it was ingested, and which adapter version produced it. The substrate doesn’t care which adapter wrote the block, which is the invariant that lets a fill-rate query span all three without reconciliation.

What it looks like in production

Megazord went live for the first integration family, then the second a year later. At cutover, the headline number was that the new substrate was about 4x more performant than the legacy ingestion on the workloads we measured---roughly, the time it took to land a customer’s demand and have it readable by the dashboards and the pricing pipeline.

4x at cutover was good enough to unblock the immediate work, but not fast enough for the read patterns the closed-loop pricing controller wanted. A later performance sprint went after the read path: partitioning strategy, the indices we cared about, a small set of query rewrites against the substrate’s most-used shapes. Year-long fill-rate queries that had been taking several seconds came in under 250 milliseconds. The pricing controller, and the rest of the analytics surface, got the read budget it needed.

What surprised us most: as soon as the substrate existed, every adjacent product surface wanted to read from it. Dashboards that had been hitting the old ingestion tables migrated. A new FTE attainment view that would have been a multi-week schema design exercise was a one-day query. A load balancer prototype that had been stuck in design review because the demand model wasn’t coherent across vendors got unstuck. The substrate did its job by being the thing other things could read.

What the substrate unlocked

Closing the loop on shift pricing describes a PID-style controller that reads fill rate and writes dollars-per-hour. That controller isn’t buildable without Megazord. A controller that has to wait six minutes to see what its last move did is not a controller. The substrate is what makes the loop closable.

The same goes for load balancing. Moving open shifts across units, between facilities in a multi-hospital system, and across timeframes requires a single coherent view of where the gaps are. The substrate is that view. Without it, every redistribution decision is a guess that compounds.

And FTE attainment---the operations metric finance and clinical leadership both care about---becomes a query against the substrate rather than a manual reconciliation across vendor exports.

--- Engineering