Integrating with UKG: caches, credentials, and worker sync at hospital scale

A shift gets cancelled inside a hospital’s UKG instance. Not by a manager pulling the lever in our app---by an automated process on UKG’s side that nobody upstream of the timekeeping system asked for. The cancellation lands in a webhook payload, hits our ingestion pipeline, and, if we’re not careful, mirrors itself back into Works. The clinician who was confirmed on that shift gets a notification an hour later that their assignment is gone. The unit manager opens a ticket. Operations spends the afternoon untangling a ghost.

That class of bug is one of about twenty we worked through across roughly six pull requests in a single quarter, rebuilding the way Works exchanges data with UKG. None of them were the kind of bug you find in an integration guide. They’re the kind you find by running the integration against real hospital data for months.

The work split into three pieces: a multi-read cache that collapsed the dominant access pattern into batched reads, mTLS credential management so credentials are auditable and rotatable per customer environment, and a worker-sync pipeline that handles two systems trying to agree on who is employed and on what terms.

Why UKG is the system of record

For many of the large hospital systems we work with, UKG is the ground truth for two things that matter to a shift marketplace. First, who is actually employed by the system. Second, when those people are scheduled, working, on PTO, or out. The HR side and the timekeeping side both live in UKG, which means that any answer we give about who is working tomorrow night either agrees with UKG or is wrong.

If you’re building a marketplace that helps a hospital fill open shifts with their own employees first and contingent staff second, you cannot have an opinion that disagrees with UKG. You can have a faster opinion, or a richer one. You cannot have a different one.

That constraint shapes how Works talks to UKG. Reads have to be cheap enough to run constantly. Writes have to be precise enough that we don’t overwrite something we shouldn’t. Credentials have to be auditable so that a customer’s security team can answer which system touched our HR data, and when. And the synchronization has to handle the fact that UKG was there first, has its own opinions, and isn’t going to change.

The multi-read cache

When we first profiled the integration, the access pattern was lopsided. Two entity types---locations and workers---dominated every call path that touched UKG. A single page render in Works might ask for a worker’s details four times, because four different components on the page each independently resolved the same worker. A shift-pricing pass would iterate through a unit roster and pull every location it touched, one row at a time. None of these calls were wrong. They were just a single-row API call repeated tens of thousands of times a day.

The numbers were specific enough to be embarrassing. When we measured the redundancy, it ran to roughly 67,000 calls per day that could be collapsed without losing freshness. UKG’s rate-limit envelope is finite, and we were burning it on data we already had.

The fix wasn’t exotic. We built a cache in front of the UKG client with three properties:

Batch by entity type, not by call site. The cache exposes get_locations(ids) and get_workers(ids). The underlying UKG calls collapse to bulk reads. A page that asks for one worker and a job that asks for two thousand workers go through the same path. Single-row reads become an emergent special case of batched reads.
Request-scoped memoization. Inside a single request or job, the cache serves repeated lookups from memory. We don’t make the same network call twice for the same key within the same unit of work, even if the underlying TTL hasn’t expired.
Scoped invalidation, not blanket TTLs. Locations and workers don’t change every minute. They change on specific events: an HR record updates, a unit gets renamed, a worker’s status flips. The cache invalidates against those events, with a coarse TTL as backstop. Most reads serve from cache; the cache is right because the events drive the invalidation, not the clock.

Call volume dropped by roughly 67,000 per day, against a system whose rate limits matter and whose latency tail matters more. Less obviously, it made the integration sane to reason about. Engineers writing new features stopped having to think about should I make this call? The answer is always: ask the cache.

mTLS credential management

UKG integrations authenticate per customer environment. Each hospital system has its own UKG tenant, its own keys, its own rotation cadence. The first version we shipped lived in environment variables and a handful of secrets managers. It worked. It also meant that giving a customer a clear answer to show me which credential touched our data, and when required spelunking through deploy configs.

We moved UKG credentials behind a credential management layer in Active Admin, with mTLS as the transport. A few things follow.

Every UKG integration runs over mutually-authenticated TLS, with a customer-specific client certificate. That gives the customer’s security team an auditable answer about which integration session touched their system, beyond just the IP and timestamp: the cert itself is the identity. Rotation is a first-class action, not a redeploy. An operator generates a new cert, the credential record updates atomically, and the next call to UKG carries the new identity.

The credentials live in a Credentials Editor in Active Admin. The interesting part isn’t the editor itself---it’s that the editor works against UAT and review apps, not just production. A common failure mode in credential management is that production credentials are well-managed and the test environments use whatever was lying around. We pulled the test envelopes into the same surface as production, with the same audit trail. The way to keep credential discipline is to never have an environment where the discipline doesn’t apply.

The Active Admin tooling pairs with our storage-layer audit, so every credential edit is recorded as a write against the right actor, not against the application-generic system user. Two-way integrations live or die on the audit posture around their credentials. We wanted that posture before any customer asked for it.

The worker-sync pipeline

The third piece is the pipeline that turns a UKG payload into a Works user, a Works workforce group, and, more recently, a Works shift that carries the right monetary pay code.

Sounds straightforward. It isn’t. Two-way data exchange between an HR system that has been deployed inside a hospital for years and a newer marketplace means the two systems disagree on details that nobody planned for them to disagree on. A non-exhaustive list of what we’ve had to handle:

Placeholder emails. Not every UKG record has an email. Some have a placeholder: [email protected], or firstname.lastname@notset, or the literal string unknown. A naive sync treats those as real identifiers and either fails on uniqueness or matches multiple humans onto the same Works user. We added a placeholder-email policy that recognizes the common shapes, defers user creation until a real email arrives, and keeps the UKG-side employee ID as the durable handle in the meantime.

Employee ID collisions. Two UKG tenants reuse the same employee ID space, or a single tenant reassigns IDs after a workforce-management migration. We can’t assume employee_id is globally unique. We can’t even assume it’s tenant-unique forever. The sync keys on (tenant, employee_id, hire_date) and uses the conflict-resolution layer to decide which Works user a colliding payload should map to.

Worker-type mismatches. UKG’s worker-type vocabulary doesn’t map one-to-one onto Works’s. A per diem in UKG might correspond to one of two Works categories depending on the workforce group, and a contractor in UKG might not be a Works user at all. The pipeline normalizes through a tenant-specific mapping table; when it sees a worker type it doesn’t recognize, it defers rather than guessing.

Money pay codes on shifts. The sync now carries pay-code information through to the shift, including the monetary amount associated with the code. That sounds like a small detail and is actually the difference between Works doing its own FTE-attainment math and Works asking UKG every time. We did this through a dedicated async job rather than inline, so a large bulk update doesn’t stall the rest of the pipeline.

The cancellation race. UKG sometimes cancels a shift on its own side: a unit re-templates, an HR rule fires, a payroll cycle closes. If Works mirrors that cancellation downstream, it cancels a confirmed clinician’s assignment for reasons that have nothing to do with the assignment. We learned to drop those mirror cancellations on the floor, with an audit row, rather than propagate them. It’s the kind of fix that’s invisible when it works and very visible when it doesn’t.

The pipeline does all of this asynchronously, with retries, behind a single ingestion job. The conflict-resolution rules are written down in code instead of left to engineers’ memories.

What this unlocks

The point of building any of this isn’t the integration itself. The integration is plumbing. The point is what becomes possible once the plumbing is right.

With UKG flowing cleanly into Works, the ingestion layer treats UKG as one of several sources of shift demand and worker state. Shift-pricing decisions can take real-time worker availability into account. FTE-attainment math runs against actual UKG data instead of estimates. The placeholder emails and ID collisions that used to surface as customer-facing bugs get caught at ingestion and resolved before they become incidents.

--- Engineering