Reverse-invoicing automation: 50 vendors, one ledger

A travel nurse worked three twelves last week at a hospital we placed her into. The vendor management system the hospital uses generated an invoice line: a name string, a unit string, an hour-type label, a count, a week-ending date. Somewhere on our side is the invoice we already generated for the same shifts, with her real profile UUID, the canonical clinical unit, and the contracted rates. The two records need to meet, agree, and roll up into the export that drives weekly payment.

That meeting is the reconciliation problem. The vendor says “SMITH, JANE”. We know her by UUID. One platform calls the unit “ICU-MED-3W”. We have a clinical unit ID. The hour type is “REG” on one feed and “Regular Hours” on another. A row that read “10 hours OT” last week is “8 hours OT” this week because a timesheet got corrected upstream. More than fifty vendor management systems are in our path. Each one is its own dialect.

This is the inbound side’s mirror image. We recently wrote about collapsing 30+ vendor job postings into one job model; the pipeline shape rhymes, but the problem is different. That post is about turning a vendor’s definition of an open position into a Job a clinician can apply to. This one is about turning a vendor’s definition of a worked invoice into a reconciled ledger entry so the right number gets paid.

The two design decisions I keep reaching for when the alternative is a status column that lies: derived state, and query-enforced idempotency.

Four steps, one shape

A horizontal four-step pipeline. Step one, FilesCollector, fetches raw vendor files from a portal, SFTP, or vendor API. Step two, Serializer, parses files into typed rows, dedupes, and drops noise. Step three, Normalizer, transforms rows into our domain model; per-vendor code lives here. Step four, Matcher, auto-reconciles each line against the corresponding internal invoice, linked via ExternalIdentifier. A dashed mint arc below the row shows the manual kickoff bypassing step one and feeding step two directly. Caption: one pipeline, many vendors.

Every vendor workflow runs the same four steps. FilesCollector fetches the raw files. Serializer parses them into typed rows. Normalizer turns rows into domain records. Matcher auto-reconciles each line against the corresponding internal invoice. The shape never changes; the per-vendor code lives almost entirely in the Normalizer and a thin wrapper at the top.

FilesCollector  -> portal scrape, REST API, or inbound email
Serializer      -> typed rows via VmsField schema + SHA256 dedup
Normalizer      -> per-vendor subclass writes the domain records
Matcher         -> External Identifier lookup -> internal invoice

FilesCollector handles the inputs we don’t control. Portal vendors get Capybara against headless Chrome with a small library of per-vendor selectors. API vendors get authenticated REST. For vendors whose integration is an emailed attachment, an inbound webhook posts the file into the same workflow record the other paths produce. Manual kickoffs skip this step: an ops user uploads a file through an admin form and the pipeline picks up at step two.

Serializer parses raw spreadsheet bytes into VmsInvoiceImportRow records using a declarative column schema. The same VmsField shape we use on the inbound job side does the work here: declare a source column, a target type, an optional format, and the base serializer handles coercion. A SHA256 of the row goes into a row_hash column so re-importing the same file updates timestamps without duplicating data.

Normalizer is where the per-vendor code lives. The base class handles structure: group rows by invoice ID, find or create the VmsReverseInvoice, materialize line items. Vendor subclasses override extraction. One platform sends names as Last, First in an Employee column. Another splits role and unit across three columns. A third tucks the company identifier into a free-text notes field, because someone made that call years ago and it stuck. A factory routes each workflow to its subclass at construction time, which keeps the base class free of conditionals.

Matcher is one method. Given an unmatched line item, look up the ExternalIdentifier for the vendor’s nurse name, find the internal invoice for that clinician and week, attach. The looping logic, the retries, and the “don’t crash the whole workflow on one bad row” behavior all sit in the base class.

When I add a new vendor I don’t touch the pipeline. I add a Normalizer subclass and a workflow class. The framework is the contract; the diff is the dialect.

Rows have a state, and it changes

VmsInvoiceImportRow is the typed shape every parsed line lives in before it becomes a domain record. The state machine has three states:

raw     -> just parsed, not yet turned into a VmsReverseInvoiceItem
synced  -> a domain item exists and matches this row&rsquo;s current values
stale   -> a domain item existed, but the row&rsquo;s values have changed

raw is the initial state. synced means we’ve already materialized this row into a VmsReverseInvoiceItem and the values agree. stale is what happens when a vendor re-sends the file with corrections: the old item is soft-deleted, a new item is created with a fresh identity, and the row transitions to synced against the new item.

The identity of that new item is the load-bearing choice. The UUID is computed deterministically from whichever fields make a row unique for that vendor:

item_uuid = UUIDTools::UUID.sha1_create(namespace, [
  invoice_id, clinician_name, week_ending, hour_type
].join(&ldquo;|&rdquo;))

UUID v5 over a namespace and a stable composite key. Same inputs, same UUID, every time. Re-running a normalizer on the same file produces the same item rows. Re-running it after a correction produces a different UUID for the changed row, which is the signal the state machine needs to mark the old item stale and create the new one. Dedup is deterministic; change detection is structural.

Map once, match forever

The hardest part of reconciliation isn’t parsing. It’s identity. The vendor knows the clinician by name string; we know the clinician by UUID. The pipeline can’t guess. There are too many ways for SMITH, JANE to be wrong: two clinicians with the same name, a married-name change that hit our profile and not the vendor’s, a typo propagating through one vendor’s system for years.

The bridge is a table called ExternalIdentifier. Each row links a vendor scope, a vendor identity string, and an internal profile ID. The first time a vendor imports a row for a clinician we haven’t seen them name before, the Matcher creates an ExternalIdentifier with internal_id set to NULL. The line item stays unmatched. A backoffice user (not an engineer) opens the unmatched queue, picks the right internal profile, and saves. The internal_id is now populated, and every subsequent import for that vendor name auto-matches.

Matching is deliberately strict. Given an internal clinician and a week-ending date, the Matcher expects exactly one internal invoice. Zero matches means we don’t have an invoice for that work yet. Multiple matches means something upstream is wrong: a duplicate placement, a week-boundary disagreement, a contract rebuilt mid-week. Both cases fail closed and end up in the queue a human looks at. A fuzzy match that’s wrong costs much more than a strict miss a human resolves.

The useful property of this design is that it learns. Week one for a new vendor, the unmatched queue is long. Many weeks later, the queue is almost empty---mostly new clinicians on existing vendors. The mapping table is data, the data accumulates, the human step decays. Same shape as the rule-table classifier on the inbound job side: operational judgement lives in data, not code, and the system gets better as the data fills in.

State you can’t make stale

Five derived states laid out in a row: missing_match, has_contract, reconciled, exported, and exception. Each section names the state, gives a one-line description, and shows the underlying record predicate it is derived from (Contract.exists? == false; Contract.exists? && !entry.exists?; BalanceEntry.active.exists?; SnapshotItem.confirmed.exists?; exception_at IS NOT NULL). Arrows between the first four show forward progression; a dashed mint arc shows that exception can branch in from any prior step as an explicit human override. Footer text notes there is no status column: state is computed at query time, the records are the state, and archiving a BalanceEntry reverts an item to has_contract automatically. Caption: state is derived, never stale.

VmsReverseInvoiceItem has five meaningful states: missing_match, has_contract, reconciled, exported, and exception. None of them are stored as a status column. The state of an item is derived from the related records at query time:

missing_match if no contract is linked to the item.
has_contract if a contract exists but no active balance entry does.
reconciled if an active balance entry exists.
exported if the item appears in any confirmed weekly snapshot.
exception if exception_at is set on the item directly. This is the one explicit override, used when a human marks the line as needing out-of-band handling.

State is derived, never stale. The reconciliation state of an item is whatever a query against the underlying contract, balance entry, and snapshot tables says it is, right now. There is no reconciled boolean to fall out of sync, because there is no reconciled boolean. If a balance entry gets archived, the item is no longer reconciled the next time you ask. No migration, no callback, no “keep the flag in sync” cron.

We’ve been bitten enough times by status columns drifting from the records they were supposed to summarize that we now treat derived state as the default. A boolean that mirrors a join is a lie waiting to happen.

The tradeoff is real. You can’t answer “show me all reconciled items” with a simple WHERE clause; you have to join, and the join touches multiple tables. We pay that price with indexes and a small library of named scopes that hide the joins. What we don’t pay for is correctness incidents. When the source of truth and the cached flag disagree, the bug is always in the cache.

Idempotency at query time

The end of the pipeline is the export to our accounting system. A weekly snapshot bundles the items that are ready, attaches metadata, hands them off, and is then locked. Once confirmed, a snapshot can’t change. The next snapshot has to be additive: no item that was in last week’s snapshot can appear in this week’s.

The procedural version is a status column that flips from ready_to_export to exported when the snapshot is built. We’ve written that version. It has a recurring failure mode: the upload fails halfway, the column update doesn’t roll back cleanly, and you have items marked exported that aren’t in any snapshot, or items in a snapshot that aren’t marked exported. Reconciling the reconciler is its own job.

The shape we landed on enforces the delta at query time. The snapshot builder selects items that are reconciled and not already in any previous snapshot:

ReverseInvoiceItem
  .reconciled
  .where.not(id: SnapshotItem.confirmed.select(:item_id))

There is no exported boolean. The query is the contract. If a snapshot upload fails, the items aren’t in the SnapshotItem set yet, and they’ll be picked up next time. If a snapshot succeeds, its items get a SnapshotItem row and the query naturally excludes them on the next build. The only state that matters is whether the snapshot record exists and is confirmed.

Same shape as our argument for storage-layer audit over callback-layer audit: check reality, don’t trust a flag that was supposed to mirror reality. Reality doesn’t drift.

What the shape buys

We’ve been running this pipeline against more than fifty vendor workflows long enough that the architectural choices have been stress-tested.

Step separation buys testability. Each step has a narrow shape: files in, rows out; rows in, items out; items in, matches out. Tests at any layer mock the layer below. We replay real production files through the Serializer in unit tests without standing up a portal scrape.

Factory-based extensibility makes the long tail cheap. The platform shape is already known when we add a new vendor; the diff is the dialect.

Deterministic UUIDs make reruns boring. When a vendor sends a corrected file, I re-run the workflow without thinking about it. The state machine handles the changes, the UUID v5 derivation handles identity, the snapshot query handles export idempotency. I’ve never had to write a “fix the reconciliation that re-ran twice” one-off.

Derived state means the ops UI never lies. The unmatched queue shows what is unmatched right now. When a contract gets attached or a balance entry gets archived, the queue updates on the next refresh, with no reconciliation-of-reconciliation job.

What the pipeline doesn’t solve is the underlying messiness of the vendor world. Portals change selectors without notice. API tokens rotate on schedules we don’t set. A vendor that sent us MM/DD/YYYY for years switches one subset of their platform to YYYY-MM-DD. The Serializer coerces unparseable values to null and surfaces them as health metrics, because a hard failure stops a whole import and Tuesday morning is the wrong time to find out a chunk of invoices won’t parse. Most of the operational work is keeping the inputs healthy, not reworking the inside.

--- Vinicius, Engineering