Auto-curation: agents that fill out forms

Every day, thousands of jobs arrive in Trusted’s platform from vendor management systems. Each one is a raw snapshot: codes, abbreviations, fragments, attached files. Before any clinician sees it, someone has to figure out what unit it’s really for, whether the rate is plausible, what credentials are required, whether the posting should even exist. We call that curation, and for most of Trusted’s history, a human did it.

Last week, every incoming job from one of our largest vendor relationships went live without a human touching it. The auto-curation rate is roughly 96%, and the system has been live in production since the end of March with no regressions against the human-curated baseline. The cost per curated job is about thirteen cents.

The architecture underneath is what we’ll keep reusing for every other autopilot we ship.

What curation actually is

Curation is not one task. It’s at least four.

Concern	Tool count	What the agent owns	One example
Decision	2	Whether this job proceeds to the marketplace, is rejected, or escalates to a human.	Reject obvious junk; escalate a posting with conflicting bill-rate fields.
Clinical Unit	4	Mapping the raw unit string from the VMS onto a canonical unit in our taxonomy.	VMS says `ICU-MED-3W`; the agent pins it to `Medical-Surgical ICU`.
Job Details	17	Normalizing shift type, hours per week, bill rate, number of positions, start date, end date, location, and the rest of the structured record.	VMS provides fragments; the agent fills out the structured marketplace record.
Job Rules	24	Resolving certifications, years of experience, EMR proficiency, charge-nurse status, floating policies, holiday coverage---every constraint on who can accept the shift.	Required BLS plus two years of recent unit experience plus a specific EMR.

Each of these requires looking things up, reading other parts of the job, applying soft judgment, and writing the result into a structured form. A human curator does this well; what they can’t do is keep pace with ingestion volume that scales with every new vendor relationship. Manual review is a queue. Queues get longer.

Why the first attempt didn’t work

Our first version, built in Q4 of last year, was the obvious move: send the LLM the raw job, hand it a giant JSON schema covering every curation output, ask for the fully curated record in one pass.

It mostly worked. When it didn’t, it failed in the worst possible way.

The schema was large enough that the API would sometimes reject the request outright. The model returned valid JSON that didn’t match reality, with no clear path to interrogate why it had landed there---all the reasoning was inside one opaque call. Extending the system meant editing one shared schema, so every change risked breaking every other field. Debugging meant reading raw JSON outputs and guessing at the model’s thought process.

We rebuilt it. What we rebuilt, we rebuilt more than once. The agent that handles job rules---the largest of the four---went through four named iterations before I was willing to put it in front of production volume. Each round changed something specific that the previous round had made invisible: the boundary between job rules and job details, the right level of granularity for tool calls, how aggressive to be about escalating to a human instead of guessing.

The pattern I kept finding: every failure was localizable only after we had decomposed concerns. With one big agent answering everything, a wrong unit assignment looked exactly like a wrong rate which looked exactly like a wrong rule. With concerns separated, a failure pointed at one prompt and one tool surface. We could fix it and ship without holding our breath about what else we’d just changed.

v2: four agents, one form

Four agent cards in a row: Decision (2 tools), Clinical Unit (4 tools), Job Details (17 tools), Job Rules (24 tools). Each card shows a think → tools → submit loop and the tool count, with arrows linking the four into a chain.

The new architecture replaces the single one-shot call with four specialized agents, chained.

Each agent has one concern. Each has its own system prompt, its own tool set, and its own success criterion. Each runs the same loop:

think → call tools → observe → submit

The four agents:

A curation decision agent (two tools) that decides whether the job proceeds, gets rejected, or is escalated to a human.

A clinical unit mapping agent (four tools) that pins the job to a canonical unit in our taxonomy.

A job details agent (seventeen tools) that normalizes shift type, rate, hours, count, and the rest of the structured record.

A job rules agent (twenty-four tools) that resolves every requirement and constraint on the shift.

The tool counts aren’t arbitrary. Each tool corresponds to a specific atomic action---looking up a certification, searching the unit catalog, setting a single form field, validating a date range, submitting the final state. The smallest agent has the simplest job. The largest agent (job rules) handles enough complexity that we’re already considering splitting it into more specialists.

Underneath, all four use the same reasoning model, configured at medium effort. Same model, four different sets of context, four different goals.

The decomposition is the first thing that earned its keep. When v1 had a bug in unit mapping, you had to read through a 20-page prompt to find the relevant section. In v2, the unit mapping agent has its own prompt that fits on a screen, owned by the engineer who knows that surface. When something breaks, you know which agent broke, and the prompt to read is small.

The Form pattern

The agents don’t talk to the database. They don’t issue queries. They don’t even produce the final job record directly. What they do is fill out a form---the same form a human curator fills out in our internal Manage app. We call this the Form pattern.

The Form is a virtual representation of the curator UI, exposed to the agent as tools.

form.set_shift_type(value, reason="job description says '3x12 nights'")
form.set_bill_rate(value, reason="VMS rate field is 78.50/hr")
form.add_certification(code="BLS", reason="required in original posting")
form.submit()

The Form is seeded from the existing job snapshot, so the agent starts where a human curator would start, with whatever fields are already populated. From there, the agent can search (looking up canonical units, certifications, EMR codes), mutate (writing to specific form fields), and at the end, call one terminal tool: submit.

Submit is where the design earns its keep. The Form submits to the same validation layer the human curator UI uses. If the model wrote an invalid combination of fields, the validation error comes back to the agent on the next turn---the same error message a human curator would see in the UI. Wrong nurse-to-patient ratio for that unit type? Same error. The agent gets to correct itself, exactly the way a human curator clicks Save, sees a red error, and tries again.

Five things follow from it.

The agent never persists a half-correct state. Mutations accumulate in the Form until submit. A failed validation rolls back the entire submission attempt. We don’t end up with jobs in production that are almost-right-but-wrong-in-one-field.

The validation contract is shared with the human UI. There is no parallel set of rules for “what AI is allowed to do.” By construction, the agent can only do what the curator UI lets a human do. Constraints live in one place.

Every mutation comes paired with a model-supplied reason. When an auditor later asks why the system pinned a job to one unit instead of another, the audit log has the agent’s answer in its own words, attached to the specific tool call that set the field. An actual entry, lightly cleaned:

form.set_clinical_unit(
  "Medical-Surgical ICU",
  reason="VMS unit string 'ICU-MED-3W' matched canonical unit via taxonomy lookup; confirmed by shift description mentioning med-surg patients and 1:3 nurse-to-patient ratio."
)

The reason is the difference between a system that produced an answer and a system that can be held accountable for the answer.

Extensibility is local. Adding a new field is the same work as adding it to the human UI plus exposing one tool. Adding a new agent for a new concern doesn’t touch the other agents.

The pattern is portable. Anywhere a human currently fills out a structured form in our app is a place where this same architecture can run. Job curation is the first place we’ve put it; it will not be the last.

The loop that keeps it honest

Five cards in a row connected by arrows: Audit → Feedback → Eval → Prompt → Release. A dashed mint arc loops back from Release to Audit, captioned 'each turn of the wheel compounds the next'.

Shipping an agentic system to production is not done at release. The release is the start of the loop, and the loop is what compounds quality over time.

Every auto-curated job is logged turn by turn: model, tokens, tool calls, results, reasons. Auditors review the auto-curated job alongside the original VMS snapshot and either approve it (thumbs up, sometimes with a small note) or correct it (thumbs down, with a note explaining what was wrong). Negative feedback represents a real production error. Positive feedback with a note usually represents minor drift---something a human noticed that wasn’t worth blocking the job, but was worth recording.

That feedback note is the most valuable artifact the system produces. It’s the only place where the human’s intent is rendered in natural language, attached to a specific real-world job, with the model’s full reasoning trace alongside it. We treat it as the single best source of ground truth on what good curation looks like.

The feedback flows into evals. Every eval case is a real production job, frozen at the input/output boundary: input is the raw VMS snapshot, expected output is the curated state a human eventually approved. The agents run against the frozen input. The output gets compared against the expected via three kinds of assertions: exact match where it should be exact, structural match where shape matters more than literal values, and LLM-judged equivalence where the output is correct as long as it’s semantically right.

Evals live in the repo. They run locally on every prompt change. They’re part of the dev loop, not a separate QA stage. Four suites, one per agent. When a feedback note flags a regression, the regression becomes a new eval case before the prompt change goes back out.

One early case showed how the loop is supposed to feel. An auditor flagged a job that the unit-mapping agent had pinned to Telemetry when the right answer was Step-Down---a borderline case where two adjacent unit types share most of their criteria. The note was three sentences, but it was specific: the agent had over-weighted the presence of a cardiac-monitoring requirement and under-weighted the nurse-to-patient ratio in the original posting. We froze that job as a new eval case, sharpened the unit-mapping prompt to weigh ratio language explicitly, and re-ran the four suites. The new case passed; nothing else regressed; the change went out the same day. Multiply that pattern by a few hundred over six months and you get a system measurably better than it was at launch, by a path you can read in the eval history.

Prompt and tool updates feed the next release. Each turn of the wheel---audit, feedback, eval, prompt update, release---compounds the next.

Where we are

Our first vendor is live---one of the largest in our footprint. Every job from that relationship is being auto-curated end-to-end, with no manual intervention. The success rate over the last seven days is roughly 96%, measured against auditor thumbs-up versus thumbs-down on the resulting job.

A few honest things to say about that number.

Not every negative is critical. Some of the failures are minor drift---fields a human auditor would have caught and corrected, but that don’t affect whether the right clinician finds the right shift. The real range across the last few weeks has been 85 to 96%, depending on which slice you look at and how generously you score borderline cases.

There are no regressions against v1. Since the end of March, the system has improved monotonically against the same human-curated baseline that ran the queue before it.

The cost per curated job is roughly thirteen cents---the model spend plus a small amount of platform overhead. Multiplied by daily volume, the operating-cost line is a number we’d happily put in front of any CFO who’s ever priced a manual curation team.

This is, internally, the first multi-turn reasoning agent we’ve deployed against marketplace operations in any health system relationship. It will not be the last.

What the Form pattern tells us about agents in production

Some of what we learned is specific to job curation. Most of it isn’t.

Decompose by concern, not by step. v1 tried to produce the entire curation result in one call. v2 has one agent per concern, and each agent sees only the prompt and tools relevant to its own concern. The decomposition reduces prompt surface area, makes failures localizable, and lets different engineers own different agents without conflicts.

Don’t let the agent touch the system of record directly. Give it a Form. The Form mediates between the agent and the database. Validation lives on the Form. Auditing lives on the Form. The agent acts; the Form enforces.

Share validation with the human UI. The strongest constraint on agent behavior is the constraint you already wrote for humans. Don’t build a second one. An agent that can do everything a curator can do---and only what a curator can do---is the agent you want in production.

Make every mutation explain itself. The reason field on every tool call has paid for itself ten times over. It’s the audit trail, the training signal for evals, and the diagnostic surface when something goes wrong.

Treat the feedback loop as the product. The agent at any given moment is a snapshot. The compounding asset is the loop: frozen eval cases, auditor notes, prompt iterations. The loop is what makes the next snapshot better than this one.

What’s next

One vendor is fully on autopilot. A second is close behind, already producing strong eval results from a smaller starting point. A third is queued. After that, the rest of our vendor footprint.

The same Form-pattern infrastructure runs each one. The work specific to each new vendor is mostly in the prompts, the unit-mapping data, and the rule taxonomies. The architecture stays.

--- Felipe, Engineering