GLaDOS: the LLM layer

Sep 18, 2024 · 9 min
marketplaceAI

The first AI feature you ship in a Rails app usually lives in a service class. There’s an HTTP client, a prompt assembled inline, a JSON.parse wrapped in a rescue, a couple of retry blocks, and a Rails.logger.info so you can find the response in production. It works. You move on.

The second feature lives somewhere else. It has its own service class, its own retry policy, a slightly different error handler, a different log line, a slightly different way of squeezing structured output out of the model. By the fifth feature, you have five different patterns for the same five problems: retrying, logging, validating shape, surfacing failures to a human, chaining one call into the next. None of them are wrong, and none of them are the same.

This is how almost every Rails app with AI features evolves, because writing an integration directly against the provider SDK is fast to ship and slow to maintain. Every new feature pays the same setup cost. Every production incident plays out a little differently depending on which feature broke. The moment you want to switch providers, or fan out to a second one, the work is spread across the codebase in a dozen shapes.

We built our way out of it. The result is GLaDOS, an internal pack that handles every LLM call across our platform.

The prompt as a first-class object

The GLaDOS prompt lifecycle. A linear flow from left to right: a prompt class is defined, called with variables, sent as a provider request, returned as a structured response, validated against a schema, written to OperationLog, and finally handed to an on_success callback. A branch from the provider request loops back through retries on transient failure and falls into on_failure once max retries is hit.

The core abstraction in GLaDOS is Glados::Prompt. A feature defines a subclass, configures the provider and temperature, writes a system prompt and a user prompt, optionally declares a JSON schema for structured output, and calls it. The framework handles everything else.

class QualificationPrompt < Glados::Prompt
  config provider: :gemini, temperature: 0.0

  def user_prompt
    <<~PROMPT
      Evaluate whether this clinician meets the requirement.

      Requirement: #{requirement.text}
      Clinician evidence: #{evidence.as_json}
    PROMPT
  end

  json_schema do
    object do
      string :verdict, enum: %w[pass fail unknown]
      string :reason
    end
  end

  def self.on_success(result, qualification_ai_execution_id:, **)
    execution = QualificationAIExecution.find(qualification_ai_execution_id)
    QualificationAI::Callback::OnSuccess.new(execution:, result:).call
  end
end

That’s the whole feature-side surface. Building the request, calling the provider, validating the response, retrying on transient failures, writing the operation log, calling back into application code on success or failure --- all of it lives inside the framework. The author of QualificationPrompt writes the prompt and the schema. Nothing else.

Two execution modes sit on the same prompt class. .call runs synchronously and returns the validated result. .call_async enqueues a Sidekiq job that runs the prompt and dispatches the on_success callback when it lands. Same class, same definition. The choice is per call site.

Every execution produces a OperationLog row (Glados::OperationLog). The log carries the full request, the full response, the provider that served it, the latency, the token counts, and a pointer to a parent log if the call is part of a multi-step pipeline. Every AI feature in the codebase produces the same shape of audit trail, because they all go through the same executor.

The first time you debug a flaky AI feature in production, you appreciate that every feature’s history lives in one table with one schema. The hundredth time, you stop noticing. That’s the point.

Tools as classes, agents as loops

Single-turn prompts cover most cases. Some features need the model to call out, get information, and decide what to do next. GLaDOS supports that through .agentic_call, a multi-turn loop that runs the prompt against the model, executes any tool calls the model returns, feeds the results back in, and continues until the model emits a final answer or hits a max-iterations cap.

Tools are first-class. Each is a Glados::Tools::Function subclass with a description, a typed parameter schema, and a call class method. The framework handles the request-tool-result-request handshake.

class FindFacility < Glados::Tools::Function
  description "Search the verified facility database by name, city, and state."

  parameters do
    string :name, description: "Facility name as entered by the clinician."
    string :city, description: "City to narrow the search.", required: false
    string :state, description: "Two-letter state code.", required: false
    integer :page, description: "Page number, 25 results per page.", required: false
  end

  def self.call(name:, city: nil, state: nil, page: 1)
    Facility.verified.search(name:, city:, state:).page(page).limit(25).map(&:as_match)
  end
end

The production example sits behind our work history flow. A clinician types a facility name, the agent calls FindFacility, and the tool returns up to 25 verified facilities. If the first search doesn’t resolve a single confident match, the agent refines: different name, add a city, narrow by state, page through results, until it converges on the facility or determines no match exists. The tool call, the result, and the agent’s next turn are all stitched together in OperationLog, parent-and-child, so the whole trajectory is replayable from one row in the database.

We’ve written about why constraining agents through the same boundaries humans use is the right architectural choice in Agents that fill out forms. Tools-as-classes is the same idea on the inbound side. The agent doesn’t get a free SQL connection or a raw HTTP client. It gets a small, named set of functions with declared parameters, the same way a junior engineer gets a small, named set of methods on a service object.

Two providers, one interface

GLaDOS speaks to two model providers today through a factory pattern. The prompt class declares which it wants. Credential verification in particular has dual configurations and chooses at call time based on the credential type or current system load.

From the feature side, the provider is an implementation detail. Swapping it is a one-line change in the prompt class. The calling code doesn’t change. The schema doesn’t change. The callbacks don’t change.

This sounds like over-engineering when you have one feature and feels like the cheapest win in the codebase when you have ten. The day a provider rate-limits you, or a new model lands that’s better-and-cheaper for a specific task, you make the change in the prompt class and the rest of the system carries on.

Six features, one framework

The GLaDOS feature map. A central GLADOS banner sits above six labeled feature sections: credential verification with its extract, match, verify chain; qualification with rules feeding AI evaluation; interaction summary fed by call, SMS, and email channels; autocuration of ingested jobs; professional summary built from clinician experience; and work history facility matching via agentic tool calls.

GLaDOS isn’t theoretical. Six distinct features in our platform run through it today.

FeatureWhat GLaDOS providesPrompt classesSync or asyncDistinguishing detail
Credential verificationThree-step pipeline: extract, match, verify, chained via with_previous_log so step two reads step one’s OperationLog without anyone passing context by hand. Same shape handles skills checklists and reference forms.Three (extract, match, verify)AsyncPipeline chaining across OperationLog parent/child links.
QualificationQualificationPrompt evaluation for the ambiguous rules a deterministic engine can’t decide---a free-text experience field against a structured requirement, a license format we haven’t seen before. Verdict plus reason returned as structured JSON.QualificationPromptSyncThe AI half of the rules-and-AI architecture in Automatic Qualifications.
Interaction summarizationAfter a call, SMS thread, or email exchange wraps up, SummarizationPrompt generates a structured summary attached to the clinician record. Advocates see it in the internal ops surface alongside the raw history.SummarizationPromptAsyncSurfaces alongside the interaction record; makes the record useful at a glance.
AutocurationWhen a new job lands from a vendor feed, JobAICurationPrompt fills missing fields including specialty inference, shift parsing, and certification requirements. Structured output maps directly onto the job record.JobAICurationPromptAsyncOutput maps onto the job record without translation.
Professional summary generationWhen a qualification completes, ProfessionalSummaryPrompt assembles the professional summary section of the submission packet from the clinician’s experience data. Certifications, charting systems, education, trauma experience returned in schema order.ProfessionalSummaryPromptAsyncPacket generator drops the output straight in.
Work history facility matchingThe agentic loop in action. Clinician types a facility, the agent matches it against the verified facility database via the FindFacility tool, and the resolved facility ID ends up on the work history row.Agentic (.agentic_call with FindFacility)SyncMulti-turn tool use; full trajectory replayable from one OperationLog row.

The thing to notice across this list is what isn’t in it. Six features and no per-feature retry policy, no per-feature log format, no per-feature provider client, no per-feature audit table. Every one inherits the same infrastructure and produces the same shape of operation log.

Feedback comes free

A side effect of the shared interface is that human feedback infrastructure plugs in once. The QualificationAIExecution has a feedback record. The credential AI verification has an assessment record. The shape is consistent because the executions are consistent.

That consistency means every feature contributes to one signal stream. When an advocate overrides a qualification verdict, the structured feedback lands the same way as when an ops reviewer corrects a credential extraction. Model evaluation, prompt iteration, eval-set building --- all of it works against a unified corpus of production signal, not a stack of feature-specific logs that have to be reconciled before anything useful comes out.

We’re not done with this part. Real work remains on closing the loop tightly enough that prompt changes are evaluated against the full feedback corpus before they ship. But the foundation (one log table, one feedback shape, one place to look) made that work tractable to plan, where on five separate integrations it would have been a quarter of yak-shaving before model evaluation could start.

Why it compounds

The argument for building a framework before you need one is usually wrong. The first time you write a piece of infrastructure, you don’t know what it’s actually for. You guess, the guess turns out to be the wrong shape, and the framework becomes the thing every feature works around.

We waited. The first three AI features at Trusted didn’t run on GLaDOS, because GLaDOS didn’t exist yet. Each one shipped as its own service-class integration, and each one taught us something specific about what the shared interface needed to handle: structured output validation, the retry-on-transient-failure pattern, the async path for long-running calls, the multi-step chain. By the time we extracted the framework, we had three production codebases worth of opinions about what it should look like.

The compounding shows up afterward. The first prompt class in GLaDOS gets you, with no extra work: logging, retries, structured output validation, async execution via Sidekiq, multi-turn conversation support, tool calling, and the audit trail. The tenth prompt class gets all of that plus the ability to read the operation log of an earlier step in the same pipeline. Our credential verification pipeline --- extract, then match, then verify --- works because each step can see the previous step’s context without anyone passing it explicitly.

The cost of the eleventh AI feature is the cost of writing one prompt class and one schema. The first ten paid for the framework. The eleventh inherits it.

--- Engineering

← back to posts