Evaluation-driven LLM systems in production

Production LLM systems accumulate technical debt fast. The first version works — until it doesn't. An output looks off on a new input type. A model update shifts behavior. Latency spikes on certain workflows. And the instinct is to tweak the prompt and see if it helps.

That is not a system. That is guessing with extra steps.

The shift I made was treating LLM workflows the way I treat software: experiment, measure, compare, decide. Every prompt change becomes an experiment run against a dataset. Every orchestration strategy is compared against others. Every model swap is evaluated, not assumed.

This article is about the evaluation infrastructure I built to make that possible — and what shipping it taught me.

The Problem

We had an LLM workflow that needed to produce reliable outputs across a wide variety of real-world inputs. The original pipeline worked on the happy path. Quality varied on edge cases. Some inputs produced weak outputs consistently. Manual prompt iteration helped sometimes, but it was slow, subjective, and not reproducible.

The core issue: we had no way to know whether a change made things better. Better for whom? Better on what inputs? Better by how much? Without those answers, "better prompts" is not engineering. It is intuition dressed up as work.

What we needed:

A curated dataset representing the real input distribution
Ground truth labels for what a correct output looks like
A repeatable way to run variants and compare them
Metrics that made comparison objective
Tracking to catch regressions and observe trends over time

The Core Inputs

Before you can run experiments, you need to define what you are experimenting on. We settled on four inputs as the foundation of the system.

Signature

The signature defines the LLM task as a structured contract: what fields go in, what fields come out, and how each field is described. In DSPy, this looks like:

class ImageAttributeExtractor(dspy.Signature):
    """Analyze an image and extract structured creative attributes."""

    image: dspy.Image = dspy.InputField(desc="The input image")
    context: str = dspy.InputField(desc="Additional context about the image or intended use")
    attributes: str = dspy.OutputField(desc="Extracted attributes as structured output")
    confidence: float = dspy.OutputField(desc="Confidence score for the extraction")

The signature is not a prompt. It is the schema. It separates what the task is from how it should be solved — which is the right separation when you are running experiments.

Prompt / Instructions

The natural language framing of the task: constraints, tone, examples, edge case guidance. This is what most people call "the prompt." In our system, it is one variable among several — not the whole system.

Dataset

A curated set of representative inputs collected from real or realistic workflows. This was the hardest input to get right. A dataset that does not reflect your actual input distribution will produce misleading results. We reviewed inputs from production, identified failure modes we already knew about, and ensured coverage across common cases, edge cases, and boundary conditions.

Ground Truth

Expected outputs or evaluation labels for each dataset example.

Without ground truth, the system becomes subjective prompt tweaking. You are comparing outputs that look slightly different with no way to know which is actually better. With ground truth, you can compute accuracy, detect regressions automatically, and track improvement over time with real signal.

The ground truth investment is what separates evaluation-driven development from iteration by feeling.

The Experiment Loop

Evaluation System — Architecture

01 · Inputs

SignaturePromptDatasetGround Truth

Experiment Runner

02 · Program Variants

dspy.Predict

dspy.ChainOfThought

CoT + Self-Context

CoT + Self-Context + Self-Consistency

LLM Judge Evaluator

Retrieval + Feedback Loop

03 · Model Matrix

Model AModel BModel C

04 · Evaluation

Accuracy

Error Categories

Judge Scores

Latency · Cost

Failure Modes

Run Consistency

05 · MLflow Tracking

Experiment runs

Metrics dashboard

Artifact logging

Model / prompt comparison

Failure breakdown

Run history

Select best workflow → production

06 · Production

Deploy LLM workflowMonitor outputsFeed failures → Dataset ↑

With inputs defined, we ran the same dataset through multiple LLM program variants and compared results across them. The variants represented different orchestration strategies — not just prompt changes.

dspy.Predict — Baseline

Direct prediction. Input goes in, output comes out. No scaffolding. Useful as a baseline to understand raw model capability before adding any orchestration overhead.

dspy.ChainOfThought

Adds an explicit reasoning step before the final output. The model articulates intermediate reasoning before producing the answer. For tasks requiring judgment or multi-step inference, CoT consistently outperformed direct prediction in our experiments.

ChainOfThought + Self-Context

Enriches the prompt context with additional information: prior outputs, user-specific state, workflow history, or derived intermediate results. The model reasons over more information, which helps on tasks where context matters significantly. Particularly useful when the task involves understanding patterns across a session or evolving workflow state.

ChainOfThought + Self-Context + Self-Consistency

Runs multiple generations independently and uses agreement or aggregation to produce a final output. Self-consistency is expensive — it multiplies inference calls — but it measurably improves reliability on tasks where single-pass outputs are unstable. For outputs that require high confidence, the cost trade-off is often worth evaluating.

LLM Judge as Evaluator

For tasks where exact matching is not a valid metric, we used a separate judge model to score outputs. The judge evaluates whether the output is correct, coherent, and within acceptable bounds. LLM judges introduce their own variance — they are used selectively, on tasks where structured scoring is insufficient and human-like judgment is genuinely needed.

Retrieval and Feedback Loop for Self-Context

Instead of static self-context, this variant enriched the program with examples drawn directly from the curated dataset — including failed cases from prior runs. In DSPy, this means adding those examples to data[] and letting the optimizer select and surface them as few-shot context at inference time:

data = [
    dspy.Example(
        image=ex.image,
        context=ex.context,
        attributes=ex.attributes,
        confidence=ex.confidence,
    ).with_inputs("image", "context")
    for ex in curated_examples  # includes prior failure cases
]

optimizer = dspy.BootstrapFewShot(metric=eval_metric)
compiled = optimizer.compile(program, trainset=data)

This created a feedback loop: production failures were added back into data[], so future compiled programs had direct visibility into the failure modes the system had already encountered.

Each of these variants ran across multiple models. The result was a matrix of (strategy × model) experiments against the same dataset — a comparable view of what actually works, not what looks reasonable.

What We Measured

Accuracy was the primary metric, but not the only one.

Task-specific correctness. Does the output match the expected answer? For structured outputs, this was exact or near-exact matching. For open-ended outputs, LLM judge scores.

Failure categories. We grouped errors by type: wrong classification, hallucinated fields, format violations, edge case failures. Knowing why the model fails is as important as knowing how often. Failure categories told us which variants to investigate and which orchestration changes were worth trying.

Model comparison. The same strategy with different models sometimes produced meaningfully different results. Tracking this separately let us make model selection decisions on evidence rather than assumption.

Latency and cost. Self-consistency and retrieval variants are more expensive per call. We tracked latency and estimated cost per variant to evaluate whether accuracy gains justified the added overhead.

Consistency across repeated runs. For non-deterministic outputs, we ran each variant multiple times to measure variance. High variance is a production risk signal — even if average accuracy is acceptable, unstable outputs create downstream problems.

The Important Production Lesson

One of the LLM calls in our workflow returned a percentage value — a numeric output representing a score or confidence level. The model could reason well around the problem. But the actual number it produced was unstable: similar inputs would generate outputs spread across a wider range than was acceptable for the use case.

The instinct is to fix this with a better prompt. We tried. It helped marginally.

What actually solved it was changing the workflow structure.

We added a self-consistency layer specifically for this call: run the inference multiple times, collect the distribution of outputs, and apply a light aggregation step to produce a stable final value. The spread across runs also became a useful diagnostic signal — high variance indicated low model confidence, which fed into downstream logic for how aggressively to act on the output.

In production LLM systems, sometimes the best improvement is not a better prompt. It is a better control loop around the model.

This is the lesson that does not come from benchmark papers. The model was capable. The single-pass output was not reliable. The fix was structural, not lexical.

Tracking Experiments with MLflow

Every experiment run was logged to MLflow. Without experiment tracking, you lose the ability to compare variants, catch regressions, or understand optimization trends over time. A prompt that looked good two weeks ago may be quietly regressing — without logging, you will not know until something breaks in production.

We tracked:

Accuracy per run, per variant, per model
Error categories as structured metrics
LLM judge scores as logged artifacts
Latency and cost estimates per variant
Prompt and signature versions as artifacts, not just inline strings
Run history to detect regressions across time

The dashboard views we used most often:

Accuracy by DSPy strategy across all models
Failure category breakdown per variant
Latency vs. accuracy tradeoff across the variant matrix
Run history to identify when accuracy shifted after a change

Experiment tracking turns optimization from a local activity into a shared, auditable record. When someone asks "why did you choose this workflow?", the answer is a logged run with metrics — not "it felt right."

Shipping the Best Workflow to Production

After running the full experiment matrix, we selected the best-performing variant based on the combined view of accuracy, failure analysis, latency, and cost. For our use case, CoT + self-context with targeted self-consistency on the numeric output call was the winning combination.

The selected workflow was deployed to production. But the eval system did not stop mattering at launch.

Production failures — outputs that were wrong or flagged by downstream validation — were routed back into the evaluation dataset. Over time, the dataset became a living record of real failure modes encountered in production, not just theoretical edge cases. Every future experiment ran against a richer, more grounded set of examples.

The eval loop is not a pre-deployment benchmark. It is part of the production development cycle.

Key Engineering Decisions

Treat prompts as versioned experiment artifacts, not config strings
Evaluate against representative datasets, not against intuition about what looks good
Compare orchestration strategies, not just models — the control structure around the model matters as much as the model itself
Use self-consistency when single-pass outputs are unstable; not as a default
Use LLM judges only where deterministic scoring is genuinely insufficient
Track every experiment run in MLflow, including failed runs
Feed production failures back into the eval dataset continuously
Optimize for reliability on your specific task and input distribution, not raw benchmark performance

Tradeoffs

None of this is free, and most of these decisions involve real trade-offs.

LLM judges are useful for qualitative evaluation but introduce their own variance and can reinforce biases in the judging model. We validated judge scoring against human-labeled examples before relying on it, and used judges selectively — not as a default replacement for structured metrics.

Self-consistency improves reliability but multiplies inference cost and latency. It is worth the investment on high-stakes, unstable outputs. It is not worth it on every call in a workflow.

Retrieval-based self-context improves context relevance but requires ongoing corpus maintenance. Stale or irrelevant retrieved examples degrade output quality. The garbage-in problem applies here as much as anywhere.

Bigger models often perform better on accuracy but cost more and run slower. We evaluated accuracy gains against cost on a per-call basis — not by picking the best model for the overall workflow and applying it uniformly.

Dataset maintenance is ongoing. Production input distributions drift. A dataset that accurately represented inputs at launch will be partially stale six months later. The eval infrastructure is only as good as the data feeding it.

Final Takeaways

Production LLM development is not about finding the best model. It is about building the best system around the model — and having the measurement infrastructure to know when you have.

Evals turn prompt iteration into engineering. Without ground truth and a repeatable test loop, you are iterating in the dark.
DSPy is useful because it treats LLM workflows as programs. The signature, orchestration strategy, and prompt are separate concerns you can experiment on independently.
MLflow makes experimentation visible. Shared, logged, comparable results replace individual intuition with something closer to institutional knowledge.
Self-consistency and retrieval are workflow design choices. The model does not change; the control loop around it does.
The eval loop continues after launch. Production failures are data. The best systems treat them that way.

LLM reliability comes from the system, not the model.