Skip to main content
Progressical

The business case for harness engineering

A better harness on a cheaper modeloften beats a worse harness on a more expensive one.

The Meta-Harness paper found 6× performance variance from harness changes alone — on the same model, same benchmark. That gap translates directly into cost and quality. Here's how.

Meta-Harness paper · Stanford / MIT · 2026

An optimization loop with access to scores plus raw execution traces reached 56.7% best accuracy. The version with scores plus summaries reached 38.7% — worse than scores alone.

Same model. Same benchmark. Different harness. The performance gap from harness changes alone was larger than the gap between generations of frontier models.

How harness engineering pays

Harness optimization works through two levers.

Lever 1 · Cost reduction

Cut LLM spend without cutting quality.

LLM API spend is primarily a function of context window usage. Most production harnesses include 20–40% of context that adds no quality signal — over-broad retrieval chunks, redundant conversation history, verbose system prompts assembled under deadline pressure.

A tuned harness reduces average token cost by 15–35% without measurable quality loss.

Example: a team spending $12,000/month often has $2,000–$4,000 of recoverable spend from over-broad retrieval and compression policies never revisited after launch.

Lever 2 · Quality improvement

Improve retention without changing the model.

The Meta-Harness paper measured what happens when you change only the harness — same model, same benchmark. The difference between the best and worst harness configuration was 6×.

That variance translates to product outcomes. A 2–5 percentage point improvement in 7-day retention is typical after a Rebuild on a consumer AI feature.

Sample result

12%17%

7-day retention · consumer mental health app · pilot in progress

Estimate your numbers

Estimate your numbers.

Your numbers

$8,000
$1,000$50,000
12%
3%40%
5k
500100k

Estimated outcomes

Conservative token savings

15% reduction in context window usage

$1.2k/mo

Typical token savings

30% reduction in context window usage

$2.4k/mo

Rebuild payback period

$35k ÷ savings midpoint

19.4 months

Conservative retention lift

+100 retained users/mo

12% → 14%

Typical retention lift

+200 retained users/mo

12% → 16%

Token savings: spend × 0.15 (conservative) to × 0.30 (typical). Payback: $35,000 ÷ monthly savings midpoint. Retention lift estimated from Progressical pilot data. Actual results depend on your harness, metric, and user segment.

What the engagement price buys

What's included.

Audit

$15,000

Two weeks

Find out exactly where the money is going and why quality is lower than it should be. The prioritized fix list makes every subsequent decision cheaper.

Rebuild

$35,000

Four weeks

Fix the highest-cost signal leaks. Includes the eval set that proves the fix worked and catches the next regression automatically.

Includes eval set

Operations

$5,000/month

Ongoing

Keep the improvements from drifting as models, traffic, and user behavior change.

How every engagement starts

Run a real audit, not a calculator.

The calculator gives you a rough order of magnitude. The audit gives you the actual numbers — grounded in your harness, your traces, your metric. If we don't find three things to fix, you don't pay.

Start with an audit