Beyond the prompt: building reliable AI products

The hardest part of shipping an AI product isn't the model — it's everything around it. Here's the system I wrap around every Claude-powered feature I ship.

Eval before product-market fit

A new feature doesn't go to users until it has an eval set. Even five hand-curated examples beat shipping blind. The eval set lives in a CSV, gets re-run on every prompt change, and the deltas are reviewed in PR.

Retries with shape constraints

Models occasionally return malformed output. I never let a single bad response surface to a user. Wrap every model call in:

Schema validation (zod or pydantic)
Retry on validation failure with the original error appended to the prompt
Hard cap of three attempts; fall back to a deterministic template

evals — pnpm

$ pnpm eval --suite=auth-router
running 12 cases · model=claude-sonnet-4-6
  ✓ classifies password reset (12ms)
  ✓ classifies signup intent (15ms)
  ✓ rejects ambiguous handoff to human (28ms)
  ...
12/12 passed · cost $0.04

Guardrails that match the threat

Not every product needs a content moderation pipeline. The right question is: "What's the worst plausible output, and what's the cost of letting it slip?" For a public chat, you need filtering. For an internal tool with twelve users, you don't.

Observability is non-negotiable

Log every model input, output, latency, and cost. Tag traces with the user, feature, and prompt version. When something goes wrong in week six, you'll know.

The order I ship in

Prompt + scaffolded tool use that runs end-to-end
Eval set + automated regression check
Retries and fallbacks
Observability
Cost guardrails (caps, alerts)
Then invite users

Skip steps and you'll pay for them later. Always.

Retries with shape constraints

Models occasionally return malformed output. I never let a single bad response surface to a user. Wrap every model call in:

Schema validation (zod or pydantic)

Retry on validation failure with the original error appended to the prompt

Hard cap of three attempts; fall back to a deterministic template

evals — pnpm

$ pnpm eval --suite=auth-router
running 12 cases · model=claude-sonnet-4-6
  ✓ classifies password reset (12ms)
  ✓ classifies signup intent (15ms)
  ✓ rejects ambiguous handoff to human (28ms)
  ...
12/12 passed · cost $0.04