Your AI app should not break silently.
We write and maintain the regression test suite for your AI agent or prompt. Describe your app in plain English; we handle the YAML, the LLM-as-judge rubrics, the test cases, the alerts. You hear from us when behaviour drifts, not before.
Built on Promptfoo · run by humans · no framework to learn
The bugs your unit tests can't see.
An LLM call goes in, text comes out. There is nothing to assert on with a unit test except the call happened. The interesting regressions live in the text:
- Silent drift
Your matching agent recommends the wrong vendor 4% of the time after a model update. Nobody notices for two weeks.
- Prompt regression
Someone tweaks the system prompt to fix one bug, breaks three others. No regression suite catches it.
- AI-tells leak
Your AI-written outreach says 'I hope this finds you well' to a $50k client even though the prompt forbids it.
- Cohort identifiers vanish
The humanizer drops 'cohort-4' and says 'next cohort'. Real Panya finding from week one of running our own eval.
We built this for ourselves first.
Panya.health is an AI matchmaker we run for the GLP-1 market. Three LLM steps sit between a quiz answer and a real email landing in someone's inbox: a humanizer, a Trust Supervisor that grades outreach, and a content supervisor that grades social drafts. We built eval suites for all three. They found six real production bugs:
- The humanizer dropped cohort identifiers. "UAE cohort-4" became "next UAE user cohort". Real loss of meta we use for scheduling and learnings.
- Banned phrases leaked ~20% of runs. "I wanted to reach out" and em dashes both made it through despite the prompt rule.
- The Trust Supervisor was returning six different JSON schemas. Six test cases produced six different verdict shapes. The production parser defaulted missing fields silently, so a draft leaking PII could be persisted as "pii_risk: low". The held-draft review queue had no real PII gate.
- Confidence didn't track score. Score 15 + pii_risk "high" still returned confidence "high". Model couldn't reliably project the meta-mapping; we made it deterministic.
- Content supervisor: false-positive em-dashes. Hyphenated phrases like "region-by-region" got flagged as em dashes. Same model also missed actual em dashes. Pure regex is now authoritative.
- Content supervisor: missed exclamation points. "We shipped supply this week!" got past the no-exclamation rule. Now a deterministic check runs after the LLM call.
All six fixes shipped same-sprint. All three eval suites are green: humanize 5/5, outreach supervisor 6/6, content supervisor 7/7. 18/18 across the agent pipeline. The drift is caught the moment a prompt change introduces it, not when a customer notices a weird email or a vendor sees their PII land on someone else's intake.
Three tiers. Tell us which fits.
Pricing is a hypothesis. We're calibrating during beta. The first 50 signups land in the free tier and help us figure out what's worth what.
- ·1 agent / prompt covered
- ·Up to 20 test cases
- ·Weekly run, Slack alert on regression
- ·Hand-built suite (we write the assertions)
- ·Up to 3 agents / prompts
- ·Up to 100 test cases total
- ·Run on every commit + nightly
- ·We add new tests when bugs ship
- ·LLM-as-judge rubrics included
- ·Unlimited agents / prompts
- ·Unlimited test cases
- ·Per-PR pre-merge gate
- ·Custom rubrics, owned dataset
- ·Slack + email + webhook alerts
- ·Priority support
Beta waitlist
Drop your email. We'll reach out when we have capacity to onboard the next batch.
Not an observability tool. Not a logging platform. Not a model router. Just regression coverage for the parts of your AI app that unit tests can't see, written and maintained by people who know what AI tells look like in production.
By the same people who built panya.health · run by humans