Your AI app should not break silently.
We write and maintain the regression test suite for your AI agent or prompt. Describe your app in plain English; we handle the YAML, the LLM-as-judge rubrics, the test cases, the alerts. You hear from us when behaviour drifts, not before.
Built on Promptfoo · run by humans · no framework to learn
The bugs your unit tests can't see.
An LLM call goes in, text comes out. There is nothing to assert on with a unit test except the call happened. The interesting regressions live in the text:
- Silent drift
Your matching agent recommends the wrong vendor 4% of the time after a model update. Nobody notices for two weeks.
- Prompt regression
Someone tweaks the system prompt to fix one bug, breaks three others. No regression suite catches it.
- AI-tells leak
Your AI-written outreach says 'I hope this finds you well' to a $50k client even though the prompt forbids it.
- Cohort identifiers vanish
The humanizer drops 'cohort-4' and says 'next cohort'. Real Panya finding from week one of running our own eval.
We built this for ourselves first.
Panya.health is an AI matchmaker we run for the GLP-1 market. Every outreach email runs through a humanizer agent before send. On the first real run of our eval suite, we found two production bugs in 25 seconds:
- The humanizer dropped cohort identifiers. "UAE cohort-4" became "next UAE user cohort". Real loss of meta we use for scheduling and learnings.
- Banned phrases leaked ~20% of runs. "I wanted to reach out" and em dashes both made it through despite the prompt rule.
Both fixes shipped same-day. Eval went from 0 to 5/5 passing. The drift is now caught the moment a prompt change introduces it, not when a customer notices a weird email.
Three tiers. Tell us which fits.
Pricing is a hypothesis. We're calibrating during beta — the first 50 signups land in the free tier and help us figure out what's worth what.
- ·1 agent / prompt covered
- ·Up to 20 test cases
- ·Weekly run, Slack alert on regression
- ·Hand-built suite (we write the assertions)
- ·Up to 3 agents / prompts
- ·Up to 100 test cases total
- ·Run on every commit + nightly
- ·We add new tests when bugs ship
- ·LLM-as-judge rubrics included
- ·Unlimited agents / prompts
- ·Unlimited test cases
- ·Per-PR pre-merge gate
- ·Custom rubrics, owned dataset
- ·Slack + email + webhook alerts
- ·Priority support
Beta waitlist
Drop your email. We'll reach out when we have capacity to onboard the next batch.
Not an observability tool. Not a logging platform. Not a model router. Just regression coverage for the parts of your AI app that unit tests can't see, written and maintained by people who know what AI tells look like in production.
By the same people who built panya.health · run by humans