Honest question: how do you test your AI agents before production?
AI agents are being shipped faster than ever.
But I keep noticing something worrying when talking to founders and engineers:
Most teams don’t really test their agents they prompt them a few times and hope for the best.
No structured evals.
No multi-turn testing.
No consistency checks.
No guardrails.
Just… vibes.
This feels very similar to the early days of backend systems:
no monitoring
no alerts
bugs discovered by users
Eventually, observability became non-negotiable.
I think AI is hitting that same phase now.
So I’m curious:
How do you evaluate your LLMs or AI agents today?
manual prompts?
scripted test cases?
human review?
automated evals?
or nothing at all?
We’ve been working on an AI reliability / agent evaluation framework internally and recently opened it up for teams who want to stress-test real AI workflows.
If you’re experimenting with agents and want to see how fragile (or solid) they really are, you can check it out here:
We’re not pushing sales mostly looking for builders willing to test, break things, and share feedback so we can understand where current eval approaches fall short.
Would love to learn how others are handling this.
Are AI evals already part of your stack or are we still in “ship first, debug later” mode?

Replies