Derek Cheng

Lessons Learned from Building Agents

PH builders: what are key lessons you’ve learned — whether technical or product or GTM — from building agents? This is still such a new discipline that it would be great to share amongst this community of builders.

I’ll kick off with an experience that had us scratching our heads for months last year…

Our product, @Tonkotsu, runs a bunch of coding agents in parallel. They’re built on top of Sonnet 4.5 and do coding tasks by repeatedly calling tools to read and write code. We review task failures daily and started noticing something strange: some sessions would devolve into an infinite loop of Sonnet repeating an incorrect tool call over and over (sometimes up to 50 times!) until the session hit limits.

What was absolutely wild was that the model knew it was stuck and even started berating itself for its mistakes:

I’ve failed 17 consecutive times with the exact same error. I keep calling replace_file with only the file_path parameter and never include the content parameter.


After 17 consecutive failures, I need to break this pattern.

It took multiple rounds of experimentation to fix, but the experience gave us a window into LLM behavior at the edges. Surprisingly, there are real parallels with human behavior too. If you’re interested, full write-up here: https://blog.tonkotsu.ai/p/ive-failed-17-consecutive-times-with

What lessons have you learned, whether technical or product or GTM? I would love for this thread to be a place for us to learn real-world lessons from each other.

127 views

Add a comment

Replies

Best
AJ

Technical lessons:

Sonnet 4.5 as an interaction layer will need very specific system prompting and context seeding to truly keep up with a person. As a standalone agent it requires a fair degree of specificity and explicit permission to fail, otherwise you get context rotting spirals.

My rule of thumb is to have explicit instructions to signify failures and errors are okay, and to stop when it is repeated.
Then I can orchestrate a recovery with another agent or send in a recovery prompt if the session is still early.

Claude models have a tendency to make the scope bigger than it needs to be, and one needs to really specify or have explicit instructions about simplicity as a value.

Claude performance is a lot better when Claude has a why for the feature or task. It helps avoid mistakes or ambiguity.

Bad spec example:

Create an authentication flow with supabase auth and user signup and login modals

Good spec:

Create the authentication flow for <app>, use a signup and a login modal, use supabase auth. The objective is to provide frictionless onboarding. Use google auth and github auth aside from email/password.

The difference is mentioning the app itself and the objective.

For tonkotsu agents clarifying questions don't apply but some coding tools are more interactive to it's best practice to give AI a directive to ask them.

Derek Cheng

@build_with_aj Have you found you've had to build your product's UX in a way to guide users to do the above?

AJ

@derekattonkotsu I've mostly made products where AI works in the background, the context is pre seeded by data and the user just adds requests, so I'm not sure the question applies.

that being said, You absolutely need to subtly encourage the users to be specific with requests and that is a core tenet of good HMIX (Human-model interaction experience)

Cassandra King (BOSS.Tech)

@derekattonkotsu pretty interesting. how are you handling repetitive failures in your agents? curios as to what mechanisms have you implemented to detect and break patterns like this? any particular strategies or tools that help you identify loops or recurring issues? Also, do you have an understanding of the model's decision-making process? Obviously, I came here with a sh*tton of questions, and no answers LOL!
but ,how can we make models more transparent to devs and users?

Derek Cheng

@cassi_cassi We went through a few rounds of experimentation for this. We started with something pretty simple, which is when the LLM made a bad tool call, we enhanced the error message we returned to it to be very clear about what was wrong. That didn't work.

Then we tried something more sophisticated: on the next turn we disabled tool calling and asked the LLM to think deeply about what it did wrong and and how it should actually call the tool correctly. This was effective in the sense that the LLM did reflect and identify what it did wrong. But then when we re-enabled tool calling in the subsequent turn, it would just make the same old mistake! 🤦

Ultimately we tried something where we embedded a JSON template of the correct tool call format and this actually worked. If you want all the gory details, we go pretty in depth here: https://blog.tonkotsu.ai/p/ive-failed-17-consecutive-times-with

The key lessons we learned: (1) have a rigorous quality process -- we might not have caught this without our daily faliure reviews, (2) LLMs can behave in strange ways and you have to experiment.