For teams running agents in production:
Have you seen retries or loops burn money before anyone noticed?
I’m specifically interested in:
- repeated retries on large contexts
- agents failing to reach an end state
- cases where traces/logs existed but still didn’t make the failure obvious
One real example would help a lot.