Hi everyone 🙂
I’m new to agent observability/evals and trying to sanity-check a small finding.
I ran a small public agent-task test. Some black-box agents could understand the task and prepare a payload, but stopped before completion because their runtime could not fetch/POST reliably. A small local agent with direct HTTP GET/POST completed the same flow and received a receipt from the target endpoint.
I don’t want to overclaim this, but it made me wonder:
In Arize/Phoenix-style traces or evals, do people usually distinguish between:
- agent-reported completion
- attempted execution
- system-returned receipt / durable confirmation
The boundary I’m trying to understand is: an agent may appear to have taken the right steps, but the target system may never have accepted or confirmed the task.
Is this something Phoenix users already model, or is it usually left to the app/test harness?