Most evaluation tools still optimize for developer-centric metrics and benchmark vanity. Business leaders need outcome-centric metrics they can sign off on.
The biggest gap is realism: single prompts and static benchmark tasks are weak proxies for how AI agents behave in production environments.
A practical evaluation framework should include repeated runs, scenario-based stress tests, and business risk indicators like cost-per-success and tail-risk exposure.