ResearchMedium Impact·Tuesday, March 24, 2026

EVA Framework Benchmarks 20 Voice Agents on Accuracy vs. Experience

ServiceNow releases EVA, the first open framework jointly scoring voice agent task accuracy and conversational quality across 20 systems.

What happened

ServiceNow researchers released EVA (End-to-end Voice Agent evaluation framework), an open-source benchmark that evaluates conversational voice agents on two simultaneous dimensions: EVA-A (task accuracy) and EVA-X (conversational experience). The framework uses a bot-to-bot architecture to simulate realistic multi-turn spoken conversations. An initial dataset of 50 airline scenarios (flight rebooking, cancellations, vouchers) is included, with more domains planned. Benchmarks were run across 20 systems including cascade pipelines, speech-to-speech models, and Large Audio Language Models — revealing a consistent accuracy-experience tradeoff.

Why it matters to you

personalized

Every existing voice eval framework tests one thing in isolation — STT accuracy, turn-taking, or task completion — but never all three together. EVA's bot-to-bot architecture runs full multi-turn spoken conversations and scores both task success and conversational quality simultaneously. The key technical finding: optimizing for one score actively degrades the other, which means your current eval loop is hiding real production failures.

What to do about it

Clone the EVA repo this week and run your existing voice agent against the 50 airline scenarios to get your EVA-A and EVA-X split — if your gap between the two scores exceeds 20 points, your agent has a structural design problem, not a model problem.

Try this now

GitHub (github.com/ServiceNow/eva)10 min

1
Run: git clone https://github.com/ServiceNow/eva && cd eva

Community

4 comments