ServiceNow releases EVA, the first open framework jointly scoring voice agent task accuracy and conversational quality across 20 systems.
ServiceNow researchers released EVA (End-to-end Voice Agent evaluation framework), an open-source benchmark that evaluates conversational voice agents on two simultaneous dimensions: EVA-A (task accuracy) and EVA-X (conversational experience). The framework uses a bot-to-bot architecture to simulate realistic multi-turn spoken conversations. An initial dataset of 50 airline scenarios (flight rebooking, cancellations, vouchers) is included, with more domains planned. Benchmarks were run across 20 systems including cascade pipelines, speech-to-speech models, and Large Audio Language Models — revealing a consistent accuracy-experience tradeoff.
Every existing voice eval framework tests one thing in isolation — STT accuracy, turn-taking, or task completion — but never all three together. EVA's bot-to-bot architecture runs full multi-turn spoken conversations and scores both task success and conversational quality simultaneously. The key technical finding: optimizing for one score actively degrades the other, which means your current eval loop is hiding real production failures.
Clone the EVA repo this week and run your existing voice agent against the 50 airline scenarios to get your EVA-A and EVA-X split — if your gap between the two scores exceeds 20 points, your agent has a structural design problem, not a model problem.
Run: git clone https://github.com/ServiceNow/eva && cd eva
Tags
Also today
Signals by role
Also today
Tools mentioned