All recordings are from real human speakers—no TTS synthesis. Each sample preserves the authentic ambient background audio for realistic evaluation.
21 scenarios test mid-utterance intent changes requiring dynamic state rollback—the hardest failure mode across all models.
Sequences of API calls across 4 domains (Travel, Finance, Housing, E-Commerce) with deterministic outputs for automatic scoring.
Fine-grained breakdown into first-response, tool-call, and task-completion latency reveals where each system bottlenecks.
Performance profile across 7 dimensions. Click legend to toggle models.
Overall performance and turn-taking metrics across all 6 systems. GPT-Realtime leads on accuracy; Gemini 3.1 on latency.
| Tool Use | Turn-Taking Dynamics | |||||||
|---|---|---|---|---|---|---|---|---|
| Model | Tool Sel ↑ | Arg Acc ↑ | Resp Qual ↑ | Pass@1 ↑ | Turn-take ↑ | Latency ↓ | Interrupt ↓ | Filler ↓ |
| GPT-Realtime | 0.876 | 0.680 | 0.792 | 0.600 | 96.0% | 6.89s | 13.5% | 16.9% |
| Gemini 2.5 Live | 0.786 | 0.593 | 0.554 | 0.490 | 92.0% | 7.26s | 14.1% | 8.9% |
| Gemini 3.1 Live | 0.817 | 0.588 | 0.718 | 0.540 | 78.0% | 4.25s | 19.2% | 31.7% |
| Grok | 0.797 | 0.542 | 0.617 | 0.430 | 94.0% | 6.65s | 25.5% | 44.3% |
| Ultravox | 0.794 | 0.513 | 0.510 | 0.410 | 96.0% | 8.40s | 47.9% | 88.0% |
| Cascaded | 0.803 | 0.562 | 0.655 | 0.450 | 100.0% | 10.12s | 33.0% | 26.9% |
Pass@1 performance degrades with scenario complexity. GPT-Realtime leads decisively (0.750 Easy, 0.433 Hard); Grok shows the steepest decline (0.200 Hard).
Self-correction remains the hardest disfluency type—even the best model (GPT-Realtime) succeeds on only 58.8% of rollback scenarios.
Finance is the strongest domain (GPT-Realtime: 0.960); Housing is consistently the hardest (best: 0.308). GPT-Realtime leads all four domains.
Latency decomposition reveals where each system bottlenecks: first word, tool call, or task completion.
Every recording is annotated for five disfluency categories, enabling fine-grained robustness analysis.
"I'd like to um search for flights to uh Boston"
Redundant acoustic tokens (um, uh) that may degrade reasoning accuracy or inflate latency.
"Book a flight to... [3s silence] ...New York on March 5th"
Mid-utterance silences during information retrieval that challenge end-of-turn detection.
"Can you check my my my order status for for order 5523"
Combinations of fillers and word repetitions that test the system's parsing ability.
"Search for hotels in— actually, find me flights to
LA"
Abandoning an initial request for a new intent; model must discard obsolete context.
"Book me a flight to New York— wait, make that
Boston"
Updating parameters mid-sentence; requires dynamic state rollback—the hardest category.
search_flights(destination, date)book_ticket(passenger_name, flight_id)update_travel_profile(document_type, document_number)query_card_benefits(card_last_4, category)calculate_currency_exchange(amount, from_currency, to_currency)modify_autopay_source(new_account_id)search_apartments(max_budget, amenities)update_search_filter(condition, new_value)check_order_status(order_id)cancel_pending_action(action_type)process_exchange(order_id, new_shipping_address)Listen to real user queries and compare how different voice agents respond. All inputs feature natural disfluencies recorded in uncontrolled environments.
If you find our work useful, please consider citing:
@article{lin2026fdbv3,
title = {Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency},
author = {Lin, Guan-Ting and Chen, Chen and Chen, Zhehuai and Lee, Hung-yi},
journal = {arXiv preprint},
year = {2026}
}