Full-Duplex-Bench-v3

Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

Guan-Ting Lin1, Chen Chen2, Zhehuai Chen2, Hung-yi Lee1

1National Taiwan University  ·  2Nvidia

Overview

Full-Duplex-Bench-v3 Overview

Key Highlights

🎤

Real Human Speech

All recordings are from real human speakers—no TTS synthesis. Each sample preserves the authentic ambient background audio for realistic evaluation.

🔄

Self-Correction & Rollback

21 scenarios test mid-utterance intent changes requiring dynamic state rollback—the hardest failure mode across all models.

🔗

Multi-Step Tool Chaining

Sequences of API calls across 4 domains (Travel, Finance, Housing, E-Commerce) with deterministic outputs for automatic scoring.

Latency Decomposition

Fine-grained breakdown into first-response, tool-call, and task-completion latency reveals where each system bottlenecks.

Model Comparison

Performance profile across 7 dimensions. Click legend to toggle models.

Main Results

Overall performance and turn-taking metrics across all 6 systems. GPT-Realtime leads on accuracy; Gemini 3.1 on latency.

Tool Use Turn-Taking Dynamics
Model Tool Sel ↑ Arg Acc ↑ Resp Qual ↑ Pass@1 ↑ Turn-take ↑ Latency ↓ Interrupt ↓ Filler ↓
GPT-Realtime 0.876 0.680 0.792 0.600 96.0% 6.89s 13.5% 16.9%
Gemini 2.5 Live 0.786 0.593 0.554 0.490 92.0% 7.26s 14.1% 8.9%
Gemini 3.1 Live 0.817 0.588 0.718 0.540 78.0% 4.25s 19.2% 31.7%
Grok 0.797 0.542 0.617 0.430 94.0% 6.65s 25.5% 44.3%
Ultravox 0.794 0.513 0.510 0.410 96.0% 8.40s 47.9% 88.0%
Cascaded 0.803 0.562 0.655 0.450 100.0% 10.12s 33.0% 26.9%

Detailed Analysis

Pass@1 performance degrades with scenario complexity. GPT-Realtime leads decisively (0.750 Easy, 0.433 Hard); Grok shows the steepest decline (0.200 Hard).

Self-correction remains the hardest disfluency type—even the best model (GPT-Realtime) succeeds on only 58.8% of rollback scenarios.

Finance is the strongest domain (GPT-Realtime: 0.960); Housing is consistently the hardest (best: 0.308). GPT-Realtime leads all four domains.

Latency decomposition reveals where each system bottlenecks: first word, tool call, or task completion.

Disfluency Categories

Every recording is annotated for five disfluency categories, enabling fine-grained robustness analysis.

Filler

"I'd like to um search for flights to uh Boston"

Redundant acoustic tokens (um, uh) that may degrade reasoning accuracy or inflate latency.

Pause

"Book a flight to... [3s silence] ...New York on March 5th"

Mid-utterance silences during information retrieval that challenge end-of-turn detection.

Hesitation

"Can you check my my my order status for for order 5523"

Combinations of fillers and word repetitions that test the system's parsing ability.

False Start

"Search for hotels in— actually, find me flights to LA"

Abandoning an initial request for a new intent; model must discard obsolete context.

Self-Correction

"Book me a flight to New York— wait, make that Boston"

Updating parameters mid-sentence; requires dynamic state rollback—the hardest category.

Task Domains & Mock APIs

Travel & Identity

  • search_flights(destination, date)
  • book_ticket(passenger_name, flight_id)
  • update_travel_profile(document_type, document_number)
💰

Finance & Billing

  • query_card_benefits(card_last_4, category)
  • calculate_currency_exchange(amount, from_currency, to_currency)
  • modify_autopay_source(new_account_id)
🏠

Housing & Location

  • search_apartments(max_budget, amenities)
  • update_search_filter(condition, new_value)
🛒

E-Commerce Support

  • check_order_status(order_id)
  • cancel_pending_action(action_type)
  • process_exchange(order_id, new_shipping_address)

Audio Demos

Listen to real user queries and compare how different voice agents respond. All inputs feature natural disfluencies recorded in uncontrolled environments.

Citation

If you find our work useful, please consider citing:

@article{lin2026fdbv3,
  title   = {Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency},
  author  = {Lin, Guan-Ting and Chen, Chen and Chen, Zhehuai and Lee, Hung-yi},
  journal = {arXiv preprint},
  year    = {2026}
}