Full-Duplex-Bench-v3

Overview

Key Highlights

🎤

Real Human Speech

All recordings are from real human speakers—no TTS synthesis. Each sample preserves the authentic ambient background audio for realistic evaluation.

🔄

Self-Correction & Rollback

21 scenarios test mid-utterance intent changes requiring dynamic state rollback—the hardest failure mode across all models.

🔗

Multi-Step Tool Chaining

Sequences of API calls across 4 domains (Travel, Finance, Housing, E-Commerce) with deterministic outputs for automatic scoring.

⏱

Latency Decomposition

Fine-grained breakdown into first-response, tool-call, and task-completion latency reveals where each system bottlenecks.

Model Comparison

Performance profile across 7 dimensions. Click legend to toggle models.

Main Results

Overall performance and turn-taking metrics across all 6 systems. GPT-Realtime leads on accuracy; Gemini 3.1 on latency.

	Tool Use				Turn-Taking Dynamics
Model	Tool Sel ↑	Arg Acc ↑	Resp Qual ↑	Pass@1 ↑	Turn-take ↑	Latency ↓	Interrupt ↓	Filler ↓
GPT-Realtime	0.876	0.680	0.792	0.600	96.0%	6.89s	13.5%	16.9%
Gemini 2.5 Live	0.786	0.593	0.554	0.490	92.0%	7.26s	14.1%	8.9%
Gemini 3.1 Live	0.817	0.588	0.718	0.540	78.0%	4.25s	19.2%	31.7%
Grok	0.797	0.542	0.617	0.430	94.0%	6.65s	25.5%	44.3%
Ultravox	0.794	0.513	0.510	0.410	96.0%	8.40s	47.9%	88.0%
Cascaded	0.803	0.562	0.655	0.450	100.0%	10.12s	33.0%	26.9%

Detailed Analysis

Pass@1 performance degrades with scenario complexity. GPT-Realtime leads decisively (0.750 Easy, 0.433 Hard); Grok shows the steepest decline (0.200 Hard).

Self-correction remains the hardest disfluency type—even the best model (GPT-Realtime) succeeds on only 58.8% of rollback scenarios.

Finance is the strongest domain (GPT-Realtime: 0.960); Housing is consistently the hardest (best: 0.308). GPT-Realtime leads all four domains.

Latency decomposition reveals where each system bottlenecks: first word, tool call, or task completion.

Disfluency Categories

Every recording is annotated for five disfluency categories, enabling fine-grained robustness analysis.

Filler

"I'd like to um search for flights to uh Boston"

Redundant acoustic tokens (um, uh) that may degrade reasoning accuracy or inflate latency.

Pause

"Book a flight to... [3s silence] ...New York on March 5th"

Mid-utterance silences during information retrieval that challenge end-of-turn detection.

Hesitation

"Can you check my my my order status for for order 5523"

Combinations of fillers and word repetitions that test the system's parsing ability.

False Start

"~~Search for hotels in~~— actually, find me flights to LA"

Abandoning an initial request for a new intent; model must discard obsolete context.

Self-Correction

"Book me a flight to ~~New York~~— wait, make that Boston"

Updating parameters mid-sentence; requires dynamic state rollback—the hardest category.

Task Domains & Mock APIs

✈

Travel & Identity

search_flights(destination, date)
book_ticket(passenger_name, flight_id)
update_travel_profile(document_type, document_number)

💰

Finance & Billing

query_card_benefits(card_last_4, category)
calculate_currency_exchange(amount, from_currency, to_currency)
modify_autopay_source(new_account_id)

🏠

Housing & Location

search_apartments(max_budget, amenities)
update_search_filter(condition, new_value)

🛒

E-Commerce Support

check_order_status(order_id)
cancel_pending_action(action_type)
process_exchange(order_id, new_shipping_address)

Audio Demos

Listen to real user queries and compare how different voice agents respond. All inputs feature natural disfluencies recorded in uncontrolled environments.

Citation

If you find our work useful, please consider citing:

@article{lin2026fdbv3,
  title   = {Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency},
  author  = {Lin, Guan-Ting and Chen, Chen and Chen, Zhehuai and Lee, Hung-yi},
  journal = {arXiv preprint},
  year    = {2026}
}