Turn-Taking Research

Three independent open-source models for voice conversation dynamics — from raw audio pretraining to production CPU deployment.

Turn-taking is the hardest unsolved problem in voice AI.

arXivPublished Paper

Open SourceApache 2.0

CPU NativeNo GPU Required

MultilingualEnglish + French

Book a Technical Walkthrough

Compliance:

better turn prediction vs best prior modelDualTurn

+53%

French turn detection, 140× smaller

0.5B beats 70B

Turn Detector

All 3 models run on CPU

No GPU

production-ready

Executive Summary

What we built

Three independent turn-taking research projects: DualTurn (a published paper on learning turn-taking from dual-channel generative speech pretraining), Semantic Turn-Taking (an open-source model predicting agent actions from conversation context), and Anyreach Turn Detector (a production multilingual model that outperforms Llama 3.3 70B on French end-of-turn detection).

Why it matters

Turn-taking is the hardest unsolved problem in voice AI. Standard voice agents use silence timeouts — they wait for a gap and hope for the best. This causes unnatural delays, false interruptions, and robotic conversations. We have invested in three separate research tracks to understand and solve this at a fundamental level.

Results

DualTurn: wF1 0.707 on otoSpeech — 53% relative improvement over VAP baseline
DualTurn: BC F1 0.349 vs 0.000 for VAP — generative pretraining is the key unlock
Semantic Turn-Taking: 91.82% accuracy on TEN benchmark, ONNX CPU inference in 128–191ms
Anyreach Turn Detector: F1 0.957 on French — beats Llama 3.3 70B (0.825) with a 0.5B model
All three models run on CPU without GPU infrastructure

Best for

→Voice agents requiring natural, anticipatory turn-taking
→Teams building multilingual voice AI (English + French)
→Researchers studying conversational dynamics and turn-taking signals
→Production deployments needing CPU-native inference

Limitations

DualTurn trained on English two-party speech only (453h)
Semantic Turn-Taking SwDA accuracy is 65.96% — lower due to domain mismatch
Anyreach Turn Detector covers English and French; other languages not yet evaluated

Research Paper · arXiv 2603.08216

DualTurn

The first use of speech-to-speech generative pretraining as a representation-learning stage for explicit turn-taking prediction in modular ASR-LLM-TTS pipelines. DualTurn learns conversational dynamics by predicting both speakers' future audio autoregressively on 453 hours of dual-channel data — with no manual labels — then fine-tunes lightweight classification heads to predict five discrete agent actions.

Generative pretraining — not model scale — is the key unlock. Without Stage 1 pretraining, backchannel F1 caps at 0.08 regardless of architecture or model size. With it: 0.349 — a 4× jump. The LLM backbone alone contributes only +0.002 BC F1 without pretraining; the backbone is the vessel, not the source.

Key Results

Two-stage training: Stage 1 generative pretraining on 453h dual-channel audio (no labels required), Stage 2 lightweight classification heads
5 agent actions: Start-Talking, Continue-Listening, Start-Listening, Continue-Talking, Backchannel
240ms anticipation — predicts turn end before it happens
Runs at ~78ms on CPU with KV-caching; ~27ms on A100 — production-ready
Codebook analysis: 56% of turn-end discrimination from semantics (CB0), 26% from prosody (CB1)
Open datasets: otoSpeech (289h, 1,125 conversations) and Switchboard splits released on HuggingFace

Agent Action wF1 — Switchboard

Higher is better. Same held-out test split as VAP and Wang et al.

Model	wF1	BC F1
DualTurn (Qwen2.5-0.5B LoRA)ours	0.633	0.349
VAP (5.8M)	0.389	0.000

Agent Action wF1 — otoSpeech

Internal dataset. VAP is the primary published baseline for this task.

Model	wF1	BC F1
DualTurn (Qwen2.5-0.5B LoRA)ours	0.707	0.349
VAP (5.8M)	0.461	0.000

Resources

Full Paper Page arXiv GitHub Model otoSpeech Dataset Switchboard Dataset

Open Source · Apache 2.0

Semantic Turn-Taking

A fine-tuned Qwen2.5-0.5B-Instruct model that predicts what a voice agent should do next, given the conversation transcript. Unlike acoustic approaches (VAD, silence detection), this model uses the semantic content of the conversation. Fine-tuned on ~154K synthetic examples, available as PyTorch and ONNX INT8 for production CPU deployment.

Four actions map directly to voice agent behavior: start_speaking (user finished their turn), continue_listening (user is mid-utterance), start_listening (user interrupted the agent), continue_speaking (user gave a backchannel). One model call, one clear action.

Key Results

Base model: Qwen2.5-0.5B-Instruct (494M parameters), fine-tuned on ~154K synthetic conversations
91.82% accuracy on TEN benchmark (binary end-of-turn detection)
ONNX INT8 quantized: 473MB, 128–191ms on CPU — production-ready without a GPU
Open benchmark dataset released: TEN (428 examples), SwDA (2,688 examples), Synthetic (36 examples)
Input: ChatML conversation + <|predict|> trigger. Output: probability over 4 action tokens

Binary Classification — End of Turn vs. Not

start_speaking / continue_speaking mapped to EOU; continue_listening / start_listening mapped to Not-EOU.

Subset	N	Accuracy	F1 (macro)
TEN	428	91.82%	91.80%
SwDA	2,688	65.96%	51.46%
Synthetic	36	86.11%	85.57%

Inference Latency

Format	Size	Short (8 tok)	Long (54 tok)
PyTorch GPU (fp16)	942 MB	26 ms	34 ms
PyTorch CPU (fp32)	942 MB	165 ms	289 ms
ONNX CPU (q8)ours	473 MB	128 ms	191 ms

Resources

Model Benchmark Dataset

Production · Multilingual

Anyreach Turn Detector

A custom fine-tuned Qwen2.5-0.5B-Instruct model for binary end-of-turn detection, built to run on CPU. Benchmarked against the strongest available baselines including Llama 3.3 70B — a model 140× larger. We also trained a SmolLM2-135M variant for ultra-low compute environments. Evaluated on English and French.

On French, our 0.5B model achieves F1 0.957 — outperforming Llama 3.3 70B (F1 0.825) and livekit's multilingual model (F1 0.746). Competitive multilingual accuracy at a fraction of the compute and cost.

Key Results

Binary task: end-of-turn vs. continue-listening — optimized for low-latency voice agent deployment
SmolLM2-135M variant for ultra-low compute; Qwen2.5-0.5B variant for higher accuracy
English: SmolLM2-135M achieves F1 0.906 — outperforms livekit 135M (0.659) by 37%
French: Qwen2.5-0.5B achieves F1 0.957 — surpasses Llama 3.3 70B (0.825)
CPU-native deployment, no GPU required

English Benchmark

Model	Accuracy	Precision	Recall	F1
Llama 3.3 70B	0.934	0.984	0.897	0.939
Anyreach Turn Detector (SmolLM2-135M)ours	0.889	0.856	0.963	0.906
livekit/turn-detector-multilingual (Qwen2.5-0.5B)	0.855	0.806	0.975	0.882
Anyreach Turn Detector (Qwen2.5-0.5B)ours	0.864	0.849	0.918	0.881
latishab/turnsense (135M)	0.775	0.720	0.976	0.829
livekit/turn-detector-en (135M)	0.714	0.981	0.496	0.659

French Benchmark

Model	Accuracy	Precision	Recall	F1
Anyreach Turn Detector (Qwen2.5-0.5B)ours	0.955	0.924	0.992	0.957
Llama 3.3 70B	0.817	0.805	0.846	0.825
Llama 3.1 8B	0.798	0.750	0.910	0.822
livekit/turn-detector-multilingual (Qwen2.5-0.5B)	0.685	0.628	0.918	0.746

Frequently Asked Questions

Common questions about our voicemail detection system.

Methodology

How we built, trained, and evaluated this model.

Dataset

NameMultiple Open Datasets

SizeotoSpeech (289h, 1,125 conversations), Switchboard (220h), TEN (428 examples), SwDA (2,688 examples)

DualTurn uses dual-channel conversational audio from otoSpeech (internal, 24kHz) and Switchboard (telephone, 8kHz). Semantic Turn-Taking uses ~154K synthetic fine-tuning examples evaluated on the semantic-turn-taking-benchmark (TEN, SwDA, Synthetic splits). Anyreach Turn Detector uses internal conversational data for English and French evaluation.

Labeling

DualTurn uses fully self-supervised labels (EOT, HOLD, BOT, BC, VAD, FVAD) derived algorithmically from audio timing — no manual annotation required. Semantic Turn-Taking uses synthetically generated conversation examples. Turn Detector uses curated end-of-turn labels.

Evaluation Protocol

DualTurn: weighted F1 (wF1) and BC F1 on 5-class agent action prediction. Semantic Turn-Taking: accuracy and macro F1 on binary and multi-class turn-taking tasks. Turn Detector: accuracy, precision, recall, and F1 on binary end-of-turn detection.

Known Limitations

•DualTurn limited to English two-party speech; multilingual and multi-party evaluation is future work
•Semantic Turn-Taking SwDA performance (65.96%) reflects domain mismatch between synthetic training data and the SwDA corpus style
•Turn Detector French evaluation uses internal data; independent third-party evaluation pending

Evaluation Details

Last Evaluated:2026-03-09

Model Version:dualturn-qwen2.5-mimi-0.5B · semantic-turn-taking · anyreach-turn-detector

Ready to see this in action?

Book a technical walkthrough with our team to see how this research applies to your use case.

Book a Technical Walkthrough