Turn-Taking Research
Three independent open-source models for voice conversation dynamics — from raw audio pretraining to production CPU deployment.
Turn-taking is the hardest unsolved problem in voice AI.
French turn detection, 140× smaller
0.5B beats 70B
Turn Detector
All 3 models run on CPU
No GPU
production-ready
Executive Summary
What we built
Three independent turn-taking research projects: DualTurn (a published paper on learning turn-taking from dual-channel generative speech pretraining), Semantic Turn-Taking (an open-source model predicting agent actions from conversation context), and Anyreach Turn Detector (a production multilingual model that outperforms Llama 3.3 70B on French end-of-turn detection).
Why it matters
Turn-taking is the hardest unsolved problem in voice AI. Standard voice agents use silence timeouts — they wait for a gap and hope for the best. This causes unnatural delays, false interruptions, and robotic conversations. We have invested in three separate research tracks to understand and solve this at a fundamental level.
Results
- DualTurn: wF1 0.707 on otoSpeech — 53% relative improvement over VAP baseline
- DualTurn: BC F1 0.349 vs 0.000 for VAP — generative pretraining is the key unlock
- Semantic Turn-Taking: 91.82% accuracy on TEN benchmark, ONNX CPU inference in 128–191ms
- Anyreach Turn Detector: F1 0.957 on French — beats Llama 3.3 70B (0.825) with a 0.5B model
- All three models run on CPU without GPU infrastructure
Best for
- →Voice agents requiring natural, anticipatory turn-taking
- →Teams building multilingual voice AI (English + French)
- →Researchers studying conversational dynamics and turn-taking signals
- →Production deployments needing CPU-native inference
Limitations
- DualTurn trained on English two-party speech only (453h)
- Semantic Turn-Taking SwDA accuracy is 65.96% — lower due to domain mismatch
- Anyreach Turn Detector covers English and French; other languages not yet evaluated
DualTurn
The first use of speech-to-speech generative pretraining as a representation-learning stage for explicit turn-taking prediction in modular ASR-LLM-TTS pipelines. DualTurn learns conversational dynamics by predicting both speakers' future audio autoregressively on 453 hours of dual-channel data — with no manual labels — then fine-tunes lightweight classification heads to predict five discrete agent actions.
Generative pretraining — not model scale — is the key unlock. Without Stage 1 pretraining, backchannel F1 caps at 0.08 regardless of architecture or model size. With it: 0.349 — a 4× jump. The LLM backbone alone contributes only +0.002 BC F1 without pretraining; the backbone is the vessel, not the source.
Key Results
- Two-stage training: Stage 1 generative pretraining on 453h dual-channel audio (no labels required), Stage 2 lightweight classification heads
- 5 agent actions: Start-Talking, Continue-Listening, Start-Listening, Continue-Talking, Backchannel
- 240ms anticipation — predicts turn end before it happens
- Runs at ~78ms on CPU with KV-caching; ~27ms on A100 — production-ready
- Codebook analysis: 56% of turn-end discrimination from semantics (CB0), 26% from prosody (CB1)
- Open datasets: otoSpeech (289h, 1,125 conversations) and Switchboard splits released on HuggingFace
Agent Action wF1 — Switchboard
Higher is better. Same held-out test split as VAP and Wang et al.
| Model | wF1 | BC F1 |
|---|---|---|
| DualTurn (Qwen2.5-0.5B LoRA)ours | 0.633 | 0.349 |
| VAP (5.8M) | 0.389 | 0.000 |
Agent Action wF1 — otoSpeech
Internal dataset. VAP is the primary published baseline for this task.
| Model | wF1 | BC F1 |
|---|---|---|
| DualTurn (Qwen2.5-0.5B LoRA)ours | 0.707 | 0.349 |
| VAP (5.8M) | 0.461 | 0.000 |
Semantic Turn-Taking
A fine-tuned Qwen2.5-0.5B-Instruct model that predicts what a voice agent should do next, given the conversation transcript. Unlike acoustic approaches (VAD, silence detection), this model uses the semantic content of the conversation. Fine-tuned on ~154K synthetic examples, available as PyTorch and ONNX INT8 for production CPU deployment.
Four actions map directly to voice agent behavior: start_speaking (user finished their turn), continue_listening (user is mid-utterance), start_listening (user interrupted the agent), continue_speaking (user gave a backchannel). One model call, one clear action.
Key Results
- Base model: Qwen2.5-0.5B-Instruct (494M parameters), fine-tuned on ~154K synthetic conversations
- 91.82% accuracy on TEN benchmark (binary end-of-turn detection)
- ONNX INT8 quantized: 473MB, 128–191ms on CPU — production-ready without a GPU
- Open benchmark dataset released: TEN (428 examples), SwDA (2,688 examples), Synthetic (36 examples)
- Input: ChatML conversation + <|predict|> trigger. Output: probability over 4 action tokens
Binary Classification — End of Turn vs. Not
start_speaking / continue_speaking mapped to EOU; continue_listening / start_listening mapped to Not-EOU.
| Subset | N | Accuracy | F1 (macro) |
|---|---|---|---|
| TEN | 428 | 91.82% | 91.80% |
| SwDA | 2,688 | 65.96% | 51.46% |
| Synthetic | 36 | 86.11% | 85.57% |
Inference Latency
| Format | Size | Short (8 tok) | Long (54 tok) |
|---|---|---|---|
| PyTorch GPU (fp16) | 942 MB | 26 ms | 34 ms |
| PyTorch CPU (fp32) | 942 MB | 165 ms | 289 ms |
| ONNX CPU (q8)ours | 473 MB | 128 ms | 191 ms |
Resources
Anyreach Turn Detector
A custom fine-tuned Qwen2.5-0.5B-Instruct model for binary end-of-turn detection, built to run on CPU. Benchmarked against the strongest available baselines including Llama 3.3 70B — a model 140× larger. We also trained a SmolLM2-135M variant for ultra-low compute environments. Evaluated on English and French.
On French, our 0.5B model achieves F1 0.957 — outperforming Llama 3.3 70B (F1 0.825) and livekit's multilingual model (F1 0.746). Competitive multilingual accuracy at a fraction of the compute and cost.
Key Results
- Binary task: end-of-turn vs. continue-listening — optimized for low-latency voice agent deployment
- SmolLM2-135M variant for ultra-low compute; Qwen2.5-0.5B variant for higher accuracy
- English: SmolLM2-135M achieves F1 0.906 — outperforms livekit 135M (0.659) by 37%
- French: Qwen2.5-0.5B achieves F1 0.957 — surpasses Llama 3.3 70B (0.825)
- CPU-native deployment, no GPU required
English Benchmark
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Llama 3.3 70B | 0.934 | 0.984 | 0.897 | 0.939 |
| Anyreach Turn Detector (SmolLM2-135M)ours | 0.889 | 0.856 | 0.963 | 0.906 |
| livekit/turn-detector-multilingual (Qwen2.5-0.5B) | 0.855 | 0.806 | 0.975 | 0.882 |
| Anyreach Turn Detector (Qwen2.5-0.5B)ours | 0.864 | 0.849 | 0.918 | 0.881 |
| latishab/turnsense (135M) | 0.775 | 0.720 | 0.976 | 0.829 |
| livekit/turn-detector-en (135M) | 0.714 | 0.981 | 0.496 | 0.659 |
French Benchmark
| Model | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Anyreach Turn Detector (Qwen2.5-0.5B)ours | 0.955 | 0.924 | 0.992 | 0.957 |
| Llama 3.3 70B | 0.817 | 0.805 | 0.846 | 0.825 |
| Llama 3.1 8B | 0.798 | 0.750 | 0.910 | 0.822 |
| livekit/turn-detector-multilingual (Qwen2.5-0.5B) | 0.685 | 0.628 | 0.918 | 0.746 |
Frequently Asked Questions
Common questions about our voicemail detection system.
Methodology
How we built, trained, and evaluated this model.
Dataset
DualTurn uses dual-channel conversational audio from otoSpeech (internal, 24kHz) and Switchboard (telephone, 8kHz). Semantic Turn-Taking uses ~154K synthetic fine-tuning examples evaluated on the semantic-turn-taking-benchmark (TEN, SwDA, Synthetic splits). Anyreach Turn Detector uses internal conversational data for English and French evaluation.
Labeling
DualTurn uses fully self-supervised labels (EOT, HOLD, BOT, BC, VAD, FVAD) derived algorithmically from audio timing — no manual annotation required. Semantic Turn-Taking uses synthetically generated conversation examples. Turn Detector uses curated end-of-turn labels.
Evaluation Protocol
DualTurn: weighted F1 (wF1) and BC F1 on 5-class agent action prediction. Semantic Turn-Taking: accuracy and macro F1 on binary and multi-class turn-taking tasks. Turn Detector: accuracy, precision, recall, and F1 on binary end-of-turn detection.
Known Limitations
- •DualTurn limited to English two-party speech; multilingual and multi-party evaluation is future work
- •Semantic Turn-Taking SwDA performance (65.96%) reflects domain mismatch between synthetic training data and the SwDA corpus style
- •Turn Detector French evaluation uses internal data; independent third-party evaluation pending
Evaluation Details
Ready to see this in action?
Book a technical walkthrough with our team to see how this research applies to your use case.
