Abstract

Production ASR-LLM-TTS pipelines lack natural turn-taking — they rely on silence timeouts that cause unnatural delays and false interruptions. Speech-to-speech models handle turn-taking naturally but sacrifice tool-calling and complex reasoning. We present DualTurn, a turn-taking component that bridges this gap by learning conversational dynamics through generative pretraining on dual-channel audio. In Stage 1, a Qwen2.5-0.5B backbone is pretrained to predict both speakers' future audio autoregressively using continuous Mimi codec embeddings — requiring no manual annotation. In Stage 2, twelve lightweight classification heads predict six self-supervised turn-taking signals per channel, which compose into five discrete agent actions. To our knowledge, DualTurn is the first to use S2S generative pretraining as a representation-learning stage for explicit turn-taking prediction in modular pipelines. On Switchboard and otoSpeech held-out sets, DualTurn achieves 0.633 and 0.707 weighted F1 on agent actions, compared to 0.389 and 0.461 for VAP — the strongest published baseline. Backchannel F1 reaches 0.349 versus VAP's 0.000, with ablations showing generative pretraining is responsible for over 99% of this gain. The model runs continuously on a single CPU at approximately 78ms latency with 240ms anticipation of turn boundaries.

The Problem

Think of a dinner party conversation. When someone finishes speaking, you don't wait for two seconds of silence before replying. You pick up on subtle cues — a drop in pitch, a completed phrase, a breath — and begin preparing your response before they're even done.

Voice agents don't do this. The standard approach: detect N milliseconds of silence, then respond. This silence timeout causes two failure modes:

—False interruptions: the agent cuts in during a thinking pause, mid-sentence
—Dead air: the agent waits too long after the user finishes, creating an awkward gap

Speech-to-speech (S2S) models handle this naturally — they operate on raw audio and learn conversation flow implicitly. But they can't call tools, run business logic, or slot into existing infrastructure. Production voice stacks run ASR → LLM → TTS. These pipelines are powerful but turn-taking-blind.

DualTurn fills this gap: a turn-taking component that plugs into any modular pipeline and brings S2S-quality conversational timing to it.

What We Built

DualTurn listens to both sides of a conversation simultaneously and decides, every 240ms, what the voice agent should do next. It outputs one of five actions:

Start-Talking

User finished. Agent should respond.

Continue-Listening

User is mid-sentence. Keep waiting.

Start-Listening

User interrupted. Agent stops talking.

Continue-Talking

User said "uh-huh". Agent keeps going.

Backchannel

Short acknowledgment. No floor change.

These actions compose directly with any ASR-LLM-TTS pipeline — no architectural changes required. DualTurn runs as a parallel process on a single CPU thread at ~78ms latency.

To our knowledge, DualTurn is the first model to use speech-to-speech generative pretraining as a representation-learning stage for explicit turn-taking prediction in modular pipelines.

How It Works

Stage 1 — Generative Pretraining

The model is first trained to predict the future: what will both speakers say next? This is done entirely from raw audio, with no human labels.

Audio is encoded using Mimi, a neural codec that converts speech into continuous 512-dimensional embeddings at 12.5 frames per second. Each speaker gets their own channel. A Qwen2.5-0.5B backbone processes both channels concatenated. A lightweight depth predictor autoregressively predicts the next audio tokens for both speakers simultaneously — across 453 hours of real English conversation.

This forces the model to internalize conversational structure: when does one speaker stop? When does the other begin? What acoustic patterns signal completion versus a mid-thought pause? It learns all of this from audio alone.

After Stage 1, the depth predictor is discarded. Only the backbone — now rich with conversational understanding — is kept for Stage 2.

Stage 2 — Turn-Taking Classification Heads

Twelve lightweight heads are attached to the frozen backbone — six per speaker channel. Each head predicts one self-supervised signal derived automatically from audio timing (no manual annotation):

Signal	Definition
EOT	Speech offset + other speaker takes floor within 4s
HOLD	Speech offset + same speaker resumes (mid-turn pause)
BOT	Speech onset ≥1s after other speaker
BC	Short isolated utterance ≤1s (backchannel)
VAD	Binary voice activity per frame
FVAD	Future voice activity at 4 horizons (0–2s ahead)

These six per-channel signals compose into the five agent actions using either zero-parameter heuristic thresholds or a logistic regression probe. Inference runs with 240ms stride and KV-caching, at ~78ms on CPU.

Results

DualTurn is evaluated against VAP — the strongest published baseline for this task — on two held-out test sets, using the same splits as prior work.

Agent Action Classification — weighted F1

Higher is better across all 5 agent action classes.

Model	Switchboard wF1	otoSpeech wF1	BC F1
DualTurn (0.5B LoRA)ours	0.633	0.707	0.349
VAP (5.8M)	0.389	0.461	0.000

+53%

wF1 on otoSpeech vs VAP

0 → 0.349

BC F1 — VAP cannot detect backchannels

240ms

anticipation before turn end

On word-level turn prediction, DualTurn achieves AUC 0.963 (logistic regression probe), compared to 0.880 for Wang et al. — a model 3.1B parameters in size. DualTurn's shift anticipation median is −360ms before turn end, vs VAP's −140ms.

Key Findings

Generative pretraining is the key ingredient

Without Stage 1, backchannel F1 caps at 0.08 — regardless of whether you use an LSTM or a large LLM. With generative pretraining: 0.349. A 4× improvement. The LLM backbone alone (without pretraining) contributes only +0.002 BC F1. The backbone is the vessel, not the source.

Semantics dominate turn-end discrimination

Ablating individual Mimi codebooks shows 56% of turn-end discrimination comes from CB0 (semantic content), 26% from CB1 (broad prosody), and only 18% from CB2–7 (fine-grained acoustics). Conversations end on meaning, not just sound.

Continuous embeddings beat discrete tokens

Using continuous Mimi embeddings (wF1 0.633, BC F1 0.349) substantially outperforms discrete codebook indices (wF1 0.602, BC F1 0.072). Quantization destroys the prosodic nuance that backchannel detection depends on.

ASR supervision hurts

Adding an ASR auxiliary objective during Stage 1 causes BC F1 to drop from 0.349 to 0.085. Forcing the model to align with transcription text actively suppresses the prosodic representations that make backchannel detection possible.

Datasets

Both training datasets and held-out evaluation splits are released publicly on HuggingFace.

otoSpeech

289 hours1,125 conversations24kHzEnglish

Novel internal dataset of full-duplex online conversations, not previously published.

Switchboard

220 hours8kHzEnglishtelephone

Standard benchmark. Same held-out test split as VAP and Wang et al.

Citation

@misc{rajaa2026dualturnlearningturntakingdualchannel,
      title={DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining},
      author={Shangeth Rajaa},
      year={2026},
      eprint={2603.08216},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2603.08216},
}