Raon-SpeechChat-9B

Raon-SpeechChat Logo

Homepage GitHub
Hugging Face X
License

Demo | Technical Report | Blog (Coming soon)

Raon-SpeechChat-9B is a full-duplex speech language model that enables real-time, simultaneous listen-and-speak conversation in English. Built on top of Raon-Speech-9B, it extends the base model with full-duplex decoding — the model can listen to a user and generate speech responses at the same time, supporting natural turn-taking, backchannels ("uh-huh", "mm-hmm"), and barge-in handling.

Key Features

  • Full-Duplex Conversation: Simultaneous listen-and-speak decoding — the model processes user speech and generates responses in real time, just like a natural conversation.
  • End-to-End Speech Language Model: Built on Qwen3 (36 layers, 4096 hidden dim), Voxtral-Mini-4B-Realtime-2602 Audio Encoder (32 layers), Mimi codec (32 quantizers), ECAPA-TDNN speaker encoder, Qwen3OmniMoeTalkerCodePredictor (5 layers, 1024 hidden dim), and Qwen3-based Talker (4 layers, 2048 hidden dim).
  • Bilingual Support: Real-time conversational speech understanding and generation in both English.
  • Backchannel Responses: Dedicated backchannel token (<|audio_output_backchannel|>) for natural conversational feedback like "uh-huh" and "mm-hmm", with adjustable frequency via backchannel penalty.
  • Speak-First / Listen-First Modes: Configurable via runtime token forcing — the model can either wait for user speech before responding (listen-first) or begin speaking immediately (speak-first).
  • Persona-Driven Conversations: 17 built-in personas with customizable system prompts, context injection, and persona catalog support.
  • Speaker Voice Conditioning: Optional speaker reference audio for voice cloning via ECAPA-TDNN embeddings.
  • HuggingFace Transformers Integration: Load and run directly via AutoModel.from_pretrained with trust_remote_code=True — no custom package installation required.

Benchmark Results

Raon-SpeechChat performs strongly on conversational speech capabilities such as pause handling, backchanneling, smooth turn-taking, interruption handling, overlap robustness, and multi-turn dialogue.

Raon-SpeechChat Benchmark Results

Requirements

pip install 'transformers>=4.57.1,<5.0' torch torchaudio soundfile accelerate

# Optional
pip install speechbrain  # for speaker voice conditioning
pip install gradio       # for Gradio demo

Quick Start

Option 1: Load from Hub (recommended)

No pip install raon needed.

import importlib
import torch
from transformers import AutoModel

MODEL_ID = "KRAFTON/Raon-SpeechChat-9B"

# Load model (downloads code + weights from Hub)
_model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True, dtype=torch.bfloat16, device_map="cuda")

# Get RaonPipeline from Hub module
hub_module = importlib.import_module(type(_model).__module__)
RaonPipeline = hub_module.RaonPipeline
del _model

# Create pipeline
pipe = RaonPipeline(MODEL_ID, device="cuda", dtype="bfloat16")

Option 2: With raon package installed

pip install -e .  # or: uv sync
from raon import RaonPipeline

# From Hub (local code + Hub weights)
pipe = RaonPipeline("KRAFTON/Raon-SpeechChat-9B")

# From local path
pipe = RaonPipeline("/path/to/raon-duplex-model")

Related Models

  • Raon-Speech-9B — Base speech language model supporting STT, TTS, TextQA, and SpeechChat tasks.

License

This repository is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

© 2026 KRAFTON

Downloads last month
152
Safetensors
Model size
10B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including KRAFTON/Raon-SpeechChat-9B