Raon-Speech-9B

Technical Report | Blog (Coming soon)

Raon-Speech is a 9B-parameter speech language model that supports state-of-the-art speech understanding, answering and generation in English and Korean. This model successfully transforms a pre-trained LLM into a SpeechLM to both understand and generate speech without compromising its original language capabilities. It trains on millions of hours of English-Korean speech-text datasets with the following training stages: (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training.

Key Features

End-to-End Speech Language Model: 9B-parameter multimodal model built on Qwen3 (36 layers, 4096 hidden dim), Qwen3OmniMoeAudioEncoder (24 layers), Mimi codec (32 quantizers), and ECAPA-TDNN speaker encoder.
Bilingual Support: State-of-the-art speech understanding, answering, and generation in both English and Korean.
Multi-Task Capabilities: Supports STT (audio → text), TTS (text → audio), TextQA (text + audio → text), and SpeechChat (audio → text) in a single unified model.
Speaker Voice Conditioning: TTS with optional speaker reference audio for voice cloning via ECAPA-TDNN embeddings.
TTS Continuation: Generate speech that naturally continues from a reference audio, with prefill-based continuation for seamless prosody.
Multi-Reward DPO Post-Training: Three-stage training pipeline — (1) speech encoder-decoder alignment, (2) end-to-end SpeechLM pre-training, and (3) multi-reward DPO-based post-training — for high-quality speech generation.
HuggingFace Transformers Integration: Load and run directly via AutoModel.from_pretrained with trust_remote_code=True — no custom package installation required.

Benchmark Results

Measured with LibriSpeech test-clean samples on single-GPU setups via streaming TTS. All values are averaged.

Metric	RTX 6000 Pro	L40S
RTF	0.27 (3.7× real-time)	0.45 (2.2× real-time)
TTFT	617 ms	887 ms
TBT	135 ms	233 ms

RTF (Real-Time Factor): Lower is faster. Values below 1.0 mean faster-than-real-time synthesis.
TTFT (Time to First Token): Latency until the first audio chunk is returned.
TBT (Time Between Tokens): Average interval between consecutive audio chunks.

Requirements

pip install transformers>=4.57.1 torch torchaudio soundfile accelerate

# Optional
pip install speechbrain  # for TTS with speaker voice conditioning
pip install gradio       # for Gradio demo

Quick Start

Option 1: Load from Hub (recommended)

No pip install raon needed.

from transformers import AutoConfig
from transformers.dynamic_module_utils import get_class_from_dynamic_module

MODEL_ID = "KRAFTON/Raon-Speech-9B"

_cfg = AutoConfig.from_pretrained(MODEL_ID, trust_remote_code=True)
RaonPipeline = get_class_from_dynamic_module(
    "modeling_raon.RaonPipeline",
    MODEL_ID,
    revision=getattr(_cfg, "_commit_hash", None),
)
del _cfg

pipe = RaonPipeline(MODEL_ID, device="cuda", dtype="bfloat16")

Option 2: With raon package installed

git clone https://github.com/krafton-ai/Raon-Speech.git
cd Raon-Speech/raon
pip install -e .  # or: uv sync

from raon import RaonPipeline

# From Hub (local code + Hub weights)
pipe = RaonPipeline("KRAFTON/Raon-Speech-9B")

# From local path
pipe = RaonPipeline("/path/to/raon-model")

Tasks

STT (Audio → Text)

text = pipe.stt("audio.wav")

TTS (Text → Audio)

# Without speaker conditioning
audio, sr = pipe.tts("Hello, how are you?")
pipe.save_audio((audio, sr), "output.wav")

# With speaker conditioning (requires speechbrain)
audio, sr = pipe.tts("Hello, how are you?", speaker_audio="speaker_ref.wav")

TextQA (Text + Audio → Text)

answer = pipe.textqa("What is the speaker saying?", audio="audio.wav")

SpeechChat (Audio → Text)

answer = pipe.speech_chat("question.wav")

Chat (Multimodal)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "audio.wav"},
            {"type": "text", "text": "Transcribe and summarise this audio."},
        ],
    },
]
response = pipe.chat(messages)

Deployment (vLLM-Omni)

1. Clone & Build

git clone https://github.com/krafton-ai/vllm-omni.git
cd vllm-omni
docker build -f docker/Dockerfile.ci -t vllm-omni .

2. Serve

docker run --rm --gpus all \
  --shm-size=16g \
  -p 8000:8000 \
  vllm-omni \
  bash -c "vllm serve KRAFTON/Raon-Speech-9B --omni --port 8000 --trust-remote-code"

3. Test — TTS

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, how are you?",
    "model": "KRAFTON/Raon-Speech-9B",
    "response_format": "wav"
  }' --output output.wav

4. Test — TTS with voice cloning

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, how are you?",
    "model": "KRAFTON/Raon-Speech-9B",
    "ref_audio": "data:audio/wav;base64,'$(base64 -w0 speaker_ref.wav)'",
    "task_type": "Base",
    "response_format": "wav"
  }' --output cloned.wav

5. Test — STT

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "KRAFTON/Raon-Speech-9B",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "audio_url", "audio_url": {"url": "data:audio/wav;base64,'"$(base64 -w0 audio.wav)"'"}},
          {"type": "text", "text": "Transcribe the audio into text."}
        ]
      }
    ]
  }'

License

This repository is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

Downloads last month: 4

Safetensors

Model size

9B params

Tensor type

F32

BF16

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support