πŸ₯€ SODA-4B-base

SODA (Scaling Open Discrete Audio) is a suite of discrete audio foundation models trained from scratch using next-token prediction on interleaved semantic, acoustic, and text tokens.

This is the 4B model trained on 500B tokens.

Model Details

Parameters 4B (non-embedding)*
Training Tokens 500B
Training Data 95% Speech (Yodas + Emilia) + 5% Text (Nemotron-CC)
Tokenizer soda-research/marin-mimi-bpe-8cb-16k-tokenizer
Architecture Qwen3-based Transformer (cold-start, random initialization)

*Excludes embedding layers for our scaling law analysis, following the methodology in Kaplan et al. (2020).

Note: SODA was trained exclusively on English speech data and does not currently support other languages.

Unified Next-Token Prediction

SODA treats all audio/text tasks as next-token prediction under a single architecture. Audio is represented as discrete tokens using the Mimi codec, and interleaved with text tokens using special delimiters.

# audio-first interleaved sequence
interleaved_seq1 = "<|begin_of_text|><|audio_start|>{AUDIO_TOKENS_1}<|audio_end|><|text_start|>{TEXT_TOKENS_1}<|text_end|><|audio_start|>{AUDIO_TOKENS_2}<|audio_end|><|text_start|>{TEXT_TOKENS_2}<|text_end|>...<|end_of_text|>"

# text-first interleaved sequence
interleaved_seq2 = "<|begin_of_text|><|text_start|>{TEXT_TOKENS_1}<|text_end|><|audio_start|>{AUDIO_TOKENS_1}<|audio_end|><|text_start|>{TEXT_TOKENS_2}<|text_end|><|audio_start|>{AUDIO_TOKENS_2}<|audio_end|>...<|end_of_text|>"

Special Tokens

  • <|begin_of_text|>: Marks the start of every sequence
  • <|text_start|> / <|text_end|>: Delimit text segments (can appear multiple times in a sequence of multiple utterances)
  • <|audio_start|> / <|audio_end|>: Delimit audio segments (can appear multiple times in a sequence of multiple utterances)
  • <|end_of_text|>: Marks the end of the complete sequence

Note: Typically a tokenizer (like ours) automatically prepends <|begin_of_text|> to input, so you don't need to include it manually in your prompts. A sequence can contain multiple utterances (i.e., multiple chunks), each with their own text/audio delimiters.

Task Prompting Examples

Below are prompting examples for tasks (excluding the auto-prepended <|begin_of_text|> by tokenizer):

# Audio Continuation
<|audio_start|> {audio_context}
β†’ model generates: {continued_audio} ... # possibly with interleaved text tokens

# Zero-Shot TTS (Voice Cloning)
<|text_start|> {transcript} <|text_end|> <|audio_start|> {prompt_audio} <|audio_end|> <|text_start|> {target_text} <|text_end|> <|audio_start|>
β†’ model generates: {target_audio} <|audio_end|>

# TTS (Unconditioned)
<|text_start|> {text} <|text_end|> <|audio_start|>
β†’ model generates: {audio} <|audio_end|>

# ASR (Automatic Speech Recognition)
<|audio_start|> {input_audio} <|audio_end|> <|text_start|>
β†’ model generates: {transcription} <|text_end|>

Other tasks can also be formulated as next-token prediction by constructing appropriate prompt sequences. However, current pre-trained models may require fine-tuning to perform these tasks effectively, as in-context learning skills have not yet fully emerged.

Loading the Model

Note: For audio conversion, you will need the helper functions defined in the Utility Functions section below to convert between waveforms and discrete audio tokens.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, MimiModel

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load SODA
model_name = "soda-research/soda-4b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32, device_map="auto")
model.eval()

# Load Mimi audio codec
mimi_model = MimiModel.from_pretrained("kyutai/mimi").to(device)

Task Examples

Alternatively, you can check our Gradio demo for how SODA can be used: https://huggingface.co/spaces/potsawee/soda-demo/blob/main/app.py

1. Audio Continuation

Continue generating speech from an audio prefix.

import librosa

audio, sr = librosa.load("audio_prefix.wav", sr=None)
audio_24k = librosa.resample(audio, orig_sr=sr, target_sr=24000)
audio_str = audio_to_str(audio_24k, mimi_model, device)

prompt = f"<|audio_start|>{audio_str}"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1000, # correspond to 10s
        min_new_tokens=100,  # correspond to 1s
        do_sample=True,
        temperature=1.0,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
    )

# Extract all audio segments from the output
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
audio_parts = generated_text.split("<|audio_start|>")
audio_segments = []
for part in audio_parts[1:]:
    content = part.split("<|audio_end|>")[0] if "<|audio_end|>" in part else part.strip()
    if content:
        audio_segments.append(content)

full_audio_str = "".join(audio_segments)
full_audio_str = full_audio_str[: (len(full_audio_str) // 8) * 8]
audio_out = str_to_audio(full_audio_str, mimi_model, device)

import soundfile as sf
sf.write("continuation_output.wav", audio_out.T, 24000)

2. Zero-Shot TTS (Voice Cloning)

Given a reference voice and target text, generate speech in that voice.

import librosa

# Load reference audio (resample to 24kHz for Mimi)
audio, sr = librosa.load("reference_voice.wav", sr=None)
audio_24k = librosa.resample(audio, orig_sr=sr, target_sr=24000)

# Encode reference audio to token string
audio_str = audio_to_str(audio_24k, mimi_model, device)

# Construct prompt
prompt_text = "The transcript of the reference audio."
target_text = "The text you want to synthesize in this voice."
prompt = (
    f"<|text_start|>{prompt_text}<|text_end|>"
    f"<|audio_start|>{audio_str}<|audio_end|>"
    f"<|text_start|>{target_text}<|text_end|>"
    f"<|audio_start|>"
)

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1500,
        do_sample=True,
        temperature=0.9,
        top_p=0.8,
        eos_token_id=tokenizer.convert_tokens_to_ids("<|audio_end|>"),
    )

# Decode generated audio
generated_str = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
generated_str = generated_str.replace("<|audio_end|>", "")
generated_str = generated_str[: (len(generated_str) // 8) * 8]  # align to 8 codebooks
audio_out = str_to_audio(generated_str, mimi_model, device)

# Save output
import soundfile as sf
sf.write("tts_output.wav", audio_out.T, 24000)

3. TTS (Unconditioned)

Generate speech from text without a reference voice (the model hallucinates a voice).

# Construct prompt
text = "The text you want to synthesize."
prompt = f"<|text_start|>{text}<|text_end|><|audio_start|>"

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1500,
        do_sample=True,
        temperature=1.0,
        top_p=0.9,
        eos_token_id=tokenizer.convert_tokens_to_ids("<|audio_end|>"),
    )

# Decode generated audio
generated_str = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
generated_str = generated_str.replace("<|audio_end|>", "")
generated_str = generated_str[: (len(generated_str) // 8) * 8]
audio_out = str_to_audio(generated_str, mimi_model, device)

# Save output
import soundfile as sf
sf.write("tts_uncond_output.wav", audio_out.T, 24000)

4. Automatic Speech Recognition (ASR)

Transcribe speech to text.

import librosa

audio, sr = librosa.load("input_speech.wav", sr=None)
audio_24k = librosa.resample(audio, orig_sr=sr, target_sr=24000)
audio_str = audio_to_str(audio_24k, mimi_model, device)

prompt = f"<|audio_start|>{audio_str}<|audio_end|><|text_start|>"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1200,
        do_sample=True,
        temperature=1.0,
        top_p=0.9,
        eos_token_id=tokenizer.convert_tokens_to_ids("<|text_end|>"),
    )

generated_text = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
transcription = generated_text.replace("<|text_end|>", "").strip()
print("Transcription:", transcription)

5. Text Generation

Generate text continuations (SODA also supports text-only generation).

prompt = "<|text_start|>The future of artificial intelligence"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=1.0,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
    )

generated_text = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(generated_text)

Utility Functions

The following utilities convert between audio waveforms and the discrete token strings used by SODA. You will need them for all audio tasks above.

import numpy as np
import torch
from transformers import MimiModel

UNICODE_OFFSET = 0xE000
NUM_CODEBOOKS = 8
CODEBOOK_SIZE = 2048
MIMI_SAMPLE_RATE = 24000

def codes_to_chars(codes, codebook_size=CODEBOOK_SIZE, unicode_offset=UNICODE_OFFSET):
    """Convert Mimi codec output (num_codebooks, seq_len) β†’ string."""
    if isinstance(codes, torch.Tensor):
        codes = codes.cpu().numpy()
    codes = codes.copy()
    for i in range(codes.shape[0]):
        codes[i] += unicode_offset + i * codebook_size
    codes = codes.T.reshape(-1)
    return "".join([chr(c) for c in codes])

def chars_to_codes(chars, num_codebooks=NUM_CODEBOOKS, codebook_size=CODEBOOK_SIZE, unicode_offset=UNICODE_OFFSET):
    """Convert string β†’ Mimi codec codes (num_codebooks, seq_len)."""
    codes = np.array([ord(c) for c in chars])
    codes = codes.reshape(-1, num_codebooks).T
    for i in range(codes.shape[0]):
        codes[i] -= unicode_offset + i * codebook_size
    return torch.tensor(codes)

def audio_to_str(audio_numpy, mimi_model, device):
    """Encode audio waveform (24kHz) β†’ discrete token string."""
    audio_tensor = torch.tensor(audio_numpy).to(device).unsqueeze(0)
    if len(audio_tensor.shape) == 2:
        audio_tensor = audio_tensor.unsqueeze(1)
    with torch.no_grad():
        audio_codes = mimi_model.encode(audio_tensor)
    codes = audio_codes[0][0].cpu()[:NUM_CODEBOOKS, :]
    return codes_to_chars(codes)

def str_to_audio(audio_str, mimi_model, device):
    """Decode discrete token string β†’ audio waveform (24kHz)."""
    codes = chars_to_codes(audio_str).to(device).unsqueeze(0)
    with torch.no_grad():
        audio_decoded = mimi_model.decode(codes).audio_values[0]
    return audio_decoded.cpu().numpy()

Citation

@article{soda2026,
  author    = {Manakul, Potsawee and Held, William and Gan, Woody Haosheng and Bartelds, Martijn and Sun, Guangzhi and Yang, Diyi},
  title     = {Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens},
  journal   = {arXiv preprint arXiv:2602.xxxxx},
  year      = {2026},
}
Downloads last month
60
Safetensors
Model size
4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train soda-research/soda-4b-base

Space using soda-research/soda-4b-base 1

Collection including soda-research/soda-4b-base