🥤 SODA-4B-base

SODA (Scaling Open Discrete Audio) is a suite of discrete audio foundation models trained from scratch using next-token prediction on interleaved semantic, acoustic, and text tokens.

Built on Marin, SODA started with a simple question: "If we tokenize large-scale audio data into discrete tokens, can we train a standard autoregressive transformer—in the same way we train text LLMs—to build a unified audio backbone where every audio-text task is just next-token prediction?" SODA is a unified pre-trained audio backbone where an audio-language task can be formulated as next-token prediction—through prompting and/or fine-tuning. More details in our project page and/or paper.

This is the 4B model trained on 500B tokens.

🌐 Project Page: https://soda-audio.github.io
📄 Paper: Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
💻 Training Code: GitHub
🗂️ All Models & Data: soda-research on HuggingFace

Model Details


Parameters	4B (non-embedding)*
Training Tokens	500B
Training Data	95% Speech (Yodas + Emilia) + 5% Text (Nemotron-CC)
Tokenizer	soda-research/marin-mimi-bpe-8cb-16k-tokenizer
Architecture	Qwen3-based Transformer (cold-start, random initialization)

*Excludes embedding layers for our scaling law analysis, following the methodology in Kaplan et al. (2020).

Note: SODA was trained exclusively on English speech data and does not currently support other languages.

Unified Next-Token Prediction

SODA treats all audio/text tasks as next-token prediction under a single architecture. Audio is represented as discrete tokens using the Mimi codec, and interleaved with text tokens using special delimiters.

# audio-first interleaved sequence
interleaved_seq1 = "<|begin_of_text|><|audio_start|>{AUDIO_TOKENS_1}<|audio_end|><|text_start|>{TEXT_TOKENS_1}<|text_end|><|audio_start|>{AUDIO_TOKENS_2}<|audio_end|><|text_start|>{TEXT_TOKENS_2}<|text_end|>...<|end_of_text|>"

# text-first interleaved sequence
interleaved_seq2 = "<|begin_of_text|><|text_start|>{TEXT_TOKENS_1}<|text_end|><|audio_start|>{AUDIO_TOKENS_1}<|audio_end|><|text_start|>{TEXT_TOKENS_2}<|text_end|><|audio_start|>{AUDIO_TOKENS_2}<|audio_end|>...<|end_of_text|>"

Special Tokens

<|begin_of_text|>: Marks the start of every sequence
<|text_start|> / <|text_end|>: Delimit text segments (can appear multiple times in a sequence of multiple utterances)
<|audio_start|> / <|audio_end|>: Delimit audio segments (can appear multiple times in a sequence of multiple utterances)
<|end_of_text|>: Marks the end of the complete sequence

Note: Typically a tokenizer (like ours) automatically prepends <|begin_of_text|> to input, so you don't need to include it manually in your prompts. A sequence can contain multiple utterances (i.e., multiple chunks), each with their own text/audio delimiters.

Task Prompting Examples

Below are prompting examples for tasks (excluding the auto-prepended <|begin_of_text|> by tokenizer):

# Audio Continuation
<|audio_start|> {audio_context}
→ model generates: {continued_audio} ... # possibly with interleaved text tokens

# TTS (Voice Cloning)
<|text_start|> {transcript} <|text_end|> <|audio_start|> {prompt_audio} <|audio_end|> <|text_start|> {target_text} <|text_end|> <|audio_start|>
→ model generates: {target_audio} <|audio_end|>

# TTS (Unconditioned)
<|text_start|> {text} <|text_end|> <|audio_start|>
→ model generates: {audio} <|audio_end|>

# ASR (Automatic Speech Recognition)
<|audio_start|> {input_audio} <|audio_end|> <|text_start|>
→ model generates: {transcription} <|text_end|>

Other tasks can also be formulated as next-token prediction by constructing appropriate prompt sequences. However, current pre-trained models may require fine-tuning to perform these tasks effectively, as in-context learning skills have not yet fully emerged.

Loading the Model

Note: For audio conversion, you will need the helper functions defined in the Utility Functions section below to convert between waveforms and discrete audio tokens.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, MimiModel

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load SODA
model_name = "soda-research/soda-4b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32, device_map="auto")
model.eval()

# Load Mimi audio codec
mimi_model = MimiModel.from_pretrained("kyutai/mimi").to(device)

Task Examples

Alternatively, you can check our Gradio demo for how SODA can be used: https://huggingface.co/spaces/potsawee/soda-demo/blob/main/app.py

1. Audio Continuation

Continue generating speech from an audio prefix.

import librosa

audio, sr = librosa.load("audio_prefix.wav", sr=None)
audio_24k = librosa.resample(audio, orig_sr=sr, target_sr=24000)

# [Recommended] Trim silence from the beginning and end for more stable generation
_, idx = librosa.effects.trim(audio_24k, top_db=40)
audio_24k = audio_24k[max(0, idx[0]-2400) : idx[1]+2400]  # keep ~100ms padding

audio_str = audio_to_str(audio_24k, mimi_model, device)

prompt = f"<|audio_start|>{audio_str}"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1000, # correspond to ~10s
        min_new_tokens=100,  # correspond to ~1s
        do_sample=True,
        temperature=1.0,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,  # <|end_of_text|>; model may interleave text and audio
    )

# Extract all audio segments from the output (model may interleave text and audio)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
audio_parts = generated_text.split("<|audio_start|>")
audio_segments = []
for part in audio_parts[1:]:
    content = part.split("<|audio_end|>")[0] if "<|audio_end|>" in part else part.strip()
    if content:
        audio_segments.append(content)

full_audio_str = "".join(audio_segments)
full_audio_str = full_audio_str[: (len(full_audio_str) // 8) * 8]
audio_out = str_to_audio(full_audio_str, mimi_model, device)

import soundfile as sf
sf.write("continuation_output.wav", audio_out.T, 24000)

2. Zero-Shot TTS (Voice Cloning)

Given a reference voice and target text, generate speech in that voice.

import librosa

# Load reference audio (resample to 24kHz for Mimi)
audio, sr = librosa.load("reference_voice.wav", sr=None)
audio_24k = librosa.resample(audio, orig_sr=sr, target_sr=24000)

# [Recommended] Trim silence from the beginning and end for more stable generation
_, idx = librosa.effects.trim(audio_24k, top_db=40)
audio_24k = audio_24k[max(0, idx[0]-2400) : idx[1]+2400]  # keep ~100ms padding

# Encode reference audio to token string
audio_str = audio_to_str(audio_24k, mimi_model, device)

# Construct prompt
prompt_text = "The transcript of the reference audio."
target_text = "The text you want to synthesize in this voice."

# [Recommended] Add a leading whitespace to both text inputs.
# The Emilia training data has transcripts that start with a space, so matching
# that format at inference time improves performance. Not strictly required.
prompt_text = " " + prompt_text.strip()
target_text = " " + target_text.strip()

prompt = (
    f"<|text_start|>{prompt_text}<|text_end|>"
    f"<|audio_start|>{audio_str}<|audio_end|>"
    f"<|text_start|>{target_text}<|text_end|>"
    f"<|audio_start|>"
)

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1500,
        do_sample=True,
        temperature=1.1,
        top_p=0.8,
        eos_token_id=tokenizer.convert_tokens_to_ids("<|audio_end|>"),
    )

# Decode generated audio
generated_str = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
generated_str = generated_str.replace("<|audio_end|>", "")
generated_str = generated_str[: (len(generated_str) // 8) * 8]  # align to 8 codebooks
audio_out = str_to_audio(generated_str, mimi_model, device)

# Save output
import soundfile as sf
sf.write("tts_output.wav", audio_out.T, 24000)

3. TTS (Unconditioned)

Generate speech from text without a reference voice (the model picks a random voice).

text = "The text you want to synthesize."

# [Recommended] Add a leading whitespace to the text input.
# The Emilia training data has transcripts that start with a space, so matching
# that format at inference time improves performance. Not strictly required.
text = " " + text.strip()

# Construct prompt
prompt = f"<|text_start|>{text}<|text_end|><|audio_start|>"

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1500,
        do_sample=True,
        temperature=1.1,
        top_p=0.8,
        eos_token_id=tokenizer.convert_tokens_to_ids("<|audio_end|>"),
    )

# Decode generated audio
generated_str = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
generated_str = generated_str.replace("<|audio_end|>", "")
generated_str = generated_str[: (len(generated_str) // 8) * 8]
audio_out = str_to_audio(generated_str, mimi_model, device)

# Save output
import soundfile as sf
sf.write("tts_uncond_output.wav", audio_out.T, 24000)

4. Automatic Speech Recognition (ASR)

Transcribe speech to text.

import librosa

audio, sr = librosa.load("input_speech.wav", sr=None)
audio_24k = librosa.resample(audio, orig_sr=sr, target_sr=24000)

# [Recommended] Trim silence from the beginning and end for more stable generation
_, idx = librosa.effects.trim(audio_24k, top_db=40)
audio_24k = audio_24k[max(0, idx[0]-2400) : idx[1]+2400]  # keep ~100ms padding

audio_str = audio_to_str(audio_24k, mimi_model, device)

prompt = f"<|audio_start|>{audio_str}<|audio_end|><|text_start|>"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1200,
        do_sample=True,
        temperature=0.3,  # lower temperature gives more stable transcriptions
        top_p=0.9,
        eos_token_id=tokenizer.convert_tokens_to_ids("<|text_end|>"),
    )

generated_text = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
transcription = generated_text.replace("<|text_end|>", "").strip()
print("Transcription:", transcription)

5. Text Generation

Generate text continuations (SODA also supports text-only generation).

prompt = "<|text_start|>The future of artificial intelligence"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=1.0,
        top_p=0.9,
        eos_token_id=tokenizer.eos_token_id,
    )

generated_text = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(generated_text)

Utility Functions

The following utilities convert between audio waveforms and the discrete token strings used by SODA. You will need them for all audio tasks above.

import numpy as np
import torch
from transformers import MimiModel

UNICODE_OFFSET = 0xE000
NUM_CODEBOOKS = 8
CODEBOOK_SIZE = 2048
MIMI_SAMPLE_RATE = 24000

def codes_to_chars(codes, codebook_size=CODEBOOK_SIZE, unicode_offset=UNICODE_OFFSET):
    """Convert Mimi codec output (num_codebooks, seq_len) → string."""
    if isinstance(codes, torch.Tensor):
        codes = codes.cpu().numpy()
    codes = codes.copy()
    for i in range(codes.shape[0]):
        codes[i] += unicode_offset + i * codebook_size
    codes = codes.T.reshape(-1)
    return "".join([chr(c) for c in codes])

def chars_to_codes(chars, num_codebooks=NUM_CODEBOOKS, codebook_size=CODEBOOK_SIZE, unicode_offset=UNICODE_OFFSET):
    """Convert string → Mimi codec codes (num_codebooks, seq_len)."""
    codes = np.array([ord(c) for c in chars])
    codes = codes.reshape(-1, num_codebooks).T
    for i in range(codes.shape[0]):
        codes[i] -= unicode_offset + i * codebook_size
    return torch.tensor(codes)

def audio_to_str(audio_numpy, mimi_model, device):
    """Encode audio waveform (24kHz) → discrete token string."""
    audio_tensor = torch.tensor(audio_numpy).to(device).unsqueeze(0)
    if len(audio_tensor.shape) == 2:
        audio_tensor = audio_tensor.unsqueeze(1)
    with torch.no_grad():
        audio_codes = mimi_model.encode(audio_tensor)
    codes = audio_codes[0][0].cpu()[:NUM_CODEBOOKS, :]
    return codes_to_chars(codes)

def str_to_audio(audio_str, mimi_model, device):
    """Decode discrete token string → audio waveform (24kHz)."""
    codes = chars_to_codes(audio_str).to(device).unsqueeze(0)
    with torch.no_grad():
        audio_decoded = mimi_model.decode(codes).audio_values[0]
    return audio_decoded.cpu().numpy()

Acknowledgements

We thank Marin and OpenAthena for developing and maintaining the training infrastructure for open science. We are particularly grateful to David Hall for his support with infrastructure and LLM pre-training. The compute resource was supported by the Google TPU Research Cloud (TRC) and a Stanford HAI–GCP Grant as part of the Marin Project.

Citation

@article{soda2026,
  author    = {Manakul, Potsawee and Gan, Woody Haosheng and Bartelds, Martijn and Sun, Guangzhi and Held, William and Yang, Diyi},
  title     = {Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens},
  journal   = {arXiv preprint arXiv:2602.16687},
  year      = {2026}
}