π₯€ SODA-4B-base
SODA (Scaling Open Discrete Audio) is a suite of discrete audio foundation models trained from scratch using next-token prediction on interleaved semantic, acoustic, and text tokens.
This is the 4B model trained on 500B tokens.
- π Project Page: https://soda-audio.github.io
- π Paper: Scaling Open Discrete Audio Foundation Models
- π» Code: GitHub
- ποΈ All Models & Data: soda-research on HuggingFace
Model Details
| Parameters | 4B (non-embedding)* |
| Training Tokens | 500B |
| Training Data | 95% Speech (Yodas + Emilia) + 5% Text (Nemotron-CC) |
| Tokenizer | soda-research/marin-mimi-bpe-8cb-16k-tokenizer |
| Architecture | Qwen3-based Transformer (cold-start, random initialization) |
*Excludes embedding layers for our scaling law analysis, following the methodology in Kaplan et al. (2020).
Note: SODA was trained exclusively on English speech data and does not currently support other languages.
Unified Next-Token Prediction
SODA treats all audio/text tasks as next-token prediction under a single architecture. Audio is represented as discrete tokens using the Mimi codec, and interleaved with text tokens using special delimiters.
# audio-first interleaved sequence
interleaved_seq1 = "<|begin_of_text|><|audio_start|>{AUDIO_TOKENS_1}<|audio_end|><|text_start|>{TEXT_TOKENS_1}<|text_end|><|audio_start|>{AUDIO_TOKENS_2}<|audio_end|><|text_start|>{TEXT_TOKENS_2}<|text_end|>...<|end_of_text|>"
# text-first interleaved sequence
interleaved_seq2 = "<|begin_of_text|><|text_start|>{TEXT_TOKENS_1}<|text_end|><|audio_start|>{AUDIO_TOKENS_1}<|audio_end|><|text_start|>{TEXT_TOKENS_2}<|text_end|><|audio_start|>{AUDIO_TOKENS_2}<|audio_end|>...<|end_of_text|>"
Special Tokens
<|begin_of_text|>: Marks the start of every sequence<|text_start|>/<|text_end|>: Delimit text segments (can appear multiple times in a sequence of multiple utterances)<|audio_start|>/<|audio_end|>: Delimit audio segments (can appear multiple times in a sequence of multiple utterances)<|end_of_text|>: Marks the end of the complete sequence
Note: Typically a tokenizer (like ours) automatically prepends
<|begin_of_text|>to input, so you don't need to include it manually in your prompts. A sequence can contain multiple utterances (i.e., multiple chunks), each with their own text/audio delimiters.
Task Prompting Examples
Below are prompting examples for tasks (excluding the auto-prepended <|begin_of_text|> by tokenizer):
# Audio Continuation
<|audio_start|> {audio_context}
β model generates: {continued_audio} ... # possibly with interleaved text tokens
# Zero-Shot TTS (Voice Cloning)
<|text_start|> {transcript} <|text_end|> <|audio_start|> {prompt_audio} <|audio_end|> <|text_start|> {target_text} <|text_end|> <|audio_start|>
β model generates: {target_audio} <|audio_end|>
# TTS (Unconditioned)
<|text_start|> {text} <|text_end|> <|audio_start|>
β model generates: {audio} <|audio_end|>
# ASR (Automatic Speech Recognition)
<|audio_start|> {input_audio} <|audio_end|> <|text_start|>
β model generates: {transcription} <|text_end|>
Other tasks can also be formulated as next-token prediction by constructing appropriate prompt sequences. However, current pre-trained models may require fine-tuning to perform these tasks effectively, as in-context learning skills have not yet fully emerged.
Loading the Model
Note: For audio conversion, you will need the helper functions defined in the Utility Functions section below to convert between waveforms and discrete audio tokens.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, MimiModel
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load SODA
model_name = "soda-research/soda-4b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32, device_map="auto")
model.eval()
# Load Mimi audio codec
mimi_model = MimiModel.from_pretrained("kyutai/mimi").to(device)
Task Examples
Alternatively, you can check our Gradio demo for how SODA can be used: https://huggingface.co/spaces/potsawee/soda-demo/blob/main/app.py
1. Audio Continuation
Continue generating speech from an audio prefix.
import librosa
audio, sr = librosa.load("audio_prefix.wav", sr=None)
audio_24k = librosa.resample(audio, orig_sr=sr, target_sr=24000)
audio_str = audio_to_str(audio_24k, mimi_model, device)
prompt = f"<|audio_start|>{audio_str}"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=1000, # correspond to 10s
min_new_tokens=100, # correspond to 1s
do_sample=True,
temperature=1.0,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
)
# Extract all audio segments from the output
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
audio_parts = generated_text.split("<|audio_start|>")
audio_segments = []
for part in audio_parts[1:]:
content = part.split("<|audio_end|>")[0] if "<|audio_end|>" in part else part.strip()
if content:
audio_segments.append(content)
full_audio_str = "".join(audio_segments)
full_audio_str = full_audio_str[: (len(full_audio_str) // 8) * 8]
audio_out = str_to_audio(full_audio_str, mimi_model, device)
import soundfile as sf
sf.write("continuation_output.wav", audio_out.T, 24000)
2. Zero-Shot TTS (Voice Cloning)
Given a reference voice and target text, generate speech in that voice.
import librosa
# Load reference audio (resample to 24kHz for Mimi)
audio, sr = librosa.load("reference_voice.wav", sr=None)
audio_24k = librosa.resample(audio, orig_sr=sr, target_sr=24000)
# Encode reference audio to token string
audio_str = audio_to_str(audio_24k, mimi_model, device)
# Construct prompt
prompt_text = "The transcript of the reference audio."
target_text = "The text you want to synthesize in this voice."
prompt = (
f"<|text_start|>{prompt_text}<|text_end|>"
f"<|audio_start|>{audio_str}<|audio_end|>"
f"<|text_start|>{target_text}<|text_end|>"
f"<|audio_start|>"
)
# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=1500,
do_sample=True,
temperature=0.9,
top_p=0.8,
eos_token_id=tokenizer.convert_tokens_to_ids("<|audio_end|>"),
)
# Decode generated audio
generated_str = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
generated_str = generated_str.replace("<|audio_end|>", "")
generated_str = generated_str[: (len(generated_str) // 8) * 8] # align to 8 codebooks
audio_out = str_to_audio(generated_str, mimi_model, device)
# Save output
import soundfile as sf
sf.write("tts_output.wav", audio_out.T, 24000)
3. TTS (Unconditioned)
Generate speech from text without a reference voice (the model hallucinates a voice).
# Construct prompt
text = "The text you want to synthesize."
prompt = f"<|text_start|>{text}<|text_end|><|audio_start|>"
# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=1500,
do_sample=True,
temperature=1.0,
top_p=0.9,
eos_token_id=tokenizer.convert_tokens_to_ids("<|audio_end|>"),
)
# Decode generated audio
generated_str = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
generated_str = generated_str.replace("<|audio_end|>", "")
generated_str = generated_str[: (len(generated_str) // 8) * 8]
audio_out = str_to_audio(generated_str, mimi_model, device)
# Save output
import soundfile as sf
sf.write("tts_uncond_output.wav", audio_out.T, 24000)
4. Automatic Speech Recognition (ASR)
Transcribe speech to text.
import librosa
audio, sr = librosa.load("input_speech.wav", sr=None)
audio_24k = librosa.resample(audio, orig_sr=sr, target_sr=24000)
audio_str = audio_to_str(audio_24k, mimi_model, device)
prompt = f"<|audio_start|>{audio_str}<|audio_end|><|text_start|>"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=1200,
do_sample=True,
temperature=1.0,
top_p=0.9,
eos_token_id=tokenizer.convert_tokens_to_ids("<|text_end|>"),
)
generated_text = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
transcription = generated_text.replace("<|text_end|>", "").strip()
print("Transcription:", transcription)
5. Text Generation
Generate text continuations (SODA also supports text-only generation).
prompt = "<|text_start|>The future of artificial intelligence"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=1.0,
top_p=0.9,
eos_token_id=tokenizer.eos_token_id,
)
generated_text = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(generated_text)
Utility Functions
The following utilities convert between audio waveforms and the discrete token strings used by SODA. You will need them for all audio tasks above.
import numpy as np
import torch
from transformers import MimiModel
UNICODE_OFFSET = 0xE000
NUM_CODEBOOKS = 8
CODEBOOK_SIZE = 2048
MIMI_SAMPLE_RATE = 24000
def codes_to_chars(codes, codebook_size=CODEBOOK_SIZE, unicode_offset=UNICODE_OFFSET):
"""Convert Mimi codec output (num_codebooks, seq_len) β string."""
if isinstance(codes, torch.Tensor):
codes = codes.cpu().numpy()
codes = codes.copy()
for i in range(codes.shape[0]):
codes[i] += unicode_offset + i * codebook_size
codes = codes.T.reshape(-1)
return "".join([chr(c) for c in codes])
def chars_to_codes(chars, num_codebooks=NUM_CODEBOOKS, codebook_size=CODEBOOK_SIZE, unicode_offset=UNICODE_OFFSET):
"""Convert string β Mimi codec codes (num_codebooks, seq_len)."""
codes = np.array([ord(c) for c in chars])
codes = codes.reshape(-1, num_codebooks).T
for i in range(codes.shape[0]):
codes[i] -= unicode_offset + i * codebook_size
return torch.tensor(codes)
def audio_to_str(audio_numpy, mimi_model, device):
"""Encode audio waveform (24kHz) β discrete token string."""
audio_tensor = torch.tensor(audio_numpy).to(device).unsqueeze(0)
if len(audio_tensor.shape) == 2:
audio_tensor = audio_tensor.unsqueeze(1)
with torch.no_grad():
audio_codes = mimi_model.encode(audio_tensor)
codes = audio_codes[0][0].cpu()[:NUM_CODEBOOKS, :]
return codes_to_chars(codes)
def str_to_audio(audio_str, mimi_model, device):
"""Decode discrete token string β audio waveform (24kHz)."""
codes = chars_to_codes(audio_str).to(device).unsqueeze(0)
with torch.no_grad():
audio_decoded = mimi_model.decode(codes).audio_values[0]
return audio_decoded.cpu().numpy()
Citation
@article{soda2026,
author = {Manakul, Potsawee and Held, William and Gan, Woody Haosheng and Bartelds, Martijn and Sun, Guangzhi and Yang, Diyi},
title = {Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens},
journal = {arXiv preprint arXiv:2602.xxxxx},
year = {2026},
}
- Downloads last month
- 60