Instructions to use zeroae/calliope-snac-4b-base-4k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zeroae/calliope-snac-4b-base-4k with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="zeroae/calliope-snac-4b-base-4k", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("zeroae/calliope-snac-4b-base-4k", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("zeroae/calliope-snac-4b-base-4k", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use zeroae/calliope-snac-4b-base-4k with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zeroae/calliope-snac-4b-base-4k" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zeroae/calliope-snac-4b-base-4k", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/zeroae/calliope-snac-4b-base-4k
- SGLang
How to use zeroae/calliope-snac-4b-base-4k with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zeroae/calliope-snac-4b-base-4k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zeroae/calliope-snac-4b-base-4k", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zeroae/calliope-snac-4b-base-4k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zeroae/calliope-snac-4b-base-4k", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use zeroae/calliope-snac-4b-base-4k with Docker Model Runner:
docker model run hf.co/zeroae/calliope-snac-4b-base-4k
Calliope SNAC 4B Base (4K)
Stage-1 multilingual SNAC prior for the Calliope text-to-speech project β a continued-pretrain of nvidia/Nemotron-H-4B-Base-8K with the vocabulary augmented by 12,288 SNAC codec tokens and a slot router that enforces the codec's CΒ·MΒ·FΒ·FΒ·MΒ·FΒ·F frame pattern at audio-mode positions.
This is the HuggingFace safetensors version, loadable via AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True). The Megatron-Bridge FSDP DCP format is at the sibling repo zeroae/calliope-snac-4b-base-4k.megatron (private) for Bridge-based continued training.
What this is and isn't. This is a pretrained prior, not a finished TTS system. Training mixed text-only and SNAC-audio-only documents β the cross-modal textβSNAC bridge is a separate stage-2 finetune objective and was not learned here. Use this checkpoint as the starting point for a TTS finetune, not as an end-to-end speech model.
Quick start: text generation
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
REPO = "zeroae/calliope-snac-4b-base-4k"
tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
REPO,
dtype=torch.bfloat16,
device_map="cuda",
low_cpu_mem_usage=True,
trust_remote_code=True,
)
# Text-only generation (the text path is preserved at near-base quality).
# The slot router masks all SNAC tokens to -inf in text mode, so text
# generation is unaffected by the augmented vocab.
ids = tokenizer("In multilingual TTS, prosody", return_tensors="pt").input_ids.to("cuda")
out = model.generate(ids, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(out[0]))
The trust_remote_code=True flag pulls modeling_nemotron_h_augmented.py from this repo, which wraps the base NemotronH-4B with the slot-router logits mask (text mode masks all SNAC tokens, audio mode masks all text β enforced per-position).
End-to-end: generate SNAC frames β decode to audio
This pretrained prior has no textβSNAC bridge (see disclaimer above); the example below shows the unconditional end-to-end pipeline that the slot router makes work: prompt with the [SNAC] marker, generate tokens (which the slot router constrains to the CΒ·MΒ·FΒ·FΒ·MΒ·FΒ·F frame pattern), parse them back into the three SNAC codebooks, and decode to a waveform via the upstream hubertsiuzdak/snac_24khz codec.
Expect the audio to be a babble/noise β the model is sampling unconditionally from its learned audio distribution; no text guides the content. The point is to demonstrate the mechanics; quality requires a stage-2 finetune that learns the textβSNAC bridge.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC # pip install snac
import torchaudio
REPO = "zeroae/calliope-snac-4b-base-4k"
# --- 1. Load LM ---------------------------------------------------------
tok = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
REPO, dtype=torch.bfloat16, device_map="cuda",
low_cpu_mem_usage=True, trust_remote_code=True,
).eval()
# Vocab layout (from augmented.yaml, also visible in the repo)
SNAC_OPEN, SNAC_CLOSE = 100, 101
C_BASE, M_BASE, F_BASE = 131072, 135168, 139264 # start of each codebook range
N_FRAMES = 50 # ~4 s at SNAC-24kHz's coarse rate
N_TOKENS = N_FRAMES * 7 # 7 tokens / frame (C,M,F,F,M,F,F)
# --- 2. Generate inside an [SNAC] ... span ------------------------------
# The slot router (modeling_nemotron_h_augmented.py) carries its
# (in_slot_mode, slot_counter) state across forward calls via
# self._slot_router_state, so KV caching just works: prefill computes
# routing from initial state, subsequent forwards advance from the
# cached final state. No special flags needed.
prompt = torch.tensor([[tok.bos_token_id, SNAC_OPEN]], device="cuda")
with torch.no_grad():
out = model.generate(
prompt,
max_new_tokens=N_TOKENS,
do_sample=True, temperature=0.8, top_p=0.95,
)
# --- 3. Parse the C/M/F/F/M/F/F frames back into codebook indices --------
gen = out[0, prompt.shape[1]:].tolist()
gen = gen[: (len(gen) // 7) * 7] # truncate to whole frames
c_codes, m_codes, f_codes = [], [], []
for i in range(0, len(gen), 7):
frame = gen[i:i + 7]
c_codes.append(frame[0] - C_BASE) # slot 0: C
m_codes.append(frame[1] - M_BASE) # slot 1: M
f_codes.append(frame[2] - F_BASE) # slot 2: F
f_codes.append(frame[3] - F_BASE) # slot 3: F
m_codes.append(frame[4] - M_BASE) # slot 4: M
f_codes.append(frame[5] - F_BASE) # slot 5: F
f_codes.append(frame[6] - F_BASE) # slot 6: F
# Sanity-check the slot router did its job (codes within [0, 4096))
assert all(0 <= c < 4096 for c in c_codes), "C codes out of range β slot router off?"
assert all(0 <= m < 4096 for m in m_codes), "M codes out of range"
assert all(0 <= f < 4096 for f in f_codes), "F codes out of range"
# --- 4. Decode the three codebooks to a 24 kHz waveform -----------------
codec = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().to("cuda")
codes = [
torch.tensor([c_codes], dtype=torch.long, device="cuda"), # [B=1, N_FRAMES]
torch.tensor([m_codes], dtype=torch.long, device="cuda"), # [B=1, 2*N_FRAMES]
torch.tensor([f_codes], dtype=torch.long, device="cuda"), # [B=1, 4*N_FRAMES]
]
with torch.no_grad():
audio = codec.decode(codes) # [1, 1, num_samples]
# --- 5. Save ------------------------------------------------------------
torchaudio.save("calliope_unconditional.wav", audio.squeeze(0).cpu(), sample_rate=24000)
print(f"saved {audio.shape[-1] / 24000:.2f} s of audio "
f"({len(c_codes)} frames, {len(c_codes) + len(m_codes) + len(f_codes)} codes)")
Dependencies: pip install snac torchaudio in addition to transformers torch. Wall-clock for 50 frames (~4 s of audio): a few seconds on a GB10 with KV caching on (the default).
Token-budget rule of thumb: SNAC-24kHz's coarse rate is ~12 Hz, so one frame β 83 ms of audio. To pre-allocate max_new_tokens for a given duration:
N_TOKENS = int(seconds * 12) * 7 # 7 tokens per frame
Why this demo's audio sounds bad (and that's expected)
The model has never seen text + [SNAC]β¦[/SNAC] parallel sequences β only text-only documents and SNAC-only documents, mixed at the batch level. Unconditional sampling from the SNAC distribution produces something codec-plausible (the slot router guarantees the bit-stream is structurally valid, and the codec can always decode), but it has no semantic content. It's the analogue of letting a language model generate without a prompt β you get gibberish that has the shape of the training distribution. A stage-2 TTS finetune on text β SNAC parallel data is what makes this conditional and intelligible.
Architecture
| Field | Value |
|---|---|
| Base model | nvidia/Nemotron-H-4B-Base-8K (hybrid Mamba + attention, 52 layers) |
| Parameters | ~4.56 B (4 B base + augmented embedding/lm_head rows) |
| Vocabulary size | 143,360 (131,072 base + 12,288 SNAC + 2 markers + 254 reserved-special unchanged) |
| New tokens | SNAC_C_* (4096), SNAC_M_* (4096), SNAC_F_* (4096), [SNAC] (id 100), [/SNAC] (id 101) |
| Vocab init | mean_resizing (multivariate-normal-matched to existing embedding distribution; Hewitt 2021) |
| Slot router | slot_pattern: [C, M, F, F, M, F, F] β masks logits to the relevant range at each frame position; [SNAC]/[/SNAC] markers flip into/out of audio mode |
| Context length (trained) | 4096 (architectural cap is 8192 inherited from base; 8K inference is unverified for this checkpoint β the slot-router state machine should extend, but no measurement exists) |
| Precision | bfloat16 weights |
| Tokenizer | NemotronH base tokenizer with the 12,290 new tokens appended |
The SNAC frame layout is [C, M, F, F, M, F, F] β 7 tokens per coarse frame, one coarse (C) β two mid (M) β four fine (F) β matching SNAC-24kHz's 1:2:4 residual-quantizer hierarchy.
Training summary
| Wall-clock | 12 days (2026-05-08 β 2026-05-21) |
| Iterations | 75,000 (warmup 457 linear β cosine decay β min_lr) |
| Global batch size | 8 (mbs=1 Γ 8-step gradient accumulation, dp=1) |
| Sequence length | 4096 |
| Tokens consumed | ~2.46 B (75k Γ 8 Γ 4096) |
| Single-pass | Yes β 600,000 of the bin's 601,910 unique samples; no epoch wrap; no overfitting by construction |
| Optimizer | Adam (Ξ²β=0.9, Ξ²β=0.95, Ξ΅=1e-8), weight_decay=0.1, grad-clip 1.0 |
| Peak LR | 5e-5 (backbone) / 5e-4 (decoupled β embedding + lm_head) |
| Min LR | 5e-7 / 5e-6 |
| Hardware | 2Γ NVIDIA DGX Spark (GB10, sm_121, 128 GB unified each), FSDP ZeRO-3 sharded across both nodes, RoCE interconnect |
| Framework | Megatron-Bridge 0.3.1 + Megatron-Core 0.16.1; resumed across 3 platform-level interruptions |
Final validation losses (iter 75,000)
Held-out per-source dev splits, blended at the same per-phase weights as training:
| Metric | Loss | Perplexity |
|---|---|---|
lm (weighted across all positions) |
3.944 | 51.6 |
loss/range_C (SNAC coarse codes) |
4.190 | 66.0 |
loss/range_M (SNAC mid codes) |
4.509 | 90.9 |
loss/range_F (SNAC fine codes) |
4.840 | 126.5 |
loss/text (FineWeb rehearsal) |
2.305 | 10.03 |
loss/[SNAC] (open marker) |
0.440 | 1.55 |
loss/[/SNAC] (close marker) |
0.020 | 1.02 |
The coarse-to-fine ordering C < M < F is preserved at every eval over the run, consistent with SNAC's residual hierarchy. Text PPL ~10 is approximately base NemotronH-Base quality (the 30% text rehearsal anchor held throughout) β the augmented vocab did not catastrophically forget the base model's text capability.
For context, the random-baseline PPL over each 4096-token SNAC range is 4096; the iter-0 augmented-baseline PPL was ~7200 (slightly above random because the newly-added rows perturbed the softmax). Final range_C PPL 66 β 62Γ better than random on coarse codes.
Data composition
Bin total: 601,910 samples Γ 4096 tokens = 2.466 B tokens. 44 sources pooled across 10 languages, blended via per-source disjoint phase windows + without-replacement sampling (provably no document seen by more than one phase or twice within a phase).
Realized per-phase quality mix (% of phase, train split):
| Phase | Iters | text | bulk-noisy audio | clean-read | studio |
|---|---|---|---|---|---|
| 1 β Broad foundation | 0 β 34,041 | 25.5% | 61.9% | 8.7% | 3.9% |
| 2 β Diversity β quality | 34,041 β 58,271 | 22.7% | 52.1% | 16.9% | 8.3% |
| 3 β Studio + speaker balance | 58,271 β 70,752 | 16.5% | 30.8% | 30.2% | 22.5% |
| 4 β Anneal | 70,752 β 75,000 | 10.2% | 14.3% | 43.8% | 31.7% |
Source corpora (all encoded into Megatron .bin/.idx format via the project's format-snac β format-phases pipeline):
- Text rehearsal: FineWeb-EN (sample-10BT) + FineWeb-2 (deu/spa/fra/ita/por/kor/jpn/rus/cmn)
- Bulk-noisy audio: Emilia-YODAS (EN/ZH/DE/FR/JA/KO), Emilia-zh, GigaSpeech-XL (audiobook/podcast/youtube), VoxPopuli (DE/FR/IT/PL/ES)
- Clean read: LibriTTS-R (EN), MLS (DE/FR/ES/IT/PT)
- Studio + crowd: HiFi-TTS, VCTK, AISHELL-1, AISHELL-3, CommonVoice-17 (EN/DE/FR/ES/IT/PT/JA/KO/RU)
Intended use
- Starting point for a stage-2 TTS finetune on parallel
text β SNACdata. The cross-modal bridge is not present in this checkpoint; supervised finetuning on text-aligned SNAC sequences is what lights it up. - Multilingual SNAC perplexity benchmarking across the 10 NemotronH-supported languages.
- Acoustic embedding extraction β pool residual stream activations over a SNAC sequence for downstream classification (language ID, speaker family, audio quality scoring).
- Audio-only continuation / infilling β given partial SNAC, generate plausible continuation. Distribution-in, distribution-out.
Out of scope and limitations
- Not a TTS system. Stage-1 mixed text-only and SNAC-only documents. There is no learned bridge between text and audio in this checkpoint. Prompting it with text and expecting SNAC output (or vice versa) will not work cleanly without a stage-2 finetune.
- No speaker conditioning. Speaker tokens / voice control are deferred to the downstream TTS finetune by design.
- 4K context, not 8K. Architecturally the base supports 8K; the augmented model's slot router was never exercised on sequences > 4K. Use 4K and below for now.
- Languages outside NemotronH's supported 10 were dropped during data design β do not expect quality on e.g. Polish, Indonesian, Vietnamese, Thai, Arabic.
- HiFi-TTS was in the training mix. If your downstream evaluation uses HiFi-TTS speakers as "held-out studio voices," this prior has already seen them β the strict version of the held-out-voice gate cannot be measured on this checkpoint. (Future Calliope stage-1 versions hold HiFi-TTS out entirely; see the v3 plan link below.)
- Convergence ceiling. The loss plateaued well above the original optimistic target (lm 3.0-3.5 hoped, lm 3.94 reached). The conservative forecast fit at iter-20k called the final values to within 0.02 nats β the model is on its forecast trajectory, just lower-quality than initial intuition suggested. Diagnostically traced to LR/optimizer regime (high decoupled_lr perturbing already-converged text rows, body-learning bottleneck), not to data quantity. A v3 design is planned that addresses these (WSD LR schedule, lower decoupled_lr, mean-init confirmed already in use, marginal data blending).
Format details
This repository ships the model in standard HuggingFace safetensors format. Files:
| File | Purpose |
|---|---|
model.safetensors (~17 GB) |
All weights, bfloat16 |
config.json |
NemotronH config + vocab_size: 143360 |
configuration_nemotron_h.py |
Config class (base NemotronH) |
modeling_nemotron_h.py |
Base NemotronH modeling (vendored to avoid transformers version drift) |
modeling_nemotron_h_augmented.py |
NemotronHAugmentedForCausalLM β wrapper that reads augmented.yaml at __init__ and applies the slot-router logits mask in forward |
augmented.yaml |
Slot router + range definitions; read by the augmented modeling class at load time |
tokenizer.json, tokenizer_config.json |
Augmented tokenizer |
generation_config.json |
Default decoding params |
__init__.py |
Empty, so the dir is a valid Python package for trust_remote_code |
The Megatron-Bridge FSDP DCP format of the same trained weights is at the sibling private repo zeroae/calliope-snac-4b-base-4k.megatron β use that one if you want to continue pretraining via Bridge.
Provenance & references
- W&B run:
ynbrpbmx(projectcalliope, expstage1-phase-full-v2) - Original training plan:
docs/superpowers/plans/2026-05-02-snac-stage1-multilingual-prior.md - Predictions vs forecast vs actuals:
2026-05-07-snac-stage1-phase-end-predictions.md,2026-05-10-snac-stage1-forecast.md - v3 design (next iteration β WSD schedule, marginal blend, HiFi-TTS holdout):
2026-05-16-snac-stage1-v3.md - Base model:
nvidia/Nemotron-H-4B-Base-8K - Codec:
hubertsiuzdak/snac_24khz
License
This checkpoint is a derivative of nvidia/Nemotron-H-4B-Base-8K. The base model's license terms apply to redistribution and use of these weights. Refer to the linked base model for the authoritative license; this repo does not extend or restrict those terms.
The augmented modeling code (modeling_nemotron_h_augmented.py, augmented.yaml, augmentation specs) is Β© Zero A.E., LLC and licensed for research use under terms TBD β please contact the org before commercial use.
- Downloads last month
- -
Model tree for zeroae/calliope-snac-4b-base-4k
Base model
nvidia/Nemotron-H-4B-Base-8K