Instructions to use zeroae/calliope-snac-4b-base-4k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zeroae/calliope-snac-4b-base-4k with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zeroae/calliope-snac-4b-base-4k", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zeroae/calliope-snac-4b-base-4k", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("zeroae/calliope-snac-4b-base-4k", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use zeroae/calliope-snac-4b-base-4k with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zeroae/calliope-snac-4b-base-4k"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zeroae/calliope-snac-4b-base-4k",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/zeroae/calliope-snac-4b-base-4k

SGLang

How to use zeroae/calliope-snac-4b-base-4k with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zeroae/calliope-snac-4b-base-4k" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zeroae/calliope-snac-4b-base-4k",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zeroae/calliope-snac-4b-base-4k" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zeroae/calliope-snac-4b-base-4k",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use zeroae/calliope-snac-4b-base-4k with Docker Model Runner:
```
docker model run hf.co/zeroae/calliope-snac-4b-base-4k
```

Calliope SNAC 4B Base (4K)

Stage-1 multilingual SNAC prior for the Calliope text-to-speech project — a continued-pretrain of nvidia/Nemotron-H-4B-Base-8K with the vocabulary augmented by 12,288 SNAC codec tokens and a slot router that enforces the codec's C·M·F·F·M·F·F frame pattern at audio-mode positions.

This is the HuggingFace safetensors version, loadable via AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True). The Megatron-Bridge FSDP DCP format is at the sibling repo zeroae/calliope-snac-4b-base-4k.megatron (private) for Bridge-based continued training.

What this is and isn't. This is a pretrained prior, not a finished TTS system. Training mixed text-only and SNAC-audio-only documents — the cross-modal text→SNAC bridge is a separate stage-2 finetune objective and was not learned here. Use this checkpoint as the starting point for a TTS finetune, not as an end-to-end speech model.

Quick start: text generation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

REPO = "zeroae/calliope-snac-4b-base-4k"

tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    REPO,
    dtype=torch.bfloat16,
    device_map="cuda",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
)

# Text-only generation (the text path is preserved at near-base quality).
# The slot router masks all SNAC tokens to -inf in text mode, so text
# generation is unaffected by the augmented vocab.
ids = tokenizer("In multilingual TTS, prosody", return_tensors="pt").input_ids.to("cuda")
out = model.generate(ids, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(out[0]))

The trust_remote_code=True flag pulls modeling_nemotron_h_augmented.py from this repo, which wraps the base NemotronH-4B with the slot-router logits mask (text mode masks all SNAC tokens, audio mode masks all text — enforced per-position).

End-to-end: generate SNAC frames → decode to audio

This pretrained prior has no text→SNAC bridge (see disclaimer above); the example below shows the unconditional end-to-end pipeline that the slot router makes work: prompt with the [SNAC] marker, generate tokens (which the slot router constrains to the C·M·F·F·M·F·F frame pattern), parse them back into the three SNAC codebooks, and decode to a waveform via the upstream hubertsiuzdak/snac_24khz codec.

Expect the audio to be a babble/noise — the model is sampling unconditionally from its learned audio distribution; no text guides the content. The point is to demonstrate the mechanics; quality requires a stage-2 finetune that learns the text→SNAC bridge.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC          # pip install snac
import torchaudio

REPO = "zeroae/calliope-snac-4b-base-4k"

# --- 1. Load LM ---------------------------------------------------------
tok = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    REPO, dtype=torch.bfloat16, device_map="cuda",
    low_cpu_mem_usage=True, trust_remote_code=True,
).eval()

# Vocab layout (from augmented.yaml, also visible in the repo)
SNAC_OPEN, SNAC_CLOSE = 100, 101
C_BASE, M_BASE, F_BASE = 131072, 135168, 139264   # start of each codebook range
N_FRAMES = 50                                      # ~4 s at SNAC-24kHz's coarse rate
N_TOKENS = N_FRAMES * 7                            # 7 tokens / frame (C,M,F,F,M,F,F)

# --- 2. Generate inside an [SNAC] ... span ------------------------------
# The slot router (modeling_nemotron_h_augmented.py) carries its
# (in_slot_mode, slot_counter) state across forward calls via
# self._slot_router_state, so KV caching just works: prefill computes
# routing from initial state, subsequent forwards advance from the
# cached final state. No special flags needed.
prompt = torch.tensor([[tok.bos_token_id, SNAC_OPEN]], device="cuda")
with torch.no_grad():
    out = model.generate(
        prompt,
        max_new_tokens=N_TOKENS,
        do_sample=True, temperature=0.8, top_p=0.95,
    )

# --- 3. Parse the C/M/F/F/M/F/F frames back into codebook indices --------
gen = out[0, prompt.shape[1]:].tolist()
gen = gen[: (len(gen) // 7) * 7]                    # truncate to whole frames
c_codes, m_codes, f_codes = [], [], []
for i in range(0, len(gen), 7):
    frame = gen[i:i + 7]
    c_codes.append(frame[0] - C_BASE)               # slot 0: C
    m_codes.append(frame[1] - M_BASE)               # slot 1: M
    f_codes.append(frame[2] - F_BASE)               # slot 2: F
    f_codes.append(frame[3] - F_BASE)               # slot 3: F
    m_codes.append(frame[4] - M_BASE)               # slot 4: M
    f_codes.append(frame[5] - F_BASE)               # slot 5: F
    f_codes.append(frame[6] - F_BASE)               # slot 6: F

# Sanity-check the slot router did its job (codes within [0, 4096))
assert all(0 <= c < 4096 for c in c_codes), "C codes out of range — slot router off?"
assert all(0 <= m < 4096 for m in m_codes), "M codes out of range"
assert all(0 <= f < 4096 for f in f_codes), "F codes out of range"

# --- 4. Decode the three codebooks to a 24 kHz waveform -----------------
codec = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().to("cuda")
codes = [
    torch.tensor([c_codes], dtype=torch.long, device="cuda"),   # [B=1, N_FRAMES]
    torch.tensor([m_codes], dtype=torch.long, device="cuda"),   # [B=1, 2*N_FRAMES]
    torch.tensor([f_codes], dtype=torch.long, device="cuda"),   # [B=1, 4*N_FRAMES]
]
with torch.no_grad():
    audio = codec.decode(codes)                                 # [1, 1, num_samples]

# --- 5. Save ------------------------------------------------------------
torchaudio.save("calliope_unconditional.wav", audio.squeeze(0).cpu(), sample_rate=24000)
print(f"saved {audio.shape[-1] / 24000:.2f} s of audio  "
      f"({len(c_codes)} frames, {len(c_codes) + len(m_codes) + len(f_codes)} codes)")

Dependencies: pip install snac torchaudio in addition to transformers torch. Wall-clock for 50 frames (~4 s of audio): a few seconds on a GB10 with KV caching on (the default).

Token-budget rule of thumb: SNAC-24kHz's coarse rate is ~12 Hz, so one frame ≈ 83 ms of audio. To pre-allocate max_new_tokens for a given duration:

N_TOKENS = int(seconds * 12) * 7      # 7 tokens per frame

Why this demo's audio sounds bad (and that's expected)

The model has never seen text + [SNAC]…[/SNAC] parallel sequences — only text-only documents and SNAC-only documents, mixed at the batch level. Unconditional sampling from the SNAC distribution produces something codec-plausible (the slot router guarantees the bit-stream is structurally valid, and the codec can always decode), but it has no semantic content. It's the analogue of letting a language model generate without a prompt — you get gibberish that has the shape of the training distribution. A stage-2 TTS finetune on text → SNAC parallel data is what makes this conditional and intelligible.

Architecture

Field	Value
Base model	`nvidia/Nemotron-H-4B-Base-8K` (hybrid Mamba + attention, 52 layers)
Parameters	~4.56 B (4 B base + augmented embedding/lm_head rows)
Vocabulary size	143,360 (131,072 base + 12,288 SNAC + 2 markers + 254 reserved-special unchanged)
New tokens	`SNAC_C_` (4096), `SNAC_M_` (4096), `SNAC_F_*` (4096), `[SNAC]` (id 100), `[/SNAC]` (id 101)
Vocab init	`mean_resizing` (multivariate-normal-matched to existing embedding distribution; Hewitt 2021)
Slot router	`slot_pattern: [C, M, F, F, M, F, F]` — masks logits to the relevant range at each frame position; `[SNAC]`/`[/SNAC]` markers flip into/out of audio mode
Context length (trained)	4096 (architectural cap is 8192 inherited from base; 8K inference is unverified for this checkpoint — the slot-router state machine should extend, but no measurement exists)
Precision	bfloat16 weights
Tokenizer	NemotronH base tokenizer with the 12,290 new tokens appended

The SNAC frame layout is [C, M, F, F, M, F, F] — 7 tokens per coarse frame, one coarse (C) → two mid (M) → four fine (F) — matching SNAC-24kHz's 1:2:4 residual-quantizer hierarchy.

Training summary


Wall-clock	12 days (2026-05-08 → 2026-05-21)
Iterations	75,000 (warmup 457 linear → cosine decay → min_lr)
Global batch size	8 (mbs=1 × 8-step gradient accumulation, dp=1)
Sequence length	4096
Tokens consumed	~2.46 B (75k × 8 × 4096)
Single-pass	Yes — 600,000 of the bin's 601,910 unique samples; no epoch wrap; no overfitting by construction
Optimizer	Adam (β₁=0.9, β₂=0.95, ε=1e-8), weight_decay=0.1, grad-clip 1.0
Peak LR	5e-5 (backbone) / 5e-4 (decoupled — embedding + lm_head)
Min LR	5e-7 / 5e-6
Hardware	2× NVIDIA DGX Spark (GB10, sm_121, 128 GB unified each), FSDP ZeRO-3 sharded across both nodes, RoCE interconnect
Framework	Megatron-Bridge 0.3.1 + Megatron-Core 0.16.1; resumed across 3 platform-level interruptions

Final validation losses (iter 75,000)

Held-out per-source dev splits, blended at the same per-phase weights as training:

Metric	Loss	Perplexity
`lm` (weighted across all positions)	3.944	51.6
`loss/range_C` (SNAC coarse codes)	4.190	66.0
`loss/range_M` (SNAC mid codes)	4.509	90.9
`loss/range_F` (SNAC fine codes)	4.840	126.5
`loss/text` (FineWeb rehearsal)	2.305	10.03
`loss/[SNAC]` (open marker)	0.440	1.55
`loss/[/SNAC]` (close marker)	0.020	1.02

The coarse-to-fine ordering C < M < F is preserved at every eval over the run, consistent with SNAC's residual hierarchy. Text PPL ~10 is approximately base NemotronH-Base quality (the 30% text rehearsal anchor held throughout) — the augmented vocab did not catastrophically forget the base model's text capability.

For context, the random-baseline PPL over each 4096-token SNAC range is 4096; the iter-0 augmented-baseline PPL was ~7200 (slightly above random because the newly-added rows perturbed the softmax). Final range_C PPL 66 ≈ 62× better than random on coarse codes.

Data composition

Bin total: 601,910 samples × 4096 tokens = 2.466 B tokens. 44 sources pooled across 10 languages, blended via per-source disjoint phase windows + without-replacement sampling (provably no document seen by more than one phase or twice within a phase).

Realized per-phase quality mix (% of phase, train split):

Phase	Iters	text	bulk-noisy audio	clean-read	studio
1 — Broad foundation	0 – 34,041	25.5%	61.9%	8.7%	3.9%
2 — Diversity → quality	34,041 – 58,271	22.7%	52.1%	16.9%	8.3%
3 — Studio + speaker balance	58,271 – 70,752	16.5%	30.8%	30.2%	22.5%
4 — Anneal	70,752 – 75,000	10.2%	14.3%	43.8%	31.7%

Source corpora (all encoded into Megatron .bin/.idx format via the project's format-snac → format-phases pipeline):

Text rehearsal: FineWeb-EN (sample-10BT) + FineWeb-2 (deu/spa/fra/ita/por/kor/jpn/rus/cmn)
Bulk-noisy audio: Emilia-YODAS (EN/ZH/DE/FR/JA/KO), Emilia-zh, GigaSpeech-XL (audiobook/podcast/youtube), VoxPopuli (DE/FR/IT/PL/ES)
Clean read: LibriTTS-R (EN), MLS (DE/FR/ES/IT/PT)
Studio + crowd: HiFi-TTS, VCTK, AISHELL-1, AISHELL-3, CommonVoice-17 (EN/DE/FR/ES/IT/PT/JA/KO/RU)

Intended use

Starting point for a stage-2 TTS finetune on parallel text → SNAC data. The cross-modal bridge is not present in this checkpoint; supervised finetuning on text-aligned SNAC sequences is what lights it up.
Multilingual SNAC perplexity benchmarking across the 10 NemotronH-supported languages.
Acoustic embedding extraction — pool residual stream activations over a SNAC sequence for downstream classification (language ID, speaker family, audio quality scoring).
Audio-only continuation / infilling — given partial SNAC, generate plausible continuation. Distribution-in, distribution-out.

Out of scope and limitations

Not a TTS system. Stage-1 mixed text-only and SNAC-only documents. There is no learned bridge between text and audio in this checkpoint. Prompting it with text and expecting SNAC output (or vice versa) will not work cleanly without a stage-2 finetune.
No speaker conditioning. Speaker tokens / voice control are deferred to the downstream TTS finetune by design.
4K context, not 8K. Architecturally the base supports 8K; the augmented model's slot router was never exercised on sequences > 4K. Use 4K and below for now.
Languages outside NemotronH's supported 10 were dropped during data design — do not expect quality on e.g. Polish, Indonesian, Vietnamese, Thai, Arabic.
HiFi-TTS was in the training mix. If your downstream evaluation uses HiFi-TTS speakers as "held-out studio voices," this prior has already seen them — the strict version of the held-out-voice gate cannot be measured on this checkpoint. (Future Calliope stage-1 versions hold HiFi-TTS out entirely; see the v3 plan link below.)
Convergence ceiling. The loss plateaued well above the original optimistic target (lm 3.0-3.5 hoped, lm 3.94 reached). The conservative forecast fit at iter-20k called the final values to within 0.02 nats — the model is on its forecast trajectory, just lower-quality than initial intuition suggested. Diagnostically traced to LR/optimizer regime (high decoupled_lr perturbing already-converged text rows, body-learning bottleneck), not to data quantity. A v3 design is planned that addresses these (WSD LR schedule, lower decoupled_lr, mean-init confirmed already in use, marginal data blending).

Format details

This repository ships the model in standard HuggingFace safetensors format. Files:

File	Purpose
`model.safetensors` (~17 GB)	All weights, bfloat16
`config.json`	NemotronH config + `vocab_size: 143360`
`configuration_nemotron_h.py`	Config class (base NemotronH)
`modeling_nemotron_h.py`	Base NemotronH modeling (vendored to avoid `transformers` version drift)
`modeling_nemotron_h_augmented.py`	`NemotronHAugmentedForCausalLM` — wrapper that reads `augmented.yaml` at `__init__` and applies the slot-router logits mask in `forward`
`augmented.yaml`	Slot router + range definitions; read by the augmented modeling class at load time
`tokenizer.json`, `tokenizer_config.json`	Augmented tokenizer
`generation_config.json`	Default decoding params
`__init__.py`	Empty, so the dir is a valid Python package for `trust_remote_code`

The Megatron-Bridge FSDP DCP format of the same trained weights is at the sibling private repo zeroae/calliope-snac-4b-base-4k.megatron — use that one if you want to continue pretraining via Bridge.

Provenance & references

W&B run: ynbrpbmx (project calliope, exp stage1-phase-full-v2)
Original training plan: docs/superpowers/plans/2026-05-02-snac-stage1-multilingual-prior.md
Predictions vs forecast vs actuals: 2026-05-07-snac-stage1-phase-end-predictions.md, 2026-05-10-snac-stage1-forecast.md
v3 design (next iteration — WSD schedule, marginal blend, HiFi-TTS holdout): 2026-05-16-snac-stage1-v3.md
Base model: nvidia/Nemotron-H-4B-Base-8K
Codec: hubertsiuzdak/snac_24khz

License

This checkpoint is a derivative of nvidia/Nemotron-H-4B-Base-8K. The base model's license terms apply to redistribution and use of these weights. Refer to the linked base model for the authoritative license; this repo does not extend or restrict those terms.

The augmented modeling code (modeling_nemotron_h_augmented.py, augmented.yaml, augmentation specs) is © Zero A.E., LLC and licensed for research use under terms TBD — please contact the org before commercial use.

Downloads last month: 11

Safetensors

Model size

5B params

Tensor type

F32

Model tree for zeroae/calliope-snac-4b-base-4k

Base model

nvidia/Nemotron-H-4B-Base-8K

Finetuned

(2)

this model