Calliope SNAC 4B Base (4K)

Stage-1 multilingual SNAC prior for the Calliope text-to-speech project β€” a continued-pretrain of nvidia/Nemotron-H-4B-Base-8K with the vocabulary augmented by 12,288 SNAC codec tokens and a slot router that enforces the codec's CΒ·MΒ·FΒ·FΒ·MΒ·FΒ·F frame pattern at audio-mode positions.

This is the HuggingFace safetensors version, loadable via AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True). The Megatron-Bridge FSDP DCP format is at the sibling repo zeroae/calliope-snac-4b-base-4k.megatron (private) for Bridge-based continued training.

What this is and isn't. This is a pretrained prior, not a finished TTS system. Training mixed text-only and SNAC-audio-only documents — the cross-modal text→SNAC bridge is a separate stage-2 finetune objective and was not learned here. Use this checkpoint as the starting point for a TTS finetune, not as an end-to-end speech model.

Quick start: text generation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

REPO = "zeroae/calliope-snac-4b-base-4k"

tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    REPO,
    dtype=torch.bfloat16,
    device_map="cuda",
    low_cpu_mem_usage=True,
    trust_remote_code=True,
)

# Text-only generation (the text path is preserved at near-base quality).
# The slot router masks all SNAC tokens to -inf in text mode, so text
# generation is unaffected by the augmented vocab.
ids = tokenizer("In multilingual TTS, prosody", return_tensors="pt").input_ids.to("cuda")
out = model.generate(ids, max_new_tokens=32, do_sample=False)
print(tokenizer.decode(out[0]))

The trust_remote_code=True flag pulls modeling_nemotron_h_augmented.py from this repo, which wraps the base NemotronH-4B with the slot-router logits mask (text mode masks all SNAC tokens, audio mode masks all text β€” enforced per-position).

End-to-end: generate SNAC frames β†’ decode to audio

This pretrained prior has no text→SNAC bridge (see disclaimer above); the example below shows the unconditional end-to-end pipeline that the slot router makes work: prompt with the [SNAC] marker, generate tokens (which the slot router constrains to the C·M·F·F·M·F·F frame pattern), parse them back into the three SNAC codebooks, and decode to a waveform via the upstream hubertsiuzdak/snac_24khz codec.

Expect the audio to be a babble/noise — the model is sampling unconditionally from its learned audio distribution; no text guides the content. The point is to demonstrate the mechanics; quality requires a stage-2 finetune that learns the text→SNAC bridge.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC          # pip install snac
import torchaudio

REPO = "zeroae/calliope-snac-4b-base-4k"

# --- 1. Load LM ---------------------------------------------------------
tok = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    REPO, dtype=torch.bfloat16, device_map="cuda",
    low_cpu_mem_usage=True, trust_remote_code=True,
).eval()

# Vocab layout (from augmented.yaml, also visible in the repo)
SNAC_OPEN, SNAC_CLOSE = 100, 101
C_BASE, M_BASE, F_BASE = 131072, 135168, 139264   # start of each codebook range
N_FRAMES = 50                                      # ~4 s at SNAC-24kHz's coarse rate
N_TOKENS = N_FRAMES * 7                            # 7 tokens / frame (C,M,F,F,M,F,F)

# --- 2. Generate inside an [SNAC] ... span ------------------------------
# The slot router (modeling_nemotron_h_augmented.py) carries its
# (in_slot_mode, slot_counter) state across forward calls via
# self._slot_router_state, so KV caching just works: prefill computes
# routing from initial state, subsequent forwards advance from the
# cached final state. No special flags needed.
prompt = torch.tensor([[tok.bos_token_id, SNAC_OPEN]], device="cuda")
with torch.no_grad():
    out = model.generate(
        prompt,
        max_new_tokens=N_TOKENS,
        do_sample=True, temperature=0.8, top_p=0.95,
    )

# --- 3. Parse the C/M/F/F/M/F/F frames back into codebook indices --------
gen = out[0, prompt.shape[1]:].tolist()
gen = gen[: (len(gen) // 7) * 7]                    # truncate to whole frames
c_codes, m_codes, f_codes = [], [], []
for i in range(0, len(gen), 7):
    frame = gen[i:i + 7]
    c_codes.append(frame[0] - C_BASE)               # slot 0: C
    m_codes.append(frame[1] - M_BASE)               # slot 1: M
    f_codes.append(frame[2] - F_BASE)               # slot 2: F
    f_codes.append(frame[3] - F_BASE)               # slot 3: F
    m_codes.append(frame[4] - M_BASE)               # slot 4: M
    f_codes.append(frame[5] - F_BASE)               # slot 5: F
    f_codes.append(frame[6] - F_BASE)               # slot 6: F

# Sanity-check the slot router did its job (codes within [0, 4096))
assert all(0 <= c < 4096 for c in c_codes), "C codes out of range β€” slot router off?"
assert all(0 <= m < 4096 for m in m_codes), "M codes out of range"
assert all(0 <= f < 4096 for f in f_codes), "F codes out of range"

# --- 4. Decode the three codebooks to a 24 kHz waveform -----------------
codec = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().to("cuda")
codes = [
    torch.tensor([c_codes], dtype=torch.long, device="cuda"),   # [B=1, N_FRAMES]
    torch.tensor([m_codes], dtype=torch.long, device="cuda"),   # [B=1, 2*N_FRAMES]
    torch.tensor([f_codes], dtype=torch.long, device="cuda"),   # [B=1, 4*N_FRAMES]
]
with torch.no_grad():
    audio = codec.decode(codes)                                 # [1, 1, num_samples]

# --- 5. Save ------------------------------------------------------------
torchaudio.save("calliope_unconditional.wav", audio.squeeze(0).cpu(), sample_rate=24000)
print(f"saved {audio.shape[-1] / 24000:.2f} s of audio  "
      f"({len(c_codes)} frames, {len(c_codes) + len(m_codes) + len(f_codes)} codes)")

Dependencies: pip install snac torchaudio in addition to transformers torch. Wall-clock for 50 frames (~4 s of audio): a few seconds on a GB10 with KV caching on (the default).

Token-budget rule of thumb: SNAC-24kHz's coarse rate is ~12 Hz, so one frame β‰ˆ 83 ms of audio. To pre-allocate max_new_tokens for a given duration:

N_TOKENS = int(seconds * 12) * 7      # 7 tokens per frame

Why this demo's audio sounds bad (and that's expected)

The model has never seen text + [SNAC]…[/SNAC] parallel sequences β€” only text-only documents and SNAC-only documents, mixed at the batch level. Unconditional sampling from the SNAC distribution produces something codec-plausible (the slot router guarantees the bit-stream is structurally valid, and the codec can always decode), but it has no semantic content. It's the analogue of letting a language model generate without a prompt β€” you get gibberish that has the shape of the training distribution. A stage-2 TTS finetune on text β†’ SNAC parallel data is what makes this conditional and intelligible.

Architecture

Field Value
Base model nvidia/Nemotron-H-4B-Base-8K (hybrid Mamba + attention, 52 layers)
Parameters ~4.56 B (4 B base + augmented embedding/lm_head rows)
Vocabulary size 143,360 (131,072 base + 12,288 SNAC + 2 markers + 254 reserved-special unchanged)
New tokens SNAC_C_* (4096), SNAC_M_* (4096), SNAC_F_* (4096), [SNAC] (id 100), [/SNAC] (id 101)
Vocab init mean_resizing (multivariate-normal-matched to existing embedding distribution; Hewitt 2021)
Slot router slot_pattern: [C, M, F, F, M, F, F] β€” masks logits to the relevant range at each frame position; [SNAC]/[/SNAC] markers flip into/out of audio mode
Context length (trained) 4096 (architectural cap is 8192 inherited from base; 8K inference is unverified for this checkpoint β€” the slot-router state machine should extend, but no measurement exists)
Precision bfloat16 weights
Tokenizer NemotronH base tokenizer with the 12,290 new tokens appended

The SNAC frame layout is [C, M, F, F, M, F, F] β€” 7 tokens per coarse frame, one coarse (C) β†’ two mid (M) β†’ four fine (F) β€” matching SNAC-24kHz's 1:2:4 residual-quantizer hierarchy.

Training summary

Wall-clock 12 days (2026-05-08 β†’ 2026-05-21)
Iterations 75,000 (warmup 457 linear β†’ cosine decay β†’ min_lr)
Global batch size 8 (mbs=1 Γ— 8-step gradient accumulation, dp=1)
Sequence length 4096
Tokens consumed ~2.46 B (75k Γ— 8 Γ— 4096)
Single-pass Yes β€” 600,000 of the bin's 601,910 unique samples; no epoch wrap; no overfitting by construction
Optimizer Adam (β₁=0.9, Ξ²β‚‚=0.95, Ξ΅=1e-8), weight_decay=0.1, grad-clip 1.0
Peak LR 5e-5 (backbone) / 5e-4 (decoupled β€” embedding + lm_head)
Min LR 5e-7 / 5e-6
Hardware 2Γ— NVIDIA DGX Spark (GB10, sm_121, 128 GB unified each), FSDP ZeRO-3 sharded across both nodes, RoCE interconnect
Framework Megatron-Bridge 0.3.1 + Megatron-Core 0.16.1; resumed across 3 platform-level interruptions

Final validation losses (iter 75,000)

Held-out per-source dev splits, blended at the same per-phase weights as training:

Metric Loss Perplexity
lm (weighted across all positions) 3.944 51.6
loss/range_C (SNAC coarse codes) 4.190 66.0
loss/range_M (SNAC mid codes) 4.509 90.9
loss/range_F (SNAC fine codes) 4.840 126.5
loss/text (FineWeb rehearsal) 2.305 10.03
loss/[SNAC] (open marker) 0.440 1.55
loss/[/SNAC] (close marker) 0.020 1.02

The coarse-to-fine ordering C < M < F is preserved at every eval over the run, consistent with SNAC's residual hierarchy. Text PPL ~10 is approximately base NemotronH-Base quality (the 30% text rehearsal anchor held throughout) β€” the augmented vocab did not catastrophically forget the base model's text capability.

For context, the random-baseline PPL over each 4096-token SNAC range is 4096; the iter-0 augmented-baseline PPL was ~7200 (slightly above random because the newly-added rows perturbed the softmax). Final range_C PPL 66 β‰ˆ 62Γ— better than random on coarse codes.

Data composition

Bin total: 601,910 samples Γ— 4096 tokens = 2.466 B tokens. 44 sources pooled across 10 languages, blended via per-source disjoint phase windows + without-replacement sampling (provably no document seen by more than one phase or twice within a phase).

Realized per-phase quality mix (% of phase, train split):

Phase Iters text bulk-noisy audio clean-read studio
1 β€” Broad foundation 0 – 34,041 25.5% 61.9% 8.7% 3.9%
2 β€” Diversity β†’ quality 34,041 – 58,271 22.7% 52.1% 16.9% 8.3%
3 β€” Studio + speaker balance 58,271 – 70,752 16.5% 30.8% 30.2% 22.5%
4 β€” Anneal 70,752 – 75,000 10.2% 14.3% 43.8% 31.7%

Source corpora (all encoded into Megatron .bin/.idx format via the project's format-snac β†’ format-phases pipeline):

Intended use

  • Starting point for a stage-2 TTS finetune on parallel text β†’ SNAC data. The cross-modal bridge is not present in this checkpoint; supervised finetuning on text-aligned SNAC sequences is what lights it up.
  • Multilingual SNAC perplexity benchmarking across the 10 NemotronH-supported languages.
  • Acoustic embedding extraction β€” pool residual stream activations over a SNAC sequence for downstream classification (language ID, speaker family, audio quality scoring).
  • Audio-only continuation / infilling β€” given partial SNAC, generate plausible continuation. Distribution-in, distribution-out.

Out of scope and limitations

  • Not a TTS system. Stage-1 mixed text-only and SNAC-only documents. There is no learned bridge between text and audio in this checkpoint. Prompting it with text and expecting SNAC output (or vice versa) will not work cleanly without a stage-2 finetune.
  • No speaker conditioning. Speaker tokens / voice control are deferred to the downstream TTS finetune by design.
  • 4K context, not 8K. Architecturally the base supports 8K; the augmented model's slot router was never exercised on sequences > 4K. Use 4K and below for now.
  • Languages outside NemotronH's supported 10 were dropped during data design β€” do not expect quality on e.g. Polish, Indonesian, Vietnamese, Thai, Arabic.
  • HiFi-TTS was in the training mix. If your downstream evaluation uses HiFi-TTS speakers as "held-out studio voices," this prior has already seen them β€” the strict version of the held-out-voice gate cannot be measured on this checkpoint. (Future Calliope stage-1 versions hold HiFi-TTS out entirely; see the v3 plan link below.)
  • Convergence ceiling. The loss plateaued well above the original optimistic target (lm 3.0-3.5 hoped, lm 3.94 reached). The conservative forecast fit at iter-20k called the final values to within 0.02 nats β€” the model is on its forecast trajectory, just lower-quality than initial intuition suggested. Diagnostically traced to LR/optimizer regime (high decoupled_lr perturbing already-converged text rows, body-learning bottleneck), not to data quantity. A v3 design is planned that addresses these (WSD LR schedule, lower decoupled_lr, mean-init confirmed already in use, marginal data blending).

Format details

This repository ships the model in standard HuggingFace safetensors format. Files:

File Purpose
model.safetensors (~17 GB) All weights, bfloat16
config.json NemotronH config + vocab_size: 143360
configuration_nemotron_h.py Config class (base NemotronH)
modeling_nemotron_h.py Base NemotronH modeling (vendored to avoid transformers version drift)
modeling_nemotron_h_augmented.py NemotronHAugmentedForCausalLM β€” wrapper that reads augmented.yaml at __init__ and applies the slot-router logits mask in forward
augmented.yaml Slot router + range definitions; read by the augmented modeling class at load time
tokenizer.json, tokenizer_config.json Augmented tokenizer
generation_config.json Default decoding params
__init__.py Empty, so the dir is a valid Python package for trust_remote_code

The Megatron-Bridge FSDP DCP format of the same trained weights is at the sibling private repo zeroae/calliope-snac-4b-base-4k.megatron β€” use that one if you want to continue pretraining via Bridge.

Provenance & references

License

This checkpoint is a derivative of nvidia/Nemotron-H-4B-Base-8K. The base model's license terms apply to redistribution and use of these weights. Refer to the linked base model for the authoritative license; this repo does not extend or restrict those terms.

The augmented modeling code (modeling_nemotron_h_augmented.py, augmented.yaml, augmentation specs) is Β© Zero A.E., LLC and licensed for research use under terms TBD β€” please contact the org before commercial use.

Downloads last month
-
Safetensors
Model size
5B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for zeroae/calliope-snac-4b-base-4k

Finetuned
(2)
this model