Instructions to use zeroae/calliope-snac-4b-base-4k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zeroae/calliope-snac-4b-base-4k with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zeroae/calliope-snac-4b-base-4k", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zeroae/calliope-snac-4b-base-4k", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("zeroae/calliope-snac-4b-base-4k", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zeroae/calliope-snac-4b-base-4k with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zeroae/calliope-snac-4b-base-4k"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zeroae/calliope-snac-4b-base-4k",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/zeroae/calliope-snac-4b-base-4k

SGLang

How to use zeroae/calliope-snac-4b-base-4k with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zeroae/calliope-snac-4b-base-4k" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zeroae/calliope-snac-4b-base-4k",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zeroae/calliope-snac-4b-base-4k" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zeroae/calliope-snac-4b-base-4k",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use zeroae/calliope-snac-4b-base-4k with Docker Model Runner:
```
docker model run hf.co/zeroae/calliope-snac-4b-base-4k
```

psodre commited on 12 days ago

Commit

d8b0850

verified ·

1 Parent(s): a00d1f8

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +6 -7

README.md CHANGED Viewed

@@ -91,18 +91,17 @@ N_FRAMES = 50                                      # ~4 s at SNAC-24kHz's coarse
 N_TOKENS = N_FRAMES * 7                            # 7 tokens / frame (C,M,F,F,M,F,F)
 # --- 2. Generate inside an [SNAC] ... span ------------------------------
-# IMPORTANT: use_cache=False is required. The slot router (in
-# modeling_nemotron_h_augmented.py) re-derives state from the full
-# input_ids on every forward; with the default KV-cached generate
-# (which only re-feeds the new token), routing state would reset to
-# 0 each step and the C/M/F pattern would not be enforced.
 prompt = torch.tensor([[tok.bos_token_id, SNAC_OPEN]], device="cuda")
 with torch.no_grad():
     out = model.generate(
         prompt,
         max_new_tokens=N_TOKENS,
         do_sample=True, temperature=0.8, top_p=0.95,
-        use_cache=False,                            # see note above
     )
 # --- 3. Parse the C/M/F/F/M/F/F frames back into codebook indices --------
@@ -140,7 +139,7 @@ print(f"saved {audio.shape[-1] / 24000:.2f} s of audio  "
       f"({len(c_codes)} frames, {len(c_codes) + len(m_codes) + len(f_codes)} codes)")
 ```
-**Dependencies**: `pip install snac torchaudio` in addition to `transformers torch`. Total wall-clock for 50 frames (~4 s of audio): roughly 30 – 60 s on one GB10 with `use_cache=False`.
 **Token-budget rule of thumb**: SNAC-24kHz's coarse rate is ~12 Hz, so one frame ≈ 83 ms of audio. To pre-allocate `max_new_tokens` for a given duration:

 N_TOKENS = N_FRAMES * 7                            # 7 tokens / frame (C,M,F,F,M,F,F)
 # --- 2. Generate inside an [SNAC] ... span ------------------------------
+# The slot router (modeling_nemotron_h_augmented.py) carries its
+# (in_slot_mode, slot_counter) state across forward calls via
+# self._slot_router_state, so KV caching just works: prefill computes
+# routing from initial state, subsequent forwards advance from the
+# cached final state. No special flags needed.
 prompt = torch.tensor([[tok.bos_token_id, SNAC_OPEN]], device="cuda")
 with torch.no_grad():
     out = model.generate(
         prompt,
         max_new_tokens=N_TOKENS,
         do_sample=True, temperature=0.8, top_p=0.95,
     )
 # --- 3. Parse the C/M/F/F/M/F/F frames back into codebook indices --------
       f"({len(c_codes)} frames, {len(c_codes) + len(m_codes) + len(f_codes)} codes)")
 ```
+**Dependencies**: `pip install snac torchaudio` in addition to `transformers torch`. Wall-clock for 50 frames (~4 s of audio): a few seconds on a GB10 with KV caching on (the default).
 **Token-budget rule of thumb**: SNAC-24kHz's coarse rate is ~12 Hz, so one frame ≈ 83 ms of audio. To pre-allocate `max_new_tokens` for a given duration: