psodre commited on
Commit
d8b0850
·
verified ·
1 Parent(s): a00d1f8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +6 -7
README.md CHANGED
@@ -91,18 +91,17 @@ N_FRAMES = 50 # ~4 s at SNAC-24kHz's coarse
91
  N_TOKENS = N_FRAMES * 7 # 7 tokens / frame (C,M,F,F,M,F,F)
92
 
93
  # --- 2. Generate inside an [SNAC] ... span ------------------------------
94
- # IMPORTANT: use_cache=False is required. The slot router (in
95
- # modeling_nemotron_h_augmented.py) re-derives state from the full
96
- # input_ids on every forward; with the default KV-cached generate
97
- # (which only re-feeds the new token), routing state would reset to
98
- # 0 each step and the C/M/F pattern would not be enforced.
99
  prompt = torch.tensor([[tok.bos_token_id, SNAC_OPEN]], device="cuda")
100
  with torch.no_grad():
101
  out = model.generate(
102
  prompt,
103
  max_new_tokens=N_TOKENS,
104
  do_sample=True, temperature=0.8, top_p=0.95,
105
- use_cache=False, # see note above
106
  )
107
 
108
  # --- 3. Parse the C/M/F/F/M/F/F frames back into codebook indices --------
@@ -140,7 +139,7 @@ print(f"saved {audio.shape[-1] / 24000:.2f} s of audio "
140
  f"({len(c_codes)} frames, {len(c_codes) + len(m_codes) + len(f_codes)} codes)")
141
  ```
142
 
143
- **Dependencies**: `pip install snac torchaudio` in addition to `transformers torch`. Total wall-clock for 50 frames (~4 s of audio): roughly 30 60 s on one GB10 with `use_cache=False`.
144
 
145
  **Token-budget rule of thumb**: SNAC-24kHz's coarse rate is ~12 Hz, so one frame ≈ 83 ms of audio. To pre-allocate `max_new_tokens` for a given duration:
146
 
 
91
  N_TOKENS = N_FRAMES * 7 # 7 tokens / frame (C,M,F,F,M,F,F)
92
 
93
  # --- 2. Generate inside an [SNAC] ... span ------------------------------
94
+ # The slot router (modeling_nemotron_h_augmented.py) carries its
95
+ # (in_slot_mode, slot_counter) state across forward calls via
96
+ # self._slot_router_state, so KV caching just works: prefill computes
97
+ # routing from initial state, subsequent forwards advance from the
98
+ # cached final state. No special flags needed.
99
  prompt = torch.tensor([[tok.bos_token_id, SNAC_OPEN]], device="cuda")
100
  with torch.no_grad():
101
  out = model.generate(
102
  prompt,
103
  max_new_tokens=N_TOKENS,
104
  do_sample=True, temperature=0.8, top_p=0.95,
 
105
  )
106
 
107
  # --- 3. Parse the C/M/F/F/M/F/F frames back into codebook indices --------
 
139
  f"({len(c_codes)} frames, {len(c_codes) + len(m_codes) + len(f_codes)} codes)")
140
  ```
141
 
142
+ **Dependencies**: `pip install snac torchaudio` in addition to `transformers torch`. Wall-clock for 50 frames (~4 s of audio): a few seconds on a GB10 with KV caching on (the default).
143
 
144
  **Token-budget rule of thumb**: SNAC-24kHz's coarse rate is ~12 Hz, so one frame ≈ 83 ms of audio. To pre-allocate `max_new_tokens` for a given duration:
145