Spaces:

Hsopgamers
/

echo-tts

Runtime error

App Files Files Community

Hsopgamers commited on 8 days ago

Commit

176c235

verified ·

1 Parent(s): ffc7d09

Upload folder using huggingface_hub

Browse files

Files changed (19) hide show

.gitattributes +6 -0
.gitignore +8 -0
LICENSE +22 -0
README.md +147 -8
audio_prompts/EARS p004 freeform.mp3 +3 -0
audio_prompts/EARS p005 freeform.mp3 +3 -0
audio_prompts/EARS p028 freeform.mp3 +3 -0
audio_prompts/EARS p036 freeform.mp3 +3 -0
audio_prompts/LICENSE +26 -0
audio_prompts/expresso_02_ex03-ex01_calm_005.mp3 +3 -0
audio_prompts/freesound_demon_chant(use_forcespeaker).mp3 +3 -0
autoencoder.py +1225 -0
gradio_app.py +994 -0
inference.py +462 -0
inference_blockwise.py +220 -0
model.py +642 -0
requirements.txt +8 -0
sampler_presets.json +62 -0
text_presets.txt +40 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,9 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+audio_prompts/EARS[[:space:]]p004[[:space:]]freeform.mp3 filter=lfs diff=lfs merge=lfs -text
+audio_prompts/EARS[[:space:]]p005[[:space:]]freeform.mp3 filter=lfs diff=lfs merge=lfs -text
+audio_prompts/EARS[[:space:]]p028[[:space:]]freeform.mp3 filter=lfs diff=lfs merge=lfs -text
+audio_prompts/EARS[[:space:]]p036[[:space:]]freeform.mp3 filter=lfs diff=lfs merge=lfs -text
+audio_prompts/expresso_02_ex03-ex01_calm_005.mp3 filter=lfs diff=lfs merge=lfs -text
+audio_prompts/freesound_demon_chant(use_forcespeaker).mp3 filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,8 @@

+.DS_Store
+__pycache__/
+*.py[cod]
+.venv/
+venv/
+.env
+.idea/
+.vscode/

LICENSE ADDED Viewed

	@@ -0,0 +1,22 @@

+MIT License
+Copyright (c) 2025 Jordan Darefsky
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,151 @@
 ---
-title: Echo Tts
-emoji: 🏆
-colorFrom: indigo
-colorTo: purple
 sdk: gradio
-sdk_version: 6.6.0
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: echo-tts
+app_file: gradio_app.py
 sdk: gradio
+sdk_version: 5.49.1
 ---
+# Echo-TTS
+A multi-speaker text-to-speech model with speaker reference conditioning. See the [blog post](https://jordandarefsky.com/blog/2025/echo/) for technical details.
+**Model:** [jordand/echo-tts-base](https://huggingface.co/jordand/echo-tts-base) | **Demo:** [echo-tts-preview](https://huggingface.co/spaces/jordand/echo-tts-preview)
+This work was made possible by the TPU Research Cloud (TRC).
+## Responsible Use
+Don't use this model to:
+- Impersonate real people without their consent
+- Generate deceptive audio (e.g., fraud, misinformation, deepfakes)
+You are responsible for complying with local laws regarding biometric data and voice cloning.
+## Installation
+```bash
+pip install -r requirements.txt
+```
+Requires Python 3.10+ and a CUDA-capable GPU with at least 8GB VRAM.
+## Quick Start
+### Gradio UI
+```bash
+python gradio_app.py
+```
+### Python API
+```python
+from inference import (
+    load_model_from_hf,
+    load_fish_ae_from_hf,
+    load_pca_state_from_hf,
+    load_audio,
+    sample_pipeline,
+    sample_euler_cfg_independent_guidances,
+)
+from functools import partial
+import torchaudio
+# Load models (downloads from HuggingFace on first run)
+model = load_model_from_hf(delete_blockwise_modules=True)
+fish_ae = load_fish_ae_from_hf()
+pca_state = load_pca_state_from_hf()
+# Load speaker reference (or set to None for no reference)
+speaker_audio = load_audio("speaker.wav").cuda()
+# Configure sampler
+sample_fn = partial(
+    sample_euler_cfg_independent_guidances,
+    num_steps=40,
+    cfg_scale_text=3.0,
+    cfg_scale_speaker=8.0,
+    cfg_min_t=0.5,
+    cfg_max_t=1.0,
+    truncation_factor=None,
+    rescale_k=None,
+    rescale_sigma=None,
+    speaker_kv_scale=None,
+    speaker_kv_max_layers=None,
+    speaker_kv_min_t=None,
+    sequence_length=640, # (~30 seconds)
+)
+# Generate
+text = "[S1] Hello, this is a test of the Echo TTS model."
+audio_out, _ = sample_pipeline(
+    model=model,
+    fish_ae=fish_ae,
+    pca_state=pca_state,
+    sample_fn=sample_fn,
+    text_prompt=text,
+    speaker_audio=speaker_audio,
+    rng_seed=0,
+)
+torchaudio.save("output.wav", audio_out[0].cpu(), 44100)
+```
+See also:
+- `inference.py` -- lower-level usage example at the bottom of the file
+- `inference_blockwise.py` -- examples of blockwise/continuation generation
+## Low VRAM (8GB)
+In `gradio_app.py`, adjust:
+```python
+FISH_AE_DTYPE = torch.bfloat16  # instead of float32
+DEFAULT_SAMPLE_LATENT_LENGTH = 576  # (< 640 depending on what fits) instead of 640
+```
+## Tips
+### Generation Length
+Echo is trained to generate up to 30 seconds of audio (640 latents) given text and reference audio. Since the supplied text always corresponded to ≤30 seconds of audio during training, the model will attempt to fit any text prompt at inference into the 30 seconds of generated audio (and thus, e.g., long text prompts may result in faster speaking rates). On the other hand, shorter text prompts will work and will produce shorter outputs (as the model generates latent padding automatically).
+If "Sample Latent Length" (in Custom Shapes in gradio)/sequence_length is set to less than 640, the model will attempt to generate the prefix corresponding to that length. I.e., if you set this to 320, and supply ~30 seconds worth of text, the model will likely generate the first half of the text (rather than try to fit the entirety of the text into the first 15 seconds).
+### Reference Audio
+You can condition on up to 5 minutes of reference audio, but shorter clips (e.g., 10 seconds or shorter) work well too.
+### Force Speaker (KV Scaling)
+Sometimes out-of-distribution text for a given reference speaker will cause the model to generate a different speaker entirely. Enabling "Force Speaker" (which scales speaker KV for a portion of timesteps, default scale 1.5) generally fixes this. However, high values may introduce artifacts or "overconditioning." Aim for the lowest scale that produces the correct speaker: 1.0 is baseline, 1.5 is the default when enabled and will usually force the speaker, but lower values (e.g., 1.3, 1.1) may suffice.
+### Text Prompt Format
+Text prompts use the format from [WhisperD](https://huggingface.co/jordand/whisper-d-v1a). Colons, semicolons, and emdashes are normalized to commas (see inference.py tokenizer_encode) by default, and "[S1] " will be added to the beginning of the prompt if not already present. Commas generally function as pauses. Exclamation points (and other non-bland punctuation) may lead to increased expressiveness but also potentially lower quality on occasion; improving controllability is an important direction for future work.
+The included text presets are stylistically in-distribution with the WhisperD transcription style.
+### Blockwise Generation
+`inference_blockwise.py` includes blockwise sampling, which allows generating audio in smaller blocks as well as producing continuations of existing audio (where the prefix and continuation are up to 30 seconds combined). The model released on HF is a fully fine-tuned model (not the LoRA as described in the blog). Blockwise generation enables audio streaming (not included in current code) since the S1-DAC decoder is causal. Blockwise functionality hasn't been thoroughly tested and may benefit from different (e.g., smaller) CFG scales.
+## License
+Code in this repo is MIT‑licensed except where file headers specify otherwise (e.g., autoencoder.py is Apache‑2.0).
+Regardless of our model license, audio outputs are CC-BY-NC-SA-4.0 due to the dependency on the Fish Speech S1-DAC autoencoder, which is CC-BY-NC-SA-4.0.
+We have chosen to release the Echo-TTS weights under CC-BY-NC-SA-4.0.
+For included audio prompts, see `audio_prompts/LICENSE`.
+## Citation
+```bibtex
+@misc{darefsky2025echo,
+    author = {Darefsky, Jordan},
+    title = {Echo-TTS},
+    year = {2025},
+    url = {https://jordandarefsky.com/blog/2025/echo/}
+}
+```

audio_prompts/EARS p004 freeform.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:68947a209bc11064f749ca0a61b7959243df83565a0e462b87dfc0ffe03aa7b0
+size 1526439

audio_prompts/EARS p005 freeform.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:07344d073eb3e22c249ebfe15f31f4ba63fd9f17c71aeee93da199ff3b53fc45
+size 1351147

audio_prompts/EARS p028 freeform.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8351eed5982f1fb5763a475c0fb69dba98a4bb49b0f2bbab12b978ff2b0fedeb
+size 1211565

audio_prompts/EARS p036 freeform.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ce77dbb86ea7c29edf2b9804ce9c9315334e9cfeef532dc0c50898a09bae1583
+size 1227585

audio_prompts/LICENSE ADDED Viewed

	@@ -0,0 +1,26 @@

+The audio files in this folder are provided for demonstration purposes and
+are sourced from the following datasets. Please refer to their original
+licenses for terms of use.
+EARS Dataset (CC-BY-NC-4.0)
+---------------------------
+- EARS p004 freeform.mp3
+- EARS p005 freeform.mp3
+- EARS p028 freeform.mp3
+- EARS p036 freeform.mp3
+Source: https://github.com/facebookresearch/ears_dataset
+Expresso Dataset (CC-BY-NC-4.0)
+-------------------------------
+- expresso_02_ex03-ex01_calm_005.wav
+Source: https://speechbot.github.io/expresso/
+Freesound (CC0)
+---------------
+- freesound_demon_chant(use_forcespeaker).mp3
+Source: https://freesound.org/s/419507/
+Author: DylanTheFish

audio_prompts/expresso_02_ex03-ex01_calm_005.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:98855b5b8b6c265a643edeb23ce5cd772391cb90754822e2a0370ea5188225f5
+size 4802350

audio_prompts/freesound_demon_chant(use_forcespeaker).mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:471f67fff5ea613ec4617b9822b1396da123a1133f199925436a2c40e5d1eb91
+size 303438

autoencoder.py ADDED Viewed

	@@ -0,0 +1,1225 @@

+# SPDX-License-Identifier: Apache-2.0
+# This file contains portions adapted from:
+#   • Descript Audio Codec (DAC) — MIT License (full text appended below)
+#   • Fish-Speech S1 DAC Autoencoder — reference implementation (Apache-2.0 / CC-BY-NC),
+#     rewritten here in a single-file Torch module for interoperability and transparency.
+#
+# OVERALL LICENSE (this file): Apache-2.0, except where explicitly marked:
+#     # SPDX-License-Identifier: MIT
+# Keep these notices and the embedded MIT text if you redistribute this file.
+# NOTE
+# Self-contained autoencoder implementation of Fish-S1-DAC (inlining DAC code to avoid dependencies).
+# Code in this module has been largely copy-and-pasted from the Fish-S1-DAC and DAC repositories,
+# and refactored with help from ChatGPT/Claude (these models also helped with licensing).
+# Thus, it differs stylistically from the rest of the codebase (and is likely internally inconsistent as well).
+from __future__ import annotations
+import math
+from dataclasses import dataclass
+from typing import List, Optional, Tuple, Union
+import numpy as np
+import torch
+from torch import Tensor, nn
+from torch.nn import functional as F
+from torch.nn.utils.parametrizations import weight_norm
+from torch.nn.utils.parametrize import remove_parametrizations
+from einops import rearrange
+# --------------------------------------------------------------------
+# Shared helpers
+# --------------------------------------------------------------------
+def find_multiple(n: int, k: int) -> int:
+    return n if n % k == 0 else n + k - (n % k)
+def unpad1d(x: Tensor, paddings: Tuple[int, int]) -> Tensor:
+    """Remove padding from x, handling properly zero padding. Only for 1d!"""
+    padding_left, padding_right = paddings
+    assert padding_left >= 0 and padding_right >= 0, (padding_left, padding_right)
+    assert (padding_left + padding_right) <= x.shape[-1]
+    end = x.shape[-1] - padding_right
+    return x[..., padding_left:end]
+def get_extra_padding_for_conv1d(
+    x: Tensor, kernel_size: int, stride: int, padding_total: int = 0
+) -> int:
+    """See pad_for_conv1d; enough right pad so striding evenly covers length."""
+    length = x.shape[-1]
+    n_frames = (length - kernel_size + padding_total) / stride + 1
+    ideal_length = (math.ceil(n_frames) - 1) * stride + (kernel_size - padding_total)
+    return ideal_length - length
+def pad1d(
+    x: Tensor,
+    paddings: Tuple[int, int],
+    mode: str = "zeros",
+    value: float = 0.0,
+) -> Tensor:
+    """
+    Reflect‑safe 1D pad: if reflect would underflow on small inputs, insert
+    temporary right zero-pad before reflecting.
+    """
+    length = x.shape[-1]
+    padding_left, padding_right = paddings
+    assert padding_left >= 0 and padding_right >= 0, (padding_left, padding_right)
+    if mode == "reflect":
+        max_pad = max(padding_left, padding_right)
+        extra_pad = 0
+        if length <= max_pad:
+            extra_pad = max_pad - length + 1
+            x = F.pad(x, (0, extra_pad))
+        padded = F.pad(x, (padding_left, padding_right), mode, value)
+        end = padded.shape[-1] - extra_pad
+        return padded[..., :end]
+    else:
+        return F.pad(x, (padding_left, padding_right), mode, value)
+# --------------------------------------------------------------------
+# DAC Layers (adapted) — MIT
+# Original: https://github.com/descriptinc/descript-audio-codec/blob/main/dac/nn/layers.py
+# SPDX-License-Identifier: MIT
+# --------------------------------------------------------------------
+def WNConv1d(*args, **kwargs):
+    return weight_norm(nn.Conv1d(*args, **kwargs))
+def WNConvTranspose1d(*args, **kwargs):
+    return weight_norm(nn.ConvTranspose1d(*args, **kwargs))
+@torch.jit.script
+def snake(x: Tensor, alpha: Tensor) -> Tensor:
+    shape = x.shape
+    x = x.reshape(shape[0], shape[1], -1)
+    x = x + (alpha + 1e-9).reciprocal() * torch.sin(alpha * x).pow(2)
+    x = x.reshape(shape)
+    return x
+class Snake1d(nn.Module):
+    def __init__(self, channels: int):
+        super().__init__()
+        self.alpha = nn.Parameter(torch.ones(1, channels, 1))
+    def forward(self, x: Tensor) -> Tensor:
+        return snake(x, self.alpha)
+# --------------------------------------------------------------------
+# DAC Vector Quantize (adapted) — MIT
+# Original: https://github.com/descriptinc/descript-audio-codec/blob/main/dac/nn/quantize.py
+# SPDX-License-Identifier: MIT
+# --------------------------------------------------------------------
+class VectorQuantize(nn.Module):
+    """
+    VQ with factorized, l2-normalized codes (ViT‑VQGAN style).
+    I/O in (B, D, T).
+    """
+    def __init__(self, input_dim: int, codebook_size: int, codebook_dim: int):
+        super().__init__()
+        self.codebook_size = codebook_size
+        self.codebook_dim = codebook_dim
+        self.in_proj  = WNConv1d(input_dim,  codebook_dim, kernel_size=1)
+        self.out_proj = WNConv1d(codebook_dim, input_dim,  kernel_size=1)
+        self.codebook = nn.Embedding(codebook_size, codebook_dim)
+    def forward(self, z: Tensor):
+        z_e = self.in_proj(z)                 # (B, D, T)
+        z_q, indices = self.decode_latents(z_e)
+        commitment_loss = F.mse_loss(z_e, z_q.detach(), reduction="none").mean([1, 2])
+        codebook_loss   = F.mse_loss(z_q, z_e.detach(), reduction="none").mean([1, 2])
+        z_q = z_e + (z_q - z_e).detach()      # straight‑through
+        z_q = self.out_proj(z_q)
+        return z_q, commitment_loss, codebook_loss, indices, z_e
+    def embed_code(self, embed_id: Tensor) -> Tensor:
+        return F.embedding(embed_id, self.codebook.weight)
+    def decode_code(self, embed_id: Tensor) -> Tensor:
+        return self.embed_code(embed_id).transpose(1, 2)
+    def decode_latents(self, latents: Tensor) -> Tuple[Tensor, Tensor]:
+        encodings = rearrange(latents, "b d t -> (b t) d")
+        codebook  = self.codebook.weight
+        encodings = F.normalize(encodings)
+        codebook  = F.normalize(codebook)
+        dist = (
+            encodings.pow(2).sum(1, keepdim=True)
+            - 2 * encodings @ codebook.t()
+            + codebook.pow(2).sum(1, keepdim=True).t()
+        )
+        indices = rearrange((-dist).max(1)[1], "(b t) -> b t", b=latents.size(0))
+        z_q = self.decode_code(indices)
+        return z_q, indices
+class ResidualVectorQuantize(nn.Module):
+    """SoundStream-style residual VQ stack."""
+    def __init__(
+        self,
+        input_dim: int = 512,
+        n_codebooks: int = 9,
+        codebook_size: int = 1024,
+        codebook_dim: Union[int, List[int]] = 8,
+        quantizer_dropout: float = 0.0,
+    ):
+        super().__init__()
+        if isinstance(codebook_dim, int):
+            codebook_dim = [codebook_dim for _ in range(n_codebooks)]
+        self.n_codebooks  = n_codebooks
+        self.codebook_dim = codebook_dim
+        self.codebook_size = codebook_size
+        self.quantizers = nn.ModuleList([
+            VectorQuantize(input_dim, codebook_size, codebook_dim[i])
+            for i in range(n_codebooks)
+        ])
+        self.quantizer_dropout = quantizer_dropout
+    def forward(self, z: Tensor, n_quantizers: Optional[int] = None):
+        z_q = 0
+        residual = z
+        commitment_loss = 0
+        codebook_loss   = 0
+        codebook_indices = []
+        latents = []
+        if n_quantizers is None:
+            n_quantizers = self.n_codebooks
+        if self.training:
+            n_quantizers = torch.ones((z.shape[0],)) * self.n_codebooks + 1
+            dropout = torch.randint(1, self.n_codebooks + 1, (z.shape[0],))
+            n_dropout = int(z.shape[0] * self.quantizer_dropout)
+            n_quantizers[:n_dropout] = dropout[:n_dropout]
+            n_quantizers = n_quantizers.to(z.device)
+        for i, quantizer in enumerate(self.quantizers):
+            if self.training is False and i >= n_quantizers:
+                break
+            z_q_i, commit_i, codebk_i, indices_i, z_e_i = quantizer(residual)
+            mask = (torch.full((z.shape[0],), fill_value=i, device=z.device) < n_quantizers)
+            z_q     = z_q + z_q_i * mask[:, None, None]
+            residual = residual - z_q_i
+            commitment_loss += (commit_i * mask).mean()
+            codebook_loss   += (codebk_i * mask).mean()
+            codebook_indices.append(indices_i)
+            latents.append(z_e_i)
+        codes   = torch.stack(codebook_indices, dim=1)
+        latents = torch.cat(latents, dim=1)
+        return z_q, codes, latents, commitment_loss, codebook_loss
+    def from_codes(self, codes: Tensor) -> Tuple[Tensor, Tensor, Tensor]:
+        z_q = 0.0
+        z_p = []
+        n_codebooks = codes.shape[1]
+        for i in range(n_codebooks):
+            z_p_i = self.quantizers[i].decode_code(codes[:, i, :])
+            z_p.append(z_p_i)
+            z_q_i = self.quantizers[i].out_proj(z_p_i)
+            z_q = z_q + z_q_i
+        return z_q, torch.cat(z_p, dim=1), codes
+    def from_latents(self, latents: Tensor) -> Tuple[Tensor, Tensor, Tensor]:
+        z_q = 0
+        z_p = []
+        codes = []
+        dims = np.cumsum([0] + [q.codebook_dim for q in self.quantizers])
+        n_codebooks = np.where(dims <= latents.shape[1])[0].max(axis=0, keepdims=True)[0]
+        for i in range(n_codebooks):
+            j, k = dims[i], dims[i + 1]
+            z_p_i, codes_i = self.quantizers[i].decode_latents(latents[:, j:k, :])
+            z_p.append(z_p_i)
+            codes.append(codes_i)
+            z_q_i = self.quantizers[i].out_proj(z_p_i)
+            z_q = z_q + z_q_i
+        return z_q, torch.cat(z_p, dim=1), torch.stack(codes, dim=1)
+# --------------------------------------------------------------------
+# S1 DAC rvq
+# --------------------------------------------------------------------
+@dataclass
+class VQResult:
+    z: Tensor
+    codes: Tensor
+    latents: Tensor
+    codebook_loss: Tensor
+    commitment_loss: Tensor
+    semantic_distill_z: Optional[Tensor] = None
+class CausalConvNet(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size,
+        dilation=1,
+        stride=1,
+        groups=1,
+        padding=None,
+    ):
+        super().__init__()
+        self.conv = nn.Conv1d(
+            in_channels, out_channels, kernel_size,
+            stride=stride, dilation=dilation, groups=groups,
+        )
+        self.stride = stride
+        self.kernel_size = (kernel_size - 1) * dilation + 1
+        self.dilation = dilation
+        self.padding = self.kernel_size - self.stride
+    def forward(self, x: Tensor) -> Tensor:
+        pad = self.padding
+        extra = get_extra_padding_for_conv1d(x, self.kernel_size, self.stride, pad)
+        x = pad1d(x, (pad, extra), mode="constant", value=0)
+        return self.conv(x).contiguous()
+    def weight_norm(self, name="weight", dim=0):
+        self.conv = weight_norm(self.conv, name=name, dim=dim)
+        return self
+    def remove_weight_norm(self):
+        self.conv = remove_parametrizations(self.conv)
+        return self
+class CausalTransConvNet(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size, dilation=1, stride=1, padding=None):
+        super().__init__()
+        self.conv = nn.ConvTranspose1d(
+            in_channels, out_channels, kernel_size,
+            stride=stride, dilation=dilation
+        )
+        self.stride = stride
+        self.kernel_size = kernel_size
+    def forward(self, x: Tensor) -> Tensor:
+        x = self.conv(x)
+        pad = self.kernel_size - self.stride
+        padding_right = math.ceil(pad)
+        padding_left  = pad - padding_right
+        x = unpad1d(x, (padding_left, padding_right))
+        return x.contiguous()
+    def weight_norm(self, name="weight", dim=0):
+        self.conv = weight_norm(self.conv, name=name, dim=dim)
+        return self
+    def remove_weight_norm(self):
+        self.conv = remove_parametrizations(self.conv)
+        return self
+def CausalWNConv1d(*args, **kwargs):
+    return CausalConvNet(*args, **kwargs).weight_norm()
+def CausalWNConvTranspose1d(*args, **kwargs):
+    return CausalTransConvNet(*args, **kwargs).weight_norm()
+class ConvNeXtBlock(nn.Module):
+    r"""ConvNeXt Block (1D).
+    DwConv -> (N, C, L) → (N, L, C) -> LN -> Linear -> GELU -> Linear -> (N, C, L) with residual
+    """
+    def __init__(
+        self,
+        dim: int,
+        layer_scale_init_value: float = 1e-6,
+        mlp_ratio: float = 4.0,
+        kernel_size: int = 7,
+        dilation: int = 1,
+    ):
+        super().__init__()
+        convnet_type = CausalConvNet
+        self.dwconv = convnet_type(
+            dim, dim, kernel_size=kernel_size,
+            groups=dim, dilation=dilation,
+        )  # depthwise conv
+        self.norm = nn.LayerNorm(dim, eps=1e-6)
+        self.pwconv1 = nn.Linear(dim, int(mlp_ratio * dim))
+        self.act = nn.GELU()
+        self.pwconv2 = nn.Linear(int(mlp_ratio * dim), dim)
+        self.gamma = (
+            nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_grad=True)
+            if layer_scale_init_value > 0 else None
+        )
+    def forward(self, x: Tensor, apply_residual: bool = True) -> Tensor:
+        inp = x
+        x = self.dwconv(x)
+        x = x.permute(0, 2, 1)     # (N, C, L) -> (N, L, C)
+        x = self.norm(x)
+        x = self.pwconv1(x)
+        x = self.act(x)
+        x = self.pwconv2(x)
+        if self.gamma is not None:
+            x = self.gamma * x
+        x = x.permute(0, 2, 1)     # (N, L, C) -> (N, C, L)
+        if apply_residual:
+            x = inp + x
+        return x
+class DownsampleResidualVectorQuantize(nn.Module):
+    def __init__(
+        self,
+        input_dim: int = 1024,
+        n_codebooks: int = 9,
+        codebook_dim: int = 8,
+        quantizer_dropout: float = 0.5,
+        codebook_size: int = 1024,
+        semantic_codebook_size: int = 4096,
+        downsample_factor: Tuple[int, ...] = (2, 2),
+        downsample_dims: Optional[Tuple[int, ...]] = None,
+        pre_module: Optional[nn.Module] = None,
+        post_module: Optional[nn.Module] = None,
+        semantic_predictor_module: Optional[nn.Module] = None,
+    ):
+        super().__init__()
+        if downsample_dims is None:
+            downsample_dims = tuple(input_dim for _ in range(len(downsample_factor)))
+        all_dims = (input_dim,) + tuple(downsample_dims)
+        self.semantic_quantizer = ResidualVectorQuantize(
+            input_dim=input_dim,
+            n_codebooks=1,
+            codebook_size=semantic_codebook_size,
+            codebook_dim=codebook_dim,
+            quantizer_dropout=0.0,
+        )
+        self.quantizer = ResidualVectorQuantize(
+            input_dim=input_dim,
+            n_codebooks=n_codebooks,
+            codebook_size=codebook_size,
+            codebook_dim=codebook_dim,
+            quantizer_dropout=quantizer_dropout,
+        )
+        convnet_type = CausalConvNet
+        transconvnet_type = CausalTransConvNet
+        self.downsample = nn.Sequential(
+            *[
+                nn.Sequential(
+                    convnet_type(all_dims[idx], all_dims[idx + 1], kernel_size=factor, stride=factor),
+                    ConvNeXtBlock(dim=all_dims[idx + 1]),
+                )
+                for idx, factor in enumerate(downsample_factor)
+            ]
+        )
+        self.upsample = nn.Sequential(
+            *[
+                nn.Sequential(
+                    transconvnet_type(all_dims[idx + 1], all_dims[idx], kernel_size=factor, stride=factor),
+                    ConvNeXtBlock(dim=all_dims[idx]),
+                )
+                for idx, factor in reversed(list(enumerate(downsample_factor)))
+            ]
+        )
+        self.apply(self._init_weights)
+        self.pre_module  = pre_module  if pre_module  is not None else nn.Identity()
+        self.post_module = post_module if post_module is not None else nn.Identity()
+        self.semantic_predictor_module = (
+            semantic_predictor_module if semantic_predictor_module is not None else nn.Identity()
+        )
+    @staticmethod
+    def _init_weights(m):
+        if isinstance(m, (nn.Conv1d, nn.Linear)):
+            nn.init.trunc_normal_(m.weight, std=0.02)
+            if getattr(m, "bias", None) is not None:
+                nn.init.constant_(m.bias, 0)
+    def forward(self, z: Tensor, n_quantizers: Optional[int] = None, semantic_len: Optional[Tensor] = None, **kwargs):
+        # z: (B, D, T)
+        original_shape = z.shape
+        if semantic_len is None:
+            semantic_len = torch.LongTensor([z.shape[-1]])
+        z = self.downsample(z)
+        z = self.pre_module(z)  # (B, D, T) or (B, T, D) depending on module; original uses channels-first in/out
+        semantic_z, semantic_codes, semantic_latents, semantic_commitment_loss, semantic_codebook_loss = \
+            self.semantic_quantizer(z)
+        residual_z = z - semantic_z
+        residual_z, codes, latents, commitment_loss, codebook_loss = self.quantizer(residual_z, n_quantizers=n_quantizers)
+        z = semantic_z + residual_z
+        commitment_loss = commitment_loss + semantic_commitment_loss
+        codebook_loss   = codebook_loss   + semantic_codebook_loss
+        codes   = torch.cat([semantic_codes, codes], dim=1)
+        latents = torch.cat([semantic_latents, latents], dim=1)
+        z = self.post_module(z)
+        z = self.upsample(z)
+        # Pad or crop z to match original shape (time dimension)
+        diff = original_shape[-1] - z.shape[-1]
+        right = 0
+        left  = abs(diff) - right
+        if diff > 0:
+            z = F.pad(z, (left, right))
+        elif diff < 0:
+            z = z[..., left:]
+        return VQResult(
+            z=z, codes=codes, latents=latents,
+            commitment_loss=commitment_loss, codebook_loss=codebook_loss,
+        )
+    def decode(self, indices: Tensor) -> Tensor:
+        new_indices = torch.zeros_like(indices)
+        new_indices[:, 0] = torch.clamp(indices[:, 0],  max=self.semantic_quantizer.codebook_size - 1)
+        new_indices[:, 1:] = torch.clamp(indices[:, 1:], max=self.quantizer.codebook_size - 1)
+        z_q_semantic = self.semantic_quantizer.from_codes(new_indices[:, :1])[0]
+        z_q_residual = self.quantizer.from_codes(new_indices[:, 1:])[0]
+        z_q = z_q_semantic + z_q_residual
+        z_q = self.post_module(z_q)
+        z_q = self.upsample(z_q)
+        return z_q
+# --------------------------------------------------------------------
+# Transformer stack
+# --------------------------------------------------------------------
+@dataclass
+class ModelArgs:
+    block_size: int = 2048
+    n_layer: int = 8
+    n_head: int = 8
+    dim: int = 512
+    intermediate_size: int = 1536
+    n_local_heads: int = -1
+    head_dim: int = 64
+    rope_base: float = 10000
+    norm_eps: float = 1e-5
+    dropout_rate: float = 0.1
+    attn_dropout_rate: float = 0.1
+    channels_first: bool = True  # to be compatible with conv1d input/output
+    pos_embed_type: str = "rope"  # "rope" or "conformer"
+    max_relative_position: int = 128
+    def __post_init__(self):
+        if self.n_local_heads == -1:
+            self.n_local_heads = self.n_head
+        if self.intermediate_size is None:
+            hidden_dim = 4 * self.dim
+            n_hidden = int(2 * hidden_dim / 3)
+            self.intermediate_size = find_multiple(n_hidden, 256)
+        assert self.pos_embed_type in ["rope", "conformer"]
+class KVCache(nn.Module):
+    def __init__(self, max_batch_size, max_seq_length, n_heads, head_dim, dtype=torch.bfloat16):
+        super().__init__()
+        cache_shape = (max_batch_size, n_heads, max_seq_length, head_dim)
+        self.register_buffer("k_cache", torch.zeros(cache_shape, dtype=dtype))
+        self.register_buffer("v_cache", torch.zeros(cache_shape, dtype=dtype))
+    def update(self, input_pos: Tensor, k_val: Tensor, v_val: Tensor):
+        # input_pos: [S], k_val: [B, H, S, D]
+        assert input_pos.shape[0] == k_val.shape[2]
+        k_out = self.k_cache
+        v_out = self.v_cache
+        k_out[:, :, input_pos] = k_val
+        v_out[:, :, input_pos] = v_val
+        return (
+            k_out[:, :, : input_pos.max() + 1, :],
+            v_out[:, :, : input_pos.max() + 1, :],
+        )
+    def clear_cache(self, prompt_len: int):
+        self.k_cache[:, :, prompt_len:, :].fill_(0)
+        self.v_cache[:, :, prompt_len:, :].fill_(0)
+class Transformer(nn.Module):
+    def __init__(self, config: ModelArgs) -> None:
+        super().__init__()
+        self.config = config
+        self.layers = nn.ModuleList(TransformerBlock(config) for _ in range(config.n_layer))
+        self.norm   = RMSNorm(config.dim, eps=config.norm_eps)
+        if config.pos_embed_type == "rope":
+            freqs_cis = precompute_freqs_cis(self.config.block_size, self.config.head_dim, self.config.rope_base)
+            self.register_buffer("freqs_cis", freqs_cis)
+        else:
+            self.register_buffer("freqs_cis", None)
+        causal_mask = torch.tril(torch.ones(self.config.block_size, self.config.block_size, dtype=torch.bool))
+        self.register_buffer("causal_mask", causal_mask)
+        self.max_batch_size = -1
+        self.max_seq_length = -1
+        self.use_kv_cache = False
+    def setup_caches(self, max_batch_size, max_seq_length):
+        head_dim = self.config.dim // self.config.n_head
+        max_seq_length = find_multiple(max_seq_length, 8)
+        self.max_seq_length = max_seq_length
+        self.max_batch_size = max_batch_size
+        dtype  = self.norm.weight.dtype
+        device = self.norm.weight.device
+        for b in self.layers:
+            b.attention.kv_cache = KVCache(
+                max_batch_size, max_seq_length, self.config.n_local_heads, head_dim, dtype
+            ).to(device)
+        self.use_kv_cache = True
+    def forward(self, x: Tensor, input_pos: Optional[Tensor] = None, mask: Optional[Tensor] = None) -> Tensor:
+        if self.config.pos_embed_type == "rope":
+            assert self.freqs_cis is not None
+            freqs_cis = self.freqs_cis[input_pos]
+        else:
+            freqs_cis = None
+        if mask is None:
+            if not self.training and self.use_kv_cache:
+                mask = self.causal_mask[None, None, input_pos]
+                mask = mask[..., : input_pos.max() + 1]
+            else:
+                mask = self.causal_mask[None, None, input_pos]
+                mask = mask[..., input_pos]
+        for layer in self.layers:
+            x = layer(x, input_pos, freqs_cis, mask)
+        x = self.norm(x)
+        return x
+class TransformerBlock(nn.Module):
+    def __init__(self, config: ModelArgs) -> None:
+        super().__init__()
+        self.attention = Attention(config)
+        self.feed_forward = FeedForward(config)
+        self.ffn_norm = RMSNorm(config.dim, eps=config.norm_eps)
+        self.attention_norm = RMSNorm(config.dim, eps=config.norm_eps)
+        self.attention_layer_scale = LayerScale(config.dim, inplace=True)
+        self.ffn_layer_scale = LayerScale(config.dim, inplace=True)
+    def forward(self, x: Tensor, input_pos: Tensor, freqs_cis: Tensor, mask: Tensor) -> Tensor:
+        h = x + self.attention_layer_scale(
+            self.attention(self.attention_norm(x), freqs_cis, mask, input_pos)
+        )
+        out = h + self.ffn_layer_scale(self.feed_forward(self.ffn_norm(h)))
+        return out
+class Attention(nn.Module):
+    def __init__(self, config: ModelArgs):
+        super().__init__()
+        assert config.dim % config.n_head == 0
+        total_head_dim = (config.n_head + 2 * config.n_local_heads) * config.head_dim
+        self.wqkv = nn.Linear(config.dim, total_head_dim, bias=False)
+        self.wo   = nn.Linear(config.head_dim * config.n_head, config.dim, bias=False)
+        self.kv_cache = None
+        self.n_head = config.n_head
+        self.head_dim = config.head_dim
+        self.n_local_heads = config.n_local_heads
+        self.dim = config.dim
+        self.attn_dropout_rate = config.attn_dropout_rate
+        self.pos_embed_type = config.pos_embed_type
+        if self.pos_embed_type == "conformer":
+            self.max_relative_position = config.max_relative_position
+            num_pos_embeddings = 2 * config.max_relative_position + 1
+            self.rel_pos_embeddings = nn.Parameter(torch.zeros(num_pos_embeddings, self.head_dim))
+            nn.init.normal_(self.rel_pos_embeddings, mean=0.0, std=0.02)
+    def _compute_conformer_pos_scores(self, q: Tensor, seqlen: int) -> Tensor:
+        positions = torch.arange(seqlen, device=q.device)
+        relative_positions = positions.unsqueeze(1) - positions.unsqueeze(0)  # [S, S]
+        relative_positions = torch.clamp(relative_positions + self.max_relative_position,
+                                         0, 2 * self.max_relative_position)
+        rel_embeddings = self.rel_pos_embeddings[relative_positions]  # [S, S, D]
+        q = q.transpose(1, 2)  # [B, S, H, D]
+        rel_logits = torch.matmul(q, rel_embeddings.transpose(-2, -1))  # [B, S, H, S]
+        rel_logits = rel_logits.transpose(1, 2)  # [B, H, S, S]
+        return rel_logits
+    def forward(self, x: Tensor, freqs_cis: Tensor, mask: Tensor, input_pos: Optional[Tensor] = None) -> Tensor:
+        bsz, seqlen, _ = x.shape
+        kv_size = self.n_local_heads * self.head_dim
+        q, k, v = self.wqkv(x).split([kv_size, kv_size, kv_size], dim=-1)
+        context_seqlen = seqlen
+        q = q.view(bsz, seqlen, self.n_head,        self.head_dim)
+        k = k.view(bsz, context_seqlen, self.n_local_heads, self.head_dim)
+        v = v.view(bsz, context_seqlen, self.n_local_heads, self.head_dim)
+        if self.pos_embed_type == "rope":
+            q = apply_rotary_emb(q, freqs_cis)
+            k = apply_rotary_emb(k, freqs_cis)
+        q, k, v = map(lambda t: t.transpose(1, 2), (q, k, v))
+        if self.kv_cache is not None:
+            k, v = self.kv_cache.update(input_pos, k, v)
+        k = k.repeat_interleave(self.n_head // self.n_local_heads, dim=1)
+        v = v.repeat_interleave(self.n_head // self.n_local_heads, dim=1)
+        if self.pos_embed_type == "conformer":
+            scale = 1.0 / math.sqrt(self.head_dim)
+            scores = torch.matmul(q, k.transpose(-2, -1)) * scale
+            rel_scores = self._compute_conformer_pos_scores(q, seqlen)
+            scores = scores + rel_scores
+            if mask is not None:
+                scores = scores.masked_fill(~mask, float("-inf"))
+            attn = F.softmax(scores, dim=-1)
+            if self.attn_dropout_rate > 0 and self.training:
+                attn = F.dropout(attn, p=self.attn_dropout_rate)
+            y = torch.matmul(attn, v)
+        else:
+            y = F.scaled_dot_product_attention(
+                q, k, v,
+                dropout_p=self.attn_dropout_rate if self.training else 0.0,
+                attn_mask=mask,
+            )
+        y = y.transpose(1, 2).contiguous().view(bsz, seqlen, self.head_dim * self.n_head)
+        y = self.wo(y)
+        return y
+class FeedForward(nn.Module):
+    def __init__(self, config: ModelArgs) -> None:
+        super().__init__()
+        self.w1 = nn.Linear(config.dim, config.intermediate_size, bias=False)
+        self.w3 = nn.Linear(config.dim, config.intermediate_size, bias=False)
+        self.w2 = nn.Linear(config.intermediate_size, config.dim, bias=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+    def forward(self, x: Tensor) -> Tensor:
+        return self.w2(self.dropout(F.silu(self.w1(x)) * self.w3(x)))
+class RMSNorm(nn.Module):
+    def __init__(self, dim: int, eps: float = 1e-5):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+    def _norm(self, x):
+        return x * torch.rsqrt(torch.mean(x * x, dim=-1, keepdim=True) + self.eps)
+    def forward(self, x: Tensor) -> Tensor:
+        output = self._norm(x.float()).type_as(x)
+        return output * self.weight
+class LayerScale(nn.Module):
+    def __init__(self, dim: int, init_values: Union[float, Tensor] = 1e-2, inplace: bool = False) -> None:
+        super().__init__()
+        self.inplace = inplace
+        self.gamma = nn.Parameter(init_values * torch.ones(dim))
+    def forward(self, x: Tensor) -> Tensor:
+        return x.mul_(self.gamma) if self.inplace else x * self.gamma
+class WindowLimitedTransformer(Transformer):
+    """Transformer with window-limited causal attention."""
+    def __init__(
+        self,
+        config: ModelArgs,
+        input_dim: int = 512,
+        window_size: Optional[int] = None,
+        causal: bool = True,
+        look_ahead_conv: Optional[nn.Module] = None,
+    ):
+        super().__init__(config)
+        self.window_size = window_size
+        self.causal = causal
+        self.channels_first = config.channels_first
+        self.look_ahead_conv = look_ahead_conv if look_ahead_conv is not None else nn.Identity()
+        self.input_proj = nn.Linear(input_dim, config.dim) if input_dim != config.dim else nn.Identity()
+        self.output_proj = nn.Linear(config.dim, input_dim) if input_dim != config.dim else nn.Identity()
+    def make_window_limited_mask(self, max_length: int, x_lens: Optional[Tensor] = None) -> Tensor:
+        if self.causal:
+            mask = torch.tril(torch.ones(max_length, max_length))
+            row_indices = torch.arange(max_length).view(-1, 1)
+            window_size = self.window_size or max_length
+            valid_range = (row_indices - window_size + 1).clamp(min=0)
+            column_indices = torch.arange(max_length)
+            mask = (column_indices >= valid_range) & mask.bool()
+        else:
+            raise NotImplementedError
+        mask = mask.bool()[None, None]
+        return mask
+    def make_mask(self, max_length: int, x_lens: Optional[Tensor] = None) -> Tensor:
+        if self.causal:
+            mask = torch.tril(torch.ones(max_length, max_length))
+        else:
+            mask = torch.ones(max_length, max_length)
+            mask = mask.bool()[None, None]
+            for i, x_len in enumerate(x_lens):
+                mask[:x_len, i] = 0
+        mask = mask.bool()[None, None]
+        return mask
+    def forward(self, x: Tensor, x_lens: Optional[Tensor] = None) -> Tensor:
+        if self.channels_first:
+            x = x.transpose(1, 2)
+        x = self.input_proj(x)
+        x = self.look_ahead_conv(x)
+        input_pos = torch.arange(x.shape[1], device=x.device)
+        max_length = x.shape[1]
+        if self.window_size is not None:
+            mask = self.make_window_limited_mask(max_length, x_lens)
+        else:
+            mask = self.make_mask(max_length, x_lens)
+        mask = mask.to(x.device)
+        x = super().forward(x, input_pos, mask)
+        x = self.output_proj(x)
+        if self.channels_first:
+            x = x.transpose(1, 2)
+        return x
+def precompute_freqs_cis(
+    seq_len: int, n_elem: int, base: int = 10000, dtype: torch.dtype = torch.bfloat16
+) -> Tensor:
+    freqs = 1.0 / (base ** (torch.arange(0, n_elem, 2)[: (n_elem // 2)].float() / n_elem))
+    t = torch.arange(seq_len, device=freqs.device)
+    freqs = torch.outer(t, freqs)
+    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
+    cache = torch.stack([freqs_cis.real, freqs_cis.imag], dim=-1)
+    return cache.to(dtype=dtype)
+def apply_rotary_emb(x: Tensor, freqs_cis: Tensor) -> Tensor:
+    xshaped = x.float().reshape(*x.shape[:-1], -1, 2)
+    freqs_cis = freqs_cis.view(1, xshaped.size(1), 1, xshaped.size(3), 2)
+    x_out2 = torch.stack(
+        [
+            xshaped[..., 0] * freqs_cis[..., 0] - xshaped[..., 1] * freqs_cis[..., 1],
+            xshaped[..., 1] * freqs_cis[..., 0] + xshaped[..., 0] * freqs_cis[..., 1],
+        ],
+        -1,
+    )
+    x_out2 = x_out2.flatten(3)
+    return x_out2.type_as(x)
+def init_weights(m):
+    if isinstance(m, nn.Conv1d):
+        nn.init.trunc_normal_(m.weight, std=0.02)
+        nn.init.constant_(m.bias, 0)
+# --------------------------------------------------------------------
+# Top-level AE
+# --------------------------------------------------------------------
+class EncoderBlock(nn.Module):
+    def __init__(
+        self,
+        dim: int = 16,
+        stride: int = 1,
+        causal: bool = False,
+        n_t_layer: int = 0,
+        transformer_general_config=None,
+    ):
+        super().__init__()
+        conv_class = CausalWNConv1d if causal else WNConv1d
+        transformer_module = (
+            nn.Identity()
+            if n_t_layer == 0
+            else WindowLimitedTransformer(
+                causal=causal,
+                input_dim=dim,
+                window_size=512,
+                config=transformer_general_config(
+                    n_layer=n_t_layer,
+                    n_head=dim // 64,
+                    dim=dim,
+                    intermediate_size=dim * 3,
+                ),
+            )
+        )
+        self.block = nn.Sequential(
+            # three multi‑receptive‑field residual units
+            ResidualUnit(dim // 2, dilation=1, causal=causal),
+            ResidualUnit(dim // 2, dilation=3, causal=causal),
+            ResidualUnit(dim // 2, dilation=9, causal=causal),
+            Snake1d(dim // 2),
+            conv_class(dim // 2, dim, kernel_size=2 * stride, stride=stride, padding=math.ceil(stride / 2)),
+            transformer_module,
+        )
+    def forward(self, x: Tensor) -> Tensor:
+        return self.block(x)
+class ResidualUnit(nn.Module):
+    def __init__(self, dim: int = 16, dilation: int = 1, causal: bool = False):
+        super().__init__()
+        conv_class = CausalWNConv1d if causal else WNConv1d
+        pad = ((7 - 1) * dilation) // 2
+        self.block = nn.Sequential(
+            Snake1d(dim),
+            conv_class(dim, dim, kernel_size=7, dilation=dilation, padding=pad),
+            Snake1d(dim),
+            conv_class(dim, dim, kernel_size=1),
+        )
+        self.causal = causal
+    def forward(self, x: Tensor) -> Tensor:
+        y = self.block(x)
+        pad = x.shape[-1] - y.shape[-1]
+        if pad > 0:
+            if self.causal:
+                x = x[..., :-pad]
+            else:
+                x = x[..., pad // 2 : -pad // 2]
+        return x + y
+class Encoder(nn.Module):
+    def __init__(
+        self,
+        d_model: int = 64,
+        strides: List[int] = [2, 4, 8, 8],
+        d_latent: int = 64,
+        n_transformer_layers: List[int] = [0, 0, 4, 4],
+        transformer_general_config: Optional[ModelArgs] = None,
+        causal: bool = False,
+    ):
+        super().__init__()
+        conv_class = CausalWNConv1d if causal else WNConv1d
+        layers: List[nn.Module] = [conv_class(1, d_model, kernel_size=7, padding=3)]
+        for stride, n_t_layer in zip(strides, n_transformer_layers):
+            d_model *= 2
+            layers.append(
+                EncoderBlock(
+                    d_model, stride=stride, causal=causal,
+                    n_t_layer=n_t_layer, transformer_general_config=transformer_general_config,
+                )
+            )
+        layers += [Snake1d(d_model), conv_class(d_model, d_latent, kernel_size=3, padding=1)]
+        self.block = nn.Sequential(*layers)
+        self.enc_dim = d_model
+    def forward(self, x: Tensor) -> Tensor:
+        return self.block(x)
+class DecoderBlock(nn.Module):
+    def __init__(
+        self,
+        input_dim: int = 16,
+        output_dim: int = 8,
+        stride: int = 1,
+        causal: bool = False,
+        n_t_layer: int = 0,
+        transformer_general_config=None,
+    ):
+        super().__init__()
+        conv_trans_class = CausalWNConvTranspose1d if causal else WNConvTranspose1d
+        transformer_module = (
+            nn.Identity()
+            if n_t_layer == 0
+            else WindowLimitedTransformer(
+                causal=causal,
+                input_dim=input_dim,
+                window_size=None,
+                config=transformer_general_config(
+                    n_layer=n_t_layer,
+                    n_head=input_dim // 64,
+                    dim=input_dim,
+                    intermediate_size=input_dim * 3,
+                ),
+            )
+        )
+        self.block = nn.Sequential(
+            Snake1d(input_dim),
+            conv_trans_class(input_dim, output_dim, kernel_size=2 * stride, stride=stride, padding=math.ceil(stride / 2)),
+            ResidualUnit(output_dim, dilation=1, causal=causal),
+            ResidualUnit(output_dim, dilation=3, causal=causal),
+            ResidualUnit(output_dim, dilation=9, causal=causal),
+        )
+    def forward(self, x: Tensor) -> Tensor:
+        return self.block(x)
+class Decoder(nn.Module):
+    def __init__(
+        self,
+        input_channel: int,
+        channels: int,
+        rates: List[int],
+        d_out: int = 1,
+        causal: bool = False,
+        n_transformer_layers: List[int] = [0, 0, 0, 0],
+        transformer_general_config=None,
+    ):
+        super().__init__()
+        conv_class = CausalWNConv1d if causal else WNConv1d
+        layers: List[nn.Module] = [conv_class(input_channel, channels, kernel_size=7, padding=3)]
+        for i, (stride, n_t_layer) in enumerate(zip(rates, n_transformer_layers)):
+            input_dim  = channels // 2**i
+            output_dim = channels // 2 ** (i + 1)
+            layers.append(
+                DecoderBlock(
+                    input_dim, output_dim, stride, causal=causal,
+                    n_t_layer=n_t_layer, transformer_general_config=transformer_general_config,
+                )
+            )
+        layers += [Snake1d(output_dim), conv_class(output_dim, d_out, kernel_size=7, padding=3), nn.Tanh()]
+        self.model = nn.Sequential(*layers)
+    def forward(self, x: Tensor) -> Tensor:
+        return self.model(x)
+class DAC(nn.Module):
+    def __init__(
+        self,
+        encoder_dim: int = 64,
+        encoder_rates: List[int] = [2, 4, 8, 8],
+        latent_dim: Optional[int] = None,
+        decoder_dim: int = 1536,
+        decoder_rates: List[int] = [8, 8, 4, 2],
+        quantizer: Optional[nn.Module] = None,
+        sample_rate: int = 44100,
+        causal: bool = True,
+        encoder_transformer_layers: List[int] = [0, 0, 0, 0],
+        decoder_transformer_layers: List[int] = [0, 0, 0, 0],
+        transformer_general_config=None,
+    ):
+        super().__init__()
+        self.encoder_dim = encoder_dim
+        self.encoder_rates = encoder_rates
+        self.decoder_dim = decoder_dim
+        self.decoder_rates = decoder_rates
+        self.sample_rate = sample_rate
+        if latent_dim is None:
+            latent_dim = encoder_dim * (2 ** len(encoder_rates))
+        self.latent_dim = latent_dim
+        self.hop_length = int(np.prod(encoder_rates))
+        self.encoder = Encoder(
+            encoder_dim, encoder_rates, latent_dim, causal=causal,
+            n_transformer_layers=encoder_transformer_layers,
+            transformer_general_config=transformer_general_config,
+        )
+        self.quantizer = quantizer
+        self.decoder = Decoder(
+            latent_dim, decoder_dim, decoder_rates, causal=causal,
+            n_transformer_layers=decoder_transformer_layers,
+            transformer_general_config=transformer_general_config,
+        )
+        self.sample_rate = sample_rate
+        self.apply(init_weights)
+        self.delay = self.get_delay()
+        self.frame_length = self.hop_length * 4
+    def get_output_length(self, input_length: int) -> int:
+        length = input_length
+        for stride in self.encoder_rates:
+            length = math.ceil(length / stride)
+        return length
+    def get_delay(self) -> int:
+        l_out = self.get_output_length(0)
+        L = l_out
+        layers = [layer for layer in self.modules() if isinstance(layer, (nn.Conv1d, nn.ConvTranspose1d))]
+        for layer in reversed(layers):
+            d = layer.dilation[0]
+            k = layer.kernel_size[0]
+            s = layer.stride[0]
+            if isinstance(layer, nn.ConvTranspose1d):
+                L = ((L - d * (k - 1) - 1) / s) + 1
+            elif isinstance(layer, nn.Conv1d):
+                L = (L - 1) * s + d * (k - 1) + 1
+            L = math.ceil(L)
+        l_in = L
+        return (l_in - l_out) // 2
+    def preprocess(self, audio_data: Tensor, sample_rate: Optional[int]) -> Tensor:
+        if sample_rate is None:
+            sample_rate = self.sample_rate
+        assert sample_rate == self.sample_rate
+        length = audio_data.shape[-1]
+        right_pad = math.ceil(length / self.hop_length) * self.hop_length - length
+        audio_data = F.pad(audio_data, (0, right_pad))
+        return audio_data
+    def encode(
+        self,
+        audio_data: Tensor,
+        audio_lengths: Optional[Tensor] = None,
+        n_quantizers: Optional[int] = None,
+        **kwargs,
+    ):
+        """Encode audio to quantized code indices."""
+        if audio_data.ndim == 2:
+            audio_data = audio_data.unsqueeze(1)
+        length = audio_data.shape[-1]
+        right_pad = math.ceil(length / self.frame_length) * self.frame_length - length
+        audio_data = F.pad(audio_data, (0, right_pad))
+        if audio_lengths is None:
+            audio_lengths = torch.LongTensor([length + right_pad]).to(audio_data.device)
+        z = self.encoder(audio_data)
+        vq_results = self.quantizer(z, n_quantizers, **kwargs)
+        indices = vq_results.codes
+        indices_lens = torch.ceil(audio_lengths / self.frame_length).long()
+        return indices, indices_lens
+    def decode(self, indices: Tensor, feature_lengths: Tensor):
+        """Decode code indices to audio."""
+        if indices.ndim == 2:
+            indices = indices[None]
+        z = self.quantizer.decode(indices)
+        audio_lengths = feature_lengths * self.frame_length
+        return self.decoder(z), audio_lengths
+    def encode_to_codes(self, audio: Tensor, audio_lengths: Optional[Tensor] = None, n_quantizers: Optional[int] = None, **kw):
+        return self.encode(audio, audio_lengths, n_quantizers, **kw)
+    def decode_codes(self, indices: Tensor, feature_lengths: Tensor):
+        return self.decode(indices, feature_lengths)
+    @torch.no_grad()
+    def encode_zq(self, audio_data: Tensor) -> Tensor:
+        indices, _ = self.encode(audio_data)
+        new_indices = torch.zeros_like(indices)
+        new_indices[:, 0] = torch.clamp(indices[:, 0],  max=self.quantizer.semantic_quantizer.codebook_size - 1)
+        new_indices[:, 1:] = torch.clamp(indices[:, 1:], max=self.quantizer.quantizer.codebook_size - 1)
+        z_q_semantic = self.quantizer.semantic_quantizer.from_codes(new_indices[:, :1])[0]
+        z_q_residual = self.quantizer.quantizer.from_codes(new_indices[:, 1:])[0]
+        z_q = z_q_semantic + z_q_residual
+        return z_q
+    @torch.no_grad()
+    def decode_zq(self, z_q: Tensor) -> Tensor:
+        z_q = self.quantizer.post_module(z_q)
+        z_q = self.quantizer.upsample(z_q)
+        return self.decoder(z_q)
+    @property
+    def device(self) -> torch.device: return next(self.parameters()).device
+    @property
+    def dtype(self) -> torch.dtype: return next(self.parameters()).dtype
+# --------------------------------------------------------------------
+# Build helpers
+# --------------------------------------------------------------------
+def build_ae(**cfg) -> DAC:
+    """
+    Factory used by external loaders
+    """
+    # Shared transformer config for the RVQ pre/post modules
+    q_config = ModelArgs(
+        block_size=4096, n_layer=8, n_head=16, dim=1024,
+        intermediate_size=3072, head_dim=64, norm_eps=1e-5,
+        dropout_rate=0.1, attn_dropout_rate=0.1, channels_first=True
+    )
+    def make_transformer():
+        return WindowLimitedTransformer(
+            causal=True, window_size=128, input_dim=1024, config=q_config
+        )
+    quantizer = DownsampleResidualVectorQuantize(
+        input_dim=1024, n_codebooks=9, codebook_size=1024, codebook_dim=8,
+        quantizer_dropout=0.5, downsample_factor=(2, 2),
+        semantic_codebook_size=4096,
+        pre_module=make_transformer(),
+        post_module=make_transformer(),
+    )
+    def transformer_general_config(**kw):
+        return ModelArgs(
+            block_size=kw.get("block_size", 16384),
+            n_layer=kw.get("n_layer", 8),
+            n_head=kw.get("n_head", 8),
+            dim=kw.get("dim", 512),
+            intermediate_size=kw.get("intermediate_size", 1536),
+            n_local_heads=kw.get("n_local_heads", -1),
+            head_dim=kw.get("head_dim", 64),
+            rope_base=kw.get("rope_base", 10000),
+            norm_eps=kw.get("norm_eps", 1e-5),
+            dropout_rate=kw.get("dropout_rate", 0.1),
+            attn_dropout_rate=kw.get("attn_dropout_rate", 0.1),
+            channels_first=kw.get("channels_first", True),
+        )
+    dac = DAC(
+        encoder_dim=64, encoder_rates=[2, 4, 8, 8], latent_dim=1024,
+        decoder_dim=1536, decoder_rates=[8, 8, 4, 2],
+        quantizer=quantizer, sample_rate=44100, causal=True,
+        encoder_transformer_layers=[0, 0, 0, 4],
+        decoder_transformer_layers=[4, 0, 0, 0],
+        transformer_general_config=transformer_general_config,
+    )
+    return dac
+__all__ = [
+    "DAC",
+    "build_ae",
+    "VectorQuantize",
+    "ResidualVectorQuantize",
+    "DownsampleResidualVectorQuantize",
+]
+# ----- BEGIN DAC MIT LICENSE -----
+# MIT License
+# Copyright (c) 2023-present, Descript
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+# ----- END DAC MIT LICENSE -----

gradio_app.py ADDED Viewed

	@@ -0,0 +1,994 @@

+import os
+import json
+import time
+import secrets
+import logging
+import warnings
+from pathlib import Path
+from typing import Tuple, Any
+from functools import partial
+# see lines ~40-50 for running on lower VRAM GPUs
+logging.getLogger("huggingface_hub").setLevel(logging.ERROR)
+import gradio as gr
+import torch
+import torchaudio
+from inference import (
+    load_model_from_hf,
+    load_fish_ae_from_hf,
+    load_pca_state_from_hf,
+    load_audio,
+    ae_reconstruct,
+    sample_pipeline,
+    compile_model,
+    compile_fish_ae,
+    sample_euler_cfg_independent_guidances
+)
+# --------------------------------------------------------------------
+# IF ON 8GB VRAM GPU, SET FISH_AE_DTYPE to bfloat16 and DEFAULT_SAMPLE_LATENT_LENGTH to < 640 (e.g., 576)
+# Configuration
+MODEL_DTYPE = torch.bfloat16
+FISH_AE_DTYPE = torch.float32
+# FISH_AE_DTYPE = torch.bfloat16 # USE THIS IF OOM ON 8GB vram GPU
+DEFAULT_SAMPLE_LATENT_LENGTH = 640 # decrease if OOM on 8GB vram GPU
+# DEFAULT_SAMPLE_LATENT_LENGTH = 576  # (example, ~27 seconds rather than ~30; can change depending on what fits in VRAM)
+# NOTE peak S1-DAC decoding VRAM > peak latent sampling VRAM, so decoding in chunks (which is posisble as S1-DAC is causal) would allow for full 640-length generation on lower VRAM GPUs
+# --------------------------------------------------------------------
+# Audio Prompt Library for Custom Audio Panel (included in repo)
+AUDIO_PROMPT_FOLDER = Path("./audio_prompts")
+# --------------------------------------------------------------------
+TEXT_PRESETS_PATH = Path("./text_presets.txt")
+SAMPLER_PRESETS_PATH = Path("./sampler_presets.json")
+TEMP_AUDIO_DIR = Path("./temp_gradio_audio")
+TEMP_AUDIO_DIR.mkdir(parents=True, exist_ok=True)
+# --------------------------------------------------------------------
+# Model loading (eager for local use)
+model = load_model_from_hf(dtype=MODEL_DTYPE, delete_blockwise_modules=True)
+fish_ae = load_fish_ae_from_hf(dtype=FISH_AE_DTYPE)
+pca_state = load_pca_state_from_hf()
+model_compiled = None
+fish_ae_compiled = None
+# --------------------------------------------------------------------
+# Helper functions
+def make_stem(prefix: str, user_id: str | None = None) -> str:
+    """Create unique filename stem: prefix__user__timestamp_random or prefix__timestamp_random if no user_id."""
+    ts = int(time.time() * 1000)
+    rand = secrets.token_hex(4)
+    if user_id:
+        return f"{prefix}__{user_id}__{ts}_{rand}"
+    return f"{prefix}__{ts}_{rand}"
+def cleanup_temp_audio(dir_: Path, user_id: str | None, max_age_sec: int = 60 * 5):
+    """Remove old files globally and all previous files for this user."""
+    now = time.time()
+    for p in dir_.glob("*"):
+        try:
+            if p.is_file() and (now - p.stat().st_mtime) > max_age_sec:
+                p.unlink(missing_ok=True)
+        except Exception:
+            pass
+    if user_id:
+        for p in dir_.glob(f"*__{user_id}__*"):
+            try:
+                if p.is_file():
+                    p.unlink(missing_ok=True)
+            except Exception:
+                pass
+def save_audio_with_format(audio_tensor: torch.Tensor, base_path: Path, filename: str, sample_rate: int, audio_format: str) -> Path:
+    """Save audio in specified format, fallback to WAV if MP3 encoding fails."""
+    if audio_format == "mp3":
+        try:
+            output_path = base_path / f"{filename}.mp3"
+            torchaudio.save(
+                str(output_path),
+                audio_tensor,
+                sample_rate,
+                format="mp3",
+                encoding="mp3",
+                bits_per_sample=None,
+            )
+            return output_path
+        except Exception as e:
+            print(f"MP3 encoding failed: {e}, falling back to WAV")
+            output_path = base_path / f"{filename}.wav"
+            torchaudio.save(str(output_path), audio_tensor, sample_rate)
+            return output_path
+    output_path = base_path / f"{filename}.wav"
+    torchaudio.save(str(output_path), audio_tensor, sample_rate)
+    return output_path
+def to_bool(val: Any) -> bool:
+    """Parse truthy values from common string/bool inputs."""
+    return str(val).strip().lower() not in {"", "0", "false", "off", "none", "no"}
+def find_min_bucket_gte(values_str: str, actual_length: int) -> int | None:
+    """Parse comma-separated values and find minimum value >= actual_length.
+    If a single value is provided (no comma), returns that value directly.
+    If comma-separated, finds the smallest bucket that can fit the content.
+    Returns None if empty string.
+    """
+    if not values_str or not values_str.strip():
+        return None
+    values_str = values_str.strip()
+    # Single value case - return as-is
+    if "," not in values_str:
+        return int(values_str)
+    # Multiple values - find minimum >= actual_length
+    values = [int(v.strip()) for v in values_str.split(",") if v.strip()]
+    if not values:
+        return None
+    # Find minimum value >= actual_length
+    candidates = [v for v in values if v >= actual_length]
+    if candidates:
+        return min(candidates)
+    # If no value is >=, return the maximum (best effort)
+    return max(values)
+def generate_audio(
+    text_prompt: str,
+    speaker_audio_path: str,
+    num_steps: int,
+    rng_seed: int,
+    cfg_scale_text: float,
+    cfg_scale_speaker: float,
+    cfg_min_t: float,
+    cfg_max_t: float,
+    truncation_factor: float,
+    rescale_k: float,
+    rescale_sigma: float,
+    force_speaker: bool,
+    speaker_kv_scale: float,
+    speaker_kv_min_t: float,
+    speaker_kv_max_layers: int,
+    reconstruct_first_30_seconds: bool,
+    use_custom_shapes: bool,
+    max_text_byte_length: str,
+    max_speaker_latent_length: str,
+    sample_latent_length: str,
+    audio_format: str,
+    use_compile: bool,
+    show_original_audio: bool,
+    session_id: str,
+) -> Tuple[Any, Any, Any, Any, Any, Any, Any, Any, Any]:
+    """Generate audio using the model."""
+    global model_compiled, fish_ae_compiled
+    if use_compile:
+        if model_compiled is None:
+            try:
+                model_compiled = compile_model(model)
+                fish_ae_compiled = compile_fish_ae(fish_ae)
+            except Exception as e:
+                print(f"Compilation wrapping failed: {str(e)}")
+                model_compiled = None
+                fish_ae_compiled = None
+                use_compile = False
+    active_model = model_compiled if use_compile else model
+    active_fish_ae = fish_ae_compiled if use_compile else fish_ae
+    cleanup_temp_audio(TEMP_AUDIO_DIR, session_id)
+    start_time = time.time()
+    num_steps_int = min(max(int(num_steps), 1), 80)
+    rng_seed_int = int(rng_seed) if rng_seed is not None else 0
+    cfg_scale_text_val = float(cfg_scale_text)
+    cfg_scale_speaker_val = float(cfg_scale_speaker) if cfg_scale_speaker is not None else None
+    cfg_min_t_val = float(cfg_min_t)
+    cfg_max_t_val = float(cfg_max_t)
+    truncation_factor_val = float(truncation_factor)
+    rescale_k_val = float(rescale_k) if rescale_k != 1.0 else None
+    rescale_sigma_val = float(rescale_sigma)
+    speaker_kv_enabled = bool(force_speaker)
+    if speaker_kv_enabled:
+        speaker_kv_scale_val = float(speaker_kv_scale) if speaker_kv_scale is not None else None
+        speaker_kv_min_t_val = float(speaker_kv_min_t) if speaker_kv_min_t is not None else None
+        speaker_kv_max_layers_val = int(speaker_kv_max_layers) if speaker_kv_max_layers is not None else None
+    else:
+        speaker_kv_scale_val = None
+        speaker_kv_min_t_val = None
+        speaker_kv_max_layers_val = None
+    # Load speaker audio early so we can compute actual lengths for bucket selection
+    use_zero_speaker = not speaker_audio_path or speaker_audio_path == ""
+    speaker_audio = load_audio(speaker_audio_path).cuda() if not use_zero_speaker else None
+    if use_custom_shapes:
+        # Compute actual text byte length
+        actual_text_byte_length = len(text_prompt.encode("utf-8")) + 1  # +1 for BOS token
+        # Compute actual speaker latent length (audio_samples // 2048)
+        AE_DOWNSAMPLE_FACTOR = 2048
+        if speaker_audio is not None:
+            actual_speaker_latent_length = (speaker_audio.shape[-1] // AE_DOWNSAMPLE_FACTOR) // 4 * 4
+        else:
+            actual_speaker_latent_length = 0
+        # Find appropriate bucket sizes from comma-separated values
+        pad_to_max_text_length = find_min_bucket_gte(max_text_byte_length, actual_text_byte_length)
+        pad_to_max_speaker_latent_length = find_min_bucket_gte(max_speaker_latent_length, actual_speaker_latent_length)
+        sample_latent_length_val = int(sample_latent_length) if sample_latent_length.strip() else (DEFAULT_SAMPLE_LATENT_LENGTH or 640)
+    else:
+        pad_to_max_text_length = None
+        pad_to_max_speaker_latent_length = None
+        sample_latent_length_val = DEFAULT_SAMPLE_LATENT_LENGTH or 640
+    sample_fn = partial(
+        sample_euler_cfg_independent_guidances,
+        num_steps=num_steps_int,
+        cfg_scale_text=cfg_scale_text_val,
+        cfg_scale_speaker=cfg_scale_speaker_val,
+        cfg_min_t=cfg_min_t_val,
+        cfg_max_t=cfg_max_t_val,
+        truncation_factor=truncation_factor_val,
+        rescale_k=rescale_k_val,
+        rescale_sigma=rescale_sigma_val,
+        speaker_kv_scale=speaker_kv_scale_val,
+        speaker_kv_min_t=speaker_kv_min_t_val,
+        speaker_kv_max_layers=speaker_kv_max_layers_val,
+        sequence_length=sample_latent_length_val,
+    )
+    audio_out, normalized_text = sample_pipeline(
+        model=active_model,
+        fish_ae=active_fish_ae,
+        pca_state=pca_state,
+        sample_fn=sample_fn,
+        text_prompt=text_prompt,
+        speaker_audio=speaker_audio,
+        rng_seed=rng_seed_int,
+        pad_to_max_text_length=pad_to_max_text_length,
+        pad_to_max_speaker_latent_length=pad_to_max_speaker_latent_length,
+        normalize_text=True,
+    )
+    audio_to_save = audio_out[0].cpu()
+    stem = make_stem("generated", session_id)
+    output_path = save_audio_with_format(audio_to_save, TEMP_AUDIO_DIR, stem, 44100, audio_format)
+    generation_time = time.time() - start_time
+    time_str = f"⏱️ Total generation time: {generation_time:.2f}s"
+    text_display = f"**Text Prompt (normalized):**\n\n{normalized_text}"
+    recon_output_path = None
+    original_output_path = None
+    if reconstruct_first_30_seconds and speaker_audio is not None:
+        audio_recon = ae_reconstruct(
+            fish_ae=fish_ae,
+            pca_state=pca_state,
+            audio=torch.nn.functional.pad(
+                speaker_audio[..., :2048 * 640],
+                (0, max(0, 2048 * 640 - speaker_audio.shape[-1])),
+            )[None],
+        )[..., : speaker_audio.shape[-1]]
+        recon_stem = make_stem("speaker_recon", session_id)
+        recon_output_path = save_audio_with_format(audio_recon.cpu()[0], TEMP_AUDIO_DIR, recon_stem, 44100, audio_format)
+    if show_original_audio and speaker_audio is not None:
+        original_stem = make_stem("original_audio", session_id)
+        original_output_path = save_audio_with_format(speaker_audio.cpu(), TEMP_AUDIO_DIR, original_stem, 44100, audio_format)
+    show_reference_section = (show_original_audio or reconstruct_first_30_seconds) and speaker_audio is not None
+    return (
+        gr.update(),
+        gr.update(value=str(output_path), visible=True),
+        gr.update(value=text_display, visible=True),
+        gr.update(value=str(original_output_path) if original_output_path else None, visible=True),
+        gr.update(value=time_str, visible=True),
+        gr.update(value=str(recon_output_path) if recon_output_path else None, visible=True),
+        gr.update(visible=(show_original_audio and speaker_audio is not None)),
+        gr.update(visible=(reconstruct_first_30_seconds and speaker_audio is not None)),
+        gr.update(visible=show_reference_section),
+    )
+# UI Helper Functions
+def load_text_presets():
+    """Load text presets from file with category and word count."""
+    if TEXT_PRESETS_PATH.exists():
+        with open(TEXT_PRESETS_PATH, "r", encoding="utf-8") as f:
+            lines = [line.strip() for line in f if line.strip()]
+        result = []
+        for line in lines:
+            if " | " in line:
+                parts = line.split(" | ", 1)
+                category = parts[0]
+                text = parts[1]
+            else:
+                category = "Uncategorized"
+                text = line
+            word_count = len(text.split())
+            result.append([category, str(word_count), text])
+        return result
+    return []
+def select_text_preset(evt: gr.SelectData):
+    """Handle text preset selection - extract text from the row."""
+    if evt.value:
+        if isinstance(evt.index, (tuple, list)) and len(evt.index) >= 2:
+            row_index = evt.index[0]
+        else:
+            row_index = evt.index
+        presets_data = load_text_presets()
+        if isinstance(row_index, int) and row_index < len(presets_data):
+            text = presets_data[row_index][2]
+            return gr.update(value=text)
+    return gr.update()
+def toggle_mode(mode):
+    """Toggle advanced settings section visibility."""
+    show_advanced = mode == "Advanced Mode"
+    return gr.update(visible=show_advanced)
+def update_force_row(force_speaker):
+    """Show KV scaling controls when Force Speaker is enabled."""
+    return gr.update(visible=bool(force_speaker))
+def apply_cfg_preset(preset_name):
+    """Apply CFG guidance preset."""
+    presets = {
+        "higher speaker": (3.0, 8.0, 0.5, 1.0),
+        "large guidances": (8.0, 8.0, 0.5, 1.0),
+    }
+    if preset_name not in presets:
+        return [gr.update()] * 5
+    text_scale, speaker_scale, min_t, max_t = presets[preset_name]
+    return [
+        gr.update(value=text_scale),
+        gr.update(value=speaker_scale),
+        gr.update(value=min_t),
+        gr.update(value=max_t),
+        gr.update(value="Custom"),
+    ]
+def apply_speaker_kv_preset(preset_name):
+    """Apply speaker KV attention control preset."""
+    if preset_name == "enable":
+        return [
+            gr.update(value=True),
+            gr.update(visible=True),
+            gr.update(value="Custom"),
+        ]
+    if preset_name == "off":
+        return [
+            gr.update(value=False),
+            gr.update(visible=False),
+            gr.update(value="Custom"),
+        ]
+    return [gr.update()] * 3
+def apply_truncation_preset(preset_name):
+    """Apply truncation & temporal rescaling preset."""
+    presets = {
+        "flat": (0.8, 1.2, 3.0),
+        "sharp": (0.9, 0.96, 3.0),
+        "baseline(sharp)": (1.0, 1.0, 3.0),
+    }
+    if preset_name == "custom" or preset_name not in presets:
+        return [gr.update()] * 4
+    truncation, rescale_k, rescale_sigma = presets[preset_name]
+    return [
+        gr.update(value=truncation),
+        gr.update(value=rescale_k),
+        gr.update(value=rescale_sigma),
+        gr.update(value="Custom"),
+    ]
+def load_sampler_presets():
+    """Load sampler presets from JSON file."""
+    if SAMPLER_PRESETS_PATH.exists():
+        with open(SAMPLER_PRESETS_PATH, "r") as f:
+            return json.load(f)
+    default_presets = {
+        "Independent-High-Speaker-CFG": {
+            "num_steps": "40",
+            "cfg_scale_text": "3.0",
+            "cfg_scale_speaker": "8.0",
+            "cfg_min_t": "0.5",
+            "cfg_max_t": "1.0",
+            "truncation_factor": "1.",
+            "rescale_k": "1.",
+            "rescale_sigma": "3.0"
+        }
+    }
+    with open(SAMPLER_PRESETS_PATH, "w") as f:
+        json.dump(default_presets, f, indent=2)
+    return default_presets
+def apply_sampler_preset(preset_name):
+    """Apply a sampler preset to all fields."""
+    presets = load_sampler_presets()
+    if preset_name == "Custom" or preset_name not in presets:
+        return [gr.update()] * 13
+    preset = presets[preset_name]
+    speaker_kv_enabled = to_bool(preset.get("speaker_kv_enable", False))
+    def to_num(val, default):
+        try:
+            return float(val) if isinstance(val, str) else val
+        except (ValueError, TypeError):
+            return default
+    return [
+        gr.update(value=int(to_num(preset.get("num_steps", "40"), 40))),
+        gr.update(value=to_num(preset.get("cfg_scale_text", "3.0"), 3.0)),
+        gr.update(value=to_num(preset.get("cfg_scale_speaker", "5.0"), 5.0)),
+        gr.update(value=to_num(preset.get("cfg_min_t", "0.5"), 0.5)),
+        gr.update(value=to_num(preset.get("cfg_max_t", "1.0"), 1.0)),
+        gr.update(value=to_num(preset.get("truncation_factor", "0.8"), 0.8)),
+        gr.update(value=to_num(preset.get("rescale_k", "1.2"), 1.2)),
+        gr.update(value=to_num(preset.get("rescale_sigma", "3.0"), 3.0)),
+        gr.update(value=speaker_kv_enabled),
+        gr.update(visible=speaker_kv_enabled),
+        gr.update(value=to_num(preset.get("speaker_kv_scale", "1.5"), 1.5)),
+        gr.update(value=to_num(preset.get("speaker_kv_min_t", "0.9"), 0.9)),
+        gr.update(value=int(to_num(preset.get("speaker_kv_max_layers", "24"), 24))),
+    ]
+AUDIO_EXTS = {".wav", ".mp3", ".m4a", ".ogg", ".flac", ".webm", ".aac", ".opus"}
+def get_audio_prompt_files(search_query: str = ""):
+    """Get list of audio files from the audio prompt folder, optionally filtered by search query."""
+    if AUDIO_PROMPT_FOLDER is None or not AUDIO_PROMPT_FOLDER.exists():
+        return []
+    files = sorted([f.name for f in AUDIO_PROMPT_FOLDER.iterdir() if f.is_file() and f.suffix.lower() in AUDIO_EXTS], key=str.lower)
+    # Filter by search query if provided
+    if search_query.strip():
+        query_lower = search_query.lower()
+        files = [f for f in files if query_lower in f.lower()]
+    return [[file] for file in files]
+def filter_audio_prompts(search_query: str):
+    """Filter audio prompts based on search query."""
+    return gr.update(value=get_audio_prompt_files(search_query))
+def select_audio_prompt_file(evt: gr.SelectData):
+    """Handle audio prompt file selection from table."""
+    if evt.value and AUDIO_PROMPT_FOLDER is not None:
+        file_path = AUDIO_PROMPT_FOLDER / evt.value
+        if file_path.exists():
+            return gr.update(value=str(file_path))
+    return gr.update()
+# UI styling and helpers
+LINK_CSS = """
+.preset-inline { display:flex; align-items:baseline; gap:6px; margin-top:-4px; margin-bottom:-12px; }
+.preset-inline .title { font-weight:600; font-size:.95rem; }
+.preset-inline .dim   { color:#666; margin:0 4px; }
+a.preset-link { color: #0a5bd8; text-decoration: underline; cursor: pointer; font-weight: 400; }
+a.preset-link:hover { text-decoration: none; opacity: 0.8; }
+.dark a.preset-link,
+[data-theme="dark"] a.preset-link { color: #60a5fa !important; }
+.dark a.preset-link:hover,
+[data-theme="dark"] a.preset-link:hover { color: #93c5fd !important; }
+.dark .preset-inline .dim,
+[data-theme="dark"] .preset-inline .dim { color: #9ca3af !important; }
+.proxy-btn { position:absolute; width:0; height:0; overflow:hidden; padding:0 !important; margin:0 !important; border:0 !important; opacity:0; pointer-events:none; }
+.gr-group { border: 1px solid #d1d5db !important; background: #f3f4f6 !important; }
+.dark .gr-group,
+[data-theme="dark"] .gr-group { border: 1px solid #4b5563 !important; background: #1f2937 !important; }
+.generated-audio-player { border: 3px solid #667eea !important; border-radius: 12px !important; padding: 20px !important; background: linear-gradient(135deg, rgba(102, 126, 234, 0.08) 0%, rgba(118, 75, 162, 0.05) 100%) !important; box-shadow: 0 4px 12px rgba(102, 126, 234, 0.2) !important; margin: 1rem 0 !important; }
+.generated-audio-player > div { background: transparent !important; }
+#component-mode-selector { text-align: center; padding: 1rem 0; }
+#component-mode-selector label { font-size: 1.1rem !important; font-weight: 600 !important; margin-bottom: 0.5rem !important; }
+#component-mode-selector .wrap { justify-content: center !important; }
+#component-mode-selector fieldset { border: 2px solid #e5e7eb !important; border-radius: 8px !important; padding: 1rem !important; background: #f9fafb !important; }
+.dark #component-mode-selector fieldset,
+[data-theme="dark"] #component-mode-selector fieldset { border: 2px solid #4b5563 !important; background: #1f2937 !important; }
+.section-separator { height: 3px !important; background: linear-gradient(90deg, transparent 0%, #667eea 20%, #764ba2 80%, transparent 100%) !important; border: none !important; margin: 2rem 0 !important; }
+.dark .section-separator,
+[data-theme="dark"] .section-separator { background: linear-gradient(90deg, transparent 0%, #667eea 20%, #764ba2 80%, transparent 100%) !important; }
+.gradio-container h1,
+.gradio-container h2 { font-weight: 700 !important; margin-top: 1.5rem !important; margin-bottom: 1rem !important; }
+.tip-box { background: linear-gradient(135deg, #fef3c7 0%, #fde68a 100%) !important; border-left: 4px solid #f59e0b !important; border-radius: 8px !important; padding: 1rem 1.5rem !important; margin: 1rem 0 !important; box-shadow: 0 2px 4px rgba(245, 158, 11, 0.1) !important; }
+.tip-box strong { color: #92400e !important; }
+.dark .tip-box,
+[data-theme="dark"] .tip-box { background: linear-gradient(135deg, #451a03 0%, #78350f 100%) !important; border-left: 4px solid #f59e0b !important; }
+.dark .tip-box strong,
+[data-theme="dark"] .tip-box strong { color: #fbbf24 !important; }
+"""
+JS_CODE = r"""
+function () {
+  const appEl = document.querySelector("gradio-app");
+  const root  = appEl && appEl.shadowRoot ? appEl.shadowRoot : document;
+  function clickHiddenButtonById(id) {
+    if (!id) return;
+    const host = root.getElementById(id);
+    if (!host) return;
+    const realBtn = host.querySelector("button, [role='button']") || host;
+    realBtn.click();
+  }
+  root.addEventListener("click", (ev) => {
+    const a = ev.target.closest("a.preset-link");
+    if (!a) return;
+    ev.preventDefault();
+    ev.stopPropagation();
+    ev.stopImmediatePropagation();
+    clickHiddenButtonById(a.getAttribute("data-fire"));
+    return false;
+  }, true);
+}
+"""
+def init_session():
+    """Initialize session ID for this browser tab/session."""
+    return secrets.token_hex(8)
+with gr.Blocks(title="Echo-TTS", css=LINK_CSS, js=JS_CODE) as demo:
+    gr.Markdown("# Echo-TTS")
+    gr.Markdown("*Jordan Darefsky, 2025. See technical details [here](https://jordandarefsky.com/blog/2025/echo/)*")
+    gr.Markdown("**License Notice:** All audio outputs are subject to non-commercial use [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).")
+    gr.Markdown("**Responsible Use:** Do not use this model to impersonate real people without their explicit consent or to generate deceptive audio.")
+    with gr.Accordion("📖 Quick Start Instructions", open=True):
+        gr.Markdown(
+            """
+            1. Upload or record a short reference clip (or leave blank for no speaker reference).
+            2. Pick a text preset or type your own prompt.
+            3. Click **Generate Audio**.
+            <div class="tip-box">
+            💡 **Tip:** If the generated voice does not match the reference, enable "Force Speaker" and regenerate.
+            </div>
+            """
+        )
+    session_id_state = gr.State(None)
+    gr.Markdown("# Speaker Reference")
+    with gr.Row():
+        if AUDIO_PROMPT_FOLDER is not None and AUDIO_PROMPT_FOLDER.exists():
+            with gr.Column(scale=1, min_width=200):
+                gr.Markdown("#### Audio Library (click to load)")
+                audio_prompt_search = gr.Textbox(
+                    label="",
+                    placeholder="🔍 Search audio prompts...",
+                    lines=1,
+                    max_lines=1,
+                )
+                audio_prompt_table = gr.Dataframe(
+                    value=get_audio_prompt_files(),
+                    headers=["Filename"],
+                    datatype=["str"],
+                    row_count=(10, "dynamic"),
+                    col_count=(1, "fixed"),
+                    interactive=False,
+                    label="",
+                )
+        with gr.Column(scale=2):
+            custom_audio_input = gr.Audio(
+                sources=["upload", "microphone"],
+                type="filepath",
+                label="Speaker Reference Audio (first five minutes used; blank for no speaker reference)",
+                max_length=600,
+            )
+    gr.HTML('<hr class="section-separator">')
+    gr.Markdown("# Text Prompt")
+    with gr.Accordion("Text Presets", open=True):
+        text_presets_table = gr.Dataframe(
+            value=load_text_presets(),
+            headers=["Category", "Words", "Preset Text"],
+            datatype=["str", "str", "str"],
+            row_count=(3, "dynamic"),
+            col_count=(3, "fixed"),
+            interactive=False,
+            column_widths=["12%", "6%", "82%"],
+        )
+    text_prompt = gr.Textbox(label="Text Prompt", placeholder="[S1] Enter your text prompt here...", lines=4)
+    gr.HTML('<hr class="section-separator">')
+    gr.Markdown("# Generation")
+    with gr.Row():
+        with gr.Column(scale=1):
+            pass
+        with gr.Column(scale=2):
+            mode_selector = gr.Radio(
+                choices=["Simple Mode", "Advanced Mode"],
+                value="Simple Mode",
+                label="",
+                info=None,
+                elem_id="component-mode-selector",
+            )
+        with gr.Column(scale=1):
+            pass
+    with gr.Accordion("⚙️ Generation Parameters", open=True):
+        with gr.Row(equal_height=False):
+            presets = load_sampler_presets()
+            preset_keys = list(presets.keys())
+            first_preset = preset_keys[0] if preset_keys else "Custom"
+            with gr.Column(scale=2):
+                preset_dropdown = gr.Dropdown(
+                    choices=["Custom"] + preset_keys,
+                    value=first_preset,
+                    label="Sampler Preset",
+                    info="Load preset configurations",
+                )
+            with gr.Column(scale=0.8, min_width=100):
+                num_steps = gr.Number(
+                    label="Steps",
+                    value=40,
+                    info="Sampling steps (Try 20-80)",
+                    precision=0,
+                    minimum=5,
+                    step=5,
+                    maximum=80,
+                )
+            with gr.Column(scale=0.8, min_width=100):
+                rng_seed = gr.Number(label="RNG Seed", value=0, info="Seed for noise", precision=0)
+            with gr.Column(scale=3):
+                with gr.Group():
+                    gr.HTML(
+                        """
+                    <div class="preset-inline">
+                      <span class="title">Speaker KV Attention Scaling</span>
+                    </div>
+                    """
+                    )
+                    spk_kv_preset_enable = gr.Button("", elem_id="spk_kv_enable", elem_classes=["proxy-btn"])
+                    spk_kv_preset_off = gr.Button("", elem_id="spk_kv_off", elem_classes=["proxy-btn"])
+                    force_speaker = gr.Checkbox(
+                        label='"Force Speaker" (KV scaling)',
+                        value=False,
+                        info="Enable to more strongly match the reference speaker (though higher values may degrade quality)",
+                    )
+                    with gr.Row(visible=False) as speaker_kv_row:
+                        speaker_kv_scale = gr.Number(label="KV Scale", value=1.5, info="Scale factor (>1 -> larger effect; try 1.5, 1.2, ...)", minimum=0, step=0.1)
+                        speaker_kv_min_t = gr.Number(
+                            label="KV Min t",
+                            value=0.9,
+                            info="(0-1), scale applied from steps t=1. to val",
+                            minimum=0,
+                            maximum=1,
+                            step=0.05,
+                        )
+                        speaker_kv_max_layers = gr.Number(
+                            label="Max Layers",
+                            value=24,
+                            info="(0-24), scale applied in first N layers",
+                            precision=0,
+                            minimum=0,
+                            maximum=24,
+                        )
+        with gr.Column(visible=False) as advanced_mode_column:
+            compile_checkbox = gr.Checkbox(
+                label="Compile Model",
+                value=False,
+                info="Compile for faster runs (~10-30% faster); forces Custom Shapes on to avoid excessive recompilation.",
+            )
+            use_custom_shapes_checkbox = gr.Checkbox(
+                label="Use Custom Shapes (Advanced)",
+                value=False,
+                info="Override default generation length and/or force latent and text padding (if unchecked, no padding is used and latent generation length is 640≈30s.)",
+            )
+            with gr.Row(visible=False) as custom_shapes_row:
+                max_text_byte_length = gr.Textbox(
+                    label="Max Text Byte Length (padded)",
+                    value="768",
+                    info="Single value or comma-separated buckets (auto-selects min >= length); 768 = max; leave blank for no padding",
+                    scale=1,
+                )
+                max_speaker_latent_length = gr.Textbox(
+                    label="Max Speaker Latent Length (padded)",
+                    value="640, 2816, 6400",
+                    info="Single value or comma-separated buckets (auto-selects min >= length); 640≈30s, 2560≈2min, 6400≈5min (max); leave blank for no padding",
+                    scale=1,
+                )
+                sample_latent_length = gr.Textbox(
+                    label="Sample Latent Length",
+                    value=str(DEFAULT_SAMPLE_LATENT_LENGTH),
+                    info="Maximum sample latent length (640≈30s max seen during training; smaller works well for generating prefixes)",
+                    scale=1,
+                )
+            with gr.Row():
+                with gr.Column(scale=1):
+                    with gr.Group():
+                        gr.HTML(
+                            """
+                        <div class="preset-inline">
+                          <span class="title">Truncation &amp; Temporal Rescaling</span><span class="dim">(</span>
+                          <a href="javascript:void(0)" class="preset-link" data-fire="trunc_flat">flat</a>
+                          <span class="dim">,</span>
+                          <a href="javascript:void(0)" class="preset-link" data-fire="trunc_sharp">sharp</a>
+                          <span class="dim">,</span>
+                          <a href="javascript:void(0)" class="preset-link" data-fire="trunc_baseline">baseline(sharp)</a>
+                          <span class="dim">)</span>
+                        </div>
+                        """
+                        )
+                        trunc_preset_flat = gr.Button("", elem_id="trunc_flat", elem_classes=["proxy-btn"])
+                        trunc_preset_sharp = gr.Button("", elem_id="trunc_sharp", elem_classes=["proxy-btn"])
+                        trunc_preset_baseline = gr.Button("", elem_id="trunc_baseline", elem_classes=["proxy-btn"])
+                        with gr.Row():
+                            truncation_factor = gr.Number(
+                                label="Truncation Factor",
+                                value=0.8,
+                                info="Multiply initial noise (<1 helps artifacts)",
+                                minimum=0,
+                                step=0.05,
+                            )
+                            rescale_k = gr.Number(
+                                label="Rescale k", value=1.2, info="<1=sharpen, >1=flatten, 1=off", minimum=0, step=0.05
+                            )
+                            rescale_sigma = gr.Number(
+                                label="Rescale σ", value=3.0, info="Sigma parameter", minimum=0, step=0.1
+                            )
+                with gr.Column(scale=1):
+                    with gr.Group():
+                        gr.HTML(
+                            """
+                        <div class="preset-inline">
+                          <span class="title">CFG Guidance</span><span class="dim">(</span>
+                          <a href="javascript:void(0)" class="preset-link" data-fire="cfg_higher">higher speaker</a>
+                          <span class="dim">,</span>
+                          <a href="javascript:void(0)" class="preset-link" data-fire="cfg_large">large guidances</a>
+                          <span class="dim">)</span>
+                        </div>
+                        """
+                        )
+                        cfg_preset_higher_speaker = gr.Button("", elem_id="cfg_higher", elem_classes=["proxy-btn"])
+                        cfg_preset_large_guidances = gr.Button("", elem_id="cfg_large", elem_classes=["proxy-btn"])
+                        with gr.Row():
+                            cfg_scale_text = gr.Number(
+                                label="Text CFG Scale", value=3.0, info="Guidance strength for text", minimum=0, step=0.5
+                            )
+                            cfg_scale_speaker = gr.Number(
+                                label="Speaker CFG Scale",
+                                value=5.0,
+                                info="Guidance strength for speaker",
+                                minimum=0,
+                                step=0.5,
+                            )
+                        with gr.Row():
+                            cfg_min_t = gr.Number(
+                                label="CFG Min t", value=0.5, info="(0-1), CFG applied when t >= val", minimum=0, maximum=1, step=0.05
+                            )
+                            cfg_max_t = gr.Number(
+                                label="CFG Max t", value=1.0, info="(0-1), CFG applied when t <= val", minimum=0, maximum=1, step=0.05
+                            )
+    with gr.Row(equal_height=True):
+        audio_format = gr.Radio(choices=["wav", "mp3"], value="wav", label="Format", scale=1, min_width=90)
+        generate_btn = gr.Button("Generate Audio", variant="primary", size="lg", scale=10)
+        with gr.Column(scale=1):
+            show_original_audio = gr.Checkbox(label="Re-display Original Audio (full 5-minute cropped mono)", value=False)
+            reconstruct_first_30_seconds = gr.Checkbox(
+                label="Show Autoencoder Reconstruction (only first 30s of reference)", value=False
+            )
+    gr.HTML('<hr class="section-separator">')
+    with gr.Accordion("Generated Audio", open=True, visible=True) as generated_section:
+        generation_time_display = gr.Markdown("", visible=False)
+        with gr.Group(elem_classes=["generated-audio-player"]):
+            generated_audio = gr.Audio(label="Generated Audio", visible=True)
+        text_prompt_display = gr.Markdown("", visible=False)
+        gr.Markdown("---")
+        reference_audio_header = gr.Markdown("#### Reference Audio", visible=False)
+        with gr.Accordion("Original Audio (5 min Cropped Mono)", open=False, visible=False) as original_accordion:
+            original_audio = gr.Audio(label="Original Reference Audio (5 min)", visible=True)
+        with gr.Accordion("Autoencoder Reconstruction of First 30s of Reference", open=False, visible=False) as reference_accordion:
+            reference_audio = gr.Audio(label="Decoded Reference Audio (30s)", visible=True)
+    # Event handlers
+    if AUDIO_PROMPT_FOLDER is not None and AUDIO_PROMPT_FOLDER.exists():
+        audio_prompt_table.select(select_audio_prompt_file, outputs=[custom_audio_input])
+        audio_prompt_search.change(filter_audio_prompts, inputs=[audio_prompt_search], outputs=[audio_prompt_table])
+    text_presets_table.select(select_text_preset, outputs=text_prompt)
+    mode_selector.change(toggle_mode, inputs=[mode_selector], outputs=[advanced_mode_column])
+    force_speaker.change(update_force_row, inputs=[force_speaker], outputs=[speaker_kv_row])
+    def toggle_custom_shapes(enabled):
+        return gr.update(visible=enabled)
+    use_custom_shapes_checkbox.change(
+        toggle_custom_shapes,
+        inputs=[use_custom_shapes_checkbox],
+        outputs=[custom_shapes_row],
+    )
+    def on_compile_change(compile_enabled):
+        """When compile is enabled, force custom shapes to be enabled."""
+        if compile_enabled:
+            return (
+                gr.update(value=True),   # use_custom_shapes_checkbox
+                gr.update(visible=True), # custom_shapes_row
+            )
+        return (
+            gr.update(),
+            gr.update(),
+        )
+    compile_checkbox.change(
+        on_compile_change,
+        inputs=[compile_checkbox],
+        outputs=[use_custom_shapes_checkbox, custom_shapes_row],
+    )
+    cfg_preset_higher_speaker.click(
+        lambda: apply_cfg_preset("higher speaker"), outputs=[cfg_scale_text, cfg_scale_speaker, cfg_min_t, cfg_max_t, preset_dropdown]
+    )
+    cfg_preset_large_guidances.click(
+        lambda: apply_cfg_preset("large guidances"), outputs=[cfg_scale_text, cfg_scale_speaker, cfg_min_t, cfg_max_t, preset_dropdown]
+    )
+    spk_kv_preset_enable.click(lambda: apply_speaker_kv_preset("enable"), outputs=[force_speaker, speaker_kv_row, preset_dropdown])
+    spk_kv_preset_off.click(lambda: apply_speaker_kv_preset("off"), outputs=[force_speaker, speaker_kv_row, preset_dropdown])
+    trunc_preset_flat.click(lambda: apply_truncation_preset("flat"), outputs=[truncation_factor, rescale_k, rescale_sigma, preset_dropdown])
+    trunc_preset_sharp.click(lambda: apply_truncation_preset("sharp"), outputs=[truncation_factor, rescale_k, rescale_sigma, preset_dropdown])
+    trunc_preset_baseline.click(
+        lambda: apply_truncation_preset("baseline(sharp)"), outputs=[truncation_factor, rescale_k, rescale_sigma, preset_dropdown]
+    )
+    preset_dropdown.change(
+        apply_sampler_preset,
+        inputs=preset_dropdown,
+        outputs=[
+            num_steps,
+            cfg_scale_text,
+            cfg_scale_speaker,
+            cfg_min_t,
+            cfg_max_t,
+            truncation_factor,
+            rescale_k,
+            rescale_sigma,
+            force_speaker,
+            speaker_kv_row,
+            speaker_kv_scale,
+            speaker_kv_min_t,
+            speaker_kv_max_layers,
+        ],
+    )
+    generate_btn.click(
+        generate_audio,
+        inputs=[
+            text_prompt,
+            custom_audio_input,
+            num_steps,
+            rng_seed,
+            cfg_scale_text,
+            cfg_scale_speaker,
+            cfg_min_t,
+            cfg_max_t,
+            truncation_factor,
+            rescale_k,
+            rescale_sigma,
+            force_speaker,
+            speaker_kv_scale,
+            speaker_kv_min_t,
+            speaker_kv_max_layers,
+            reconstruct_first_30_seconds,
+            use_custom_shapes_checkbox,
+            max_text_byte_length,
+            max_speaker_latent_length,
+            sample_latent_length,
+            audio_format,
+            compile_checkbox,
+            show_original_audio,
+            session_id_state,
+        ],
+        outputs=[
+            generated_section,
+            generated_audio,
+            text_prompt_display,
+            original_audio,
+            generation_time_display,
+            reference_audio,
+            original_accordion,
+            reference_accordion,
+            reference_audio_header,
+        ],
+    )
+    demo.load(init_session, outputs=[session_id_state]).then(
+        lambda: apply_sampler_preset(list(load_sampler_presets().keys())[0]),
+        outputs=[
+            num_steps,
+            cfg_scale_text,
+            cfg_scale_speaker,
+            cfg_min_t,
+            cfg_max_t,
+            truncation_factor,
+            rescale_k,
+            rescale_sigma,
+            force_speaker,
+            speaker_kv_row,
+            speaker_kv_scale,
+            speaker_kv_min_t,
+            speaker_kv_max_layers,
+        ],
+    )
+if __name__ == "__main__":
+    demo.launch(
+        allowed_paths=[str(AUDIO_PROMPT_FOLDER)]
+    )

inference.py ADDED Viewed

	@@ -0,0 +1,462 @@

+from dataclasses import dataclass
+from typing import Callable, List, Tuple
+from huggingface_hub import hf_hub_download
+import safetensors.torch as st
+import torch
+import torchaudio
+from torchcodec.decoders import AudioDecoder
+from autoencoder import DAC, build_ae
+from model import EchoDiT
+def load_model_from_hf(repo_id: str = "jordand/echo-tts-base", device: str = "cuda", dtype: torch.dtype | None = torch.bfloat16, compile: bool = False, token: str | None = None, delete_blockwise_modules: bool = False) -> EchoDiT:
+    with torch.device("meta"):
+        model = EchoDiT(
+            latent_size=80, model_size=2048, num_layers=24, num_heads=16,
+            intermediate_size=5888, norm_eps=1e-5,
+            text_vocab_size=256, text_model_size=1280, text_num_layers=14,
+            text_num_heads=10, text_intermediate_size=3328,
+            speaker_patch_size=4, speaker_model_size=1280, speaker_num_layers=14,
+            speaker_num_heads=10, speaker_intermediate_size=3328,
+            timestep_embed_size=512, adaln_rank=256,
+        )
+    w_path = hf_hub_download(repo_id, "pytorch_model.safetensors", token=token)
+    state = st.load_file(w_path, device="cpu")
+    if delete_blockwise_modules:
+        state = {k: v for k, v in state.items() if not (
+            k.startswith("latent_encoder.") or
+            k.startswith("latent_norm") or
+            ".wk_latent" in k or
+            ".wv_latent" in k
+        )}
+    if dtype is not None:
+        state = {k: v.to(dtype=dtype) for k, v in state.items()}
+    state = {k: v.to(device=device) for k, v in state.items()}
+    model.load_state_dict(state, strict=False, assign=True)
+    model = model.eval()
+    if compile:
+        model = compile_model(model)
+    return model
+def compile_model(model: EchoDiT) -> EchoDiT:
+    model = torch.compile(model)
+    model.get_kv_cache_text = torch.compile(model.get_kv_cache_text)
+    model.get_kv_cache_speaker = torch.compile(model.get_kv_cache_speaker)
+    model.get_kv_cache_latent = torch.compile(model.get_kv_cache_latent)
+    return model
+def load_fish_ae_from_hf(repo_id: str = "jordand/fish-s1-dac-min", device: str = "cuda", dtype: torch.dtype | None = torch.float32, compile: bool = False, token: str | None = None) -> DAC:
+    with torch.device("meta"):
+        fish_ae = build_ae()
+    w_path = hf_hub_download(repo_id, "pytorch_model.safetensors", token=token)
+    if dtype is not None and dtype != torch.float32:
+        state = st.load_file(w_path, device="cpu")
+        state = {k: v.to(dtype=dtype) for k, v in state.items()}
+        state = {k: v.to(device=device) for k, v in state.items()}
+        fish_ae.load_state_dict(state, strict=False, assign=True)
+    else:
+        state = st.load_file(w_path, device=device)
+        fish_ae.load_state_dict(state, strict=False, assign=True)
+    fish_ae = fish_ae.eval().to(device)
+    if compile:
+        fish_ae = compile_fish_ae(fish_ae)
+    return fish_ae
+def compile_fish_ae(fish_ae: DAC) -> DAC:
+    fish_ae.quantizer.upsample = torch.compile(fish_ae.quantizer.upsample)
+    fish_ae.quantizer.downsample = torch.compile(fish_ae.quantizer.downsample)
+    fish_ae.quantizer.pre_module = torch.compile(fish_ae.quantizer.pre_module)
+    fish_ae.quantizer.post_module = torch.compile(fish_ae.quantizer.post_module)
+    return fish_ae
+@dataclass
+class PCAState:
+    pca_components: torch.Tensor
+    pca_mean: torch.Tensor
+    latent_scale: float
+def load_pca_state_from_hf(repo_id: str = "jordand/echo-tts-base", device: str = "cuda", filename: str = "pca_state.safetensors", token: str | None = None) -> PCAState:
+    p_path = hf_hub_download(repo_id, filename, token=token)
+    t = st.load_file(p_path, device=device)
+    return PCAState(
+        pca_components=t["pca_components"],
+        pca_mean=t["pca_mean"],
+        latent_scale=float(t["latent_scale"].item()),
+    )
+# ________
+def load_audio(path: str, max_duration: int = 300) -> torch.Tensor:
+    decoder = AudioDecoder(path)
+    sr = decoder.metadata.sample_rate
+    audio = decoder.get_samples_played_in_range(0, max_duration)
+    audio = audio.data.mean(dim=0).unsqueeze(0)
+    audio = torchaudio.functional.resample(audio, sr, 44_100)
+    audio = audio / torch.maximum(audio.abs().max(), torch.tensor(1.))
+    # is this better than clipping? should we target a specific energy level?
+    return audio
+def tokenizer_encode(text: str, append_bos: bool = True, normalize: bool = True, return_normalized_text: bool = False) -> torch.Tensor | Tuple[torch.Tensor, str]:
+    if normalize:
+        text = text.replace("…", "...")
+        text = text.replace('’', "'")
+        text = text.replace('”', '"')
+        text = text.replace('”', '"')
+        text = text.replace("\n", " ")
+        text = text.replace(":", ",")
+        text = text.replace(";", ",")
+        text = text.replace("—", ", ")
+        if not text.startswith("[") and not text.startswith("(") and 'S1' not in text and 'S2' not in text:
+            text = "[S1] " + text
+    b = list(text.encode("utf-8"))
+    if append_bos:
+        b.insert(0, 0)
+    if return_normalized_text:
+        return torch.tensor(b), text
+    return torch.tensor(b)
+def get_text_input_ids_and_mask(text_arr: List[str], max_length: int | None, device: str | None = None, normalize: bool = True, return_normalized_text: bool = False, pad_to_max: bool = True) -> Tuple[torch.Tensor, torch.Tensor] | Tuple[torch.Tensor, torch.Tensor, List[str]]:
+    encoded_texts = [tokenizer_encode(text, normalize=normalize, return_normalized_text=True) for text in text_arr]
+    if max_length is None:
+        max_length = max(len(enc) for enc, _ in encoded_texts)
+    tokens = torch.zeros((len(text_arr), max_length), dtype=torch.int32)
+    mask = torch.zeros((len(text_arr), max_length), dtype=torch.bool)
+    for i, (encoded, _) in enumerate(encoded_texts):
+        length = min(len(encoded), max_length)
+        tokens[i, :length] = encoded[:length]
+        mask[i, :length] = 1
+    if not pad_to_max and max_length is not None:
+        tokens, mask = tokens[:, :max_length], mask[:, :max_length]
+    if device is not None:
+        tokens, mask = tokens.to(device), mask.to(device)
+    if return_normalized_text:
+        return tokens, mask, [text for _, text in encoded_texts]
+    return tokens, mask
+# ________
+@torch.inference_mode()
+def ae_encode(fish_ae: DAC, pca_state: PCAState, audio: torch.Tensor) -> torch.Tensor:
+    assert audio.ndim == 3 and audio.shape[1] == 1 # (b, 1, length)
+    z_q = fish_ae.encode_zq(audio).float()
+    z_q = (z_q.transpose(1, 2) - pca_state.pca_mean) @ pca_state.pca_components.T
+    z_q = z_q * pca_state.latent_scale
+    return z_q
+@torch.inference_mode()
+def ae_decode(fish_ae: DAC, pca_state: PCAState, z_q: torch.Tensor) -> torch.Tensor:
+    z_q = (z_q / pca_state.latent_scale) @ pca_state.pca_components + pca_state.pca_mean
+    return fish_ae.decode_zq(z_q.transpose(1, 2).to(fish_ae.dtype)).float()
+@torch.inference_mode()
+def ae_reconstruct(fish_ae: DAC, pca_state: PCAState, audio: torch.Tensor) -> torch.Tensor:
+    assert audio.ndim == 3 and audio.shape[1] == 1 # (b, 1, length)
+    z_q = ae_encode(fish_ae, pca_state, audio.to(fish_ae.dtype))
+    return ae_decode(fish_ae, pca_state, z_q)
+# ________
+@torch.inference_mode()
+def get_speaker_latent_and_mask(
+    fish_ae: DAC,
+    pca_state: PCAState,
+    audio: torch.Tensor, # (1, length)
+    max_speaker_latent_length: int = 6400, # pretrained max length
+    audio_chunk_size: int = 640 * 2048, # (~30 seconds, 1/10 max speaker condition size; max chunk seen in training)
+    pad_to_max: bool = False,
+    divis_by_patch_size: int | None = 4,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    # gets speaker latent and mask from audio, computes in chunks and concatenates (similar to training setup)
+    AE_DOWNSAMPLE_FACTOR = 2048
+    max_audio_len_length = max_speaker_latent_length * AE_DOWNSAMPLE_FACTOR
+    assert audio.ndim == 2 and audio.shape[0] == 1  # (1, length)
+    audio = audio[:, :max_audio_len_length]
+    latent_arr = []
+    for i in range(0, audio.shape[1], audio_chunk_size):
+        audio_chunk = audio[:, i:i + audio_chunk_size]
+        if audio_chunk.shape[1] < audio_chunk_size:
+            audio_chunk = torch.nn.functional.pad(audio_chunk, (0, audio_chunk_size - audio_chunk.shape[1]))
+        latent_chunk = ae_encode(fish_ae, pca_state, audio_chunk.unsqueeze(0))
+        latent_arr.append(latent_chunk)
+    speaker_latent = torch.cat(latent_arr, dim=1)
+    actual_latent_length = audio.shape[1] // AE_DOWNSAMPLE_FACTOR
+    speaker_mask = (torch.arange(speaker_latent.shape[1], device=speaker_latent.device) < actual_latent_length).unsqueeze(0)
+    if pad_to_max and speaker_latent.shape[1] < max_speaker_latent_length:
+        speaker_latent = torch.nn.functional.pad(speaker_latent, (0, 0, 0, max_speaker_latent_length - speaker_latent.shape[1]))
+        speaker_mask = torch.nn.functional.pad(speaker_mask, (0, max_speaker_latent_length - speaker_mask.shape[1]))
+    elif not pad_to_max:
+        speaker_latent = speaker_latent[:, :actual_latent_length]
+        speaker_mask = speaker_mask[:, :actual_latent_length]
+    if divis_by_patch_size is not None:
+        speaker_latent = speaker_latent[:, :speaker_latent.shape[1] // divis_by_patch_size * divis_by_patch_size]
+        speaker_mask = speaker_mask[:, :speaker_mask.shape[1] // divis_by_patch_size * divis_by_patch_size]
+    return speaker_latent, speaker_mask
+# ________
+def find_flattening_point(data, target_value=0.0, window_size=20, std_threshold=0.05):
+    # simple heuristic to find end of latent generations; slow and can be improved
+    # (data is (length, 80))
+    padded_data = torch.cat([data, torch.zeros(window_size, *data.shape[1:], device=data.device, dtype=data.dtype)])
+    for i in range(len(padded_data) - window_size):
+        window = padded_data[i:i + window_size]
+        if window.std() < std_threshold and abs(window.mean() - target_value) < 0.1:
+            return i
+    return len(data)
+def crop_audio_to_flattening_point(audio: torch.Tensor, latent: torch.Tensor) -> torch.Tensor:
+    # (audio is (..., length), latent is (length, 80))
+    flattening_point = find_flattening_point(latent)
+    return audio[..., :flattening_point * 2048]
+SampleFn = Callable[
+    [EchoDiT, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, int],
+    torch.Tensor
+]
+@torch.inference_mode()
+def sample_pipeline(
+    model: EchoDiT,
+    fish_ae: DAC,
+    pca_state: PCAState,
+    sample_fn: SampleFn,
+    text_prompt: str,
+    speaker_audio: torch.Tensor | None,
+    rng_seed: int,
+    pad_to_max_speaker_latent_length: int | None = None,
+    pad_to_max_text_length: int | None = None,
+    normalize_text: bool = True,
+) -> Tuple[torch.Tensor, str]:
+    MAX_SPEAKER_LATENT_LENGTH = 6400 # max seen during training, though maybe can go higher?
+    MAX_TEXT_LENGTH = 768
+    device, dtype = model.device, model.dtype
+    text_input_ids, text_mask, normalized_text = get_text_input_ids_and_mask([text_prompt], max_length=min(pad_to_max_text_length or MAX_TEXT_LENGTH, MAX_TEXT_LENGTH), device=device, normalize=normalize_text, return_normalized_text=True, pad_to_max=(pad_to_max_text_length is not None))
+    if speaker_audio is None:
+        speaker_latent = torch.zeros((1, pad_to_max_speaker_latent_length or 4, 80), device=device, dtype=dtype)
+        speaker_mask = torch.zeros((1, pad_to_max_speaker_latent_length or 4), device=device, dtype=torch.bool)
+    else:
+        speaker_latent, speaker_mask = get_speaker_latent_and_mask(
+            fish_ae,
+            pca_state,
+            speaker_audio.to(fish_ae.dtype).to(device),
+            max_speaker_latent_length=pad_to_max_speaker_latent_length or MAX_SPEAKER_LATENT_LENGTH,
+            pad_to_max=(pad_to_max_speaker_latent_length is not None)
+        )
+    latent_out = sample_fn(model, speaker_latent, speaker_mask, text_input_ids, text_mask, rng_seed)
+    audio_out = ae_decode(fish_ae, pca_state, latent_out)
+    audio_out = crop_audio_to_flattening_point(audio_out, latent_out[0])
+    return audio_out, normalized_text[0]
+# ________
+KVCache = List[Tuple[torch.Tensor, torch.Tensor]]
+def _concat_kv_caches(*caches: KVCache) -> KVCache:
+    # helper that concatenates multiple KV caches along the batch dimension
+    num_layers = len(caches[0])
+    result = []
+    for i in range(num_layers):
+        k = torch.cat([c[i][0] for c in caches], dim=0)
+        v = torch.cat([c[i][1] for c in caches], dim=0)
+        result.append((k, v))
+    return result
+def _multiply_kv_cache(cache: KVCache, scale: float, max_layers: int | None = None) -> None:
+    # helper that multiplies KV cache values in-place, for kv speaker scaling
+    num_layers = len(cache) if max_layers is None else min(max_layers, len(cache))
+    for i in range(num_layers):
+        k, v = cache[i]
+        k.mul_(scale)
+        v.mul_(scale)
+def _temporal_score_rescale(
+    v_pred: torch.Tensor, x_t: torch.Tensor, t: float, rescale_k: float, rescale_sigma: float
+) -> torch.Tensor:
+    # for https://arxiv.org/pdf/2510.01184
+    if t < 1:
+        snr = (1 - t) ** 2 / (t ** 2)
+        ratio = (snr * rescale_sigma ** 2 + 1) / (snr * rescale_sigma ** 2 / rescale_k + 1)
+        return 1 / (1 - t) * (ratio * ((1 - t) * v_pred + x_t) - x_t)
+    return v_pred
+@torch.inference_mode()
+def sample_euler_cfg_independent_guidances(
+    model: EchoDiT,
+    speaker_latent: torch.Tensor,
+    speaker_mask: torch.Tensor,
+    text_input_ids: torch.Tensor,
+    text_mask: torch.Tensor,
+    rng_seed: int,
+    num_steps: int,
+    cfg_scale_text: float,
+    cfg_scale_speaker: float,
+    cfg_min_t: float,
+    cfg_max_t: float,
+    truncation_factor: float | None,
+    rescale_k: float | None,
+    rescale_sigma: float | None,
+    speaker_kv_scale: float | None,
+    speaker_kv_max_layers: int | None,
+    speaker_kv_min_t: float | None,
+    sequence_length: int | None = None,
+) -> torch.Tensor:
+    if sequence_length is None:
+        sequence_length = 640 # max sequence length during training
+    INIT_SCALE = 0.999 # so that we can apply rescale to first step
+    device, dtype = model.device, model.dtype
+    batch_size = text_input_ids.shape[0]
+    rng = torch.Generator(device=device).manual_seed(rng_seed)
+    t_schedule = torch.linspace(1., 0., num_steps + 1, device=device) * INIT_SCALE
+    text_mask_uncond = torch.zeros_like(text_mask)
+    speaker_mask_uncond = torch.zeros_like(speaker_mask)
+    kv_text_cond = model.get_kv_cache_text(text_input_ids, text_mask)
+    kv_speaker_cond = model.get_kv_cache_speaker(speaker_latent.to(dtype))
+    if speaker_kv_scale is not None:
+        _multiply_kv_cache(kv_speaker_cond, speaker_kv_scale, speaker_kv_max_layers)
+    # masks prevent decoder from attending to unconds:
+    kv_text_full = _concat_kv_caches(kv_text_cond, kv_text_cond, kv_text_cond)
+    kv_speaker_full = _concat_kv_caches(kv_speaker_cond, kv_speaker_cond, kv_speaker_cond)
+    full_text_mask = torch.cat([text_mask, text_mask_uncond, text_mask], dim=0)
+    full_speaker_mask = torch.cat([speaker_mask, speaker_mask, speaker_mask_uncond], dim=0)
+    x_t = torch.randn((batch_size, sequence_length, 80), device=device, dtype=torch.float32, generator=rng)
+    if truncation_factor is not None:
+        x_t = x_t * truncation_factor
+    for i in range(num_steps):
+        t, t_next = t_schedule[i], t_schedule[i + 1]
+        has_cfg = ((t >= cfg_min_t) * (t <= cfg_max_t)).item()
+        if has_cfg:
+            v_cond, v_uncond_text, v_uncond_speaker = model(
+                x=torch.cat([x_t, x_t, x_t], dim=0).to(dtype),
+                t=(torch.ones((batch_size * 3,), device=device) * t).to(dtype),
+                text_mask=full_text_mask,
+                speaker_mask=full_speaker_mask,
+                kv_cache_text=kv_text_full,
+                kv_cache_speaker=kv_speaker_full,
+            ).float().chunk(3, dim=0)
+            v_pred = v_cond + cfg_scale_text * (v_cond - v_uncond_text) + cfg_scale_speaker * (v_cond - v_uncond_speaker) # can also use a single, joint unconditional for fewer NFE
+        else:
+            v_pred = model(
+                x=x_t.to(dtype),
+                t=(torch.ones((batch_size,), device=device) * t).to(dtype),
+                text_mask=text_mask,
+                speaker_mask=speaker_mask,
+                kv_cache_text=kv_text_cond,
+                kv_cache_speaker=kv_speaker_cond,
+            ).float()
+        # optional temporal score rescaling: https://arxiv.org/pdf/2510.01184
+        if rescale_k is not None and rescale_sigma is not None:
+            v_pred = _temporal_score_rescale(v_pred, x_t, t, rescale_k, rescale_sigma)
+        # optional kv speaker scaling
+        if speaker_kv_scale is not None and t_next < speaker_kv_min_t and t >= speaker_kv_min_t:
+            _multiply_kv_cache(kv_speaker_cond, 1. / speaker_kv_scale, speaker_kv_max_layers)
+            kv_speaker_full = _concat_kv_caches(kv_speaker_cond, kv_speaker_cond, kv_speaker_cond)
+        x_t = x_t + v_pred * (t_next - t)
+    return x_t
+# ___________________________________________________________
+# simple example
+if __name__ == "__main__":
+    model = load_model_from_hf(delete_blockwise_modules=True)
+    fish_ae = load_fish_ae_from_hf()
+    pca_state = load_pca_state_from_hf()
+    speaker_audio_path = "/path/to/speaker/audio.wav"
+    speaker_audio = load_audio(speaker_audio_path).cuda()
+    speaker_latent, speaker_mask = get_speaker_latent_and_mask(fish_ae, pca_state, speaker_audio)
+    text = "[S1] Alright, I'm going to demo this new model called Echo TTS. Hopefully this works, I'm super excited to try this and see what it can do."
+    text_input_ids, text_mask = get_text_input_ids_and_mask([text], max_length=None, device="cuda")
+    latent_out = sample_euler_cfg_independent_guidances(
+        model=model,
+        speaker_latent=speaker_latent,
+        speaker_mask=speaker_mask,
+        text_input_ids=text_input_ids,
+        text_mask=text_mask,
+        rng_seed=0,
+        num_steps=40,
+        cfg_scale_text=3.0,
+        cfg_scale_speaker=8.0,
+        cfg_min_t=0.5,
+        cfg_max_t=1.0,
+        truncation_factor=0.8,
+        rescale_k=None,
+        rescale_sigma=None,
+        speaker_kv_scale=None,
+        speaker_kv_max_layers=None,
+        speaker_kv_min_t=None,
+        sequence_length=640, # (max 640. shorter lengths will generate prefixes, not necessarily full generations)
+    )
+    audio_out = ae_decode(fish_ae, pca_state, latent_out)
+    audio_out = crop_audio_to_flattening_point(audio_out, latent_out[0])
+    torchaudio.save("output.wav", audio_out[0].cpu(), 44100)

inference_blockwise.py ADDED Viewed

	@@ -0,0 +1,220 @@

+from typing import List
+import torch
+from inference import (
+    KVCache,
+    _concat_kv_caches,
+    _multiply_kv_cache,
+    _temporal_score_rescale,
+)
+from model import EchoDiT
+@torch.inference_mode()
+def sample_blockwise_euler_cfg_independent_guidances(
+    model: EchoDiT,
+    speaker_latent: torch.Tensor,
+    speaker_mask: torch.Tensor,
+    text_input_ids: torch.Tensor,
+    text_mask: torch.Tensor,
+    rng_seed: int,
+    block_sizes: List[int],
+    num_steps: int,
+    cfg_scale_text: float,
+    cfg_scale_speaker: float,
+    cfg_min_t: float,
+    cfg_max_t: float,
+    truncation_factor: float | None,
+    rescale_k: float | None,
+    rescale_sigma: float | None,
+    speaker_kv_scale: float | None,
+    speaker_kv_max_layers: int | None,
+    speaker_kv_min_t: float | None,
+    continuation_latent: torch.Tensor | None = None,
+) -> torch.Tensor:
+    INIT_SCALE = 0.999  # so that we can apply rescale to first step
+    device, dtype = model.device, model.dtype
+    batch_size = text_input_ids.shape[0]
+    rng = torch.Generator(device=device).manual_seed(rng_seed)
+    t_schedule = torch.linspace(1., 0., num_steps + 1, device=device) * INIT_SCALE
+    text_mask_uncond = torch.zeros_like(text_mask)
+    speaker_mask_uncond = torch.zeros_like(speaker_mask)
+    kv_text_cond = model.get_kv_cache_text(text_input_ids, text_mask)
+    kv_speaker_cond = model.get_kv_cache_speaker(speaker_latent.to(dtype))
+    # masks prevent decoder from attending to unconds:
+    kv_text_full = _concat_kv_caches(kv_text_cond, kv_text_cond, kv_text_cond)
+    kv_speaker_full = _concat_kv_caches(kv_speaker_cond, kv_speaker_cond, kv_speaker_cond)
+    full_text_mask = torch.cat([text_mask, text_mask_uncond, text_mask], dim=0)
+    full_speaker_mask = torch.cat([speaker_mask, speaker_mask, speaker_mask_uncond], dim=0)
+    prefix_latent = torch.zeros((batch_size, sum(block_sizes) , 80), device=device, dtype=torch.float32)
+    start_pos = 0
+    if continuation_latent is not None:
+        continuation_len = continuation_latent.shape[1]
+        prefix_latent = torch.cat([continuation_latent, prefix_latent], dim=1)
+        start_pos = continuation_len
+    for block_size in block_sizes:
+        if speaker_kv_scale is not None:
+            _multiply_kv_cache(kv_speaker_cond, speaker_kv_scale, speaker_kv_max_layers)
+            kv_speaker_full = _concat_kv_caches(kv_speaker_cond, kv_speaker_cond, kv_speaker_cond)
+        full_prefix_latent = torch.cat([prefix_latent, prefix_latent, prefix_latent], dim=0)
+        kv_latent_full = model.get_kv_cache_latent(full_prefix_latent.to(dtype))
+        kv_latent_cond = [(k[:batch_size], v[:batch_size]) for k, v in kv_latent_full]
+        x_t = torch.randn((batch_size, block_size, 80), device=device, dtype=torch.float32, generator=rng)
+        if truncation_factor is not None:
+            x_t = x_t * truncation_factor
+        for i in range(num_steps):
+            t, t_next = t_schedule[i], t_schedule[i + 1]
+            has_cfg = ((t >= cfg_min_t) * (t <= cfg_max_t)).item()
+            if has_cfg:
+                v_cond, v_uncond_text, v_uncond_speaker = model(
+                    x=torch.cat([x_t, x_t, x_t], dim=0).to(dtype),
+                    t=(torch.ones((batch_size * 3,), device=device) * t).to(dtype),
+                    text_mask=full_text_mask,
+                    speaker_mask=full_speaker_mask,
+                    start_pos=start_pos,
+                    kv_cache_text=kv_text_full,
+                    kv_cache_speaker=kv_speaker_full,
+                    kv_cache_latent=kv_latent_full,
+                ).float().chunk(3, dim=0)
+                v_pred = v_cond + cfg_scale_text * (v_cond - v_uncond_text) + cfg_scale_speaker * (v_cond - v_uncond_speaker)
+            else:
+                v_pred = model(
+                    x=x_t.to(dtype),
+                    t=(torch.ones((batch_size,), device=device) * t).to(dtype),
+                    text_mask=text_mask,
+                    speaker_mask=speaker_mask,
+                    start_pos=start_pos,
+                    kv_cache_text=kv_text_cond,
+                    kv_cache_speaker=kv_speaker_cond,
+                    kv_cache_latent=kv_latent_cond,
+                ).float()
+            # optional temporal score rescaling: https://arxiv.org/pdf/2510.01184
+            if rescale_k is not None and rescale_sigma is not None:
+                v_pred = _temporal_score_rescale(v_pred, x_t, t, rescale_k, rescale_sigma)
+            # optional kv speaker scaling
+            if speaker_kv_scale is not None and t_next < speaker_kv_min_t and t >= speaker_kv_min_t:
+                _multiply_kv_cache(kv_speaker_cond, 1. / speaker_kv_scale, speaker_kv_max_layers)
+                kv_speaker_full = _concat_kv_caches(kv_speaker_cond, kv_speaker_cond, kv_speaker_cond)
+            x_t = x_t + v_pred * (t_next - t)
+        prefix_latent[:, start_pos:start_pos + block_size] = x_t
+        start_pos += block_size
+    return prefix_latent
+if __name__ == "__main__":
+    import torchaudio
+    from inference import (
+        load_model_from_hf,
+        load_fish_ae_from_hf,
+        load_pca_state_from_hf,
+        load_audio,
+        get_text_input_ids_and_mask,
+        get_speaker_latent_and_mask,
+        ae_encode,
+        ae_decode,
+        crop_audio_to_flattening_point,
+    )
+    model = load_model_from_hf()
+    fish_ae = load_fish_ae_from_hf()
+    pca_state = load_pca_state_from_hf()
+    # example 1, generate 320 in three blocks
+    speaker_audio_path = "/path/to/speaker/audio.wav"
+    speaker_audio = load_audio(speaker_audio_path).cuda()
+    speaker_latent, speaker_mask = get_speaker_latent_and_mask(fish_ae, pca_state, speaker_audio)
+    text = "[S1] Alright, I'm going to demo this new model called Echo TTS."
+    text_input_ids, text_mask = get_text_input_ids_and_mask([text], max_length=None, device="cuda")
+    latent_out = sample_blockwise_euler_cfg_independent_guidances(
+        model=model,
+        speaker_latent=speaker_latent,
+        speaker_mask=speaker_mask,
+        text_input_ids=text_input_ids,
+        text_mask=text_mask,
+        rng_seed=0,
+        block_sizes=[128, 128, 64], # (sums to 320, so will be ~15 seconds; supports up to 640)
+        num_steps=40,
+        cfg_scale_text=3.0,
+        cfg_scale_speaker=5.0,
+        cfg_min_t=0.5,
+        cfg_max_t=1.0,
+        truncation_factor=0.8,
+        rescale_k=None,
+        rescale_sigma=None,
+        speaker_kv_scale=None,
+        speaker_kv_max_layers=None,
+        speaker_kv_min_t=None,
+    )
+    audio_out = ae_decode(fish_ae, pca_state, latent_out)
+    audio_out = crop_audio_to_flattening_point(audio_out, latent_out[0])
+    torchaudio.save("output_blockwise.wav", audio_out[0].cpu(), 44100)
+    # ___________________________________________________________
+    # example 2: with continuation latent (use same speaker audio as first example, generate from partial output of first example)
+    continuation_audio_path = "output_blockwise.wav" # can be any path
+    continuation_audio = load_audio(continuation_audio_path).cuda()
+    continuation_latent, continuation_mask = get_speaker_latent_and_mask(fish_ae, pca_state, continuation_audio)
+    continuation_latent = continuation_latent[:, :continuation_mask.sum()]
+    text = "[S1] Alright, I'm going to demo this new model called Echo TTS, and now, we're going to continue from the audio we already generated and add some more text."
+    # NOTE this MUST include the text from the continuation prefix. can use https://huggingface.co/jordand/whisper-d-v1a to get in-distribution transcription automatically.
+    text_input_ids, text_mask = get_text_input_ids_and_mask([text], max_length=None, device="cuda")
+    continuation_block_sizes = [256] # (generate up to 12 more seconds)
+    # NOTE: these do not include the continuation latent length, so sum(block_sizes) + continuation_latent.shape[1] should be < 640 (to be in-distribution with training data)
+    latent_out_continued = sample_blockwise_euler_cfg_independent_guidances(
+        model=model,
+        speaker_latent=speaker_latent,
+        speaker_mask=speaker_mask,
+        text_input_ids=text_input_ids,
+        text_mask=text_mask,
+        rng_seed=0,
+        block_sizes=continuation_block_sizes,
+        num_steps=40,
+        cfg_scale_text=3.0,
+        cfg_scale_speaker=3.0,
+        cfg_min_t=0.5,
+        cfg_max_t=1.0,
+        truncation_factor=0.8,
+        rescale_k=None,
+        rescale_sigma=None,
+        speaker_kv_scale=None,
+        speaker_kv_max_layers=None,
+        speaker_kv_min_t=None,
+        continuation_latent=continuation_latent,
+    )
+    audio_out_continued = ae_decode(fish_ae, pca_state, latent_out_continued)
+    audio_out_continued = crop_audio_to_flattening_point(audio_out_continued, latent_out_continued[0])
+    torchaudio.save("output_blockwise_continued.wav", audio_out_continued[0].cpu(), 44100)

model.py ADDED Viewed

	@@ -0,0 +1,642 @@

+from typing import Tuple, List
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import torch.nn.functional as F
+def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0) -> torch.Tensor:
+    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)] / dim))
+    t = torch.arange(end)
+    freqs = torch.outer(t, freqs)
+    freqs_cis = torch.complex(torch.cos(freqs), torch.sin(freqs))
+    return freqs_cis
+def apply_rotary_emb(
+    x: torch.Tensor,
+    freqs_cis: torch.Tensor,
+) -> torch.Tensor:
+    x_ = torch.view_as_complex(x.float().reshape(*x.shape[:3], -1, 2))
+    x_ = x_ * freqs_cis[..., None, :]
+    x_ = torch.view_as_real(x_).reshape(x.shape)
+    return x_.type_as(x)
+def get_timestep_embedding(
+    timestep: torch.Tensor,
+    embed_size: int,
+) -> torch.Tensor:
+    assert embed_size % 2 == 0
+    half = embed_size // 2
+    freqs = 1000 * torch.exp(
+        -torch.log(torch.tensor(10000.0)) *
+        torch.arange(start=0, end=half, dtype=torch.float32) / half
+    ).to(timestep.device)
+    args = timestep[..., None] * freqs[None]
+    embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
+    return embedding.to(timestep.dtype)
+class LowRankAdaLN(nn.Module):
+    def __init__(
+        self,
+        model_size: int,
+        rank: int,
+        eps: float
+    ):
+        super().__init__()
+        self.eps = eps
+        self.shift_down = nn.Linear(model_size, rank, bias=False)
+        self.scale_down = nn.Linear(model_size, rank, bias=False)
+        self.gate_down = nn.Linear(model_size, rank, bias=False)
+        self.shift_up = nn.Linear(rank, model_size, bias=True)
+        self.scale_up = nn.Linear(rank, model_size, bias=True)
+        self.gate_up = nn.Linear(rank, model_size, bias=True)
+    def forward(
+        self,
+        x: torch.Tensor,
+        cond_embed: torch.Tensor,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        shift, scale, gate = cond_embed.chunk(3, dim=-1)
+        shift = self.shift_up(self.shift_down(F.silu(shift))) + shift
+        scale = self.scale_up(self.scale_down(F.silu(scale))) + scale
+        gate = self.gate_up(self.gate_down(F.silu(gate))) + gate
+        x_dtype = x.dtype
+        x = x.float()
+        x = x * torch.rsqrt(torch.pow(x.float(), 2).mean(dim=-1, keepdim=True) + self.eps)
+        x = x * (scale + 1) + shift
+        gate = torch.tanh(gate)
+        return x.to(x_dtype), gate
+class RMSNorm(nn.Module): # could also just use torch rmsnorm
+    def __init__(
+        self,
+        model_size: int | Tuple[int, int],
+        eps: float
+    ):
+        super().__init__()
+        self.eps = eps
+        if isinstance(model_size, int):
+            model_size = (model_size, )
+        self.weight = nn.Parameter(torch.ones(model_size))
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x_dtype = x.dtype
+        x = x.float()
+        x = x * torch.rsqrt(torch.pow(x.float(), 2).mean(dim=-1, keepdim=True) + self.eps)
+        x = x * self.weight
+        return x.to(x_dtype)
+class SelfAttention(nn.Module):
+    def __init__(
+        self,
+        model_size: int,
+        num_heads: int,
+        is_causal: bool,
+        norm_eps: float
+    ):
+        super().__init__()
+        self.num_heads = num_heads
+        self.is_causal = is_causal
+        self.wq = nn.Linear(model_size, model_size, bias=False)
+        self.wk = nn.Linear(model_size, model_size, bias=False)
+        self.wv = nn.Linear(model_size, model_size, bias=False)
+        self.wo = nn.Linear(model_size, model_size, bias=False)
+        self.gate = nn.Linear(model_size, model_size, bias=False)
+        assert model_size % num_heads == 0
+        self.q_norm = RMSNorm((num_heads, model_size // num_heads), eps=norm_eps)
+        self.k_norm = RMSNorm((num_heads, model_size // num_heads), eps=norm_eps)
+    def forward(self, x: torch.Tensor, mask: torch.Tensor | None, freqs_cis: torch.Tensor) -> torch.Tensor:
+        batch_size, seq_len = x.shape[:2]
+        xq = self.wq(x).reshape(batch_size, seq_len, self.num_heads, -1)
+        xk = self.wk(x).reshape(batch_size, seq_len, self.num_heads, -1)
+        xv = self.wv(x).reshape(batch_size, seq_len, self.num_heads, -1)
+        gate = self.gate(x)
+        xq = self.q_norm(xq)
+        xk = self.k_norm(xk)
+        xq = apply_rotary_emb(xq, freqs_cis[:seq_len])
+        xk = apply_rotary_emb(xk, freqs_cis[:seq_len])
+        if mask is not None:
+            assert mask.ndim == 2 # (b, s)
+            mask = mask[:, None, None]
+        output = F.scaled_dot_product_attention(
+            query=xq.transpose(1, 2),
+            key=xk.transpose(1, 2),
+            value=xv.transpose(1, 2),
+            attn_mask=mask,
+            is_causal=self.is_causal
+        ).transpose(1, 2)
+        output = output.reshape(batch_size, seq_len, -1)
+        output = output * torch.sigmoid(gate)
+        output = self.wo(output)
+        return output
+class JointAttention(nn.Module):
+    def __init__(
+        self,
+        model_size: int,
+        num_heads: int,
+        text_model_size: int,
+        speaker_model_size: int,
+        speaker_patch_size: int,
+        norm_eps: float
+    ):
+        super().__init__()
+        self.speaker_patch_size = speaker_patch_size
+        self.num_heads = num_heads
+        self.wq = nn.Linear(model_size, model_size, bias=False)
+        self.wk = nn.Linear(model_size, model_size, bias=False)
+        self.wv = nn.Linear(model_size, model_size, bias=False)
+        self.wk_text = nn.Linear(text_model_size, model_size, bias=False)
+        self.wv_text = nn.Linear(text_model_size, model_size, bias=False)
+        self.wk_speaker = nn.Linear(speaker_model_size, model_size, bias=False)
+        self.wv_speaker = nn.Linear(speaker_model_size, model_size, bias=False)
+        self.wk_latent = nn.Linear(speaker_model_size, model_size, bias=False)
+        self.wv_latent = nn.Linear(speaker_model_size, model_size, bias=False)
+        assert model_size % num_heads == 0
+        self.head_dim = model_size // num_heads
+        self.q_norm = RMSNorm((num_heads, self.head_dim), eps=norm_eps)
+        self.k_norm = RMSNorm((num_heads, self.head_dim), eps=norm_eps)
+        self.gate = nn.Linear(model_size, model_size, bias=False)
+        self.wo = nn.Linear(model_size, model_size, bias=False)
+    def _apply_rotary_half(self, y: torch.Tensor, fc: torch.Tensor) -> torch.Tensor:
+        y1, y2 = y.chunk(2, dim=-2)
+        y1 = apply_rotary_emb(y1, fc)
+        return torch.cat([y1, y2], dim=-2)
+    def forward(
+        self,
+        x: torch.Tensor,
+        text_mask: torch.Tensor,
+        speaker_mask: torch.Tensor,
+        freqs_cis: torch.Tensor,
+        kv_cache_text: Tuple[torch.Tensor, torch.Tensor],
+        kv_cache_speaker: Tuple[torch.Tensor, torch.Tensor],
+        start_pos: int | None,
+        kv_cache_latent: Tuple[torch.Tensor, torch.Tensor] | None
+    ) -> torch.Tensor:
+        batch_size, seq_len = x.shape[:2]
+        xq = self.wq(x).reshape(batch_size, seq_len, self.num_heads, -1)
+        xk_self = self.wk(x).reshape(batch_size, seq_len, self.num_heads, -1)
+        xv_self = self.wv(x).reshape(batch_size, seq_len, self.num_heads, -1)
+        xq = self.q_norm(xq)
+        xk_self = self.k_norm(xk_self)
+        gate = self.gate(x)
+        if start_pos is None:
+            start_pos = 0
+        freqs_q = freqs_cis[start_pos : start_pos + seq_len]
+        xq = self._apply_rotary_half(xq, freqs_q)
+        xk_self = self._apply_rotary_half(xk_self, freqs_q)
+        xk_text, xv_text = kv_cache_text
+        xk_speaker, xv_speaker = kv_cache_speaker
+        if kv_cache_latent is None or kv_cache_latent[0].shape [1] == 0:
+            xk_latent = torch.zeros((batch_size, 0, self.num_heads, xq.shape[-1]), device=x.device, dtype=x.dtype)
+            xv_latent = torch.zeros((batch_size, 0, self.num_heads, xq.shape[-1]), device=x.device, dtype=x.dtype)
+            latent_mask = torch.zeros((batch_size, 0), dtype=torch.bool, device=x.device)
+        else:
+            xk_latent, xv_latent = kv_cache_latent
+            latent_positions = torch.arange(xk_latent.shape[1], device=x.device, dtype=torch.long) * self.speaker_patch_size
+            latent_mask = (latent_positions[None, :] < start_pos).expand(batch_size, xk_latent.shape[1])
+        xk = torch.cat([xk_self, xk_latent, xk_text, xk_speaker], dim=1)
+        xv = torch.cat([xv_self, xv_latent, xv_text, xv_speaker], dim=1)
+        self_mask = torch.ones((batch_size, seq_len), dtype=torch.bool, device=x.device)
+        mask = torch.cat([self_mask, latent_mask, text_mask, speaker_mask], dim=1)
+        mask = mask[:, None, None]
+        output = F.scaled_dot_product_attention(
+            query=xq.transpose(1, 2),
+            key=xk.transpose(1, 2),
+            value=xv.transpose(1, 2),
+            attn_mask=mask,
+            is_causal=False
+        ).transpose(1, 2)
+        output = output.reshape(batch_size, seq_len, -1)
+        output = output * torch.sigmoid(gate)
+        output = self.wo(output)
+        return output
+    def get_kv_cache_text(self, text_state: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        batch_size = text_state.shape[0]
+        xk = self.wk_text(text_state).reshape(batch_size, text_state.shape[1], self.num_heads, -1)
+        xv = self.wv_text(text_state).reshape(batch_size, text_state.shape[1], self.num_heads, -1)
+        xk = self.k_norm(xk)
+        return xk, xv
+    def get_kv_cache_speaker(self, speaker_state: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        batch_size = speaker_state.shape[0]
+        xk = self.wk_speaker(speaker_state).reshape(batch_size, speaker_state.shape[1], self.num_heads, -1)
+        xv = self.wv_speaker(speaker_state).reshape(batch_size, speaker_state.shape[1], self.num_heads, -1)
+        xk = self.k_norm(xk)
+        return xk, xv
+    def get_kv_cache_latent(self, latent_state: torch.Tensor, freqs_cis: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        batch_size = latent_state.shape[0]
+        seq_len = latent_state.shape[1]
+        xk = self.wk_latent(latent_state).reshape(batch_size, seq_len, self.num_heads, -1)
+        xv = self.wv_latent(latent_state).reshape(batch_size, seq_len, self.num_heads, -1)
+        xk = self.k_norm(xk)
+        xk = self._apply_rotary_half(xk, freqs_cis)
+        return xk, xv
+class MLP(nn.Module):
+    def __init__(
+        self,
+        model_size: int,
+        intermediate_size: int
+    ):
+        super().__init__()
+        self.w1 = nn.Linear(model_size, intermediate_size, bias=False)
+        self.w3 = nn.Linear(model_size, intermediate_size, bias=False)
+        self.w2 = nn.Linear(intermediate_size, model_size, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.w2(F.silu(self.w1(x)) * self.w3(x))
+class EncoderTransformerBlock(nn.Module):
+    def __init__(
+        self,
+        model_size: int,
+        num_heads: int,
+        intermediate_size: int,
+        is_causal: bool,
+        norm_eps: float
+    ):
+        super().__init__()
+        self.attention = SelfAttention(
+            model_size=model_size,
+            num_heads=num_heads,
+            is_causal=is_causal,
+            norm_eps=norm_eps
+        )
+        self.mlp = MLP(
+            model_size=model_size,
+            intermediate_size=intermediate_size
+        )
+        self.attention_norm = RMSNorm(model_size, norm_eps)
+        self.mlp_norm = RMSNorm(model_size, norm_eps)
+    def forward(self, x: torch.Tensor, mask: torch.Tensor | None, freqs_cis: torch.Tensor) -> torch.Tensor:
+        x = x + self.attention(self.attention_norm(x), mask, freqs_cis)
+        x = x + self.mlp(self.mlp_norm(x))
+        return x
+class TransformerBlock(nn.Module):
+    def __init__(
+        self,
+        model_size: int,
+        num_heads: int,
+        intermediate_size: int,
+        norm_eps: float,
+        text_model_size: int,
+        speaker_model_size: int,
+        speaker_patch_size: int,
+        adaln_rank: int,
+    ):
+        super().__init__()
+        self.attention = JointAttention(
+            model_size=model_size,
+            num_heads=num_heads,
+            text_model_size=text_model_size,
+            speaker_model_size=speaker_model_size,
+            speaker_patch_size=speaker_patch_size,
+            norm_eps=norm_eps
+        )
+        self.mlp = MLP(
+            model_size=model_size,
+            intermediate_size=intermediate_size
+        )
+        self.attention_adaln = LowRankAdaLN(model_size=model_size, rank=adaln_rank, eps=norm_eps)
+        self.mlp_adaln = LowRankAdaLN(model_size=model_size, rank=adaln_rank, eps=norm_eps)
+    def forward(
+        self,
+        x: torch.Tensor,
+        cond_embed: torch.Tensor,
+        text_mask: torch.Tensor,
+        speaker_mask: torch.Tensor,
+        freqs_cis: torch.Tensor,
+        kv_cache_text: Tuple[torch.Tensor, torch.Tensor],
+        kv_cache_speaker: Tuple[torch.Tensor, torch.Tensor],
+        start_pos: int | None,
+        kv_cache_latent: Tuple[torch.Tensor, torch.Tensor] | None,
+    ) -> torch.Tensor:
+        x_norm, attention_gate = self.attention_adaln(x, cond_embed)
+        x = x + attention_gate * self.attention(x_norm, text_mask, speaker_mask, freqs_cis, kv_cache_text, kv_cache_speaker, start_pos, kv_cache_latent)
+        x_norm, mlp_gate = self.mlp_adaln(x, cond_embed)
+        x = x + mlp_gate * self.mlp(x_norm)
+        return x
+class TextEncoder(nn.Module):
+    def __init__(
+        self,
+        vocab_size: int,
+        model_size: int,
+        num_layers: int,
+        num_heads: int,
+        intermediate_size: int,
+        norm_eps: float,
+    ):
+        super().__init__()
+        self.text_embedding = nn.Embedding(vocab_size, model_size)
+        self.blocks = nn.ModuleList()
+        for i in range(num_layers):
+            block = EncoderTransformerBlock(
+                model_size=model_size,
+                num_heads=num_heads,
+                intermediate_size=intermediate_size,
+                is_causal=False,
+                norm_eps=norm_eps
+            )
+            self.blocks.append(block)
+        self.head_dim = model_size // num_heads
+    def forward(self, input_ids: torch.Tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
+        x = self.text_embedding(input_ids)
+        freqs_cis = precompute_freqs_cis(self.head_dim, input_ids.shape[1]).to(x.device) # could cache
+        for block in self.blocks:
+            x = block(x, mask, freqs_cis)
+        return x
+class SpeakerEncoder(nn.Module):
+    def __init__(
+        self,
+        latent_size: int,
+        patch_size: int,
+        model_size: int,
+        num_layers: int,
+        num_heads: int,
+        intermediate_size: int,
+        norm_eps: float,
+    ):
+        super().__init__()
+        self.patch_size = patch_size
+        self.in_proj = nn.Linear(latent_size * patch_size, model_size, bias=True)
+        self.blocks = nn.ModuleList()
+        for i in range(num_layers):
+            block = EncoderTransformerBlock(
+                model_size=model_size,
+                num_heads=num_heads,
+                intermediate_size=intermediate_size,
+                is_causal=True,
+                norm_eps=norm_eps
+            )
+            self.blocks.append(block)
+        self.head_dim = model_size // num_heads
+    def forward(self, latent: torch.Tensor) -> torch.Tensor:
+        x = latent.reshape(*latent.shape[:-2], latent.shape[-2] // self.patch_size, latent.shape[-1] * self.patch_size)
+        x = self.in_proj(x)
+        x = x / 6. # this helped with initial activation dynamics in early ablations, could also bake into in_proj
+        freqs_cis = precompute_freqs_cis(self.head_dim, x.shape[1]).to(x.device) # could cache
+        for block in self.blocks:
+            x = block(x, None, freqs_cis)
+        return x
+class EchoDiT(nn.Module):
+    def __init__(
+        self,
+        latent_size: int,
+        #
+        model_size: int,
+        num_layers: int,
+        num_heads: int,
+        intermediate_size: int,
+        norm_eps: float,
+        #
+        text_vocab_size: int,
+        text_model_size: int,
+        text_num_layers: int,
+        text_num_heads: int,
+        text_intermediate_size: int,
+        #
+        speaker_patch_size: int,
+        speaker_model_size: int,
+        speaker_num_layers: int,
+        speaker_num_heads: int,
+        speaker_intermediate_size: int,
+        #
+        timestep_embed_size: int,
+        adaln_rank: int,
+    ):
+        super().__init__()
+        self.speaker_patch_size = speaker_patch_size
+        self.timestep_embed_size = timestep_embed_size
+        self.text_encoder = TextEncoder(
+            vocab_size=text_vocab_size,
+            model_size=text_model_size,
+            num_layers=text_num_layers,
+            num_heads=text_num_heads,
+            intermediate_size=text_intermediate_size,
+            norm_eps=norm_eps,
+        )
+        self.speaker_encoder = SpeakerEncoder(
+            latent_size=latent_size,
+            patch_size=speaker_patch_size,
+            model_size=speaker_model_size,
+            num_layers=speaker_num_layers,
+            num_heads=speaker_num_heads,
+            intermediate_size=speaker_intermediate_size,
+            norm_eps=norm_eps,
+        )
+        self.latent_encoder = SpeakerEncoder(
+            latent_size=latent_size,
+            patch_size=speaker_patch_size,
+            model_size=speaker_model_size,
+            num_layers=speaker_num_layers,
+            num_heads=speaker_num_heads,
+            intermediate_size=speaker_intermediate_size,
+            norm_eps=norm_eps,
+        )
+        self.text_norm = RMSNorm(text_model_size, norm_eps)
+        self.speaker_norm = RMSNorm(speaker_model_size, norm_eps)
+        self.latent_norm = RMSNorm(speaker_model_size, norm_eps)
+        self.cond_module = nn.Sequential(
+            nn.Linear(timestep_embed_size, model_size, bias=False),
+            nn.SiLU(),
+            nn.Linear(model_size, model_size, bias=False),
+            nn.SiLU(),
+            nn.Linear(model_size, model_size * 3, bias=False),
+        )
+        self.in_proj = nn.Linear(latent_size, model_size, bias=True)
+        self.blocks = nn.ModuleList()
+        for i in range(num_layers):
+            block = TransformerBlock(
+                model_size=model_size,
+                num_heads=num_heads,
+                intermediate_size=intermediate_size,
+                norm_eps=norm_eps,
+                text_model_size=text_model_size,
+                speaker_model_size=speaker_model_size,
+                speaker_patch_size=speaker_patch_size,
+                adaln_rank=adaln_rank,
+            )
+            self.blocks.append(block)
+        self.out_norm = RMSNorm(model_size, norm_eps)
+        self.out_proj = nn.Linear(model_size, latent_size, bias=True)
+        self.head_dim = model_size // num_heads
+    def forward(
+        self,
+        x: torch.Tensor,
+        t: torch.Tensor,
+        text_mask: torch.Tensor,
+        speaker_mask: torch.Tensor,
+        kv_cache_text: List[Tuple[torch.Tensor, torch.Tensor]],
+        kv_cache_speaker: List[Tuple[torch.Tensor, torch.Tensor]],
+        start_pos: int | None = None,
+        kv_cache_latent: List[Tuple[torch.Tensor, torch.Tensor]] | None = None,
+    ) -> torch.Tensor:
+        if start_pos is None:
+            start_pos = 0
+        max_pos = start_pos + x.shape[1]
+        freqs_cis = precompute_freqs_cis(self.head_dim, max_pos).to(x.device) # could cache
+        speaker_mask = speaker_mask[..., ::self.speaker_patch_size]
+        cond_embed = self.cond_module(get_timestep_embedding(t, self.timestep_embed_size))
+        cond_embed = cond_embed[:, None]
+        x = self.in_proj(x)
+        for i, block in enumerate(self.blocks):
+            x = block(
+                x=x,
+                cond_embed=cond_embed,
+                text_mask=text_mask,
+                speaker_mask=speaker_mask,
+                freqs_cis=freqs_cis,
+                kv_cache_text=kv_cache_text[i],
+                kv_cache_speaker=kv_cache_speaker[i],
+                start_pos=start_pos,
+                kv_cache_latent=kv_cache_latent[i] if kv_cache_latent is not None else None,
+            )
+        x = self.out_norm(x)
+        x = self.out_proj(x)
+        return x.float()
+    def get_kv_cache_text(
+        self,
+        text_input_ids: torch.Tensor,
+        text_mask: torch.Tensor | None,
+    ) -> List[Tuple[torch.Tensor, torch.Tensor]]:
+        text_state = self.text_encoder(text_input_ids, text_mask)
+        text_state = self.text_norm(text_state)
+        return [block.attention.get_kv_cache_text(text_state) for block in self.blocks]
+    def get_kv_cache_speaker(
+        self,
+        speaker_latent: torch.Tensor,
+    ) -> List[Tuple[torch.Tensor, torch.Tensor]]:
+        speaker_state = self.speaker_encoder(speaker_latent)
+        speaker_state = self.speaker_norm(speaker_state)
+        return [block.attention.get_kv_cache_speaker(speaker_state) for block in self.blocks]
+    def get_kv_cache_latent(
+        self,
+        prefix_latent: torch.Tensor,
+    ) -> List[Tuple[torch.Tensor, torch.Tensor]]:
+        latent_state = self.latent_encoder(prefix_latent)
+        latent_state = self.latent_norm(latent_state)
+        seq_len = latent_state.shape[1]
+        max_pos = seq_len * self.speaker_patch_size
+        freqs_cis = precompute_freqs_cis(self.head_dim, max_pos).to(latent_state.device) # could cache
+        positions = torch.arange(seq_len, device=latent_state.device) * self.speaker_patch_size
+        freqs_latent = freqs_cis[positions]
+        return [block.attention.get_kv_cache_latent(latent_state, freqs_latent) for block in self.blocks]
+    @property
+    def device(self) -> torch.device: return next(self.parameters()).device
+    @property
+    def dtype(self) -> torch.dtype: return next(self.parameters()).dtype

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+torch>=2.9.1
+torchaudio>=2.9.1
+torchcodec>=0.8.1
+huggingface-hub
+numpy
+safetensors
+einops
+gradio==5.49.1

sampler_presets.json ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+  "Independent-High-Speaker-CFG": {
+    "num_steps": "40",
+    "cfg_scale_text": "3.0",
+    "cfg_scale_speaker": "8.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "1.",
+    "rescale_k": "1.",
+    "rescale_sigma": "3.0"
+  },
+  "Independent-High-Speaker-CFG-Flat": {
+    "num_steps": "40",
+    "cfg_scale_text": "3.0",
+    "cfg_scale_speaker": "8.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "0.8",
+    "rescale_k": "1.2",
+    "rescale_sigma": "3.0"
+  },
+  "Independent-High-CFG": {
+    "num_steps": "40",
+    "cfg_scale_text": "8.0",
+    "cfg_scale_speaker": "8.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "1.",
+    "rescale_k": "1.",
+    "rescale_sigma": "3.0"
+  },
+  "Independent-High-CFG-Flat": {
+    "num_steps": "40",
+    "cfg_scale_text": "8.0",
+    "cfg_scale_speaker": "8.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "0.8",
+    "rescale_k": "1.2",
+    "rescale_sigma": "3.0"
+  },
+  "Independent-Low-CFG": {
+    "num_steps": "40",
+    "cfg_scale_text": "3.0",
+    "cfg_scale_speaker": "3.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "1.",
+    "rescale_k": "1.",
+    "rescale_sigma": "3.0"
+  },
+  "Independent-Low-CFG-Flat": {
+    "num_steps": "40",
+    "cfg_scale_text": "3.0",
+    "cfg_scale_speaker": "3.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "0.8",
+    "rescale_k": "1.2",
+    "rescale_sigma": "3.0"
+  }
+}

text_presets.txt ADDED Viewed

	@@ -0,0 +1,40 @@

+Reading | [S1] The old lighthouse keeper had seen many storms in his thirty years on the rock, but nothing like this. The fog rolled in thick as wool, swallowing the beam of light before it could reach the churning waves below. Then he heard it, three short bells from the channel, where no ship should be at this hour. He grabbed his lantern and peered into the mist, his heart pounding. Something was out there, something that shouldn't exist.
+Reading | [S1] Deep beneath the ocean's surface, where sunlight fades to perpetual twilight, extraordinary creatures have evolved in ways that defy imagination. Bioluminescent jellyfish pulse with ethereal blue light, while giant squid hunt in the crushing darkness. At depths of over two miles, the pressure is immense, enough to collapse a submarine, yet life persists.
+Reading | [S1] The telegram arrived on a Tuesday morning in June, nineteen forty-three. Margaret's hands trembled as she tore open the envelope, dreading the words she knew might be inside. Her brother had shipped out to North Africa six months ago, and his letters had grown increasingly sparse.
+Cartoon | [S1] After giving everything some more thought, I've decided it's in the best interest of humanity to acquire Nexus AI. (laughs) I've spoken with the CEO and he's on board. Well (laughs), at least that's the impression he gave initially.
+Single (Disfluent) | [S1] ... explore how we can design, create interfaces that are not confusing, but at the same time can be powerful. Um, you know, I think, uh, in the, the famous, um, usability book, it's, uh, it's this, um, um, oh, geez, I'm, I'm blanking on the term, uh, uh, the, the rule about, um, uh, it's like the simplicity rule. I can't recall. Oh, cognitive load maybe.
+Single (Disfluent) | [S1] Uh, complacency when the motivation isn't structured properly. Like for example, if you, if you're in the cor- if you work in the corporation for many years, a lot of corporate employees, they just, they're, they're aiming for that stock vesting and they're, they're doing just a sufficient job to, to, to reach that vesting and, and they don't, they're not performing any better than that. Um, and so I think, um, that showed me an important insight. Yeah.
+Single (Disfluent) | [S1] We see the pattern of revelations, major shifts. I think Neptune in Pisces, which that transit has been happening all of 2021, and Neptune will remain in the sign of Pisces until March of 2029. So it's several years more of this transit. And what it brings is a lot of things, you know, the thing that I tend to emphasize is the profound dissolution or profound changes
+Single (Disfluent) | [S1] I asked her, "Do you have like a phrase you use," and she mentioned she actually does. Like when things get tense, when there's like a moment, like if her, if her roommate is like venting about work drama or just like is stressed, and her, her roommate like deals with anxiety, I'm like, "Oh, this is probably how it feels to live with me." But, um, and like if, if, if things are rough, like she'll internally just like use this practice where she's like, like, "Not my problem, not mine to carry, not mine to handle, not mine to change." Like she'll sort of repeat that. So that's interesting.
+Single (Disfluent) | [S1] If I examine the, the, if, if you examine the range of options, uh, beginning from, like, say, individual all the way, right? There will be some revenue stream, uh, there will be some purchase, there'll be some hardware profit margin for someone who creates a smart product, um, uh, there will be memberships, personal and business, uh, and then there'll be usage-based, right? So I still believe that that's kinda how, those are all the metrics. To your point, what is a membership? Up to now, folks
+Single (Disfluent) | [S1] I think, if, if we can keep it under 25 points allowed, sure, our odds improve significantly. We wouldn't need to put up huge numbers ourselves, or at least that's the theory. And I should, I want to share some other stats which might be a bit outside our current discussion, but regarding this compared to 2018, the team's final four games that year, they managed 18 points total.
+Singing | [S1] (singing) Amazing grace, how sweet the sound, that saved a wretch like me. I once was lost, but now am found, was blind, but now I see.
+Conversation | [S1] Alright then. So, so 18 years you spent in that, uh, in that role, but alongside that in, in, was it while you were working that position in '93, you started doing some work with the network? [S2] Uh, yes. It was somewhere around '93. I, I, I played tennis pretty well, you know? I, I, I competed as a tennis player. And the, I got a chance to do some broadcasting over in Brisbane.
+Conversation | [S1] ... that will provide the analytics component- [S2] Right. [S1] ... to ideally get you to adopt some of their other tools. And- [S2] (laughs) [S1] ... some of those features are valuable too. [S2] That's interesting. [S1] Mailchimp, I mean, that's campaign manage-, uh, not exactly campaign management, but messaging platforms. [S2] Uh-huh. [S1] The, the companies that are, you know,
+Conversation | [S1] They were like, they were pumped for it, going wild for it, and it disappeared immediately. [S2] Yeah, I think it's about people understanding what's available first. Um... [S1] I think the finish on that one too was really nice. [S2] Yeah. [S1] I mean, that was pretty awesome. [S2] Have you seen those new editions?
+Conversation | [S1] He was just practicing with them and they were on rotation. [S2] So that was probably in January. [S1] I think startup stereotypes, there is some like that, but some of them, I think they need to be changed. Like we don't all work twenty-hour days. [S2] No, they just need to, it's called not, it's based in Silicon Valley. [S1] Yeah. [S2] But the stereotypes would apply if they, it was called Techlife- [S1] Palo Alto. [S2] ... Cupertino or Mountain View, California.
+Conversation | [S1] That's a nice overview. [S2] We were at the downtown cinema. [S1] By that, you mean the one in Riverside? [S2] Yeah. [S1] Yeah. So not exactly downtown. [S2] Not exactly downtown, yeah. [S1] I know a little bit about that area. [S2] (laughs) [S1] You know, Millbrook doesn't have a cinema. [S2] (laughs) It's the closest one for us. It's the closest. [S1] Yeah, that's true. [S2] The most nearby. [S1] Riverside is nearby. [S2] Riverside's close. [S1] That's fair. [S2] Support nearby. [S1] You can say, say Riverside, definitely. [S2] Well, yeah, fair enough.
+Conversation | [S1] But they also, they also discovered, um, they also discovered like patterns in the desert, um, near Peru, like in the Atacama Desert. [S2] Yeah. [S1] Um, and like, it was like, of like perfectly, like, geo- geometric shapes. And they're like, "Yo, this is definitely not like formed by wind. This has to be artificial." [S2] Yeah, it's too precise.
+Conversation | [S1] 'Cause I, yeah, there, there has to be a way that they can just make the, the system recognize that, no, you did not earn this- [S2] (laughs) [S1] ... on your own. You still have to go and complete one if you want it for your own- [S2] Right. [S1] ... like, profile. [S2] Right. Mm-hmm. [S1] So, yeah. [S2] Um, yeah. So let's actually move into multiplayer.
+Conversation | [S1] Yeah. [S2] Yeah. TRS as a whole is just relaxed. [S1] But anyway, you know that Mirror app that launched and then got removed like a month later?  [S2] Mirror, what, like, to your future? [S1] Yeah. [S2] Oh. [S1] So basically, there was an app, there's a show coming out. [S2] This is a show. [S1] Coming, I don't know what it is. [S2] Yeah, yeah, yeah. [S1] Like 2026 or something. Basically, Marcus, have you heard about this? [S2] I'm sorry, I don't know. No, I don't have an, it's an app- [S1] Okay, so I'll explain. I'll explain. [S2] Yeah. [S1] For context. So there's this app that launched in terms of the show called Mirror.
+Conversation | [S1] Jamie Patterson, right? [S2] No, I know where- [S1] I know where- [S2] ... Patterson works as well. I know where- [S1] I know- I know he used to work near- on this street, and this is a weird street. [S2] The only person who I don't know where they work, Jamie. But anyway, why are we even talking about who works where? [S1] It was a- it was- it was a really weird street name where Jamie worked. [S2] I- I drove past this street on my commute. [S1] No, you didn't. [S2] Yeah, I did. [S1] No, you drove past the street that my street is down the street of. [S2] Nice. There's, like, one street in Oakfield, I think I'll be able to find it, mate.