Instructions to use ChristianYang/LongCat-AudioDiT-Env-TTS-1B-6000Step with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ChristianYang/LongCat-AudioDiT-Env-TTS-1B-6000Step with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="ChristianYang/LongCat-AudioDiT-Env-TTS-1B-6000Step", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ChristianYang/LongCat-AudioDiT-Env-TTS-1B-6000Step", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
LongCat-AudioDiT Env-TTS — 6000-step fine-tune
Fine-tune of meituan-longcat/LongCat-AudioDiT-1B for the three-stream env-tts task: given a reference environment audio, a reference speaker audio, and three text streams (env caption / speaker caption / target speech text), generate target speech that places the target text in the referenced environment with the referenced speaker timbre.
Differences from the base model
The transformer adds six learnable boundary tokens (three latent-space, three text-space):
latent sequence : [<boe> z_env <bos> z_spk <bon> z_target]
text sequence : [<boe_t> env_text_emb <bos_t> spk_text_emb <bon_t> target_text_emb]
encode_multistream_text(env, spk, target, drop_env_text=…, drop_spk_text=…, drop_target_text=…) is the new entry-point. AudioDiTModel.forward(...) also
accepts a pre-assembled prompt_latent (replaces prompt_audio) so the inference
path can feed the boundary-tokenized three-stream prompt directly.
Training summary
| Field | Value |
|---|---|
| Steps | 6000 |
| Effective batch | 16 × grad_accum 2 × 2 GPU = 64 rows / step |
| Learning rate | cosine 5e-5 (warmup 250) |
| AdamW | β₁=0.9, β₂=0.999, wd=0.01 |
| EMA | disabled |
| LoRA | r=32, alpha=32, target = attn + ffn |
| Full-train | boundary tokens + AdaLN + text_conv + latent_embed + input_embed + output_proj + time_embed |
| Audio filter | target duration ∈ [3, 45] s |
| RMS normalize | three-stream independent to -23 dBFS (target_rms=0.0708) |
| Augmentation | noise + RIR on spk_audio (DNS5 64GB) |
| Data | ChristianYang/Env-TTS-Clean |
How to load
The model uses custom code in this repo, so pass trust_remote_code=True:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
"meituan-longcat/LongCat-AudioDiT-Env-TTS-1B-6000Step",
trust_remote_code=True,
).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model)
For end-to-end env-tts inference (three-stream prompt + ASR fallback for missing
env/spk text) see the training repo's inference_env_tts.py.
License
Inherits the original meituan-longcat/LongCat-AudioDiT-1B license.
- Downloads last month
- 15
Model tree for ChristianYang/LongCat-AudioDiT-Env-TTS-1B-6000Step
Base model
meituan-longcat/LongCat-AudioDiT-1B