LongCat-AudioDiT Env-TTS — 6000-step fine-tune

Fine-tune of meituan-longcat/LongCat-AudioDiT-1B for the three-stream env-tts task: given a reference environment audio, a reference speaker audio, and three text streams (env caption / speaker caption / target speech text), generate target speech that places the target text in the referenced environment with the referenced speaker timbre.

Differences from the base model

The transformer adds six learnable boundary tokens (three latent-space, three text-space):

latent sequence : [<boe>  z_env  <bos>  z_spk  <bon>  z_target]
text sequence   : [<boe_t> env_text_emb <bos_t> spk_text_emb <bon_t> target_text_emb]

encode_multistream_text(env, spk, target, drop_env_text=…, drop_spk_text=…, drop_target_text=…) is the new entry-point. AudioDiTModel.forward(...) also accepts a pre-assembled prompt_latent (replaces prompt_audio) so the inference path can feed the boundary-tokenized three-stream prompt directly.

Training summary

Field Value
Steps 6000
Effective batch 16 × grad_accum 2 × 2 GPU = 64 rows / step
Learning rate cosine 5e-5 (warmup 250)
AdamW β₁=0.9, β₂=0.999, wd=0.01
EMA disabled
LoRA r=32, alpha=32, target = attn + ffn
Full-train boundary tokens + AdaLN + text_conv + latent_embed + input_embed + output_proj + time_embed
Audio filter target duration ∈ [3, 45] s
RMS normalize three-stream independent to -23 dBFS (target_rms=0.0708)
Augmentation noise + RIR on spk_audio (DNS5 64GB)
Data ChristianYang/Env-TTS-Clean

How to load

The model uses custom code in this repo, so pass trust_remote_code=True:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "meituan-longcat/LongCat-AudioDiT-Env-TTS-1B-6000Step",
    trust_remote_code=True,
).cuda().eval()

tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model)

For end-to-end env-tts inference (three-stream prompt + ASR fallback for missing env/spk text) see the training repo's inference_env_tts.py.

License

Inherits the original meituan-longcat/LongCat-AudioDiT-1B license.

Downloads last month
15
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ChristianYang/LongCat-AudioDiT-Env-TTS-1B-6000Step

Finetuned
(8)
this model