LongCat-AudioDiT Env-TTS — 6000-step fine-tune

Fine-tune of meituan-longcat/LongCat-AudioDiT-1B for the three-stream env-tts task: given a reference environment audio, a reference speaker audio, and three text streams (env caption / speaker caption / target speech text), generate target speech that places the target text in the referenced environment with the referenced speaker timbre.

Differences from the base model

The transformer adds six learnable boundary tokens (three latent-space, three text-space):

latent sequence : [<boe>  z_env  <bos>  z_spk  <bon>  z_target]
text sequence   : [<boe_t> env_text_emb <bos_t> spk_text_emb <bon_t> target_text_emb]

encode_multistream_text(env, spk, target, drop_env_text=…, drop_spk_text=…, drop_target_text=…) is the new entry-point. AudioDiTModel.forward(...) also accepts a pre-assembled prompt_latent (replaces prompt_audio) so the inference path can feed the boundary-tokenized three-stream prompt directly.

Training summary

Field	Value
Steps	6000
Effective batch	16 × grad_accum 2 × 2 GPU = 64 rows / step
Learning rate	cosine 5e-5 (warmup 250)
AdamW	β₁=0.9, β₂=0.999, wd=0.01
EMA	disabled
LoRA	r=32, alpha=32, target = attn + ffn
Full-train	boundary tokens + AdaLN + text_conv + latent_embed + input_embed + output_proj + time_embed
Audio filter	target duration ∈ [3, 45] s
RMS normalize	three-stream independent to -23 dBFS (target_rms=0.0708)
Augmentation	noise + RIR on spk_audio (DNS5 64GB)
Data	ChristianYang/Env-TTS-Clean

How to load

The model uses custom code in this repo, so pass trust_remote_code=True:

from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "meituan-longcat/LongCat-AudioDiT-Env-TTS-1B-6000Step",
    trust_remote_code=True,
).cuda().eval()

tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model)

For end-to-end env-tts inference (three-stream prompt + ASR fallback for missing env/spk text) see the training repo's inference_env_tts.py.

License

Inherits the original meituan-longcat/LongCat-AudioDiT-1B license.

Downloads last month: 5

Safetensors

Model size

1B params

Tensor type

F32

Model tree for ChristianYang/LongCat-AudioDiT-Env-TTS-1B-6000Step

Base model

meituan-longcat/LongCat-AudioDiT-1B

Finetuned

(11)

this model