ONNX
Safetensors
Chinese
English
Dubbing-model

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

🎬 Fun-CineForge: A Unified Dataset Pipeline and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes

license

Fun-CineForge contains an end-to-end dataset pipeline for producing large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using this pipeline, we constructed the first large-scale Chinese television dubbing dataset CineDub-CN, which includes rich annotations and diverse scenes. In monologue, narration, dialogue, and multi-speaker scenes, our dubbing model consistently outperforms state-of-the-art methods in terms of audio quality, lip-sync, timbre transition, and instruction following.

Open Source 🎬

You can access https://funcineforge.github.io/ to get our CineDub-CN dataset samples and demo samples.

GitHub link: https://github.com/FunAudioLLM/FunCineForge/

Modelscope link: https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/

CineDub Samples: huggingface modelscope

Dataset Pipeline πŸ”¨

Environmental Installation

Fun-CineForge dataset pipeline toolkit only relies on a Python environment to run.

# Conda
git clone git@github.com:FunAudioLLM/FunCineForge.git
conda create -n FunCineForge python=3.10 -y && conda activate FunCineForge
sudo apt-get install ffmpeg
# Initial settings
python setup.py

Data collection

If you want to produce your own data, we recommend that you refer to the following requirements to collect the corresponding movies or television series.

  1. Video source: TV dramas or movies, non documentaries, with more monologues or dialogue scenes, clear and unobstructed faces (such as without masks and veils).
  2. Speech Requirements: Standard pronunciation, clear articulation, prominent human voice. Avoid materials with strong dialects, excessive background noise, or strong colloquialism.
  3. Image Requirements: High resolution, clear facial details, sufficient lighting, avoiding extremely dark or strong backlit scenes.

How to use

  • [1] Standardize video format and name; trim the beginning and end of long videos; extract the audio from the trimmed video. (default is to trim 10 seconds from both the beginning and end.)
python normalize_trim.py --root datasets/raw_zh --intro 10 --outro 10
  • [2] Speech Separation. The audio is used to separate the vocals from the instrumental music.
cd speech_separation
python run.py --root datasets/clean/zh --gpus 0 1 2 3
  • [3] VideoClipper. For long videos, VideoClipper is used to obtain sentence-level subtitle files and clip the long video into segments based on timestamps. Now it supports bilingualism in both Chinese and English. Below is an example in Chinese. It is recommended to use gpu acceleration for English.
cd video_clip
bash run.sh --stage 1 --stop_stage 2 --input datasets/raw_zh --output datasets/clean/zh --lang zh --device cpu
  • Video duration limit and check for cleanup. (Without --execute, only pre-deleted files will be printed. After checking, add --execute to confirm the deletion.)
python clean_video.py --root datasets/clean/zh
python clean_srt.py --root datasets/clean/zh --lang zh
  • [4] Speaker Diarization. Multimodal active speaker recognition obtains RTTM files; identifies the speaker's facial frames, extracts frame-level speaker face and lip raw data.
cd speaker_diarization
bash run.sh --stage 1 --stop_stage 4 --hf_access_token hf_xxx --root datasets/clean/zh --gpus "0 1 2 3"
  • [5] Multimodal CoT Correction. Based on general-purpose MLLMs, the system uses audio, ASR text, and RTTM files as input. It leverages Chain-of-Thought (CoT) reasoning to extract clues and corrects the results of the specialized models. It also annotates character age, gender, and vocal timbre. Experimental results show that this strategy reduces the CER from 4.53% to 0.94% and the speaker diarization error rate from 8.38% to 1.20%, achieving quality comparable to or even better than manual transcription. Adding the --resume enables breakpoint COT inference to prevent wasted resources from repeated COT inferences. Now supports both Chinese and English.
python cot.py --root_dir datasets/clean/zh --lang zh --provider google --model gemini-3-pro-preview --api_key xxx --resume
python cot.py --root_dir datasets/clean/en --lang en --provider google --model gemini-3-pro-preview --api_key xxx --resume
python build_datasets.py --root_zh datasets/clean/zh --root_en datasets/clean/en --out_dir datasets/clean --save
  • (Reference) Extract speech tokens based on the CosyVoice3 tokenizer for llm training.
python speech_tokenizer.py --root datasets/clean/zh

Dubbing Model βš™οΈ

We've open-sourced the inference code and the infer.sh script, and provided some test cases in the data folder for your experience. Inference requires a consumer-grade GPU. Run the following command:

cd exps
bash infer.sh

The API for multi-speaker dubbing from raw videos and SRT scripts is under development ...

Recent Updates πŸš€

  • 2025/12/18: Fun-CineForge dataset pipeline toolkit is online! πŸ”₯
  • 2026/01/19: Chinese demo samples and CineDub-CN dataset samples released. πŸ”₯
  • 2026/01/25: Fix some environmental and operational issues.
  • 2026/02/09: Optimized the data pipeline and added support for English videos.
  • 2026/03/05: English demo samples and CineDub-EN dataset samples released. πŸ”₯
  • 2026/03/16: Open source inference code and checkpoints. πŸ”₯

Publication πŸ“š

If you use our dataset or code, please cite the following paper:

@misc{liu2026funcineforgeunifieddatasettoolkit,
    title={FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes}, 
    author={Jiaxuan Liu and Yang Xiang and Han Zhao and Xiangang Li and Zhenhua Ling},
    year={2026},
    eprint={2601.14777},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
}

Comminicate 🍟

We welcome you to participate in discussions on Fun-CineForge GitHub Issues or contact us for collaborative development. For any questions, you can contact the developer.

Disclaimer

This repository contains research artifacts:

⚠️ Currently not a commercial product of Tongyi Lab.

⚠️ Released for academic research / cutting-edge exploration purposes

⚠️ CineDub Dataset samples are subject to specific license terms.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train FunAudioLLM/Fun-CineForge

Paper for FunAudioLLM/Fun-CineForge