--- datasets: - alibabasglab/VoxCeleb2-mix language: - en library_name: pytorch license: apache-2.0 pipeline_tag: audio-to-audio tags: - audio-visual - speech-separation - cocktail-party - multimodal - lip-reading - audio-processing --- # Dolphin: Efficient Audio-Visual Speech Separation

Dolphin Logo

## Model Overview **Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6× faster** and using **50% fewer parameters** than previous methods. 🔗 **Links**: [📄 Paper](https://arxiv.org/abs/2509.23610) | [💻 Code](https://github.com/JusperLee/Dolphin) | [🎮 Demo](https://huggingface.co/spaces/JusperLee/Dolphin) | [🌐 Project Page](https://cslikai.cn/Dolphin) ## Key Features - 🎯 **Balanced Quality & Efficiency**: SOTA separation quality without iterative refinement - 🔬 **DP-LipCoder**: Lightweight video encoder with discrete audio-aligned semantic tokens - 🌐 **Global-Local Attention**: Multi-scale attention for long-range context and fine-grained details - 🚀 **Edge-Friendly**: >50% parameter reduction, >2.4× lower MACs, >6× faster inference ## Performance **VoxCeleb2 Benchmark:** | Metric | Value | |--------|-------| | SI-SNRi | **16.1 dB** | | SDRi | **16.3 dB** | | PESQ | **3.45** | | ESTOI | **0.93** | | Parameters | **51.3M** (vs 112M in IIANet) | | MACs | **417G** (vs 1009G in IIANet) | | Inference Speed | **0.015s/4s-clip** (vs 0.100s in IIANet) | ## Quick Start ### Installation ```bash pip install torch torchvision torchaudio pip install huggingface_hub ``` ### Inference Example ```python import torch from huggingface_hub import hf_hub_download import yaml # Download model and config config_path = hf_hub_download(repo_id="JusperLee/Dolphin", filename="conf.yml") model_path = hf_hub_download(repo_id="JusperLee/Dolphin", filename="best_model.pth") # Load model (you need to import Dolphin class from the repo) with open(config_path) as f: config = yaml.safe_load(f) model = Dolphin(**config['model']) model.load_state_dict(torch.load(model_path, map_location='cpu')) model.eval() # Prepare inputs # audio: [batch, samples] - 16kHz audio # video: [batch, frames, 1, height, width] - grayscale lip frames audio_mixture = torch.randn(1, 64000) # 4 seconds at 16kHz video_frames = torch.randn(1, 100, 1, 88, 88) # 4s at 25fps, 88x88 resolution # Separate speech with torch.no_grad(): separated_audio = model(audio_mixture, video_frames) ``` ### Complete Pipeline with Video Input For end-to-end video processing with face detection and tracking, see our [inference script](https://github.com/JusperLee/Dolphin/blob/main/inference.py): ```bash git clone https://github.com/JusperLee/Dolphin.git cd Dolphin python inference.py \ --input video.mp4 \ --output ./output \ --speakers 2 \ --config checkpoints/vox2/conf.yml ``` ## Model Architecture ### Components 1. **DP-LipCoder** (Video Encoder) - Dual-path architecture: visual compression + semantic encoding - Vector quantization for discrete lip semantic tokens - Knowledge distillation from AV-HuBERT - Only **8.5M parameters** 2. **Audio Encoder** - Convolutional encoder for time-frequency representation - Extracts multi-scale acoustic features 3. **Global-Local Attention Separator** - Single-pass TDANet-based architecture - **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies - **Local Attention (LA)**: Heat diffusion attention for noise suppression - No iterative refinement needed 4. **Audio Decoder** - Reconstructs separated waveform from enhanced features ### Input/Output Specifications **Inputs:** - `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate - `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps **Output:** - `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz ## Training Details - **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR) - **Training**: ~200K steps with Adam optimizer - **Augmentation**: Random mixing, noise addition, video frame dropout - **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio) ## Use Cases - 🎧 **Hearing Aids**: Camera-based speech enhancement - 💼 **Video Conferencing**: Noise suppression with visual context - 🚗 **In-Car Assistants**: Driver speech extraction - 🥽 **AR/VR**: Immersive communication in noisy environments - 📱 **Edge Devices**: Efficient deployment on mobile/embedded systems ## Limitations - Requires frontal or near-frontal face view for optimal performance - Works best with 25fps video input - Trained on English speech (may need fine-tuning for other languages) - Performance degrades with severe occlusions or low lighting ## Citation ```bibtex @misc{li2025dolphin, title={Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention}, author={Kai Li and Kejun Gao and Xiaolin Hu}, year={2025}, eprint={2509.23610}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2509.23610} } ``` ## License Apache-2.0 License. See [LICENSE](https://github.com/JusperLee/Dolphin/blob/main/LICENSE) for details. ## Acknowledgments Built with inspiration from IIANet and SepReformer. Thanks to the Hugging Face team for hosting! ## Contact - 📧 Email: tsinghua.kaili@gmail.com - 🐛 Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues) - 💬 Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions) --- **Developed by the Audio and Speech Group at Tsinghua University** 🎓