JusperLee
/

Dolphin

@@ -1,10 +1,11 @@
 ---
-license: apache-2.0
 datasets:
 - alibabasglab/VoxCeleb2-mix
 language:
 - en
 tags:
 - audio-visual
 - speech-separation
@@ -12,8 +13,6 @@ tags:
 - multimodal
 - lip-reading
 - audio-processing
-pipeline_tag: audio-to-audio
-library_name: pytorch
 ---
 # Dolphin: Efficient Audio-Visual Speech Separation
@@ -27,7 +26,7 @@ library_name: pytorch
 **Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6× faster** and using **50% fewer parameters** than previous methods.
-🔗 **Links**: [📄 Paper](https://arxiv.org/abs/2509.23610) | [💻 Code](https://github.com/JusperLee/Dolphin) | [🎮 Demo](https://huggingface.co/spaces/JusperLee/Dolphin)
 ## Key Features
@@ -107,55 +106,55 @@ python inference.py \
 ### Components
-1. **DP-LipCoder** (Video Encoder)
-   - Dual-path architecture: visual compression + semantic encoding
-   - Vector quantization for discrete lip semantic tokens
-   - Knowledge distillation from AV-HuBERT
-   - Only **8.5M parameters**
-2. **Audio Encoder**
-   - Convolutional encoder for time-frequency representation
-   - Extracts multi-scale acoustic features
-3. **Global-Local Attention Separator**
-   - Single-pass TDANet-based architecture
-   - **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies
-   - **Local Attention (LA)**: Heat diffusion attention for noise suppression
-   - No iterative refinement needed
-4. **Audio Decoder**
-   - Reconstructs separated waveform from enhanced features
 ### Input/Output Specifications
 **Inputs:**
-- `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate
-- `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps
 **Output:**
-- `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz
 ## Training Details
-- **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
-- **Training**: ~200K steps with Adam optimizer
-- **Augmentation**: Random mixing, noise addition, video frame dropout
-- **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)
 ## Use Cases
-- 🎧 **Hearing Aids**: Camera-based speech enhancement
-- 💼 **Video Conferencing**: Noise suppression with visual context
-- 🚗 **In-Car Assistants**: Driver speech extraction
-- 🥽 **AR/VR**: Immersive communication in noisy environments
-- 📱 **Edge Devices**: Efficient deployment on mobile/embedded systems
 ## Limitations
-- Requires frontal or near-frontal face view for optimal performance
-- Works best with 25fps video input
-- Trained on English speech (may need fine-tuning for other languages)
-- Performance degrades with severe occlusions or low lighting
 ## Citation
@@ -181,9 +180,9 @@ Built with inspiration from IIANet and SepReformer. Thanks to the Hugging Face t
 ## Contact
-- 📧 Email: tsinghua.kaili@gmail.com
-- 🐛 Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues)
-- 💬 Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions)
 ---

 ---
 datasets:
 - alibabasglab/VoxCeleb2-mix
 language:
 - en
+library_name: pytorch
+license: apache-2.0
+pipeline_tag: audio-to-audio
 tags:
 - audio-visual
 - speech-separation
 - multimodal
 - lip-reading
 - audio-processing
 ---
 # Dolphin: Efficient Audio-Visual Speech Separation
 **Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6× faster** and using **50% fewer parameters** than previous methods.
+🔗 **Links**: [📄 Paper](https://arxiv.org/abs/2509.23610) | [💻 Code](https://github.com/JusperLee/Dolphin) | [🎮 Demo](https://huggingface.co/spaces/JusperLee/Dolphin) | [🌐 Project Page](https://cslikai.cn/Dolphin)
 ## Key Features
 ### Components
+1.  **DP-LipCoder** (Video Encoder)
+    -   Dual-path architecture: visual compression + semantic encoding
+    -   Vector quantization for discrete lip semantic tokens
+    -   Knowledge distillation from AV-HuBERT
+    -   Only **8.5M parameters**
+2.  **Audio Encoder**
+    -   Convolutional encoder for time-frequency representation
+    -   Extracts multi-scale acoustic features
+3.  **Global-Local Attention Separator**
+    -   Single-pass TDANet-based architecture
+    -   **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies
+    -   **Local Attention (LA)**: Heat diffusion attention for noise suppression
+    -   No iterative refinement needed
+4.  **Audio Decoder**
+    -   Reconstructs separated waveform from enhanced features
 ### Input/Output Specifications
 **Inputs:**
+-   `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate
+-   `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps
 **Output:**
+-   `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz
 ## Training Details
+-   **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
+-   **Training**: ~200K steps with Adam optimizer
+-   **Augmentation**: Random mixing, noise addition, video frame dropout
+-   **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)
 ## Use Cases
+-   🎧 **Hearing Aids**: Camera-based speech enhancement
+-   💼 **Video Conferencing**: Noise suppression with visual context
+-   🚗 **In-Car Assistants**: Driver speech extraction
+-   🥽 **AR/VR**: Immersive communication in noisy environments
+-   📱 **Edge Devices**: Efficient deployment on mobile/embedded systems
 ## Limitations
+-   Requires frontal or near-frontal face view for optimal performance
+-   Works best with 25fps video input
+-   Trained on English speech (may need fine-tuning for other languages)
+-   Performance degrades with severe occlusions or low lighting
 ## Citation
 ## Contact
+-   📧 Email: tsinghua.kaili@gmail.com
+-   🐛 Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues)
+-   💬 Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions)
 ---