Improve model card: Add project page link

This PR enhances the model card by adding a link to the project page (`https://cslikai.cn/Dolphin`) in the "Links" section. This page provides additional context and resources for the Dolphin model, improving discoverability for researchers and users.

The existing `library_name: pytorch` is retained, as the provided sample usage demonstrates direct PyTorch compatibility, and this aligns with the guidelines for ensuring the automated code snippet functions correctly. All other metadata and content remain unchanged as they are already well-documented.

Files changed (1) hide show

README.md +38 -39

README.md CHANGED Viewed

@@ -1,10 +1,11 @@
 ---
-license: apache-2.0
 datasets:
 - alibabasglab/VoxCeleb2-mix
 language:
 - en
 tags:
 - audio-visual
 - speech-separation
@@ -12,8 +13,6 @@ tags:
 - multimodal
 - lip-reading
 - audio-processing
-pipeline_tag: audio-to-audio
-library_name: pytorch
 ---
 # Dolphin: Efficient Audio-Visual Speech Separation
@@ -27,7 +26,7 @@ library_name: pytorch
 **Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6× faster** and using **50% fewer parameters** than previous methods.
-🔗 **Links**: [📄 Paper](https://arxiv.org/abs/2509.23610) | [💻 Code](https://github.com/JusperLee/Dolphin) | [🎮 Demo](https://huggingface.co/spaces/JusperLee/Dolphin)
 ## Key Features
@@ -107,55 +106,55 @@ python inference.py \
 ### Components
-1. **DP-LipCoder** (Video Encoder)
-   - Dual-path architecture: visual compression + semantic encoding
-   - Vector quantization for discrete lip semantic tokens
-   - Knowledge distillation from AV-HuBERT
-   - Only **8.5M parameters**
-2. **Audio Encoder**
-   - Convolutional encoder for time-frequency representation
-   - Extracts multi-scale acoustic features
-3. **Global-Local Attention Separator**
-   - Single-pass TDANet-based architecture
-   - **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies
-   - **Local Attention (LA)**: Heat diffusion attention for noise suppression
-   - No iterative refinement needed
-4. **Audio Decoder**
-   - Reconstructs separated waveform from enhanced features
 ### Input/Output Specifications
 **Inputs:**
-- `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate
-- `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps
 **Output:**
-- `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz
 ## Training Details
-- **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
-- **Training**: ~200K steps with Adam optimizer
-- **Augmentation**: Random mixing, noise addition, video frame dropout
-- **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)
 ## Use Cases
-- 🎧 **Hearing Aids**: Camera-based speech enhancement
-- 💼 **Video Conferencing**: Noise suppression with visual context
-- 🚗 **In-Car Assistants**: Driver speech extraction
-- 🥽 **AR/VR**: Immersive communication in noisy environments
-- 📱 **Edge Devices**: Efficient deployment on mobile/embedded systems
 ## Limitations
-- Requires frontal or near-frontal face view for optimal performance
-- Works best with 25fps video input
-- Trained on English speech (may need fine-tuning for other languages)
-- Performance degrades with severe occlusions or low lighting
 ## Citation
@@ -181,9 +180,9 @@ Built with inspiration from IIANet and SepReformer. Thanks to the Hugging Face t
 ## Contact
-- 📧 Email: tsinghua.kaili@gmail.com
-- 🐛 Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues)
-- 💬 Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions)
 ---

 ---
 datasets:
 - alibabasglab/VoxCeleb2-mix
 language:
 - en
+library_name: pytorch
+license: apache-2.0
+pipeline_tag: audio-to-audio
 tags:
 - audio-visual
 - speech-separation
 - multimodal
 - lip-reading
 - audio-processing
 ---
 # Dolphin: Efficient Audio-Visual Speech Separation
 **Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6× faster** and using **50% fewer parameters** than previous methods.
+🔗 **Links**: [📄 Paper](https://arxiv.org/abs/2509.23610) | [💻 Code](https://github.com/JusperLee/Dolphin) | [🎮 Demo](https://huggingface.co/spaces/JusperLee/Dolphin) | [🌐 Project Page](https://cslikai.cn/Dolphin)
 ## Key Features
 ### Components
+1.  **DP-LipCoder** (Video Encoder)
+    -   Dual-path architecture: visual compression + semantic encoding
+    -   Vector quantization for discrete lip semantic tokens
+    -   Knowledge distillation from AV-HuBERT
+    -   Only **8.5M parameters**
+2.  **Audio Encoder**
+    -   Convolutional encoder for time-frequency representation
+    -   Extracts multi-scale acoustic features
+3.  **Global-Local Attention Separator**
+    -   Single-pass TDANet-based architecture
+    -   **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies
+    -   **Local Attention (LA)**: Heat diffusion attention for noise suppression
+    -   No iterative refinement needed
+4.  **Audio Decoder**
+    -   Reconstructs separated waveform from enhanced features
 ### Input/Output Specifications
 **Inputs:**
+-   `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate
+-   `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps
 **Output:**
+-   `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz
 ## Training Details
+-   **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
+-   **Training**: ~200K steps with Adam optimizer
+-   **Augmentation**: Random mixing, noise addition, video frame dropout
+-   **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)
 ## Use Cases
+-   🎧 **Hearing Aids**: Camera-based speech enhancement
+-   💼 **Video Conferencing**: Noise suppression with visual context
+-   🚗 **In-Car Assistants**: Driver speech extraction
+-   🥽 **AR/VR**: Immersive communication in noisy environments
+-   📱 **Edge Devices**: Efficient deployment on mobile/embedded systems
 ## Limitations
+-   Requires frontal or near-frontal face view for optimal performance
+-   Works best with 25fps video input
+-   Trained on English speech (may need fine-tuning for other languages)
+-   Performance degrades with severe occlusions or low lighting
 ## Citation
 ## Contact
+-   📧 Email: tsinghua.kaili@gmail.com
+-   🐛 Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues)
+-   💬 Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions)
 ---