nielsr HF Staff commited on
Commit
d3588ef
ยท
verified ยท
1 Parent(s): f84ce1e

Improve model card: Add project page link

Browse files

This PR enhances the model card by adding a link to the project page (`https://cslikai.cn/Dolphin`) in the "Links" section. This page provides additional context and resources for the Dolphin model, improving discoverability for researchers and users.

The existing `library_name: pytorch` is retained, as the provided sample usage demonstrates direct PyTorch compatibility, and this aligns with the guidelines for ensuring the automated code snippet functions correctly. All other metadata and content remain unchanged as they are already well-documented.

Files changed (1) hide show
  1. README.md +38 -39
README.md CHANGED
@@ -1,10 +1,11 @@
1
  ---
2
-
3
- license: apache-2.0
4
  datasets:
5
  - alibabasglab/VoxCeleb2-mix
6
  language:
7
  - en
 
 
 
8
  tags:
9
  - audio-visual
10
  - speech-separation
@@ -12,8 +13,6 @@ tags:
12
  - multimodal
13
  - lip-reading
14
  - audio-processing
15
- pipeline_tag: audio-to-audio
16
- library_name: pytorch
17
  ---
18
 
19
  # Dolphin: Efficient Audio-Visual Speech Separation
@@ -27,7 +26,7 @@ library_name: pytorch
27
 
28
  **Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6ร— faster** and using **50% fewer parameters** than previous methods.
29
 
30
- ๐Ÿ”— **Links**: [๐Ÿ“„ Paper](https://arxiv.org/abs/2509.23610) | [๐Ÿ’ป Code](https://github.com/JusperLee/Dolphin) | [๐ŸŽฎ Demo](https://huggingface.co/spaces/JusperLee/Dolphin)
31
 
32
  ## Key Features
33
 
@@ -107,55 +106,55 @@ python inference.py \
107
 
108
  ### Components
109
 
110
- 1. **DP-LipCoder** (Video Encoder)
111
- - Dual-path architecture: visual compression + semantic encoding
112
- - Vector quantization for discrete lip semantic tokens
113
- - Knowledge distillation from AV-HuBERT
114
- - Only **8.5M parameters**
115
 
116
- 2. **Audio Encoder**
117
- - Convolutional encoder for time-frequency representation
118
- - Extracts multi-scale acoustic features
119
 
120
- 3. **Global-Local Attention Separator**
121
- - Single-pass TDANet-based architecture
122
- - **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies
123
- - **Local Attention (LA)**: Heat diffusion attention for noise suppression
124
- - No iterative refinement needed
125
 
126
- 4. **Audio Decoder**
127
- - Reconstructs separated waveform from enhanced features
128
 
129
  ### Input/Output Specifications
130
 
131
  **Inputs:**
132
- - `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate
133
- - `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps
134
 
135
  **Output:**
136
- - `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz
137
 
138
  ## Training Details
139
 
140
- - **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
141
- - **Training**: ~200K steps with Adam optimizer
142
- - **Augmentation**: Random mixing, noise addition, video frame dropout
143
- - **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)
144
 
145
  ## Use Cases
146
 
147
- - ๐ŸŽง **Hearing Aids**: Camera-based speech enhancement
148
- - ๐Ÿ’ผ **Video Conferencing**: Noise suppression with visual context
149
- - ๐Ÿš— **In-Car Assistants**: Driver speech extraction
150
- - ๐Ÿฅฝ **AR/VR**: Immersive communication in noisy environments
151
- - ๐Ÿ“ฑ **Edge Devices**: Efficient deployment on mobile/embedded systems
152
 
153
  ## Limitations
154
 
155
- - Requires frontal or near-frontal face view for optimal performance
156
- - Works best with 25fps video input
157
- - Trained on English speech (may need fine-tuning for other languages)
158
- - Performance degrades with severe occlusions or low lighting
159
 
160
  ## Citation
161
 
@@ -181,9 +180,9 @@ Built with inspiration from IIANet and SepReformer. Thanks to the Hugging Face t
181
 
182
  ## Contact
183
 
184
- - ๐Ÿ“ง Email: tsinghua.kaili@gmail.com
185
- - ๐Ÿ› Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues)
186
- - ๐Ÿ’ฌ Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions)
187
 
188
  ---
189
 
 
1
  ---
 
 
2
  datasets:
3
  - alibabasglab/VoxCeleb2-mix
4
  language:
5
  - en
6
+ library_name: pytorch
7
+ license: apache-2.0
8
+ pipeline_tag: audio-to-audio
9
  tags:
10
  - audio-visual
11
  - speech-separation
 
13
  - multimodal
14
  - lip-reading
15
  - audio-processing
 
 
16
  ---
17
 
18
  # Dolphin: Efficient Audio-Visual Speech Separation
 
26
 
27
  **Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6ร— faster** and using **50% fewer parameters** than previous methods.
28
 
29
+ ๐Ÿ”— **Links**: [๐Ÿ“„ Paper](https://arxiv.org/abs/2509.23610) | [๐Ÿ’ป Code](https://github.com/JusperLee/Dolphin) | [๐ŸŽฎ Demo](https://huggingface.co/spaces/JusperLee/Dolphin) | [๐ŸŒ Project Page](https://cslikai.cn/Dolphin)
30
 
31
  ## Key Features
32
 
 
106
 
107
  ### Components
108
 
109
+ 1. **DP-LipCoder** (Video Encoder)
110
+ - Dual-path architecture: visual compression + semantic encoding
111
+ - Vector quantization for discrete lip semantic tokens
112
+ - Knowledge distillation from AV-HuBERT
113
+ - Only **8.5M parameters**
114
 
115
+ 2. **Audio Encoder**
116
+ - Convolutional encoder for time-frequency representation
117
+ - Extracts multi-scale acoustic features
118
 
119
+ 3. **Global-Local Attention Separator**
120
+ - Single-pass TDANet-based architecture
121
+ - **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies
122
+ - **Local Attention (LA)**: Heat diffusion attention for noise suppression
123
+ - No iterative refinement needed
124
 
125
+ 4. **Audio Decoder**
126
+ - Reconstructs separated waveform from enhanced features
127
 
128
  ### Input/Output Specifications
129
 
130
  **Inputs:**
131
+ - `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate
132
+ - `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps
133
 
134
  **Output:**
135
+ - `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz
136
 
137
  ## Training Details
138
 
139
+ - **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
140
+ - **Training**: ~200K steps with Adam optimizer
141
+ - **Augmentation**: Random mixing, noise addition, video frame dropout
142
+ - **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)
143
 
144
  ## Use Cases
145
 
146
+ - ๐ŸŽง **Hearing Aids**: Camera-based speech enhancement
147
+ - ๐Ÿ’ผ **Video Conferencing**: Noise suppression with visual context
148
+ - ๐Ÿš— **In-Car Assistants**: Driver speech extraction
149
+ - ๐Ÿฅฝ **AR/VR**: Immersive communication in noisy environments
150
+ - ๐Ÿ“ฑ **Edge Devices**: Efficient deployment on mobile/embedded systems
151
 
152
  ## Limitations
153
 
154
+ - Requires frontal or near-frontal face view for optimal performance
155
+ - Works best with 25fps video input
156
+ - Trained on English speech (may need fine-tuning for other languages)
157
+ - Performance degrades with severe occlusions or low lighting
158
 
159
  ## Citation
160
 
 
180
 
181
  ## Contact
182
 
183
+ - ๐Ÿ“ง Email: tsinghua.kaili@gmail.com
184
+ - ๐Ÿ› Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues)
185
+ - ๐Ÿ’ฌ Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions)
186
 
187
  ---
188