Improve model card: Add project page link

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +38 -39
README.md CHANGED
@@ -1,10 +1,11 @@
1
  ---
2
-
3
- license: apache-2.0
4
  datasets:
5
  - alibabasglab/VoxCeleb2-mix
6
  language:
7
  - en
 
 
 
8
  tags:
9
  - audio-visual
10
  - speech-separation
@@ -12,8 +13,6 @@ tags:
12
  - multimodal
13
  - lip-reading
14
  - audio-processing
15
- pipeline_tag: audio-to-audio
16
- library_name: pytorch
17
  ---
18
 
19
  # Dolphin: Efficient Audio-Visual Speech Separation
@@ -27,7 +26,7 @@ library_name: pytorch
27
 
28
  **Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6ร— faster** and using **50% fewer parameters** than previous methods.
29
 
30
- ๐Ÿ”— **Links**: [๐Ÿ“„ Paper](https://arxiv.org/abs/2509.23610) | [๐Ÿ’ป Code](https://github.com/JusperLee/Dolphin) | [๐ŸŽฎ Demo](https://huggingface.co/spaces/JusperLee/Dolphin)
31
 
32
  ## Key Features
33
 
@@ -107,55 +106,55 @@ python inference.py \
107
 
108
  ### Components
109
 
110
- 1. **DP-LipCoder** (Video Encoder)
111
- - Dual-path architecture: visual compression + semantic encoding
112
- - Vector quantization for discrete lip semantic tokens
113
- - Knowledge distillation from AV-HuBERT
114
- - Only **8.5M parameters**
115
 
116
- 2. **Audio Encoder**
117
- - Convolutional encoder for time-frequency representation
118
- - Extracts multi-scale acoustic features
119
 
120
- 3. **Global-Local Attention Separator**
121
- - Single-pass TDANet-based architecture
122
- - **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies
123
- - **Local Attention (LA)**: Heat diffusion attention for noise suppression
124
- - No iterative refinement needed
125
 
126
- 4. **Audio Decoder**
127
- - Reconstructs separated waveform from enhanced features
128
 
129
  ### Input/Output Specifications
130
 
131
  **Inputs:**
132
- - `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate
133
- - `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps
134
 
135
  **Output:**
136
- - `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz
137
 
138
  ## Training Details
139
 
140
- - **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
141
- - **Training**: ~200K steps with Adam optimizer
142
- - **Augmentation**: Random mixing, noise addition, video frame dropout
143
- - **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)
144
 
145
  ## Use Cases
146
 
147
- - ๐ŸŽง **Hearing Aids**: Camera-based speech enhancement
148
- - ๐Ÿ’ผ **Video Conferencing**: Noise suppression with visual context
149
- - ๐Ÿš— **In-Car Assistants**: Driver speech extraction
150
- - ๐Ÿฅฝ **AR/VR**: Immersive communication in noisy environments
151
- - ๐Ÿ“ฑ **Edge Devices**: Efficient deployment on mobile/embedded systems
152
 
153
  ## Limitations
154
 
155
- - Requires frontal or near-frontal face view for optimal performance
156
- - Works best with 25fps video input
157
- - Trained on English speech (may need fine-tuning for other languages)
158
- - Performance degrades with severe occlusions or low lighting
159
 
160
  ## Citation
161
 
@@ -181,9 +180,9 @@ Built with inspiration from IIANet and SepReformer. Thanks to the Hugging Face t
181
 
182
  ## Contact
183
 
184
- - ๐Ÿ“ง Email: tsinghua.kaili@gmail.com
185
- - ๐Ÿ› Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues)
186
- - ๐Ÿ’ฌ Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions)
187
 
188
  ---
189
 
 
1
  ---
 
 
2
  datasets:
3
  - alibabasglab/VoxCeleb2-mix
4
  language:
5
  - en
6
+ library_name: pytorch
7
+ license: apache-2.0
8
+ pipeline_tag: audio-to-audio
9
  tags:
10
  - audio-visual
11
  - speech-separation
 
13
  - multimodal
14
  - lip-reading
15
  - audio-processing
 
 
16
  ---
17
 
18
  # Dolphin: Efficient Audio-Visual Speech Separation
 
26
 
27
  **Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6ร— faster** and using **50% fewer parameters** than previous methods.
28
 
29
+ ๐Ÿ”— **Links**: [๐Ÿ“„ Paper](https://arxiv.org/abs/2509.23610) | [๐Ÿ’ป Code](https://github.com/JusperLee/Dolphin) | [๐ŸŽฎ Demo](https://huggingface.co/spaces/JusperLee/Dolphin) | [๐ŸŒ Project Page](https://cslikai.cn/Dolphin)
30
 
31
  ## Key Features
32
 
 
106
 
107
  ### Components
108
 
109
+ 1. **DP-LipCoder** (Video Encoder)
110
+ - Dual-path architecture: visual compression + semantic encoding
111
+ - Vector quantization for discrete lip semantic tokens
112
+ - Knowledge distillation from AV-HuBERT
113
+ - Only **8.5M parameters**
114
 
115
+ 2. **Audio Encoder**
116
+ - Convolutional encoder for time-frequency representation
117
+ - Extracts multi-scale acoustic features
118
 
119
+ 3. **Global-Local Attention Separator**
120
+ - Single-pass TDANet-based architecture
121
+ - **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies
122
+ - **Local Attention (LA)**: Heat diffusion attention for noise suppression
123
+ - No iterative refinement needed
124
 
125
+ 4. **Audio Decoder**
126
+ - Reconstructs separated waveform from enhanced features
127
 
128
  ### Input/Output Specifications
129
 
130
  **Inputs:**
131
+ - `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate
132
+ - `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps
133
 
134
  **Output:**
135
+ - `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz
136
 
137
  ## Training Details
138
 
139
+ - **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
140
+ - **Training**: ~200K steps with Adam optimizer
141
+ - **Augmentation**: Random mixing, noise addition, video frame dropout
142
+ - **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)
143
 
144
  ## Use Cases
145
 
146
+ - ๐ŸŽง **Hearing Aids**: Camera-based speech enhancement
147
+ - ๐Ÿ’ผ **Video Conferencing**: Noise suppression with visual context
148
+ - ๐Ÿš— **In-Car Assistants**: Driver speech extraction
149
+ - ๐Ÿฅฝ **AR/VR**: Immersive communication in noisy environments
150
+ - ๐Ÿ“ฑ **Edge Devices**: Efficient deployment on mobile/embedded systems
151
 
152
  ## Limitations
153
 
154
+ - Requires frontal or near-frontal face view for optimal performance
155
+ - Works best with 25fps video input
156
+ - Trained on English speech (may need fine-tuning for other languages)
157
+ - Performance degrades with severe occlusions or low lighting
158
 
159
  ## Citation
160
 
 
180
 
181
  ## Contact
182
 
183
+ - ๐Ÿ“ง Email: tsinghua.kaili@gmail.com
184
+ - ๐Ÿ› Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues)
185
+ - ๐Ÿ’ฌ Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions)
186
 
187
  ---
188