Improve model card: Add project page link
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,10 +1,11 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
license: apache-2.0
|
| 4 |
datasets:
|
| 5 |
- alibabasglab/VoxCeleb2-mix
|
| 6 |
language:
|
| 7 |
- en
|
|
|
|
|
|
|
|
|
|
| 8 |
tags:
|
| 9 |
- audio-visual
|
| 10 |
- speech-separation
|
|
@@ -12,8 +13,6 @@ tags:
|
|
| 12 |
- multimodal
|
| 13 |
- lip-reading
|
| 14 |
- audio-processing
|
| 15 |
-
pipeline_tag: audio-to-audio
|
| 16 |
-
library_name: pytorch
|
| 17 |
---
|
| 18 |
|
| 19 |
# Dolphin: Efficient Audio-Visual Speech Separation
|
|
@@ -27,7 +26,7 @@ library_name: pytorch
|
|
| 27 |
|
| 28 |
**Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6ร faster** and using **50% fewer parameters** than previous methods.
|
| 29 |
|
| 30 |
-
๐ **Links**: [๐ Paper](https://arxiv.org/abs/2509.23610) | [๐ป Code](https://github.com/JusperLee/Dolphin) | [๐ฎ Demo](https://huggingface.co/spaces/JusperLee/Dolphin)
|
| 31 |
|
| 32 |
## Key Features
|
| 33 |
|
|
@@ -107,55 +106,55 @@ python inference.py \
|
|
| 107 |
|
| 108 |
### Components
|
| 109 |
|
| 110 |
-
1.
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
|
| 116 |
-
2.
|
| 117 |
-
|
| 118 |
-
|
| 119 |
|
| 120 |
-
3.
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
|
| 126 |
-
4.
|
| 127 |
-
|
| 128 |
|
| 129 |
### Input/Output Specifications
|
| 130 |
|
| 131 |
**Inputs:**
|
| 132 |
-
-
|
| 133 |
-
-
|
| 134 |
|
| 135 |
**Output:**
|
| 136 |
-
-
|
| 137 |
|
| 138 |
## Training Details
|
| 139 |
|
| 140 |
-
-
|
| 141 |
-
-
|
| 142 |
-
-
|
| 143 |
-
-
|
| 144 |
|
| 145 |
## Use Cases
|
| 146 |
|
| 147 |
-
-
|
| 148 |
-
-
|
| 149 |
-
-
|
| 150 |
-
-
|
| 151 |
-
-
|
| 152 |
|
| 153 |
## Limitations
|
| 154 |
|
| 155 |
-
-
|
| 156 |
-
-
|
| 157 |
-
-
|
| 158 |
-
-
|
| 159 |
|
| 160 |
## Citation
|
| 161 |
|
|
@@ -181,9 +180,9 @@ Built with inspiration from IIANet and SepReformer. Thanks to the Hugging Face t
|
|
| 181 |
|
| 182 |
## Contact
|
| 183 |
|
| 184 |
-
-
|
| 185 |
-
-
|
| 186 |
-
-
|
| 187 |
|
| 188 |
---
|
| 189 |
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
datasets:
|
| 3 |
- alibabasglab/VoxCeleb2-mix
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
+
library_name: pytorch
|
| 7 |
+
license: apache-2.0
|
| 8 |
+
pipeline_tag: audio-to-audio
|
| 9 |
tags:
|
| 10 |
- audio-visual
|
| 11 |
- speech-separation
|
|
|
|
| 13 |
- multimodal
|
| 14 |
- lip-reading
|
| 15 |
- audio-processing
|
|
|
|
|
|
|
| 16 |
---
|
| 17 |
|
| 18 |
# Dolphin: Efficient Audio-Visual Speech Separation
|
|
|
|
| 26 |
|
| 27 |
**Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6ร faster** and using **50% fewer parameters** than previous methods.
|
| 28 |
|
| 29 |
+
๐ **Links**: [๐ Paper](https://arxiv.org/abs/2509.23610) | [๐ป Code](https://github.com/JusperLee/Dolphin) | [๐ฎ Demo](https://huggingface.co/spaces/JusperLee/Dolphin) | [๐ Project Page](https://cslikai.cn/Dolphin)
|
| 30 |
|
| 31 |
## Key Features
|
| 32 |
|
|
|
|
| 106 |
|
| 107 |
### Components
|
| 108 |
|
| 109 |
+
1. **DP-LipCoder** (Video Encoder)
|
| 110 |
+
- Dual-path architecture: visual compression + semantic encoding
|
| 111 |
+
- Vector quantization for discrete lip semantic tokens
|
| 112 |
+
- Knowledge distillation from AV-HuBERT
|
| 113 |
+
- Only **8.5M parameters**
|
| 114 |
|
| 115 |
+
2. **Audio Encoder**
|
| 116 |
+
- Convolutional encoder for time-frequency representation
|
| 117 |
+
- Extracts multi-scale acoustic features
|
| 118 |
|
| 119 |
+
3. **Global-Local Attention Separator**
|
| 120 |
+
- Single-pass TDANet-based architecture
|
| 121 |
+
- **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies
|
| 122 |
+
- **Local Attention (LA)**: Heat diffusion attention for noise suppression
|
| 123 |
+
- No iterative refinement needed
|
| 124 |
|
| 125 |
+
4. **Audio Decoder**
|
| 126 |
+
- Reconstructs separated waveform from enhanced features
|
| 127 |
|
| 128 |
### Input/Output Specifications
|
| 129 |
|
| 130 |
**Inputs:**
|
| 131 |
+
- `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate
|
| 132 |
+
- `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps
|
| 133 |
|
| 134 |
**Output:**
|
| 135 |
+
- `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz
|
| 136 |
|
| 137 |
## Training Details
|
| 138 |
|
| 139 |
+
- **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
|
| 140 |
+
- **Training**: ~200K steps with Adam optimizer
|
| 141 |
+
- **Augmentation**: Random mixing, noise addition, video frame dropout
|
| 142 |
+
- **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)
|
| 143 |
|
| 144 |
## Use Cases
|
| 145 |
|
| 146 |
+
- ๐ง **Hearing Aids**: Camera-based speech enhancement
|
| 147 |
+
- ๐ผ **Video Conferencing**: Noise suppression with visual context
|
| 148 |
+
- ๐ **In-Car Assistants**: Driver speech extraction
|
| 149 |
+
- ๐ฅฝ **AR/VR**: Immersive communication in noisy environments
|
| 150 |
+
- ๐ฑ **Edge Devices**: Efficient deployment on mobile/embedded systems
|
| 151 |
|
| 152 |
## Limitations
|
| 153 |
|
| 154 |
+
- Requires frontal or near-frontal face view for optimal performance
|
| 155 |
+
- Works best with 25fps video input
|
| 156 |
+
- Trained on English speech (may need fine-tuning for other languages)
|
| 157 |
+
- Performance degrades with severe occlusions or low lighting
|
| 158 |
|
| 159 |
## Citation
|
| 160 |
|
|
|
|
| 180 |
|
| 181 |
## Contact
|
| 182 |
|
| 183 |
+
- ๐ง Email: tsinghua.kaili@gmail.com
|
| 184 |
+
- ๐ Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues)
|
| 185 |
+
- ๐ฌ Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions)
|
| 186 |
|
| 187 |
---
|
| 188 |
|