Dolphin (code, models, paper)

Browse files

Files changed (8) hide show

.gitattributes +1 -0
Dolphin. Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention.pdf +3 -0
code/Dolphin.zip +3 -0
model/.gitattributes +35 -0
model/README.md +189 -0
model/config.json +136 -0
model/model.safetensors +3 -0
model/source.txt +1 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+Dolphin.[[:space:]]Efficient[[:space:]]Audio-Visual[[:space:]]Speech[[:space:]]Separation[[:space:]]with[[:space:]]Discrete[[:space:]]Lip[[:space:]]Semantics[[:space:]]and[[:space:]]Multi-Scale[[:space:]]Global-Local[[:space:]]Attention.pdf filter=lfs diff=lfs merge=lfs -text

Dolphin. Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0aaf438b5a925e11239a33303c3278bfdf58829931000555017617f2924ca219
+size 6363139

code/Dolphin.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:31e8333103bf2f33023f21e87fca2c90d0a7217da1e42289618531767a5ea465
+size 832673765

model/.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

model/README.md ADDED Viewed

	@@ -0,0 +1,189 @@

+---
+datasets:
+- alibabasglab/VoxCeleb2-mix
+language:
+- en
+library_name: pytorch
+license: apache-2.0
+pipeline_tag: audio-to-audio
+tags:
+- audio-visual
+- speech-separation
+- cocktail-party
+- multimodal
+- lip-reading
+- audio-processing
+---
+# Dolphin: Efficient Audio-Visual Speech Separation
+<p align="center">
+  <img src="https://github.com/JusperLee/Dolphin/raw/main/assets/icon.png" alt="Dolphin Logo" width="120"/>
+</p>
+## Model Overview
+**Dolphin** is an efficient audio-visual speech separation model that extracts target speech from noisy environments by combining acoustic and visual (lip movement) cues. It achieves **state-of-the-art performance** while being **6× faster** and using **50% fewer parameters** than previous methods.
+🔗 **Links**: [📄 Paper](https://arxiv.org/abs/2509.23610) | [💻 Code](https://github.com/JusperLee/Dolphin) | [🎮 Demo](https://huggingface.co/spaces/JusperLee/Dolphin) | [🌐 Project Page](https://cslikai.cn/Dolphin)
+## Key Features
+- 🎯 **Balanced Quality & Efficiency**: SOTA separation quality without iterative refinement
+- 🔬 **DP-LipCoder**: Lightweight video encoder with discrete audio-aligned semantic tokens
+- 🌐 **Global-Local Attention**: Multi-scale attention for long-range context and fine-grained details
+- 🚀 **Edge-Friendly**: >50% parameter reduction, >2.4× lower MACs, >6× faster inference
+## Performance
+**VoxCeleb2 Benchmark:**
+| Metric | Value |
+|--------|-------|
+| SI-SNRi | **16.1 dB** |
+| SDRi | **16.3 dB** |
+| PESQ | **3.45** |
+| ESTOI | **0.93** |
+| Parameters | **51.3M** (vs 112M in IIANet) |
+| MACs | **417G** (vs 1009G in IIANet) |
+| Inference Speed | **0.015s/4s-clip** (vs 0.100s in IIANet) |
+## Quick Start
+### Installation
+```bash
+pip install torch torchvision torchaudio
+pip install huggingface_hub
+```
+### Inference Example
+```python
+import torch
+from huggingface_hub import hf_hub_download
+import yaml
+# Download model and config
+config_path = hf_hub_download(repo_id="JusperLee/Dolphin", filename="conf.yml")
+model_path = hf_hub_download(repo_id="JusperLee/Dolphin", filename="best_model.pth")
+# Load model (you need to import Dolphin class from the repo)
+with open(config_path) as f:
+    config = yaml.safe_load(f)
+model = Dolphin(**config['model'])
+model.load_state_dict(torch.load(model_path, map_location='cpu'))
+model.eval()
+# Prepare inputs
+# audio: [batch, samples] - 16kHz audio
+# video: [batch, frames, 1, height, width] - grayscale lip frames
+audio_mixture = torch.randn(1, 64000)  # 4 seconds at 16kHz
+video_frames = torch.randn(1, 100, 1, 88, 88)  # 4s at 25fps, 88x88 resolution
+# Separate speech
+with torch.no_grad():
+    separated_audio = model(audio_mixture, video_frames)
+```
+### Complete Pipeline with Video Input
+For end-to-end video processing with face detection and tracking, see our [inference script](https://github.com/JusperLee/Dolphin/blob/main/inference.py):
+```bash
+git clone https://github.com/JusperLee/Dolphin.git
+cd Dolphin
+python inference.py \
+    --input video.mp4 \
+    --output ./output \
+    --speakers 2 \
+    --config checkpoints/vox2/conf.yml
+```
+## Model Architecture
+### Components
+1.  **DP-LipCoder** (Video Encoder)
+    -   Dual-path architecture: visual compression + semantic encoding
+    -   Vector quantization for discrete lip semantic tokens
+    -   Knowledge distillation from AV-HuBERT
+    -   Only **8.5M parameters**
+2.  **Audio Encoder**
+    -   Convolutional encoder for time-frequency representation
+    -   Extracts multi-scale acoustic features
+3.  **Global-Local Attention Separator**
+    -   Single-pass TDANet-based architecture
+    -   **Global Attention (GA)**: Coarse-grained self-attention for long-range dependencies
+    -   **Local Attention (LA)**: Heat diffusion attention for noise suppression
+    -   No iterative refinement needed
+4.  **Audio Decoder**
+    -   Reconstructs separated waveform from enhanced features
+### Input/Output Specifications
+**Inputs:**
+-   `audio`: Mixed audio waveform, shape `[batch, samples]`, 16kHz sampling rate
+-   `video`: Grayscale lip region frames, shape `[batch, frames, 1, 88, 88]`, 25fps
+**Output:**
+-   `separated_audio`: Separated target speech, shape `[batch, samples]`, 16kHz
+## Training Details
+-   **Dataset**: VoxCeleb2 (2-speaker mixtures at 0dB SNR)
+-   **Training**: ~200K steps with Adam optimizer
+-   **Augmentation**: Random mixing, noise addition, video frame dropout
+-   **Loss**: SI-SNR (Scale-Invariant Signal-to-Noise Ratio)
+## Use Cases
+-   🎧 **Hearing Aids**: Camera-based speech enhancement
+-   💼 **Video Conferencing**: Noise suppression with visual context
+-   🚗 **In-Car Assistants**: Driver speech extraction
+-   🥽 **AR/VR**: Immersive communication in noisy environments
+-   📱 **Edge Devices**: Efficient deployment on mobile/embedded systems
+## Limitations
+-   Requires frontal or near-frontal face view for optimal performance
+-   Works best with 25fps video input
+-   Trained on English speech (may need fine-tuning for other languages)
+-   Performance degrades with severe occlusions or low lighting
+## Citation
+```bibtex
+@misc{li2025dolphin,
+  title={Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention},
+  author={Kai Li and Kejun Gao and Xiaolin Hu},
+  year={2025},
+  eprint={2509.23610},
+  archivePrefix={arXiv},
+  primaryClass={cs.SD},
+  url={https://arxiv.org/abs/2509.23610}
+}
+```
+## License
+Apache-2.0 License. See [LICENSE](https://github.com/JusperLee/Dolphin/blob/main/LICENSE) for details.
+## Acknowledgments
+Built with inspiration from IIANet and SepReformer. Thanks to the Hugging Face team for hosting!
+## Contact
+-   📧 Email: tsinghua.kaili@gmail.com
+-   🐛 Issues: [GitHub Issues](https://github.com/JusperLee/Dolphin/issues)
+-   💬 Discussions: [GitHub Discussions](https://github.com/JusperLee/Dolphin/discussions)
+---
+**Developed by the Audio and Speech Group at Tsinghua University** 🎓

model/config.json ADDED Viewed

	@@ -0,0 +1,136 @@

+{
+  "model_type": "dolphin",
+  "task": "audio_visual_speech_separation",
+  "framework": "pytorch",
+  "license": "apache-2.0",
+  "tags": [
+    "audio",
+    "speech-separation",
+    "audio-visual",
+    "pytorch",
+    "dolphin"
+  ],
+  "architectures": [
+    "Dolphin"
+  ],
+  "auto_map": {
+    "AutoModel": "dolphin.Dolphin"
+  },
+  "num_stages": 4,
+  "sample_rate": 16000,
+  "vpre_channels": 3872,
+  "vmid_channels": 512,
+  "vin_channels": 64,
+  "vout_channels": 64,
+  "module_audio_enc": {
+    "in_channels": 1,
+    "out_channels": 256,
+    "kernel_size": 16,
+    "stride": 4,
+    "groups": 1,
+    "bias": false
+  },
+  "module_feature_projector": {
+    "num_channels": 256,
+    "in_channels": 256,
+    "out_channels": 128,
+    "kernel_size": 1,
+    "bias": false
+  },
+  "module_separator": {
+    "num_stages": 4,
+    "relative_positional_encoding": {
+      "in_channels": 128,
+      "num_heads": 8,
+      "maxlen": 2000,
+      "embed_v": false
+    },
+    "enc_stage": {
+      "global_blocks": {
+        "in_channels": 128,
+        "num_mha_heads": 8,
+        "dropout_rate": 0.05
+      },
+      "local_blocks": {
+        "in_channels": 128,
+        "kernel_size": 65,
+        "dropout_rate": 0.05
+      },
+      "down_conv_layer": {
+        "in_channels": 128,
+        "samp_kernel_size": 5
+      }
+    },
+    "simple_fusion": {
+      "out_channels": 128
+    },
+    "dec_stage": {
+      "global_blocks": {
+        "in_channels": 128,
+        "num_mha_heads": 8,
+        "dropout_rate": 0.05
+      },
+      "local_blocks": {
+        "in_channels": 128,
+        "kernel_size": 65,
+        "dropout_rate": 0.05
+      },
+      "spk_attention": {
+        "in_channels": 128,
+        "num_mha_heads": 8,
+        "dropout_rate": 0.05
+      }
+    }
+  },
+  "module_output_layer": {
+    "in_channels": 256,
+    "out_channels": 128
+  },
+  "module_audio_dec": {
+    "in_channels": 256,
+    "out_channels": 1,
+    "kernel_size": 16,
+    "stride": 4,
+    "bias": false
+  },
+  "video_encoder_params": {
+    "layers": [
+      "residual",
+      "compress_space",
+      "consecutive_residual",
+      "compress_space",
+      "consecutive_residual",
+      "linear_attend_space",
+      "compress_space",
+      "consecutive_residual",
+      "attend_space"
+    ],
+    "image_size": 88,
+    "in_channel": 1,
+    "init_channel": 4,
+    "max_dim": 32,
+    "input_conv_kernel_size": [
+      7,
+      7,
+      7
+    ],
+    "output_conv_kernel_size": [
+      3,
+      3,
+      3
+    ],
+    "residual_conv_kernel_size": 3,
+    "pad_mode": "constant",
+    "attn_dim_head": 32,
+    "attn_heads": 8,
+    "attn_dropout": 0.0,
+    "flash_attn": true,
+    "linear_attn_dim_head": 8,
+    "linear_attn_heads": 16,
+    "num_quantizers": 1,
+    "codebook_size": 256,
+    "codebook_dim": 64,
+    "commitment_cost": 1.0,
+    "distill_cost": 1.0
+  }
+}

model/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9be694e4150588ca0af8447fae184b6262a3cf43587928bd6001eee5b4eefb8a
+size 28391276

model/source.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ https://huggingface.co/JusperLee/Dolphin