Add verified dataset sources with paper citations
Browse files
README.md
CHANGED
|
@@ -298,11 +298,11 @@ fused_embedding = model.encode_multimodal(
|
|
| 298 |
|
| 299 |
The model architecture includes a Whisper audio encoder, but this release only trained on image-text data. Future releases will add audio-text alignment using:
|
| 300 |
|
| 301 |
-
| Dataset | Size |
|
| 302 |
-
|---------|------|-------------|
|
| 303 |
-
| [WavCaps](https://huggingface.co/datasets/cvssp/WavCaps) |
|
| 304 |
-
| [AudioCaps](https://github.com/cdjkim/audiocaps) | 46K |
|
| 305 |
-
| [Clotho](https://zenodo.org/
|
| 306 |
|
| 307 |
This will enable:
|
| 308 |
- Audio-to-text retrieval
|
|
|
|
| 298 |
|
| 299 |
The model architecture includes a Whisper audio encoder, but this release only trained on image-text data. Future releases will add audio-text alignment using:
|
| 300 |
|
| 301 |
+
| Dataset | Size | Source | Paper |
|
| 302 |
+
|---------|------|--------|-------|
|
| 303 |
+
| [WavCaps](https://huggingface.co/datasets/cvssp/WavCaps) | 403K clips | HuggingFace (CVSSP, University of Surrey) | [arXiv:2303.17395](https://arxiv.org/abs/2303.17395) |
|
| 304 |
+
| [AudioCaps](https://github.com/cdjkim/audiocaps) | 46K clips | GitHub (Seoul National University) | [NAACL-HLT 2019](https://aclanthology.org/N19-1011/) |
|
| 305 |
+
| [Clotho](https://zenodo.org/records/3490684) | 6K clips | Zenodo (Tampere University) | [ICASSP 2020](https://ieeexplore.ieee.org/document/9052990) |
|
| 306 |
|
| 307 |
This will enable:
|
| 308 |
- Audio-to-text retrieval
|