HuaminChen commited on
Commit
f241f4b
·
verified ·
1 Parent(s): f2cea71

Add verified dataset sources with paper citations

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -298,11 +298,11 @@ fused_embedding = model.encode_multimodal(
298
 
299
  The model architecture includes a Whisper audio encoder, but this release only trained on image-text data. Future releases will add audio-text alignment using:
300
 
301
- | Dataset | Size | Description |
302
- |---------|------|-------------|
303
- | [WavCaps](https://huggingface.co/datasets/cvssp/WavCaps) | 400K | Largest audio-caption dataset |
304
- | [AudioCaps](https://github.com/cdjkim/audiocaps) | 46K | YouTube audio with human captions |
305
- | [Clotho](https://zenodo.org/record/3490684) | 6K | High-quality multi-annotator captions |
306
 
307
  This will enable:
308
  - Audio-to-text retrieval
 
298
 
299
  The model architecture includes a Whisper audio encoder, but this release only trained on image-text data. Future releases will add audio-text alignment using:
300
 
301
+ | Dataset | Size | Source | Paper |
302
+ |---------|------|--------|-------|
303
+ | [WavCaps](https://huggingface.co/datasets/cvssp/WavCaps) | 403K clips | HuggingFace (CVSSP, University of Surrey) | [arXiv:2303.17395](https://arxiv.org/abs/2303.17395) |
304
+ | [AudioCaps](https://github.com/cdjkim/audiocaps) | 46K clips | GitHub (Seoul National University) | [NAACL-HLT 2019](https://aclanthology.org/N19-1011/) |
305
+ | [Clotho](https://zenodo.org/records/3490684) | 6K clips | Zenodo (Tampere University) | [ICASSP 2020](https://ieeexplore.ieee.org/document/9052990) |
306
 
307
  This will enable:
308
  - Audio-to-text retrieval