niobures's picture
CLAP (code, models, paper)
f8ffad7 verified
# Microsoft CLAP
## Input
**audio file**
Audio file in wav format to use as the model's input.
Default file name is [input.wav](./input.wav)
(source: https://freesound.org/people/InspectorJ/sounds/456440/)
**text file**
A text file containing sentences separated by new lines.
Default file name is [captions.txt](./captions.txt)
## Output
**Cosine similarities**
Cosine similarity between the input audio and the sentences in the text file.
## Usage
Internet connection is required when running the script for the first time, as the model files will be automatically downloaded.
Running the script will compute the cosine similarities between the audio and the captions, using audio and language encoder models train by contrastive training.
You can switch the versions of the encoder model's weight (2022 or 2023) by specifying the version using the argument ```-v``` or ```--version```.
For more information on arguments, try running ```python3 msclap.py --help```
```bash
$ python3 msclap.py -t captions.txt -a input.wav -v 2023
INFO arg_utils.py (13) : Start!
INFO arg_utils.py (158) : env_id updated to 0
INFO arg_utils.py (163) : env_id: 0
INFO arg_utils.py (166) : CPU
INFO msclap.py (167) : input_text: ['Dog barking.', 'Birds whistling.', 'Car passing by.', 'Wind blowing.', 'Water flowing.', 'People talking.']
INFO msclap.py (170) : inference has started...
Similarity:
Birds whistling.: 0.41247469186782837
Wind blowing.: 0.2643369734287262
Water flowing.: 0.23884761333465576
Car passing by.: 0.22803542017936707
People talking.: 0.17387858033180237
Dog barking.: 0.11309497803449631
INFO msclap.py (192) : Script finished successfully.
```
## Reference
* [CLAP](https://github.com/microsoft/CLAP)
## Framework
Pytorch
## Model Format
ONNX opset=11
## Netron
[caption_model_2023.onnx.prototxt]()
[audio_model_2023.onnx.prototxt]()
[caption_model_2022.onnx.prototxt]()
[audio_model_2022.onnx.prototxt]()