niobures
/

CLAP

Model card Files Files and versions

CLAP / models /onnx /ailia-models /Microsoft-CLAP /code /README.md

niobures's picture

CLAP (code, models, paper)

f8ffad7 verified 3 months ago

|

history blame contribute delete

1.99 kB

	# Microsoft CLAP

	## Input

	audio file

	Audio file in wav format to use as the model's input.
	Default file name is [input.wav](./input.wav)
	(source: https://freesound.org/people/InspectorJ/sounds/456440/)



	text file

	A text file containing sentences separated by new lines.
	Default file name is [captions.txt](./captions.txt)

	## Output

	Cosine similarities

	Cosine similarity between the input audio and the sentences in the text file.

	## Usage
	Internet connection is required when running the script for the first time, as the model files will be automatically downloaded.

	Running the script will compute the cosine similarities between the audio and the captions, using audio and language encoder models train by contrastive training.

	You can switch the versions of the encoder model's weight (2022 or 2023) by specifying the version using the argument ```-v``` or ```--version```.
	For more information on arguments, try running ```python3 msclap.py --help```
	```bash
	$ python3 msclap.py -t captions.txt -a input.wav -v 2023
	INFO arg_utils.py (13) : Start!
	INFO arg_utils.py (158) : env_id updated to 0
	INFO arg_utils.py (163) : env_id: 0
	INFO arg_utils.py (166) : CPU
	INFO msclap.py (167) : input_text: ['Dog barking.', 'Birds whistling.', 'Car passing by.', 'Wind blowing.', 'Water flowing.', 'People talking.']
	INFO msclap.py (170) : inference has started...
	Similarity:
	Birds whistling.: 0.41247469186782837
	Wind blowing.: 0.2643369734287262
	Water flowing.: 0.23884761333465576
	Car passing by.: 0.22803542017936707
	People talking.: 0.17387858033180237
	Dog barking.: 0.11309497803449631
	INFO msclap.py (192) : Script finished successfully.
	```

	## Reference

	* [CLAP](https://github.com/microsoft/CLAP)

	## Framework

	Pytorch

	## Model Format

	ONNX opset=11

	## Netron

	[caption_model_2023.onnx.prototxt]()

	[audio_model_2023.onnx.prototxt]()

	[caption_model_2022.onnx.prototxt]()

	[audio_model_2022.onnx.prototxt]()