risashinoda
/

BioVITA

zero-shot-retrieval

Model card Files Files and versions

BioVITA / README.md

risashinoda's picture

Fix GitHub URL and update citation

64aea8e verified 17 days ago

|

history blame contribute delete

1.69 kB

	---
	license: mit
	tags:
	- open_clip
	- bioacoustics
	- multimodal
	- zero-shot-retrieval
	---

	# BioVITA

	BioVITA is a 3-modal (Audio × Image × Text) representation learning model for wildlife species recognition, trained on the BioVITA dataset.

	- Image / Text encoder: ViT-L/14 fine-tuned from [BioCLIP-2](https://huggingface.co/imageomics/bioclip-2)
	- Audio encoder: [CLAP (HTSAT-unfused)](https://huggingface.co/laion/clap-htsat-unfused) fine-tuned with a linear projection adapter

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `open_clip_pytorch_model.bin` \| Image & text encoder weights (OpenCLIP ViT-L/14) \|
	\| `open_clip_config.json` \| OpenCLIP model config \|
	\| `clap_weights.pth` \| Audio encoder (CLAP) + adapter weights \|
	\| `tokenizer*.json` / `vocab.json` / `merges.txt` \| Tokenizer files \|

	## Usage

	With the [BioVITA release code](https://github.com/dahlian00/BioVITA):

	```bash
	# Extract features (image + text + audio)
	torchrun --nproc_per_node=8 eval/extract_features.py \
	--ids_dir path/to/benchmark/ids \
	--feat_root path/to/output \
	--tag biовita \
	--vita_model_id risashinoda/BioVITA \
	--modalities audio,image,text

	# Evaluate on BioVITA benchmark
	python eval/eval_benchmark.py \
	--base_dir path/to/benchmark \
	--ids_dir path/to/benchmark/ids \
	--feat_root path/to/output \
	--tag biовita
	```

	## Citation

	```bibtex
	@inproceedings{shinoda2026biovita,
	title = {BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment},
	author = {Risa Shinoda and Kaede Shiohara and Nakamasa Inoue and Kuniaki Saito and Hiroaki Santo and Fumio Okura},
	booktitle = {CVPR},
	year = {2026},
	}
	```