Instructions to use risashinoda/BioVITA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- OpenCLIP
How to use risashinoda/BioVITA with OpenCLIP:
import open_clip model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:risashinoda/BioVITA') tokenizer = open_clip.get_tokenizer('hf-hub:risashinoda/BioVITA') - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| tags: | |
| - open_clip | |
| - bioacoustics | |
| - multimodal | |
| - zero-shot-retrieval | |
| # BioVITA | |
| **BioVITA** is a 3-modal (Audio × Image × Text) representation learning model for wildlife species recognition, trained on the BioVITA dataset. | |
| - Image / Text encoder: ViT-L/14 fine-tuned from [BioCLIP-2](https://huggingface.co/imageomics/bioclip-2) | |
| - Audio encoder: [CLAP (HTSAT-unfused)](https://huggingface.co/laion/clap-htsat-unfused) fine-tuned with a linear projection adapter | |
| ## Files | |
| | File | Description | | |
| |------|-------------| | |
| | `open_clip_pytorch_model.bin` | Image & text encoder weights (OpenCLIP ViT-L/14) | | |
| | `open_clip_config.json` | OpenCLIP model config | | |
| | `clap_weights.pth` | Audio encoder (CLAP) + adapter weights | | |
| | `tokenizer*.json` / `vocab.json` / `merges.txt` | Tokenizer files | | |
| ## Usage | |
| With the [BioVITA release code](https://github.com/dahlian00/BioVITA): | |
| ```bash | |
| # Extract features (image + text + audio) | |
| torchrun --nproc_per_node=8 eval/extract_features.py \ | |
| --ids_dir path/to/benchmark/ids \ | |
| --feat_root path/to/output \ | |
| --tag biовita \ | |
| --vita_model_id risashinoda/BioVITA \ | |
| --modalities audio,image,text | |
| # Evaluate on BioVITA benchmark | |
| python eval/eval_benchmark.py \ | |
| --base_dir path/to/benchmark \ | |
| --ids_dir path/to/benchmark/ids \ | |
| --feat_root path/to/output \ | |
| --tag biовita | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{shinoda2026biovita, | |
| title = {BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment}, | |
| author = {Risa Shinoda and Kaede Shiohara and Nakamasa Inoue and Kuniaki Saito and Hiroaki Santo and Fumio Okura}, | |
| booktitle = {CVPR}, | |
| year = {2026}, | |
| } | |
| ``` | |