BioVITA / README.md
risashinoda's picture
Fix GitHub URL and update citation
64aea8e verified
---
license: mit
tags:
- open_clip
- bioacoustics
- multimodal
- zero-shot-retrieval
---
# BioVITA
**BioVITA** is a 3-modal (Audio × Image × Text) representation learning model for wildlife species recognition, trained on the BioVITA dataset.
- Image / Text encoder: ViT-L/14 fine-tuned from [BioCLIP-2](https://huggingface.co/imageomics/bioclip-2)
- Audio encoder: [CLAP (HTSAT-unfused)](https://huggingface.co/laion/clap-htsat-unfused) fine-tuned with a linear projection adapter
## Files
| File | Description |
|------|-------------|
| `open_clip_pytorch_model.bin` | Image & text encoder weights (OpenCLIP ViT-L/14) |
| `open_clip_config.json` | OpenCLIP model config |
| `clap_weights.pth` | Audio encoder (CLAP) + adapter weights |
| `tokenizer*.json` / `vocab.json` / `merges.txt` | Tokenizer files |
## Usage
With the [BioVITA release code](https://github.com/dahlian00/BioVITA):
```bash
# Extract features (image + text + audio)
torchrun --nproc_per_node=8 eval/extract_features.py \
--ids_dir path/to/benchmark/ids \
--feat_root path/to/output \
--tag biовita \
--vita_model_id risashinoda/BioVITA \
--modalities audio,image,text
# Evaluate on BioVITA benchmark
python eval/eval_benchmark.py \
--base_dir path/to/benchmark \
--ids_dir path/to/benchmark/ids \
--feat_root path/to/output \
--tag biовita
```
## Citation
```bibtex
@inproceedings{shinoda2026biovita,
title = {BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment},
author = {Risa Shinoda and Kaede Shiohara and Nakamasa Inoue and Kuniaki Saito and Hiroaki Santo and Fumio Okura},
booktitle = {CVPR},
year = {2026},
}
```