Instructions to use risashinoda/BioVITA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- OpenCLIP
How to use risashinoda/BioVITA with OpenCLIP:
import open_clip model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:risashinoda/BioVITA') tokenizer = open_clip.get_tokenizer('hf-hub:risashinoda/BioVITA') - Notebooks
- Google Colab
- Kaggle
BioVITA
BioVITA is a 3-modal (Audio × Image × Text) representation learning model for wildlife species recognition, trained on the BioVITA dataset.
- Image / Text encoder: ViT-L/14 fine-tuned from BioCLIP-2
- Audio encoder: CLAP (HTSAT-unfused) fine-tuned with a linear projection adapter
Files
| File | Description |
|---|---|
open_clip_pytorch_model.bin |
Image & text encoder weights (OpenCLIP ViT-L/14) |
open_clip_config.json |
OpenCLIP model config |
clap_weights.pth |
Audio encoder (CLAP) + adapter weights |
tokenizer*.json / vocab.json / merges.txt |
Tokenizer files |
Usage
With the BioVITA release code:
# Extract features (image + text + audio)
torchrun --nproc_per_node=8 eval/extract_features.py \
--ids_dir path/to/benchmark/ids \
--feat_root path/to/output \
--tag biовita \
--vita_model_id risashinoda/BioVITA \
--modalities audio,image,text
# Evaluate on BioVITA benchmark
python eval/eval_benchmark.py \
--base_dir path/to/benchmark \
--ids_dir path/to/benchmark/ids \
--feat_root path/to/output \
--tag biовita
Citation
@inproceedings{shinoda2026biovita,
title = {BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment},
author = {Risa Shinoda and Kaede Shiohara and Nakamasa Inoue and Kuniaki Saito and Hiroaki Santo and Fumio Okura},
booktitle = {CVPR},
year = {2026},
}
- Downloads last month
- 32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support