AZOTH — Open-Vocabulary Object Detection (ANIMA Module)

Part of the ANIMA Perception Suite by Robot Flow Labs.

Model

AZOTH is a real-time open-vocabulary object detection module built on LLMDet (CVPR 2025). It detects arbitrary objects from natural language queries without predefined class vocabularies.

Architecture: GroundingDINO + Swin-T backbone + BERT text encoder
Variant: LLMDet-Tiny (172.9M parameters)
Input: RGB image + text query (natural language)
Output: Bounding boxes + confidence scores

Paper

LLMDet: Learning to Understand Visual Grounding with LLMs arXiv: 2602.05730 | CVPR 2025

Evaluation — LVIS minival

Metric	This Model	Paper
AP (all)	37.7	34.8
AP_rare	36.2	30.1
AP_common	28.8	35.2
AP_frequent	38.4	37.3

This baseline exceeds the paper's reported AP on LVIS minival.

Exported Formats

Format	File	Size	Use Case
SafeTensors	`pytorch/azoth_v1_baseline.safetensors`	~660MB	Fast loading, HF transformers
PyTorch (.pth)	`pytorch/azoth_v1_baseline.pth`	~660MB	Training, fine-tuning
ONNX	Deferred	—	Complex architecture; export on target
TensorRT	Deferred	—	Generate on target hardware (Jetson/L4)

Usage

from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
from PIL import Image

model = AutoModelForZeroShotObjectDetection.from_pretrained(
    "ilessio-aiflowlab/project_azoth",
    subfolder="pytorch"
)
processor = AutoProcessor.from_pretrained(
    "ilessio-aiflowlab/project_azoth",
    subfolder="pytorch"
)

image = Image.open("test.jpg")
inputs = processor(images=image, text="a person. a car. a dog.", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_grounded_object_detection(
    outputs, inputs.input_ids, threshold=0.3, target_sizes=[(image.height, image.width)]
)

Training Data

Trained on GroundingCap-1M (CVPR 2025):

COCO 2017 (205K samples)
GQA (358K samples)
LLaVA-ReCap-558K (307K samples)
Flickr30k (79K samples)
V3Det (166K samples)

Training Config

Optimizer: AdamW, weight_decay=0.05
LR: 1e-4, cosine decay
Warmup: 1000 iterations
Epochs: 12 (1x schedule)
Batch size: 16 (8 GPUs x 2)
Mixed precision: FP16
Hardware: 8x NVIDIA L4 (23GB each)

Directory Structure

pytorch/          SafeTensors + PyTorch weights + tokenizer
checkpoints/      Best training checkpoint (resume training)
configs/          Training TOML configs
reports/          LVIS evaluation reports (JSON)
paper.pdf         LLMDet CVPR 2025 paper

License

Apache 2.0 — Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Zero-Shot Object Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support