AZOTH β€” Open-Vocabulary Object Detection (ANIMA Module)

Part of the ANIMA Perception Suite by Robot Flow Labs.

Model

AZOTH is a real-time open-vocabulary object detection module built on LLMDet (CVPR 2025). It detects arbitrary objects from natural language queries without predefined class vocabularies.

  • Architecture: GroundingDINO + Swin-T backbone + BERT text encoder
  • Variant: LLMDet-Tiny (172.9M parameters)
  • Input: RGB image + text query (natural language)
  • Output: Bounding boxes + confidence scores

Paper

LLMDet: Learning to Understand Visual Grounding with LLMs arXiv: 2602.05730 | CVPR 2025

Evaluation β€” LVIS minival

Metric This Model Paper
AP (all) 37.7 34.8
AP_rare 36.2 30.1
AP_common 28.8 35.2
AP_frequent 38.4 37.3

This baseline exceeds the paper's reported AP on LVIS minival.

Exported Formats

Format File Size Use Case
SafeTensors pytorch/azoth_v1_baseline.safetensors ~660MB Fast loading, HF transformers
PyTorch (.pth) pytorch/azoth_v1_baseline.pth ~660MB Training, fine-tuning
ONNX Deferred β€” Complex architecture; export on target
TensorRT Deferred β€” Generate on target hardware (Jetson/L4)

Usage

from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
from PIL import Image

model = AutoModelForZeroShotObjectDetection.from_pretrained(
    "ilessio-aiflowlab/project_azoth",
    subfolder="pytorch"
)
processor = AutoProcessor.from_pretrained(
    "ilessio-aiflowlab/project_azoth",
    subfolder="pytorch"
)

image = Image.open("test.jpg")
inputs = processor(images=image, text="a person. a car. a dog.", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

results = processor.post_process_grounded_object_detection(
    outputs, inputs.input_ids, threshold=0.3, target_sizes=[(image.height, image.width)]
)

Training Data

Trained on GroundingCap-1M (CVPR 2025):

  • COCO 2017 (205K samples)
  • GQA (358K samples)
  • LLaVA-ReCap-558K (307K samples)
  • Flickr30k (79K samples)
  • V3Det (166K samples)

Training Config

  • Optimizer: AdamW, weight_decay=0.05
  • LR: 1e-4, cosine decay
  • Warmup: 1000 iterations
  • Epochs: 12 (1x schedule)
  • Batch size: 16 (8 GPUs x 2)
  • Mixed precision: FP16
  • Hardware: 8x NVIDIA L4 (23GB each)

Directory Structure

pytorch/          SafeTensors + PyTorch weights + tokenizer
checkpoints/      Best training checkpoint (resume training)
configs/          Training TOML configs
reports/          LVIS evaluation reports (JSON)
paper.pdf         LLMDet CVPR 2025 paper

License

Apache 2.0 β€” Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support