AZOTH β Open-Vocabulary Object Detection (ANIMA Module)
Part of the ANIMA Perception Suite by Robot Flow Labs.
Model
AZOTH is a real-time open-vocabulary object detection module built on LLMDet (CVPR 2025). It detects arbitrary objects from natural language queries without predefined class vocabularies.
- Architecture: GroundingDINO + Swin-T backbone + BERT text encoder
- Variant: LLMDet-Tiny (172.9M parameters)
- Input: RGB image + text query (natural language)
- Output: Bounding boxes + confidence scores
Paper
LLMDet: Learning to Understand Visual Grounding with LLMs arXiv: 2602.05730 | CVPR 2025
Evaluation β LVIS minival
| Metric | This Model | Paper |
|---|---|---|
| AP (all) | 37.7 | 34.8 |
| AP_rare | 36.2 | 30.1 |
| AP_common | 28.8 | 35.2 |
| AP_frequent | 38.4 | 37.3 |
This baseline exceeds the paper's reported AP on LVIS minival.
Exported Formats
| Format | File | Size | Use Case |
|---|---|---|---|
| SafeTensors | pytorch/azoth_v1_baseline.safetensors |
~660MB | Fast loading, HF transformers |
| PyTorch (.pth) | pytorch/azoth_v1_baseline.pth |
~660MB | Training, fine-tuning |
| ONNX | Deferred | β | Complex architecture; export on target |
| TensorRT | Deferred | β | Generate on target hardware (Jetson/L4) |
Usage
from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
from PIL import Image
model = AutoModelForZeroShotObjectDetection.from_pretrained(
"ilessio-aiflowlab/project_azoth",
subfolder="pytorch"
)
processor = AutoProcessor.from_pretrained(
"ilessio-aiflowlab/project_azoth",
subfolder="pytorch"
)
image = Image.open("test.jpg")
inputs = processor(images=image, text="a person. a car. a dog.", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
results = processor.post_process_grounded_object_detection(
outputs, inputs.input_ids, threshold=0.3, target_sizes=[(image.height, image.width)]
)
Training Data
Trained on GroundingCap-1M (CVPR 2025):
- COCO 2017 (205K samples)
- GQA (358K samples)
- LLaVA-ReCap-558K (307K samples)
- Flickr30k (79K samples)
- V3Det (166K samples)
Training Config
- Optimizer: AdamW, weight_decay=0.05
- LR: 1e-4, cosine decay
- Warmup: 1000 iterations
- Epochs: 12 (1x schedule)
- Batch size: 16 (8 GPUs x 2)
- Mixed precision: FP16
- Hardware: 8x NVIDIA L4 (23GB each)
Directory Structure
pytorch/ SafeTensors + PyTorch weights + tokenizer
checkpoints/ Best training checkpoint (resume training)
configs/ Training TOML configs
reports/ LVIS evaluation reports (JSON)
paper.pdf LLMDet CVPR 2025 paper
License
Apache 2.0 β Robot Flow Labs / AIFLOW LABS LIMITED
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support