Oculus / docs /BENCHMARK_README.md
kobiakor15's picture
Upload docs/BENCHMARK_README.md with huggingface_hub
bdb0102 verified

Oculus Model Benchmarking Guide

This guide explains how to use the test_benchmarks.py script to evaluate the Oculus vision-language model on standard benchmark tasks.

Overview

The benchmark script tests the Oculus model on three key vision-language tasks:

  1. Image Captioning - Generate natural language descriptions of images
  2. Visual Question Answering (VQA) - Answer questions about image content
  3. Object Detection - Detect and localize objects in images

Requirements

System Requirements

  • Apple Silicon Mac (M1, M2, M3, or later)
  • macOS 12.0 or later
  • Python 3.8+
  • 16GB+ RAM recommended

Python Dependencies

Install required packages:

pip install mlx mlx-nn numpy pillow datasets transformers huggingface_hub

Or create a requirements file:

# requirements.txt
mlx>=0.0.8
numpy>=1.21.0
pillow>=9.0.0
datasets>=2.14.0
transformers>=4.30.0
huggingface_hub>=0.16.0

Then install:

pip install -r requirements.txt

Quick Start

Basic Usage

Run the benchmark with default settings (5 samples per task):

cd /Users/kanayochukew/railweb/OceanirPublic/Oculus
python test_benchmarks.py

What Happens

  1. Model Loading: Initializes the Oculus model with default configuration
  2. Dataset Loading: Downloads small subsets of benchmark datasets from HuggingFace
  3. Preprocessing: Resizes and normalizes images for both vision encoders
  4. Inference: Runs the model on each task
  5. Results: Prints detailed metrics and timing information

Dataset Information

Image Captioning

  • Dataset: COCO Captions (Karpathy split)
  • Source: yerevann/coco-karpathy
  • Samples: 5 (configurable)
  • Metrics: Inference time, token generation count

Visual Question Answering

  • Dataset: VQAv2 validation set
  • Source: HuggingFaceM4/VQAv2
  • Samples: 5 (configurable)
  • Metrics: Inference time, answer generation

Object Detection

  • Dataset: COCO Detection validation set
  • Source: detection-datasets/coco
  • Samples: 5 (configurable)
  • Metrics: Inference time, confidence scores, bbox predictions

Configuration

Adjusting Sample Count

Edit the num_samples variable in main():

def main():
    num_samples = 10  # Change this value
    # ...

Model Configuration

The script loads the default Oculus configuration:

  • DINOv3: Large (1.7B parameters)
  • SigLIP2: SO400M (400M parameters)
  • LFM2.5: 1.2B parameters

To use different model sizes, modify the create_oculus_model() call:

model = create_oculus_model(
    dinov3_model_size="base",  # Options: "small", "base", "large"
    siglip2_model_size="so400m",
    num_classes=150
)

Loading Pretrained Weights

⚠️ Important: The benchmark uses a randomly initialized model by default. For meaningful results, load pretrained weights first.

Using HuggingFace Weights

# In the main() function, after loading the model:
import os
from oculus import load_dinov3_from_hf, load_siglip2_from_hf, load_lfm2_from_hf

# Set your HuggingFace token
os.environ["HF_TOKEN"] = "your_token_here"

# Load pretrained weights
load_dinov3_from_hf(
    model.dinov3_encoder,
    repo_id="facebook/dinov3-vitl16-pretrain-lvd1689m",
    token=os.getenv("HF_TOKEN")
)

load_siglip2_from_hf(
    model.siglip2_encoder,
    repo_id="google/siglip2-so400m-patch16-naflex",
    token=os.getenv("HF_TOKEN")
)

load_lfm2_from_hf(
    model.language_model,
    repo_id="LiquidAI/LFM2.5-1.2B-Base",
    token=os.getenv("HF_TOKEN")
)

Using Local Weights

# Load from local files
import mlx.core as mx

weights = mx.load("/path/to/model_weights.npz")
model.update(weights)

Expected Output

Sample Output Format

============================================================
Oculus Model Benchmark Suite
============================================================
Testing Oculus vision-language model on benchmark tasks
Compatible with MLX and Apple Silicon
============================================================

[Step 1] Loading Oculus model...
✓ Model loaded successfully

Model Configuration:
  DINOv3: DINOv3-ViT-L/16
  SigLIP2: SigLIP2-SO400M
  Language Model: LFM2.5-1.2B-Base
  Total Parameters: 3,806,600,000

[Step 2] Loading benchmark datasets...

Loading COCO Captions dataset (5 samples)...
✓ Loaded 5 COCO caption samples

============================================================
BENCHMARKING: Image Captioning
============================================================

[Sample 1/5]
  Image ID: 0
  Generated tokens: 23 tokens
  Inference time: 2.456s
  Reference captions: 5 captions

...

============================================================
CAPTIONING SUMMARY
============================================================
Total samples: 5
Successful: 5
Failed: 0
Average inference time: 2.123s
Total time: 10.615s

Performance Metrics

Timing Metrics

  • Inference Time: Time to process a single sample
  • Average Time: Mean inference time across all samples
  • Total Time: Cumulative time for all samples

Quality Metrics (with pretrained weights)

  • BLEU Score: For captioning (requires reference captions)
  • Accuracy: For VQA (requires ground truth answers)
  • mAP: For detection (requires bounding box annotations)

Troubleshooting

Out of Memory

If you encounter memory issues:

  1. Reduce the number of samples:
num_samples = 3  # Reduce from 5 to 3
  1. Use smaller model sizes:
model = create_oculus_model(
    dinov3_model_size="base",  # Instead of "large"
    siglip2_model_size="so400m",
    num_classes=150
)
  1. Process samples one at a time (already implemented in the script)

Dataset Loading Failures

If HuggingFace datasets fail to load:

  • Check your internet connection
  • Verify dataset availability on HuggingFace
  • The script automatically falls back to synthetic samples

Import Errors

If you get import errors:

# Install missing dependencies
pip install --upgrade mlx datasets transformers pillow

Advanced Usage

Custom Datasets

To benchmark on your own datasets:

# Create custom samples
custom_samples = [
    {
        "image": Image.open("path/to/image.jpg"),
        "captions": ["A custom caption"],
        "image_id": 0
    },
    # Add more samples...
]

# Run benchmark
benchmark.benchmark_captioning(custom_samples)

Extracting Results

Access detailed results programmatically:

# After running benchmarks
captioning_results = benchmark.results["captioning"]
vqa_results = benchmark.results["vqa"]
detection_results = benchmark.results["detection"]

# Save to file
import json
with open("benchmark_results.json", "w") as f:
    json.dump(benchmark.results, f, indent=2)

Custom Preprocessing

Modify the ImagePreprocessor class for custom image preprocessing:

class CustomPreprocessor(ImagePreprocessor):
    def preprocess(self, image):
        # Your custom preprocessing
        return dinov3_input, siglip2_input

Performance Benchmarks (Reference)

On Apple Silicon M2 Max (64GB RAM):

Task Avg Time Throughput
Image Captioning ~2.1s ~0.5 samples/s
VQA ~1.8s ~0.6 samples/s
Object Detection ~0.8s ~1.2 samples/s

Note: Times are for randomly initialized models. Pretrained models may vary.

Integration with Training Pipeline

To use this benchmark during training:

# In your training script
from test_benchmarks import OculusBenchmark, ImagePreprocessor

# After each epoch
preprocessor = ImagePreprocessor()
benchmark = OculusBenchmark(model, preprocessor)
benchmark.benchmark_captioning(val_samples)
benchmark.print_final_summary()

Citation

If you use this benchmark in your research, please cite:

@software{oculus2025,
  title={Oculus: Adaptive Semantic Comprehension Hierarchies},
  author={Your Name},
  year={2025},
  url={https://github.com/yourusername/Oculus}
}

Support

For issues or questions:

  1. Check the main README
  2. Review the architecture documentation
  3. Open an issue on GitHub

License

Same as the main Oculus project.