Oculus / docs /BENCHMARK_README.md

kobiakor15

Upload docs/BENCHMARK_README.md with huggingface_hub

bdb0102 verified 7 days ago

preview code

raw

history blame contribute delete

8.32 kB

Oculus Model Benchmarking Guide

This guide explains how to use the test_benchmarks.py script to evaluate the Oculus vision-language model on standard benchmark tasks.

Overview

The benchmark script tests the Oculus model on three key vision-language tasks:

Image Captioning - Generate natural language descriptions of images
Visual Question Answering (VQA) - Answer questions about image content
Object Detection - Detect and localize objects in images

Requirements

System Requirements

Apple Silicon Mac (M1, M2, M3, or later)
macOS 12.0 or later
Python 3.8+
16GB+ RAM recommended

Python Dependencies

Install required packages:

pip install mlx mlx-nn numpy pillow datasets transformers huggingface_hub

Or create a requirements file:

# requirements.txt
mlx>=0.0.8
numpy>=1.21.0
pillow>=9.0.0
datasets>=2.14.0
transformers>=4.30.0
huggingface_hub>=0.16.0

Then install:

pip install -r requirements.txt

Quick Start

Basic Usage

Run the benchmark with default settings (5 samples per task):

cd /Users/kanayochukew/railweb/OceanirPublic/Oculus
python test_benchmarks.py

What Happens

Model Loading: Initializes the Oculus model with default configuration
Dataset Loading: Downloads small subsets of benchmark datasets from HuggingFace
Preprocessing: Resizes and normalizes images for both vision encoders
Inference: Runs the model on each task
Results: Prints detailed metrics and timing information

Dataset Information

Image Captioning

Dataset: COCO Captions (Karpathy split)
Source: yerevann/coco-karpathy
Samples: 5 (configurable)
Metrics: Inference time, token generation count

Visual Question Answering

Dataset: VQAv2 validation set
Source: HuggingFaceM4/VQAv2
Samples: 5 (configurable)
Metrics: Inference time, answer generation

Object Detection

Dataset: COCO Detection validation set
Source: detection-datasets/coco
Samples: 5 (configurable)
Metrics: Inference time, confidence scores, bbox predictions

Configuration

Adjusting Sample Count

Edit the num_samples variable in main():

def main():
    num_samples = 10  # Change this value
    # ...

Model Configuration

The script loads the default Oculus configuration:

DINOv3: Large (1.7B parameters)
SigLIP2: SO400M (400M parameters)
LFM2.5: 1.2B parameters

To use different model sizes, modify the create_oculus_model() call:

model = create_oculus_model(
    dinov3_model_size="base",  # Options: "small", "base", "large"
    siglip2_model_size="so400m",
    num_classes=150
)

Loading Pretrained Weights

⚠️ Important: The benchmark uses a randomly initialized model by default. For meaningful results, load pretrained weights first.

Using HuggingFace Weights

# In the main() function, after loading the model:
import os
from oculus import load_dinov3_from_hf, load_siglip2_from_hf, load_lfm2_from_hf

# Set your HuggingFace token
os.environ["HF_TOKEN"] = "your_token_here"

# Load pretrained weights
load_dinov3_from_hf(
    model.dinov3_encoder,
    repo_id="facebook/dinov3-vitl16-pretrain-lvd1689m",
    token=os.getenv("HF_TOKEN")
)

load_siglip2_from_hf(
    model.siglip2_encoder,
    repo_id="google/siglip2-so400m-patch16-naflex",
    token=os.getenv("HF_TOKEN")
)

load_lfm2_from_hf(
    model.language_model,
    repo_id="LiquidAI/LFM2.5-1.2B-Base",
    token=os.getenv("HF_TOKEN")
)

Using Local Weights

# Load from local files
import mlx.core as mx

weights = mx.load("/path/to/model_weights.npz")
model.update(weights)

Expected Output

Sample Output Format

============================================================
Oculus Model Benchmark Suite
============================================================
Testing Oculus vision-language model on benchmark tasks
Compatible with MLX and Apple Silicon
============================================================

[Step 1] Loading Oculus model...
✓ Model loaded successfully

Model Configuration:
  DINOv3: DINOv3-ViT-L/16
  SigLIP2: SigLIP2-SO400M
  Language Model: LFM2.5-1.2B-Base
  Total Parameters: 3,806,600,000

[Step 2] Loading benchmark datasets...

Loading COCO Captions dataset (5 samples)...
✓ Loaded 5 COCO caption samples

============================================================
BENCHMARKING: Image Captioning
============================================================

[Sample 1/5]
  Image ID: 0
  Generated tokens: 23 tokens
  Inference time: 2.456s
  Reference captions: 5 captions

...

============================================================
CAPTIONING SUMMARY
============================================================
Total samples: 5
Successful: 5
Failed: 0
Average inference time: 2.123s
Total time: 10.615s

Performance Metrics

Timing Metrics

Inference Time: Time to process a single sample
Average Time: Mean inference time across all samples
Total Time: Cumulative time for all samples

Quality Metrics (with pretrained weights)

BLEU Score: For captioning (requires reference captions)
Accuracy: For VQA (requires ground truth answers)
mAP: For detection (requires bounding box annotations)

Troubleshooting

Out of Memory

If you encounter memory issues:

Reduce the number of samples:

num_samples = 3  # Reduce from 5 to 3

Use smaller model sizes:

model = create_oculus_model(
    dinov3_model_size="base",  # Instead of "large"
    siglip2_model_size="so400m",
    num_classes=150
)

Process samples one at a time (already implemented in the script)

Dataset Loading Failures

If HuggingFace datasets fail to load:

Check your internet connection
Verify dataset availability on HuggingFace
The script automatically falls back to synthetic samples

Import Errors

If you get import errors:

# Install missing dependencies
pip install --upgrade mlx datasets transformers pillow

Advanced Usage

Custom Datasets

To benchmark on your own datasets:

# Create custom samples
custom_samples = [
    {
        "image": Image.open("path/to/image.jpg"),
        "captions": ["A custom caption"],
        "image_id": 0
    },
    # Add more samples...
]

# Run benchmark
benchmark.benchmark_captioning(custom_samples)

Extracting Results

Access detailed results programmatically:

# After running benchmarks
captioning_results = benchmark.results["captioning"]
vqa_results = benchmark.results["vqa"]
detection_results = benchmark.results["detection"]

# Save to file
import json
with open("benchmark_results.json", "w") as f:
    json.dump(benchmark.results, f, indent=2)

Custom Preprocessing

Modify the ImagePreprocessor class for custom image preprocessing:

class CustomPreprocessor(ImagePreprocessor):
    def preprocess(self, image):
        # Your custom preprocessing
        return dinov3_input, siglip2_input

Performance Benchmarks (Reference)

On Apple Silicon M2 Max (64GB RAM):

Task	Avg Time	Throughput
Image Captioning	~2.1s	~0.5 samples/s
VQA	~1.8s	~0.6 samples/s
Object Detection	~0.8s	~1.2 samples/s

Note: Times are for randomly initialized models. Pretrained models may vary.

Integration with Training Pipeline

To use this benchmark during training:

# In your training script
from test_benchmarks import OculusBenchmark, ImagePreprocessor

# After each epoch
preprocessor = ImagePreprocessor()
benchmark = OculusBenchmark(model, preprocessor)
benchmark.benchmark_captioning(val_samples)
benchmark.print_final_summary()

Citation

If you use this benchmark in your research, please cite:

@software{oculus2025,
  title={Oculus: Adaptive Semantic Comprehension Hierarchies},
  author={Your Name},
  year={2025},
  url={https://github.com/yourusername/Oculus}
}

Support

For issues or questions:

Check the main README
Review the architecture documentation
Open an issue on GitHub

License

Same as the main Oculus project.