Oculus / docs /BENCHMARK_README.md
kobiakor15's picture
Upload docs/BENCHMARK_README.md with huggingface_hub
bdb0102 verified
# Oculus Model Benchmarking Guide
This guide explains how to use the `test_benchmarks.py` script to evaluate the Oculus vision-language model on standard benchmark tasks.
## Overview
The benchmark script tests the Oculus model on three key vision-language tasks:
1. **Image Captioning** - Generate natural language descriptions of images
2. **Visual Question Answering (VQA)** - Answer questions about image content
3. **Object Detection** - Detect and localize objects in images
## Requirements
### System Requirements
- Apple Silicon Mac (M1, M2, M3, or later)
- macOS 12.0 or later
- Python 3.8+
- 16GB+ RAM recommended
### Python Dependencies
Install required packages:
```bash
pip install mlx mlx-nn numpy pillow datasets transformers huggingface_hub
```
Or create a requirements file:
```bash
# requirements.txt
mlx>=0.0.8
numpy>=1.21.0
pillow>=9.0.0
datasets>=2.14.0
transformers>=4.30.0
huggingface_hub>=0.16.0
```
Then install:
```bash
pip install -r requirements.txt
```
## Quick Start
### Basic Usage
Run the benchmark with default settings (5 samples per task):
```bash
cd /Users/kanayochukew/railweb/OceanirPublic/Oculus
python test_benchmarks.py
```
### What Happens
1. **Model Loading**: Initializes the Oculus model with default configuration
2. **Dataset Loading**: Downloads small subsets of benchmark datasets from HuggingFace
3. **Preprocessing**: Resizes and normalizes images for both vision encoders
4. **Inference**: Runs the model on each task
5. **Results**: Prints detailed metrics and timing information
## Dataset Information
### Image Captioning
- **Dataset**: COCO Captions (Karpathy split)
- **Source**: `yerevann/coco-karpathy`
- **Samples**: 5 (configurable)
- **Metrics**: Inference time, token generation count
### Visual Question Answering
- **Dataset**: VQAv2 validation set
- **Source**: `HuggingFaceM4/VQAv2`
- **Samples**: 5 (configurable)
- **Metrics**: Inference time, answer generation
### Object Detection
- **Dataset**: COCO Detection validation set
- **Source**: `detection-datasets/coco`
- **Samples**: 5 (configurable)
- **Metrics**: Inference time, confidence scores, bbox predictions
## Configuration
### Adjusting Sample Count
Edit the `num_samples` variable in `main()`:
```python
def main():
num_samples = 10 # Change this value
# ...
```
### Model Configuration
The script loads the default Oculus configuration:
- **DINOv3**: Large (1.7B parameters)
- **SigLIP2**: SO400M (400M parameters)
- **LFM2.5**: 1.2B parameters
To use different model sizes, modify the `create_oculus_model()` call:
```python
model = create_oculus_model(
dinov3_model_size="base", # Options: "small", "base", "large"
siglip2_model_size="so400m",
num_classes=150
)
```
## Loading Pretrained Weights
⚠️ **Important**: The benchmark uses a randomly initialized model by default. For meaningful results, load pretrained weights first.
### Using HuggingFace Weights
```python
# In the main() function, after loading the model:
import os
from oculus import load_dinov3_from_hf, load_siglip2_from_hf, load_lfm2_from_hf
# Set your HuggingFace token
os.environ["HF_TOKEN"] = "your_token_here"
# Load pretrained weights
load_dinov3_from_hf(
model.dinov3_encoder,
repo_id="facebook/dinov3-vitl16-pretrain-lvd1689m",
token=os.getenv("HF_TOKEN")
)
load_siglip2_from_hf(
model.siglip2_encoder,
repo_id="google/siglip2-so400m-patch16-naflex",
token=os.getenv("HF_TOKEN")
)
load_lfm2_from_hf(
model.language_model,
repo_id="LiquidAI/LFM2.5-1.2B-Base",
token=os.getenv("HF_TOKEN")
)
```
### Using Local Weights
```python
# Load from local files
import mlx.core as mx
weights = mx.load("/path/to/model_weights.npz")
model.update(weights)
```
## Expected Output
### Sample Output Format
```
============================================================
Oculus Model Benchmark Suite
============================================================
Testing Oculus vision-language model on benchmark tasks
Compatible with MLX and Apple Silicon
============================================================
[Step 1] Loading Oculus model...
✓ Model loaded successfully
Model Configuration:
DINOv3: DINOv3-ViT-L/16
SigLIP2: SigLIP2-SO400M
Language Model: LFM2.5-1.2B-Base
Total Parameters: 3,806,600,000
[Step 2] Loading benchmark datasets...
Loading COCO Captions dataset (5 samples)...
✓ Loaded 5 COCO caption samples
============================================================
BENCHMARKING: Image Captioning
============================================================
[Sample 1/5]
Image ID: 0
Generated tokens: 23 tokens
Inference time: 2.456s
Reference captions: 5 captions
...
============================================================
CAPTIONING SUMMARY
============================================================
Total samples: 5
Successful: 5
Failed: 0
Average inference time: 2.123s
Total time: 10.615s
```
## Performance Metrics
### Timing Metrics
- **Inference Time**: Time to process a single sample
- **Average Time**: Mean inference time across all samples
- **Total Time**: Cumulative time for all samples
### Quality Metrics (with pretrained weights)
- **BLEU Score**: For captioning (requires reference captions)
- **Accuracy**: For VQA (requires ground truth answers)
- **mAP**: For detection (requires bounding box annotations)
## Troubleshooting
### Out of Memory
If you encounter memory issues:
1. Reduce the number of samples:
```python
num_samples = 3 # Reduce from 5 to 3
```
2. Use smaller model sizes:
```python
model = create_oculus_model(
dinov3_model_size="base", # Instead of "large"
siglip2_model_size="so400m",
num_classes=150
)
```
3. Process samples one at a time (already implemented in the script)
### Dataset Loading Failures
If HuggingFace datasets fail to load:
- Check your internet connection
- Verify dataset availability on HuggingFace
- The script automatically falls back to synthetic samples
### Import Errors
If you get import errors:
```bash
# Install missing dependencies
pip install --upgrade mlx datasets transformers pillow
```
## Advanced Usage
### Custom Datasets
To benchmark on your own datasets:
```python
# Create custom samples
custom_samples = [
{
"image": Image.open("path/to/image.jpg"),
"captions": ["A custom caption"],
"image_id": 0
},
# Add more samples...
]
# Run benchmark
benchmark.benchmark_captioning(custom_samples)
```
### Extracting Results
Access detailed results programmatically:
```python
# After running benchmarks
captioning_results = benchmark.results["captioning"]
vqa_results = benchmark.results["vqa"]
detection_results = benchmark.results["detection"]
# Save to file
import json
with open("benchmark_results.json", "w") as f:
json.dump(benchmark.results, f, indent=2)
```
### Custom Preprocessing
Modify the `ImagePreprocessor` class for custom image preprocessing:
```python
class CustomPreprocessor(ImagePreprocessor):
def preprocess(self, image):
# Your custom preprocessing
return dinov3_input, siglip2_input
```
## Performance Benchmarks (Reference)
On Apple Silicon M2 Max (64GB RAM):
| Task | Avg Time | Throughput |
|------|----------|------------|
| Image Captioning | ~2.1s | ~0.5 samples/s |
| VQA | ~1.8s | ~0.6 samples/s |
| Object Detection | ~0.8s | ~1.2 samples/s |
*Note: Times are for randomly initialized models. Pretrained models may vary.*
## Integration with Training Pipeline
To use this benchmark during training:
```python
# In your training script
from test_benchmarks import OculusBenchmark, ImagePreprocessor
# After each epoch
preprocessor = ImagePreprocessor()
benchmark = OculusBenchmark(model, preprocessor)
benchmark.benchmark_captioning(val_samples)
benchmark.print_final_summary()
```
## Citation
If you use this benchmark in your research, please cite:
```bibtex
@software{oculus2025,
title={Oculus: Adaptive Semantic Comprehension Hierarchies},
author={Your Name},
year={2025},
url={https://github.com/yourusername/Oculus}
}
```
## Support
For issues or questions:
1. Check the [main README](README.md)
2. Review the [architecture documentation](ARCHITECTURE.md)
3. Open an issue on GitHub
## License
Same as the main Oculus project.