Oculus

File size: 8,319 Bytes

bdb0102

# Oculus Model Benchmarking Guide

This guide explains how to use the `test_benchmarks.py` script to evaluate the Oculus vision-language model on standard benchmark tasks.

## Overview

The benchmark script tests the Oculus model on three key vision-language tasks:

1. **Image Captioning** - Generate natural language descriptions of images
2. **Visual Question Answering (VQA)** - Answer questions about image content
3. **Object Detection** - Detect and localize objects in images

## Requirements

### System Requirements
- Apple Silicon Mac (M1, M2, M3, or later)
- macOS 12.0 or later
- Python 3.8+
- 16GB+ RAM recommended

### Python Dependencies

Install required packages:

```bash
pip install mlx mlx-nn numpy pillow datasets transformers huggingface_hub
```

Or create a requirements file:

```bash
# requirements.txt
mlx>=0.0.8
numpy>=1.21.0
pillow>=9.0.0
datasets>=2.14.0
transformers>=4.30.0
huggingface_hub>=0.16.0
```

Then install:

```bash
pip install -r requirements.txt
```

## Quick Start

### Basic Usage

Run the benchmark with default settings (5 samples per task):

```bash
cd /Users/kanayochukew/railweb/OceanirPublic/Oculus
python test_benchmarks.py
```

### What Happens

1. **Model Loading**: Initializes the Oculus model with default configuration
2. **Dataset Loading**: Downloads small subsets of benchmark datasets from HuggingFace
3. **Preprocessing**: Resizes and normalizes images for both vision encoders
4. **Inference**: Runs the model on each task
5. **Results**: Prints detailed metrics and timing information

## Dataset Information

### Image Captioning
- **Dataset**: COCO Captions (Karpathy split)
- **Source**: `yerevann/coco-karpathy`
- **Samples**: 5 (configurable)
- **Metrics**: Inference time, token generation count

### Visual Question Answering
- **Dataset**: VQAv2 validation set
- **Source**: `HuggingFaceM4/VQAv2`
- **Samples**: 5 (configurable)
- **Metrics**: Inference time, answer generation

### Object Detection
- **Dataset**: COCO Detection validation set
- **Source**: `detection-datasets/coco`
- **Samples**: 5 (configurable)
- **Metrics**: Inference time, confidence scores, bbox predictions

## Configuration

### Adjusting Sample Count

Edit the `num_samples` variable in `main()`:

```python
def main():
    num_samples = 10  # Change this value
    # ...
```

### Model Configuration

The script loads the default Oculus configuration:
- **DINOv3**: Large (1.7B parameters)
- **SigLIP2**: SO400M (400M parameters)
- **LFM2.5**: 1.2B parameters

To use different model sizes, modify the `create_oculus_model()` call:

```python
model = create_oculus_model(
    dinov3_model_size="base",  # Options: "small", "base", "large"
    siglip2_model_size="so400m",
    num_classes=150
)
```

## Loading Pretrained Weights

⚠️ **Important**: The benchmark uses a randomly initialized model by default. For meaningful results, load pretrained weights first.

### Using HuggingFace Weights

```python
# In the main() function, after loading the model:
import os
from oculus import load_dinov3_from_hf, load_siglip2_from_hf, load_lfm2_from_hf

# Set your HuggingFace token
os.environ["HF_TOKEN"] = "your_token_here"

# Load pretrained weights
load_dinov3_from_hf(
    model.dinov3_encoder,
    repo_id="facebook/dinov3-vitl16-pretrain-lvd1689m",
    token=os.getenv("HF_TOKEN")
)

load_siglip2_from_hf(
    model.siglip2_encoder,
    repo_id="google/siglip2-so400m-patch16-naflex",
    token=os.getenv("HF_TOKEN")
)

load_lfm2_from_hf(
    model.language_model,
    repo_id="LiquidAI/LFM2.5-1.2B-Base",
    token=os.getenv("HF_TOKEN")
)
```

### Using Local Weights

```python
# Load from local files
import mlx.core as mx

weights = mx.load("/path/to/model_weights.npz")
model.update(weights)
```

## Expected Output

### Sample Output Format

```
============================================================
Oculus Model Benchmark Suite
============================================================
Testing Oculus vision-language model on benchmark tasks
Compatible with MLX and Apple Silicon
============================================================

[Step 1] Loading Oculus model...
✓ Model loaded successfully

Model Configuration:
  DINOv3: DINOv3-ViT-L/16
  SigLIP2: SigLIP2-SO400M
  Language Model: LFM2.5-1.2B-Base
  Total Parameters: 3,806,600,000

[Step 2] Loading benchmark datasets...

Loading COCO Captions dataset (5 samples)...
✓ Loaded 5 COCO caption samples

============================================================
BENCHMARKING: Image Captioning
============================================================

[Sample 1/5]
  Image ID: 0
  Generated tokens: 23 tokens
  Inference time: 2.456s
  Reference captions: 5 captions

...

============================================================
CAPTIONING SUMMARY
============================================================
Total samples: 5
Successful: 5
Failed: 0
Average inference time: 2.123s
Total time: 10.615s
```

## Performance Metrics

### Timing Metrics
- **Inference Time**: Time to process a single sample
- **Average Time**: Mean inference time across all samples
- **Total Time**: Cumulative time for all samples

### Quality Metrics (with pretrained weights)
- **BLEU Score**: For captioning (requires reference captions)
- **Accuracy**: For VQA (requires ground truth answers)
- **mAP**: For detection (requires bounding box annotations)

## Troubleshooting

### Out of Memory

If you encounter memory issues:

1. Reduce the number of samples:
```python
num_samples = 3  # Reduce from 5 to 3
```

2. Use smaller model sizes:
```python
model = create_oculus_model(
    dinov3_model_size="base",  # Instead of "large"
    siglip2_model_size="so400m",
    num_classes=150
)
```

3. Process samples one at a time (already implemented in the script)

### Dataset Loading Failures

If HuggingFace datasets fail to load:
- Check your internet connection
- Verify dataset availability on HuggingFace
- The script automatically falls back to synthetic samples

### Import Errors

If you get import errors:

```bash
# Install missing dependencies
pip install --upgrade mlx datasets transformers pillow
```

## Advanced Usage

### Custom Datasets

To benchmark on your own datasets:

```python
# Create custom samples
custom_samples = [
    {
        "image": Image.open("path/to/image.jpg"),
        "captions": ["A custom caption"],
        "image_id": 0
    },
    # Add more samples...
]

# Run benchmark
benchmark.benchmark_captioning(custom_samples)
```

### Extracting Results

Access detailed results programmatically:

```python
# After running benchmarks
captioning_results = benchmark.results["captioning"]
vqa_results = benchmark.results["vqa"]
detection_results = benchmark.results["detection"]

# Save to file
import json
with open("benchmark_results.json", "w") as f:
    json.dump(benchmark.results, f, indent=2)
```

### Custom Preprocessing

Modify the `ImagePreprocessor` class for custom image preprocessing:

```python
class CustomPreprocessor(ImagePreprocessor):
    def preprocess(self, image):
        # Your custom preprocessing
        return dinov3_input, siglip2_input
```

## Performance Benchmarks (Reference)

On Apple Silicon M2 Max (64GB RAM):

| Task | Avg Time | Throughput |
|------|----------|------------|
| Image Captioning | ~2.1s | ~0.5 samples/s |
| VQA | ~1.8s | ~0.6 samples/s |
| Object Detection | ~0.8s | ~1.2 samples/s |

*Note: Times are for randomly initialized models. Pretrained models may vary.*

## Integration with Training Pipeline

To use this benchmark during training:

```python
# In your training script
from test_benchmarks import OculusBenchmark, ImagePreprocessor

# After each epoch
preprocessor = ImagePreprocessor()
benchmark = OculusBenchmark(model, preprocessor)
benchmark.benchmark_captioning(val_samples)
benchmark.print_final_summary()
```

## Citation

If you use this benchmark in your research, please cite:

```bibtex
@software{oculus2025,
  title={Oculus: Adaptive Semantic Comprehension Hierarchies},
  author={Your Name},
  year={2025},
  url={https://github.com/yourusername/Oculus}
}
```

## Support

For issues or questions:
1. Check the [main README](README.md)
2. Review the [architecture documentation](ARCHITECTURE.md)
3. Open an issue on GitHub

## License

Same as the main Oculus project.