# Oculus Model Benchmarking Guide This guide explains how to use the `test_benchmarks.py` script to evaluate the Oculus vision-language model on standard benchmark tasks. ## Overview The benchmark script tests the Oculus model on three key vision-language tasks: 1. **Image Captioning** - Generate natural language descriptions of images 2. **Visual Question Answering (VQA)** - Answer questions about image content 3. **Object Detection** - Detect and localize objects in images ## Requirements ### System Requirements - Apple Silicon Mac (M1, M2, M3, or later) - macOS 12.0 or later - Python 3.8+ - 16GB+ RAM recommended ### Python Dependencies Install required packages: ```bash pip install mlx mlx-nn numpy pillow datasets transformers huggingface_hub ``` Or create a requirements file: ```bash # requirements.txt mlx>=0.0.8 numpy>=1.21.0 pillow>=9.0.0 datasets>=2.14.0 transformers>=4.30.0 huggingface_hub>=0.16.0 ``` Then install: ```bash pip install -r requirements.txt ``` ## Quick Start ### Basic Usage Run the benchmark with default settings (5 samples per task): ```bash cd /Users/kanayochukew/railweb/OceanirPublic/Oculus python test_benchmarks.py ``` ### What Happens 1. **Model Loading**: Initializes the Oculus model with default configuration 2. **Dataset Loading**: Downloads small subsets of benchmark datasets from HuggingFace 3. **Preprocessing**: Resizes and normalizes images for both vision encoders 4. **Inference**: Runs the model on each task 5. **Results**: Prints detailed metrics and timing information ## Dataset Information ### Image Captioning - **Dataset**: COCO Captions (Karpathy split) - **Source**: `yerevann/coco-karpathy` - **Samples**: 5 (configurable) - **Metrics**: Inference time, token generation count ### Visual Question Answering - **Dataset**: VQAv2 validation set - **Source**: `HuggingFaceM4/VQAv2` - **Samples**: 5 (configurable) - **Metrics**: Inference time, answer generation ### Object Detection - **Dataset**: COCO Detection validation set - **Source**: `detection-datasets/coco` - **Samples**: 5 (configurable) - **Metrics**: Inference time, confidence scores, bbox predictions ## Configuration ### Adjusting Sample Count Edit the `num_samples` variable in `main()`: ```python def main(): num_samples = 10 # Change this value # ... ``` ### Model Configuration The script loads the default Oculus configuration: - **DINOv3**: Large (1.7B parameters) - **SigLIP2**: SO400M (400M parameters) - **LFM2.5**: 1.2B parameters To use different model sizes, modify the `create_oculus_model()` call: ```python model = create_oculus_model( dinov3_model_size="base", # Options: "small", "base", "large" siglip2_model_size="so400m", num_classes=150 ) ``` ## Loading Pretrained Weights ⚠️ **Important**: The benchmark uses a randomly initialized model by default. For meaningful results, load pretrained weights first. ### Using HuggingFace Weights ```python # In the main() function, after loading the model: import os from oculus import load_dinov3_from_hf, load_siglip2_from_hf, load_lfm2_from_hf # Set your HuggingFace token os.environ["HF_TOKEN"] = "your_token_here" # Load pretrained weights load_dinov3_from_hf( model.dinov3_encoder, repo_id="facebook/dinov3-vitl16-pretrain-lvd1689m", token=os.getenv("HF_TOKEN") ) load_siglip2_from_hf( model.siglip2_encoder, repo_id="google/siglip2-so400m-patch16-naflex", token=os.getenv("HF_TOKEN") ) load_lfm2_from_hf( model.language_model, repo_id="LiquidAI/LFM2.5-1.2B-Base", token=os.getenv("HF_TOKEN") ) ``` ### Using Local Weights ```python # Load from local files import mlx.core as mx weights = mx.load("/path/to/model_weights.npz") model.update(weights) ``` ## Expected Output ### Sample Output Format ``` ============================================================ Oculus Model Benchmark Suite ============================================================ Testing Oculus vision-language model on benchmark tasks Compatible with MLX and Apple Silicon ============================================================ [Step 1] Loading Oculus model... ✓ Model loaded successfully Model Configuration: DINOv3: DINOv3-ViT-L/16 SigLIP2: SigLIP2-SO400M Language Model: LFM2.5-1.2B-Base Total Parameters: 3,806,600,000 [Step 2] Loading benchmark datasets... Loading COCO Captions dataset (5 samples)... ✓ Loaded 5 COCO caption samples ============================================================ BENCHMARKING: Image Captioning ============================================================ [Sample 1/5] Image ID: 0 Generated tokens: 23 tokens Inference time: 2.456s Reference captions: 5 captions ... ============================================================ CAPTIONING SUMMARY ============================================================ Total samples: 5 Successful: 5 Failed: 0 Average inference time: 2.123s Total time: 10.615s ``` ## Performance Metrics ### Timing Metrics - **Inference Time**: Time to process a single sample - **Average Time**: Mean inference time across all samples - **Total Time**: Cumulative time for all samples ### Quality Metrics (with pretrained weights) - **BLEU Score**: For captioning (requires reference captions) - **Accuracy**: For VQA (requires ground truth answers) - **mAP**: For detection (requires bounding box annotations) ## Troubleshooting ### Out of Memory If you encounter memory issues: 1. Reduce the number of samples: ```python num_samples = 3 # Reduce from 5 to 3 ``` 2. Use smaller model sizes: ```python model = create_oculus_model( dinov3_model_size="base", # Instead of "large" siglip2_model_size="so400m", num_classes=150 ) ``` 3. Process samples one at a time (already implemented in the script) ### Dataset Loading Failures If HuggingFace datasets fail to load: - Check your internet connection - Verify dataset availability on HuggingFace - The script automatically falls back to synthetic samples ### Import Errors If you get import errors: ```bash # Install missing dependencies pip install --upgrade mlx datasets transformers pillow ``` ## Advanced Usage ### Custom Datasets To benchmark on your own datasets: ```python # Create custom samples custom_samples = [ { "image": Image.open("path/to/image.jpg"), "captions": ["A custom caption"], "image_id": 0 }, # Add more samples... ] # Run benchmark benchmark.benchmark_captioning(custom_samples) ``` ### Extracting Results Access detailed results programmatically: ```python # After running benchmarks captioning_results = benchmark.results["captioning"] vqa_results = benchmark.results["vqa"] detection_results = benchmark.results["detection"] # Save to file import json with open("benchmark_results.json", "w") as f: json.dump(benchmark.results, f, indent=2) ``` ### Custom Preprocessing Modify the `ImagePreprocessor` class for custom image preprocessing: ```python class CustomPreprocessor(ImagePreprocessor): def preprocess(self, image): # Your custom preprocessing return dinov3_input, siglip2_input ``` ## Performance Benchmarks (Reference) On Apple Silicon M2 Max (64GB RAM): | Task | Avg Time | Throughput | |------|----------|------------| | Image Captioning | ~2.1s | ~0.5 samples/s | | VQA | ~1.8s | ~0.6 samples/s | | Object Detection | ~0.8s | ~1.2 samples/s | *Note: Times are for randomly initialized models. Pretrained models may vary.* ## Integration with Training Pipeline To use this benchmark during training: ```python # In your training script from test_benchmarks import OculusBenchmark, ImagePreprocessor # After each epoch preprocessor = ImagePreprocessor() benchmark = OculusBenchmark(model, preprocessor) benchmark.benchmark_captioning(val_samples) benchmark.print_final_summary() ``` ## Citation If you use this benchmark in your research, please cite: ```bibtex @software{oculus2025, title={Oculus: Adaptive Semantic Comprehension Hierarchies}, author={Your Name}, year={2025}, url={https://github.com/yourusername/Oculus} } ``` ## Support For issues or questions: 1. Check the [main README](README.md) 2. Review the [architecture documentation](ARCHITECTURE.md) 3. Open an issue on GitHub ## License Same as the main Oculus project.