| # Oculus Model Benchmarking Guide | |
| This guide explains how to use the `test_benchmarks.py` script to evaluate the Oculus vision-language model on standard benchmark tasks. | |
| ## Overview | |
| The benchmark script tests the Oculus model on three key vision-language tasks: | |
| 1. **Image Captioning** - Generate natural language descriptions of images | |
| 2. **Visual Question Answering (VQA)** - Answer questions about image content | |
| 3. **Object Detection** - Detect and localize objects in images | |
| ## Requirements | |
| ### System Requirements | |
| - Apple Silicon Mac (M1, M2, M3, or later) | |
| - macOS 12.0 or later | |
| - Python 3.8+ | |
| - 16GB+ RAM recommended | |
| ### Python Dependencies | |
| Install required packages: | |
| ```bash | |
| pip install mlx mlx-nn numpy pillow datasets transformers huggingface_hub | |
| ``` | |
| Or create a requirements file: | |
| ```bash | |
| # requirements.txt | |
| mlx>=0.0.8 | |
| numpy>=1.21.0 | |
| pillow>=9.0.0 | |
| datasets>=2.14.0 | |
| transformers>=4.30.0 | |
| huggingface_hub>=0.16.0 | |
| ``` | |
| Then install: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Quick Start | |
| ### Basic Usage | |
| Run the benchmark with default settings (5 samples per task): | |
| ```bash | |
| cd /Users/kanayochukew/railweb/OceanirPublic/Oculus | |
| python test_benchmarks.py | |
| ``` | |
| ### What Happens | |
| 1. **Model Loading**: Initializes the Oculus model with default configuration | |
| 2. **Dataset Loading**: Downloads small subsets of benchmark datasets from HuggingFace | |
| 3. **Preprocessing**: Resizes and normalizes images for both vision encoders | |
| 4. **Inference**: Runs the model on each task | |
| 5. **Results**: Prints detailed metrics and timing information | |
| ## Dataset Information | |
| ### Image Captioning | |
| - **Dataset**: COCO Captions (Karpathy split) | |
| - **Source**: `yerevann/coco-karpathy` | |
| - **Samples**: 5 (configurable) | |
| - **Metrics**: Inference time, token generation count | |
| ### Visual Question Answering | |
| - **Dataset**: VQAv2 validation set | |
| - **Source**: `HuggingFaceM4/VQAv2` | |
| - **Samples**: 5 (configurable) | |
| - **Metrics**: Inference time, answer generation | |
| ### Object Detection | |
| - **Dataset**: COCO Detection validation set | |
| - **Source**: `detection-datasets/coco` | |
| - **Samples**: 5 (configurable) | |
| - **Metrics**: Inference time, confidence scores, bbox predictions | |
| ## Configuration | |
| ### Adjusting Sample Count | |
| Edit the `num_samples` variable in `main()`: | |
| ```python | |
| def main(): | |
| num_samples = 10 # Change this value | |
| # ... | |
| ``` | |
| ### Model Configuration | |
| The script loads the default Oculus configuration: | |
| - **DINOv3**: Large (1.7B parameters) | |
| - **SigLIP2**: SO400M (400M parameters) | |
| - **LFM2.5**: 1.2B parameters | |
| To use different model sizes, modify the `create_oculus_model()` call: | |
| ```python | |
| model = create_oculus_model( | |
| dinov3_model_size="base", # Options: "small", "base", "large" | |
| siglip2_model_size="so400m", | |
| num_classes=150 | |
| ) | |
| ``` | |
| ## Loading Pretrained Weights | |
| ⚠️ **Important**: The benchmark uses a randomly initialized model by default. For meaningful results, load pretrained weights first. | |
| ### Using HuggingFace Weights | |
| ```python | |
| # In the main() function, after loading the model: | |
| import os | |
| from oculus import load_dinov3_from_hf, load_siglip2_from_hf, load_lfm2_from_hf | |
| # Set your HuggingFace token | |
| os.environ["HF_TOKEN"] = "your_token_here" | |
| # Load pretrained weights | |
| load_dinov3_from_hf( | |
| model.dinov3_encoder, | |
| repo_id="facebook/dinov3-vitl16-pretrain-lvd1689m", | |
| token=os.getenv("HF_TOKEN") | |
| ) | |
| load_siglip2_from_hf( | |
| model.siglip2_encoder, | |
| repo_id="google/siglip2-so400m-patch16-naflex", | |
| token=os.getenv("HF_TOKEN") | |
| ) | |
| load_lfm2_from_hf( | |
| model.language_model, | |
| repo_id="LiquidAI/LFM2.5-1.2B-Base", | |
| token=os.getenv("HF_TOKEN") | |
| ) | |
| ``` | |
| ### Using Local Weights | |
| ```python | |
| # Load from local files | |
| import mlx.core as mx | |
| weights = mx.load("/path/to/model_weights.npz") | |
| model.update(weights) | |
| ``` | |
| ## Expected Output | |
| ### Sample Output Format | |
| ``` | |
| ============================================================ | |
| Oculus Model Benchmark Suite | |
| ============================================================ | |
| Testing Oculus vision-language model on benchmark tasks | |
| Compatible with MLX and Apple Silicon | |
| ============================================================ | |
| [Step 1] Loading Oculus model... | |
| ✓ Model loaded successfully | |
| Model Configuration: | |
| DINOv3: DINOv3-ViT-L/16 | |
| SigLIP2: SigLIP2-SO400M | |
| Language Model: LFM2.5-1.2B-Base | |
| Total Parameters: 3,806,600,000 | |
| [Step 2] Loading benchmark datasets... | |
| Loading COCO Captions dataset (5 samples)... | |
| ✓ Loaded 5 COCO caption samples | |
| ============================================================ | |
| BENCHMARKING: Image Captioning | |
| ============================================================ | |
| [Sample 1/5] | |
| Image ID: 0 | |
| Generated tokens: 23 tokens | |
| Inference time: 2.456s | |
| Reference captions: 5 captions | |
| ... | |
| ============================================================ | |
| CAPTIONING SUMMARY | |
| ============================================================ | |
| Total samples: 5 | |
| Successful: 5 | |
| Failed: 0 | |
| Average inference time: 2.123s | |
| Total time: 10.615s | |
| ``` | |
| ## Performance Metrics | |
| ### Timing Metrics | |
| - **Inference Time**: Time to process a single sample | |
| - **Average Time**: Mean inference time across all samples | |
| - **Total Time**: Cumulative time for all samples | |
| ### Quality Metrics (with pretrained weights) | |
| - **BLEU Score**: For captioning (requires reference captions) | |
| - **Accuracy**: For VQA (requires ground truth answers) | |
| - **mAP**: For detection (requires bounding box annotations) | |
| ## Troubleshooting | |
| ### Out of Memory | |
| If you encounter memory issues: | |
| 1. Reduce the number of samples: | |
| ```python | |
| num_samples = 3 # Reduce from 5 to 3 | |
| ``` | |
| 2. Use smaller model sizes: | |
| ```python | |
| model = create_oculus_model( | |
| dinov3_model_size="base", # Instead of "large" | |
| siglip2_model_size="so400m", | |
| num_classes=150 | |
| ) | |
| ``` | |
| 3. Process samples one at a time (already implemented in the script) | |
| ### Dataset Loading Failures | |
| If HuggingFace datasets fail to load: | |
| - Check your internet connection | |
| - Verify dataset availability on HuggingFace | |
| - The script automatically falls back to synthetic samples | |
| ### Import Errors | |
| If you get import errors: | |
| ```bash | |
| # Install missing dependencies | |
| pip install --upgrade mlx datasets transformers pillow | |
| ``` | |
| ## Advanced Usage | |
| ### Custom Datasets | |
| To benchmark on your own datasets: | |
| ```python | |
| # Create custom samples | |
| custom_samples = [ | |
| { | |
| "image": Image.open("path/to/image.jpg"), | |
| "captions": ["A custom caption"], | |
| "image_id": 0 | |
| }, | |
| # Add more samples... | |
| ] | |
| # Run benchmark | |
| benchmark.benchmark_captioning(custom_samples) | |
| ``` | |
| ### Extracting Results | |
| Access detailed results programmatically: | |
| ```python | |
| # After running benchmarks | |
| captioning_results = benchmark.results["captioning"] | |
| vqa_results = benchmark.results["vqa"] | |
| detection_results = benchmark.results["detection"] | |
| # Save to file | |
| import json | |
| with open("benchmark_results.json", "w") as f: | |
| json.dump(benchmark.results, f, indent=2) | |
| ``` | |
| ### Custom Preprocessing | |
| Modify the `ImagePreprocessor` class for custom image preprocessing: | |
| ```python | |
| class CustomPreprocessor(ImagePreprocessor): | |
| def preprocess(self, image): | |
| # Your custom preprocessing | |
| return dinov3_input, siglip2_input | |
| ``` | |
| ## Performance Benchmarks (Reference) | |
| On Apple Silicon M2 Max (64GB RAM): | |
| | Task | Avg Time | Throughput | | |
| |------|----------|------------| | |
| | Image Captioning | ~2.1s | ~0.5 samples/s | | |
| | VQA | ~1.8s | ~0.6 samples/s | | |
| | Object Detection | ~0.8s | ~1.2 samples/s | | |
| *Note: Times are for randomly initialized models. Pretrained models may vary.* | |
| ## Integration with Training Pipeline | |
| To use this benchmark during training: | |
| ```python | |
| # In your training script | |
| from test_benchmarks import OculusBenchmark, ImagePreprocessor | |
| # After each epoch | |
| preprocessor = ImagePreprocessor() | |
| benchmark = OculusBenchmark(model, preprocessor) | |
| benchmark.benchmark_captioning(val_samples) | |
| benchmark.print_final_summary() | |
| ``` | |
| ## Citation | |
| If you use this benchmark in your research, please cite: | |
| ```bibtex | |
| @software{oculus2025, | |
| title={Oculus: Adaptive Semantic Comprehension Hierarchies}, | |
| author={Your Name}, | |
| year={2025}, | |
| url={https://github.com/yourusername/Oculus} | |
| } | |
| ``` | |
| ## Support | |
| For issues or questions: | |
| 1. Check the [main README](README.md) | |
| 2. Review the [architecture documentation](ARCHITECTURE.md) | |
| 3. Open an issue on GitHub | |
| ## License | |
| Same as the main Oculus project. | |