Oculus / docs /BENCHMARK_README.md

Upload docs/BENCHMARK_README.md with huggingface_hub

bdb0102 verified 8 days ago

8.32 kB

	# Oculus Model Benchmarking Guide

	This guide explains how to use the `test_benchmarks.py` script to evaluate the Oculus vision-language model on standard benchmark tasks.

	## Overview

	The benchmark script tests the Oculus model on three key vision-language tasks:

	1. Image Captioning - Generate natural language descriptions of images
	2. Visual Question Answering (VQA) - Answer questions about image content
	3. Object Detection - Detect and localize objects in images

	## Requirements

	### System Requirements
	- Apple Silicon Mac (M1, M2, M3, or later)
	- macOS 12.0 or later
	- Python 3.8+
	- 16GB+ RAM recommended

	### Python Dependencies

	Install required packages:

	```bash
	pip install mlx mlx-nn numpy pillow datasets transformers huggingface_hub
	```

	Or create a requirements file:

	```bash
	# requirements.txt
	mlx>=0.0.8
	numpy>=1.21.0
	pillow>=9.0.0
	datasets>=2.14.0
	transformers>=4.30.0
	huggingface_hub>=0.16.0
	```

	Then install:

	```bash
	pip install -r requirements.txt
	```

	## Quick Start

	### Basic Usage

	Run the benchmark with default settings (5 samples per task):

	```bash
	cd /Users/kanayochukew/railweb/OceanirPublic/Oculus
	python test_benchmarks.py
	```

	### What Happens

	1. Model Loading: Initializes the Oculus model with default configuration
	2. Dataset Loading: Downloads small subsets of benchmark datasets from HuggingFace
	3. Preprocessing: Resizes and normalizes images for both vision encoders
	4. Inference: Runs the model on each task
	5. Results: Prints detailed metrics and timing information

	## Dataset Information

	### Image Captioning
	- Dataset: COCO Captions (Karpathy split)
	- Source: `yerevann/coco-karpathy`
	- Samples: 5 (configurable)
	- Metrics: Inference time, token generation count

	### Visual Question Answering
	- Dataset: VQAv2 validation set
	- Source: `HuggingFaceM4/VQAv2`
	- Samples: 5 (configurable)
	- Metrics: Inference time, answer generation

	### Object Detection
	- Dataset: COCO Detection validation set
	- Source: `detection-datasets/coco`
	- Samples: 5 (configurable)
	- Metrics: Inference time, confidence scores, bbox predictions

	## Configuration

	### Adjusting Sample Count

	Edit the `num_samples` variable in `main()`:

	```python
	def main():
	num_samples = 10 # Change this value
	# ...
	```

	### Model Configuration

	The script loads the default Oculus configuration:
	- DINOv3: Large (1.7B parameters)
	- SigLIP2: SO400M (400M parameters)
	- LFM2.5: 1.2B parameters

	To use different model sizes, modify the `create_oculus_model()` call:

	```python
	model = create_oculus_model(
	dinov3_model_size="base", # Options: "small", "base", "large"
	siglip2_model_size="so400m",
	num_classes=150
	)
	```

	## Loading Pretrained Weights

	⚠️ Important: The benchmark uses a randomly initialized model by default. For meaningful results, load pretrained weights first.

	### Using HuggingFace Weights

	```python
	# In the main() function, after loading the model:
	import os
	from oculus import load_dinov3_from_hf, load_siglip2_from_hf, load_lfm2_from_hf

	# Set your HuggingFace token
	os.environ["HF_TOKEN"] = "your_token_here"

	# Load pretrained weights
	load_dinov3_from_hf(
	model.dinov3_encoder,
	repo_id="facebook/dinov3-vitl16-pretrain-lvd1689m",
	token=os.getenv("HF_TOKEN")
	)

	load_siglip2_from_hf(
	model.siglip2_encoder,
	repo_id="google/siglip2-so400m-patch16-naflex",
	token=os.getenv("HF_TOKEN")
	)

	load_lfm2_from_hf(
	model.language_model,
	repo_id="LiquidAI/LFM2.5-1.2B-Base",
	token=os.getenv("HF_TOKEN")
	)
	```

	### Using Local Weights

	```python
	# Load from local files
	import mlx.core as mx

	weights = mx.load("/path/to/model_weights.npz")
	model.update(weights)
	```

	## Expected Output

	### Sample Output Format

	```
	============================================================
	Oculus Model Benchmark Suite
	============================================================
	Testing Oculus vision-language model on benchmark tasks
	Compatible with MLX and Apple Silicon
	============================================================

	[Step 1] Loading Oculus model...
	✓ Model loaded successfully

	Model Configuration:
	DINOv3: DINOv3-ViT-L/16
	SigLIP2: SigLIP2-SO400M
	Language Model: LFM2.5-1.2B-Base
	Total Parameters: 3,806,600,000

	[Step 2] Loading benchmark datasets...

	Loading COCO Captions dataset (5 samples)...
	✓ Loaded 5 COCO caption samples

	============================================================
	BENCHMARKING: Image Captioning
	============================================================

	[Sample 1/5]
	Image ID: 0
	Generated tokens: 23 tokens
	Inference time: 2.456s
	Reference captions: 5 captions

	...

	============================================================
	CAPTIONING SUMMARY
	============================================================
	Total samples: 5
	Successful: 5
	Failed: 0
	Average inference time: 2.123s
	Total time: 10.615s
	```

	## Performance Metrics

	### Timing Metrics
	- Inference Time: Time to process a single sample
	- Average Time: Mean inference time across all samples
	- Total Time: Cumulative time for all samples

	### Quality Metrics (with pretrained weights)
	- BLEU Score: For captioning (requires reference captions)
	- Accuracy: For VQA (requires ground truth answers)
	- mAP: For detection (requires bounding box annotations)

	## Troubleshooting

	### Out of Memory

	If you encounter memory issues:

	1. Reduce the number of samples:
	```python
	num_samples = 3 # Reduce from 5 to 3
	```

	2. Use smaller model sizes:
	```python
	model = create_oculus_model(
	dinov3_model_size="base", # Instead of "large"
	siglip2_model_size="so400m",
	num_classes=150
	)
	```

	3. Process samples one at a time (already implemented in the script)

	### Dataset Loading Failures

	If HuggingFace datasets fail to load:
	- Check your internet connection
	- Verify dataset availability on HuggingFace
	- The script automatically falls back to synthetic samples

	### Import Errors

	If you get import errors:

	```bash
	# Install missing dependencies
	pip install --upgrade mlx datasets transformers pillow
	```

	## Advanced Usage

	### Custom Datasets

	To benchmark on your own datasets:

	```python
	# Create custom samples
	custom_samples = [
	{
	"image": Image.open("path/to/image.jpg"),
	"captions": ["A custom caption"],
	"image_id": 0
	},
	# Add more samples...
	]

	# Run benchmark
	benchmark.benchmark_captioning(custom_samples)
	```

	### Extracting Results

	Access detailed results programmatically:

	```python
	# After running benchmarks
	captioning_results = benchmark.results["captioning"]
	vqa_results = benchmark.results["vqa"]
	detection_results = benchmark.results["detection"]

	# Save to file
	import json
	with open("benchmark_results.json", "w") as f:
	json.dump(benchmark.results, f, indent=2)
	```

	### Custom Preprocessing

	Modify the `ImagePreprocessor` class for custom image preprocessing:

	```python
	class CustomPreprocessor(ImagePreprocessor):
	def preprocess(self, image):
	# Your custom preprocessing
	return dinov3_input, siglip2_input
	```

	## Performance Benchmarks (Reference)

	On Apple Silicon M2 Max (64GB RAM):

	\| Task \| Avg Time \| Throughput \|
	\|------\|----------\|------------\|
	\| Image Captioning \| ~2.1s \| ~0.5 samples/s \|
	\| VQA \| ~1.8s \| ~0.6 samples/s \|
	\| Object Detection \| ~0.8s \| ~1.2 samples/s \|

	Note: Times are for randomly initialized models. Pretrained models may vary.

	## Integration with Training Pipeline

	To use this benchmark during training:

	```python
	# In your training script
	from test_benchmarks import OculusBenchmark, ImagePreprocessor

	# After each epoch
	preprocessor = ImagePreprocessor()
	benchmark = OculusBenchmark(model, preprocessor)
	benchmark.benchmark_captioning(val_samples)
	benchmark.print_final_summary()
	```

	## Citation

	If you use this benchmark in your research, please cite:

	```bibtex
	@software{oculus2025,
	title={Oculus: Adaptive Semantic Comprehension Hierarchies},
	author={Your Name},
	year={2025},
	url={https://github.com/yourusername/Oculus}
	}
	```

	## Support

	For issues or questions:
	1. Check the [main README](README.md)
	2. Review the [architecture documentation](ARCHITECTURE.md)
	3. Open an issue on GitHub

	## License

	Same as the main Oculus project.