README.md · hrithik-dev8/tiny-random-MiniCPM-o-2

tiny-random-MiniCPM-o-2_6 / README.md

hrithik-dev8

Upload folder using huggingface_hub

1c546e7 verified 24 days ago

preview code

raw

history blame contribute delete

5.82 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- multimodal
	- vision-language
	- openvino
	- optimum-intel
	- testing
	- tiny-model
	- minicpmo
	base_model: openbmb/MiniCPM-o-2_6
	library_name: transformers
	pipeline_tag: image-text-to-text
	---

	# Tiny Random MiniCPM-o-2_6

	A tiny (~42 MB) randomly-initialized version of [MiniCPM-o-2.6](https://huggingface.co/openbmb/MiniCPM-o-2_6) designed for testing purposes in the [optimum-intel](https://github.com/huggingface/optimum-intel) library.

	## Purpose

	This model was created to replace the existing test model at `optimum-intel-internal-testing/tiny-random-MiniCPM-o-2_6` (185 MB) with a smaller alternative for CI/CD testing. Smaller test models reduce:

	- Download times in CI pipelines
	- Storage requirements
	- Test execution time

	## Size Comparison

	\| Model \| Total Size \| Model Weights \|
	\|-------\|------------\|---------------\|
	\| [openbmb/MiniCPM-o-2_6](https://huggingface.co/openbmb/MiniCPM-o-2_6) (Original) \| 17.4 GB \| ~17 GB \|
	\| [optimum-intel-internal-testing/tiny-random-MiniCPM-o-2_6](https://huggingface.co/optimum-intel-internal-testing/tiny-random-MiniCPM-o-2_6) (Current Test Model) \| 185 MB \| 169 MB \|
	\| hrithik-dev8/tiny-random-MiniCPM-o-2_6 (This Model) \| ~42 MB \| 41.55 MB \|

	Result: 4× smaller than Intel's current test model

	## Model Configuration

	\| Component \| This Model \| Original \|
	\|-----------\|------------\|----------\|
	\| Vocabulary \| 5,000 tokens \| 151,700 tokens \|
	\| LLM Hidden Size \| 128 \| 3,584 \|
	\| LLM Layers \| 1 \| 40 \|
	\| LLM Attention Heads \| 8 \| 28 \|
	\| Vision Hidden Size \| 128 \| 1,152 \|
	\| Vision Layers \| 1 \| 27 \|
	\| Image Size \| 980 (preserved) \| 980 \|
	\| Patch Size \| 14 (preserved) \| 14 \|
	\| Audio d_model \| 64 \| 1,280 \|
	\| TTS Hidden Size \| 128 \| - \|

	## Parameter Breakdown

	\| Component \| Parameters \| Size (MB) \|
	\|-----------\|------------\|-----------\|
	\| TTS/DVAE \| 19,339,766 \| 36.89 \|
	\| LLM \| 1,419,840 \| 2.71 \|
	\| Vision \| 835,328 \| 1.59 \|
	\| Resampler \| 91,392 \| 0.17 \|
	\| Audio \| 56,192 \| 0.11 \|
	\| Other \| 20,736 \| 0.04 \|
	\| Total \| 21,763,254 \| ~41.5 \|

	## Technical Details

	### Why Keep TTS/DVAE Components?

	The TTS (Text-to-Speech) component, which includes the DVAE (Discrete Variational Auto-Encoder), accounts for approximately 37 MB (~85%) of the model size. While the optimum-intel tests do not exercise TTS functionality (they only test image+text → text generation), we retain this component because:

	1. Structural Consistency: Removing TTS via `init_tts=False` causes structural differences in the model that lead to numerical divergence between PyTorch and OpenVINO outputs
	2. Test Compatibility: The `test_compare_to_transformers` test compares PyTorch vs OpenVINO outputs and requires exact structural matching
	3. Architecture Integrity: The MiniCPM-o architecture expects TTS weights to be present during model loading

	### Tokenizer Shrinking

	The vocabulary was reduced from 151,700 to 5,000 tokens:

	- Base tokens: IDs 0-4899 (first 4,900 most common tokens)
	- Special tokens: IDs 4900-4949 (remapped from original high IDs)
	- BPE merges: Filtered from 151,387 to 4,644 (only merges involving retained tokens)

	Key token mappings:
	\| Token \| ID \|
	\|-------\|-----\|
	\| `<unk>` \| 4900 \|
	\| `<\\|endoftext\\|>` \| 4901 \|
	\| `<\\|im_start\\|>` \| 4902 \|
	\| `<\\|im_end\\|>` \| 4903 \|

	### Reproducibility

	Model weights are initialized with a fixed random seed (42) to ensure:
	- Reproducible outputs between runs
	- Consistent behavior between PyTorch and OpenVINO
	- Passing of `test_compare_to_transformers` which compares framework outputs

	## Test Results

	Tested with `pytest tests/openvino/test_seq2seq.py -k "minicpmo" -v`:

	\| Test \| Status \| Notes \|
	\|------\|--------\|-------\|
	\| `test_compare_to_transformers` \| ✅ PASSED \| PyTorch/OpenVINO outputs match \|
	\| `test_generate_utils` \| ✅ PASSED \| Generation pipeline works \|
	\| `test_model_can_be_loaded_after_saving` \| ⚠️ FAILED \| Windows file locking issue (not model-related) \|

	The third test failure is a Windows-specific issue where OpenVINO keeps file handles open, preventing cleanup of temporary directories. This is a known platform limitation, not a model defect. The test passes on Linux/macOS.

	## Usage

	### For optimum-intel Testing

	```python
	# In optimum-intel/tests/openvino/utils_tests.py, update MODEL_NAMES:
	MODEL_NAMES = {
	# ... other models ...
	"minicpmo": "hrithik-dev8/tiny-random-MiniCPM-o-2_6",
	}
	```

	Then run tests:
	```bash
	pytest tests/openvino/test_seq2seq.py -k "minicpmo" -v
	```

	### Basic Model Loading

	```python
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained(
	"hrithik-dev8/tiny-random-MiniCPM-o-2_6",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"hrithik-dev8/tiny-random-MiniCPM-o-2_6",
	trust_remote_code=True
	)
	```

	## Files Included

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `model.safetensors` \| 41.55 MB \| Model weights (bfloat16) \|
	\| `config.json` \| 5.33 KB \| Model configuration \|
	\| `tokenizer.json` \| 338.27 KB \| Shrunk tokenizer (5,000 tokens) \|
	\| `tokenizer_config.json` \| 12.78 KB \| Tokenizer settings \|
	\| `vocab.json` \| 85.70 KB \| Vocabulary mapping \|
	\| `merges.txt` \| 36.58 KB \| BPE merge rules \|
	\| `preprocessor_config.json` \| 1.07 KB \| Image processor config \|
	\| `generation_config.json` \| 121 B \| Generation settings \|
	\| `added_tokens.json` \| 1.13 KB \| Special tokens \|
	\| `special_tokens_map.json` \| 1.24 KB \| Special token mappings \|

	## Requirements

	- Python 3.8+
	- transformers >= 4.45.0, < 4.52.0
	- torch
	- For OpenVINO testing: optimum-intel with OpenVINO backend

	## Limitations

	⚠️ This model is for testing only - it produces random/meaningless outputs and should not be used for inference.