| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - gemma3 |
| - multimodal |
| - vision-language |
| - ocr |
| - synthetic-data |
| - tinystories |
| - validation |
| - tiny-model |
| pipeline_tag: image-to-text |
| --- |
| |
| # tinygemma3ocr2m |
|
|
| `shibatch/tinygemma3ocr2m` is a tiny Gemma3-style multimodal validation checkpoint. |
|
|
| The model is intentionally small. It is designed to verify that a Gemma3 multimodal inference implementation can correctly run the full multimodal path: |
|
|
| ```text |
| image -> vision tower -> multimodal projector -> image tokens -> text decoder -> generated text |
| ``` |
|
|
| It is **not** intended to be a general OCR model. |
|
|
| The checkpoint is trained on synthetic fixed-style digit OCR and mixed with a small amount of TinyStories text data so that the text decoder retains basic text-generation behavior. |
|
|
| ## Model summary |
|
|
| ```text |
| Repository: shibatch/tinygemma3ocr2m |
| Model class: Gemma3ForConditionalGeneration |
| Task: synthetic digit OCR + text-generation sanity |
| Image size: 128 x 128 |
| Patch size: 16 x 16 |
| Image tokens per image: 64 |
| Image token: <image> |
| Image token id: 1003 |
| Approximate scale: about 2M parameters |
| OCR prompt: Read the digits. |
| OCR target format: digit string only |
| ``` |
|
|
| The model is intended to be loaded with Hugging Face Transformers as `Gemma3ForConditionalGeneration`. |
|
|
| ## Intended use |
|
|
| This checkpoint is mainly useful for validating custom Gemma3 multimodal inference implementations. |
|
|
| Typical checks include: |
|
|
| * vision tower weight loading; |
| * multimodal projector execution; |
| * image token insertion; |
| * text decoder generation after image conditioning; |
| * small-model CPU/GPU inference paths; |
| * safetensors key mapping and tensor layout assumptions. |
|
|
| The expected OCR validation behavior is: |
|
|
| ```text |
| Input image: synthetic 128x128 image containing centered digits |
| Prompt: Read the digits. |
| Output: the digit string |
| ``` |
|
|
| For example, a rendered image containing `6235317` should generate: |
|
|
| ```text |
| 6235317 |
| ``` |
|
|
| ## Non-goals and limitations |
|
|
| This model is deliberately narrow. |
|
|
| It is **not** a general OCR model. It should not be expected to handle natural images, documents, handwriting, rotated text, complex backgrounds, arbitrary fonts, tables, screenshots, Japanese text, or real-world scanned documents. |
|
|
| Known limitations: |
|
|
| * trained on synthetic digit images; |
| * primarily intended for fixed-style controlled rendering; |
| * prompt generalization is limited; |
| * text generation is only a sanity check and is not high quality; |
| * generated TinyStories-like text may contain repetition or malformed words; |
| * OCR robustness to strong augmentation is limited at this model scale; |
| * exact-match accuracy becomes harder for long digit strings because per-digit errors accumulate. |
|
|
| This checkpoint is best understood as a **minimal multimodal validation artifact**, not as a production OCR system. |
|
|
| ## Training overview |
|
|
| The final checkpoint was produced in stages. |
|
|
| ### Phase 1: text warmup |
|
|
| A small `Gemma3ForCausalLM` text model was trained on TinyStories. The resulting text weights were transplanted into a `Gemma3ForConditionalGeneration` checkpoint. |
|
|
| This transplant-based setup was used because direct text-only warmup inside the multimodal wrapper was less stable for this tiny configuration. |
|
|
| ### Phase 2a: single-digit OCR |
|
|
| The model was trained to read one synthetic digit from a 128x128 image. The target output was the digit itself. |
|
|
| ### Phase 2b: multi-digit OCR |
|
|
| The model was trained to read multi-digit synthetic images. The standard OCR prompt was: |
|
|
| ```text |
| Read the digits. |
| ``` |
|
|
| The model learned to generate the digit string followed by EOS. |
|
|
| ### Phase 2b plus text |
|
|
| The final checkpoint keeps the OCR distribution close to Phase 2b and mixes in TinyStories text batches as a light regularizer. |
|
|
| The purpose of the text mix is not to make the model a strong language model. It is only intended to prevent the text decoder from collapsing completely into OCR-only behavior. |
|
|
| A representative mixed-training configuration was: |
|
|
| ```text |
| OCR ratio: 0.90 |
| Text ratio: 0.10 |
| OCR loss weight: 1.0 |
| Text loss weight: 0.2 |
| OCR style: fixed synthetic digits |
| Font size: 30 |
| Offset: 0 |
| Rotation: 0 |
| Noise: 0 |
| Prompt: Read the digits. |
| Digit length: 1 to 8 |
| ``` |
|
|
| ## Installation |
|
|
| ```bash |
| pip install torch transformers pillow huggingface_hub numpy |
| ``` |
|
|
| ## OCR example |
|
|
| The following example downloads a sample image from this repository and runs OCR. |
|
|
| ```python |
| import torch, numpy as np |
| from PIL import Image |
| from huggingface_hub import hf_hub_download |
| from transformers import PreTrainedTokenizerFast, Gemma3ForConditionalGeneration |
| |
| repo = "shibatch/tinygemma3ocr2m" |
| |
| tok = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf") |
| model = Gemma3ForConditionalGeneration.from_pretrained( |
| repo, subfolder="hf", torch_dtype=torch.bfloat16 |
| ).cuda().eval() |
| |
| path = hf_hub_download(repo, "sample_images/sample_00_6235317.png") |
| img = Image.open(path).convert("RGB") |
| pix = torch.from_numpy(np.asarray(img, dtype=np.float32) / 127.5 - 1).permute(2, 0, 1)[None].cuda() |
| |
| ids = [tok.bos_token_id] + [model.config.image_token_index] * model.config.mm_tokens_per_image |
| ids += tok.encode("\nRead the digits.\n", add_special_tokens=False) |
| |
| input_ids = torch.tensor([ids], device="cuda") |
| attention_mask = torch.ones_like(input_ids) |
| |
| out = model.generate( |
| input_ids=input_ids, |
| attention_mask=attention_mask, |
| pixel_values=pix, |
| max_new_tokens=12, |
| do_sample=False, |
| pad_token_id=tok.bos_token_id, |
| eos_token_id=tok.eos_token_id, |
| ) |
| |
| print(tok.decode(out[0][len(ids):], skip_special_tokens=True)) |
| ``` |
|
|
| Expected output: |
|
|
| ```text |
| 6235317 |
| ``` |
|
|
| ## Text-generation sanity check |
|
|
| The checkpoint also supports direct text-decoder sanity checks through the language model branch. |
|
|
| This example intentionally does **not** use image tokens or `pixel_values`. It directly calls: |
|
|
| ```text |
| model.model.language_model |
| model.lm_head |
| ``` |
|
|
| Example: |
|
|
| ```python |
| import torch |
| from transformers import PreTrainedTokenizerFast, Gemma3ForConditionalGeneration |
| |
| repo = "shibatch/tinygemma3ocr2m" |
| tok = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf") |
| model = Gemma3ForConditionalGeneration.from_pretrained( |
| repo, subfolder="hf", torch_dtype=torch.bfloat16 |
| ).cuda().eval() |
| |
| ids = [tok.bos_token_id] + tok.encode("Once upon", add_special_tokens=False) |
| x = torch.tensor([ids], device="cuda") |
| |
| for _ in range(50): |
| h = model.model.language_model(input_ids=x, use_cache=False, return_dict=True).last_hidden_state |
| nxt = model.lm_head(h)[0, -1].argmax().view(1, 1) |
| x = torch.cat([x, nxt], dim=1) |
| if int(nxt) == tok.eos_token_id: |
| break |
| |
| print(tok.decode(x[0], skip_special_tokens=True)) |
| ``` |
|
|
| A successful sanity check means the model continues with roughly TinyStories-like English text. The output does not need to be high quality. The purpose is only to confirm that the text decoder has not collapsed into digit-only output. |
|
|
| Acceptable behavior: |
|
|
| ```text |
| Once upon a time, there was a little girl named Lily... |
| ``` |
|
|
| Known imperfect behavior: |
|
|
| ```text |
| malformed names |
| repetition |
| awkward TinyStories-like phrasing |
| occasional non-word fragments |
| ``` |
|
|
| These are expected for a model of this size and training mix. |
|
|
| ## Evaluation notes |
|
|
| A representative OCR result for the intended fixed-style OCR setting was: |
|
|
| ```text |
| exact_match: 0.87 |
| digit_accuracy: 0.976 |
| length_accuracy: 0.998 |
| ``` |
|
|
| The important point is that the model can exercise the full multimodal path and produce digit strings with high accuracy in the intended fixed synthetic setting. |
|
|
| For longer digit strings, exact match is much stricter than digit accuracy. Even if each digit is correct with high probability, the probability of a fully correct 8-digit sequence is lower because all digits must be correct simultaneously. |
|
|
| ## Recommended validation tests |
|
|
| ### OCR smoke test |
|
|
| ```bash |
| python test_inference_tinygemma3ocr2m.py \ |
| --model-dir shibatch/tinygemma3ocr2m \ |
| --text 1677216 \ |
| --font-size 30 \ |
| --prompt "Read the digits." |
| ``` |
|
|
| Expected output: |
|
|
| ```text |
| Prediction: |
| target: 1677216 |
| prediction: 1677216 |
| raw_text: '1677216' |
| ``` |
|
|
| ### Batch synthetic OCR test |
|
|
| ```bash |
| python test_inference_tinygemma3ocr2m.py \ |
| --model-dir shibatch/tinygemma3ocr2m \ |
| --eval-synthetic 500 \ |
| --min-digits 1 \ |
| --max-digits 8 \ |
| --font-size 30 \ |
| --prompt "Read the digits." |
| ``` |
|
|
| ### Text sanity test |
|
|
| ```bash |
| python test_text_generation_tinygemma3ocr2m.py \ |
| --model-dir shibatch/tinygemma3ocr2m \ |
| --prompt "Once upon" \ |
| --max-new-tokens 50 |
| ``` |
|
|
| ## Expected config properties |
|
|
| A correctly loaded 128x128 checkpoint should report: |
|
|
| ```text |
| image_token_index: 1003 |
| mm_tokens_per_image: 64 |
| vision image_size: 128 |
| vision patch_size: 16 |
| ``` |
|
|
| If `mm_tokens_per_image` is not 64, or if the vision image size is not 128, the checkpoint is not the intended 128x128 model. |
|
|
| ## File layout |
|
|
| The recommended repository layout is: |
|
|
| ```text |
| shibatch/tinygemma3ocr2m |
| README.md |
| config.json |
| model.safetensors |
| tokenizer.json |
| tokenizer_config.json |
| special_tokens_map.json |
| sample_images/ |
| sample_00_6235317.png |
| sample_images.json |
| eval_ocr_augmented.json |
| eval_text_generation.json |
| safetensors_keys.json |
| artifact_metadata.json |
| ``` |
|
|
| The model files should be placed at the repository root so that this works: |
|
|
| ```python |
| Gemma3ForConditionalGeneration.from_pretrained("shibatch/tinygemma3ocr2m") |
| PreTrainedTokenizerFast.from_pretrained("shibatch/tinygemma3ocr2m") |
| ``` |
|
|
| If the model files are instead placed under an `hf/` subdirectory, pass `subfolder="hf"` to `from_pretrained()`. |
|
|
| ## Implementation notes |
|
|
| This checkpoint uses repeated `<image>` tokens in the text input. The number of inserted image tokens must match `mm_tokens_per_image`. |
|
|
| For this 128x128 checkpoint: |
|
|
| ```text |
| image_size = 128 |
| patch_size = 16 |
| patch grid = 8 x 8 |
| mm_tokens_per_image = 64 |
| ``` |
|
|
| The OCR prompt format is approximately: |
|
|
| ```text |
| <bos> |
| <image> repeated 64 times |
| \nRead the digits.\n |
| ``` |
|
|
| The target tokens are the digit string followed by EOS. |
|
|
| ## Citation / attribution |
|
|
| This is a synthetic tiny validation checkpoint derived from a Gemma3-style architecture and trained for local inference-engine validation. TinyStories was used for basic text warmup and text regularization. |
|
|
| ## Disclaimer |
|
|
| This model is for engineering validation and testing. It should not be used for real OCR tasks or user-facing document understanding. |
|
|