--- license: mit language: - en tags: - gemma3 - multimodal - vision-language - ocr - synthetic-data - tinystories - validation - tiny-model pipeline_tag: image-to-text --- # tinygemma3ocr2m `shibatch/tinygemma3ocr2m` is a tiny Gemma3-style multimodal validation checkpoint. The model is intentionally small. It is designed to verify that a Gemma3 multimodal inference implementation can correctly run the full multimodal path: ```text image -> vision tower -> multimodal projector -> image tokens -> text decoder -> generated text ``` It is **not** intended to be a general OCR model. The checkpoint is trained on synthetic fixed-style digit OCR and mixed with a small amount of TinyStories text data so that the text decoder retains basic text-generation behavior. ## Model summary ```text Repository: shibatch/tinygemma3ocr2m Model class: Gemma3ForConditionalGeneration Task: synthetic digit OCR + text-generation sanity Image size: 128 x 128 Patch size: 16 x 16 Image tokens per image: 64 Image token: Image token id: 1003 Approximate scale: about 2M parameters OCR prompt: Read the digits. OCR target format: digit string only ``` The model is intended to be loaded with Hugging Face Transformers as `Gemma3ForConditionalGeneration`. ## Intended use This checkpoint is mainly useful for validating custom Gemma3 multimodal inference implementations. Typical checks include: * vision tower weight loading; * multimodal projector execution; * image token insertion; * text decoder generation after image conditioning; * small-model CPU/GPU inference paths; * safetensors key mapping and tensor layout assumptions. The expected OCR validation behavior is: ```text Input image: synthetic 128x128 image containing centered digits Prompt: Read the digits. Output: the digit string ``` For example, a rendered image containing `6235317` should generate: ```text 6235317 ``` ## Non-goals and limitations This model is deliberately narrow. It is **not** a general OCR model. It should not be expected to handle natural images, documents, handwriting, rotated text, complex backgrounds, arbitrary fonts, tables, screenshots, Japanese text, or real-world scanned documents. Known limitations: * trained on synthetic digit images; * primarily intended for fixed-style controlled rendering; * prompt generalization is limited; * text generation is only a sanity check and is not high quality; * generated TinyStories-like text may contain repetition or malformed words; * OCR robustness to strong augmentation is limited at this model scale; * exact-match accuracy becomes harder for long digit strings because per-digit errors accumulate. This checkpoint is best understood as a **minimal multimodal validation artifact**, not as a production OCR system. ## Training overview The final checkpoint was produced in stages. ### Phase 1: text warmup A small `Gemma3ForCausalLM` text model was trained on TinyStories. The resulting text weights were transplanted into a `Gemma3ForConditionalGeneration` checkpoint. This transplant-based setup was used because direct text-only warmup inside the multimodal wrapper was less stable for this tiny configuration. ### Phase 2a: single-digit OCR The model was trained to read one synthetic digit from a 128x128 image. The target output was the digit itself. ### Phase 2b: multi-digit OCR The model was trained to read multi-digit synthetic images. The standard OCR prompt was: ```text Read the digits. ``` The model learned to generate the digit string followed by EOS. ### Phase 2b plus text The final checkpoint keeps the OCR distribution close to Phase 2b and mixes in TinyStories text batches as a light regularizer. The purpose of the text mix is not to make the model a strong language model. It is only intended to prevent the text decoder from collapsing completely into OCR-only behavior. A representative mixed-training configuration was: ```text OCR ratio: 0.90 Text ratio: 0.10 OCR loss weight: 1.0 Text loss weight: 0.2 OCR style: fixed synthetic digits Font size: 30 Offset: 0 Rotation: 0 Noise: 0 Prompt: Read the digits. Digit length: 1 to 8 ``` ## Installation ```bash pip install torch transformers pillow huggingface_hub numpy ``` ## OCR example The following example downloads a sample image from this repository and runs OCR. ```python import torch, numpy as np from PIL import Image from huggingface_hub import hf_hub_download from transformers import PreTrainedTokenizerFast, Gemma3ForConditionalGeneration repo = "shibatch/tinygemma3ocr2m" tok = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf") model = Gemma3ForConditionalGeneration.from_pretrained( repo, subfolder="hf", torch_dtype=torch.bfloat16 ).cuda().eval() path = hf_hub_download(repo, "sample_images/sample_00_6235317.png") img = Image.open(path).convert("RGB") pix = torch.from_numpy(np.asarray(img, dtype=np.float32) / 127.5 - 1).permute(2, 0, 1)[None].cuda() ids = [tok.bos_token_id] + [model.config.image_token_index] * model.config.mm_tokens_per_image ids += tok.encode("\nRead the digits.\n", add_special_tokens=False) input_ids = torch.tensor([ids], device="cuda") attention_mask = torch.ones_like(input_ids) out = model.generate( input_ids=input_ids, attention_mask=attention_mask, pixel_values=pix, max_new_tokens=12, do_sample=False, pad_token_id=tok.bos_token_id, eos_token_id=tok.eos_token_id, ) print(tok.decode(out[0][len(ids):], skip_special_tokens=True)) ``` Expected output: ```text 6235317 ``` ## Text-generation sanity check The checkpoint also supports direct text-decoder sanity checks through the language model branch. This example intentionally does **not** use image tokens or `pixel_values`. It directly calls: ```text model.model.language_model model.lm_head ``` Example: ```python import torch from transformers import PreTrainedTokenizerFast, Gemma3ForConditionalGeneration repo = "shibatch/tinygemma3ocr2m" tok = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf") model = Gemma3ForConditionalGeneration.from_pretrained( repo, subfolder="hf", torch_dtype=torch.bfloat16 ).cuda().eval() ids = [tok.bos_token_id] + tok.encode("Once upon", add_special_tokens=False) x = torch.tensor([ids], device="cuda") for _ in range(50): h = model.model.language_model(input_ids=x, use_cache=False, return_dict=True).last_hidden_state nxt = model.lm_head(h)[0, -1].argmax().view(1, 1) x = torch.cat([x, nxt], dim=1) if int(nxt) == tok.eos_token_id: break print(tok.decode(x[0], skip_special_tokens=True)) ``` A successful sanity check means the model continues with roughly TinyStories-like English text. The output does not need to be high quality. The purpose is only to confirm that the text decoder has not collapsed into digit-only output. Acceptable behavior: ```text Once upon a time, there was a little girl named Lily... ``` Known imperfect behavior: ```text malformed names repetition awkward TinyStories-like phrasing occasional non-word fragments ``` These are expected for a model of this size and training mix. ## Evaluation notes A representative OCR result for the intended fixed-style OCR setting was: ```text exact_match: 0.87 digit_accuracy: 0.976 length_accuracy: 0.998 ``` The important point is that the model can exercise the full multimodal path and produce digit strings with high accuracy in the intended fixed synthetic setting. For longer digit strings, exact match is much stricter than digit accuracy. Even if each digit is correct with high probability, the probability of a fully correct 8-digit sequence is lower because all digits must be correct simultaneously. ## Recommended validation tests ### OCR smoke test ```bash python test_inference_tinygemma3ocr2m.py \ --model-dir shibatch/tinygemma3ocr2m \ --text 1677216 \ --font-size 30 \ --prompt "Read the digits." ``` Expected output: ```text Prediction: target: 1677216 prediction: 1677216 raw_text: '1677216' ``` ### Batch synthetic OCR test ```bash python test_inference_tinygemma3ocr2m.py \ --model-dir shibatch/tinygemma3ocr2m \ --eval-synthetic 500 \ --min-digits 1 \ --max-digits 8 \ --font-size 30 \ --prompt "Read the digits." ``` ### Text sanity test ```bash python test_text_generation_tinygemma3ocr2m.py \ --model-dir shibatch/tinygemma3ocr2m \ --prompt "Once upon" \ --max-new-tokens 50 ``` ## Expected config properties A correctly loaded 128x128 checkpoint should report: ```text image_token_index: 1003 mm_tokens_per_image: 64 vision image_size: 128 vision patch_size: 16 ``` If `mm_tokens_per_image` is not 64, or if the vision image size is not 128, the checkpoint is not the intended 128x128 model. ## File layout The recommended repository layout is: ```text shibatch/tinygemma3ocr2m README.md config.json model.safetensors tokenizer.json tokenizer_config.json special_tokens_map.json sample_images/ sample_00_6235317.png sample_images.json eval_ocr_augmented.json eval_text_generation.json safetensors_keys.json artifact_metadata.json ``` The model files should be placed at the repository root so that this works: ```python Gemma3ForConditionalGeneration.from_pretrained("shibatch/tinygemma3ocr2m") PreTrainedTokenizerFast.from_pretrained("shibatch/tinygemma3ocr2m") ``` If the model files are instead placed under an `hf/` subdirectory, pass `subfolder="hf"` to `from_pretrained()`. ## Implementation notes This checkpoint uses repeated `` tokens in the text input. The number of inserted image tokens must match `mm_tokens_per_image`. For this 128x128 checkpoint: ```text image_size = 128 patch_size = 16 patch grid = 8 x 8 mm_tokens_per_image = 64 ``` The OCR prompt format is approximately: ```text repeated 64 times \nRead the digits.\n ``` The target tokens are the digit string followed by EOS. ## Citation / attribution This is a synthetic tiny validation checkpoint derived from a Gemma3-style architecture and trained for local inference-engine validation. TinyStories was used for basic text warmup and text regularization. ## Disclaimer This model is for engineering validation and testing. It should not be used for real OCR tasks or user-facing document understanding.