tinygemma3ocr2m

shibatch/tinygemma3ocr2m is a tiny Gemma3-style multimodal validation checkpoint.

The model is intentionally small. It is designed to verify that a Gemma3 multimodal inference implementation can correctly run the full multimodal path:

image -> vision tower -> multimodal projector -> image tokens -> text decoder -> generated text

It is not intended to be a general OCR model.

The checkpoint is trained on synthetic fixed-style digit OCR and mixed with a small amount of TinyStories text data so that the text decoder retains basic text-generation behavior.

Model summary

Repository:               shibatch/tinygemma3ocr2m
Model class:              Gemma3ForConditionalGeneration
Task:                     synthetic digit OCR + text-generation sanity
Image size:               128 x 128
Patch size:               16 x 16
Image tokens per image:   64
Image token:              <image>
Image token id:           1003
Approximate scale:        about 2M parameters
OCR prompt:               Read the digits.
OCR target format:        digit string only

The model is intended to be loaded with Hugging Face Transformers as Gemma3ForConditionalGeneration.

Intended use

This checkpoint is mainly useful for validating custom Gemma3 multimodal inference implementations.

Typical checks include:

vision tower weight loading;
multimodal projector execution;
image token insertion;
text decoder generation after image conditioning;
small-model CPU/GPU inference paths;
safetensors key mapping and tensor layout assumptions.

The expected OCR validation behavior is:

Input image:  synthetic 128x128 image containing centered digits
Prompt:       Read the digits.
Output:       the digit string

For example, a rendered image containing 6235317 should generate:

Non-goals and limitations

This model is deliberately narrow.

It is not a general OCR model. It should not be expected to handle natural images, documents, handwriting, rotated text, complex backgrounds, arbitrary fonts, tables, screenshots, Japanese text, or real-world scanned documents.

Known limitations:

trained on synthetic digit images;
primarily intended for fixed-style controlled rendering;
prompt generalization is limited;
text generation is only a sanity check and is not high quality;
generated TinyStories-like text may contain repetition or malformed words;
OCR robustness to strong augmentation is limited at this model scale;
exact-match accuracy becomes harder for long digit strings because per-digit errors accumulate.

This checkpoint is best understood as a minimal multimodal validation artifact, not as a production OCR system.

Training overview

The final checkpoint was produced in stages.

Phase 1: text warmup

A small Gemma3ForCausalLM text model was trained on TinyStories. The resulting text weights were transplanted into a Gemma3ForConditionalGeneration checkpoint.

This transplant-based setup was used because direct text-only warmup inside the multimodal wrapper was less stable for this tiny configuration.

Phase 2a: single-digit OCR

The model was trained to read one synthetic digit from a 128x128 image. The target output was the digit itself.

Phase 2b: multi-digit OCR

The model was trained to read multi-digit synthetic images. The standard OCR prompt was:

Read the digits.

The model learned to generate the digit string followed by EOS.

Phase 2b plus text

The final checkpoint keeps the OCR distribution close to Phase 2b and mixes in TinyStories text batches as a light regularizer.

The purpose of the text mix is not to make the model a strong language model. It is only intended to prevent the text decoder from collapsing completely into OCR-only behavior.

A representative mixed-training configuration was:

OCR ratio:         0.90
Text ratio:        0.10
OCR loss weight:   1.0
Text loss weight:  0.2
OCR style:         fixed synthetic digits
Font size:         30
Offset:            0
Rotation:          0
Noise:             0
Prompt:            Read the digits.
Digit length:      1 to 8

Installation

pip install torch transformers pillow huggingface_hub numpy

OCR example

The following example downloads a sample image from this repository and runs OCR.

import torch, numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download
from transformers import PreTrainedTokenizerFast, Gemma3ForConditionalGeneration

repo = "shibatch/tinygemma3ocr2m"

tok = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma3ForConditionalGeneration.from_pretrained(
    repo, subfolder="hf", torch_dtype=torch.bfloat16
).cuda().eval()

path = hf_hub_download(repo, "sample_images/sample_00_6235317.png")
img = Image.open(path).convert("RGB")
pix = torch.from_numpy(np.asarray(img, dtype=np.float32) / 127.5 - 1).permute(2, 0, 1)[None].cuda()

ids = [tok.bos_token_id] + [model.config.image_token_index] * model.config.mm_tokens_per_image
ids += tok.encode("\nRead the digits.\n", add_special_tokens=False)

input_ids = torch.tensor([ids], device="cuda")
attention_mask = torch.ones_like(input_ids)

out = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    pixel_values=pix,
    max_new_tokens=12,
    do_sample=False,
    pad_token_id=tok.bos_token_id,
    eos_token_id=tok.eos_token_id,
)

print(tok.decode(out[0][len(ids):], skip_special_tokens=True))

Expected output:

Text-generation sanity check

The checkpoint also supports direct text-decoder sanity checks through the language model branch.

This example intentionally does not use image tokens or pixel_values. It directly calls:

model.model.language_model
model.lm_head

Example:

import torch
from transformers import PreTrainedTokenizerFast, Gemma3ForConditionalGeneration

repo = "shibatch/tinygemma3ocr2m"
tok = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma3ForConditionalGeneration.from_pretrained(
    repo, subfolder="hf", torch_dtype=torch.bfloat16
).cuda().eval()

ids = [tok.bos_token_id] + tok.encode("Once upon", add_special_tokens=False)
x = torch.tensor([ids], device="cuda")

for _ in range(50):
    h = model.model.language_model(input_ids=x, use_cache=False, return_dict=True).last_hidden_state
    nxt = model.lm_head(h)[0, -1].argmax().view(1, 1)
    x = torch.cat([x, nxt], dim=1)
    if int(nxt) == tok.eos_token_id:
        break

print(tok.decode(x[0], skip_special_tokens=True))

A successful sanity check means the model continues with roughly TinyStories-like English text. The output does not need to be high quality. The purpose is only to confirm that the text decoder has not collapsed into digit-only output.

Acceptable behavior:

Once upon a time, there was a little girl named Lily...

Known imperfect behavior:

malformed names
repetition
awkward TinyStories-like phrasing
occasional non-word fragments

These are expected for a model of this size and training mix.

Evaluation notes

A representative OCR result for the intended fixed-style OCR setting was:

exact_match:     0.87
digit_accuracy:  0.976
length_accuracy: 0.998

The important point is that the model can exercise the full multimodal path and produce digit strings with high accuracy in the intended fixed synthetic setting.

For longer digit strings, exact match is much stricter than digit accuracy. Even if each digit is correct with high probability, the probability of a fully correct 8-digit sequence is lower because all digits must be correct simultaneously.

Recommended validation tests

OCR smoke test

python test_inference_tinygemma3ocr2m.py \
  --model-dir shibatch/tinygemma3ocr2m \
  --text 1677216 \
  --font-size 30 \
  --prompt "Read the digits."

Expected output:

Prediction:
  target: 1677216
  prediction: 1677216
  raw_text: '1677216'

Batch synthetic OCR test

python test_inference_tinygemma3ocr2m.py \
  --model-dir shibatch/tinygemma3ocr2m \
  --eval-synthetic 500 \
  --min-digits 1 \
  --max-digits 8 \
  --font-size 30 \
  --prompt "Read the digits."

Text sanity test

python test_text_generation_tinygemma3ocr2m.py \
  --model-dir shibatch/tinygemma3ocr2m \
  --prompt "Once upon" \
  --max-new-tokens 50

Expected config properties

A correctly loaded 128x128 checkpoint should report:

image_token_index:      1003
mm_tokens_per_image:    64
vision image_size:      128
vision patch_size:      16

If mm_tokens_per_image is not 64, or if the vision image size is not 128, the checkpoint is not the intended 128x128 model.

File layout

The recommended repository layout is:

shibatch/tinygemma3ocr2m
  README.md
  config.json
  model.safetensors
  tokenizer.json
  tokenizer_config.json
  special_tokens_map.json
  sample_images/
    sample_00_6235317.png
  sample_images.json
  eval_ocr_augmented.json
  eval_text_generation.json
  safetensors_keys.json
  artifact_metadata.json

The model files should be placed at the repository root so that this works:

Gemma3ForConditionalGeneration.from_pretrained("shibatch/tinygemma3ocr2m")
PreTrainedTokenizerFast.from_pretrained("shibatch/tinygemma3ocr2m")

If the model files are instead placed under an hf/ subdirectory, pass subfolder="hf" to from_pretrained().

Implementation notes

This checkpoint uses repeated <image> tokens in the text input. The number of inserted image tokens must match mm_tokens_per_image.

For this 128x128 checkpoint:

image_size = 128
patch_size = 16
patch grid = 8 x 8
mm_tokens_per_image = 64

The OCR prompt format is approximately:

<bos>
<image> repeated 64 times
\nRead the digits.\n

The target tokens are the digit string followed by EOS.

Citation / attribution

This is a synthetic tiny validation checkpoint derived from a Gemma3-style architecture and trained for local inference-engine validation. TinyStories was used for basic text warmup and text regularization.

Disclaimer

This model is for engineering validation and testing. It should not be used for real OCR tasks or user-facing document understanding.

Downloads last month: -; Downloads are not tracked for this model. How to track