---
license: mit
language:
  - en
tags:
  - gemma3
  - multimodal
  - vision-language
  - ocr
  - synthetic-data
  - tinystories
  - validation
  - tiny-model
pipeline_tag: image-to-text
---

# tinygemma3ocr2m

`shibatch/tinygemma3ocr2m` is a tiny Gemma3-style multimodal validation checkpoint.

The model is intentionally small. It is designed to verify that a Gemma3 multimodal inference implementation can correctly run the full multimodal path:

```text
image -> vision tower -> multimodal projector -> image tokens -> text decoder -> generated text
```

It is **not** intended to be a general OCR model.

The checkpoint is trained on synthetic fixed-style digit OCR and mixed with a small amount of TinyStories text data so that the text decoder retains basic text-generation behavior.

## Model summary

```text
Repository:               shibatch/tinygemma3ocr2m
Model class:              Gemma3ForConditionalGeneration
Task:                     synthetic digit OCR + text-generation sanity
Image size:               128 x 128
Patch size:               16 x 16
Image tokens per image:   64
Image token:              <image>
Image token id:           1003
Approximate scale:        about 2M parameters
OCR prompt:               Read the digits.
OCR target format:        digit string only
```

The model is intended to be loaded with Hugging Face Transformers as `Gemma3ForConditionalGeneration`.

## Intended use

This checkpoint is mainly useful for validating custom Gemma3 multimodal inference implementations.

Typical checks include:

* vision tower weight loading;
* multimodal projector execution;
* image token insertion;
* text decoder generation after image conditioning;
* small-model CPU/GPU inference paths;
* safetensors key mapping and tensor layout assumptions.

The expected OCR validation behavior is:

```text
Input image:  synthetic 128x128 image containing centered digits
Prompt:       Read the digits.
Output:       the digit string
```

For example, a rendered image containing `6235317` should generate:

```text
6235317
```

## Non-goals and limitations

This model is deliberately narrow.

It is **not** a general OCR model. It should not be expected to handle natural images, documents, handwriting, rotated text, complex backgrounds, arbitrary fonts, tables, screenshots, Japanese text, or real-world scanned documents.

Known limitations:

* trained on synthetic digit images;
* primarily intended for fixed-style controlled rendering;
* prompt generalization is limited;
* text generation is only a sanity check and is not high quality;
* generated TinyStories-like text may contain repetition or malformed words;
* OCR robustness to strong augmentation is limited at this model scale;
* exact-match accuracy becomes harder for long digit strings because per-digit errors accumulate.

This checkpoint is best understood as a **minimal multimodal validation artifact**, not as a production OCR system.

## Training overview

The final checkpoint was produced in stages.

### Phase 1: text warmup

A small `Gemma3ForCausalLM` text model was trained on TinyStories. The resulting text weights were transplanted into a `Gemma3ForConditionalGeneration` checkpoint.

This transplant-based setup was used because direct text-only warmup inside the multimodal wrapper was less stable for this tiny configuration.

### Phase 2a: single-digit OCR

The model was trained to read one synthetic digit from a 128x128 image. The target output was the digit itself.

### Phase 2b: multi-digit OCR

The model was trained to read multi-digit synthetic images. The standard OCR prompt was:

```text
Read the digits.
```

The model learned to generate the digit string followed by EOS.

### Phase 2b plus text

The final checkpoint keeps the OCR distribution close to Phase 2b and mixes in TinyStories text batches as a light regularizer.

The purpose of the text mix is not to make the model a strong language model. It is only intended to prevent the text decoder from collapsing completely into OCR-only behavior.

A representative mixed-training configuration was:

```text
OCR ratio:         0.90
Text ratio:        0.10
OCR loss weight:   1.0
Text loss weight:  0.2
OCR style:         fixed synthetic digits
Font size:         30
Offset:            0
Rotation:          0
Noise:             0
Prompt:            Read the digits.
Digit length:      1 to 8
```

## Installation

```bash
pip install torch transformers pillow huggingface_hub numpy
```

## OCR example

The following example downloads a sample image from this repository and runs OCR.

```python
import torch, numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download
from transformers import PreTrainedTokenizerFast, Gemma3ForConditionalGeneration

repo = "shibatch/tinygemma3ocr2m"

tok = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma3ForConditionalGeneration.from_pretrained(
    repo, subfolder="hf", torch_dtype=torch.bfloat16
).cuda().eval()

path = hf_hub_download(repo, "sample_images/sample_00_6235317.png")
img = Image.open(path).convert("RGB")
pix = torch.from_numpy(np.asarray(img, dtype=np.float32) / 127.5 - 1).permute(2, 0, 1)[None].cuda()

ids = [tok.bos_token_id] + [model.config.image_token_index] * model.config.mm_tokens_per_image
ids += tok.encode("\nRead the digits.\n", add_special_tokens=False)

input_ids = torch.tensor([ids], device="cuda")
attention_mask = torch.ones_like(input_ids)

out = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    pixel_values=pix,
    max_new_tokens=12,
    do_sample=False,
    pad_token_id=tok.bos_token_id,
    eos_token_id=tok.eos_token_id,
)

print(tok.decode(out[0][len(ids):], skip_special_tokens=True))
```

Expected output:

```text
6235317
```

## Text-generation sanity check

The checkpoint also supports direct text-decoder sanity checks through the language model branch.

This example intentionally does **not** use image tokens or `pixel_values`. It directly calls:

```text
model.model.language_model
model.lm_head
```

Example:

```python
import torch
from transformers import PreTrainedTokenizerFast, Gemma3ForConditionalGeneration

repo = "shibatch/tinygemma3ocr2m"
tok = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
model = Gemma3ForConditionalGeneration.from_pretrained(
    repo, subfolder="hf", torch_dtype=torch.bfloat16
).cuda().eval()

ids = [tok.bos_token_id] + tok.encode("Once upon", add_special_tokens=False)
x = torch.tensor([ids], device="cuda")

for _ in range(50):
    h = model.model.language_model(input_ids=x, use_cache=False, return_dict=True).last_hidden_state
    nxt = model.lm_head(h)[0, -1].argmax().view(1, 1)
    x = torch.cat([x, nxt], dim=1)
    if int(nxt) == tok.eos_token_id:
        break

print(tok.decode(x[0], skip_special_tokens=True))
```

A successful sanity check means the model continues with roughly TinyStories-like English text. The output does not need to be high quality. The purpose is only to confirm that the text decoder has not collapsed into digit-only output.

Acceptable behavior:

```text
Once upon a time, there was a little girl named Lily...
```

Known imperfect behavior:

```text
malformed names
repetition
awkward TinyStories-like phrasing
occasional non-word fragments
```

These are expected for a model of this size and training mix.

## Evaluation notes

A representative OCR result for the intended fixed-style OCR setting was:

```text
exact_match:     0.87
digit_accuracy:  0.976
length_accuracy: 0.998
```

The important point is that the model can exercise the full multimodal path and produce digit strings with high accuracy in the intended fixed synthetic setting.

For longer digit strings, exact match is much stricter than digit accuracy. Even if each digit is correct with high probability, the probability of a fully correct 8-digit sequence is lower because all digits must be correct simultaneously.

## Recommended validation tests

### OCR smoke test

```bash
python test_inference_tinygemma3ocr2m.py \
  --model-dir shibatch/tinygemma3ocr2m \
  --text 1677216 \
  --font-size 30 \
  --prompt "Read the digits."
```

Expected output:

```text
Prediction:
  target: 1677216
  prediction: 1677216
  raw_text: '1677216'
```

### Batch synthetic OCR test

```bash
python test_inference_tinygemma3ocr2m.py \
  --model-dir shibatch/tinygemma3ocr2m \
  --eval-synthetic 500 \
  --min-digits 1 \
  --max-digits 8 \
  --font-size 30 \
  --prompt "Read the digits."
```

### Text sanity test

```bash
python test_text_generation_tinygemma3ocr2m.py \
  --model-dir shibatch/tinygemma3ocr2m \
  --prompt "Once upon" \
  --max-new-tokens 50
```

## Expected config properties

A correctly loaded 128x128 checkpoint should report:

```text
image_token_index:      1003
mm_tokens_per_image:    64
vision image_size:      128
vision patch_size:      16
```

If `mm_tokens_per_image` is not 64, or if the vision image size is not 128, the checkpoint is not the intended 128x128 model.

## File layout

The recommended repository layout is:

```text
shibatch/tinygemma3ocr2m
  README.md
  config.json
  model.safetensors
  tokenizer.json
  tokenizer_config.json
  special_tokens_map.json
  sample_images/
    sample_00_6235317.png
  sample_images.json
  eval_ocr_augmented.json
  eval_text_generation.json
  safetensors_keys.json
  artifact_metadata.json
```

The model files should be placed at the repository root so that this works:

```python
Gemma3ForConditionalGeneration.from_pretrained("shibatch/tinygemma3ocr2m")
PreTrainedTokenizerFast.from_pretrained("shibatch/tinygemma3ocr2m")
```

If the model files are instead placed under an `hf/` subdirectory, pass `subfolder="hf"` to `from_pretrained()`.

## Implementation notes

This checkpoint uses repeated `<image>` tokens in the text input. The number of inserted image tokens must match `mm_tokens_per_image`.

For this 128x128 checkpoint:

```text
image_size = 128
patch_size = 16
patch grid = 8 x 8
mm_tokens_per_image = 64
```

The OCR prompt format is approximately:

```text
<bos>
<image> repeated 64 times
\nRead the digits.\n
```

The target tokens are the digit string followed by EOS.

## Citation / attribution

This is a synthetic tiny validation checkpoint derived from a Gemma3-style architecture and trained for local inference-engine validation. TinyStories was used for basic text warmup and text regularization.

## Disclaimer

This model is for engineering validation and testing. It should not be used for real OCR tasks or user-facing document understanding.