Upload README.md with huggingface_hub

52e58bc verified 3 days ago

10.6 kB

	---
	license: mit
	language:
	- en
	tags:
	- gemma3
	- multimodal
	- vision-language
	- ocr
	- synthetic-data
	- tinystories
	- validation
	- tiny-model
	pipeline_tag: image-to-text
	---

	# tinygemma3ocr2m

	`shibatch/tinygemma3ocr2m` is a tiny Gemma3-style multimodal validation checkpoint.

	The model is intentionally small. It is designed to verify that a Gemma3 multimodal inference implementation can correctly run the full multimodal path:

	```text
	image -> vision tower -> multimodal projector -> image tokens -> text decoder -> generated text
	```

	It is not intended to be a general OCR model.

	The checkpoint is trained on synthetic fixed-style digit OCR and mixed with a small amount of TinyStories text data so that the text decoder retains basic text-generation behavior.

	## Model summary

	```text
	Repository: shibatch/tinygemma3ocr2m
	Model class: Gemma3ForConditionalGeneration
	Task: synthetic digit OCR + text-generation sanity
	Image size: 128 x 128
	Patch size: 16 x 16
	Image tokens per image: 64
	Image token: <image>
	Image token id: 1003
	Approximate scale: about 2M parameters
	OCR prompt: Read the digits.
	OCR target format: digit string only
	```

	The model is intended to be loaded with Hugging Face Transformers as `Gemma3ForConditionalGeneration`.

	## Intended use

	This checkpoint is mainly useful for validating custom Gemma3 multimodal inference implementations.

	Typical checks include:

	* vision tower weight loading;
	* multimodal projector execution;
	* image token insertion;
	* text decoder generation after image conditioning;
	* small-model CPU/GPU inference paths;
	* safetensors key mapping and tensor layout assumptions.

	The expected OCR validation behavior is:

	```text
	Input image: synthetic 128x128 image containing centered digits
	Prompt: Read the digits.
	Output: the digit string
	```

	For example, a rendered image containing `6235317` should generate:

	```text
	6235317
	```

	## Non-goals and limitations

	This model is deliberately narrow.

	It is not a general OCR model. It should not be expected to handle natural images, documents, handwriting, rotated text, complex backgrounds, arbitrary fonts, tables, screenshots, Japanese text, or real-world scanned documents.

	Known limitations:

	* trained on synthetic digit images;
	* primarily intended for fixed-style controlled rendering;
	* prompt generalization is limited;
	* text generation is only a sanity check and is not high quality;
	* generated TinyStories-like text may contain repetition or malformed words;
	* OCR robustness to strong augmentation is limited at this model scale;
	* exact-match accuracy becomes harder for long digit strings because per-digit errors accumulate.

	This checkpoint is best understood as a minimal multimodal validation artifact, not as a production OCR system.

	## Training overview

	The final checkpoint was produced in stages.

	### Phase 1: text warmup

	A small `Gemma3ForCausalLM` text model was trained on TinyStories. The resulting text weights were transplanted into a `Gemma3ForConditionalGeneration` checkpoint.

	This transplant-based setup was used because direct text-only warmup inside the multimodal wrapper was less stable for this tiny configuration.

	### Phase 2a: single-digit OCR

	The model was trained to read one synthetic digit from a 128x128 image. The target output was the digit itself.

	### Phase 2b: multi-digit OCR

	The model was trained to read multi-digit synthetic images. The standard OCR prompt was:

	```text
	Read the digits.
	```

	The model learned to generate the digit string followed by EOS.

	### Phase 2b plus text

	The final checkpoint keeps the OCR distribution close to Phase 2b and mixes in TinyStories text batches as a light regularizer.

	The purpose of the text mix is not to make the model a strong language model. It is only intended to prevent the text decoder from collapsing completely into OCR-only behavior.

	A representative mixed-training configuration was:

	```text
	OCR ratio: 0.90
	Text ratio: 0.10
	OCR loss weight: 1.0
	Text loss weight: 0.2
	OCR style: fixed synthetic digits
	Font size: 30
	Offset: 0
	Rotation: 0
	Noise: 0
	Prompt: Read the digits.
	Digit length: 1 to 8
	```

	## Installation

	```bash
	pip install torch transformers pillow huggingface_hub numpy
	```

	## OCR example

	The following example downloads a sample image from this repository and runs OCR.

	```python
	import torch, numpy as np
	from PIL import Image
	from huggingface_hub import hf_hub_download
	from transformers import PreTrainedTokenizerFast, Gemma3ForConditionalGeneration

	repo = "shibatch/tinygemma3ocr2m"

	tok = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
	model = Gemma3ForConditionalGeneration.from_pretrained(
	repo, subfolder="hf", torch_dtype=torch.bfloat16
	).cuda().eval()

	path = hf_hub_download(repo, "sample_images/sample_00_6235317.png")
	img = Image.open(path).convert("RGB")
	pix = torch.from_numpy(np.asarray(img, dtype=np.float32) / 127.5 - 1).permute(2, 0, 1)[None].cuda()

	ids = [tok.bos_token_id] + [model.config.image_token_index] * model.config.mm_tokens_per_image
	ids += tok.encode("\nRead the digits.\n", add_special_tokens=False)

	input_ids = torch.tensor([ids], device="cuda")
	attention_mask = torch.ones_like(input_ids)

	out = model.generate(
	input_ids=input_ids,
	attention_mask=attention_mask,
	pixel_values=pix,
	max_new_tokens=12,
	do_sample=False,
	pad_token_id=tok.bos_token_id,
	eos_token_id=tok.eos_token_id,
	)

	print(tok.decode(out[0][len(ids):], skip_special_tokens=True))
	```

	Expected output:

	```text
	6235317
	```

	## Text-generation sanity check

	The checkpoint also supports direct text-decoder sanity checks through the language model branch.

	This example intentionally does not use image tokens or `pixel_values`. It directly calls:

	```text
	model.model.language_model
	model.lm_head
	```

	Example:

	```python
	import torch
	from transformers import PreTrainedTokenizerFast, Gemma3ForConditionalGeneration

	repo = "shibatch/tinygemma3ocr2m"
	tok = PreTrainedTokenizerFast.from_pretrained(repo, subfolder="hf")
	model = Gemma3ForConditionalGeneration.from_pretrained(
	repo, subfolder="hf", torch_dtype=torch.bfloat16
	).cuda().eval()

	ids = [tok.bos_token_id] + tok.encode("Once upon", add_special_tokens=False)
	x = torch.tensor([ids], device="cuda")

	for _ in range(50):
	h = model.model.language_model(input_ids=x, use_cache=False, return_dict=True).last_hidden_state
	nxt = model.lm_head(h)[0, -1].argmax().view(1, 1)
	x = torch.cat([x, nxt], dim=1)
	if int(nxt) == tok.eos_token_id:
	break

	print(tok.decode(x[0], skip_special_tokens=True))
	```

	A successful sanity check means the model continues with roughly TinyStories-like English text. The output does not need to be high quality. The purpose is only to confirm that the text decoder has not collapsed into digit-only output.

	Acceptable behavior:

	```text
	Once upon a time, there was a little girl named Lily...
	```

	Known imperfect behavior:

	```text
	malformed names
	repetition
	awkward TinyStories-like phrasing
	occasional non-word fragments
	```

	These are expected for a model of this size and training mix.

	## Evaluation notes

	A representative OCR result for the intended fixed-style OCR setting was:

	```text
	exact_match: 0.87
	digit_accuracy: 0.976
	length_accuracy: 0.998
	```

	The important point is that the model can exercise the full multimodal path and produce digit strings with high accuracy in the intended fixed synthetic setting.

	For longer digit strings, exact match is much stricter than digit accuracy. Even if each digit is correct with high probability, the probability of a fully correct 8-digit sequence is lower because all digits must be correct simultaneously.

	## Recommended validation tests

	### OCR smoke test

	```bash
	python test_inference_tinygemma3ocr2m.py \
	--model-dir shibatch/tinygemma3ocr2m \
	--text 1677216 \
	--font-size 30 \
	--prompt "Read the digits."
	```

	Expected output:

	```text
	Prediction:
	target: 1677216
	prediction: 1677216
	raw_text: '1677216'
	```

	### Batch synthetic OCR test

	```bash
	python test_inference_tinygemma3ocr2m.py \
	--model-dir shibatch/tinygemma3ocr2m \
	--eval-synthetic 500 \
	--min-digits 1 \
	--max-digits 8 \
	--font-size 30 \
	--prompt "Read the digits."
	```

	### Text sanity test

	```bash
	python test_text_generation_tinygemma3ocr2m.py \
	--model-dir shibatch/tinygemma3ocr2m \
	--prompt "Once upon" \
	--max-new-tokens 50
	```

	## Expected config properties

	A correctly loaded 128x128 checkpoint should report:

	```text
	image_token_index: 1003
	mm_tokens_per_image: 64
	vision image_size: 128
	vision patch_size: 16
	```

	If `mm_tokens_per_image` is not 64, or if the vision image size is not 128, the checkpoint is not the intended 128x128 model.

	## File layout

	The recommended repository layout is:

	```text
	shibatch/tinygemma3ocr2m
	README.md
	config.json
	model.safetensors
	tokenizer.json
	tokenizer_config.json
	special_tokens_map.json
	sample_images/
	sample_00_6235317.png
	sample_images.json
	eval_ocr_augmented.json
	eval_text_generation.json
	safetensors_keys.json
	artifact_metadata.json
	```

	The model files should be placed at the repository root so that this works:

	```python
	Gemma3ForConditionalGeneration.from_pretrained("shibatch/tinygemma3ocr2m")
	PreTrainedTokenizerFast.from_pretrained("shibatch/tinygemma3ocr2m")
	```

	If the model files are instead placed under an `hf/` subdirectory, pass `subfolder="hf"` to `from_pretrained()`.

	## Implementation notes

	This checkpoint uses repeated `<image>` tokens in the text input. The number of inserted image tokens must match `mm_tokens_per_image`.

	For this 128x128 checkpoint:

	```text
	image_size = 128
	patch_size = 16
	patch grid = 8 x 8
	mm_tokens_per_image = 64
	```

	The OCR prompt format is approximately:

	```text
	<bos>
	<image> repeated 64 times
	\nRead the digits.\n
	```

	The target tokens are the digit string followed by EOS.

	## Citation / attribution

	This is a synthetic tiny validation checkpoint derived from a Gemma3-style architecture and trained for local inference-engine validation. TinyStories was used for basic text warmup and text regularization.

	## Disclaimer

	This model is for engineering validation and testing. It should not be used for real OCR tasks or user-facing document understanding.