GLM-OCR-Manga-LoRA

A LoRA adapter for GLM-OCR fine-tuned on manga text recognition.

This adapter significantly improves GLM-OCR's performance on Japanese manga speech bubbles, vertical text, and complex typographic styles by training on ~143k image-text pairs from Manga109-s and manga-synthetic datasets.

Key Results

Metric	Baseline	Fine-Tuned	Improvement
Character Error Rate	111.43%	26.02%	↓ 76.6%
Exact Match Accuracy	12.72%	55.91%	↑ 339.4%

The baseline CER exceeding 100% is due to severe hallucination — the base model would frequently output hundreds of repetitive characters for short inputs. The LoRA adapter nearly eliminates this failure mode.

Model Details

Property	Value
Base Model	zai-org/GLM-OCR (0.9B params)
Method	LoRA (Low-Rank Adaptation)
LoRA Rank	8
LoRA Target	All linear layers
Precision	BFloat16
Training Hardware	NVIDIA RTX 3060 (12GB VRAM)
Training Data	~143k samples (90/10 train/val split)
Epochs	3

Training Data

The training corpus was assembled from two complementary sources:

Manga109-s — Real manga pages with bounding-box annotations. Text regions were programmatically cropped from speech bubbles using XML coordinate data.
jzhang533/manga-synthetic — ~58k synthetic manga-style text images providing diverse font styles and orientations.

All samples were formatted in ShareGPT conversational structure for LLaMA-Factory compatibility.

How to Use

With PEFT

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel
from PIL import Image

# Load base model + LoRA adapter
processor = AutoProcessor.from_pretrained("zai-org/GLM-OCR", trust_remote_code=True)
base_model = AutoModelForImageTextToText.from_pretrained(
    "zai-org/GLM-OCR",
    torch_dtype=torch.float16,
    device_map="cuda",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "YOUR_HF_USERNAME/GLM-OCR-Manga-LoRA")
model.eval()

# Run inference
image = Image.open("manga_crop.png").convert("RGB")
messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text": "Text Recognition:"},
]}]

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
).to(model.device)
inputs.pop("token_type_ids", None)

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=512)

result = processor.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(result)

With the Web UI

pip install -r requirements.txt
python web_ui.py --lora-path ./path/to/adapter

Training Procedure

Training was performed using LLaMA-Factory with the following configuration:

Parameter	Value
Learning Rate	1e-4
LR Scheduler	Cosine
Warmup Ratio	0.1
Batch Size	4
Gradient Accumulation	4 steps (effective batch = 16)
Gradient Checkpointing	Enabled
Cutoff Length	1024 tokens
Total Steps	~26,856

Training Loss Dynamics

The model demonstrated stable convergence throughout training:

Initial validation loss dropped rapidly from ~1.40 → ~0.75 within the first 1,000 steps
Continued steady decay to ~0.30 over the full training run (~12 hours)
No anomalous loss spikes, confirming well-tuned hyperparameters

Limitations

Domain-specific: Optimized for Japanese manga text. Performance on other languages, document types, or handwriting styles has not been evaluated.
Input format: Best results are achieved with cropped text regions (speech bubbles). Full-page manga inputs may produce degraded results.
Furigana: While significantly improved, very small furigana annotations remain challenging.
Hardware: Inference requires a CUDA-capable GPU with ≥4GB VRAM.

Citation

@misc{glm-ocr-manga-lora,
  title   = {GLM-OCR-Manga-LoRA: Fine-tuning GLM-OCR for Manga Text Recognition},
  author  = {Psyka},
  year    = {2026},
  url     = {https://github.com/Psyka/glm-ocr-manga-finetune},
}

Acknowledgments

GLM-OCR by ZhipuAI
LLaMA-Factory by hiyouga
Manga109-s academic dataset
manga-synthetic by jzhang533

Downloads last month: 19

Model tree for psyka-101/GLM-OCR-Manga-LoRA

Base model

zai-org/GLM-OCR

Adapter

(7)

this model

Dataset used to train psyka-101/GLM-OCR-Manga-LoRA

Evaluation results

Character Error Rate (%) on Manga109-s (withheld test split)
self-reported

26.020
Exact Match Accuracy (%) on Manga109-s (withheld test split)
self-reported

55.910