GLM-OCR-Manga-LoRA
A LoRA adapter for GLM-OCR fine-tuned on manga text recognition.
This adapter significantly improves GLM-OCR's performance on Japanese manga speech bubbles, vertical text, and complex typographic styles by training on ~143k image-text pairs from Manga109-s and manga-synthetic datasets.
Key Results
| Metric | Baseline | Fine-Tuned | Improvement |
|---|---|---|---|
| Character Error Rate | 111.43% | 26.02% | ↓ 76.6% |
| Exact Match Accuracy | 12.72% | 55.91% | ↑ 339.4% |
The baseline CER exceeding 100% is due to severe hallucination — the base model would frequently output hundreds of repetitive characters for short inputs. The LoRA adapter nearly eliminates this failure mode.
Model Details
| Property | Value |
|---|---|
| Base Model | zai-org/GLM-OCR (0.9B params) |
| Method | LoRA (Low-Rank Adaptation) |
| LoRA Rank | 8 |
| LoRA Target | All linear layers |
| Precision | BFloat16 |
| Training Hardware | NVIDIA RTX 3060 (12GB VRAM) |
| Training Data | ~143k samples (90/10 train/val split) |
| Epochs | 3 |
Training Data
The training corpus was assembled from two complementary sources:
- Manga109-s — Real manga pages with bounding-box annotations. Text regions were programmatically cropped from speech bubbles using XML coordinate data.
- jzhang533/manga-synthetic — ~58k synthetic manga-style text images providing diverse font styles and orientations.
All samples were formatted in ShareGPT conversational structure for LLaMA-Factory compatibility.
How to Use
With PEFT
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel
from PIL import Image
# Load base model + LoRA adapter
processor = AutoProcessor.from_pretrained("zai-org/GLM-OCR", trust_remote_code=True)
base_model = AutoModelForImageTextToText.from_pretrained(
"zai-org/GLM-OCR",
torch_dtype=torch.float16,
device_map="cuda",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "YOUR_HF_USERNAME/GLM-OCR-Manga-LoRA")
model.eval()
# Run inference
image = Image.open("manga_crop.png").convert("RGB")
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Text Recognition:"},
]}]
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
).to(model.device)
inputs.pop("token_type_ids", None)
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=512)
result = processor.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(result)
With the Web UI
pip install -r requirements.txt
python web_ui.py --lora-path ./path/to/adapter
Training Procedure
Training was performed using LLaMA-Factory with the following configuration:
| Parameter | Value |
|---|---|
| Learning Rate | 1e-4 |
| LR Scheduler | Cosine |
| Warmup Ratio | 0.1 |
| Batch Size | 4 |
| Gradient Accumulation | 4 steps (effective batch = 16) |
| Gradient Checkpointing | Enabled |
| Cutoff Length | 1024 tokens |
| Total Steps | ~26,856 |
Training Loss Dynamics
The model demonstrated stable convergence throughout training:
- Initial validation loss dropped rapidly from ~1.40 → ~0.75 within the first 1,000 steps
- Continued steady decay to ~0.30 over the full training run (~12 hours)
- No anomalous loss spikes, confirming well-tuned hyperparameters
Limitations
- Domain-specific: Optimized for Japanese manga text. Performance on other languages, document types, or handwriting styles has not been evaluated.
- Input format: Best results are achieved with cropped text regions (speech bubbles). Full-page manga inputs may produce degraded results.
- Furigana: While significantly improved, very small furigana annotations remain challenging.
- Hardware: Inference requires a CUDA-capable GPU with ≥4GB VRAM.
Citation
@misc{glm-ocr-manga-lora,
title = {GLM-OCR-Manga-LoRA: Fine-tuning GLM-OCR for Manga Text Recognition},
author = {Psyka},
year = {2026},
url = {https://github.com/Psyka/glm-ocr-manga-finetune},
}
Acknowledgments
- GLM-OCR by ZhipuAI
- LLaMA-Factory by hiyouga
- Manga109-s academic dataset
- manga-synthetic by jzhang533
- Downloads last month
- 219
Model tree for psyka-101/GLM-OCR-Manga-LoRA
Base model
zai-org/GLM-OCRDataset used to train psyka-101/GLM-OCR-Manga-LoRA
Evaluation results
- Character Error Rate (%) on Manga109-s (withheld test split)self-reported26.020
- Exact Match Accuracy (%) on Manga109-s (withheld test split)self-reported55.910