GLM-OCR-Manga-LoRA

A LoRA adapter for GLM-OCR fine-tuned on manga text recognition.

This adapter significantly improves GLM-OCR's performance on Japanese manga speech bubbles, vertical text, and complex typographic styles by training on ~143k image-text pairs from Manga109-s and manga-synthetic datasets.

Key Results

Metric Baseline Fine-Tuned Improvement
Character Error Rate 111.43% 26.02% ↓ 76.6%
Exact Match Accuracy 12.72% 55.91% ↑ 339.4%

The baseline CER exceeding 100% is due to severe hallucination — the base model would frequently output hundreds of repetitive characters for short inputs. The LoRA adapter nearly eliminates this failure mode.

Model Details

Property Value
Base Model zai-org/GLM-OCR (0.9B params)
Method LoRA (Low-Rank Adaptation)
LoRA Rank 8
LoRA Target All linear layers
Precision BFloat16
Training Hardware NVIDIA RTX 3060 (12GB VRAM)
Training Data ~143k samples (90/10 train/val split)
Epochs 3

Training Data

The training corpus was assembled from two complementary sources:

  1. Manga109-s — Real manga pages with bounding-box annotations. Text regions were programmatically cropped from speech bubbles using XML coordinate data.
  2. jzhang533/manga-synthetic — ~58k synthetic manga-style text images providing diverse font styles and orientations.

All samples were formatted in ShareGPT conversational structure for LLaMA-Factory compatibility.

How to Use

With PEFT

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel
from PIL import Image

# Load base model + LoRA adapter
processor = AutoProcessor.from_pretrained("zai-org/GLM-OCR", trust_remote_code=True)
base_model = AutoModelForImageTextToText.from_pretrained(
    "zai-org/GLM-OCR",
    torch_dtype=torch.float16,
    device_map="cuda",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "YOUR_HF_USERNAME/GLM-OCR-Manga-LoRA")
model.eval()

# Run inference
image = Image.open("manga_crop.png").convert("RGB")
messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text": "Text Recognition:"},
]}]

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
).to(model.device)
inputs.pop("token_type_ids", None)

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=512)

result = processor.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(result)

With the Web UI

pip install -r requirements.txt
python web_ui.py --lora-path ./path/to/adapter

Training Procedure

Training was performed using LLaMA-Factory with the following configuration:

Parameter Value
Learning Rate 1e-4
LR Scheduler Cosine
Warmup Ratio 0.1
Batch Size 4
Gradient Accumulation 4 steps (effective batch = 16)
Gradient Checkpointing Enabled
Cutoff Length 1024 tokens
Total Steps ~26,856

Training Loss Dynamics

The model demonstrated stable convergence throughout training:

  • Initial validation loss dropped rapidly from ~1.40 → ~0.75 within the first 1,000 steps
  • Continued steady decay to ~0.30 over the full training run (~12 hours)
  • No anomalous loss spikes, confirming well-tuned hyperparameters

Limitations

  • Domain-specific: Optimized for Japanese manga text. Performance on other languages, document types, or handwriting styles has not been evaluated.
  • Input format: Best results are achieved with cropped text regions (speech bubbles). Full-page manga inputs may produce degraded results.
  • Furigana: While significantly improved, very small furigana annotations remain challenging.
  • Hardware: Inference requires a CUDA-capable GPU with ≥4GB VRAM.

Citation

@misc{glm-ocr-manga-lora,
  title   = {GLM-OCR-Manga-LoRA: Fine-tuning GLM-OCR for Manga Text Recognition},
  author  = {Psyka},
  year    = {2026},
  url     = {https://github.com/Psyka/glm-ocr-manga-finetune},
}

Acknowledgments

Downloads last month
219
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for psyka-101/GLM-OCR-Manga-LoRA

Base model

zai-org/GLM-OCR
Adapter
(6)
this model

Dataset used to train psyka-101/GLM-OCR-Manga-LoRA

Evaluation results

  • Character Error Rate (%) on Manga109-s (withheld test split)
    self-reported
    26.020
  • Exact Match Accuracy (%) on Manga109-s (withheld test split)
    self-reported
    55.910