GT-REX-v4: Production OCR Model
🦖 GothiTech Recognition & Extraction eXpert — Version 4
GT-REX-v4 is a state-of-the-art production-grade OCR model developed by GothiTech for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables.
📑 Table of Contents
- GT-REX Variants
- Key Features
- Model Details
- Quick Start
- Installation
- Usage Examples
- Use Cases
- Performance Benchmarks
- Prompt Engineering Guide
- API Integration
- Troubleshooting
- License
- Citation
⚙️ GT-REX Variants
GT-REX-v4 ships with three optimized configurations tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings.
| Variant | Speed | Accuracy | Resolution | GPU Memory | Throughput | Best For |
|---|---|---|---|---|---|---|
| 🚀 Nano | ⚡⚡⚡⚡⚡ | ⭐⭐⭐ | 640px | 4–6 GB | 100–150 docs/min | High-volume batch processing |
| ⚡ Pro (Default) | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ | 1024px | 6–10 GB | 50–80 docs/min | Standard enterprise workflows |
| 🎯 Ultra | ⚡⚡⚡ | ⭐⭐⭐⭐⭐ | 1536px | 10–15 GB | 20–30 docs/min | High-accuracy & fine-detail needs |
How to Choose a Variant
- Nano → You need maximum throughput and documents are simple (receipts, IDs, labels).
- Pro → General-purpose. Best balance for invoices, contracts, forms, and reports.
- Ultra → Documents have fine print, dense tables, medical records, or legal footnotes.
🚀 GT-Rex-Nano
Speed-optimized for high-volume batch processing
| Setting | Value |
|---|---|
| Resolution | 640 × 640 px |
| Speed | ~1–2s per image |
| Max Tokens | 2048 |
| GPU Memory | 4–6 GB |
| Recommended Batch Size | 256 sequences |
Best for: Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning.
from vllm import LLM
llm = LLM(
model="developerJenis/GT-REX-v4",
trust_remote_code=True,
max_model_len=2048,
gpu_memory_utilization=0.6,
max_num_seqs=256,
limit_mm_per_prompt={"image": 1},
)
⚡ GT-Rex-Pro (Default)
Balanced quality and speed for standard enterprise documents
| Setting | Value |
|---|---|
| Resolution | 1024 × 1024 px |
| Speed | ~2–5s per image |
| Max Tokens | 4096 |
| GPU Memory | 6–10 GB |
| Recommended Batch Size | 128 sequences |
Best for: Contracts, forms, invoices, reports, government documents, insurance claims.
from vllm import LLM
llm = LLM(
model="developerJenis/GT-REX-v4",
trust_remote_code=True,
max_model_len=4096,
gpu_memory_utilization=0.75,
max_num_seqs=128,
limit_mm_per_prompt={"image": 1},
)
🎯 GT-Rex-Ultra
Maximum quality with adaptive processing for complex documents
| Setting | Value |
|---|---|
| Resolution | 1536 × 1536 px |
| Speed | ~5–10s per image |
| Max Tokens | 8192 |
| GPU Memory | 10–15 GB |
| Recommended Batch Size | 64 sequences |
Best for: Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts.
from vllm import LLM
llm = LLM(
model="developerJenis/GT-REX-v4",
trust_remote_code=True,
max_model_len=8192,
gpu_memory_utilization=0.85,
max_num_seqs=64,
limit_mm_per_prompt={"image": 1},
)
🎯 Key Features
| Feature | Description |
|---|---|
| High Accuracy | Advanced vision-language architecture for precise text extraction |
| Multi-Language | Handles documents in English and multiple other languages |
| Production Ready | Optimized for deployment with the vLLM inference engine |
| Batch Processing | Process hundreds of documents per minute (Nano variant) |
| Flexible Prompts | Supports structured extraction — JSON, tables, key-value pairs, forms |
| Handwriting Support | Transcribes handwritten text with high fidelity |
| Three Variants | Nano (speed), Pro (balanced), Ultra (accuracy) |
| Structured Output | Extract data directly into JSON, Markdown tables, or custom schemas |
📊 Model Details
| Attribute | Value |
|---|---|
| Developer | GothiTech (Jenis Hathaliya) |
| Architecture | Vision-Language Model (VLM) |
| Model Size | ~6.5 GB |
| Parameters | ~7B |
| License | MIT |
| Release Date | February 2026 |
| Precision | BF16 / FP16 |
| Input Resolution | 640px – 1536px (variant dependent) |
| Max Sequence Length | 2048 – 8192 tokens (variant dependent) |
| Inference Engine | vLLM (recommended) |
| Framework | PyTorch / Transformers |
🚀 Quick Start
Get running in under 5 minutes:
from vllm import LLM, SamplingParams
from PIL import Image
# 1. Load model (Pro variant — default)
llm = LLM(
model="developerJenis/GT-REX-v4",
trust_remote_code=True,
max_model_len=4096,
gpu_memory_utilization=0.75,
max_num_seqs=128,
limit_mm_per_prompt={"image": 1},
)
# 2. Prepare input
image = Image.open("document.png")
prompt = "Extract all text from this document."
# 3. Run inference
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=4096,
)
outputs = llm.generate(
[{
"prompt": prompt,
"multi_modal_data": {"image": image},
}],
sampling_params=sampling_params,
)
# 4. Get results
result = outputs[0].outputs[0].text
print(result)
💻 Installation
Prerequisites
- Python 3.9+
- CUDA 11.8+ (GPU required)
- 8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra)
Install Dependencies
pip install vllm pillow torch transformers
Verify Installation
from vllm import LLM
print("vLLM installed successfully!")
📖 Usage Examples
Basic Text Extraction
prompt = "Extract all text from this document image."
Structured JSON Extraction
prompt = """Extract the following fields from this invoice as JSON:
{
"invoice_number": "",
"date": "",
"vendor_name": "",
"total_amount": "",
"line_items": [
{"description": "", "quantity": "", "unit_price": "", "amount": ""}
]
}"""
Table Extraction (Markdown Format)
prompt = "Extract all tables from this document in Markdown table format."
Key-Value Pair Extraction
prompt = """Extract all key-value pairs from this form.
Return as:
Key: Value
Key: Value
..."""
Handwritten Text Transcription
prompt = "Transcribe all handwritten text from this image accurately."
Multi-Document Batch Processing
from PIL import Image
from vllm import LLM, SamplingParams
llm = LLM(
model="developerJenis/GT-REX-v4",
trust_remote_code=True,
max_model_len=4096,
gpu_memory_utilization=0.75,
max_num_seqs=128,
limit_mm_per_prompt={"image": 1},
)
# Prepare batch
image_paths = ["doc1.png", "doc2.png", "doc3.png"]
prompts = []
for path in image_paths:
img = Image.open(path)
prompts.append({
"prompt": "Extract all text from this document.",
"multi_modal_data": {"image": img},
})
# Run batch inference
sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
outputs = llm.generate(prompts, sampling_params=sampling_params)
# Collect results
for i, output in enumerate(outputs):
print(f"--- Document {i + 1} ---")
print(output.outputs[0].text)
print()
🏢 Use Cases
| Domain | Application | Recommended Variant |
|---|---|---|
| Finance | Invoice processing, receipt scanning, bank statements | Pro / Nano |
| Legal | Contract analysis, clause extraction, legal filings | Ultra |
| Healthcare | Medical records, prescriptions, lab reports | Ultra |
| Government | Form processing, ID verification, tax documents | Pro |
| Insurance | Claims processing, policy documents | Pro |
| Education | Exam paper digitization, handwritten notes | Pro / Ultra |
| Logistics | Shipping labels, waybills, packing lists | Nano |
| Real Estate | Property documents, deeds, mortgage papers | Pro |
| Retail | Product catalogs, price tags, inventory lists | Nano |
📈 Performance Benchmarks
Throughput by Variant (NVIDIA A100 80GB)
| Variant | Single Image | Batch (32) | Batch (128) |
|---|---|---|---|
| Nano | ~1.2s | ~15s | ~55s |
| Pro | ~3.5s | ~45s | ~170s |
| Ultra | ~7.0s | ~110s | ~380s |
Accuracy by Document Type (Pro Variant)
| Document Type | Character Accuracy | Field Accuracy |
|---|---|---|
| Printed invoices | 98.5%+ | 96%+ |
| Typed contracts | 98%+ | 95%+ |
| Handwritten notes | 92%+ | 88%+ |
| Dense tables | 96%+ | 93%+ |
| Low-quality scans | 94%+ | 90%+ |
Note: Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration.
🧠 Prompt Engineering Guide
Get the best results from GT-REX-v4 with these prompt strategies:
Do's
- Be specific about what to extract ("Extract the invoice number and total amount")
- Specify output format ("Return as JSON", "Return as Markdown table")
- Provide schema for structured extraction (show the expected JSON keys)
- Use clear instructions ("Transcribe exactly as written, preserving spelling errors")
Don'ts
- Avoid vague prompts ("What is this?")
- Don't ask for analysis or summarization — GT-REX is optimized for extraction
- Don't include unrelated context in the prompt
Example Prompts
# Simple extraction
"Extract all text from this document."
# Targeted extraction
"Extract only the table on this page as a Markdown table."
# Schema-driven extraction
"Extract data matching this schema: {name: str, date: str, amount: float}"
# Preservation mode
"Transcribe this document exactly as written, preserving original formatting."
🔌 API Integration
FastAPI Server Example
from fastapi import FastAPI, UploadFile
from PIL import Image
from vllm import LLM, SamplingParams
import io
app = FastAPI()
llm = LLM(
model="developerJenis/GT-REX-v4",
trust_remote_code=True,
max_model_len=4096,
gpu_memory_utilization=0.75,
max_num_seqs=128,
limit_mm_per_prompt={"image": 1},
)
sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
@app.post("/extract")
async def extract_text(file: UploadFile, prompt: str = "Extract all text."):
image_bytes = await file.read()
image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
outputs = llm.generate(
[{
"prompt": prompt,
"multi_modal_data": {"image": image},
}],
sampling_params=sampling_params,
)
return {"text": outputs[0].outputs[0].text}
🛠️ Troubleshooting
| Issue | Solution |
|---|---|
| CUDA Out of Memory | Reduce gpu_memory_utilization or switch to Nano variant |
| Slow inference | Increase max_num_seqs for better batching; use Nano for speed |
| Truncated output | Increase max_tokens in SamplingParams |
| Low accuracy on small text | Switch to Ultra variant for higher resolution |
| Garbled multilingual text | Ensure image resolution is sufficient; try Ultra variant |
🔧 Hardware Recommendations
| Variant | Minimum GPU | Recommended GPU |
|---|---|---|
| Nano | NVIDIA T4 (16 GB) | NVIDIA A10 (24 GB) |
| Pro | NVIDIA A10 (24 GB) | NVIDIA A100 (40 GB) |
| Ultra | NVIDIA A100 (40 GB) | NVIDIA A100 (80 GB) |
📜 License
This model is released under the MIT License. You are free to use, modify, and distribute it for both commercial and non-commercial purposes.
📖 Citation
If you use GT-REX-v4 in your work, please cite:
@misc{gtrex-v4-2026,
title = {GT-REX-v4: Production-Grade OCR with Vision-Language Models},
author = {Hathaliya, Jenis},
year = {2026},
month = {February},
url = {https://huggingface.co/developerJenis/GT-REX-v4},
note = {GothiTech Recognition \& Extraction eXpert, Version 4}
}
🤝 Contact & Support
- Developer: Jenis Hathaliya
- Organization: GothiTech
- HuggingFace: developerJenis
Built with ❤️ by GothiTech
Last updated: February 2026
Model Version: v4.0 | Variants: Nano | Pro | Ultra
- Downloads last month
- -