GT-REX-v4: Production OCR Model

🦖 GothiTech Recognition & Extraction eXpert — Version 4

Model License: MIT vLLM Parameters


GT-REX-v4 is a state-of-the-art production-grade OCR model developed by GothiTech for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables.


📑 Table of Contents


⚙️ GT-REX Variants

GT-REX-v4 ships with three optimized configurations tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings.

Variant Speed Accuracy Resolution GPU Memory Throughput Best For
🚀 Nano ⚡⚡⚡⚡⚡ ⭐⭐⭐ 640px 4–6 GB 100–150 docs/min High-volume batch processing
⚡ Pro (Default) ⚡⚡⚡⚡ ⭐⭐⭐⭐ 1024px 6–10 GB 50–80 docs/min Standard enterprise workflows
🎯 Ultra ⚡⚡⚡ ⭐⭐⭐⭐⭐ 1536px 10–15 GB 20–30 docs/min High-accuracy & fine-detail needs

How to Choose a Variant

  • Nano → You need maximum throughput and documents are simple (receipts, IDs, labels).
  • Pro → General-purpose. Best balance for invoices, contracts, forms, and reports.
  • Ultra → Documents have fine print, dense tables, medical records, or legal footnotes.

🚀 GT-Rex-Nano

Speed-optimized for high-volume batch processing

Setting Value
Resolution 640 × 640 px
Speed ~1–2s per image
Max Tokens 2048
GPU Memory 4–6 GB
Recommended Batch Size 256 sequences

Best for: Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning.

from vllm import LLM

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=2048,
    gpu_memory_utilization=0.6,
    max_num_seqs=256,
    limit_mm_per_prompt={"image": 1},
)

⚡ GT-Rex-Pro (Default)

Balanced quality and speed for standard enterprise documents

Setting Value
Resolution 1024 × 1024 px
Speed ~2–5s per image
Max Tokens 4096
GPU Memory 6–10 GB
Recommended Batch Size 128 sequences

Best for: Contracts, forms, invoices, reports, government documents, insurance claims.

from vllm import LLM

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

🎯 GT-Rex-Ultra

Maximum quality with adaptive processing for complex documents

Setting Value
Resolution 1536 × 1536 px
Speed ~5–10s per image
Max Tokens 8192
GPU Memory 10–15 GB
Recommended Batch Size 64 sequences

Best for: Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts.

from vllm import LLM

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=8192,
    gpu_memory_utilization=0.85,
    max_num_seqs=64,
    limit_mm_per_prompt={"image": 1},
)

🎯 Key Features

Feature Description
High Accuracy Advanced vision-language architecture for precise text extraction
Multi-Language Handles documents in English and multiple other languages
Production Ready Optimized for deployment with the vLLM inference engine
Batch Processing Process hundreds of documents per minute (Nano variant)
Flexible Prompts Supports structured extraction — JSON, tables, key-value pairs, forms
Handwriting Support Transcribes handwritten text with high fidelity
Three Variants Nano (speed), Pro (balanced), Ultra (accuracy)
Structured Output Extract data directly into JSON, Markdown tables, or custom schemas

📊 Model Details

Attribute Value
Developer GothiTech (Jenis Hathaliya)
Architecture Vision-Language Model (VLM)
Model Size ~6.5 GB
Parameters ~7B
License MIT
Release Date February 2026
Precision BF16 / FP16
Input Resolution 640px – 1536px (variant dependent)
Max Sequence Length 2048 – 8192 tokens (variant dependent)
Inference Engine vLLM (recommended)
Framework PyTorch / Transformers

🚀 Quick Start

Get running in under 5 minutes:

from vllm import LLM, SamplingParams
from PIL import Image

# 1. Load model (Pro variant — default)
llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

# 2. Prepare input
image = Image.open("document.png")
prompt = "Extract all text from this document."

# 3. Run inference
sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=4096,
)

outputs = llm.generate(
    [{
        "prompt": prompt,
        "multi_modal_data": {"image": image},
    }],
    sampling_params=sampling_params,
)

# 4. Get results
result = outputs[0].outputs[0].text
print(result)

💻 Installation

Prerequisites

  • Python 3.9+
  • CUDA 11.8+ (GPU required)
  • 8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra)

Install Dependencies

pip install vllm pillow torch transformers

Verify Installation

from vllm import LLM
print("vLLM installed successfully!")

📖 Usage Examples

Basic Text Extraction

prompt = "Extract all text from this document image."

Structured JSON Extraction

prompt = """Extract the following fields from this invoice as JSON:
{
    "invoice_number": "",
    "date": "",
    "vendor_name": "",
    "total_amount": "",
    "line_items": [
        {"description": "", "quantity": "", "unit_price": "", "amount": ""}
    ]
}"""

Table Extraction (Markdown Format)

prompt = "Extract all tables from this document in Markdown table format."

Key-Value Pair Extraction

prompt = """Extract all key-value pairs from this form.
Return as:
Key: Value
Key: Value
..."""

Handwritten Text Transcription

prompt = "Transcribe all handwritten text from this image accurately."

Multi-Document Batch Processing

from PIL import Image
from vllm import LLM, SamplingParams

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

# Prepare batch
image_paths = ["doc1.png", "doc2.png", "doc3.png"]
prompts = []
for path in image_paths:
    img = Image.open(path)
    prompts.append({
        "prompt": "Extract all text from this document.",
        "multi_modal_data": {"image": img},
    })

# Run batch inference
sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
outputs = llm.generate(prompts, sampling_params=sampling_params)

# Collect results
for i, output in enumerate(outputs):
    print(f"--- Document {i + 1} ---")
    print(output.outputs[0].text)
    print()

🏢 Use Cases

Domain Application Recommended Variant
Finance Invoice processing, receipt scanning, bank statements Pro / Nano
Legal Contract analysis, clause extraction, legal filings Ultra
Healthcare Medical records, prescriptions, lab reports Ultra
Government Form processing, ID verification, tax documents Pro
Insurance Claims processing, policy documents Pro
Education Exam paper digitization, handwritten notes Pro / Ultra
Logistics Shipping labels, waybills, packing lists Nano
Real Estate Property documents, deeds, mortgage papers Pro
Retail Product catalogs, price tags, inventory lists Nano

📈 Performance Benchmarks

Throughput by Variant (NVIDIA A100 80GB)

Variant Single Image Batch (32) Batch (128)
Nano ~1.2s ~15s ~55s
Pro ~3.5s ~45s ~170s
Ultra ~7.0s ~110s ~380s

Accuracy by Document Type (Pro Variant)

Document Type Character Accuracy Field Accuracy
Printed invoices 98.5%+ 96%+
Typed contracts 98%+ 95%+
Handwritten notes 92%+ 88%+
Dense tables 96%+ 93%+
Low-quality scans 94%+ 90%+

Note: Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration.


🧠 Prompt Engineering Guide

Get the best results from GT-REX-v4 with these prompt strategies:

Do's

  • Be specific about what to extract ("Extract the invoice number and total amount")
  • Specify output format ("Return as JSON", "Return as Markdown table")
  • Provide schema for structured extraction (show the expected JSON keys)
  • Use clear instructions ("Transcribe exactly as written, preserving spelling errors")

Don'ts

  • Avoid vague prompts ("What is this?")
  • Don't ask for analysis or summarization — GT-REX is optimized for extraction
  • Don't include unrelated context in the prompt

Example Prompts

# Simple extraction
"Extract all text from this document."

# Targeted extraction
"Extract only the table on this page as a Markdown table."

# Schema-driven extraction
"Extract data matching this schema: {name: str, date: str, amount: float}"

# Preservation mode
"Transcribe this document exactly as written, preserving original formatting."

🔌 API Integration

FastAPI Server Example

from fastapi import FastAPI, UploadFile
from PIL import Image
from vllm import LLM, SamplingParams
import io

app = FastAPI()

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)


@app.post("/extract")
async def extract_text(file: UploadFile, prompt: str = "Extract all text."):
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

    outputs = llm.generate(
        [{
            "prompt": prompt,
            "multi_modal_data": {"image": image},
        }],
        sampling_params=sampling_params,
    )

    return {"text": outputs[0].outputs[0].text}

🛠️ Troubleshooting

Issue Solution
CUDA Out of Memory Reduce gpu_memory_utilization or switch to Nano variant
Slow inference Increase max_num_seqs for better batching; use Nano for speed
Truncated output Increase max_tokens in SamplingParams
Low accuracy on small text Switch to Ultra variant for higher resolution
Garbled multilingual text Ensure image resolution is sufficient; try Ultra variant

🔧 Hardware Recommendations

Variant Minimum GPU Recommended GPU
Nano NVIDIA T4 (16 GB) NVIDIA A10 (24 GB)
Pro NVIDIA A10 (24 GB) NVIDIA A100 (40 GB)
Ultra NVIDIA A100 (40 GB) NVIDIA A100 (80 GB)

📜 License

This model is released under the MIT License. You are free to use, modify, and distribute it for both commercial and non-commercial purposes.


📖 Citation

If you use GT-REX-v4 in your work, please cite:

@misc{gtrex-v4-2026,
  title   = {GT-REX-v4: Production-Grade OCR with Vision-Language Models},
  author  = {Hathaliya, Jenis},
  year    = {2026},
  month   = {February},
  url     = {https://huggingface.co/developerJenis/GT-REX-v4},
  note    = {GothiTech Recognition \& Extraction eXpert, Version 4}
}

🤝 Contact & Support

  • Developer: Jenis Hathaliya
  • Organization: GothiTech
  • HuggingFace: developerJenis

Built with ❤️ by GothiTech

Last updated: February 2026
Model Version: v4.0 | Variants: Nano | Pro | Ultra

Downloads last month
-
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support