GT-REX-v4 / README.md
developerJenis's picture
📝 Enhanced model card with GT-REX variants (Nano/Pro/Ultra), benchmarks, and usage guide
c65711d verified
---
license: mit
language:
- en
- multilingual
tags:
- ocr
- vision-language
- document-understanding
- gothitech
- document-ai
- text-extraction
- invoice-processing
- production
- handwriting-recognition
- table-extraction
pipeline_tag: image-text-to-text
model-index:
- name: GT-REX-v4
results: []
---
# GT-REX-v4: Production OCR Model
<p align="center">
<strong>🦖 GothiTech Recognition & Extraction eXpert — Version 4</strong>
</p>
<p align="center">
<a href="https://huggingface.co/developerJenis/GT-REX-v4"><img src="https://img.shields.io/badge/🤗_Model-GT--REX--v4-blue" alt="Model"></a>
<a href="#"><img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License: MIT"></a>
<a href="#"><img src="https://img.shields.io/badge/vLLM-Supported-orange" alt="vLLM"></a>
<a href="#"><img src="https://img.shields.io/badge/Params-~7B-red" alt="Parameters"></a>
</p>
---
**GT-REX-v4** is a state-of-the-art production-grade OCR model developed by **GothiTech** for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables.
---
## 📑 Table of Contents
- [GT-REX Variants](#-gt-rex-variants)
- [Key Features](#-key-features)
- [Model Details](#-model-details)
- [Quick Start](#-quick-start)
- [Installation](#-installation)
- [Usage Examples](#-usage-examples)
- [Use Cases](#-use-cases)
- [Performance Benchmarks](#-performance-benchmarks)
- [Prompt Engineering Guide](#-prompt-engineering-guide)
- [API Integration](#-api-integration)
- [Troubleshooting](#-troubleshooting)
- [License](#-license)
- [Citation](#-citation)
---
## ⚙️ GT-REX Variants
GT-REX-v4 ships with **three optimized configurations** tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings.
| Variant | Speed | Accuracy | Resolution | GPU Memory | Throughput | Best For |
|---------|-------|----------|------------|------------|------------|----------|
| **🚀 Nano** | ⚡⚡⚡⚡⚡ | ⭐⭐⭐ | 640px | 4–6 GB | 100–150 docs/min | High-volume batch processing |
| **⚡ Pro** *(Default)* | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ | 1024px | 6–10 GB | 50–80 docs/min | Standard enterprise workflows |
| **🎯 Ultra** | ⚡⚡⚡ | ⭐⭐⭐⭐⭐ | 1536px | 10–15 GB | 20–30 docs/min | High-accuracy & fine-detail needs |
### How to Choose a Variant
- **Nano** → You need maximum throughput and documents are simple (receipts, IDs, labels).
- **Pro** → General-purpose. Best balance for invoices, contracts, forms, and reports.
- **Ultra** → Documents have fine print, dense tables, medical records, or legal footnotes.
---
### 🚀 GT-Rex-Nano
**Speed-optimized for high-volume batch processing**
| Setting | Value |
|---------|-------|
| Resolution | 640 × 640 px |
| Speed | ~1–2s per image |
| Max Tokens | 2048 |
| GPU Memory | 4–6 GB |
| Recommended Batch Size | 256 sequences |
**Best for:** Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning.
```python
from vllm import LLM
llm = LLM(
model="developerJenis/GT-REX-v4",
trust_remote_code=True,
max_model_len=2048,
gpu_memory_utilization=0.6,
max_num_seqs=256,
limit_mm_per_prompt={"image": 1},
)
```
---
### ⚡ GT-Rex-Pro (Default)
**Balanced quality and speed for standard enterprise documents**
| Setting | Value |
|---------|-------|
| Resolution | 1024 × 1024 px |
| Speed | ~2–5s per image |
| Max Tokens | 4096 |
| GPU Memory | 6–10 GB |
| Recommended Batch Size | 128 sequences |
**Best for:** Contracts, forms, invoices, reports, government documents, insurance claims.
```python
from vllm import LLM
llm = LLM(
model="developerJenis/GT-REX-v4",
trust_remote_code=True,
max_model_len=4096,
gpu_memory_utilization=0.75,
max_num_seqs=128,
limit_mm_per_prompt={"image": 1},
)
```
---
### 🎯 GT-Rex-Ultra
**Maximum quality with adaptive processing for complex documents**
| Setting | Value |
|---------|-------|
| Resolution | 1536 × 1536 px |
| Speed | ~5–10s per image |
| Max Tokens | 8192 |
| GPU Memory | 10–15 GB |
| Recommended Batch Size | 64 sequences |
**Best for:** Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts.
```python
from vllm import LLM
llm = LLM(
model="developerJenis/GT-REX-v4",
trust_remote_code=True,
max_model_len=8192,
gpu_memory_utilization=0.85,
max_num_seqs=64,
limit_mm_per_prompt={"image": 1},
)
```
---
## 🎯 Key Features
| Feature | Description |
|---------|-------------|
| **High Accuracy** | Advanced vision-language architecture for precise text extraction |
| **Multi-Language** | Handles documents in English and multiple other languages |
| **Production Ready** | Optimized for deployment with the vLLM inference engine |
| **Batch Processing** | Process hundreds of documents per minute (Nano variant) |
| **Flexible Prompts** | Supports structured extraction — JSON, tables, key-value pairs, forms |
| **Handwriting Support** | Transcribes handwritten text with high fidelity |
| **Three Variants** | Nano (speed), Pro (balanced), Ultra (accuracy) |
| **Structured Output** | Extract data directly into JSON, Markdown tables, or custom schemas |
---
## 📊 Model Details
| Attribute | Value |
|-----------|-------|
| **Developer** | GothiTech (Jenis Hathaliya) |
| **Architecture** | Vision-Language Model (VLM) |
| **Model Size** | ~6.5 GB |
| **Parameters** | ~7B |
| **License** | MIT |
| **Release Date** | February 2026 |
| **Precision** | BF16 / FP16 |
| **Input Resolution** | 640px – 1536px (variant dependent) |
| **Max Sequence Length** | 2048 – 8192 tokens (variant dependent) |
| **Inference Engine** | vLLM (recommended) |
| **Framework** | PyTorch / Transformers |
---
## 🚀 Quick Start
Get running in under 5 minutes:
```python
from vllm import LLM, SamplingParams
from PIL import Image
# 1. Load model (Pro variant — default)
llm = LLM(
model="developerJenis/GT-REX-v4",
trust_remote_code=True,
max_model_len=4096,
gpu_memory_utilization=0.75,
max_num_seqs=128,
limit_mm_per_prompt={"image": 1},
)
# 2. Prepare input
image = Image.open("document.png")
prompt = "Extract all text from this document."
# 3. Run inference
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=4096,
)
outputs = llm.generate(
[{
"prompt": prompt,
"multi_modal_data": {"image": image},
}],
sampling_params=sampling_params,
)
# 4. Get results
result = outputs[0].outputs[0].text
print(result)
```
---
## 💻 Installation
### Prerequisites
- Python 3.9+
- CUDA 11.8+ (GPU required)
- 8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra)
### Install Dependencies
```bash
pip install vllm pillow torch transformers
```
### Verify Installation
```python
from vllm import LLM
print("vLLM installed successfully!")
```
---
## 📖 Usage Examples
### Basic Text Extraction
```python
prompt = "Extract all text from this document image."
```
### Structured JSON Extraction
```python
prompt = """Extract the following fields from this invoice as JSON:
{
"invoice_number": "",
"date": "",
"vendor_name": "",
"total_amount": "",
"line_items": [
{"description": "", "quantity": "", "unit_price": "", "amount": ""}
]
}"""
```
### Table Extraction (Markdown Format)
```python
prompt = "Extract all tables from this document in Markdown table format."
```
### Key-Value Pair Extraction
```python
prompt = """Extract all key-value pairs from this form.
Return as:
Key: Value
Key: Value
..."""
```
### Handwritten Text Transcription
```python
prompt = "Transcribe all handwritten text from this image accurately."
```
### Multi-Document Batch Processing
```python
from PIL import Image
from vllm import LLM, SamplingParams
llm = LLM(
model="developerJenis/GT-REX-v4",
trust_remote_code=True,
max_model_len=4096,
gpu_memory_utilization=0.75,
max_num_seqs=128,
limit_mm_per_prompt={"image": 1},
)
# Prepare batch
image_paths = ["doc1.png", "doc2.png", "doc3.png"]
prompts = []
for path in image_paths:
img = Image.open(path)
prompts.append({
"prompt": "Extract all text from this document.",
"multi_modal_data": {"image": img},
})
# Run batch inference
sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
outputs = llm.generate(prompts, sampling_params=sampling_params)
# Collect results
for i, output in enumerate(outputs):
print(f"--- Document {i + 1} ---")
print(output.outputs[0].text)
print()
```
---
## 🏢 Use Cases
| Domain | Application | Recommended Variant |
|--------|-------------|---------------------|
| **Finance** | Invoice processing, receipt scanning, bank statements | Pro / Nano |
| **Legal** | Contract analysis, clause extraction, legal filings | Ultra |
| **Healthcare** | Medical records, prescriptions, lab reports | Ultra |
| **Government** | Form processing, ID verification, tax documents | Pro |
| **Insurance** | Claims processing, policy documents | Pro |
| **Education** | Exam paper digitization, handwritten notes | Pro / Ultra |
| **Logistics** | Shipping labels, waybills, packing lists | Nano |
| **Real Estate** | Property documents, deeds, mortgage papers | Pro |
| **Retail** | Product catalogs, price tags, inventory lists | Nano |
---
## 📈 Performance Benchmarks
### Throughput by Variant (NVIDIA A100 80GB)
| Variant | Single Image | Batch (32) | Batch (128) |
|---------|-------------|------------|-------------|
| Nano | ~1.2s | ~15s | ~55s |
| Pro | ~3.5s | ~45s | ~170s |
| Ultra | ~7.0s | ~110s | ~380s |
### Accuracy by Document Type (Pro Variant)
| Document Type | Character Accuracy | Field Accuracy |
|---------------|--------------------|----------------|
| Printed invoices | 98.5%+ | 96%+ |
| Typed contracts | 98%+ | 95%+ |
| Handwritten notes | 92%+ | 88%+ |
| Dense tables | 96%+ | 93%+ |
| Low-quality scans | 94%+ | 90%+ |
> **Note:** Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration.
---
## 🧠 Prompt Engineering Guide
Get the best results from GT-REX-v4 with these prompt strategies:
### Do's
- **Be specific** about what to extract ("Extract the invoice number and total amount")
- **Specify output format** ("Return as JSON", "Return as Markdown table")
- **Provide schema** for structured extraction (show the expected JSON keys)
- **Use clear instructions** ("Transcribe exactly as written, preserving spelling errors")
### Don'ts
- Avoid vague prompts ("What is this?")
- Don't ask for analysis or summarization — GT-REX is optimized for **extraction**
- Don't include unrelated context in the prompt
### Example Prompts
```text
# Simple extraction
"Extract all text from this document."
# Targeted extraction
"Extract only the table on this page as a Markdown table."
# Schema-driven extraction
"Extract data matching this schema: {name: str, date: str, amount: float}"
# Preservation mode
"Transcribe this document exactly as written, preserving original formatting."
```
---
## 🔌 API Integration
### FastAPI Server Example
```python
from fastapi import FastAPI, UploadFile
from PIL import Image
from vllm import LLM, SamplingParams
import io
app = FastAPI()
llm = LLM(
model="developerJenis/GT-REX-v4",
trust_remote_code=True,
max_model_len=4096,
gpu_memory_utilization=0.75,
max_num_seqs=128,
limit_mm_per_prompt={"image": 1},
)
sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
@app.post("/extract")
async def extract_text(file: UploadFile, prompt: str = "Extract all text."):
image_bytes = await file.read()
image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
outputs = llm.generate(
[{
"prompt": prompt,
"multi_modal_data": {"image": image},
}],
sampling_params=sampling_params,
)
return {"text": outputs[0].outputs[0].text}
```
---
## 🛠️ Troubleshooting
| Issue | Solution |
|-------|----------|
| **CUDA Out of Memory** | Reduce `gpu_memory_utilization` or switch to Nano variant |
| **Slow inference** | Increase `max_num_seqs` for better batching; use Nano for speed |
| **Truncated output** | Increase `max_tokens` in `SamplingParams` |
| **Low accuracy on small text** | Switch to Ultra variant for higher resolution |
| **Garbled multilingual text** | Ensure image resolution is sufficient; try Ultra variant |
---
## 🔧 Hardware Recommendations
| Variant | Minimum GPU | Recommended GPU |
|---------|-------------|-----------------|
| Nano | NVIDIA T4 (16 GB) | NVIDIA A10 (24 GB) |
| Pro | NVIDIA A10 (24 GB) | NVIDIA A100 (40 GB) |
| Ultra | NVIDIA A100 (40 GB) | NVIDIA A100 (80 GB) |
---
## 📜 License
This model is released under the **MIT License**. You are free to use, modify, and distribute it for both commercial and non-commercial purposes.
---
## 📖 Citation
If you use GT-REX-v4 in your work, please cite:
```bibtex
@misc{gtrex-v4-2026,
title = {GT-REX-v4: Production-Grade OCR with Vision-Language Models},
author = {Hathaliya, Jenis},
year = {2026},
month = {February},
url = {https://huggingface.co/developerJenis/GT-REX-v4},
note = {GothiTech Recognition \& Extraction eXpert, Version 4}
}
```
---
## 🤝 Contact & Support
- **Developer:** Jenis Hathaliya
- **Organization:** GothiTech
- **HuggingFace:** [developerJenis](https://huggingface.co/developerJenis)
---
<p align="center">
Built with ❤️ by <strong>GothiTech</strong>
</p>
<p align="center">
<em>Last updated: February 2026</em><br>
<em>Model Version: v4.0 | Variants: Nano | Pro | Ultra</em>
</p>