---
license: mit
language:
  - en
  - multilingual
tags:
  - ocr
  - vision-language
  - document-understanding
  - gothitech
  - document-ai
  - text-extraction
  - invoice-processing
  - production
  - handwriting-recognition
  - table-extraction
pipeline_tag: image-text-to-text
model-index:
  - name: GT-REX-v4
    results: []
---

# GT-REX-v4: Production OCR Model

<p align="center">
  <strong>🦖 GothiTech Recognition & Extraction eXpert — Version 4</strong>
</p>

<p align="center">
  <a href="https://huggingface.co/developerJenis/GT-REX-v4"><img src="https://img.shields.io/badge/🤗_Model-GT--REX--v4-blue" alt="Model"></a>
  <a href="#"><img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License: MIT"></a>
  <a href="#"><img src="https://img.shields.io/badge/vLLM-Supported-orange" alt="vLLM"></a>
  <a href="#"><img src="https://img.shields.io/badge/Params-~7B-red" alt="Parameters"></a>
</p>

---

**GT-REX-v4** is a state-of-the-art production-grade OCR model developed by **GothiTech** for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables.

---

## 📑 Table of Contents

- [GT-REX Variants](#-gt-rex-variants)
- [Key Features](#-key-features)
- [Model Details](#-model-details)
- [Quick Start](#-quick-start)
- [Installation](#-installation)
- [Usage Examples](#-usage-examples)
- [Use Cases](#-use-cases)
- [Performance Benchmarks](#-performance-benchmarks)
- [Prompt Engineering Guide](#-prompt-engineering-guide)
- [API Integration](#-api-integration)
- [Troubleshooting](#-troubleshooting)
- [License](#-license)
- [Citation](#-citation)

---

## ⚙️ GT-REX Variants

GT-REX-v4 ships with **three optimized configurations** tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings.

| Variant | Speed | Accuracy | Resolution | GPU Memory | Throughput | Best For |
|---------|-------|----------|------------|------------|------------|----------|
| **🚀 Nano** | ⚡⚡⚡⚡⚡ | ⭐⭐⭐ | 640px | 4–6 GB | 100–150 docs/min | High-volume batch processing |
| **⚡ Pro** *(Default)* | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ | 1024px | 6–10 GB | 50–80 docs/min | Standard enterprise workflows |
| **🎯 Ultra** | ⚡⚡⚡ | ⭐⭐⭐⭐⭐ | 1536px | 10–15 GB | 20–30 docs/min | High-accuracy & fine-detail needs |

### How to Choose a Variant

- **Nano** → You need maximum throughput and documents are simple (receipts, IDs, labels).
- **Pro** → General-purpose. Best balance for invoices, contracts, forms, and reports.
- **Ultra** → Documents have fine print, dense tables, medical records, or legal footnotes.

---

### 🚀 GT-Rex-Nano

**Speed-optimized for high-volume batch processing**

| Setting | Value |
|---------|-------|
| Resolution | 640 × 640 px |
| Speed | ~1–2s per image |
| Max Tokens | 2048 |
| GPU Memory | 4–6 GB |
| Recommended Batch Size | 256 sequences |

**Best for:** Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning.

```python
from vllm import LLM

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=2048,
    gpu_memory_utilization=0.6,
    max_num_seqs=256,
    limit_mm_per_prompt={"image": 1},
)
```

---

### ⚡ GT-Rex-Pro (Default)

**Balanced quality and speed for standard enterprise documents**

| Setting | Value |
|---------|-------|
| Resolution | 1024 × 1024 px |
| Speed | ~2–5s per image |
| Max Tokens | 4096 |
| GPU Memory | 6–10 GB |
| Recommended Batch Size | 128 sequences |

**Best for:** Contracts, forms, invoices, reports, government documents, insurance claims.

```python
from vllm import LLM

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)
```

---

### 🎯 GT-Rex-Ultra

**Maximum quality with adaptive processing for complex documents**

| Setting | Value |
|---------|-------|
| Resolution | 1536 × 1536 px |
| Speed | ~5–10s per image |
| Max Tokens | 8192 |
| GPU Memory | 10–15 GB |
| Recommended Batch Size | 64 sequences |

**Best for:** Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts.

```python
from vllm import LLM

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=8192,
    gpu_memory_utilization=0.85,
    max_num_seqs=64,
    limit_mm_per_prompt={"image": 1},
)
```

---

## 🎯 Key Features

| Feature | Description |
|---------|-------------|
| **High Accuracy** | Advanced vision-language architecture for precise text extraction |
| **Multi-Language** | Handles documents in English and multiple other languages |
| **Production Ready** | Optimized for deployment with the vLLM inference engine |
| **Batch Processing** | Process hundreds of documents per minute (Nano variant) |
| **Flexible Prompts** | Supports structured extraction — JSON, tables, key-value pairs, forms |
| **Handwriting Support** | Transcribes handwritten text with high fidelity |
| **Three Variants** | Nano (speed), Pro (balanced), Ultra (accuracy) |
| **Structured Output** | Extract data directly into JSON, Markdown tables, or custom schemas |

---

## 📊 Model Details

| Attribute | Value |
|-----------|-------|
| **Developer** | GothiTech (Jenis Hathaliya) |
| **Architecture** | Vision-Language Model (VLM) |
| **Model Size** | ~6.5 GB |
| **Parameters** | ~7B |
| **License** | MIT |
| **Release Date** | February 2026 |
| **Precision** | BF16 / FP16 |
| **Input Resolution** | 640px – 1536px (variant dependent) |
| **Max Sequence Length** | 2048 – 8192 tokens (variant dependent) |
| **Inference Engine** | vLLM (recommended) |
| **Framework** | PyTorch / Transformers |

---

## 🚀 Quick Start

Get running in under 5 minutes:

```python
from vllm import LLM, SamplingParams
from PIL import Image

# 1. Load model (Pro variant — default)
llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

# 2. Prepare input
image = Image.open("document.png")
prompt = "Extract all text from this document."

# 3. Run inference
sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=4096,
)

outputs = llm.generate(
    [{
        "prompt": prompt,
        "multi_modal_data": {"image": image},
    }],
    sampling_params=sampling_params,
)

# 4. Get results
result = outputs[0].outputs[0].text
print(result)
```

---

## 💻 Installation

### Prerequisites

- Python 3.9+
- CUDA 11.8+ (GPU required)
- 8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra)

### Install Dependencies

```bash
pip install vllm pillow torch transformers
```

### Verify Installation

```python
from vllm import LLM
print("vLLM installed successfully!")
```

---

## 📖 Usage Examples

### Basic Text Extraction

```python
prompt = "Extract all text from this document image."
```

### Structured JSON Extraction

```python
prompt = """Extract the following fields from this invoice as JSON:
{
    "invoice_number": "",
    "date": "",
    "vendor_name": "",
    "total_amount": "",
    "line_items": [
        {"description": "", "quantity": "", "unit_price": "", "amount": ""}
    ]
}"""
```

### Table Extraction (Markdown Format)

```python
prompt = "Extract all tables from this document in Markdown table format."
```

### Key-Value Pair Extraction

```python
prompt = """Extract all key-value pairs from this form.
Return as:
Key: Value
Key: Value
..."""
```

### Handwritten Text Transcription

```python
prompt = "Transcribe all handwritten text from this image accurately."
```

### Multi-Document Batch Processing

```python
from PIL import Image
from vllm import LLM, SamplingParams

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

# Prepare batch
image_paths = ["doc1.png", "doc2.png", "doc3.png"]
prompts = []
for path in image_paths:
    img = Image.open(path)
    prompts.append({
        "prompt": "Extract all text from this document.",
        "multi_modal_data": {"image": img},
    })

# Run batch inference
sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
outputs = llm.generate(prompts, sampling_params=sampling_params)

# Collect results
for i, output in enumerate(outputs):
    print(f"--- Document {i + 1} ---")
    print(output.outputs[0].text)
    print()
```

---

## 🏢 Use Cases

| Domain | Application | Recommended Variant |
|--------|-------------|---------------------|
| **Finance** | Invoice processing, receipt scanning, bank statements | Pro / Nano |
| **Legal** | Contract analysis, clause extraction, legal filings | Ultra |
| **Healthcare** | Medical records, prescriptions, lab reports | Ultra |
| **Government** | Form processing, ID verification, tax documents | Pro |
| **Insurance** | Claims processing, policy documents | Pro |
| **Education** | Exam paper digitization, handwritten notes | Pro / Ultra |
| **Logistics** | Shipping labels, waybills, packing lists | Nano |
| **Real Estate** | Property documents, deeds, mortgage papers | Pro |
| **Retail** | Product catalogs, price tags, inventory lists | Nano |

---

## 📈 Performance Benchmarks

### Throughput by Variant (NVIDIA A100 80GB)

| Variant | Single Image | Batch (32) | Batch (128) |
|---------|-------------|------------|-------------|
| Nano | ~1.2s | ~15s | ~55s |
| Pro | ~3.5s | ~45s | ~170s |
| Ultra | ~7.0s | ~110s | ~380s |

### Accuracy by Document Type (Pro Variant)

| Document Type | Character Accuracy | Field Accuracy |
|---------------|--------------------|----------------|
| Printed invoices | 98.5%+ | 96%+ |
| Typed contracts | 98%+ | 95%+ |
| Handwritten notes | 92%+ | 88%+ |
| Dense tables | 96%+ | 93%+ |
| Low-quality scans | 94%+ | 90%+ |

> **Note:** Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration.

---

## 🧠 Prompt Engineering Guide

Get the best results from GT-REX-v4 with these prompt strategies:

### Do's

- **Be specific** about what to extract ("Extract the invoice number and total amount")
- **Specify output format** ("Return as JSON", "Return as Markdown table")
- **Provide schema** for structured extraction (show the expected JSON keys)
- **Use clear instructions** ("Transcribe exactly as written, preserving spelling errors")

### Don'ts

- Avoid vague prompts ("What is this?")
- Don't ask for analysis or summarization — GT-REX is optimized for **extraction**
- Don't include unrelated context in the prompt

### Example Prompts

```text
# Simple extraction
"Extract all text from this document."

# Targeted extraction
"Extract only the table on this page as a Markdown table."

# Schema-driven extraction
"Extract data matching this schema: {name: str, date: str, amount: float}"

# Preservation mode
"Transcribe this document exactly as written, preserving original formatting."
```

---

## 🔌 API Integration

### FastAPI Server Example

```python
from fastapi import FastAPI, UploadFile
from PIL import Image
from vllm import LLM, SamplingParams
import io

app = FastAPI()

llm = LLM(
    model="developerJenis/GT-REX-v4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)


@app.post("/extract")
async def extract_text(file: UploadFile, prompt: str = "Extract all text."):
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

    outputs = llm.generate(
        [{
            "prompt": prompt,
            "multi_modal_data": {"image": image},
        }],
        sampling_params=sampling_params,
    )

    return {"text": outputs[0].outputs[0].text}
```

---

## 🛠️ Troubleshooting

| Issue | Solution |
|-------|----------|
| **CUDA Out of Memory** | Reduce `gpu_memory_utilization` or switch to Nano variant |
| **Slow inference** | Increase `max_num_seqs` for better batching; use Nano for speed |
| **Truncated output** | Increase `max_tokens` in `SamplingParams` |
| **Low accuracy on small text** | Switch to Ultra variant for higher resolution |
| **Garbled multilingual text** | Ensure image resolution is sufficient; try Ultra variant |

---

## 🔧 Hardware Recommendations

| Variant | Minimum GPU | Recommended GPU |
|---------|-------------|-----------------|
| Nano | NVIDIA T4 (16 GB) | NVIDIA A10 (24 GB) |
| Pro | NVIDIA A10 (24 GB) | NVIDIA A100 (40 GB) |
| Ultra | NVIDIA A100 (40 GB) | NVIDIA A100 (80 GB) |

---

## 📜 License

This model is released under the **MIT License**. You are free to use, modify, and distribute it for both commercial and non-commercial purposes.

---

## 📖 Citation

If you use GT-REX-v4 in your work, please cite:

```bibtex
@misc{gtrex-v4-2026,
  title   = {GT-REX-v4: Production-Grade OCR with Vision-Language Models},
  author  = {Hathaliya, Jenis},
  year    = {2026},
  month   = {February},
  url     = {https://huggingface.co/developerJenis/GT-REX-v4},
  note    = {GothiTech Recognition \& Extraction eXpert, Version 4}
}
```

---

## 🤝 Contact & Support

- **Developer:** Jenis Hathaliya
- **Organization:** GothiTech
- **HuggingFace:** [developerJenis](https://huggingface.co/developerJenis)

---

<p align="center">
  Built with ❤️ by <strong>GothiTech</strong>
</p>

<p align="center">
  <em>Last updated: February 2026</em><br>
  <em>Model Version: v4.0 | Variants: Nano | Pro | Ultra</em>
</p>