---
license: mit
language:
  - en
  - multilingual
tags:
  - ocr
  - vision-language
  - document-understanding
  - gothitech
  - document-ai
  - text-extraction
  - invoice-processing
  - production
  - handwriting-recognition
  - table-extraction
pipeline_tag: image-text-to-text
---

# GT-REX: Production OCR Model

<p align="center">
  <strong>GothiTech Recognition and Extraction eXpert</strong>
</p>

<p align="center">
  <a href="https://huggingface.co/gothitech/GT-REX"><img src="https://img.shields.io/badge/Model-GT--REX-blue" alt="Model"></a>
  <a href="#"><img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License: MIT"></a>
  <a href="#"><img src="https://img.shields.io/badge/vLLM-Supported-orange" alt="vLLM"></a>
  <a href="#"><img src="https://img.shields.io/badge/Params-~3B-red" alt="Parameters"></a>
</p>

---

**GT-REX** is a state-of-the-art production-grade OCR model developed by **GothiTech** for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables.

---

## Table of Contents

- [GT-REX Variants](#gt-rex-variants)
- [Key Features](#key-features)
- [Model Details](#model-details)
- [Quick Start](#quick-start)
- [Installation](#installation)
- [Usage Examples](#usage-examples)
- [Use Cases](#use-cases)
- [Performance Benchmarks](#performance-benchmarks)
- [Prompt Engineering Guide](#prompt-engineering-guide)
- [API Integration](#api-integration)
- [Troubleshooting](#troubleshooting)
- [Hardware Recommendations](#hardware-recommendations)
- [License](#license)
- [Citation](#citation)

---

## GT-REX Variants

GT-REX ships with **three optimized configurations** tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings.

| Variant | Speed | Accuracy | Resolution | GPU Memory | Throughput | Best For |
|---------|-------|----------|------------|------------|------------|----------|
| **Nano** | Ultra Fast | Good | 640px | 4-6 GB | 100-150 docs/min | High-volume batch processing |
| **Pro** (Default) | Fast | High | 1024px | 6-10 GB | 50-80 docs/min | Standard enterprise workflows |
| **Ultra** | Moderate | Maximum | 1536px | 10-15 GB | 20-30 docs/min | High-accuracy and fine-detail needs |

### How to Choose a Variant

- **Nano**: You need maximum throughput and documents are simple (receipts, IDs, labels).
- **Pro**: General-purpose. Best balance for invoices, contracts, forms, and reports.
- **Ultra**: Documents have fine print, dense tables, medical records, or legal footnotes.

---

### GT-Rex-Nano

**Speed-optimized for high-volume batch processing**

| Setting | Value |
|---------|-------|
| Resolution | 640 x 640 px |
| Speed | ~1-2s per image |
| Max Tokens | 2048 |
| GPU Memory | 4-6 GB |
| Recommended Batch Size | 256 sequences |

**Best for:** Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning.

```python
from vllm import LLM

llm = LLM(
    model="gothitech/GT-REX",
    trust_remote_code=True,
    max_model_len=2048,
    gpu_memory_utilization=0.6,
    max_num_seqs=256,
    limit_mm_per_prompt={"image": 1},
)
```

---

### GT-Rex-Pro (Default)

**Balanced quality and speed for standard enterprise documents**

| Setting | Value |
|---------|-------|
| Resolution | 1024 x 1024 px |
| Speed | ~2-5s per image |
| Max Tokens | 4096 |
| GPU Memory | 6-10 GB |
| Recommended Batch Size | 128 sequences |

**Best for:** Contracts, forms, invoices, reports, government documents, insurance claims.

```python
from vllm import LLM

llm = LLM(
    model="gothitech/GT-REX",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)
```

---

### GT-Rex-Ultra

**Maximum quality with adaptive processing for complex documents**

| Setting | Value |
|---------|-------|
| Resolution | 1536 x 1536 px |
| Speed | ~5-10s per image |
| Max Tokens | 8192 |
| GPU Memory | 10-15 GB |
| Recommended Batch Size | 64 sequences |

**Best for:** Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts.

```python
from vllm import LLM

llm = LLM(
    model="gothitech/GT-REX",
    trust_remote_code=True,
    max_model_len=8192,
    gpu_memory_utilization=0.85,
    max_num_seqs=64,
    limit_mm_per_prompt={"image": 1},
)
```

---

## Key Features

| Feature | Description |
|---------|-------------|
| **High Accuracy** | Advanced vision-language architecture for precise text extraction |
| **Multi-Language** | Handles documents in English and multiple other languages |
| **Production Ready** | Optimized for deployment with the vLLM inference engine |
| **Batch Processing** | Process hundreds of documents per minute (Nano variant) |
| **Flexible Prompts** | Supports structured extraction: JSON, tables, key-value pairs, forms |
| **Handwriting Support** | Transcribes handwritten text with high fidelity |
| **Three Variants** | Nano (speed), Pro (balanced), Ultra (accuracy) |
| **Structured Output** | Extract data directly into JSON, Markdown tables, or custom schemas |

---

## Model Details

| Attribute | Value |
|-----------|-------|
| **Developer** | GothiTech (Jenis Hathaliya) |
| **Architecture** | Vision-Language Model (VLM) |
| **Model Size** | ~6.5 GB |
| **Parameters** | ~7B |
| **License** | MIT |
| **Release Date** | February 2026 |
| **Precision** | BF16 / FP16 |
| **Input Resolution** | 640px - 1536px (variant dependent) |
| **Max Sequence Length** | 2048 - 8192 tokens (variant dependent) |
| **Inference Engine** | vLLM (recommended) |
| **Framework** | PyTorch / Transformers |

---

## Quick Start

Get running in under 5 minutes:

```python
from vllm import LLM, SamplingParams
from PIL import Image

# 1. Load model (Pro variant - default)
llm = LLM(
    model="gothitech/GT-REX",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

# 2. Prepare input
image = Image.open("document.png")
prompt = "Extract all text from this document."

# 3. Run inference
sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=4096,
)

outputs = llm.generate(
    [{
        "prompt": prompt,
        "multi_modal_data": {"image": image},
    }],
    sampling_params=sampling_params,
)

# 4. Get results
result = outputs[0].outputs[0].text
print(result)
```

---

## Installation

### Prerequisites

- Python 3.9+
- CUDA 11.8+ (GPU required)
- 8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra)

### Install Dependencies

```bash
pip install vllm pillow torch transformers
```

### Verify Installation

```python
from vllm import LLM
print("vLLM installed successfully!")
```

---

## Usage Examples

### Basic Text Extraction

```python
prompt = "Extract all text from this document image."
```

### Structured JSON Extraction

```python
prompt = '''Extract the following fields from this invoice as JSON:
{
    "invoice_number": "",
    "date": "",
    "vendor_name": "",
    "total_amount": "",
    "line_items": [
        {"description": "", "quantity": "", "unit_price": "", "amount": ""}
    ]
}'''
```

### Table Extraction (Markdown Format)

```python
prompt = "Extract all tables from this document in Markdown table format."
```

### Key-Value Pair Extraction

```python
prompt = '''Extract all key-value pairs from this form.
Return as:
Key: Value
Key: Value'''
```

### Handwritten Text Transcription

```python
prompt = "Transcribe all handwritten text from this image accurately."
```

### Multi-Document Batch Processing

```python
from PIL import Image
from vllm import LLM, SamplingParams

llm = LLM(
    model="gothitech/GT-REX",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

# Prepare batch
image_paths = ["doc1.png", "doc2.png", "doc3.png"]
prompts = []
for path in image_paths:
    img = Image.open(path)
    prompts.append({
        "prompt": "Extract all text from this document.",
        "multi_modal_data": {"image": img},
    })

# Run batch inference
sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
outputs = llm.generate(prompts, sampling_params=sampling_params)

# Collect results
for i, output in enumerate(outputs):
    print(f"--- Document {i + 1} ---")
    print(output.outputs[0].text)
    print()
```

---

## Use Cases

| Domain | Application | Recommended Variant |
|--------|-------------|---------------------|
| **Finance** | Invoice processing, receipt scanning, bank statements | Pro / Nano |
| **Legal** | Contract analysis, clause extraction, legal filings | Ultra |
| **Healthcare** | Medical records, prescriptions, lab reports | Ultra |
| **Government** | Form processing, ID verification, tax documents | Pro |
| **Insurance** | Claims processing, policy documents | Pro |
| **Education** | Exam paper digitization, handwritten notes | Pro / Ultra |
| **Logistics** | Shipping labels, waybills, packing lists | Nano |
| **Real Estate** | Property documents, deeds, mortgage papers | Pro |
| **Retail** | Product catalogs, price tags, inventory lists | Nano |

---

## Performance Benchmarks

### Throughput by Variant (NVIDIA A100 80GB)

| Variant | Single Image | Batch (32) | Batch (128) |
|---------|-------------|------------|-------------|
| Nano | ~1.2s | ~15s | ~55s |
| Pro | ~3.5s | ~45s | ~170s |
| Ultra | ~7.0s | ~110s | ~380s |

### Accuracy by Document Type (Pro Variant)

| Document Type | Character Accuracy | Field Accuracy |
|---------------|--------------------|----------------|
| Printed invoices | 98.5%+ | 96%+ |
| Typed contracts | 98%+ | 95%+ |
| Handwritten notes | 92%+ | 88%+ |
| Dense tables | 96%+ | 93%+ |
| Low-quality scans | 94%+ | 90%+ |

> **Note:** Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration.

---

## Prompt Engineering Guide

Get the best results from GT-REX with these prompt strategies:

### Tips for Best Results

**Do:**
- Be specific about what to extract ("Extract the invoice number and total amount")
- Specify output format ("Return as JSON", "Return as Markdown table")
- Provide schema for structured extraction (show the expected JSON keys)
- Use clear instructions ("Transcribe exactly as written, preserving spelling errors")

**Don't:**
- Use vague prompts ("What is this?")
- Ask for analysis or summarization (GT-REX is optimized for extraction)
- Include unrelated context in the prompt

### Example Prompts

```text
# Simple extraction
"Extract all text from this document."

# Targeted extraction
"Extract only the table on this page as a Markdown table."

# Schema-driven extraction
"Extract data matching this schema: {name: str, date: str, amount: float}"

# Preservation mode
"Transcribe this document exactly as written, preserving original formatting."
```

---

## API Integration

### FastAPI Server Example

```python
from fastapi import FastAPI, UploadFile
from PIL import Image
from vllm import LLM, SamplingParams
import io

app = FastAPI()

llm = LLM(
    model="gothitech/GT-REX",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.75,
    max_num_seqs=128,
    limit_mm_per_prompt={"image": 1},
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)


@app.post("/extract")
async def extract_text(file: UploadFile, prompt: str = "Extract all text."):
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

    outputs = llm.generate(
        [{
            "prompt": prompt,
            "multi_modal_data": {"image": image},
        }],
        sampling_params=sampling_params,
    )

    return {"text": outputs[0].outputs[0].text}
```

### cURL Example

```bash
curl -X POST "http://localhost:8000/extract" \
  -F "file=@invoice.png" \
  -F "prompt=Extract all text from this invoice as JSON."
```

---

## Troubleshooting

| Issue | Solution |
|-------|----------|
| **CUDA Out of Memory** | Reduce `gpu_memory_utilization` or switch to Nano variant |
| **Slow inference** | Increase `max_num_seqs` for better batching; use Nano for speed |
| **Truncated output** | Increase `max_tokens` in `SamplingParams` |
| **Low accuracy on small text** | Switch to Ultra variant for higher resolution |
| **Garbled multilingual text** | Ensure image resolution is sufficient; try Ultra variant |
| **Empty output** | Check that the image is loaded correctly and is not blank |
| **Model loading errors** | Ensure `trust_remote_code=True` is set |

---

## Hardware Recommendations

| Variant | Minimum GPU | Recommended GPU |
|---------|-------------|-----------------|
| Nano | NVIDIA T4 (16 GB) | NVIDIA A10 (24 GB) |
| Pro | NVIDIA A10 (24 GB) | NVIDIA A100 (40 GB) |
| Ultra | NVIDIA A100 (40 GB) | NVIDIA A100 (80 GB) |

---

## License

This model is released under the **MIT License**. You are free to use, modify, and distribute it for both commercial and non-commercial purposes.

---

## Citation

If you use GT-REX in your work, please cite:

```bibtex
@misc{gtrex-2026,
  title   = {GT-REX: Production-Grade OCR with Vision-Language Models},
  author  = {Hathaliya, Jenis},
  year    = {2026},
  month   = {February},
  url     = {https://huggingface.co/gothitech/GT-REX},
  note    = {GothiTech Recognition and Extraction eXpert}
}
```

---

## Contact and Support

- **Developer:** Jenis Hathaliya
- **Organization:** GothiTech
- **HuggingFace:** [gothitech](https://huggingface.co/gothitech)

---

<p align="center">
  Built by <strong>GothiTech</strong>
</p>

<p align="center">
  <em>Last updated: February 2026</em><br>
  <em>GT-REX | Variants: Nano | Pro | Ultra</em>
</p>