--- license: mit language: - en - multilingual tags: - ocr - vision-language - document-understanding - gothitech - document-ai - text-extraction - invoice-processing - production - handwriting-recognition - table-extraction pipeline_tag: image-text-to-text --- # GT-REX: Production OCR Model
GothiTech Recognition and Extraction eXpert
--- **GT-REX** is a state-of-the-art production-grade OCR model developed by **GothiTech** for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables. --- ## Table of Contents - [GT-REX Variants](#gt-rex-variants) - [Key Features](#key-features) - [Model Details](#model-details) - [Quick Start](#quick-start) - [Installation](#installation) - [Usage Examples](#usage-examples) - [Use Cases](#use-cases) - [Performance Benchmarks](#performance-benchmarks) - [Prompt Engineering Guide](#prompt-engineering-guide) - [API Integration](#api-integration) - [Troubleshooting](#troubleshooting) - [Hardware Recommendations](#hardware-recommendations) - [License](#license) - [Citation](#citation) --- ## GT-REX Variants GT-REX ships with **three optimized configurations** tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings. | Variant | Speed | Accuracy | Resolution | GPU Memory | Throughput | Best For | |---------|-------|----------|------------|------------|------------|----------| | **Nano** | Ultra Fast | Good | 640px | 4-6 GB | 100-150 docs/min | High-volume batch processing | | **Pro** (Default) | Fast | High | 1024px | 6-10 GB | 50-80 docs/min | Standard enterprise workflows | | **Ultra** | Moderate | Maximum | 1536px | 10-15 GB | 20-30 docs/min | High-accuracy and fine-detail needs | ### How to Choose a Variant - **Nano**: You need maximum throughput and documents are simple (receipts, IDs, labels). - **Pro**: General-purpose. Best balance for invoices, contracts, forms, and reports. - **Ultra**: Documents have fine print, dense tables, medical records, or legal footnotes. --- ### GT-Rex-Nano **Speed-optimized for high-volume batch processing** | Setting | Value | |---------|-------| | Resolution | 640 x 640 px | | Speed | ~1-2s per image | | Max Tokens | 2048 | | GPU Memory | 4-6 GB | | Recommended Batch Size | 256 sequences | **Best for:** Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning. ```python from vllm import LLM llm = LLM( model="gothitech/GT-REX", trust_remote_code=True, max_model_len=2048, gpu_memory_utilization=0.6, max_num_seqs=256, limit_mm_per_prompt={"image": 1}, ) ``` --- ### GT-Rex-Pro (Default) **Balanced quality and speed for standard enterprise documents** | Setting | Value | |---------|-------| | Resolution | 1024 x 1024 px | | Speed | ~2-5s per image | | Max Tokens | 4096 | | GPU Memory | 6-10 GB | | Recommended Batch Size | 128 sequences | **Best for:** Contracts, forms, invoices, reports, government documents, insurance claims. ```python from vllm import LLM llm = LLM( model="gothitech/GT-REX", trust_remote_code=True, max_model_len=4096, gpu_memory_utilization=0.75, max_num_seqs=128, limit_mm_per_prompt={"image": 1}, ) ``` --- ### GT-Rex-Ultra **Maximum quality with adaptive processing for complex documents** | Setting | Value | |---------|-------| | Resolution | 1536 x 1536 px | | Speed | ~5-10s per image | | Max Tokens | 8192 | | GPU Memory | 10-15 GB | | Recommended Batch Size | 64 sequences | **Best for:** Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts. ```python from vllm import LLM llm = LLM( model="gothitech/GT-REX", trust_remote_code=True, max_model_len=8192, gpu_memory_utilization=0.85, max_num_seqs=64, limit_mm_per_prompt={"image": 1}, ) ``` --- ## Key Features | Feature | Description | |---------|-------------| | **High Accuracy** | Advanced vision-language architecture for precise text extraction | | **Multi-Language** | Handles documents in English and multiple other languages | | **Production Ready** | Optimized for deployment with the vLLM inference engine | | **Batch Processing** | Process hundreds of documents per minute (Nano variant) | | **Flexible Prompts** | Supports structured extraction: JSON, tables, key-value pairs, forms | | **Handwriting Support** | Transcribes handwritten text with high fidelity | | **Three Variants** | Nano (speed), Pro (balanced), Ultra (accuracy) | | **Structured Output** | Extract data directly into JSON, Markdown tables, or custom schemas | --- ## Model Details | Attribute | Value | |-----------|-------| | **Developer** | GothiTech (Jenis Hathaliya) | | **Architecture** | Vision-Language Model (VLM) | | **Model Size** | ~6.5 GB | | **Parameters** | ~7B | | **License** | MIT | | **Release Date** | February 2026 | | **Precision** | BF16 / FP16 | | **Input Resolution** | 640px - 1536px (variant dependent) | | **Max Sequence Length** | 2048 - 8192 tokens (variant dependent) | | **Inference Engine** | vLLM (recommended) | | **Framework** | PyTorch / Transformers | --- ## Quick Start Get running in under 5 minutes: ```python from vllm import LLM, SamplingParams from PIL import Image # 1. Load model (Pro variant - default) llm = LLM( model="gothitech/GT-REX", trust_remote_code=True, max_model_len=4096, gpu_memory_utilization=0.75, max_num_seqs=128, limit_mm_per_prompt={"image": 1}, ) # 2. Prepare input image = Image.open("document.png") prompt = "Extract all text from this document." # 3. Run inference sampling_params = SamplingParams( temperature=0.0, max_tokens=4096, ) outputs = llm.generate( [{ "prompt": prompt, "multi_modal_data": {"image": image}, }], sampling_params=sampling_params, ) # 4. Get results result = outputs[0].outputs[0].text print(result) ``` --- ## Installation ### Prerequisites - Python 3.9+ - CUDA 11.8+ (GPU required) - 8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra) ### Install Dependencies ```bash pip install vllm pillow torch transformers ``` ### Verify Installation ```python from vllm import LLM print("vLLM installed successfully!") ``` --- ## Usage Examples ### Basic Text Extraction ```python prompt = "Extract all text from this document image." ``` ### Structured JSON Extraction ```python prompt = '''Extract the following fields from this invoice as JSON: { "invoice_number": "", "date": "", "vendor_name": "", "total_amount": "", "line_items": [ {"description": "", "quantity": "", "unit_price": "", "amount": ""} ] }''' ``` ### Table Extraction (Markdown Format) ```python prompt = "Extract all tables from this document in Markdown table format." ``` ### Key-Value Pair Extraction ```python prompt = '''Extract all key-value pairs from this form. Return as: Key: Value Key: Value''' ``` ### Handwritten Text Transcription ```python prompt = "Transcribe all handwritten text from this image accurately." ``` ### Multi-Document Batch Processing ```python from PIL import Image from vllm import LLM, SamplingParams llm = LLM( model="gothitech/GT-REX", trust_remote_code=True, max_model_len=4096, gpu_memory_utilization=0.75, max_num_seqs=128, limit_mm_per_prompt={"image": 1}, ) # Prepare batch image_paths = ["doc1.png", "doc2.png", "doc3.png"] prompts = [] for path in image_paths: img = Image.open(path) prompts.append({ "prompt": "Extract all text from this document.", "multi_modal_data": {"image": img}, }) # Run batch inference sampling_params = SamplingParams(temperature=0.0, max_tokens=4096) outputs = llm.generate(prompts, sampling_params=sampling_params) # Collect results for i, output in enumerate(outputs): print(f"--- Document {i + 1} ---") print(output.outputs[0].text) print() ``` --- ## Use Cases | Domain | Application | Recommended Variant | |--------|-------------|---------------------| | **Finance** | Invoice processing, receipt scanning, bank statements | Pro / Nano | | **Legal** | Contract analysis, clause extraction, legal filings | Ultra | | **Healthcare** | Medical records, prescriptions, lab reports | Ultra | | **Government** | Form processing, ID verification, tax documents | Pro | | **Insurance** | Claims processing, policy documents | Pro | | **Education** | Exam paper digitization, handwritten notes | Pro / Ultra | | **Logistics** | Shipping labels, waybills, packing lists | Nano | | **Real Estate** | Property documents, deeds, mortgage papers | Pro | | **Retail** | Product catalogs, price tags, inventory lists | Nano | --- ## Performance Benchmarks ### Throughput by Variant (NVIDIA A100 80GB) | Variant | Single Image | Batch (32) | Batch (128) | |---------|-------------|------------|-------------| | Nano | ~1.2s | ~15s | ~55s | | Pro | ~3.5s | ~45s | ~170s | | Ultra | ~7.0s | ~110s | ~380s | ### Accuracy by Document Type (Pro Variant) | Document Type | Character Accuracy | Field Accuracy | |---------------|--------------------|----------------| | Printed invoices | 98.5%+ | 96%+ | | Typed contracts | 98%+ | 95%+ | | Handwritten notes | 92%+ | 88%+ | | Dense tables | 96%+ | 93%+ | | Low-quality scans | 94%+ | 90%+ | > **Note:** Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration. --- ## Prompt Engineering Guide Get the best results from GT-REX with these prompt strategies: ### Tips for Best Results **Do:** - Be specific about what to extract ("Extract the invoice number and total amount") - Specify output format ("Return as JSON", "Return as Markdown table") - Provide schema for structured extraction (show the expected JSON keys) - Use clear instructions ("Transcribe exactly as written, preserving spelling errors") **Don't:** - Use vague prompts ("What is this?") - Ask for analysis or summarization (GT-REX is optimized for extraction) - Include unrelated context in the prompt ### Example Prompts ```text # Simple extraction "Extract all text from this document." # Targeted extraction "Extract only the table on this page as a Markdown table." # Schema-driven extraction "Extract data matching this schema: {name: str, date: str, amount: float}" # Preservation mode "Transcribe this document exactly as written, preserving original formatting." ``` --- ## API Integration ### FastAPI Server Example ```python from fastapi import FastAPI, UploadFile from PIL import Image from vllm import LLM, SamplingParams import io app = FastAPI() llm = LLM( model="gothitech/GT-REX", trust_remote_code=True, max_model_len=4096, gpu_memory_utilization=0.75, max_num_seqs=128, limit_mm_per_prompt={"image": 1}, ) sampling_params = SamplingParams(temperature=0.0, max_tokens=4096) @app.post("/extract") async def extract_text(file: UploadFile, prompt: str = "Extract all text."): image_bytes = await file.read() image = Image.open(io.BytesIO(image_bytes)).convert("RGB") outputs = llm.generate( [{ "prompt": prompt, "multi_modal_data": {"image": image}, }], sampling_params=sampling_params, ) return {"text": outputs[0].outputs[0].text} ``` ### cURL Example ```bash curl -X POST "http://localhost:8000/extract" \ -F "file=@invoice.png" \ -F "prompt=Extract all text from this invoice as JSON." ``` --- ## Troubleshooting | Issue | Solution | |-------|----------| | **CUDA Out of Memory** | Reduce `gpu_memory_utilization` or switch to Nano variant | | **Slow inference** | Increase `max_num_seqs` for better batching; use Nano for speed | | **Truncated output** | Increase `max_tokens` in `SamplingParams` | | **Low accuracy on small text** | Switch to Ultra variant for higher resolution | | **Garbled multilingual text** | Ensure image resolution is sufficient; try Ultra variant | | **Empty output** | Check that the image is loaded correctly and is not blank | | **Model loading errors** | Ensure `trust_remote_code=True` is set | --- ## Hardware Recommendations | Variant | Minimum GPU | Recommended GPU | |---------|-------------|-----------------| | Nano | NVIDIA T4 (16 GB) | NVIDIA A10 (24 GB) | | Pro | NVIDIA A10 (24 GB) | NVIDIA A100 (40 GB) | | Ultra | NVIDIA A100 (40 GB) | NVIDIA A100 (80 GB) | --- ## License This model is released under the **MIT License**. You are free to use, modify, and distribute it for both commercial and non-commercial purposes. --- ## Citation If you use GT-REX in your work, please cite: ```bibtex @misc{gtrex-2026, title = {GT-REX: Production-Grade OCR with Vision-Language Models}, author = {Hathaliya, Jenis}, year = {2026}, month = {February}, url = {https://huggingface.co/gothitech/GT-REX}, note = {GothiTech Recognition and Extraction eXpert} } ``` --- ## Contact and Support - **Developer:** Jenis Hathaliya - **Organization:** GothiTech - **HuggingFace:** [gothitech](https://huggingface.co/gothitech) ---Built by GothiTech
Last updated: February 2026
GT-REX | Variants: Nano | Pro | Ultra