📝 Enhanced model card with GT-REX variants (Nano/Pro/Ultra), benchmarks, and usage guide
c65711d
verified
| license: mit | |
| language: | |
| - en | |
| - multilingual | |
| tags: | |
| - ocr | |
| - vision-language | |
| - document-understanding | |
| - gothitech | |
| - document-ai | |
| - text-extraction | |
| - invoice-processing | |
| - production | |
| - handwriting-recognition | |
| - table-extraction | |
| pipeline_tag: image-text-to-text | |
| model-index: | |
| - name: GT-REX-v4 | |
| results: [] | |
| # GT-REX-v4: Production OCR Model | |
| <p align="center"> | |
| <strong>🦖 GothiTech Recognition & Extraction eXpert — Version 4</strong> | |
| </p> | |
| <p align="center"> | |
| <a href="https://huggingface.co/developerJenis/GT-REX-v4"><img src="https://img.shields.io/badge/🤗_Model-GT--REX--v4-blue" alt="Model"></a> | |
| <a href="#"><img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License: MIT"></a> | |
| <a href="#"><img src="https://img.shields.io/badge/vLLM-Supported-orange" alt="vLLM"></a> | |
| <a href="#"><img src="https://img.shields.io/badge/Params-~7B-red" alt="Parameters"></a> | |
| </p> | |
| --- | |
| **GT-REX-v4** is a state-of-the-art production-grade OCR model developed by **GothiTech** for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables. | |
| --- | |
| ## 📑 Table of Contents | |
| - [GT-REX Variants](#-gt-rex-variants) | |
| - [Key Features](#-key-features) | |
| - [Model Details](#-model-details) | |
| - [Quick Start](#-quick-start) | |
| - [Installation](#-installation) | |
| - [Usage Examples](#-usage-examples) | |
| - [Use Cases](#-use-cases) | |
| - [Performance Benchmarks](#-performance-benchmarks) | |
| - [Prompt Engineering Guide](#-prompt-engineering-guide) | |
| - [API Integration](#-api-integration) | |
| - [Troubleshooting](#-troubleshooting) | |
| - [License](#-license) | |
| - [Citation](#-citation) | |
| --- | |
| ## ⚙️ GT-REX Variants | |
| GT-REX-v4 ships with **three optimized configurations** tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings. | |
| | Variant | Speed | Accuracy | Resolution | GPU Memory | Throughput | Best For | | |
| |---------|-------|----------|------------|------------|------------|----------| | |
| | **🚀 Nano** | ⚡⚡⚡⚡⚡ | ⭐⭐⭐ | 640px | 4–6 GB | 100–150 docs/min | High-volume batch processing | | |
| | **⚡ Pro** *(Default)* | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ | 1024px | 6–10 GB | 50–80 docs/min | Standard enterprise workflows | | |
| | **🎯 Ultra** | ⚡⚡⚡ | ⭐⭐⭐⭐⭐ | 1536px | 10–15 GB | 20–30 docs/min | High-accuracy & fine-detail needs | | |
| ### How to Choose a Variant | |
| - **Nano** → You need maximum throughput and documents are simple (receipts, IDs, labels). | |
| - **Pro** → General-purpose. Best balance for invoices, contracts, forms, and reports. | |
| - **Ultra** → Documents have fine print, dense tables, medical records, or legal footnotes. | |
| --- | |
| ### 🚀 GT-Rex-Nano | |
| **Speed-optimized for high-volume batch processing** | |
| | Setting | Value | | |
| |---------|-------| | |
| | Resolution | 640 × 640 px | | |
| | Speed | ~1–2s per image | | |
| | Max Tokens | 2048 | | |
| | GPU Memory | 4–6 GB | | |
| | Recommended Batch Size | 256 sequences | | |
| **Best for:** Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning. | |
| ```python | |
| from vllm import LLM | |
| llm = LLM( | |
| model="developerJenis/GT-REX-v4", | |
| trust_remote_code=True, | |
| max_model_len=2048, | |
| gpu_memory_utilization=0.6, | |
| max_num_seqs=256, | |
| limit_mm_per_prompt={"image": 1}, | |
| ) | |
| ``` | |
| --- | |
| ### ⚡ GT-Rex-Pro (Default) | |
| **Balanced quality and speed for standard enterprise documents** | |
| | Setting | Value | | |
| |---------|-------| | |
| | Resolution | 1024 × 1024 px | | |
| | Speed | ~2–5s per image | | |
| | Max Tokens | 4096 | | |
| | GPU Memory | 6–10 GB | | |
| | Recommended Batch Size | 128 sequences | | |
| **Best for:** Contracts, forms, invoices, reports, government documents, insurance claims. | |
| ```python | |
| from vllm import LLM | |
| llm = LLM( | |
| model="developerJenis/GT-REX-v4", | |
| trust_remote_code=True, | |
| max_model_len=4096, | |
| gpu_memory_utilization=0.75, | |
| max_num_seqs=128, | |
| limit_mm_per_prompt={"image": 1}, | |
| ) | |
| ``` | |
| --- | |
| ### 🎯 GT-Rex-Ultra | |
| **Maximum quality with adaptive processing for complex documents** | |
| | Setting | Value | | |
| |---------|-------| | |
| | Resolution | 1536 × 1536 px | | |
| | Speed | ~5–10s per image | | |
| | Max Tokens | 8192 | | |
| | GPU Memory | 10–15 GB | | |
| | Recommended Batch Size | 64 sequences | | |
| **Best for:** Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts. | |
| ```python | |
| from vllm import LLM | |
| llm = LLM( | |
| model="developerJenis/GT-REX-v4", | |
| trust_remote_code=True, | |
| max_model_len=8192, | |
| gpu_memory_utilization=0.85, | |
| max_num_seqs=64, | |
| limit_mm_per_prompt={"image": 1}, | |
| ) | |
| ``` | |
| --- | |
| ## 🎯 Key Features | |
| | Feature | Description | | |
| |---------|-------------| | |
| | **High Accuracy** | Advanced vision-language architecture for precise text extraction | | |
| | **Multi-Language** | Handles documents in English and multiple other languages | | |
| | **Production Ready** | Optimized for deployment with the vLLM inference engine | | |
| | **Batch Processing** | Process hundreds of documents per minute (Nano variant) | | |
| | **Flexible Prompts** | Supports structured extraction — JSON, tables, key-value pairs, forms | | |
| | **Handwriting Support** | Transcribes handwritten text with high fidelity | | |
| | **Three Variants** | Nano (speed), Pro (balanced), Ultra (accuracy) | | |
| | **Structured Output** | Extract data directly into JSON, Markdown tables, or custom schemas | | |
| --- | |
| ## 📊 Model Details | |
| | Attribute | Value | | |
| |-----------|-------| | |
| | **Developer** | GothiTech (Jenis Hathaliya) | | |
| | **Architecture** | Vision-Language Model (VLM) | | |
| | **Model Size** | ~6.5 GB | | |
| | **Parameters** | ~7B | | |
| | **License** | MIT | | |
| | **Release Date** | February 2026 | | |
| | **Precision** | BF16 / FP16 | | |
| | **Input Resolution** | 640px – 1536px (variant dependent) | | |
| | **Max Sequence Length** | 2048 – 8192 tokens (variant dependent) | | |
| | **Inference Engine** | vLLM (recommended) | | |
| | **Framework** | PyTorch / Transformers | | |
| --- | |
| ## 🚀 Quick Start | |
| Get running in under 5 minutes: | |
| ```python | |
| from vllm import LLM, SamplingParams | |
| from PIL import Image | |
| # 1. Load model (Pro variant — default) | |
| llm = LLM( | |
| model="developerJenis/GT-REX-v4", | |
| trust_remote_code=True, | |
| max_model_len=4096, | |
| gpu_memory_utilization=0.75, | |
| max_num_seqs=128, | |
| limit_mm_per_prompt={"image": 1}, | |
| ) | |
| # 2. Prepare input | |
| image = Image.open("document.png") | |
| prompt = "Extract all text from this document." | |
| # 3. Run inference | |
| sampling_params = SamplingParams( | |
| temperature=0.0, | |
| max_tokens=4096, | |
| ) | |
| outputs = llm.generate( | |
| [{ | |
| "prompt": prompt, | |
| "multi_modal_data": {"image": image}, | |
| }], | |
| sampling_params=sampling_params, | |
| ) | |
| # 4. Get results | |
| result = outputs[0].outputs[0].text | |
| print(result) | |
| ``` | |
| --- | |
| ## 💻 Installation | |
| ### Prerequisites | |
| - Python 3.9+ | |
| - CUDA 11.8+ (GPU required) | |
| - 8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra) | |
| ### Install Dependencies | |
| ```bash | |
| pip install vllm pillow torch transformers | |
| ``` | |
| ### Verify Installation | |
| ```python | |
| from vllm import LLM | |
| print("vLLM installed successfully!") | |
| ``` | |
| --- | |
| ## 📖 Usage Examples | |
| ### Basic Text Extraction | |
| ```python | |
| prompt = "Extract all text from this document image." | |
| ``` | |
| ### Structured JSON Extraction | |
| ```python | |
| prompt = """Extract the following fields from this invoice as JSON: | |
| { | |
| "invoice_number": "", | |
| "date": "", | |
| "vendor_name": "", | |
| "total_amount": "", | |
| "line_items": [ | |
| {"description": "", "quantity": "", "unit_price": "", "amount": ""} | |
| ] | |
| }""" | |
| ``` | |
| ### Table Extraction (Markdown Format) | |
| ```python | |
| prompt = "Extract all tables from this document in Markdown table format." | |
| ``` | |
| ### Key-Value Pair Extraction | |
| ```python | |
| prompt = """Extract all key-value pairs from this form. | |
| Return as: | |
| Key: Value | |
| Key: Value | |
| ...""" | |
| ``` | |
| ### Handwritten Text Transcription | |
| ```python | |
| prompt = "Transcribe all handwritten text from this image accurately." | |
| ``` | |
| ### Multi-Document Batch Processing | |
| ```python | |
| from PIL import Image | |
| from vllm import LLM, SamplingParams | |
| llm = LLM( | |
| model="developerJenis/GT-REX-v4", | |
| trust_remote_code=True, | |
| max_model_len=4096, | |
| gpu_memory_utilization=0.75, | |
| max_num_seqs=128, | |
| limit_mm_per_prompt={"image": 1}, | |
| ) | |
| # Prepare batch | |
| image_paths = ["doc1.png", "doc2.png", "doc3.png"] | |
| prompts = [] | |
| for path in image_paths: | |
| img = Image.open(path) | |
| prompts.append({ | |
| "prompt": "Extract all text from this document.", | |
| "multi_modal_data": {"image": img}, | |
| }) | |
| # Run batch inference | |
| sampling_params = SamplingParams(temperature=0.0, max_tokens=4096) | |
| outputs = llm.generate(prompts, sampling_params=sampling_params) | |
| # Collect results | |
| for i, output in enumerate(outputs): | |
| print(f"--- Document {i + 1} ---") | |
| print(output.outputs[0].text) | |
| print() | |
| ``` | |
| --- | |
| ## 🏢 Use Cases | |
| | Domain | Application | Recommended Variant | | |
| |--------|-------------|---------------------| | |
| | **Finance** | Invoice processing, receipt scanning, bank statements | Pro / Nano | | |
| | **Legal** | Contract analysis, clause extraction, legal filings | Ultra | | |
| | **Healthcare** | Medical records, prescriptions, lab reports | Ultra | | |
| | **Government** | Form processing, ID verification, tax documents | Pro | | |
| | **Insurance** | Claims processing, policy documents | Pro | | |
| | **Education** | Exam paper digitization, handwritten notes | Pro / Ultra | | |
| | **Logistics** | Shipping labels, waybills, packing lists | Nano | | |
| | **Real Estate** | Property documents, deeds, mortgage papers | Pro | | |
| | **Retail** | Product catalogs, price tags, inventory lists | Nano | | |
| --- | |
| ## 📈 Performance Benchmarks | |
| ### Throughput by Variant (NVIDIA A100 80GB) | |
| | Variant | Single Image | Batch (32) | Batch (128) | | |
| |---------|-------------|------------|-------------| | |
| | Nano | ~1.2s | ~15s | ~55s | | |
| | Pro | ~3.5s | ~45s | ~170s | | |
| | Ultra | ~7.0s | ~110s | ~380s | | |
| ### Accuracy by Document Type (Pro Variant) | |
| | Document Type | Character Accuracy | Field Accuracy | | |
| |---------------|--------------------|----------------| | |
| | Printed invoices | 98.5%+ | 96%+ | | |
| | Typed contracts | 98%+ | 95%+ | | |
| | Handwritten notes | 92%+ | 88%+ | | |
| | Dense tables | 96%+ | 93%+ | | |
| | Low-quality scans | 94%+ | 90%+ | | |
| > **Note:** Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration. | |
| --- | |
| ## 🧠 Prompt Engineering Guide | |
| Get the best results from GT-REX-v4 with these prompt strategies: | |
| ### Do's | |
| - **Be specific** about what to extract ("Extract the invoice number and total amount") | |
| - **Specify output format** ("Return as JSON", "Return as Markdown table") | |
| - **Provide schema** for structured extraction (show the expected JSON keys) | |
| - **Use clear instructions** ("Transcribe exactly as written, preserving spelling errors") | |
| ### Don'ts | |
| - Avoid vague prompts ("What is this?") | |
| - Don't ask for analysis or summarization — GT-REX is optimized for **extraction** | |
| - Don't include unrelated context in the prompt | |
| ### Example Prompts | |
| ```text | |
| # Simple extraction | |
| "Extract all text from this document." | |
| # Targeted extraction | |
| "Extract only the table on this page as a Markdown table." | |
| # Schema-driven extraction | |
| "Extract data matching this schema: {name: str, date: str, amount: float}" | |
| # Preservation mode | |
| "Transcribe this document exactly as written, preserving original formatting." | |
| ``` | |
| --- | |
| ## 🔌 API Integration | |
| ### FastAPI Server Example | |
| ```python | |
| from fastapi import FastAPI, UploadFile | |
| from PIL import Image | |
| from vllm import LLM, SamplingParams | |
| import io | |
| app = FastAPI() | |
| llm = LLM( | |
| model="developerJenis/GT-REX-v4", | |
| trust_remote_code=True, | |
| max_model_len=4096, | |
| gpu_memory_utilization=0.75, | |
| max_num_seqs=128, | |
| limit_mm_per_prompt={"image": 1}, | |
| ) | |
| sampling_params = SamplingParams(temperature=0.0, max_tokens=4096) | |
| @app.post("/extract") | |
| async def extract_text(file: UploadFile, prompt: str = "Extract all text."): | |
| image_bytes = await file.read() | |
| image = Image.open(io.BytesIO(image_bytes)).convert("RGB") | |
| outputs = llm.generate( | |
| [{ | |
| "prompt": prompt, | |
| "multi_modal_data": {"image": image}, | |
| }], | |
| sampling_params=sampling_params, | |
| ) | |
| return {"text": outputs[0].outputs[0].text} | |
| ``` | |
| --- | |
| ## 🛠️ Troubleshooting | |
| | Issue | Solution | | |
| |-------|----------| | |
| | **CUDA Out of Memory** | Reduce `gpu_memory_utilization` or switch to Nano variant | | |
| | **Slow inference** | Increase `max_num_seqs` for better batching; use Nano for speed | | |
| | **Truncated output** | Increase `max_tokens` in `SamplingParams` | | |
| | **Low accuracy on small text** | Switch to Ultra variant for higher resolution | | |
| | **Garbled multilingual text** | Ensure image resolution is sufficient; try Ultra variant | | |
| --- | |
| ## 🔧 Hardware Recommendations | |
| | Variant | Minimum GPU | Recommended GPU | | |
| |---------|-------------|-----------------| | |
| | Nano | NVIDIA T4 (16 GB) | NVIDIA A10 (24 GB) | | |
| | Pro | NVIDIA A10 (24 GB) | NVIDIA A100 (40 GB) | | |
| | Ultra | NVIDIA A100 (40 GB) | NVIDIA A100 (80 GB) | | |
| --- | |
| ## 📜 License | |
| This model is released under the **MIT License**. You are free to use, modify, and distribute it for both commercial and non-commercial purposes. | |
| --- | |
| ## 📖 Citation | |
| If you use GT-REX-v4 in your work, please cite: | |
| ```bibtex | |
| @misc{gtrex-v4-2026, | |
| title = {GT-REX-v4: Production-Grade OCR with Vision-Language Models}, | |
| author = {Hathaliya, Jenis}, | |
| year = {2026}, | |
| month = {February}, | |
| url = {https://huggingface.co/developerJenis/GT-REX-v4}, | |
| note = {GothiTech Recognition \& Extraction eXpert, Version 4} | |
| } | |
| ``` | |
| --- | |
| ## 🤝 Contact & Support | |
| - **Developer:** Jenis Hathaliya | |
| - **Organization:** GothiTech | |
| - **HuggingFace:** [developerJenis](https://huggingface.co/developerJenis) | |
| --- | |
| <p align="center"> | |
| Built with ❤️ by <strong>GothiTech</strong> | |
| </p> | |
| <p align="center"> | |
| <em>Last updated: February 2026</em><br> | |
| <em>Model Version: v4.0 | Variants: Nano | Pro | Ultra</em> | |
| </p> | |