📝 Enhanced model card with GT-REX variants (Nano/Pro/Ultra), benchmarks, and usage guide

Browse files

Files changed (1) hide show

README.md +438 -37

README.md CHANGED Viewed

@@ -12,89 +12,170 @@ tags:
   - text-extraction
   - invoice-processing
   - production
 pipeline_tag: image-text-to-text
 ---
 # GT-REX-v4: Production OCR Model
-**GT-REX-v4** is a state-of-the-art production-grade OCR model developed by GothiTech for enterprise document understanding, text extraction, and intelligent document processing.
 ## ⚙️ GT-REX Variants
-GT-REX-v4 supports **three optimized configurations** for different performance requirements:
 | Variant | Speed | Accuracy | Resolution | GPU Memory | Throughput | Best For |
 |---------|-------|----------|------------|------------|------------|----------|
-| **🚀 Nano** | ⚡⚡⚡⚡⚡ | ⭐⭐⭐ | 640px | 4-6 GB | 100-150 docs/min | High-volume batch |
-| **⚡ Pro** | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ | 1024px | 6-10 GB | 50-80 docs/min | Standard workflows |
-| **🎯 Ultra** | ⚡⚡⚡ | ⭐⭐⭐⭐⭐ | 1536px | 10-15 GB | 20-30 docs/min | High-accuracy needs |
 ### 🚀 GT-Rex-Nano
 **Speed-optimized for high-volume batch processing**
-- **Resolution**: 640×640px
-- **Speed**: ~1-2s per image
-- **Max Tokens**: 2048
-- **Best for**: Thumbnails, previews, high-throughput pipelines (100+ docs)
 ```python
 llm = LLM(
-    model='developerJenis/GT-REX-v4',
     trust_remote_code=True,
     max_model_len=2048,
     gpu_memory_utilization=0.6,
     max_num_seqs=256,
-    logits_processors=[NGramPerReqLogitsProcessor],
 )
 ```
 ### ⚡ GT-Rex-Pro (Default)
-**Balanced quality and speed for standard documents**
-- **Resolution**: 1024×1024px
-- **Speed**: ~2-5s per image
-- **Max Tokens**: 4096
-- **Best for**: Contracts, forms, invoices, reports
 ```python
 llm = LLM(
-    model='developerJenis/GT-REX-v4',
     trust_remote_code=True,
     max_model_len=4096,
     gpu_memory_utilization=0.75,
     max_num_seqs=128,
-    logits_processors=[NGramPerReqLogitsProcessor],
 )
 ```
 ### 🎯 GT-Rex-Ultra
-**Maximum quality with adaptive processing**
-- **Resolution**: 1536×1536px
-- **Speed**: ~5-10s per image
-- **Max Tokens**: 8192
-- **Best for**: Legal documents, fine print, dense tables, medical records
 ```python
 llm = LLM(
-    model='developerJenis/GT-REX-v4',
     trust_remote_code=True,
     max_model_len=8192,
     gpu_memory_utilization=0.85,
     max_num_seqs=64,
-    logits_processors=[NGramPerReqLogitsProcessor],
 )
 ```
 ## 🎯 Key Features
-- **High Accuracy**: Advanced vision-language architecture for precise text extraction
-- **Multi-Language Support**: Handles documents in multiple languages
-- **Production Ready**: Optimized for deployment with vLLM inference engine
-- **Batch Processing**: Process hundreds of documents per minute
-- **Flexible Prompts**: Support for structured extraction (JSON, tables, forms)
-- **Handwriting Support**: Capable of transcribing handwritten text
-- **Three Optimized Variants**: Nano, Pro, and Ultra for different use cases
 ## 📊 Model Details
@@ -106,18 +187,338 @@ llm = LLM(
 | **Parameters** | ~7B |
 | **License** | MIT |
 | **Release Date** | February 2026 |
-| **Precision** | BF16/FP16 |
-| **Input Resolution** | 640px - 1536px (variant dependent) |
-## 🚀 Use Cases
 ## 💻 Installation
 ```bash
 pip install vllm pillow torch transformers
 ```
 ---
-*Last updated: February 2026*
-*Model Version: v4.0 | Variants: Nano | Pro | Ultra*

   - text-extraction
   - invoice-processing
   - production
+  - handwriting-recognition
+  - table-extraction
 pipeline_tag: image-text-to-text
+model-index:
+  - name: GT-REX-v4
+    results: []
 ---
 # GT-REX-v4: Production OCR Model
+<p align="center">
+  <strong>🦖 GothiTech Recognition & Extraction eXpert — Version 4</strong>
+</p>
+<p align="center">
+  <a href="https://huggingface.co/developerJenis/GT-REX-v4"><img src="https://img.shields.io/badge/🤗_Model-GT--REX--v4-blue" alt="Model"></a>
+  <a href="#"><img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License: MIT"></a>
+  <a href="#"><img src="https://img.shields.io/badge/vLLM-Supported-orange" alt="vLLM"></a>
+  <a href="#"><img src="https://img.shields.io/badge/Params-~7B-red" alt="Parameters"></a>
+</p>
+---
+**GT-REX-v4** is a state-of-the-art production-grade OCR model developed by **GothiTech** for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables.
+---
+## 📑 Table of Contents
+- [GT-REX Variants](#-gt-rex-variants)
+- [Key Features](#-key-features)
+- [Model Details](#-model-details)
+- [Quick Start](#-quick-start)
+- [Installation](#-installation)
+- [Usage Examples](#-usage-examples)
+- [Use Cases](#-use-cases)
+- [Performance Benchmarks](#-performance-benchmarks)
+- [Prompt Engineering Guide](#-prompt-engineering-guide)
+- [API Integration](#-api-integration)
+- [Troubleshooting](#-troubleshooting)
+- [License](#-license)
+- [Citation](#-citation)
+---
 ## ⚙️ GT-REX Variants
+GT-REX-v4 ships with **three optimized configurations** tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings.
 | Variant | Speed | Accuracy | Resolution | GPU Memory | Throughput | Best For |
 |---------|-------|----------|------------|------------|------------|----------|
+| **🚀 Nano** | ⚡⚡⚡⚡⚡ | ⭐⭐⭐ | 640px | 4–6 GB | 100–150 docs/min | High-volume batch processing |
+| **⚡ Pro** *(Default)* | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ | 1024px | 6–10 GB | 50–80 docs/min | Standard enterprise workflows |
+| **🎯 Ultra** | ⚡⚡⚡ | ⭐⭐⭐⭐⭐ | 1536px | 10–15 GB | 20–30 docs/min | High-accuracy & fine-detail needs |
+### How to Choose a Variant
+- **Nano** → You need maximum throughput and documents are simple (receipts, IDs, labels).
+- **Pro** → General-purpose. Best balance for invoices, contracts, forms, and reports.
+- **Ultra** → Documents have fine print, dense tables, medical records, or legal footnotes.
+---
 ### 🚀 GT-Rex-Nano
 **Speed-optimized for high-volume batch processing**
+| Setting | Value |
+|---------|-------|
+| Resolution | 640 × 640 px |
+| Speed | ~1–2s per image |
+| Max Tokens | 2048 |
+| GPU Memory | 4–6 GB |
+| Recommended Batch Size | 256 sequences |
+**Best for:** Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning.
 ```python
+from vllm import LLM
 llm = LLM(
+    model="developerJenis/GT-REX-v4",
     trust_remote_code=True,
     max_model_len=2048,
     gpu_memory_utilization=0.6,
     max_num_seqs=256,
+    limit_mm_per_prompt={"image": 1},
 )
 ```
+---
 ### ⚡ GT-Rex-Pro (Default)
+**Balanced quality and speed for standard enterprise documents**
+| Setting | Value |
+|---------|-------|
+| Resolution | 1024 × 1024 px |
+| Speed | ~2–5s per image |
+| Max Tokens | 4096 |
+| GPU Memory | 6–10 GB |
+| Recommended Batch Size | 128 sequences |
+**Best for:** Contracts, forms, invoices, reports, government documents, insurance claims.
 ```python
+from vllm import LLM
 llm = LLM(
+    model="developerJenis/GT-REX-v4",
     trust_remote_code=True,
     max_model_len=4096,
     gpu_memory_utilization=0.75,
     max_num_seqs=128,
+    limit_mm_per_prompt={"image": 1},
 )
 ```
+---
 ### 🎯 GT-Rex-Ultra
+**Maximum quality with adaptive processing for complex documents**
+| Setting | Value |
+|---------|-------|
+| Resolution | 1536 × 1536 px |
+| Speed | ~5–10s per image |
+| Max Tokens | 8192 |
+| GPU Memory | 10–15 GB |
+| Recommended Batch Size | 64 sequences |
+**Best for:** Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts.
 ```python
+from vllm import LLM
 llm = LLM(
+    model="developerJenis/GT-REX-v4",
     trust_remote_code=True,
     max_model_len=8192,
     gpu_memory_utilization=0.85,
     max_num_seqs=64,
+    limit_mm_per_prompt={"image": 1},
 )
 ```
+---
 ## 🎯 Key Features
+| Feature | Description |
+|---------|-------------|
+| **High Accuracy** | Advanced vision-language architecture for precise text extraction |
+| **Multi-Language** | Handles documents in English and multiple other languages |
+| **Production Ready** | Optimized for deployment with the vLLM inference engine |
+| **Batch Processing** | Process hundreds of documents per minute (Nano variant) |
+| **Flexible Prompts** | Supports structured extraction — JSON, tables, key-value pairs, forms |
+| **Handwriting Support** | Transcribes handwritten text with high fidelity |
+| **Three Variants** | Nano (speed), Pro (balanced), Ultra (accuracy) |
+| **Structured Output** | Extract data directly into JSON, Markdown tables, or custom schemas |
+---
 ## 📊 Model Details
 | **Parameters** | ~7B |
 | **License** | MIT |
 | **Release Date** | February 2026 |
+| **Precision** | BF16 / FP16 |
+| **Input Resolution** | 640px – 1536px (variant dependent) |
+| **Max Sequence Length** | 2048 – 8192 tokens (variant dependent) |
+| **Inference Engine** | vLLM (recommended) |
+| **Framework** | PyTorch / Transformers |
+---
+## 🚀 Quick Start
+Get running in under 5 minutes:
+```python
+from vllm import LLM, SamplingParams
+from PIL import Image
+# 1. Load model (Pro variant — default)
+llm = LLM(
+    model="developerJenis/GT-REX-v4",
+    trust_remote_code=True,
+    max_model_len=4096,
+    gpu_memory_utilization=0.75,
+    max_num_seqs=128,
+    limit_mm_per_prompt={"image": 1},
+)
+# 2. Prepare input
+image = Image.open("document.png")
+prompt = "Extract all text from this document."
+# 3. Run inference
+sampling_params = SamplingParams(
+    temperature=0.0,
+    max_tokens=4096,
+)
+outputs = llm.generate(
+    [{
+        "prompt": prompt,
+        "multi_modal_data": {"image": image},
+    }],
+    sampling_params=sampling_params,
+)
+# 4. Get results
+result = outputs[0].outputs[0].text
+print(result)
+```
+---
 ## 💻 Installation
+### Prerequisites
+- Python 3.9+
+- CUDA 11.8+ (GPU required)
+- 8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra)
+### Install Dependencies
 ```bash
 pip install vllm pillow torch transformers
 ```
+### Verify Installation
+```python
+from vllm import LLM
+print("vLLM installed successfully!")
+```
 ---
+## 📖 Usage Examples
+### Basic Text Extraction
+```python
+prompt = "Extract all text from this document image."
+```
+### Structured JSON Extraction
+```python
+prompt = """Extract the following fields from this invoice as JSON:
+{
+    "invoice_number": "",
+    "date": "",
+    "vendor_name": "",
+    "total_amount": "",
+    "line_items": [
+        {"description": "", "quantity": "", "unit_price": "", "amount": ""}
+    ]
+}"""
+```
+### Table Extraction (Markdown Format)
+```python
+prompt = "Extract all tables from this document in Markdown table format."
+```
+### Key-Value Pair Extraction
+```python
+prompt = """Extract all key-value pairs from this form.
+Return as:
+Key: Value
+Key: Value
+..."""
+```
+### Handwritten Text Transcription
+```python
+prompt = "Transcribe all handwritten text from this image accurately."
+```
+### Multi-Document Batch Processing
+```python
+from PIL import Image
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model="developerJenis/GT-REX-v4",
+    trust_remote_code=True,
+    max_model_len=4096,
+    gpu_memory_utilization=0.75,
+    max_num_seqs=128,
+    limit_mm_per_prompt={"image": 1},
+)
+# Prepare batch
+image_paths = ["doc1.png", "doc2.png", "doc3.png"]
+prompts = []
+for path in image_paths:
+    img = Image.open(path)
+    prompts.append({
+        "prompt": "Extract all text from this document.",
+        "multi_modal_data": {"image": img},
+    })
+# Run batch inference
+sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
+outputs = llm.generate(prompts, sampling_params=sampling_params)
+# Collect results
+for i, output in enumerate(outputs):
+    print(f"--- Document {i + 1} ---")
+    print(output.outputs[0].text)
+    print()
+```
+---
+## 🏢 Use Cases
+| Domain | Application | Recommended Variant |
+|--------|-------------|---------------------|
+| **Finance** | Invoice processing, receipt scanning, bank statements | Pro / Nano |
+| **Legal** | Contract analysis, clause extraction, legal filings | Ultra |
+| **Healthcare** | Medical records, prescriptions, lab reports | Ultra |
+| **Government** | Form processing, ID verification, tax documents | Pro |
+| **Insurance** | Claims processing, policy documents | Pro |
+| **Education** | Exam paper digitization, handwritten notes | Pro / Ultra |
+| **Logistics** | Shipping labels, waybills, packing lists | Nano |
+| **Real Estate** | Property documents, deeds, mortgage papers | Pro |
+| **Retail** | Product catalogs, price tags, inventory lists | Nano |
+---
+## 📈 Performance Benchmarks
+### Throughput by Variant (NVIDIA A100 80GB)
+| Variant | Single Image | Batch (32) | Batch (128) |
+|---------|-------------|------------|-------------|
+| Nano | ~1.2s | ~15s | ~55s |
+| Pro | ~3.5s | ~45s | ~170s |
+| Ultra | ~7.0s | ~110s | ~380s |
+### Accuracy by Document Type (Pro Variant)
+| Document Type | Character Accuracy | Field Accuracy |
+|---------------|--------------------|----------------|
+| Printed invoices | 98.5%+ | 96%+ |
+| Typed contracts | 98%+ | 95%+ |
+| Handwritten notes | 92%+ | 88%+ |
+| Dense tables | 96%+ | 93%+ |
+| Low-quality scans | 94%+ | 90%+ |
+> **Note:** Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration.
+---
+## 🧠 Prompt Engineering Guide
+Get the best results from GT-REX-v4 with these prompt strategies:
+### Do's
+- **Be specific** about what to extract ("Extract the invoice number and total amount")
+- **Specify output format** ("Return as JSON", "Return as Markdown table")
+- **Provide schema** for structured extraction (show the expected JSON keys)
+- **Use clear instructions** ("Transcribe exactly as written, preserving spelling errors")
+### Don'ts
+- Avoid vague prompts ("What is this?")
+- Don't ask for analysis or summarization — GT-REX is optimized for **extraction**
+- Don't include unrelated context in the prompt
+### Example Prompts
+```text
+# Simple extraction
+"Extract all text from this document."
+# Targeted extraction
+"Extract only the table on this page as a Markdown table."
+# Schema-driven extraction
+"Extract data matching this schema: {name: str, date: str, amount: float}"
+# Preservation mode
+"Transcribe this document exactly as written, preserving original formatting."
+```
+---
+## 🔌 API Integration
+### FastAPI Server Example
+```python
+from fastapi import FastAPI, UploadFile
+from PIL import Image
+from vllm import LLM, SamplingParams
+import io
+app = FastAPI()
+llm = LLM(
+    model="developerJenis/GT-REX-v4",
+    trust_remote_code=True,
+    max_model_len=4096,
+    gpu_memory_utilization=0.75,
+    max_num_seqs=128,
+    limit_mm_per_prompt={"image": 1},
+)
+sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
+@app.post("/extract")
+async def extract_text(file: UploadFile, prompt: str = "Extract all text."):
+    image_bytes = await file.read()
+    image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
+    outputs = llm.generate(
+        [{
+            "prompt": prompt,
+            "multi_modal_data": {"image": image},
+        }],
+        sampling_params=sampling_params,
+    )
+    return {"text": outputs[0].outputs[0].text}
+```
+---
+## 🛠️ Troubleshooting
+| Issue | Solution |
+|-------|----------|
+| **CUDA Out of Memory** | Reduce `gpu_memory_utilization` or switch to Nano variant |
+| **Slow inference** | Increase `max_num_seqs` for better batching; use Nano for speed |
+| **Truncated output** | Increase `max_tokens` in `SamplingParams` |
+| **Low accuracy on small text** | Switch to Ultra variant for higher resolution |
+| **Garbled multilingual text** | Ensure image resolution is sufficient; try Ultra variant |
+---
+## 🔧 Hardware Recommendations
+| Variant | Minimum GPU | Recommended GPU |
+|---------|-------------|-----------------|
+| Nano | NVIDIA T4 (16 GB) | NVIDIA A10 (24 GB) |
+| Pro | NVIDIA A10 (24 GB) | NVIDIA A100 (40 GB) |
+| Ultra | NVIDIA A100 (40 GB) | NVIDIA A100 (80 GB) |
+---
+## 📜 License
+This model is released under the **MIT License**. You are free to use, modify, and distribute it for both commercial and non-commercial purposes.
+---
+## 📖 Citation
+If you use GT-REX-v4 in your work, please cite:
+```bibtex
+@misc{gtrex-v4-2026,
+  title   = {GT-REX-v4: Production-Grade OCR with Vision-Language Models},
+  author  = {Hathaliya, Jenis},
+  year    = {2026},
+  month   = {February},
+  url     = {https://huggingface.co/developerJenis/GT-REX-v4},
+  note    = {GothiTech Recognition \& Extraction eXpert, Version 4}
+}
+```
+---
+## 🤝 Contact & Support
+- **Developer:** Jenis Hathaliya
+- **Organization:** GothiTech
+- **HuggingFace:** [developerJenis](https://huggingface.co/developerJenis)
+---
+<p align="center">
+  Built with ❤️ by <strong>GothiTech</strong>
+</p>
+<p align="center">
+  <em>Last updated: February 2026</em><br>
+  <em>Model Version: v4.0 | Variants: Nano | Pro | Ultra</em>
+</p>