README.md · developerJenis/GT-REX-v4 at main

GT-REX-v4 / README.md

developerJenis

📝 Enhanced model card with GT-REX variants (Nano/Pro/Ultra), benchmarks, and usage guide

c65711d verified 1 day ago

preview code

raw

history blame contribute delete

14.1 kB

	---
	license: mit
	language:
	- en
	- multilingual
	tags:
	- ocr
	- vision-language
	- document-understanding
	- gothitech
	- document-ai
	- text-extraction
	- invoice-processing
	- production
	- handwriting-recognition
	- table-extraction
	pipeline_tag: image-text-to-text
	model-index:
	- name: GT-REX-v4
	results: []
	---

	# GT-REX-v4: Production OCR Model

	<p align="center">
	<strong>🦖 GothiTech Recognition & Extraction eXpert — Version 4</strong>
	</p>

	<p align="center">
	<a href="https://huggingface.co/developerJenis/GT-REX-v4"><img src="https://img.shields.io/badge/🤗_Model-GT--REX--v4-blue" alt="Model"></a>
	<a href="#"><img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License: MIT"></a>
	<a href="#"><img src="https://img.shields.io/badge/vLLM-Supported-orange" alt="vLLM"></a>
	<a href="#"><img src="https://img.shields.io/badge/Params-~7B-red" alt="Parameters"></a>
	</p>

	---

	GT-REX-v4 is a state-of-the-art production-grade OCR model developed by GothiTech for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables.

	---

	## 📑 Table of Contents

	- [GT-REX Variants](#-gt-rex-variants)
	- [Key Features](#-key-features)
	- [Model Details](#-model-details)
	- [Quick Start](#-quick-start)
	- [Installation](#-installation)
	- [Usage Examples](#-usage-examples)
	- [Use Cases](#-use-cases)
	- [Performance Benchmarks](#-performance-benchmarks)
	- [Prompt Engineering Guide](#-prompt-engineering-guide)
	- [API Integration](#-api-integration)
	- [Troubleshooting](#-troubleshooting)
	- [License](#-license)
	- [Citation](#-citation)

	---

	## ⚙️ GT-REX Variants

	GT-REX-v4 ships with three optimized configurations tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings.

	\| Variant \| Speed \| Accuracy \| Resolution \| GPU Memory \| Throughput \| Best For \|
	\|---------\|-------\|----------\|------------\|------------\|------------\|----------\|
	\| 🚀 Nano \| ⚡⚡⚡⚡⚡ \| ⭐⭐⭐ \| 640px \| 4–6 GB \| 100–150 docs/min \| High-volume batch processing \|
	\| ⚡ Pro (Default) \| ⚡⚡⚡⚡ \| ⭐⭐⭐⭐ \| 1024px \| 6–10 GB \| 50–80 docs/min \| Standard enterprise workflows \|
	\| 🎯 Ultra \| ⚡⚡⚡ \| ⭐⭐⭐⭐⭐ \| 1536px \| 10–15 GB \| 20–30 docs/min \| High-accuracy & fine-detail needs \|

	### How to Choose a Variant

	- Nano → You need maximum throughput and documents are simple (receipts, IDs, labels).
	- Pro → General-purpose. Best balance for invoices, contracts, forms, and reports.
	- Ultra → Documents have fine print, dense tables, medical records, or legal footnotes.

	---

	### 🚀 GT-Rex-Nano

	Speed-optimized for high-volume batch processing

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Resolution \| 640 × 640 px \|
	\| Speed \| ~1–2s per image \|
	\| Max Tokens \| 2048 \|
	\| GPU Memory \| 4–6 GB \|
	\| Recommended Batch Size \| 256 sequences \|

	Best for: Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning.

	```python
	from vllm import LLM

	llm = LLM(
	model="developerJenis/GT-REX-v4",
	trust_remote_code=True,
	max_model_len=2048,
	gpu_memory_utilization=0.6,
	max_num_seqs=256,
	limit_mm_per_prompt={"image": 1},
	)
	```

	---

	### ⚡ GT-Rex-Pro (Default)

	Balanced quality and speed for standard enterprise documents

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Resolution \| 1024 × 1024 px \|
	\| Speed \| ~2–5s per image \|
	\| Max Tokens \| 4096 \|
	\| GPU Memory \| 6–10 GB \|
	\| Recommended Batch Size \| 128 sequences \|

	Best for: Contracts, forms, invoices, reports, government documents, insurance claims.

	```python
	from vllm import LLM

	llm = LLM(
	model="developerJenis/GT-REX-v4",
	trust_remote_code=True,
	max_model_len=4096,
	gpu_memory_utilization=0.75,
	max_num_seqs=128,
	limit_mm_per_prompt={"image": 1},
	)
	```

	---

	### 🎯 GT-Rex-Ultra

	Maximum quality with adaptive processing for complex documents

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Resolution \| 1536 × 1536 px \|
	\| Speed \| ~5–10s per image \|
	\| Max Tokens \| 8192 \|
	\| GPU Memory \| 10–15 GB \|
	\| Recommended Batch Size \| 64 sequences \|

	Best for: Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts.

	```python
	from vllm import LLM

	llm = LLM(
	model="developerJenis/GT-REX-v4",
	trust_remote_code=True,
	max_model_len=8192,
	gpu_memory_utilization=0.85,
	max_num_seqs=64,
	limit_mm_per_prompt={"image": 1},
	)
	```

	---

	## 🎯 Key Features

	\| Feature \| Description \|
	\|---------\|-------------\|
	\| High Accuracy \| Advanced vision-language architecture for precise text extraction \|
	\| Multi-Language \| Handles documents in English and multiple other languages \|
	\| Production Ready \| Optimized for deployment with the vLLM inference engine \|
	\| Batch Processing \| Process hundreds of documents per minute (Nano variant) \|
	\| Flexible Prompts \| Supports structured extraction — JSON, tables, key-value pairs, forms \|
	\| Handwriting Support \| Transcribes handwritten text with high fidelity \|
	\| Three Variants \| Nano (speed), Pro (balanced), Ultra (accuracy) \|
	\| Structured Output \| Extract data directly into JSON, Markdown tables, or custom schemas \|

	---

	## 📊 Model Details

	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Developer \| GothiTech (Jenis Hathaliya) \|
	\| Architecture \| Vision-Language Model (VLM) \|
	\| Model Size \| ~6.5 GB \|
	\| Parameters \| ~7B \|
	\| License \| MIT \|
	\| Release Date \| February 2026 \|
	\| Precision \| BF16 / FP16 \|
	\| Input Resolution \| 640px – 1536px (variant dependent) \|
	\| Max Sequence Length \| 2048 – 8192 tokens (variant dependent) \|
	\| Inference Engine \| vLLM (recommended) \|
	\| Framework \| PyTorch / Transformers \|

	---

	## 🚀 Quick Start

	Get running in under 5 minutes:

	```python
	from vllm import LLM, SamplingParams
	from PIL import Image

	# 1. Load model (Pro variant — default)
	llm = LLM(
	model="developerJenis/GT-REX-v4",
	trust_remote_code=True,
	max_model_len=4096,
	gpu_memory_utilization=0.75,
	max_num_seqs=128,
	limit_mm_per_prompt={"image": 1},
	)

	# 2. Prepare input
	image = Image.open("document.png")
	prompt = "Extract all text from this document."

	# 3. Run inference
	sampling_params = SamplingParams(
	temperature=0.0,
	max_tokens=4096,
	)

	outputs = llm.generate(
	[{
	"prompt": prompt,
	"multi_modal_data": {"image": image},
	}],
	sampling_params=sampling_params,
	)

	# 4. Get results
	result = outputs[0].outputs[0].text
	print(result)
	```

	---

	## 💻 Installation

	### Prerequisites

	- Python 3.9+
	- CUDA 11.8+ (GPU required)
	- 8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra)

	### Install Dependencies

	```bash
	pip install vllm pillow torch transformers
	```

	### Verify Installation

	```python
	from vllm import LLM
	print("vLLM installed successfully!")
	```

	---

	## 📖 Usage Examples

	### Basic Text Extraction

	```python
	prompt = "Extract all text from this document image."
	```

	### Structured JSON Extraction

	```python
	prompt = """Extract the following fields from this invoice as JSON:
	{
	"invoice_number": "",
	"date": "",
	"vendor_name": "",
	"total_amount": "",
	"line_items": [
	{"description": "", "quantity": "", "unit_price": "", "amount": ""}
	]
	}"""
	```

	### Table Extraction (Markdown Format)

	```python
	prompt = "Extract all tables from this document in Markdown table format."
	```

	### Key-Value Pair Extraction

	```python
	prompt = """Extract all key-value pairs from this form.
	Return as:
	Key: Value
	Key: Value
	..."""
	```

	### Handwritten Text Transcription

	```python
	prompt = "Transcribe all handwritten text from this image accurately."
	```

	### Multi-Document Batch Processing

	```python
	from PIL import Image
	from vllm import LLM, SamplingParams

	llm = LLM(
	model="developerJenis/GT-REX-v4",
	trust_remote_code=True,
	max_model_len=4096,
	gpu_memory_utilization=0.75,
	max_num_seqs=128,
	limit_mm_per_prompt={"image": 1},
	)

	# Prepare batch
	image_paths = ["doc1.png", "doc2.png", "doc3.png"]
	prompts = []
	for path in image_paths:
	img = Image.open(path)
	prompts.append({
	"prompt": "Extract all text from this document.",
	"multi_modal_data": {"image": img},
	})

	# Run batch inference
	sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
	outputs = llm.generate(prompts, sampling_params=sampling_params)

	# Collect results
	for i, output in enumerate(outputs):
	print(f"--- Document {i + 1} ---")
	print(output.outputs[0].text)
	print()
	```

	---

	## 🏢 Use Cases

	\| Domain \| Application \| Recommended Variant \|
	\|--------\|-------------\|---------------------\|
	\| Finance \| Invoice processing, receipt scanning, bank statements \| Pro / Nano \|
	\| Legal \| Contract analysis, clause extraction, legal filings \| Ultra \|
	\| Healthcare \| Medical records, prescriptions, lab reports \| Ultra \|
	\| Government \| Form processing, ID verification, tax documents \| Pro \|
	\| Insurance \| Claims processing, policy documents \| Pro \|
	\| Education \| Exam paper digitization, handwritten notes \| Pro / Ultra \|
	\| Logistics \| Shipping labels, waybills, packing lists \| Nano \|
	\| Real Estate \| Property documents, deeds, mortgage papers \| Pro \|
	\| Retail \| Product catalogs, price tags, inventory lists \| Nano \|

	---

	## 📈 Performance Benchmarks

	### Throughput by Variant (NVIDIA A100 80GB)

	\| Variant \| Single Image \| Batch (32) \| Batch (128) \|
	\|---------\|-------------\|------------\|-------------\|
	\| Nano \| ~1.2s \| ~15s \| ~55s \|
	\| Pro \| ~3.5s \| ~45s \| ~170s \|
	\| Ultra \| ~7.0s \| ~110s \| ~380s \|

	### Accuracy by Document Type (Pro Variant)

	\| Document Type \| Character Accuracy \| Field Accuracy \|
	\|---------------\|--------------------\|----------------\|
	\| Printed invoices \| 98.5%+ \| 96%+ \|
	\| Typed contracts \| 98%+ \| 95%+ \|
	\| Handwritten notes \| 92%+ \| 88%+ \|
	\| Dense tables \| 96%+ \| 93%+ \|
	\| Low-quality scans \| 94%+ \| 90%+ \|

	> Note: Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration.

	---

	## 🧠 Prompt Engineering Guide

	Get the best results from GT-REX-v4 with these prompt strategies:

	### Do's

	- Be specific about what to extract ("Extract the invoice number and total amount")
	- Specify output format ("Return as JSON", "Return as Markdown table")
	- Provide schema for structured extraction (show the expected JSON keys)
	- Use clear instructions ("Transcribe exactly as written, preserving spelling errors")

	### Don'ts

	- Avoid vague prompts ("What is this?")
	- Don't ask for analysis or summarization — GT-REX is optimized for extraction
	- Don't include unrelated context in the prompt

	### Example Prompts

	```text
	# Simple extraction
	"Extract all text from this document."

	# Targeted extraction
	"Extract only the table on this page as a Markdown table."

	# Schema-driven extraction
	"Extract data matching this schema: {name: str, date: str, amount: float}"

	# Preservation mode
	"Transcribe this document exactly as written, preserving original formatting."
	```

	---

	## 🔌 API Integration

	### FastAPI Server Example

	```python
	from fastapi import FastAPI, UploadFile
	from PIL import Image
	from vllm import LLM, SamplingParams
	import io

	app = FastAPI()

	llm = LLM(
	model="developerJenis/GT-REX-v4",
	trust_remote_code=True,
	max_model_len=4096,
	gpu_memory_utilization=0.75,
	max_num_seqs=128,
	limit_mm_per_prompt={"image": 1},
	)

	sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)


	@app.post("/extract")
	async def extract_text(file: UploadFile, prompt: str = "Extract all text."):
	image_bytes = await file.read()
	image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

	outputs = llm.generate(
	[{
	"prompt": prompt,
	"multi_modal_data": {"image": image},
	}],
	sampling_params=sampling_params,
	)

	return {"text": outputs[0].outputs[0].text}
	```

	---

	## 🛠️ Troubleshooting

	\| Issue \| Solution \|
	\|-------\|----------\|
	\| CUDA Out of Memory \| Reduce `gpu_memory_utilization` or switch to Nano variant \|
	\| Slow inference \| Increase `max_num_seqs` for better batching; use Nano for speed \|
	\| Truncated output \| Increase `max_tokens` in `SamplingParams` \|
	\| Low accuracy on small text \| Switch to Ultra variant for higher resolution \|
	\| Garbled multilingual text \| Ensure image resolution is sufficient; try Ultra variant \|

	---

	## 🔧 Hardware Recommendations

	\| Variant \| Minimum GPU \| Recommended GPU \|
	\|---------\|-------------\|-----------------\|
	\| Nano \| NVIDIA T4 (16 GB) \| NVIDIA A10 (24 GB) \|
	\| Pro \| NVIDIA A10 (24 GB) \| NVIDIA A100 (40 GB) \|
	\| Ultra \| NVIDIA A100 (40 GB) \| NVIDIA A100 (80 GB) \|

	---

	## 📜 License

	This model is released under the MIT License. You are free to use, modify, and distribute it for both commercial and non-commercial purposes.

	---

	## 📖 Citation

	If you use GT-REX-v4 in your work, please cite:

	```bibtex
	@misc{gtrex-v4-2026,
	title = {GT-REX-v4: Production-Grade OCR with Vision-Language Models},
	author = {Hathaliya, Jenis},
	year = {2026},
	month = {February},
	url = {https://huggingface.co/developerJenis/GT-REX-v4},
	note = {GothiTech Recognition \& Extraction eXpert, Version 4}
	}
	```

	---

	## 🤝 Contact & Support

	- Developer: Jenis Hathaliya
	- Organization: GothiTech
	- HuggingFace: [developerJenis](https://huggingface.co/developerJenis)

	---

	<p align="center">
	Built with ❤️ by <strong>GothiTech</strong>
	</p>

	<p align="center">
	<em>Last updated: February 2026</em><br>
	<em>Model Version: v4.0 \| Variants: Nano \| Pro \| Ultra</em>
	</p>