README.md · gothitech/GT-REX at main

GT-REX / README.md

developerJenis

Update README.md

aac804f verified 8 days ago

preview code

raw

history blame contribute delete

14 kB

	---
	license: mit
	language:
	- en
	- multilingual
	tags:
	- ocr
	- vision-language
	- document-understanding
	- gothitech
	- document-ai
	- text-extraction
	- invoice-processing
	- production
	- handwriting-recognition
	- table-extraction
	pipeline_tag: image-text-to-text
	---

	# GT-REX: Production OCR Model

	<p align="center">
	<strong>GothiTech Recognition and Extraction eXpert</strong>
	</p>

	<p align="center">
	<a href="https://huggingface.co/gothitech/GT-REX"><img src="https://img.shields.io/badge/Model-GT--REX-blue" alt="Model"></a>
	<a href="#"><img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License: MIT"></a>
	<a href="#"><img src="https://img.shields.io/badge/vLLM-Supported-orange" alt="vLLM"></a>
	<a href="#"><img src="https://img.shields.io/badge/Params-~3B-red" alt="Parameters"></a>
	</p>

	---

	GT-REX is a state-of-the-art production-grade OCR model developed by GothiTech for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables.

	---

	## Table of Contents

	- [GT-REX Variants](#gt-rex-variants)
	- [Key Features](#key-features)
	- [Model Details](#model-details)
	- [Quick Start](#quick-start)
	- [Installation](#installation)
	- [Usage Examples](#usage-examples)
	- [Use Cases](#use-cases)
	- [Performance Benchmarks](#performance-benchmarks)
	- [Prompt Engineering Guide](#prompt-engineering-guide)
	- [API Integration](#api-integration)
	- [Troubleshooting](#troubleshooting)
	- [Hardware Recommendations](#hardware-recommendations)
	- [License](#license)
	- [Citation](#citation)

	---

	## GT-REX Variants

	GT-REX ships with three optimized configurations tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings.

	\| Variant \| Speed \| Accuracy \| Resolution \| GPU Memory \| Throughput \| Best For \|
	\|---------\|-------\|----------\|------------\|------------\|------------\|----------\|
	\| Nano \| Ultra Fast \| Good \| 640px \| 4-6 GB \| 100-150 docs/min \| High-volume batch processing \|
	\| Pro (Default) \| Fast \| High \| 1024px \| 6-10 GB \| 50-80 docs/min \| Standard enterprise workflows \|
	\| Ultra \| Moderate \| Maximum \| 1536px \| 10-15 GB \| 20-30 docs/min \| High-accuracy and fine-detail needs \|

	### How to Choose a Variant

	- Nano: You need maximum throughput and documents are simple (receipts, IDs, labels).
	- Pro: General-purpose. Best balance for invoices, contracts, forms, and reports.
	- Ultra: Documents have fine print, dense tables, medical records, or legal footnotes.

	---

	### GT-Rex-Nano

	Speed-optimized for high-volume batch processing

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Resolution \| 640 x 640 px \|
	\| Speed \| ~1-2s per image \|
	\| Max Tokens \| 2048 \|
	\| GPU Memory \| 4-6 GB \|
	\| Recommended Batch Size \| 256 sequences \|

	Best for: Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning.

	```python
	from vllm import LLM

	llm = LLM(
	model="gothitech/GT-REX",
	trust_remote_code=True,
	max_model_len=2048,
	gpu_memory_utilization=0.6,
	max_num_seqs=256,
	limit_mm_per_prompt={"image": 1},
	)
	```

	---

	### GT-Rex-Pro (Default)

	Balanced quality and speed for standard enterprise documents

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Resolution \| 1024 x 1024 px \|
	\| Speed \| ~2-5s per image \|
	\| Max Tokens \| 4096 \|
	\| GPU Memory \| 6-10 GB \|
	\| Recommended Batch Size \| 128 sequences \|

	Best for: Contracts, forms, invoices, reports, government documents, insurance claims.

	```python
	from vllm import LLM

	llm = LLM(
	model="gothitech/GT-REX",
	trust_remote_code=True,
	max_model_len=4096,
	gpu_memory_utilization=0.75,
	max_num_seqs=128,
	limit_mm_per_prompt={"image": 1},
	)
	```

	---

	### GT-Rex-Ultra

	Maximum quality with adaptive processing for complex documents

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Resolution \| 1536 x 1536 px \|
	\| Speed \| ~5-10s per image \|
	\| Max Tokens \| 8192 \|
	\| GPU Memory \| 10-15 GB \|
	\| Recommended Batch Size \| 64 sequences \|

	Best for: Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts.

	```python
	from vllm import LLM

	llm = LLM(
	model="gothitech/GT-REX",
	trust_remote_code=True,
	max_model_len=8192,
	gpu_memory_utilization=0.85,
	max_num_seqs=64,
	limit_mm_per_prompt={"image": 1},
	)
	```

	---

	## Key Features

	\| Feature \| Description \|
	\|---------\|-------------\|
	\| High Accuracy \| Advanced vision-language architecture for precise text extraction \|
	\| Multi-Language \| Handles documents in English and multiple other languages \|
	\| Production Ready \| Optimized for deployment with the vLLM inference engine \|
	\| Batch Processing \| Process hundreds of documents per minute (Nano variant) \|
	\| Flexible Prompts \| Supports structured extraction: JSON, tables, key-value pairs, forms \|
	\| Handwriting Support \| Transcribes handwritten text with high fidelity \|
	\| Three Variants \| Nano (speed), Pro (balanced), Ultra (accuracy) \|
	\| Structured Output \| Extract data directly into JSON, Markdown tables, or custom schemas \|

	---

	## Model Details

	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Developer \| GothiTech (Jenis Hathaliya) \|
	\| Architecture \| Vision-Language Model (VLM) \|
	\| Model Size \| ~6.5 GB \|
	\| Parameters \| ~7B \|
	\| License \| MIT \|
	\| Release Date \| February 2026 \|
	\| Precision \| BF16 / FP16 \|
	\| Input Resolution \| 640px - 1536px (variant dependent) \|
	\| Max Sequence Length \| 2048 - 8192 tokens (variant dependent) \|
	\| Inference Engine \| vLLM (recommended) \|
	\| Framework \| PyTorch / Transformers \|

	---

	## Quick Start

	Get running in under 5 minutes:

	```python
	from vllm import LLM, SamplingParams
	from PIL import Image

	# 1. Load model (Pro variant - default)
	llm = LLM(
	model="gothitech/GT-REX",
	trust_remote_code=True,
	max_model_len=4096,
	gpu_memory_utilization=0.75,
	max_num_seqs=128,
	limit_mm_per_prompt={"image": 1},
	)

	# 2. Prepare input
	image = Image.open("document.png")
	prompt = "Extract all text from this document."

	# 3. Run inference
	sampling_params = SamplingParams(
	temperature=0.0,
	max_tokens=4096,
	)

	outputs = llm.generate(
	[{
	"prompt": prompt,
	"multi_modal_data": {"image": image},
	}],
	sampling_params=sampling_params,
	)

	# 4. Get results
	result = outputs[0].outputs[0].text
	print(result)
	```

	---

	## Installation

	### Prerequisites

	- Python 3.9+
	- CUDA 11.8+ (GPU required)
	- 8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra)

	### Install Dependencies

	```bash
	pip install vllm pillow torch transformers
	```

	### Verify Installation

	```python
	from vllm import LLM
	print("vLLM installed successfully!")
	```

	---

	## Usage Examples

	### Basic Text Extraction

	```python
	prompt = "Extract all text from this document image."
	```

	### Structured JSON Extraction

	```python
	prompt = '''Extract the following fields from this invoice as JSON:
	{
	"invoice_number": "",
	"date": "",
	"vendor_name": "",
	"total_amount": "",
	"line_items": [
	{"description": "", "quantity": "", "unit_price": "", "amount": ""}
	]
	}'''
	```

	### Table Extraction (Markdown Format)

	```python
	prompt = "Extract all tables from this document in Markdown table format."
	```

	### Key-Value Pair Extraction

	```python
	prompt = '''Extract all key-value pairs from this form.
	Return as:
	Key: Value
	Key: Value'''
	```

	### Handwritten Text Transcription

	```python
	prompt = "Transcribe all handwritten text from this image accurately."
	```

	### Multi-Document Batch Processing

	```python
	from PIL import Image
	from vllm import LLM, SamplingParams

	llm = LLM(
	model="gothitech/GT-REX",
	trust_remote_code=True,
	max_model_len=4096,
	gpu_memory_utilization=0.75,
	max_num_seqs=128,
	limit_mm_per_prompt={"image": 1},
	)

	# Prepare batch
	image_paths = ["doc1.png", "doc2.png", "doc3.png"]
	prompts = []
	for path in image_paths:
	img = Image.open(path)
	prompts.append({
	"prompt": "Extract all text from this document.",
	"multi_modal_data": {"image": img},
	})

	# Run batch inference
	sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
	outputs = llm.generate(prompts, sampling_params=sampling_params)

	# Collect results
	for i, output in enumerate(outputs):
	print(f"--- Document {i + 1} ---")
	print(output.outputs[0].text)
	print()
	```

	---

	## Use Cases

	\| Domain \| Application \| Recommended Variant \|
	\|--------\|-------------\|---------------------\|
	\| Finance \| Invoice processing, receipt scanning, bank statements \| Pro / Nano \|
	\| Legal \| Contract analysis, clause extraction, legal filings \| Ultra \|
	\| Healthcare \| Medical records, prescriptions, lab reports \| Ultra \|
	\| Government \| Form processing, ID verification, tax documents \| Pro \|
	\| Insurance \| Claims processing, policy documents \| Pro \|
	\| Education \| Exam paper digitization, handwritten notes \| Pro / Ultra \|
	\| Logistics \| Shipping labels, waybills, packing lists \| Nano \|
	\| Real Estate \| Property documents, deeds, mortgage papers \| Pro \|
	\| Retail \| Product catalogs, price tags, inventory lists \| Nano \|

	---

	## Performance Benchmarks

	### Throughput by Variant (NVIDIA A100 80GB)

	\| Variant \| Single Image \| Batch (32) \| Batch (128) \|
	\|---------\|-------------\|------------\|-------------\|
	\| Nano \| ~1.2s \| ~15s \| ~55s \|
	\| Pro \| ~3.5s \| ~45s \| ~170s \|
	\| Ultra \| ~7.0s \| ~110s \| ~380s \|

	### Accuracy by Document Type (Pro Variant)

	\| Document Type \| Character Accuracy \| Field Accuracy \|
	\|---------------\|--------------------\|----------------\|
	\| Printed invoices \| 98.5%+ \| 96%+ \|
	\| Typed contracts \| 98%+ \| 95%+ \|
	\| Handwritten notes \| 92%+ \| 88%+ \|
	\| Dense tables \| 96%+ \| 93%+ \|
	\| Low-quality scans \| 94%+ \| 90%+ \|

	> Note: Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration.

	---

	## Prompt Engineering Guide

	Get the best results from GT-REX with these prompt strategies:

	### Tips for Best Results

	Do:
	- Be specific about what to extract ("Extract the invoice number and total amount")
	- Specify output format ("Return as JSON", "Return as Markdown table")
	- Provide schema for structured extraction (show the expected JSON keys)
	- Use clear instructions ("Transcribe exactly as written, preserving spelling errors")

	Don't:
	- Use vague prompts ("What is this?")
	- Ask for analysis or summarization (GT-REX is optimized for extraction)
	- Include unrelated context in the prompt

	### Example Prompts

	```text
	# Simple extraction
	"Extract all text from this document."

	# Targeted extraction
	"Extract only the table on this page as a Markdown table."

	# Schema-driven extraction
	"Extract data matching this schema: {name: str, date: str, amount: float}"

	# Preservation mode
	"Transcribe this document exactly as written, preserving original formatting."
	```

	---

	## API Integration

	### FastAPI Server Example

	```python
	from fastapi import FastAPI, UploadFile
	from PIL import Image
	from vllm import LLM, SamplingParams
	import io

	app = FastAPI()

	llm = LLM(
	model="gothitech/GT-REX",
	trust_remote_code=True,
	max_model_len=4096,
	gpu_memory_utilization=0.75,
	max_num_seqs=128,
	limit_mm_per_prompt={"image": 1},
	)

	sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)


	@app.post("/extract")
	async def extract_text(file: UploadFile, prompt: str = "Extract all text."):
	image_bytes = await file.read()
	image = Image.open(io.BytesIO(image_bytes)).convert("RGB")

	outputs = llm.generate(
	[{
	"prompt": prompt,
	"multi_modal_data": {"image": image},
	}],
	sampling_params=sampling_params,
	)

	return {"text": outputs[0].outputs[0].text}
	```

	### cURL Example

	```bash
	curl -X POST "http://localhost:8000/extract" \
	-F "file=@invoice.png" \
	-F "prompt=Extract all text from this invoice as JSON."
	```

	---

	## Troubleshooting

	\| Issue \| Solution \|
	\|-------\|----------\|
	\| CUDA Out of Memory \| Reduce `gpu_memory_utilization` or switch to Nano variant \|
	\| Slow inference \| Increase `max_num_seqs` for better batching; use Nano for speed \|
	\| Truncated output \| Increase `max_tokens` in `SamplingParams` \|
	\| Low accuracy on small text \| Switch to Ultra variant for higher resolution \|
	\| Garbled multilingual text \| Ensure image resolution is sufficient; try Ultra variant \|
	\| Empty output \| Check that the image is loaded correctly and is not blank \|
	\| Model loading errors \| Ensure `trust_remote_code=True` is set \|

	---

	## Hardware Recommendations

	\| Variant \| Minimum GPU \| Recommended GPU \|
	\|---------\|-------------\|-----------------\|
	\| Nano \| NVIDIA T4 (16 GB) \| NVIDIA A10 (24 GB) \|
	\| Pro \| NVIDIA A10 (24 GB) \| NVIDIA A100 (40 GB) \|
	\| Ultra \| NVIDIA A100 (40 GB) \| NVIDIA A100 (80 GB) \|

	---

	## License

	This model is released under the MIT License. You are free to use, modify, and distribute it for both commercial and non-commercial purposes.

	---

	## Citation

	If you use GT-REX in your work, please cite:

	```bibtex
	@misc{gtrex-2026,
	title = {GT-REX: Production-Grade OCR with Vision-Language Models},
	author = {Hathaliya, Jenis},
	year = {2026},
	month = {February},
	url = {https://huggingface.co/gothitech/GT-REX},
	note = {GothiTech Recognition and Extraction eXpert}
	}
	```

	---

	## Contact and Support

	- Developer: Jenis Hathaliya
	- Organization: GothiTech
	- HuggingFace: [gothitech](https://huggingface.co/gothitech)

	---

	<p align="center">
	Built by <strong>GothiTech</strong>
	</p>

	<p align="center">
	<em>Last updated: February 2026</em><br>
	<em>GT-REX \| Variants: Nano \| Pro \| Ultra</em>
	</p>