developerJenis commited on
Commit
0117543
·
verified ·
1 Parent(s): b07d51c

Add GT-REX model card with Nano/Pro/Ultra variants

Browse files
Files changed (1) hide show
  1. README.md +484 -139
README.md CHANGED
@@ -1,186 +1,531 @@
1
  ---
2
- pipeline_tag: image-text-to-text
3
  language:
4
- - multilingual
 
5
  tags:
6
- - deepseek
7
- - vision-language
8
- - ocr
9
- - custom_code
10
- license: mit
11
- library_name: transformers
 
 
 
 
 
12
  ---
13
- <div align="center">
14
- <img src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/logo.svg?raw=true" width="60%" alt="DeepSeek AI" />
15
- </div>
16
- <hr>
17
- <div align="center">
18
- <a href="https://www.deepseek.com/" target="_blank">
19
- <img alt="Homepage" src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/badge.svg?raw=true" />
20
- </a>
21
- <a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR" target="_blank">
22
- <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" />
23
- </a>
24
 
25
- </div>
26
 
27
- <div align="center">
 
 
28
 
29
- <a href="https://discord.gg/Tc7c45Zzu5" target="_blank">
30
- <img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" />
31
- </a>
32
- <a href="https://twitter.com/deepseek_ai" target="_blank">
33
- <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" />
34
- </a>
35
 
36
- </div>
37
 
 
38
 
 
39
 
40
- <p align="center">
41
- <a href="https://github.com/deepseek-ai/DeepSeek-OCR"><b>🌟 Github</b></a> |
42
- <a href="https://huggingface.co/deepseek-ai/DeepSeek-OCR"><b>📥 Model Download</b></a> |
43
- <a href="https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf"><b>📄 Paper Link</b></a> |
44
- <a href="https://arxiv.org/abs/2510.18234"><b>📄 Arxiv Paper Link</b></a> |
45
- </p>
46
- <h2>
47
- <p align="center">
48
- <a href="https://huggingface.co/papers/2510.18234">DeepSeek-OCR: Contexts Optical Compression</a>
49
- </p>
50
- </h2>
51
- <p align="center">
52
- <img src="assets/fig1.png" style="width: 1000px" align=center>
53
- </p>
54
- <p align="center">
55
- <a href="https://huggingface.co/papers/2510.18234">Explore the boundaries of visual-text compression.</a>
56
- </p>
57
 
58
- ## Usage
59
- Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.9 + CUDA11.8:
60
 
61
- ```
62
- torch==2.6.0
63
- transformers==4.46.3
64
- tokenizers==0.20.3
65
- einops
66
- addict
67
- easydict
68
- pip install flash-attn==2.7.3 --no-build-isolation
69
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  ```python
72
- from transformers import AutoModel, AutoTokenizer
73
- import torch
74
- import os
75
- os.environ["CUDA_VISIBLE_DEVICES"] = '0'
76
- model_name = 'deepseek-ai/DeepSeek-OCR'
 
 
 
 
 
 
 
 
77
 
78
- tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
79
- model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)
80
- model = model.eval().cuda().to(torch.bfloat16)
81
 
82
- # prompt = "<image>\nFree OCR. "
83
- prompt = "<image>\n<|grounding|>Convert the document to markdown. "
84
- image_file = 'your_image.jpg'
85
- output_path = 'your/output/dir'
86
 
87
- # infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):
 
 
 
 
 
 
88
 
89
- # Tiny: base_size = 512, image_size = 512, crop_mode = False
90
- # Small: base_size = 640, image_size = 640, crop_mode = False
91
- # Base: base_size = 1024, image_size = 1024, crop_mode = False
92
- # Large: base_size = 1280, image_size = 1280, crop_mode = False
93
 
94
- # Gundam: base_size = 1024, image_size = 640, crop_mode = True
 
95
 
96
- res = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
 
 
 
 
 
 
 
97
  ```
98
 
99
- ## vLLM
100
- Refer to [🌟GitHub](https://github.com/deepseek-ai/DeepSeek-OCR/) for guidance on model inference acceleration and PDF processing, etc.<!-- -->
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
- [2025/10/23] 🚀🚀🚀 DeepSeek-OCR is now officially supported in upstream [vLLM](https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-OCR.html#installing-vllm).
103
- ```shell
104
- uv venv
105
- source .venv/bin/activate
106
- # Until v0.11.1 release, you need to install vLLM from nightly build
107
- uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 
 
 
 
 
108
  ```
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  ```python
111
  from vllm import LLM, SamplingParams
112
- from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
113
  from PIL import Image
114
 
115
- # Create model instance
116
  llm = LLM(
117
- model="deepseek-ai/DeepSeek-OCR",
118
- enable_prefix_caching=False,
119
- mm_processor_cache_gb=0,
120
- logits_processors=[NGramPerReqLogitsProcessor]
 
 
121
  )
122
 
123
- # Prepare batched input with your image file
124
- image_1 = Image.open("path/to/your/image_1.png").convert("RGB")
125
- image_2 = Image.open("path/to/your/image_2.png").convert("RGB")
126
- prompt = "<image>\nFree OCR."
127
 
128
- model_input = [
129
- {
130
- "prompt": prompt,
131
- "multi_modal_data": {"image": image_1}
132
- },
133
- {
 
 
134
  "prompt": prompt,
135
- "multi_modal_data": {"image": image_2}
136
- }
137
- ]
138
-
139
- sampling_param = SamplingParams(
140
- temperature=0.0,
141
- max_tokens=8192,
142
- # ngram logit processor args
143
- extra_args=dict(
144
- ngram_size=30,
145
- window_size=90,
146
- whitelist_token_ids={128821, 128822}, # whitelist: <td>, </td>
147
- ),
148
- skip_special_tokens=False,
149
- )
150
- # Generate output
151
- model_outputs = llm.generate(model_input, sampling_param)
152
-
153
- # Print output
154
- for output in model_outputs:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
  print(output.outputs[0].text)
 
156
  ```
157
 
 
158
 
159
- ## Visualizations
160
- <table>
161
- <tr>
162
- <td><img src="assets/show1.jpg" style="width: 500px"></td>
163
- <td><img src="assets/show2.jpg" style="width: 500px"></td>
164
- </tr>
165
- <tr>
166
- <td><img src="assets/show3.jpg" style="width: 500px"></td>
167
- <td><img src="assets/show4.jpg" style="width: 500px"></td>
168
- </tr>
169
- </table>
170
 
 
 
 
 
 
 
 
 
 
 
 
171
 
172
- ## Acknowledgement
173
 
174
- We would like to thank [Vary](https://github.com/Ucas-HaoranWei/Vary/), [GOT-OCR2.0](https://github.com/Ucas-HaoranWei/GOT-OCR2.0/), [MinerU](https://github.com/opendatalab/MinerU), [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR), [OneChart](https://github.com/LingyvKong/OneChart), [Slow Perception](https://github.com/Ucas-HaoranWei/Slow-Perception) for their valuable models and ideas.
175
 
176
- We also appreciate the benchmarks: [Fox](https://github.com/ucaslcl/Fox), [OminiDocBench](https://github.com/opendatalab/OmniDocBench).
177
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178
 
179
  ## Citation
 
 
 
180
  ```bibtex
181
- @article{wei2025deepseek,
182
- title={DeepSeek-OCR: Contexts Optical Compression},
183
- author={Wei, Haoran and Sun, Yaofeng and Li, Yukun},
184
- journal={arXiv preprint arXiv:2510.18234},
185
- year={2025}
186
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
  language:
4
+ - en
5
+ - multilingual
6
  tags:
7
+ - ocr
8
+ - vision-language
9
+ - document-understanding
10
+ - gothitech
11
+ - document-ai
12
+ - text-extraction
13
+ - invoice-processing
14
+ - production
15
+ - handwriting-recognition
16
+ - table-extraction
17
+ pipeline_tag: image-text-to-text
18
  ---
 
 
 
 
 
 
 
 
 
 
 
19
 
20
+ # GT-REX: Production OCR Model
21
 
22
+ <p align="center">
23
+ <strong>GothiTech Recognition and Extraction eXpert</strong>
24
+ </p>
25
 
26
+ <p align="center">
27
+ <a href="https://huggingface.co/gothitech/GT-REX"><img src="https://img.shields.io/badge/Model-GT--REX-blue" alt="Model"></a>
28
+ <a href="#"><img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License: MIT"></a>
29
+ <a href="#"><img src="https://img.shields.io/badge/vLLM-Supported-orange" alt="vLLM"></a>
30
+ <a href="#"><img src="https://img.shields.io/badge/Params-~7B-red" alt="Parameters"></a>
31
+ </p>
32
 
33
+ ---
34
 
35
+ **GT-REX** is a state-of-the-art production-grade OCR model developed by **GothiTech** for enterprise document understanding, text extraction, and intelligent document processing. Built on a Vision-Language Model (VLM) architecture, it delivers high-accuracy text extraction from complex documents including invoices, contracts, forms, handwritten notes, and dense tables.
36
 
37
+ ---
38
 
39
+ ## Table of Contents
40
+
41
+ - [GT-REX Variants](#gt-rex-variants)
42
+ - [Key Features](#key-features)
43
+ - [Model Details](#model-details)
44
+ - [Quick Start](#quick-start)
45
+ - [Installation](#installation)
46
+ - [Usage Examples](#usage-examples)
47
+ - [Use Cases](#use-cases)
48
+ - [Performance Benchmarks](#performance-benchmarks)
49
+ - [Prompt Engineering Guide](#prompt-engineering-guide)
50
+ - [API Integration](#api-integration)
51
+ - [Troubleshooting](#troubleshooting)
52
+ - [Hardware Recommendations](#hardware-recommendations)
53
+ - [License](#license)
54
+ - [Citation](#citation)
 
55
 
56
+ ---
 
57
 
58
+ ## GT-REX Variants
59
+
60
+ GT-REX ships with **three optimized configurations** tailored to different performance and accuracy requirements. All variants share the same underlying model weights — they differ only in inference settings.
61
+
62
+ | Variant | Speed | Accuracy | Resolution | GPU Memory | Throughput | Best For |
63
+ |---------|-------|----------|------------|------------|------------|----------|
64
+ | **Nano** | Ultra Fast | Good | 640px | 4-6 GB | 100-150 docs/min | High-volume batch processing |
65
+ | **Pro** (Default) | Fast | High | 1024px | 6-10 GB | 50-80 docs/min | Standard enterprise workflows |
66
+ | **Ultra** | Moderate | Maximum | 1536px | 10-15 GB | 20-30 docs/min | High-accuracy and fine-detail needs |
67
+
68
+ ### How to Choose a Variant
69
+
70
+ - **Nano**: You need maximum throughput and documents are simple (receipts, IDs, labels).
71
+ - **Pro**: General-purpose. Best balance for invoices, contracts, forms, and reports.
72
+ - **Ultra**: Documents have fine print, dense tables, medical records, or legal footnotes.
73
+
74
+ ---
75
+
76
+ ### GT-Rex-Nano
77
+
78
+ **Speed-optimized for high-volume batch processing**
79
+
80
+ | Setting | Value |
81
+ |---------|-------|
82
+ | Resolution | 640 x 640 px |
83
+ | Speed | ~1-2s per image |
84
+ | Max Tokens | 2048 |
85
+ | GPU Memory | 4-6 GB |
86
+ | Recommended Batch Size | 256 sequences |
87
+
88
+ **Best for:** Thumbnails, previews, high-throughput pipelines (100+ docs/min), mobile uploads, receipt scanning.
89
 
90
  ```python
91
+ from vllm import LLM
92
+
93
+ llm = LLM(
94
+ model="gothitech/GT-REX",
95
+ trust_remote_code=True,
96
+ max_model_len=2048,
97
+ gpu_memory_utilization=0.6,
98
+ max_num_seqs=256,
99
+ limit_mm_per_prompt={"image": 1},
100
+ )
101
+ ```
102
+
103
+ ---
104
 
105
+ ### GT-Rex-Pro (Default)
 
 
106
 
107
+ **Balanced quality and speed for standard enterprise documents**
 
 
 
108
 
109
+ | Setting | Value |
110
+ |---------|-------|
111
+ | Resolution | 1024 x 1024 px |
112
+ | Speed | ~2-5s per image |
113
+ | Max Tokens | 4096 |
114
+ | GPU Memory | 6-10 GB |
115
+ | Recommended Batch Size | 128 sequences |
116
 
117
+ **Best for:** Contracts, forms, invoices, reports, government documents, insurance claims.
 
 
 
118
 
119
+ ```python
120
+ from vllm import LLM
121
 
122
+ llm = LLM(
123
+ model="gothitech/GT-REX",
124
+ trust_remote_code=True,
125
+ max_model_len=4096,
126
+ gpu_memory_utilization=0.75,
127
+ max_num_seqs=128,
128
+ limit_mm_per_prompt={"image": 1},
129
+ )
130
  ```
131
 
132
+ ---
133
+
134
+ ### GT-Rex-Ultra
135
+
136
+ **Maximum quality with adaptive processing for complex documents**
137
+
138
+ | Setting | Value |
139
+ |---------|-------|
140
+ | Resolution | 1536 x 1536 px |
141
+ | Speed | ~5-10s per image |
142
+ | Max Tokens | 8192 |
143
+ | GPU Memory | 10-15 GB |
144
+ | Recommended Batch Size | 64 sequences |
145
+
146
+ **Best for:** Legal documents, fine print, dense tables, medical records, engineering drawings, academic papers, multi-column layouts.
147
 
148
+ ```python
149
+ from vllm import LLM
150
+
151
+ llm = LLM(
152
+ model="gothitech/GT-REX",
153
+ trust_remote_code=True,
154
+ max_model_len=8192,
155
+ gpu_memory_utilization=0.85,
156
+ max_num_seqs=64,
157
+ limit_mm_per_prompt={"image": 1},
158
+ )
159
  ```
160
 
161
+ ---
162
+
163
+ ## Key Features
164
+
165
+ | Feature | Description |
166
+ |---------|-------------|
167
+ | **High Accuracy** | Advanced vision-language architecture for precise text extraction |
168
+ | **Multi-Language** | Handles documents in English and multiple other languages |
169
+ | **Production Ready** | Optimized for deployment with the vLLM inference engine |
170
+ | **Batch Processing** | Process hundreds of documents per minute (Nano variant) |
171
+ | **Flexible Prompts** | Supports structured extraction: JSON, tables, key-value pairs, forms |
172
+ | **Handwriting Support** | Transcribes handwritten text with high fidelity |
173
+ | **Three Variants** | Nano (speed), Pro (balanced), Ultra (accuracy) |
174
+ | **Structured Output** | Extract data directly into JSON, Markdown tables, or custom schemas |
175
+
176
+ ---
177
+
178
+ ## Model Details
179
+
180
+ | Attribute | Value |
181
+ |-----------|-------|
182
+ | **Developer** | GothiTech (Jenis Hathaliya) |
183
+ | **Architecture** | Vision-Language Model (VLM) |
184
+ | **Model Size** | ~6.5 GB |
185
+ | **Parameters** | ~7B |
186
+ | **License** | MIT |
187
+ | **Release Date** | February 2026 |
188
+ | **Precision** | BF16 / FP16 |
189
+ | **Input Resolution** | 640px - 1536px (variant dependent) |
190
+ | **Max Sequence Length** | 2048 - 8192 tokens (variant dependent) |
191
+ | **Inference Engine** | vLLM (recommended) |
192
+ | **Framework** | PyTorch / Transformers |
193
+
194
+ ---
195
+
196
+ ## Quick Start
197
+
198
+ Get running in under 5 minutes:
199
+
200
  ```python
201
  from vllm import LLM, SamplingParams
 
202
  from PIL import Image
203
 
204
+ # 1. Load model (Pro variant - default)
205
  llm = LLM(
206
+ model="gothitech/GT-REX",
207
+ trust_remote_code=True,
208
+ max_model_len=4096,
209
+ gpu_memory_utilization=0.75,
210
+ max_num_seqs=128,
211
+ limit_mm_per_prompt={"image": 1},
212
  )
213
 
214
+ # 2. Prepare input
215
+ image = Image.open("document.png")
216
+ prompt = "Extract all text from this document."
 
217
 
218
+ # 3. Run inference
219
+ sampling_params = SamplingParams(
220
+ temperature=0.0,
221
+ max_tokens=4096,
222
+ )
223
+
224
+ outputs = llm.generate(
225
+ [{
226
  "prompt": prompt,
227
+ "multi_modal_data": {"image": image},
228
+ }],
229
+ sampling_params=sampling_params,
230
+ )
231
+
232
+ # 4. Get results
233
+ result = outputs[0].outputs[0].text
234
+ print(result)
235
+ ```
236
+
237
+ ---
238
+
239
+ ## Installation
240
+
241
+ ### Prerequisites
242
+
243
+ - Python 3.9+
244
+ - CUDA 11.8+ (GPU required)
245
+ - 8 GB+ VRAM (Pro variant), 4 GB+ (Nano), 12 GB+ (Ultra)
246
+
247
+ ### Install Dependencies
248
+
249
+ ```bash
250
+ pip install vllm pillow torch transformers
251
+ ```
252
+
253
+ ### Verify Installation
254
+
255
+ ```python
256
+ from vllm import LLM
257
+ print("vLLM installed successfully!")
258
+ ```
259
+
260
+ ---
261
+
262
+ ## Usage Examples
263
+
264
+ ### Basic Text Extraction
265
+
266
+ ```python
267
+ prompt = "Extract all text from this document image."
268
+ ```
269
+
270
+ ### Structured JSON Extraction
271
+
272
+ ```python
273
+ prompt = '''Extract the following fields from this invoice as JSON:
274
+ {
275
+ "invoice_number": "",
276
+ "date": "",
277
+ "vendor_name": "",
278
+ "total_amount": "",
279
+ "line_items": [
280
+ {"description": "", "quantity": "", "unit_price": "", "amount": ""}
281
+ ]
282
+ }'''
283
+ ```
284
+
285
+ ### Table Extraction (Markdown Format)
286
+
287
+ ```python
288
+ prompt = "Extract all tables from this document in Markdown table format."
289
+ ```
290
+
291
+ ### Key-Value Pair Extraction
292
+
293
+ ```python
294
+ prompt = '''Extract all key-value pairs from this form.
295
+ Return as:
296
+ Key: Value
297
+ Key: Value'''
298
+ ```
299
+
300
+ ### Handwritten Text Transcription
301
+
302
+ ```python
303
+ prompt = "Transcribe all handwritten text from this image accurately."
304
+ ```
305
+
306
+ ### Multi-Document Batch Processing
307
+
308
+ ```python
309
+ from PIL import Image
310
+ from vllm import LLM, SamplingParams
311
+
312
+ llm = LLM(
313
+ model="gothitech/GT-REX",
314
+ trust_remote_code=True,
315
+ max_model_len=4096,
316
+ gpu_memory_utilization=0.75,
317
+ max_num_seqs=128,
318
+ limit_mm_per_prompt={"image": 1},
319
+ )
320
+
321
+ # Prepare batch
322
+ image_paths = ["doc1.png", "doc2.png", "doc3.png"]
323
+ prompts = []
324
+ for path in image_paths:
325
+ img = Image.open(path)
326
+ prompts.append({
327
+ "prompt": "Extract all text from this document.",
328
+ "multi_modal_data": {"image": img},
329
+ })
330
+
331
+ # Run batch inference
332
+ sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
333
+ outputs = llm.generate(prompts, sampling_params=sampling_params)
334
+
335
+ # Collect results
336
+ for i, output in enumerate(outputs):
337
+ print(f"--- Document {i + 1} ---")
338
  print(output.outputs[0].text)
339
+ print()
340
  ```
341
 
342
+ ---
343
 
344
+ ## Use Cases
 
 
 
 
 
 
 
 
 
 
345
 
346
+ | Domain | Application | Recommended Variant |
347
+ |--------|-------------|---------------------|
348
+ | **Finance** | Invoice processing, receipt scanning, bank statements | Pro / Nano |
349
+ | **Legal** | Contract analysis, clause extraction, legal filings | Ultra |
350
+ | **Healthcare** | Medical records, prescriptions, lab reports | Ultra |
351
+ | **Government** | Form processing, ID verification, tax documents | Pro |
352
+ | **Insurance** | Claims processing, policy documents | Pro |
353
+ | **Education** | Exam paper digitization, handwritten notes | Pro / Ultra |
354
+ | **Logistics** | Shipping labels, waybills, packing lists | Nano |
355
+ | **Real Estate** | Property documents, deeds, mortgage papers | Pro |
356
+ | **Retail** | Product catalogs, price tags, inventory lists | Nano |
357
 
358
+ ---
359
 
360
+ ## Performance Benchmarks
361
 
362
+ ### Throughput by Variant (NVIDIA A100 80GB)
363
 
364
+ | Variant | Single Image | Batch (32) | Batch (128) |
365
+ |---------|-------------|------------|-------------|
366
+ | Nano | ~1.2s | ~15s | ~55s |
367
+ | Pro | ~3.5s | ~45s | ~170s |
368
+ | Ultra | ~7.0s | ~110s | ~380s |
369
+
370
+ ### Accuracy by Document Type (Pro Variant)
371
+
372
+ | Document Type | Character Accuracy | Field Accuracy |
373
+ |---------------|--------------------|----------------|
374
+ | Printed invoices | 98.5%+ | 96%+ |
375
+ | Typed contracts | 98%+ | 95%+ |
376
+ | Handwritten notes | 92%+ | 88%+ |
377
+ | Dense tables | 96%+ | 93%+ |
378
+ | Low-quality scans | 94%+ | 90%+ |
379
+
380
+ > **Note:** Benchmark numbers are approximate and may vary based on document quality, content complexity, and hardware configuration.
381
+
382
+ ---
383
+
384
+ ## Prompt Engineering Guide
385
+
386
+ Get the best results from GT-REX with these prompt strategies:
387
+
388
+ ### Tips for Best Results
389
+
390
+ **Do:**
391
+ - Be specific about what to extract ("Extract the invoice number and total amount")
392
+ - Specify output format ("Return as JSON", "Return as Markdown table")
393
+ - Provide schema for structured extraction (show the expected JSON keys)
394
+ - Use clear instructions ("Transcribe exactly as written, preserving spelling errors")
395
+
396
+ **Don't:**
397
+ - Use vague prompts ("What is this?")
398
+ - Ask for analysis or summarization (GT-REX is optimized for extraction)
399
+ - Include unrelated context in the prompt
400
+
401
+ ### Example Prompts
402
+
403
+ ```text
404
+ # Simple extraction
405
+ "Extract all text from this document."
406
+
407
+ # Targeted extraction
408
+ "Extract only the table on this page as a Markdown table."
409
+
410
+ # Schema-driven extraction
411
+ "Extract data matching this schema: {name: str, date: str, amount: float}"
412
+
413
+ # Preservation mode
414
+ "Transcribe this document exactly as written, preserving original formatting."
415
+ ```
416
+
417
+ ---
418
+
419
+ ## API Integration
420
+
421
+ ### FastAPI Server Example
422
+
423
+ ```python
424
+ from fastapi import FastAPI, UploadFile
425
+ from PIL import Image
426
+ from vllm import LLM, SamplingParams
427
+ import io
428
+
429
+ app = FastAPI()
430
+
431
+ llm = LLM(
432
+ model="gothitech/GT-REX",
433
+ trust_remote_code=True,
434
+ max_model_len=4096,
435
+ gpu_memory_utilization=0.75,
436
+ max_num_seqs=128,
437
+ limit_mm_per_prompt={"image": 1},
438
+ )
439
+
440
+ sampling_params = SamplingParams(temperature=0.0, max_tokens=4096)
441
+
442
+
443
+ @app.post("/extract")
444
+ async def extract_text(file: UploadFile, prompt: str = "Extract all text."):
445
+ image_bytes = await file.read()
446
+ image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
447
+
448
+ outputs = llm.generate(
449
+ [{
450
+ "prompt": prompt,
451
+ "multi_modal_data": {"image": image},
452
+ }],
453
+ sampling_params=sampling_params,
454
+ )
455
+
456
+ return {"text": outputs[0].outputs[0].text}
457
+ ```
458
+
459
+ ### cURL Example
460
+
461
+ ```bash
462
+ curl -X POST "http://localhost:8000/extract" \
463
+ -F "file=@invoice.png" \
464
+ -F "prompt=Extract all text from this invoice as JSON."
465
+ ```
466
+
467
+ ---
468
+
469
+ ## Troubleshooting
470
+
471
+ | Issue | Solution |
472
+ |-------|----------|
473
+ | **CUDA Out of Memory** | Reduce `gpu_memory_utilization` or switch to Nano variant |
474
+ | **Slow inference** | Increase `max_num_seqs` for better batching; use Nano for speed |
475
+ | **Truncated output** | Increase `max_tokens` in `SamplingParams` |
476
+ | **Low accuracy on small text** | Switch to Ultra variant for higher resolution |
477
+ | **Garbled multilingual text** | Ensure image resolution is sufficient; try Ultra variant |
478
+ | **Empty output** | Check that the image is loaded correctly and is not blank |
479
+ | **Model loading errors** | Ensure `trust_remote_code=True` is set |
480
+
481
+ ---
482
+
483
+ ## Hardware Recommendations
484
+
485
+ | Variant | Minimum GPU | Recommended GPU |
486
+ |---------|-------------|-----------------|
487
+ | Nano | NVIDIA T4 (16 GB) | NVIDIA A10 (24 GB) |
488
+ | Pro | NVIDIA A10 (24 GB) | NVIDIA A100 (40 GB) |
489
+ | Ultra | NVIDIA A100 (40 GB) | NVIDIA A100 (80 GB) |
490
+
491
+ ---
492
+
493
+ ## License
494
+
495
+ This model is released under the **MIT License**. You are free to use, modify, and distribute it for both commercial and non-commercial purposes.
496
+
497
+ ---
498
 
499
  ## Citation
500
+
501
+ If you use GT-REX in your work, please cite:
502
+
503
  ```bibtex
504
+ @misc{gtrex-2026,
505
+ title = {GT-REX: Production-Grade OCR with Vision-Language Models},
506
+ author = {Hathaliya, Jenis},
507
+ year = {2026},
508
+ month = {February},
509
+ url = {https://huggingface.co/gothitech/GT-REX},
510
+ note = {GothiTech Recognition and Extraction eXpert}
511
+ }
512
+ ```
513
+
514
+ ---
515
+
516
+ ## Contact and Support
517
+
518
+ - **Developer:** Jenis Hathaliya
519
+ - **Organization:** GothiTech
520
+ - **HuggingFace:** [gothitech](https://huggingface.co/gothitech)
521
+
522
+ ---
523
+
524
+ <p align="center">
525
+ Built by <strong>GothiTech</strong>
526
+ </p>
527
+
528
+ <p align="center">
529
+ <em>Last updated: February 2026</em><br>
530
+ <em>GT-REX | Variants: Nano | Pro | Ultra</em>
531
+ </p>