Add GT-REX variants (Nano/Pro/Ultra) to model card

Browse files

Files changed (1) hide show

README.md +61 -227

README.md CHANGED Viewed

@@ -19,271 +19,105 @@ pipeline_tag: image-text-to-text
 **GT-REX-v4** is a state-of-the-art production-grade OCR model developed by GothiTech for enterprise document understanding, text extraction, and intelligent document processing.
-## 🎯 Key Features
-- **High Accuracy**: Advanced vision-language architecture for precise text extraction
-- **Multi-Language Support**: Handles documents in multiple languages
-- **Production Ready**: Optimized for deployment with vLLM inference engine
-- **Batch Processing**: Process hundreds of documents per minute
-- **Flexible Prompts**: Support for structured extraction (JSON, tables, forms)
-- **Handwriting Support**: Capable of transcribing handwritten text
-## 📊 Model Details
-| Attribute | Value |
-|-----------|-------|
-| **Developer** | GothiTech (Jenis Hathaliya) |
-| **Architecture** | Vision-Language Model (VLM) |
-| **Model Size** | ~6.5 GB |
-| **Parameters** | ~7B |
-| **License** | MIT |
-| **Release Date** | February 2026 |
-| **Precision** | BF16/FP16 |
-| **Input Resolution** | Up to 1024x1024 |
-## 🚀 Use Cases
-### Enterprise Applications
-- 📄 **Document Digitization**: Convert scanned documents to editable text
-- 🧾 **Invoice & Receipt Processing**: Extract structured data from financial documents
-- 📋 **Form Automation**: Auto-fill and process forms from images
-- 📑 **Contract Analysis**: Extract key terms and clauses from legal documents
-- 🏥 **Medical Records**: Digitize patient records and prescriptions
-- 📦 **Logistics**: Process shipping labels, delivery notes, and manifests
-### Advanced Features
-- ✍️ **Handwriting Recognition**: Transcribe handwritten notes and forms
-- 🌍 **Multi-language OCR**: Support for English, Spanish, French, German, Chinese, and more
-- 📊 **Table Extraction**: Parse complex tables with accurate cell detection
-- 🎨 **Layout Understanding**: Maintain document structure and formatting
-- 🔍 **Selective Extraction**: Target specific fields with custom prompts
-## 💻 Installation
-```bash
-pip install vllm pillow torch transformers
-```
-## 🔧 Usage
-### Basic Usage with vLLM
 ```python
-from vllm import LLM, SamplingParams
-from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
-from PIL import Image
-# Initialize model
 llm = LLM(
     model='developerJenis/GT-REX-v4',
     trust_remote_code=True,
-    max_model_len=4096,
-    gpu_memory_utilization=0.75,
     logits_processors=[NGramPerReqLogitsProcessor],
 )
-# Load document
-image = Image.open('invoice.jpg')
-prompt = '<image>\n<|grounding|>Extract all text from this document.'
-# Generate
-result = llm.generate(
-    {'prompt': prompt, 'multi_modal_data': {'image': image}},
-    SamplingParams(temperature=0.0, max_tokens=2000)
-)
-# Extract text
-print(result.outputs.text)
 ```
-### Structured Data Extraction (JSON)
-```python
-# Extract specific fields in JSON format
-prompt = '''<image>\n<|grounding|>Extract the following information in JSON format:
-- invoice_number
-- date
-- vendor_name
-- total_amount
-- line_items (list)'''
-result = llm.generate(
-    {'prompt': prompt, 'multi_modal_data': {'image': invoice_image}},
-    SamplingParams(temperature=0.0, max_tokens=2000)
-)
-import json
-data = json.loads(result.outputs.text)
-print(data)
-```
-### Batch Processing
-```python
-# Process multiple documents efficiently
-from pathlib import Path
-doc_paths = list(Path('documents/').glob('*.jpg'))
-images = [Image.open(p) for p in doc_paths]
-prompts = [
-    {'prompt': '<image>\n<|grounding|>Extract all text.',
-     'multi_modal_data': {'image': img}}
-    for img in images
-]
-# Batch inference
-results = llm.generate(
-    prompts,
-    SamplingParams(temperature=0.0, max_tokens=2000)
-)
-for i, result in enumerate(results):
-    text = result.outputs.text
-    print(f'Document {i}: {text[:100]}...')
-```
-### Table Extraction
-```python
-# Extract tables with structure preservation
-prompt = '<image>\n<|grounding|>Extract all tables in markdown format.'
-result = llm.generate(
-    {'prompt': prompt, 'multi_modal_data': {'image': table_image}},
-    SamplingParams(temperature=0.0, max_tokens=3000)
-)
-markdown_table = result.outputs.text
-print(markdown_table)
-```
-## 📈 Performance Benchmarks
-| Metric | T4 GPU | V100 GPU | A100 GPU |
-|--------|---------|----------|----------|
-| **Latency (single image)** | 3-5 sec | 2-3 sec | 1-2 sec |
-| **Throughput (batch=8)** | ~60 img/min | ~120 img/min | ~200 img/min |
-| **GPU Memory** | 6-8 GB | 8-10 GB | 10-12 GB |
-| **Max Resolution** | 1024x1024 | 1024x1024 | 1024x1024 |
-## ⚙️ System Requirements
-### Minimum Requirements
-```
-Python >= 3.8
-PyTorch >= 2.0
-CUDA >= 11.8
-GPU Memory: 15GB+ (T4 or better)
-vLLM >= 0.15.0
-```
-### Recommended Setup
-```
-Python 3.10+
-PyTorch 2.1+
-CUDA 12.1+
-GPU: A100 (40GB) or V100 (32GB)
-vLLM 0.16+
-```
-## 🎛️ Advanced Configuration
-### Optimize for Throughput
 ```python
 llm = LLM(
     model='developerJenis/GT-REX-v4',
     trust_remote_code=True,
-    tensor_parallel_size=2,  # Multi-GPU
     max_num_seqs=128,
-    max_num_batched_tokens=8192,
-    gpu_memory_utilization=0.9,
 )
 ```
-### Optimize for Latency
 ```python
 llm = LLM(
     model='developerJenis/GT-REX-v4',
     trust_remote_code=True,
-    max_num_seqs=1,
-    gpu_memory_utilization=0.6,
-    enable_prefix_caching=True,
 )
 ```
-## 📝 Supported Prompt Templates
-### General Extraction
-- `Extract all text from this document`
-- `Transcribe the entire page`
-- `Convert this image to text`
-### Structured Extraction
-- `Extract invoice number, date, and total in JSON format`
-- `Parse all form fields as key-value pairs`
-- `Extract table data in CSV format`
-### Selective Extraction
-- `Extract only the recipient address`
-- `Find and extract all dates`
-- `Extract signature fields`
-## 🏆 Model Capabilities
-✅ **Printed Text**: High accuracy on machine-printed documents
-✅ **Handwriting**: Good performance on clear handwritten text
-✅ **Tables**: Accurate cell detection and structure preservation
-✅ **Multi-column**: Handles complex layouts
-✅ **Low Quality**: Works on scanned and photographed documents
-✅ **Mixed Content**: Text + images + tables in same document
-## 🔒 Limitations
-- Requires GPU for inference (CPU inference not supported)
-- Maximum input resolution: 1024x1024 pixels
-- Performance may vary on heavily degraded or low-contrast images
-- Complex mathematical formulas may require specialized prompts
-## 👨‍💻 Developer
-**Jenis Hathaliya** - AI Engineer at GothiTech
-Specializing in production AI systems, document intelligence, and enterprise ML deployment.
-- 🌐 HuggingFace: [@developerJenis](https://huggingface.co/developerJenis)
-- 💻 GitHub: [@developerJenis](https://github.com/developerJenis)
-- 🏢 Company: GothiTech - AI Solutions for Enterprise
-## 📞 Support & Contact
-For enterprise support, custom deployments, or commercial licensing:
-- Open an issue on GitHub
-- Contact via HuggingFace profile
-## 📄 License
-This model is released under the MIT License. See LICENSE file for details.
-## 🙏 Acknowledgments
-Built with cutting-edge ML frameworks and optimized for production deployment.
-## 📖 Citation
-If you use GT-REX-v4 in your research or production systems, please cite:
-```bibtex
-@misc{gtrex-v4-2026,
-  title={GT-REX-v4: Production OCR Model for Enterprise Document Understanding},
-  author={Jenis Hathaliya},
-  year={2026},
-  publisher={GothiTech},
-  url={https://huggingface.co/developerJenis/GT-REX-v4},
-  note={Production-grade vision-language model for OCR and document AI}
-}
 ```
 ---
 *Last updated: February 2026*

 **GT-REX-v4** is a state-of-the-art production-grade OCR model developed by GothiTech for enterprise document understanding, text extraction, and intelligent document processing.
+## ⚙️ GT-REX Variants
+GT-REX-v4 supports **three optimized configurations** for different performance requirements:
+| Variant | Speed | Accuracy | Resolution | GPU Memory | Throughput | Best For |
+|---------|-------|----------|------------|------------|------------|----------|
+| **🚀 Nano** | ⚡⚡⚡⚡⚡ | ⭐⭐⭐ | 640px | 4-6 GB | 100-150 docs/min | High-volume batch |
+| **⚡ Pro** | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ | 1024px | 6-10 GB | 50-80 docs/min | Standard workflows |
+| **🎯 Ultra** | ⚡⚡⚡ | ⭐⭐⭐⭐⭐ | 1536px | 10-15 GB | 20-30 docs/min | High-accuracy needs |
+### 🚀 GT-Rex-Nano
+**Speed-optimized for high-volume batch processing**
+- **Resolution**: 640×640px
+- **Speed**: ~1-2s per image
+- **Max Tokens**: 2048
+- **Best for**: Thumbnails, previews, high-throughput pipelines (100+ docs)
 ```python
 llm = LLM(
     model='developerJenis/GT-REX-v4',
     trust_remote_code=True,
+    max_model_len=2048,
+    gpu_memory_utilization=0.6,
+    max_num_seqs=256,
     logits_processors=[NGramPerReqLogitsProcessor],
 )
 ```
+### ⚡ GT-Rex-Pro (Default)
+**Balanced quality and speed for standard documents**
+- **Resolution**: 1024×1024px
+- **Speed**: ~2-5s per image
+- **Max Tokens**: 4096
+- **Best for**: Contracts, forms, invoices, reports
 ```python
 llm = LLM(
     model='developerJenis/GT-REX-v4',
     trust_remote_code=True,
+    max_model_len=4096,
+    gpu_memory_utilization=0.75,
     max_num_seqs=128,
+    logits_processors=[NGramPerReqLogitsProcessor],
 )
 ```
+### 🎯 GT-Rex-Ultra
+**Maximum quality with adaptive processing**
+- **Resolution**: 1536×1536px
+- **Speed**: ~5-10s per image
+- **Max Tokens**: 8192
+- **Best for**: Legal documents, fine print, dense tables, medical records
 ```python
 llm = LLM(
     model='developerJenis/GT-REX-v4',
     trust_remote_code=True,
+    max_model_len=8192,
+    gpu_memory_utilization=0.85,
+    max_num_seqs=64,
+    logits_processors=[NGramPerReqLogitsProcessor],
 )
 ```
+## 🎯 Key Features
+- **High Accuracy**: Advanced vision-language architecture for precise text extraction
+- **Multi-Language Support**: Handles documents in multiple languages
+- **Production Ready**: Optimized for deployment with vLLM inference engine
+- **Batch Processing**: Process hundreds of documents per minute
+- **Flexible Prompts**: Support for structured extraction (JSON, tables, forms)
+- **Handwriting Support**: Capable of transcribing handwritten text
+- **Three Optimized Variants**: Nano, Pro, and Ultra for different use cases
+## 📊 Model Details
+| Attribute | Value |
+|-----------|-------|
+| **Developer** | GothiTech (Jenis Hathaliya) |
+| **Architecture** | Vision-Language Model (VLM) |
+| **Model Size** | ~6.5 GB |
+| **Parameters** | ~7B |
+| **License** | MIT |
+| **Release Date** | February 2026 |
+| **Precision** | BF16/FP16 |
+| **Input Resolution** | 640px - 1536px (variant dependent) |
+## 🚀 Use Cases
+## 💻 Installation
+```bash
+pip install vllm pillow torch transformers
 ```
 ---
 *Last updated: February 2026*
+*Model Version: v4.0 | Variants: Nano | Pro | Ultra*