File size: 5,586 Bytes
b50ecc6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
pipeline_tag: image-text-to-text
tags:
- visual-document-understanding
- visual-question-answering
- indian-documents
license: apache-2.0
language:
- en
library_name: transformers
base_model:
- bharatgenai/patram-7b-instruct
---
# Patram-7B-Instruct
Patram-7B-Instruct by BharatGen is a 7B parameter vision-language model trained from scratch for visual document understanding. As India’s first document foundation model, it is built to tackle complex document analysis.
The model was trained on a carefully curated instruction-tuned dataset, combining diverse public and custom synthetic data designed to support a broad spectrum of document understanding tasks.
## Model Overview
* **Architecture:** Vision Transformer (ViT) + MLP projector + OLMo-7B LLM
* **Training Data:** BharatDocs-v1, a dataset of diverse Indian documents + Other Open Source Document Datasets
* **Supported I/O Formats:** The model currently accepts English-language instructions and image files (e.g., PNG, JPEG) as input. The output is provided in text format.
* **Language:** English (Indian language support upcoming)
* **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
## Usage Examples
Use the `transformers` library.
```python
import torch
from transformers import AutoProcessor, AutoModelForCausalLM, GenerationConfig
from PIL import Image
import requests
# Model ID and device setup
model_id = "bharatgenai/patram-7b-instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load processor and model
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True
).to(device)
def get_patram_response(image_path_or_url, question):
try:
# Load image
if image_path_or_url.startswith("http"):
image = Image.open(requests.get(image_path_or_url, stream=True).raw).convert("RGB")
else:
image = Image.open(image_path_or_url).convert("RGB")
except Exception as e:
print(f"Error loading image: {e}")
return None
# Format the prompt as expected
prompt = f"Question: {question} Answer based on the image."
try:
# Preprocess image and text using the processor
inputs = processor.process(images=[image], text=prompt)
inputs = {k: v.to(device).unsqueeze(0) for k, v in inputs.items()}
# Generate output using model's generate_from_batch method (Patram-specific)
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer
)
# Extract generated tokens (excluding input tokens) and decode
generated_tokens = output[0, inputs['input_ids'].size(1):]
response = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
return response
except Exception as e:
print(f"Error during inference: {e}")
return None
# Example usage:
# image_input = "https://knowscope.in/wp-content/uploads/2025/05/cghd-nag.png"
# question = "Who issued this notice?"
# answer = get_patram_response(image_input, question)
# if answer:
# print("Answer:", answer)
```
**Note**: If you're trying this on an Apple Silicon (M1/M2/M3/M4/...) chip, please follow the official documentation by PyTorch and Hugging Face for installing dependencies:
- [PyTorch's official guide on installation (macOS)](https://pytorch.org/get-started/locally/#:~:text=torch%20torchvision%20torchaudio-,Installing%20on%20macOS,-PyTorch%20can%20be)
- [Hugging Face Transformers performance tips](https://huggingface.co/docs/transformers/main/en/perf_train_special)
## Evaluations
We evaluated Patram-7B-Instruct alongside other vision-language models (VLMs) in the 7B–9B parameter range across multiple public document benchmarks.
**Benchmarks**: DocVQA, VisualMRC, Patram-Bench
Patram-Bench is an in-house benchmark designed for Indic Document VQA.
**Metric**: G-Eval (LLM-as-a-judge)
| Model | Overall | DocVQA | Patram-Bench | VisualMRC |
| ---------------------- | ------- | ------ | ------------ | --------- |
| claude-3.7-sonnet | 0.8830 | 0.8480 | 0.8857 | 0.8830 |
| Qwen2.5-VL-7B-Instruct | 0.8759 | 0.8722 | 0.6816 | 0.9169 |
| gemma-3-12b-it | 0.8556 | 0.8451 | 0.6349 | 0.9069 |
| **patram-7b-instruct** | 0.8331 | 0.8550 | 0.6515 | 0.8510 |
| InternVL3-9B | 0.7865 | 0.8681 | 0.6888 | 0.7405 |
| deepseek-vl2 | 0.7581 | 0.8739 | 0.5089 | 0.7144 |
*Note: The benchmarked results reflect the API variant.
## Citation
```bibtex
@online{BharatGenPatramLaunch2025,
author = {{BharatGen Team}},
title = {BharatGen Unveils Patram: India's Pioneering Vision-Language Foundation Model for Document Intelligence},
year = {2025},
url = {https://bharatgen.com/blog/patram-launch},
urldate = {2025-06-02}
}
```
## Resources
* **Model**: [huggingface.co/bharatgenai/patram-7b-instruct](https://huggingface.co/bharatgenai/patram-7b-instruct)
* **Project Page**: [bharatgen.com/patram](https://bharatgen.com/patram)
* **Blog**: [bharatgen.com/blog/patram-launch](https://bharatgen.com/blog/patram-launch)
## Authors
* **Principal Investigators**: Prof. Ravi Kiran Sarvadevabhatla, Prof. Ganesh Ramakrishnan
* **Contributors**: BharatGen Team
## Contact
* [Contact Form](https://bharatgen.com/contact)
* Hugging Face Community Tab
|