Instructions to use prithivMLmods/Gliese-OCR-7B-Post2.0-final with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prithivMLmods/Gliese-OCR-7B-Post2.0-final with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="prithivMLmods/Gliese-OCR-7B-Post2.0-final")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("prithivMLmods/Gliese-OCR-7B-Post2.0-final")
model = AutoModelForMultimodalLM.from_pretrained("prithivMLmods/Gliese-OCR-7B-Post2.0-final")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use prithivMLmods/Gliese-OCR-7B-Post2.0-final with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prithivMLmods/Gliese-OCR-7B-Post2.0-final"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Gliese-OCR-7B-Post2.0-final",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/prithivMLmods/Gliese-OCR-7B-Post2.0-final

SGLang

How to use prithivMLmods/Gliese-OCR-7B-Post2.0-final with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "prithivMLmods/Gliese-OCR-7B-Post2.0-final" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Gliese-OCR-7B-Post2.0-final",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "prithivMLmods/Gliese-OCR-7B-Post2.0-final" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Gliese-OCR-7B-Post2.0-final",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use prithivMLmods/Gliese-OCR-7B-Post2.0-final with Docker Model Runner:
```
docker model run hf.co/prithivMLmods/Gliese-OCR-7B-Post2.0-final
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Gliese-OCR-7B-Post2.0-final

The Gliese-OCR-7B-Post2.0-final model is a refined and optimized version of Gliese-OCR-7B-Post1.0, built upon the Qwen2.5-VL architecture. It represents the final iteration in the Gliese-OCR series, offering enhanced efficiency, precision, and visualization capabilities for document OCR, visual analysis, and information extraction.

Fine-tuned with extended document visualization data and OCR-focused objectives, this model delivers superior accuracy across a wide range of document types, including scanned PDFs, handwritten pages, structured forms, and analytical reports.

Key Enhancements

Optimized Document Visualization and OCR Pipeline: Significantly improved recognition of text, layout, and embedded visuals for structured document understanding.
Context-Aware Multimodal Linking: Enhanced understanding of document context with stronger alignment between text, images, and layout components.
Refined Document Retrieval: Improved retrieval accuracy from complex layouts and multi-page documents.
High-Fidelity Content Extraction: Precise extraction of structured, semi-structured, and unstructured information with advanced text normalization.
Analytical Recognition: Superior reasoning over charts, graphs, tables, and mathematical equations.
Improved Visual Reasoning and Layout Awareness: Trained on document visualization datasets for advanced spatial and semantic comprehension.
State-of-the-Art Performance Across Resolutions: Achieves top results on benchmarks such as DocVQA, InfographicVQA, MathVista, and RealWorldQA.
Extended Multimodal Duration Support: Handles long document sequences and extended videos (20+ minutes).
Final Release Stability: Consolidates all prior improvements for stable and reliable performance.

Quick Start with Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Gliese-OCR-7B-Post2.0-final", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Gliese-OCR-7B-Post2.0-final")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": "Describe the document structure and extract key text content."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text)

Intended Use

Document visualization and OCR extraction tasks.
Context-aware document retrieval and multimodal linking.
Extraction and LaTeX formatting of equations and structured content.
Analytical document interpretation (charts, tables, graphs, and figures).
Multilingual OCR for enterprise, academic, and research use cases.
Summarization, question answering, and cross-modal reasoning over long documents.
Intelligent robotic or mobile automation guided by visual document input.

Limitations

Reduced accuracy on heavily degraded or occluded documents.
High computational requirements for large-scale or real-time applications.
Limited optimization for low-resource or edge devices.
Occasional misalignment in text layout or minor hallucinations in outputs.
Performance may vary depending on visual token configuration and context length settings.