Instructions to use prithivMLmods/Megalodon-OCR-Sync-0713 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use prithivMLmods/Megalodon-OCR-Sync-0713 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="prithivMLmods/Megalodon-OCR-Sync-0713") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("prithivMLmods/Megalodon-OCR-Sync-0713") model = AutoModelForImageTextToText.from_pretrained("prithivMLmods/Megalodon-OCR-Sync-0713") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use prithivMLmods/Megalodon-OCR-Sync-0713 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "prithivMLmods/Megalodon-OCR-Sync-0713" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prithivMLmods/Megalodon-OCR-Sync-0713", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/prithivMLmods/Megalodon-OCR-Sync-0713
- SGLang
How to use prithivMLmods/Megalodon-OCR-Sync-0713 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "prithivMLmods/Megalodon-OCR-Sync-0713" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prithivMLmods/Megalodon-OCR-Sync-0713", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "prithivMLmods/Megalodon-OCR-Sync-0713" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prithivMLmods/Megalodon-OCR-Sync-0713", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use prithivMLmods/Megalodon-OCR-Sync-0713 with Docker Model Runner:
docker model run hf.co/prithivMLmods/Megalodon-OCR-Sync-0713
Megalodon-OCR-Sync-0713
The Megalodon-OCR-Sync-0713 model is a fine-tuned version of Qwen2.5-VL-3B-Instruct, optimized for Document Retrieval, Content Extraction, and Analysis Recognition. Built on top of the Qwen2.5-VL architecture, this model enhances document comprehension capabilities with focused training on 200K image pairs from a mixture of captioning datasets, including 70K from Corvus-OCR-Caption-Mix dataset, and other document modular datasets from modular combination of opensource datasets best for doc OCR captioning, image reasoning, visual analysis, working on all category of images with variational dimension.
Key Enhancements
Context-Aware Multimodal Extraction and Linking for Documents: Advanced capability for understanding document context and establishing connections between multimodal elements within documents.
Enhanced Document Retrieval: Designed to efficiently locate and extract relevant information from complex document structures and layouts.
Superior Content Extraction: Optimized for precise extraction of structured and unstructured content from diverse document formats.
Analysis Recognition: Specialized in recognizing and interpreting analytical content, charts, tables, and visual data representations.
State-of-the-Art Performance Across Resolutions: Achieves competitive results on OCR and visual QA benchmarks such as DocVQA, MathVista, RealWorldQA, and MTVQA.
Video Understanding up to 20+ minutes: Supports detailed comprehension of long-duration videos for content summarization, Q&A, and multi-modal reasoning.
Visually-Grounded Device Interaction: Enables mobile/robotic device operation via visual inputs and text-based instructions using contextual understanding and decision-making logic.
Quick Start with Transformers
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"prithivMLmods/Megalodon-OCR-Sync-0713", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("prithivMLmods/Megalodon-OCR-Sync-0713")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Not expected to work as well in Indian languages.
Intended Use
This model is intended for:
- Context-aware multimodal extraction and linking for complex document structures.
- High-fidelity document retrieval and content extraction from various document formats.
- Analysis recognition of charts, graphs, tables, and visual data representations.
- Document-based question answering for educational and enterprise applications.
- Extraction and LaTeX formatting of mathematical expressions from printed or handwritten content.
- Retrieval and summarization from long documents, slides, and multi-modal inputs.
- Multilingual document analysis and structured content extraction for global use cases.
- Robotic or mobile automation with vision-guided contextual interaction.
Limitations
- May show degraded performance on extremely low-quality or occluded images.
- Not optimized for real-time applications on low-resource or edge devices due to computational demands.
- Variable accuracy on uncommon or low-resource languages/scripts.
- Long video processing may require substantial memory and is not optimized for streaming applications.
- Visual token settings affect performance; suboptimal configurations can impact results.
- In rare cases, outputs may contain hallucinated or contextually misaligned information.
- Downloads last month
- 21
Model tree for prithivMLmods/Megalodon-OCR-Sync-0713
Base model
Qwen/Qwen2.5-VL-3B-Instruct