Instructions to use prithivMLmods/Enesidaon-VLR-7B-no-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use prithivMLmods/Enesidaon-VLR-7B-no-Thinking with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="prithivMLmods/Enesidaon-VLR-7B-no-Thinking") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("prithivMLmods/Enesidaon-VLR-7B-no-Thinking") model = AutoModelForImageTextToText.from_pretrained("prithivMLmods/Enesidaon-VLR-7B-no-Thinking") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use prithivMLmods/Enesidaon-VLR-7B-no-Thinking with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "prithivMLmods/Enesidaon-VLR-7B-no-Thinking" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prithivMLmods/Enesidaon-VLR-7B-no-Thinking", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/prithivMLmods/Enesidaon-VLR-7B-no-Thinking
- SGLang
How to use prithivMLmods/Enesidaon-VLR-7B-no-Thinking with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "prithivMLmods/Enesidaon-VLR-7B-no-Thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prithivMLmods/Enesidaon-VLR-7B-no-Thinking", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "prithivMLmods/Enesidaon-VLR-7B-no-Thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prithivMLmods/Enesidaon-VLR-7B-no-Thinking", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use prithivMLmods/Enesidaon-VLR-7B-no-Thinking with Docker Model Runner:
docker model run hf.co/prithivMLmods/Enesidaon-VLR-7B-no-Thinking
Enesidaon-VLR-7B-no-Thinking
The Enesidaon-VLR-7B-no-Thinking model is a high-fidelity vision-language reasoning (experimental) model designed for fine-grained multimodal comprehension. Built on top of Qwen2.5-VL-7B-Instruct, this model improves image captioning, sampled video reasoning, and detailed document understanding. Unlike standard approaches, it explicitly grounds its textual reasoning steps to visual coordinates, enabling precise and explainable multimodal reasoning. The model is trained using supervised fine-tuning (SFT) on visually-grounded reasoning traces and further optimized via GRPO reinforcement learning, resulting in superior chain-of-thought reasoning without overthinking or unnecessary hallucinations.
Key Enhancements
- Visually-Grounded Reasoning and Explanation: Explicitly anchors reasoning chains to image regions and document elements for transparent, explainable multimodal outputs.
- Advanced Image Captioning: Produces context-aware, detailed captions with grounded reasoning for improved visual understanding.
- Sampled Video Reasoning: Handles long-duration video inputs with temporal reasoning for content summarization and QA.
- Context-Aware Document Analysis: Excels in document retrieval, structured and unstructured content extraction, and analytical content recognition.
- Fine-Grained Visual Grounding: Enhanced capability for multimodal linking across charts, tables, and graphical elements with spatial grounding.
- Reinforcement-Learned Reasoning: Trained with GRPO to incentivize accurate, grounded reasoning aligned with visual cues.
- State-of-the-Art Benchmarking: Competitive results on OCR, visual QA, and reasoning tasks including DocVQA, MathVista, RealWorldQA, and MTVQA.
Quick Start with Transformers
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"prithivMLmods/Enesidaon-VLR-7B-no-Thinking", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("prithivMLmods/Enesidaon-VLR-7B-no-Thinking")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image with reasoning."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Intended Use
This model is intended for:
- Grounded visual reasoning with spatially-aligned chain-of-thought explanations.
- Accurate, explainable image captioning and video reasoning.
- Multimodal document analysis with visually-referenced reasoning steps.
- Analytical content recognition, table/chart interpretation, and structured extraction.
- Multilingual reasoning over documents and visual scenes for global applications.
- Educational and enterprise solutions requiring step-by-step reasoning transparency.
- Robotic and mobile device automation with vision-guided contextual decision-making.
Limitations
- May require high memory for long videos and complex document inputs.
- Performance can degrade with extremely low-resolution or heavily occluded images.
- Not fully optimized for real-time inference on low-resource edge devices.
- Visual token configurations significantly impact grounded reasoning performance.
- Some rare cases of reasoning drift or incomplete grounding.
References
- YaRN: Efficient Context Window Extension of Large Language Models
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
- Downloads last month
- 9
