Instructions to use prithivMLmods/Enesidaon-VLR-7B-no-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prithivMLmods/Enesidaon-VLR-7B-no-Thinking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="prithivMLmods/Enesidaon-VLR-7B-no-Thinking")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("prithivMLmods/Enesidaon-VLR-7B-no-Thinking")
model = AutoModelForImageTextToText.from_pretrained("prithivMLmods/Enesidaon-VLR-7B-no-Thinking")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use prithivMLmods/Enesidaon-VLR-7B-no-Thinking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prithivMLmods/Enesidaon-VLR-7B-no-Thinking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Enesidaon-VLR-7B-no-Thinking",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/prithivMLmods/Enesidaon-VLR-7B-no-Thinking

SGLang

How to use prithivMLmods/Enesidaon-VLR-7B-no-Thinking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "prithivMLmods/Enesidaon-VLR-7B-no-Thinking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Enesidaon-VLR-7B-no-Thinking",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "prithivMLmods/Enesidaon-VLR-7B-no-Thinking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Enesidaon-VLR-7B-no-Thinking",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use prithivMLmods/Enesidaon-VLR-7B-no-Thinking with Docker Model Runner:
```
docker model run hf.co/prithivMLmods/Enesidaon-VLR-7B-no-Thinking
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Enesidaon-VLR-7B-no-Thinking

The Enesidaon-VLR-7B-no-Thinking model is a high-fidelity vision-language reasoning (experimental) model designed for fine-grained multimodal comprehension. Built on top of Qwen2.5-VL-7B-Instruct, this model improves image captioning, sampled video reasoning, and detailed document understanding. Unlike standard approaches, it explicitly grounds its textual reasoning steps to visual coordinates, enabling precise and explainable multimodal reasoning. The model is trained using supervised fine-tuning (SFT) on visually-grounded reasoning traces and further optimized via GRPO reinforcement learning, resulting in superior chain-of-thought reasoning without overthinking or unnecessary hallucinations.

Key Enhancements

Visually-Grounded Reasoning and Explanation: Explicitly anchors reasoning chains to image regions and document elements for transparent, explainable multimodal outputs.
Advanced Image Captioning: Produces context-aware, detailed captions with grounded reasoning for improved visual understanding.
Sampled Video Reasoning: Handles long-duration video inputs with temporal reasoning for content summarization and QA.
Context-Aware Document Analysis: Excels in document retrieval, structured and unstructured content extraction, and analytical content recognition.
Fine-Grained Visual Grounding: Enhanced capability for multimodal linking across charts, tables, and graphical elements with spatial grounding.
Reinforcement-Learned Reasoning: Trained with GRPO to incentivize accurate, grounded reasoning aligned with visual cues.
State-of-the-Art Benchmarking: Competitive results on OCR, visual QA, and reasoning tasks including DocVQA, MathVista, RealWorldQA, and MTVQA.

Quick Start with Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Enesidaon-VLR-7B-no-Thinking", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Enesidaon-VLR-7B-no-Thinking")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image with reasoning."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

This model is intended for:

Grounded visual reasoning with spatially-aligned chain-of-thought explanations.
Accurate, explainable image captioning and video reasoning.
Multimodal document analysis with visually-referenced reasoning steps.
Analytical content recognition, table/chart interpretation, and structured extraction.
Multilingual reasoning over documents and visual scenes for global applications.
Educational and enterprise solutions requiring step-by-step reasoning transparency.
Robotic and mobile device automation with vision-guided contextual decision-making.

Limitations

May require high memory for long videos and complex document inputs.
Performance can degrade with extremely low-resolution or heavily occluded images.
Not fully optimized for real-time inference on low-resource edge devices.
Visual token configurations significantly impact grounded reasoning performance.
Some rare cases of reasoning drift or incomplete grounding.

References

YaRN: Efficient Context Window Extension of Large Language Models
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy