Instructions to use prithivMLmods/Nemesis-VLMer-7B-0818 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prithivMLmods/Nemesis-VLMer-7B-0818 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="prithivMLmods/Nemesis-VLMer-7B-0818")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("prithivMLmods/Nemesis-VLMer-7B-0818")
model = AutoModelForImageTextToText.from_pretrained("prithivMLmods/Nemesis-VLMer-7B-0818")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use prithivMLmods/Nemesis-VLMer-7B-0818 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prithivMLmods/Nemesis-VLMer-7B-0818"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Nemesis-VLMer-7B-0818",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/prithivMLmods/Nemesis-VLMer-7B-0818

SGLang

How to use prithivMLmods/Nemesis-VLMer-7B-0818 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "prithivMLmods/Nemesis-VLMer-7B-0818" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Nemesis-VLMer-7B-0818",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "prithivMLmods/Nemesis-VLMer-7B-0818" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Nemesis-VLMer-7B-0818",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use prithivMLmods/Nemesis-VLMer-7B-0818 with Docker Model Runner:
```
docker model run hf.co/prithivMLmods/Nemesis-VLMer-7B-0818
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Nemesis-VLMer-7B-0818

The Nemesis-VLMer-7B-0818 model is a fine-tuned version of Qwen2.5-VL-7B-Instruct, optimized for Reasoning, Content Analysis, and Visual Question Answering (VQA). Built on top of the Qwen2.5-VL architecture, this model enhances multimodal comprehension capabilities with focused training on reasoning-oriented and analysis-rich datasets for superior reasoning, content interpretation, and visual question answering tasks.

Key Enhancements

Context-Aware Multimodal Reasoning and Linking: Advanced capability for understanding multimodal context and establishing connections across text, images, and structured elements.
Enhanced Content Analysis: Designed to efficiently interpret and analyze complex content, ranging from structured text to multimodal information.
Visual Question Answering (VQA): Specialized for accurately answering visual and multimodal queries across diverse domains.
Advanced Reasoning Capabilities: Optimized for logical, mathematical, and contextual reasoning tasks involving charts, tables, and diagrams.
State-of-the-Art Performance Across Benchmarks: Achieves competitive results on reasoning and visual QA datasets such as DocVQA, MathVista, RealWorldQA, and MTVQA.
Video Understanding up to 20+ minutes: Supports detailed comprehension of long-duration videos for reasoning, summarization, question answering, and multi-modal analysis.
Visually-Grounded Device Interaction: Enables mobile or robotic device operation via visual inputs and text-based instructions using contextual understanding and reasoning-driven decision-making logic.

Quick Start with Transformers🤗

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Nemesis-VLMer-7B-0818", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Nemesis-VLMer-7B-0818")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "What reasoning can you infer from this image?"},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

This model is intended for:

Context-aware multimodal reasoning and linking across diverse inputs.
High-fidelity content analysis and interpretation for structured and unstructured data.
Visual question answering (VQA) across educational, enterprise, and research applications.
Reasoning-driven analysis of charts, graphs, tables, and visual data representations.
Extraction and LaTeX formatting of mathematical expressions for academic and professional use.
Retrieval, reasoning, and summarization from long documents, slides, and multi-modal sources.
Multilingual reasoning and structured content analysis for global use cases.
Robotic or mobile automation with vision-guided, reasoning-based contextual interaction.

Limitations

May show degraded performance on extremely low-quality or occluded images.
Not optimized for real-time applications on low-resource or edge devices due to computational demands.
Variable accuracy on uncommon or low-resource languages or scripts.
Long video processing may require substantial memory and is not optimized for streaming applications.
Visual token settings affect performance; suboptimal configurations can impact results.
In rare cases, outputs may contain hallucinated or contextually misaligned reasoning steps.