How to use from
Docker Model Runner
# Gated model: Login with a HF token with gated access permission
hf auth login
docker model run hf.co/prithivMLmods/Nemesis-VLMer-7B-0818
Quick Links

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

1.png

Nemesis-VLMer-7B-0818

The Nemesis-VLMer-7B-0818 model is a fine-tuned version of Qwen2.5-VL-7B-Instruct, optimized for Reasoning, Content Analysis, and Visual Question Answering (VQA). Built on top of the Qwen2.5-VL architecture, this model enhances multimodal comprehension capabilities with focused training on reasoning-oriented and analysis-rich datasets for superior reasoning, content interpretation, and visual question answering tasks.

Key Enhancements

  • Context-Aware Multimodal Reasoning and Linking: Advanced capability for understanding multimodal context and establishing connections across text, images, and structured elements.

  • Enhanced Content Analysis: Designed to efficiently interpret and analyze complex content, ranging from structured text to multimodal information.

  • Visual Question Answering (VQA): Specialized for accurately answering visual and multimodal queries across diverse domains.

  • Advanced Reasoning Capabilities: Optimized for logical, mathematical, and contextual reasoning tasks involving charts, tables, and diagrams.

  • State-of-the-Art Performance Across Benchmarks: Achieves competitive results on reasoning and visual QA datasets such as DocVQA, MathVista, RealWorldQA, and MTVQA.

  • Video Understanding up to 20+ minutes: Supports detailed comprehension of long-duration videos for reasoning, summarization, question answering, and multi-modal analysis.

  • Visually-Grounded Device Interaction: Enables mobile or robotic device operation via visual inputs and text-based instructions using contextual understanding and reasoning-driven decision-making logic.

Quick Start with Transformers🤗

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Nemesis-VLMer-7B-0818", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Nemesis-VLMer-7B-0818")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "What reasoning can you infer from this image?"},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

This model is intended for:

  • Context-aware multimodal reasoning and linking across diverse inputs.
  • High-fidelity content analysis and interpretation for structured and unstructured data.
  • Visual question answering (VQA) across educational, enterprise, and research applications.
  • Reasoning-driven analysis of charts, graphs, tables, and visual data representations.
  • Extraction and LaTeX formatting of mathematical expressions for academic and professional use.
  • Retrieval, reasoning, and summarization from long documents, slides, and multi-modal sources.
  • Multilingual reasoning and structured content analysis for global use cases.
  • Robotic or mobile automation with vision-guided, reasoning-based contextual interaction.

Limitations

  • May show degraded performance on extremely low-quality or occluded images.
  • Not optimized for real-time applications on low-resource or edge devices due to computational demands.
  • Variable accuracy on uncommon or low-resource languages or scripts.
  • Long video processing may require substantial memory and is not optimized for streaming applications.
  • Visual token settings affect performance; suboptimal configurations can impact results.
  • In rare cases, outputs may contain hallucinated or contextually misaligned reasoning steps.
Downloads last month
3
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prithivMLmods/Nemesis-VLMer-7B-0818

Finetuned
(1080)
this model
Quantizations
2 models

Collections including prithivMLmods/Nemesis-VLMer-7B-0818