Model Card — Qwen2-VL-ImgChat-2B

Model Details

Model Name: Qwen2-VL-ImgChat-2B
Model Type: Vision-Language Model fine-tuned for multimodal dialog auto-completion
Language(s): English
Base Model: Qwen2-VL-2B
Fine-tuning Dataset: ImageChat
License: Same as base model (Qwen2-VL license)
Repository: https://github.com/devichand579/MAC

Intended Use

Direct Use

This model generates conversational responses conditioned on both textual and visual context. It is suitable for:

Multimodal dialog systems
Image-grounded conversational agents
Research on multimodal auto-completion

Out-of-Scope Use

The model is not intended for:

Medical, legal, or financial advice
Safety-critical decision-making
Autonomous systems requiring guaranteed correctness

Limitations and Risks

Model outputs may contain inaccuracies or biases inherited from training data.
Performance depends on image relevance and dialogue context quality.
The model is not explicitly safety-filtered.

How to Use

Example usage with Hugging Face Transformers:

from transformers import AutoProcessor, AutoModelForVision2Seq

processor = AutoProcessor.from_pretrained("devichand/MiniCPM_V_Noimg_ImgChat-7B")
model = AutoModelForVision2Seq.from_pretrained("devichand/MiniCPM_V_Noimg_ImgChat-7B")

inputs = processor(images=your_image,
                   text="Describe the image.",
                   return_tensors="pt")

outputs = model.generate(**inputs)
print(processor.decode(outputs[0]))