Model Card — Qwen2-VL-ImgChat-2B
Model Details
- Model Name: Qwen2-VL-ImgChat-2B
- Model Type: Vision-Language Model fine-tuned for multimodal dialog auto-completion
- Language(s): English
- Base Model: Qwen2-VL-2B
- Fine-tuning Dataset: ImageChat
- License: Same as base model (Qwen2-VL license)
- Repository: https://github.com/devichand579/MAC
Intended Use
Direct Use
This model generates conversational responses conditioned on both textual and visual context. It is suitable for:
- Multimodal dialog systems
- Image-grounded conversational agents
- Research on multimodal auto-completion
Out-of-Scope Use
The model is not intended for:
- Medical, legal, or financial advice
- Safety-critical decision-making
- Autonomous systems requiring guaranteed correctness
Limitations and Risks
- Model outputs may contain inaccuracies or biases inherited from training data.
- Performance depends on image relevance and dialogue context quality.
- The model is not explicitly safety-filtered.
How to Use
Example usage with Hugging Face Transformers:
from transformers import AutoProcessor, AutoModelForVision2Seq
processor = AutoProcessor.from_pretrained("devichand/MiniCPM_V_Noimg_ImgChat-7B")
model = AutoModelForVision2Seq.from_pretrained("devichand/MiniCPM_V_Noimg_ImgChat-7B")
inputs = processor(images=your_image,
text="Describe the image.",
return_tensors="pt")
outputs = model.generate(**inputs)
print(processor.decode(outputs[0]))