# Model Card — Qwen2-VL-ImgChat-2B ## Model Details - **Model Name:** Qwen2-VL-ImgChat-2B - **Model Type:** Vision-Language Model fine-tuned for multimodal dialog auto-completion - **Language(s):** English - **Base Model:** Qwen2-VL-2B - **Fine-tuning Dataset:** ImageChat - **License:** Same as base model (Qwen2-VL license) - **Repository:** https://github.com/devichand579/MAC --- ## Intended Use ### Direct Use This model generates conversational responses conditioned on both textual and visual context. It is suitable for: - Multimodal dialog systems - Image-grounded conversational agents - Research on multimodal auto-completion ### Out-of-Scope Use The model is not intended for: - Medical, legal, or financial advice - Safety-critical decision-making - Autonomous systems requiring guaranteed correctness --- ## Limitations and Risks - Model outputs may contain inaccuracies or biases inherited from training data. - Performance depends on image relevance and dialogue context quality. - The model is not explicitly safety-filtered. --- ## How to Use Example usage with Hugging Face Transformers: ```python from transformers import AutoProcessor, AutoModelForVision2Seq processor = AutoProcessor.from_pretrained("devichand/MiniCPM_V_Noimg_ImgChat-7B") model = AutoModelForVision2Seq.from_pretrained("devichand/MiniCPM_V_Noimg_ImgChat-7B") inputs = processor(images=your_image, text="Describe the image.", return_tensors="pt") outputs = model.generate(**inputs) print(processor.decode(outputs[0]))