| # Model Card — Qwen2-VL-ImgChat-2B | |
| ## Model Details | |
| - **Model Name:** Qwen2-VL-ImgChat-2B | |
| - **Model Type:** Vision-Language Model fine-tuned for multimodal dialog auto-completion | |
| - **Language(s):** English | |
| - **Base Model:** Qwen2-VL-2B | |
| - **Fine-tuning Dataset:** ImageChat | |
| - **License:** Same as base model (Qwen2-VL license) | |
| - **Repository:** https://github.com/devichand579/MAC | |
| --- | |
| ## Intended Use | |
| ### Direct Use | |
| This model generates conversational responses conditioned on both textual and visual context. It is suitable for: | |
| - Multimodal dialog systems | |
| - Image-grounded conversational agents | |
| - Research on multimodal auto-completion | |
| ### Out-of-Scope Use | |
| The model is not intended for: | |
| - Medical, legal, or financial advice | |
| - Safety-critical decision-making | |
| - Autonomous systems requiring guaranteed correctness | |
| --- | |
| ## Limitations and Risks | |
| - Model outputs may contain inaccuracies or biases inherited from training data. | |
| - Performance depends on image relevance and dialogue context quality. | |
| - The model is not explicitly safety-filtered. | |
| --- | |
| ## How to Use | |
| Example usage with Hugging Face Transformers: | |
| ```python | |
| from transformers import AutoProcessor, AutoModelForVision2Seq | |
| processor = AutoProcessor.from_pretrained("devichand/MiniCPM_V_Noimg_ImgChat-7B") | |
| model = AutoModelForVision2Seq.from_pretrained("devichand/MiniCPM_V_Noimg_ImgChat-7B") | |
| inputs = processor(images=your_image, | |
| text="Describe the image.", | |
| return_tensors="pt") | |
| outputs = model.generate(**inputs) | |
| print(processor.decode(outputs[0])) | |