File size: 1,563 Bytes
01c8d01
b3625a9
 
01c8d01
 
 
 
 
 
 
b3625a9
01c8d01
b3625a9
01c8d01
b3625a9
 
01c8d01
 
 
 
b3625a9
 
01c8d01
 
 
 
b3625a9
01c8d01
b3625a9
01c8d01
 
 
 
b3625a9
01c8d01
b3625a9
01c8d01
b3625a9
01c8d01
b3625a9
01c8d01
 
b3625a9
01c8d01
 
b3625a9
01c8d01
 
 
b3625a9
01c8d01
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# Model Card — Qwen2-VL-ImgChat-2B

## Model Details
- **Model Name:** Qwen2-VL-ImgChat-2B  
- **Model Type:** Vision-Language Model fine-tuned for multimodal dialog auto-completion  
- **Language(s):** English  
- **Base Model:** Qwen2-VL-2B  
- **Fine-tuning Dataset:** ImageChat  
- **License:** Same as base model (Qwen2-VL license)  
- **Repository:** https://github.com/devichand579/MAC

---

## Intended Use

### Direct Use
This model generates conversational responses conditioned on both textual and visual context. It is suitable for:
- Multimodal dialog systems
- Image-grounded conversational agents
- Research on multimodal auto-completion

### Out-of-Scope Use
The model is not intended for:
- Medical, legal, or financial advice
- Safety-critical decision-making
- Autonomous systems requiring guaranteed correctness

---

## Limitations and Risks
- Model outputs may contain inaccuracies or biases inherited from training data.
- Performance depends on image relevance and dialogue context quality.
- The model is not explicitly safety-filtered.

---

## How to Use

Example usage with Hugging Face Transformers:

```python
from transformers import AutoProcessor, AutoModelForVision2Seq

processor = AutoProcessor.from_pretrained("devichand/PaliGemma2_ImgChat-3B")
model = AutoModelForVision2Seq.from_pretrained("devichand/PaliGemma2_ImgChat-3B")

inputs = processor(images=your_image,
                   text="Describe the image.",
                   return_tensors="pt")

outputs = model.generate(**inputs)
print(processor.decode(outputs[0]))