File size: 10,923 Bytes
d2dccbb
 
 
 
 
 
 
 
 
 
1bd40ee
d2dccbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcb5b96
d2dccbb
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
---
language:
- th
metrics:
- sacrebleu
base_model:
- Qwen/Qwen2-VL-7B-Instruct
pipeline_tag: visual-question-answering
---

# Pathumma-llm-vision-2.0.0-preview

## Model Overview
Pathumma-llm-vision-2.0.0-preview is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content.

- **Model Name**: Pathumma-llm-vision-2.0.0-preview
- **Base Model**: Qwen/Qwen2-VL-7B-Instruct
- **Architecture**: Multi-modal LLM (Visual Language Model)
- **Parameters**: 7 Billion
- **Organization**: NECTEC
- **License**: [Specify License]

## Intended Use
- **Primary Use Cases**: 
  - Visual Question Answering (VQA)
  - Image Captioning
- **Intended Users**: Developers, researchers, and AI practitioners working on multi-modal tasks.
- **Possible Applications**: Educational tools, accessibility applications, interactive visual content generation.

## Model Description
Pathumma-llm-vision-2.0.0-preview is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs.

## Training Data
The model was fine-tuned on several datasets:
- **Thai Image Caption**: Data sourced from image captioning competitions on Kaggle.
- **Small-Thai-Wikipedia**: Articles in Thai from Wikipedia.

### Dataset Size
- **Training Dataset Size**: 132,946 examples
- **Validation Dataset Size**: - examples

## Training Details
- **Hardware Used**: 
  - **HPC Cluster**: Lanta
  - **Number of Nodes**: 4 Nodes
  - **GPUs per Node**: 4 GPUs
  - **Total GPUs Used**: 16 GPUs
- **Fine-tuning Duration**: 20 hours, 34 minutes, and 43 seconds (excluding evaluation)

## Evaluation Results

| Type                                   | Encoder                            | Decoder                             | IPU24-dataset <br>(test) <br>(Sentence SacreBLEU) |
|----------------------------------------|------------------------------------|-------------------------------------|-------------------------------|
| Pathumma-llm-vision-beta-0.0.0         | siglip-so400m-patch14-384          | Meta-Llama-3.1-8B-Instruct          | 13.45412                      |
| Pathumma-llm-vision-1.0.0              | siglip-so400m-patch14-384          | Meta-Llama-3.1-8B-Instruct          | 17.66370                      |
| Pathumma-llm-vision-2.0.0-preview      | Qwen2-VL-7B-Instruct               | Qwen2-VL-7B-Instruct                | **19.112962**                 |

**\*\*Note**: Other models not target fine-tuned on IPU24-datasets may be less representative of IPU24 performance.

## Required Libraries

Before you start, ensure you have the following libraries installed:

```
pip install transformers==4.48.1 accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8
```

## Usage
We provide a [inference tutorial](https://colab.research.google.com/drive/1URMEJr2P_9JK0BvBzFv4NN4824iAf0y4#scrollTo=_S-LoNKcv8ww).
To use the model with the Hugging Face `transformers` library:

```python
import torch

from peft import get_peft_model, LoraConfig
from transformers import BitsAndBytesConfig
from transformers import (
    Qwen2VLForConditionalGeneration,
    Qwen2VLProcessor,
)
```

```python
MODEL_ID = "nectec/Pathumma-llm-vision-2.0.0-preview"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
USE_QLORA = True

lora_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=8,
    bias="none",
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)

if USE_QLORA:
    bnb_config = BitsAndBytesConfig(
        load_in_8bit=True,
        # load_in_4bit=True,
        # bnb_4bit_use_double_quant=True,
        # bnb_4bit_quant_type="nf4",
        # bnb_4bit_compute_type=torch.bfloat16
    )


model = Qwen2VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    device_map="auto",
    quantization_config=bnb_config if USE_QLORA else None,
    torch_dtype=torch.bfloat16
)


model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

MIN_PIXELS = 256 * 28 * 28
MAX_PIXELS = 1280 * 28 * 28

processor = Qwen2VLProcessor.from_pretrained(MODEL_ID, min_pixels=MIN_PIXELS, max_pixels=MAX_PIXELS)

def encode_via_processor(image, instruction, question):

    if isinstance(image, str):
        local_path = image
        image = Image.open(local_path)

    messages = [
            {
                "role": "system", "content": [{"type": "text", "text": instruction}]
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image"
                    },
                    {
                        "type": "text",
                        "text": question
                    }
                ]
            },
        ]

    text = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
    ).strip()

    def convert_img(image):
        width, height = image.size
        factor = processor.image_processor.patch_size * processor.image_processor.merge_size
        if width < factor:
            image = image.copy().resize((factor, factor * height // width))
        elif height < factor:
            image = image.copy().resize((factor * width // height, factor))
        return image
    image_inputs = [convert_img(image)]

    encoding = processor(
        text=text,
        images=image_inputs,
        videos=None,
        return_tensors="pt",
    )

    ## Remove batch dimension
    # encoding = {k:v.squeeze(dim=0) for k,v in encoding.items()}
    encoding = {k: v.to(DEVICE) for k, v in encoding.items()}
    inputs = encoding
    return inputs


def encode_via_processor_extlib(local_path, instruction, question):
    img_path = "file://" + local_path
    messages = [
            {
                "role": "system", "content": [{"type": "text", "text": instruction}]
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": img_path,
                    },
                    {
                        "type": "text",
                        "text": question
                    }
                ]
            },
        ]

    text = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
    ).strip()

    image_inputs, video_inputs = process_vision_info(messages)

    encoding = processor(
        text=text,
        images=image_inputs,
        videos=video_inputs,
        return_tensors="pt",
    )

    ## Remove batch dimension
    # encoding = {k:v.squeeze(dim=0) for k,v in encoding.items()}
    encoding = {k: v.to(DEVICE) for k, v in encoding.items()}
    inputs = encoding
    return inputs

def inference(inputs):
    start_time = time.time()
    model.eval()
    with torch.inference_mode():
        # Generate
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=.1,
            # repetition_penalty=1.2,
            # top_k=2,
            # top_p=1,
        )
        generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
    end_time = time.time()

    ## Get letency_time...
    latency_time = end_time - start_time

    answer_prompt = [*map(
        lambda x: re.sub(r"assistant(:|\n)?", "<||SEP-ASSIST||>", x).split('<||SEP-ASSIST||>')[-1].strip(),
        generated_texts
    )]
    predict_output = generated_texts[0]
    response = re.sub(r"assistant(:|\n)?", "<||SEP-ASSIST||>", predict_output).split('<||SEP-ASSIST||>')[-1].strip()

    return predict_output, response, round(latency_time, 3)

instruction = "You are a helpful assistant."

def response_image(img_path, question, instruction=instruction):
    image = Image.open(img_path)
    _, response, latency_time = inference(encode_via_processor(image=image, instruction=instruction, question=question))
    print("RESPONSE".center(60, "="))
    print(response)
    print(latency_time, "sec.")
    print("IMAGE".center(60, "="))
    plt.imshow(image)
    plt.show()

# Output processing (depends on task requirements)
question = "อธิบายภาพนี้"
img_path = "/content/The Most Beautiful Public High School in Every State in America.jpg"
response_image(img_path, question)

>>> ==========================RESPONSE==========================
>>> อาคารสีน้ำตาลขนาดใหญ่ที่มีเสาไฟฟ้าอยู่ด้านหน้าและมีต้นไม้อยู่ด้านข้าง
>>> 7.987 sec.
>>> ===========================IMAGE============================
>>> <IMAGE_MATPLOTLIB>
```

## Limitations and Biases
- The model may exhibit biases due to the training data, which might not be fully representative of all contexts.
- Performance may degrade on unfamiliar images or non-standard question formats.

## Ethical Considerations
- The model should not be used to generate misleading information or in ways that violate privacy.
- Consider fairness and minimize bias when using the model for language and image processing tasks.

## Citation
If you use this model, please cite it as follows:

```bibtex
@misc{PathummaVision,
  author = {Thirawarit Pitiphiphat and NECTEC Team},
  title = {nectec/Pathumma-llm-vision-2.0.0-preview},
  year = {2025},
  url = {https://huggingface.co/nectec/Pathumma-llm-vision-2.0.0-preview}
}
```

```bibtex
@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}
```

## **Contributor Contract**
**Vision Team**  
Thirawarit Pitiphiphat (thirawarit.pit@ncr.nstda.or.th)<br>
Theerasit Issaranon (theerasit.issaranon@nectec.or.th)

## Contact
For questions or support, please contact **https://discord.gg/3WJwJjZt7r**.

```
This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed!
```