File size: 10,923 Bytes
d2dccbb 1bd40ee d2dccbb bcb5b96 d2dccbb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 |
---
language:
- th
metrics:
- sacrebleu
base_model:
- Qwen/Qwen2-VL-7B-Instruct
pipeline_tag: visual-question-answering
---
# Pathumma-llm-vision-2.0.0-preview
## Model Overview
Pathumma-llm-vision-2.0.0-preview is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content.
- **Model Name**: Pathumma-llm-vision-2.0.0-preview
- **Base Model**: Qwen/Qwen2-VL-7B-Instruct
- **Architecture**: Multi-modal LLM (Visual Language Model)
- **Parameters**: 7 Billion
- **Organization**: NECTEC
- **License**: [Specify License]
## Intended Use
- **Primary Use Cases**:
- Visual Question Answering (VQA)
- Image Captioning
- **Intended Users**: Developers, researchers, and AI practitioners working on multi-modal tasks.
- **Possible Applications**: Educational tools, accessibility applications, interactive visual content generation.
## Model Description
Pathumma-llm-vision-2.0.0-preview is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs.
## Training Data
The model was fine-tuned on several datasets:
- **Thai Image Caption**: Data sourced from image captioning competitions on Kaggle.
- **Small-Thai-Wikipedia**: Articles in Thai from Wikipedia.
### Dataset Size
- **Training Dataset Size**: 132,946 examples
- **Validation Dataset Size**: - examples
## Training Details
- **Hardware Used**:
- **HPC Cluster**: Lanta
- **Number of Nodes**: 4 Nodes
- **GPUs per Node**: 4 GPUs
- **Total GPUs Used**: 16 GPUs
- **Fine-tuning Duration**: 20 hours, 34 minutes, and 43 seconds (excluding evaluation)
## Evaluation Results
| Type | Encoder | Decoder | IPU24-dataset <br>(test) <br>(Sentence SacreBLEU) |
|----------------------------------------|------------------------------------|-------------------------------------|-------------------------------|
| Pathumma-llm-vision-beta-0.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 13.45412 |
| Pathumma-llm-vision-1.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 17.66370 |
| Pathumma-llm-vision-2.0.0-preview | Qwen2-VL-7B-Instruct | Qwen2-VL-7B-Instruct | **19.112962** |
**\*\*Note**: Other models not target fine-tuned on IPU24-datasets may be less representative of IPU24 performance.
## Required Libraries
Before you start, ensure you have the following libraries installed:
```
pip install transformers==4.48.1 accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8
```
## Usage
We provide a [inference tutorial](https://colab.research.google.com/drive/1URMEJr2P_9JK0BvBzFv4NN4824iAf0y4#scrollTo=_S-LoNKcv8ww).
To use the model with the Hugging Face `transformers` library:
```python
import torch
from peft import get_peft_model, LoraConfig
from transformers import BitsAndBytesConfig
from transformers import (
Qwen2VLForConditionalGeneration,
Qwen2VLProcessor,
)
```
```python
MODEL_ID = "nectec/Pathumma-llm-vision-2.0.0-preview"
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
USE_QLORA = True
lora_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.05,
r=8,
bias="none",
target_modules=["q_proj", "v_proj"],
task_type="CAUSAL_LM",
)
if USE_QLORA:
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
# load_in_4bit=True,
# bnb_4bit_use_double_quant=True,
# bnb_4bit_quant_type="nf4",
# bnb_4bit_compute_type=torch.bfloat16
)
model = Qwen2VLForConditionalGeneration.from_pretrained(
MODEL_ID,
device_map="auto",
quantization_config=bnb_config if USE_QLORA else None,
torch_dtype=torch.bfloat16
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
MIN_PIXELS = 256 * 28 * 28
MAX_PIXELS = 1280 * 28 * 28
processor = Qwen2VLProcessor.from_pretrained(MODEL_ID, min_pixels=MIN_PIXELS, max_pixels=MAX_PIXELS)
def encode_via_processor(image, instruction, question):
if isinstance(image, str):
local_path = image
image = Image.open(local_path)
messages = [
{
"role": "system", "content": [{"type": "text", "text": instruction}]
},
{
"role": "user",
"content": [
{
"type": "image"
},
{
"type": "text",
"text": question
}
]
},
]
text = processor.apply_chat_template(
messages,
add_generation_prompt=True,
).strip()
def convert_img(image):
width, height = image.size
factor = processor.image_processor.patch_size * processor.image_processor.merge_size
if width < factor:
image = image.copy().resize((factor, factor * height // width))
elif height < factor:
image = image.copy().resize((factor * width // height, factor))
return image
image_inputs = [convert_img(image)]
encoding = processor(
text=text,
images=image_inputs,
videos=None,
return_tensors="pt",
)
## Remove batch dimension
# encoding = {k:v.squeeze(dim=0) for k,v in encoding.items()}
encoding = {k: v.to(DEVICE) for k, v in encoding.items()}
inputs = encoding
return inputs
def encode_via_processor_extlib(local_path, instruction, question):
img_path = "file://" + local_path
messages = [
{
"role": "system", "content": [{"type": "text", "text": instruction}]
},
{
"role": "user",
"content": [
{
"type": "image",
"image": img_path,
},
{
"type": "text",
"text": question
}
]
},
]
text = processor.apply_chat_template(
messages,
add_generation_prompt=True,
).strip()
image_inputs, video_inputs = process_vision_info(messages)
encoding = processor(
text=text,
images=image_inputs,
videos=video_inputs,
return_tensors="pt",
)
## Remove batch dimension
# encoding = {k:v.squeeze(dim=0) for k,v in encoding.items()}
encoding = {k: v.to(DEVICE) for k, v in encoding.items()}
inputs = encoding
return inputs
def inference(inputs):
start_time = time.time()
model.eval()
with torch.inference_mode():
# Generate
generated_ids = model.generate(
**inputs,
max_new_tokens=256,
temperature=.1,
# repetition_penalty=1.2,
# top_k=2,
# top_p=1,
)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
end_time = time.time()
## Get letency_time...
latency_time = end_time - start_time
answer_prompt = [*map(
lambda x: re.sub(r"assistant(:|\n)?", "<||SEP-ASSIST||>", x).split('<||SEP-ASSIST||>')[-1].strip(),
generated_texts
)]
predict_output = generated_texts[0]
response = re.sub(r"assistant(:|\n)?", "<||SEP-ASSIST||>", predict_output).split('<||SEP-ASSIST||>')[-1].strip()
return predict_output, response, round(latency_time, 3)
instruction = "You are a helpful assistant."
def response_image(img_path, question, instruction=instruction):
image = Image.open(img_path)
_, response, latency_time = inference(encode_via_processor(image=image, instruction=instruction, question=question))
print("RESPONSE".center(60, "="))
print(response)
print(latency_time, "sec.")
print("IMAGE".center(60, "="))
plt.imshow(image)
plt.show()
# Output processing (depends on task requirements)
question = "อธิบายภาพนี้"
img_path = "/content/The Most Beautiful Public High School in Every State in America.jpg"
response_image(img_path, question)
>>> ==========================RESPONSE==========================
>>> อาคารสีน้ำตาลขนาดใหญ่ที่มีเสาไฟฟ้าอยู่ด้านหน้าและมีต้นไม้อยู่ด้านข้าง
>>> 7.987 sec.
>>> ===========================IMAGE============================
>>> <IMAGE_MATPLOTLIB>
```
## Limitations and Biases
- The model may exhibit biases due to the training data, which might not be fully representative of all contexts.
- Performance may degrade on unfamiliar images or non-standard question formats.
## Ethical Considerations
- The model should not be used to generate misleading information or in ways that violate privacy.
- Consider fairness and minimize bias when using the model for language and image processing tasks.
## Citation
If you use this model, please cite it as follows:
```bibtex
@misc{PathummaVision,
author = {Thirawarit Pitiphiphat and NECTEC Team},
title = {nectec/Pathumma-llm-vision-2.0.0-preview},
year = {2025},
url = {https://huggingface.co/nectec/Pathumma-llm-vision-2.0.0-preview}
}
```
```bibtex
@article{Qwen2VL,
title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2409.12191},
year={2024}
}
@article{Qwen-VL,
title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023}
}
```
## **Contributor Contract**
**Vision Team**
Thirawarit Pitiphiphat (thirawarit.pit@ncr.nstda.or.th)<br>
Theerasit Issaranon (theerasit.issaranon@nectec.or.th)
## Contact
For questions or support, please contact **https://discord.gg/3WJwJjZt7r**.
```
This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed!
``` |