File size: 1,214 Bytes
ec93f53 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | ---
license: apache-2.0
language:
- vi
- en
tags:
- vision-language-model
- vlm
- qwen3
- fastvlm
- vietnamese
base_model: Qwen/Qwen3-0.6B
datasets:
- 5CD-AI/Viet-multimodal-open-r1-8k-verified
---
# Belle-VLM: Vietnamese Vision Language Model
## Model Description
Belle-VLM is a Vision Language Model trained for Vietnamese multimodal reasoning tasks.
### Architecture
- **LLM Backbone**: Qwen3-0.6B
- **Vision Encoder**: FastViTHD (MobileCLIP)
- **Projector**: MLP 2-layer (3072 -> 1024)
### Training
- **Dataset**: 5CD-AI/Viet-multimodal-open-r1-8k-verified
- **Method**: LoRA fine-tuning
- **Epochs**: 2
- **Learning Rate**: 2e-05
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"beyoru/Belle-VLM",
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("beyoru/Belle-VLM", trust_remote_code=True)
```
## Training Details
| Parameter | Value |
|-----------|-------|
| Base Model | Qwen/Qwen3-0.6B |
| Vision Tower | mobileclip_l_384 |
| LoRA Rank | 8 |
| LoRA Alpha | 16 |
| Batch Size | 1 x 1 |
| Epochs | 2 |
## License
Apache 2.0
|