| license: apache-2.0 | |
| language: | |
| - vi | |
| - en | |
| tags: | |
| - vision-language-model | |
| - vlm | |
| - qwen3 | |
| - fastvlm | |
| - vietnamese | |
| base_model: Qwen/Qwen3-0.6B | |
| datasets: | |
| - 5CD-AI/Viet-multimodal-open-r1-8k-verified | |
| # Belle-VLM: Vietnamese Vision Language Model | |
| ## Model Description | |
| Belle-VLM is a Vision Language Model trained for Vietnamese multimodal reasoning tasks. | |
| ### Architecture | |
| - **LLM Backbone**: Qwen3-0.6B | |
| - **Vision Encoder**: FastViTHD (MobileCLIP) | |
| - **Projector**: MLP 2-layer (3072 -> 1024) | |
| ### Training | |
| - **Dataset**: 5CD-AI/Viet-multimodal-open-r1-8k-verified | |
| - **Method**: LoRA fine-tuning | |
| - **Epochs**: 2 | |
| - **Learning Rate**: 2e-05 | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "beyoru/Belle-VLM", | |
| trust_remote_code=True, | |
| torch_dtype=torch.float16, | |
| device_map="auto" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("beyoru/Belle-VLM", trust_remote_code=True) | |
| ``` | |
| ## Training Details | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Base Model | Qwen/Qwen3-0.6B | | |
| | Vision Tower | mobileclip_l_384 | | |
| | LoRA Rank | 8 | | |
| | LoRA Alpha | 16 | | |
| | Batch Size | 1 x 1 | | |
| | Epochs | 2 | | |
| ## License | |
| Apache 2.0 | |