beyoru
/

Belle-VLM-Base

vision-language-model

Model card Files Files and versions

Belle-VLM-Base / README.md

beyoru's picture

Upload Belle-VLM (epoch-based training)

ec93f53 verified about 2 months ago

|

history blame contribute delete

1.21 kB

	---
	license: apache-2.0
	language:
	- vi
	- en
	tags:
	- vision-language-model
	- vlm
	- qwen3
	- fastvlm
	- vietnamese
	base_model: Qwen/Qwen3-0.6B
	datasets:
	- 5CD-AI/Viet-multimodal-open-r1-8k-verified
	---

	# Belle-VLM: Vietnamese Vision Language Model

	## Model Description

	Belle-VLM is a Vision Language Model trained for Vietnamese multimodal reasoning tasks.

	### Architecture
	- LLM Backbone: Qwen3-0.6B
	- Vision Encoder: FastViTHD (MobileCLIP)
	- Projector: MLP 2-layer (3072 -> 1024)

	### Training
	- Dataset: 5CD-AI/Viet-multimodal-open-r1-8k-verified
	- Method: LoRA fine-tuning
	- Epochs: 2
	- Learning Rate: 2e-05

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"beyoru/Belle-VLM",
	trust_remote_code=True,
	torch_dtype=torch.float16,
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained("beyoru/Belle-VLM", trust_remote_code=True)
	```

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| Qwen/Qwen3-0.6B \|
	\| Vision Tower \| mobileclip_l_384 \|
	\| LoRA Rank \| 8 \|
	\| LoRA Alpha \| 16 \|
	\| Batch Size \| 1 x 1 \|
	\| Epochs \| 2 \|

	## License

	Apache 2.0