FTib-VLM
FTib-VLM is a Tibetan vision-language model fine-tuned from Qwen/Qwen3-VL-8B-Instruct for multimodal understanding in low-resource language settings. It is released as part of the FTibSuite project to support reproducible Tibetan multimodal research.
Model Details
- Model:
onedday/FTib-VLM - Base model:
Qwen/Qwen3-VL-8B-Instruct - Architecture:
qwen3_vl - Parameters: ~8.8B
- License:
Apache-2.0
Highlights
- Fine-tuned for Tibetan multimodal understanding
- Built on top of a strong open-source VLM backbone
- Trained with a three-stage adaptation pipeline
- Released for research, evaluation, and downstream adaptation
- Shows strong improvements on Tibetan multimodal benchmarks
Intended Use
FTib-VLM is intended for research and experimental applications such as:
- Tibetan image question answering
- Tibetan image description
- Multimodal instruction following
- Tibetan-oriented visual reasoning
- Low-resource vision-language adaptation research
Training Overview
FTib-VLM is fine-tuned from Qwen/Qwen3-VL-8B-Instruct using a three-stage pipeline:
- Continual Pretraining on Tibetan-oriented text data
- Multimodal Alignment on Tibetan image-text pairs
- Multimodal Instruction Tuning on Tibetan multimodal instruction data
The goal is to improve Tibetan multimodal capability while preserving the strengths of the base vision-language model.
Benchmark Summary
FTib-VLM shows clear improvements over the base model on Tibetan multimodal evaluation, including:
- BinaryVQA: 76.01
- MMBench: 67.78
- POPE-random: 80.56
Usage
Install dependencies:
pip install -U transformers accelerate torch pillow
Example
Replace "example.jpg" with your local image path.
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
model_id = "onedday/FTib-VLM"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
image = Image.open("example.jpg").convert("RGB")
prompt = "请详细描述这张图片。"
inputs = processor(
text=prompt,
images=image,
return_tensors="pt",
)
inputs = {
k: v.to(model.device) if hasattr(v, "to") else v
for k, v in inputs.items()
}
generated_ids = model.generate(
**inputs,
max_new_tokens=256,
)
output = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print(output[0])
Limitations
- OCR and in-image text understanding remain challenging.
- Benchmark performance does not fully reflect real-world reliability.
- The model is intended primarily for research use.
- As a low-resource adapted model, output quality may vary across domains and prompt styles.
Ethical Considerations
This model is released to support Tibetan multimodal research and improve access to low-resource language technology. However, like other vision-language models, it may produce incorrect, biased, or misleading outputs. It should be used with care in high-stakes or reliability-sensitive scenarios.
Citation
If you use this model, please cite the FTibSuite paper:
@article{xu2026ftibsuite,
title={FTibSuite: A Comprehensive Resource Suite for Tibetan Vision--Language Modeling},
author={Xu, Guixian and Liang, Yide and Su, Zeli and Song, Xuexian and Zhang, Ziyin and Dong, Yushuang and Zhang, Ting and Han, Xu},
year={2026}
}
You may also cite this repository as:
@misc{onedday_ftib_vlm,
title = {FTib-VLM},
author = {onedday},
year = {2026},
howpublished = {\url{https://huggingface.co/onedday/FTib-VLM}}
}
You may also cite this repository as:
@misc{onedday_ftib_vlm,
title = {FTib-VLM},
author = {onedday},
year = {2026},
howpublished = {\url{https://huggingface.co/onedday/FTib-VLM}}
}
```
- Downloads last month
- 43
Model tree for onedday/FTib-VLM
Base model
Qwen/Qwen3-VL-8B-Instruct