FTib-VLM

FTib-VLM is a Tibetan vision-language model fine-tuned from Qwen/Qwen3-VL-8B-Instruct for multimodal understanding in low-resource language settings. It is released as part of the FTibSuite project to support reproducible Tibetan multimodal research.

Model Details

  • Model: onedday/FTib-VLM
  • Base model: Qwen/Qwen3-VL-8B-Instruct
  • Architecture: qwen3_vl
  • Parameters: ~8.8B
  • License: Apache-2.0

Highlights

  • Fine-tuned for Tibetan multimodal understanding
  • Built on top of a strong open-source VLM backbone
  • Trained with a three-stage adaptation pipeline
  • Released for research, evaluation, and downstream adaptation
  • Shows strong improvements on Tibetan multimodal benchmarks

Intended Use

FTib-VLM is intended for research and experimental applications such as:

  • Tibetan image question answering
  • Tibetan image description
  • Multimodal instruction following
  • Tibetan-oriented visual reasoning
  • Low-resource vision-language adaptation research

Training Overview

FTib-VLM is fine-tuned from Qwen/Qwen3-VL-8B-Instruct using a three-stage pipeline:

  1. Continual Pretraining on Tibetan-oriented text data
  2. Multimodal Alignment on Tibetan image-text pairs
  3. Multimodal Instruction Tuning on Tibetan multimodal instruction data

The goal is to improve Tibetan multimodal capability while preserving the strengths of the base vision-language model.

Benchmark Summary

FTib-VLM shows clear improvements over the base model on Tibetan multimodal evaluation, including:

  • BinaryVQA: 76.01
  • MMBench: 67.78
  • POPE-random: 80.56

Usage

Install dependencies:

pip install -U transformers accelerate torch pillow

Example

Replace "example.jpg" with your local image path.

from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
import torch

model_id = "onedday/FTib-VLM"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

image = Image.open("example.jpg").convert("RGB")
prompt = "请详细描述这张图片。"

inputs = processor(
    text=prompt,
    images=image,
    return_tensors="pt",
)

inputs = {
    k: v.to(model.device) if hasattr(v, "to") else v
    for k, v in inputs.items()
}

generated_ids = model.generate(
    **inputs,
    max_new_tokens=256,
)

output = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(output[0])

Limitations

  • OCR and in-image text understanding remain challenging.
  • Benchmark performance does not fully reflect real-world reliability.
  • The model is intended primarily for research use.
  • As a low-resource adapted model, output quality may vary across domains and prompt styles.

Ethical Considerations

This model is released to support Tibetan multimodal research and improve access to low-resource language technology. However, like other vision-language models, it may produce incorrect, biased, or misleading outputs. It should be used with care in high-stakes or reliability-sensitive scenarios.

Citation

If you use this model, please cite the FTibSuite paper:

@article{xu2026ftibsuite,
  title={FTibSuite: A Comprehensive Resource Suite for Tibetan Vision--Language Modeling},
  author={Xu, Guixian and Liang, Yide and Su, Zeli and Song, Xuexian and Zhang, Ziyin and Dong, Yushuang and Zhang, Ting and Han, Xu},
  year={2026}
}
You may also cite this repository as:
@misc{onedday_ftib_vlm,
  title        = {FTib-VLM},
  author       = {onedday},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/onedday/FTib-VLM}}
}

You may also cite this repository as:


@misc{onedday_ftib_vlm,
  title        = {FTib-VLM},
  author       = {onedday},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/onedday/FTib-VLM}}
}
```
Downloads last month
43
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for onedday/FTib-VLM

Finetuned
(268)
this model