FTib-VLM

FTib-VLM is a Tibetan vision-language model fine-tuned from Qwen/Qwen3-VL-8B-Instruct for multimodal understanding in low-resource language settings. It is released as part of the FTibSuite project to support reproducible Tibetan multimodal research.

Model Details

Model: onedday/FTib-VLM
Base model: Qwen/Qwen3-VL-8B-Instruct
Architecture: qwen3_vl
Parameters: ~8.8B
License: Apache-2.0

Highlights

Fine-tuned for Tibetan multimodal understanding
Built on top of a strong open-source VLM backbone
Trained with a three-stage adaptation pipeline
Released for research, evaluation, and downstream adaptation
Shows strong improvements on Tibetan multimodal benchmarks

Intended Use

FTib-VLM is intended for research and experimental applications such as:

Tibetan image question answering
Tibetan image description
Multimodal instruction following
Tibetan-oriented visual reasoning
Low-resource vision-language adaptation research

Training Overview

FTib-VLM is fine-tuned from Qwen/Qwen3-VL-8B-Instruct using a three-stage pipeline:

Continual Pretraining on Tibetan-oriented text data
Multimodal Alignment on Tibetan image-text pairs
Multimodal Instruction Tuning on Tibetan multimodal instruction data

The goal is to improve Tibetan multimodal capability while preserving the strengths of the base vision-language model.

Benchmark Summary

FTib-VLM shows clear improvements over the base model on Tibetan multimodal evaluation, including:

BinaryVQA: 76.01
MMBench: 67.78
POPE-random: 80.56

Usage

Install dependencies:

pip install -U transformers accelerate torch pillow

Example

Replace "example.jpg" with your local image path.

from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
import torch

model_id = "onedday/FTib-VLM"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

image = Image.open("example.jpg").convert("RGB")
prompt = "请详细描述这张图片。"

inputs = processor(
    text=prompt,
    images=image,
    return_tensors="pt",
)

inputs = {
    k: v.to(model.device) if hasattr(v, "to") else v
    for k, v in inputs.items()
}

generated_ids = model.generate(
    **inputs,
    max_new_tokens=256,
)

output = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(output[0])

Limitations

OCR and in-image text understanding remain challenging.
Benchmark performance does not fully reflect real-world reliability.
The model is intended primarily for research use.
As a low-resource adapted model, output quality may vary across domains and prompt styles.

Ethical Considerations

This model is released to support Tibetan multimodal research and improve access to low-resource language technology. However, like other vision-language models, it may produce incorrect, biased, or misleading outputs. It should be used with care in high-stakes or reliability-sensitive scenarios.

Citation

If you use this model, please cite the FTibSuite paper:

@article{xu2026ftibsuite,
  title={FTibSuite: A Comprehensive Resource Suite for Tibetan Vision--Language Modeling},
  author={Xu, Guixian and Liang, Yide and Su, Zeli and Song, Xuexian and Zhang, Ziyin and Dong, Yushuang and Zhang, Ting and Han, Xu},
  year={2026}
}
You may also cite this repository as:
@misc{onedday_ftib_vlm,
  title        = {FTib-VLM},
  author       = {onedday},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/onedday/FTib-VLM}}
}

You may also cite this repository as:


@misc{onedday_ftib_vlm,
  title        = {FTib-VLM},
  author       = {onedday},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/onedday/FTib-VLM}}
}
```

Downloads last month: 43

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for onedday/FTib-VLM

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(268)

this model