| --- |
| library_name: transformers |
| license: apache-2.0 |
| language: |
| - multilingual |
| - af |
| - am |
| - ar |
| - as |
| - azb |
| - be |
| - bg |
| - bm |
| - bn |
| - bo |
| - bs |
| - ca |
| - ceb |
| - cs |
| - cy |
| - da |
| - de |
| - du |
| - el |
| - en |
| - eo |
| - es |
| - et |
| - eu |
| - fa |
| - fi |
| - fr |
| - ga |
| - gd |
| - gl |
| - ha |
| - hi |
| - hr |
| - ht |
| - hu |
| - id |
| - ig |
| - is |
| - it |
| - iw |
| - ja |
| - jv |
| - ka |
| - ki |
| - kk |
| - km |
| - ko |
| - la |
| - lb |
| - ln |
| - lo |
| - lt |
| - lv |
| - mi |
| - mr |
| - ms |
| - mt |
| - my |
| - 'no' |
| - oc |
| - pa |
| - pl |
| - pt |
| - qu |
| - ro |
| - ru |
| - sa |
| - sc |
| - sd |
| - sg |
| - sk |
| - sl |
| - sm |
| - so |
| - sq |
| - sr |
| - ss |
| - sv |
| - sw |
| - ta |
| - te |
| - th |
| - ti |
| - tl |
| - tn |
| - tpi |
| - tr |
| - ts |
| - tw |
| - uk |
| - ur |
| - uz |
| - vi |
| - war |
| - wo |
| - xh |
| - yo |
| - zh |
| - zu |
| base_model: |
| - Qwen/Qwen2.5-7B-Instruct |
| - timm/ViT-SO400M-14-SigLIP-384 |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # Centurio Qwen |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
| <!-- Provide a longer summary of what this model is. --> |
|
|
| - **Model type:** Centurio is an open-source multilingual large vision-language model. |
| - **Training Data:** COMING SOON |
| - **Languages:** The model was trained with the following 100 languages: `af, am, ar, ar-eg, as, azb, be, bg, bm, bn, bo, bs, ca, ceb, cs, cy, da, de, du, el, en, eo, es, et, eu, fa, fi, fr, ga, gd, gl, ha, hi, hr, ht, hu, id, ig, is, it, iw, ja, jv, ka, ki, kk, km, ko, la, lb, ln, lo, lt, lv, mi, mr, ms, mt, my, no, oc, pa, pl, pt, qu, ro, ru, sa, sc, sd, sg, sk, sl, sm, so, sq, sr, ss, sv, sw, ta, te, th, ti, tl, tn, tpi, tr, ts, tw, uk, ur, uz, vi, war, wo, xh, yo, zh, zu |
| ` |
| - **License:** This work is released under the Apache 2.0 license. |
|
|
| ### Model Sources |
|
|
| <!-- Provide the basic links for the model. --> |
|
|
| - **Repository:** [gregor-ge.github.io/Centurio](https://gregor-ge.github.io/Centurio) |
| - **Paper:** [arXiv](https://arxiv.org/abs/2501.) |
|
|
| ## Uses |
|
|
| <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
| ### Direct Use |
|
|
| <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
| The model can be used directly through the `transformers` library with our custom code. |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| import timm |
| from PIL import Image |
| import requests |
| |
| url = "https://upload.wikimedia.org/wikipedia/commons/b/bd/Golden_Retriever_Dukedestiny01_drvd.jpg" |
| image = Image.open(requests.get(url, stream=True).raw) |
| |
| model_name = "WueNLP/centurio_qwen" |
| |
| processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) |
| |
| ## Appearance of images in the prompt are indicates with '<image_placeholder>'! |
| prompt = "<image_placeholder>\nBriefly describe the image in German." |
| |
| messages = [ |
| {"role": "system", "content": "You are a helpful assistant."}, # This is the system prompt used during our training. |
| {"role": "user", "content": prompt} |
| ] |
| |
| text = processor.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True |
| ) |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| trust_remote_code=True |
| ) |
| |
| model_inputs = processor(text=[text], images=[image] return_tensors="pt").to(model.device) |
| |
| generated_ids = model.generate( |
| **model_inputs, |
| max_new_tokens=128 |
| ) |
| |
| generated_ids = [ |
| output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
| ] |
| |
| response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
| |
| ``` |
|
|
| #### Multiple Images |
| We natively support multi-image inputs. You only have to 1) include more `<image_placeholder>` while 2) passing all images of the *entire batch* as a flat list: |
|
|
| ```python |
| [...] |
| # Variables reused from above. |
| |
| processor.tokenizer.padding_side = "left" # default is 'right' but has to be 'left' for batched generation to work correctly! |
| |
| image_multi_1, image_multi_2 = [...] # prepare additional images |
| |
| prompt_multi = "What is the difference between the following images?\n<image_placeholder><image_placeholder>\nAnswer in German." |
| |
| messages_multi = [ |
| {"role": "system", "content": "You are a helpful assistant."}, |
| {"role": "user", "content": prompt_multi} |
| ] |
| |
| text_multi = processor.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True |
| ) |
| |
| model_inputs = processor(text=[text, text_multi], images=[image, image_multi_1, image_multi_2] return_tensors="pt").to(model.device) |
| |
| generated_ids = model.generate( |
| **model_inputs, |
| max_new_tokens=128 |
| ) |
| |
| [...] |
| |
| ``` |
|
|
|
|
|
|
|
|
| ## Bias, Risks, and Limitations |
|
|
| - General biases, risks, and limitations of large vision-language models like hallucinations or biases from training data apply. |
| - This is a research project and *not* recommended for production use. |
| - Multilingual: Performance and generation quality can differ widely between languages. |
| - OCR: Model struggles both with small text and writing in non-Latin scripts. |
|
|
|
|
| ## Citation |
|
|
| <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
| **BibTeX:** |
|
|
| ``` |
| @article{centurio2025, |
| author = {Gregor Geigle and |
| Florian Schneider and |
| Carolin Holtermann and |
| Chris Biemann and |
| Radu Timofte and |
| Anne Lauscher and |
| Goran Glava\v{s}}, |
| title = {Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model}, |
| journal = {arXiv}, |
| volume = {abs/2501.05122}, |
| year = {2025}, |
| url = {https://arxiv.org/abs/2501.05122}, |
| eprinttype = {arXiv}, |
| eprint = {2501.05122}, |
| } |
| ``` |