Spravil commited on
Commit
9664c36
·
verified ·
1 Parent(s): 9e7527c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +78 -0
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - fr
6
+ - es
7
+ - ru
8
+ - zh
9
+ base_model:
10
+ - microsoft/Florence-2-large
11
+ - google/gemma-2-9b
12
+ pipeline_tag: image-text-to-text
13
+ library_name: transformers
14
+ tags:
15
+ - Image-to-Text
16
+ - Image-Text-to-Text
17
+ - Translation
18
+ datasets:
19
+ - Spravil/cc12m_ccmatrix_captions_and_translations
20
+ ---
21
+ # Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation
22
+
23
+ <a href="https://arxiv.org/abs/2503.09443"><img src="https://img.shields.io/badge/cs.CL-2503.09443-b31b1b?logo=arxiv&logoColor=red"></a>
24
+ <a href="https://spravil.com/projects/caption_via_translation/" alt="Project Page"> <img alt="Project page" src="https://img.shields.io/badge/Project Page-blue"></a>
25
+
26
+ # 11.2B Model
27
+
28
+ The 11.2B model is built upon [**Google's Gemma-2**](https://huggingface.co/google/gemma-2-9b) as decoder, [**Microsoft's Florence-2-large**](https://huggingface.co/microsoft/Florence-2-base) as encoder and is trained using a synthetic dataset.
29
+ As a **pre-trained version**, its coverage across tasks and languages is currently limited.
30
+ It supports **image captioning in English and German**, and facilitates **multimodal machine translation from English to German, French, Spanish, Russian, and Chinese**.
31
+
32
+ # Getting Started
33
+
34
+ ```python
35
+ import requests
36
+ from PIL import Image
37
+ import torch
38
+ from transformers import AutoModelForCausalLM, AutoConfig, AutoProcessor, AutoTokenizer
39
+
40
+
41
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
42
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
43
+ model = AutoModelForCausalLM.from_pretrained("Spravil/caption-via-translation-11_2B", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
44
+ tokenizer = AutoTokenizer.from_pretrained(
45
+ "google/gemma-2-2b",
46
+ add_bos_token=True,
47
+ add_eos_token=True,
48
+ padding_side="right",
49
+ truncation_side="right",
50
+ )
51
+ processor = AutoProcessor.from_pretrained("Spravil/caption-via-translation-11_2B", trust_remote_code=True, new_tokenizer=tokenizer, use_encoder_tokenizer=True)
52
+ task = "<MORE_DETAILED_CAPTION>"
53
+ lang = "de"
54
+ prompt = f"<LANG_{lang.upper()}>{task}"
55
+ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
56
+ image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
57
+ inputs = processor(prompt, images=image, return_tensors="pt").to(device, torch_dtype)
58
+ generated_ids = model.generate(
59
+ **inputs,
60
+ max_new_tokens=128,
61
+ num_beams=4,
62
+ do_sample=False,
63
+ use_cache=False,
64
+ )
65
+ generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
66
+ parsed_answer = processor.post_process_generation(generated_text, task=task, image_size=(image.width, image.height))
67
+ print(parsed_answer)
68
+ ```
69
+
70
+ # Bibtex
71
+ ```
72
+ @inproceedings{spravil2026scaling,
73
+ title={Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation},
74
+ author={Spravil, Julian and Houben, Sebastian and Behnke, Sven},
75
+ booktitle={Proceedings of the 40th AAAI Conference on Artificial Intelligence},
76
+ year={2026}
77
+ }
78
+ ```