Spravil commited on
Commit
6d95bbc
·
verified ·
1 Parent(s): 8a99814

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - fr
6
+ - es
7
+ - ru
8
+ - zh
9
+ base_model:
10
+ - microsoft/Florence-2-base
11
+ pipeline_tag: image-text-to-text
12
+ library_name: transformers
13
+ tags:
14
+ - Image-to-Text
15
+ - Image-Text-to-Text
16
+ - Translation
17
+ datasets:
18
+ - Spravil/cc12m_ccmatrix_captions_and_translations
19
+ ---
20
+ # Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation
21
+
22
+ <a href="https://arxiv.org/abs/2503.09443"><img src="https://img.shields.io/badge/cs.CL-2503.09443-b31b1b?logo=arxiv&logoColor=red"></a>
23
+ <a href="https://spravil.com/projects/caption_via_translation/" alt="Project Page"> <img alt="Project page" src="https://img.shields.io/badge/Project Page-blue"></a>
24
+
25
+ # 0.4B Model
26
+
27
+ The 0.4B model is built upon [**Microsoft's Florence-2-base**](https://huggingface.co/microsoft/Florence-2-base) and trained using a synthetic dataset.
28
+ As a **pre-trained version**, its coverage across tasks and languages is currently limited.
29
+ It supports **image captioning in English and German**, and facilitates **multimodal machine translation from English to German, French, Spanish, Russian, and Chinese**.
30
+
31
+ # Getting Started
32
+
33
+ ```python
34
+ import requests
35
+ from PIL import Image
36
+ import torch
37
+ from transformers import AutoModelForCausalLM, AutoConfig, AutoProcessor, AutoTokenizer
38
+
39
+
40
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
41
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
42
+ model = AutoModelForCausalLM.from_pretrained("Spravil/caption-via-translation-0_4B", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
43
+ tokenizer = AutoTokenizer.from_pretrained(
44
+ "google/gemma-2-2b",
45
+ add_bos_token=True,
46
+ add_eos_token=True,
47
+ padding_side="right",
48
+ truncation_side="right",
49
+ )
50
+ processor = AutoProcessor.from_pretrained("Spravil/caption-via-translation-0_4B", trust_remote_code=True, new_tokenizer=tokenizer, use_encoder_tokenizer=True)
51
+ task = "<MORE_DETAILED_CAPTION>"
52
+ lang = "de"
53
+ prompt = f"<LANG_{lang.upper()}>{task}"
54
+ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
55
+ image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
56
+ inputs = processor(prompt, images=image, return_tensors="pt").to(device, torch_dtype)
57
+ generated_ids = model.generate(
58
+ **inputs,
59
+ max_new_tokens=128,
60
+ num_beams=4,
61
+ do_sample=False,
62
+ use_cache=False,
63
+ )
64
+ generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
65
+ parsed_answer = processor.post_process_generation(generated_text, task=task, image_size=(image.width, image.height))
66
+ print(parsed_answer)
67
+ ```
68
+
69
+ # Bibtex
70
+ ```
71
+ @inproceedings{spravil2026scaling,
72
+ title={Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation},
73
+ author={Spravil, Julian and Houben, Sebastian and Behnke, Sven},
74
+ booktitle={Proceedings of the 40th AAAI Conference on Artificial Intelligence},
75
+ year={2026}
76
+ }
77
+ ```