--- library_name: transformers datasets: - laicsiifes/flickr30k-pt-br-human-generated language: - pt metrics: - bleu - rouge - meteor - bertscore - clipscore base_model: - microsoft/Phi-3-vision-128k-instruct pipeline_tag: image-text-to-text model-index: - name: Phi-3-Vision-Flickr30K-Native results: - task: name: Image Captioning type: image-text-to-text dataset: name: Flickr30K Portuguese Natively Annotated type: laicsiifes/flickr30k-pt-br-human-generated split: test metrics: - name: CIDEr-D type: cider value: 72.99 - name: BLEU@4 type: bleu value: 26.74 - name: ROUGE-L type: rouge value: 45.78 - name: METEOR type: meteor value: 47.45 - name: BERTScore type: bertscore value: 72.51 - name: CLiP-Score type: clipscore value: 55.10 license: mit --- # 🎉 Phi-3 Vision fine-tuned in Flickr30K Translated for Brazilian Portuguese Image Captioning Phi-3 Vision (microsoft/Phi-3-vision-128k-instruct) model fine-tuned for image captioning on [Flickr30K Portuguese Natively Annotated](https://huggingface.co/datasets/laicsiifes/flickr30k-pt-br-human-generated) (annotated by Brazilian Portuguese speakers). ## 🤖 Model Description ## 🧑‍💻 How to Get Started with the Model Use the code below to get started with the model. - **Install libraries:** ```bash pip install transformers==4.45.2 bitsandbytes==0.45.2 peft==0.13.2 ``` - **Python code:** ```python import requests import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoProcessor, BitsAndBytesConfig from huggingface_hub import login # Use your HuggingFace API key, since Phi-3 Vision is available through user form submission login('hf_...') # load a fine-tuned image captioning model, and corresponding tokenizer and image processor model = AutoModelForCausalLM.from_pretrained( 'microsoft/Phi-3-vision-128k-instruct', device_map="cuda", trust_remote_code=True, _attn_implementation='eager', quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16, ) ) model.load_adapter('laicsiifes/phi3-vision-flickr30k_pt_human_generated') processor = AutoProcessor.from_pretrained('laicsiifes/phi3-vision-flickr30k_pt_human_generated', trust_remote_code=True) # preprocess an image image = Image.open(requests.get("http://images.cocodataset.org/val2014/COCO_val2014_000000458153.jpg", stream=True).raw) text_prompt = processor.tokenizer.apply_chat_template( [ { "role": "user", "content": f"<|image_1|>\nEscreva uma descrição em português do Brasil para a imagem com no máximo 25 palavras." } ], tokenize=False, add_generation_prompt=True ) inputs = processor( text=text_prompt, images=image, return_tensors='pt' ).to('cuda:0') # generate caption generated_ids = model.generate(**inputs, max_new_tokens=25) prediction = generated_ids[:, inputs['input_ids'].shape[1]:].tolist() generated_text = processor.batch_decode(prediction, skip_special_tokens=True)[0] ``` ```python import matplotlib.pyplot as plt # plot image with caption plt.imshow(image) plt.axis("off") plt.title(generated_text) plt.show() ``` ![image/png](example.png) ## 📈 Results The evaluation metrics: CIDEr-D, BLEU@4, ROUGE-L, METEOR, BERTScore (using [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased)), and CLIP-Score (using [CAPIVARA](https://huggingface.co/hiaac-nlp/CAPIVARA)). | Model | #Params | CIDEr | BLEU-4 | ROUGE-L | METEOR | BERTScore | CLIP-Score | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | **ViTucano 1B** | 1.53B | 69.71 | 22.67 | 43.60 | 48.63 | 72.46 | 56.14 | | **ViTucano 2B** | 2.88B | 71.49 | 23.75 | 44.30 | **49.49** | **72.60** | 56.47 | | **PaliGemma** | 2.92B | 55.30 | 19.41 | 39.85 | 48.96 | 70.33 | **59.96** | | **Phi-3 V** | 4.15B | **72.99** | **26.74** | **45.78** | 47.45 | 72.51 | 55.10 | | **LLaMa 3.2 V** | 11.70B | 69.13 | 24.79 | 43.11 | 45.99 | 72.08 | 56.38 | ## 📋 BibTeX entry and citation info Coming soon. For now, please reference the model adapter using its Hugging Face link.