--- license: apache-2.0 language: - vi pipeline_tag: image-to-text library_name: transformers tags: - image-captioning - vietnamese - vision-encoder-decoder - deit - gpt2 --- # Vietnamese Image Captioning Model image captioning tieng Viet duoc train tren Flickr8k va 10k anh MSCOCO da dich sang tieng Viet. Model nhan mot anh dau vao va sinh mot cau chu thich bang tieng Viet. ## Architecture - Encoder: `facebook/deit-base-distilled-patch16-224` - Decoder: `NlpHUST/gpt2-vietnamese` - Transformers class: `VisionEncoderDecoderModel` ## Usage ```python from PIL import Image import torch from transformers import AutoImageProcessor, AutoTokenizer, VisionEncoderDecoderModel model_id = "slyviee/vietnamese-image-captioning" device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) image_processor = AutoImageProcessor.from_pretrained("facebook/deit-base-distilled-patch16-224") model = VisionEncoderDecoderModel.from_pretrained(model_id).to(device) model.eval() image = Image.open("image.jpg").convert("RGB") pixel_values = image_processor(images=image, return_tensors="pt").pixel_values.to(device) with torch.no_grad(): output_ids = model.generate(pixel_values, max_new_tokens=40, num_beams=4) caption = tokenizer.decode(output_ids[0], skip_special_tokens=True) print(caption) ```