Image-to-Text
Transformers
Safetensors
Vietnamese
vision-encoder-decoder
image-text-to-text
image-captioning
vietnamese
deit
gpt2
Instructions to use slyviee/vietnamese-image-captioning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use slyviee/vietnamese-image-captioning with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="slyviee/vietnamese-image-captioning")# Load model directly from transformers import AutoTokenizer, AutoModelForImageTextToText tokenizer = AutoTokenizer.from_pretrained("slyviee/vietnamese-image-captioning") model = AutoModelForImageTextToText.from_pretrained("slyviee/vietnamese-image-captioning") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - vi | |
| pipeline_tag: image-to-text | |
| library_name: transformers | |
| tags: | |
| - image-captioning | |
| - vietnamese | |
| - vision-encoder-decoder | |
| - deit | |
| - gpt2 | |
| # Vietnamese Image Captioning | |
| Model image captioning tieng Viet duoc train tren Flickr8k va 10k anh MSCOCO da dich sang tieng Viet. | |
| Model nhan mot anh dau vao va sinh mot cau chu thich bang tieng Viet. | |
| ## Architecture | |
| - Encoder: `facebook/deit-base-distilled-patch16-224` | |
| - Decoder: `NlpHUST/gpt2-vietnamese` | |
| - Transformers class: `VisionEncoderDecoderModel` | |
| ## Usage | |
| ```python | |
| from PIL import Image | |
| import torch | |
| from transformers import AutoImageProcessor, AutoTokenizer, VisionEncoderDecoderModel | |
| model_id = "slyviee/vietnamese-image-captioning" | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) | |
| image_processor = AutoImageProcessor.from_pretrained("facebook/deit-base-distilled-patch16-224") | |
| model = VisionEncoderDecoderModel.from_pretrained(model_id).to(device) | |
| model.eval() | |
| image = Image.open("image.jpg").convert("RGB") | |
| pixel_values = image_processor(images=image, return_tensors="pt").pixel_values.to(device) | |
| with torch.no_grad(): | |
| output_ids = model.generate(pixel_values, max_new_tokens=40, num_beams=4) | |
| caption = tokenizer.decode(output_ids[0], skip_special_tokens=True) | |
| print(caption) | |
| ``` | |