English
imagenet-caption / README.md
poonai's picture
Update README.md
cc94a57 verified
---
license: apache-2.0
datasets:
- visual-layer/imagenet-1k-vl-enriched
language:
- en
metrics:
- bleu
base_model:
- timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k
- openai-community/gpt2
results:
- tasks:
type: text-generation
metrics:
- name: bleu
type: bleu
value: 0.040
verified: true
---
# About
This project provides an image captioning model trained on the [visual-layer/imagenet-1k-vl-enriched](https://huggingface.co/datasets/visual-layer/imagenet-1k-vl-enriched) dataset. The model architecture combines a ViT backbone [timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k](https://huggingface.co/timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k) for image feature extraction and a GPT-2 language model [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) for text generation.
A custom projection layer was implemented to map the image features from the vision backbone to the input space of the language model, enabling seamless integration between the two modalities.
## How to use
To run this app, follow these steps:
## Install dependencies
This project uses uv for fast dependency management.
To install all dependencies, run:
`uv sync`
Run inference
To test the model and generate captions, run:
`uv run inference.py`
This will process your input images and output captions using the trained model.
## Example
#### Input
![test image](./test_image_0.png)
#### Output
`a boy holding a fish in the woods`