---
license: apache-2.0
datasets:
- visual-layer/imagenet-1k-vl-enriched
language:
- en
metrics:
- bleu
  
base_model:
- timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k
- openai-community/gpt2
results:
- tasks:
    type: text-generation
  metrics:
  - name: bleu
    type: bleu
    value: 0.040
    verified: true
  
---
# About
This project provides an image captioning model trained on the [visual-layer/imagenet-1k-vl-enriched](https://huggingface.co/datasets/visual-layer/imagenet-1k-vl-enriched) dataset. The model architecture combines a ViT backbone [timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k](https://huggingface.co/timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k) for image feature extraction and a GPT-2 language model [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) for text generation.

A custom projection layer was implemented to map the image features from the vision backbone to the input space of the language model, enabling seamless integration between the two modalities.

## How to use
To run this app, follow these steps:

## Install dependencies
This project uses uv for fast dependency management.
To install all dependencies, run:

`uv sync`

Run inference
To test the model and generate captions, run:

`uv run inference.py`

This will process your input images and output captions using the trained model.

## Example 

####  Input

![test image](./test_image_0.png)

####  Output

`a boy holding a fish in the woods`