English
imagenet-caption / README.md
poonai's picture
Update README.md
cc94a57 verified
metadata
license: apache-2.0
datasets:
  - visual-layer/imagenet-1k-vl-enriched
language:
  - en
metrics:
  - bleu
base_model:
  - timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k
  - openai-community/gpt2
results:
  - tasks:
      type: text-generation
    metrics:
      - name: bleu
        type: bleu
        value: 0.04
        verified: true

About

This project provides an image captioning model trained on the visual-layer/imagenet-1k-vl-enriched dataset. The model architecture combines a ViT backbone timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k for image feature extraction and a GPT-2 language model openai-community/gpt2 for text generation.

A custom projection layer was implemented to map the image features from the vision backbone to the input space of the language model, enabling seamless integration between the two modalities.

How to use

To run this app, follow these steps:

Install dependencies

This project uses uv for fast dependency management. To install all dependencies, run:

uv sync

Run inference To test the model and generate captions, run:

uv run inference.py

This will process your input images and output captions using the trained model.

Example

Input

test image

Output

a boy holding a fish in the woods