| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - visual-layer/imagenet-1k-vl-enriched |
| | language: |
| | - en |
| | metrics: |
| | - bleu |
| | |
| | base_model: |
| | - timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k |
| | - openai-community/gpt2 |
| | results: |
| | - tasks: |
| | type: text-generation |
| | metrics: |
| | - name: bleu |
| | type: bleu |
| | value: 0.040 |
| | verified: true |
| | |
| | --- |
| | # About |
| | This project provides an image captioning model trained on the [visual-layer/imagenet-1k-vl-enriched](https://huggingface.co/datasets/visual-layer/imagenet-1k-vl-enriched) dataset. The model architecture combines a ViT backbone [timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k](https://huggingface.co/timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k) for image feature extraction and a GPT-2 language model [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) for text generation. |
| |
|
| | A custom projection layer was implemented to map the image features from the vision backbone to the input space of the language model, enabling seamless integration between the two modalities. |
| |
|
| | ## How to use |
| | To run this app, follow these steps: |
| |
|
| | ## Install dependencies |
| | This project uses uv for fast dependency management. |
| | To install all dependencies, run: |
| |
|
| | `uv sync` |
| |
|
| | Run inference |
| | To test the model and generate captions, run: |
| |
|
| | `uv run inference.py` |
| |
|
| | This will process your input images and output captions using the trained model. |
| |
|
| | ## Example |
| |
|
| | #### Input |
| |
|
| |  |
| |
|
| | #### Output |
| |
|
| | `a boy holding a fish in the woods` |
| |
|
| |
|