--- license: apache-2.0 datasets: - visual-layer/imagenet-1k-vl-enriched language: - en metrics: - bleu base_model: - timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k - openai-community/gpt2 results: - tasks: type: text-generation metrics: - name: bleu type: bleu value: 0.040 verified: true --- # About This project provides an image captioning model trained on the [visual-layer/imagenet-1k-vl-enriched](https://huggingface.co/datasets/visual-layer/imagenet-1k-vl-enriched) dataset. The model architecture combines a ViT backbone [timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k](https://huggingface.co/timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k) for image feature extraction and a GPT-2 language model [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) for text generation. A custom projection layer was implemented to map the image features from the vision backbone to the input space of the language model, enabling seamless integration between the two modalities. ## How to use To run this app, follow these steps: ## Install dependencies This project uses uv for fast dependency management. To install all dependencies, run: `uv sync` Run inference To test the model and generate captions, run: `uv run inference.py` This will process your input images and output captions using the trained model. ## Example #### Input ![test image](./test_image_0.png) #### Output `a boy holding a fish in the woods`