poonai
/

imagenet-caption

Model card Files Files and versions

imagenet-caption / README.md

poonai's picture

Update README.md

cc94a57 verified 6 months ago

|

history blame contribute delete

1.52 kB

	---
	license: apache-2.0
	datasets:
	- visual-layer/imagenet-1k-vl-enriched
	language:
	- en
	metrics:
	- bleu

	base_model:
	- timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k
	- openai-community/gpt2
	results:
	- tasks:
	type: text-generation
	metrics:
	- name: bleu
	type: bleu
	value: 0.040
	verified: true

	---
	# About
	This project provides an image captioning model trained on the [visual-layer/imagenet-1k-vl-enriched](https://huggingface.co/datasets/visual-layer/imagenet-1k-vl-enriched) dataset. The model architecture combines a ViT backbone [timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k](https://huggingface.co/timm/vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k) for image feature extraction and a GPT-2 language model [openai-community/gpt2](https://huggingface.co/openai-community/gpt2) for text generation.

	A custom projection layer was implemented to map the image features from the vision backbone to the input space of the language model, enabling seamless integration between the two modalities.

	## How to use
	To run this app, follow these steps:

	## Install dependencies
	This project uses uv for fast dependency management.
	To install all dependencies, run:

	`uv sync`

	Run inference
	To test the model and generate captions, run:

	`uv run inference.py`

	This will process your input images and output captions using the trained model.

	## Example

	#### Input

	![test image](./test_image_0.png)

	#### Output

	`a boy holding a fish in the woods`