--- title: Image Captioning Model Comparison emoji: 🖼️ colorFrom: blue colorTo: purple sdk: gradio app_file: app.py pinned: false --- # Image Captioning Model Comparison This Space lets you test three image captioning models in one live Gradio app: 1. Custom EfficientNet-V2-S + Transformer trained on 5k samples 2. Custom EfficientNet-V2-S + Transformer trained on 100k samples 3. BLIP image-captioning base fine-tuned with LoRA on COCO 2014 Upload an image, choose a model, and generate a caption. You can also compare all three models on the same image. ## Files ```text . ├── app.py ├── custom_caption_model.py ├── requirements.txt ├── README.md └── models/ ├── custom_5k/ │ ├── best_phase-5k.pt │ └── vocab-5k.json ├── custom_100k/ │ ├── best_phase-100k.pt │ └── vocab-100k.json └── blip_lora/ ├── adapter_config.json ├── adapter_model.safetensors ├── preprocessor_config.json ├── tokenizer.json ├── tokenizer_config.json ├── special_tokens_map.json └── vocab.txt ``` ## Notes The custom models use their original PyTorch architecture and saved vocabularies. The BLIP model uses the base model `Salesforce/blip-image-captioning-base` plus the LoRA adapter files. For faster inference, use GPU hardware in the Space settings.