---
title: Image Captioning Model Comparison
emoji: 🖼️
colorFrom: blue
colorTo: purple
sdk: gradio
app_file: app.py
pinned: false
---

# Image Captioning Model Comparison

This Space lets you test three image captioning models in one live Gradio app:

1. Custom EfficientNet-V2-S + Transformer trained on 5k samples
2. Custom EfficientNet-V2-S + Transformer trained on 100k samples
3. BLIP image-captioning base fine-tuned with LoRA on COCO 2014

Upload an image, choose a model, and generate a caption. You can also compare all three models on the same image.

## Files

```text
.
├── app.py
├── custom_caption_model.py
├── requirements.txt
├── README.md
└── models/
    ├── custom_5k/
    │   ├── best_phase-5k.pt
    │   └── vocab-5k.json
    ├── custom_100k/
    │   ├── best_phase-100k.pt
    │   └── vocab-100k.json
    └── blip_lora/
        ├── adapter_config.json
        ├── adapter_model.safetensors
        ├── preprocessor_config.json
        ├── tokenizer.json
        ├── tokenizer_config.json
        ├── special_tokens_map.json
        └── vocab.txt
```

## Notes

The custom models use their original PyTorch architecture and saved vocabularies. The BLIP model uses the base model `Salesforce/blip-image-captioning-base` plus the LoRA adapter files.

For faster inference, use GPU hardware in the Space settings.