Spaces:
Runtime error
A newer version of the Streamlit SDK is available: 1.56.0
title: Caption Gen
emoji: 📸
sdk: streamlit
sdk_version: 1.43.0
app_file: app.py
# AI Image Caption Generator
A deep learning–based image captioning system built using a **ResNet50 encoder** and an **LSTM decoder**. The model generates natural language descriptions for uploaded images.
## Architecture
* **Encoder:** ResNet50 (frozen backbone)
* **Decoder:** LSTM-based sequence generator
* **Training Dataset:** Flickr8k
* **Inference Framework:** Streamlit
* **Evaluation Metric:** SacreBLEU
The encoder extracts high-level visual features, which are then passed to the decoder to generate captions word by word.
## How It Works
1. User uploads an image.
2. Image is preprocessed and passed through the ResNet50 encoder.
3. Extracted feature vector is fed into the LSTM decoder.
4. Caption is generated using temperature-based sampling.
5. If the image belongs to the Flickr8k dataset, BLEU metrics are displayed.
## Features
* Temperature-controlled caption generation
* SacreBLEU evaluation
* N-gram precision breakdown (1–4 gram)
* Clean Streamlit interface
* Fully CPU-compatible deployment
## Project Structure
app.py
models/
encoder.pth
decoder.pth
models/
encoder.py
decoder.py
utils/
transforms.py
vocab.py
helpers.py
vocabulary.json
requirements.txt
## Model Details
* Encoder weights size: ~92 MB
* Decoder weights size: ~32 MB
* Full encoder backbone included in state_dict
* Inference runs on CPU
## Limitations
* Trained on Flickr8k (8,000 images)
* Performs best on outdoor scenes, people, and animals
* May generalize poorly to unseen domains
* CPU inference can be slow (2–5 seconds per image)
## Setup (Local)
pip install -r requirements.txt
streamlit run app.py
## Deployment
This project is deployed on **Hugging Face Spaces** using Streamlit.
## License
MIT License
If you want, I can also write a **short portfolio-style README** optimized specifically for recruiters.