Image Captioning System — Dev Scaffold (v1.0.0)

InceptionV3 + Transformer image captioning architecture.

This release contains a deployment scaffold used for end-to-end system validation and infrastructure testing. It is intentionally published before the production training run so the full serving stack (FastAPI backend, Hugging Face Spaces container, Vercel frontend, GitHub Actions CI/CD) can be exercised end-to-end.

Purpose

  • FastAPI inference serving
  • Hugging Face Hub snapshot_download integration
  • Frontend / backend deployment validation
  • CI/CD pipeline validation
  • Production ML system architecture demonstration

Architecture

  • Encoder: frozen InceptionV3 (ImageNet weights, 2048-dim features)
  • Decoder: single Transformer decoder layer, d_model=512, 8 heads
  • Vocab size: 52 tokens (scaffold) — production target is 15,000 (COCO)
  • Max caption length: 40 tokens

⚠️ Current limitations

The decoder weights are bootstrap development artefacts generated by a synthetic 10-sentence corpus, not trained on the full COCO dataset. Caption outputs will be incoherent and limited to the 52-token scaffold vocabulary. The encoder is fully functional (real ImageNet weights); only the decoder is untrained.

Future revisions will replace these weights with a model trained on MS COCO 2017 via scripts/train.py and configs/train/stabilized.yaml.

Files

File Size SHA-256
model.h5 158 MB bfe020d920aa2f3d019bf7b5b33904384057372e7c304a9e101a2a59fe110084
vocab.json 566 B 45ec1704d73046303cbd5292590b2e204b194a2d8345dfb84de81370b4ab4eef
vocab.pkl 3,013 B c6700d2bbcd8dc705d6b0ca53e0f8848baa6225e9b3e836036d94ab5accd306c

Usage

This repo is consumed by the backend via huggingface_hub.snapshot_download:

BACKEND_WEIGHTS_HUB_REPO=apoorvrajdev/captioning-inceptionv3-transformer
BACKEND_WEIGHTS_HUB_REVISION=v1.0.0
BACKEND_WEIGHTS_HUB_FILENAME=model.h5
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support