apoorvrajdev
/

captioning-inceptionv3-transformer

image-captioning

Model card Files Files and versions

captioning-inceptionv3-transformer / README.md

apoorvrajdev's picture

Update README.md

398df6a verified 5 days ago

|

history blame contribute delete

2.17 kB

	---
	license: mit
	library_name: keras
	tags:
	- image-captioning
	- tensorflow
	- keras
	- transformer
	- inceptionv3
	- multimodal
	- dev-scaffold
	pipeline_tag: image-to-text
	---

	# Image Captioning System — Dev Scaffold (v1.0.0)

	InceptionV3 + Transformer image captioning architecture.

	This release contains a deployment scaffold used for end-to-end
	system validation and infrastructure testing. It is intentionally
	published before the production training run so the full serving
	stack (FastAPI backend, Hugging Face Spaces container, Vercel
	frontend, GitHub Actions CI/CD) can be exercised end-to-end.

	## Purpose

	- FastAPI inference serving
	- Hugging Face Hub `snapshot_download` integration
	- Frontend / backend deployment validation
	- CI/CD pipeline validation
	- Production ML system architecture demonstration

	## Architecture

	- Encoder: frozen InceptionV3 (ImageNet weights, 2048-dim features)
	- Decoder: single Transformer decoder layer, d_model=512, 8 heads
	- Vocab size: 52 tokens (scaffold) — production target is 15,000 (COCO)
	- Max caption length: 40 tokens

	## ⚠️ Current limitations

	The decoder weights are bootstrap development artefacts generated by
	a synthetic 10-sentence corpus, not trained on the full COCO dataset.
	Caption outputs will be incoherent and limited to the 52-token scaffold
	vocabulary. The encoder is fully functional (real ImageNet weights);
	only the decoder is untrained.

	Future revisions will replace these weights with a model trained on
	MS COCO 2017 via `scripts/train.py` and `configs/train/stabilized.yaml`.

	## Files

	\| File \| Size \| SHA-256 \|
	\|---\|---:\|---\|
	\| `model.h5` \| 158 MB \| `bfe020d920aa2f3d019bf7b5b33904384057372e7c304a9e101a2a59fe110084` \|
	\| `vocab.json` \| 566 B \| `45ec1704d73046303cbd5292590b2e204b194a2d8345dfb84de81370b4ab4eef` \|
	\| `vocab.pkl` \| 3,013 B \| `c6700d2bbcd8dc705d6b0ca53e0f8848baa6225e9b3e836036d94ab5accd306c` \|

	## Usage

	This repo is consumed by the backend via `huggingface_hub.snapshot_download`:

	```env
	BACKEND_WEIGHTS_HUB_REPO=apoorvrajdev/captioning-inceptionv3-transformer
	BACKEND_WEIGHTS_HUB_REVISION=v1.0.0
	BACKEND_WEIGHTS_HUB_FILENAME=model.h5