Instructions to use apoorvrajdev/captioning-inceptionv3-transformer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Keras
How to use apoorvrajdev/captioning-inceptionv3-transformer with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://apoorvrajdev/captioning-inceptionv3-transformer") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: keras | |
| tags: | |
| - image-captioning | |
| - tensorflow | |
| - keras | |
| - transformer | |
| - inceptionv3 | |
| - multimodal | |
| - dev-scaffold | |
| pipeline_tag: image-to-text | |
| # Image Captioning System — Dev Scaffold (v1.0.0) | |
| InceptionV3 + Transformer image captioning architecture. | |
| This release contains a **deployment scaffold** used for end-to-end | |
| system validation and infrastructure testing. It is intentionally | |
| published before the production training run so the full serving | |
| stack (FastAPI backend, Hugging Face Spaces container, Vercel | |
| frontend, GitHub Actions CI/CD) can be exercised end-to-end. | |
| ## Purpose | |
| - FastAPI inference serving | |
| - Hugging Face Hub `snapshot_download` integration | |
| - Frontend / backend deployment validation | |
| - CI/CD pipeline validation | |
| - Production ML system architecture demonstration | |
| ## Architecture | |
| - **Encoder:** frozen InceptionV3 (ImageNet weights, 2048-dim features) | |
| - **Decoder:** single Transformer decoder layer, d_model=512, 8 heads | |
| - **Vocab size:** 52 tokens (scaffold) — production target is 15,000 (COCO) | |
| - **Max caption length:** 40 tokens | |
| ## ⚠️ Current limitations | |
| The decoder weights are **bootstrap development artefacts** generated by | |
| a synthetic 10-sentence corpus, not trained on the full COCO dataset. | |
| Caption outputs will be incoherent and limited to the 52-token scaffold | |
| vocabulary. The encoder is fully functional (real ImageNet weights); | |
| only the decoder is untrained. | |
| Future revisions will replace these weights with a model trained on | |
| MS COCO 2017 via `scripts/train.py` and `configs/train/stabilized.yaml`. | |
| ## Files | |
| | File | Size | SHA-256 | | |
| |---|---:|---| | |
| | `model.h5` | 158 MB | `bfe020d920aa2f3d019bf7b5b33904384057372e7c304a9e101a2a59fe110084` | | |
| | `vocab.json` | 566 B | `45ec1704d73046303cbd5292590b2e204b194a2d8345dfb84de81370b4ab4eef` | | |
| | `vocab.pkl` | 3,013 B | `c6700d2bbcd8dc705d6b0ca53e0f8848baa6225e9b3e836036d94ab5accd306c` | | |
| ## Usage | |
| This repo is consumed by the backend via `huggingface_hub.snapshot_download`: | |
| ```env | |
| BACKEND_WEIGHTS_HUB_REPO=apoorvrajdev/captioning-inceptionv3-transformer | |
| BACKEND_WEIGHTS_HUB_REVISION=v1.0.0 | |
| BACKEND_WEIGHTS_HUB_FILENAME=model.h5 |