apoorvrajdev commited on
Commit
cce4499
Β·
0 Parent(s):

initial commit: add image captioning research project

Browse files
Files changed (2) hide show
  1. README.md +209 -0
  2. image-captionin-using-dl.ipynb +0 -0
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ–ΌοΈ Image Captioning System (CNN + Transformer)
2
+ πŸ“„ Backed by an IEEE publication (see below)
3
+
4
+ ![Python](https://img.shields.io/badge/Python-3.10-blue)
5
+ ![Deep Learning](https://img.shields.io/badge/Deep%20Learning-TensorFlow-orange)
6
+ ![Computer Vision](https://img.shields.io/badge/Computer%20Vision-CNN-red)
7
+ ![NLP](https://img.shields.io/badge/NLP-Transformer-green)
8
+ ![Dataset](https://img.shields.io/badge/Dataset-COCO-yellow)
9
+ ![License](https://img.shields.io/badge/License-MIT-lightgrey)
10
+
11
+ This project builds an **AI-powered image captioning system** that generates **natural language descriptions from images** using a hybrid **CNN + Transformer architecture**.
12
+
13
+ The system understands visual content and produces **context-aware captions**, bridging the gap between **computer vision and natural language processing**.
14
+
15
+ ---
16
+
17
+ # πŸš€ Live Demo
18
+
19
+ [![Open Notebook](https://img.shields.io/badge/Open%20Kaggle%20Notebook-GPU-blue)](https://www.kaggle.com/code/apoorvujjwal/image-captionin-using-dl)
20
+
21
+ OR explore the full pipeline here:
22
+
23
+ πŸ‘‰ Run the full pipeline on Kaggle: https://www.kaggle.com/code/apoorvujjwal/image-captionin-using-dl
24
+
25
+ The notebook includes:
26
+
27
+ - End-to-end training pipeline
28
+ - COCO dataset integration
29
+ - Transformer-based caption generation
30
+ - GPU-enabled execution
31
+
32
+ ---
33
+
34
+ # πŸ“„ IEEE Research Publication
35
+
36
+ This project is backed by an **IEEE published research paper**:
37
+
38
+ [![IEEE Paper](https://img.shields.io/badge/View%20Research%20Paper-IEEE-blue)](https://ieeexplore.ieee.org/document/10675203)
39
+
40
+ πŸ“„ **Title:** AI Narratives: Bridging Visual Content and Linguistic Expression
41
+
42
+ ---
43
+
44
+ ### 🧠 Key Contributions
45
+
46
+ - Designed a hybrid **CNN + Transformer architecture** for image captioning
47
+ - Leveraged **InceptionV3** for visual feature extraction
48
+ - Implemented **attention-based sequence generation**
49
+ - Achieved improved caption quality using **BLEU evaluation**
50
+ - Compared multiple CNN backbones (VGG, ResNet, Inception)
51
+
52
+ ---
53
+
54
+ ### πŸš€ Practical Impact
55
+
56
+ - Combines **computer vision and NLP** for real-world multimodal applications
57
+ - Demonstrates ability to build **end-to-end deep learning pipelines**
58
+ - Trained and evaluated on **COCO benchmark dataset** used in industry research
59
+
60
+ # 🧠 Model Overview
61
+
62
+ The system uses a **two-stage architecture**:
63
+
64
+ ### πŸ”Ή Encoder (Vision)
65
+ - **InceptionV3 (CNN)**
66
+ - Extracts high-level spatial features from images
67
+ - Converts image β†’ feature vector
68
+
69
+ ### πŸ”Ή Decoder (Language)
70
+ - **Transformer Decoder**
71
+ - Generates captions word-by-word using attention
72
+ - Captures long-range dependencies in text
73
+
74
+ ---
75
+
76
+ # πŸ”„ Caption Generation Pipeline
77
+
78
+ Image β†’ CNN Encoder β†’ Feature Embeddings β†’ Transformer Decoder β†’ Caption
79
+
80
+ ---
81
+
82
+ # πŸ“Έ Sample Outputs
83
+
84
+ ### 🟒 Example 1
85
+ **Generated Caption:**
86
+ `a man is standing on a beach with a surfboard`
87
+
88
+ *<img width="923" height="906" alt="image" src="https://github.com/user-attachments/assets/64e8412b-1d49-404c-a5b2-1da121b224e2" />
89
+ *
90
+
91
+ ---
92
+
93
+ ### 🟒 Example 2
94
+ **Generated Caption:**
95
+ `a man riding a motorcycle on a street`
96
+ *<img width="832" height="857" alt="image" src="https://github.com/user-attachments/assets/c802d420-a1c1-48be-8e79-599f193c72cd" />
97
+ *
98
+
99
+ ---
100
+
101
+ # πŸ“Š Model Performance
102
+
103
+ The model was evaluated using **BLEU Score**, a standard NLP metric for text generation.
104
+
105
+ | Metric | Value |
106
+ |--------|------|
107
+ | BLEU Score | ~24 |
108
+
109
+ ### Key Observations:
110
+ - Generates **semantically meaningful captions**
111
+ - Performs well on **common objects and scenes**
112
+ - Slight limitations on **complex multi-object scenes**
113
+
114
+ ---
115
+
116
+ # πŸ“‚ Dataset
117
+
118
+ The model is trained on the **COCO 2017 Dataset**, a large-scale benchmark dataset for image captioning.
119
+
120
+ Dataset characteristics:
121
+
122
+ - 200,000+ images
123
+ - 80 object categories
124
+ - Multiple captions per image
125
+ - Rich annotations for training
126
+
127
+ ---
128
+
129
+ # βš™οΈ Deep Learning Pipeline
130
+
131
+ The project follows a complete deep learning workflow:
132
+
133
+ 1. Image preprocessing (resize, normalization)
134
+ 2. Feature extraction using InceptionV3
135
+ 3. Caption preprocessing (tokenization, padding)
136
+ 4. Vocabulary creation
137
+ 5. Transformer model training
138
+ 6. Loss optimization (Cross-Entropy)
139
+ 7. Model evaluation using BLEU score
140
+ 8. Inference on unseen images
141
+
142
+ ---
143
+
144
+ # 🧰 Technologies Used
145
+
146
+ - Python
147
+ - TensorFlow / Keras
148
+ - CNN (InceptionV3)
149
+ - Transformer Architecture
150
+ - NumPy, Pandas
151
+ - Matplotlib
152
+ - Jupyter Notebook
153
+
154
+ ---
155
+
156
+ # πŸ“ Project Structure
157
+
158
+ ```
159
+
160
+ image-captioning-system
161
+ β”‚
162
+ β”œβ”€β”€ image_captioning.ipynb
163
+ β”œβ”€β”€ assets/
164
+ β”œβ”€β”€ requirements.txt
165
+ └── README.md
166
+
167
+ ---
168
+
169
+ # πŸ§ͺ Research Contribution
170
+
171
+ This project is based on an **IEEE research publication**:
172
+
173
+ πŸ“„ AI Narratives: Bridging Visual Content and Linguistic Expression
174
+
175
+ Key contributions:
176
+
177
+ - Integration of **CNN + Transformer architecture**
178
+ - Improved caption generation using **attention mechanisms**
179
+ - Comparative analysis of CNN encoders (VGG, ResNet, Inception)
180
+ - Enhanced tokenization strategies for better language modeling
181
+
182
+ ---
183
+
184
+ # ⚠️ Limitations
185
+
186
+ - Struggles with highly complex or cluttered scenes
187
+ - May generate generic captions for rare objects
188
+ - Requires large datasets and compute for training
189
+
190
+ ---
191
+
192
+ # πŸš€ Future Improvements
193
+
194
+ - Replace CNN with **Vision Transformer (ViT)**
195
+ - Use pretrained models like **BLIP / CLIP**
196
+ - Optimize inference using **TensorRT / ONNX**
197
+ - Deploy as **FastAPI-based real-time API**
198
+ - Multi-GPU distributed training
199
+
200
+ ---
201
+
202
+ # πŸ‘¨β€πŸ’» Author
203
+
204
+ **Apoorv Raj**
205
+ AI Systems Engineer | Deep Learning | ML Infrastructure
206
+
207
+ ---
208
+
209
+ ⭐ If you found this project useful, consider giving it a **star** on GitHub.
image-captionin-using-dl.ipynb ADDED
The diff for this file is too large to render. See raw diff