Spaces:
Sleeping
Sleeping
File size: 15,193 Bytes
f9b8c32 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 | # ๐ฌ Task 3: Beam Search & Length Penalty Ablation for Caption Quality Trade-offs
## ๐ The Big Question: Does Beam Search Actually Make Captions Better?
When an AI model generates a caption for an image, it faces a decision at every single word: **which word should come next?** The simplest approach is **greedy decoding** โ at each step, just pick the single highest-probability word and move on. It's fast, but it's short-sighted. It often gets "trapped" in a mediocre caption because it couldn't look ahead.
**Beam search** changes this by keeping multiple candidate captions alive simultaneously and only committing when the full sequence is generated. But this comes at a cost โ it's slower, and the quality gains aren't guaranteed.
Then there's **length penalty**: a scalar that either punishes the model for producing short captions (`< 1.0`) or rewards it for staying concise (`> 1.0`). The interaction between beam size and length penalty is non-trivial and poorly understood without experiments.
This task cracks the problem open with a **full ablation study** across 9 decoding configurations to answer:
- Which combination of beam size and length penalty produces the best captions?
- Is the quality improvement worth the latency cost?
- What's the Pareto-optimal strategy for real-time vs. offline captioning?
---
## ๐ง Background: Training Setup
Before decoding, we need a good model. This task proceeds in two phases:
### Phase 1: Fine-tune BLIP on 10k COCO Captions
BLIP (*Bootstrapping Language-Image Pre-training*) is fine-tuned on 10,000 training imageโcaption pairs from the **MS-COCO 2017** dataset using the existing training pipeline:
```bash
python train.py --model blip
```
- **Training data**: 10,000 COCO training images (30,000 used in the main project)
- **Epochs**: 3 with cosine LR schedule and linear warmup
- **Optimizer**: AdamW, lr=1e-5, effective batch size=64 (gradient accumulation)
- **Checkpointing**: Best checkpoint saved to `outputs/blip/best/` based on validation CIDEr
- **Best validation CIDEr achieved during training**: **0.6199** (at epoch 3)
The fine-tuned checkpoint in `outputs/blip/best/` is the model used for all 9 ablation configurations below.
---
## ๐ Baseline: Greedy Decoding (beam=1)
Before running beam search, we establish a **greedy baseline** โ the simplest possible decoding strategy.
| Metric | Score |
|--------|-------|
| CIDEr | 0.4783 |
| BLEU-4 | 0.2341 |
| METEOR | 0.2701 |
| ROUGE-L | 0.4502 |
| Mean caption length | 9.8 tokens |
| Latency per 100 images | **4.2s** |
**Why it fails**: Greedy decode selects each word independently. By ignoring future context, it often commits to a locally plausible but globally mediocre path โ resulting in generic captions like *"a man is standing in a field"* even when the image contains much richer detail.
---
## ๐ Enhanced: Beam Search Ablation (3ร3 Grid)
### Design: The 9-Configuration Grid
We sweep two decoding hyperparameters simultaneously:
```
beam_size โ {1, 3, 5}
length_penalty โ {0.8, 1.0, 1.2}
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Total configurations : 9
Evaluation images : 500 COCO val
```
**What each parameter controls:**
| Parameter | `< 1.0` | `= 1.0` | `> 1.0` |
|-----------|---------|---------|---------|
| `length_penalty` | Punishes short captions (forces longer output) | Neutral | Rewards compact captions |
| `beam_size` | 1 = greedy | 3 = balanced | 5 = high quality, slower |
### Metrics Computed Per Configuration
For each of the 9 configurations, four quality metrics are computed on 500 COCO validation images:
| Metric | What it Measures |
|--------|-----------------|
| **CIDEr** | Consensus-based: how well captions match 5 human references |
| **BLEU-4** | 4-gram precision overlap with reference captions |
| **METEOR** | Precision/recall with stemming, synonym matching |
| **ROUGE-L** | Longest common subsequence F1 with references |
| **Mean Length** | Average number of tokens per generated caption |
| **Latency/100** | Seconds to generate captions for 100 images |
---
## ๐ Full Results: All 9 Configurations
Results sorted by CIDEr score (primary metric):
| Rank | Beam | LenPen | CIDEr | BLEU-4 | METEOR | ROUGE-L | Avg Len | Lat/100 | Pareto? |
|------|------|--------|-------|--------|--------|---------|---------|---------|---------|
| 1 ๐ | **5** | **1.0** | **0.5598** | **0.2891** | **0.3089** | **0.4953** | 10.8 | 15.1s | โ
|
| 2 | 3 | 1.2 | 0.5456 | 0.2791 | 0.2981 | 0.4872 | 11.2 | 9.4s | โ
|
| 3 | 3 | 1.0 | 0.5451 | 0.2821 | 0.3012 | 0.4891 | 10.5 | 9.1s | โ
|
| 4 | 5 | 1.2 | 0.5106 | 0.2674 | 0.2914 | 0.4734 | 11.9 | 15.8s | โ |
| 5 | 3 | 0.8 | 0.5031 | 0.2641 | 0.2891 | 0.4705 | 9.6 | 8.7s | โ |
| 6 | 5 | 0.8 | 0.4914 | 0.2558 | 0.2834 | 0.4621 | 9.4 | 14.2s | โ |
| 7 | 1 | 1.0 | 0.4783 | 0.2341 | 0.2701 | 0.4502 | 9.8 | 4.2s | โ
|
| 8 | 1 | 1.2 | 0.4651 | 0.2271 | 0.2658 | 0.4461 | 10.4 | 4.3s | โ |
| 9 | 1 | 0.8 | 0.4512 | 0.2201 | 0.2614 | 0.4389 | 9.2 | 4.1s | โ |
> โ
Pareto-optimal = no other config has both higher CIDEr AND lower latency.
---
## ๐ก๏ธ CIDEr Heatmap: Beam Size ร Length Penalty
The heatmap visualizes how CIDEr score varies across the full 3ร3 grid. **Warmer (brighter) cells = better caption quality.**
```
Length Penalty โ 0.8 1.0 1.2
โโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโ
Beam = 1 โ 0.4512 โ 0.4783 โ 0.4651 โ โ greedy, fastest
โโโโโโโโโโผโโโโโโโโโผโโโโโโโโโค
Beam = 3 โ 0.5031 โ 0.5451 โ 0.5456 โ โ balanced sweet spot
โโโโโโโโโโผโโโโโโโโโผโโโโโโโโโค
Beam = 5 โ 0.4914 โโ
0.5598 โ 0.5106 โ โ peak quality
โโโโโโโโโโดโโโโโโโโโดโโโโโโโโโ
```
**Key pattern**: The `length_penalty=1.0` column is consistently strong. `lp=0.8` penalizes longer candidates too aggressively, causing early truncation. `lp=1.2` over-rewards length, leading to captions that run on beyond the reference length and accumulate noise tokens.
See `results/cider_heatmap.png` for the colour-coded version.
---
## โก Latency Analysis: The SpeedโQuality Tradeoff
Generation time (seconds per 100 images) vs. CIDEr score:
```
CIDEr
0.56 | โ
(beam=5, lp=1.0)
0.55 | โ โ (beam=3, lp=1.0/1.2)
0.50 | โ
0.48 | Pareto
0.47 | โ (beam=1, lp=1.0) Frontier โโฎ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
4s 9s 14s โ Latency/100
```
| Use Case | Recommended Config | CIDEr | Latency/100 |
|----------|--------------------|-------|-------------|
| **Real-time** (live captioning, APIs) | beam=1, lp=1.0 | 0.4783 | 4.2s |
| **Balanced** (standard apps) | beam=3, lp=1.0 | 0.5451 | 9.1s |
| **Offline** (batch processing, archives) | beam=5, lp=1.0 | 0.5598 | 15.1s |
**Key finding**: Going from greedy (beam=1) to beam=3 yields a **+14% CIDEr improvement** at only a **2.2ร latency cost**. Going further from beam=3 to beam=5 adds only **+2.7% more CIDEr** at a further **1.7ร latency cost** โ rapidly diminishing returns.
See `results/latency_barchart.png` and `results/quality_speed_scatter.png`.
---
## ๐ Analysis: Key Findings
### Finding 1: Beam Size Matters More Than Length Penalty
Across all three length penalty settings, the CIDEr variance driven by beam size (range: ~0.08) is **larger** than the variance driven by length penalty (range: ~0.03). Beam size is the primary lever; length penalty is a fine-tuning knob.
### Finding 2: Length Penalty = 1.0 is the Safest Default
For every beam size, `lp=1.0` performs at par or best. This is because the COCO captions used as references are themselves moderate length (~10 tokens). Any penalty that pushes the model toward shorter (`lp=0.8`) or longer (`lp=1.2`) sequences diverges from the reference distribution.
### Finding 3: Optimal for API Design
- **Real-time captioning API** (< 5s/100 images required): use `beam=1, lp=1.0`
- **Standard captioning** (< 10s/100): use `beam=3, lp=1.0` โ recommended default
- **High-fidelity offline**: use `beam=5, lp=1.0`
### Finding 4: Why lp=0.8 Hurts
`lp=0.8` encourages the beam to prefer *shorter* sequences. Combined with beam=5, it actually *reduces* CIDEr below the greedy baseline for some images because BLIP's captions are already quite compact and penalizing length causes early stopping before key objects are mentioned.
### Finding 5: BLEU-4 Agrees With CIDEr
The ranking by BLEU-4 is nearly identical to CIDEr ranking (Spearman ฯ โ 0.93), validating that our CIDEr-based conclusions are not an artifact of the metric choice.
---
## ๐๏ธ Pipeline: 5 Independent Components
All code is organized into 5 self-contained modules. Each can be imported individually in a Jupyter notebook or run as a standalone script:
| File | What It Does | Returns |
|------|-------------|---------|
| `step1_load_model.py` | Load BLIP + fine-tuned checkpoint | `(model, processor, device)` |
| `step2_prepare_data.py` | Load 500 COCO val images | `DataLoader` |
| `step3_run_ablation.py` | Run 9-config grid, compute 4 metrics + latency | `list[dict]` (9 result rows) |
| `step4_visualize.py` | Generate 3 publication figures | `dict[str, path]` |
| `step5_analyze.py` | Pareto analysis, findings report | `dict` (findings) |
| `pipeline.py` | **Master orchestrator** โ chains all steps | All of the above |
---
## ๐ How to Run
Make sure you are in the project root directory and your virtualenv is active.
```bash
source venv/bin/activate
export PYTHONPATH=.
```
### Option A: Run Full Pipeline (Demo Mode โ No GPU Required)
Uses pre-computed results bundled in `results/ablation_results.json`. All 3 figures are generated, the analysis is printed, and `findings.md` is saved.
```bash
venv/bin/python task/task_03/pipeline.py --demo
```
**Outputs:**
- `task/task_03/results/cider_heatmap.png` โ 3ร3 CIDEr heatmap
- `task/task_03/results/latency_barchart.png` โ latency per config
- `task/task_03/results/quality_speed_scatter.png` โ Pareto scatter
- `task/task_03/results/findings.md` โ written analysis
### Option B: Run Full Pipeline (Live GPU Inference)
Downloads COCO val, runs all 9 configs end-to-end. Requires the fine-tuned BLIP checkpoint at `outputs/blip/best/` and a GPU (MPS or CUDA).
```bash
venv/bin/python task/task_03/pipeline.py
```
### Option C: Run Individual Components (for Notebook / HuggingFace inspection)
```python
# Step 1 โ Load model
from task.task_03.step1_load_model import load_model
model, processor, device = load_model()
# Step 2 โ Prepare data
from task.task_03.step2_prepare_data import load_val_data
dataloader = load_val_data(processor, n=500, batch_size=8)
# Step 3 โ Run ablation (or load cached)
from task.task_03.step3_run_ablation import run_ablation
results = run_ablation(model, processor, dataloader, device)
# Step 4 โ Visualize
from task.task_03.step4_visualize import visualize_all
paths = visualize_all(results)
# Step 5 โ Analyze
from task.task_03.step5_analyze import analyze_results
findings = analyze_results(results)
```
### Option D: Run Step 3 in Live Mode (standalone)
```bash
venv/bin/python task/task_03/step3_run_ablation.py --live # GPU inference
venv/bin/python task/task_03/step3_run_ablation.py # pre-computed
```
### Option E: Regenerate Figures Only (no inference needed)
```bash
venv/bin/python task/task_03/step4_visualize.py # generates all 3 PNGs
venv/bin/python task/task_03/step5_analyze.py # prints analysis
```
---
## ๐ How to Read and Judge the Results
### `results/cider_heatmap.png`
- **Brighter / warmer** cells = higher CIDEr (better captions)
- **Row** = beam size (1 โ 3 โ 5, top to bottom)
- **Column** = length penalty (0.8 โ 1.0 โ 1.2, left to right)
- Look for the โ
โ it marks the best config at `beam=5, lp=1.0` (CIDEr: 0.5598)
### `results/quality_speed_scatter.png`
- **X-axis** = latency (lower = faster)
- **Y-axis** = CIDEr (higher = better)
- **Red dashed line** = Pareto frontier โ configs on this line dominate all others
- Points *above* the frontier do not exist; points *below* are dominated
### `results/findings.md`
A machine-readable summary of the best config and insights โ suitable for direct inclusion in a project report.
### โ Why Does `lp=0.8` Sometimes Beat `lp=1.2` for beam=3?
`lp=0.8` produces shorter captions that can sometimes align better with short reference captions in COCO. The COCO validation set has high variance in reference length (7โ20 tokens). For images with very short human captions, penalizing length (`lp=0.8`) accidentally aligns better. `lp=1.0` wins on average because it is distribution-neutral.
---
## ๐ Folder Structure
```
task/task_03/
โโโ step1_load_model.py # Component 1: Load BLIP + checkpoint
โโโ step2_prepare_data.py # Component 2: COCO val DataLoader (500 images)
โโโ step3_run_ablation.py # Component 3: 9-config sweep + 4 metrics + latency
โโโ step4_visualize.py # Component 4: Heatmap, latency chart, scatter
โโโ step5_analyze.py # Component 5: Rankings, Pareto, findings
โโโ pipeline.py # Master orchestrator (--demo or live)
โโโ results/
โโโ ablation_results.json # Pre-computed 9-config ร 6-metric table
โโโ findings.md # Written analysis (auto-generated)
โโโ cider_heatmap.png # 3ร3 CIDEr quality heatmap
โโโ latency_barchart.png # Grouped latency bar chart
โโโ quality_speed_scatter.png # Pareto frontier scatter
```
---
## โ๏ธ Dependencies
All dependencies are already in the project `requirements.txt`:
| Package | Used For |
|---------|---------|
| `transformers` | BLIP model loading and inference |
| `torch` | GPU acceleration (MPS / CUDA) |
| `datasets` | COCO 2017 validation split |
| `pycocoevalcap` | CIDEr metric computation |
| `nltk` | BLEU-4 and METEOR metrics |
| `rouge-score` | ROUGE-L metric |
| `matplotlib` | Heatmap, bar chart, scatter figures |
| `numpy` | Matrix operations for the heatmap grid |
---
## ๐ Connection to the Broader Project
This task feeds directly back into the main project:
- The best config (`beam=5, lp=1.0`) is the **default decoding setting in `eval.py`** for the main evaluation sweep.
- The latency measurements inform the **API design recommendation** in `app.py` (real-time tab uses beam=1, compare tab uses beam=3).
- Results are referenced in the **main README** and `experiments/results_beam_search_and_decoding_settings_comparison.md`.
---
**Author:** Manoj Kumar โ March 2026
|