project_02_DS / task /task_04 /README.md
griddev's picture
Deploy Streamlit Space app
0710b5c verified
# πŸ”¬ Task 4: Caption Diversity Analysis & Concept Activation Vectors for Style Steering
## πŸ“Œ The Big Question: Can We Steer a VLM to Write Longer or Shorter Captionsβ€”Without Retraining?
When a vision-language model generates a caption for an image, it doesn't just pick words at randomβ€”it navigates a high-dimensional **representation space** where different directions correspond to different caption properties. This task asks two deep questions:
1. **How diverse are the captions a model generates for the same image?** When we sample 5 captions using nucleus sampling (p=0.9), do we get genuinely different descriptions or minor paraphrases of the same sentence?
2. **Can we control caption style by directly manipulating the model's hidden states?** We extract "steering directions" from mean hidden states of short vs. detailed captions, then inject `h_steered = h + Ξ» Γ— direction` into the decoder at generation timeβ€”no gradient update, no retraining.
---
## 🧠 Part 1 β€” Caption Diversity Analysis
### What "Diversity" Means Here
For each image, we generate **5 captions** using **nucleus sampling** (`top_p=0.9`, `num_beams=1`). Nucleus sampling selects the next token from the smallest vocabulary subset whose cumulative probability exceeds `p=0.9`. This introduces stochasticityβ€”but how much?
We quantify diversity with a single score:
```
diversity_score = unique_ngrams / total_ngrams
```
where ngrams = all **unigrams + bigrams** across the 5 captions for that image.
| Score Range | Category | Meaning |
|-------------|------------|-----------------------------------------------------------|
| > 0.75 | 🌈 Diverse | Captions use substantially different vocabulary each time |
| 0.40–0.75 | Medium | Partial variation β€” similar structure, different words |
| < 0.40 | πŸ”„ Repetitive| Almost identical captions β€” model is not exploring |
### Results
| Metric | Value |
|--------------------------|--------|
| Total images analysed | 200 |
| Mean diversity score | 0.5847 |
| Diverse (>0.75) | 37 images (18.5%) |
| Medium (0.40–0.75) | 118 images (59.0%) |
| Repetitive (<0.40) | 45 images (22.5%) |
### Which Images Are Hard to Diversify?
**Repetitive**: Simple, prototypical scenes β€” a solitary animal on a plain surface, a single person in a common pose, a single food item on a plate. The model has high confidence and nucleus sampling collapses to the same phrase cluster.
**Diverse**: Complex multi-object scenes β€” busy city streets, sporting events, family gatherings. The model must choose which aspect to describe, leading to genuinely different captions.
> **Implication**: Caption diversity is an intrinsic property of image complexity. Extremely confident model predictions undermine the diversity even at high top-p values.
---
## 🧭 Part 2 β€” Concept Steering Vectors (CAV-style)
### The Idea: Representation Engineering for Caption Style
Language models store information about caption *style* in their hidden state geometry. Short captions cluster in one region; detailed captions in another. The **mean difference** between these regions is a **steering direction** β€” a vector that, when added to hidden states during decoding, nudges the model to generate more detailed (or shorter) text.
This is inspired by Concept Activation Vectors (CAVs) from interpretability research, adapted here for generative steering.
### Step 1 β€” Partitioning Captions by Style
From COCO validation captions (500 samples):
| Style | Word Count | Example |
|-----------|------------|-------------------------------------------------------------|
| **Short** | ≀ 8 words | *"A dog on a couch"* |
| **Medium** | 9–14 words | *"A brown dog is resting on a dark leather couch"* |
| **Detailed** | β‰₯ 15 words | *"A large brown Labrador is lying comfortably on a black leather sofa next to a cushion"* |
### Step 2 β€” Extracting Mean Hidden States
For each style group, we:
1. Tokenize all captions
2. Pass them through BLIP's **text encoder** (BERT-based)
3. **Mean-pool** the encoder output across all token positions
4. Average across all captions in the group β†’ `ΞΌ_style ∈ ℝ^{768}`
### Step 3 β€” Computing Steering Directions
```python
d_short2detail = normalize(ΞΌ_detailed βˆ’ ΞΌ_short)
d_short2medium = normalize(ΞΌ_medium βˆ’ ΞΌ_short)
```
Both vectors are L2-normalised so that Ξ» has a consistent magnitude interpretation regardless of the dataset size used to compute the means.
### Step 4 β€” Applying the Steering Vector
At generation time, we register a **PyTorch forward hook** on every attention sub-layer inside BLIP's text decoder. The hook modifies the hidden state output before it flows to the next layer:
```python
h_steered = h + Ξ» Γ— d_short2detail
```
This is injected at **every decoder layer at every time step**, creating a persistent bias throughout the decoding process. No gradients are computed; the model weights are not changed.
---
## πŸ“Š Steering Results β€” Ξ» Sweep
We sweep Ξ» from βˆ’1.0 (push toward shorter style) to +2.0 (push toward detailed style):
| Ξ» | Mean Length | Unique Words | Style Score |
|-----|-------------|--------------|-------------|
| βˆ’1.0 | 6.8 words | 6.1 | 0.453 |
| βˆ’0.5 | 8.2 words | 7.3 | 0.548 |
| **0.0** | **10.1 words** | **8.9** | **0.673** ← baseline |
| +0.5 | 11.8 words | 10.2 | 0.787 |
| +1.0 | 13.5 words | 11.4 | 0.900 |
| +1.5 | 15.2 words | 12.1 | 0.932 |
| +2.0 | 16.7 words | 12.8 | 0.956 |
**Key finding**: Ξ»=+2.0 adds +6.6 words (+65%) over baseline. Ξ»=βˆ’1.0 shortens captions by βˆ’3.3 words (βˆ’33%). The effect is **monotonically increasing in Ξ»**, confirming that the steering direction captures a real style axis.
---
## πŸ” Key Findings
### Finding 1: Diversity Is Bimodal
The distribution of diversity scores is not uniform β€” it is bimodal with peaks near 0.30 (repetitive) and 0.70 (diverse). Most images fall in the medium range, but simple and complex images cluster away from the center.
### Finding 2: Repetitive Images Are Visually Overconfident
When the model's visual encoder produces a very "clean" signal (e.g., a single dominant object on a plain background), the decoder probability distribution is sharply peaked. Even with p=0.9 nucleus sampling, the effective vocabulary at each step is very small, producing nearly identical captions.
### Finding 3: Steering Vectors Capture Real Style Axes
The steering effect is not noise β€” the mean caption length increases monotonically with Ξ» across 7 different values and 20 images per Ξ». This replicates the core CAV hypothesis: mean representation differences encode interpretable semantic attributes.
### Finding 4: Optimal Ξ» for Practical Use
- Ξ» ∈ [0.5, 1.0]: Adds 2–4 words, keeps captions fluent and on-topic
- Ξ» > 1.5: Captions start becoming verbose and diverge from the COCO reference distribution (CIDER would drop)
- Ξ» < 0: Produces very terse captions, useful for summarization-style applications
### Finding 5: No Retraining Needed
The full steering effect requires zero gradient computation. The model weights are unchanged. The only requirement is access to a representative set of style-labelled captions to compute the mean vectors β€” making this technique immediately applicable to any BLIP-based deployment.
---
## πŸ—οΈ Pipeline: 7 Independent Components
| File | What It Does | Returns |
|------|-------------|---------|
| `step1_load_model.py` | Load BLIP + fine-tuned checkpoint | `(model, processor, device)` |
| `step2_prepare_data.py` | COCO val DataLoader + style caption sets | `DataLoader`, `dict[style→list[str]]` |
| `step3_diversity_analysis.py` | 5 captions/image (nucleus p=0.9), diversity scores | `list[dict]` |
| `step4_steering_vectors.py` | Extract ΞΌ per style, compute d_short2detail | `dict[str, Tensor]` |
| `step5_steer_and_eval.py` | Ξ»-sweep steered generation, length/richness metrics | `list[dict]` |
| `step6_visualize.py` | 3 publication figures (real COCO thumbnails in extremes panel) | `dict[str, path]` |
| `step7_analyze.py` | Rankings, findings, write findings.md | `dict` |
| `pipeline.py` | **Master orchestrator** (--demo or live) | All of the above |
| `demo_gradio.py` | **Interactive user-upload Gradio demo** (HF Spaces) | Gradio Blocks app |
---
## πŸš€ How to Run
Make sure you are in the project root directory and your virtualenv is active:
```bash
source venv/bin/activate
export PYTHONPATH=.
```
### Option A: Demo Mode (No GPU Required) βœ… Recommended for HuggingFace Spaces
Uses pre-computed results bundled in `results/*.json`. Generates all 3 figures and findings.md in under 15 seconds.
```bash
venv/bin/python task/task_04/pipeline.py --demo
```
**Outputs:**
- `task/task_04/results/diversity_histogram.png` β€” diversity score distribution
- `task/task_04/results/diverse_vs_repetitive.png` β€” caption extremes panel
- `task/task_04/results/steering_lambda_sweep.png` β€” Ξ» vs. caption length chart
- `task/task_04/results/findings.md` β€” written analysis
### Option B: Live GPU Inference
Downloads COCO val, runs nucleus sampling on 200 images and steering on 20 images. Requires a GPU (MPS or CUDA) and ~10 GB RAM.
```bash
venv/bin/python task/task_04/pipeline.py
```
### Option C: Individual Steps (Notebook / HuggingFace Inspection)
```python
# Step 1 β€” Load model
from task.task_04.step1_load_model import load_model
model, processor, device = load_model()
# Step 2 β€” Prepare data
from task.task_04.step2_prepare_data import load_val_data, build_style_sets
dataloader = load_val_data(processor, n=200, batch_size=4)
style_sets = build_style_sets(n=500)
# Step 3 β€” Diversity analysis
from task.task_04.step3_diversity_analysis import run_diversity_analysis
records = run_diversity_analysis(model, processor, dataloader, device)
# Step 4 β€” Steering vectors
from task.task_04.step4_steering_vectors import extract_steering_vectors
vectors = extract_steering_vectors(model, processor, style_sets, device)
# Step 5 β€” Steered generation
from task.task_04.step5_steer_and_eval import run_steering_eval
steering_results = run_steering_eval(model, processor, dataloader, device, vectors)
# Step 6 β€” Visualize
from task.task_04.step6_visualize import visualize_all
paths = visualize_all(records, steering_results)
# Step 7 β€” Analyze
from task.task_04.step7_analyze import analyze_results
findings = analyze_results(records, steering_results)
```
### Option D: Run Individual Steps Standalone
```bash
# Diversity analysis (precomputed)
venv/bin/python task/task_04/step3_diversity_analysis.py
venv/bin/python task/task_04/step3_diversity_analysis.py --live # GPU inference
# Steering vectors (precomputed)
venv/bin/python task/task_04/step4_steering_vectors.py
venv/bin/python task/task_04/step4_steering_vectors.py --live
# Ξ» sweep (precomputed)
venv/bin/python task/task_04/step5_steer_and_eval.py
venv/bin/python task/task_04/step5_steer_and_eval.py --live
# Regenerate figures only
venv/bin/python task/task_04/step6_visualize.py
# Print analysis only
venv/bin/python task/task_04/step7_analyze.py
```
---
## 🌑️ Understanding the Figures
### `results/diversity_histogram.png`
- **X-axis**: diversity score (unique n-grams / total n-grams)
- Red-shaded zone: repetitive (< 0.40)
- Blue-shaded zone: diverse (> 0.75)
- Dashed lines: thresholds; dotted line: mean score
- Look at the bimodal shape β€” it confirms that high and low diversity images are distinct populations
### `results/diverse_vs_repetitive.png`
- Left panel: top-3 most diverse images with all 5 captions
- Right panel: top-3 most repetitive images with all 5 captions
- **Image thumbnails**: actual COCO validation images are fetched via `datasets` streaming and embedded at the left column of each row. First run downloads 6 images; subsequent runs load from `results/images/`.
- Compare how different the captions look between the two groups
### `results/steering_lambda_sweep.png`
- X-axis: Ξ» (negative = push toward shorter, positive = push toward detailed)
- Left Y-axis (orange): mean caption length in words
- Right Y-axis (purple): mean unique word count per caption
- The dashed vertical line at Ξ»=0 is the unsteered baseline
- The slope of both lines confirms that steering is effective
---
## πŸ“ Folder Structure
```
task/task_04/
β”œβ”€β”€ step1_load_model.py # Component 1: Load BLIP + checkpoint
β”œβ”€β”€ step2_prepare_data.py # Component 2: COCO DataLoader + style sets
β”œβ”€β”€ step3_diversity_analysis.py # Component 3: Nucleus diversity (p=0.9)
β”œβ”€β”€ step4_steering_vectors.py # Component 4: BLIP hidden-state extraction
β”œβ”€β”€ step5_steer_and_eval.py # Component 5: Forward-hook Ξ» sweep
β”œβ”€β”€ step6_visualize.py # Component 6: 3 publication figures
β”œβ”€β”€ step7_analyze.py # Component 7: Rankings & findings.md
β”œβ”€β”€ demo_gradio.py # Component 8: User-upload Gradio demo
β”œβ”€β”€ pipeline.py # Master orchestrator (--demo or live)
└── results/
β”œβ”€β”€ diversity_results.json # Pre-computed per-image diversity records
β”œβ”€β”€ steering_vectors.pt # d_short2detail, d_short2medium tensors
β”œβ”€β”€ steering_vectors_meta.json # Steering vector metadata
β”œβ”€β”€ steering_results.json # Ξ»-sweep metrics table
β”œβ”€β”€ findings.md # Auto-generated written analysis
β”œβ”€β”€ diversity_histogram.png # Diversity score distribution
β”œβ”€β”€ diverse_vs_repetitive.png # Caption extremes panel (with real COCO images)
β”œβ”€β”€ steering_lambda_sweep.png # Ξ» vs length/richness chart
└── images/ # Real COCO thumbnails (fetched on first run)
β”œβ”€β”€ img_0.jpg
β”œβ”€β”€ img_3.jpg
└── ... # 6 total (top-3 diverse + top-3 repetitive)
```
---
## βš™οΈ Dependencies
All dependencies are already in the project `requirements.txt`:
| Package | Used For |
|---------|---------|
| `transformers` | BLIP model loading, text encoder, text decoder |
| `torch` | Forward hooks, hidden-state arithmetic |
| `datasets` | COCO 2017 validation split |
| `matplotlib` | Histogram, text panel, dual-axis chart |
| `numpy` | Score aggregations |
| `tqdm` | Progress bars for live inference |
---
## πŸ”— Connection to the Broader Project
- **Builds on Task 3**: Uses the same BLIP fine-tuned checkpoint (`outputs/blip/best/`) as the base model for caption generation and hidden-state extraction.
- **Complements the main app**: The diversity analysis exposes a limitation of the current inference pipeline β€” for simple images, multiple sampling behaves like greedy decode.
- **Novel capability**: Concept steering is a zero-shot technique that can be integrated into the Streamlit demo as a "style slider" β€” allowing users to interactively generate shorter or longer captions from the same image. The `demo_gradio.py` file provides a standalone Gradio interface for this.
- **Connects to Experiment 2 (beam search)**: Diversity analysis shows that nucleus sampling and beam search operate in different regimes β€” beam search maximises probability (low entropy), nucleus sampling controls entropy directly.
- **Leads into Task 5**: The caption diversity pipeline (step2–step3) is the data source for Task 5's toxicity analysis β€” the same BLIP caption generation flow is extended with safety classification.
---
**Author:** Manoj Kumar β€” March 2026