project_02_DS / task /task_04 /README.md
griddev's picture
Deploy Streamlit Space app
0710b5c verified

A newer version of the Streamlit SDK is available: 1.57.0

Upgrade

πŸ”¬ Task 4: Caption Diversity Analysis & Concept Activation Vectors for Style Steering

πŸ“Œ The Big Question: Can We Steer a VLM to Write Longer or Shorter Captionsβ€”Without Retraining?

When a vision-language model generates a caption for an image, it doesn't just pick words at randomβ€”it navigates a high-dimensional representation space where different directions correspond to different caption properties. This task asks two deep questions:

  1. How diverse are the captions a model generates for the same image? When we sample 5 captions using nucleus sampling (p=0.9), do we get genuinely different descriptions or minor paraphrases of the same sentence?

  2. Can we control caption style by directly manipulating the model's hidden states? We extract "steering directions" from mean hidden states of short vs. detailed captions, then inject h_steered = h + Ξ» Γ— direction into the decoder at generation timeβ€”no gradient update, no retraining.


🧠 Part 1 β€” Caption Diversity Analysis

What "Diversity" Means Here

For each image, we generate 5 captions using nucleus sampling (top_p=0.9, num_beams=1). Nucleus sampling selects the next token from the smallest vocabulary subset whose cumulative probability exceeds p=0.9. This introduces stochasticityβ€”but how much?

We quantify diversity with a single score:

diversity_score = unique_ngrams / total_ngrams

where ngrams = all unigrams + bigrams across the 5 captions for that image.

Score Range Category Meaning
> 0.75 🌈 Diverse Captions use substantially different vocabulary each time
0.40–0.75 Medium Partial variation β€” similar structure, different words
< 0.40 πŸ”„ Repetitive Almost identical captions β€” model is not exploring

Results

Metric Value
Total images analysed 200
Mean diversity score 0.5847
Diverse (>0.75) 37 images (18.5%)
Medium (0.40–0.75) 118 images (59.0%)
Repetitive (<0.40) 45 images (22.5%)

Which Images Are Hard to Diversify?

Repetitive: Simple, prototypical scenes β€” a solitary animal on a plain surface, a single person in a common pose, a single food item on a plate. The model has high confidence and nucleus sampling collapses to the same phrase cluster.

Diverse: Complex multi-object scenes β€” busy city streets, sporting events, family gatherings. The model must choose which aspect to describe, leading to genuinely different captions.

Implication: Caption diversity is an intrinsic property of image complexity. Extremely confident model predictions undermine the diversity even at high top-p values.


🧭 Part 2 β€” Concept Steering Vectors (CAV-style)

The Idea: Representation Engineering for Caption Style

Language models store information about caption style in their hidden state geometry. Short captions cluster in one region; detailed captions in another. The mean difference between these regions is a steering direction β€” a vector that, when added to hidden states during decoding, nudges the model to generate more detailed (or shorter) text.

This is inspired by Concept Activation Vectors (CAVs) from interpretability research, adapted here for generative steering.

Step 1 β€” Partitioning Captions by Style

From COCO validation captions (500 samples):

Style Word Count Example
Short ≀ 8 words "A dog on a couch"
Medium 9–14 words "A brown dog is resting on a dark leather couch"
Detailed β‰₯ 15 words "A large brown Labrador is lying comfortably on a black leather sofa next to a cushion"

Step 2 β€” Extracting Mean Hidden States

For each style group, we:

  1. Tokenize all captions
  2. Pass them through BLIP's text encoder (BERT-based)
  3. Mean-pool the encoder output across all token positions
  4. Average across all captions in the group β†’ ΞΌ_style ∈ ℝ^{768}

Step 3 β€” Computing Steering Directions

d_short2detail = normalize(ΞΌ_detailed βˆ’ ΞΌ_short)
d_short2medium = normalize(ΞΌ_medium   βˆ’ ΞΌ_short)

Both vectors are L2-normalised so that Ξ» has a consistent magnitude interpretation regardless of the dataset size used to compute the means.

Step 4 β€” Applying the Steering Vector

At generation time, we register a PyTorch forward hook on every attention sub-layer inside BLIP's text decoder. The hook modifies the hidden state output before it flows to the next layer:

h_steered = h + Ξ» Γ— d_short2detail

This is injected at every decoder layer at every time step, creating a persistent bias throughout the decoding process. No gradients are computed; the model weights are not changed.


πŸ“Š Steering Results β€” Ξ» Sweep

We sweep Ξ» from βˆ’1.0 (push toward shorter style) to +2.0 (push toward detailed style):

Ξ» Mean Length Unique Words Style Score
βˆ’1.0 6.8 words 6.1 0.453
βˆ’0.5 8.2 words 7.3 0.548
0.0 10.1 words 8.9 0.673 ← baseline
+0.5 11.8 words 10.2 0.787
+1.0 13.5 words 11.4 0.900
+1.5 15.2 words 12.1 0.932
+2.0 16.7 words 12.8 0.956

Key finding: Ξ»=+2.0 adds +6.6 words (+65%) over baseline. Ξ»=βˆ’1.0 shortens captions by βˆ’3.3 words (βˆ’33%). The effect is monotonically increasing in Ξ», confirming that the steering direction captures a real style axis.


πŸ” Key Findings

Finding 1: Diversity Is Bimodal

The distribution of diversity scores is not uniform β€” it is bimodal with peaks near 0.30 (repetitive) and 0.70 (diverse). Most images fall in the medium range, but simple and complex images cluster away from the center.

Finding 2: Repetitive Images Are Visually Overconfident

When the model's visual encoder produces a very "clean" signal (e.g., a single dominant object on a plain background), the decoder probability distribution is sharply peaked. Even with p=0.9 nucleus sampling, the effective vocabulary at each step is very small, producing nearly identical captions.

Finding 3: Steering Vectors Capture Real Style Axes

The steering effect is not noise β€” the mean caption length increases monotonically with Ξ» across 7 different values and 20 images per Ξ». This replicates the core CAV hypothesis: mean representation differences encode interpretable semantic attributes.

Finding 4: Optimal Ξ» for Practical Use

  • Ξ» ∈ [0.5, 1.0]: Adds 2–4 words, keeps captions fluent and on-topic
  • Ξ» > 1.5: Captions start becoming verbose and diverge from the COCO reference distribution (CIDER would drop)
  • Ξ» < 0: Produces very terse captions, useful for summarization-style applications

Finding 5: No Retraining Needed

The full steering effect requires zero gradient computation. The model weights are unchanged. The only requirement is access to a representative set of style-labelled captions to compute the mean vectors β€” making this technique immediately applicable to any BLIP-based deployment.


πŸ—οΈ Pipeline: 7 Independent Components

File What It Does Returns
step1_load_model.py Load BLIP + fine-tuned checkpoint (model, processor, device)
step2_prepare_data.py COCO val DataLoader + style caption sets DataLoader, dict[style→list[str]]
step3_diversity_analysis.py 5 captions/image (nucleus p=0.9), diversity scores list[dict]
step4_steering_vectors.py Extract ΞΌ per style, compute d_short2detail dict[str, Tensor]
step5_steer_and_eval.py Ξ»-sweep steered generation, length/richness metrics list[dict]
step6_visualize.py 3 publication figures (real COCO thumbnails in extremes panel) dict[str, path]
step7_analyze.py Rankings, findings, write findings.md dict
pipeline.py Master orchestrator (--demo or live) All of the above
demo_gradio.py Interactive user-upload Gradio demo (HF Spaces) Gradio Blocks app

πŸš€ How to Run

Make sure you are in the project root directory and your virtualenv is active:

source venv/bin/activate
export PYTHONPATH=.

Option A: Demo Mode (No GPU Required) βœ… Recommended for HuggingFace Spaces

Uses pre-computed results bundled in results/*.json. Generates all 3 figures and findings.md in under 15 seconds.

venv/bin/python task/task_04/pipeline.py --demo

Outputs:

  • task/task_04/results/diversity_histogram.png β€” diversity score distribution
  • task/task_04/results/diverse_vs_repetitive.png β€” caption extremes panel
  • task/task_04/results/steering_lambda_sweep.png β€” Ξ» vs. caption length chart
  • task/task_04/results/findings.md β€” written analysis

Option B: Live GPU Inference

Downloads COCO val, runs nucleus sampling on 200 images and steering on 20 images. Requires a GPU (MPS or CUDA) and ~10 GB RAM.

venv/bin/python task/task_04/pipeline.py

Option C: Individual Steps (Notebook / HuggingFace Inspection)

# Step 1 β€” Load model
from task.task_04.step1_load_model import load_model
model, processor, device = load_model()

# Step 2 β€” Prepare data
from task.task_04.step2_prepare_data import load_val_data, build_style_sets
dataloader = load_val_data(processor, n=200, batch_size=4)
style_sets = build_style_sets(n=500)

# Step 3 β€” Diversity analysis
from task.task_04.step3_diversity_analysis import run_diversity_analysis
records = run_diversity_analysis(model, processor, dataloader, device)

# Step 4 β€” Steering vectors
from task.task_04.step4_steering_vectors import extract_steering_vectors
vectors = extract_steering_vectors(model, processor, style_sets, device)

# Step 5 β€” Steered generation
from task.task_04.step5_steer_and_eval import run_steering_eval
steering_results = run_steering_eval(model, processor, dataloader, device, vectors)

# Step 6 β€” Visualize
from task.task_04.step6_visualize import visualize_all
paths = visualize_all(records, steering_results)

# Step 7 β€” Analyze
from task.task_04.step7_analyze import analyze_results
findings = analyze_results(records, steering_results)

Option D: Run Individual Steps Standalone

# Diversity analysis (precomputed)
venv/bin/python task/task_04/step3_diversity_analysis.py
venv/bin/python task/task_04/step3_diversity_analysis.py --live  # GPU inference

# Steering vectors (precomputed)
venv/bin/python task/task_04/step4_steering_vectors.py
venv/bin/python task/task_04/step4_steering_vectors.py --live

# Ξ» sweep (precomputed)
venv/bin/python task/task_04/step5_steer_and_eval.py
venv/bin/python task/task_04/step5_steer_and_eval.py --live

# Regenerate figures only
venv/bin/python task/task_04/step6_visualize.py

# Print analysis only
venv/bin/python task/task_04/step7_analyze.py

🌑️ Understanding the Figures

results/diversity_histogram.png

  • X-axis: diversity score (unique n-grams / total n-grams)
  • Red-shaded zone: repetitive (< 0.40)
  • Blue-shaded zone: diverse (> 0.75)
  • Dashed lines: thresholds; dotted line: mean score
  • Look at the bimodal shape β€” it confirms that high and low diversity images are distinct populations

results/diverse_vs_repetitive.png

  • Left panel: top-3 most diverse images with all 5 captions
  • Right panel: top-3 most repetitive images with all 5 captions
  • Image thumbnails: actual COCO validation images are fetched via datasets streaming and embedded at the left column of each row. First run downloads 6 images; subsequent runs load from results/images/.
  • Compare how different the captions look between the two groups

results/steering_lambda_sweep.png

  • X-axis: Ξ» (negative = push toward shorter, positive = push toward detailed)
  • Left Y-axis (orange): mean caption length in words
  • Right Y-axis (purple): mean unique word count per caption
  • The dashed vertical line at Ξ»=0 is the unsteered baseline
  • The slope of both lines confirms that steering is effective

πŸ“ Folder Structure

task/task_04/
β”œβ”€β”€ step1_load_model.py          # Component 1: Load BLIP + checkpoint
β”œβ”€β”€ step2_prepare_data.py        # Component 2: COCO DataLoader + style sets
β”œβ”€β”€ step3_diversity_analysis.py  # Component 3: Nucleus diversity (p=0.9)
β”œβ”€β”€ step4_steering_vectors.py    # Component 4: BLIP hidden-state extraction
β”œβ”€β”€ step5_steer_and_eval.py      # Component 5: Forward-hook Ξ» sweep
β”œβ”€β”€ step6_visualize.py           # Component 6: 3 publication figures
β”œβ”€β”€ step7_analyze.py             # Component 7: Rankings & findings.md
β”œβ”€β”€ demo_gradio.py               # Component 8: User-upload Gradio demo
β”œβ”€β”€ pipeline.py                  # Master orchestrator (--demo or live)
└── results/
    β”œβ”€β”€ diversity_results.json       # Pre-computed per-image diversity records
    β”œβ”€β”€ steering_vectors.pt          # d_short2detail, d_short2medium tensors
    β”œβ”€β”€ steering_vectors_meta.json   # Steering vector metadata
    β”œβ”€β”€ steering_results.json        # Ξ»-sweep metrics table
    β”œβ”€β”€ findings.md                  # Auto-generated written analysis
    β”œβ”€β”€ diversity_histogram.png      # Diversity score distribution
    β”œβ”€β”€ diverse_vs_repetitive.png    # Caption extremes panel (with real COCO images)
    β”œβ”€β”€ steering_lambda_sweep.png    # Ξ» vs length/richness chart
    └── images/                      # Real COCO thumbnails (fetched on first run)
        β”œβ”€β”€ img_0.jpg
        β”œβ”€β”€ img_3.jpg
        └── ...                      # 6 total (top-3 diverse + top-3 repetitive)

βš™οΈ Dependencies

All dependencies are already in the project requirements.txt:

Package Used For
transformers BLIP model loading, text encoder, text decoder
torch Forward hooks, hidden-state arithmetic
datasets COCO 2017 validation split
matplotlib Histogram, text panel, dual-axis chart
numpy Score aggregations
tqdm Progress bars for live inference

πŸ”— Connection to the Broader Project

  • Builds on Task 3: Uses the same BLIP fine-tuned checkpoint (outputs/blip/best/) as the base model for caption generation and hidden-state extraction.
  • Complements the main app: The diversity analysis exposes a limitation of the current inference pipeline β€” for simple images, multiple sampling behaves like greedy decode.
  • Novel capability: Concept steering is a zero-shot technique that can be integrated into the Streamlit demo as a "style slider" β€” allowing users to interactively generate shorter or longer captions from the same image. The demo_gradio.py file provides a standalone Gradio interface for this.
  • Connects to Experiment 2 (beam search): Diversity analysis shows that nucleus sampling and beam search operate in different regimes β€” beam search maximises probability (low entropy), nucleus sampling controls entropy directly.
  • Leads into Task 5: The caption diversity pipeline (step2–step3) is the data source for Task 5's toxicity analysis β€” the same BLIP caption generation flow is extended with safety classification.

Author: Manoj Kumar β€” March 2026