Spaces:
Sleeping
A newer version of the Streamlit SDK is available: 1.57.0
π¬ Task 4: Caption Diversity Analysis & Concept Activation Vectors for Style Steering
π The Big Question: Can We Steer a VLM to Write Longer or Shorter CaptionsβWithout Retraining?
When a vision-language model generates a caption for an image, it doesn't just pick words at randomβit navigates a high-dimensional representation space where different directions correspond to different caption properties. This task asks two deep questions:
How diverse are the captions a model generates for the same image? When we sample 5 captions using nucleus sampling (p=0.9), do we get genuinely different descriptions or minor paraphrases of the same sentence?
Can we control caption style by directly manipulating the model's hidden states? We extract "steering directions" from mean hidden states of short vs. detailed captions, then inject
h_steered = h + Ξ» Γ directioninto the decoder at generation timeβno gradient update, no retraining.
π§ Part 1 β Caption Diversity Analysis
What "Diversity" Means Here
For each image, we generate 5 captions using nucleus sampling (top_p=0.9, num_beams=1). Nucleus sampling selects the next token from the smallest vocabulary subset whose cumulative probability exceeds p=0.9. This introduces stochasticityβbut how much?
We quantify diversity with a single score:
diversity_score = unique_ngrams / total_ngrams
where ngrams = all unigrams + bigrams across the 5 captions for that image.
| Score Range | Category | Meaning |
|---|---|---|
| > 0.75 | π Diverse | Captions use substantially different vocabulary each time |
| 0.40β0.75 | Medium | Partial variation β similar structure, different words |
| < 0.40 | π Repetitive | Almost identical captions β model is not exploring |
Results
| Metric | Value |
|---|---|
| Total images analysed | 200 |
| Mean diversity score | 0.5847 |
| Diverse (>0.75) | 37 images (18.5%) |
| Medium (0.40β0.75) | 118 images (59.0%) |
| Repetitive (<0.40) | 45 images (22.5%) |
Which Images Are Hard to Diversify?
Repetitive: Simple, prototypical scenes β a solitary animal on a plain surface, a single person in a common pose, a single food item on a plate. The model has high confidence and nucleus sampling collapses to the same phrase cluster.
Diverse: Complex multi-object scenes β busy city streets, sporting events, family gatherings. The model must choose which aspect to describe, leading to genuinely different captions.
Implication: Caption diversity is an intrinsic property of image complexity. Extremely confident model predictions undermine the diversity even at high top-p values.
π§ Part 2 β Concept Steering Vectors (CAV-style)
The Idea: Representation Engineering for Caption Style
Language models store information about caption style in their hidden state geometry. Short captions cluster in one region; detailed captions in another. The mean difference between these regions is a steering direction β a vector that, when added to hidden states during decoding, nudges the model to generate more detailed (or shorter) text.
This is inspired by Concept Activation Vectors (CAVs) from interpretability research, adapted here for generative steering.
Step 1 β Partitioning Captions by Style
From COCO validation captions (500 samples):
| Style | Word Count | Example |
|---|---|---|
| Short | β€ 8 words | "A dog on a couch" |
| Medium | 9β14 words | "A brown dog is resting on a dark leather couch" |
| Detailed | β₯ 15 words | "A large brown Labrador is lying comfortably on a black leather sofa next to a cushion" |
Step 2 β Extracting Mean Hidden States
For each style group, we:
- Tokenize all captions
- Pass them through BLIP's text encoder (BERT-based)
- Mean-pool the encoder output across all token positions
- Average across all captions in the group β
ΞΌ_style β β^{768}
Step 3 β Computing Steering Directions
d_short2detail = normalize(ΞΌ_detailed β ΞΌ_short)
d_short2medium = normalize(ΞΌ_medium β ΞΌ_short)
Both vectors are L2-normalised so that Ξ» has a consistent magnitude interpretation regardless of the dataset size used to compute the means.
Step 4 β Applying the Steering Vector
At generation time, we register a PyTorch forward hook on every attention sub-layer inside BLIP's text decoder. The hook modifies the hidden state output before it flows to the next layer:
h_steered = h + Ξ» Γ d_short2detail
This is injected at every decoder layer at every time step, creating a persistent bias throughout the decoding process. No gradients are computed; the model weights are not changed.
π Steering Results β Ξ» Sweep
We sweep Ξ» from β1.0 (push toward shorter style) to +2.0 (push toward detailed style):
| Ξ» | Mean Length | Unique Words | Style Score |
|---|---|---|---|
| β1.0 | 6.8 words | 6.1 | 0.453 |
| β0.5 | 8.2 words | 7.3 | 0.548 |
| 0.0 | 10.1 words | 8.9 | 0.673 β baseline |
| +0.5 | 11.8 words | 10.2 | 0.787 |
| +1.0 | 13.5 words | 11.4 | 0.900 |
| +1.5 | 15.2 words | 12.1 | 0.932 |
| +2.0 | 16.7 words | 12.8 | 0.956 |
Key finding: Ξ»=+2.0 adds +6.6 words (+65%) over baseline. Ξ»=β1.0 shortens captions by β3.3 words (β33%). The effect is monotonically increasing in Ξ», confirming that the steering direction captures a real style axis.
π Key Findings
Finding 1: Diversity Is Bimodal
The distribution of diversity scores is not uniform β it is bimodal with peaks near 0.30 (repetitive) and 0.70 (diverse). Most images fall in the medium range, but simple and complex images cluster away from the center.
Finding 2: Repetitive Images Are Visually Overconfident
When the model's visual encoder produces a very "clean" signal (e.g., a single dominant object on a plain background), the decoder probability distribution is sharply peaked. Even with p=0.9 nucleus sampling, the effective vocabulary at each step is very small, producing nearly identical captions.
Finding 3: Steering Vectors Capture Real Style Axes
The steering effect is not noise β the mean caption length increases monotonically with Ξ» across 7 different values and 20 images per Ξ». This replicates the core CAV hypothesis: mean representation differences encode interpretable semantic attributes.
Finding 4: Optimal Ξ» for Practical Use
- Ξ» β [0.5, 1.0]: Adds 2β4 words, keeps captions fluent and on-topic
- Ξ» > 1.5: Captions start becoming verbose and diverge from the COCO reference distribution (CIDER would drop)
- Ξ» < 0: Produces very terse captions, useful for summarization-style applications
Finding 5: No Retraining Needed
The full steering effect requires zero gradient computation. The model weights are unchanged. The only requirement is access to a representative set of style-labelled captions to compute the mean vectors β making this technique immediately applicable to any BLIP-based deployment.
ποΈ Pipeline: 7 Independent Components
| File | What It Does | Returns |
|---|---|---|
step1_load_model.py |
Load BLIP + fine-tuned checkpoint | (model, processor, device) |
step2_prepare_data.py |
COCO val DataLoader + style caption sets | DataLoader, dict[styleβlist[str]] |
step3_diversity_analysis.py |
5 captions/image (nucleus p=0.9), diversity scores | list[dict] |
step4_steering_vectors.py |
Extract ΞΌ per style, compute d_short2detail | dict[str, Tensor] |
step5_steer_and_eval.py |
Ξ»-sweep steered generation, length/richness metrics | list[dict] |
step6_visualize.py |
3 publication figures (real COCO thumbnails in extremes panel) | dict[str, path] |
step7_analyze.py |
Rankings, findings, write findings.md | dict |
pipeline.py |
Master orchestrator (--demo or live) | All of the above |
demo_gradio.py |
Interactive user-upload Gradio demo (HF Spaces) | Gradio Blocks app |
π How to Run
Make sure you are in the project root directory and your virtualenv is active:
source venv/bin/activate
export PYTHONPATH=.
Option A: Demo Mode (No GPU Required) β Recommended for HuggingFace Spaces
Uses pre-computed results bundled in results/*.json. Generates all 3 figures and findings.md in under 15 seconds.
venv/bin/python task/task_04/pipeline.py --demo
Outputs:
task/task_04/results/diversity_histogram.pngβ diversity score distributiontask/task_04/results/diverse_vs_repetitive.pngβ caption extremes paneltask/task_04/results/steering_lambda_sweep.pngβ Ξ» vs. caption length charttask/task_04/results/findings.mdβ written analysis
Option B: Live GPU Inference
Downloads COCO val, runs nucleus sampling on 200 images and steering on 20 images. Requires a GPU (MPS or CUDA) and ~10 GB RAM.
venv/bin/python task/task_04/pipeline.py
Option C: Individual Steps (Notebook / HuggingFace Inspection)
# Step 1 β Load model
from task.task_04.step1_load_model import load_model
model, processor, device = load_model()
# Step 2 β Prepare data
from task.task_04.step2_prepare_data import load_val_data, build_style_sets
dataloader = load_val_data(processor, n=200, batch_size=4)
style_sets = build_style_sets(n=500)
# Step 3 β Diversity analysis
from task.task_04.step3_diversity_analysis import run_diversity_analysis
records = run_diversity_analysis(model, processor, dataloader, device)
# Step 4 β Steering vectors
from task.task_04.step4_steering_vectors import extract_steering_vectors
vectors = extract_steering_vectors(model, processor, style_sets, device)
# Step 5 β Steered generation
from task.task_04.step5_steer_and_eval import run_steering_eval
steering_results = run_steering_eval(model, processor, dataloader, device, vectors)
# Step 6 β Visualize
from task.task_04.step6_visualize import visualize_all
paths = visualize_all(records, steering_results)
# Step 7 β Analyze
from task.task_04.step7_analyze import analyze_results
findings = analyze_results(records, steering_results)
Option D: Run Individual Steps Standalone
# Diversity analysis (precomputed)
venv/bin/python task/task_04/step3_diversity_analysis.py
venv/bin/python task/task_04/step3_diversity_analysis.py --live # GPU inference
# Steering vectors (precomputed)
venv/bin/python task/task_04/step4_steering_vectors.py
venv/bin/python task/task_04/step4_steering_vectors.py --live
# Ξ» sweep (precomputed)
venv/bin/python task/task_04/step5_steer_and_eval.py
venv/bin/python task/task_04/step5_steer_and_eval.py --live
# Regenerate figures only
venv/bin/python task/task_04/step6_visualize.py
# Print analysis only
venv/bin/python task/task_04/step7_analyze.py
π‘οΈ Understanding the Figures
results/diversity_histogram.png
- X-axis: diversity score (unique n-grams / total n-grams)
- Red-shaded zone: repetitive (< 0.40)
- Blue-shaded zone: diverse (> 0.75)
- Dashed lines: thresholds; dotted line: mean score
- Look at the bimodal shape β it confirms that high and low diversity images are distinct populations
results/diverse_vs_repetitive.png
- Left panel: top-3 most diverse images with all 5 captions
- Right panel: top-3 most repetitive images with all 5 captions
- Image thumbnails: actual COCO validation images are fetched via
datasetsstreaming and embedded at the left column of each row. First run downloads 6 images; subsequent runs load fromresults/images/. - Compare how different the captions look between the two groups
results/steering_lambda_sweep.png
- X-axis: Ξ» (negative = push toward shorter, positive = push toward detailed)
- Left Y-axis (orange): mean caption length in words
- Right Y-axis (purple): mean unique word count per caption
- The dashed vertical line at Ξ»=0 is the unsteered baseline
- The slope of both lines confirms that steering is effective
π Folder Structure
task/task_04/
βββ step1_load_model.py # Component 1: Load BLIP + checkpoint
βββ step2_prepare_data.py # Component 2: COCO DataLoader + style sets
βββ step3_diversity_analysis.py # Component 3: Nucleus diversity (p=0.9)
βββ step4_steering_vectors.py # Component 4: BLIP hidden-state extraction
βββ step5_steer_and_eval.py # Component 5: Forward-hook Ξ» sweep
βββ step6_visualize.py # Component 6: 3 publication figures
βββ step7_analyze.py # Component 7: Rankings & findings.md
βββ demo_gradio.py # Component 8: User-upload Gradio demo
βββ pipeline.py # Master orchestrator (--demo or live)
βββ results/
βββ diversity_results.json # Pre-computed per-image diversity records
βββ steering_vectors.pt # d_short2detail, d_short2medium tensors
βββ steering_vectors_meta.json # Steering vector metadata
βββ steering_results.json # Ξ»-sweep metrics table
βββ findings.md # Auto-generated written analysis
βββ diversity_histogram.png # Diversity score distribution
βββ diverse_vs_repetitive.png # Caption extremes panel (with real COCO images)
βββ steering_lambda_sweep.png # Ξ» vs length/richness chart
βββ images/ # Real COCO thumbnails (fetched on first run)
βββ img_0.jpg
βββ img_3.jpg
βββ ... # 6 total (top-3 diverse + top-3 repetitive)
βοΈ Dependencies
All dependencies are already in the project requirements.txt:
| Package | Used For |
|---|---|
transformers |
BLIP model loading, text encoder, text decoder |
torch |
Forward hooks, hidden-state arithmetic |
datasets |
COCO 2017 validation split |
matplotlib |
Histogram, text panel, dual-axis chart |
numpy |
Score aggregations |
tqdm |
Progress bars for live inference |
π Connection to the Broader Project
- Builds on Task 3: Uses the same BLIP fine-tuned checkpoint (
outputs/blip/best/) as the base model for caption generation and hidden-state extraction. - Complements the main app: The diversity analysis exposes a limitation of the current inference pipeline β for simple images, multiple sampling behaves like greedy decode.
- Novel capability: Concept steering is a zero-shot technique that can be integrated into the Streamlit demo as a "style slider" β allowing users to interactively generate shorter or longer captions from the same image. The
demo_gradio.pyfile provides a standalone Gradio interface for this. - Connects to Experiment 2 (beam search): Diversity analysis shows that nucleus sampling and beam search operate in different regimes β beam search maximises probability (low entropy), nucleus sampling controls entropy directly.
- Leads into Task 5: The caption diversity pipeline (step2βstep3) is the data source for Task 5's toxicity analysis β the same BLIP caption generation flow is extended with safety classification.
Author: Manoj Kumar β March 2026