# Multilingual V-Score Analysis with SAE Features

```
MVA project. 
Contributors:
Yannis Kolodziej, Tom Mariani, Hai Pham, Valentin Smague 

```

This project investigates how **Sparse Autoencoder (SAE) features** inside large language models (LLMs) encode language identity. The core metric is the **v-score** — a per-feature, per-language score that measures how much a SAE feature activates on one language compared to the average across all other languages.

## What is the V-Score?

For a given SAE feature `f` and language `L` (over a set of `K` languages):

```
v(f, L) = mean_activation(f, L) - mean( mean_activation(f, L') for L' ≠ L )
```

Features with a high v-score for language `L` are considered **language-specific** to `L`. By sorting features by their v-score, we obtain a ranked list of the most language-discriminative SAE features per layer.

## Project Structure

```
MVA-SNLP/
├── compute_v_scores.py         # CLI: compute & save v-scores for any model/SAE/language set
├── visualize_v_scores.ipynb    # Visualize saved v-score runs (Figure 1 reproduction + insights)
├── sae_feature_exploration.ipynb     # Hugging Face-based interactive SAE feature exploration
├── extended_visualization.ipynb      # Extended visualizations and additional analyses
├── code_switching_analysis.ipynb     # Code-switching analysis on specific words
│
├── scripts/                          # Ready-to-run bash scripts for each experiment
│   ├── run_gemma_reprod.sh           # Reproduce Figure 1 (Gemma-2B, 10 languages, 100 texts)
│   ├── run_gemma_diverse_langs_all.sh    # Insight 1: all texts, 10 diverse languages
│   ├── run_gemma_diverse_langs_small.sh  # Insight 1: quick run (25 texts/language)
│   ├── run_gemma_similar_langs.sh    # Insight 2: similar languages (es/pt/gl/ca)
│   ├── run_gemma_underrepresented_langs.sh  # Insight 4: underrepresented languages
│   └── run_qwen_reprod.sh            # Reproduction with Qwen3-0.6B
│
└── v_score_runs/                     # Saved results (meta.json + v_scores.pt per run)
    ├── run_reprod_fig_1/
    ├── run_insight_1_all/
    ├── run_insight_1_small/
    ├── run_insight_2/
    ├── run_insight_4/
    └── qwen_run_reprod_fig_1/
```

## Supported Models

| Alias | Model | SAE Release |
|---|---|---|
| `gemma-2b` | `google/gemma-2-2b` | `gemma-scope-2b-pt-res-canonical` |
| `qwen3-0.6b` | `Qwen/Qwen3-0.6B` | `mwhanna-qwen3-0.6b-transcoders-lowl0` |

## Quick Start

### 1. Install dependencies

```bash
pip install torch transformers datasets sae-lens matplotlib
```

### 2. Compute v-scores (CLI)

**Reproduce Figure 1** (Gemma-2B, 10 languages):
```bash
bash scripts/run_gemma_reprod.sh
```

**Or run directly with custom settings:**
```bash
python compute_v_scores.py compute \
    --model gemma-2b \
    --languages eng_Latn,fra_Latn,jpn_Jpan,cmn_Hans \
    --layers 0,5,10,15,20 \
    --n-texts-per-lang 100 \
    --out-dir ./v_score_runs/my_run
```

**Use a custom model/SAE not in the presets:**
```bash
python compute_v_scores.py compute \
    --model custom \
    --model-id "your/hf-model-id" \
    --sae-release "your-sae-release" \
    --sae-id-template "layer_{layer}" \
    --languages eng_Latn,fra_Latn \
    --layers 0,5,10 \
    --out-dir ./v_score_runs/custom_run
```

### 3. Visualize results

Open `visualize_v_scores.ipynb` and point it to any `v_score_runs/<run_name>/` directory. The notebook loads `meta.json` and `v_scores.pt` and renders:
- Top language-specific features per layer
- Feature activation heatmaps across languages
- V-score distributions

## CLI Reference

```
python compute_v_scores.py compute [OPTIONS]

Options:
  --model              Preset: gemma-2b | qwen3-0.6b | custom
  --model-id           Override HuggingFace model ID (for --model custom)
  --sae-release        Override sae_lens release name
  --sae-id-template    Template string with {layer}, e.g. "layer_{layer}/width_16k/canonical"
  --languages          Comma-separated flores_plus language codes
  --layers             Comma-separated layer indices to analyse
  --n-texts-per-lang   Number of FLORES+ texts per language (default: 100, -1 = all)
  --split              FLORES+ split: dev | devtest (default: dev)
  --out-dir            Output directory for meta.json and v_scores.pt
  --device             cuda | cpu (default: auto-detect)
```

## Language Codes

Languages are specified as FLORES+ codes (`lang_Script`). Examples:

| Code | Language |
|---|---|
| `eng_Latn` | English |
| `fra_Latn` | French |
| `spa_Latn` | Spanish |
| `por_Latn` | Portuguese |
| `jpn_Jpan` | Japanese |
| `cmn_Hans` | Chinese (Simplified) |
| `kor_Hang` | Korean |
| `tha_Thai` | Thai |
| `vie_Latn` | Vietnamese |
| `kas_Arab` | Kashmiri (Arabic script) |
| `wuu_Hans` | Wu Chinese |
| `azb_Arab` | South Azerbaijani |
| `nus_Latn` | Nuer |
| `arg_Latn` | Aragonese |
| `glg_Latn` | Galician |
| `cat_Latn` | Catalan |

Full list: [FLORES+ dataset](https://huggingface.co/datasets/openlanguagedata/flores_plus)

## Saved Run Format

Each run in `v_score_runs/` contains:

- **`meta.json`** — run configuration (model, languages, layers, etc.)
- **`v_scores.pt`** — PyTorch file with structure:
  ```python
  {
    "layers": {
      "0":  {"top_index_per_lan": Tensor[K, F], "top_values_per_lan": Tensor[K, F]},
      "5":  {...},
      ...
    }
  }
  ```
  where `K` = number of languages and `F` = number of SAE features, sorted by v-score descending.

Load a saved run programmatically:
```python
from compute_v_scores import load_v_score_run
meta, layers = load_v_score_run("./v_score_runs/run_reprod_fig_1")
top_indices = layers[10]["top_index_per_lan"]   # layer 10, shape [K, F]
```

## Experiments

| Script | Description | Languages | Layers |
|---|---|---|---|
| `run_gemma_reprod.sh` | Reproduce Figure 1 | 10 diverse | 0,2,5,10,15,20 |
| `run_gemma_diverse_langs_all.sh` | Insight 1: all FLORES texts | 10 diverse | 0,2,5,10,15,20 |
| `run_gemma_diverse_langs_small.sh` | Insight 1: quick (25 texts) | 10 diverse | 0,2,5,10,15,20 |
| `run_gemma_similar_langs.sh` | Insight 2: similar languages | es/pt/gl/ca | 0,2,5,10,15,20 |
| `run_gemma_underrepresented_langs.sh` | Insight 4: low-resource languages | 4 rare | 0,2,5,10,15,20 |
| `run_qwen_reprod.sh` | Same setup on Qwen3-0.6B | 10 diverse | 0,2,5,10,15,20 |

## Related: Part 6 — Steering Vectors

The v-score runs produced by this repo feed into a companion project that extends the analysis into **active language steering** using SAE-gated steering vectors:

**[siemovit/snlp](https://github.com/siemovit/snlp/tree/main)** — *Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders* (Part 6 experiments)

That repo implements three experiments on top of the v-scores:

| Experiment | Entry point | What it does |
|---|---|---|
| Baseline steering | `part_6/baseline_experiment.py` | One-layer toy steering demo |
| Adversarial Language Identification (LID) | `part_6/lid_experiment.py` | Steers model to generate in a target language; measures first-token LID accuracy |
| Cross-Lingual Continuation (CLC) | `part_6/clc_experiment.py` | Prompts in one language, steers continuation into another |

### Using v-score runs from this repo in the steering repo

The steering repo can import saved v-score runs directly via its export utility:

```bash
cd /path/to/siemovit-snlp
uv run python -m part_6.export_v_scores \
  --run-dir ../MVA-SNLP/v_score_runs/run_reprod_fig_1 \
  --top-k 5
# → results/csv/v_scores_run_reprod_fig_1_top5.csv
```

### Quick start for the steering experiments

```bash
git clone https://github.com/siemovit/snlp.git
cd snlp
uv sync
uv run python download.py --model-name gemma-2-2b

# Adversarial Language Identification (French → English, Gemma-2B)
uv run python -m part_6.lid_experiment \
  --model-name gemma-2-2b \
  --source-lang fr \
  --target-lang en \
  --base-layer 20 \
  --alpha 0.5

# Cross-Lingual Continuation (French → English, Qwen3)
uv run python -m part_6.clc_experiment \
  --model-name qwen \
  --source-lang fr \
  --target-lang en \
  --base-layer 18 \
  --alpha 10.0
```

> **Note:** The LID experiment is memory-heavy. On a Tesla V100, `--train-n 100` can cause OOM — start with the default `--train-n 20` and scale up carefully.

## Related: Extended Analysis — Ablation, Clustering & Synergy

**[VSmague/NLP](https://github.com/VSmague/NLP)** — Extended experiments by Valentin Smague covering ablation studies, feature clustering, and cross-language synergy analysis built on top of the v-scores from this repo.

That repo covers four additional directions:

| Analysis | Script / Notebook | What it does |
|---|---|---|
| Feature ablation | `ablation.py`, `SNLP_ablation_clean.ipynb` | Ablates top language-specific SAE features and measures the effect on model behavior; produces per-language specificity plots |
| Language clustering | `compute_clusters.py`, `compute_matrix.py` | Clusters languages by their v-score feature overlap using MDS and similarity matrices |
| Cross-language synergy | `cross_language_synergy.py` | Measures how much top features for one language also activate on other languages (feature sharing / synergy) |
| Visualization | `visualisation.py`, `reprod.py` | Reproduces v-score bar charts (Figure 1 style) and generates additional plots |

Key outputs stored in the repo:

- `v_scores.png` — reproduced v-score figure
- `ablation_fr.png`, `ablation_specificity.png` — ablation results for French
- `clustering_best.png`, `clustering_comparison.png`, `clustering_mds.png` — language clustering visualizations
- `plots/`, `plots_interaction/`, `plots_synergy/` — full plot collections
- `sae_features/` — saved SAE feature data
- `figures_section5/` — figures for section 5 of the report