MVA-SNLP / README.md

Hai Pham

updated README

2660d5a 19 days ago

9.91 kB

	# Multilingual V-Score Analysis with SAE Features

	```
	MVA project.
	Contributors:
	Yannis Kolodziej, Tom Mariani, Hai Pham, Valentin Smague

	```

	This project investigates how Sparse Autoencoder (SAE) features inside large language models (LLMs) encode language identity. The core metric is the v-score — a per-feature, per-language score that measures how much a SAE feature activates on one language compared to the average across all other languages.

	## What is the V-Score?

	For a given SAE feature `f` and language `L` (over a set of `K` languages):

	```
	v(f, L) = mean_activation(f, L) - mean( mean_activation(f, L') for L' ≠ L )
	```

	Features with a high v-score for language `L` are considered language-specific to `L`. By sorting features by their v-score, we obtain a ranked list of the most language-discriminative SAE features per layer.

	## Project Structure

	```
	MVA-SNLP/
	├── compute_v_scores.py # CLI: compute & save v-scores for any model/SAE/language set
	├── visualize_v_scores.ipynb # Visualize saved v-score runs (Figure 1 reproduction + insights)
	├── sae_feature_exploration.ipynb # Hugging Face-based interactive SAE feature exploration
	├── extended_visualization.ipynb # Extended visualizations and additional analyses
	├── code_switching_analysis.ipynb # Code-switching analysis on specific words
	│
	├── scripts/ # Ready-to-run bash scripts for each experiment
	│ ├── run_gemma_reprod.sh # Reproduce Figure 1 (Gemma-2B, 10 languages, 100 texts)
	│ ├── run_gemma_diverse_langs_all.sh # Insight 1: all texts, 10 diverse languages
	│ ├── run_gemma_diverse_langs_small.sh # Insight 1: quick run (25 texts/language)
	│ ├── run_gemma_similar_langs.sh # Insight 2: similar languages (es/pt/gl/ca)
	│ ├── run_gemma_underrepresented_langs.sh # Insight 4: underrepresented languages
	│ └── run_qwen_reprod.sh # Reproduction with Qwen3-0.6B
	│
	└── v_score_runs/ # Saved results (meta.json + v_scores.pt per run)
	├── run_reprod_fig_1/
	├── run_insight_1_all/
	├── run_insight_1_small/
	├── run_insight_2/
	├── run_insight_4/
	└── qwen_run_reprod_fig_1/
	```

	## Supported Models

	\| Alias \| Model \| SAE Release \|
	\|---\|---\|---\|
	\| `gemma-2b` \| `google/gemma-2-2b` \| `gemma-scope-2b-pt-res-canonical` \|
	\| `qwen3-0.6b` \| `Qwen/Qwen3-0.6B` \| `mwhanna-qwen3-0.6b-transcoders-lowl0` \|

	## Quick Start

	### 1. Install dependencies

	```bash
	pip install torch transformers datasets sae-lens matplotlib
	```

	### 2. Compute v-scores (CLI)

	Reproduce Figure 1 (Gemma-2B, 10 languages):
	```bash
	bash scripts/run_gemma_reprod.sh
	```

	Or run directly with custom settings:
	```bash
	python compute_v_scores.py compute \
	--model gemma-2b \
	--languages eng_Latn,fra_Latn,jpn_Jpan,cmn_Hans \
	--layers 0,5,10,15,20 \
	--n-texts-per-lang 100 \
	--out-dir ./v_score_runs/my_run
	```

	Use a custom model/SAE not in the presets:
	```bash
	python compute_v_scores.py compute \
	--model custom \
	--model-id "your/hf-model-id" \
	--sae-release "your-sae-release" \
	--sae-id-template "layer_{layer}" \
	--languages eng_Latn,fra_Latn \
	--layers 0,5,10 \
	--out-dir ./v_score_runs/custom_run
	```

	### 3. Visualize results

	Open `visualize_v_scores.ipynb` and point it to any `v_score_runs/<run_name>/` directory. The notebook loads `meta.json` and `v_scores.pt` and renders:
	- Top language-specific features per layer
	- Feature activation heatmaps across languages
	- V-score distributions

	## CLI Reference

	```
	python compute_v_scores.py compute [OPTIONS]

	Options:
	--model Preset: gemma-2b \| qwen3-0.6b \| custom
	--model-id Override HuggingFace model ID (for --model custom)
	--sae-release Override sae_lens release name
	--sae-id-template Template string with {layer}, e.g. "layer_{layer}/width_16k/canonical"
	--languages Comma-separated flores_plus language codes
	--layers Comma-separated layer indices to analyse
	--n-texts-per-lang Number of FLORES+ texts per language (default: 100, -1 = all)
	--split FLORES+ split: dev \| devtest (default: dev)
	--out-dir Output directory for meta.json and v_scores.pt
	--device cuda \| cpu (default: auto-detect)
	```

	## Language Codes

	Languages are specified as FLORES+ codes (`lang_Script`). Examples:

	\| Code \| Language \|
	\|---\|---\|
	\| `eng_Latn` \| English \|
	\| `fra_Latn` \| French \|
	\| `spa_Latn` \| Spanish \|
	\| `por_Latn` \| Portuguese \|
	\| `jpn_Jpan` \| Japanese \|
	\| `cmn_Hans` \| Chinese (Simplified) \|
	\| `kor_Hang` \| Korean \|
	\| `tha_Thai` \| Thai \|
	\| `vie_Latn` \| Vietnamese \|
	\| `kas_Arab` \| Kashmiri (Arabic script) \|
	\| `wuu_Hans` \| Wu Chinese \|
	\| `azb_Arab` \| South Azerbaijani \|
	\| `nus_Latn` \| Nuer \|
	\| `arg_Latn` \| Aragonese \|
	\| `glg_Latn` \| Galician \|
	\| `cat_Latn` \| Catalan \|

	Full list: [FLORES+ dataset](https://huggingface.co/datasets/openlanguagedata/flores_plus)

	## Saved Run Format

	Each run in `v_score_runs/` contains:

	- `meta.json` — run configuration (model, languages, layers, etc.)
	- `v_scores.pt` — PyTorch file with structure:
	```python
	{
	"layers": {
	"0": {"top_index_per_lan": Tensor[K, F], "top_values_per_lan": Tensor[K, F]},
	"5": {...},
	...
	}
	}
	```
	where `K` = number of languages and `F` = number of SAE features, sorted by v-score descending.

	Load a saved run programmatically:
	```python
	from compute_v_scores import load_v_score_run
	meta, layers = load_v_score_run("./v_score_runs/run_reprod_fig_1")
	top_indices = layers[10]["top_index_per_lan"] # layer 10, shape [K, F]
	```

	## Experiments

	\| Script \| Description \| Languages \| Layers \|
	\|---\|---\|---\|---\|
	\| `run_gemma_reprod.sh` \| Reproduce Figure 1 \| 10 diverse \| 0,2,5,10,15,20 \|
	\| `run_gemma_diverse_langs_all.sh` \| Insight 1: all FLORES texts \| 10 diverse \| 0,2,5,10,15,20 \|
	\| `run_gemma_diverse_langs_small.sh` \| Insight 1: quick (25 texts) \| 10 diverse \| 0,2,5,10,15,20 \|
	\| `run_gemma_similar_langs.sh` \| Insight 2: similar languages \| es/pt/gl/ca \| 0,2,5,10,15,20 \|
	\| `run_gemma_underrepresented_langs.sh` \| Insight 4: low-resource languages \| 4 rare \| 0,2,5,10,15,20 \|
	\| `run_qwen_reprod.sh` \| Same setup on Qwen3-0.6B \| 10 diverse \| 0,2,5,10,15,20 \|

	## Related: Part 6 — Steering Vectors

	The v-score runs produced by this repo feed into a companion project that extends the analysis into active language steering using SAE-gated steering vectors:

	[siemovit/snlp](https://github.com/siemovit/snlp/tree/main) — Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders (Part 6 experiments)

	That repo implements three experiments on top of the v-scores:

	\| Experiment \| Entry point \| What it does \|
	\|---\|---\|---\|
	\| Baseline steering \| `part_6/baseline_experiment.py` \| One-layer toy steering demo \|
	\| Adversarial Language Identification (LID) \| `part_6/lid_experiment.py` \| Steers model to generate in a target language; measures first-token LID accuracy \|
	\| Cross-Lingual Continuation (CLC) \| `part_6/clc_experiment.py` \| Prompts in one language, steers continuation into another \|

	### Using v-score runs from this repo in the steering repo

	The steering repo can import saved v-score runs directly via its export utility:

	```bash
	cd /path/to/siemovit-snlp
	uv run python -m part_6.export_v_scores \
	--run-dir ../MVA-SNLP/v_score_runs/run_reprod_fig_1 \
	--top-k 5
	# → results/csv/v_scores_run_reprod_fig_1_top5.csv
	```

	### Quick start for the steering experiments

	```bash
	git clone https://github.com/siemovit/snlp.git
	cd snlp
	uv sync
	uv run python download.py --model-name gemma-2-2b

	# Adversarial Language Identification (French → English, Gemma-2B)
	uv run python -m part_6.lid_experiment \
	--model-name gemma-2-2b \
	--source-lang fr \
	--target-lang en \
	--base-layer 20 \
	--alpha 0.5

	# Cross-Lingual Continuation (French → English, Qwen3)
	uv run python -m part_6.clc_experiment \
	--model-name qwen \
	--source-lang fr \
	--target-lang en \
	--base-layer 18 \
	--alpha 10.0
	```

	> Note: The LID experiment is memory-heavy. On a Tesla V100, `--train-n 100` can cause OOM — start with the default `--train-n 20` and scale up carefully.

	## Related: Extended Analysis — Ablation, Clustering & Synergy

	[VSmague/NLP](https://github.com/VSmague/NLP) — Extended experiments by Valentin Smague covering ablation studies, feature clustering, and cross-language synergy analysis built on top of the v-scores from this repo.

	That repo covers four additional directions:

	\| Analysis \| Script / Notebook \| What it does \|
	\|---\|---\|---\|
	\| Feature ablation \| `ablation.py`, `SNLP_ablation_clean.ipynb` \| Ablates top language-specific SAE features and measures the effect on model behavior; produces per-language specificity plots \|
	\| Language clustering \| `compute_clusters.py`, `compute_matrix.py` \| Clusters languages by their v-score feature overlap using MDS and similarity matrices \|
	\| Cross-language synergy \| `cross_language_synergy.py` \| Measures how much top features for one language also activate on other languages (feature sharing / synergy) \|
	\| Visualization \| `visualisation.py`, `reprod.py` \| Reproduces v-score bar charts (Figure 1 style) and generates additional plots \|

	Key outputs stored in the repo:

	- `v_scores.png` — reproduced v-score figure
	- `ablation_fr.png`, `ablation_specificity.png` — ablation results for French
	- `clustering_best.png`, `clustering_comparison.png`, `clustering_mds.png` — language clustering visualizations
	- `plots/`, `plots_interaction/`, `plots_synergy/` — full plot collections
	- `sae_features/` — saved SAE feature data
	- `figures_section5/` — figures for section 5 of the report