# Multilingual V-Score Analysis with SAE Features ``` MVA project. Contributors: Yannis Kolodziej, Tom Mariani, Hai Pham, Valentin Smague ``` This project investigates how **Sparse Autoencoder (SAE) features** inside large language models (LLMs) encode language identity. The core metric is the **v-score** — a per-feature, per-language score that measures how much a SAE feature activates on one language compared to the average across all other languages. ## What is the V-Score? For a given SAE feature `f` and language `L` (over a set of `K` languages): ``` v(f, L) = mean_activation(f, L) - mean( mean_activation(f, L') for L' ≠ L ) ``` Features with a high v-score for language `L` are considered **language-specific** to `L`. By sorting features by their v-score, we obtain a ranked list of the most language-discriminative SAE features per layer. ## Project Structure ``` MVA-SNLP/ ├── compute_v_scores.py # CLI: compute & save v-scores for any model/SAE/language set ├── visualize_v_scores.ipynb # Visualize saved v-score runs (Figure 1 reproduction + insights) ├── sae_feature_exploration.ipynb # Hugging Face-based interactive SAE feature exploration ├── extended_visualization.ipynb # Extended visualizations and additional analyses ├── code_switching_analysis.ipynb # Code-switching analysis on specific words │ ├── scripts/ # Ready-to-run bash scripts for each experiment │ ├── run_gemma_reprod.sh # Reproduce Figure 1 (Gemma-2B, 10 languages, 100 texts) │ ├── run_gemma_diverse_langs_all.sh # Insight 1: all texts, 10 diverse languages │ ├── run_gemma_diverse_langs_small.sh # Insight 1: quick run (25 texts/language) │ ├── run_gemma_similar_langs.sh # Insight 2: similar languages (es/pt/gl/ca) │ ├── run_gemma_underrepresented_langs.sh # Insight 4: underrepresented languages │ └── run_qwen_reprod.sh # Reproduction with Qwen3-0.6B │ └── v_score_runs/ # Saved results (meta.json + v_scores.pt per run) ├── run_reprod_fig_1/ ├── run_insight_1_all/ ├── run_insight_1_small/ ├── run_insight_2/ ├── run_insight_4/ └── qwen_run_reprod_fig_1/ ``` ## Supported Models | Alias | Model | SAE Release | |---|---|---| | `gemma-2b` | `google/gemma-2-2b` | `gemma-scope-2b-pt-res-canonical` | | `qwen3-0.6b` | `Qwen/Qwen3-0.6B` | `mwhanna-qwen3-0.6b-transcoders-lowl0` | ## Quick Start ### 1. Install dependencies ```bash pip install torch transformers datasets sae-lens matplotlib ``` ### 2. Compute v-scores (CLI) **Reproduce Figure 1** (Gemma-2B, 10 languages): ```bash bash scripts/run_gemma_reprod.sh ``` **Or run directly with custom settings:** ```bash python compute_v_scores.py compute \ --model gemma-2b \ --languages eng_Latn,fra_Latn,jpn_Jpan,cmn_Hans \ --layers 0,5,10,15,20 \ --n-texts-per-lang 100 \ --out-dir ./v_score_runs/my_run ``` **Use a custom model/SAE not in the presets:** ```bash python compute_v_scores.py compute \ --model custom \ --model-id "your/hf-model-id" \ --sae-release "your-sae-release" \ --sae-id-template "layer_{layer}" \ --languages eng_Latn,fra_Latn \ --layers 0,5,10 \ --out-dir ./v_score_runs/custom_run ``` ### 3. Visualize results Open `visualize_v_scores.ipynb` and point it to any `v_score_runs//` directory. The notebook loads `meta.json` and `v_scores.pt` and renders: - Top language-specific features per layer - Feature activation heatmaps across languages - V-score distributions ## CLI Reference ``` python compute_v_scores.py compute [OPTIONS] Options: --model Preset: gemma-2b | qwen3-0.6b | custom --model-id Override HuggingFace model ID (for --model custom) --sae-release Override sae_lens release name --sae-id-template Template string with {layer}, e.g. "layer_{layer}/width_16k/canonical" --languages Comma-separated flores_plus language codes --layers Comma-separated layer indices to analyse --n-texts-per-lang Number of FLORES+ texts per language (default: 100, -1 = all) --split FLORES+ split: dev | devtest (default: dev) --out-dir Output directory for meta.json and v_scores.pt --device cuda | cpu (default: auto-detect) ``` ## Language Codes Languages are specified as FLORES+ codes (`lang_Script`). Examples: | Code | Language | |---|---| | `eng_Latn` | English | | `fra_Latn` | French | | `spa_Latn` | Spanish | | `por_Latn` | Portuguese | | `jpn_Jpan` | Japanese | | `cmn_Hans` | Chinese (Simplified) | | `kor_Hang` | Korean | | `tha_Thai` | Thai | | `vie_Latn` | Vietnamese | | `kas_Arab` | Kashmiri (Arabic script) | | `wuu_Hans` | Wu Chinese | | `azb_Arab` | South Azerbaijani | | `nus_Latn` | Nuer | | `arg_Latn` | Aragonese | | `glg_Latn` | Galician | | `cat_Latn` | Catalan | Full list: [FLORES+ dataset](https://huggingface.co/datasets/openlanguagedata/flores_plus) ## Saved Run Format Each run in `v_score_runs/` contains: - **`meta.json`** — run configuration (model, languages, layers, etc.) - **`v_scores.pt`** — PyTorch file with structure: ```python { "layers": { "0": {"top_index_per_lan": Tensor[K, F], "top_values_per_lan": Tensor[K, F]}, "5": {...}, ... } } ``` where `K` = number of languages and `F` = number of SAE features, sorted by v-score descending. Load a saved run programmatically: ```python from compute_v_scores import load_v_score_run meta, layers = load_v_score_run("./v_score_runs/run_reprod_fig_1") top_indices = layers[10]["top_index_per_lan"] # layer 10, shape [K, F] ``` ## Experiments | Script | Description | Languages | Layers | |---|---|---|---| | `run_gemma_reprod.sh` | Reproduce Figure 1 | 10 diverse | 0,2,5,10,15,20 | | `run_gemma_diverse_langs_all.sh` | Insight 1: all FLORES texts | 10 diverse | 0,2,5,10,15,20 | | `run_gemma_diverse_langs_small.sh` | Insight 1: quick (25 texts) | 10 diverse | 0,2,5,10,15,20 | | `run_gemma_similar_langs.sh` | Insight 2: similar languages | es/pt/gl/ca | 0,2,5,10,15,20 | | `run_gemma_underrepresented_langs.sh` | Insight 4: low-resource languages | 4 rare | 0,2,5,10,15,20 | | `run_qwen_reprod.sh` | Same setup on Qwen3-0.6B | 10 diverse | 0,2,5,10,15,20 | ## Related: Part 6 — Steering Vectors The v-score runs produced by this repo feed into a companion project that extends the analysis into **active language steering** using SAE-gated steering vectors: **[siemovit/snlp](https://github.com/siemovit/snlp/tree/main)** — *Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders* (Part 6 experiments) That repo implements three experiments on top of the v-scores: | Experiment | Entry point | What it does | |---|---|---| | Baseline steering | `part_6/baseline_experiment.py` | One-layer toy steering demo | | Adversarial Language Identification (LID) | `part_6/lid_experiment.py` | Steers model to generate in a target language; measures first-token LID accuracy | | Cross-Lingual Continuation (CLC) | `part_6/clc_experiment.py` | Prompts in one language, steers continuation into another | ### Using v-score runs from this repo in the steering repo The steering repo can import saved v-score runs directly via its export utility: ```bash cd /path/to/siemovit-snlp uv run python -m part_6.export_v_scores \ --run-dir ../MVA-SNLP/v_score_runs/run_reprod_fig_1 \ --top-k 5 # → results/csv/v_scores_run_reprod_fig_1_top5.csv ``` ### Quick start for the steering experiments ```bash git clone https://github.com/siemovit/snlp.git cd snlp uv sync uv run python download.py --model-name gemma-2-2b # Adversarial Language Identification (French → English, Gemma-2B) uv run python -m part_6.lid_experiment \ --model-name gemma-2-2b \ --source-lang fr \ --target-lang en \ --base-layer 20 \ --alpha 0.5 # Cross-Lingual Continuation (French → English, Qwen3) uv run python -m part_6.clc_experiment \ --model-name qwen \ --source-lang fr \ --target-lang en \ --base-layer 18 \ --alpha 10.0 ``` > **Note:** The LID experiment is memory-heavy. On a Tesla V100, `--train-n 100` can cause OOM — start with the default `--train-n 20` and scale up carefully. ## Related: Extended Analysis — Ablation, Clustering & Synergy **[VSmague/NLP](https://github.com/VSmague/NLP)** — Extended experiments by Valentin Smague covering ablation studies, feature clustering, and cross-language synergy analysis built on top of the v-scores from this repo. That repo covers four additional directions: | Analysis | Script / Notebook | What it does | |---|---|---| | Feature ablation | `ablation.py`, `SNLP_ablation_clean.ipynb` | Ablates top language-specific SAE features and measures the effect on model behavior; produces per-language specificity plots | | Language clustering | `compute_clusters.py`, `compute_matrix.py` | Clusters languages by their v-score feature overlap using MDS and similarity matrices | | Cross-language synergy | `cross_language_synergy.py` | Measures how much top features for one language also activate on other languages (feature sharing / synergy) | | Visualization | `visualisation.py`, `reprod.py` | Reproduces v-score bar charts (Figure 1 style) and generates additional plots | Key outputs stored in the repo: - `v_scores.png` — reproduced v-score figure - `ablation_fr.png`, `ablation_specificity.png` — ablation results for French - `clustering_best.png`, `clustering_comparison.png`, `clustering_mds.png` — language clustering visualizations - `plots/`, `plots_interaction/`, `plots_synergy/` — full plot collections - `sae_features/` — saved SAE feature data - `figures_section5/` — figures for section 5 of the report