| # Multilingual V-Score Analysis with SAE Features |
|
|
| ``` |
| MVA project. |
| Contributors: |
| Yannis Kolodziej, Tom Mariani, Hai Pham, Valentin Smague |
| |
| ``` |
|
|
| This project investigates how **Sparse Autoencoder (SAE) features** inside large language models (LLMs) encode language identity. The core metric is the **v-score** β a per-feature, per-language score that measures how much a SAE feature activates on one language compared to the average across all other languages. |
|
|
| ## What is the V-Score? |
|
|
| For a given SAE feature `f` and language `L` (over a set of `K` languages): |
|
|
| ``` |
| v(f, L) = mean_activation(f, L) - mean( mean_activation(f, L') for L' β L ) |
| ``` |
|
|
| Features with a high v-score for language `L` are considered **language-specific** to `L`. By sorting features by their v-score, we obtain a ranked list of the most language-discriminative SAE features per layer. |
|
|
| ## Project Structure |
|
|
| ``` |
| MVA-SNLP/ |
| βββ compute_v_scores.py # CLI: compute & save v-scores for any model/SAE/language set |
| βββ visualize_v_scores.ipynb # Visualize saved v-score runs (Figure 1 reproduction + insights) |
| βββ sae_feature_exploration.ipynb # Hugging Face-based interactive SAE feature exploration |
| βββ extended_visualization.ipynb # Extended visualizations and additional analyses |
| βββ code_switching_analysis.ipynb # Code-switching analysis on specific words |
| β |
| βββ scripts/ # Ready-to-run bash scripts for each experiment |
| β βββ run_gemma_reprod.sh # Reproduce Figure 1 (Gemma-2B, 10 languages, 100 texts) |
| β βββ run_gemma_diverse_langs_all.sh # Insight 1: all texts, 10 diverse languages |
| β βββ run_gemma_diverse_langs_small.sh # Insight 1: quick run (25 texts/language) |
| β βββ run_gemma_similar_langs.sh # Insight 2: similar languages (es/pt/gl/ca) |
| β βββ run_gemma_underrepresented_langs.sh # Insight 4: underrepresented languages |
| β βββ run_qwen_reprod.sh # Reproduction with Qwen3-0.6B |
| β |
| βββ v_score_runs/ # Saved results (meta.json + v_scores.pt per run) |
| βββ run_reprod_fig_1/ |
| βββ run_insight_1_all/ |
| βββ run_insight_1_small/ |
| βββ run_insight_2/ |
| βββ run_insight_4/ |
| βββ qwen_run_reprod_fig_1/ |
| ``` |
|
|
| ## Supported Models |
|
|
| | Alias | Model | SAE Release | |
| |---|---|---| |
| | `gemma-2b` | `google/gemma-2-2b` | `gemma-scope-2b-pt-res-canonical` | |
| | `qwen3-0.6b` | `Qwen/Qwen3-0.6B` | `mwhanna-qwen3-0.6b-transcoders-lowl0` | |
|
|
| ## Quick Start |
|
|
| ### 1. Install dependencies |
|
|
| ```bash |
| pip install torch transformers datasets sae-lens matplotlib |
| ``` |
|
|
| ### 2. Compute v-scores (CLI) |
|
|
| **Reproduce Figure 1** (Gemma-2B, 10 languages): |
| ```bash |
| bash scripts/run_gemma_reprod.sh |
| ``` |
|
|
| **Or run directly with custom settings:** |
| ```bash |
| python compute_v_scores.py compute \ |
| --model gemma-2b \ |
| --languages eng_Latn,fra_Latn,jpn_Jpan,cmn_Hans \ |
| --layers 0,5,10,15,20 \ |
| --n-texts-per-lang 100 \ |
| --out-dir ./v_score_runs/my_run |
| ``` |
|
|
| **Use a custom model/SAE not in the presets:** |
| ```bash |
| python compute_v_scores.py compute \ |
| --model custom \ |
| --model-id "your/hf-model-id" \ |
| --sae-release "your-sae-release" \ |
| --sae-id-template "layer_{layer}" \ |
| --languages eng_Latn,fra_Latn \ |
| --layers 0,5,10 \ |
| --out-dir ./v_score_runs/custom_run |
| ``` |
|
|
| ### 3. Visualize results |
|
|
| Open `visualize_v_scores.ipynb` and point it to any `v_score_runs/<run_name>/` directory. The notebook loads `meta.json` and `v_scores.pt` and renders: |
| - Top language-specific features per layer |
| - Feature activation heatmaps across languages |
| - V-score distributions |
|
|
| ## CLI Reference |
|
|
| ``` |
| python compute_v_scores.py compute [OPTIONS] |
| |
| Options: |
| --model Preset: gemma-2b | qwen3-0.6b | custom |
| --model-id Override HuggingFace model ID (for --model custom) |
| --sae-release Override sae_lens release name |
| --sae-id-template Template string with {layer}, e.g. "layer_{layer}/width_16k/canonical" |
| --languages Comma-separated flores_plus language codes |
| --layers Comma-separated layer indices to analyse |
| --n-texts-per-lang Number of FLORES+ texts per language (default: 100, -1 = all) |
| --split FLORES+ split: dev | devtest (default: dev) |
| --out-dir Output directory for meta.json and v_scores.pt |
| --device cuda | cpu (default: auto-detect) |
| ``` |
|
|
| ## Language Codes |
|
|
| Languages are specified as FLORES+ codes (`lang_Script`). Examples: |
|
|
| | Code | Language | |
| |---|---| |
| | `eng_Latn` | English | |
| | `fra_Latn` | French | |
| | `spa_Latn` | Spanish | |
| | `por_Latn` | Portuguese | |
| | `jpn_Jpan` | Japanese | |
| | `cmn_Hans` | Chinese (Simplified) | |
| | `kor_Hang` | Korean | |
| | `tha_Thai` | Thai | |
| | `vie_Latn` | Vietnamese | |
| | `kas_Arab` | Kashmiri (Arabic script) | |
| | `wuu_Hans` | Wu Chinese | |
| | `azb_Arab` | South Azerbaijani | |
| | `nus_Latn` | Nuer | |
| | `arg_Latn` | Aragonese | |
| | `glg_Latn` | Galician | |
| | `cat_Latn` | Catalan | |
|
|
| Full list: [FLORES+ dataset](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
|
| ## Saved Run Format |
|
|
| Each run in `v_score_runs/` contains: |
|
|
| - **`meta.json`** β run configuration (model, languages, layers, etc.) |
| - **`v_scores.pt`** β PyTorch file with structure: |
| ```python |
| { |
| "layers": { |
| "0": {"top_index_per_lan": Tensor[K, F], "top_values_per_lan": Tensor[K, F]}, |
| "5": {...}, |
| ... |
| } |
| } |
| ``` |
| where `K` = number of languages and `F` = number of SAE features, sorted by v-score descending. |
| |
| Load a saved run programmatically: |
| ```python |
| from compute_v_scores import load_v_score_run |
| meta, layers = load_v_score_run("./v_score_runs/run_reprod_fig_1") |
| top_indices = layers[10]["top_index_per_lan"] # layer 10, shape [K, F] |
| ``` |
| |
| ## Experiments |
| |
| | Script | Description | Languages | Layers | |
| |---|---|---|---| |
| | `run_gemma_reprod.sh` | Reproduce Figure 1 | 10 diverse | 0,2,5,10,15,20 | |
| | `run_gemma_diverse_langs_all.sh` | Insight 1: all FLORES texts | 10 diverse | 0,2,5,10,15,20 | |
| | `run_gemma_diverse_langs_small.sh` | Insight 1: quick (25 texts) | 10 diverse | 0,2,5,10,15,20 | |
| | `run_gemma_similar_langs.sh` | Insight 2: similar languages | es/pt/gl/ca | 0,2,5,10,15,20 | |
| | `run_gemma_underrepresented_langs.sh` | Insight 4: low-resource languages | 4 rare | 0,2,5,10,15,20 | |
| | `run_qwen_reprod.sh` | Same setup on Qwen3-0.6B | 10 diverse | 0,2,5,10,15,20 | |
| |
| ## Related: Part 6 β Steering Vectors |
| |
| The v-score runs produced by this repo feed into a companion project that extends the analysis into **active language steering** using SAE-gated steering vectors: |
| |
| **[siemovit/snlp](https://github.com/siemovit/snlp/tree/main)** β *Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders* (Part 6 experiments) |
| |
| That repo implements three experiments on top of the v-scores: |
| |
| | Experiment | Entry point | What it does | |
| |---|---|---| |
| | Baseline steering | `part_6/baseline_experiment.py` | One-layer toy steering demo | |
| | Adversarial Language Identification (LID) | `part_6/lid_experiment.py` | Steers model to generate in a target language; measures first-token LID accuracy | |
| | Cross-Lingual Continuation (CLC) | `part_6/clc_experiment.py` | Prompts in one language, steers continuation into another | |
| |
| ### Using v-score runs from this repo in the steering repo |
| |
| The steering repo can import saved v-score runs directly via its export utility: |
| |
| ```bash |
| cd /path/to/siemovit-snlp |
| uv run python -m part_6.export_v_scores \ |
| --run-dir ../MVA-SNLP/v_score_runs/run_reprod_fig_1 \ |
| --top-k 5 |
| # β results/csv/v_scores_run_reprod_fig_1_top5.csv |
| ``` |
| |
| ### Quick start for the steering experiments |
| |
| ```bash |
| git clone https://github.com/siemovit/snlp.git |
| cd snlp |
| uv sync |
| uv run python download.py --model-name gemma-2-2b |
| |
| # Adversarial Language Identification (French β English, Gemma-2B) |
| uv run python -m part_6.lid_experiment \ |
| --model-name gemma-2-2b \ |
| --source-lang fr \ |
| --target-lang en \ |
| --base-layer 20 \ |
| --alpha 0.5 |
| |
| # Cross-Lingual Continuation (French β English, Qwen3) |
| uv run python -m part_6.clc_experiment \ |
| --model-name qwen \ |
| --source-lang fr \ |
| --target-lang en \ |
| --base-layer 18 \ |
| --alpha 10.0 |
| ``` |
| |
| > **Note:** The LID experiment is memory-heavy. On a Tesla V100, `--train-n 100` can cause OOM β start with the default `--train-n 20` and scale up carefully. |
| |
| ## Related: Extended Analysis β Ablation, Clustering & Synergy |
| |
| **[VSmague/NLP](https://github.com/VSmague/NLP)** β Extended experiments by Valentin Smague covering ablation studies, feature clustering, and cross-language synergy analysis built on top of the v-scores from this repo. |
| |
| That repo covers four additional directions: |
| |
| | Analysis | Script / Notebook | What it does | |
| |---|---|---| |
| | Feature ablation | `ablation.py`, `SNLP_ablation_clean.ipynb` | Ablates top language-specific SAE features and measures the effect on model behavior; produces per-language specificity plots | |
| | Language clustering | `compute_clusters.py`, `compute_matrix.py` | Clusters languages by their v-score feature overlap using MDS and similarity matrices | |
| | Cross-language synergy | `cross_language_synergy.py` | Measures how much top features for one language also activate on other languages (feature sharing / synergy) | |
| | Visualization | `visualisation.py`, `reprod.py` | Reproduces v-score bar charts (Figure 1 style) and generates additional plots | |
| |
| Key outputs stored in the repo: |
| |
| - `v_scores.png` β reproduced v-score figure |
| - `ablation_fr.png`, `ablation_specificity.png` β ablation results for French |
| - `clustering_best.png`, `clustering_comparison.png`, `clustering_mds.png` β language clustering visualizations |
| - `plots/`, `plots_interaction/`, `plots_synergy/` β full plot collections |
| - `sae_features/` β saved SAE feature data |
| - `figures_section5/` β figures for section 5 of the report |
| |