Papers
arxiv:2605.01229

Attention Sinks in Massively Multilingual Neural Machine Translation:Discovery, Analysis, and Mitigation

Published on May 2
Authors:
,

Abstract

Non-content tokens in NMT cross-attention capture most attention mass, distorting similarity measurements and masking linguistic patterns; filtering methods reveal true content-level alignments and linguistic insights.

AI-generated summary

Cross-attention patterns in neural machine translation (NMT) are widely used to study how multilingual models align linguistic structure. We report a systematic artifact in cross-attention analysis of NLLB-200 (600M): non-content tokens - primarily end-of-sequence tokens, language tags, and punctuation - capture 83 percent to 91 percent of total cross-attention mass. We term these "attention sinks," extending findings from LLMs [Xiao et al., 2023] to NMT cross-attention and identifying a causal mechanism rooted in vocabulary design rather than position bias. This artifact causes raw metrics to underestimate content-level similarity by nearly half (36.7 percent raw vs. 70.7 percent filtered), rendering uncorrected analyses unreliable. To address this, we validate a content-only filtering methodology that removes non-content tokens and renormalizes the distribution. Applying this to 1,000 parallel sentences across African languages (Swahili, Kikuyu, Somali, Luo) and non-African benchmarks (German, Turkish, Chinese, Hindi), we confirm the artifact is universal and recover masked linguistic signals: a 16.9 percentage-point gap between teacher-forcing and generation modes, clear language-family clustering in attention entropy, and a hidden Somali paradox linking SOV word order to monotonic alignment. We release our filtering toolkit and corrected datasets to support reproducible interpretability research on multilingual NMT.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.01229
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.01229 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.01229 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.01229 in a Space README.md to link it from this page.

Collections including this paper 1