MILCO: Learned Sparse Retrieval Across Languages via a Multilingual Connector

Paper: MILCO: Learned Sparse Retrieval Across Languages via a Multilingual Connector (ICLR 2026)

Overview

Learned Sparse Retrieval (LSR) combines the efficiency of bi-encoders with the transparency of lexical matching, but existing approaches struggle to scale beyond English. MILCO addresses this by mapping queries and documents from different languages into a shared English lexical space via a multilingual connector. MILCO supports both monolingual and cross-lingual retrieval and has been evaluated on 39+ languages, with the potential to generalize to additional languages supported by the underlying multilingual encoder.

MILCO is trained with a two-stage regime that combines Sparse Alignment Pretraining with contrastive learning, designed to provide representation transparency and effectiveness while mitigating semantic collapse.

We also introduce LexEcho head, which addresses the observation that uncommon entities/rare terms (e.g., proper nouns, code-switched terms) are often lost when projected into English. The LexEcho head augments the English lexical representation with a source-language view, preserving the original multilingual token weights alongside the projected English terms.

Usage

Loading the model

from transformers import AutoModel
model = AutoModel.from_pretrained("omai-research/milco", trust_remote_code=True)

Encoding text (sparse tensor)

# Returns a sparse COO tensor of shape [N, vocab_size]
sparse_reps = model.encode_text(["Baltimore: The Greatest City in America", "巴尔的摩:美国最伟大的城市", "Baltimore : La plus grande ville d'Amérique"])

Encoding text (token-weight dictionary)

# Returns a list of {token: weight} dicts, sorted by weight descending
results = model.encode_text(["Baltimore : La plus grande ville d'Amérique", "巴尔的摩:美国最伟大的城市"], return_dict=True)
print(results[0])
# {'e_baltimore': 1.8021222352981567, 'e_maryland': 1.2527629137039185, 'e_city': 1.202409029006958, 'e_largest': 0.9440274834632874, 'e_biggest': 0.9287132620811462, 'e_america': 0.890972912311554, 'e_usa': 0.8649974465370178, 'e_urban': 0.7201237678527832, 'e_geography': 0.6369074583053589, 'e_cities': 0.5816917419433594, 'e_us': 0.42353010177612305, 'e_is': 0.35060185194015503 ..}
print(results[1])
# {'e_baltimore': 1.6522579193115234, 'e_city': 1.3217558860778809, 'e_greatest': 1.1192315816879272, 'e_usa': 0.9287132620811462, 'e_maryland': 0.8281423449516296, 'e_best': 0.817850649356842, 'e_biggest': 0.8074519634246826, 'e_us': 0.6793810129165649, 'e_great': 0.630689799785614, 'e_america': 0.5255602598190308 ...}

LexEcho: Dual-view encoding (pivot + source)

When source_view=True, the LexEcho head augments the pivot representation with source-language token weights. This preserves entities and terms that may not have English equivalents:

results = model.encode_text(["巴尔的摩:美国最伟大的城市"], return_dict=True, source_view=True)
# Shape: [N, en_vocab_size + m_vocab_size]
# Columns [0, en_vocab_size) are pivot (English) terms
# Columns [en_vocab_size, en_vocab_size + m_vocab_size) are source (multilingual) terms
print(results[0])
# {'m_</s>': 2.09375, 'e_baltimore': 1.6522579193115234, 'e_city': 1.3217558860778809, 'e_greatest': 1.12432062625885, 'e_usa': 0.9225212931632996, 'e_maryland': 0.8247235417366028, 'e_best': 0.8143964409828186, 'e_biggest': 0.8039615750312805, 'm_伟大的': 0.77734375, 'm_:': 0.6875, 'm_美国': 0.68359375, 'e_us': 0.6734226942062378, 'e_great': 0.6265231370925903}

Query and document encoding

encode_query and encode_document are aliases for encode_text:

q_reps = model.encode_query(queries)
d_reps = model.encode_document(documents)
scores = torch.sparse.mm(q_reps, d_reps.t())

Getting the vocabulary mapping

id2term = model.get_vocab()
# {0: "e_[PAD]", 1: "e_[UNK]", ..., 30522: "m_[PAD]", ...}
# "e_" prefix = English (pivot) vocabulary
# "m_" prefix = multilingual (source) vocabulary

Results

MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines including BGE-M3 and Qwen3-Embed on standard multilingual benchmarks.

With mass-based pruning to reduce document representations to only 30 active dimensions on average, MILCO 560M outperforms the similarly-sized Qwen3-Embed 0.6B (1024 dimensions) while achieving 3x lower retrieval latency and 10x smaller index size.

Architecture

MILCO consists of:

  • Multilingual encoder: A pretrained multilingual transformer that produces contextual embeddings for input text in any language.
  • Multilingual connector: A linear layer that maps multilingual hidden states into the English encoder's hidden dimension.
  • LexEcho head:
    • LM head: Borrowed from a pretrained English masked language model, produces logits over the English vocabulary. Used to generate the pivot (English) sparse representation.
    • Echo token: A special [ECHO] token added to the MLM head to score each input token, producing the source-language view that preserves uncommon entities that may be lost during projection into the English vocabulary.

Configuration

Parameter Description
lsr_encoder_checkpoint naver/splade-v3
multilingual_encoder_checkpoint BAAI/bge-m3-unsupervised

Citation

@inproceedings{nguyen2026milco,
  title={MILCO: Learned Sparse Retrieval Across Languages via a Multilingual Connector},
  author={Nguyen, Thong and Lei, Yibin and Ju, Jia-Huei and Yang, Eugene and Yates, Andrew},
  booktitle={International Conference on Learning Representations},
  year={2026}
}
Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including omai-research/milco-650m