Upload BidirLM-1B-Embedding

13c984c verified about 15 hours ago

6.38 kB

	---
	tags:
	- mteb
	- sentence-transformers
	- transformers
	- embedding
	- bidirectional
	- multilingual
	pipeline_tag: sentence-similarity
	license: apache-2.0
	base_model: BidirLM/BidirLM-1B-Base
	language:
	- multilingual
	- af
	- am
	- ar
	- az
	- be
	- bg
	- bn
	- bs
	- ca
	- ceb
	- cs
	- cy
	- da
	- de
	- el
	- en
	- es
	- et
	- eu
	- fa
	- fi
	- fr
	- ga
	- gl
	- gu
	- ha
	- he
	- hi
	- hr
	- ht
	- hu
	- hy
	- id
	- ig
	- is
	- it
	- ja
	- jv
	- ka
	- kk
	- kn
	- ko
	- ky
	- lt
	- lv
	- mg
	- mk
	- ml
	- mr
	- ms
	- mt
	- my
	- nb
	- ne
	- nl
	- nso
	- ny
	- pa
	- pl
	- ps
	- pt
	- ro
	- ru
	- sd
	- si
	- sk
	- sl
	- sn
	- so
	- sq
	- sr
	- su
	- sv
	- sw
	- ta
	- te
	- th
	- tl
	- tr
	- uk
	- ur
	- vi
	- wo
	- xh
	- yo
	- zh
	- zu
	---

	# BidirLM-1B

	BidirLM is a family of 5 frontier bidirectional encoders, including an omnimodal variant at 2.5B, adapted from causal decoder LLMs. Contrary to contrastive-only models, BidirLM relies on a prior masking phase (MNTP) that enables state-of-the-art results on task-specific fine-tuning (NER, classification, NLI) while achieving frontier performance on embedding benchmarks (MTEB) against open-source alternatives.

	![Multilingual model performance by size on XTREME-Benchmark Augmented and MTEB Multilingual V2](final_results.png)

	\| Model \| Base LLM \| Parameters \| Embedding Dim \| Max Tokens \| MTEB Multi. V2 (Mean Task) \|
	\|---\|---\|---\|---\|---\|---\|
	\| BidirLM-270M \| Gemma3-270M \| 268M \| 640 \| 512 \| 55.5 \|
	\| BidirLM-0.6B \| Qwen3-0.6B \| 596M \| 1024 \| 512 \| 59.6 \|
	\| BidirLM-1B \| Gemma3-1B \| 1001M \| 1152 \| 512 (\) \| 62.1* \|
	\| BidirLM-1.7B \| Qwen3-1.7B \| 1721M \| 2048 \| 512 \| 62.9 \|
	\| BidirLM-Omni-2.5B \| Qwen3-1.7B \| 2.5B \| 2048 \| 512 \| 63.1 \|

	(\*) While evaluated on MTEB with a max length of 512, the underlying architecture supports up to 32,768 context length (Gemma3). Longer sequences can be used by adjusting `model.max_seq_length` in Sentence Transformers or `max_length` in the tokenizer.

	## Supported Tasks

	General embeddings (via Sentence Transformers): retrieval, semantic similarity (STS), clustering, classification, pair classification, reranking, bitext mining, multilabel classification

	Downstream fine-tuning (via Transformers): sequence classification (e.g. MNLI, XNLI, PAWS-X, MathShepherd), token classification (e.g. PAN-X, POS), information retrieval (e.g. MIRACL, CodeSearchNet), sequence regression (e.g. Seahorse)

	## Usage

	### Sentence Transformers

	Use Sentence Transformers to compute embeddings for any text representation task.

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("BidirLM/BidirLM-1B", trust_remote_code=True)

	queries = [
	"What is the capital of France?",
	"How does photosynthesis work?",
	]
	documents = [
	"Paris is the capital and largest city of France, situated on the river Seine.",
	"Photosynthesis is the process by which plants convert sunlight, water, and CO2 into glucose and oxygen.",
	]

	query_embeddings = model.encode(queries)
	document_embeddings = model.encode(documents)

	similarities = model.similarity(query_embeddings, document_embeddings)
	print(similarities)
	```

	### Fine-tuning for Downstream Tasks

	BidirLM can be directly fine-tuned for downstream tasks:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForTokenClassification

	tokenizer = AutoTokenizer.from_pretrained("BidirLM/BidirLM-1B", trust_remote_code=True)

	# Sequence classification (e.g., NLI: entailment, neutral, contradiction)
	seq_model = AutoModelForSequenceClassification.from_pretrained(
	"BidirLM/BidirLM-1B",
	trust_remote_code=True,
	num_labels=3,
	)

	# Token classification (e.g., NER)
	tok_model = AutoModelForTokenClassification.from_pretrained(
	"BidirLM/BidirLM-1B",
	trust_remote_code=True,
	num_labels=7,
	)

	# Fine-tune with HuggingFace Trainer
	```

	## Evaluation

	Please follow the [mteb repository](https://github.com/embeddings-benchmark/mteb) on how to reproduce our scores. The evaluation prompts used for each task are also available at [mteb_v2_eval_prompts.json](mteb_v2_eval_prompts.json).

	## Supported Languages

	Multilingual support across over 140 languages, inherited from the Gemma3 base model and reinforced through contrastive training with 87 languages.

	## Requirements

	This model requires `trust_remote_code=True` as it uses a custom bidirectional architecture.

	```
	transformers>=4.57.6,<5.0.0
	sentence-transformers>=5.0.0
	```

	## FAQ

	### 1. What pooling strategy does this model use?

	The model uses mean pooling. This is handled automatically when using Sentence Transformers.

	### 2. Do I need `trust_remote_code=True`?

	Yes. BidirLM uses a custom bidirectional architecture (`BidirLMModel`) that requires loading custom code from the repository.

	### 3. Why are my reproduced results slightly different from those reported in the model card?

	Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences. This model was trained and evaluated with `transformers==4.57.6` and `pytorch==2.6.0`.

	### 4. What is the relationship between BidirLM-1B and BidirLM-1B-Base?

	[BidirLM/BidirLM-1B-Base](https://huggingface.co/BidirLM/BidirLM-1B-Base) is the intermediate MNTP-adapted checkpoint (bidirectional pretraining stage). BidirLM-1B is the final contrastive fine-tuned version optimized for both sentence embeddings and downstream fine-tuning.

	### 5. How is BidirLM different from other embedding models?

	Most embedding models (BGE-M3, KaLM, EmbedGemma, Qwen3-Embedding) use contrastive-only training, which optimizes embeddings but sacrifices fine-tuning ability. BidirLM restores a prior MNTP phase, advancing the Pareto frontier on both MTEB and XTREME simultaneously.

	## Citation

	```bibtex
	@misc{boizard2026bidirlmtextomnimodalbidirectional,
	title={BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs},
	author={Nicolas Boizard and Théo Deschamps-Berger and Hippolyte Gisserot-Boukhlef and Céline Hudelot and Pierre Colombo},
	year={2026},
	eprint={2604.02045},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2604.02045},
	}
	```