drexalt
/

NeoBERT-RetroMAE-pretrain

Model card Files Files and versions

NeoBERT-RetroMAE-pretrain / README.md

drexalt's picture

Update README.md

6188f69 verified 6 months ago

|

history blame contribute delete

2.69 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- BeIR/msmarco
	language:
	- en
	base_model:
	- chandar-lab/NeoBERT
	---

	# Model Card for NeoBERT-RetroMAE-pretrain

	This model is an equivalent to [shitao/RetroMAE_MSMARCO](https://huggingface.co/Shitao/RetroMAE_MSMARCO) but trained on the NeoBERT architecture. The training objective
	was a LexMAE style train incorporating the Bag-of-Words loss from DupMAE, making a sort of LexMAE/DupMAE hybrid. This model only underwent masked language modeling pretraining and did not see any contrastive training, so the model will need to be trained on a downstream task to achieve useful performance. This model was trained with SPLADE training in mind but may be appropriate for dense embedding or ColBERT downstream training also.



	## How to Get Started with the Model
	Please reference the original NeoBERT model card as well: ["chandar-lab/NeoBERT"](https://huggingface.co/chandar-lab/NeoBERT)


	Ensure you have the following dependencies installed:

	```bash
	pip install transformers torch xformers==0.0.28.post3
	```

	If you would like to use sequence packing (un-padding), you will need to also install flash-attention:

	```bash
	pip install transformers torch xformers==0.0.28.post3 flash_attn
	```

	Use the code below to get started with the model.



	```
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
	model = AutoModelForMaskedLM.from_pretrained("drexalt/NeoBERT-RetroMAE-pretrain", trust_remote_code=True)
	```


	## Evaluation

	NeoBERT-RetroMAE-pretrain and Shitao/RetroMAE_MSMARCO were trained on the MSMARCO collection, so the NanoBEIR MsMARCO results can be seen as in-distribution and the other subsets as OOD.

	Both models were evaluated with a maximum sequence length of 512 tokens and went through a SPLADE activation before selecting the top 512 tokens. Results are similar across top_k values. This method of evaluation was mentioned in the LexMAE paper for choosing a pretrain checkpoint, but it is not present in the LexMAE codebase. A different manner of evaluation may be more appropriate for dense downstream models.


	### Evaluation on NanoBEIR:

	\| Model \| MsMARCO nDCG@10 \| MsMARCO MRR@10 \| SciFact nDCG@10 \| ClimateFEVER nDCG@10 \| NFCorpus nDCG@10 \|
	\|:--------------------------\|:----------------\|:---------------\|:----------------\|:---------------------\|:-----------------\|
	\| NeoBERT-RetroMAE-pretrain \| 0.3210 \| 0.2415 \| 0.3648 \| 0.1378 \| 0.0747 \|
	\| Shitao/RetroMAE_MSMARCO \| 0.1980 \| 0.1374 \| 0.3236 \| 0.0942 \| 0.1150 \|