biu-nlp
/

cdlm

Model card Files Files and versions

cdlm / README.md

cattana's picture

Create README.md

c105869 over 4 years ago

|

history blame contribute delete

1.2 kB

	---
	language: en
	tags:
	- longformer
	- cdlm
	license: apache-2.0
	inference: false

	---


	# Cross-Document Language Modeling

	CDLM: Cross-Document Language Modeling.
	Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew E Peters, Arie Cattan and Ido Dagan. In EMNLP Findings, 2021. [PDF](https://arxiv.org/pdf/2101.00406.pdf)


	Please note that during our pretraining we used the document and sentence separators, which you might want to add to your data. The document and sentence separators are `<doc-s>`, `</doc-s>` (the last two tokens in the vocabulary), and `<s>`, `</s>`, respectively.


	```python
	from transformers import AutoTokenizer, AutoModel
	# load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained('biu-nlp/cdlm')
	model = AutoModel.from_pretrained('biu-nlp/cdlm')
	```

	The original repo is [here](https://github.com/aviclu/CDLM).

	If you find our work useful, please cite the paper as:

	```python
	@article{caciularu2021cross,
	title={Cross-Document Language Modeling},
	author={Caciularu, Avi and Cohan, Arman and Beltagy, Iz and Peters, Matthew E and Cattan, Arie and Dagan, Ido},
	journal={Findings of the Association for Computational Linguistics: EMNLP 2021},
	year={2021}
	}
	```