|
|
--- |
|
|
license: cc-by-4.0 |
|
|
language: |
|
|
- he |
|
|
--- |
|
|
# OtoBERT: Identifying Suffixed Verbal Forms in Modern Hebrew Literature |
|
|
|
|
|
New language model for Hebrew designed specifically for identifying suffixed verbal forms in Modern Hebrew, released [here](https://aclanthology.org/2024.tsar-1.2/). |
|
|
|
|
|
This is the base model pretrained with the masked-language-modeling objective. |
|
|
|
|
|
This model was trained with a special tokenizer which combines the bound suffix of an object pronoun into a single unit (e.g., `专讗讬转讬 讗讜转讜` becomes one unit), and was trained to predict those items during the mask prediction stage as well. For more details, please check out the paper listed on this page. |
|
|
|
|
|
Sample usage: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained('dicta-il/otobert') |
|
|
model = AutoModelForMaskedLM.from_pretrained('dicta-il/otobert') |
|
|
|
|
|
model.eval() |
|
|
|
|
|
sentence = '讗谞讬 诇讗 讬讻讜诇 诇讛讙讬讚 诇讱 诪转讬 [MASK] 诇讗讞专讜谞讛.' # Supposed to be 专讗讬转讬 讗讜转讜 |
|
|
|
|
|
output = model(tokenizer.encode(sentence, return_tensors='pt')) |
|
|
# the [MASK] is the 7th token (including [CLS]) |
|
|
import torch |
|
|
top_2 = torch.topk(output.logits[0, 7, :], 2)[1] |
|
|
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2))) # should print 谞驻讙砖谞讜 / 专讗讬转讬_讗讜转讜 |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use OtoBERT in your research, please cite ```OtoBERT: Identifying Suffixed Verbal Forms in Modern Hebrew Literature``` |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{shmidman-shmidman-2024-otobert, |
|
|
title = "{O}to{BERT}: Identifying Suffixed Verbal Forms in {M}odern {H}ebrew Literature", |
|
|
author = "Shmidman, Avi and |
|
|
Shmidman, Shaltiel", |
|
|
editor = "Shardlow, Matthew and |
|
|
Saggion, Horacio and |
|
|
Alva-Manchego, Fernando and |
|
|
Zampieri, Marcos and |
|
|
North, Kai and |
|
|
{\v{S}}tajner, Sanja and |
|
|
Stodden, Regina", |
|
|
booktitle = "Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)", |
|
|
month = nov, |
|
|
year = "2024", |
|
|
address = "Miami, Florida, USA", |
|
|
publisher = "Association for Computational Linguistics", |
|
|
url = "https://aclanthology.org/2024.tsar-1.2", |
|
|
doi = "10.18653/v1/2024.tsar-1.2", |
|
|
pages = "12--19", |
|
|
abstract = "We provide a solution for a specific morphological obstacle which often makes Hebrew literature difficult to parse for the younger generation. The morphologically-rich nature of the Hebrew language allows pronominal direct objects to be realized as bound morphemes, suffixed to the verb. Although such suffixes are often utilized in Biblical Hebrew, their use has all but disappeared in modern Hebrew. Nevertheless, authors of modern Hebrew literature, in their search for literary flair, do make use of such forms. These unusual forms are notorious for alienating young readers from Hebrew literature, especially because these rare suffixed forms are often orthographically identical to common Hebrew words with different meanings. Upon encountering such words, readers naturally select the usual analysis of the word; yet, upon completing the sentence, they find themselves confounded. Young readers end up feeling {``}tricked{''}, and this in turn contributes to their alienation from the text. In order to address this challenge, we pretrained a new BERT model specifically geared to identify such forms, so that they may be automatically simplified and/or flagged. We release this new BERT model to the public for unrestricted use.", |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Shield: [![CC BY 4.0][cc-by-shield]][cc-by] |
|
|
|
|
|
This work is licensed under a |
|
|
[Creative Commons Attribution 4.0 International License][cc-by]. |
|
|
|
|
|
[![CC BY 4.0][cc-by-image]][cc-by] |
|
|
|
|
|
[cc-by]: http://creativecommons.org/licenses/by/4.0/ |
|
|
[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png |
|
|
[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg |
|
|
|