niobures commited on Mar 1

Commit

1735ad5

verified ·

1 Parent(s): 5553459

DictaBERT

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +20 -0
dictabert-ce/.gitattributes +36 -0
dictabert-ce/README.md +94 -0
dictabert-ce/config.json +33 -0
dictabert-ce/model.safetensors +3 -0
dictabert-ce/source.txt +1 -0
dictabert-ce/special_tokens_map.json +37 -0
dictabert-ce/tokenizer.json +0 -0
dictabert-ce/tokenizer_config.json +70 -0
dictabert-ce/vocab.txt +3 -0
dictabert-char-spacefix/.gitattributes +35 -0
dictabert-char-spacefix/README.md +78 -0
dictabert-char-spacefix/config.json +26 -0
dictabert-char-spacefix/model.safetensors +3 -0
dictabert-char-spacefix/source.txt +1 -0
dictabert-char-spacefix/special_tokens_map.json +37 -0
dictabert-char-spacefix/tokenizer.json +1022 -0
dictabert-char-spacefix/tokenizer_config.json +64 -0
dictabert-char-spacefix/vocab.txt +0 -0
dictabert-heq/.gitattributes +36 -0
dictabert-heq/LICENSE +395 -0
dictabert-heq/README.md +71 -0
dictabert-heq/config.json +26 -0
dictabert-heq/issues.txt +20 -0
dictabert-heq/pytorch_model.bin +3 -0
dictabert-heq/source.txt +1 -0
dictabert-heq/speed.ipynb +220 -0
dictabert-heq/tokenizer.json +0 -0
dictabert-heq/tokenizer_config.json +13 -0
dictabert-heq/vocab.txt +3 -0
dictabert-joint/.gitattributes +36 -0
dictabert-joint/BertForJointParsing.py +534 -0
dictabert-joint/BertForMorphTagging.py +215 -0
dictabert-joint/BertForPrefixMarking.py +266 -0
dictabert-joint/BertForSyntaxParsing.py +315 -0
dictabert-joint/README.md +521 -0
dictabert-joint/config.json +93 -0
dictabert-joint/model.safetensors +3 -0
dictabert-joint/pytorch_model.bin +3 -0
dictabert-joint/source.txt +1 -0
dictabert-joint/special_tokens_map.json +37 -0
dictabert-joint/tokenizer.json +0 -0
dictabert-joint/tokenizer_config.json +63 -0
dictabert-joint/vocab.txt +3 -0
dictabert-large-char-menaked/.gitattributes +35 -0
dictabert-large-char-menaked/BertForDiacritization.py +190 -0
dictabert-large-char-menaked/README.md +69 -0
dictabert-large-char-menaked/config.json +63 -0
dictabert-large-char-menaked/issues.txt +35 -0
dictabert-large-char-menaked/model.safetensors +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,23 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+dictabert-ce/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-heq/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-joint/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-large-heq/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-large-ner/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-large-parse/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-large/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-lex/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-morph/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-ner-handler/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-ner-ONNX/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-ner/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-parse/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-seg/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-sentiment/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-syntax/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-tiny-joint/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-tiny-parse/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert-tiny/vocab.txt filter=lfs diff=lfs merge=lfs -text
+dictabert/vocab.txt filter=lfs diff=lfs merge=lfs -text

dictabert-ce/.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+vocab.txt filter=lfs diff=lfs merge=lfs -text

dictabert-ce/README.md ADDED Viewed

	@@ -0,0 +1,94 @@

+---
+library_name: transformers
+language:
+- he
+---
+## Model Details
+### Model Description
+This is the model card of a 🤗 transformers model that has been pushed on the Hub.
+- **Model type:** CrossEncoder
+- **Language(s) (NLP):** Hebrew
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [DictaBERT](https://huggingface.co/dicta-il/dictabert)
+## Uses
+Model was trained for ranking task as a part of a Hebrew semantic search engine.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+from sentence_transformers import CrossEncoder
+query = "על מה לא הסכים דוד בן גוריון לוותר?"
+doc1 = """
+מלחמת סיני הסתיימה בתבוסה של הכוחות המצריים, אך ברית המועצות וארצות הברית הפעילו לחץ כבד על ישראל לסגת מחצי האי סיני.
+ראש ממשלת ישראל, דוד בן-גוריון, הסכים, בעקבות הלחץ של שתי המעצמות,
+לפנות את חצי האי סיני ורצועת עזה בתהליך שהסתיים במרץ 1957,
+אך הודיע שסגירה של מצרי טיראן לשיט ישראלי תהווה עילה למלחמה.
+ארצות הברית התחייבה לדאוג להבטחת חופש המעבר של ישראל במצרי טיראן.
+כוח חירום בינלאומי של האו"ם הוצב בצד המצרי של הגבול עם ישראל ובשארם א-שייח' וכתוצאה מכך נשאר נתיב השיט במפרץ אילת פתוח לשיט הישראלי.
+"""
+doc2 = """
+ים סוף מהווה מוקד חשוב לתיירות מרחבי העולם.
+מזג האוויר הנוח בעונת החורף, החופים היפים, הים הצלול ואתרי הצלילה המרהיבים לחופי סיני,
+מצרים, וסודאן הופכים את חופי ים סוף ליעד תיירות מבוקש.
+ראס מוחמד והחור הכחול בסיני, ידועים כאתרי צלילה מהמרהיבים בעולם.
+מאז הסכם השלום בין ישראל למצרים פיתחה מצרים מאוד את התיירות לאורך חופי ים סוף,
+ובמיוחד בסיני, ובנתה עשרות אתרי תיירות ומאות מלונות וכפרי נופש.
+תיירות זו נפגעה קשות מאז המהפכה של 2011 במצרים,
+עם עלייה חדה בתקריות טרור מצד ארגונים אסלאמיים קיצוניים בסיני.
+"""
+model = CrossEncoder("haguy77/dictabert-ce")
+scores = model.predict([[query, doc1], [query, doc2]])  # Note: query should ALWAYS be the first of each pair
+# array([0.02000629, 0.00031683], dtype=float32)
+results = model.rank(query, [doc2, doc1])
+# [{'corpus_id': 1, 'score': 0.020006292}, {'corpus_id': 0, 'score': 0.00031683326}]
+```
+### Training Data
+[Hebrew Question Answering Dataset (HeQ)](https://github.com/NNLP-IL/Hebrew-Question-Answering-Dataset)
+## Citation
+**BibTeX:**
+```bibtex
+@misc{shmidman2023dictabert,
+      title={DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew},
+      author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel},
+      year={2023},
+      eprint={2308.16687},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+```bibtex
+@inproceedings{cohen2023heq,
+  title={Heq: a large and diverse hebrew reading comprehension benchmark},
+  author={Cohen, Amir and Merhav-Fine, Hilla and Goldberg, Yoav and Tsarfaty, Reut},
+  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
+  pages={13693--13705},
+  year={2023}
+}
+```
+**APA:**
+```apa
+Shmidman, S., Shmidman, A., & Koppel, M. (2023). DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew. arXiv preprint arXiv:2308.16687.
+Cohen, A., Merhav-Fine, H., Goldberg, Y., & Tsarfaty, R. (2023, December). Heq: a large and diverse hebrew reading comprehension benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 13693-13705).
+```

dictabert-ce/config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "_name_or_path": "dictabert-ce-heq-wiki",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "newmodern": true,
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.38.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 128000
+}

dictabert-ce/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3386271f395a8888d16161407a4d05acebd63e2a83f4dd02e7298c89bf336aa2
+size 737407996

dictabert-ce/source.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ https://huggingface.co/haguy77/dictabert-ce

dictabert-ce/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

dictabert-ce/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

dictabert-ce/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,70 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "[BLANK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "max_length": 512,
+  "model_max_length": 512,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

dictabert-ce/vocab.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0fb90bfa35244d26f0065d1fcd0b5becc3da3d44d616a7e2aacaf6320b9fa2d0
+size 1500244

dictabert-char-spacefix/.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

dictabert-char-spacefix/README.md ADDED Viewed

	@@ -0,0 +1,78 @@

+---
+license: cc-by-4.0
+language:
+- he
+base_model:
+- dicta-il/dictabert-char
+---
+# DictaBERT-char-spacefix: A finetuned BERT model for restoring missing spaces in Hebrew texts.
+DictaBERT-char-spacefix is a finetuned BERT model based on [dicta-il/dictabert-char](https://huggingface.co/dicta-il/dictabert-char), for the task of restoring missing spaces in Hebrew text.
+This model is released to the public in this 2025 W-NUT paper: Avi Shmidman and Shaltiel Shmidman, "Restoring Missing Spaces in Scraped Hebrew Social Media", The 10th Workshop on Noisy and User-generated Text (W-NUT), 2025
+Sample usage:
+```python
+from transformers import pipeline
+oracle = pipeline('token-classification', model='dicta-il/dictabert-char-spacefix')
+text = 'בשנת 1948 השליםאפרים קישון אתלימודיובפיסולמתכת ובתולדותהאמנות והחל לפרסםמאמרים הומוריסטיים'
+raw_output = oracle(text)
+# Classifier returns LABEL_1 if there should be a space before the character
+text_output = ''.join((' ' if o['entity'] == 'LABEL_1' else '') + o['word'] for o in raw_output)
+print(text_output)
+```
+Output:
+```text
+בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים
+```
+## Citation
+If you use DictaBERT-char-spacefix
+in your research, please cite ```Restoring Missing Spaces in Scraped Hebrew Social Media```
+**BibTeX:**
+```bibtex
+@inproceedings{shmidman-shmidman-2025-restoring,
+    title = "Restoring Missing Spaces in Scraped {H}ebrew Social Media",
+    author = "Shmidman, Avi  and
+      Shmidman, Shaltiel",
+    editor = "Bak, JinYeong  and
+      Goot, Rob van der  and
+      Jang, Hyeju  and
+      Buaphet, Weerayut  and
+      Ramponi, Alan  and
+      Xu, Wei  and
+      Ritter, Alan",
+    booktitle = "Proceedings of the Tenth Workshop on Noisy and User-generated Text",
+    month = may,
+    year = "2025",
+    address = "Albuquerque, New Mexico, USA",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2025.wnut-1.3/",
+    pages = "16--25",
+    ISBN = "979-8-89176-232-9",
+    abstract = "A formidable challenge regarding scraped corpora of social media is the omission of whitespaces, causing pairs of words to be conflated together as one. In order for the text to be properly parsed and analyzed, these missing spaces must be detected and restored. However, it is particularly hard to restore whitespace in languages such as Hebrew which are written without vowels, because a conflated form can often be split into multiple different pairs of valid words. Thus, a simple dictionary lookup is not feasible. In this paper, we present and evaluate a series of neural approaches to restore missing spaces in scraped Hebrew social media. Our best all-around method involved pretraining a new character-based BERT model for Hebrew, and then fine-tuning a space restoration model on top of this new BERT model. This method is blazing fast, high-performing, and open for unrestricted use, providing a practical solution to process huge Hebrew social media corpora with a consumer-grade GPU. We release the new BERT model and the fine-tuned space-restoration model to the NLP community."
+}
+```
+## License
+Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
+This work is licensed under a
+[Creative Commons Attribution 4.0 International License][cc-by].
+[![CC BY 4.0][cc-by-image]][cc-by]
+[cc-by]: http://creativecommons.org/licenses/by/4.0/
+[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
+[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg

dictabert-char-spacefix/config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "_name_or_path": "../TavFullParagraphBigModern/ckpt_31600/",
+  "architectures": [
+    "BertForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 2048,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.47.0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 1024
+}

dictabert-char-spacefix/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7f564cf441cb623312a04cc69f013c5ecbd46b8db5d868aa8a77b159ee87d376
+size 349696712

dictabert-char-spacefix/source.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ https://huggingface.co/dicta-il/dictabert-char-spacefix

dictabert-char-spacefix/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

dictabert-char-spacefix/tokenizer.json ADDED Viewed

	@@ -0,0 +1,1022 @@

+{
+  "version": "1.0",
+  "truncation": null,
+  "padding": null,
+  "added_tokens": [
+    {
+      "id": 0,
+      "content": "[UNK]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "id": 1,
+      "content": "[CLS]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "id": 2,
+      "content": "[SEP]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "id": 3,
+      "content": "[PAD]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "id": 4,
+      "content": "[MASK]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    },
+    {
+      "id": 5,
+      "content": "[BLANK]",
+      "single_word": false,
+      "lstrip": false,
+      "rstrip": false,
+      "normalized": false,
+      "special": true
+    }
+  ],
+  "normalizer": {
+    "type": "Sequence",
+    "normalizers": [
+      {
+        "type": "NFKC"
+      },
+      {
+        "type": "Lowercase"
+      },
+      {
+        "type": "StripAccents"
+      },
+      {
+        "type": "Replace",
+        "pattern": {
+          "String": "<foreign>"
+        },
+        "content": "[UNK]"
+      },
+      {
+        "type": "Replace",
+        "pattern": {
+          "Regex": "[^֐-׿\u0000-‌-‿₠-₿∀-⋿⅐-↋ﬀ-ﭏ]+"
+        },
+        "content": "[UNK]"
+      }
+    ]
+  },
+  "pre_tokenizer": {
+    "type": "Split",
+    "pattern": {
+      "Regex": "(\\[UNK\\]|[\\s\\S])"
+    },
+    "behavior": "Removed",
+    "invert": true
+  },
+  "post_processor": {
+    "type": "TemplateProcessing",
+    "single": [
+      {
+        "SpecialToken": {
+          "id": "[CLS]",
+          "type_id": 0
+        }
+      },
+      {
+        "Sequence": {
+          "id": "A",
+          "type_id": 0
+        }
+      },
+      {
+        "SpecialToken": {
+          "id": "[SEP]",
+          "type_id": 0
+        }
+      }
+    ],
+    "pair": [
+      {
+        "SpecialToken": {
+          "id": "[CLS]",
+          "type_id": 0
+        }
+      },
+      {
+        "Sequence": {
+          "id": "A",
+          "type_id": 0
+        }
+      },
+      {
+        "SpecialToken": {
+          "id": "[SEP]",
+          "type_id": 0
+        }
+      },
+      {
+        "Sequence": {
+          "id": "B",
+          "type_id": 1
+        }
+      },
+      {
+        "SpecialToken": {
+          "id": "[SEP]",
+          "type_id": 1
+        }
+      }
+    ],
+    "special_tokens": {
+      "[CLS]": {
+        "id": "[CLS]",
+        "ids": [
+          1
+        ],
+        "tokens": [
+          "[CLS]"
+        ]
+      },
+      "[SEP]": {
+        "id": "[SEP]",
+        "ids": [
+          2
+        ],
+        "tokens": [
+          "[SEP]"
+        ]
+      }
+    }
+  },
+  "decoder": null,
+  "model": {
+    "type": "WordPiece",
+    "unk_token": "[UNK]",
+    "continuing_subword_prefix": "##",
+    "max_input_chars_per_word": 100,
+    "vocab": {
+      "[UNK]": 0,
+      "[CLS]": 1,
+      "[SEP]": 2,
+      "[PAD]": 3,
+      "[MASK]": 4,
+      "[BLANK]": 5,
+      "\u0000": 6,
+      "\u0001": 7,
+      "\u0002": 8,
+      "\u0003": 9,
+      "\u0004": 10,
+      "\u0005": 11,
+      "\u0006": 12,
+      "\u0007": 13,
+      "\b": 14,
+      "\t": 15,
+      "\n": 16,
+      "\u000b": 17,
+      "\u000e": 18,
+      "\u000f": 19,
+      "\u0010": 20,
+      "\u0011": 21,
+      "\u0012": 22,
+      "\u0013": 23,
+      "\u0014": 24,
+      "\u0015": 25,
+      "\u0016": 26,
+      "\u0017": 27,
+      "\u0018": 28,
+      "\u0019": 29,
+      "\u001a": 30,
+      "\u001b": 31,
+      "\u001c": 32,
+      "\u001d": 33,
+      "\u001e": 34,
+      "\u001f": 35,
+      " ": 36,
+      "!": 37,
+      "\"": 38,
+      "#": 39,
+      "$": 40,
+      "%": 41,
+      "&": 42,
+      "'": 43,
+      "(": 44,
+      ")": 45,
+      "*": 46,
+      "+": 47,
+      ",": 48,
+      "-": 49,
+      ".": 50,
+      "/": 51,
+      "0": 52,
+      "1": 53,
+      "2": 54,
+      "3": 55,
+      "4": 56,
+      "5": 57,
+      "6": 58,
+      "7": 59,
+      "8": 60,
+      "9": 61,
+      ":": 62,
+      ";": 63,
+      "<": 64,
+      "=": 65,
+      ">": 66,
+      "?": 67,
+      "@": 68,
+      "K": 69,
+      "N": 70,
+      "U": 71,
+      "[": 72,
+      "\\": 73,
+      "]": 74,
+      "^": 75,
+      "_": 76,
+      "`": 77,
+      "a": 78,
+      "b": 79,
+      "c": 80,
+      "d": 81,
+      "e": 82,
+      "f": 83,
+      "g": 84,
+      "h": 85,
+      "i": 86,
+      "j": 87,
+      "k": 88,
+      "l": 89,
+      "m": 90,
+      "n": 91,
+      "o": 92,
+      "p": 93,
+      "q": 94,
+      "r": 95,
+      "s": 96,
+      "t": 97,
+      "u": 98,
+      "v": 99,
+      "w": 100,
+      "x": 101,
+      "y": 102,
+      "z": 103,
+      "{": 104,
+      "|": 105,
+      "}": 106,
+      "~": 107,
+      "": 108,
+      "": 109,
+      "": 110,
+      "": 111,
+      "": 112,
+      "": 113,
+      "": 114,
+      "": 115,
+      "": 116,
+      "": 117,
+      "": 118,
+      "": 119,
+      "": 120,
+      "": 121,
+      "": 122,
+      "": 123,
+      "": 124,
+      "": 125,
+      "": 126,
+      "": 127,
+      "": 128,
+      "": 129,
+      "": 130,
+      "": 131,
+      "": 132,
+      "": 133,
+      "": 134,
+      "": 135,
+      "": 136,
+      "¡": 137,
+      "¢": 138,
+      "£": 139,
+      "¤": 140,
+      "¥": 141,
+      "¦": 142,
+      "§": 143,
+      "©": 144,
+      "«": 145,
+      "¬": 146,
+      "": 147,
+      "®": 148,
+      "°": 149,
+      "±": 150,
+      "¶": 151,
+      "·": 152,
+      "»": 153,
+      "¿": 154,
+      "×": 155,
+      "ß": 156,
+      "à": 157,
+      "á": 158,
+      "â": 159,
+      "ã": 160,
+      "ä": 161,
+      "å": 162,
+      "æ": 163,
+      "ç": 164,
+      "è": 165,
+      "é": 166,
+      "ê": 167,
+      "ë": 168,
+      "ì": 169,
+      "í": 170,
+      "î": 171,
+      "ï": 172,
+      "ð": 173,
+      "ñ": 174,
+      "ò": 175,
+      "ó": 176,
+      "ô": 177,
+      "õ": 178,
+      "ö": 179,
+      "÷": 180,
+      "ø": 181,
+      "ù": 182,
+      "ú": 183,
+      "û": 184,
+      "ü": 185,
+      "ý": 186,
+      "þ": 187,
+      "ÿ": 188,
+      "ȼ": 189,
+      "˖": 190,
+      "˗": 191,
+      "ͱ": 192,
+      "ͳ": 193,
+      "͵": 194,
+      "ӏ": 195,
+      "ԝ": 196,
+      "֎": 197,
+      "־": 198,
+      "׀": 199,
+      "׃": 200,
+      "׆": 201,
+      "׈": 202,
+      "׉": 203,
+      "׊": 204,
+      "׋": 205,
+      "׍": 206,
+      "׎": 207,
+      "׏": 208,
+      "א": 209,
+      "ב": 210,
+      "ג": 211,
+      "ד": 212,
+      "ה": 213,
+      "ו": 214,
+      "ז": 215,
+      "ח": 216,
+      "ט": 217,
+      "י": 218,
+      "ך": 219,
+      "כ": 220,
+      "ל": 221,
+      "ם": 222,
+      "מ": 223,
+      "ן": 224,
+      "נ": 225,
+      "ס": 226,
+      "ע": 227,
+      "ף": 228,
+      "פ": 229,
+      "ץ": 230,
+      "צ": 231,
+      "ק": 232,
+      "ר": 233,
+      "ש": 234,
+      "ת": 235,
+      "׫": 236,
+      "װ": 237,
+      "ױ": 238,
+      "ײ": 239,
+      "׳": 240,
+      "״": 241,
+      "׸": 242,
+      "׹": 243,
+      "׺": 244,
+      "׿": 245,
+      "،": 246,
+      "؛": 247,
+      "؟": 248,
+      "٪": 249,
+      "٭": 250,
+      "۔": 251,
+      "۝": 252,
+      "۞": 253,
+      "۩": 254,
+      "ߋ": 255,
+      "ߐ": 256,
+      "ߕ": 257,
+      "ߗ": 258,
+      "ߜ": 259,
+      "ߝ": 260,
+      "ߞ": 261,
+      "ߟ": 262,
+      "ߠ": 263,
+      "ߡ": 264,
+      "ߢ": 265,
+      "ߨ": 266,
+      "ߩ": 267,
+      "ߪ": 268,
+      "।": 269,
+      "฿": 270,
+      "๏": 271,
+      "፡": 272,
+      "ᤞ": 273,
+      "᧐": 274,
+      "ᨁ": 275,
+      "ᨅ": 276,
+      "ᨔ": 277,
+      "ᨕ": 278,
+      "‌": 279,
+      "‍": 280,
+      "‎": 281,
+      "‏": 282,
+      "‐": 283,
+      "‒": 284,
+      "–": 285,
+      "—": 286,
+      "―": 287,
+      "‖": 288,
+      "‘": 289,
+      "’": 290,
+      "‚": 291,
+      "‛": 292,
+      "“": 293,
+      "”": 294,
+      "„": 295,
+      "‟": 296,
+      "†": 297,
+      "‡": 298,
+      "•": 299,
+      "‣": 300,
+      "‧": 301,
+      " ": 302,
+      " ": 303,
+      "‪": 304,
+      "‫": 305,
+      "‬": 306,
+      "‭": 307,
+      "‮": 308,
+      "‰": 309,
+      "′": 310,
+      "‹": 311,
+      "›": 312,
+      "※": 313,
+      "‽": 314,
+      "‿": 315,
+      "⁃": 316,
+      "⁄": 317,
+      "⁎": 318,
+      "⁠": 319,
+      "⁣": 320,
+      "⁦": 321,
+      "⁧": 322,
+      "⁨": 323,
+      "⁩": 324,
+      "₡": 325,
+      "₣": 326,
+      "₤": 327,
+      "₦": 328,
+      "₩": 329,
+      "₪": 330,
+      "₫": 331,
+      "€": 332,
+      "₭": 333,
+      "₮": 334,
+      "₱": 335,
+      "₴": 336,
+      "₵": 337,
+      "₸": 338,
+      "₹": 339,
+      "₺": 340,
+      "₼": 341,
+      "₽": 342,
+      "₾": 343,
+      "₿": 344,
+      "ↄ": 345,
+      "←": 346,
+      "↑": 347,
+      "→": 348,
+      "↓": 349,
+      "↔": 350,
+      "↗": 351,
+      "↘": 352,
+      "↙": 353,
+      "↩": 354,
+      "↳": 355,
+      "↵": 356,
+      "⇌": 357,
+      "⇐": 358,
+      "⇒": 359,
+      "⇓": 360,
+      "⇔": 361,
+      "⇦": 362,
+      "⇧": 363,
+      "⇨": 364,
+      "⇱": 365,
+      "∀": 366,
+      "∂": 367,
+      "∃": 368,
+      "∅": 369,
+      "∆": 370,
+      "∇": 371,
+      "∈": 372,
+      "∉": 373,
+      "∍": 374,
+      "∎": 375,
+      "∏": 376,
+      "∐": 377,
+      "∑": 378,
+      "−": 379,
+      "∕": 380,
+      "∗": 381,
+      "∘": 382,
+      "∙": 383,
+      "√": 384,
+      "∛": 385,
+      "∝": 386,
+      "∞": 387,
+      "∟": 388,
+      "∠": 389,
+      "∢": 390,
+      "∧": 391,
+      "∨": 392,
+      "∩": 393,
+      "∪": 394,
+      "∫": 395,
+      "∴": 396,
+      "∼": 397,
+      "≅": 398,
+      "≈": 399,
+      "≋": 400,
+      "≟": 401,
+      "≠": 402,
+      "≡": 403,
+      "≤": 404,
+      "≥": 405,
+      "≦": 406,
+      "≧": 407,
+      "≪": 408,
+      "≫": 409,
+      "⊂": 410,
+      "⊃": 411,
+      "⊆": 412,
+      "⊇": 413,
+      "⊕": 414,
+      "⊗": 415,
+      "⊙": 416,
+      "⊞": 417,
+      "⊠": 418,
+      "⊢": 419,
+      "⊤": 420,
+      "⊦": 421,
+      "⋃": 422,
+      "⋄": 423,
+      "⋅": 424,
+      "⋆": 425,
+      "⋇": 426,
+      "⋧": 427,
+      "⋮": 428,
+      "⋯": 429,
+      "⌀": 430,
+      "⌂": 431,
+      "⌘": 432,
+      "⌚": 433,
+      "⌛": 434,
+      "⌥": 435,
+      "⎙": 436,
+      "⏎": 437,
+      "⏪": 438,
+      "⏮": 439,
+      "⏰": 440,
+      "⏱": 441,
+      "⏳": 442,
+      "⏺": 443,
+      "─": 444,
+      "│": 445,
+      "┐": 446,
+      "└": 447,
+      "┴": 448,
+      "╋": 449,
+      "║": 450,
+      "╬": 451,
+      "█": 452,
+      "▌": 453,
+      "░": 454,
+      "■": 455,
+      "□": 456,
+      "▪": 457,
+      "▫": 458,
+      "▲": 459,
+      "△": 460,
+      "▶": 461,
+      "▷": 462,
+      "▸": 463,
+      "►": 464,
+      "▼": 465,
+      "▽": 466,
+      "▾": 467,
+      "◀": 468,
+      "◁": 469,
+      "◂": 470,
+      "◃": 471,
+      "◄": 472,
+      "◆": 473,
+      "◇": 474,
+      "◈": 475,
+      "◉": 476,
+      "◊": 477,
+      "○": 478,
+      "◌": 479,
+      "◎": 480,
+      "●": 481,
+      "◕": 482,
+      "◘": 483,
+      "◙": 484,
+      "◡": 485,
+      "◥": 486,
+      "◦": 487,
+      "◴": 488,
+      "◻": 489,
+      "◼": 490,
+      "◽": 491,
+      "◾": 492,
+      "☀": 493,
+      "☁": 494,
+      "☂": 495,
+      "☃": 496,
+      "☄": 497,
+      "★": 498,
+      "☆": 499,
+      "☉": 500,
+      "☎": 501,
+      "☏": 502,
+      "☐": 503,
+      "☑": 504,
+      "☒": 505,
+      "☔": 506,
+      "☕": 507,
+      "☘": 508,
+      "☚": 509,
+      "☜": 510,
+      "☝": 511,
+      "☠": 512,
+      "☢": 513,
+      "☯": 514,
+      "☰": 515,
+      "☹": 516,
+      "☺": 517,
+      "☻": 518,
+      "☼": 519,
+      "♀": 520,
+      "♂": 521,
+      "♔": 522,
+      "♕": 523,
+      "♚": 524,
+      "♛": 525,
+      "♟": 526,
+      "♠": 527,
+      "♡": 528,
+      "♢": 529,
+      "♣": 530,
+      "♥": 531,
+      "♦": 532,
+      "♧": 533,
+      "♨": 534,
+      "♪": 535,
+      "♫": 536,
+      "♬": 537,
+      "♭": 538,
+      "♯": 539,
+      "♰": 540,
+      "♻": 541,
+      "♿": 542,
+      "⚇": 543,
+      "⚒": 544,
+      "⚓": 545,
+      "⚔": 546,
+      "⚖": 547,
+      "⚘": 548,
+      "⚛": 549,
+      "⚜": 550,
+      "⚠": 551,
+      "⚡": 552,
+      "⚧": 553,
+      "⚪": 554,
+      "⚫": 555,
+      "⚽": 556,
+      "⛔": 557,
+      "⛰": 558,
+      "✂": 559,
+      "✅": 560,
+      "✆": 561,
+      "✈": 562,
+      "✉": 563,
+      "✊": 564,
+      "✋": 565,
+      "✌": 566,
+      "✍": 567,
+      "✎": 568,
+      "✏": 569,
+      "✓": 570,
+      "✔": 571,
+      "✖": 572,
+      "✗": 573,
+      "✙": 574,
+      "✛": 575,
+      "✡": 576,
+      "✦": 577,
+      "✧": 578,
+      "✨": 579,
+      "✩": 580,
+      "✪": 581,
+      "✫": 582,
+      "✭": 583,
+      "✮": 584,
+      "✯": 585,
+      "✰": 586,
+      "✱": 587,
+      "✲": 588,
+      "✳": 589,
+      "✴": 590,
+      "✶": 591,
+      "✸": 592,
+      "✺": 593,
+      "✻": 594,
+      "✽": 595,
+      "✾": 596,
+      "✿": 597,
+      "❀": 598,
+      "❁": 599,
+      "❂": 600,
+      "❃": 601,
+      "❄": 602,
+      "❇": 603,
+      "❈": 604,
+      "❋": 605,
+      "❌": 606,
+      "❎": 607,
+      "❏": 608,
+      "❑": 609,
+      "❒": 610,
+      "❓": 611,
+      "❔": 612,
+      "❕": 613,
+      "❖": 614,
+      "❗": 615,
+      "❝": 616,
+      "❞": 617,
+      "❣": 618,
+      "❤": 619,
+      "❥": 620,
+      "❦": 621,
+      "❭": 622,
+      "❯": 623,
+      "❶": 624,
+      "❷": 625,
+      "❸": 626,
+      "➊": 627,
+      "➋": 628,
+      "➌": 629,
+      "➍": 630,
+      "➎": 631,
+      "➔": 632,
+      "➕": 633,
+      "➖": 634,
+      "➡": 635,
+      "➢": 636,
+      "➤": 637,
+      "➦": 638,
+      "⟨": 639,
+      "⟩": 640,
+      "⠀": 641,
+      "⤵": 642,
+      "⤶": 643,
+      "⦁": 644,
+      "⦿": 645,
+      "⧼": 646,
+      "⧽": 647,
+      "⬅": 648,
+      "⬆": 649,
+      "⬇": 650,
+      "⬛": 651,
+      "⬜": 652,
+      "⭐": 653,
+      "⭕": 654,
+      "ⰲ": 655,
+      "ⰽ": 656,
+      "ⰾ": 657,
+      "ⱀ": 658,
+      "ⱁ": 659,
+      "ⱄ": 660,
+      "ⱏ": 661,
+      "ⱐ": 662,
+      "ⱑ": 663,
+      "ⱥ": 664,
+      "ⲟ": 665,
+      "ⴰ": 666,
+      "ⴻ": 667,
+      "ⵍ": 668,
+      "ⵏ": 669,
+      "ⵔ": 670,
+      "ⵢ": 671,
+      "ⵣ": 672,
+      "、": 673,
+      "。": 674,
+      "〈": 675,
+      "〉": 676,
+      "《": 677,
+      "》": 678,
+      "「": 679,
+      "」": 680,
+      "【": 681,
+      "】": 682,
+      "ꙭ": 683,
+      "": 684,
+      "": 685,
+      "": 686,
+      "": 687,
+      "": 688,
+      "": 689,
+      "": 690,
+      "": 691,
+      "": 692,
+      "": 693,
+      "": 694,
+      "": 695,
+      "": 696,
+      "": 697,
+      "": 698,
+      "": 699,
+      "": 700,
+      "": 701,
+      "": 702,
+      "": 703,
+      "": 704,
+      "": 705,
+      "": 706,
+      "": 707,
+      "": 708,
+      "": 709,
+      "": 710,
+      "": 711,
+      "": 712,
+      "": 713,
+      "": 714,
+      "": 715,
+      "": 716,
+      "": 717,
+      "": 718,
+      "": 719,
+      "": 720,
+      "": 721,
+      "": 722,
+      "": 723,
+      "": 724,
+      "": 725,
+      "": 726,
+      "": 727,
+      "": 728,
+      "": 729,
+      "": 730,
+      "": 731,
+      "": 732,
+      "": 733,
+      "": 734,
+      "": 735,
+      "": 736,
+      "": 737,
+      "": 738,
+      "": 739,
+      "": 740,
+      "": 741,
+      "": 742,
+      "": 743,
+      "": 744,
+      "": 745,
+      "": 746,
+      "": 747,
+      "": 748,
+      "": 749,
+      "": 750,
+      "": 751,
+      "": 752,
+      "": 753,
+      "": 754,
+      "": 755,
+      "": 756,
+      "": 757,
+      "": 758,
+      "": 759,
+      "": 760,
+      "": 761,
+      "": 762,
+      "": 763,
+      "": 764,
+      "": 765,
+      "": 766,
+      "": 767,
+      "": 768,
+      "": 769,
+      "": 770,
+      "": 771,
+      "": 772,
+      "": 773,
+      "": 774,
+      "": 775,
+      "": 776,
+      "": 777,
+      "": 778,
+      "": 779,
+      "": 780,
+      "": 781,
+      "": 782,
+      "": 783,
+      "": 784,
+      "": 785,
+      "": 786,
+      "": 787,
+      "": 788,
+      "": 789,
+      "": 790,
+      "": 791,
+      "": 792,
+      "": 793,
+      "": 794,
+      "": 795,
+      "": 796,
+      "": 797,
+      "": 798,
+      "": 799,
+      "": 800,
+      "": 801,
+      "": 802,
+      "": 803,
+      "": 804,
+      "": 805,
+      "": 806,
+      "": 807,
+      "": 808,
+      "": 809,
+      "": 810,
+      "": 811,
+      "": 812,
+      "": 813,
+      "": 814,
+      "": 815,
+      "": 816,
+      "": 817,
+      "": 818,
+      "": 819,
+      "": 820,
+      "": 821,
+      "": 822,
+      "": 823,
+      "": 824,
+      "": 825,
+      "": 826,
+      "": 827,
+      "": 828,
+      "": 829,
+      "": 830,
+      "": 831,
+      "": 832,
+      "": 833,
+      "": 834,
+      "": 835,
+      "": 836,
+      "": 837,
+      "": 838,
+      "": 839,
+      "": 840
+    }
+  }
+}

dictabert-char-spacefix/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,64 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "[BLANK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 2048,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

dictabert-char-spacefix/vocab.txt ADDED Viewed

Binary file (3.01 kB). View file

dictabert-heq/.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+vocab.txt filter=lfs diff=lfs merge=lfs -text

dictabert-heq/LICENSE ADDED Viewed

	@@ -0,0 +1,395 @@

+Attribution 4.0 International
+=======================================================================
+Creative Commons Corporation ("Creative Commons") is not a law firm and
+does not provide legal services or legal advice. Distribution of
+Creative Commons public licenses does not create a lawyer-client or
+other relationship. Creative Commons makes its licenses and related
+information available on an "as-is" basis. Creative Commons gives no
+warranties regarding its licenses, any material licensed under their
+terms and conditions, or any related information. Creative Commons
+disclaims all liability for damages resulting from their use to the
+fullest extent possible.
+Using Creative Commons Public Licenses
+Creative Commons public licenses provide a standard set of terms and
+conditions that creators and other rights holders may use to share
+original works of authorship and other material subject to copyright
+and certain other rights specified in the public license below. The
+following considerations are for informational purposes only, are not
+exhaustive, and do not form part of our licenses.
+     Considerations for licensors: Our public licenses are
+     intended for use by those authorized to give the public
+     permission to use material in ways otherwise restricted by
+     copyright and certain other rights. Our licenses are
+     irrevocable. Licensors should read and understand the terms
+     and conditions of the license they choose before applying it.
+     Licensors should also secure all rights necessary before
+     applying our licenses so that the public can reuse the
+     material as expected. Licensors should clearly mark any
+     material not subject to the license. This includes other CC-
+     licensed material, or material used under an exception or
+     limitation to copyright. More considerations for licensors:
+	wiki.creativecommons.org/Considerations_for_licensors
+     Considerations for the public: By using one of our public
+     licenses, a licensor grants the public permission to use the
+     licensed material under specified terms and conditions. If
+     the licensor's permission is not necessary for any reason--for
+     example, because of any applicable exception or limitation to
+     copyright--then that use is not regulated by the license. Our
+     licenses grant only permissions under copyright and certain
+     other rights that a licensor has authority to grant. Use of
+     the licensed material may still be restricted for other
+     reasons, including because others have copyright or other
+     rights in the material. A licensor may make special requests,
+     such as asking that all changes be marked or described.
+     Although not required by our licenses, you are encouraged to
+     respect those requests where reasonable. More_considerations
+     for the public:
+	wiki.creativecommons.org/Considerations_for_licensees
+=======================================================================
+Creative Commons Attribution 4.0 International Public License
+By exercising the Licensed Rights (defined below), You accept and agree
+to be bound by the terms and conditions of this Creative Commons
+Attribution 4.0 International Public License ("Public License"). To the
+extent this Public License may be interpreted as a contract, You are
+granted the Licensed Rights in consideration of Your acceptance of
+these terms and conditions, and the Licensor grants You such rights in
+consideration of benefits the Licensor receives from making the
+Licensed Material available under these terms and conditions.
+Section 1 -- Definitions.
+  a. Adapted Material means material subject to Copyright and Similar
+     Rights that is derived from or based upon the Licensed Material
+     and in which the Licensed Material is translated, altered,
+     arranged, transformed, or otherwise modified in a manner requiring
+     permission under the Copyright and Similar Rights held by the
+     Licensor. For purposes of this Public License, where the Licensed
+     Material is a musical work, performance, or sound recording,
+     Adapted Material is always produced where the Licensed Material is
+     synched in timed relation with a moving image.
+  b. Adapter's License means the license You apply to Your Copyright
+     and Similar Rights in Your contributions to Adapted Material in
+     accordance with the terms and conditions of this Public License.
+  c. Copyright and Similar Rights means copyright and/or similar rights
+     closely related to copyright including, without limitation,
+     performance, broadcast, sound recording, and Sui Generis Database
+     Rights, without regard to how the rights are labeled or
+     categorized. For purposes of this Public License, the rights
+     specified in Section 2(b)(1)-(2) are not Copyright and Similar
+     Rights.
+  d. Effective Technological Measures means those measures that, in the
+     absence of proper authority, may not be circumvented under laws
+     fulfilling obligations under Article 11 of the WIPO Copyright
+     Treaty adopted on December 20, 1996, and/or similar international
+     agreements.
+  e. Exceptions and Limitations means fair use, fair dealing, and/or
+     any other exception or limitation to Copyright and Similar Rights
+     that applies to Your use of the Licensed Material.
+  f. Licensed Material means the artistic or literary work, database,
+     or other material to which the Licensor applied this Public
+     License.
+  g. Licensed Rights means the rights granted to You subject to the
+     terms and conditions of this Public License, which are limited to
+     all Copyright and Similar Rights that apply to Your use of the
+     Licensed Material and that the Licensor has authority to license.
+  h. Licensor means the individual(s) or entity(ies) granting rights
+     under this Public License.
+  i. Share means to provide material to the public by any means or
+     process that requires permission under the Licensed Rights, such
+     as reproduction, public display, public performance, distribution,
+     dissemination, communication, or importation, and to make material
+     available to the public including in ways that members of the
+     public may access the material from a place and at a time
+     individually chosen by them.
+  j. Sui Generis Database Rights means rights other than copyright
+     resulting from Directive 96/9/EC of the European Parliament and of
+     the Council of 11 March 1996 on the legal protection of databases,
+     as amended and/or succeeded, as well as other essentially
+     equivalent rights anywhere in the world.
+  k. You means the individual or entity exercising the Licensed Rights
+     under this Public License. Your has a corresponding meaning.
+Section 2 -- Scope.
+  a. License grant.
+       1. Subject to the terms and conditions of this Public License,
+          the Licensor hereby grants You a worldwide, royalty-free,
+          non-sublicensable, non-exclusive, irrevocable license to
+          exercise the Licensed Rights in the Licensed Material to:
+            a. reproduce and Share the Licensed Material, in whole or
+               in part; and
+            b. produce, reproduce, and Share Adapted Material.
+       2. Exceptions and Limitations. For the avoidance of doubt, where
+          Exceptions and Limitations apply to Your use, this Public
+          License does not apply, and You do not need to comply with
+          its terms and conditions.
+       3. Term. The term of this Public License is specified in Section
+          6(a).
+       4. Media and formats; technical modifications allowed. The
+          Licensor authorizes You to exercise the Licensed Rights in
+          all media and formats whether now known or hereafter created,
+          and to make technical modifications necessary to do so. The
+          Licensor waives and/or agrees not to assert any right or
+          authority to forbid You from making technical modifications
+          necessary to exercise the Licensed Rights, including
+          technical modifications necessary to circumvent Effective
+          Technological Measures. For purposes of this Public License,
+          simply making modifications authorized by this Section 2(a)
+          (4) never produces Adapted Material.
+       5. Downstream recipients.
+            a. Offer from the Licensor -- Licensed Material. Every
+               recipient of the Licensed Material automatically
+               receives an offer from the Licensor to exercise the
+               Licensed Rights under the terms and conditions of this
+               Public License.
+            b. No downstream restrictions. You may not offer or impose
+               any additional or different terms or conditions on, or
+               apply any Effective Technological Measures to, the
+               Licensed Material if doing so restricts exercise of the
+               Licensed Rights by any recipient of the Licensed
+               Material.
+       6. No endorsement. Nothing in this Public License constitutes or
+          may be construed as permission to assert or imply that You
+          are, or that Your use of the Licensed Material is, connected
+          with, or sponsored, endorsed, or granted official status by,
+          the Licensor or others designated to receive attribution as
+          provided in Section 3(a)(1)(A)(i).
+  b. Other rights.
+       1. Moral rights, such as the right of integrity, are not
+          licensed under this Public License, nor are publicity,
+          privacy, and/or other similar personality rights; however, to
+          the extent possible, the Licensor waives and/or agrees not to
+          assert any such rights held by the Licensor to the limited
+          extent necessary to allow You to exercise the Licensed
+          Rights, but not otherwise.
+       2. Patent and trademark rights are not licensed under this
+          Public License.
+       3. To the extent possible, the Licensor waives any right to
+          collect royalties from You for the exercise of the Licensed
+          Rights, whether directly or through a collecting society
+          under any voluntary or waivable statutory or compulsory
+          licensing scheme. In all other cases the Licensor expressly
+          reserves any right to collect such royalties.
+Section 3 -- License Conditions.
+Your exercise of the Licensed Rights is expressly made subject to the
+following conditions.
+  a. Attribution.
+       1. If You Share the Licensed Material (including in modified
+          form), You must:
+            a. retain the following if it is supplied by the Licensor
+               with the Licensed Material:
+                 i. identification of the creator(s) of the Licensed
+                    Material and any others designated to receive
+                    attribution, in any reasonable manner requested by
+                    the Licensor (including by pseudonym if
+                    designated);
+                ii. a copyright notice;
+               iii. a notice that refers to this Public License;
+                iv. a notice that refers to the disclaimer of
+                    warranties;
+                 v. a URI or hyperlink to the Licensed Material to the
+                    extent reasonably practicable;
+            b. indicate if You modified the Licensed Material and
+               retain an indication of any previous modifications; and
+            c. indicate the Licensed Material is licensed under this
+               Public License, and include the text of, or the URI or
+               hyperlink to, this Public License.
+       2. You may satisfy the conditions in Section 3(a)(1) in any
+          reasonable manner based on the medium, means, and context in
+          which You Share the Licensed Material. For example, it may be
+          reasonable to satisfy the conditions by providing a URI or
+          hyperlink to a resource that includes the required
+          information.
+       3. If requested by the Licensor, You must remove any of the
+          information required by Section 3(a)(1)(A) to the extent
+          reasonably practicable.
+       4. If You Share Adapted Material You produce, the Adapter's
+          License You apply must not prevent recipients of the Adapted
+          Material from complying with this Public License.
+Section 4 -- Sui Generis Database Rights.
+Where the Licensed Rights include Sui Generis Database Rights that
+apply to Your use of the Licensed Material:
+  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
+     to extract, reuse, reproduce, and Share all or a substantial
+     portion of the contents of the database;
+  b. if You include all or a substantial portion of the database
+     contents in a database in which You have Sui Generis Database
+     Rights, then the database in which You have Sui Generis Database
+     Rights (but not its individual contents) is Adapted Material; and
+  c. You must comply with the conditions in Section 3(a) if You Share
+     all or a substantial portion of the contents of the database.
+For the avoidance of doubt, this Section 4 supplements and does not
+replace Your obligations under this Public License where the Licensed
+Rights include other Copyright and Similar Rights.
+Section 5 -- Disclaimer of Warranties and Limitation of Liability.
+  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
+     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
+     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
+     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
+     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
+     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
+     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
+     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
+     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
+     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
+  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
+     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
+     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
+     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
+     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
+     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
+     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
+     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
+     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
+  c. The disclaimer of warranties and limitation of liability provided
+     above shall be interpreted in a manner that, to the extent
+     possible, most closely approximates an absolute disclaimer and
+     waiver of all liability.
+Section 6 -- Term and Termination.
+  a. This Public License applies for the term of the Copyright and
+     Similar Rights licensed here. However, if You fail to comply with
+     this Public License, then Your rights under this Public License
+     terminate automatically.
+  b. Where Your right to use the Licensed Material has terminated under
+     Section 6(a), it reinstates:
+       1. automatically as of the date the violation is cured, provided
+          it is cured within 30 days of Your discovery of the
+          violation; or
+       2. upon express reinstatement by the Licensor.
+     For the avoidance of doubt, this Section 6(b) does not affect any
+     right the Licensor may have to seek remedies for Your violations
+     of this Public License.
+  c. For the avoidance of doubt, the Licensor may also offer the
+     Licensed Material under separate terms or conditions or stop
+     distributing the Licensed Material at any time; however, doing so
+     will not terminate this Public License.
+  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
+     License.
+Section 7 -- Other Terms and Conditions.
+  a. The Licensor shall not be bound by any additional or different
+     terms or conditions communicated by You unless expressly agreed.
+  b. Any arrangements, understandings, or agreements regarding the
+     Licensed Material not stated herein are separate from and
+     independent of the terms and conditions of this Public License.
+Section 8 -- Interpretation.
+  a. For the avoidance of doubt, this Public License does not, and
+     shall not be interpreted to, reduce, limit, restrict, or impose
+     conditions on any use of the Licensed Material that could lawfully
+     be made without permission under this Public License.
+  b. To the extent possible, if any provision of this Public License is
+     deemed unenforceable, it shall be automatically reformed to the
+     minimum extent necessary to make it enforceable. If the provision
+     cannot be reformed, it shall be severed from this Public License
+     without affecting the enforceability of the remaining terms and
+     conditions.
+  c. No term or condition of this Public License will be waived and no
+     failure to comply consented to unless expressly agreed to by the
+     Licensor.
+  d. Nothing in this Public License constitutes or may be interpreted
+     as a limitation upon, or waiver of, any privileges and immunities
+     that apply to the Licensor or You, including from the legal
+     processes of any jurisdiction or authority.
+=======================================================================
+Creative Commons is not a party to its public
+licenses. Notwithstanding, Creative Commons may elect to apply one of
+its public licenses to material it publishes and in those instances
+will be considered the “Licensor.” The text of the Creative Commons
+public licenses is dedicated to the public domain under the CC0 Public
+Domain Dedication. Except for the limited purpose of indicating that
+material is shared under a Creative Commons public license or as
+otherwise permitted by the Creative Commons policies published at
+creativecommons.org/policies, Creative Commons does not authorize the
+use of the trademark "Creative Commons" or any other trademark or logo
+of Creative Commons without its prior written consent including,
+without limitation, in connection with any unauthorized modifications
+to any of its public licenses or any other arrangements,
+understandings, or agreements concerning use of licensed material. For
+the avoidance of doubt, this paragraph does not form part of the
+public licenses.
+Creative Commons may be contacted at creativecommons.org.

dictabert-heq/README.md ADDED Viewed

	@@ -0,0 +1,71 @@

+---
+license: cc-by-4.0
+language:
+- he
+inference: false
+---
+# DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew
+State-of-the-art language model for Hebrew, released [here](https://arxiv.org/abs/2308.16687).
+This is the fine-tuned model for the question-answering task using the [HeQ](https://u.cs.biu.ac.il/~yogo/heq.pdf) dataset.
+For the bert-base models for other tasks, see [here](https://huggingface.co/collections/dicta-il/dictabert-6588e7cc08f83845fc42a18b).
+Sample usage:
+```python
+from transformers import pipeline
+oracle = pipeline('question-answering', model='dicta-il/dictabert-heq')
+context = 'בניית פרופילים של משתמשים נחשבת על ידי רבים כאיום פוטנציאלי על הפרטיות. מסיבה זו הגבילו חלק מהמדינות באמצעות חקיקה את המידע שניתן להשיג באמצעות עוגיות ואת אופן השימוש בעוגיות. ארצות הברית, למשל, קבעה חוקים נוקשים בכל הנוגע ליצירת עוגיות חדשות. חוקים אלו, אשר נקבעו בשנת 2000, נקבעו לאחר שנחשף כי המשרד ליישום המדיניות של הממשל האמריקאי נגד השימוש בסמים (ONDCP) בבית הלבן השתמש בעוגיות כדי לעקוב אחרי משתמשים שצפו בפרסומות נגד השימוש בסמים במטרה לבדוק האם משתמשים אלו נכנסו לאתרים התומכים בשימוש בסמים. דניאל בראנט, פעיל הדוגל בפרטיות המשתמשים באינטרנט, חשף כי ה-CIA שלח עוגיות קבועות למחשבי אזרחים במשך עשר שנים. ב-25 בדצמבר 2005 גילה בראנט כי הסוכנות לביטחון לאומי (ה-NSA) השאירה שתי עוגיות קבועות במחשבי מבקרים בגלל שדרוג תוכנה. לאחר שהנושא פורסם, הם ביטלו מיד את השימוש בהן.'
+question = 'כיצד הוגבל המידע שניתן להשיג באמצעות העוגיות?'
+oracle(question=question, context=context)
+```
+Output:
+```json
+{
+    "score": 0.998887836933136,
+    "start": 101,
+    "end": 114,
+    "answer": "באמצעות חקיקה"
+}
+```
+## Citation
+If you use DictaBERT in your research, please cite ```DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew```
+**BibTeX:**
+```bibtex
+@misc{shmidman2023dictabert,
+      title={DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew},
+      author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel},
+      year={2023},
+      eprint={2308.16687},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+## License
+Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
+This work is licensed under a
+[Creative Commons Attribution 4.0 International License][cc-by].
+[![CC BY 4.0][cc-by-image]][cc-by]
+[cc-by]: http://creativecommons.org/licenses/by/4.0/
+[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
+[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg

dictabert-heq/config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "architectures": [
+    "BertForQuestionAnswering"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "newmodern": true,
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.31.0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 128000
+}

dictabert-heq/issues.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+------------------------------------------------------------------------------
+#1 Adding `safetensors` variant of this model
+------------------------------------------------------------------------------
+[SFconvertbot] Oct 27, 2024
+This is an automated PR created with https://huggingface.co/spaces/safetensors/convert
+This new file is equivalent to pytorch_model.bin but safe in the sense that
+no arbitrary code can be put into it.
+These files also happen to load much faster than their pytorch counterpart:
+https://colab.research.google.com/github/huggingface/notebooks/blob/main/safetensors_doc/en/speed.ipynb
+The widgets on your model page will run using this model even if this is not merged
+making sure the file actually works.
+If you find any issues: please report here: https://huggingface.co/spaces/safetensors/convert/discussions
+Feel free to ignore this PR.

dictabert-heq/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f3d14a6e3dabe859df3862034077116fa796cfdf963c39ad05e99c9d2b375681
+size 735092905

dictabert-heq/source.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ https://huggingface.co/dicta-il/dictabert-heq

dictabert-heq/speed.ipynb ADDED Viewed

	@@ -0,0 +1,220 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Sm51_Do2Uh_y"
+      },
+      "source": [
+        "<!-- DISABLE-FRONTMATTER-SECTIONS -->"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "hiKkSKRdUh_0"
+      },
+      "source": [
+        "# Speed Comparison"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "63f59hZcUh_0"
+      },
+      "source": [
+        "`Safetensors` is really fast. Let's compare it against `PyTorch` by loading [gpt2](https://huggingface.co/gpt2) weights. To run the [GPU benchmark](#gpu-benchmark), make sure your machine has GPU or you have selected `GPU runtime` if you are using Google Colab.\n",
+        "\n",
+        "Before you begin, make sure you have all the necessary libraries installed:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "FVUx03_SUh_0"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install safetensors huggingface_hub torch"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "lKRAmkNBUh_1"
+      },
+      "source": [
+        "Let's start by importing all the packages that will be used:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "aj8sFZhlUh_1"
+      },
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "import datetime\n",
+        "from huggingface_hub import hf_hub_download\n",
+        "from safetensors.torch import load_file\n",
+        "import torch"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Ddd_qKtzUh_1"
+      },
+      "source": [
+        "Download safetensors & torch weights for gpt2:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-7ESrRyDUh_2"
+      },
+      "outputs": [],
+      "source": [
+        "sf_filename = hf_hub_download(\"gpt2\", filename=\"model.safetensors\")\n",
+        "pt_filename = hf_hub_download(\"gpt2\", filename=\"pytorch_model.bin\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "jeriUWJxUh_2"
+      },
+      "source": [
+        "### CPU benchmark"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "jclEP0Z8Uh_2",
+        "outputId": "3b057f7e-d98a-458f-ab19-df77a3e38b55"
+      },
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "Loaded safetensors 0:00:00.004015\n",
+              "Loaded pytorch 0:00:00.307460\n",
+              "on CPU, safetensors is faster than pytorch by: 76.6 X"
+            ]
+          },
+          "execution_count": null,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "start_st = datetime.datetime.now()\n",
+        "weights = load_file(sf_filename, device=\"cpu\")\n",
+        "load_time_st = datetime.datetime.now() - start_st\n",
+        "print(f\"Loaded safetensors {load_time_st}\")\n",
+        "\n",
+        "start_pt = datetime.datetime.now()\n",
+        "weights = torch.load(pt_filename, map_location=\"cpu\")\n",
+        "load_time_pt = datetime.datetime.now() - start_pt\n",
+        "print(f\"Loaded pytorch {load_time_pt}\")\n",
+        "\n",
+        "print(f\"on CPU, safetensors is faster than pytorch by: {load_time_pt/load_time_st:.1f} X\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "bJ0AvxqSUh_2"
+      },
+      "source": [
+        "This speedup is due to the fact that this library avoids unnecessary copies by mapping the file directly. It is actually possible to do on [pure pytorch](https://gist.github.com/Narsil/3edeec2669a5e94e4707aa0f901d2282).\n",
+        "The currently shown speedup was gotten on:\n",
+        "* OS: Ubuntu 18.04.6 LTS\n",
+        "* CPU: Intel(R) Xeon(R) CPU @ 2.00GHz"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "THYSS8dZUh_3"
+      },
+      "source": [
+        "### GPU benchmark"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "B0uQ1T3oUh_3",
+        "outputId": "98bd8eb2-b82a-4200-f3d4-6c2399a28647"
+      },
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "Loaded safetensors 0:00:00.165206\n",
+              "Loaded pytorch 0:00:00.353889\n",
+              "on GPU, safetensors is faster than pytorch by: 2.1 X"
+            ]
+          },
+          "execution_count": null,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "# This is required because this feature hasn't been fully verified yet, but\n",
+        "# it's been tested on many different environments\n",
+        "os.environ[\"SAFETENSORS_FAST_GPU\"] = \"1\"\n",
+        "\n",
+        "# CUDA startup out of the measurement\n",
+        "torch.zeros((2, 2)).cuda()\n",
+        "\n",
+        "start_st = datetime.datetime.now()\n",
+        "weights = load_file(sf_filename, device=\"cuda:0\")\n",
+        "load_time_st = datetime.datetime.now() - start_st\n",
+        "print(f\"Loaded safetensors {load_time_st}\")\n",
+        "\n",
+        "start_pt = datetime.datetime.now()\n",
+        "weights = torch.load(pt_filename, map_location=\"cuda:0\")\n",
+        "load_time_pt = datetime.datetime.now() - start_pt\n",
+        "print(f\"Loaded pytorch {load_time_pt}\")\n",
+        "\n",
+        "print(f\"on GPU, safetensors is faster than pytorch by: {load_time_pt/load_time_st:.1f} X\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "0zjAzX_FUh_3"
+      },
+      "source": [
+        "The speedup works because this library is able to skip unecessary CPU allocations. It is unfortunately not replicable in pure pytorch as far as we know. The library works by memory mapping the file, creating the tensor empty with pytorch and calling `cudaMemcpy` directly to move the tensor directly on the GPU.\n",
+        "The currently shown speedup was gotten on:\n",
+        "* OS: Ubuntu 18.04.6 LTS.\n",
+        "* GPU: Tesla T4\n",
+        "* Driver Version: 460.32.03\n",
+        "* CUDA Version: 11.2"
+      ]
+    }
+  ],
+  "metadata": {
+    "language_info": {
+      "name": "python"
+    },
+    "colab": {
+      "provenance": []
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}

dictabert-heq/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

dictabert-heq/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

dictabert-heq/vocab.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0fb90bfa35244d26f0065d1fcd0b5becc3da3d44d616a7e2aacaf6320b9fa2d0
+size 1500244

dictabert-joint/.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+vocab.txt filter=lfs diff=lfs merge=lfs -text

dictabert-joint/BertForJointParsing.py ADDED Viewed

	@@ -0,0 +1,534 @@

+from dataclasses import dataclass
+import re
+from operator import itemgetter
+import torch
+from torch import nn
+from typing import Any, Dict, List, Literal, Optional, Tuple, Union
+from transformers import BertPreTrainedModel, BertModel, BertTokenizerFast
+from transformers.models.bert.modeling_bert import BertOnlyMLMHead
+from transformers.utils import ModelOutput
+from .BertForSyntaxParsing import BertSyntaxParsingHead, SyntaxLabels, SyntaxLogitsOutput, parse_logits as syntax_parse_logits
+from .BertForPrefixMarking import BertPrefixMarkingHead, parse_logits as prefix_parse_logits, encode_sentences_for_bert_for_prefix_marking, get_prefixes_from_str
+from .BertForMorphTagging import BertMorphTaggingHead, MorphLogitsOutput, MorphLabels, parse_logits as morph_parse_logits
+import warnings
+@dataclass
+class JointParsingOutput(ModelOutput):
+    loss: Optional[torch.FloatTensor] = None
+    # logits will contain the optional predictions for the given labels
+    logits: Optional[Union[SyntaxLogitsOutput, None]] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    # if no labels are given, we will always include the syntax logits separately
+    syntax_logits: Optional[SyntaxLogitsOutput] = None
+    ner_logits: Optional[torch.FloatTensor] = None
+    prefix_logits: Optional[torch.FloatTensor] = None
+    lex_logits: Optional[torch.FloatTensor] = None
+    morph_logits: Optional[MorphLogitsOutput] = None
+# wrapper class to wrap a torch.nn.Module so that you can store a module in multiple linked
+# properties without registering the parameter multiple times
+class ModuleRef:
+    def __init__(self, module: torch.nn.Module):
+        self.module = module
+    def forward(self, *args, **kwargs):
+        return self.module.forward(*args, **kwargs)
+    def __call__(self, *args, **kwargs):
+        return self.module(*args, **kwargs)
+class BertForJointParsing(BertPreTrainedModel):
+    _tied_weights_keys = ["predictions.decoder.bias", "cls.predictions.decoder.weight"]
+    def __init__(self, config, do_syntax=None, do_ner=None, do_prefix=None, do_lex=None, do_morph=None, syntax_head_size=64):
+        super().__init__(config)
+        self.bert = BertModel(config, add_pooling_layer=False)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        # create all the heads as None, and then populate them as defined
+        self.syntax, self.ner, self.prefix, self.lex, self.morph = (None,)*5
+        if do_syntax is not None:
+            config.do_syntax = do_syntax
+            config.syntax_head_size = syntax_head_size
+        if do_ner is not None: config.do_ner = do_ner
+        if do_prefix is not None: config.do_prefix = do_prefix
+        if do_lex is not None: config.do_lex = do_lex
+        if do_morph is not None: config.do_morph = do_morph
+        # add all the individual heads
+        if config.do_syntax:
+            self.syntax = BertSyntaxParsingHead(config)
+        if config.do_ner:
+            self.num_labels = config.num_labels
+            self.classifier = nn.Linear(config.hidden_size, config.num_labels) # name it same as in BertForTokenClassification
+            self.ner = ModuleRef(self.classifier)
+        if config.do_prefix:
+            self.prefix = BertPrefixMarkingHead(config)
+        if config.do_lex:
+            self.cls = BertOnlyMLMHead(config) # name it the same as in BertForMaskedLM
+            self.lex = ModuleRef(self.cls)
+        if config.do_morph:
+            self.morph = BertMorphTaggingHead(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder if self.lex is not None else None
+    def set_output_embeddings(self, new_embeddings):
+        if self.lex is not None:
+            self.cls.predictions.decoder = new_embeddings
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        prefix_class_id_options: Optional[torch.Tensor] = None,
+        labels: Optional[Union[SyntaxLabels, MorphLabels, torch.Tensor]] = None,
+        labels_type: Optional[Literal['syntax', 'ner', 'prefix', 'lex', 'morph']] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        compute_syntax_mst: Optional[bool] = None
+    ):
+        if return_dict is False:
+            warnings.warn("Specified `return_dict=False` but the flag is ignored and treated as always True in this model.")
+        if labels is not None and labels_type is None:
+            raise ValueError("Cannot specify labels without labels_type")
+        if labels_type == 'seg' and prefix_class_id_options is None:
+            raise ValueError('Cannot calculate prefix logits without prefix_class_id_options')
+        if compute_syntax_mst is not None and self.syntax is None:
+            raise ValueError("Cannot compute syntax MST when the syntax head isn't loaded")
+        bert_outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=True,
+        )
+        # calculate the extended attention mask for any child that might need it
+        extended_attention_mask = None
+        if attention_mask is not None:
+            extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_ids.size())
+        # extract the hidden states, and apply the dropout
+        hidden_states = self.dropout(bert_outputs[0])
+        logits = None
+        syntax_logits = None
+        ner_logits = None
+        prefix_logits = None
+        lex_logits = None
+        morph_logits = None
+        # Calculate the syntax
+        if self.syntax is not None and (labels is None or labels_type == 'syntax'):
+            # apply the syntax head
+            loss, syntax_logits = self.syntax(hidden_states, extended_attention_mask, labels, compute_syntax_mst)
+            logits = syntax_logits
+        # Calculate the NER
+        if self.ner is not None and (labels is None or labels_type == 'ner'):
+            ner_logits = self.ner(hidden_states)
+            logits = ner_logits
+            if labels is not None:
+                loss_fct = nn.CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+        # Calculate the segmentation
+        if self.prefix is not None and (labels is None or labels_type == 'prefix'):
+            loss, prefix_logits = self.prefix(hidden_states, prefix_class_id_options, labels)
+            logits = prefix_logits
+        # Calculate the lexeme
+        if self.lex is not None and (labels is None or labels_type == 'lex'):
+            lex_logits = self.lex(hidden_states)
+            logits = lex_logits
+            if labels is not None:
+                loss_fct = nn.CrossEntropyLoss()  # -100 index = padding token
+                loss = loss_fct(lex_logits.view(-1, self.config.vocab_size), labels.view(-1))
+        if self.morph is not None and (labels is None or labels_type == 'morph'):
+            loss, morph_logits = self.morph(hidden_states, labels)
+            logits = morph_logits
+        # no labels => logits = None
+        if labels is None: logits = None
+        return JointParsingOutput(
+            loss,
+            logits,
+            hidden_states=bert_outputs.hidden_states,
+            attentions=bert_outputs.attentions,
+            # all the predicted logits section
+            syntax_logits=syntax_logits,
+            ner_logits=ner_logits,
+            prefix_logits=prefix_logits,
+            lex_logits=lex_logits,
+            morph_logits=morph_logits
+        )
+    def predict(self, sentences: Union[str, List[str]], tokenizer: BertTokenizerFast, padding='longest', truncation=True, compute_syntax_mst=True, per_token_ner=False, output_style: Literal['json', 'ud', 'iahlt_ud'] = 'json'):
+        is_single_sentence = isinstance(sentences, str)
+        if is_single_sentence:
+            sentences = [sentences]
+        if output_style not in ['json', 'ud', 'iahlt_ud']:
+            raise ValueError('output_style must be in json/ud/iahlt_ud')
+        if output_style in ['ud', 'iahlt_ud'] and (self.prefix is None or self.morph is None or self.syntax is None or self.lex is None):
+            raise ValueError("Cannot output UD format when any of the prefix,morph,syntax, and lex heads aren't loaded.")
+        # predict the logits for the sentence
+        if self.prefix is not None:
+            inputs = encode_sentences_for_bert_for_prefix_marking(tokenizer, self.config.prefix_cfg, sentences, padding)
+        else:
+            inputs = tokenizer(sentences, padding=padding, truncation=truncation, return_offsets_mapping=True, return_tensors='pt')
+        offset_mapping = inputs.pop('offset_mapping')
+        # Copy the tensors to the right device, and parse!
+        inputs = {k:v.to(self.device) for k,v in inputs.items()}
+        output = self.forward(**inputs, return_dict=True, compute_syntax_mst=compute_syntax_mst)
+        input_ids = inputs['input_ids'].tolist() # convert once
+        final_output = [dict(text=sentence, tokens=combine_token_wordpieces(ids, offsets, tokenizer)) for sentence, ids, offsets in zip(sentences, input_ids, offset_mapping)]
+        # Syntax logits: each sentence gets a dict(tree: List[dict(word,dep_head,dep_head_idx,dep_func)], root_idx: int)
+        if output.syntax_logits is not None:
+            for sent_idx,parsed in enumerate(syntax_parse_logits(input_ids, sentences, tokenizer, output.syntax_logits)):
+                merge_token_list(final_output[sent_idx]['tokens'], parsed['tree'], 'syntax')
+                final_output[sent_idx]['root_idx'] = parsed['root_idx']
+        # Prefix logits: each sentence gets a list([prefix_segment, word_without_prefix]) - **WITH CLS & SEP**
+        if output.prefix_logits is not None:
+            for sent_idx,parsed in enumerate(prefix_parse_logits(input_ids, sentences, tokenizer, output.prefix_logits, self.config.prefix_cfg)):
+                merge_token_list(final_output[sent_idx]['tokens'], map(tuple, parsed[1:-1]), 'seg')
+        # Lex logits each sentence gets a list(tuple(word, lexeme))
+        if output.lex_logits is not None:
+            for sent_idx, parsed in enumerate(lex_parse_logits(input_ids, sentences, tokenizer, output.lex_logits)):
+                merge_token_list(final_output[sent_idx]['tokens'], map(itemgetter(1), parsed), 'lex')
+        # morph logits each sentences get a dict(text=str, tokens=list(dict(token, pos, feats, prefixes, suffix, suffix_feats?)))
+        if output.morph_logits is not None:
+            for sent_idx,parsed in enumerate(morph_parse_logits(input_ids, sentences, tokenizer, output.morph_logits)):
+                merge_token_list(final_output[sent_idx]['tokens'], parsed['tokens'], 'morph')
+        # NER logits each sentence gets a list(tuple(word, ner))
+        if output.ner_logits is not None:
+            for sent_idx,parsed in enumerate(ner_parse_logits(input_ids, sentences, tokenizer, output.ner_logits, self.config.id2label)):
+                if per_token_ner:
+                    merge_token_list(final_output[sent_idx]['tokens'], map(itemgetter(1), parsed), 'ner')
+                final_output[sent_idx]['ner_entities'] = aggregate_ner_tokens(final_output[sent_idx], parsed)
+        if output_style in ['ud', 'iahlt_ud']:
+            final_output = convert_output_to_ud(final_output, self.config, style='htb' if output_style == 'ud' else 'iahlt')
+        if is_single_sentence:
+            final_output = final_output[0]
+        return final_output
+def aggregate_ner_tokens(final_output, parsed):
+    entities = []
+    prev = None
+    for token_idx, (d, (word, pred)) in enumerate(zip(final_output['tokens'], parsed)):
+        # O does nothing
+        if pred == 'O': prev = None
+        # B- || I-entity != prev (different entity or none)
+        elif pred.startswith('B-') or pred[2:] != prev:
+            prev = pred[2:]
+            entities.append([[word], dict(label=prev, start=d['offsets']['start'], end=d['offsets']['end'], token_start=token_idx, token_end=token_idx)])
+        else:
+            entities[-1][0].append(word)
+            entities[-1][1]['end'] = d['offsets']['end']
+            entities[-1][1]['token_end'] = token_idx
+    return [dict(phrase=' '.join(words), **d) for words, d in entities]
+def merge_token_list(src, update, key):
+    for token_src, token_update in zip(src, update):
+        token_src[key] = token_update
+def combine_token_wordpieces(input_ids: List[int], offset_mapping: torch.Tensor, tokenizer: BertTokenizerFast):
+    offset_mapping = offset_mapping.tolist()
+    ret = []
+    special_toks = tokenizer.all_special_tokens
+    special_toks.remove(tokenizer.unk_token)
+    special_toks.remove(tokenizer.mask_token)
+    for token, offsets in zip(tokenizer.convert_ids_to_tokens(input_ids), offset_mapping):
+        if token in special_toks: continue
+        if token.startswith('##'):
+            ret[-1]['token'] += token[2:]
+            ret[-1]['offsets']['end'] = offsets[1]
+        else: ret.append(dict(token=token, offsets=dict(start=offsets[0], end=offsets[1])))
+    return ret
+def ner_parse_logits(input_ids: List[List[int]], sentences: List[str], tokenizer: BertTokenizerFast, logits: torch.Tensor, id2label: Dict[int, str]):
+    predictions = torch.argmax(logits, dim=-1).tolist()
+    batch_ret = []
+    special_toks = tokenizer.all_special_tokens
+    special_toks.remove(tokenizer.unk_token)
+    special_toks.remove(tokenizer.mask_token)
+    for batch_idx in range(len(sentences)):
+        ret = []
+        batch_ret.append(ret)
+        tokens = tokenizer.convert_ids_to_tokens(input_ids[batch_idx])
+        for tok_idx in range(len(tokens)):
+            token = tokens[tok_idx]
+            if token in special_toks: continue
+            # wordpieces should just be appended to the previous word
+            # we modify the last token in ret
+            # by discarding the original end position and replacing it with the new token's end position
+            if token.startswith('##'):
+                continue
+            # for each token, we append a tuple containing: token, label, start position, end position
+            ret.append((token, id2label[predictions[batch_idx][tok_idx]]))
+    return batch_ret
+def lex_parse_logits(input_ids: List[List[int]], sentences: List[str], tokenizer: BertTokenizerFast, logits: torch.Tensor):
+    predictions = torch.argsort(logits, dim=-1, descending=True)[..., :3].tolist()
+    batch_ret = []
+    special_toks = tokenizer.all_special_tokens
+    special_toks.remove(tokenizer.unk_token)
+    special_toks.remove(tokenizer.mask_token)
+    for batch_idx in range(len(sentences)):
+        intermediate_ret = []
+        tokens = tokenizer.convert_ids_to_tokens(input_ids[batch_idx])
+        for tok_idx in range(len(tokens)):
+            token = tokens[tok_idx]
+            if token in special_toks: continue
+            # wordpieces should just be appended to the previous word
+            if token.startswith('##'):
+                intermediate_ret[-1] = (intermediate_ret[-1][0] + token[2:], intermediate_ret[-1][1])
+                continue
+            intermediate_ret.append((token, tokenizer.convert_ids_to_tokens(predictions[batch_idx][tok_idx])))
+        # build the final output taking into account valid letters
+        ret = []
+        batch_ret.append(ret)
+        for (token, lexemes) in intermediate_ret:
+            # must overlap on at least 2 non אהוי letters
+            possible_lets = set(c for c in token if c not in 'אהוי')
+            final_lex = '[BLANK]'
+            for lex in lexemes:
+                if sum(c in possible_lets for c in lex) >= min([2, len(possible_lets), len([c for c in lex if c not in 'אהוי'])]):
+                    final_lex = lex
+                    break
+            ret.append((token, final_lex))
+    return batch_ret
+ud_prefixes_to_pos = {
+    'ש': ['SCONJ'],
+    'מש': ['SCONJ'],
+    'כש': ['SCONJ'],
+    'לכש': ['SCONJ'],
+    'בש': ['SCONJ'],
+    'לש': ['SCONJ'],
+    'ו': ['CCONJ'],
+    'ל': ['ADP'],
+    'ה': ['DET', 'SCONJ'],
+    'מ': ['ADP', 'SCONJ'],
+    'ב': ['ADP'],
+    'כ': ['ADP', 'ADV'],
+}
+ud_suffix_to_htb_str = {
+    'Gender=Masc|Number=Sing|Person=3': '_הוא',
+	'Gender=Masc|Number=Plur|Person=3': '_הם',
+	'Gender=Fem|Number=Sing|Person=3': '_היא',
+	'Gender=Fem|Number=Plur|Person=3': '_הן',
+	'Gender=Fem,Masc|Number=Plur|Person=1': '_אנחנו',
+	'Gender=Fem,Masc|Number=Sing|Person=1': '_אני',
+	'Gender=Masc|Number=Plur|Person=2': '_אתם',
+	'Gender=Masc|Number=Sing|Person=3': '_הוא',
+	'Gender=Masc|Number=Sing|Person=2': '_אתה',
+	'Gender=Fem|Number=Sing|Person=2': '_את',
+	'Gender=Masc|Number=Plur|Person=3': '_הם'
+}
+def convert_output_to_ud(output_sentences, model_cfg, style: Literal['htb', 'iahlt']):
+    if style not in ['htb', 'iahlt']:
+        raise ValueError('style must be htb/iahlt')
+    final_output = []
+    for sent_idx, sentence in enumerate(output_sentences):
+        # next, go through each word and insert it in the UD format. Store in a temp format for the post process
+        intermediate_output = []
+        ranges = []
+        # store a mapping between each word index and the actual line it appears in
+        idx_to_key = {-1: 0}
+        for word_idx,word in enumerate(sentence['tokens']):
+            try:
+                # handle blank lexemes
+                if word['lex'] == '[BLANK]':
+                    word['lex'] = word['seg'][-1]
+            except KeyError:
+                import json
+                print(json.dumps(sentence, ensure_ascii=False, indent=2))
+                exit(0)
+            start = len(intermediate_output)
+            # Add in all the prefixes
+            if len(word['seg']) > 1:
+                for pre in get_prefixes_from_str(word['seg'][0], model_cfg.prefix_cfg, greedy=True):
+                    # pos - just take the first valid pos that appears in the predicted prefixes list.
+                    pos = next((pos for pos in ud_prefixes_to_pos[pre] if pos in word['morph']['prefixes']), ud_prefixes_to_pos[pre][0])
+                    dep, func = ud_get_prefix_dep(pre, word, word_idx)
+                    intermediate_output.append(dict(word=pre, lex=pre, pos=pos, dep=dep, func=func, feats='_'))
+                # if there was an implicit heh, add it in dependent on the method
+                if not 'ה' in pre and intermediate_output[-1]['pos'] == 'ADP' and 'DET' in word['morph']['prefixes']:
+                    if style == 'htb':
+                        intermediate_output.append(dict(word='ה_', lex='ה', pos='DET', dep=word_idx, func='det', feats='_'))
+                    elif style == 'iahlt':
+                        intermediate_output[-1]['feats'] = 'Definite=Def|PronType=Art'
+            idx_to_key[word_idx] = len(intermediate_output) + 1
+            # add the main word in!
+            intermediate_output.append(dict(
+                    word=word['seg'][-1], lex=word['lex'], pos=word['morph']['pos'],
+                    dep=word['syntax']['dep_head_idx'], func=word['syntax']['dep_func'],
+                    feats='|'.join(f'{k}={v}' for k,v in word['morph']['feats'].items())))
+            # if we have suffixes, this changes things
+            if word['morph']['suffix']:
+                # first determine the dependency info:
+                # For adp, num, det - they main word points to here, and the suffix points to the dependency
+                entry_to_assign_suf_dep = None
+                if word['morph']['pos'] in ['ADP', 'NUM', 'DET']:
+                    entry_to_assign_suf_dep = intermediate_output[-1]
+                    intermediate_output[-1]['func'] = 'case'
+                    dep = word['syntax']['dep_head_idx']
+                    func = word['syntax']['dep_func']
+                else:
+                    # if pos is verb -> obj, num -> dep, default to -> nmod:poss
+                    dep = word_idx
+                    func = {'VERB': 'obj', 'NUM': 'dep'}.get(word['morph']['pos'], 'nmod:poss')
+                s_word, s_lex = word['seg'][-1], word['lex']
+                # update the word of the string and extract the string of the suffix!
+                # for IAHLT:
+                if style == 'iahlt':
+                    # we need to shorten the main word and extract the suffix
+                    # if it is longer than the lexeme - just take off the lexeme.
+                    if len(s_word) > len(s_lex):
+                        idx = len(s_lex)
+                    # Otherwise, try to find the last letter of the lexeme, and fail that just take the last letter
+                    else:
+                        # take either len-1, or the last occurence (which can be -1 === len-1)
+                        idx = min([len(s_word) - 1, s_word.rfind(s_lex[-1])])
+                    # extract the suffix and update the main word
+                    suf = s_word[idx:]
+                    intermediate_output[-1]['word'] = s_word[:idx]
+                # for htb:
+                elif style == 'htb':
+                    # main word becomes the lexeme, the suffix is based on the features
+                    intermediate_output[-1]['word'] = (s_lex if s_lex != s_word else s_word[:-1]) + '_'
+                    suf_feats = word['morph']['suffix_feats']
+                    suf = ud_suffix_to_htb_str.get(f"Gender={suf_feats.get('Gender', 'Fem,Masc')}|Number={suf_feats.get('Number', 'Sing')}|Person={suf_feats.get('Person', '3')}", "_הוא")
+                    # for HTB, if the function is poss, then add a shel pointing to the next word
+                    if func == 'nmod:poss' and s_lex != 'של':
+                        intermediate_output.append(dict(word='_של_', lex='של', pos='ADP', dep=len(intermediate_output) + 2, func='case', feats='_', absolute_dep=True))
+                    # if the function is obj, then add a את pointing to the next word
+                    elif func == 'obj' and s_lex != 'את':
+                        intermediate_output.append(dict(word='_את_', lex='את', pos='ADP', dep=len(intermediate_output) + 2, func='case', feats='_', absolute_dep=True))
+                # add the main suffix in
+                intermediate_output.append(dict(word=suf, lex='הוא', pos='PRON', dep=dep, func=func, feats='|'.join(f'{k}={v}' for k,v in word['morph']['suffix_feats'].items())))
+                if entry_to_assign_suf_dep:
+                    entry_to_assign_suf_dep['dep'] = len(intermediate_output)
+                    entry_to_assign_suf_dep['absolute_dep'] = True
+            end = len(intermediate_output)
+            ranges.append((start, end, word['token']))
+        # now that we have the intermediate output, combine it to the final output
+        cur_output = []
+        final_output.append(cur_output)
+        # first, add the headers
+        cur_output.append(f'# sent_id = {sent_idx + 1}')
+        cur_output.append(f'# text = {sentence["text"]}')
+        # add in all the actual entries
+        for start,end,token in ranges:
+            if end - start > 1:
+                cur_output.append(f'{start + 1}-{end}\t{token}\t_\t_\t_\t_\t_\t_\t_\t_')
+            for idx,output in enumerate(intermediate_output[start:end], start + 1):
+                # compute the actual dependency location
+                dep = output['dep'] if output.get('absolute_dep', False) else idx_to_key[output['dep']]
+                func = normalize_dep_rel(output['func'], style)
+                # and add the full ud string in
+                cur_output.append('\t'.join([
+                    str(idx),
+                    output['word'],
+                    output['lex'],
+                    output['pos'],
+                    output['pos'],
+                    output['feats'],
+                    str(dep),
+                    func,
+                    '_', '_'
+                ]))
+    return final_output
+def normalize_dep_rel(dep, style: Literal['htb', 'iahlt']):
+    if style == 'iahlt':
+        if dep == 'compound:smixut': return 'compound'
+        if dep == 'nsubj:cop': return 'nsubj'
+        if dep == 'mark:q': return 'mark'
+        if dep == 'case:gen' or dep == 'case:acc': return 'case'
+    return dep
+def ud_get_prefix_dep(pre, word, word_idx):
+    does_follow_main = False
+    # shin goes to the main word for verbs, otherwise follows the word
+    if pre.endswith('ש'):
+        does_follow_main = word['morph']['pos'] != 'VERB' and word['syntax']['dep_head_idx'] > word_idx
+        func = 'mark'
+    # vuv goes to the main word if the function is in the list, otherwise follows
+    elif pre == 'ו':
+        does_follow_main = word['syntax']['dep_func'] not in ["conj", "acl:recl", "parataxis", "root", "acl", "amod", "list", "appos", "dep", "flatccomp"]
+        func = 'cc'
+    else:
+        # for adj, noun, propn, pron, verb - prefixes go to the main word
+        if word['morph']['pos'] in ["ADJ", "NOUN", "PROPN", "PRON", "VERB"]:
+            does_follow_main = False
+        # otherwise - prefix follows the word if the function is in the list
+        else: does_follow_main = word['syntax']['dep_func'] in ["compound:affix", "det", "aux", "nummod", "advmod", "dep", "cop", "mark", "fixed"]
+        func = 'case'
+        if pre == 'ה':
+            func = 'det' if 'DET' in word['morph']['prefixes'] else 'mark'
+    return (word['syntax']['dep_head_idx'] if does_follow_main else word_idx), func

dictabert-joint/BertForMorphTagging.py ADDED Viewed

	@@ -0,0 +1,215 @@

+from collections import OrderedDict
+from operator import itemgetter
+from transformers.utils import ModelOutput
+import torch
+from torch import nn
+from typing import Dict, List, Tuple, Optional
+from dataclasses import dataclass
+from transformers import BertPreTrainedModel, BertModel, BertTokenizerFast
+ALL_POS = ['DET', 'NOUN', 'VERB', 'CCONJ', 'ADP', 'PRON', 'PUNCT', 'ADJ', 'ADV', 'SCONJ', 'NUM', 'PROPN', 'AUX', 'X', 'INTJ', 'SYM']
+ALL_PREFIX_POS = ['SCONJ', 'DET', 'ADV', 'CCONJ', 'ADP', 'NUM']
+ALL_SUFFIX_POS = ['none', 'ADP_PRON', 'PRON']
+ALL_FEATURES = [
+    ('Gender', ['none', 'Masc', 'Fem', 'Fem,Masc']),
+    ('Number', ['none', 'Sing', 'Plur', 'Plur,Sing', 'Dual', 'Dual,Plur']),
+    ('Person', ['none', '1', '2', '3', '1,2,3']),
+    ('Tense', ['none', 'Past', 'Fut', 'Pres', 'Imp'])
+]
+@dataclass
+class MorphLogitsOutput(ModelOutput):
+    prefix_logits: torch.FloatTensor = None
+    pos_logits: torch.FloatTensor = None
+    features_logits: List[torch.FloatTensor] = None
+    suffix_logits: torch.FloatTensor = None
+    suffix_features_logits: List[torch.FloatTensor] = None
+    def detach(self):
+        return MorphLogitsOutput(self.prefix_logits.detach(), self.pos_logits.detach(), [logits.deatch() for logits in self.features_logits], self.suffix_logits.detach(), [logits.deatch() for logits in self.suffix_features_logits])
+@dataclass
+class MorphTaggingOutput(ModelOutput):
+    loss: Optional[torch.FloatTensor] = None
+    logits: Optional[MorphLogitsOutput] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+@dataclass
+class MorphLabels(ModelOutput):
+    prefix_labels: Optional[torch.FloatTensor] = None
+    pos_labels: Optional[torch.FloatTensor] = None
+    features_labels: Optional[List[torch.FloatTensor]] = None
+    suffix_labels: Optional[torch.FloatTensor] = None
+    suffix_features_labels: Optional[List[torch.FloatTensor]] = None
+    def detach(self):
+        return MorphLabels(self.prefix_labels.detach(), self.pos_labels.detach(), [labels.detach() for labels in self.features_labels], self.suffix_labels.detach(), [labels.detach() for labels in self.suffix_features_labels])
+    def to(self, device):
+        return MorphLabels(self.prefix_labels.to(device), self.pos_labels.to(device), [feat.to(device) for feat in self.features_labels], self.suffix_labels.to(device), [feat.to(device) for feat in self.suffix_features_labels])
+class BertMorphTaggingHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.num_prefix_classes = len(ALL_PREFIX_POS)
+        self.num_pos_classes = len(ALL_POS)
+        self.num_suffix_classes = len(ALL_SUFFIX_POS)
+        self.num_features_classes = list(map(len, map(itemgetter(1), ALL_FEATURES)))
+        # we need a classifier for prefix cls and POS cls
+        # the prefix will use BCEWithLogits for multiple labels cls
+        self.prefix_cls = nn.Linear(config.hidden_size, self.num_prefix_classes)
+        # and pos + feats will use good old cross entropy for single label
+        self.pos_cls = nn.Linear(config.hidden_size, self.num_pos_classes)
+        self.features_cls = nn.ModuleList([nn.Linear(config.hidden_size, len(features)) for _, features in ALL_FEATURES])
+        # and suffix + feats will also be cross entropy
+        self.suffix_cls = nn.Linear(config.hidden_size, self.num_suffix_classes)
+        self.suffix_features_cls = nn.ModuleList([nn.Linear(config.hidden_size, len(features)) for _, features in ALL_FEATURES])
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            labels: Optional[MorphLabels] = None):
+        # run each of the classifiers on the transformed output
+        prefix_logits = self.prefix_cls(hidden_states)
+        pos_logits = self.pos_cls(hidden_states)
+        suffix_logits = self.suffix_cls(hidden_states)
+        features_logits = [cls(hidden_states) for cls in self.features_cls]
+        suffix_features_logits = [cls(hidden_states) for cls in self.suffix_features_cls]
+        loss = None
+        if labels is not None:
+            # step 1: prefix labels loss
+            loss_fct = nn.BCEWithLogitsLoss(weight=(labels.prefix_labels != -100).float())
+            loss = loss_fct(prefix_logits, labels.prefix_labels)
+            # step 2: pos labels loss
+            loss_fct = nn.CrossEntropyLoss()
+            loss += loss_fct(pos_logits.view(-1, self.num_pos_classes), labels.pos_labels.view(-1))
+            # step 2b: features
+            for feat_logits,feat_labels,num_features in zip(features_logits, labels.features_labels, self.num_features_classes):
+                loss += loss_fct(feat_logits.view(-1, num_features), feat_labels.view(-1))
+            # step 3: suffix logits loss
+            loss += loss_fct(suffix_logits.view(-1, self.num_suffix_classes), labels.suffix_labels.view(-1))
+            # step 3b: suffix features
+            for feat_logits,feat_labels,num_features in zip(suffix_features_logits, labels.suffix_features_labels, self.num_features_classes):
+                loss += loss_fct(feat_logits.view(-1, num_features), feat_labels.view(-1))
+        return loss, MorphLogitsOutput(prefix_logits, pos_logits, features_logits, suffix_logits, suffix_features_logits)
+class BertForMorphTagging(BertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.bert = BertModel(config, add_pooling_layer=False)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.morph = BertMorphTaggingHead(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        labels: Optional[MorphLabels] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        bert_outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = bert_outputs[0]
+        hidden_states = self.dropout(hidden_states)
+        loss, logits = self.morph(hidden_states, labels)
+        if not return_dict:
+            return (loss,logits) + bert_outputs[2:]
+        return MorphTaggingOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=bert_outputs.hidden_states,
+            attentions=bert_outputs.attentions,
+        )
+    def predict(self, sentences: List[str], tokenizer: BertTokenizerFast, padding='longest'):
+        # tokenize the inputs and convert them to relevant device
+        inputs = tokenizer(sentences, padding=padding, truncation=True, return_tensors='pt')
+        inputs = {k:v.to(self.device) for k,v in inputs.items()}
+        # calculate the logits
+        logits = self.forward(**inputs, return_dict=True).logits
+        return parse_logits(inputs['input_ids'].tolist(), sentences, tokenizer, logits)
+def parse_logits(input_ids: List[List[int]], sentences: List[str], tokenizer: BertTokenizerFast, logits: MorphLogitsOutput):
+    prefix_logits, pos_logits, feats_logits, suffix_logits, suffix_feats_logits = \
+                logits.prefix_logits, logits.pos_logits, logits.features_logits, logits.suffix_logits, logits.suffix_features_logits
+    prefix_predictions = (prefix_logits > 0.5).int().tolist() # Threshold at 0.5 for multi-label classification
+    pos_predictions = pos_logits.argmax(axis=-1).tolist()
+    suffix_predictions = suffix_logits.argmax(axis=-1).tolist()
+    feats_predictions = [logits.argmax(axis=-1).tolist() for logits in feats_logits]
+    suffix_feats_predictions = [logits.argmax(axis=-1).tolist() for logits in suffix_feats_logits]
+    # create the return dictionary
+    # for each sentence, return a dict object with the following files { text, tokens }
+    # Where tokens is a list of dicts, where each dict is:
+    #       { pos: str, feats: dict, prefixes: List[str], suffix: str | bool, suffix_feats: dict | None}
+    special_toks = tokenizer.all_special_tokens
+    special_toks.remove(tokenizer.unk_token)
+    special_toks.remove(tokenizer.mask_token)
+    ret = []
+    for sent_idx,sentence in enumerate(sentences):
+        input_id_strs = tokenizer.convert_ids_to_tokens(input_ids[sent_idx])
+        # iterate through each token in the sentence, ignoring special tokens
+        tokens = []
+        for token_idx,token_str in enumerate(input_id_strs):
+            if token_str in special_toks: continue
+            if token_str.startswith('##'):
+                tokens[-1]['token'] += token_str[2:]
+                continue
+            tokens.append(dict(
+                token=token_str,
+                pos=ALL_POS[pos_predictions[sent_idx][token_idx]],
+                feats=get_features_dict_from_predictions(feats_predictions, (sent_idx, token_idx)),
+                prefixes=[ALL_PREFIX_POS[idx] for idx,i in enumerate(prefix_predictions[sent_idx][token_idx]) if i > 0],
+                suffix=get_suffix_or_false(ALL_SUFFIX_POS[suffix_predictions[sent_idx][token_idx]]),
+            ))
+            if tokens[-1]['suffix']:
+                tokens[-1]['suffix_feats'] = get_features_dict_from_predictions(suffix_feats_predictions, (sent_idx, token_idx))
+        ret.append(dict(text=sentence, tokens=tokens))
+    return ret
+def get_suffix_or_false(suffix):
+    return False if suffix == 'none' else suffix
+def get_features_dict_from_predictions(predictions, idx):
+    ret = {}
+    for (feat_idx, (feat_name, feat_values)) in enumerate(ALL_FEATURES):
+        val = feat_values[predictions[feat_idx][idx[0]][idx[1]]]
+        if val != 'none':
+            ret[feat_name] = val
+    return ret

dictabert-joint/BertForPrefixMarking.py ADDED Viewed

	@@ -0,0 +1,266 @@

+from transformers.utils import ModelOutput
+import torch
+from torch import nn
+from typing import Dict, List, Tuple, Optional
+from dataclasses import dataclass
+from transformers import BertPreTrainedModel, BertModel, BertTokenizerFast
+# define the classes, and the possible prefixes for each class
+POSSIBLE_PREFIX_CLASSES =  [ ['לכש', 'כש', 'מש', 'בש', 'לש'], ['מ'], ['ש'], ['ה'], ['ו'], ['כ'], ['ל'], ['ב'] ]
+POSSIBLE_RABBINIC_PREFIX_CLASSES =  [ ['לכש', 'כש', 'מש', 'בש', 'לש', 'לד', 'בד', 'מד', 'כד', 'לכד'], ['מ'], ['ש', 'ד'], ['ה'], ['ו'], ['כ'], ['ל'], ['ב'], ['א'], ['ק'] ]
+class PrefixConfig(dict):
+    def __init__(self, possible_classes, **kwargs): # added kwargs for previous version where all features were kept as dict values
+        super().__init__()
+        self.possible_classes = possible_classes
+        self.total_classes = len(possible_classes)
+        self.prefix_c2i = {w: i for i, l in enumerate(possible_classes) for w in l}
+        self.all_prefix_items = list(sorted(self.prefix_c2i.keys(), key=len, reverse=True))
+    @property
+    def possible_classes(self) -> List[List[str]]:
+        return self.get('possible_classes')
+    @possible_classes.setter
+    def possible_classes(self, value: List[List[str]]):
+        self['possible_classes'] = value
+DEFAULT_PREFIX_CONFIG = PrefixConfig(POSSIBLE_PREFIX_CLASSES)
+def get_prefixes_from_str(s, cfg: PrefixConfig, greedy=False):
+    # keep trimming prefixes from the string
+    while len(s) > 0 and s[0] in cfg.prefix_c2i:
+        # find the longest string to trim
+        next_pre = next((pre for pre in cfg.all_prefix_items if s.startswith(pre)), None)
+        if next_pre is None:
+            return
+        yield next_pre
+        # if the chosen prefix is more than one letter, there is always an option that the
+        # prefix is actually just the first letter of the prefix - so offer that up as a valid prefix
+        # as well. We will still jump to the length of the longer one, since if the next two/three
+        # letters are a prefix, they have to be the longest one
+        if not greedy and len(next_pre) > 1:
+            yield next_pre[0]
+        s = s[len(next_pre):]
+def get_prefix_classes_from_str(s, cfg: PrefixConfig, greedy=False):
+    for pre in get_prefixes_from_str(s, cfg, greedy):
+        yield cfg.prefix_c2i[pre]
+@dataclass
+class PrefixesClassifiersOutput(ModelOutput):
+    loss: Optional[torch.FloatTensor] = None
+    logits: Optional[torch.FloatTensor] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+class BertPrefixMarkingHead(nn.Module):
+    def __init__(self, config) -> None:
+        super().__init__()
+        self.config = config
+        if not hasattr(config, 'prefix_cfg') or config.prefix_cfg is None:
+            setattr(config, 'prefix_cfg', DEFAULT_PREFIX_CONFIG)
+        if isinstance(config.prefix_cfg, dict):
+            config.prefix_cfg = PrefixConfig(config.prefix_cfg['possible_classes'])
+        # an embedding table containing an embedding for each prefix class + 1 for NONE
+        # we will concatenate either the embedding/NONE for each class - and we want the concatenate
+        # size to be the hidden_size
+        prefix_class_embed = config.hidden_size // config.prefix_cfg.total_classes
+        self.prefix_class_embeddings = nn.Embedding(config.prefix_cfg.total_classes + 1, prefix_class_embed)
+        # one layer for transformation, apply an activation, then another N classifiers for each prefix class
+        self.transform = nn.Linear(config.hidden_size + prefix_class_embed * config.prefix_cfg.total_classes, config.hidden_size)
+        self.activation = nn.Tanh()
+        self.classifiers = nn.ModuleList([nn.Linear(config.hidden_size, 2) for _ in range(config.prefix_cfg.total_classes)])
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            prefix_class_id_options: torch.Tensor,
+            labels: Optional[torch.Tensor] = None) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
+        # encode the prefix_class_id_options
+        # If input_ids is batch x seq_len
+        # Then sequence_output is batch x seq_len x hidden_dim
+        # So prefix_class_id_options is batch x seq_len x total_classes
+        # Looking up the embeddings should give us batch x seq_len x total_classes x hidden_dim / N
+        possible_class_embed = self.prefix_class_embeddings(prefix_class_id_options)
+        # then flatten the final dimension - now we have batch x seq_len x hidden_dim_2
+        possible_class_embed = possible_class_embed.reshape(possible_class_embed.shape[:-2] + (-1,))
+        # concatenate the new class embed into the sequence output before the transform
+        pre_transform_output = torch.cat((hidden_states, possible_class_embed), dim=-1) # batch x seq_len x (hidden_dim + hidden_dim_2)
+        pre_logits_output = self.activation(self.transform(pre_transform_output))# batch x seq_len x hidden_dim
+        # run each of the classifiers on the transformed output
+        logits = torch.cat([cls(pre_logits_output).unsqueeze(-2) for cls in self.classifiers], dim=-2)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(logits.view(-1, 2), labels.view(-1))
+        return (loss, logits)
+class BertForPrefixMarking(BertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.bert = BertModel(config, add_pooling_layer=False)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.prefix = BertPrefixMarkingHead(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        prefix_class_id_options: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ):
+        r"""
+        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        bert_outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = bert_outputs[0]
+        hidden_states = self.dropout(hidden_states)
+        loss, logits = self.prefix.forward(hidden_states, prefix_class_id_options, labels)
+        if not return_dict:
+            return (loss,logits,) + bert_outputs[2:]
+        return PrefixesClassifiersOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=bert_outputs.hidden_states,
+            attentions=bert_outputs.attentions,
+        )
+    def predict(self, sentences: List[str], tokenizer: BertTokenizerFast, padding='longest'):
+        # step 1: encode the sentences through using the tokenizer, and get the input tensors + prefix id tensors
+        inputs = encode_sentences_for_bert_for_prefix_marking(tokenizer, self.config.prefix_cfg, sentences, padding)
+        inputs.pop('offset_mapping')
+        inputs = {k:v.to(self.device) for k,v in inputs.items()}
+        # run through bert
+        logits = self.forward(**inputs, return_dict=True).logits
+        return parse_logits(inputs['input_ids'].tolist(), sentences, tokenizer, logits, self.config.prefix_cfg)
+def parse_logits(input_ids: List[List[int]], sentences: List[str], tokenizer: BertTokenizerFast, logits: torch.FloatTensor, config: PrefixConfig):
+    # extract the predictions by argmaxing the final dimension (batch x sequence x prefixes x prediction)
+    logit_preds = torch.argmax(logits, axis=3).tolist()
+    ret = []
+    for sent_idx,sent_ids in enumerate(input_ids):
+        tokens = tokenizer.convert_ids_to_tokens(sent_ids)
+        ret.append([])
+        for tok_idx,token in enumerate(tokens):
+            # If we've reached the pad token, then we are at the end
+            if token == tokenizer.pad_token: continue
+            if token.startswith('##'): continue
+            # combine the next tokens in? only if it's a breakup
+            next_tok_idx = tok_idx + 1
+            while next_tok_idx < len(tokens) and tokens[next_tok_idx].startswith('##'):
+                token += tokens[next_tok_idx][2:]
+                next_tok_idx += 1
+            prefix_len = get_predicted_prefix_len_from_logits(token, logit_preds[sent_idx][tok_idx], config)
+            if not prefix_len:
+                ret[-1].append([token])
+            else:
+                ret[-1].append([token[:prefix_len], token[prefix_len:]])
+    return ret
+def encode_sentences_for_bert_for_prefix_marking(tokenizer: BertTokenizerFast, config: PrefixConfig, sentences: List[str], padding='longest', truncation=True):
+    inputs = tokenizer(sentences, padding=padding, truncation=truncation, return_offsets_mapping=True, return_tensors='pt')
+    # create our prefix_id_options array which will be like the input ids shape but with an addtional
+    # dimension containing for each prefix whether it can be for that word
+    prefix_id_options = torch.full(inputs['input_ids'].shape + (config.total_classes,), config.total_classes, dtype=torch.long)
+    # go through each token, and fill in the vector accordingly
+    for sent_idx, sent_ids in enumerate(inputs['input_ids']):
+        tokens = tokenizer.convert_ids_to_tokens(sent_ids)
+        for tok_idx, token in enumerate(tokens):
+            # if the first letter isn't a valid prefix letter, nothing to talk about
+            if len(token) < 2 or not token[0] in config.prefix_c2i: continue
+            # combine the next tokens in? only if it's a breakup
+            next_tok_idx = tok_idx + 1
+            while next_tok_idx < len(tokens) and tokens[next_tok_idx].startswith('##'):
+                token += tokens[next_tok_idx][2:]
+                next_tok_idx += 1
+            # find all the possible prefixes - and mark them as 0 (and in the possible mark it as it's value for embed lookup)
+            for pre_class in get_prefix_classes_from_str(token, config):
+                prefix_id_options[sent_idx, tok_idx, pre_class] = pre_class
+    inputs['prefix_class_id_options'] = prefix_id_options
+    return inputs
+def get_predicted_prefix_len_from_logits(token, token_logits, config: PrefixConfig):
+    # Go through each possible prefix, and check if the prefix is yes - and if
+    # so increase the counter of the matched length, otherwise break out. That will solve cases
+    # of predicting prefix combinations that don't exist on the word.
+    # For example, if we have the word ושכשהלכתי and the model predict ו & כש, then we will only
+    # take the vuv because in order to get the כש we need the ש as well.
+    # Two extra items:
+    # 1] Don't allow the same prefix multiple times
+    # 2] Always check that the word starts with that prefix - otherwise it's bad
+    #    (except for the case of multi-letter prefix, where we force the next to be last)
+    cur_len, skip_next, last_check, seen_prefixes = 0, False, False, set()
+    for prefix in get_prefixes_from_str(token, config):
+        # Are we skipping this prefix? This will be the case where we matched כש, don't allow ש
+        if skip_next:
+            skip_next = False
+            continue
+        # check for duplicate prefixes, we don't allow two of the same prefix
+        # if it predicted two of the same, then we will break out
+        if prefix in seen_prefixes: break
+        seen_prefixes.add(prefix)
+        # check if we predicted this prefix
+        if token_logits[config.prefix_c2i[prefix]]:
+            cur_len += len(prefix)
+            if last_check: break
+            skip_next = len(prefix) > 1
+        # Otherwise, we predicted no. If we didn't, then this is the end of the prefix
+        # and time to break out. *Except* if it's a multi letter prefix, then we allow
+        # just the next letter - e.g., if כש doesn't match, then we allow כ, but then we know
+        # the word continues with a ש, and if it's not כש, then it's not כ-ש- (invalid)
+        elif len(prefix) > 1:
+            last_check = True
+        else:
+            break
+    return cur_len

dictabert-joint/BertForSyntaxParsing.py ADDED Viewed

	@@ -0,0 +1,315 @@

+import math
+from transformers.utils import ModelOutput
+import torch
+from torch import nn
+from typing import Dict, List, Tuple, Optional, Union
+from dataclasses import dataclass
+from transformers import BertPreTrainedModel, BertModel, BertTokenizerFast
+ALL_FUNCTION_LABELS = ["nsubj", "nsubj:cop", "punct", "mark", "mark:q", "case", "case:gen", "case:acc", "fixed", "obl", "det", "amod", "acl:relcl", "nmod", "cc", "conj", "root", "compound:smixut", "cop", "compound:affix", "advmod", "nummod", "appos", "nsubj:pass", "nmod:poss", "xcomp", "obj", "aux", "parataxis", "advcl", "ccomp", "csubj", "acl", "obl:tmod", "csubj:pass", "dep", "dislocated", "nmod:tmod", "nmod:npmod", "flat", "obl:npmod", "goeswith", "reparandum", "orphan", "list", "discourse", "iobj", "vocative", "expl", "flat:name"]
+@dataclass
+class SyntaxLogitsOutput(ModelOutput):
+    dependency_logits: torch.FloatTensor = None
+    function_logits: torch.FloatTensor = None
+    dependency_head_indices: torch.LongTensor = None
+    def detach(self):
+        return SyntaxTaggingOutput(self.dependency_logits.detach(), self.function_logits.detach(), self.dependency_head_indices.detach())
+@dataclass
+class SyntaxTaggingOutput(ModelOutput):
+    loss: Optional[torch.FloatTensor] = None
+    logits: Optional[SyntaxLogitsOutput] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+@dataclass
+class SyntaxLabels(ModelOutput):
+    dependency_labels: Optional[torch.LongTensor] = None
+    function_labels: Optional[torch.LongTensor] = None
+    def detach(self):
+        return SyntaxLabels(self.dependency_labels.detach(), self.function_labels.detach())
+    def to(self, device):
+        return SyntaxLabels(self.dependency_labels.to(device), self.function_labels.to(device))
+class BertSyntaxParsingHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        # the attention query & key values
+        self.head_size = config.syntax_head_size# int(config.hidden_size / config.num_attention_heads * 2)
+        self.query = nn.Linear(config.hidden_size, self.head_size)
+        self.key = nn.Linear(config.hidden_size, self.head_size)
+        # the function classifier gets two encoding values and predicts the labels
+        self.num_function_classes = len(ALL_FUNCTION_LABELS)
+        self.cls = nn.Linear(config.hidden_size * 2, self.num_function_classes)
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            extended_attention_mask: Optional[torch.Tensor],
+            labels: Optional[SyntaxLabels] = None,
+            compute_mst: bool = False) -> Tuple[torch.Tensor, SyntaxLogitsOutput]:
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        query_layer = self.query(hidden_states)
+        key_layer = self.key(hidden_states)
+        attention_scores = torch.bmm(query_layer, key_layer.transpose(-1, -2)) / math.sqrt(self.head_size)
+        # add in the attention mask
+        if extended_attention_mask is not None:
+            if extended_attention_mask.ndim == 4:
+                extended_attention_mask = extended_attention_mask.squeeze(1)
+            attention_scores += extended_attention_mask# batch x seq x seq
+        # At this point take the hidden_state of the word and of the dependency word, and predict the function
+        # If labels are provided, use the labels.
+        if self.training and labels is not None:
+            # Note that the labels can have -100, so just set those to zero with a max
+            dep_indices = labels.dependency_labels.clamp_min(0)
+        # Otherwise - check if he wants the MST or just the argmax
+        elif compute_mst:
+            dep_indices = compute_mst_tree(attention_scores, extended_attention_mask)
+        else:
+            dep_indices = torch.argmax(attention_scores, dim=-1)
+        # After we retrieved the dependency indicies, create a tensor of teh batch indices, and and retrieve the vectors of the heads to calculate the function
+        batch_indices = torch.arange(dep_indices.size(0)).view(-1, 1).expand(-1, dep_indices.size(1)).to(dep_indices.device)
+        dep_vectors = hidden_states[batch_indices, dep_indices, :] # batch x seq x dim
+        # concatenate that with the last hidden states, and send to the classifier output
+        cls_inputs = torch.cat((hidden_states, dep_vectors), dim=-1)
+        function_logits = self.cls(cls_inputs)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            # step 1: dependency scores loss - this is applied to the attention scores
+            loss = loss_fct(attention_scores.view(-1, hidden_states.size(-2)), labels.dependency_labels.view(-1))
+            # step 2: function loss
+            loss += loss_fct(function_logits.view(-1, self.num_function_classes), labels.function_labels.view(-1))
+        return (loss, SyntaxLogitsOutput(attention_scores, function_logits, dep_indices))
+class BertForSyntaxParsing(BertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.bert = BertModel(config, add_pooling_layer=False)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.syntax = BertSyntaxParsingHead(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        labels: Optional[SyntaxLabels] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        compute_syntax_mst: Optional[bool] = None,
+    ):
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        bert_outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        extended_attention_mask = None
+        if attention_mask is not None:
+            extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_ids.size())
+        # apply the syntax head
+        loss, logits = self.syntax(self.dropout(bert_outputs[0]), extended_attention_mask, labels, compute_syntax_mst)
+        if not return_dict:
+            return (loss,(logits.dependency_logits, logits.function_logits)) + bert_outputs[2:]
+        return SyntaxTaggingOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=bert_outputs.hidden_states,
+            attentions=bert_outputs.attentions,
+        )
+    def predict(self, sentences: Union[str, List[str]], tokenizer: BertTokenizerFast, compute_mst=True):
+        if isinstance(sentences, str):
+            sentences = [sentences]
+        # predict the logits for the sentence
+        inputs = tokenizer(sentences, padding='longest', truncation=True, return_tensors='pt')
+        inputs = {k:v.to(self.device) for k,v in inputs.items()}
+        logits = self.forward(**inputs, return_dict=True, compute_syntax_mst=compute_mst).logits
+        return parse_logits(inputs['input_ids'].tolist(), sentences, tokenizer, logits)
+def parse_logits(input_ids: List[List[int]], sentences: List[str], tokenizer: BertTokenizerFast, logits: SyntaxLogitsOutput):
+    outputs = []
+    special_toks = tokenizer.all_special_tokens
+    special_toks.remove(tokenizer.unk_token)
+    special_toks.remove(tokenizer.mask_token)
+    for i in range(len(sentences)):
+        deps = logits.dependency_head_indices[i].tolist()
+        funcs = logits.function_logits.argmax(-1)[i].tolist()
+        toks = [tok for tok in tokenizer.convert_ids_to_tokens(input_ids[i]) if tok not in special_toks]
+        # first, go through the tokens and create a mapping between each dependency index and the index without wordpieces
+        # wordpieces. At the same time, append the wordpieces in
+        idx_mapping = {-1:-1} # default root
+        real_idx = -1
+        for i in range(len(toks)):
+            if not toks[i].startswith('##'):
+                real_idx += 1
+            idx_mapping[i] = real_idx
+        # build our tree, keeping tracking of the root idx
+        tree = []
+        root_idx = 0
+        for i in range(len(toks)):
+            if toks[i].startswith('##'):
+                tree[-1]['word'] += toks[i][2:]
+                continue
+            dep_idx = deps[i + 1] - 1 # increase 1 for cls, decrease 1 for cls
+            if dep_idx == len(toks): dep_idx = i - 1 # if he predicts sep, then just point to the previous word
+            dep_head = 'root' if dep_idx == -1 else toks[dep_idx]
+            dep_func = ALL_FUNCTION_LABELS[funcs[i + 1]]
+            if dep_head == 'root': root_idx = len(tree)
+            tree.append(dict(word=toks[i], dep_head_idx=idx_mapping[dep_idx], dep_func=dep_func))
+        # append the head word
+        for d in tree:
+            d['dep_head'] = tree[d['dep_head_idx']]['word']
+        outputs.append(dict(tree=tree, root_idx=root_idx))
+    return outputs
+def compute_mst_tree(attention_scores: torch.Tensor, extended_attention_mask: torch.LongTensor):
+    # attention scores should be 3 dimensions - batch x seq x seq (if it is 2 - just unsqueeze)
+    if attention_scores.ndim == 2: attention_scores = attention_scores.unsqueeze(0)
+    if attention_scores.ndim != 3 or attention_scores.shape[1] != attention_scores.shape[2]:
+        raise ValueError(f'Expected attention scores to be of shape batch x seq x seq, instead got {attention_scores.shape}')
+    batch_size, seq_len, _ = attention_scores.shape
+    # start by softmaxing so the scores are comparable
+    attention_scores = attention_scores.softmax(dim=-1)
+    batch_indices = torch.arange(batch_size, device=attention_scores.device)
+    seq_indices = torch.arange(seq_len, device=attention_scores.device)
+    seq_lens = torch.full((batch_size,), seq_len)
+    if extended_attention_mask is not None:
+        seq_lens = torch.argmax((extended_attention_mask != 0).int(), dim=2).squeeze(1)
+        # zero out any padding
+        attention_scores[extended_attention_mask.squeeze(1) != 0] = 0
+    # set the values for the CLS and sep to all by very low, so they never get chosen as a replacement arc
+    attention_scores[:, 0, :] = 0
+    attention_scores[batch_indices, seq_lens - 1, :] = 0
+    attention_scores[batch_indices, :, seq_lens - 1] = 0 # can never predict sep
+    # set the values for each token pointing to itself be 0
+    attention_scores[:, seq_indices, seq_indices] = 0
+    # find the root, and make him super high so we never have a conflict
+    root_cands = torch.argsort(attention_scores[:, :, 0], dim=-1)
+    attention_scores[batch_indices.unsqueeze(1), root_cands, 0] = 0
+    attention_scores[batch_indices, root_cands[:, -1], 0] = 1.0
+    # we start by getting the argmax for each score, and then computing the cycles and contracting them
+    sorted_indices = torch.argsort(attention_scores, dim=-1, descending=True)
+    indices = sorted_indices[:, :, 0].clone() # take the argmax
+    attention_scores = attention_scores.tolist()
+    seq_lens = seq_lens.tolist()
+    sorted_indices = [[sub_l[:slen] for sub_l in l[:slen]] for l,slen in zip(sorted_indices.tolist(), seq_lens)]
+    # go through each batch item and make sure our tree works
+    for batch_idx in range(batch_size):
+        # We have one root - detect the cycles and contract them. A cycle can never contain the root so really
+        # for every cycle, we look at all the nodes, and find the highest arc out of the cycle for any values. Replace that and tada
+        has_cycle, cycle_nodes = detect_cycle(indices[batch_idx], seq_lens[batch_idx])
+        contracted_arcs = set()
+        while has_cycle:
+            base_idx, head_idx = choose_contracting_arc(indices[batch_idx], sorted_indices[batch_idx], cycle_nodes, contracted_arcs, seq_lens[batch_idx], attention_scores[batch_idx])
+            indices[batch_idx, base_idx] = head_idx
+            contracted_arcs.add(base_idx)
+            # find the next cycle
+            has_cycle, cycle_nodes = detect_cycle(indices[batch_idx], seq_lens[batch_idx])
+    return indices
+def detect_cycle(indices: torch.LongTensor, seq_len: int):
+    # Simple cycle detection algorithm
+    # Returns a boolean indicating if a cycle is detected and the nodes involved in the cycle
+    visited = set()
+    for node in range(1, seq_len - 1): # ignore the CLS/SEP tokens
+        if node in visited:
+            continue
+        current_path = set()
+        while node not in visited:
+            visited.add(node)
+            current_path.add(node)
+            node = indices[node].item()
+            if node == 0: break # roots never point to anything
+            if node in current_path:
+                return True, current_path  # Cycle detected
+    return False, None
+def choose_contracting_arc(indices: torch.LongTensor, sorted_indices: List[List[int]], cycle_nodes: set, contracted_arcs: set, seq_len: int, scores: List[List[float]]):
+    # Chooses the highest-scoring, non-cycling arc from a graph. Iterates through 'cycle_nodes' to find
+    # the best arc based on 'scores', avoiding cycles and zero node connections.
+    # For each node, we only look at the next highest scoring non-cycling arc
+    best_base_idx, best_head_idx = -1, -1
+    score = 0
+    # convert the indices to a list once, to avoid multiple conversions (saves a few seconds)
+    currents = indices.tolist()
+    for base_node in cycle_nodes:
+        if base_node in contracted_arcs: continue
+        # we don't want to take anything that has a higher score than the current value - we can end up in an endless loop
+        # Since the indices are sorted, as soon as we find our current item, we can move on to the next.
+        current = currents[base_node]
+        found_current = False
+        for head_node in sorted_indices[base_node]:
+            if head_node == current:
+                found_current = True
+                continue
+            if head_node in contracted_arcs: continue
+            if not found_current or head_node in cycle_nodes or head_node == 0:
+                continue
+            current_score = scores[base_node][head_node]
+            if current_score > score:
+                best_base_idx, best_head_idx, score = base_node, head_node, current_score
+            break
+    if best_base_idx == -1:
+        raise ValueError('Stuck in endless loop trying to compute syntax mst. Please try again setting compute_syntax_mst=False')
+    return best_base_idx, best_head_idx

dictabert-joint/README.md ADDED Viewed

	@@ -0,0 +1,521 @@

+---
+license: cc-by-4.0
+language:
+- he
+inference: false
+---
+# DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew
+State-of-the-art language model for Hebrew, released [here](https://arxiv.org/abs/2403.06970).
+This is the fine-tuned model for the joint parsing of the following tasks:
+- Prefix Segmentation
+- Morphological Disabmgiuation
+- Lexicographical Analysis (Lemmatization)
+- Syntactical Parsing (Dependency-Tree)
+- Named-Entity Recognition
+A live demo of the model with instant visualization of the syntax tree can be found [here](https://huggingface.co/spaces/dicta-il/joint-demo).
+For a faster model, you can use the equivalent bert-tiny model for this task [here](https://huggingface.co/dicta-il/dictabert-tiny-joint).
+For the bert-base models for other tasks, see [here](https://huggingface.co/collections/dicta-il/dictabert-6588e7cc08f83845fc42a18b).
+---
+The model currently supports 3 types of output:
+1. **JSON**: The model returns a JSON object for each sentence in the input, where for each sentence we have the sentence text, the NER entities, and the list of tokens. For each token we include the output from each of the tasks.
+    ```python
+    model.predict(..., output_style='json')
+    ```
+1. **UD**: The model returns the full UD output for each sentence, according to the style of the Hebrew UD Treebank.
+    ```python
+    model.predict(..., output_style='ud')
+    ```
+1. **UD, in the style of IAHLT**: This model returns the full UD output, with slight modifications to match the style of IAHLT. This differences are mostly granularity of some dependency relations, how the suffix of a word is broken up, and implicit definite articles. The actual tagging behavior doesn't change.
+    ```python
+    model.predict(..., output_style='iahlt_ud')
+    ```
+---
+If you only need the output for one of the tasks, you can tell the model to not initialize some of the heads, for example:
+```python
+model = AutoModel.from_pretrained('dicta-il/dictabert-joint', trust_remote_code=True, do_lex=False)
+```
+The list of options are: `do_lex`, `do_syntax`, `do_ner`, `do_prefix`, `do_morph`.
+---
+Sample usage:
+```python
+from transformers import AutoModel, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-joint')
+model = AutoModel.from_pretrained('dicta-il/dictabert-joint', trust_remote_code=True)
+model.eval()
+sentence = 'בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים'
+print(model.predict([sentence], tokenizer, output_style='json')) # see below for other return formats
+```
+Output:
+```json
+[
+  {
+    "text": "בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים",
+    "tokens": [
+      {
+        "token": "בשנת",
+        "syntax": {
+          "word": "בשנת",
+          "dep_head_idx": 2,
+          "dep_func": "obl",
+          "dep_head": "השלים"
+        },
+        "seg": [
+          "ב",
+          "שנת"
+        ],
+        "lex": "שנה",
+        "morph": {
+          "token": "בשנת",
+          "pos": "NOUN",
+          "feats": {
+            "Gender": "Fem",
+            "Number": "Sing"
+          },
+          "prefixes": [
+            "ADP"
+          ],
+          "suffix": false
+        }
+      },
+      {
+        "token": "1948",
+        "syntax": {
+          "word": "1948",
+          "dep_head_idx": 0,
+          "dep_func": "compound",
+          "dep_head": "בשנת"
+        },
+        "seg": [
+          "1948"
+        ],
+        "lex": "1948",
+        "morph": {
+          "token": "1948",
+          "pos": "NUM",
+          "feats": {},
+          "prefixes": [],
+          "suffix": false
+        }
+      },
+      {
+        "token": "השלים",
+        "syntax": {
+          "word": "השלים",
+          "dep_head_idx": -1,
+          "dep_func": "root",
+          "dep_head": "הומוריסטיים"
+        },
+        "seg": [
+          "השלים"
+        ],
+        "lex": "השלים",
+        "morph": {
+          "token": "השלים",
+          "pos": "VERB",
+          "feats": {
+            "Gender": "Masc",
+            "Number": "Sing",
+            "Person": "3",
+            "Tense": "Past"
+          },
+          "prefixes": [],
+          "suffix": false
+        }
+      },
+      {
+        "token": "אפרים",
+        "syntax": {
+          "word": "אפרים",
+          "dep_head_idx": 2,
+          "dep_func": "nsubj",
+          "dep_head": "השלים"
+        },
+        "seg": [
+          "אפרים"
+        ],
+        "lex": "אפרים",
+        "morph": {
+          "token": "אפרים",
+          "pos": "PROPN",
+          "feats": {},
+          "prefixes": [],
+          "suffix": false
+        }
+      },
+      {
+        "token": "קיש��ן",
+        "syntax": {
+          "word": "קישון",
+          "dep_head_idx": 3,
+          "dep_func": "flat",
+          "dep_head": "אפרים"
+        },
+        "seg": [
+          "קישון"
+        ],
+        "lex": "קישון",
+        "morph": {
+          "token": "קישון",
+          "pos": "PROPN",
+          "feats": {},
+          "prefixes": [],
+          "suffix": false
+        }
+      },
+      {
+        "token": "את",
+        "syntax": {
+          "word": "את",
+          "dep_head_idx": 6,
+          "dep_func": "case",
+          "dep_head": "לימודיו"
+        },
+        "seg": [
+          "את"
+        ],
+        "lex": "את",
+        "morph": {
+          "token": "את",
+          "pos": "ADP",
+          "feats": {},
+          "prefixes": [],
+          "suffix": false
+        }
+      },
+      {
+        "token": "לימודיו",
+        "syntax": {
+          "word": "לימודיו",
+          "dep_head_idx": 2,
+          "dep_func": "obj",
+          "dep_head": "השלים"
+        },
+        "seg": [
+          "לימודיו"
+        ],
+        "lex": "לימוד",
+        "morph": {
+          "token": "לימודיו",
+          "pos": "NOUN",
+          "feats": {
+            "Gender": "Masc",
+            "Number": "Plur"
+          },
+          "prefixes": [],
+          "suffix": "PRON",
+          "suffix_feats": {
+            "Gender": "Masc",
+            "Number": "Sing",
+            "Person": "3"
+          }
+        }
+      },
+      {
+        "token": "בפיסול",
+        "syntax": {
+          "word": "בפיסול",
+          "dep_head_idx": 6,
+          "dep_func": "nmod",
+          "dep_head": "לימודיו"
+        },
+        "seg": [
+          "ב",
+          "פיסול"
+        ],
+        "lex": "פיסול",
+        "morph": {
+          "token": "בפיסול",
+          "pos": "NOUN",
+          "feats": {
+            "Gender": "Masc",
+            "Number": "Sing"
+          },
+          "prefixes": [
+            "ADP"
+          ],
+          "suffix": false
+        }
+      },
+      {
+        "token": "מתכת",
+        "syntax": {
+          "word": "מתכת",
+          "dep_head_idx": 7,
+          "dep_func": "compound",
+          "dep_head": "בפיסול"
+        },
+        "seg": [
+          "מתכת"
+        ],
+        "lex": "מתכת",
+        "morph": {
+          "token": "מתכת",
+          "pos": "NOUN",
+          "feats": {
+            "Gender": "Fem",
+            "Number": "Sing"
+          },
+          "prefixes": [],
+          "suffix": false
+        }
+      },
+      {
+        "token": "ובתולדות",
+        "syntax": {
+          "word": "ובתולדות",
+          "dep_head_idx": 7,
+          "dep_func": "conj",
+          "dep_head": "בפיסול"
+        },
+        "seg": [
+          "וב",
+          "תולדות"
+        ],
+        "lex": "תולדה",
+        "morph": {
+          "token": "ובתולדות",
+          "pos": "NOUN",
+          "feats": {
+            "Gender": "Fem",
+            "Number": "Plur"
+          },
+          "prefixes": [
+            "CCONJ",
+            "ADP"
+          ],
+          "suffix": false
+        }
+      },
+      {
+        "token": "האמנות",
+        "syntax": {
+          "word": "האמנות",
+          "dep_head_idx": 9,
+          "dep_func": "compound",
+          "dep_head": "ובתולדות"
+        },
+        "seg": [
+          "ה",
+          "אמנות"
+        ],
+        "lex": "אומנות",
+        "morph": {
+          "token": "האמנות",
+          "pos": "NOUN",
+          "feats": {
+            "Gender": "Fem",
+            "Number": "Sing"
+          },
+          "prefixes": [
+            "DET"
+          ],
+          "suffix": false
+        }
+      },
+      {
+        "token": "והחל",
+        "syntax": {
+          "word": "והחל",
+          "dep_head_idx": 2,
+          "dep_func": "conj",
+          "dep_head": "השלים"
+        },
+        "seg": [
+          "ו",
+          "החל"
+        ],
+        "lex": "החל",
+        "morph": {
+          "token": "והחל",
+          "pos": "VERB",
+          "feats": {
+            "Gender": "Masc",
+            "Number": "Sing",
+            "Person": "3",
+            "Tense": "Past"
+          },
+          "prefixes": [
+            "CCONJ"
+          ],
+          "suffix": false
+        }
+      },
+      {
+        "token": "לפרסם",
+        "syntax": {
+          "word": "לפרסם",
+          "dep_head_idx": 11,
+          "dep_func": "xcomp",
+          "dep_head": "והחל"
+        },
+        "seg": [
+          "לפרסם"
+        ],
+        "lex": "פרסם",
+        "morph": {
+          "token": "לפרסם",
+          "pos": "VERB",
+          "feats": {},
+          "prefixes": [],
+          "suffix": false
+        }
+      },
+      {
+        "token": "מאמרים",
+        "syntax": {
+          "word": "מא��רים",
+          "dep_head_idx": 12,
+          "dep_func": "obj",
+          "dep_head": "לפרסם"
+        },
+        "seg": [
+          "מאמרים"
+        ],
+        "lex": "מאמר",
+        "morph": {
+          "token": "מאמרים",
+          "pos": "NOUN",
+          "feats": {
+            "Gender": "Masc",
+            "Number": "Plur"
+          },
+          "prefixes": [],
+          "suffix": false
+        }
+      },
+      {
+        "token": "הומוריסטיים",
+        "syntax": {
+          "word": "הומוריסטיים",
+          "dep_head_idx": 13,
+          "dep_func": "amod",
+          "dep_head": "מאמרים"
+        },
+        "seg": [
+          "הומוריסטיים"
+        ],
+        "lex": "הומוריסטי",
+        "morph": {
+          "token": "הומוריסטיים",
+          "pos": "ADJ",
+          "feats": {
+            "Gender": "Masc",
+            "Number": "Plur"
+          },
+          "prefixes": [],
+          "suffix": false
+        }
+      }
+    ],
+    "root_idx": 2,
+    "ner_entities": [
+      {
+        "phrase": "1948",
+        "label": "TIMEX"
+      },
+      {
+        "phrase": "אפרים קישון",
+        "label": "PER"
+      }
+    ]
+  }
+]
+```
+You can also choose to get your response in UD format:
+```python
+sentence = 'בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים'
+print(model.predict([sentence], tokenizer, output_style='ud'))
+```
+Results:
+```json
+[
+  [
+    "# sent_id = 1",
+    "# text = בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים",
+    "1-2\tבשנת\t_\t_\t_\t_\t_\t_\t_\t_",
+    "1\tב\tב\tADP\tADP\t_\t2\tcase\t_\t_",
+    "2\tשנת\tשנה\tNOUN\tNOUN\tGender=Fem|Number=Sing\t4\tobl\t_\t_",
+    "3\t1948\t1948\tNUM\tNUM\t\t2\tcompound:smixut\t_\t_",
+    "4\tהשלים\tהשלים\tVERB\tVERB\tGender=Masc|Number=Sing|Person=3|Tense=Past\t0\troot\t_\t_",
+    "5\tאפרים\tאפרים\tPROPN\tPROPN\t\t4\tnsubj\t_\t_",
+    "6\tקישון\tקישון\tPROPN\tPROPN\t\t5\tflat\t_\t_",
+    "7\tאת\tאת\tADP\tADP\t\t8\tcase:acc\t_\t_",
+    "8-10\tלימודיו\t_\t_\t_\t_\t_\t_\t_\t_",
+    "8\tלימוד_\tלימוד\tNOUN\tNOUN\tGender=Masc|Number=Plur\t4\tobj\t_\t_",
+    "9\t_של_\tשל\tADP\tADP\t_\t10\tcase\t_\t_",
+    "10\t_הוא\tהוא\tPRON\tPRON\tGender=Masc|Number=Sing|Person=3\t8\tnmod:poss\t_\t_",
+    "11-12\tבפיסול\t_\t_\t_\t_\t_\t_\t_\t_",
+    "11\tב\tב\tADP\tADP\t_\t12\tcase\t_\t_",
+    "12\tפיסול\tפיסול\tNOUN\tNOUN\tGender=Masc|Number=Sing\t8\tnmod\t_\t_",
+    "13\tמתכת\tמתכת\tNOUN\tNOUN\tGender=Fem|Number=Sing\t12\tcompound:smixut\t_\t_",
+    "14-16\tובתולדות\t_\t_\t_\t_\t_\t_\t_\t_",
+    "14\tו\tו\tCCONJ\tCCONJ\t_\t16\tcc\t_\t_",
+    "15\tב\tב\tADP\tADP\t_\t16\tcase\t_\t_",
+    "16\tתולדות\tתולדה\tNOUN\tNOUN\tGender=Fem|Number=Plur\t12\tconj\t_\t_",
+    "17-18\tהאמנות\t_\t_\t_\t_\t_\t_\t_\t_",
+    "17\tה\tה\tDET\tDET\t_\t18\tdet\t_\t_",
+    "18\tאמנות\tאומנות\tNOUN\tNOUN\tGender=Fem|Number=Sing\t16\tcompound:smixut\t_\t_",
+    "19-20\tוהחל\t_\t_\t_\t_\t_\t_\t_\t_",
+    "19\tו\tו\tCCONJ\tCCONJ\t_\t20\tcc\t_\t_",
+    "20\tהחל\tהחל\tVERB\tVERB\tGender=Masc|Number=Sing|Person=3|Tense=Past\t4\tconj\t_\t_",
+    "21\tלפרסם\tפרסם\tVERB\tVERB\t\t20\txcomp\t_\t_",
+    "22\tמאמרים\tמאמר\tNOUN\tNOUN\tGender=Masc|Number=Plur\t21\tobj\t_\t_",
+    "23\tהומוריסטיים\tהומוריסטי\tADJ\tADJ\tGender=Masc|Number=Plur\t22\tamod\t_\t_"
+  ]
+]
+```
+## Citation
+If you use DictaBERT-joint in your research, please cite ```MRL Parsing without Tears: The Case of Hebrew```
+**BibTeX:**
+```bibtex
+@misc{shmidman2024mrl,
+      title={MRL Parsing Without Tears: The Case of Hebrew},
+      author={Shaltiel Shmidman and Avi Shmidman and Moshe Koppel and Reut Tsarfaty},
+      year={2024},
+      eprint={2403.06970},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+## License
+Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
+This work is licensed under a
+[Creative Commons Attribution 4.0 International License][cc-by].
+[![CC BY 4.0][cc-by-image]][cc-by]
+[cc-by]: http://creativecommons.org/licenses/by/4.0/
+[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
+[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg

dictabert-joint/config.json ADDED Viewed

	@@ -0,0 +1,93 @@

+{
+  "architectures": [
+    "BertForJointParsing"
+  ],
+  "auto_map": {
+    "AutoModel": "BertForJointParsing.BertForJointParsing"
+  },
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "do_lex": true,
+  "do_morph": true,
+  "do_ner": true,
+  "do_prefix": true,
+  "do_syntax": true,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "B-ANG",
+    "1": "B-DUC",
+    "2": "B-EVE",
+    "3": "B-FAC",
+    "4": "B-GPE",
+    "5": "B-LOC",
+    "6": "B-ORG",
+    "7": "B-PER",
+    "8": "B-WOA",
+    "9": "B-INFORMAL",
+    "10": "B-MISC",
+    "11": "B-TIMEX",
+    "12": "B-TTL",
+    "13": "I-DUC",
+    "14": "I-EVE",
+    "15": "I-FAC",
+    "16": "I-GPE",
+    "17": "I-LOC",
+    "18": "I-ORG",
+    "19": "I-PER",
+    "20": "I-WOA",
+    "21": "I-ANG",
+    "22": "I-INFORMAL",
+    "23": "I-MISC",
+    "24": "I-TIMEX",
+    "25": "I-TTL",
+    "26": "O"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "B-ANG": 0,
+    "B-DUC": 1,
+    "B-EVE": 2,
+    "B-FAC": 3,
+    "B-GPE": 4,
+    "B-INFORMAL": 9,
+    "B-LOC": 5,
+    "B-MISC": 10,
+    "B-ORG": 6,
+    "B-PER": 7,
+    "B-TIMEX": 11,
+    "B-TTL": 12,
+    "B-WOA": 8,
+    "I-ANG": 21,
+    "I-DUC": 13,
+    "I-EVE": 14,
+    "I-FAC": 15,
+    "I-GPE": 16,
+    "I-INFORMAL": 22,
+    "I-LOC": 17,
+    "I-MISC": 23,
+    "I-ORG": 18,
+    "I-PER": 19,
+    "I-TIMEX": 24,
+    "I-TTL": 25,
+    "I-WOA": 20,
+    "O": 26
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "newmodern": true,
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "syntax_head_size": 128,
+  "torch_dtype": "float32",
+  "transformers_version": "4.36.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 128000
+}

dictabert-joint/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c59db02c002b69fd3c072c310e3ba578d036386477248d1c8576d896cc06aa1c
+size 744080096

dictabert-joint/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:26ce128baa792901b22cecfdd6a7dc783307e60d4b766dcd3aa4d1eaeb3a36d2
+size 744148153

dictabert-joint/source.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ https://huggingface.co/dicta-il/dictabert-joint

dictabert-joint/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

dictabert-joint/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

dictabert-joint/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "[BLANK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

dictabert-joint/vocab.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0fb90bfa35244d26f0065d1fcd0b5becc3da3d44d616a7e2aacaf6320b9fa2d0
+size 1500244

dictabert-large-char-menaked/.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

dictabert-large-char-menaked/BertForDiacritization.py ADDED Viewed

	@@ -0,0 +1,190 @@

+from dataclasses import dataclass
+from typing import List, Optional, Tuple, Union
+import torch
+from torch import nn
+from transformers.utils import ModelOutput
+from transformers import BertPreTrainedModel, BertModel, BertTokenizerFast
+# MAT_LECT => Matres Lectionis, known in Hebrew as Em Kriaa.
+MAT_LECT_TOKEN = '<MAT_LECT>'
+NIKUD_CLASSES = ['', MAT_LECT_TOKEN, '\u05BC', '\u05B0', '\u05B1', '\u05B2', '\u05B3', '\u05B4', '\u05B5', '\u05B6', '\u05B7', '\u05B8', '\u05B9', '\u05BA', '\u05BB', '\u05BC\u05B0', '\u05BC\u05B1', '\u05BC\u05B2', '\u05BC\u05B3', '\u05BC\u05B4', '\u05BC\u05B5', '\u05BC\u05B6', '\u05BC\u05B7', '\u05BC\u05B8', '\u05BC\u05B9', '\u05BC\u05BA', '\u05BC\u05BB', '\u05C7', '\u05BC\u05C7']
+SHIN_CLASSES = ['\u05C1', '\u05C2'] # shin, sin
+@dataclass
+class MenakedLogitsOutput(ModelOutput):
+    nikud_logits: torch.FloatTensor = None
+    shin_logits: torch.FloatTensor = None
+    def detach(self):
+        return MenakedLogitsOutput(self.nikud_logits.detach(), self.shin_logits.detach())
+@dataclass
+class MenakedOutput(ModelOutput):
+    loss: Optional[torch.FloatTensor] = None
+    logits: Optional[MenakedLogitsOutput] = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+@dataclass
+class MenakedLabels(ModelOutput):
+    nikud_labels: Optional[torch.FloatTensor] = None
+    shin_labels: Optional[torch.FloatTensor] = None
+    def detach(self):
+        return MenakedLabels(self.nikud_labels.detach(), self.shin_labels.detach())
+    def to(self, device):
+        return MenakedLabels(self.nikud_labels.to(device), self.shin_labels.to(device))
+class BertMenakedHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        if not hasattr(config, 'nikud_classes'):
+            config.nikud_classes = NIKUD_CLASSES
+            config.shin_classes = SHIN_CLASSES
+            config.mat_lect_token = MAT_LECT_TOKEN
+        self.num_nikud_classes = len(config.nikud_classes)
+        self.num_shin_classes = len(config.shin_classes)
+        # create our classifiers
+        self.nikud_cls = nn.Linear(config.hidden_size, self.num_nikud_classes)
+        self.shin_cls = nn.Linear(config.hidden_size, self.num_shin_classes)
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            labels: Optional[MenakedLabels] = None):
+        # run each of the classifiers on the transformed output
+        nikud_logits = self.nikud_cls(hidden_states)
+        shin_logits = self.shin_cls(hidden_states)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.CrossEntropyLoss()
+            loss = loss_fct(nikud_logits.view(-1, self.num_nikud_classes), labels.nikud_labels.view(-1))
+            loss += loss_fct(shin_logits.view(-1, self.num_shin_classes), labels.shin_labels.view(-1))
+        return loss, MenakedLogitsOutput(nikud_logits, shin_logits)
+class BertForDiacritization(BertPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.config = config
+        self.bert = BertModel(config, add_pooling_layer=False)
+        classifier_dropout = config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.menaked = BertMenakedHead(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], MenakedOutput]:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        bert_outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        hidden_states = bert_outputs[0]
+        hidden_states = self.dropout(hidden_states)
+        loss, logits = self.menaked(hidden_states, labels)
+        if not return_dict:
+            return (loss,logits) + bert_outputs[2:]
+        return MenakedOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=bert_outputs.hidden_states,
+            attentions=bert_outputs.attentions,
+        )
+    def predict(self, sentences: List[str], tokenizer: BertTokenizerFast, mark_matres_lectionis: str = None, padding='longest'):
+        sentences = [remove_nikkud(sentence) for sentence in sentences]
+        # assert the lengths aren't out of range
+        assert all(len(sentence) + 2 <= tokenizer.model_max_length for sentence in sentences), f'All sentences must be <= {tokenizer.model_max_length}, please segment and try again'
+        # tokenize the inputs and convert them to relevant device
+        inputs = tokenizer(sentences, padding=padding, truncation=True, return_tensors='pt', return_offsets_mapping=True)
+        offset_mapping = inputs.pop('offset_mapping')
+        inputs = {k:v.to(self.device) for k,v in inputs.items()}
+        # calculate the predictions
+        logits = self.forward(**inputs, return_dict=True).logits
+        nikud_predictions = logits.nikud_logits.argmax(dim=-1).tolist()
+        shin_predictions = logits.shin_logits.argmax(dim=-1).tolist()
+        ret = []
+        for sent_idx,(sentence,sent_offsets) in enumerate(zip(sentences, offset_mapping)):
+            # assign the nikud to each letter!
+            output = []
+            prev_index = 0
+            for idx,offsets in enumerate(sent_offsets):
+                # add in anything we missed
+                if offsets[0] > prev_index:
+                    output.append(sentence[prev_index:offsets[0]])
+                if offsets[1] - offsets[0] != 1: continue
+                # get our next char
+                char = sentence[offsets[0]:offsets[1]]
+                prev_index = offsets[1]
+                if not is_hebrew_letter(char):
+                    output.append(char)
+                    continue
+                nikud = self.config.nikud_classes[nikud_predictions[sent_idx][idx]]
+                shin = '' if char != 'ש' else self.config.shin_classes[shin_predictions[sent_idx][idx]]
+                # check for matres lectionis
+                if nikud == self.config.mat_lect_token:
+                    if not is_matres_letter(char): nikud = '' # don't allow matres on irrelevant letters
+                    elif mark_matres_lectionis is not None: nikud = mark_matres_lectionis
+                    else: continue
+                output.append(char + shin + nikud)
+            output.append(sentence[prev_index:])
+            ret.append(''.join(output))
+        return ret
+ALEF_ORD = ord('א')
+TAF_ORD = ord('ת')
+def is_hebrew_letter(char):
+   return ALEF_ORD <= ord(char) <= TAF_ORD
+MATRES_LETTERS = list('אוי')
+def is_matres_letter(char):
+    return char in MATRES_LETTERS
+import re
+nikud_pattern = re.compile(r'[\u05B0-\u05BD\u05C1\u05C2\u05C7]')
+def remove_nikkud(text):
+    return nikud_pattern.sub('', text)

dictabert-large-char-menaked/README.md ADDED Viewed

	@@ -0,0 +1,69 @@

+---
+license: cc-by-4.0
+language:
+- he
+inference: false
+---
+# DictaBERT-large-char-menaked: An open-source BERT-based model for adding diacritiziation marks ("nikud") to Hebrew texts
+This model is a fine-tuned version of [DictaBERT-large-char](https://huggingface.co/dicta-il/dictabert-large-char), dedicated to the task of adding nikud (diacritics) to Hebrew text.
+The model was trained on a corpus of modern Hebrew texts manually diacritized by linguistic experts.
+As of 2025-03, this model provides SOTA performance on all modern Hebrew vocalization benchmarks as compared to all other open-source alternatives, as well as when compared with commercial generative LLMs.
+Note: this model is trained to handle a wide variety of genres of modern Hebrew prose. However, it is not intended for earlier layers of Hebrew (e.g. Biblical, Rabbinic, Premodern), nor for poetic texts.
+Sample usage:
+```python
+from transformers import AutoModel, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-large-char-menaked')
+model = AutoModel.from_pretrained('dicta-il/dictabert-large-char-menaked', trust_remote_code=True)
+model.eval()
+sentence = 'בשנת 1948 השלים אפרים קישון את לימודיו בפיסול מתכת ובתולדות האמנות והחל לפרסם מאמרים הומוריסטיים'
+print(model.predict([sentence], tokenizer))
+```
+Output:
+```json
+['בִּשְׁנַת 1948 הִשְׁלִים אֶפְרַיִם קִישׁוֹן אֶת לִמּוּדָיו בְּפִסּוּל מַתֶּכֶת וּבְתוֹלְדוֹת הָאׇמָּנוּת וְהֵחֵל לְפַרְסֵם מַאֲמָרִים הוּמוֹרִיסְטִיִּים']
+```
+### Matres Lectionis (אימות קריאה)
+As can be seen, the predict method automatically removed all the matres-lectionis (אימות קריאה). If you wish to keep them in, you can specify that to the predict function:
+```python
+print(model.predict([sentence], tokenizer, mark_matres_lectionis = '*'))
+```
+Output:
+```json
+['בִּשְׁנַת 1948 הִשְׁלִים אֶפְרַיִם קִישׁוֹן אֶת לִי*מּוּדָיו בְּפִי*סּוּל מַתֶּכֶת וּבְתוֹלְדוֹת הָאׇמָּנוּת וְהֵחֵל לְפַרְסֵם מַאֲמָרִים הוּמוֹרִיסְטִיִּים']
+```
+### Community Project
+A third-party project, [dicta-onnx](https://github.com/thewh1teagle/dicta-onnx), offers a lightweight ONNX-based tool built on top of our model for adding Hebrew diacritics. We're not affiliated, but it's a cool and practical application you might find useful.
+## License
+Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
+This work is licensed under a
+[Creative Commons Attribution 4.0 International License][cc-by].
+[![CC BY 4.0][cc-by-image]][cc-by]
+[cc-by]: http://creativecommons.org/licenses/by/4.0/
+[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
+[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg

dictabert-large-char-menaked/config.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "architectures": [
+    "BertForDiacritization"
+  ],
+  "auto_map": {
+    "AutoModel": "BertForDiacritization.BertForDiacritization"
+  },
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-12,
+  "mat_lect_token": "<MAT_LECT>",
+  "max_position_embeddings": 2048,
+  "model_type": "bert",
+  "nikud_classes": [
+    "",
+    "<MAT_LECT>",
+    "\u05bc",
+    "\u05b0",
+    "\u05b1",
+    "\u05b2",
+    "\u05b3",
+    "\u05b4",
+    "\u05b5",
+    "\u05b6",
+    "\u05b7",
+    "\u05b8",
+    "\u05b9",
+    "\u05ba",
+    "\u05bb",
+    "\u05bc\u05b0",
+    "\u05bc\u05b1",
+    "\u05bc\u05b2",
+    "\u05bc\u05b3",
+    "\u05bc\u05b4",
+    "\u05bc\u05b5",
+    "\u05bc\u05b6",
+    "\u05bc\u05b7",
+    "\u05bc\u05b8",
+    "\u05bc\u05b9",
+    "\u05bc\u05ba",
+    "\u05bc\u05bb",
+    "\u05c7",
+    "\u05bc\u05c7"
+  ],
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "shin_classes": [
+    "\u05c1",
+    "\u05c2"
+  ],
+  "torch_dtype": "float32",
+  "transformers_version": "4.42.4",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 1024
+}

dictabert-large-char-menaked/issues.txt ADDED Viewed

	@@ -0,0 +1,35 @@

+------------------------------------------------------------------------
+#3 Different variations of diacritics
+------------------------------------------------------------------------
+[thewh1teagle] Apr 3, 2025
+I would like to get multiple variations of diacritics for sentence
+For instance with 'Shalom Olam'
+שלום עולם
+The diacritics are 'Shlom Olam'
+שְׁלוֹם עוֹלָם
+I tried to implement beam search but couldn't get different variations
+Thanks
+thewh1teagle changed discussion title from Beam search example to Different variations of diacritics Apr 3, 2025
+[johnlockejrr] Apr 3, 2025
+Seems the vocalization is for peace of world not Hello world! :)
+[Shaltiel, DICTA: The Israel Center for Text Analysis.org] Apr 8, 2025
+Indeed, the current architecture does not allow retrieving multiple variations of diacritics for each word/the sentence. We are looking into training a model with a different architecture, but that is currently only in research.
+[thewh1teagle] Apr 17, 2025
+I noticed some differences in the nikud from Dicta website in terms of modernity
+For instance when I hit שלום עולם in Dicta website it's Shalom Olam but in the model it's Shlom Olam, it's like the model nikud is a bit 'less modern' than Dicta website. That's why I asked for a way to get variations.
+If you plan another one, I wish that you can include more modern nikud, and Shva Na and Atama'a! ; )
+Thank you very much for the model. very appreciated.

dictabert-large-char-menaked/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c2927643c61f408c7d5ff1652b605b322ed896fa07ded344bd508a02b76bf50e
+size 1222010788