Sentence Similarity
sentence-transformers
Safetensors
Hebrew
bert
biblical-hebrew
digital-humanities
inner-biblical-parallels
Eval Results (legacy)
text-embeddings-inference
Instructions to use davidmsmiley/MiqraBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use davidmsmiley/MiqraBERT with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("davidmsmiley/MiqraBERT") sentences = [ "וַיַּעַשׂ הַיָּשָׁר בְּעֵינֵי יְהוָה כְּכֹל אֲשֶׁר־עָשָׂה עֻזִּיָּהוּ אָבִיו רַק לֹא־בָא אֶל־הֵיכַל יְהוָה וְעוֺד הָעָם מַשְׁחִיתִים", "וַיַּעַשׂ הַיָּשָׁר בְּעֵינֵי יְהוָה כְּכֹל אֲשֶׁר־עָשָׂה עֻזִיָּהוּ אָבִיו עָשָׂה", "וְהִנֵּה שֶׁבַע שִׁבֳּלִים צְנֻמוֺת דַּקּוֺת שְׁדֻפוֺת קָדִים צֹמְחוֺת אַחֲרֵיהֶם", "יִשָּׁעֵן עַל־בֵּיתוֺ וְלֹא יַעֲמֹד יַחֲזִיק בּוֺ וְלֹא יָקוּם" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Upload 10 files
Browse files- README.md +194 -0
- config.json +25 -0
- config_sentence_transformers.json +14 -0
- model.safetensors +3 -0
- modules.json +14 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +37 -0
- tokenizer.json +0 -0
- tokenizer_config.json +66 -0
- vocab.txt +0 -0
README.md
CHANGED
|
@@ -1,3 +1,197 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- he
|
| 4 |
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- sentence-transformers
|
| 7 |
+
- sentence-similarity
|
| 8 |
+
- feature-extraction
|
| 9 |
+
- biblical-hebrew
|
| 10 |
+
- digital-humanities
|
| 11 |
+
- inner-biblical-parallels
|
| 12 |
+
base_model: imvladikon/sentence-transformers-alephbert
|
| 13 |
+
datasets:
|
| 14 |
+
- davidmsmiley/tomim
|
| 15 |
+
pipeline_tag: sentence-similarity
|
| 16 |
+
library_name: sentence-transformers
|
| 17 |
+
model-index:
|
| 18 |
+
- name: MiqraBERT
|
| 19 |
+
results:
|
| 20 |
+
- task:
|
| 21 |
+
type: sentence-similarity
|
| 22 |
+
name: Semantic Similarity
|
| 23 |
+
dataset:
|
| 24 |
+
name: "T'OMIM"
|
| 25 |
+
type: davidmsmiley/tomim
|
| 26 |
+
metrics:
|
| 27 |
+
- type: f1
|
| 28 |
+
value: 0.980
|
| 29 |
+
name: F1 (threshold=0.53)
|
| 30 |
+
- type: recall_at_10
|
| 31 |
+
value: 0.728
|
| 32 |
+
name: Recall@10 (all pairs)
|
| 33 |
+
- type: recall_at_10
|
| 34 |
+
value: 0.871
|
| 35 |
+
name: Recall@10 (narrative)
|
| 36 |
+
widget:
|
| 37 |
+
- source_sentence: "וַיַּעַשׂ הַיָּשָׁר בְּעֵינֵי יְהוָה כְּכֹל אֲשֶׁר־עָשָׂה עֻזִּיָּהוּ אָבִיו רַק לֹא־בָא אֶל־הֵיכַל יְהוָה וְעוֺד הָעָם מַשְׁחִיתִים"
|
| 38 |
+
sentences:
|
| 39 |
+
- "וַיַּעַשׂ הַיָּשָׁר בְּעֵינֵי יְהוָה כְּכֹל אֲשֶׁר־עָשָׂה עֻזִיָּהוּ אָבִיו עָשָׂה"
|
| 40 |
+
- "וְהִנֵּה שֶׁבַע שִׁבֳּלִים צְנֻמוֺת דַּקּוֺת שְׁדֻפוֺת קָדִים צֹמְחוֺת אַחֲרֵיהֶם"
|
| 41 |
+
- "יִשָּׁעֵן עַל־בֵּיתוֺ וְלֹא יַעֲמֹד יַחֲזִיק בּוֺ וְלֹא יָקוּם"
|
| 42 |
---
|
| 43 |
+
|
| 44 |
+
# MiqraBERT
|
| 45 |
+
|
| 46 |
+
A [sentence-transformers](https://www.sbert.net) model finetuned from [AlephBERT](https://huggingface.co/imvladikon/sentence-transformers-alephbert) for detecting parallel passages in the Hebrew Bible. It maps Biblical Hebrew verses to 768-dimensional embeddings where cosine similarity reflects textual parallelism — high scores indicate genuine synoptic parallels, low scores indicate unrelated text.
|
| 47 |
+
|
| 48 |
+
*MiqraBERT* derives from Hebrew מִקְרָא (*miqra*, "scripture").
|
| 49 |
+
|
| 50 |
+
## Model Details
|
| 51 |
+
|
| 52 |
+
- **Developed by:** David M. Smiley, University of Notre Dame
|
| 53 |
+
- **Model type:** Sentence Transformer (BERT encoder + mean pooling)
|
| 54 |
+
- **Language:** Biblical Hebrew (vocalized, with niqqud)
|
| 55 |
+
- **Base model:** [imvladikon/sentence-transformers-alephbert](https://huggingface.co/imvladikon/sentence-transformers-alephbert) (AlephBERT)
|
| 56 |
+
- **Finetuned on:** [T'OMIM](https://huggingface.co/datasets/davidmsmiley/tomim) — 1,650 Biblical Hebrew verse pairs ([Zenodo](https://doi.org/10.5281/zenodo.19135731))
|
| 57 |
+
- **Output:** 768 dimensions, cosine similarity
|
| 58 |
+
- **Max sequence length:** 512 tokens
|
| 59 |
+
- **License:** Apache 2.0
|
| 60 |
+
- **Paper:** [arXiv:2506.24117](https://arxiv.org/abs/2506.24117)
|
| 61 |
+
|
| 62 |
+
## Usage
|
| 63 |
+
|
| 64 |
+
### Sentence Transformers
|
| 65 |
+
|
| 66 |
+
```bash
|
| 67 |
+
pip install -U sentence-transformers
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
```python
|
| 71 |
+
from sentence_transformers import SentenceTransformer
|
| 72 |
+
|
| 73 |
+
model = SentenceTransformer("davidmsmiley/miqrabert")
|
| 74 |
+
|
| 75 |
+
# 2 Kgs 18:13 and its synoptic parallel Isa 36:1
|
| 76 |
+
parallel_a = "וּבְאַרְבַּע עֶשְׂרֵה שָׁנָה לַמֶּלֶךְ חִזְקִיָּהוּ עָלָה סַנְחֵרִיב מֶלֶךְ־אַשּׁוּר עַל כָּל־עָרֵי יְהוּדָה הַבְּצֻרוֺת וַיִּתְפְּשֵׂם"
|
| 77 |
+
parallel_b = "וַיְהִי בְּאַרְבַּע עֶשְׂרֵה שָׁנָה לַמֶּלֶךְ חִזְקִיָּהוּ עָלָה סַנְחֵרִיב מֶלֶךְ־אַשּׁוּר עַל־כָּל־עָרֵי יְהוּדָה הַבְּצֻרוֺת וַיִּתְפְּשֵׂם"
|
| 78 |
+
unrelated = "וְהִנֵּה שֶׁבַע שִׁבֳּלִים צְנֻמוֺת דַּקּוֺת שְׁדֻפוֺת קָדִים צֹמְחוֺת אַחֲרֵיהֶם"
|
| 79 |
+
|
| 80 |
+
embeddings = model.encode([parallel_a, parallel_b, unrelated])
|
| 81 |
+
similarities = model.similarity(embeddings, embeddings)
|
| 82 |
+
# parallel_a ↔ parallel_b: ~0.98 (near-verbatim parallel)
|
| 83 |
+
# parallel_a ↔ unrelated: ~0.05 (no relationship)
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
### Using Transformers Directly
|
| 87 |
+
|
| 88 |
+
```python
|
| 89 |
+
import torch
|
| 90 |
+
from transformers import AutoTokenizer, AutoModel
|
| 91 |
+
|
| 92 |
+
tokenizer = AutoTokenizer.from_pretrained("davidmsmiley/miqrabert")
|
| 93 |
+
model = AutoModel.from_pretrained("davidmsmiley/miqrabert")
|
| 94 |
+
|
| 95 |
+
def encode(texts):
|
| 96 |
+
inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
|
| 97 |
+
with torch.no_grad():
|
| 98 |
+
output = model(**inputs)
|
| 99 |
+
mask = inputs["attention_mask"].unsqueeze(-1)
|
| 100 |
+
embeddings = (output.last_hidden_state * mask).sum(1) / mask.sum(1)
|
| 101 |
+
return torch.nn.functional.normalize(embeddings, p=2, dim=1)
|
| 102 |
+
|
| 103 |
+
emb = encode(["וַיַּעַשׂ הַיָּשָׁר בְּעֵינֵי יְהוָה", "וַיַּעַשׂ הָרַע בְּעֵינֵי יְהוָה"])
|
| 104 |
+
similarity = torch.nn.functional.cosine_similarity(emb[0], emb[1], dim=0)
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
## Intended Uses
|
| 108 |
+
|
| 109 |
+
**Use for:** measuring semantic similarity between Biblical Hebrew verse pairs; identifying candidate parallel passages across the Hebrew Bible; supporting computational research on inner-biblical allusion and textual reuse.
|
| 110 |
+
|
| 111 |
+
**Not designed for:** Modern Hebrew, Rabbinic Hebrew, or Aramaic text. Not optimized for poetic parallelism (see Limitations). Outputs continuous similarity scores — not a binary classifier.
|
| 112 |
+
|
| 113 |
+
## Training
|
| 114 |
+
|
| 115 |
+
### Data
|
| 116 |
+
|
| 117 |
+
[T'OMIM](https://huggingface.co/datasets/davidmsmiley/tomim) contains 825 parallel and 825 non-parallel Biblical Hebrew verse pairs. Parallels include 556 narrative pairs from Chronicles // Samuel-Kings and 269 poetic pairs from published parallelism studies (Berlin, Fokkelman, Kugel, Tsumura). Negatives are random pairs sampled from the full Hebrew Bible.
|
| 118 |
+
|
| 119 |
+
### Procedure
|
| 120 |
+
|
| 121 |
+
Cosine similarity regression via [CosineSimilarityLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosinesimilarityloss) (MSE). Both verses pass through a shared encoder, are mean-pooled to 768-dim embeddings, and compared via cosine similarity against target labels (1.0 = parallel, 0.0 = non-parallel). This checkpoint uses a 70/15/15 train-validation-test split (1,155 / 247 / 248 pairs), selected from seven configurations (50%–90%) as the optimal balance of separation quality and test set size. Stability validated across 10 random seeds (70 models total).
|
| 122 |
+
|
| 123 |
+
### Hyperparameters
|
| 124 |
+
|
| 125 |
+
- **Epochs:** 2
|
| 126 |
+
- **Batch size:** 16
|
| 127 |
+
- **Learning rate:** 5e-05 (linear schedule)
|
| 128 |
+
- **Optimizer:** AdamW
|
| 129 |
+
- **Seed:** 42
|
| 130 |
+
- **Hardware:** NVIDIA T4 GPU (~36 seconds)
|
| 131 |
+
|
| 132 |
+
### Framework Versions
|
| 133 |
+
|
| 134 |
+
- Sentence Transformers 5.2.0 / Transformers 4.57.3 / PyTorch 2.9.0+cu126
|
| 135 |
+
|
| 136 |
+
## Evaluation
|
| 137 |
+
|
| 138 |
+
### Test Set Performance
|
| 139 |
+
|
| 140 |
+
| Metric | Score |
|
| 141 |
+
|:-------|:------|
|
| 142 |
+
| Wasserstein Distance | 0.772 [0.735, 0.809] |
|
| 143 |
+
| Overlap Coefficient | 0.046 |
|
| 144 |
+
| F1 (threshold = 0.53) | 0.980 |
|
| 145 |
+
| Precision / Recall | 0.984 / 0.976 |
|
| 146 |
+
| Mean cosine sim (parallel) | 0.880 |
|
| 147 |
+
| Mean cosine sim (non-parallel) | 0.108 |
|
| 148 |
+
|
| 149 |
+
Wasserstein Distance (WD) measures distributional separation between parallel and non-parallel similarity scores; higher is better. Overlap Coefficient (OVL) measures the proportion of ambiguous space where distributions intersect; lower is better. The unfinetuned AlephBERT baseline achieves WD = 0.276 and OVL = 0.240.
|
| 150 |
+
|
| 151 |
+
### Retrieval (Recall@k)
|
| 152 |
+
|
| 153 |
+
Each query verse is searched against all 68,125 verse and half-verse vectors in the Hebrew Bible ([BHSA](https://etcbc.github.io/bhsa/) corpus). Recall@k measures how often the true parallel appears in the top-k results.
|
| 154 |
+
|
| 155 |
+
| Model | Recall@10 (all) | Recall@10 (narrative) | Recall@10 (poetic) |
|
| 156 |
+
|:------|:---------------:|:---------------------:|:------------------:|
|
| 157 |
+
| **MiqraBERT-70p** | **0.728** | **0.871** | 0.089 |
|
| 158 |
+
| BEREL-70p | 0.704 | 0.831 | 0.137 |
|
| 159 |
+
| DictaLM-70p | 0.751 | 0.914 | 0.024 |
|
| 160 |
+
|
| 161 |
+
MiqraBERT is selected as the primary model for its balance across metrics: strong narrative recall, stable training, and the smallest parameter footprint (~110M vs. 7.25B for DictaLM). Full model comparison in the [paper](https://arxiv.org/abs/2506.24117).
|
| 162 |
+
|
| 163 |
+
## Limitations
|
| 164 |
+
|
| 165 |
+
- **Narrative focus:** Trained primarily on Chronicles // Samuel-Kings synoptic parallels. Recall@10 for poetic parallelism is only 8.9% — a structural limitation of mean-pooled embeddings for texts with little lexical overlap.
|
| 166 |
+
- **Biblical Hebrew only:** Not evaluated on Modern Hebrew, Rabbinic Hebrew, unvocalized text, or other Semitic languages.
|
| 167 |
+
- **Training scope:** May underperform on intertextual relationships not represented in training (allusions, type-scenes, formulaic speech).
|
| 168 |
+
|
| 169 |
+
## Citation
|
| 170 |
+
|
| 171 |
+
```bibtex
|
| 172 |
+
@article{smiley2025intertextual,
|
| 173 |
+
title = {Intertextual Parallel Detection in Biblical Hebrew: A Transformer-Based Benchmark},
|
| 174 |
+
author = {Smiley, David M.},
|
| 175 |
+
journal = {arXiv preprint arXiv:2506.24117},
|
| 176 |
+
year = {2025},
|
| 177 |
+
url = {https://arxiv.org/abs/2506.24117}
|
| 178 |
+
}
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
### Upstream Models
|
| 182 |
+
|
| 183 |
+
```bibtex
|
| 184 |
+
@inproceedings{reimers2019sentencebert,
|
| 185 |
+
title = {Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
|
| 186 |
+
author = {Reimers, Nils and Gurevych, Iryna},
|
| 187 |
+
booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},
|
| 188 |
+
year = {2019}
|
| 189 |
+
}
|
| 190 |
+
|
| 191 |
+
@article{seker2021alephbert,
|
| 192 |
+
title = {AlephBERT: A Hebrew Large Pre-Trained Language Model to Start-off Your Hebrew NLP Application With},
|
| 193 |
+
author = {Seker, Amit and Bandel, Elron and Bareket, Dan and Brusilovsky, Idan and Greenfeld, Refael Shaked and Tsarfaty, Reut},
|
| 194 |
+
journal = {arXiv preprint arXiv:2104.04052},
|
| 195 |
+
year = {2021}
|
| 196 |
+
}
|
| 197 |
+
```
|
config.json
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"BertModel"
|
| 4 |
+
],
|
| 5 |
+
"attention_probs_dropout_prob": 0.1,
|
| 6 |
+
"classifier_dropout": null,
|
| 7 |
+
"dtype": "float32",
|
| 8 |
+
"gradient_checkpointing": false,
|
| 9 |
+
"hidden_act": "gelu",
|
| 10 |
+
"hidden_dropout_prob": 0.1,
|
| 11 |
+
"hidden_size": 768,
|
| 12 |
+
"initializer_range": 0.02,
|
| 13 |
+
"intermediate_size": 3072,
|
| 14 |
+
"layer_norm_eps": 1e-12,
|
| 15 |
+
"max_position_embeddings": 512,
|
| 16 |
+
"model_type": "bert",
|
| 17 |
+
"num_attention_heads": 12,
|
| 18 |
+
"num_hidden_layers": 12,
|
| 19 |
+
"pad_token_id": 0,
|
| 20 |
+
"position_embedding_type": "absolute",
|
| 21 |
+
"transformers_version": "4.57.3",
|
| 22 |
+
"type_vocab_size": 1,
|
| 23 |
+
"use_cache": true,
|
| 24 |
+
"vocab_size": 52000
|
| 25 |
+
}
|
config_sentence_transformers.json
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"__version__": {
|
| 3 |
+
"sentence_transformers": "5.2.0",
|
| 4 |
+
"transformers": "4.57.3",
|
| 5 |
+
"pytorch": "2.9.0+cu126"
|
| 6 |
+
},
|
| 7 |
+
"model_type": "SentenceTransformer",
|
| 8 |
+
"prompts": {
|
| 9 |
+
"query": "",
|
| 10 |
+
"document": ""
|
| 11 |
+
},
|
| 12 |
+
"default_prompt_name": null,
|
| 13 |
+
"similarity_fn_name": "cosine"
|
| 14 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:84794ec60752ba964055437944a8345d51a29a0d9e73699b62d87cc2d305e3bb
|
| 3 |
+
size 503928680
|
modules.json
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"idx": 0,
|
| 4 |
+
"name": "0",
|
| 5 |
+
"path": "",
|
| 6 |
+
"type": "sentence_transformers.models.Transformer"
|
| 7 |
+
},
|
| 8 |
+
{
|
| 9 |
+
"idx": 1,
|
| 10 |
+
"name": "1",
|
| 11 |
+
"path": "1_Pooling",
|
| 12 |
+
"type": "sentence_transformers.models.Pooling"
|
| 13 |
+
}
|
| 14 |
+
]
|
sentence_bert_config.json
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"max_seq_length": 512,
|
| 3 |
+
"do_lower_case": false
|
| 4 |
+
}
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cls_token": {
|
| 3 |
+
"content": "[CLS]",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"mask_token": {
|
| 10 |
+
"content": "[MASK]",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"pad_token": {
|
| 17 |
+
"content": "[PAD]",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"sep_token": {
|
| 24 |
+
"content": "[SEP]",
|
| 25 |
+
"lstrip": false,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
},
|
| 30 |
+
"unk_token": {
|
| 31 |
+
"content": "[UNK]",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false
|
| 36 |
+
}
|
| 37 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[PAD]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"1": {
|
| 12 |
+
"content": "[UNK]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"2": {
|
| 20 |
+
"content": "[CLS]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"3": {
|
| 28 |
+
"content": "[SEP]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"4": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"clean_up_tokenization_spaces": false,
|
| 45 |
+
"cls_token": "[CLS]",
|
| 46 |
+
"do_basic_tokenize": true,
|
| 47 |
+
"do_lower_case": true,
|
| 48 |
+
"extra_special_tokens": {},
|
| 49 |
+
"mask_token": "[MASK]",
|
| 50 |
+
"max_len": 512,
|
| 51 |
+
"max_length": 512,
|
| 52 |
+
"model_max_length": 512,
|
| 53 |
+
"never_split": null,
|
| 54 |
+
"pad_to_multiple_of": null,
|
| 55 |
+
"pad_token": "[PAD]",
|
| 56 |
+
"pad_token_type_id": 0,
|
| 57 |
+
"padding_side": "right",
|
| 58 |
+
"sep_token": "[SEP]",
|
| 59 |
+
"stride": 0,
|
| 60 |
+
"strip_accents": null,
|
| 61 |
+
"tokenize_chinese_chars": true,
|
| 62 |
+
"tokenizer_class": "BertTokenizer",
|
| 63 |
+
"truncation_side": "right",
|
| 64 |
+
"truncation_strategy": "longest_first",
|
| 65 |
+
"unk_token": "[UNK]"
|
| 66 |
+
}
|
vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|