nmmursit commited on Jan 16

Commit

d15d46e

verified ·

1 Parent(s): 7874463

Initial model upload - clean repository

Browse files

Files changed (17) hide show

.gitattributes +7 -0
README.md +324 -0
base_mlm_vs_downstream.png +0 -0
config.json +43 -0
model.onnx +3 -0
model.safetensors +3 -0
model_performance_2d.png +3 -0
plots/eval_loss.png +3 -0
plots/eval_masked_accuracy.png +3 -0
plots/grad_L2_norm.png +3 -0
plots/lr_schedule.png +3 -0
plots/train_loss.png +3 -0
plots/train_masked_accuracypng.png +3 -0
pytorch_model.bin +3 -0
special_tokens_map.json +51 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+plots/eval_loss.png filter=lfs diff=lfs merge=lfs -text
+plots/eval_masked_accuracy.png filter=lfs diff=lfs merge=lfs -text
+plots/grad_L2_norm.png filter=lfs diff=lfs merge=lfs -text
+plots/lr_schedule.png filter=lfs diff=lfs merge=lfs -text
+plots/train_loss.png filter=lfs diff=lfs merge=lfs -text
+plots/train_masked_accuracypng.png filter=lfs diff=lfs merge=lfs -text
+model_performance_2d.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,324 @@

+---
+language:
+- tr
+- en
+license: apache-2.0
+tags:
+- fill-mask
+- turkish
+- legal
+- turkish-legal
+- mecellem
+- modernbert
+- TRUBA
+- MN5
+base_model: ModernBERT-base
+pipeline_tag: fill-mask
+---
+# Mursit-Base
+[![GitHub](https://img.shields.io/badge/GitHub-NewMindAI-black?logo=github)](https://github.com/newmindai/mecellem-models) [![HuggingFace Space](https://img.shields.io/badge/HF%20Space-Mizan-blue?logo=huggingface)](https://huggingface.co/spaces/newmindai/Mizan) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+## Model Description
+Mursit-Base is a Turkish Masked Language Model pre-trained entirely from scratch on Turkish-dominant corpora. The model is based on ModernBERT-base architecture (155M parameters) and serves as a foundation model for downstream tasks including text classification, named entity recognition, and feature extraction. Unlike domain-adaptive approaches that continue training from existing checkpoints, this model is initialized randomly and trained on a carefully curated dataset combining Turkish legal text with general web data.
+**Key Features:**
+- Pre-trained from scratch on approximately 112.7 billion tokens of Turkish-dominant corpus
+- Achieves 57.62% MLM accuracy on Turkish datasets (80-10-10 masking strategy, evaluated at 15% masking rate)
+- Serves as foundation for downstream embedding tasks (Mursit-Base-TR-Retrieval)
+- Custom tokenizer optimized for Turkish morphological structure
+- Pre-trained with 30% masking rate (ModernBERT/RoBERTa approach) but evaluated at 15% masking rate for fair comparison
+**Model Type:** Masked Language Model (MLM)
+**Parameters:** 155M
+**Base Architecture:** ModernBERT-base
+**Hidden Size:** 768
+**Max Sequence Length:** 1,024 tokens
+### Architecture Details
+- **Layers:** 22 transformer layers
+- **Hidden Size:** 768
+- **FFN Size:** 1,152
+- **Attention Heads:** 12 heads with 64 dimensions each
+- **Activation:** GeGLU (Gated Linear Units with GELU)
+- **Normalization:** RMSNorm
+- **Position Embeddings:** Rotary positional embeddings (RoPE) with θ=10,000
+- **Window Size:** 128 (for sliding window attention in local layers)
+- **Vocabulary Size:** 59,008 tokens
+### Training Details
+**Pre-training:**
+- **Dataset:** Turkish-dominant corpus totaling approximately 112.7 billion tokens
+  - **Legal Sources:**
+    - Court of Cassation (Yargıtay): 10.3M sequences, ~3.43B tokens
+    - Council of State (Danıştay): 151K sequences, ~0.11B tokens
+    - Academic theses (YÖKTEZ): 21.1M sequences, ~9.61B tokens (after DocsOCR processing)
+  - **General Turkish Sources:**
+    - FineWeb2: General Turkish web data
+    - CulturaX: Multilingual corpus (Turkish subset)
+    - Total general Turkish: 212M sequences, ~96.17B tokens
+  - **Data Processing:** SemHash-based semantic deduplication, FineWeb quality filtering, URL-based filtering, page-packing for YÖKTEZ documents
+- **Training Method:** Masked Language Modeling (MLM) with 15% masking probability
+- **Masking Strategy:** 80% [MASK], 10% random token, 10% unchanged (80-10-10 strategy)
+- **Framework:** MosaicML Composer with Decoupled StableAdamW optimizer
+- **Learning Rate:** 5×10⁻⁴ with warmup_stable_decay schedule
+- **Precision:** BF16 mixed precision
+- **Hardware Infrastructure:**
+  - **System:** MareNostrum 5 ACC partition at Barcelona Supercomputing Center (BSC)
+  - **Compute Nodes:** 16 nodes
+  - **GPUs:** 64× NVIDIA Hopper H100 64GB GPUs (4 GPUs per node)
+  - **Node Configuration:** Each node equipped with 4× H100 GPUs, 80 CPU cores, 512GB DDR5 memory
+  - **Interconnect:** 800 Gb/s InfiniBand for distributed training
+  - **GPU Interconnect:** NVLink for intra-node GPU communication (4 GPUs per node connected via NVLink)
+  - **Distributed Training:** Multi-node distributed training across 16 nodes with InfiniBand interconnect
+**MLM Accuracy:** 64.05% (evaluated on Turkish datasets: blackerx/turkish_v2, fthbrmnby/turkish_product_reviews, hazal/Turkish-Biomedical-corpus-trM, newmindai/EuroHPC-Legal)
+### MLM Accuracy Scores (80-10-10 Strategy) on Turkish Datasets
+The following table presents MLM accuracy scores (averaged across the 80-10-10 strategy) for our pre-trained models and baseline MLM models evaluated on Turkish datasets. *This model's results are highlighted in italics.*
+| Model | MLM Avg (%) |
+|-------|-------------|
+| boun-tabilab/TabiBERT | **69.57** |
+| newmindai/Mursit-Large | 67.25 |
+| ytu-ce-cosmos/turkish-large-bert-cased | 65.03 |
+| dbmdz/bert-base-turkish-cased | 64.98 |
+| *newmindai/Mursit-Base* | *64.05* |
+| KocLab-Bilkent/BERTurk-Legal | 54.10 |
+| ytu-ce-cosmos/turkish-base-bert-uncased | 52.69 |
+*MLM accuracy averaged across the 80-10-10 masking strategy. turkish-base-bert-uncased was evaluated only on uncased datasets. Evaluation datasets: blackerx/turkish_v2, fthbrmnby/turkish_product_reviews, hazal/Turkish-Biomedical-corpus-trM, newmindai/EuroHPC-Legal. All experiments are reproducible (see Section A.2 in the paper).*
+## Performance on MTEB-Turkish Benchmark
+The following visualization shows the model's performance compared to other Turkish language models:
+![Model Performance Comparison](model_performance_2d.png)
+*Model Performance Comparison: Legal Score vs. MTEB Score. MLM models (blue circles) form a distinct cluster. Mursit-Base achieves competitive performance among Turkish MLM models.*
+This model was evaluated on the comprehensive MTEB-Turkish benchmark for embedding tasks using mean pooling over token representations followed by L2 normalization.
+### Comprehensive Benchmark Results
+The following table presents comprehensive evaluation results across all models evaluated on the MTEB-Turkish benchmark. *This model's results are highlighted in italics.*
+| Model | MTEB | Legal | Cls. | Clus. | Pair | Ret. | STS | Cont. | Reg. | Case | Params | Type |
+|-------|------|-------|------|-------|------|------|-----|-------|------|------|--------|------|
+| embeddinggemma-300m | **65.42** | 50.63 | **77.74** | **45.05** | **80.02** | **55.06** | 69.22 | 83.97 | **39.56** | 28.38 | 307M | Emb. |
+| bge-m3 | 62.87 | **51.16** | 75.35 | 35.86 | 78.88 | 54.42 | **69.83** | **86.08** | 38.09 | **29.3** | 567M | Emb. |
+| Mursit-Embed-Qwen3-1.7B-TR | 56.84 | 34.76 | 68.46 | 42.22 | 59.67 | 50.1 | 63.77 | 70.22 | 17.94 | 16.11 | 1.7B | CLM-E. |
+| Mursit-Large-TR-Retrieval | 56.87 | 46.56 | 67.72 | 41.15 | 59.78 | 51.69 | 64.01 | 81.78 | 32.67 | 25.24 | 403M | Emb. |
+| Mursit-Base-TR-Retrieval | 55.86 | 47.52 | 66.25 | 39.75 | 61.31 | 50.07 | 61.9 | 80.4 | 34.1 | 28.07 | 155M | Emb. |
+| Mursit-Embed-Qwen3-4B-TR | 53.65 | 37.0 | 67.29 | 36.68 | 58.36 | 51.12 | 54.77 | 69.25 | 24.21 | 17.56 | 4B | CLM-E. |
+|-------|------|-------|------|------|------|------|-----|-------|------|------|--------|------|
+| bert-base-turkish-uncased | 46.23 | 24.94 | 68.05 | 33.81 | 60.44 | 32.01 | 36.85 | 52.47 | 12.05 | 10.29 | 110M | MLM |
+| turkish-large-bert-cased | 45.3 | 19.12 | 67.43 | 34.24 | 60.11 | 28.68 | 36.04 | 47.57 | 5.93 | 3.85 | 337M | MLM |
+| bert-base-turkish-cased | 45.17 | 24.41 | 66.39 | 35.28 | 60.05 | 30.52 | 33.62 | 54.03 | 10.13 | 9.07 | 110M | MLM |
+| BERTurk-Legal | 42.02 | 32.63 | 60.61 | 26.24 | 59.51 | 25.8 | 37.94 | 61.4 | 15.51 | 20.99 | 184M | MLM |
+| Mursit-Large | 41.75 | 23.71 | 62.95 | 25.34 | 58.04 | 27.4 | 35.01 | 42.74 | 11.29 | 17.1 | 403M | MLM |
+| turkish-base-bert-uncased | 44.68 | 27.58 | 66.22 | 30.23 | 58.84 | 31.4 | 36.74 | 56.6 | 13.39 | 12.74 | 110M | MLM |
+| *Mursit-Base* | 40.23 | 17.93 | 59.78 | 25.48 | 58.65 | 20.82 | 36.45 | 36.0 | 7.4 | 10.4 | 155M | MLM |
+| mmBERT-base | 39.65 | 12.15 | 61.84 | 26.77 | 59.25 | 15.83 | 34.56 | 34.45 | 1.33 | 0.68 | 306M | MLM |
+| TabiBERT | 37.77 | 11.5 | 59.63 | 25.75 | 58.19 | 14.96 | 30.32 | 32.02 | 1.86 | 0.63 | 148M | MLM |
+| ModernBERT-base | 23.8 | 2.99 | 39.06 | 2.01 | 53.95 | 2.1 | 21.91 | 7.92 | 0.62 | 0.43 | 149M | MLM |
+| ModernBERT-large | 23.74 | 2.44 | 39.44 | 3.9 | 53.73 | 1.8 | 19.85 | 6.12 | 0.62 | 0.59 | 394M | MLM |
+**Column abbreviations:** MTEB = mean performance across task types; Legal = weighted average of Contracts, Regulation, Caselaw; Classification = accuracy on Turkish classification tasks; Clustering = V-measure on clustering tasks; Pair Classification = average precision on pair classification tasks like NLI; Retrieval = nDCG@10 on information retrieval tasks; Semantic Textual Similarity = Spearman correlation; Contracts = nDCG@10 on legal contract retrieval; Regulation = nDCG@10 on regulatory text retrieval; Caselaw = nDCG@10 on case law retrieval; Number of Parameters = number of model parameters; Model Type = model type (Embedding, CLM-Embedding, Masked Language Model). **Bold values** indicate the highest score in each column.
+**Key Findings:**
+- The model shows substantial improvement over ModernBERT baselines (which are monolingual English models), validating the effectiveness of Turkish-specific pre-training
+- Pre-training alone without embedding-specific fine-tuning yields limited utility for retrieval tasks
+- Language-specific pre-training is critical, as monolingual English models show limited cross-lingual transfer to Turkish
+- The model demonstrates that improvements in MLM accuracy do not always directly translate to better downstream task performance
+### MLM vs Downstream Performance Analysis
+The following visualization shows the relationship between MLM validation loss and downstream retrieval performance:
+![MLM vs Downstream Performance](base_mlm_vs_downstream.png)
+*Relationship between MLM validation loss and downstream retrieval performance across ModernBERT-base versions v1-v6. This analysis demonstrates how improvements in MLM accuracy correlate with downstream task performance.*
+**Note:** This model is primarily designed for Masked Language Modeling tasks. Embedding performance is provided for reference using standard mean pooling. For optimal retrieval performance, consider using the post-trained retrieval variants (Mursit-Base-TR-Retrieval or Mursit-Large-TR-Retrieval).
+## Reproducibility
+To reproduce the MLM benchmark results for this model, please refer to:
+- **MLM Benchmark Results:** [github.com/newmindai/mecellem-models/benchmark/mlm](https://github.com/newmindai/mecellem-models/tree/main/benchmark/mlm) - Contains code and evaluation configurations for reproducing MLM accuracy scores on Turkish datasets using the 80-10-10 masking strategy.
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Masked Language Modeling
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+import torch
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("newmindai/Mursit-Base")
+model = AutoModelForMaskedLM.from_pretrained("newmindai/Mursit-Base")
+# Example text with mask
+text = "Türkiye Cumhuriyeti'nin başkenti [MASK]'dir."
+inputs = tokenizer(text, return_tensors="pt")
+# Predict masked token
+with torch.no_grad():
+    outputs = model(**inputs)
+    mask_token_index = (inputs["input_ids"] == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
+    predictions = torch.nn.functional.softmax(outputs.logits[0, mask_token_index], dim=-1)
+# Get top predictions
+top_k = 5
+top_indices = torch.topk(predictions[0], top_k).indices
+for idx in top_indices:
+    token = tokenizer.decode([idx])
+    score = predictions[0][idx].item()
+    print(f"{token}: {score:.4f}")
+```
+### Feature Extraction
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+tokenizer = AutoTokenizer.from_pretrained("newmindai/Mursit-Base")
+model = AutoModel.from_pretrained("newmindai/Mursit-Base")
+text = "Türk hukuk sistemi medeni hukuk geleneğine dayanır"
+inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
+with torch.no_grad():
+    outputs = model(**inputs)
+    # Mean pooling
+    embeddings = outputs.last_hidden_state.mean(dim=1)
+    print(embeddings.shape)  # (batch_size, 768)
+```
+# ONNX Model Inference - Masked Language Modeling (MLM)
+This script demonstrates how to use the ONNX model from Hugging Face for masked language modeling tasks.
+## Exporting Model to ONNX
+To export the model to ONNX format for MLM, use the `optimum-cli` command:
+```bash
+optimum-cli export onnx \
+  -m newmindai/Mursit-Base \
+  --task fill-mask \
+  onnx/MursitBase
+```
+This will create the `model.onnx` file in the specified output directory.
+## Installation
+```bash
+pip install onnxruntime-gpu transformers huggingface_hub numpy
+```
+## Usage
+```python
+import numpy as np
+import onnxruntime as ort
+from transformers import AutoTokenizer
+from huggingface_hub import hf_hub_download
+repo_id = "newmindai/Mursit-Base"
+onnx_path = hf_hub_download(repo_id, "model.onnx")
+tokenizer = AutoTokenizer.from_pretrained(repo_id)
+sess = ort.InferenceSession(
+    onnx_path,
+    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
+)
+text = f"Bu bir {tokenizer.mask_token} cümledir."
+inputs = tokenizer(text, return_tensors="np")
+outputs = sess.run(None, dict(inputs))
+logits = outputs[0]
+mask_pos = np.where(inputs["input_ids"][0] == tokenizer.mask_token_id)[0][0]
+mask_logits = logits[0, mask_pos]
+top_k = 5
+top_k_ids = np.argsort(mask_logits)[-top_k:][::-1]
+predictions = tokenizer.convert_ids_to_tokens(top_k_ids)
+print("MASK predictions:")
+for p in predictions:
+    print(p)
+```
+## Features
+- **Automatic GPU/CPU selection**: Uses CUDA if available, otherwise falls back to CPU
+- **Hugging Face integration**: Downloads model files directly from Hugging Face Hub
+- **Masked token prediction**: Predicts the most likely tokens for masked positions
+- **Top-K predictions**: Returns the top K most probable token predictions
+## Use Cases
+- Turkish language understanding tasks
+- Text classification
+- Named entity recognition
+- Question answering
+- Text generation (with fine-tuning)
+- Feature extraction for downstream tasks
+- Domain adaptation for Turkish legal texts
+## Reproducibility
+To reproduce the MLM benchmark results for this model, please refer to:
+- **MLM Benchmark Results:** [github.com/newmindai/mecellem-models/benchmark/mlm](https://github.com/newmindai/mecellem-models/tree/main/benchmark/mlm) - Contains code and evaluation configurations for reproducing MLM accuracy scores on Turkish datasets using the 80-10-10 masking strategy.
+## Acknowledgments
+This work was supported by the EuroHPC Joint Undertaking through project etur46 with access to the MareNostrum 5 supercomputer, hosted by Barcelona Supercomputing Center (BSC), Spain. MareNostrum 5 is owned by EuroHPC JU and operated by BSC. We are grateful to the BSC support team for their assistance with job scheduling, environment configuration, and technical guidance throughout the project.
+The numerical calculations reported in this work were fully/partially performed at TÜBİTAK ULAKBİM, High Performance and Grid Computing Center (TRUBA resources). The authors gratefully acknowledge the know-how provided by the MINERVA Support for expert guidance and collaboration opportunities in HPC-AI integration.
+## References
+If you use this model, please cite our paper:
+```bibtex
+@article{mecellem2026,
+  title={Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain},
+  author={Uğur, Özgür and Göksu, Mahmut and Şavirdi, Esra and Çimen, Mahmut and Yılmaz, Musa and Demir, Alp Talha and Güllüce, Rumeysa and Çetin, İclal and Sağbaş, Ömer Can},
+  journal={Procedia Computer Science},
+  year={2026},
+  publisher={Elsevier}
+}
+```
+### Base Model References
+```bibtex
+@inproceedings{modernbert2025,
+  title={ModernBERT: A Modern Bidirectional Encoder Transformer},
+  author={Answer.AI and LightOn},
+  booktitle={Proceedings of the 2025 Conference on Language Models},
+  year={2025}
+}
+```
+<!-- Updated: 2026-01-15 09:38:13 -->

base_mlm_vs_downstream.png ADDED Viewed

config.json ADDED Viewed

	@@ -0,0 +1,43 @@

+{
+  "architectures": [
+    "ModernBertForMaskedLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "classifier_activation": "silu",
+  "classifier_bias": false,
+  "classifier_dropout": 0.0,
+  "classifier_pooling": "mean",
+  "cls_token_id": 1,
+  "decoder_bias": true,
+  "deterministic_flash_attn": false,
+  "embedding_dropout": 0.0,
+  "eos_token_id": 2,
+  "global_attn_every_n_layers": 3,
+  "global_rope_theta": 10000.0,
+  "gradient_checkpointing": false,
+  "hidden_activation": "gelu",
+  "hidden_size": 768,
+  "initializer_cutoff_factor": 2.0,
+  "initializer_range": 0.02,
+  "intermediate_size": 1152,
+  "layer_norm_eps": 1e-05,
+  "local_attention": 128,
+  "local_rope_theta": 10000.0,
+  "max_position_embeddings": 1024,
+  "mlp_bias": false,
+  "mlp_dropout": 0.0,
+  "model_type": "modernbert",
+  "norm_bias": false,
+  "norm_eps": 1e-05,
+  "num_attention_heads": 12,
+  "num_hidden_layers": 22,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "sep_token_id": 2,
+  "tie_word_embeddings": true,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.48.0",
+  "vocab_size": 59008
+}

model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d7cec3828bf4e48aec81927a5d2eee7dc2af0d4264b186a3c0c08d3c01885d41
+size 625765807

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:900af6e23d5b02bde978e043239c1fec759a31249946079f24fc31dcdee2ab57
+size 625211640

model_performance_2d.png ADDED Viewed

Git LFS Details

SHA256: 35d74a68a424786e7eca5fe891553cb3a1cd162e9a972f3e0ab3a125e9280137
Pointer size: 131 Bytes
Size of remote file: 156 kB

plots/eval_loss.png ADDED Viewed

Git LFS Details

SHA256: 220a421ad69765bc9e774bf0bc8228743ba62b0861cf060ec6c087d8977524ec
Pointer size: 131 Bytes
Size of remote file: 302 kB

plots/eval_masked_accuracy.png ADDED Viewed

Git LFS Details

SHA256: 73c6bd3cbf82b9c6f66c98f84da517dd40e104c06201aec5166d95ba1a86a8d7
Pointer size: 131 Bytes
Size of remote file: 321 kB

plots/grad_L2_norm.png ADDED Viewed

Git LFS Details

SHA256: 90ec075b45da8250f8fb0e5d43b3f53ae1d50866c5c82d5cef45ff9e5ae6e2cd
Pointer size: 131 Bytes
Size of remote file: 323 kB

plots/lr_schedule.png ADDED Viewed

Git LFS Details

SHA256: 5e2e499fa39aa4eb9e0b9cd748a607eb222fc2a4193eabd11f36e73b8f8224fd
Pointer size: 131 Bytes
Size of remote file: 353 kB

plots/train_loss.png ADDED Viewed

Git LFS Details

SHA256: 31230e7305f463feacd006520f404f0749db47df41821c789a4cd0429fa836a7
Pointer size: 131 Bytes
Size of remote file: 300 kB

plots/train_masked_accuracypng.png ADDED Viewed

Git LFS Details

SHA256: a47e0adab68c148f72cfe43ea6c83861c526b6acb0f468c101fdab97893c094c
Pointer size: 131 Bytes
Size of remote file: 336 kB

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2b0e5315d7a508cada6a90ebb1704ba8aca3e8cf3863656fd3789744ac2d8ccc
+size 625240686

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": "<mask>",
+  "model_max_length": 1024,
+  "pad_token": "[PAD]",
+  "sep_token": "</s>",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "<unk>",
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ]
+}