|
|
--- |
|
|
language: |
|
|
- tr |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- fill-mask |
|
|
- turkish |
|
|
- legal |
|
|
- turkish-legal |
|
|
- mecellem |
|
|
- modernbert |
|
|
- TRUBA |
|
|
- MN5 |
|
|
base_model: ModernBERT-base |
|
|
pipeline_tag: fill-mask |
|
|
--- |
|
|
|
|
|
# Mursit-Base |
|
|
|
|
|
[](https://github.com/newmindai/mecellem-models) [](https://huggingface.co/spaces/newmindai/Mizan) [](https://opensource.org/licenses/Apache-2.0) |
|
|
|
|
|
## Model Description |
|
|
|
|
|
Mursit-Base is a Turkish Masked Language Model pre-trained entirely from scratch on Turkish-dominant corpora. The model is based on ModernBERT-base architecture (155M parameters) and serves as a foundation model for downstream tasks including text classification, named entity recognition, and feature extraction. Unlike domain-adaptive approaches that continue training from existing checkpoints, this model is initialized randomly and trained on a carefully curated dataset combining Turkish legal text with general web data. |
|
|
|
|
|
**Key Features:** |
|
|
- Pre-trained from scratch on approximately 112.7 billion tokens of Turkish-dominant corpus |
|
|
- Achieves 57.62% MLM accuracy on Turkish datasets (80-10-10 masking strategy, evaluated at 15% masking rate) |
|
|
- Serves as foundation for downstream embedding tasks (Mursit-Base-TR-Retrieval) |
|
|
- Custom tokenizer optimized for Turkish morphological structure |
|
|
- Pre-trained with 30% masking rate (ModernBERT/RoBERTa approach) but evaluated at 15% masking rate for fair comparison |
|
|
|
|
|
**Model Type:** Masked Language Model (MLM) |
|
|
**Parameters:** 155M |
|
|
**Base Architecture:** ModernBERT-base |
|
|
**Hidden Size:** 768 |
|
|
**Max Sequence Length:** 1,024 tokens |
|
|
|
|
|
### Architecture Details |
|
|
|
|
|
- **Layers:** 22 transformer layers |
|
|
- **Hidden Size:** 768 |
|
|
- **FFN Size:** 1,152 |
|
|
- **Attention Heads:** 12 heads with 64 dimensions each |
|
|
- **Activation:** GeGLU (Gated Linear Units with GELU) |
|
|
- **Normalization:** RMSNorm |
|
|
- **Position Embeddings:** Rotary positional embeddings (RoPE) with θ=10,000 |
|
|
- **Window Size:** 128 (for sliding window attention in local layers) |
|
|
- **Vocabulary Size:** 59,008 tokens |
|
|
|
|
|
### Training Details |
|
|
|
|
|
**Pre-training:** |
|
|
- **Dataset:** Turkish-dominant corpus totaling approximately 112.7 billion tokens |
|
|
- **Legal Sources:** |
|
|
- Court of Cassation (Yargıtay): 10.3M sequences, ~3.43B tokens |
|
|
- Council of State (Danıştay): 151K sequences, ~0.11B tokens |
|
|
- Academic theses (YÖKTEZ): 21.1M sequences, ~9.61B tokens (after DocsOCR processing) |
|
|
- **General Turkish Sources:** |
|
|
- FineWeb2: General Turkish web data |
|
|
- CulturaX: Multilingual corpus (Turkish subset) |
|
|
- Total general Turkish: 212M sequences, ~96.17B tokens |
|
|
- **Data Processing:** SemHash-based semantic deduplication, FineWeb quality filtering, URL-based filtering, page-packing for YÖKTEZ documents |
|
|
- **Training Method:** Masked Language Modeling (MLM) with 15% masking probability |
|
|
- **Masking Strategy:** 80% [MASK], 10% random token, 10% unchanged (80-10-10 strategy) |
|
|
- **Framework:** MosaicML Composer with Decoupled StableAdamW optimizer |
|
|
- **Learning Rate:** 5×10⁻⁴ with warmup_stable_decay schedule |
|
|
- **Precision:** BF16 mixed precision |
|
|
- **Hardware Infrastructure:** |
|
|
- **System:** MareNostrum 5 ACC partition at Barcelona Supercomputing Center (BSC) |
|
|
- **Compute Nodes:** 16 nodes |
|
|
- **GPUs:** 64× NVIDIA Hopper H100 64GB GPUs (4 GPUs per node) |
|
|
- **Node Configuration:** Each node equipped with 4× H100 GPUs, 80 CPU cores, 512GB DDR5 memory |
|
|
- **Interconnect:** 800 Gb/s InfiniBand for distributed training |
|
|
- **GPU Interconnect:** NVLink for intra-node GPU communication (4 GPUs per node connected via NVLink) |
|
|
- **Distributed Training:** Multi-node distributed training across 16 nodes with InfiniBand interconnect |
|
|
|
|
|
**MLM Accuracy:** 64.05% (evaluated on Turkish datasets: blackerx/turkish_v2, fthbrmnby/turkish_product_reviews, hazal/Turkish-Biomedical-corpus-trM, newmindai/EuroHPC-Legal) |
|
|
|
|
|
### MLM Accuracy Scores (80-10-10 Strategy) on Turkish Datasets |
|
|
|
|
|
The following table presents MLM accuracy scores (averaged across the 80-10-10 strategy) for our pre-trained models and baseline MLM models evaluated on Turkish datasets. *This model's results are highlighted in italics.* |
|
|
|
|
|
| Model | MLM Avg (%) | |
|
|
|-------|-------------| |
|
|
| boun-tabilab/TabiBERT | **69.57** | |
|
|
| newmindai/Mursit-Large | 67.25 | |
|
|
| ytu-ce-cosmos/turkish-large-bert-cased | 65.03 | |
|
|
| dbmdz/bert-base-turkish-cased | 64.98 | |
|
|
| *newmindai/Mursit-Base* | *64.05* | |
|
|
| KocLab-Bilkent/BERTurk-Legal | 54.10 | |
|
|
| ytu-ce-cosmos/turkish-base-bert-uncased | 52.69 | |
|
|
|
|
|
*MLM accuracy averaged across the 80-10-10 masking strategy. turkish-base-bert-uncased was evaluated only on uncased datasets. Evaluation datasets: blackerx/turkish_v2, fthbrmnby/turkish_product_reviews, hazal/Turkish-Biomedical-corpus-trM, newmindai/EuroHPC-Legal. All experiments are reproducible (see Section A.2 in the paper).* |
|
|
|
|
|
## Performance on MTEB-Turkish Benchmark |
|
|
|
|
|
The following visualization shows the model's performance compared to other Turkish language models: |
|
|
|
|
|
 |
|
|
|
|
|
*Model Performance Comparison: Legal Score vs. MTEB Score. MLM models (blue circles) form a distinct cluster. Mursit-Base achieves competitive performance among Turkish MLM models.* |
|
|
|
|
|
This model was evaluated on the comprehensive MTEB-Turkish benchmark for embedding tasks using mean pooling over token representations followed by L2 normalization. |
|
|
|
|
|
### Comprehensive Benchmark Results |
|
|
|
|
|
The following table presents comprehensive evaluation results across all models evaluated on the MTEB-Turkish benchmark. *This model's results are highlighted in italics.* |
|
|
|
|
|
| Model | MTEB | Legal | Cls. | Clus. | Pair | Ret. | STS | Cont. | Reg. | Case | Params | Type | |
|
|
|-------|------|-------|------|-------|------|------|-----|-------|------|------|--------|------| |
|
|
| embeddinggemma-300m | **65.42** | 50.63 | **77.74** | **45.05** | **80.02** | **55.06** | 69.22 | 83.97 | **39.56** | 28.38 | 307M | Emb. | |
|
|
| bge-m3 | 62.87 | **51.16** | 75.35 | 35.86 | 78.88 | 54.42 | **69.83** | **86.08** | 38.09 | **29.3** | 567M | Emb. | |
|
|
| Mursit-Embed-Qwen3-1.7B-TR | 56.84 | 34.76 | 68.46 | 42.22 | 59.67 | 50.1 | 63.77 | 70.22 | 17.94 | 16.11 | 1.7B | CLM-E. | |
|
|
| Mursit-Large-TR-Retrieval | 56.87 | 46.56 | 67.72 | 41.15 | 59.78 | 51.69 | 64.01 | 81.78 | 32.67 | 25.24 | 403M | Emb. | |
|
|
| Mursit-Base-TR-Retrieval | 55.86 | 47.52 | 66.25 | 39.75 | 61.31 | 50.07 | 61.9 | 80.4 | 34.1 | 28.07 | 155M | Emb. | |
|
|
| Mursit-Embed-Qwen3-4B-TR | 53.65 | 37.0 | 67.29 | 36.68 | 58.36 | 51.12 | 54.77 | 69.25 | 24.21 | 17.56 | 4B | CLM-E. | |
|
|
|-------|------|-------|------|------|------|------|-----|-------|------|------|--------|------| |
|
|
| bert-base-turkish-uncased | 46.23 | 24.94 | 68.05 | 33.81 | 60.44 | 32.01 | 36.85 | 52.47 | 12.05 | 10.29 | 110M | MLM | |
|
|
| turkish-large-bert-cased | 45.3 | 19.12 | 67.43 | 34.24 | 60.11 | 28.68 | 36.04 | 47.57 | 5.93 | 3.85 | 337M | MLM | |
|
|
| bert-base-turkish-cased | 45.17 | 24.41 | 66.39 | 35.28 | 60.05 | 30.52 | 33.62 | 54.03 | 10.13 | 9.07 | 110M | MLM | |
|
|
| BERTurk-Legal | 42.02 | 32.63 | 60.61 | 26.24 | 59.51 | 25.8 | 37.94 | 61.4 | 15.51 | 20.99 | 184M | MLM | |
|
|
| Mursit-Large | 41.75 | 23.71 | 62.95 | 25.34 | 58.04 | 27.4 | 35.01 | 42.74 | 11.29 | 17.1 | 403M | MLM | |
|
|
| turkish-base-bert-uncased | 44.68 | 27.58 | 66.22 | 30.23 | 58.84 | 31.4 | 36.74 | 56.6 | 13.39 | 12.74 | 110M | MLM | |
|
|
| *Mursit-Base* | 40.23 | 17.93 | 59.78 | 25.48 | 58.65 | 20.82 | 36.45 | 36.0 | 7.4 | 10.4 | 155M | MLM | |
|
|
| mmBERT-base | 39.65 | 12.15 | 61.84 | 26.77 | 59.25 | 15.83 | 34.56 | 34.45 | 1.33 | 0.68 | 306M | MLM | |
|
|
| TabiBERT | 37.77 | 11.5 | 59.63 | 25.75 | 58.19 | 14.96 | 30.32 | 32.02 | 1.86 | 0.63 | 148M | MLM | |
|
|
| ModernBERT-base | 23.8 | 2.99 | 39.06 | 2.01 | 53.95 | 2.1 | 21.91 | 7.92 | 0.62 | 0.43 | 149M | MLM | |
|
|
| ModernBERT-large | 23.74 | 2.44 | 39.44 | 3.9 | 53.73 | 1.8 | 19.85 | 6.12 | 0.62 | 0.59 | 394M | MLM | |
|
|
|
|
|
**Column abbreviations:** MTEB = mean performance across task types; Legal = weighted average of Contracts, Regulation, Caselaw; Classification = accuracy on Turkish classification tasks; Clustering = V-measure on clustering tasks; Pair Classification = average precision on pair classification tasks like NLI; Retrieval = nDCG@10 on information retrieval tasks; Semantic Textual Similarity = Spearman correlation; Contracts = nDCG@10 on legal contract retrieval; Regulation = nDCG@10 on regulatory text retrieval; Caselaw = nDCG@10 on case law retrieval; Number of Parameters = number of model parameters; Model Type = model type (Embedding, CLM-Embedding, Masked Language Model). **Bold values** indicate the highest score in each column. |
|
|
|
|
|
**Key Findings:** |
|
|
- The model shows substantial improvement over ModernBERT baselines (which are monolingual English models), validating the effectiveness of Turkish-specific pre-training |
|
|
- Pre-training alone without embedding-specific fine-tuning yields limited utility for retrieval tasks |
|
|
- Language-specific pre-training is critical, as monolingual English models show limited cross-lingual transfer to Turkish |
|
|
- The model demonstrates that improvements in MLM accuracy do not always directly translate to better downstream task performance |
|
|
|
|
|
### MLM vs Downstream Performance Analysis |
|
|
|
|
|
The following visualization shows the relationship between MLM validation loss and downstream retrieval performance: |
|
|
|
|
|
 |
|
|
|
|
|
*Relationship between MLM validation loss and downstream retrieval performance across ModernBERT-base versions v1-v6. This analysis demonstrates how improvements in MLM accuracy correlate with downstream task performance.* |
|
|
|
|
|
**Note:** This model is primarily designed for Masked Language Modeling tasks. Embedding performance is provided for reference using standard mean pooling. For optimal retrieval performance, consider using the post-trained retrieval variants (Mursit-Base-TR-Retrieval or Mursit-Large-TR-Retrieval). |
|
|
|
|
|
## Reproducibility |
|
|
|
|
|
To reproduce the MLM benchmark results for this model, please refer to: |
|
|
|
|
|
- **MLM Benchmark Results:** [github.com/newmindai/mecellem-models/benchmark/mlm](https://github.com/newmindai/mecellem-models/tree/main/benchmark/mlm) - Contains code and evaluation configurations for reproducing MLM accuracy scores on Turkish datasets using the 80-10-10 masking strategy. |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
### Masked Language Modeling |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("newmindai/Mursit-Base") |
|
|
model = AutoModelForMaskedLM.from_pretrained("newmindai/Mursit-Base") |
|
|
|
|
|
# Example text with mask |
|
|
text = "Türkiye Cumhuriyeti'nin başkenti [MASK]'dir." |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
|
|
# Predict masked token |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
mask_token_index = (inputs["input_ids"] == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0] |
|
|
predictions = torch.nn.functional.softmax(outputs.logits[0, mask_token_index], dim=-1) |
|
|
|
|
|
# Get top predictions |
|
|
top_k = 5 |
|
|
top_indices = torch.topk(predictions[0], top_k).indices |
|
|
for idx in top_indices: |
|
|
token = tokenizer.decode([idx]) |
|
|
score = predictions[0][idx].item() |
|
|
print(f"{token}: {score:.4f}") |
|
|
``` |
|
|
|
|
|
### Feature Extraction |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("newmindai/Mursit-Base") |
|
|
model = AutoModel.from_pretrained("newmindai/Mursit-Base") |
|
|
|
|
|
text = "Türk hukuk sistemi medeni hukuk geleneğine dayanır" |
|
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
# Mean pooling |
|
|
embeddings = outputs.last_hidden_state.mean(dim=1) |
|
|
print(embeddings.shape) # (batch_size, 768) |
|
|
``` |
|
|
# ONNX Model Inference - Masked Language Modeling (MLM) |
|
|
|
|
|
This script demonstrates how to use the ONNX model from Hugging Face for masked language modeling tasks. |
|
|
|
|
|
## Exporting Model to ONNX |
|
|
|
|
|
To export the model to ONNX format for MLM, use the `optimum-cli` command: |
|
|
|
|
|
```bash |
|
|
optimum-cli export onnx \ |
|
|
-m newmindai/Mursit-Base \ |
|
|
--task fill-mask \ |
|
|
onnx/MursitBase |
|
|
``` |
|
|
|
|
|
This will create the `model.onnx` file in the specified output directory. |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install onnxruntime-gpu transformers huggingface_hub numpy |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import numpy as np |
|
|
import onnxruntime as ort |
|
|
from transformers import AutoTokenizer |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
repo_id = "newmindai/Mursit-Base" |
|
|
|
|
|
onnx_path = hf_hub_download(repo_id, "model.onnx") |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(repo_id) |
|
|
|
|
|
sess = ort.InferenceSession( |
|
|
onnx_path, |
|
|
providers=["CUDAExecutionProvider", "CPUExecutionProvider"] |
|
|
) |
|
|
|
|
|
text = f"Bu bir {tokenizer.mask_token} cümledir." |
|
|
inputs = tokenizer(text, return_tensors="np") |
|
|
|
|
|
outputs = sess.run(None, dict(inputs)) |
|
|
logits = outputs[0] |
|
|
|
|
|
mask_pos = np.where(inputs["input_ids"][0] == tokenizer.mask_token_id)[0][0] |
|
|
mask_logits = logits[0, mask_pos] |
|
|
|
|
|
top_k = 5 |
|
|
top_k_ids = np.argsort(mask_logits)[-top_k:][::-1] |
|
|
predictions = tokenizer.convert_ids_to_tokens(top_k_ids) |
|
|
|
|
|
print("MASK predictions:") |
|
|
for p in predictions: |
|
|
print(p) |
|
|
``` |
|
|
|
|
|
## Features |
|
|
|
|
|
- **Automatic GPU/CPU selection**: Uses CUDA if available, otherwise falls back to CPU |
|
|
- **Hugging Face integration**: Downloads model files directly from Hugging Face Hub |
|
|
- **Masked token prediction**: Predicts the most likely tokens for masked positions |
|
|
- **Top-K predictions**: Returns the top K most probable token predictions |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- Turkish language understanding tasks |
|
|
- Text classification |
|
|
- Named entity recognition |
|
|
- Question answering |
|
|
- Text generation (with fine-tuning) |
|
|
- Feature extraction for downstream tasks |
|
|
- Domain adaptation for Turkish legal texts |
|
|
|
|
|
## Reproducibility |
|
|
|
|
|
To reproduce the MLM benchmark results for this model, please refer to: |
|
|
|
|
|
- **MLM Benchmark Results:** [github.com/newmindai/mecellem-models/benchmark/mlm](https://github.com/newmindai/mecellem-models/tree/main/benchmark/mlm) - Contains code and evaluation configurations for reproducing MLM accuracy scores on Turkish datasets using the 80-10-10 masking strategy. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
This work was supported by the EuroHPC Joint Undertaking through project etur46 with access to the MareNostrum 5 supercomputer, hosted by Barcelona Supercomputing Center (BSC), Spain. MareNostrum 5 is owned by EuroHPC JU and operated by BSC. We are grateful to the BSC support team for their assistance with job scheduling, environment configuration, and technical guidance throughout the project. |
|
|
|
|
|
The numerical calculations reported in this work were fully/partially performed at TÜBİTAK ULAKBİM, High Performance and Grid Computing Center (TRUBA resources). The authors gratefully acknowledge the know-how provided by the MINERVA Support for expert guidance and collaboration opportunities in HPC-AI integration. |
|
|
|
|
|
## References |
|
|
|
|
|
If you use this model, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
@article{mecellem2026, |
|
|
title={Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain}, |
|
|
author={Uğur, Özgür and Göksu, Mahmut and Çimen, Mahmut and Yılmaz, Musa and Şavirdi, Esra and Demir, Alp Talha and Güllüce, Rumeysa and Çetin, İclal and Sağbaş, Ömer Can}, |
|
|
journal={arXiv preprint arXiv:2601.16018}, |
|
|
year={2026}, |
|
|
month={January}, |
|
|
url={https://arxiv.org/abs/2601.16018}, |
|
|
doi={10.48550/arXiv.2601.16018}, |
|
|
eprint={2601.16018}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL} |
|
|
} |
|
|
``` |
|
|
### Base Model References |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{modernbert2025, |
|
|
title={ModernBERT: A Modern Bidirectional Encoder Transformer}, |
|
|
author={Answer.AI and LightOn}, |
|
|
booktitle={Proceedings of the 2025 Conference on Language Models}, |
|
|
year={2025} |
|
|
} |
|
|
``` |