| | --- |
| | base_model: ModernBERT-base |
| | language: |
| | - tr |
| | - en |
| | license: apache-2.0 |
| | pipeline_tag: feature-extraction |
| | library_name: transformers |
| | tags: |
| | - fill-mask |
| | - turkish |
| | - legal |
| | - turkish-legal |
| | - mecellem |
| | - modernbert |
| | - TRUBA |
| | - MN5 |
| | --- |
| | |
| | # Mursit-Base |
| |
|
| | [](https://github.com/newmindai/mecellem-models) [](https://huggingface.co/spaces/newmindai/Mizan) [](https://opensource.org/licenses/Apache-2.0) |
| |
|
| | Mursit-Base is a Turkish Masked Language Model pre-trained entirely from scratch on Turkish-dominant corpora, as introduced in the paper [Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain](https://huggingface.co/papers/2601.16018). |
| |
|
| | ## Model Description |
| |
|
| | Mursit-Base is a Turkish Masked Language Model pre-trained entirely from scratch on Turkish-dominant corpora. The model is based on ModernBERT-base architecture (155M parameters) and serves as a foundation model for downstream tasks including text classification, named entity recognition, and feature extraction. Unlike domain-adaptive approaches that continue training from existing checkpoints, this model is initialized randomly and trained on a carefully curated dataset combining Turkish legal text with general web data. |
| |
|
| | **Key Features:** |
| | - Pre-trained from scratch on approximately 112.7 billion tokens of Turkish-dominant corpus |
| | - Achieves 57.62% MLM accuracy on Turkish datasets (80-10-10 masking strategy, evaluated at 15% masking rate) |
| | - Serves as foundation for downstream embedding tasks (Mursit-Base-TR-Retrieval) |
| | - Custom tokenizer optimized for Turkish morphological structure |
| | - Pre-trained with 30% masking rate (ModernBERT/RoBERTa approach) but evaluated at 15% masking rate for fair comparison |
| |
|
| | **Model Type:** Masked Language Model (MLM) |
| | **Parameters:** 155M |
| | **Base Architecture:** ModernBERT-base |
| | **Hidden Size:** 768 |
| | **Max Sequence Length:** 1,024 tokens |
| |
|
| | ### Architecture Details |
| |
|
| | - **Layers:** 22 transformer layers |
| | - **Hidden Size:** 768 |
| | - **FFN Size:** 1,152 |
| | - **Attention Heads:** 12 heads with 64 dimensions each |
| | - **Activation:** GeGLU (Gated Linear Units with GELU) |
| | - **Normalization:** RMSNorm |
| | - **Position Embeddings:** Rotary positional embeddings (RoPE) with θ=10,000 |
| | - **Window Size:** 128 (for sliding window attention in local layers) |
| | - **Vocabulary Size:** 59,008 tokens |
| |
|
| | ### Training Details |
| |
|
| | **Pre-training:** |
| | - **Dataset:** Turkish-dominant corpus totaling approximately 112.7 billion tokens |
| | - **Legal Sources:** |
| | - Court of Cassation (Yargıtay): 10.3M sequences, ~3.43B tokens |
| | - Council of State (Danıştay): 151K sequences, ~0.11B tokens |
| | - Academic theses (YÖKTEZ): 21.1M sequences, ~9.61B tokens (after DocsOCR processing) |
| | - **General Turkish Sources:** |
| | - FineWeb2: General Turkish web data |
| | - CulturaX: Multilingual corpus (Turkish subset) |
| | - Total general Turkish: 212M sequences, ~96.17B tokens |
| | - **Data Processing:** SemHash-based semantic deduplication, FineWeb quality filtering, URL-based filtering, page-packing for YÖKTEZ documents |
| | - **Training Method:** Masked Language Modeling (MLM) with 15% masking probability |
| | - **Masking Strategy:** 80% [MASK], 10% random token, 10% unchanged (80-10-10 strategy) |
| | - **Framework:** MosaicML Composer with Decoupled StableAdamW optimizer |
| | - **Learning Rate:** 5×10⁻⁴ with warmup_stable_decay schedule |
| | - **Precision:** BF16 mixed precision |
| | - **Hardware Infrastructure:** |
| | - **System:** MareNostrum 5 ACC partition at Barcelona Supercomputing Center (BSC) |
| | - **Compute Nodes:** 16 nodes |
| | - **GPUs:** 64× NVIDIA Hopper H100 64GB GPUs (4 GPUs per node) |
| | - **Node Configuration:** Each node equipped with 4× H100 GPUs, 80 CPU cores, 512GB DDR5 memory |
| | - **Interconnect:** 800 Gb/s InfiniBand for distributed training |
| | - **GPU Interconnect:** NVLink for intra-node GPU communication (4 GPUs per node connected via NVLink) |
| | - **Distributed Training:** Multi-node distributed training across 16 nodes with InfiniBand interconnect |
| |
|
| | **MLM Accuracy:** 64.05% (evaluated on Turkish datasets: blackerx/turkish_v2, fthbrmnby/turkish_product_reviews, hazal/Turkish-Biomedical-corpus-trM, newmindai/EuroHPC-Legal) |
| | |
| | ### MLM Accuracy Scores (80-10-10 Strategy) on Turkish Datasets |
| | |
| | The following table presents MLM accuracy scores (averaged across the 80-10-10 strategy) for our pre-trained models and baseline MLM models evaluated on Turkish datasets. *This model's results are highlighted in italics.* |
| | |
| | | Model | MLM Avg (%) | |
| | |-------|-------------| |
| | | boun-tabilab/TabiBERT | **69.57** | |
| | | newmindai/Mursit-Large | 67.25 | |
| | | ytu-ce-cosmos/turkish-large-bert-cased | 65.03 | |
| | | dbmdz/bert-base-turkish-cased | 64.98 | |
| | | *newmindai/Mursit-Base* | *64.05* | |
| | | KocLab-Bilkent/BERTurk-Legal | 54.10 | |
| | | ytu-ce-cosmos/turkish-base-bert-uncased | 52.69 | |
| | |
| | *MLM accuracy averaged across the 80-10-10 masking strategy. turkish-base-bert-uncased was evaluated only on uncased datasets. Evaluation datasets: blackerx/turkish_v2, fthbrmnby/turkish_product_reviews, hazal/Turkish-Biomedical-corpus-trM, newmindai/EuroHPC-Legal. All experiments are reproducible (see Section A.2 in the paper).* |
| |
|
| | ## Performance on MTEB-Turkish Benchmark |
| |
|
| | The following visualization shows the model's performance compared to other Turkish language models: |
| |
|
| |  |
| |
|
| | *Model Performance Comparison: Legal Score vs. MTEB Score. MLM models (blue circles) form a distinct cluster. Mursit-Base achieves competitive performance among Turkish MLM models.* |
| |
|
| | This model was evaluated on the comprehensive MTEB-Turkish benchmark for embedding tasks using mean pooling over token representations followed by L2 normalization. |
| |
|
| | ### Comprehensive Benchmark Results |
| |
|
| | The following table presents comprehensive evaluation results across all models evaluated on the MTEB-Turkish benchmark. *This model's results are highlighted in italics.* |
| |
|
| | | Model | MTEB | Legal | Cls. | Clus. | Pair | Ret. | STS | Cont. | Reg. | Case | Params | Type | |
| | |-------|------|-------|------|-------|------|------|-----|-------|------|------|--------|------| |
| | | embeddinggemma-300m | **65.42** | 50.63 | **77.74** | **45.05** | **80.02** | **55.06** | 69.22 | 83.97 | **39.56** | 28.38 | 307M | Emb. | |
| | | bge-m3 | 62.87 | **51.16** | 75.35 | 35.86 | 78.88 | 54.42 | **69.83** | **86.08** | 38.09 | **29.3** | 567M | Emb. | |
| | | Mursit-Embed-Qwen3-1.7B-TR | 56.84 | 34.76 | 68.46 | 42.22 | 59.67 | 50.1 | 63.77 | 70.22 | 17.94 | 16.11 | 1.7B | CLM-E. | |
| | | Mursit-Large-TR-Retrieval | 56.87 | 46.56 | 67.72 | 41.15 | 59.78 | 51.69 | 64.01 | 81.78 | 32.67 | 25.24 | 403M | Emb. | |
| | | Mursit-Base-TR-Retrieval | 55.86 | 47.52 | 66.25 | 39.75 | 61.31 | 50.07 | 61.9 | 80.4 | 34.1 | 28.07 | 155M | Emb. | |
| | | Mursit-Embed-Qwen3-4B-TR | 53.65 | 37.0 | 67.29 | 36.68 | 58.36 | 51.12 | 54.77 | 69.25 | 24.21 | 17.56 | 4B | CLM-E. | |
| | |-------|------|-------|------|------|------|------|-----|-------|------|------|--------|------| |
| | | bert-base-turkish-uncased | 46.23 | 24.94 | 68.05 | 33.81 | 60.44 | 32.01 | 36.85 | 52.47 | 12.05 | 10.29 | 110M | MLM | |
| | | turkish-large-bert-cased | 45.3 | 19.12 | 67.43 | 34.24 | 60.11 | 28.68 | 36.04 | 47.57 | 5.93 | 3.85 | 337M | MLM | |
| | | bert-base-turkish-cased | 45.17 | 24.41 | 66.39 | 35.28 | 60.05 | 30.52 | 33.62 | 54.03 | 10.13 | 9.07 | 110M | MLM | |
| | | BERTurk-Legal | 42.02 | 32.63 | 60.61 | 26.24 | 59.51 | 25.8 | 37.94 | 61.4 | 15.51 | 20.99 | 184M | MLM | |
| | | Mursit-Large | 41.75 | 23.71 | 62.95 | 25.34 | 58.04 | 27.4 | 35.01 | 42.74 | 11.29 | 17.1 | 403M | MLM | |
| | | turkish-base-bert-uncased | 44.68 | 27.58 | 66.22 | 30.23 | 58.84 | 31.4 | 36.74 | 56.6 | 13.39 | 12.74 | 110M | MLM | |
| | | *Mursit-Base* | 40.23 | 17.93 | 59.78 | 25.48 | 58.65 | 20.82 | 36.45 | 36.0 | 7.4 | 10.4 | 155M | MLM | |
| | | mmBERT-base | 39.65 | 12.15 | 61.84 | 26.77 | 59.25 | 15.83 | 34.56 | 34.45 | 1.33 | 0.68 | 306M | MLM | |
| | | TabiBERT | 37.77 | 11.5 | 59.63 | 25.75 | 58.19 | 14.96 | 30.32 | 32.02 | 1.86 | 0.63 | 148M | MLM | |
| | | ModernBERT-base | 23.8 | 2.99 | 39.06 | 2.01 | 53.95 | 2.1 | 21.91 | 7.92 | 0.62 | 0.43 | 149M | MLM | |
| | | ModernBERT-large | 23.74 | 2.44 | 39.44 | 3.9 | 53.73 | 1.8 | 19.85 | 6.12 | 0.62 | 0.59 | 394M | MLM | |
| |
|
| | **Column abbreviations:** MTEB = mean performance across task types; Legal = weighted average of Contracts, Regulation, Caselaw; Classification = accuracy on Turkish classification tasks; Clustering = V-measure on clustering tasks; Pair Classification = average precision on pair classification tasks like NLI; Retrieval = nDCG@10 on information retrieval tasks; Semantic Textual Similarity = Spearman correlation; Contracts = nDCG@10 on legal contract retrieval; Regulation = nDCG@10 on regulatory text retrieval; Caselaw = nDCG@10 on case law retrieval; Number of Parameters = number of model parameters; Model Type = model type (Embedding, CLM-Embedding, Masked Language Model). **Bold values** indicate the highest score in each column. |
| |
|
| | **Key Findings:** |
| | - The model shows substantial improvement over ModernBERT baselines (which are monolingual English models), validating the effectiveness of Turkish-specific pre-training |
| | - Pre-training alone without embedding-specific fine-tuning yields limited utility for retrieval tasks |
| | - Language-specific pre-training is critical, as monolingual English models show limited cross-lingual transfer to Turkish |
| | - The model demonstrates that improvements in MLM accuracy do not always directly translate to better downstream task performance |
| |
|
| | ### MLM vs Downstream Performance Analysis |
| |
|
| | The following visualization shows the relationship between MLM validation loss and downstream retrieval performance: |
| |
|
| |  |
| |
|
| | *Relationship between MLM validation loss and downstream retrieval performance across ModernBERT-base versions v1-v6. This analysis demonstrates how improvements in MLM accuracy correlate with downstream task performance.* |
| |
|
| | **Note:** This model is primarily designed for Masked Language Modeling tasks. Embedding performance is provided for reference using standard mean pooling. For optimal retrieval performance, consider using the post-trained retrieval variants (Mursit-Base-TR-Retrieval or Mursit-Large-TR-Retrieval). |
| |
|
| | ## Reproducibility |
| |
|
| | To reproduce the MLM benchmark results for this model, please refer to: |
| |
|
| | - **MLM Benchmark Results:** [github.com/newmindai/mecellem-models/benchmark/mlm](https://github.com/newmindai/mecellem-models/tree/main/benchmark/mlm) - Contains code and evaluation configurations for reproducing MLM accuracy scores on Turkish datasets using the 80-10-10 masking strategy. |
| |
|
| | ## Usage |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install transformers torch |
| | ``` |
| |
|
| | ### Masked Language Modeling |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForMaskedLM |
| | import torch |
| | |
| | # Load model and tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained("newmindai/Mursit-Base") |
| | model = AutoModelForMaskedLM.from_pretrained("newmindai/Mursit-Base") |
| | |
| | # Example text with mask |
| | text = "Türkiye Cumhuriyeti'nin başkenti [MASK]'dir." |
| | inputs = tokenizer(text, return_tensors="pt") |
| | |
| | # Predict masked token |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | mask_token_index = (inputs["input_ids"] == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0] |
| | predictions = torch.nn.functional.softmax(outputs.logits[0, mask_token_index], dim=-1) |
| | |
| | # Get top predictions |
| | top_k = 5 |
| | top_indices = torch.topk(predictions[0], top_k).indices |
| | for idx in top_indices: |
| | token = tokenizer.decode([idx]) |
| | score = predictions[0][idx].item() |
| | print(f"{token}: {score:.4f}") |
| | ``` |
| |
|
| | ### Feature Extraction |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | import torch |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("newmindai/Mursit-Base") |
| | model = AutoModel.from_pretrained("newmindai/Mursit-Base") |
| | |
| | text = "Türk hukuk sistemi medeni hukuk geleneğine dayanır" |
| | inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) |
| | |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | # Mean pooling |
| | embeddings = outputs.last_hidden_state.mean(dim=1) |
| | print(embeddings.shape) # (batch_size, 768) |
| | ``` |
| | # ONNX Model Inference - Masked Language Modeling (MLM) |
| |
|
| | This script demonstrates how to use the ONNX model from Hugging Face for masked language modeling tasks. |
| |
|
| | ## Exporting Model to ONNX |
| |
|
| | To export the model to ONNX format for MLM, use the `optimum-cli` command: |
| |
|
| | ```bash |
| | optimum-cli export onnx \ |
| | -m newmindai/Mursit-Base \ |
| | --task fill-mask \ |
| | onnx/MursitBase |
| | ``` |
| |
|
| | This will create the `model.onnx` file in the specified output directory. |
| |
|
| | ## Installation |
| |
|
| | ```bash |
| | pip install onnxruntime-gpu transformers huggingface_hub numpy |
| | ``` |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import numpy as np |
| | import onnxruntime as ort |
| | from transformers import AutoTokenizer |
| | from huggingface_hub import hf_hub_download |
| | |
| | repo_id = "newmindai/Mursit-Base" |
| | |
| | onnx_path = hf_hub_download(repo_id, "model.onnx") |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(repo_id) |
| | |
| | sess = ort.InferenceSession( |
| | onnx_path, |
| | providers=["CUDAExecutionProvider", "CPUExecutionProvider"] |
| | ) |
| | |
| | text = f"Bu bir {tokenizer.mask_token} cümledir." |
| | inputs = tokenizer(text, return_tensors="np") |
| | |
| | outputs = sess.run(None, dict(inputs)) |
| | logits = outputs[0] |
| | |
| | mask_pos = np.where(inputs["input_ids"][0] == tokenizer.mask_token_id)[0][0] |
| | mask_logits = logits[0, mask_pos] |
| | |
| | top_k = 5 |
| | top_k_ids = np.argsort(mask_logits)[-top_k:][::-1] |
| | predictions = tokenizer.convert_ids_to_tokens(top_k_ids) |
| | |
| | print("MASK predictions:") |
| | for p in predictions: |
| | print(p) |
| | ``` |
| |
|
| | ## Features |
| |
|
| | - **Automatic GPU/CPU selection**: Uses CUDA if available, otherwise falls back to CPU |
| | - **Hugging Face integration**: Downloads model files directly from Hugging Face Hub |
| | - **Masked token prediction**: Predicts the most likely tokens for masked positions |
| | - **Top-K predictions**: Returns the top K most probable token predictions |
| |
|
| | ## Use Cases |
| |
|
| | - Turkish language understanding tasks |
| | - Text classification |
| | - Named entity recognition |
| | - Question answering |
| | - Text generation (with fine-tuning) |
| | - Feature extraction for downstream tasks |
| | - Domain adaptation for Turkish legal texts |
| |
|
| | ## Reproducibility |
| |
|
| | To reproduce the MLM benchmark results for this model, please refer to: |
| |
|
| | - **MLM Benchmark Results:** [github.com/newmindai/mecellem-models/benchmark/mlm](https://github.com/newmindai/mecellem-models/tree/main/benchmark/mlm) - Contains code and evaluation configurations for reproducing MLM accuracy scores on Turkish datasets using the 80-10-10 masking strategy. |
| |
|
| | ## Acknowledgments |
| |
|
| | This work was supported by the EuroHPC Joint Undertaking through project etur46 with access to the MareNostrum 5 supercomputer, hosted by Barcelona Supercomputing Center (BSC), Spain. MareNostrum 5 is owned by EuroHPC JU and operated by BSC. We are grateful to the BSC support team for their assistance with job scheduling, environment configuration, and technical guidance throughout the project. |
| |
|
| | The numerical calculations reported in this work were fully/partially performed at TÜBİTAK ULAKBİM, High Performance and Grid Computing Center (TRUBA resources). The authors gratefully acknowledge the know-how provided by the MINERVA Support for expert guidance and collaboration opportunities in HPC-AI integration. |
| |
|
| | ## References |
| |
|
| | If you use this model, please cite our paper: |
| |
|
| | ```bibtex |
| | @article{mecellem2026, |
| | title={Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain}, |
| | author={Uğur, Özgür and Göksu, Mahmut and Çimen, Mahmut and Yılmaz, Musa and Şavirdi, Esra and Demir, Alp Talha and Güllüce, Rumeysa and İclal Çetin, and Sağbaş, Ömer Can}, |
| | journal={arXiv preprint arXiv:2601.16018}, |
| | year={2026}, |
| | month={January}, |
| | url={https://arxiv.org/abs/2601.16018}, |
| | doi={10.48550/arXiv.2601.16018}, |
| | eprint={2601.16018}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL} |
| | } |
| | ``` |
| | ### Base Model References |
| |
|
| | ```bibtex |
| | @inproceedings{modernbert2025, |
| | title={ModernBERT: A Modern Bidirectional Encoder Transformer}, |
| | author={Answer.AI and LightOn}, |
| | booktitle={Proceedings of the 2025 Conference on Language Models}, |
| | year={2025} |
| | } |
| | ``` |