|
|
--- |
|
|
language: |
|
|
- tr |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- reward-model |
|
|
- turkish |
|
|
- legal |
|
|
- turkish-legal |
|
|
- mecellem |
|
|
- armo |
|
|
- reward |
|
|
- evaluation |
|
|
- TRUBA |
|
|
- MN5 |
|
|
base_model: Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 |
|
|
pipeline_tag: text-classification |
|
|
datasets: |
|
|
- newmindai/armo-ultrafeedback-dataset |
|
|
- newmindai/armo-pair-dataset |
|
|
- newmindai/armo-dataset |
|
|
--- |
|
|
|
|
|
# Muhakim (ArmoRM-Turkish-Legal) |
|
|
|
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
|
|
|
## Model Description |
|
|
|
|
|
Muhakim (ArmoRM-Turkish-Legal) is a domain-specific multi-objective reward model trained for Turkish legal text assessment. Built upon the Skywork-Reward-V2-Llama-3.1-8B backbone (8B parameters) and augmented with a mixture-of-experts gating mechanism, the model produces fine-grained quality scores across five legally grounded dimensions. The training pipeline consists of three components: (i) multi-objective supervision that enables independent learning of five legal quality dimensions, (ii) preference-based training of a mixture-of-experts gating network to capture context-dependent importance of these dimensions, and (iii) a debiasing stage designed to mitigate length-related reward artifacts. |
|
|
|
|
|
**Key Features:** |
|
|
- Multi-objective reward model with five legal quality dimensions |
|
|
- Context-aware evaluation through mixture-of-experts gating mechanism |
|
|
- Trained for benchmarking decoder-only language models in Turkish legal tasks |
|
|
- Evaluates quality across: statute reference, legal accuracy, case law reference, linguistic coherence, and depth coverage |
|
|
|
|
|
**Model Type:** Reward Model |
|
|
**Parameters:** 8B |
|
|
**Base Model:** Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 |
|
|
**Architecture:** Llama-3.1-based reward backbone with MoE gating |
|
|
|
|
|
### Architecture Details |
|
|
|
|
|
The Muhakim reward model employs a multi-objective framework distinguishing input-dependent and output-dependent components: |
|
|
|
|
|
**1. Gating Mechanism (Input-Dependent):** |
|
|
- Operates in a prompt-conditioned manner |
|
|
- Dynamically adjusts evaluation priorities based on legal domain or question type |
|
|
- Mixture-of-experts (MoE) layer outputs non-negative coefficients summing to 1 |
|
|
- Determines how much weight each reward objective should receive |
|
|
|
|
|
**2. Reward Prediction (Output-Dependent):** |
|
|
- Multi-objective reward predictions from ArmoRM's regression layer |
|
|
- Represents model performance on each objective |
|
|
- Assesses the quality of the generated response |
|
|
|
|
|
**3. Final Score:** |
|
|
- Score = Σ(gating[i] × transformed_rewards[i]) |
|
|
- Context-aware evaluation that adapts importance weights based on the legal question |
|
|
- Assesses response quality across multiple dimensions |
|
|
|
|
|
### Training Pipeline |
|
|
|
|
|
<table width="100%"> |
|
|
<tr> |
|
|
<td align="center" width="100%"> |
|
|
<img |
|
|
src="https://huggingface.co/newmindai/Muhakim/resolve/main/muhakim_avatar.png" |
|
|
width="100%"> |
|
|
<br> |
|
|
<em>Muhakim Model Training Pipeline</em> |
|
|
</td> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
The following visualization shows the Muhakim model training pipeline: |
|
|
|
|
|
 |
|
|
|
|
|
*Muhakim Model Training Pipeline. The training pipeline consists of three components: (i) multi-objective supervision that enables independent learning of five legal quality dimensions, (ii) preference-based training of a mixture-of-experts gating network to capture context-dependent importance of these dimensions, and (iii) a debiasing stage designed to mitigate length-related reward artifacts.* |
|
|
|
|
|
### Quality Dimensions |
|
|
|
|
|
The model evaluates five legal quality dimensions: |
|
|
|
|
|
1. **Statute Reference:** Accuracy of legal statute citations |
|
|
2. **Legal Accuracy:** Correctness of legal information |
|
|
3. **Case Law Reference:** Proper citation of legal precedents |
|
|
4. **Linguistic Coherence:** Language quality and fluency |
|
|
5. **Depth Coverage:** Comprehensiveness of the response |
|
|
|
|
|
### Training Pipeline |
|
|
|
|
|
The training pipeline consists of three components: |
|
|
|
|
|
1. **Multi-objective Supervision:** Enables independent learning of five legal quality dimensions |
|
|
2. **Preference-based Training:** Trains a mixture-of-experts gating network to capture context-dependent importance of these dimensions |
|
|
3. **Debiasing Stage:** Designed to mitigate length-related reward artifacts |
|
|
|
|
|
This training design allows the model to produce stable, interpretable, and context-aware reward signals, making it suitable for benchmarking decoder-only online language models in legal tasks. |
|
|
|
|
|
### Benchmark Evaluation |
|
|
|
|
|
The model is used to evaluate decoder-only online language models under varying contextual conditions in legal text generation. The benchmark uses the newmindai/EuroHPC-Legal dataset, which consists of 116 high-quality question-answer pairs. From each reference text, the first 5, 10, 20, 50, and 100 tokens are extracted to construct five distinct context-length settings. |
|
|
|
|
|
Models evaluated include: |
|
|
- Qwen3-1.7B-Base |
|
|
- Qwen3-4B-Base |
|
|
- Mecellem-Qwen3-1.7B-TR |
|
|
- Mecellem-Qwen3-4B-TR |
|
|
|
|
|
For each evaluation instance, the reward model produces: |
|
|
- **Overall quality score (Score)** |
|
|
- **Vector of per-objective reward values (Rewards)** |
|
|
- **Set of gating outputs (Gating)** reflecting the context-dependent weighting of quality dimensions |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
### Reward Scoring |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("newmindai/Muhakim") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("newmindai/Muhakim") |
|
|
|
|
|
# Example: User message (legal question + context) and assistant response |
|
|
user_message = "Sözleşme feshi nasıl yapılır? [Legal context here]" |
|
|
assistant_response = "Sözleşme feshi yazılı bildirimle yapılabilir..." |
|
|
|
|
|
# Format for reward model (conversational format) |
|
|
text = f"User: {user_message}\nAssistant: {assistant_response}" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048) |
|
|
|
|
|
# Get reward score |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
reward_score = outputs.logits.item() |
|
|
|
|
|
print(f"Reward Score: {reward_score:.4f}") |
|
|
``` |
|
|
|
|
|
### Multi-Objective Evaluation |
|
|
|
|
|
The model can provide detailed scores for each quality dimension: |
|
|
|
|
|
```python |
|
|
# The model outputs include: |
|
|
# - Overall score (weighted combination) |
|
|
# - Per-objective rewards (statute, accuracy, case law, coherence, depth) |
|
|
# - Gating weights (context-dependent importance) |
|
|
``` |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- Benchmarking decoder-only language models in legal tasks |
|
|
- Evaluating legal text generation quality |
|
|
- Context-aware assessment of legal responses |
|
|
- Multi-objective evaluation of legal text quality |
|
|
- Training legal language models with reward signals |
|
|
- Quality assessment for legal RAG systems |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
The model has been used to evaluate Turkish legal language models across different context lengths. Results show that Mecellem-Qwen3 models consistently outperform base Qwen3 models across all five legal quality objectives, with particularly pronounced gains for depth of coverage, statute reference usage, and legal accuracy. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
This work was supported by the EuroHPC Joint Undertaking through project etur46 with access to the MareNostrum 5 supercomputer, hosted by Barcelona Supercomputing Center (BSC), Spain. MareNostrum 5 is owned by EuroHPC JU and operated by BSC. We are grateful to the BSC support team for their assistance with job scheduling, environment configuration, and technical guidance throughout the project. |
|
|
|
|
|
The numerical calculations reported in this work were fully/partially performed at TÜBİTAK ULAKBİM, High Performance and Grid Computing Center (TRUBA resources). The authors gratefully acknowledge the know-how provided by the MINERVA Support for expert guidance and collaboration opportunities in HPC-AI integration. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{mecellem2026, |
|
|
title={Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain}, |
|
|
author={Uğur, Özgür and Göksu, Mahmut and Çimen, Mahmut and Yılmaz, Musa and Şavirdi, Esra and Demir, Alp Talha and Güllüce, Rumeysa and Çetin, İclal and Sağbaş, Ömer Can}, |
|
|
journal={arXiv preprint arXiv:2601.16018}, |
|
|
year={2026}, |
|
|
month={January}, |
|
|
url={https://arxiv.org/abs/2601.16018}, |
|
|
doi={10.48550/arXiv.2601.16018}, |
|
|
eprint={2601.16018}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL} |
|
|
} |
|
|
``` |
|
|
|
|
|
### Base Model References |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{ArmoRM, |
|
|
title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts}, |
|
|
author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong}, |
|
|
booktitle={EMNLP}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
```bibtex |
|
|
@inproceedings{wang2024arithmetic, |
|
|
title={Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards}, |
|
|
author={Wang, Haoxiang and Lin, Yong and Xiong, Wei and Yang, Rui and Diao, Shizhe and Qiu, Shuang and Zhao, Han and Zhang, Tong}, |
|
|
year={2024}, |
|
|
booktitle={ACL} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This dataset is released under the Apache 2.0 License. |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions: [info@newmind.ai](mailto:info@newmind.ai) |
|
|
|