Muhakim / README.md

Update README.md

0574e45 verified 1 day ago

9.21 kB

	---
	language:
	- tr
	- en
	license: apache-2.0
	tags:
	- reward-model
	- turkish
	- legal
	- turkish-legal
	- mecellem
	- armo
	- reward
	- evaluation
	- TRUBA
	- MN5
	base_model: Skywork/Skywork-Reward-Llama-3.1-8B-v0.2
	pipeline_tag: text-classification
	datasets:
	- newmindai/armo-ultrafeedback-dataset
	- newmindai/armo-pair-dataset
	- newmindai/armo-dataset
	---

	# Muhakim (ArmoRM-Turkish-Legal)

	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

	## Model Description

	Muhakim (ArmoRM-Turkish-Legal) is a domain-specific multi-objective reward model trained for Turkish legal text assessment. Built upon the Skywork-Reward-V2-Llama-3.1-8B backbone (8B parameters) and augmented with a mixture-of-experts gating mechanism, the model produces fine-grained quality scores across five legally grounded dimensions. The training pipeline consists of three components: (i) multi-objective supervision that enables independent learning of five legal quality dimensions, (ii) preference-based training of a mixture-of-experts gating network to capture context-dependent importance of these dimensions, and (iii) a debiasing stage designed to mitigate length-related reward artifacts.

	Key Features:
	- Multi-objective reward model with five legal quality dimensions
	- Context-aware evaluation through mixture-of-experts gating mechanism
	- Trained for benchmarking decoder-only language models in Turkish legal tasks
	- Evaluates quality across: statute reference, legal accuracy, case law reference, linguistic coherence, and depth coverage

	Model Type: Reward Model
	Parameters: 8B
	Base Model: Skywork/Skywork-Reward-Llama-3.1-8B-v0.2
	Architecture: Llama-3.1-based reward backbone with MoE gating

	### Architecture Details

	The Muhakim reward model employs a multi-objective framework distinguishing input-dependent and output-dependent components:

	1. Gating Mechanism (Input-Dependent):
	- Operates in a prompt-conditioned manner
	- Dynamically adjusts evaluation priorities based on legal domain or question type
	- Mixture-of-experts (MoE) layer outputs non-negative coefficients summing to 1
	- Determines how much weight each reward objective should receive

	2. Reward Prediction (Output-Dependent):
	- Multi-objective reward predictions from ArmoRM's regression layer
	- Represents model performance on each objective
	- Assesses the quality of the generated response

	3. Final Score:
	- Score = Σ(gating[i] × transformed_rewards[i])
	- Context-aware evaluation that adapts importance weights based on the legal question
	- Assesses response quality across multiple dimensions

	### Training Pipeline

	<table width="100%">
	<tr>
	<td align="center" width="100%">
	<img
	src="https://huggingface.co/newmindai/Muhakim/resolve/main/muhakim_avatar.png"
	width="100%">
	<br>
	<em>Muhakim Model Training Pipeline</em>
	</td>
	</tr>
	</table>

	The following visualization shows the Muhakim model training pipeline:

	![Muhakim Training Pipeline](muhakim-pipeline.png)

	Muhakim Model Training Pipeline. The training pipeline consists of three components: (i) multi-objective supervision that enables independent learning of five legal quality dimensions, (ii) preference-based training of a mixture-of-experts gating network to capture context-dependent importance of these dimensions, and (iii) a debiasing stage designed to mitigate length-related reward artifacts.

	### Quality Dimensions

	The model evaluates five legal quality dimensions:

	1. Statute Reference: Accuracy of legal statute citations
	2. Legal Accuracy: Correctness of legal information
	3. Case Law Reference: Proper citation of legal precedents
	4. Linguistic Coherence: Language quality and fluency
	5. Depth Coverage: Comprehensiveness of the response

	### Training Pipeline

	The training pipeline consists of three components:

	1. Multi-objective Supervision: Enables independent learning of five legal quality dimensions
	2. Preference-based Training: Trains a mixture-of-experts gating network to capture context-dependent importance of these dimensions
	3. Debiasing Stage: Designed to mitigate length-related reward artifacts

	This training design allows the model to produce stable, interpretable, and context-aware reward signals, making it suitable for benchmarking decoder-only online language models in legal tasks.

	### Benchmark Evaluation

	The model is used to evaluate decoder-only online language models under varying contextual conditions in legal text generation. The benchmark uses the newmindai/EuroHPC-Legal dataset, which consists of 116 high-quality question-answer pairs. From each reference text, the first 5, 10, 20, 50, and 100 tokens are extracted to construct five distinct context-length settings.

	Models evaluated include:
	- Qwen3-1.7B-Base
	- Qwen3-4B-Base
	- Mecellem-Qwen3-1.7B-TR
	- Mecellem-Qwen3-4B-TR

	For each evaluation instance, the reward model produces:
	- Overall quality score (Score)
	- Vector of per-objective reward values (Rewards)
	- Set of gating outputs (Gating) reflecting the context-dependent weighting of quality dimensions

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Reward Scoring

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("newmindai/Muhakim")
	model = AutoModelForSequenceClassification.from_pretrained("newmindai/Muhakim")

	# Example: User message (legal question + context) and assistant response
	user_message = "Sözleşme feshi nasıl yapılır? [Legal context here]"
	assistant_response = "Sözleşme feshi yazılı bildirimle yapılabilir..."

	# Format for reward model (conversational format)
	text = f"User: {user_message}\nAssistant: {assistant_response}"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)

	# Get reward score
	with torch.no_grad():
	outputs = model(**inputs)
	reward_score = outputs.logits.item()

	print(f"Reward Score: {reward_score:.4f}")
	```

	### Multi-Objective Evaluation

	The model can provide detailed scores for each quality dimension:

	```python
	# The model outputs include:
	# - Overall score (weighted combination)
	# - Per-objective rewards (statute, accuracy, case law, coherence, depth)
	# - Gating weights (context-dependent importance)
	```

	## Use Cases

	- Benchmarking decoder-only language models in legal tasks
	- Evaluating legal text generation quality
	- Context-aware assessment of legal responses
	- Multi-objective evaluation of legal text quality
	- Training legal language models with reward signals
	- Quality assessment for legal RAG systems

	## Evaluation Results

	The model has been used to evaluate Turkish legal language models across different context lengths. Results show that Mecellem-Qwen3 models consistently outperform base Qwen3 models across all five legal quality objectives, with particularly pronounced gains for depth of coverage, statute reference usage, and legal accuracy.

	## Acknowledgments

	This work was supported by the EuroHPC Joint Undertaking through project etur46 with access to the MareNostrum 5 supercomputer, hosted by Barcelona Supercomputing Center (BSC), Spain. MareNostrum 5 is owned by EuroHPC JU and operated by BSC. We are grateful to the BSC support team for their assistance with job scheduling, environment configuration, and technical guidance throughout the project.

	The numerical calculations reported in this work were fully/partially performed at TÜBİTAK ULAKBİM, High Performance and Grid Computing Center (TRUBA resources). The authors gratefully acknowledge the know-how provided by the MINERVA Support for expert guidance and collaboration opportunities in HPC-AI integration.

	## Citation

	```bibtex
	@article{mecellem2026,
	title={Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain},
	author={Uğur, Özgür and Göksu, Mahmut and Çimen, Mahmut and Yılmaz, Musa and Şavirdi, Esra and Demir, Alp Talha and Güllüce, Rumeysa and Çetin, İclal and Sağbaş, Ömer Can},
	journal={arXiv preprint arXiv:2601.16018},
	year={2026},
	month={January},
	url={https://arxiv.org/abs/2601.16018},
	doi={10.48550/arXiv.2601.16018},
	eprint={2601.16018},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	### Base Model References

	```bibtex
	@inproceedings{ArmoRM,
	title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts},
	author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong},
	booktitle={EMNLP},
	year={2024}
	}
	```
	```bibtex
	@inproceedings{wang2024arithmetic,
	title={Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards},
	author={Wang, Haoxiang and Lin, Yong and Xiong, Wei and Yang, Rui and Diao, Shizhe and Qiu, Shuang and Zhao, Han and Zhang, Tong},
	year={2024},
	booktitle={ACL}
	}
	```

	## License

	This dataset is released under the Apache 2.0 License.

	## Contact

	For questions: [info@newmind.ai](mailto:info@newmind.ai)