Spaces:

project-themis
/

README

Running

App Files Files Community

README / README.md

iNeil77

Update README.md

ba953b5 verified 21 days ago

preview code

raw

history blame contribute delete

12 kB

	---
	title: README
	emoji: 🐠
	colorFrom: purple
	colorTo: purple
	sdk: static
	pinned: true
	license: apache-2.0
	---

	<div align="center">

	# Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

	[![arXiv](https://img.shields.io/badge/arXiv-2605.00754-b31b1b.svg)](https://arxiv.org/abs/2605.00754)
	[![Models](https://img.shields.io/badge/%F0%9F%A4%97%20Models-Themis--RM-yellow)](https://huggingface.co/collections/project-themis/themis-reward-model-collection)
	[![Datasets & Benchmarks](https://img.shields.io/badge/%F0%9F%A4%97%20Datasets%20%26%20Benchmarks-Themis-blue)](https://huggingface.co/collections/project-themis/themis-preference-datasets-and-benchmarks)
	[![GitHub](https://img.shields.io/badge/GitHub-Themis-181717?logo=github)](https://github.com/iNeil77/Themis)
	[![Docker](https://img.shields.io/badge/Docker-ineil77%2Fthemis-2496ED?logo=docker)](https://hub.docker.com/repository/docker/ineil77/themis/general)

	</div>

	> Abstract:
	>
	> Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.
	>

	Themis reward models are trained using the Bradley-Terry preference framework with a multi-stage data pipeline that mines, filters, scores, and assembles high-quality code preference pairs from open-source repositories. The models are evaluated on Code RewardBench (CRB), a benchmark of 8,866 preference pairs spanning 5 quality aspects and 8 programming languages.

	## Pipeline Overview

	The end-to-end pipeline has three phases: dataset construction, model training, and evaluation.

	```
	DATASET CONSTRUCTION
	────────────────────
	BigQuery (github_repos)
	│
	▼
	┌─────────────────────┐ ┌───────────────────┐ ┌──────────────────┐
	│ 1. Commit Mining │──▶│ 2. Repo Filtering │──▶│ 3. Ext Filtering │
	│ (SQL) │ │ (allowlists) │ │ (lang → ext) │
	└─────────────────────┘ └───────────────────┘ └──────────────────┘
	│
	┌─────────────────────────────────────────────────────┘
	▼
	┌──────────────────────┐ ┌──────────────────┐ ┌──────────────────┐
	│ 4. Content Retrieval │──▶│ 5. Deduplication │──▶│ 6. Aspect Filter │
	│ (git fetch) │ │ (MinHash LSH) │ │ (ModernBERT) │
	└──────────────────────┘ └──────────────────┘ └──────────────────┘
	│
	┌─────────────────────────────────────────────────────┘
	▼
	┌──────────────────────┐ ┌──────────────────┐ ┌──────────────────┐
	│ 7. LLM Scoring & │-─▶│ 8. LLM-as-a-Judge│──▶│ 9. Training Data │
	│ Instruction Synth │ │ (A/B voting) │ │ Assembly │
	└──────────────────────┘ └──────────────────┘ └──────────────────┘
	│
	MODEL TRAINING │
	────────────── │
	┌─────────────────────────────────────────────────────┘
	▼
	┌───────────────────────────────────────────────────────────────────┐
	│ Bradley-Terry preference training with FSDP2 on multi-node GPUs │
	│ (BT loss + LM regularisation + magnitude penalty, Liger kernels) │
	└───────────────────────────────────┬───────────────────────────────┘
	│
	EVALUATION │
	────────── │
	┌───────────────────────────────┘
	▼
	┌───────────────────────────────────────────────────────────────────┐
	│ Code RewardBench: 8,866 pairs × 5 aspects × 8 languages │
	│ Evaluated across scalar, MoE, and generative RM architectures │
	└───────────────────────────────────────────────────────────────────┘
	```

	## Results

	Themis-RM models achieve best-in-class accuracy on [Themis-CodeRewardBench](https://huggingface.co/datasets/project-themis/Themis-CodeRewardBench), a code-specific reward model benchmark, while also matching or exceeding much larger models on established general-domain benchmarks (RewardBench V1, RewardBench V2, JudgeBench). Models are grouped by parameter class; bold marks the best in each group.

	\| Model \| [Themis-CodeRewardBench](https://huggingface.co/datasets/project-themis/Themis-CodeRewardBench) \| [RewardBench V1](https://huggingface.co/datasets/allenai/reward-bench) \| [RewardBench V2](https://huggingface.co/datasets/allenai/reward-bench-v2) \| [JudgeBench](https://huggingface.co/datasets/ScalerLab/JudgeBench) \|
	\|---\|---\|---\|---\|---\|
	\| \| \| \| \| \|
	\| 32B - 72B Class \| \| \| \| \|
	\| [WorldPM-72B](https://huggingface.co/Qwen/WorldPM-72B-RLHFLow) \| 76.96 \| 90.88 \| 67.92 \| 55.21 \|
	\| [Athene-RM-70B](https://huggingface.co/Nexusflow/Athene-RM-70B) \| 78.39 \| 91.22 \| 68.76 \| 63.45 \|
	\| [Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.3-Nemotron-70B-Reward) \| 81.19 \| 93.88 \| 70.49 \| 73.47 \|
	\| [Themis-RM-32B](https://huggingface.co/project-themis/Themis-RM-32B) \| 91.82 \| 94.89 \| 72.34 \| 71.65 \|
	\| [AceCodeRM-32B](https://huggingface.co/TIGER-Lab/AceCodeRM-32B) \| 62.95 \| 23.58 \| 67.98 \| 66.77 \|
	\| \| \| \| \| \|
	\| 7B – 14B Class \| \| \| \| \|
	\| [Themis-RM-14B](https://huggingface.co/project-themis/Themis-RM-14B) \| 91.19 \| 94.11 \| 71.44 \| 70.85 \|
	\| [Themis-RM-8B](https://huggingface.co/project-themis/Themis-RM-8B) \| 89.78 \| 93.69 \| 65.87 \| 69.97 \|
	\| [Athene-RM-8B](https://huggingface.co/Nexusflow/Athene-RM-8B) \| 76.58 \| 87.48 \| 62.96 \| 61.12 \|
	\| [CodeScaler-8B](https://huggingface.co/LARK-Lab/CodeScaler-8B) \| 79.12 \| 94.66 \| 76.51 \| 70.05 \|
	\| [Skywork-Reward-V2-8B](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-8B) \| 79.97 \| 94.76 \| 76.93 \| 67.90 \|
	\| [AceCodeRM-7B](https://huggingface.co/TIGER-Lab/AceCodeRM-7B) \| 71.11 \| 22.74 \| 63.16 \| 61.09 \|
	\| \| \| \| \| \|
	\| 0.6B - 4B Class \| \| \| \| \|
	\| [Themis-RM-4B](https://huggingface.co/project-themis/Themis-RM-4B) \| 88.39 \| 92.46 \| 63.81 \| 68.02 \|
	\| [CodeScaler-4B](https://huggingface.co/LARK-Lab/CodeScaler-4B) \| 77.97 \| 94.32 \| 75.13 \| 68.44 \|
	\| [Skywork-Reward-V2-4B](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-4B) \| 79.27 \| 94.06 \| 74.26 \| 65.43 \|
	\| [Themis-RM-1.7B](https://huggingface.co/project-themis/Themis-RM-1.7B) \| 83.04 \| 89.17 \| 56.22 \| 63.29 \|
	\| [CodeScaler-1.7B](https://huggingface.co/LARK-Lab/CodeScaler-1.7B) \| 73.75 \| 91.13 \| 68.44 \| 66.17 \|
	\| [Skywork-Reward-V2-1.7B](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-1.7B) \| 75.60 \| 91.64 \| 67.71 \| 66.48 \|
	\| [Themis-RM-0.6B](https://huggingface.co/project-themis/Themis-RM-0.6B) \| 79.26 \| 83.41 \| 49.61 \| 63.84 \|
	\| [Skywork-Reward-V2-0.6B](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-0.6B) \| 72.77 \| 86.32 \| 60.83 \| 63.65 \|

	## Datasets

	All datasets are available on HuggingFace:

	\| Dataset \| Description \| Samples \|
	\|---\|---\|---\|
	\| [Themis-CodeRewardBench](https://huggingface.co/datasets/project-themis/Themis-CodeRewardBench) \| Code RM evaluation benchmark: 5 quality dimensions, 8 languages, 19 source subsets \| 8,866 \|
	\| [Themis-CodePreference](https://huggingface.co/datasets/project-themis/Themis-CodePreference) \| Training data for the PM stage: code preferences across 5 criteria and 8 languages \| 354,010 \|
	\| [Themis-GeneralPreference](https://huggingface.co/datasets/project-themis/Themis-GeneralPreference) \| Training data for the PT stage: general-domain and code retrieval preferences \| 110,598 \|
	\| [Themis-Git-Commits-Merged](https://huggingface.co/datasets/project-themis/git-commits-merged) \| Single-file commits from merged PRs across 24 languages (intermediate, pre-classification) \| ~8M \|
	\| [Themis-Git-Commits](https://huggingface.co/datasets/project-themis/git-commits) \| Raw mined single-file commits from permissively licensed repos (full unfiltered pool) \| ~28M \|

	## Related Work

	[Distributed Training Tutorial](https://github.com/iNeil77/AWS_DistTraining_Tutorial) — A companion tutorial by us that walks through multi-node distributed training of scalar reward models on cloud GPU clusters. Covers cluster provisioning, high-speed networking, container management, and FSDP-based training. Useful as a standalone guide for anyone looking to reproduce the Themis training setup or adapt it to their own reward modelling workloads. Follows a simplified recipe that leverages the Axolotl framework for training reward models with the Bradley-Terry loss.

	## Citation

	```bibtex
	@article{themis2025,
	title={Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring},
	author={Paul, Indraneil and Gurevych, Iryna and Glava\v{s}, Goran},
	journal={arXiv preprint arXiv:2605.00754},
	year={2025}
	}
	```