--- title: README emoji: 🐠 colorFrom: purple colorTo: purple sdk: static pinned: true license: apache-2.0 ---
# Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring [![arXiv](https://img.shields.io/badge/arXiv-2605.00754-b31b1b.svg)](https://arxiv.org/abs/2605.00754) [![Models](https://img.shields.io/badge/%F0%9F%A4%97%20Models-Themis--RM-yellow)](https://huggingface.co/collections/project-themis/themis-reward-model-collection) [![Datasets & Benchmarks](https://img.shields.io/badge/%F0%9F%A4%97%20Datasets%20%26%20Benchmarks-Themis-blue)](https://huggingface.co/collections/project-themis/themis-preference-datasets-and-benchmarks) [![GitHub](https://img.shields.io/badge/GitHub-Themis-181717?logo=github)](https://github.com/iNeil77/Themis) [![Docker](https://img.shields.io/badge/Docker-ineil77%2Fthemis-2496ED?logo=docker)](https://hub.docker.com/repository/docker/ineil77/themis/general)
> **Abstract:** > > Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling. > Themis reward models are trained using the Bradley-Terry preference framework with a multi-stage data pipeline that mines, filters, scores, and assembles high-quality code preference pairs from open-source repositories. The models are evaluated on Code RewardBench (CRB), a benchmark of 8,866 preference pairs spanning 5 quality aspects and 8 programming languages. ## Pipeline Overview The end-to-end pipeline has three phases: dataset construction, model training, and evaluation. ``` DATASET CONSTRUCTION ──────────────────── BigQuery (github_repos) β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ 1. Commit Mining │──▢│ 2. Repo Filtering │──▢│ 3. Ext Filtering β”‚ β”‚ (SQL) β”‚ β”‚ (allowlists) β”‚ β”‚ (lang β†’ ext) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ 4. Content Retrieval │──▢│ 5. Deduplication │──▢│ 6. Aspect Filter β”‚ β”‚ (git fetch) β”‚ β”‚ (MinHash LSH) β”‚ β”‚ (ModernBERT) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ 7. LLM Scoring & β”‚-─▢│ 8. LLM-as-a-Judge│──▢│ 9. Training Data β”‚ β”‚ Instruction Synth β”‚ β”‚ (A/B voting) β”‚ β”‚ Assembly β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ MODEL TRAINING β”‚ ────────────── β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Bradley-Terry preference training with FSDP2 on multi-node GPUs β”‚ β”‚ (BT loss + LM regularisation + magnitude penalty, Liger kernels) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ EVALUATION β”‚ ────────── β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Code RewardBench: 8,866 pairs Γ— 5 aspects Γ— 8 languages β”‚ β”‚ Evaluated across scalar, MoE, and generative RM architectures β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ## Results Themis-RM models achieve best-in-class accuracy on [Themis-CodeRewardBench](https://huggingface.co/datasets/project-themis/Themis-CodeRewardBench), a code-specific reward model benchmark, while also matching or exceeding much larger models on established general-domain benchmarks (RewardBench V1, RewardBench V2, JudgeBench). Models are grouped by parameter class; **bold** marks the best in each group. | Model | [Themis-CodeRewardBench](https://huggingface.co/datasets/project-themis/Themis-CodeRewardBench) | [RewardBench V1](https://huggingface.co/datasets/allenai/reward-bench) | [RewardBench V2](https://huggingface.co/datasets/allenai/reward-bench-v2) | [JudgeBench](https://huggingface.co/datasets/ScalerLab/JudgeBench) | |---|---|---|---|---| | | | | | | | **32B - 72B Class** | | | | | | [WorldPM-72B](https://huggingface.co/Qwen/WorldPM-72B-RLHFLow) | 76.96 | 90.88 | 67.92 | 55.21 | | [Athene-RM-70B](https://huggingface.co/Nexusflow/Athene-RM-70B) | 78.39 | 91.22 | 68.76 | 63.45 | | [Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.3-Nemotron-70B-Reward) | 81.19 | 93.88 | 70.49 | **73.47** | | **[Themis-RM-32B](https://huggingface.co/project-themis/Themis-RM-32B)** | **91.82** | **94.89** | **72.34** | 71.65 | | [AceCodeRM-32B](https://huggingface.co/TIGER-Lab/AceCodeRM-32B) | 62.95 | 23.58 | 67.98 | 66.77 | | | | | | | | **7B – 14B Class** | | | | | | **[Themis-RM-14B](https://huggingface.co/project-themis/Themis-RM-14B)** | **91.19** | 94.11 | 71.44 | **70.85** | | **[Themis-RM-8B](https://huggingface.co/project-themis/Themis-RM-8B)** | 89.78 | 93.69 | 65.87 | 69.97 | | [Athene-RM-8B](https://huggingface.co/Nexusflow/Athene-RM-8B) | 76.58 | 87.48 | 62.96 | 61.12 | | [CodeScaler-8B](https://huggingface.co/LARK-Lab/CodeScaler-8B) | 79.12 | 94.66 | 76.51 | 70.05 | | [Skywork-Reward-V2-8B](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-8B) | 79.97 | **94.76** | **76.93** | 67.90 | | [AceCodeRM-7B](https://huggingface.co/TIGER-Lab/AceCodeRM-7B) | 71.11 | 22.74 | 63.16 | 61.09 | | | | | | | | **0.6B - 4B Class** | | | | | | **[Themis-RM-4B](https://huggingface.co/project-themis/Themis-RM-4B)** | **88.39** | 92.46 | 63.81 | 68.02 | | [CodeScaler-4B](https://huggingface.co/LARK-Lab/CodeScaler-4B) | 77.97 | **94.32** | **75.13** | **68.44** | | [Skywork-Reward-V2-4B](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-4B) | 79.27 | 94.06 | 74.26 | 65.43 | | **[Themis-RM-1.7B](https://huggingface.co/project-themis/Themis-RM-1.7B)** | 83.04 | 89.17 | 56.22 | 63.29 | | [CodeScaler-1.7B](https://huggingface.co/LARK-Lab/CodeScaler-1.7B) | 73.75 | 91.13 | 68.44 | 66.17 | | [Skywork-Reward-V2-1.7B](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-1.7B) | 75.60 | 91.64 | 67.71 | 66.48 | | **[Themis-RM-0.6B](https://huggingface.co/project-themis/Themis-RM-0.6B)** | 79.26 | 83.41 | 49.61 | 63.84 | | [Skywork-Reward-V2-0.6B](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-0.6B) | 72.77 | 86.32 | 60.83 | 63.65 | ## Datasets All datasets are available on HuggingFace: | Dataset | Description | Samples | |---|---|---| | [Themis-CodeRewardBench](https://huggingface.co/datasets/project-themis/Themis-CodeRewardBench) | Code RM evaluation benchmark: 5 quality dimensions, 8 languages, 19 source subsets | 8,866 | | [Themis-CodePreference](https://huggingface.co/datasets/project-themis/Themis-CodePreference) | Training data for the PM stage: code preferences across 5 criteria and 8 languages | 354,010 | | [Themis-GeneralPreference](https://huggingface.co/datasets/project-themis/Themis-GeneralPreference) | Training data for the PT stage: general-domain and code retrieval preferences | 110,598 | | [Themis-Git-Commits-Merged](https://huggingface.co/datasets/project-themis/git-commits-merged) | Single-file commits from merged PRs across 24 languages (intermediate, pre-classification) | ~8M | | [Themis-Git-Commits](https://huggingface.co/datasets/project-themis/git-commits) | Raw mined single-file commits from permissively licensed repos (full unfiltered pool) | ~28M | ## Related Work **[Distributed Training Tutorial](https://github.com/iNeil77/AWS_DistTraining_Tutorial)** β€” A companion tutorial by us that walks through multi-node distributed training of scalar reward models on cloud GPU clusters. Covers cluster provisioning, high-speed networking, container management, and FSDP-based training. Useful as a standalone guide for anyone looking to reproduce the Themis training setup or adapt it to their own reward modelling workloads. Follows a simplified recipe that leverages the Axolotl framework for training reward models with the Bradley-Terry loss. ## Citation ```bibtex @article{themis2025, title={Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring}, author={Paul, Indraneil and Gurevych, Iryna and Glava\v{s}, Goran}, journal={arXiv preprint arXiv:2605.00754}, year={2025} } ```