File size: 12,010 Bytes
55945a0
 
 
 
3dbd93d
55945a0
3dbd93d
6132cc5
55945a0
 
6132cc5
 
 
 
7a7ebb5
6132cc5
 
40c7b0c
3dbd93d
6132cc5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40c7b0c
 
6132cc5
 
 
 
 
 
 
 
 
 
 
ba953b5
6132cc5
 
40c7b0c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
title: README
emoji: 🐠
colorFrom: purple
colorTo: purple
sdk: static
pinned: true
license: apache-2.0
---

<div align="center">

# Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

[![arXiv](https://img.shields.io/badge/arXiv-2605.00754-b31b1b.svg)](https://arxiv.org/abs/2605.00754)
[![Models](https://img.shields.io/badge/%F0%9F%A4%97%20Models-Themis--RM-yellow)](https://huggingface.co/collections/project-themis/themis-reward-model-collection)
[![Datasets & Benchmarks](https://img.shields.io/badge/%F0%9F%A4%97%20Datasets%20%26%20Benchmarks-Themis-blue)](https://huggingface.co/collections/project-themis/themis-preference-datasets-and-benchmarks)
[![GitHub](https://img.shields.io/badge/GitHub-Themis-181717?logo=github)](https://github.com/iNeil77/Themis)
[![Docker](https://img.shields.io/badge/Docker-ineil77%2Fthemis-2496ED?logo=docker)](https://hub.docker.com/repository/docker/ineil77/themis/general)

</div>

> **Abstract:**
>
> Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.
>

Themis reward models are trained using the Bradley-Terry preference framework with a multi-stage data pipeline that mines, filters, scores, and assembles high-quality code preference pairs from open-source repositories. The models are evaluated on Code RewardBench (CRB), a benchmark of 8,866 preference pairs spanning 5 quality aspects and 8 programming languages.

## Pipeline Overview

The end-to-end pipeline has three phases: dataset construction, model training, and evaluation.

```
                          DATASET CONSTRUCTION
                          ────────────────────
  BigQuery (github_repos)
      β”‚
      β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 1. Commit Mining    │──▢│ 2. Repo Filtering │──▢│ 3. Ext Filtering β”‚
  β”‚    (SQL)            β”‚   β”‚    (allowlists)   β”‚   β”‚    (lang β†’ ext)  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                            β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 4. Content Retrieval │──▢│ 5. Deduplication │──▢│ 6. Aspect Filter β”‚
  β”‚    (git fetch)       β”‚   β”‚    (MinHash LSH) β”‚   β”‚    (ModernBERT)  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                            β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 7. LLM Scoring &     β”‚-─▢│ 8. LLM-as-a-Judge│──▢│ 9. Training Data β”‚
  β”‚    Instruction Synth β”‚   β”‚    (A/B voting)  β”‚   β”‚    Assembly      β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                            β”‚
                          MODEL TRAINING                    β”‚
                          ──────────────                    β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Bradley-Terry preference training with FSDP2 on multi-node GPUs   β”‚
  β”‚ (BT loss + LM regularisation + magnitude penalty, Liger kernels)  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                      β”‚
                          EVALUATION  β”‚
                          ──────────  β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Code RewardBench: 8,866 pairs Γ— 5 aspects Γ— 8 languages           β”‚
  β”‚ Evaluated across scalar, MoE, and generative RM architectures     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Results

Themis-RM models achieve best-in-class accuracy on [Themis-CodeRewardBench](https://huggingface.co/datasets/project-themis/Themis-CodeRewardBench), a code-specific reward model benchmark, while also matching or exceeding much larger models on established general-domain benchmarks (RewardBench V1, RewardBench V2, JudgeBench). Models are grouped by parameter class; **bold** marks the best in each group.

| Model | [Themis-CodeRewardBench](https://huggingface.co/datasets/project-themis/Themis-CodeRewardBench) | [RewardBench V1](https://huggingface.co/datasets/allenai/reward-bench) | [RewardBench V2](https://huggingface.co/datasets/allenai/reward-bench-v2) | [JudgeBench](https://huggingface.co/datasets/ScalerLab/JudgeBench) |
|---|---|---|---|---|
| | | | | |
| **32B - 72B Class** | | | | |
| [WorldPM-72B](https://huggingface.co/Qwen/WorldPM-72B-RLHFLow) | 76.96 | 90.88 | 67.92 | 55.21 |
| [Athene-RM-70B](https://huggingface.co/Nexusflow/Athene-RM-70B) | 78.39 | 91.22 | 68.76 | 63.45 |
| [Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.3-Nemotron-70B-Reward) | 81.19 | 93.88 | 70.49 | **73.47** |
| **[Themis-RM-32B](https://huggingface.co/project-themis/Themis-RM-32B)** | **91.82** | **94.89** | **72.34** | 71.65 |
| [AceCodeRM-32B](https://huggingface.co/TIGER-Lab/AceCodeRM-32B) | 62.95 | 23.58 | 67.98 | 66.77 |
| | | | | |
| **7B – 14B Class** | | | | |
| **[Themis-RM-14B](https://huggingface.co/project-themis/Themis-RM-14B)** | **91.19** | 94.11 | 71.44 | **70.85** |
| **[Themis-RM-8B](https://huggingface.co/project-themis/Themis-RM-8B)** | 89.78 | 93.69 | 65.87 | 69.97 |
| [Athene-RM-8B](https://huggingface.co/Nexusflow/Athene-RM-8B) | 76.58 | 87.48 | 62.96 | 61.12 |
| [CodeScaler-8B](https://huggingface.co/LARK-Lab/CodeScaler-8B) | 79.12 | 94.66 | 76.51 | 70.05 |
| [Skywork-Reward-V2-8B](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-8B) | 79.97 | **94.76** | **76.93** | 67.90 |
| [AceCodeRM-7B](https://huggingface.co/TIGER-Lab/AceCodeRM-7B) | 71.11 | 22.74 | 63.16 | 61.09 |
| | | | | |
| **0.6B - 4B Class** | | | | |
| **[Themis-RM-4B](https://huggingface.co/project-themis/Themis-RM-4B)** | **88.39** | 92.46 | 63.81 | 68.02 |
| [CodeScaler-4B](https://huggingface.co/LARK-Lab/CodeScaler-4B) | 77.97 | **94.32** | **75.13** | **68.44** |
| [Skywork-Reward-V2-4B](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-4B) | 79.27 | 94.06 | 74.26 | 65.43 |
| **[Themis-RM-1.7B](https://huggingface.co/project-themis/Themis-RM-1.7B)** | 83.04 | 89.17 | 56.22 | 63.29 |
| [CodeScaler-1.7B](https://huggingface.co/LARK-Lab/CodeScaler-1.7B) | 73.75 | 91.13 | 68.44 | 66.17 |
| [Skywork-Reward-V2-1.7B](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-1.7B) | 75.60 | 91.64 | 67.71 | 66.48 |
| **[Themis-RM-0.6B](https://huggingface.co/project-themis/Themis-RM-0.6B)** | 79.26 | 83.41 | 49.61 | 63.84 |
| [Skywork-Reward-V2-0.6B](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-0.6B) | 72.77 | 86.32 | 60.83 | 63.65 |

## Datasets

All datasets are available on HuggingFace:

| Dataset | Description | Samples |
|---|---|---|
| [Themis-CodeRewardBench](https://huggingface.co/datasets/project-themis/Themis-CodeRewardBench) | Code RM evaluation benchmark: 5 quality dimensions, 8 languages, 19 source subsets | 8,866 |
| [Themis-CodePreference](https://huggingface.co/datasets/project-themis/Themis-CodePreference) | Training data for the PM stage: code preferences across 5 criteria and 8 languages | 354,010 |
| [Themis-GeneralPreference](https://huggingface.co/datasets/project-themis/Themis-GeneralPreference) | Training data for the PT stage: general-domain and code retrieval preferences | 110,598 |
| [Themis-Git-Commits-Merged](https://huggingface.co/datasets/project-themis/git-commits-merged) | Single-file commits from merged PRs across 24 languages (intermediate, pre-classification) | ~8M |
| [Themis-Git-Commits](https://huggingface.co/datasets/project-themis/git-commits) | Raw mined single-file commits from permissively licensed repos (full unfiltered pool) | ~28M |

## Related Work

**[Distributed Training Tutorial](https://github.com/iNeil77/AWS_DistTraining_Tutorial)** β€” A companion tutorial by us that walks through multi-node distributed training of scalar reward models on cloud GPU clusters. Covers cluster provisioning, high-speed networking, container management, and FSDP-based training. Useful as a standalone guide for anyone looking to reproduce the Themis training setup or adapt it to their own reward modelling workloads. Follows a simplified recipe that leverages the Axolotl framework for training reward models with the Bradley-Terry loss.

## Citation

```bibtex
@article{themis2025,
  title={Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring},
  author={Paul, Indraneil and Gurevych, Iryna and Glava\v{s}, Goran},
  journal={arXiv preprint arXiv:2605.00754},
  year={2025}
}
```