Schrieffer
/

Llama-SARM-4B

@@ -1,5 +1,7 @@
 ---
-license: apache-2.0
 tags:
 - reward-model
 - rlhf
@@ -11,26 +13,33 @@ pipeline_tag: reinforcement-learning
 # SARM: Interpretable Reward Model via Sparse Autoencoder
-This repository provides the official implementation and model weights for the AAAI 26 Oral Paper, 'Interpretable Reward Model via Sparse Autoencoder'.
-  + **Authors** (\* indicates equal contribution)
-    Shuyi Zhang\*, Wei Shi\*, Sihang Li\*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang
   + **Paper**: [Interpretable Reward Model via Sparse Autoencoder](https://arxiv.org/abs/2508.08746)
-  + **Model**: [Schrieffer/Llama-SARM-4B](https://huggingface.co/Schrieffer/Llama-SARM-4B)
-      + Finetuned from model: [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
   + **Code Repository:** [https://github.com/schrieffer-z/sarm](https://github.com/schrieffer-z/sarm)
   + **Demo:** [Try SARM Demo in Huggingface Space](https://huggingface.co/spaces/Schrieffer/SARM-Demo)
-## Reward Bench V2 evaluation
-\[Official results in progress\]
-## SARM inference demo
 ```python
 import torch
@@ -74,4 +83,22 @@ for example in examples:
 "+example[0])
     print("Answer:
 "+example[1])
-    print("Score:", get_reward_score(model, example[0],example[1]))

 ---
+license: llama3.1
+base_model:
+  - meta-llama/Llama-3.1-8B-Instruct
 tags:
 - reward-model
 - rlhf
 # SARM: Interpretable Reward Model via Sparse Autoencoder
+This repository contains the model weights of the AAAI 2026 Oral Paper "*Interpretable Reward Model via Sparse Autoencoder*".
+## 🔥 News
+- [2025/11/8] Our paper has been accepted as an oral presentation at AAAI 2026. 🎉
+- [2025/12/11] Llama-SARM-4B is ranked 18th on the [Reward Bench 2](https://huggingface.co/spaces/allenai/reward-bench) leaderboard, above GPT-4.1, Skywork-Reward-Llama-3.1-8B, and Claude-Sonnet-4!🎉
+## 🔗 Links
+  + **Authors**
+    Shuyi Zhang\*, Wei Shi\*, Sihang Li\*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang†
   + **Paper**: [Interpretable Reward Model via Sparse Autoencoder](https://arxiv.org/abs/2508.08746)
   + **Code Repository:** [https://github.com/schrieffer-z/sarm](https://github.com/schrieffer-z/sarm)
   + **Demo:** [Try SARM Demo in Huggingface Space](https://huggingface.co/spaces/Schrieffer/SARM-Demo)
+## 📊 Evaluation
+Llama-SARM-4B shows competitive performance, even with a much smaller parameter size.
+### Reward Bench 2
+| Rank | Model | Model Type | Score | Factuality | Precise IF | Math | Safety | Focus | Ties |
+| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
+| 18 | [**Schrieffer/Llama-SARM-4B**](https://huggingface.co/Schrieffer/Llama-SARM-4B) | Seq. Classifier | 73.79 | 68.74 | 42.81 | 64.48 | 91.78 | 95.56 | 79.39 |
+| 22 | [openai/gpt-4.1-2025-04-14](https://huggingface.co/openai/gpt-4.1-2025-04-14) | Generative | 72.32 | 82.89 | 39.74 | 65.21 | 87.26 | 73.38 | 85.42 |
+| 24 | [Skywork/Skywork-Reward-Llama-3.1-8B-v0.2](https://huggingface.co/Skywork/Skywork-Reward-Llama-3.1-8B-v0.2) | Seq. Classifier | 71.75 | 69.68 | 40.63 | 60.11 | 94.22 | 94.14 | 71.69 |
+| 25 | [anthropic/claude-sonnet-4-20250514](https://huggingface.co/anthropic/claude-sonnet-4-20250514) | Generative | 71.17 | 76.12 | 35.94 | 70.49 | 89.09 | 75.96 | 79.39 |
+## SARM Inference Demo
 ```python
 import torch
 "+example[0])
     print("Answer:
 "+example[1])
+    print("Score:", get_reward_score(model, example[0],example[1]))
+```
+## 📧 Contact
+If you have any questions, please feel free to reach us at `shuyizhang@mail.ustc.edu.cn`.
+## 📚 Citation
+If you find our work useful, please cite it as follows.
+```bibtex
+@article{zhang2025interpretable,
+  title={Interpretable Reward Model via Sparse Autoencoder},
+  author={Zhang, Shuyi and Shi, Wei and Li, Sihang and Liao, Jiayi and Liang, Tao and Cai, Hengxing and Wang, Xiang},
+  journal={arXiv preprint arXiv:2508.08746},
+  year={2025}
+}
+```

modeling_sarm_gemma2.py DELETED Viewed

@@ -1,475 +0,0 @@
-import torch
-from torch import nn
-from typing import List, Optional, Union, Tuple
-from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
-from transformers.models.gemma2.modeling_gemma2 import (
-    Gemma2PreTrainedModel,
-    Gemma2DecoderLayer,
-    Gemma2RMSNorm
-)
-from transformers.modeling_outputs import (
-    BaseModelOutputWithPast,
-    SequenceClassifierOutputWithPast
-)
-from transformers.models.gemma2.configuration_gemma2 import Gemma2Config
-from transformers.cache_utils import Cache
-from transformers.utils import logging
-# Local
-from sae import TopkSAE, pre_process, Normalized_MSE_loss, Masked_Normalized_MSE_loss
-logger = logging.get_logger(__name__)
-#==========================================================================================================================================================================
-#==========================================================================================================================================================================
-def get_last_assistant_masks(input_ids):
-    i=len(input_ids)-4
-    while i >= 0:
-        if input_ids[i:i+4] == [128006, 78191, 128007, 271]:
-            pos = i + 4
-            break
-        i -= 1
-    assistant_masks = []
-    for i in range(len(input_ids)):
-        if i < pos:
-            assistant_masks.append(0)
-        else:
-            assistant_masks.append(1)
-    assert input_ids[-1]==128009
-    return assistant_masks
-def Normalized_MSE_loss(x: torch.Tensor, x_hat: torch.Tensor) -> torch.Tensor:
-    return (((x_hat - x) ** 2).mean(dim=-1) / (x**2).mean(dim=-1)).mean()
-def Masked_Normalized_MSE_loss(x: torch.Tensor, x_hat: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
-    mask = mask.to(torch.bfloat16)
-    loss = ((x_hat - x) ** 2).mean(dim=-1) / (x**2).mean(dim=-1)
-    assert loss.shape==mask.shape
-    seq_loss = (mask * loss).sum(-1) / (mask.sum(-1))
-    return seq_loss.mean()
-def pre_process(hidden_stats: torch.Tensor, eps: float = 1e-6) -> tuple:
-    '''
-    :param hidden_stats: Hidden states (shape: [batch, max_length, hidden_size]).
-    :param eps: Epsilon value for numerical stability.
-    '''
-    mean = hidden_stats.mean(dim=-1, keepdim=True)
-    std = hidden_stats.std(dim=-1, keepdim=True)
-    x = (hidden_stats - mean) / (std + eps)
-    return x, mean, std
-class TopkSAE(nn.Module):
-    '''
-    TopK Sparse Autoencoder Implements:
-    z = TopK(encoder(x - pre_bias) + latent_bias)
-    x_hat = decoder(z) + pre_bias
-    '''
-    def __init__(
-        self, hidden_size: int, latent_size: int, k: int
-    ) -> None:
-        '''
-        :param hidden_size: Dimensionality of the input residual stream activation.
-        :param latent_size: Number of latent units.
-        :param k: Number of activated latents.
-        '''
-        # 'sae_pre_bias', 'sae_latent_bias', 'sae_encoder.weight', 'sae_decoder.weight'
-        assert k <= latent_size, f'k should be less than or equal to {latent_size}'
-        super(TopkSAE, self).__init__()
-        self.pre_bias = nn.Parameter(torch.zeros(hidden_size))
-        self.latent_bias = nn.Parameter(torch.zeros(latent_size))
-        self.encoder = nn.Linear(hidden_size, latent_size, bias=False)
-        self.decoder = nn.Linear(latent_size, hidden_size, bias=False)
-        self.k = k
-        self.latent_size = latent_size
-        self.hidden_size = hidden_size
-        # "tied" init
-        # self.decoder.weight.data = self.encoder.weight.data.T.clone()
-    def pre_acts(self, x: torch.Tensor) -> torch.Tensor:
-        x = x - self.pre_bias
-        return self.encoder(x) + self.latent_bias
-    def get_latents(self, pre_acts: torch.Tensor) -> torch.Tensor:
-        topk = torch.topk(pre_acts, self.k, dim=-1)
-        latents = torch.zeros_like(pre_acts)
-        latents.scatter_(-1, topk.indices, topk.values)
-        return latents
-    def encode(self, x: torch.Tensor) -> torch.Tensor:
-        pre_acts = self.pre_acts(x)
-        latents = self.get_latents(pre_acts)
-        return latents
-    def decode(self, latents: torch.Tensor) -> torch.Tensor:
-        return self.decoder(latents) + self.pre_bias
-    def forward(self, x: torch.Tensor) -> tuple:
-        '''
-        :param x: Input residual stream activation (shape: [batch_size, max_length, hidden_size]).
-        :return:  latents (shape: [batch_size, max_length, latent_size]).
-                  x_hat (shape: [batch_size, max_length, hidden_size]).
-        '''
-        latents = self.encode(x)
-        x_hat = self.decode(latents)
-        return latents, x_hat
-#==========================================================================================================================================================================
-#==========================================================================================================================================================================
-class MyGemma2Model(Gemma2PreTrainedModel):
-    """
-    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Gemma2DecoderLayer`]
-    Args:
-        config: Gemma2Config
-    """
-    def __init__(
-            self,
-            config: Gemma2Config,
-    ):
-        sae_source_layer = config.sarm_param.get("sae_source_layer", config.num_hidden_layers/2)
-        super().__init__(config)
-        self.padding_idx = config.pad_token_id
-        self.vocab_size = config.vocab_size
-        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
-        self.layers = nn.ModuleList(
-            [Gemma2DecoderLayer(config, layer_idx) for layer_idx in range(sae_source_layer)]
-        )
-        self.norm = Gemma2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
-        self.gradient_checkpointing = False
-        # Initialize weights and apply final processing
-        self.post_init()
-    def get_input_embeddings(self):
-        return self.embed_tokens
-    def set_input_embeddings(self, value):
-        self.embed_tokens = value
-    def forward(
-        self,
-        input_ids: torch.LongTensor = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
-        inputs_embeds: Optional[torch.FloatTensor] = None,
-        use_cache: Optional[bool] = None,
-        output_attentions: Optional[bool] = None,
-        output_hidden_states: Optional[bool] = None,
-        return_dict: Optional[bool] = None,
-        cache_position: Optional[torch.LongTensor] = None,
-    ) -> Union[Tuple, BaseModelOutputWithPast]:
-        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
-        output_hidden_states = (
-            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
-        )
-        use_cache = use_cache if use_cache is not None else self.config.use_cache
-        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-        if (input_ids is None) ^ (inputs_embeds is not None):
-            raise ValueError(
-                "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
-            )
-        if self.gradient_checkpointing and self.training and use_cache:
-            logger.warning_once(
-                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
-            )
-            use_cache = False
-        if inputs_embeds is None:
-            inputs_embeds = self.embed_tokens(input_ids)
-        if cache_position is None:
-            cache_position = torch.arange(0, inputs_embeds.shape[1], device=inputs_embeds.device)
-        if position_ids is None:
-            position_ids = cache_position.unsqueeze(0)
-        causal_mask = self._update_causal_mask(
-            attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
-        )
-        # embed positions
-        hidden_states = inputs_embeds
-        # normalized
-        # Gemma2 downcasts the below to float16, causing sqrt(3072)=55.4256 to become 55.5
-        # See https://github.com/huggingface/transformers/pull/29402
-        normalizer = torch.tensor(self.config.hidden_size**0.5, dtype=hidden_states.dtype)
-        hidden_states = hidden_states * normalizer
-        all_hidden_states = () if output_hidden_states else None
-        all_self_attns = () if output_attentions else None
-        for decoder_layer in self.layers:
-            if output_hidden_states:
-                all_hidden_states += (hidden_states,)
-            if self.gradient_checkpointing and self.training:
-                layer_outputs = self._gradient_checkpointing_func(
-                    decoder_layer.__call__,
-                    hidden_states,
-                    causal_mask,
-                    position_ids,
-                    past_key_values,
-                    output_attentions,
-                    use_cache,
-                    cache_position,
-                )
-            else:
-                layer_outputs = decoder_layer(
-                    hidden_states,
-                    attention_mask=causal_mask,
-                    position_ids=position_ids,
-                    past_key_value=past_key_values,
-                    output_attentions=output_attentions,
-                    use_cache=use_cache,
-                    cache_position=cache_position,
-                )
-            hidden_states = layer_outputs[0]
-            if output_attentions:
-                all_self_attns += (layer_outputs[1],)
-        # hidden_states = self.norm(hidden_states)
-        # add hidden states from the last decoder layer
-        if output_hidden_states:
-            all_hidden_states += (hidden_states,)
-        next_cache = past_key_values if use_cache else None
-        if not return_dict:
-            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
-        return BaseModelOutputWithPast(
-            last_hidden_state=hidden_states,
-            past_key_values=next_cache,
-            hidden_states=all_hidden_states,
-            attentions=all_self_attns,
-        )
-    def _update_causal_mask(
-        self,
-        attention_mask: torch.Tensor,
-        input_tensor: torch.Tensor,
-        cache_position: torch.Tensor,
-        past_key_values: Cache,
-        output_attentions: bool,
-    ):
-        if self.config._attn_implementation == "flash_attention_2":
-            if attention_mask is not None and 0.0 in attention_mask:
-                return attention_mask
-            return None
-        dtype, device = input_tensor.dtype, input_tensor.device
-        min_dtype = torch.finfo(dtype).min
-        sequence_length = input_tensor.shape[1]
-        if past_key_values is not None:
-            target_length = past_key_values.get_max_length()
-        else:
-            target_length = attention_mask.shape[-1] if attention_mask is not None else input_tensor.shape[1]
-        if attention_mask is not None and attention_mask.dim() == 4:
-            # in this case we assume that the mask comes already in inverted form and requires no inversion or slicing
-            if attention_mask.max() != 0:
-                raise ValueError("Custom 4D attention mask should be passed in inverted form with max==0`")
-            causal_mask = attention_mask
-        else:
-            causal_mask = torch.full(
-                (sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device
-            )
-            if sequence_length != 1:
-                causal_mask = torch.triu(causal_mask, diagonal=1)
-            causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
-            causal_mask = causal_mask[None, None, :, :].expand(input_tensor.shape[0], 1, -1, -1)
-            if attention_mask is not None:
-                causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
-                mask_length = attention_mask.shape[-1]
-                padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
-                padding_mask = padding_mask == 0
-                causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
-                    padding_mask, min_dtype
-                )
-        return causal_mask
-#==========================================================================================================================================================================
-#==========================================================================================================================================================================
-class Gemma2SARM(Gemma2PreTrainedModel):
-    def __init__(
-            self, config, sae_hidden_state_source_layer, sae_latent_size, sae_k,
-            sae_use_sequence_level=False,
-            sarm_use_topk=False,
-            sarm_train_mode=1
-    ):
-        super().__init__(config)
-        self.num_labels = config.num_labels
-        self.model = MyGemma2Model(config)
-        self.score = nn.Linear(config.sarm_param['sae_latent_size'], self.num_labels, bias=False)
-        self.sae = TopkSAE(hidden_size=self.model.config.hidden_size, latent_size=config.sarm_param['sae_latent_size'], k=config.sarm_param['sae_k'])
-        self.sae_use_sequence_level = config.sarm_param['sae_use_sequence_level']
-        self.sarm_use_topk = config.sarm_param['sarm_use_topk']
-        self.sarm_train_mode = config.sarm_param['sarm_use_topk']
-        if self.sarm_train_mode==1:
-            for p in self.sae.parameters():
-                p.requires_grad_(False)
-        # Initialize weights and apply final processing
-        self.post_init()
-    def get_input_embeddings(self):
-        return self.model.embed_tokens
-    def set_input_embeddings(self, value):
-        self.model.embed_tokens = value
-    def forward(
-        self,
-        input_ids: torch.LongTensor = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        assistant_masks: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_values: Optional[Union[Cache, List[torch.FloatTensor]]] = None,
-        inputs_embeds: Optional[torch.FloatTensor] = None,
-        labels: Optional[torch.LongTensor] = None,
-        use_cache: Optional[bool] = None,
-        output_attentions: Optional[bool] = None,
-        output_hidden_states: Optional[bool] = None,
-        return_dict: Optional[bool] = None,
-    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
-        r"""
-        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
-            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
-            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
-            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
-        """
-        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-        transformer_outputs = self.model(
-            input_ids,
-            attention_mask=attention_mask,
-            position_ids=position_ids,
-            past_key_values=past_key_values,
-            inputs_embeds=inputs_embeds,
-            use_cache=use_cache,
-            output_attentions=output_attentions,
-            output_hidden_states=output_hidden_states,
-            return_dict=return_dict,
-        )
-        hidden_states = transformer_outputs[0]
-        h, _, _ = pre_process(hidden_states)
-        sae_features = self.sae.pre_acts(h)
-        if self.sarm_use_topk:
-            sae_features = self.sae.get_latents(sae_features)
-        logits = self.score(sae_features)
-        if input_ids is not None:
-            batch_size = input_ids.shape[0]
-        else:
-            batch_size = inputs_embeds.shape[0]
-        if self.config.pad_token_id is None and batch_size != 1:
-            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
-        if self.config.pad_token_id is None:
-            sequence_lengths = -1
-        else:
-            if input_ids is not None:
-                # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
-                sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
-                sequence_lengths = sequence_lengths % input_ids.shape[-1]
-                sequence_lengths = sequence_lengths.to(logits.device)
-            else:
-                sequence_lengths = -1
-        # ensure last_token is <|eot_id|>
-        assert ((input_ids[torch.arange(batch_size, device=logits.device), sequence_lengths]!=torch.ones(batch_size, device=logits.device)*128009).sum() == 0).item()
-        # joint training
-        rec_loss = None
-        if self.sarm_train_mode==2:
-            if not self.sarm_use_topk:
-                sae_features_t = self.sae.get_latents(sae_features)
-            h_hat = self.sae.decode(sae_features_t)
-            rec_loss = Masked_Normalized_MSE_loss(h, h_hat, assistant_masks)
-        elif self.sarm_train_mode==3 and not self.sae_use_sequence_level:
-            h_d = h.detach()
-            _, h_hat = self.sae(h_d)
-            rec_loss = Masked_Normalized_MSE_loss(h_d, h_hat, assistant_masks)
-        elif self.sarm_train_mode==3 and self.sae_use_sequence_level:
-            h_d = h.detach()
-            sequence_lengths_t = sequence_lengths.view(-1,1,1)
-            last_token_mask = torch.zeros([h_d.shape[0] ,1 ,h_d.shape[1]], device=h_d.device)
-            last_token_mask.scatter_(-1, sequence_lengths_t, torch.ones_like(sequence_lengths_t, dtype=last_token_mask.dtype))
-            # h_d -> (bs, seq_len, d), last_token_mask -> (bs, 1, seq_len)
-            h_d = torch.matmul(last_token_mask.to(h_d.dtype), h_d)
-            _, h_hat = self.sae(h_d)
-            rec_loss = Normalized_MSE_loss(h_d, h_hat)
-        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
-        loss = None
-        if labels is not None:
-            labels = labels.to(logits.device)
-            if self.config.problem_type is None:
-                if self.num_labels == 1:
-                    self.config.problem_type = "regression"
-                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
-                    self.config.problem_type = "single_label_classification"
-                else:
-                    self.config.problem_type = "multi_label_classification"
-            if self.config.problem_type == "regression":
-                loss_fct = MSELoss()
-                if self.num_labels == 1:
-                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
-                else:
-                    loss = loss_fct(pooled_logits, labels)
-            elif self.config.problem_type == "single_label_classification":
-                loss_fct = CrossEntropyLoss()
-                loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
-            elif self.config.problem_type == "multi_label_classification":
-                loss_fct = BCEWithLogitsLoss()
-                loss = loss_fct(pooled_logits, labels)
-        if rec_loss is not None:
-            loss = rec_loss
-        if not return_dict:
-            output = (pooled_logits,) + transformer_outputs[1:]
-            return ((loss,) + output) if loss is not None else output
-        return SequenceClassifierOutputWithPast(
-            loss=loss,
-            logits=pooled_logits,
-            past_key_values=transformer_outputs.past_key_values,
-            hidden_states=transformer_outputs.hidden_states,
-            attentions=transformer_outputs.attentions,
-        )