dfc-D8k-excl5-k45

A DFC sparse crosscoder trained to compare layer-13 activations between:

Model A (ToolRL): chengq9/ToolRL-Qwen2.5-3B — fine-tuned with tool-use reinforcement learning
Model B (Base): Qwen/Qwen2.5-3B — vanilla base model

What is this?

This model learns a sparse dictionary of features from the internal representations of two language models. By comparing which features activate for which model, we can identify:

What the ToolRL fine-tuning changed (A-exclusive features)
What remained the same (shared features)
What the base model does that ToolRL suppressed (B-exclusive features)

Model Architecture

Dedicated Feature CrossCoder with partitioned dictionary (0.05/0.05 A/B exclusive)

Parameter	Value
Dictionary size	8192
Top-k active features	45
Layer	13 (middle layer of Qwen2.5-3B)
Activation dimension	2048
A-exclusive features	409 (0.05)
B-exclusive features	409 (0.05)
Shared features	7374

How it works

Encode: Takes stacked activations (batch, 2, 2048) from both models, applies per-model encoder weights, sums across models, and selects the top-45 features via ReLU + top-k.
Decode: Reconstructs per-model activations from the sparse feature vector using per-model decoder weights.
Partition masks (DFC only): Hard binary masks zero out encoder/decoder weights to enforce that exclusive features cannot be used by the wrong model.

Training

Parameter	Value
Loss function	MSE + L1 sparsity (shared: 1e-3, exclusive: 1e-3)
Training steps	9000
Learning rate	1e-4
Batch size	1024
Sparsity coefficient (shared)	1e-3
Exclusive sparsity coefficient	1e-3
Optimizer	Adam (grad clip 1.0)
W&B project	`dfc-crosscoder-sweep`

Training Data

FineWeb: ~40,000 general web text samples (from HuggingFaceFW/fineweb sample-10BT)
ToolRL: ~40,000 tool-use conversation samples (from emrecanacikgoz/ToolRL, cycled)
Activations extracted from layer 13, last token per sample
Both datasets concatenated and z-score normalized

Usage

Quick Start

import torch
from huggingface_hub import hf_hub_download

# Download model files
repo_id = "antebe1/dfc-D8k-excl5-k45"
for fname in ["model.pt", "config.json", "dfc.py"]:
    hf_hub_download(repo_id=repo_id, filename=fname, local_dir="./model")

# Load the crosscoder
import sys; sys.path.insert(0, "./model")
from dfc import DFCCrossCoder

dfc = DFCCrossCoder.load("./model", device="cuda")
print(f"Loaded: dict_size={dfc.dict_size}, k={dfc.k}")

Extract Features from Real Models

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load both models
model_a = AutoModelForCausalLM.from_pretrained("chengq9/ToolRL-Qwen2.5-3B", device_map="cuda:0")
model_b = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B", device_map="cuda:1")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")

# Get activations from layer 13
# NOTE: hidden_states[0] = embeddings, hidden_states[i] = output of layer i-1
#       so layer 13 activations are at index 13+1
text = "Use the search tool to find recent papers on RLHF"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    out_a = model_a(**inputs.to("cuda:0"), output_hidden_states=True)
    out_b = model_b(**inputs.to("cuda:1"), output_hidden_states=True)
    act_a = out_a.hidden_states[13 + 1][:, -1, :]  # last token, layer 13
    act_b = out_b.hidden_states[13 + 1][:, -1, :]

# Stack and encode
activations = torch.stack([act_a.cpu(), act_b.cpu()], dim=1)  # (1, 2, 2048)
features = dfc.encode(activations.to(dfc.W_enc.device))

print(f"Active features: {(features > 0).sum().item()} / {dfc.dict_size}")

Analyze Partitions (DFC only)

stats = dfc.feature_stats(features)
print(f"L0 total:    {stats['l0_total']:.1f}")
print(f"L0 A-excl:   {stats['l0_a_excl']:.1f}")
print(f"L0 B-excl:   {stats['l0_b_excl']:.1f}")
print(f"L0 shared:   {stats['l0_shared']:.1f}")

# Check reconstruction quality
recon, feats = dfc(activations.to(dfc.W_enc.device))
mse = torch.nn.functional.mse_loss(recon.cpu(), activations)
print(f"Reconstruction MSE: {mse.item():.6f}")

Files

File	Description
`model.pt`	PyTorch state dict (encoder/decoder weights + partition masks)
`config.json`	Architecture config: dict_size, k, partition sizes (n_a, n_b)
`hparams.json`	Full training hyperparameters including loss, lr, steps, etc.
`dfc.py`	`DFCCrossCoder` class definition — required to load model.pt
`demo.py`	Feature extraction demo (works with downloaded model)
`requirements.txt`	Python dependencies

Part of a Sweep

This model is one of 48 models in a hyperparameter sweep. See the full collection:

Axis	Values
k (top-k)	45, 90, 160
dict_size	8,192 / 16,384
Architecture	DFC (partitioned) / CrossCoder (all shared)
Exclusive % (DFC)	3%, 5%, 10%
Exclusive sparsity	1e-3 (penalized) / 0 (free)
CrossCoder L1	with / without

Citation

@misc{dfc-D8k-excl5-k45,
  title={DFC CrossCoder: ToolRL vs Base Qwen2.5-3B},
  author={Andre Shportko},
  year={2026},
  url={https://huggingface.co/antebe1/dfc-D8k-excl5-k45}
}

Downloads last month: 14

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including antebe1/dfc-D8k-excl5-k45

qwen-toolrl-crosscoder

Collection

4 items • Updated 9 days ago