metadata
license: mit
This repository contains the weight diffs and DIT adapters used in the paper Learning to Interpret Weight Differences in Language Models (Goel et al. 2025). This paper introduces Diff Interpretation Tuning, a method that trains a LoRA adapter than can be applied to a model to get it to describe its own finetuning induced modifications.
To play around with the weight diffs and DIT adapters from the paper, please check out our Google Colab demo notebook. The code used to train and evaluate the weight diffs and DIT adapters can be found on at github.com/Aviously/diff-interpretation-tuning.
You can cite our work using the following bibtex
@misc{goel2025learninginterpretweightdifferences,
title={Learning to Interpret Weight Differences in Language Models},
author={Avichal Goel and Yoon Kim and Nir Shavit and Tony T. Wang},
year={2025},
eprint={2510.05092},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2510.05092},
}