Enhancing Training Data Attribution with Representational Optimization
Paper • 2505.18513 • Published
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("sunweiwei/AirRep-Flan-Small", dtype="auto")This repository contains the AirRep model presented in Enhancing Training Data Attribution with Representational Optimization.
AirRep is an embedding model designed for computing training data influence on test examples.
Code: https://github.com/sunnweiwei/airrep
This model is based on gte-small config with an additional projection layer
You can use the FLAN-trained model to encode training and test data and compute similarity scores.
from airrep import AirRep
model = AirRep.from_pretrained("sunweiwei/AirRep-Flan-Small")
train_texts = [
"Question: Classify the sentiment of 'The movie was wonderful and heartwarming.'\
Answer: positive",
"Question: Does the hypothesis entail the premise? Premise: 'A man is playing a guitar on stage.' Hypothesis: 'Someone is performing music.'\
Answer: entailment",
]
query_texts = [
"Question: Classify the sentiment of 'The service was awful and I won't return.'\
Answer: negative"
]
# Embeddings and influence-like similarity score
train_emb = model.encode(train_texts, batch_size=128)
query_emb = model.encode(query_texts)
score = model.similarity(query_emb, train_emb, softmax=True)
print("Similarity score:", score)
This model was trained on the FLAN dataset with data influence optimization.
If you use this model, please cite:
@inproceedings{Sun2025AirRep,
title= {Enhancing Training Data Attribution with Representational Optimization},
author = {Weiwei Sun and Haokun Liu and Nikhil Kandpal and Colin Raffel and Yiming Yang},
year = {2025},
booktitle={NeurIPS},
year={2025},
url={https://arxiv.org/abs/2505.18513}
}
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="sunweiwei/AirRep-Flan-Small")