AirRep-Flan

This repository contains the AirRep model presented in Enhancing Training Data Attribution with Representational Optimization.

AirRep is an embedding model designed for computing training data influence on test examples.

Code: https://github.com/sunnweiwei/airrep

Model Description

This model is based on gte-small config with an additional projection layer

Sample Usage

You can use the FLAN-trained model to encode training and test data and compute similarity scores.

from airrep import AirRep

model = AirRep.from_pretrained("sunweiwei/AirRep-Flan-Small")

train_texts = [
    "Question: Classify the sentiment of 'The movie was wonderful and heartwarming.'\
Answer: positive",
    "Question: Does the hypothesis entail the premise? Premise: 'A man is playing a guitar on stage.' Hypothesis: 'Someone is performing music.'\
Answer: entailment",
]
query_texts = [
    "Question: Classify the sentiment of 'The service was awful and I won't return.'\
Answer: negative"
]

# Embeddings and influence-like similarity score
train_emb = model.encode(train_texts, batch_size=128)
query_emb = model.encode(query_texts)
score = model.similarity(query_emb, train_emb, softmax=True)
print("Similarity score:", score)

Training Data

This model was trained on the FLAN dataset with data influence optimization.

Citation

If you use this model, please cite:

@inproceedings{Sun2025AirRep,
  title= {Enhancing Training Data Attribution with Representational Optimization},
  author = {Weiwei Sun and Haokun Liu and Nikhil Kandpal and Colin Raffel and Yiming Yang},
  year = {2025},
  booktitle={NeurIPS},
  year={2025},
  url={https://arxiv.org/abs/2505.18513}
}

Downloads last month: 214

Safetensors

Model size

33.4M params

Tensor type

F32

BF16

Paper for sunweiwei/AirRep-Flan-Small

Enhancing Training Data Attribution with Representational Optimization

Paper • 2505.18513 • Published May 24, 2025