AirRep-Flan-Small / README.md
sunweiwei's picture
Improve model card: Add metadata, paper link, code link, and sample usage (#1)
50fb968 verified
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: feature-extraction

AirRep-Flan

This repository contains the AirRep model presented in Enhancing Training Data Attribution with Representational Optimization.

AirRep is an embedding model designed for computing training data influence on test examples.

Code: https://github.com/sunnweiwei/airrep

Model Description

This model is based on gte-small config with an additional projection layer

Sample Usage

You can use the FLAN-trained model to encode training and test data and compute similarity scores.

from airrep import AirRep

model = AirRep.from_pretrained("sunweiwei/AirRep-Flan-Small")

train_texts = [
    "Question: Classify the sentiment of 'The movie was wonderful and heartwarming.'\
Answer: positive",
    "Question: Does the hypothesis entail the premise? Premise: 'A man is playing a guitar on stage.' Hypothesis: 'Someone is performing music.'\
Answer: entailment",
]
query_texts = [
    "Question: Classify the sentiment of 'The service was awful and I won't return.'\
Answer: negative"
]

# Embeddings and influence-like similarity score
train_emb = model.encode(train_texts, batch_size=128)
query_emb = model.encode(query_texts)
score = model.similarity(query_emb, train_emb, softmax=True)
print("Similarity score:", score)

Training Data

This model was trained on the FLAN dataset with data influence optimization.

Citation

If you use this model, please cite:

@inproceedings{Sun2025AirRep,
  title= {Enhancing Training Data Attribution with Representational Optimization},
  author = {Weiwei Sun and Haokun Liu and Nikhil Kandpal and Colin Raffel and Yiming Yang},
  year = {2025},
  booktitle={NeurIPS},
  year={2025},
  url={https://arxiv.org/abs/2505.18513}
}