---
license: apache-2.0
library_name: transformers
pipeline_tag: feature-extraction
---

# AirRep-Flan

This repository contains the AirRep model presented in [Enhancing Training Data Attribution with Representational Optimization](https://huggingface.co/papers/2505.18513).

AirRep is an embedding model designed for computing training data influence on test examples.

Code: https://github.com/sunnweiwei/airrep

## Model Description

This model is based on gte-small config with an additional projection layer

## Sample Usage

You can use the FLAN-trained model to encode training and test data and compute similarity scores.

```python
from airrep import AirRep

model = AirRep.from_pretrained("sunweiwei/AirRep-Flan-Small")

train_texts = [
    "Question: Classify the sentiment of 'The movie was wonderful and heartwarming.'\
Answer: positive",
    "Question: Does the hypothesis entail the premise? Premise: 'A man is playing a guitar on stage.' Hypothesis: 'Someone is performing music.'\
Answer: entailment",
]
query_texts = [
    "Question: Classify the sentiment of 'The service was awful and I won't return.'\
Answer: negative"
]

# Embeddings and influence-like similarity score
train_emb = model.encode(train_texts, batch_size=128)
query_emb = model.encode(query_texts)
score = model.similarity(query_emb, train_emb, softmax=True)
print("Similarity score:", score)
```

## Training Data

This model was trained on the FLAN dataset with data influence optimization.

## Citation

If you use this model, please cite:

```bibtex
@inproceedings{Sun2025AirRep,
  title= {Enhancing Training Data Attribution with Representational Optimization},
  author = {Weiwei Sun and Haokun Liu and Nikhil Kandpal and Colin Raffel and Yiming Yang},
  year = {2025},
  booktitle={NeurIPS},
  year={2025},
  url={https://arxiv.org/abs/2505.18513}
}
```