AirRep-Flan-Small / README.md
sunweiwei's picture
Improve model card: Add metadata, paper link, code link, and sample usage (#1)
50fb968 verified
---
license: apache-2.0
library_name: transformers
pipeline_tag: feature-extraction
---
# AirRep-Flan
This repository contains the AirRep model presented in [Enhancing Training Data Attribution with Representational Optimization](https://huggingface.co/papers/2505.18513).
AirRep is an embedding model designed for computing training data influence on test examples.
Code: https://github.com/sunnweiwei/airrep
## Model Description
This model is based on gte-small config with an additional projection layer
## Sample Usage
You can use the FLAN-trained model to encode training and test data and compute similarity scores.
```python
from airrep import AirRep
model = AirRep.from_pretrained("sunweiwei/AirRep-Flan-Small")
train_texts = [
"Question: Classify the sentiment of 'The movie was wonderful and heartwarming.'\
Answer: positive",
"Question: Does the hypothesis entail the premise? Premise: 'A man is playing a guitar on stage.' Hypothesis: 'Someone is performing music.'\
Answer: entailment",
]
query_texts = [
"Question: Classify the sentiment of 'The service was awful and I won't return.'\
Answer: negative"
]
# Embeddings and influence-like similarity score
train_emb = model.encode(train_texts, batch_size=128)
query_emb = model.encode(query_texts)
score = model.similarity(query_emb, train_emb, softmax=True)
print("Similarity score:", score)
```
## Training Data
This model was trained on the FLAN dataset with data influence optimization.
## Citation
If you use this model, please cite:
```bibtex
@inproceedings{Sun2025AirRep,
title= {Enhancing Training Data Attribution with Representational Optimization},
author = {Weiwei Sun and Haokun Liu and Nikhil Kandpal and Colin Raffel and Yiming Yang},
year = {2025},
booktitle={NeurIPS},
year={2025},
url={https://arxiv.org/abs/2505.18513}
}
```