--- license: apache-2.0 library_name: transformers pipeline_tag: feature-extraction --- # AirRep-Flan This repository contains the AirRep model presented in [Enhancing Training Data Attribution with Representational Optimization](https://huggingface.co/papers/2505.18513). AirRep is an embedding model designed for computing training data influence on test examples. Code: https://github.com/sunnweiwei/airrep ## Model Description This model is based on gte-small config with an additional projection layer ## Sample Usage You can use the FLAN-trained model to encode training and test data and compute similarity scores. ```python from airrep import AirRep model = AirRep.from_pretrained("sunweiwei/AirRep-Flan-Small") train_texts = [ "Question: Classify the sentiment of 'The movie was wonderful and heartwarming.'\ Answer: positive", "Question: Does the hypothesis entail the premise? Premise: 'A man is playing a guitar on stage.' Hypothesis: 'Someone is performing music.'\ Answer: entailment", ] query_texts = [ "Question: Classify the sentiment of 'The service was awful and I won't return.'\ Answer: negative" ] # Embeddings and influence-like similarity score train_emb = model.encode(train_texts, batch_size=128) query_emb = model.encode(query_texts) score = model.similarity(query_emb, train_emb, softmax=True) print("Similarity score:", score) ``` ## Training Data This model was trained on the FLAN dataset with data influence optimization. ## Citation If you use this model, please cite: ```bibtex @inproceedings{Sun2025AirRep, title= {Enhancing Training Data Attribution with Representational Optimization}, author = {Weiwei Sun and Haokun Liu and Nikhil Kandpal and Colin Raffel and Yiming Yang}, year = {2025}, booktitle={NeurIPS}, year={2025}, url={https://arxiv.org/abs/2505.18513} } ```