library_name: transformers
tags: []
pipeline_tag: feature-extraction
license: cc-by-4.0
language:
- en
metrics:
- ndcg@10
base_model:
- BAAI/bge-base-en-v1.5
Model Card for Model ID
This model is a feature extraction model to be used for information retrieval, as described in the paper Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval.
Model Details
- Developed by: Junyu Luo, et al.
- Model type: BertModel
- Language(s) (NLP): English
- License: cc-by-4.0
- Finetuned from model: BAAI/bge-base-en-v1.5
Model Sources
- Repository: This repository.
- Paper: https://arxiv.org/abs/2410.14745
- Code: https://github.com/luojunyu/rlhn
Uses
Direct Use
This model is intended to be used as a feature extractor for performing information retrieval.
Downstream Use
This model can be fine-tuned for a specific task, or plugged into a larger ecosystem/app
Out-of-Scope Use
Misuse and malicious use are out of scope.
Bias, Risks, and Limitations
This model has not been examined extensively for bias, risks and limitations.
Recommendations
Users should be made aware of the risks, biases and limitations of the model.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoModel, AutoTokenizer
import torch
model_name = "models/e5-base-unsupervised-bge-retrieval-gpt4o-7-datasets-680K-removed" # Replace with the actual model name
# Load model and tokenizer
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Example text
sentences = ["This is an example sentence.", "Each sentence is converted into embeddings."]
# Tokenize the sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Generate embeddings
with torch.no_grad():
model_output = model(**encoded_input)
embeddings = model_output.last_hidden_state.mean(dim=1)
print(embeddings)
Training Details
Training Data
The model was trained on a modified version of the BGE collection, with hard negatives relabeled using a cascading LLM approach.
Training Procedure
This model was trained using supervised fine-tuning.
Evaluation
This model was evaluated based on nDCG@10.
Testing Data, Factors & Metrics
Testing Data
Testing was performed on BEIR benchmark and zero-shot AIR-Bench evaluation.
Metrics
This model was evaluated based on nDCG@10.
Results
Results show significant improvements over the base model on both BEIR and zero-shot AIR-Bench, as reported in the paper.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
Technical Specifications
Model Architecture and Objective
The model is based on the BERT architecture.
Compute Infrastructure
Hardware
NVIDIA A100 GPU
Software
PyTorch
Citation
@misc{luo2024semievol,
title={Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval},
author={Junyu Luo and Xiao Luo and Xiusi Chen and Zhiping Xiao and Wei Ju and Ming Zhang},
year={2024},
eprint={2410.14745},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2410.14745},
}