library_name: transformers
license: cc-by-4.0
language:
- en
pipeline_tag: feature-extraction
Model Card for Model ID
This model is a BERT model fine-tuned for robust information retrieval by relabeling hard negatives using cascading LLM prompts, as described in the paper Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval.
Model Details
- Developed by: [More Information Needed]
- Funded by [optional]: [More Information Needed]
- Shared by [optional]: [More Information Needed]
- Model type: BERT
- Language(s) (NLP): English
- License: cc-by-4.0
- Finetuned from model [optional]: e5-base
Model Sources [optional]
- Repository: https://github.com/luojunyu/rlhn
- Paper [optional]: https://huggingface.co/papers/2505.16967
- Demo [optional]: [More Information Needed]
Uses
Direct Use
This model can be used for feature extraction to obtain embeddings for information retrieval tasks.
Downstream Use [optional]
This model can be fine-tuned for specific information retrieval tasks to improve performance.
Out-of-Scope Use
This model should not be used for malicious purposes or in any way that violates ethical guidelines.
Bias, Risks, and Limitations
The model may exhibit biases present in the training data.
Recommendations
Users should be aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoModel, AutoTokenizer
import torch
model_name = "models/e5-base-unsupervised-bge-retrieval-gpt4o-7-datasets-400K-replaced"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
def get_embeddings(texts):
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
embeddings = model_output.last_hidden_state[:, 0]
return embeddings
texts = ["This is a sample sentence.", "Here is another one."]
embeddings = get_embeddings(texts)
print(embeddings.shape)
Training Details
Training Data
The model was trained on a subset of the BGE collection.
Training Procedure
Preprocessing [optional]
The data was preprocessed by identifying and relabeling hard negatives using cascading LLM prompts.
Training Hyperparameters
- Training regime: bf16 mixed precision
Speeds, Sizes, Times [optional]
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
The model was evaluated on BEIR and zero-shot AIR-Bench.
Factors
[More Information Needed]
Metrics
nDCG@10
Results
The model shows improvements over previous state-of-the-art models on various information retrieval benchmarks.
Summary
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation [optional]
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]