nielsr HF Staff

Improve model card

053a3be verified 11 months ago

4.08 kB

library_name: transformers
license: cc-by-4.0
language:
  - en
pipeline_tag: feature-extraction

Model Card for Model ID

This model is a BERT model fine-tuned for robust information retrieval by relabeling hard negatives using cascading LLM prompts, as described in the paper Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval.

Model Details

Developed by: [More Information Needed]
Funded by [optional]: [More Information Needed]
Shared by [optional]: [More Information Needed]
Model type: BERT
Language(s) (NLP): English
License: cc-by-4.0
Finetuned from model [optional]: e5-base

Model Sources [optional]

Repository: https://github.com/luojunyu/rlhn
Paper [optional]: https://huggingface.co/papers/2505.16967
Demo [optional]: [More Information Needed]

Uses

Direct Use

This model can be used for feature extraction to obtain embeddings for information retrieval tasks.

Downstream Use [optional]

This model can be fine-tuned for specific information retrieval tasks to improve performance.

Out-of-Scope Use

This model should not be used for malicious purposes or in any way that violates ethical guidelines.

Bias, Risks, and Limitations

The model may exhibit biases present in the training data.

Recommendations

Users should be aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModel, AutoTokenizer
import torch

model_name = "models/e5-base-unsupervised-bge-retrieval-gpt4o-7-datasets-400K-replaced"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

def get_embeddings(texts):
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
        embeddings = model_output.last_hidden_state[:, 0]
    return embeddings

texts = ["This is a sample sentence.", "Here is another one."]
embeddings = get_embeddings(texts)
print(embeddings.shape)

Training Details

Training Data

The model was trained on a subset of the BGE collection.

Training Procedure

Preprocessing [optional]

The data was preprocessed by identifying and relabeling hard negatives using cascading LLM prompts.

Training Hyperparameters

Training regime: bf16 mixed precision

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on BEIR and zero-shot AIR-Bench.

Factors

[More Information Needed]

Metrics

nDCG@10

Results

The model shows improvements over previous state-of-the-art models on various information retrieval benchmarks.

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]