e5-base-remove-680K / README.md
nielsr's picture
nielsr HF Staff
Improve model card
5b521c9 verified
|
raw
history blame
3.57 kB
metadata
library_name: transformers
tags: []
pipeline_tag: feature-extraction
license: cc-by-4.0
language:
  - en
metrics:
  - ndcg@10
base_model:
  - BAAI/bge-base-en-v1.5

Model Card for Model ID

This model is a feature extraction model to be used for information retrieval, as described in the paper Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval.

Model Details

  • Developed by: Junyu Luo, et al.
  • Model type: BertModel
  • Language(s) (NLP): English
  • License: cc-by-4.0
  • Finetuned from model: BAAI/bge-base-en-v1.5

Model Sources

Uses

Direct Use

This model is intended to be used as a feature extractor for performing information retrieval.

Downstream Use

This model can be fine-tuned for a specific task, or plugged into a larger ecosystem/app

Out-of-Scope Use

Misuse and malicious use are out of scope.

Bias, Risks, and Limitations

This model has not been examined extensively for bias, risks and limitations.

Recommendations

Users should be made aware of the risks, biases and limitations of the model.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModel, AutoTokenizer
import torch

model_name = "models/e5-base-unsupervised-bge-retrieval-gpt4o-7-datasets-680K-removed"  # Replace with the actual model name

# Load model and tokenizer
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example text
sentences = ["This is an example sentence.", "Each sentence is converted into embeddings."]

# Tokenize the sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Generate embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
    embeddings = model_output.last_hidden_state.mean(dim=1)

print(embeddings)

Training Details

Training Data

The model was trained on a modified version of the BGE collection, with hard negatives relabeled using a cascading LLM approach.

Training Procedure

This model was trained using supervised fine-tuning.

Evaluation

This model was evaluated based on nDCG@10.

Testing Data, Factors & Metrics

Testing Data

Testing was performed on BEIR benchmark and zero-shot AIR-Bench evaluation.

Metrics

This model was evaluated based on nDCG@10.

Results

Results show significant improvements over the base model on both BEIR and zero-shot AIR-Bench, as reported in the paper.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Technical Specifications

Model Architecture and Objective

The model is based on the BERT architecture.

Compute Infrastructure

Hardware

NVIDIA A100 GPU

Software

PyTorch

Citation

@misc{luo2024semievol,
    title={Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval},
    author={Junyu Luo and Xiao Luo and Xiusi Chen and Zhiping Xiao and Wei Ju and Ming Zhang},
    year={2024},
    eprint={2410.14745},
    archivePrefix={arXiv},
    primaryClass={cs.IR},
    url={https://arxiv.org/abs/2410.14745},
}