# Sparse Encoder Evaluation This directory contains examples demonstrating how to evaluate Sparse Encoder models using various metrics and evaluator classes. To run any of these evaluation scripts, simply execute the Python script. Each script will: 1. Load a pretrained sparse encoder model. 2. Prepare the evaluation dataset. 3. Configure the appropriate evaluator. 4. Run the evaluation. 5. Report the results. ```{eval-rst} ============================================================================================= ========================================================================================================================================================================= Evaluator Evaluation Script ============================================================================================= ========================================================================================================================================================================= :class:`~sentence_transformers.sparse_encoder.evaluation.SparseInformationRetrievalEvaluator` `sparse_retrieval_evaluator.py `_ :class:`~sentence_transformers.sparse_encoder.evaluation.SparseNanoBEIREvaluator` `sparse_nanobeir_evaluator.py `_ :class:`~sentence_transformers.sparse_encoder.evaluation.SparseEmbeddingSimilarityEvaluator` `sparse_similarity_evaluator.py `_ :class:`~sentence_transformers.sparse_encoder.evaluation.SparseBinaryClassificationEvaluator` `sparse_classification_evaluator.py `_ :class:`~sentence_transformers.sparse_encoder.evaluation.SparseTripletEvaluator` `sparse_triplet_evaluator.py `_ :class:`~sentence_transformers.sparse_encoder.evaluation.SparseRerankingEvaluator` `sparse_reranking_evaluator.py `_ :class:`~sentence_transformers.sparse_encoder.evaluation.SparseTranslationEvaluator` `sparse_translation_evaluator.py `_ :class:`~sentence_transformers.sparse_encoder.evaluation.SparseMSEEvaluator` `sparse_mse_evaluator.py `_ ============================================================================================= ========================================================================================================================================================================= ``` ## Example with Retrieval Evaluation: This script demonstrates how to evaluate a sparse encoder on an information retrieval task ([`sparse_retrieval_evaluator.py`](sparse_retrieval_evaluator.py)): ```python import logging import random from datasets import load_dataset from sentence_transformers import SparseEncoder from sentence_transformers.sparse_encoder.evaluation import SparseInformationRetrievalEvaluator logging.basicConfig(format="%(message)s", level=logging.INFO) # Load a model model = SparseEncoder("naver/splade-cocondenser-ensembledistil") # Load the NFcorpus IR dataset (https://huggingface.co/datasets/BeIR/nfcorpus, https://huggingface.co/datasets/BeIR/nfcorpus-qrels) corpus = load_dataset("BeIR/nfcorpus", "corpus", split="corpus") queries = load_dataset("BeIR/nfcorpus", "queries", split="queries") relevant_docs_data = load_dataset("BeIR/nfcorpus-qrels", split="test") # For this dataset, we want to concatenate the title and texts for the corpus corpus = corpus.map(lambda x: {"text": x["title"] + " " + x["text"]}, remove_columns=["title"]) # Shrink the corpus size heavily to only the relevant documents + 1,000 random documents required_corpus_ids = set(map(str, relevant_docs_data["corpus-id"])) required_corpus_ids |= set(random.sample(corpus["_id"], k=1000)) corpus = corpus.filter(lambda x: x["_id"] in required_corpus_ids) # Convert the datasets to dictionaries corpus = dict(zip(corpus["_id"], corpus["text"])) # Our corpus (cid => document) queries = dict(zip(queries["_id"], queries["text"])) # Our queries (qid => question) relevant_docs = {} # Query ID to relevant documents (qid => set([relevant_cids]) for qid, corpus_ids in zip(relevant_docs_data["query-id"], relevant_docs_data["corpus-id"]): qid = str(qid) corpus_ids = str(corpus_ids) if qid not in relevant_docs: relevant_docs[qid] = set() relevant_docs[qid].add(corpus_ids) # Given queries, a corpus and a mapping with relevant documents, the SparseInformationRetrievalEvaluator computes different IR metrics. ir_evaluator = SparseInformationRetrievalEvaluator( queries=queries, corpus=corpus, relevant_docs=relevant_docs, name="BeIR-nfcorpus-subset-test", show_progress_bar=True, batch_size=16, ) # Run evaluation results = ir_evaluator(model) """ Queries: 323 Corpus: 3269 Score-Function: dot Accuracy@1: 50.77% Accuracy@3: 64.40% Accuracy@5: 66.87% Accuracy@10: 71.83% Precision@1: 50.77% Precision@3: 40.45% Precision@5: 34.06% Precision@10: 25.98% Recall@1: 6.27% Recall@3: 11.69% Recall@5: 13.74% Recall@10: 17.23% MRR@10: 0.5814 NDCG@10: 0.3621 MAP@100: 0.1838 Model Query Sparsity: Active Dimensions: 40.0, Sparsity Ratio: 0.9987 Model Corpus Sparsity: Active Dimensions: 206.2, Sparsity Ratio: 0.9932 """ # Print the results print(f"Primary metric: {ir_evaluator.primary_metric}") # => Primary metric: BeIR-nfcorpus-subset-test_dot_ndcg@10 print(f"Primary metric value: {results[ir_evaluator.primary_metric]:.4f}") # => Primary metric value: 0.3621 ```