| # Sparse Encoder Evaluation | |
| This directory contains examples demonstrating how to evaluate Sparse Encoder models using various metrics and evaluator classes. | |
| To run any of these evaluation scripts, simply execute the Python script. Each script will: | |
| 1. Load a pretrained sparse encoder model. | |
| 2. Prepare the evaluation dataset. | |
| 3. Configure the appropriate evaluator. | |
| 4. Run the evaluation. | |
| 5. Report the results. | |
| ```{eval-rst} | |
| ============================================================================================= ========================================================================================================================================================================= | |
| Evaluator Evaluation Script | |
| ============================================================================================= ========================================================================================================================================================================= | |
| :class:`~sentence_transformers.sparse_encoder.evaluation.SparseInformationRetrievalEvaluator` `sparse_retrieval_evaluator.py <https://github.com/UKPLab/sentence-transformers/blob/master/examples/sparse_encoder/evaluation/sparse_retrieval_evaluator.py>`_ | |
| :class:`~sentence_transformers.sparse_encoder.evaluation.SparseNanoBEIREvaluator` `sparse_nanobeir_evaluator.py <https://github.com/UKPLab/sentence-transformers/blob/master/examples/sparse_encoder/evaluation/sparse_nanobeir_evaluator.py>`_ | |
| :class:`~sentence_transformers.sparse_encoder.evaluation.SparseEmbeddingSimilarityEvaluator` `sparse_similarity_evaluator.py <https://github.com/UKPLab/sentence-transformers/blob/master/examples/sparse_encoder/evaluation/sparse_similarity_evaluator.py>`_ | |
| :class:`~sentence_transformers.sparse_encoder.evaluation.SparseBinaryClassificationEvaluator` `sparse_classification_evaluator.py <https://github.com/UKPLab/sentence-transformers/blob/master/examples/sparse_encoder/evaluation/sparse_classification_evaluator.py>`_ | |
| :class:`~sentence_transformers.sparse_encoder.evaluation.SparseTripletEvaluator` `sparse_triplet_evaluator.py <https://github.com/UKPLab/sentence-transformers/blob/master/examples/sparse_encoder/evaluation/sparse_triplet_evaluator.py>`_ | |
| :class:`~sentence_transformers.sparse_encoder.evaluation.SparseRerankingEvaluator` `sparse_reranking_evaluator.py <https://github.com/UKPLab/sentence-transformers/blob/master/examples/sparse_encoder/evaluation/sparse_reranking_evaluator.py>`_ | |
| :class:`~sentence_transformers.sparse_encoder.evaluation.SparseTranslationEvaluator` `sparse_translation_evaluator.py <https://github.com/UKPLab/sentence-transformers/blob/master/examples/sparse_encoder/evaluation/sparse_translation_evaluator.py>`_ | |
| :class:`~sentence_transformers.sparse_encoder.evaluation.SparseMSEEvaluator` `sparse_mse_evaluator.py <https://github.com/UKPLab/sentence-transformers/blob/master/examples/sparse_encoder/evaluation/sparse_mse_evaluator.py>`_ | |
| ============================================================================================= ========================================================================================================================================================================= | |
| ``` | |
| ## Example with Retrieval Evaluation: | |
| This script demonstrates how to evaluate a sparse encoder on an information retrieval task ([`sparse_retrieval_evaluator.py`](sparse_retrieval_evaluator.py)): | |
| ```python | |
| import logging | |
| import random | |
| from datasets import load_dataset | |
| from sentence_transformers import SparseEncoder | |
| from sentence_transformers.sparse_encoder.evaluation import SparseInformationRetrievalEvaluator | |
| logging.basicConfig(format="%(message)s", level=logging.INFO) | |
| # Load a model | |
| model = SparseEncoder("naver/splade-cocondenser-ensembledistil") | |
| # Load the NFcorpus IR dataset (https://huggingface.co/datasets/BeIR/nfcorpus, https://huggingface.co/datasets/BeIR/nfcorpus-qrels) | |
| corpus = load_dataset("BeIR/nfcorpus", "corpus", split="corpus") | |
| queries = load_dataset("BeIR/nfcorpus", "queries", split="queries") | |
| relevant_docs_data = load_dataset("BeIR/nfcorpus-qrels", split="test") | |
| # For this dataset, we want to concatenate the title and texts for the corpus | |
| corpus = corpus.map(lambda x: {"text": x["title"] + " " + x["text"]}, remove_columns=["title"]) | |
| # Shrink the corpus size heavily to only the relevant documents + 1,000 random documents | |
| required_corpus_ids = set(map(str, relevant_docs_data["corpus-id"])) | |
| required_corpus_ids |= set(random.sample(corpus["_id"], k=1000)) | |
| corpus = corpus.filter(lambda x: x["_id"] in required_corpus_ids) | |
| # Convert the datasets to dictionaries | |
| corpus = dict(zip(corpus["_id"], corpus["text"])) # Our corpus (cid => document) | |
| queries = dict(zip(queries["_id"], queries["text"])) # Our queries (qid => question) | |
| relevant_docs = {} # Query ID to relevant documents (qid => set([relevant_cids]) | |
| for qid, corpus_ids in zip(relevant_docs_data["query-id"], relevant_docs_data["corpus-id"]): | |
| qid = str(qid) | |
| corpus_ids = str(corpus_ids) | |
| if qid not in relevant_docs: | |
| relevant_docs[qid] = set() | |
| relevant_docs[qid].add(corpus_ids) | |
| # Given queries, a corpus and a mapping with relevant documents, the SparseInformationRetrievalEvaluator computes different IR metrics. | |
| ir_evaluator = SparseInformationRetrievalEvaluator( | |
| queries=queries, | |
| corpus=corpus, | |
| relevant_docs=relevant_docs, | |
| name="BeIR-nfcorpus-subset-test", | |
| show_progress_bar=True, | |
| batch_size=16, | |
| ) | |
| # Run evaluation | |
| results = ir_evaluator(model) | |
| """ | |
| Queries: 323 | |
| Corpus: 3269 | |
| Score-Function: dot | |
| Accuracy@1: 50.77% | |
| Accuracy@3: 64.40% | |
| Accuracy@5: 66.87% | |
| Accuracy@10: 71.83% | |
| Precision@1: 50.77% | |
| Precision@3: 40.45% | |
| Precision@5: 34.06% | |
| Precision@10: 25.98% | |
| Recall@1: 6.27% | |
| Recall@3: 11.69% | |
| Recall@5: 13.74% | |
| Recall@10: 17.23% | |
| MRR@10: 0.5814 | |
| NDCG@10: 0.3621 | |
| MAP@100: 0.1838 | |
| Model Query Sparsity: Active Dimensions: 40.0, Sparsity Ratio: 0.9987 | |
| Model Corpus Sparsity: Active Dimensions: 206.2, Sparsity Ratio: 0.9932 | |
| """ | |
| # Print the results | |
| print(f"Primary metric: {ir_evaluator.primary_metric}") | |
| # => Primary metric: BeIR-nfcorpus-subset-test_dot_ndcg@10 | |
| print(f"Primary metric value: {results[ir_evaluator.primary_metric]:.4f}") | |
| # => Primary metric value: 0.3621 | |
| ``` |