| --- |
| language: en |
| tags: |
| - bm25 |
| - bm25s |
| - retrieval |
| - search |
| - lexical |
| --- |
| |
| # BM25S Index |
|
|
| This is a BM25S index created with the [`bm25s` library](https://github.com/xhluca/bm25s) (version `0.0.1dev0`), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks. |
|
|
| [BM25S GitHub Repository](https://github.com/xhluca/bm25s) |
|
|
| ## Installation |
|
|
| You can install the `bm25s` library with `pip`: |
|
|
| ```bash |
| pip install "bm25s==0.1.3" |
| |
| # Include extra dependencies like stemmer |
| pip install "bm25s[full]==0.1.3" |
| |
| # For huggingface hub usage |
| pip install huggingface_hub |
| ``` |
|
|
| ## Loading a `bm25s` index |
|
|
| You can use this index for information retrieval tasks. Here is an example: |
|
|
| ```python |
| import bm25s |
| from bm25s.hf import BM25HF |
| |
| # Load the index |
| retriever = BM25HF.load_from_hub("xhluca/bm25s-fiqa-index", revision="main") |
| |
| # You can retrieve now |
| query = "a cat is a feline" |
| results = retriever.retrieve(query, k=3) |
| ``` |
|
|
| ## Saving a `bm25s` index |
|
|
| You can save a `bm25s` index to the Hugging Face Hub. Here is an example: |
|
|
| ```python |
| import bm25s |
| from bm25s.hf import BM25HF |
| |
| # Create a BM25 index and add documents |
| retriever = BM25HF() |
| corpus = [ |
| "a cat is a feline and likes to purr", |
| "a dog is the human's best friend and loves to play", |
| "a bird is a beautiful animal that can fly", |
| "a fish is a creature that lives in water and swims", |
| ] |
| corpus_tokens = bm25s.tokenize(corpus) |
| retriever.index(corpus_tokens) |
| |
| token = None # You can get a token from the Hugging Face website |
| retriever.save_to_hub("xhluca/bm25s-fiqa-index", token=token) |
| ``` |
|
|
|
|
| ## Stats |
|
|
| This dataset was created using the following data: |
|
|
| | Statistic | Value | |
| | --- | --- | |
| | Number of documents | 57638 | |
| | Number of tokens | 3626761 | |
| | Average tokens per document | 62.923088934383564 | |
|
|
| ## Parameters |
|
|
| The index was created with the following parameters: |
|
|
| | Parameter | Value | |
| | --- | --- | |
| | k1 | `1.5` | |
| | b | `0.75` | |
| | delta | `0.5` | |
| | method | `lucene` | |
| | idf method | `lucene` | |
|
|