|
|
| ## Search Engine |
|
|
| In this document, we provide examples of how to launch different retrievers, including local sparse retriever (e.g., BM25), local dense retriever (e.g., e5) and online search engine. |
| For local retrievers, we use [wiki-18](https://huggingface.co/datasets/PeterJinGo/wiki-18-corpus) corpus as an example and the corpus indexing can be found at [bm25](https://huggingface.co/datasets/PeterJinGo/wiki-18-bm25-index), [e5-flat](https://huggingface.co/datasets/PeterJinGo/wiki-18-e5-index), [e5-HNSW64](https://huggingface.co/datasets/PeterJinGo/wiki-18-e5-index-HNSW64). |
|
|
| ### How to choose the retriever? |
|
|
| - If you have a private or domain-specific corpus, choose **local retriever**. |
|
|
| - If there is no high quality embedding-based retrievers (dense retrievers) in your domain, choose **sparse local retriever** (e.g., BM25). |
|
|
| - Otherwise choose **dense local retriever**. |
| |
| - If you do not have sufficent GPUs to conduct exact dense embedding matching, choose **ANN indexing** on CPUs. |
| |
| - If you have sufficient GPUs, choose **flat indexing** on GPUs. |
|
|
|
|
| - If you want to train a general LLM search agent and have enough funding, choose **online search engine** (e.g., [SerpAPI](https://serpapi.com/)). |
|
|
|
|
| - If you have a domain specific online search engine (e.g., PubMed search), you can refer to [link](https://github.com/PeterGriffinJin/Search-R1/blob/main/search_r1/search/serp_search_server.py) to integrate it to Search-R1 by yourself. |
|
|
| Search engine launching scripts can be found at [link](https://github.com/PeterGriffinJin/Search-R1/tree/main/example/retriever). |
|
|
| ### Local Sparse Retriever |
|
|
| Sparse retriever (e.g., bm25) is a traditional method. The retrieval process is very efficient and no GPUs are needed. However, it may not be as accurate as dense retrievers in some specific domain. |
|
|
| (1) Download the indexing. |
| ```bash |
| save_path=/your/path/to/save |
| huggingface-cli download PeterJinGo/wiki-18-bm25-index --repo-type dataset --local-dir $save_path |
| ``` |
|
|
| (2) Launch a local BM25 retriever server. |
| ```bash |
| conda activate retriever |
| |
| index_file=$save_path/bm25 |
| corpus_file=$save_path/wiki-18.jsonl |
| retriever_name=bm25 |
| |
| python search_r1/search/retrieval_server.py --index_path $index_file --corpus_path $corpus_file --topk 3 --retriever_name $retriever_name |
| ``` |
|
|
|
|
| ### Local Dense Retriever |
|
|
| You can also adopt some off-the-shelf dense retrievers, e.g., e5. These models are much stronger than sparse retriever in some specific domains. |
| If you have sufficient GPU, we would recommend the flat indexing variant below, otherwise you can adopt the ANN variant. |
|
|
| #### Flat indexing |
|
|
| Flat indexing conducts exact embedding match, which is slow but very accurate. To make it efficient enough to support online RL, we would recommend enable **GPU** usage by ```--faiss_gpu```. |
|
|
| (1) Download the indexing and corpus. |
| ```bash |
| save_path=/the/path/to/save |
| python scripts/download.py --save_path $save_path |
| cat $save_path/part_* > $save_path/e5_Flat.index |
| gzip -d $save_path/wiki-18.jsonl.gz |
| ``` |
|
|
| (2) Launch a local flat e5 retriever server. |
|
|
| ```bash |
| conda activate retriever |
| |
| index_file=$save_path/e5_Flat.index |
| corpus_file=$save_path/wiki-18.jsonl |
| retriever_name=e5 |
| retriever_path=intfloat/e5-base-v2 |
| |
| python search_r1/search/retrieval_server.py --index_path $index_file --corpus_path $corpus_file --topk 3 --retriever_name $retriever_name --retriever_model $retriever_path --faiss_gpu |
| |
| ``` |
|
|
|
|
| #### ANN indexing (HNSW64) |
|
|
| To improve the search efficient with only **CPU**, you can adopt approximate nearest neighbor (ANN) indexing, e.g., with HNSW64. |
| It is very efficient, but may not be as accurate as flat indexing, especially when the number of retrieved passages is small. |
|
|
| (1) Download the indexing. |
| ```bash |
| save_path=/the/path/to/save |
| huggingface-cli download PeterJinGo/wiki-18-e5-index-HNSW64 --repo-type dataset --local-dir $save_path |
| cat $save_path/part_* > $save_path/e5_HNSW64.index |
| ``` |
|
|
|
|
| (2) Launch a local ANN dense retriever server. |
| ```bash |
| conda activate retriever |
| |
| index_file=$save_path/e5_HNSW64.index |
| corpus_file=$save_path/wiki-18.jsonl |
| retriever_name=e5 |
| retriever_path=intfloat/e5-base-v2 |
| |
| python search_r1/search/retrieval_server.py --index_path $index_file --corpus_path $corpus_file --topk 3 --retriever_name $retriever_name --retriever_model $retriever_path |
| ``` |
|
|
|
|
| ### Online Search Engine |
|
|
| We support both [Google Search API](https://developers.google.com/custom-search/v1/overview) and [SerpAPI](https://serpapi.com/). We would recommend [SerpAPI](https://serpapi.com/) since it integrates multiple online search engine APIs (including Google, Bing, Baidu, etc) and does not have a monthly quota limitation ([Google Search API](https://developers.google.com/custom-search/v1/overview) has a hard 10k monthly quota, which is not sufficient to fulfill online LLM RL training). |
|
|
| #### SerAPI online search server |
|
|
| ```bash |
| search_url=https://serpapi.com/search |
| serp_api_key="" # put your serp api key here (https://serpapi.com/) |
| |
| python search_r1/search/serp_search_server.py --search_url $search_url --topk 3 --serp_api_key $serp_api_key |
| ``` |
|
|
| #### Google online search server |
|
|
| ```bash |
| api_key="" # put your google custom API key here (https://developers.google.com/custom-search/v1/overview) |
| cse_id="" # put your google cse API key here (https://developers.google.com/custom-search/v1/overview) |
| |
| python search_r1/search/google_search_server.py --api_key $api_key --topk 5 --cse_id $cse_id --snippet_only |
| ``` |
|
|
|
|