| # Wiki-RAG | |
| This repository hosts a prebuilt FAISS index and metadata for Retrieval-Augmented Generation (RAG) over English Wikipedia. | |
| 📍 Hugging Face Hub: [royrin/wiki-rag](https://huggingface.co/royrin/wiki-rag) | |
| Quick start to download entire Wikipedia and load it into a RAG for you. This RAG code gives you a RAG that directly gives you the relevant wikipedia article. It's entirely offline, so saves on requests to Wikipedia. | |
| Note: The RAG is generated on the first 3 paragraphs of the Wikipedia page. To then get the full page from Wikipedia, you can access a local version of Wikipedia, or make an API call for that page. | |
| There are things like this, but somehow nothing quite like this. Other things require many HTTP requests to Wikipedia (like this https://llamahub.ai/l/readers/llama-index-readers-wikipedia?from=). | |
| Date of download of Wikipedia : April 10, 2025, from `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`. | |
| ## 🛠️ Usage | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| import faiss | |
| import pickle | |
| # Download index and metadata | |
| index_path = hf_hub_download("royrin/wiki-rag", PATH_TO_INDEX) | |
| meta_path = hf_hub_download("royrin/wiki-rag", PATH_TO_METADATA) | |
| # Load FAISS index | |
| index = faiss.read_index(index_path) | |
| ``` | |
| Create script download_folder.sh | |
| #!/bin/bash | |
| REPO_URL=https://huggingface.co/royrin/wiki-rag | |
| TARGET_DIR=KLOM-models # name it what you wish | |
| FOLDER=$1 # e.g., "wiki_index__top_100000__2025-04-11" | |
| mkdir -p $TARGET_DIR | |
| git clone --filter=blob:none --no-checkout $REPO_URL $TARGET_DIR | |
| cd $TARGET_DIR | |
| git sparse-checkout init --cone | |
| git sparse-checkout set $FOLDER | |
| git checkout main | |
| Example how to run script: | |
| bash download_folder.sh wiki_index__top_100000__2025-04-11 | |
| # Do it for yourself, from Scratch | |
| 1. Download Wikipedia full (~22 GB, 2 hours to download over Wget, ~30 min using Aria2c) | |
| `aria2c -x 16 -s 16 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2` | |
| 2. Extract Wikipedia into machine-readable code: | |
| `python3 WikiExtractor.py ../enwiki-latest-pages-articles.xml.bz2 -o extracted --json` | |
| 3. Get list of top 100k or 1M articles, by page-views from | |
| `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/` | |
| 4. load abstracts into RAG | |
| # Helpful Links: | |
| 1. Wikipedia downloads: `https://dumps.wikimedia.org/enwiki/latest/` | |
| 2. Wikipedia page views: `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/` |