royrin
/

wiki-rag

Model card Files Files and versions

xet

Community

royrin commited on Apr 13, 2025

Commit

24f2dd5

verified ·

1 Parent(s): f696701

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +52 -0

README.md ADDED Viewed

	@@ -0,0 +1,52 @@

+# Wiki-RAG
+This repository hosts a prebuilt FAISS index and metadata for Retrieval-Augmented Generation (RAG) over English Wikipedia.
+📍 Hugging Face Hub: [royrin/wiki-rag](https://huggingface.co/royrin/wiki-rag)
+Quick start to download entire Wikipedia and load it into a RAG for you. This RAG code gives you a RAG that directly gives you the relevant wikipedia article. It's entirely offline, so saves on requests to Wikipedia.
+Note: The RAG is generated on the first 3 paragraphs of the Wikipedia page. To then get the full page from Wikipedia, you can access a local version of Wikipedia, or make an API call for that page.
+There are things like this, but somehow nothing quite like this. Other things require many HTTP requests to Wikipedia (like this https://llamahub.ai/l/readers/llama-index-readers-wikipedia?from=).
+Date of download of Wikipedia : April 10, 2025, from `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`.
+## 🛠️ Usage
+```python
+from huggingface_hub import hf_hub_download
+import faiss
+import pickle
+# Download index and metadata
+index_path = hf_hub_download("royrin/wiki-rag", PATH_TO_INDEX)
+meta_path = hf_hub_download("royrin/wiki-rag", PATH_TO_METADATA)
+# Load FAISS index
+index = faiss.read_index(index_path)
+```
+# Do it for yourself, from Scratch
+1. Download Wikipedia full (~22 GB, 2 hours to download over Wget, ~30 min using Aria2c)
+    `aria2c -x 16 -s 16 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2`
+2. Extract Wikipedia into machine-readable code:
+    `python3 WikiExtractor.py ../enwiki-latest-pages-articles.xml.bz2 -o extracted --json`
+3. Get list of top 100k or 1M articles, by page-views from
+    `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`
+4. load abstracts into RAG
+# Helpful Links:
+1. Wikipedia downloads: `https://dumps.wikimedia.org/enwiki/latest/`
+2. Wikipedia page views: `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`