royrin commited on
Commit
24f2dd5
·
verified ·
1 Parent(s): f696701

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +52 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Wiki-RAG
2
+ This repository hosts a prebuilt FAISS index and metadata for Retrieval-Augmented Generation (RAG) over English Wikipedia.
3
+
4
+ 📍 Hugging Face Hub: [royrin/wiki-rag](https://huggingface.co/royrin/wiki-rag)
5
+
6
+
7
+
8
+ Quick start to download entire Wikipedia and load it into a RAG for you. This RAG code gives you a RAG that directly gives you the relevant wikipedia article. It's entirely offline, so saves on requests to Wikipedia.
9
+
10
+ Note: The RAG is generated on the first 3 paragraphs of the Wikipedia page. To then get the full page from Wikipedia, you can access a local version of Wikipedia, or make an API call for that page.
11
+
12
+ There are things like this, but somehow nothing quite like this. Other things require many HTTP requests to Wikipedia (like this https://llamahub.ai/l/readers/llama-index-readers-wikipedia?from=).
13
+
14
+ Date of download of Wikipedia : April 10, 2025, from `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`.
15
+
16
+
17
+
18
+
19
+ ## 🛠️ Usage
20
+
21
+ ```python
22
+ from huggingface_hub import hf_hub_download
23
+ import faiss
24
+ import pickle
25
+
26
+ # Download index and metadata
27
+ index_path = hf_hub_download("royrin/wiki-rag", PATH_TO_INDEX)
28
+ meta_path = hf_hub_download("royrin/wiki-rag", PATH_TO_METADATA)
29
+
30
+ # Load FAISS index
31
+ index = faiss.read_index(index_path)
32
+
33
+ ```
34
+
35
+
36
+
37
+
38
+ # Do it for yourself, from Scratch
39
+ 1. Download Wikipedia full (~22 GB, 2 hours to download over Wget, ~30 min using Aria2c)
40
+ `aria2c -x 16 -s 16 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2`
41
+ 2. Extract Wikipedia into machine-readable code:
42
+ `python3 WikiExtractor.py ../enwiki-latest-pages-articles.xml.bz2 -o extracted --json`
43
+ 3. Get list of top 100k or 1M articles, by page-views from
44
+ `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`
45
+ 4. load abstracts into RAG
46
+
47
+
48
+
49
+
50
+ # Helpful Links:
51
+ 1. Wikipedia downloads: `https://dumps.wikimedia.org/enwiki/latest/`
52
+ 2. Wikipedia page views: `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`