royrin
/

wiki-rag

Model card Files Files and versions

wiki-rag / README.md

royrin's picture

Update README.md

c68d8d2 verified 8 months ago

|

history blame contribute delete

2.44 kB

	# Wiki-RAG
	This repository hosts a prebuilt FAISS index and metadata for Retrieval-Augmented Generation (RAG) over English Wikipedia.

	📍 Hugging Face Hub: [royrin/wiki-rag](https://huggingface.co/royrin/wiki-rag)



	Quick start to download entire Wikipedia and load it into a RAG for you. This RAG code gives you a RAG that directly gives you the relevant wikipedia article. It's entirely offline, so saves on requests to Wikipedia.

	Note: The RAG is generated on the first 3 paragraphs of the Wikipedia page. To then get the full page from Wikipedia, you can access a local version of Wikipedia, or make an API call for that page.

	There are things like this, but somehow nothing quite like this. Other things require many HTTP requests to Wikipedia (like this https://llamahub.ai/l/readers/llama-index-readers-wikipedia?from=).

	Date of download of Wikipedia : April 10, 2025, from `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`.




	## 🛠️ Usage

	```python
	from huggingface_hub import hf_hub_download
	import faiss
	import pickle

	# Download index and metadata
	index_path = hf_hub_download("royrin/wiki-rag", PATH_TO_INDEX)
	meta_path = hf_hub_download("royrin/wiki-rag", PATH_TO_METADATA)

	# Load FAISS index
	index = faiss.read_index(index_path)

	```

	Create script download_folder.sh

	#!/bin/bash
	REPO_URL=https://huggingface.co/royrin/wiki-rag
	TARGET_DIR=KLOM-models # name it what you wish
	FOLDER=$1 # e.g., "wiki_index__top_100000__2025-04-11"

	mkdir -p $TARGET_DIR

	git clone --filter=blob:none --no-checkout $REPO_URL $TARGET_DIR
	cd $TARGET_DIR
	git sparse-checkout init --cone
	git sparse-checkout set $FOLDER
	git checkout main

	Example how to run script:

	bash download_folder.sh wiki_index__top_100000__2025-04-11



	# Do it for yourself, from Scratch
	1. Download Wikipedia full (~22 GB, 2 hours to download over Wget, ~30 min using Aria2c)
	`aria2c -x 16 -s 16 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2`
	2. Extract Wikipedia into machine-readable code:
	`python3 WikiExtractor.py ../enwiki-latest-pages-articles.xml.bz2 -o extracted --json`
	3. Get list of top 100k or 1M articles, by page-views from
	`https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`
	4. load abstracts into RAG




	# Helpful Links:
	1. Wikipedia downloads: `https://dumps.wikimedia.org/enwiki/latest/`
	2. Wikipedia page views: `https://dumps.wikimedia.org/other/pageviews/2024/2024-12/`