Spaces:
Paused
Build Dataset Guide
This guide explains how to use build_dataset.py to create the pre-embedded ArXiv corpus required by Agentic-RAG's HF Spaces deployment.
Overview
The retriever tool loads a HuggingFace Dataset of pre-computed 384-dim embeddings at startup and builds a FAISS index in memory. build_dataset.py is the one-time script that creates this dataset.
jamescalam/ai-arxiv2-chunks β embed locally β ./dataset_cache β push to HF Hub
(~241k raw chunks) (SentenceTransformer) (Arrow format) (your repo)
Prerequisites
Install the required packages into any Python 3.10+ environment:
pip install datasets sentence-transformers huggingface-hub numpy tqdm torch
GPU users: install the CUDA-enabled torch build for your driver version. See pytorch.org/get-started. CPU-only torch also works β it just runs slower.
A HuggingFace account with a write-access token is required for the push step. Generate one at huggingface.co/settings/tokens.
Recommended: Two-Step Workflow
Separating the embed step from the push step means a failed or interrupted upload does not require re-embedding everything.
Step 1 β Embed and save locally
python build_dataset.py --build-only --limit 5000 --output-dir ./dataset_cache
This streams jamescalam/ai-arxiv2-chunks from HF Hub, embeds each chunk with all-MiniLM-L6-v2, and saves an Arrow dataset to ./dataset_cache/.
| Flag | Description |
|---|---|
--limit N |
Process only the first N chunks. Omit for the full ~241k corpus. |
--output-dir PATH |
Where to save the Arrow dataset (default: ./dataset_cache). |
--device auto|cpu|cuda|mps |
Compute device (default: auto). See GPU section below. |
Expected output:
INFO: GPU detected (CUDA): NVIDIA GeForce RTX 3080
INFO: Loading source dataset: jamescalam/ai-arxiv2-chunks
INFO: Rows to embed: 5000 (of 241874 available)
INFO: Loading embedding model: sentence-transformers/all-MiniLM-L6-v2 [device=cuda]
INFO: Embedding 5000 chunks [batch_size=256, estimated time ~10s] β¦
...
INFO: Embedding complete in 9.3s (537 chunks/s)
INFO: Saved 5000 rows (8.2 MB) β ./dataset_cache
Step 2 β Push to HuggingFace Hub
python build_dataset.py --push-only \
--output-dir ./dataset_cache \
--repo-id myuser/agentic-rag-chunks \
--hf-token hf_xxxxxxxxxxxxxxxxxxxx
Or set the token in your environment to avoid passing it on the command line:
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
python build_dataset.py --push-only \
--output-dir ./dataset_cache \
--repo-id myuser/agentic-rag-chunks
The --repo-id is created automatically if it does not exist. To create a private dataset:
python build_dataset.py --push-only --output-dir ./dataset_cache \
--repo-id myuser/agentic-rag-chunks --private
One-Shot Workflow
Build and push in a single command:
python build_dataset.py \
--limit 5000 \
--repo-id myuser/agentic-rag-chunks \
--hf-token hf_xxxxxxxxxxxxxxxxxxxx
GPU Acceleration
The script detects the best available device automatically (--device auto):
| Priority | Device | Typical throughput | Time for 5k chunks |
|---|---|---|---|
| 1 | CUDA (NVIDIA GPU) | ~500 chunks/s | ~10s |
| 2 | MPS (Apple Silicon) | ~300 chunks/s | ~17s |
| 3 | CPU | ~40 chunks/s | ~2 min |
GPU batch size is automatically set to 256; CPU batch size is 64.
Override the device:
# Force CPU (e.g. to benchmark or avoid VRAM issues)
python build_dataset.py --build-only --limit 5000 --device cpu
# Force a specific CUDA device (multi-GPU machine)
CUDA_VISIBLE_DEVICES=1 python build_dataset.py --build-only --limit 5000
Verify GPU is being used:
# In another terminal while the script runs
watch -n 1 nvidia-smi
Corpus Size Reference
--limit |
Chunks | Disk size | CPU time | GPU time |
|---|---|---|---|---|
5000 |
5 000 | ~8 MB | ~2 min | ~10s |
50000 |
50 000 | ~80 MB | ~20 min | ~2 min |
| (omit) | ~241 000 | ~370 MB | ~2β4 hrs | ~8 min |
For HF Spaces demos, --limit 5000 gives good coverage of the corpus at minimal build time.
Overwriting an Existing Dataset
By default the push step refuses to overwrite an existing HF Hub dataset as a safety guard:
ERROR: Dataset 'myuser/agentic-rag-chunks' already exists on HF Hub.
Pass --force to overwrite it.
To overwrite intentionally:
python build_dataset.py --push-only --output-dir ./dataset_cache \
--repo-id myuser/agentic-rag-chunks --force
Similarly, --build-only refuses to write into an existing --output-dir. Delete it first:
rm -rf ./dataset_cache
python build_dataset.py --build-only --limit 5000 --output-dir ./dataset_cache
After Pushing
Set HF_DATASET_REPO as a secret in your HF Space:
HF_DATASET_REPO=myuser/agentic-rag-chunks
If the dataset is private, also set:
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
The dataset is loaded at container startup and a FAISS index is built in memory (~60s for 5k chunks, ~5 min for 241k).
Dataset Schema
| Column | Type | Description |
|---|---|---|
chunk_id |
string | Original chunk-id from the source dataset |
content |
string | Raw chunk text |
title |
string | Paper title |
source_url |
string | https://arxiv.org/abs/<doi> |
full_citation |
string | <title> arXiv:<doi> |
embedding |
float32[384] | L2-normalised embedding from all-MiniLM-L6-v2 |
Troubleshooting
| Error | Cause | Fix |
|---|---|---|
403 Forbidden: You don't have the rights to create a dataset |
Token missing or read-only | Generate a write token at huggingface.co/settings/tokens |
Output directory already exists |
Previous --build-only run left data |
Delete ./dataset_cache and re-run, or use --push-only |
CUDA out of memory |
Batch too large for VRAM | Add --device cpu or reduce BATCH_SIZE_GPU in the script |
RepositoryNotFoundError on push |
--repo-id typo or wrong namespace |
Check the exact username at huggingface.co |
| Dataset loads but FAISS search returns no results | Embeddings not L2-normalised | Rebuild with the current script (normalize_embeddings=True is set) |