agentic-rag / docs /Build Dataset Guide.md
vksepm
feat: HF Spaces deployment β€” FAISS retriever, single-container Docker build
635dd02
|
Raw
History Blame Contribute Delete
6.45 kB

Build Dataset Guide

This guide explains how to use build_dataset.py to create the pre-embedded ArXiv corpus required by Agentic-RAG's HF Spaces deployment.


Overview

The retriever tool loads a HuggingFace Dataset of pre-computed 384-dim embeddings at startup and builds a FAISS index in memory. build_dataset.py is the one-time script that creates this dataset.

jamescalam/ai-arxiv2-chunks  β†’  embed locally  β†’  ./dataset_cache  β†’  push to HF Hub
(~241k raw chunks)              (SentenceTransformer)  (Arrow format)    (your repo)

Prerequisites

Install the required packages into any Python 3.10+ environment:

pip install datasets sentence-transformers huggingface-hub numpy tqdm torch

GPU users: install the CUDA-enabled torch build for your driver version. See pytorch.org/get-started. CPU-only torch also works β€” it just runs slower.

A HuggingFace account with a write-access token is required for the push step. Generate one at huggingface.co/settings/tokens.


Recommended: Two-Step Workflow

Separating the embed step from the push step means a failed or interrupted upload does not require re-embedding everything.

Step 1 β€” Embed and save locally

python build_dataset.py --build-only --limit 5000 --output-dir ./dataset_cache

This streams jamescalam/ai-arxiv2-chunks from HF Hub, embeds each chunk with all-MiniLM-L6-v2, and saves an Arrow dataset to ./dataset_cache/.

Flag Description
--limit N Process only the first N chunks. Omit for the full ~241k corpus.
--output-dir PATH Where to save the Arrow dataset (default: ./dataset_cache).
--device auto|cpu|cuda|mps Compute device (default: auto). See GPU section below.

Expected output:

INFO: GPU detected (CUDA): NVIDIA GeForce RTX 3080
INFO: Loading source dataset: jamescalam/ai-arxiv2-chunks
INFO: Rows to embed: 5000 (of 241874 available)
INFO: Loading embedding model: sentence-transformers/all-MiniLM-L6-v2  [device=cuda]
INFO: Embedding 5000 chunks  [batch_size=256, estimated time ~10s] …
...
INFO: Embedding complete in 9.3s (537 chunks/s)
INFO: Saved 5000 rows (8.2 MB) β†’ ./dataset_cache

Step 2 β€” Push to HuggingFace Hub

python build_dataset.py --push-only \
    --output-dir ./dataset_cache \
    --repo-id myuser/agentic-rag-chunks \
    --hf-token hf_xxxxxxxxxxxxxxxxxxxx

Or set the token in your environment to avoid passing it on the command line:

export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

python build_dataset.py --push-only \
    --output-dir ./dataset_cache \
    --repo-id myuser/agentic-rag-chunks

The --repo-id is created automatically if it does not exist. To create a private dataset:

python build_dataset.py --push-only --output-dir ./dataset_cache \
    --repo-id myuser/agentic-rag-chunks --private

One-Shot Workflow

Build and push in a single command:

python build_dataset.py \
    --limit 5000 \
    --repo-id myuser/agentic-rag-chunks \
    --hf-token hf_xxxxxxxxxxxxxxxxxxxx

GPU Acceleration

The script detects the best available device automatically (--device auto):

Priority Device Typical throughput Time for 5k chunks
1 CUDA (NVIDIA GPU) ~500 chunks/s ~10s
2 MPS (Apple Silicon) ~300 chunks/s ~17s
3 CPU ~40 chunks/s ~2 min

GPU batch size is automatically set to 256; CPU batch size is 64.

Override the device:

# Force CPU (e.g. to benchmark or avoid VRAM issues)
python build_dataset.py --build-only --limit 5000 --device cpu

# Force a specific CUDA device (multi-GPU machine)
CUDA_VISIBLE_DEVICES=1 python build_dataset.py --build-only --limit 5000

Verify GPU is being used:

# In another terminal while the script runs
watch -n 1 nvidia-smi

Corpus Size Reference

--limit Chunks Disk size CPU time GPU time
5000 5 000 ~8 MB ~2 min ~10s
50000 50 000 ~80 MB ~20 min ~2 min
(omit) ~241 000 ~370 MB ~2–4 hrs ~8 min

For HF Spaces demos, --limit 5000 gives good coverage of the corpus at minimal build time.


Overwriting an Existing Dataset

By default the push step refuses to overwrite an existing HF Hub dataset as a safety guard:

ERROR: Dataset 'myuser/agentic-rag-chunks' already exists on HF Hub.
  Pass --force to overwrite it.

To overwrite intentionally:

python build_dataset.py --push-only --output-dir ./dataset_cache \
    --repo-id myuser/agentic-rag-chunks --force

Similarly, --build-only refuses to write into an existing --output-dir. Delete it first:

rm -rf ./dataset_cache
python build_dataset.py --build-only --limit 5000 --output-dir ./dataset_cache

After Pushing

Set HF_DATASET_REPO as a secret in your HF Space:

HF_DATASET_REPO=myuser/agentic-rag-chunks

If the dataset is private, also set:

HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

The dataset is loaded at container startup and a FAISS index is built in memory (~60s for 5k chunks, ~5 min for 241k).


Dataset Schema

Column Type Description
chunk_id string Original chunk-id from the source dataset
content string Raw chunk text
title string Paper title
source_url string https://arxiv.org/abs/<doi>
full_citation string <title> arXiv:<doi>
embedding float32[384] L2-normalised embedding from all-MiniLM-L6-v2

Troubleshooting

Error Cause Fix
403 Forbidden: You don't have the rights to create a dataset Token missing or read-only Generate a write token at huggingface.co/settings/tokens
Output directory already exists Previous --build-only run left data Delete ./dataset_cache and re-run, or use --push-only
CUDA out of memory Batch too large for VRAM Add --device cpu or reduce BATCH_SIZE_GPU in the script
RepositoryNotFoundError on push --repo-id typo or wrong namespace Check the exact username at huggingface.co
Dataset loads but FAISS search returns no results Embeddings not L2-normalised Rebuild with the current script (normalize_embeddings=True is set)