Spaces:

DreamyDetective
/

agentic-rag

Paused

App Files Files Community

agentic-rag / docs /Build Dataset Guide.md

vksepm

feat: HF Spaces deployment — FAISS retriever, single-container Docker build

635dd02 4 months ago

preview code

Raw

History Blame Contribute Delete

6.45 kB

Build Dataset Guide

This guide explains how to use build_dataset.py to create the pre-embedded ArXiv corpus required by Agentic-RAG's HF Spaces deployment.

Overview

The retriever tool loads a HuggingFace Dataset of pre-computed 384-dim embeddings at startup and builds a FAISS index in memory. build_dataset.py is the one-time script that creates this dataset.

jamescalam/ai-arxiv2-chunks  →  embed locally  →  ./dataset_cache  →  push to HF Hub
(~241k raw chunks)              (SentenceTransformer)  (Arrow format)    (your repo)

Prerequisites

Install the required packages into any Python 3.10+ environment:

pip install datasets sentence-transformers huggingface-hub numpy tqdm torch

GPU users: install the CUDA-enabled torch build for your driver version. See pytorch.org/get-started. CPU-only torch also works — it just runs slower.

A HuggingFace account with a write-access token is required for the push step. Generate one at huggingface.co/settings/tokens.

Recommended: Two-Step Workflow

Separating the embed step from the push step means a failed or interrupted upload does not require re-embedding everything.

Step 1 — Embed and save locally

python build_dataset.py --build-only --limit 5000 --output-dir ./dataset_cache

This streams jamescalam/ai-arxiv2-chunks from HF Hub, embeds each chunk with all-MiniLM-L6-v2, and saves an Arrow dataset to ./dataset_cache/.

Flag	Description
`--limit N`	Process only the first N chunks. Omit for the full ~241k corpus.
`--output-dir PATH`	Where to save the Arrow dataset (default: `./dataset_cache`).
`--device auto\|cpu\|cuda\|mps`	Compute device (default: `auto`). See GPU section below.

Expected output:

INFO: GPU detected (CUDA): NVIDIA GeForce RTX 3080
INFO: Loading source dataset: jamescalam/ai-arxiv2-chunks
INFO: Rows to embed: 5000 (of 241874 available)
INFO: Loading embedding model: sentence-transformers/all-MiniLM-L6-v2  [device=cuda]
INFO: Embedding 5000 chunks  [batch_size=256, estimated time ~10s] …
...
INFO: Embedding complete in 9.3s (537 chunks/s)
INFO: Saved 5000 rows (8.2 MB) → ./dataset_cache

Step 2 — Push to HuggingFace Hub

python build_dataset.py --push-only \
    --output-dir ./dataset_cache \
    --repo-id myuser/agentic-rag-chunks \
    --hf-token hf_xxxxxxxxxxxxxxxxxxxx

Or set the token in your environment to avoid passing it on the command line:

export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

python build_dataset.py --push-only \
    --output-dir ./dataset_cache \
    --repo-id myuser/agentic-rag-chunks

The --repo-id is created automatically if it does not exist. To create a private dataset:

python build_dataset.py --push-only --output-dir ./dataset_cache \
    --repo-id myuser/agentic-rag-chunks --private

One-Shot Workflow

Build and push in a single command:

python build_dataset.py \
    --limit 5000 \
    --repo-id myuser/agentic-rag-chunks \
    --hf-token hf_xxxxxxxxxxxxxxxxxxxx

GPU Acceleration

The script detects the best available device automatically (--device auto):

Priority	Device	Typical throughput	Time for 5k chunks
1	CUDA (NVIDIA GPU)	~500 chunks/s	~10s
2	MPS (Apple Silicon)	~300 chunks/s	~17s
3	CPU	~40 chunks/s	~2 min

GPU batch size is automatically set to 256; CPU batch size is 64.

Override the device:

# Force CPU (e.g. to benchmark or avoid VRAM issues)
python build_dataset.py --build-only --limit 5000 --device cpu

# Force a specific CUDA device (multi-GPU machine)
CUDA_VISIBLE_DEVICES=1 python build_dataset.py --build-only --limit 5000

Verify GPU is being used:

# In another terminal while the script runs
watch -n 1 nvidia-smi

Corpus Size Reference

`--limit`	Chunks	Disk size	CPU time	GPU time
`5000`	5 000	~8 MB	~2 min	~10s
`50000`	50 000	~80 MB	~20 min	~2 min
(omit)	~241 000	~370 MB	~2–4 hrs	~8 min

For HF Spaces demos, --limit 5000 gives good coverage of the corpus at minimal build time.

Overwriting an Existing Dataset

By default the push step refuses to overwrite an existing HF Hub dataset as a safety guard:

ERROR: Dataset 'myuser/agentic-rag-chunks' already exists on HF Hub.
  Pass --force to overwrite it.

To overwrite intentionally:

python build_dataset.py --push-only --output-dir ./dataset_cache \
    --repo-id myuser/agentic-rag-chunks --force

Similarly, --build-only refuses to write into an existing --output-dir. Delete it first:

rm -rf ./dataset_cache
python build_dataset.py --build-only --limit 5000 --output-dir ./dataset_cache

After Pushing

Set HF_DATASET_REPO as a secret in your HF Space:

HF_DATASET_REPO=myuser/agentic-rag-chunks

If the dataset is private, also set:

HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

The dataset is loaded at container startup and a FAISS index is built in memory (~60s for 5k chunks, ~5 min for 241k).

Dataset Schema

Column	Type	Description
`chunk_id`	string	Original `chunk-id` from the source dataset
`content`	string	Raw chunk text
`title`	string	Paper title
`source_url`	string	`https://arxiv.org/abs/<doi>`
`full_citation`	string	`<title> arXiv:<doi>`
`embedding`	float32[384]	L2-normalised embedding from `all-MiniLM-L6-v2`

Troubleshooting

Error	Cause	Fix
`403 Forbidden: You don't have the rights to create a dataset`	Token missing or read-only	Generate a write token at huggingface.co/settings/tokens
`Output directory already exists`	Previous `--build-only` run left data	Delete `./dataset_cache` and re-run, or use `--push-only`
`CUDA out of memory`	Batch too large for VRAM	Add `--device cpu` or reduce `BATCH_SIZE_GPU` in the script
`RepositoryNotFoundError` on push	`--repo-id` typo or wrong namespace	Check the exact username at huggingface.co
Dataset loads but FAISS search returns no results	Embeddings not L2-normalised	Rebuild with the current script (`normalize_embeddings=True` is set)