# Build Dataset Guide This guide explains how to use `build_dataset.py` to create the pre-embedded ArXiv corpus required by Agentic-RAG's HF Spaces deployment. --- ## Overview The retriever tool loads a HuggingFace Dataset of pre-computed 384-dim embeddings at startup and builds a FAISS index in memory. `build_dataset.py` is the one-time script that creates this dataset. ``` jamescalam/ai-arxiv2-chunks → embed locally → ./dataset_cache → push to HF Hub (~241k raw chunks) (SentenceTransformer) (Arrow format) (your repo) ``` --- ## Prerequisites Install the required packages into any Python 3.10+ environment: ```bash pip install datasets sentence-transformers huggingface-hub numpy tqdm torch ``` > **GPU users:** install the CUDA-enabled torch build for your driver version. > See [pytorch.org/get-started](https://pytorch.org/get-started/locally/). > CPU-only torch also works — it just runs slower. A HuggingFace account with a **write-access token** is required for the push step. Generate one at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). --- ## Recommended: Two-Step Workflow Separating the embed step from the push step means a failed or interrupted upload does not require re-embedding everything. ### Step 1 — Embed and save locally ```bash python build_dataset.py --build-only --limit 5000 --output-dir ./dataset_cache ``` This streams `jamescalam/ai-arxiv2-chunks` from HF Hub, embeds each chunk with `all-MiniLM-L6-v2`, and saves an Arrow dataset to `./dataset_cache/`. | Flag | Description | |---|---| | `--limit N` | Process only the first N chunks. Omit for the full ~241k corpus. | | `--output-dir PATH` | Where to save the Arrow dataset (default: `./dataset_cache`). | | `--device auto\|cpu\|cuda\|mps` | Compute device (default: `auto`). See [GPU section](#gpu-acceleration) below. | **Expected output:** ``` INFO: GPU detected (CUDA): NVIDIA GeForce RTX 3080 INFO: Loading source dataset: jamescalam/ai-arxiv2-chunks INFO: Rows to embed: 5000 (of 241874 available) INFO: Loading embedding model: sentence-transformers/all-MiniLM-L6-v2 [device=cuda] INFO: Embedding 5000 chunks [batch_size=256, estimated time ~10s] … ... INFO: Embedding complete in 9.3s (537 chunks/s) INFO: Saved 5000 rows (8.2 MB) → ./dataset_cache ``` ### Step 2 — Push to HuggingFace Hub ```bash python build_dataset.py --push-only \ --output-dir ./dataset_cache \ --repo-id myuser/agentic-rag-chunks \ --hf-token hf_xxxxxxxxxxxxxxxxxxxx ``` Or set the token in your environment to avoid passing it on the command line: ```bash export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx python build_dataset.py --push-only \ --output-dir ./dataset_cache \ --repo-id myuser/agentic-rag-chunks ``` The `--repo-id` is created automatically if it does not exist. To create a **private** dataset: ```bash python build_dataset.py --push-only --output-dir ./dataset_cache \ --repo-id myuser/agentic-rag-chunks --private ``` --- ## One-Shot Workflow Build and push in a single command: ```bash python build_dataset.py \ --limit 5000 \ --repo-id myuser/agentic-rag-chunks \ --hf-token hf_xxxxxxxxxxxxxxxxxxxx ``` --- ## GPU Acceleration The script detects the best available device automatically (`--device auto`): | Priority | Device | Typical throughput | Time for 5k chunks | |---|---|---|---| | 1 | CUDA (NVIDIA GPU) | ~500 chunks/s | ~10s | | 2 | MPS (Apple Silicon) | ~300 chunks/s | ~17s | | 3 | CPU | ~40 chunks/s | ~2 min | GPU batch size is automatically set to 256; CPU batch size is 64. **Override the device:** ```bash # Force CPU (e.g. to benchmark or avoid VRAM issues) python build_dataset.py --build-only --limit 5000 --device cpu # Force a specific CUDA device (multi-GPU machine) CUDA_VISIBLE_DEVICES=1 python build_dataset.py --build-only --limit 5000 ``` **Verify GPU is being used:** ```bash # In another terminal while the script runs watch -n 1 nvidia-smi ``` --- ## Corpus Size Reference | `--limit` | Chunks | Disk size | CPU time | GPU time | |---|---|---|---|---| | `5000` | 5 000 | ~8 MB | ~2 min | ~10s | | `50000` | 50 000 | ~80 MB | ~20 min | ~2 min | | *(omit)* | ~241 000 | ~370 MB | ~2–4 hrs | ~8 min | For HF Spaces demos, `--limit 5000` gives good coverage of the corpus at minimal build time. --- ## Overwriting an Existing Dataset By default the push step refuses to overwrite an existing HF Hub dataset as a safety guard: ``` ERROR: Dataset 'myuser/agentic-rag-chunks' already exists on HF Hub. Pass --force to overwrite it. ``` To overwrite intentionally: ```bash python build_dataset.py --push-only --output-dir ./dataset_cache \ --repo-id myuser/agentic-rag-chunks --force ``` Similarly, `--build-only` refuses to write into an existing `--output-dir`. Delete it first: ```bash rm -rf ./dataset_cache python build_dataset.py --build-only --limit 5000 --output-dir ./dataset_cache ``` --- ## After Pushing Set `HF_DATASET_REPO` as a secret in your HF Space: ``` HF_DATASET_REPO=myuser/agentic-rag-chunks ``` If the dataset is private, also set: ``` HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx ``` The dataset is loaded at container startup and a FAISS index is built in memory (~60s for 5k chunks, ~5 min for 241k). --- ## Dataset Schema | Column | Type | Description | |---|---|---| | `chunk_id` | string | Original `chunk-id` from the source dataset | | `content` | string | Raw chunk text | | `title` | string | Paper title | | `source_url` | string | `https://arxiv.org/abs/` | | `full_citation` | string | ` arXiv:<doi>` | | `embedding` | float32[384] | L2-normalised embedding from `all-MiniLM-L6-v2` | --- ## Troubleshooting | Error | Cause | Fix | |---|---|---| | `403 Forbidden: You don't have the rights to create a dataset` | Token missing or read-only | Generate a **write** token at huggingface.co/settings/tokens | | `Output directory already exists` | Previous `--build-only` run left data | Delete `./dataset_cache` and re-run, or use `--push-only` | | `CUDA out of memory` | Batch too large for VRAM | Add `--device cpu` or reduce `BATCH_SIZE_GPU` in the script | | `RepositoryNotFoundError` on push | `--repo-id` typo or wrong namespace | Check the exact username at huggingface.co | | Dataset loads but FAISS search returns no results | Embeddings not L2-normalised | Rebuild with the current script (`normalize_embeddings=True` is set) |