# Build Dataset Guide

This guide explains how to use `build_dataset.py` to create the pre-embedded ArXiv corpus required by Agentic-RAG's HF Spaces deployment.

---

## Overview

The retriever tool loads a HuggingFace Dataset of pre-computed 384-dim embeddings at startup and builds a FAISS index in memory. `build_dataset.py` is the one-time script that creates this dataset.

```
jamescalam/ai-arxiv2-chunks  →  embed locally  →  ./dataset_cache  →  push to HF Hub
(~241k raw chunks)              (SentenceTransformer)  (Arrow format)    (your repo)
```

---

## Prerequisites

Install the required packages into any Python 3.10+ environment:

```bash
pip install datasets sentence-transformers huggingface-hub numpy tqdm torch
```

> **GPU users:** install the CUDA-enabled torch build for your driver version.
> See [pytorch.org/get-started](https://pytorch.org/get-started/locally/).
> CPU-only torch also works — it just runs slower.

A HuggingFace account with a **write-access token** is required for the push step.
Generate one at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).

---

## Recommended: Two-Step Workflow

Separating the embed step from the push step means a failed or interrupted upload does not require re-embedding everything.

### Step 1 — Embed and save locally

```bash
python build_dataset.py --build-only --limit 5000 --output-dir ./dataset_cache
```

This streams `jamescalam/ai-arxiv2-chunks` from HF Hub, embeds each chunk with `all-MiniLM-L6-v2`, and saves an Arrow dataset to `./dataset_cache/`.

| Flag | Description |
|---|---|
| `--limit N` | Process only the first N chunks. Omit for the full ~241k corpus. |
| `--output-dir PATH` | Where to save the Arrow dataset (default: `./dataset_cache`). |
| `--device auto\|cpu\|cuda\|mps` | Compute device (default: `auto`). See [GPU section](#gpu-acceleration) below. |

**Expected output:**
```
INFO: GPU detected (CUDA): NVIDIA GeForce RTX 3080
INFO: Loading source dataset: jamescalam/ai-arxiv2-chunks
INFO: Rows to embed: 5000 (of 241874 available)
INFO: Loading embedding model: sentence-transformers/all-MiniLM-L6-v2  [device=cuda]
INFO: Embedding 5000 chunks  [batch_size=256, estimated time ~10s] …
...
INFO: Embedding complete in 9.3s (537 chunks/s)
INFO: Saved 5000 rows (8.2 MB) → ./dataset_cache
```

### Step 2 — Push to HuggingFace Hub

```bash
python build_dataset.py --push-only \
    --output-dir ./dataset_cache \
    --repo-id myuser/agentic-rag-chunks \
    --hf-token hf_xxxxxxxxxxxxxxxxxxxx
```

Or set the token in your environment to avoid passing it on the command line:

```bash
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx

python build_dataset.py --push-only \
    --output-dir ./dataset_cache \
    --repo-id myuser/agentic-rag-chunks
```

The `--repo-id` is created automatically if it does not exist. To create a **private** dataset:

```bash
python build_dataset.py --push-only --output-dir ./dataset_cache \
    --repo-id myuser/agentic-rag-chunks --private
```

---

## One-Shot Workflow

Build and push in a single command:

```bash
python build_dataset.py \
    --limit 5000 \
    --repo-id myuser/agentic-rag-chunks \
    --hf-token hf_xxxxxxxxxxxxxxxxxxxx
```

---

## GPU Acceleration

The script detects the best available device automatically (`--device auto`):

| Priority | Device | Typical throughput | Time for 5k chunks |
|---|---|---|---|
| 1 | CUDA (NVIDIA GPU) | ~500 chunks/s | ~10s |
| 2 | MPS (Apple Silicon) | ~300 chunks/s | ~17s |
| 3 | CPU | ~40 chunks/s | ~2 min |

GPU batch size is automatically set to 256; CPU batch size is 64.

**Override the device:**
```bash
# Force CPU (e.g. to benchmark or avoid VRAM issues)
python build_dataset.py --build-only --limit 5000 --device cpu

# Force a specific CUDA device (multi-GPU machine)
CUDA_VISIBLE_DEVICES=1 python build_dataset.py --build-only --limit 5000
```

**Verify GPU is being used:**
```bash
# In another terminal while the script runs
watch -n 1 nvidia-smi
```

---

## Corpus Size Reference

| `--limit` | Chunks | Disk size | CPU time | GPU time |
|---|---|---|---|---|
| `5000` | 5 000 | ~8 MB | ~2 min | ~10s |
| `50000` | 50 000 | ~80 MB | ~20 min | ~2 min |
| *(omit)* | ~241 000 | ~370 MB | ~2–4 hrs | ~8 min |

For HF Spaces demos, `--limit 5000` gives good coverage of the corpus at minimal build time.

---

## Overwriting an Existing Dataset

By default the push step refuses to overwrite an existing HF Hub dataset as a safety guard:

```
ERROR: Dataset 'myuser/agentic-rag-chunks' already exists on HF Hub.
  Pass --force to overwrite it.
```

To overwrite intentionally:

```bash
python build_dataset.py --push-only --output-dir ./dataset_cache \
    --repo-id myuser/agentic-rag-chunks --force
```

Similarly, `--build-only` refuses to write into an existing `--output-dir`. Delete it first:

```bash
rm -rf ./dataset_cache
python build_dataset.py --build-only --limit 5000 --output-dir ./dataset_cache
```

---

## After Pushing

Set `HF_DATASET_REPO` as a secret in your HF Space:

```
HF_DATASET_REPO=myuser/agentic-rag-chunks
```

If the dataset is private, also set:

```
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
```

The dataset is loaded at container startup and a FAISS index is built in memory (~60s for 5k chunks, ~5 min for 241k).

---

## Dataset Schema

| Column | Type | Description |
|---|---|---|
| `chunk_id` | string | Original `chunk-id` from the source dataset |
| `content` | string | Raw chunk text |
| `title` | string | Paper title |
| `source_url` | string | `https://arxiv.org/abs/<doi>` |
| `full_citation` | string | `<title> arXiv:<doi>` |
| `embedding` | float32[384] | L2-normalised embedding from `all-MiniLM-L6-v2` |

---

## Troubleshooting

| Error | Cause | Fix |
|---|---|---|
| `403 Forbidden: You don't have the rights to create a dataset` | Token missing or read-only | Generate a **write** token at huggingface.co/settings/tokens |
| `Output directory already exists` | Previous `--build-only` run left data | Delete `./dataset_cache` and re-run, or use `--push-only` |
| `CUDA out of memory` | Batch too large for VRAM | Add `--device cpu` or reduce `BATCH_SIZE_GPU` in the script |
| `RepositoryNotFoundError` on push | `--repo-id` typo or wrong namespace | Check the exact username at huggingface.co |
| Dataset loads but FAISS search returns no results | Embeddings not L2-normalised | Rebuild with the current script (`normalize_embeddings=True` is set) |