--- title: EnggSS RAG ChatBot emoji: ⚡ colorFrom: blue colorTo: indigo sdk: gradio sdk_version: "5.0.0" app_file: app.py pinned: false license: other --- # EnggSS RAG ChatBot **Serving-only** HuggingFace Space — reads a pre-built private dataset, no PDF processing at runtime. Build the dataset locally with `preprocessing/create_dataset.py`, then deploy this Space to answer questions. ## How it works ``` Local machine (once) PDFs → create_dataset.py → BAAI/bge-large-en-v1.5 embeddings │ ▼ Private HuggingFace Dataset │ ┌─────────────────────┘ ▼ (Space startup) Load dataset → NumPy float32 matrix (L2-normalised) │ ▼ (each query, ~20 ms) Embed query → cosine scores → MMR top-3 │ ▼ Qwen2.5-7B-Instruct (HF Inference API) → answer │ ▼ Gradio UI ``` ## Tabs | Tab | Purpose | |-----|---------| | 💬 Q&A | Ask questions; see top-3 retrieved contexts + generated answer | | 📊 Analytics | Total chunks, documents processed, per-file breakdown | ## Required Space Secrets Set in **Settings → Variables and Secrets**: | Secret | Description | |--------|-------------| | `HF_TOKEN` | HuggingFace token — needs **read** access to the dataset repo | | `HF_DATASET_REPO` | e.g. `your-org/enggss-rag-dataset` (created by preprocessing script) | ## Setup order 1. **Run preprocessing locally** (once, or when you add new PDFs): ```bash cd preprocessing pip install -r requirements.txt python create_dataset.py ./pdfs --repo your-org/enggss-rag-dataset ``` 2. **Deploy this Space** — upload `app.py` + `requirements.txt` + `README.md` 3. **Set the two secrets** above in Space Settings → Secrets 4. Space restarts, loads the dataset, and is ready to answer questions To add new PDFs later without rebuilding everything: ```bash python create_dataset.py ./pdfs --repo your-org/enggss-rag-dataset --update ``` ## Local development ```bash git clone https://huggingface.co/spaces/your-org/enggss-rag-chatbot cd enggss-rag-chatbot pip install -r requirements.txt # create .env with HF_TOKEN and HF_DATASET_REPO python app.py ```