File size: 2,481 Bytes
086f690
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
title: EnggSS RAG ChatBot
emoji: 
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: "5.0.0"
app_file: app.py
pinned: false
license: other
---

# EnggSS RAG ChatBot

**Serving-only** HuggingFace Space — reads a pre-built private dataset, no PDF
processing at runtime.  Build the dataset locally with
`preprocessing/create_dataset.py`, then deploy this Space to answer questions.

## How it works

```
Local machine (once)
  PDFs  →  create_dataset.py  →  BAAI/bge-large-en-v1.5 embeddings


                            Private HuggingFace Dataset

                  ┌─────────────────────┘
                  ▼  (Space startup)
         Load dataset → NumPy float32 matrix (L2-normalised)

                  ▼  (each query, ~20 ms)
         Embed query → cosine scores → MMR top-3


         Qwen2.5-7B-Instruct (HF Inference API) → answer


              Gradio UI
```

## Tabs

| Tab | Purpose |
|-----|---------|
| 💬 Q&A | Ask questions; see top-3 retrieved contexts + generated answer |
| 📊 Analytics | Total chunks, documents processed, per-file breakdown |

## Required Space Secrets

Set in **Settings → Variables and Secrets**:

| Secret | Description |
|--------|-------------|
| `HF_TOKEN` | HuggingFace token — needs **read** access to the dataset repo |
| `HF_DATASET_REPO` | e.g. `your-org/enggss-rag-dataset` (created by preprocessing script) |

## Setup order

1. **Run preprocessing locally** (once, or when you add new PDFs):
   ```bash
   cd preprocessing
   pip install -r requirements.txt
   python create_dataset.py ./pdfs --repo your-org/enggss-rag-dataset
   ```
2. **Deploy this Space** — upload `app.py` + `requirements.txt` + `README.md`
3. **Set the two secrets** above in Space Settings → Secrets
4. Space restarts, loads the dataset, and is ready to answer questions

To add new PDFs later without rebuilding everything:
```bash
python create_dataset.py ./pdfs --repo your-org/enggss-rag-dataset --update
```

## Local development

```bash
git clone https://huggingface.co/spaces/your-org/enggss-rag-chatbot
cd enggss-rag-chatbot
pip install -r requirements.txt
# create .env with HF_TOKEN and HF_DATASET_REPO
python app.py
```