Spaces:
Running
Running
docs: overhaul README with full project documentation, RAG pipeline diagram, API reference, and GSSOC contributor section
Browse files
README.md
CHANGED
|
@@ -10,57 +10,489 @@ license: mit
|
|
| 10 |
short_description: Enterprise Agentic RAG β upload PDFs and chat with AI
|
| 11 |
---
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
-
- **Semantic Search** β Two-stage retrieval with cross-encoder reranking
|
| 21 |
-
- **Streaming Chat** β Real-time AI responses with inline source citations
|
| 22 |
-
- **Data Isolation** β Per-user vector collections for complete privacy
|
| 23 |
-
- **Open-Source LLMs** β Powered by Mistral-7B and HuggingFace ecosystem
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
| 40 |
-
2. **Upload** a PDF document
|
| 41 |
-
3. **Wait** for processing (chunking + embedding)
|
| 42 |
-
4. **Ask** questions and get cited answers!
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
```bash
|
| 47 |
-
# Backend
|
| 48 |
-
cd backend
|
|
|
|
| 49 |
pip install -r requirements.txt
|
| 50 |
-
uvicorn app.main:app --port
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
```
|
| 55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
## π¦ Environment Variables
|
| 57 |
|
| 58 |
-
| Variable | Required | Description |
|
| 59 |
-
|---|---|---|
|
| 60 |
-
| `HF_TOKEN` | β
| HuggingFace API token for LLM inference |
|
| 61 |
-
| `SECRET_KEY` | β
| JWT signing secret |
|
| 62 |
-
| `DATABASE_URL` | β |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
|
| 65 |
|
| 66 |
-
|
|
|
|
| 10 |
short_description: Enterprise Agentic RAG β upload PDFs and chat with AI
|
| 11 |
---
|
| 12 |
|
| 13 |
+
<div align="center">
|
| 14 |
|
| 15 |
+
<br/>
|
| 16 |
|
| 17 |
+
```
|
| 18 |
+
βββββββ βββββββ ββββββββ ββββββ ββββββββββββββββββββββββββββββββββββ ββββββ ββββ ββββββββββββ
|
| 19 |
+
ββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββββββββββββ
|
| 20 |
+
βββββββββββ βββββββββ βββββββββββββββββββββββββββββββββββ βββ ββββββββββββββ βββ βββ
|
| 21 |
+
βββββββ βββ βββββββββ βββββββββββββββββββββββββββββββββββ βββ ββββββββββββββββββ βββ
|
| 22 |
+
βββ βββββββββββ βββ ββββββββββββββββββββββββββββββ βββ βββ ββββββ ββββββ βββ
|
| 23 |
+
βββ βββββββ βββ βββ ββββββββββββββββββββββββββββββ βββ βββ ββββββ βββββ βββ
|
| 24 |
+
|
| 25 |
+
βββββββ ββββββ βββββββ
|
| 26 |
+
ββββββββββββββββββββββββ
|
| 27 |
+
βββββββββββββββββββ ββββ
|
| 28 |
+
βββββββββββββββββββ βββ
|
| 29 |
+
βββ ββββββ ββββββββββββ
|
| 30 |
+
βββ ββββββ βββ βββββββ
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
### Enterprise Agentic Retrieval-Augmented Generation System
|
| 34 |
+
|
| 35 |
+
<br/>
|
| 36 |
+
|
| 37 |
+
[](https://fastapi.tiangolo.com/)
|
| 38 |
+
[](https://nextjs.org/)
|
| 39 |
+
[](https://python.org/)
|
| 40 |
+
[](https://langchain.com/)
|
| 41 |
+
[](https://trychroma.com/)
|
| 42 |
+
[](https://huggingface.co/)
|
| 43 |
+
[](https://docker.com/)
|
| 44 |
+
[](LICENSE)
|
| 45 |
+
|
| 46 |
+
<br/>
|
| 47 |
+
|
| 48 |
+
> **Upload Β· Embed Β· Retrieve Β· Chat** β A production-grade AI document assistant built end-to-end with an agentic RAG pipeline, streaming responses, and per-user data isolation.
|
| 49 |
+
|
| 50 |
+
<br/>
|
| 51 |
+
|
| 52 |
+
[Features](#-key-features) Β· [Tech Stack](#-tech-stack) Β· [Getting Started](#-getting-started) Β· [Architecture](#-architecture) Β· [RAG Pipeline](#-rag-pipeline) Β· [API Reference](#-api-reference) Β· [Deployment](#-deployment) Β· [Contributing](#-contributing)
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
</div>
|
| 57 |
+
|
| 58 |
+
## π€ Contributors
|
| 59 |
+
|
| 60 |
+
Thanks to all the amazing people who have contributed to **PDF-Assistant-RAG**! π
|
| 61 |
+
|
| 62 |
+
<br/>
|
| 63 |
+
|
| 64 |
+
<div align="center">
|
| 65 |
+
<a href="https://github.com/param20h/PDF-Assistant-RAG/graphs/contributors">
|
| 66 |
+
<img src="https://contrib.rocks/image?repo=param20h/PDF-Assistant-RAG" alt="Contributors" />
|
| 67 |
+
</a>
|
| 68 |
+
</div>
|
| 69 |
+
|
| 70 |
+
<br/>
|
| 71 |
+
|
| 72 |
+
> π **GSSOC Contributors** β This project is open for [GirlScript Summer of Code](https://gssoc.girlscript.tech/). Check out our [CONTRIBUTING.md](CONTRIBUTING.md) to get started and browse [open issues](https://github.com/param20h/PDF-Assistant-RAG/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) tagged `good first issue`.
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
<br/>
|
| 77 |
+
|
| 78 |
+
## π Overview
|
| 79 |
+
|
| 80 |
+
**PDF-Assistant-RAG** is a complete, production-ready AI document assistant that lets users upload complex PDFs, financial reports, legal contracts, and research papers β then chat with an AI that provides **accurate, cited answers** powered by a multi-stage Retrieval-Augmented Generation pipeline.
|
| 81 |
+
|
| 82 |
+
The system uses **semantic search + cross-encoder reranking** to find the most relevant document chunks, streams AI-generated answers token-by-token, and highlights exact source citations with page numbers β all inside a sleek Next.js UI with JWT-secured per-user data isolation.
|
| 83 |
+
|
| 84 |
+
<br/>
|
| 85 |
+
|
| 86 |
+
## π Tech Stack
|
| 87 |
+
|
| 88 |
+
<div align="center">
|
| 89 |
+
|
| 90 |
+
### Backend
|
| 91 |
+
|
| 92 |
+
| | Technology | Purpose |
|
| 93 |
+
|---|---|---|
|
| 94 |
+
| <img src="https://skillicons.dev/icons?i=fastapi" width="30"/> | **FastAPI 0.115+** | Async REST API framework |
|
| 95 |
+
| <img src="https://skillicons.dev/icons?i=python" width="30"/> | **Python 3.11** | Runtime environment |
|
| 96 |
+
| <img src="https://skillicons.dev/icons?i=sqlite" width="30"/> | **SQLite + SQLAlchemy** | User & document metadata storage |
|
| 97 |
+
| <img src="https://img.shields.io/badge/JWT-000000?style=flat&logo=jsonwebtokens&logoColor=white" height="24"/> | **JWT + Passlib** | Authentication & authorization |
|
| 98 |
+
| <img src="https://img.shields.io/badge/LangChain-1C3C3C?style=flat&logo=langchain&logoColor=white" height="24"/> | **LangChain** | RAG orchestration |
|
| 99 |
+
| <img src="https://img.shields.io/badge/ChromaDB-FF6B35?style=flat" height="24"/> | **ChromaDB** | Persistent vector store (per-user) |
|
| 100 |
+
| <img src="https://img.shields.io/badge/HuggingFace-FFD21E?style=flat&logo=huggingface&logoColor=black" height="24"/> | **HuggingFace Hub** | LLM inference API |
|
| 101 |
+
|
| 102 |
+
### Frontend
|
| 103 |
+
|
| 104 |
+
| | Technology | Purpose |
|
| 105 |
+
|---|---|---|
|
| 106 |
+
| <img src="https://skillicons.dev/icons?i=nextjs" width="30"/> | **Next.js 16** | React framework (App Router) |
|
| 107 |
+
| <img src="https://skillicons.dev/icons?i=tailwind" width="30"/> | **Tailwind CSS v4** | Utility-first styling |
|
| 108 |
+
| <img src="https://img.shields.io/badge/shadcn/ui-000000?style=flat&logo=shadcnui&logoColor=white" height="24"/> | **shadcn/ui** | Accessible component library |
|
| 109 |
+
| <img src="https://skillicons.dev/icons?i=ts" width="30"/> | **TypeScript** | Type-safe frontend |
|
| 110 |
+
| <img src="https://img.shields.io/badge/react--pdf-FF0000?style=flat" height="24"/> | **react-pdf** | In-browser PDF viewer |
|
| 111 |
+
| <img src="https://img.shields.io/badge/react--markdown-000000?style=flat" height="24"/> | **react-markdown + GFM** | Markdown-rendered AI responses |
|
| 112 |
+
|
| 113 |
+
### AI / ML Pipeline
|
| 114 |
+
|
| 115 |
+
| | Technology | Purpose |
|
| 116 |
+
|---|---|---|
|
| 117 |
+
| <img src="https://img.shields.io/badge/MiniLM-L6-v2-FFD21E?style=flat" height="24"/> | **all-MiniLM-L6-v2** | Local sentence embeddings |
|
| 118 |
+
| <img src="https://img.shields.io/badge/ms--marco--MiniLM-1C3C3C?style=flat" height="24"/> | **ms-marco-MiniLM-L-6-v2** | Cross-encoder reranker |
|
| 119 |
+
| <img src="https://img.shields.io/badge/Qwen2.5-72B-Instruct-626BDF?style=flat" height="24"/> | **Qwen2.5-72B-Instruct** | LLM (HuggingFace Inference API) |
|
| 120 |
+
| <img src="https://img.shields.io/badge/PyMuPDF-FF0000?style=flat" height="24"/> | **PyMuPDF + python-docx** | Document parsing |
|
| 121 |
+
|
| 122 |
+
### DevOps & Tooling
|
| 123 |
+
|
| 124 |
+
| | Technology | Purpose |
|
| 125 |
+
|---|---|---|
|
| 126 |
+
| <img src="https://skillicons.dev/icons?i=docker" width="30"/> | **Docker Multi-Stage** | Containerized deployment |
|
| 127 |
+
| <img src="https://skillicons.dev/icons?i=githubactions" width="30"/> | **GitHub Actions** | CI pipeline (dev branch) |
|
| 128 |
+
| <img src="https://skillicons.dev/icons?i=git" width="30"/> | **Git LFS** | Binary asset management |
|
| 129 |
+
| <img src="https://img.shields.io/badge/HuggingFace_Spaces-FFD21E?style=flat&logo=huggingface&logoColor=black" height="24"/> | **HuggingFace Spaces** | Production deployment |
|
| 130 |
+
|
| 131 |
+
</div>
|
| 132 |
+
|
| 133 |
+
<br/>
|
| 134 |
+
|
| 135 |
+
## β¨ Key Features
|
| 136 |
+
|
| 137 |
+
<table>
|
| 138 |
+
<tr>
|
| 139 |
+
<td width="33%" valign="top">
|
| 140 |
+
|
| 141 |
+
### π€ Users
|
| 142 |
+
- π JWT-secured register & login
|
| 143 |
+
- π Upload **PDF** and **DOCX** documents
|
| 144 |
+
- π¬ Ask questions in natural language
|
| 145 |
+
- π **Streaming AI responses** token-by-token
|
| 146 |
+
- π Inline **source citations** with page numbers
|
| 147 |
+
- ποΈ Per-user complete data isolation
|
| 148 |
+
|
| 149 |
+
</td>
|
| 150 |
+
<td width="33%" valign="top">
|
| 151 |
+
|
| 152 |
+
### π€ RAG Pipeline
|
| 153 |
+
- πͺ Smart **recursive text chunking** (configurable size & overlap)
|
| 154 |
+
- π§ **Local embeddings** β no data leaves your machine
|
| 155 |
+
- π **Two-stage retrieval** β semantic search β cross-encoder rerank
|
| 156 |
+
- βοΈ Top-K filtering for precision answers
|
| 157 |
+
- π Custom **system prompts** with citation instructions
|
| 158 |
+
- π§Ύ Source scoring with confidence levels
|
| 159 |
+
|
| 160 |
+
</td>
|
| 161 |
+
<td width="33%" valign="top">
|
| 162 |
+
|
| 163 |
+
### βοΈ Engineering
|
| 164 |
+
- π **Async FastAPI** with Server-Sent Events streaming
|
| 165 |
+
- ποΈ **ChromaDB** with persistent per-user collections
|
| 166 |
+
- π³ **Multi-stage Docker** build (Node β Python)
|
| 167 |
+
- π **GitHub Actions CI** on `dev` branch
|
| 168 |
+
- π‘οΈ CORS, file validation, JWT expiry
|
| 169 |
+
- π Chat **history persistence** per document
|
| 170 |
+
|
| 171 |
+
</td>
|
| 172 |
+
</tr>
|
| 173 |
+
</table>
|
| 174 |
+
|
| 175 |
+
<br/>
|
| 176 |
+
|
| 177 |
+
## π Project Structure
|
| 178 |
+
|
| 179 |
+
```
|
| 180 |
+
PDF-Assistant-RAG/
|
| 181 |
+
β
|
| 182 |
+
βββ backend/ # FastAPI + RAG server
|
| 183 |
+
β βββ app/
|
| 184 |
+
β β βββ main.py # App entrypoint, middleware, static files
|
| 185 |
+
β β βββ config.py # Pydantic settings (env vars)
|
| 186 |
+
β β βββ database.py # SQLAlchemy async engine
|
| 187 |
+
β β βββ models.py # ORM models (User, Document, Message)
|
| 188 |
+
β β βββ schemas.py # Pydantic request/response schemas
|
| 189 |
+
β β βββ auth.py # JWT creation & verification
|
| 190 |
+
β β β
|
| 191 |
+
β β βββ routes/
|
| 192 |
+
β β β βββ auth.py # POST /register, /login, /me
|
| 193 |
+
β β β βββ documents.py # Upload, list, delete, retrieve
|
| 194 |
+
β β β βββ chat.py # Streaming chat + history
|
| 195 |
+
β β β
|
| 196 |
+
β β βββ rag/
|
| 197 |
+
β β βββ agent.py # Main RAG orchestrator
|
| 198 |
+
β β βββ chunker.py # Recursive text splitter
|
| 199 |
+
β β βββ embeddings.py # SentenceTransformer wrapper
|
| 200 |
+
β β βββ vectorstore.py # ChromaDB collection manager
|
| 201 |
+
β β βββ retriever.py # Semantic search + reranking
|
| 202 |
+
β β βββ prompts.py # System & user prompt templates
|
| 203 |
+
β β
|
| 204 |
+
β βββ requirements.txt
|
| 205 |
+
β βββ .env # Local env (never committed)
|
| 206 |
+
β
|
| 207 |
+
βββ frontend/ # Next.js 16 App Router
|
| 208 |
+
β βββ src/
|
| 209 |
+
β βββ app/
|
| 210 |
+
β β βββ layout.tsx # Root layout + fonts
|
| 211 |
+
β β βββ page.tsx # Landing / redirect
|
| 212 |
+
β β βββ login/ # Auth pages
|
| 213 |
+
β β βββ register/
|
| 214 |
+
β β βββ dashboard/ # Main app page
|
| 215 |
+
β β
|
| 216 |
+
β βββ components/
|
| 217 |
+
β β βββ chat/
|
| 218 |
+
β β β βββ ChatPanel.tsx # Chat UI + SSE streaming
|
| 219 |
+
β β β βββ MessageBubble.tsx # User / assistant message
|
| 220 |
+
β β β βββ SourceCard.tsx # Citation cards
|
| 221 |
+
β β βββ document/ # Upload + sidebar components
|
| 222 |
+
β β βββ layout/ # Navbar, sidebar shell
|
| 223 |
+
β β
|
| 224 |
+
β βββ lib/
|
| 225 |
+
β βββ api.ts # Typed API client + SSE stream helper
|
| 226 |
+
β
|
| 227 |
+
βββ .github/
|
| 228 |
+
β βββ workflows/
|
| 229 |
+
β β βββ ci.yml # CI β runs on dev branch only
|
| 230 |
+
β β βββ deploy.yml # Docker build β main branch only
|
| 231 |
+
β β βββ devsecops.yml # Security scans β main branch only
|
| 232 |
+
β βββ ISSUE_TEMPLATE/ # Bug report & feature request forms
|
| 233 |
+
β βββ pull_request_template.md # PR checklist
|
| 234 |
+
β βββ CODEOWNERS # Auto-review assignment
|
| 235 |
+
β
|
| 236 |
+
βββ Dockerfile # Multi-stage: Node build β Python serve
|
| 237 |
+
βββ docker-compose.yml # Local Docker stack
|
| 238 |
+
βββ CONTRIBUTING.md # GSSOC contributor guide
|
| 239 |
+
βββ .env.example # Template for environment variables
|
| 240 |
+
```
|
| 241 |
+
|
| 242 |
+
<br/>
|
| 243 |
+
|
| 244 |
+
## π Getting Started
|
| 245 |
+
|
| 246 |
+
### Prerequisites
|
| 247 |
+
|
| 248 |
+
-  **Python 3.11+**
|
| 249 |
+
-  **Node.js 20+**
|
| 250 |
+
-  **HuggingFace account** (free) for LLM inference
|
| 251 |
+
|
| 252 |
+
---
|
| 253 |
+
|
| 254 |
+
### 1. Clone the Repository
|
| 255 |
+
|
| 256 |
+
```bash
|
| 257 |
+
git clone https://github.com/param20h/PDF-Assistant-RAG.git
|
| 258 |
+
cd PDF-Assistant-RAG
|
| 259 |
+
```
|
| 260 |
|
| 261 |
+
### 2. Configure Environment
|
|
|
|
|
|
|
|
|
|
|
|
|
| 262 |
|
| 263 |
+
```bash
|
| 264 |
+
cp .env.example backend/.env
|
| 265 |
+
```
|
| 266 |
|
| 267 |
+
Edit `backend/.env`:
|
| 268 |
+
|
| 269 |
+
```env
|
| 270 |
+
SECRET_KEY=your-strong-random-secret
|
| 271 |
+
DATABASE_URL=sqlite:///./data/app.db
|
| 272 |
+
HF_TOKEN=hf_your_huggingface_token_here
|
| 273 |
+
UPLOAD_DIR=./data/uploads
|
| 274 |
+
CHROMA_PERSIST_DIR=./data/chroma_db
|
| 275 |
+
```
|
| 276 |
|
| 277 |
+
> Get your free HuggingFace token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
|
| 278 |
|
| 279 |
+
### 3. Run Locally
|
|
|
|
|
|
|
|
|
|
| 280 |
|
| 281 |
+
Open **two terminals**:
|
| 282 |
|
| 283 |
```bash
|
| 284 |
+
# Terminal A β Backend
|
| 285 |
+
cd backend
|
| 286 |
+
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
|
| 287 |
pip install -r requirements.txt
|
| 288 |
+
uvicorn app.main:app --reload --port 8000
|
| 289 |
+
# β API running at http://localhost:8000
|
| 290 |
+
# β Swagger docs at http://localhost:8000/docs
|
| 291 |
+
```
|
| 292 |
+
|
| 293 |
+
```bash
|
| 294 |
+
# Terminal B β Frontend
|
| 295 |
+
cd frontend
|
| 296 |
+
npm install
|
| 297 |
+
npm run dev
|
| 298 |
+
# β App running at http://localhost:3000
|
| 299 |
+
```
|
| 300 |
+
|
| 301 |
+
### 4. Run with Docker
|
| 302 |
+
|
| 303 |
+
```bash
|
| 304 |
+
docker compose up --build
|
| 305 |
+
# β Full stack at http://localhost:7860
|
| 306 |
+
```
|
| 307 |
+
|
| 308 |
+
<br/>
|
| 309 |
+
|
| 310 |
+
## π§ RAG Pipeline
|
| 311 |
+
|
| 312 |
+
```
|
| 313 |
+
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 314 |
+
β PDF / DOCX Upload β
|
| 315 |
+
βββββββββββββββββββββ¬ββββββββββββββββββββββββββ
|
| 316 |
+
β
|
| 317 |
+
βΌ
|
| 318 |
+
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 319 |
+
β PyMuPDF / python-docx Parser β
|
| 320 |
+
β (text extraction per page) β
|
| 321 |
+
βββββββββββββββββββββ¬ββββββββββββββββββββββββββ
|
| 322 |
+
β
|
| 323 |
+
βΌ
|
| 324 |
+
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 325 |
+
β Recursive Character Text Splitter β
|
| 326 |
+
β chunk_size=1000 | overlap=200 β
|
| 327 |
+
βββββββββββββββββββββ¬ββββββββββββββββββββββββββ
|
| 328 |
+
β
|
| 329 |
+
βΌ
|
| 330 |
+
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 331 |
+
β all-MiniLM-L6-v2 (local embeddings) β
|
| 332 |
+
β 384-dim dense vectors β
|
| 333 |
+
βββββββββββββββββββββ¬ββββββββββββββββββββββββββ
|
| 334 |
+
β
|
| 335 |
+
βΌ
|
| 336 |
+
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 337 |
+
β ChromaDB β per-user persistent collection β
|
| 338 |
+
βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 339 |
|
| 340 |
+
ββ At Query Time ββ
|
| 341 |
+
|
| 342 |
+
User Question βββΆ Embed βββΆ Semantic Search (Top-K=10)
|
| 343 |
+
β
|
| 344 |
+
βΌ
|
| 345 |
+
Cross-Encoder Reranker (Top-K=5)
|
| 346 |
+
ms-marco-MiniLM-L-6-v2
|
| 347 |
+
β
|
| 348 |
+
βΌ
|
| 349 |
+
Prompt Assembly (system + context + question)
|
| 350 |
+
β
|
| 351 |
+
βΌ
|
| 352 |
+
Qwen2.5-72B-Instruct (HF Inference API)
|
| 353 |
+
β
|
| 354 |
+
βΌ
|
| 355 |
+
Streamed SSE tokens βββΆ Frontend ChatPanel
|
| 356 |
```
|
| 357 |
|
| 358 |
+
<br/>
|
| 359 |
+
|
| 360 |
+
## π‘ API Reference
|
| 361 |
+
|
| 362 |
+
| Method | Endpoint | Auth | Description |
|
| 363 |
+
|--------|----------|------|-------------|
|
| 364 |
+
| `POST` | `/api/v1/auth/register` | β | Create a new user account |
|
| 365 |
+
| `POST` | `/api/v1/auth/login` | β | Login and receive JWT token |
|
| 366 |
+
| `GET` | `/api/v1/auth/me` | β
| Get current user profile |
|
| 367 |
+
| `POST` | `/api/v1/documents/upload` | β
| Upload PDF/DOCX and trigger indexing |
|
| 368 |
+
| `GET` | `/api/v1/documents` | β
| List all documents for current user |
|
| 369 |
+
| `DELETE` | `/api/v1/documents/{id}` | β
| Delete a document and its vector data |
|
| 370 |
+
| `POST` | `/api/v1/chat/ask/stream` | β
| Ask a question (SSE streaming response) |
|
| 371 |
+
| `GET` | `/api/v1/chat/history/{doc_id}` | β
| Get chat history for a document |
|
| 372 |
+
| `DELETE` | `/api/v1/chat/history/{doc_id}` | β
| Clear chat history for a document |
|
| 373 |
+
| `GET` | `/health` | β | Health check (db + chroma status) |
|
| 374 |
+
|
| 375 |
+
> Full interactive docs available at `/docs` (Swagger UI) when running locally.
|
| 376 |
+
|
| 377 |
+
<br/>
|
| 378 |
+
|
| 379 |
## π¦ Environment Variables
|
| 380 |
|
| 381 |
+
| Variable | Required | Default | Description |
|
| 382 |
+
|---|---|---|---|
|
| 383 |
+
| `HF_TOKEN` | β
| β | HuggingFace API token for LLM inference |
|
| 384 |
+
| `SECRET_KEY` | β
| β | JWT signing secret (use a strong random string) |
|
| 385 |
+
| `DATABASE_URL` | β | `sqlite:///./data/app.db` | SQLAlchemy database URL |
|
| 386 |
+
| `UPLOAD_DIR` | β | `./data/uploads` | Directory for uploaded files |
|
| 387 |
+
| `CHROMA_PERSIST_DIR` | β | `./data/chroma_db` | ChromaDB persistence path |
|
| 388 |
+
| `LLM_MODEL` | β | `Qwen/Qwen2.5-72B-Instruct` | HuggingFace model ID |
|
| 389 |
+
| `LLM_TEMPERATURE` | β | `0.3` | LLM sampling temperature |
|
| 390 |
+
| `LLM_MAX_NEW_TOKENS` | β | `1024` | Max tokens per response |
|
| 391 |
+
| `EMBEDDING_MODEL` | β | `all-MiniLM-L6-v2` | SentenceTransformer model |
|
| 392 |
+
| `CHUNK_SIZE` | β | `1000` | Document chunk size (characters) |
|
| 393 |
+
| `CHUNK_OVERLAP` | β | `200` | Overlap between chunks |
|
| 394 |
+
| `TOP_K_RETRIEVAL` | β | `10` | Candidates retrieved from vector store |
|
| 395 |
+
| `TOP_K_RERANK` | β | `5` | Final chunks passed to LLM after reranking |
|
| 396 |
+
| `MAX_FILE_SIZE_MB` | β | `50` | Maximum upload file size |
|
| 397 |
+
|
| 398 |
+
<br/>
|
| 399 |
+
|
| 400 |
+
## π Scripts
|
| 401 |
+
|
| 402 |
+
### Backend (`backend/`)
|
| 403 |
+
|
| 404 |
+
| Command | Description |
|
| 405 |
+
|---------|-------------|
|
| 406 |
+
| `uvicorn app.main:app --reload` | Start FastAPI with hot reload |
|
| 407 |
+
| `uvicorn app.main:app --port 8000` | Start FastAPI on port 8000 |
|
| 408 |
+
|
| 409 |
+
### Frontend (`frontend/`)
|
| 410 |
+
|
| 411 |
+
| Command | Description |
|
| 412 |
+
|---------|-------------|
|
| 413 |
+
| `npm run dev` | Start **Next.js** dev server |
|
| 414 |
+
| `npm run build` | Production build β `out/` (static export) |
|
| 415 |
+
| `npm run lint` | Run ESLint |
|
| 416 |
+
|
| 417 |
+
### Docker
|
| 418 |
+
|
| 419 |
+
| Command | Description |
|
| 420 |
+
|---------|-------------|
|
| 421 |
+
| `docker compose up --build` | Build and start the full stack |
|
| 422 |
+
| `docker compose down` | Stop all containers |
|
| 423 |
+
|
| 424 |
+
<br/>
|
| 425 |
+
|
| 426 |
+
## π Deployment
|
| 427 |
+
|
| 428 |
+
This project is deployed on **HuggingFace Spaces** using Docker.
|
| 429 |
+
|
| 430 |
+
### HuggingFace Spaces
|
| 431 |
+
|
| 432 |
+
1. Fork this repo and create a new Space at [huggingface.co/new-space](https://huggingface.co/new-space) (SDK: Docker)
|
| 433 |
+
2. Set the following Space secrets:
|
| 434 |
+
- `HF_TOKEN` β your HuggingFace API token
|
| 435 |
+
- `SECRET_KEY` β a strong random string
|
| 436 |
+
3. Push to the `hf` remote β the Space will auto-build
|
| 437 |
+
|
| 438 |
+
```bash
|
| 439 |
+
git remote add hf https://<username>:<HF_TOKEN>@huggingface.co/spaces/<username>/<space-name>
|
| 440 |
+
git push hf main
|
| 441 |
+
```
|
| 442 |
+
|
| 443 |
+
### Self-Hosted / VPS
|
| 444 |
+
|
| 445 |
+
```bash
|
| 446 |
+
docker compose up -d --build
|
| 447 |
+
# App available at http://your-server:7860
|
| 448 |
+
```
|
| 449 |
+
|
| 450 |
+
<br/>
|
| 451 |
+
|
| 452 |
+
## π€ Contributing β GSSOC
|
| 453 |
+
|
| 454 |
+
This project is participating in **GirlScript Summer of Code**! We welcome contributors of all skill levels.
|
| 455 |
+
|
| 456 |
+
**Branch Strategy:**
|
| 457 |
+
|
| 458 |
+
| Branch | Purpose |
|
| 459 |
+
|--------|---------|
|
| 460 |
+
| `main` | Production β HuggingFace deployed (admin only) |
|
| 461 |
+
| `dev` | All contributor PRs target here |
|
| 462 |
+
| `feature/*` / `fix/*` / `docs/*` | Your working branches |
|
| 463 |
+
|
| 464 |
+
```bash
|
| 465 |
+
# Always branch from dev
|
| 466 |
+
git checkout -b feature/my-feature upstream/dev
|
| 467 |
+
```
|
| 468 |
+
|
| 469 |
+
**Quick links:**
|
| 470 |
+
- π [Good First Issues](https://github.com/param20h/PDF-Assistant-RAG/issues?q=label%3A%22good+first+issue%22)
|
| 471 |
+
- π [Contributing Guide](CONTRIBUTING.md)
|
| 472 |
+
- π¬ [Discussions](https://github.com/param20h/PDF-Assistant-RAG/discussions)
|
| 473 |
+
|
| 474 |
+
<br/>
|
| 475 |
+
|
| 476 |
+
## π License
|
| 477 |
+
|
| 478 |
+
Distributed under the **MIT License**. See [`LICENSE`](license) for more information.
|
| 479 |
+
|
| 480 |
+
---
|
| 481 |
+
|
| 482 |
+
<div align="center">
|
| 483 |
+
|
| 484 |
+
<br/>
|
| 485 |
+
|
| 486 |
+
**Built with π as a flagship AI engineering project**
|
| 487 |
+
|
| 488 |
+
*If you found this project helpful, please give it a β β it helps GSSOC contributors discover it!*
|
| 489 |
+
|
| 490 |
+
<br/>
|
| 491 |
+
|
| 492 |
+
[](https://skillicons.dev)
|
| 493 |
+
|
| 494 |
+
<br/>
|
| 495 |
|
| 496 |
+
**[β¬ Back to top](#)**
|
| 497 |
|
| 498 |
+
</div>
|