Param20h commited on
Commit
bf52102
Β·
unverified Β·
1 Parent(s): c6e1b35

docs: overhaul README with full project documentation, RAG pipeline diagram, API reference, and GSSOC contributor section

Browse files
Files changed (1) hide show
  1. README.md +468 -36
README.md CHANGED
@@ -10,57 +10,489 @@ license: mit
10
  short_description: Enterprise Agentic RAG β€” upload PDFs and chat with AI
11
  ---
12
 
13
- # 🧠 Document AI Analyst β€” Enterprise Agentic RAG System
14
 
15
- Upload complex PDFs, financial reports, legal contracts, or research papers and chat with an AI agent that provides **accurate, cited insights** powered by Retrieval-Augmented Generation.
16
 
17
- ## ✨ Features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- - **Multi-Format Upload** β€” PDF, DOCX, TXT, Markdown with smart chunking
20
- - **Semantic Search** β€” Two-stage retrieval with cross-encoder reranking
21
- - **Streaming Chat** β€” Real-time AI responses with inline source citations
22
- - **Data Isolation** β€” Per-user vector collections for complete privacy
23
- - **Open-Source LLMs** β€” Powered by Mistral-7B and HuggingFace ecosystem
24
 
25
- ## πŸ—οΈ Architecture
 
 
26
 
27
- | Layer | Technology |
28
- |---|---|
29
- | **Frontend** | Next.js 16, Tailwind CSS v4, Shadcn UI v2 |
30
- | **Backend** | FastAPI, SQLAlchemy, JWT Auth |
31
- | **Embeddings** | sentence-transformers/all-MiniLM-L6-v2 (local) |
32
- | **Vector Store** | ChromaDB (persistent, per-user collections) |
33
- | **Reranker** | cross-encoder/ms-marco-MiniLM-L-6-v2 |
34
- | **LLM** | Mistral-7B-Instruct via HuggingFace Inference API |
35
- | **Deployment** | Docker multi-stage build on HuggingFace Spaces |
36
 
37
- ## πŸš€ Quick Start
38
 
39
- 1. **Register** an account
40
- 2. **Upload** a PDF document
41
- 3. **Wait** for processing (chunking + embedding)
42
- 4. **Ask** questions and get cited answers!
43
 
44
- ## πŸ”§ Local Development
45
 
46
  ```bash
47
- # Backend
48
- cd backend && python3 -m venv .venv && source .venv/bin/activate
 
49
  pip install -r requirements.txt
50
- uvicorn app.main:app --port 7860
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- # Frontend
53
- cd frontend && npm install && npm run dev
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  ```
55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ## πŸ“¦ Environment Variables
57
 
58
- | Variable | Required | Description |
59
- |---|---|---|
60
- | `HF_TOKEN` | βœ… | HuggingFace API token for LLM inference |
61
- | `SECRET_KEY` | βœ… | JWT signing secret |
62
- | `DATABASE_URL` | ❌ | SQLite path (default: `sqlite:///./data/app.db`) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
- ## πŸ› οΈ Tech Stack
65
 
66
- Built with: **FastAPI** β€’ **LangChain** β€’ **ChromaDB** β€’ **HuggingFace** β€’ **Next.js 16** β€’ **Tailwind CSS** β€’ **Shadcn UI**
 
10
  short_description: Enterprise Agentic RAG β€” upload PDFs and chat with AI
11
  ---
12
 
13
+ <div align="center">
14
 
15
+ <br/>
16
 
17
+ ```
18
+ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
19
+ β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β• β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β•β•β•β•šβ•β•β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘β•šβ•β•β–ˆβ–ˆβ•”β•β•β•
20
+ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘
21
+ β–ˆβ–ˆβ•”β•β•β•β• β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β• β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β•šβ•β•β•β•β–ˆβ–ˆβ•‘β•šβ•β•β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ•β•β•β•β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘
22
+ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β•šβ–ˆβ–ˆβ–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘
23
+ β•šβ•β• β•šβ•β•β•β•β•β• β•šβ•β• β•šβ•β• β•šβ•β•β•šβ•β•β•β•β•β•β•β•šβ•β•β•β•β•β•β•β•šβ•β•β•šβ•β•β•β•β•β•β• β•šβ•β• β•šβ•β• β•šβ•β•β•šβ•β• β•šβ•β•β•β• β•šβ•β•
24
+
25
+ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
26
+ β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β•
27
+ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ–ˆβ•—
28
+ β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘
29
+ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•
30
+ β•šβ•β• β•šβ•β•β•šβ•β• β•šβ•β• β•šβ•β•β•β•β•β•
31
+ ```
32
+
33
+ ### Enterprise Agentic Retrieval-Augmented Generation System
34
+
35
+ <br/>
36
+
37
+ [![FastAPI](https://img.shields.io/badge/FastAPI-0.115+-009688?style=for-the-badge&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com/)
38
+ [![Next.js](https://img.shields.io/badge/Next.js-16-000000?style=for-the-badge&logo=next.js&logoColor=white)](https://nextjs.org/)
39
+ [![Python](https://img.shields.io/badge/Python-3.11-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org/)
40
+ [![LangChain](https://img.shields.io/badge/LangChain-RAG-1C3C3C?style=for-the-badge&logo=langchain&logoColor=white)](https://langchain.com/)
41
+ [![ChromaDB](https://img.shields.io/badge/ChromaDB-VectorStore-FF6B35?style=for-the-badge)](https://trychroma.com/)
42
+ [![HuggingFace](https://img.shields.io/badge/HuggingFace-Inference-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/)
43
+ [![Docker](https://img.shields.io/badge/Docker-Multi--Stage-2496ED?style=for-the-badge&logo=docker&logoColor=white)](https://docker.com/)
44
+ [![License: MIT](https://img.shields.io/badge/License-MIT-F59E0B?style=for-the-badge)](LICENSE)
45
+
46
+ <br/>
47
+
48
+ > **Upload Β· Embed Β· Retrieve Β· Chat** β€” A production-grade AI document assistant built end-to-end with an agentic RAG pipeline, streaming responses, and per-user data isolation.
49
+
50
+ <br/>
51
+
52
+ [Features](#-key-features) Β· [Tech Stack](#-tech-stack) Β· [Getting Started](#-getting-started) Β· [Architecture](#-architecture) Β· [RAG Pipeline](#-rag-pipeline) Β· [API Reference](#-api-reference) Β· [Deployment](#-deployment) Β· [Contributing](#-contributing)
53
+
54
+ ---
55
+
56
+ </div>
57
+
58
+ ## 🀝 Contributors
59
+
60
+ Thanks to all the amazing people who have contributed to **PDF-Assistant-RAG**! πŸŽ‰
61
+
62
+ <br/>
63
+
64
+ <div align="center">
65
+ <a href="https://github.com/param20h/PDF-Assistant-RAG/graphs/contributors">
66
+ <img src="https://contrib.rocks/image?repo=param20h/PDF-Assistant-RAG" alt="Contributors" />
67
+ </a>
68
+ </div>
69
+
70
+ <br/>
71
+
72
+ > 🌟 **GSSOC Contributors** β€” This project is open for [GirlScript Summer of Code](https://gssoc.girlscript.tech/). Check out our [CONTRIBUTING.md](CONTRIBUTING.md) to get started and browse [open issues](https://github.com/param20h/PDF-Assistant-RAG/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) tagged `good first issue`.
73
+
74
+ ---
75
+
76
+ <br/>
77
+
78
+ ## 🌟 Overview
79
+
80
+ **PDF-Assistant-RAG** is a complete, production-ready AI document assistant that lets users upload complex PDFs, financial reports, legal contracts, and research papers β€” then chat with an AI that provides **accurate, cited answers** powered by a multi-stage Retrieval-Augmented Generation pipeline.
81
+
82
+ The system uses **semantic search + cross-encoder reranking** to find the most relevant document chunks, streams AI-generated answers token-by-token, and highlights exact source citations with page numbers β€” all inside a sleek Next.js UI with JWT-secured per-user data isolation.
83
+
84
+ <br/>
85
+
86
+ ## πŸ›  Tech Stack
87
+
88
+ <div align="center">
89
+
90
+ ### Backend
91
+
92
+ | | Technology | Purpose |
93
+ |---|---|---|
94
+ | <img src="https://skillicons.dev/icons?i=fastapi" width="30"/> | **FastAPI 0.115+** | Async REST API framework |
95
+ | <img src="https://skillicons.dev/icons?i=python" width="30"/> | **Python 3.11** | Runtime environment |
96
+ | <img src="https://skillicons.dev/icons?i=sqlite" width="30"/> | **SQLite + SQLAlchemy** | User & document metadata storage |
97
+ | <img src="https://img.shields.io/badge/JWT-000000?style=flat&logo=jsonwebtokens&logoColor=white" height="24"/> | **JWT + Passlib** | Authentication & authorization |
98
+ | <img src="https://img.shields.io/badge/LangChain-1C3C3C?style=flat&logo=langchain&logoColor=white" height="24"/> | **LangChain** | RAG orchestration |
99
+ | <img src="https://img.shields.io/badge/ChromaDB-FF6B35?style=flat" height="24"/> | **ChromaDB** | Persistent vector store (per-user) |
100
+ | <img src="https://img.shields.io/badge/HuggingFace-FFD21E?style=flat&logo=huggingface&logoColor=black" height="24"/> | **HuggingFace Hub** | LLM inference API |
101
+
102
+ ### Frontend
103
+
104
+ | | Technology | Purpose |
105
+ |---|---|---|
106
+ | <img src="https://skillicons.dev/icons?i=nextjs" width="30"/> | **Next.js 16** | React framework (App Router) |
107
+ | <img src="https://skillicons.dev/icons?i=tailwind" width="30"/> | **Tailwind CSS v4** | Utility-first styling |
108
+ | <img src="https://img.shields.io/badge/shadcn/ui-000000?style=flat&logo=shadcnui&logoColor=white" height="24"/> | **shadcn/ui** | Accessible component library |
109
+ | <img src="https://skillicons.dev/icons?i=ts" width="30"/> | **TypeScript** | Type-safe frontend |
110
+ | <img src="https://img.shields.io/badge/react--pdf-FF0000?style=flat" height="24"/> | **react-pdf** | In-browser PDF viewer |
111
+ | <img src="https://img.shields.io/badge/react--markdown-000000?style=flat" height="24"/> | **react-markdown + GFM** | Markdown-rendered AI responses |
112
+
113
+ ### AI / ML Pipeline
114
+
115
+ | | Technology | Purpose |
116
+ |---|---|---|
117
+ | <img src="https://img.shields.io/badge/MiniLM-L6-v2-FFD21E?style=flat" height="24"/> | **all-MiniLM-L6-v2** | Local sentence embeddings |
118
+ | <img src="https://img.shields.io/badge/ms--marco--MiniLM-1C3C3C?style=flat" height="24"/> | **ms-marco-MiniLM-L-6-v2** | Cross-encoder reranker |
119
+ | <img src="https://img.shields.io/badge/Qwen2.5-72B-Instruct-626BDF?style=flat" height="24"/> | **Qwen2.5-72B-Instruct** | LLM (HuggingFace Inference API) |
120
+ | <img src="https://img.shields.io/badge/PyMuPDF-FF0000?style=flat" height="24"/> | **PyMuPDF + python-docx** | Document parsing |
121
+
122
+ ### DevOps & Tooling
123
+
124
+ | | Technology | Purpose |
125
+ |---|---|---|
126
+ | <img src="https://skillicons.dev/icons?i=docker" width="30"/> | **Docker Multi-Stage** | Containerized deployment |
127
+ | <img src="https://skillicons.dev/icons?i=githubactions" width="30"/> | **GitHub Actions** | CI pipeline (dev branch) |
128
+ | <img src="https://skillicons.dev/icons?i=git" width="30"/> | **Git LFS** | Binary asset management |
129
+ | <img src="https://img.shields.io/badge/HuggingFace_Spaces-FFD21E?style=flat&logo=huggingface&logoColor=black" height="24"/> | **HuggingFace Spaces** | Production deployment |
130
+
131
+ </div>
132
+
133
+ <br/>
134
+
135
+ ## ✨ Key Features
136
+
137
+ <table>
138
+ <tr>
139
+ <td width="33%" valign="top">
140
+
141
+ ### πŸ‘€ Users
142
+ - πŸ” JWT-secured register & login
143
+ - πŸ“„ Upload **PDF** and **DOCX** documents
144
+ - πŸ’¬ Ask questions in natural language
145
+ - 🌊 **Streaming AI responses** token-by-token
146
+ - πŸ“š Inline **source citations** with page numbers
147
+ - πŸ—‚οΈ Per-user complete data isolation
148
+
149
+ </td>
150
+ <td width="33%" valign="top">
151
+
152
+ ### πŸ€– RAG Pipeline
153
+ - πŸ”ͺ Smart **recursive text chunking** (configurable size & overlap)
154
+ - 🧠 **Local embeddings** β€” no data leaves your machine
155
+ - πŸ” **Two-stage retrieval** β€” semantic search β†’ cross-encoder rerank
156
+ - βœ‚οΈ Top-K filtering for precision answers
157
+ - πŸ“ Custom **system prompts** with citation instructions
158
+ - 🧾 Source scoring with confidence levels
159
+
160
+ </td>
161
+ <td width="33%" valign="top">
162
+
163
+ ### βš™οΈ Engineering
164
+ - πŸš€ **Async FastAPI** with Server-Sent Events streaming
165
+ - πŸ—„οΈ **ChromaDB** with persistent per-user collections
166
+ - 🐳 **Multi-stage Docker** build (Node β†’ Python)
167
+ - πŸ”„ **GitHub Actions CI** on `dev` branch
168
+ - πŸ›‘οΈ CORS, file validation, JWT expiry
169
+ - πŸ“Š Chat **history persistence** per document
170
+
171
+ </td>
172
+ </tr>
173
+ </table>
174
+
175
+ <br/>
176
+
177
+ ## πŸ“ Project Structure
178
+
179
+ ```
180
+ PDF-Assistant-RAG/
181
+ β”‚
182
+ β”œβ”€β”€ backend/ # FastAPI + RAG server
183
+ β”‚ β”œβ”€β”€ app/
184
+ β”‚ β”‚ β”œβ”€β”€ main.py # App entrypoint, middleware, static files
185
+ β”‚ β”‚ β”œβ”€β”€ config.py # Pydantic settings (env vars)
186
+ β”‚ β”‚ β”œβ”€β”€ database.py # SQLAlchemy async engine
187
+ β”‚ β”‚ β”œβ”€β”€ models.py # ORM models (User, Document, Message)
188
+ β”‚ β”‚ β”œβ”€β”€ schemas.py # Pydantic request/response schemas
189
+ β”‚ β”‚ β”œβ”€β”€ auth.py # JWT creation & verification
190
+ β”‚ β”‚ β”‚
191
+ β”‚ β”‚ β”œβ”€β”€ routes/
192
+ β”‚ β”‚ β”‚ β”œβ”€β”€ auth.py # POST /register, /login, /me
193
+ β”‚ β”‚ β”‚ β”œβ”€β”€ documents.py # Upload, list, delete, retrieve
194
+ β”‚ β”‚ β”‚ └── chat.py # Streaming chat + history
195
+ β”‚ β”‚ β”‚
196
+ β”‚ β”‚ └── rag/
197
+ β”‚ β”‚ β”œβ”€β”€ agent.py # Main RAG orchestrator
198
+ β”‚ β”‚ β”œβ”€β”€ chunker.py # Recursive text splitter
199
+ β”‚ β”‚ β”œβ”€β”€ embeddings.py # SentenceTransformer wrapper
200
+ β”‚ β”‚ β”œβ”€β”€ vectorstore.py # ChromaDB collection manager
201
+ β”‚ β”‚ β”œβ”€β”€ retriever.py # Semantic search + reranking
202
+ β”‚ β”‚ └── prompts.py # System & user prompt templates
203
+ β”‚ β”‚
204
+ β”‚ β”œβ”€β”€ requirements.txt
205
+ β”‚ └── .env # Local env (never committed)
206
+ β”‚
207
+ β”œβ”€β”€ frontend/ # Next.js 16 App Router
208
+ β”‚ └── src/
209
+ β”‚ β”œβ”€β”€ app/
210
+ β”‚ β”‚ β”œβ”€β”€ layout.tsx # Root layout + fonts
211
+ β”‚ β”‚ β”œβ”€β”€ page.tsx # Landing / redirect
212
+ β”‚ β”‚ β”œβ”€β”€ login/ # Auth pages
213
+ β”‚ β”‚ β”œβ”€β”€ register/
214
+ β”‚ β”‚ └── dashboard/ # Main app page
215
+ β”‚ β”‚
216
+ β”‚ β”œβ”€β”€ components/
217
+ β”‚ β”‚ β”œβ”€β”€ chat/
218
+ β”‚ β”‚ β”‚ β”œβ”€β”€ ChatPanel.tsx # Chat UI + SSE streaming
219
+ β”‚ β”‚ β”‚ β”œβ”€β”€ MessageBubble.tsx # User / assistant message
220
+ β”‚ β”‚ β”‚ └── SourceCard.tsx # Citation cards
221
+ β”‚ β”‚ β”œβ”€β”€ document/ # Upload + sidebar components
222
+ β”‚ β”‚ └── layout/ # Navbar, sidebar shell
223
+ β”‚ β”‚
224
+ β”‚ └── lib/
225
+ β”‚ └── api.ts # Typed API client + SSE stream helper
226
+ β”‚
227
+ β”œβ”€β”€ .github/
228
+ β”‚ β”œβ”€β”€ workflows/
229
+ β”‚ β”‚ β”œβ”€β”€ ci.yml # CI β€” runs on dev branch only
230
+ β”‚ β”‚ β”œβ”€β”€ deploy.yml # Docker build β€” main branch only
231
+ β”‚ β”‚ └── devsecops.yml # Security scans β€” main branch only
232
+ β”‚ β”œβ”€β”€ ISSUE_TEMPLATE/ # Bug report & feature request forms
233
+ β”‚ β”œβ”€β”€ pull_request_template.md # PR checklist
234
+ β”‚ └── CODEOWNERS # Auto-review assignment
235
+ β”‚
236
+ β”œβ”€β”€ Dockerfile # Multi-stage: Node build β†’ Python serve
237
+ β”œβ”€β”€ docker-compose.yml # Local Docker stack
238
+ β”œβ”€β”€ CONTRIBUTING.md # GSSOC contributor guide
239
+ └── .env.example # Template for environment variables
240
+ ```
241
+
242
+ <br/>
243
+
244
+ ## πŸš€ Getting Started
245
+
246
+ ### Prerequisites
247
+
248
+ - ![Python](https://img.shields.io/badge/Python-3.11+-3776AB?style=flat&logo=python&logoColor=white) **Python 3.11+**
249
+ - ![Node.js](https://img.shields.io/badge/Node.js-20+-339933?style=flat&logo=node.js&logoColor=white) **Node.js 20+**
250
+ - ![HuggingFace](https://img.shields.io/badge/HuggingFace-Token-FFD21E?style=flat&logo=huggingface&logoColor=black) **HuggingFace account** (free) for LLM inference
251
+
252
+ ---
253
+
254
+ ### 1. Clone the Repository
255
+
256
+ ```bash
257
+ git clone https://github.com/param20h/PDF-Assistant-RAG.git
258
+ cd PDF-Assistant-RAG
259
+ ```
260
 
261
+ ### 2. Configure Environment
 
 
 
 
262
 
263
+ ```bash
264
+ cp .env.example backend/.env
265
+ ```
266
 
267
+ Edit `backend/.env`:
268
+
269
+ ```env
270
+ SECRET_KEY=your-strong-random-secret
271
+ DATABASE_URL=sqlite:///./data/app.db
272
+ HF_TOKEN=hf_your_huggingface_token_here
273
+ UPLOAD_DIR=./data/uploads
274
+ CHROMA_PERSIST_DIR=./data/chroma_db
275
+ ```
276
 
277
+ > Get your free HuggingFace token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
278
 
279
+ ### 3. Run Locally
 
 
 
280
 
281
+ Open **two terminals**:
282
 
283
  ```bash
284
+ # Terminal A β€” Backend
285
+ cd backend
286
+ python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
287
  pip install -r requirements.txt
288
+ uvicorn app.main:app --reload --port 8000
289
+ # β†’ API running at http://localhost:8000
290
+ # β†’ Swagger docs at http://localhost:8000/docs
291
+ ```
292
+
293
+ ```bash
294
+ # Terminal B β€” Frontend
295
+ cd frontend
296
+ npm install
297
+ npm run dev
298
+ # β†’ App running at http://localhost:3000
299
+ ```
300
+
301
+ ### 4. Run with Docker
302
+
303
+ ```bash
304
+ docker compose up --build
305
+ # β†’ Full stack at http://localhost:7860
306
+ ```
307
+
308
+ <br/>
309
+
310
+ ## 🧠 RAG Pipeline
311
+
312
+ ```
313
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
314
+ β”‚ PDF / DOCX Upload β”‚
315
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
316
+ β”‚
317
+ β–Ό
318
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
319
+ β”‚ PyMuPDF / python-docx Parser β”‚
320
+ β”‚ (text extraction per page) β”‚
321
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
322
+ β”‚
323
+ β–Ό
324
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
325
+ β”‚ Recursive Character Text Splitter β”‚
326
+ β”‚ chunk_size=1000 | overlap=200 β”‚
327
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
328
+ β”‚
329
+ β–Ό
330
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
331
+ β”‚ all-MiniLM-L6-v2 (local embeddings) β”‚
332
+ β”‚ 384-dim dense vectors β”‚
333
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
334
+ β”‚
335
+ β–Ό
336
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
337
+ β”‚ ChromaDB β€” per-user persistent collection β”‚
338
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
339
 
340
+ ── At Query Time ──
341
+
342
+ User Question ──▢ Embed ──▢ Semantic Search (Top-K=10)
343
+ β”‚
344
+ β–Ό
345
+ Cross-Encoder Reranker (Top-K=5)
346
+ ms-marco-MiniLM-L-6-v2
347
+ β”‚
348
+ β–Ό
349
+ Prompt Assembly (system + context + question)
350
+ β”‚
351
+ β–Ό
352
+ Qwen2.5-72B-Instruct (HF Inference API)
353
+ β”‚
354
+ β–Ό
355
+ Streamed SSE tokens ──▢ Frontend ChatPanel
356
  ```
357
 
358
+ <br/>
359
+
360
+ ## πŸ“‘ API Reference
361
+
362
+ | Method | Endpoint | Auth | Description |
363
+ |--------|----------|------|-------------|
364
+ | `POST` | `/api/v1/auth/register` | ❌ | Create a new user account |
365
+ | `POST` | `/api/v1/auth/login` | ❌ | Login and receive JWT token |
366
+ | `GET` | `/api/v1/auth/me` | βœ… | Get current user profile |
367
+ | `POST` | `/api/v1/documents/upload` | βœ… | Upload PDF/DOCX and trigger indexing |
368
+ | `GET` | `/api/v1/documents` | βœ… | List all documents for current user |
369
+ | `DELETE` | `/api/v1/documents/{id}` | βœ… | Delete a document and its vector data |
370
+ | `POST` | `/api/v1/chat/ask/stream` | βœ… | Ask a question (SSE streaming response) |
371
+ | `GET` | `/api/v1/chat/history/{doc_id}` | βœ… | Get chat history for a document |
372
+ | `DELETE` | `/api/v1/chat/history/{doc_id}` | βœ… | Clear chat history for a document |
373
+ | `GET` | `/health` | ❌ | Health check (db + chroma status) |
374
+
375
+ > Full interactive docs available at `/docs` (Swagger UI) when running locally.
376
+
377
+ <br/>
378
+
379
  ## πŸ“¦ Environment Variables
380
 
381
+ | Variable | Required | Default | Description |
382
+ |---|---|---|---|
383
+ | `HF_TOKEN` | βœ… | β€” | HuggingFace API token for LLM inference |
384
+ | `SECRET_KEY` | βœ… | β€” | JWT signing secret (use a strong random string) |
385
+ | `DATABASE_URL` | ❌ | `sqlite:///./data/app.db` | SQLAlchemy database URL |
386
+ | `UPLOAD_DIR` | ❌ | `./data/uploads` | Directory for uploaded files |
387
+ | `CHROMA_PERSIST_DIR` | ❌ | `./data/chroma_db` | ChromaDB persistence path |
388
+ | `LLM_MODEL` | ❌ | `Qwen/Qwen2.5-72B-Instruct` | HuggingFace model ID |
389
+ | `LLM_TEMPERATURE` | ❌ | `0.3` | LLM sampling temperature |
390
+ | `LLM_MAX_NEW_TOKENS` | ❌ | `1024` | Max tokens per response |
391
+ | `EMBEDDING_MODEL` | ❌ | `all-MiniLM-L6-v2` | SentenceTransformer model |
392
+ | `CHUNK_SIZE` | ❌ | `1000` | Document chunk size (characters) |
393
+ | `CHUNK_OVERLAP` | ❌ | `200` | Overlap between chunks |
394
+ | `TOP_K_RETRIEVAL` | ❌ | `10` | Candidates retrieved from vector store |
395
+ | `TOP_K_RERANK` | ❌ | `5` | Final chunks passed to LLM after reranking |
396
+ | `MAX_FILE_SIZE_MB` | ❌ | `50` | Maximum upload file size |
397
+
398
+ <br/>
399
+
400
+ ## πŸ“œ Scripts
401
+
402
+ ### Backend (`backend/`)
403
+
404
+ | Command | Description |
405
+ |---------|-------------|
406
+ | `uvicorn app.main:app --reload` | Start FastAPI with hot reload |
407
+ | `uvicorn app.main:app --port 8000` | Start FastAPI on port 8000 |
408
+
409
+ ### Frontend (`frontend/`)
410
+
411
+ | Command | Description |
412
+ |---------|-------------|
413
+ | `npm run dev` | Start **Next.js** dev server |
414
+ | `npm run build` | Production build β†’ `out/` (static export) |
415
+ | `npm run lint` | Run ESLint |
416
+
417
+ ### Docker
418
+
419
+ | Command | Description |
420
+ |---------|-------------|
421
+ | `docker compose up --build` | Build and start the full stack |
422
+ | `docker compose down` | Stop all containers |
423
+
424
+ <br/>
425
+
426
+ ## 🌐 Deployment
427
+
428
+ This project is deployed on **HuggingFace Spaces** using Docker.
429
+
430
+ ### HuggingFace Spaces
431
+
432
+ 1. Fork this repo and create a new Space at [huggingface.co/new-space](https://huggingface.co/new-space) (SDK: Docker)
433
+ 2. Set the following Space secrets:
434
+ - `HF_TOKEN` β€” your HuggingFace API token
435
+ - `SECRET_KEY` β€” a strong random string
436
+ 3. Push to the `hf` remote β€” the Space will auto-build
437
+
438
+ ```bash
439
+ git remote add hf https://<username>:<HF_TOKEN>@huggingface.co/spaces/<username>/<space-name>
440
+ git push hf main
441
+ ```
442
+
443
+ ### Self-Hosted / VPS
444
+
445
+ ```bash
446
+ docker compose up -d --build
447
+ # App available at http://your-server:7860
448
+ ```
449
+
450
+ <br/>
451
+
452
+ ## 🀝 Contributing β€” GSSOC
453
+
454
+ This project is participating in **GirlScript Summer of Code**! We welcome contributors of all skill levels.
455
+
456
+ **Branch Strategy:**
457
+
458
+ | Branch | Purpose |
459
+ |--------|---------|
460
+ | `main` | Production β€” HuggingFace deployed (admin only) |
461
+ | `dev` | All contributor PRs target here |
462
+ | `feature/*` / `fix/*` / `docs/*` | Your working branches |
463
+
464
+ ```bash
465
+ # Always branch from dev
466
+ git checkout -b feature/my-feature upstream/dev
467
+ ```
468
+
469
+ **Quick links:**
470
+ - πŸ“‹ [Good First Issues](https://github.com/param20h/PDF-Assistant-RAG/issues?q=label%3A%22good+first+issue%22)
471
+ - πŸ“– [Contributing Guide](CONTRIBUTING.md)
472
+ - πŸ’¬ [Discussions](https://github.com/param20h/PDF-Assistant-RAG/discussions)
473
+
474
+ <br/>
475
+
476
+ ## πŸ“„ License
477
+
478
+ Distributed under the **MIT License**. See [`LICENSE`](license) for more information.
479
+
480
+ ---
481
+
482
+ <div align="center">
483
+
484
+ <br/>
485
+
486
+ **Built with πŸ’™ as a flagship AI engineering project**
487
+
488
+ *If you found this project helpful, please give it a ⭐ β€” it helps GSSOC contributors discover it!*
489
+
490
+ <br/>
491
+
492
+ [![FastAPI](https://skillicons.dev/icons?i=fastapi,python,nextjs,ts,tailwind,docker)](https://skillicons.dev)
493
+
494
+ <br/>
495
 
496
+ **[⬆ Back to top](#)**
497
 
498
+ </div>