File size: 10,577 Bytes
5aa2260
 
 
 
f56271e
 
5aa2260
 
 
 
693f74a
5aa2260
 
 
 
 
 
 
 
 
 
 
 
 
693f74a
5aa2260
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
693f74a
5aa2260
693f74a
 
 
 
5aa2260
 
 
 
 
 
 
 
 
 
 
693f74a
5aa2260
693f74a
5aa2260
693f74a
5aa2260
693f74a
 
 
 
 
 
 
5aa2260
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
693f74a
5aa2260
 
 
693f74a
 
5aa2260
 
 
 
 
 
693f74a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5aa2260
 
 
 
 
 
 
 
 
693f74a
5aa2260
 
 
 
693f74a
 
5aa2260
693f74a
 
 
5aa2260
693f74a
5aa2260
693f74a
 
 
5aa2260
 
 
693f74a
5aa2260
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: DocMind-Agentic-Research
colorFrom: blue
colorTo: indigo
sdk: docker
---

<div align="center">

<h1>🧠 DocMind β€” Agentic Research Platform</h1>
<img src="https://readme-typing-svg.demolab.com?font=Fira+Code&size=22&duration=3000&pause=1000&color=4f8ef7&center=true&vCenter=true&width=700&lines=LangGraph+%C2%B7+5+Agents+%C2%B7+Hybrid+RAG;Qwen+2.5-7B+%C2%B7+3+LLM+Calls+per+Query;Deployed+Free+on+HuggingFace+Spaces" alt="Typing SVG"/>

<br/>

[![Python](https://img.shields.io/badge/Python-3.10+-3b82f6?style=for-the-badge&logo=python&logoColor=white)](https://www.python.org/)
[![LangGraph](https://img.shields.io/badge/LangGraph-0.2-06b6d4?style=for-the-badge)](https://github.com/langchain-ai/langgraph)
[![LangChain](https://img.shields.io/badge/LangChain-0.3-4f46e5?style=for-the-badge)](https://langchain.com/)
[![Flask](https://img.shields.io/badge/Flask-3.1-3b82f6?style=for-the-badge&logo=flask&logoColor=white)](https://flask.palletsprojects.com/)
[![Docker](https://img.shields.io/badge/Docker-Ready-3b82f6?style=for-the-badge&logo=docker&logoColor=white)](https://www.docker.com/)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-Spaces-ffcc00?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/mnoorchenar/spaces)
[![Status](https://img.shields.io/badge/Status-Active-22c55e?style=for-the-badge)](#)

<br/>

**🧠 DocMind** β€” A clean, minimal agentic document research platform. Five specialized LangGraph agents plan, retrieve, grade, generate, and critique answers from uploaded PDFs and web pages using hybrid search and Qwen 2.5-7B β€” all running free on HuggingFace Spaces.

<br/>

---

</div>

## Table of Contents
- [Features](#-features)
- [Architecture](#️-architecture)
- [Getting Started](#-getting-started)
- [Docker Deployment](#-docker-deployment)
- [Dashboard Modules](#-dashboard-modules)
- [ML Models](#-ml-models)
- [Project Structure](#-project-structure)
- [Author](#-author)
- [Contributing](#-contributing)
- [Disclaimer](#disclaimer)
- [License](#-license)

---

## ✨ Features

<table>
  <tr><td>🧠 <b>LangGraph State Machine</b></td><td>Five agents wired into a linear StateGraph β€” Planner β†’ Retriever β†’ Grader β†’ Generator β†’ Critic.</td></tr>
  <tr><td>πŸ” <b>Hybrid RAG (FAISS + BM25)</b></td><td>Semantic vector search combined with BM25 keyword search, fused via Reciprocal Rank Fusion for precision retrieval.</td></tr>
  <tr><td>πŸ€– <b>Multi-Agent Orchestration</b></td><td>Planner, Retriever, Grader, Generator, and Critic agents each with specialized roles β€” only 3 LLM calls per query.</td></tr>
  <tr><td>⚑ <b>Score-Based Grading</b></td><td>Grader uses hybrid search scores + keyword overlap β€” no LLM call needed, instant and deterministic relevance scoring.</td></tr>
  <tr><td>πŸ“„ <b>PDF &amp; URL Ingestion</b></td><td>Upload PDF files up to 10 MB or paste any public URL β€” both are chunked, embedded, and indexed automatically.</td></tr>
  <tr><td>πŸ”’ <b>Secure by Design</b></td><td>Stateless REST backend, no user data persisted, HF token kept server-side only.</td></tr>
  <tr><td>🐳 <b>Containerized Deployment</b></td><td>Docker-first with Gunicorn, embedding model pre-downloaded at build time for fast cold starts.</td></tr>
</table>

---

## πŸ—οΈ Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   DocMind β€” LangGraph Flow                    β”‚
β”‚                                                              β”‚
β”‚  PDF / URL ──▢ Ingestor ──▢ FAISS+BM25 Hybrid Vector Store  β”‚
β”‚                                    β”‚                         β”‚
β”‚  User Query ──▢ [PLANNER Agent]    β”‚   (Qwen 2.5-7B, 0.3)   β”‚
β”‚                      β”‚             β”‚                         β”‚
β”‚                 [RETRIEVER] β—€β”€β”€β”€β”€β”€β”€β”˜  (FAISS+BM25+RRF)      β”‚
β”‚                      β”‚                                       β”‚
β”‚                 [GRADER]  (score-based, no LLM call)         β”‚
β”‚                      β”‚                                       β”‚
β”‚                 [GENERATOR]         (Qwen 2.5-7B, 0.4)       β”‚
β”‚                      β”‚                                       β”‚
β”‚                  [CRITIC]           (Qwen 2.5-7B, 0.1)       β”‚
β”‚                      β”‚                                       β”‚
β”‚                  [OUTPUT]  Flask API + Single-Page UI         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## πŸš€ Getting Started

### Prerequisites
- Python 3.10+ Β· Docker Β· Git Β· Free HuggingFace account

### Local Installation

```bash
git clone https://github.com/mnoorchenar/docmind.git
cd docmind

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

pip install -r requirements.txt

cp .env.example .env
# Edit .env β€” set HF_TOKEN to your free HuggingFace Read token

python app.py
```

Open `http://localhost:7860` πŸŽ‰

### Getting your free HuggingFace token
1. Create a free account at [huggingface.co](https://huggingface.co)
2. Go to Settings β†’ Access Tokens β†’ New Token β†’ Role: **Read**
3. Copy the token and set it as `HF_TOKEN` in your `.env` file or Space secrets

---

## 🐳 Docker Deployment

```bash
docker build -t docmind .
docker run -p 7860:7860 -e HF_TOKEN=hf_your_token_here docmind
```

---

## πŸ“Š App Modules

| Module | Description | Status |
|--------|-------------|--------|
| πŸ“€ Upload & Index | PDF / URL ingest, chunk, embed (local BAAI model), FAISS+BM25 index | βœ… Live |
| πŸ” Research Query | LangGraph 5-agent pipeline with real-time trace log | βœ… Live |

---

## 🧠 ML Models

```python
stack = {
    # ── LLM (LangChain LCEL chains) ──────────────────────────────────────────
    "llm":             "Qwen/Qwen2.5-7B-Instruct",         # via HF Router
    "lcel_chain":      "ChatPromptTemplate | ChatOpenAI | StrOutputParser",
    "retry":           "ChatOpenAI.with_retry(stop_after_attempt=2)",

    # ── RAG (LangChain + custom hybrid) ──────────────────────────────────────
    "splitter":        "RecursiveCharacterTextSplitter (langchain-text-splitters)",
    "documents":       "langchain_core.documents.Document",
    "embeddings":      "HuggingFaceEmbeddings (BAAI/bge-small-en-v1.5, local)",
    "vector_index":    "FAISS IndexFlatIP (cosine)",
    "keyword_index":   "BM25Okapi (rank-bm25)",
    "fusion":          "Reciprocal Rank Fusion (RRF k=60)",
    "grader":          "score-based (hybrid score Γ— 0.7 + keyword overlap Γ— 0.3)",

    # ── Orchestration (LangGraph) ─────────────────────────────────────────────
    "graph":           "LangGraph 0.2 StateGraph β€” 5 nodes, linear pipeline",
}
```

---

## πŸ“ Project Structure

```
docmind/
β”œβ”€β”€ πŸ“„ app.py                     # Flask entry point, 5 REST routes
β”œβ”€β”€ πŸ“„ requirements.txt
β”œβ”€β”€ πŸ“„ Dockerfile                 # Port 7860, embedding model pre-downloaded
β”œβ”€β”€ πŸ“„ .env.example
β”œβ”€β”€ πŸ“‚ agents/
β”‚   β”œβ”€β”€ πŸ“„ llm_factory.py         # get_llm() β†’ LangChain ChatOpenAI (HF Router)
β”‚   β”œβ”€β”€ πŸ“„ planner.py             # LCEL: ChatPromptTemplate | ChatOpenAI | StrOutputParser
β”‚   β”œβ”€β”€ πŸ“„ retriever.py           # Hybrid FAISS+BM25 search wrapper
β”‚   β”œβ”€β”€ πŸ“„ grader.py              # Score-based relevance grading (no LLM call)
β”‚   β”œβ”€β”€ πŸ“„ generator.py           # LCEL chain β€” cited answer generation
β”‚   └── πŸ“„ critic.py              # LCEL chain β€” hallucination detection
β”œβ”€β”€ πŸ“‚ graph/
β”‚   └── πŸ“„ research_graph.py      # LangGraph StateGraph (5 nodes, linear pipeline)
β”œβ”€β”€ πŸ“‚ rag/
β”‚   β”œβ”€β”€ πŸ“„ ingestor.py            # RecursiveCharacterTextSplitter + Document objects
β”‚   β”œβ”€β”€ πŸ“„ vector_store.py        # FAISS + BM25 + RRF, accepts Document or dict
β”‚   └── πŸ“„ embeddings.py          # LangChain HuggingFaceEmbeddings (bge-small-en-v1.5)
β”œβ”€β”€ πŸ“‚ tracing/
β”‚   └── πŸ“„ tracer.py              # Thread-safe in-memory trace store
β”œβ”€β”€ πŸ“‚ templates/
β”‚   └── πŸ“„ index.html             # Dark-mode single-page UI
└── πŸ“‚ docs/
    └── πŸ“„ project-template.html  # Portfolio showcase page
```

---

## πŸ‘¨β€πŸ’» Author

<div align="center">
<table><tr><td align="center" width="100%">
<img src="https://avatars.githubusercontent.com/mnoorchenar" width="120" style="border-radius:50%;border:3px solid #4f46e5" alt="Mohammad Noorchenarboo"/>
<h3>Mohammad Noorchenarboo</h3>
<code>Data Scientist</code> &nbsp;|&nbsp; <code>AI Researcher</code> &nbsp;|&nbsp; <code>Biostatistician</code>
πŸ“ Ontario, Canada &nbsp;&nbsp; πŸ“§ mohammadnoorchenarboo@gmail.com

[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/mnoorchenar)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-ffcc00?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/mnoorchenar/spaces)
[![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/mnoorchenar)
</td></tr></table>
</div>

---

## 🀝 Contributing

1. Fork the repository
2. Create a feature branch: `git checkout -b feature/amazing-feature`
3. Commit: `git commit -m 'Add amazing feature'`
4. Push: `git push origin feature/amazing-feature`
5. Open a Pull Request

---

## Disclaimer

<span style="color:red">This project is developed strictly for educational and research purposes. All LLM outputs are AI-generated and may contain inaccuracies. No real user data is stored. Provided "as is" without warranty of any kind.</span>

---

## πŸ“œ License

Distributed under the **MIT License**.

<div align="center">
<img src="https://capsule-render.vercel.app/api?type=waving&color=0:3b82f6,100:4f46e5&height=120&section=footer&text=Made%20with%20%E2%9D%A4%EF%B8%8F%20by%20Mohammad%20Noorchenarboo&fontColor=ffffff&fontSize=18&fontAlignY=80" width="100%"/>
</div>