Spaces:
Sleeping
Sleeping
File size: 10,577 Bytes
5aa2260 f56271e 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 693f74a 5aa2260 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | ---
title: DocMind-Agentic-Research
colorFrom: blue
colorTo: indigo
sdk: docker
---
<div align="center">
<h1>π§ DocMind β Agentic Research Platform</h1>
<img src="https://readme-typing-svg.demolab.com?font=Fira+Code&size=22&duration=3000&pause=1000&color=4f8ef7¢er=true&vCenter=true&width=700&lines=LangGraph+%C2%B7+5+Agents+%C2%B7+Hybrid+RAG;Qwen+2.5-7B+%C2%B7+3+LLM+Calls+per+Query;Deployed+Free+on+HuggingFace+Spaces" alt="Typing SVG"/>
<br/>
[](https://www.python.org/)
[](https://github.com/langchain-ai/langgraph)
[](https://langchain.com/)
[](https://flask.palletsprojects.com/)
[](https://www.docker.com/)
[](https://huggingface.co/mnoorchenar/spaces)
[](#)
<br/>
**π§ DocMind** β A clean, minimal agentic document research platform. Five specialized LangGraph agents plan, retrieve, grade, generate, and critique answers from uploaded PDFs and web pages using hybrid search and Qwen 2.5-7B β all running free on HuggingFace Spaces.
<br/>
---
</div>
## Table of Contents
- [Features](#-features)
- [Architecture](#οΈ-architecture)
- [Getting Started](#-getting-started)
- [Docker Deployment](#-docker-deployment)
- [Dashboard Modules](#-dashboard-modules)
- [ML Models](#-ml-models)
- [Project Structure](#-project-structure)
- [Author](#-author)
- [Contributing](#-contributing)
- [Disclaimer](#disclaimer)
- [License](#-license)
---
## β¨ Features
<table>
<tr><td>π§ <b>LangGraph State Machine</b></td><td>Five agents wired into a linear StateGraph β Planner β Retriever β Grader β Generator β Critic.</td></tr>
<tr><td>π <b>Hybrid RAG (FAISS + BM25)</b></td><td>Semantic vector search combined with BM25 keyword search, fused via Reciprocal Rank Fusion for precision retrieval.</td></tr>
<tr><td>π€ <b>Multi-Agent Orchestration</b></td><td>Planner, Retriever, Grader, Generator, and Critic agents each with specialized roles β only 3 LLM calls per query.</td></tr>
<tr><td>β‘ <b>Score-Based Grading</b></td><td>Grader uses hybrid search scores + keyword overlap β no LLM call needed, instant and deterministic relevance scoring.</td></tr>
<tr><td>π <b>PDF & URL Ingestion</b></td><td>Upload PDF files up to 10 MB or paste any public URL β both are chunked, embedded, and indexed automatically.</td></tr>
<tr><td>π <b>Secure by Design</b></td><td>Stateless REST backend, no user data persisted, HF token kept server-side only.</td></tr>
<tr><td>π³ <b>Containerized Deployment</b></td><td>Docker-first with Gunicorn, embedding model pre-downloaded at build time for fast cold starts.</td></tr>
</table>
---
## ποΈ Architecture
```
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DocMind β LangGraph Flow β
β β
β PDF / URL βββΆ Ingestor βββΆ FAISS+BM25 Hybrid Vector Store β
β β β
β User Query βββΆ [PLANNER Agent] β (Qwen 2.5-7B, 0.3) β
β β β β
β [RETRIEVER] ββββββββ (FAISS+BM25+RRF) β
β β β
β [GRADER] (score-based, no LLM call) β
β β β
β [GENERATOR] (Qwen 2.5-7B, 0.4) β
β β β
β [CRITIC] (Qwen 2.5-7B, 0.1) β
β β β
β [OUTPUT] Flask API + Single-Page UI β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
---
## π Getting Started
### Prerequisites
- Python 3.10+ Β· Docker Β· Git Β· Free HuggingFace account
### Local Installation
```bash
git clone https://github.com/mnoorchenar/docmind.git
cd docmind
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env β set HF_TOKEN to your free HuggingFace Read token
python app.py
```
Open `http://localhost:7860` π
### Getting your free HuggingFace token
1. Create a free account at [huggingface.co](https://huggingface.co)
2. Go to Settings β Access Tokens β New Token β Role: **Read**
3. Copy the token and set it as `HF_TOKEN` in your `.env` file or Space secrets
---
## π³ Docker Deployment
```bash
docker build -t docmind .
docker run -p 7860:7860 -e HF_TOKEN=hf_your_token_here docmind
```
---
## π App Modules
| Module | Description | Status |
|--------|-------------|--------|
| π€ Upload & Index | PDF / URL ingest, chunk, embed (local BAAI model), FAISS+BM25 index | β
Live |
| π Research Query | LangGraph 5-agent pipeline with real-time trace log | β
Live |
---
## π§ ML Models
```python
stack = {
# ββ LLM (LangChain LCEL chains) ββββββββββββββββββββββββββββββββββββββββββ
"llm": "Qwen/Qwen2.5-7B-Instruct", # via HF Router
"lcel_chain": "ChatPromptTemplate | ChatOpenAI | StrOutputParser",
"retry": "ChatOpenAI.with_retry(stop_after_attempt=2)",
# ββ RAG (LangChain + custom hybrid) ββββββββββββββββββββββββββββββββββββββ
"splitter": "RecursiveCharacterTextSplitter (langchain-text-splitters)",
"documents": "langchain_core.documents.Document",
"embeddings": "HuggingFaceEmbeddings (BAAI/bge-small-en-v1.5, local)",
"vector_index": "FAISS IndexFlatIP (cosine)",
"keyword_index": "BM25Okapi (rank-bm25)",
"fusion": "Reciprocal Rank Fusion (RRF k=60)",
"grader": "score-based (hybrid score Γ 0.7 + keyword overlap Γ 0.3)",
# ββ Orchestration (LangGraph) βββββββββββββββββββββββββββββββββββββββββββββ
"graph": "LangGraph 0.2 StateGraph β 5 nodes, linear pipeline",
}
```
---
## π Project Structure
```
docmind/
βββ π app.py # Flask entry point, 5 REST routes
βββ π requirements.txt
βββ π Dockerfile # Port 7860, embedding model pre-downloaded
βββ π .env.example
βββ π agents/
β βββ π llm_factory.py # get_llm() β LangChain ChatOpenAI (HF Router)
β βββ π planner.py # LCEL: ChatPromptTemplate | ChatOpenAI | StrOutputParser
β βββ π retriever.py # Hybrid FAISS+BM25 search wrapper
β βββ π grader.py # Score-based relevance grading (no LLM call)
β βββ π generator.py # LCEL chain β cited answer generation
β βββ π critic.py # LCEL chain β hallucination detection
βββ π graph/
β βββ π research_graph.py # LangGraph StateGraph (5 nodes, linear pipeline)
βββ π rag/
β βββ π ingestor.py # RecursiveCharacterTextSplitter + Document objects
β βββ π vector_store.py # FAISS + BM25 + RRF, accepts Document or dict
β βββ π embeddings.py # LangChain HuggingFaceEmbeddings (bge-small-en-v1.5)
βββ π tracing/
β βββ π tracer.py # Thread-safe in-memory trace store
βββ π templates/
β βββ π index.html # Dark-mode single-page UI
βββ π docs/
βββ π project-template.html # Portfolio showcase page
```
---
## π¨βπ» Author
<div align="center">
<table><tr><td align="center" width="100%">
<img src="https://avatars.githubusercontent.com/mnoorchenar" width="120" style="border-radius:50%;border:3px solid #4f46e5" alt="Mohammad Noorchenarboo"/>
<h3>Mohammad Noorchenarboo</h3>
<code>Data Scientist</code> | <code>AI Researcher</code> | <code>Biostatistician</code>
π Ontario, Canada π§ mohammadnoorchenarboo@gmail.com
[](https://www.linkedin.com/in/mnoorchenar)
[](https://huggingface.co/mnoorchenar/spaces)
[](https://github.com/mnoorchenar)
</td></tr></table>
</div>
---
## π€ Contributing
1. Fork the repository
2. Create a feature branch: `git checkout -b feature/amazing-feature`
3. Commit: `git commit -m 'Add amazing feature'`
4. Push: `git push origin feature/amazing-feature`
5. Open a Pull Request
---
## Disclaimer
<span style="color:red">This project is developed strictly for educational and research purposes. All LLM outputs are AI-generated and may contain inaccuracies. No real user data is stored. Provided "as is" without warranty of any kind.</span>
---
## π License
Distributed under the **MIT License**.
<div align="center">
<img src="https://capsule-render.vercel.app/api?type=waving&color=0:3b82f6,100:4f46e5&height=120§ion=footer&text=Made%20with%20%E2%9D%A4%EF%B8%8F%20by%20Mohammad%20Noorchenarboo&fontColor=ffffff&fontSize=18&fontAlignY=80" width="100%"/>
</div> |