Spaces:
Sleeping
Sleeping
metadata
title: DocMind-Agentic-Research
colorFrom: blue
colorTo: indigo
sdk: docker
π§ DocMind β Agentic Research Platform
π§ DocMind β A clean, minimal agentic document research platform. Five specialized LangGraph agents plan, retrieve, grade, generate, and critique answers from uploaded PDFs and web pages using hybrid search and Qwen 2.5-7B β all running free on HuggingFace Spaces.
Table of Contents
- Features
- Architecture
- Getting Started
- Docker Deployment
- Dashboard Modules
- ML Models
- Project Structure
- Author
- Contributing
- Disclaimer
- License
β¨ Features
| π§ LangGraph State Machine | Five agents wired into a linear StateGraph β Planner β Retriever β Grader β Generator β Critic. |
| π Hybrid RAG (FAISS + BM25) | Semantic vector search combined with BM25 keyword search, fused via Reciprocal Rank Fusion for precision retrieval. |
| π€ Multi-Agent Orchestration | Planner, Retriever, Grader, Generator, and Critic agents each with specialized roles β only 3 LLM calls per query. |
| β‘ Score-Based Grading | Grader uses hybrid search scores + keyword overlap β no LLM call needed, instant and deterministic relevance scoring. |
| π PDF & URL Ingestion | Upload PDF files up to 10 MB or paste any public URL β both are chunked, embedded, and indexed automatically. |
| π Secure by Design | Stateless REST backend, no user data persisted, HF token kept server-side only. |
| π³ Containerized Deployment | Docker-first with Gunicorn, embedding model pre-downloaded at build time for fast cold starts. |
ποΈ Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DocMind β LangGraph Flow β
β β
β PDF / URL βββΆ Ingestor βββΆ FAISS+BM25 Hybrid Vector Store β
β β β
β User Query βββΆ [PLANNER Agent] β (Qwen 2.5-7B, 0.3) β
β β β β
β [RETRIEVER] ββββββββ (FAISS+BM25+RRF) β
β β β
β [GRADER] (score-based, no LLM call) β
β β β
β [GENERATOR] (Qwen 2.5-7B, 0.4) β
β β β
β [CRITIC] (Qwen 2.5-7B, 0.1) β
β β β
β [OUTPUT] Flask API + Single-Page UI β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Getting Started
Prerequisites
- Python 3.10+ Β· Docker Β· Git Β· Free HuggingFace account
Local Installation
git clone https://github.com/mnoorchenar/docmind.git
cd docmind
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env β set HF_TOKEN to your free HuggingFace Read token
python app.py
Open http://localhost:7860 π
Getting your free HuggingFace token
- Create a free account at huggingface.co
- Go to Settings β Access Tokens β New Token β Role: Read
- Copy the token and set it as
HF_TOKENin your.envfile or Space secrets
π³ Docker Deployment
docker build -t docmind .
docker run -p 7860:7860 -e HF_TOKEN=hf_your_token_here docmind
π App Modules
| Module | Description | Status |
|---|---|---|
| π€ Upload & Index | PDF / URL ingest, chunk, embed (local BAAI model), FAISS+BM25 index | β Live |
| π Research Query | LangGraph 5-agent pipeline with real-time trace log | β Live |
π§ ML Models
stack = {
# ββ LLM (LangChain LCEL chains) ββββββββββββββββββββββββββββββββββββββββββ
"llm": "Qwen/Qwen2.5-7B-Instruct", # via HF Router
"lcel_chain": "ChatPromptTemplate | ChatOpenAI | StrOutputParser",
"retry": "ChatOpenAI.with_retry(stop_after_attempt=2)",
# ββ RAG (LangChain + custom hybrid) ββββββββββββββββββββββββββββββββββββββ
"splitter": "RecursiveCharacterTextSplitter (langchain-text-splitters)",
"documents": "langchain_core.documents.Document",
"embeddings": "HuggingFaceEmbeddings (BAAI/bge-small-en-v1.5, local)",
"vector_index": "FAISS IndexFlatIP (cosine)",
"keyword_index": "BM25Okapi (rank-bm25)",
"fusion": "Reciprocal Rank Fusion (RRF k=60)",
"grader": "score-based (hybrid score Γ 0.7 + keyword overlap Γ 0.3)",
# ββ Orchestration (LangGraph) βββββββββββββββββββββββββββββββββββββββββββββ
"graph": "LangGraph 0.2 StateGraph β 5 nodes, linear pipeline",
}
π Project Structure
docmind/
βββ π app.py # Flask entry point, 5 REST routes
βββ π requirements.txt
βββ π Dockerfile # Port 7860, embedding model pre-downloaded
βββ π .env.example
βββ π agents/
β βββ π llm_factory.py # get_llm() β LangChain ChatOpenAI (HF Router)
β βββ π planner.py # LCEL: ChatPromptTemplate | ChatOpenAI | StrOutputParser
β βββ π retriever.py # Hybrid FAISS+BM25 search wrapper
β βββ π grader.py # Score-based relevance grading (no LLM call)
β βββ π generator.py # LCEL chain β cited answer generation
β βββ π critic.py # LCEL chain β hallucination detection
βββ π graph/
β βββ π research_graph.py # LangGraph StateGraph (5 nodes, linear pipeline)
βββ π rag/
β βββ π ingestor.py # RecursiveCharacterTextSplitter + Document objects
β βββ π vector_store.py # FAISS + BM25 + RRF, accepts Document or dict
β βββ π embeddings.py # LangChain HuggingFaceEmbeddings (bge-small-en-v1.5)
βββ π tracing/
β βββ π tracer.py # Thread-safe in-memory trace store
βββ π templates/
β βββ π index.html # Dark-mode single-page UI
βββ π docs/
βββ π project-template.html # Portfolio showcase page
π¨βπ» Author
Mohammad NoorchenarbooData Scientist | AI Researcher | Biostatistician
π Ontario, Canada π§ mohammadnoorchenarboo@gmail.com
|
π€ Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit:
git commit -m 'Add amazing feature' - Push:
git push origin feature/amazing-feature - Open a Pull Request
Disclaimer
This project is developed strictly for educational and research purposes. All LLM outputs are AI-generated and may contain inaccuracies. No real user data is stored. Provided "as is" without warranty of any kind.
π License
Distributed under the MIT License.