Spaces:
Sleeping
Sleeping
| title: GraphResearcher | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: mit | |
| # GraphResearcher | |
|  | |
| Citation-grounded document intelligence platform combining RAG with graph-based retrieval. | |
| **Live demo:** [yugbirla-graphresearcher.hf.space](https://yugbirla-graphresearcher.hf.space) Β· [App UI](https://yugbirla-graphresearcher.hf.space/app) Β· [Admin](https://yugbirla-graphresearcher.hf.space/admin/secure) | |
| --- | |
| ## Problem | |
| PDF chatbots typically generate answers without showing where the information came from. Users cannot verify whether the output is grounded in the document. GraphResearcher addresses this by combining chunk-based retrieval, a document knowledge graph, source-level citation, and a verification UI that lets users trace every claim back to a specific page and chunk. | |
| --- | |
| ## Architecture | |
| ```text | |
| User Browser | |
| β Upload / Ask / Compare / Feedback | |
| βΌ | |
| FastAPI Backend | |
| βββ Ingestion: PDF parsing β chunking β metadata preservation | |
| βββ Retrieval: hybrid (BM25 + vector) β cross-encoder reranking β graph fusion | |
| βββ GraphRAG: entity extraction β relation extraction β graph storage β graph-guided retrieval | |
| βββ Generation: evidence extraction β grounded prompt β LLM answer β citation attachment | |
| βββ Product: app UI, source viewer, document comparison, feedback, admin monitoring | |
| βββ Storage: runtime filesystem, SQLite, optional HF Dataset backup | |
| ``` | |
| --- | |
| ## GraphRAG Algorithm | |
| Entities are extracted from processed document chunks using rule-based pattern matching: capitalized multi-word phrases and uppercase acronyms are identified, normalized by stripping punctuation and deduplicating via lowercased entity IDs, and classified as CONCEPT, ACRONYM, ORGANIZATION, or TECHNICAL_TERM. Noisy candidates (stopwords, overly short/long strings) are filtered using a quality module. Relations are constructed by co-occurrence: when two or more entities appear in the same sentence, an edge is created between each pair, with the relation type inferred from verb phrases (e.g., "uses" β USES, "reduces" β REDUCES) or defaulting to RELATED_TO. Edge weights increment with repeated co-occurrence across chunks. | |
| During answering, query terms are tokenized and matched against graph entity names and IDs using exact and substring scoring. The top-k matched entities and their neighboring relations are retrieved, and every chunk ID linked to those entities/relations receives a graph score based on entity mention count and relation weight. These graph-scored chunks are fused with normal hybrid retrieval results: chunks appearing in both lists get a score boost; graph-only chunks are appended. The fused set is re-sorted by score and truncated to the final top-k. The LLM prompt then includes both the standard evidence context and a structured graph context block listing matched entities (with types, mention counts, and pages) and relations (with types and weights). | |
| --- | |
| ## Evaluation | |
| The project includes an ablation evaluation framework comparing **RAG only** vs. **RAG + Graph** retrieval. | |
| **Metrics computed:** | |
| - **Recall@K** (K=3,5,10): fraction of manually labeled gold chunk IDs retrieved in top K | |
| - **Estimated faithfulness**: automatic heuristic checking whether answer sentences are supported by retrieved source text (not human judgment) | |
| - **Answer completeness**: fraction of expected gold terms present in the answer | |
| - **Latency**: end-to-end request time | |
| <!-- EVAL_RESULTS_START --> | |
| ### 15-Question Starter Evaluation (Vectorless RAG Master Guide) | |
| Generated: 2026-06-18 | |
| QA file: `eval/qa_15_starter.jsonl` | |
| | Mode | Recall@3 | Recall@5 | Recall@10 | Faithfulness | Completeness | Avg Latency (ms) | Avg Answer Words | Errors | | |
| |---|---:|---:|---:|---:|---:|---:|---:|---:| | |
| | RAG | 0.2944 | 0.4722 | 0.6389 | 0.9333 | 0.7167 | 7388.7 | 238.5 | 0 | | |
| | RAG + Graph | 0.2944 | 0.4722 | 0.6389 | 0.9333 | 0.7167 | 7114.3 | 238.5 | 0 | | |
| **Conclusion:** On this specific 15-question dataset, the rule-based GraphRAG implementation had **no significant effect** on retrieval recall or estimated faithfulness compared to standard hybrid retrieval with reranking. Latency was similar. | |
| <!-- EVAL_RESULTS_END --> | |
| --- | |
| ## Setup | |
| ```bash | |
| git clone https://github.com/yug-birla/Graph-Researcher.git | |
| cd Graph-Researcher/backend | |
| python -m venv venv | |
| venv\Scripts\activate # Windows | |
| # source venv/bin/activate # Linux/Mac | |
| pip install -r requirements.txt | |
| cp .env.example .env # edit with your keys | |
| uvicorn app.main:app --reload | |
| ``` | |
| Open: `http://127.0.0.1:8000/app` | |
| --- | |
| ## Environment Variables | |
| See [`.env.example`](.env.example) for the full list. Key variables: | |
| | Variable | Purpose | | |
| |---|---| | |
| | `LLM_PROVIDER` | `huggingface`, `local`, or `disabled` | | |
| | `HF_API_TOKEN` | Hugging Face API token for inference | | |
| | `HF_INFERENCE_MODEL` | Model name (default: `Qwen/Qwen3-4B-Instruct-2507`) | | |
| | `ADMIN_EMAILS` | Comma-separated admin email allowlist | | |
| | `ADMIN_DASHBOARD_KEY` | Password for `/admin/secure` | | |
| | `SESSION_SECRET_KEY` | Session middleware secret | | |
| | `HF_FEEDBACK_DATASET` | Optional HF dataset for permanent feedback backup | | |
| | `HF_FEEDBACK_TOKEN` | HF write token for feedback backup | | |
| --- | |
| ## Limitations | |
| - **Temporary storage on HF Spaces.** Uploaded documents may disappear after runtime restart unless persistent storage is configured. | |
| - **Rule-based entity extraction.** The graph layer uses regex patterns, not a trained NER model. It may miss entities that don't follow capitalization conventions and may extract noisy candidates. | |
| - **Heuristic faithfulness.** The automatic faithfulness metric is a bag-of-words overlap check, not a semantic entailment model. It can overestimate faithfulness for topically relevant but factually incorrect answers. | |
| - **No proven GraphRAG improvement yet.** The ablation framework exists but has not been run with verified gold labels. Whether graph-guided retrieval improves results on real documents is an open question until evaluation is completed. | |
| - **OCR.** Scanned PDFs require stronger OCR support than currently provided. | |
| - **Single-user storage.** Production-level per-user persistent document workspaces are future work. | |
| --- | |
| ## License | |
| MIT | |