--- title: GraphResearcher emoji: ๐Ÿ“š colorFrom: blue colorTo: indigo sdk: docker app_port: 7860 pinned: false license: mit --- # GraphResearcher ![Python syntax check](https://github.com/yug-birla/Graph-Researcher/actions/workflows/python-syntax-check.yml/badge.svg) Citation-grounded document intelligence platform combining RAG with graph-based retrieval. **Live demo:** [yugbirla-graphresearcher.hf.space](https://yugbirla-graphresearcher.hf.space) ยท [App UI](https://yugbirla-graphresearcher.hf.space/app) ยท [Admin](https://yugbirla-graphresearcher.hf.space/admin/secure) --- ## Problem PDF chatbots typically generate answers without showing where the information came from. Users cannot verify whether the output is grounded in the document. GraphResearcher addresses this by combining chunk-based retrieval, a document knowledge graph, source-level citation, and a verification UI that lets users trace every claim back to a specific page and chunk. --- ## Architecture ```text User Browser โ”‚ Upload / Ask / Compare / Feedback โ–ผ FastAPI Backend โ”œโ”€โ”€ Ingestion: PDF parsing โ†’ chunking โ†’ metadata preservation โ”œโ”€โ”€ Retrieval: hybrid (BM25 + vector) โ†’ cross-encoder reranking โ†’ graph fusion โ”œโ”€โ”€ GraphRAG: entity extraction โ†’ relation extraction โ†’ graph storage โ†’ graph-guided retrieval โ”œโ”€โ”€ Generation: evidence extraction โ†’ grounded prompt โ†’ LLM answer โ†’ citation attachment โ”œโ”€โ”€ Product: app UI, source viewer, document comparison, feedback, admin monitoring โ””โ”€โ”€ Storage: runtime filesystem, SQLite, optional HF Dataset backup ``` --- ## GraphRAG Algorithm Entities are extracted from processed document chunks using rule-based pattern matching: capitalized multi-word phrases and uppercase acronyms are identified, normalized by stripping punctuation and deduplicating via lowercased entity IDs, and classified as CONCEPT, ACRONYM, ORGANIZATION, or TECHNICAL_TERM. Noisy candidates (stopwords, overly short/long strings) are filtered using a quality module. Relations are constructed by co-occurrence: when two or more entities appear in the same sentence, an edge is created between each pair, with the relation type inferred from verb phrases (e.g., "uses" โ†’ USES, "reduces" โ†’ REDUCES) or defaulting to RELATED_TO. Edge weights increment with repeated co-occurrence across chunks. During answering, query terms are tokenized and matched against graph entity names and IDs using exact and substring scoring. The top-k matched entities and their neighboring relations are retrieved, and every chunk ID linked to those entities/relations receives a graph score based on entity mention count and relation weight. These graph-scored chunks are fused with normal hybrid retrieval results: chunks appearing in both lists get a score boost; graph-only chunks are appended. The fused set is re-sorted by score and truncated to the final top-k. The LLM prompt then includes both the standard evidence context and a structured graph context block listing matched entities (with types, mention counts, and pages) and relations (with types and weights). --- ## Evaluation The project includes an ablation evaluation framework comparing **RAG only** vs. **RAG + Graph** retrieval. **Metrics computed:** - **Recall@K** (K=3,5,10): fraction of manually labeled gold chunk IDs retrieved in top K - **Estimated faithfulness**: automatic heuristic checking whether answer sentences are supported by retrieved source text (not human judgment) - **Answer completeness**: fraction of expected gold terms present in the answer - **Latency**: end-to-end request time ### 15-Question Starter Evaluation (Vectorless RAG Master Guide) Generated: 2026-06-18 QA file: `eval/qa_15_starter.jsonl` | Mode | Recall@3 | Recall@5 | Recall@10 | Faithfulness | Completeness | Avg Latency (ms) | Avg Answer Words | Errors | |---|---:|---:|---:|---:|---:|---:|---:|---:| | RAG | 0.2944 | 0.4722 | 0.6389 | 0.9333 | 0.7167 | 7388.7 | 238.5 | 0 | | RAG + Graph | 0.2944 | 0.4722 | 0.6389 | 0.9333 | 0.7167 | 7114.3 | 238.5 | 0 | **Conclusion:** On this specific 15-question dataset, the rule-based GraphRAG implementation had **no significant effect** on retrieval recall or estimated faithfulness compared to standard hybrid retrieval with reranking. Latency was similar. --- ## Setup ```bash git clone https://github.com/yug-birla/Graph-Researcher.git cd Graph-Researcher/backend python -m venv venv venv\Scripts\activate # Windows # source venv/bin/activate # Linux/Mac pip install -r requirements.txt cp .env.example .env # edit with your keys uvicorn app.main:app --reload ``` Open: `http://127.0.0.1:8000/app` --- ## Environment Variables See [`.env.example`](.env.example) for the full list. Key variables: | Variable | Purpose | |---|---| | `LLM_PROVIDER` | `huggingface`, `local`, or `disabled` | | `HF_API_TOKEN` | Hugging Face API token for inference | | `HF_INFERENCE_MODEL` | Model name (default: `Qwen/Qwen3-4B-Instruct-2507`) | | `ADMIN_EMAILS` | Comma-separated admin email allowlist | | `ADMIN_DASHBOARD_KEY` | Password for `/admin/secure` | | `SESSION_SECRET_KEY` | Session middleware secret | | `HF_FEEDBACK_DATASET` | Optional HF dataset for permanent feedback backup | | `HF_FEEDBACK_TOKEN` | HF write token for feedback backup | --- ## Limitations - **Temporary storage on HF Spaces.** Uploaded documents may disappear after runtime restart unless persistent storage is configured. - **Rule-based entity extraction.** The graph layer uses regex patterns, not a trained NER model. It may miss entities that don't follow capitalization conventions and may extract noisy candidates. - **Heuristic faithfulness.** The automatic faithfulness metric is a bag-of-words overlap check, not a semantic entailment model. It can overestimate faithfulness for topically relevant but factually incorrect answers. - **No proven GraphRAG improvement yet.** The ablation framework exists but has not been run with verified gold labels. Whether graph-guided retrieval improves results on real documents is an open question until evaluation is completed. - **OCR.** Scanned PDFs require stronger OCR support than currently provided. - **Single-user storage.** Production-level per-user persistent document workspaces are future work. --- ## License MIT