Spaces:
Sleeping
Sleeping
File size: 3,580 Bytes
006b823 71c7b9b 8b57dac 71c7b9b 8b57dac 71c7b9b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
title: Classics RAG QA
emoji: π
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: "4.0.0"
app_file: app.py
pinned: false
---
# Classics RAG QA β Grounded Q&A for The Iliad and Dorian Gray
**Ask a question β get a concise answer with verbatim quotes and citations.**
No hallucinations: every claim is backed by text pulled from the book.
---
## β¨ What it does
- Retrieves the most relevant passages from **public-domain editions** (Project Gutenberg).
- Composes a short answer that references **[1][2][3]**-style citations.
- Shows the exact **quoted lines** and where they came from (book, chapter/paragraph).
---
## π οΈ How it works (under the hood)
1. **Chunk** the cleaned book into overlapping segments with chapter/paragraph metadata.
2. **Embed** chunks using `sentence-transformers/all-MiniLM-L6-v2`.
3. **Index** embeddings in FAISS for fast top-k retrieval.
4. **Compose** answers with a deterministic heuristic:
- rank candidate sentences by lexical coverage and similarity;
- select 1β3 diverse quotes;
- synthesize a 2β4 sentence answer that explicitly references the quotes.
No large language model is required; an optional rewrite step can be added but is **off by default** to preserve groundedness.
---
## π§ͺ Evaluation (lightweight)
- **Retrieval Recall@k:** proportion of questions whose gold-support chunk appears in the top-k.
- **Groundedness:** % of answers with β₯1 quote; **Attribution** = fraction of answer sentences that share β₯2 content words with some quote.
- On a tiny hand-built QA set (10β20 items), target Recall@5 β₯ 0.8 and Groundedness β₯ 0.95.
> Note: numbers vary by edition and chunking parameters.
**Current Results:**
- β
**Recall@5: 100%** (10/10 questions)
- β
**Groundedness: 100%** quote presence
- β
**Attribution Score: 0.75** (target: β₯0.7)
---
## π Try it
- Pick a book (Iliad or Dorian Gray).
- Ask focused questions like:
- "How does Homer portray Achilles' anger in Book 1?"
- "What does Lord Henry claim about influence on the young?"
- "Where does the poem describe the shield of Achilles?"
- Read the answer; expand **Evidence** to inspect quotes and locations.
---
## βοΈ Configuration
Key parameters (adjusted in `configs/app.yaml`):
- `chunk_size` / `chunk_overlap`: retrieval granularity and recall.
- `embedding_model`: default `all-MiniLM-L6-v2` (speed/quality trade-off).
- `top_k`: number of retrieved chunks shown to the composer.
---
## π Data & Licensing
- Texts are sourced from **Project Gutenberg** (public domain).
- Only derived chunks and indices are stored for retrieval; we do not redistribute copyrighted editions.
---
## π Limitations
- Coreference and pronouns may require nearby context; very long-range references can be missed.
- Different translations/editions may shift phrasing and chapter boundaries.
- The system is conservative by design; if quotes are weak, the answer stays cautious.
- Negative questions (e.g., "Was X ugly?") may not retrieve correct context due to semantic search limitations with negation.
---
## π§© Roadmap
- Named-entity & character graph for richer answers.
- Optional LLM paraphrase pass that **never changes quotes** (off by default).
- Multi-book corpus with per-source filtering and cross-references.
---
## π§Ύ Citation
If you reference this project, please cite:
> Classics RAG QA β Grounded Literary Question Answering with Verbatim Citations (2025).
> https://huggingface.co/spaces/Tuminha/classics-rag-qa
|