File size: 3,580 Bytes
006b823
 
 
 
 
 
 
 
 
 
 
71c7b9b
 
 
 
 
 
8b57dac
71c7b9b
 
 
 
 
 
 
 
 
8b57dac
 
71c7b9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
title: Classics RAG QA
emoji: πŸ“š
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: "4.0.0"
app_file: app.py
pinned: false
---

# Classics RAG QA β€” Grounded Q&A for The Iliad and Dorian Gray

**Ask a question β†’ get a concise answer with verbatim quotes and citations.**  

No hallucinations: every claim is backed by text pulled from the book.

---

## ✨ What it does

- Retrieves the most relevant passages from **public-domain editions** (Project Gutenberg).

- Composes a short answer that references **[1][2][3]**-style citations.

- Shows the exact **quoted lines** and where they came from (book, chapter/paragraph).

---

## πŸ› οΈ How it works (under the hood)

1. **Chunk** the cleaned book into overlapping segments with chapter/paragraph metadata.  

2. **Embed** chunks using `sentence-transformers/all-MiniLM-L6-v2`.  

3. **Index** embeddings in FAISS for fast top-k retrieval.  

4. **Compose** answers with a deterministic heuristic:

   - rank candidate sentences by lexical coverage and similarity;

   - select 1–3 diverse quotes;

   - synthesize a 2–4 sentence answer that explicitly references the quotes.

No large language model is required; an optional rewrite step can be added but is **off by default** to preserve groundedness.

---

## πŸ§ͺ Evaluation (lightweight)

- **Retrieval Recall@k:** proportion of questions whose gold-support chunk appears in the top-k.  

- **Groundedness:** % of answers with β‰₯1 quote; **Attribution** = fraction of answer sentences that share β‰₯2 content words with some quote.  

- On a tiny hand-built QA set (10–20 items), target Recall@5 β‰₯ 0.8 and Groundedness β‰₯ 0.95.

> Note: numbers vary by edition and chunking parameters.

**Current Results:**
- βœ… **Recall@5: 100%** (10/10 questions)
- βœ… **Groundedness: 100%** quote presence
- βœ… **Attribution Score: 0.75** (target: β‰₯0.7)

---

## πŸš€ Try it

- Pick a book (Iliad or Dorian Gray).

- Ask focused questions like:

  - "How does Homer portray Achilles' anger in Book 1?"

  - "What does Lord Henry claim about influence on the young?"

  - "Where does the poem describe the shield of Achilles?"

- Read the answer; expand **Evidence** to inspect quotes and locations.

---

## βš™οΈ Configuration

Key parameters (adjusted in `configs/app.yaml`):

- `chunk_size` / `chunk_overlap`: retrieval granularity and recall.

- `embedding_model`: default `all-MiniLM-L6-v2` (speed/quality trade-off).

- `top_k`: number of retrieved chunks shown to the composer.

---

## πŸ“š Data & Licensing

- Texts are sourced from **Project Gutenberg** (public domain).  

- Only derived chunks and indices are stored for retrieval; we do not redistribute copyrighted editions.

---

## πŸ”Ž Limitations

- Coreference and pronouns may require nearby context; very long-range references can be missed.

- Different translations/editions may shift phrasing and chapter boundaries.

- The system is conservative by design; if quotes are weak, the answer stays cautious.

- Negative questions (e.g., "Was X ugly?") may not retrieve correct context due to semantic search limitations with negation.

---

## 🧩 Roadmap

- Named-entity & character graph for richer answers.

- Optional LLM paraphrase pass that **never changes quotes** (off by default).

- Multi-book corpus with per-source filtering and cross-references.

---

## 🧾 Citation

If you reference this project, please cite:

> Classics RAG QA β€” Grounded Literary Question Answering with Verbatim Citations (2025).  

> https://huggingface.co/spaces/Tuminha/classics-rag-qa