adarshninja commited on
Commit
f3fc53e
·
verified ·
1 Parent(s): 5660c85

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -145
README.md CHANGED
@@ -1,146 +1,159 @@
1
- 📄 AI Research Paper Explainer
2
-
3
- Hugging Face · RAG · CPU-only · No Paid APIs
4
-
5
- An end-to-end Retrieval-Augmented Generation (RAG) system that reads long research papers and explains them in simple, grounded terms.
6
-
7
- Built to deeply understand how embeddings, retrieval, and generation work together — not just to make another chatbot.
8
-
9
- 🚀 What This Project Does
10
-
11
- Given a text-based research paper (PDF), the system:
12
-
13
- Extracts and processes long documents safely
14
-
15
- Chunks content into semantically meaningful pieces
16
-
17
- Builds embeddings using Sentence-Transformers
18
-
19
- Retrieves the most relevant sections via cosine similarity
20
-
21
- Generates grounded explanations using a Hugging Face LLM
22
-
23
- Supported Analysis Modes
24
-
25
- Summary – Explain the paper in simple terms
26
-
27
- Main Contribution – What is new or novel
28
-
29
- Why Better – Improvements over prior work
30
-
31
- Limitations – Weaknesses or constraints
32
-
33
- Prerequisites – Concepts to learn before reading
34
-
35
- 🧠 Architecture (High-Level)
36
- PDF (text-based)
37
-
38
- Text Extraction
39
-
40
- Chunking (long-document safe)
41
-
42
- Embeddings (Sentence-Transformers)
43
-
44
- Semantic Similarity (Top-K Retrieval)
45
-
46
- Context Builder
47
-
48
- Hugging Face LLM (FLAN-T5)
49
-
50
- Grounded Explanation
51
-
52
- ⚠️ Scanned PDFs are intentionally excluded to ensure fast, reliable, free-tier-safe processing.
53
-
54
- ⚠️ Scanned PDFs are intentionally excluded to ensure fast, reliable, free-tier-safe processing.
55
-
56
- 📂 Project Structure
57
- Research/
58
-
59
- ├── app.py # CLI pipeline runner
60
- ├── streamlit_app.py # Streamlit UI
61
-
62
- ├── extractor/
63
- │ └── pdf_parser.py
64
-
65
- ├── chunker/
66
- │ └── text_chunker.py
67
-
68
- ├── embeddings/
69
- │ ├── embedder.py
70
- │ └── similarity.py
71
-
72
- ├── llm/
73
- ├── explainer.py
74
- └── question_bank.py
75
-
76
- ── utils/
77
- ├── validators.py
78
- │ └── context_builder.py
79
-
80
- ├── requirements.txt
81
- ── README.md
82
-
83
- ⚙️ Setup & Installation
84
- 1️⃣ Create virtual environment
85
- python -m venv venv
86
- source venv/bin/activate # Linux/Mac
87
- venv\Scripts\activate # Windows
88
- 2️⃣ Install dependencies
89
- pip install -r requirements.txt
90
-
91
- 3️⃣ Run Streamlit app
92
- streamlit run streamlit_app.py
93
-
94
-
95
- Upload a text-based research paper (PDF) and select an analysis mode.
96
-
97
- 🧪 Example Output
98
-
99
- Mode: Summary
100
-
101
- This paper introduces the Transformer architecture, which replaces recurrent networks with self-attention mechanisms. This allows parallel processing of sequences, leading to faster training and better performance on sequence-to-sequence tasks.
102
-
103
- 🎯 Why This Project Matters
104
-
105
- Most GenAI demos:
106
-
107
- send full documents to an LLM
108
-
109
- ignore retrieval
110
-
111
- hallucinate confidently
112
-
113
- This project:
114
-
115
- controls context explicitly
116
-
117
- retrieves before generating
118
-
119
- mirrors how real RAG systems are built in production
120
-
121
- 📌 Design Decisions
122
-
123
- ❌ No OCR → avoids heavy dependencies and free-tier crashes
124
-
125
- ✅ Text-based PDFs only (arXiv-style)
126
-
127
- ✅ Chunked processing for long documents
128
-
129
- ✅ Top-K retrieval to reduce hallucination
130
-
131
- These choices prioritize robustness, clarity, and deployability.
132
-
133
- 🔮 Future Improvements
134
-
135
- Section-aware chunking (Abstract / Method / Results)
136
-
137
- Hugging Face Spaces deployment
138
-
139
- Evaluation metrics for retrieval quality
140
-
141
- Lightweight vector store (FAISS)
142
-
143
- 🙌 Author
144
-
145
- Built by Adarsh
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  Learning GenAI by building systems, not just prompts.
 
1
+ ---
2
+ title: AI Research Paper Explainer
3
+ emoji: 📄
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: streamlit
7
+ sdk_version: "1.31.0"
8
+ python_version: "3.10"
9
+ app_file: streamlit_app.py
10
+ pinned: false
11
+ ---
12
+
13
+
14
+ 📄 AI Research Paper Explainer
15
+
16
+ Hugging Face · RAG · CPU-only · No Paid APIs
17
+
18
+ An end-to-end Retrieval-Augmented Generation (RAG) system that reads long research papers and explains them in simple, grounded terms.
19
+
20
+ Built to deeply understand how embeddings, retrieval, and generation work together — not just to make another chatbot.
21
+
22
+ 🚀 What This Project Does
23
+
24
+ Given a text-based research paper (PDF), the system:
25
+
26
+ Extracts and processes long documents safely
27
+
28
+ Chunks content into semantically meaningful pieces
29
+
30
+ Builds embeddings using Sentence-Transformers
31
+
32
+ Retrieves the most relevant sections via cosine similarity
33
+
34
+ Generates grounded explanations using a Hugging Face LLM
35
+
36
+ Supported Analysis Modes
37
+
38
+ Summary – Explain the paper in simple terms
39
+
40
+ Main Contribution – What is new or novel
41
+
42
+ Why Better – Improvements over prior work
43
+
44
+ Limitations Weaknesses or constraints
45
+
46
+ Prerequisites – Concepts to learn before reading
47
+
48
+ 🧠 Architecture (High-Level)
49
+ PDF (text-based)
50
+
51
+ Text Extraction
52
+
53
+ Chunking (long-document safe)
54
+
55
+ Embeddings (Sentence-Transformers)
56
+
57
+ Semantic Similarity (Top-K Retrieval)
58
+
59
+ Context Builder
60
+
61
+ Hugging Face LLM (FLAN-T5)
62
+
63
+ Grounded Explanation
64
+
65
+ ⚠️ Scanned PDFs are intentionally excluded to ensure fast, reliable, free-tier-safe processing.
66
+
67
+ ⚠️ Scanned PDFs are intentionally excluded to ensure fast, reliable, free-tier-safe processing.
68
+
69
+ 📂 Project Structure
70
+ Research/
71
+
72
+ ├── app.py # CLI pipeline runner
73
+ ├── streamlit_app.py # Streamlit UI
74
+
75
+ ├── extractor/
76
+ │ └── pdf_parser.py
77
+
78
+ ── chunker/
79
+ └── text_chunker.py
80
+
81
+ ── embeddings/
82
+ │ ├── embedder.py
83
+ │ └── similarity.py
84
+
85
+ ├── llm/
86
+ ├── explainer.py
87
+ │ └── question_bank.py
88
+
89
+ ├── utils/
90
+ │ ├── validators.py
91
+ │ └── context_builder.py
92
+
93
+ ├── requirements.txt
94
+ └── README.md
95
+
96
+ ⚙️ Setup & Installation
97
+ 1️⃣ Create virtual environment
98
+ python -m venv venv
99
+ source venv/bin/activate # Linux/Mac
100
+ venv\Scripts\activate # Windows
101
+ 2️⃣ Install dependencies
102
+ pip install -r requirements.txt
103
+
104
+ 3️⃣ Run Streamlit app
105
+ streamlit run streamlit_app.py
106
+
107
+
108
+ Upload a text-based research paper (PDF) and select an analysis mode.
109
+
110
+ 🧪 Example Output
111
+
112
+ Mode: Summary
113
+
114
+ This paper introduces the Transformer architecture, which replaces recurrent networks with self-attention mechanisms. This allows parallel processing of sequences, leading to faster training and better performance on sequence-to-sequence tasks.
115
+
116
+ 🎯 Why This Project Matters
117
+
118
+ Most GenAI demos:
119
+
120
+ send full documents to an LLM
121
+
122
+ ignore retrieval
123
+
124
+ hallucinate confidently
125
+
126
+ This project:
127
+
128
+ controls context explicitly
129
+
130
+ retrieves before generating
131
+
132
+ mirrors how real RAG systems are built in production
133
+
134
+ 📌 Design Decisions
135
+
136
+ ❌ No OCR → avoids heavy dependencies and free-tier crashes
137
+
138
+ ✅ Text-based PDFs only (arXiv-style)
139
+
140
+ ✅ Chunked processing for long documents
141
+
142
+ ✅ Top-K retrieval to reduce hallucination
143
+
144
+ These choices prioritize robustness, clarity, and deployability.
145
+
146
+ 🔮 Future Improvements
147
+
148
+ Section-aware chunking (Abstract / Method / Results)
149
+
150
+ Hugging Face Spaces deployment
151
+
152
+ Evaluation metrics for retrieval quality
153
+
154
+ Lightweight vector store (FAISS)
155
+
156
+ 🙌 Author
157
+
158
+ Built by Adarsh
159
  Learning GenAI by building systems, not just prompts.