Spaces:

adarshninja
/

ai-research-paper-explainer

Sleeping

App Files Files Community

adarshninja commited on Feb 9

Commit

f3fc53e

verified ·

1 Parent(s): 5660c85

Update README.md

Browse files

Files changed (1) hide show

README.md +158 -145

README.md CHANGED Viewed

@@ -1,146 +1,159 @@
-📄 AI Research Paper Explainer
-Hugging Face · RAG · CPU-only · No Paid APIs
-An end-to-end Retrieval-Augmented Generation (RAG) system that reads long research papers and explains them in simple, grounded terms.
-Built to deeply understand how embeddings, retrieval, and generation work together — not just to make another chatbot.
-🚀 What This Project Does
-Given a text-based research paper (PDF), the system:
-Extracts and processes long documents safely
-Chunks content into semantically meaningful pieces
-Builds embeddings using Sentence-Transformers
-Retrieves the most relevant sections via cosine similarity
-Generates grounded explanations using a Hugging Face LLM
-Supported Analysis Modes
-Summary – Explain the paper in simple terms
-Main Contribution – What is new or novel
-Why Better – Improvements over prior work
-Limitations – Weaknesses or constraints
-Prerequisites – Concepts to learn before reading
-🧠 Architecture (High-Level)
-PDF (text-based)
-   ↓
-Text Extraction
-   ↓
-Chunking (long-document safe)
-   ↓
-Embeddings (Sentence-Transformers)
-   ↓
-Semantic Similarity (Top-K Retrieval)
-   ↓
-Context Builder
-   ↓
-Hugging Face LLM (FLAN-T5)
-   ↓
-Grounded Explanation
-⚠️ Scanned PDFs are intentionally excluded to ensure fast, reliable, free-tier-safe processing.
-⚠️ Scanned PDFs are intentionally excluded to ensure fast, reliable, free-tier-safe processing.
-📂 Project Structure
-Research/
-│
-├── app.py                     # CLI pipeline runner
-├── streamlit_app.py           # Streamlit UI
-│
-├── extractor/
-│   └── pdf_parser.py
-│
-├── chunker/
-│   └── text_chunker.py
-│
-├── embeddings/
-│   ├── embedder.py
-│   └── similarity.py
-│
-├── llm/
-│   ├── explainer.py
-│   └── question_bank.py
-│
-├── utils/
-│   ├── validators.py
-│   └── context_builder.py
-│
-├── requirements.txt
-└── README.md
-⚙️ Setup & Installation
-1️⃣ Create virtual environment
-python -m venv venv
-source venv/bin/activate   # Linux/Mac
-venv\Scripts\activate      # Windows
-2️⃣ Install dependencies
-pip install -r requirements.txt
-3️⃣ Run Streamlit app
-streamlit run streamlit_app.py
-Upload a text-based research paper (PDF) and select an analysis mode.
-🧪 Example Output
-Mode: Summary
-This paper introduces the Transformer architecture, which replaces recurrent networks with self-attention mechanisms. This allows parallel processing of sequences, leading to faster training and better performance on sequence-to-sequence tasks.
-🎯 Why This Project Matters
-Most GenAI demos:
-send full documents to an LLM
-ignore retrieval
-hallucinate confidently
-This project:
-controls context explicitly
-retrieves before generating
-mirrors how real RAG systems are built in production
-📌 Design Decisions
-❌ No OCR → avoids heavy dependencies and free-tier crashes
-✅ Text-based PDFs only (arXiv-style)
-✅ Chunked processing for long documents
-✅ Top-K retrieval to reduce hallucination
-These choices prioritize robustness, clarity, and deployability.
-🔮 Future Improvements
-Section-aware chunking (Abstract / Method / Results)
-Hugging Face Spaces deployment
-Evaluation metrics for retrieval quality
-Lightweight vector store (FAISS)
-🙌 Author
-Built by Adarsh
 Learning GenAI by building systems, not just prompts.

+---
+title: AI Research Paper Explainer
+emoji: 📄
+colorFrom: blue
+colorTo: purple
+sdk: streamlit
+sdk_version: "1.31.0"
+python_version: "3.10"
+app_file: streamlit_app.py
+pinned: false
+---
+📄 AI Research Paper Explainer
+Hugging Face · RAG · CPU-only · No Paid APIs
+An end-to-end Retrieval-Augmented Generation (RAG) system that reads long research papers and explains them in simple, grounded terms.
+Built to deeply understand how embeddings, retrieval, and generation work together — not just to make another chatbot.
+🚀 What This Project Does
+Given a text-based research paper (PDF), the system:
+Extracts and processes long documents safely
+Chunks content into semantically meaningful pieces
+Builds embeddings using Sentence-Transformers
+Retrieves the most relevant sections via cosine similarity
+Generates grounded explanations using a Hugging Face LLM
+Supported Analysis Modes
+Summary – Explain the paper in simple terms
+Main Contribution – What is new or novel
+Why Better – Improvements over prior work
+Limitations – Weaknesses or constraints
+Prerequisites – Concepts to learn before reading
+🧠 Architecture (High-Level)
+PDF (text-based)
+   ↓
+Text Extraction
+   ↓
+Chunking (long-document safe)
+   ↓
+Embeddings (Sentence-Transformers)
+   ↓
+Semantic Similarity (Top-K Retrieval)
+   ↓
+Context Builder
+   ↓
+Hugging Face LLM (FLAN-T5)
+   ↓
+Grounded Explanation
+⚠️ Scanned PDFs are intentionally excluded to ensure fast, reliable, free-tier-safe processing.
+⚠️ Scanned PDFs are intentionally excluded to ensure fast, reliable, free-tier-safe processing.
+📂 Project Structure
+Research/
+│
+├── app.py                     # CLI pipeline runner
+├── streamlit_app.py           # Streamlit UI
+│
+├── extractor/
+│   └── pdf_parser.py
+│
+├── chunker/
+│   └── text_chunker.py
+│
+├── embeddings/
+│   ├── embedder.py
+│   └── similarity.py
+│
+├── llm/
+│   ├── explainer.py
+│   └── question_bank.py
+│
+├── utils/
+│   ├── validators.py
+│   └── context_builder.py
+│
+├── requirements.txt
+└── README.md
+⚙️ Setup & Installation
+1️⃣ Create virtual environment
+python -m venv venv
+source venv/bin/activate   # Linux/Mac
+venv\Scripts\activate      # Windows
+2️⃣ Install dependencies
+pip install -r requirements.txt
+3️⃣ Run Streamlit app
+streamlit run streamlit_app.py
+Upload a text-based research paper (PDF) and select an analysis mode.
+🧪 Example Output
+Mode: Summary
+This paper introduces the Transformer architecture, which replaces recurrent networks with self-attention mechanisms. This allows parallel processing of sequences, leading to faster training and better performance on sequence-to-sequence tasks.
+🎯 Why This Project Matters
+Most GenAI demos:
+send full documents to an LLM
+ignore retrieval
+hallucinate confidently
+This project:
+controls context explicitly
+retrieves before generating
+mirrors how real RAG systems are built in production
+📌 Design Decisions
+❌ No OCR → avoids heavy dependencies and free-tier crashes
+✅ Text-based PDFs only (arXiv-style)
+✅ Chunked processing for long documents
+✅ Top-K retrieval to reduce hallucination
+These choices prioritize robustness, clarity, and deployability.
+🔮 Future Improvements
+Section-aware chunking (Abstract / Method / Results)
+Hugging Face Spaces deployment
+Evaluation metrics for retrieval quality
+Lightweight vector store (FAISS)
+🙌 Author
+Built by Adarsh
 Learning GenAI by building systems, not just prompts.