# Scientific RAG - Implementation Tasks > **Project**: Scientific Advanced RAG System > **Dataset**: [armanc/scientific_papers](https://huggingface.co/datasets/armanc/scientific_papers) (ArXiv + PubMed) > **Deadline**: December 16, 2025 > **Reference Architecture**: LLM-Engineers-Handbook (Domain-Driven Design) --- ## Overview Build a Retrieval-Augmented Generation (RAG) system for answering questions about scientific papers. The system will use the `armanc/scientific_papers` dataset containing ~320K papers from ArXiv and PubMed with articles, abstracts, and section names. ### Dataset Structure ```python { "abstract": "Summary of the paper...", "article": "Full body of the paper, paragraphs separated by \\n...", "section_names": "[sec:introduction]introduction\\n[sec:methods]methods\\n..." } ``` - **arxiv**: 203,037 train / 6,436 val / 6,440 test - **pubmed**: 119,924 train / 6,633 val / 6,658 test --- ## Project Structure (Target) ``` scientific-rag/ ├── pyproject.toml # Project configuration ├── Makefile # Development commands ├── docker-compose.yaml # Qdrant infrastructure ├── .env.dist # Environment template ├── README.md # Documentation ├── tasks.md # This file ├── docs/ │ └── assignment.md # Assignment requirements ├── configs/ │ └── rag_config.yaml # RAG pipeline configuration ├── data/ │ ├── raw/ # Downloaded dataset cache │ └── processed/ # Processed chunks ├── scientific_rag/ # Main package │ ├── __init__.py │ ├── settings.py # Configuration management │ ├── domain/ # Core entities │ │ ├── __init__.py │ │ ├── documents.py # Document models (Paper, Chunk) │ │ ├── queries.py # Query models │ │ └── types.py # Enums and type definitions │ ├── application/ # Business logic │ │ ├── __init__.py │ │ ├── data_loader.py # HuggingFace dataset loading │ │ ├── chunking/ # Chunking strategies │ │ │ ├── __init__.py │ │ │ ├── base.py # Abstract chunker │ │ │ └── scientific_chunker.py │ │ ├── embeddings/ # Embedding models │ │ │ ├── __init__.py │ │ │ └── encoder.py # Sentence-transformers wrapper │ │ ├── query_processing/ # Query enhancement │ │ │ ├── __init__.py │ │ │ ├── query_expansion.py # Multi-query generation │ │ │ └── self_query.py # Metadata extraction │ │ ├── retrieval/ # Retrieval logic │ │ │ ├── __init__.py │ │ │ ├── bm25_retriever.py │ │ │ ├── dense_retriever.py │ │ │ └── hybrid_retriever.py │ │ ├── reranking/ # Reranker │ │ │ ├── __init__.py │ │ │ └── cross_encoder.py │ │ └── rag/ # RAG pipeline │ │ ├── __init__.py │ │ ├── pipeline.py # Main RAG orchestration │ │ ├── prompt_templates.py │ │ └── llm_client.py # LiteLLM wrapper │ └── infrastructure/ # External integrations │ ├── __init__.py │ └── qdrant.py # Qdrant vector database client ├── demo/ # Gradio/Streamlit UI │ ├── __init__.py │ └── main.py # Web interface └── tests/ ├── __init__.py ├── unit/ │ ├── test_chunking.py │ ├── test_retrieval.py │ └── test_reranking.py └── integration/ └── test_rag_pipeline.py ``` --- ## Implementation Tasks ### Phase 1: Project Setup & Data Loading - [✅] **1.1** Update `pyproject.toml` with project dependencies - `datasets` - HuggingFace datasets - `sentence-transformers` - Embeddings and cross-encoders - `rank-bm25` - BM25 retrieval - `qdrant-client` - Vector database client - `litellm` - LLM abstraction layer - `gradio` or `streamlit` - UI framework - `pydantic` - Data validation - `pydantic-settings` - Configuration management - `loguru` - Logging - `numpy`, `scipy` - Numerical operations - `tqdm` - Progress bars - [✅] **1.2** Create `docker-compose.yaml` for local infrastructure - Qdrant vector database service - Example: ```yaml services: qdrant: image: qdrant/qdrant:latest ports: - "6333:6333" - "6334:6334" volumes: - qdrant_storage:/qdrant/storage volumes: qdrant_storage: ``` - Add `make qdrant-up` and `make qdrant-down` commands - [✅] **1.3** Create `scientific_rag/settings.py` - Environment variable management - Model IDs configuration - API keys handling (OpenAI, Groq, OpenRouter) - Qdrant connection settings (host, port, API key for cloud) - Default chunking parameters - [✅] **1.4** Create `scientific_rag/domain/` entities - `types.py`: Enums for DataSource (ARXIV, PUBMED), SectionType - `documents.py`: `ScientificPaper`, `PaperChunk` Pydantic models with metadata - `queries.py`: `Query`, `EmbeddedQuery`, `QueryFilters` models - [✅] **1.5** Implement `scientific_rag/application/data_loader.py` - Load `armanc/scientific_papers` from HuggingFace - Support both `arxiv` and `pubmed` subsets - Configurable sample size for development - Progress tracking with tqdm ### Phase 2: Chunking Strategy - Configurable sample size for development - Progress tracking with tqdm ### Phase 2: Chunking Strategy - [✅] **2.1** Implement `scientific_rag/application/chunking/scientific_chunker.py` - **Section-aware chunking**: Parse `section_names` to identify sections - **Paragraph-based splitting**: Split on `\n` boundaries - **Overlap strategy**: Add overlap between chunks for context - Configurable `chunk_size` and `chunk_overlap` - **Metadata preservation**: Store source (arxiv/pubmed), normalized section name, paper_id, position - Normalize section names to enum values (introduction, methods, results, conclusion, other) - [✅] **2.2** Create processing script to generate chunks - Batch processing with progress tracking - Save chunks to disk (JSON/Parquet) for reuse - Generate unique chunk IDs (hash-based) ### Phase 3: Retrieval Implementation - [✅] **3.1** Create `scientific_rag/application/embeddings/encoder.py` - Singleton pattern for embedding model - Use `intfloat/e5-small-v2` - Batch embedding support - GPU/CPU device configuration - [✅] **3.2** Implement `scientific_rag/infrastructure/qdrant.py` - Qdrant client wrapper (local Docker or Qdrant Cloud) - Collection creation with proper schema - `upsert_chunks(chunks)` - batch insert with embeddings - `search(query_vector, filters, k)` - filtered vector search - Support for Qdrant filter syntax - [✅] **3.3** Implement `scientific_rag/application/retrieval/bm25_retriever.py` - Use `rank_bm25` library - Tokenization with proper preprocessing - `search(query, k) -> List[Chunk]` interface - Score normalization - [✅] **3.4** Implement `scientific_rag/application/retrieval/dense_retriever.py` - Semantic search using Qdrant - Integrate with `QdrantClient` from infrastructure - Apply metadata filters from self-query - `search(query, filters, k) -> List[Chunk]` interface - [✅] **3.5** Implement `scientific_rag/application/retrieval/hybrid_retriever.py` - Combine BM25 and dense retrieval - Pass metadata filters to both retrievers - Configurable weights for each method - Toggle switches: `use_bm25`, `use_dense` - Reciprocal Rank Fusion (RRF) or weighted combination - Deduplication of results ### Phase 4: Query Processing & Metadata Filtering - [✅] **4.1** Implement `scientific_rag/application/query_processing/self_query.py` - Extract metadata filters from natural language queries using **rule-based matching** - Detect source preferences: "arxiv papers about..." → filter to arxiv - Detect section preferences: "in the methods section..." → filter to methods - Use regex/keyword matching - No LLM needed - metadata is already structured in chunks from dataset - Return structured `QueryFilters` object - Filters are passed to Qdrant for efficient pre-filtering before vector search - [✅] **4.2** Implement `scientific_rag/application/query_processing/query_expansion.py` - Generate multiple query variations to improve recall - Use LLM to create semantically similar queries - Configurable `expand_to_n` parameter (default: 3) - Example prompt: ``` Generate {n} different versions of this question to search a scientific papers database. Each version should capture the same intent but use different wording. Separate versions with "###" Original: {query} ``` - Search with all expanded queries, merge results - Deduplicate before reranking - [✅] **4.3** Update `scientific_rag/domain/queries.py` - Add `QueryFilters` model for self-query results - Add `ExpandedQuery` model to hold query variations - Example: ```python class QueryFilters(BaseModel): source: Literal["arxiv", "pubmed", "any"] = "any" section: Literal["introduction", "methods", "results", "conclusion", "any"] = "any" class ExpandedQuery(BaseModel): original: str variations: list[str] filters: QueryFilters | None = None ``` ### Phase 5: Reranking - [✅] **5.1** Implement `scientific_rag/application/reranking/cross_encoder.py` - Use `cross-encoder/ms-marco-MiniLM-L6-v2` (or similar) - `rerank(query, chunks, top_k) -> List[Chunk]` interface - Batch processing for efficiency - Score-based sorting ### Phase 6: LLM Integration - [✅] **6.1** Implement `scientific_rag/application/rag/llm_client.py` - LiteLLM wrapper for provider abstraction - Support for Groq, OpenRouter, OpenAI - Configurable model selection - Error handling and retries - Response streaming (optional) - [✅] **6.2** Create `scientific_rag/application/rag/prompt_templates.py` - RAG prompt template with context injection - Citation-aware prompting (instruct model to cite sources) - System prompt for scientific Q&A - Example: ``` You are a scientific research assistant. Answer the question based on the provided context. Always cite your sources using [1], [2], etc. Context: [1] {chunk_1} [2] {chunk_2} ... Question: {query} Answer with citations: ``` - [✅] **6.3** Implement `scientific_rag/application/rag/pipeline.py` - Main `RAGPipeline` class - Orchestrate: Query → Self-Query → Query Expansion → Retrieve (with filters) → Rerank → Generate - Full pipeline flow: ``` 1. Self-Query: Extract filters (source, section) for Qdrant 2. Query Expansion: Generate N query variations 3. Retrieve: Search with all queries (BM25 + Qdrant with filters) 4. Merge & Deduplicate: Combine results from all queries 5. Rerank: Cross-encoder scoring 6. Generate: LLM with citations ``` - Configurable retrieval parameters - Toggle for each component: `use_self_query`, `use_query_expansion`, `use_bm25`, `use_dense`, `use_reranking` - Citation tracking and formatting ### Phase 7: User Interface - [✅] **7.1** Create `demo/main.py` with Gradio - Text input for questions - API key input field (not stored in code) - Dropdown for LLM provider/model selection - Dropdown for metadata filters (optional manual override): - Source: Any / ArXiv / PubMed - Section: Any / Introduction / Methods / Results / Conclusion - Checkboxes for pipeline components: - [✅] Enable Self-Query (metadata extraction) - [✅] Enable Query Expansion - [✅] Enable BM25 - [✅] Enable Dense Retrieval (Qdrant) - [✅] Enable Reranking - Slider for top-k parameter - Slider for query expansion count (1-5) - Output: Answer with citations - Expandable section showing retrieved chunks with metadata - [✅] **7.2** Add service description - Brief explanation of the RAG system - Dataset information - Usage instructions - [✅] **7.3** Style and UX improvements - Clear layout - Loading indicators - Error messages for invalid inputs ### Phase 8: Deployment - [ ] **8.1** Create `requirements.txt` for HuggingFace Spaces - Pin versions for reproducibility - Note: HF Spaces may need Qdrant Cloud instead of local - [ ] **8.2** Create HuggingFace Space configuration - `README.md` with YAML frontmatter for Gradio SDK - Resource requirements (CPU/memory) - Configure Qdrant Cloud connection for deployment - [ ] **8.3** Deploy to HuggingFace Spaces - Test with sample queries - Verify API key handling - Verify Qdrant Cloud connectivity ### Phase 9: Evaluation & Documentation - [ ] **9.1** Find queries where BM25 outperforms dense retrieval - Queries with specific terminology, rare words, or exact phrases - Examples: - "papers mentioning @xmath0 decay channel" - "CLEO detector measurements" - [ ] **9.2** Find queries where dense retrieval outperforms BM25 - Semantic similarity queries - Paraphrased questions - Examples: - "How do researchers measure particle lifetimes?" - "What methods are used for blood clot prevention?" - [ ] **9.3** Demonstrate metadata filtering effectiveness - Show queries where filtering by source improves results - Show queries where filtering by section improves results - Examples: - "arxiv papers about quantum computing" → filter to arxiv - "methodology for clinical trials" → filter to methods section - [ ] **9.4** Document the system in README.md - Architecture overview - Installation instructions (including Docker/Qdrant setup) - Usage examples - Component descriptions - Retrieval comparison findings - Metadata filtering examples - [ ] **9.5** Prepare submission materials - Source code link - Deployed service link - Component checklist (per assignment requirements) --- ## Optional Enhancements (Bonus Points) ### Citation Enhancement - [ ] **B.1** Improve citation formatting - Parse and display chunk source information - Show paper abstract or section name - Link citations to source documents ### Performance Optimization - [ ] **B.2** Add caching layer - Cache embeddings - Cache LLM responses for identical queries - [ ] **B.3** Optimize for larger dataset - FAISS index for fast similarity search - Batch processing improvements --- ## Dependencies Summary ```toml [project] name = "scientific-rag" version = "0.1.0" description = "Scientific Papers RAG System" requires-python = ">=3.11" dependencies = [ # Data "datasets>=3.0.0", "huggingface-hub>=0.20.0", # ML/Embeddings "sentence-transformers>=3.0.0", "torch>=2.0.0", "numpy>=1.26.0", "scipy>=1.11.0", # Retrieval "rank-bm25>=0.2.2", "qdrant-client>=1.8.0", # LLM "litellm>=1.0.0", # Configuration "pydantic>=2.0.0", "pydantic-settings>=2.0.0", # UI "gradio>=4.0.0", # Utilities "loguru>=0.7.0", "tqdm>=4.65.0", "python-dotenv>=1.0.0", ] [dependency-groups] dev = [ "pytest>=8.0.0", "ruff>=0.4.0", "mypy>=1.10.0", "pre-commit>=3.0.0", "ipykernel>=6.0.0", ] ``` --- ## Quick Start Commands ```bash # Setup make install # Run locally make run-app # Run tests make test # Lint make lint # Format make format ``` --- ## Key Implementation Notes ### Chunking Strategy For scientific papers, consider: 1. **Section-based chunking**: Split by sections first, then by size 2. **Preserve context**: Include section title in each chunk 3. **Handle LaTeX**: Papers contain `@xmath` tokens for math expressions ### Retrieval Comparison Document specific queries that demonstrate: - BM25 strength: Exact term matching, rare terminology - Dense strength: Semantic understanding, paraphrased queries ### LLM Configuration Recommended free options: - **Groq**: Fast, free tier with `llama-3.1-8b-instant` - **OpenRouter**: Multiple model options, some free ### Citation Format ``` Answer: The decay channel measurement shows... [1]. Further analysis using the CLEO detector... [2]. Sources: [1] "we have studied the leptonic decay..." (arxiv, section: introduction) [2] "data collected with the CLEO detector..." (arxiv, section: methods) ``` --- ## Timeline Suggestion | Week | Focus Area | | ------------------ | ---------------------------------------- | | Week 1 (Dec 9-11) | Phase 1-2: Setup, Data Loading, Chunking | | Week 2 (Dec 12-14) | Phase 3-5: Retrieval, Reranking, LLM | | Week 3 (Dec 15-16) | Phase 6-8: UI, Deployment, Documentation | --- ## References - [Assignment Document](./docs/assignment.md) - [LLM-Engineers-Handbook](https://github.com/PacktPublishing/LLM-Engineers-Handbook) - Reference architecture - [Scientific Papers Dataset](https://huggingface.co/datasets/armanc/scientific_papers) - [LiteLLM Documentation](https://docs.litellm.ai/) - [Sentence-Transformers](https://www.sbert.net/) - [Gradio Documentation](https://www.gradio.app/docs)