scientific-rag / docs /tasks.md
Daryna Vasylashko
Update task list to reflect completed implementations
79eba02 unverified
# Scientific RAG - Implementation Tasks
> **Project**: Scientific Advanced RAG System
> **Dataset**: [armanc/scientific_papers](https://huggingface.co/datasets/armanc/scientific_papers) (ArXiv + PubMed)
> **Deadline**: December 16, 2025
> **Reference Architecture**: LLM-Engineers-Handbook (Domain-Driven Design)
---
## Overview
Build a Retrieval-Augmented Generation (RAG) system for answering questions about scientific papers. The system will use the `armanc/scientific_papers` dataset containing ~320K papers from ArXiv and PubMed with articles, abstracts, and section names.
### Dataset Structure
```python
{
"abstract": "Summary of the paper...",
"article": "Full body of the paper, paragraphs separated by \\n...",
"section_names": "[sec:introduction]introduction\\n[sec:methods]methods\\n..."
}
```
- **arxiv**: 203,037 train / 6,436 val / 6,440 test
- **pubmed**: 119,924 train / 6,633 val / 6,658 test
---
## Project Structure (Target)
```
scientific-rag/
├── pyproject.toml # Project configuration
├── Makefile # Development commands
├── docker-compose.yaml # Qdrant infrastructure
├── .env.dist # Environment template
├── README.md # Documentation
├── tasks.md # This file
├── docs/
│ └── assignment.md # Assignment requirements
├── configs/
│ └── rag_config.yaml # RAG pipeline configuration
├── data/
│ ├── raw/ # Downloaded dataset cache
│ └── processed/ # Processed chunks
├── scientific_rag/ # Main package
│ ├── __init__.py
│ ├── settings.py # Configuration management
│ ├── domain/ # Core entities
│ │ ├── __init__.py
│ │ ├── documents.py # Document models (Paper, Chunk)
│ │ ├── queries.py # Query models
│ │ └── types.py # Enums and type definitions
│ ├── application/ # Business logic
│ │ ├── __init__.py
│ │ ├── data_loader.py # HuggingFace dataset loading
│ │ ├── chunking/ # Chunking strategies
│ │ │ ├── __init__.py
│ │ │ ├── base.py # Abstract chunker
│ │ │ └── scientific_chunker.py
│ │ ├── embeddings/ # Embedding models
│ │ │ ├── __init__.py
│ │ │ └── encoder.py # Sentence-transformers wrapper
│ │ ├── query_processing/ # Query enhancement
│ │ │ ├── __init__.py
│ │ │ ├── query_expansion.py # Multi-query generation
│ │ │ └── self_query.py # Metadata extraction
│ │ ├── retrieval/ # Retrieval logic
│ │ │ ├── __init__.py
│ │ │ ├── bm25_retriever.py
│ │ │ ├── dense_retriever.py
│ │ │ └── hybrid_retriever.py
│ │ ├── reranking/ # Reranker
│ │ │ ├── __init__.py
│ │ │ └── cross_encoder.py
│ │ └── rag/ # RAG pipeline
│ │ ├── __init__.py
│ │ ├── pipeline.py # Main RAG orchestration
│ │ ├── prompt_templates.py
│ │ └── llm_client.py # LiteLLM wrapper
│ └── infrastructure/ # External integrations
│ ├── __init__.py
│ └── qdrant.py # Qdrant vector database client
├── demo/ # Gradio/Streamlit UI
│ ├── __init__.py
│ └── main.py # Web interface
└── tests/
├── __init__.py
├── unit/
│ ├── test_chunking.py
│ ├── test_retrieval.py
│ └── test_reranking.py
└── integration/
└── test_rag_pipeline.py
```
---
## Implementation Tasks
### Phase 1: Project Setup & Data Loading
- [✅] **1.1** Update `pyproject.toml` with project dependencies
- `datasets` - HuggingFace datasets
- `sentence-transformers` - Embeddings and cross-encoders
- `rank-bm25` - BM25 retrieval
- `qdrant-client` - Vector database client
- `litellm` - LLM abstraction layer
- `gradio` or `streamlit` - UI framework
- `pydantic` - Data validation
- `pydantic-settings` - Configuration management
- `loguru` - Logging
- `numpy`, `scipy` - Numerical operations
- `tqdm` - Progress bars
- [✅] **1.2** Create `docker-compose.yaml` for local infrastructure
- Qdrant vector database service
- Example:
```yaml
services:
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
- "6334:6334"
volumes:
- qdrant_storage:/qdrant/storage
volumes:
qdrant_storage:
```
- Add `make qdrant-up` and `make qdrant-down` commands
- [✅] **1.3** Create `scientific_rag/settings.py`
- Environment variable management
- Model IDs configuration
- API keys handling (OpenAI, Groq, OpenRouter)
- Qdrant connection settings (host, port, API key for cloud)
- Default chunking parameters
- [✅] **1.4** Create `scientific_rag/domain/` entities
- `types.py`: Enums for DataSource (ARXIV, PUBMED), SectionType
- `documents.py`: `ScientificPaper`, `PaperChunk` Pydantic models with metadata
- `queries.py`: `Query`, `EmbeddedQuery`, `QueryFilters` models
- [✅] **1.5** Implement `scientific_rag/application/data_loader.py`
- Load `armanc/scientific_papers` from HuggingFace
- Support both `arxiv` and `pubmed` subsets
- Configurable sample size for development
- Progress tracking with tqdm
### Phase 2: Chunking Strategy
- Configurable sample size for development
- Progress tracking with tqdm
### Phase 2: Chunking Strategy
- [✅] **2.1** Implement `scientific_rag/application/chunking/scientific_chunker.py`
- **Section-aware chunking**: Parse `section_names` to identify sections
- **Paragraph-based splitting**: Split on `\n` boundaries
- **Overlap strategy**: Add overlap between chunks for context
- Configurable `chunk_size` and `chunk_overlap`
- **Metadata preservation**: Store source (arxiv/pubmed), normalized section name, paper_id, position
- Normalize section names to enum values (introduction, methods, results, conclusion, other)
- [✅] **2.2** Create processing script to generate chunks
- Batch processing with progress tracking
- Save chunks to disk (JSON/Parquet) for reuse
- Generate unique chunk IDs (hash-based)
### Phase 3: Retrieval Implementation
- [✅] **3.1** Create `scientific_rag/application/embeddings/encoder.py`
- Singleton pattern for embedding model
- Use `intfloat/e5-small-v2`
- Batch embedding support
- GPU/CPU device configuration
- [✅] **3.2** Implement `scientific_rag/infrastructure/qdrant.py`
- Qdrant client wrapper (local Docker or Qdrant Cloud)
- Collection creation with proper schema
- `upsert_chunks(chunks)` - batch insert with embeddings
- `search(query_vector, filters, k)` - filtered vector search
- Support for Qdrant filter syntax
- [✅] **3.3** Implement `scientific_rag/application/retrieval/bm25_retriever.py`
- Use `rank_bm25` library
- Tokenization with proper preprocessing
- `search(query, k) -> List[Chunk]` interface
- Score normalization
- [✅] **3.4** Implement `scientific_rag/application/retrieval/dense_retriever.py`
- Semantic search using Qdrant
- Integrate with `QdrantClient` from infrastructure
- Apply metadata filters from self-query
- `search(query, filters, k) -> List[Chunk]` interface
- [✅] **3.5** Implement `scientific_rag/application/retrieval/hybrid_retriever.py`
- Combine BM25 and dense retrieval
- Pass metadata filters to both retrievers
- Configurable weights for each method
- Toggle switches: `use_bm25`, `use_dense`
- Reciprocal Rank Fusion (RRF) or weighted combination
- Deduplication of results
### Phase 4: Query Processing & Metadata Filtering
- [✅] **4.1** Implement `scientific_rag/application/query_processing/self_query.py`
- Extract metadata filters from natural language queries using **rule-based matching**
- Detect source preferences: "arxiv papers about..." → filter to arxiv
- Detect section preferences: "in the methods section..." → filter to methods
- Use regex/keyword matching
- No LLM needed - metadata is already structured in chunks from dataset
- Return structured `QueryFilters` object
- Filters are passed to Qdrant for efficient pre-filtering before vector search
- [✅] **4.2** Implement `scientific_rag/application/query_processing/query_expansion.py`
- Generate multiple query variations to improve recall
- Use LLM to create semantically similar queries
- Configurable `expand_to_n` parameter (default: 3)
- Example prompt:
```
Generate {n} different versions of this question to search a scientific papers database.
Each version should capture the same intent but use different wording.
Separate versions with "###"
Original: {query}
```
- Search with all expanded queries, merge results
- Deduplicate before reranking
- [✅] **4.3** Update `scientific_rag/domain/queries.py`
- Add `QueryFilters` model for self-query results
- Add `ExpandedQuery` model to hold query variations
- Example:
```python
class QueryFilters(BaseModel):
source: Literal["arxiv", "pubmed", "any"] = "any"
section: Literal["introduction", "methods", "results", "conclusion", "any"] = "any"
class ExpandedQuery(BaseModel):
original: str
variations: list[str]
filters: QueryFilters | None = None
```
### Phase 5: Reranking
- [✅] **5.1** Implement `scientific_rag/application/reranking/cross_encoder.py`
- Use `cross-encoder/ms-marco-MiniLM-L6-v2` (or similar)
- `rerank(query, chunks, top_k) -> List[Chunk]` interface
- Batch processing for efficiency
- Score-based sorting
### Phase 6: LLM Integration
- [✅] **6.1** Implement `scientific_rag/application/rag/llm_client.py`
- LiteLLM wrapper for provider abstraction
- Support for Groq, OpenRouter, OpenAI
- Configurable model selection
- Error handling and retries
- Response streaming (optional)
- [✅] **6.2** Create `scientific_rag/application/rag/prompt_templates.py`
- RAG prompt template with context injection
- Citation-aware prompting (instruct model to cite sources)
- System prompt for scientific Q&A
- Example:
```
You are a scientific research assistant. Answer the question based on the provided context.
Always cite your sources using [1], [2], etc.
Context:
[1] {chunk_1}
[2] {chunk_2}
...
Question: {query}
Answer with citations:
```
- [✅] **6.3** Implement `scientific_rag/application/rag/pipeline.py`
- Main `RAGPipeline` class
- Orchestrate: Query → Self-Query → Query Expansion → Retrieve (with filters) → Rerank → Generate
- Full pipeline flow:
```
1. Self-Query: Extract filters (source, section) for Qdrant
2. Query Expansion: Generate N query variations
3. Retrieve: Search with all queries (BM25 + Qdrant with filters)
4. Merge & Deduplicate: Combine results from all queries
5. Rerank: Cross-encoder scoring
6. Generate: LLM with citations
```
- Configurable retrieval parameters
- Toggle for each component: `use_self_query`, `use_query_expansion`, `use_bm25`, `use_dense`, `use_reranking`
- Citation tracking and formatting
### Phase 7: User Interface
- [✅] **7.1** Create `demo/main.py` with Gradio
- Text input for questions
- API key input field (not stored in code)
- Dropdown for LLM provider/model selection
- Dropdown for metadata filters (optional manual override):
- Source: Any / ArXiv / PubMed
- Section: Any / Introduction / Methods / Results / Conclusion
- Checkboxes for pipeline components:
- [✅] Enable Self-Query (metadata extraction)
- [✅] Enable Query Expansion
- [✅] Enable BM25
- [✅] Enable Dense Retrieval (Qdrant)
- [✅] Enable Reranking
- Slider for top-k parameter
- Slider for query expansion count (1-5)
- Output: Answer with citations
- Expandable section showing retrieved chunks with metadata
- [✅] **7.2** Add service description
- Brief explanation of the RAG system
- Dataset information
- Usage instructions
- [✅] **7.3** Style and UX improvements
- Clear layout
- Loading indicators
- Error messages for invalid inputs
### Phase 8: Deployment
- [ ] **8.1** Create `requirements.txt` for HuggingFace Spaces
- Pin versions for reproducibility
- Note: HF Spaces may need Qdrant Cloud instead of local
- [ ] **8.2** Create HuggingFace Space configuration
- `README.md` with YAML frontmatter for Gradio SDK
- Resource requirements (CPU/memory)
- Configure Qdrant Cloud connection for deployment
- [ ] **8.3** Deploy to HuggingFace Spaces
- Test with sample queries
- Verify API key handling
- Verify Qdrant Cloud connectivity
### Phase 9: Evaluation & Documentation
- [ ] **9.1** Find queries where BM25 outperforms dense retrieval
- Queries with specific terminology, rare words, or exact phrases
- Examples:
- "papers mentioning @xmath0 decay channel"
- "CLEO detector measurements"
- [ ] **9.2** Find queries where dense retrieval outperforms BM25
- Semantic similarity queries
- Paraphrased questions
- Examples:
- "How do researchers measure particle lifetimes?"
- "What methods are used for blood clot prevention?"
- [ ] **9.3** Demonstrate metadata filtering effectiveness
- Show queries where filtering by source improves results
- Show queries where filtering by section improves results
- Examples:
- "arxiv papers about quantum computing" → filter to arxiv
- "methodology for clinical trials" → filter to methods section
- [ ] **9.4** Document the system in README.md
- Architecture overview
- Installation instructions (including Docker/Qdrant setup)
- Usage examples
- Component descriptions
- Retrieval comparison findings
- Metadata filtering examples
- [ ] **9.5** Prepare submission materials
- Source code link
- Deployed service link
- Component checklist (per assignment requirements)
---
## Optional Enhancements (Bonus Points)
### Citation Enhancement
- [ ] **B.1** Improve citation formatting
- Parse and display chunk source information
- Show paper abstract or section name
- Link citations to source documents
### Performance Optimization
- [ ] **B.2** Add caching layer
- Cache embeddings
- Cache LLM responses for identical queries
- [ ] **B.3** Optimize for larger dataset
- FAISS index for fast similarity search
- Batch processing improvements
---
## Dependencies Summary
```toml
[project]
name = "scientific-rag"
version = "0.1.0"
description = "Scientific Papers RAG System"
requires-python = ">=3.11"
dependencies = [
# Data
"datasets>=3.0.0",
"huggingface-hub>=0.20.0",
# ML/Embeddings
"sentence-transformers>=3.0.0",
"torch>=2.0.0",
"numpy>=1.26.0",
"scipy>=1.11.0",
# Retrieval
"rank-bm25>=0.2.2",
"qdrant-client>=1.8.0",
# LLM
"litellm>=1.0.0",
# Configuration
"pydantic>=2.0.0",
"pydantic-settings>=2.0.0",
# UI
"gradio>=4.0.0",
# Utilities
"loguru>=0.7.0",
"tqdm>=4.65.0",
"python-dotenv>=1.0.0",
]
[dependency-groups]
dev = [
"pytest>=8.0.0",
"ruff>=0.4.0",
"mypy>=1.10.0",
"pre-commit>=3.0.0",
"ipykernel>=6.0.0",
]
```
---
## Quick Start Commands
```bash
# Setup
make install
# Run locally
make run-app
# Run tests
make test
# Lint
make lint
# Format
make format
```
---
## Key Implementation Notes
### Chunking Strategy
For scientific papers, consider:
1. **Section-based chunking**: Split by sections first, then by size
2. **Preserve context**: Include section title in each chunk
3. **Handle LaTeX**: Papers contain `@xmath` tokens for math expressions
### Retrieval Comparison
Document specific queries that demonstrate:
- BM25 strength: Exact term matching, rare terminology
- Dense strength: Semantic understanding, paraphrased queries
### LLM Configuration
Recommended free options:
- **Groq**: Fast, free tier with `llama-3.1-8b-instant`
- **OpenRouter**: Multiple model options, some free
### Citation Format
```
Answer: The decay channel measurement shows... [1]. Further analysis using the CLEO detector... [2].
Sources:
[1] "we have studied the leptonic decay..." (arxiv, section: introduction)
[2] "data collected with the CLEO detector..." (arxiv, section: methods)
```
---
## Timeline Suggestion
| Week | Focus Area |
| ------------------ | ---------------------------------------- |
| Week 1 (Dec 9-11) | Phase 1-2: Setup, Data Loading, Chunking |
| Week 2 (Dec 12-14) | Phase 3-5: Retrieval, Reranking, LLM |
| Week 3 (Dec 15-16) | Phase 6-8: UI, Deployment, Documentation |
---
## References
- [Assignment Document](./docs/assignment.md)
- [LLM-Engineers-Handbook](https://github.com/PacktPublishing/LLM-Engineers-Handbook) - Reference architecture
- [Scientific Papers Dataset](https://huggingface.co/datasets/armanc/scientific_papers)
- [LiteLLM Documentation](https://docs.litellm.ai/)
- [Sentence-Transformers](https://www.sbert.net/)
- [Gradio Documentation](https://www.gradio.app/docs)