Spaces:

DenysKovalML
/

scientific-rag

Sleeping

App Files Files Community

scientific-rag / docs /tasks.md

Daryna Vasylashko

Update task list to reflect completed implementations

79eba02 unverified 2 months ago

preview code

raw

history blame contribute delete

17.7 kB

	# Scientific RAG - Implementation Tasks

	> Project: Scientific Advanced RAG System
	> Dataset: [armanc/scientific_papers](https://huggingface.co/datasets/armanc/scientific_papers) (ArXiv + PubMed)
	> Deadline: December 16, 2025
	> Reference Architecture: LLM-Engineers-Handbook (Domain-Driven Design)

	---

	## Overview

	Build a Retrieval-Augmented Generation (RAG) system for answering questions about scientific papers. The system will use the `armanc/scientific_papers` dataset containing ~320K papers from ArXiv and PubMed with articles, abstracts, and section names.

	### Dataset Structure

	```python
	{
	"abstract": "Summary of the paper...",
	"article": "Full body of the paper, paragraphs separated by \\n...",
	"section_names": "[sec:introduction]introduction\\n[sec:methods]methods\\n..."
	}
	```

	- arxiv: 203,037 train / 6,436 val / 6,440 test
	- pubmed: 119,924 train / 6,633 val / 6,658 test

	---

	## Project Structure (Target)

	```
	scientific-rag/
	├── pyproject.toml # Project configuration
	├── Makefile # Development commands
	├── docker-compose.yaml # Qdrant infrastructure
	├── .env.dist # Environment template
	├── README.md # Documentation
	├── tasks.md # This file
	├── docs/
	│ └── assignment.md # Assignment requirements
	├── configs/
	│ └── rag_config.yaml # RAG pipeline configuration
	├── data/
	│ ├── raw/ # Downloaded dataset cache
	│ └── processed/ # Processed chunks
	├── scientific_rag/ # Main package
	│ ├── __init__.py
	│ ├── settings.py # Configuration management
	│ ├── domain/ # Core entities
	│ │ ├── __init__.py
	│ │ ├── documents.py # Document models (Paper, Chunk)
	│ │ ├── queries.py # Query models
	│ │ └── types.py # Enums and type definitions
	│ ├── application/ # Business logic
	│ │ ├── __init__.py
	│ │ ├── data_loader.py # HuggingFace dataset loading
	│ │ ├── chunking/ # Chunking strategies
	│ │ │ ├── __init__.py
	│ │ │ ├── base.py # Abstract chunker
	│ │ │ └── scientific_chunker.py
	│ │ ├── embeddings/ # Embedding models
	│ │ │ ├── __init__.py
	│ │ │ └── encoder.py # Sentence-transformers wrapper
	│ │ ├── query_processing/ # Query enhancement
	│ │ │ ├── __init__.py
	│ │ │ ├── query_expansion.py # Multi-query generation
	│ │ │ └── self_query.py # Metadata extraction
	│ │ ├── retrieval/ # Retrieval logic
	│ │ │ ├── __init__.py
	│ │ │ ├── bm25_retriever.py
	│ │ │ ├── dense_retriever.py
	│ │ │ └── hybrid_retriever.py
	│ │ ├── reranking/ # Reranker
	│ │ │ ├── __init__.py
	│ │ │ └── cross_encoder.py
	│ │ └── rag/ # RAG pipeline
	│ │ ├── __init__.py
	│ │ ├── pipeline.py # Main RAG orchestration
	│ │ ├── prompt_templates.py
	│ │ └── llm_client.py # LiteLLM wrapper
	│ └── infrastructure/ # External integrations
	│ ├── __init__.py
	│ └── qdrant.py # Qdrant vector database client
	├── demo/ # Gradio/Streamlit UI
	│ ├── __init__.py
	│ └── main.py # Web interface
	└── tests/
	├── __init__.py
	├── unit/
	│ ├── test_chunking.py
	│ ├── test_retrieval.py
	│ └── test_reranking.py
	└── integration/
	└── test_rag_pipeline.py
	```

	---

	## Implementation Tasks

	### Phase 1: Project Setup & Data Loading

	- [✅] 1.1 Update `pyproject.toml` with project dependencies

	- `datasets` - HuggingFace datasets
	- `sentence-transformers` - Embeddings and cross-encoders
	- `rank-bm25` - BM25 retrieval
	- `qdrant-client` - Vector database client
	- `litellm` - LLM abstraction layer
	- `gradio` or `streamlit` - UI framework
	- `pydantic` - Data validation
	- `pydantic-settings` - Configuration management
	- `loguru` - Logging
	- `numpy`, `scipy` - Numerical operations
	- `tqdm` - Progress bars

	- [✅] 1.2 Create `docker-compose.yaml` for local infrastructure

	- Qdrant vector database service
	- Example:
	```yaml
	services:
	qdrant:
	image: qdrant/qdrant:latest
	ports:
	- "6333:6333"
	- "6334:6334"
	volumes:
	- qdrant_storage:/qdrant/storage
	volumes:
	qdrant_storage:
	```
	- Add `make qdrant-up` and `make qdrant-down` commands

	- [✅] 1.3 Create `scientific_rag/settings.py`

	- Environment variable management
	- Model IDs configuration
	- API keys handling (OpenAI, Groq, OpenRouter)
	- Qdrant connection settings (host, port, API key for cloud)
	- Default chunking parameters

	- [✅] 1.4 Create `scientific_rag/domain/` entities

	- `types.py`: Enums for DataSource (ARXIV, PUBMED), SectionType
	- `documents.py`: `ScientificPaper`, `PaperChunk` Pydantic models with metadata
	- `queries.py`: `Query`, `EmbeddedQuery`, `QueryFilters` models

	- [✅] 1.5 Implement `scientific_rag/application/data_loader.py`
	- Load `armanc/scientific_papers` from HuggingFace
	- Support both `arxiv` and `pubmed` subsets
	- Configurable sample size for development
	- Progress tracking with tqdm

	### Phase 2: Chunking Strategy

	- Configurable sample size for development
	- Progress tracking with tqdm

	### Phase 2: Chunking Strategy

	- [✅] 2.1 Implement `scientific_rag/application/chunking/scientific_chunker.py`

	- Section-aware chunking: Parse `section_names` to identify sections
	- Paragraph-based splitting: Split on `\n` boundaries
	- Overlap strategy: Add overlap between chunks for context
	- Configurable `chunk_size` and `chunk_overlap`
	- Metadata preservation: Store source (arxiv/pubmed), normalized section name, paper_id, position
	- Normalize section names to enum values (introduction, methods, results, conclusion, other)

	- [✅] 2.2 Create processing script to generate chunks
	- Batch processing with progress tracking
	- Save chunks to disk (JSON/Parquet) for reuse
	- Generate unique chunk IDs (hash-based)

	### Phase 3: Retrieval Implementation

	- [✅] 3.1 Create `scientific_rag/application/embeddings/encoder.py`

	- Singleton pattern for embedding model
	- Use `intfloat/e5-small-v2`
	- Batch embedding support
	- GPU/CPU device configuration

	- [✅] 3.2 Implement `scientific_rag/infrastructure/qdrant.py`

	- Qdrant client wrapper (local Docker or Qdrant Cloud)
	- Collection creation with proper schema
	- `upsert_chunks(chunks)` - batch insert with embeddings
	- `search(query_vector, filters, k)` - filtered vector search
	- Support for Qdrant filter syntax


	- [✅] 3.3 Implement `scientific_rag/application/retrieval/bm25_retriever.py`

	- Use `rank_bm25` library
	- Tokenization with proper preprocessing
	- `search(query, k) -> List[Chunk]` interface
	- Score normalization

	- [✅] 3.4 Implement `scientific_rag/application/retrieval/dense_retriever.py`

	- Semantic search using Qdrant
	- Integrate with `QdrantClient` from infrastructure
	- Apply metadata filters from self-query
	- `search(query, filters, k) -> List[Chunk]` interface

	- [✅] 3.5 Implement `scientific_rag/application/retrieval/hybrid_retriever.py`
	- Combine BM25 and dense retrieval
	- Pass metadata filters to both retrievers
	- Configurable weights for each method
	- Toggle switches: `use_bm25`, `use_dense`
	- Reciprocal Rank Fusion (RRF) or weighted combination
	- Deduplication of results

	### Phase 4: Query Processing & Metadata Filtering

	- [✅] 4.1 Implement `scientific_rag/application/query_processing/self_query.py`

	- Extract metadata filters from natural language queries using rule-based matching
	- Detect source preferences: "arxiv papers about..." → filter to arxiv
	- Detect section preferences: "in the methods section..." → filter to methods
	- Use regex/keyword matching
	- No LLM needed - metadata is already structured in chunks from dataset
	- Return structured `QueryFilters` object
	- Filters are passed to Qdrant for efficient pre-filtering before vector search

	- [✅] 4.2 Implement `scientific_rag/application/query_processing/query_expansion.py`

	- Generate multiple query variations to improve recall
	- Use LLM to create semantically similar queries
	- Configurable `expand_to_n` parameter (default: 3)
	- Example prompt:

	```
	Generate {n} different versions of this question to search a scientific papers database.
	Each version should capture the same intent but use different wording.
	Separate versions with "###"

	Original: {query}
	```

	- Search with all expanded queries, merge results
	- Deduplicate before reranking

	- [✅] 4.3 Update `scientific_rag/domain/queries.py`

	- Add `QueryFilters` model for self-query results
	- Add `ExpandedQuery` model to hold query variations
	- Example:

	```python
	class QueryFilters(BaseModel):
	source: Literal["arxiv", "pubmed", "any"] = "any"
	section: Literal["introduction", "methods", "results", "conclusion", "any"] = "any"

	class ExpandedQuery(BaseModel):
	original: str
	variations: list[str]
	filters: QueryFilters \| None = None
	```

	### Phase 5: Reranking

	- [✅] 5.1 Implement `scientific_rag/application/reranking/cross_encoder.py`
	- Use `cross-encoder/ms-marco-MiniLM-L6-v2` (or similar)
	- `rerank(query, chunks, top_k) -> List[Chunk]` interface
	- Batch processing for efficiency
	- Score-based sorting

	### Phase 6: LLM Integration

	- [✅] 6.1 Implement `scientific_rag/application/rag/llm_client.py`

	- LiteLLM wrapper for provider abstraction
	- Support for Groq, OpenRouter, OpenAI
	- Configurable model selection
	- Error handling and retries
	- Response streaming (optional)

	- [✅] 6.2 Create `scientific_rag/application/rag/prompt_templates.py`

	- RAG prompt template with context injection
	- Citation-aware prompting (instruct model to cite sources)
	- System prompt for scientific Q&A
	- Example:

	```
	You are a scientific research assistant. Answer the question based on the provided context.
	Always cite your sources using [1], [2], etc.

	Context:
	[1] {chunk_1}
	[2] {chunk_2}
	...

	Question: {query}

	Answer with citations:
	```

	- [✅] 6.3 Implement `scientific_rag/application/rag/pipeline.py`
	- Main `RAGPipeline` class
	- Orchestrate: Query → Self-Query → Query Expansion → Retrieve (with filters) → Rerank → Generate
	- Full pipeline flow:
	```
	1. Self-Query: Extract filters (source, section) for Qdrant
	2. Query Expansion: Generate N query variations
	3. Retrieve: Search with all queries (BM25 + Qdrant with filters)
	4. Merge & Deduplicate: Combine results from all queries
	5. Rerank: Cross-encoder scoring
	6. Generate: LLM with citations
	```
	- Configurable retrieval parameters
	- Toggle for each component: `use_self_query`, `use_query_expansion`, `use_bm25`, `use_dense`, `use_reranking`
	- Citation tracking and formatting

	### Phase 7: User Interface

	- [✅] 7.1 Create `demo/main.py` with Gradio

	- Text input for questions
	- API key input field (not stored in code)
	- Dropdown for LLM provider/model selection
	- Dropdown for metadata filters (optional manual override):
	- Source: Any / ArXiv / PubMed
	- Section: Any / Introduction / Methods / Results / Conclusion
	- Checkboxes for pipeline components:
	- [✅] Enable Self-Query (metadata extraction)
	- [✅] Enable Query Expansion
	- [✅] Enable BM25
	- [✅] Enable Dense Retrieval (Qdrant)
	- [✅] Enable Reranking
	- Slider for top-k parameter
	- Slider for query expansion count (1-5)
	- Output: Answer with citations
	- Expandable section showing retrieved chunks with metadata

	- [✅] 7.2 Add service description

	- Brief explanation of the RAG system
	- Dataset information
	- Usage instructions

	- [✅] 7.3 Style and UX improvements
	- Clear layout
	- Loading indicators
	- Error messages for invalid inputs

	### Phase 8: Deployment

	- [ ] 8.1 Create `requirements.txt` for HuggingFace Spaces

	- Pin versions for reproducibility
	- Note: HF Spaces may need Qdrant Cloud instead of local

	- [ ] 8.2 Create HuggingFace Space configuration

	- `README.md` with YAML frontmatter for Gradio SDK
	- Resource requirements (CPU/memory)
	- Configure Qdrant Cloud connection for deployment

	- [ ] 8.3 Deploy to HuggingFace Spaces
	- Test with sample queries
	- Verify API key handling
	- Verify Qdrant Cloud connectivity

	### Phase 9: Evaluation & Documentation

	- [ ] 9.1 Find queries where BM25 outperforms dense retrieval

	- Queries with specific terminology, rare words, or exact phrases
	- Examples:
	- "papers mentioning @xmath0 decay channel"
	- "CLEO detector measurements"

	- [ ] 9.2 Find queries where dense retrieval outperforms BM25

	- Semantic similarity queries
	- Paraphrased questions
	- Examples:
	- "How do researchers measure particle lifetimes?"
	- "What methods are used for blood clot prevention?"

	- [ ] 9.3 Demonstrate metadata filtering effectiveness

	- Show queries where filtering by source improves results
	- Show queries where filtering by section improves results
	- Examples:
	- "arxiv papers about quantum computing" → filter to arxiv
	- "methodology for clinical trials" → filter to methods section

	- [ ] 9.4 Document the system in README.md

	- Architecture overview
	- Installation instructions (including Docker/Qdrant setup)
	- Usage examples
	- Component descriptions
	- Retrieval comparison findings
	- Metadata filtering examples

	- [ ] 9.5 Prepare submission materials
	- Source code link
	- Deployed service link
	- Component checklist (per assignment requirements)

	---

	## Optional Enhancements (Bonus Points)

	### Citation Enhancement

	- [ ] B.1 Improve citation formatting
	- Parse and display chunk source information
	- Show paper abstract or section name
	- Link citations to source documents

	### Performance Optimization

	- [ ] B.2 Add caching layer

	- Cache embeddings
	- Cache LLM responses for identical queries

	- [ ] B.3 Optimize for larger dataset
	- FAISS index for fast similarity search
	- Batch processing improvements

	---

	## Dependencies Summary

	```toml
	[project]
	name = "scientific-rag"
	version = "0.1.0"
	description = "Scientific Papers RAG System"
	requires-python = ">=3.11"

	dependencies = [
	# Data
	"datasets>=3.0.0",
	"huggingface-hub>=0.20.0",

	# ML/Embeddings
	"sentence-transformers>=3.0.0",
	"torch>=2.0.0",
	"numpy>=1.26.0",
	"scipy>=1.11.0",

	# Retrieval
	"rank-bm25>=0.2.2",
	"qdrant-client>=1.8.0",

	# LLM
	"litellm>=1.0.0",

	# Configuration
	"pydantic>=2.0.0",
	"pydantic-settings>=2.0.0",

	# UI
	"gradio>=4.0.0",

	# Utilities
	"loguru>=0.7.0",
	"tqdm>=4.65.0",
	"python-dotenv>=1.0.0",
	]

	[dependency-groups]
	dev = [
	"pytest>=8.0.0",
	"ruff>=0.4.0",
	"mypy>=1.10.0",
	"pre-commit>=3.0.0",
	"ipykernel>=6.0.0",
	]
	```

	---

	## Quick Start Commands

	```bash
	# Setup
	make install

	# Run locally
	make run-app

	# Run tests
	make test

	# Lint
	make lint

	# Format
	make format
	```

	---

	## Key Implementation Notes

	### Chunking Strategy

	For scientific papers, consider:

	1. Section-based chunking: Split by sections first, then by size
	2. Preserve context: Include section title in each chunk
	3. Handle LaTeX: Papers contain `@xmath` tokens for math expressions

	### Retrieval Comparison

	Document specific queries that demonstrate:

	- BM25 strength: Exact term matching, rare terminology
	- Dense strength: Semantic understanding, paraphrased queries

	### LLM Configuration

	Recommended free options:

	- Groq: Fast, free tier with `llama-3.1-8b-instant`
	- OpenRouter: Multiple model options, some free

	### Citation Format

	```
	Answer: The decay channel measurement shows... [1]. Further analysis using the CLEO detector... [2].

	Sources:
	[1] "we have studied the leptonic decay..." (arxiv, section: introduction)
	[2] "data collected with the CLEO detector..." (arxiv, section: methods)
	```

	---

	## Timeline Suggestion

	\| Week \| Focus Area \|
	\| ------------------ \| ---------------------------------------- \|
	\| Week 1 (Dec 9-11) \| Phase 1-2: Setup, Data Loading, Chunking \|
	\| Week 2 (Dec 12-14) \| Phase 3-5: Retrieval, Reranking, LLM \|
	\| Week 3 (Dec 15-16) \| Phase 6-8: UI, Deployment, Documentation \|

	---

	## References

	- [Assignment Document](./docs/assignment.md)
	- [LLM-Engineers-Handbook](https://github.com/PacktPublishing/LLM-Engineers-Handbook) - Reference architecture
	- [Scientific Papers Dataset](https://huggingface.co/datasets/armanc/scientific_papers)
	- [LiteLLM Documentation](https://docs.litellm.ai/)
	- [Sentence-Transformers](https://www.sbert.net/)
	- [Gradio Documentation](https://www.gradio.app/docs)