Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.5.1
Scientific RAG - Implementation Tasks
Project: Scientific Advanced RAG System Dataset: armanc/scientific_papers (ArXiv + PubMed) Deadline: December 16, 2025 Reference Architecture: LLM-Engineers-Handbook (Domain-Driven Design)
Overview
Build a Retrieval-Augmented Generation (RAG) system for answering questions about scientific papers. The system will use the armanc/scientific_papers dataset containing ~320K papers from ArXiv and PubMed with articles, abstracts, and section names.
Dataset Structure
{
"abstract": "Summary of the paper...",
"article": "Full body of the paper, paragraphs separated by \\n...",
"section_names": "[sec:introduction]introduction\\n[sec:methods]methods\\n..."
}
- arxiv: 203,037 train / 6,436 val / 6,440 test
- pubmed: 119,924 train / 6,633 val / 6,658 test
Project Structure (Target)
scientific-rag/
βββ pyproject.toml # Project configuration
βββ Makefile # Development commands
βββ docker-compose.yaml # Qdrant infrastructure
βββ .env.dist # Environment template
βββ README.md # Documentation
βββ tasks.md # This file
βββ docs/
β βββ assignment.md # Assignment requirements
βββ configs/
β βββ rag_config.yaml # RAG pipeline configuration
βββ data/
β βββ raw/ # Downloaded dataset cache
β βββ processed/ # Processed chunks
βββ scientific_rag/ # Main package
β βββ __init__.py
β βββ settings.py # Configuration management
β βββ domain/ # Core entities
β β βββ __init__.py
β β βββ documents.py # Document models (Paper, Chunk)
β β βββ queries.py # Query models
β β βββ types.py # Enums and type definitions
β βββ application/ # Business logic
β β βββ __init__.py
β β βββ data_loader.py # HuggingFace dataset loading
β β βββ chunking/ # Chunking strategies
β β β βββ __init__.py
β β β βββ base.py # Abstract chunker
β β β βββ scientific_chunker.py
β β βββ embeddings/ # Embedding models
β β β βββ __init__.py
β β β βββ encoder.py # Sentence-transformers wrapper
β β βββ query_processing/ # Query enhancement
β β β βββ __init__.py
β β β βββ query_expansion.py # Multi-query generation
β β β βββ self_query.py # Metadata extraction
β β βββ retrieval/ # Retrieval logic
β β β βββ __init__.py
β β β βββ bm25_retriever.py
β β β βββ dense_retriever.py
β β β βββ hybrid_retriever.py
β β βββ reranking/ # Reranker
β β β βββ __init__.py
β β β βββ cross_encoder.py
β β βββ rag/ # RAG pipeline
β β βββ __init__.py
β β βββ pipeline.py # Main RAG orchestration
β β βββ prompt_templates.py
β β βββ llm_client.py # LiteLLM wrapper
β βββ infrastructure/ # External integrations
β βββ __init__.py
β βββ qdrant.py # Qdrant vector database client
βββ demo/ # Gradio/Streamlit UI
β βββ __init__.py
β βββ main.py # Web interface
βββ tests/
βββ __init__.py
βββ unit/
β βββ test_chunking.py
β βββ test_retrieval.py
β βββ test_reranking.py
βββ integration/
βββ test_rag_pipeline.py
Implementation Tasks
Phase 1: Project Setup & Data Loading
[β ] 1.1 Update
pyproject.tomlwith project dependenciesdatasets- HuggingFace datasetssentence-transformers- Embeddings and cross-encodersrank-bm25- BM25 retrievalqdrant-client- Vector database clientlitellm- LLM abstraction layergradioorstreamlit- UI frameworkpydantic- Data validationpydantic-settings- Configuration managementloguru- Loggingnumpy,scipy- Numerical operationstqdm- Progress bars
[β ] 1.2 Create
docker-compose.yamlfor local infrastructure- Qdrant vector database service
- Example:
services: qdrant: image: qdrant/qdrant:latest ports: - "6333:6333" - "6334:6334" volumes: - qdrant_storage:/qdrant/storage volumes: qdrant_storage: - Add
make qdrant-upandmake qdrant-downcommands
[β ] 1.3 Create
scientific_rag/settings.py- Environment variable management
- Model IDs configuration
- API keys handling (OpenAI, Groq, OpenRouter)
- Qdrant connection settings (host, port, API key for cloud)
- Default chunking parameters
[β ] 1.4 Create
scientific_rag/domain/entitiestypes.py: Enums for DataSource (ARXIV, PUBMED), SectionTypedocuments.py:ScientificPaper,PaperChunkPydantic models with metadataqueries.py:Query,EmbeddedQuery,QueryFiltersmodels
[β ] 1.5 Implement
scientific_rag/application/data_loader.py- Load
armanc/scientific_papersfrom HuggingFace - Support both
arxivandpubmedsubsets - Configurable sample size for development
- Progress tracking with tqdm
- Load
Phase 2: Chunking Strategy
- Configurable sample size for development
- Progress tracking with tqdm
Phase 2: Chunking Strategy
[β ] 2.1 Implement
scientific_rag/application/chunking/scientific_chunker.py- Section-aware chunking: Parse
section_namesto identify sections - Paragraph-based splitting: Split on
\nboundaries - Overlap strategy: Add overlap between chunks for context
- Configurable
chunk_sizeandchunk_overlap - Metadata preservation: Store source (arxiv/pubmed), normalized section name, paper_id, position
- Normalize section names to enum values (introduction, methods, results, conclusion, other)
- Section-aware chunking: Parse
[β ] 2.2 Create processing script to generate chunks
- Batch processing with progress tracking
- Save chunks to disk (JSON/Parquet) for reuse
- Generate unique chunk IDs (hash-based)
Phase 3: Retrieval Implementation
[β ] 3.1 Create
scientific_rag/application/embeddings/encoder.py- Singleton pattern for embedding model
- Use
intfloat/e5-small-v2 - Batch embedding support
- GPU/CPU device configuration
[β ] 3.2 Implement
scientific_rag/infrastructure/qdrant.py- Qdrant client wrapper (local Docker or Qdrant Cloud)
- Collection creation with proper schema
upsert_chunks(chunks)- batch insert with embeddingssearch(query_vector, filters, k)- filtered vector search- Support for Qdrant filter syntax
[β ] 3.3 Implement
scientific_rag/application/retrieval/bm25_retriever.py- Use
rank_bm25library - Tokenization with proper preprocessing
search(query, k) -> List[Chunk]interface- Score normalization
- Use
[β ] 3.4 Implement
scientific_rag/application/retrieval/dense_retriever.py- Semantic search using Qdrant
- Integrate with
QdrantClientfrom infrastructure - Apply metadata filters from self-query
search(query, filters, k) -> List[Chunk]interface
[β ] 3.5 Implement
scientific_rag/application/retrieval/hybrid_retriever.py- Combine BM25 and dense retrieval
- Pass metadata filters to both retrievers
- Configurable weights for each method
- Toggle switches:
use_bm25,use_dense - Reciprocal Rank Fusion (RRF) or weighted combination
- Deduplication of results
Phase 4: Query Processing & Metadata Filtering
[β ] 4.1 Implement
scientific_rag/application/query_processing/self_query.py- Extract metadata filters from natural language queries using rule-based matching
- Detect source preferences: "arxiv papers about..." β filter to arxiv
- Detect section preferences: "in the methods section..." β filter to methods
- Use regex/keyword matching
- No LLM needed - metadata is already structured in chunks from dataset
- Return structured
QueryFiltersobject - Filters are passed to Qdrant for efficient pre-filtering before vector search
[β ] 4.2 Implement
scientific_rag/application/query_processing/query_expansion.pyGenerate multiple query variations to improve recall
Use LLM to create semantically similar queries
Configurable
expand_to_nparameter (default: 3)Example prompt:
Generate {n} different versions of this question to search a scientific papers database. Each version should capture the same intent but use different wording. Separate versions with "###" Original: {query}Search with all expanded queries, merge results
Deduplicate before reranking
[β ] 4.3 Update
scientific_rag/domain/queries.pyAdd
QueryFiltersmodel for self-query resultsAdd
ExpandedQuerymodel to hold query variationsExample:
class QueryFilters(BaseModel): source: Literal["arxiv", "pubmed", "any"] = "any" section: Literal["introduction", "methods", "results", "conclusion", "any"] = "any" class ExpandedQuery(BaseModel): original: str variations: list[str] filters: QueryFilters | None = None
Phase 5: Reranking
- [β
] 5.1 Implement
scientific_rag/application/reranking/cross_encoder.py- Use
cross-encoder/ms-marco-MiniLM-L6-v2(or similar) rerank(query, chunks, top_k) -> List[Chunk]interface- Batch processing for efficiency
- Score-based sorting
- Use
Phase 6: LLM Integration
[β ] 6.1 Implement
scientific_rag/application/rag/llm_client.py- LiteLLM wrapper for provider abstraction
- Support for Groq, OpenRouter, OpenAI
- Configurable model selection
- Error handling and retries
- Response streaming (optional)
[β ] 6.2 Create
scientific_rag/application/rag/prompt_templates.pyRAG prompt template with context injection
Citation-aware prompting (instruct model to cite sources)
System prompt for scientific Q&A
Example:
You are a scientific research assistant. Answer the question based on the provided context. Always cite your sources using [1], [2], etc. Context: [1] {chunk_1} [2] {chunk_2} ... Question: {query} Answer with citations:
[β ] 6.3 Implement
scientific_rag/application/rag/pipeline.py- Main
RAGPipelineclass - Orchestrate: Query β Self-Query β Query Expansion β Retrieve (with filters) β Rerank β Generate
- Full pipeline flow:
1. Self-Query: Extract filters (source, section) for Qdrant 2. Query Expansion: Generate N query variations 3. Retrieve: Search with all queries (BM25 + Qdrant with filters) 4. Merge & Deduplicate: Combine results from all queries 5. Rerank: Cross-encoder scoring 6. Generate: LLM with citations - Configurable retrieval parameters
- Toggle for each component:
use_self_query,use_query_expansion,use_bm25,use_dense,use_reranking - Citation tracking and formatting
- Main
Phase 7: User Interface
[β ] 7.1 Create
demo/main.pywith Gradio- Text input for questions
- API key input field (not stored in code)
- Dropdown for LLM provider/model selection
- Dropdown for metadata filters (optional manual override):
- Source: Any / ArXiv / PubMed
- Section: Any / Introduction / Methods / Results / Conclusion
- Checkboxes for pipeline components:
- [β ] Enable Self-Query (metadata extraction)
- [β ] Enable Query Expansion
- [β ] Enable BM25
- [β ] Enable Dense Retrieval (Qdrant)
- [β ] Enable Reranking
- Slider for top-k parameter
- Slider for query expansion count (1-5)
- Output: Answer with citations
- Expandable section showing retrieved chunks with metadata
[β ] 7.2 Add service description
- Brief explanation of the RAG system
- Dataset information
- Usage instructions
[β ] 7.3 Style and UX improvements
- Clear layout
- Loading indicators
- Error messages for invalid inputs
Phase 8: Deployment
8.1 Create
requirements.txtfor HuggingFace Spaces- Pin versions for reproducibility
- Note: HF Spaces may need Qdrant Cloud instead of local
8.2 Create HuggingFace Space configuration
README.mdwith YAML frontmatter for Gradio SDK- Resource requirements (CPU/memory)
- Configure Qdrant Cloud connection for deployment
8.3 Deploy to HuggingFace Spaces
- Test with sample queries
- Verify API key handling
- Verify Qdrant Cloud connectivity
Phase 9: Evaluation & Documentation
9.1 Find queries where BM25 outperforms dense retrieval
- Queries with specific terminology, rare words, or exact phrases
- Examples:
- "papers mentioning @xmath0 decay channel"
- "CLEO detector measurements"
9.2 Find queries where dense retrieval outperforms BM25
- Semantic similarity queries
- Paraphrased questions
- Examples:
- "How do researchers measure particle lifetimes?"
- "What methods are used for blood clot prevention?"
9.3 Demonstrate metadata filtering effectiveness
- Show queries where filtering by source improves results
- Show queries where filtering by section improves results
- Examples:
- "arxiv papers about quantum computing" β filter to arxiv
- "methodology for clinical trials" β filter to methods section
9.4 Document the system in README.md
- Architecture overview
- Installation instructions (including Docker/Qdrant setup)
- Usage examples
- Component descriptions
- Retrieval comparison findings
- Metadata filtering examples
9.5 Prepare submission materials
- Source code link
- Deployed service link
- Component checklist (per assignment requirements)
Optional Enhancements (Bonus Points)
Citation Enhancement
- B.1 Improve citation formatting
- Parse and display chunk source information
- Show paper abstract or section name
- Link citations to source documents
Performance Optimization
B.2 Add caching layer
- Cache embeddings
- Cache LLM responses for identical queries
B.3 Optimize for larger dataset
- FAISS index for fast similarity search
- Batch processing improvements
Dependencies Summary
[project]
name = "scientific-rag"
version = "0.1.0"
description = "Scientific Papers RAG System"
requires-python = ">=3.11"
dependencies = [
# Data
"datasets>=3.0.0",
"huggingface-hub>=0.20.0",
# ML/Embeddings
"sentence-transformers>=3.0.0",
"torch>=2.0.0",
"numpy>=1.26.0",
"scipy>=1.11.0",
# Retrieval
"rank-bm25>=0.2.2",
"qdrant-client>=1.8.0",
# LLM
"litellm>=1.0.0",
# Configuration
"pydantic>=2.0.0",
"pydantic-settings>=2.0.0",
# UI
"gradio>=4.0.0",
# Utilities
"loguru>=0.7.0",
"tqdm>=4.65.0",
"python-dotenv>=1.0.0",
]
[dependency-groups]
dev = [
"pytest>=8.0.0",
"ruff>=0.4.0",
"mypy>=1.10.0",
"pre-commit>=3.0.0",
"ipykernel>=6.0.0",
]
Quick Start Commands
# Setup
make install
# Run locally
make run-app
# Run tests
make test
# Lint
make lint
# Format
make format
Key Implementation Notes
Chunking Strategy
For scientific papers, consider:
- Section-based chunking: Split by sections first, then by size
- Preserve context: Include section title in each chunk
- Handle LaTeX: Papers contain
@xmathtokens for math expressions
Retrieval Comparison
Document specific queries that demonstrate:
- BM25 strength: Exact term matching, rare terminology
- Dense strength: Semantic understanding, paraphrased queries
LLM Configuration
Recommended free options:
- Groq: Fast, free tier with
llama-3.1-8b-instant - OpenRouter: Multiple model options, some free
Citation Format
Answer: The decay channel measurement shows... [1]. Further analysis using the CLEO detector... [2].
Sources:
[1] "we have studied the leptonic decay..." (arxiv, section: introduction)
[2] "data collected with the CLEO detector..." (arxiv, section: methods)
Timeline Suggestion
| Week | Focus Area |
|---|---|
| Week 1 (Dec 9-11) | Phase 1-2: Setup, Data Loading, Chunking |
| Week 2 (Dec 12-14) | Phase 3-5: Retrieval, Reranking, LLM |
| Week 3 (Dec 15-16) | Phase 6-8: UI, Deployment, Documentation |