scientific-rag / docs /tasks.md
Daryna Vasylashko
Update task list to reflect completed implementations
79eba02 unverified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

Scientific RAG - Implementation Tasks

Project: Scientific Advanced RAG System Dataset: armanc/scientific_papers (ArXiv + PubMed) Deadline: December 16, 2025 Reference Architecture: LLM-Engineers-Handbook (Domain-Driven Design)


Overview

Build a Retrieval-Augmented Generation (RAG) system for answering questions about scientific papers. The system will use the armanc/scientific_papers dataset containing ~320K papers from ArXiv and PubMed with articles, abstracts, and section names.

Dataset Structure

{
    "abstract": "Summary of the paper...",
    "article": "Full body of the paper, paragraphs separated by \\n...",
    "section_names": "[sec:introduction]introduction\\n[sec:methods]methods\\n..."
}
  • arxiv: 203,037 train / 6,436 val / 6,440 test
  • pubmed: 119,924 train / 6,633 val / 6,658 test

Project Structure (Target)

scientific-rag/
β”œβ”€β”€ pyproject.toml              # Project configuration
β”œβ”€β”€ Makefile                    # Development commands
β”œβ”€β”€ docker-compose.yaml         # Qdrant infrastructure
β”œβ”€β”€ .env.dist                   # Environment template
β”œβ”€β”€ README.md                   # Documentation
β”œβ”€β”€ tasks.md                    # This file
β”œβ”€β”€ docs/
β”‚   └── assignment.md           # Assignment requirements
β”œβ”€β”€ configs/
β”‚   └── rag_config.yaml         # RAG pipeline configuration
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                    # Downloaded dataset cache
β”‚   └── processed/              # Processed chunks
β”œβ”€β”€ scientific_rag/             # Main package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ settings.py             # Configuration management
β”‚   β”œβ”€β”€ domain/                 # Core entities
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ documents.py        # Document models (Paper, Chunk)
β”‚   β”‚   β”œβ”€β”€ queries.py          # Query models
β”‚   β”‚   └── types.py            # Enums and type definitions
β”‚   β”œβ”€β”€ application/            # Business logic
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ data_loader.py      # HuggingFace dataset loading
β”‚   β”‚   β”œβ”€β”€ chunking/           # Chunking strategies
β”‚   β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”‚   β”œβ”€β”€ base.py         # Abstract chunker
β”‚   β”‚   β”‚   └── scientific_chunker.py
β”‚   β”‚   β”œβ”€β”€ embeddings/         # Embedding models
β”‚   β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”‚   └── encoder.py      # Sentence-transformers wrapper
β”‚   β”‚   β”œβ”€β”€ query_processing/   # Query enhancement
β”‚   β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”‚   β”œβ”€β”€ query_expansion.py   # Multi-query generation
β”‚   β”‚   β”‚   └── self_query.py        # Metadata extraction
β”‚   β”‚   β”œβ”€β”€ retrieval/          # Retrieval logic
β”‚   β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”‚   β”œβ”€β”€ bm25_retriever.py
β”‚   β”‚   β”‚   β”œβ”€β”€ dense_retriever.py
β”‚   β”‚   β”‚   └── hybrid_retriever.py
β”‚   β”‚   β”œβ”€β”€ reranking/          # Reranker
β”‚   β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”‚   └── cross_encoder.py
β”‚   β”‚   └── rag/                # RAG pipeline
β”‚   β”‚       β”œβ”€β”€ __init__.py
β”‚   β”‚       β”œβ”€β”€ pipeline.py     # Main RAG orchestration
β”‚   β”‚       β”œβ”€β”€ prompt_templates.py
β”‚   β”‚       └── llm_client.py   # LiteLLM wrapper
β”‚   └── infrastructure/         # External integrations
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── qdrant.py           # Qdrant vector database client
β”œβ”€β”€ demo/                        # Gradio/Streamlit UI
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── main.py                 # Web interface
└── tests/
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ unit/
    β”‚   β”œβ”€β”€ test_chunking.py
    β”‚   β”œβ”€β”€ test_retrieval.py
    β”‚   └── test_reranking.py
    └── integration/
        └── test_rag_pipeline.py

Implementation Tasks

Phase 1: Project Setup & Data Loading

  • [βœ…] 1.1 Update pyproject.toml with project dependencies

    • datasets - HuggingFace datasets
    • sentence-transformers - Embeddings and cross-encoders
    • rank-bm25 - BM25 retrieval
    • qdrant-client - Vector database client
    • litellm - LLM abstraction layer
    • gradio or streamlit - UI framework
    • pydantic - Data validation
    • pydantic-settings - Configuration management
    • loguru - Logging
    • numpy, scipy - Numerical operations
    • tqdm - Progress bars
  • [βœ…] 1.2 Create docker-compose.yaml for local infrastructure

    • Qdrant vector database service
    • Example:
      services:
        qdrant:
          image: qdrant/qdrant:latest
          ports:
            - "6333:6333"
            - "6334:6334"
          volumes:
            - qdrant_storage:/qdrant/storage
      volumes:
        qdrant_storage:
      
    • Add make qdrant-up and make qdrant-down commands
  • [βœ…] 1.3 Create scientific_rag/settings.py

    • Environment variable management
    • Model IDs configuration
    • API keys handling (OpenAI, Groq, OpenRouter)
    • Qdrant connection settings (host, port, API key for cloud)
    • Default chunking parameters
  • [βœ…] 1.4 Create scientific_rag/domain/ entities

    • types.py: Enums for DataSource (ARXIV, PUBMED), SectionType
    • documents.py: ScientificPaper, PaperChunk Pydantic models with metadata
    • queries.py: Query, EmbeddedQuery, QueryFilters models
  • [βœ…] 1.5 Implement scientific_rag/application/data_loader.py

    • Load armanc/scientific_papers from HuggingFace
    • Support both arxiv and pubmed subsets
    • Configurable sample size for development
    • Progress tracking with tqdm

Phase 2: Chunking Strategy

  • Configurable sample size for development
  • Progress tracking with tqdm

Phase 2: Chunking Strategy

  • [βœ…] 2.1 Implement scientific_rag/application/chunking/scientific_chunker.py

    • Section-aware chunking: Parse section_names to identify sections
    • Paragraph-based splitting: Split on \n boundaries
    • Overlap strategy: Add overlap between chunks for context
    • Configurable chunk_size and chunk_overlap
    • Metadata preservation: Store source (arxiv/pubmed), normalized section name, paper_id, position
    • Normalize section names to enum values (introduction, methods, results, conclusion, other)
  • [βœ…] 2.2 Create processing script to generate chunks

    • Batch processing with progress tracking
    • Save chunks to disk (JSON/Parquet) for reuse
    • Generate unique chunk IDs (hash-based)

Phase 3: Retrieval Implementation

  • [βœ…] 3.1 Create scientific_rag/application/embeddings/encoder.py

    • Singleton pattern for embedding model
    • Use intfloat/e5-small-v2
    • Batch embedding support
    • GPU/CPU device configuration
  • [βœ…] 3.2 Implement scientific_rag/infrastructure/qdrant.py

    • Qdrant client wrapper (local Docker or Qdrant Cloud)
    • Collection creation with proper schema
    • upsert_chunks(chunks) - batch insert with embeddings
    • search(query_vector, filters, k) - filtered vector search
    • Support for Qdrant filter syntax
  • [βœ…] 3.3 Implement scientific_rag/application/retrieval/bm25_retriever.py

    • Use rank_bm25 library
    • Tokenization with proper preprocessing
    • search(query, k) -> List[Chunk] interface
    • Score normalization
  • [βœ…] 3.4 Implement scientific_rag/application/retrieval/dense_retriever.py

    • Semantic search using Qdrant
    • Integrate with QdrantClient from infrastructure
    • Apply metadata filters from self-query
    • search(query, filters, k) -> List[Chunk] interface
  • [βœ…] 3.5 Implement scientific_rag/application/retrieval/hybrid_retriever.py

    • Combine BM25 and dense retrieval
    • Pass metadata filters to both retrievers
    • Configurable weights for each method
    • Toggle switches: use_bm25, use_dense
    • Reciprocal Rank Fusion (RRF) or weighted combination
    • Deduplication of results

Phase 4: Query Processing & Metadata Filtering

  • [βœ…] 4.1 Implement scientific_rag/application/query_processing/self_query.py

    • Extract metadata filters from natural language queries using rule-based matching
    • Detect source preferences: "arxiv papers about..." β†’ filter to arxiv
    • Detect section preferences: "in the methods section..." β†’ filter to methods
    • Use regex/keyword matching
    • No LLM needed - metadata is already structured in chunks from dataset
    • Return structured QueryFilters object
    • Filters are passed to Qdrant for efficient pre-filtering before vector search
  • [βœ…] 4.2 Implement scientific_rag/application/query_processing/query_expansion.py

    • Generate multiple query variations to improve recall

    • Use LLM to create semantically similar queries

    • Configurable expand_to_n parameter (default: 3)

    • Example prompt:

      Generate {n} different versions of this question to search a scientific papers database.
      Each version should capture the same intent but use different wording.
      Separate versions with "###"
      
      Original: {query}
      
    • Search with all expanded queries, merge results

    • Deduplicate before reranking

  • [βœ…] 4.3 Update scientific_rag/domain/queries.py

    • Add QueryFilters model for self-query results

    • Add ExpandedQuery model to hold query variations

    • Example:

      class QueryFilters(BaseModel):
          source: Literal["arxiv", "pubmed", "any"] = "any"
          section: Literal["introduction", "methods", "results", "conclusion", "any"] = "any"
      
      class ExpandedQuery(BaseModel):
          original: str
          variations: list[str]
          filters: QueryFilters | None = None
      

Phase 5: Reranking

  • [βœ…] 5.1 Implement scientific_rag/application/reranking/cross_encoder.py
    • Use cross-encoder/ms-marco-MiniLM-L6-v2 (or similar)
    • rerank(query, chunks, top_k) -> List[Chunk] interface
    • Batch processing for efficiency
    • Score-based sorting

Phase 6: LLM Integration

  • [βœ…] 6.1 Implement scientific_rag/application/rag/llm_client.py

    • LiteLLM wrapper for provider abstraction
    • Support for Groq, OpenRouter, OpenAI
    • Configurable model selection
    • Error handling and retries
    • Response streaming (optional)
  • [βœ…] 6.2 Create scientific_rag/application/rag/prompt_templates.py

    • RAG prompt template with context injection

    • Citation-aware prompting (instruct model to cite sources)

    • System prompt for scientific Q&A

    • Example:

      You are a scientific research assistant. Answer the question based on the provided context.
      Always cite your sources using [1], [2], etc.
      
      Context:
      [1] {chunk_1}
      [2] {chunk_2}
      ...
      
      Question: {query}
      
      Answer with citations:
      
  • [βœ…] 6.3 Implement scientific_rag/application/rag/pipeline.py

    • Main RAGPipeline class
    • Orchestrate: Query β†’ Self-Query β†’ Query Expansion β†’ Retrieve (with filters) β†’ Rerank β†’ Generate
    • Full pipeline flow:
      1. Self-Query: Extract filters (source, section) for Qdrant
      2. Query Expansion: Generate N query variations
      3. Retrieve: Search with all queries (BM25 + Qdrant with filters)
      4. Merge & Deduplicate: Combine results from all queries
      5. Rerank: Cross-encoder scoring
      6. Generate: LLM with citations
      
    • Configurable retrieval parameters
    • Toggle for each component: use_self_query, use_query_expansion, use_bm25, use_dense, use_reranking
    • Citation tracking and formatting

Phase 7: User Interface

  • [βœ…] 7.1 Create demo/main.py with Gradio

    • Text input for questions
    • API key input field (not stored in code)
    • Dropdown for LLM provider/model selection
    • Dropdown for metadata filters (optional manual override):
      • Source: Any / ArXiv / PubMed
      • Section: Any / Introduction / Methods / Results / Conclusion
    • Checkboxes for pipeline components:
      • [βœ…] Enable Self-Query (metadata extraction)
      • [βœ…] Enable Query Expansion
      • [βœ…] Enable BM25
      • [βœ…] Enable Dense Retrieval (Qdrant)
      • [βœ…] Enable Reranking
    • Slider for top-k parameter
    • Slider for query expansion count (1-5)
    • Output: Answer with citations
    • Expandable section showing retrieved chunks with metadata
  • [βœ…] 7.2 Add service description

    • Brief explanation of the RAG system
    • Dataset information
    • Usage instructions
  • [βœ…] 7.3 Style and UX improvements

    • Clear layout
    • Loading indicators
    • Error messages for invalid inputs

Phase 8: Deployment

  • 8.1 Create requirements.txt for HuggingFace Spaces

    • Pin versions for reproducibility
    • Note: HF Spaces may need Qdrant Cloud instead of local
  • 8.2 Create HuggingFace Space configuration

    • README.md with YAML frontmatter for Gradio SDK
    • Resource requirements (CPU/memory)
    • Configure Qdrant Cloud connection for deployment
  • 8.3 Deploy to HuggingFace Spaces

    • Test with sample queries
    • Verify API key handling
    • Verify Qdrant Cloud connectivity

Phase 9: Evaluation & Documentation

  • 9.1 Find queries where BM25 outperforms dense retrieval

    • Queries with specific terminology, rare words, or exact phrases
    • Examples:
      • "papers mentioning @xmath0 decay channel"
      • "CLEO detector measurements"
  • 9.2 Find queries where dense retrieval outperforms BM25

    • Semantic similarity queries
    • Paraphrased questions
    • Examples:
      • "How do researchers measure particle lifetimes?"
      • "What methods are used for blood clot prevention?"
  • 9.3 Demonstrate metadata filtering effectiveness

    • Show queries where filtering by source improves results
    • Show queries where filtering by section improves results
    • Examples:
      • "arxiv papers about quantum computing" β†’ filter to arxiv
      • "methodology for clinical trials" β†’ filter to methods section
  • 9.4 Document the system in README.md

    • Architecture overview
    • Installation instructions (including Docker/Qdrant setup)
    • Usage examples
    • Component descriptions
    • Retrieval comparison findings
    • Metadata filtering examples
  • 9.5 Prepare submission materials

    • Source code link
    • Deployed service link
    • Component checklist (per assignment requirements)

Optional Enhancements (Bonus Points)

Citation Enhancement

  • B.1 Improve citation formatting
    • Parse and display chunk source information
    • Show paper abstract or section name
    • Link citations to source documents

Performance Optimization

  • B.2 Add caching layer

    • Cache embeddings
    • Cache LLM responses for identical queries
  • B.3 Optimize for larger dataset

    • FAISS index for fast similarity search
    • Batch processing improvements

Dependencies Summary

[project]
name = "scientific-rag"
version = "0.1.0"
description = "Scientific Papers RAG System"
requires-python = ">=3.11"

dependencies = [
    # Data
    "datasets>=3.0.0",
    "huggingface-hub>=0.20.0",

    # ML/Embeddings
    "sentence-transformers>=3.0.0",
    "torch>=2.0.0",
    "numpy>=1.26.0",
    "scipy>=1.11.0",

    # Retrieval
    "rank-bm25>=0.2.2",
    "qdrant-client>=1.8.0",

    # LLM
    "litellm>=1.0.0",

    # Configuration
    "pydantic>=2.0.0",
    "pydantic-settings>=2.0.0",

    # UI
    "gradio>=4.0.0",

    # Utilities
    "loguru>=0.7.0",
    "tqdm>=4.65.0",
    "python-dotenv>=1.0.0",
]

[dependency-groups]
dev = [
    "pytest>=8.0.0",
    "ruff>=0.4.0",
    "mypy>=1.10.0",
    "pre-commit>=3.0.0",
    "ipykernel>=6.0.0",
]

Quick Start Commands

# Setup
make install

# Run locally
make run-app

# Run tests
make test

# Lint
make lint

# Format
make format

Key Implementation Notes

Chunking Strategy

For scientific papers, consider:

  1. Section-based chunking: Split by sections first, then by size
  2. Preserve context: Include section title in each chunk
  3. Handle LaTeX: Papers contain @xmath tokens for math expressions

Retrieval Comparison

Document specific queries that demonstrate:

  • BM25 strength: Exact term matching, rare terminology
  • Dense strength: Semantic understanding, paraphrased queries

LLM Configuration

Recommended free options:

  • Groq: Fast, free tier with llama-3.1-8b-instant
  • OpenRouter: Multiple model options, some free

Citation Format

Answer: The decay channel measurement shows... [1]. Further analysis using the CLEO detector... [2].

Sources:
[1] "we have studied the leptonic decay..." (arxiv, section: introduction)
[2] "data collected with the CLEO detector..." (arxiv, section: methods)

Timeline Suggestion

Week Focus Area
Week 1 (Dec 9-11) Phase 1-2: Setup, Data Loading, Chunking
Week 2 (Dec 12-14) Phase 3-5: Retrieval, Reranking, LLM
Week 3 (Dec 15-16) Phase 6-8: UI, Deployment, Documentation

References