# Scientific RAG - Implementation Tasks

> **Project**: Scientific Advanced RAG System
> **Dataset**: [armanc/scientific_papers](https://huggingface.co/datasets/armanc/scientific_papers) (ArXiv + PubMed)
> **Deadline**: December 16, 2025
> **Reference Architecture**: LLM-Engineers-Handbook (Domain-Driven Design)

---

## Overview

Build a Retrieval-Augmented Generation (RAG) system for answering questions about scientific papers. The system will use the `armanc/scientific_papers` dataset containing ~320K papers from ArXiv and PubMed with articles, abstracts, and section names.

### Dataset Structure

```python
{
    "abstract": "Summary of the paper...",
    "article": "Full body of the paper, paragraphs separated by \\n...",
    "section_names": "[sec:introduction]introduction\\n[sec:methods]methods\\n..."
}
```

- **arxiv**: 203,037 train / 6,436 val / 6,440 test
- **pubmed**: 119,924 train / 6,633 val / 6,658 test

---

## Project Structure (Target)

```
scientific-rag/
├── pyproject.toml              # Project configuration
├── Makefile                    # Development commands
├── docker-compose.yaml         # Qdrant infrastructure
├── .env.dist                   # Environment template
├── README.md                   # Documentation
├── tasks.md                    # This file
├── docs/
│   └── assignment.md           # Assignment requirements
├── configs/
│   └── rag_config.yaml         # RAG pipeline configuration
├── data/
│   ├── raw/                    # Downloaded dataset cache
│   └── processed/              # Processed chunks
├── scientific_rag/             # Main package
│   ├── __init__.py
│   ├── settings.py             # Configuration management
│   ├── domain/                 # Core entities
│   │   ├── __init__.py
│   │   ├── documents.py        # Document models (Paper, Chunk)
│   │   ├── queries.py          # Query models
│   │   └── types.py            # Enums and type definitions
│   ├── application/            # Business logic
│   │   ├── __init__.py
│   │   ├── data_loader.py      # HuggingFace dataset loading
│   │   ├── chunking/           # Chunking strategies
│   │   │   ├── __init__.py
│   │   │   ├── base.py         # Abstract chunker
│   │   │   └── scientific_chunker.py
│   │   ├── embeddings/         # Embedding models
│   │   │   ├── __init__.py
│   │   │   └── encoder.py      # Sentence-transformers wrapper
│   │   ├── query_processing/   # Query enhancement
│   │   │   ├── __init__.py
│   │   │   ├── query_expansion.py   # Multi-query generation
│   │   │   └── self_query.py        # Metadata extraction
│   │   ├── retrieval/          # Retrieval logic
│   │   │   ├── __init__.py
│   │   │   ├── bm25_retriever.py
│   │   │   ├── dense_retriever.py
│   │   │   └── hybrid_retriever.py
│   │   ├── reranking/          # Reranker
│   │   │   ├── __init__.py
│   │   │   └── cross_encoder.py
│   │   └── rag/                # RAG pipeline
│   │       ├── __init__.py
│   │       ├── pipeline.py     # Main RAG orchestration
│   │       ├── prompt_templates.py
│   │       └── llm_client.py   # LiteLLM wrapper
│   └── infrastructure/         # External integrations
│       ├── __init__.py
│       └── qdrant.py           # Qdrant vector database client
├── demo/                        # Gradio/Streamlit UI
│   ├── __init__.py
│   └── main.py                 # Web interface
└── tests/
    ├── __init__.py
    ├── unit/
    │   ├── test_chunking.py
    │   ├── test_retrieval.py
    │   └── test_reranking.py
    └── integration/
        └── test_rag_pipeline.py
```

---

## Implementation Tasks

### Phase 1: Project Setup & Data Loading

- [✅] **1.1** Update `pyproject.toml` with project dependencies

  - `datasets` - HuggingFace datasets
  - `sentence-transformers` - Embeddings and cross-encoders
  - `rank-bm25` - BM25 retrieval
  - `qdrant-client` - Vector database client
  - `litellm` - LLM abstraction layer
  - `gradio` or `streamlit` - UI framework
  - `pydantic` - Data validation
  - `pydantic-settings` - Configuration management
  - `loguru` - Logging
  - `numpy`, `scipy` - Numerical operations
  - `tqdm` - Progress bars

- [✅] **1.2** Create `docker-compose.yaml` for local infrastructure

  - Qdrant vector database service
  - Example:
    ```yaml
    services:
      qdrant:
        image: qdrant/qdrant:latest
        ports:
          - "6333:6333"
          - "6334:6334"
        volumes:
          - qdrant_storage:/qdrant/storage
    volumes:
      qdrant_storage:
    ```
  - Add `make qdrant-up` and `make qdrant-down` commands

- [✅] **1.3** Create `scientific_rag/settings.py`

  - Environment variable management
  - Model IDs configuration
  - API keys handling (OpenAI, Groq, OpenRouter)
  - Qdrant connection settings (host, port, API key for cloud)
  - Default chunking parameters

- [✅] **1.4** Create `scientific_rag/domain/` entities

  - `types.py`: Enums for DataSource (ARXIV, PUBMED), SectionType
  - `documents.py`: `ScientificPaper`, `PaperChunk` Pydantic models with metadata
  - `queries.py`: `Query`, `EmbeddedQuery`, `QueryFilters` models

- [✅] **1.5** Implement `scientific_rag/application/data_loader.py`
  - Load `armanc/scientific_papers` from HuggingFace
  - Support both `arxiv` and `pubmed` subsets
  - Configurable sample size for development
  - Progress tracking with tqdm

### Phase 2: Chunking Strategy

- Configurable sample size for development
- Progress tracking with tqdm

### Phase 2: Chunking Strategy

- [✅] **2.1** Implement `scientific_rag/application/chunking/scientific_chunker.py`

  - **Section-aware chunking**: Parse `section_names` to identify sections
  - **Paragraph-based splitting**: Split on `\n` boundaries
  - **Overlap strategy**: Add overlap between chunks for context
  - Configurable `chunk_size` and `chunk_overlap`
  - **Metadata preservation**: Store source (arxiv/pubmed), normalized section name, paper_id, position
  - Normalize section names to enum values (introduction, methods, results, conclusion, other)

- [✅] **2.2** Create processing script to generate chunks
  - Batch processing with progress tracking
  - Save chunks to disk (JSON/Parquet) for reuse
  - Generate unique chunk IDs (hash-based)

### Phase 3: Retrieval Implementation

- [✅] **3.1** Create `scientific_rag/application/embeddings/encoder.py`

  - Singleton pattern for embedding model
  - Use `intfloat/e5-small-v2`
  - Batch embedding support
  - GPU/CPU device configuration

- [✅] **3.2** Implement `scientific_rag/infrastructure/qdrant.py`

  - Qdrant client wrapper (local Docker or Qdrant Cloud)
  - Collection creation with proper schema
  - `upsert_chunks(chunks)` - batch insert with embeddings
  - `search(query_vector, filters, k)` - filtered vector search
  - Support for Qdrant filter syntax


- [✅] **3.3** Implement `scientific_rag/application/retrieval/bm25_retriever.py`

  - Use `rank_bm25` library
  - Tokenization with proper preprocessing
  - `search(query, k) -> List[Chunk]` interface
  - Score normalization

- [✅] **3.4** Implement `scientific_rag/application/retrieval/dense_retriever.py`

  - Semantic search using Qdrant
  - Integrate with `QdrantClient` from infrastructure
  - Apply metadata filters from self-query
  - `search(query, filters, k) -> List[Chunk]` interface

- [✅] **3.5** Implement `scientific_rag/application/retrieval/hybrid_retriever.py`
  - Combine BM25 and dense retrieval
  - Pass metadata filters to both retrievers
  - Configurable weights for each method
  - Toggle switches: `use_bm25`, `use_dense`
  - Reciprocal Rank Fusion (RRF) or weighted combination
  - Deduplication of results

### Phase 4: Query Processing & Metadata Filtering

- [✅] **4.1** Implement `scientific_rag/application/query_processing/self_query.py`

  - Extract metadata filters from natural language queries using **rule-based matching**
  - Detect source preferences: "arxiv papers about..." → filter to arxiv
  - Detect section preferences: "in the methods section..." → filter to methods
  - Use regex/keyword matching
  - No LLM needed - metadata is already structured in chunks from dataset
  - Return structured `QueryFilters` object
  - Filters are passed to Qdrant for efficient pre-filtering before vector search

- [✅] **4.2** Implement `scientific_rag/application/query_processing/query_expansion.py`

  - Generate multiple query variations to improve recall
  - Use LLM to create semantically similar queries
  - Configurable `expand_to_n` parameter (default: 3)
  - Example prompt:

    ```
    Generate {n} different versions of this question to search a scientific papers database.
    Each version should capture the same intent but use different wording.
    Separate versions with "###"

    Original: {query}
    ```

  - Search with all expanded queries, merge results
  - Deduplicate before reranking

- [✅] **4.3** Update `scientific_rag/domain/queries.py`

  - Add `QueryFilters` model for self-query results
  - Add `ExpandedQuery` model to hold query variations
  - Example:

    ```python
    class QueryFilters(BaseModel):
        source: Literal["arxiv", "pubmed", "any"] = "any"
        section: Literal["introduction", "methods", "results", "conclusion", "any"] = "any"

    class ExpandedQuery(BaseModel):
        original: str
        variations: list[str]
        filters: QueryFilters | None = None
    ```

### Phase 5: Reranking

- [✅] **5.1** Implement `scientific_rag/application/reranking/cross_encoder.py`
  - Use `cross-encoder/ms-marco-MiniLM-L6-v2` (or similar)
  - `rerank(query, chunks, top_k) -> List[Chunk]` interface
  - Batch processing for efficiency
  - Score-based sorting

### Phase 6: LLM Integration

- [✅] **6.1** Implement `scientific_rag/application/rag/llm_client.py`

  - LiteLLM wrapper for provider abstraction
  - Support for Groq, OpenRouter, OpenAI
  - Configurable model selection
  - Error handling and retries
  - Response streaming (optional)

- [✅] **6.2** Create `scientific_rag/application/rag/prompt_templates.py`

  - RAG prompt template with context injection
  - Citation-aware prompting (instruct model to cite sources)
  - System prompt for scientific Q&A
  - Example:

    ```
    You are a scientific research assistant. Answer the question based on the provided context.
    Always cite your sources using [1], [2], etc.

    Context:
    [1] {chunk_1}
    [2] {chunk_2}
    ...

    Question: {query}

    Answer with citations:
    ```

- [✅] **6.3** Implement `scientific_rag/application/rag/pipeline.py`
  - Main `RAGPipeline` class
  - Orchestrate: Query → Self-Query → Query Expansion → Retrieve (with filters) → Rerank → Generate
  - Full pipeline flow:
    ```
    1. Self-Query: Extract filters (source, section) for Qdrant
    2. Query Expansion: Generate N query variations
    3. Retrieve: Search with all queries (BM25 + Qdrant with filters)
    4. Merge & Deduplicate: Combine results from all queries
    5. Rerank: Cross-encoder scoring
    6. Generate: LLM with citations
    ```
  - Configurable retrieval parameters
  - Toggle for each component: `use_self_query`, `use_query_expansion`, `use_bm25`, `use_dense`, `use_reranking`
  - Citation tracking and formatting

### Phase 7: User Interface

- [✅] **7.1** Create `demo/main.py` with Gradio

  - Text input for questions
  - API key input field (not stored in code)
  - Dropdown for LLM provider/model selection
  - Dropdown for metadata filters (optional manual override):
    - Source: Any / ArXiv / PubMed
    - Section: Any / Introduction / Methods / Results / Conclusion
  - Checkboxes for pipeline components:
    - [✅] Enable Self-Query (metadata extraction)
    - [✅] Enable Query Expansion
    - [✅] Enable BM25
    - [✅] Enable Dense Retrieval (Qdrant)
    - [✅] Enable Reranking
  - Slider for top-k parameter
  - Slider for query expansion count (1-5)
  - Output: Answer with citations
  - Expandable section showing retrieved chunks with metadata

- [✅] **7.2** Add service description

  - Brief explanation of the RAG system
  - Dataset information
  - Usage instructions

- [✅] **7.3** Style and UX improvements
  - Clear layout
  - Loading indicators
  - Error messages for invalid inputs

### Phase 8: Deployment

- [ ] **8.1** Create `requirements.txt` for HuggingFace Spaces

  - Pin versions for reproducibility
  - Note: HF Spaces may need Qdrant Cloud instead of local

- [ ] **8.2** Create HuggingFace Space configuration

  - `README.md` with YAML frontmatter for Gradio SDK
  - Resource requirements (CPU/memory)
  - Configure Qdrant Cloud connection for deployment

- [ ] **8.3** Deploy to HuggingFace Spaces
  - Test with sample queries
  - Verify API key handling
  - Verify Qdrant Cloud connectivity

### Phase 9: Evaluation & Documentation

- [ ] **9.1** Find queries where BM25 outperforms dense retrieval

  - Queries with specific terminology, rare words, or exact phrases
  - Examples:
    - "papers mentioning @xmath0 decay channel"
    - "CLEO detector measurements"

- [ ] **9.2** Find queries where dense retrieval outperforms BM25

  - Semantic similarity queries
  - Paraphrased questions
  - Examples:
    - "How do researchers measure particle lifetimes?"
    - "What methods are used for blood clot prevention?"

- [ ] **9.3** Demonstrate metadata filtering effectiveness

  - Show queries where filtering by source improves results
  - Show queries where filtering by section improves results
  - Examples:
    - "arxiv papers about quantum computing" → filter to arxiv
    - "methodology for clinical trials" → filter to methods section

- [ ] **9.4** Document the system in README.md

  - Architecture overview
  - Installation instructions (including Docker/Qdrant setup)
  - Usage examples
  - Component descriptions
  - Retrieval comparison findings
  - Metadata filtering examples

- [ ] **9.5** Prepare submission materials
  - Source code link
  - Deployed service link
  - Component checklist (per assignment requirements)

---

## Optional Enhancements (Bonus Points)

### Citation Enhancement

- [ ] **B.1** Improve citation formatting
  - Parse and display chunk source information
  - Show paper abstract or section name
  - Link citations to source documents

### Performance Optimization

- [ ] **B.2** Add caching layer

  - Cache embeddings
  - Cache LLM responses for identical queries

- [ ] **B.3** Optimize for larger dataset
  - FAISS index for fast similarity search
  - Batch processing improvements

---

## Dependencies Summary

```toml
[project]
name = "scientific-rag"
version = "0.1.0"
description = "Scientific Papers RAG System"
requires-python = ">=3.11"

dependencies = [
    # Data
    "datasets>=3.0.0",
    "huggingface-hub>=0.20.0",

    # ML/Embeddings
    "sentence-transformers>=3.0.0",
    "torch>=2.0.0",
    "numpy>=1.26.0",
    "scipy>=1.11.0",

    # Retrieval
    "rank-bm25>=0.2.2",
    "qdrant-client>=1.8.0",

    # LLM
    "litellm>=1.0.0",

    # Configuration
    "pydantic>=2.0.0",
    "pydantic-settings>=2.0.0",

    # UI
    "gradio>=4.0.0",

    # Utilities
    "loguru>=0.7.0",
    "tqdm>=4.65.0",
    "python-dotenv>=1.0.0",
]

[dependency-groups]
dev = [
    "pytest>=8.0.0",
    "ruff>=0.4.0",
    "mypy>=1.10.0",
    "pre-commit>=3.0.0",
    "ipykernel>=6.0.0",
]
```

---

## Quick Start Commands

```bash
# Setup
make install

# Run locally
make run-app

# Run tests
make test

# Lint
make lint

# Format
make format
```

---

## Key Implementation Notes

### Chunking Strategy

For scientific papers, consider:

1. **Section-based chunking**: Split by sections first, then by size
2. **Preserve context**: Include section title in each chunk
3. **Handle LaTeX**: Papers contain `@xmath` tokens for math expressions

### Retrieval Comparison

Document specific queries that demonstrate:

- BM25 strength: Exact term matching, rare terminology
- Dense strength: Semantic understanding, paraphrased queries

### LLM Configuration

Recommended free options:

- **Groq**: Fast, free tier with `llama-3.1-8b-instant`
- **OpenRouter**: Multiple model options, some free

### Citation Format

```
Answer: The decay channel measurement shows... [1]. Further analysis using the CLEO detector... [2].

Sources:
[1] "we have studied the leptonic decay..." (arxiv, section: introduction)
[2] "data collected with the CLEO detector..." (arxiv, section: methods)
```

---

## Timeline Suggestion

| Week               | Focus Area                               |
| ------------------ | ---------------------------------------- |
| Week 1 (Dec 9-11)  | Phase 1-2: Setup, Data Loading, Chunking |
| Week 2 (Dec 12-14) | Phase 3-5: Retrieval, Reranking, LLM     |
| Week 3 (Dec 15-16) | Phase 6-8: UI, Deployment, Documentation |

---

## References

- [Assignment Document](./docs/assignment.md)
- [LLM-Engineers-Handbook](https://github.com/PacktPublishing/LLM-Engineers-Handbook) - Reference architecture
- [Scientific Papers Dataset](https://huggingface.co/datasets/armanc/scientific_papers)
- [LiteLLM Documentation](https://docs.litellm.ai/)
- [Sentence-Transformers](https://www.sbert.net/)
- [Gradio Documentation](https://www.gradio.app/docs)