Spaces:
Sleeping
Sleeping
Commit Β·
7fed86f
1
Parent(s): 01a77b0
docs: add team_roles.md
Browse files- docs/team_roles.md +266 -0
docs/team_roles.md
ADDED
|
@@ -0,0 +1,266 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Team Roles & Task Distribution
|
| 2 |
+
|
| 3 |
+
> **Project**: Scientific Advanced RAG System
|
| 4 |
+
> **Team Size**: 3 members
|
| 5 |
+
> **Timeline**: December 12-16, 2025
|
| 6 |
+
> **Strategy**: Parallel development with clear ownership and minimal dependencies
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## π₯ Team Structure
|
| 11 |
+
|
| 12 |
+
### Member 1: Data Pipeline Lead
|
| 13 |
+
**Focus**: Data processing, embeddings, vector database infrastructure
|
| 14 |
+
|
| 15 |
+
### Member 2: Retrieval Engineer
|
| 16 |
+
**Focus**: Search algorithms, BM25, dense retrieval, reranking
|
| 17 |
+
|
| 18 |
+
### Member 3: LLM & Integration Lead
|
| 19 |
+
**Focus**: Query processing, LLM integration, RAG pipeline, UI
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## π Detailed Task Assignments
|
| 24 |
+
|
| 25 |
+
### πΉ Member 1: Data Pipeline Lead
|
| 26 |
+
|
| 27 |
+
#### Phase 2: Chunking Strategy (Priority: HIGH)
|
| 28 |
+
- **2.1** Create `scientific_rag/application/chunking/base.py`
|
| 29 |
+
- Abstract `BaseChunker` class
|
| 30 |
+
- Define interface: `chunk(document) -> List[Chunk]`
|
| 31 |
+
|
| 32 |
+
- **2.2** Implement `scientific_rag/application/chunking/scientific_chunker.py`
|
| 33 |
+
- Section-aware chunking with metadata preservation
|
| 34 |
+
- Normalize section names to enum values
|
| 35 |
+
- Handle LaTeX tokens (@xmath)
|
| 36 |
+
|
| 37 |
+
- **2.3** Create processing script to generate chunks
|
| 38 |
+
- Batch processing with progress tracking
|
| 39 |
+
- Save to `data/processed/` as JSON/Parquet
|
| 40 |
+
- Generate hash-based chunk IDs
|
| 41 |
+
|
| 42 |
+
#### Phase 3: Embeddings & Vector Database (Priority: HIGH)
|
| 43 |
+
- **3.1** Create `scientific_rag/application/embeddings/encoder.py`
|
| 44 |
+
- Singleton pattern for `intfloat/e5-small-v2`
|
| 45 |
+
- Batch embedding support (batch_size=32)
|
| 46 |
+
- CPU/GPU device configuration
|
| 47 |
+
|
| 48 |
+
- **3.3** Implement `scientific_rag/infrastructure/qdrant.py`
|
| 49 |
+
- Qdrant client wrapper (local Docker + cloud support)
|
| 50 |
+
- Collection creation with schema (384-d vectors)
|
| 51 |
+
- Metadata payload: source, section, paper_id, position
|
| 52 |
+
- `upsert_chunks(chunks)` with embeddings
|
| 53 |
+
- `search(query_vector, filters, k)` with filtering
|
| 54 |
+
|
| 55 |
+
#### Deliverables
|
| 56 |
+
- Working chunking pipeline that processes papers β chunks with metadata
|
| 57 |
+
- Qdrant collection populated with embedded chunks
|
| 58 |
+
- Script to run: `python scripts/process_and_index.py`
|
| 59 |
+
|
| 60 |
+
**Estimated Time**: 2-3 days
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
### πΉ Member 2: Retrieval Engineer
|
| 65 |
+
|
| 66 |
+
#### Phase 3: Retrieval Implementation (Priority: HIGH)
|
| 67 |
+
- **3.2** Implement `scientific_rag/application/retrieval/bm25_retriever.py`
|
| 68 |
+
- Use `rank_bm25` library
|
| 69 |
+
- Tokenization with preprocessing
|
| 70 |
+
- `search(query, k) -> List[Chunk]` interface
|
| 71 |
+
- Score normalization
|
| 72 |
+
|
| 73 |
+
- **3.4** Implement `scientific_rag/application/retrieval/dense_retriever.py`
|
| 74 |
+
- Semantic search using Qdrant (depends on Member 1's 3.3)
|
| 75 |
+
- Apply metadata filters from `QueryFilters`
|
| 76 |
+
- `search(query, filters, k) -> List[Chunk]`
|
| 77 |
+
|
| 78 |
+
- **3.5** Implement `scientific_rag/application/retrieval/hybrid_retriever.py`
|
| 79 |
+
- Combine BM25 + dense retrieval
|
| 80 |
+
- Reciprocal Rank Fusion (RRF) or weighted combination
|
| 81 |
+
- Configurable weights: `bm25_weight`, `dense_weight`
|
| 82 |
+
- Toggle switches: `use_bm25`, `use_dense`
|
| 83 |
+
- Deduplication logic
|
| 84 |
+
|
| 85 |
+
#### Phase 5: Reranking (Priority: MEDIUM)
|
| 86 |
+
- **5.1** Implement `scientific_rag/application/reranking/cross_encoder.py`
|
| 87 |
+
- Use `cross-encoder/ms-marco-MiniLM-L6-v2`
|
| 88 |
+
- `rerank(query, chunks, top_k) -> List[Chunk]`
|
| 89 |
+
- Batch processing for efficiency
|
| 90 |
+
- Score-based sorting
|
| 91 |
+
|
| 92 |
+
#### Phase 9: Evaluation Support (Priority: LOW)
|
| 93 |
+
- **9.1** Find BM25-best queries
|
| 94 |
+
- Document specific terminology queries
|
| 95 |
+
- Exact phrase matching examples
|
| 96 |
+
|
| 97 |
+
- **9.2** Find dense-best queries
|
| 98 |
+
- Semantic similarity queries
|
| 99 |
+
- Paraphrased questions
|
| 100 |
+
|
| 101 |
+
#### Deliverables
|
| 102 |
+
- BM25 retriever (can test standalone with chunks)
|
| 103 |
+
- Dense retriever (integrates with Member 1's Qdrant)
|
| 104 |
+
- Hybrid retriever combining both
|
| 105 |
+
- Reranker module
|
| 106 |
+
- Comparison analysis for BM25 vs Dense
|
| 107 |
+
|
| 108 |
+
**Estimated Time**: 2-3 days
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
### πΉ Member 3: LLM & Integration Lead
|
| 113 |
+
|
| 114 |
+
#### Phase 4: Query Processing (Priority: HIGH)
|
| 115 |
+
- **4.1** Implement `scientific_rag/application/query_processing/self_query.py`
|
| 116 |
+
- Rule-based metadata filter extraction
|
| 117 |
+
- Regex/keyword matching for source (arxiv/pubmed)
|
| 118 |
+
- Pattern matching for section (introduction/methods/results/conclusion)
|
| 119 |
+
- Return `QueryFilters` object
|
| 120 |
+
|
| 121 |
+
- **4.2** Implement `scientific_rag/application/query_processing/query_expansion.py`
|
| 122 |
+
- LLM-based query variation generation
|
| 123 |
+
- Configurable `expand_to_n` parameter (default: 3)
|
| 124 |
+
- Deduplicate expanded queries
|
| 125 |
+
|
| 126 |
+
- **4.3** Update `scientific_rag/domain/queries.py`
|
| 127 |
+
- Already done, verify completeness
|
| 128 |
+
|
| 129 |
+
#### Phase 6: LLM Integration (Priority: HIGH)
|
| 130 |
+
- **6.1** Implement `scientific_rag/application/rag/llm_client.py`
|
| 131 |
+
- LiteLLM wrapper for OpenRouter
|
| 132 |
+
- Support `openai/gpt-oss-120b:free`
|
| 133 |
+
- Error handling and retries
|
| 134 |
+
- Optional: response streaming
|
| 135 |
+
|
| 136 |
+
- **6.2** Create `scientific_rag/application/rag/prompt_templates.py`
|
| 137 |
+
- RAG prompt with context injection
|
| 138 |
+
- Citation-aware prompting ([1], [2] format)
|
| 139 |
+
- System prompt for scientific Q&A
|
| 140 |
+
|
| 141 |
+
- **6.3** Implement `scientific_rag/application/rag/pipeline.py`
|
| 142 |
+
- Main `RAGPipeline` orchestration class
|
| 143 |
+
- Full flow: Self-Query β Query Expansion β Retrieve β Rerank β Generate
|
| 144 |
+
- Toggle switches for each component
|
| 145 |
+
- Citation tracking
|
| 146 |
+
|
| 147 |
+
#### Phase 7: User Interface (Priority: MEDIUM)
|
| 148 |
+
- **7.1** Create `demo/main.py` with Gradio
|
| 149 |
+
- Text input for questions
|
| 150 |
+
- API key input field
|
| 151 |
+
- Dropdown for model selection
|
| 152 |
+
- Metadata filter dropdowns (source, section)
|
| 153 |
+
- Component toggle checkboxes
|
| 154 |
+
- Top-k slider, expansion count slider
|
| 155 |
+
- Output: Answer with citations + retrieved chunks
|
| 156 |
+
|
| 157 |
+
- **7.2** Add service description
|
| 158 |
+
- RAG system explanation
|
| 159 |
+
- Dataset info (320K papers)
|
| 160 |
+
|
| 161 |
+
- **7.3** Style and UX improvements
|
| 162 |
+
- Clean layout with loading indicators
|
| 163 |
+
- Error messages
|
| 164 |
+
|
| 165 |
+
#### Phase 8: Deployment (Priority: LOW)
|
| 166 |
+
- **8.1** Create `requirements.txt` for HuggingFace Spaces
|
| 167 |
+
- **8.2** HuggingFace Space configuration (`README.md` with YAML)
|
| 168 |
+
- **8.3** Deploy and test
|
| 169 |
+
|
| 170 |
+
#### Phase 9: Documentation (Priority: LOW)
|
| 171 |
+
- **9.3** Demonstrate metadata filtering effectiveness
|
| 172 |
+
- **9.4** Document system in `README.md`
|
| 173 |
+
- **9.5** Prepare submission materials
|
| 174 |
+
|
| 175 |
+
#### Deliverables
|
| 176 |
+
- Query processing modules (self-query, expansion)
|
| 177 |
+
- LLM client with prompt templates
|
| 178 |
+
- Complete RAG pipeline
|
| 179 |
+
- Gradio UI demo
|
| 180 |
+
- Documentation and deployment
|
| 181 |
+
|
| 182 |
+
**Estimated Time**: 3-4 days
|
| 183 |
+
|
| 184 |
+
---
|
| 185 |
+
|
| 186 |
+
## π Integration Points & Dependencies
|
| 187 |
+
|
| 188 |
+
### Critical Path
|
| 189 |
+
```
|
| 190 |
+
Day 1-2:
|
| 191 |
+
Member 1: Chunking (2.1, 2.2, 2.3) β Embeddings (3.1)
|
| 192 |
+
Member 2: BM25 (3.2) [can start immediately]
|
| 193 |
+
Member 3: Self-query (4.1), LLM client (6.1), Prompts (6.2)
|
| 194 |
+
|
| 195 |
+
Day 2-3:
|
| 196 |
+
Member 1: Qdrant client (3.3) + Index chunks [BLOCKER for Member 2's 3.4]
|
| 197 |
+
Member 2: Dense retriever (3.4) [WAIT for 3.3] β Hybrid (3.5)
|
| 198 |
+
Member 3: Query expansion (4.2), Pipeline stub (6.3)
|
| 199 |
+
|
| 200 |
+
Day 3-4:
|
| 201 |
+
Member 1: Support/testing, optimize indexing
|
| 202 |
+
Member 2: Reranking (5.1) β Integration testing
|
| 203 |
+
Member 3: Complete Pipeline (6.3) β Gradio UI (7.1)
|
| 204 |
+
|
| 205 |
+
Day 4-5:
|
| 206 |
+
All: Integration testing, bug fixes
|
| 207 |
+
Member 3: UI polish (7.2, 7.3), Deployment (8.1, 8.2, 8.3)
|
| 208 |
+
Member 1 & 2: Evaluation (9.1, 9.2, 9.3)
|
| 209 |
+
Member 3: Documentation (9.4, 9.5)
|
| 210 |
+
```
|
| 211 |
+
|
| 212 |
+
### Key Handoffs
|
| 213 |
+
1. **Member 1 β Member 2**: Qdrant client ready (Day 2)
|
| 214 |
+
2. **Member 1 & 2 β Member 3**: Retrievers ready for pipeline (Day 3)
|
| 215 |
+
3. **Member 3 β All**: Pipeline ready for testing (Day 3-4)
|
| 216 |
+
|
| 217 |
+
---
|
| 218 |
+
|
| 219 |
+
## β
Success Criteria
|
| 220 |
+
|
| 221 |
+
### By December 14 (Mid-checkpoint)
|
| 222 |
+
- [ ] Chunks generated and saved to disk (Member 1)
|
| 223 |
+
- [ ] Qdrant collection created and indexed (Member 1)
|
| 224 |
+
- [ ] BM25 retriever working (Member 2)
|
| 225 |
+
- [ ] Dense retriever working (Member 2)
|
| 226 |
+
- [ ] LLM client + prompts ready (Member 3)
|
| 227 |
+
|
| 228 |
+
### By December 16 (Final Deadline)
|
| 229 |
+
- [ ] Complete RAG pipeline functional
|
| 230 |
+
- [ ] Gradio UI deployed locally
|
| 231 |
+
- [ ] Evaluation examples documented
|
| 232 |
+
- [ ] README.md with usage instructions
|
| 233 |
+
- [ ] Ready for HuggingFace Spaces deployment
|
| 234 |
+
|
| 235 |
+
---
|
| 236 |
+
|
| 237 |
+
## π¨ Risk Mitigation
|
| 238 |
+
|
| 239 |
+
### Risk: Qdrant indexing takes longer than expected
|
| 240 |
+
**Mitigation**: Member 1 starts with small sample (1K papers), scales up gradually
|
| 241 |
+
|
| 242 |
+
### Risk: Dense retriever blocked by Qdrant
|
| 243 |
+
**Mitigation**: Member 2 prioritizes BM25 + Reranking first (no dependencies)
|
| 244 |
+
|
| 245 |
+
### Risk: LLM API rate limits
|
| 246 |
+
**Mitigation**: Member 3 implements retry logic + fallback prompts, tests with small queries
|
| 247 |
+
|
| 248 |
+
### Risk: Integration issues at Day 3
|
| 249 |
+
**Mitigation**: Daily integration checkpoints, mock interfaces early
|
| 250 |
+
|
| 251 |
+
---
|
| 252 |
+
|
| 253 |
+
## π Quick Reference
|
| 254 |
+
|
| 255 |
+
### Useful Make Commands
|
| 256 |
+
```bash
|
| 257 |
+
make install # Install dependencies
|
| 258 |
+
make qdrant-up # Start Qdrant
|
| 259 |
+
make qdrant-down # Stop Qdrant
|
| 260 |
+
make format # Format code
|
| 261 |
+
make lint # Check code quality
|
| 262 |
+
```
|
| 263 |
+
|
| 264 |
+
---
|
| 265 |
+
|
| 266 |
+
**Good luck team! π**
|