DenysKovalML commited on
Commit
7fed86f
Β·
1 Parent(s): 01a77b0

docs: add team_roles.md

Browse files
Files changed (1) hide show
  1. docs/team_roles.md +266 -0
docs/team_roles.md ADDED
@@ -0,0 +1,266 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Team Roles & Task Distribution
2
+
3
+ > **Project**: Scientific Advanced RAG System
4
+ > **Team Size**: 3 members
5
+ > **Timeline**: December 12-16, 2025
6
+ > **Strategy**: Parallel development with clear ownership and minimal dependencies
7
+
8
+ ---
9
+
10
+ ## πŸ‘₯ Team Structure
11
+
12
+ ### Member 1: Data Pipeline Lead
13
+ **Focus**: Data processing, embeddings, vector database infrastructure
14
+
15
+ ### Member 2: Retrieval Engineer
16
+ **Focus**: Search algorithms, BM25, dense retrieval, reranking
17
+
18
+ ### Member 3: LLM & Integration Lead
19
+ **Focus**: Query processing, LLM integration, RAG pipeline, UI
20
+
21
+ ---
22
+
23
+ ## πŸ“‹ Detailed Task Assignments
24
+
25
+ ### πŸ”Ή Member 1: Data Pipeline Lead
26
+
27
+ #### Phase 2: Chunking Strategy (Priority: HIGH)
28
+ - **2.1** Create `scientific_rag/application/chunking/base.py`
29
+ - Abstract `BaseChunker` class
30
+ - Define interface: `chunk(document) -> List[Chunk]`
31
+
32
+ - **2.2** Implement `scientific_rag/application/chunking/scientific_chunker.py`
33
+ - Section-aware chunking with metadata preservation
34
+ - Normalize section names to enum values
35
+ - Handle LaTeX tokens (@xmath)
36
+
37
+ - **2.3** Create processing script to generate chunks
38
+ - Batch processing with progress tracking
39
+ - Save to `data/processed/` as JSON/Parquet
40
+ - Generate hash-based chunk IDs
41
+
42
+ #### Phase 3: Embeddings & Vector Database (Priority: HIGH)
43
+ - **3.1** Create `scientific_rag/application/embeddings/encoder.py`
44
+ - Singleton pattern for `intfloat/e5-small-v2`
45
+ - Batch embedding support (batch_size=32)
46
+ - CPU/GPU device configuration
47
+
48
+ - **3.3** Implement `scientific_rag/infrastructure/qdrant.py`
49
+ - Qdrant client wrapper (local Docker + cloud support)
50
+ - Collection creation with schema (384-d vectors)
51
+ - Metadata payload: source, section, paper_id, position
52
+ - `upsert_chunks(chunks)` with embeddings
53
+ - `search(query_vector, filters, k)` with filtering
54
+
55
+ #### Deliverables
56
+ - Working chunking pipeline that processes papers β†’ chunks with metadata
57
+ - Qdrant collection populated with embedded chunks
58
+ - Script to run: `python scripts/process_and_index.py`
59
+
60
+ **Estimated Time**: 2-3 days
61
+
62
+ ---
63
+
64
+ ### πŸ”Ή Member 2: Retrieval Engineer
65
+
66
+ #### Phase 3: Retrieval Implementation (Priority: HIGH)
67
+ - **3.2** Implement `scientific_rag/application/retrieval/bm25_retriever.py`
68
+ - Use `rank_bm25` library
69
+ - Tokenization with preprocessing
70
+ - `search(query, k) -> List[Chunk]` interface
71
+ - Score normalization
72
+
73
+ - **3.4** Implement `scientific_rag/application/retrieval/dense_retriever.py`
74
+ - Semantic search using Qdrant (depends on Member 1's 3.3)
75
+ - Apply metadata filters from `QueryFilters`
76
+ - `search(query, filters, k) -> List[Chunk]`
77
+
78
+ - **3.5** Implement `scientific_rag/application/retrieval/hybrid_retriever.py`
79
+ - Combine BM25 + dense retrieval
80
+ - Reciprocal Rank Fusion (RRF) or weighted combination
81
+ - Configurable weights: `bm25_weight`, `dense_weight`
82
+ - Toggle switches: `use_bm25`, `use_dense`
83
+ - Deduplication logic
84
+
85
+ #### Phase 5: Reranking (Priority: MEDIUM)
86
+ - **5.1** Implement `scientific_rag/application/reranking/cross_encoder.py`
87
+ - Use `cross-encoder/ms-marco-MiniLM-L6-v2`
88
+ - `rerank(query, chunks, top_k) -> List[Chunk]`
89
+ - Batch processing for efficiency
90
+ - Score-based sorting
91
+
92
+ #### Phase 9: Evaluation Support (Priority: LOW)
93
+ - **9.1** Find BM25-best queries
94
+ - Document specific terminology queries
95
+ - Exact phrase matching examples
96
+
97
+ - **9.2** Find dense-best queries
98
+ - Semantic similarity queries
99
+ - Paraphrased questions
100
+
101
+ #### Deliverables
102
+ - BM25 retriever (can test standalone with chunks)
103
+ - Dense retriever (integrates with Member 1's Qdrant)
104
+ - Hybrid retriever combining both
105
+ - Reranker module
106
+ - Comparison analysis for BM25 vs Dense
107
+
108
+ **Estimated Time**: 2-3 days
109
+
110
+ ---
111
+
112
+ ### πŸ”Ή Member 3: LLM & Integration Lead
113
+
114
+ #### Phase 4: Query Processing (Priority: HIGH)
115
+ - **4.1** Implement `scientific_rag/application/query_processing/self_query.py`
116
+ - Rule-based metadata filter extraction
117
+ - Regex/keyword matching for source (arxiv/pubmed)
118
+ - Pattern matching for section (introduction/methods/results/conclusion)
119
+ - Return `QueryFilters` object
120
+
121
+ - **4.2** Implement `scientific_rag/application/query_processing/query_expansion.py`
122
+ - LLM-based query variation generation
123
+ - Configurable `expand_to_n` parameter (default: 3)
124
+ - Deduplicate expanded queries
125
+
126
+ - **4.3** Update `scientific_rag/domain/queries.py`
127
+ - Already done, verify completeness
128
+
129
+ #### Phase 6: LLM Integration (Priority: HIGH)
130
+ - **6.1** Implement `scientific_rag/application/rag/llm_client.py`
131
+ - LiteLLM wrapper for OpenRouter
132
+ - Support `openai/gpt-oss-120b:free`
133
+ - Error handling and retries
134
+ - Optional: response streaming
135
+
136
+ - **6.2** Create `scientific_rag/application/rag/prompt_templates.py`
137
+ - RAG prompt with context injection
138
+ - Citation-aware prompting ([1], [2] format)
139
+ - System prompt for scientific Q&A
140
+
141
+ - **6.3** Implement `scientific_rag/application/rag/pipeline.py`
142
+ - Main `RAGPipeline` orchestration class
143
+ - Full flow: Self-Query β†’ Query Expansion β†’ Retrieve β†’ Rerank β†’ Generate
144
+ - Toggle switches for each component
145
+ - Citation tracking
146
+
147
+ #### Phase 7: User Interface (Priority: MEDIUM)
148
+ - **7.1** Create `demo/main.py` with Gradio
149
+ - Text input for questions
150
+ - API key input field
151
+ - Dropdown for model selection
152
+ - Metadata filter dropdowns (source, section)
153
+ - Component toggle checkboxes
154
+ - Top-k slider, expansion count slider
155
+ - Output: Answer with citations + retrieved chunks
156
+
157
+ - **7.2** Add service description
158
+ - RAG system explanation
159
+ - Dataset info (320K papers)
160
+
161
+ - **7.3** Style and UX improvements
162
+ - Clean layout with loading indicators
163
+ - Error messages
164
+
165
+ #### Phase 8: Deployment (Priority: LOW)
166
+ - **8.1** Create `requirements.txt` for HuggingFace Spaces
167
+ - **8.2** HuggingFace Space configuration (`README.md` with YAML)
168
+ - **8.3** Deploy and test
169
+
170
+ #### Phase 9: Documentation (Priority: LOW)
171
+ - **9.3** Demonstrate metadata filtering effectiveness
172
+ - **9.4** Document system in `README.md`
173
+ - **9.5** Prepare submission materials
174
+
175
+ #### Deliverables
176
+ - Query processing modules (self-query, expansion)
177
+ - LLM client with prompt templates
178
+ - Complete RAG pipeline
179
+ - Gradio UI demo
180
+ - Documentation and deployment
181
+
182
+ **Estimated Time**: 3-4 days
183
+
184
+ ---
185
+
186
+ ## πŸ”„ Integration Points & Dependencies
187
+
188
+ ### Critical Path
189
+ ```
190
+ Day 1-2:
191
+ Member 1: Chunking (2.1, 2.2, 2.3) β†’ Embeddings (3.1)
192
+ Member 2: BM25 (3.2) [can start immediately]
193
+ Member 3: Self-query (4.1), LLM client (6.1), Prompts (6.2)
194
+
195
+ Day 2-3:
196
+ Member 1: Qdrant client (3.3) + Index chunks [BLOCKER for Member 2's 3.4]
197
+ Member 2: Dense retriever (3.4) [WAIT for 3.3] β†’ Hybrid (3.5)
198
+ Member 3: Query expansion (4.2), Pipeline stub (6.3)
199
+
200
+ Day 3-4:
201
+ Member 1: Support/testing, optimize indexing
202
+ Member 2: Reranking (5.1) β†’ Integration testing
203
+ Member 3: Complete Pipeline (6.3) β†’ Gradio UI (7.1)
204
+
205
+ Day 4-5:
206
+ All: Integration testing, bug fixes
207
+ Member 3: UI polish (7.2, 7.3), Deployment (8.1, 8.2, 8.3)
208
+ Member 1 & 2: Evaluation (9.1, 9.2, 9.3)
209
+ Member 3: Documentation (9.4, 9.5)
210
+ ```
211
+
212
+ ### Key Handoffs
213
+ 1. **Member 1 β†’ Member 2**: Qdrant client ready (Day 2)
214
+ 2. **Member 1 & 2 β†’ Member 3**: Retrievers ready for pipeline (Day 3)
215
+ 3. **Member 3 β†’ All**: Pipeline ready for testing (Day 3-4)
216
+
217
+ ---
218
+
219
+ ## βœ… Success Criteria
220
+
221
+ ### By December 14 (Mid-checkpoint)
222
+ - [ ] Chunks generated and saved to disk (Member 1)
223
+ - [ ] Qdrant collection created and indexed (Member 1)
224
+ - [ ] BM25 retriever working (Member 2)
225
+ - [ ] Dense retriever working (Member 2)
226
+ - [ ] LLM client + prompts ready (Member 3)
227
+
228
+ ### By December 16 (Final Deadline)
229
+ - [ ] Complete RAG pipeline functional
230
+ - [ ] Gradio UI deployed locally
231
+ - [ ] Evaluation examples documented
232
+ - [ ] README.md with usage instructions
233
+ - [ ] Ready for HuggingFace Spaces deployment
234
+
235
+ ---
236
+
237
+ ## 🚨 Risk Mitigation
238
+
239
+ ### Risk: Qdrant indexing takes longer than expected
240
+ **Mitigation**: Member 1 starts with small sample (1K papers), scales up gradually
241
+
242
+ ### Risk: Dense retriever blocked by Qdrant
243
+ **Mitigation**: Member 2 prioritizes BM25 + Reranking first (no dependencies)
244
+
245
+ ### Risk: LLM API rate limits
246
+ **Mitigation**: Member 3 implements retry logic + fallback prompts, tests with small queries
247
+
248
+ ### Risk: Integration issues at Day 3
249
+ **Mitigation**: Daily integration checkpoints, mock interfaces early
250
+
251
+ ---
252
+
253
+ ## πŸ“š Quick Reference
254
+
255
+ ### Useful Make Commands
256
+ ```bash
257
+ make install # Install dependencies
258
+ make qdrant-up # Start Qdrant
259
+ make qdrant-down # Stop Qdrant
260
+ make format # Format code
261
+ make lint # Check code quality
262
+ ```
263
+
264
+ ---
265
+
266
+ **Good luck team! πŸš€**