DenysKovalML commited on
Commit
01a77b0
·
1 Parent(s): 4f0dc81

docs: update metadata part

Browse files
Files changed (1) hide show
  1. docs/tasks.md +6 -14
docs/tasks.md CHANGED
@@ -170,7 +170,8 @@ scientific-rag/
170
  - **Paragraph-based splitting**: Split on `\n` boundaries
171
  - **Overlap strategy**: Add overlap between chunks for context
172
  - Configurable `chunk_size` (default: 512 tokens) and `chunk_overlap` (default: 50 tokens)
173
- - **Metadata preservation**: Store source paper index, section name, position
 
174
 
175
  - [ ] **2.3** Create processing script to generate chunks
176
  - Batch processing with progress tracking
@@ -224,22 +225,13 @@ scientific-rag/
224
 
225
  - [ ] **4.1** Implement `scientific_rag/application/query_processing/self_query.py`
226
 
227
- - Extract metadata filters from natural language queries
228
  - Detect source preferences: "arxiv papers about..." → filter to arxiv
229
  - Detect section preferences: "in the methods section..." → filter to methods
230
- - Use LLM to parse query intent
231
- - Example prompt:
232
-
233
- ```
234
- Extract search filters from this query. Return JSON with:
235
- - source: "arxiv", "pubmed", or "any"
236
- - section: "introduction", "methods", "results", "conclusion", or "any"
237
-
238
- Query: {query}
239
- ```
240
-
241
  - Return structured `QueryFilters` object
242
- - Filters are passed to Qdrant for efficient pre-filtering
243
 
244
  - [ ] **4.2** Implement `scientific_rag/application/query_processing/query_expansion.py`
245
 
 
170
  - **Paragraph-based splitting**: Split on `\n` boundaries
171
  - **Overlap strategy**: Add overlap between chunks for context
172
  - Configurable `chunk_size` (default: 512 tokens) and `chunk_overlap` (default: 50 tokens)
173
+ - **Metadata preservation**: Store source (arxiv/pubmed), normalized section name, paper_id, position
174
+ - Normalize section names to enum values (introduction, methods, results, conclusion, other)
175
 
176
  - [ ] **2.3** Create processing script to generate chunks
177
  - Batch processing with progress tracking
 
225
 
226
  - [ ] **4.1** Implement `scientific_rag/application/query_processing/self_query.py`
227
 
228
+ - Extract metadata filters from natural language queries using **rule-based matching**
229
  - Detect source preferences: "arxiv papers about..." → filter to arxiv
230
  - Detect section preferences: "in the methods section..." → filter to methods
231
+ - Use regex/keyword matching
232
+ - No LLM needed - metadata is already structured in chunks from dataset
 
 
 
 
 
 
 
 
 
233
  - Return structured `QueryFilters` object
234
+ - Filters are passed to Qdrant for efficient pre-filtering before vector search
235
 
236
  - [ ] **4.2** Implement `scientific_rag/application/query_processing/query_expansion.py`
237