Spaces:
Sleeping
Sleeping
Commit ·
01a77b0
1
Parent(s): 4f0dc81
docs: update metadata part
Browse files- docs/tasks.md +6 -14
docs/tasks.md
CHANGED
|
@@ -170,7 +170,8 @@ scientific-rag/
|
|
| 170 |
- **Paragraph-based splitting**: Split on `\n` boundaries
|
| 171 |
- **Overlap strategy**: Add overlap between chunks for context
|
| 172 |
- Configurable `chunk_size` (default: 512 tokens) and `chunk_overlap` (default: 50 tokens)
|
| 173 |
-
- **Metadata preservation**: Store source
|
|
|
|
| 174 |
|
| 175 |
- [ ] **2.3** Create processing script to generate chunks
|
| 176 |
- Batch processing with progress tracking
|
|
@@ -224,22 +225,13 @@ scientific-rag/
|
|
| 224 |
|
| 225 |
- [ ] **4.1** Implement `scientific_rag/application/query_processing/self_query.py`
|
| 226 |
|
| 227 |
-
- Extract metadata filters from natural language queries
|
| 228 |
- Detect source preferences: "arxiv papers about..." → filter to arxiv
|
| 229 |
- Detect section preferences: "in the methods section..." → filter to methods
|
| 230 |
-
- Use
|
| 231 |
-
-
|
| 232 |
-
|
| 233 |
-
```
|
| 234 |
-
Extract search filters from this query. Return JSON with:
|
| 235 |
-
- source: "arxiv", "pubmed", or "any"
|
| 236 |
-
- section: "introduction", "methods", "results", "conclusion", or "any"
|
| 237 |
-
|
| 238 |
-
Query: {query}
|
| 239 |
-
```
|
| 240 |
-
|
| 241 |
- Return structured `QueryFilters` object
|
| 242 |
-
- Filters are passed to Qdrant for efficient pre-filtering
|
| 243 |
|
| 244 |
- [ ] **4.2** Implement `scientific_rag/application/query_processing/query_expansion.py`
|
| 245 |
|
|
|
|
| 170 |
- **Paragraph-based splitting**: Split on `\n` boundaries
|
| 171 |
- **Overlap strategy**: Add overlap between chunks for context
|
| 172 |
- Configurable `chunk_size` (default: 512 tokens) and `chunk_overlap` (default: 50 tokens)
|
| 173 |
+
- **Metadata preservation**: Store source (arxiv/pubmed), normalized section name, paper_id, position
|
| 174 |
+
- Normalize section names to enum values (introduction, methods, results, conclusion, other)
|
| 175 |
|
| 176 |
- [ ] **2.3** Create processing script to generate chunks
|
| 177 |
- Batch processing with progress tracking
|
|
|
|
| 225 |
|
| 226 |
- [ ] **4.1** Implement `scientific_rag/application/query_processing/self_query.py`
|
| 227 |
|
| 228 |
+
- Extract metadata filters from natural language queries using **rule-based matching**
|
| 229 |
- Detect source preferences: "arxiv papers about..." → filter to arxiv
|
| 230 |
- Detect section preferences: "in the methods section..." → filter to methods
|
| 231 |
+
- Use regex/keyword matching
|
| 232 |
+
- No LLM needed - metadata is already structured in chunks from dataset
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 233 |
- Return structured `QueryFilters` object
|
| 234 |
+
- Filters are passed to Qdrant for efficient pre-filtering before vector search
|
| 235 |
|
| 236 |
- [ ] **4.2** Implement `scientific_rag/application/query_processing/query_expansion.py`
|
| 237 |
|