vgecbot / MARKDOWN_FIX_SUMMARY.md
harsh-dev's picture
docker deployment
4225666
# Markdown Splitting Fix - Summary
## Problem
The markdown files with `---` section delimiters were being split at every `#` header, creating many small chunks with insufficient context.
### Example Issue:
```
# Faculty of the Information & Communication Technology Department
```
This header alone was becoming a separate chunk because the default markdown splitter splits on headers.
## Solution Implemented
### 1. Created New Splitter Method: `for_markdown_with_sections()`
**Location:** `app/services/text_splitter.py`
**Custom Separators Priority:**
1. `\n---\n` - Section delimiters (HIGHEST PRIORITY)
2. `\n\n\n` - Triple newlines
3. `\n\n` - Paragraphs
4. `\n` - Single newlines
5. `. ` - Sentences
6. ` ` - Words
7. `` - Characters (last resort)
This ensures sections stay together and headers aren't split separately.
### 2. Updated RAG Service
**Location:** `app/services/rag_service.py` (line 77-82)
**Changed from:**
```python
markdown_splitter = self.text_splitter.for_markdown(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
```
**Changed to:**
```python
markdown_splitter = TextSplitter.for_markdown_with_sections(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
```
### 3. Updated Document Helpers
**Location:** `app/utils/document_helpers.py` (line 161-167)
Added auto-detection for markdown with sections:
```python
# Use section-aware splitter if text contains markdown section delimiters
if "\n---\n" in text or text.startswith("---\n"):
splitter = TextSplitter.for_markdown_with_sections()
else:
splitter = TextSplitter()
```
## Expected Results
### Before (with `for_markdown()`):
- **Many small chunks** - Headers split separately
- Example: "# Faculty..." becomes its own 50-character chunk
- Poor context for RAG retrieval
### After (with `for_markdown_with_sections()`):
- **Fewer, more meaningful chunks** - Sections kept together
- Headers stay with their content
- Better context for RAG retrieval
- Reduced number of chunks overall
## How to Use
### For File Upload (Already Applied):
When you upload a `.md` file via the POST endpoint, it will automatically:
1. Detect it's a markdown file
2. Use `for_markdown_with_sections()` splitter
3. Keep sections together
### For Raw Text Upload:
When posting raw text with `---` delimiters:
1. The system auto-detects section delimiters
2. Applies the section-aware splitter
3. Preserves semantic structure
## Configuration
You can still adjust chunk size in `app/core/config.py`:
```python
chunk_size: int = 768 # Adjust as needed
chunk_overlap: int = 200 # Adjust overlap
```
## Next Steps
Try uploading your markdown file again. You should see:
- βœ… Fewer total chunks
- βœ… Each chunk contains header + related content
- βœ… Better semantic coherence
- βœ… Improved RAG retrieval quality