Spaces:
Sleeping
Markdown Splitting Fix - Summary
Problem
The markdown files with --- section delimiters were being split at every # header, creating many small chunks with insufficient context.
Example Issue:
# Faculty of the Information & Communication Technology Department
This header alone was becoming a separate chunk because the default markdown splitter splits on headers.
Solution Implemented
1. Created New Splitter Method: for_markdown_with_sections()
Location: app/services/text_splitter.py
Custom Separators Priority:
\n---\n- Section delimiters (HIGHEST PRIORITY)\n\n\n- Triple newlines\n\n- Paragraphs\n- Single newlines.- Sentences- Words- `` - Characters (last resort)
This ensures sections stay together and headers aren't split separately.
2. Updated RAG Service
Location: app/services/rag_service.py (line 77-82)
Changed from:
markdown_splitter = self.text_splitter.for_markdown(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
Changed to:
markdown_splitter = TextSplitter.for_markdown_with_sections(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
3. Updated Document Helpers
Location: app/utils/document_helpers.py (line 161-167)
Added auto-detection for markdown with sections:
# Use section-aware splitter if text contains markdown section delimiters
if "\n---\n" in text or text.startswith("---\n"):
splitter = TextSplitter.for_markdown_with_sections()
else:
splitter = TextSplitter()
Expected Results
Before (with for_markdown()):
- Many small chunks - Headers split separately
- Example: "# Faculty..." becomes its own 50-character chunk
- Poor context for RAG retrieval
After (with for_markdown_with_sections()):
- Fewer, more meaningful chunks - Sections kept together
- Headers stay with their content
- Better context for RAG retrieval
- Reduced number of chunks overall
How to Use
For File Upload (Already Applied):
When you upload a .md file via the POST endpoint, it will automatically:
- Detect it's a markdown file
- Use
for_markdown_with_sections()splitter - Keep sections together
For Raw Text Upload:
When posting raw text with --- delimiters:
- The system auto-detects section delimiters
- Applies the section-aware splitter
- Preserves semantic structure
Configuration
You can still adjust chunk size in app/core/config.py:
chunk_size: int = 768 # Adjust as needed
chunk_overlap: int = 200 # Adjust overlap
Next Steps
Try uploading your markdown file again. You should see:
- ✅ Fewer total chunks
- ✅ Each chunk contains header + related content
- ✅ Better semantic coherence
- ✅ Improved RAG retrieval quality