# Markdown Splitting Fix - Summary ## Problem The markdown files with `---` section delimiters were being split at every `#` header, creating many small chunks with insufficient context. ### Example Issue: ``` # Faculty of the Information & Communication Technology Department ``` This header alone was becoming a separate chunk because the default markdown splitter splits on headers. ## Solution Implemented ### 1. Created New Splitter Method: `for_markdown_with_sections()` **Location:** `app/services/text_splitter.py` **Custom Separators Priority:** 1. `\n---\n` - Section delimiters (HIGHEST PRIORITY) 2. `\n\n\n` - Triple newlines 3. `\n\n` - Paragraphs 4. `\n` - Single newlines 5. `. ` - Sentences 6. ` ` - Words 7. `` - Characters (last resort) This ensures sections stay together and headers aren't split separately. ### 2. Updated RAG Service **Location:** `app/services/rag_service.py` (line 77-82) **Changed from:** ```python markdown_splitter = self.text_splitter.for_markdown( chunk_size=chunk_size, chunk_overlap=chunk_overlap ) ``` **Changed to:** ```python markdown_splitter = TextSplitter.for_markdown_with_sections( chunk_size=chunk_size, chunk_overlap=chunk_overlap ) ``` ### 3. Updated Document Helpers **Location:** `app/utils/document_helpers.py` (line 161-167) Added auto-detection for markdown with sections: ```python # Use section-aware splitter if text contains markdown section delimiters if "\n---\n" in text or text.startswith("---\n"): splitter = TextSplitter.for_markdown_with_sections() else: splitter = TextSplitter() ``` ## Expected Results ### Before (with `for_markdown()`): - **Many small chunks** - Headers split separately - Example: "# Faculty..." becomes its own 50-character chunk - Poor context for RAG retrieval ### After (with `for_markdown_with_sections()`): - **Fewer, more meaningful chunks** - Sections kept together - Headers stay with their content - Better context for RAG retrieval - Reduced number of chunks overall ## How to Use ### For File Upload (Already Applied): When you upload a `.md` file via the POST endpoint, it will automatically: 1. Detect it's a markdown file 2. Use `for_markdown_with_sections()` splitter 3. Keep sections together ### For Raw Text Upload: When posting raw text with `---` delimiters: 1. The system auto-detects section delimiters 2. Applies the section-aware splitter 3. Preserves semantic structure ## Configuration You can still adjust chunk size in `app/core/config.py`: ```python chunk_size: int = 768 # Adjust as needed chunk_overlap: int = 200 # Adjust overlap ``` ## Next Steps Try uploading your markdown file again. You should see: - ✅ Fewer total chunks - ✅ Each chunk contains header + related content - ✅ Better semantic coherence - ✅ Improved RAG retrieval quality