Spaces:
Sleeping
Sleeping
| # Markdown Splitting Fix - Summary | |
| ## Problem | |
| The markdown files with `---` section delimiters were being split at every `#` header, creating many small chunks with insufficient context. | |
| ### Example Issue: | |
| ``` | |
| # Faculty of the Information & Communication Technology Department | |
| ``` | |
| This header alone was becoming a separate chunk because the default markdown splitter splits on headers. | |
| ## Solution Implemented | |
| ### 1. Created New Splitter Method: `for_markdown_with_sections()` | |
| **Location:** `app/services/text_splitter.py` | |
| **Custom Separators Priority:** | |
| 1. `\n---\n` - Section delimiters (HIGHEST PRIORITY) | |
| 2. `\n\n\n` - Triple newlines | |
| 3. `\n\n` - Paragraphs | |
| 4. `\n` - Single newlines | |
| 5. `. ` - Sentences | |
| 6. ` ` - Words | |
| 7. `` - Characters (last resort) | |
| This ensures sections stay together and headers aren't split separately. | |
| ### 2. Updated RAG Service | |
| **Location:** `app/services/rag_service.py` (line 77-82) | |
| **Changed from:** | |
| ```python | |
| markdown_splitter = self.text_splitter.for_markdown( | |
| chunk_size=chunk_size, | |
| chunk_overlap=chunk_overlap | |
| ) | |
| ``` | |
| **Changed to:** | |
| ```python | |
| markdown_splitter = TextSplitter.for_markdown_with_sections( | |
| chunk_size=chunk_size, | |
| chunk_overlap=chunk_overlap | |
| ) | |
| ``` | |
| ### 3. Updated Document Helpers | |
| **Location:** `app/utils/document_helpers.py` (line 161-167) | |
| Added auto-detection for markdown with sections: | |
| ```python | |
| # Use section-aware splitter if text contains markdown section delimiters | |
| if "\n---\n" in text or text.startswith("---\n"): | |
| splitter = TextSplitter.for_markdown_with_sections() | |
| else: | |
| splitter = TextSplitter() | |
| ``` | |
| ## Expected Results | |
| ### Before (with `for_markdown()`): | |
| - **Many small chunks** - Headers split separately | |
| - Example: "# Faculty..." becomes its own 50-character chunk | |
| - Poor context for RAG retrieval | |
| ### After (with `for_markdown_with_sections()`): | |
| - **Fewer, more meaningful chunks** - Sections kept together | |
| - Headers stay with their content | |
| - Better context for RAG retrieval | |
| - Reduced number of chunks overall | |
| ## How to Use | |
| ### For File Upload (Already Applied): | |
| When you upload a `.md` file via the POST endpoint, it will automatically: | |
| 1. Detect it's a markdown file | |
| 2. Use `for_markdown_with_sections()` splitter | |
| 3. Keep sections together | |
| ### For Raw Text Upload: | |
| When posting raw text with `---` delimiters: | |
| 1. The system auto-detects section delimiters | |
| 2. Applies the section-aware splitter | |
| 3. Preserves semantic structure | |
| ## Configuration | |
| You can still adjust chunk size in `app/core/config.py`: | |
| ```python | |
| chunk_size: int = 768 # Adjust as needed | |
| chunk_overlap: int = 200 # Adjust overlap | |
| ``` | |
| ## Next Steps | |
| Try uploading your markdown file again. You should see: | |
| - β Fewer total chunks | |
| - β Each chunk contains header + related content | |
| - β Better semantic coherence | |
| - β Improved RAG retrieval quality | |