vgecbot / MARKDOWN_FIX_SUMMARY.md
harsh-dev's picture
docker deployment
4225666

Markdown Splitting Fix - Summary

Problem

The markdown files with --- section delimiters were being split at every # header, creating many small chunks with insufficient context.

Example Issue:

# Faculty of the Information & Communication Technology Department

This header alone was becoming a separate chunk because the default markdown splitter splits on headers.

Solution Implemented

1. Created New Splitter Method: for_markdown_with_sections()

Location: app/services/text_splitter.py

Custom Separators Priority:

  1. \n---\n - Section delimiters (HIGHEST PRIORITY)
  2. \n\n\n - Triple newlines
  3. \n\n - Paragraphs
  4. \n - Single newlines
  5. . - Sentences
  6. - Words
  7. `` - Characters (last resort)

This ensures sections stay together and headers aren't split separately.

2. Updated RAG Service

Location: app/services/rag_service.py (line 77-82)

Changed from:

markdown_splitter = self.text_splitter.for_markdown(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Changed to:

markdown_splitter = TextSplitter.for_markdown_with_sections(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

3. Updated Document Helpers

Location: app/utils/document_helpers.py (line 161-167)

Added auto-detection for markdown with sections:

# Use section-aware splitter if text contains markdown section delimiters
if "\n---\n" in text or text.startswith("---\n"):
    splitter = TextSplitter.for_markdown_with_sections()
else:
    splitter = TextSplitter()

Expected Results

Before (with for_markdown()):

  • Many small chunks - Headers split separately
  • Example: "# Faculty..." becomes its own 50-character chunk
  • Poor context for RAG retrieval

After (with for_markdown_with_sections()):

  • Fewer, more meaningful chunks - Sections kept together
  • Headers stay with their content
  • Better context for RAG retrieval
  • Reduced number of chunks overall

How to Use

For File Upload (Already Applied):

When you upload a .md file via the POST endpoint, it will automatically:

  1. Detect it's a markdown file
  2. Use for_markdown_with_sections() splitter
  3. Keep sections together

For Raw Text Upload:

When posting raw text with --- delimiters:

  1. The system auto-detects section delimiters
  2. Applies the section-aware splitter
  3. Preserves semantic structure

Configuration

You can still adjust chunk size in app/core/config.py:

chunk_size: int = 768  # Adjust as needed
chunk_overlap: int = 200  # Adjust overlap

Next Steps

Try uploading your markdown file again. You should see:

  • ✅ Fewer total chunks
  • ✅ Each chunk contains header + related content
  • ✅ Better semantic coherence
  • ✅ Improved RAG retrieval quality