Spaces:

harshvisualz
/

vgecbot

Sleeping

App Files Files Community

vgecbot / MARKDOWN_FIX_SUMMARY.md

harsh-dev

docker deployment

4225666 about 2 months ago

preview code

raw

history blame contribute delete

2.84 kB

Markdown Splitting Fix - Summary

Problem

The markdown files with --- section delimiters were being split at every # header, creating many small chunks with insufficient context.

Example Issue:

# Faculty of the Information & Communication Technology Department

This header alone was becoming a separate chunk because the default markdown splitter splits on headers.

Solution Implemented

1. Created New Splitter Method: `for_markdown_with_sections()`

Location: app/services/text_splitter.py

Custom Separators Priority:

\n---\n - Section delimiters (HIGHEST PRIORITY)
\n\n\n - Triple newlines
\n\n - Paragraphs
\n - Single newlines
. - Sentences
- Words
`` - Characters (last resort)

This ensures sections stay together and headers aren't split separately.

2. Updated RAG Service

Location: app/services/rag_service.py (line 77-82)

Changed from:

markdown_splitter = self.text_splitter.for_markdown(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Changed to:

markdown_splitter = TextSplitter.for_markdown_with_sections(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

3. Updated Document Helpers

Location: app/utils/document_helpers.py (line 161-167)

Added auto-detection for markdown with sections:

# Use section-aware splitter if text contains markdown section delimiters
if "\n---\n" in text or text.startswith("---\n"):
    splitter = TextSplitter.for_markdown_with_sections()
else:
    splitter = TextSplitter()

Expected Results

Before (with `for_markdown()`):

Many small chunks - Headers split separately
Example: "# Faculty..." becomes its own 50-character chunk
Poor context for RAG retrieval

After (with `for_markdown_with_sections()`):

Fewer, more meaningful chunks - Sections kept together
Headers stay with their content
Better context for RAG retrieval
Reduced number of chunks overall

How to Use

For File Upload (Already Applied):

When you upload a .md file via the POST endpoint, it will automatically:

Detect it's a markdown file
Use for_markdown_with_sections() splitter
Keep sections together

For Raw Text Upload:

When posting raw text with --- delimiters:

The system auto-detects section delimiters
Applies the section-aware splitter
Preserves semantic structure

Configuration

You can still adjust chunk size in app/core/config.py:

chunk_size: int = 768  # Adjust as needed
chunk_overlap: int = 200  # Adjust overlap

Next Steps

Try uploading your markdown file again. You should see:

✅ Fewer total chunks
✅ Each chunk contains header + related content
✅ Better semantic coherence
✅ Improved RAG retrieval quality