Spaces:

harshvisualz
/

vgecbot

Sleeping

App Files Files Community

vgecbot / MARKDOWN_FIX_SUMMARY.md

harsh-dev

docker deployment

4225666 about 2 months ago

preview code

raw

history blame contribute delete

2.84 kB

	# Markdown Splitting Fix - Summary

	## Problem

	The markdown files with `---` section delimiters were being split at every `#` header, creating many small chunks with insufficient context.

	### Example Issue:

	```
	# Faculty of the Information & Communication Technology Department
	```

	This header alone was becoming a separate chunk because the default markdown splitter splits on headers.

	## Solution Implemented

	### 1. Created New Splitter Method: `for_markdown_with_sections()`

	Location: `app/services/text_splitter.py`

	Custom Separators Priority:

	1. `\n---\n` - Section delimiters (HIGHEST PRIORITY)
	2. `\n\n\n` - Triple newlines
	3. `\n\n` - Paragraphs
	4. `\n` - Single newlines
	5. `. ` - Sentences
	6. ` ` - Words
	7. `` - Characters (last resort)

	This ensures sections stay together and headers aren't split separately.

	### 2. Updated RAG Service

	Location: `app/services/rag_service.py` (line 77-82)

	Changed from:

	```python
	markdown_splitter = self.text_splitter.for_markdown(
	chunk_size=chunk_size,
	chunk_overlap=chunk_overlap
	)
	```

	Changed to:

	```python
	markdown_splitter = TextSplitter.for_markdown_with_sections(
	chunk_size=chunk_size,
	chunk_overlap=chunk_overlap
	)
	```

	### 3. Updated Document Helpers

	Location: `app/utils/document_helpers.py` (line 161-167)

	Added auto-detection for markdown with sections:

	```python
	# Use section-aware splitter if text contains markdown section delimiters
	if "\n---\n" in text or text.startswith("---\n"):
	splitter = TextSplitter.for_markdown_with_sections()
	else:
	splitter = TextSplitter()
	```

	## Expected Results

	### Before (with `for_markdown()`):

	- Many small chunks - Headers split separately
	- Example: "# Faculty..." becomes its own 50-character chunk
	- Poor context for RAG retrieval

	### After (with `for_markdown_with_sections()`):

	- Fewer, more meaningful chunks - Sections kept together
	- Headers stay with their content
	- Better context for RAG retrieval
	- Reduced number of chunks overall

	## How to Use

	### For File Upload (Already Applied):

	When you upload a `.md` file via the POST endpoint, it will automatically:

	1. Detect it's a markdown file
	2. Use `for_markdown_with_sections()` splitter
	3. Keep sections together

	### For Raw Text Upload:

	When posting raw text with `---` delimiters:

	1. The system auto-detects section delimiters
	2. Applies the section-aware splitter
	3. Preserves semantic structure

	## Configuration

	You can still adjust chunk size in `app/core/config.py`:

	```python
	chunk_size: int = 768 # Adjust as needed
	chunk_overlap: int = 200 # Adjust overlap
	```

	## Next Steps

	Try uploading your markdown file again. You should see:

	- ✅ Fewer total chunks
	- ✅ Each chunk contains header + related content
	- ✅ Better semantic coherence
	- ✅ Improved RAG retrieval quality