# Markdown Splitting Fix - Summary

## Problem

The markdown files with `---` section delimiters were being split at every `#` header, creating many small chunks with insufficient context.

### Example Issue:

```
# Faculty of the Information & Communication Technology Department
```

This header alone was becoming a separate chunk because the default markdown splitter splits on headers.

## Solution Implemented

### 1. Created New Splitter Method: `for_markdown_with_sections()`

**Location:** `app/services/text_splitter.py`

**Custom Separators Priority:**

1. `\n---\n` - Section delimiters (HIGHEST PRIORITY)
2. `\n\n\n` - Triple newlines
3. `\n\n` - Paragraphs
4. `\n` - Single newlines
5. `. ` - Sentences
6. ` ` - Words
7. `` - Characters (last resort)

This ensures sections stay together and headers aren't split separately.

### 2. Updated RAG Service

**Location:** `app/services/rag_service.py` (line 77-82)

**Changed from:**

```python
markdown_splitter = self.text_splitter.for_markdown(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
```

**Changed to:**

```python
markdown_splitter = TextSplitter.for_markdown_with_sections(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
```

### 3. Updated Document Helpers

**Location:** `app/utils/document_helpers.py` (line 161-167)

Added auto-detection for markdown with sections:

```python
# Use section-aware splitter if text contains markdown section delimiters
if "\n---\n" in text or text.startswith("---\n"):
    splitter = TextSplitter.for_markdown_with_sections()
else:
    splitter = TextSplitter()
```

## Expected Results

### Before (with `for_markdown()`):

- **Many small chunks** - Headers split separately
- Example: "# Faculty..." becomes its own 50-character chunk
- Poor context for RAG retrieval

### After (with `for_markdown_with_sections()`):

- **Fewer, more meaningful chunks** - Sections kept together
- Headers stay with their content
- Better context for RAG retrieval
- Reduced number of chunks overall

## How to Use

### For File Upload (Already Applied):

When you upload a `.md` file via the POST endpoint, it will automatically:

1. Detect it's a markdown file
2. Use `for_markdown_with_sections()` splitter
3. Keep sections together

### For Raw Text Upload:

When posting raw text with `---` delimiters:

1. The system auto-detects section delimiters
2. Applies the section-aware splitter
3. Preserves semantic structure

## Configuration

You can still adjust chunk size in `app/core/config.py`:

```python
chunk_size: int = 768  # Adjust as needed
chunk_overlap: int = 200  # Adjust overlap
```

## Next Steps

Try uploading your markdown file again. You should see:

- ✅ Fewer total chunks
- ✅ Each chunk contains header + related content
- ✅ Better semantic coherence
- ✅ Improved RAG retrieval quality