Spaces:

PavaniYerra
/

CapstoneRAG10

Sleeping

App Files Files Community

CapstoneRAG10 / docs /CHUNKING_STRATEGIES_NEW.md

PavaniYerra

Clone

9bc547e 3 months ago

preview code

raw

history blame contribute delete

11.4 kB

Chunking Strategies Implementation Guide

Overview

Added two new chunking strategies to the RAG Capstone Project:

Row-Based Chunking - For structured data (tables, CSV, rows)
Entity-Based Chunking - For semantic units (paragraphs, sections, headers)

Available Chunking Strategies

1. Dense Chunking ✅ (Existing)

Use Case: General-purpose documents, fixed-size chunking
Method: Fixed-size chunks with overlap, breaks at sentence boundaries
Best For: Linear documents, books, articles
Chunk Size: 512 (recommended)
Overlap: 50 (recommended)

2. Sparse Chunking ✅ (Existing)

Use Case: Semantic-aware chunking
Method: Splits by paragraphs/double newlines, keeps semantic units
Best For: Articles with clear paragraphs
Chunk Size: 512 (recommended)
Overlap: 50 (recommended)

3. Hybrid Chunking ✅ (Existing)

Use Case: Combines dense and sparse approaches
Method: First splits by semantics, then applies fixed-size chunking
Best For: Mixed document types
Chunk Size: 512 (recommended)
Overlap: 50 (recommended)

4. Re-Ranking Chunking ✅ (Existing)

Use Case: Query-aware chunk ranking
Method: Hybrid chunking + BM25 re-ranking
Best For: Finding most relevant chunks for specific queries
Chunk Size: 512 (recommended)
Overlap: 50 (recommended)

5. Row-Based Chunking 🆕 (NEW)

Use Case: Structured data with row delimiters
Method: Splits by rows/lines (newline delimiters)
Best For:
- CSV files
- Tables
- Structured database outputs
- Line-oriented data (logs, scripts)
- Tab-separated values (TSV)
Chunk Size: 512 (recommended)
Overlap: 50 (recommended - overlaps last N rows)
Advantages:
- Preserves row structure
- No row is split across chunks
- Maintains semantic integrity of each row
- Perfect for tabular data

Algorithm:

1. Split text by newlines into rows
2. Accumulate rows until chunk_size exceeded
3. Save accumulated chunk
4. Apply overlap by including last N rows in next chunk
5. Continue until all rows processed

Example:

Input (CSV):
id,name,age
1,John,25
2,Jane,30
3,Bob,35

Output Chunks (chunk_size=50):
Chunk 1: "id,name,age\n1,John,25\n2,Jane,30"
Chunk 2: "2,Jane,30\n3,Bob,35"

6. Entity-Based Chunking 🆕 (NEW)

Use Case: Semantic entity-based chunking
Method: Splits by entities (paragraphs, sections, headers)
Best For:
- Documents with clear semantic units
- Markdown files with headers
- Technical documentation
- Sections with key-value pairs
- Posts, articles with clear structure
Chunk Size: 512 (recommended)
Overlap: 50 (recommended - overlaps last entity)
Advantages:
- Respects semantic boundaries
- No entity is split across chunks
- Perfect for header-based docs
- Preserves context between sections

Algorithm:

1. Identify entities using patterns:
   - Paragraph breaks (double newlines)
   - Section headers (lines starting with #)
2. Accumulate entities until chunk_size exceeded
3. Save accumulated chunk
4. Apply overlap by including last entity in next chunk
5. Continue until all entities processed

Example:

Input (Markdown):
# Section 1
This is paragraph 1.

This is paragraph 2.

# Section 2
This is paragraph 3.

Output Chunks (chunk_size=100):
Chunk 1: "# Section 1\nThis is paragraph 1.\n\nThis is paragraph 2."
Chunk 2: "This is paragraph 2.\n\n# Section 2\nThis is paragraph 3."

When to Use Which Strategy

Data Type	Strategy	Reason
CSV/Tables	Row-Based	Preserves row integrity
Logs/Line data	Row-Based	Each line is atomic
Markdown docs	Entity-Based	Respects header structure
Technical docs	Entity-Based	Sections are semantic units
Blog posts	Entity-Based	Paragraphs are natural units
Books/Articles	Dense	Long continuous text
Mixed documents	Hybrid	Combines approaches
Query-specific	Re-Ranking	Uses BM25 scoring

Implementation Details

Row-Based Chunking Code:

class RowBasedChunking(ChunkingStrategy):
    """Row-based chunking strategy - splits by rows/lines/delimiters.
    
    Best for: Tables, CSV data, structured documents with clear row separators
    """
    
    def chunk_text(self, text: str, chunk_size: int = 512, 
                   overlap: int = 50) -> List[str]:
        """Create chunks based on row/line delimiters."""
        lines = text.split('\n')
        chunks = []
        current_chunk = []
        current_size = 0
        
        for line in lines:
            line_size = len(line) + 1  # +1 for newline
            
            # If adding this line exceeds chunk_size, save current chunk
            if current_size + line_size > chunk_size and current_chunk:
                chunks.append('\n'.join(current_chunk))
                
                # Add overlap by including last few lines in next chunk
                if overlap > 0:
                    overlap_size = 0
                    overlap_lines = []
                    for prev_line in reversed(current_chunk):
                        overlap_size += len(prev_line) + 1
                        if overlap_size <= overlap:
                            overlap_lines.insert(0, prev_line)
                        else:
                            break
                    current_chunk = overlap_lines
                    current_size = sum(len(l) + 1 for l in overlap_lines)
                else:
                    current_chunk = []
                    current_size = 0
            
            current_chunk.append(line)
            current_size += line_size
        
        # Add remaining chunk
        if current_chunk:
            chunks.append('\n'.join(current_chunk))
        
        return chunks

Entity-Based Chunking Code:

class EntityBasedChunking(ChunkingStrategy):
    """Entity-based chunking strategy - splits by entities and semantic units.
    
    Best for: Documents with clear semantic units, structured data with headers
    """
    
    def chunk_text(self, text: str, chunk_size: int = 512, 
                   overlap: int = 50) -> List[str]:
        """Create chunks based on entities and semantic units."""
        # Split by double newlines (paragraphs) or headers (lines starting with #)
        entity_pattern = r'(?:\n\n+|(?=^#{1,6}\s))'
        entities = re.split(entity_pattern, text, flags=re.MULTILINE)
        
        # Remove empty entities
        entities = [e.strip() for e in entities if e.strip()]
        
        chunks = []
        current_chunk = []
        current_size = 0
        
        for entity in entities:
            entity_size = len(entity)
            
            # If adding this entity exceeds chunk_size, save current chunk
            if current_size + entity_size > chunk_size and current_chunk:
                chunks.append('\n\n'.join(current_chunk))
                
                # Add overlap by including last entity in next chunk
                if overlap > 0 and current_chunk:
                    last_entity = current_chunk[-1]
                    if len(last_entity) <= overlap:
                        current_chunk = [last_entity]
                        current_size = len(last_entity)
                    else:
                        current_chunk = []
                        current_size = 0
                else:
                    current_chunk = []
                    current_size = 0
            
            current_chunk.append(entity)
            current_size += entity_size + 2  # +2 for separator
        
        # Add remaining chunk
        if current_chunk:
            chunks.append('\n\n'.join(current_chunk))
        
        return chunks

Configuration Changes

config.py Update:

# Before:
chunking_strategies: list = ["dense", "sparse", "hybrid", "re-ranking"]

# After:
chunking_strategies: list = ["dense", "sparse", "hybrid", "re-ranking", "row-based", "entity-based"]

Using New Strategies in Streamlit UI

Step 1: Create Collection

Go to Streamlit app
In sidebar, under "2. Chunking Strategy"
Select "row-based" or "entity-based" from dropdown
Set chunk size (default 512)
Set overlap (default 50)
Click "🚀 Load Data & Create Collection"

Step 2: Verify Collection

Collection name will include strategy: e.g., covidqa_row_based_mpnet
Use collection for chat queries or evaluation

Step 3: Evaluation

Run evaluation on collection with new chunking strategy
Compare metrics against other strategies
Download results for analysis

Evaluation Comparison

To compare which strategy works best for your dataset:

Create multiple collections with different chunking strategies
Run TRACE evaluation on each
Compare metrics:
- Utilization: How much retrieved context is used
- Relevance: How relevant retrieved docs are
- Adherence: How well response follows context
- Completeness: How complete the response is

Files Modified

File	Change	Lines
`chunking_strategies.py`	Added RowBasedChunking class	28-75
`chunking_strategies.py`	Added EntityBasedChunking class	78-145
`chunking_strategies.py`	Updated ChunkingFactory.STRATEGIES dict	184-191
`config.py`	Updated chunking_strategies list	40

Testing the New Strategies

Test Row-Based with CSV:

from chunking_strategies import ChunkingFactory

csv_data = """id,name,age,city
1,John Doe,25,New York
2,Jane Smith,30,Los Angeles
3,Bob Johnson,35,Chicago"""

chunker = ChunkingFactory.create_chunker("row-based")
chunks = chunker.chunk_text(csv_data, chunk_size=100, overlap=20)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n---")

Test Entity-Based with Markdown:

from chunking_strategies import ChunkingFactory

markdown_data = """# Introduction
This is the intro paragraph.

## Background
More details here about the background.

## Methodology
Explain the method used."""

chunker = ChunkingFactory.create_chunker("entity-based")
chunks = chunker.chunk_text(markdown_data, chunk_size=100, overlap=20)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n---")

Summary

✅ Row-Based Chunking - Implemented for structured, row-oriented data
✅ Entity-Based Chunking - Implemented for semantic unit-based data
✅ Config Updated - Both strategies available in UI
✅ Backward Compatible - All existing strategies still work
✅ Ready to Use - Available in Streamlit dropdown immediately

Total Strategies Available: 6

dense
sparse
hybrid
re-ranking
row-based (NEW)
entity-based (NEW)

Next Steps

Create collections with new strategies
Test with your data - See which works best
Run evaluation - Compare TRACE metrics
Select optimal strategy - Use best performing one
Document findings - Save results for reproducibility