CapstoneRAG10 / docs /CHUNKING_STRATEGIES_NEW.md
PavaniYerra's picture
Clone
9bc547e

Chunking Strategies Implementation Guide

Overview

Added two new chunking strategies to the RAG Capstone Project:

  1. Row-Based Chunking - For structured data (tables, CSV, rows)
  2. Entity-Based Chunking - For semantic units (paragraphs, sections, headers)

Available Chunking Strategies

1. Dense Chunking βœ… (Existing)

  • Use Case: General-purpose documents, fixed-size chunking
  • Method: Fixed-size chunks with overlap, breaks at sentence boundaries
  • Best For: Linear documents, books, articles
  • Chunk Size: 512 (recommended)
  • Overlap: 50 (recommended)

2. Sparse Chunking βœ… (Existing)

  • Use Case: Semantic-aware chunking
  • Method: Splits by paragraphs/double newlines, keeps semantic units
  • Best For: Articles with clear paragraphs
  • Chunk Size: 512 (recommended)
  • Overlap: 50 (recommended)

3. Hybrid Chunking βœ… (Existing)

  • Use Case: Combines dense and sparse approaches
  • Method: First splits by semantics, then applies fixed-size chunking
  • Best For: Mixed document types
  • Chunk Size: 512 (recommended)
  • Overlap: 50 (recommended)

4. Re-Ranking Chunking βœ… (Existing)

  • Use Case: Query-aware chunk ranking
  • Method: Hybrid chunking + BM25 re-ranking
  • Best For: Finding most relevant chunks for specific queries
  • Chunk Size: 512 (recommended)
  • Overlap: 50 (recommended)

5. Row-Based Chunking πŸ†• (NEW)

  • Use Case: Structured data with row delimiters
  • Method: Splits by rows/lines (newline delimiters)
  • Best For:
    • CSV files
    • Tables
    • Structured database outputs
    • Line-oriented data (logs, scripts)
    • Tab-separated values (TSV)
  • Chunk Size: 512 (recommended)
  • Overlap: 50 (recommended - overlaps last N rows)
  • Advantages:
    • Preserves row structure
    • No row is split across chunks
    • Maintains semantic integrity of each row
    • Perfect for tabular data

Algorithm:

1. Split text by newlines into rows
2. Accumulate rows until chunk_size exceeded
3. Save accumulated chunk
4. Apply overlap by including last N rows in next chunk
5. Continue until all rows processed

Example:

Input (CSV):
id,name,age
1,John,25
2,Jane,30
3,Bob,35

Output Chunks (chunk_size=50):
Chunk 1: "id,name,age\n1,John,25\n2,Jane,30"
Chunk 2: "2,Jane,30\n3,Bob,35"

6. Entity-Based Chunking πŸ†• (NEW)

  • Use Case: Semantic entity-based chunking
  • Method: Splits by entities (paragraphs, sections, headers)
  • Best For:
    • Documents with clear semantic units
    • Markdown files with headers
    • Technical documentation
    • Sections with key-value pairs
    • Posts, articles with clear structure
  • Chunk Size: 512 (recommended)
  • Overlap: 50 (recommended - overlaps last entity)
  • Advantages:
    • Respects semantic boundaries
    • No entity is split across chunks
    • Perfect for header-based docs
    • Preserves context between sections

Algorithm:

1. Identify entities using patterns:
   - Paragraph breaks (double newlines)
   - Section headers (lines starting with #)
2. Accumulate entities until chunk_size exceeded
3. Save accumulated chunk
4. Apply overlap by including last entity in next chunk
5. Continue until all entities processed

Example:

Input (Markdown):
# Section 1
This is paragraph 1.

This is paragraph 2.

# Section 2
This is paragraph 3.

Output Chunks (chunk_size=100):
Chunk 1: "# Section 1\nThis is paragraph 1.\n\nThis is paragraph 2."
Chunk 2: "This is paragraph 2.\n\n# Section 2\nThis is paragraph 3."

When to Use Which Strategy

Data Type Strategy Reason
CSV/Tables Row-Based Preserves row integrity
Logs/Line data Row-Based Each line is atomic
Markdown docs Entity-Based Respects header structure
Technical docs Entity-Based Sections are semantic units
Blog posts Entity-Based Paragraphs are natural units
Books/Articles Dense Long continuous text
Mixed documents Hybrid Combines approaches
Query-specific Re-Ranking Uses BM25 scoring

Implementation Details

Row-Based Chunking Code:

class RowBasedChunking(ChunkingStrategy):
    """Row-based chunking strategy - splits by rows/lines/delimiters.
    
    Best for: Tables, CSV data, structured documents with clear row separators
    """
    
    def chunk_text(self, text: str, chunk_size: int = 512, 
                   overlap: int = 50) -> List[str]:
        """Create chunks based on row/line delimiters."""
        lines = text.split('\n')
        chunks = []
        current_chunk = []
        current_size = 0
        
        for line in lines:
            line_size = len(line) + 1  # +1 for newline
            
            # If adding this line exceeds chunk_size, save current chunk
            if current_size + line_size > chunk_size and current_chunk:
                chunks.append('\n'.join(current_chunk))
                
                # Add overlap by including last few lines in next chunk
                if overlap > 0:
                    overlap_size = 0
                    overlap_lines = []
                    for prev_line in reversed(current_chunk):
                        overlap_size += len(prev_line) + 1
                        if overlap_size <= overlap:
                            overlap_lines.insert(0, prev_line)
                        else:
                            break
                    current_chunk = overlap_lines
                    current_size = sum(len(l) + 1 for l in overlap_lines)
                else:
                    current_chunk = []
                    current_size = 0
            
            current_chunk.append(line)
            current_size += line_size
        
        # Add remaining chunk
        if current_chunk:
            chunks.append('\n'.join(current_chunk))
        
        return chunks

Entity-Based Chunking Code:

class EntityBasedChunking(ChunkingStrategy):
    """Entity-based chunking strategy - splits by entities and semantic units.
    
    Best for: Documents with clear semantic units, structured data with headers
    """
    
    def chunk_text(self, text: str, chunk_size: int = 512, 
                   overlap: int = 50) -> List[str]:
        """Create chunks based on entities and semantic units."""
        # Split by double newlines (paragraphs) or headers (lines starting with #)
        entity_pattern = r'(?:\n\n+|(?=^#{1,6}\s))'
        entities = re.split(entity_pattern, text, flags=re.MULTILINE)
        
        # Remove empty entities
        entities = [e.strip() for e in entities if e.strip()]
        
        chunks = []
        current_chunk = []
        current_size = 0
        
        for entity in entities:
            entity_size = len(entity)
            
            # If adding this entity exceeds chunk_size, save current chunk
            if current_size + entity_size > chunk_size and current_chunk:
                chunks.append('\n\n'.join(current_chunk))
                
                # Add overlap by including last entity in next chunk
                if overlap > 0 and current_chunk:
                    last_entity = current_chunk[-1]
                    if len(last_entity) <= overlap:
                        current_chunk = [last_entity]
                        current_size = len(last_entity)
                    else:
                        current_chunk = []
                        current_size = 0
                else:
                    current_chunk = []
                    current_size = 0
            
            current_chunk.append(entity)
            current_size += entity_size + 2  # +2 for separator
        
        # Add remaining chunk
        if current_chunk:
            chunks.append('\n\n'.join(current_chunk))
        
        return chunks

Configuration Changes

config.py Update:

# Before:
chunking_strategies: list = ["dense", "sparse", "hybrid", "re-ranking"]

# After:
chunking_strategies: list = ["dense", "sparse", "hybrid", "re-ranking", "row-based", "entity-based"]

Using New Strategies in Streamlit UI

Step 1: Create Collection

  1. Go to Streamlit app
  2. In sidebar, under "2. Chunking Strategy"
  3. Select "row-based" or "entity-based" from dropdown
  4. Set chunk size (default 512)
  5. Set overlap (default 50)
  6. Click "πŸš€ Load Data & Create Collection"

Step 2: Verify Collection

  1. Collection name will include strategy: e.g., covidqa_row_based_mpnet
  2. Use collection for chat queries or evaluation

Step 3: Evaluation

  1. Run evaluation on collection with new chunking strategy
  2. Compare metrics against other strategies
  3. Download results for analysis

Evaluation Comparison

To compare which strategy works best for your dataset:

  1. Create multiple collections with different chunking strategies
  2. Run TRACE evaluation on each
  3. Compare metrics:
    • Utilization: How much retrieved context is used
    • Relevance: How relevant retrieved docs are
    • Adherence: How well response follows context
    • Completeness: How complete the response is

Files Modified

File Change Lines
chunking_strategies.py Added RowBasedChunking class 28-75
chunking_strategies.py Added EntityBasedChunking class 78-145
chunking_strategies.py Updated ChunkingFactory.STRATEGIES dict 184-191
config.py Updated chunking_strategies list 40

Testing the New Strategies

Test Row-Based with CSV:

from chunking_strategies import ChunkingFactory

csv_data = """id,name,age,city
1,John Doe,25,New York
2,Jane Smith,30,Los Angeles
3,Bob Johnson,35,Chicago"""

chunker = ChunkingFactory.create_chunker("row-based")
chunks = chunker.chunk_text(csv_data, chunk_size=100, overlap=20)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n---")

Test Entity-Based with Markdown:

from chunking_strategies import ChunkingFactory

markdown_data = """# Introduction
This is the intro paragraph.

## Background
More details here about the background.

## Methodology
Explain the method used."""

chunker = ChunkingFactory.create_chunker("entity-based")
chunks = chunker.chunk_text(markdown_data, chunk_size=100, overlap=20)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n---")

Summary

βœ… Row-Based Chunking - Implemented for structured, row-oriented data
βœ… Entity-Based Chunking - Implemented for semantic unit-based data
βœ… Config Updated - Both strategies available in UI
βœ… Backward Compatible - All existing strategies still work
βœ… Ready to Use - Available in Streamlit dropdown immediately

Total Strategies Available: 6

  • dense
  • sparse
  • hybrid
  • re-ranking
  • row-based (NEW)
  • entity-based (NEW)

Next Steps

  1. Create collections with new strategies
  2. Test with your data - See which works best
  3. Run evaluation - Compare TRACE metrics
  4. Select optimal strategy - Use best performing one
  5. Document findings - Save results for reproducibility