Spaces:
Sleeping
Sleeping
Chunking Strategies Implementation Guide
Overview
Added two new chunking strategies to the RAG Capstone Project:
- Row-Based Chunking - For structured data (tables, CSV, rows)
- Entity-Based Chunking - For semantic units (paragraphs, sections, headers)
Available Chunking Strategies
1. Dense Chunking β (Existing)
- Use Case: General-purpose documents, fixed-size chunking
- Method: Fixed-size chunks with overlap, breaks at sentence boundaries
- Best For: Linear documents, books, articles
- Chunk Size: 512 (recommended)
- Overlap: 50 (recommended)
2. Sparse Chunking β (Existing)
- Use Case: Semantic-aware chunking
- Method: Splits by paragraphs/double newlines, keeps semantic units
- Best For: Articles with clear paragraphs
- Chunk Size: 512 (recommended)
- Overlap: 50 (recommended)
3. Hybrid Chunking β (Existing)
- Use Case: Combines dense and sparse approaches
- Method: First splits by semantics, then applies fixed-size chunking
- Best For: Mixed document types
- Chunk Size: 512 (recommended)
- Overlap: 50 (recommended)
4. Re-Ranking Chunking β (Existing)
- Use Case: Query-aware chunk ranking
- Method: Hybrid chunking + BM25 re-ranking
- Best For: Finding most relevant chunks for specific queries
- Chunk Size: 512 (recommended)
- Overlap: 50 (recommended)
5. Row-Based Chunking π (NEW)
- Use Case: Structured data with row delimiters
- Method: Splits by rows/lines (newline delimiters)
- Best For:
- CSV files
- Tables
- Structured database outputs
- Line-oriented data (logs, scripts)
- Tab-separated values (TSV)
- Chunk Size: 512 (recommended)
- Overlap: 50 (recommended - overlaps last N rows)
- Advantages:
- Preserves row structure
- No row is split across chunks
- Maintains semantic integrity of each row
- Perfect for tabular data
Algorithm:
1. Split text by newlines into rows
2. Accumulate rows until chunk_size exceeded
3. Save accumulated chunk
4. Apply overlap by including last N rows in next chunk
5. Continue until all rows processed
Example:
Input (CSV):
id,name,age
1,John,25
2,Jane,30
3,Bob,35
Output Chunks (chunk_size=50):
Chunk 1: "id,name,age\n1,John,25\n2,Jane,30"
Chunk 2: "2,Jane,30\n3,Bob,35"
6. Entity-Based Chunking π (NEW)
- Use Case: Semantic entity-based chunking
- Method: Splits by entities (paragraphs, sections, headers)
- Best For:
- Documents with clear semantic units
- Markdown files with headers
- Technical documentation
- Sections with key-value pairs
- Posts, articles with clear structure
- Chunk Size: 512 (recommended)
- Overlap: 50 (recommended - overlaps last entity)
- Advantages:
- Respects semantic boundaries
- No entity is split across chunks
- Perfect for header-based docs
- Preserves context between sections
Algorithm:
1. Identify entities using patterns:
- Paragraph breaks (double newlines)
- Section headers (lines starting with #)
2. Accumulate entities until chunk_size exceeded
3. Save accumulated chunk
4. Apply overlap by including last entity in next chunk
5. Continue until all entities processed
Example:
Input (Markdown):
# Section 1
This is paragraph 1.
This is paragraph 2.
# Section 2
This is paragraph 3.
Output Chunks (chunk_size=100):
Chunk 1: "# Section 1\nThis is paragraph 1.\n\nThis is paragraph 2."
Chunk 2: "This is paragraph 2.\n\n# Section 2\nThis is paragraph 3."
When to Use Which Strategy
| Data Type | Strategy | Reason |
|---|---|---|
| CSV/Tables | Row-Based | Preserves row integrity |
| Logs/Line data | Row-Based | Each line is atomic |
| Markdown docs | Entity-Based | Respects header structure |
| Technical docs | Entity-Based | Sections are semantic units |
| Blog posts | Entity-Based | Paragraphs are natural units |
| Books/Articles | Dense | Long continuous text |
| Mixed documents | Hybrid | Combines approaches |
| Query-specific | Re-Ranking | Uses BM25 scoring |
Implementation Details
Row-Based Chunking Code:
class RowBasedChunking(ChunkingStrategy):
"""Row-based chunking strategy - splits by rows/lines/delimiters.
Best for: Tables, CSV data, structured documents with clear row separators
"""
def chunk_text(self, text: str, chunk_size: int = 512,
overlap: int = 50) -> List[str]:
"""Create chunks based on row/line delimiters."""
lines = text.split('\n')
chunks = []
current_chunk = []
current_size = 0
for line in lines:
line_size = len(line) + 1 # +1 for newline
# If adding this line exceeds chunk_size, save current chunk
if current_size + line_size > chunk_size and current_chunk:
chunks.append('\n'.join(current_chunk))
# Add overlap by including last few lines in next chunk
if overlap > 0:
overlap_size = 0
overlap_lines = []
for prev_line in reversed(current_chunk):
overlap_size += len(prev_line) + 1
if overlap_size <= overlap:
overlap_lines.insert(0, prev_line)
else:
break
current_chunk = overlap_lines
current_size = sum(len(l) + 1 for l in overlap_lines)
else:
current_chunk = []
current_size = 0
current_chunk.append(line)
current_size += line_size
# Add remaining chunk
if current_chunk:
chunks.append('\n'.join(current_chunk))
return chunks
Entity-Based Chunking Code:
class EntityBasedChunking(ChunkingStrategy):
"""Entity-based chunking strategy - splits by entities and semantic units.
Best for: Documents with clear semantic units, structured data with headers
"""
def chunk_text(self, text: str, chunk_size: int = 512,
overlap: int = 50) -> List[str]:
"""Create chunks based on entities and semantic units."""
# Split by double newlines (paragraphs) or headers (lines starting with #)
entity_pattern = r'(?:\n\n+|(?=^#{1,6}\s))'
entities = re.split(entity_pattern, text, flags=re.MULTILINE)
# Remove empty entities
entities = [e.strip() for e in entities if e.strip()]
chunks = []
current_chunk = []
current_size = 0
for entity in entities:
entity_size = len(entity)
# If adding this entity exceeds chunk_size, save current chunk
if current_size + entity_size > chunk_size and current_chunk:
chunks.append('\n\n'.join(current_chunk))
# Add overlap by including last entity in next chunk
if overlap > 0 and current_chunk:
last_entity = current_chunk[-1]
if len(last_entity) <= overlap:
current_chunk = [last_entity]
current_size = len(last_entity)
else:
current_chunk = []
current_size = 0
else:
current_chunk = []
current_size = 0
current_chunk.append(entity)
current_size += entity_size + 2 # +2 for separator
# Add remaining chunk
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
return chunks
Configuration Changes
config.py Update:
# Before:
chunking_strategies: list = ["dense", "sparse", "hybrid", "re-ranking"]
# After:
chunking_strategies: list = ["dense", "sparse", "hybrid", "re-ranking", "row-based", "entity-based"]
Using New Strategies in Streamlit UI
Step 1: Create Collection
- Go to Streamlit app
- In sidebar, under "2. Chunking Strategy"
- Select "row-based" or "entity-based" from dropdown
- Set chunk size (default 512)
- Set overlap (default 50)
- Click "π Load Data & Create Collection"
Step 2: Verify Collection
- Collection name will include strategy: e.g.,
covidqa_row_based_mpnet - Use collection for chat queries or evaluation
Step 3: Evaluation
- Run evaluation on collection with new chunking strategy
- Compare metrics against other strategies
- Download results for analysis
Evaluation Comparison
To compare which strategy works best for your dataset:
- Create multiple collections with different chunking strategies
- Run TRACE evaluation on each
- Compare metrics:
- Utilization: How much retrieved context is used
- Relevance: How relevant retrieved docs are
- Adherence: How well response follows context
- Completeness: How complete the response is
Files Modified
| File | Change | Lines |
|---|---|---|
chunking_strategies.py |
Added RowBasedChunking class | 28-75 |
chunking_strategies.py |
Added EntityBasedChunking class | 78-145 |
chunking_strategies.py |
Updated ChunkingFactory.STRATEGIES dict | 184-191 |
config.py |
Updated chunking_strategies list | 40 |
Testing the New Strategies
Test Row-Based with CSV:
from chunking_strategies import ChunkingFactory
csv_data = """id,name,age,city
1,John Doe,25,New York
2,Jane Smith,30,Los Angeles
3,Bob Johnson,35,Chicago"""
chunker = ChunkingFactory.create_chunker("row-based")
chunks = chunker.chunk_text(csv_data, chunk_size=100, overlap=20)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n---")
Test Entity-Based with Markdown:
from chunking_strategies import ChunkingFactory
markdown_data = """# Introduction
This is the intro paragraph.
## Background
More details here about the background.
## Methodology
Explain the method used."""
chunker = ChunkingFactory.create_chunker("entity-based")
chunks = chunker.chunk_text(markdown_data, chunk_size=100, overlap=20)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n---")
Summary
β
Row-Based Chunking - Implemented for structured, row-oriented data
β
Entity-Based Chunking - Implemented for semantic unit-based data
β
Config Updated - Both strategies available in UI
β
Backward Compatible - All existing strategies still work
β
Ready to Use - Available in Streamlit dropdown immediately
Total Strategies Available: 6
- dense
- sparse
- hybrid
- re-ranking
- row-based (NEW)
- entity-based (NEW)
Next Steps
- Create collections with new strategies
- Test with your data - See which works best
- Run evaluation - Compare TRACE metrics
- Select optimal strategy - Use best performing one
- Document findings - Save results for reproducibility