code2-repo / doc /TRAINING_FIX_PADDING.md

Deepu1965

Upload folder using huggingface_hub

9b1c753 verified 2 months ago

8.86 kB

	# 🔧 Training Fix - Variable-Length Sequence Padding

	## ❌ Problem Identified

	```python
	RuntimeError: stack expects each tensor to be equal size,
	but got [7] at entry 0 and [41] at entry 1
	```

	Root Cause: The DataLoader's default collate function tries to stack tensors of different lengths into a batch, which fails. This happens because different clauses tokenize to different sequence lengths (e.g., 7 tokens vs 41 tokens).

	Location: Training loop → DataLoader → `torch.stack()`

	---

	## 🎯 Understanding the Problem

	### What Happens:

	1. Tokenization produces variable-length sequences:
	```python
	Clause 1: "The party shall..." → [101, 2023, 102] (length 7)
	Clause 2: "This agreement shall be governed by..." → [101, ..., 102] (length 41)
	```

	2. Default collate tries to stack them:
	```python
	torch.stack([tensor([...], length=7), tensor([..., length=41)])
	❌ ERROR: Can't stack different sizes!
	```

	3. Training fails before first batch completes

	### Why It Matters:

	- BERT/Transformers require fixed-size batches
	- Each sequence in a batch must have same length
	- Solution: Pad shorter sequences to match longest in batch

	---

	## ✅ Solution Applied

	### Custom Collate Function

	Added `collate_batch()` function in `trainer.py`:

	```python
	def collate_batch(batch):
	"""
	Custom collate function to handle variable-length sequences.
	Pads all sequences to the maximum length in the batch.
	"""
	# Find max length in this batch
	max_len = max(item['input_ids'].size(0) for item in batch)

	# Prepare batched tensors
	input_ids_batch = []
	attention_mask_batch = []
	risk_labels_batch = []
	severity_scores_batch = []
	importance_scores_batch = []

	for item in batch:
	input_ids = item['input_ids']
	attention_mask = item['attention_mask']
	current_len = input_ids.size(0)

	# Pad if needed
	if current_len < max_len:
	padding_len = max_len - current_len

	# Pad input_ids with 0 (PAD token)
	input_ids = torch.cat([
	input_ids,
	torch.zeros(padding_len, dtype=torch.long)
	])

	# Pad attention_mask with 0 (don't attend to padding)
	attention_mask = torch.cat([
	attention_mask,
	torch.zeros(padding_len, dtype=torch.long)
	])

	input_ids_batch.append(input_ids)
	attention_mask_batch.append(attention_mask)
	risk_labels_batch.append(item['risk_label'])
	severity_scores_batch.append(item['severity_score'])
	importance_scores_batch.append(item['importance_score'])

	# Stack into batched tensors
	return {
	'input_ids': torch.stack(input_ids_batch),
	'attention_mask': torch.stack(attention_mask_batch),
	'risk_label': torch.stack(risk_labels_batch),
	'severity_score': torch.stack(severity_scores_batch),
	'importance_score': torch.stack(importance_scores_batch)
	}
	```

	### Updated DataLoader Creation

	```python
	dataloader = DataLoader(
	dataset,
	batch_size=self.config.batch_size,
	shuffle=shuffle,
	num_workers=0,
	collate_fn=collate_batch # ✅ Custom collate function
	)
	```

	---

	## 🎯 How It Works

	### Example Batch:

	Input:
	```
	Sample 0: [101, 2023, 2003, 102] length=4
	Sample 1: [101, 1996, 2172, 3325, 2003, 102] length=6
	Sample 2: [101, 102] length=2
	```

	After Padding (max_len=6):
	```
	Sample 0: [101, 2023, 2003, 102, 0, 0] length=6 ✅
	Sample 1: [101, 1996, 2172, 3325, 2003, 102] length=6 ✅
	Sample 2: [101, 102, 0, 0, 0, 0] length=6 ✅
	```

	Attention Masks:
	```
	Sample 0: [1, 1, 1, 1, 0, 0] # Don't attend to padding
	Sample 1: [1, 1, 1, 1, 1, 1] # Attend to all
	Sample 2: [1, 1, 0, 0, 0, 0] # Don't attend to padding
	```

	Batched Tensor:
	```python
	input_ids.shape = (3, 6) # batch_size=3, max_len=6
	attention_mask.shape = (3, 6)
	risk_label.shape = (3,)
	severity_score.shape = (3,)
	importance_score.shape = (3,)
	```

	---

	## 🔍 Key Features

	### 1. Dynamic Padding
	- Pads to max length in current batch (not global max)
	- More efficient: batch with short sequences → less padding
	- Example:
	- Batch 1: lengths [10, 12, 11] → pad to 12
	- Batch 2: lengths [50, 48, 52] → pad to 52

	### 2. Attention Mask Handling
	- Original tokens: `attention_mask = 1` (attend)
	- Padding tokens: `attention_mask = 0` (ignore)
	- BERT won't process padding tokens

	### 3. Preserves All Data
	- Risk labels preserved
	- Severity scores preserved
	- Importance scores preserved
	- No data loss, just shape adjustment

	---

	## 🧪 Verification

	### Test Script: `test_collate_batch.py`

	Tests 7 scenarios:
	1. ✅ Import collate_batch function
	2. ✅ Create mock batch with variable lengths (4, 6, 2)
	3. ✅ Collate into uniform batch
	4. ✅ Verify output shapes (3, 6)
	5. ✅ Verify padding tokens are 0
	6. ✅ Verify labels preserved
	7. ✅ Test with actual DataLoader

	Run test:
	```bash
	python3 test_collate_batch.py
	```

	Expected output:
	```
	✅ Step 1: Imports successful
	✅ Step 2: Creating mock batch with variable lengths...
	Sample 0: length 4
	Sample 1: length 6
	Sample 2: length 2
	✅ Step 3: Testing collate_batch()...
	📊 Output shapes:
	input_ids: torch.Size([3, 6])
	attention_mask: torch.Size([3, 6])
	✅ Step 4: Verifying shapes...
	✅ Step 5: Verifying padding...
	✅ Step 6: Verifying labels preserved...
	✅ Step 7: Testing with actual DataLoader...

	🎉 ALL TESTS PASSED!
	```

	---

	## 📊 Performance Impact

	### Memory Usage:

	Before (would crash):
	- Can't create batches ❌

	After:
	- Batch 1 (short clauses): `batch_size × 20 tokens` = minimal
	- Batch 2 (long clauses): `batch_size × 512 tokens` = max
	- Adaptive: Uses only what's needed per batch

	### Training Speed:

	- No slowdown: Padding is fast (`torch.cat`)
	- Actually faster: Proper batching enables GPU parallelism
	- Efficient: Only pads to batch max, not global max

	---

	## 🎓 Technical Details

	### Why Attention Mask Matters:

	```python
	# Without attention mask (wrong):
	BERT attends to padding → learns meaningless patterns ❌

	# With attention mask (correct):
	BERT ignores padding → learns only from real tokens ✅
	```

	### PyTorch DataLoader Flow:

	```
	Dataset.__getitem__(idx)
	→ Returns single sample with variable length
	↓
	collate_fn(batch)
	→ Pads all samples to same length
	↓
	Batched tensors
	→ All same size, ready for model
	```

	### Padding Token ID:

	- 0 is standard PAD token in BERT
	- Doesn't conflict with vocabulary (reserved ID)
	- Attention mask ensures it's ignored anyway

	---

	## 🚀 Ready to Train

	Now training will proceed normally:

	```bash
	python3 train.py
	```

	Expected flow:
	```
	1. Load data ✅
	2. Discover risk patterns (LDA) ✅
	3. Create datasets ✅
	4. Create dataloaders with collate_fn ✅
	5. Start training...
	📈 Epoch 1/5
	Batch 0: input_ids shape = [16, 235] ✅
	Batch 1: input_ids shape = [16, 178] ✅
	Batch 2: input_ids shape = [16, 312] ✅
	...
	```

	---

	## 🔧 Alternative Solutions (Not Used)

	### 1. Pre-pad to Fixed Length
	```python
	# Pad ALL sequences to max_length=512
	# Wasteful: short clauses pad to 512 unnecessarily
	```
	Why not: Wastes memory and computation

	### 2. Filter by Length
	```python
	# Only use clauses within certain length range
	# Loses data
	```
	Why not: Loses valuable training data

	### 3. Bucket Batching
	```python
	# Group similar-length sequences into batches
	# More complex
	```
	Why not: Our solution is simpler and works well

	---

	## 📝 Summary

	### Problem:
	- Variable-length sequences can't be stacked into batches
	- Training crashes at first batch

	### Solution:
	- Custom `collate_batch()` function
	- Dynamically pads to max length in each batch
	- Preserves attention masks and all labels

	### Result:
	- ✅ Training proceeds normally
	- ✅ Efficient memory usage
	- ✅ No data loss
	- ✅ Proper attention mask handling

	---

	## 📚 Files Modified

	1. `trainer.py` - Added collate_batch function (45 lines)
	- Lines 18-61: `collate_batch()` function
	- Line 202: Added `collate_fn=collate_batch`

	2. `test_collate_batch.py` - Verification test (180 lines)

	3. `doc/TRAINING_FIX_PADDING.md` - This documentation

	---

	## ✅ Status

	- [x] Identified root cause (variable-length tensors)
	- [x] Implemented custom collate function
	- [x] Added to DataLoader creation
	- [x] Created comprehensive tests
	- [x] Documented solution
	- [x] READY TO TRAIN 🎉

	---

	Next: Run `python3 train.py` - training should now progress through epochs! 🚀