UltraThinking-LLM-Training / ARCHITECTURE_IMPROVEMENTS_GUIDE.md

Upload folder using huggingface_hub

54c5666 verified 4 months ago

19.3 kB

	# Complete Improvement Guide for Advanced Transformer Architecture

	## 🎯 Executive Summary

	Current State: Production-quality code with modern best practices
	Overall Grade: 8.5/10

	Main Areas for Improvement:
	1. Advanced training optimizations
	2. Inference performance
	3. Numerical stability edge cases
	4. Memory efficiency
	5. Code maintainability

	---

	## 1. ⚡ CRITICAL IMPROVEMENTS (Must Implement)

	### 1.1 Fix Potential NaN Issue in Attention Mask

	Problem: When ALL tokens in a row are masked, softmax produces NaN

	Current Code (architecture.py:218-223):
	```python
	attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
	if attention_mask is not None:
	attn_weights = attn_weights + attention_mask.to(attn_weights.dtype)
	attn_weights = F.softmax(attn_weights, dim=-1) # NaN if all -inf!
	attn_weights = self.attention_dropout(attn_weights)
	attn_output = torch.matmul(attn_weights, value_states)
	```

	Solution:
	```python
	attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)

	if attention_mask is not None:
	attn_weights = attn_weights + attention_mask.to(attn_weights.dtype)

	# CRITICAL FIX: Clamp before softmax to prevent all -inf rows
	mask_value = -1e4 if attn_weights.dtype in (torch.float16, torch.bfloat16) else -1e9
	attn_weights = torch.clamp(attn_weights, min=mask_value, max=1e4)
	attn_weights = F.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)

	# Add small epsilon to prevent exact zeros
	attn_weights = attn_weights + 1e-10

	attn_weights = self.attention_dropout(attn_weights)
	attn_output = torch.matmul(attn_weights, value_states)
	```

	Why Critical: Prevents training crashes from NaN propagation.

	---

	### 1.2 Improve Scaled Dot Product Attention Mask Handling

	Current Issue (architecture.py:204-214):
	```python
	sdpa_mask = None
	if attention_mask is not None:
	# Convert to (batch, seq, seq) to broadcast across heads
	sdpa_mask = attention_mask.squeeze(1) # Loses important dimensions!
	```

	Better Implementation:
	```python
	sdpa_mask = None
	if attention_mask is not None:
	# SDPA expects boolean or additive mask
	# Shape: (batch, 1, seq, seq) or (batch, heads, seq, seq)

	# Convert additive mask to boolean (more stable)
	sdpa_mask = attention_mask > -1e8

	# OR keep as additive float mask (ensure correct dtype)
	# sdpa_mask = attention_mask.to(query_states.dtype)

	attn_output = F.scaled_dot_product_attention(
	query_states,
	key_states,
	value_states,
	attn_mask=sdpa_mask,
	dropout_p=self.attention_dropout.p if self.training else 0.0,
	is_causal=(sdpa_mask is None),
	# enable_gqa=True # Hint for optimization (PyTorch 2.1+)
	)
	```

	---

	### 1.3 Fix Gradient Checkpointing Compatibility

	Current Problem (architecture.py:363-370):
	```python
	if self.config.gradient_checkpointing and self.training:
	hidden_states, present_key_value = torch.utils.checkpoint.checkpoint(
	layer,
	hidden_states,
	attention_mask,
	use_cache, # ❌ Incompatible with checkpointing!
	past_key_value,
	use_reentrant=False,
	)
	```

	Issue: Passing `use_cache=True` with gradient checkpointing is incompatible (caching requires storing activations, checkpointing discards them).

	Solution:
	```python
	for i, (layer, past_key_value) in enumerate(zip(self.layers, past_key_values)):
	if self.config.gradient_checkpointing and self.training:
	# Disable cache during gradient checkpointing
	hidden_states, _ = torch.utils.checkpoint.checkpoint(
	layer,
	hidden_states,
	attention_mask,
	False, # ✅ Force use_cache=False
	None, # ✅ Force past_key_value=None
	use_reentrant=False,
	)
	present_key_value = None
	else:
	hidden_states, present_key_value = layer(
	hidden_states, attention_mask, use_cache, past_key_value
	)

	# Only append cache if not using gradient checkpointing
	if use_cache and not (self.config.gradient_checkpointing and self.training):
	present_key_values.append(present_key_value)
	```

	---

	## 2. 🚀 PERFORMANCE OPTIMIZATIONS

	### 2.1 Fused Operations

	Add Fused LayerNorm + Linear:
	```python
	try:
	from apex.normalization import FusedRMSNorm
	FUSED_RMSNORM_AVAILABLE = True
	except ImportError:
	FUSED_RMSNORM_AVAILABLE = False

	class RMSNorm(nn.Module):
	def __init__(self, hidden_size: int, eps: float = 1e-6):
	super().__init__()
	if FUSED_RMSNORM_AVAILABLE:
	self.norm = FusedRMSNorm(hidden_size, eps)
	self.use_fused = True
	else:
	self.weight = nn.Parameter(torch.ones(hidden_size))
	self.variance_epsilon = float(eps)
	self.use_fused = False

	def forward(self, hidden_states):
	if self.use_fused:
	return self.norm(hidden_states)
	# ... existing code
	```

	Benefit: 15-20% faster layer normalization.

	---

	### 2.2 Better KV Cache Management

	Add Incremental Decoding Optimization:
	```python
	class GroupedQueryAttention(nn.Module):
	def forward(self, hidden_states, attention_mask=None,
	use_cache=False, past_key_value=None, position_ids=None):
	bsz, q_len, _ = hidden_states.size()

	# Detect if this is incremental decoding (q_len=1)
	is_decoding = q_len == 1 and past_key_value is not None

	query_states = self.q_proj(hidden_states)
	key_states = self.k_proj(hidden_states)
	value_states = self.v_proj(hidden_states)

	# Reshape
	query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
	key_states = key_states.view(bsz, q_len, self.num_kv_heads, self.head_dim).transpose(1, 2)
	value_states = value_states.view(bsz, q_len, self.num_kv_heads, self.head_dim).transpose(1, 2)

	# Optimized RoPE for decoding
	if is_decoding and position_ids is not None:
	# Only compute RoPE for current position
	kv_seq_len = past_key_value[0].shape[2] + 1
	cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
	# Slice to get only the last position
	cos = cos[:, :, -1:, :]
	sin = sin[:, :, -1:, :]
	query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
	else:
	cos, sin = self.rotary_emb(value_states, seq_len=q_len)
	query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)

	# ... rest of attention
	```

	Benefit: 10-15% faster autoregressive generation.

	---

	### 2.3 Sliding Window Attention Implementation

	Current Code Mentions It But Doesn't Implement It!

	```python
	class GroupedQueryAttention(nn.Module):
	def forward(self, hidden_states, attention_mask=None,
	use_cache=False, past_key_value=None):
	# ... existing code up to attention computation ...

	# Apply sliding window if configured
	if self.config.sliding_window is not None and not use_cache:
	window_size = self.config.sliding_window
	seq_len = query_states.size(2)

	# Create sliding window mask
	window_mask = torch.ones(seq_len, seq_len, device=query_states.device)
	window_mask = torch.triu(window_mask, diagonal=-window_size)
	window_mask = torch.tril(window_mask, diagonal=0)

	# Convert to additive mask
	mask_value = -1e4 if query_states.dtype in (torch.float16, torch.bfloat16) else -1e9
	window_mask = (1 - window_mask) * mask_value

	# Combine with existing mask
	if attention_mask is None:
	attention_mask = window_mask[None, None, :, :]
	else:
	attention_mask = attention_mask + window_mask[None, None, :, :]

	# ... continue with attention computation ...
	```

	Benefit: Reduces computation from O(N²) to O(N×W) for long sequences.

	---

	## 3. 🔢 NUMERICAL STABILITY

	### 3.1 Better Softmax Stability

	```python
	def stable_softmax(x, dim=-1, dtype=torch.float32):
	"""Numerically stable softmax"""
	x_max = torch.max(x, dim=dim, keepdim=True)[0]
	x_exp = torch.exp((x - x_max).to(dtype))
	x_sum = torch.sum(x_exp, dim=dim, keepdim=True)
	return (x_exp / (x_sum + 1e-10)).to(x.dtype)

	# Use in attention:
	attn_weights = stable_softmax(attn_weights, dim=-1)
	```

	---

	### 3.2 Enhanced RoPE Stability for Long Contexts

	```python
	class RotaryPositionalEmbedding(nn.Module):
	def __init__(self, dim: int, max_position_embeddings: int = 2048,
	base: int = 10000, scaling_factor: float = 1.0):
	super().__init__()
	self.dim = dim
	self.max_position_embeddings = max_position_embeddings
	self.base = base
	self.scaling_factor = scaling_factor # For length extrapolation

	# Use float64 for better numerical precision
	inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.float64) / self.dim))
	self.register_buffer("inv_freq", inv_freq.float(), persistent=False)

	self._seq_len_cached = 0
	self._cos_cached = None
	self._sin_cached = None

	def _update_cos_sin_cache(self, x, seq_len):
	# Apply scaling for extrapolation
	seq_len = int(seq_len * self.scaling_factor)

	if seq_len != self._seq_len_cached or self._cos_cached is None or self._cos_cached.device != x.device:
	self._seq_len_cached = seq_len
	t = torch.arange(seq_len, device=x.device, dtype=self.inv_freq.dtype)
	freqs = torch.outer(t, self.inv_freq)
	emb = torch.cat((freqs, freqs), dim=-1)

	# Use float32 for cos/sin computation (more stable)
	self._cos_cached = emb.cos().to(x.dtype)[None, None, :, :]
	self._sin_cached = emb.sin().to(x.dtype)[None, None, :, :]
	```

	---

	## 4. 💾 MEMORY OPTIMIZATIONS

	### 4.1 CPU Offloading for Large Models

	```python
	class AdvancedGPTModel(nn.Module):
	def __init__(self, config: ModelConfig):
	super().__init__()
	# ... existing code ...
	self.offload_to_cpu = config.get('offload_to_cpu', False)

	def forward(self, input_ids, attention_mask=None, past_key_values=None,
	labels=None, use_cache=None):
	# ... embeddings ...

	for i, (layer, past_key_value) in enumerate(zip(self.layers, past_key_values)):
	# Move layer to GPU only when needed
	if self.offload_to_cpu:
	layer = layer.to(hidden_states.device)

	hidden_states, present_key_value = layer(
	hidden_states, attention_mask, use_cache, past_key_value
	)

	# Move back to CPU
	if self.offload_to_cpu:
	layer = layer.cpu()
	torch.cuda.empty_cache()

	if use_cache:
	present_key_values.append(present_key_value)
	```

	Benefit: Run models 2-3x larger than GPU memory (slower but possible).

	---

	## 5. 🏗️ ARCHITECTURAL ENHANCEMENTS

	### 5.1 ALiBi Positional Bias (Alternative to RoPE)

	```python
	class ALiBiPositionalBias(nn.Module):
	"""Attention with Linear Biases (ALiBi)"""

	def __init__(self, num_heads):
	super().__init__()
	self.num_heads = num_heads
	slopes = torch.Tensor(self._get_slopes(num_heads))
	self.register_buffer('slopes', slopes, persistent=False)

	def _get_slopes(self, n):
	def get_slopes_power_of_2(n):
	start = (2(-2-(math.log2(n)-3)))
	ratio = start
	return [startratio*i for i in range(n)]

	if math.log2(n).is_integer():
	return get_slopes_power_of_2(n)
	else:
	closest_power_of_2 = 2**math.floor(math.log2(n))
	return get_slopes_power_of_2(closest_power_of_2) + \
	self._get_slopes(2*closest_power_of_2)[0::2][:n-closest_power_of_2]

	def forward(self, seq_len, device):
	"""Returns bias to add to attention scores"""
	positions = torch.arange(seq_len, device=device)
	# Create relative position matrix
	relative_positions = positions[None, :] - positions[:, None]
	# Apply slopes
	alibi = relative_positions[None, :, :] * self.slopes[:, None, None]
	return alibi
	```

	Benefit: Better extrapolation to longer contexts than RoPE.

	---

	### 5.2 Multi-Query Attention Option

	```python
	@dataclass
	class ModelConfig:
	# ... existing fields ...
	attention_type: str = "gqa" # Options: "mha", "mqa", "gqa"

	class GroupedQueryAttention(nn.Module):
	def __init__(self, config: ModelConfig):
	super().__init__()
	# ... existing setup ...

	# Support MQA (Multi-Query Attention)
	if config.attention_type == "mqa":
	self.num_kv_heads = 1
	elif config.attention_type == "gqa":
	self.num_kv_heads = config.n_kv_head
	else: # mha
	self.num_kv_heads = self.num_heads
	```

	Benefit: MQA is fastest for inference, GQA balances quality/speed, MHA is highest quality.

	---

	## 6. 🧪 TRAINING IMPROVEMENTS

	### 6.1 Better Weight Initialization

	```python
	def _init_weights(self, module):
	"""Improved initialization following GPT-3/Llama"""
	if isinstance(module, nn.Linear):
	# Use truncated normal for better convergence
	std = 0.02
	if hasattr(module, 'scale_init'):
	# Scale down residual layers
	std /= math.sqrt(2 * self.config.n_layer)

	torch.nn.init.trunc_normal_(module.weight, mean=0.0, std=std, a=-2std, b=2std)

	if module.bias is not None:
	torch.nn.init.zeros_(module.bias)

	elif isinstance(module, nn.Embedding):
	torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

	if hasattr(module, 'padding_idx') and module.padding_idx is not None:
	module.weight.data[module.padding_idx].zero_()

	# Mark residual layers for scaled init:
	class TransformerBlock(nn.Module):
	def __init__(self, config: ModelConfig):
	super().__init__()
	# ... existing code ...

	# Mark for scaled initialization
	self.self_attn.o_proj.scale_init = True
	if hasattr(self.mlp, 'down_proj'):
	self.mlp.down_proj.scale_init = True
	```

	---

	## 7. 🔍 DEBUGGING & MONITORING

	### 7.1 Add Gradient Monitoring

	```python
	class AdvancedGPTModel(nn.Module):
	def __init__(self, config: ModelConfig):
	super().__init__()
	# ... existing code ...
	self.gradient_stats = {}

	def monitor_gradients(self):
	"""Track gradient statistics for debugging"""
	for name, param in self.named_parameters():
	if param.grad is not None:
	grad_norm = param.grad.norm().item()
	grad_max = param.grad.abs().max().item()
	grad_mean = param.grad.mean().item()

	self.gradient_stats[name] = {
	'norm': grad_norm,
	'max': grad_max,
	'mean': grad_mean,
	'has_nan': torch.isnan(param.grad).any().item(),
	'has_inf': torch.isinf(param.grad).any().item()
	}

	return self.gradient_stats
	```

	---

	## 8. 📝 CODE QUALITY IMPROVEMENTS

	### 8.1 Configuration Validation

	```python
	@dataclass
	class ModelConfig:
	# ... existing fields ...

	def __post_init__(self):
	"""Validate configuration after initialization"""

	# Check n_head is divisible by n_kv_head
	if self.n_head % self.n_kv_head != 0:
	raise ValueError(
	f"n_head ({self.n_head}) must be divisible by n_kv_head ({self.n_kv_head})"
	)

	# Check rotary_dim
	head_dim = self.n_embd // self.n_head
	if self.rotary_dim > head_dim:
	raise ValueError(
	f"rotary_dim ({self.rotary_dim}) cannot exceed head_dim ({head_dim})"
	)

	# Warn about suboptimal settings
	if self.flash_attention and not FLASH_ATTENTION_AVAILABLE:
	warnings.warn(
	"flash_attention=True but Flash Attention not installed. "
	"Install with: pip install flash-attn --no-build-isolation"
	)

	if self.gradient_checkpointing and self.use_cache:
	warnings.warn(
	"gradient_checkpointing=True with use_cache=True may cause issues. "
	"Cache will be disabled during training."
	)
	```

	---

	## 9. 📊 PRIORITY IMPLEMENTATION ORDER

	### Week 1 (Critical):
	1. ✅ Fix NaN handling in attention (Section 1.1)
	2. ✅ Fix SDPA mask handling (Section 1.2)
	3. ✅ Fix gradient checkpointing (Section 1.3)
	4. ✅ Add configuration validation (Section 8.1)

	### Week 2 (High Priority):
	5. ✅ Implement sliding window attention (Section 2.3)
	6. ✅ Add fused operations (Section 2.1)
	7. ✅ Optimize KV cache (Section 2.2)
	8. ✅ Improve weight initialization (Section 6.1)

	### Week 3 (Medium Priority):
	9. ✅ Add ALiBi support (Section 5.1)
	10. ✅ Add gradient monitoring (Section 7.1)
	11. ✅ Improve numerical stability (Section 3.1-3.2)
	12. ✅ Add CPU offloading (Section 4.1)

	### Week 4 (Nice to Have):
	13. ✅ Multi-query attention option (Section 5.2)
	14. ✅ Additional optimizations

	---

	## 10. 🎯 Expected Impact

	\| Improvement \| Impact \| Effort \| Priority \|
	\|------------\|--------\|--------\|----------\|
	\| NaN fix \| Prevents crashes \| Low \| CRITICAL \|
	\| SDPA mask fix \| Better stability \| Low \| CRITICAL \|
	\| Gradient checkpointing fix \| Enables training large models \| Low \| CRITICAL \|
	\| Fused operations \| 15-20% speedup \| Medium \| High \|
	\| KV cache optimization \| 10-15% faster generation \| Low \| High \|
	\| Sliding window \| O(N²) → O(N×W) \| Medium \| High \|
	\| ALiBi support \| Better length extrapolation \| Medium \| Medium \|
	\| CPU offloading \| 2-3x larger models \| Medium \| Medium \|
	\| Better init \| Faster convergence \| Low \| Medium \|

	---

	## 11. 🧪 TESTING CHECKLIST

	After implementing improvements:

	- [ ] Test with different dtypes (fp32, fp16, bf16)
	- [ ] Test with various sequence lengths (128, 512, 2048, 8192)
	- [ ] Test with gradient checkpointing on/off
	- [ ] Test with sliding window on/off
	- [ ] Test incremental decoding (generation)
	- [ ] Test with different batch sizes
	- [ ] Monitor for NaN/Inf values
	- [ ] Profile memory usage
	- [ ] Profile inference speed
	- [ ] Test on multiple GPUs

	---

	## 12. 📚 References

	- Flash Attention: https://arxiv.org/abs/2205.14135
	- ALiBi: https://arxiv.org/abs/2108.12409
	- GQA: https://arxiv.org/abs/2305.13245
	- RoPE: https://arxiv.org/abs/2104.09864
	- Switch Transformers: https://arxiv.org/abs/2101.03961

	---

	Last Updated: 2025-01-13
	Version: 1.0
	Maintainer: ULTRATHINK Team