UltraThinking-LLM-Training / ARCHITECTURE_IMPROVEMENTS_GUIDE.md

Vedisasi

Upload folder using huggingface_hub

54c5666 verified 4 months ago

preview code

raw

history blame contribute delete

19.3 kB

Complete Improvement Guide for Advanced Transformer Architecture

🎯 Executive Summary

Current State: Production-quality code with modern best practices
Overall Grade: 8.5/10

Main Areas for Improvement:

Advanced training optimizations
Inference performance
Numerical stability edge cases
Memory efficiency
Code maintainability

1. ⚡ CRITICAL IMPROVEMENTS (Must Implement)

1.1 Fix Potential NaN Issue in Attention Mask

Problem: When ALL tokens in a row are masked, softmax produces NaN

Current Code (architecture.py:218-223):

attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
if attention_mask is not None:
    attn_weights = attn_weights + attention_mask.to(attn_weights.dtype)
attn_weights = F.softmax(attn_weights, dim=-1)  # NaN if all -inf!
attn_weights = self.attention_dropout(attn_weights)
attn_output = torch.matmul(attn_weights, value_states)

Solution:

attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)

if attention_mask is not None:
    attn_weights = attn_weights + attention_mask.to(attn_weights.dtype)
    
# CRITICAL FIX: Clamp before softmax to prevent all -inf rows
mask_value = -1e4 if attn_weights.dtype in (torch.float16, torch.bfloat16) else -1e9
attn_weights = torch.clamp(attn_weights, min=mask_value, max=1e4)
attn_weights = F.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)

# Add small epsilon to prevent exact zeros
attn_weights = attn_weights + 1e-10

attn_weights = self.attention_dropout(attn_weights)
attn_output = torch.matmul(attn_weights, value_states)

Why Critical: Prevents training crashes from NaN propagation.

1.2 Improve Scaled Dot Product Attention Mask Handling

Current Issue (architecture.py:204-214):

sdpa_mask = None
if attention_mask is not None:
    # Convert to (batch, seq, seq) to broadcast across heads
    sdpa_mask = attention_mask.squeeze(1)  # Loses important dimensions!

Better Implementation:

sdpa_mask = None
if attention_mask is not None:
    # SDPA expects boolean or additive mask
    # Shape: (batch, 1, seq, seq) or (batch, heads, seq, seq)
    
    # Convert additive mask to boolean (more stable)
    sdpa_mask = attention_mask > -1e8
    
    # OR keep as additive float mask (ensure correct dtype)
    # sdpa_mask = attention_mask.to(query_states.dtype)
    
attn_output = F.scaled_dot_product_attention(
    query_states,
    key_states,
    value_states,
    attn_mask=sdpa_mask,
    dropout_p=self.attention_dropout.p if self.training else 0.0,
    is_causal=(sdpa_mask is None),
    # enable_gqa=True  # Hint for optimization (PyTorch 2.1+)
)

1.3 Fix Gradient Checkpointing Compatibility

Current Problem (architecture.py:363-370):

if self.config.gradient_checkpointing and self.training:
    hidden_states, present_key_value = torch.utils.checkpoint.checkpoint(
        layer,
        hidden_states,
        attention_mask,
        use_cache,  # ❌ Incompatible with checkpointing!
        past_key_value,
        use_reentrant=False,
    )

Issue: Passing use_cache=True with gradient checkpointing is incompatible (caching requires storing activations, checkpointing discards them).

Solution:

for i, (layer, past_key_value) in enumerate(zip(self.layers, past_key_values)):
    if self.config.gradient_checkpointing and self.training:
        # Disable cache during gradient checkpointing
        hidden_states, _ = torch.utils.checkpoint.checkpoint(
            layer,
            hidden_states,
            attention_mask,
            False,  # ✅ Force use_cache=False
            None,   # ✅ Force past_key_value=None
            use_reentrant=False,
        )
        present_key_value = None
    else:
        hidden_states, present_key_value = layer(
            hidden_states, attention_mask, use_cache, past_key_value
        )
    
    # Only append cache if not using gradient checkpointing
    if use_cache and not (self.config.gradient_checkpointing and self.training):
        present_key_values.append(present_key_value)

2. 🚀 PERFORMANCE OPTIMIZATIONS

2.1 Fused Operations

Add Fused LayerNorm + Linear:

try:
    from apex.normalization import FusedRMSNorm
    FUSED_RMSNORM_AVAILABLE = True
except ImportError:
    FUSED_RMSNORM_AVAILABLE = False

class RMSNorm(nn.Module):
    def __init__(self, hidden_size: int, eps: float = 1e-6):
        super().__init__()
        if FUSED_RMSNORM_AVAILABLE:
            self.norm = FusedRMSNorm(hidden_size, eps)
            self.use_fused = True
        else:
            self.weight = nn.Parameter(torch.ones(hidden_size))
            self.variance_epsilon = float(eps)
            self.use_fused = False
    
    def forward(self, hidden_states):
        if self.use_fused:
            return self.norm(hidden_states)
        # ... existing code

Benefit: 15-20% faster layer normalization.

2.2 Better KV Cache Management

Add Incremental Decoding Optimization:

class GroupedQueryAttention(nn.Module):
    def forward(self, hidden_states, attention_mask=None, 
                use_cache=False, past_key_value=None, position_ids=None):
        bsz, q_len, _ = hidden_states.size()
        
        # Detect if this is incremental decoding (q_len=1)
        is_decoding = q_len == 1 and past_key_value is not None

        query_states = self.q_proj(hidden_states)
        key_states = self.k_proj(hidden_states)
        value_states = self.v_proj(hidden_states)

        # Reshape
        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
        key_states = key_states.view(bsz, q_len, self.num_kv_heads, self.head_dim).transpose(1, 2)
        value_states = value_states.view(bsz, q_len, self.num_kv_heads, self.head_dim).transpose(1, 2)

        # Optimized RoPE for decoding
        if is_decoding and position_ids is not None:
            # Only compute RoPE for current position
            kv_seq_len = past_key_value[0].shape[2] + 1
            cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
            # Slice to get only the last position
            cos = cos[:, :, -1:, :]
            sin = sin[:, :, -1:, :]
            query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
        else:
            cos, sin = self.rotary_emb(value_states, seq_len=q_len)
            query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)

        # ... rest of attention

Benefit: 10-15% faster autoregressive generation.

2.3 Sliding Window Attention Implementation

Current Code Mentions It But Doesn't Implement It!

class GroupedQueryAttention(nn.Module):
    def forward(self, hidden_states, attention_mask=None, 
                use_cache=False, past_key_value=None):
        # ... existing code up to attention computation ...
        
        # Apply sliding window if configured
        if self.config.sliding_window is not None and not use_cache:
            window_size = self.config.sliding_window
            seq_len = query_states.size(2)
            
            # Create sliding window mask
            window_mask = torch.ones(seq_len, seq_len, device=query_states.device)
            window_mask = torch.triu(window_mask, diagonal=-window_size)
            window_mask = torch.tril(window_mask, diagonal=0)
            
            # Convert to additive mask
            mask_value = -1e4 if query_states.dtype in (torch.float16, torch.bfloat16) else -1e9
            window_mask = (1 - window_mask) * mask_value
            
            # Combine with existing mask
            if attention_mask is None:
                attention_mask = window_mask[None, None, :, :]
            else:
                attention_mask = attention_mask + window_mask[None, None, :, :]
        
        # ... continue with attention computation ...

Benefit: Reduces computation from O(N²) to O(N×W) for long sequences.

3. 🔢 NUMERICAL STABILITY

3.1 Better Softmax Stability

def stable_softmax(x, dim=-1, dtype=torch.float32):
    """Numerically stable softmax"""
    x_max = torch.max(x, dim=dim, keepdim=True)[0]
    x_exp = torch.exp((x - x_max).to(dtype))
    x_sum = torch.sum(x_exp, dim=dim, keepdim=True)
    return (x_exp / (x_sum + 1e-10)).to(x.dtype)

# Use in attention:
attn_weights = stable_softmax(attn_weights, dim=-1)

3.2 Enhanced RoPE Stability for Long Contexts

class RotaryPositionalEmbedding(nn.Module):
    def __init__(self, dim: int, max_position_embeddings: int = 2048, 
                 base: int = 10000, scaling_factor: float = 1.0):
        super().__init__()
        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
        self.scaling_factor = scaling_factor  # For length extrapolation
        
        # Use float64 for better numerical precision
        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.float64) / self.dim))
        self.register_buffer("inv_freq", inv_freq.float(), persistent=False)
        
        self._seq_len_cached = 0
        self._cos_cached = None
        self._sin_cached = None

    def _update_cos_sin_cache(self, x, seq_len):
        # Apply scaling for extrapolation
        seq_len = int(seq_len * self.scaling_factor)
        
        if seq_len != self._seq_len_cached or self._cos_cached is None or self._cos_cached.device != x.device:
            self._seq_len_cached = seq_len
            t = torch.arange(seq_len, device=x.device, dtype=self.inv_freq.dtype)
            freqs = torch.outer(t, self.inv_freq)
            emb = torch.cat((freqs, freqs), dim=-1)
            
            # Use float32 for cos/sin computation (more stable)
            self._cos_cached = emb.cos().to(x.dtype)[None, None, :, :]
            self._sin_cached = emb.sin().to(x.dtype)[None, None, :, :]

4. 💾 MEMORY OPTIMIZATIONS

4.1 CPU Offloading for Large Models

class AdvancedGPTModel(nn.Module):
    def __init__(self, config: ModelConfig):
        super().__init__()
        # ... existing code ...
        self.offload_to_cpu = config.get('offload_to_cpu', False)
        
    def forward(self, input_ids, attention_mask=None, past_key_values=None, 
                labels=None, use_cache=None):
        # ... embeddings ...
        
        for i, (layer, past_key_value) in enumerate(zip(self.layers, past_key_values)):
            # Move layer to GPU only when needed
            if self.offload_to_cpu:
                layer = layer.to(hidden_states.device)
            
            hidden_states, present_key_value = layer(
                hidden_states, attention_mask, use_cache, past_key_value
            )
            
            # Move back to CPU
            if self.offload_to_cpu:
                layer = layer.cpu()
                torch.cuda.empty_cache()
            
            if use_cache:
                present_key_values.append(present_key_value)

Benefit: Run models 2-3x larger than GPU memory (slower but possible).

5. 🏗️ ARCHITECTURAL ENHANCEMENTS

5.1 ALiBi Positional Bias (Alternative to RoPE)

class ALiBiPositionalBias(nn.Module):
    """Attention with Linear Biases (ALiBi)"""
    
    def __init__(self, num_heads):
        super().__init__()
        self.num_heads = num_heads
        slopes = torch.Tensor(self._get_slopes(num_heads))
        self.register_buffer('slopes', slopes, persistent=False)
    
    def _get_slopes(self, n):
        def get_slopes_power_of_2(n):
            start = (2**(-2**-(math.log2(n)-3)))
            ratio = start
            return [start*ratio**i for i in range(n)]
        
        if math.log2(n).is_integer():
            return get_slopes_power_of_2(n)
        else:
            closest_power_of_2 = 2**math.floor(math.log2(n))
            return get_slopes_power_of_2(closest_power_of_2) + \
                   self._get_slopes(2*closest_power_of_2)[0::2][:n-closest_power_of_2]
    
    def forward(self, seq_len, device):
        """Returns bias to add to attention scores"""
        positions = torch.arange(seq_len, device=device)
        # Create relative position matrix
        relative_positions = positions[None, :] - positions[:, None]
        # Apply slopes
        alibi = relative_positions[None, :, :] * self.slopes[:, None, None]
        return alibi

Benefit: Better extrapolation to longer contexts than RoPE.

5.2 Multi-Query Attention Option

@dataclass
class ModelConfig:
    # ... existing fields ...
    attention_type: str = "gqa"  # Options: "mha", "mqa", "gqa"
    
class GroupedQueryAttention(nn.Module):
    def __init__(self, config: ModelConfig):
        super().__init__()
        # ... existing setup ...
        
        # Support MQA (Multi-Query Attention)
        if config.attention_type == "mqa":
            self.num_kv_heads = 1
        elif config.attention_type == "gqa":
            self.num_kv_heads = config.n_kv_head
        else:  # mha
            self.num_kv_heads = self.num_heads

Benefit: MQA is fastest for inference, GQA balances quality/speed, MHA is highest quality.

6. 🧪 TRAINING IMPROVEMENTS

6.1 Better Weight Initialization

def _init_weights(self, module):
    """Improved initialization following GPT-3/Llama"""
    if isinstance(module, nn.Linear):
        # Use truncated normal for better convergence
        std = 0.02
        if hasattr(module, 'scale_init'):
            # Scale down residual layers
            std /= math.sqrt(2 * self.config.n_layer)
        
        torch.nn.init.trunc_normal_(module.weight, mean=0.0, std=std, a=-2*std, b=2*std)
        
        if module.bias is not None:
            torch.nn.init.zeros_(module.bias)
    
    elif isinstance(module, nn.Embedding):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        
        if hasattr(module, 'padding_idx') and module.padding_idx is not None:
            module.weight.data[module.padding_idx].zero_()

# Mark residual layers for scaled init:
class TransformerBlock(nn.Module):
    def __init__(self, config: ModelConfig):
        super().__init__()
        # ... existing code ...
        
        # Mark for scaled initialization
        self.self_attn.o_proj.scale_init = True
        if hasattr(self.mlp, 'down_proj'):
            self.mlp.down_proj.scale_init = True

7. 🔍 DEBUGGING & MONITORING

7.1 Add Gradient Monitoring

class AdvancedGPTModel(nn.Module):
    def __init__(self, config: ModelConfig):
        super().__init__()
        # ... existing code ...
        self.gradient_stats = {}
    
    def monitor_gradients(self):
        """Track gradient statistics for debugging"""
        for name, param in self.named_parameters():
            if param.grad is not None:
                grad_norm = param.grad.norm().item()
                grad_max = param.grad.abs().max().item()
                grad_mean = param.grad.mean().item()
                
                self.gradient_stats[name] = {
                    'norm': grad_norm,
                    'max': grad_max,
                    'mean': grad_mean,
                    'has_nan': torch.isnan(param.grad).any().item(),
                    'has_inf': torch.isinf(param.grad).any().item()
                }
        
        return self.gradient_stats

8. 📝 CODE QUALITY IMPROVEMENTS

8.1 Configuration Validation

@dataclass
class ModelConfig:
    # ... existing fields ...
    
    def __post_init__(self):
        """Validate configuration after initialization"""
        
        # Check n_head is divisible by n_kv_head
        if self.n_head % self.n_kv_head != 0:
            raise ValueError(
                f"n_head ({self.n_head}) must be divisible by n_kv_head ({self.n_kv_head})"
            )
        
        # Check rotary_dim
        head_dim = self.n_embd // self.n_head
        if self.rotary_dim > head_dim:
            raise ValueError(
                f"rotary_dim ({self.rotary_dim}) cannot exceed head_dim ({head_dim})"
            )
        
        # Warn about suboptimal settings
        if self.flash_attention and not FLASH_ATTENTION_AVAILABLE:
            warnings.warn(
                "flash_attention=True but Flash Attention not installed. "
                "Install with: pip install flash-attn --no-build-isolation"
            )
        
        if self.gradient_checkpointing and self.use_cache:
            warnings.warn(
                "gradient_checkpointing=True with use_cache=True may cause issues. "
                "Cache will be disabled during training."
            )

9. 📊 PRIORITY IMPLEMENTATION ORDER

Week 1 (Critical):

✅ Fix NaN handling in attention (Section 1.1)
✅ Fix SDPA mask handling (Section 1.2)
✅ Fix gradient checkpointing (Section 1.3)
✅ Add configuration validation (Section 8.1)

Week 2 (High Priority):

✅ Implement sliding window attention (Section 2.3)
✅ Add fused operations (Section 2.1)
✅ Optimize KV cache (Section 2.2)
✅ Improve weight initialization (Section 6.1)

Week 3 (Medium Priority):

✅ Add ALiBi support (Section 5.1)
✅ Add gradient monitoring (Section 7.1)
✅ Improve numerical stability (Section 3.1-3.2)
✅ Add CPU offloading (Section 4.1)

Week 4 (Nice to Have):

✅ Multi-query attention option (Section 5.2)
✅ Additional optimizations

10. 🎯 Expected Impact

Improvement	Impact	Effort	Priority
NaN fix	Prevents crashes	Low	CRITICAL
SDPA mask fix	Better stability	Low	CRITICAL
Gradient checkpointing fix	Enables training large models	Low	CRITICAL
Fused operations	15-20% speedup	Medium	High
KV cache optimization	10-15% faster generation	Low	High
Sliding window	O(N²) → O(N×W)	Medium	High
ALiBi support	Better length extrapolation	Medium	Medium
CPU offloading	2-3x larger models	Medium	Medium
Better init	Faster convergence	Low	Medium

11. 🧪 TESTING CHECKLIST

After implementing improvements:

Test with different dtypes (fp32, fp16, bf16)
Test with various sequence lengths (128, 512, 2048, 8192)
Test with gradient checkpointing on/off
Test with sliding window on/off
Test incremental decoding (generation)
Test with different batch sizes
Monitor for NaN/Inf values
Profile memory usage
Profile inference speed
Test on multiple GPUs

12. 📚 References

Flash Attention: https://arxiv.org/abs/2205.14135
ALiBi: https://arxiv.org/abs/2108.12409
GQA: https://arxiv.org/abs/2305.13245
RoPE: https://arxiv.org/abs/2104.09864
Switch Transformers: https://arxiv.org/abs/2101.03961

Last Updated: 2025-01-13
Version: 1.0
Maintainer: ULTRATHINK Team