LIMITATIONS = """ ### Key Assumptions: - Standard transformer architecture with homogeneous layers - Adam optimizer with mixed precision training (master weights copy) - Tensor parallelism includes sequence parallelism - Pipeline parallelism maintains consistent activation memory ### Not Currently Supported: - Non-standard architectures (alternating dense/sparse layers, custom attention) - Multi-modal models with vision layers - Mixed dtype training (e.g., MXFP4) - Kernel/framework overhead and intermediate memory For advanced configurations, results should be validated against profiling. """