Spaces:
Sleeping
Sleeping
| LIMITATIONS = """ | |
| ### Key Assumptions: | |
| - Standard transformer architecture with homogeneous layers | |
| - Adam optimizer with mixed precision training (master weights copy) | |
| - Tensor parallelism includes sequence parallelism | |
| - Pipeline parallelism maintains consistent activation memory | |
| ### Not Currently Supported: | |
| - Non-standard architectures (alternating dense/sparse layers, custom attention) | |
| - Multi-modal models with vision layers | |
| - Mixed dtype training (e.g., MXFP4) | |
| - Kernel/framework overhead and intermediate memory | |
| For advanced configurations, results should be validated against profiling. | |
| """ |