File size: 601 Bytes
f45427d
75dbc58
 
 
 
 
f45427d
75dbc58
 
 
 
 
f45427d
75dbc58
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
LIMITATIONS = """
### Key Assumptions:
- Standard transformer architecture with homogeneous layers
- Adam optimizer with mixed precision training (master weights copy)
- Tensor parallelism includes sequence parallelism
- Pipeline parallelism maintains consistent activation memory

### Not Currently Supported:
- Non-standard architectures (alternating dense/sparse layers, custom attention)
- Multi-modal models with vision layers
- Mixed dtype training (e.g., MXFP4)
- Kernel/framework overhead and intermediate memory

For advanced configurations, results should be validated against profiling.
"""