LIMITATIONS = """ This calculator has many limitations and assumptions ### Assumptions: - Your implementation of tensor parallel also incorporates sequence parallel - You are doing selective recomputation with flash attention if not doing gradient checkpointing - You keep a master copy of the model weights for mixed precision - May not be true for some implementations which cast on the fly - You're using Adam optimizer - If using PP you're using a schedule that will keep the number of activations roughly the same - EP is the number of PPxTP units that share each expert - Swiglu activation function - Rotary embeddings ### Limitations: - Does not support non-homogenous layers - e.g. Llama4 Maverick with alternating dense and sparse layers, iRoPE - Does not include memory for kernel or framework overhead - Does not include memory for intermediates - Does not include vision layers for multi-modal models - Models shared experts as another routed expert per token - Does not support different dtypes for different parts of the model - e.g. MXFP4 for GPT-OSS 20 and 120B - Have not validated EP/FSDP interaction - Doesn't model biases on a per-model basis Note this is not an exhaustive list, just some of the main ones """