Spaces:
Sleeping
Sleeping
| LIMITATIONS = """ | |
| This calculator has many limitations and assumptions | |
| ### Assumptions: | |
| - Your implementation of tensor parallel also incorporates sequence parallel | |
| - You are doing selective recomputation with flash attention if not doing gradient checkpointing | |
| - You keep a master copy of the model weights for mixed precision | |
| - May not be true for some implementations which cast on the fly | |
| - You're using Adam optimizer | |
| - If using PP you're using a schedule that will keep the number of activations roughly the same | |
| - EP is the number of PPxTP units that share each expert | |
| - Swiglu activation function | |
| - Rotary embeddings | |
| ### Limitations: | |
| - Does not support non-homogenous layers | |
| - e.g. Llama4 Maverick with alternating dense and sparse layers, iRoPE | |
| - Does not include memory for kernel or framework overhead | |
| - Does not include memory for intermediates | |
| - Does not include vision layers for multi-modal models | |
| - Models shared experts as another routed expert per token | |
| - Does not support different dtypes for different parts of the model | |
| - e.g. MXFP4 for GPT-OSS 20 and 120B | |
| - Have not validated EP/FSDP interaction | |
| - Doesn't model biases on a per-model basis | |
| Note this is not an exhaustive list, just some of the main ones | |
| """ | |