Spaces:
Sleeping
Sleeping
File size: 1,246 Bytes
f45427d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
LIMITATIONS = """
This calculator has many limitations and assumptions
### Assumptions:
- Your implementation of tensor parallel also incorporates sequence parallel
- You are doing selective recomputation with flash attention if not doing gradient checkpointing
- You keep a master copy of the model weights for mixed precision
- May not be true for some implementations which cast on the fly
- You're using Adam optimizer
- If using PP you're using a schedule that will keep the number of activations roughly the same
- EP is the number of PPxTP units that share each expert
- Swiglu activation function
- Rotary embeddings
### Limitations:
- Does not support non-homogenous layers
- e.g. Llama4 Maverick with alternating dense and sparse layers, iRoPE
- Does not include memory for kernel or framework overhead
- Does not include memory for intermediates
- Does not include vision layers for multi-modal models
- Models shared experts as another routed expert per token
- Does not support different dtypes for different parts of the model
- e.g. MXFP4 for GPT-OSS 20 and 120B
- Have not validated EP/FSDP interaction
- Doesn't model biases on a per-model basis
Note this is not an exhaustive list, just some of the main ones
"""
|