File size: 1,246 Bytes
f45427d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
LIMITATIONS = """
This calculator has many limitations and assumptions
### Assumptions:
- Your implementation of tensor parallel also incorporates sequence parallel
- You are doing selective recomputation with flash attention if not doing gradient checkpointing
- You keep a master copy of the model weights for mixed precision
    - May not be true for some implementations which cast on the fly
- You're using Adam optimizer
- If using PP you're using a schedule that will keep the number of activations roughly the same
- EP is the number of PPxTP units that share each expert
- Swiglu activation function
- Rotary embeddings

### Limitations:
- Does not support non-homogenous layers
    - e.g. Llama4 Maverick with alternating dense and sparse layers, iRoPE
- Does not include memory for kernel or framework overhead
- Does not include memory for intermediates 
- Does not include vision layers for multi-modal models
- Models shared experts as another routed expert per token
- Does not support different dtypes for different parts of the model
    - e.g. MXFP4 for GPT-OSS 20 and 120B
- Have not validated EP/FSDP interaction
- Doesn't model biases on a per-model basis

Note this is not an exhaustive list, just some of the main ones
"""