Spaces:

rubenaghayan
/

llm_memory_visualizer

Sleeping

llm_memory_visualizer / limitations.py

fsdp + bugfixes

f45427d 2 months ago

1.25 kB

	LIMITATIONS = """
	This calculator has many limitations and assumptions
	### Assumptions:
	- Your implementation of tensor parallel also incorporates sequence parallel
	- You are doing selective recomputation with flash attention if not doing gradient checkpointing
	- You keep a master copy of the model weights for mixed precision
	- May not be true for some implementations which cast on the fly
	- You're using Adam optimizer
	- If using PP you're using a schedule that will keep the number of activations roughly the same
	- EP is the number of PPxTP units that share each expert
	- Swiglu activation function
	- Rotary embeddings

	### Limitations:
	- Does not support non-homogenous layers
	- e.g. Llama4 Maverick with alternating dense and sparse layers, iRoPE
	- Does not include memory for kernel or framework overhead
	- Does not include memory for intermediates
	- Does not include vision layers for multi-modal models
	- Models shared experts as another routed expert per token
	- Does not support different dtypes for different parts of the model
	- e.g. MXFP4 for GPT-OSS 20 and 120B
	- Have not validated EP/FSDP interaction
	- Doesn't model biases on a per-model basis

	Note this is not an exhaustive list, just some of the main ones
	"""