Buckets:

hf-doc-build
/

doc

Files

xet

hf-doc-build/doc / bitsandbytes /main /en /optimizers.md

HuggingFaceDocBuilder

4 days ago

preview code

download

raw

4.75 kB

	# 8-bit optimizers

	With 8-bit optimizers, large models can be finetuned with 75% less GPU memory without losing any accuracy compared to training with standard 32-bit optimizers. The reduced memory requirements means 8-bit optimizers are 4x faster than a standard optimizer, and no hyperparameter tuning is required.

	This guide will show you how to use 8-bit optimizers.

	> [!WARNING]
	> 8-bit optimizers reduce memory usage and accelerate optimization on a wide range of tasks. However, since 8-bit optimizers only reduce memory proportional to the number of parameters, models that use large amounts of activation memory, such as convolutional networks, don't really benefit from 8-bit optimizers. 8-bit optimizers are most beneficial for training or finetuning models with many parameters on highly memory-constrained GPUs.

	8-bit optimizers are a drop-in replacement for regular optimizers which means they also accept the same arguments as a regular optimizer. For NLP models, it is recommended to use the [StableEmbedding](/docs/bitsandbytes/main/en/reference/nn/embeddings#bitsandbytes.nn.StableEmbedding) class to improve stability and results.

	```diff
	import bitsandbytes as bnb

	- adam = torch.optim.Adam(...)
	+ adam = bnb.optim.Adam8bit(...)

	# recommended for NLP models
	- before: torch.nn.Embedding(...)
	+ bnb.nn.StableEmbedding(...)
	```

	By default, all parameter tensors with less than 4096 elements are kept at 32-bits even if you initialize those parameters with 8-bit optimizers. This is done because small tensors do not save much memory and often contain highly variable parameters (biases) or parameters that require high precision (batch norm, layer norm).

	You can change this value with the `min_8bit_size` parameter. For example, if you want to optimize parameters to 8-bits only if the minimum size is 16384 values (it is recommended to use multiples of 4096):

	```py
	import bitsandbytes as bnb

	adam = bnb.optim.Adam8bit(model.parameters(), min_8bit_size=16384)
	```

	Other parameters you can configure include the learning rate (`lr`), the decay rates (`betas`), and the number of bits of the optimizer state (`optim_bits`). For example, to initialize a 32-bit [Adam](/docs/bitsandbytes/main/en/reference/optim/adam#bitsandbytes.optim.Adam) optimizer:

	```py
	import bitsandbytes as bnb

	adam = bnb.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.995), optim_bits=32)
	```

	## Optimize unstable parameters

	To optimize some unstable parameters with 32-bit Adam and others with 8-bit Adam, use the [GlobalOptimManager](/docs/bitsandbytes/main/en/reference/optim/optim_overview#bitsandbytes.optim.GlobalOptimManager) class to override the specific hyperparameters for a particular layer. You'll need to:

	1. Register the parameters while they're on the CPU.

	```py
	import torch
	import bitsandbytes as bnb

	mng = bnb.optim.GlobalOptimManager.get_instance()

	model = MyModel()
	mng.register_parameters(model.parameters())
	```

	2. Override the config with the new desired hyperparameters. For example, let's override the `model.fc1.weight` layer to use 32-bit Adam.

	> [!TIP]
	> Check the optimizer API documentation for more information about other hyperparameters you can override.

	```py
	model = model.cuda()
	# use 8-bit optimizer states for all parameters
	adam = bnb.optim.Adam(model.parameters(), lr=0.001, optim_bits=8)

	# override the parameter model.fc1.weight now uses 32-bit Adam
	mng.override_config(model.fc1.weight, "optim_bits", 32)
	```

	You can also override multiple layers at once by passing them as a list and the new hyperparameters as a dictionary. For example, let's override the `model.special.weight` and `model.also_special.weight` layers to use sparse optimization and a lower learning and decay rate.

	```py
	mng.override_config([model.special.weight, model.also_special.weight],
	key_value_dict ={'is_sparse': True, 'lr': 1e-5, 'betas'=(0.9, 0.98)})
	```

	For a specific layer, we recommend overriding locally in each module. Pass the module, the parameter, and its attribute name to the [GlobalOptimManager](/docs/bitsandbytes/main/en/reference/optim/optim_overview#bitsandbytes.optim.GlobalOptimManager):

	```py
	class MyModule(torch.nn.Module):
	def __init__(d_in, d_out):
	super(MyModule, self).__init__()
	self.linear = torch.nn.Linear(d_in, d_out)
	# optimization will happen in 32-bit and
	# learning rate will be set to 0.0001 independent of the main learning rate
	config = {'optim_bits': 32, 'lr' : 0.0001}
	GlobalOptimManager.get_instance().register_module_override(self, 'weight', config)

	```

	## Next steps

	For more conceptual details and explanation about 8-bit optimizers, take a look at the [8-bit optimizers](./explanations/optimizers) guide.

Xet Storage Details

Size:: 4.75 kB
Xet hash:: 350e9373512b14adfceeffd4e7cc26d43089a8bea397a51a395652d69446d604

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.