Buckets:
AdamW
AdamW is a variant of the Adam optimizer that separates weight decay from the gradient update based on the observation that the weight decay formulation is different when applied to SGD and Adam.
bitsandbytes also supports paged optimizers which take advantage of CUDAs unified memory to transfer memory from the GPU to the CPU when GPU memory is exhausted.
AdamW[[api-class]][[bitsandbytes.optim.AdamW]]
bitsandbytes.optim.AdamW[[bitsandbytes.optim.AdamW]]
__init__bitsandbytes.optim.AdamW.__init__https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/optim/adamw.py#L10[{"name": "params", "val": ""}, {"name": "lr", "val": " = 0.001"}, {"name": "betas", "val": " = (0.9, 0.999)"}, {"name": "eps", "val": " = 1e-08"}, {"name": "weight_decay", "val": " = 0.01"}, {"name": "amsgrad", "val": " = False"}, {"name": "optim_bits", "val": " = 32"}, {"name": "args", "val": " = None"}, {"name": "min_8bit_size", "val": " = 4096"}, {"name": "is_paged", "val": " = False"}]- params (torch.Tensor) --
The input parameters to optimize.
- lr (
float, defaults to 1e-3) -- The learning rate. - betas (
tuple(float, float), defaults to (0.9, 0.999)) -- The beta values are the decay rates of the first and second-order moment of the optimizer. - eps (
float, defaults to 1e-8) -- The epsilon value prevents division by zero in the optimizer. - weight_decay (
float, defaults to 1e-2) -- The weight decay value for the optimizer. - amsgrad (
bool, defaults toFalse) -- Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead. - optim_bits (
int, defaults to 32) -- The number of bits of the optimizer state. - args (
object, defaults toNone) -- An object with additional arguments. - min_8bit_size (
int, defaults to 4096) -- The minimum number of elements of the parameter tensors for 8-bit optimization. - is_paged (
bool, defaults toFalse) -- Whether the optimizer is a paged optimizer or not.0
Base AdamW optimizer.
Parameters:
params (torch.Tensor) : The input parameters to optimize.
lr (float, defaults to 1e-3) : The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) : The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) : The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 1e-2) : The weight decay value for the optimizer.
amsgrad (bool, defaults to False) : Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) : The number of bits of the optimizer state.
args (object, defaults to None) : An object with additional arguments.
min_8bit_size (int, defaults to 4096) : The minimum number of elements of the parameter tensors for 8-bit optimization.
is_paged (bool, defaults to False) : Whether the optimizer is a paged optimizer or not.
AdamW8bit[[bitsandbytes.optim.AdamW8bit]]
bitsandbytes.optim.AdamW8bit[[bitsandbytes.optim.AdamW8bit]]
__init__bitsandbytes.optim.AdamW8bit.__init__https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/optim/adamw.py#L63[{"name": "params", "val": ""}, {"name": "lr", "val": " = 0.001"}, {"name": "betas", "val": " = (0.9, 0.999)"}, {"name": "eps", "val": " = 1e-08"}, {"name": "weight_decay", "val": " = 0.01"}, {"name": "amsgrad", "val": " = False"}, {"name": "optim_bits", "val": " = 32"}, {"name": "args", "val": " = None"}, {"name": "min_8bit_size", "val": " = 4096"}, {"name": "is_paged", "val": " = False"}]- params (torch.Tensor) --
The input parameters to optimize.
- lr (
float, defaults to 1e-3) -- The learning rate. - betas (
tuple(float, float), defaults to (0.9, 0.999)) -- The beta values are the decay rates of the first and second-order moment of the optimizer. - eps (
float, defaults to 1e-8) -- The epsilon value prevents division by zero in the optimizer. - weight_decay (
float, defaults to 1e-2) -- The weight decay value for the optimizer. - amsgrad (
bool, defaults toFalse) -- Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead. Note: This parameter is not supported in AdamW8bit and must be False. - optim_bits (
int, defaults to 32) -- The number of bits of the optimizer state. Note: This parameter is not used in AdamW8bit as it always uses 8-bit optimization. - args (
object, defaults toNone) -- An object with additional arguments. - min_8bit_size (
int, defaults to 4096) -- The minimum number of elements of the parameter tensors for 8-bit optimization. - is_paged (
bool, defaults toFalse) -- Whether the optimizer is a paged optimizer or not.0
8-bit AdamW optimizer.
Parameters:
params (torch.Tensor) : The input parameters to optimize.
lr (float, defaults to 1e-3) : The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) : The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) : The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 1e-2) : The weight decay value for the optimizer.
amsgrad (bool, defaults to False) : Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead. Note: This parameter is not supported in AdamW8bit and must be False.
optim_bits (int, defaults to 32) : The number of bits of the optimizer state. Note: This parameter is not used in AdamW8bit as it always uses 8-bit optimization.
args (object, defaults to None) : An object with additional arguments.
min_8bit_size (int, defaults to 4096) : The minimum number of elements of the parameter tensors for 8-bit optimization.
is_paged (bool, defaults to False) : Whether the optimizer is a paged optimizer or not.
AdamW32bit[[bitsandbytes.optim.AdamW32bit]]
bitsandbytes.optim.AdamW32bit[[bitsandbytes.optim.AdamW32bit]]
__init__bitsandbytes.optim.AdamW32bit.__init__https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/optim/adamw.py#L127[{"name": "params", "val": ""}, {"name": "lr", "val": " = 0.001"}, {"name": "betas", "val": " = (0.9, 0.999)"}, {"name": "eps", "val": " = 1e-08"}, {"name": "weight_decay", "val": " = 0.01"}, {"name": "amsgrad", "val": " = False"}, {"name": "optim_bits", "val": " = 32"}, {"name": "args", "val": " = None"}, {"name": "min_8bit_size", "val": " = 4096"}, {"name": "is_paged", "val": " = False"}]- params (torch.Tensor) --
The input parameters to optimize.
- lr (
float, defaults to 1e-3) -- The learning rate. - betas (
tuple(float, float), defaults to (0.9, 0.999)) -- The beta values are the decay rates of the first and second-order moment of the optimizer. - eps (
float, defaults to 1e-8) -- The epsilon value prevents division by zero in the optimizer. - weight_decay (
float, defaults to 1e-2) -- The weight decay value for the optimizer. - amsgrad (
bool, defaults toFalse) -- Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead. - optim_bits (
int, defaults to 32) -- The number of bits of the optimizer state. - args (
object, defaults toNone) -- An object with additional arguments. - min_8bit_size (
int, defaults to 4096) -- The minimum number of elements of the parameter tensors for 8-bit optimization. - is_paged (
bool, defaults toFalse) -- Whether the optimizer is a paged optimizer or not.0
32-bit AdamW optimizer.
Parameters:
params (torch.Tensor) : The input parameters to optimize.
lr (float, defaults to 1e-3) : The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) : The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) : The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 1e-2) : The weight decay value for the optimizer.
amsgrad (bool, defaults to False) : Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) : The number of bits of the optimizer state.
args (object, defaults to None) : An object with additional arguments.
min_8bit_size (int, defaults to 4096) : The minimum number of elements of the parameter tensors for 8-bit optimization.
is_paged (bool, defaults to False) : Whether the optimizer is a paged optimizer or not.
PagedAdamW[[bitsandbytes.optim.PagedAdamW]]
bitsandbytes.optim.PagedAdamW[[bitsandbytes.optim.PagedAdamW]]
__init__bitsandbytes.optim.PagedAdamW.__init__https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/optim/adamw.py#L180[{"name": "params", "val": ""}, {"name": "lr", "val": " = 0.001"}, {"name": "betas", "val": " = (0.9, 0.999)"}, {"name": "eps", "val": " = 1e-08"}, {"name": "weight_decay", "val": " = 0.01"}, {"name": "amsgrad", "val": " = False"}, {"name": "optim_bits", "val": " = 32"}, {"name": "args", "val": " = None"}, {"name": "min_8bit_size", "val": " = 4096"}]- params (torch.Tensor) --
The input parameters to optimize.
- lr (
float, defaults to 1e-3) -- The learning rate. - betas (
tuple(float, float), defaults to (0.9, 0.999)) -- The beta values are the decay rates of the first and second-order moment of the optimizer. - eps (
float, defaults to 1e-8) -- The epsilon value prevents division by zero in the optimizer. - weight_decay (
float, defaults to 1e-2) -- The weight decay value for the optimizer. - amsgrad (
bool, defaults toFalse) -- Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead. - optim_bits (
int, defaults to 32) -- The number of bits of the optimizer state. - args (
object, defaults toNone) -- An object with additional arguments. - min_8bit_size (
int, defaults to 4096) -- The minimum number of elements of the parameter tensors for 8-bit optimization.0
Paged AdamW optimizer.
Parameters:
params (torch.Tensor) : The input parameters to optimize.
lr (float, defaults to 1e-3) : The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) : The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) : The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 1e-2) : The weight decay value for the optimizer.
amsgrad (bool, defaults to False) : Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) : The number of bits of the optimizer state.
args (object, defaults to None) : An object with additional arguments.
min_8bit_size (int, defaults to 4096) : The minimum number of elements of the parameter tensors for 8-bit optimization.
PagedAdamW8bit[[bitsandbytes.optim.PagedAdamW8bit]]
bitsandbytes.optim.PagedAdamW8bit[[bitsandbytes.optim.PagedAdamW8bit]]
__init__bitsandbytes.optim.PagedAdamW8bit.__init__https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/optim/adamw.py#L230[{"name": "params", "val": ""}, {"name": "lr", "val": " = 0.001"}, {"name": "betas", "val": " = (0.9, 0.999)"}, {"name": "eps", "val": " = 1e-08"}, {"name": "weight_decay", "val": " = 0.01"}, {"name": "amsgrad", "val": " = False"}, {"name": "optim_bits", "val": " = 32"}, {"name": "args", "val": " = None"}, {"name": "min_8bit_size", "val": " = 4096"}]- params (torch.Tensor) --
The input parameters to optimize.
- lr (
float, defaults to 1e-3) -- The learning rate. - betas (
tuple(float, float), defaults to (0.9, 0.999)) -- The beta values are the decay rates of the first and second-order moment of the optimizer. - eps (
float, defaults to 1e-8) -- The epsilon value prevents division by zero in the optimizer. - weight_decay (
float, defaults to 1e-2) -- The weight decay value for the optimizer. - amsgrad (
bool, defaults toFalse) -- Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead. Note: This parameter is not supported in PagedAdamW8bit and must be False. - optim_bits (
int, defaults to 32) -- The number of bits of the optimizer state. Note: This parameter is not used in PagedAdamW8bit as it always uses 8-bit optimization. - args (
object, defaults toNone) -- An object with additional arguments. - min_8bit_size (
int, defaults to 4096) -- The minimum number of elements of the parameter tensors for 8-bit optimization.0
Paged 8-bit AdamW optimizer.
Parameters:
params (torch.Tensor) : The input parameters to optimize.
lr (float, defaults to 1e-3) : The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) : The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) : The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 1e-2) : The weight decay value for the optimizer.
amsgrad (bool, defaults to False) : Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead. Note: This parameter is not supported in PagedAdamW8bit and must be False.
optim_bits (int, defaults to 32) : The number of bits of the optimizer state. Note: This parameter is not used in PagedAdamW8bit as it always uses 8-bit optimization.
args (object, defaults to None) : An object with additional arguments.
min_8bit_size (int, defaults to 4096) : The minimum number of elements of the parameter tensors for 8-bit optimization.
PagedAdamW32bit[[bitsandbytes.optim.PagedAdamW32bit]]
bitsandbytes.optim.PagedAdamW32bit[[bitsandbytes.optim.PagedAdamW32bit]]
__init__bitsandbytes.optim.PagedAdamW32bit.__init__https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/optim/adamw.py#L291[{"name": "params", "val": ""}, {"name": "lr", "val": " = 0.001"}, {"name": "betas", "val": " = (0.9, 0.999)"}, {"name": "eps", "val": " = 1e-08"}, {"name": "weight_decay", "val": " = 0.01"}, {"name": "amsgrad", "val": " = False"}, {"name": "optim_bits", "val": " = 32"}, {"name": "args", "val": " = None"}, {"name": "min_8bit_size", "val": " = 4096"}]- params (torch.Tensor) --
The input parameters to optimize.
- lr (
float, defaults to 1e-3) -- The learning rate. - betas (
tuple(float, float), defaults to (0.9, 0.999)) -- The beta values are the decay rates of the first and second-order moment of the optimizer. - eps (
float, defaults to 1e-8) -- The epsilon value prevents division by zero in the optimizer. - weight_decay (
float, defaults to 1e-2) -- The weight decay value for the optimizer. - amsgrad (
bool, defaults toFalse) -- Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead. - optim_bits (
int, defaults to 32) -- The number of bits of the optimizer state. - args (
object, defaults toNone) -- An object with additional arguments. - min_8bit_size (
int, defaults to 4096) -- The minimum number of elements of the parameter tensors for 8-bit optimization.0
Paged 32-bit AdamW optimizer.
Parameters:
params (torch.Tensor) : The input parameters to optimize.
lr (float, defaults to 1e-3) : The learning rate.
betas (tuple(float, float), defaults to (0.9, 0.999)) : The beta values are the decay rates of the first and second-order moment of the optimizer.
eps (float, defaults to 1e-8) : The epsilon value prevents division by zero in the optimizer.
weight_decay (float, defaults to 1e-2) : The weight decay value for the optimizer.
amsgrad (bool, defaults to False) : Whether to use the AMSGrad variant of Adam that uses the maximum of past squared gradients instead.
optim_bits (int, defaults to 32) : The number of bits of the optimizer state.
args (object, defaults to None) : An object with additional arguments.
min_8bit_size (int, defaults to 4096) : The minimum number of elements of the parameter tensors for 8-bit optimization.
Xet Storage Details
- Size:
- 17.7 kB
- Xet hash:
- 67685a1a9bb6faf642128b153e45df6059cb0efc05a960ae0600d9c9b2bebd33
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.