Buckets:
LLM.int8()
LLM.int8() is a quantization method that aims to make large language model inference more accessible without significant degradation. Unlike naive 8-bit quantization, which can result in loss of critical information and accuracy, LLM.int8() dynamically adapts to ensure sensitive components of the computation retain higher precision when needed. The key is to extract the outliers from the inputs and weights and multiply them in 16-bit. All other values are multiplied in 8-bit before being dequantized back to 16-bits. The outputs from the 16-bit and 8-bit multiplication are combined to produce the final output.
Linear8bitLt[[bitsandbytes.nn.Linear8bitLt]]
bitsandbytes.nn.Linear8bitLt[[bitsandbytes.nn.Linear8bitLt]]
This class is the base module for the LLM.int8() algorithm. To read more about it, have a look at the paper.
In order to quantize a linear layer one should first load the original fp16 / bf16 weights into
the Linear8bitLt module, then call int8_module.to("cuda") to quantize the fp16 weights.
Example:
import torch
import torch.nn as nn
import bitsandbytes as bnb
from bitsandbytes.nn import Linear8bitLt
fp16_model = nn.Sequential(
nn.Linear(64, 64),
nn.Linear(64, 64)
)
int8_model = nn.Sequential(
Linear8bitLt(64, 64, has_fp16_weights=False),
Linear8bitLt(64, 64, has_fp16_weights=False)
)
int8_model.load_state_dict(fp16_model.state_dict())
int8_model = int8_model.to(0) # Quantization happens here
__init__bitsandbytes.nn.Linear8bitLt.__init__https://github.com/bitsandbytes-foundation/bitsandbytes/blob/main/bitsandbytes/nn/modules.py#L1050[{"name": "input_features", "val": ": int"}, {"name": "output_features", "val": ": int"}, {"name": "bias", "val": " = True"}, {"name": "has_fp16_weights", "val": " = True"}, {"name": "threshold", "val": " = 0.0"}, {"name": "index", "val": " = None"}, {"name": "device", "val": " = None"}]- input_features (int) -- Number of input features of the linear layer.
- output_features (int) -- Number of output features of the linear layer.
- bias (bool, defaults to True) -- Whether the linear class uses the bias term as well.
- has_fp16_weights (bool, defaults to True) --
If False, weights are quantized to int8 on
.to(device). If True, weights remain in fp16 and are quantized on-the-fly during each forward pass. - threshold (float, defaults to 0.0) -- Outlier threshold for mixed-precision decomposition (LLM.int8()). During the forward pass, activation columns where any value exceeds this threshold are computed in fp16, while the remaining columns use int8. This operates on activations (inputs), not on weight values. Set to 0.0 to disable mixed-precision decomposition and quantize all columns to int8.
- index -- Indices for weight reordering (used internally).
- device -- Device to initialize the layer on.0
Initialize Linear8bitLt class.
Parameters:
input_features (int) : Number of input features of the linear layer.
output_features (int) : Number of output features of the linear layer.
bias (bool, defaults to True) : Whether the linear class uses the bias term as well.
has_fp16_weights (bool, defaults to True) : If False, weights are quantized to int8 on .to(device). If True, weights remain in fp16 and are quantized on-the-fly during each forward pass.
threshold (float, defaults to 0.0) : Outlier threshold for mixed-precision decomposition (LLM.int8()). During the forward pass, activation columns where any value exceeds this threshold are computed in fp16, while the remaining columns use int8. This operates on activations (inputs), not on weight values. Set to 0.0 to disable mixed-precision decomposition and quantize all columns to int8.
index : Indices for weight reordering (used internally).
device : Device to initialize the layer on.
Int8Params[[bitsandbytes.nn.Int8Params]]
bitsandbytes.nn.Int8Params[[bitsandbytes.nn.Int8Params]]
__init__bitsandbytes.nn.Int8Params.init[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}] Initialize self. See help(type(self)) for accurate signature.
Xet Storage Details
- Size:
- 4.53 kB
- Xet hash:
- f16308979d3525d4c915c65f7e2cbf0caca609b90b39bd282d75c08e51b62813
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.