Llama 3.2 1B Instruct - W8A8 Quantized
This is a W8A8 (8-bit weights, 8-bit activations) quantized version of meta-llama/Llama-3.2-1B-Instruct.
Model Details
- Base Model: meta-llama/Llama-3.2-1B-Instruct
- Model Type: Llama 3.2
- Quantization Method: W8A8 GPTQ
- Precision: INT8 weights and activations
- Quantization Framework: llmcompressor
- Model Size: ~1.9 GB (vs ~4.9 GB original FP16)
- Compression Ratio: ~61% size reduction
License
This model is a quantized derivative of Llama 3.2 1B Instruct and inherits the same license:
- License: Llama 3.2 Community License
- Acceptable Use Policy: https://www.llama.com/llama3_2/use-policy/
Important: You must comply with Meta's Llama 3.2 license terms when using this model.
Usage
Basic Inference
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "lsm0729/llama-1b-w8a8-quantized"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
dtype="auto"
)
# Generate
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of South Korea?"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to("cuda")
outputs = model.generate(
input_ids,
max_new_tokens=256,
temperature=0.6,
top_p=0.9,
do_sample=True
)
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
With Custom QLinear and Fused QKV (Optimized)
For better performance with custom INT8 kernels:
import os
os.environ["TORCHAO_AUTOTUNER_ENABLE"] = "1"
from transformers import AutoTokenizer, AutoModelForCausalLM
from utils import replace_CompressedLinear_with_QLinear, replace_attention_with_fused_qkv
model_id = "lsm0729/llama-1b-w8a8-quantized"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", dtype="auto")
# Replace with optimized layers
replace_CompressedLinear_with_QLinear(model) # INT8 Triton kernels
replace_attention_with_fused_qkv(model) # Fused QKV projection
# Now use the model (same as above)
Performance
Tested on NVIDIA RTX 4060 Ti (16GB):
| Metric | Value |
|---|---|
| Memory Usage | ~2.5 GB |
| Inference Speed | ~180 tokens/sec |
| Latency (first token) | ~50ms |
| Batch Size 1 | Supported |
Quantization Details
- Method: GPTQ (Generalized Post-Training Quantization)
- Calibration: Default calibration dataset
- Weight Quantization: Per-channel INT8
- Activation Quantization: Per-token INT8
- Recipe: W8A8 symmetric quantization
Quantization Configuration
quant_stage:
quant_modifiers:
GPTQModifier:
sequential_update: true
dampening_frac: 0.01
block_size: 128
scheme:
input_activations:
num_bits: 8
symmetric: true
strategy: token
weights:
num_bits: 8
symmetric: true
strategy: channel
Limitations
- This is a quantized model, so there may be slight accuracy degradation compared to the original FP16 model
- Requires libraries that support INT8 inference (e.g., llmcompressor, compressed-tensors)
- Best performance with custom INT8 kernels (Triton)
Evaluation
Accuracy is preserved within acceptable bounds. For detailed benchmarks, please refer to the base model's evaluation results.
Citation
If you use this model, please cite both the original Llama 3.2 model and the quantization framework:
@article{llama32,
title={Llama 3.2: Revolutionizing edge AI and vision with open, customizable models},
author={Meta AI},
year={2024},
url={https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/}
}
@software{llmcompressor,
title={LLM Compressor},
author={Neural Magic, Inc.},
year={2024},
url={https://github.com/vllm-project/llm-compressor}
}
Acknowledgements
- Original Model: Meta Llama 3.2 1B Instruct by Meta AI
- Quantization: llmcompressor by Neural Magic
- Optimization: Custom Triton kernels for INT8 inference
Contact
For questions or issues specific to this quantized version, please open an issue on the model repository.
For questions about the base model, please refer to Meta's Llama repository.
Disclaimer: This is an independently quantized version and is not officially affiliated with or endorsed by Meta AI.
- Downloads last month
- 69
Model tree for lsm0729/llama-1b-w8a8-quantized
Base model
meta-llama/Llama-3.2-1B-Instruct