Llama 3.2 1B Instruct - W8A8 Quantized

This is a W8A8 (8-bit weights, 8-bit activations) quantized version of meta-llama/Llama-3.2-1B-Instruct.

Model Details

  • Base Model: meta-llama/Llama-3.2-1B-Instruct
  • Model Type: Llama 3.2
  • Quantization Method: W8A8 GPTQ
  • Precision: INT8 weights and activations
  • Quantization Framework: llmcompressor
  • Model Size: ~1.9 GB (vs ~4.9 GB original FP16)
  • Compression Ratio: ~61% size reduction

License

This model is a quantized derivative of Llama 3.2 1B Instruct and inherits the same license:

Important: You must comply with Meta's Llama 3.2 license terms when using this model.

Usage

Basic Inference

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "lsm0729/llama-1b-w8a8-quantized"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    dtype="auto"
)

# Generate
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of South Korea?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    temperature=0.6,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

With Custom QLinear and Fused QKV (Optimized)

For better performance with custom INT8 kernels:

import os
os.environ["TORCHAO_AUTOTUNER_ENABLE"] = "1"

from transformers import AutoTokenizer, AutoModelForCausalLM
from utils import replace_CompressedLinear_with_QLinear, replace_attention_with_fused_qkv

model_id = "lsm0729/llama-1b-w8a8-quantized"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", dtype="auto")

# Replace with optimized layers
replace_CompressedLinear_with_QLinear(model)  # INT8 Triton kernels
replace_attention_with_fused_qkv(model)       # Fused QKV projection

# Now use the model (same as above)

Performance

Tested on NVIDIA RTX 4060 Ti (16GB):

Metric Value
Memory Usage ~2.5 GB
Inference Speed ~180 tokens/sec
Latency (first token) ~50ms
Batch Size 1 Supported

Quantization Details

  • Method: GPTQ (Generalized Post-Training Quantization)
  • Calibration: Default calibration dataset
  • Weight Quantization: Per-channel INT8
  • Activation Quantization: Per-token INT8
  • Recipe: W8A8 symmetric quantization

Quantization Configuration

quant_stage:
  quant_modifiers:
    GPTQModifier:
      sequential_update: true
      dampening_frac: 0.01
      block_size: 128
      scheme:
        input_activations:
          num_bits: 8
          symmetric: true
          strategy: token
        weights:
          num_bits: 8
          symmetric: true
          strategy: channel

Limitations

  • This is a quantized model, so there may be slight accuracy degradation compared to the original FP16 model
  • Requires libraries that support INT8 inference (e.g., llmcompressor, compressed-tensors)
  • Best performance with custom INT8 kernels (Triton)

Evaluation

Accuracy is preserved within acceptable bounds. For detailed benchmarks, please refer to the base model's evaluation results.

Citation

If you use this model, please cite both the original Llama 3.2 model and the quantization framework:

@article{llama32,
  title={Llama 3.2: Revolutionizing edge AI and vision with open, customizable models},
  author={Meta AI},
  year={2024},
  url={https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/}
}

@software{llmcompressor,
  title={LLM Compressor},
  author={Neural Magic, Inc.},
  year={2024},
  url={https://github.com/vllm-project/llm-compressor}
}

Acknowledgements

  • Original Model: Meta Llama 3.2 1B Instruct by Meta AI
  • Quantization: llmcompressor by Neural Magic
  • Optimization: Custom Triton kernels for INT8 inference

Contact

For questions or issues specific to this quantized version, please open an issue on the model repository.

For questions about the base model, please refer to Meta's Llama repository.


Disclaimer: This is an independently quantized version and is not officially affiliated with or endorsed by Meta AI.

Downloads last month
69
Safetensors
Model size
1B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lsm0729/llama-1b-w8a8-quantized

Quantized
(346)
this model