Llama 3.2 1B Instruct - W8A8 Quantized

This is a W8A8 (8-bit weights, 8-bit activations) quantized version of meta-llama/Llama-3.2-1B-Instruct.

Model Details

Base Model: meta-llama/Llama-3.2-1B-Instruct
Model Type: Llama 3.2
Quantization Method: W8A8 GPTQ
Precision: INT8 weights and activations
Quantization Framework: llmcompressor
Model Size: ~1.9 GB (vs ~4.9 GB original FP16)
Compression Ratio: ~61% size reduction

License

This model is a quantized derivative of Llama 3.2 1B Instruct and inherits the same license:

License: Llama 3.2 Community License
Acceptable Use Policy: https://www.llama.com/llama3_2/use-policy/

Important: You must comply with Meta's Llama 3.2 license terms when using this model.

Usage

Basic Inference

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "lsm0729/llama-1b-w8a8-quantized"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    dtype="auto"
)

# Generate
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of South Korea?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    temperature=0.6,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

With Custom QLinear and Fused QKV (Optimized)

For better performance with custom INT8 kernels:

import os
os.environ["TORCHAO_AUTOTUNER_ENABLE"] = "1"

from transformers import AutoTokenizer, AutoModelForCausalLM
from utils import replace_CompressedLinear_with_QLinear, replace_attention_with_fused_qkv

model_id = "lsm0729/llama-1b-w8a8-quantized"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", dtype="auto")

# Replace with optimized layers
replace_CompressedLinear_with_QLinear(model)  # INT8 Triton kernels
replace_attention_with_fused_qkv(model)       # Fused QKV projection

# Now use the model (same as above)

Performance

Tested on NVIDIA RTX 4060 Ti (16GB):

Metric	Value
Memory Usage	~2.5 GB
Inference Speed	~180 tokens/sec
Latency (first token)	~50ms
Batch Size 1	Supported

Quantization Details

Method: GPTQ (Generalized Post-Training Quantization)
Calibration: Default calibration dataset
Weight Quantization: Per-channel INT8
Activation Quantization: Per-token INT8
Recipe: W8A8 symmetric quantization

Quantization Configuration

quant_stage:
  quant_modifiers:
    GPTQModifier:
      sequential_update: true
      dampening_frac: 0.01
      block_size: 128
      scheme:
        input_activations:
          num_bits: 8
          symmetric: true
          strategy: token
        weights:
          num_bits: 8
          symmetric: true
          strategy: channel

Limitations

This is a quantized model, so there may be slight accuracy degradation compared to the original FP16 model
Requires libraries that support INT8 inference (e.g., llmcompressor, compressed-tensors)
Best performance with custom INT8 kernels (Triton)

Evaluation

Accuracy is preserved within acceptable bounds. For detailed benchmarks, please refer to the base model's evaluation results.

Citation

If you use this model, please cite both the original Llama 3.2 model and the quantization framework:

@article{llama32,
  title={Llama 3.2: Revolutionizing edge AI and vision with open, customizable models},
  author={Meta AI},
  year={2024},
  url={https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/}
}

@software{llmcompressor,
  title={LLM Compressor},
  author={Neural Magic, Inc.},
  year={2024},
  url={https://github.com/vllm-project/llm-compressor}
}

Acknowledgements

Original Model: Meta Llama 3.2 1B Instruct by Meta AI
Quantization: llmcompressor by Neural Magic
Optimization: Custom Triton kernels for INT8 inference

Contact

For questions or issues specific to this quantized version, please open an issue on the model repository.

For questions about the base model, please refer to Meta's Llama repository.

Disclaimer: This is an independently quantized version and is not officially affiliated with or endorsed by Meta AI.

Downloads last month: 5

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for lsm0729/llama-1b-w8a8-quantized

Base model

meta-llama/Llama-3.2-1B-Instruct

Quantized

(373)

this model