Llama-3-Adaptive-Quant

This repository features an implementation of the Adaptive Quantization Framework integrated directly into the Hugging Face Transformers ecosystem.

Key Features

  • Dynamic Routing: Dynamically switch precision at runtime between FP32, INT8, and INT4 based on an <1% overhead Input Complexity Scorer.
  • Zero-Calibration: Operates seamlessly without the need for offline calibration datasets.
  • On-Device Optimization: Tailored for ultra-efficient execution on edge environments like Xiaomi HyperOS and Snapdragon HTP architectures.

How to Use

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "YOUR_HF_USERNAME/Llama-3-Adaptive-Quant",
    trust_remote_code=True,
    device_map="auto"
)

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support