Llama-3-Adaptive-Quant
This repository features an implementation of the Adaptive Quantization Framework integrated directly into the Hugging Face Transformers ecosystem.
Key Features
- Dynamic Routing: Dynamically switch precision at runtime between FP32, INT8, and INT4 based on an
<1% overheadInput Complexity Scorer. - Zero-Calibration: Operates seamlessly without the need for offline calibration datasets.
- On-Device Optimization: Tailored for ultra-efficient execution on edge environments like Xiaomi HyperOS and Snapdragon HTP architectures.
How to Use
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"YOUR_HF_USERNAME/Llama-3-Adaptive-Quant",
trust_remote_code=True,
device_map="auto"
)
- Downloads last month
- 28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support