| | --- |
| | license: llama3.1 |
| | base_model: meta-llama/Llama-3.1-8B-Instruct |
| | tags: |
| | - llama |
| | - quantized |
| | - nvidia-modeloptimizer |
| | - FP8 |
| | - QAT |
| | library_name: nvidia-modeloptimizer |
| | --- |
| | |
| | # Llama-3.1-8B-Instruct Quantized (ModelOpt FP8) through QAT |
| |
|
| | This is a quantized version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using [modelopt](https://github.com/NVIDIA/Model-Optimizer) with NVFP4 weight quantization. |
| |
|
| | ## Model Details |
| |
|
| | - **Base Model:** meta-llama/Llama-3.1-8B-Instruct |
| | - **Quantization Method:** modelopt FP8 Quantization Aware Training (QAT) |
| | - **Weight Precision:** FP8 |
| | - **Original Size:** ~16 GB (bfloat16) |
| | - **Quantized Size:** ~6 GB (fp8) |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | # Load base model structure |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8", |
| | torch_dtype=torch.bfloat16, |
| | low_cpu_mem_usage=True |
| | ) |
| | |
| | # Load tokenizer and generate |
| | tokenizer = AutoTokenizer.from_pretrained("tokenlabsdotrun/Llama-3.1-8B-ModelOpt-NVFP4-QAT") |
| | |
| | inputs = tokenizer("Hello, my name is", return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=10) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ## License |
| |
|
| | This model inherits the [Llama 3.1 Community License](https://llama.meta.com/llama3_1/license/). |
| |
|