|
|
--- |
|
|
license: llama3.1 |
|
|
base_model: meta-llama/Llama-3.1-8B-Instruct |
|
|
tags: |
|
|
- llama |
|
|
- quantized |
|
|
- nvidia-modeloptimizer |
|
|
- fp8 |
|
|
library_name: nvidia-modeloptimizer |
|
|
--- |
|
|
|
|
|
# Llama-3.1-8B-Instruct Quantized (ModelOpt FP8) |
|
|
|
|
|
This is a quantized version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using [modelopt](https://github.com/NVIDIA/Model-Optimizer) with FP8 weight quantization. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model:** meta-llama/Llama-3.1-8B-Instruct |
|
|
- **Quantization Method:** modelopt FP8 Post-Training Quantization (PTQ) |
|
|
- **Weight Precision:** FP8 |
|
|
- **Original Size:** ~16 GB (bfloat16) |
|
|
- **Quantized Size:** ~9 GB |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
# Load base model structure |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8", |
|
|
torch_dtype=torch.bfloat16, |
|
|
low_cpu_mem_usage=True |
|
|
) |
|
|
|
|
|
# Load tokenizer and generate |
|
|
tokenizer = AutoTokenizer.from_pretrained("tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8") |
|
|
|
|
|
inputs = tokenizer("Hello, my name is", return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=10) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model inherits the [Llama 3.1 Community License](https://llama.meta.com/llama3_1/license/). |
|
|
|