genai2eliza's picture
Upload Llama-3.1-8B quantized with ModelOpt FP8
33d5dec verified
---
license: llama3.1
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
- llama
- quantized
- nvidia-modeloptimizer
- fp8
library_name: nvidia-modeloptimizer
---
# Llama-3.1-8B-Instruct Quantized (ModelOpt FP8)
This is a quantized version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using [modelopt](https://github.com/NVIDIA/Model-Optimizer) with FP8 weight quantization.
## Model Details
- **Base Model:** meta-llama/Llama-3.1-8B-Instruct
- **Quantization Method:** modelopt FP8 Post-Training Quantization (PTQ)
- **Weight Precision:** FP8
- **Original Size:** ~16 GB (bfloat16)
- **Quantized Size:** ~9 GB
## Usage
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model structure
model = AutoModelForCausalLM.from_pretrained(
"tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
)
# Load tokenizer and generate
tokenizer = AutoTokenizer.from_pretrained("tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8")
inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## License
This model inherits the [Llama 3.1 Community License](https://llama.meta.com/llama3_1/license/).