File size: 1,367 Bytes
161ba5b 33d5dec 161ba5b 33d5dec 161ba5b 33d5dec 161ba5b 33d5dec 161ba5b 33d5dec 161ba5b cda4bbd 161ba5b cda4bbd 161ba5b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
---
license: llama3.1
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
- llama
- quantized
- nvidia-modeloptimizer
- fp8
library_name: nvidia-modeloptimizer
---
# Llama-3.1-8B-Instruct Quantized (ModelOpt FP8)
This is a quantized version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using [modelopt](https://github.com/NVIDIA/Model-Optimizer) with FP8 weight quantization.
## Model Details
- **Base Model:** meta-llama/Llama-3.1-8B-Instruct
- **Quantization Method:** modelopt FP8 Post-Training Quantization (PTQ)
- **Weight Precision:** FP8
- **Original Size:** ~16 GB (bfloat16)
- **Quantized Size:** ~9 GB
## Usage
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model structure
model = AutoModelForCausalLM.from_pretrained(
"tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
)
# Load tokenizer and generate
tokenizer = AutoTokenizer.from_pretrained("tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8")
inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## License
This model inherits the [Llama 3.1 Community License](https://llama.meta.com/llama3_1/license/).
|