File size: 1,367 Bytes
161ba5b
 
 
 
 
 
 
33d5dec
161ba5b
 
 
33d5dec
161ba5b
33d5dec
161ba5b
 
 
 
33d5dec
 
161ba5b
33d5dec
161ba5b
 
 
 
 
 
 
 
 
cda4bbd
161ba5b
 
 
 
 
cda4bbd
 
161ba5b
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
license: llama3.1
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
  - llama
  - quantized
  - nvidia-modeloptimizer
  - fp8
library_name: nvidia-modeloptimizer
---

# Llama-3.1-8B-Instruct Quantized (ModelOpt FP8)

This is a quantized version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) using [modelopt](https://github.com/NVIDIA/Model-Optimizer) with FP8 weight quantization.

## Model Details

- **Base Model:** meta-llama/Llama-3.1-8B-Instruct
- **Quantization Method:** modelopt FP8 Post-Training Quantization (PTQ)    
- **Weight Precision:** FP8
- **Original Size:** ~16 GB (bfloat16)
- **Quantized Size:** ~9 GB

## Usage

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model structure
model = AutoModelForCausalLM.from_pretrained(
    "tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
)

# Load tokenizer and generate
tokenizer = AutoTokenizer.from_pretrained("tokenlabsdotrun/Llama-3.1-8B-ModelOpt-FP8")

inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## License

This model inherits the [Llama 3.1 Community License](https://llama.meta.com/llama3_1/license/).