Instructions to use ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp
- SGLang
How to use ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp with Docker Model Runner:
docker model run hf.co/ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp
Llama-3.1-8B-Instruct-MR-GPTQ-nvfp
Model Overview
This model was obtained by quantizing the weights of Llama-3.1-8B-Instruct to NVFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.5, reducing the disk size and GPU memory requirements by approximately 72%.
Usage
MR-GPTQ quantized models with QuTLASS kernels are supported in the following integrations:
transformerswith these features:- Available in
main(Documentation). - RTN on-the-fly quantization.
- Pseudo-quantization QAT.
- Available in
vLLMwith these features:- Available in this PR.
- Compatible with real quantization models from
FP-Quantand thetransformersintegration.
Evaluation
This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the vLLM engine.
OpenLLM v1 results
| Model | MMLU‑CoT | GSM8k | Hellaswag | Winogrande | Average | Recovery (%) |
|---|---|---|---|---|---|---|
meta‑llama/Llama 3.1‑8B‑Instruct |
0.7276 | 0.8506 | 0.8001 | 0.7790 | 0.7893 | – |
ISTA‑DASLab/Llama‑3.1‑8B‑Instruct‑MR‑GPTQ‑nvfp |
0.6917 | 0.8089 | 0.7850 | 0.7545 | 0.7600 | 96.29 |
Platinum bench results
Below we report recoveries on individual tasks as well as the average recovery.
Recovery by Task
| Task | Recovery (%) |
|---|---|
| SingleOp | 100.00 |
| SingleQ | 98.99 |
| MultiArith | 99.41 |
| SVAMP | 97.54 |
| GSM8K | 96.64 |
| MMLU‑Math | 92.43 |
| BBH‑LogicalDeduction‑3Obj | 87.34 |
| BBH‑ObjectCounting | 98.80 |
| BBH‑Navigate | 92.00 |
| TabFact | 86.92 |
| HotpotQA | 103.18 |
| SQuAD | 101.54 |
| DROP | 103.77 |
| Winograd‑WSC | 89.47 |
| Average | 96.29 |
- Downloads last month
- 4
Model tree for ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp
Base model
meta-llama/Llama-3.1-8B