Instructions to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ")
model = AutoModelForCausalLM.from_pretrained("ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ

SGLang

How to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with Docker Model Runner:
```
docker model run hf.co/ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ
```

soroushtabesh commited on 15 days ago

Commit

3b84589

verified ·

1 Parent(s): 768d9f7

Add model card with GSQ paper citation (arXiv:2604.18556)

Browse files

Files changed (1) hide show

README.md +62 -9

README.md CHANGED Viewed

@@ -1,16 +1,69 @@
 ---
 license: llama3.1
-base_model: meta-llama/Llama-3.1-70B-Instruct
-base_model_relation: quantized
 library_name: transformers
 pipeline_tag: text-generation
 tags:
   - llama
   - llama-3.1
-  - llama-3.1-70b
-  - instruct
-  - quantized
-  - 2bit
-  - GSQ
-  - ISTA-DASLab
----

 ---
 license: llama3.1
 library_name: transformers
 pipeline_tag: text-generation
+base_model: meta-llama/Llama-3.1-70B-Instruct
+base_model_relation: quantized
 tags:
+  - gsq
+  - gumbel-softmax
+  - quantization
+  - ptq
   - llama
   - llama-3.1
+  - vllm
+  - compressed-tensors
+  - arxiv:2604.18556
+---
+# Llama-3.1-70B-Instruct — 2-bit GSQ
+2-bit quantization of [`meta-llama/Llama-3.1-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)
+produced with **GSQ** (Gumbel-Softmax Quantization) at **≈2.13 bpp**.
+GSQ is the strongest *scalar* PTQ method we measured at this scale and lands
+within ≈1.7 points of vector-quantized methods (QTIP, PV-Tuning) on the
+standard zero-shot suite (ARC-C/E, HellaSwag, PIQA, Winogrande):
+| Method          | 70B Avg |
+|-----------------|:-------:|
+| FP16            | 78.99   |
+| GPTQ            | 57.38   |
+| QuIP            | 61.57   |
+| EfficientQAT    | 71.43   |
+| QTIP (VQ)       | 77.25   |
+| PV-Tuning (VQ)  | 76.27   |
+| **GSQ (ours)**  | **75.57** |
+- Paper: [GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling](https://arxiv.org/abs/2604.18556) (arXiv:2604.18556)
+- Paper page on HF: <https://huggingface.co/papers/2604.18556>
+- Code: <https://github.com/IST-DASLab/GSQ>
+- Collection: <https://huggingface.co/collections/ISTA-DASLab/gsq>
+## Quantization details
+- **Base model:** [`meta-llama/Llama-3.1-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)
+- **Bits / weight (effective):** ≈2.13 bpp
+- **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
+- **Group size:** 128
+- **Format:** `compressed-tensors` (auto-detected by vLLM)
+- **Pipeline:** GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)
+## Serving with vLLM
+```bash
+vllm serve ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ \
+  --tensor-parallel-size 2
+```
+## Citation
+```bibtex
+@article{gsq2026,
+  title  = {GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling},
+  author = {Dadgarnia, Alireza and Tabesh, Soroush and Nikdan, Mahdi and Helcig, Michael and Kurti{\'c}, Eldar and Kleinegger, Max and Alistarh, Dan},
+  journal= {arXiv preprint arXiv:2604.18556},
+  year   = {2026},
+  url    = {https://arxiv.org/abs/2604.18556}
+}
+```