Instructions to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ")
model = AutoModelForCausalLM.from_pretrained("ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ

SGLang

How to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with Docker Model Runner:
```
docker model run hf.co/ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ
```

Llama-3.1-70B-Instruct-2Bit-GSQ / README.md

soroushtabesh

Add humming instructions

0a0345b verified 4 days ago

preview code

raw

history blame contribute delete

4.87 kB

	---
	license: llama3.1
	library_name: transformers
	pipeline_tag: text-generation
	base_model: meta-llama/Llama-3.1-70B-Instruct
	base_model_relation: quantized
	tags:
	- gsq
	- gumbel-softmax
	- quantization
	- ptq
	- llama
	- llama-3.1
	- vllm
	- humming
	- arxiv:2604.18556
	---

	# Llama-3.1-70B-Instruct — 2-bit GSQ

	2-bit quantization of [`meta-llama/Llama-3.1-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)
	produced with GSQ (Gumbel-Softmax Quantization) at ≈2.13 bpp.

	GSQ is the strongest scalar PTQ method we measured at this scale and lands
	within ≈1.7 points of vector-quantized methods (QTIP, PV-Tuning) on the
	standard zero-shot suite (ARC-C/E, HellaSwag, PIQA, Winogrande):

	\| Method \| 70B Avg \|
	\|-----------------\|:-------:\|
	\| FP16 \| 78.99 \|
	\| GPTQ \| 57.38 \|
	\| QuIP \| 61.57 \|
	\| EfficientQAT \| 71.43 \|
	\| QTIP (VQ) \| 77.25 \|
	\| PV-Tuning (VQ) \| 76.27 \|
	\| GSQ (ours) \| 75.57 \|

	- Paper: [GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling](https://arxiv.org/abs/2604.18556) (arXiv:2604.18556)
	- Paper page on HF: <https://huggingface.co/papers/2604.18556>
	- Code: <https://github.com/IST-DASLab/GSQ>
	- Collection: <https://huggingface.co/collections/ISTA-DASLab/gsq>

	## Quantization details

	- Base model: [`meta-llama/Llama-3.1-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)
	- Bits / weight (effective): ≈2.13 bpp
	- Codebook: 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
	- Group size: 128
	- Format: [Humming](https://github.com/inclusionAI/humming) (`quant_method: "humming"`, `b_dtype: "uint2"`)
	- Pipeline: GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)

	### Storage layout (why the HF UI shows I32 + BF16)

	The Hugging Face "Tensor types" widget reports the container dtype of each
	safetensor on disk, not the effective precision of the underlying weights.
	This checkpoint uses the Humming on-disk layout (exact-width packing — no
	sub-byte values are padded into a wider container). For every quantized
	`Linear` layer with original weight shape `[out_features, in_features]`, the
	following tensors are stored:

	\| Tensor \| Dtype \| Shape on disk \| Meaning \|
	\|------------------------------\|-------\|-------------------------------------\|-------------------------------------------------------------------------------\|
	\| `<layer>.weight` \| I32 \| `[out_features, in_features × 2 / 32]` = `[out_features, in_features / 16]` \| 2-bit values bit-packed along the input dim, LSB-first: 16 weights per INT32 word. \|
	\| `<layer>.weight_scale` \| BF16 \| `[out_features, in_features / 128]` \| One symmetric scale per group of `group_size = 128` weights along the input dim. \|
	\| Attention / norms / embed / LM-head \| BF16 \| unchanged \| Not quantized; copied from the base checkpoint. \|

	Example (`model.layers.0.mlp.gate_proj`, original `[28672, 8192]`):
	`weight` = `[28672, 512]` I32 (since `8192 × 2 / 32 = 512`),
	`weight_scale` = `[28672, 64]` BF16 (since `8192 / 128 = 64`).

	So although the UI says "I32 + BF16", the effective storage per quantized
	weight is `2 bits (packed) + 16 bits / 128 (group scale) ≈ 2.13 bpp`. The
	`quantization_config` block in `config.json` is:

	```json
	{
	"quant_method": "humming",
	"b_dtype": "uint2",
	"weight_scale_group_size": 128,
	"weight_scale_type": "group",
	"has_zero_point": false,
	"ignore": ["lm_head", "embed_tokens"]
	}
	```

	Loading this checkpoint requires vLLM plus the
	[`humming`](https://github.com/inclusionAI/humming) kernels (`pip install
	humming-kernels`). See Serving with vLLM below.

	> Note: GSQ training first writes shards in `compressed-tensors`
	> `pack-quantized` format (where a sub-4-bit codebook is padded into a 4-bit
	> INT32 container). The published checkpoint here has been re-packed via
	> `convert_to_humming.py` into exact-width 2-bit Humming storage, hence the
	> `2 / 32` shape factor you see above.

	## Serving with vLLM

	Install the Humming kernels (required for vLLM to load this checkpoint):

	```bash
	pip install humming-kernels
	```

	```bash
	vllm serve ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ \
	--tensor-parallel-size 2
	```

	## Citation

	```bibtex
	@article{gsq2026,
	title = {GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling},
	author = {Dadgarnia, Alireza and Tabesh, Soroush and Nikdan, Mahdi and Helcig, Michael and Kurti{\'c}, Eldar and Kleinegger, Max and Alistarh, Dan},
	journal= {arXiv preprint arXiv:2604.18556},
	year = {2026},
	url = {https://arxiv.org/abs/2604.18556}
	}
	```