Text Generation
Transformers
Safetensors
llama
gsq
gumbel-softmax
quantization
ptq
llama-3.1
vllm
humming
text-generation-inference
Instructions to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ") model = AutoModelForCausalLM.from_pretrained("ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ
- SGLang
How to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ with Docker Model Runner:
docker model run hf.co/ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ
| license: llama3.1 | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| base_model: meta-llama/Llama-3.1-70B-Instruct | |
| base_model_relation: quantized | |
| tags: | |
| - gsq | |
| - gumbel-softmax | |
| - quantization | |
| - ptq | |
| - llama | |
| - llama-3.1 | |
| - vllm | |
| - humming | |
| - arxiv:2604.18556 | |
| # Llama-3.1-70B-Instruct β 2-bit GSQ | |
| 2-bit quantization of [`meta-llama/Llama-3.1-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) | |
| produced with **GSQ** (Gumbel-Softmax Quantization) at **β2.13 bpp**. | |
| GSQ is the strongest *scalar* PTQ method we measured at this scale and lands | |
| within β1.7 points of vector-quantized methods (QTIP, PV-Tuning) on the | |
| standard zero-shot suite (ARC-C/E, HellaSwag, PIQA, Winogrande): | |
| | Method | 70B Avg | | |
| |-----------------|:-------:| | |
| | FP16 | 78.99 | | |
| | GPTQ | 57.38 | | |
| | QuIP | 61.57 | | |
| | EfficientQAT | 71.43 | | |
| | QTIP (VQ) | 77.25 | | |
| | PV-Tuning (VQ) | 76.27 | | |
| | **GSQ (ours)** | **75.57** | | |
| - Paper: [GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling](https://arxiv.org/abs/2604.18556) (arXiv:2604.18556) | |
| - Paper page on HF: <https://huggingface.co/papers/2604.18556> | |
| - Code: <https://github.com/IST-DASLab/GSQ> | |
| - Collection: <https://huggingface.co/collections/ISTA-DASLab/gsq> | |
| ## Quantization details | |
| - **Base model:** [`meta-llama/Llama-3.1-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) | |
| - **Bits / weight (effective):** β2.13 bpp | |
| - **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} Γ scale` | |
| - **Group size:** 128 | |
| - **Format:** [Humming](https://github.com/inclusionAI/humming) (`quant_method: "humming"`, `b_dtype: "uint2"`) | |
| - **Pipeline:** GPTQ initialization β Gumbel-Softmax refinement (Lion optimizer) | |
| ### Storage layout (why the HF UI shows I32 + BF16) | |
| The Hugging Face "Tensor types" widget reports the **container dtype** of each | |
| safetensor on disk, not the effective precision of the underlying weights. | |
| This checkpoint uses the **Humming** on-disk layout (exact-width packing β no | |
| sub-byte values are padded into a wider container). For every quantized | |
| `Linear` layer with original weight shape `[out_features, in_features]`, the | |
| following tensors are stored: | |
| | Tensor | Dtype | Shape on disk | Meaning | | |
| |------------------------------|-------|-------------------------------------|-------------------------------------------------------------------------------| | |
| | `<layer>.weight` | I32 | `[out_features, in_features Γ 2 / 32]` = `[out_features, in_features / 16]` | 2-bit values bit-packed along the input dim, LSB-first: 16 weights per INT32 word. | | |
| | `<layer>.weight_scale` | BF16 | `[out_features, in_features / 128]` | One symmetric scale per group of `group_size = 128` weights along the input dim. | | |
| | Attention / norms / embed / LM-head | BF16 | unchanged | Not quantized; copied from the base checkpoint. | | |
| **Example** (`model.layers.0.mlp.gate_proj`, original `[28672, 8192]`): | |
| `weight` = `[28672, 512]` I32 (since `8192 Γ 2 / 32 = 512`), | |
| `weight_scale` = `[28672, 64]` BF16 (since `8192 / 128 = 64`). | |
| So although the UI says "I32 + BF16", the **effective storage** per quantized | |
| weight is `2 bits (packed) + 16 bits / 128 (group scale) β 2.13 bpp`. The | |
| `quantization_config` block in `config.json` is: | |
| ```json | |
| { | |
| "quant_method": "humming", | |
| "b_dtype": "uint2", | |
| "weight_scale_group_size": 128, | |
| "weight_scale_type": "group", | |
| "has_zero_point": false, | |
| "ignore": ["lm_head", "embed_tokens"] | |
| } | |
| ``` | |
| Loading this checkpoint requires vLLM plus the | |
| [`humming`](https://github.com/inclusionAI/humming) kernels (`pip install | |
| humming-kernels`). See **Serving with vLLM** below. | |
| > Note: GSQ training first writes shards in `compressed-tensors` | |
| > `pack-quantized` format (where a sub-4-bit codebook is padded into a 4-bit | |
| > INT32 container). The published checkpoint here has been re-packed via | |
| > `convert_to_humming.py` into exact-width 2-bit Humming storage, hence the | |
| > `2 / 32` shape factor you see above. | |
| ## Serving with vLLM | |
| Install the Humming kernels (required for vLLM to load this checkpoint): | |
| ```bash | |
| pip install humming-kernels | |
| ``` | |
| ```bash | |
| vllm serve ISTA-DASLab/Llama-3.1-70B-Instruct-2Bit-GSQ \ | |
| --tensor-parallel-size 2 | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @article{gsq2026, | |
| title = {GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling}, | |
| author = {Dadgarnia, Alireza and Tabesh, Soroush and Nikdan, Mahdi and Helcig, Michael and Kurti{\'c}, Eldar and Kleinegger, Max and Alistarh, Dan}, | |
| journal= {arXiv preprint arXiv:2604.18556}, | |
| year = {2026}, | |
| url = {https://arxiv.org/abs/2604.18556} | |
| } | |
| ``` | |