Gemma-2-2B quantized in NormalFloat4

Description

This repository contains a 4-bit quantized version of the Gemma-2-2B model. Designed to minimize memory consumption and speed up inference.

Benchmark results

Gemma-2-2b	Wiki	C4	PIQA	ARC-E	ARC-C	HellaSwag	Wino	Avg.
	0-shot	0-shot	0-shot	0-shot	25-shot	0-shot	0-shot
Unquantized	8.76	12.54	78.40	80.18	50.58	54.98	68.90	66.66
NF4	9.15	13.05	78.02	78.96	49.15	53.53	69.14	65.76

Benchmark scores are computed with lm-evaluation-harness.

How to use:

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("StoyanGanchev/gemma-2-2b-nf4")
model = AutoModelForCausalLM.from_pretrained("StoyanGanchev/gemma-2-2b-nf4")

All files could be accesed on repository

Downloads last month: -

Safetensors

Model size

3B params

Tensor type

F32

F16

Collection including StoyanGanchev/gemma-2-2b-nf4

Gemma-2-quantized

Collection

Gemma-2-2B and Gemma-2-9B Quantized in low-bit • 4 items • Updated Aug 2, 2024