Param2-17B-A2.4B-Thinking (4-bit)

This repository contains a 4-bit quantized version of the original model:

bharatgenai/Param2-17B-A2.4B-Thinking

The model has been quantized using bitsandbytes NF4 quantization to reduce memory usage while maintaining good inference quality.

Model Details

Base Model bharatgenai/Param2-17B-A2.4B-Thinking

Quantization 4-bit (NF4)

Compute dtype float16

Double quantization enabled

Benefits

Approximate VRAM Requirements

4-bit inference typically requires around 10–12 GB of VRAM.

Installation

pip install transformers accelerate bitsandbytes torch

Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "neuralnets/modelname-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", trust_remote_code=True )

prompt = "Explain quantum computing in simple terms."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Notes

This repository only contains quantized weights.
For the original model, visit:

License

Follows the license of the original model.

Safetensors

Model size

17B params

Tensor type

F32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support