Instructions to use DavidCatalano/calme-3.2-instruct-78b-exl2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DavidCatalano/calme-3.2-instruct-78b-exl2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="DavidCatalano/calme-3.2-instruct-78b-exl2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("DavidCatalano/calme-3.2-instruct-78b-exl2")
model = AutoModelForCausalLM.from_pretrained("DavidCatalano/calme-3.2-instruct-78b-exl2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use DavidCatalano/calme-3.2-instruct-78b-exl2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DavidCatalano/calme-3.2-instruct-78b-exl2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DavidCatalano/calme-3.2-instruct-78b-exl2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/DavidCatalano/calme-3.2-instruct-78b-exl2

SGLang

How to use DavidCatalano/calme-3.2-instruct-78b-exl2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "DavidCatalano/calme-3.2-instruct-78b-exl2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DavidCatalano/calme-3.2-instruct-78b-exl2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "DavidCatalano/calme-3.2-instruct-78b-exl2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DavidCatalano/calme-3.2-instruct-78b-exl2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use DavidCatalano/calme-3.2-instruct-78b-exl2 with Docker Model Runner:
```
docker model run hf.co/DavidCatalano/calme-3.2-instruct-78b-exl2
```

EXL2 4.5bpw Quantization of calme-3.2-instruct-78b

This repository hosts the 4.5 bits per weight (bpw) quantization of the calme-3.2-instruct-78b model, leveraging the ExLlamaV2 format for efficient inference with high-context capabilities. This model is a Qwen 2.5 finetune.

Quantization Details

Format: ExLlamaV2 4.5bpw
Version: ExLlamaV2 0.2.6
Model Size: 78 billion parameters
VRAM Usage: Approx. 44GB (32,000 context)
Calibration:
- Rows: 115
- Length: 2048
- Dataset: (default)

The quantization process reduces memory usage and inference latency while maintaining high performance for generative text tasks.

Prompt Template

This model uses the ChatML prompt template for interaction:

<|im_start|>system
{System}
<|im_end|>
<|im_start|>user
{User}
<|im_end|>
<|im_start|>assistant
{Assistant}

Model Usage

Example: Inference with ExLlamaV2

To use this quantized model, ensure you have the ExLlamaV2 library installed:

pip install exllamav2

from exllamav2 import ExLlamaModel, ExLlamaTokenizer, ExLlamaPipeline

# Load model and tokenizer
model = ExLlamaModel.from_pretrained("DavidCatalano/calme-3.2-instruct-78b-exl2-4.5bpw")
tokenizer = ExLlamaTokenizer.from_pretrained("DavidCatalano/calme-3.2-instruct-78b-exl2-4.5bpw")

# Create pipeline
pipeline = ExLlamaPipeline(model, tokenizer)

# Generate text
messages = [{"role": "user", "content": "What is EXL2 quantization?"}]
response = pipeline(messages)
print(response)

Features

EXL2 format requires Nvidia hardware but runs faster and with less RAM than GGUF.
Supports 44GB VRAM with 32,000 context window.
40GB minimum 1024 context window
Highly optimized for inference, making it ideal for resource-constrained environments.
Compatible with ChatML-based prompting systems.

Acknowledgments

Original Model Creator: MaziyarPanahi
Quantization by: DavidCatalano
Quantization Tool: ExLlamaV2 0.2.6

Download Instructions

To download the model files:

huggingface-cli install huggingface_hub
huggingface-cli login
huggingface-cli download DavidCatalano/calme-3.2-instruct-78b-exl2-4.5bpw --include "*" --local-dir ./local-folder

Downloads last month: 5

Model tree for DavidCatalano/calme-3.2-instruct-78b-exl2

Base model

MaziyarPanahi/calme-3.2-instruct-78b

Quantized

(10)

this model

Quantizations

1 model

Evaluation results

strict accuracy on IFEval (0-Shot)
Open LLM Leaderboard

80.630
normalized accuracy on BBH (3-Shot)
Open LLM Leaderboard

62.610
exact match on MATH Lvl 5 (4-Shot)
Open LLM Leaderboard

39.950
acc_norm on GPQA (0-shot)
Open LLM Leaderboard

20.360
acc_norm on MuSR (0-shot)
Open LLM Leaderboard

38.530
accuracy on MMLU-PRO (5-shot)
test set Open LLM Leaderboard

70.030