Instructions to use microsoft/bitnet-b1.58-2B-4T-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/bitnet-b1.58-2B-4T-bf16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/bitnet-b1.58-2B-4T-bf16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/bitnet-b1.58-2B-4T-bf16", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/bitnet-b1.58-2B-4T-bf16", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use microsoft/bitnet-b1.58-2B-4T-bf16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/bitnet-b1.58-2B-4T-bf16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/bitnet-b1.58-2B-4T-bf16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/bitnet-b1.58-2B-4T-bf16

SGLang

How to use microsoft/bitnet-b1.58-2B-4T-bf16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/bitnet-b1.58-2B-4T-bf16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/bitnet-b1.58-2B-4T-bf16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/bitnet-b1.58-2B-4T-bf16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/bitnet-b1.58-2B-4T-bf16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/bitnet-b1.58-2B-4T-bf16 with Docker Model Runner:
```
docker model run hf.co/microsoft/bitnet-b1.58-2B-4T-bf16
```

configuration files for custom training?

by wardgenaicto - opened Apr 24, 2025

Discussion

wardgenaicto

Apr 24, 2025

Hi,

I'm working on custom training with the bitnet-b1.58-2B-4T-bf16 model and would like to retain 1-bit quantization compatibility for CPU inference using tools like llama.cpp.

However, the current repository appears to be missing the configuration_bitnet.py and modeling_bitnet.py files typically required to enable trust_remote_code=True with Transformers. Are there official versions of these files available, or recommended alternatives that preserve compatibility with the quantization pipeline (e.g. i2_s / GGUF for CPU use)?

Any guidance or references would be much appreciated.
Thanks!

liuzhihhxx

Apr 25, 2025

You can find alternative files here: https://huggingface.co/1bitLLM/bitnet_b1_58-3B/tree/main Just put them into the model path.
And install transformers==4.52.0.dev0 by pip install git+https://github.com/shumingma/transformers.git.
Hope it works for you as well.

wardgenaicto

Apr 25, 2025

I just wanted to say thanks, the config files worked great and I'm training! Stuck on CPU for now due to the Mac MPS 512 limit, figuring out the local GPU path so I can avoid needing a cloud resource for my use case.

Really impressed with this model's local performance and efficiency. It's a fantastic start!

Hopeful that larger context windows and bigger models might be possibilities down the line.
Thanks again for sharing!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment