Instructions to use microsoft/bitnet-b1.58-2B-4T-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/bitnet-b1.58-2B-4T-bf16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/bitnet-b1.58-2B-4T-bf16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/bitnet-b1.58-2B-4T-bf16", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("microsoft/bitnet-b1.58-2B-4T-bf16", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/bitnet-b1.58-2B-4T-bf16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/bitnet-b1.58-2B-4T-bf16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/bitnet-b1.58-2B-4T-bf16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/microsoft/bitnet-b1.58-2B-4T-bf16
- SGLang
How to use microsoft/bitnet-b1.58-2B-4T-bf16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/bitnet-b1.58-2B-4T-bf16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/bitnet-b1.58-2B-4T-bf16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/bitnet-b1.58-2B-4T-bf16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/bitnet-b1.58-2B-4T-bf16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use microsoft/bitnet-b1.58-2B-4T-bf16 with Docker Model Runner:
docker model run hf.co/microsoft/bitnet-b1.58-2B-4T-bf16
configuration files for custom training?
Hi,
I'm working on custom training with the bitnet-b1.58-2B-4T-bf16 model and would like to retain 1-bit quantization compatibility for CPU inference using tools like llama.cpp.
However, the current repository appears to be missing the configuration_bitnet.py and modeling_bitnet.py files typically required to enable trust_remote_code=True with Transformers. Are there official versions of these files available, or recommended alternatives that preserve compatibility with the quantization pipeline (e.g. i2_s / GGUF for CPU use)?
Any guidance or references would be much appreciated.
Thanks!
You can find alternative files here: https://huggingface.co/1bitLLM/bitnet_b1_58-3B/tree/main Just put them into the model path.
And install transformers==4.52.0.dev0 by pip install git+https://github.com/shumingma/transformers.git.
Hope it works for you as well.
I just wanted to say thanks, the config files worked great and I'm training! Stuck on CPU for now due to the Mac MPS 512 limit, figuring out the local GPU path so I can avoid needing a cloud resource for my use case.
Really impressed with this model's local performance and efficiency. It's a fantastic start!
Hopeful that larger context windows and bigger models might be possibilities down the line.
Thanks again for sharing!