Instructions to use ai21labs/AI21-Jamba2-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ai21labs/AI21-Jamba2-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ai21labs/AI21-Jamba2-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba2-3B")
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba2-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ai21labs/AI21-Jamba2-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ai21labs/AI21-Jamba2-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ai21labs/AI21-Jamba2-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ai21labs/AI21-Jamba2-3B

SGLang

How to use ai21labs/AI21-Jamba2-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ai21labs/AI21-Jamba2-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ai21labs/AI21-Jamba2-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ai21labs/AI21-Jamba2-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ai21labs/AI21-Jamba2-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ai21labs/AI21-Jamba2-3B with Docker Model Runner:
```
docker model run hf.co/ai21labs/AI21-Jamba2-3B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Introduction

Jamba2 3B is an ultra-compact open source model designed to bring enterprise-grade reliability to on-device deployments. At just 3B parameters, it runs efficiently on consumer devices—iPhones, Androids, Macs, and PCs—while maintaining the grounding and instruction-following capabilities required for production use.

Released under Apache 2.0 License with a 256K context window, Jamba2 3B enables developers to build reliable AI applications for edge environments. For more details, read the full release blog post.

Key Advantages

On-device deployment: Runs efficiently on iPhones, Androids, Macs, and PCs
Ultra-compact footprint: 3B parameters enabling edge deployments with minimal resources
Benchmark leadership: Excels on IFBench, IFEval, Collie, and FACTS
256K context window: Processes long documents and knowledge bases
Apache 2.0 License: Fully open source for commercial use
SSM-Transformer architecture: Memory-efficient design for resource-constrained environments

Evaluation Results

Jamba2 3B achieves category-leading performance on instruction following and grounding benchmarks despite its compact size. The model delivers consistent, context-faithful outputs across diverse enterprise tasks including RAG workflows and technical document processing.

Training and Evaluation Details

Jamba2 models were developed using a comprehensive post-training pipeline starting from Jamba 1.5 pre-training. The models underwent mid-training on 500B carefully curated tokens with increased representation of math, code, high-quality web data, and long documents. A state passing phase optimized the Mamba layers for effective context length generalization. Training continued with cold start supervised fine-tuning to establish instruction-following and reasoning capabilities, followed by DPO optimization.

The final training stages involved multiple on-policy reinforcement learning phases, progressively moving from short-context verifiable rewards to longer context training with mixed verifiable and model-based rewards. Evaluation focused on two key enterprise reliability signals: instruction-following benchmarks measuring steerability, and grounding benchmarks testing context faithfulness. Human evaluators assessed performance on real-world enterprise tasks using blind, counterbalanced side-by-side comparisons, rating outputs on factuality, style, constraint-adherence, instruction-following, and helpfulness.

Quickstart

Run with vLLM

Best results require vLLM version 0.10.2 or higher.

vllm serve "ai21labs/AI21-Jamba2-3B" --mamba-ssm-cache-dtype float32 --enable-auto-tool-choice --tool-call-parser hermes --enable-prefix-caching

Run with Transformers

pip install transformers>= 4.54.0
pip install flash-attn --no-build-isolation
pip install causal-conv1d>=1.2.0
pip install mamba-ssm

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba2-3B",
                                      dtype=torch.bfloat16,
                        attn_implementation="flash_attention_2",
                                             device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-3B")

messages = [
    {"role": "system",
     "content": "You are an HR Policy Assistant.
                 Answer employee questions using only the provided policy documents.
                 If the answer isn't in the documents, say so clearly.
                 Be concise and cite the specific policy section when possible."
},
    {"role": "user",
     "content": "Context documents: {retrieved_chunks}.
                 Employee question: {user_question}.
                 Answer:"
},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

outputs = model.generate(**tokenizer(prompts, return_tensors="pt").to(model.device), do_sample=True, temperature=0.6)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Downloads last month: 20,053

Model tree for ai21labs/AI21-Jamba2-3B

Finetunes

1 model

Quantizations

7 models

Collection including ai21labs/AI21-Jamba2-3B

Jamba2

Collection

Jamba2 is a highly-efficient open source family of language models built for maximum reliability and steerability in the enterprise. • 3 items • Updated Jan 8 • 5