Instructions to use CohereLabs/c4ai-command-r-plus-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use CohereLabs/c4ai-command-r-plus-4bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="CohereLabs/c4ai-command-r-plus-4bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r-plus-4bit")
model = AutoModelForCausalLM.from_pretrained("CohereLabs/c4ai-command-r-plus-4bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use CohereLabs/c4ai-command-r-plus-4bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "CohereLabs/c4ai-command-r-plus-4bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/c4ai-command-r-plus-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/CohereLabs/c4ai-command-r-plus-4bit

SGLang

How to use CohereLabs/c4ai-command-r-plus-4bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "CohereLabs/c4ai-command-r-plus-4bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/c4ai-command-r-plus-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "CohereLabs/c4ai-command-r-plus-4bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/c4ai-command-r-plus-4bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use CohereLabs/c4ai-command-r-plus-4bit with Docker Model Runner:
```
docker model run hf.co/CohereLabs/c4ai-command-r-plus-4bit
```

Running on 3x24 GB RAM?

by Marcophono - opened Apr 9, 2024

Discussion

Marcophono

Apr 9, 2024

Hello!
I would like to know if it's possible to bring this model to run on a server with 3x RTX 4090. Sure, a model must either be "ready" to split parts of calculations which are undepending on results of simultanously done calculations on one of the other GPUs or the model layers are divided into three parts so that the intermediate result of cuda:0 is send for further calculation to cuda:1 and so on. As I wasn't able to find informations about this I think it is not possible at the moment. Are there plans to offer this? I know that there is a branch which can let the model run on one 24 GB graphic card but I think this will cost some output performance.

Best regards
Marc

P.S.: Very impressed by this work!!

BrunoSE

Apr 11, 2024

At least I tried with 4xL4 GPUs (i.e. 96GB VRAM) and it didnt work. Got out of memory error with this 4bit version

Marcophono

Apr 11, 2024

@BrunoSE From my research till now this might be a working solution:
https://huggingface.co/pmysl/c4ai-command-r-plus-GGUF in combination with https://github.com/ggerganov/llama.cpp

Another option seems to be https://github.com/ollama/ollama/releases/tag/v0.1.32-rc1

This way of letting run a llm on local (consumer) hardware is new for me so I hoped to get some input here (like you, I think ;)

Best regards
Marc

danabo

Apr 20, 2024

At least I tried with 4xL4 GPUs (i.e. 96GB VRAM) and it didnt work. Got out of memory error with this 4bit version

Strange that didn't work for you. I was able to get the 4bit working on four A10G cards totaling 96GiB VRAM. I didn't do anything special. Just loaded the model with AutoModelForCausalLM.from_pretrained(). Note that passing device_map='auto' is important so that all the GPUs are utilized. However, I am getting OOM errors at only moderately long context lengths of around 4k tokens.

alexrs changed discussion status to closed Jun 12, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment