Instructions to use CohereLabs/c4ai-command-r-v01 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use CohereLabs/c4ai-command-r-v01 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="CohereLabs/c4ai-command-r-v01")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r-v01")
model = AutoModelForCausalLM.from_pretrained("CohereLabs/c4ai-command-r-v01")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use CohereLabs/c4ai-command-r-v01 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "CohereLabs/c4ai-command-r-v01"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/c4ai-command-r-v01",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/CohereLabs/c4ai-command-r-v01

SGLang

How to use CohereLabs/c4ai-command-r-v01 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "CohereLabs/c4ai-command-r-v01" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/c4ai-command-r-v01",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "CohereLabs/c4ai-command-r-v01" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/c4ai-command-r-v01",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use CohereLabs/c4ai-command-r-v01 with Docker Model Runner:
```
docker model run hf.co/CohereLabs/c4ai-command-r-v01
```

zsh: killed on macbookpro M2 with 24GB

#14

by aleksandrvin - opened Mar 12, 2024

Discussion

aleksandrvin

Mar 12, 2024

I am considering myself a newb, but it feels like I'm lacking memory. Can I run it on 24GB system or I need to go to a server with more RAM?

...
model-00006-of-00015.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [02:48<00:00, 29.3MB/s]
model-00007-of-00015.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [02:38<00:00, 31.1MB/s]
model-00008-of-00015.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [02:35<00:00, 31.8MB/s]
model-00009-of-00015.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [02:40<00:00, 30.7MB/s]
model-00010-of-00015.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [02:48<00:00, 29.3MB/s]
model-00011-of-00015.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [03:18<00:00, 24.8MB/s]
model-00012-of-00015.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [03:37<00:00, 22.7MB/s]
model-00013-of-00015.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [03:22<00:00, 24.3MB/s]
model-00014-of-00015.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [02:24<00:00, 34.2MB/s]
model-00015-of-00015.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.11G/1.11G [00:28<00:00, 38.5MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [37:37<00:00, 150.50s/it]
Loading checkpoint shards:  33%|███████████████████████████████████████████████████████████▋                                                                                                                       | 5/15 [02:32<05:36, 33.69s/it]zsh: killed     python3 commandR.py
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

saurabhdash

Cohere Labs org Mar 12, 2024

The FP16 model requires ~70GB of memory. There is going to be a quantized model soon that should be ~18GB.

blevlabs

Mar 12, 2024

@saurabhdash Awesome! Will this be supporting NVIDIA/GPU inference, in GPTQ or AWQ formats?

WaveCut

Mar 12, 2024

I see no reason why it would not. Should be achievable, with slight code updates (to AWQ / GPTQ i mean).

saurabhdash

Cohere Labs org Mar 13, 2024

@Blevlabs Yes, should work right out of the box like other models.

Avaruuskettu

Mar 13, 2024

The FP16 model requires ~70GB of memory. There is going to be a quantized model soon that should be ~18GB.

What quantizations are you planning to release? 6bit or 8bit would probably be optimal for my purposes.

alexrs

Cohere Labs org Sep 5, 2024

The quantized model can be found here - https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit

You can also load the model in 4 or 8 bits using bitsandbytes!

alexrs changed discussion status to closed Sep 5, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

*zsh: killed* on macbookpro M2 with 24GB

zsh: killed on macbookpro M2 with 24GB