Instructions to use CohereLabs/c4ai-command-r-v01 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use CohereLabs/c4ai-command-r-v01 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="CohereLabs/c4ai-command-r-v01")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r-v01")
model = AutoModelForCausalLM.from_pretrained("CohereLabs/c4ai-command-r-v01")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use CohereLabs/c4ai-command-r-v01 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "CohereLabs/c4ai-command-r-v01"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/c4ai-command-r-v01",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/CohereLabs/c4ai-command-r-v01

SGLang

How to use CohereLabs/c4ai-command-r-v01 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "CohereLabs/c4ai-command-r-v01" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/c4ai-command-r-v01",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "CohereLabs/c4ai-command-r-v01" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CohereLabs/c4ai-command-r-v01",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use CohereLabs/c4ai-command-r-v01 with Docker Model Runner:
```
docker model run hf.co/CohereLabs/c4ai-command-r-v01
```

Update Readme with tool-use/RAG prompting guide

#29

by patrick-s-h-lewis - opened Mar 18, 2024

base: refs/heads/main

←

from: refs/pr/29

Discussion Files changed

-5

patrick-s-h-lewis

Mar 18, 2024

Update Readme with tool-use/RAG prompting guide

Update Readme with tool-use/RAG prompting guidee1f09213

patrick-s-h-lewis

Mar 18, 2024

•

edited Mar 18, 2024

If/when merged, I will open a PR to copy over the changes to the quantized model card

Cyleux

Mar 18, 2024

Thank you so very much!

Bullish on Cohere

sarahooker

Mar 18, 2024

This is great @patrick-s-h-lewis ! One suggestion, right now documentation talks through how to use the tokenizer https://docs.cohere.com/docs/prompting-command-r (screenshot attached) but only refers to it as hugging face tokenizer. Now, that our PR has been merged, our code is more accessible in transformer library -- might suggest you explicitly link to the relevant code section https://github.com/huggingface/transformers/blob/838b87abe231fd70be5132088d0dee72a7bb8d62/src/transformers/models/cohere/tokenization_cohere_fast.py#L420.

Merging now.

sarahooker changed pull request status to merged Mar 18, 2024

Cyleux

Mar 19, 2024

According to the prompt structure described in the instructions, the tool outputs are not included directly in the chat history itself, but rather in a separate dedicated section.

Specifically, the tool outputs are inserted into the {TOOL_OUTPUTS} placeholder, which comes after the {CHAT_HISTORY} section in the augmented generation prompt template:

augmented_gen_prompt_template = """
...
<|END_OF_TURN_TOKEN|> {CHAT_HISTORY} <|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|> {TOOL_OUTPUTS}<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|> {INSTRUCTIONS}<|END_OF_TURN_TOKEN|>"""

The {TOOL_OUTPUTS} section contains the results from all the tool calls, with each result prefixed by a Document: {n} identifier.

For example:

<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>

Document: 0
[Tool 1 output]
Document: 1
[Tool 2 output]

<|END_OF_TURN_TOKEN|>

So even if multiple tools are used at different points in the conversation, their outputs are aggregated together in this single {TOOL_OUTPUTS} section, separate from the actual back-and-forth utterances captured in {CHAT_HISTORY}. This allows the model to reference the tool results when formulating its response, without muddling the conversation history.

Is this accurate? Thank you

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment