Instructions to use CohereLabs/c4ai-command-r-plus-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CohereLabs/c4ai-command-r-plus-4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="CohereLabs/c4ai-command-r-plus-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("CohereLabs/c4ai-command-r-plus-4bit") model = AutoModelForCausalLM.from_pretrained("CohereLabs/c4ai-command-r-plus-4bit") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use CohereLabs/c4ai-command-r-plus-4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "CohereLabs/c4ai-command-r-plus-4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CohereLabs/c4ai-command-r-plus-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/CohereLabs/c4ai-command-r-plus-4bit
- SGLang
How to use CohereLabs/c4ai-command-r-plus-4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "CohereLabs/c4ai-command-r-plus-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CohereLabs/c4ai-command-r-plus-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "CohereLabs/c4ai-command-r-plus-4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CohereLabs/c4ai-command-r-plus-4bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use CohereLabs/c4ai-command-r-plus-4bit with Docker Model Runner:
docker model run hf.co/CohereLabs/c4ai-command-r-plus-4bit
Running on 3x24 GB RAM?
Hello!
I would like to know if it's possible to bring this model to run on a server with 3x RTX 4090. Sure, a model must either be "ready" to split parts of calculations which are undepending on results of simultanously done calculations on one of the other GPUs or the model layers are divided into three parts so that the intermediate result of cuda:0 is send for further calculation to cuda:1 and so on. As I wasn't able to find informations about this I think it is not possible at the moment. Are there plans to offer this? I know that there is a branch which can let the model run on one 24 GB graphic card but I think this will cost some output performance.
Best regards
Marc
P.S.: Very impressed by this work!!
At least I tried with 4xL4 GPUs (i.e. 96GB VRAM) and it didnt work. Got out of memory error with this 4bit version
@BrunoSE From my research till now this might be a working solution:
https://huggingface.co/pmysl/c4ai-command-r-plus-GGUF in combination with https://github.com/ggerganov/llama.cpp
Another option seems to be https://github.com/ollama/ollama/releases/tag/v0.1.32-rc1
This way of letting run a llm on local (consumer) hardware is new for me so I hoped to get some input here (like you, I think ;)
Best regards
Marc
At least I tried with 4xL4 GPUs (i.e. 96GB VRAM) and it didnt work. Got out of memory error with this 4bit version
Strange that didn't work for you. I was able to get the 4bit working on four A10G cards totaling 96GiB VRAM. I didn't do anything special. Just loaded the model with AutoModelForCausalLM.from_pretrained(). Note that passing device_map='auto' is important so that all the GPUs are utilized. However, I am getting OOM errors at only moderately long context lengths of around 4k tokens.