Instructions to use Phind/Phind-CodeLlama-34B-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Phind/Phind-CodeLlama-34B-v1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Phind/Phind-CodeLlama-34B-v1")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Phind/Phind-CodeLlama-34B-v1")
model = AutoModelForCausalLM.from_pretrained("Phind/Phind-CodeLlama-34B-v1")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Phind/Phind-CodeLlama-34B-v1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Phind/Phind-CodeLlama-34B-v1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Phind/Phind-CodeLlama-34B-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Phind/Phind-CodeLlama-34B-v1

SGLang

How to use Phind/Phind-CodeLlama-34B-v1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Phind/Phind-CodeLlama-34B-v1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Phind/Phind-CodeLlama-34B-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Phind/Phind-CodeLlama-34B-v1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Phind/Phind-CodeLlama-34B-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Phind/Phind-CodeLlama-34B-v1 with Docker Model Runner:
```
docker model run hf.co/Phind/Phind-CodeLlama-34B-v1
```

Any chance of a 13B-20B version?

by smcleod - opened Aug 29, 2023

Discussion

smcleod

Aug 29, 2023

Is there any chance there will be slightly smaller version somewhere between 13B and 20B~ that's likely to run on more common GPUs with 16GB of vRAM?

A lot of the decent coding models coming out seem to be focused on folks with 24GB+ cards.

corey4005

Aug 29, 2023

How do you know that it is only for a 24 GB card? So if we use the code they generated to show how to use it, we have to have a certain set of specs?

smcleod

Aug 29, 2023

How do you know that it is only for a 24 GB card? So if we use the code they generated to show how to use it, we have to have a certain set of specs?

Because a 34B model won’t fit on a 16GB GPU, quantised at 4bit it should however just fit on a 24GB GPU.

corey4005

Aug 29, 2023

•

edited Aug 29, 2023

How do you know that it is only for a 24 GB card? So if we use the code they generated to show how to use it, we have to have a certain set of specs?

Because a 34B model won’t fit on a 16GB GPU, quantised at 4bit it should however just fit on a 24GB GPU.

Thanks for the response. Is there anything I can read that will help me understand the math better? In other words, how do you know what fits and what does not? I appreciate any information you can pass along. I am assuming that when I build my next PC, I need to get a GPU that will be able to handle these models, like an RTX 4090?

smcleod

Aug 31, 2023

@corey4005 - so I was able to get the v2 (phind-codellama-34b-v2.Q4_K_M.gguf) of this model running on my little Tesla P100 (16GB), but it's very slow (2.5-3tk/s).

Output generated in 40.89 seconds (2.69 tokens/s, 110 tokens, context 454, seed 403749230)
MEM[|||||||||||||||||15.560Gi/16.000Gi]

V2 GGUF - https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-GGUF/blob/main/phind-codellama-34b-v2.Q4_K_M.gguf

Settings: