Instructions to use Phind/Phind-CodeLlama-34B-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Phind/Phind-CodeLlama-34B-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Phind/Phind-CodeLlama-34B-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Phind/Phind-CodeLlama-34B-v1") model = AutoModelForCausalLM.from_pretrained("Phind/Phind-CodeLlama-34B-v1") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Phind/Phind-CodeLlama-34B-v1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Phind/Phind-CodeLlama-34B-v1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Phind/Phind-CodeLlama-34B-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Phind/Phind-CodeLlama-34B-v1
- SGLang
How to use Phind/Phind-CodeLlama-34B-v1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Phind/Phind-CodeLlama-34B-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Phind/Phind-CodeLlama-34B-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Phind/Phind-CodeLlama-34B-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Phind/Phind-CodeLlama-34B-v1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Phind/Phind-CodeLlama-34B-v1 with Docker Model Runner:
docker model run hf.co/Phind/Phind-CodeLlama-34B-v1
Any chance of a 13B-20B version?
Is there any chance there will be slightly smaller version somewhere between 13B and 20B~ that's likely to run on more common GPUs with 16GB of vRAM?
A lot of the decent coding models coming out seem to be focused on folks with 24GB+ cards.
How do you know that it is only for a 24 GB card? So if we use the code they generated to show how to use it, we have to have a certain set of specs?
How do you know that it is only for a 24 GB card? So if we use the code they generated to show how to use it, we have to have a certain set of specs?
Because a 34B model won’t fit on a 16GB GPU, quantised at 4bit it should however just fit on a 24GB GPU.
How do you know that it is only for a 24 GB card? So if we use the code they generated to show how to use it, we have to have a certain set of specs?
Because a 34B model won’t fit on a 16GB GPU, quantised at 4bit it should however just fit on a 24GB GPU.
Thanks for the response. Is there anything I can read that will help me understand the math better? In other words, how do you know what fits and what does not? I appreciate any information you can pass along. I am assuming that when I build my next PC, I need to get a GPU that will be able to handle these models, like an RTX 4090?
@corey4005 - so I was able to get the v2 (phind-codellama-34b-v2.Q4_K_M.gguf) of this model running on my little Tesla P100 (16GB), but it's very slow (2.5-3tk/s).
Output generated in 40.89 seconds (2.69 tokens/s, 110 tokens, context 454, seed 403749230)
MEM[|||||||||||||||||15.560Gi/16.000Gi]
Settings:
- llamacpp_hf
- gpu layers 33
- tokens 1024
- batch 512
Install it with cublas. It insanely bumps up speed for gpu
Does that let you split between cpu and gpu memory though @johnwick123forevr ?
yes, you still split between cpu and gpu memory. higher gpu layers=more gpu memory