Instructions to use TheBloke/Phind-CodeLlama-34B-v2-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TheBloke/Phind-CodeLlama-34B-v2-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TheBloke/Phind-CodeLlama-34B-v2-AWQ")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("TheBloke/Phind-CodeLlama-34B-v2-AWQ") model = AutoModelForCausalLM.from_pretrained("TheBloke/Phind-CodeLlama-34B-v2-AWQ") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TheBloke/Phind-CodeLlama-34B-v2-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TheBloke/Phind-CodeLlama-34B-v2-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/Phind-CodeLlama-34B-v2-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TheBloke/Phind-CodeLlama-34B-v2-AWQ
- SGLang
How to use TheBloke/Phind-CodeLlama-34B-v2-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TheBloke/Phind-CodeLlama-34B-v2-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/Phind-CodeLlama-34B-v2-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TheBloke/Phind-CodeLlama-34B-v2-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/Phind-CodeLlama-34B-v2-AWQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TheBloke/Phind-CodeLlama-34B-v2-AWQ with Docker Model Runner:
docker model run hf.co/TheBloke/Phind-CodeLlama-34B-v2-AWQ
torch.bfloat16 is not supported for quantization method awq
Hey, I tried the vLLM example in the model card (just copied and pasted it) and I'm running into this error:
ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]
Is there a fix to be able to use the AWQ model with vLLM instead of AutoAWQ?
What version of vLLM are you using? I had thought that the latest supported bfloat16 with AWQ. 2.0, the first with AWQ support, definitely did not. But I thought it came later.
Either way, you should specify dtype="auto" in either Python code or as a command line parameter. That will load it in bfloat16 if it can, otherwise float16.
This README hasn't been updated in a while - my newer README template include the dtype="auto" parameter in the examples.
All my AWQ READMEs are going to be updated later today anyway when I update for Transformers AWQ support, so that will get changed then.
I'm using version 0.2.1.post1; I did a reinstall of it too just in case something got messed up during installation and the issue with bfloat16 still persisted.
I'll definitely specify the dtype in my Python code! :)
Thank you so much for your help, you're a legend. <3
Hi, you can apply the following workaround, edit config.json and change
"torch_dtype": "bfloat16" --> "torch_dtype": "float16",
Yeah but it's easier just to pass --dtype auto or dtype="auto"
For me specifying auto didn't work i still got the same error. But specifiying dtype="float16" did work.