Instructions to use TheBloke/Phind-CodeLlama-34B-v2-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TheBloke/Phind-CodeLlama-34B-v2-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TheBloke/Phind-CodeLlama-34B-v2-AWQ")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Phind-CodeLlama-34B-v2-AWQ")
model = AutoModelForCausalLM.from_pretrained("TheBloke/Phind-CodeLlama-34B-v2-AWQ")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use TheBloke/Phind-CodeLlama-34B-v2-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TheBloke/Phind-CodeLlama-34B-v2-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/Phind-CodeLlama-34B-v2-AWQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/TheBloke/Phind-CodeLlama-34B-v2-AWQ

SGLang

How to use TheBloke/Phind-CodeLlama-34B-v2-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TheBloke/Phind-CodeLlama-34B-v2-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/Phind-CodeLlama-34B-v2-AWQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TheBloke/Phind-CodeLlama-34B-v2-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/Phind-CodeLlama-34B-v2-AWQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use TheBloke/Phind-CodeLlama-34B-v2-AWQ with Docker Model Runner:
```
docker model run hf.co/TheBloke/Phind-CodeLlama-34B-v2-AWQ
```

torch.bfloat16 is not supported for quantization method awq

by Pizzarino - opened Nov 2, 2023

Discussion

Pizzarino

Nov 2, 2023

Hey, I tried the vLLM example in the model card (just copied and pasted it) and I'm running into this error:

ValueError: torch.bfloat16 is not supported for quantization method awq. Supported dtypes: [torch.float16]

Is there a fix to be able to use the AWQ model with vLLM instead of AutoAWQ?

TheBloke

Owner Nov 2, 2023

What version of vLLM are you using? I had thought that the latest supported bfloat16 with AWQ. 2.0, the first with AWQ support, definitely did not. But I thought it came later.

Either way, you should specify dtype="auto" in either Python code or as a command line parameter. That will load it in bfloat16 if it can, otherwise float16.

This README hasn't been updated in a while - my newer README template include the dtype="auto" parameter in the examples.

All my AWQ READMEs are going to be updated later today anyway when I update for Transformers AWQ support, so that will get changed then.

Pizzarino

Nov 3, 2023

I'm using version 0.2.1.post1; I did a reinstall of it too just in case something got messed up during installation and the issue with bfloat16 still persisted.

I'll definitely specify the dtype in my Python code! :)

Thank you so much for your help, you're a legend. <3

ikaro79

Nov 10, 2023

Hi, you can apply the following workaround, edit config.json and change
"torch_dtype": "bfloat16" --> "torch_dtype": "float16",

TheBloke

Owner Nov 10, 2023

Yeah but it's easier just to pass --dtype auto or dtype="auto"

romant319

Nov 28, 2023

For me specifying auto didn't work i still got the same error. But specifiying dtype="float16" did work.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment