Instructions to use latimar/Phind-Codellama-34B-v2-exl2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use latimar/Phind-Codellama-34B-v2-exl2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="latimar/Phind-Codellama-34B-v2-exl2")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("latimar/Phind-Codellama-34B-v2-exl2")
model = AutoModelForCausalLM.from_pretrained("latimar/Phind-Codellama-34B-v2-exl2")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use latimar/Phind-Codellama-34B-v2-exl2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "latimar/Phind-Codellama-34B-v2-exl2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "latimar/Phind-Codellama-34B-v2-exl2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/latimar/Phind-Codellama-34B-v2-exl2

SGLang

How to use latimar/Phind-Codellama-34B-v2-exl2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "latimar/Phind-Codellama-34B-v2-exl2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "latimar/Phind-Codellama-34B-v2-exl2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "latimar/Phind-Codellama-34B-v2-exl2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "latimar/Phind-Codellama-34B-v2-exl2",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use latimar/Phind-Codellama-34B-v2-exl2 with Docker Model Runner:
```
docker model run hf.co/latimar/Phind-Codellama-34B-v2-exl2
```

Thanks!

by SekkSea - opened Sep 19, 2023

Discussion

SekkSea

Sep 19, 2023

This is the first time I have been able to load a 34b model on my budget 3060! With 12gb of vram, the 2.55 bit variation mostly loads on my GPU, with a little spilling over into the CPU at 2048 context.

Ixel1

Sep 20, 2023

Indeed, this is the best AI model I've used so far which also fits on a single 3090. I'm using the 5_0-bpw-h8-evol-ins variant. Thanks from me too.

latimar

Owner Sep 21, 2023

wow, glad it's useful.
@SekkSea what's your experience with 2.55 variant? Is it actually usable and helpful?
It's a shame Phind did not make a 13B variant of the model, I'd love to compare compare 2.55 quant of 34B model with different quants of 13B...

SekkSea

Sep 24, 2023

•

edited Sep 24, 2023

From what I've seen, I think the quality of 2.55-bit 34b exceeds comparable 6-bit or 8-bit 13b models, but that's just my own subjective opinion. 34b models like this one are usable at 2 bpw, but the replies take a while, so it's probably not the sweet spot for 12 VRAM. It's fun to use on occasion, though, because of the higher quality responses.

For the most part, I'm using 4 bpw 13b models for 4k context, 4.65 bpw 13b models for 3k context, and 3 bpw 20b models for ~2k context.

Hisma

Oct 25, 2023

@latimar , can you or someone else explain why the perplexity scores are worse on the "5_0-bpw-h8-evol-ins" model versus the "5_0-bpw-h8" model?
I would assume fine-tuning the model would improve the scores?
Also, in my personal non scientific test, I give both LLMs a coding challenge, and the "5_0-bpw-h8-evol-ins" model gave a better response than the "5_0-bpw-h8" model. So anecdotally, "5_0-bpw-h8-evol-ins" is a better performing model for me, despite the worse PPL score.

latimar

Owner Oct 25, 2023

@Hisma 5_0-bpw-h8-evol-ins was converted using different calibration dataset, not wikitext, but evol-instruct. It has worse ppl score on wikitext, yes, but its coding abilities are actually better that 5_0-bpw-h8. The better metric to compare different quants would be HumanEval score, or at least ppl score on evol-instruct dataset.

Hisma

Oct 25, 2023

Got it, thank you. Would have been useful to include the humaneval scores with these models too like you did in your supercoder models. But regardless, I can definitely confirm there is noticeablely better coding performance on 5_0-bpw-h8-evol-ins, so based on what you're saying this all makes sense. Thank you for explaining!

latimar

Owner Oct 26, 2023

@Hisma well, it was a first naive attempt to quantize phind, I only measured PPL at the moment and thought it was enough. I was young and stupid back then =) I'll probably update the README in the repo to include HumanEval scores, but I want to finish making new phind quants first.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment