Instructions to use Phind/Phind-CodeLlama-34B-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Phind/Phind-CodeLlama-34B-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Phind/Phind-CodeLlama-34B-v2")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Phind/Phind-CodeLlama-34B-v2") model = AutoModelForCausalLM.from_pretrained("Phind/Phind-CodeLlama-34B-v2") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Phind/Phind-CodeLlama-34B-v2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Phind/Phind-CodeLlama-34B-v2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Phind/Phind-CodeLlama-34B-v2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Phind/Phind-CodeLlama-34B-v2
- SGLang
How to use Phind/Phind-CodeLlama-34B-v2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Phind/Phind-CodeLlama-34B-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Phind/Phind-CodeLlama-34B-v2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Phind/Phind-CodeLlama-34B-v2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Phind/Phind-CodeLlama-34B-v2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Phind/Phind-CodeLlama-34B-v2 with Docker Model Runner:
docker model run hf.co/Phind/Phind-CodeLlama-34B-v2
Input_Id's issue
I'm getting this odd error and not entirely sure why, it may be to do with the model and how I'm using the device_map not the actual input_ids. It also states that the attention mask and the pad token id aren't set, in the example of how to run the script there's no mention of these, and unfortunately the error message in console doesn't say where that issue is coming from so not a lot of clues to run off of, but this is the error it provides:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:2 for open-end generation.
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1535: UserWarning: You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids to the correct device by calling for example input_ids = input_ids.to('cuda') before running .generate().
warnings.warn(
Traceback (most recent call last):
File "/usr/local/llamaengineer.py", line 498, in
generated_text = generate(prompt)
File "/usr/local/llamaengineer.py", line 488, in generate
generate_ids = model.generate(inputs.input_ids.to("cpu"), max_new_tokens=384, do_sample=True, top_p=0.75, top_k=40, temperature=0.1)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1648, in generate
return self.sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2730, in sample
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 820, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 708, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 333, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 184, in apply_rotary_pos_emb
cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
That's was the error I got when I tried running it like this:
generate_ids = model.generate(inputs.input_ids.to("cpu"), max_new_tokens=384, do_sample=True, top_p=0.75, top_k=40, temperature=0.1)
That was only tried because an almost identical issue occurred when I tried running it with input_ids.to("cuda"), the difference is that instead of getting the warning about the input_ids being run on a different device than my model's device. I just got this message:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask`
to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Any help would be greatly appreciated, I'll provide the important part of my script that I'm running for reference:
model_path = "Phind/Phind-CodeLlama-34B-v2"
model = LlamaForCausalLM.from_pretrained(model_path, quantization_config=bnb_config, device_map=device_map)
tokenizer = AutoTokenizer.from_pretrained(model_path)
def generate(prompt: str):
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096)
# Generate
generate_ids = model.generate(inputs.input_ids.to("cuda"), max_new_tokens=384, do_sample=True, top_p=0.75, top_k=40, temperature=0.1)
completion = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
completion = completion.replace(prompt, "").split("\n\n\n")[0]
# Print the completion to the console
print("Generated Completion:")
print(completion)
return completion
prompt = "Please write a small script that prints the numbers 1-10 in the console"
generated_text = generate(prompt)
So turns out I was able to solve my problem and now have the model working, very excited to see it in action, if anyone's interested I put the model script up on my GitHub, solely because I'm using a technique that allows me to run this model on my limited GPU and that's pretty cool I think. Shikamaru5/LlamaEngineer