Instructions to use Qwen/Qwen2.5-Coder-32B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen2.5-Coder-32B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Qwen/Qwen2.5-Coder-32B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Qwen/Qwen2.5-Coder-32B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen2.5-Coder-32B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-Coder-32B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Qwen/Qwen2.5-Coder-32B-Instruct
- SGLang
How to use Qwen/Qwen2.5-Coder-32B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen2.5-Coder-32B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-Coder-32B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen2.5-Coder-32B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-Coder-32B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Qwen/Qwen2.5-Coder-32B-Instruct with Docker Model Runner:
docker model run hf.co/Qwen/Qwen2.5-Coder-32B-Instruct
VSCODE + Cline + Ollama + Qwen2.5-Coder-32B-Instruct.Q8_0
Although in general the setup works smoothly, but I am running into an issue that appears to be specifically model related.
As a file is edited, Qwen deletes working sections of code that it has NO business deleting. I did not ask for anything that would require such deletions.
For example, I asked the code to be modified so that the page or a list element be refreshed after a form submission. The model then deleted the handlers for a "delete" and "view" file buttons, making them inactive.
Is there anything I can put in the prompt to prevent code deletions that were not requested?
Thanks.
Hello @BigDeeper , I think you would need to set ollama max context length to higher (at least 8k imo if the repo is small, 16k and above if it is a huge repo). This happened to me before.
Hello @BigDeeper , I think you would need to set ollama max context length to higher (at least 8k imo if the repo is small, 16k and above if it is a huge repo). This happened to me before.
My context size is 19456. It is a trade off between being able to put all 65 layers on GPUs, versus the context size. It is NOT simply truncating files, it just fails to copy existing code sections into new versions.
Maybe it is possible to modify the prompt to help the model deal with this, but nothing obvious comes to mind.
All my files are pretty small, 450 lines are the largest. Most are fewer than 200 lines.