Instructions to use Qwen/Qwen2.5-Coder-1.5B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen2.5-Coder-1.5B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Qwen/Qwen2.5-Coder-1.5B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-1.5B") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-1.5B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Qwen/Qwen2.5-Coder-1.5B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen2.5-Coder-1.5B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-Coder-1.5B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Qwen/Qwen2.5-Coder-1.5B
- SGLang
How to use Qwen/Qwen2.5-Coder-1.5B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen2.5-Coder-1.5B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-Coder-1.5B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen2.5-Coder-1.5B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-Coder-1.5B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Qwen/Qwen2.5-Coder-1.5B with Docker Model Runner:
docker model run hf.co/Qwen/Qwen2.5-Coder-1.5B
Request Fork with Modifications for Python GenAI App Development on Microsoft OS
Thank you for releasing these models. But, at 1.B or 3B Parameters, the coder models should be more-specialized and have a smaller token vocabulary. "vocab_size": 151936 is far too large for a 1.5B or 3B model.
I propose and request that you develop/train a Specialized 1.5B and a 3B Qwen coder model that is an expert at coding ONLY (Python and its libraries, plus HTML, JS, CSS, BASH, PowerShell and Microsoft OS related languages/features) etc. And only English language. These limitation are so that even a tiny 1.5B or 3B model can have a fair chance provide reliable service while performing local Python/GenAI/Windows etc. By unnecessarily adding support and tokens for extraneous coding languages or extraneous foreign languages that are not needed for US GenAI App development on MS Windows machines, you are CRIPPLING the model's potential for compute-efficient local edge operation on local machines. Until the smaller models are optimized for specialization, they will be mere toys having no real coding potential. The token vocabulary of over 150,000 words is excessive and inappropriate for the specialist 1.5 and 3B models. Extraneous tokens not needed to support the prescribed specialization should be excised for the specialized models (to reduce the compute needed in training and in inference, and for more effective use of the limited 1.5 B parameters).
Additionally, please provide a slimmed-down python script to local operate (train and inference) the Qwen models, with or without GPUs, and thus a a script that does not invoke the cumbersome huggingface "transformers" library which is very verbose and has a high memory overhead. Also, the tokenizer script should be standalone, and not based on calls to the inflexible "autotokenizer" from huggingface.
This line should not be required to local train and inference these models: "from transformers import AutoModelForCausalLM, AutoTokenizer"
These changes will promote further public deployment, and development and finetuning of Qwen models.