Instructions to use Qwen/Qwen2.5-Coder-1.5B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen2.5-Coder-1.5B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Qwen/Qwen2.5-Coder-1.5B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-1.5B") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-1.5B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Qwen/Qwen2.5-Coder-1.5B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen2.5-Coder-1.5B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-Coder-1.5B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Qwen/Qwen2.5-Coder-1.5B
- SGLang
How to use Qwen/Qwen2.5-Coder-1.5B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen2.5-Coder-1.5B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-Coder-1.5B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen2.5-Coder-1.5B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2.5-Coder-1.5B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Qwen/Qwen2.5-Coder-1.5B with Docker Model Runner:
docker model run hf.co/Qwen/Qwen2.5-Coder-1.5B
Optimizing Qwen Coder Models (1.5B & 3B) for Python and Edge Deployment
Subject: Optimizing Qwen Coder Models (1.5B & 3B) for Specialized Python Development and Edge Deployment
Dear Qwen Team,
First, thank you for your contributions to the open-source AI community with the Qwen models. The release of the 1.5B Coder model is a significant step. However, I believe there's a substantial opportunity to enhance its practical utility, particularly for specialized Python development and edge deployment scenarios, through focused optimization.
Key Concerns and Recommendations:
Vocabulary Size and Specialization: The current vocabulary size of 151,936 tokens is disproportionately large for a 1.5B or even a 3B parameter model. This expansive vocabulary, encompassing numerous languages and potentially non-essential coding constructs, dilutes the model's capacity for focused learning and efficient inference. I strongly advocate for developing specialized 1.5B and 3B Qwen Coder models with a significantly reduced vocabulary, concentrating primarily on:
Python and its core libraries (e.g., NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch).
Web development languages: HTML, CSS, JavaScript.
Essential scripting languages: Bash, PowerShell.
Microsoft OS-specific languages and APIs. (relevant for Windows-centric Python development)
English language text.
By limiting the scope to these essential areas for US-based GenAI application development on Windows machines, we can dramatically improve the model's ability to learn nuanced patterns and provide reliable code generation and assistance within its parameter budget.
Computational Efficiency and Edge Deployment: An excessively large vocabulary directly impacts computational efficiency during both training and inference. For resource-constrained edge deployments (e.g., local machines without powerful GPUs), this inefficiency is a major barrier. Specialization and vocabulary reduction are crucial for enabling truly compute-efficient local operation. A smaller, focused model is not a toy; it's a practical tool for developers working in specific domains.
Simplified Inference and Tokenization Scripts: The current reliance on the Hugging Face transformers library introduces significant overhead (memory footprint, dependency complexity) that hinders lightweight deployment. I urge the development and release of streamlined, standalone Python scripts for inference and tokenization that:
Eliminate dependency on transformers.AutoModelForCausalLM and transformers.AutoTokenizer.
Offer a lightweight alternative for both CPU and GPU-based inference.
Provide a direct, customizable tokenizer implementation instead of relying on transformers.AutoTokenizer.
These changes would significantly lower the barrier to entry for developers who want to experiment with, fine-tune, and deploy Qwen Coder models in local development environments or edge applications.
Motivation and Impact:
By providing optimized, specialized models and simplified deployment tools, Qwen can significantly expand its user base and foster greater community involvement. Developers will be empowered to leverage Qwen models for practical, resource-efficient Python development on local machines, driving innovation and accelerating the adoption of on-device AI.
I believe these recommendations align with the growing demand for efficient and specialized AI models for real-world applications. I'm eager to see Qwen evolve in this direction and contribute to its success.
Thank you for considering these suggestions.
Sincerely,