Instructions to use Qwen/Qwen2.5-Coder-1.5B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen2.5-Coder-1.5B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Qwen/Qwen2.5-Coder-1.5B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-1.5B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-1.5B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Qwen/Qwen2.5-Coder-1.5B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen2.5-Coder-1.5B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2.5-Coder-1.5B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen2.5-Coder-1.5B

SGLang

How to use Qwen/Qwen2.5-Coder-1.5B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen2.5-Coder-1.5B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2.5-Coder-1.5B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen2.5-Coder-1.5B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2.5-Coder-1.5B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen2.5-Coder-1.5B with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen2.5-Coder-1.5B
```

Optimizing Qwen Coder Models (1.5B & 3B) for Python and Edge Deployment

by MartialTerran - opened Nov 22, 2024

Discussion

MartialTerran

Nov 22, 2024

Subject: Optimizing Qwen Coder Models (1.5B & 3B) for Specialized Python Development and Edge Deployment

Dear Qwen Team,

First, thank you for your contributions to the open-source AI community with the Qwen models. The release of the 1.5B Coder model is a significant step. However, I believe there's a substantial opportunity to enhance its practical utility, particularly for specialized Python development and edge deployment scenarios, through focused optimization.

Key Concerns and Recommendations:

Vocabulary Size and Specialization: The current vocabulary size of 151,936 tokens is disproportionately large for a 1.5B or even a 3B parameter model. This expansive vocabulary, encompassing numerous languages and potentially non-essential coding constructs, dilutes the model's capacity for focused learning and efficient inference. I strongly advocate for developing specialized 1.5B and 3B Qwen Coder models with a significantly reduced vocabulary, concentrating primarily on:

Python and its core libraries (e.g., NumPy, Pandas, Scikit-learn, TensorFlow, PyTorch).

Web development languages: HTML, CSS, JavaScript.

Essential scripting languages: Bash, PowerShell.

Microsoft OS-specific languages and APIs. (relevant for Windows-centric Python development)

English language text.

By limiting the scope to these essential areas for US-based GenAI application development on Windows machines, we can dramatically improve the model's ability to learn nuanced patterns and provide reliable code generation and assistance within its parameter budget.

Computational Efficiency and Edge Deployment: An excessively large vocabulary directly impacts computational efficiency during both training and inference. For resource-constrained edge deployments (e.g., local machines without powerful GPUs), this inefficiency is a major barrier. Specialization and vocabulary reduction are crucial for enabling truly compute-efficient local operation. A smaller, focused model is not a toy; it's a practical tool for developers working in specific domains.

Simplified Inference and Tokenization Scripts: The current reliance on the Hugging Face transformers library introduces significant overhead (memory footprint, dependency complexity) that hinders lightweight deployment. I urge the development and release of streamlined, standalone Python scripts for inference and tokenization that:

Eliminate dependency on transformers.AutoModelForCausalLM and transformers.AutoTokenizer.

Offer a lightweight alternative for both CPU and GPU-based inference.

Provide a direct, customizable tokenizer implementation instead of relying on transformers.AutoTokenizer.

These changes would significantly lower the barrier to entry for developers who want to experiment with, fine-tune, and deploy Qwen Coder models in local development environments or edge applications.

Motivation and Impact:

By providing optimized, specialized models and simplified deployment tools, Qwen can significantly expand its user base and foster greater community involvement. Developers will be empowered to leverage Qwen models for practical, resource-efficient Python development on local machines, driving innovation and accelerating the adoption of on-device AI.

I believe these recommendations align with the growing demand for efficient and specialized AI models for real-world applications. I'm eager to see Qwen evolve in this direction and contribute to its success.

Thank you for considering these suggestions.

Sincerely,

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment