Instructions to use Qwen/Qwen2.5-Coder-1.5B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen2.5-Coder-1.5B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Qwen/Qwen2.5-Coder-1.5B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-1.5B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-1.5B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Qwen/Qwen2.5-Coder-1.5B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen2.5-Coder-1.5B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2.5-Coder-1.5B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen2.5-Coder-1.5B

SGLang

How to use Qwen/Qwen2.5-Coder-1.5B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen2.5-Coder-1.5B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2.5-Coder-1.5B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen2.5-Coder-1.5B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2.5-Coder-1.5B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen2.5-Coder-1.5B with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen2.5-Coder-1.5B
```

Request Fork with Modifications for Python GenAI App Development on Microsoft OS

by MartialTerran - opened Nov 13, 2024

Discussion

MartialTerran

Nov 13, 2024

•

edited Nov 13, 2024

Thank you for releasing these models. But, at 1.B or 3B Parameters, the coder models should be more-specialized and have a smaller token vocabulary. "vocab_size": 151936 is far too large for a 1.5B or 3B model.

I propose and request that you develop/train a Specialized 1.5B and a 3B Qwen coder model that is an expert at coding ONLY (Python and its libraries, plus HTML, JS, CSS, BASH, PowerShell and Microsoft OS related languages/features) etc. And only English language. These limitation are so that even a tiny 1.5B or 3B model can have a fair chance provide reliable service while performing local Python/GenAI/Windows etc. By unnecessarily adding support and tokens for extraneous coding languages or extraneous foreign languages that are not needed for US GenAI App development on MS Windows machines, you are CRIPPLING the model's potential for compute-efficient local edge operation on local machines. Until the smaller models are optimized for specialization, they will be mere toys having no real coding potential. The token vocabulary of over 150,000 words is excessive and inappropriate for the specialist 1.5 and 3B models. Extraneous tokens not needed to support the prescribed specialization should be excised for the specialized models (to reduce the compute needed in training and in inference, and for more effective use of the limited 1.5 B parameters).
Additionally, please provide a slimmed-down python script to local operate (train and inference) the Qwen models, with or without GPUs, and thus a a script that does not invoke the cumbersome huggingface "transformers" library which is very verbose and has a high memory overhead. Also, the tokenizer script should be standalone, and not based on calls to the inflexible "autotokenizer" from huggingface.

This line should not be required to local train and inference these models: "from transformers import AutoModelForCausalLM, AutoTokenizer"
These changes will promote further public deployment, and development and finetuning of Qwen models.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment