Instructions to use microsoft/Phi-3-mini-4k-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Phi-3-mini-4k-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use microsoft/Phi-3-mini-4k-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Phi-3-mini-4k-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-mini-4k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/Phi-3-mini-4k-instruct

SGLang

How to use microsoft/Phi-3-mini-4k-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Phi-3-mini-4k-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-mini-4k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Phi-3-mini-4k-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-mini-4k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/Phi-3-mini-4k-instruct with Docker Model Runner:
```
docker model run hf.co/microsoft/Phi-3-mini-4k-instruct
```

Model doesn't seem to tokenize new lines in chat template?

#84

by bartowski - opened Jul 4, 2024

Discussion

bartowski

Jul 4, 2024

Noticed when using transformers to extract a chat template that it prints out without any newlines, replacing them instead with spaces.. Any idea why?

Only seems to apply when using the built in tags, so it seems to tokenize <|system|>\n as just '<|system|>'

If i typo it and make it <|systemA|>\n, it tokenizes as '<|systemA|>\n' properly..

bartowski changed discussion title from Model doesn't seem to tokenize new lines in system prompt? to Model doesn't seem to tokenize new lines in chat template? Jul 4, 2024

bartowski

Jul 4, 2024

Actually I just noticed why, it's because <|user|>, <|end|>, <|system|>, and <|assistant|> all have rstrip = true in tokenizer_config.json.. if I take that out, it properly puts the new lines. Which is correct? The chat template seems to imply there should be new lines, as do the examples on your card

bartowski

Jul 4, 2024

@nguyenbh @gargamit

hanori

Microsoft org Jul 8, 2024

I agree that rstrip=true causes many odd issues with tokenization, and directly conflicts with the chat_template/examples. Would love to see one or the other changed for consistency!

bartowski

Jul 8, 2024

is there any way it can be escalated @hanori ?

UserDAN

Jul 16, 2024

I have similar issue

When I feed a text block that contains new lines into the Phi-3 tokeniser, the new lines are removed after decoding. Here is an example of the text I am working with:
Input Text to the tokenizer:

<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>

after tokenizer.decode I got this:

<|system|>
You are a helpful assistant.<|end|><|user|>
How to explain Internet for a medieval knight?<|end|><|assistant|>

Can you help me with this issue and is it affecting the performance of the model if I proceed with this ?

UserDAN

Jul 17, 2024

•

edited Jul 17, 2024

can you help us with this please @hanori @gugarosa

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment