Instructions to use microsoft/Phi-3.5-mini-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Phi-3.5-mini-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-3.5-mini-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3.5-mini-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/Phi-3.5-mini-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Phi-3.5-mini-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3.5-mini-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/Phi-3.5-mini-instruct

SGLang

How to use microsoft/Phi-3.5-mini-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Phi-3.5-mini-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3.5-mini-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Phi-3.5-mini-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3.5-mini-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/Phi-3.5-mini-instruct with Docker Model Runner:
```
docker model run hf.co/microsoft/Phi-3.5-mini-instruct
```

Determinism Challenges with Microsoft Phi-3.5-mini-instruct Across Different GPU Architectures

#28

by HumzaAli - opened Dec 21, 2024

Discussion

HumzaAli

Dec 21, 2024

I’ve been working on generating deterministic outputs using Microsoft’s Phi-3.5-mini-instruct model on two different GPU setups: a single A6000 GPU and a dual A5000 GPUs. Despite taking every precaution—such as matching library versions, disabling randomness, forcing strict precision settings, and using greedy decoding—the outputs still diverge after a certain number of tokens.

I’ve ensured that all configurations, including PyTorch, CUDA versions, and model parameters, are identical across the setups. While greedy decoding does control randomness effectively, the context and wording in the responses start to differ significantly in many prompts. Based on my research, it seems the underlying issue stems from floating-point operations, which are handled differently across GPU architectures. Variations in precision, rounding, or the sequence of operations appear to contribute to these discrepancies.

This makes achieving perfect determinism seemingly impossible unless the same GPU model (e.g., A6000) or one with an identical architecture is used. While theoretically, it might be possible to trace the inference process at a low level (e.g., inspecting logits and floating-point operations), this would be computationally intensive and impractical for large models like Phi-3.5-mini-instruct.

Has anyone encountered similar challenges with this or other LLMs? If so, have you found any strategies to mitigate these discrepancies across different GPU architectures? Your insights and experiences would be greatly appreciated!

carlquillen

Dec 21, 2024

•

edited Dec 21, 2024

It's probably not possible to get completely consistent results across different hardware, no. CUDA has ulps error bounds that are far bigger than 1 for a number of operations, and even the IEEE floating point will not guarantee <= 0.5 for non-basic operations.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment