Instructions to use microsoft/Phi-3-small-128k-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Phi-3-small-128k-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-3-small-128k-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-small-128k-instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/Phi-3-small-128k-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Phi-3-small-128k-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-small-128k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/Phi-3-small-128k-instruct

SGLang

How to use microsoft/Phi-3-small-128k-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Phi-3-small-128k-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-small-128k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Phi-3-small-128k-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-small-128k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/Phi-3-small-128k-instruct with Docker Model Runner:
```
docker model run hf.co/microsoft/Phi-3-small-128k-instruct
```

Model not working with accelerate for inference.

#25

by Satandon1999 - opened Jul 9, 2024

Discussion

Satandon1999

Jul 9, 2024

Trying to do multi-gpu inference using the accelerate library following the instructions mentioned here: https://medium.com/@geronimo7/llms-multi-gpu-inference-with-accelerate-5a8333e4c5db.
The process works perfectly fine for mini models. But with this small model I am facing the following error:

e90707fca55744aebfc511579dbd663c00000C:353:385 [1] NCCL INFO [Service thread] Connection closed by localRank 1
sh: 1: cannot create 0.1/compile-ptx-log-7f1750: Directory nonexistent
SystemLog: Traceback (most recent call last):
SystemLog:   File "/mnt/azureml/cr/j/c3a3f3df23864dbdbb10a3f2d941acb8/exe/wd/run.py", line 141, in main
SystemLog:     output = model.generate(input_ids=input_ids,
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
SystemLog:     return func(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate
SystemLog:     result = self._sample(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/transformers/generation/utils.py", line 2397, in _sample
SystemLog:     outputs = self(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog:     return self._call_impl(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog:     return forward_call(*args, **kwargs)
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 956, in forward
SystemLog:     outputs = self.model(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog:     return self._call_impl(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog:     return forward_call(*args, **kwargs)
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 859, in forward
SystemLog:     layer_outputs = decoder_layer(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog:     return self._call_impl(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog:     return forward_call(*args, **kwargs)
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 671, in forward
SystemLog:     hidden_states, self_attn_weights, present_key_values = self.self_attn(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog:     return self._call_impl(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog:     return forward_call(*args, **kwargs)
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 616, in forward
SystemLog:     attn_function_output = self._apply_blocksparse_attention(
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 382, in _apply_blocksparse_attention
SystemLog:     context_layer = self._blocksparse_layer(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog:     return self._call_impl(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog:     return forward_call(*args, **kwargs)
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/triton_blocksparse_attention_layer.py", line 165, in forward
SystemLog:     return blocksparse_flash_attn_padded_fwd(
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/triton_flash_blocksparse_attn.py", line 996, in blocksparse_flash_attn_padded_fwd
SystemLog:     _fwd_kernel_batch_inference[grid](
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
SystemLog:     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run
SystemLog:     return self.fn.run(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
SystemLog:     self.cache[device][key] = compile(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/compiler.py", line 193, in compile
SystemLog:     next_module = compile_ir(module, metadata)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 201, in <lambda>
SystemLog:     stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.capability)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 194, in make_cubin
SystemLog:     return compile_ptx_to_cubin(src, ptxas, capability, opt.enable_fp_fusion)
SystemLog: RuntimeError: `ptxas` failed with error code 2: 
SystemLog: 
SystemLog:ERROR:__main__:An error occurred during execution
Traceback (most recent call last):
  File "/mnt/azureml/cr/j/c3a3f3df23864dbdbb10a3f2d941acb8/exe/wd/run.py", line 201, in <module>
    main()
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/shrike/compliant_logging/exceptions.py", line 411, in wrapper
    print_prefixed_stack_trace_and_raise(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/shrike/compliant_logging/exceptions.py", line 366, in print_prefixed_stack_trace_and_raise
    raise scrubbed_err  # type: ignore
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/shrike/compliant_logging/exceptions.py", line 406, in wrapper
    return function(*func_args, **func_kwargs)
  File "/mnt/azureml/cr/j/c3a3f3df23864dbdbb10a3f2d941acb8/exe/wd/run.py", line 141, in main
    output = model.generate(input_ids=input_ids,
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate
    result = self._sample(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/transformers/generation/utils.py", line 2397, in _sample
    outputs = self(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 956, in forward
    outputs = self.model(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 859, in forward
    layer_outputs = decoder_layer(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 671, in forward
    hidden_states, self_attn_weights, present_key_values = self.self_attn(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 616, in forward
    attn_function_output = self._apply_blocksparse_attention(
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 382, in _apply_blocksparse_attention
    context_layer = self._blocksparse_layer(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/triton_blocksparse_attention_layer.py", line 165, in forward
    return blocksparse_flash_attn_padded_fwd(
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/triton_flash_blocksparse_attn.py", line 996, in blocksparse_flash_attn_padded_fwd
    _fwd_kernel_batch_inference[grid](
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run
    return self.fn.run(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
    self.cache[device][key] = compile(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/compiler.py", line 193, in compile
    next_module = compile_ir(module, metadata)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 201, in <lambda>
    stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.capability)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 194, in make_cubin
    return compile_ptx_to_cubin(src, ptxas, capability, opt.enable_fp_fusion)
RuntimeError: `ptxas` failed with error code 2:

Satandon1999

Jul 9, 2024

Resolved by creating the directory called "0.1" as mentioned in the error line: sh: 1: cannot create 0.1/compile-ptx-log-7f1750: Directory nonexistent.

Credit: https://github.com/vllm-project/vllm/issues/3926

Satandon1999 changed discussion status to closed Jul 9, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment