Instructions to use microsoft/Phi-3-small-128k-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/Phi-3-small-128k-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/Phi-3-small-128k-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-small-128k-instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use microsoft/Phi-3-small-128k-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/Phi-3-small-128k-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-3-small-128k-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/microsoft/Phi-3-small-128k-instruct
- SGLang
How to use microsoft/Phi-3-small-128k-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/Phi-3-small-128k-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-3-small-128k-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/Phi-3-small-128k-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/Phi-3-small-128k-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use microsoft/Phi-3-small-128k-instruct with Docker Model Runner:
docker model run hf.co/microsoft/Phi-3-small-128k-instruct
Model not working with accelerate for inference.
#25
by Satandon1999 - opened
Trying to do multi-gpu inference using the accelerate library following the instructions mentioned here: https://medium.com/@geronimo7/llms-multi-gpu-inference-with-accelerate-5a8333e4c5db.
The process works perfectly fine for mini models. But with this small model I am facing the following error:
e90707fca55744aebfc511579dbd663c00000C:353:385 [1] NCCL INFO [Service thread] Connection closed by localRank 1
sh: 1: cannot create 0.1/compile-ptx-log-7f1750: Directory nonexistent
SystemLog: Traceback (most recent call last):
SystemLog: File "/mnt/azureml/cr/j/c3a3f3df23864dbdbb10a3f2d941acb8/exe/wd/run.py", line 141, in main
SystemLog: output = model.generate(input_ids=input_ids,
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
SystemLog: return func(*args, **kwargs)
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate
SystemLog: result = self._sample(
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/transformers/generation/utils.py", line 2397, in _sample
SystemLog: outputs = self(
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog: return self._call_impl(*args, **kwargs)
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog: return forward_call(*args, **kwargs)
SystemLog: File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 956, in forward
SystemLog: outputs = self.model(
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog: return self._call_impl(*args, **kwargs)
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog: return forward_call(*args, **kwargs)
SystemLog: File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 859, in forward
SystemLog: layer_outputs = decoder_layer(
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog: return self._call_impl(*args, **kwargs)
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog: return forward_call(*args, **kwargs)
SystemLog: File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 671, in forward
SystemLog: hidden_states, self_attn_weights, present_key_values = self.self_attn(
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog: return self._call_impl(*args, **kwargs)
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog: return forward_call(*args, **kwargs)
SystemLog: File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 616, in forward
SystemLog: attn_function_output = self._apply_blocksparse_attention(
SystemLog: File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 382, in _apply_blocksparse_attention
SystemLog: context_layer = self._blocksparse_layer(
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog: return self._call_impl(*args, **kwargs)
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog: return forward_call(*args, **kwargs)
SystemLog: File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/triton_blocksparse_attention_layer.py", line 165, in forward
SystemLog: return blocksparse_flash_attn_padded_fwd(
SystemLog: File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/triton_flash_blocksparse_attn.py", line 996, in blocksparse_flash_attn_padded_fwd
SystemLog: _fwd_kernel_batch_inference[grid](
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
SystemLog: return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run
SystemLog: return self.fn.run(*args, **kwargs)
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
SystemLog: self.cache[device][key] = compile(
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/compiler.py", line 193, in compile
SystemLog: next_module = compile_ir(module, metadata)
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 201, in <lambda>
SystemLog: stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.capability)
SystemLog: File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 194, in make_cubin
SystemLog: return compile_ptx_to_cubin(src, ptxas, capability, opt.enable_fp_fusion)
SystemLog: RuntimeError: `ptxas` failed with error code 2:
SystemLog:
SystemLog:ERROR:__main__:An error occurred during execution
Traceback (most recent call last):
File "/mnt/azureml/cr/j/c3a3f3df23864dbdbb10a3f2d941acb8/exe/wd/run.py", line 201, in <module>
main()
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/shrike/compliant_logging/exceptions.py", line 411, in wrapper
print_prefixed_stack_trace_and_raise(
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/shrike/compliant_logging/exceptions.py", line 366, in print_prefixed_stack_trace_and_raise
raise scrubbed_err # type: ignore
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/shrike/compliant_logging/exceptions.py", line 406, in wrapper
return function(*func_args, **func_kwargs)
File "/mnt/azureml/cr/j/c3a3f3df23864dbdbb10a3f2d941acb8/exe/wd/run.py", line 141, in main
output = model.generate(input_ids=input_ids,
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate
result = self._sample(
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/transformers/generation/utils.py", line 2397, in _sample
outputs = self(
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 956, in forward
outputs = self.model(
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 859, in forward
layer_outputs = decoder_layer(
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 671, in forward
hidden_states, self_attn_weights, present_key_values = self.self_attn(
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 616, in forward
attn_function_output = self._apply_blocksparse_attention(
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 382, in _apply_blocksparse_attention
context_layer = self._blocksparse_layer(
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/triton_blocksparse_attention_layer.py", line 165, in forward
return blocksparse_flash_attn_padded_fwd(
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/triton_flash_blocksparse_attn.py", line 996, in blocksparse_flash_attn_padded_fwd
_fwd_kernel_batch_inference[grid](
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run
return self.fn.run(*args, **kwargs)
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
self.cache[device][key] = compile(
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/compiler.py", line 193, in compile
next_module = compile_ir(module, metadata)
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 201, in <lambda>
stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.capability)
File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 194, in make_cubin
return compile_ptx_to_cubin(src, ptxas, capability, opt.enable_fp_fusion)
RuntimeError: `ptxas` failed with error code 2:
Resolved by creating the directory called "0.1" as mentioned in the error line: sh: 1: cannot create 0.1/compile-ptx-log-7f1750: Directory nonexistent.
Satandon1999 changed discussion status to closed