Instructions to use microsoft/phi-1_5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/phi-1_5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/phi-1_5")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5") model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5") - Inference
- Local Apps Settings
- vLLM
How to use microsoft/phi-1_5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/phi-1_5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/phi-1_5", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/microsoft/phi-1_5
- SGLang
How to use microsoft/phi-1_5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/phi-1_5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/phi-1_5", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/phi-1_5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/phi-1_5", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use microsoft/phi-1_5 with Docker Model Runner:
docker model run hf.co/microsoft/phi-1_5
raise error when `use_cache = True`
transformers version: 4.33.2
AutoModelForCausalLM.from_pretrained("microsoft/phi-1", trust_remote_code=True, torch_dtype="auto", use_cache=True)
raise the following error:
File /usr/local/lib/python3.9/dist-packages/transformers/models/auto/auto_factory.py:558, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
556 else:
557 cls.register(config.__class__, model_class, exist_ok=True)
--> 558 return model_class.from_pretrained(
559 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
560 )
561 elif type(config) in cls._model_mapping.keys():
562 model_class = _get_model_class(config, cls._model_mapping)
File /usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py:2966, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
2963 init_contexts.append(init_empty_weights())
2965 with ContextManagers(init_contexts):
-> 2966 model = cls(config, *model_args, **model_kwargs)
2968 # Check first if we are `from_pt`
2969 if use_keep_in_fp32_modules:
TypeError: __init__() got an unexpected keyword argument 'use_cache'
Hey @wjfwzzc , thanks for your issue!
It seems there is an issue with the propagation of unused kwargs when using remote code, cc @ArthurZ .
To do what you're trying to do, you could define a GenerationConfig locally with use_cache set to True:
from transformers import GenerationConfig
generation_config = GenerationConfig(use_cache=True)
You can then pass this to the generate method:
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True, torch_dtype="auto")
>>> inputs = tokenizer('''```python
... def print_prime(n):
... """
... Print all primes between 1 and n
... """''', return_tensors="pt", return_attention_mask=False)
>>> model.generate(**inputs, max_length=200, generation_config=generation_config)
Please let me know if that works for you!
Hi @lysandre , thanks for your help and it works for me!
Nevertheless I'm still confused about the attention_mask. It seems that return_attention_mask=True will raise
ValueError: The following `model_kwargs` are not used by the model: ['attention_mask'] (note: typos in the generate arguments will also show up in this list)
But how to do batch inferencing with padding without attention mask?
Hey @wjfwzzc , Phi is being contributed to transformers in this PR: https://github.com/huggingface/transformers/pull/26170
This should enable leveraging the attention mask to perform batch inference.