Instructions to use rhysjones/phi-2-orange with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rhysjones/phi-2-orange with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="rhysjones/phi-2-orange", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("rhysjones/phi-2-orange", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use rhysjones/phi-2-orange with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rhysjones/phi-2-orange" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rhysjones/phi-2-orange", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/rhysjones/phi-2-orange
- SGLang
How to use rhysjones/phi-2-orange with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rhysjones/phi-2-orange" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rhysjones/phi-2-orange", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rhysjones/phi-2-orange" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rhysjones/phi-2-orange", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use rhysjones/phi-2-orange with Docker Model Runner:
docker model run hf.co/rhysjones/phi-2-orange
Can't get it to generate the EOS token and beam search is not supported
This is how I'm using the model
import torch
import transformers
import time
model = transformers.AutoModelForCausalLM.from_pretrained(
"rhysjones/phi-2-orange",
trust_remote_code=True
)
tokenizer = transformers.AutoTokenizer.from_pretrained("rhysjones/phi-2-orange")
SYSTEM_PROMPT = "You are an AI assistant. You will be given a task. You must generate a short answer."
input_text = f"""<|im_start|>system
You are a helpful assistant that gives short answers.<|im_end|>
<|im_start|>user
Give me the first 3 prime numbers.<|im_end|>
<|im_start|>assistant
"""
t1 = time.time()
with torch.no_grad():
outputs = model.generate(
tokenizer(input_text, return_tensors="pt")['input_ids'],
max_new_tokens=200,
num_beams = 1
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(time.time()-t1)
I think the beam search part is relatively easy to change, you just update modeling_phi.py, right? The EOS token is harder
Hey - there are a few recent updates to Phi-2 to base it on HF's transformer updates that post-date this model's Phi-2 version. One of the updates seems to fix the support for beam search. I'll look to update in the future with this and also allow it to run without needing trust_remote_code.
In the meantime, the other recent Phi-2 update is to make the eos token explicit. If you add this to your code, it should reliably finish on the EOS token:
import torch
import transformers
import time
model = transformers.AutoModelForCausalLM.from_pretrained(
"rhysjones/phi-2-orange",
trust_remote_code=True
)
tokenizer = transformers.AutoTokenizer.from_pretrained("rhysjones/phi-2-orange")
SYSTEM_PROMPT = "You are an AI assistant. You will be given a task. You must generate a short answer."
input_text = f"""<|im_start|>system
You are a helpful assistant that gives short answers.<|im_end|>
<|im_start|>user
Give me the first 3 prime numbers.<|im_end|>
<|im_start|>assistant
"""
generation_config = transformers.GenerationConfig(
eos_token_id = 50256
)
t1 = time.time()
with torch.no_grad():
outputs = model.generate(
tokenizer(input_text, return_tensors="pt")['input_ids'],
max_new_tokens=200,
generation_config=generation_config,
num_beams = 1
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(time.time()-t1)
Newer version also never stops, it is better to add a stoping criteria, for example for token <|im_end|>