Instructions to use microsoft/Phi-3-small-8k-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Phi-3-small-8k-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-3-small-8k-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-small-8k-instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/Phi-3-small-8k-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Phi-3-small-8k-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-small-8k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/Phi-3-small-8k-instruct

SGLang

How to use microsoft/Phi-3-small-8k-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Phi-3-small-8k-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-small-8k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Phi-3-small-8k-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-small-8k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/Phi-3-small-8k-instruct with Docker Model Runner:
```
docker model run hf.co/microsoft/Phi-3-small-8k-instruct
```

RuntimeError: FlashAttention only support fp16 and bf16 data type during fine tuning.

#11

by faizsameerahmed96 - opened May 26, 2024

Discussion

faizsameerahmed96

May 26, 2024

The hyper params i am using

training_config = {
    "bf16": True,
    "do_eval": False,
    "learning_rate": 0.00001,
    "lr_scheduler_type": "cosine",
    "log_level": "info",
    "logging_steps": 30,
    "logging_strategy": "steps",
    "num_train_epochs": 5,
    "max_steps": -1,
    "output_dir": "./workspace/checkpoint_dir",
    "overwrite_output_dir": True,
    "per_device_eval_batch_size": 4,
    "remove_unused_columns": True,
    "save_steps": 100,
    "save_total_limit": 1,
    "seed": 0,
    "gradient_checkpointing": True,
    "gradient_checkpointing_kwargs":{"use_reentrant": False},
    "gradient_accumulation_steps": 1,
    "warmup_ratio": 0.2,
}

I am loading the model using

checkpoint_path = "microsoft/Phi-3-small-8k-instruct"
model_kwargs = dict(
    use_cache=False,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",
    torch_dtype="auto",
    device_map=None,
)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs)

And start training

trainer = SFTTrainer(
        model=model,
        args=train_conf,
        train_dataset=processed_dataset,
        max_seq_length=8192,
        dataset_text_field="text",
        tokenizer=tokenizer,
        packing=True
    )
train_result = trainer.train()

I am getting the following error

    107 # if out.isnan().any() or softmax_lse.isnan().any():
    108 #     breakpoint()
    109 return out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state

RuntimeError: FlashAttention only support fp16 and bf16 data type

I used the exact same config while fine tuning phi-3-mini-128k without any issues. Is anyone else facing the same issue?

bapatra

Microsoft org May 29, 2024

Hi !
Flash attention as well as the block-sparse kernel for attention require the model to be trained on fp16 / bf16. Is there a reason why bfloat16 might not work for your use-case ?

Alexandre-Numind

May 30, 2024

Hi,

Getting the same error, even with bf16 = True in training arg

ecocytus11

May 31, 2024

Not sure if it's the correct fix. Here is how I make it work
https://huggingface.co/microsoft/Phi-3-small-8k-instruct/blob/f5527db8a43fc9a4bf17c5b754251e1efe1d4ad3/positional_embedding.py#L269
update the dtype of the q and k after the rotary mapping:

return (
            apply_rotary_pos_emb(
                q, cos_cached[seqlen_offset:seq_len], sin_cached[seqlen_offset:seq_len], seq_dimension=seq_dimension
            ).to(q.dtype),
            apply_rotary_pos_emb(
                k, cos_cached[seqlen_offset:seq_len], sin_cached[seqlen_offset:seq_len], seq_dimension=seq_dimension
            ).to(q.dtype),
        )

Alexandre-Numind

May 31, 2024

Thx for the answer.
Happens when device_map = "auto" (or anything that is not None). Might be a problem related with flash attention and multi gpus training. If you have a fix, do not hesitate.