ToolDial: Multi-turn Dialogue Generation Method for Tool-Augmented Language Models
Paper • 2503.00564 • Published • 2
How to use HOLILAB/td-llama-op with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="HOLILAB/td-llama-op")
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe(messages) # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("HOLILAB/td-llama-op", dtype="auto")How to use HOLILAB/td-llama-op with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HOLILAB/td-llama-op"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "HOLILAB/td-llama-op",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker model run hf.co/HOLILAB/td-llama-op
How to use HOLILAB/td-llama-op with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "HOLILAB/td-llama-op" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "HOLILAB/td-llama-op",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "HOLILAB/td-llama-op" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "HOLILAB/td-llama-op",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'How to use HOLILAB/td-llama-op with Docker Model Runner:
docker model run hf.co/HOLILAB/td-llama-op
TD (ToolDial)-Llama-OP (OverallPerformance) is the same model used in ToolDial paper Overall Performance Task. We encourage you to use this model to reproduce the results. Please refer the Experiments of our github page to see how our evaluation has proceed.
[Model Summary]
[How to load the model]
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
device = "cuda:0"
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
## 1. Load the base model (we use llama3-8b-inst) with the given quantization config.
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
quantization_config=quant_config,
device_map={"": device},
)
tokenizer = AutoTokenizer.from_pretrained("HOLILAB/td-llama-op")
tokenizer.pad_token_id = tokenizer.eos_token_id
## 2. Load the lora adapter with PeftModel
model = PeftModel.from_pretrained(base_model, "HOLILAB/td-llama-op")