language:
- en
license: apache-2.0
library_name: vllm
pipeline_tag: text-generation
tags:
- tool-calling
- function-calling
- vllm
- inference
- guide
- hermes
- llama
- qwen
- mistral
- fp8
- blackwell
- rtx-6000
- open-source
- multi-step-workflow
- prompt-engineering
- quantization
- deployment
base_model:
- NousResearch/Hermes-3-Llama-3.1-70B-FP8
- nvidia/Llama-3.3-70B-Instruct-FP8
- RedHatAI/Qwen2-72B-Instruct-FP8
- RedHatAI/Mistral-Nemo-Instruct-2407-FP8
VLLM Tool Calling Guide
A battle-tested guide to getting tool calling working reliably with open source models on VLLM.
This is not a model. This is a collection of production-tested configurations, prompt templates, Python examples, and hard-won lessons from building multi-step tool calling systems with open source LLMs on NVIDIA Blackwell GPUs.
Everything here was discovered through real deployment β not theory.
Quick Start
Launch VLLM with tool calling (Hermes-3 70B):
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \
--dtype auto \
--quantization compressed-tensors \
--max-model-len 131072 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--gpu-memory-utilization 0.90 \
--max-num-seqs 4
Test it works:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "NousResearch/Hermes-3-Llama-3.1-70B-FP8",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}],
"tool_choice": "auto"
}'
If you see "tool_calls" in the response, you're good. Read on for the details.
What This Repository Contains
| Directory | Contents |
|---|---|
configs/ |
Production VLLM launch scripts for 4 models with inline documentation |
examples/ |
Working Python code: basic tool calls, multi-step orchestration, JSON extraction |
prompts/ |
System prompt templates for tool calling (Hermes-specific and model-agnostic) |
chat_templates/ |
Jinja2 chat templates for Hermes-3 tool calling |
guides/ |
Deep-dive guides on specific topics (context length, prompt engineering, troubleshooting) |
Model Comparison
All models tested on NVIDIA RTX 6000 Pro Blackwell (96GB VRAM), single GPU.
| Model | Size | Quant | VLLM Parser | Speed | Memory | Context | Tool Quality | Open WebUI |
|---|---|---|---|---|---|---|---|---|
| Hermes-3-Llama-3.1-70B | 70B | FP8 | hermes |
25-35 tok/s | ~40GB | 128K | Excellent | No |
| Llama-3.3-70B-Instruct | 70B | FP8 | llama3_json |
60-90 tok/s | ~40GB | 128K | Excellent | Yes |
| Qwen2-72B-Instruct | 72B | FP8 | hermes |
60-90 tok/s | ~45GB | 128K | Very Good | Yes |
| Mistral-Nemo-Instruct | 12B | FP8 | mistral |
100-150 tok/s | ~15GB | 128K | Good | Yes |
Recommendations:
- Best overall tool calling: Hermes-3-Llama-3.1-70B (purpose-built for function calling)
- Best for Open WebUI: Llama-3.3-70B-Instruct (works out of the box)
- Best speed/quality ratio: Mistral-Nemo-12B (fast iterations, good enough for most tasks)
- Best multilingual: Qwen2-72B (strong across languages)
See guides/MODEL_COMPARISON.md for the full breakdown.
The Critical Context Length Fix
This is the #1 issue people hit with VLLM tool calling.
VLLM defaults to short context windows. Tool calling needs much more:
System prompt: 3-5K tokens
Tool definitions: 2-4K tokens per tool
Conversation history: 2-10K tokens
Tool responses: 5-20K tokens
βββββββββββββββββββββββββββββββββ
Total needed: 20-40K+ tokens
If your context window is 16K (the default for many configs), tool calls get silently truncated mid-generation.
The fix:
# BEFORE (broken): Default or small context
--max-model-len 16384
# AFTER (working): Full context support
--max-model-len 131072 # 128K tokens
--max-num-seqs 4 # Reduce concurrency to fit KV cache
--max-num-batched-tokens 132000 # Match context length
--gpu-memory-utilization 0.90 # Leave headroom
Memory math for 96GB GPU:
- Model weights (FP8 70B): ~40GB
- KV cache for 128K context: ~45-50GB
- Total: fits with batch size 4
See guides/CONTEXT_LENGTH_FIX.md for the full analysis.
Tool Call Formats
VLLM supports multiple tool call formats. Which one you use depends on your model:
Hermes Format (ChatML + XML tags)
<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"location": "San Francisco"}}
</tool_call>
<|im_end|>
Parser flag: --tool-call-parser hermes
Models: Hermes-3, Hermes-2-Pro, Qwen2
Llama 3 JSON Format
{"name": "get_weather", "parameters": {"location": "San Francisco"}}
Parser flag: --tool-call-parser llama3_json
Models: Llama-3.1, Llama-3.3
Mistral Format
[TOOL_CALLS] [{"name": "get_weather", "arguments": {"location": "San Francisco"}}]
Parser flag: --tool-call-parser mistral
Models: Mistral-Nemo, Mistral-7B
All formats are converted to OpenAI-compatible JSON by VLLM. Your application code always receives the same standardized format regardless of which parser is used.
See guides/TOOL_CALL_FORMATS.md for detailed comparison.
7 Prompt Engineering Lessons for Tool Calling
These lessons were learned through production debugging. Each one cost hours to diagnose.
1. LLMs Learn from Your Examples
Problem: LLM wraps all JSON responses in markdown code blocks (```json ... ```).
Root cause: Your prompt examples showed JSON inside markdown code blocks. The LLM learned to replicate the formatting.
Fix: Show raw JSON in all examples. Add explicit instruction: "Do NOT wrap your response in markdown code blocks."
2. Jinja2 Escaping Leaks into Output
Problem: LLM outputs {{ instead of { in JSON.
Root cause: Your Jinja2 chat template examples used {{ for escaping. The LLM learned to double braces.
Fix: Use single braces in all prompt examples. Handle template escaping separately from content.
3. Explicitly Limit Tool Call Blocks
Problem: LLM creates multiple <tool_call> blocks or nests them 5 levels deep.
Root cause: No instruction telling it not to.
Fix: Add: "Use ONLY ONE <tool_call> block per response. Do NOT create multiple blocks or nest them."
4. Track Validation Results, Not Just Calls
Problem: System checks if validation tools were called but not if they passed. LLM returns "success" with invalid output.
Fix:
# BAD: Only tracks if called
tracking = {'validate_called': False}
# GOOD: Tracks if called AND passed
tracking = {
'validate_called': False,
'validate_passed': False, # Did it return valid: true?
'validation_errors': [] # What went wrong?
}
5. Feed Errors Back with Structure
Problem: Validation fails but the LLM doesn't know what failed or how to fix it.
Fix: Format errors with property names, error types, and suggested fixes:
errors_formatted = "\n\nValidation Errors Found:\n"
for i, error in enumerate(errors, 1):
errors_formatted += f"\n{i}. "
if 'property' in error:
errors_formatted += f"Property: {error['property']}\n"
if 'message' in error:
errors_formatted += f" Message: {error['message']}\n"
if 'fix' in error:
errors_formatted += f" Fix: {error['fix']}\n"
6. Use raw_decode for Robust JSON Extraction
Problem: LLM adds conversational text before/after the JSON: "Here is the result: {...} Let me know if you need anything else!"
Fix: Three-layer extraction:
import json
from json import JSONDecoder
import re
def extract_json(text: str):
# Layer 1: Strip markdown code blocks
if "```" in text:
match = re.search(r'```(?:json)?\s*\n(.*?)\n```', text, re.DOTALL)
if match:
text = match.group(1).strip()
# Layer 2: Find first { or [ (skip preamble)
if not text.startswith(('{', '[')):
for char in ['{', '[']:
idx = text.find(char)
if idx != -1:
text = text[idx:]
break
# Layer 3: raw_decode stops at end of valid JSON (skip postamble)
try:
return json.loads(text)
except json.JSONDecodeError:
decoder = JSONDecoder()
data, _ = decoder.raw_decode(text)
return data
7. Budget Enough Iterations for Multi-Step Workflows
Problem: Multi-step tool calling runs out of iterations before completing.
Root cause: Each step needs multiple LLM turns:
- Get information (tool call)
- Process results (tool call)
- Validate output (tool call)
- Fix errors if needed (tool call)
- Return final response
Recommended iteration budgets:
| Workflow Complexity | Max Iterations | Step Retry Limit |
|---|---|---|
| Simple (1-2 tools) | 5 | 2 |
| Medium (3-5 tools) | 10 | 3 |
| Complex (6+ tools) | 15 | 3 |
See guides/PROMPT_ENGINEERING_LESSONS.md for code examples for each lesson.
Multi-Step Workflow Architecture
For complex tasks, single-prompt tool calling is unreliable. Break it into steps with isolated tool sets:
Step 1: Discovery Step 2: Configuration
βββββββββββββββββββ βββββββββββββββββββββββ
β Tools: β β Tools: β
β - search β ββ> β - get_details β
β - list β β - validate_minimal β
β - get_info β β - validate_full β
β β β β
β Output: What β β Output: How β
β components to β β to configure them β
β use β β β
βββββββββββββββββββ βββββββββββββββββββββββ
Key patterns:
- Isolated tool sets per step β each step only sees relevant tools, reducing confusion
- Pydantic schema validation β validate LLM responses structurally, not just syntactically
- Retry with error feedback β when validation fails, feed structured errors back to the LLM
- Result tracking β track whether validations passed, not just whether they were called
See guides/MULTI_STEP_WORKFLOWS.md for the full architecture. See examples/multi_step_orchestrator.py for working code.
Blackwell GPU Notes
If you're running on NVIDIA RTX 6000 Pro Blackwell (or similar Blackwell architecture):
FlashInfer Bug (SM120)
FlashInfer has known issues with Blackwell's SM120 compute architecture. Symptoms: crashes, hangs, or incorrect output.
# Workaround: Disable FlashInfer, use FlashAttention-2 instead
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_USE_FLASHINFER=0
FP8 Quantization Types
Not all FP8 models use the same quantization method:
| Model | Quantization Flag | Notes |
|---|---|---|
| Hermes-3-Llama-3.1-70B-FP8 | --quantization compressed-tensors |
Uses compressed-tensors format |
| Llama-3.3-70B-Instruct-FP8 | --quantization fp8_e4m3 |
Native FP8, faster on Blackwell |
| Qwen2-72B-Instruct-FP8 | --quantization fp8 |
Standard FP8 |
| Mistral-Nemo-FP8 | --quantization fp8 |
Standard FP8 |
Using the wrong flag won't crash β but you'll lose performance. compressed-tensors doesn't leverage Blackwell's native FP8 acceleration.
Troubleshooting
Tool calls get cut off mid-generation
Cause: Context window too small.
Fix: Increase --max-model-len to 131072 (128K). See Context Length Fix.
Model responds with text instead of tool calls
Cause: Missing --enable-auto-tool-choice flag, or system prompt doesn't instruct tool use.
Fix:
- Add
--enable-auto-tool-choiceto VLLM launch - Add
--tool-call-parser hermes(or appropriate parser) - Ensure tools are passed in the API request
Very slow generation (2-3 tok/s on 70B)
Cause: Wrong quantization method or FlashInfer issues on Blackwell.
Fix:
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_USE_FLASHINFER=0
Also verify you're using the correct --quantization flag for your model.
Model hallucinates tool/function names
Cause: Tool definitions are too vague, or the model is guessing from training data.
Fix:
- Include
includeExamples: truein tool definitions to show real configurations - Add existence validation after tool calls (verify the tool response is valid before proceeding)
- Use specific, descriptive tool names
Hermes-3 tool calls don't work in Open WebUI
Cause: Open WebUI expects OpenAI-format tool calls. Hermes-3's native format (ChatML + XML) isn't compatible.
Fix: Switch to Llama-3.3-70B-Instruct which works out of the box with Open WebUI. See guides/OPEN_WEBUI_COMPATIBILITY.md.
FlashInfer crashes on Blackwell GPU
Cause: FlashInfer has known bugs with SM120 (Blackwell) compute architecture.
Fix:
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_USE_FLASHINFER=0
Open WebUI Compatibility
| Model | Tool Calling via API | Tool Calling in Open WebUI |
|---|---|---|
| Hermes-3-Llama-3.1-70B | Yes | No (format incompatible) |
| Llama-3.3-70B-Instruct | Yes | Yes |
| Qwen2-72B-Instruct | Yes | Yes |
| Mistral-Nemo-12B | Yes | Yes |
If you need Open WebUI support, use Llama 3.3 or Qwen2. If you're building a custom application that talks directly to the VLLM API, all models work.
See guides/OPEN_WEBUI_COMPATIBILITY.md for details.
Verified FP8 Models
All models listed below have been verified to exist on Hugging Face and work with VLLM for tool calling:
70B+ Models (High Performance):
- NousResearch/Hermes-3-Llama-3.1-70B-FP8 β Best tool calling
- nvidia/Llama-3.3-70B-Instruct-FP8 β Best Open WebUI support
- RedHatAI/Qwen2-72B-Instruct-FP8 β Best multilingual
12B Models (Fast Iteration):
- RedHatAI/Mistral-Nemo-Instruct-2407-FP8 β 100-150 tok/s
Memory Requirements (single GPU):
- 70B FP8: ~40-50GB
- 12B FP8: ~12-15GB
Citation
If you find this guide useful, please star the repository and share it.
@misc{odmark2025vllmtoolcalling,
title={VLLM Tool Calling Guide: Open Source Models on Blackwell GPUs},
author={Joshua Eric Odmark},
year={2025},
url={https://huggingface.co/joshuaeric/vllm-tool-calling-guide}
}
Acknowledgments
- NousResearch for Hermes-3 and pioneering open source tool calling
- vLLM Project for the inference engine
- NVIDIA and Red Hat AI / NeuralMagic for FP8 quantized models
License
Apache 2.0 β use freely, attribution appreciated.