---
language:
- en
license: apache-2.0
library_name: vllm
pipeline_tag: text-generation
tags:
- tool-calling
- function-calling
- vllm
- inference
- guide
- hermes
- llama
- qwen
- mistral
- fp8
- blackwell
- rtx-6000
- open-source
- multi-step-workflow
- prompt-engineering
- quantization
- deployment
base_model:
- NousResearch/Hermes-3-Llama-3.1-70B-FP8
- nvidia/Llama-3.3-70B-Instruct-FP8
- RedHatAI/Qwen2-72B-Instruct-FP8
- RedHatAI/Mistral-Nemo-Instruct-2407-FP8
---
# VLLM Tool Calling Guide
**A battle-tested guide to getting tool calling working reliably with open source models on VLLM.**
This is not a model. This is a collection of production-tested configurations, prompt templates, Python examples, and hard-won lessons from building multi-step tool calling systems with open source LLMs on NVIDIA Blackwell GPUs.
Everything here was discovered through real deployment — not theory.
---
## Quick Start
**Launch VLLM with tool calling (Hermes-3 70B):**
```bash
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \
--dtype auto \
--quantization compressed-tensors \
--max-model-len 131072 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--gpu-memory-utilization 0.90 \
--max-num-seqs 4
```
**Test it works:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "NousResearch/Hermes-3-Llama-3.1-70B-FP8",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}],
"tool_choice": "auto"
}'
```
If you see `"tool_calls"` in the response, you're good. Read on for the details.
---
## What This Repository Contains
| Directory | Contents |
|-----------|----------|
| [`configs/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/configs) | Production VLLM launch scripts for 4 models with inline documentation |
| [`examples/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/examples) | Working Python code: basic tool calls, multi-step orchestration, JSON extraction |
| [`prompts/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/prompts) | System prompt templates for tool calling (Hermes-specific and model-agnostic) |
| [`chat_templates/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/chat_templates) | Jinja2 chat templates for Hermes-3 tool calling |
| [`guides/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/guides) | Deep-dive guides on specific topics (context length, prompt engineering, troubleshooting) |
---
## Model Comparison
All models tested on NVIDIA RTX 6000 Pro Blackwell (96GB VRAM), single GPU.
| Model | Size | Quant | VLLM Parser | Speed | Memory | Context | Tool Quality | Open WebUI |
|-------|------|-------|-------------|-------|--------|---------|-------------|------------|
| **Hermes-3-Llama-3.1-70B** | 70B | FP8 | `hermes` | 25-35 tok/s | ~40GB | 128K | Excellent | No |
| **Llama-3.3-70B-Instruct** | 70B | FP8 | `llama3_json` | 60-90 tok/s | ~40GB | 128K | Excellent | Yes |
| **Qwen2-72B-Instruct** | 72B | FP8 | `hermes` | 60-90 tok/s | ~45GB | 128K | Very Good | Yes |
| **Mistral-Nemo-Instruct** | 12B | FP8 | `mistral` | 100-150 tok/s | ~15GB | 128K | Good | Yes |
**Recommendations:**
- **Best overall tool calling:** Hermes-3-Llama-3.1-70B (purpose-built for function calling)
- **Best for Open WebUI:** Llama-3.3-70B-Instruct (works out of the box)
- **Best speed/quality ratio:** Mistral-Nemo-12B (fast iterations, good enough for most tasks)
- **Best multilingual:** Qwen2-72B (strong across languages)
> See [guides/MODEL_COMPARISON.md](guides/MODEL_COMPARISON.md) for the full breakdown.
---
## The Critical Context Length Fix
**This is the #1 issue people hit with VLLM tool calling.**
VLLM defaults to short context windows. Tool calling needs much more:
```
System prompt: 3-5K tokens
Tool definitions: 2-4K tokens per tool
Conversation history: 2-10K tokens
Tool responses: 5-20K tokens
─────────────────────────────────
Total needed: 20-40K+ tokens
```
**If your context window is 16K (the default for many configs), tool calls get silently truncated mid-generation.**
The fix:
```bash
# BEFORE (broken): Default or small context
--max-model-len 16384
# AFTER (working): Full context support
--max-model-len 131072 # 128K tokens
--max-num-seqs 4 # Reduce concurrency to fit KV cache
--max-num-batched-tokens 132000 # Match context length
--gpu-memory-utilization 0.90 # Leave headroom
```
**Memory math for 96GB GPU:**
- Model weights (FP8 70B): ~40GB
- KV cache for 128K context: ~45-50GB
- Total: fits with batch size 4
> See [guides/CONTEXT_LENGTH_FIX.md](guides/CONTEXT_LENGTH_FIX.md) for the full analysis.
---
## Tool Call Formats
VLLM supports multiple tool call formats. Which one you use depends on your model:
### Hermes Format (ChatML + XML tags)
```
<|im_start|>assistant
{"name": "get_weather", "arguments": {"location": "San Francisco"}}
<|im_end|>
```
**Parser flag:** `--tool-call-parser hermes`
**Models:** Hermes-3, Hermes-2-Pro, Qwen2
### Llama 3 JSON Format
```json
{"name": "get_weather", "parameters": {"location": "San Francisco"}}
```
**Parser flag:** `--tool-call-parser llama3_json`
**Models:** Llama-3.1, Llama-3.3
### Mistral Format
```
[TOOL_CALLS] [{"name": "get_weather", "arguments": {"location": "San Francisco"}}]
```
**Parser flag:** `--tool-call-parser mistral`
**Models:** Mistral-Nemo, Mistral-7B
**All formats are converted to OpenAI-compatible JSON by VLLM.** Your application code always receives the same standardized format regardless of which parser is used.
> See [guides/TOOL_CALL_FORMATS.md](guides/TOOL_CALL_FORMATS.md) for detailed comparison.
---
## 7 Prompt Engineering Lessons for Tool Calling
These lessons were learned through production debugging. Each one cost hours to diagnose.
### 1. LLMs Learn from Your Examples
**Problem:** LLM wraps all JSON responses in markdown code blocks (` ```json ... ``` `).
**Root cause:** Your prompt examples showed JSON inside markdown code blocks. The LLM learned to replicate the formatting.
**Fix:** Show raw JSON in all examples. Add explicit instruction: "Do NOT wrap your response in markdown code blocks."
### 2. Jinja2 Escaping Leaks into Output
**Problem:** LLM outputs `{{` instead of `{` in JSON.
**Root cause:** Your Jinja2 chat template examples used `{{` for escaping. The LLM learned to double braces.
**Fix:** Use single braces in all prompt examples. Handle template escaping separately from content.
### 3. Explicitly Limit Tool Call Blocks
**Problem:** LLM creates multiple `` blocks or nests them 5 levels deep.
**Root cause:** No instruction telling it not to.
**Fix:** Add: "Use ONLY ONE `` block per response. Do NOT create multiple blocks or nest them."
### 4. Track Validation Results, Not Just Calls
**Problem:** System checks if validation tools were *called* but not if they *passed*. LLM returns "success" with invalid output.
**Fix:**
```python
# BAD: Only tracks if called
tracking = {'validate_called': False}
# GOOD: Tracks if called AND passed
tracking = {
'validate_called': False,
'validate_passed': False, # Did it return valid: true?
'validation_errors': [] # What went wrong?
}
```
### 5. Feed Errors Back with Structure
**Problem:** Validation fails but the LLM doesn't know *what* failed or *how* to fix it.
**Fix:** Format errors with property names, error types, and suggested fixes:
```python
errors_formatted = "\n\nValidation Errors Found:\n"
for i, error in enumerate(errors, 1):
errors_formatted += f"\n{i}. "
if 'property' in error:
errors_formatted += f"Property: {error['property']}\n"
if 'message' in error:
errors_formatted += f" Message: {error['message']}\n"
if 'fix' in error:
errors_formatted += f" Fix: {error['fix']}\n"
```
### 6. Use `raw_decode` for Robust JSON Extraction
**Problem:** LLM adds conversational text before/after the JSON: "Here is the result: {...} Let me know if you need anything else!"
**Fix:** Three-layer extraction:
```python
import json
from json import JSONDecoder
import re
def extract_json(text: str):
# Layer 1: Strip markdown code blocks
if "```" in text:
match = re.search(r'```(?:json)?\s*\n(.*?)\n```', text, re.DOTALL)
if match:
text = match.group(1).strip()
# Layer 2: Find first { or [ (skip preamble)
if not text.startswith(('{', '[')):
for char in ['{', '[']:
idx = text.find(char)
if idx != -1:
text = text[idx:]
break
# Layer 3: raw_decode stops at end of valid JSON (skip postamble)
try:
return json.loads(text)
except json.JSONDecodeError:
decoder = JSONDecoder()
data, _ = decoder.raw_decode(text)
return data
```
### 7. Budget Enough Iterations for Multi-Step Workflows
**Problem:** Multi-step tool calling runs out of iterations before completing.
**Root cause:** Each step needs multiple LLM turns:
1. Get information (tool call)
2. Process results (tool call)
3. Validate output (tool call)
4. Fix errors if needed (tool call)
5. Return final response
**Recommended iteration budgets:**
| Workflow Complexity | Max Iterations | Step Retry Limit |
|-------------------|---------------|-----------------|
| Simple (1-2 tools) | 5 | 2 |
| Medium (3-5 tools) | 10 | 3 |
| Complex (6+ tools) | 15 | 3 |
> See [guides/PROMPT_ENGINEERING_LESSONS.md](guides/PROMPT_ENGINEERING_LESSONS.md) for code examples for each lesson.
---
## Multi-Step Workflow Architecture
For complex tasks, single-prompt tool calling is unreliable. Break it into steps with isolated tool sets:
```
Step 1: Discovery Step 2: Configuration
┌─────────────────┐ ┌─────────────────────┐
│ Tools: │ │ Tools: │
│ - search │ ──> │ - get_details │
│ - list │ │ - validate_minimal │
│ - get_info │ │ - validate_full │
│ │ │ │
│ Output: What │ │ Output: How │
│ components to │ │ to configure them │
│ use │ │ │
└─────────────────┘ └─────────────────────┘
```
**Key patterns:**
- **Isolated tool sets per step** — each step only sees relevant tools, reducing confusion
- **Pydantic schema validation** — validate LLM responses structurally, not just syntactically
- **Retry with error feedback** — when validation fails, feed structured errors back to the LLM
- **Result tracking** — track whether validations *passed*, not just whether they were *called*
> See [guides/MULTI_STEP_WORKFLOWS.md](guides/MULTI_STEP_WORKFLOWS.md) for the full architecture.
> See [examples/multi_step_orchestrator.py](examples/multi_step_orchestrator.py) for working code.
---
## Blackwell GPU Notes
If you're running on NVIDIA RTX 6000 Pro Blackwell (or similar Blackwell architecture):
### FlashInfer Bug (SM120)
FlashInfer has known issues with Blackwell's SM120 compute architecture. Symptoms: crashes, hangs, or incorrect output.
```bash
# Workaround: Disable FlashInfer, use FlashAttention-2 instead
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_USE_FLASHINFER=0
```
### FP8 Quantization Types
Not all FP8 models use the same quantization method:
| Model | Quantization Flag | Notes |
|-------|------------------|-------|
| Hermes-3-Llama-3.1-70B-FP8 | `--quantization compressed-tensors` | Uses compressed-tensors format |
| Llama-3.3-70B-Instruct-FP8 | `--quantization fp8_e4m3` | Native FP8, faster on Blackwell |
| Qwen2-72B-Instruct-FP8 | `--quantization fp8` | Standard FP8 |
| Mistral-Nemo-FP8 | `--quantization fp8` | Standard FP8 |
Using the wrong flag won't crash — but you'll lose performance. `compressed-tensors` doesn't leverage Blackwell's native FP8 acceleration.
---
## Troubleshooting
Tool calls get cut off mid-generation
**Cause:** Context window too small.
**Fix:** Increase `--max-model-len` to 131072 (128K). See [Context Length Fix](#the-critical-context-length-fix).
Model responds with text instead of tool calls
**Cause:** Missing `--enable-auto-tool-choice` flag, or system prompt doesn't instruct tool use.
**Fix:**
1. Add `--enable-auto-tool-choice` to VLLM launch
2. Add `--tool-call-parser hermes` (or appropriate parser)
3. Ensure tools are passed in the API request
Very slow generation (2-3 tok/s on 70B)
**Cause:** Wrong quantization method or FlashInfer issues on Blackwell.
**Fix:**
```bash
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_USE_FLASHINFER=0
```
Also verify you're using the correct `--quantization` flag for your model.
Model hallucinates tool/function names
**Cause:** Tool definitions are too vague, or the model is guessing from training data.
**Fix:**
1. Include `includeExamples: true` in tool definitions to show real configurations
2. Add existence validation after tool calls (verify the tool response is valid before proceeding)
3. Use specific, descriptive tool names
Hermes-3 tool calls don't work in Open WebUI
**Cause:** Open WebUI expects OpenAI-format tool calls. Hermes-3's native format (ChatML + XML) isn't compatible.
**Fix:** Switch to Llama-3.3-70B-Instruct which works out of the box with Open WebUI. See [guides/OPEN_WEBUI_COMPATIBILITY.md](guides/OPEN_WEBUI_COMPATIBILITY.md).
FlashInfer crashes on Blackwell GPU
**Cause:** FlashInfer has known bugs with SM120 (Blackwell) compute architecture.
**Fix:**
```bash
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_USE_FLASHINFER=0
```
---
## Open WebUI Compatibility
| Model | Tool Calling via API | Tool Calling in Open WebUI |
|-------|---------------------|---------------------------|
| Hermes-3-Llama-3.1-70B | Yes | **No** (format incompatible) |
| Llama-3.3-70B-Instruct | Yes | Yes |
| Qwen2-72B-Instruct | Yes | Yes |
| Mistral-Nemo-12B | Yes | Yes |
If you need Open WebUI support, use Llama 3.3 or Qwen2. If you're building a custom application that talks directly to the VLLM API, all models work.
> See [guides/OPEN_WEBUI_COMPATIBILITY.md](guides/OPEN_WEBUI_COMPATIBILITY.md) for details.
---
## Verified FP8 Models
All models listed below have been verified to exist on Hugging Face and work with VLLM for tool calling:
**70B+ Models (High Performance):**
- [NousResearch/Hermes-3-Llama-3.1-70B-FP8](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B-FP8) — Best tool calling
- [nvidia/Llama-3.3-70B-Instruct-FP8](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8) — Best Open WebUI support
- [RedHatAI/Qwen2-72B-Instruct-FP8](https://huggingface.co/RedHatAI/Qwen2-72B-Instruct-FP8) — Best multilingual
**12B Models (Fast Iteration):**
- [RedHatAI/Mistral-Nemo-Instruct-2407-FP8](https://huggingface.co/RedHatAI/Mistral-Nemo-Instruct-2407-FP8) — 100-150 tok/s
**Memory Requirements (single GPU):**
- 70B FP8: ~40-50GB
- 12B FP8: ~12-15GB
---
## Citation
If you find this guide useful, please star the repository and share it.
```bibtex
@misc{odmark2025vllmtoolcalling,
title={VLLM Tool Calling Guide: Open Source Models on Blackwell GPUs},
author={Joshua Eric Odmark},
year={2025},
url={https://huggingface.co/joshuaeric/vllm-tool-calling-guide}
}
```
## Acknowledgments
- [NousResearch](https://huggingface.co/NousResearch) for Hermes-3 and pioneering open source tool calling
- [vLLM Project](https://github.com/vllm-project/vllm) for the inference engine
- [NVIDIA](https://huggingface.co/nvidia) and [Red Hat AI / NeuralMagic](https://huggingface.co/RedHatAI) for FP8 quantized models
## License
Apache 2.0 — use freely, attribution appreciated.