README.md · joshuaeric/vllm-tool-calling-guide at main

Joshua Odmark

Mistral acquired by RedHat, updating references

2e4500a 3 days ago

16.9 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: vllm
	pipeline_tag: text-generation
	tags:
	- tool-calling
	- function-calling
	- vllm
	- inference
	- guide
	- hermes
	- llama
	- qwen
	- mistral
	- fp8
	- blackwell
	- rtx-6000
	- open-source
	- multi-step-workflow
	- prompt-engineering
	- quantization
	- deployment
	base_model:
	- NousResearch/Hermes-3-Llama-3.1-70B-FP8
	- nvidia/Llama-3.3-70B-Instruct-FP8
	- RedHatAI/Qwen2-72B-Instruct-FP8
	- RedHatAI/Mistral-Nemo-Instruct-2407-FP8
	---

	# VLLM Tool Calling Guide

	A battle-tested guide to getting tool calling working reliably with open source models on VLLM.

	This is not a model. This is a collection of production-tested configurations, prompt templates, Python examples, and hard-won lessons from building multi-step tool calling systems with open source LLMs on NVIDIA Blackwell GPUs.

	Everything here was discovered through real deployment — not theory.

	---

	## Quick Start

	Launch VLLM with tool calling (Hermes-3 70B):

	```bash
	python -m vllm.entrypoints.openai.api_server \
	--model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \
	--dtype auto \
	--quantization compressed-tensors \
	--max-model-len 131072 \
	--enable-auto-tool-choice \
	--tool-call-parser hermes \
	--gpu-memory-utilization 0.90 \
	--max-num-seqs 4
	```

	Test it works:

	```bash
	curl http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "NousResearch/Hermes-3-Llama-3.1-70B-FP8",
	"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
	"tools": [{
	"type": "function",
	"function": {
	"name": "get_weather",
	"description": "Get current weather for a location",
	"parameters": {
	"type": "object",
	"properties": {
	"location": {"type": "string", "description": "City name"}
	},
	"required": ["location"]
	}
	}
	}],
	"tool_choice": "auto"
	}'
	```

	If you see `"tool_calls"` in the response, you're good. Read on for the details.

	---

	## What This Repository Contains

	\| Directory \| Contents \|
	\|-----------\|----------\|
	\| [`configs/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/configs) \| Production VLLM launch scripts for 4 models with inline documentation \|
	\| [`examples/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/examples) \| Working Python code: basic tool calls, multi-step orchestration, JSON extraction \|
	\| [`prompts/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/prompts) \| System prompt templates for tool calling (Hermes-specific and model-agnostic) \|
	\| [`chat_templates/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/chat_templates) \| Jinja2 chat templates for Hermes-3 tool calling \|
	\| [`guides/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/guides) \| Deep-dive guides on specific topics (context length, prompt engineering, troubleshooting) \|

	---

	## Model Comparison

	All models tested on NVIDIA RTX 6000 Pro Blackwell (96GB VRAM), single GPU.

	\| Model \| Size \| Quant \| VLLM Parser \| Speed \| Memory \| Context \| Tool Quality \| Open WebUI \|
	\|-------\|------\|-------\|-------------\|-------\|--------\|---------\|-------------\|------------\|
	\| Hermes-3-Llama-3.1-70B \| 70B \| FP8 \| `hermes` \| 25-35 tok/s \| ~40GB \| 128K \| Excellent \| No \|
	\| Llama-3.3-70B-Instruct \| 70B \| FP8 \| `llama3_json` \| 60-90 tok/s \| ~40GB \| 128K \| Excellent \| Yes \|
	\| Qwen2-72B-Instruct \| 72B \| FP8 \| `hermes` \| 60-90 tok/s \| ~45GB \| 128K \| Very Good \| Yes \|
	\| Mistral-Nemo-Instruct \| 12B \| FP8 \| `mistral` \| 100-150 tok/s \| ~15GB \| 128K \| Good \| Yes \|

	Recommendations:
	- Best overall tool calling: Hermes-3-Llama-3.1-70B (purpose-built for function calling)
	- Best for Open WebUI: Llama-3.3-70B-Instruct (works out of the box)
	- Best speed/quality ratio: Mistral-Nemo-12B (fast iterations, good enough for most tasks)
	- Best multilingual: Qwen2-72B (strong across languages)

	> See [guides/MODEL_COMPARISON.md](guides/MODEL_COMPARISON.md) for the full breakdown.

	---

	## The Critical Context Length Fix

	This is the #1 issue people hit with VLLM tool calling.

	VLLM defaults to short context windows. Tool calling needs much more:

	```
	System prompt: 3-5K tokens
	Tool definitions: 2-4K tokens per tool
	Conversation history: 2-10K tokens
	Tool responses: 5-20K tokens
	─────────────────────────────────
	Total needed: 20-40K+ tokens
	```

	If your context window is 16K (the default for many configs), tool calls get silently truncated mid-generation.

	The fix:

	```bash
	# BEFORE (broken): Default or small context
	--max-model-len 16384

	# AFTER (working): Full context support
	--max-model-len 131072 # 128K tokens
	--max-num-seqs 4 # Reduce concurrency to fit KV cache
	--max-num-batched-tokens 132000 # Match context length
	--gpu-memory-utilization 0.90 # Leave headroom
	```

	Memory math for 96GB GPU:
	- Model weights (FP8 70B): ~40GB
	- KV cache for 128K context: ~45-50GB
	- Total: fits with batch size 4

	> See [guides/CONTEXT_LENGTH_FIX.md](guides/CONTEXT_LENGTH_FIX.md) for the full analysis.

	---

	## Tool Call Formats

	VLLM supports multiple tool call formats. Which one you use depends on your model:

	### Hermes Format (ChatML + XML tags)

	```
	<\|im_start\|>assistant
	<tool_call>
	{"name": "get_weather", "arguments": {"location": "San Francisco"}}
	</tool_call>
	<\|im_end\|>
	```

	Parser flag: `--tool-call-parser hermes`
	Models: Hermes-3, Hermes-2-Pro, Qwen2

	### Llama 3 JSON Format

	```json
	{"name": "get_weather", "parameters": {"location": "San Francisco"}}
	```

	Parser flag: `--tool-call-parser llama3_json`
	Models: Llama-3.1, Llama-3.3

	### Mistral Format

	```
	[TOOL_CALLS] [{"name": "get_weather", "arguments": {"location": "San Francisco"}}]
	```

	Parser flag: `--tool-call-parser mistral`
	Models: Mistral-Nemo, Mistral-7B

	All formats are converted to OpenAI-compatible JSON by VLLM. Your application code always receives the same standardized format regardless of which parser is used.

	> See [guides/TOOL_CALL_FORMATS.md](guides/TOOL_CALL_FORMATS.md) for detailed comparison.

	---

	## 7 Prompt Engineering Lessons for Tool Calling

	These lessons were learned through production debugging. Each one cost hours to diagnose.

	### 1. LLMs Learn from Your Examples

	Problem: LLM wraps all JSON responses in markdown code blocks (` ```json ... ``` `).

	Root cause: Your prompt examples showed JSON inside markdown code blocks. The LLM learned to replicate the formatting.

	Fix: Show raw JSON in all examples. Add explicit instruction: "Do NOT wrap your response in markdown code blocks."

	### 2. Jinja2 Escaping Leaks into Output

	Problem: LLM outputs `{{` instead of `{` in JSON.

	Root cause: Your Jinja2 chat template examples used `{{` for escaping. The LLM learned to double braces.

	Fix: Use single braces in all prompt examples. Handle template escaping separately from content.

	### 3. Explicitly Limit Tool Call Blocks

	Problem: LLM creates multiple `<tool_call>` blocks or nests them 5 levels deep.

	Root cause: No instruction telling it not to.

	Fix: Add: "Use ONLY ONE `<tool_call>` block per response. Do NOT create multiple blocks or nest them."

	### 4. Track Validation Results, Not Just Calls

	Problem: System checks if validation tools were called but not if they passed. LLM returns "success" with invalid output.

	Fix:

	```python
	# BAD: Only tracks if called
	tracking = {'validate_called': False}

	# GOOD: Tracks if called AND passed
	tracking = {
	'validate_called': False,
	'validate_passed': False, # Did it return valid: true?
	'validation_errors': [] # What went wrong?
	}
	```

	### 5. Feed Errors Back with Structure

	Problem: Validation fails but the LLM doesn't know what failed or how to fix it.

	Fix: Format errors with property names, error types, and suggested fixes:

	```python
	errors_formatted = "\n\nValidation Errors Found:\n"
	for i, error in enumerate(errors, 1):
	errors_formatted += f"\n{i}. "
	if 'property' in error:
	errors_formatted += f"Property: {error['property']}\n"
	if 'message' in error:
	errors_formatted += f" Message: {error['message']}\n"
	if 'fix' in error:
	errors_formatted += f" Fix: {error['fix']}\n"
	```

	### 6. Use `raw_decode` for Robust JSON Extraction

	Problem: LLM adds conversational text before/after the JSON: "Here is the result: {...} Let me know if you need anything else!"

	Fix: Three-layer extraction:

	```python
	import json
	from json import JSONDecoder
	import re

	def extract_json(text: str):
	# Layer 1: Strip markdown code blocks
	if "```" in text:
	match = re.search(r'```(?:json)?\s\n(.?)\n```', text, re.DOTALL)
	if match:
	text = match.group(1).strip()

	# Layer 2: Find first { or [ (skip preamble)
	if not text.startswith(('{', '[')):
	for char in ['{', '[']:
	idx = text.find(char)
	if idx != -1:
	text = text[idx:]
	break

	# Layer 3: raw_decode stops at end of valid JSON (skip postamble)
	try:
	return json.loads(text)
	except json.JSONDecodeError:
	decoder = JSONDecoder()
	data, _ = decoder.raw_decode(text)
	return data
	```

	### 7. Budget Enough Iterations for Multi-Step Workflows

	Problem: Multi-step tool calling runs out of iterations before completing.

	Root cause: Each step needs multiple LLM turns:
	1. Get information (tool call)
	2. Process results (tool call)
	3. Validate output (tool call)
	4. Fix errors if needed (tool call)
	5. Return final response

	Recommended iteration budgets:

	\| Workflow Complexity \| Max Iterations \| Step Retry Limit \|
	\|-------------------\|---------------\|-----------------\|
	\| Simple (1-2 tools) \| 5 \| 2 \|
	\| Medium (3-5 tools) \| 10 \| 3 \|
	\| Complex (6+ tools) \| 15 \| 3 \|

	> See [guides/PROMPT_ENGINEERING_LESSONS.md](guides/PROMPT_ENGINEERING_LESSONS.md) for code examples for each lesson.

	---

	## Multi-Step Workflow Architecture

	For complex tasks, single-prompt tool calling is unreliable. Break it into steps with isolated tool sets:

	```
	Step 1: Discovery Step 2: Configuration
	┌─────────────────┐ ┌─────────────────────┐
	│ Tools: │ │ Tools: │
	│ - search │ ──> │ - get_details │
	│ - list │ │ - validate_minimal │
	│ - get_info │ │ - validate_full │
	│ │ │ │
	│ Output: What │ │ Output: How │
	│ components to │ │ to configure them │
	│ use │ │ │
	└─────────────────┘ └─────────────────────┘
	```

	Key patterns:
	- Isolated tool sets per step — each step only sees relevant tools, reducing confusion
	- Pydantic schema validation — validate LLM responses structurally, not just syntactically
	- Retry with error feedback — when validation fails, feed structured errors back to the LLM
	- Result tracking — track whether validations passed, not just whether they were called

	> See [guides/MULTI_STEP_WORKFLOWS.md](guides/MULTI_STEP_WORKFLOWS.md) for the full architecture.
	> See [examples/multi_step_orchestrator.py](examples/multi_step_orchestrator.py) for working code.

	---

	## Blackwell GPU Notes

	If you're running on NVIDIA RTX 6000 Pro Blackwell (or similar Blackwell architecture):

	### FlashInfer Bug (SM120)

	FlashInfer has known issues with Blackwell's SM120 compute architecture. Symptoms: crashes, hangs, or incorrect output.

	```bash
	# Workaround: Disable FlashInfer, use FlashAttention-2 instead
	export VLLM_ATTENTION_BACKEND=FLASH_ATTN
	export VLLM_USE_FLASHINFER=0
	```

	### FP8 Quantization Types

	Not all FP8 models use the same quantization method:

	\| Model \| Quantization Flag \| Notes \|
	\|-------\|------------------\|-------\|
	\| Hermes-3-Llama-3.1-70B-FP8 \| `--quantization compressed-tensors` \| Uses compressed-tensors format \|
	\| Llama-3.3-70B-Instruct-FP8 \| `--quantization fp8_e4m3` \| Native FP8, faster on Blackwell \|
	\| Qwen2-72B-Instruct-FP8 \| `--quantization fp8` \| Standard FP8 \|
	\| Mistral-Nemo-FP8 \| `--quantization fp8` \| Standard FP8 \|

	Using the wrong flag won't crash — but you'll lose performance. `compressed-tensors` doesn't leverage Blackwell's native FP8 acceleration.

	---

	## Troubleshooting

	<details>
	<summary><b>Tool calls get cut off mid-generation</b></summary>

	Cause: Context window too small.

	Fix: Increase `--max-model-len` to 131072 (128K). See [Context Length Fix](#the-critical-context-length-fix).

	</details>

	<details>
	<summary><b>Model responds with text instead of tool calls</b></summary>

	Cause: Missing `--enable-auto-tool-choice` flag, or system prompt doesn't instruct tool use.

	Fix:
	1. Add `--enable-auto-tool-choice` to VLLM launch
	2. Add `--tool-call-parser hermes` (or appropriate parser)
	3. Ensure tools are passed in the API request

	</details>

	<details>
	<summary><b>Very slow generation (2-3 tok/s on 70B)</b></summary>

	Cause: Wrong quantization method or FlashInfer issues on Blackwell.

	Fix:
	```bash
	export VLLM_ATTENTION_BACKEND=FLASH_ATTN
	export VLLM_USE_FLASHINFER=0
	```
	Also verify you're using the correct `--quantization` flag for your model.

	</details>

	<details>
	<summary><b>Model hallucinates tool/function names</b></summary>

	Cause: Tool definitions are too vague, or the model is guessing from training data.

	Fix:
	1. Include `includeExamples: true` in tool definitions to show real configurations
	2. Add existence validation after tool calls (verify the tool response is valid before proceeding)
	3. Use specific, descriptive tool names

	</details>

	<details>
	<summary><b>Hermes-3 tool calls don't work in Open WebUI</b></summary>

	Cause: Open WebUI expects OpenAI-format tool calls. Hermes-3's native format (ChatML + XML) isn't compatible.

	Fix: Switch to Llama-3.3-70B-Instruct which works out of the box with Open WebUI. See [guides/OPEN_WEBUI_COMPATIBILITY.md](guides/OPEN_WEBUI_COMPATIBILITY.md).

	</details>

	<details>
	<summary><b>FlashInfer crashes on Blackwell GPU</b></summary>

	Cause: FlashInfer has known bugs with SM120 (Blackwell) compute architecture.

	Fix:
	```bash
	export VLLM_ATTENTION_BACKEND=FLASH_ATTN
	export VLLM_USE_FLASHINFER=0
	```

	</details>

	---

	## Open WebUI Compatibility

	\| Model \| Tool Calling via API \| Tool Calling in Open WebUI \|
	\|-------\|---------------------\|---------------------------\|
	\| Hermes-3-Llama-3.1-70B \| Yes \| No (format incompatible) \|
	\| Llama-3.3-70B-Instruct \| Yes \| Yes \|
	\| Qwen2-72B-Instruct \| Yes \| Yes \|
	\| Mistral-Nemo-12B \| Yes \| Yes \|

	If you need Open WebUI support, use Llama 3.3 or Qwen2. If you're building a custom application that talks directly to the VLLM API, all models work.

	> See [guides/OPEN_WEBUI_COMPATIBILITY.md](guides/OPEN_WEBUI_COMPATIBILITY.md) for details.

	---

	## Verified FP8 Models

	All models listed below have been verified to exist on Hugging Face and work with VLLM for tool calling:

	70B+ Models (High Performance):
	- [NousResearch/Hermes-3-Llama-3.1-70B-FP8](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B-FP8) — Best tool calling
	- [nvidia/Llama-3.3-70B-Instruct-FP8](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8) — Best Open WebUI support
	- [RedHatAI/Qwen2-72B-Instruct-FP8](https://huggingface.co/RedHatAI/Qwen2-72B-Instruct-FP8) — Best multilingual

	12B Models (Fast Iteration):
	- [RedHatAI/Mistral-Nemo-Instruct-2407-FP8](https://huggingface.co/RedHatAI/Mistral-Nemo-Instruct-2407-FP8) — 100-150 tok/s

	Memory Requirements (single GPU):
	- 70B FP8: ~40-50GB
	- 12B FP8: ~12-15GB

	---

	## Citation

	If you find this guide useful, please star the repository and share it.

	```bibtex
	@misc{odmark2025vllmtoolcalling,
	title={VLLM Tool Calling Guide: Open Source Models on Blackwell GPUs},
	author={Joshua Eric Odmark},
	year={2025},
	url={https://huggingface.co/joshuaeric/vllm-tool-calling-guide}
	}
	```

	## Acknowledgments

	- [NousResearch](https://huggingface.co/NousResearch) for Hermes-3 and pioneering open source tool calling
	- [vLLM Project](https://github.com/vllm-project/vllm) for the inference engine
	- [NVIDIA](https://huggingface.co/nvidia) and [Red Hat AI / NeuralMagic](https://huggingface.co/RedHatAI) for FP8 quantized models

	## License

	Apache 2.0 — use freely, attribution appreciated.