--- language: - en license: apache-2.0 library_name: vllm pipeline_tag: text-generation tags: - tool-calling - function-calling - vllm - inference - guide - hermes - llama - qwen - mistral - fp8 - blackwell - rtx-6000 - open-source - multi-step-workflow - prompt-engineering - quantization - deployment base_model: - NousResearch/Hermes-3-Llama-3.1-70B-FP8 - nvidia/Llama-3.3-70B-Instruct-FP8 - RedHatAI/Qwen2-72B-Instruct-FP8 - RedHatAI/Mistral-Nemo-Instruct-2407-FP8 --- # VLLM Tool Calling Guide **A battle-tested guide to getting tool calling working reliably with open source models on VLLM.** This is not a model. This is a collection of production-tested configurations, prompt templates, Python examples, and hard-won lessons from building multi-step tool calling systems with open source LLMs on NVIDIA Blackwell GPUs. Everything here was discovered through real deployment — not theory. --- ## Quick Start **Launch VLLM with tool calling (Hermes-3 70B):** ```bash python -m vllm.entrypoints.openai.api_server \ --model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \ --dtype auto \ --quantization compressed-tensors \ --max-model-len 131072 \ --enable-auto-tool-choice \ --tool-call-parser hermes \ --gpu-memory-utilization 0.90 \ --max-num-seqs 4 ``` **Test it works:** ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Hermes-3-Llama-3.1-70B-FP8", "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}], "tools": [{ "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City name"} }, "required": ["location"] } } }], "tool_choice": "auto" }' ``` If you see `"tool_calls"` in the response, you're good. Read on for the details. --- ## What This Repository Contains | Directory | Contents | |-----------|----------| | [`configs/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/configs) | Production VLLM launch scripts for 4 models with inline documentation | | [`examples/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/examples) | Working Python code: basic tool calls, multi-step orchestration, JSON extraction | | [`prompts/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/prompts) | System prompt templates for tool calling (Hermes-specific and model-agnostic) | | [`chat_templates/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/chat_templates) | Jinja2 chat templates for Hermes-3 tool calling | | [`guides/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/guides) | Deep-dive guides on specific topics (context length, prompt engineering, troubleshooting) | --- ## Model Comparison All models tested on NVIDIA RTX 6000 Pro Blackwell (96GB VRAM), single GPU. | Model | Size | Quant | VLLM Parser | Speed | Memory | Context | Tool Quality | Open WebUI | |-------|------|-------|-------------|-------|--------|---------|-------------|------------| | **Hermes-3-Llama-3.1-70B** | 70B | FP8 | `hermes` | 25-35 tok/s | ~40GB | 128K | Excellent | No | | **Llama-3.3-70B-Instruct** | 70B | FP8 | `llama3_json` | 60-90 tok/s | ~40GB | 128K | Excellent | Yes | | **Qwen2-72B-Instruct** | 72B | FP8 | `hermes` | 60-90 tok/s | ~45GB | 128K | Very Good | Yes | | **Mistral-Nemo-Instruct** | 12B | FP8 | `mistral` | 100-150 tok/s | ~15GB | 128K | Good | Yes | **Recommendations:** - **Best overall tool calling:** Hermes-3-Llama-3.1-70B (purpose-built for function calling) - **Best for Open WebUI:** Llama-3.3-70B-Instruct (works out of the box) - **Best speed/quality ratio:** Mistral-Nemo-12B (fast iterations, good enough for most tasks) - **Best multilingual:** Qwen2-72B (strong across languages) > See [guides/MODEL_COMPARISON.md](guides/MODEL_COMPARISON.md) for the full breakdown. --- ## The Critical Context Length Fix **This is the #1 issue people hit with VLLM tool calling.** VLLM defaults to short context windows. Tool calling needs much more: ``` System prompt: 3-5K tokens Tool definitions: 2-4K tokens per tool Conversation history: 2-10K tokens Tool responses: 5-20K tokens ───────────────────────────────── Total needed: 20-40K+ tokens ``` **If your context window is 16K (the default for many configs), tool calls get silently truncated mid-generation.** The fix: ```bash # BEFORE (broken): Default or small context --max-model-len 16384 # AFTER (working): Full context support --max-model-len 131072 # 128K tokens --max-num-seqs 4 # Reduce concurrency to fit KV cache --max-num-batched-tokens 132000 # Match context length --gpu-memory-utilization 0.90 # Leave headroom ``` **Memory math for 96GB GPU:** - Model weights (FP8 70B): ~40GB - KV cache for 128K context: ~45-50GB - Total: fits with batch size 4 > See [guides/CONTEXT_LENGTH_FIX.md](guides/CONTEXT_LENGTH_FIX.md) for the full analysis. --- ## Tool Call Formats VLLM supports multiple tool call formats. Which one you use depends on your model: ### Hermes Format (ChatML + XML tags) ``` <|im_start|>assistant {"name": "get_weather", "arguments": {"location": "San Francisco"}} <|im_end|> ``` **Parser flag:** `--tool-call-parser hermes` **Models:** Hermes-3, Hermes-2-Pro, Qwen2 ### Llama 3 JSON Format ```json {"name": "get_weather", "parameters": {"location": "San Francisco"}} ``` **Parser flag:** `--tool-call-parser llama3_json` **Models:** Llama-3.1, Llama-3.3 ### Mistral Format ``` [TOOL_CALLS] [{"name": "get_weather", "arguments": {"location": "San Francisco"}}] ``` **Parser flag:** `--tool-call-parser mistral` **Models:** Mistral-Nemo, Mistral-7B **All formats are converted to OpenAI-compatible JSON by VLLM.** Your application code always receives the same standardized format regardless of which parser is used. > See [guides/TOOL_CALL_FORMATS.md](guides/TOOL_CALL_FORMATS.md) for detailed comparison. --- ## 7 Prompt Engineering Lessons for Tool Calling These lessons were learned through production debugging. Each one cost hours to diagnose. ### 1. LLMs Learn from Your Examples **Problem:** LLM wraps all JSON responses in markdown code blocks (` ```json ... ``` `). **Root cause:** Your prompt examples showed JSON inside markdown code blocks. The LLM learned to replicate the formatting. **Fix:** Show raw JSON in all examples. Add explicit instruction: "Do NOT wrap your response in markdown code blocks." ### 2. Jinja2 Escaping Leaks into Output **Problem:** LLM outputs `{{` instead of `{` in JSON. **Root cause:** Your Jinja2 chat template examples used `{{` for escaping. The LLM learned to double braces. **Fix:** Use single braces in all prompt examples. Handle template escaping separately from content. ### 3. Explicitly Limit Tool Call Blocks **Problem:** LLM creates multiple `` blocks or nests them 5 levels deep. **Root cause:** No instruction telling it not to. **Fix:** Add: "Use ONLY ONE `` block per response. Do NOT create multiple blocks or nest them." ### 4. Track Validation Results, Not Just Calls **Problem:** System checks if validation tools were *called* but not if they *passed*. LLM returns "success" with invalid output. **Fix:** ```python # BAD: Only tracks if called tracking = {'validate_called': False} # GOOD: Tracks if called AND passed tracking = { 'validate_called': False, 'validate_passed': False, # Did it return valid: true? 'validation_errors': [] # What went wrong? } ``` ### 5. Feed Errors Back with Structure **Problem:** Validation fails but the LLM doesn't know *what* failed or *how* to fix it. **Fix:** Format errors with property names, error types, and suggested fixes: ```python errors_formatted = "\n\nValidation Errors Found:\n" for i, error in enumerate(errors, 1): errors_formatted += f"\n{i}. " if 'property' in error: errors_formatted += f"Property: {error['property']}\n" if 'message' in error: errors_formatted += f" Message: {error['message']}\n" if 'fix' in error: errors_formatted += f" Fix: {error['fix']}\n" ``` ### 6. Use `raw_decode` for Robust JSON Extraction **Problem:** LLM adds conversational text before/after the JSON: "Here is the result: {...} Let me know if you need anything else!" **Fix:** Three-layer extraction: ```python import json from json import JSONDecoder import re def extract_json(text: str): # Layer 1: Strip markdown code blocks if "```" in text: match = re.search(r'```(?:json)?\s*\n(.*?)\n```', text, re.DOTALL) if match: text = match.group(1).strip() # Layer 2: Find first { or [ (skip preamble) if not text.startswith(('{', '[')): for char in ['{', '[']: idx = text.find(char) if idx != -1: text = text[idx:] break # Layer 3: raw_decode stops at end of valid JSON (skip postamble) try: return json.loads(text) except json.JSONDecodeError: decoder = JSONDecoder() data, _ = decoder.raw_decode(text) return data ``` ### 7. Budget Enough Iterations for Multi-Step Workflows **Problem:** Multi-step tool calling runs out of iterations before completing. **Root cause:** Each step needs multiple LLM turns: 1. Get information (tool call) 2. Process results (tool call) 3. Validate output (tool call) 4. Fix errors if needed (tool call) 5. Return final response **Recommended iteration budgets:** | Workflow Complexity | Max Iterations | Step Retry Limit | |-------------------|---------------|-----------------| | Simple (1-2 tools) | 5 | 2 | | Medium (3-5 tools) | 10 | 3 | | Complex (6+ tools) | 15 | 3 | > See [guides/PROMPT_ENGINEERING_LESSONS.md](guides/PROMPT_ENGINEERING_LESSONS.md) for code examples for each lesson. --- ## Multi-Step Workflow Architecture For complex tasks, single-prompt tool calling is unreliable. Break it into steps with isolated tool sets: ``` Step 1: Discovery Step 2: Configuration ┌─────────────────┐ ┌─────────────────────┐ │ Tools: │ │ Tools: │ │ - search │ ──> │ - get_details │ │ - list │ │ - validate_minimal │ │ - get_info │ │ - validate_full │ │ │ │ │ │ Output: What │ │ Output: How │ │ components to │ │ to configure them │ │ use │ │ │ └─────────────────┘ └─────────────────────┘ ``` **Key patterns:** - **Isolated tool sets per step** — each step only sees relevant tools, reducing confusion - **Pydantic schema validation** — validate LLM responses structurally, not just syntactically - **Retry with error feedback** — when validation fails, feed structured errors back to the LLM - **Result tracking** — track whether validations *passed*, not just whether they were *called* > See [guides/MULTI_STEP_WORKFLOWS.md](guides/MULTI_STEP_WORKFLOWS.md) for the full architecture. > See [examples/multi_step_orchestrator.py](examples/multi_step_orchestrator.py) for working code. --- ## Blackwell GPU Notes If you're running on NVIDIA RTX 6000 Pro Blackwell (or similar Blackwell architecture): ### FlashInfer Bug (SM120) FlashInfer has known issues with Blackwell's SM120 compute architecture. Symptoms: crashes, hangs, or incorrect output. ```bash # Workaround: Disable FlashInfer, use FlashAttention-2 instead export VLLM_ATTENTION_BACKEND=FLASH_ATTN export VLLM_USE_FLASHINFER=0 ``` ### FP8 Quantization Types Not all FP8 models use the same quantization method: | Model | Quantization Flag | Notes | |-------|------------------|-------| | Hermes-3-Llama-3.1-70B-FP8 | `--quantization compressed-tensors` | Uses compressed-tensors format | | Llama-3.3-70B-Instruct-FP8 | `--quantization fp8_e4m3` | Native FP8, faster on Blackwell | | Qwen2-72B-Instruct-FP8 | `--quantization fp8` | Standard FP8 | | Mistral-Nemo-FP8 | `--quantization fp8` | Standard FP8 | Using the wrong flag won't crash — but you'll lose performance. `compressed-tensors` doesn't leverage Blackwell's native FP8 acceleration. --- ## Troubleshooting
Tool calls get cut off mid-generation **Cause:** Context window too small. **Fix:** Increase `--max-model-len` to 131072 (128K). See [Context Length Fix](#the-critical-context-length-fix).
Model responds with text instead of tool calls **Cause:** Missing `--enable-auto-tool-choice` flag, or system prompt doesn't instruct tool use. **Fix:** 1. Add `--enable-auto-tool-choice` to VLLM launch 2. Add `--tool-call-parser hermes` (or appropriate parser) 3. Ensure tools are passed in the API request
Very slow generation (2-3 tok/s on 70B) **Cause:** Wrong quantization method or FlashInfer issues on Blackwell. **Fix:** ```bash export VLLM_ATTENTION_BACKEND=FLASH_ATTN export VLLM_USE_FLASHINFER=0 ``` Also verify you're using the correct `--quantization` flag for your model.
Model hallucinates tool/function names **Cause:** Tool definitions are too vague, or the model is guessing from training data. **Fix:** 1. Include `includeExamples: true` in tool definitions to show real configurations 2. Add existence validation after tool calls (verify the tool response is valid before proceeding) 3. Use specific, descriptive tool names
Hermes-3 tool calls don't work in Open WebUI **Cause:** Open WebUI expects OpenAI-format tool calls. Hermes-3's native format (ChatML + XML) isn't compatible. **Fix:** Switch to Llama-3.3-70B-Instruct which works out of the box with Open WebUI. See [guides/OPEN_WEBUI_COMPATIBILITY.md](guides/OPEN_WEBUI_COMPATIBILITY.md).
FlashInfer crashes on Blackwell GPU **Cause:** FlashInfer has known bugs with SM120 (Blackwell) compute architecture. **Fix:** ```bash export VLLM_ATTENTION_BACKEND=FLASH_ATTN export VLLM_USE_FLASHINFER=0 ```
--- ## Open WebUI Compatibility | Model | Tool Calling via API | Tool Calling in Open WebUI | |-------|---------------------|---------------------------| | Hermes-3-Llama-3.1-70B | Yes | **No** (format incompatible) | | Llama-3.3-70B-Instruct | Yes | Yes | | Qwen2-72B-Instruct | Yes | Yes | | Mistral-Nemo-12B | Yes | Yes | If you need Open WebUI support, use Llama 3.3 or Qwen2. If you're building a custom application that talks directly to the VLLM API, all models work. > See [guides/OPEN_WEBUI_COMPATIBILITY.md](guides/OPEN_WEBUI_COMPATIBILITY.md) for details. --- ## Verified FP8 Models All models listed below have been verified to exist on Hugging Face and work with VLLM for tool calling: **70B+ Models (High Performance):** - [NousResearch/Hermes-3-Llama-3.1-70B-FP8](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B-FP8) — Best tool calling - [nvidia/Llama-3.3-70B-Instruct-FP8](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8) — Best Open WebUI support - [RedHatAI/Qwen2-72B-Instruct-FP8](https://huggingface.co/RedHatAI/Qwen2-72B-Instruct-FP8) — Best multilingual **12B Models (Fast Iteration):** - [RedHatAI/Mistral-Nemo-Instruct-2407-FP8](https://huggingface.co/RedHatAI/Mistral-Nemo-Instruct-2407-FP8) — 100-150 tok/s **Memory Requirements (single GPU):** - 70B FP8: ~40-50GB - 12B FP8: ~12-15GB --- ## Citation If you find this guide useful, please star the repository and share it. ```bibtex @misc{odmark2025vllmtoolcalling, title={VLLM Tool Calling Guide: Open Source Models on Blackwell GPUs}, author={Joshua Eric Odmark}, year={2025}, url={https://huggingface.co/joshuaeric/vllm-tool-calling-guide} } ``` ## Acknowledgments - [NousResearch](https://huggingface.co/NousResearch) for Hermes-3 and pioneering open source tool calling - [vLLM Project](https://github.com/vllm-project/vllm) for the inference engine - [NVIDIA](https://huggingface.co/nvidia) and [Red Hat AI / NeuralMagic](https://huggingface.co/RedHatAI) for FP8 quantized models ## License Apache 2.0 — use freely, attribution appreciated.