| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | library_name: vllm |
| | pipeline_tag: text-generation |
| | tags: |
| | - tool-calling |
| | - function-calling |
| | - vllm |
| | - inference |
| | - guide |
| | - hermes |
| | - llama |
| | - qwen |
| | - mistral |
| | - fp8 |
| | - blackwell |
| | - rtx-6000 |
| | - open-source |
| | - multi-step-workflow |
| | - prompt-engineering |
| | - quantization |
| | - deployment |
| | base_model: |
| | - NousResearch/Hermes-3-Llama-3.1-70B-FP8 |
| | - nvidia/Llama-3.3-70B-Instruct-FP8 |
| | - RedHatAI/Qwen2-72B-Instruct-FP8 |
| | - RedHatAI/Mistral-Nemo-Instruct-2407-FP8 |
| | --- |
| | |
| | # VLLM Tool Calling Guide |
| |
|
| | **A battle-tested guide to getting tool calling working reliably with open source models on VLLM.** |
| |
|
| | This is not a model. This is a collection of production-tested configurations, prompt templates, Python examples, and hard-won lessons from building multi-step tool calling systems with open source LLMs on NVIDIA Blackwell GPUs. |
| |
|
| | Everything here was discovered through real deployment — not theory. |
| |
|
| | --- |
| |
|
| | ## Quick Start |
| |
|
| | **Launch VLLM with tool calling (Hermes-3 70B):** |
| |
|
| | ```bash |
| | python -m vllm.entrypoints.openai.api_server \ |
| | --model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \ |
| | --dtype auto \ |
| | --quantization compressed-tensors \ |
| | --max-model-len 131072 \ |
| | --enable-auto-tool-choice \ |
| | --tool-call-parser hermes \ |
| | --gpu-memory-utilization 0.90 \ |
| | --max-num-seqs 4 |
| | ``` |
| |
|
| | **Test it works:** |
| |
|
| | ```bash |
| | curl http://localhost:8000/v1/chat/completions \ |
| | -H "Content-Type: application/json" \ |
| | -d '{ |
| | "model": "NousResearch/Hermes-3-Llama-3.1-70B-FP8", |
| | "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}], |
| | "tools": [{ |
| | "type": "function", |
| | "function": { |
| | "name": "get_weather", |
| | "description": "Get current weather for a location", |
| | "parameters": { |
| | "type": "object", |
| | "properties": { |
| | "location": {"type": "string", "description": "City name"} |
| | }, |
| | "required": ["location"] |
| | } |
| | } |
| | }], |
| | "tool_choice": "auto" |
| | }' |
| | ``` |
| |
|
| | If you see `"tool_calls"` in the response, you're good. Read on for the details. |
| |
|
| | --- |
| |
|
| | ## What This Repository Contains |
| |
|
| | | Directory | Contents | |
| | |-----------|----------| |
| | | [`configs/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/configs) | Production VLLM launch scripts for 4 models with inline documentation | |
| | | [`examples/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/examples) | Working Python code: basic tool calls, multi-step orchestration, JSON extraction | |
| | | [`prompts/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/prompts) | System prompt templates for tool calling (Hermes-specific and model-agnostic) | |
| | | [`chat_templates/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/chat_templates) | Jinja2 chat templates for Hermes-3 tool calling | |
| | | [`guides/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/guides) | Deep-dive guides on specific topics (context length, prompt engineering, troubleshooting) | |
| |
|
| | --- |
| |
|
| | ## Model Comparison |
| |
|
| | All models tested on NVIDIA RTX 6000 Pro Blackwell (96GB VRAM), single GPU. |
| |
|
| | | Model | Size | Quant | VLLM Parser | Speed | Memory | Context | Tool Quality | Open WebUI | |
| | |-------|------|-------|-------------|-------|--------|---------|-------------|------------| |
| | | **Hermes-3-Llama-3.1-70B** | 70B | FP8 | `hermes` | 25-35 tok/s | ~40GB | 128K | Excellent | No | |
| | | **Llama-3.3-70B-Instruct** | 70B | FP8 | `llama3_json` | 60-90 tok/s | ~40GB | 128K | Excellent | Yes | |
| | | **Qwen2-72B-Instruct** | 72B | FP8 | `hermes` | 60-90 tok/s | ~45GB | 128K | Very Good | Yes | |
| | | **Mistral-Nemo-Instruct** | 12B | FP8 | `mistral` | 100-150 tok/s | ~15GB | 128K | Good | Yes | |
| |
|
| | **Recommendations:** |
| | - **Best overall tool calling:** Hermes-3-Llama-3.1-70B (purpose-built for function calling) |
| | - **Best for Open WebUI:** Llama-3.3-70B-Instruct (works out of the box) |
| | - **Best speed/quality ratio:** Mistral-Nemo-12B (fast iterations, good enough for most tasks) |
| | - **Best multilingual:** Qwen2-72B (strong across languages) |
| |
|
| | > See [guides/MODEL_COMPARISON.md](guides/MODEL_COMPARISON.md) for the full breakdown. |
| |
|
| | --- |
| |
|
| | ## The Critical Context Length Fix |
| |
|
| | **This is the #1 issue people hit with VLLM tool calling.** |
| |
|
| | VLLM defaults to short context windows. Tool calling needs much more: |
| |
|
| | ``` |
| | System prompt: 3-5K tokens |
| | Tool definitions: 2-4K tokens per tool |
| | Conversation history: 2-10K tokens |
| | Tool responses: 5-20K tokens |
| | ───────────────────────────────── |
| | Total needed: 20-40K+ tokens |
| | ``` |
| |
|
| | **If your context window is 16K (the default for many configs), tool calls get silently truncated mid-generation.** |
| |
|
| | The fix: |
| |
|
| | ```bash |
| | # BEFORE (broken): Default or small context |
| | --max-model-len 16384 |
| | |
| | # AFTER (working): Full context support |
| | --max-model-len 131072 # 128K tokens |
| | --max-num-seqs 4 # Reduce concurrency to fit KV cache |
| | --max-num-batched-tokens 132000 # Match context length |
| | --gpu-memory-utilization 0.90 # Leave headroom |
| | ``` |
| |
|
| | **Memory math for 96GB GPU:** |
| | - Model weights (FP8 70B): ~40GB |
| | - KV cache for 128K context: ~45-50GB |
| | - Total: fits with batch size 4 |
| |
|
| | > See [guides/CONTEXT_LENGTH_FIX.md](guides/CONTEXT_LENGTH_FIX.md) for the full analysis. |
| |
|
| | --- |
| |
|
| | ## Tool Call Formats |
| |
|
| | VLLM supports multiple tool call formats. Which one you use depends on your model: |
| |
|
| | ### Hermes Format (ChatML + XML tags) |
| |
|
| | ``` |
| | <|im_start|>assistant |
| | <tool_call> |
| | {"name": "get_weather", "arguments": {"location": "San Francisco"}} |
| | </tool_call> |
| | <|im_end|> |
| | ``` |
| |
|
| | **Parser flag:** `--tool-call-parser hermes` |
| | **Models:** Hermes-3, Hermes-2-Pro, Qwen2 |
| |
|
| | ### Llama 3 JSON Format |
| |
|
| | ```json |
| | {"name": "get_weather", "parameters": {"location": "San Francisco"}} |
| | ``` |
| |
|
| | **Parser flag:** `--tool-call-parser llama3_json` |
| | **Models:** Llama-3.1, Llama-3.3 |
| |
|
| | ### Mistral Format |
| |
|
| | ``` |
| | [TOOL_CALLS] [{"name": "get_weather", "arguments": {"location": "San Francisco"}}] |
| | ``` |
| |
|
| | **Parser flag:** `--tool-call-parser mistral` |
| | **Models:** Mistral-Nemo, Mistral-7B |
| |
|
| | **All formats are converted to OpenAI-compatible JSON by VLLM.** Your application code always receives the same standardized format regardless of which parser is used. |
| |
|
| | > See [guides/TOOL_CALL_FORMATS.md](guides/TOOL_CALL_FORMATS.md) for detailed comparison. |
| |
|
| | --- |
| |
|
| | ## 7 Prompt Engineering Lessons for Tool Calling |
| |
|
| | These lessons were learned through production debugging. Each one cost hours to diagnose. |
| |
|
| | ### 1. LLMs Learn from Your Examples |
| |
|
| | **Problem:** LLM wraps all JSON responses in markdown code blocks (` ```json ... ``` `). |
| |
|
| | **Root cause:** Your prompt examples showed JSON inside markdown code blocks. The LLM learned to replicate the formatting. |
| |
|
| | **Fix:** Show raw JSON in all examples. Add explicit instruction: "Do NOT wrap your response in markdown code blocks." |
| |
|
| | ### 2. Jinja2 Escaping Leaks into Output |
| |
|
| | **Problem:** LLM outputs `{{` instead of `{` in JSON. |
| |
|
| | **Root cause:** Your Jinja2 chat template examples used `{{` for escaping. The LLM learned to double braces. |
| |
|
| | **Fix:** Use single braces in all prompt examples. Handle template escaping separately from content. |
| |
|
| | ### 3. Explicitly Limit Tool Call Blocks |
| |
|
| | **Problem:** LLM creates multiple `<tool_call>` blocks or nests them 5 levels deep. |
| |
|
| | **Root cause:** No instruction telling it not to. |
| |
|
| | **Fix:** Add: "Use ONLY ONE `<tool_call>` block per response. Do NOT create multiple blocks or nest them." |
| |
|
| | ### 4. Track Validation Results, Not Just Calls |
| |
|
| | **Problem:** System checks if validation tools were *called* but not if they *passed*. LLM returns "success" with invalid output. |
| |
|
| | **Fix:** |
| |
|
| | ```python |
| | # BAD: Only tracks if called |
| | tracking = {'validate_called': False} |
| | |
| | # GOOD: Tracks if called AND passed |
| | tracking = { |
| | 'validate_called': False, |
| | 'validate_passed': False, # Did it return valid: true? |
| | 'validation_errors': [] # What went wrong? |
| | } |
| | ``` |
| |
|
| | ### 5. Feed Errors Back with Structure |
| |
|
| | **Problem:** Validation fails but the LLM doesn't know *what* failed or *how* to fix it. |
| |
|
| | **Fix:** Format errors with property names, error types, and suggested fixes: |
| |
|
| | ```python |
| | errors_formatted = "\n\nValidation Errors Found:\n" |
| | for i, error in enumerate(errors, 1): |
| | errors_formatted += f"\n{i}. " |
| | if 'property' in error: |
| | errors_formatted += f"Property: {error['property']}\n" |
| | if 'message' in error: |
| | errors_formatted += f" Message: {error['message']}\n" |
| | if 'fix' in error: |
| | errors_formatted += f" Fix: {error['fix']}\n" |
| | ``` |
| |
|
| | ### 6. Use `raw_decode` for Robust JSON Extraction |
| | |
| | **Problem:** LLM adds conversational text before/after the JSON: "Here is the result: {...} Let me know if you need anything else!" |
| | |
| | **Fix:** Three-layer extraction: |
| | |
| | ```python |
| | import json |
| | from json import JSONDecoder |
| | import re |
| | |
| | def extract_json(text: str): |
| | # Layer 1: Strip markdown code blocks |
| | if "```" in text: |
| | match = re.search(r'```(?:json)?\s*\n(.*?)\n```', text, re.DOTALL) |
| | if match: |
| | text = match.group(1).strip() |
| | |
| | # Layer 2: Find first { or [ (skip preamble) |
| | if not text.startswith(('{', '[')): |
| | for char in ['{', '[']: |
| | idx = text.find(char) |
| | if idx != -1: |
| | text = text[idx:] |
| | break |
| | |
| | # Layer 3: raw_decode stops at end of valid JSON (skip postamble) |
| | try: |
| | return json.loads(text) |
| | except json.JSONDecodeError: |
| | decoder = JSONDecoder() |
| | data, _ = decoder.raw_decode(text) |
| | return data |
| | ``` |
| | |
| | ### 7. Budget Enough Iterations for Multi-Step Workflows |
| |
|
| | **Problem:** Multi-step tool calling runs out of iterations before completing. |
| |
|
| | **Root cause:** Each step needs multiple LLM turns: |
| | 1. Get information (tool call) |
| | 2. Process results (tool call) |
| | 3. Validate output (tool call) |
| | 4. Fix errors if needed (tool call) |
| | 5. Return final response |
| |
|
| | **Recommended iteration budgets:** |
| |
|
| | | Workflow Complexity | Max Iterations | Step Retry Limit | |
| | |-------------------|---------------|-----------------| |
| | | Simple (1-2 tools) | 5 | 2 | |
| | | Medium (3-5 tools) | 10 | 3 | |
| | | Complex (6+ tools) | 15 | 3 | |
| |
|
| | > See [guides/PROMPT_ENGINEERING_LESSONS.md](guides/PROMPT_ENGINEERING_LESSONS.md) for code examples for each lesson. |
| |
|
| | --- |
| |
|
| | ## Multi-Step Workflow Architecture |
| |
|
| | For complex tasks, single-prompt tool calling is unreliable. Break it into steps with isolated tool sets: |
| |
|
| | ``` |
| | Step 1: Discovery Step 2: Configuration |
| | ┌─────────────────┐ ┌─────────────────────┐ |
| | │ Tools: │ │ Tools: │ |
| | │ - search │ ──> │ - get_details │ |
| | │ - list │ │ - validate_minimal │ |
| | │ - get_info │ │ - validate_full │ |
| | │ │ │ │ |
| | │ Output: What │ │ Output: How │ |
| | │ components to │ │ to configure them │ |
| | │ use │ │ │ |
| | └─────────────────┘ └─────────────────────┘ |
| | ``` |
| |
|
| | **Key patterns:** |
| | - **Isolated tool sets per step** — each step only sees relevant tools, reducing confusion |
| | - **Pydantic schema validation** — validate LLM responses structurally, not just syntactically |
| | - **Retry with error feedback** — when validation fails, feed structured errors back to the LLM |
| | - **Result tracking** — track whether validations *passed*, not just whether they were *called* |
| |
|
| | > See [guides/MULTI_STEP_WORKFLOWS.md](guides/MULTI_STEP_WORKFLOWS.md) for the full architecture. |
| | > See [examples/multi_step_orchestrator.py](examples/multi_step_orchestrator.py) for working code. |
| |
|
| | --- |
| |
|
| | ## Blackwell GPU Notes |
| |
|
| | If you're running on NVIDIA RTX 6000 Pro Blackwell (or similar Blackwell architecture): |
| |
|
| | ### FlashInfer Bug (SM120) |
| |
|
| | FlashInfer has known issues with Blackwell's SM120 compute architecture. Symptoms: crashes, hangs, or incorrect output. |
| |
|
| | ```bash |
| | # Workaround: Disable FlashInfer, use FlashAttention-2 instead |
| | export VLLM_ATTENTION_BACKEND=FLASH_ATTN |
| | export VLLM_USE_FLASHINFER=0 |
| | ``` |
| |
|
| | ### FP8 Quantization Types |
| |
|
| | Not all FP8 models use the same quantization method: |
| |
|
| | | Model | Quantization Flag | Notes | |
| | |-------|------------------|-------| |
| | | Hermes-3-Llama-3.1-70B-FP8 | `--quantization compressed-tensors` | Uses compressed-tensors format | |
| | | Llama-3.3-70B-Instruct-FP8 | `--quantization fp8_e4m3` | Native FP8, faster on Blackwell | |
| | | Qwen2-72B-Instruct-FP8 | `--quantization fp8` | Standard FP8 | |
| | | Mistral-Nemo-FP8 | `--quantization fp8` | Standard FP8 | |
| |
|
| | Using the wrong flag won't crash — but you'll lose performance. `compressed-tensors` doesn't leverage Blackwell's native FP8 acceleration. |
| |
|
| | --- |
| |
|
| | ## Troubleshooting |
| |
|
| | <details> |
| | <summary><b>Tool calls get cut off mid-generation</b></summary> |
| |
|
| | **Cause:** Context window too small. |
| |
|
| | **Fix:** Increase `--max-model-len` to 131072 (128K). See [Context Length Fix](#the-critical-context-length-fix). |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary><b>Model responds with text instead of tool calls</b></summary> |
| |
|
| | **Cause:** Missing `--enable-auto-tool-choice` flag, or system prompt doesn't instruct tool use. |
| |
|
| | **Fix:** |
| | 1. Add `--enable-auto-tool-choice` to VLLM launch |
| | 2. Add `--tool-call-parser hermes` (or appropriate parser) |
| | 3. Ensure tools are passed in the API request |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary><b>Very slow generation (2-3 tok/s on 70B)</b></summary> |
| |
|
| | **Cause:** Wrong quantization method or FlashInfer issues on Blackwell. |
| |
|
| | **Fix:** |
| | ```bash |
| | export VLLM_ATTENTION_BACKEND=FLASH_ATTN |
| | export VLLM_USE_FLASHINFER=0 |
| | ``` |
| | Also verify you're using the correct `--quantization` flag for your model. |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary><b>Model hallucinates tool/function names</b></summary> |
| |
|
| | **Cause:** Tool definitions are too vague, or the model is guessing from training data. |
| |
|
| | **Fix:** |
| | 1. Include `includeExamples: true` in tool definitions to show real configurations |
| | 2. Add existence validation after tool calls (verify the tool response is valid before proceeding) |
| | 3. Use specific, descriptive tool names |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary><b>Hermes-3 tool calls don't work in Open WebUI</b></summary> |
| |
|
| | **Cause:** Open WebUI expects OpenAI-format tool calls. Hermes-3's native format (ChatML + XML) isn't compatible. |
| |
|
| | **Fix:** Switch to Llama-3.3-70B-Instruct which works out of the box with Open WebUI. See [guides/OPEN_WEBUI_COMPATIBILITY.md](guides/OPEN_WEBUI_COMPATIBILITY.md). |
| |
|
| | </details> |
| |
|
| | <details> |
| | <summary><b>FlashInfer crashes on Blackwell GPU</b></summary> |
| |
|
| | **Cause:** FlashInfer has known bugs with SM120 (Blackwell) compute architecture. |
| |
|
| | **Fix:** |
| | ```bash |
| | export VLLM_ATTENTION_BACKEND=FLASH_ATTN |
| | export VLLM_USE_FLASHINFER=0 |
| | ``` |
| |
|
| | </details> |
| |
|
| | --- |
| |
|
| | ## Open WebUI Compatibility |
| |
|
| | | Model | Tool Calling via API | Tool Calling in Open WebUI | |
| | |-------|---------------------|---------------------------| |
| | | Hermes-3-Llama-3.1-70B | Yes | **No** (format incompatible) | |
| | | Llama-3.3-70B-Instruct | Yes | Yes | |
| | | Qwen2-72B-Instruct | Yes | Yes | |
| | | Mistral-Nemo-12B | Yes | Yes | |
| |
|
| | If you need Open WebUI support, use Llama 3.3 or Qwen2. If you're building a custom application that talks directly to the VLLM API, all models work. |
| |
|
| | > See [guides/OPEN_WEBUI_COMPATIBILITY.md](guides/OPEN_WEBUI_COMPATIBILITY.md) for details. |
| |
|
| | --- |
| |
|
| | ## Verified FP8 Models |
| |
|
| | All models listed below have been verified to exist on Hugging Face and work with VLLM for tool calling: |
| |
|
| | **70B+ Models (High Performance):** |
| | - [NousResearch/Hermes-3-Llama-3.1-70B-FP8](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B-FP8) — Best tool calling |
| | - [nvidia/Llama-3.3-70B-Instruct-FP8](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8) — Best Open WebUI support |
| | - [RedHatAI/Qwen2-72B-Instruct-FP8](https://huggingface.co/RedHatAI/Qwen2-72B-Instruct-FP8) — Best multilingual |
| |
|
| | **12B Models (Fast Iteration):** |
| | - [RedHatAI/Mistral-Nemo-Instruct-2407-FP8](https://huggingface.co/RedHatAI/Mistral-Nemo-Instruct-2407-FP8) — 100-150 tok/s |
| |
|
| | **Memory Requirements (single GPU):** |
| | - 70B FP8: ~40-50GB |
| | - 12B FP8: ~12-15GB |
| |
|
| | --- |
| |
|
| | ## Citation |
| |
|
| | If you find this guide useful, please star the repository and share it. |
| |
|
| | ```bibtex |
| | @misc{odmark2025vllmtoolcalling, |
| | title={VLLM Tool Calling Guide: Open Source Models on Blackwell GPUs}, |
| | author={Joshua Eric Odmark}, |
| | year={2025}, |
| | url={https://huggingface.co/joshuaeric/vllm-tool-calling-guide} |
| | } |
| | ``` |
| |
|
| | ## Acknowledgments |
| |
|
| | - [NousResearch](https://huggingface.co/NousResearch) for Hermes-3 and pioneering open source tool calling |
| | - [vLLM Project](https://github.com/vllm-project/vllm) for the inference engine |
| | - [NVIDIA](https://huggingface.co/nvidia) and [Red Hat AI / NeuralMagic](https://huggingface.co/RedHatAI) for FP8 quantized models |
| |
|
| | ## License |
| |
|
| | Apache 2.0 — use freely, attribution appreciated. |
| |
|