README.md · joshuaeric/vllm-tool-calling-guide at main

File size: 16,905 Bytes

---
language:
  - en
license: apache-2.0
library_name: vllm
pipeline_tag: text-generation
tags:
  - tool-calling
  - function-calling
  - vllm
  - inference
  - guide
  - hermes
  - llama
  - qwen
  - mistral
  - fp8
  - blackwell
  - rtx-6000
  - open-source
  - multi-step-workflow
  - prompt-engineering
  - quantization
  - deployment
base_model:
  - NousResearch/Hermes-3-Llama-3.1-70B-FP8
  - nvidia/Llama-3.3-70B-Instruct-FP8
  - RedHatAI/Qwen2-72B-Instruct-FP8
  - RedHatAI/Mistral-Nemo-Instruct-2407-FP8
---

# VLLM Tool Calling Guide

**A battle-tested guide to getting tool calling working reliably with open source models on VLLM.**

This is not a model. This is a collection of production-tested configurations, prompt templates, Python examples, and hard-won lessons from building multi-step tool calling systems with open source LLMs on NVIDIA Blackwell GPUs.

Everything here was discovered through real deployment — not theory.

---

## Quick Start

**Launch VLLM with tool calling (Hermes-3 70B):**

```bash
python -m vllm.entrypoints.openai.api_server \
  --model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \
  --dtype auto \
  --quantization compressed-tensors \
  --max-model-len 131072 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 4
```

**Test it works:**

```bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "NousResearch/Hermes-3-Llama-3.1-70B-FP8",
    "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "tool_choice": "auto"
  }'
```

If you see `"tool_calls"` in the response, you're good. Read on for the details.

---

## What This Repository Contains

| Directory | Contents |
|-----------|----------|
| [`configs/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/configs) | Production VLLM launch scripts for 4 models with inline documentation |
| [`examples/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/examples) | Working Python code: basic tool calls, multi-step orchestration, JSON extraction |
| [`prompts/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/prompts) | System prompt templates for tool calling (Hermes-specific and model-agnostic) |
| [`chat_templates/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/chat_templates) | Jinja2 chat templates for Hermes-3 tool calling |
| [`guides/`](https://huggingface.co/joshuaeric/vllm-tool-calling-guide/tree/main/guides) | Deep-dive guides on specific topics (context length, prompt engineering, troubleshooting) |

---

## Model Comparison

All models tested on NVIDIA RTX 6000 Pro Blackwell (96GB VRAM), single GPU.

| Model | Size | Quant | VLLM Parser | Speed | Memory | Context | Tool Quality | Open WebUI |
|-------|------|-------|-------------|-------|--------|---------|-------------|------------|
| **Hermes-3-Llama-3.1-70B** | 70B | FP8 | `hermes` | 25-35 tok/s | ~40GB | 128K | Excellent | No |
| **Llama-3.3-70B-Instruct** | 70B | FP8 | `llama3_json` | 60-90 tok/s | ~40GB | 128K | Excellent | Yes |
| **Qwen2-72B-Instruct** | 72B | FP8 | `hermes` | 60-90 tok/s | ~45GB | 128K | Very Good | Yes |
| **Mistral-Nemo-Instruct** | 12B | FP8 | `mistral` | 100-150 tok/s | ~15GB | 128K | Good | Yes |

**Recommendations:**
- **Best overall tool calling:** Hermes-3-Llama-3.1-70B (purpose-built for function calling)
- **Best for Open WebUI:** Llama-3.3-70B-Instruct (works out of the box)
- **Best speed/quality ratio:** Mistral-Nemo-12B (fast iterations, good enough for most tasks)
- **Best multilingual:** Qwen2-72B (strong across languages)

> See [guides/MODEL_COMPARISON.md](guides/MODEL_COMPARISON.md) for the full breakdown.

---

## The Critical Context Length Fix

**This is the #1 issue people hit with VLLM tool calling.**

VLLM defaults to short context windows. Tool calling needs much more:

```
System prompt:        3-5K tokens
Tool definitions:     2-4K tokens per tool
Conversation history: 2-10K tokens
Tool responses:       5-20K tokens
─────────────────────────────────
Total needed:         20-40K+ tokens
```

**If your context window is 16K (the default for many configs), tool calls get silently truncated mid-generation.**

The fix:

```bash
# BEFORE (broken): Default or small context
--max-model-len 16384

# AFTER (working): Full context support
--max-model-len 131072          # 128K tokens
--max-num-seqs 4                # Reduce concurrency to fit KV cache
--max-num-batched-tokens 132000 # Match context length
--gpu-memory-utilization 0.90   # Leave headroom
```

**Memory math for 96GB GPU:**
- Model weights (FP8 70B): ~40GB
- KV cache for 128K context: ~45-50GB
- Total: fits with batch size 4

> See [guides/CONTEXT_LENGTH_FIX.md](guides/CONTEXT_LENGTH_FIX.md) for the full analysis.

---

## Tool Call Formats

VLLM supports multiple tool call formats. Which one you use depends on your model:

### Hermes Format (ChatML + XML tags)

```
<|im_start|>assistant
<tool_call>
{"name": "get_weather", "arguments": {"location": "San Francisco"}}
</tool_call>
<|im_end|>
```

**Parser flag:** `--tool-call-parser hermes`
**Models:** Hermes-3, Hermes-2-Pro, Qwen2

### Llama 3 JSON Format

```json
{"name": "get_weather", "parameters": {"location": "San Francisco"}}
```

**Parser flag:** `--tool-call-parser llama3_json`
**Models:** Llama-3.1, Llama-3.3

### Mistral Format

```
[TOOL_CALLS] [{"name": "get_weather", "arguments": {"location": "San Francisco"}}]
```

**Parser flag:** `--tool-call-parser mistral`
**Models:** Mistral-Nemo, Mistral-7B

**All formats are converted to OpenAI-compatible JSON by VLLM.** Your application code always receives the same standardized format regardless of which parser is used.

> See [guides/TOOL_CALL_FORMATS.md](guides/TOOL_CALL_FORMATS.md) for detailed comparison.

---

## 7 Prompt Engineering Lessons for Tool Calling

These lessons were learned through production debugging. Each one cost hours to diagnose.

### 1. LLMs Learn from Your Examples

**Problem:** LLM wraps all JSON responses in markdown code blocks (` ```json ... ``` `).

**Root cause:** Your prompt examples showed JSON inside markdown code blocks. The LLM learned to replicate the formatting.

**Fix:** Show raw JSON in all examples. Add explicit instruction: "Do NOT wrap your response in markdown code blocks."

### 2. Jinja2 Escaping Leaks into Output

**Problem:** LLM outputs `{{` instead of `{` in JSON.

**Root cause:** Your Jinja2 chat template examples used `{{` for escaping. The LLM learned to double braces.

**Fix:** Use single braces in all prompt examples. Handle template escaping separately from content.

### 3. Explicitly Limit Tool Call Blocks

**Problem:** LLM creates multiple `<tool_call>` blocks or nests them 5 levels deep.

**Root cause:** No instruction telling it not to.

**Fix:** Add: "Use ONLY ONE `<tool_call>` block per response. Do NOT create multiple blocks or nest them."

### 4. Track Validation Results, Not Just Calls

**Problem:** System checks if validation tools were *called* but not if they *passed*. LLM returns "success" with invalid output.

**Fix:**

```python
# BAD: Only tracks if called
tracking = {'validate_called': False}

# GOOD: Tracks if called AND passed
tracking = {
    'validate_called': False,
    'validate_passed': False,  # Did it return valid: true?
    'validation_errors': []     # What went wrong?
}
```

### 5. Feed Errors Back with Structure

**Problem:** Validation fails but the LLM doesn't know *what* failed or *how* to fix it.

**Fix:** Format errors with property names, error types, and suggested fixes:

```python
errors_formatted = "\n\nValidation Errors Found:\n"
for i, error in enumerate(errors, 1):
    errors_formatted += f"\n{i}. "
    if 'property' in error:
        errors_formatted += f"Property: {error['property']}\n"
    if 'message' in error:
        errors_formatted += f"   Message: {error['message']}\n"
    if 'fix' in error:
        errors_formatted += f"   Fix: {error['fix']}\n"
```

### 6. Use `raw_decode` for Robust JSON Extraction

**Problem:** LLM adds conversational text before/after the JSON: "Here is the result: {...} Let me know if you need anything else!"

**Fix:** Three-layer extraction:

```python
import json
from json import JSONDecoder
import re

def extract_json(text: str):
    # Layer 1: Strip markdown code blocks
    if "```" in text:
        match = re.search(r'```(?:json)?\s*\n(.*?)\n```', text, re.DOTALL)
        if match:
            text = match.group(1).strip()

    # Layer 2: Find first { or [ (skip preamble)
    if not text.startswith(('{', '[')):
        for char in ['{', '[']:
            idx = text.find(char)
            if idx != -1:
                text = text[idx:]
                break

    # Layer 3: raw_decode stops at end of valid JSON (skip postamble)
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        decoder = JSONDecoder()
        data, _ = decoder.raw_decode(text)
        return data
```

### 7. Budget Enough Iterations for Multi-Step Workflows

**Problem:** Multi-step tool calling runs out of iterations before completing.

**Root cause:** Each step needs multiple LLM turns:
1. Get information (tool call)
2. Process results (tool call)
3. Validate output (tool call)
4. Fix errors if needed (tool call)
5. Return final response

**Recommended iteration budgets:**

| Workflow Complexity | Max Iterations | Step Retry Limit |
|-------------------|---------------|-----------------|
| Simple (1-2 tools) | 5 | 2 |
| Medium (3-5 tools) | 10 | 3 |
| Complex (6+ tools) | 15 | 3 |

> See [guides/PROMPT_ENGINEERING_LESSONS.md](guides/PROMPT_ENGINEERING_LESSONS.md) for code examples for each lesson.

---

## Multi-Step Workflow Architecture

For complex tasks, single-prompt tool calling is unreliable. Break it into steps with isolated tool sets:

```
Step 1: Discovery          Step 2: Configuration
┌─────────────────┐       ┌─────────────────────┐
│ Tools:           │       │ Tools:               │
│ - search         │  ──>  │ - get_details        │
│ - list           │       │ - validate_minimal   │
│ - get_info       │       │ - validate_full      │
│                  │       │                      │
│ Output: What     │       │ Output: How          │
│ components to    │       │ to configure them    │
│ use              │       │                      │
└─────────────────┘       └─────────────────────┘
```

**Key patterns:**
- **Isolated tool sets per step** — each step only sees relevant tools, reducing confusion
- **Pydantic schema validation** — validate LLM responses structurally, not just syntactically
- **Retry with error feedback** — when validation fails, feed structured errors back to the LLM
- **Result tracking** — track whether validations *passed*, not just whether they were *called*

> See [guides/MULTI_STEP_WORKFLOWS.md](guides/MULTI_STEP_WORKFLOWS.md) for the full architecture.
> See [examples/multi_step_orchestrator.py](examples/multi_step_orchestrator.py) for working code.

---

## Blackwell GPU Notes

If you're running on NVIDIA RTX 6000 Pro Blackwell (or similar Blackwell architecture):

### FlashInfer Bug (SM120)

FlashInfer has known issues with Blackwell's SM120 compute architecture. Symptoms: crashes, hangs, or incorrect output.

```bash
# Workaround: Disable FlashInfer, use FlashAttention-2 instead
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_USE_FLASHINFER=0
```

### FP8 Quantization Types

Not all FP8 models use the same quantization method:

| Model | Quantization Flag | Notes |
|-------|------------------|-------|
| Hermes-3-Llama-3.1-70B-FP8 | `--quantization compressed-tensors` | Uses compressed-tensors format |
| Llama-3.3-70B-Instruct-FP8 | `--quantization fp8_e4m3` | Native FP8, faster on Blackwell |
| Qwen2-72B-Instruct-FP8 | `--quantization fp8` | Standard FP8 |
| Mistral-Nemo-FP8 | `--quantization fp8` | Standard FP8 |

Using the wrong flag won't crash — but you'll lose performance. `compressed-tensors` doesn't leverage Blackwell's native FP8 acceleration.

---

## Troubleshooting

<details>
<summary><b>Tool calls get cut off mid-generation</b></summary>

**Cause:** Context window too small.

**Fix:** Increase `--max-model-len` to 131072 (128K). See [Context Length Fix](#the-critical-context-length-fix).

</details>

<details>
<summary><b>Model responds with text instead of tool calls</b></summary>

**Cause:** Missing `--enable-auto-tool-choice` flag, or system prompt doesn't instruct tool use.

**Fix:**
1. Add `--enable-auto-tool-choice` to VLLM launch
2. Add `--tool-call-parser hermes` (or appropriate parser)
3. Ensure tools are passed in the API request

</details>

<details>
<summary><b>Very slow generation (2-3 tok/s on 70B)</b></summary>

**Cause:** Wrong quantization method or FlashInfer issues on Blackwell.

**Fix:**
```bash
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_USE_FLASHINFER=0
```
Also verify you're using the correct `--quantization` flag for your model.

</details>

<details>
<summary><b>Model hallucinates tool/function names</b></summary>

**Cause:** Tool definitions are too vague, or the model is guessing from training data.

**Fix:**
1. Include `includeExamples: true` in tool definitions to show real configurations
2. Add existence validation after tool calls (verify the tool response is valid before proceeding)
3. Use specific, descriptive tool names

</details>

<details>
<summary><b>Hermes-3 tool calls don't work in Open WebUI</b></summary>

**Cause:** Open WebUI expects OpenAI-format tool calls. Hermes-3's native format (ChatML + XML) isn't compatible.

**Fix:** Switch to Llama-3.3-70B-Instruct which works out of the box with Open WebUI. See [guides/OPEN_WEBUI_COMPATIBILITY.md](guides/OPEN_WEBUI_COMPATIBILITY.md).

</details>

<details>
<summary><b>FlashInfer crashes on Blackwell GPU</b></summary>

**Cause:** FlashInfer has known bugs with SM120 (Blackwell) compute architecture.

**Fix:**
```bash
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_USE_FLASHINFER=0
```

</details>

---

## Open WebUI Compatibility

| Model | Tool Calling via API | Tool Calling in Open WebUI |
|-------|---------------------|---------------------------|
| Hermes-3-Llama-3.1-70B | Yes | **No** (format incompatible) |
| Llama-3.3-70B-Instruct | Yes | Yes |
| Qwen2-72B-Instruct | Yes | Yes |
| Mistral-Nemo-12B | Yes | Yes |

If you need Open WebUI support, use Llama 3.3 or Qwen2. If you're building a custom application that talks directly to the VLLM API, all models work.

> See [guides/OPEN_WEBUI_COMPATIBILITY.md](guides/OPEN_WEBUI_COMPATIBILITY.md) for details.

---

## Verified FP8 Models

All models listed below have been verified to exist on Hugging Face and work with VLLM for tool calling:

**70B+ Models (High Performance):**
- [NousResearch/Hermes-3-Llama-3.1-70B-FP8](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B-FP8) — Best tool calling
- [nvidia/Llama-3.3-70B-Instruct-FP8](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8) — Best Open WebUI support
- [RedHatAI/Qwen2-72B-Instruct-FP8](https://huggingface.co/RedHatAI/Qwen2-72B-Instruct-FP8) — Best multilingual

**12B Models (Fast Iteration):**
- [RedHatAI/Mistral-Nemo-Instruct-2407-FP8](https://huggingface.co/RedHatAI/Mistral-Nemo-Instruct-2407-FP8) — 100-150 tok/s

**Memory Requirements (single GPU):**
- 70B FP8: ~40-50GB
- 12B FP8: ~12-15GB

---

## Citation

If you find this guide useful, please star the repository and share it.

```bibtex
@misc{odmark2025vllmtoolcalling,
  title={VLLM Tool Calling Guide: Open Source Models on Blackwell GPUs},
  author={Joshua Eric Odmark},
  year={2025},
  url={https://huggingface.co/joshuaeric/vllm-tool-calling-guide}
}
```

## Acknowledgments

- [NousResearch](https://huggingface.co/NousResearch) for Hermes-3 and pioneering open source tool calling
- [vLLM Project](https://github.com/vllm-project/vllm) for the inference engine
- [NVIDIA](https://huggingface.co/nvidia) and [Red Hat AI / NeuralMagic](https://huggingface.co/RedHatAI) for FP8 quantized models

## License

Apache 2.0 — use freely, attribution appreciated.