joshuaeric
/

vllm-tool-calling-guide

@@ -25,8 +25,8 @@ tags:
 base_model:
   - NousResearch/Hermes-3-Llama-3.1-70B-FP8
   - nvidia/Llama-3.3-70B-Instruct-FP8
-  - neuralmagic/Qwen2-72B-Instruct-FP8
-  - neuralmagic/Mistral-Nemo-Instruct-2407-FP8
 ---
 # VLLM Tool Calling Guide
@@ -473,10 +473,10 @@ All models listed below have been verified to exist on Hugging Face and work wit
 **70B+ Models (High Performance):**
 - [NousResearch/Hermes-3-Llama-3.1-70B-FP8](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B-FP8) — Best tool calling
 - [nvidia/Llama-3.3-70B-Instruct-FP8](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8) — Best Open WebUI support
-- [neuralmagic/Qwen2-72B-Instruct-FP8](https://huggingface.co/neuralmagic/Qwen2-72B-Instruct-FP8) — Best multilingual
 **12B Models (Fast Iteration):**
-- [neuralmagic/Mistral-Nemo-Instruct-2407-FP8](https://huggingface.co/neuralmagic/Mistral-Nemo-Instruct-2407-FP8) — 100-150 tok/s
 **Memory Requirements (single GPU):**
 - 70B FP8: ~40-50GB
@@ -501,7 +501,7 @@ If you find this guide useful, please star the repository and share it.
 - [NousResearch](https://huggingface.co/NousResearch) for Hermes-3 and pioneering open source tool calling
 - [vLLM Project](https://github.com/vllm-project/vllm) for the inference engine
-- [NVIDIA](https://huggingface.co/nvidia) and [NeuralMagic](https://huggingface.co/neuralmagic) for FP8 quantized models
 ## License

 base_model:
   - NousResearch/Hermes-3-Llama-3.1-70B-FP8
   - nvidia/Llama-3.3-70B-Instruct-FP8
+  - RedHatAI/Qwen2-72B-Instruct-FP8
+  - RedHatAI/Mistral-Nemo-Instruct-2407-FP8
 ---
 # VLLM Tool Calling Guide
 **70B+ Models (High Performance):**
 - [NousResearch/Hermes-3-Llama-3.1-70B-FP8](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-70B-FP8) — Best tool calling
 - [nvidia/Llama-3.3-70B-Instruct-FP8](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP8) — Best Open WebUI support
+- [RedHatAI/Qwen2-72B-Instruct-FP8](https://huggingface.co/RedHatAI/Qwen2-72B-Instruct-FP8) — Best multilingual
 **12B Models (Fast Iteration):**
+- [RedHatAI/Mistral-Nemo-Instruct-2407-FP8](https://huggingface.co/RedHatAI/Mistral-Nemo-Instruct-2407-FP8) — 100-150 tok/s
 **Memory Requirements (single GPU):**
 - 70B FP8: ~40-50GB
 - [NousResearch](https://huggingface.co/NousResearch) for Hermes-3 and pioneering open source tool calling
 - [vLLM Project](https://github.com/vllm-project/vllm) for the inference engine
+- [NVIDIA](https://huggingface.co/nvidia) and [Red Hat AI / NeuralMagic](https://huggingface.co/RedHatAI) for FP8 quantized models
 ## License

configs/mistral_nemo_12b_fp8.sh CHANGED Viewed

@@ -3,7 +3,7 @@
 # VLLM Launch Config: Mistral-Nemo-Instruct-2407-FP8
 # ============================================================================
 #
-# Model:    neuralmagic/Mistral-Nemo-Instruct-2407-FP8
 # Purpose:  Fast tool calling for rapid iteration and testing
 # Parser:   mistral (native Mistral tool call format)
 # Memory:   ~15GB (leaves tons of VRAM for other tasks)
@@ -21,7 +21,7 @@ echo "=========================================="
 echo "Starting VLLM: Mistral-Nemo-12B-FP8"
 echo "=========================================="
 echo ""
-echo "Model:      neuralmagic/Mistral-Nemo-Instruct-2407-FP8"
 echo "Context:    128K tokens"
 echo "Parser:     mistral"
 echo "Quantization: FP8"
@@ -35,7 +35,7 @@ export VLLM_ATTENTION_BACKEND=FLASH_ATTN
 export VLLM_USE_FLASHINFER=0
 python -m vllm.entrypoints.openai.api_server \
-  --model neuralmagic/Mistral-Nemo-Instruct-2407-FP8 \
   --host 0.0.0.0 \
   --port 8000 \
   --dtype auto \

 # VLLM Launch Config: Mistral-Nemo-Instruct-2407-FP8
 # ============================================================================
 #
+# Model:    RedHatAI/Mistral-Nemo-Instruct-2407-FP8
 # Purpose:  Fast tool calling for rapid iteration and testing
 # Parser:   mistral (native Mistral tool call format)
 # Memory:   ~15GB (leaves tons of VRAM for other tasks)
 echo "Starting VLLM: Mistral-Nemo-12B-FP8"
 echo "=========================================="
 echo ""
+echo "Model:      RedHatAI/Mistral-Nemo-Instruct-2407-FP8"
 echo "Context:    128K tokens"
 echo "Parser:     mistral"
 echo "Quantization: FP8"
 export VLLM_USE_FLASHINFER=0
 python -m vllm.entrypoints.openai.api_server \
+  --model RedHatAI/Mistral-Nemo-Instruct-2407-FP8 \
   --host 0.0.0.0 \
   --port 8000 \
   --dtype auto \

configs/qwen2_72b_fp8.sh CHANGED Viewed

@@ -3,7 +3,7 @@
 # VLLM Launch Config: Qwen2-72B-Instruct-FP8
 # ============================================================================
 #
-# Model:    neuralmagic/Qwen2-72B-Instruct-FP8
 # Purpose:  Strong multilingual tool calling with excellent reasoning
 # Parser:   hermes (Qwen2 uses ChatML-compatible format)
 # Memory:   ~45GB model + KV cache
@@ -15,7 +15,7 @@ echo "=========================================="
 echo "Starting VLLM: Qwen2-72B-Instruct-FP8"
 echo "=========================================="
 echo ""
-echo "Model:      neuralmagic/Qwen2-72B-Instruct-FP8"
 echo "Context:    128K tokens"
 echo "Parser:     hermes"
 echo "Quantization: FP8"
@@ -29,7 +29,7 @@ export VLLM_ATTENTION_BACKEND=FLASH_ATTN
 export VLLM_USE_FLASHINFER=0
 python -m vllm.entrypoints.openai.api_server \
-  --model neuralmagic/Qwen2-72B-Instruct-FP8 \
   --host 0.0.0.0 \
   --port 8000 \
   --dtype auto \

 # VLLM Launch Config: Qwen2-72B-Instruct-FP8
 # ============================================================================
 #
+# Model:    RedHatAI/Qwen2-72B-Instruct-FP8
 # Purpose:  Strong multilingual tool calling with excellent reasoning
 # Parser:   hermes (Qwen2 uses ChatML-compatible format)
 # Memory:   ~45GB model + KV cache
 echo "Starting VLLM: Qwen2-72B-Instruct-FP8"
 echo "=========================================="
 echo ""
+echo "Model:      RedHatAI/Qwen2-72B-Instruct-FP8"
 echo "Context:    128K tokens"
 echo "Parser:     hermes"
 echo "Quantization: FP8"
 export VLLM_USE_FLASHINFER=0
 python -m vllm.entrypoints.openai.api_server \
+  --model RedHatAI/Qwen2-72B-Instruct-FP8 \
   --host 0.0.0.0 \
   --port 8000 \
   --dtype auto \

guides/MODEL_COMPARISON.md CHANGED Viewed

@@ -6,7 +6,7 @@ Detailed comparison of open source models tested for tool calling with VLLM on N
 | | Hermes-3 70B | Llama-3.3 70B | Qwen2 72B | Mistral-Nemo 12B |
 |---|---|---|---|---|
-| **Model ID** | `NousResearch/Hermes-3-Llama-3.1-70B-FP8` | `nvidia/Llama-3.3-70B-Instruct-FP8` | `neuralmagic/Qwen2-72B-Instruct-FP8` | `neuralmagic/Mistral-Nemo-Instruct-2407-FP8` |
 | **Size** | 70B | 70B | 72B | 12B |
 | **Quantization** | FP8 (compressed-tensors) | FP8 (native e4m3) | FP8 | FP8 |
 | **VLLM Parser** | `hermes` | `llama3_json` | `hermes` | `mistral` |

 | | Hermes-3 70B | Llama-3.3 70B | Qwen2 72B | Mistral-Nemo 12B |
 |---|---|---|---|---|
+| **Model ID** | `NousResearch/Hermes-3-Llama-3.1-70B-FP8` | `nvidia/Llama-3.3-70B-Instruct-FP8` | `RedHatAI/Qwen2-72B-Instruct-FP8` | `RedHatAI/Mistral-Nemo-Instruct-2407-FP8` |
 | **Size** | 70B | 70B | 72B | 12B |
 | **Quantization** | FP8 (compressed-tensors) | FP8 (native e4m3) | FP8 | FP8 |
 | **VLLM Parser** | `hermes` | `llama3_json` | `hermes` | `mistral` |