Update README.md

42864f7 verified 20 days ago

4.47 kB

	---
	language:
	- en
	- de
	- fr
	- it
	- pt
	- hi
	- es
	- th
	license: llama3.3
	pipeline_tag: text-generation
	tags:
	- facebook
	- meta
	- pytorch
	- llama
	- llama-3
	- neuralmagic
	- redhat
	- speculators
	- eagle3
	---

	# Llama-3.3-70B-Instruct-speculator.eagle3

	## Model Overview
	- Verifier: meta-llama/Llama-3.3-70B-Instruct
	- Speculative Decoding Algorithm: EAGLE-3
	- Model Architecture: Eagle3Speculator
	- Release Date: 09/15/2025
	- Version: 1.0
	- Model Developers: RedHat

	This is a speculator model designed for use with [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), based on the [EAGLE-3](https://arxiv.org/abs/2503.01840) speculative decoding algorithm.
	It was trained using the [speculators](https://github.com/vllm-project/speculators) library on a combination of the [Aeala/ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) and the `train_sft` split of [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) datasets.
	This model should be used with the [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) chat template, specifically through the `/chat/completions` endpoint.

	## Use with vLLM

	```bash
	vllm serve meta-llama/Llama-3.3-70B-Instruct \
	-tp 4 \
	--speculative-config '{
	"model": "RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3",
	"num_speculative_tokens": 3,
	"method": "eagle3"
	}'
	```

	## Evaluations

	<h3>Use cases</h3>
	<table>
	<thead>
	<tr>
	<th>Use Case</th>
	<th>Dataset</th>
	<th>Number of Samples</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>Coding</td>
	<td>HumanEval</td>
	<td>168</td>
	</tr>
	<tr>
	<td>Math Reasoning</td>
	<td>gsm8k</td>
	<td>80</td>
	</tr>
	<tr>
	<td>Text Summarization</td>
	<td>CNN/Daily Mail</td>
	<td>80</td>
	</tr>
	</tbody>
	</table>

	<h3>Acceptance lengths</h3>
	<table>
	<thead>
	<tr>
	<th>Use Case</th>
	<th>k=1</th>
	<th>k=2</th>
	<th>k=3</th>
	<th>k=4</th>
	<th>k=5</th>
	<th>k=6</th>
	<th>k=7</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>Coding</td>
	<td>1.84</td>
	<td>2.53</td>
	<td>3.07</td>
	<td>3.42</td>
	<td>3.71</td>
	<td>3.89</td>
	<td>4.00</td>
	</tr>
	<tr>
	<td>Math Reasoning</td>
	<td>1.81</td>
	<td>2.43</td>
	<td>2.88</td>
	<td>3.17</td>
	<td>3.30</td>
	<td>3.42</td>
	<td>3.53</td>
	</tr>
	<tr>
	<td>Text Summarization</td>
	<td>1.71</td>
	<td>2.21</td>
	<td>2.52</td>
	<td>2.74</td>
	<td>2.83</td>
	<td>2.87</td>
	<td>2.89</td>
	</tr>
	</tbody>
	</table>

	<h3>Performance benchmarking (4xA100)</h3>
	<div style="display: flex; justify-content: center; gap: 20px;">

	<figure style="text-align: center;">
	<img src="assets/Llama-3.3-70B-Instruct-HumanEval.png" alt="Coding" width="100%">
	</figure>

	<figure style="text-align: center;">
	<img src="assets/Llama-3.3-70B-Instruct-math_reasoning.png" alt="Coding" width="100%">
	</figure>

	<figure style="text-align: center;">
	<img src="assets/Llama-3.3-70B-Instruct-summarization.png" alt="Coding" width="100%">
	</figure>
	</div>

	<details> <summary>Details</summary>
	<strong>Configuration</strong>

	- temperature: 0
	- repetitions: 5
	- time per experiment: 4min
	- hardware: 4xA100
	- vLLM version: 0.11.0
	- GuideLLM version: 0.3.0

	<strong>Command</strong>
	```bash
	GUIDELLM__PREFERRED_ROUTE="chat_completions" \
	guidellm benchmark \
	--target "http://localhost:8000/v1" \
	--data "RedHatAI/speculator_benchmarks" \
	--data-args '{"data_files": "HumanEval.jsonl"}' \
	--rate-type sweep \
	--max-seconds 240 \
	--output-path "Llama-3.3-70B-Instruct-HumanEval.json" \
	--backend-args '{"extra_body": {"chat_completions": {"temperature":0.0}}}'
	```
	GuideLLM interface changed, so for compatibility with the latest version (v0.6.0), please use the following command:
	```bash
	GUIDELLM__PREFERRED_ROUTE="chat_completions" \
	guidellm benchmark \
	--target "http://localhost:8000/v1" \
	--data "RedHatAI/speculator_benchmarks" \
	--data-args '{"data_files": "HumanEval.jsonl"}' \
	--profile sweep \
	--max-seconds 1800 \
	--output-path "my_output.json" \
	--backend-args '{"extras": {"body": {"temperature":0.6, "top_p":0.95, "top_k":20}}}'
	```
	</details>