| --- |
| language: |
| - en |
| - de |
| - fr |
| - it |
| - pt |
| - hi |
| - es |
| - th |
| license: llama3.3 |
| pipeline_tag: text-generation |
| tags: |
| - facebook |
| - meta |
| - pytorch |
| - llama |
| - llama-3 |
| - neuralmagic |
| - redhat |
| - speculators |
| - eagle3 |
| --- |
| |
| # Llama-3.3-70B-Instruct-speculator.eagle3 |
|
|
| ## Model Overview |
| - **Verifier:** meta-llama/Llama-3.3-70B-Instruct |
| - **Speculative Decoding Algorithm:** EAGLE-3 |
| - **Model Architecture:** Eagle3Speculator |
| - **Release Date:** 09/15/2025 |
| - **Version:** 1.0 |
| - **Model Developers:** RedHat |
|
|
| This is a speculator model designed for use with [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), based on the [EAGLE-3](https://arxiv.org/abs/2503.01840) speculative decoding algorithm. |
| It was trained using the [speculators](https://github.com/vllm-project/speculators) library on a combination of the [Aeala/ShareGPT_Vicuna_unfiltered](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) and the `train_sft` split of [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) datasets. |
| This model should be used with the [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) chat template, specifically through the `/chat/completions` endpoint. |
|
|
| ## Use with vLLM |
|
|
| ```bash |
| vllm serve meta-llama/Llama-3.3-70B-Instruct \ |
| -tp 4 \ |
| --speculative-config '{ |
| "model": "RedHatAI/Llama-3.3-70B-Instruct-speculator.eagle3", |
| "num_speculative_tokens": 3, |
| "method": "eagle3" |
| }' |
| ``` |
|
|
| ## Evaluations |
|
|
| <h3>Use cases</h3> |
| <table> |
| <thead> |
| <tr> |
| <th>Use Case</th> |
| <th>Dataset</th> |
| <th>Number of Samples</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>Coding</td> |
| <td>HumanEval</td> |
| <td>168</td> |
| </tr> |
| <tr> |
| <td>Math Reasoning</td> |
| <td>gsm8k</td> |
| <td>80</td> |
| </tr> |
| <tr> |
| <td>Text Summarization</td> |
| <td>CNN/Daily Mail</td> |
| <td>80</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <h3>Acceptance lengths</h3> |
| <table> |
| <thead> |
| <tr> |
| <th>Use Case</th> |
| <th>k=1</th> |
| <th>k=2</th> |
| <th>k=3</th> |
| <th>k=4</th> |
| <th>k=5</th> |
| <th>k=6</th> |
| <th>k=7</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>Coding</td> |
| <td>1.84</td> |
| <td>2.53</td> |
| <td>3.07</td> |
| <td>3.42</td> |
| <td>3.71</td> |
| <td>3.89</td> |
| <td>4.00</td> |
| </tr> |
| <tr> |
| <td>Math Reasoning</td> |
| <td>1.81</td> |
| <td>2.43</td> |
| <td>2.88</td> |
| <td>3.17</td> |
| <td>3.30</td> |
| <td>3.42</td> |
| <td>3.53</td> |
| </tr> |
| <tr> |
| <td>Text Summarization</td> |
| <td>1.71</td> |
| <td>2.21</td> |
| <td>2.52</td> |
| <td>2.74</td> |
| <td>2.83</td> |
| <td>2.87</td> |
| <td>2.89</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <h3>Performance benchmarking (4xA100)</h3> |
| <div style="display: flex; justify-content: center; gap: 20px;"> |
|
|
| <figure style="text-align: center;"> |
| <img src="assets/Llama-3.3-70B-Instruct-HumanEval.png" alt="Coding" width="100%"> |
| </figure> |
| |
| <figure style="text-align: center;"> |
| <img src="assets/Llama-3.3-70B-Instruct-math_reasoning.png" alt="Coding" width="100%"> |
| </figure> |
| |
| <figure style="text-align: center;"> |
| <img src="assets/Llama-3.3-70B-Instruct-summarization.png" alt="Coding" width="100%"> |
| </figure> |
| </div> |
| |
| <details> <summary>Details</summary> |
| <strong>Configuration</strong> |
|
|
| - temperature: 0 |
| - repetitions: 5 |
| - time per experiment: 4min |
| - hardware: 4xA100 |
| - vLLM version: 0.11.0 |
| - GuideLLM version: 0.3.0 |
|
|
| <strong>Command</strong> |
| ```bash |
| GUIDELLM__PREFERRED_ROUTE="chat_completions" \ |
| guidellm benchmark \ |
| --target "http://localhost:8000/v1" \ |
| --data "RedHatAI/speculator_benchmarks" \ |
| --data-args '{"data_files": "HumanEval.jsonl"}' \ |
| --rate-type sweep \ |
| --max-seconds 240 \ |
| --output-path "Llama-3.3-70B-Instruct-HumanEval.json" \ |
| --backend-args '{"extra_body": {"chat_completions": {"temperature":0.0}}}' |
| ``` |
| GuideLLM interface changed, so for compatibility with the latest version (v0.6.0), please use the following command: |
| ```bash |
| GUIDELLM__PREFERRED_ROUTE="chat_completions" \ |
| guidellm benchmark \ |
| --target "http://localhost:8000/v1" \ |
| --data "RedHatAI/speculator_benchmarks" \ |
| --data-args '{"data_files": "HumanEval.jsonl"}' \ |
| --profile sweep \ |
| --max-seconds 1800 \ |
| --output-path "my_output.json" \ |
| --backend-args '{"extras": {"body": {"temperature":0.6, "top_p":0.95, "top_k":20}}}' |
| ``` |
| </details> |
|
|