Update README.md
Browse files
README.md
CHANGED
|
@@ -11,7 +11,7 @@ tags:
|
|
| 11 |
- diffusion-language-model
|
| 12 |
---
|
| 13 |
|
| 14 |
-
# LLaMA3.1-8B-Instruct-DFlash-
|
| 15 |
[**Paper (Coming Soon)**](#) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
|
| 16 |
|
| 17 |
**DFlash** is a novel speculative decoding method that utilizes a lightweight **block diffusion** model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
|
|
@@ -24,7 +24,7 @@ This model is the **drafter** component. It must be used in conjunction with the
|
|
| 24 |
|
| 25 |
## 📊 Training Data
|
| 26 |
|
| 27 |
-
**LLaMA3.1-8B-Instruct-DFlash-
|
| 28 |
|
| 29 |
## 🚀 Quick Start
|
| 30 |
|
|
@@ -41,7 +41,7 @@ uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/16818/he
|
|
| 41 |
python -m sglang.launch_server \
|
| 42 |
--model-path meta-llama/Llama-3.1-8B-Instruct \
|
| 43 |
--speculative-algorithm DFLASH \
|
| 44 |
-
--speculative-draft-model-path z-lab/LLaMA3.1-8B-Instruct-DFlash-
|
| 45 |
--tp-size 1 \
|
| 46 |
--dtype bfloat16 \
|
| 47 |
--attention-backend fa3 \
|
|
@@ -61,7 +61,7 @@ pip install transformers==4.57.3 torch==2.9.0 accelerate
|
|
| 61 |
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
|
| 62 |
|
| 63 |
model = AutoModel.from_pretrained(
|
| 64 |
-
"z-lab/LLaMA3.1-8B-Instruct-DFlash-
|
| 65 |
trust_remote_code=True,
|
| 66 |
dtype="auto",
|
| 67 |
device_map="cuda:0"
|
|
|
|
| 11 |
- diffusion-language-model
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# LLaMA3.1-8B-Instruct-DFlash-UltraChat
|
| 15 |
[**Paper (Coming Soon)**](#) | [**GitHub**](https://github.com/z-lab/dflash) | [**Blog**](https://z-lab.ai/projects/dflash/)
|
| 16 |
|
| 17 |
**DFlash** is a novel speculative decoding method that utilizes a lightweight **block diffusion** model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
|
|
|
|
| 24 |
|
| 25 |
## 📊 Training Data
|
| 26 |
|
| 27 |
+
**LLaMA3.1-8B-Instruct-DFlash-UltraChat** is trained on **Ultrachat-200K** and **ShareGPT** datasets, aiming to align with EAGLE-3 training data. The assistant reponses in the datasets are regenerated by `meta-llama/Llama-3.1-8B-Instruct`.
|
| 28 |
|
| 29 |
## 🚀 Quick Start
|
| 30 |
|
|
|
|
| 41 |
python -m sglang.launch_server \
|
| 42 |
--model-path meta-llama/Llama-3.1-8B-Instruct \
|
| 43 |
--speculative-algorithm DFLASH \
|
| 44 |
+
--speculative-draft-model-path z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat \
|
| 45 |
--tp-size 1 \
|
| 46 |
--dtype bfloat16 \
|
| 47 |
--attention-backend fa3 \
|
|
|
|
| 61 |
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
|
| 62 |
|
| 63 |
model = AutoModel.from_pretrained(
|
| 64 |
+
"z-lab/LLaMA3.1-8B-Instruct-DFlash-UltraChat",
|
| 65 |
trust_remote_code=True,
|
| 66 |
dtype="auto",
|
| 67 |
device_map="cuda:0"
|