Instructions to use husj576/GRIFFIN-llama2-chat-13B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use husj576/GRIFFIN-llama2-chat-13B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="husj576/GRIFFIN-llama2-chat-13B")

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("husj576/GRIFFIN-llama2-chat-13B")
model = AutoModelForMultimodalLM.from_pretrained("husj576/GRIFFIN-llama2-chat-13B")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use husj576/GRIFFIN-llama2-chat-13B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "husj576/GRIFFIN-llama2-chat-13B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "husj576/GRIFFIN-llama2-chat-13B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/husj576/GRIFFIN-llama2-chat-13B

SGLang

How to use husj576/GRIFFIN-llama2-chat-13B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "husj576/GRIFFIN-llama2-chat-13B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "husj576/GRIFFIN-llama2-chat-13B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "husj576/GRIFFIN-llama2-chat-13B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "husj576/GRIFFIN-llama2-chat-13B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use husj576/GRIFFIN-llama2-chat-13B with Docker Model Runner:
```
docker model run hf.co/husj576/GRIFFIN-llama2-chat-13B
```

GRIFFIN: Effective Token Alignment for Faster Speculative Decoding

This repository contains the draft model for GRIFFIN, a novel framework designed to accelerate inference in large language models (LLMs) by addressing token misalignment in speculative decoding. GRIFFIN incorporates a token-alignable training strategy and a token-alignable draft model to mitigate this issue, demonstrating significant speedup ratios over existing state-of-the-art methods.

For more details, refer to the paper: GRIFFIN: Effective Token Alignment for Faster Speculative Decoding

The official code and further details can be found on the project's GitHub repository: https://github.com/hsj576/GRIFFIN

Overview

GRIFFIN is a novel framework designed to address token misalignment in speculative decoding. This repository provides the implementation of GRIFFIN, including its token-alignable training strategy and token-alignable draft model.

GRIFFIN is:
- 4.2x faster than vanilla decoding.
- 1.3x faster than EAGLE-2.

benchmark

Speed up ratios of GRIFFIN when temperature = 0.

benchmark

Speed up ratios of GRIFFIN when temperature = 1.

Acceleration demo of GRIFFIN for llama3-8B in a 4090GPU

demogif

Sample Usage

You can use the provided eagenerate function for accelerated generation, similar to using the generate method from Hugging Face. Here is an example:

import torch
from model.ea_model_griffin import EaModel
from fastchat.model import get_conversation_template

# Replace with the actual path to your base model and GRIFFIN weight
base_model_path = "meta-llama/Llama-2-13b-chat-hf" # Example base model
GRIFFIN_model_path = "husj576/GRIFFIN-llama2-chat-13B" # Example GRIFFIN model

model = EaModel.from_pretrained(
    base_model_path=base_model_path,
    ea_model_path=GRIFFIN_model_path,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto",
    total_token=-1
)
model.eval()

your_message="Hello"
conv = get_conversation_template("llama2") # Use appropriate conversation template for your base model
conv.append_message(conv.roles[0], your_message)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

input_ids=model.tokenizer([prompt]).input_ids
input_ids = torch.as_tensor(input_ids).cuda()
output_ids=model.eagenerate(input_ids,temperature=0.5,max_new_tokens=512)
output=model.tokenizer.decode(output_ids[0])
print(output)

Note: Vicuna, LLaMA2-Chat, and LLaMA3-Instruct are both chat models. You need to use the correct chat template, otherwise it will cause abnormal output from the model and affect the performance of GRIFFIN.

Citation

If you find our work helpful or inspiring, please feel free to cite it.

@misc{hu2025griffineffectivetokenalignment,
      title={GRIFFIN: Effective Token Alignment for Faster Speculative Decoding},
      author={Shijing Hu and Jingyang Li and Xingyu Xie and Zhihui Lu and Kim-Chuan Toh and Pan Zhou},
      year={2025},
      eprint={2502.11018},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.11018},
}

Downloads last month: 2

Safetensors

Model size

0.8B params

Tensor type

F32

Paper for husj576/GRIFFIN-llama2-chat-13B

GRIFFIN: Effective Token Alignment for Faster Speculative Decoding

Paper • 2502.11018 • Published Feb 16, 2025