pipeline_tag: text-generation
library_name: transformers
license: apache-2.0
PosS: Position Specialist Generates Better Draft for Speculative Decoding
This repository contains the PosS-2 model described in the paper POSS: Position Specialist Generates Better Draft for Speculative Decoding.
PosS proposes several Position Specialists, which are responsible for drafting certain positions. They are trained to generate high-quality draft tokens with certain previous deviated features as inputs. During inference time, these Positions Specialists mitigate feature deviations and make accurate predictions even at large positions.
PosS achieves higher position-wise acceptance rate (acceptance rate at a position given its previous positions are accepted) than previous methods:
PosS Weights
We also provide our trained parameters in Hugging Face:
| Base Model | PosS-1 Weights | PosS-2 Weights | PosS-3 Weights |
|---|---|---|---|
| Llama3-8B-Instruct | HINT-lab/PosS1-Llama3-8B-Instruct | HINT-lab/PosS2-Llama3-8B-Instruct | HINT-lab/PosS3-Llama3-8B-Instruct |
| Llama2-13B-Chat | HINT-lab/PosS1-Llama2-13B-Chat | HINT-lab/PosS2-Llama2-13B-Chat | HINT-lab/PosS3-Llama2-13B-Chat |
Simplified Inference Example
This example uses the transformers library. Make sure to install it first (pip install transformers).
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "HINT-lab/PosS2-Llama3-8B-Instruct" # Or choose another PosS model
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(inputs["input_ids"], max_new_tokens=10) # Adjust max_new_tokens as needed
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(generated_text)