nielsr's picture
nielsr HF Staff
Add text generation pipeline tag, license, and model details
364b968 verified
|
raw
history blame
2.47 kB
metadata
library_name: transformers
pipeline_tag: text-generation
license: mit
tags:
  - long-sequence-generation
  - llm-acceleration

TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

TokenSwift is a novel framework designed to substantially accelerate the generation process of ultra-long sequences (up to 100K tokens) while maintaining the target model's inherent quality. This model achieves over 3x speedup across models of varying scales and architectures, translating to significant time savings for ultra-long sequence generation. This model is described in From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens.

Getting Started

Models Download

Model Name Download Link
TokenSwift-Yarn-Llama-2-7b-128k HuggingFace
TokenSwift-Llama-3.1-8B HuggingFace
TokenSwift-Qwen2.5-1.5B HuggingFace
TokenSwift-Qwen2.5-7B HuggingFace
TokenSwift-Qwen2.5-14B HuggingFace
TokenSwift-DeepSeek-R1-Distill-Qwen-32B HuggingFace

Inference (LLaMA 3.1-8B Example)

torchrun  --master-port 1111 --nproc_per_node=1 main.py \
    --model_type llama3_1 \
    --ckpt_path your_checkpoint_path \
    --prefill_len 4096 \
    --retrival_max_budget 4096 \
    --gen_len 102400 \
    --gamma 4 \
    --min_p 0.1 \
    --temperature 1.0 \
    --tree_decoding \
    --ngram_topk 20 \
    --penalty 1.2 \
    --penalty_length 1024 \
    --prompt_id 0

  <NOTE: Modify the data and model path>

Code and further instructions are available at: https://github.com/bigai-nlco/TokenSwift

Citation

@misc{tokenswift,
      title={From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens}, 
      author={Tong Wu and Junzhe Shen and Zixia Jia and Yuxuan Wang and Zilong Zheng},
      year={2025},
      eprint={2502.18890},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.18890}, 
}