Add text generation pipeline tag, license, and model details

364b968 verified 12 months ago

2.47 kB

library_name: transformers
pipeline_tag: text-generation
license: mit
tags:
  - long-sequence-generation
  - llm-acceleration

TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

TokenSwift is a novel framework designed to substantially accelerate the generation process of ultra-long sequences (up to 100K tokens) while maintaining the target model's inherent quality. This model achieves over 3x speedup across models of varying scales and architectures, translating to significant time savings for ultra-long sequence generation. This model is described in From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens.

Getting Started

Models Download

Model Name	Download Link
TokenSwift-Yarn-Llama-2-7b-128k	HuggingFace
TokenSwift-Llama-3.1-8B	HuggingFace
TokenSwift-Qwen2.5-1.5B	HuggingFace
TokenSwift-Qwen2.5-7B	HuggingFace
TokenSwift-Qwen2.5-14B	HuggingFace
TokenSwift-DeepSeek-R1-Distill-Qwen-32B	HuggingFace

Inference (LLaMA 3.1-8B Example)

torchrun  --master-port 1111 --nproc_per_node=1 main.py \
    --model_type llama3_1 \
    --ckpt_path your_checkpoint_path \
    --prefill_len 4096 \
    --retrival_max_budget 4096 \
    --gen_len 102400 \
    --gamma 4 \
    --min_p 0.1 \
    --temperature 1.0 \
    --tree_decoding \
    --ngram_topk 20 \
    --penalty 1.2 \
    --penalty_length 1024 \
    --prompt_id 0

  <NOTE: Modify the data and model path>

Code and further instructions are available at: https://github.com/bigai-nlco/TokenSwift

Citation

@misc{tokenswift,
      title={From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens}, 
      author={Tong Wu and Junzhe Shen and Zixia Jia and Yuxuan Wang and Zilong Zheng},
      year={2025},
      eprint={2502.18890},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.18890}, 
}