metadata
language:
- en
- zh
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
BlockFFN-3B-SFT-EAGLE
This is the 3B BlockFFN model used in the paper BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity for acceleration tests.
BlockFFN introduces a novel Mixture-of-Experts (MoE) architecture designed for efficient inference, particularly on end-side devices. It aims to achieve high token-level and chunk-level sparsity, making it acceleration-friendly and compatible with techniques like speculative decoding. This model is based on the paper.
For the full codebase and more details, visit the official GitHub repository.
Usage
You can easily load and use this model with the Hugging Face transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model_name = "SparseLLM/BlockFFN-3B-SFT-EAGLE"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16, # or torch.float16 if bfloat16 is not supported
device_map="auto",
trust_remote_code=True,
)
# Create a text generation pipeline
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
# Generate text
prompt = "The quick brown fox jumps over the lazy"
output = pipe(prompt, max_new_tokens=50, do_sample=True, temperature=0.7)
print(output[0]['generated_text'])
Citation
If you find our work useful for your research, please kindly cite our paper as follows:
@article{song2025blockffn,
title={{BlockFFN}: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity},
author={Chenyang Song and Weilin Zhao and Xu Han and Chaojun Xiao and Yingfa Chen and Yuxuan Li and Zhiyuan Liu and Maosong Sun},
journal={arXiv preprint arXiv:2507.08771},
year={2025},
url={https://arxiv.org/pdf/2507.08771},
}