File size: 2,516 Bytes
e026bcf
 
 
 
56cf4e5
e026bcf
56cf4e5
 
 
e026bcf
 
 
 
 
 
56cf4e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2dc1626
 
 
 
 
 
 
 
 
 
 
 
 
56cf4e5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
language:
- en
- zh
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
  - moe
---

# BlockFFN-3B-SFT

This is the original 3B BlockFFN checkpoint used in the paper *BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity* for acceleration tests.

BlockFFN introduces a novel Mixture-of-Experts (MoE) architecture designed to alleviate the computational burden of large language models (LLMs) by promoting both token-level sparsity (TLS) and chunk-level sparsity (CLS). It features a new router integrating ReLU activation and RMSNorm for differentiable and flexible routing. CLS-aware training objectives are designed to enhance acceleration-friendliness, particularly for low-resource conditions like end-side devices. The model also integrates efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. Experimental results demonstrate BlockFFN's superior performance, achieving high TLS and CLS, and significant speedups on real end-side devices compared to dense models.

Links: [[Paper](https://arxiv.org/pdf/2507.08771)] [[Codes](https://github.com/thunlp/BlockFFN)] [[Models Collection](https://huggingface.co/SparseLLM)]

### How to use

You can load and use this model simply by using `AutoTokenizer` and `AutoModelForCausalLM` from the `transformers` library:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "SparseLLM/BlockFFN-3B-SFT"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

text = "Hello, my name is"
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)

outputs = model.generate(input_ids, max_new_tokens=20, do_sample=True, top_p=0.8, temperature=0.8)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Citation

If you find our work useful for your research, please kindly cite our paper as follows:

```
@article{song2025blockffn,
      title={{BlockFFN}: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity}, 
      author={Chenyang Song and Weilin Zhao and Xu Han and Chaojun Xiao and Yingfa Chen and Yuxuan Li and Zhiyuan Liu and Maosong Sun},
      journal={arXiv preprint arXiv:2507.08771},
      year={2025},
      url={https://arxiv.org/pdf/2507.08771}, 
}
```