BlockFFN-XLarge / README.md

Improve model card: Add `library_name` metadata and comprehensive sample usage

5f73b41 verified 7 months ago

4.35 kB

	---
	language:
	- en
	- zh
	license: apache-2.0
	pipeline_tag: text-generation
	library_name: transformers
	---

	# BlockFFN-XLarge

	This is the original 1.2B BlockFFN checkpoint used in the paper BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity for acceleration tests.

	Links: [[Paper](https://arxiv.org/pdf/2507.08771)] [[Codes](https://github.com/thunlp/BlockFFN)]

	### Introduction

	BlockFFN presents a novel Mixture-of-Experts (MoE) architecture designed to enhance activation sparsity at both token and chunk levels, making LLMs more acceleration-friendly, especially for end-side devices. This approach integrates a new router for differentiable and flexible routing and is optimized with CLS-aware training objectives. The model achieves superior performance and significant speedup on end-side devices.

	### How to Use

	You can explore the core implementation of BlockFFN in the [GitHub repository](https://github.com/thunlp/BlockFFN). You can load and use this model simply by using `AutoTokenizer` and `AutoModelForCausalLM`.

	#### Text Generation

	```python
	from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
	import torch

	model_name = "SparseLLM/BlockFFN-XLarge" # Or other BlockFFN models like SparseLLM/BlockFFN-XLarge-sft

	pipe = pipeline(
	"text-generation",
	model_name,
	tokenizer=AutoTokenizer.from_pretrained(model_name),
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)
	print(pipe("The key to life is", max_new_tokens=20, do_sample=True)[0]["generated_text"])
	```

	#### Get Expert Routing Probabilities

	Based on expert routing probabilities, BlockFFN enables mechanistic interpretability by understanding which sparse features are activated to which token. Following the standard MoE approach, you can obtain expert routing probabilities for all layers by setting `output_router_probs=True`. The example below demonstrates how to compute and analyze the expert activation patterns:

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"SparseLLM/BlockFFN-XLarge",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)
	tokenizer = AutoTokenizer.from_pretrained("SparseLLM/BlockFFN-XLarge")

	inputs = tokenizer("City and County of San Francisco", return_tensors="pt")
	outputs = model(**inputs.to(model.device), output_router_probs=True)

	# Get full expert routing probabilities: [batch_size, seq_len, moe_heads, moe_experts**2]
	# Note: The output format for router_probs might vary based on the specific BlockFFN implementation details.
	# This example assumes a common structure for illustration.
	if hasattr(outputs, 'router_probs') and outputs.router_probs is not None:
	for layer_idx, layer_router_probs in enumerate(outputs.router_probs):
	print(f"Layer {layer_idx} Router Probs Shape: {layer_router_probs.shape}")
	# Example: Analyze first token's expert activation in the first layer
	if layer_router_probs.shape[1] > 0: # Check if there are tokens
	first_token_probs = layer_router_probs[0, 0] # batch_idx, token_idx
	# Assuming first_token_probs is [num_heads, num_experts]
	# Sum across heads to get overall expert importance
	expert_activations = first_token_probs.sum(dim=0)
	activated_experts = (expert_activations > 1e-2).nonzero(as_tuple=True)[0]
	decoded_token = tokenizer.decode(inputs.input_ids[0, 0])
	print(f"Token: '{decoded_token}' (Layer {layer_idx}) Activated Experts Count: {len(activated_experts)}")
	# print(f"Activated Expert Indices: {activated_experts.tolist()}")
	else:
	print("Model output does not contain 'router_probs'.")

	```

	### Citation

	If you find our work useful for your research, please kindly cite our paper as follows:

	```
	@article{song2025blockffn,
	title={{BlockFFN}: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity},
	author={Chenyang Song and Weilin Zhao and Xu Han and Chaojun Xiao and Yingfa Chen and Yuxuan Li and Zhiyuan Liu and Maosong Sun},
	journal={arXiv preprint arXiv:2507.08771},
	year={2025},
	url={https://arxiv.org/pdf/2507.08771},
	}
	```