Structural FFN Decomposition Guides Cross-Model Compression and Quantization

Artifacts for the paper by Yeonseong Cynn (River Lab, May 2026).

Summary

Decomposes transformer FFN layers into structural (format-preserving) and classification-relevant components across BERT and GPT-2.

Key findings:

Early-layer FFN is 90-200x more structural than classification-relevant; late layers approach 1:1
Structural pruning: head + FFN neuron removal with layer-wise retraining achieves 19.1% parameter reduction on BERT (SST-2) and 9.1% on GPT-2 with no accuracy loss
Neuron pruning: removing 8% rarely-active FFN neurons improves BERT accuracy by 0.3%
Mixed-precision quantization: INT4 on structurally-dominant layers (L1-L3) with STE retraining recovers to -2.1% loss

bert_sst2_int4_ste.pt — BERT SST-2 with L1-L3 INT4 quantization + STE retraining. Standard BERT state_dict, loadable directly. Accuracy: 90.1% (original FP32: 92.4%).

bert_structural_prune.json — Per-layer structural pruning results (head/FFN reduction, accuracy)
bert_sst2_all_prune.json — All-layer simultaneous FFN pruning results
bert_l8_prune_results.json — L8 FFN correction + pruning (multi-seed)
bert_quantize_results.json — INT4/INT8 post-training quantization results
bert_quantize_retrain.json — INT4 STE retraining results

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support