littleworth commited on
Commit
25346d8
·
verified ·
1 Parent(s): fbc5b61

docs: cite bioRxiv preprint (doi:10.64898/2026.02.17.706304)

Browse files
Files changed (1) hide show
  1. README.md +12 -5
README.md CHANGED
@@ -20,7 +20,7 @@ base_model: nferruz/ProtGPT2
20
 
21
  A compact protein language model distilled from [ProtGPT2](https://huggingface.co/nferruz/ProtGPT2) using **complementary-regularizer distillation**---a method that combines uncertainty-aware position weighting with calibration-aware label smoothing to achieve 87% better perplexity than standard knowledge distillation at 20x compression.
22
 
23
- > **Paper**: *Distilling Protein Language Models with Complementary Regularizers* (Wijaya, 2026)
24
  > **Code**: [github.com/ewijaya/protein-lm-distill](https://github.com/ewijaya/protein-lm-distill)
25
 
26
  ## Model Summary
@@ -181,10 +181,17 @@ Recommended fine-tuning hyperparameters for this model:
181
  ## Citation
182
 
183
  ```bibtex
184
- @article{wijaya2026distilling,
185
- title={Distilling Protein Language Models with Complementary Regularizers},
186
- author={Wijaya, Edward},
187
- year={2026}
 
 
 
 
 
 
 
188
  }
189
  ```
190
 
 
20
 
21
  A compact protein language model distilled from [ProtGPT2](https://huggingface.co/nferruz/ProtGPT2) using **complementary-regularizer distillation**---a method that combines uncertainty-aware position weighting with calibration-aware label smoothing to achieve 87% better perplexity than standard knowledge distillation at 20x compression.
22
 
23
+ > **Preprint**: *Distilling Protein Language Models with Complementary Regularizers* (Wijaya, 2026) — [bioRxiv](https://www.biorxiv.org/content/10.64898/2026.02.17.706304)
24
  > **Code**: [github.com/ewijaya/protein-lm-distill](https://github.com/ewijaya/protein-lm-distill)
25
 
26
  ## Model Summary
 
181
  ## Citation
182
 
183
  ```bibtex
184
+ @article {Wijaya2026.02.17.706304,
185
+ author = {Wijaya, Edward},
186
+ title = {Distilling Protein Language Models with Complementary Regularizers},
187
+ elocation-id = {2026.02.17.706304},
188
+ year = {2026},
189
+ doi = {10.64898/2026.02.17.706304},
190
+ publisher = {Cold Spring Harbor Laboratory},
191
+ abstract = {Large autoregressive protein language models generate novel sequences de novo, but their size limits throughput and precludes rapid domain adaptation on scarce proprietary data. We distill a 738M-parameter protein language model into compact students using two protein-specific enhancements, uncertainty-aware position weighting and calibration-aware label smoothing, that individually degrade quality yet combine for substantial improvement. We trace this complementary-regularizer effect to information theory: smoothing denoises teacher distributions while weighting amplifies the cleaned signal at biologically variable positions. Students achieve up to 5x inference speedup, preserve natural amino acid distributions, and require as little as 170 MB of GPU memory, enabling deployment on consumer-grade hardware. When fine-tuned on protein families with as few as 50 sequences, students generate more family-matching sequences than the teacher, achieving higher sample efficiency and Pfam hit rates despite their smaller capacity. These results establish distilled protein language models as superior starting points for domain adaptation on scarce data.Competing Interest StatementThe authors have declared no competing interest.},
192
+ URL = {https://www.biorxiv.org/content/early/2026/02/25/2026.02.17.706304},
193
+ eprint = {https://www.biorxiv.org/content/early/2026/02/25/2026.02.17.706304.full.pdf},
194
+ journal = {bioRxiv}
195
  }
196
  ```
197