Natural language lossless compressor using SmolLM2-135M

#9
by robtacconelli - opened

Hi everyone! πŸ‘‹
Just wanted to share that SmolLM2-135M is the core model behind Nacrith, a lossless compression system that achieves the best compression results we've measured on natural language text β€” outperforming every classical and neural compressor tested, including CMIX, ts_zip, and FineZip (which uses an 8B model).
Some highlights with your 135M model at the center:

0.918 bpb on alice29.txt (βˆ’44% vs CMIX, βˆ’20% vs ts_zip)
0.9389 bpb on enwik8 100 MB (βˆ’8% vs FineZip's 8B fine-tuned model)
0.723 bpb on a document published after the training cutoff β€” confirming it's not memorization

The whole system runs on a GTX 1050 Ti with ~500 MB of GGUF weights. SmolLM2-135M hits a remarkable sweet spot: strong enough predictions to beat models 60Γ— its size, small enough to fit on consumer hardware.

πŸ’» Code: https://github.com/robtacconelli/Nacrith-GPU
⭐ Space: https://huggingface.co/spaces/robtacconelli/Nacrith-GPU
πŸ“„ Paper: https://arxiv.org/abs/2602.19626

Would love to hear your thoughts β€” and thank you for making SmolLM2 open! ❀️

Sign up or log in to comment