Natural language lossless compressor using SmolLM2-135M
Hi everyone! π
Just wanted to share that SmolLM2-135M is the core model behind Nacrith, a lossless compression system that achieves the best compression results we've measured on natural language text β outperforming every classical and neural compressor tested, including CMIX, ts_zip, and FineZip (which uses an 8B model).
Some highlights with your 135M model at the center:
0.918 bpb on alice29.txt (β44% vs CMIX, β20% vs ts_zip)
0.9389 bpb on enwik8 100 MB (β8% vs FineZip's 8B fine-tuned model)
0.723 bpb on a document published after the training cutoff β confirming it's not memorization
The whole system runs on a GTX 1050 Ti with ~500 MB of GGUF weights. SmolLM2-135M hits a remarkable sweet spot: strong enough predictions to beat models 60Γ its size, small enough to fit on consumer hardware.
π» Code: https://github.com/robtacconelli/Nacrith-GPU
β Space: https://huggingface.co/spaces/robtacconelli/Nacrith-GPU
π Paper: https://arxiv.org/abs/2602.19626
Would love to hear your thoughts β and thank you for making SmolLM2 open! β€οΈ