Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
robtacconelliΒ 
posted an update about 22 hours ago
Post
1423
πŸ† Nacrith: a 135M model that out-compresses everything on natural language

What if a tiny LM could compress english text better than _every_ compressor out there β€” classical or neural, small or large?

Nacrith pairs SmolLM2-135M with an ensemble of online predictors and high-precision arithmetic coding.

What's inside

The standard LLM+arithmetic coding approach wastes ~75% of CDF precision on large vocabularies. Our CDF-24 fix alone recovers 0.5 bpb. On top: a token N-gram that skips the GPU on predictable tokens, an adaptive bias head, llama.cpp backend (7Γ— faster than PyTorch), multi-GPU parallel compression, and a binary file format (NC06) β€” the first LLM-based binary compressor we know of.

Runs on a GTX 1050 Ti. ~500 MB weights, ~1.2 GB VRAM per worker.

πŸ’» Code: https://github.com/robtacconelli/Nacrith-GPU
⭐ Space: robtacconelli/Nacrith-GPU
πŸ“„ Paper: Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding (2602.19626)

Try it, break it, share your results β€” all feedback welcome. ⭐ on the repo appreciated!

Results across all systems we tested:
- alice29.txt β†’ 0.918 bpb (βˆ’44% vs CMIX, βˆ’20% vs ts_zip) β€” below the 2nd-order Shannon entropy bound
- enwik8 (100 MB) β†’ 0.9389 bpb (βˆ’8% vs FineZip/LLMZip's 8B model, βˆ’15% vs ts_zip)
- Unseen text β†’ 0.723 bpb on a doc published after training cutoff β€” no memorization, 26% better than FineZip/LLMZip on the same model

SmolLM2-135M by
HuggingFaceTB