Nacrith: How a 135M Language Model Became the Best Text Compressor We've Tested

Community Article Published February 24, 2026

TL;DR: Nacrith combines SmolLM2-135M with ensemble online predictors and high-precision arithmetic coding to achieve the best lossless compression results on natural language text across every system we evaluated β€” classical or neural, small or large. It runs on old GPUs (eg GTX 1050 Ti), it's open-source, and you can try it right now.

πŸ“„ Paper: arxiv.org/abs/2602.19626
πŸ’» Code: github.com/robtacconelli/Nacrith-GPU
⭐ Space huggingface.co/spaces/robtacconelli/Nacrith-GPU


The Idea Behind Nacrith

Shannon proved in 1948 that compression is prediction: the better you predict what comes next, the fewer bits you need to encode it. Every compressor since β€” from gzip to CMIX β€” is fundamentally a prediction engine paired with an entropy coder.

Language models are arguably the best next-token predictors ever built. So a natural question arises: what happens when you plug a modern transformer into an arithmetic coder and use it as a compressor?

The answer, it turns out, is really good compression β€” but only if you solve several engineering problems that previous work either ignored or handled poorly. That's what Nacrith does.

The Results

Let's start with the numbers, because they're the reason you're reading this.

Standard Benchmarks

Benchmark Nacrith gzip bzip2 CMIX v21 ts_zip FineZip
enwik8 (100 MB) 0.9389 bpb 2.916 2.321 1.17 ~1.11 1.024
alice29.txt (152 KB) 0.918 bpb 2.85 2.27 1.63 1.14 β€”

On alice29.txt, Nacrith compresses to 11.5% of original size β€” outperforming CMIX by 44% and ts_zip by 20%. On enwik8, it beats FineZip by 8% despite FineZip using a fine-tuned LLaMA-3-8B (60Γ— more parameters).

enwik8_comparison

alice29_comparison

But is it just memorization?

Both alice29.txt and enwik8 are almost certainly in SmolLM2's training data. Fair objection. So we ran an out-of-distribution test on a UK government report published in October 2025 β€” a full year after SmolLM2's release.

Compressor Size bpb
gzip -9 91,348 B 2.189
CMIX v21 47,897 B 1.148
ts_zip 40,237 B 0.964
FineZip (SmolLM2-135M) 40,747 B 0.977
Nacrith 30,171 B 0.723

Nacrith achieves 0.723 bpb on text it has provably never seen β€” beating ts_zip by 25% and CMIX by 37%. The comparison with FineZip is especially telling: both use the exact same SmolLM2-135M model, but Nacrith compresses 26% smaller, isolating the gains from architecture alone.

How It Works

Nacrith's compression pipeline is conceptually simple: tokenize the input, predict each token's probability using an ensemble of models, feed those probabilities into an arithmetic coder. Decompression is the mirror image. Because everything is deterministic, reconstruction is perfectly lossless.

The devil is in the details. Here's what makes Nacrith different from a naive "LLM + arithmetic coding" approach.

1. CDF-24: Fixing the Quantization Bottleneck

This is the single most impactful contribution.

Arithmetic coding requires converting a probability distribution into an integer CDF. The standard approach uses a 16-bit CDF total (65,536). But SmolLM2 has a vocabulary of 49,152 tokens, and every token needs at least 1 count to avoid zero-width intervals. Do the math:

Floor allocation: 49,152 / 65,536 = 75%

Three quarters of the CDF range is consumed by minimum-probability floors before any real probability information is encoded. This introduces roughly 2 extra bits per token β€” a massive waste that compounds over millions of tokens.

Nacrith upgrades to a 24-bit CDF total (16,777,216):

Floor allocation: 49,152 / 16,777,216 = 0.29%

The overhead drops from 75% to essentially nothing. In the ablation study, CDF-24 alone accounts for a 0.5 bpb improvement β€” the largest single gain from any component.

This is safe with a 32-bit arithmetic coder because the minimum symbol width after range narrowing is 2³¹ / 2²⁴ = 128, well above the representability threshold.

2. Token-Level N-gram Model

Nacrith maintains an interpolated 1-to-4-gram model over tokens (not bytes). It's updated online as compression proceeds, adapting to the specific document. Context keys use 64-bit rolling hashes instead of Python tuples, and each context is capped at 64 continuations β€” keeping memory at ~128 MB per worker instead of the ~3.6 GB that naive Python dicts would require.

3. Confidence-Based LLM Skip

This is where it gets interesting. The N-gram's main contribution isn't through the ensemble mixer β€” it's through skipping the LLM entirely.

When the N-gram's Shannon entropy drops below 1.5 bits (empirically calibrated), the token is so predictable that the N-gram alone provides near-optimal coding. Nacrith bypasses the GPU forward pass and uses the N-gram prediction directly. On highly compressible text, the skip rate reaches 30–70%, dramatically reducing GPU load while simultaneously improving compression.

The ablation study confirms this: the N-gram + skip combination accounts for a 0.39 bpb improvement (30% relative), making it the second largest contributor after CDF-24.

4. Adaptive Log-Space Bias Head

A per-token bias vector (initialized to zero) applies a multiplicative correction to LLM log-probabilities:

adjusted_p(t) = softmax(log p_llm + bias)

After each observed token, the bias is updated via one gradient descent step on cross-entropy loss (learning rate 0.001). Over time, it learns to suppress tokens the LLM systematically over-predicts for this document and boost under-predicted ones. The improvement is small but consistent: ~0.015 bpb.

5. llama.cpp Backend

Nacrith uses llama.cpp instead of PyTorch for inference. The difference is dramatic: ~7× faster single-token decode on the same hardware, because all GPU computation happens in C/C++ with a single Python→C boundary crossing instead of PyTorch's per-call dispatch overhead.

The model loads in GGUF format (FP32, ~500 MB). A dual-tokenizer architecture uses llama.cpp for inference but the HuggingFace tokenizer for text encode/decode, working around 47 whitespace tokens that llama.cpp's detokenizer silently drops.

6. Native KV Cache Sliding Window

When the 2,048-token context fills up, naive implementations reset the entire KV cache and re-evaluate 1,536 tokens from scratch (about 693 ms on a GTX 1050 Ti). Nacrith instead uses llama.cpp's native cache manipulation β€” remove old positions, shift remaining ones down, re-evaluate only the final token (about 19 ms). That's a 37Γ— speedup per slide, making the overhead effectively zero.

7. Parallel Multi-GPU Compression

Text is split into N chunks (at newline boundaries) and compressed concurrently. Each worker owns an independent model instance, N-gram, mixer, and adaptive head β€” zero shared state, zero synchronization. Worker count auto-scales based on available VRAM. Threading works despite Python's GIL because llama-cpp-python releases the GIL during C-level GPU inference.

8. Binary File Support (NC06)

To our knowledge, Nacrith is the first LLM-based compressor that handles arbitrary binary files. The NC06 hybrid format segments input into text-like and binary regions: text chunks go through the full neural pipeline, binary chunks get LZMA or gzip compression. This extends the applicability of neural compression beyond the pure-text domain of all prior work.

System Requirements

The design philosophy is accessibility. Everything was developed and tested on a GTX 1050 Ti (4 GB VRAM):

  • Model weights: ~500 MB (GGUF F32)
  • VRAM per worker: ~1.2 GB
  • Workers on 4 GB GPU: up to 3 concurrent
  • Workers on 8+ GB GPU: up to 8 concurrent

No fine-tuning. No massive downloads. No cloud GPU required.

Try It Yourself

# Clone the repository
git clone https://github.com/st4ck/Nacrith-GPU.git
cd Nacrith-GPU

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies

# For CUDA 11.8 GPUs - eg NVIDIA 1050 Ti
# pip install torch --index-url https://download.pytorch.org/whl/cu118

pip install torch

pip install transformers accelerate numpy

# Install llama-cpp-python with CUDA support (required for GPU acceleration)
# Without CMAKE_ARGS, it will compile for CPU only
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

# If the above fails (e.g. CUDA toolkit not found), install the CPU-only fallback:
# pip install llama-cpp-python

# Install test dependencies
pip install pytest

# Compress a text file
python nacrith.py compress --input myfile.txt --output myfile.nc

# Decompress
python nacrith.py decompress --input myfile.nc --output myfile_restored.txt

What's Next

There's clear headroom for improvement:

  • Larger models: SmolLM2-360M or 1.7B with longer context windows should push compression further
  • Quantization: INT8/INT4 weights would reduce VRAM with minimal probability degradation
  • ANS coding: Replacing arithmetic coding with Asymmetric Numeral Systems would improve encoding speed

Conclusion

Nacrith demonstrates that you don't need a 70B model or a cluster of GPUs to achieve state-of-the-art neural text compression. A 135M transformer with the right engineering β€” high-precision CDF quantization, an ensemble of lightweight online predictors, and efficient C/C++ inference β€” is enough to outperform every compressor we tested.

The full paper with methodology, ablations, and reproducible benchmarks is available on arXiv. The code is open-source on GitHub.

We'd love for you to try it on your own data and share what you find. Open an issue, report your results, or just give it a spin β€” all feedback welcome.

Community

Sign up or log in to comment