--- title: Midicoth - Micro-Diffusion Compression emoji: 🗜️ colorFrom: blue colorTo: indigo sdk: docker app_port: 7860 pinned: false license: apache-2.0 short_description: Lossless data compression via Binary Tree Tweedie Denoising --- # Midicoth — Micro-Diffusion Compression **Lossless data compression via Binary Tree Tweedie Denoising** [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE) Midicoth is a lossless text compressor that introduces *micro-diffusion* — a multi-step score-based reverse diffusion process implementing Tweedie's empirical Bayes formula — into a cascaded statistical modeling pipeline. It treats Jeffreys-prior smoothing as a shrinkage operator toward uniform and reverses it through binary tree denoising with variance-aware James-Stein shrinkage, achieving compression ratios that outperform xz, zstd, Brotli, and bzip2 on all tested inputs. **No neural network. No training data. No GPU. ~2,000 lines of C.** --- ## Results ### alice29.txt — 152 KB (Canterbury Corpus) ![alice29.txt compression comparison](charts/chart_alice29.png) ### enwik8 — 100 MB (Large Text Compression Benchmark) ![enwik8 compression comparison](charts/chart_enwik8.png) | Benchmark | Midicoth | xz -9 | Improvement | |-----------|----------|--------|-------------| | **alice29.txt** (152 KB) | **2.119 bpb** | 2.551 bpb | **16.9%** | | **enwik8** (100 MB) | **1.753 bpb** | 1.989 bpb | **11.9%** | Midicoth outperforms all dictionary-based compressors (gzip, zstd, xz, Brotli, bzip2) on every tested input, narrowing the gap to heavyweight context-mixing systems like PAQ and CMIX. --- ## How It Works Midicoth processes input one byte at a time through a five-layer cascade: ``` Input byte | v [1] PPM Model (orders 0-4, PPMC exclusion, Jeffreys prior) | produces 256-way distribution + confidence + order v [2] Match Model (context lengths 4-16) | blends in long-range repetition predictions v [3] Word Model (trie + bigram) | blends in word completion predictions v [4] High-Order Context (orders 5-8) | blends in extended context predictions v [5] Micro-Diffusion (binary tree Tweedie, K=3 steps) | post-blend denoising with James-Stein shrinkage v Arithmetic Coder -> compressed output ``` ### The Key Idea: Post-Blend Tweedie Denoising PPM models smooth their predictions with a Jeffreys prior (0.5 per symbol, 128 total pseudo-counts). When few observations are available, this prior dominates — pulling the distribution toward uniform and wasting bits. After blending with match, word, and high-order models, additional systematic biases are introduced. We frame this as a **denoising problem**: - **Shrinkage operator**: Jeffreys smoothing pulls the empirical distribution toward uniform: `p = (1-gamma)*q + gamma*u`, where `gamma = 128/(C+128)` - **Reverse diffusion**: Calibration tables estimate the additive Tweedie correction `delta = E[theta|p] - E[p]`, approximating `sigma^2 * s(p)` - **Binary tree**: Each 256-way prediction is decomposed into 8 binary decisions (MSB to LSB), enabling data-efficient calibration - **Multi-step**: K=3 denoising steps with independent score tables, each correcting residual noise from the previous step - **Variance-aware shrinkage**: Each correction is attenuated based on SNR = delta^2 * N / var(error). When SNR < 4, the correction is linearly shrunk toward zero, preventing noisy estimates from hurting compression ### Ablation: Component Contributions ![Ablation study](charts/chart_ablation.png) With PPMC exclusion providing a strong base, the post-PPM layers collectively contribute **5.6-11.1%** additional improvement. The Tweedie denoiser is the most consistent contributor (+2.6-2.8% across both datasets). ### Score Magnitude vs. Noise Level ![Delta vs noise level](charts/chart_delta_vs_noise.png) On alice29 (small file), corrections decrease monotonically with confidence — the James-Stein shrinkage suppresses noisy corrections in sparse bins. On enwik8_3M (larger file), the pattern inverts at high confidence: enough data accumulates for the shrinkage to permit large corrections where genuine signal exists. Successive steps show rapidly decreasing corrections, confirming multi-step reverse diffusion behavior. --- ## Quick Start ### Build ```bash make # Linux: builds mdc and ablation make win # Windows cross-compile: builds mdc.exe and ablation.exe ``` Requirements: GCC (or any C99 compiler) and `libm`. No other dependencies. For Windows cross-compilation: `x86_64-w64-mingw32-gcc`. ### Compress ```bash ./mdc compress input.txt output.mdc ``` ### Decompress ```bash ./mdc decompress output.mdc restored.txt ``` ### Verify round-trip ```bash ./mdc compress alice29.txt alice29.mdc ./mdc decompress alice29.mdc alice29.restored diff alice29.txt alice29.restored # should produce no output ``` ### Run ablation study ```bash ./ablation alice29.txt ``` This runs five configurations (Base PPM, +Match, +Match+Word, +M+W+HCtx, +M+W+H+Tweedie) and reports the marginal contribution of each component with round-trip verification. ### Reproduce delta-vs-noise table ```bash gcc -O3 -march=native -o measure_delta measure_delta.c -lm ./measure_delta alice29.txt ``` --- ## Architecture ### File Structure | File | Description | |------|-------------| | `mdc.c` | Main compressor/decompressor driver | | `ablation.c` | Ablation study driver | | `measure_delta.c` | Delta-vs-noise measurement tool | | `ppm.h` | Adaptive PPM model (orders 0-4) with PPMC exclusion and Jeffreys prior | | `tweedie.h` | Binary tree Tweedie denoiser with James-Stein shrinkage (K=3 steps, 155K entries) | | `match.h` | Extended match model (context lengths 4, 6, 8, 12, 16) | | `word.h` | Trie-based word model with bigram prediction | | `highctx.h` | High-order context model (orders 5-8) | | `arith.h` | 32-bit arithmetic coder (E1/E2/E3 renormalization) | | `fastmath.h` | Fast math utilities (log, exp approximations) | | `Makefile` | Build system (Linux + Windows cross-compile) | ### Design Principles - **Header-only modules**: Each component is a self-contained `.h` file with `static inline` functions - **Zero external dependencies**: Only `libm` is required - **Fully online**: All models are adaptive — no pre-training or offline parameter estimation - **Deterministic**: Bit-exact encoder-decoder symmetry - **Count-based**: All learnable components use simple count accumulation, avoiding overfitting from gradient-based learners ### Calibration Table Structure The Tweedie denoiser maintains a 6-dimensional calibration table: ``` table[step][bit_context][order_group][shape][confidence][prob_bin] 3 x 27 x 3 x 4 x 8 x 20 = 155,520 entries (~4.7 MB) ``` Each entry tracks four sufficient statistics: - `sum_pred`: sum of predicted P(right) values - `hits`: count of times the true symbol went right - `total`: total observations - `sum_sq_err`: sum of squared prediction errors (for James-Stein SNR) The Tweedie correction is: `delta = hits/total - sum_pred/total`, then shrunk by `min(1, SNR/4)` where `SNR = delta^2 * total / (sum_sq_err / total)`. --- ## Compressed Format MDC files use the `.mdc` extension: | Offset | Size | Content | |--------|------|---------| | 0 | 4 bytes | Magic: `MDC7` | | 4 | 8 bytes | Original file size (uint64, little-endian) | | 12 | variable | Arithmetic-coded bitstream | The format is self-contained — no external dictionaries or model files needed. --- ## Performance | File | Size | Ratio | Speed | bpb | |------|------|-------|-------|-----| | alice29.txt | 152 KB | 26.5% | ~42 KB/s | 2.119 | | enwik8_3M | 3.0 MB | 25.0% | ~40 KB/s | 2.003 | | enwik8 | 100 MB | 21.9% | ~42 KB/s | 1.753 | All measurements on a single CPU core (x86-64). ### Comparison with Other Approaches | Category | Example | enwik8 bpb | Requires | |----------|---------|------------|----------| | Dictionary-based | xz -9 | 1.989 | CPU | | **Midicoth (this work)** | **-** | **1.753** | **CPU** | | Context mixing | PAQ8px | ~1.27 | CPU (hours) | | Context mixing | CMIX v21 | ~1.17 | CPU (16-64 GB RAM) | | LLM-based | Nacrith | 0.939 | GPU + pre-trained model | Midicoth occupies a unique position: better than all dictionary compressors, simpler and faster than context mixers, and fully online without any pre-trained knowledge. --- ## Paper The full technical details are described in: > **Micro-Diffusion Compression: Binary Tree Tweedie Denoising for Online Probability Estimation** > Roberto Tacconelli, 2026 > > [arXiv:2603.08771](https://arxiv.org/abs/2603.08771) Key references: - Efron, B. (2011). *Tweedie's formula and selection bias.* JASA, 106(496):1602-1614. - James, W. and Stein, C. (1961). *Estimation with quadratic loss.* Proc. 4th Berkeley Symposium. - Ho, J., Jain, A., Abbeel, P. (2020). *Denoising diffusion probabilistic models.* NeurIPS 2020. - Cleary, J.G. and Witten, I.H. (1984). *Data compression using adaptive coding and partial string matching.* IEEE Trans. Comm., 32(4):396-402. --- ## Test Data - **alice29.txt** (152,089 bytes): Canterbury Corpus - Lewis Carroll's *Alice's Adventures in Wonderland*. - **enwik8** (100,000,000 bytes): First 100 MB of English Wikipedia. Download from the [Large Text Compression Benchmark](http://mattmahoney.net/dc/text.html). --- ## Building from Source ### Prerequisites - A C99 compiler (GCC, Clang, or MSVC) - `make` (optional) ### Using Make ```bash make # builds mdc and ablation (Linux) make win # builds mdc.exe and ablation.exe (Windows cross-compile) make clean # removes all binaries ``` ### Manual compilation ```bash # Linux gcc -O3 -march=native -o mdc mdc.c -lm gcc -O3 -march=native -o ablation ablation.c -lm gcc -O3 -march=native -o measure_delta measure_delta.c -lm # Windows cross-compile x86_64-w64-mingw32-gcc -O3 -o mdc.exe mdc.c -lm -lpthread x86_64-w64-mingw32-gcc -O3 -o ablation.exe ablation.c -lm -lpthread ``` --- ## License Apache License 2.0 - see [LICENSE](LICENSE). ``` Copyright 2026 Roberto Tacconelli ``` --- ## Citation ```bibtex @article{tacconelli2026midicoth, title={Micro-Diffusion Compression: Binary Tree Tweedie Denoising for Online Probability Estimation}, author={Tacconelli, Roberto}, year={2026}, eprint={2603.08771}, archivePrefix={arXiv}, primaryClass={cs.IT} } ``` ## Related Work - [Nacrith](https://github.com/robtacconelli/Nacrith-GPU) - LLM-based lossless compression (0.939 bpb on enwik8) - [PAQ8px](https://github.com/hxim/paq8px) - Context-mixing compressor - [CMIX](http://www.byronknoll.com/cmix.html) - Context-mixing with LSTM - [Large Text Compression Benchmark](http://mattmahoney.net/dc/text.html) - enwik8 benchmark