---
title: Midicoth - Micro-Diffusion Compression
emoji: 🗜️
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: Lossless data compression via Binary Tree Tweedie Denoising
---

# Midicoth — Micro-Diffusion Compression

**Lossless data compression via Binary Tree Tweedie Denoising**

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)

Midicoth is a lossless text compressor that introduces *micro-diffusion* — a multi-step score-based reverse diffusion process implementing Tweedie's empirical Bayes formula — into a cascaded statistical modeling pipeline. It treats Jeffreys-prior smoothing as a shrinkage operator toward uniform and reverses it through binary tree denoising with variance-aware James-Stein shrinkage, achieving compression ratios that outperform xz, zstd, Brotli, and bzip2 on all tested inputs.

**No neural network. No training data. No GPU. ~2,000 lines of C.**

---

## Results

### alice29.txt — 152 KB (Canterbury Corpus)

![alice29.txt compression comparison](charts/chart_alice29.png)

### enwik8 — 100 MB (Large Text Compression Benchmark)

![enwik8 compression comparison](charts/chart_enwik8.png)

| Benchmark | Midicoth | xz -9 | Improvement |
|-----------|----------|--------|-------------|
| **alice29.txt** (152 KB) | **2.119 bpb** | 2.551 bpb | **16.9%** |
| **enwik8** (100 MB) | **1.753 bpb** | 1.989 bpb | **11.9%** |

Midicoth outperforms all dictionary-based compressors (gzip, zstd, xz, Brotli, bzip2) on every tested input, narrowing the gap to heavyweight context-mixing systems like PAQ and CMIX.

---

## How It Works

Midicoth processes input one byte at a time through a five-layer cascade:

```
Input byte
    |
    v
[1] PPM Model (orders 0-4, PPMC exclusion, Jeffreys prior)
    |  produces 256-way distribution + confidence + order
    v
[2] Match Model (context lengths 4-16)
    |  blends in long-range repetition predictions
    v
[3] Word Model (trie + bigram)
    |  blends in word completion predictions
    v
[4] High-Order Context (orders 5-8)
    |  blends in extended context predictions
    v
[5] Micro-Diffusion (binary tree Tweedie, K=3 steps)
    |  post-blend denoising with James-Stein shrinkage
    v
Arithmetic Coder -> compressed output
```

### The Key Idea: Post-Blend Tweedie Denoising

PPM models smooth their predictions with a Jeffreys prior (0.5 per symbol, 128 total pseudo-counts). When few observations are available, this prior dominates — pulling the distribution toward uniform and wasting bits. After blending with match, word, and high-order models, additional systematic biases are introduced. We frame this as a **denoising problem**:

- **Shrinkage operator**: Jeffreys smoothing pulls the empirical distribution toward uniform: `p = (1-gamma)*q + gamma*u`, where `gamma = 128/(C+128)`
- **Reverse diffusion**: Calibration tables estimate the additive Tweedie correction `delta = E[theta|p] - E[p]`, approximating `sigma^2 * s(p)`
- **Binary tree**: Each 256-way prediction is decomposed into 8 binary decisions (MSB to LSB), enabling data-efficient calibration
- **Multi-step**: K=3 denoising steps with independent score tables, each correcting residual noise from the previous step
- **Variance-aware shrinkage**: Each correction is attenuated based on SNR = delta^2 * N / var(error). When SNR < 4, the correction is linearly shrunk toward zero, preventing noisy estimates from hurting compression

### Ablation: Component Contributions

![Ablation study](charts/chart_ablation.png)

With PPMC exclusion providing a strong base, the post-PPM layers collectively contribute **5.6-11.1%** additional improvement. The Tweedie denoiser is the most consistent contributor (+2.6-2.8% across both datasets).

### Score Magnitude vs. Noise Level

![Delta vs noise level](charts/chart_delta_vs_noise.png)

On alice29 (small file), corrections decrease monotonically with confidence — the James-Stein shrinkage suppresses noisy corrections in sparse bins. On enwik8_3M (larger file), the pattern inverts at high confidence: enough data accumulates for the shrinkage to permit large corrections where genuine signal exists. Successive steps show rapidly decreasing corrections, confirming multi-step reverse diffusion behavior.

---

## Quick Start

### Build

```bash
make          # Linux: builds mdc and ablation
make win      # Windows cross-compile: builds mdc.exe and ablation.exe
```

Requirements: GCC (or any C99 compiler) and `libm`. No other dependencies.
For Windows cross-compilation: `x86_64-w64-mingw32-gcc`.

### Compress

```bash
./mdc compress input.txt output.mdc
```

### Decompress

```bash
./mdc decompress output.mdc restored.txt
```

### Verify round-trip

```bash
./mdc compress alice29.txt alice29.mdc
./mdc decompress alice29.mdc alice29.restored
diff alice29.txt alice29.restored  # should produce no output
```

### Run ablation study

```bash
./ablation alice29.txt
```

This runs five configurations (Base PPM, +Match, +Match+Word, +M+W+HCtx, +M+W+H+Tweedie) and reports the marginal contribution of each component with round-trip verification.

### Reproduce delta-vs-noise table

```bash
gcc -O3 -march=native -o measure_delta measure_delta.c -lm
./measure_delta alice29.txt
```

---

## Architecture

### File Structure

| File | Description |
|------|-------------|
| `mdc.c` | Main compressor/decompressor driver |
| `ablation.c` | Ablation study driver |
| `measure_delta.c` | Delta-vs-noise measurement tool |
| `ppm.h` | Adaptive PPM model (orders 0-4) with PPMC exclusion and Jeffreys prior |
| `tweedie.h` | Binary tree Tweedie denoiser with James-Stein shrinkage (K=3 steps, 155K entries) |
| `match.h` | Extended match model (context lengths 4, 6, 8, 12, 16) |
| `word.h` | Trie-based word model with bigram prediction |
| `highctx.h` | High-order context model (orders 5-8) |
| `arith.h` | 32-bit arithmetic coder (E1/E2/E3 renormalization) |
| `fastmath.h` | Fast math utilities (log, exp approximations) |
| `Makefile` | Build system (Linux + Windows cross-compile) |

### Design Principles

- **Header-only modules**: Each component is a self-contained `.h` file with `static inline` functions
- **Zero external dependencies**: Only `libm` is required
- **Fully online**: All models are adaptive — no pre-training or offline parameter estimation
- **Deterministic**: Bit-exact encoder-decoder symmetry
- **Count-based**: All learnable components use simple count accumulation, avoiding overfitting from gradient-based learners

### Calibration Table Structure

The Tweedie denoiser maintains a 6-dimensional calibration table:

```
table[step][bit_context][order_group][shape][confidence][prob_bin]
       3   x    27     x     3      x  4   x     8     x   20
                    = 155,520 entries (~4.7 MB)
```

Each entry tracks four sufficient statistics:
- `sum_pred`: sum of predicted P(right) values
- `hits`: count of times the true symbol went right
- `total`: total observations
- `sum_sq_err`: sum of squared prediction errors (for James-Stein SNR)

The Tweedie correction is: `delta = hits/total - sum_pred/total`, then shrunk by `min(1, SNR/4)` where `SNR = delta^2 * total / (sum_sq_err / total)`.

---

## Compressed Format

MDC files use the `.mdc` extension:

| Offset | Size | Content |
|--------|------|---------|
| 0 | 4 bytes | Magic: `MDC7` |
| 4 | 8 bytes | Original file size (uint64, little-endian) |
| 12 | variable | Arithmetic-coded bitstream |

The format is self-contained — no external dictionaries or model files needed.

---

## Performance

| File | Size | Ratio | Speed | bpb |
|------|------|-------|-------|-----|
| alice29.txt | 152 KB | 26.5% | ~42 KB/s | 2.119 |
| enwik8_3M | 3.0 MB | 25.0% | ~40 KB/s | 2.003 |
| enwik8 | 100 MB | 21.9% | ~42 KB/s | 1.753 |

All measurements on a single CPU core (x86-64).

### Comparison with Other Approaches

| Category | Example | enwik8 bpb | Requires |
|----------|---------|------------|----------|
| Dictionary-based | xz -9 | 1.989 | CPU |
| **Midicoth (this work)** | **-** | **1.753** | **CPU** |
| Context mixing | PAQ8px | ~1.27 | CPU (hours) |
| Context mixing | CMIX v21 | ~1.17 | CPU (16-64 GB RAM) |
| LLM-based | Nacrith | 0.939 | GPU + pre-trained model |

Midicoth occupies a unique position: better than all dictionary compressors, simpler and faster than context mixers, and fully online without any pre-trained knowledge.

---

## Paper

The full technical details are described in:

> **Micro-Diffusion Compression: Binary Tree Tweedie Denoising for Online Probability Estimation**
> Roberto Tacconelli, 2026
>
> [arXiv:2603.08771](https://arxiv.org/abs/2603.08771)

Key references:
- Efron, B. (2011). *Tweedie's formula and selection bias.* JASA, 106(496):1602-1614.
- James, W. and Stein, C. (1961). *Estimation with quadratic loss.* Proc. 4th Berkeley Symposium.
- Ho, J., Jain, A., Abbeel, P. (2020). *Denoising diffusion probabilistic models.* NeurIPS 2020.
- Cleary, J.G. and Witten, I.H. (1984). *Data compression using adaptive coding and partial string matching.* IEEE Trans. Comm., 32(4):396-402.

---

## Test Data

- **alice29.txt** (152,089 bytes): Canterbury Corpus - Lewis Carroll's *Alice's Adventures in Wonderland*.
- **enwik8** (100,000,000 bytes): First 100 MB of English Wikipedia. Download from the [Large Text Compression Benchmark](http://mattmahoney.net/dc/text.html).

---

## Building from Source

### Prerequisites

- A C99 compiler (GCC, Clang, or MSVC)
- `make` (optional)

### Using Make

```bash
make          # builds mdc and ablation (Linux)
make win      # builds mdc.exe and ablation.exe (Windows cross-compile)
make clean    # removes all binaries
```

### Manual compilation

```bash
# Linux
gcc -O3 -march=native -o mdc mdc.c -lm
gcc -O3 -march=native -o ablation ablation.c -lm
gcc -O3 -march=native -o measure_delta measure_delta.c -lm

# Windows cross-compile
x86_64-w64-mingw32-gcc -O3 -o mdc.exe mdc.c -lm -lpthread
x86_64-w64-mingw32-gcc -O3 -o ablation.exe ablation.c -lm -lpthread
```

---

## License

Apache License 2.0 - see [LICENSE](LICENSE).

```
Copyright 2026 Roberto Tacconelli
```

---

## Citation

```bibtex
@article{tacconelli2026midicoth,
  title={Micro-Diffusion Compression: Binary Tree Tweedie Denoising
         for Online Probability Estimation},
  author={Tacconelli, Roberto},
  year={2026},
  eprint={2603.08771},
  archivePrefix={arXiv},
  primaryClass={cs.IT}
}
```

## Related Work

- [Nacrith](https://github.com/robtacconelli/Nacrith-GPU) - LLM-based lossless compression (0.939 bpb on enwik8)
- [PAQ8px](https://github.com/hxim/paq8px) - Context-mixing compressor
- [CMIX](http://www.byronknoll.com/cmix.html) - Context-mixing with LSTM
- [Large Text Compression Benchmark](http://mattmahoney.net/dc/text.html) - enwik8 benchmark