Entropy Coding with Equiprobable Partitioning

Implementation and comparison of the entropy coding algorithm using equiprobable partitioning from Han et al. (2008), compared against Huffman coding and theoretical limits.

Overview

This project implements two compression algorithms:

Equiprobable Partitioning (EP) - The main algorithm from the paper
Huffman Coding - Classical entropy coding for comparison

Algorithm Description

Enumerative Entropy Coding

The algorithm from Han et al. (2008) is actually an enumerative entropy coding method that works in three steps:

Encode alphabet size M using exp-Golomb codes
Encode symbol counts N(s₁), N(s₂), ..., N(s_{M-1}) using exp-Golomb codes (last count is implied)
Encode sequence position among all permutations with the same symbol counts using combinatorial enumeration

How It Works

Step 1: Use exp-Golomb to encode how many distinct symbols appear
Step 2: Use exp-Golomb to encode how many times each symbol appears
Step 3: Use lexicographic indexing to identify which specific permutation this sequence represents among all sequences with the same symbol histogram

This is fundamentally different from simple partitioning - it's a form of combinatorial compression that leverages the mathematical structure of permutations.

Performance Optimizations

Key optimizations enable practical performance for datasets up to ~10,000 symbols:

Cached Binomial Coefficients: Uses math.comb() with caching to avoid recomputation
Binary Search: O(log n) position reconstruction instead of linear search
Complement Encoding: For frequent symbols (>50%), encode positions of other symbols instead
Arbitrary Precision: Avoids integer overflow for large combinatorial values

These optimizations achieve polynomial time complexity, making the algorithm practical for research and educational use.

Why Enumerative Coding?

The algorithm aims to achieve compression by:

Separating structure (symbol counts) from content (permutation)
Using optimal exp-Golomb codes for integer encoding
Leveraging combinatorial mathematics for exact permutation indexing
Achieving theoretical compression bounds for certain distributions

Installation

# Clone or navigate to the repository
cd entropy-coding-equiprobable

# Install dependencies
just setup
# or manually: uv sync

Usage

Quick Start

# Run all available commands
just

# Run quick tests
just test

# Run paper examples  
just test-paper

# Run full compression benchmark
just run

# Run benchmark and generate plots (recommended)
just analyze

# Generate plots and analysis
just plot

Visualization

The plotting functionality generates comprehensive analysis:

Compression Comparison: Side-by-side comparison of Huffman vs Enumerative methods
Compression Time Analysis: Performance timing comparison between algorithms
Distribution Analysis: Performance across uniform, Zipf, and geometric data
Efficiency Analysis: How close each method gets to theoretical limits
Enumerative Timeout Analysis: Computational complexity limitations and scaling behavior

Plots are saved to the plots/ directory as high-resolution PNG files.

Results

Compression Performance Comparison

Comparison of compression ratios, bits per symbol, and efficiency between Huffman and Enumerative coding across different datasets.

Compression Time Analysis

Performance timing analysis showing encoding times, speed ratios, and scalability characteristics. Huffman coding is consistently 100-1000x faster.

Distribution Analysis

Performance breakdown by data distribution type (Uniform, Zipf, Geometric, English Text) showing compression ratios and efficiency metrics.

Computational Complexity Analysis

Enumerative encoding performance showing computation times, timeout patterns, and scaling limitations by dataset size and vocabulary.

Command Reference

just - List available commands
just setup - Install dependencies
just test - Quick test with small datasets + paper examples
just test-paper - Test examples from the paper
just run - Full compression benchmark
just analyze - Run full benchmark and generate plots
just plot - Generate comparison plots
just clean - Remove generated files
just check - Run code quality checks
just format - Format code

Test Datasets

The benchmark includes:

I.I.D. Datasets

Small (1K symbols): Quick testing
Medium (10K symbols): Moderate datasets
Large (100K symbols): Performance at scale

Distributions

Uniform: All symbols equally likely
Zipf: Power-law distribution (realistic for text)
Geometric: Exponentially decreasing probabilities

Vocabulary Sizes

10 symbols: Small alphabet
64 symbols: Medium alphabet
256 symbols: Full byte range

Real Data

English text: Downloaded from WikiText-2 via Hugging Face

Results Analysis

Performance Patterns

Uniform Distributions: EP performs poorly because there's no probability imbalance to exploit
Skewed Distributions: EP performs better but still trails Huffman
Large Vocabularies: EP overhead becomes significant with many symbols

Computational Complexity

The optimized enumerative entropy coding implementation achieves polynomial time complexity through careful algorithmic design:

Time Complexity Analysis

Encoding: O(M × n) where M = alphabet size, n = sequence length
- Symbol position finding: O(n) per symbol
- Combinatorial indexing: O(k) per symbol with memoization
Decoding: O(M × k × log n) where k = average symbol count
- Binary search for position reconstruction: O(log n) per position
- Memoized binomial lookups: O(1) amortized

Space Complexity

Memory Usage: O(unique_binomial_lookups) for coefficient cache
Typical Cache Size: < 1000 entries for most realistic datasets
No Upfront Cost: Zero initialization time, grows only as needed

Performance Characteristics

Small Datasets (< 5000 symbols): 0.045s - 1.7s encoding time
Medium Datasets (5000-10000 symbols): 0.3s - 15s encoding time
Large Datasets (> 100000 symbols): May timeout (> 30s)
Performance vs Huffman: ~259x slower on average

Timeout Mechanism

Timeout Duration: 30 seconds by default for enumerative coding
Graceful Handling: Timeouts are logged and marked as "TIMEOUT" in results
When Timeouts Occur: Very large sequences (> 100k symbols) with high vocabulary diversity

The optimizations successfully transform the algorithm from exponential (naive multinomial) to polynomial complexity, making it practical for realistic data sizes.

Performance Results

From the benchmark results comparing Huffman vs Enumerative coding:

Dataset Type	Huffman Efficiency	Enumerative Efficiency	Speed Ratio
Uniform data	~99.8% of theoretical	~48.9% of theoretical	259x slower
Zipf data	~99.0-99.4% of theoretical	~47.7-49.9% of theoretical	100-1000x slower
Geometric data	~98.9-99.3% of theoretical	~49.6-49.9% of theoretical	400-2000x slower
English text	~99.1% of theoretical	~48.1% of theoretical	23x slower

Why Enumerative Underperforms

Computational Complexity: Combinatorial calculations become expensive for large datasets
Fixed Algorithm Structure: Cannot adapt to data characteristics like Huffman's variable-length codes
Overhead: Algorithm encodes structure information (alphabet, counts, positions) separately
Scaling Issues: Performance degrades exponentially with dataset size and vocabulary complexity

File Structure

entropy-coding-equiprobable/
├── enumerative_coding.py      # Core enumerative entropy coding implementation
├── entropy_coding.py          # Legacy compatibility and Huffman implementation
├── test_compression.py        # Main benchmark script with timing analysis
├── test_paper_examples.py     # Paper example verification
├── test_enumerative.py        # Basic functionality tests
├── plot_results.py           # Comprehensive visualization and analysis
├── quick_test.py             # Quick functionality test
├── justfile                  # Command runner
├── pyproject.toml           # Python dependencies
├── CLAUDE.md               # Project-specific AI instructions
└── README.md              # This file

Implementation Details

Enumerative Entropy Coding

The implementation follows the Han et al. (2008) algorithm with four main steps:

def encode(self, data: List[int]) -> bytes:
    # Step 1: Encode sequence length
    bits += ExpGolombCoder.encode(n)
    
    # Step 2: Encode alphabet (size K and symbols)
    bits += ExpGolombCoder.encode(K)
    for symbol in sorted_symbols:
        bits += ExpGolombCoder.encode(symbol)
    
    # Step 3: Encode symbol frequencies (K-1, last is implied)
    for i in range(K - 1):
        bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]])
    
    # Step 4: Encode symbol positions using combinatorial indexing
    for symbol in sorted_symbols[:-1]:
        positions = find_symbol_positions(symbol, remaining_data)
        rank = self._rank(len(remaining_data), len(positions), positions)
        bits += ExpGolombCoder.encode(rank)

Key Optimizations

# Complement encoding for frequent symbols
use_complement = k > current_n / 2
if use_complement:
    # Encode positions of OTHER symbols instead
    complement_positions = find_complement_positions()
    rank = self._rank(current_n, current_n - k, complement_positions)

# Fast binomial coefficient computation with caching
class OptimizedBinomialTable:
    def get(self, n: int, k: int) -> int:
        if (n, k) in self._cache:
            return self._cache[(n, k)]
        result = math.comb(n, k)  # Uses arbitrary precision
        self._cache[(n, k)] = result
        return result

Theoretical Analysis

Compression Bounds

Shannon Entropy: H(X) = -Σ p(x) log2 p(x) - theoretical minimum
Huffman: Achieves H(X) ≤ L_Huffman < H(X) + 1 (typically ~99% efficiency)
Enumerative: L_Enum ≥ H(X) + overhead (typically ~49% efficiency due to structural encoding)

When Enumerative Coding Works

Research/theoretical applications: When exact mathematical properties are needed
Educational purposes: Understanding combinatorial compression principles
Small datasets: Where computational cost is not a concern

When Enumerative Struggles

All practical applications: 259x slower than Huffman with worse compression
Large datasets: Exponential scaling makes it computationally prohibitive
Real-time systems: Unpredictable and potentially very long encoding times

Future Optimization Opportunities

While the current implementation achieves practical performance for datasets up to ~10,000 symbols, several optimization strategies could further improve performance:

1. Just-In-Time (JIT) Compilation

Target: Critical loops in combinatorial indexing and position reconstruction
Options:
- Numba (requires Python 3.11 due to llvmlite compatibility issues)
- JAX (better Python 3.12 support, NumPy-compatible)
- PyPy (alternative Python interpreter with JIT)
Expected Benefit: 10-100x speedup for computational bottlenecks

2. Algorithmic Improvements

Incremental Encoding: Reuse computations when processing similar sequences
Approximate Methods: Trade slight accuracy for major performance gains on very large datasets
Parallel Processing: Distribute symbol processing across multiple cores

3. Specialized Data Structures

Sparse Binomial Tables: Only compute coefficients actually needed
Compressed Position Indices: More efficient representation for position lists
Fast Integer Arithmetic: Specialized libraries for large integer operations

4. Memory Hierarchy Optimizations

Cache-Friendly Algorithms: Reorganize computations to minimize cache misses
Memory Pooling: Reduce allocation overhead for temporary arrays
Streaming Encoding: Process very large datasets without loading entirely into memory

5. Domain-Specific Optimizations

Text-Specific: Leverage byte patterns and common character frequencies
Statistical Precomputation: Pre-build tables for common distributions (Zipf, geometric)
Adaptive Thresholds: Dynamically adjust complement encoding and timeout parameters

The current implementation provides a solid foundation for exploring these advanced optimizations while maintaining correctness and robustness.

References

Han, Y., et al. (2008). "Entropy coding using equiprobable partitioning"
Cover, T. M., & Thomas, J. A. (2006). "Elements of Information Theory"
Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes"

Contributing

This is a research implementation. To contribute:

Fork the repository
Make changes following the existing code style
Run just check to verify code quality
Submit a pull request

License

This project is for educational and research purposes.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support