Entropy Coding with Equiprobable Partitioning

Implementation and comparison of the entropy coding algorithm using equiprobable partitioning from Han et al. (2008), compared against Huffman coding and theoretical limits.

Overview

This project implements two compression algorithms:

  1. Equiprobable Partitioning (EP) - The main algorithm from the paper
  2. Huffman Coding - Classical entropy coding for comparison

Algorithm Description

Enumerative Entropy Coding

The algorithm from Han et al. (2008) is actually an enumerative entropy coding method that works in three steps:

  1. Encode alphabet size M using exp-Golomb codes
  2. Encode symbol counts N(s₁), N(sβ‚‚), ..., N(s_{M-1}) using exp-Golomb codes (last count is implied)
  3. Encode sequence position among all permutations with the same symbol counts using combinatorial enumeration

How It Works

  • Step 1: Use exp-Golomb to encode how many distinct symbols appear
  • Step 2: Use exp-Golomb to encode how many times each symbol appears
  • Step 3: Use lexicographic indexing to identify which specific permutation this sequence represents among all sequences with the same symbol histogram

This is fundamentally different from simple partitioning - it's a form of combinatorial compression that leverages the mathematical structure of permutations.

Performance Optimizations

Key optimizations enable practical performance for datasets up to ~10,000 symbols:

  1. Cached Binomial Coefficients: Uses math.comb() with caching to avoid recomputation
  2. Binary Search: O(log n) position reconstruction instead of linear search
  3. Complement Encoding: For frequent symbols (>50%), encode positions of other symbols instead
  4. Arbitrary Precision: Avoids integer overflow for large combinatorial values

These optimizations achieve polynomial time complexity, making the algorithm practical for research and educational use.

Why Enumerative Coding?

The algorithm aims to achieve compression by:

  • Separating structure (symbol counts) from content (permutation)
  • Using optimal exp-Golomb codes for integer encoding
  • Leveraging combinatorial mathematics for exact permutation indexing
  • Achieving theoretical compression bounds for certain distributions

Installation

# Clone or navigate to the repository
cd entropy-coding-equiprobable

# Install dependencies
just setup
# or manually: uv sync

Usage

Quick Start

# Run all available commands
just

# Run quick tests
just test

# Run paper examples  
just test-paper

# Run full compression benchmark
just run

# Run benchmark and generate plots (recommended)
just analyze

# Generate plots and analysis
just plot

Visualization

The plotting functionality generates comprehensive analysis:

  1. Compression Comparison: Side-by-side comparison of Huffman vs Enumerative methods
  2. Compression Time Analysis: Performance timing comparison between algorithms
  3. Distribution Analysis: Performance across uniform, Zipf, and geometric data
  4. Efficiency Analysis: How close each method gets to theoretical limits
  5. Enumerative Timeout Analysis: Computational complexity limitations and scaling behavior

Plots are saved to the plots/ directory as high-resolution PNG files.

Results

Compression Performance Comparison

Compression Comparison

Comparison of compression ratios, bits per symbol, and efficiency between Huffman and Enumerative coding across different datasets.

Compression Time Analysis

Compression Time Comparison

Performance timing analysis showing encoding times, speed ratios, and scalability characteristics. Huffman coding is consistently 100-1000x faster.

Distribution Analysis

Distribution Comparison

Performance breakdown by data distribution type (Uniform, Zipf, Geometric, English Text) showing compression ratios and efficiency metrics.

Computational Complexity Analysis

Enumerative Timeout Analysis

Enumerative encoding performance showing computation times, timeout patterns, and scaling limitations by dataset size and vocabulary.

Command Reference

  • just - List available commands
  • just setup - Install dependencies
  • just test - Quick test with small datasets + paper examples
  • just test-paper - Test examples from the paper
  • just run - Full compression benchmark
  • just analyze - Run full benchmark and generate plots
  • just plot - Generate comparison plots
  • just clean - Remove generated files
  • just check - Run code quality checks
  • just format - Format code

Test Datasets

The benchmark includes:

I.I.D. Datasets

  • Small (1K symbols): Quick testing
  • Medium (10K symbols): Moderate datasets
  • Large (100K symbols): Performance at scale

Distributions

  • Uniform: All symbols equally likely
  • Zipf: Power-law distribution (realistic for text)
  • Geometric: Exponentially decreasing probabilities

Vocabulary Sizes

  • 10 symbols: Small alphabet
  • 64 symbols: Medium alphabet
  • 256 symbols: Full byte range

Real Data

  • English text: Downloaded from WikiText-2 via Hugging Face

Results Analysis

Performance Patterns

  1. Uniform Distributions: EP performs poorly because there's no probability imbalance to exploit
  2. Skewed Distributions: EP performs better but still trails Huffman
  3. Large Vocabularies: EP overhead becomes significant with many symbols

Computational Complexity

The optimized enumerative entropy coding implementation achieves polynomial time complexity through careful algorithmic design:

Time Complexity Analysis

  • Encoding: O(M Γ— n) where M = alphabet size, n = sequence length
    • Symbol position finding: O(n) per symbol
    • Combinatorial indexing: O(k) per symbol with memoization
  • Decoding: O(M Γ— k Γ— log n) where k = average symbol count
    • Binary search for position reconstruction: O(log n) per position
    • Memoized binomial lookups: O(1) amortized

Space Complexity

  • Memory Usage: O(unique_binomial_lookups) for coefficient cache
  • Typical Cache Size: < 1000 entries for most realistic datasets
  • No Upfront Cost: Zero initialization time, grows only as needed

Performance Characteristics

  • Small Datasets (< 5000 symbols): 0.045s - 1.7s encoding time
  • Medium Datasets (5000-10000 symbols): 0.3s - 15s encoding time
  • Large Datasets (> 100000 symbols): May timeout (> 30s)
  • Performance vs Huffman: ~259x slower on average

Timeout Mechanism

  • Timeout Duration: 30 seconds by default for enumerative coding
  • Graceful Handling: Timeouts are logged and marked as "TIMEOUT" in results
  • When Timeouts Occur: Very large sequences (> 100k symbols) with high vocabulary diversity

The optimizations successfully transform the algorithm from exponential (naive multinomial) to polynomial complexity, making it practical for realistic data sizes.

Performance Results

From the benchmark results comparing Huffman vs Enumerative coding:

Dataset Type Huffman Efficiency Enumerative Efficiency Speed Ratio
Uniform data ~99.8% of theoretical ~48.9% of theoretical 259x slower
Zipf data ~99.0-99.4% of theoretical ~47.7-49.9% of theoretical 100-1000x slower
Geometric data ~98.9-99.3% of theoretical ~49.6-49.9% of theoretical 400-2000x slower
English text ~99.1% of theoretical ~48.1% of theoretical 23x slower

Why Enumerative Underperforms

  1. Computational Complexity: Combinatorial calculations become expensive for large datasets
  2. Fixed Algorithm Structure: Cannot adapt to data characteristics like Huffman's variable-length codes
  3. Overhead: Algorithm encodes structure information (alphabet, counts, positions) separately
  4. Scaling Issues: Performance degrades exponentially with dataset size and vocabulary complexity

File Structure

entropy-coding-equiprobable/
β”œβ”€β”€ enumerative_coding.py      # Core enumerative entropy coding implementation
β”œβ”€β”€ entropy_coding.py          # Legacy compatibility and Huffman implementation
β”œβ”€β”€ test_compression.py        # Main benchmark script with timing analysis
β”œβ”€β”€ test_paper_examples.py     # Paper example verification
β”œβ”€β”€ test_enumerative.py        # Basic functionality tests
β”œβ”€β”€ plot_results.py           # Comprehensive visualization and analysis
β”œβ”€β”€ quick_test.py             # Quick functionality test
β”œβ”€β”€ justfile                  # Command runner
β”œβ”€β”€ pyproject.toml           # Python dependencies
β”œβ”€β”€ CLAUDE.md               # Project-specific AI instructions
└── README.md              # This file

Implementation Details

Enumerative Entropy Coding

The implementation follows the Han et al. (2008) algorithm with four main steps:

def encode(self, data: List[int]) -> bytes:
    # Step 1: Encode sequence length
    bits += ExpGolombCoder.encode(n)
    
    # Step 2: Encode alphabet (size K and symbols)
    bits += ExpGolombCoder.encode(K)
    for symbol in sorted_symbols:
        bits += ExpGolombCoder.encode(symbol)
    
    # Step 3: Encode symbol frequencies (K-1, last is implied)
    for i in range(K - 1):
        bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]])
    
    # Step 4: Encode symbol positions using combinatorial indexing
    for symbol in sorted_symbols[:-1]:
        positions = find_symbol_positions(symbol, remaining_data)
        rank = self._rank(len(remaining_data), len(positions), positions)
        bits += ExpGolombCoder.encode(rank)

Key Optimizations

# Complement encoding for frequent symbols
use_complement = k > current_n / 2
if use_complement:
    # Encode positions of OTHER symbols instead
    complement_positions = find_complement_positions()
    rank = self._rank(current_n, current_n - k, complement_positions)

# Fast binomial coefficient computation with caching
class OptimizedBinomialTable:
    def get(self, n: int, k: int) -> int:
        if (n, k) in self._cache:
            return self._cache[(n, k)]
        result = math.comb(n, k)  # Uses arbitrary precision
        self._cache[(n, k)] = result
        return result

Theoretical Analysis

Compression Bounds

  • Shannon Entropy: H(X) = -Ξ£ p(x) log2 p(x) - theoretical minimum
  • Huffman: Achieves H(X) ≀ L_Huffman < H(X) + 1 (typically ~99% efficiency)
  • Enumerative: L_Enum β‰₯ H(X) + overhead (typically ~49% efficiency due to structural encoding)

When Enumerative Coding Works

  1. Research/theoretical applications: When exact mathematical properties are needed
  2. Educational purposes: Understanding combinatorial compression principles
  3. Small datasets: Where computational cost is not a concern

When Enumerative Struggles

  1. All practical applications: 259x slower than Huffman with worse compression
  2. Large datasets: Exponential scaling makes it computationally prohibitive
  3. Real-time systems: Unpredictable and potentially very long encoding times

Future Optimization Opportunities

While the current implementation achieves practical performance for datasets up to ~10,000 symbols, several optimization strategies could further improve performance:

1. Just-In-Time (JIT) Compilation

  • Target: Critical loops in combinatorial indexing and position reconstruction
  • Options:
    • Numba (requires Python 3.11 due to llvmlite compatibility issues)
    • JAX (better Python 3.12 support, NumPy-compatible)
    • PyPy (alternative Python interpreter with JIT)
  • Expected Benefit: 10-100x speedup for computational bottlenecks

2. Algorithmic Improvements

  • Incremental Encoding: Reuse computations when processing similar sequences
  • Approximate Methods: Trade slight accuracy for major performance gains on very large datasets
  • Parallel Processing: Distribute symbol processing across multiple cores

3. Specialized Data Structures

  • Sparse Binomial Tables: Only compute coefficients actually needed
  • Compressed Position Indices: More efficient representation for position lists
  • Fast Integer Arithmetic: Specialized libraries for large integer operations

4. Memory Hierarchy Optimizations

  • Cache-Friendly Algorithms: Reorganize computations to minimize cache misses
  • Memory Pooling: Reduce allocation overhead for temporary arrays
  • Streaming Encoding: Process very large datasets without loading entirely into memory

5. Domain-Specific Optimizations

  • Text-Specific: Leverage byte patterns and common character frequencies
  • Statistical Precomputation: Pre-build tables for common distributions (Zipf, geometric)
  • Adaptive Thresholds: Dynamically adjust complement encoding and timeout parameters

The current implementation provides a solid foundation for exploring these advanced optimizations while maintaining correctness and robustness.

References

  • Han, Y., et al. (2008). "Entropy coding using equiprobable partitioning"
  • Cover, T. M., & Thomas, J. A. (2006). "Elements of Information Theory"
  • Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes"

Contributing

This is a research implementation. To contribute:

  1. Fork the repository
  2. Make changes following the existing code style
  3. Run just check to verify code quality
  4. Submit a pull request

License

This project is for educational and research purposes.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support