Entropy Coding with Equiprobable Partitioning
Implementation and comparison of the entropy coding algorithm using equiprobable partitioning from Han et al. (2008), compared against Huffman coding and theoretical limits.
Overview
This project implements two compression algorithms:
- Equiprobable Partitioning (EP) - The main algorithm from the paper
- Huffman Coding - Classical entropy coding for comparison
Algorithm Description
Enumerative Entropy Coding
The algorithm from Han et al. (2008) is actually an enumerative entropy coding method that works in three steps:
- Encode alphabet size M using exp-Golomb codes
- Encode symbol counts N(sβ), N(sβ), ..., N(s_{M-1}) using exp-Golomb codes (last count is implied)
- Encode sequence position among all permutations with the same symbol counts using combinatorial enumeration
How It Works
- Step 1: Use exp-Golomb to encode how many distinct symbols appear
- Step 2: Use exp-Golomb to encode how many times each symbol appears
- Step 3: Use lexicographic indexing to identify which specific permutation this sequence represents among all sequences with the same symbol histogram
This is fundamentally different from simple partitioning - it's a form of combinatorial compression that leverages the mathematical structure of permutations.
Performance Optimizations
Key optimizations enable practical performance for datasets up to ~10,000 symbols:
- Cached Binomial Coefficients: Uses
math.comb()with caching to avoid recomputation - Binary Search: O(log n) position reconstruction instead of linear search
- Complement Encoding: For frequent symbols (>50%), encode positions of other symbols instead
- Arbitrary Precision: Avoids integer overflow for large combinatorial values
These optimizations achieve polynomial time complexity, making the algorithm practical for research and educational use.
Why Enumerative Coding?
The algorithm aims to achieve compression by:
- Separating structure (symbol counts) from content (permutation)
- Using optimal exp-Golomb codes for integer encoding
- Leveraging combinatorial mathematics for exact permutation indexing
- Achieving theoretical compression bounds for certain distributions
Installation
# Clone or navigate to the repository
cd entropy-coding-equiprobable
# Install dependencies
just setup
# or manually: uv sync
Usage
Quick Start
# Run all available commands
just
# Run quick tests
just test
# Run paper examples
just test-paper
# Run full compression benchmark
just run
# Run benchmark and generate plots (recommended)
just analyze
# Generate plots and analysis
just plot
Visualization
The plotting functionality generates comprehensive analysis:
- Compression Comparison: Side-by-side comparison of Huffman vs Enumerative methods
- Compression Time Analysis: Performance timing comparison between algorithms
- Distribution Analysis: Performance across uniform, Zipf, and geometric data
- Efficiency Analysis: How close each method gets to theoretical limits
- Enumerative Timeout Analysis: Computational complexity limitations and scaling behavior
Plots are saved to the plots/ directory as high-resolution PNG files.
Results
Compression Performance Comparison
Comparison of compression ratios, bits per symbol, and efficiency between Huffman and Enumerative coding across different datasets.
Compression Time Analysis
Performance timing analysis showing encoding times, speed ratios, and scalability characteristics. Huffman coding is consistently 100-1000x faster.
Distribution Analysis
Performance breakdown by data distribution type (Uniform, Zipf, Geometric, English Text) showing compression ratios and efficiency metrics.
Computational Complexity Analysis
Enumerative encoding performance showing computation times, timeout patterns, and scaling limitations by dataset size and vocabulary.
Command Reference
just- List available commandsjust setup- Install dependenciesjust test- Quick test with small datasets + paper examplesjust test-paper- Test examples from the paperjust run- Full compression benchmarkjust analyze- Run full benchmark and generate plotsjust plot- Generate comparison plotsjust clean- Remove generated filesjust check- Run code quality checksjust format- Format code
Test Datasets
The benchmark includes:
I.I.D. Datasets
- Small (1K symbols): Quick testing
- Medium (10K symbols): Moderate datasets
- Large (100K symbols): Performance at scale
Distributions
- Uniform: All symbols equally likely
- Zipf: Power-law distribution (realistic for text)
- Geometric: Exponentially decreasing probabilities
Vocabulary Sizes
- 10 symbols: Small alphabet
- 64 symbols: Medium alphabet
- 256 symbols: Full byte range
Real Data
- English text: Downloaded from WikiText-2 via Hugging Face
Results Analysis
Performance Patterns
- Uniform Distributions: EP performs poorly because there's no probability imbalance to exploit
- Skewed Distributions: EP performs better but still trails Huffman
- Large Vocabularies: EP overhead becomes significant with many symbols
Computational Complexity
The optimized enumerative entropy coding implementation achieves polynomial time complexity through careful algorithmic design:
Time Complexity Analysis
- Encoding: O(M Γ n) where M = alphabet size, n = sequence length
- Symbol position finding: O(n) per symbol
- Combinatorial indexing: O(k) per symbol with memoization
- Decoding: O(M Γ k Γ log n) where k = average symbol count
- Binary search for position reconstruction: O(log n) per position
- Memoized binomial lookups: O(1) amortized
Space Complexity
- Memory Usage: O(unique_binomial_lookups) for coefficient cache
- Typical Cache Size: < 1000 entries for most realistic datasets
- No Upfront Cost: Zero initialization time, grows only as needed
Performance Characteristics
- Small Datasets (< 5000 symbols): 0.045s - 1.7s encoding time
- Medium Datasets (5000-10000 symbols): 0.3s - 15s encoding time
- Large Datasets (> 100000 symbols): May timeout (> 30s)
- Performance vs Huffman: ~259x slower on average
Timeout Mechanism
- Timeout Duration: 30 seconds by default for enumerative coding
- Graceful Handling: Timeouts are logged and marked as "TIMEOUT" in results
- When Timeouts Occur: Very large sequences (> 100k symbols) with high vocabulary diversity
The optimizations successfully transform the algorithm from exponential (naive multinomial) to polynomial complexity, making it practical for realistic data sizes.
Performance Results
From the benchmark results comparing Huffman vs Enumerative coding:
| Dataset Type | Huffman Efficiency | Enumerative Efficiency | Speed Ratio |
|---|---|---|---|
| Uniform data | ~99.8% of theoretical | ~48.9% of theoretical | 259x slower |
| Zipf data | ~99.0-99.4% of theoretical | ~47.7-49.9% of theoretical | 100-1000x slower |
| Geometric data | ~98.9-99.3% of theoretical | ~49.6-49.9% of theoretical | 400-2000x slower |
| English text | ~99.1% of theoretical | ~48.1% of theoretical | 23x slower |
Why Enumerative Underperforms
- Computational Complexity: Combinatorial calculations become expensive for large datasets
- Fixed Algorithm Structure: Cannot adapt to data characteristics like Huffman's variable-length codes
- Overhead: Algorithm encodes structure information (alphabet, counts, positions) separately
- Scaling Issues: Performance degrades exponentially with dataset size and vocabulary complexity
File Structure
entropy-coding-equiprobable/
βββ enumerative_coding.py # Core enumerative entropy coding implementation
βββ entropy_coding.py # Legacy compatibility and Huffman implementation
βββ test_compression.py # Main benchmark script with timing analysis
βββ test_paper_examples.py # Paper example verification
βββ test_enumerative.py # Basic functionality tests
βββ plot_results.py # Comprehensive visualization and analysis
βββ quick_test.py # Quick functionality test
βββ justfile # Command runner
βββ pyproject.toml # Python dependencies
βββ CLAUDE.md # Project-specific AI instructions
βββ README.md # This file
Implementation Details
Enumerative Entropy Coding
The implementation follows the Han et al. (2008) algorithm with four main steps:
def encode(self, data: List[int]) -> bytes:
# Step 1: Encode sequence length
bits += ExpGolombCoder.encode(n)
# Step 2: Encode alphabet (size K and symbols)
bits += ExpGolombCoder.encode(K)
for symbol in sorted_symbols:
bits += ExpGolombCoder.encode(symbol)
# Step 3: Encode symbol frequencies (K-1, last is implied)
for i in range(K - 1):
bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]])
# Step 4: Encode symbol positions using combinatorial indexing
for symbol in sorted_symbols[:-1]:
positions = find_symbol_positions(symbol, remaining_data)
rank = self._rank(len(remaining_data), len(positions), positions)
bits += ExpGolombCoder.encode(rank)
Key Optimizations
# Complement encoding for frequent symbols
use_complement = k > current_n / 2
if use_complement:
# Encode positions of OTHER symbols instead
complement_positions = find_complement_positions()
rank = self._rank(current_n, current_n - k, complement_positions)
# Fast binomial coefficient computation with caching
class OptimizedBinomialTable:
def get(self, n: int, k: int) -> int:
if (n, k) in self._cache:
return self._cache[(n, k)]
result = math.comb(n, k) # Uses arbitrary precision
self._cache[(n, k)] = result
return result
Theoretical Analysis
Compression Bounds
- Shannon Entropy: H(X) = -Ξ£ p(x) log2 p(x) - theoretical minimum
- Huffman: Achieves H(X) β€ L_Huffman < H(X) + 1 (typically ~99% efficiency)
- Enumerative: L_Enum β₯ H(X) + overhead (typically ~49% efficiency due to structural encoding)
When Enumerative Coding Works
- Research/theoretical applications: When exact mathematical properties are needed
- Educational purposes: Understanding combinatorial compression principles
- Small datasets: Where computational cost is not a concern
When Enumerative Struggles
- All practical applications: 259x slower than Huffman with worse compression
- Large datasets: Exponential scaling makes it computationally prohibitive
- Real-time systems: Unpredictable and potentially very long encoding times
Future Optimization Opportunities
While the current implementation achieves practical performance for datasets up to ~10,000 symbols, several optimization strategies could further improve performance:
1. Just-In-Time (JIT) Compilation
- Target: Critical loops in combinatorial indexing and position reconstruction
- Options:
- Numba (requires Python 3.11 due to llvmlite compatibility issues)
- JAX (better Python 3.12 support, NumPy-compatible)
- PyPy (alternative Python interpreter with JIT)
- Expected Benefit: 10-100x speedup for computational bottlenecks
2. Algorithmic Improvements
- Incremental Encoding: Reuse computations when processing similar sequences
- Approximate Methods: Trade slight accuracy for major performance gains on very large datasets
- Parallel Processing: Distribute symbol processing across multiple cores
3. Specialized Data Structures
- Sparse Binomial Tables: Only compute coefficients actually needed
- Compressed Position Indices: More efficient representation for position lists
- Fast Integer Arithmetic: Specialized libraries for large integer operations
4. Memory Hierarchy Optimizations
- Cache-Friendly Algorithms: Reorganize computations to minimize cache misses
- Memory Pooling: Reduce allocation overhead for temporary arrays
- Streaming Encoding: Process very large datasets without loading entirely into memory
5. Domain-Specific Optimizations
- Text-Specific: Leverage byte patterns and common character frequencies
- Statistical Precomputation: Pre-build tables for common distributions (Zipf, geometric)
- Adaptive Thresholds: Dynamically adjust complement encoding and timeout parameters
The current implementation provides a solid foundation for exploring these advanced optimizations while maintaining correctness and robustness.
References
- Han, Y., et al. (2008). "Entropy coding using equiprobable partitioning"
- Cover, T. M., & Thomas, J. A. (2006). "Elements of Information Theory"
- Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes"
Contributing
This is a research implementation. To contribute:
- Fork the repository
- Make changes following the existing code style
- Run
just checkto verify code quality - Submit a pull request
License
This project is for educational and research purposes.



