--- license: mit --- # Entropy Coding with Equiprobable Partitioning Implementation and comparison of the entropy coding algorithm using equiprobable partitioning from Han et al. (2008), compared against Huffman coding and theoretical limits. ## Overview This project implements two compression algorithms: 1. **Equiprobable Partitioning (EP)** - The main algorithm from the paper 2. **Huffman Coding** - Classical entropy coding for comparison ## Algorithm Description ### Enumerative Entropy Coding The algorithm from Han et al. (2008) is actually an **enumerative entropy coding** method that works in three steps: 1. **Encode alphabet size M** using exp-Golomb codes 2. **Encode symbol counts** N(s₁), N(s₂), ..., N(s_{M-1}) using exp-Golomb codes (last count is implied) 3. **Encode sequence position** among all permutations with the same symbol counts using combinatorial enumeration #### How It Works - **Step 1**: Use exp-Golomb to encode how many distinct symbols appear - **Step 2**: Use exp-Golomb to encode how many times each symbol appears - **Step 3**: Use lexicographic indexing to identify which specific permutation this sequence represents among all sequences with the same symbol histogram This is fundamentally different from simple partitioning - it's a form of **combinatorial compression** that leverages the mathematical structure of permutations. ### Performance Optimizations Key optimizations enable practical performance for datasets up to ~10,000 symbols: 1. **Cached Binomial Coefficients**: Uses `math.comb()` with caching to avoid recomputation 2. **Binary Search**: O(log n) position reconstruction instead of linear search 3. **Complement Encoding**: For frequent symbols (>50%), encode positions of other symbols instead 4. **Arbitrary Precision**: Avoids integer overflow for large combinatorial values These optimizations achieve polynomial time complexity, making the algorithm practical for research and educational use. ### Why Enumerative Coding? The algorithm aims to achieve compression by: - Separating structure (symbol counts) from content (permutation) - Using optimal exp-Golomb codes for integer encoding - Leveraging combinatorial mathematics for exact permutation indexing - Achieving theoretical compression bounds for certain distributions ## Installation ```bash # Clone or navigate to the repository cd entropy-coding-equiprobable # Install dependencies just setup # or manually: uv sync ``` ## Usage ### Quick Start ```bash # Run all available commands just # Run quick tests just test # Run paper examples just test-paper # Run full compression benchmark just run # Run benchmark and generate plots (recommended) just analyze # Generate plots and analysis just plot ``` ### Visualization The plotting functionality generates comprehensive analysis: 1. **Compression Comparison**: Side-by-side comparison of Huffman vs Enumerative methods 2. **Compression Time Analysis**: Performance timing comparison between algorithms 3. **Distribution Analysis**: Performance across uniform, Zipf, and geometric data 4. **Efficiency Analysis**: How close each method gets to theoretical limits 5. **Enumerative Timeout Analysis**: Computational complexity limitations and scaling behavior Plots are saved to the `plots/` directory as high-resolution PNG files. ## Results ### Compression Performance Comparison ![Compression Comparison](plots/compression_comparison.png) Comparison of compression ratios, bits per symbol, and efficiency between Huffman and Enumerative coding across different datasets. ### Compression Time Analysis ![Compression Time Comparison](plots/compression_time_comparison.png) Performance timing analysis showing encoding times, speed ratios, and scalability characteristics. Huffman coding is consistently 100-1000x faster. ### Distribution Analysis ![Distribution Comparison](plots/distribution_comparison.png) Performance breakdown by data distribution type (Uniform, Zipf, Geometric, English Text) showing compression ratios and efficiency metrics. ### Computational Complexity Analysis ![Enumerative Timeout Analysis](plots/enumerative_timeout_analysis.png) Enumerative encoding performance showing computation times, timeout patterns, and scaling limitations by dataset size and vocabulary. ### Command Reference - `just` - List available commands - `just setup` - Install dependencies - `just test` - Quick test with small datasets + paper examples - `just test-paper` - Test examples from the paper - `just run` - Full compression benchmark - `just analyze` - Run full benchmark and generate plots - `just plot` - Generate comparison plots - `just clean` - Remove generated files - `just check` - Run code quality checks - `just format` - Format code ## Test Datasets The benchmark includes: ### I.I.D. Datasets - **Small** (1K symbols): Quick testing - **Medium** (10K symbols): Moderate datasets - **Large** (100K symbols): Performance at scale ### Distributions - **Uniform**: All symbols equally likely - **Zipf**: Power-law distribution (realistic for text) - **Geometric**: Exponentially decreasing probabilities ### Vocabulary Sizes - **10 symbols**: Small alphabet - **64 symbols**: Medium alphabet - **256 symbols**: Full byte range ### Real Data - **English text**: Downloaded from WikiText-2 via Hugging Face ## Results Analysis ### Performance Patterns 1. **Uniform Distributions**: EP performs poorly because there's no probability imbalance to exploit 2. **Skewed Distributions**: EP performs better but still trails Huffman 3. **Large Vocabularies**: EP overhead becomes significant with many symbols ### Computational Complexity The optimized enumerative entropy coding implementation achieves **polynomial time complexity** through careful algorithmic design: #### Time Complexity Analysis - **Encoding**: O(M × n) where M = alphabet size, n = sequence length - Symbol position finding: O(n) per symbol - Combinatorial indexing: O(k) per symbol with memoization - **Decoding**: O(M × k × log n) where k = average symbol count - Binary search for position reconstruction: O(log n) per position - Memoized binomial lookups: O(1) amortized #### Space Complexity - **Memory Usage**: O(unique_binomial_lookups) for coefficient cache - **Typical Cache Size**: < 1000 entries for most realistic datasets - **No Upfront Cost**: Zero initialization time, grows only as needed #### Performance Characteristics - **Small Datasets** (< 5000 symbols): 0.045s - 1.7s encoding time - **Medium Datasets** (5000-10000 symbols): 0.3s - 15s encoding time - **Large Datasets** (> 100000 symbols): May timeout (> 30s) - **Performance vs Huffman**: ~259x slower on average #### Timeout Mechanism - **Timeout Duration**: 30 seconds by default for enumerative coding - **Graceful Handling**: Timeouts are logged and marked as "TIMEOUT" in results - **When Timeouts Occur**: Very large sequences (> 100k symbols) with high vocabulary diversity The optimizations successfully transform the algorithm from exponential (naive multinomial) to polynomial complexity, making it practical for realistic data sizes. ### Performance Results From the benchmark results comparing Huffman vs Enumerative coding: | Dataset Type | Huffman Efficiency | Enumerative Efficiency | Speed Ratio | |--------------|-------------------|------------------------|-------------| | Uniform data | ~99.8% of theoretical | ~48.9% of theoretical | 259x slower | | Zipf data | ~99.0-99.4% of theoretical | ~47.7-49.9% of theoretical | 100-1000x slower | | Geometric data | ~98.9-99.3% of theoretical | ~49.6-49.9% of theoretical | 400-2000x slower | | English text | ~99.1% of theoretical | ~48.1% of theoretical | 23x slower | ### Why Enumerative Underperforms 1. **Computational Complexity**: Combinatorial calculations become expensive for large datasets 2. **Fixed Algorithm Structure**: Cannot adapt to data characteristics like Huffman's variable-length codes 3. **Overhead**: Algorithm encodes structure information (alphabet, counts, positions) separately 4. **Scaling Issues**: Performance degrades exponentially with dataset size and vocabulary complexity ## File Structure ``` entropy-coding-equiprobable/ ├── enumerative_coding.py # Core enumerative entropy coding implementation ├── entropy_coding.py # Legacy compatibility and Huffman implementation ├── test_compression.py # Main benchmark script with timing analysis ├── test_paper_examples.py # Paper example verification ├── test_enumerative.py # Basic functionality tests ├── plot_results.py # Comprehensive visualization and analysis ├── quick_test.py # Quick functionality test ├── justfile # Command runner ├── pyproject.toml # Python dependencies ├── CLAUDE.md # Project-specific AI instructions └── README.md # This file ``` ## Implementation Details ### Enumerative Entropy Coding The implementation follows the Han et al. (2008) algorithm with four main steps: ```python def encode(self, data: List[int]) -> bytes: # Step 1: Encode sequence length bits += ExpGolombCoder.encode(n) # Step 2: Encode alphabet (size K and symbols) bits += ExpGolombCoder.encode(K) for symbol in sorted_symbols: bits += ExpGolombCoder.encode(symbol) # Step 3: Encode symbol frequencies (K-1, last is implied) for i in range(K - 1): bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]]) # Step 4: Encode symbol positions using combinatorial indexing for symbol in sorted_symbols[:-1]: positions = find_symbol_positions(symbol, remaining_data) rank = self._rank(len(remaining_data), len(positions), positions) bits += ExpGolombCoder.encode(rank) ``` ### Key Optimizations ```python # Complement encoding for frequent symbols use_complement = k > current_n / 2 if use_complement: # Encode positions of OTHER symbols instead complement_positions = find_complement_positions() rank = self._rank(current_n, current_n - k, complement_positions) # Fast binomial coefficient computation with caching class OptimizedBinomialTable: def get(self, n: int, k: int) -> int: if (n, k) in self._cache: return self._cache[(n, k)] result = math.comb(n, k) # Uses arbitrary precision self._cache[(n, k)] = result return result ``` ## Theoretical Analysis ### Compression Bounds - **Shannon Entropy**: H(X) = -Σ p(x) log2 p(x) - theoretical minimum - **Huffman**: Achieves H(X) ≤ L_Huffman < H(X) + 1 (typically ~99% efficiency) - **Enumerative**: L_Enum ≥ H(X) + overhead (typically ~49% efficiency due to structural encoding) ### When Enumerative Coding Works 1. **Research/theoretical applications**: When exact mathematical properties are needed 2. **Educational purposes**: Understanding combinatorial compression principles 3. **Small datasets**: Where computational cost is not a concern ### When Enumerative Struggles 1. **All practical applications**: 259x slower than Huffman with worse compression 2. **Large datasets**: Exponential scaling makes it computationally prohibitive 3. **Real-time systems**: Unpredictable and potentially very long encoding times ## Future Optimization Opportunities While the current implementation achieves practical performance for datasets up to ~10,000 symbols, several optimization strategies could further improve performance: ### 1. Just-In-Time (JIT) Compilation - **Target**: Critical loops in combinatorial indexing and position reconstruction - **Options**: - **Numba** (requires Python 3.11 due to llvmlite compatibility issues) - **JAX** (better Python 3.12 support, NumPy-compatible) - **PyPy** (alternative Python interpreter with JIT) - **Expected Benefit**: 10-100x speedup for computational bottlenecks ### 2. Algorithmic Improvements - **Incremental Encoding**: Reuse computations when processing similar sequences - **Approximate Methods**: Trade slight accuracy for major performance gains on very large datasets - **Parallel Processing**: Distribute symbol processing across multiple cores ### 3. Specialized Data Structures - **Sparse Binomial Tables**: Only compute coefficients actually needed - **Compressed Position Indices**: More efficient representation for position lists - **Fast Integer Arithmetic**: Specialized libraries for large integer operations ### 4. Memory Hierarchy Optimizations - **Cache-Friendly Algorithms**: Reorganize computations to minimize cache misses - **Memory Pooling**: Reduce allocation overhead for temporary arrays - **Streaming Encoding**: Process very large datasets without loading entirely into memory ### 5. Domain-Specific Optimizations - **Text-Specific**: Leverage byte patterns and common character frequencies - **Statistical Precomputation**: Pre-build tables for common distributions (Zipf, geometric) - **Adaptive Thresholds**: Dynamically adjust complement encoding and timeout parameters The current implementation provides a solid foundation for exploring these advanced optimizations while maintaining correctness and robustness. ## References - Han, Y., et al. (2008). "Entropy coding using equiprobable partitioning" - Cover, T. M., & Thomas, J. A. (2006). "Elements of Information Theory" - Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes" ## Contributing This is a research implementation. To contribute: 1. Fork the repository 2. Make changes following the existing code style 3. Run `just check` to verify code quality 4. Submit a pull request ## License This project is for educational and research purposes.