osteele's picture
Initial commit
24c19d8 unverified
---
license: mit
---
# Entropy Coding with Equiprobable Partitioning
Implementation and comparison of the entropy coding algorithm using equiprobable partitioning from Han et al. (2008), compared against Huffman coding and theoretical limits.
## Overview
This project implements two compression algorithms:
1. **Equiprobable Partitioning (EP)** - The main algorithm from the paper
2. **Huffman Coding** - Classical entropy coding for comparison
## Algorithm Description
### Enumerative Entropy Coding
The algorithm from Han et al. (2008) is actually an **enumerative entropy coding** method that works in three steps:
1. **Encode alphabet size M** using exp-Golomb codes
2. **Encode symbol counts** N(s₁), N(sβ‚‚), ..., N(s_{M-1}) using exp-Golomb codes (last count is implied)
3. **Encode sequence position** among all permutations with the same symbol counts using combinatorial enumeration
#### How It Works
- **Step 1**: Use exp-Golomb to encode how many distinct symbols appear
- **Step 2**: Use exp-Golomb to encode how many times each symbol appears
- **Step 3**: Use lexicographic indexing to identify which specific permutation this sequence represents among all sequences with the same symbol histogram
This is fundamentally different from simple partitioning - it's a form of **combinatorial compression** that leverages the mathematical structure of permutations.
### Performance Optimizations
Key optimizations enable practical performance for datasets up to ~10,000 symbols:
1. **Cached Binomial Coefficients**: Uses `math.comb()` with caching to avoid recomputation
2. **Binary Search**: O(log n) position reconstruction instead of linear search
3. **Complement Encoding**: For frequent symbols (>50%), encode positions of other symbols instead
4. **Arbitrary Precision**: Avoids integer overflow for large combinatorial values
These optimizations achieve polynomial time complexity, making the algorithm practical for research and educational use.
### Why Enumerative Coding?
The algorithm aims to achieve compression by:
- Separating structure (symbol counts) from content (permutation)
- Using optimal exp-Golomb codes for integer encoding
- Leveraging combinatorial mathematics for exact permutation indexing
- Achieving theoretical compression bounds for certain distributions
## Installation
```bash
# Clone or navigate to the repository
cd entropy-coding-equiprobable
# Install dependencies
just setup
# or manually: uv sync
```
## Usage
### Quick Start
```bash
# Run all available commands
just
# Run quick tests
just test
# Run paper examples
just test-paper
# Run full compression benchmark
just run
# Run benchmark and generate plots (recommended)
just analyze
# Generate plots and analysis
just plot
```
### Visualization
The plotting functionality generates comprehensive analysis:
1. **Compression Comparison**: Side-by-side comparison of Huffman vs Enumerative methods
2. **Compression Time Analysis**: Performance timing comparison between algorithms
3. **Distribution Analysis**: Performance across uniform, Zipf, and geometric data
4. **Efficiency Analysis**: How close each method gets to theoretical limits
5. **Enumerative Timeout Analysis**: Computational complexity limitations and scaling behavior
Plots are saved to the `plots/` directory as high-resolution PNG files.
## Results
### Compression Performance Comparison
![Compression Comparison](plots/compression_comparison.png)
Comparison of compression ratios, bits per symbol, and efficiency between Huffman and Enumerative coding across different datasets.
### Compression Time Analysis
![Compression Time Comparison](plots/compression_time_comparison.png)
Performance timing analysis showing encoding times, speed ratios, and scalability characteristics. Huffman coding is consistently 100-1000x faster.
### Distribution Analysis
![Distribution Comparison](plots/distribution_comparison.png)
Performance breakdown by data distribution type (Uniform, Zipf, Geometric, English Text) showing compression ratios and efficiency metrics.
### Computational Complexity Analysis
![Enumerative Timeout Analysis](plots/enumerative_timeout_analysis.png)
Enumerative encoding performance showing computation times, timeout patterns, and scaling limitations by dataset size and vocabulary.
### Command Reference
- `just` - List available commands
- `just setup` - Install dependencies
- `just test` - Quick test with small datasets + paper examples
- `just test-paper` - Test examples from the paper
- `just run` - Full compression benchmark
- `just analyze` - Run full benchmark and generate plots
- `just plot` - Generate comparison plots
- `just clean` - Remove generated files
- `just check` - Run code quality checks
- `just format` - Format code
## Test Datasets
The benchmark includes:
### I.I.D. Datasets
- **Small** (1K symbols): Quick testing
- **Medium** (10K symbols): Moderate datasets
- **Large** (100K symbols): Performance at scale
### Distributions
- **Uniform**: All symbols equally likely
- **Zipf**: Power-law distribution (realistic for text)
- **Geometric**: Exponentially decreasing probabilities
### Vocabulary Sizes
- **10 symbols**: Small alphabet
- **64 symbols**: Medium alphabet
- **256 symbols**: Full byte range
### Real Data
- **English text**: Downloaded from WikiText-2 via Hugging Face
## Results Analysis
### Performance Patterns
1. **Uniform Distributions**: EP performs poorly because there's no probability imbalance to exploit
2. **Skewed Distributions**: EP performs better but still trails Huffman
3. **Large Vocabularies**: EP overhead becomes significant with many symbols
### Computational Complexity
The optimized enumerative entropy coding implementation achieves **polynomial time complexity** through careful algorithmic design:
#### Time Complexity Analysis
- **Encoding**: O(M Γ— n) where M = alphabet size, n = sequence length
- Symbol position finding: O(n) per symbol
- Combinatorial indexing: O(k) per symbol with memoization
- **Decoding**: O(M Γ— k Γ— log n) where k = average symbol count
- Binary search for position reconstruction: O(log n) per position
- Memoized binomial lookups: O(1) amortized
#### Space Complexity
- **Memory Usage**: O(unique_binomial_lookups) for coefficient cache
- **Typical Cache Size**: < 1000 entries for most realistic datasets
- **No Upfront Cost**: Zero initialization time, grows only as needed
#### Performance Characteristics
- **Small Datasets** (< 5000 symbols): 0.045s - 1.7s encoding time
- **Medium Datasets** (5000-10000 symbols): 0.3s - 15s encoding time
- **Large Datasets** (> 100000 symbols): May timeout (> 30s)
- **Performance vs Huffman**: ~259x slower on average
#### Timeout Mechanism
- **Timeout Duration**: 30 seconds by default for enumerative coding
- **Graceful Handling**: Timeouts are logged and marked as "TIMEOUT" in results
- **When Timeouts Occur**: Very large sequences (> 100k symbols) with high vocabulary diversity
The optimizations successfully transform the algorithm from exponential (naive multinomial) to polynomial complexity, making it practical for realistic data sizes.
### Performance Results
From the benchmark results comparing Huffman vs Enumerative coding:
| Dataset Type | Huffman Efficiency | Enumerative Efficiency | Speed Ratio |
|--------------|-------------------|------------------------|-------------|
| Uniform data | ~99.8% of theoretical | ~48.9% of theoretical | 259x slower |
| Zipf data | ~99.0-99.4% of theoretical | ~47.7-49.9% of theoretical | 100-1000x slower |
| Geometric data | ~98.9-99.3% of theoretical | ~49.6-49.9% of theoretical | 400-2000x slower |
| English text | ~99.1% of theoretical | ~48.1% of theoretical | 23x slower |
### Why Enumerative Underperforms
1. **Computational Complexity**: Combinatorial calculations become expensive for large datasets
2. **Fixed Algorithm Structure**: Cannot adapt to data characteristics like Huffman's variable-length codes
3. **Overhead**: Algorithm encodes structure information (alphabet, counts, positions) separately
4. **Scaling Issues**: Performance degrades exponentially with dataset size and vocabulary complexity
## File Structure
```
entropy-coding-equiprobable/
β”œβ”€β”€ enumerative_coding.py # Core enumerative entropy coding implementation
β”œβ”€β”€ entropy_coding.py # Legacy compatibility and Huffman implementation
β”œβ”€β”€ test_compression.py # Main benchmark script with timing analysis
β”œβ”€β”€ test_paper_examples.py # Paper example verification
β”œβ”€β”€ test_enumerative.py # Basic functionality tests
β”œβ”€β”€ plot_results.py # Comprehensive visualization and analysis
β”œβ”€β”€ quick_test.py # Quick functionality test
β”œβ”€β”€ justfile # Command runner
β”œβ”€β”€ pyproject.toml # Python dependencies
β”œβ”€β”€ CLAUDE.md # Project-specific AI instructions
└── README.md # This file
```
## Implementation Details
### Enumerative Entropy Coding
The implementation follows the Han et al. (2008) algorithm with four main steps:
```python
def encode(self, data: List[int]) -> bytes:
# Step 1: Encode sequence length
bits += ExpGolombCoder.encode(n)
# Step 2: Encode alphabet (size K and symbols)
bits += ExpGolombCoder.encode(K)
for symbol in sorted_symbols:
bits += ExpGolombCoder.encode(symbol)
# Step 3: Encode symbol frequencies (K-1, last is implied)
for i in range(K - 1):
bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]])
# Step 4: Encode symbol positions using combinatorial indexing
for symbol in sorted_symbols[:-1]:
positions = find_symbol_positions(symbol, remaining_data)
rank = self._rank(len(remaining_data), len(positions), positions)
bits += ExpGolombCoder.encode(rank)
```
### Key Optimizations
```python
# Complement encoding for frequent symbols
use_complement = k > current_n / 2
if use_complement:
# Encode positions of OTHER symbols instead
complement_positions = find_complement_positions()
rank = self._rank(current_n, current_n - k, complement_positions)
# Fast binomial coefficient computation with caching
class OptimizedBinomialTable:
def get(self, n: int, k: int) -> int:
if (n, k) in self._cache:
return self._cache[(n, k)]
result = math.comb(n, k) # Uses arbitrary precision
self._cache[(n, k)] = result
return result
```
## Theoretical Analysis
### Compression Bounds
- **Shannon Entropy**: H(X) = -Ξ£ p(x) log2 p(x) - theoretical minimum
- **Huffman**: Achieves H(X) ≀ L_Huffman < H(X) + 1 (typically ~99% efficiency)
- **Enumerative**: L_Enum β‰₯ H(X) + overhead (typically ~49% efficiency due to structural encoding)
### When Enumerative Coding Works
1. **Research/theoretical applications**: When exact mathematical properties are needed
2. **Educational purposes**: Understanding combinatorial compression principles
3. **Small datasets**: Where computational cost is not a concern
### When Enumerative Struggles
1. **All practical applications**: 259x slower than Huffman with worse compression
2. **Large datasets**: Exponential scaling makes it computationally prohibitive
3. **Real-time systems**: Unpredictable and potentially very long encoding times
## Future Optimization Opportunities
While the current implementation achieves practical performance for datasets up to ~10,000 symbols, several optimization strategies could further improve performance:
### 1. Just-In-Time (JIT) Compilation
- **Target**: Critical loops in combinatorial indexing and position reconstruction
- **Options**:
- **Numba** (requires Python 3.11 due to llvmlite compatibility issues)
- **JAX** (better Python 3.12 support, NumPy-compatible)
- **PyPy** (alternative Python interpreter with JIT)
- **Expected Benefit**: 10-100x speedup for computational bottlenecks
### 2. Algorithmic Improvements
- **Incremental Encoding**: Reuse computations when processing similar sequences
- **Approximate Methods**: Trade slight accuracy for major performance gains on very large datasets
- **Parallel Processing**: Distribute symbol processing across multiple cores
### 3. Specialized Data Structures
- **Sparse Binomial Tables**: Only compute coefficients actually needed
- **Compressed Position Indices**: More efficient representation for position lists
- **Fast Integer Arithmetic**: Specialized libraries for large integer operations
### 4. Memory Hierarchy Optimizations
- **Cache-Friendly Algorithms**: Reorganize computations to minimize cache misses
- **Memory Pooling**: Reduce allocation overhead for temporary arrays
- **Streaming Encoding**: Process very large datasets without loading entirely into memory
### 5. Domain-Specific Optimizations
- **Text-Specific**: Leverage byte patterns and common character frequencies
- **Statistical Precomputation**: Pre-build tables for common distributions (Zipf, geometric)
- **Adaptive Thresholds**: Dynamically adjust complement encoding and timeout parameters
The current implementation provides a solid foundation for exploring these advanced optimizations while maintaining correctness and robustness.
## References
- Han, Y., et al. (2008). "Entropy coding using equiprobable partitioning"
- Cover, T. M., & Thomas, J. A. (2006). "Elements of Information Theory"
- Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes"
## Contributing
This is a research implementation. To contribute:
1. Fork the repository
2. Make changes following the existing code style
3. Run `just check` to verify code quality
4. Submit a pull request
## License
This project is for educational and research purposes.