File size: 13,815 Bytes
24c19d8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 | ---
license: mit
---
# Entropy Coding with Equiprobable Partitioning
Implementation and comparison of the entropy coding algorithm using equiprobable partitioning from Han et al. (2008), compared against Huffman coding and theoretical limits.
## Overview
This project implements two compression algorithms:
1. **Equiprobable Partitioning (EP)** - The main algorithm from the paper
2. **Huffman Coding** - Classical entropy coding for comparison
## Algorithm Description
### Enumerative Entropy Coding
The algorithm from Han et al. (2008) is actually an **enumerative entropy coding** method that works in three steps:
1. **Encode alphabet size M** using exp-Golomb codes
2. **Encode symbol counts** N(sβ), N(sβ), ..., N(s_{M-1}) using exp-Golomb codes (last count is implied)
3. **Encode sequence position** among all permutations with the same symbol counts using combinatorial enumeration
#### How It Works
- **Step 1**: Use exp-Golomb to encode how many distinct symbols appear
- **Step 2**: Use exp-Golomb to encode how many times each symbol appears
- **Step 3**: Use lexicographic indexing to identify which specific permutation this sequence represents among all sequences with the same symbol histogram
This is fundamentally different from simple partitioning - it's a form of **combinatorial compression** that leverages the mathematical structure of permutations.
### Performance Optimizations
Key optimizations enable practical performance for datasets up to ~10,000 symbols:
1. **Cached Binomial Coefficients**: Uses `math.comb()` with caching to avoid recomputation
2. **Binary Search**: O(log n) position reconstruction instead of linear search
3. **Complement Encoding**: For frequent symbols (>50%), encode positions of other symbols instead
4. **Arbitrary Precision**: Avoids integer overflow for large combinatorial values
These optimizations achieve polynomial time complexity, making the algorithm practical for research and educational use.
### Why Enumerative Coding?
The algorithm aims to achieve compression by:
- Separating structure (symbol counts) from content (permutation)
- Using optimal exp-Golomb codes for integer encoding
- Leveraging combinatorial mathematics for exact permutation indexing
- Achieving theoretical compression bounds for certain distributions
## Installation
```bash
# Clone or navigate to the repository
cd entropy-coding-equiprobable
# Install dependencies
just setup
# or manually: uv sync
```
## Usage
### Quick Start
```bash
# Run all available commands
just
# Run quick tests
just test
# Run paper examples
just test-paper
# Run full compression benchmark
just run
# Run benchmark and generate plots (recommended)
just analyze
# Generate plots and analysis
just plot
```
### Visualization
The plotting functionality generates comprehensive analysis:
1. **Compression Comparison**: Side-by-side comparison of Huffman vs Enumerative methods
2. **Compression Time Analysis**: Performance timing comparison between algorithms
3. **Distribution Analysis**: Performance across uniform, Zipf, and geometric data
4. **Efficiency Analysis**: How close each method gets to theoretical limits
5. **Enumerative Timeout Analysis**: Computational complexity limitations and scaling behavior
Plots are saved to the `plots/` directory as high-resolution PNG files.
## Results
### Compression Performance Comparison

Comparison of compression ratios, bits per symbol, and efficiency between Huffman and Enumerative coding across different datasets.
### Compression Time Analysis

Performance timing analysis showing encoding times, speed ratios, and scalability characteristics. Huffman coding is consistently 100-1000x faster.
### Distribution Analysis

Performance breakdown by data distribution type (Uniform, Zipf, Geometric, English Text) showing compression ratios and efficiency metrics.
### Computational Complexity Analysis

Enumerative encoding performance showing computation times, timeout patterns, and scaling limitations by dataset size and vocabulary.
### Command Reference
- `just` - List available commands
- `just setup` - Install dependencies
- `just test` - Quick test with small datasets + paper examples
- `just test-paper` - Test examples from the paper
- `just run` - Full compression benchmark
- `just analyze` - Run full benchmark and generate plots
- `just plot` - Generate comparison plots
- `just clean` - Remove generated files
- `just check` - Run code quality checks
- `just format` - Format code
## Test Datasets
The benchmark includes:
### I.I.D. Datasets
- **Small** (1K symbols): Quick testing
- **Medium** (10K symbols): Moderate datasets
- **Large** (100K symbols): Performance at scale
### Distributions
- **Uniform**: All symbols equally likely
- **Zipf**: Power-law distribution (realistic for text)
- **Geometric**: Exponentially decreasing probabilities
### Vocabulary Sizes
- **10 symbols**: Small alphabet
- **64 symbols**: Medium alphabet
- **256 symbols**: Full byte range
### Real Data
- **English text**: Downloaded from WikiText-2 via Hugging Face
## Results Analysis
### Performance Patterns
1. **Uniform Distributions**: EP performs poorly because there's no probability imbalance to exploit
2. **Skewed Distributions**: EP performs better but still trails Huffman
3. **Large Vocabularies**: EP overhead becomes significant with many symbols
### Computational Complexity
The optimized enumerative entropy coding implementation achieves **polynomial time complexity** through careful algorithmic design:
#### Time Complexity Analysis
- **Encoding**: O(M Γ n) where M = alphabet size, n = sequence length
- Symbol position finding: O(n) per symbol
- Combinatorial indexing: O(k) per symbol with memoization
- **Decoding**: O(M Γ k Γ log n) where k = average symbol count
- Binary search for position reconstruction: O(log n) per position
- Memoized binomial lookups: O(1) amortized
#### Space Complexity
- **Memory Usage**: O(unique_binomial_lookups) for coefficient cache
- **Typical Cache Size**: < 1000 entries for most realistic datasets
- **No Upfront Cost**: Zero initialization time, grows only as needed
#### Performance Characteristics
- **Small Datasets** (< 5000 symbols): 0.045s - 1.7s encoding time
- **Medium Datasets** (5000-10000 symbols): 0.3s - 15s encoding time
- **Large Datasets** (> 100000 symbols): May timeout (> 30s)
- **Performance vs Huffman**: ~259x slower on average
#### Timeout Mechanism
- **Timeout Duration**: 30 seconds by default for enumerative coding
- **Graceful Handling**: Timeouts are logged and marked as "TIMEOUT" in results
- **When Timeouts Occur**: Very large sequences (> 100k symbols) with high vocabulary diversity
The optimizations successfully transform the algorithm from exponential (naive multinomial) to polynomial complexity, making it practical for realistic data sizes.
### Performance Results
From the benchmark results comparing Huffman vs Enumerative coding:
| Dataset Type | Huffman Efficiency | Enumerative Efficiency | Speed Ratio |
|--------------|-------------------|------------------------|-------------|
| Uniform data | ~99.8% of theoretical | ~48.9% of theoretical | 259x slower |
| Zipf data | ~99.0-99.4% of theoretical | ~47.7-49.9% of theoretical | 100-1000x slower |
| Geometric data | ~98.9-99.3% of theoretical | ~49.6-49.9% of theoretical | 400-2000x slower |
| English text | ~99.1% of theoretical | ~48.1% of theoretical | 23x slower |
### Why Enumerative Underperforms
1. **Computational Complexity**: Combinatorial calculations become expensive for large datasets
2. **Fixed Algorithm Structure**: Cannot adapt to data characteristics like Huffman's variable-length codes
3. **Overhead**: Algorithm encodes structure information (alphabet, counts, positions) separately
4. **Scaling Issues**: Performance degrades exponentially with dataset size and vocabulary complexity
## File Structure
```
entropy-coding-equiprobable/
βββ enumerative_coding.py # Core enumerative entropy coding implementation
βββ entropy_coding.py # Legacy compatibility and Huffman implementation
βββ test_compression.py # Main benchmark script with timing analysis
βββ test_paper_examples.py # Paper example verification
βββ test_enumerative.py # Basic functionality tests
βββ plot_results.py # Comprehensive visualization and analysis
βββ quick_test.py # Quick functionality test
βββ justfile # Command runner
βββ pyproject.toml # Python dependencies
βββ CLAUDE.md # Project-specific AI instructions
βββ README.md # This file
```
## Implementation Details
### Enumerative Entropy Coding
The implementation follows the Han et al. (2008) algorithm with four main steps:
```python
def encode(self, data: List[int]) -> bytes:
# Step 1: Encode sequence length
bits += ExpGolombCoder.encode(n)
# Step 2: Encode alphabet (size K and symbols)
bits += ExpGolombCoder.encode(K)
for symbol in sorted_symbols:
bits += ExpGolombCoder.encode(symbol)
# Step 3: Encode symbol frequencies (K-1, last is implied)
for i in range(K - 1):
bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]])
# Step 4: Encode symbol positions using combinatorial indexing
for symbol in sorted_symbols[:-1]:
positions = find_symbol_positions(symbol, remaining_data)
rank = self._rank(len(remaining_data), len(positions), positions)
bits += ExpGolombCoder.encode(rank)
```
### Key Optimizations
```python
# Complement encoding for frequent symbols
use_complement = k > current_n / 2
if use_complement:
# Encode positions of OTHER symbols instead
complement_positions = find_complement_positions()
rank = self._rank(current_n, current_n - k, complement_positions)
# Fast binomial coefficient computation with caching
class OptimizedBinomialTable:
def get(self, n: int, k: int) -> int:
if (n, k) in self._cache:
return self._cache[(n, k)]
result = math.comb(n, k) # Uses arbitrary precision
self._cache[(n, k)] = result
return result
```
## Theoretical Analysis
### Compression Bounds
- **Shannon Entropy**: H(X) = -Ξ£ p(x) log2 p(x) - theoretical minimum
- **Huffman**: Achieves H(X) β€ L_Huffman < H(X) + 1 (typically ~99% efficiency)
- **Enumerative**: L_Enum β₯ H(X) + overhead (typically ~49% efficiency due to structural encoding)
### When Enumerative Coding Works
1. **Research/theoretical applications**: When exact mathematical properties are needed
2. **Educational purposes**: Understanding combinatorial compression principles
3. **Small datasets**: Where computational cost is not a concern
### When Enumerative Struggles
1. **All practical applications**: 259x slower than Huffman with worse compression
2. **Large datasets**: Exponential scaling makes it computationally prohibitive
3. **Real-time systems**: Unpredictable and potentially very long encoding times
## Future Optimization Opportunities
While the current implementation achieves practical performance for datasets up to ~10,000 symbols, several optimization strategies could further improve performance:
### 1. Just-In-Time (JIT) Compilation
- **Target**: Critical loops in combinatorial indexing and position reconstruction
- **Options**:
- **Numba** (requires Python 3.11 due to llvmlite compatibility issues)
- **JAX** (better Python 3.12 support, NumPy-compatible)
- **PyPy** (alternative Python interpreter with JIT)
- **Expected Benefit**: 10-100x speedup for computational bottlenecks
### 2. Algorithmic Improvements
- **Incremental Encoding**: Reuse computations when processing similar sequences
- **Approximate Methods**: Trade slight accuracy for major performance gains on very large datasets
- **Parallel Processing**: Distribute symbol processing across multiple cores
### 3. Specialized Data Structures
- **Sparse Binomial Tables**: Only compute coefficients actually needed
- **Compressed Position Indices**: More efficient representation for position lists
- **Fast Integer Arithmetic**: Specialized libraries for large integer operations
### 4. Memory Hierarchy Optimizations
- **Cache-Friendly Algorithms**: Reorganize computations to minimize cache misses
- **Memory Pooling**: Reduce allocation overhead for temporary arrays
- **Streaming Encoding**: Process very large datasets without loading entirely into memory
### 5. Domain-Specific Optimizations
- **Text-Specific**: Leverage byte patterns and common character frequencies
- **Statistical Precomputation**: Pre-build tables for common distributions (Zipf, geometric)
- **Adaptive Thresholds**: Dynamically adjust complement encoding and timeout parameters
The current implementation provides a solid foundation for exploring these advanced optimizations while maintaining correctness and robustness.
## References
- Han, Y., et al. (2008). "Entropy coding using equiprobable partitioning"
- Cover, T. M., & Thomas, J. A. (2006). "Elements of Information Theory"
- Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes"
## Contributing
This is a research implementation. To contribute:
1. Fork the repository
2. Make changes following the existing code style
3. Run `just check` to verify code quality
4. Submit a pull request
## License
This project is for educational and research purposes. |