File size: 13,815 Bytes

24c19d8

---
license: mit
---

# Entropy Coding with Equiprobable Partitioning

Implementation and comparison of the entropy coding algorithm using equiprobable partitioning from Han et al. (2008), compared against Huffman coding and theoretical limits.

## Overview

This project implements two compression algorithms:

1. **Equiprobable Partitioning (EP)** - The main algorithm from the paper
2. **Huffman Coding** - Classical entropy coding for comparison

## Algorithm Description

### Enumerative Entropy Coding

The algorithm from Han et al. (2008) is actually an **enumerative entropy coding** method that works in three steps:

1. **Encode alphabet size M** using exp-Golomb codes
2. **Encode symbol counts** N(s₁), N(s₂), ..., N(s_{M-1}) using exp-Golomb codes (last count is implied)
3. **Encode sequence position** among all permutations with the same symbol counts using combinatorial enumeration

#### How It Works

- **Step 1**: Use exp-Golomb to encode how many distinct symbols appear
- **Step 2**: Use exp-Golomb to encode how many times each symbol appears  
- **Step 3**: Use lexicographic indexing to identify which specific permutation this sequence represents among all sequences with the same symbol histogram

This is fundamentally different from simple partitioning - it's a form of **combinatorial compression** that leverages the mathematical structure of permutations.

### Performance Optimizations

Key optimizations enable practical performance for datasets up to ~10,000 symbols:

1. **Cached Binomial Coefficients**: Uses `math.comb()` with caching to avoid recomputation
2. **Binary Search**: O(log n) position reconstruction instead of linear search  
3. **Complement Encoding**: For frequent symbols (>50%), encode positions of other symbols instead
4. **Arbitrary Precision**: Avoids integer overflow for large combinatorial values

These optimizations achieve polynomial time complexity, making the algorithm practical for research and educational use.

### Why Enumerative Coding?

The algorithm aims to achieve compression by:
- Separating structure (symbol counts) from content (permutation)
- Using optimal exp-Golomb codes for integer encoding
- Leveraging combinatorial mathematics for exact permutation indexing
- Achieving theoretical compression bounds for certain distributions

## Installation

```bash
# Clone or navigate to the repository
cd entropy-coding-equiprobable

# Install dependencies
just setup
# or manually: uv sync
```

## Usage

### Quick Start

```bash
# Run all available commands
just

# Run quick tests
just test

# Run paper examples  
just test-paper

# Run full compression benchmark
just run

# Run benchmark and generate plots (recommended)
just analyze

# Generate plots and analysis
just plot
```

### Visualization

The plotting functionality generates comprehensive analysis:

1. **Compression Comparison**: Side-by-side comparison of Huffman vs Enumerative methods
2. **Compression Time Analysis**: Performance timing comparison between algorithms
3. **Distribution Analysis**: Performance across uniform, Zipf, and geometric data  
4. **Efficiency Analysis**: How close each method gets to theoretical limits
5. **Enumerative Timeout Analysis**: Computational complexity limitations and scaling behavior

Plots are saved to the `plots/` directory as high-resolution PNG files.

## Results

### Compression Performance Comparison
![Compression Comparison](plots/compression_comparison.png)

Comparison of compression ratios, bits per symbol, and efficiency between Huffman and Enumerative coding across different datasets.

### Compression Time Analysis  
![Compression Time Comparison](plots/compression_time_comparison.png)

Performance timing analysis showing encoding times, speed ratios, and scalability characteristics. Huffman coding is consistently 100-1000x faster.

### Distribution Analysis
![Distribution Comparison](plots/distribution_comparison.png)

Performance breakdown by data distribution type (Uniform, Zipf, Geometric, English Text) showing compression ratios and efficiency metrics.

### Computational Complexity Analysis
![Enumerative Timeout Analysis](plots/enumerative_timeout_analysis.png)

Enumerative encoding performance showing computation times, timeout patterns, and scaling limitations by dataset size and vocabulary.

### Command Reference

- `just` - List available commands
- `just setup` - Install dependencies
- `just test` - Quick test with small datasets + paper examples
- `just test-paper` - Test examples from the paper
- `just run` - Full compression benchmark
- `just analyze` - Run full benchmark and generate plots
- `just plot` - Generate comparison plots
- `just clean` - Remove generated files
- `just check` - Run code quality checks
- `just format` - Format code

## Test Datasets

The benchmark includes:

### I.I.D. Datasets
- **Small** (1K symbols): Quick testing
- **Medium** (10K symbols): Moderate datasets  
- **Large** (100K symbols): Performance at scale

### Distributions
- **Uniform**: All symbols equally likely
- **Zipf**: Power-law distribution (realistic for text)
- **Geometric**: Exponentially decreasing probabilities

### Vocabulary Sizes
- **10 symbols**: Small alphabet
- **64 symbols**: Medium alphabet
- **256 symbols**: Full byte range

### Real Data
- **English text**: Downloaded from WikiText-2 via Hugging Face

## Results Analysis

### Performance Patterns

1. **Uniform Distributions**: EP performs poorly because there's no probability imbalance to exploit
2. **Skewed Distributions**: EP performs better but still trails Huffman
3. **Large Vocabularies**: EP overhead becomes significant with many symbols

### Computational Complexity

The optimized enumerative entropy coding implementation achieves **polynomial time complexity** through careful algorithmic design:

#### Time Complexity Analysis
- **Encoding**: O(M × n) where M = alphabet size, n = sequence length
  - Symbol position finding: O(n) per symbol
  - Combinatorial indexing: O(k) per symbol with memoization
- **Decoding**: O(M × k × log n) where k = average symbol count
  - Binary search for position reconstruction: O(log n) per position
  - Memoized binomial lookups: O(1) amortized

#### Space Complexity
- **Memory Usage**: O(unique_binomial_lookups) for coefficient cache
- **Typical Cache Size**: < 1000 entries for most realistic datasets
- **No Upfront Cost**: Zero initialization time, grows only as needed

#### Performance Characteristics
- **Small Datasets** (< 5000 symbols): 0.045s - 1.7s encoding time
- **Medium Datasets** (5000-10000 symbols): 0.3s - 15s encoding time  
- **Large Datasets** (> 100000 symbols): May timeout (> 30s)
- **Performance vs Huffman**: ~259x slower on average

#### Timeout Mechanism
- **Timeout Duration**: 30 seconds by default for enumerative coding
- **Graceful Handling**: Timeouts are logged and marked as "TIMEOUT" in results
- **When Timeouts Occur**: Very large sequences (> 100k symbols) with high vocabulary diversity

The optimizations successfully transform the algorithm from exponential (naive multinomial) to polynomial complexity, making it practical for realistic data sizes.

### Performance Results

From the benchmark results comparing Huffman vs Enumerative coding:

| Dataset Type | Huffman Efficiency | Enumerative Efficiency | Speed Ratio |
|--------------|-------------------|------------------------|-------------|
| Uniform data | ~99.8% of theoretical | ~48.9% of theoretical | 259x slower |
| Zipf data | ~99.0-99.4% of theoretical | ~47.7-49.9% of theoretical | 100-1000x slower |
| Geometric data | ~98.9-99.3% of theoretical | ~49.6-49.9% of theoretical | 400-2000x slower |
| English text | ~99.1% of theoretical | ~48.1% of theoretical | 23x slower |

### Why Enumerative Underperforms

1. **Computational Complexity**: Combinatorial calculations become expensive for large datasets
2. **Fixed Algorithm Structure**: Cannot adapt to data characteristics like Huffman's variable-length codes
3. **Overhead**: Algorithm encodes structure information (alphabet, counts, positions) separately
4. **Scaling Issues**: Performance degrades exponentially with dataset size and vocabulary complexity

## File Structure

```
entropy-coding-equiprobable/
├── enumerative_coding.py      # Core enumerative entropy coding implementation
├── entropy_coding.py          # Legacy compatibility and Huffman implementation
├── test_compression.py        # Main benchmark script with timing analysis
├── test_paper_examples.py     # Paper example verification
├── test_enumerative.py        # Basic functionality tests
├── plot_results.py           # Comprehensive visualization and analysis
├── quick_test.py             # Quick functionality test
├── justfile                  # Command runner
├── pyproject.toml           # Python dependencies
├── CLAUDE.md               # Project-specific AI instructions
└── README.md              # This file
```

## Implementation Details

### Enumerative Entropy Coding

The implementation follows the Han et al. (2008) algorithm with four main steps:

```python
def encode(self, data: List[int]) -> bytes:
    # Step 1: Encode sequence length
    bits += ExpGolombCoder.encode(n)
    
    # Step 2: Encode alphabet (size K and symbols)
    bits += ExpGolombCoder.encode(K)
    for symbol in sorted_symbols:
        bits += ExpGolombCoder.encode(symbol)
    
    # Step 3: Encode symbol frequencies (K-1, last is implied)
    for i in range(K - 1):
        bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]])
    
    # Step 4: Encode symbol positions using combinatorial indexing
    for symbol in sorted_symbols[:-1]:
        positions = find_symbol_positions(symbol, remaining_data)
        rank = self._rank(len(remaining_data), len(positions), positions)
        bits += ExpGolombCoder.encode(rank)
```

### Key Optimizations

```python
# Complement encoding for frequent symbols
use_complement = k > current_n / 2
if use_complement:
    # Encode positions of OTHER symbols instead
    complement_positions = find_complement_positions()
    rank = self._rank(current_n, current_n - k, complement_positions)

# Fast binomial coefficient computation with caching
class OptimizedBinomialTable:
    def get(self, n: int, k: int) -> int:
        if (n, k) in self._cache:
            return self._cache[(n, k)]
        result = math.comb(n, k)  # Uses arbitrary precision
        self._cache[(n, k)] = result
        return result
```

## Theoretical Analysis

### Compression Bounds

- **Shannon Entropy**: H(X) = -Σ p(x) log2 p(x) - theoretical minimum
- **Huffman**: Achieves H(X) ≤ L_Huffman < H(X) + 1 (typically ~99% efficiency)
- **Enumerative**: L_Enum ≥ H(X) + overhead (typically ~49% efficiency due to structural encoding)

### When Enumerative Coding Works

1. **Research/theoretical applications**: When exact mathematical properties are needed
2. **Educational purposes**: Understanding combinatorial compression principles
3. **Small datasets**: Where computational cost is not a concern

### When Enumerative Struggles

1. **All practical applications**: 259x slower than Huffman with worse compression
2. **Large datasets**: Exponential scaling makes it computationally prohibitive
3. **Real-time systems**: Unpredictable and potentially very long encoding times

## Future Optimization Opportunities

While the current implementation achieves practical performance for datasets up to ~10,000 symbols, several optimization strategies could further improve performance:

### 1. Just-In-Time (JIT) Compilation
- **Target**: Critical loops in combinatorial indexing and position reconstruction
- **Options**: 
  - **Numba** (requires Python 3.11 due to llvmlite compatibility issues)
  - **JAX** (better Python 3.12 support, NumPy-compatible)
  - **PyPy** (alternative Python interpreter with JIT)
- **Expected Benefit**: 10-100x speedup for computational bottlenecks

### 2. Algorithmic Improvements
- **Incremental Encoding**: Reuse computations when processing similar sequences
- **Approximate Methods**: Trade slight accuracy for major performance gains on very large datasets
- **Parallel Processing**: Distribute symbol processing across multiple cores

### 3. Specialized Data Structures
- **Sparse Binomial Tables**: Only compute coefficients actually needed
- **Compressed Position Indices**: More efficient representation for position lists
- **Fast Integer Arithmetic**: Specialized libraries for large integer operations

### 4. Memory Hierarchy Optimizations
- **Cache-Friendly Algorithms**: Reorganize computations to minimize cache misses
- **Memory Pooling**: Reduce allocation overhead for temporary arrays
- **Streaming Encoding**: Process very large datasets without loading entirely into memory

### 5. Domain-Specific Optimizations
- **Text-Specific**: Leverage byte patterns and common character frequencies
- **Statistical Precomputation**: Pre-build tables for common distributions (Zipf, geometric)
- **Adaptive Thresholds**: Dynamically adjust complement encoding and timeout parameters

The current implementation provides a solid foundation for exploring these advanced optimizations while maintaining correctness and robustness.

## References

- Han, Y., et al. (2008). "Entropy coding using equiprobable partitioning"
- Cover, T. M., & Thomas, J. A. (2006). "Elements of Information Theory"
- Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes"

## Contributing

This is a research implementation. To contribute:

1. Fork the repository
2. Make changes following the existing code style
3. Run `just check` to verify code quality
4. Submit a pull request

## License

This project is for educational and research purposes.