File size: 13,815 Bytes
24c19d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
---
license: mit
---

# Entropy Coding with Equiprobable Partitioning

Implementation and comparison of the entropy coding algorithm using equiprobable partitioning from Han et al. (2008), compared against Huffman coding and theoretical limits.

## Overview

This project implements two compression algorithms:

1. **Equiprobable Partitioning (EP)** - The main algorithm from the paper
2. **Huffman Coding** - Classical entropy coding for comparison

## Algorithm Description

### Enumerative Entropy Coding

The algorithm from Han et al. (2008) is actually an **enumerative entropy coding** method that works in three steps:

1. **Encode alphabet size M** using exp-Golomb codes
2. **Encode symbol counts** N(s₁), N(sβ‚‚), ..., N(s_{M-1}) using exp-Golomb codes (last count is implied)
3. **Encode sequence position** among all permutations with the same symbol counts using combinatorial enumeration

#### How It Works

- **Step 1**: Use exp-Golomb to encode how many distinct symbols appear
- **Step 2**: Use exp-Golomb to encode how many times each symbol appears  
- **Step 3**: Use lexicographic indexing to identify which specific permutation this sequence represents among all sequences with the same symbol histogram

This is fundamentally different from simple partitioning - it's a form of **combinatorial compression** that leverages the mathematical structure of permutations.

### Performance Optimizations

Key optimizations enable practical performance for datasets up to ~10,000 symbols:

1. **Cached Binomial Coefficients**: Uses `math.comb()` with caching to avoid recomputation
2. **Binary Search**: O(log n) position reconstruction instead of linear search  
3. **Complement Encoding**: For frequent symbols (>50%), encode positions of other symbols instead
4. **Arbitrary Precision**: Avoids integer overflow for large combinatorial values

These optimizations achieve polynomial time complexity, making the algorithm practical for research and educational use.

### Why Enumerative Coding?

The algorithm aims to achieve compression by:
- Separating structure (symbol counts) from content (permutation)
- Using optimal exp-Golomb codes for integer encoding
- Leveraging combinatorial mathematics for exact permutation indexing
- Achieving theoretical compression bounds for certain distributions

## Installation

```bash
# Clone or navigate to the repository
cd entropy-coding-equiprobable

# Install dependencies
just setup
# or manually: uv sync
```

## Usage

### Quick Start

```bash
# Run all available commands
just

# Run quick tests
just test

# Run paper examples  
just test-paper

# Run full compression benchmark
just run

# Run benchmark and generate plots (recommended)
just analyze

# Generate plots and analysis
just plot
```

### Visualization

The plotting functionality generates comprehensive analysis:

1. **Compression Comparison**: Side-by-side comparison of Huffman vs Enumerative methods
2. **Compression Time Analysis**: Performance timing comparison between algorithms
3. **Distribution Analysis**: Performance across uniform, Zipf, and geometric data  
4. **Efficiency Analysis**: How close each method gets to theoretical limits
5. **Enumerative Timeout Analysis**: Computational complexity limitations and scaling behavior

Plots are saved to the `plots/` directory as high-resolution PNG files.

## Results

### Compression Performance Comparison
![Compression Comparison](plots/compression_comparison.png)

Comparison of compression ratios, bits per symbol, and efficiency between Huffman and Enumerative coding across different datasets.

### Compression Time Analysis  
![Compression Time Comparison](plots/compression_time_comparison.png)

Performance timing analysis showing encoding times, speed ratios, and scalability characteristics. Huffman coding is consistently 100-1000x faster.

### Distribution Analysis
![Distribution Comparison](plots/distribution_comparison.png)

Performance breakdown by data distribution type (Uniform, Zipf, Geometric, English Text) showing compression ratios and efficiency metrics.

### Computational Complexity Analysis
![Enumerative Timeout Analysis](plots/enumerative_timeout_analysis.png)

Enumerative encoding performance showing computation times, timeout patterns, and scaling limitations by dataset size and vocabulary.

### Command Reference

- `just` - List available commands
- `just setup` - Install dependencies
- `just test` - Quick test with small datasets + paper examples
- `just test-paper` - Test examples from the paper
- `just run` - Full compression benchmark
- `just analyze` - Run full benchmark and generate plots
- `just plot` - Generate comparison plots
- `just clean` - Remove generated files
- `just check` - Run code quality checks
- `just format` - Format code

## Test Datasets

The benchmark includes:

### I.I.D. Datasets
- **Small** (1K symbols): Quick testing
- **Medium** (10K symbols): Moderate datasets  
- **Large** (100K symbols): Performance at scale

### Distributions
- **Uniform**: All symbols equally likely
- **Zipf**: Power-law distribution (realistic for text)
- **Geometric**: Exponentially decreasing probabilities

### Vocabulary Sizes
- **10 symbols**: Small alphabet
- **64 symbols**: Medium alphabet
- **256 symbols**: Full byte range

### Real Data
- **English text**: Downloaded from WikiText-2 via Hugging Face

## Results Analysis

### Performance Patterns

1. **Uniform Distributions**: EP performs poorly because there's no probability imbalance to exploit
2. **Skewed Distributions**: EP performs better but still trails Huffman
3. **Large Vocabularies**: EP overhead becomes significant with many symbols

### Computational Complexity

The optimized enumerative entropy coding implementation achieves **polynomial time complexity** through careful algorithmic design:

#### Time Complexity Analysis
- **Encoding**: O(M Γ— n) where M = alphabet size, n = sequence length
  - Symbol position finding: O(n) per symbol
  - Combinatorial indexing: O(k) per symbol with memoization
- **Decoding**: O(M Γ— k Γ— log n) where k = average symbol count
  - Binary search for position reconstruction: O(log n) per position
  - Memoized binomial lookups: O(1) amortized

#### Space Complexity
- **Memory Usage**: O(unique_binomial_lookups) for coefficient cache
- **Typical Cache Size**: < 1000 entries for most realistic datasets
- **No Upfront Cost**: Zero initialization time, grows only as needed

#### Performance Characteristics
- **Small Datasets** (< 5000 symbols): 0.045s - 1.7s encoding time
- **Medium Datasets** (5000-10000 symbols): 0.3s - 15s encoding time  
- **Large Datasets** (> 100000 symbols): May timeout (> 30s)
- **Performance vs Huffman**: ~259x slower on average

#### Timeout Mechanism
- **Timeout Duration**: 30 seconds by default for enumerative coding
- **Graceful Handling**: Timeouts are logged and marked as "TIMEOUT" in results
- **When Timeouts Occur**: Very large sequences (> 100k symbols) with high vocabulary diversity

The optimizations successfully transform the algorithm from exponential (naive multinomial) to polynomial complexity, making it practical for realistic data sizes.

### Performance Results

From the benchmark results comparing Huffman vs Enumerative coding:

| Dataset Type | Huffman Efficiency | Enumerative Efficiency | Speed Ratio |
|--------------|-------------------|------------------------|-------------|
| Uniform data | ~99.8% of theoretical | ~48.9% of theoretical | 259x slower |
| Zipf data | ~99.0-99.4% of theoretical | ~47.7-49.9% of theoretical | 100-1000x slower |
| Geometric data | ~98.9-99.3% of theoretical | ~49.6-49.9% of theoretical | 400-2000x slower |
| English text | ~99.1% of theoretical | ~48.1% of theoretical | 23x slower |

### Why Enumerative Underperforms

1. **Computational Complexity**: Combinatorial calculations become expensive for large datasets
2. **Fixed Algorithm Structure**: Cannot adapt to data characteristics like Huffman's variable-length codes
3. **Overhead**: Algorithm encodes structure information (alphabet, counts, positions) separately
4. **Scaling Issues**: Performance degrades exponentially with dataset size and vocabulary complexity

## File Structure

```
entropy-coding-equiprobable/
β”œβ”€β”€ enumerative_coding.py      # Core enumerative entropy coding implementation
β”œβ”€β”€ entropy_coding.py          # Legacy compatibility and Huffman implementation
β”œβ”€β”€ test_compression.py        # Main benchmark script with timing analysis
β”œβ”€β”€ test_paper_examples.py     # Paper example verification
β”œβ”€β”€ test_enumerative.py        # Basic functionality tests
β”œβ”€β”€ plot_results.py           # Comprehensive visualization and analysis
β”œβ”€β”€ quick_test.py             # Quick functionality test
β”œβ”€β”€ justfile                  # Command runner
β”œβ”€β”€ pyproject.toml           # Python dependencies
β”œβ”€β”€ CLAUDE.md               # Project-specific AI instructions
└── README.md              # This file
```

## Implementation Details

### Enumerative Entropy Coding

The implementation follows the Han et al. (2008) algorithm with four main steps:

```python
def encode(self, data: List[int]) -> bytes:
    # Step 1: Encode sequence length
    bits += ExpGolombCoder.encode(n)
    
    # Step 2: Encode alphabet (size K and symbols)
    bits += ExpGolombCoder.encode(K)
    for symbol in sorted_symbols:
        bits += ExpGolombCoder.encode(symbol)
    
    # Step 3: Encode symbol frequencies (K-1, last is implied)
    for i in range(K - 1):
        bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]])
    
    # Step 4: Encode symbol positions using combinatorial indexing
    for symbol in sorted_symbols[:-1]:
        positions = find_symbol_positions(symbol, remaining_data)
        rank = self._rank(len(remaining_data), len(positions), positions)
        bits += ExpGolombCoder.encode(rank)
```

### Key Optimizations

```python
# Complement encoding for frequent symbols
use_complement = k > current_n / 2
if use_complement:
    # Encode positions of OTHER symbols instead
    complement_positions = find_complement_positions()
    rank = self._rank(current_n, current_n - k, complement_positions)

# Fast binomial coefficient computation with caching
class OptimizedBinomialTable:
    def get(self, n: int, k: int) -> int:
        if (n, k) in self._cache:
            return self._cache[(n, k)]
        result = math.comb(n, k)  # Uses arbitrary precision
        self._cache[(n, k)] = result
        return result
```

## Theoretical Analysis

### Compression Bounds

- **Shannon Entropy**: H(X) = -Ξ£ p(x) log2 p(x) - theoretical minimum
- **Huffman**: Achieves H(X) ≀ L_Huffman < H(X) + 1 (typically ~99% efficiency)
- **Enumerative**: L_Enum β‰₯ H(X) + overhead (typically ~49% efficiency due to structural encoding)

### When Enumerative Coding Works

1. **Research/theoretical applications**: When exact mathematical properties are needed
2. **Educational purposes**: Understanding combinatorial compression principles
3. **Small datasets**: Where computational cost is not a concern

### When Enumerative Struggles

1. **All practical applications**: 259x slower than Huffman with worse compression
2. **Large datasets**: Exponential scaling makes it computationally prohibitive
3. **Real-time systems**: Unpredictable and potentially very long encoding times

## Future Optimization Opportunities

While the current implementation achieves practical performance for datasets up to ~10,000 symbols, several optimization strategies could further improve performance:

### 1. Just-In-Time (JIT) Compilation
- **Target**: Critical loops in combinatorial indexing and position reconstruction
- **Options**: 
  - **Numba** (requires Python 3.11 due to llvmlite compatibility issues)
  - **JAX** (better Python 3.12 support, NumPy-compatible)
  - **PyPy** (alternative Python interpreter with JIT)
- **Expected Benefit**: 10-100x speedup for computational bottlenecks

### 2. Algorithmic Improvements
- **Incremental Encoding**: Reuse computations when processing similar sequences
- **Approximate Methods**: Trade slight accuracy for major performance gains on very large datasets
- **Parallel Processing**: Distribute symbol processing across multiple cores

### 3. Specialized Data Structures
- **Sparse Binomial Tables**: Only compute coefficients actually needed
- **Compressed Position Indices**: More efficient representation for position lists
- **Fast Integer Arithmetic**: Specialized libraries for large integer operations

### 4. Memory Hierarchy Optimizations
- **Cache-Friendly Algorithms**: Reorganize computations to minimize cache misses
- **Memory Pooling**: Reduce allocation overhead for temporary arrays
- **Streaming Encoding**: Process very large datasets without loading entirely into memory

### 5. Domain-Specific Optimizations
- **Text-Specific**: Leverage byte patterns and common character frequencies
- **Statistical Precomputation**: Pre-build tables for common distributions (Zipf, geometric)
- **Adaptive Thresholds**: Dynamically adjust complement encoding and timeout parameters

The current implementation provides a solid foundation for exploring these advanced optimizations while maintaining correctness and robustness.

## References

- Han, Y., et al. (2008). "Entropy coding using equiprobable partitioning"
- Cover, T. M., & Thomas, J. A. (2006). "Elements of Information Theory"
- Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes"

## Contributing

This is a research implementation. To contribute:

1. Fork the repository
2. Make changes following the existing code style
3. Run `just check` to verify code quality
4. Submit a pull request

## License

This project is for educational and research purposes.