osteele commited on
Commit
24c19d8
·
unverified ·
0 Parent(s):

Initial commit

Browse files
.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.png filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
11
+
12
+ # Build outputs
13
+ compression_results.json
14
+
15
+ # Keep plots but ignore temp plots
16
+ test_plots/
README.md ADDED
@@ -0,0 +1,348 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # Entropy Coding with Equiprobable Partitioning
6
+
7
+ Implementation and comparison of the entropy coding algorithm using equiprobable partitioning from Han et al. (2008), compared against Huffman coding and theoretical limits.
8
+
9
+ ## Overview
10
+
11
+ This project implements two compression algorithms:
12
+
13
+ 1. **Equiprobable Partitioning (EP)** - The main algorithm from the paper
14
+ 2. **Huffman Coding** - Classical entropy coding for comparison
15
+
16
+ ## Algorithm Description
17
+
18
+ ### Enumerative Entropy Coding
19
+
20
+ The algorithm from Han et al. (2008) is actually an **enumerative entropy coding** method that works in three steps:
21
+
22
+ 1. **Encode alphabet size M** using exp-Golomb codes
23
+ 2. **Encode symbol counts** N(s₁), N(s₂), ..., N(s_{M-1}) using exp-Golomb codes (last count is implied)
24
+ 3. **Encode sequence position** among all permutations with the same symbol counts using combinatorial enumeration
25
+
26
+ #### How It Works
27
+
28
+ - **Step 1**: Use exp-Golomb to encode how many distinct symbols appear
29
+ - **Step 2**: Use exp-Golomb to encode how many times each symbol appears
30
+ - **Step 3**: Use lexicographic indexing to identify which specific permutation this sequence represents among all sequences with the same symbol histogram
31
+
32
+ This is fundamentally different from simple partitioning - it's a form of **combinatorial compression** that leverages the mathematical structure of permutations.
33
+
34
+ ### Performance Optimizations
35
+
36
+ Key optimizations enable practical performance for datasets up to ~10,000 symbols:
37
+
38
+ 1. **Cached Binomial Coefficients**: Uses `math.comb()` with caching to avoid recomputation
39
+ 2. **Binary Search**: O(log n) position reconstruction instead of linear search
40
+ 3. **Complement Encoding**: For frequent symbols (>50%), encode positions of other symbols instead
41
+ 4. **Arbitrary Precision**: Avoids integer overflow for large combinatorial values
42
+
43
+ These optimizations achieve polynomial time complexity, making the algorithm practical for research and educational use.
44
+
45
+ ### Why Enumerative Coding?
46
+
47
+ The algorithm aims to achieve compression by:
48
+ - Separating structure (symbol counts) from content (permutation)
49
+ - Using optimal exp-Golomb codes for integer encoding
50
+ - Leveraging combinatorial mathematics for exact permutation indexing
51
+ - Achieving theoretical compression bounds for certain distributions
52
+
53
+ ## Installation
54
+
55
+ ```bash
56
+ # Clone or navigate to the repository
57
+ cd entropy-coding-equiprobable
58
+
59
+ # Install dependencies
60
+ just setup
61
+ # or manually: uv sync
62
+ ```
63
+
64
+ ## Usage
65
+
66
+ ### Quick Start
67
+
68
+ ```bash
69
+ # Run all available commands
70
+ just
71
+
72
+ # Run quick tests
73
+ just test
74
+
75
+ # Run paper examples
76
+ just test-paper
77
+
78
+ # Run full compression benchmark
79
+ just run
80
+
81
+ # Run benchmark and generate plots (recommended)
82
+ just analyze
83
+
84
+ # Generate plots and analysis
85
+ just plot
86
+ ```
87
+
88
+ ### Visualization
89
+
90
+ The plotting functionality generates comprehensive analysis:
91
+
92
+ 1. **Compression Comparison**: Side-by-side comparison of Huffman vs Enumerative methods
93
+ 2. **Compression Time Analysis**: Performance timing comparison between algorithms
94
+ 3. **Distribution Analysis**: Performance across uniform, Zipf, and geometric data
95
+ 4. **Efficiency Analysis**: How close each method gets to theoretical limits
96
+ 5. **Enumerative Timeout Analysis**: Computational complexity limitations and scaling behavior
97
+
98
+ Plots are saved to the `plots/` directory as high-resolution PNG files.
99
+
100
+ ## Results
101
+
102
+ ### Compression Performance Comparison
103
+ ![Compression Comparison](plots/compression_comparison.png)
104
+
105
+ Comparison of compression ratios, bits per symbol, and efficiency between Huffman and Enumerative coding across different datasets.
106
+
107
+ ### Compression Time Analysis
108
+ ![Compression Time Comparison](plots/compression_time_comparison.png)
109
+
110
+ Performance timing analysis showing encoding times, speed ratios, and scalability characteristics. Huffman coding is consistently 100-1000x faster.
111
+
112
+ ### Distribution Analysis
113
+ ![Distribution Comparison](plots/distribution_comparison.png)
114
+
115
+ Performance breakdown by data distribution type (Uniform, Zipf, Geometric, English Text) showing compression ratios and efficiency metrics.
116
+
117
+ ### Computational Complexity Analysis
118
+ ![Enumerative Timeout Analysis](plots/enumerative_timeout_analysis.png)
119
+
120
+ Enumerative encoding performance showing computation times, timeout patterns, and scaling limitations by dataset size and vocabulary.
121
+
122
+ ### Command Reference
123
+
124
+ - `just` - List available commands
125
+ - `just setup` - Install dependencies
126
+ - `just test` - Quick test with small datasets + paper examples
127
+ - `just test-paper` - Test examples from the paper
128
+ - `just run` - Full compression benchmark
129
+ - `just analyze` - Run full benchmark and generate plots
130
+ - `just plot` - Generate comparison plots
131
+ - `just clean` - Remove generated files
132
+ - `just check` - Run code quality checks
133
+ - `just format` - Format code
134
+
135
+ ## Test Datasets
136
+
137
+ The benchmark includes:
138
+
139
+ ### I.I.D. Datasets
140
+ - **Small** (1K symbols): Quick testing
141
+ - **Medium** (10K symbols): Moderate datasets
142
+ - **Large** (100K symbols): Performance at scale
143
+
144
+ ### Distributions
145
+ - **Uniform**: All symbols equally likely
146
+ - **Zipf**: Power-law distribution (realistic for text)
147
+ - **Geometric**: Exponentially decreasing probabilities
148
+
149
+ ### Vocabulary Sizes
150
+ - **10 symbols**: Small alphabet
151
+ - **64 symbols**: Medium alphabet
152
+ - **256 symbols**: Full byte range
153
+
154
+ ### Real Data
155
+ - **English text**: Downloaded from WikiText-2 via Hugging Face
156
+
157
+ ## Results Analysis
158
+
159
+ ### Performance Patterns
160
+
161
+ 1. **Uniform Distributions**: EP performs poorly because there's no probability imbalance to exploit
162
+ 2. **Skewed Distributions**: EP performs better but still trails Huffman
163
+ 3. **Large Vocabularies**: EP overhead becomes significant with many symbols
164
+
165
+ ### Computational Complexity
166
+
167
+ The optimized enumerative entropy coding implementation achieves **polynomial time complexity** through careful algorithmic design:
168
+
169
+ #### Time Complexity Analysis
170
+ - **Encoding**: O(M × n) where M = alphabet size, n = sequence length
171
+ - Symbol position finding: O(n) per symbol
172
+ - Combinatorial indexing: O(k) per symbol with memoization
173
+ - **Decoding**: O(M × k × log n) where k = average symbol count
174
+ - Binary search for position reconstruction: O(log n) per position
175
+ - Memoized binomial lookups: O(1) amortized
176
+
177
+ #### Space Complexity
178
+ - **Memory Usage**: O(unique_binomial_lookups) for coefficient cache
179
+ - **Typical Cache Size**: < 1000 entries for most realistic datasets
180
+ - **No Upfront Cost**: Zero initialization time, grows only as needed
181
+
182
+ #### Performance Characteristics
183
+ - **Small Datasets** (< 5000 symbols): 0.045s - 1.7s encoding time
184
+ - **Medium Datasets** (5000-10000 symbols): 0.3s - 15s encoding time
185
+ - **Large Datasets** (> 100000 symbols): May timeout (> 30s)
186
+ - **Performance vs Huffman**: ~259x slower on average
187
+
188
+ #### Timeout Mechanism
189
+ - **Timeout Duration**: 30 seconds by default for enumerative coding
190
+ - **Graceful Handling**: Timeouts are logged and marked as "TIMEOUT" in results
191
+ - **When Timeouts Occur**: Very large sequences (> 100k symbols) with high vocabulary diversity
192
+
193
+ The optimizations successfully transform the algorithm from exponential (naive multinomial) to polynomial complexity, making it practical for realistic data sizes.
194
+
195
+ ### Performance Results
196
+
197
+ From the benchmark results comparing Huffman vs Enumerative coding:
198
+
199
+ | Dataset Type | Huffman Efficiency | Enumerative Efficiency | Speed Ratio |
200
+ |--------------|-------------------|------------------------|-------------|
201
+ | Uniform data | ~99.8% of theoretical | ~48.9% of theoretical | 259x slower |
202
+ | Zipf data | ~99.0-99.4% of theoretical | ~47.7-49.9% of theoretical | 100-1000x slower |
203
+ | Geometric data | ~98.9-99.3% of theoretical | ~49.6-49.9% of theoretical | 400-2000x slower |
204
+ | English text | ~99.1% of theoretical | ~48.1% of theoretical | 23x slower |
205
+
206
+ ### Why Enumerative Underperforms
207
+
208
+ 1. **Computational Complexity**: Combinatorial calculations become expensive for large datasets
209
+ 2. **Fixed Algorithm Structure**: Cannot adapt to data characteristics like Huffman's variable-length codes
210
+ 3. **Overhead**: Algorithm encodes structure information (alphabet, counts, positions) separately
211
+ 4. **Scaling Issues**: Performance degrades exponentially with dataset size and vocabulary complexity
212
+
213
+ ## File Structure
214
+
215
+ ```
216
+ entropy-coding-equiprobable/
217
+ ├── enumerative_coding.py # Core enumerative entropy coding implementation
218
+ ├── entropy_coding.py # Legacy compatibility and Huffman implementation
219
+ ├── test_compression.py # Main benchmark script with timing analysis
220
+ ├── test_paper_examples.py # Paper example verification
221
+ ├── test_enumerative.py # Basic functionality tests
222
+ ├── plot_results.py # Comprehensive visualization and analysis
223
+ ├── quick_test.py # Quick functionality test
224
+ ├── justfile # Command runner
225
+ ├── pyproject.toml # Python dependencies
226
+ ├── CLAUDE.md # Project-specific AI instructions
227
+ └── README.md # This file
228
+ ```
229
+
230
+ ## Implementation Details
231
+
232
+ ### Enumerative Entropy Coding
233
+
234
+ The implementation follows the Han et al. (2008) algorithm with four main steps:
235
+
236
+ ```python
237
+ def encode(self, data: List[int]) -> bytes:
238
+ # Step 1: Encode sequence length
239
+ bits += ExpGolombCoder.encode(n)
240
+
241
+ # Step 2: Encode alphabet (size K and symbols)
242
+ bits += ExpGolombCoder.encode(K)
243
+ for symbol in sorted_symbols:
244
+ bits += ExpGolombCoder.encode(symbol)
245
+
246
+ # Step 3: Encode symbol frequencies (K-1, last is implied)
247
+ for i in range(K - 1):
248
+ bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]])
249
+
250
+ # Step 4: Encode symbol positions using combinatorial indexing
251
+ for symbol in sorted_symbols[:-1]:
252
+ positions = find_symbol_positions(symbol, remaining_data)
253
+ rank = self._rank(len(remaining_data), len(positions), positions)
254
+ bits += ExpGolombCoder.encode(rank)
255
+ ```
256
+
257
+ ### Key Optimizations
258
+
259
+ ```python
260
+ # Complement encoding for frequent symbols
261
+ use_complement = k > current_n / 2
262
+ if use_complement:
263
+ # Encode positions of OTHER symbols instead
264
+ complement_positions = find_complement_positions()
265
+ rank = self._rank(current_n, current_n - k, complement_positions)
266
+
267
+ # Fast binomial coefficient computation with caching
268
+ class OptimizedBinomialTable:
269
+ def get(self, n: int, k: int) -> int:
270
+ if (n, k) in self._cache:
271
+ return self._cache[(n, k)]
272
+ result = math.comb(n, k) # Uses arbitrary precision
273
+ self._cache[(n, k)] = result
274
+ return result
275
+ ```
276
+
277
+ ## Theoretical Analysis
278
+
279
+ ### Compression Bounds
280
+
281
+ - **Shannon Entropy**: H(X) = -Σ p(x) log2 p(x) - theoretical minimum
282
+ - **Huffman**: Achieves H(X) ≤ L_Huffman < H(X) + 1 (typically ~99% efficiency)
283
+ - **Enumerative**: L_Enum ≥ H(X) + overhead (typically ~49% efficiency due to structural encoding)
284
+
285
+ ### When Enumerative Coding Works
286
+
287
+ 1. **Research/theoretical applications**: When exact mathematical properties are needed
288
+ 2. **Educational purposes**: Understanding combinatorial compression principles
289
+ 3. **Small datasets**: Where computational cost is not a concern
290
+
291
+ ### When Enumerative Struggles
292
+
293
+ 1. **All practical applications**: 259x slower than Huffman with worse compression
294
+ 2. **Large datasets**: Exponential scaling makes it computationally prohibitive
295
+ 3. **Real-time systems**: Unpredictable and potentially very long encoding times
296
+
297
+ ## Future Optimization Opportunities
298
+
299
+ While the current implementation achieves practical performance for datasets up to ~10,000 symbols, several optimization strategies could further improve performance:
300
+
301
+ ### 1. Just-In-Time (JIT) Compilation
302
+ - **Target**: Critical loops in combinatorial indexing and position reconstruction
303
+ - **Options**:
304
+ - **Numba** (requires Python 3.11 due to llvmlite compatibility issues)
305
+ - **JAX** (better Python 3.12 support, NumPy-compatible)
306
+ - **PyPy** (alternative Python interpreter with JIT)
307
+ - **Expected Benefit**: 10-100x speedup for computational bottlenecks
308
+
309
+ ### 2. Algorithmic Improvements
310
+ - **Incremental Encoding**: Reuse computations when processing similar sequences
311
+ - **Approximate Methods**: Trade slight accuracy for major performance gains on very large datasets
312
+ - **Parallel Processing**: Distribute symbol processing across multiple cores
313
+
314
+ ### 3. Specialized Data Structures
315
+ - **Sparse Binomial Tables**: Only compute coefficients actually needed
316
+ - **Compressed Position Indices**: More efficient representation for position lists
317
+ - **Fast Integer Arithmetic**: Specialized libraries for large integer operations
318
+
319
+ ### 4. Memory Hierarchy Optimizations
320
+ - **Cache-Friendly Algorithms**: Reorganize computations to minimize cache misses
321
+ - **Memory Pooling**: Reduce allocation overhead for temporary arrays
322
+ - **Streaming Encoding**: Process very large datasets without loading entirely into memory
323
+
324
+ ### 5. Domain-Specific Optimizations
325
+ - **Text-Specific**: Leverage byte patterns and common character frequencies
326
+ - **Statistical Precomputation**: Pre-build tables for common distributions (Zipf, geometric)
327
+ - **Adaptive Thresholds**: Dynamically adjust complement encoding and timeout parameters
328
+
329
+ The current implementation provides a solid foundation for exploring these advanced optimizations while maintaining correctness and robustness.
330
+
331
+ ## References
332
+
333
+ - Han, Y., et al. (2008). "Entropy coding using equiprobable partitioning"
334
+ - Cover, T. M., & Thomas, J. A. (2006). "Elements of Information Theory"
335
+ - Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes"
336
+
337
+ ## Contributing
338
+
339
+ This is a research implementation. To contribute:
340
+
341
+ 1. Fork the repository
342
+ 2. Make changes following the existing code style
343
+ 3. Run `just check` to verify code quality
344
+ 4. Submit a pull request
345
+
346
+ ## License
347
+
348
+ This project is for educational and research purposes.
entropy_coding.py ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Implementation of enumerative entropy coding as described in Han et al. (2008).
4
+ This is the actual "equiprobable partitioning" algorithm from the paper.
5
+ """
6
+
7
+ # Import the actual implementation
8
+ from enumerative_coding import EnumerativeEncoder, ExpGolombCoder
9
+
10
+ import numpy as np
11
+ from typing import Tuple, List, Dict
12
+ from collections import Counter
13
+ import heapq
14
+ from scipy.stats import entropy as scipy_entropy
15
+
16
+
17
+ # For backward compatibility with existing code that expects this class name
18
+ EquiprobablePartitioningEncoder = EnumerativeEncoder
19
+
20
+
21
+ class HuffmanEncoder:
22
+ """Standard Huffman coding for comparison."""
23
+
24
+ class Node:
25
+ def __init__(self, symbol=None, freq=0, left=None, right=None):
26
+ self.symbol = symbol
27
+ self.freq = freq
28
+ self.left = left
29
+ self.right = right
30
+
31
+ def __lt__(self, other):
32
+ return self.freq < other.freq
33
+
34
+ def __init__(self):
35
+ self.codes = {}
36
+ self.root = None
37
+
38
+ def _build_tree(self, frequencies: Dict[int, int]):
39
+ """Build Huffman tree from symbol frequencies."""
40
+ heap = []
41
+
42
+ for symbol, freq in frequencies.items():
43
+ heapq.heappush(heap, self.Node(symbol=symbol, freq=freq))
44
+
45
+ while len(heap) > 1:
46
+ left = heapq.heappop(heap)
47
+ right = heapq.heappop(heap)
48
+ parent = self.Node(freq=left.freq + right.freq, left=left, right=right)
49
+ heapq.heappush(heap, parent)
50
+
51
+ self.root = heap[0]
52
+
53
+ def _generate_codes(self, node, code=''):
54
+ """Generate Huffman codes by traversing the tree."""
55
+ if node.symbol is not None:
56
+ self.codes[node.symbol] = code if code else '0'
57
+ return
58
+
59
+ if node.left:
60
+ self._generate_codes(node.left, code + '0')
61
+ if node.right:
62
+ self._generate_codes(node.right, code + '1')
63
+
64
+ def encode(self, data: List[int]) -> Tuple[bytes, Dict]:
65
+ """Encode data using Huffman coding."""
66
+ frequencies = Counter(data)
67
+
68
+ if len(frequencies) == 1:
69
+ # Special case: only one symbol
70
+ symbol = list(frequencies.keys())[0]
71
+ self.codes = {symbol: '0'}
72
+ else:
73
+ self._build_tree(frequencies)
74
+ self._generate_codes(self.root)
75
+
76
+ # Encode data
77
+ encoded_bits = ''.join(self.codes[symbol] for symbol in data)
78
+
79
+ # Pad to byte boundary
80
+ padding = (8 - len(encoded_bits) % 8) % 8
81
+ encoded_bits += '0' * padding
82
+
83
+ encoded_bytes = bytes(int(encoded_bits[i:i+8], 2) for i in range(0, len(encoded_bits), 8))
84
+
85
+ metadata = {
86
+ 'codes': self.codes,
87
+ 'padding': padding,
88
+ 'original_length': len(data)
89
+ }
90
+
91
+ return encoded_bytes, metadata
92
+
93
+ def decode(self, encoded_bytes: bytes, metadata: Dict) -> List[int]:
94
+ """Decode Huffman encoded data."""
95
+ # Create reverse mapping
96
+ reverse_codes = {code: symbol for symbol, code in metadata['codes'].items()}
97
+
98
+ # Convert bytes to bit string
99
+ bit_string = ''.join(format(byte, '08b') for byte in encoded_bytes)
100
+
101
+ # Remove padding
102
+ if metadata['padding'] > 0:
103
+ bit_string = bit_string[:-metadata['padding']]
104
+
105
+ decoded = []
106
+ current_code = ''
107
+
108
+ for bit in bit_string:
109
+ current_code += bit
110
+ if current_code in reverse_codes:
111
+ decoded.append(reverse_codes[current_code])
112
+ current_code = ''
113
+ if len(decoded) >= metadata['original_length']:
114
+ break
115
+
116
+ return decoded
117
+
118
+
119
+ def calculate_entropy(data: List[int]) -> float:
120
+ """Calculate Shannon entropy of data."""
121
+ probabilities = list(Counter(data).values())
122
+ return scipy_entropy(probabilities, base=2) * len(data) / 8 # Convert to bytes
123
+
124
+
125
+ def theoretical_minimum_size(data: List[int]) -> float:
126
+ """Calculate theoretical minimum compressed size in bytes."""
127
+ return calculate_entropy(data)
enumerative_coding.py ADDED
@@ -0,0 +1,261 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Correct implementation of enumerative entropy coding as described in Han et al. (2008).
4
+ This version is fully self-contained, embedding all necessary data into the stream.
5
+ """
6
+
7
+ import numpy as np
8
+ from typing import List, Dict, Tuple, Optional
9
+ from collections import Counter
10
+ import math
11
+
12
+ class ExpGolombCoder:
13
+ """Exponential-Golomb coding for non-negative integers."""
14
+
15
+ @staticmethod
16
+ def encode(n: int) -> str:
17
+ """Encodes a non-negative integer n >= 0."""
18
+ if n < 0:
19
+ raise ValueError("Exp-Golomb is for non-negative integers.")
20
+ n_plus_1 = n + 1
21
+ binary = bin(n_plus_1)[2:]
22
+ leading_zeros = '0' * (len(binary) - 1)
23
+ return leading_zeros + binary
24
+
25
+ @staticmethod
26
+ def decode(bits: str, start_pos: int = 0) -> Tuple[int, int]:
27
+ """Decodes an exp-Golomb integer from a bit string."""
28
+ pos = start_pos
29
+ leading_zeros = 0
30
+ while pos < len(bits) and bits[pos] == '0':
31
+ leading_zeros += 1
32
+ pos += 1
33
+
34
+ if pos >= len(bits):
35
+ raise ValueError("Incomplete exp-Golomb code: no '1' bit found.")
36
+
37
+ num_bits_to_read = leading_zeros + 1
38
+ if pos + num_bits_to_read > len(bits):
39
+ raise ValueError("Incomplete exp-Golomb code: not enough bits for value.")
40
+
41
+ code_bits = bits[pos:pos + num_bits_to_read]
42
+ value = int(code_bits, 2) - 1
43
+ return value, pos + num_bits_to_read
44
+
45
+
46
+ class OptimizedBinomialTable:
47
+ """
48
+ Computes and caches binomial coefficients C(n, k) using Python's arbitrary
49
+ precision integers to prevent overflow.
50
+ """
51
+
52
+ def __init__(self):
53
+ self._cache = {}
54
+
55
+ def get(self, n: int, k: int) -> int:
56
+ if k < 0 or k > n:
57
+ return 0
58
+ if k == 0 or k == n:
59
+ return 1
60
+ if k > n // 2:
61
+ k = n - k
62
+
63
+ key = (n, k)
64
+ if key in self._cache:
65
+ return self._cache[key]
66
+
67
+ result = math.comb(n, k)
68
+ self._cache[key] = result
69
+ return result
70
+
71
+ def __getitem__(self, n: int):
72
+ return BinomialRow(self, n)
73
+
74
+
75
+ class BinomialRow:
76
+ """Helper class to support table[n][k] syntax."""
77
+ def __init__(self, table: OptimizedBinomialTable, n: int):
78
+ self.table = table
79
+ self.n = n
80
+
81
+ def __getitem__(self, k: int) -> int:
82
+ return self.table.get(self.n, k)
83
+
84
+
85
+ class EnumerativeEncoder:
86
+ """
87
+ An enumerative entropy coder aligned with the algorithm described in
88
+ "Entropy Coding Using Equiprobable Partitioning" by Han et al. (2008).
89
+
90
+ This implementation is self-contained, writing all necessary information
91
+ (length, alphabet, counts, and positions) into the output stream.
92
+ """
93
+
94
+ def __init__(self):
95
+ self.binom_table = OptimizedBinomialTable()
96
+
97
+ def _rank(self, n: int, k: int, positions: List[int]) -> int:
98
+ """Calculates the standard lexicographical rank of a combination."""
99
+ index = 0
100
+ for i, pos in enumerate(positions):
101
+ index += self.binom_table.get(pos, i + 1)
102
+ return index
103
+
104
+ def _unrank(self, n: int, k: int, index: int) -> List[int]:
105
+ """Converts a standard lexicographical rank back to a combination."""
106
+ positions = []
107
+ v_high = n - 1
108
+ for i in range(k - 1, -1, -1):
109
+ v_low = i
110
+ # Binary search for the largest position p_i
111
+ while v_low < v_high:
112
+ mid = (v_low + v_high + 1) // 2
113
+ if self.binom_table.get(mid, i + 1) <= index:
114
+ v_low = mid
115
+ else:
116
+ v_high = mid - 1
117
+
118
+ p_i = v_low
119
+ positions.append(p_i)
120
+ index -= self.binom_table.get(p_i, i + 1)
121
+ v_high = p_i - 1
122
+
123
+ positions.reverse() # Stored descending, so reverse to ascending
124
+ return positions
125
+
126
+ def encode(self, data: List[int]) -> bytes:
127
+ if not data:
128
+ return bytes()
129
+
130
+ n = len(data)
131
+ symbol_counts = Counter(data)
132
+
133
+ # Optimization: encode symbols from least frequent to most frequent
134
+ sorted_symbols = sorted(symbol_counts.keys(), key=lambda s: symbol_counts[s])
135
+ K = len(sorted_symbols)
136
+
137
+ bits = ""
138
+ # Step 1: Encode sequence length n
139
+ bits += ExpGolombCoder.encode(n)
140
+
141
+ # Step 2: Encode header - alphabet size (K) and the alphabet itself
142
+ bits += ExpGolombCoder.encode(K)
143
+ for symbol in sorted_symbols:
144
+ bits += ExpGolombCoder.encode(symbol)
145
+
146
+ # Step 3: Encode K-1 symbol frequencies
147
+ for i in range(K - 1):
148
+ bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]])
149
+
150
+ # Step 4: Encode symbol locations sequentially
151
+ available_indices = list(range(n))
152
+
153
+ for i in range(K - 1):
154
+ symbol = sorted_symbols[i]
155
+ k = symbol_counts[symbol]
156
+ if k == 0:
157
+ continue
158
+
159
+ current_n = len(available_indices)
160
+
161
+ # Find the positions of the current symbol within the available slots
162
+ symbol_positions_in_available = [
163
+ j for j, original_idx in enumerate(available_indices) if data[original_idx] == symbol
164
+ ]
165
+
166
+ # Optimization: Use complement method for frequent symbols
167
+ use_complement = k > current_n / 2
168
+ bits += '1' if use_complement else '0'
169
+
170
+ if use_complement:
171
+ complement_k = current_n - k
172
+ complement_positions = [j for j in range(current_n) if j not in symbol_positions_in_available]
173
+ index = self._rank(current_n, complement_k, complement_positions)
174
+ else:
175
+ index = self._rank(current_n, k, symbol_positions_in_available)
176
+
177
+ bits += ExpGolombCoder.encode(index)
178
+
179
+ # Update available indices for the next symbol
180
+ used_indices = {available_indices[j] for j in symbol_positions_in_available}
181
+ available_indices = [idx for idx in available_indices if idx not in used_indices]
182
+
183
+ # Convert bit string to bytes with padding
184
+ padding = (8 - len(bits) % 8) % 8
185
+ bits += '0' * padding
186
+ encoded_bytes = bytes(int(bits[i:i+8], 2) for i in range(0, len(bits), 8))
187
+
188
+ return encoded_bytes
189
+
190
+ def decode(self, encoded_bytes: bytes) -> List[int]:
191
+ if not encoded_bytes:
192
+ return []
193
+
194
+ # Convert bytes to bit string
195
+ bits = ''.join(format(byte, '08b') for byte in encoded_bytes)
196
+ pos = 0
197
+
198
+ # Step 1: Decode sequence length n
199
+ n, pos = ExpGolombCoder.decode(bits, pos)
200
+
201
+ # Step 2: Decode header - alphabet size (K) and the alphabet itself
202
+ K, pos = ExpGolombCoder.decode(bits, pos)
203
+ sorted_symbols = []
204
+ for _ in range(K):
205
+ symbol, pos = ExpGolombCoder.decode(bits, pos)
206
+ sorted_symbols.append(symbol)
207
+
208
+ # Step 3: Decode K-1 symbol frequencies
209
+ counts = {}
210
+ decoded_count_sum = 0
211
+ for i in range(K - 1):
212
+ symbol = sorted_symbols[i]
213
+ count, pos = ExpGolombCoder.decode(bits, pos)
214
+ counts[symbol] = count
215
+ decoded_count_sum += count
216
+
217
+ # The last symbol's count is implied
218
+ last_symbol = sorted_symbols[-1]
219
+ counts[last_symbol] = n - decoded_count_sum
220
+
221
+ # Step 4: Decode symbol locations sequentially
222
+ result = [None] * n
223
+ available_indices = list(range(n))
224
+
225
+ for i in range(K - 1):
226
+ symbol = sorted_symbols[i]
227
+ k = counts[symbol]
228
+ if k == 0:
229
+ continue
230
+
231
+ current_n = len(available_indices)
232
+
233
+ # Read complement flag
234
+ use_complement = (bits[pos] == '1')
235
+ pos += 1
236
+
237
+ index, pos = ExpGolombCoder.decode(bits, pos)
238
+
239
+ if use_complement:
240
+ complement_k = current_n - k
241
+ complement_positions = self._unrank(current_n, complement_k, index)
242
+ positions_in_available = [j for j in range(current_n) if j not in complement_positions]
243
+ else:
244
+ positions_in_available = self._unrank(current_n, k, index)
245
+
246
+ # Map positions from available list back to original sequence
247
+ used_indices = set()
248
+ for rel_pos in positions_in_available:
249
+ abs_pos = available_indices[rel_pos]
250
+ result[abs_pos] = symbol
251
+ used_indices.add(abs_pos)
252
+
253
+ # Update available indices
254
+ available_indices = [idx for idx in available_indices if idx not in used_indices]
255
+
256
+ # Last symbol fills all remaining positions
257
+ for i in range(n):
258
+ if result[i] is None:
259
+ result[i] = last_symbol
260
+
261
+ return result
justfile ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # List available commands
2
+ default:
3
+ @just --list
4
+
5
+ # Run the compression tests
6
+ run:
7
+ uv run python test_compression.py
8
+
9
+ # Run compression tests and generate plots
10
+ analyze:
11
+ uv run python test_compression.py
12
+ uv run python plot_results.py
13
+
14
+ # Install dependencies
15
+ setup:
16
+ uv sync
17
+
18
+ # Run with smaller test sizes for quick testing
19
+ test:
20
+ uv run python quick_test.py
21
+ uv run python test_paper_examples.py
22
+
23
+ # Run tests from the paper
24
+ test-paper:
25
+ uv run python test_paper_examples.py
26
+
27
+ # Generate comparison plots
28
+ plot:
29
+ uv run python plot_results.py
30
+
31
+ # Clean up generated files
32
+ clean:
33
+ rm -f compression_results.json
34
+
35
+ # Run python directly with uv
36
+ python *args:
37
+ uv run python {{args}}
38
+
39
+ # Check code quality
40
+ check:
41
+ uv run ruff check .
42
+ uv run pyright .
43
+
44
+ # Format code
45
+ format:
46
+ uv run ruff format .
47
+
48
+ # Fix linting issues
49
+ fix:
50
+ uv run ruff check --fix .
main.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ def main():
2
+ print("Hello from entropy-coding-equiprobable!")
3
+
4
+
5
+ if __name__ == "__main__":
6
+ main()
plot_results.py ADDED
@@ -0,0 +1,866 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Generate plots comparing compression methods.
4
+ """
5
+
6
+ import json
7
+ import matplotlib.pyplot as plt
8
+ import matplotlib
9
+ import seaborn as sns
10
+ import numpy as np
11
+ import pandas as pd
12
+ from pathlib import Path
13
+
14
+ # Use non-interactive backend to avoid opening windows
15
+ matplotlib.use('Agg')
16
+
17
+ # Set style
18
+ plt.style.use('default')
19
+ sns.set_palette("husl")
20
+
21
+
22
+ def load_results(filename='compression_results.json'):
23
+ """Load compression results from JSON file."""
24
+ try:
25
+ with open(filename, 'r') as f:
26
+ return json.load(f)
27
+ except FileNotFoundError:
28
+ print(f"Results file {filename} not found. Run 'just run' first to generate results.")
29
+ return None
30
+
31
+
32
+ def create_comparison_dataframe(results):
33
+ """Convert results to pandas DataFrame for easier plotting."""
34
+ rows = []
35
+
36
+ for result in results:
37
+ name = result['name']
38
+ original_size = result['original_size']
39
+ theoretical_min = result['theoretical_minimum']
40
+ vocab_size = result['vocabulary_size']
41
+
42
+ # Extract method results
43
+ methods = result['methods']
44
+
45
+ # Determine dataset type for analysis
46
+ if 'uniform' in name:
47
+ dataset_type = 'Uniform'
48
+ elif 'zipf' in name:
49
+ dataset_type = 'Zipf'
50
+ elif 'geometric' in name:
51
+ dataset_type = 'Geometric'
52
+ elif 'english' in name:
53
+ dataset_type = 'English Text'
54
+ else:
55
+ dataset_type = 'Other'
56
+
57
+ row_base = {
58
+ 'dataset': name,
59
+ 'original_size': original_size,
60
+ 'theoretical_minimum': theoretical_min,
61
+ 'vocabulary_size': vocab_size,
62
+ 'dataset_type': dataset_type
63
+ }
64
+
65
+ # Add each method as a separate row
66
+ for method_name, method_data in methods.items():
67
+ row = row_base.copy()
68
+ row['method'] = method_name
69
+
70
+ # Parse method details
71
+ if method_name.startswith('equiprobable_k'):
72
+ row['method_type'] = 'Equiprobable'
73
+ row['k_value'] = int(method_name.split('k')[1])
74
+ elif method_name == 'enumerative':
75
+ row['method_type'] = 'Enumerative'
76
+ row['k_value'] = None
77
+ elif method_name == 'huffman':
78
+ row['method_type'] = 'Huffman'
79
+ row['k_value'] = None
80
+
81
+ if method_data is None or method_data.get('compressed_size') is None:
82
+ # Handle timeout/failure cases
83
+ row['compressed_size'] = None
84
+ row['compression_ratio'] = None
85
+ row['bits_per_symbol'] = None
86
+ row['correct'] = False
87
+ row['encoding_time'] = method_data.get('encoding_time', 0) if method_data else 0
88
+ row['status'] = 'timeout' if method_data and method_data.get('timed_out') else 'failed'
89
+ else:
90
+ # Handle successful cases
91
+ row['compressed_size'] = method_data['compressed_size']
92
+ row['compression_ratio'] = method_data['compression_ratio']
93
+ row['bits_per_symbol'] = method_data['bits_per_symbol']
94
+ row['correct'] = method_data['correct']
95
+ row['encoding_time'] = method_data.get('encoding_time', 0)
96
+ row['status'] = 'success'
97
+
98
+ rows.append(row)
99
+
100
+ # Add theoretical minimum as a reference
101
+ row = row_base.copy()
102
+ row['method'] = 'theoretical'
103
+ row['method_type'] = 'Theoretical'
104
+ row['compressed_size'] = theoretical_min
105
+ row['compression_ratio'] = original_size / theoretical_min
106
+ row['bits_per_symbol'] = theoretical_min * 8 / original_size
107
+ row['correct'] = True
108
+ row['k_value'] = None
109
+ row['status'] = 'success'
110
+ rows.append(row)
111
+
112
+ return pd.DataFrame(rows)
113
+
114
+
115
+ def plot_compression_ratios(df, save_path='plots'):
116
+ """Plot compression ratios for different methods."""
117
+ Path(save_path).mkdir(exist_ok=True)
118
+
119
+ # Create figure with subplots
120
+ fig, axes = plt.subplots(2, 2, figsize=(15, 12))
121
+ fig.suptitle('Compression Performance Comparison', fontsize=16, fontweight='bold')
122
+
123
+ # 1. Compression ratio by dataset (including partial data)
124
+ ax1 = axes[0, 0]
125
+
126
+ # Get all datasets
127
+ datasets = sorted(df['dataset'].unique())
128
+
129
+ # Pivot for easier plotting, but include all data (success and timeout)
130
+ pivot_data = df.pivot(index='dataset', columns='method', values='compression_ratio')
131
+
132
+ # Select key methods for cleaner plot, now including enumerative
133
+ key_methods = ['theoretical', 'huffman', 'enumerative']
134
+ available_methods = [col for col in key_methods if col in pivot_data.columns]
135
+ pivot_subset = pivot_data[available_methods]
136
+
137
+ # Plot with special handling for missing values
138
+ bars = pivot_subset.plot(kind='bar', ax=ax1, width=0.8)
139
+ ax1.set_title('Compression Ratio by Dataset')
140
+ ax1.set_xlabel('Dataset')
141
+ ax1.set_ylabel('Compression Ratio')
142
+ ax1.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
143
+ ax1.tick_params(axis='x', rotation=45)
144
+
145
+ # Add text annotations for timeouts
146
+ for i, dataset in enumerate(datasets):
147
+ enum_data = df[(df['dataset'] == dataset) & (df['method'] == 'enumerative')]
148
+ if not enum_data.empty and enum_data.iloc[0]['status'] == 'timeout':
149
+ ax1.text(i, ax1.get_ylim()[1] * 0.9, 'TIMEOUT',
150
+ ha='center', va='center', fontsize=8,
151
+ bbox=dict(boxstyle="round,pad=0.3", facecolor="red", alpha=0.7))
152
+
153
+ # 2. Bits per symbol
154
+ ax2 = axes[0, 1]
155
+ pivot_bits = df.pivot(index='dataset', columns='method', values='bits_per_symbol')
156
+ pivot_bits_subset = pivot_bits[available_methods]
157
+
158
+ pivot_bits_subset.plot(kind='bar', ax=ax2, width=0.8)
159
+ ax2.set_title('Bits per Symbol by Dataset')
160
+ ax2.set_xlabel('Dataset')
161
+ ax2.set_ylabel('Bits per Symbol')
162
+ ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
163
+ ax2.tick_params(axis='x', rotation=45)
164
+
165
+ # Add timeout annotations for bits per symbol plot too
166
+ for i, dataset in enumerate(datasets):
167
+ enum_data = df[(df['dataset'] == dataset) & (df['method'] == 'enumerative')]
168
+ if not enum_data.empty and enum_data.iloc[0]['status'] == 'timeout':
169
+ ax2.text(i, ax2.get_ylim()[1] * 0.9, 'TIMEOUT',
170
+ ha='center', va='center', fontsize=8,
171
+ bbox=dict(boxstyle="round,pad=0.3", facecolor="red", alpha=0.7))
172
+
173
+ # 3. Enumerative encoding time by dataset size
174
+ ax3 = axes[1, 0]
175
+ enum_data = df[df['method'] == 'enumerative'].copy()
176
+
177
+ if not enum_data.empty:
178
+ # Create scatter plot of dataset size vs encoding time
179
+ successful_enum = enum_data[enum_data['status'] == 'success']
180
+ timeout_enum = enum_data[enum_data['status'] == 'timeout']
181
+
182
+ # Plot successful encodings
183
+ if not successful_enum.empty:
184
+ ax3.scatter(successful_enum['original_size'], successful_enum['encoding_time'],
185
+ c='green', marker='o', s=60, alpha=0.7, label='Successful')
186
+
187
+ # Plot timeouts (use timeout duration)
188
+ if not timeout_enum.empty:
189
+ ax3.scatter(timeout_enum['original_size'], timeout_enum['encoding_time'],
190
+ c='red', marker='X', s=80, alpha=0.9, label='Timeout')
191
+
192
+ ax3.set_title('Enumerative Encoding Time vs Dataset Size')
193
+ ax3.set_xlabel('Dataset Size (symbols)')
194
+ ax3.set_ylabel('Encoding Time (seconds)')
195
+ ax3.set_xscale('log')
196
+ ax3.set_yscale('log')
197
+ ax3.grid(True, alpha=0.3)
198
+ ax3.legend()
199
+
200
+ # Add trend line for successful cases
201
+ if len(successful_enum) > 1:
202
+ x_vals = successful_enum['original_size'].values
203
+ y_vals = successful_enum['encoding_time'].values
204
+ z = np.polyfit(np.log10(x_vals), np.log10(y_vals), 1)
205
+ p = np.poly1d(z)
206
+ x_trend = np.logspace(np.log10(min(x_vals)), np.log10(max(x_vals)), 100)
207
+ y_trend = 10 ** p(np.log10(x_trend))
208
+ ax3.plot(x_trend, y_trend, 'b--', alpha=0.5, linewidth=1,
209
+ label=f'Trend (slope: {z[0]:.2f})')
210
+ ax3.legend()
211
+ else:
212
+ ax3.text(0.5, 0.5, 'No enumerative data available',
213
+ ha='center', va='center', transform=ax3.transAxes)
214
+ ax3.set_title('Enumerative Encoding Time vs Dataset Size')
215
+
216
+ # 4. Efficiency vs theoretical minimum
217
+ ax4 = axes[1, 1]
218
+
219
+ # Calculate efficiency (how close to theoretical minimum)
220
+ theoretical_data = df[df['method'] == 'theoretical'].set_index('dataset')['compressed_size']
221
+
222
+ for method in ['huffman', 'enumerative']:
223
+ if method in df['method'].values:
224
+ # Get method data including both successful and failed cases
225
+ method_df = df[df['method'] == method].set_index('dataset')
226
+
227
+ # Only plot efficiency for successful cases
228
+ successful_data = method_df[method_df['compressed_size'].notna()]
229
+ if not successful_data.empty:
230
+ efficiency = theoretical_data / successful_data['compressed_size']
231
+
232
+ # Plot only datasets that have both theoretical and method data
233
+ common_datasets = efficiency.dropna().index
234
+ dataset_indices = [datasets.index(d) for d in common_datasets if d in datasets]
235
+ efficiency_values = [efficiency[datasets[i]] for i in dataset_indices]
236
+
237
+ ax4.plot(dataset_indices, efficiency_values, marker='o', label=method, linewidth=2)
238
+
239
+ # Mark timeouts/failures
240
+ if method == 'enumerative':
241
+ failed_data = method_df[method_df['compressed_size'].isna()]
242
+ if not failed_data.empty:
243
+ failed_indices = [datasets.index(d) for d in failed_data.index if d in datasets]
244
+ ax4.scatter(failed_indices, [0.1] * len(failed_indices),
245
+ marker='X', s=100, color='red', label=f'{method} (timeout)', zorder=5)
246
+
247
+ ax4.set_title('Efficiency vs Theoretical Minimum')
248
+ ax4.set_xlabel('Dataset Index')
249
+ ax4.set_ylabel('Efficiency (Theoretical/Actual)')
250
+ ax4.set_xticks(range(len(datasets)))
251
+ ax4.set_xticklabels([d[:15] + '...' if len(d) > 15 else d for d in datasets], rotation=45)
252
+ ax4.legend()
253
+ ax4.axhline(y=1.0, color='gray', linestyle='--', alpha=0.7, label='Perfect efficiency')
254
+ ax4.set_ylim(0, ax4.get_ylim()[1])
255
+
256
+ plt.tight_layout()
257
+ plt.savefig(f'{save_path}/compression_comparison.png', dpi=300, bbox_inches='tight')
258
+ plt.close(fig)
259
+
260
+
261
+ def plot_k_parameter_analysis(df, save_path='plots'):
262
+ """Analyze the effect of k parameter on EP performance."""
263
+ Path(save_path).mkdir(exist_ok=True)
264
+
265
+ # Filter equiprobable methods
266
+ ep_data = df[df['method_type'] == 'Equiprobable'].copy()
267
+
268
+ if ep_data.empty:
269
+ print("No equiprobable data found for k parameter analysis")
270
+ return
271
+
272
+ fig, axes = plt.subplots(2, 2, figsize=(15, 12))
273
+ fig.suptitle('Equiprobable Partitioning: Effect of k Parameter', fontsize=16, fontweight='bold')
274
+
275
+ # 1. Compression ratio vs k for different datasets
276
+ ax1 = axes[0, 0]
277
+
278
+ datasets_to_plot = ['small_uniform_10', 'medium_zipf_256', 'large_geometric_64', 'english_text']
279
+ for dataset in datasets_to_plot:
280
+ if dataset in ep_data['dataset'].values:
281
+ dataset_data = ep_data[ep_data['dataset'] == dataset].sort_values('k_value')
282
+ ax1.plot(dataset_data['k_value'], dataset_data['compression_ratio'],
283
+ marker='o', label=dataset, linewidth=2)
284
+
285
+ ax1.set_title('Compression Ratio vs k Parameter')
286
+ ax1.set_xlabel('k (Number of Partitions)')
287
+ ax1.set_ylabel('Compression Ratio')
288
+ ax1.legend()
289
+ ax1.grid(True, alpha=0.3)
290
+
291
+ # 2. Bits per symbol vs k
292
+ ax2 = axes[0, 1]
293
+
294
+ for dataset in datasets_to_plot:
295
+ if dataset in ep_data['dataset'].values:
296
+ dataset_data = ep_data[ep_data['dataset'] == dataset].sort_values('k_value')
297
+ ax2.plot(dataset_data['k_value'], dataset_data['bits_per_symbol'],
298
+ marker='s', label=dataset, linewidth=2)
299
+
300
+ ax2.set_title('Bits per Symbol vs k Parameter')
301
+ ax2.set_xlabel('k (Number of Partitions)')
302
+ ax2.set_ylabel('Bits per Symbol')
303
+ ax2.legend()
304
+ ax2.grid(True, alpha=0.3)
305
+
306
+ # 3. Optimal k by dataset type
307
+ ax3 = axes[1, 0]
308
+
309
+ # Find optimal k for each dataset
310
+ optimal_k = {}
311
+ for dataset in ep_data['dataset'].unique():
312
+ dataset_data = ep_data[ep_data['dataset'] == dataset]
313
+ if len(dataset_data) > 0:
314
+ best_idx = dataset_data['compression_ratio'].idxmax()
315
+ optimal_k[dataset] = dataset_data.loc[best_idx, 'k_value']
316
+
317
+ if optimal_k:
318
+ datasets = list(optimal_k.keys())
319
+ k_values = list(optimal_k.values())
320
+
321
+ colors = ['red' if 'uniform' in d else 'blue' if 'zipf' in d else 'green' if 'geometric' in d else 'orange'
322
+ for d in datasets]
323
+
324
+ bars = ax3.bar(range(len(datasets)), k_values, color=colors, alpha=0.7)
325
+ ax3.set_title('Optimal k Value by Dataset')
326
+ ax3.set_xlabel('Dataset')
327
+ ax3.set_ylabel('Optimal k')
328
+ ax3.set_xticks(range(len(datasets)))
329
+ ax3.set_xticklabels([d[:15] + '...' if len(d) > 15 else d for d in datasets], rotation=45)
330
+
331
+ # Add value labels on bars
332
+ for bar, k_val in zip(bars, k_values):
333
+ ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
334
+ str(int(k_val)), ha='center', va='bottom')
335
+
336
+ # 4. Performance improvement over k=2
337
+ ax4 = axes[1, 1]
338
+
339
+ for dataset in datasets_to_plot:
340
+ if dataset in ep_data['dataset'].values:
341
+ dataset_data = ep_data[ep_data['dataset'] == dataset].sort_values('k_value')
342
+ if len(dataset_data) >= 2:
343
+ baseline = dataset_data[dataset_data['k_value'] == 2]['compression_ratio'].iloc[0]
344
+ improvement = (dataset_data['compression_ratio'] / baseline - 1) * 100
345
+ ax4.plot(dataset_data['k_value'], improvement,
346
+ marker='^', label=dataset, linewidth=2)
347
+
348
+ ax4.set_title('Performance Improvement over k=2 (%)')
349
+ ax4.set_xlabel('k (Number of Partitions)')
350
+ ax4.set_ylabel('Improvement (%)')
351
+ ax4.legend()
352
+ ax4.grid(True, alpha=0.3)
353
+ ax4.axhline(y=0, color='black', linestyle='-', alpha=0.5)
354
+
355
+ plt.tight_layout()
356
+ plt.savefig(f'{save_path}/k_parameter_analysis.png', dpi=300, bbox_inches='tight')
357
+ plt.close(fig)
358
+
359
+
360
+ def plot_distribution_comparison(df, save_path='plots'):
361
+ """Compare performance across different data distributions."""
362
+ Path(save_path).mkdir(exist_ok=True)
363
+
364
+ # Categorize datasets by distribution
365
+ def get_distribution(name):
366
+ if 'uniform' in name:
367
+ return 'Uniform'
368
+ elif 'zipf' in name:
369
+ return 'Zipf'
370
+ elif 'geometric' in name:
371
+ return 'Geometric'
372
+ elif 'english' in name:
373
+ return 'Natural Text'
374
+ else:
375
+ return 'Other'
376
+
377
+ df['distribution'] = df['dataset'].apply(get_distribution)
378
+
379
+ # Filter working methods
380
+ df_plot = df[df['correct'] | (df['method'] == 'theoretical')].copy()
381
+
382
+ fig, axes = plt.subplots(2, 2, figsize=(15, 12))
383
+ fig.suptitle('Performance by Data Distribution', fontsize=16, fontweight='bold')
384
+
385
+ # 1. Box plot of compression ratios by distribution
386
+ ax1 = axes[0, 0]
387
+
388
+ methods_to_plot = ['huffman', 'enumerative']
389
+ plot_data = df_plot[df_plot['method'].isin(methods_to_plot)]
390
+
391
+ if not plot_data.empty:
392
+ sns.boxplot(data=plot_data, x='distribution', y='compression_ratio', hue='method', ax=ax1)
393
+ ax1.set_title('Compression Ratio Distribution by Data Type')
394
+ ax1.set_xlabel('Data Distribution')
395
+ ax1.set_ylabel('Compression Ratio')
396
+ ax1.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
397
+
398
+ # 2. Enumerative vs Huffman efficiency by distribution
399
+ ax2 = axes[0, 1]
400
+
401
+ # Calculate relative efficiency vs Huffman for enumerative method
402
+ huffman_data = df_plot[df_plot['method'] == 'huffman'].set_index('dataset')['compression_ratio']
403
+ enum_data = df_plot[df_plot['method'] == 'enumerative'].set_index('dataset')
404
+
405
+ if not enum_data.empty and not huffman_data.empty:
406
+ # Only compare datasets where both methods succeeded
407
+ common_datasets = set(huffman_data.index) & set(enum_data.index)
408
+
409
+ if common_datasets:
410
+ distribution_ratios = {}
411
+
412
+ for dataset in common_datasets:
413
+ enum_ratio = enum_data.loc[dataset, 'compression_ratio']
414
+ huffman_ratio = huffman_data.loc[dataset]
415
+ relative_efficiency = enum_ratio / huffman_ratio
416
+
417
+ # Get distribution type
418
+ dist_type = df_plot[df_plot['dataset'] == dataset]['distribution'].iloc[0]
419
+ if dist_type not in distribution_ratios:
420
+ distribution_ratios[dist_type] = []
421
+ distribution_ratios[dist_type].append(relative_efficiency)
422
+
423
+ # Plot box plots for each distribution
424
+ if distribution_ratios:
425
+ distributions = list(distribution_ratios.keys())
426
+ ratios = [distribution_ratios[dist] for dist in distributions]
427
+
428
+ bp = ax2.boxplot(ratios, tick_labels=distributions, patch_artist=True)
429
+
430
+ # Color the boxes
431
+ colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightyellow']
432
+ for patch, color in zip(bp['boxes'], colors[:len(bp['boxes'])]):
433
+ patch.set_facecolor(color)
434
+ patch.set_alpha(0.7)
435
+
436
+ ax2.set_title('Enumerative Efficiency Relative to Huffman')
437
+ ax2.set_xlabel('Data Distribution')
438
+ ax2.set_ylabel('Enumerative Ratio / Huffman Ratio')
439
+ ax2.axhline(y=1.0, color='red', linestyle='--', alpha=0.7, label='Equal to Huffman')
440
+ ax2.legend()
441
+ ax2.grid(True, alpha=0.3)
442
+
443
+ # 3. Vocabulary size effect
444
+ ax3 = axes[1, 0]
445
+
446
+ # Plot compression ratio vs vocabulary size
447
+ vocab_data = df_plot[df_plot['method'].isin(['huffman', 'enumerative'])]
448
+
449
+ for method in ['huffman', 'enumerative']:
450
+ method_subset = vocab_data[vocab_data['method'] == method]
451
+ if not method_subset.empty:
452
+ ax3.scatter(method_subset['vocabulary_size'], method_subset['compression_ratio'],
453
+ label=method, alpha=0.7, s=60)
454
+
455
+ ax3.set_title('Compression vs Vocabulary Size')
456
+ ax3.set_xlabel('Vocabulary Size')
457
+ ax3.set_ylabel('Compression Ratio')
458
+ ax3.set_xscale('log')
459
+ ax3.legend()
460
+ ax3.grid(True, alpha=0.3)
461
+
462
+ # 4. Dataset size effect
463
+ ax4 = axes[1, 1]
464
+
465
+ for method in ['huffman', 'enumerative']:
466
+ method_subset = df_plot[df_plot['method'] == method]
467
+ if not method_subset.empty:
468
+ ax4.scatter(method_subset['original_size'], method_subset['compression_ratio'],
469
+ label=method, alpha=0.7, s=60)
470
+
471
+ ax4.set_title('Compression vs Dataset Size')
472
+ ax4.set_xlabel('Original Size (bytes)')
473
+ ax4.set_ylabel('Compression Ratio')
474
+ ax4.set_xscale('log')
475
+ ax4.legend()
476
+ ax4.grid(True, alpha=0.3)
477
+
478
+ plt.tight_layout()
479
+ plt.savefig(f'{save_path}/distribution_comparison.png', dpi=300, bbox_inches='tight')
480
+ plt.close(fig)
481
+
482
+
483
+ def generate_summary_table(df):
484
+ """Generate a summary table of results."""
485
+ print("\n" + "="*130)
486
+ print("DETAILED COMPRESSION ANALYSIS")
487
+ print("="*130)
488
+
489
+ methods_order = ['theoretical', 'huffman', 'enumerative']
490
+
491
+ print(f"{'Dataset':<25} {'Method':<15} {'Size':<8} {'Ratio':<7} {'Bits/Sym':<8} {'Efficiency':<10} {'Time':<8}")
492
+ print("-" * 130)
493
+
494
+ for dataset in sorted(df['dataset'].unique()):
495
+ dataset_data = df[df['dataset'] == dataset]
496
+ theoretical_data = dataset_data[dataset_data['method'] == 'theoretical']
497
+
498
+ if not theoretical_data.empty:
499
+ theoretical_ratio = theoretical_data['compression_ratio'].iloc[0]
500
+
501
+ for method in methods_order:
502
+ method_data = dataset_data[dataset_data['method'] == method]
503
+ if not method_data.empty:
504
+ row = method_data.iloc[0]
505
+
506
+ if row['compressed_size'] is not None:
507
+ # Successful compression
508
+ efficiency = row['compression_ratio'] / theoretical_ratio
509
+ time_str = f"{row.get('encoding_time', 0):.3f}s" if 'encoding_time' in row else "N/A"
510
+
511
+ print(f"{dataset:<25} {method:<15} {row['compressed_size']:<8.0f} "
512
+ f"{row['compression_ratio']:<7.2f} {row['bits_per_symbol']:<8.2f} "
513
+ f"{efficiency:<10.3f} {time_str:<8}")
514
+ else:
515
+ # Timeout/failure case
516
+ time_str = f"{row.get('encoding_time', 0):.1f}s" if 'encoding_time' in row else "N/A"
517
+ status = "TIMEOUT" if row.get('status') == 'timeout' else "FAILED"
518
+
519
+ print(f"{dataset:<25} {method:<15} {status:<8} {'N/A':<7} {'N/A':<8} "
520
+ f"{'N/A':<10} {time_str:<8}")
521
+
522
+ print("-" * 130)
523
+
524
+
525
+ def plot_enumerative_timeout_analysis(df, save_path='plots'):
526
+ """Plot analysis focusing only on enumerative encoding times and timeouts."""
527
+ Path(save_path).mkdir(exist_ok=True)
528
+
529
+ # Filter to only enumerative method data
530
+ enum_df = df[df['method'] == 'enumerative'].copy()
531
+
532
+ if enum_df.empty:
533
+ print("No enumerative data found for timeout analysis")
534
+ return
535
+
536
+ fig, ax = plt.subplots(1, 1, figsize=(12, 8))
537
+ fig.suptitle('Enumerative Encoding: Computation Time and Timeouts',
538
+ fontsize=14, fontweight='bold')
539
+
540
+ # Extract data characteristics for analysis
541
+ enum_stats = []
542
+ for _, row in enum_df.iterrows():
543
+ dataset_name = row['dataset']
544
+ vocab_size = row['vocabulary_size']
545
+ original_size = row['original_size']
546
+
547
+ # Determine dataset type
548
+ if 'uniform' in dataset_name:
549
+ dataset_type = 'Uniform'
550
+ color = 'blue'
551
+ marker = 'o'
552
+ elif 'zipf' in dataset_name:
553
+ dataset_type = 'Zipf'
554
+ color = 'red'
555
+ marker = 's'
556
+ elif 'geometric' in dataset_name:
557
+ dataset_type = 'Geometric'
558
+ color = 'green'
559
+ marker = '^'
560
+ elif 'english' in dataset_name:
561
+ dataset_type = 'English Text'
562
+ color = 'purple'
563
+ marker = 'D'
564
+ else:
565
+ dataset_type = 'Other'
566
+ color = 'gray'
567
+ marker = 'x'
568
+
569
+ # Get timing and timeout info
570
+ timed_out = row['status'] == 'timeout'
571
+ encoding_time = row.get('encoding_time', 0) # Default to 0 if not available
572
+
573
+ enum_stats.append({
574
+ 'dataset': dataset_name,
575
+ 'vocab_size': vocab_size,
576
+ 'original_size': original_size,
577
+ 'dataset_type': dataset_type,
578
+ 'color': color,
579
+ 'marker': marker,
580
+ 'timed_out': timed_out,
581
+ 'encoding_time': encoding_time
582
+ })
583
+
584
+ if enum_stats:
585
+ stats_df = pd.DataFrame(enum_stats)
586
+
587
+ # Separate successful and timeout data
588
+ successful_data = stats_df[~stats_df['timed_out']]
589
+ timeout_data = stats_df[stats_df['timed_out']]
590
+
591
+ # Plot successful encodings by dataset type
592
+ scatter_success = None
593
+ for dataset_type in successful_data['dataset_type'].unique():
594
+ type_data = successful_data[successful_data['dataset_type'] == dataset_type]
595
+
596
+ if not type_data.empty:
597
+ # Use log scale for encoding time as color intensity
598
+ times_log = np.log10(np.maximum(type_data['encoding_time'].values, 0.001))
599
+
600
+ scatter = ax.scatter(type_data['vocab_size'], type_data['original_size'],
601
+ c=times_log, cmap='viridis',
602
+ marker=type_data['marker'].iloc[0],
603
+ s=100, alpha=0.8, edgecolors='black', linewidth=0.5,
604
+ label=f'{dataset_type}')
605
+
606
+ if scatter_success is None: # Use first successful scatter for colorbar
607
+ scatter_success = scatter
608
+
609
+ # Plot all timeouts with a single legend entry
610
+ if not timeout_data.empty:
611
+ ax.scatter(timeout_data['vocab_size'], timeout_data['original_size'],
612
+ color='red', marker='X', s=150, alpha=0.9,
613
+ edgecolors='darkred', linewidth=1,
614
+ label='Timeout')
615
+
616
+ # Add colorbar for encoding time
617
+ if scatter_success is not None:
618
+ cbar = plt.colorbar(scatter_success, ax=ax)
619
+ cbar.set_label('log₁₀(Encoding Time in seconds)')
620
+
621
+ ax.set_xlabel('Vocabulary Size')
622
+ ax.set_ylabel('Dataset Size (symbols)')
623
+ ax.set_xscale('log')
624
+ ax.set_yscale('log')
625
+ ax.grid(True, alpha=0.3)
626
+
627
+ # Position legend below the plot to avoid overlap
628
+ ax.legend(bbox_to_anchor=(0.5, -0.15), loc='upper center', ncol=3)
629
+
630
+ # Annotate points with timing information
631
+ for _, row in stats_df.iterrows():
632
+ if row['timed_out']:
633
+ time_label = f"TO:{row['encoding_time']:.1f}s"
634
+ else:
635
+ time_label = f"{row['encoding_time']:.2f}s"
636
+
637
+ ax.annotate(time_label,
638
+ (row['vocab_size'], row['original_size']),
639
+ xytext=(5, 5), textcoords='offset points',
640
+ fontsize=8, alpha=0.8)
641
+
642
+ plt.tight_layout()
643
+ plt.savefig(f'{save_path}/enumerative_timeout_analysis.png', dpi=300, bbox_inches='tight')
644
+ plt.close(fig)
645
+
646
+ # Print enumerative timeout summary
647
+ print("\nEnumerative Encoding Performance Summary:")
648
+ print("=" * 50)
649
+
650
+ enum_success = enum_df[enum_df['status'] == 'success']
651
+ enum_timeout = enum_df[enum_df['status'] == 'timeout']
652
+
653
+ print(f"Successful encodings: {len(enum_success)}")
654
+ print(f"Timed out encodings: {len(enum_timeout)}")
655
+
656
+ if not enum_success.empty:
657
+ avg_time = enum_success['encoding_time'].mean()
658
+ max_time = enum_success['encoding_time'].max()
659
+ min_time = enum_success['encoding_time'].min()
660
+ print(f"Encoding time stats (successful): min={min_time:.3f}s, avg={avg_time:.3f}s, max={max_time:.3f}s")
661
+
662
+ if not enum_timeout.empty:
663
+ print("Datasets that timed out:")
664
+ for _, row in enum_timeout.iterrows():
665
+ print(f" {row['dataset']}: vocab={row['vocabulary_size']}, size={row['original_size']}")
666
+
667
+ print(f"Performance by dataset type:")
668
+ for dtype in enum_df['dataset_type'].unique():
669
+ type_data = enum_df[enum_df['dataset_type'] == dtype]
670
+ success_rate = len(type_data[type_data['status'] == 'success']) / len(type_data)
671
+ print(f" {dtype}: {success_rate:.1%} success rate")
672
+
673
+
674
+ def plot_compression_time_comparison(df, save_path='plots'):
675
+ """Plot comparison of compression times between different algorithms."""
676
+ Path(save_path).mkdir(exist_ok=True)
677
+
678
+ # Filter to methods that have timing data
679
+ timing_data = df[df['encoding_time'].notna() & (df['encoding_time'] > 0)].copy()
680
+
681
+ if timing_data.empty:
682
+ print("No timing data available for compression time comparison")
683
+ return
684
+
685
+ fig, axes = plt.subplots(2, 2, figsize=(15, 12))
686
+ fig.suptitle('Compression Time Comparison: Huffman vs Enumerative', fontsize=16, fontweight='bold')
687
+
688
+ # 1. Time comparison by dataset (successful cases only)
689
+ ax1 = axes[0, 0]
690
+
691
+ huffman_times = timing_data[timing_data['method'] == 'huffman']
692
+ enum_times = timing_data[(timing_data['method'] == 'enumerative') & (timing_data['status'] == 'success')]
693
+
694
+ if not huffman_times.empty and not enum_times.empty:
695
+ # Get common datasets where both methods succeeded
696
+ common_datasets = set(huffman_times['dataset']) & set(enum_times['dataset'])
697
+
698
+ if common_datasets:
699
+ huffman_common = huffman_times[huffman_times['dataset'].isin(common_datasets)].sort_values('dataset')
700
+ enum_common = enum_times[enum_times['dataset'].isin(common_datasets)].sort_values('dataset')
701
+
702
+ x = np.arange(len(common_datasets))
703
+ width = 0.35
704
+
705
+ ax1.bar(x - width/2, huffman_common['encoding_time'], width,
706
+ label='Huffman', alpha=0.8, color='blue')
707
+ ax1.bar(x + width/2, enum_common['encoding_time'], width,
708
+ label='Enumerative', alpha=0.8, color='green')
709
+
710
+ ax1.set_title('Encoding Time by Dataset (Successful Cases)')
711
+ ax1.set_xlabel('Dataset')
712
+ ax1.set_ylabel('Encoding Time (seconds)')
713
+ ax1.set_yscale('log')
714
+ ax1.set_xticks(x)
715
+ ax1.set_xticklabels([d[:15] + '...' if len(d) > 15 else d for d in sorted(common_datasets)], rotation=45)
716
+ ax1.legend()
717
+ ax1.grid(True, alpha=0.3)
718
+
719
+ # 2. Time vs Dataset Size scatter plot
720
+ ax2 = axes[0, 1]
721
+
722
+ for method in ['huffman', 'enumerative']:
723
+ method_data = timing_data[timing_data['method'] == method]
724
+ if not method_data.empty:
725
+ successful = method_data[method_data['status'] == 'success']
726
+ if not successful.empty:
727
+ ax2.scatter(successful['original_size'], successful['encoding_time'],
728
+ label=f'{method} (success)', alpha=0.7, s=60)
729
+
730
+ # For enumerative, also show timeouts
731
+ if method == 'enumerative':
732
+ timeouts = method_data[method_data['status'] == 'timeout']
733
+ if not timeouts.empty:
734
+ ax2.scatter(timeouts['original_size'], timeouts['encoding_time'],
735
+ label='enumerative (timeout)', alpha=0.9, s=80, marker='X', color='red')
736
+
737
+ ax2.set_title('Encoding Time vs Dataset Size')
738
+ ax2.set_xlabel('Dataset Size (symbols)')
739
+ ax2.set_ylabel('Encoding Time (seconds)')
740
+ ax2.set_xscale('log')
741
+ ax2.set_yscale('log')
742
+ ax2.legend()
743
+ ax2.grid(True, alpha=0.3)
744
+
745
+ # 3. Speed ratio (Enumerative/Huffman) by dataset characteristics
746
+ ax3 = axes[1, 0]
747
+
748
+ if not huffman_times.empty and not enum_times.empty:
749
+ # Calculate speed ratios for common successful datasets
750
+ huffman_dict = dict(zip(huffman_times['dataset'], huffman_times['encoding_time']))
751
+ enum_successful = enum_times[enum_times['status'] == 'success']
752
+
753
+ ratios = []
754
+ dataset_types = []
755
+ vocab_sizes = []
756
+
757
+ for _, row in enum_successful.iterrows():
758
+ dataset = row['dataset']
759
+ if dataset in huffman_dict:
760
+ ratio = row['encoding_time'] / huffman_dict[dataset]
761
+ ratios.append(ratio)
762
+ dataset_types.append(row['dataset_type'])
763
+ vocab_sizes.append(row['vocabulary_size'])
764
+
765
+ if ratios:
766
+ # Color by dataset type
767
+ colors = {'Uniform': 'blue', 'Zipf': 'red', 'Geometric': 'green', 'English Text': 'purple'}
768
+ type_colors = [colors.get(dt, 'gray') for dt in dataset_types]
769
+
770
+ scatter = ax3.scatter(vocab_sizes, ratios, c=type_colors, alpha=0.7, s=80)
771
+
772
+ # Add legend for dataset types
773
+ for dtype, color in colors.items():
774
+ if dtype in dataset_types:
775
+ ax3.scatter([], [], c=color, label=dtype, alpha=0.7, s=80)
776
+
777
+ ax3.set_title('Speed Ratio (Enumerative/Huffman) vs Vocabulary Size')
778
+ ax3.set_xlabel('Vocabulary Size')
779
+ ax3.set_ylabel('Time Ratio (Enum/Huffman)')
780
+ ax3.set_xscale('log')
781
+ ax3.set_yscale('log')
782
+ ax3.axhline(y=1.0, color='black', linestyle='--', alpha=0.5, label='Equal speed')
783
+ ax3.legend()
784
+ ax3.grid(True, alpha=0.3)
785
+
786
+ # 4. Time distribution by algorithm
787
+ ax4 = axes[1, 1]
788
+
789
+ huffman_successful = huffman_times[huffman_times['status'] == 'success']['encoding_time']
790
+ enum_successful_times = enum_times[enum_times['status'] == 'success']['encoding_time']
791
+
792
+ time_data = []
793
+ labels = []
794
+
795
+ if not huffman_successful.empty:
796
+ time_data.append(huffman_successful.values)
797
+ labels.append('Huffman')
798
+
799
+ if not enum_successful_times.empty:
800
+ time_data.append(enum_successful_times.values)
801
+ labels.append('Enumerative')
802
+
803
+ if time_data:
804
+ bp = ax4.boxplot(time_data, tick_labels=labels, patch_artist=True)
805
+
806
+ # Color the boxes
807
+ colors = ['lightblue', 'lightgreen']
808
+ for patch, color in zip(bp['boxes'], colors[:len(bp['boxes'])]):
809
+ patch.set_facecolor(color)
810
+ patch.set_alpha(0.7)
811
+
812
+ ax4.set_title('Encoding Time Distribution')
813
+ ax4.set_ylabel('Encoding Time (seconds)')
814
+ ax4.set_yscale('log')
815
+ ax4.grid(True, alpha=0.3)
816
+
817
+ plt.tight_layout()
818
+ plt.savefig(f'{save_path}/compression_time_comparison.png', dpi=300, bbox_inches='tight')
819
+ plt.close(fig)
820
+
821
+ # Print timing summary
822
+ print("\nCompression Time Summary:")
823
+ print("=" * 50)
824
+
825
+ if not huffman_times.empty:
826
+ huff_stats = huffman_times['encoding_time']
827
+ print(f"Huffman encoding times:")
828
+ print(f" Min: {huff_stats.min():.6f}s, Avg: {huff_stats.mean():.6f}s, Max: {huff_stats.max():.6f}s")
829
+
830
+ if not enum_successful_times.empty:
831
+ enum_stats = enum_successful_times
832
+ print(f"Enumerative encoding times (successful):")
833
+ print(f" Min: {enum_stats.min():.3f}s, Avg: {enum_stats.mean():.3f}s, Max: {enum_stats.max():.3f}s")
834
+ print(f" Speed vs Huffman: {enum_stats.mean() / huffman_times['encoding_time'].mean():.0f}x slower on average")
835
+
836
+
837
+ def main():
838
+ """Generate all plots and analysis."""
839
+ results = load_results()
840
+ if results is None:
841
+ return
842
+
843
+ print("Loading compression results...")
844
+ df = create_comparison_dataframe(results)
845
+
846
+ print("Generating plots...")
847
+
848
+ # Create plots directory
849
+ Path('plots').mkdir(exist_ok=True)
850
+
851
+ # Generate all plots
852
+ plot_compression_ratios(df)
853
+ plot_k_parameter_analysis(df)
854
+ plot_distribution_comparison(df)
855
+ plot_enumerative_timeout_analysis(df)
856
+ plot_compression_time_comparison(df)
857
+
858
+ # Generate summary
859
+ generate_summary_table(df)
860
+
861
+ print("\nPlots saved to 'plots/' directory")
862
+ print("Analysis complete!")
863
+
864
+
865
+ if __name__ == "__main__":
866
+ main()
plots/compression_comparison.png ADDED

Git LFS Details

  • SHA256: 10c0b1cf4fb1eef860fc1bbdab38de2f8ee4eda07f76379bd8a999561a074301
  • Pointer size: 131 Bytes
  • Size of remote file: 612 kB
plots/compression_time_comparison.png ADDED

Git LFS Details

  • SHA256: c040301aed29368c2c9bc7ccda8bbe1c7754840d9cbfe00b471c834af70c1bd2
  • Pointer size: 131 Bytes
  • Size of remote file: 509 kB
plots/distribution_comparison.png ADDED

Git LFS Details

  • SHA256: 0042e9e527c9b6bab436e73e7dc07805741a15f529ea60783d0862419bbd004f
  • Pointer size: 131 Bytes
  • Size of remote file: 342 kB
plots/enumerative_timeout_analysis.png ADDED

Git LFS Details

  • SHA256: 09183de8cb5a170cede05f43fd5c2713e45619922980da81fe4b299d929f17c4
  • Pointer size: 131 Bytes
  • Size of remote file: 175 kB
pyproject.toml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "entropy-coding-equiprobable"
3
+ version = "0.1.0"
4
+ description = "Add your description here"
5
+ readme = "README.md"
6
+ requires-python = ">=3.12"
7
+ dependencies = [
8
+ "datasets>=4.0.0",
9
+ "matplotlib>=3.10.3",
10
+ "numpy>=2.3.1",
11
+ "requests>=2.32.4",
12
+ "scipy>=1.16.0",
13
+ "seaborn>=0.13.2",
14
+ ]
15
+
16
+ [dependency-groups]
17
+ dev = [
18
+ "pyright>=1.1.403",
19
+ "ruff>=0.12.3",
20
+ ]
quick_test.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Quick test with smaller datasets to verify functionality."""
3
+
4
+ import numpy as np
5
+ from enumerative_coding import EnumerativeEncoder
6
+ from entropy_coding import HuffmanEncoder, theoretical_minimum_size
7
+ from test_compression import generate_iid_data, compress_and_compare, print_results
8
+
9
+ def main():
10
+ np.random.seed(42)
11
+
12
+ # Test with a small dataset
13
+ print("Testing with small uniform dataset...")
14
+ data = generate_iid_data(100, 10, 'uniform')
15
+ results = compress_and_compare(data, "small_test")
16
+ print_results(results)
17
+
18
+ # Test with non-uniform distribution
19
+ print("\nTesting with small Zipf dataset...")
20
+ data = generate_iid_data(100, 10, 'zipf')
21
+ results = compress_and_compare(data, "small_zipf_test")
22
+ print_results(results)
23
+
24
+ if __name__ == "__main__":
25
+ main()
test_compression.py ADDED
@@ -0,0 +1,310 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test entropy coding with equiprobable partitioning on various datasets.
4
+ Compares against Huffman coding and theoretical limits.
5
+ """
6
+
7
+ import numpy as np
8
+ from datasets import load_dataset
9
+ import os
10
+ from typing import List, Dict, Tuple, Optional
11
+ from enumerative_coding import EnumerativeEncoder
12
+ from entropy_coding import HuffmanEncoder, theoretical_minimum_size
13
+ import json
14
+ import signal
15
+ from contextlib import contextmanager
16
+
17
+
18
+ class TimeoutError(Exception):
19
+ """Raised when a timeout occurs."""
20
+ pass
21
+
22
+
23
+ @contextmanager
24
+ def timeout(seconds):
25
+ """Context manager for timeout functionality."""
26
+ def timeout_handler(signum, frame):
27
+ raise TimeoutError(f"Operation timed out after {seconds} seconds")
28
+
29
+ # Set the signal handler
30
+ old_handler = signal.signal(signal.SIGALRM, timeout_handler)
31
+ signal.alarm(seconds)
32
+
33
+ try:
34
+ yield
35
+ finally:
36
+ # Restore the old signal handler and disable alarm
37
+ signal.signal(signal.SIGALRM, old_handler)
38
+ signal.alarm(0)
39
+
40
+
41
+ def generate_iid_data(size: int, vocab_size: int, distribution: str = 'uniform') -> List[int]:
42
+ """
43
+ Generate i.i.d. data with specified vocabulary size and distribution.
44
+
45
+ Args:
46
+ size: Number of symbols to generate
47
+ vocab_size: Size of the vocabulary (number of unique symbols)
48
+ distribution: 'uniform', 'zipf', or 'geometric'
49
+ """
50
+ if distribution == 'uniform':
51
+ return list(np.random.randint(0, vocab_size, size))
52
+ elif distribution == 'zipf':
53
+ # Zipf distribution (power law)
54
+ probabilities = 1 / np.arange(1, vocab_size + 1)
55
+ probabilities /= probabilities.sum()
56
+ return list(np.random.choice(vocab_size, size, p=probabilities))
57
+ elif distribution == 'geometric':
58
+ # Geometric distribution
59
+ p = 0.3 # Parameter for geometric distribution
60
+ probabilities = [(1 - p) ** i * p for i in range(vocab_size - 1)]
61
+ probabilities.append((1 - p) ** (vocab_size - 1))
62
+ probabilities = np.array(probabilities)
63
+ probabilities /= probabilities.sum()
64
+ return list(np.random.choice(vocab_size, size, p=probabilities))
65
+ else:
66
+ raise ValueError(f"Unknown distribution: {distribution}")
67
+
68
+
69
+ def download_english_text() -> str:
70
+ """Download English text from Hugging Face."""
71
+ print("Downloading English text dataset...")
72
+ # Using WikiText-2 dataset as a source of English text
73
+ dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:10%]")
74
+
75
+ # Concatenate text samples
76
+ text = " ".join(dataset['text'][:1000]) # Take first 1000 samples
77
+
78
+ # Clean up text
79
+ text = " ".join(text.split()) # Remove extra whitespace
80
+
81
+ return text
82
+
83
+
84
+ def text_to_symbols(text: str) -> List[int]:
85
+ """Convert text to list of integer symbols (byte values)."""
86
+ return list(text.encode('utf-8'))
87
+
88
+
89
+ def symbols_to_text(symbols: List[int]) -> str:
90
+ """Convert list of integer symbols back to text."""
91
+ return bytes(symbols).decode('utf-8', errors='ignore')
92
+
93
+
94
+ def compress_and_compare(data: List[int], name: str, timeout_seconds: int = 30) -> Dict:
95
+ """
96
+ Compress data using different methods and compare results.
97
+
98
+ Args:
99
+ data: Input data as list of integers
100
+ name: Name for this dataset
101
+ timeout_seconds: Timeout in seconds for enumerative coding
102
+ """
103
+ original_size = len(data)
104
+
105
+ results = {
106
+ 'name': name,
107
+ 'original_size': original_size,
108
+ 'vocabulary_size': len(set(data)),
109
+ 'theoretical_minimum': theoretical_minimum_size(data),
110
+ 'methods': {}
111
+ }
112
+
113
+ # Huffman coding (always works quickly)
114
+ print(f" Compressing with Huffman coding...")
115
+ import time
116
+
117
+ huffman_start_time = time.time()
118
+ huffman = HuffmanEncoder()
119
+ huffman_encoded, huffman_metadata = huffman.encode(data)
120
+ huffman_size = len(huffman_encoded)
121
+
122
+ # Verify Huffman decoding
123
+ huffman_decoded = huffman.decode(huffman_encoded, huffman_metadata)
124
+ huffman_correct = huffman_decoded == data
125
+ huffman_encoding_time = time.time() - huffman_start_time
126
+
127
+ results['methods']['huffman'] = {
128
+ 'compressed_size': huffman_size,
129
+ 'compression_ratio': original_size / huffman_size,
130
+ 'bits_per_symbol': huffman_size * 8 / len(data),
131
+ 'correct': huffman_correct,
132
+ 'encoding_time': huffman_encoding_time
133
+ }
134
+
135
+ # Enumerative entropy coding (may timeout on large datasets)
136
+ print(f" Compressing with enumerative entropy coding (timeout: {timeout_seconds}s)...")
137
+ import time
138
+
139
+ start_time = time.time()
140
+ try:
141
+ with timeout(timeout_seconds):
142
+ ep_encoder = EnumerativeEncoder()
143
+ ep_encoded = ep_encoder.encode(data)
144
+ ep_size = len(ep_encoded)
145
+
146
+ # Verify decoding
147
+ ep_decoded = ep_encoder.decode(ep_encoded)
148
+ ep_correct = ep_decoded == data
149
+
150
+ encoding_time = time.time() - start_time
151
+ results['methods']['enumerative'] = {
152
+ 'compressed_size': ep_size,
153
+ 'compression_ratio': original_size / ep_size,
154
+ 'bits_per_symbol': ep_size * 8 / len(data),
155
+ 'correct': ep_correct,
156
+ 'encoding_time': encoding_time
157
+ }
158
+ except TimeoutError as e:
159
+ encoding_time = time.time() - start_time
160
+ print(f" Enumerative coding timed out: {e}")
161
+ results['methods']['enumerative'] = {
162
+ 'compressed_size': None,
163
+ 'compression_ratio': None,
164
+ 'bits_per_symbol': None,
165
+ 'correct': False,
166
+ 'encoding_time': encoding_time,
167
+ 'timed_out': True
168
+ }
169
+ except ValueError as e:
170
+ encoding_time = time.time() - start_time
171
+ print(f" Enumerative coding failed: {e}")
172
+ results['methods']['enumerative'] = {
173
+ 'compressed_size': None,
174
+ 'compression_ratio': None,
175
+ 'bits_per_symbol': None,
176
+ 'correct': False,
177
+ 'encoding_time': encoding_time,
178
+ 'timed_out': False,
179
+ 'error': str(e)
180
+ }
181
+ except Exception as e:
182
+ encoding_time = time.time() - start_time
183
+ print(f" Enumerative coding failed: {e}")
184
+ results['methods']['enumerative'] = {
185
+ 'compressed_size': None,
186
+ 'compression_ratio': None,
187
+ 'bits_per_symbol': None,
188
+ 'correct': False,
189
+ 'encoding_time': encoding_time,
190
+ 'timed_out': False,
191
+ 'error': str(e)
192
+ }
193
+
194
+ return results
195
+
196
+
197
+ def main():
198
+ """Run compression tests on various datasets."""
199
+ np.random.seed(42) # For reproducibility
200
+
201
+ all_results = []
202
+
203
+ # Test configurations
204
+ test_configs = [
205
+ # (name, size, vocab_size, distribution)
206
+ # Uniform distribution (one representative case)
207
+ ("uniform_10k_v256", 10000, 256, 'uniform'),
208
+
209
+ # Zipf distribution with different vocabulary sizes
210
+ ("zipf_10k_v16", 10000, 16, 'zipf'),
211
+ ("zipf_10k_v64", 10000, 64, 'zipf'),
212
+ ("zipf_10k_v256", 10000, 256, 'zipf'),
213
+ ("zipf_5k_v16", 5000, 16, 'zipf'),
214
+ ("zipf_5k_v64", 5000, 64, 'zipf'),
215
+ ("zipf_5k_v256", 5000, 256, 'zipf'),
216
+
217
+ # Geometric distribution with different vocabulary sizes
218
+ ("geometric_10k_v16", 10000, 16, 'geometric'),
219
+ ("geometric_10k_v64", 10000, 64, 'geometric'),
220
+ ("geometric_10k_v256", 10000, 256, 'geometric'),
221
+ ("geometric_5k_v16", 5000, 16, 'geometric'),
222
+ ("geometric_5k_v64", 5000, 64, 'geometric'),
223
+ ("geometric_5k_v256", 5000, 256, 'geometric'),
224
+
225
+ # Large scale test with most interesting distribution
226
+ ("zipf_100k_v256", 100000, 256, 'zipf'),
227
+ ]
228
+
229
+ # Generate and test i.i.d. datasets
230
+ print("Testing i.i.d. datasets...")
231
+ for name, size, vocab_size, distribution in test_configs:
232
+ print(f"\nTesting {name} (size={size}, vocab={vocab_size}, dist={distribution})...")
233
+ data = generate_iid_data(size, vocab_size, distribution)
234
+ results = compress_and_compare(data, name)
235
+ all_results.append(results)
236
+ print_results(results)
237
+
238
+ # Test English text
239
+ print("\nTesting English text...")
240
+ english_text = download_english_text()
241
+ print(f"Text length: {len(english_text)} characters")
242
+
243
+ # Convert to symbols and compress (use subset to avoid timeout)
244
+ text_symbols = text_to_symbols(english_text)
245
+
246
+ # Use a subset that's computationally feasible for enumerative coding
247
+ max_text_length = 2000 # Should complete within timeout
248
+ if len(text_symbols) > max_text_length:
249
+ text_symbols = text_symbols[:max_text_length]
250
+ print(f"Using text subset of {len(text_symbols)} symbols (original: {len(text_to_symbols(english_text))})")
251
+
252
+ text_results = compress_and_compare(text_symbols, "english_text")
253
+ all_results.append(text_results)
254
+ print_results(text_results)
255
+
256
+ # Save all results to JSON
257
+ with open('compression_results.json', 'w') as f:
258
+ json.dump(all_results, f, indent=2)
259
+
260
+ print("\nResults saved to compression_results.json")
261
+
262
+ # Generate summary table
263
+ print("\n" + "="*80)
264
+ print("COMPRESSION SUMMARY")
265
+ print("="*80)
266
+ print(f"{'Dataset':<20} {'Original':<10} {'Theoretical':<12} {'Huffman':<10} {'Enumerative':<12}")
267
+ print("-"*70)
268
+
269
+ for result in all_results:
270
+ name = result['name'][:20]
271
+ original = result['original_size']
272
+ theoretical = result['theoretical_minimum']
273
+ huffman = result['methods']['huffman']['compressed_size']
274
+ enumerative_method = result['methods']['enumerative']
275
+ if enumerative_method is not None and enumerative_method.get('compressed_size') is not None:
276
+ enumerative = enumerative_method['compressed_size']
277
+ else:
278
+ enumerative = "TIMEOUT"
279
+
280
+ row = f"{name:<20} {original:<10} {theoretical:<12.1f} {huffman:<10} {enumerative:<12}"
281
+ print(row)
282
+
283
+
284
+ def print_results(results: Dict):
285
+ """Print compression results for a dataset."""
286
+ print(f"\nResults for {results['name']}:")
287
+ print(f" Original size: {results['original_size']} bytes")
288
+ print(f" Vocabulary size: {results['vocabulary_size']}")
289
+ print(f" Theoretical minimum: {results['theoretical_minimum']:.1f} bytes")
290
+
291
+ for method, data in results['methods'].items():
292
+ print(f"\n {method}:")
293
+ if data is None or data.get('compressed_size') is None:
294
+ if data and data.get('timed_out'):
295
+ print(f" Status: TIMEOUT after {data.get('encoding_time', 0):.2f}s - computational complexity too high")
296
+ elif data and data.get('error'):
297
+ print(f" Status: FAILED - {data.get('error')}")
298
+ else:
299
+ print(f" Status: FAILED - computational complexity too high")
300
+ else:
301
+ print(f" Compressed size: {data['compressed_size']} bytes")
302
+ print(f" Compression ratio: {data['compression_ratio']:.2f}")
303
+ print(f" Bits per symbol: {data['bits_per_symbol']:.2f}")
304
+ print(f" Correctly decoded: {data['correct']}")
305
+ if 'encoding_time' in data:
306
+ print(f" Encoding time: {data['encoding_time']:.3f}s")
307
+
308
+
309
+ if __name__ == "__main__":
310
+ main()
test_enumerative.py ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Test the enumerative entropy coding implementation."""
3
+
4
+ from enumerative_coding import EnumerativeEncoder, ExpGolombCoder
5
+
6
+
7
+ def test_exp_golomb():
8
+ """Test exp-Golomb coding."""
9
+ print("Testing Exp-Golomb coding:")
10
+
11
+ test_values = [0, 1, 2, 3, 4, 5, 10, 15, 31, 32, 100]
12
+
13
+ for n in test_values:
14
+ encoded = ExpGolombCoder.encode(n)
15
+ decoded, _ = ExpGolombCoder.decode(encoded, 0)
16
+ print(f" {n:3d} -> {encoded:>10s} -> {decoded:3d} ({'✓' if n == decoded else '✗'})")
17
+
18
+
19
+ def test_combinatorics():
20
+ """Test combinatorial functions."""
21
+ print("\nTesting combinatorial functions:")
22
+
23
+ # Test binomial coefficients using the encoder's table
24
+ encoder = EnumerativeEncoder()
25
+ test_cases = [(5, 2), (10, 3), (7, 0), (7, 7), (6, 4)]
26
+ for n, k in test_cases:
27
+ result = encoder.binom_table.get(n, k)
28
+ expected = math.comb(n, k) if hasattr(math, 'comb') else None
29
+ print(f" C({n},{k}) = {result}" + (f" (expected {expected})" if expected else ""))
30
+
31
+
32
+ def test_simple_sequence():
33
+ """Test encoding/decoding of simple sequences."""
34
+ print("\nTesting simple sequences:")
35
+
36
+ encoder = EnumerativeEncoder()
37
+
38
+ test_sequences = [
39
+ [0, 1, 0],
40
+ [0, 1, 1, 2],
41
+ [1, 2, 3, 1, 2, 3],
42
+ [0, 0, 1, 1, 2],
43
+ ]
44
+
45
+ for seq in test_sequences:
46
+ print(f"\n Testing sequence: {seq}")
47
+
48
+ try:
49
+ encoded = encoder.encode(seq)
50
+ decoded = encoder.decode(encoded)
51
+
52
+ print(f" Original: {seq}")
53
+ print(f" Decoded: {decoded}")
54
+ print(f" Correct: {'✓' if seq == decoded else '✗'}")
55
+ print(f" Size: {len(seq)} symbols -> {len(encoded)} bytes")
56
+
57
+ except Exception as e:
58
+ print(f" Error: {e}")
59
+
60
+
61
+ def test_paper_example():
62
+ """Test with an example that should match the paper's approach."""
63
+ print("\nTesting paper-style example:")
64
+
65
+ # Create a sequence with known symbol frequencies
66
+ # This tests the 3-step encoding process
67
+ sequence = [0, 0, 1, 2, 1, 0, 2, 1, 1, 2] # 3 zeros, 4 ones, 3 twos
68
+
69
+ print(f" Sequence: {sequence}")
70
+ print(f" Symbols: {sorted(set(sequence))}")
71
+ print(f" Counts: {[sequence.count(s) for s in sorted(set(sequence))]}")
72
+
73
+ encoder = EnumerativeEncoder()
74
+ encoded = encoder.encode(sequence)
75
+ decoded = encoder.decode(encoded)
76
+
77
+ print(f" Encoded size: {len(encoded)} bytes")
78
+ print(f" Correctly decoded: {'✓' if sequence == decoded else '✗'}")
79
+
80
+ if sequence != decoded:
81
+ print(f" Expected: {sequence}")
82
+ print(f" Got: {decoded}")
83
+
84
+
85
+ if __name__ == "__main__":
86
+ import math
87
+
88
+ test_exp_golomb()
89
+ test_combinatorics()
90
+ test_simple_sequence()
91
+ test_paper_example()
test_paper_examples.py ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test cases that match specific examples from Han et al. (2008) paper.
4
+ Tests only end-to-end encoding/decoding and verifies results match paper.
5
+ """
6
+
7
+ import numpy as np
8
+ from enumerative_coding import EnumerativeEncoder
9
+ from collections import Counter
10
+
11
+
12
+ def test_example_from_paper():
13
+ """
14
+ Test with a specific example that should match the paper's results.
15
+ This verifies the encoding produces the expected output.
16
+ """
17
+ print("Testing Example from Paper")
18
+ print("=" * 50)
19
+
20
+ # Use a simple sequence that we can verify by hand
21
+ # This sequence has: 2 zeros, 2 ones, 1 two
22
+ sequence = [0, 1, 0, 1, 2]
23
+
24
+ print(f"Input sequence: {sequence}")
25
+ print(f"Symbol counts: {[sequence.count(i) for i in range(3)]}")
26
+
27
+ encoder = EnumerativeEncoder()
28
+ encoded = encoder.encode(sequence)
29
+ decoded = encoder.decode(encoded)
30
+
31
+ # Test end-to-end correctness
32
+ assert decoded == sequence, f"Decoding failed: expected {sequence}, got {decoded}"
33
+ print("✓ End-to-end encoding/decoding successful")
34
+
35
+ # Show compression metrics
36
+ original_size = len(sequence)
37
+ compressed_size = len(encoded)
38
+ print(f"Original: {original_size} symbols -> Compressed: {compressed_size} bytes")
39
+ print(f"Compression ratio: {original_size / compressed_size:.2f}")
40
+
41
+
42
+ def test_paper_sequence_properties():
43
+ """
44
+ Test with sequences that have the properties discussed in the paper.
45
+ Verify the algorithm handles different symbol distributions correctly.
46
+ """
47
+ print("\n\nTesting Paper Sequence Properties")
48
+ print("=" * 50)
49
+
50
+ test_cases = [
51
+ # (description, sequence)
52
+ ("Uniform distribution", [0, 1, 2, 0, 1, 2]),
53
+ ("Skewed distribution", [0, 0, 0, 1, 1, 2]),
54
+ ("Single symbol", [0, 0, 0, 0]),
55
+ ("Binary sequence", [0, 1, 1, 0, 1]),
56
+ ]
57
+
58
+ encoder = EnumerativeEncoder()
59
+
60
+ for description, sequence in test_cases:
61
+ print(f"\n{description}: {sequence}")
62
+
63
+ # Test encoding and decoding
64
+ encoded = encoder.encode(sequence)
65
+ decoded = encoder.decode(encoded)
66
+
67
+ # Verify correctness
68
+ assert decoded == sequence, f"Failed for {description}"
69
+
70
+ # Show results
71
+ compression_ratio = len(sequence) / len(encoded) if len(encoded) > 0 else float('inf')
72
+ print(f" Compressed: {len(sequence)} -> {len(encoded)} bytes (ratio: {compression_ratio:.2f})")
73
+ print(" ✓ Correctly encoded and decoded")
74
+
75
+
76
+ def test_lexicographic_ordering():
77
+ """
78
+ Test that sequences with the same symbol counts but different orders
79
+ produce different encodings (different lexicographic indices).
80
+ """
81
+ print("\n\nTesting Lexicographic Ordering")
82
+ print("=" * 50)
83
+
84
+ # All permutations of [0, 0, 1, 1] should have different lexicographic indices
85
+ permutations = [
86
+ [0, 0, 1, 1],
87
+ [0, 1, 0, 1],
88
+ [0, 1, 1, 0],
89
+ [1, 0, 0, 1],
90
+ [1, 0, 1, 0],
91
+ [1, 1, 0, 0]
92
+ ]
93
+
94
+ encoder = EnumerativeEncoder()
95
+ indices = []
96
+
97
+ for perm in permutations:
98
+ encoded = encoder.encode(perm)
99
+ decoded = encoder.decode(encoded)
100
+
101
+ # Verify end-to-end correctness
102
+ assert decoded == perm, f"Decoding failed for {perm}"
103
+
104
+ # Collect compressed size as a proxy for different encodings
105
+ compressed_size = len(encoded)
106
+ indices.append(compressed_size)
107
+ print(f" {perm} -> compressed size: {compressed_size} bytes")
108
+
109
+ # Verify all encodings are different (different permutations should produce different encodings)
110
+ encodings = []
111
+ for perm in permutations:
112
+ encoded = encoder.encode(perm)
113
+ encodings.append(encoded)
114
+
115
+ # Check that different permutations produce different encodings
116
+ unique_encodings = len(set(encodings))
117
+ total_permutations = len(permutations)
118
+ print(f" Unique encodings: {unique_encodings} out of {total_permutations} permutations")
119
+
120
+ assert unique_encodings == total_permutations, f"CRITICAL BUG: Different permutations produced identical encodings! This means the algorithm is lossy and cannot uniquely decode sequences."
121
+ print(" ✓ All permutations have unique encodings")
122
+
123
+
124
+ def main():
125
+ """Run all paper example tests."""
126
+ test_example_from_paper()
127
+ test_paper_sequence_properties()
128
+ test_lexicographic_ordering()
129
+
130
+ print("\n" + "=" * 50)
131
+ print("All paper example tests passed! ✓")
132
+
133
+
134
+ if __name__ == "__main__":
135
+ main()
uv.lock ADDED
The diff for this file is too large to render. See raw diff