osteele commited on Jul 14, 2025

Commit

24c19d8

unverified ·

0 Parent(s):

Initial commit

Browse files

Files changed (18) hide show

.gitattributes +36 -0
.gitignore +16 -0
README.md +348 -0
entropy_coding.py +127 -0
enumerative_coding.py +261 -0
justfile +50 -0
main.py +6 -0
plot_results.py +866 -0
plots/compression_comparison.png +3 -0
plots/compression_time_comparison.png +3 -0
plots/distribution_comparison.png +3 -0
plots/enumerative_timeout_analysis.png +3 -0
pyproject.toml +20 -0
quick_test.py +25 -0
test_compression.py +310 -0
test_enumerative.py +91 -0
test_paper_examples.py +135 -0
uv.lock +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,16 @@

+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+wheels/
+*.egg-info
+# Virtual environments
+.venv
+# Build outputs
+compression_results.json
+# Keep plots but ignore temp plots
+test_plots/

README.md ADDED Viewed

	@@ -0,0 +1,348 @@

+---
+license: mit
+---
+# Entropy Coding with Equiprobable Partitioning
+Implementation and comparison of the entropy coding algorithm using equiprobable partitioning from Han et al. (2008), compared against Huffman coding and theoretical limits.
+## Overview
+This project implements two compression algorithms:
+1. **Equiprobable Partitioning (EP)** - The main algorithm from the paper
+2. **Huffman Coding** - Classical entropy coding for comparison
+## Algorithm Description
+### Enumerative Entropy Coding
+The algorithm from Han et al. (2008) is actually an **enumerative entropy coding** method that works in three steps:
+1. **Encode alphabet size M** using exp-Golomb codes
+2. **Encode symbol counts** N(s₁), N(s₂), ..., N(s_{M-1}) using exp-Golomb codes (last count is implied)
+3. **Encode sequence position** among all permutations with the same symbol counts using combinatorial enumeration
+#### How It Works
+- **Step 1**: Use exp-Golomb to encode how many distinct symbols appear
+- **Step 2**: Use exp-Golomb to encode how many times each symbol appears
+- **Step 3**: Use lexicographic indexing to identify which specific permutation this sequence represents among all sequences with the same symbol histogram
+This is fundamentally different from simple partitioning - it's a form of **combinatorial compression** that leverages the mathematical structure of permutations.
+### Performance Optimizations
+Key optimizations enable practical performance for datasets up to ~10,000 symbols:
+1. **Cached Binomial Coefficients**: Uses `math.comb()` with caching to avoid recomputation
+2. **Binary Search**: O(log n) position reconstruction instead of linear search
+3. **Complement Encoding**: For frequent symbols (>50%), encode positions of other symbols instead
+4. **Arbitrary Precision**: Avoids integer overflow for large combinatorial values
+These optimizations achieve polynomial time complexity, making the algorithm practical for research and educational use.
+### Why Enumerative Coding?
+The algorithm aims to achieve compression by:
+- Separating structure (symbol counts) from content (permutation)
+- Using optimal exp-Golomb codes for integer encoding
+- Leveraging combinatorial mathematics for exact permutation indexing
+- Achieving theoretical compression bounds for certain distributions
+## Installation
+```bash
+# Clone or navigate to the repository
+cd entropy-coding-equiprobable
+# Install dependencies
+just setup
+# or manually: uv sync
+```
+## Usage
+### Quick Start
+```bash
+# Run all available commands
+just
+# Run quick tests
+just test
+# Run paper examples
+just test-paper
+# Run full compression benchmark
+just run
+# Run benchmark and generate plots (recommended)
+just analyze
+# Generate plots and analysis
+just plot
+```
+### Visualization
+The plotting functionality generates comprehensive analysis:
+1. **Compression Comparison**: Side-by-side comparison of Huffman vs Enumerative methods
+2. **Compression Time Analysis**: Performance timing comparison between algorithms
+3. **Distribution Analysis**: Performance across uniform, Zipf, and geometric data
+4. **Efficiency Analysis**: How close each method gets to theoretical limits
+5. **Enumerative Timeout Analysis**: Computational complexity limitations and scaling behavior
+Plots are saved to the `plots/` directory as high-resolution PNG files.
+## Results
+### Compression Performance Comparison
+![Compression Comparison](plots/compression_comparison.png)
+Comparison of compression ratios, bits per symbol, and efficiency between Huffman and Enumerative coding across different datasets.
+### Compression Time Analysis
+![Compression Time Comparison](plots/compression_time_comparison.png)
+Performance timing analysis showing encoding times, speed ratios, and scalability characteristics. Huffman coding is consistently 100-1000x faster.
+### Distribution Analysis
+![Distribution Comparison](plots/distribution_comparison.png)
+Performance breakdown by data distribution type (Uniform, Zipf, Geometric, English Text) showing compression ratios and efficiency metrics.
+### Computational Complexity Analysis
+![Enumerative Timeout Analysis](plots/enumerative_timeout_analysis.png)
+Enumerative encoding performance showing computation times, timeout patterns, and scaling limitations by dataset size and vocabulary.
+### Command Reference
+- `just` - List available commands
+- `just setup` - Install dependencies
+- `just test` - Quick test with small datasets + paper examples
+- `just test-paper` - Test examples from the paper
+- `just run` - Full compression benchmark
+- `just analyze` - Run full benchmark and generate plots
+- `just plot` - Generate comparison plots
+- `just clean` - Remove generated files
+- `just check` - Run code quality checks
+- `just format` - Format code
+## Test Datasets
+The benchmark includes:
+### I.I.D. Datasets
+- **Small** (1K symbols): Quick testing
+- **Medium** (10K symbols): Moderate datasets
+- **Large** (100K symbols): Performance at scale
+### Distributions
+- **Uniform**: All symbols equally likely
+- **Zipf**: Power-law distribution (realistic for text)
+- **Geometric**: Exponentially decreasing probabilities
+### Vocabulary Sizes
+- **10 symbols**: Small alphabet
+- **64 symbols**: Medium alphabet
+- **256 symbols**: Full byte range
+### Real Data
+- **English text**: Downloaded from WikiText-2 via Hugging Face
+## Results Analysis
+### Performance Patterns
+1. **Uniform Distributions**: EP performs poorly because there's no probability imbalance to exploit
+2. **Skewed Distributions**: EP performs better but still trails Huffman
+3. **Large Vocabularies**: EP overhead becomes significant with many symbols
+### Computational Complexity
+The optimized enumerative entropy coding implementation achieves **polynomial time complexity** through careful algorithmic design:
+#### Time Complexity Analysis
+- **Encoding**: O(M × n) where M = alphabet size, n = sequence length
+  - Symbol position finding: O(n) per symbol
+  - Combinatorial indexing: O(k) per symbol with memoization
+- **Decoding**: O(M × k × log n) where k = average symbol count
+  - Binary search for position reconstruction: O(log n) per position
+  - Memoized binomial lookups: O(1) amortized
+#### Space Complexity
+- **Memory Usage**: O(unique_binomial_lookups) for coefficient cache
+- **Typical Cache Size**: < 1000 entries for most realistic datasets
+- **No Upfront Cost**: Zero initialization time, grows only as needed
+#### Performance Characteristics
+- **Small Datasets** (< 5000 symbols): 0.045s - 1.7s encoding time
+- **Medium Datasets** (5000-10000 symbols): 0.3s - 15s encoding time
+- **Large Datasets** (> 100000 symbols): May timeout (> 30s)
+- **Performance vs Huffman**: ~259x slower on average
+#### Timeout Mechanism
+- **Timeout Duration**: 30 seconds by default for enumerative coding
+- **Graceful Handling**: Timeouts are logged and marked as "TIMEOUT" in results
+- **When Timeouts Occur**: Very large sequences (> 100k symbols) with high vocabulary diversity
+The optimizations successfully transform the algorithm from exponential (naive multinomial) to polynomial complexity, making it practical for realistic data sizes.
+### Performance Results
+From the benchmark results comparing Huffman vs Enumerative coding:
+| Dataset Type | Huffman Efficiency | Enumerative Efficiency | Speed Ratio |
+|--------------|-------------------|------------------------|-------------|
+| Uniform data | ~99.8% of theoretical | ~48.9% of theoretical | 259x slower |
+| Zipf data | ~99.0-99.4% of theoretical | ~47.7-49.9% of theoretical | 100-1000x slower |
+| Geometric data | ~98.9-99.3% of theoretical | ~49.6-49.9% of theoretical | 400-2000x slower |
+| English text | ~99.1% of theoretical | ~48.1% of theoretical | 23x slower |
+### Why Enumerative Underperforms
+1. **Computational Complexity**: Combinatorial calculations become expensive for large datasets
+2. **Fixed Algorithm Structure**: Cannot adapt to data characteristics like Huffman's variable-length codes
+3. **Overhead**: Algorithm encodes structure information (alphabet, counts, positions) separately
+4. **Scaling Issues**: Performance degrades exponentially with dataset size and vocabulary complexity
+## File Structure
+```
+entropy-coding-equiprobable/
+├── enumerative_coding.py      # Core enumerative entropy coding implementation
+├── entropy_coding.py          # Legacy compatibility and Huffman implementation
+├── test_compression.py        # Main benchmark script with timing analysis
+├── test_paper_examples.py     # Paper example verification
+├── test_enumerative.py        # Basic functionality tests
+├── plot_results.py           # Comprehensive visualization and analysis
+├── quick_test.py             # Quick functionality test
+├── justfile                  # Command runner
+├── pyproject.toml           # Python dependencies
+├── CLAUDE.md               # Project-specific AI instructions
+└── README.md              # This file
+```
+## Implementation Details
+### Enumerative Entropy Coding
+The implementation follows the Han et al. (2008) algorithm with four main steps:
+```python
+def encode(self, data: List[int]) -> bytes:
+    # Step 1: Encode sequence length
+    bits += ExpGolombCoder.encode(n)
+    # Step 2: Encode alphabet (size K and symbols)
+    bits += ExpGolombCoder.encode(K)
+    for symbol in sorted_symbols:
+        bits += ExpGolombCoder.encode(symbol)
+    # Step 3: Encode symbol frequencies (K-1, last is implied)
+    for i in range(K - 1):
+        bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]])
+    # Step 4: Encode symbol positions using combinatorial indexing
+    for symbol in sorted_symbols[:-1]:
+        positions = find_symbol_positions(symbol, remaining_data)
+        rank = self._rank(len(remaining_data), len(positions), positions)
+        bits += ExpGolombCoder.encode(rank)
+```
+### Key Optimizations
+```python
+# Complement encoding for frequent symbols
+use_complement = k > current_n / 2
+if use_complement:
+    # Encode positions of OTHER symbols instead
+    complement_positions = find_complement_positions()
+    rank = self._rank(current_n, current_n - k, complement_positions)
+# Fast binomial coefficient computation with caching
+class OptimizedBinomialTable:
+    def get(self, n: int, k: int) -> int:
+        if (n, k) in self._cache:
+            return self._cache[(n, k)]
+        result = math.comb(n, k)  # Uses arbitrary precision
+        self._cache[(n, k)] = result
+        return result
+```
+## Theoretical Analysis
+### Compression Bounds
+- **Shannon Entropy**: H(X) = -Σ p(x) log2 p(x) - theoretical minimum
+- **Huffman**: Achieves H(X) ≤ L_Huffman < H(X) + 1 (typically ~99% efficiency)
+- **Enumerative**: L_Enum ≥ H(X) + overhead (typically ~49% efficiency due to structural encoding)
+### When Enumerative Coding Works
+1. **Research/theoretical applications**: When exact mathematical properties are needed
+2. **Educational purposes**: Understanding combinatorial compression principles
+3. **Small datasets**: Where computational cost is not a concern
+### When Enumerative Struggles
+1. **All practical applications**: 259x slower than Huffman with worse compression
+2. **Large datasets**: Exponential scaling makes it computationally prohibitive
+3. **Real-time systems**: Unpredictable and potentially very long encoding times
+## Future Optimization Opportunities
+While the current implementation achieves practical performance for datasets up to ~10,000 symbols, several optimization strategies could further improve performance:
+### 1. Just-In-Time (JIT) Compilation
+- **Target**: Critical loops in combinatorial indexing and position reconstruction
+- **Options**:
+  - **Numba** (requires Python 3.11 due to llvmlite compatibility issues)
+  - **JAX** (better Python 3.12 support, NumPy-compatible)
+  - **PyPy** (alternative Python interpreter with JIT)
+- **Expected Benefit**: 10-100x speedup for computational bottlenecks
+### 2. Algorithmic Improvements
+- **Incremental Encoding**: Reuse computations when processing similar sequences
+- **Approximate Methods**: Trade slight accuracy for major performance gains on very large datasets
+- **Parallel Processing**: Distribute symbol processing across multiple cores
+### 3. Specialized Data Structures
+- **Sparse Binomial Tables**: Only compute coefficients actually needed
+- **Compressed Position Indices**: More efficient representation for position lists
+- **Fast Integer Arithmetic**: Specialized libraries for large integer operations
+### 4. Memory Hierarchy Optimizations
+- **Cache-Friendly Algorithms**: Reorganize computations to minimize cache misses
+- **Memory Pooling**: Reduce allocation overhead for temporary arrays
+- **Streaming Encoding**: Process very large datasets without loading entirely into memory
+### 5. Domain-Specific Optimizations
+- **Text-Specific**: Leverage byte patterns and common character frequencies
+- **Statistical Precomputation**: Pre-build tables for common distributions (Zipf, geometric)
+- **Adaptive Thresholds**: Dynamically adjust complement encoding and timeout parameters
+The current implementation provides a solid foundation for exploring these advanced optimizations while maintaining correctness and robustness.
+## References
+- Han, Y., et al. (2008). "Entropy coding using equiprobable partitioning"
+- Cover, T. M., & Thomas, J. A. (2006). "Elements of Information Theory"
+- Huffman, D. A. (1952). "A method for the construction of minimum-redundancy codes"
+## Contributing
+This is a research implementation. To contribute:
+1. Fork the repository
+2. Make changes following the existing code style
+3. Run `just check` to verify code quality
+4. Submit a pull request
+## License
+This project is for educational and research purposes.

entropy_coding.py ADDED Viewed

	@@ -0,0 +1,127 @@

+#!/usr/bin/env python3
+"""
+Implementation of enumerative entropy coding as described in Han et al. (2008).
+This is the actual "equiprobable partitioning" algorithm from the paper.
+"""
+# Import the actual implementation
+from enumerative_coding import EnumerativeEncoder, ExpGolombCoder
+import numpy as np
+from typing import Tuple, List, Dict
+from collections import Counter
+import heapq
+from scipy.stats import entropy as scipy_entropy
+# For backward compatibility with existing code that expects this class name
+EquiprobablePartitioningEncoder = EnumerativeEncoder
+class HuffmanEncoder:
+    """Standard Huffman coding for comparison."""
+    class Node:
+        def __init__(self, symbol=None, freq=0, left=None, right=None):
+            self.symbol = symbol
+            self.freq = freq
+            self.left = left
+            self.right = right
+        def __lt__(self, other):
+            return self.freq < other.freq
+    def __init__(self):
+        self.codes = {}
+        self.root = None
+    def _build_tree(self, frequencies: Dict[int, int]):
+        """Build Huffman tree from symbol frequencies."""
+        heap = []
+        for symbol, freq in frequencies.items():
+            heapq.heappush(heap, self.Node(symbol=symbol, freq=freq))
+        while len(heap) > 1:
+            left = heapq.heappop(heap)
+            right = heapq.heappop(heap)
+            parent = self.Node(freq=left.freq + right.freq, left=left, right=right)
+            heapq.heappush(heap, parent)
+        self.root = heap[0]
+    def _generate_codes(self, node, code=''):
+        """Generate Huffman codes by traversing the tree."""
+        if node.symbol is not None:
+            self.codes[node.symbol] = code if code else '0'
+            return
+        if node.left:
+            self._generate_codes(node.left, code + '0')
+        if node.right:
+            self._generate_codes(node.right, code + '1')
+    def encode(self, data: List[int]) -> Tuple[bytes, Dict]:
+        """Encode data using Huffman coding."""
+        frequencies = Counter(data)
+        if len(frequencies) == 1:
+            # Special case: only one symbol
+            symbol = list(frequencies.keys())[0]
+            self.codes = {symbol: '0'}
+        else:
+            self._build_tree(frequencies)
+            self._generate_codes(self.root)
+        # Encode data
+        encoded_bits = ''.join(self.codes[symbol] for symbol in data)
+        # Pad to byte boundary
+        padding = (8 - len(encoded_bits) % 8) % 8
+        encoded_bits += '0' * padding
+        encoded_bytes = bytes(int(encoded_bits[i:i+8], 2) for i in range(0, len(encoded_bits), 8))
+        metadata = {
+            'codes': self.codes,
+            'padding': padding,
+            'original_length': len(data)
+        }
+        return encoded_bytes, metadata
+    def decode(self, encoded_bytes: bytes, metadata: Dict) -> List[int]:
+        """Decode Huffman encoded data."""
+        # Create reverse mapping
+        reverse_codes = {code: symbol for symbol, code in metadata['codes'].items()}
+        # Convert bytes to bit string
+        bit_string = ''.join(format(byte, '08b') for byte in encoded_bytes)
+        # Remove padding
+        if metadata['padding'] > 0:
+            bit_string = bit_string[:-metadata['padding']]
+        decoded = []
+        current_code = ''
+        for bit in bit_string:
+            current_code += bit
+            if current_code in reverse_codes:
+                decoded.append(reverse_codes[current_code])
+                current_code = ''
+                if len(decoded) >= metadata['original_length']:
+                    break
+        return decoded
+def calculate_entropy(data: List[int]) -> float:
+    """Calculate Shannon entropy of data."""
+    probabilities = list(Counter(data).values())
+    return scipy_entropy(probabilities, base=2) * len(data) / 8  # Convert to bytes
+def theoretical_minimum_size(data: List[int]) -> float:
+    """Calculate theoretical minimum compressed size in bytes."""
+    return calculate_entropy(data)

enumerative_coding.py ADDED Viewed

	@@ -0,0 +1,261 @@

+#!/usr/bin/env python3
+"""
+Correct implementation of enumerative entropy coding as described in Han et al. (2008).
+This version is fully self-contained, embedding all necessary data into the stream.
+"""
+import numpy as np
+from typing import List, Dict, Tuple, Optional
+from collections import Counter
+import math
+class ExpGolombCoder:
+    """Exponential-Golomb coding for non-negative integers."""
+    @staticmethod
+    def encode(n: int) -> str:
+        """Encodes a non-negative integer n >= 0."""
+        if n < 0:
+            raise ValueError("Exp-Golomb is for non-negative integers.")
+        n_plus_1 = n + 1
+        binary = bin(n_plus_1)[2:]
+        leading_zeros = '0' * (len(binary) - 1)
+        return leading_zeros + binary
+    @staticmethod
+    def decode(bits: str, start_pos: int = 0) -> Tuple[int, int]:
+        """Decodes an exp-Golomb integer from a bit string."""
+        pos = start_pos
+        leading_zeros = 0
+        while pos < len(bits) and bits[pos] == '0':
+            leading_zeros += 1
+            pos += 1
+        if pos >= len(bits):
+            raise ValueError("Incomplete exp-Golomb code: no '1' bit found.")
+        num_bits_to_read = leading_zeros + 1
+        if pos + num_bits_to_read > len(bits):
+            raise ValueError("Incomplete exp-Golomb code: not enough bits for value.")
+        code_bits = bits[pos:pos + num_bits_to_read]
+        value = int(code_bits, 2) - 1
+        return value, pos + num_bits_to_read
+class OptimizedBinomialTable:
+    """
+    Computes and caches binomial coefficients C(n, k) using Python's arbitrary
+    precision integers to prevent overflow.
+    """
+    def __init__(self):
+        self._cache = {}
+    def get(self, n: int, k: int) -> int:
+        if k < 0 or k > n:
+            return 0
+        if k == 0 or k == n:
+            return 1
+        if k > n // 2:
+            k = n - k
+        key = (n, k)
+        if key in self._cache:
+            return self._cache[key]
+        result = math.comb(n, k)
+        self._cache[key] = result
+        return result
+    def __getitem__(self, n: int):
+        return BinomialRow(self, n)
+class BinomialRow:
+    """Helper class to support table[n][k] syntax."""
+    def __init__(self, table: OptimizedBinomialTable, n: int):
+        self.table = table
+        self.n = n
+    def __getitem__(self, k: int) -> int:
+        return self.table.get(self.n, k)
+class EnumerativeEncoder:
+    """
+    An enumerative entropy coder aligned with the algorithm described in
+    "Entropy Coding Using Equiprobable Partitioning" by Han et al. (2008).
+    This implementation is self-contained, writing all necessary information
+    (length, alphabet, counts, and positions) into the output stream.
+    """
+    def __init__(self):
+        self.binom_table = OptimizedBinomialTable()
+    def _rank(self, n: int, k: int, positions: List[int]) -> int:
+        """Calculates the standard lexicographical rank of a combination."""
+        index = 0
+        for i, pos in enumerate(positions):
+            index += self.binom_table.get(pos, i + 1)
+        return index
+    def _unrank(self, n: int, k: int, index: int) -> List[int]:
+        """Converts a standard lexicographical rank back to a combination."""
+        positions = []
+        v_high = n - 1
+        for i in range(k - 1, -1, -1):
+            v_low = i
+            # Binary search for the largest position p_i
+            while v_low < v_high:
+                mid = (v_low + v_high + 1) // 2
+                if self.binom_table.get(mid, i + 1) <= index:
+                    v_low = mid
+                else:
+                    v_high = mid - 1
+            p_i = v_low
+            positions.append(p_i)
+            index -= self.binom_table.get(p_i, i + 1)
+            v_high = p_i - 1
+        positions.reverse() # Stored descending, so reverse to ascending
+        return positions
+    def encode(self, data: List[int]) -> bytes:
+        if not data:
+            return bytes()
+        n = len(data)
+        symbol_counts = Counter(data)
+        # Optimization: encode symbols from least frequent to most frequent
+        sorted_symbols = sorted(symbol_counts.keys(), key=lambda s: symbol_counts[s])
+        K = len(sorted_symbols)
+        bits = ""
+        # Step 1: Encode sequence length n
+        bits += ExpGolombCoder.encode(n)
+        # Step 2: Encode header - alphabet size (K) and the alphabet itself
+        bits += ExpGolombCoder.encode(K)
+        for symbol in sorted_symbols:
+            bits += ExpGolombCoder.encode(symbol)
+        # Step 3: Encode K-1 symbol frequencies
+        for i in range(K - 1):
+            bits += ExpGolombCoder.encode(symbol_counts[sorted_symbols[i]])
+        # Step 4: Encode symbol locations sequentially
+        available_indices = list(range(n))
+        for i in range(K - 1):
+            symbol = sorted_symbols[i]
+            k = symbol_counts[symbol]
+            if k == 0:
+                continue
+            current_n = len(available_indices)
+            # Find the positions of the current symbol within the available slots
+            symbol_positions_in_available = [
+                j for j, original_idx in enumerate(available_indices) if data[original_idx] == symbol
+            ]
+            # Optimization: Use complement method for frequent symbols
+            use_complement = k > current_n / 2
+            bits += '1' if use_complement else '0'
+            if use_complement:
+                complement_k = current_n - k
+                complement_positions = [j for j in range(current_n) if j not in symbol_positions_in_available]
+                index = self._rank(current_n, complement_k, complement_positions)
+            else:
+                index = self._rank(current_n, k, symbol_positions_in_available)
+            bits += ExpGolombCoder.encode(index)
+            # Update available indices for the next symbol
+            used_indices = {available_indices[j] for j in symbol_positions_in_available}
+            available_indices = [idx for idx in available_indices if idx not in used_indices]
+        # Convert bit string to bytes with padding
+        padding = (8 - len(bits) % 8) % 8
+        bits += '0' * padding
+        encoded_bytes = bytes(int(bits[i:i+8], 2) for i in range(0, len(bits), 8))
+        return encoded_bytes
+    def decode(self, encoded_bytes: bytes) -> List[int]:
+        if not encoded_bytes:
+            return []
+        # Convert bytes to bit string
+        bits = ''.join(format(byte, '08b') for byte in encoded_bytes)
+        pos = 0
+        # Step 1: Decode sequence length n
+        n, pos = ExpGolombCoder.decode(bits, pos)
+        # Step 2: Decode header - alphabet size (K) and the alphabet itself
+        K, pos = ExpGolombCoder.decode(bits, pos)
+        sorted_symbols = []
+        for _ in range(K):
+            symbol, pos = ExpGolombCoder.decode(bits, pos)
+            sorted_symbols.append(symbol)
+        # Step 3: Decode K-1 symbol frequencies
+        counts = {}
+        decoded_count_sum = 0
+        for i in range(K - 1):
+            symbol = sorted_symbols[i]
+            count, pos = ExpGolombCoder.decode(bits, pos)
+            counts[symbol] = count
+            decoded_count_sum += count
+        # The last symbol's count is implied
+        last_symbol = sorted_symbols[-1]
+        counts[last_symbol] = n - decoded_count_sum
+        # Step 4: Decode symbol locations sequentially
+        result = [None] * n
+        available_indices = list(range(n))
+        for i in range(K - 1):
+            symbol = sorted_symbols[i]
+            k = counts[symbol]
+            if k == 0:
+                continue
+            current_n = len(available_indices)
+            # Read complement flag
+            use_complement = (bits[pos] == '1')
+            pos += 1
+            index, pos = ExpGolombCoder.decode(bits, pos)
+            if use_complement:
+                complement_k = current_n - k
+                complement_positions = self._unrank(current_n, complement_k, index)
+                positions_in_available = [j for j in range(current_n) if j not in complement_positions]
+            else:
+                positions_in_available = self._unrank(current_n, k, index)
+            # Map positions from available list back to original sequence
+            used_indices = set()
+            for rel_pos in positions_in_available:
+                abs_pos = available_indices[rel_pos]
+                result[abs_pos] = symbol
+                used_indices.add(abs_pos)
+            # Update available indices
+            available_indices = [idx for idx in available_indices if idx not in used_indices]
+        # Last symbol fills all remaining positions
+        for i in range(n):
+            if result[i] is None:
+                result[i] = last_symbol
+        return result

justfile ADDED Viewed

	@@ -0,0 +1,50 @@

+# List available commands
+default:
+    @just --list
+# Run the compression tests
+run:
+    uv run python test_compression.py
+# Run compression tests and generate plots
+analyze:
+    uv run python test_compression.py
+    uv run python plot_results.py
+# Install dependencies
+setup:
+    uv sync
+# Run with smaller test sizes for quick testing
+test:
+    uv run python quick_test.py
+    uv run python test_paper_examples.py
+# Run tests from the paper
+test-paper:
+    uv run python test_paper_examples.py
+# Generate comparison plots
+plot:
+    uv run python plot_results.py
+# Clean up generated files
+clean:
+    rm -f compression_results.json
+# Run python directly with uv
+python *args:
+    uv run python {{args}}
+# Check code quality
+check:
+    uv run ruff check .
+    uv run pyright .
+# Format code
+format:
+    uv run ruff format .
+# Fix linting issues
+fix:
+    uv run ruff check --fix .

main.py ADDED Viewed

	@@ -0,0 +1,6 @@

+def main():
+    print("Hello from entropy-coding-equiprobable!")
+if __name__ == "__main__":
+    main()

plot_results.py ADDED Viewed

	@@ -0,0 +1,866 @@

+#!/usr/bin/env python3
+"""
+Generate plots comparing compression methods.
+"""
+import json
+import matplotlib.pyplot as plt
+import matplotlib
+import seaborn as sns
+import numpy as np
+import pandas as pd
+from pathlib import Path
+# Use non-interactive backend to avoid opening windows
+matplotlib.use('Agg')
+# Set style
+plt.style.use('default')
+sns.set_palette("husl")
+def load_results(filename='compression_results.json'):
+    """Load compression results from JSON file."""
+    try:
+        with open(filename, 'r') as f:
+            return json.load(f)
+    except FileNotFoundError:
+        print(f"Results file {filename} not found. Run 'just run' first to generate results.")
+        return None
+def create_comparison_dataframe(results):
+    """Convert results to pandas DataFrame for easier plotting."""
+    rows = []
+    for result in results:
+        name = result['name']
+        original_size = result['original_size']
+        theoretical_min = result['theoretical_minimum']
+        vocab_size = result['vocabulary_size']
+        # Extract method results
+        methods = result['methods']
+        # Determine dataset type for analysis
+        if 'uniform' in name:
+            dataset_type = 'Uniform'
+        elif 'zipf' in name:
+            dataset_type = 'Zipf'
+        elif 'geometric' in name:
+            dataset_type = 'Geometric'
+        elif 'english' in name:
+            dataset_type = 'English Text'
+        else:
+            dataset_type = 'Other'
+        row_base = {
+            'dataset': name,
+            'original_size': original_size,
+            'theoretical_minimum': theoretical_min,
+            'vocabulary_size': vocab_size,
+            'dataset_type': dataset_type
+        }
+        # Add each method as a separate row
+        for method_name, method_data in methods.items():
+            row = row_base.copy()
+            row['method'] = method_name
+            # Parse method details
+            if method_name.startswith('equiprobable_k'):
+                row['method_type'] = 'Equiprobable'
+                row['k_value'] = int(method_name.split('k')[1])
+            elif method_name == 'enumerative':
+                row['method_type'] = 'Enumerative'
+                row['k_value'] = None
+            elif method_name == 'huffman':
+                row['method_type'] = 'Huffman'
+                row['k_value'] = None
+            if method_data is None or method_data.get('compressed_size') is None:
+                # Handle timeout/failure cases
+                row['compressed_size'] = None
+                row['compression_ratio'] = None
+                row['bits_per_symbol'] = None
+                row['correct'] = False
+                row['encoding_time'] = method_data.get('encoding_time', 0) if method_data else 0
+                row['status'] = 'timeout' if method_data and method_data.get('timed_out') else 'failed'
+            else:
+                # Handle successful cases
+                row['compressed_size'] = method_data['compressed_size']
+                row['compression_ratio'] = method_data['compression_ratio']
+                row['bits_per_symbol'] = method_data['bits_per_symbol']
+                row['correct'] = method_data['correct']
+                row['encoding_time'] = method_data.get('encoding_time', 0)
+                row['status'] = 'success'
+            rows.append(row)
+        # Add theoretical minimum as a reference
+        row = row_base.copy()
+        row['method'] = 'theoretical'
+        row['method_type'] = 'Theoretical'
+        row['compressed_size'] = theoretical_min
+        row['compression_ratio'] = original_size / theoretical_min
+        row['bits_per_symbol'] = theoretical_min * 8 / original_size
+        row['correct'] = True
+        row['k_value'] = None
+        row['status'] = 'success'
+        rows.append(row)
+    return pd.DataFrame(rows)
+def plot_compression_ratios(df, save_path='plots'):
+    """Plot compression ratios for different methods."""
+    Path(save_path).mkdir(exist_ok=True)
+    # Create figure with subplots
+    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
+    fig.suptitle('Compression Performance Comparison', fontsize=16, fontweight='bold')
+    # 1. Compression ratio by dataset (including partial data)
+    ax1 = axes[0, 0]
+    # Get all datasets
+    datasets = sorted(df['dataset'].unique())
+    # Pivot for easier plotting, but include all data (success and timeout)
+    pivot_data = df.pivot(index='dataset', columns='method', values='compression_ratio')
+    # Select key methods for cleaner plot, now including enumerative
+    key_methods = ['theoretical', 'huffman', 'enumerative']
+    available_methods = [col for col in key_methods if col in pivot_data.columns]
+    pivot_subset = pivot_data[available_methods]
+    # Plot with special handling for missing values
+    bars = pivot_subset.plot(kind='bar', ax=ax1, width=0.8)
+    ax1.set_title('Compression Ratio by Dataset')
+    ax1.set_xlabel('Dataset')
+    ax1.set_ylabel('Compression Ratio')
+    ax1.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
+    ax1.tick_params(axis='x', rotation=45)
+    # Add text annotations for timeouts
+    for i, dataset in enumerate(datasets):
+        enum_data = df[(df['dataset'] == dataset) & (df['method'] == 'enumerative')]
+        if not enum_data.empty and enum_data.iloc[0]['status'] == 'timeout':
+            ax1.text(i, ax1.get_ylim()[1] * 0.9, 'TIMEOUT',
+                    ha='center', va='center', fontsize=8,
+                    bbox=dict(boxstyle="round,pad=0.3", facecolor="red", alpha=0.7))
+    # 2. Bits per symbol
+    ax2 = axes[0, 1]
+    pivot_bits = df.pivot(index='dataset', columns='method', values='bits_per_symbol')
+    pivot_bits_subset = pivot_bits[available_methods]
+    pivot_bits_subset.plot(kind='bar', ax=ax2, width=0.8)
+    ax2.set_title('Bits per Symbol by Dataset')
+    ax2.set_xlabel('Dataset')
+    ax2.set_ylabel('Bits per Symbol')
+    ax2.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
+    ax2.tick_params(axis='x', rotation=45)
+    # Add timeout annotations for bits per symbol plot too
+    for i, dataset in enumerate(datasets):
+        enum_data = df[(df['dataset'] == dataset) & (df['method'] == 'enumerative')]
+        if not enum_data.empty and enum_data.iloc[0]['status'] == 'timeout':
+            ax2.text(i, ax2.get_ylim()[1] * 0.9, 'TIMEOUT',
+                    ha='center', va='center', fontsize=8,
+                    bbox=dict(boxstyle="round,pad=0.3", facecolor="red", alpha=0.7))
+    # 3. Enumerative encoding time by dataset size
+    ax3 = axes[1, 0]
+    enum_data = df[df['method'] == 'enumerative'].copy()
+    if not enum_data.empty:
+        # Create scatter plot of dataset size vs encoding time
+        successful_enum = enum_data[enum_data['status'] == 'success']
+        timeout_enum = enum_data[enum_data['status'] == 'timeout']
+        # Plot successful encodings
+        if not successful_enum.empty:
+            ax3.scatter(successful_enum['original_size'], successful_enum['encoding_time'],
+                       c='green', marker='o', s=60, alpha=0.7, label='Successful')
+        # Plot timeouts (use timeout duration)
+        if not timeout_enum.empty:
+            ax3.scatter(timeout_enum['original_size'], timeout_enum['encoding_time'],
+                       c='red', marker='X', s=80, alpha=0.9, label='Timeout')
+        ax3.set_title('Enumerative Encoding Time vs Dataset Size')
+        ax3.set_xlabel('Dataset Size (symbols)')
+        ax3.set_ylabel('Encoding Time (seconds)')
+        ax3.set_xscale('log')
+        ax3.set_yscale('log')
+        ax3.grid(True, alpha=0.3)
+        ax3.legend()
+        # Add trend line for successful cases
+        if len(successful_enum) > 1:
+            x_vals = successful_enum['original_size'].values
+            y_vals = successful_enum['encoding_time'].values
+            z = np.polyfit(np.log10(x_vals), np.log10(y_vals), 1)
+            p = np.poly1d(z)
+            x_trend = np.logspace(np.log10(min(x_vals)), np.log10(max(x_vals)), 100)
+            y_trend = 10 ** p(np.log10(x_trend))
+            ax3.plot(x_trend, y_trend, 'b--', alpha=0.5, linewidth=1,
+                    label=f'Trend (slope: {z[0]:.2f})')
+            ax3.legend()
+    else:
+        ax3.text(0.5, 0.5, 'No enumerative data available',
+                ha='center', va='center', transform=ax3.transAxes)
+        ax3.set_title('Enumerative Encoding Time vs Dataset Size')
+    # 4. Efficiency vs theoretical minimum
+    ax4 = axes[1, 1]
+    # Calculate efficiency (how close to theoretical minimum)
+    theoretical_data = df[df['method'] == 'theoretical'].set_index('dataset')['compressed_size']
+    for method in ['huffman', 'enumerative']:
+        if method in df['method'].values:
+            # Get method data including both successful and failed cases
+            method_df = df[df['method'] == method].set_index('dataset')
+            # Only plot efficiency for successful cases
+            successful_data = method_df[method_df['compressed_size'].notna()]
+            if not successful_data.empty:
+                efficiency = theoretical_data / successful_data['compressed_size']
+                # Plot only datasets that have both theoretical and method data
+                common_datasets = efficiency.dropna().index
+                dataset_indices = [datasets.index(d) for d in common_datasets if d in datasets]
+                efficiency_values = [efficiency[datasets[i]] for i in dataset_indices]
+                ax4.plot(dataset_indices, efficiency_values, marker='o', label=method, linewidth=2)
+            # Mark timeouts/failures
+            if method == 'enumerative':
+                failed_data = method_df[method_df['compressed_size'].isna()]
+                if not failed_data.empty:
+                    failed_indices = [datasets.index(d) for d in failed_data.index if d in datasets]
+                    ax4.scatter(failed_indices, [0.1] * len(failed_indices),
+                              marker='X', s=100, color='red', label=f'{method} (timeout)', zorder=5)
+    ax4.set_title('Efficiency vs Theoretical Minimum')
+    ax4.set_xlabel('Dataset Index')
+    ax4.set_ylabel('Efficiency (Theoretical/Actual)')
+    ax4.set_xticks(range(len(datasets)))
+    ax4.set_xticklabels([d[:15] + '...' if len(d) > 15 else d for d in datasets], rotation=45)
+    ax4.legend()
+    ax4.axhline(y=1.0, color='gray', linestyle='--', alpha=0.7, label='Perfect efficiency')
+    ax4.set_ylim(0, ax4.get_ylim()[1])
+    plt.tight_layout()
+    plt.savefig(f'{save_path}/compression_comparison.png', dpi=300, bbox_inches='tight')
+    plt.close(fig)
+def plot_k_parameter_analysis(df, save_path='plots'):
+    """Analyze the effect of k parameter on EP performance."""
+    Path(save_path).mkdir(exist_ok=True)
+    # Filter equiprobable methods
+    ep_data = df[df['method_type'] == 'Equiprobable'].copy()
+    if ep_data.empty:
+        print("No equiprobable data found for k parameter analysis")
+        return
+    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
+    fig.suptitle('Equiprobable Partitioning: Effect of k Parameter', fontsize=16, fontweight='bold')
+    # 1. Compression ratio vs k for different datasets
+    ax1 = axes[0, 0]
+    datasets_to_plot = ['small_uniform_10', 'medium_zipf_256', 'large_geometric_64', 'english_text']
+    for dataset in datasets_to_plot:
+        if dataset in ep_data['dataset'].values:
+            dataset_data = ep_data[ep_data['dataset'] == dataset].sort_values('k_value')
+            ax1.plot(dataset_data['k_value'], dataset_data['compression_ratio'],
+                    marker='o', label=dataset, linewidth=2)
+    ax1.set_title('Compression Ratio vs k Parameter')
+    ax1.set_xlabel('k (Number of Partitions)')
+    ax1.set_ylabel('Compression Ratio')
+    ax1.legend()
+    ax1.grid(True, alpha=0.3)
+    # 2. Bits per symbol vs k
+    ax2 = axes[0, 1]
+    for dataset in datasets_to_plot:
+        if dataset in ep_data['dataset'].values:
+            dataset_data = ep_data[ep_data['dataset'] == dataset].sort_values('k_value')
+            ax2.plot(dataset_data['k_value'], dataset_data['bits_per_symbol'],
+                    marker='s', label=dataset, linewidth=2)
+    ax2.set_title('Bits per Symbol vs k Parameter')
+    ax2.set_xlabel('k (Number of Partitions)')
+    ax2.set_ylabel('Bits per Symbol')
+    ax2.legend()
+    ax2.grid(True, alpha=0.3)
+    # 3. Optimal k by dataset type
+    ax3 = axes[1, 0]
+    # Find optimal k for each dataset
+    optimal_k = {}
+    for dataset in ep_data['dataset'].unique():
+        dataset_data = ep_data[ep_data['dataset'] == dataset]
+        if len(dataset_data) > 0:
+            best_idx = dataset_data['compression_ratio'].idxmax()
+            optimal_k[dataset] = dataset_data.loc[best_idx, 'k_value']
+    if optimal_k:
+        datasets = list(optimal_k.keys())
+        k_values = list(optimal_k.values())
+        colors = ['red' if 'uniform' in d else 'blue' if 'zipf' in d else 'green' if 'geometric' in d else 'orange'
+                 for d in datasets]
+        bars = ax3.bar(range(len(datasets)), k_values, color=colors, alpha=0.7)
+        ax3.set_title('Optimal k Value by Dataset')
+        ax3.set_xlabel('Dataset')
+        ax3.set_ylabel('Optimal k')
+        ax3.set_xticks(range(len(datasets)))
+        ax3.set_xticklabels([d[:15] + '...' if len(d) > 15 else d for d in datasets], rotation=45)
+        # Add value labels on bars
+        for bar, k_val in zip(bars, k_values):
+            ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
+                    str(int(k_val)), ha='center', va='bottom')
+    # 4. Performance improvement over k=2
+    ax4 = axes[1, 1]
+    for dataset in datasets_to_plot:
+        if dataset in ep_data['dataset'].values:
+            dataset_data = ep_data[ep_data['dataset'] == dataset].sort_values('k_value')
+            if len(dataset_data) >= 2:
+                baseline = dataset_data[dataset_data['k_value'] == 2]['compression_ratio'].iloc[0]
+                improvement = (dataset_data['compression_ratio'] / baseline - 1) * 100
+                ax4.plot(dataset_data['k_value'], improvement,
+                        marker='^', label=dataset, linewidth=2)
+    ax4.set_title('Performance Improvement over k=2 (%)')
+    ax4.set_xlabel('k (Number of Partitions)')
+    ax4.set_ylabel('Improvement (%)')
+    ax4.legend()
+    ax4.grid(True, alpha=0.3)
+    ax4.axhline(y=0, color='black', linestyle='-', alpha=0.5)
+    plt.tight_layout()
+    plt.savefig(f'{save_path}/k_parameter_analysis.png', dpi=300, bbox_inches='tight')
+    plt.close(fig)
+def plot_distribution_comparison(df, save_path='plots'):
+    """Compare performance across different data distributions."""
+    Path(save_path).mkdir(exist_ok=True)
+    # Categorize datasets by distribution
+    def get_distribution(name):
+        if 'uniform' in name:
+            return 'Uniform'
+        elif 'zipf' in name:
+            return 'Zipf'
+        elif 'geometric' in name:
+            return 'Geometric'
+        elif 'english' in name:
+            return 'Natural Text'
+        else:
+            return 'Other'
+    df['distribution'] = df['dataset'].apply(get_distribution)
+    # Filter working methods
+    df_plot = df[df['correct'] | (df['method'] == 'theoretical')].copy()
+    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
+    fig.suptitle('Performance by Data Distribution', fontsize=16, fontweight='bold')
+    # 1. Box plot of compression ratios by distribution
+    ax1 = axes[0, 0]
+    methods_to_plot = ['huffman', 'enumerative']
+    plot_data = df_plot[df_plot['method'].isin(methods_to_plot)]
+    if not plot_data.empty:
+        sns.boxplot(data=plot_data, x='distribution', y='compression_ratio', hue='method', ax=ax1)
+        ax1.set_title('Compression Ratio Distribution by Data Type')
+        ax1.set_xlabel('Data Distribution')
+        ax1.set_ylabel('Compression Ratio')
+        ax1.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
+    # 2. Enumerative vs Huffman efficiency by distribution
+    ax2 = axes[0, 1]
+    # Calculate relative efficiency vs Huffman for enumerative method
+    huffman_data = df_plot[df_plot['method'] == 'huffman'].set_index('dataset')['compression_ratio']
+    enum_data = df_plot[df_plot['method'] == 'enumerative'].set_index('dataset')
+    if not enum_data.empty and not huffman_data.empty:
+        # Only compare datasets where both methods succeeded
+        common_datasets = set(huffman_data.index) & set(enum_data.index)
+        if common_datasets:
+            distribution_ratios = {}
+            for dataset in common_datasets:
+                enum_ratio = enum_data.loc[dataset, 'compression_ratio']
+                huffman_ratio = huffman_data.loc[dataset]
+                relative_efficiency = enum_ratio / huffman_ratio
+                # Get distribution type
+                dist_type = df_plot[df_plot['dataset'] == dataset]['distribution'].iloc[0]
+                if dist_type not in distribution_ratios:
+                    distribution_ratios[dist_type] = []
+                distribution_ratios[dist_type].append(relative_efficiency)
+            # Plot box plots for each distribution
+            if distribution_ratios:
+                distributions = list(distribution_ratios.keys())
+                ratios = [distribution_ratios[dist] for dist in distributions]
+                bp = ax2.boxplot(ratios, tick_labels=distributions, patch_artist=True)
+                # Color the boxes
+                colors = ['lightblue', 'lightgreen', 'lightcoral', 'lightyellow']
+                for patch, color in zip(bp['boxes'], colors[:len(bp['boxes'])]):
+                    patch.set_facecolor(color)
+                    patch.set_alpha(0.7)
+    ax2.set_title('Enumerative Efficiency Relative to Huffman')
+    ax2.set_xlabel('Data Distribution')
+    ax2.set_ylabel('Enumerative Ratio / Huffman Ratio')
+    ax2.axhline(y=1.0, color='red', linestyle='--', alpha=0.7, label='Equal to Huffman')
+    ax2.legend()
+    ax2.grid(True, alpha=0.3)
+    # 3. Vocabulary size effect
+    ax3 = axes[1, 0]
+    # Plot compression ratio vs vocabulary size
+    vocab_data = df_plot[df_plot['method'].isin(['huffman', 'enumerative'])]
+    for method in ['huffman', 'enumerative']:
+        method_subset = vocab_data[vocab_data['method'] == method]
+        if not method_subset.empty:
+            ax3.scatter(method_subset['vocabulary_size'], method_subset['compression_ratio'],
+                       label=method, alpha=0.7, s=60)
+    ax3.set_title('Compression vs Vocabulary Size')
+    ax3.set_xlabel('Vocabulary Size')
+    ax3.set_ylabel('Compression Ratio')
+    ax3.set_xscale('log')
+    ax3.legend()
+    ax3.grid(True, alpha=0.3)
+    # 4. Dataset size effect
+    ax4 = axes[1, 1]
+    for method in ['huffman', 'enumerative']:
+        method_subset = df_plot[df_plot['method'] == method]
+        if not method_subset.empty:
+            ax4.scatter(method_subset['original_size'], method_subset['compression_ratio'],
+                       label=method, alpha=0.7, s=60)
+    ax4.set_title('Compression vs Dataset Size')
+    ax4.set_xlabel('Original Size (bytes)')
+    ax4.set_ylabel('Compression Ratio')
+    ax4.set_xscale('log')
+    ax4.legend()
+    ax4.grid(True, alpha=0.3)
+    plt.tight_layout()
+    plt.savefig(f'{save_path}/distribution_comparison.png', dpi=300, bbox_inches='tight')
+    plt.close(fig)
+def generate_summary_table(df):
+    """Generate a summary table of results."""
+    print("\n" + "="*130)
+    print("DETAILED COMPRESSION ANALYSIS")
+    print("="*130)
+    methods_order = ['theoretical', 'huffman', 'enumerative']
+    print(f"{'Dataset':<25} {'Method':<15} {'Size':<8} {'Ratio':<7} {'Bits/Sym':<8} {'Efficiency':<10} {'Time':<8}")
+    print("-" * 130)
+    for dataset in sorted(df['dataset'].unique()):
+        dataset_data = df[df['dataset'] == dataset]
+        theoretical_data = dataset_data[dataset_data['method'] == 'theoretical']
+        if not theoretical_data.empty:
+            theoretical_ratio = theoretical_data['compression_ratio'].iloc[0]
+            for method in methods_order:
+                method_data = dataset_data[dataset_data['method'] == method]
+                if not method_data.empty:
+                    row = method_data.iloc[0]
+                    if row['compressed_size'] is not None:
+                        # Successful compression
+                        efficiency = row['compression_ratio'] / theoretical_ratio
+                        time_str = f"{row.get('encoding_time', 0):.3f}s" if 'encoding_time' in row else "N/A"
+                        print(f"{dataset:<25} {method:<15} {row['compressed_size']:<8.0f} "
+                              f"{row['compression_ratio']:<7.2f} {row['bits_per_symbol']:<8.2f} "
+                              f"{efficiency:<10.3f} {time_str:<8}")
+                    else:
+                        # Timeout/failure case
+                        time_str = f"{row.get('encoding_time', 0):.1f}s" if 'encoding_time' in row else "N/A"
+                        status = "TIMEOUT" if row.get('status') == 'timeout' else "FAILED"
+                        print(f"{dataset:<25} {method:<15} {status:<8} {'N/A':<7} {'N/A':<8} "
+                              f"{'N/A':<10} {time_str:<8}")
+        print("-" * 130)
+def plot_enumerative_timeout_analysis(df, save_path='plots'):
+    """Plot analysis focusing only on enumerative encoding times and timeouts."""
+    Path(save_path).mkdir(exist_ok=True)
+    # Filter to only enumerative method data
+    enum_df = df[df['method'] == 'enumerative'].copy()
+    if enum_df.empty:
+        print("No enumerative data found for timeout analysis")
+        return
+    fig, ax = plt.subplots(1, 1, figsize=(12, 8))
+    fig.suptitle('Enumerative Encoding: Computation Time and Timeouts',
+                 fontsize=14, fontweight='bold')
+    # Extract data characteristics for analysis
+    enum_stats = []
+    for _, row in enum_df.iterrows():
+        dataset_name = row['dataset']
+        vocab_size = row['vocabulary_size']
+        original_size = row['original_size']
+        # Determine dataset type
+        if 'uniform' in dataset_name:
+            dataset_type = 'Uniform'
+            color = 'blue'
+            marker = 'o'
+        elif 'zipf' in dataset_name:
+            dataset_type = 'Zipf'
+            color = 'red'
+            marker = 's'
+        elif 'geometric' in dataset_name:
+            dataset_type = 'Geometric'
+            color = 'green'
+            marker = '^'
+        elif 'english' in dataset_name:
+            dataset_type = 'English Text'
+            color = 'purple'
+            marker = 'D'
+        else:
+            dataset_type = 'Other'
+            color = 'gray'
+            marker = 'x'
+        # Get timing and timeout info
+        timed_out = row['status'] == 'timeout'
+        encoding_time = row.get('encoding_time', 0)  # Default to 0 if not available
+        enum_stats.append({
+            'dataset': dataset_name,
+            'vocab_size': vocab_size,
+            'original_size': original_size,
+            'dataset_type': dataset_type,
+            'color': color,
+            'marker': marker,
+            'timed_out': timed_out,
+            'encoding_time': encoding_time
+        })
+    if enum_stats:
+        stats_df = pd.DataFrame(enum_stats)
+        # Separate successful and timeout data
+        successful_data = stats_df[~stats_df['timed_out']]
+        timeout_data = stats_df[stats_df['timed_out']]
+        # Plot successful encodings by dataset type
+        scatter_success = None
+        for dataset_type in successful_data['dataset_type'].unique():
+            type_data = successful_data[successful_data['dataset_type'] == dataset_type]
+            if not type_data.empty:
+                # Use log scale for encoding time as color intensity
+                times_log = np.log10(np.maximum(type_data['encoding_time'].values, 0.001))
+                scatter = ax.scatter(type_data['vocab_size'], type_data['original_size'],
+                                   c=times_log, cmap='viridis',
+                                   marker=type_data['marker'].iloc[0],
+                                   s=100, alpha=0.8, edgecolors='black', linewidth=0.5,
+                                   label=f'{dataset_type}')
+                if scatter_success is None:  # Use first successful scatter for colorbar
+                    scatter_success = scatter
+        # Plot all timeouts with a single legend entry
+        if not timeout_data.empty:
+            ax.scatter(timeout_data['vocab_size'], timeout_data['original_size'],
+                      color='red', marker='X', s=150, alpha=0.9,
+                      edgecolors='darkred', linewidth=1,
+                      label='Timeout')
+        # Add colorbar for encoding time
+        if scatter_success is not None:
+            cbar = plt.colorbar(scatter_success, ax=ax)
+            cbar.set_label('log₁₀(Encoding Time in seconds)')
+        ax.set_xlabel('Vocabulary Size')
+        ax.set_ylabel('Dataset Size (symbols)')
+        ax.set_xscale('log')
+        ax.set_yscale('log')
+        ax.grid(True, alpha=0.3)
+        # Position legend below the plot to avoid overlap
+        ax.legend(bbox_to_anchor=(0.5, -0.15), loc='upper center', ncol=3)
+        # Annotate points with timing information
+        for _, row in stats_df.iterrows():
+            if row['timed_out']:
+                time_label = f"TO:{row['encoding_time']:.1f}s"
+            else:
+                time_label = f"{row['encoding_time']:.2f}s"
+            ax.annotate(time_label,
+                       (row['vocab_size'], row['original_size']),
+                       xytext=(5, 5), textcoords='offset points',
+                       fontsize=8, alpha=0.8)
+    plt.tight_layout()
+    plt.savefig(f'{save_path}/enumerative_timeout_analysis.png', dpi=300, bbox_inches='tight')
+    plt.close(fig)
+    # Print enumerative timeout summary
+    print("\nEnumerative Encoding Performance Summary:")
+    print("=" * 50)
+    enum_success = enum_df[enum_df['status'] == 'success']
+    enum_timeout = enum_df[enum_df['status'] == 'timeout']
+    print(f"Successful encodings: {len(enum_success)}")
+    print(f"Timed out encodings: {len(enum_timeout)}")
+    if not enum_success.empty:
+        avg_time = enum_success['encoding_time'].mean()
+        max_time = enum_success['encoding_time'].max()
+        min_time = enum_success['encoding_time'].min()
+        print(f"Encoding time stats (successful): min={min_time:.3f}s, avg={avg_time:.3f}s, max={max_time:.3f}s")
+    if not enum_timeout.empty:
+        print("Datasets that timed out:")
+        for _, row in enum_timeout.iterrows():
+            print(f"  {row['dataset']}: vocab={row['vocabulary_size']}, size={row['original_size']}")
+    print(f"Performance by dataset type:")
+    for dtype in enum_df['dataset_type'].unique():
+        type_data = enum_df[enum_df['dataset_type'] == dtype]
+        success_rate = len(type_data[type_data['status'] == 'success']) / len(type_data)
+        print(f"  {dtype}: {success_rate:.1%} success rate")
+def plot_compression_time_comparison(df, save_path='plots'):
+    """Plot comparison of compression times between different algorithms."""
+    Path(save_path).mkdir(exist_ok=True)
+    # Filter to methods that have timing data
+    timing_data = df[df['encoding_time'].notna() & (df['encoding_time'] > 0)].copy()
+    if timing_data.empty:
+        print("No timing data available for compression time comparison")
+        return
+    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
+    fig.suptitle('Compression Time Comparison: Huffman vs Enumerative', fontsize=16, fontweight='bold')
+    # 1. Time comparison by dataset (successful cases only)
+    ax1 = axes[0, 0]
+    huffman_times = timing_data[timing_data['method'] == 'huffman']
+    enum_times = timing_data[(timing_data['method'] == 'enumerative') & (timing_data['status'] == 'success')]
+    if not huffman_times.empty and not enum_times.empty:
+        # Get common datasets where both methods succeeded
+        common_datasets = set(huffman_times['dataset']) & set(enum_times['dataset'])
+        if common_datasets:
+            huffman_common = huffman_times[huffman_times['dataset'].isin(common_datasets)].sort_values('dataset')
+            enum_common = enum_times[enum_times['dataset'].isin(common_datasets)].sort_values('dataset')
+            x = np.arange(len(common_datasets))
+            width = 0.35
+            ax1.bar(x - width/2, huffman_common['encoding_time'], width,
+                   label='Huffman', alpha=0.8, color='blue')
+            ax1.bar(x + width/2, enum_common['encoding_time'], width,
+                   label='Enumerative', alpha=0.8, color='green')
+            ax1.set_title('Encoding Time by Dataset (Successful Cases)')
+            ax1.set_xlabel('Dataset')
+            ax1.set_ylabel('Encoding Time (seconds)')
+            ax1.set_yscale('log')
+            ax1.set_xticks(x)
+            ax1.set_xticklabels([d[:15] + '...' if len(d) > 15 else d for d in sorted(common_datasets)], rotation=45)
+            ax1.legend()
+            ax1.grid(True, alpha=0.3)
+    # 2. Time vs Dataset Size scatter plot
+    ax2 = axes[0, 1]
+    for method in ['huffman', 'enumerative']:
+        method_data = timing_data[timing_data['method'] == method]
+        if not method_data.empty:
+            successful = method_data[method_data['status'] == 'success']
+            if not successful.empty:
+                ax2.scatter(successful['original_size'], successful['encoding_time'],
+                           label=f'{method} (success)', alpha=0.7, s=60)
+            # For enumerative, also show timeouts
+            if method == 'enumerative':
+                timeouts = method_data[method_data['status'] == 'timeout']
+                if not timeouts.empty:
+                    ax2.scatter(timeouts['original_size'], timeouts['encoding_time'],
+                               label='enumerative (timeout)', alpha=0.9, s=80, marker='X', color='red')
+    ax2.set_title('Encoding Time vs Dataset Size')
+    ax2.set_xlabel('Dataset Size (symbols)')
+    ax2.set_ylabel('Encoding Time (seconds)')
+    ax2.set_xscale('log')
+    ax2.set_yscale('log')
+    ax2.legend()
+    ax2.grid(True, alpha=0.3)
+    # 3. Speed ratio (Enumerative/Huffman) by dataset characteristics
+    ax3 = axes[1, 0]
+    if not huffman_times.empty and not enum_times.empty:
+        # Calculate speed ratios for common successful datasets
+        huffman_dict = dict(zip(huffman_times['dataset'], huffman_times['encoding_time']))
+        enum_successful = enum_times[enum_times['status'] == 'success']
+        ratios = []
+        dataset_types = []
+        vocab_sizes = []
+        for _, row in enum_successful.iterrows():
+            dataset = row['dataset']
+            if dataset in huffman_dict:
+                ratio = row['encoding_time'] / huffman_dict[dataset]
+                ratios.append(ratio)
+                dataset_types.append(row['dataset_type'])
+                vocab_sizes.append(row['vocabulary_size'])
+        if ratios:
+            # Color by dataset type
+            colors = {'Uniform': 'blue', 'Zipf': 'red', 'Geometric': 'green', 'English Text': 'purple'}
+            type_colors = [colors.get(dt, 'gray') for dt in dataset_types]
+            scatter = ax3.scatter(vocab_sizes, ratios, c=type_colors, alpha=0.7, s=80)
+            # Add legend for dataset types
+            for dtype, color in colors.items():
+                if dtype in dataset_types:
+                    ax3.scatter([], [], c=color, label=dtype, alpha=0.7, s=80)
+            ax3.set_title('Speed Ratio (Enumerative/Huffman) vs Vocabulary Size')
+            ax3.set_xlabel('Vocabulary Size')
+            ax3.set_ylabel('Time Ratio (Enum/Huffman)')
+            ax3.set_xscale('log')
+            ax3.set_yscale('log')
+            ax3.axhline(y=1.0, color='black', linestyle='--', alpha=0.5, label='Equal speed')
+            ax3.legend()
+            ax3.grid(True, alpha=0.3)
+    # 4. Time distribution by algorithm
+    ax4 = axes[1, 1]
+    huffman_successful = huffman_times[huffman_times['status'] == 'success']['encoding_time']
+    enum_successful_times = enum_times[enum_times['status'] == 'success']['encoding_time']
+    time_data = []
+    labels = []
+    if not huffman_successful.empty:
+        time_data.append(huffman_successful.values)
+        labels.append('Huffman')
+    if not enum_successful_times.empty:
+        time_data.append(enum_successful_times.values)
+        labels.append('Enumerative')
+    if time_data:
+        bp = ax4.boxplot(time_data, tick_labels=labels, patch_artist=True)
+        # Color the boxes
+        colors = ['lightblue', 'lightgreen']
+        for patch, color in zip(bp['boxes'], colors[:len(bp['boxes'])]):
+            patch.set_facecolor(color)
+            patch.set_alpha(0.7)
+    ax4.set_title('Encoding Time Distribution')
+    ax4.set_ylabel('Encoding Time (seconds)')
+    ax4.set_yscale('log')
+    ax4.grid(True, alpha=0.3)
+    plt.tight_layout()
+    plt.savefig(f'{save_path}/compression_time_comparison.png', dpi=300, bbox_inches='tight')
+    plt.close(fig)
+    # Print timing summary
+    print("\nCompression Time Summary:")
+    print("=" * 50)
+    if not huffman_times.empty:
+        huff_stats = huffman_times['encoding_time']
+        print(f"Huffman encoding times:")
+        print(f"  Min: {huff_stats.min():.6f}s, Avg: {huff_stats.mean():.6f}s, Max: {huff_stats.max():.6f}s")
+    if not enum_successful_times.empty:
+        enum_stats = enum_successful_times
+        print(f"Enumerative encoding times (successful):")
+        print(f"  Min: {enum_stats.min():.3f}s, Avg: {enum_stats.mean():.3f}s, Max: {enum_stats.max():.3f}s")
+        print(f"  Speed vs Huffman: {enum_stats.mean() / huffman_times['encoding_time'].mean():.0f}x slower on average")
+def main():
+    """Generate all plots and analysis."""
+    results = load_results()
+    if results is None:
+        return
+    print("Loading compression results...")
+    df = create_comparison_dataframe(results)
+    print("Generating plots...")
+    # Create plots directory
+    Path('plots').mkdir(exist_ok=True)
+    # Generate all plots
+    plot_compression_ratios(df)
+    plot_k_parameter_analysis(df)
+    plot_distribution_comparison(df)
+    plot_enumerative_timeout_analysis(df)
+    plot_compression_time_comparison(df)
+    # Generate summary
+    generate_summary_table(df)
+    print("\nPlots saved to 'plots/' directory")
+    print("Analysis complete!")
+if __name__ == "__main__":
+    main()

plots/compression_comparison.png ADDED Viewed

Git LFS Details

SHA256: 10c0b1cf4fb1eef860fc1bbdab38de2f8ee4eda07f76379bd8a999561a074301
Pointer size: 131 Bytes
Size of remote file: 612 kB

plots/compression_time_comparison.png ADDED Viewed

Git LFS Details

SHA256: c040301aed29368c2c9bc7ccda8bbe1c7754840d9cbfe00b471c834af70c1bd2
Pointer size: 131 Bytes
Size of remote file: 509 kB

plots/distribution_comparison.png ADDED Viewed

Git LFS Details

SHA256: 0042e9e527c9b6bab436e73e7dc07805741a15f529ea60783d0862419bbd004f
Pointer size: 131 Bytes
Size of remote file: 342 kB

plots/enumerative_timeout_analysis.png ADDED Viewed

Git LFS Details

SHA256: 09183de8cb5a170cede05f43fd5c2713e45619922980da81fe4b299d929f17c4
Pointer size: 131 Bytes
Size of remote file: 175 kB

pyproject.toml ADDED Viewed

	@@ -0,0 +1,20 @@

+[project]
+name = "entropy-coding-equiprobable"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "datasets>=4.0.0",
+    "matplotlib>=3.10.3",
+    "numpy>=2.3.1",
+    "requests>=2.32.4",
+    "scipy>=1.16.0",
+    "seaborn>=0.13.2",
+]
+[dependency-groups]
+dev = [
+    "pyright>=1.1.403",
+    "ruff>=0.12.3",
+]

quick_test.py ADDED Viewed

	@@ -0,0 +1,25 @@

+#!/usr/bin/env python3
+"""Quick test with smaller datasets to verify functionality."""
+import numpy as np
+from enumerative_coding import EnumerativeEncoder
+from entropy_coding import HuffmanEncoder, theoretical_minimum_size
+from test_compression import generate_iid_data, compress_and_compare, print_results
+def main():
+    np.random.seed(42)
+    # Test with a small dataset
+    print("Testing with small uniform dataset...")
+    data = generate_iid_data(100, 10, 'uniform')
+    results = compress_and_compare(data, "small_test")
+    print_results(results)
+    # Test with non-uniform distribution
+    print("\nTesting with small Zipf dataset...")
+    data = generate_iid_data(100, 10, 'zipf')
+    results = compress_and_compare(data, "small_zipf_test")
+    print_results(results)
+if __name__ == "__main__":
+    main()

test_compression.py ADDED Viewed

	@@ -0,0 +1,310 @@

+#!/usr/bin/env python3
+"""
+Test entropy coding with equiprobable partitioning on various datasets.
+Compares against Huffman coding and theoretical limits.
+"""
+import numpy as np
+from datasets import load_dataset
+import os
+from typing import List, Dict, Tuple, Optional
+from enumerative_coding import EnumerativeEncoder
+from entropy_coding import HuffmanEncoder, theoretical_minimum_size
+import json
+import signal
+from contextlib import contextmanager
+class TimeoutError(Exception):
+    """Raised when a timeout occurs."""
+    pass
+@contextmanager
+def timeout(seconds):
+    """Context manager for timeout functionality."""
+    def timeout_handler(signum, frame):
+        raise TimeoutError(f"Operation timed out after {seconds} seconds")
+    # Set the signal handler
+    old_handler = signal.signal(signal.SIGALRM, timeout_handler)
+    signal.alarm(seconds)
+    try:
+        yield
+    finally:
+        # Restore the old signal handler and disable alarm
+        signal.signal(signal.SIGALRM, old_handler)
+        signal.alarm(0)
+def generate_iid_data(size: int, vocab_size: int, distribution: str = 'uniform') -> List[int]:
+    """
+    Generate i.i.d. data with specified vocabulary size and distribution.
+    Args:
+        size: Number of symbols to generate
+        vocab_size: Size of the vocabulary (number of unique symbols)
+        distribution: 'uniform', 'zipf', or 'geometric'
+    """
+    if distribution == 'uniform':
+        return list(np.random.randint(0, vocab_size, size))
+    elif distribution == 'zipf':
+        # Zipf distribution (power law)
+        probabilities = 1 / np.arange(1, vocab_size + 1)
+        probabilities /= probabilities.sum()
+        return list(np.random.choice(vocab_size, size, p=probabilities))
+    elif distribution == 'geometric':
+        # Geometric distribution
+        p = 0.3  # Parameter for geometric distribution
+        probabilities = [(1 - p) ** i * p for i in range(vocab_size - 1)]
+        probabilities.append((1 - p) ** (vocab_size - 1))
+        probabilities = np.array(probabilities)
+        probabilities /= probabilities.sum()
+        return list(np.random.choice(vocab_size, size, p=probabilities))
+    else:
+        raise ValueError(f"Unknown distribution: {distribution}")
+def download_english_text() -> str:
+    """Download English text from Hugging Face."""
+    print("Downloading English text dataset...")
+    # Using WikiText-2 dataset as a source of English text
+    dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:10%]")
+    # Concatenate text samples
+    text = " ".join(dataset['text'][:1000])  # Take first 1000 samples
+    # Clean up text
+    text = " ".join(text.split())  # Remove extra whitespace
+    return text
+def text_to_symbols(text: str) -> List[int]:
+    """Convert text to list of integer symbols (byte values)."""
+    return list(text.encode('utf-8'))
+def symbols_to_text(symbols: List[int]) -> str:
+    """Convert list of integer symbols back to text."""
+    return bytes(symbols).decode('utf-8', errors='ignore')
+def compress_and_compare(data: List[int], name: str, timeout_seconds: int = 30) -> Dict:
+    """
+    Compress data using different methods and compare results.
+    Args:
+        data: Input data as list of integers
+        name: Name for this dataset
+        timeout_seconds: Timeout in seconds for enumerative coding
+    """
+    original_size = len(data)
+    results = {
+        'name': name,
+        'original_size': original_size,
+        'vocabulary_size': len(set(data)),
+        'theoretical_minimum': theoretical_minimum_size(data),
+        'methods': {}
+    }
+    # Huffman coding (always works quickly)
+    print(f"  Compressing with Huffman coding...")
+    import time
+    huffman_start_time = time.time()
+    huffman = HuffmanEncoder()
+    huffman_encoded, huffman_metadata = huffman.encode(data)
+    huffman_size = len(huffman_encoded)
+    # Verify Huffman decoding
+    huffman_decoded = huffman.decode(huffman_encoded, huffman_metadata)
+    huffman_correct = huffman_decoded == data
+    huffman_encoding_time = time.time() - huffman_start_time
+    results['methods']['huffman'] = {
+        'compressed_size': huffman_size,
+        'compression_ratio': original_size / huffman_size,
+        'bits_per_symbol': huffman_size * 8 / len(data),
+        'correct': huffman_correct,
+        'encoding_time': huffman_encoding_time
+    }
+    # Enumerative entropy coding (may timeout on large datasets)
+    print(f"  Compressing with enumerative entropy coding (timeout: {timeout_seconds}s)...")
+    import time
+    start_time = time.time()
+    try:
+        with timeout(timeout_seconds):
+            ep_encoder = EnumerativeEncoder()
+            ep_encoded = ep_encoder.encode(data)
+            ep_size = len(ep_encoded)
+            # Verify decoding
+            ep_decoded = ep_encoder.decode(ep_encoded)
+            ep_correct = ep_decoded == data
+            encoding_time = time.time() - start_time
+            results['methods']['enumerative'] = {
+                'compressed_size': ep_size,
+                'compression_ratio': original_size / ep_size,
+                'bits_per_symbol': ep_size * 8 / len(data),
+                'correct': ep_correct,
+                'encoding_time': encoding_time
+            }
+    except TimeoutError as e:
+        encoding_time = time.time() - start_time
+        print(f"    Enumerative coding timed out: {e}")
+        results['methods']['enumerative'] = {
+            'compressed_size': None,
+            'compression_ratio': None,
+            'bits_per_symbol': None,
+            'correct': False,
+            'encoding_time': encoding_time,
+            'timed_out': True
+        }
+    except ValueError as e:
+        encoding_time = time.time() - start_time
+        print(f"    Enumerative coding failed: {e}")
+        results['methods']['enumerative'] = {
+            'compressed_size': None,
+            'compression_ratio': None,
+            'bits_per_symbol': None,
+            'correct': False,
+            'encoding_time': encoding_time,
+            'timed_out': False,
+            'error': str(e)
+        }
+    except Exception as e:
+        encoding_time = time.time() - start_time
+        print(f"    Enumerative coding failed: {e}")
+        results['methods']['enumerative'] = {
+            'compressed_size': None,
+            'compression_ratio': None,
+            'bits_per_symbol': None,
+            'correct': False,
+            'encoding_time': encoding_time,
+            'timed_out': False,
+            'error': str(e)
+        }
+    return results
+def main():
+    """Run compression tests on various datasets."""
+    np.random.seed(42)  # For reproducibility
+    all_results = []
+    # Test configurations
+    test_configs = [
+        # (name, size, vocab_size, distribution)
+        # Uniform distribution (one representative case)
+        ("uniform_10k_v256", 10000, 256, 'uniform'),
+        # Zipf distribution with different vocabulary sizes
+        ("zipf_10k_v16", 10000, 16, 'zipf'),
+        ("zipf_10k_v64", 10000, 64, 'zipf'),
+        ("zipf_10k_v256", 10000, 256, 'zipf'),
+        ("zipf_5k_v16", 5000, 16, 'zipf'),
+        ("zipf_5k_v64", 5000, 64, 'zipf'),
+        ("zipf_5k_v256", 5000, 256, 'zipf'),
+        # Geometric distribution with different vocabulary sizes
+        ("geometric_10k_v16", 10000, 16, 'geometric'),
+        ("geometric_10k_v64", 10000, 64, 'geometric'),
+        ("geometric_10k_v256", 10000, 256, 'geometric'),
+        ("geometric_5k_v16", 5000, 16, 'geometric'),
+        ("geometric_5k_v64", 5000, 64, 'geometric'),
+        ("geometric_5k_v256", 5000, 256, 'geometric'),
+        # Large scale test with most interesting distribution
+        ("zipf_100k_v256", 100000, 256, 'zipf'),
+    ]
+    # Generate and test i.i.d. datasets
+    print("Testing i.i.d. datasets...")
+    for name, size, vocab_size, distribution in test_configs:
+        print(f"\nTesting {name} (size={size}, vocab={vocab_size}, dist={distribution})...")
+        data = generate_iid_data(size, vocab_size, distribution)
+        results = compress_and_compare(data, name)
+        all_results.append(results)
+        print_results(results)
+    # Test English text
+    print("\nTesting English text...")
+    english_text = download_english_text()
+    print(f"Text length: {len(english_text)} characters")
+    # Convert to symbols and compress (use subset to avoid timeout)
+    text_symbols = text_to_symbols(english_text)
+    # Use a subset that's computationally feasible for enumerative coding
+    max_text_length = 2000  # Should complete within timeout
+    if len(text_symbols) > max_text_length:
+        text_symbols = text_symbols[:max_text_length]
+        print(f"Using text subset of {len(text_symbols)} symbols (original: {len(text_to_symbols(english_text))})")
+    text_results = compress_and_compare(text_symbols, "english_text")
+    all_results.append(text_results)
+    print_results(text_results)
+    # Save all results to JSON
+    with open('compression_results.json', 'w') as f:
+        json.dump(all_results, f, indent=2)
+    print("\nResults saved to compression_results.json")
+    # Generate summary table
+    print("\n" + "="*80)
+    print("COMPRESSION SUMMARY")
+    print("="*80)
+    print(f"{'Dataset':<20} {'Original':<10} {'Theoretical':<12} {'Huffman':<10} {'Enumerative':<12}")
+    print("-"*70)
+    for result in all_results:
+        name = result['name'][:20]
+        original = result['original_size']
+        theoretical = result['theoretical_minimum']
+        huffman = result['methods']['huffman']['compressed_size']
+        enumerative_method = result['methods']['enumerative']
+        if enumerative_method is not None and enumerative_method.get('compressed_size') is not None:
+            enumerative = enumerative_method['compressed_size']
+        else:
+            enumerative = "TIMEOUT"
+        row = f"{name:<20} {original:<10} {theoretical:<12.1f} {huffman:<10} {enumerative:<12}"
+        print(row)
+def print_results(results: Dict):
+    """Print compression results for a dataset."""
+    print(f"\nResults for {results['name']}:")
+    print(f"  Original size: {results['original_size']} bytes")
+    print(f"  Vocabulary size: {results['vocabulary_size']}")
+    print(f"  Theoretical minimum: {results['theoretical_minimum']:.1f} bytes")
+    for method, data in results['methods'].items():
+        print(f"\n  {method}:")
+        if data is None or data.get('compressed_size') is None:
+            if data and data.get('timed_out'):
+                print(f"    Status: TIMEOUT after {data.get('encoding_time', 0):.2f}s - computational complexity too high")
+            elif data and data.get('error'):
+                print(f"    Status: FAILED - {data.get('error')}")
+            else:
+                print(f"    Status: FAILED - computational complexity too high")
+        else:
+            print(f"    Compressed size: {data['compressed_size']} bytes")
+            print(f"    Compression ratio: {data['compression_ratio']:.2f}")
+            print(f"    Bits per symbol: {data['bits_per_symbol']:.2f}")
+            print(f"    Correctly decoded: {data['correct']}")
+            if 'encoding_time' in data:
+                print(f"    Encoding time: {data['encoding_time']:.3f}s")
+if __name__ == "__main__":
+    main()

test_enumerative.py ADDED Viewed

	@@ -0,0 +1,91 @@

+#!/usr/bin/env python3
+"""Test the enumerative entropy coding implementation."""
+from enumerative_coding import EnumerativeEncoder, ExpGolombCoder
+def test_exp_golomb():
+    """Test exp-Golomb coding."""
+    print("Testing Exp-Golomb coding:")
+    test_values = [0, 1, 2, 3, 4, 5, 10, 15, 31, 32, 100]
+    for n in test_values:
+        encoded = ExpGolombCoder.encode(n)
+        decoded, _ = ExpGolombCoder.decode(encoded, 0)
+        print(f"  {n:3d} -> {encoded:>10s} -> {decoded:3d} ({'✓' if n == decoded else '✗'})")
+def test_combinatorics():
+    """Test combinatorial functions."""
+    print("\nTesting combinatorial functions:")
+    # Test binomial coefficients using the encoder's table
+    encoder = EnumerativeEncoder()
+    test_cases = [(5, 2), (10, 3), (7, 0), (7, 7), (6, 4)]
+    for n, k in test_cases:
+        result = encoder.binom_table.get(n, k)
+        expected = math.comb(n, k) if hasattr(math, 'comb') else None
+        print(f"  C({n},{k}) = {result}" + (f" (expected {expected})" if expected else ""))
+def test_simple_sequence():
+    """Test encoding/decoding of simple sequences."""
+    print("\nTesting simple sequences:")
+    encoder = EnumerativeEncoder()
+    test_sequences = [
+        [0, 1, 0],
+        [0, 1, 1, 2],
+        [1, 2, 3, 1, 2, 3],
+        [0, 0, 1, 1, 2],
+    ]
+    for seq in test_sequences:
+        print(f"\n  Testing sequence: {seq}")
+        try:
+            encoded = encoder.encode(seq)
+            decoded = encoder.decode(encoded)
+            print(f"    Original:  {seq}")
+            print(f"    Decoded:   {decoded}")
+            print(f"    Correct:   {'✓' if seq == decoded else '✗'}")
+            print(f"    Size:      {len(seq)} symbols -> {len(encoded)} bytes")
+        except Exception as e:
+            print(f"    Error: {e}")
+def test_paper_example():
+    """Test with an example that should match the paper's approach."""
+    print("\nTesting paper-style example:")
+    # Create a sequence with known symbol frequencies
+    # This tests the 3-step encoding process
+    sequence = [0, 0, 1, 2, 1, 0, 2, 1, 1, 2]  # 3 zeros, 4 ones, 3 twos
+    print(f"  Sequence: {sequence}")
+    print(f"  Symbols:  {sorted(set(sequence))}")
+    print(f"  Counts:   {[sequence.count(s) for s in sorted(set(sequence))]}")
+    encoder = EnumerativeEncoder()
+    encoded = encoder.encode(sequence)
+    decoded = encoder.decode(encoded)
+    print(f"  Encoded size: {len(encoded)} bytes")
+    print(f"  Correctly decoded: {'✓' if sequence == decoded else '✗'}")
+    if sequence != decoded:
+        print(f"  Expected: {sequence}")
+        print(f"  Got:      {decoded}")
+if __name__ == "__main__":
+    import math
+    test_exp_golomb()
+    test_combinatorics()
+    test_simple_sequence()
+    test_paper_example()

test_paper_examples.py ADDED Viewed

	@@ -0,0 +1,135 @@

+#!/usr/bin/env python3
+"""
+Test cases that match specific examples from Han et al. (2008) paper.
+Tests only end-to-end encoding/decoding and verifies results match paper.
+"""
+import numpy as np
+from enumerative_coding import EnumerativeEncoder
+from collections import Counter
+def test_example_from_paper():
+    """
+    Test with a specific example that should match the paper's results.
+    This verifies the encoding produces the expected output.
+    """
+    print("Testing Example from Paper")
+    print("=" * 50)
+    # Use a simple sequence that we can verify by hand
+    # This sequence has: 2 zeros, 2 ones, 1 two
+    sequence = [0, 1, 0, 1, 2]
+    print(f"Input sequence: {sequence}")
+    print(f"Symbol counts: {[sequence.count(i) for i in range(3)]}")
+    encoder = EnumerativeEncoder()
+    encoded = encoder.encode(sequence)
+    decoded = encoder.decode(encoded)
+    # Test end-to-end correctness
+    assert decoded == sequence, f"Decoding failed: expected {sequence}, got {decoded}"
+    print("✓ End-to-end encoding/decoding successful")
+    # Show compression metrics
+    original_size = len(sequence)
+    compressed_size = len(encoded)
+    print(f"Original: {original_size} symbols -> Compressed: {compressed_size} bytes")
+    print(f"Compression ratio: {original_size / compressed_size:.2f}")
+def test_paper_sequence_properties():
+    """
+    Test with sequences that have the properties discussed in the paper.
+    Verify the algorithm handles different symbol distributions correctly.
+    """
+    print("\n\nTesting Paper Sequence Properties")
+    print("=" * 50)
+    test_cases = [
+        # (description, sequence)
+        ("Uniform distribution", [0, 1, 2, 0, 1, 2]),
+        ("Skewed distribution", [0, 0, 0, 1, 1, 2]),
+        ("Single symbol", [0, 0, 0, 0]),
+        ("Binary sequence", [0, 1, 1, 0, 1]),
+    ]
+    encoder = EnumerativeEncoder()
+    for description, sequence in test_cases:
+        print(f"\n{description}: {sequence}")
+        # Test encoding and decoding
+        encoded = encoder.encode(sequence)
+        decoded = encoder.decode(encoded)
+        # Verify correctness
+        assert decoded == sequence, f"Failed for {description}"
+        # Show results
+        compression_ratio = len(sequence) / len(encoded) if len(encoded) > 0 else float('inf')
+        print(f"  Compressed: {len(sequence)} -> {len(encoded)} bytes (ratio: {compression_ratio:.2f})")
+        print("  ✓ Correctly encoded and decoded")
+def test_lexicographic_ordering():
+    """
+    Test that sequences with the same symbol counts but different orders
+    produce different encodings (different lexicographic indices).
+    """
+    print("\n\nTesting Lexicographic Ordering")
+    print("=" * 50)
+    # All permutations of [0, 0, 1, 1] should have different lexicographic indices
+    permutations = [
+        [0, 0, 1, 1],
+        [0, 1, 0, 1],
+        [0, 1, 1, 0],
+        [1, 0, 0, 1],
+        [1, 0, 1, 0],
+        [1, 1, 0, 0]
+    ]
+    encoder = EnumerativeEncoder()
+    indices = []
+    for perm in permutations:
+        encoded = encoder.encode(perm)
+        decoded = encoder.decode(encoded)
+        # Verify end-to-end correctness
+        assert decoded == perm, f"Decoding failed for {perm}"
+        # Collect compressed size as a proxy for different encodings
+        compressed_size = len(encoded)
+        indices.append(compressed_size)
+        print(f"  {perm} -> compressed size: {compressed_size} bytes")
+    # Verify all encodings are different (different permutations should produce different encodings)
+    encodings = []
+    for perm in permutations:
+        encoded = encoder.encode(perm)
+        encodings.append(encoded)
+    # Check that different permutations produce different encodings
+    unique_encodings = len(set(encodings))
+    total_permutations = len(permutations)
+    print(f"  Unique encodings: {unique_encodings} out of {total_permutations} permutations")
+    assert unique_encodings == total_permutations, f"CRITICAL BUG: Different permutations produced identical encodings! This means the algorithm is lossy and cannot uniquely decode sequences."
+    print("  ✓ All permutations have unique encodings")
+def main():
+    """Run all paper example tests."""
+    test_example_from_paper()
+    test_paper_sequence_properties()
+    test_lexicographic_ordering()
+    print("\n" + "=" * 50)
+    print("All paper example tests passed! ✓")
+if __name__ == "__main__":
+    main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff