| # Changelog |
|
|
| All notable changes to XERV Crayon will be documented in this file. |
|
|
| The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), |
| and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). |
|
|
| ## [2.0.0] - 2026-01-23 |
|
|
| ### Added |
| - **Double-Array Trie (DAT) Engine**: Complete rewrite of the tokenization engine using memory-mapped DAT for O(1) lookups |
| - **AVX2/SIMD Optimizations**: Native C++ engine with AVX2 intrinsics achieving >16M tokens/second |
| - **Pre-built Vocabulary Profiles**: 5 production-ready profiles (lite, code, science, multilingual, arts_commerce) |
| - **CLI Tool**: `crayon-benchmark` command for easy performance testing |
| - **Zero-Copy Memory Mapping**: Memory-mapped DAT files for instant loading |
| - **Cross-Platform Support**: Windows (MSVC), Linux (GCC), macOS (Clang/Apple Silicon) |
| |
| ### Changed |
| - Version bump from 1.1.0 to 2.0.0 |
| - Minimum Python version updated to 3.10 |
| - Package structure reorganized for better modularity |
| |
| ### Performance |
| - Tokenization: 16M+ tokens/second (up from 2M in v1.x) |
| - Memory usage: 50% reduction via mmap |
| - Load time: <10ms for vocabulary profiles |
| |
| ## [1.1.0] - 2026-01-16 |
| |
| ### Added |
| - Initial C-Trie implementation |
| - SIMD-accelerated text processing |
| - Basic vocabulary management |
| |
| ### Fixed |
| - Memory leaks in trie traversal |
| - Unicode handling edge cases |
| |
| ## [1.0.0] - 2026-01-11 |
| |
| ### Added |
| - Initial release |
| - Pure Python tokenizer |
| - Basic vocabulary training |
| - Entropy-guided vocabulary construction |
| |