CRAYON-tokenizer / CHANGELOG.md
Phase-Technologies's picture
Upload folder using huggingface_hub
708f4a3 verified
# Changelog
All notable changes to XERV Crayon will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [2.0.0] - 2026-01-23
### Added
- **Double-Array Trie (DAT) Engine**: Complete rewrite of the tokenization engine using memory-mapped DAT for O(1) lookups
- **AVX2/SIMD Optimizations**: Native C++ engine with AVX2 intrinsics achieving >16M tokens/second
- **Pre-built Vocabulary Profiles**: 5 production-ready profiles (lite, code, science, multilingual, arts_commerce)
- **CLI Tool**: `crayon-benchmark` command for easy performance testing
- **Zero-Copy Memory Mapping**: Memory-mapped DAT files for instant loading
- **Cross-Platform Support**: Windows (MSVC), Linux (GCC), macOS (Clang/Apple Silicon)
### Changed
- Version bump from 1.1.0 to 2.0.0
- Minimum Python version updated to 3.10
- Package structure reorganized for better modularity
### Performance
- Tokenization: 16M+ tokens/second (up from 2M in v1.x)
- Memory usage: 50% reduction via mmap
- Load time: <10ms for vocabulary profiles
## [1.1.0] - 2026-01-16
### Added
- Initial C-Trie implementation
- SIMD-accelerated text processing
- Basic vocabulary management
### Fixed
- Memory leaks in trie traversal
- Unicode handling edge cases
## [1.0.0] - 2026-01-11
### Added
- Initial release
- Pure Python tokenizer
- Basic vocabulary training
- Entropy-guided vocabulary construction