CRAYON-tokenizer / CHANGELOG.md
Phase-Technologies's picture
Upload folder using huggingface_hub
708f4a3 verified

Changelog

All notable changes to XERV Crayon will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[2.0.0] - 2026-01-23

Added

  • Double-Array Trie (DAT) Engine: Complete rewrite of the tokenization engine using memory-mapped DAT for O(1) lookups
  • AVX2/SIMD Optimizations: Native C++ engine with AVX2 intrinsics achieving >16M tokens/second
  • Pre-built Vocabulary Profiles: 5 production-ready profiles (lite, code, science, multilingual, arts_commerce)
  • CLI Tool: crayon-benchmark command for easy performance testing
  • Zero-Copy Memory Mapping: Memory-mapped DAT files for instant loading
  • Cross-Platform Support: Windows (MSVC), Linux (GCC), macOS (Clang/Apple Silicon)

Changed

  • Version bump from 1.1.0 to 2.0.0
  • Minimum Python version updated to 3.10
  • Package structure reorganized for better modularity

Performance

  • Tokenization: 16M+ tokens/second (up from 2M in v1.x)
  • Memory usage: 50% reduction via mmap
  • Load time: <10ms for vocabulary profiles

[1.1.0] - 2026-01-16

Added

  • Initial C-Trie implementation
  • SIMD-accelerated text processing
  • Basic vocabulary management

Fixed

  • Memory leaks in trie traversal
  • Unicode handling edge cases

[1.0.0] - 2026-01-11

Added

  • Initial release
  • Pure Python tokenizer
  • Basic vocabulary training
  • Entropy-guided vocabulary construction