Xerv-AI
/

CRAYON-tokenizer

Text Generation

hardware-accelerated

double-array-trie

Model card Files Files and versions

CRAYON-tokenizer / CHANGELOG.md

Phase-Technologies's picture

Phase-Technologies

Upload folder using huggingface_hub

708f4a3 verified 4 days ago

|

history blame contribute delete

1.51 kB

Changelog

All notable changes to XERV Crayon will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[2.0.0] - 2026-01-23

Added

Double-Array Trie (DAT) Engine: Complete rewrite of the tokenization engine using memory-mapped DAT for O(1) lookups
AVX2/SIMD Optimizations: Native C++ engine with AVX2 intrinsics achieving >16M tokens/second
Pre-built Vocabulary Profiles: 5 production-ready profiles (lite, code, science, multilingual, arts_commerce)
CLI Tool: crayon-benchmark command for easy performance testing
Zero-Copy Memory Mapping: Memory-mapped DAT files for instant loading
Cross-Platform Support: Windows (MSVC), Linux (GCC), macOS (Clang/Apple Silicon)

Changed

Version bump from 1.1.0 to 2.0.0
Minimum Python version updated to 3.10
Package structure reorganized for better modularity

Performance

Tokenization: 16M+ tokens/second (up from 2M in v1.x)
Memory usage: 50% reduction via mmap
Load time: <10ms for vocabulary profiles

[1.1.0] - 2026-01-16

Added

Initial C-Trie implementation
SIMD-accelerated text processing
Basic vocabulary management

Fixed

Memory leaks in trie traversal
Unicode handling edge cases

[1.0.0] - 2026-01-11

Added

Initial release
Pure Python tokenizer
Basic vocabulary training
Entropy-guided vocabulary construction