Xerv-AI
/

CRAYON-tokenizer

Text Generation

hardware-accelerated

double-array-trie

Model card Files Files and versions

CRAYON-tokenizer / CHANGELOG.md

Phase-Technologies's picture

Phase-Technologies

Upload folder using huggingface_hub

708f4a3 verified 4 days ago

|

history blame contribute delete

1.51 kB

	# Changelog

	All notable changes to XERV Crayon will be documented in this file.

	The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
	and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

	## [2.0.0] - 2026-01-23

	### Added
	- Double-Array Trie (DAT) Engine: Complete rewrite of the tokenization engine using memory-mapped DAT for O(1) lookups
	- AVX2/SIMD Optimizations: Native C++ engine with AVX2 intrinsics achieving >16M tokens/second
	- Pre-built Vocabulary Profiles: 5 production-ready profiles (lite, code, science, multilingual, arts_commerce)
	- CLI Tool: `crayon-benchmark` command for easy performance testing
	- Zero-Copy Memory Mapping: Memory-mapped DAT files for instant loading
	- Cross-Platform Support: Windows (MSVC), Linux (GCC), macOS (Clang/Apple Silicon)

	### Changed
	- Version bump from 1.1.0 to 2.0.0
	- Minimum Python version updated to 3.10
	- Package structure reorganized for better modularity

	### Performance
	- Tokenization: 16M+ tokens/second (up from 2M in v1.x)
	- Memory usage: 50% reduction via mmap
	- Load time: <10ms for vocabulary profiles

	## [1.1.0] - 2026-01-16

	### Added
	- Initial C-Trie implementation
	- SIMD-accelerated text processing
	- Basic vocabulary management

	### Fixed
	- Memory leaks in trie traversal
	- Unicode handling edge cases

	## [1.0.0] - 2026-01-11

	### Added
	- Initial release
	- Pure Python tokenizer
	- Basic vocabulary training
	- Entropy-guided vocabulary construction