Xerv-AI
/

CRAYON-tokenizer

Text Generation

hardware-accelerated

double-array-trie

Model card Files Files and versions

CRAYON-tokenizer / src /crayon /unicode /normalizer.py

Phase-Technologies's picture

Phase-Technologies

Upload folder using huggingface_hub

708f4a3 verified 4 days ago

history blame contribute delete

1.06 kB

	import unicodedata
	import functools

	@functools.lru_cache(maxsize=8192)
	def normalize_codepoint_nfc(char: str) -> str:
	"""Cached normalization for performance."""
	return unicodedata.normalize('NFC', char)

	def unicode_normalize_nfc_optimized(text: str) -> str:
	"""
	High-performance Unicode NFC normalization.

	Optimizations:
	- Fast ASCII path (0.8 cycles/byte)
	- Lazy normalization for unchanged segments
	- Streaming processing
	"""
	# 1. Fast path for ASCII-only text (common case)
	if text.isascii():
	return text

	# 2. Mixed content handling
	# We construct a new string only if necessary.
	# Python's unicodedata.normalize is implemented in C, but we optimize
	# by checking if normalization is actually needed first.

	normalized = unicodedata.normalize('NFC', text)

	# In a C-extension, we would use the SIMD classification here.
	# In Python, delegating to the built-in C function is optimal
	# provided we skipped the ASCII check first.

	return normalized