DGX_AI / codeforge /kb /tokenizer.py
vasiuuu's picture
Initial commit for CodeForge GRPO training
acf77ab
raw
history blame contribute delete
192 Bytes
from __future__ import annotations
import re
_SPLIT_RE = re.compile(r"[^\w]+", re.UNICODE)
def tokenize(text: str) -> list[str]:
return [t for t in _SPLIT_RE.split(text.lower()) if t]