Fix dead split parameter in PackedStreamingDataset._load_dataset 0cd5689 Vjeong Claude Sonnet 4.6 commited on 1 day ago
Add Code CPT pipeline for injecting Python code capability a424729 Vjeong Claude Opus 4.6 commited on 7 days ago
Remove unused tokenizer training code (train_bpe, load_sentencepiece, load_trained_hf) 33ba3d1 Vjeong Claude Opus 4.6 commited on 14 days ago
Use LLaMA 2 pretrained tokenizer and remove tokenizer_mode option a5ca4e4 Vjeong Claude Opus 4.6 commited on 18 days ago
Fix BPE tokenizer ByteLevel decoder and update evaluation notebook 8626149 Vjeong Claude Sonnet 4.6 commited on 20 days ago
docs: translate all Korean comments and docstrings to English 858e8b2 Vjeong Claude Sonnet 4.6 commited on Feb 27
refactor(data): replace per-worker seed strategy with full sharding in IterableDataset 8a39fec Vjeong Claude Sonnet 4.6 commited on Feb 21