Commit History

Fix dead split parameter in PackedStreamingDataset._load_dataset
0cd5689

Vjeong Claude Sonnet 4.6 commited on

Add Code CPT pipeline for injecting Python code capability
a424729

Vjeong Claude Opus 4.6 commited on

Remove unused tokenizer training code (train_bpe, load_sentencepiece, load_trained_hf)
33ba3d1

Vjeong Claude Opus 4.6 commited on

Use LLaMA 2 pretrained tokenizer and remove tokenizer_mode option
a5ca4e4

Vjeong Claude Opus 4.6 commited on

Fix BPE tokenizer ByteLevel decoder and update evaluation notebook
8626149

Vjeong Claude Sonnet 4.6 commited on

docs: translate all Korean comments and docstrings to English
858e8b2

Vjeong Claude Sonnet 4.6 commited on

refactor(data): replace per-worker seed strategy with full sharding in IterableDataset
8a39fec

Vjeong Claude Sonnet 4.6 commited on

Initial commit: LLM-1B-Lab project setup
8a58ffe

Vjeong Claude Opus 4.6 commited on