| language: | |
| - en | |
| V1 of an English/code tokenizer. Byte-level BPE, 64k vocab, split digits (the difference with v1). Equal mix between: | |
| On the NL side: | |
| - Books | |
| - C4 | |
| - v1 of our CC (helen quality classifier) | |
| - enwiki | |
| - Gutenberg | |
| On the code side: | |
| - Jupyter notebooks (0.5 weight, it was small) | |
| - GH issues | |
| - Stackexchange | |
| - The cleaned Python Stack | |
| For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH). |