HuggingFaceGECLM
/

mix_tok_v2

Model card Files Files and versions

teven commited on Apr 13, 2023

Commit

a505bc0

·

1 Parent(s): f0817b9

Create README.md

Files changed (1) hide show

README.md +21 -0

README.md ADDED Viewed

	@@ -0,0 +1,21 @@

+---
+language:
+- en
+---
+V1 of an English/code tokenizer. Byte-level BPE, 64k vocab, split digits (the difference with v1). Equal mix between:
+On the NL side:
+- Books
+- C4
+- v1 of our CC (helen quality classifier)
+- enwiki
+- Gutenberg
+- Reddit
+On the code side:
+- Jupyter notebooks (0.5 weight, it was small)
+- GH issues
+- Stackexchange
+- The cleaned Python Stack
+For a total of 1/3 code data (although there is a lot of English in Stackexchange and GH).