update url
Browse files
README.md
CHANGED
|
@@ -47,7 +47,7 @@ This table tracks the performance of our model on various tasks over time. The m
|
|
| 47 |
|
| 48 |
# Tokenizer
|
| 49 |
Our tokenizer was trained from scratch on 500,000 samples from the Openwebtext dataset.
|
| 50 |
-
For variation, we also included 500,000 samples from our [GitHub-CC0](KoalaAI/GitHub-CC0) dataset, in the hopes that code would be tokenized properly despite our small vocab_size.
|
| 51 |
Like Mistral, we use the LlamaTokenizerFast as our tokenizer class; in legacy mode.
|
| 52 |
|
| 53 |
## Tokenization Analysis
|
|
|
|
| 47 |
|
| 48 |
# Tokenizer
|
| 49 |
Our tokenizer was trained from scratch on 500,000 samples from the Openwebtext dataset.
|
| 50 |
+
For variation, we also included 500,000 samples from our [GitHub-CC0](https://huggingface.co/KoalaAI/GitHub-CC0) dataset, in the hopes that code would be tokenized properly despite our small vocab_size.
|
| 51 |
Like Mistral, we use the LlamaTokenizerFast as our tokenizer class; in legacy mode.
|
| 52 |
|
| 53 |
## Tokenization Analysis
|