Update README.md
Browse files
README.md
CHANGED
|
@@ -2,6 +2,7 @@
|
|
| 2 |
license: openrail
|
| 3 |
datasets:
|
| 4 |
- Skylion007/openwebtext
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
# Bamboo 400M
|
|
@@ -59,4 +60,4 @@ The histogram was created using 10,000 samples of tokenization.
|
|
| 59 |
|
| 60 |
**Further Analysis:** While the tokenizer performs well overall, further analysis of the peak at 7 tokens and the lower frequency of very short sequences (1-3 tokens) could provide additional insights into its behavior and potential areas for refinement.
|
| 61 |
|
| 62 |
-
**Theory On 7-token peak:** Since English sentences have an average of 15-20 words, our tokenizer may be splitting them into 7-9 tokens, which would possibly explain this peak.
|
|
|
|
| 2 |
license: openrail
|
| 3 |
datasets:
|
| 4 |
- Skylion007/openwebtext
|
| 5 |
+
- ArtifactAI/arxiv-math-instruct-50k
|
| 6 |
---
|
| 7 |
|
| 8 |
# Bamboo 400M
|
|
|
|
| 60 |
|
| 61 |
**Further Analysis:** While the tokenizer performs well overall, further analysis of the peak at 7 tokens and the lower frequency of very short sequences (1-3 tokens) could provide additional insights into its behavior and potential areas for refinement.
|
| 62 |
|
| 63 |
+
**Theory On 7-token peak:** Since English sentences have an average of 15-20 words, our tokenizer may be splitting them into 7-9 tokens, which would possibly explain this peak.
|