KoalaAI
/

Bamboo-400M

Text Generation

text-generation-inference

Model card Files Files and versions

DarwinAnim8or commited on Jul 29, 2024

Commit

f49b5ff

·

verified ·

1 Parent(s): cd30026

Update README.md

Files changed (1) hide show

README.md +2 -1

README.md CHANGED Viewed

@@ -2,6 +2,7 @@
 license: openrail
 datasets:
 - Skylion007/openwebtext
 ---
 # Bamboo 400M
@@ -59,4 +60,4 @@ The histogram was created using 10,000 samples of tokenization.
 **Further Analysis:** While the tokenizer performs well overall, further analysis of the peak at 7 tokens and the lower frequency of very short sequences (1-3 tokens) could provide additional insights into its behavior and potential areas for refinement.
-**Theory On 7-token peak:** Since English sentences have an average of 15-20 words, our tokenizer may be splitting them into 7-9 tokens, which would possibly explain this peak.

 license: openrail
 datasets:
 - Skylion007/openwebtext
+- ArtifactAI/arxiv-math-instruct-50k
 ---
 # Bamboo 400M
 **Further Analysis:** While the tokenizer performs well overall, further analysis of the peak at 7 tokens and the lower frequency of very short sequences (1-3 tokens) could provide additional insights into its behavior and potential areas for refinement.
+**Theory On 7-token peak:** Since English sentences have an average of 15-20 words, our tokenizer may be splitting them into 7-9 tokens, which would possibly explain this peak.