Text Generation
Transformers
Safetensors
mistral
text-generation-inference
DarwinAnim8or commited on
Commit
f49b5ff
·
verified ·
1 Parent(s): cd30026

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -2,6 +2,7 @@
2
  license: openrail
3
  datasets:
4
  - Skylion007/openwebtext
 
5
  ---
6
 
7
  # Bamboo 400M
@@ -59,4 +60,4 @@ The histogram was created using 10,000 samples of tokenization.
59
 
60
  **Further Analysis:** While the tokenizer performs well overall, further analysis of the peak at 7 tokens and the lower frequency of very short sequences (1-3 tokens) could provide additional insights into its behavior and potential areas for refinement.
61
 
62
- **Theory On 7-token peak:** Since English sentences have an average of 15-20 words, our tokenizer may be splitting them into 7-9 tokens, which would possibly explain this peak.
 
2
  license: openrail
3
  datasets:
4
  - Skylion007/openwebtext
5
+ - ArtifactAI/arxiv-math-instruct-50k
6
  ---
7
 
8
  # Bamboo 400M
 
60
 
61
  **Further Analysis:** While the tokenizer performs well overall, further analysis of the peak at 7 tokens and the lower frequency of very short sequences (1-3 tokens) could provide additional insights into its behavior and potential areas for refinement.
62
 
63
+ **Theory On 7-token peak:** Since English sentences have an average of 15-20 words, our tokenizer may be splitting them into 7-9 tokens, which would possibly explain this peak.