Update README.md
Browse files
README.md
CHANGED
|
@@ -7,4 +7,47 @@ datasets:
|
|
| 7 |
language:
|
| 8 |
- en
|
| 9 |
library_name: transformers
|
| 10 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
language:
|
| 8 |
- en
|
| 9 |
library_name: transformers
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# Bamboo Nano
|
| 13 |
+
This is a WIP foundational (aka base) model trained only on public domain (CC0) datasets, primarily in the English language.
|
| 14 |
+
The primary goal of this model is to see how a limited tokenizer influences model training speed & coherency.
|
| 15 |
+
|
| 16 |
+
Further training is planned & ongoing, but currently no multi-language datasets are in use or planned; though this may change in the future and the current datasets *can* contain languages other than English.
|
| 17 |
+
|
| 18 |
+
## License
|
| 19 |
+
Though the training data of this model is CC0, the model itself is not. The model is released under the OpenRAIL license, as tagged.
|
| 20 |
+
|
| 21 |
+
## Planned updates
|
| 22 |
+
As mentioned, a few updates are planned:
|
| 23 |
+
* Further training on more CC0 data, this model's weights will be updated as we pretrain on more of the listed datasets.
|
| 24 |
+
* Experiment with extending the context length using YaRN to 32k tokens.
|
| 25 |
+
* Fine-tuning the resulting model for instruct, code and storywriting. These will then be combined using MergeKit to create a MoE model.
|
| 26 |
+
* Release a GGUF version and an extended context version of the base model
|
| 27 |
+
|
| 28 |
+
## Other model versions
|
| 29 |
+
* [Bamboo-400M](https://huggingface.co/KoalaAI/Bamboo-400M)
|
| 30 |
+
|
| 31 |
+
# Model Performance Tracking
|
| 32 |
+
|
| 33 |
+
This table tracks the performance of our model on various tasks over time. The metric used is 'acc'.
|
| 34 |
+
|
| 35 |
+
| Date (YYYY-MM-DD) | arc_easy | hellaswag | sglue_rte | truthfulqa | Avg |
|
| 36 |
+
|-------------------|----------------|----------------|----------------|----------------| ------ |
|
| 37 |
+
|
| 38 |
+
## Legend
|
| 39 |
+
- Date: The date of the model that the evaluation was run on. Pretraining is ongoing and tests are re-run with that date's model.
|
| 40 |
+
- Metric: The evaluation metric used (acc = accuracy)
|
| 41 |
+
- Task columns: Results for each task in the format "Percentage ± Standard Error"
|
| 42 |
+
|
| 43 |
+
## Notes
|
| 44 |
+
- All accuracy values are presented as percentages
|
| 45 |
+
- Empty cells indicate that the task was not evaluated on that date or for that metric
|
| 46 |
+
- Standard errors are also converted to percentages for consistency
|
| 47 |
+
|
| 48 |
+
# Tokenizer
|
| 49 |
+
Our tokenizer was trained from scratch on 500,000 samples from the Openwebtext dataset.
|
| 50 |
+
For variation, we also included 500,000 samples from our [GitHub-CC0](KoalaAI/GitHub-CC0) dataset, in the hopes that code would be tokenized properly despite our small vocab_size.
|
| 51 |
+
Like Mistral, we use the LlamaTokenizerFast as our tokenizer class; in legacy mode.
|
| 52 |
+
|
| 53 |
+
## Tokenization Analysis
|