KoalaAI
/

Bamboo-Nano

Text Generation

text-generation-inference

Model card Files Files and versions

DarwinAnim8or commited on Dec 1, 2024

Commit

2c7047e

·

verified ·

1 Parent(s): 747604b

Update README.md

Files changed (1) hide show

README.md +44 -1

README.md CHANGED Viewed

@@ -7,4 +7,47 @@ datasets:
 language:
 - en
 library_name: transformers
----

 language:
 - en
 library_name: transformers
+---
+# Bamboo Nano
+This is a WIP foundational (aka base) model trained only on public domain (CC0) datasets, primarily in the English language.
+The primary goal of this model is to see how a limited tokenizer influences model training speed & coherency.
+Further training is planned & ongoing, but currently no multi-language datasets are in use or planned; though this may change in the future and the current datasets *can* contain languages other than English.
+## License
+Though the training data of this model is CC0, the model itself is not. The model is released under the OpenRAIL license, as tagged.
+## Planned updates
+As mentioned, a few updates are planned:
+* Further training on more CC0 data, this model's weights will be updated as we pretrain on more of the listed datasets.
+* Experiment with extending the context length using YaRN to 32k tokens.
+* Fine-tuning the resulting model for instruct, code and storywriting. These will then be combined using MergeKit to create a MoE model.
+* Release a GGUF version and an extended context version of the base model
+## Other model versions
+* [Bamboo-400M](https://huggingface.co/KoalaAI/Bamboo-400M)
+# Model Performance Tracking
+This table tracks the performance of our model on various tasks over time. The metric used is 'acc'.
+| Date (YYYY-MM-DD) | arc_easy       | hellaswag      | sglue_rte      | truthfulqa     | Avg    |
+|-------------------|----------------|----------------|----------------|----------------| ------ |
+## Legend
+- Date: The date of the model that the evaluation was run on. Pretraining is ongoing and tests are re-run with that date's model.
+- Metric: The evaluation metric used (acc = accuracy)
+- Task columns: Results for each task in the format "Percentage ± Standard Error"
+## Notes
+- All accuracy values are presented as percentages
+- Empty cells indicate that the task was not evaluated on that date or for that metric
+- Standard errors are also converted to percentages for consistency
+# Tokenizer
+Our tokenizer was trained from scratch on 500,000 samples from the Openwebtext dataset.
+For variation, we also included 500,000 samples from our [GitHub-CC0](KoalaAI/GitHub-CC0) dataset, in the hopes that code would be tokenized properly despite our small vocab_size.
+Like Mistral, we use the LlamaTokenizerFast as our tokenizer class; in legacy mode.
+## Tokenization Analysis