Text Generation
Transformers
Safetensors
English
mistral
text-generation-inference
DarwinAnim8or commited on
Commit
2c7047e
·
verified ·
1 Parent(s): 747604b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -1
README.md CHANGED
@@ -7,4 +7,47 @@ datasets:
7
  language:
8
  - en
9
  library_name: transformers
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  language:
8
  - en
9
  library_name: transformers
10
+ ---
11
+
12
+ # Bamboo Nano
13
+ This is a WIP foundational (aka base) model trained only on public domain (CC0) datasets, primarily in the English language.
14
+ The primary goal of this model is to see how a limited tokenizer influences model training speed & coherency.
15
+
16
+ Further training is planned & ongoing, but currently no multi-language datasets are in use or planned; though this may change in the future and the current datasets *can* contain languages other than English.
17
+
18
+ ## License
19
+ Though the training data of this model is CC0, the model itself is not. The model is released under the OpenRAIL license, as tagged.
20
+
21
+ ## Planned updates
22
+ As mentioned, a few updates are planned:
23
+ * Further training on more CC0 data, this model's weights will be updated as we pretrain on more of the listed datasets.
24
+ * Experiment with extending the context length using YaRN to 32k tokens.
25
+ * Fine-tuning the resulting model for instruct, code and storywriting. These will then be combined using MergeKit to create a MoE model.
26
+ * Release a GGUF version and an extended context version of the base model
27
+
28
+ ## Other model versions
29
+ * [Bamboo-400M](https://huggingface.co/KoalaAI/Bamboo-400M)
30
+
31
+ # Model Performance Tracking
32
+
33
+ This table tracks the performance of our model on various tasks over time. The metric used is 'acc'.
34
+
35
+ | Date (YYYY-MM-DD) | arc_easy | hellaswag | sglue_rte | truthfulqa | Avg |
36
+ |-------------------|----------------|----------------|----------------|----------------| ------ |
37
+
38
+ ## Legend
39
+ - Date: The date of the model that the evaluation was run on. Pretraining is ongoing and tests are re-run with that date's model.
40
+ - Metric: The evaluation metric used (acc = accuracy)
41
+ - Task columns: Results for each task in the format "Percentage ± Standard Error"
42
+
43
+ ## Notes
44
+ - All accuracy values are presented as percentages
45
+ - Empty cells indicate that the task was not evaluated on that date or for that metric
46
+ - Standard errors are also converted to percentages for consistency
47
+
48
+ # Tokenizer
49
+ Our tokenizer was trained from scratch on 500,000 samples from the Openwebtext dataset.
50
+ For variation, we also included 500,000 samples from our [GitHub-CC0](KoalaAI/GitHub-CC0) dataset, in the hopes that code would be tokenized properly despite our small vocab_size.
51
+ Like Mistral, we use the LlamaTokenizerFast as our tokenizer class; in legacy mode.
52
+
53
+ ## Tokenization Analysis