Update test results
Browse files
README.md
CHANGED
|
@@ -18,8 +18,33 @@ As mentioned, a few updates are planned:
|
|
| 18 |
* Fine-tuning the resulting model for instruct, code and storywriting. These will then be combined using MergeKit to create a MoE model.
|
| 19 |
* Release a GGUF version and an extended context version of the base model
|
| 20 |
|
| 21 |
-
##
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
# Tokenizer
|
| 25 |
Our tokenizer was trained from scratch on 500,000 samples from the Openwebtext dataset. Like Mistral, we use the LlamaTokenizerFast as our tokenizer class; in legacy mode.
|
|
|
|
| 18 |
* Fine-tuning the resulting model for instruct, code and storywriting. These will then be combined using MergeKit to create a MoE model.
|
| 19 |
* Release a GGUF version and an extended context version of the base model
|
| 20 |
|
| 21 |
+
## Model Performance Tracking
|
| 22 |
+
|
| 23 |
+
This table tracks the performance of our model on various tasks over time.
|
| 24 |
+
|
| 25 |
+
| Date (YYYY-MM-DD) | Metric | arc_easy | hellaswag | sglue_rte | truthfulqa | Avg |
|
| 26 |
+
|-------------------|----------|---------------|---------------|---------------|---------------| ---- |
|
| 27 |
+
| 2024-07-27 | acc | 27.40% ± 0.92% | 25.52% ± 0.44% | 52.71% ± 3.01% | 39.52% ± 1.11% | 36.29% |
|
| 28 |
+
| | acc_norm | 27.95% ± 0.92% | 25.03% ± 0.43% | - | - | - |
|
| 29 |
+
|
| 30 |
+
### Legend
|
| 31 |
+
- Date: The date of each evaluation run
|
| 32 |
+
- Metric: The evaluation metric used (acc = accuracy, acc_norm = normalized accuracy)
|
| 33 |
+
- Task columns: Results for each task in the format "Percentage ± Standard Error"
|
| 34 |
+
|
| 35 |
+
### Notes
|
| 36 |
+
- All accuracy values are presented as percentages
|
| 37 |
+
- Empty cells indicate that the task was not evaluated on that date or for that metric
|
| 38 |
+
- Standard errors are also converted to percentages for consistency
|
| 39 |
+
|
| 40 |
+
### Legend
|
| 41 |
+
- Task: The name of the evaluation task
|
| 42 |
+
- Metric: The evaluation metric used (acc = accuracy, acc_norm = normalized accuracy)
|
| 43 |
+
- Date columns: The date of each evaluation run, with results in the format "Value ± Standard Error"
|
| 44 |
+
|
| 45 |
+
### Notes
|
| 46 |
+
- All accuracy values are on a scale from 0 to 1
|
| 47 |
+
- Empty cells indicate that the task was not evaluated on that date
|
| 48 |
|
| 49 |
# Tokenizer
|
| 50 |
Our tokenizer was trained from scratch on 500,000 samples from the Openwebtext dataset. Like Mistral, we use the LlamaTokenizerFast as our tokenizer class; in legacy mode.
|