fix typos
Browse files
README.md
CHANGED
|
@@ -98,10 +98,10 @@ This code outputs the following:
|
|
| 98 |
|
| 99 |
### Training Data / Preprocessing
|
| 100 |
|
| 101 |
-
The data used comes from Google DeepMind and the 🤗 hub. The model card can be found [here](https://huggingface.co/datasets/deepmind/mathematics).
|
| 102 |
Each of the underlying datasets contains a specific class of mathematical problems and their solutions. For the CyberSolve LinAlg *1.x* family of models, we are interested specifically in the solving of one-dimensional linear equations, so we use the *algebra__linear_1d* split.
|
| 103 |
|
| 104 |
-
The training and evaluation splits of the 1D linear algebra dataset split
|
| 105 |
All inputs and labels are then tokenized. We subsequently evaluate the length of each *input_ids* vector and each *labels* vector to ensure there are no outliers and no inputs that need to be truncated. For later ease of loading, we upload these preprocessed and tokenized training and evaluation datasets
|
| 106 |
to the 🤗 hub at the following locations: [MarioBarbeque/DeepMind-LinAlg-1D-train](https://huggingface.co/datasets/MarioBarbeque/DeepMind-LinAlg-1D-train) and [MarioBarbeque/DeepMind-LinAlg-1D-eval](https://huggingface.co/datasets/MarioBarbeque/DeepMind-LinAlg-1D-eval).
|
| 107 |
|
|
|
|
| 98 |
|
| 99 |
### Training Data / Preprocessing
|
| 100 |
|
| 101 |
+
The data used comes from Google DeepMind and the 🤗 hub. The model card can be found [here](https://huggingface.co/datasets/deepmind/mathematics). The Deepmind Mathematics DatasetDict object is composed of a vast variety of underlying mathematics datasets.
|
| 102 |
Each of the underlying datasets contains a specific class of mathematical problems and their solutions. For the CyberSolve LinAlg *1.x* family of models, we are interested specifically in the solving of one-dimensional linear equations, so we use the *algebra__linear_1d* split.
|
| 103 |
|
| 104 |
+
The training and evaluation splits of the 1D linear algebra dataset split are preprocessed in the following way: we format the raw problems and their solutions of the form `"b'Solve 65*l - 361 + 881 = 0 for l.\\n'"` and `"b'-8\\n'"` into the much cleanear `"Solve 65*l - 361 + 881 = 0 for l."` and `"-8"`.
|
| 105 |
All inputs and labels are then tokenized. We subsequently evaluate the length of each *input_ids* vector and each *labels* vector to ensure there are no outliers and no inputs that need to be truncated. For later ease of loading, we upload these preprocessed and tokenized training and evaluation datasets
|
| 106 |
to the 🤗 hub at the following locations: [MarioBarbeque/DeepMind-LinAlg-1D-train](https://huggingface.co/datasets/MarioBarbeque/DeepMind-LinAlg-1D-train) and [MarioBarbeque/DeepMind-LinAlg-1D-eval](https://huggingface.co/datasets/MarioBarbeque/DeepMind-LinAlg-1D-eval).
|
| 107 |
|