MarioBarbeque commited on
Commit
01d49ea
Β·
verified Β·
1 Parent(s): 1909f9a

added quite a few updates

Browse files
Files changed (1) hide show
  1. README.md +45 -24
README.md CHANGED
@@ -29,8 +29,8 @@ The model weights of *CyberSolve LinAlg 1.2* are a further downstream checkpoint
29
  To construct **CyberSolve LinAlg 1.2**, the *FLAN-T5 large* model is fined-tuned using a custom PyTorch training loop optimized for multiple GPUs. We supervise a training of *FLAN-T5 large* on the *algebra__linear_1d* split of the Google DeepMind mathematics dataset, an open source
30
  dateset from Google DeepMind available through the πŸ€— hub [deepmind/math_dataset](https://huggingface.co/datasets/deepmind/math_dataset). This large dataset consists of code generating mathematical problems and their solutions to a variety of tasks across unique mathematical disciplines.
31
 
32
- In this preliminary family of CyberSolve models, we are specifically interested in understanding the ability of neural models to solve non-trivial mathematical tasks. As such, the CyberSolve **LinAlg 1.x** family of models are trained on a set of 2M simpler, one-dimension linear equations. We preprocessed the data and simulated the training process on a smaller,
33
- downsampled set of the dataset before training for multiple epochs over the dataset's entirety. This model in particular has been trained for 2 additional epochs, limited only by funds, beyond the original *CyberSolve LinAlg 1.1* checkpoint.
34
 
35
  Version 1.2 is the most capable version of CyberSolve LinAlg, scoring a **90.75** exact match score on the evaluation set of 10k linear equations from the DeepMind *algebra__linear_1d* split. This is a non-trivial improvement from the exact match score of **86.56** attained by *CyberSolve LinAlg 1.1*.
36
 
@@ -53,16 +53,21 @@ Version 1.2 is the most capable version of CyberSolve LinAlg, scoring a **90.75*
53
 
54
  ### Direct Use
55
 
56
- In order to effectively query the model's ability to solve linear equations, a string of the format `Solve <any one-dimensional linear equation>.` should be tokenized and passed to the model's `generate` attribute. An example input string is `input_text = "Solve 24 = 1601*c - 1605*c for c."`.
57
- The model will attempt to solve the equation, outputting its prediction in a simple numeric format. See the example below.
58
 
59
  ## How to Use and Query the Model
60
 
61
- Use the code below to get started with the model. Reference the Nvidia `apex` package for optimized inference. Users pass a `text` string detailing a sentence with a `[MASK]` token. The model will provide options
62
- to fill the mask based on the sentence context and its background of knowledge. Note - the DistilBERT base model was trained on a very large general corpus of text.
63
- In our training, we have fine-tuned the model on the large IMDB movie review dataset. That is, the model is now accustomed to filling `[MASK]` tokens with words related to
64
- the domain of movies/tv/films. To see the model's afinity for cinematic lingo, it is best to be considerate in one's prompt engineering. Specifically, to most likely generate movie related text,
65
- one should ideally pass a masked `text` string that could reasonably be found in someone's review of a movie. See the example below:
 
 
 
 
 
66
 
67
  ``` python
68
 
@@ -93,34 +98,38 @@ This code outputs the following:
93
 
94
  ### Training Data / Preprocessing
95
 
96
- The data used comes from Google DeepMind and the πŸ€— hub. The model card can be found [here](https://huggingface.co/datasets/deepmind/mathematics). This dataset is preprocessed in the
97
- following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that
98
- applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
99
- and passed to a `DataCollator` with the default collating function.
 
 
 
100
 
101
  ### Training Procedure
102
 
103
- The model was trained locally on a single-node with multiple Nvidia A100 GPUs using πŸ€— Transformers, πŸ€— Tokenizers, and a custom PyTorch training loop that made use of πŸ€— Accelerate.
104
 
105
 
106
  #### Training Hyperparameters
107
 
108
  - **Precision:** We use FP32 precision, the same precision of the base "google/flan-t5-large" model.
109
- - **Optimizer:** `apex.optimizers.FusedAdam`, a fused kernel version of the AdamW optimizer from Nvidia `apex`
110
- - **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 5e-5
111
- - **Batch Size:** 32
112
- - **Number of Training Steps**: 2877 steps over the course of 3 epochs, followed by
113
 
114
 
115
  ## Evaluation / Metrics
116
 
117
- We evaluate our masked language model's performance using the `perplexity` metric, which has a few mathematical defitions. We define the perplexity as the exponential of the cross-entropy.
118
- To remove randomness in our metrics, we premask our evaluation dataset with a single masking function. This ensures we are evaluating with respect to the same set of labels each epoch.
119
- See the wikipedia links for perplexity and cross-entropy below for more a detailed discussion and various other definitions.
120
-
121
- Cross-entropy: [https://en.wikipedia.org/wiki/Cross-entropy](https://en.wikipedia.org/wiki/Cross-entropy)
122
 
123
- Perplexity: [https://en.wikipedia.org/wiki/Perplexity](https://en.wikipedia.org/wiki/Perplexity)
 
 
 
 
124
 
125
 
126
  ### Testing Data, Factors & Metrics
@@ -140,6 +149,18 @@ We find the following perplexity metrics over 3 training epochs:
140
  |1 | 16.28 |
141
  |2 | 15.78 |
142
 
 
 
 
 
 
 
 
 
 
 
 
 
143
  #### Summary
144
 
145
  We train this model for the purpose of attempting a local training of a masked language model using both the πŸ€— ecosystem and a custom PyTorch training and evaluation loop.
 
29
  To construct **CyberSolve LinAlg 1.2**, the *FLAN-T5 large* model is fined-tuned using a custom PyTorch training loop optimized for multiple GPUs. We supervise a training of *FLAN-T5 large* on the *algebra__linear_1d* split of the Google DeepMind mathematics dataset, an open source
30
  dateset from Google DeepMind available through the πŸ€— hub [deepmind/math_dataset](https://huggingface.co/datasets/deepmind/math_dataset). This large dataset consists of code generating mathematical problems and their solutions to a variety of tasks across unique mathematical disciplines.
31
 
32
+ In this preliminary family of CyberSolve models, we are specifically interested in understanding the ability of neural models to solve non-trivial mathematical tasks. As such, the CyberSolve **LinAlg 1.x** family of models are trained on a set of 2M simpler, one-dimension linear equations.
33
+ We preprocessed the data and simulated the training on a smaller, downsampled set of the dataset before training for multiple epochs over the dataset's entirety. This model in particular has been trained for 2 additional epochs, limited only by funds, beyond the original *CyberSolve LinAlg 1.1* checkpoint.
34
 
35
  Version 1.2 is the most capable version of CyberSolve LinAlg, scoring a **90.75** exact match score on the evaluation set of 10k linear equations from the DeepMind *algebra__linear_1d* split. This is a non-trivial improvement from the exact match score of **86.56** attained by *CyberSolve LinAlg 1.1*.
36
 
 
53
 
54
  ### Direct Use
55
 
56
+ In order to effectively query the model's ability to solve linear equations, a string of the format `"Solve <any one-dimensional linear equation of variable x> for x."` should be tokenized and passed to the model's `generate` attribute.
57
+ An example input string is `input_text = "Solve 24 = 1601*c - 1605*c for c."`. The model will attempt to solve the equation, outputting its prediction in a simple numeric format. See the example below.
58
 
59
  ## How to Use and Query the Model
60
 
61
+ Use the code below to get started with the model. Users pass an `input_text` string (again, of the form `input_text = "Solve 24 = 1601*c - 1605*c for c."`) which prompts the model to solve a one-dimensional linear equation.
62
+ Model prediction is significantly faster on a GPU, and so usage of the `.to('cuda')` commands to make sure both the model and all input ids are on the GPU is best practice.
63
+
64
+ Furthermore, the FLAN-T5 model architecture makes use
65
+ of many normalization layers, as is common in the transformer architecture. By default, CyberSolve uses the T5 model's `T5LayerNorm` Python class; it is highly recommended that user install the Nvidia `Apex` package for Nvidia GPUs
66
+ or the ROCm `Apex` package for AMD GPUs. Once installed, the model will default to using the `apex.normalization.FusedRMSNorm` class when computing the normalization layers. The `FusedRMSNorm` class from `apex` makes use of an optimized fushed kernel
67
+ that is much faster than the standard `T5LayerNorm` class, thereby significantly improving both inference and training.
68
+
69
+ The base FLAN-T5 model is capable of answering a variety of prompts, but the domain-adapted CyberSolve LinAlg model is designed specifically for solving linear equations. As such, users must be considerate in their prompt
70
+ engineering to issue a coherent, relevant query as outlined above and below.
71
 
72
  ``` python
73
 
 
98
 
99
  ### Training Data / Preprocessing
100
 
101
+ The data used comes from Google DeepMind and the πŸ€— hub. The model card can be found [here](https://huggingface.co/datasets/deepmind/mathematics). Th Deepmind Mathematics DatasetDict object is composed of a vast variety of underlying mathematics datasets.
102
+ Each of the underlying datasets contains a specific class of mathematical problems and their solutions. For the CyberSolve LinAlg *1.x* family of models, we are interested specifically in the solving of one-dimensional linear equations, so we use the *algebra__linear_1d* split.
103
+
104
+ The training and evaluation splits of the 1D linear algebra dataset split is preprocessed in the following way: we format the raw problems and their solutions of the form `"b'Solve 65*l - 361 + 881 = 0 for l.\\n'"` and `"b'-8\\n'"` into the much cleanear `"Solve 65*l - 361 + 881 = 0 for l."` and `"-8"`.
105
+ All inputs and labels are then tokenized. We subsequently evaluate the length of each *input_ids* vector and each *labels* vector to ensure there are no outliers and no inputs that need to be truncated. For later ease of loading, we upload these preprocessed and tokenized training and evaluation datasets
106
+ to the πŸ€— hub at the following locations: [MarioBarbeque/DeepMind-LinAlg-1D-train](https://huggingface.co/datasets/MarioBarbeque/DeepMind-LinAlg-1D-train) and [MarioBarbeque/DeepMind-LinAlg-1D-eval](https://huggingface.co/datasets/MarioBarbeque/DeepMind-LinAlg-1D-eval).
107
+
108
 
109
  ### Training Procedure
110
 
111
+ The model was trained locally on a single-node with multiple Nvidia A100 GPUs using πŸ€— Transformers, πŸ€— Tokenizers, and a custom PyTorch training loop that made use of both Nvidia Apex and πŸ€— Accelerate.
112
 
113
 
114
  #### Training Hyperparameters
115
 
116
  - **Precision:** We use FP32 precision, the same precision of the base "google/flan-t5-large" model.
117
+ - **Optimizer:** `apex.optimizers.FusedAdam`, a fused kernel version of the AdamW optimizer from Nvidia Apex
118
+ - **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 1e-4 to further adjust the CyberSolve LinAlg **1.1** weights
119
+ - **Batch Size:** 64
120
+ - **Number of Training Steps**: 1918 training steps over 2 additional epochs (CyberSolve LinAlg **1.2**) - beyond the original 2877 total steps over 3 epochs (CyberSolve LinAlg **1.1**)
121
 
122
 
123
  ## Evaluation / Metrics
124
 
125
+ We evaluate our text-to-text linear equation solver by using the `exact_match` metric by comparing the model's decoded predicted tokens with their numeric labels. *CyberSolve LinAlg 1.2* scores a **90.75** exact match score
126
+ on the evaluation set of 10k linear equations from the DeepMind *algebra__linear_1d* split. This is a non-trivial improvement from the exact match score of **86.56** attained by *CyberSolve LinAlg 1.1*.
 
 
 
127
 
128
+ Additionally, we construct a partial correctness dataset available at the following model card: [MarioBarbeque/CyberSolve-LinAlg-1.2-correctness-benchmark](https://huggingface.co/datasets/MarioBarbeque/CyberSolve-LinAlg-1.2-correctness-benchmark).
129
+ This dataset was created with the goal of analyzing both the token-to-token and decoded-sequence-to-decoded-sequence partial correctness of CyberSolve's predicitions in detail beyond just its ability to get answers flat out right or wrong. Similar partial correctness benchmark datasets were created for the
130
+ intial [FLAN-T5 model](https://huggingface.co/datasets/MarioBarbeque/FLAN-T5-DeepMind-LinAlg-1D-benchmark), the preliminary, [zeroth-generation downsampled training](https://huggingface.co/datasets/MarioBarbeque/CyberSolve-DeepMind-LinAlg-1D-downsample-benchmark-v2) of CyberSolve, and
131
+ the [1.1 version](https://huggingface.co/datasets/MarioBarbeque/CyberSolve-LinAlg-1.1-correctness-benchmark) of the model. *We have yet to complete partial correctness analysis of the various model versions and their predicitions, but we look forward to better understanding the mathematical
132
+ reasoning capabilities of models and publishing our results when complete!*
133
 
134
 
135
  ### Testing Data, Factors & Metrics
 
149
  |1 | 16.28 |
150
  |2 | 15.78 |
151
 
152
+ We find the following benchmark scores for our each of our neural models.
153
+
154
+ **CyberSolve LinAlg 1.2**
155
+ | epoch | exact_match score |
156
+ |-------|-------------------|
157
+ |0 | 17.38 |
158
+ |1 | 16.28 |
159
+ |2 | 15.78 |
160
+
161
+
162
+
163
+
164
  #### Summary
165
 
166
  We train this model for the purpose of attempting a local training of a masked language model using both the πŸ€— ecosystem and a custom PyTorch training and evaluation loop.