MarioBarbeque
/

CyberSolve-LinAlg-1.2

@@ -29,8 +29,8 @@ The model weights of *CyberSolve LinAlg 1.2* are a further downstream checkpoint
 To construct **CyberSolve LinAlg 1.2**, the *FLAN-T5 large* model is fined-tuned using a custom PyTorch training loop optimized for multiple GPUs. We supervise a training of *FLAN-T5 large* on the *algebra__linear_1d* split of the Google DeepMind mathematics dataset, an open source
 dateset from Google DeepMind available through the 🤗 hub [deepmind/math_dataset](https://huggingface.co/datasets/deepmind/math_dataset). This large dataset consists of code generating mathematical problems and their solutions to a variety of tasks across unique mathematical disciplines.
-In this preliminary family of CyberSolve models, we are specifically interested in understanding the ability of neural models to solve non-trivial mathematical tasks. As such, the CyberSolve **LinAlg 1.x** family of models are trained on a set of 2M simpler, one-dimension linear equations. We preprocessed the data and simulated the training process on a smaller,
-downsampled set of the dataset before training for multiple epochs over the dataset's entirety. This model in particular has been trained for 2 additional epochs, limited only by funds, beyond the original *CyberSolve LinAlg 1.1* checkpoint.
 Version 1.2 is the most capable version of CyberSolve LinAlg, scoring a **90.75** exact match score on the evaluation set of 10k linear equations from the DeepMind *algebra__linear_1d* split. This is a non-trivial improvement from the exact match score of **86.56** attained by *CyberSolve LinAlg 1.1*.
@@ -53,16 +53,21 @@ Version 1.2 is the most capable version of CyberSolve LinAlg, scoring a **90.75*
 ### Direct Use
-In order to effectively query the model's ability to solve linear equations, a string of the format `Solve <any one-dimensional linear equation>.` should be tokenized and passed to the model's `generate` attribute. An example input string is `input_text = "Solve 24 = 1601*c - 1605*c for c."`.
-The model will attempt to solve the equation, outputting its prediction in a simple numeric format. See the example below.
 ## How to Use and Query the Model
-Use the code below to get started with the model. Reference the Nvidia `apex` package for optimized inference. Users pass a `text` string detailing a sentence with a `[MASK]` token. The model will provide options
-to fill the mask based on the sentence context and its background of knowledge. Note - the DistilBERT base model was trained on a very large general corpus of text.
-In our training, we have fine-tuned the model on the large IMDB movie review dataset. That is, the model is now accustomed to filling `[MASK]` tokens with words related to
-the domain of movies/tv/films. To see the model's afinity for cinematic lingo, it is best to be considerate in one's prompt engineering. Specifically, to most likely generate movie related text,
-one should ideally pass a masked `text` string that could reasonably be found in someone's review of a movie. See the example below:
 ``` python
@@ -93,34 +98,38 @@ This code outputs the following:
 ### Training Data / Preprocessing
-The data used comes from Google DeepMind and the 🤗 hub. The model card can be found [here](https://huggingface.co/datasets/deepmind/mathematics). This dataset is preprocessed in the
-following way: The train and test splits are tokenized, concatenated, and chunked into chunks of 256 tokens. We subsequently load the training data into a `DataCollator` that
-applies a custom random masking function when batching. We mask of 15% of tokens in each chunk. The evaluation data is masked in its entirety, to remove randomness when evaluating,
-and passed to a `DataCollator` with the default collating function.
 ### Training Procedure
-The model was trained locally on a single-node with multiple Nvidia A100 GPUs using 🤗 Transformers, 🤗 Tokenizers, and a custom PyTorch training loop that made use of 🤗 Accelerate.
 #### Training Hyperparameters
 - **Precision:** We use FP32 precision, the same precision of the base "google/flan-t5-large" model.
-- **Optimizer:** `apex.optimizers.FusedAdam`, a fused kernel version of the AdamW optimizer from Nvidia `apex`
-- **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 5e-5
-- **Batch Size:** 32
-- **Number of Training Steps**: 2877 steps over the course of 3 epochs, followed by
 ## Evaluation / Metrics
-We evaluate our masked language model's performance using the `perplexity` metric, which has a few mathematical defitions. We define the perplexity as the exponential of the cross-entropy.
-To remove randomness in our metrics, we premask our evaluation dataset with a single masking function. This ensures we are evaluating with respect to the same set of labels each epoch.
-See the wikipedia links for perplexity and cross-entropy below for more a detailed discussion and various other definitions.
-Cross-entropy: [https://en.wikipedia.org/wiki/Cross-entropy](https://en.wikipedia.org/wiki/Cross-entropy)
-Perplexity: [https://en.wikipedia.org/wiki/Perplexity](https://en.wikipedia.org/wiki/Perplexity)
 ### Testing Data, Factors & Metrics
@@ -140,6 +149,18 @@ We find the following perplexity metrics over 3 training epochs:
  |1      |  16.28 |
  |2      |  15.78 |
 #### Summary
 We train this model for the purpose of attempting a local training of a masked language model using both the 🤗 ecosystem and a custom PyTorch training and evaluation loop.

 To construct **CyberSolve LinAlg 1.2**, the *FLAN-T5 large* model is fined-tuned using a custom PyTorch training loop optimized for multiple GPUs. We supervise a training of *FLAN-T5 large* on the *algebra__linear_1d* split of the Google DeepMind mathematics dataset, an open source
 dateset from Google DeepMind available through the 🤗 hub [deepmind/math_dataset](https://huggingface.co/datasets/deepmind/math_dataset). This large dataset consists of code generating mathematical problems and their solutions to a variety of tasks across unique mathematical disciplines.
+In this preliminary family of CyberSolve models, we are specifically interested in understanding the ability of neural models to solve non-trivial mathematical tasks. As such, the CyberSolve **LinAlg 1.x** family of models are trained on a set of 2M simpler, one-dimension linear equations.
+We preprocessed the data and simulated the training on a smaller, downsampled set of the dataset before training for multiple epochs over the dataset's entirety. This model in particular has been trained for 2 additional epochs, limited only by funds, beyond the original *CyberSolve LinAlg 1.1* checkpoint.
 Version 1.2 is the most capable version of CyberSolve LinAlg, scoring a **90.75** exact match score on the evaluation set of 10k linear equations from the DeepMind *algebra__linear_1d* split. This is a non-trivial improvement from the exact match score of **86.56** attained by *CyberSolve LinAlg 1.1*.
 ### Direct Use
+In order to effectively query the model's ability to solve linear equations, a string of the format `"Solve <any one-dimensional linear equation of variable x> for x."` should be tokenized and passed to the model's `generate` attribute.
+An example input string is `input_text = "Solve 24 = 1601*c - 1605*c for c."`. The model will attempt to solve the equation, outputting its prediction in a simple numeric format. See the example below.
 ## How to Use and Query the Model
+Use the code below to get started with the model. Users pass an `input_text` string (again, of the form `input_text = "Solve 24 = 1601*c - 1605*c for c."`) which prompts the model to solve a one-dimensional linear equation.
+Model prediction is significantly faster on a GPU, and so usage of the `.to('cuda')` commands to make sure both the model and all input ids are on the GPU is best practice.
+Furthermore, the FLAN-T5 model architecture makes use
+of many normalization layers, as is common in the transformer architecture. By default, CyberSolve uses the T5 model's `T5LayerNorm` Python class; it is highly recommended that user install the Nvidia `Apex` package for Nvidia GPUs
+or the ROCm `Apex` package for AMD GPUs. Once installed, the model will default to using the `apex.normalization.FusedRMSNorm` class when computing the normalization layers. The `FusedRMSNorm` class from `apex` makes use of an optimized fushed kernel
+that is much faster than the standard `T5LayerNorm` class, thereby significantly improving both inference and training.
+The base FLAN-T5 model is capable of answering a variety of prompts, but the domain-adapted CyberSolve LinAlg model is designed specifically for solving linear equations. As such, users must be considerate in their prompt
+engineering to issue a coherent, relevant query as outlined above and below.
 ``` python
 ### Training Data / Preprocessing
+The data used comes from Google DeepMind and the 🤗 hub. The model card can be found [here](https://huggingface.co/datasets/deepmind/mathematics). Th Deepmind Mathematics DatasetDict object is composed of a vast variety of underlying mathematics datasets.
+Each of the underlying datasets contains a specific class of mathematical problems and their solutions. For the CyberSolve LinAlg *1.x* family of models, we are interested specifically in the solving of one-dimensional linear equations, so we use the *algebra__linear_1d* split.
+The training and evaluation splits of the 1D linear algebra dataset split is preprocessed in the following way: we format the raw problems and their solutions of the form `"b'Solve 65*l - 361 + 881 = 0 for l.\\n'"` and `"b'-8\\n'"` into the much cleanear `"Solve 65*l - 361 + 881 = 0 for l."` and `"-8"`.
+All inputs and labels are then tokenized. We subsequently evaluate the length of each *input_ids* vector and each *labels* vector to ensure there are no outliers and no inputs that need to be truncated. For later ease of loading, we upload these preprocessed and tokenized training and evaluation datasets
+to the 🤗 hub at the following locations: [MarioBarbeque/DeepMind-LinAlg-1D-train](https://huggingface.co/datasets/MarioBarbeque/DeepMind-LinAlg-1D-train) and [MarioBarbeque/DeepMind-LinAlg-1D-eval](https://huggingface.co/datasets/MarioBarbeque/DeepMind-LinAlg-1D-eval).
 ### Training Procedure
+The model was trained locally on a single-node with multiple Nvidia A100 GPUs using 🤗 Transformers, 🤗 Tokenizers, and a custom PyTorch training loop that made use of both Nvidia Apex and 🤗 Accelerate.
 #### Training Hyperparameters
 - **Precision:** We use FP32 precision, the same precision of the base "google/flan-t5-large" model.
+- **Optimizer:** `apex.optimizers.FusedAdam`, a fused kernel version of the AdamW optimizer from Nvidia Apex
+- **Learning Rate:** We use a linear learing rate scheduler with an initial learning rate of 1e-4 to further adjust the CyberSolve LinAlg **1.1** weights
+- **Batch Size:** 64
+- **Number of Training Steps**: 1918 training steps over 2 additional epochs (CyberSolve LinAlg **1.2**) - beyond the original 2877 total steps over 3 epochs (CyberSolve LinAlg **1.1**)
 ## Evaluation / Metrics
+We evaluate our text-to-text linear equation solver by using the `exact_match` metric by comparing the model's decoded predicted tokens with their numeric labels. *CyberSolve LinAlg 1.2* scores a **90.75** exact match score
+on the evaluation set of 10k linear equations from the DeepMind *algebra__linear_1d* split. This is a non-trivial improvement from the exact match score of **86.56** attained by *CyberSolve LinAlg 1.1*.
+Additionally, we construct a partial correctness dataset available at the following model card: [MarioBarbeque/CyberSolve-LinAlg-1.2-correctness-benchmark](https://huggingface.co/datasets/MarioBarbeque/CyberSolve-LinAlg-1.2-correctness-benchmark).
+This dataset was created with the goal of analyzing both the token-to-token and decoded-sequence-to-decoded-sequence partial correctness of CyberSolve's predicitions in detail beyond just its ability to get answers flat out right or wrong. Similar partial correctness benchmark datasets were created for the
+intial [FLAN-T5 model](https://huggingface.co/datasets/MarioBarbeque/FLAN-T5-DeepMind-LinAlg-1D-benchmark), the preliminary, [zeroth-generation downsampled training](https://huggingface.co/datasets/MarioBarbeque/CyberSolve-DeepMind-LinAlg-1D-downsample-benchmark-v2) of CyberSolve, and
+the [1.1 version](https://huggingface.co/datasets/MarioBarbeque/CyberSolve-LinAlg-1.1-correctness-benchmark) of the model. *We have yet to complete partial correctness analysis of the various model versions and their predicitions, but we look forward to better understanding the mathematical
+reasoning capabilities of models and publishing our results when complete!*
 ### Testing Data, Factors & Metrics
  |1      |  16.28 |
  |2      |  15.78 |
+ We find the following benchmark scores for our each of our neural models.
+**CyberSolve LinAlg 1.2**
+| epoch | exact_match score |
+|-------|-------------------|
+|0      |  17.38 |
+|1      |  16.28 |
+|2      |  15.78 |
 #### Summary
 We train this model for the purpose of attempting a local training of a masked language model using both the 🤗 ecosystem and a custom PyTorch training and evaluation loop.