diff --git "a/amazon-review-prediction.rmp" "b/amazon-review-prediction.rmp" new file mode 100644--- /dev/null +++ "b/amazon-review-prediction.rmp" @@ -0,0 +1,1521 @@ + + + + + + + + + + + + + + + + + Load data. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Change role to 'regular' for all columns. + + + + + + + + + + + + + + + + Define the target column for the predictive model. + + + + + + + + + Should define a target column? + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Discretize by binning (same range per bin). + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Discretize by frequency (same count per bin). + + + + + + + + + Should discretize numerical target column? + + + + + + + + + + + + + + + + + + + + + + + + + + + + Map some nominal target values to new values. + + + + + + + + + Should map nominal values? + + + + + + + + + + + + + + + + + + + + + + + + + + + + Make sure that target is binary for positive class mapping. + + + + + + + + + + + + + + + + + Potentially define which one should be the positive class. + + + + + + + + + + Should define positive class? + + + + + + + + + + + + + + + + + + + + + + + + + + Potentially remove columns. + + + + + + + + + Should remove columns? + + + + + + + + + + + + + + + + + + No date processing is desired here, so simply remove the date columns completely. + + + + + + + + + + + + + + + + + + + + + + + Check if there actually are any date columns in the data. + + + + + + + + + + + + + + Adds an additional column with the date today. This can be useful for calculations of ages etc. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Select the other way around here and store in the macro if that column already exists. + + + + + + + + Store if the other way round exists. + + + + + + + + + + + + + Generate the difference for the two date columns in milliseconds. + + + + + + + + + + + + + + + + + Both date columns are the same or the other way round already was created - do nothing here! + + Only calculate the differences between the two date columns if the columns are not equal and if the other way around has not been calculated yet. + + + + + + + + + + + Loop over all combinations of date attributes and calculate their differences (which includes the new today column generated previously). + + + + + + + + + Loop over all combinations of date attributes and calculate their differences (which includes the new today column generated previously). + + + + + + + + + + + + + + + Remove the generated today column again. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + <- Extract Day of Month + <- Extract Month of Year + <- Extract Year + <- Extract Quarter of Year + <- Extract Half of the Year + <- Extract Day of Week + <- Extract Month of Quarter + <- Extract Time Features + <- Remove original date column + <- Remove all constant columns (e.g. because all in the same month) + <- Transform all remaining date / time features into binary numerical ones + <- Rename all generated date attributes with a color instead of an underscore. + + Loop over the date columns. We needed the check and branch before since otherwise this loop will fail. + + + + + + + + + + + + + + + + + + + + Do nothing if there are no date columns in the data table. + + If there are any date columns in the data, work on them inside of this Branch operator. + + + + + + + + + + + Should handle dates? + + + + + + + + + + + + + + + + + + + Remove all unused values so that they are not shown by models and do not change calculations based on the number of values in the data. Also order the value mappings alphabetically. + + + + + + + + + + + + + + + Transform all nominal columns to text so that we make sure that all will have polynominal type after the next transformation. + + + + + + + + + + + + + + + Transform all text columns into polynominal columns. + + + + + + + + + + + + + + + Turn all numerical columns (not integers though) into real columns. + + + + + + + + + + + + + + + + Define the value type of all columns which have been identified as text. + + + + + + + + + + + + + Unify all value types + + + + + + + + + + + + + + + + All general preprocessing steps happen inside this operator + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Should handle text columns? + + + + + + + + + + + + + + + + + + + Removes columns with too many nominal values. + + + + + + + Remove useless columns like constants. + + + + + + + + Order columns alphabetically. + + + + + + + + Define a result name. + + + + + + + Define a result name. + + + + + + + + + + + + + + + + + +