Summary:

tags: - regression - sales-prediction library: scikit-learn

Data Analysis Key Findings

The dataset contains 39,435 entries with 'input' (sales data description) and 'output' (total sales value) columns.
The 'output' column containing sales values was successfully converted from an object type to a numeric (float) type, and rows with parsing errors were removed.
The 'product_name' was successfully extracted from the 'input' column.
The data was split into training (80%) and testing (20%) sets, resulting in 30,921 training samples and 7,731 testing samples.
A RandomForestRegressor model within a Pipeline using TfidfVectorizer for the 'product_name' feature was successfully trained.
For efficient querying, a separate DataFrame df_query was created with 'product_name' set as the index and 'sales_value' as a numeric type.
A function query_sales_data was implemented to filter the sales data DataFrame based on a provided query string using dataframe.eval().
The trained prediction model showed a Mean Absolute Error (MAE) of 2224.90, a Root Mean Squared Error (RMSE) of 31031.07, and an R-squared (R2) Score of 0.05 on the test set.
A function predict_and_query was successfully developed to integrate the prediction model and querying functionality, allowing users to predict sales for a product and retrieve its actual sales data.

Insights or Next Steps

The low R-squared score (0.05) indicates that the current model, which only uses product name features processed by TF-IDF, has limited predictive power. Including additional features related to sales data, such as time-based information or product categories, could significantly improve model performance.
The implemented query_sales_data function provides a basic querying capability. For a large dataset or more complex querying needs, consider implementing a more robust data storage and querying solution, such as a database.

Example of how to use the predict_and_query function

You can replace this with any product name you want to predict for

If the product name exists in the original data, it will also retrieve actual sales

sample_product_name_for_prediction = "APPLE IPHONE 16 128GB SS "MP01213171""

result = predict_and_query(sample_product_name_for_prediction)

print(f"Prediction and Query Result for '{sample_product_name_for_prediction}':") print(f"Predicted Sales: {result['predicted_sales']:.2f}") print(f"Actual Sales: {result['actual_sales']}")

You can try with a product name that might not be in the training data to see only the prediction

sample_product_name_new = "A completely new product not in the dataset"

result_new = predict_and_query(sample_product_name_new)

print(f"\nPrediction and Query Result for '{sample_product_name_new}':")

print(f"Predicted Sales: {result_new['predicted_sales']:.2f}")

print(f"Actual Sales: {result_new['actual_sales']}")

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support