| ## Summary: | |
| --- | |
| tags: | |
| - regression | |
| - sales-prediction | |
| library: scikit-learn | |
| --- | |
| ### Data Analysis Key Findings | |
| * The dataset contains 39,435 entries with 'input' (sales data description) and 'output' (total sales value) columns. | |
| * The 'output' column containing sales values was successfully converted from an object type to a numeric (float) type, and rows with parsing errors were removed. | |
| * The 'product_name' was successfully extracted from the 'input' column. | |
| * The data was split into training (80%) and testing (20%) sets, resulting in 30,921 training samples and 7,731 testing samples. | |
| * A `RandomForestRegressor` model within a `Pipeline` using `TfidfVectorizer` for the 'product_name' feature was successfully trained. | |
| * For efficient querying, a separate DataFrame `df_query` was created with 'product_name' set as the index and 'sales_value' as a numeric type. | |
| * A function `query_sales_data` was implemented to filter the sales data DataFrame based on a provided query string using `dataframe.eval()`. | |
| * The trained prediction model showed a Mean Absolute Error (MAE) of 2224.90, a Root Mean Squared Error (RMSE) of 31031.07, and an R-squared (R2) Score of 0.05 on the test set. | |
| * A function `predict_and_query` was successfully developed to integrate the prediction model and querying functionality, allowing users to predict sales for a product and retrieve its actual sales data. | |
| ### Insights or Next Steps | |
| * The low R-squared score (0.05) indicates that the current model, which only uses product name features processed by TF-IDF, has limited predictive power. Including additional features related to sales data, such as time-based information or product categories, could significantly improve model performance. | |
| * The implemented `query_sales_data` function provides a basic querying capability. For a large dataset or more complex querying needs, consider implementing a more robust data storage and querying solution, such as a database. | |
| # Example of how to use the predict_and_query function | |
| # You can replace this with any product name you want to predict for | |
| # If the product name exists in the original data, it will also retrieve actual sales | |
| sample_product_name_for_prediction = "APPLE IPHONE 16 128GB SS \"MP01213171\"" | |
| result = predict_and_query(sample_product_name_for_prediction) | |
| print(f"Prediction and Query Result for '{sample_product_name_for_prediction}':") | |
| print(f"Predicted Sales: {result['predicted_sales']:.2f}") | |
| print(f"Actual Sales: {result['actual_sales']}") | |
| # You can try with a product name that might not be in the training data to see only the prediction | |
| # sample_product_name_new = "A completely new product not in the dataset" | |
| # result_new = predict_and_query(sample_product_name_new) | |
| # print(f"\nPrediction and Query Result for '{sample_product_name_new}':") | |
| # print(f"Predicted Sales: {result_new['predicted_sales']:.2f}") | |
| # print(f"Actual Sales: {result_new['actual_sales']}") |