## Summary: --- tags: - regression - sales-prediction library: scikit-learn --- ### Data Analysis Key Findings * The dataset contains 39,435 entries with 'input' (sales data description) and 'output' (total sales value) columns. * The 'output' column containing sales values was successfully converted from an object type to a numeric (float) type, and rows with parsing errors were removed. * The 'product_name' was successfully extracted from the 'input' column. * The data was split into training (80%) and testing (20%) sets, resulting in 30,921 training samples and 7,731 testing samples. * A `RandomForestRegressor` model within a `Pipeline` using `TfidfVectorizer` for the 'product_name' feature was successfully trained. * For efficient querying, a separate DataFrame `df_query` was created with 'product_name' set as the index and 'sales_value' as a numeric type. * A function `query_sales_data` was implemented to filter the sales data DataFrame based on a provided query string using `dataframe.eval()`. * The trained prediction model showed a Mean Absolute Error (MAE) of 2224.90, a Root Mean Squared Error (RMSE) of 31031.07, and an R-squared (R2) Score of 0.05 on the test set. * A function `predict_and_query` was successfully developed to integrate the prediction model and querying functionality, allowing users to predict sales for a product and retrieve its actual sales data. ### Insights or Next Steps * The low R-squared score (0.05) indicates that the current model, which only uses product name features processed by TF-IDF, has limited predictive power. Including additional features related to sales data, such as time-based information or product categories, could significantly improve model performance. * The implemented `query_sales_data` function provides a basic querying capability. For a large dataset or more complex querying needs, consider implementing a more robust data storage and querying solution, such as a database. # Example of how to use the predict_and_query function # You can replace this with any product name you want to predict for # If the product name exists in the original data, it will also retrieve actual sales sample_product_name_for_prediction = "APPLE IPHONE 16 128GB SS \"MP01213171\"" result = predict_and_query(sample_product_name_for_prediction) print(f"Prediction and Query Result for '{sample_product_name_for_prediction}':") print(f"Predicted Sales: {result['predicted_sales']:.2f}") print(f"Actual Sales: {result['actual_sales']}") # You can try with a product name that might not be in the training data to see only the prediction # sample_product_name_new = "A completely new product not in the dataset" # result_new = predict_and_query(sample_product_name_new) # print(f"\nPrediction and Query Result for '{sample_product_name_new}':") # print(f"Predicted Sales: {result_new['predicted_sales']:.2f}") # print(f"Actual Sales: {result_new['actual_sales']}")