## Summary:

---
tags:
- regression
- sales-prediction
library: scikit-learn
---

### Data Analysis Key Findings

* The dataset contains 39,435 entries with 'input' (sales data description) and 'output' (total sales value) columns.
* The 'output' column containing sales values was successfully converted from an object type to a numeric (float) type, and rows with parsing errors were removed.
* The 'product_name' was successfully extracted from the 'input' column.
* The data was split into training (80%) and testing (20%) sets, resulting in 30,921 training samples and 7,731 testing samples.
* A `RandomForestRegressor` model within a `Pipeline` using `TfidfVectorizer` for the 'product_name' feature was successfully trained.
* For efficient querying, a separate DataFrame `df_query` was created with 'product_name' set as the index and 'sales_value' as a numeric type.
* A function `query_sales_data` was implemented to filter the sales data DataFrame based on a provided query string using `dataframe.eval()`.
* The trained prediction model showed a Mean Absolute Error (MAE) of 2224.90, a Root Mean Squared Error (RMSE) of 31031.07, and an R-squared (R2) Score of 0.05 on the test set.
* A function `predict_and_query` was successfully developed to integrate the prediction model and querying functionality, allowing users to predict sales for a product and retrieve its actual sales data.

### Insights or Next Steps

* The low R-squared score (0.05) indicates that the current model, which only uses product name features processed by TF-IDF, has limited predictive power. Including additional features related to sales data, such as time-based information or product categories, could significantly improve model performance.
* The implemented `query_sales_data` function provides a basic querying capability. For a large dataset or more complex querying needs, consider implementing a more robust data storage and querying solution, such as a database.


# Example of how to use the predict_and_query function

# You can replace this with any product name you want to predict for
# If the product name exists in the original data, it will also retrieve actual sales
sample_product_name_for_prediction = "APPLE IPHONE 16 128GB SS \"MP01213171\""

result = predict_and_query(sample_product_name_for_prediction)

print(f"Prediction and Query Result for '{sample_product_name_for_prediction}':")
print(f"Predicted Sales: {result['predicted_sales']:.2f}")
print(f"Actual Sales: {result['actual_sales']}")

# You can try with a product name that might not be in the training data to see only the prediction
# sample_product_name_new = "A completely new product not in the dataset"
# result_new = predict_and_query(sample_product_name_new)
# print(f"\nPrediction and Query Result for '{sample_product_name_new}':")
# print(f"Predicted Sales: {result_new['predicted_sales']:.2f}")
# print(f"Actual Sales: {result_new['actual_sales']}")