{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.0" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \ud83c\udf77 **Data Collection, Creation, Storage, and Processing**\n\n**Business Problem:** How can a wine retailer optimize its pricing and inventory strategy using customer review sentiment (qualitative) and sales data (quantitative)?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **1.** \ud83d\udce6 Install required packages" ] }, { "cell_type": "code", "metadata": {}, "source": [ "!pip install pandas numpy matplotlib seaborn vaderSentiment beautifulsoup4 requests textblob faker" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2.** \ud83c\udf10 Load the real-world wine dataset\n\nThe dataset is sourced from Kaggle's Wine Reviews dataset (https://www.kaggle.com/zynicide/wine-reviews), containing wine titles, countries, varieties, ratings (points), and prices." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *a. Initial setup*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "import pandas as pd\nimport numpy as np\nimport random\nfrom datetime import datetime\nimport warnings\n\nwarnings.filterwarnings(\"ignore\")\nrandom.seed(2025)\nnp.random.seed(2025)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *b. Load the wine dataset*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "df_wines = pd.read_csv(\"wine_data.csv\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *c. View first few lines*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "df_wines.head()" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *d. Basic info*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "print(f\"Shape: {df_wines.shape}\")\nprint(f\"\\nColumns: {list(df_wines.columns)}\")\nprint(f\"\\nMissing values:\\n{df_wines.isnull().sum()}\")\nprint(f\"\\nPoints distribution:\\n{df_wines['points'].describe()}\")" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **3.** \ud83e\udde9 Create a meaningful connection between real & synthetic datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *a. Generate popularity scores based on points (with some randomness)*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "def generate_popularity_score(points):\n if points >= 95:\n base = 5\n elif points >= 90:\n base = 4\n elif points >= 85:\n base = 3\n elif points >= 82:\n base = 2\n else:\n base = 1\n trend_factor = random.choices([-1, 0, 1], weights=[1, 3, 2])[0]\n return int(np.clip(base + trend_factor, 1, 5))" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *b. Run the function to create a \"popularity_score\" column*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "df_wines[\"popularity_score\"] = df_wines[\"points\"].apply(generate_popularity_score)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *c. Create sentiment labels from popularity scores*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "def get_sentiment(popularity_score):\n if popularity_score <= 2:\n return \"negative\"\n elif popularity_score == 3:\n return \"neutral\"\n else:\n return \"positive\"\n\ndf_wines[\"sentiment_label\"] = df_wines[\"popularity_score\"].apply(get_sentiment)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **4.** \ud83d\udcc8 Generate synthetic wine sales data (18 months)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *a. Create a generate_sales_profile function*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "def generate_sales_profile(sentiment, price):\n months = pd.date_range(end=datetime.today(), periods=18, freq=\"M\")\n price_factor = max(0.5, 1 - (price / 100)) # cheaper wines sell more\n \n if sentiment == \"positive\":\n base = int(random.randint(150, 250) * price_factor)\n trend = np.linspace(base, base + random.randint(20, 60), len(months))\n elif sentiment == \"negative\":\n base = int(random.randint(20, 60) * price_factor)\n trend = np.linspace(base, base - random.randint(5, 20), len(months))\n else:\n base = int(random.randint(70, 120) * price_factor)\n trend = np.linspace(base, base + random.randint(-10, 10), len(months))\n\n noise = np.random.normal(0, base * 0.1, len(months))\n units = np.maximum(1, trend + noise).astype(int)\n return list(zip(months, units))" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *b. Build the sales dataset*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "sales_data = []\nfor _, row in df_wines.iterrows():\n records = generate_sales_profile(row[\"sentiment_label\"], row[\"price\"])\n for month, units in records:\n sales_data.append({\n \"title\": row[\"title\"],\n \"month\": month,\n \"units_sold\": units,\n \"sentiment_label\": row[\"sentiment_label\"],\n \"price\": row[\"price\"],\n \"popularity_score\": row[\"popularity_score\"],\n })\n\ndf_sales = pd.DataFrame(sales_data)" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *c. Save and view*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "df_sales.to_csv(\"synthetic_wine_sales.csv\", index=False)\nprint(f\"Sales dataset shape: {df_sales.shape}\")\nprint(df_sales.head())" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **5.** \ud83c\udfaf Generate synthetic customer reviews" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *a. Create review pools per sentiment category (generated with ChatGPT)*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "synthetic_reviews_by_sentiment = {\n \"positive\": [\n \"Exceptional wine with rich, complex flavours that linger beautifully on the palate.\",\n \"One of the finest bottles I have tasted this year, absolutely worth the price.\",\n \"Beautifully balanced with notes of dark fruit and a silky smooth finish.\",\n \"A stunning wine that pairs perfectly with any occasion. Highly recommended.\",\n \"Elegant and refined, this wine exceeded all my expectations.\",\n \"The aroma alone is captivating \u2014 cherry, oak, and a hint of vanilla.\",\n \"A true masterpiece from this winery. Will definitely be buying more.\",\n \"Smooth, sophisticated, and incredibly drinkable. A real crowd-pleaser.\",\n \"Outstanding quality for the price. This is a hidden gem.\",\n \"Deep colour, wonderful bouquet, and a long satisfying finish.\",\n \"This wine is a revelation. Every sip brings something new to discover.\",\n \"Perfect balance of acidity and tannins. A joy to drink.\",\n \"An absolute delight \u2014 fruity, fresh, and full of character.\",\n \"The craftsmanship behind this bottle is evident in every glass.\",\n \"Rich and velvety with a complexity that keeps you coming back.\",\n \"A gorgeous wine that shows what this region can produce at its best.\",\n \"Delightfully aromatic with layers of flavour that unfold gradually.\",\n \"Simply superb. This wine belongs on every serious collector's shelf.\",\n \"An impressive vintage that delivers on every front.\",\n \"A real treat for the senses. Would pair wonderfully with aged cheese.\",\n \"Bright and lively with a finish that goes on and on.\",\n \"Wonderfully expressive wine with great depth and personality.\",\n \"Perfectly aged and showing beautifully right now.\",\n \"A showstopper at dinner parties. Everyone asked about this bottle.\",\n \"Luxurious texture and impeccable balance. Worth every penny.\",\n \"This wine tells a story of its terroir in the most beautiful way.\",\n \"Vibrant fruit flavours with just the right amount of oak influence.\",\n \"A benchmark wine for the variety. Consistently excellent.\",\n \"So good I ordered a full case. Cannot recommend enough.\",\n \"The kind of wine that makes you stop and appreciate the moment.\",\n \"Floral nose, silky palate, and an incredibly clean finish.\",\n \"A wine that proves quality does not always mean expensive.\",\n \"Absolutely divine. This winery never disappoints.\",\n \"Complex, layered, and endlessly interesting. A true fine wine.\",\n \"This vintage is drinking perfectly right now. Do not miss it.\",\n \"Generous fruit, gentle spice, and a touch of minerality. Lovely.\",\n \"A wine that manages to be both powerful and graceful.\",\n \"One of the best values in wine today. Extraordinary quality.\",\n \"The finish alone makes this wine memorable. Pure elegance.\",\n \"A joyful wine that brings warmth and happiness with every glass.\",\n \"Incredible structure and aging potential. A serious wine.\",\n \"Harmonious from start to finish. Textbook winemaking.\",\n \"This wine has soul. You can taste the passion behind it.\",\n \"Brilliant colour and an intoxicating bouquet. Truly special.\",\n \"A wine lover's dream. Complex, balanced, and utterly delicious.\",\n \"Everything I look for in a great wine \u2014 and then some.\",\n \"Refined and polished with a beautiful aromatic profile.\",\n \"A stunning example of what modern winemaking can achieve.\",\n \"Pure pleasure in a glass. An unforgettable experience.\",\n \"This is the wine I will be recommending to everyone this year.\"\n ],\n \"neutral\": [\n \"A decent wine that does its job without being particularly memorable.\",\n \"Perfectly acceptable for everyday drinking, nothing more nothing less.\",\n \"A straightforward wine with simple fruit flavours.\",\n \"Reliable and consistent, though it lacks real excitement.\",\n \"Good enough for a casual dinner but would not seek it out again.\",\n \"Middle of the road \u2014 not bad but not impressive either.\",\n \"A solid table wine at a fair price point.\",\n \"Pleasant enough on the palate but forgettable overall.\",\n \"Does what you expect for the price range. No surprises.\",\n \"A basic wine that fills a gap but does not inspire.\",\n \"Drinkable and inoffensive, suitable for large gatherings.\",\n \"It is fine. Nothing wrong with it but nothing exciting either.\",\n \"An average wine that will satisfy casual drinkers.\",\n \"Somewhat one-dimensional but clean and well-made.\",\n \"A reasonable choice if you are not looking for anything special.\",\n \"Not bad for the price but I have had better in this range.\",\n \"A competent wine without much personality.\",\n \"Simple and clean with modest fruit character.\",\n \"Okay for the price but I would not buy it again.\",\n \"An unremarkable but serviceable everyday wine.\",\n \"Neither impressive nor disappointing. Just average.\",\n \"A safe choice for when you do not know what to pick.\",\n \"Lacks complexity but is well-balanced for what it is.\",\n \"A standard offering from this region. No complaints.\",\n \"Meets expectations without exceeding them.\",\n \"The kind of wine you drink without really thinking about it.\",\n \"Perfectly fine but would not win any awards.\",\n \"A middle-ground wine suitable for any casual occasion.\",\n \"Uncomplicated and easy to drink. Nothing fancy.\",\n \"Acceptable quality at an acceptable price.\",\n \"It does its job. You could do worse for the money.\",\n \"A fair wine that represents its price point accurately.\",\n \"Light and simple \u2014 good for a hot afternoon, perhaps.\",\n \"Not a standout but not a disappointment either.\",\n \"A run-of-the-mill wine with standard characteristics.\",\n \"Drinkable but I would happily trade up for a few more euros.\",\n \"A no-frills wine for no-frills occasions.\",\n \"Average in every respect. The definition of mediocre.\",\n \"A passable wine that does not offend or excite.\",\n \"Consistent but uninspiring. A forgettable bottle.\",\n \"Neither here nor there. An okay wine.\",\n \"Functional wine for functional purposes.\",\n \"This wine is the equivalent of a shrug. It exists.\",\n \"An everyday wine that blends into the background.\",\n \"Does not stand out in a lineup but holds its own.\",\n \"A modest wine with modest ambitions.\",\n \"Satisfactory but leaves you wanting something more.\",\n \"A standard-issue wine with no real flaws or virtues.\",\n \"Unremarkable but perfectly drinkable.\",\n \"A wine that is easy to forget. Not terrible, not great.\"\n ],\n \"negative\": [\n \"Disappointing for the price. Thin and lacking any real character.\",\n \"Harsh tannins and an unpleasant bitter finish. Would not buy again.\",\n \"Tastes cheap and overly acidic. Not enjoyable at all.\",\n \"This wine fell flat \u2014 watery, bland, and overpriced.\",\n \"The flavour profile is muddled and confusing. Poor winemaking.\",\n \"Way too much oak. It tastes like you are drinking a barrel.\",\n \"Unbalanced and rough around the edges. Needs serious improvement.\",\n \"I could not finish the bottle. The aftertaste was terrible.\",\n \"Massively overpriced for the quality. A real letdown.\",\n \"Astringent and harsh with no redeeming qualities.\",\n \"The aroma was off-putting \u2014 smelled like vinegar.\",\n \"A wine that tries too hard and achieves too little.\",\n \"Flat and lifeless. No fruit, no complexity, no interest.\",\n \"Would not serve this to guests. Embarrassingly poor quality.\",\n \"Overly sweet and cloying. Lacks any sophistication.\",\n \"This wine is a cautionary tale about judging by the label.\",\n \"Thin, watery, and devoid of any character whatsoever.\",\n \"An unpleasant drinking experience from start to finish.\",\n \"Tastes like it was made in a hurry with no care.\",\n \"Rough, tannic, and completely out of balance.\",\n \"The worst wine I have had in this price range.\",\n \"Needs several more years \u2014 or maybe a different winemaker.\",\n \"Acidic and sharp with a finish that makes you wince.\",\n \"Not worth the glass it was poured into.\",\n \"A forgettable wine for all the wrong reasons.\",\n \"This bottle went straight down the sink after one glass.\",\n \"Dull, flat, and uninspiring. Avoid this one.\",\n \"Tastes oxidised and past its prime. Questionable storage.\",\n \"An aggressive wine that attacks rather than caresses the palate.\",\n \"I expected much more from this winery. Very disappointing.\",\n \"The tannins are so harsh they strip your mouth completely.\",\n \"No complexity, no depth, no reason to buy this wine.\",\n \"Cheap-tasting wine at a not-so-cheap price.\",\n \"A mess of a wine. Every element seems to fight the others.\",\n \"Sour and unripe. Tastes like it was harvested too early.\",\n \"One of the most underwhelming wines I have ever tasted.\",\n \"Thin body and a finish that disappears instantly.\",\n \"Off-putting nose and equally unpleasant on the palate.\",\n \"This wine does not live up to its reputation at all.\",\n \"Clumsy and awkward. Lacks finesse entirely.\",\n \"I wanted to like this wine but it gave me nothing to work with.\",\n \"Overworked and over-oaked. Lost all its fruit character.\",\n \"A wine that is hard to recommend to anyone.\",\n \"Surprisingly bad for a winery of this calibre.\",\n \"Flat, stale, and lacking any vibrancy.\",\n \"The kind of wine that makes you question the ratings.\",\n \"No redeeming features. A complete waste of money.\",\n \"Tastes manufactured rather than crafted. Very artificial.\",\n \"A harsh and unforgiving wine with zero charm.\",\n \"Bottom of the barrel quality. Stay away.\"\n ]\n}" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *b. Generate 10 reviews per wine*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "review_rows = []\nfor _, row in df_wines.iterrows():\n title = row['title']\n sentiment_label = row['sentiment_label']\n review_pool = synthetic_reviews_by_sentiment[sentiment_label]\n sampled_reviews = random.sample(review_pool, 10)\n for review_text in sampled_reviews:\n review_rows.append({\n \"title\": title,\n \"review_text\": review_text,\n \"sentiment_label\": sentiment_label,\n \"points\": row[\"points\"],\n \"price\": row[\"price\"],\n \"country\": row[\"country\"],\n \"variety\": row[\"variety\"],\n \"popularity_score\": row[\"popularity_score\"],\n })" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *c. Create and save df_reviews*" ] }, { "cell_type": "code", "metadata": {}, "source": [ "df_reviews = pd.DataFrame(review_rows)\ndf_reviews.to_csv(\"synthetic_wine_reviews.csv\", index=False)\nprint(f\"Reviews dataset shape: {df_reviews.shape}\")\nprint(df_reviews.head())" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **6.** \u2705 Create derived inputs for analysis" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Title-level features for analysis\ntitle_features = df_wines[[\"title\", \"country\", \"variety\", \"points\", \"price\", \"popularity_score\", \"sentiment_label\"]]\ntitle_features.to_csv(\"synthetic_title_level_features.csv\", index=False)\n\n# Monthly revenue series\ndf_sales[\"revenue\"] = df_sales[\"units_sold\"] * df_sales[\"price\"]\nmonthly_rev = df_sales.groupby(\"month\", as_index=False).agg(\n total_units_sold=(\"units_sold\", \"sum\"),\n total_revenue=(\"revenue\", \"sum\"),\n)\nmonthly_rev.to_csv(\"synthetic_monthly_revenue_series.csv\", index=False)\nprint(\"Derived files saved.\")\nprint(monthly_rev.head())" ], "outputs": [], "execution_count": null } ] }