{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.0" }, "colab": { "provenance": [] } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "vVzjbyMTUxct" }, "source": [ "# 🎬 Streaming Platform — Content Performance & Cancellation Risk Predictor\n", "### Notebook 1: Data Collection, Creation & Processing\n", "\n", "We're trying to answer one core business question: **how should a streaming platform decide which shows to renew, cancel, or invest more in?**\n", "\n", "To do that we need data — and lots of it. In this notebook we pull real show data straight from IMDb's official public datasets, engineer some features from it, and then build out the synthetic side of our pipeline: viewership time series, financial KPIs, and audience reviews. Everything synthetic is anchored to real IMDb signals so the story holds together end to end.\n", "\n", "By the end of this notebook we have four clean CSV files ready to feed into Notebook 2 for the actual analysis and modelling.\n" ], "id": "vVzjbyMTUxct" }, { "cell_type": "markdown", "metadata": { "id": "A1xRpTkzUxcw" }, "source": [ "---\n", "## Setup" ], "id": "A1xRpTkzUxcw" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dx-1FV1vUxcx", "outputId": "a7b4a3e5-5c66-45ac-9b7c-dffa21ee4bb9" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)\n", "Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)\n", "Requirement already satisfied: matplotlib in /usr/local/lib/python3.12/dist-packages (3.10.0)\n", "Requirement already satisfied: seaborn in /usr/local/lib/python3.12/dist-packages (0.13.2)\n", "Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (2.32.4)\n", "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)\n", "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)\n", "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.3)\n", "Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.3.3)\n", "Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (0.12.1)\n", "Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (4.62.1)\n", "Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (1.5.0)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (26.0)\n", "Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (11.3.0)\n", "Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib) (3.3.2)\n", "Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests) (3.4.6)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests) (3.11)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests) (2.5.0)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests) (2026.2.25)\n", "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n" ] } ], "source": [ "!pip install pandas numpy matplotlib seaborn requests" ], "id": "dx-1FV1vUxcx" }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WepyuijgUxcy" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import random\n", "import requests\n", "import gzip\n", "import io\n", "import warnings\n", "from datetime import datetime\n", "from pathlib import Path\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "random.seed(2025)\n", "np.random.seed(2025)" ], "id": "WepyuijgUxcy" }, { "cell_type": "markdown", "metadata": { "id": "kltC0ESwUxcy" }, "source": [ "---\n", "## Real-World Data — IMDb Official Datasets\n", "\n", "Rather than scraping a website, we went straight to IMDb's own public data dumps which they update daily and make available for free. We use three files and join them together:\n", "\n", "- **title.basics** gives us show titles, genres, start and end years. The end year is key — if a show has one, it actually ended, which is our real-world cancellation signal.\n", "- **title.ratings** gives us the IMDb average rating and vote count for each title.\n", "- **title.episode** lets us count how many seasons each show ran for.\n", "\n", "Joining these three gives us something genuinely useful: a dataset of real TV shows with real ratings, real season counts, and a real ground-truth label for whether they were cancelled or not. That's the foundation everything else is built on.\n" ], "id": "kltC0ESwUxcy" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 262 }, "id": "tT5a27_7Uxcz", "outputId": "bff8f0e2-0bea-4228-e14b-fc5a16191e04" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Downloading title.basics...\n", "Loaded 170,450 TV series from 2010 onwards\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " tconst titleType primaryTitle startYear endYear \\\n", "0 tt0113735 tvSeries The Magic 7 2026 NaN \n", "1 tt0158429 tvSeries Sauda 2018 NaN \n", "2 tt0166938 tvSeries Yo-TV 2020 NaN \n", "3 tt0179619 tvSeries Waka Huia 2014 NaN \n", "4 tt0188360 tvSeries Show of Hearts 2015 NaN \n", "\n", " genres was_cancelled primary_genre \n", "0 Adventure,Animation,Family 0 Adventure \n", "1 Unknown 0 Unknown \n", "2 Unknown 0 Unknown \n", "3 Documentary 0 Documentary \n", "4 Comedy 0 Comedy " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tconsttitleTypeprimaryTitlestartYearendYeargenreswas_cancelledprimary_genre
0tt0113735tvSeriesThe Magic 72026NaNAdventure,Animation,Family0Adventure
1tt0158429tvSeriesSauda2018NaNUnknown0Unknown
2tt0166938tvSeriesYo-TV2020NaNUnknown0Unknown
3tt0179619tvSeriesWaka Huia2014NaNDocumentary0Documentary
4tt0188360tvSeriesShow of Hearts2015NaNComedy0Comedy
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df_basics" } }, "metadata": {}, "execution_count": 3 } ], "source": [ "# title.basics — show metadata and cancellation signal\n", "print(\"Downloading title.basics...\")\n", "r = requests.get(\"https://datasets.imdbws.com/title.basics.tsv.gz\", stream=True)\n", "with gzip.open(io.BytesIO(r.content)) as f:\n", " df_basics = pd.read_csv(f, sep=\"\\t\", low_memory=False,\n", " usecols=[\"tconst\",\"titleType\",\"primaryTitle\",\n", " \"startYear\",\"endYear\",\"genres\"])\n", "\n", "# keep only TV series from 2010 onwards (streaming era)\n", "df_basics = df_basics[df_basics[\"titleType\"] == \"tvSeries\"].copy()\n", "df_basics = df_basics[df_basics[\"startYear\"] != \"\\\\N\"].copy()\n", "df_basics[\"startYear\"] = pd.to_numeric(df_basics[\"startYear\"], errors=\"coerce\")\n", "df_basics = df_basics[df_basics[\"startYear\"] >= 2010].reset_index(drop=True)\n", "\n", "# was_cancelled = 1 if the show has an end year on IMDb, 0 if still running\n", "df_basics[\"was_cancelled\"] = (df_basics[\"endYear\"] != \"\\\\N\").astype(int)\n", "df_basics[\"endYear\"] = pd.to_numeric(df_basics[\"endYear\"].replace(\"\\\\N\", np.nan), errors=\"coerce\")\n", "df_basics[\"genres\"] = df_basics[\"genres\"].replace(\"\\\\N\", \"Unknown\")\n", "df_basics[\"primary_genre\"] = df_basics[\"genres\"].apply(lambda x: x.split(\",\")[0])\n", "\n", "print(f\"Loaded {len(df_basics):,} TV series from 2010 onwards\")\n", "df_basics.head()" ], "id": "tT5a27_7Uxcz" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 242 }, "id": "ELoOJVaUUxcz", "outputId": "376d03ab-1e50-4957-a0ec-214d3ddce90b" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Downloading title.ratings...\n", "Loaded ratings for 1,655,561 titles\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " tconst averageRating numVotes\n", "0 tt0000001 5.7 2201\n", "1 tt0000002 5.5 312\n", "2 tt0000003 6.4 2312\n", "3 tt0000004 5.1 197\n", "4 tt0000005 6.2 3038" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tconstaverageRatingnumVotes
0tt00000015.72201
1tt00000025.5312
2tt00000036.42312
3tt00000045.1197
4tt00000056.23038
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df_ratings" } }, "metadata": {}, "execution_count": 4 } ], "source": [ "# title.ratings — IMDb scores and vote counts\n", "print(\"Downloading title.ratings...\")\n", "r = requests.get(\"https://datasets.imdbws.com/title.ratings.tsv.gz\", stream=True)\n", "with gzip.open(io.BytesIO(r.content)) as f:\n", " df_ratings = pd.read_csv(f, sep=\"\\t\", low_memory=False)\n", "\n", "print(f\"Loaded ratings for {len(df_ratings):,} titles\")\n", "df_ratings.head()" ], "id": "ELoOJVaUUxcz" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 242 }, "id": "gXi7rHWOUxc0", "outputId": "2f202800-bf59-4bfd-e386-872efd7dd66b" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Downloading title.episode...\n", "Season counts derived for 209,186 shows\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " tconst num_seasons\n", "0 tt0029778 1\n", "1 tt0035599 1\n", "2 tt0035803 1\n", "3 tt0038276 1\n", "4 tt0039120 1" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tconstnum_seasons
0tt00297781
1tt00355991
2tt00358031
3tt00382761
4tt00391201
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "season_counts" } }, "metadata": {}, "execution_count": 5 } ], "source": [ "# title.episode — season counts per show\n", "print(\"Downloading title.episode...\")\n", "r = requests.get(\"https://datasets.imdbws.com/title.episode.tsv.gz\", stream=True)\n", "with gzip.open(io.BytesIO(r.content)) as f:\n", " df_episodes = pd.read_csv(f, sep=\"\\t\", low_memory=False,\n", " usecols=[\"tconst\",\"parentTconst\",\"seasonNumber\"])\n", "\n", "df_episodes[\"seasonNumber\"] = pd.to_numeric(\n", " df_episodes[\"seasonNumber\"].replace(\"\\\\N\", np.nan), errors=\"coerce\"\n", ")\n", "\n", "season_counts = (\n", " df_episodes.dropna(subset=[\"seasonNumber\"])\n", " .groupby(\"parentTconst\")[\"seasonNumber\"]\n", " .nunique()\n", " .reset_index()\n", " .rename(columns={\"parentTconst\": \"tconst\", \"seasonNumber\": \"num_seasons\"})\n", ")\n", "\n", "print(f\"Season counts derived for {len(season_counts):,} shows\")\n", "season_counts.head()" ], "id": "gXi7rHWOUxc0" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 593 }, "id": "HJfJYF_DUxc1", "outputId": "f35a7764-1de6-4e47-8588-73a5c3eefb08" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Final merged dataset: 6,747 shows\n", " Still running: 1,894\n", " Ended / cancelled: 4,853\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " tconst title primary_genre startYear endYear \\\n", "0 tt0429087 Shantaram Action 2022 2022.0 \n", "1 tt0475784 Westworld Drama 2016 2022.0 \n", "2 tt0489974 Carnival Row Crime 2019 2023.0 \n", "3 tt0499400 The Nine Lives of Chloe King Action 2011 2011.0 \n", "4 tt0804484 Foundation Drama 2021 NaN \n", "5 tt0808224 The Death of Bunny Munro Drama 2025 2025.0 \n", "6 tt0808491 The Swarm Drama 2023 2023.0 \n", "7 tt0944947 Game of Thrones Action 2011 2019.0 \n", "8 tt0979432 Boardwalk Empire Crime 2010 2014.0 \n", "9 tt10009170 Blood of Zeus Action 2020 2025.0 \n", "\n", " imdb_rating vote_count num_seasons was_cancelled \n", "0 7.4 17824 1 1 \n", "1 8.4 556925 4 1 \n", "2 7.7 88336 2 1 \n", "3 7.0 10799 1 1 \n", "4 7.6 125394 4 0 \n", "5 6.7 1911 1 1 \n", "6 5.9 7815 1 1 \n", "7 9.2 2599705 8 1 \n", "8 8.6 217329 5 1 \n", "9 7.5 28471 3 1 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tconsttitleprimary_genrestartYearendYearimdb_ratingvote_countnum_seasonswas_cancelled
0tt0429087ShantaramAction20222022.07.41782411
1tt0475784WestworldDrama20162022.08.455692541
2tt0489974Carnival RowCrime20192023.07.78833621
3tt0499400The Nine Lives of Chloe KingAction20112011.07.01079911
4tt0804484FoundationDrama2021NaN7.612539440
5tt0808224The Death of Bunny MunroDrama20252025.06.7191111
6tt0808491The SwarmDrama20232023.05.9781511
7tt0944947Game of ThronesAction20112019.09.2259970581
8tt0979432Boardwalk EmpireCrime20102014.08.621732951
9tt10009170Blood of ZeusAction20202025.07.52847131
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df_shows", "summary": "{\n \"name\": \"df_shows\",\n \"rows\": 6747,\n \"fields\": [\n {\n \"column\": \"tconst\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 6747,\n \"samples\": [\n \"tt21200366\",\n \"tt3906560\",\n \"tt33204276\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 6628,\n \"samples\": [\n \"I'm Her Most Dangerous Obsession\",\n \"She's Gotta Have It\",\n \"Bad Thoughts\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"primary_genre\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 25,\n \"samples\": [\n \"Reality-TV\",\n \"Thriller\",\n \"Action\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"startYear\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 4,\n \"min\": 2010,\n \"max\": 2026,\n \"num_unique_values\": 17,\n \"samples\": [\n 2022,\n 2016,\n 2025\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"endYear\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 3.8399196886811735,\n \"min\": 2010.0,\n \"max\": 2027.0,\n \"num_unique_values\": 18,\n \"samples\": [\n 2022.0,\n 2023.0,\n 2024.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"imdb_rating\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.0832167175656575,\n \"min\": 1.0,\n \"max\": 9.7,\n \"num_unique_values\": 86,\n \"samples\": [\n 3.1,\n 7.4,\n 4.4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"vote_count\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 66534,\n \"min\": 1000,\n \"max\": 2599705,\n \"num_unique_values\": 4778,\n \"samples\": [\n 20415,\n 16092,\n 6048\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"num_seasons\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2,\n \"min\": 1,\n \"max\": 49,\n \"num_unique_values\": 33,\n \"samples\": [\n 39,\n 27,\n 29\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"was_cancelled\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 6 } ], "source": [ "# merge all three together and apply a minimum vote threshold\n", "# 1,000 votes filters out obscure titles nobody has actually seen\n", "df_shows = (\n", " df_basics\n", " .merge(df_ratings, on=\"tconst\", how=\"inner\")\n", " .merge(season_counts, on=\"tconst\", how=\"left\")\n", ")\n", "\n", "df_shows[\"num_seasons\"] = df_shows[\"num_seasons\"].fillna(1).astype(int)\n", "df_shows = df_shows[df_shows[\"numVotes\"] >= 1000].copy()\n", "\n", "df_shows = df_shows.rename(columns={\n", " \"primaryTitle\": \"title\",\n", " \"averageRating\": \"imdb_rating\",\n", " \"numVotes\": \"vote_count\"\n", "})[[\"tconst\",\"title\",\"primary_genre\",\"startYear\",\"endYear\",\n", " \"imdb_rating\",\"vote_count\",\"num_seasons\",\"was_cancelled\"]].reset_index(drop=True)\n", "\n", "print(f\"Final merged dataset: {len(df_shows):,} shows\")\n", "print(f\" Still running: {(df_shows['was_cancelled']==0).sum():,}\")\n", "print(f\" Ended / cancelled: {(df_shows['was_cancelled']==1).sum():,}\")\n", "df_shows.head(10)" ], "id": "HJfJYF_DUxc1" }, { "cell_type": "markdown", "metadata": { "id": "M3LEbV3NUxc2" }, "source": [ "---\n", "## Feature Engineering\n", "\n", "Before adding any synthetic data we want to squeeze more signal out of what we already have. We engineer three new columns:\n", "\n", "**Longevity score** combines how long a show survived (seasons) with how well it was rated. A show that ran for 8 seasons and had an 8.5 rating scores much higher than one that limped through 3 seasons with a 6.0. We normalise both dimensions and weight rating slightly more since it captures audience quality perception better than raw survival.\n", "\n", "**Popularity tier** maps the IMDb rating to a 1–5 scale. This is the anchor for everything synthetic we generate below — the higher the tier, the bigger the budgets, audiences, and revenues we simulate.\n", "\n", "**Audience sentiment** is derived from the tier and gives us the positive / neutral / negative label we use for review generation and the VADER analysis in Notebook 2.\n" ], "id": "M3LEbV3NUxc2" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 807 }, "id": "I4NzThtfUxc2", "outputId": "3c644c69-b6d3-4a08-a60e-d4d20bd9c8ff" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Popularity tier distribution:\n", "popularity_tier\n", "1 404\n", "2 907\n", "3 2649\n", "4 2332\n", "5 455\n", "Name: count, dtype: int64\n", "\n", "Audience sentiment distribution:\n", "audience_sentiment\n", "positive 2787\n", "neutral 2649\n", "negative 1311\n", "Name: count, dtype: int64\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " title imdb_rating num_seasons longevity_score \\\n", "0 Shantaram 7.4 1 0.4414 \n", "1 Westworld 8.4 4 0.5353 \n", "2 Carnival Row 7.7 2 0.4704 \n", "3 The Nine Lives of Chloe King 7.0 1 0.4138 \n", "4 Foundation 7.6 4 0.4802 \n", "5 The Death of Bunny Munro 6.7 1 0.3931 \n", "6 The Swarm 5.9 1 0.3379 \n", "7 Game of Thrones 9.2 8 0.6239 \n", "8 Boardwalk Empire 8.6 5 0.5575 \n", "9 Blood of Zeus 7.5 3 0.4649 \n", "\n", " popularity_tier audience_sentiment was_cancelled \n", "0 3 neutral 1 \n", "1 4 positive 1 \n", "2 4 positive 1 \n", "3 3 neutral 1 \n", "4 4 positive 0 \n", "5 3 neutral 1 \n", "6 2 negative 1 \n", "7 5 positive 1 \n", "8 5 positive 1 \n", "9 4 positive 1 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleimdb_ratingnum_seasonslongevity_scorepopularity_tieraudience_sentimentwas_cancelled
0Shantaram7.410.44143neutral1
1Westworld8.440.53534positive1
2Carnival Row7.720.47044positive1
3The Nine Lives of Chloe King7.010.41383neutral1
4Foundation7.640.48024positive0
5The Death of Bunny Munro6.710.39313neutral1
6The Swarm5.910.33792negative1
7Game of Thrones9.280.62395positive1
8Boardwalk Empire8.650.55755positive1
9Blood of Zeus7.530.46494positive1
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"df_shows[[\\\"title\\\",\\\"imdb_rating\\\",\\\"num_seasons\\\",\\\"longevity_score\\\",\\\"popularity_tier\\\",\\\"audience_sentiment\\\",\\\"was_cancelled\\\"]]\",\n \"rows\": 10,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"Boardwalk Empire\",\n \"Westworld\",\n \"The Death of Bunny Munro\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"imdb_rating\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.9614803401237302,\n \"min\": 5.9,\n \"max\": 9.2,\n \"num_unique_values\": 10,\n \"samples\": [\n 8.6,\n 8.4,\n 6.7\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"num_seasons\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2,\n \"min\": 1,\n \"max\": 8,\n \"num_unique_values\": 6,\n \"samples\": [\n 1,\n 4,\n 3\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"longevity_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.08369410174359163,\n \"min\": 0.3379,\n \"max\": 0.6239,\n \"num_unique_values\": 10,\n \"samples\": [\n 0.5575,\n 0.5353,\n 0.3931\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"popularity_tier\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 2,\n \"max\": 5,\n \"num_unique_values\": 4,\n \"samples\": [\n 4,\n 5,\n 3\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"audience_sentiment\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"neutral\",\n \"positive\",\n \"negative\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"was_cancelled\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 7 } ], "source": [ "# longevity score: weighted combination of normalised rating and season count\n", "df_shows[\"rating_norm\"] = (df_shows[\"imdb_rating\"] - df_shows[\"imdb_rating\"].min()) / (\n", " df_shows[\"imdb_rating\"].max() - df_shows[\"imdb_rating\"].min()\n", ")\n", "df_shows[\"seasons_norm\"] = (df_shows[\"num_seasons\"] - 1) / (\n", " df_shows[\"num_seasons\"].max() - 1 + 1e-6\n", ")\n", "df_shows[\"longevity_score\"] = (\n", " 0.6 * df_shows[\"rating_norm\"] + 0.4 * df_shows[\"seasons_norm\"]\n", ").round(4)\n", "\n", "# popularity tier (1-5) anchors all synthetic data generation\n", "def assign_popularity_tier(rating):\n", " if rating >= 8.5: return 5\n", " elif rating >= 7.5: return 4\n", " elif rating >= 6.5: return 3\n", " elif rating >= 5.5: return 2\n", " else: return 1\n", "\n", "df_shows[\"popularity_tier\"] = df_shows[\"imdb_rating\"].apply(assign_popularity_tier)\n", "\n", "# audience sentiment derived from tier\n", "df_shows[\"audience_sentiment\"] = df_shows[\"popularity_tier\"].apply(\n", " lambda t: \"positive\" if t >= 4 else (\"negative\" if t <= 2 else \"neutral\")\n", ")\n", "\n", "print(\"Popularity tier distribution:\")\n", "print(df_shows[\"popularity_tier\"].value_counts().sort_index())\n", "print()\n", "print(\"Audience sentiment distribution:\")\n", "print(df_shows[\"audience_sentiment\"].value_counts())\n", "\n", "df_shows.to_csv(\"imdb_shows_real.csv\", index=False)\n", "df_shows[[\"title\",\"imdb_rating\",\"num_seasons\",\"longevity_score\",\"popularity_tier\",\"audience_sentiment\",\"was_cancelled\"]].head(10)" ], "id": "I4NzThtfUxc2" }, { "cell_type": "markdown", "metadata": { "id": "b5B82D29Uxc2" }, "source": [ "---\n", "## Synthetic Streaming Financials\n", "\n", "IMDb gives us ratings and season counts but nothing about money. We simulate the financial KPIs a streaming platform would actually track for each show:\n", "\n", "- **content_cost_m** — total production and licensing cost in millions, scaled by how many seasons ran\n", "- **subscriber_impact_k** — how many new subscribers (in thousands) the show is estimated to have driven to the platform\n", "- **avg_completion_rate** — the share of viewers who actually finish a full season, which is a strong retention signal\n", "- **platform_roi_pct** — return on content investment, derived from a simple revenue model\n", "\n", "All values are drawn from ranges tied to each show's popularity tier so a tier 5 show gets Netflix-scale numbers and a tier 1 show gets the kind of numbers that explain why it got cancelled.\n" ], "id": "b5B82D29Uxc2" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 335 }, "id": "CwDkL7_vUxc3", "outputId": "b94f1717-69af-4c63-ab11-74091d4a04b9" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " title content_cost_m subscriber_impact_k \\\n", "0 Shantaram 21.16 107.0 \n", "1 Westworld 154.66 150.0 \n", "2 Carnival Row 116.97 173.0 \n", "3 The Nine Lives of Chloe King 10.87 122.0 \n", "4 Foundation 139.87 167.0 \n", "5 The Death of Bunny Munro 20.20 123.0 \n", "6 The Swarm 4.70 26.0 \n", "7 Game of Thrones 1918.81 1458.0 \n", "\n", " avg_completion_rate platform_roi_pct \n", "0 0.469 14.46 \n", "1 0.745 -45.29 \n", "2 0.630 -41.87 \n", "3 0.596 131.83 \n", "4 0.556 -50.71 \n", "5 0.399 29.55 \n", "6 0.395 19.36 \n", "7 0.930 -39.82 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titlecontent_cost_msubscriber_impact_kavg_completion_rateplatform_roi_pct
0Shantaram21.16107.00.46914.46
1Westworld154.66150.00.745-45.29
2Carnival Row116.97173.00.630-41.87
3The Nine Lives of Chloe King10.87122.00.596131.83
4Foundation139.87167.00.556-50.71
5The Death of Bunny Munro20.20123.00.39929.55
6The Swarm4.7026.00.39519.36
7Game of Thrones1918.811458.00.930-39.82
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \" \\\"avg_completion_rate\\\",\\\"platform_roi_pct\\\"]]\",\n \"rows\": 8,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 8,\n \"samples\": [\n \"Westworld\",\n \"The Death of Bunny Munro\",\n \"Shantaram\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"content_cost_m\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 657.6613120748399,\n \"min\": 4.7,\n \"max\": 1918.81,\n \"num_unique_values\": 8,\n \"samples\": [\n 154.66,\n 20.2,\n 21.16\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"subscriber_impact_k\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 473.8799576746355,\n \"min\": 26.0,\n \"max\": 1458.0,\n \"num_unique_values\": 8,\n \"samples\": [\n 150.0,\n 123.0,\n 107.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"avg_completion_rate\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.1818806516058578,\n \"min\": 0.395,\n \"max\": 0.93,\n \"num_unique_values\": 8,\n \"samples\": [\n 0.745,\n 0.399,\n 0.469\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"platform_roi_pct\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 61.82850138834495,\n \"min\": -50.71,\n \"max\": 131.83,\n \"num_unique_values\": 8,\n \"samples\": [\n -45.29,\n 29.55,\n 14.46\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 8 } ], "source": [ "def generate_streaming_financials(row):\n", " tier = row[\"popularity_tier\"]\n", " seasons = row[\"num_seasons\"]\n", "\n", " if tier == 5:\n", " cost = round(random.uniform(80, 250) * seasons, 2)\n", " subs = int(random.uniform(500, 2000))\n", " completion = round(random.uniform(0.70, 0.95), 3)\n", " elif tier == 4:\n", " cost = round(random.uniform(30, 80) * seasons, 2)\n", " subs = int(random.uniform(150, 500))\n", " completion = round(random.uniform(0.55, 0.75), 3)\n", " elif tier == 3:\n", " cost = round(random.uniform(10, 30) * seasons, 2)\n", " subs = int(random.uniform(30, 150))\n", " completion = round(random.uniform(0.35, 0.60), 3)\n", " elif tier == 2:\n", " cost = round(random.uniform(3, 12) * seasons, 2)\n", " subs = int(random.uniform(5, 35))\n", " completion = round(random.uniform(0.15, 0.40), 3)\n", " else:\n", " cost = round(random.uniform(1, 5) * seasons, 2)\n", " subs = int(random.uniform(0, 10))\n", " completion = round(random.uniform(0.05, 0.20), 3)\n", "\n", " revenue = round((subs * 12 * 0.015) + (completion * cost * 0.5), 2)\n", " roi = round((revenue - cost) / cost * 100, 2)\n", "\n", " return pd.Series({\n", " \"content_cost_m\": cost,\n", " \"subscriber_impact_k\": subs,\n", " \"avg_completion_rate\": completion,\n", " \"platform_roi_pct\": roi\n", " })\n", "\n", "df_shows[[\"content_cost_m\",\"subscriber_impact_k\",\n", " \"avg_completion_rate\",\"platform_roi_pct\"]] = df_shows.apply(\n", " generate_streaming_financials, axis=1\n", ")\n", "\n", "df_shows[[\"title\",\"content_cost_m\",\"subscriber_impact_k\",\n", " \"avg_completion_rate\",\"platform_roi_pct\"]].head(8)" ], "id": "CwDkL7_vUxc3" }, { "cell_type": "markdown", "metadata": { "id": "yLu0iwAoUxc3" }, "source": [ "---\n", "## Synthetic Viewership Time Series\n", "\n", "We generate 18 months of monthly streaming numbers per show (in thousands of streams). The shape of each series is designed to reflect how streaming audiences actually behave:\n", "\n", "- High-performing shows start strong and grow, with a clear spike in month 1 when a new season drops on the platform\n", "- Struggling shows start low and decline as the platform gradually deprioritises them\n", "- Average shows stay relatively flat with some natural variation\n", "\n", "We add a seasonality wave and random noise on top to make the series look realistic rather than perfectly smooth. These time series are what we feed into ARIMA in Notebook 2 to forecast future viewership and spot shows that are trending in the wrong direction.\n" ], "id": "yLu0iwAoUxc3" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 381 }, "id": "rFjajV3QUxc3", "outputId": "97709617-e1bf-4bf1-8d8a-8206a7e19c90" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Viewership time series: 121,446 rows across 6,747 shows\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " tconst title month monthly_streams_k audience_sentiment\n", "0 tt0429087 Shantaram 2024-10 252 neutral\n", "1 tt0429087 Shantaram 2024-11 206 neutral\n", "2 tt0429087 Shantaram 2024-12 195 neutral\n", "3 tt0429087 Shantaram 2025-01 203 neutral\n", "4 tt0429087 Shantaram 2025-02 204 neutral\n", "5 tt0429087 Shantaram 2025-03 215 neutral\n", "6 tt0429087 Shantaram 2025-04 200 neutral\n", "7 tt0429087 Shantaram 2025-05 181 neutral\n", "8 tt0429087 Shantaram 2025-06 184 neutral\n", "9 tt0429087 Shantaram 2025-07 179 neutral" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tconsttitlemonthmonthly_streams_kaudience_sentiment
0tt0429087Shantaram2024-10252neutral
1tt0429087Shantaram2024-11206neutral
2tt0429087Shantaram2024-12195neutral
3tt0429087Shantaram2025-01203neutral
4tt0429087Shantaram2025-02204neutral
5tt0429087Shantaram2025-03215neutral
6tt0429087Shantaram2025-04200neutral
7tt0429087Shantaram2025-05181neutral
8tt0429087Shantaram2025-06184neutral
9tt0429087Shantaram2025-07179neutral
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df_viewership" } }, "metadata": {}, "execution_count": 9 } ], "source": [ "def generate_viewership_series(row):\n", " sentiment = row[\"audience_sentiment\"]\n", " tier = row[\"popularity_tier\"]\n", " months = pd.date_range(end=datetime.today(), periods=18, freq=\"M\")\n", "\n", " if sentiment == \"positive\":\n", " base = random.randint(500, 2500) * tier\n", " trend = np.linspace(base, base * random.uniform(1.1, 1.5), 18)\n", " elif sentiment == \"negative\":\n", " base = random.randint(20, 200)\n", " trend = np.linspace(base, base * random.uniform(0.4, 0.8), 18)\n", " else:\n", " base = random.randint(150, 600)\n", " trend = np.full(18, base + random.randint(-30, 30))\n", "\n", " # premiere spike in month 1 — new season drop effect\n", " premiere_spike = np.zeros(18)\n", " premiere_spike[0] = base * random.uniform(0.3, 0.8)\n", "\n", " seasonality = base * 0.08 * np.sin(np.linspace(0, 3 * np.pi, 18))\n", " noise = np.random.normal(0, base * 0.04, 18)\n", " viewers = np.clip(trend + premiere_spike + seasonality + noise, 0, None).astype(int)\n", "\n", " return [\n", " {\n", " \"tconst\": row[\"tconst\"],\n", " \"title\": row[\"title\"],\n", " \"month\": m.strftime(\"%Y-%m\"),\n", " \"monthly_streams_k\": v,\n", " \"audience_sentiment\": sentiment\n", " }\n", " for m, v in zip(months, viewers)\n", " ]\n", "\n", "viewership_rows = []\n", "for _, row in df_shows.iterrows():\n", " viewership_rows.extend(generate_viewership_series(row))\n", "\n", "df_viewership = pd.DataFrame(viewership_rows)\n", "df_viewership.to_csv(\"synthetic_viewership_data.csv\", index=False)\n", "print(f\"Viewership time series: {len(df_viewership):,} rows across {df_shows['tconst'].nunique():,} shows\")\n", "df_viewership.head(10)" ], "id": "rFjajV3QUxc3" }, { "cell_type": "markdown", "metadata": { "id": "cIX-9IHFUxc4" }, "source": [ "---\n", "## Synthetic Audience Reviews\n", "\n", "We generate 10 synthetic viewer reviews per show by sampling from a pool of 50 pre-written reviews per sentiment label. The reviews are written to sound like genuine streaming platform comments rather than generic feedback — they reference things like binge-watching, season finales, subscription value, and completion.\n", "\n", "These reviews feed into the VADER sentiment analysis in Notebook 2, where we verify whether the computed sentiment scores actually align with the IMDb-derived labels we assigned here.\n" ], "id": "cIX-9IHFUxc4" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 311 }, "id": "HaR4tUZpUxc4", "outputId": "549be39f-744e-4808-97d1-873f8253b8e6" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Generated 67,470 reviews across 6,747 shows\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " tconst title audience_sentiment \\\n", "0 tt0429087 Shantaram neutral \n", "1 tt0429087 Shantaram neutral \n", "2 tt0429087 Shantaram neutral \n", "3 tt0429087 Shantaram neutral \n", "4 tt0429087 Shantaram neutral \n", "\n", " review_text imdb_rating \\\n", "0 It held my attention without ever truly captur... 7.4 \n", "1 I finished it but felt no strong urge to discu... 7.4 \n", "2 Not a waste of time but not the best use of it... 7.4 \n", "3 I kept watching but never felt fully gripped t... 7.4 \n", "4 Serviceable television that ticked most of the... 7.4 \n", "\n", " popularity_tier \n", "0 3 \n", "1 3 \n", "2 3 \n", "3 3 \n", "4 3 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tconsttitleaudience_sentimentreview_textimdb_ratingpopularity_tier
0tt0429087ShantaramneutralIt held my attention without ever truly captur...7.43
1tt0429087ShantaramneutralI finished it but felt no strong urge to discu...7.43
2tt0429087ShantaramneutralNot a waste of time but not the best use of it...7.43
3tt0429087ShantaramneutralI kept watching but never felt fully gripped t...7.43
4tt0429087ShantaramneutralServiceable television that ticked most of the...7.43
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "df_reviews", "summary": "{\n \"name\": \"df_reviews\",\n \"rows\": 67470,\n \"fields\": [\n {\n \"column\": \"tconst\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 6747,\n \"samples\": [\n \"tt21200366\",\n \"tt3906560\",\n \"tt33204276\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 6628,\n \"samples\": [\n \"I'm Her Most Dangerous Obsession\",\n \"She's Gotta Have It\",\n \"Bad Thoughts\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"audience_sentiment\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 3,\n \"samples\": [\n \"neutral\",\n \"positive\",\n \"negative\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"review_text\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 150,\n \"samples\": [\n \"The writing felt lazy and the characters were impossible to care about.\",\n \"Bold, daring, and wildly entertaining throughout.\",\n \"I binged the entire season in one weekend. Completely addictive.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"imdb_rating\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 1.0831444675280868,\n \"min\": 1.0,\n \"max\": 9.7,\n \"num_unique_values\": 86,\n \"samples\": [\n 3.1,\n 7.4,\n 4.4\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"popularity_tier\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 1,\n \"max\": 5,\n \"num_unique_values\": 5,\n \"samples\": [\n 4,\n 1,\n 2\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 10 } ], "source": [ "synthetic_reviews_by_sentiment = {\n", " \"positive\": [\n", " \"An absolute masterpiece of modern television. Every episode left me wanting more.\",\n", " \"The writing is sharp, the acting superb — this show deserves every award it gets.\",\n", " \"I binged the entire season in one weekend. Completely addictive.\",\n", " \"One of the best things I have ever watched. Emotional, gripping, and beautifully made.\",\n", " \"The character development across the season is simply outstanding.\",\n", " \"Stunning visuals and a storyline that genuinely surprised me at every turn.\",\n", " \"I cannot stop thinking about this show. It has stayed with me for weeks.\",\n", " \"Perfectly paced with a finale that left me speechless.\",\n", " \"A rare show that gets better with every single episode.\",\n", " \"The performances are extraordinary. Best show of the year without question.\",\n", " \"This series set a new standard for what streaming television can be.\",\n", " \"Every scene felt intentional and meaningful. Truly exceptional craft.\",\n", " \"I recommended this to everyone I know. An absolute must-watch.\",\n", " \"Refreshingly original in a sea of formulaic streaming content.\",\n", " \"The dialogue alone is worth the subscription fee.\",\n", " \"A beautifully layered narrative that rewards patient viewers.\",\n", " \"One of those rare shows where every episode feels like a mini-movie.\",\n", " \"Emotionally devastating in the best possible way.\",\n", " \"The ensemble cast delivers career-best performances across the board.\",\n", " \"I have not been this invested in a show since Breaking Bad.\",\n", " \"Cinematic quality storytelling at its absolute finest.\",\n", " \"The tension was unbearable in the best way — hooked from minute one.\",\n", " \"This show understands its audience and respects their intelligence.\",\n", " \"Genuinely moving and consistently entertaining throughout.\",\n", " \"A triumph of modern storytelling. Easily a ten out of ten.\",\n", " \"The world-building here is extraordinary and completely immersive.\",\n", " \"Every plot twist felt earned rather than cheap.\",\n", " \"Superb from start to finish. I did not want it to end.\",\n", " \"Flawlessly executed on every single level.\",\n", " \"This show changed the way I think about what television can be.\",\n", " \"Bold, daring, and wildly entertaining throughout.\",\n", " \"A creative achievement that will be discussed for years to come.\",\n", " \"Compelling and intelligent — exactly what prestige TV should be.\",\n", " \"The season finale had me in tears. That is perfect television.\",\n", " \"I watched it twice because once was simply not enough.\",\n", " \"Brilliantly cast with writing to match the talent on screen.\",\n", " \"An unforgettable viewing experience from episode one.\",\n", " \"Everything clicked — story, cast, direction, score.\",\n", " \"A show that demands your full attention and richly rewards it.\",\n", " \"The best new series I have seen in a very long time.\",\n", " \"Thought-provoking, deeply human, and completely gripping.\",\n", " \"Inventive, surprising, and impossible to stop watching.\",\n", " \"I have already started rewatching it from episode one.\",\n", " \"This exceeded all of my expectations by a considerable margin.\",\n", " \"Outstanding in every single department imaginable.\",\n", " \"A cultural event disguised as a streaming series.\",\n", " \"Dazzling performances backed by brilliant, disciplined writing.\",\n", " \"Pure gold from the very first scene to the very last.\",\n", " \"The kind of show that reminds you why you pay for streaming.\",\n", " \"A complete and utter triumph of the medium.\"\n", " ],\n", " \"neutral\": [\n", " \"A decent show with some strong moments but it did not fully win me over.\",\n", " \"Worth watching once but I am not sure I would revisit it.\",\n", " \"The premise is interesting but the execution felt a little uneven.\",\n", " \"Some episodes were brilliant, others felt like padding.\",\n", " \"A solid effort that just lacked that final spark to be truly special.\",\n", " \"I enjoyed parts of it but felt a bit disconnected from the main characters.\",\n", " \"Fine for what it is, but nothing that will stay with me long term.\",\n", " \"The performances are good but the writing lets them down occasionally.\",\n", " \"An average season that had genuine flashes of greatness buried within it.\",\n", " \"I kept watching but never felt fully gripped the way I hoped.\",\n", " \"Not bad, not great — somewhere comfortably in the middle.\",\n", " \"Watchable entertainment that does not overstay its welcome.\",\n", " \"Had strong competition and did not quite stand out enough.\",\n", " \"The concept had real potential that was only partially realised.\",\n", " \"A few storylines dragged but the better ones kept me engaged.\",\n", " \"It felt safe when it could have been bold and daring.\",\n", " \"Enjoyable enough for a weekend binge but forgettable by Monday.\",\n", " \"The show needed more time to properly develop its ideas.\",\n", " \"Perfectly adequate streaming content — no more, no less.\",\n", " \"Neither a disappointment nor a revelation. Just fine.\",\n", " \"I finished it but felt no strong urge to discuss it with anyone.\",\n", " \"Some great ideas that were not fully explored or committed to.\",\n", " \"The first half was noticeably stronger than the second.\",\n", " \"A mixed bag that still managed to entertain more often than not.\",\n", " \"Watchable, inoffensive, and ultimately unremarkable.\",\n", " \"Good enough to complete but not enough to recommend enthusiastically.\",\n", " \"Showed genuine promise that it could not quite fulfil.\",\n", " \"A three out of five — exactly what that score means.\",\n", " \"The tone was inconsistent which made it hard to fully invest emotionally.\",\n", " \"It did some things very well and others not so much.\",\n", " \"A pleasant enough distraction without being truly engaging.\",\n", " \"I liked it more in memory than I did while actively watching it.\",\n", " \"Competent but cautious storytelling that played it too safe.\",\n", " \"Had its moments without ever having a truly great one.\",\n", " \"Fine viewing for a quiet evening but nothing more than that.\",\n", " \"The cast did their best with scripts that were hit and miss.\",\n", " \"Decent entertainment that needed more creative risk-taking.\",\n", " \"I was entertained enough but expected considerably more.\",\n", " \"Not a waste of time but not the best use of it either.\",\n", " \"Middle of the road in almost every single regard.\",\n", " \"Serviceable television that ticked most of the basic boxes.\",\n", " \"Some intriguing ideas buried under too much filler content.\",\n", " \"I would not turn it off but I would not rush to turn it on.\",\n", " \"Adequately produced and adequately written — that sums it up.\",\n", " \"A show that exists comfortably without ever truly excelling.\",\n", " \"Good moments separated by stretches of mediocrity.\",\n", " \"Passable but not something I would passionately recommend.\",\n", " \"It held my attention without ever truly capturing my imagination.\",\n", " \"The kind of show you watch and then promptly forget about.\",\n", " \"Reasonable, reliable, and completely forgettable.\"\n", " ],\n", " \"negative\": [\n", " \"A complete disappointment. The potential was there but it went entirely to waste.\",\n", " \"I gave up halfway through the second episode. Life really is too short.\",\n", " \"Dull, predictable, and painfully slow — I genuinely struggled to finish it.\",\n", " \"The writing felt lazy and the characters were impossible to care about.\",\n", " \"An overproduced mess with no coherent story to tell anyone.\",\n", " \"I cannot understand how this got commissioned, let alone renewed for another season.\",\n", " \"Every single episode felt like an endurance test I had not signed up for.\",\n", " \"The plot went nowhere and took forever to get there.\",\n", " \"Wooden acting and a script that reads like an early first draft.\",\n", " \"I fell asleep twice attempting this. On the third try I just gave up entirely.\",\n", " \"A frustrating waste of what looked like a genuinely promising concept.\",\n", " \"The pacing was so slow I completely lost track of what was even happening.\",\n", " \"Cheap, hollow, and utterly forgettable in every possible way.\",\n", " \"I have never fast-forwarded through a show this aggressively in my life.\",\n", " \"The characters made decisions so baffling I started watching it ironically.\",\n", " \"A cynical cash-grab with absolutely no artistic merit whatsoever.\",\n", " \"Badly written, badly directed, and badly acted throughout.\",\n", " \"I have seen more engaging content produced for daytime television.\",\n", " \"The dialogue was so unnatural it became almost accidentally comedic.\",\n", " \"An absolute mess from the very first episode to the very last.\",\n", " \"Tedious, derivative, and wholly uninspiring in every department.\",\n", " \"This show clearly had no idea what it actually wanted to be.\",\n", " \"The finale was so poor it retroactively ruined what little I had enjoyed.\",\n", " \"I watched the whole thing hoping it would improve. It never did.\",\n", " \"A failure at even the most basic and fundamental levels of storytelling.\",\n", " \"Bloated and boring — this could have been told in half the episode count.\",\n", " \"Not a single character I found even remotely interesting or worth following.\",\n", " \"Clumsy attempts at depth that only highlighted how shallow it really was.\",\n", " \"An incoherent storyline that appeared to be inventing itself as it went along.\",\n", " \"Easily one of the worst shows I have watched this entire year.\",\n", " \"The kind of television that makes you genuinely question how it got made.\",\n", " \"Utterly charmless and deeply unpleasant to sit through.\",\n", " \"I kept checking how many episodes were left. That is never a good sign at all.\",\n", " \"The acting ranged from mediocre to genuinely quite bad throughout.\",\n", " \"Poorly constructed in virtually every single department.\",\n", " \"A slog from beginning to end with absolutely no payoff for your patience.\",\n", " \"Completely hollow beneath its expensive-looking exterior and production design.\",\n", " \"The plot twists were telegraphed miles in advance and still disappointing.\",\n", " \"Nothing about this worked the way it was so clearly intended to.\",\n", " \"An embarrassment for the talented people attached to this project.\",\n", " \"It tried to be edgy and ended up being offensive and incredibly dull.\",\n", " \"Do not be fooled by the trailer — the actual show is nothing like it.\",\n", " \"A tedious exercise in style over substance that achieves neither.\",\n", " \"I am still genuinely annoyed about the time I spent on this.\",\n", " \"Cancelled at the right time — and arguably a full season too late.\",\n", " \"The worst kind of streaming content — expensive, empty, and forgettable.\",\n", " \"More plot holes than actual coherent plot to speak of.\",\n", " \"I genuinely cannot think of a single redeeming quality anywhere.\",\n", " \"A complete and utter waste of an otherwise talented cast.\",\n", " \"The only consistent thing about this show was how consistently bad it was.\"\n", " ]\n", "}\n", "\n", "review_rows = []\n", "for _, row in df_shows.iterrows():\n", " for review_text in random.sample(synthetic_reviews_by_sentiment[row[\"audience_sentiment\"]], 10):\n", " review_rows.append({\n", " \"tconst\": row[\"tconst\"],\n", " \"title\": row[\"title\"],\n", " \"audience_sentiment\": row[\"audience_sentiment\"],\n", " \"review_text\": review_text,\n", " \"imdb_rating\": row[\"imdb_rating\"],\n", " \"popularity_tier\": row[\"popularity_tier\"]\n", " })\n", "\n", "df_reviews = pd.DataFrame(review_rows)\n", "df_reviews.to_csv(\"synthetic_show_reviews.csv\", index=False)\n", "print(f\"Generated {len(df_reviews):,} reviews across {df_shows['tconst'].nunique():,} shows\")\n", "df_reviews.head()" ], "id": "HaR4tUZpUxc4" }, { "cell_type": "markdown", "metadata": { "id": "Nz9UnqWmUxc5" }, "source": [ "---\n", "## Renewal Decision Label\n", "\n", "This is the target variable for the Random Forest classifier in Notebook 2. We assign one of three labels to each show:\n", "\n", "- **Renew** — still running on IMDb, strong average viewership, positive ROI, and a high popularity tier\n", "- **Cancel** — either actually cancelled on IMDb, or both viewership and ROI are too low to justify keeping\n", "- **Invest More** — everything in between: shows that are not performing well enough to confidently renew but not badly enough to pull the plug on\n", "\n", "The fact that `was_cancelled` comes from real IMDb data makes this label much more meaningful than anything we could have made up purely from synthetic signals.\n" ], "id": "Nz9UnqWmUxc5" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wpa8t6swUxc5", "outputId": "89f6672f-a024-42a5-f747-1d6ca578b386" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Renewal decision distribution:\n", "renewal_decision\n", "Cancel 5258\n", "Invest More 1231\n", "Renew 258\n", "Name: count, dtype: int64\n" ] } ], "source": [ "# merge average viewership back into df_shows\n", "avg_views = (\n", " df_viewership.groupby(\"tconst\")[\"monthly_streams_k\"]\n", " .mean().reset_index()\n", " .rename(columns={\"monthly_streams_k\": \"avg_monthly_streams_k\"})\n", ")\n", "df_shows = df_shows.merge(avg_views, on=\"tconst\", how=\"left\")\n", "\n", "def assign_renewal_decision(row):\n", " if (row[\"was_cancelled\"] == 0 and row[\"avg_monthly_streams_k\"] >= 800\n", " and row[\"platform_roi_pct\"] > 10 and row[\"popularity_tier\"] >= 4):\n", " return \"Renew\"\n", " elif row[\"was_cancelled\"] == 1 or (\n", " row[\"avg_monthly_streams_k\"] <= 200 and row[\"platform_roi_pct\"] < 0):\n", " return \"Cancel\"\n", " else:\n", " return \"Invest More\"\n", "\n", "df_shows[\"renewal_decision\"] = df_shows.apply(assign_renewal_decision, axis=1)\n", "\n", "print(\"Renewal decision distribution:\")\n", "print(df_shows[\"renewal_decision\"].value_counts())" ], "id": "wpa8t6swUxc5" }, { "cell_type": "markdown", "metadata": { "id": "nTCW3GApUxc5" }, "source": [ "---\n", "## Exporting Pipeline-Ready Files\n", "\n", "We save four clean CSVs into an `/artifacts/` folder. These are the only files Notebook 2 needs — it reads directly from here without having to re-run any of the data collection steps above.\n" ], "id": "nTCW3GApUxc5" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "mb2z3taGUxc6", "outputId": "e7806dcb-bed9-4538-b5e2-83dbfc9bf5a6" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "shows_master.csv — 6,747 shows\n", "viewership_timeseries.csv — 121,446 rows\n", "show_reviews.csv — 67,470 rows\n", "monthly_platform_totals.csv — 18 months\n" ] } ], "source": [ "Path(\"artifacts\").mkdir(exist_ok=True)\n", "\n", "# full enriched show dataset — main input for the classifier\n", "df_shows.to_csv(\"artifacts/shows_master.csv\", index=False)\n", "\n", "# 18-month viewership time series — input for ARIMA forecasting\n", "df_viewership.to_csv(\"artifacts/viewership_timeseries.csv\", index=False)\n", "\n", "# audience reviews — input for VADER sentiment analysis\n", "df_reviews.to_csv(\"artifacts/show_reviews.csv\", index=False)\n", "\n", "# monthly platform totals — for the HF app dashboard overview chart\n", "monthly_totals = (\n", " df_viewership\n", " .assign(month=pd.to_datetime(df_viewership[\"month\"]))\n", " .groupby(\"month\", as_index=False)[\"monthly_streams_k\"].sum()\n", " .rename(columns={\"monthly_streams_k\": \"total_streams_k\"})\n", " .sort_values(\"month\")\n", ")\n", "monthly_totals[\"month\"] = monthly_totals[\"month\"].dt.strftime(\"%Y-%m-%d\")\n", "monthly_totals.to_csv(\"artifacts/monthly_platform_totals.csv\", index=False)\n", "\n", "print(f\"shows_master.csv — {len(df_shows):,} shows\")\n", "print(f\"viewership_timeseries.csv — {len(df_viewership):,} rows\")\n", "print(f\"show_reviews.csv — {len(df_reviews):,} rows\")\n", "print(f\"monthly_platform_totals.csv — {len(monthly_totals):,} months\")" ], "id": "mb2z3taGUxc6" }, { "cell_type": "markdown", "metadata": { "id": "sRoai9e-Uxc6" }, "source": [ "---\n", "## Summary\n", "\n", "That's Notebook 1 done. Here's what we ended up with:\n" ], "id": "sRoai9e-Uxc6" }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 754 }, "id": "nTLL9HRvUxc6", "outputId": "22b4969b-6e52-408b-b801-de6b60c304ca" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Real shows from IMDb: 6,747\n", "Year range: 2010 — 2026\n", "Genres: 25 unique\n", "Still running: 1,894\n", "Ended / cancelled: 4,853\n", "\n", "Renewal decision split:\n", "renewal_decision\n", "Cancel 5258\n", "Invest More 1231\n", "Renew 258\n", "\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ " title imdb_rating num_seasons longevity_score \\\n", "0 Shantaram 7.4 1 0.4414 \n", "1 Westworld 8.4 4 0.5353 \n", "2 Carnival Row 7.7 2 0.4704 \n", "3 The Nine Lives of Chloe King 7.0 1 0.4138 \n", "4 Foundation 7.6 4 0.4802 \n", "5 The Death of Bunny Munro 6.7 1 0.3931 \n", "6 The Swarm 5.9 1 0.3379 \n", "7 Game of Thrones 9.2 8 0.6239 \n", "8 Boardwalk Empire 8.6 5 0.5575 \n", "9 Blood of Zeus 7.5 3 0.4649 \n", "\n", " was_cancelled renewal_decision content_cost_m platform_roi_pct \n", "0 1 Cancel 21.16 14.46 \n", "1 1 Cancel 154.66 -45.29 \n", "2 1 Cancel 116.97 -41.87 \n", "3 1 Cancel 10.87 131.83 \n", "4 0 Invest More 139.87 -50.71 \n", "5 1 Cancel 20.20 29.55 \n", "6 1 Cancel 4.70 19.36 \n", "7 1 Cancel 1918.81 -39.82 \n", "8 1 Cancel 701.30 -9.61 \n", "9 1 Cancel 171.92 -46.81 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleimdb_ratingnum_seasonslongevity_scorewas_cancelledrenewal_decisioncontent_cost_mplatform_roi_pct
0Shantaram7.410.44141Cancel21.1614.46
1Westworld8.440.53531Cancel154.66-45.29
2Carnival Row7.720.47041Cancel116.97-41.87
3The Nine Lives of Chloe King7.010.41381Cancel10.87131.83
4Foundation7.640.48020Invest More139.87-50.71
5The Death of Bunny Munro6.710.39311Cancel20.2029.55
6The Swarm5.910.33791Cancel4.7019.36
7Game of Thrones9.280.62391Cancel1918.81-39.82
8Boardwalk Empire8.650.55751Cancel701.30-9.61
9Blood of Zeus7.530.46491Cancel171.92-46.81
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \" \\\"content_cost_m\\\",\\\"platform_roi_pct\\\"]]\",\n \"rows\": 10,\n \"fields\": [\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"Boardwalk Empire\",\n \"Westworld\",\n \"The Death of Bunny Munro\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"imdb_rating\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.9614803401237302,\n \"min\": 5.9,\n \"max\": 9.2,\n \"num_unique_values\": 10,\n \"samples\": [\n 8.6,\n 8.4,\n 6.7\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"num_seasons\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 2,\n \"min\": 1,\n \"max\": 8,\n \"num_unique_values\": 6,\n \"samples\": [\n 1,\n 4,\n 3\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"longevity_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.08369410174359163,\n \"min\": 0.3379,\n \"max\": 0.6239,\n \"num_unique_values\": 10,\n \"samples\": [\n 0.5575,\n 0.5353,\n 0.3931\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"was_cancelled\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 0,\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"renewal_decision\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"Invest More\",\n \"Cancel\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"content_cost_m\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 596.1274469663904,\n \"min\": 4.7,\n \"max\": 1918.81,\n \"num_unique_values\": 10,\n \"samples\": [\n 701.3,\n 154.66\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"platform_roi_pct\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 56.69587432569988,\n \"min\": -50.71,\n \"max\": 131.83,\n \"num_unique_values\": 10,\n \"samples\": [\n -9.61,\n -45.29\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 13 } ], "source": [ "print(f\"Real shows from IMDb: {len(df_shows):,}\")\n", "print(f\"Year range: {int(df_shows['startYear'].min())} — {int(df_shows['startYear'].max())}\")\n", "print(f\"Genres: {df_shows['primary_genre'].nunique()} unique\")\n", "print(f\"Still running: {(df_shows['was_cancelled']==0).sum():,}\")\n", "print(f\"Ended / cancelled: {(df_shows['was_cancelled']==1).sum():,}\")\n", "print()\n", "print(\"Renewal decision split:\")\n", "print(df_shows[\"renewal_decision\"].value_counts().to_string())\n", "print()\n", "df_shows[[\"title\",\"imdb_rating\",\"num_seasons\",\"longevity_score\",\n", " \"was_cancelled\",\"renewal_decision\",\n", " \"content_cost_m\",\"platform_roi_pct\"]].head(10)" ], "id": "nTLL9HRvUxc6" } ] }