Spaces:
Sleeping
Sleeping
a big --fix indeed
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- _server/README.md +6 -1
- daft/01_what_makes_daft_special.py +30 -35
- daft/README.md +6 -1
- duckdb/008_loading_parquet.py +64 -64
- duckdb/009_loading_json.py +50 -56
- duckdb/011_working_with_apache_arrow.py +101 -96
- duckdb/01_getting_started.py +52 -60
- duckdb/DuckDB_Loading_CSVs.py +43 -37
- duckdb/README.md +5 -0
- functional_programming/05_functors.py +546 -602
- functional_programming/06_applicatives.py +687 -658
- functional_programming/CHANGELOG.md +6 -1
- functional_programming/README.md +12 -7
- optimization/01_least_squares.py +28 -34
- optimization/02_linear_program.py +31 -31
- optimization/03_minimum_fuel_optimal_control.py +38 -40
- optimization/04_quadratic_program.py +44 -46
- optimization/05_portfolio_optimization.py +50 -58
- optimization/06_convex_optimization.py +24 -26
- optimization/07_sdp.py +34 -36
- optimization/README.md +6 -1
- polars/01_why_polars.py +110 -132
- polars/02_dataframes.py +89 -95
- polars/03_loading_data.py +43 -83
- polars/04_basic_operations.py +118 -106
- polars/05_reactive_plots.py +98 -129
- polars/06_Dataframe_Transformer.py +42 -52
- polars/07-querying-with-sql.py +6 -6
- polars/08_working_with_columns.py +147 -165
- polars/09_data_types.py +70 -88
- polars/10_strings.py +199 -225
- polars/11_missing_data.py +48 -94
- polars/12_aggregations.py +67 -77
- polars/13_window_functions.py +59 -91
- polars/14_user_defined_functions.py +137 -159
- polars/16_lazy_execution.py +103 -113
- polars/README.md +6 -1
- probability/01_sets.py +60 -62
- probability/02_axioms.py +56 -66
- probability/03_probability_of_or.py +81 -103
- probability/04_conditional_probability.py +117 -133
- probability/05_independence.py +129 -163
- probability/06_probability_of_and.py +79 -89
- probability/07_law_of_total_probability.py +110 -122
- probability/08_bayes_theorem.py +128 -164
- probability/09_random_variables.py +184 -210
- probability/10_probability_mass_function.py +123 -191
- probability/11_expectation.py +115 -191
- probability/12_variance.py +151 -202
- probability/13_bernoulli_distribution.py +128 -142
_server/README.md
CHANGED
|
@@ -1,3 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# marimo learn server
|
| 2 |
|
| 3 |
This folder contains server code for hosting marimo apps.
|
|
@@ -21,4 +26,4 @@ docker build -t marimo-learn .
|
|
| 21 |
|
| 22 |
```bash
|
| 23 |
docker run -p 7860:7860 marimo-learn
|
| 24 |
-
```
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Readme
|
| 3 |
+
marimo-version: 0.18.4
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
# marimo learn server
|
| 7 |
|
| 8 |
This folder contains server code for hosting marimo apps.
|
|
|
|
| 26 |
|
| 27 |
```bash
|
| 28 |
docker run -p 7860:7860 marimo-learn
|
| 29 |
+
```
|
daft/01_what_makes_daft_special.py
CHANGED
|
@@ -8,28 +8,25 @@
|
|
| 8 |
|
| 9 |
import marimo
|
| 10 |
|
| 11 |
-
__generated_with = "0.
|
| 12 |
app = marimo.App(width="medium")
|
| 13 |
|
| 14 |
|
| 15 |
@app.cell(hide_code=True)
|
| 16 |
def _(mo):
|
| 17 |
-
mo.md(
|
| 18 |
-
r"""
|
| 19 |
# What Makes Daft Special?
|
| 20 |
|
| 21 |
> _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_.
|
| 22 |
|
| 23 |
Welcome to the course on [Daft](https://www.getdaft.io/), the distributed dataframe library! In this first chapter, we'll explore what Daft is and what makes it a noteworthy tool in the landscape of data processing. We'll look at its core design choices and how they aim to help you work with data more effectively, whether you're a data engineer, data scientist, or analyst.
|
| 24 |
-
"""
|
| 25 |
-
)
|
| 26 |
return
|
| 27 |
|
| 28 |
|
| 29 |
@app.cell(hide_code=True)
|
| 30 |
def _(mo):
|
| 31 |
-
mo.md(
|
| 32 |
-
r"""
|
| 33 |
## 🎯 Introducing Daft: A Unified Data Engine
|
| 34 |
|
| 35 |
Daft is a distributed query engine designed to handle a wide array of data tasks, from data engineering and analytics to powering ML/AI workflows. It provides both a Python DataFrame API, familiar to users of libraries like Pandas, and a SQL interface, allowing you to choose the interaction style that best suits your needs or the task at hand.
|
|
@@ -37,8 +34,7 @@ def _(mo):
|
|
| 37 |
The main goal of Daft is to provide a robust and versatile platform for processing data, whether it's gigabytes on your laptop or petabytes on a cluster.
|
| 38 |
|
| 39 |
Let's go ahead and `pip install daft` to see it in action!
|
| 40 |
-
"""
|
| 41 |
-
)
|
| 42 |
return
|
| 43 |
|
| 44 |
|
|
@@ -86,8 +82,7 @@ def _(mo):
|
|
| 86 |
|
| 87 |
@app.cell(hide_code=True)
|
| 88 |
def _(mo):
|
| 89 |
-
mo.md(
|
| 90 |
-
r"""
|
| 91 |
## 🦀 Built with Rust: Performance and Simplicity
|
| 92 |
|
| 93 |
One of Daft's key characteristics is that its core engine is written in Rust. This choice has several implications for users:
|
|
@@ -97,8 +92,7 @@ def _(mo):
|
|
| 97 |
* **Simplified Developer Experience**: Rust-based systems typically require less configuration tuning compared to JVM-based systems. You don't need to worry about JVM heap sizes, garbage collection parameters, or managing Java dependencies.
|
| 98 |
|
| 99 |
Daft also leverages [Apache Arrow](https://arrow.apache.org/) for its in-memory data format. This allows for efficient data exchange between Daft's Rust core and Python, often with zero-copy data sharing, further enhancing performance.
|
| 100 |
-
"""
|
| 101 |
-
)
|
| 102 |
return
|
| 103 |
|
| 104 |
|
|
@@ -118,7 +112,9 @@ def _(mo):
|
|
| 118 |
|
| 119 |
@app.cell(hide_code=True)
|
| 120 |
def _(mo):
|
| 121 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 122 |
return
|
| 123 |
|
| 124 |
|
|
@@ -135,7 +131,9 @@ def _(daft):
|
|
| 135 |
|
| 136 |
@app.cell(hide_code=True)
|
| 137 |
def _(mo):
|
| 138 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 139 |
return
|
| 140 |
|
| 141 |
|
|
@@ -147,14 +145,15 @@ def _(mo, trillion_rows_df):
|
|
| 147 |
|
| 148 |
@app.cell(hide_code=True)
|
| 149 |
def _(mo):
|
| 150 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 151 |
return
|
| 152 |
|
| 153 |
|
| 154 |
@app.cell(hide_code=True)
|
| 155 |
def _(mo):
|
| 156 |
-
mo.md(
|
| 157 |
-
r"""
|
| 158 |
## 🌐 Scale Your Work: From Laptop to Cluster
|
| 159 |
|
| 160 |
Daft is designed with scalability in mind. As the trillion-row dataframe example above illustrates, you can write your data processing logic using Daft's Python API, and this same code can run:
|
|
@@ -163,15 +162,13 @@ def _(mo):
|
|
| 163 |
* **On a Cluster**: By integrating with [Ray](https://www.ray.io/), a framework for distributed computing. This allows Daft to scale out to process very large datasets across many machines.
|
| 164 |
|
| 165 |
This "write once, scale anywhere" approach means you don't need to significantly refactor your code when moving from local development to large-scale distributed execution. We'll delve into distributed computing with Ray in a later chapter.
|
| 166 |
-
"""
|
| 167 |
-
)
|
| 168 |
return
|
| 169 |
|
| 170 |
|
| 171 |
@app.cell(hide_code=True)
|
| 172 |
def _(mo):
|
| 173 |
-
mo.md(
|
| 174 |
-
r"""
|
| 175 |
## 🖼️ Handling More Than Just Tables: Multimodal Data Support
|
| 176 |
|
| 177 |
Modern datasets often contain more than just numbers and text. They might include images, audio clips, URLs pointing to external files, tensor data from machine learning models, or complex nested structures like JSON.
|
|
@@ -179,8 +176,7 @@ def _(mo):
|
|
| 179 |
Daft is built to accommodate these **multimodal data types** as integral parts of a DataFrame. This means you can have columns containing image data, embeddings, or other complex Python objects, and Daft provides mechanisms to process them. This is particularly useful for ML/AI pipelines and advanced analytics where diverse data sources are common.
|
| 180 |
|
| 181 |
As an example of how Daft simplifies working with such complex data, let's see how we can process image URLs. With just a few lines of Daft code, we can pull open data from the [National Gallery of Art](https://github.com/NationalGalleryOfArt/opendata), then directly fetch, decode, and even resize the images within our DataFrame:
|
| 182 |
-
"""
|
| 183 |
-
)
|
| 184 |
return
|
| 185 |
|
| 186 |
|
|
@@ -217,20 +213,23 @@ def _(daft):
|
|
| 217 |
|
| 218 |
@app.cell(hide_code=True)
|
| 219 |
def _(mo):
|
| 220 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 221 |
return
|
| 222 |
|
| 223 |
|
| 224 |
@app.cell(hide_code=True)
|
| 225 |
def _(mo):
|
| 226 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 227 |
return
|
| 228 |
|
| 229 |
|
| 230 |
@app.cell(hide_code=True)
|
| 231 |
def _(mo):
|
| 232 |
-
mo.md(
|
| 233 |
-
r"""
|
| 234 |
## 🧑💻 Designed for Developers: Python and SQL Interfaces
|
| 235 |
|
| 236 |
Daft aims to be developer-friendly by offering flexible ways to interact with your data:
|
|
@@ -239,8 +238,7 @@ def _(mo):
|
|
| 239 |
* **SQL Interface**: For those who prefer SQL or have existing SQL-based logic, Daft allows you to write queries using SQL syntax. Daft can execute SQL queries directly or even translate SQL expressions into its native expression system.
|
| 240 |
|
| 241 |
This dual-interface approach allows developers to choose the most appropriate tool for their specific task or leverage existing skills.
|
| 242 |
-
"""
|
| 243 |
-
)
|
| 244 |
return
|
| 245 |
|
| 246 |
|
|
@@ -285,8 +283,7 @@ def _(daft):
|
|
| 285 |
|
| 286 |
@app.cell(hide_code=True)
|
| 287 |
def _(mo):
|
| 288 |
-
mo.md(
|
| 289 |
-
r"""
|
| 290 |
## 🟣 Daft's Value Proposition
|
| 291 |
|
| 292 |
So, what makes Daft special? It's the combination of these design choices:
|
|
@@ -299,8 +296,7 @@ def _(mo):
|
|
| 299 |
These elements combine to make Daft a versatile tool for tackling modern data challenges.
|
| 300 |
|
| 301 |
And this is just scratching the surface. Daft is a growing data engine with an ambitious vision: to unify data engineering, analytics, and ML/AI workflows 🚀.
|
| 302 |
-
"""
|
| 303 |
-
)
|
| 304 |
return
|
| 305 |
|
| 306 |
|
|
@@ -308,7 +304,6 @@ def _(mo):
|
|
| 308 |
def _():
|
| 309 |
import daft
|
| 310 |
import marimo as mo
|
| 311 |
-
|
| 312 |
return daft, mo
|
| 313 |
|
| 314 |
|
|
|
|
| 8 |
|
| 9 |
import marimo
|
| 10 |
|
| 11 |
+
__generated_with = "0.18.4"
|
| 12 |
app = marimo.App(width="medium")
|
| 13 |
|
| 14 |
|
| 15 |
@app.cell(hide_code=True)
|
| 16 |
def _(mo):
|
| 17 |
+
mo.md(r"""
|
|
|
|
| 18 |
# What Makes Daft Special?
|
| 19 |
|
| 20 |
> _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_.
|
| 21 |
|
| 22 |
Welcome to the course on [Daft](https://www.getdaft.io/), the distributed dataframe library! In this first chapter, we'll explore what Daft is and what makes it a noteworthy tool in the landscape of data processing. We'll look at its core design choices and how they aim to help you work with data more effectively, whether you're a data engineer, data scientist, or analyst.
|
| 23 |
+
""")
|
|
|
|
| 24 |
return
|
| 25 |
|
| 26 |
|
| 27 |
@app.cell(hide_code=True)
|
| 28 |
def _(mo):
|
| 29 |
+
mo.md(r"""
|
|
|
|
| 30 |
## 🎯 Introducing Daft: A Unified Data Engine
|
| 31 |
|
| 32 |
Daft is a distributed query engine designed to handle a wide array of data tasks, from data engineering and analytics to powering ML/AI workflows. It provides both a Python DataFrame API, familiar to users of libraries like Pandas, and a SQL interface, allowing you to choose the interaction style that best suits your needs or the task at hand.
|
|
|
|
| 34 |
The main goal of Daft is to provide a robust and versatile platform for processing data, whether it's gigabytes on your laptop or petabytes on a cluster.
|
| 35 |
|
| 36 |
Let's go ahead and `pip install daft` to see it in action!
|
| 37 |
+
""")
|
|
|
|
| 38 |
return
|
| 39 |
|
| 40 |
|
|
|
|
| 82 |
|
| 83 |
@app.cell(hide_code=True)
|
| 84 |
def _(mo):
|
| 85 |
+
mo.md(r"""
|
|
|
|
| 86 |
## 🦀 Built with Rust: Performance and Simplicity
|
| 87 |
|
| 88 |
One of Daft's key characteristics is that its core engine is written in Rust. This choice has several implications for users:
|
|
|
|
| 92 |
* **Simplified Developer Experience**: Rust-based systems typically require less configuration tuning compared to JVM-based systems. You don't need to worry about JVM heap sizes, garbage collection parameters, or managing Java dependencies.
|
| 93 |
|
| 94 |
Daft also leverages [Apache Arrow](https://arrow.apache.org/) for its in-memory data format. This allows for efficient data exchange between Daft's Rust core and Python, often with zero-copy data sharing, further enhancing performance.
|
| 95 |
+
""")
|
|
|
|
| 96 |
return
|
| 97 |
|
| 98 |
|
|
|
|
| 112 |
|
| 113 |
@app.cell(hide_code=True)
|
| 114 |
def _(mo):
|
| 115 |
+
mo.md(r"""
|
| 116 |
+
A cornerstone of Daft's design is **lazy execution**. Imagine defining a DataFrame with a trillion rows on your laptop – usually not a great prospect for your device's memory!
|
| 117 |
+
""")
|
| 118 |
return
|
| 119 |
|
| 120 |
|
|
|
|
| 131 |
|
| 132 |
@app.cell(hide_code=True)
|
| 133 |
def _(mo):
|
| 134 |
+
mo.md(r"""
|
| 135 |
+
With Daft, this is perfectly fine. Operations like `with_column` or `filter` don't compute results immediately. Instead, Daft builds a *logical plan* – a blueprint of the transformations you've defined. You can inspect this plan:
|
| 136 |
+
""")
|
| 137 |
return
|
| 138 |
|
| 139 |
|
|
|
|
| 145 |
|
| 146 |
@app.cell(hide_code=True)
|
| 147 |
def _(mo):
|
| 148 |
+
mo.md(r"""
|
| 149 |
+
This plan is only executed (and data materialized) when you explicitly request it (e.g., with `.show()`, `.collect()`, or by writing to a file). Before execution, Daft's optimizer works to make your query run as efficiently as possible. This approach allows you to define complex operations on massive datasets without immediate computational cost or memory overflow.
|
| 150 |
+
""")
|
| 151 |
return
|
| 152 |
|
| 153 |
|
| 154 |
@app.cell(hide_code=True)
|
| 155 |
def _(mo):
|
| 156 |
+
mo.md(r"""
|
|
|
|
| 157 |
## 🌐 Scale Your Work: From Laptop to Cluster
|
| 158 |
|
| 159 |
Daft is designed with scalability in mind. As the trillion-row dataframe example above illustrates, you can write your data processing logic using Daft's Python API, and this same code can run:
|
|
|
|
| 162 |
* **On a Cluster**: By integrating with [Ray](https://www.ray.io/), a framework for distributed computing. This allows Daft to scale out to process very large datasets across many machines.
|
| 163 |
|
| 164 |
This "write once, scale anywhere" approach means you don't need to significantly refactor your code when moving from local development to large-scale distributed execution. We'll delve into distributed computing with Ray in a later chapter.
|
| 165 |
+
""")
|
|
|
|
| 166 |
return
|
| 167 |
|
| 168 |
|
| 169 |
@app.cell(hide_code=True)
|
| 170 |
def _(mo):
|
| 171 |
+
mo.md(r"""
|
|
|
|
| 172 |
## 🖼️ Handling More Than Just Tables: Multimodal Data Support
|
| 173 |
|
| 174 |
Modern datasets often contain more than just numbers and text. They might include images, audio clips, URLs pointing to external files, tensor data from machine learning models, or complex nested structures like JSON.
|
|
|
|
| 176 |
Daft is built to accommodate these **multimodal data types** as integral parts of a DataFrame. This means you can have columns containing image data, embeddings, or other complex Python objects, and Daft provides mechanisms to process them. This is particularly useful for ML/AI pipelines and advanced analytics where diverse data sources are common.
|
| 177 |
|
| 178 |
As an example of how Daft simplifies working with such complex data, let's see how we can process image URLs. With just a few lines of Daft code, we can pull open data from the [National Gallery of Art](https://github.com/NationalGalleryOfArt/opendata), then directly fetch, decode, and even resize the images within our DataFrame:
|
| 179 |
+
""")
|
|
|
|
| 180 |
return
|
| 181 |
|
| 182 |
|
|
|
|
| 213 |
|
| 214 |
@app.cell(hide_code=True)
|
| 215 |
def _(mo):
|
| 216 |
+
mo.md(r"""
|
| 217 |
+
> Example inspired by the great post [Exploring Art with TypeScript, Jupyter, Polars, and Observable Plot](https://deno.com/blog/exploring-art-with-typescript-and-jupyter) published on Deno's blog.
|
| 218 |
+
""")
|
| 219 |
return
|
| 220 |
|
| 221 |
|
| 222 |
@app.cell(hide_code=True)
|
| 223 |
def _(mo):
|
| 224 |
+
mo.md(r"""
|
| 225 |
+
In later chapters, we'll explore in more detail how to work with these image objects and other complex types, including applying User-Defined Functions (UDFs) for custom processing. Until then, you can [take a look at a more complex example](https://blog.getdaft.io/p/we-cloned-over-15000-repos-to-find), in which Daft is used to clone over 15,000 GitHub repos to find the best developers.
|
| 226 |
+
""")
|
| 227 |
return
|
| 228 |
|
| 229 |
|
| 230 |
@app.cell(hide_code=True)
|
| 231 |
def _(mo):
|
| 232 |
+
mo.md(r"""
|
|
|
|
| 233 |
## 🧑💻 Designed for Developers: Python and SQL Interfaces
|
| 234 |
|
| 235 |
Daft aims to be developer-friendly by offering flexible ways to interact with your data:
|
|
|
|
| 238 |
* **SQL Interface**: For those who prefer SQL or have existing SQL-based logic, Daft allows you to write queries using SQL syntax. Daft can execute SQL queries directly or even translate SQL expressions into its native expression system.
|
| 239 |
|
| 240 |
This dual-interface approach allows developers to choose the most appropriate tool for their specific task or leverage existing skills.
|
| 241 |
+
""")
|
|
|
|
| 242 |
return
|
| 243 |
|
| 244 |
|
|
|
|
| 283 |
|
| 284 |
@app.cell(hide_code=True)
|
| 285 |
def _(mo):
|
| 286 |
+
mo.md(r"""
|
|
|
|
| 287 |
## 🟣 Daft's Value Proposition
|
| 288 |
|
| 289 |
So, what makes Daft special? It's the combination of these design choices:
|
|
|
|
| 296 |
These elements combine to make Daft a versatile tool for tackling modern data challenges.
|
| 297 |
|
| 298 |
And this is just scratching the surface. Daft is a growing data engine with an ambitious vision: to unify data engineering, analytics, and ML/AI workflows 🚀.
|
| 299 |
+
""")
|
|
|
|
| 300 |
return
|
| 301 |
|
| 302 |
|
|
|
|
| 304 |
def _():
|
| 305 |
import daft
|
| 306 |
import marimo as mo
|
|
|
|
| 307 |
return daft, mo
|
| 308 |
|
| 309 |
|
daft/README.md
CHANGED
|
@@ -1,3 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Learn Daft
|
| 2 |
|
| 3 |
_🚧 This collection is a work in progress. Please help us add notebooks!_
|
|
@@ -23,4 +28,4 @@ You can also open notebooks in our online playground by appending marimo.app/ to
|
|
| 23 |
|
| 24 |
**Thanks to all our notebook authors!**
|
| 25 |
|
| 26 |
-
* [Péter Gyarmati](https://github.com/peter-gy)
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Readme
|
| 3 |
+
marimo-version: 0.18.4
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
# Learn Daft
|
| 7 |
|
| 8 |
_🚧 This collection is a work in progress. Please help us add notebooks!_
|
|
|
|
| 28 |
|
| 29 |
**Thanks to all our notebook authors!**
|
| 30 |
|
| 31 |
+
* [Péter Gyarmati](https://github.com/peter-gy)
|
duckdb/008_loading_parquet.py
CHANGED
|
@@ -11,39 +11,35 @@
|
|
| 11 |
|
| 12 |
import marimo
|
| 13 |
|
| 14 |
-
__generated_with = "0.
|
| 15 |
app = marimo.App(width="medium")
|
| 16 |
|
| 17 |
|
| 18 |
@app.cell(hide_code=True)
|
| 19 |
def _(mo):
|
| 20 |
-
mo.md(
|
| 21 |
-
r"""
|
| 22 |
# Loading Parquet files with DuckDB
|
| 23 |
*By [Thomas Liang](https://github.com/thliang01)*
|
| 24 |
#
|
| 25 |
-
"""
|
| 26 |
-
)
|
| 27 |
return
|
| 28 |
|
| 29 |
|
| 30 |
@app.cell(hide_code=True)
|
| 31 |
def _(mo):
|
| 32 |
-
mo.md(
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
<
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
"""
|
| 46 |
-
)
|
| 47 |
return
|
| 48 |
|
| 49 |
|
|
@@ -55,24 +51,24 @@ def _():
|
|
| 55 |
|
| 56 |
@app.cell(hide_code=True)
|
| 57 |
def _(mo):
|
| 58 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 59 |
return
|
| 60 |
|
| 61 |
|
| 62 |
@app.cell(hide_code=True)
|
| 63 |
def _(mo):
|
| 64 |
-
mo.md(
|
| 65 |
-
|
| 66 |
-
The simplest way to query a Parquet file is to use it directly in a `FROM` clause, just like you would with a table. DuckDB will automatically detect that it's a Parquet file and read it accordingly.
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
)
|
| 71 |
return
|
| 72 |
|
| 73 |
|
| 74 |
@app.cell
|
| 75 |
-
def _(AIRBNB_URL, mo
|
| 76 |
mo.sql(
|
| 77 |
f"""
|
| 78 |
SELECT *
|
|
@@ -85,24 +81,24 @@ def _(AIRBNB_URL, mo, null):
|
|
| 85 |
|
| 86 |
@app.cell(hide_code=True)
|
| 87 |
def _(mo):
|
| 88 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 89 |
return
|
| 90 |
|
| 91 |
|
| 92 |
@app.cell(hide_code=True)
|
| 93 |
def _(mo):
|
| 94 |
-
mo.md(
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
Some useful options for `read_parquet` include:
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
)
|
| 106 |
return
|
| 107 |
|
| 108 |
|
|
@@ -120,31 +116,29 @@ def _(AIRBNB_URL, mo):
|
|
| 120 |
|
| 121 |
@app.cell(hide_code=True)
|
| 122 |
def _(mo):
|
| 123 |
-
mo.md(
|
| 124 |
-
|
| 125 |
-
You can also read multiple Parquet files at once using a glob pattern. For example, to read all Parquet files in a directory `data/`:
|
| 126 |
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
)
|
| 132 |
return
|
| 133 |
|
| 134 |
|
| 135 |
@app.cell(hide_code=True)
|
| 136 |
def _(mo):
|
| 137 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 138 |
return
|
| 139 |
|
| 140 |
|
| 141 |
@app.cell(hide_code=True)
|
| 142 |
def _(mo):
|
| 143 |
-
mo.md(
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
"""
|
| 147 |
-
)
|
| 148 |
return
|
| 149 |
|
| 150 |
|
|
@@ -156,7 +150,7 @@ def _(AIRBNB_URL, mo):
|
|
| 156 |
SELECT * FROM read_parquet('{AIRBNB_URL}');
|
| 157 |
"""
|
| 158 |
)
|
| 159 |
-
return
|
| 160 |
|
| 161 |
|
| 162 |
@app.cell(hide_code=True)
|
|
@@ -172,7 +166,7 @@ def _(mo, stock_table):
|
|
| 172 |
|
| 173 |
|
| 174 |
@app.cell
|
| 175 |
-
def _(
|
| 176 |
mo.sql(
|
| 177 |
f"""
|
| 178 |
SELECT * FROM airbnb_stock LIMIT 5;
|
|
@@ -183,18 +177,22 @@ def _(airbnb_stock, mo):
|
|
| 183 |
|
| 184 |
@app.cell(hide_code=True)
|
| 185 |
def _(mo):
|
| 186 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 187 |
return
|
| 188 |
|
| 189 |
|
| 190 |
@app.cell(hide_code=True)
|
| 191 |
def _(mo):
|
| 192 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 193 |
return
|
| 194 |
|
| 195 |
|
| 196 |
@app.cell
|
| 197 |
-
def _(
|
| 198 |
stock_data = mo.sql(
|
| 199 |
f"""
|
| 200 |
SELECT
|
|
@@ -209,7 +207,9 @@ def _(airbnb_stock, mo):
|
|
| 209 |
|
| 210 |
@app.cell(hide_code=True)
|
| 211 |
def _(mo):
|
| 212 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 213 |
return
|
| 214 |
|
| 215 |
|
|
@@ -227,14 +227,15 @@ def _(px, stock_data):
|
|
| 227 |
|
| 228 |
@app.cell(hide_code=True)
|
| 229 |
def _(mo):
|
| 230 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 231 |
return
|
| 232 |
|
| 233 |
|
| 234 |
@app.cell(hide_code=True)
|
| 235 |
def _(mo):
|
| 236 |
-
mo.md(
|
| 237 |
-
r"""
|
| 238 |
In this notebook, we've seen how easy it is to work with Parquet files in DuckDB. We learned how to:
|
| 239 |
<ul>
|
| 240 |
<li>Query Parquet files directly from a URL using a simple `FROM` clause.</li>
|
|
@@ -244,8 +245,7 @@ def _(mo):
|
|
| 244 |
</ul>
|
| 245 |
|
| 246 |
DuckDB's native Parquet support makes it a powerful tool for interactive data analysis on large datasets without complex ETL pipelines.
|
| 247 |
-
"""
|
| 248 |
-
)
|
| 249 |
return
|
| 250 |
|
| 251 |
|
|
|
|
| 11 |
|
| 12 |
import marimo
|
| 13 |
|
| 14 |
+
__generated_with = "0.18.4"
|
| 15 |
app = marimo.App(width="medium")
|
| 16 |
|
| 17 |
|
| 18 |
@app.cell(hide_code=True)
|
| 19 |
def _(mo):
|
| 20 |
+
mo.md(r"""
|
|
|
|
| 21 |
# Loading Parquet files with DuckDB
|
| 22 |
*By [Thomas Liang](https://github.com/thliang01)*
|
| 23 |
#
|
| 24 |
+
""")
|
|
|
|
| 25 |
return
|
| 26 |
|
| 27 |
|
| 28 |
@app.cell(hide_code=True)
|
| 29 |
def _(mo):
|
| 30 |
+
mo.md(r"""
|
| 31 |
+
[Apache Parquet](https://parquet.apache.org/) is a popular columnar storage format, optimized for analytics. Its columnar nature allows query engines like DuckDB to read only the necessary columns, leading to significant performance gains, especially for wide tables.
|
| 32 |
+
|
| 33 |
+
DuckDB has excellent, built-in support for reading Parquet files, making it incredibly easy to query and analyze Parquet data directly without a separate loading step.
|
| 34 |
+
|
| 35 |
+
In this notebook, we'll explore how to load and analyze Airbnb's stock price data from a remote Parquet file:
|
| 36 |
+
<ul>
|
| 37 |
+
<li>Querying a remote Parquet file directly.</li>
|
| 38 |
+
<li>Using the `read_parquet` function for more control.</li>
|
| 39 |
+
<li>Creating a persistent table from a Parquet file.</li>
|
| 40 |
+
<li>Performing basic data analysis and visualization.</li>
|
| 41 |
+
</ul>
|
| 42 |
+
""")
|
|
|
|
|
|
|
| 43 |
return
|
| 44 |
|
| 45 |
|
|
|
|
| 51 |
|
| 52 |
@app.cell(hide_code=True)
|
| 53 |
def _(mo):
|
| 54 |
+
mo.md(r"""
|
| 55 |
+
## Using `FROM` to query Parquet files
|
| 56 |
+
""")
|
| 57 |
return
|
| 58 |
|
| 59 |
|
| 60 |
@app.cell(hide_code=True)
|
| 61 |
def _(mo):
|
| 62 |
+
mo.md(r"""
|
| 63 |
+
The simplest way to query a Parquet file is to use it directly in a `FROM` clause, just like you would with a table. DuckDB will automatically detect that it's a Parquet file and read it accordingly.
|
|
|
|
| 64 |
|
| 65 |
+
Let's query a dataset of Airbnb's stock price from Hugging Face.
|
| 66 |
+
""")
|
|
|
|
| 67 |
return
|
| 68 |
|
| 69 |
|
| 70 |
@app.cell
|
| 71 |
+
def _(AIRBNB_URL, mo):
|
| 72 |
mo.sql(
|
| 73 |
f"""
|
| 74 |
SELECT *
|
|
|
|
| 81 |
|
| 82 |
@app.cell(hide_code=True)
|
| 83 |
def _(mo):
|
| 84 |
+
mo.md(r"""
|
| 85 |
+
## Using `read_parquet`
|
| 86 |
+
""")
|
| 87 |
return
|
| 88 |
|
| 89 |
|
| 90 |
@app.cell(hide_code=True)
|
| 91 |
def _(mo):
|
| 92 |
+
mo.md(r"""
|
| 93 |
+
For more control, you can use the `read_parquet` table function. This is useful when you need to specify options, for example, when dealing with multiple files or specific data types.
|
| 94 |
+
Some useful options for `read_parquet` include:
|
|
|
|
| 95 |
|
| 96 |
+
- `binary_as_string=True`: Reads `BINARY` columns as `VARCHAR`.
|
| 97 |
+
- `filename=True`: Adds a `filename` column with the path of the file for each row.
|
| 98 |
+
- `hive_partitioning=True`: Enables reading of Hive-partitioned datasets.
|
| 99 |
|
| 100 |
+
Here, we'll use `read_parquet` to select only a few relevant columns. This is much more efficient than `SELECT *` because DuckDB only needs to read the data for the columns we specify.
|
| 101 |
+
""")
|
|
|
|
| 102 |
return
|
| 103 |
|
| 104 |
|
|
|
|
| 116 |
|
| 117 |
@app.cell(hide_code=True)
|
| 118 |
def _(mo):
|
| 119 |
+
mo.md(r"""
|
| 120 |
+
You can also read multiple Parquet files at once using a glob pattern. For example, to read all Parquet files in a directory `data/`:
|
|
|
|
| 121 |
|
| 122 |
+
```sql
|
| 123 |
+
SELECT * FROM read_parquet('data/*.parquet');
|
| 124 |
+
```
|
| 125 |
+
""")
|
|
|
|
| 126 |
return
|
| 127 |
|
| 128 |
|
| 129 |
@app.cell(hide_code=True)
|
| 130 |
def _(mo):
|
| 131 |
+
mo.md(r"""
|
| 132 |
+
## Creating a table from a Parquet file
|
| 133 |
+
""")
|
| 134 |
return
|
| 135 |
|
| 136 |
|
| 137 |
@app.cell(hide_code=True)
|
| 138 |
def _(mo):
|
| 139 |
+
mo.md(r"""
|
| 140 |
+
While querying Parquet files directly is powerful, sometimes it's useful to load the data into a persistent table within your DuckDB database. This can simplify subsequent queries and is a good practice if you'll be accessing the data frequently.
|
| 141 |
+
""")
|
|
|
|
|
|
|
| 142 |
return
|
| 143 |
|
| 144 |
|
|
|
|
| 150 |
SELECT * FROM read_parquet('{AIRBNB_URL}');
|
| 151 |
"""
|
| 152 |
)
|
| 153 |
+
return (stock_table,)
|
| 154 |
|
| 155 |
|
| 156 |
@app.cell(hide_code=True)
|
|
|
|
| 166 |
|
| 167 |
|
| 168 |
@app.cell
|
| 169 |
+
def _(mo):
|
| 170 |
mo.sql(
|
| 171 |
f"""
|
| 172 |
SELECT * FROM airbnb_stock LIMIT 5;
|
|
|
|
| 177 |
|
| 178 |
@app.cell(hide_code=True)
|
| 179 |
def _(mo):
|
| 180 |
+
mo.md(r"""
|
| 181 |
+
## Analysis and Visualization
|
| 182 |
+
""")
|
| 183 |
return
|
| 184 |
|
| 185 |
|
| 186 |
@app.cell(hide_code=True)
|
| 187 |
def _(mo):
|
| 188 |
+
mo.md(r"""
|
| 189 |
+
Let's perform a simple analysis: plotting the closing stock price over time.
|
| 190 |
+
""")
|
| 191 |
return
|
| 192 |
|
| 193 |
|
| 194 |
@app.cell
|
| 195 |
+
def _(mo):
|
| 196 |
stock_data = mo.sql(
|
| 197 |
f"""
|
| 198 |
SELECT
|
|
|
|
| 207 |
|
| 208 |
@app.cell(hide_code=True)
|
| 209 |
def _(mo):
|
| 210 |
+
mo.md(r"""
|
| 211 |
+
Now we can easily visualize this result using marimo's integration with plotting libraries like Plotly.
|
| 212 |
+
""")
|
| 213 |
return
|
| 214 |
|
| 215 |
|
|
|
|
| 227 |
|
| 228 |
@app.cell(hide_code=True)
|
| 229 |
def _(mo):
|
| 230 |
+
mo.md(r"""
|
| 231 |
+
## Conclusion
|
| 232 |
+
""")
|
| 233 |
return
|
| 234 |
|
| 235 |
|
| 236 |
@app.cell(hide_code=True)
|
| 237 |
def _(mo):
|
| 238 |
+
mo.md(r"""
|
|
|
|
| 239 |
In this notebook, we've seen how easy it is to work with Parquet files in DuckDB. We learned how to:
|
| 240 |
<ul>
|
| 241 |
<li>Query Parquet files directly from a URL using a simple `FROM` clause.</li>
|
|
|
|
| 245 |
</ul>
|
| 246 |
|
| 247 |
DuckDB's native Parquet support makes it a powerful tool for interactive data analysis on large datasets without complex ETL pipelines.
|
| 248 |
+
""")
|
|
|
|
| 249 |
return
|
| 250 |
|
| 251 |
|
duckdb/009_loading_json.py
CHANGED
|
@@ -10,38 +10,34 @@
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
-
__generated_with = "0.
|
| 14 |
app = marimo.App(width="medium")
|
| 15 |
|
| 16 |
|
| 17 |
@app.cell(hide_code=True)
|
| 18 |
def _(mo):
|
| 19 |
-
mo.md(
|
| 20 |
-
|
| 21 |
-
# Loading JSON
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
)
|
| 33 |
return
|
| 34 |
|
| 35 |
|
| 36 |
@app.cell(hide_code=True)
|
| 37 |
def _(mo):
|
| 38 |
-
mo.md(
|
| 39 |
-
|
| 40 |
-
## Using `FROM`
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
)
|
| 45 |
return
|
| 46 |
|
| 47 |
|
|
@@ -57,20 +53,18 @@ def _(mo):
|
|
| 57 |
|
| 58 |
@app.cell(hide_code=True)
|
| 59 |
def _(mo):
|
| 60 |
-
mo.md(
|
| 61 |
-
|
| 62 |
-
## Using `read_json`
|
| 63 |
|
| 64 |
-
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
)
|
| 74 |
return
|
| 75 |
|
| 76 |
|
|
@@ -99,24 +93,24 @@ def _(mo):
|
|
| 99 |
;
|
| 100 |
"""
|
| 101 |
)
|
| 102 |
-
return
|
| 103 |
|
| 104 |
|
| 105 |
@app.cell(hide_code=True)
|
| 106 |
def _(mo):
|
| 107 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 108 |
return
|
| 109 |
|
| 110 |
|
| 111 |
@app.cell(hide_code=True)
|
| 112 |
def _(mo):
|
| 113 |
-
mo.md(
|
| 114 |
-
|
| 115 |
-
## Using `COPY`
|
| 116 |
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
)
|
| 120 |
return
|
| 121 |
|
| 122 |
|
|
@@ -137,11 +131,11 @@ def _(mo):
|
|
| 137 |
);
|
| 138 |
"""
|
| 139 |
)
|
| 140 |
-
return
|
| 141 |
|
| 142 |
|
| 143 |
@app.cell
|
| 144 |
-
def _(
|
| 145 |
_df = mo.sql(
|
| 146 |
f"""
|
| 147 |
COPY cars2 FROM 'https://raw.githubusercontent.com/vega/vega-datasets/refs/heads/main/data/cars.json' (FORMAT json, ARRAY true, DATEFORMAT '%Y-%m-%d');
|
|
@@ -153,7 +147,9 @@ def _(cars2, mo):
|
|
| 153 |
|
| 154 |
@app.cell(hide_code=True)
|
| 155 |
def _(mo):
|
| 156 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 157 |
return
|
| 158 |
|
| 159 |
|
|
@@ -164,11 +160,11 @@ def _(Path):
|
|
| 164 |
TMP_DIR = TemporaryDirectory()
|
| 165 |
COPY_PATH = Path(TMP_DIR.name) / "cars_mpg.jsonl"
|
| 166 |
print(COPY_PATH)
|
| 167 |
-
return COPY_PATH, TMP_DIR
|
| 168 |
|
| 169 |
|
| 170 |
@app.cell
|
| 171 |
-
def _(COPY_PATH,
|
| 172 |
_df = mo.sql(
|
| 173 |
f"""
|
| 174 |
COPY (
|
|
@@ -191,13 +187,11 @@ def _(COPY_PATH, Path):
|
|
| 191 |
|
| 192 |
@app.cell(hide_code=True)
|
| 193 |
def _(mo):
|
| 194 |
-
mo.md(
|
| 195 |
-
|
| 196 |
-
## Using `IMPORT DATABASE`
|
| 197 |
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
)
|
| 201 |
return
|
| 202 |
|
| 203 |
|
|
@@ -226,7 +220,9 @@ def _(EXPORT_PATH, Path):
|
|
| 226 |
|
| 227 |
@app.cell(hide_code=True)
|
| 228 |
def _(mo):
|
| 229 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 230 |
return
|
| 231 |
|
| 232 |
|
|
@@ -250,14 +246,12 @@ def _(TMP_DIR):
|
|
| 250 |
|
| 251 |
@app.cell(hide_code=True)
|
| 252 |
def _(mo):
|
| 253 |
-
mo.md(
|
| 254 |
-
|
| 255 |
-
## Further Reading
|
| 256 |
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
)
|
| 261 |
return
|
| 262 |
|
| 263 |
|
|
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
+
__generated_with = "0.18.4"
|
| 14 |
app = marimo.App(width="medium")
|
| 15 |
|
| 16 |
|
| 17 |
@app.cell(hide_code=True)
|
| 18 |
def _(mo):
|
| 19 |
+
mo.md(r"""
|
| 20 |
+
# Loading JSON
|
|
|
|
| 21 |
|
| 22 |
+
DuckDB supports reading and writing JSON through the `json` extension that should be present in most distributions and is autoloaded on first-use. If it's not, you can [install and load](https://duckdb.org/docs/stable/data/json/installing_and_loading.html) it manually like any other extension.
|
| 23 |
|
| 24 |
+
In this tutorial we'll cover 4 different ways we can transfer JSON data in and out of DuckDB:
|
| 25 |
|
| 26 |
+
- [`FROM`](https://duckdb.org/docs/stable/sql/query_syntax/from.html) statement.
|
| 27 |
+
- [`read_json`](https://duckdb.org/docs/stable/data/json/loading_json#the-read_json-function) function.
|
| 28 |
+
- [`COPY`](https://duckdb.org/docs/stable/sql/statements/copy#copy--from) statement.
|
| 29 |
+
- [`IMPORT DATABASE`](https://duckdb.org/docs/stable/sql/statements/export.html) statement.
|
| 30 |
+
""")
|
|
|
|
| 31 |
return
|
| 32 |
|
| 33 |
|
| 34 |
@app.cell(hide_code=True)
|
| 35 |
def _(mo):
|
| 36 |
+
mo.md(r"""
|
| 37 |
+
## Using `FROM`
|
|
|
|
| 38 |
|
| 39 |
+
Loading data using `FROM` is simple and straightforward. We use a path or URL to the file we want to load where we'd normally put a table name. When we do this, DuckDB attempts to infer the right way to read the file including the correct format and column types. In most cases this is all we need to load data into DuckDB.
|
| 40 |
+
""")
|
|
|
|
| 41 |
return
|
| 42 |
|
| 43 |
|
|
|
|
| 53 |
|
| 54 |
@app.cell(hide_code=True)
|
| 55 |
def _(mo):
|
| 56 |
+
mo.md(r"""
|
| 57 |
+
## Using `read_json`
|
|
|
|
| 58 |
|
| 59 |
+
For greater control over how the JSON is read, we can directly call the [`read_json`](https://duckdb.org/docs/stable/data/json/loading_json#the-read_json-function) function. It supports a few different arguments — some common ones are:
|
| 60 |
|
| 61 |
+
- `format='array'` or `format='newline_delimited'` - the former tells DuckDB that the rows should be read from a top-level JSON array while the latter means the rows should be read from JSON objects separated by a newline (JSONL/NDJSON).
|
| 62 |
+
- `ignore_errors=true` - skips lines with parse errors when reading newline delimited JSON.
|
| 63 |
+
- `columns={columnName: type, ...}` - lets you set types for individual columns manually.
|
| 64 |
+
- `dateformat` and `timestampformat` - controls how DuckDB attempts to parse [Date](https://duckdb.org/docs/stable/sql/data_types/date) and [Timestamp](https://duckdb.org/docs/stable/sql/data_types/timestamp) types. Use the format specifiers specified in the [docs](https://duckdb.org/docs/stable/sql/functions/dateformat.html#format-specifiers).
|
| 65 |
|
| 66 |
+
We could rewrite the previous query more explicitly as:
|
| 67 |
+
""")
|
|
|
|
| 68 |
return
|
| 69 |
|
| 70 |
|
|
|
|
| 93 |
;
|
| 94 |
"""
|
| 95 |
)
|
| 96 |
+
return
|
| 97 |
|
| 98 |
|
| 99 |
@app.cell(hide_code=True)
|
| 100 |
def _(mo):
|
| 101 |
+
mo.md(r"""
|
| 102 |
+
Other than singular files we can read [multiple files](https://duckdb.org/docs/stable/data/multiple_files/overview.html) at a time by either passing a list of files or a UNIX glob pattern.
|
| 103 |
+
""")
|
| 104 |
return
|
| 105 |
|
| 106 |
|
| 107 |
@app.cell(hide_code=True)
|
| 108 |
def _(mo):
|
| 109 |
+
mo.md(r"""
|
| 110 |
+
## Using `COPY`
|
|
|
|
| 111 |
|
| 112 |
+
`COPY` is for useful both for importing and exporting data in a variety of formats including JSON. For example, we can import data into an existing table from a JSON file.
|
| 113 |
+
""")
|
|
|
|
| 114 |
return
|
| 115 |
|
| 116 |
|
|
|
|
| 131 |
);
|
| 132 |
"""
|
| 133 |
)
|
| 134 |
+
return
|
| 135 |
|
| 136 |
|
| 137 |
@app.cell
|
| 138 |
+
def _(mo):
|
| 139 |
_df = mo.sql(
|
| 140 |
f"""
|
| 141 |
COPY cars2 FROM 'https://raw.githubusercontent.com/vega/vega-datasets/refs/heads/main/data/cars.json' (FORMAT json, ARRAY true, DATEFORMAT '%Y-%m-%d');
|
|
|
|
| 147 |
|
| 148 |
@app.cell(hide_code=True)
|
| 149 |
def _(mo):
|
| 150 |
+
mo.md(r"""
|
| 151 |
+
Similarly, we can write data from a table or select statement to a JSON file. For example, we create a new JSONL file with just the car names and miles per gallon. We first create a temporary directory to avoid cluttering our project directory.
|
| 152 |
+
""")
|
| 153 |
return
|
| 154 |
|
| 155 |
|
|
|
|
| 160 |
TMP_DIR = TemporaryDirectory()
|
| 161 |
COPY_PATH = Path(TMP_DIR.name) / "cars_mpg.jsonl"
|
| 162 |
print(COPY_PATH)
|
| 163 |
+
return COPY_PATH, TMP_DIR
|
| 164 |
|
| 165 |
|
| 166 |
@app.cell
|
| 167 |
+
def _(COPY_PATH, mo):
|
| 168 |
_df = mo.sql(
|
| 169 |
f"""
|
| 170 |
COPY (
|
|
|
|
| 187 |
|
| 188 |
@app.cell(hide_code=True)
|
| 189 |
def _(mo):
|
| 190 |
+
mo.md(r"""
|
| 191 |
+
## Using `IMPORT DATABASE`
|
|
|
|
| 192 |
|
| 193 |
+
The last method we can use to load JSON data is using the `IMPORT DATABASE` statement. It works in conjunction with `EXPORT DATABASE` to save and load an entire database to and from a directory. For example let's try and export our default in-memory database.
|
| 194 |
+
""")
|
|
|
|
| 195 |
return
|
| 196 |
|
| 197 |
|
|
|
|
| 220 |
|
| 221 |
@app.cell(hide_code=True)
|
| 222 |
def _(mo):
|
| 223 |
+
mo.md(r"""
|
| 224 |
+
We can then load the database back into DuckDB.
|
| 225 |
+
""")
|
| 226 |
return
|
| 227 |
|
| 228 |
|
|
|
|
| 246 |
|
| 247 |
@app.cell(hide_code=True)
|
| 248 |
def _(mo):
|
| 249 |
+
mo.md(r"""
|
| 250 |
+
## Further Reading
|
|
|
|
| 251 |
|
| 252 |
+
- Complete information on the JSON support in DuckDB can be found in their [documentation](https://duckdb.org/docs/stable/data/json/overview.html).
|
| 253 |
+
- You can also learn more about using SQL in marimo from the [examples](https://github.com/marimo-team/marimo/tree/main/examples/sql).
|
| 254 |
+
""")
|
|
|
|
| 255 |
return
|
| 256 |
|
| 257 |
|
duckdb/011_working_with_apache_arrow.py
CHANGED
|
@@ -14,41 +14,37 @@
|
|
| 14 |
|
| 15 |
import marimo
|
| 16 |
|
| 17 |
-
__generated_with = "0.
|
| 18 |
app = marimo.App(width="medium")
|
| 19 |
|
| 20 |
|
| 21 |
@app.cell(hide_code=True)
|
| 22 |
def _(mo):
|
| 23 |
-
mo.md(
|
| 24 |
-
r"""
|
| 25 |
# Working with Apache Arrow
|
| 26 |
*By [Thomas Liang](https://github.com/thliang01)*
|
| 27 |
#
|
| 28 |
-
"""
|
| 29 |
-
)
|
| 30 |
return
|
| 31 |
|
| 32 |
|
| 33 |
@app.cell(hide_code=True)
|
| 34 |
def _(mo):
|
| 35 |
-
mo.md(
|
| 36 |
-
|
| 37 |
-
[Apache Arrow](https://arrow.apache.org/) is a multi-language toolbox for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
)
|
| 52 |
return
|
| 53 |
|
| 54 |
|
|
@@ -71,23 +67,21 @@ def _(mo):
|
|
| 71 |
(5, 'Eve', 40, 'London');
|
| 72 |
"""
|
| 73 |
)
|
| 74 |
-
return
|
| 75 |
|
| 76 |
|
| 77 |
@app.cell(hide_code=True)
|
| 78 |
def _(mo):
|
| 79 |
-
mo.md(
|
| 80 |
-
|
| 81 |
-
## 1. Creating an Arrow Table from a DuckDB Query
|
| 82 |
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
)
|
| 86 |
return
|
| 87 |
|
| 88 |
|
| 89 |
@app.cell
|
| 90 |
-
def _(mo
|
| 91 |
users_arrow_table = mo.sql( # type: ignore
|
| 92 |
"""
|
| 93 |
SELECT * FROM users WHERE age > 30;
|
|
@@ -98,7 +92,9 @@ def _(mo, users):
|
|
| 98 |
|
| 99 |
@app.cell(hide_code=True)
|
| 100 |
def _(mo):
|
| 101 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 102 |
return
|
| 103 |
|
| 104 |
|
|
@@ -110,13 +106,11 @@ def _(users_arrow_table):
|
|
| 110 |
|
| 111 |
@app.cell(hide_code=True)
|
| 112 |
def _(mo):
|
| 113 |
-
mo.md(
|
| 114 |
-
|
| 115 |
-
## 2. Loading an Arrow Table into DuckDB
|
| 116 |
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
)
|
| 120 |
return
|
| 121 |
|
| 122 |
|
|
@@ -129,17 +123,19 @@ def _(pa):
|
|
| 129 |
'age': [22, 45],
|
| 130 |
'city': ['Berlin', 'Tokyo']
|
| 131 |
})
|
| 132 |
-
return
|
| 133 |
|
| 134 |
|
| 135 |
@app.cell(hide_code=True)
|
| 136 |
def _(mo):
|
| 137 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 138 |
return
|
| 139 |
|
| 140 |
|
| 141 |
@app.cell
|
| 142 |
-
def _(mo
|
| 143 |
mo.sql(
|
| 144 |
f"""
|
| 145 |
SELECT name, age, city
|
|
@@ -152,19 +148,19 @@ def _(mo, new_data):
|
|
| 152 |
|
| 153 |
@app.cell(hide_code=True)
|
| 154 |
def _(mo):
|
| 155 |
-
mo.md(
|
| 156 |
-
|
| 157 |
-
## 3. Convert between DuckDB, Arrow, and Polars/Pandas DataFrames.
|
| 158 |
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
)
|
| 162 |
return
|
| 163 |
|
| 164 |
|
| 165 |
@app.cell(hide_code=True)
|
| 166 |
def _(mo):
|
| 167 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 168 |
return
|
| 169 |
|
| 170 |
|
|
@@ -186,7 +182,9 @@ def _(users_arrow_table):
|
|
| 186 |
|
| 187 |
@app.cell(hide_code=True)
|
| 188 |
def _(mo):
|
| 189 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 190 |
return
|
| 191 |
|
| 192 |
|
|
@@ -199,17 +197,19 @@ def _(pl):
|
|
| 199 |
"price": [1200.00, 25.50, 75.00]
|
| 200 |
})
|
| 201 |
polars_df
|
| 202 |
-
return
|
| 203 |
|
| 204 |
|
| 205 |
@app.cell(hide_code=True)
|
| 206 |
def _(mo):
|
| 207 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 208 |
return
|
| 209 |
|
| 210 |
|
| 211 |
@app.cell
|
| 212 |
-
def _(mo
|
| 213 |
# Query the Polars DataFrame directly in DuckDB
|
| 214 |
mo.sql(
|
| 215 |
f"""
|
|
@@ -224,7 +224,9 @@ def _(mo, polars_df):
|
|
| 224 |
|
| 225 |
@app.cell(hide_code=True)
|
| 226 |
def _(mo):
|
| 227 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 228 |
return
|
| 229 |
|
| 230 |
|
|
@@ -238,11 +240,11 @@ def _(pd):
|
|
| 238 |
"order_date": pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-16', '2024-01-17'])
|
| 239 |
})
|
| 240 |
pandas_df
|
| 241 |
-
return
|
| 242 |
|
| 243 |
|
| 244 |
@app.cell
|
| 245 |
-
def _(mo
|
| 246 |
# Query the Pandas DataFrame in DuckDB
|
| 247 |
mo.sql(
|
| 248 |
f"""
|
|
@@ -257,18 +259,16 @@ def _(mo, pandas_df):
|
|
| 257 |
|
| 258 |
@app.cell(hide_code=True)
|
| 259 |
def _(mo):
|
| 260 |
-
mo.md(
|
| 261 |
-
|
| 262 |
-
## 4. Advanced Example: Combining Multiple Data Sources
|
| 263 |
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
)
|
| 267 |
return
|
| 268 |
|
| 269 |
|
| 270 |
@app.cell
|
| 271 |
-
def _(mo
|
| 272 |
# Join the DuckDB users table with the Polars products DataFrame and Pandas orders DataFrame
|
| 273 |
result = mo.sql(
|
| 274 |
f"""
|
|
@@ -291,27 +291,28 @@ def _(mo, pandas_df, polars_df, users):
|
|
| 291 |
|
| 292 |
@app.cell(hide_code=True)
|
| 293 |
def _(mo):
|
| 294 |
-
mo.md(
|
| 295 |
-
|
| 296 |
-
## 5. Performance Benefits of Arrow Integration
|
| 297 |
|
| 298 |
-
|
| 299 |
|
| 300 |
-
|
| 301 |
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
)
|
| 310 |
return
|
| 311 |
|
|
|
|
| 312 |
@app.cell(hide_code=True)
|
| 313 |
def _(mo):
|
| 314 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 315 |
return
|
| 316 |
|
| 317 |
|
|
@@ -352,18 +353,22 @@ def _(pd, pl):
|
|
| 352 |
|
| 353 |
@app.cell(hide_code=True)
|
| 354 |
def _(mo):
|
| 355 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 356 |
return
|
| 357 |
|
| 358 |
|
| 359 |
@app.cell(hide_code=True)
|
| 360 |
def _(mo):
|
| 361 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 362 |
return
|
| 363 |
|
| 364 |
|
| 365 |
@app.cell
|
| 366 |
-
def _(duckdb, mo, pandas_data,
|
| 367 |
# Test query: group by category and calculate aggregations
|
| 368 |
query = """
|
| 369 |
SELECT
|
|
@@ -425,14 +430,16 @@ def _(duckdb, mo, pandas_data, polars_data, time):
|
|
| 425 |
|
| 426 |
@app.cell(hide_code=True)
|
| 427 |
def _(mo):
|
| 428 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 429 |
return
|
| 430 |
|
| 431 |
|
| 432 |
@app.cell
|
| 433 |
def _(approach1_time, approach2_time, approach3_time, mo, pl):
|
| 434 |
import altair as alt
|
| 435 |
-
|
| 436 |
# Create a bar chart showing the performance comparison
|
| 437 |
performance_data = pl.DataFrame({
|
| 438 |
"Approach": ["Traditional\n(Copy to DuckDB)", "Pandas\nGroupBy", "Arrow-based\n(Zero-copy)"],
|
|
@@ -450,27 +457,30 @@ def _(approach1_time, approach2_time, approach3_time, mo, pl):
|
|
| 450 |
width=400,
|
| 451 |
height=300
|
| 452 |
)
|
| 453 |
-
|
| 454 |
# Display using marimo's altair_chart UI element
|
| 455 |
mo.ui.altair_chart(chart)
|
| 456 |
-
return
|
| 457 |
-
|
| 458 |
|
| 459 |
|
| 460 |
@app.cell(hide_code=True)
|
| 461 |
def _(mo):
|
| 462 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 463 |
return
|
| 464 |
|
| 465 |
|
| 466 |
@app.cell(hide_code=True)
|
| 467 |
def _(mo):
|
| 468 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 469 |
return
|
| 470 |
|
| 471 |
|
| 472 |
@app.cell
|
| 473 |
-
def _(mo, pl,
|
| 474 |
# Create additional datasets for join operations
|
| 475 |
categories_df = pl.DataFrame({
|
| 476 |
"category": [f"cat_{i}" for i in range(100)],
|
|
@@ -510,23 +520,21 @@ def _(mo, pl, polars_data, time):
|
|
| 510 |
print(f"Complex query with joins and window functions completed in {complex_query_time:.3f} seconds")
|
| 511 |
|
| 512 |
complex_result
|
| 513 |
-
return
|
| 514 |
|
| 515 |
|
| 516 |
@app.cell(hide_code=True)
|
| 517 |
def _(mo):
|
| 518 |
-
mo.md(
|
| 519 |
-
r"""
|
| 520 |
### Memory Efficiency During Operations
|
| 521 |
|
| 522 |
Let's demonstrate how Arrow's zero-copy operations save memory during data transformations:
|
| 523 |
-
"""
|
| 524 |
-
)
|
| 525 |
return
|
| 526 |
|
| 527 |
|
| 528 |
@app.cell
|
| 529 |
-
def _(polars_data, time):
|
| 530 |
import os
|
| 531 |
import pyarrow.compute as pc # Add this import
|
| 532 |
|
|
@@ -558,7 +566,7 @@ def _(polars_data, time):
|
|
| 558 |
|
| 559 |
copy_ops_time = time.time() - latest_start_time
|
| 560 |
memory_after_copy = process.memory_info().rss / 1024 / 1024 # MB
|
| 561 |
-
|
| 562 |
print("Memory Usage Comparison:")
|
| 563 |
print(f"Initial memory: {memory_before:.2f} MB")
|
| 564 |
print(f"After Arrow operations: {memory_after_arrow:.2f} MB (diff: +{memory_after_arrow - memory_before:.2f} MB)")
|
|
@@ -567,14 +575,12 @@ def _(polars_data, time):
|
|
| 567 |
print(f"Arrow operations: {arrow_ops_time:.3f} seconds")
|
| 568 |
print(f"Copy operations: {copy_ops_time:.3f} seconds")
|
| 569 |
print(f"Speedup: {copy_ops_time/arrow_ops_time:.1f}x")
|
| 570 |
-
return
|
| 571 |
-
|
| 572 |
|
| 573 |
|
| 574 |
@app.cell(hide_code=True)
|
| 575 |
def _(mo):
|
| 576 |
-
mo.md(
|
| 577 |
-
r"""
|
| 578 |
## Summary
|
| 579 |
|
| 580 |
In this notebook, we've explored:
|
|
@@ -590,8 +596,7 @@ def _(mo):
|
|
| 590 |
- **Better scalability**: Can handle larger datasets within the same memory constraints
|
| 591 |
|
| 592 |
The seamless integration between DuckDB and Arrow-compatible systems makes it easy to work with data across different tools while maintaining high performance and memory efficiency.
|
| 593 |
-
"""
|
| 594 |
-
)
|
| 595 |
return
|
| 596 |
|
| 597 |
|
|
@@ -604,7 +609,7 @@ def _():
|
|
| 604 |
import duckdb
|
| 605 |
import sqlglot
|
| 606 |
import psutil
|
| 607 |
-
return duckdb, mo, pa, pd, pl
|
| 608 |
|
| 609 |
|
| 610 |
if __name__ == "__main__":
|
|
|
|
| 14 |
|
| 15 |
import marimo
|
| 16 |
|
| 17 |
+
__generated_with = "0.18.4"
|
| 18 |
app = marimo.App(width="medium")
|
| 19 |
|
| 20 |
|
| 21 |
@app.cell(hide_code=True)
|
| 22 |
def _(mo):
|
| 23 |
+
mo.md(r"""
|
|
|
|
| 24 |
# Working with Apache Arrow
|
| 25 |
*By [Thomas Liang](https://github.com/thliang01)*
|
| 26 |
#
|
| 27 |
+
""")
|
|
|
|
| 28 |
return
|
| 29 |
|
| 30 |
|
| 31 |
@app.cell(hide_code=True)
|
| 32 |
def _(mo):
|
| 33 |
+
mo.md(r"""
|
| 34 |
+
[Apache Arrow](https://arrow.apache.org/) is a multi-language toolbox for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.
|
|
|
|
| 35 |
|
| 36 |
+
A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.
|
| 37 |
|
| 38 |
+
DuckDB has native support for Apache Arrow, which is an in-memory columnar data format. This allows for efficient data transfer between DuckDB and other Arrow-compatible systems, such as Polars and Pandas (via PyArrow).
|
| 39 |
|
| 40 |
+
In this notebook, we'll explore how to:
|
| 41 |
|
| 42 |
+
- Create an Arrow table from a DuckDB query.
|
| 43 |
+
- Load an Arrow table into DuckDB.
|
| 44 |
+
- Convert between DuckDB, Arrow, and Polars/Pandas DataFrames.
|
| 45 |
+
- Combining data from multiple sources
|
| 46 |
+
- Performance benefits
|
| 47 |
+
""")
|
|
|
|
| 48 |
return
|
| 49 |
|
| 50 |
|
|
|
|
| 67 |
(5, 'Eve', 40, 'London');
|
| 68 |
"""
|
| 69 |
)
|
| 70 |
+
return
|
| 71 |
|
| 72 |
|
| 73 |
@app.cell(hide_code=True)
|
| 74 |
def _(mo):
|
| 75 |
+
mo.md(r"""
|
| 76 |
+
## 1. Creating an Arrow Table from a DuckDB Query
|
|
|
|
| 77 |
|
| 78 |
+
You can directly fetch the results of a DuckDB query as an Apache Arrow table using the `.arrow()` method on the query result.
|
| 79 |
+
""")
|
|
|
|
| 80 |
return
|
| 81 |
|
| 82 |
|
| 83 |
@app.cell
|
| 84 |
+
def _(mo):
|
| 85 |
users_arrow_table = mo.sql( # type: ignore
|
| 86 |
"""
|
| 87 |
SELECT * FROM users WHERE age > 30;
|
|
|
|
| 92 |
|
| 93 |
@app.cell(hide_code=True)
|
| 94 |
def _(mo):
|
| 95 |
+
mo.md(r"""
|
| 96 |
+
The `.arrow()` method returns a `pyarrow.Table` object. We can inspect its schema:
|
| 97 |
+
""")
|
| 98 |
return
|
| 99 |
|
| 100 |
|
|
|
|
| 106 |
|
| 107 |
@app.cell(hide_code=True)
|
| 108 |
def _(mo):
|
| 109 |
+
mo.md(r"""
|
| 110 |
+
## 2. Loading an Arrow Table into DuckDB
|
|
|
|
| 111 |
|
| 112 |
+
You can also register an existing Arrow table (or a Polars/Pandas DataFrame, which uses Arrow under the hood) directly with DuckDB. This allows you to query the in-memory data without any copying, which is highly efficient.
|
| 113 |
+
""")
|
|
|
|
| 114 |
return
|
| 115 |
|
| 116 |
|
|
|
|
| 123 |
'age': [22, 45],
|
| 124 |
'city': ['Berlin', 'Tokyo']
|
| 125 |
})
|
| 126 |
+
return
|
| 127 |
|
| 128 |
|
| 129 |
@app.cell(hide_code=True)
|
| 130 |
def _(mo):
|
| 131 |
+
mo.md(r"""
|
| 132 |
+
Now, we can query this Arrow table `new_data` directly from SQL by embedding it in the query.
|
| 133 |
+
""")
|
| 134 |
return
|
| 135 |
|
| 136 |
|
| 137 |
@app.cell
|
| 138 |
+
def _(mo):
|
| 139 |
mo.sql(
|
| 140 |
f"""
|
| 141 |
SELECT name, age, city
|
|
|
|
| 148 |
|
| 149 |
@app.cell(hide_code=True)
|
| 150 |
def _(mo):
|
| 151 |
+
mo.md(r"""
|
| 152 |
+
## 3. Convert between DuckDB, Arrow, and Polars/Pandas DataFrames.
|
|
|
|
| 153 |
|
| 154 |
+
The real power of DuckDB's Arrow integration comes from its seamless interoperability with data frame libraries like Polars and Pandas. Because they all share the Arrow in-memory format, conversions are often zero-copy and extremely fast.
|
| 155 |
+
""")
|
|
|
|
| 156 |
return
|
| 157 |
|
| 158 |
|
| 159 |
@app.cell(hide_code=True)
|
| 160 |
def _(mo):
|
| 161 |
+
mo.md(r"""
|
| 162 |
+
### From DuckDB to Polars/Pandas
|
| 163 |
+
""")
|
| 164 |
return
|
| 165 |
|
| 166 |
|
|
|
|
| 182 |
|
| 183 |
@app.cell(hide_code=True)
|
| 184 |
def _(mo):
|
| 185 |
+
mo.md(r"""
|
| 186 |
+
### From Polars/Pandas to DuckDB
|
| 187 |
+
""")
|
| 188 |
return
|
| 189 |
|
| 190 |
|
|
|
|
| 197 |
"price": [1200.00, 25.50, 75.00]
|
| 198 |
})
|
| 199 |
polars_df
|
| 200 |
+
return
|
| 201 |
|
| 202 |
|
| 203 |
@app.cell(hide_code=True)
|
| 204 |
def _(mo):
|
| 205 |
+
mo.md(r"""
|
| 206 |
+
Now we can query this Polars DataFrame directly in DuckDB:
|
| 207 |
+
""")
|
| 208 |
return
|
| 209 |
|
| 210 |
|
| 211 |
@app.cell
|
| 212 |
+
def _(mo):
|
| 213 |
# Query the Polars DataFrame directly in DuckDB
|
| 214 |
mo.sql(
|
| 215 |
f"""
|
|
|
|
| 224 |
|
| 225 |
@app.cell(hide_code=True)
|
| 226 |
def _(mo):
|
| 227 |
+
mo.md(r"""
|
| 228 |
+
Similarly, we can query a Pandas DataFrame:
|
| 229 |
+
""")
|
| 230 |
return
|
| 231 |
|
| 232 |
|
|
|
|
| 240 |
"order_date": pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-16', '2024-01-17'])
|
| 241 |
})
|
| 242 |
pandas_df
|
| 243 |
+
return
|
| 244 |
|
| 245 |
|
| 246 |
@app.cell
|
| 247 |
+
def _(mo):
|
| 248 |
# Query the Pandas DataFrame in DuckDB
|
| 249 |
mo.sql(
|
| 250 |
f"""
|
|
|
|
| 259 |
|
| 260 |
@app.cell(hide_code=True)
|
| 261 |
def _(mo):
|
| 262 |
+
mo.md(r"""
|
| 263 |
+
## 4. Advanced Example: Combining Multiple Data Sources
|
|
|
|
| 264 |
|
| 265 |
+
One of the most powerful features is the ability to join data from different sources (DuckDB tables, Arrow tables, Polars/Pandas DataFrames) in a single query:
|
| 266 |
+
""")
|
|
|
|
| 267 |
return
|
| 268 |
|
| 269 |
|
| 270 |
@app.cell
|
| 271 |
+
def _(mo):
|
| 272 |
# Join the DuckDB users table with the Polars products DataFrame and Pandas orders DataFrame
|
| 273 |
result = mo.sql(
|
| 274 |
f"""
|
|
|
|
| 291 |
|
| 292 |
@app.cell(hide_code=True)
|
| 293 |
def _(mo):
|
| 294 |
+
mo.md(r"""
|
| 295 |
+
## 5. Performance Benefits of Arrow Integration
|
|
|
|
| 296 |
|
| 297 |
+
The zero-copy integration between DuckDB and Apache Arrow delivers significant performance and memory benefits. This seamless integration enables:
|
| 298 |
|
| 299 |
+
### Key Benefits:
|
| 300 |
|
| 301 |
+
- **Memory Efficiency**: Arrow's columnar format uses 20-40% less memory than traditional DataFrames through compact columnar representation and better compression ratios
|
| 302 |
+
- **Zero-Copy Operations**: Data can be shared between DuckDB and Arrow-compatible systems (Polars, Pandas) without any data copying, eliminating redundant memory usage
|
| 303 |
+
- **Query Performance**: 2-10x faster queries compared to traditional approaches that require data copying
|
| 304 |
+
- **Larger-than-Memory Analysis**: Both DuckDB and Arrow-compatible libraries support streaming query results, allowing you to execute queries on data larger than available memory by processing data in batches.
|
| 305 |
+
- **Advanced Query Optimization**: DuckDB's optimizer can push down filters and projections directly into Arrow scans, reading only relevant columns and partitions
|
| 306 |
+
Let's demonstrate these benefits with concrete examples:
|
| 307 |
+
""")
|
|
|
|
| 308 |
return
|
| 309 |
|
| 310 |
+
|
| 311 |
@app.cell(hide_code=True)
|
| 312 |
def _(mo):
|
| 313 |
+
mo.md(r"""
|
| 314 |
+
### Memory Efficiency Demonstration
|
| 315 |
+
""")
|
| 316 |
return
|
| 317 |
|
| 318 |
|
|
|
|
| 353 |
|
| 354 |
@app.cell(hide_code=True)
|
| 355 |
def _(mo):
|
| 356 |
+
mo.md(r"""
|
| 357 |
+
### Performance Comparison: Arrow vs Non-Arrow Approaches
|
| 358 |
+
""")
|
| 359 |
return
|
| 360 |
|
| 361 |
|
| 362 |
@app.cell(hide_code=True)
|
| 363 |
def _(mo):
|
| 364 |
+
mo.md(r"""
|
| 365 |
+
Let's compare three approaches for the same analytical query:
|
| 366 |
+
""")
|
| 367 |
return
|
| 368 |
|
| 369 |
|
| 370 |
@app.cell
|
| 371 |
+
def _(duckdb, mo, pandas_data, time):
|
| 372 |
# Test query: group by category and calculate aggregations
|
| 373 |
query = """
|
| 374 |
SELECT
|
|
|
|
| 430 |
|
| 431 |
@app.cell(hide_code=True)
|
| 432 |
def _(mo):
|
| 433 |
+
mo.md(r"""
|
| 434 |
+
### Visualizing the Performance Difference
|
| 435 |
+
""")
|
| 436 |
return
|
| 437 |
|
| 438 |
|
| 439 |
@app.cell
|
| 440 |
def _(approach1_time, approach2_time, approach3_time, mo, pl):
|
| 441 |
import altair as alt
|
| 442 |
+
|
| 443 |
# Create a bar chart showing the performance comparison
|
| 444 |
performance_data = pl.DataFrame({
|
| 445 |
"Approach": ["Traditional\n(Copy to DuckDB)", "Pandas\nGroupBy", "Arrow-based\n(Zero-copy)"],
|
|
|
|
| 457 |
width=400,
|
| 458 |
height=300
|
| 459 |
)
|
| 460 |
+
|
| 461 |
# Display using marimo's altair_chart UI element
|
| 462 |
mo.ui.altair_chart(chart)
|
| 463 |
+
return
|
|
|
|
| 464 |
|
| 465 |
|
| 466 |
@app.cell(hide_code=True)
|
| 467 |
def _(mo):
|
| 468 |
+
mo.md(r"""
|
| 469 |
+
### Complex Query Performance
|
| 470 |
+
""")
|
| 471 |
return
|
| 472 |
|
| 473 |
|
| 474 |
@app.cell(hide_code=True)
|
| 475 |
def _(mo):
|
| 476 |
+
mo.md(r"""
|
| 477 |
+
Let's test a more complex query with joins and window functions:
|
| 478 |
+
""")
|
| 479 |
return
|
| 480 |
|
| 481 |
|
| 482 |
@app.cell
|
| 483 |
+
def _(mo, pl, time):
|
| 484 |
# Create additional datasets for join operations
|
| 485 |
categories_df = pl.DataFrame({
|
| 486 |
"category": [f"cat_{i}" for i in range(100)],
|
|
|
|
| 520 |
print(f"Complex query with joins and window functions completed in {complex_query_time:.3f} seconds")
|
| 521 |
|
| 522 |
complex_result
|
| 523 |
+
return
|
| 524 |
|
| 525 |
|
| 526 |
@app.cell(hide_code=True)
|
| 527 |
def _(mo):
|
| 528 |
+
mo.md(r"""
|
|
|
|
| 529 |
### Memory Efficiency During Operations
|
| 530 |
|
| 531 |
Let's demonstrate how Arrow's zero-copy operations save memory during data transformations:
|
| 532 |
+
""")
|
|
|
|
| 533 |
return
|
| 534 |
|
| 535 |
|
| 536 |
@app.cell
|
| 537 |
+
def _(polars_data, psutil, time):
|
| 538 |
import os
|
| 539 |
import pyarrow.compute as pc # Add this import
|
| 540 |
|
|
|
|
| 566 |
|
| 567 |
copy_ops_time = time.time() - latest_start_time
|
| 568 |
memory_after_copy = process.memory_info().rss / 1024 / 1024 # MB
|
| 569 |
+
|
| 570 |
print("Memory Usage Comparison:")
|
| 571 |
print(f"Initial memory: {memory_before:.2f} MB")
|
| 572 |
print(f"After Arrow operations: {memory_after_arrow:.2f} MB (diff: +{memory_after_arrow - memory_before:.2f} MB)")
|
|
|
|
| 575 |
print(f"Arrow operations: {arrow_ops_time:.3f} seconds")
|
| 576 |
print(f"Copy operations: {copy_ops_time:.3f} seconds")
|
| 577 |
print(f"Speedup: {copy_ops_time/arrow_ops_time:.1f}x")
|
| 578 |
+
return
|
|
|
|
| 579 |
|
| 580 |
|
| 581 |
@app.cell(hide_code=True)
|
| 582 |
def _(mo):
|
| 583 |
+
mo.md(r"""
|
|
|
|
| 584 |
## Summary
|
| 585 |
|
| 586 |
In this notebook, we've explored:
|
|
|
|
| 596 |
- **Better scalability**: Can handle larger datasets within the same memory constraints
|
| 597 |
|
| 598 |
The seamless integration between DuckDB and Arrow-compatible systems makes it easy to work with data across different tools while maintaining high performance and memory efficiency.
|
| 599 |
+
""")
|
|
|
|
| 600 |
return
|
| 601 |
|
| 602 |
|
|
|
|
| 609 |
import duckdb
|
| 610 |
import sqlglot
|
| 611 |
import psutil
|
| 612 |
+
return duckdb, mo, pa, pd, pl, psutil
|
| 613 |
|
| 614 |
|
| 615 |
if __name__ == "__main__":
|
duckdb/01_getting_started.py
CHANGED
|
@@ -15,26 +15,23 @@
|
|
| 15 |
|
| 16 |
import marimo
|
| 17 |
|
| 18 |
-
__generated_with = "0.
|
| 19 |
app = marimo.App(width="medium")
|
| 20 |
|
| 21 |
|
| 22 |
@app.cell(hide_code=True)
|
| 23 |
def _(mo):
|
| 24 |
-
mo.md(
|
| 25 |
-
rf"""
|
| 26 |
<p align="center">
|
| 27 |
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSxHAqB0W_61zuIGVMiU6sEeQyTaw-9xwiprw&s" alt="DuckDB Image"/>
|
| 28 |
</p>
|
| 29 |
-
"""
|
| 30 |
-
)
|
| 31 |
return
|
| 32 |
|
| 33 |
|
| 34 |
@app.cell(hide_code=True)
|
| 35 |
def _(mo):
|
| 36 |
-
mo.md(
|
| 37 |
-
rf"""
|
| 38 |
# 🦆 **DuckDB**: An Embeddable Analytical Database System
|
| 39 |
|
| 40 |
## What is DuckDB?
|
|
@@ -83,15 +80,13 @@ def _(mo):
|
|
| 83 |
/// attention | Note
|
| 84 |
DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system.
|
| 85 |
///
|
| 86 |
-
"""
|
| 87 |
-
)
|
| 88 |
return
|
| 89 |
|
| 90 |
|
| 91 |
@app.cell(hide_code=True)
|
| 92 |
def _(mo):
|
| 93 |
-
mo.md(
|
| 94 |
-
r"""
|
| 95 |
# [1. DuckDB Connections: In-Memory vs. File-based](https://duckdb.org/docs/stable/connect/overview.html)
|
| 96 |
|
| 97 |
DuckDB is a lightweight, _relational database management system (RDBMS)_ designed for analytical workloads. Unlike traditional client-server databases, it operates _in-process_ (embedded within your application) and supports both _in-memory_ (temporary) and _file-based_ (persistent) storage.
|
|
@@ -105,8 +100,7 @@ def _(mo):
|
|
| 105 |
| Performance | Faster for most operations | Slightly slower but provides persistence |
|
| 106 |
| Creation | duckdb.connect(':memory:') | duckdb.connect('filename.db') |
|
| 107 |
| Multiple Connection Access | Limited to single connection | Multiple connections can access the same database |
|
| 108 |
-
"""
|
| 109 |
-
)
|
| 110 |
return
|
| 111 |
|
| 112 |
|
|
@@ -134,8 +128,7 @@ def _(mo):
|
|
| 134 |
|
| 135 |
@app.cell(hide_code=True)
|
| 136 |
def _(mo):
|
| 137 |
-
mo.md(
|
| 138 |
-
"""
|
| 139 |
## Creating DuckDB Connections
|
| 140 |
|
| 141 |
Let's create both types of DuckDB connections and explore their characteristics.
|
|
@@ -144,8 +137,7 @@ def _(mo):
|
|
| 144 |
2. **File-based connection**: Data persists between sessions
|
| 145 |
|
| 146 |
We'll then demonstrate the key differences between these connection types.
|
| 147 |
-
"""
|
| 148 |
-
)
|
| 149 |
return
|
| 150 |
|
| 151 |
|
|
@@ -176,28 +168,28 @@ def _(file_db, memory_db):
|
|
| 176 |
|
| 177 |
@app.cell(hide_code=True)
|
| 178 |
def _(mo):
|
| 179 |
-
mo.md(
|
| 180 |
-
r"""
|
| 181 |
## Testing Connection Persistence
|
| 182 |
|
| 183 |
-
Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist.
|
| 184 |
|
| 185 |
1. First, we'll query our tables to confirm the data was properly inserted
|
| 186 |
2. Then, we'll simulate an application restart by creating new connections
|
| 187 |
3. Finally, we'll check which data persists after the "restart"
|
| 188 |
-
"""
|
| 189 |
-
)
|
| 190 |
return
|
| 191 |
|
| 192 |
|
| 193 |
@app.cell(hide_code=True)
|
| 194 |
def _(mo):
|
| 195 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 196 |
return
|
| 197 |
|
| 198 |
|
| 199 |
@app.cell(hide_code=True)
|
| 200 |
-
def _(
|
| 201 |
_df = mo.sql(
|
| 202 |
f"""
|
| 203 |
SELECT * FROM mem_test
|
|
@@ -208,7 +200,7 @@ def _(mem_test, memory_db, mo):
|
|
| 208 |
|
| 209 |
|
| 210 |
@app.cell(hide_code=True)
|
| 211 |
-
def _(file_db,
|
| 212 |
_df = mo.sql(
|
| 213 |
f"""
|
| 214 |
SELECT * FROM file_test
|
|
@@ -227,7 +219,9 @@ def _():
|
|
| 227 |
|
| 228 |
@app.cell(hide_code=True)
|
| 229 |
def _(mo):
|
| 230 |
-
mo.md(rf"""
|
|
|
|
|
|
|
| 231 |
return
|
| 232 |
|
| 233 |
|
|
@@ -311,8 +305,7 @@ def _(file_data, file_data_available, mo):
|
|
| 311 |
|
| 312 |
@app.cell(hide_code=True)
|
| 313 |
def _(mo):
|
| 314 |
-
mo.md(
|
| 315 |
-
r"""
|
| 316 |
# [2. Creating Tables in DuckDB](https://duckdb.org/docs/stable/sql/statements/create_table.html)
|
| 317 |
|
| 318 |
DuckDB supports standard SQL syntax for creating tables. Let's create more complex tables to demonstrate different data types and constraints.
|
|
@@ -326,8 +319,7 @@ def _(mo):
|
|
| 326 |
- **CREATE OR REPLACE** to recreate tables
|
| 327 |
- **Primary keys** and other constraints
|
| 328 |
- **Various data types** including INTEGER, VARCHAR, TIMESTAMP, DECIMAL, etc.
|
| 329 |
-
"""
|
| 330 |
-
)
|
| 331 |
return
|
| 332 |
|
| 333 |
|
|
@@ -406,8 +398,7 @@ def _(memory_schema, mo):
|
|
| 406 |
|
| 407 |
@app.cell(hide_code=True)
|
| 408 |
def _(mo):
|
| 409 |
-
mo.md(
|
| 410 |
-
r"""
|
| 411 |
# [3. Inserting Data Into Tables](https://duckdb.org/docs/stable/sql/statements/insert)
|
| 412 |
|
| 413 |
DuckDB supports multiple ways to insert data:
|
|
@@ -418,8 +409,7 @@ def _(mo):
|
|
| 418 |
4. **Bulk inserts**: For efficient loading of multiple rows
|
| 419 |
|
| 420 |
Let's demonstrate these different insertion methods:
|
| 421 |
-
"""
|
| 422 |
-
)
|
| 423 |
return
|
| 424 |
|
| 425 |
|
|
@@ -741,8 +731,7 @@ def _(file_results, memory_results, mo):
|
|
| 741 |
|
| 742 |
@app.cell(hide_code=True)
|
| 743 |
def _(mo):
|
| 744 |
-
mo.md(
|
| 745 |
-
r"""
|
| 746 |
# [4. Using SQL Directly in marimo](https://duckdb.org/docs/stable/sql/query_syntax/select)
|
| 747 |
|
| 748 |
There are multiple ways to leverage DuckDB's SQL capabilities in marimo:
|
|
@@ -752,8 +741,7 @@ def _(mo):
|
|
| 752 |
3. **Interactive queries**: Combining UI elements with SQL execution
|
| 753 |
|
| 754 |
Let's explore these approaches:
|
| 755 |
-
"""
|
| 756 |
-
)
|
| 757 |
return
|
| 758 |
|
| 759 |
|
|
@@ -808,7 +796,9 @@ def _(age_threshold, filtered_users, mo):
|
|
| 808 |
|
| 809 |
@app.cell(hide_code=True)
|
| 810 |
def _(mo):
|
| 811 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 812 |
return
|
| 813 |
|
| 814 |
|
|
@@ -904,7 +894,9 @@ def _(complex_query_result, mo):
|
|
| 904 |
|
| 905 |
@app.cell(hide_code=True)
|
| 906 |
def _(mo):
|
| 907 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 908 |
return
|
| 909 |
|
| 910 |
|
|
@@ -950,12 +942,10 @@ def _(new_memory_db):
|
|
| 950 |
|
| 951 |
@app.cell(hide_code=True)
|
| 952 |
def _(mo):
|
| 953 |
-
mo.md(
|
| 954 |
-
rf"""
|
| 955 |
<!-- Display the join result -->
|
| 956 |
## Join Result (Users and Departments):
|
| 957 |
-
"""
|
| 958 |
-
)
|
| 959 |
return
|
| 960 |
|
| 961 |
|
|
@@ -967,12 +957,10 @@ def _(join_result, mo):
|
|
| 967 |
|
| 968 |
@app.cell(hide_code=True)
|
| 969 |
def _(mo):
|
| 970 |
-
mo.md(
|
| 971 |
-
rf"""
|
| 972 |
<!-- Demonstrate different types of joins -->
|
| 973 |
## Different Types of Joins
|
| 974 |
-
"""
|
| 975 |
-
)
|
| 976 |
return
|
| 977 |
|
| 978 |
|
|
@@ -1122,7 +1110,9 @@ def _(join_description, join_tabs, mo):
|
|
| 1122 |
|
| 1123 |
@app.cell(hide_code=True)
|
| 1124 |
def _(mo):
|
| 1125 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 1126 |
return
|
| 1127 |
|
| 1128 |
|
|
@@ -1224,7 +1214,9 @@ def _(mo, window_result):
|
|
| 1224 |
|
| 1225 |
@app.cell(hide_code=True)
|
| 1226 |
def _(mo):
|
| 1227 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 1228 |
return
|
| 1229 |
|
| 1230 |
|
|
@@ -1342,7 +1334,9 @@ def _(mo, pandas_result):
|
|
| 1342 |
|
| 1343 |
@app.cell(hide_code=True)
|
| 1344 |
def _(mo):
|
| 1345 |
-
mo.md("""
|
|
|
|
|
|
|
| 1346 |
return
|
| 1347 |
|
| 1348 |
|
|
@@ -1498,8 +1492,7 @@ def _(age_groups, mo, new_memory_db, plotly_express):
|
|
| 1498 |
|
| 1499 |
@app.cell(hide_code=True)
|
| 1500 |
def _(mo):
|
| 1501 |
-
mo.md(
|
| 1502 |
-
r"""
|
| 1503 |
/// admonition |
|
| 1504 |
## Database Management Best Practices
|
| 1505 |
///
|
|
@@ -1538,14 +1531,15 @@ def _(mo):
|
|
| 1538 |
- Create indexes for frequently queried columns
|
| 1539 |
- For large datasets, consider partitioning
|
| 1540 |
- Use prepared statements for repeated queries
|
| 1541 |
-
"""
|
| 1542 |
-
)
|
| 1543 |
return
|
| 1544 |
|
| 1545 |
|
| 1546 |
@app.cell(hide_code=True)
|
| 1547 |
def _(mo):
|
| 1548 |
-
mo.md(rf"""
|
|
|
|
|
|
|
| 1549 |
return
|
| 1550 |
|
| 1551 |
|
|
@@ -1736,8 +1730,7 @@ def _(
|
|
| 1736 |
|
| 1737 |
@app.cell(hide_code=True)
|
| 1738 |
def _(mo):
|
| 1739 |
-
mo.md(
|
| 1740 |
-
rf"""
|
| 1741 |
# Summary and Key Takeaways
|
| 1742 |
|
| 1743 |
In this notebook, we've explored DuckDB, a powerful embedded analytical database system. Here's what we covered:
|
|
@@ -1770,8 +1763,7 @@ def _(mo):
|
|
| 1770 |
- Experiment with more complex queries and window functions
|
| 1771 |
- Use DuckDB's COPY functionality to import/export data from/to files
|
| 1772 |
- Create more advanced interactive dashboards with marimo and Plotly
|
| 1773 |
-
"""
|
| 1774 |
-
)
|
| 1775 |
return
|
| 1776 |
|
| 1777 |
|
|
|
|
| 15 |
|
| 16 |
import marimo
|
| 17 |
|
| 18 |
+
__generated_with = "0.18.4"
|
| 19 |
app = marimo.App(width="medium")
|
| 20 |
|
| 21 |
|
| 22 |
@app.cell(hide_code=True)
|
| 23 |
def _(mo):
|
| 24 |
+
mo.md(rf"""
|
|
|
|
| 25 |
<p align="center">
|
| 26 |
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSxHAqB0W_61zuIGVMiU6sEeQyTaw-9xwiprw&s" alt="DuckDB Image"/>
|
| 27 |
</p>
|
| 28 |
+
""")
|
|
|
|
| 29 |
return
|
| 30 |
|
| 31 |
|
| 32 |
@app.cell(hide_code=True)
|
| 33 |
def _(mo):
|
| 34 |
+
mo.md(rf"""
|
|
|
|
| 35 |
# 🦆 **DuckDB**: An Embeddable Analytical Database System
|
| 36 |
|
| 37 |
## What is DuckDB?
|
|
|
|
| 80 |
/// attention | Note
|
| 81 |
DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system.
|
| 82 |
///
|
| 83 |
+
""")
|
|
|
|
| 84 |
return
|
| 85 |
|
| 86 |
|
| 87 |
@app.cell(hide_code=True)
|
| 88 |
def _(mo):
|
| 89 |
+
mo.md(r"""
|
|
|
|
| 90 |
# [1. DuckDB Connections: In-Memory vs. File-based](https://duckdb.org/docs/stable/connect/overview.html)
|
| 91 |
|
| 92 |
DuckDB is a lightweight, _relational database management system (RDBMS)_ designed for analytical workloads. Unlike traditional client-server databases, it operates _in-process_ (embedded within your application) and supports both _in-memory_ (temporary) and _file-based_ (persistent) storage.
|
|
|
|
| 100 |
| Performance | Faster for most operations | Slightly slower but provides persistence |
|
| 101 |
| Creation | duckdb.connect(':memory:') | duckdb.connect('filename.db') |
|
| 102 |
| Multiple Connection Access | Limited to single connection | Multiple connections can access the same database |
|
| 103 |
+
""")
|
|
|
|
| 104 |
return
|
| 105 |
|
| 106 |
|
|
|
|
| 128 |
|
| 129 |
@app.cell(hide_code=True)
|
| 130 |
def _(mo):
|
| 131 |
+
mo.md("""
|
|
|
|
| 132 |
## Creating DuckDB Connections
|
| 133 |
|
| 134 |
Let's create both types of DuckDB connections and explore their characteristics.
|
|
|
|
| 137 |
2. **File-based connection**: Data persists between sessions
|
| 138 |
|
| 139 |
We'll then demonstrate the key differences between these connection types.
|
| 140 |
+
""")
|
|
|
|
| 141 |
return
|
| 142 |
|
| 143 |
|
|
|
|
| 168 |
|
| 169 |
@app.cell(hide_code=True)
|
| 170 |
def _(mo):
|
| 171 |
+
mo.md(r"""
|
|
|
|
| 172 |
## Testing Connection Persistence
|
| 173 |
|
| 174 |
+
Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist.
|
| 175 |
|
| 176 |
1. First, we'll query our tables to confirm the data was properly inserted
|
| 177 |
2. Then, we'll simulate an application restart by creating new connections
|
| 178 |
3. Finally, we'll check which data persists after the "restart"
|
| 179 |
+
""")
|
|
|
|
| 180 |
return
|
| 181 |
|
| 182 |
|
| 183 |
@app.cell(hide_code=True)
|
| 184 |
def _(mo):
|
| 185 |
+
mo.md(r"""
|
| 186 |
+
## Current Database Contents
|
| 187 |
+
""")
|
| 188 |
return
|
| 189 |
|
| 190 |
|
| 191 |
@app.cell(hide_code=True)
|
| 192 |
+
def _(memory_db, mo):
|
| 193 |
_df = mo.sql(
|
| 194 |
f"""
|
| 195 |
SELECT * FROM mem_test
|
|
|
|
| 200 |
|
| 201 |
|
| 202 |
@app.cell(hide_code=True)
|
| 203 |
+
def _(file_db, mo):
|
| 204 |
_df = mo.sql(
|
| 205 |
f"""
|
| 206 |
SELECT * FROM file_test
|
|
|
|
| 219 |
|
| 220 |
@app.cell(hide_code=True)
|
| 221 |
def _(mo):
|
| 222 |
+
mo.md(rf"""
|
| 223 |
+
## 🔄 Simulating Application Restart...
|
| 224 |
+
""")
|
| 225 |
return
|
| 226 |
|
| 227 |
|
|
|
|
| 305 |
|
| 306 |
@app.cell(hide_code=True)
|
| 307 |
def _(mo):
|
| 308 |
+
mo.md(r"""
|
|
|
|
| 309 |
# [2. Creating Tables in DuckDB](https://duckdb.org/docs/stable/sql/statements/create_table.html)
|
| 310 |
|
| 311 |
DuckDB supports standard SQL syntax for creating tables. Let's create more complex tables to demonstrate different data types and constraints.
|
|
|
|
| 319 |
- **CREATE OR REPLACE** to recreate tables
|
| 320 |
- **Primary keys** and other constraints
|
| 321 |
- **Various data types** including INTEGER, VARCHAR, TIMESTAMP, DECIMAL, etc.
|
| 322 |
+
""")
|
|
|
|
| 323 |
return
|
| 324 |
|
| 325 |
|
|
|
|
| 398 |
|
| 399 |
@app.cell(hide_code=True)
|
| 400 |
def _(mo):
|
| 401 |
+
mo.md(r"""
|
|
|
|
| 402 |
# [3. Inserting Data Into Tables](https://duckdb.org/docs/stable/sql/statements/insert)
|
| 403 |
|
| 404 |
DuckDB supports multiple ways to insert data:
|
|
|
|
| 409 |
4. **Bulk inserts**: For efficient loading of multiple rows
|
| 410 |
|
| 411 |
Let's demonstrate these different insertion methods:
|
| 412 |
+
""")
|
|
|
|
| 413 |
return
|
| 414 |
|
| 415 |
|
|
|
|
| 731 |
|
| 732 |
@app.cell(hide_code=True)
|
| 733 |
def _(mo):
|
| 734 |
+
mo.md(r"""
|
|
|
|
| 735 |
# [4. Using SQL Directly in marimo](https://duckdb.org/docs/stable/sql/query_syntax/select)
|
| 736 |
|
| 737 |
There are multiple ways to leverage DuckDB's SQL capabilities in marimo:
|
|
|
|
| 741 |
3. **Interactive queries**: Combining UI elements with SQL execution
|
| 742 |
|
| 743 |
Let's explore these approaches:
|
| 744 |
+
""")
|
|
|
|
| 745 |
return
|
| 746 |
|
| 747 |
|
|
|
|
| 796 |
|
| 797 |
@app.cell(hide_code=True)
|
| 798 |
def _(mo):
|
| 799 |
+
mo.md(r"""
|
| 800 |
+
# [5. Working with Polars and DuckDB](https://duckdb.org/docs/stable/guides/python/polars.html)
|
| 801 |
+
""")
|
| 802 |
return
|
| 803 |
|
| 804 |
|
|
|
|
| 894 |
|
| 895 |
@app.cell(hide_code=True)
|
| 896 |
def _(mo):
|
| 897 |
+
mo.md(r"""
|
| 898 |
+
# [6. Advanced Queries: Joins Between Tables](https://duckdb.org/docs/stable/guides/performance/join_operations.html)
|
| 899 |
+
""")
|
| 900 |
return
|
| 901 |
|
| 902 |
|
|
|
|
| 942 |
|
| 943 |
@app.cell(hide_code=True)
|
| 944 |
def _(mo):
|
| 945 |
+
mo.md(rf"""
|
|
|
|
| 946 |
<!-- Display the join result -->
|
| 947 |
## Join Result (Users and Departments):
|
| 948 |
+
""")
|
|
|
|
| 949 |
return
|
| 950 |
|
| 951 |
|
|
|
|
| 957 |
|
| 958 |
@app.cell(hide_code=True)
|
| 959 |
def _(mo):
|
| 960 |
+
mo.md(rf"""
|
|
|
|
| 961 |
<!-- Demonstrate different types of joins -->
|
| 962 |
## Different Types of Joins
|
| 963 |
+
""")
|
|
|
|
| 964 |
return
|
| 965 |
|
| 966 |
|
|
|
|
| 1110 |
|
| 1111 |
@app.cell(hide_code=True)
|
| 1112 |
def _(mo):
|
| 1113 |
+
mo.md(r"""
|
| 1114 |
+
# [7. Aggregate Functions in DuckDB](https://duckdb.org/docs/stable/sql/functions/aggregates.html)
|
| 1115 |
+
""")
|
| 1116 |
return
|
| 1117 |
|
| 1118 |
|
|
|
|
| 1214 |
|
| 1215 |
@app.cell(hide_code=True)
|
| 1216 |
def _(mo):
|
| 1217 |
+
mo.md(r"""
|
| 1218 |
+
# [8. Converting DuckDB Results to Polars/Pandas](https://duckdb.org/docs/stable/guides/python/polars.html)
|
| 1219 |
+
""")
|
| 1220 |
return
|
| 1221 |
|
| 1222 |
|
|
|
|
| 1334 |
|
| 1335 |
@app.cell(hide_code=True)
|
| 1336 |
def _(mo):
|
| 1337 |
+
mo.md("""
|
| 1338 |
+
# 9. Data Visualization with DuckDB and Plotly
|
| 1339 |
+
""")
|
| 1340 |
return
|
| 1341 |
|
| 1342 |
|
|
|
|
| 1492 |
|
| 1493 |
@app.cell(hide_code=True)
|
| 1494 |
def _(mo):
|
| 1495 |
+
mo.md(r"""
|
|
|
|
| 1496 |
/// admonition |
|
| 1497 |
## Database Management Best Practices
|
| 1498 |
///
|
|
|
|
| 1531 |
- Create indexes for frequently queried columns
|
| 1532 |
- For large datasets, consider partitioning
|
| 1533 |
- Use prepared statements for repeated queries
|
| 1534 |
+
""")
|
|
|
|
| 1535 |
return
|
| 1536 |
|
| 1537 |
|
| 1538 |
@app.cell(hide_code=True)
|
| 1539 |
def _(mo):
|
| 1540 |
+
mo.md(rf"""
|
| 1541 |
+
## 10. Interactive DuckDB Dashboard with marimo and Plotly
|
| 1542 |
+
""")
|
| 1543 |
return
|
| 1544 |
|
| 1545 |
|
|
|
|
| 1730 |
|
| 1731 |
@app.cell(hide_code=True)
|
| 1732 |
def _(mo):
|
| 1733 |
+
mo.md(rf"""
|
|
|
|
| 1734 |
# Summary and Key Takeaways
|
| 1735 |
|
| 1736 |
In this notebook, we've explored DuckDB, a powerful embedded analytical database system. Here's what we covered:
|
|
|
|
| 1763 |
- Experiment with more complex queries and window functions
|
| 1764 |
- Use DuckDB's COPY functionality to import/export data from/to files
|
| 1765 |
- Create more advanced interactive dashboards with marimo and Plotly
|
| 1766 |
+
""")
|
|
|
|
| 1767 |
return
|
| 1768 |
|
| 1769 |
|
duckdb/DuckDB_Loading_CSVs.py
CHANGED
|
@@ -13,39 +13,41 @@
|
|
| 13 |
|
| 14 |
import marimo
|
| 15 |
|
| 16 |
-
__generated_with = "0.
|
| 17 |
app = marimo.App(width="medium")
|
| 18 |
|
| 19 |
|
| 20 |
@app.cell(hide_code=True)
|
| 21 |
def _(mo):
|
| 22 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 23 |
return
|
| 24 |
|
| 25 |
|
| 26 |
@app.cell(hide_code=True)
|
| 27 |
def _(mo):
|
| 28 |
-
mo.md(
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
<
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
"""
|
| 42 |
-
)
|
| 43 |
return
|
| 44 |
|
| 45 |
|
| 46 |
@app.cell(hide_code=True)
|
| 47 |
def _(mo):
|
| 48 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 49 |
return
|
| 50 |
|
| 51 |
|
|
@@ -67,7 +69,9 @@ def _(mo):
|
|
| 67 |
|
| 68 |
@app.cell(hide_code=True)
|
| 69 |
def _(mo):
|
| 70 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 71 |
return
|
| 72 |
|
| 73 |
|
|
@@ -80,11 +84,11 @@ def _(mo):
|
|
| 80 |
SELECT Year, Concept, publications FROM "https://raw.githubusercontent.com/Mustjaab/Loading_CSVs_in_DuckDB/refs/heads/main/AI_Research_Data.csv"
|
| 81 |
"""
|
| 82 |
)
|
| 83 |
-
return
|
| 84 |
|
| 85 |
|
| 86 |
@app.cell
|
| 87 |
-
def _(
|
| 88 |
Analysis = mo.sql(
|
| 89 |
f"""
|
| 90 |
SELECT *
|
|
@@ -93,11 +97,11 @@ def _(Domain_Analysis, mo):
|
|
| 93 |
ORDER BY Year
|
| 94 |
"""
|
| 95 |
)
|
| 96 |
-
return
|
| 97 |
|
| 98 |
|
| 99 |
@app.cell
|
| 100 |
-
def _(
|
| 101 |
_df = mo.sql(
|
| 102 |
f"""
|
| 103 |
SELECT
|
|
@@ -111,7 +115,7 @@ def _(Domain_Analysis, mo):
|
|
| 111 |
|
| 112 |
|
| 113 |
@app.cell
|
| 114 |
-
def _(
|
| 115 |
NLP_Analysis = mo.sql(
|
| 116 |
f"""
|
| 117 |
SELECT
|
|
@@ -137,21 +141,23 @@ def _(NLP_Analysis, px):
|
|
| 137 |
|
| 138 |
@app.cell(hide_code=True)
|
| 139 |
def _(mo):
|
| 140 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
@app.cell(hide_code=True)
|
| 143 |
def _(mo):
|
| 144 |
-
mo.md(
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
<
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
"""
|
| 154 |
-
)
|
| 155 |
return
|
| 156 |
|
| 157 |
|
|
@@ -159,7 +165,7 @@ def _(mo):
|
|
| 159 |
def _():
|
| 160 |
import pyarrow
|
| 161 |
import polars
|
| 162 |
-
return
|
| 163 |
|
| 164 |
|
| 165 |
@app.cell
|
|
|
|
| 13 |
|
| 14 |
import marimo
|
| 15 |
|
| 16 |
+
__generated_with = "0.18.4"
|
| 17 |
app = marimo.App(width="medium")
|
| 18 |
|
| 19 |
|
| 20 |
@app.cell(hide_code=True)
|
| 21 |
def _(mo):
|
| 22 |
+
mo.md(r"""
|
| 23 |
+
#Loading CSVs with DuckDB
|
| 24 |
+
""")
|
| 25 |
return
|
| 26 |
|
| 27 |
|
| 28 |
@app.cell(hide_code=True)
|
| 29 |
def _(mo):
|
| 30 |
+
mo.md(r"""
|
| 31 |
+
<p> I remember when I first learnt about DuckDB, it was a gamechanger — I used to load the data I wanted to work on to a database software like MS SQL Server, and then build a bridge to an IDE with the language I wanted to use like Python, or R; it was quite the hassle. DuckDB changed my whole world — now I could just import the data file into the IDE, or notebook, make a duckdb connection, and there we go! But then, I realized I didn't even need the step of first importing the file using python. I could just query the csv file directly using SQL through a DuckDB connection.</p>
|
| 32 |
+
|
| 33 |
+
##Introduction
|
| 34 |
+
<p> I found this dataset on the evolution of AI research by discipline from <a href= "https://oecd.ai/en/data?selectedArea=ai-research&selectedVisualization=16731"> OECD</a>, and it piqued my interest. I feel like publications in natural language processing drastically jumped in the mid 2010s, and I'm excited to find out if that's the case. </p>
|
| 35 |
+
|
| 36 |
+
<p> In this notebook, we'll: </p>
|
| 37 |
+
<ul>
|
| 38 |
+
<li> Import the CSV file into the notebook</li>
|
| 39 |
+
<li> Create another table within the database based on the CSV</li>
|
| 40 |
+
<li> Dig into publications on natural language processing have evolved over the years</li>
|
| 41 |
+
</ul>
|
| 42 |
+
""")
|
|
|
|
|
|
|
| 43 |
return
|
| 44 |
|
| 45 |
|
| 46 |
@app.cell(hide_code=True)
|
| 47 |
def _(mo):
|
| 48 |
+
mo.md(r"""
|
| 49 |
+
##Load the CSV
|
| 50 |
+
""")
|
| 51 |
return
|
| 52 |
|
| 53 |
|
|
|
|
| 69 |
|
| 70 |
@app.cell(hide_code=True)
|
| 71 |
def _(mo):
|
| 72 |
+
mo.md(r"""
|
| 73 |
+
##Create Another Table
|
| 74 |
+
""")
|
| 75 |
return
|
| 76 |
|
| 77 |
|
|
|
|
| 84 |
SELECT Year, Concept, publications FROM "https://raw.githubusercontent.com/Mustjaab/Loading_CSVs_in_DuckDB/refs/heads/main/AI_Research_Data.csv"
|
| 85 |
"""
|
| 86 |
)
|
| 87 |
+
return
|
| 88 |
|
| 89 |
|
| 90 |
@app.cell
|
| 91 |
+
def _(mo):
|
| 92 |
Analysis = mo.sql(
|
| 93 |
f"""
|
| 94 |
SELECT *
|
|
|
|
| 97 |
ORDER BY Year
|
| 98 |
"""
|
| 99 |
)
|
| 100 |
+
return
|
| 101 |
|
| 102 |
|
| 103 |
@app.cell
|
| 104 |
+
def _(mo):
|
| 105 |
_df = mo.sql(
|
| 106 |
f"""
|
| 107 |
SELECT
|
|
|
|
| 115 |
|
| 116 |
|
| 117 |
@app.cell
|
| 118 |
+
def _(mo):
|
| 119 |
NLP_Analysis = mo.sql(
|
| 120 |
f"""
|
| 121 |
SELECT
|
|
|
|
| 141 |
|
| 142 |
@app.cell(hide_code=True)
|
| 143 |
def _(mo):
|
| 144 |
+
mo.md(r"""
|
| 145 |
+
<p> We can see there's a significant increase in NLP publications 2020 and onwards which definitely makes sense provided the rapid emergence of commercial large language models, and AI assistants. </p>
|
| 146 |
+
""")
|
| 147 |
+
return
|
| 148 |
+
|
| 149 |
|
| 150 |
@app.cell(hide_code=True)
|
| 151 |
def _(mo):
|
| 152 |
+
mo.md(r"""
|
| 153 |
+
##Conclusion
|
| 154 |
+
<p> In this notebook, we learned how to:</p>
|
| 155 |
+
<ul>
|
| 156 |
+
<li> Load a CSV into DuckDB </li>
|
| 157 |
+
<li> Create other tables using the imported CSV </li>
|
| 158 |
+
<li> Seamlessly analyze and visualize data between SQL, and Python cells</li>
|
| 159 |
+
</ul>
|
| 160 |
+
""")
|
|
|
|
|
|
|
| 161 |
return
|
| 162 |
|
| 163 |
|
|
|
|
| 165 |
def _():
|
| 166 |
import pyarrow
|
| 167 |
import polars
|
| 168 |
+
return
|
| 169 |
|
| 170 |
|
| 171 |
@app.cell
|
duckdb/README.md
CHANGED
|
@@ -1,3 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Learn DuckDB
|
| 2 |
|
| 3 |
_🚧 This collection is a work in progress. Please help us add notebooks!_
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Readme
|
| 3 |
+
marimo-version: 0.18.4
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
# Learn DuckDB
|
| 7 |
|
| 8 |
_🚧 This collection is a work in progress. Please help us add notebooks!_
|
functional_programming/05_functors.py
CHANGED
|
@@ -7,102 +7,98 @@
|
|
| 7 |
|
| 8 |
import marimo
|
| 9 |
|
| 10 |
-
__generated_with = "0.
|
| 11 |
app = marimo.App(app_title="Category Theory and Functors")
|
| 12 |
|
| 13 |
|
| 14 |
@app.cell(hide_code=True)
|
| 15 |
def _(mo):
|
| 16 |
-
mo.md(
|
| 17 |
-
|
| 18 |
-
# Category Theory and Functors
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
)
|
| 46 |
return
|
| 47 |
|
| 48 |
|
| 49 |
@app.cell(hide_code=True)
|
| 50 |
def _(mo):
|
| 51 |
-
mo.md(
|
| 52 |
-
|
| 53 |
-
# Functor as a Computational Context
|
| 54 |
|
| 55 |
-
|
| 56 |
|
| 57 |
-
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
|
| 67 |
-
|
| 68 |
|
| 69 |
-
|
| 70 |
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
|
| 82 |
-
|
| 83 |
-
|
| 84 |
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
|
| 90 |
-
|
| 91 |
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
|
| 96 |
-
|
| 97 |
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
)
|
| 101 |
return
|
| 102 |
|
| 103 |
|
| 104 |
@app.cell
|
| 105 |
-
def _(B, Callable, Functor, dataclass):
|
| 106 |
@dataclass
|
| 107 |
class Wrapper[A](Functor):
|
| 108 |
value: A
|
|
@@ -115,26 +111,24 @@ def _(B, Callable, Functor, dataclass):
|
|
| 115 |
|
| 116 |
@app.cell(hide_code=True)
|
| 117 |
def _(mo):
|
| 118 |
-
mo.md(
|
| 119 |
-
|
| 120 |
-
/// attention
|
| 121 |
|
| 122 |
-
|
| 123 |
|
| 124 |
-
|
| 125 |
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
|
| 131 |
-
|
| 132 |
|
| 133 |
-
|
| 134 |
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
)
|
| 138 |
return
|
| 139 |
|
| 140 |
|
|
@@ -149,46 +143,42 @@ def _(Wrapper, pp):
|
|
| 149 |
|
| 150 |
@app.cell(hide_code=True)
|
| 151 |
def _(mo):
|
| 152 |
-
mo.md(
|
| 153 |
-
|
| 154 |
-
We can analyze the type signature of `fmap` for `Wrapper`:
|
| 155 |
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
|
| 160 |
-
|
| 161 |
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
|
| 166 |
-
|
| 167 |
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
)
|
| 175 |
return
|
| 176 |
|
| 177 |
|
| 178 |
@app.cell(hide_code=True)
|
| 179 |
def _(mo):
|
| 180 |
-
mo.md(
|
| 181 |
-
|
| 182 |
-
## The List Functor
|
| 183 |
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
)
|
| 187 |
return
|
| 188 |
|
| 189 |
|
| 190 |
@app.cell
|
| 191 |
-
def _(B, Callable, Functor, dataclass):
|
| 192 |
@dataclass
|
| 193 |
class List[A](Functor):
|
| 194 |
value: list[A]
|
|
@@ -201,7 +191,9 @@ def _(B, Callable, Functor, dataclass):
|
|
| 201 |
|
| 202 |
@app.cell(hide_code=True)
|
| 203 |
def _(mo):
|
| 204 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 205 |
return
|
| 206 |
|
| 207 |
|
|
@@ -215,114 +207,106 @@ def _(List, pp):
|
|
| 215 |
|
| 216 |
@app.cell(hide_code=True)
|
| 217 |
def _(mo):
|
| 218 |
-
mo.md(
|
| 219 |
-
|
| 220 |
-
### Extracting the Type of `fmap`
|
| 221 |
|
| 222 |
-
|
| 223 |
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
|
| 228 |
-
|
| 229 |
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
|
| 234 |
-
|
| 235 |
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
|
| 240 |
-
|
| 241 |
|
| 242 |
-
|
| 243 |
|
| 244 |
-
|
| 245 |
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
|
| 250 |
-
|
| 251 |
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
|
| 256 |
-
|
| 257 |
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
)
|
| 261 |
return
|
| 262 |
|
| 263 |
|
| 264 |
@app.cell(hide_code=True)
|
| 265 |
def _(mo):
|
| 266 |
-
mo.md(
|
| 267 |
-
|
| 268 |
-
## Defining Functor
|
| 269 |
|
| 270 |
-
|
| 271 |
|
| 272 |
-
|
| 273 |
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
|
| 282 |
|
| 283 |
-
|
| 284 |
-
|
| 285 |
-
)
|
| 286 |
return
|
| 287 |
|
| 288 |
|
| 289 |
@app.cell(hide_code=True)
|
| 290 |
def _(mo):
|
| 291 |
-
mo.md(
|
| 292 |
-
|
| 293 |
-
# More Functor instances (optional)
|
| 294 |
|
| 295 |
-
|
| 296 |
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
)
|
| 300 |
return
|
| 301 |
|
| 302 |
|
| 303 |
@app.cell(hide_code=True)
|
| 304 |
def _(mo):
|
| 305 |
-
mo.md(
|
| 306 |
-
|
| 307 |
-
## The [Maybe](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Maybe.html#t:Maybe) Functor
|
| 308 |
|
| 309 |
-
|
| 310 |
|
| 311 |
-
|
| 312 |
-
|
| 313 |
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
)
|
| 321 |
return
|
| 322 |
|
| 323 |
|
| 324 |
@app.cell
|
| 325 |
-
def _(B, Callable, Functor, dataclass):
|
| 326 |
@dataclass
|
| 327 |
class Maybe[A](Functor):
|
| 328 |
value: None | A
|
|
@@ -345,24 +329,22 @@ def _(Maybe, pp):
|
|
| 345 |
|
| 346 |
@app.cell(hide_code=True)
|
| 347 |
def _(mo):
|
| 348 |
-
mo.md(
|
| 349 |
-
|
| 350 |
-
## The [Either](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Either.html#t:Either) Functor
|
| 351 |
|
| 352 |
-
|
| 353 |
|
| 354 |
-
|
| 355 |
|
| 356 |
-
|
| 357 |
|
| 358 |
-
|
| 359 |
-
|
| 360 |
-
)
|
| 361 |
return
|
| 362 |
|
| 363 |
|
| 364 |
@app.cell
|
| 365 |
-
def _(B, Callable, Functor, Union, dataclass):
|
| 366 |
@dataclass
|
| 367 |
class Either[A](Functor):
|
| 368 |
left: A = None
|
|
@@ -400,29 +382,27 @@ def _(Either):
|
|
| 400 |
|
| 401 |
@app.cell(hide_code=True)
|
| 402 |
def _(mo):
|
| 403 |
-
mo.md(
|
| 404 |
-
|
| 405 |
-
## The [RoseTree](https://en.wikipedia.org/wiki/Rose_tree) Functor
|
| 406 |
|
| 407 |
-
|
| 408 |
|
| 409 |
-
|
| 410 |
-
|
| 411 |
|
| 412 |
-
|
| 413 |
|
| 414 |
-
|
| 415 |
-
|
| 416 |
-
|
| 417 |
|
| 418 |
-
|
| 419 |
-
|
| 420 |
-
)
|
| 421 |
return
|
| 422 |
|
| 423 |
|
| 424 |
@app.cell
|
| 425 |
-
def _(B, Callable, Functor, dataclass):
|
| 426 |
@dataclass
|
| 427 |
class RoseTree[A](Functor):
|
| 428 |
value: A # The value stored in the node.
|
|
@@ -459,34 +439,32 @@ def _(RoseTree, pp):
|
|
| 459 |
|
| 460 |
@app.cell(hide_code=True)
|
| 461 |
def _(mo):
|
| 462 |
-
mo.md(
|
| 463 |
-
|
| 464 |
-
## Generic Functions that can be Used with Any Functor
|
| 465 |
|
| 466 |
-
|
| 467 |
|
| 468 |
-
|
| 469 |
|
| 470 |
-
|
| 471 |
-
|
| 472 |
-
|
| 473 |
|
| 474 |
-
|
| 475 |
|
| 476 |
-
|
| 477 |
-
|
| 478 |
-
|
| 479 |
|
| 480 |
-
|
| 481 |
|
| 482 |
-
|
| 483 |
-
|
| 484 |
-
|
| 485 |
-
|
| 486 |
|
| 487 |
-
|
| 488 |
-
|
| 489 |
-
)
|
| 490 |
return
|
| 491 |
|
| 492 |
|
|
@@ -506,55 +484,51 @@ def _(flist, inc, pp, rosetree, wrapper):
|
|
| 506 |
|
| 507 |
@app.cell(hide_code=True)
|
| 508 |
def _(mo):
|
| 509 |
-
mo.md(
|
| 510 |
-
|
| 511 |
-
|
| 512 |
-
|
| 513 |
-
|
| 514 |
-
"""
|
| 515 |
-
)
|
| 516 |
return
|
| 517 |
|
| 518 |
|
| 519 |
@app.cell(hide_code=True)
|
| 520 |
def _(mo):
|
| 521 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 522 |
return
|
| 523 |
|
| 524 |
|
| 525 |
@app.cell(hide_code=True)
|
| 526 |
def _(mo):
|
| 527 |
-
mo.md(
|
| 528 |
-
|
| 529 |
-
## Functor laws
|
| 530 |
|
| 531 |
-
|
| 532 |
|
| 533 |
-
|
| 534 |
-
|
| 535 |
-
|
| 536 |
-
|
| 537 |
|
| 538 |
-
|
| 539 |
-
|
| 540 |
|
| 541 |
-
|
| 542 |
-
|
| 543 |
-
|
| 544 |
-
|
| 545 |
-
)
|
| 546 |
return
|
| 547 |
|
| 548 |
|
| 549 |
@app.cell(hide_code=True)
|
| 550 |
def _(mo):
|
| 551 |
-
mo.md(
|
| 552 |
-
|
| 553 |
-
### Functor laws verification
|
| 554 |
|
| 555 |
-
|
| 556 |
-
|
| 557 |
-
)
|
| 558 |
return
|
| 559 |
|
| 560 |
|
|
@@ -562,12 +536,14 @@ def _(mo):
|
|
| 562 |
def _():
|
| 563 |
id = lambda x: x
|
| 564 |
compose = lambda f, g: lambda x: f(g(x))
|
| 565 |
-
return
|
| 566 |
|
| 567 |
|
| 568 |
@app.cell(hide_code=True)
|
| 569 |
def _(mo):
|
| 570 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 571 |
return
|
| 572 |
|
| 573 |
|
|
@@ -581,7 +557,9 @@ def _(id):
|
|
| 581 |
|
| 582 |
@app.cell(hide_code=True)
|
| 583 |
def _(mo):
|
| 584 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 585 |
return
|
| 586 |
|
| 587 |
|
|
@@ -589,17 +567,19 @@ def _(mo):
|
|
| 589 |
def _(check_functor_law, flist, pp, rosetree, wrapper):
|
| 590 |
for functor in (wrapper, flist, rosetree):
|
| 591 |
pp(check_functor_law(functor))
|
| 592 |
-
return
|
| 593 |
|
| 594 |
|
| 595 |
@app.cell(hide_code=True)
|
| 596 |
def _(mo):
|
| 597 |
-
mo.md("""
|
|
|
|
|
|
|
| 598 |
return
|
| 599 |
|
| 600 |
|
| 601 |
@app.cell
|
| 602 |
-
def _(B, Callable, Functor, dataclass):
|
| 603 |
@dataclass
|
| 604 |
class EvilFunctor[A](Functor):
|
| 605 |
value: list[A]
|
|
@@ -624,31 +604,29 @@ def _(EvilFunctor, check_functor_law, pp):
|
|
| 624 |
|
| 625 |
@app.cell(hide_code=True)
|
| 626 |
def _(mo):
|
| 627 |
-
mo.md(
|
| 628 |
-
|
| 629 |
-
|
| 630 |
-
|
| 631 |
-
|
| 632 |
-
|
| 633 |
-
|
| 634 |
-
|
| 635 |
-
|
| 636 |
-
|
| 637 |
-
|
| 638 |
-
|
| 639 |
-
|
| 640 |
-
|
| 641 |
-
|
| 642 |
-
|
| 643 |
-
|
| 644 |
-
|
| 645 |
-
|
| 646 |
-
|
| 647 |
-
|
| 648 |
-
|
| 649 |
-
|
| 650 |
-
"""
|
| 651 |
-
)
|
| 652 |
return
|
| 653 |
|
| 654 |
|
|
@@ -676,13 +654,11 @@ def _(List, Maybe):
|
|
| 676 |
|
| 677 |
@app.cell(hide_code=True)
|
| 678 |
def _(mo):
|
| 679 |
-
mo.md(
|
| 680 |
-
|
| 681 |
-
|
| 682 |
-
|
| 683 |
-
|
| 684 |
-
"""
|
| 685 |
-
)
|
| 686 |
return
|
| 687 |
|
| 688 |
|
|
@@ -697,7 +673,9 @@ def _(List, RoseTree, flist, pp, rosetree):
|
|
| 697 |
|
| 698 |
@app.cell(hide_code=True)
|
| 699 |
def _(mo):
|
| 700 |
-
mo.md("""
|
|
|
|
|
|
|
| 701 |
return
|
| 702 |
|
| 703 |
|
|
@@ -728,291 +706,275 @@ def _(ABC, B, Callable, abstractmethod, dataclass):
|
|
| 728 |
|
| 729 |
@app.cell(hide_code=True)
|
| 730 |
def _(mo):
|
| 731 |
-
mo.md(
|
| 732 |
-
|
| 733 |
-
## Limitations of Functor
|
| 734 |
|
| 735 |
-
|
| 736 |
|
| 737 |
-
|
| 738 |
-
|
| 739 |
|
| 740 |
-
|
| 741 |
|
| 742 |
-
|
| 743 |
|
| 744 |
-
|
| 745 |
-
|
| 746 |
|
| 747 |
-
|
| 748 |
|
| 749 |
-
|
| 750 |
-
|
| 751 |
-
)
|
| 752 |
return
|
| 753 |
|
| 754 |
|
| 755 |
@app.cell(hide_code=True)
|
| 756 |
def _(mo):
|
| 757 |
-
mo.md(
|
| 758 |
-
|
| 759 |
-
# Introduction to Categories
|
| 760 |
|
| 761 |
-
|
| 762 |
|
| 763 |
-
|
| 764 |
-
|
| 765 |
-
|
| 766 |
|
| 767 |
-
|
| 768 |
|
| 769 |
-
|
| 770 |
|
| 771 |
-
|
| 772 |
|
| 773 |
-
|
| 774 |
|
| 775 |
-
|
| 776 |
|
| 777 |
-
|
| 778 |
|
| 779 |
-
|
| 780 |
|
| 781 |
-
|
| 782 |
-
|
| 783 |
-
|
| 784 |
|
| 785 |
-
|
| 786 |
-
|
| 787 |
-
|
| 788 |
-
)
|
| 789 |
return
|
| 790 |
|
| 791 |
|
| 792 |
@app.cell(hide_code=True)
|
| 793 |
def _(mo):
|
| 794 |
-
mo.md(
|
| 795 |
-
|
| 796 |
-
|
| 797 |
-
|
| 798 |
-
|
| 799 |
-
|
| 800 |
-
|
| 801 |
-
|
| 802 |
-
|
| 803 |
-
|
| 804 |
-
|
| 805 |
-
|
| 806 |
-
|
| 807 |
-
|
| 808 |
-
|
| 809 |
-
|
| 810 |
-
|
| 811 |
-
|
| 812 |
-
|
| 813 |
-
|
| 814 |
-
|
| 815 |
-
|
| 816 |
-
|
| 817 |
-
|
| 818 |
-
|
| 819 |
-
|
| 820 |
-
|
| 821 |
-
|
| 822 |
-
|
| 823 |
-
|
| 824 |
-
|
| 825 |
-
|
| 826 |
-
|
| 827 |
-
|
| 828 |
-
|
| 829 |
-
|
| 830 |
-
|
| 831 |
-
|
| 832 |
-
|
| 833 |
-
|
| 834 |
-
|
| 835 |
-
|
| 836 |
-
|
| 837 |
-
|
| 838 |
-
|
| 839 |
-
|
| 840 |
-
|
| 841 |
-
"""
|
| 842 |
-
)
|
| 843 |
return
|
| 844 |
|
| 845 |
|
| 846 |
@app.cell(hide_code=True)
|
| 847 |
def _(mo):
|
| 848 |
-
mo.md(
|
| 849 |
-
|
| 850 |
-
# Functors, again
|
| 851 |
|
| 852 |
-
|
| 853 |
|
| 854 |
-
|
| 855 |
-
|
| 856 |
|
| 857 |
-
|
| 858 |
|
| 859 |
-
|
| 860 |
|
| 861 |
-
|
| 862 |
-
|
| 863 |
-
)
|
| 864 |
return
|
| 865 |
|
| 866 |
|
| 867 |
@app.cell(hide_code=True)
|
| 868 |
def _(mo):
|
| 869 |
-
mo.md(
|
| 870 |
-
|
| 871 |
-
## Functors on the category of Python
|
| 872 |
|
| 873 |
-
|
| 874 |
|
| 875 |
-
|
| 876 |
|
| 877 |
-
|
| 878 |
|
| 879 |
-
|
| 880 |
-
|
| 881 |
-
|
| 882 |
-
|
| 883 |
|
| 884 |
-
|
| 885 |
|
| 886 |
-
|
| 887 |
-
|
| 888 |
-
|
| 889 |
-
|
| 890 |
|
| 891 |
-
|
| 892 |
|
| 893 |
-
|
| 894 |
|
| 895 |
-
|
| 896 |
|
| 897 |
-
|
| 898 |
-
|
| 899 |
-
|
| 900 |
-
|
| 901 |
-
|
| 902 |
-
|
| 903 |
-
|
| 904 |
-
)
|
| 905 |
return
|
| 906 |
|
| 907 |
|
| 908 |
@app.cell(hide_code=True)
|
| 909 |
def _(mo):
|
| 910 |
-
mo.md(
|
| 911 |
-
|
| 912 |
-
## Functor laws, again
|
| 913 |
|
| 914 |
-
|
| 915 |
|
| 916 |
-
|
| 917 |
|
| 918 |
-
|
| 919 |
|
| 920 |
-
|
| 921 |
|
| 922 |
-
|
| 923 |
-
|
| 924 |
-
)
|
| 925 |
return
|
| 926 |
|
| 927 |
|
| 928 |
@app.cell(hide_code=True)
|
| 929 |
def _(mo):
|
| 930 |
-
mo.md(
|
| 931 |
-
|
| 932 |
-
|
| 933 |
-
|
| 934 |
-
|
| 935 |
-
|
| 936 |
-
|
| 937 |
-
|
| 938 |
-
|
| 939 |
-
|
| 940 |
-
|
| 941 |
-
|
| 942 |
-
|
| 943 |
-
|
| 944 |
-
|
| 945 |
-
|
| 946 |
-
|
| 947 |
-
|
| 948 |
-
|
| 949 |
-
|
| 950 |
-
|
| 951 |
-
|
| 952 |
-
|
| 953 |
-
|
| 954 |
-
|
| 955 |
-
|
| 956 |
-
|
| 957 |
-
|
| 958 |
-
|
| 959 |
-
|
| 960 |
-
|
| 961 |
-
|
| 962 |
-
|
| 963 |
-
|
| 964 |
-
|
| 965 |
-
|
| 966 |
-
|
| 967 |
-
|
| 968 |
-
|
| 969 |
-
|
| 970 |
-
|
| 971 |
-
|
| 972 |
-
|
| 973 |
-
|
| 974 |
-
|
| 975 |
-
|
| 976 |
-
|
| 977 |
-
"""
|
| 978 |
-
)
|
| 979 |
return
|
| 980 |
|
| 981 |
|
| 982 |
@app.cell(hide_code=True)
|
| 983 |
def _(mo):
|
| 984 |
-
mo.md(
|
| 985 |
-
|
| 986 |
-
|
| 987 |
-
|
| 988 |
-
|
| 989 |
-
|
| 990 |
-
|
| 991 |
-
|
| 992 |
-
|
| 993 |
-
|
| 994 |
-
|
| 995 |
-
|
| 996 |
-
|
| 997 |
-
|
| 998 |
-
|
| 999 |
-
|
| 1000 |
-
|
| 1001 |
-
|
| 1002 |
-
|
| 1003 |
-
|
| 1004 |
-
|
| 1005 |
-
|
| 1006 |
-
|
| 1007 |
-
|
| 1008 |
-
|
| 1009 |
-
|
| 1010 |
-
|
| 1011 |
-
|
| 1012 |
-
|
| 1013 |
-
|
| 1014 |
-
"""
|
| 1015 |
-
)
|
| 1016 |
return
|
| 1017 |
|
| 1018 |
|
|
@@ -1042,19 +1004,17 @@ def _(WrapperCategory, id, pp, wrapper):
|
|
| 1042 |
|
| 1043 |
@app.cell(hide_code=True)
|
| 1044 |
def _(mo):
|
| 1045 |
-
mo.md(
|
| 1046 |
-
|
| 1047 |
-
## Length as a Functor
|
| 1048 |
|
| 1049 |
-
|
| 1050 |
|
| 1051 |
-
|
| 1052 |
|
| 1053 |
-
|
| 1054 |
|
| 1055 |
-
|
| 1056 |
-
|
| 1057 |
-
)
|
| 1058 |
return
|
| 1059 |
|
| 1060 |
|
|
@@ -1078,24 +1038,20 @@ def _(A, dataclass):
|
|
| 1078 |
|
| 1079 |
@app.cell(hide_code=True)
|
| 1080 |
def _(mo):
|
| 1081 |
-
mo.md(
|
| 1082 |
-
|
| 1083 |
-
|
| 1084 |
-
|
| 1085 |
-
"""
|
| 1086 |
-
)
|
| 1087 |
return
|
| 1088 |
|
| 1089 |
|
| 1090 |
@app.cell(hide_code=True)
|
| 1091 |
def _(mo):
|
| 1092 |
-
mo.md(
|
| 1093 |
-
|
| 1094 |
-
### Category of Integer Addition
|
| 1095 |
|
| 1096 |
-
|
| 1097 |
-
|
| 1098 |
-
)
|
| 1099 |
return
|
| 1100 |
|
| 1101 |
|
|
@@ -1117,28 +1073,24 @@ def _(dataclass):
|
|
| 1117 |
|
| 1118 |
@app.cell(hide_code=True)
|
| 1119 |
def _(mo):
|
| 1120 |
-
mo.md(
|
| 1121 |
-
|
| 1122 |
-
|
| 1123 |
-
|
| 1124 |
-
"""
|
| 1125 |
-
)
|
| 1126 |
return
|
| 1127 |
|
| 1128 |
|
| 1129 |
@app.cell(hide_code=True)
|
| 1130 |
def _(mo):
|
| 1131 |
-
mo.md(
|
| 1132 |
-
|
| 1133 |
-
### Defining the Length Functor
|
| 1134 |
|
| 1135 |
-
|
| 1136 |
|
| 1137 |
-
|
| 1138 |
-
|
| 1139 |
-
|
| 1140 |
-
|
| 1141 |
-
)
|
| 1142 |
return
|
| 1143 |
|
| 1144 |
|
|
@@ -1150,23 +1102,23 @@ def _(IntAddition):
|
|
| 1150 |
|
| 1151 |
@app.cell(hide_code=True)
|
| 1152 |
def _(mo):
|
| 1153 |
-
mo.md("""
|
|
|
|
|
|
|
| 1154 |
return
|
| 1155 |
|
| 1156 |
|
| 1157 |
@app.cell(hide_code=True)
|
| 1158 |
def _(mo):
|
| 1159 |
-
mo.md(
|
| 1160 |
-
|
| 1161 |
-
### Verifying Functor Laws
|
| 1162 |
|
| 1163 |
-
|
| 1164 |
|
| 1165 |
-
|
| 1166 |
|
| 1167 |
-
|
| 1168 |
-
|
| 1169 |
-
)
|
| 1170 |
return
|
| 1171 |
|
| 1172 |
|
|
@@ -1178,19 +1130,19 @@ def _(IntAddition, ListConcatenation, length, pp):
|
|
| 1178 |
|
| 1179 |
@app.cell(hide_code=True)
|
| 1180 |
def _(mo):
|
| 1181 |
-
mo.md("""
|
|
|
|
|
|
|
| 1182 |
return
|
| 1183 |
|
| 1184 |
|
| 1185 |
@app.cell(hide_code=True)
|
| 1186 |
def _(mo):
|
| 1187 |
-
mo.md(
|
| 1188 |
-
|
| 1189 |
-
**Composition Law**
|
| 1190 |
|
| 1191 |
-
|
| 1192 |
-
|
| 1193 |
-
)
|
| 1194 |
return
|
| 1195 |
|
| 1196 |
|
|
@@ -1202,36 +1154,36 @@ def _(IntAddition, ListConcatenation, length, pp):
|
|
| 1202 |
length(ListConcatenation.compose(lista, listb))
|
| 1203 |
== IntAddition.compose(length(lista), length(listb))
|
| 1204 |
)
|
| 1205 |
-
return
|
| 1206 |
|
| 1207 |
|
| 1208 |
@app.cell(hide_code=True)
|
| 1209 |
def _(mo):
|
| 1210 |
-
mo.md("""
|
|
|
|
|
|
|
| 1211 |
return
|
| 1212 |
|
| 1213 |
|
| 1214 |
@app.cell(hide_code=True)
|
| 1215 |
def _(mo):
|
| 1216 |
-
mo.md(
|
| 1217 |
-
|
| 1218 |
-
# Bifunctor
|
| 1219 |
|
| 1220 |
-
|
| 1221 |
|
| 1222 |
-
|
| 1223 |
|
| 1224 |
-
|
| 1225 |
|
| 1226 |
-
|
| 1227 |
|
| 1228 |
-
|
| 1229 |
-
|
| 1230 |
|
| 1231 |
|
| 1232 |
-
|
| 1233 |
-
|
| 1234 |
-
)
|
| 1235 |
return
|
| 1236 |
|
| 1237 |
|
|
@@ -1261,38 +1213,36 @@ def _(ABC, B, Callable, D, dataclass, f, id):
|
|
| 1261 |
|
| 1262 |
@app.cell(hide_code=True)
|
| 1263 |
def _(mo):
|
| 1264 |
-
mo.md(
|
| 1265 |
-
|
| 1266 |
-
|
| 1267 |
-
|
| 1268 |
-
|
| 1269 |
-
"""
|
| 1270 |
-
)
|
| 1271 |
return
|
| 1272 |
|
| 1273 |
|
| 1274 |
@app.cell(hide_code=True)
|
| 1275 |
def _(mo):
|
| 1276 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 1277 |
return
|
| 1278 |
|
| 1279 |
|
| 1280 |
@app.cell(hide_code=True)
|
| 1281 |
def _(mo):
|
| 1282 |
-
mo.md(
|
| 1283 |
-
|
| 1284 |
-
### The Either Bifunctor
|
| 1285 |
|
| 1286 |
-
|
| 1287 |
|
| 1288 |
-
|
| 1289 |
-
|
| 1290 |
-
)
|
| 1291 |
return
|
| 1292 |
|
| 1293 |
|
| 1294 |
@app.cell
|
| 1295 |
-
def _(B, Bifunctor, Callable, D, dataclass):
|
| 1296 |
@dataclass
|
| 1297 |
class BiEither[A, C](Bifunctor):
|
| 1298 |
left: A = None
|
|
@@ -1334,18 +1284,16 @@ def _(BiEither):
|
|
| 1334 |
|
| 1335 |
@app.cell(hide_code=True)
|
| 1336 |
def _(mo):
|
| 1337 |
-
mo.md(
|
| 1338 |
-
|
| 1339 |
-
### The 2d Tuple Bifunctor
|
| 1340 |
|
| 1341 |
-
|
| 1342 |
-
|
| 1343 |
-
)
|
| 1344 |
return
|
| 1345 |
|
| 1346 |
|
| 1347 |
@app.cell
|
| 1348 |
-
def _(B, Bifunctor, Callable, D, dataclass):
|
| 1349 |
@dataclass
|
| 1350 |
class BiTuple[A, C](Bifunctor):
|
| 1351 |
value: tuple[A, C]
|
|
@@ -1368,19 +1316,17 @@ def _(BiTuple):
|
|
| 1368 |
|
| 1369 |
@app.cell(hide_code=True)
|
| 1370 |
def _(mo):
|
| 1371 |
-
mo.md(
|
| 1372 |
-
|
| 1373 |
-
## Bifunctor laws
|
| 1374 |
|
| 1375 |
-
|
| 1376 |
|
| 1377 |
-
|
| 1378 |
-
|
| 1379 |
-
|
| 1380 |
|
| 1381 |
-
|
| 1382 |
-
|
| 1383 |
-
)
|
| 1384 |
return
|
| 1385 |
|
| 1386 |
|
|
@@ -1394,24 +1340,22 @@ def _(BiEither, BiTuple, id):
|
|
| 1394 |
|
| 1395 |
@app.cell(hide_code=True)
|
| 1396 |
def _(mo):
|
| 1397 |
-
mo.md(
|
| 1398 |
-
|
| 1399 |
-
|
| 1400 |
-
|
| 1401 |
-
|
| 1402 |
-
|
| 1403 |
-
|
| 1404 |
-
|
| 1405 |
-
|
| 1406 |
-
|
| 1407 |
-
|
| 1408 |
-
|
| 1409 |
-
|
| 1410 |
-
|
| 1411 |
-
|
| 1412 |
-
|
| 1413 |
-
"""
|
| 1414 |
-
)
|
| 1415 |
return
|
| 1416 |
|
| 1417 |
|
|
|
|
| 7 |
|
| 8 |
import marimo
|
| 9 |
|
| 10 |
+
__generated_with = "0.18.4"
|
| 11 |
app = marimo.App(app_title="Category Theory and Functors")
|
| 12 |
|
| 13 |
|
| 14 |
@app.cell(hide_code=True)
|
| 15 |
def _(mo):
|
| 16 |
+
mo.md("""
|
| 17 |
+
# Category Theory and Functors
|
|
|
|
| 18 |
|
| 19 |
+
In this notebook, you will learn:
|
| 20 |
|
| 21 |
+
* Why `length` is a *functor* from the category of `list concatenation` to the category of `integer addition`
|
| 22 |
+
* How to *lift* an ordinary function into a specific *computational context*
|
| 23 |
+
* How to write an *adapter* between two categories
|
| 24 |
|
| 25 |
+
In short, a mathematical functor is a **mapping** between two categories in category theory. In practice, a functor represents a type that can be mapped over.
|
| 26 |
|
| 27 |
+
/// admonition | Intuitions
|
| 28 |
|
| 29 |
+
- A simple intuition is that a `Functor` represents a **container** of values, along with the ability to apply a function uniformly to every element in the container.
|
| 30 |
+
- Another intuition is that a `Functor` represents some sort of **computational context**.
|
| 31 |
+
- Mathematically, `Functors` generalize the idea of a container or a computational context.
|
| 32 |
+
///
|
| 33 |
|
| 34 |
+
We will start with intuition, introduce the basics of category theory, and then examine functors from a categorical perspective.
|
| 35 |
|
| 36 |
+
/// details | Notebook metadata
|
| 37 |
+
type: info
|
| 38 |
|
| 39 |
+
version: 0.1.5 | last modified: 2025-04-11 | author: [métaboulie](https://github.com/metaboulie)<br/>
|
| 40 |
+
reviewer: [Haleshot](https://github.com/Haleshot)
|
| 41 |
|
| 42 |
+
///
|
| 43 |
+
""")
|
|
|
|
| 44 |
return
|
| 45 |
|
| 46 |
|
| 47 |
@app.cell(hide_code=True)
|
| 48 |
def _(mo):
|
| 49 |
+
mo.md("""
|
| 50 |
+
# Functor as a Computational Context
|
|
|
|
| 51 |
|
| 52 |
+
A [**Functor**](https://wiki.haskell.org/Functor) is an abstraction that represents a computational context with the ability to apply a function to every value inside it without altering the structure of the context itself. This enables transformations while preserving the shape of the data.
|
| 53 |
|
| 54 |
+
To understand this, let's look at a simple example.
|
| 55 |
|
| 56 |
+
## [The One-Way Wrapper Design Pattern](http://blog.sigfpe.com/2007/04/trivial-monad.html)
|
| 57 |
|
| 58 |
+
Often, we need to wrap data in some kind of context. However, when performing operations on wrapped data, we typically have to:
|
| 59 |
|
| 60 |
+
1. Unwrap the data.
|
| 61 |
+
2. Modify the unwrapped data.
|
| 62 |
+
3. Rewrap the modified data.
|
| 63 |
|
| 64 |
+
This process is tedious and inefficient. Instead, we want to wrap data **once** and apply functions directly to the wrapped data without unwrapping it.
|
| 65 |
|
| 66 |
+
/// admonition | Rules for a One-Way Wrapper
|
| 67 |
|
| 68 |
+
1. We can wrap values, but we cannot unwrap them.
|
| 69 |
+
2. We should still be able to apply transformations to the wrapped data.
|
| 70 |
+
3. Any operation that depends on wrapped data should itself return a wrapped result.
|
| 71 |
+
///
|
| 72 |
|
| 73 |
+
Let's define such a `Wrapper` class:
|
| 74 |
|
| 75 |
+
```python
|
| 76 |
+
from dataclasses import dataclass
|
| 77 |
+
from typing import TypeVar
|
| 78 |
|
| 79 |
+
A = TypeVar("A")
|
| 80 |
+
B = TypeVar("B")
|
| 81 |
|
| 82 |
+
@dataclass
|
| 83 |
+
class Wrapper[A]:
|
| 84 |
+
value: A
|
| 85 |
+
```
|
| 86 |
|
| 87 |
+
Now, we can create an instance of wrapped data:
|
| 88 |
|
| 89 |
+
```python
|
| 90 |
+
wrapped = Wrapper(1)
|
| 91 |
+
```
|
| 92 |
|
| 93 |
+
### Mapping Functions Over Wrapped Data
|
| 94 |
|
| 95 |
+
To modify wrapped data while keeping it wrapped, we define an `fmap` method:
|
| 96 |
+
""")
|
|
|
|
| 97 |
return
|
| 98 |
|
| 99 |
|
| 100 |
@app.cell
|
| 101 |
+
def _(A, B, Callable, Functor, dataclass):
|
| 102 |
@dataclass
|
| 103 |
class Wrapper[A](Functor):
|
| 104 |
value: A
|
|
|
|
| 111 |
|
| 112 |
@app.cell(hide_code=True)
|
| 113 |
def _(mo):
|
| 114 |
+
mo.md(r"""
|
| 115 |
+
/// attention
|
|
|
|
| 116 |
|
| 117 |
+
To distinguish between regular types and functors, we use the prefix `f` to indicate `Functor`.
|
| 118 |
|
| 119 |
+
For instance,
|
| 120 |
|
| 121 |
+
- `a: A` is a regular variable of type `A`
|
| 122 |
+
- `g: Callable[[A], B]` is a regular function from type `A` to `B`
|
| 123 |
+
- `fa: Functor[A]` is a *Functor* wrapping a value of type `A`
|
| 124 |
+
- `fg: Functor[Callable[[A], B]]` is a *Functor* wrapping a function from type `A` to `B`
|
| 125 |
|
| 126 |
+
and we will avoid using `f` to represent a function
|
| 127 |
|
| 128 |
+
///
|
| 129 |
|
| 130 |
+
> Try with Wrapper below
|
| 131 |
+
""")
|
|
|
|
| 132 |
return
|
| 133 |
|
| 134 |
|
|
|
|
| 143 |
|
| 144 |
@app.cell(hide_code=True)
|
| 145 |
def _(mo):
|
| 146 |
+
mo.md("""
|
| 147 |
+
We can analyze the type signature of `fmap` for `Wrapper`:
|
|
|
|
| 148 |
|
| 149 |
+
* `g` is of type `Callable[[A], B]`
|
| 150 |
+
* `fa` is of type `Wrapper[A]`
|
| 151 |
+
* The return value is of type `Wrapper[B]`
|
| 152 |
|
| 153 |
+
Thus, in Python's type system, we can express the type signature of `fmap` as:
|
| 154 |
|
| 155 |
+
```python
|
| 156 |
+
fmap(g: Callable[[A], B], fa: Wrapper[A]) -> Wrapper[B]:
|
| 157 |
+
```
|
| 158 |
|
| 159 |
+
Essentially, `fmap`:
|
| 160 |
|
| 161 |
+
1. Takes a function `Callable[[A], B]` and a `Wrapper[A]` instance as input.
|
| 162 |
+
2. Applies the function to the value inside the wrapper.
|
| 163 |
+
3. Returns a new `Wrapper[B]` instance with the transformed value, leaving the original wrapper and its internal data unmodified.
|
| 164 |
|
| 165 |
+
Now, let's examine `list` as a similar kind of wrapper.
|
| 166 |
+
""")
|
|
|
|
| 167 |
return
|
| 168 |
|
| 169 |
|
| 170 |
@app.cell(hide_code=True)
|
| 171 |
def _(mo):
|
| 172 |
+
mo.md("""
|
| 173 |
+
## The List Functor
|
|
|
|
| 174 |
|
| 175 |
+
We can define a `List` class to represent a wrapped list that supports `fmap`:
|
| 176 |
+
""")
|
|
|
|
| 177 |
return
|
| 178 |
|
| 179 |
|
| 180 |
@app.cell
|
| 181 |
+
def _(A, B, Callable, Functor, dataclass):
|
| 182 |
@dataclass
|
| 183 |
class List[A](Functor):
|
| 184 |
value: list[A]
|
|
|
|
| 191 |
|
| 192 |
@app.cell(hide_code=True)
|
| 193 |
def _(mo):
|
| 194 |
+
mo.md(r"""
|
| 195 |
+
> Try with List below
|
| 196 |
+
""")
|
| 197 |
return
|
| 198 |
|
| 199 |
|
|
|
|
| 207 |
|
| 208 |
@app.cell(hide_code=True)
|
| 209 |
def _(mo):
|
| 210 |
+
mo.md("""
|
| 211 |
+
### Extracting the Type of `fmap`
|
|
|
|
| 212 |
|
| 213 |
+
The type signature of `fmap` for `List` is:
|
| 214 |
|
| 215 |
+
```python
|
| 216 |
+
fmap(g: Callable[[A], B], fa: List[A]) -> List[B]
|
| 217 |
+
```
|
| 218 |
|
| 219 |
+
Similarly, for `Wrapper`:
|
| 220 |
|
| 221 |
+
```python
|
| 222 |
+
fmap(g: Callable[[A], B], fa: Wrapper[A]) -> Wrapper[B]
|
| 223 |
+
```
|
| 224 |
|
| 225 |
+
Both follow the same pattern, which we can generalize as:
|
| 226 |
|
| 227 |
+
```python
|
| 228 |
+
fmap(g: Callable[[A], B], fa: Functor[A]) -> Functor[B]
|
| 229 |
+
```
|
| 230 |
|
| 231 |
+
where `Functor` can be `Wrapper`, `List`, or any other wrapper type that follows the same structure.
|
| 232 |
|
| 233 |
+
### Functors in Haskell (optional)
|
| 234 |
|
| 235 |
+
In Haskell, the type of `fmap` is:
|
| 236 |
|
| 237 |
+
```haskell
|
| 238 |
+
fmap :: Functor f => (a -> b) -> f a -> f b
|
| 239 |
+
```
|
| 240 |
|
| 241 |
+
or equivalently:
|
| 242 |
|
| 243 |
+
```haskell
|
| 244 |
+
fmap :: Functor f => (a -> b) -> (f a -> f b)
|
| 245 |
+
```
|
| 246 |
|
| 247 |
+
This means that `fmap` **lifts** an ordinary function into the **functor world**, allowing it to operate within a computational context.
|
| 248 |
|
| 249 |
+
Now, let's define an abstract class for `Functor`.
|
| 250 |
+
""")
|
|
|
|
| 251 |
return
|
| 252 |
|
| 253 |
|
| 254 |
@app.cell(hide_code=True)
|
| 255 |
def _(mo):
|
| 256 |
+
mo.md("""
|
| 257 |
+
## Defining Functor
|
|
|
|
| 258 |
|
| 259 |
+
Recall that, a **Functor** is an abstraction that allows us to apply a function to values inside a computational context while preserving its structure.
|
| 260 |
|
| 261 |
+
To define `Functor` in Python, we use an abstract base class:
|
| 262 |
|
| 263 |
+
```python
|
| 264 |
+
@dataclass
|
| 265 |
+
class Functor[A](ABC):
|
| 266 |
+
@classmethod
|
| 267 |
+
@abstractmethod
|
| 268 |
+
def fmap(g: Callable[[A], B], fa: "Functor[A]") -> "Functor[B]":
|
| 269 |
+
raise NotImplementedError
|
| 270 |
+
```
|
| 271 |
|
| 272 |
+
We can now extend custom wrappers, containers, or computation contexts with this `Functor` base class, implement the `fmap` method, and apply any function.
|
| 273 |
+
""")
|
|
|
|
| 274 |
return
|
| 275 |
|
| 276 |
|
| 277 |
@app.cell(hide_code=True)
|
| 278 |
def _(mo):
|
| 279 |
+
mo.md(r"""
|
| 280 |
+
# More Functor instances (optional)
|
|
|
|
| 281 |
|
| 282 |
+
In this section, we will explore more *Functor* instances to help you build up a better comprehension.
|
| 283 |
|
| 284 |
+
The main reference is [Data.Functor](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Functor.html)
|
| 285 |
+
""")
|
|
|
|
| 286 |
return
|
| 287 |
|
| 288 |
|
| 289 |
@app.cell(hide_code=True)
|
| 290 |
def _(mo):
|
| 291 |
+
mo.md(r"""
|
| 292 |
+
## The [Maybe](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Maybe.html#t:Maybe) Functor
|
|
|
|
| 293 |
|
| 294 |
+
**`Maybe`** is a functor that can either hold a value (`Just(value)`) or be `Nothing` (equivalent to `None` in Python).
|
| 295 |
|
| 296 |
+
- It the value exists, `fmap` applies the function to this value inside the functor.
|
| 297 |
+
- If the value is `None`, `fmap` simply returns `None`.
|
| 298 |
|
| 299 |
+
/// admonition
|
| 300 |
+
By using `Maybe` as a functor, we gain the ability to apply transformations (`fmap`) to potentially absent values, without having to explicitly handle the `None` case every time.
|
| 301 |
+
///
|
| 302 |
|
| 303 |
+
We can implement the `Maybe` functor as:
|
| 304 |
+
""")
|
|
|
|
| 305 |
return
|
| 306 |
|
| 307 |
|
| 308 |
@app.cell
|
| 309 |
+
def _(A, B, Callable, Functor, dataclass):
|
| 310 |
@dataclass
|
| 311 |
class Maybe[A](Functor):
|
| 312 |
value: None | A
|
|
|
|
| 329 |
|
| 330 |
@app.cell(hide_code=True)
|
| 331 |
def _(mo):
|
| 332 |
+
mo.md(r"""
|
| 333 |
+
## The [Either](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Either.html#t:Either) Functor
|
|
|
|
| 334 |
|
| 335 |
+
The `Either` type represents values with two possibilities: a value of type `Either a b` is either `Left a` or `Right b`.
|
| 336 |
|
| 337 |
+
The `Either` type is sometimes used to represent a value which is **either correct or an error**; by convention, the `left` attribute is used to hold an error value and the `right` attribute is used to hold a correct value.
|
| 338 |
|
| 339 |
+
`fmap` for `Either` will ignore Left values, but will apply the supplied function to values contained in the Right.
|
| 340 |
|
| 341 |
+
The implementation is:
|
| 342 |
+
""")
|
|
|
|
| 343 |
return
|
| 344 |
|
| 345 |
|
| 346 |
@app.cell
|
| 347 |
+
def _(A, B, Callable, Functor, Union, dataclass):
|
| 348 |
@dataclass
|
| 349 |
class Either[A](Functor):
|
| 350 |
left: A = None
|
|
|
|
| 382 |
|
| 383 |
@app.cell(hide_code=True)
|
| 384 |
def _(mo):
|
| 385 |
+
mo.md("""
|
| 386 |
+
## The [RoseTree](https://en.wikipedia.org/wiki/Rose_tree) Functor
|
|
|
|
| 387 |
|
| 388 |
+
A **RoseTree** is a tree where:
|
| 389 |
|
| 390 |
+
- Each node holds a **value**.
|
| 391 |
+
- Each node has a **list of child nodes** (which are also RoseTrees).
|
| 392 |
|
| 393 |
+
This structure is useful for representing hierarchical data, such as:
|
| 394 |
|
| 395 |
+
- Abstract Syntax Trees (ASTs)
|
| 396 |
+
- File system directories
|
| 397 |
+
- Recursive computations
|
| 398 |
|
| 399 |
+
The implementation is:
|
| 400 |
+
""")
|
|
|
|
| 401 |
return
|
| 402 |
|
| 403 |
|
| 404 |
@app.cell
|
| 405 |
+
def _(A, B, Callable, Functor, dataclass):
|
| 406 |
@dataclass
|
| 407 |
class RoseTree[A](Functor):
|
| 408 |
value: A # The value stored in the node.
|
|
|
|
| 439 |
|
| 440 |
@app.cell(hide_code=True)
|
| 441 |
def _(mo):
|
| 442 |
+
mo.md("""
|
| 443 |
+
## Generic Functions that can be Used with Any Functor
|
|
|
|
| 444 |
|
| 445 |
+
One of the powerful features of functors is that we can write **generic functions** that can work with any functor.
|
| 446 |
|
| 447 |
+
Remember that in Haskell, the type of `fmap` can be written as:
|
| 448 |
|
| 449 |
+
```haskell
|
| 450 |
+
fmap :: Functor f => (a -> b) -> (f a -> f b)
|
| 451 |
+
```
|
| 452 |
|
| 453 |
+
Translating to Python, we get:
|
| 454 |
|
| 455 |
+
```python
|
| 456 |
+
def fmap(g: Callable[[A], B]) -> Callable[[Functor[A]], Functor[B]]
|
| 457 |
+
```
|
| 458 |
|
| 459 |
+
This means that `fmap`:
|
| 460 |
|
| 461 |
+
- Takes an **ordinary function** `Callable[[A], B]` as input.
|
| 462 |
+
- Outputs a function that:
|
| 463 |
+
- Takes a **functor** of type `Functor[A]` as input.
|
| 464 |
+
- Outputs a **functor** of type `Functor[B]`.
|
| 465 |
|
| 466 |
+
Inspired by this, we can implement an `inc` function which takes a functor, applies the function `lambda x: x + 1` to every value inside it, and returns a new functor with the updated values.
|
| 467 |
+
""")
|
|
|
|
| 468 |
return
|
| 469 |
|
| 470 |
|
|
|
|
| 484 |
|
| 485 |
@app.cell(hide_code=True)
|
| 486 |
def _(mo):
|
| 487 |
+
mo.md(r"""
|
| 488 |
+
/// admonition | exercise
|
| 489 |
+
Implement other generic functions and apply them to different *Functor* instances.
|
| 490 |
+
///
|
| 491 |
+
""")
|
|
|
|
|
|
|
| 492 |
return
|
| 493 |
|
| 494 |
|
| 495 |
@app.cell(hide_code=True)
|
| 496 |
def _(mo):
|
| 497 |
+
mo.md(r"""
|
| 498 |
+
# Functor laws and utility functions
|
| 499 |
+
""")
|
| 500 |
return
|
| 501 |
|
| 502 |
|
| 503 |
@app.cell(hide_code=True)
|
| 504 |
def _(mo):
|
| 505 |
+
mo.md("""
|
| 506 |
+
## Functor laws
|
|
|
|
| 507 |
|
| 508 |
+
In addition to providing a function `fmap` of the specified type, functors are also required to satisfy two equational laws:
|
| 509 |
|
| 510 |
+
```haskell
|
| 511 |
+
fmap id = id -- fmap preserves identity
|
| 512 |
+
fmap (g . h) = fmap g . fmap h -- fmap distributes over composition
|
| 513 |
+
```
|
| 514 |
|
| 515 |
+
1. `fmap` should preserve the **identity function**, in the sense that applying `fmap` to this function returns the same function as the result.
|
| 516 |
+
2. `fmap` should also preserve **function composition**. Applying two composed functions `g` and `h` to a functor via `fmap` should give the same result as first applying `fmap` to `g` and then applying `fmap` to `h`.
|
| 517 |
|
| 518 |
+
/// admonition |
|
| 519 |
+
- Any `Functor` instance satisfying the first law `(fmap id = id)` will [automatically satisfy the second law](https://github.com/quchen/articles/blob/master/second_functor_law.md) as well.
|
| 520 |
+
///
|
| 521 |
+
""")
|
|
|
|
| 522 |
return
|
| 523 |
|
| 524 |
|
| 525 |
@app.cell(hide_code=True)
|
| 526 |
def _(mo):
|
| 527 |
+
mo.md(r"""
|
| 528 |
+
### Functor laws verification
|
|
|
|
| 529 |
|
| 530 |
+
We can define `id` and `compose` in `Python` as:
|
| 531 |
+
""")
|
|
|
|
| 532 |
return
|
| 533 |
|
| 534 |
|
|
|
|
| 536 |
def _():
|
| 537 |
id = lambda x: x
|
| 538 |
compose = lambda f, g: lambda x: f(g(x))
|
| 539 |
+
return (id,)
|
| 540 |
|
| 541 |
|
| 542 |
@app.cell(hide_code=True)
|
| 543 |
def _(mo):
|
| 544 |
+
mo.md(r"""
|
| 545 |
+
We can add a helper function `check_functor_law` to verify that an instance satisfies the functor laws:
|
| 546 |
+
""")
|
| 547 |
return
|
| 548 |
|
| 549 |
|
|
|
|
| 557 |
|
| 558 |
@app.cell(hide_code=True)
|
| 559 |
def _(mo):
|
| 560 |
+
mo.md(r"""
|
| 561 |
+
We can verify the functor we've defined:
|
| 562 |
+
""")
|
| 563 |
return
|
| 564 |
|
| 565 |
|
|
|
|
| 567 |
def _(check_functor_law, flist, pp, rosetree, wrapper):
|
| 568 |
for functor in (wrapper, flist, rosetree):
|
| 569 |
pp(check_functor_law(functor))
|
| 570 |
+
return
|
| 571 |
|
| 572 |
|
| 573 |
@app.cell(hide_code=True)
|
| 574 |
def _(mo):
|
| 575 |
+
mo.md("""
|
| 576 |
+
And here is an `EvilFunctor`. We can verify it's not a valid `Functor`.
|
| 577 |
+
""")
|
| 578 |
return
|
| 579 |
|
| 580 |
|
| 581 |
@app.cell
|
| 582 |
+
def _(A, B, Callable, Functor, dataclass):
|
| 583 |
@dataclass
|
| 584 |
class EvilFunctor[A](Functor):
|
| 585 |
value: list[A]
|
|
|
|
| 604 |
|
| 605 |
@app.cell(hide_code=True)
|
| 606 |
def _(mo):
|
| 607 |
+
mo.md(r"""
|
| 608 |
+
## Utility functions
|
| 609 |
+
|
| 610 |
+
```python
|
| 611 |
+
@classmethod
|
| 612 |
+
def const(cls, fa: "Functor[A]", b: B) -> "Functor[B]":
|
| 613 |
+
return cls.fmap(lambda _: b, fa)
|
| 614 |
+
|
| 615 |
+
@classmethod
|
| 616 |
+
def void(cls, fa: "Functor[A]") -> "Functor[None]":
|
| 617 |
+
return cls.const(fa, None)
|
| 618 |
+
|
| 619 |
+
@classmethod
|
| 620 |
+
def unzip(
|
| 621 |
+
cls, fab: "Functor[tuple[A, B]]"
|
| 622 |
+
) -> tuple["Functor[A]", "Functor[B]"]:
|
| 623 |
+
return cls.fmap(lambda p: p[0], fab), cls.fmap(lambda p: p[1], fab)
|
| 624 |
+
```
|
| 625 |
+
|
| 626 |
+
- `const` replaces all values inside a functor with a constant `b`
|
| 627 |
+
- `void` is equivalent to `const(fa, None)`, transforming all values in a functor into `None`
|
| 628 |
+
- `unzip` is a generalization of the regular *unzip* on a list of pairs
|
| 629 |
+
""")
|
|
|
|
|
|
|
| 630 |
return
|
| 631 |
|
| 632 |
|
|
|
|
| 654 |
|
| 655 |
@app.cell(hide_code=True)
|
| 656 |
def _(mo):
|
| 657 |
+
mo.md(r"""
|
| 658 |
+
/// admonition
|
| 659 |
+
You can always override these utility functions with a more efficient implementation for specific instances.
|
| 660 |
+
///
|
| 661 |
+
""")
|
|
|
|
|
|
|
| 662 |
return
|
| 663 |
|
| 664 |
|
|
|
|
| 673 |
|
| 674 |
@app.cell(hide_code=True)
|
| 675 |
def _(mo):
|
| 676 |
+
mo.md("""
|
| 677 |
+
# Formal implementation of Functor
|
| 678 |
+
""")
|
| 679 |
return
|
| 680 |
|
| 681 |
|
|
|
|
| 706 |
|
| 707 |
@app.cell(hide_code=True)
|
| 708 |
def _(mo):
|
| 709 |
+
mo.md("""
|
| 710 |
+
## Limitations of Functor
|
|
|
|
| 711 |
|
| 712 |
+
Functors abstract the idea of mapping a function over each element of a structure. Suppose now that we wish to generalise this idea to allow functions with any number of arguments to be mapped, rather than being restricted to functions with a single argument. More precisely, suppose that we wish to define a hierarchy of `fmap` functions with the following types:
|
| 713 |
|
| 714 |
+
```haskell
|
| 715 |
+
fmap0 :: a -> f a
|
| 716 |
|
| 717 |
+
fmap1 :: (a -> b) -> f a -> f b
|
| 718 |
|
| 719 |
+
fmap2 :: (a -> b -> c) -> f a -> f b -> f c
|
| 720 |
|
| 721 |
+
fmap3 :: (a -> b -> c -> d) -> f a -> f b -> f c -> f d
|
| 722 |
+
```
|
| 723 |
|
| 724 |
+
And we have to declare a special version of the functor class for each case.
|
| 725 |
|
| 726 |
+
We will learn how to resolve this problem in the next notebook on `Applicatives`.
|
| 727 |
+
""")
|
|
|
|
| 728 |
return
|
| 729 |
|
| 730 |
|
| 731 |
@app.cell(hide_code=True)
|
| 732 |
def _(mo):
|
| 733 |
+
mo.md("""
|
| 734 |
+
# Introduction to Categories
|
|
|
|
| 735 |
|
| 736 |
+
A [category](https://en.wikibooks.org/wiki/Haskell/Category_theory#Introduction_to_categories) is, in essence, a simple collection. It has three components:
|
| 737 |
|
| 738 |
+
- A collection of **objects**.
|
| 739 |
+
- A collection of **morphisms**, each of which ties two objects (a _source object_ and a _target object_) together. If $f$ is a morphism with source object $C$ and target object $B$, we write $f : C → B$.
|
| 740 |
+
- A notion of **composition** of these morphisms. If $g : A → B$ and $f : B → C$ are two morphisms, they can be composed, resulting in a morphism $f ∘ g : A → C$.
|
| 741 |
|
| 742 |
+
## Category laws
|
| 743 |
|
| 744 |
+
There are three laws that categories need to follow.
|
| 745 |
|
| 746 |
+
1. The composition of morphisms needs to be **associative**. Symbolically, $f ∘ (g ∘ h) = (f ∘ g) ∘ h$
|
| 747 |
|
| 748 |
+
- Morphisms are applied right to left, so with $f ∘ g$ first $g$ is applied, then $f$.
|
| 749 |
|
| 750 |
+
2. The category needs to be **closed** under the composition operation. So if $f : B → C$ and $g : A → B$, then there must be some morphism $h : A → C$ in the category such that $h = f ∘ g$.
|
| 751 |
|
| 752 |
+
3. Given a category $C$ there needs to be for every object $A$ an **identity** morphism, $id_A : A → A$ that is an identity of composition with other morphisms. Put precisely, for every morphism $g : A → B$: $g ∘ id_A = id_B ∘ g = g$
|
| 753 |
|
| 754 |
+
/// attention | The definition of a category does not define:
|
| 755 |
|
| 756 |
+
- what `∘` is,
|
| 757 |
+
- what `id` is, or
|
| 758 |
+
- what `f`, `g`, and `h` might be.
|
| 759 |
|
| 760 |
+
Instead, category theory leaves it up to us to discover what they might be.
|
| 761 |
+
///
|
| 762 |
+
""")
|
|
|
|
| 763 |
return
|
| 764 |
|
| 765 |
|
| 766 |
@app.cell(hide_code=True)
|
| 767 |
def _(mo):
|
| 768 |
+
mo.md("""
|
| 769 |
+
## The Python category
|
| 770 |
+
|
| 771 |
+
The main category we'll be concerning ourselves with in this part is the Python category, or we can give it a shorter name: `Py`. `Py` treats Python types as objects and Python functions as morphisms. A function `def f(a: A) -> B` for types A and B is a morphism in Python.
|
| 772 |
+
|
| 773 |
+
Remember that we defined the `id` and `compose` function above as:
|
| 774 |
+
|
| 775 |
+
```Python
|
| 776 |
+
def id(x: A) -> A:
|
| 777 |
+
return x
|
| 778 |
+
|
| 779 |
+
def compose(f: Callable[[B], C], g: Callable[[A], B]) -> Callable[[A], C]:
|
| 780 |
+
return lambda x: f(g(x))
|
| 781 |
+
```
|
| 782 |
+
|
| 783 |
+
We can check second law easily.
|
| 784 |
+
|
| 785 |
+
For the first law, we have:
|
| 786 |
+
|
| 787 |
+
```python
|
| 788 |
+
# compose(f, g) = lambda x: f(g(x))
|
| 789 |
+
f ∘ (g ∘ h)
|
| 790 |
+
= compose(f, compose(g, h))
|
| 791 |
+
= lambda x: f(compose(g, h)(x))
|
| 792 |
+
= lambda x: f(lambda y: g(h(y))(x))
|
| 793 |
+
= lambda x: f(g(h(x)))
|
| 794 |
+
|
| 795 |
+
(f ∘ g) ∘ h
|
| 796 |
+
= compose(compose(f, g), h)
|
| 797 |
+
= lambda x: compose(f, g)(h(x))
|
| 798 |
+
= lambda x: lambda y: f(g(y))(h(x))
|
| 799 |
+
= lambda x: f(g(h(x)))
|
| 800 |
+
```
|
| 801 |
+
|
| 802 |
+
For the third law, we have:
|
| 803 |
+
|
| 804 |
+
```python
|
| 805 |
+
g ∘ id_A
|
| 806 |
+
= compose(g: Callable[[a], b], id: Callable[[a], a]) -> Callable[[a], b]
|
| 807 |
+
= lambda x: g(id(x))
|
| 808 |
+
= lambda x: g(x) # id(x) = x
|
| 809 |
+
= g
|
| 810 |
+
```
|
| 811 |
+
the similar proof can be applied to $id_B ∘ g =g$.
|
| 812 |
+
|
| 813 |
+
Thus `Py` is a valid category.
|
| 814 |
+
""")
|
|
|
|
|
|
|
| 815 |
return
|
| 816 |
|
| 817 |
|
| 818 |
@app.cell(hide_code=True)
|
| 819 |
def _(mo):
|
| 820 |
+
mo.md("""
|
| 821 |
+
# Functors, again
|
|
|
|
| 822 |
|
| 823 |
+
A functor is essentially a transformation between categories, so given categories $C$ and $D$, a functor $F : C → D$:
|
| 824 |
|
| 825 |
+
- Maps any object $A$ in $C$ to $F ( A )$, in $D$.
|
| 826 |
+
- Maps morphisms $f : A → B$ in $C$ to $F ( f ) : F ( A ) → F ( B )$ in $D$.
|
| 827 |
|
| 828 |
+
/// admonition |
|
| 829 |
|
| 830 |
+
Endofunctors are functors from a category to itself.
|
| 831 |
|
| 832 |
+
///
|
| 833 |
+
""")
|
|
|
|
| 834 |
return
|
| 835 |
|
| 836 |
|
| 837 |
@app.cell(hide_code=True)
|
| 838 |
def _(mo):
|
| 839 |
+
mo.md("""
|
| 840 |
+
## Functors on the category of Python
|
|
|
|
| 841 |
|
| 842 |
+
Remember that a functor has two parts: it maps objects in one category to objects in another and morphisms in the first category to morphisms in the second.
|
| 843 |
|
| 844 |
+
Functors in Python are from `Py` to `Func`, where `Func` is the subcategory of `Py` defined on just that functor's types. E.g. the RoseTree functor goes from `Py` to `RoseTree`, where `RoseTree` is the category containing only RoseTree types, that is, `RoseTree[T]` for any type `T`. The morphisms in `RoseTree` are functions defined on RoseTree types, that is, functions `Callable[[RoseTree[T]], RoseTree[U]]` for types `T`, `U`.
|
| 845 |
|
| 846 |
+
Recall the definition of `Functor`:
|
| 847 |
|
| 848 |
+
```Python
|
| 849 |
+
@dataclass
|
| 850 |
+
class Functor[A](ABC)
|
| 851 |
+
```
|
| 852 |
|
| 853 |
+
And RoseTree:
|
| 854 |
|
| 855 |
+
```Python
|
| 856 |
+
@dataclass
|
| 857 |
+
class RoseTree[A](Functor)
|
| 858 |
+
```
|
| 859 |
|
| 860 |
+
**Here's the key part:** the _type constructor_ `RoseTree` takes any type `T` to a new type, `RoseTree[T]`. Also, `fmap` restricted to `RoseTree` types takes a function `Callable[[A], B]` to a function `Callable[[RoseTree[A]], RoseTree[B]]`.
|
| 861 |
|
| 862 |
+
But that's it. We've defined two parts, something that takes objects in `Py` to objects in another category (that of `RoseTree` types and functions defined on `RoseTree` types), and something that takes morphisms in `Py` to morphisms in this category. So `RoseTree` is a functor.
|
| 863 |
|
| 864 |
+
To sum up:
|
| 865 |
|
| 866 |
+
- We work in the category **Py** and its subcategories.
|
| 867 |
+
- **Objects** are types (e.g., `int`, `str`, `list`).
|
| 868 |
+
- **Morphisms** are functions (`Callable[[A], B]`).
|
| 869 |
+
- **Things that take a type and return another type** are type constructors (`RoseTree[T]`).
|
| 870 |
+
- **Things that take a function and return another function** are higher-order functions (`Callable[[Callable[[A], B]], Callable[[C], D]]`).
|
| 871 |
+
- **Abstract base classes (ABC)** and duck typing provide a way to express polymorphism, capturing the idea that in category theory, structures are often defined over multiple objects at once.
|
| 872 |
+
""")
|
|
|
|
| 873 |
return
|
| 874 |
|
| 875 |
|
| 876 |
@app.cell(hide_code=True)
|
| 877 |
def _(mo):
|
| 878 |
+
mo.md("""
|
| 879 |
+
## Functor laws, again
|
|
|
|
| 880 |
|
| 881 |
+
Once again there are a few axioms that functors have to obey.
|
| 882 |
|
| 883 |
+
1. Given an identity morphism $id_A$ on an object $A$, $F ( id_A )$ must be the identity morphism on $F ( A )$.:
|
| 884 |
|
| 885 |
+
$$F({id} _{A})={id} _{F(A)}$$
|
| 886 |
|
| 887 |
+
3. Functors must distribute over morphism composition.
|
| 888 |
|
| 889 |
+
$$F(f\circ g)=F(f)\circ F(g)$$
|
| 890 |
+
""")
|
|
|
|
| 891 |
return
|
| 892 |
|
| 893 |
|
| 894 |
@app.cell(hide_code=True)
|
| 895 |
def _(mo):
|
| 896 |
+
mo.md("""
|
| 897 |
+
Remember that we defined the `id` and `compose` as
|
| 898 |
+
```python
|
| 899 |
+
id = lambda x: x
|
| 900 |
+
compose = lambda f, g: lambda x: f(g(x))
|
| 901 |
+
```
|
| 902 |
+
|
| 903 |
+
We can define `fmap` as:
|
| 904 |
+
|
| 905 |
+
```python
|
| 906 |
+
fmap = lambda g, functor: functor.fmap(g, functor)
|
| 907 |
+
```
|
| 908 |
+
|
| 909 |
+
Let's prove that `fmap` is a functor.
|
| 910 |
+
|
| 911 |
+
First, let's define a `Category` for a specific `Functor`. We choose to define the `Category` for the `Wrapper` as `WrapperCategory` here for simplicity, but remember that `Wrapper` can be any `Functor`(i.e. `List`, `RoseTree`, `Maybe` and more):
|
| 912 |
+
|
| 913 |
+
We define `WrapperCategory` as:
|
| 914 |
+
|
| 915 |
+
```python
|
| 916 |
+
@dataclass
|
| 917 |
+
class WrapperCategory:
|
| 918 |
+
@staticmethod
|
| 919 |
+
def id(wrapper: Wrapper[A]) -> Wrapper[A]:
|
| 920 |
+
return Wrapper(wrapper.value)
|
| 921 |
+
|
| 922 |
+
@staticmethod
|
| 923 |
+
def compose(
|
| 924 |
+
f: Callable[[Wrapper[B]], Wrapper[C]],
|
| 925 |
+
g: Callable[[Wrapper[A]], Wrapper[B]],
|
| 926 |
+
wrapper: Wrapper[A]
|
| 927 |
+
) -> Callable[[Wrapper[A]], Wrapper[C]]:
|
| 928 |
+
return f(g(Wrapper(wrapper.value)))
|
| 929 |
+
```
|
| 930 |
+
|
| 931 |
+
And `Wrapper` is:
|
| 932 |
+
|
| 933 |
+
```Python
|
| 934 |
+
@dataclass
|
| 935 |
+
class Wrapper[A](Functor):
|
| 936 |
+
value: A
|
| 937 |
+
|
| 938 |
+
@classmethod
|
| 939 |
+
def fmap(cls, g: Callable[[A], B], fa: "Wrapper[A]") -> "Wrapper[B]":
|
| 940 |
+
return Wrapper(g(fa.value))
|
| 941 |
+
```
|
| 942 |
+
""")
|
|
|
|
|
|
|
| 943 |
return
|
| 944 |
|
| 945 |
|
| 946 |
@app.cell(hide_code=True)
|
| 947 |
def _(mo):
|
| 948 |
+
mo.md("""
|
| 949 |
+
We can prove that:
|
| 950 |
+
|
| 951 |
+
```python
|
| 952 |
+
fmap(id, wrapper)
|
| 953 |
+
= Wrapper.fmap(id, wrapper)
|
| 954 |
+
= Wrapper(id(wrapper.value))
|
| 955 |
+
= Wrapper(wrapper.value)
|
| 956 |
+
= WrapperCategory.id(wrapper)
|
| 957 |
+
```
|
| 958 |
+
and:
|
| 959 |
+
```python
|
| 960 |
+
fmap(compose(f, g), wrapper)
|
| 961 |
+
= Wrapper.fmap(compose(f, g), wrapper)
|
| 962 |
+
= Wrapper(compose(f, g)(wrapper.value))
|
| 963 |
+
= Wrapper(f(g(wrapper.value)))
|
| 964 |
+
|
| 965 |
+
WrapperCategory.compose(fmap(f, wrapper), fmap(g, wrapper), wrapper)
|
| 966 |
+
= fmap(f, wrapper)(fmap(g, wrapper)(wrapper))
|
| 967 |
+
= fmap(f, wrapper)(Wrapper.fmap(g, wrapper))
|
| 968 |
+
= fmap(f, wrapper)(Wrapper(g(wrapper.value)))
|
| 969 |
+
= Wrapper.fmap(f, Wrapper(g(wrapper.value)))
|
| 970 |
+
= Wrapper(f(Wrapper(g(wrapper.value)).value))
|
| 971 |
+
= Wrapper(f(g(wrapper.value))) # Wrapper(g(wrapper.value)).value = g(wrapper.value)
|
| 972 |
+
```
|
| 973 |
+
|
| 974 |
+
So our `Wrapper` is a valid `Functor`.
|
| 975 |
+
|
| 976 |
+
> Try validating functor laws for `Wrapper` below.
|
| 977 |
+
""")
|
|
|
|
|
|
|
| 978 |
return
|
| 979 |
|
| 980 |
|
|
|
|
| 1004 |
|
| 1005 |
@app.cell(hide_code=True)
|
| 1006 |
def _(mo):
|
| 1007 |
+
mo.md("""
|
| 1008 |
+
## Length as a Functor
|
|
|
|
| 1009 |
|
| 1010 |
+
Remember that a functor is a transformation between two categories. It is not only limited to a functor from `Py` to `Func`, but also includes transformations between other mathematical structures.
|
| 1011 |
|
| 1012 |
+
Let’s prove that **`length`** can be viewed as a functor. Specifically, we will demonstrate that `length` is a functor from the **category of list concatenation** to the **category of integer addition**.
|
| 1013 |
|
| 1014 |
+
### Category of List Concatenation
|
| 1015 |
|
| 1016 |
+
First, let’s define the category of list concatenation:
|
| 1017 |
+
""")
|
|
|
|
| 1018 |
return
|
| 1019 |
|
| 1020 |
|
|
|
|
| 1038 |
|
| 1039 |
@app.cell(hide_code=True)
|
| 1040 |
def _(mo):
|
| 1041 |
+
mo.md("""
|
| 1042 |
+
- **Identity**: The identity element is an empty list (`ListConcatenation([])`).
|
| 1043 |
+
- **Composition**: The composition of two lists is their concatenation (`this.value + other.value`).
|
| 1044 |
+
""")
|
|
|
|
|
|
|
| 1045 |
return
|
| 1046 |
|
| 1047 |
|
| 1048 |
@app.cell(hide_code=True)
|
| 1049 |
def _(mo):
|
| 1050 |
+
mo.md("""
|
| 1051 |
+
### Category of Integer Addition
|
|
|
|
| 1052 |
|
| 1053 |
+
Now, let's define the category of integer addition:
|
| 1054 |
+
""")
|
|
|
|
| 1055 |
return
|
| 1056 |
|
| 1057 |
|
|
|
|
| 1073 |
|
| 1074 |
@app.cell(hide_code=True)
|
| 1075 |
def _(mo):
|
| 1076 |
+
mo.md("""
|
| 1077 |
+
- **Identity**: The identity element is `IntAddition(0)` (the additive identity).
|
| 1078 |
+
- **Composition**: The composition of two integers is their sum (`this.value + other.value`).
|
| 1079 |
+
""")
|
|
|
|
|
|
|
| 1080 |
return
|
| 1081 |
|
| 1082 |
|
| 1083 |
@app.cell(hide_code=True)
|
| 1084 |
def _(mo):
|
| 1085 |
+
mo.md("""
|
| 1086 |
+
### Defining the Length Functor
|
|
|
|
| 1087 |
|
| 1088 |
+
We now define the `length` function as a functor, mapping from the category of list concatenation to the category of integer addition:
|
| 1089 |
|
| 1090 |
+
```python
|
| 1091 |
+
length = lambda l: IntAddition(len(l.value))
|
| 1092 |
+
```
|
| 1093 |
+
""")
|
|
|
|
| 1094 |
return
|
| 1095 |
|
| 1096 |
|
|
|
|
| 1102 |
|
| 1103 |
@app.cell(hide_code=True)
|
| 1104 |
def _(mo):
|
| 1105 |
+
mo.md("""
|
| 1106 |
+
This function takes an instance of `ListConcatenation`, computes its length, and returns an `IntAddition` instance with the computed length.
|
| 1107 |
+
""")
|
| 1108 |
return
|
| 1109 |
|
| 1110 |
|
| 1111 |
@app.cell(hide_code=True)
|
| 1112 |
def _(mo):
|
| 1113 |
+
mo.md("""
|
| 1114 |
+
### Verifying Functor Laws
|
|
|
|
| 1115 |
|
| 1116 |
+
Now, let’s verify that `length` satisfies the two functor laws.
|
| 1117 |
|
| 1118 |
+
**Identity Law**
|
| 1119 |
|
| 1120 |
+
The identity law states that applying the functor to the identity element of one category should give the identity element of the other category.
|
| 1121 |
+
""")
|
|
|
|
| 1122 |
return
|
| 1123 |
|
| 1124 |
|
|
|
|
| 1130 |
|
| 1131 |
@app.cell(hide_code=True)
|
| 1132 |
def _(mo):
|
| 1133 |
+
mo.md("""
|
| 1134 |
+
This ensures that the length of an empty list (identity in the `ListConcatenation` category) is `0` (identity in the `IntAddition` category).
|
| 1135 |
+
""")
|
| 1136 |
return
|
| 1137 |
|
| 1138 |
|
| 1139 |
@app.cell(hide_code=True)
|
| 1140 |
def _(mo):
|
| 1141 |
+
mo.md("""
|
| 1142 |
+
**Composition Law**
|
|
|
|
| 1143 |
|
| 1144 |
+
The composition law states that the functor should preserve composition. Applying the functor to a composed element should be the same as composing the functor applied to the individual elements.
|
| 1145 |
+
""")
|
|
|
|
| 1146 |
return
|
| 1147 |
|
| 1148 |
|
|
|
|
| 1154 |
length(ListConcatenation.compose(lista, listb))
|
| 1155 |
== IntAddition.compose(length(lista), length(listb))
|
| 1156 |
)
|
| 1157 |
+
return
|
| 1158 |
|
| 1159 |
|
| 1160 |
@app.cell(hide_code=True)
|
| 1161 |
def _(mo):
|
| 1162 |
+
mo.md("""
|
| 1163 |
+
This ensures that the length of the concatenation of two lists is the same as the sum of the lengths of the individual lists.
|
| 1164 |
+
""")
|
| 1165 |
return
|
| 1166 |
|
| 1167 |
|
| 1168 |
@app.cell(hide_code=True)
|
| 1169 |
def _(mo):
|
| 1170 |
+
mo.md(r"""
|
| 1171 |
+
# Bifunctor
|
|
|
|
| 1172 |
|
| 1173 |
+
A `Bifunctor` is a type constructor that takes two type arguments and **is a functor in both arguments.**
|
| 1174 |
|
| 1175 |
+
For example, think about `Either`'s usual `Functor` instance. It only allows you to fmap over the second type parameter: `right` values get mapped, `left` values stay as they are.
|
| 1176 |
|
| 1177 |
+
However, its `Bifunctor` instance allows you to map both halves of the sum.
|
| 1178 |
|
| 1179 |
+
There are three core methods for `Bifunctor`:
|
| 1180 |
|
| 1181 |
+
- `bimap` allows mapping over both type arguments at once.
|
| 1182 |
+
- `first` and `second` are also provided for mapping over only one type argument at a time.
|
| 1183 |
|
| 1184 |
|
| 1185 |
+
The abstraction of `Bifunctor` is:
|
| 1186 |
+
""")
|
|
|
|
| 1187 |
return
|
| 1188 |
|
| 1189 |
|
|
|
|
| 1213 |
|
| 1214 |
@app.cell(hide_code=True)
|
| 1215 |
def _(mo):
|
| 1216 |
+
mo.md(r"""
|
| 1217 |
+
/// admonition | minimal implementation requirement
|
| 1218 |
+
- `bimap` or both `first` and `second`
|
| 1219 |
+
///
|
| 1220 |
+
""")
|
|
|
|
|
|
|
| 1221 |
return
|
| 1222 |
|
| 1223 |
|
| 1224 |
@app.cell(hide_code=True)
|
| 1225 |
def _(mo):
|
| 1226 |
+
mo.md(r"""
|
| 1227 |
+
## Instances of Bifunctor
|
| 1228 |
+
""")
|
| 1229 |
return
|
| 1230 |
|
| 1231 |
|
| 1232 |
@app.cell(hide_code=True)
|
| 1233 |
def _(mo):
|
| 1234 |
+
mo.md(r"""
|
| 1235 |
+
### The Either Bifunctor
|
|
|
|
| 1236 |
|
| 1237 |
+
For the `Either Bifunctor`, we allow it to map a function over the `left` value as well.
|
| 1238 |
|
| 1239 |
+
Notice that, the `Either Bifunctor` still only contains the `left` value or the `right` value.
|
| 1240 |
+
""")
|
|
|
|
| 1241 |
return
|
| 1242 |
|
| 1243 |
|
| 1244 |
@app.cell
|
| 1245 |
+
def _(A, B, Bifunctor, C, Callable, D, dataclass):
|
| 1246 |
@dataclass
|
| 1247 |
class BiEither[A, C](Bifunctor):
|
| 1248 |
left: A = None
|
|
|
|
| 1284 |
|
| 1285 |
@app.cell(hide_code=True)
|
| 1286 |
def _(mo):
|
| 1287 |
+
mo.md(r"""
|
| 1288 |
+
### The 2d Tuple Bifunctor
|
|
|
|
| 1289 |
|
| 1290 |
+
For 2d tuples, we simply expect `bimap` to map 2 functions to the 2 elements in the tuple respectively.
|
| 1291 |
+
""")
|
|
|
|
| 1292 |
return
|
| 1293 |
|
| 1294 |
|
| 1295 |
@app.cell
|
| 1296 |
+
def _(A, B, Bifunctor, C, Callable, D, dataclass):
|
| 1297 |
@dataclass
|
| 1298 |
class BiTuple[A, C](Bifunctor):
|
| 1299 |
value: tuple[A, C]
|
|
|
|
| 1316 |
|
| 1317 |
@app.cell(hide_code=True)
|
| 1318 |
def _(mo):
|
| 1319 |
+
mo.md(r"""
|
| 1320 |
+
## Bifunctor laws
|
|
|
|
| 1321 |
|
| 1322 |
+
The only law we need to follow is
|
| 1323 |
|
| 1324 |
+
```python
|
| 1325 |
+
bimap(id, id, fa) == id(fa)
|
| 1326 |
+
```
|
| 1327 |
|
| 1328 |
+
and then other laws are followed automatically.
|
| 1329 |
+
""")
|
|
|
|
| 1330 |
return
|
| 1331 |
|
| 1332 |
|
|
|
|
| 1340 |
|
| 1341 |
@app.cell(hide_code=True)
|
| 1342 |
def _(mo):
|
| 1343 |
+
mo.md("""
|
| 1344 |
+
# Further reading
|
| 1345 |
+
|
| 1346 |
+
- [The Trivial Monad](http://blog.sigfpe.com/2007/04/trivial-monad.html)
|
| 1347 |
+
- [Haskellforall: The Category Design Pattern](https://www.haskellforall.com/2012/08/the-category-design-pattern.html)
|
| 1348 |
+
- [Haskellforall: The Functor Design Pattern](https://www.haskellforall.com/2012/09/the-functor-design-pattern.html)
|
| 1349 |
+
|
| 1350 |
+
/// attention | ATTENTION
|
| 1351 |
+
The functor design pattern doesn't work at all if you aren't using categories in the first place. This is why you should structure your tools using the compositional category design pattern so that you can take advantage of functors to easily mix your tools together.
|
| 1352 |
+
///
|
| 1353 |
+
|
| 1354 |
+
- [Haskellwiki: Functor](https://wiki.haskell.org/index.php?title=Functor)
|
| 1355 |
+
- [Haskellwiki: Typeclassopedia#Functor](https://wiki.haskell.org/index.php?title=Typeclassopedia#Functor)
|
| 1356 |
+
- [Haskellwiki: Typeclassopedia#Category](https://wiki.haskell.org/index.php?title=Typeclassopedia#Category)
|
| 1357 |
+
- [Haskellwiki: Category Theory](https://en.wikibooks.org/wiki/Haskell/Category_theory)
|
| 1358 |
+
""")
|
|
|
|
|
|
|
| 1359 |
return
|
| 1360 |
|
| 1361 |
|
functional_programming/06_applicatives.py
CHANGED
|
@@ -7,266 +7,261 @@
|
|
| 7 |
|
| 8 |
import marimo
|
| 9 |
|
| 10 |
-
__generated_with = "0.
|
| 11 |
app = marimo.App(app_title="Applicative programming with effects")
|
| 12 |
|
| 13 |
|
| 14 |
@app.cell(hide_code=True)
|
| 15 |
-
def _(mo)
|
| 16 |
-
mo.md(
|
| 17 |
-
|
| 18 |
-
# Applicative programming with effects
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
|
| 45 |
|
| 46 |
@app.cell(hide_code=True)
|
| 47 |
-
def _(mo)
|
| 48 |
-
mo.md(
|
| 49 |
-
|
| 50 |
-
# The intuition: [Multifunctor](https://arxiv.org/pdf/2401.14286)
|
| 51 |
|
| 52 |
-
|
| 53 |
|
| 54 |
-
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
|
| 72 |
|
| 73 |
@app.cell(hide_code=True)
|
| 74 |
-
def _(mo)
|
| 75 |
-
mo.md(
|
| 76 |
-
|
| 77 |
-
## Defining Multifunctor
|
| 78 |
-
|
| 79 |
-
/// admonition
|
| 80 |
-
we use prefix `f` rather than `ap` to indicate *Applicative Functor*
|
| 81 |
-
///
|
| 82 |
-
|
| 83 |
-
As a result, we may want to define a single `Multifunctor` such that:
|
| 84 |
|
| 85 |
-
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
|
| 88 |
-
# lift a regular 3-argument function `g`
|
| 89 |
-
g: Callable[[A, B, C], D]
|
| 90 |
-
# into the context of functors
|
| 91 |
-
fg: Callable[[Functor[A], Functor[B], Functor[C]], Functor[D]]
|
| 92 |
-
```
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
|
|
|
|
|
|
| 100 |
|
| 101 |
-
|
| 102 |
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
|
|
|
| 106 |
|
| 107 |
-
|
| 108 |
|
| 109 |
```python
|
| 110 |
-
fd
|
| 111 |
```
|
| 112 |
-
"""
|
| 113 |
-
)
|
| 114 |
-
|
| 115 |
|
| 116 |
-
|
| 117 |
-
def _(mo) -> None:
|
| 118 |
-
mo.md(
|
| 119 |
-
r"""
|
| 120 |
-
## Pure, apply and lift
|
| 121 |
|
| 122 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
-
1. `pure`: embeds an object (value or function) into the applicative functor
|
| 125 |
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
a: A
|
| 131 |
-
# then we can have `fa` as
|
| 132 |
-
fa: Applicative[A] = pure(a)
|
| 133 |
-
# or if we have a regular function `g`
|
| 134 |
-
g: Callable[[A], B]
|
| 135 |
-
# then we can have `fg` as
|
| 136 |
-
fg: Applicative[Callable[[A], B]] = pure(g)
|
| 137 |
-
```
|
| 138 |
|
| 139 |
-
|
| 140 |
|
| 141 |
-
|
| 142 |
-
# F (a -> b) -> F a -> F b
|
| 143 |
-
apply: Callable[[Applicative[Callable[[A], B]], Applicative[A]], Applicative[B]]
|
| 144 |
-
# and we can have
|
| 145 |
-
fd = apply(apply(apply(fg, fa), fb), fc)
|
| 146 |
-
```
|
| 147 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
-
|
| 150 |
|
| 151 |
```python
|
| 152 |
-
|
|
|
|
|
|
|
|
|
|
| 153 |
```
|
| 154 |
-
|
| 155 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 156 |
|
| 157 |
|
| 158 |
@app.cell(hide_code=True)
|
| 159 |
-
def _(mo)
|
| 160 |
-
mo.md(
|
| 161 |
-
|
| 162 |
-
/// admonition | How to use *Applicative* in the manner of *Multifunctor*
|
| 163 |
|
| 164 |
-
|
| 165 |
|
| 166 |
-
|
| 167 |
|
| 168 |
-
|
| 169 |
|
| 170 |
-
|
| 171 |
|
| 172 |
|
| 173 |
-
|
| 174 |
|
| 175 |
-
|
| 176 |
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
|
| 187 |
|
| 188 |
@app.cell(hide_code=True)
|
| 189 |
-
def _(mo)
|
| 190 |
-
mo.md(
|
| 191 |
-
|
| 192 |
-
## Abstracting applicatives
|
| 193 |
|
| 194 |
-
|
| 195 |
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
return curr
|
| 216 |
-
for arg in args:
|
| 217 |
-
curr = cls.apply(curr, arg)
|
| 218 |
return curr
|
| 219 |
-
|
|
|
|
|
|
|
|
|
|
| 220 |
|
| 221 |
-
|
| 222 |
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
|
| 229 |
|
| 230 |
@app.cell(hide_code=True)
|
| 231 |
-
def _(mo)
|
| 232 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 233 |
|
| 234 |
|
| 235 |
@app.cell(hide_code=True)
|
| 236 |
-
def _(mo)
|
| 237 |
-
mo.md(
|
| 238 |
-
|
| 239 |
-
## Applicative instances
|
| 240 |
|
| 241 |
-
|
| 242 |
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
|
| 248 |
|
| 249 |
@app.cell(hide_code=True)
|
| 250 |
-
def _(mo)
|
| 251 |
-
mo.md(
|
| 252 |
-
|
| 253 |
-
### The Wrapper Applicative
|
| 254 |
|
| 255 |
-
|
| 256 |
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
|
| 261 |
-
|
| 262 |
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
|
| 267 |
|
| 268 |
@app.cell
|
| 269 |
-
def _(Applicative, dataclass):
|
| 270 |
@dataclass
|
| 271 |
class Wrapper[A](Applicative):
|
| 272 |
value: A
|
|
@@ -284,42 +279,45 @@ def _(Applicative, dataclass):
|
|
| 284 |
|
| 285 |
|
| 286 |
@app.cell(hide_code=True)
|
| 287 |
-
def _(mo)
|
| 288 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 289 |
|
| 290 |
|
| 291 |
@app.cell
|
| 292 |
-
def _(Wrapper)
|
| 293 |
Wrapper.lift(
|
| 294 |
lambda a: lambda b: lambda c: a + b * c,
|
| 295 |
Wrapper(1),
|
| 296 |
Wrapper(2),
|
| 297 |
Wrapper(3),
|
| 298 |
)
|
|
|
|
| 299 |
|
| 300 |
|
| 301 |
@app.cell(hide_code=True)
|
| 302 |
-
def _(mo)
|
| 303 |
-
mo.md(
|
| 304 |
-
|
| 305 |
-
### The List Applicative
|
| 306 |
|
| 307 |
-
|
| 308 |
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
|
| 312 |
|
| 313 |
-
|
| 314 |
-
|
| 315 |
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
|
| 319 |
|
| 320 |
|
| 321 |
@app.cell
|
| 322 |
-
def _(Applicative, dataclass, product):
|
| 323 |
@dataclass
|
| 324 |
class List[A](Applicative):
|
| 325 |
value: list[A]
|
|
@@ -335,47 +333,51 @@ def _(Applicative, dataclass, product):
|
|
| 335 |
|
| 336 |
|
| 337 |
@app.cell(hide_code=True)
|
| 338 |
-
def _(mo)
|
| 339 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 340 |
|
| 341 |
|
| 342 |
@app.cell
|
| 343 |
-
def _(List)
|
| 344 |
List.apply(
|
| 345 |
List([lambda a: a + 1, lambda a: a * 2]),
|
| 346 |
List([1, 2]),
|
| 347 |
)
|
|
|
|
| 348 |
|
| 349 |
|
| 350 |
@app.cell
|
| 351 |
-
def _(List)
|
| 352 |
List.lift(lambda a: lambda b: a + b, List([1, 2]), List([3, 4, 5]))
|
|
|
|
| 353 |
|
| 354 |
|
| 355 |
@app.cell(hide_code=True)
|
| 356 |
-
def _(mo)
|
| 357 |
-
mo.md(
|
| 358 |
-
|
| 359 |
-
### The Maybe Applicative
|
| 360 |
|
| 361 |
-
|
| 362 |
|
| 363 |
-
|
| 364 |
-
|
| 365 |
-
|
| 366 |
-
|
| 367 |
|
| 368 |
-
|
| 369 |
-
|
| 370 |
-
|
| 371 |
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
|
| 375 |
|
| 376 |
|
| 377 |
@app.cell
|
| 378 |
-
def _(Applicative, dataclass):
|
| 379 |
@dataclass
|
| 380 |
class Maybe[A](Applicative):
|
| 381 |
value: None | A
|
|
@@ -399,51 +401,55 @@ def _(Applicative, dataclass):
|
|
| 399 |
|
| 400 |
|
| 401 |
@app.cell(hide_code=True)
|
| 402 |
-
def _(mo)
|
| 403 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 404 |
|
| 405 |
|
| 406 |
@app.cell
|
| 407 |
-
def _(Maybe)
|
| 408 |
Maybe.lift(
|
| 409 |
lambda a: lambda b: a + b,
|
| 410 |
Maybe(1),
|
| 411 |
Maybe(2),
|
| 412 |
)
|
|
|
|
| 413 |
|
| 414 |
|
| 415 |
@app.cell
|
| 416 |
-
def _(Maybe)
|
| 417 |
Maybe.lift(
|
| 418 |
lambda a: lambda b: None,
|
| 419 |
Maybe(1),
|
| 420 |
Maybe(2),
|
| 421 |
)
|
|
|
|
| 422 |
|
| 423 |
|
| 424 |
@app.cell(hide_code=True)
|
| 425 |
-
def _(mo)
|
| 426 |
-
mo.md(
|
| 427 |
-
|
| 428 |
-
### The Either Applicative
|
| 429 |
|
| 430 |
-
|
| 431 |
|
| 432 |
-
|
| 433 |
-
|
| 434 |
-
|
| 435 |
|
| 436 |
-
|
| 437 |
-
|
| 438 |
-
|
| 439 |
|
| 440 |
-
|
| 441 |
-
|
| 442 |
-
|
| 443 |
|
| 444 |
|
| 445 |
@app.cell
|
| 446 |
-
def _(Applicative, B, Callable, Union, dataclass):
|
| 447 |
@dataclass
|
| 448 |
class Either[A](Applicative):
|
| 449 |
left: A = None
|
|
@@ -486,171 +492,180 @@ def _(Applicative, B, Callable, Union, dataclass):
|
|
| 486 |
|
| 487 |
|
| 488 |
@app.cell(hide_code=True)
|
| 489 |
-
def _(mo)
|
| 490 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 491 |
|
| 492 |
|
| 493 |
@app.cell
|
| 494 |
-
def _(Either)
|
| 495 |
Either.apply(Either(left=TypeError("Parse Error")), Either(right=2))
|
|
|
|
| 496 |
|
| 497 |
|
| 498 |
@app.cell
|
| 499 |
-
def _(Either)
|
| 500 |
Either.apply(
|
| 501 |
Either(right=lambda x: x + 1), Either(left=TypeError("Parse Error"))
|
| 502 |
)
|
|
|
|
| 503 |
|
| 504 |
|
| 505 |
@app.cell
|
| 506 |
-
def _(Either)
|
| 507 |
Either.apply(Either(right=lambda x: x + 1), Either(right=1))
|
|
|
|
| 508 |
|
| 509 |
|
| 510 |
@app.cell(hide_code=True)
|
| 511 |
-
def _(mo)
|
| 512 |
-
mo.md(
|
| 513 |
-
|
| 514 |
-
## Collect the list of response with sequenceL
|
| 515 |
|
| 516 |
-
|
| 517 |
|
| 518 |
-
|
| 519 |
-
|
| 520 |
-
|
| 521 |
|
| 522 |
-
|
| 523 |
-
|
| 524 |
-
|
| 525 |
-
|
| 526 |
-
|
| 527 |
|
| 528 |
-
|
| 529 |
-
|
| 530 |
-
|
| 531 |
-
|
| 532 |
-
|
| 533 |
|
| 534 |
-
|
| 535 |
-
|
| 536 |
-
|
| 537 |
|
| 538 |
|
| 539 |
@app.cell
|
| 540 |
-
def _(Wrapper)
|
| 541 |
Wrapper.sequenceL([Wrapper(1), Wrapper(2), Wrapper(3)])
|
|
|
|
| 542 |
|
| 543 |
|
| 544 |
@app.cell(hide_code=True)
|
| 545 |
-
def _(mo)
|
| 546 |
-
mo.md(
|
| 547 |
-
|
| 548 |
-
|
| 549 |
-
|
| 550 |
-
|
| 551 |
-
|
| 552 |
-
)
|
| 553 |
|
| 554 |
|
| 555 |
@app.cell
|
| 556 |
-
def _(Maybe)
|
| 557 |
Maybe.sequenceL([Maybe(1), Maybe(2), Maybe(None), Maybe(3)])
|
|
|
|
| 558 |
|
| 559 |
|
| 560 |
@app.cell(hide_code=True)
|
| 561 |
-
def _(mo)
|
| 562 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 563 |
|
| 564 |
|
| 565 |
@app.cell
|
| 566 |
-
def _(List)
|
| 567 |
List.sequenceL([List([1, 2]), List([3]), List([5, 6, 7])])
|
|
|
|
| 568 |
|
| 569 |
|
| 570 |
@app.cell(hide_code=True)
|
| 571 |
-
def _(mo)
|
| 572 |
-
mo.md(
|
| 573 |
-
|
| 574 |
-
|
| 575 |
-
|
| 576 |
-
|
| 577 |
-
|
| 578 |
-
|
| 579 |
-
|
| 580 |
-
|
| 581 |
-
|
| 582 |
-
|
| 583 |
-
|
| 584 |
-
|
| 585 |
-
|
| 586 |
-
|
| 587 |
-
|
| 588 |
-
|
| 589 |
-
|
| 590 |
-
|
| 591 |
-
|
| 592 |
-
|
| 593 |
-
|
| 594 |
-
|
| 595 |
-
|
| 596 |
-
|
| 597 |
-
|
| 598 |
-
|
| 599 |
-
|
| 600 |
-
|
| 601 |
-
|
| 602 |
-
|
| 603 |
-
|
| 604 |
-
|
| 605 |
-
|
| 606 |
-
|
| 607 |
-
|
| 608 |
-
|
| 609 |
-
|
| 610 |
-
|
| 611 |
-
|
| 612 |
-
|
| 613 |
-
|
| 614 |
-
|
| 615 |
-
|
|
|
|
|
|
|
|
|
|
| 616 |
|
| 617 |
-
|
| 618 |
-
|
| 619 |
-
|
| 620 |
-
|
| 621 |
-
|
| 622 |
-
def check_identity(cls, fa: "Applicative[A]"):
|
| 623 |
-
if cls.lift(id, fa) != fa:
|
| 624 |
-
raise ValueError("Instance violates identity law")
|
| 625 |
-
return True
|
| 626 |
-
|
| 627 |
-
@classmethod
|
| 628 |
-
def check_homomorphism(cls, a: A, f: Callable[[A], B]):
|
| 629 |
-
if cls.lift(f, cls.pure(a)) != cls.pure(f(a)):
|
| 630 |
-
raise ValueError("Instance violates homomorphism law")
|
| 631 |
-
return True
|
| 632 |
-
|
| 633 |
-
@classmethod
|
| 634 |
-
def check_interchange(cls, a: A, fg: "Applicative[Callable[[A], B]]"):
|
| 635 |
-
if cls.apply(fg, cls.pure(a)) != cls.lift(lambda g: g(a), fg):
|
| 636 |
-
raise ValueError("Instance violates interchange law")
|
| 637 |
-
return True
|
| 638 |
-
|
| 639 |
-
@classmethod
|
| 640 |
-
def check_composition(
|
| 641 |
-
cls,
|
| 642 |
-
fg: "Applicative[Callable[[B], C]]",
|
| 643 |
-
fh: "Applicative[Callable[[A], B]]",
|
| 644 |
-
fa: "Applicative[A]",
|
| 645 |
-
):
|
| 646 |
-
if cls.apply(fg, cls.apply(fh, fa)) != cls.lift(compose, fg, fh, fa):
|
| 647 |
-
raise ValueError("Instance violates composition law")
|
| 648 |
-
return True
|
| 649 |
-
```
|
| 650 |
|
| 651 |
-
|
| 652 |
-
|
| 653 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 654 |
|
| 655 |
|
| 656 |
@app.cell
|
|
@@ -662,7 +677,7 @@ def _():
|
|
| 662 |
|
| 663 |
|
| 664 |
@app.cell
|
| 665 |
-
def _(List, Wrapper)
|
| 666 |
print("Checking Wrapper")
|
| 667 |
print(Wrapper.check_identity(Wrapper.pure(1)))
|
| 668 |
print(Wrapper.check_homomorphism(1, lambda x: x + 1))
|
|
@@ -684,79 +699,77 @@ def _(List, Wrapper) -> None:
|
|
| 684 |
List.pure(lambda x: x * 2), List.pure(lambda x: x + 0.1), List.pure(1)
|
| 685 |
)
|
| 686 |
)
|
|
|
|
| 687 |
|
| 688 |
|
| 689 |
@app.cell(hide_code=True)
|
| 690 |
-
def _(mo)
|
| 691 |
-
mo.md(
|
| 692 |
-
|
| 693 |
-
## Utility functions
|
| 694 |
|
| 695 |
-
|
| 696 |
-
|
| 697 |
-
|
| 698 |
|
| 699 |
-
|
| 700 |
-
|
| 701 |
-
|
| 702 |
-
|
| 703 |
-
|
| 704 |
-
|
| 705 |
-
|
| 706 |
-
|
| 707 |
-
|
| 708 |
-
|
| 709 |
-
|
| 710 |
-
|
| 711 |
-
|
| 712 |
-
@classmethod
|
| 713 |
-
def keep(
|
| 714 |
-
cls, fa: "Applicative[A]", fb: "Applicative[B]"
|
| 715 |
-
) -> "Applicative[B]":
|
| 716 |
-
'''
|
| 717 |
-
Sequences the effects of two Applicative computations,
|
| 718 |
-
but discard the result of the second.
|
| 719 |
-
'''
|
| 720 |
-
return cls.lift(const, fa, fb)
|
| 721 |
-
|
| 722 |
-
@classmethod
|
| 723 |
-
def revapp(
|
| 724 |
-
cls, fa: "Applicative[A]", fg: "Applicative[Callable[[A], [B]]]"
|
| 725 |
-
) -> "Applicative[B]":
|
| 726 |
-
'''
|
| 727 |
-
The first computation produces values which are provided
|
| 728 |
-
as input to the function(s) produced by the second computation.
|
| 729 |
-
'''
|
| 730 |
-
return cls.lift(lambda a: lambda f: f(a), fa, fg)
|
| 731 |
-
```
|
| 732 |
|
| 733 |
-
|
| 734 |
-
|
| 735 |
-
|
| 736 |
-
""
|
| 737 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 738 |
|
| 739 |
|
| 740 |
@app.cell(hide_code=True)
|
| 741 |
-
def _(mo)
|
| 742 |
-
mo.md(
|
| 743 |
-
|
| 744 |
-
|
| 745 |
-
|
| 746 |
-
|
| 747 |
-
|
| 748 |
-
)
|
| 749 |
|
| 750 |
|
| 751 |
@app.cell(hide_code=True)
|
| 752 |
-
def _(mo)
|
| 753 |
-
mo.md(
|
| 754 |
-
|
| 755 |
-
# Formal implementation of Applicative
|
| 756 |
|
| 757 |
-
|
| 758 |
-
|
| 759 |
-
|
| 760 |
|
| 761 |
|
| 762 |
@app.cell
|
|
@@ -887,40 +900,38 @@ def _(
|
|
| 887 |
|
| 888 |
|
| 889 |
@app.cell(hide_code=True)
|
| 890 |
-
def _(mo)
|
| 891 |
-
mo.md(
|
| 892 |
-
|
| 893 |
-
# Effectful programming
|
| 894 |
|
| 895 |
-
|
| 896 |
|
| 897 |
-
|
| 898 |
-
|
| 899 |
-
|
| 900 |
|
| 901 |
|
| 902 |
@app.cell(hide_code=True)
|
| 903 |
-
def _(mo)
|
| 904 |
-
mo.md(
|
| 905 |
-
|
| 906 |
-
## The IO Applicative
|
| 907 |
|
| 908 |
-
|
| 909 |
|
| 910 |
-
|
| 911 |
|
| 912 |
-
|
| 913 |
|
| 914 |
-
|
| 915 |
-
|
| 916 |
-
|
| 917 |
-
|
| 918 |
|
| 919 |
-
|
| 920 |
|
| 921 |
-
|
| 922 |
-
|
| 923 |
-
|
| 924 |
|
| 925 |
|
| 926 |
@app.cell
|
|
@@ -943,8 +954,11 @@ def _(Applicative, Callable, dataclass):
|
|
| 943 |
|
| 944 |
|
| 945 |
@app.cell(hide_code=True)
|
| 946 |
-
def _(mo)
|
| 947 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 948 |
|
| 949 |
|
| 950 |
@app.cell
|
|
@@ -953,29 +967,31 @@ def _(IO):
|
|
| 953 |
return IO.sequenceL([
|
| 954 |
IO.pure(input(f"input the {i}th str")) for i in range(1, n + 1)
|
| 955 |
])
|
| 956 |
-
return
|
| 957 |
|
| 958 |
|
| 959 |
@app.cell
|
| 960 |
-
def _()
|
| 961 |
# get_chars()()
|
| 962 |
return
|
| 963 |
|
| 964 |
|
| 965 |
@app.cell(hide_code=True)
|
| 966 |
-
def _(mo)
|
| 967 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 968 |
|
| 969 |
|
| 970 |
@app.cell(hide_code=True)
|
| 971 |
-
def _(mo)
|
| 972 |
-
mo.md(
|
| 973 |
-
|
| 974 |
-
## Lax Monoidal Functor
|
| 975 |
|
| 976 |
-
|
| 977 |
-
|
| 978 |
-
|
| 979 |
|
| 980 |
|
| 981 |
@app.cell
|
|
@@ -997,97 +1013,92 @@ def _(ABC, Functor, abstractmethod, dataclass):
|
|
| 997 |
|
| 998 |
|
| 999 |
@app.cell(hide_code=True)
|
| 1000 |
-
def _(mo)
|
| 1001 |
-
mo.md(
|
| 1002 |
-
|
| 1003 |
-
Intuitively, this states that a *monoidal functor* is one which has some sort of "default shape" and which supports some sort of "combining" operation.
|
| 1004 |
|
| 1005 |
-
|
| 1006 |
-
|
| 1007 |
|
| 1008 |
-
|
| 1009 |
-
|
| 1010 |
-
|
| 1011 |
|
| 1012 |
|
| 1013 |
@app.cell(hide_code=True)
|
| 1014 |
-
def _(mo)
|
| 1015 |
-
mo.md(
|
| 1016 |
-
|
| 1017 |
-
Furthermore, to deserve the name "monoidal", instances of Monoidal ought to satisfy the following laws, which seem much more straightforward than the traditional Applicative laws:
|
| 1018 |
|
| 1019 |
-
|
| 1020 |
|
| 1021 |
-
|
| 1022 |
|
| 1023 |
-
|
| 1024 |
|
| 1025 |
-
|
| 1026 |
|
| 1027 |
-
|
| 1028 |
|
| 1029 |
-
|
| 1030 |
-
|
| 1031 |
-
|
| 1032 |
|
| 1033 |
|
| 1034 |
@app.cell(hide_code=True)
|
| 1035 |
-
def _(mo)
|
| 1036 |
-
mo.md(
|
| 1037 |
-
|
| 1038 |
-
/// admonition | ≅ indicates isomorphism
|
| 1039 |
|
| 1040 |
-
|
| 1041 |
|
| 1042 |
-
|
| 1043 |
|
| 1044 |
-
|
| 1045 |
-
|
| 1046 |
-
|
| 1047 |
|
| 1048 |
|
| 1049 |
@app.cell(hide_code=True)
|
| 1050 |
-
def _(mo)
|
| 1051 |
-
mo.md(
|
| 1052 |
-
|
| 1053 |
-
|
| 1054 |
-
|
| 1055 |
-
|
| 1056 |
-
|
| 1057 |
-
|
| 1058 |
-
|
| 1059 |
-
|
| 1060 |
-
|
| 1061 |
-
|
| 1062 |
-
|
| 1063 |
-
|
| 1064 |
-
|
| 1065 |
-
|
| 1066 |
-
|
| 1067 |
-
)
|
| 1068 |
|
| 1069 |
|
| 1070 |
@app.cell(hide_code=True)
|
| 1071 |
-
def _(mo)
|
| 1072 |
-
mo.md(
|
| 1073 |
-
|
| 1074 |
-
## Instance: ListMonoidal
|
| 1075 |
|
| 1076 |
-
|
| 1077 |
|
| 1078 |
-
|
| 1079 |
-
|
| 1080 |
-
|
| 1081 |
|
| 1082 |
-
|
| 1083 |
|
| 1084 |
-
|
| 1085 |
-
|
| 1086 |
-
|
| 1087 |
|
| 1088 |
|
| 1089 |
@app.cell
|
| 1090 |
-
def _(B, Callable, Monoidal, dataclass, product):
|
| 1091 |
@dataclass
|
| 1092 |
class ListMonoidal[A](Monoidal):
|
| 1093 |
items: list[A]
|
|
@@ -1111,8 +1122,11 @@ def _(B, Callable, Monoidal, dataclass, product):
|
|
| 1111 |
|
| 1112 |
|
| 1113 |
@app.cell(hide_code=True)
|
| 1114 |
-
def _(mo)
|
| 1115 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 1116 |
|
| 1117 |
|
| 1118 |
@app.cell
|
|
@@ -1124,13 +1138,17 @@ def _(ListMonoidal):
|
|
| 1124 |
|
| 1125 |
|
| 1126 |
@app.cell(hide_code=True)
|
| 1127 |
-
def _(mo)
|
| 1128 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 1129 |
|
| 1130 |
|
| 1131 |
@app.cell
|
| 1132 |
-
def _(List, xs, ys)
|
| 1133 |
List.lift(lambda fa: lambda fb: (fa, fb), List(xs.items), List(ys.items))
|
|
|
|
| 1134 |
|
| 1135 |
|
| 1136 |
@app.cell(hide_code=True)
|
|
@@ -1179,83 +1197,81 @@ def _(TypeVar):
|
|
| 1179 |
A = TypeVar("A")
|
| 1180 |
B = TypeVar("B")
|
| 1181 |
C = TypeVar("C")
|
| 1182 |
-
return A, B
|
| 1183 |
|
| 1184 |
|
| 1185 |
@app.cell(hide_code=True)
|
| 1186 |
-
def _(mo)
|
| 1187 |
-
mo.md(
|
| 1188 |
-
|
| 1189 |
-
# From Applicative to Alternative
|
| 1190 |
-
|
| 1191 |
-
## Abstracting Alternative
|
| 1192 |
-
|
| 1193 |
-
In our studies so far, we saw that both `Maybe` and `List` can represent computations with a varying number of results.
|
| 1194 |
-
|
| 1195 |
-
We use `Maybe` to indicate a computation can fail somehow and `List` for computations that can have many possible results. In both of these cases, one useful operation is amalgamating all possible results from multiple computations into a single computation.
|
| 1196 |
|
| 1197 |
-
|
| 1198 |
|
| 1199 |
-
|
| 1200 |
-
- **Choice** (combination of results)
|
| 1201 |
-
- **Repetition** (multiple results)
|
| 1202 |
|
| 1203 |
-
|
| 1204 |
|
| 1205 |
-
|
| 1206 |
-
@dataclass
|
| 1207 |
-
class Alternative[A](Applicative, ABC):
|
| 1208 |
-
@classmethod
|
| 1209 |
-
@abstractmethod
|
| 1210 |
-
def empty(cls) -> "Alternative[A]":
|
| 1211 |
-
'''Identity element for alternative computations'''
|
| 1212 |
-
|
| 1213 |
-
@classmethod
|
| 1214 |
-
@abstractmethod
|
| 1215 |
-
def alt(
|
| 1216 |
-
cls, fa: "Alternative[A]", fb: "Alternative[A]"
|
| 1217 |
-
) -> "Alternative[A]":
|
| 1218 |
-
'''Binary operation combining computations'''
|
| 1219 |
-
```
|
| 1220 |
|
| 1221 |
-
|
| 1222 |
-
|
|
|
|
| 1223 |
|
| 1224 |
-
|
| 1225 |
-
|
| 1226 |
-
```python
|
| 1227 |
-
# Left identity
|
| 1228 |
-
alt(empty, fa) == fa
|
| 1229 |
-
# Right identity
|
| 1230 |
-
alt(fa, empty) == fa
|
| 1231 |
-
# Associativity
|
| 1232 |
-
alt(fa, alt(fb, fc)) == alt(alt(fa, fb), fc)
|
| 1233 |
-
```
|
| 1234 |
|
| 1235 |
-
|
| 1236 |
-
|
| 1237 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1238 |
|
| 1239 |
-
|
| 1240 |
-
|
| 1241 |
-
|
| 1242 |
-
|
| 1243 |
-
""
|
| 1244 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1245 |
|
| 1246 |
|
| 1247 |
@app.cell(hide_code=True)
|
| 1248 |
-
def _(mo)
|
| 1249 |
-
mo.md(
|
| 1250 |
-
|
| 1251 |
-
## Instances of Alternative
|
| 1252 |
|
| 1253 |
-
|
| 1254 |
|
| 1255 |
-
|
| 1256 |
-
|
| 1257 |
-
|
| 1258 |
-
|
| 1259 |
|
| 1260 |
|
| 1261 |
@app.cell
|
|
@@ -1278,31 +1294,32 @@ def _(Alternative, Maybe, dataclass):
|
|
| 1278 |
|
| 1279 |
|
| 1280 |
@app.cell
|
| 1281 |
-
def _(AltMaybe)
|
| 1282 |
print(AltMaybe.empty())
|
| 1283 |
print(AltMaybe.alt(AltMaybe(None), AltMaybe(1)))
|
| 1284 |
print(AltMaybe.alt(AltMaybe(None), AltMaybe(None)))
|
| 1285 |
print(AltMaybe.alt(AltMaybe(1), AltMaybe(None)))
|
| 1286 |
print(AltMaybe.alt(AltMaybe(1), AltMaybe(2)))
|
|
|
|
| 1287 |
|
| 1288 |
|
| 1289 |
@app.cell
|
| 1290 |
-
def _(AltMaybe)
|
| 1291 |
print(AltMaybe.check_left_identity(AltMaybe(1)))
|
| 1292 |
print(AltMaybe.check_right_identity(AltMaybe(1)))
|
| 1293 |
print(AltMaybe.check_associativity(AltMaybe(1), AltMaybe(2), AltMaybe(None)))
|
|
|
|
| 1294 |
|
| 1295 |
|
| 1296 |
@app.cell(hide_code=True)
|
| 1297 |
-
def _(mo)
|
| 1298 |
-
mo.md(
|
| 1299 |
-
|
| 1300 |
-
|
| 1301 |
-
|
| 1302 |
-
|
| 1303 |
-
|
| 1304 |
-
|
| 1305 |
-
)
|
| 1306 |
|
| 1307 |
|
| 1308 |
@app.cell
|
|
@@ -1320,23 +1337,26 @@ def _(Alternative, List, dataclass):
|
|
| 1320 |
|
| 1321 |
|
| 1322 |
@app.cell
|
| 1323 |
-
def _(AltList)
|
| 1324 |
print(AltList.empty())
|
| 1325 |
print(AltList.alt(AltList([1, 2, 3]), AltList([4, 5])))
|
|
|
|
| 1326 |
|
| 1327 |
|
| 1328 |
@app.cell
|
| 1329 |
-
def _(AltList)
|
| 1330 |
AltList([1])
|
|
|
|
| 1331 |
|
| 1332 |
|
| 1333 |
@app.cell
|
| 1334 |
-
def _(AltList)
|
| 1335 |
AltList([1])
|
|
|
|
| 1336 |
|
| 1337 |
|
| 1338 |
@app.cell
|
| 1339 |
-
def _(AltList)
|
| 1340 |
print(AltList.check_left_identity(AltList([1, 2, 3])))
|
| 1341 |
print(AltList.check_right_identity(AltList([1, 2, 3])))
|
| 1342 |
print(
|
|
@@ -1344,77 +1364,88 @@ def _(AltList) -> None:
|
|
| 1344 |
AltList([1, 2]), AltList([3, 4, 5]), AltList([6])
|
| 1345 |
)
|
| 1346 |
)
|
|
|
|
| 1347 |
|
| 1348 |
|
| 1349 |
@app.cell(hide_code=True)
|
| 1350 |
-
def _(mo)
|
| 1351 |
-
mo.md(
|
| 1352 |
-
|
| 1353 |
-
## some and many
|
| 1354 |
|
| 1355 |
|
| 1356 |
-
|
| 1357 |
|
| 1358 |
-
|
| 1359 |
|
| 1360 |
-
|
| 1361 |
|
| 1362 |
-
|
| 1363 |
|
| 1364 |
-
|
| 1365 |
-
|
| 1366 |
-
|
| 1367 |
-
|
| 1368 |
-
|
| 1369 |
-
|
| 1370 |
|
| 1371 |
-
|
| 1372 |
-
|
| 1373 |
-
|
| 1374 |
|
| 1375 |
-
|
| 1376 |
-
|
| 1377 |
-
|
| 1378 |
-
|
| 1379 |
-
|
| 1380 |
|
| 1381 |
-
|
| 1382 |
-
|
| 1383 |
|
| 1384 |
-
|
| 1385 |
|
| 1386 |
-
|
| 1387 |
|
| 1388 |
-
|
| 1389 |
-
|
| 1390 |
-
|
| 1391 |
|
| 1392 |
|
| 1393 |
@app.cell(hide_code=True)
|
| 1394 |
-
def _(mo)
|
| 1395 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 1396 |
|
| 1397 |
|
| 1398 |
@app.cell
|
| 1399 |
-
def _(AltMaybe)
|
| 1400 |
print(AltMaybe.some(AltMaybe.empty()))
|
| 1401 |
print(AltMaybe.many(AltMaybe.empty()))
|
|
|
|
| 1402 |
|
| 1403 |
|
| 1404 |
@app.cell(hide_code=True)
|
| 1405 |
-
def _(mo)
|
| 1406 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 1407 |
|
| 1408 |
|
| 1409 |
@app.cell
|
| 1410 |
-
def _(AltList)
|
| 1411 |
print(AltList.some(AltList.empty()))
|
| 1412 |
print(AltList.many(AltList.empty()))
|
|
|
|
| 1413 |
|
| 1414 |
|
| 1415 |
@app.cell(hide_code=True)
|
| 1416 |
-
def _(mo)
|
| 1417 |
-
mo.md(r"""
|
|
|
|
|
|
|
|
|
|
| 1418 |
|
| 1419 |
|
| 1420 |
@app.cell
|
|
@@ -1472,42 +1503,40 @@ def _(ABC, Applicative, abstractmethod, dataclass):
|
|
| 1472 |
|
| 1473 |
|
| 1474 |
@app.cell(hide_code=True)
|
| 1475 |
-
def _(mo)
|
| 1476 |
-
mo.md(
|
| 1477 |
-
|
| 1478 |
-
/// admonition
|
| 1479 |
|
| 1480 |
-
|
| 1481 |
|
| 1482 |
-
|
| 1483 |
-
|
| 1484 |
-
|
| 1485 |
|
| 1486 |
|
| 1487 |
@app.cell(hide_code=True)
|
| 1488 |
-
def _(mo)
|
| 1489 |
-
mo.md(
|
| 1490 |
-
|
| 1491 |
-
|
| 1492 |
-
|
| 1493 |
-
|
| 1494 |
-
|
| 1495 |
-
|
| 1496 |
-
|
| 1497 |
-
|
| 1498 |
-
|
| 1499 |
-
|
| 1500 |
-
|
| 1501 |
-
|
| 1502 |
-
|
| 1503 |
-
|
| 1504 |
-
|
| 1505 |
-
|
| 1506 |
-
|
| 1507 |
-
|
| 1508 |
-
|
| 1509 |
-
|
| 1510 |
-
)
|
| 1511 |
|
| 1512 |
|
| 1513 |
if __name__ == "__main__":
|
|
|
|
| 7 |
|
| 8 |
import marimo
|
| 9 |
|
| 10 |
+
__generated_with = "0.18.4"
|
| 11 |
app = marimo.App(app_title="Applicative programming with effects")
|
| 12 |
|
| 13 |
|
| 14 |
@app.cell(hide_code=True)
|
| 15 |
+
def _(mo):
|
| 16 |
+
mo.md(r"""
|
| 17 |
+
# Applicative programming with effects
|
|
|
|
| 18 |
|
| 19 |
+
`Applicative Functor` encapsulates certain sorts of *effectful* computations in a functionally pure way, and encourages an *applicative* programming style.
|
| 20 |
|
| 21 |
+
Applicative is a functor with application, providing operations to
|
| 22 |
|
| 23 |
+
+ embed pure expressions (`pure`), and
|
| 24 |
+
+ sequence computations and combine their results (`apply`).
|
| 25 |
|
| 26 |
+
In this notebook, you will learn:
|
| 27 |
|
| 28 |
+
1. How to view `Applicative` as multi-functor intuitively.
|
| 29 |
+
2. How to use `lift` to simplify chaining application.
|
| 30 |
+
3. How to bring *effects* to the functional pure world.
|
| 31 |
+
4. How to view `Applicative` as a lax monoidal functor.
|
| 32 |
+
5. How to use `Alternative` to amalgamate multiple computations into a single computation.
|
| 33 |
|
| 34 |
+
/// details | Notebook metadata
|
| 35 |
+
type: info
|
| 36 |
|
| 37 |
+
version: 0.1.3 | last modified: 2025-04-16 | author: [métaboulie](https://github.com/metaboulie)<br/>
|
| 38 |
+
reviewer: [Haleshot](https://github.com/Haleshot)
|
| 39 |
|
| 40 |
+
///
|
| 41 |
+
""")
|
| 42 |
+
return
|
| 43 |
|
| 44 |
|
| 45 |
@app.cell(hide_code=True)
|
| 46 |
+
def _(mo):
|
| 47 |
+
mo.md(r"""
|
| 48 |
+
# The intuition: [Multifunctor](https://arxiv.org/pdf/2401.14286)
|
|
|
|
| 49 |
|
| 50 |
+
## Limitations of functor
|
| 51 |
|
| 52 |
+
Recall that functors abstract the idea of mapping a function over each element of a structure.
|
| 53 |
|
| 54 |
+
Suppose now that we wish to generalise this idea to allow functions with any number of arguments to be mapped, rather than being restricted to functions with a single argument. More precisely, suppose that we wish to define a hierarchy of `fmap` functions with the following types:
|
| 55 |
|
| 56 |
+
```haskell
|
| 57 |
+
fmap0 :: a -> f a
|
| 58 |
|
| 59 |
+
fmap1 :: (a -> b) -> f a -> f b
|
| 60 |
|
| 61 |
+
fmap2 :: (a -> b -> c) -> f a -> f b -> f c
|
| 62 |
|
| 63 |
+
fmap3 :: (a -> b -> c -> d) -> f a -> f b -> f c -> f d
|
| 64 |
+
```
|
| 65 |
|
| 66 |
+
And we have to declare a special version of the functor class for each case.
|
| 67 |
+
""")
|
| 68 |
+
return
|
| 69 |
|
| 70 |
|
| 71 |
@app.cell(hide_code=True)
|
| 72 |
+
def _(mo):
|
| 73 |
+
mo.md(r"""
|
| 74 |
+
## Defining Multifunctor
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
+
/// admonition
|
| 77 |
+
we use prefix `f` rather than `ap` to indicate *Applicative Functor*
|
| 78 |
+
///
|
| 79 |
|
| 80 |
+
As a result, we may want to define a single `Multifunctor` such that:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
+
1. Lift a regular n-argument function into the context of functors
|
| 83 |
|
| 84 |
+
```python
|
| 85 |
+
# lift a regular 3-argument function `g`
|
| 86 |
+
g: Callable[[A, B, C], D]
|
| 87 |
+
# into the context of functors
|
| 88 |
+
fg: Callable[[Functor[A], Functor[B], Functor[C]], Functor[D]]
|
| 89 |
+
```
|
| 90 |
|
| 91 |
+
3. Apply it to n functor-wrapped values
|
| 92 |
|
| 93 |
+
```python
|
| 94 |
+
# fa: Functor[A], fb: Functor[B], fc: Functor[C]
|
| 95 |
+
fg(fa, fb, fc)
|
| 96 |
+
```
|
| 97 |
|
| 98 |
+
5. Get a single functor-wrapped result
|
| 99 |
|
| 100 |
```python
|
| 101 |
+
fd: Functor[D]
|
| 102 |
```
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
+
We will define a function `lift` such that
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
+
```python
|
| 107 |
+
fd = lift(g, fa, fb, fc)
|
| 108 |
+
```
|
| 109 |
+
""")
|
| 110 |
+
return
|
| 111 |
|
|
|
|
| 112 |
|
| 113 |
+
@app.cell(hide_code=True)
|
| 114 |
+
def _(mo):
|
| 115 |
+
mo.md(r"""
|
| 116 |
+
## Pure, apply and lift
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
+
Traditionally, applicative functors are presented through two core operations:
|
| 119 |
|
| 120 |
+
1. `pure`: embeds an object (value or function) into the applicative functor
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
+
```python
|
| 123 |
+
# a -> F a
|
| 124 |
+
pure: Callable[[A], Applicative[A]]
|
| 125 |
+
# for example, if `a` is
|
| 126 |
+
a: A
|
| 127 |
+
# then we can have `fa` as
|
| 128 |
+
fa: Applicative[A] = pure(a)
|
| 129 |
+
# or if we have a regular function `g`
|
| 130 |
+
g: Callable[[A], B]
|
| 131 |
+
# then we can have `fg` as
|
| 132 |
+
fg: Applicative[Callable[[A], B]] = pure(g)
|
| 133 |
+
```
|
| 134 |
|
| 135 |
+
2. `apply`: applies a function inside an applicative functor to a value inside an applicative functor
|
| 136 |
|
| 137 |
```python
|
| 138 |
+
# F (a -> b) -> F a -> F b
|
| 139 |
+
apply: Callable[[Applicative[Callable[[A], B]], Applicative[A]], Applicative[B]]
|
| 140 |
+
# and we can have
|
| 141 |
+
fd = apply(apply(apply(fg, fa), fb), fc)
|
| 142 |
```
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
As a result,
|
| 146 |
+
|
| 147 |
+
```python
|
| 148 |
+
lift(g, fa, fb, fc) = apply(apply(apply(pure(g), fa), fb), fc)
|
| 149 |
+
```
|
| 150 |
+
""")
|
| 151 |
+
return
|
| 152 |
|
| 153 |
|
| 154 |
@app.cell(hide_code=True)
|
| 155 |
+
def _(mo):
|
| 156 |
+
mo.md(r"""
|
| 157 |
+
/// admonition | How to use *Applicative* in the manner of *Multifunctor*
|
|
|
|
| 158 |
|
| 159 |
+
1. Define `pure` and `apply` for an `Applicative` subclass
|
| 160 |
|
| 161 |
+
- We can define them much easier compared with `lift`.
|
| 162 |
|
| 163 |
+
2. Use the `lift` method
|
| 164 |
|
| 165 |
+
- We can use it much more convenient compared with the combination of `pure` and `apply`.
|
| 166 |
|
| 167 |
|
| 168 |
+
///
|
| 169 |
|
| 170 |
+
/// attention | You can suppress the chaining application of `apply` and `pure` as:
|
| 171 |
|
| 172 |
+
```python
|
| 173 |
+
apply(pure(g), fa) -> lift(g, fa)
|
| 174 |
+
apply(apply(pure(g), fa), fb) -> lift(g, fa, fb)
|
| 175 |
+
apply(apply(apply(pure(g), fa), fb), fc) -> lift(g, fa, fb, fc)
|
| 176 |
+
```
|
| 177 |
|
| 178 |
+
///
|
| 179 |
+
""")
|
| 180 |
+
return
|
| 181 |
|
| 182 |
|
| 183 |
@app.cell(hide_code=True)
|
| 184 |
+
def _(mo):
|
| 185 |
+
mo.md(r"""
|
| 186 |
+
## Abstracting applicatives
|
|
|
|
| 187 |
|
| 188 |
+
We can now provide an initial abstraction definition of applicatives:
|
| 189 |
|
| 190 |
+
```python
|
| 191 |
+
@dataclass
|
| 192 |
+
class Applicative[A](Functor, ABC):
|
| 193 |
+
@classmethod
|
| 194 |
+
@abstractmethod
|
| 195 |
+
def pure(cls, a: A) -> "Applicative[A]":
|
| 196 |
+
raise NotImplementedError("Subclasses must implement pure")
|
| 197 |
+
|
| 198 |
+
@classmethod
|
| 199 |
+
@abstractmethod
|
| 200 |
+
def apply(
|
| 201 |
+
cls, fg: "Applicative[Callable[[A], B]]", fa: "Applicative[A]"
|
| 202 |
+
) -> "Applicative[B]":
|
| 203 |
+
raise NotImplementedError("Subclasses must implement apply")
|
| 204 |
+
|
| 205 |
+
@classmethod
|
| 206 |
+
def lift(cls, f: Callable, *args: "Applicative") -> "Applicative":
|
| 207 |
+
curr = cls.pure(f)
|
| 208 |
+
if not args:
|
|
|
|
|
|
|
|
|
|
| 209 |
return curr
|
| 210 |
+
for arg in args:
|
| 211 |
+
curr = cls.apply(curr, arg)
|
| 212 |
+
return curr
|
| 213 |
+
```
|
| 214 |
|
| 215 |
+
/// attention | minimal implementation requirement
|
| 216 |
|
| 217 |
+
- `pure`
|
| 218 |
+
- `apply`
|
| 219 |
+
///
|
| 220 |
+
""")
|
| 221 |
+
return
|
| 222 |
|
| 223 |
|
| 224 |
@app.cell(hide_code=True)
|
| 225 |
+
def _(mo):
|
| 226 |
+
mo.md(r"""
|
| 227 |
+
# Instances, laws and utility functions
|
| 228 |
+
""")
|
| 229 |
+
return
|
| 230 |
|
| 231 |
|
| 232 |
@app.cell(hide_code=True)
|
| 233 |
+
def _(mo):
|
| 234 |
+
mo.md(r"""
|
| 235 |
+
## Applicative instances
|
|
|
|
| 236 |
|
| 237 |
+
When we are actually implementing an *Applicative* instance, we can keep in mind that `pure` and `apply` fundamentally:
|
| 238 |
|
| 239 |
+
- embed an object (value or function) to the computational context
|
| 240 |
+
- apply a function inside the computation context to a value inside the computational context
|
| 241 |
+
""")
|
| 242 |
+
return
|
| 243 |
|
| 244 |
|
| 245 |
@app.cell(hide_code=True)
|
| 246 |
+
def _(mo):
|
| 247 |
+
mo.md(r"""
|
| 248 |
+
### The Wrapper Applicative
|
|
|
|
| 249 |
|
| 250 |
+
- `pure` should simply *wrap* an object, in the sense that:
|
| 251 |
|
| 252 |
+
```haskell
|
| 253 |
+
Wrapper.pure(1) => Wrapper(value=1)
|
| 254 |
+
```
|
| 255 |
|
| 256 |
+
- `apply` should apply a *wrapped* function to a *wrapped* value
|
| 257 |
|
| 258 |
+
The implementation is:
|
| 259 |
+
""")
|
| 260 |
+
return
|
| 261 |
|
| 262 |
|
| 263 |
@app.cell
|
| 264 |
+
def _(A, Applicative, dataclass):
|
| 265 |
@dataclass
|
| 266 |
class Wrapper[A](Applicative):
|
| 267 |
value: A
|
|
|
|
| 279 |
|
| 280 |
|
| 281 |
@app.cell(hide_code=True)
|
| 282 |
+
def _(mo):
|
| 283 |
+
mo.md(r"""
|
| 284 |
+
> try with Wrapper below
|
| 285 |
+
""")
|
| 286 |
+
return
|
| 287 |
|
| 288 |
|
| 289 |
@app.cell
|
| 290 |
+
def _(Wrapper):
|
| 291 |
Wrapper.lift(
|
| 292 |
lambda a: lambda b: lambda c: a + b * c,
|
| 293 |
Wrapper(1),
|
| 294 |
Wrapper(2),
|
| 295 |
Wrapper(3),
|
| 296 |
)
|
| 297 |
+
return
|
| 298 |
|
| 299 |
|
| 300 |
@app.cell(hide_code=True)
|
| 301 |
+
def _(mo):
|
| 302 |
+
mo.md(r"""
|
| 303 |
+
### The List Applicative
|
|
|
|
| 304 |
|
| 305 |
+
- `pure` should wrap the object in a list, in the sense that:
|
| 306 |
|
| 307 |
+
```haskell
|
| 308 |
+
List.pure(1) => List(value=[1])
|
| 309 |
+
```
|
| 310 |
|
| 311 |
+
- `apply` should apply a list of functions to a list of values
|
| 312 |
+
- you can think of this as cartesian product, concatenating the result of applying every function to every value
|
| 313 |
|
| 314 |
+
The implementation is:
|
| 315 |
+
""")
|
| 316 |
+
return
|
| 317 |
|
| 318 |
|
| 319 |
@app.cell
|
| 320 |
+
def _(A, Applicative, dataclass, product):
|
| 321 |
@dataclass
|
| 322 |
class List[A](Applicative):
|
| 323 |
value: list[A]
|
|
|
|
| 333 |
|
| 334 |
|
| 335 |
@app.cell(hide_code=True)
|
| 336 |
+
def _(mo):
|
| 337 |
+
mo.md(r"""
|
| 338 |
+
> try with List below
|
| 339 |
+
""")
|
| 340 |
+
return
|
| 341 |
|
| 342 |
|
| 343 |
@app.cell
|
| 344 |
+
def _(List):
|
| 345 |
List.apply(
|
| 346 |
List([lambda a: a + 1, lambda a: a * 2]),
|
| 347 |
List([1, 2]),
|
| 348 |
)
|
| 349 |
+
return
|
| 350 |
|
| 351 |
|
| 352 |
@app.cell
|
| 353 |
+
def _(List):
|
| 354 |
List.lift(lambda a: lambda b: a + b, List([1, 2]), List([3, 4, 5]))
|
| 355 |
+
return
|
| 356 |
|
| 357 |
|
| 358 |
@app.cell(hide_code=True)
|
| 359 |
+
def _(mo):
|
| 360 |
+
mo.md(r"""
|
| 361 |
+
### The Maybe Applicative
|
|
|
|
| 362 |
|
| 363 |
+
- `pure` should wrap the object in a Maybe, in the sense that:
|
| 364 |
|
| 365 |
+
```haskell
|
| 366 |
+
Maybe.pure(1) => "Just 1"
|
| 367 |
+
Maybe.pure(None) => "Nothing"
|
| 368 |
+
```
|
| 369 |
|
| 370 |
+
- `apply` should apply a function maybe exist to a value maybe exist
|
| 371 |
+
- if the function is `None` or the value is `None`, simply returns `None`
|
| 372 |
+
- else apply the function to the value and wrap the result in `Just`
|
| 373 |
|
| 374 |
+
The implementation is:
|
| 375 |
+
""")
|
| 376 |
+
return
|
| 377 |
|
| 378 |
|
| 379 |
@app.cell
|
| 380 |
+
def _(A, Applicative, dataclass):
|
| 381 |
@dataclass
|
| 382 |
class Maybe[A](Applicative):
|
| 383 |
value: None | A
|
|
|
|
| 401 |
|
| 402 |
|
| 403 |
@app.cell(hide_code=True)
|
| 404 |
+
def _(mo):
|
| 405 |
+
mo.md(r"""
|
| 406 |
+
> try with Maybe below
|
| 407 |
+
""")
|
| 408 |
+
return
|
| 409 |
|
| 410 |
|
| 411 |
@app.cell
|
| 412 |
+
def _(Maybe):
|
| 413 |
Maybe.lift(
|
| 414 |
lambda a: lambda b: a + b,
|
| 415 |
Maybe(1),
|
| 416 |
Maybe(2),
|
| 417 |
)
|
| 418 |
+
return
|
| 419 |
|
| 420 |
|
| 421 |
@app.cell
|
| 422 |
+
def _(Maybe):
|
| 423 |
Maybe.lift(
|
| 424 |
lambda a: lambda b: None,
|
| 425 |
Maybe(1),
|
| 426 |
Maybe(2),
|
| 427 |
)
|
| 428 |
+
return
|
| 429 |
|
| 430 |
|
| 431 |
@app.cell(hide_code=True)
|
| 432 |
+
def _(mo):
|
| 433 |
+
mo.md(r"""
|
| 434 |
+
### The Either Applicative
|
|
|
|
| 435 |
|
| 436 |
+
- `pure` should wrap the object in `Right`, in the sense that:
|
| 437 |
|
| 438 |
+
```haskell
|
| 439 |
+
Either.pure(1) => Right(1)
|
| 440 |
+
```
|
| 441 |
|
| 442 |
+
- `apply` should apply a function that is either on Left or Right to a value that is either on Left or Right
|
| 443 |
+
- if the function is `Left`, simply returns the `Left` of the function
|
| 444 |
+
- else `fmap` the `Right` of the function to the value
|
| 445 |
|
| 446 |
+
The implementation is:
|
| 447 |
+
""")
|
| 448 |
+
return
|
| 449 |
|
| 450 |
|
| 451 |
@app.cell
|
| 452 |
+
def _(A, Applicative, B, Callable, Union, dataclass):
|
| 453 |
@dataclass
|
| 454 |
class Either[A](Applicative):
|
| 455 |
left: A = None
|
|
|
|
| 492 |
|
| 493 |
|
| 494 |
@app.cell(hide_code=True)
|
| 495 |
+
def _(mo):
|
| 496 |
+
mo.md(r"""
|
| 497 |
+
> try with `Either` below
|
| 498 |
+
""")
|
| 499 |
+
return
|
| 500 |
|
| 501 |
|
| 502 |
@app.cell
|
| 503 |
+
def _(Either):
|
| 504 |
Either.apply(Either(left=TypeError("Parse Error")), Either(right=2))
|
| 505 |
+
return
|
| 506 |
|
| 507 |
|
| 508 |
@app.cell
|
| 509 |
+
def _(Either):
|
| 510 |
Either.apply(
|
| 511 |
Either(right=lambda x: x + 1), Either(left=TypeError("Parse Error"))
|
| 512 |
)
|
| 513 |
+
return
|
| 514 |
|
| 515 |
|
| 516 |
@app.cell
|
| 517 |
+
def _(Either):
|
| 518 |
Either.apply(Either(right=lambda x: x + 1), Either(right=1))
|
| 519 |
+
return
|
| 520 |
|
| 521 |
|
| 522 |
@app.cell(hide_code=True)
|
| 523 |
+
def _(mo):
|
| 524 |
+
mo.md(r"""
|
| 525 |
+
## Collect the list of response with sequenceL
|
|
|
|
| 526 |
|
| 527 |
+
One often wants to execute a list of commands and collect the list of their response, and we can define a function `sequenceL` for this
|
| 528 |
|
| 529 |
+
/// admonition
|
| 530 |
+
In a further notebook about `Traversable`, we will have a more generic `sequence` that execute a **sequence** of commands and collect the **sequence** of their response, which is not limited to `list`.
|
| 531 |
+
///
|
| 532 |
|
| 533 |
+
```python
|
| 534 |
+
@classmethod
|
| 535 |
+
def sequenceL(cls, fas: list["Applicative[A]"]) -> "Applicative[list[A]]":
|
| 536 |
+
if not fas:
|
| 537 |
+
return cls.pure([])
|
| 538 |
|
| 539 |
+
return cls.apply(
|
| 540 |
+
cls.fmap(lambda v: lambda vs: [v] + vs, fas[0]),
|
| 541 |
+
cls.sequenceL(fas[1:]),
|
| 542 |
+
)
|
| 543 |
+
```
|
| 544 |
|
| 545 |
+
Let's try `sequenceL` with the instances.
|
| 546 |
+
""")
|
| 547 |
+
return
|
| 548 |
|
| 549 |
|
| 550 |
@app.cell
|
| 551 |
+
def _(Wrapper):
|
| 552 |
Wrapper.sequenceL([Wrapper(1), Wrapper(2), Wrapper(3)])
|
| 553 |
+
return
|
| 554 |
|
| 555 |
|
| 556 |
@app.cell(hide_code=True)
|
| 557 |
+
def _(mo):
|
| 558 |
+
mo.md(r"""
|
| 559 |
+
/// attention
|
| 560 |
+
For the `Maybe` Applicative, the presence of any `Nothing` causes the entire computation to return Nothing.
|
| 561 |
+
///
|
| 562 |
+
""")
|
| 563 |
+
return
|
|
|
|
| 564 |
|
| 565 |
|
| 566 |
@app.cell
|
| 567 |
+
def _(Maybe):
|
| 568 |
Maybe.sequenceL([Maybe(1), Maybe(2), Maybe(None), Maybe(3)])
|
| 569 |
+
return
|
| 570 |
|
| 571 |
|
| 572 |
@app.cell(hide_code=True)
|
| 573 |
+
def _(mo):
|
| 574 |
+
mo.md(r"""
|
| 575 |
+
The result of `sequenceL` for `List Applicative` is the Cartesian product of the input lists, yielding all possible ordered combinations of elements from each list.
|
| 576 |
+
""")
|
| 577 |
+
return
|
| 578 |
|
| 579 |
|
| 580 |
@app.cell
|
| 581 |
+
def _(List):
|
| 582 |
List.sequenceL([List([1, 2]), List([3]), List([5, 6, 7])])
|
| 583 |
+
return
|
| 584 |
|
| 585 |
|
| 586 |
@app.cell(hide_code=True)
|
| 587 |
+
def _(mo):
|
| 588 |
+
mo.md(r"""
|
| 589 |
+
## Applicative laws
|
| 590 |
+
|
| 591 |
+
/// admonition | id and compose
|
| 592 |
+
|
| 593 |
+
Remember that
|
| 594 |
+
|
| 595 |
+
- `id = lambda x: x`
|
| 596 |
+
- `compose = lambda f: lambda g: lambda x: f(g(x))`
|
| 597 |
+
|
| 598 |
+
///
|
| 599 |
+
|
| 600 |
+
Traditionally, there are four laws that `Applicative` instances should satisfy. In some sense, they are all concerned with making sure that `pure` deserves its name:
|
| 601 |
+
|
| 602 |
+
- The identity law:
|
| 603 |
+
```python
|
| 604 |
+
# fa: Applicative[A]
|
| 605 |
+
apply(pure(id), fa) = fa
|
| 606 |
+
```
|
| 607 |
+
- Homomorphism:
|
| 608 |
+
```python
|
| 609 |
+
# a: A
|
| 610 |
+
# g: Callable[[A], B]
|
| 611 |
+
apply(pure(g), pure(a)) = pure(g(a))
|
| 612 |
+
```
|
| 613 |
+
Intuitively, applying a non-effectful function to a non-effectful argument in an effectful context is the same as just applying the function to the argument and then injecting the result into the context with pure.
|
| 614 |
+
- Interchange:
|
| 615 |
+
```python
|
| 616 |
+
# a: A
|
| 617 |
+
# fg: Applicative[Callable[[A], B]]
|
| 618 |
+
apply(fg, pure(a)) = apply(pure(lambda g: g(a)), fg)
|
| 619 |
+
```
|
| 620 |
+
Intuitively, this says that when evaluating the application of an effectful function to a pure argument, the order in which we evaluate the function and its argument doesn't matter.
|
| 621 |
+
- Composition:
|
| 622 |
+
```python
|
| 623 |
+
# fg: Applicative[Callable[[B], C]]
|
| 624 |
+
# fh: Applicative[Callable[[A], B]]
|
| 625 |
+
# fa: Applicative[A]
|
| 626 |
+
apply(fg, apply(fh, fa)) = lift(compose, fg, fh, fa)
|
| 627 |
+
```
|
| 628 |
+
This one is the trickiest law to gain intuition for. In some sense it is expressing a sort of associativity property of `apply`.
|
| 629 |
+
|
| 630 |
+
We can add 4 helper functions to `Applicative` to check whether an instance respects the laws or not:
|
| 631 |
+
|
| 632 |
+
```python
|
| 633 |
+
@dataclass
|
| 634 |
+
class Applicative[A](Functor, ABC):
|
| 635 |
|
| 636 |
+
@classmethod
|
| 637 |
+
def check_identity(cls, fa: "Applicative[A]"):
|
| 638 |
+
if cls.lift(id, fa) != fa:
|
| 639 |
+
raise ValueError("Instance violates identity law")
|
| 640 |
+
return True
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 641 |
|
| 642 |
+
@classmethod
|
| 643 |
+
def check_homomorphism(cls, a: A, f: Callable[[A], B]):
|
| 644 |
+
if cls.lift(f, cls.pure(a)) != cls.pure(f(a)):
|
| 645 |
+
raise ValueError("Instance violates homomorphism law")
|
| 646 |
+
return True
|
| 647 |
+
|
| 648 |
+
@classmethod
|
| 649 |
+
def check_interchange(cls, a: A, fg: "Applicative[Callable[[A], B]]"):
|
| 650 |
+
if cls.apply(fg, cls.pure(a)) != cls.lift(lambda g: g(a), fg):
|
| 651 |
+
raise ValueError("Instance violates interchange law")
|
| 652 |
+
return True
|
| 653 |
+
|
| 654 |
+
@classmethod
|
| 655 |
+
def check_composition(
|
| 656 |
+
cls,
|
| 657 |
+
fg: "Applicative[Callable[[B], C]]",
|
| 658 |
+
fh: "Applicative[Callable[[A], B]]",
|
| 659 |
+
fa: "Applicative[A]",
|
| 660 |
+
):
|
| 661 |
+
if cls.apply(fg, cls.apply(fh, fa)) != cls.lift(compose, fg, fh, fa):
|
| 662 |
+
raise ValueError("Instance violates composition law")
|
| 663 |
+
return True
|
| 664 |
+
```
|
| 665 |
+
|
| 666 |
+
> Try to validate applicative laws below
|
| 667 |
+
""")
|
| 668 |
+
return
|
| 669 |
|
| 670 |
|
| 671 |
@app.cell
|
|
|
|
| 677 |
|
| 678 |
|
| 679 |
@app.cell
|
| 680 |
+
def _(List, Wrapper):
|
| 681 |
print("Checking Wrapper")
|
| 682 |
print(Wrapper.check_identity(Wrapper.pure(1)))
|
| 683 |
print(Wrapper.check_homomorphism(1, lambda x: x + 1))
|
|
|
|
| 699 |
List.pure(lambda x: x * 2), List.pure(lambda x: x + 0.1), List.pure(1)
|
| 700 |
)
|
| 701 |
)
|
| 702 |
+
return
|
| 703 |
|
| 704 |
|
| 705 |
@app.cell(hide_code=True)
|
| 706 |
+
def _(mo):
|
| 707 |
+
mo.md(r"""
|
| 708 |
+
## Utility functions
|
|
|
|
| 709 |
|
| 710 |
+
/// attention | using `fmap`
|
| 711 |
+
`fmap` is defined automatically using `pure` and `apply`, so you can use `fmap` with any `Applicative`
|
| 712 |
+
///
|
| 713 |
|
| 714 |
+
```python
|
| 715 |
+
@dataclass
|
| 716 |
+
class Applicative[A](Functor, ABC):
|
| 717 |
+
@classmethod
|
| 718 |
+
def skip(
|
| 719 |
+
cls, fa: "Applicative[A]", fb: "Applicative[B]"
|
| 720 |
+
) -> "Applicative[B]":
|
| 721 |
+
'''
|
| 722 |
+
Sequences the effects of two Applicative computations,
|
| 723 |
+
but discards the result of the first.
|
| 724 |
+
'''
|
| 725 |
+
return cls.apply(cls.const(fa, id), fb)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 726 |
|
| 727 |
+
@classmethod
|
| 728 |
+
def keep(
|
| 729 |
+
cls, fa: "Applicative[A]", fb: "Applicative[B]"
|
| 730 |
+
) -> "Applicative[B]":
|
| 731 |
+
'''
|
| 732 |
+
Sequences the effects of two Applicative computations,
|
| 733 |
+
but discard the result of the second.
|
| 734 |
+
'''
|
| 735 |
+
return cls.lift(const, fa, fb)
|
| 736 |
+
|
| 737 |
+
@classmethod
|
| 738 |
+
def revapp(
|
| 739 |
+
cls, fa: "Applicative[A]", fg: "Applicative[Callable[[A], [B]]]"
|
| 740 |
+
) -> "Applicative[B]":
|
| 741 |
+
'''
|
| 742 |
+
The first computation produces values which are provided
|
| 743 |
+
as input to the function(s) produced by the second computation.
|
| 744 |
+
'''
|
| 745 |
+
return cls.lift(lambda a: lambda f: f(a), fa, fg)
|
| 746 |
+
```
|
| 747 |
+
|
| 748 |
+
- `skip` sequences the effects of two Applicative computations, but **discards the result of the first**. For example, if `m1` and `m2` are instances of type `Maybe[Int]`, then `Maybe.skip(m1, m2)` is `Nothing` whenever either `m1` or `m2` is `Nothing`; but if not, it will have the same value as `m2`.
|
| 749 |
+
- Likewise, `keep` sequences the effects of two computations, but **keeps only the result of the first**.
|
| 750 |
+
- `revapp` is similar to `apply`, but where the first computation produces value(s) which are provided as input to the function(s) produced by the second computation.
|
| 751 |
+
""")
|
| 752 |
+
return
|
| 753 |
|
| 754 |
|
| 755 |
@app.cell(hide_code=True)
|
| 756 |
+
def _(mo):
|
| 757 |
+
mo.md(r"""
|
| 758 |
+
/// admonition | Exercise
|
| 759 |
+
Try to use utility functions with different instances
|
| 760 |
+
///
|
| 761 |
+
""")
|
| 762 |
+
return
|
|
|
|
| 763 |
|
| 764 |
|
| 765 |
@app.cell(hide_code=True)
|
| 766 |
+
def _(mo):
|
| 767 |
+
mo.md(r"""
|
| 768 |
+
# Formal implementation of Applicative
|
|
|
|
| 769 |
|
| 770 |
+
Now, we can give the formal implementation of `Applicative`
|
| 771 |
+
""")
|
| 772 |
+
return
|
| 773 |
|
| 774 |
|
| 775 |
@app.cell
|
|
|
|
| 900 |
|
| 901 |
|
| 902 |
@app.cell(hide_code=True)
|
| 903 |
+
def _(mo):
|
| 904 |
+
mo.md(r"""
|
| 905 |
+
# Effectful programming
|
|
|
|
| 906 |
|
| 907 |
+
Our original motivation for applicatives was the desire to generalise the idea of mapping to functions with multiple arguments. This is a valid interpretation of the concept of applicatives, but from the three instances we have seen it becomes clear that there is also another, more abstract view.
|
| 908 |
|
| 909 |
+
The arguments are no longer just plain values but may also have effects, such as the possibility of failure, having many ways to succeed, or performing input/output actions. In this manner, applicative functors can also be viewed as abstracting the idea of **applying pure functions to effectful arguments**, with the precise form of effects that are permitted depending on the nature of the underlying functor.
|
| 910 |
+
""")
|
| 911 |
+
return
|
| 912 |
|
| 913 |
|
| 914 |
@app.cell(hide_code=True)
|
| 915 |
+
def _(mo):
|
| 916 |
+
mo.md(r"""
|
| 917 |
+
## The IO Applicative
|
|
|
|
| 918 |
|
| 919 |
+
We will try to define an `IO` applicative here.
|
| 920 |
|
| 921 |
+
As before, we first abstract how `pure` and `apply` should function.
|
| 922 |
|
| 923 |
+
- `pure` should wrap the object in an IO action, and make the object *callable* if it's not because we want to perform the action later:
|
| 924 |
|
| 925 |
+
```haskell
|
| 926 |
+
IO.pure(1) => IO(effect=lambda: 1)
|
| 927 |
+
IO.pure(f) => IO(effect=f)
|
| 928 |
+
```
|
| 929 |
|
| 930 |
+
- `apply` should perform an action that produces a value, then apply the function with the value
|
| 931 |
|
| 932 |
+
The implementation is:
|
| 933 |
+
""")
|
| 934 |
+
return
|
| 935 |
|
| 936 |
|
| 937 |
@app.cell
|
|
|
|
| 954 |
|
| 955 |
|
| 956 |
@app.cell(hide_code=True)
|
| 957 |
+
def _(mo):
|
| 958 |
+
mo.md(r"""
|
| 959 |
+
For example, a function that reads a given number of lines from the keyboard can be defined in applicative style as follows:
|
| 960 |
+
""")
|
| 961 |
+
return
|
| 962 |
|
| 963 |
|
| 964 |
@app.cell
|
|
|
|
| 967 |
return IO.sequenceL([
|
| 968 |
IO.pure(input(f"input the {i}th str")) for i in range(1, n + 1)
|
| 969 |
])
|
| 970 |
+
return
|
| 971 |
|
| 972 |
|
| 973 |
@app.cell
|
| 974 |
+
def _():
|
| 975 |
# get_chars()()
|
| 976 |
return
|
| 977 |
|
| 978 |
|
| 979 |
@app.cell(hide_code=True)
|
| 980 |
+
def _(mo):
|
| 981 |
+
mo.md(r"""
|
| 982 |
+
# From the perspective of category theory
|
| 983 |
+
""")
|
| 984 |
+
return
|
| 985 |
|
| 986 |
|
| 987 |
@app.cell(hide_code=True)
|
| 988 |
+
def _(mo):
|
| 989 |
+
mo.md(r"""
|
| 990 |
+
## Lax Monoidal Functor
|
|
|
|
| 991 |
|
| 992 |
+
An alternative, equivalent formulation of `Applicative` is given by
|
| 993 |
+
""")
|
| 994 |
+
return
|
| 995 |
|
| 996 |
|
| 997 |
@app.cell
|
|
|
|
| 1013 |
|
| 1014 |
|
| 1015 |
@app.cell(hide_code=True)
|
| 1016 |
+
def _(mo):
|
| 1017 |
+
mo.md(r"""
|
| 1018 |
+
Intuitively, this states that a *monoidal functor* is one which has some sort of "default shape" and which supports some sort of "combining" operation.
|
|
|
|
| 1019 |
|
| 1020 |
+
- `unit` provides the identity element
|
| 1021 |
+
- `tensor` combines two contexts into a product context
|
| 1022 |
|
| 1023 |
+
More technically, the idea is that `monoidal functor` preserves the "monoidal structure" given by the pairing constructor `(,)` and unit type `()`.
|
| 1024 |
+
""")
|
| 1025 |
+
return
|
| 1026 |
|
| 1027 |
|
| 1028 |
@app.cell(hide_code=True)
|
| 1029 |
+
def _(mo):
|
| 1030 |
+
mo.md(r"""
|
| 1031 |
+
Furthermore, to deserve the name "monoidal", instances of Monoidal ought to satisfy the following laws, which seem much more straightforward than the traditional Applicative laws:
|
|
|
|
| 1032 |
|
| 1033 |
+
- Left identity
|
| 1034 |
|
| 1035 |
+
`tensor(unit, v) ≅ v`
|
| 1036 |
|
| 1037 |
+
- Right identity
|
| 1038 |
|
| 1039 |
+
`tensor(u, unit) ≅ u`
|
| 1040 |
|
| 1041 |
+
- Associativity
|
| 1042 |
|
| 1043 |
+
`tensor(u, tensor(v, w)) ≅ tensor(tensor(u, v), w)`
|
| 1044 |
+
""")
|
| 1045 |
+
return
|
| 1046 |
|
| 1047 |
|
| 1048 |
@app.cell(hide_code=True)
|
| 1049 |
+
def _(mo):
|
| 1050 |
+
mo.md(r"""
|
| 1051 |
+
/// admonition | ≅ indicates isomorphism
|
|
|
|
| 1052 |
|
| 1053 |
+
`≅` refers to *isomorphism* rather than equality.
|
| 1054 |
|
| 1055 |
+
In particular we consider `(x, ()) ≅ x ≅ ((), x)` and `((x, y), z) ≅ (x, (y, z))`
|
| 1056 |
|
| 1057 |
+
///
|
| 1058 |
+
""")
|
| 1059 |
+
return
|
| 1060 |
|
| 1061 |
|
| 1062 |
@app.cell(hide_code=True)
|
| 1063 |
+
def _(mo):
|
| 1064 |
+
mo.md(r"""
|
| 1065 |
+
## Mutual definability of Monoidal and Applicative
|
| 1066 |
+
|
| 1067 |
+
We can implement `pure` and `apply` in terms of `unit` and `tensor`, and vice versa.
|
| 1068 |
+
|
| 1069 |
+
```python
|
| 1070 |
+
pure(a) = fmap((lambda _: a), unit)
|
| 1071 |
+
apply(fg, fa) = fmap((lambda pair: pair[0](pair[1])), tensor(fg, fa))
|
| 1072 |
+
```
|
| 1073 |
+
|
| 1074 |
+
```python
|
| 1075 |
+
unit() = pure(())
|
| 1076 |
+
tensor(fa, fb) = lift(lambda fa: lambda fb: (fa, fb), fa, fb)
|
| 1077 |
+
```
|
| 1078 |
+
""")
|
| 1079 |
+
return
|
|
|
|
| 1080 |
|
| 1081 |
|
| 1082 |
@app.cell(hide_code=True)
|
| 1083 |
+
def _(mo):
|
| 1084 |
+
mo.md(r"""
|
| 1085 |
+
## Instance: ListMonoidal
|
|
|
|
| 1086 |
|
| 1087 |
+
- `unit` should simply return a empty tuple wrapper in a list
|
| 1088 |
|
| 1089 |
+
```haskell
|
| 1090 |
+
ListMonoidal.unit() => [()]
|
| 1091 |
+
```
|
| 1092 |
|
| 1093 |
+
- `tensor` should return the *cartesian product* of the items of 2 ListMonoidal instances
|
| 1094 |
|
| 1095 |
+
The implementation is:
|
| 1096 |
+
""")
|
| 1097 |
+
return
|
| 1098 |
|
| 1099 |
|
| 1100 |
@app.cell
|
| 1101 |
+
def _(A, B, Callable, Monoidal, dataclass, product):
|
| 1102 |
@dataclass
|
| 1103 |
class ListMonoidal[A](Monoidal):
|
| 1104 |
items: list[A]
|
|
|
|
| 1122 |
|
| 1123 |
|
| 1124 |
@app.cell(hide_code=True)
|
| 1125 |
+
def _(mo):
|
| 1126 |
+
mo.md(r"""
|
| 1127 |
+
> try with `ListMonoidal` below
|
| 1128 |
+
""")
|
| 1129 |
+
return
|
| 1130 |
|
| 1131 |
|
| 1132 |
@app.cell
|
|
|
|
| 1138 |
|
| 1139 |
|
| 1140 |
@app.cell(hide_code=True)
|
| 1141 |
+
def _(mo):
|
| 1142 |
+
mo.md(r"""
|
| 1143 |
+
and we can prove that `tensor(fa, fb) = lift(lambda fa: lambda fb: (fa, fb), fa, fb)`:
|
| 1144 |
+
""")
|
| 1145 |
+
return
|
| 1146 |
|
| 1147 |
|
| 1148 |
@app.cell
|
| 1149 |
+
def _(List, xs, ys):
|
| 1150 |
List.lift(lambda fa: lambda fb: (fa, fb), List(xs.items), List(ys.items))
|
| 1151 |
+
return
|
| 1152 |
|
| 1153 |
|
| 1154 |
@app.cell(hide_code=True)
|
|
|
|
| 1197 |
A = TypeVar("A")
|
| 1198 |
B = TypeVar("B")
|
| 1199 |
C = TypeVar("C")
|
| 1200 |
+
return A, B
|
| 1201 |
|
| 1202 |
|
| 1203 |
@app.cell(hide_code=True)
|
| 1204 |
+
def _(mo):
|
| 1205 |
+
mo.md(r"""
|
| 1206 |
+
# From Applicative to Alternative
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1207 |
|
| 1208 |
+
## Abstracting Alternative
|
| 1209 |
|
| 1210 |
+
In our studies so far, we saw that both `Maybe` and `List` can represent computations with a varying number of results.
|
|
|
|
|
|
|
| 1211 |
|
| 1212 |
+
We use `Maybe` to indicate a computation can fail somehow and `List` for computations that can have many possible results. In both of these cases, one useful operation is amalgamating all possible results from multiple computations into a single computation.
|
| 1213 |
|
| 1214 |
+
`Alternative` formalizes computations that support:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1215 |
|
| 1216 |
+
- **Failure** (empty result)
|
| 1217 |
+
- **Choice** (combination of results)
|
| 1218 |
+
- **Repetition** (multiple results)
|
| 1219 |
|
| 1220 |
+
It extends `Applicative` with monoidal structure, where:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1221 |
|
| 1222 |
+
```python
|
| 1223 |
+
@dataclass
|
| 1224 |
+
class Alternative[A](Applicative, ABC):
|
| 1225 |
+
@classmethod
|
| 1226 |
+
@abstractmethod
|
| 1227 |
+
def empty(cls) -> "Alternative[A]":
|
| 1228 |
+
'''Identity element for alternative computations'''
|
| 1229 |
|
| 1230 |
+
@classmethod
|
| 1231 |
+
@abstractmethod
|
| 1232 |
+
def alt(
|
| 1233 |
+
cls, fa: "Alternative[A]", fb: "Alternative[A]"
|
| 1234 |
+
) -> "Alternative[A]":
|
| 1235 |
+
'''Binary operation combining computations'''
|
| 1236 |
+
```
|
| 1237 |
+
|
| 1238 |
+
- `empty` is the identity element (e.g., `Maybe(None)`, `List([])`)
|
| 1239 |
+
- `alt` is a combination operator (e.g., `Maybe` fallback, list concatenation)
|
| 1240 |
+
|
| 1241 |
+
`empty` and `alt` should satisfy the following **laws**:
|
| 1242 |
+
|
| 1243 |
+
```python
|
| 1244 |
+
# Left identity
|
| 1245 |
+
alt(empty, fa) == fa
|
| 1246 |
+
# Right identity
|
| 1247 |
+
alt(fa, empty) == fa
|
| 1248 |
+
# Associativity
|
| 1249 |
+
alt(fa, alt(fb, fc)) == alt(alt(fa, fb), fc)
|
| 1250 |
+
```
|
| 1251 |
+
|
| 1252 |
+
/// admonition
|
| 1253 |
+
Actually, `Alternative` is a *monoid* on `Applicative Functors`. We will talk about *monoid* and review these laws in the next notebook about `Monads`.
|
| 1254 |
+
///
|
| 1255 |
+
|
| 1256 |
+
/// attention | minimal implementation requirement
|
| 1257 |
+
- `empty`
|
| 1258 |
+
- `alt`
|
| 1259 |
+
///
|
| 1260 |
+
""")
|
| 1261 |
+
return
|
| 1262 |
|
| 1263 |
|
| 1264 |
@app.cell(hide_code=True)
|
| 1265 |
+
def _(mo):
|
| 1266 |
+
mo.md(r"""
|
| 1267 |
+
## Instances of Alternative
|
|
|
|
| 1268 |
|
| 1269 |
+
### The Maybe Alternative
|
| 1270 |
|
| 1271 |
+
- `empty`: the identity element of `Maybe` is `Maybe(None)`
|
| 1272 |
+
- `alt`: return the first element if it's not `None`, else return the second element
|
| 1273 |
+
""")
|
| 1274 |
+
return
|
| 1275 |
|
| 1276 |
|
| 1277 |
@app.cell
|
|
|
|
| 1294 |
|
| 1295 |
|
| 1296 |
@app.cell
|
| 1297 |
+
def _(AltMaybe):
|
| 1298 |
print(AltMaybe.empty())
|
| 1299 |
print(AltMaybe.alt(AltMaybe(None), AltMaybe(1)))
|
| 1300 |
print(AltMaybe.alt(AltMaybe(None), AltMaybe(None)))
|
| 1301 |
print(AltMaybe.alt(AltMaybe(1), AltMaybe(None)))
|
| 1302 |
print(AltMaybe.alt(AltMaybe(1), AltMaybe(2)))
|
| 1303 |
+
return
|
| 1304 |
|
| 1305 |
|
| 1306 |
@app.cell
|
| 1307 |
+
def _(AltMaybe):
|
| 1308 |
print(AltMaybe.check_left_identity(AltMaybe(1)))
|
| 1309 |
print(AltMaybe.check_right_identity(AltMaybe(1)))
|
| 1310 |
print(AltMaybe.check_associativity(AltMaybe(1), AltMaybe(2), AltMaybe(None)))
|
| 1311 |
+
return
|
| 1312 |
|
| 1313 |
|
| 1314 |
@app.cell(hide_code=True)
|
| 1315 |
+
def _(mo):
|
| 1316 |
+
mo.md(r"""
|
| 1317 |
+
### The List Alternative
|
| 1318 |
+
|
| 1319 |
+
- `empty`: the identity element of `List` is `List([])`
|
| 1320 |
+
- `alt`: return the concatenation of 2 input lists
|
| 1321 |
+
""")
|
| 1322 |
+
return
|
|
|
|
| 1323 |
|
| 1324 |
|
| 1325 |
@app.cell
|
|
|
|
| 1337 |
|
| 1338 |
|
| 1339 |
@app.cell
|
| 1340 |
+
def _(AltList):
|
| 1341 |
print(AltList.empty())
|
| 1342 |
print(AltList.alt(AltList([1, 2, 3]), AltList([4, 5])))
|
| 1343 |
+
return
|
| 1344 |
|
| 1345 |
|
| 1346 |
@app.cell
|
| 1347 |
+
def _(AltList):
|
| 1348 |
AltList([1])
|
| 1349 |
+
return
|
| 1350 |
|
| 1351 |
|
| 1352 |
@app.cell
|
| 1353 |
+
def _(AltList):
|
| 1354 |
AltList([1])
|
| 1355 |
+
return
|
| 1356 |
|
| 1357 |
|
| 1358 |
@app.cell
|
| 1359 |
+
def _(AltList):
|
| 1360 |
print(AltList.check_left_identity(AltList([1, 2, 3])))
|
| 1361 |
print(AltList.check_right_identity(AltList([1, 2, 3])))
|
| 1362 |
print(
|
|
|
|
| 1364 |
AltList([1, 2]), AltList([3, 4, 5]), AltList([6])
|
| 1365 |
)
|
| 1366 |
)
|
| 1367 |
+
return
|
| 1368 |
|
| 1369 |
|
| 1370 |
@app.cell(hide_code=True)
|
| 1371 |
+
def _(mo):
|
| 1372 |
+
mo.md(r"""
|
| 1373 |
+
## some and many
|
|
|
|
| 1374 |
|
| 1375 |
|
| 1376 |
+
/// admonition | This section mainly refers to
|
| 1377 |
|
| 1378 |
+
- https://stackoverflow.com/questions/7671009/some-and-many-functions-from-the-alternative-type-class/7681283#7681283
|
| 1379 |
|
| 1380 |
+
///
|
| 1381 |
|
| 1382 |
+
First let's have a look at the implementation of `some` and `many`:
|
| 1383 |
|
| 1384 |
+
```python
|
| 1385 |
+
@classmethod
|
| 1386 |
+
def some(cls, fa: "Alternative[A]") -> "Alternative[list[A]]":
|
| 1387 |
+
# Short-circuit if input is empty
|
| 1388 |
+
if fa == cls.empty():
|
| 1389 |
+
return cls.empty()
|
| 1390 |
|
| 1391 |
+
return cls.apply(
|
| 1392 |
+
cls.fmap(lambda a: lambda b: [a] + b, fa), cls.many(fa)
|
| 1393 |
+
)
|
| 1394 |
|
| 1395 |
+
@classmethod
|
| 1396 |
+
def many(cls, fa: "Alternative[A]") -> "Alternative[list[A]]":
|
| 1397 |
+
# Directly return empty list if input is empty
|
| 1398 |
+
if fa == cls.empty():
|
| 1399 |
+
return cls.pure([])
|
| 1400 |
|
| 1401 |
+
return cls.alt(cls.some(fa), cls.pure([]))
|
| 1402 |
+
```
|
| 1403 |
|
| 1404 |
+
So `some f` runs `f` once, then *many* times, and conses the results. `many f` runs f *some* times, or *alternatively* just returns the empty list.
|
| 1405 |
|
| 1406 |
+
The idea is that they both run `f` as often as possible until it **fails**, collecting the results in a list. The difference is that `some f` immediately fails if `f` fails, while `many f` will still succeed and *return* the empty list in such a case. But what all this exactly means depends on how `alt` is defined.
|
| 1407 |
|
| 1408 |
+
Let's see what it does for the instances `AltMaybe` and `AltList`.
|
| 1409 |
+
""")
|
| 1410 |
+
return
|
| 1411 |
|
| 1412 |
|
| 1413 |
@app.cell(hide_code=True)
|
| 1414 |
+
def _(mo):
|
| 1415 |
+
mo.md(r"""
|
| 1416 |
+
For `AltMaybe`. `None` means failure, so some `None` fails as well and evaluates to `None` while many `None` succeeds and evaluates to `Just []`. Both `some (Just ())` and `many (Just ())` never return, because `Just ()` never fails.
|
| 1417 |
+
""")
|
| 1418 |
+
return
|
| 1419 |
|
| 1420 |
|
| 1421 |
@app.cell
|
| 1422 |
+
def _(AltMaybe):
|
| 1423 |
print(AltMaybe.some(AltMaybe.empty()))
|
| 1424 |
print(AltMaybe.many(AltMaybe.empty()))
|
| 1425 |
+
return
|
| 1426 |
|
| 1427 |
|
| 1428 |
@app.cell(hide_code=True)
|
| 1429 |
+
def _(mo):
|
| 1430 |
+
mo.md(r"""
|
| 1431 |
+
For `AltList`, `[]` means failure, so `some []` evaluates to `[]` (no answers) while `many []` evaluates to `[[]]` (there's one answer and it is the empty list). Again `some [()]` and `many [()]` don't return.
|
| 1432 |
+
""")
|
| 1433 |
+
return
|
| 1434 |
|
| 1435 |
|
| 1436 |
@app.cell
|
| 1437 |
+
def _(AltList):
|
| 1438 |
print(AltList.some(AltList.empty()))
|
| 1439 |
print(AltList.many(AltList.empty()))
|
| 1440 |
+
return
|
| 1441 |
|
| 1442 |
|
| 1443 |
@app.cell(hide_code=True)
|
| 1444 |
+
def _(mo):
|
| 1445 |
+
mo.md(r"""
|
| 1446 |
+
## Formal implementation of Alternative
|
| 1447 |
+
""")
|
| 1448 |
+
return
|
| 1449 |
|
| 1450 |
|
| 1451 |
@app.cell
|
|
|
|
| 1503 |
|
| 1504 |
|
| 1505 |
@app.cell(hide_code=True)
|
| 1506 |
+
def _(mo):
|
| 1507 |
+
mo.md(r"""
|
| 1508 |
+
/// admonition
|
|
|
|
| 1509 |
|
| 1510 |
+
We will explore more about `Alternative` in a future notebooks about [Monadic Parsing](https://www.cambridge.org/core/journals/journal-of-functional-programming/article/monadic-parsing-in-haskell/E557DFCCE00E0D4B6ED02F3FB0466093)
|
| 1511 |
|
| 1512 |
+
///
|
| 1513 |
+
""")
|
| 1514 |
+
return
|
| 1515 |
|
| 1516 |
|
| 1517 |
@app.cell(hide_code=True)
|
| 1518 |
+
def _(mo):
|
| 1519 |
+
mo.md(r"""
|
| 1520 |
+
# Further reading
|
| 1521 |
+
|
| 1522 |
+
Notice that these reading sources are optional and non-trivial
|
| 1523 |
+
|
| 1524 |
+
- [Applicaive Programming with Effects](https://www.staff.city.ac.uk/~ross/papers/Applicative.html)
|
| 1525 |
+
- [Equivalence of Applicative Functors and
|
| 1526 |
+
Multifunctors](https://arxiv.org/pdf/2401.14286)
|
| 1527 |
+
- [Applicative functor](https://wiki.haskell.org/index.php?title=Applicative_functor)
|
| 1528 |
+
- [Control.Applicative](https://hackage.haskell.org/package/base-4.21.0.0/docs/Control-Applicative.html#t:Applicative)
|
| 1529 |
+
- [Typeclassopedia#Applicative](https://wiki.haskell.org/index.php?title=Typeclassopedia#Applicative)
|
| 1530 |
+
- [Notions of computation as monoids](https://www.cambridge.org/core/journals/journal-of-functional-programming/article/notions-of-computation-as-monoids/70019FC0F2384270E9F41B9719042528)
|
| 1531 |
+
- [Free Applicative Functors](https://arxiv.org/abs/1403.0749)
|
| 1532 |
+
- [The basics of applicative functors, put to practical work](http://www.serpentine.com/blog/2008/02/06/the-basics-of-applicative-functors-put-to-practical-work/)
|
| 1533 |
+
- [Abstracting with Applicatives](http://comonad.com/reader/2012/abstracting-with-applicatives/)
|
| 1534 |
+
- [Static analysis with Applicatives](https://gergo.erdi.hu/blog/2012-12-01-static_analysis_with_applicatives/)
|
| 1535 |
+
- [Explaining Applicative functor in categorical terms - monoidal functors](https://cstheory.stackexchange.com/questions/12412/explaining-applicative-functor-in-categorical-terms-monoidal-functors)
|
| 1536 |
+
- [Applicative, A Strong Lax Monoidal Functor](https://beuke.org/applicative/)
|
| 1537 |
+
- [Applicative Functors](https://bartoszmilewski.com/2017/02/06/applicative-functors/)
|
| 1538 |
+
""")
|
| 1539 |
+
return
|
|
|
|
| 1540 |
|
| 1541 |
|
| 1542 |
if __name__ == "__main__":
|
functional_programming/CHANGELOG.md
CHANGED
|
@@ -1,3 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Changelog of the functional-programming course
|
| 2 |
|
| 3 |
## 2025-04-16
|
|
@@ -121,4 +126,4 @@ for reviewing
|
|
| 121 |
|
| 122 |
**functors.py**
|
| 123 |
|
| 124 |
-
- Demo version of notebook `05_functors.py`
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Changelog
|
| 3 |
+
marimo-version: 0.18.4
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
# Changelog of the functional-programming course
|
| 7 |
|
| 8 |
## 2025-04-16
|
|
|
|
| 126 |
|
| 127 |
**functors.py**
|
| 128 |
|
| 129 |
+
- Demo version of notebook `05_functors.py`
|
functional_programming/README.md
CHANGED
|
@@ -1,3 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Learn Functional Programming
|
| 2 |
|
| 3 |
_🚧 This collection is a [work in progress](https://github.com/marimo-team/learn/issues/51)._
|
|
@@ -24,13 +29,13 @@ Topics include:
|
|
| 24 |
|
| 25 |
To run a notebook locally, use
|
| 26 |
|
| 27 |
-
```bash
|
| 28 |
-
uvx marimo edit <URL>
|
| 29 |
```
|
| 30 |
|
| 31 |
For example, run the `Functor` tutorial with
|
| 32 |
|
| 33 |
-
```bash
|
| 34 |
uvx marimo edit https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py
|
| 35 |
```
|
| 36 |
|
|
@@ -52,11 +57,11 @@ on Discord (@eugene.hs).
|
|
| 52 |
## Description of notebooks
|
| 53 |
|
| 54 |
Check [here](https://github.com/marimo-team/learn/issues/51) for current series
|
| 55 |
-
structure.
|
| 56 |
|
| 57 |
| Notebook | Title | Key Concepts | Prerequisites |
|
| 58 |
-
|----------|-------|--------------|---------------|
|
| 59 |
-
| [05. Functors](https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py) | Category Theory and Functors | Category Theory, Functor, fmap, Bifunctor | Basic Python, Functions |
|
| 60 |
| [06. Applicatives](https://github.com/marimo-team/learn/blob/main/functional_programming/06_applicatives.py) | Applicative programming with effects | Applicative Functor, pure, apply, Effectful programming, Alternative | Functors |
|
| 61 |
|
| 62 |
**Authors.**
|
|
@@ -69,4 +74,4 @@ Thanks to all our notebook authors!
|
|
| 69 |
|
| 70 |
Thanks to all our notebook reviews!
|
| 71 |
|
| 72 |
-
- [Haleshot](https://github.com/Haleshot)
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Readme
|
| 3 |
+
marimo-version: 0.18.4
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
# Learn Functional Programming
|
| 7 |
|
| 8 |
_🚧 This collection is a [work in progress](https://github.com/marimo-team/learn/issues/51)._
|
|
|
|
| 29 |
|
| 30 |
To run a notebook locally, use
|
| 31 |
|
| 32 |
+
```bash
|
| 33 |
+
uvx marimo edit <URL>
|
| 34 |
```
|
| 35 |
|
| 36 |
For example, run the `Functor` tutorial with
|
| 37 |
|
| 38 |
+
```bash
|
| 39 |
uvx marimo edit https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py
|
| 40 |
```
|
| 41 |
|
|
|
|
| 57 |
## Description of notebooks
|
| 58 |
|
| 59 |
Check [here](https://github.com/marimo-team/learn/issues/51) for current series
|
| 60 |
+
structure.
|
| 61 |
|
| 62 |
| Notebook | Title | Key Concepts | Prerequisites |
|
| 63 |
+
|----------|-------|--------------|---------------|
|
| 64 |
+
| [05. Functors](https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py) | Category Theory and Functors | Category Theory, Functor, fmap, Bifunctor | Basic Python, Functions |
|
| 65 |
| [06. Applicatives](https://github.com/marimo-team/learn/blob/main/functional_programming/06_applicatives.py) | Applicative programming with effects | Applicative Functor, pure, apply, Effectful programming, Alternative | Functors |
|
| 66 |
|
| 67 |
**Authors.**
|
|
|
|
| 74 |
|
| 75 |
Thanks to all our notebook reviews!
|
| 76 |
|
| 77 |
+
- [Haleshot](https://github.com/Haleshot)
|
optimization/01_least_squares.py
CHANGED
|
@@ -9,7 +9,7 @@
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
-
__generated_with = "0.
|
| 13 |
app = marimo.App()
|
| 14 |
|
| 15 |
|
|
@@ -21,45 +21,41 @@ def _():
|
|
| 21 |
|
| 22 |
@app.cell(hide_code=True)
|
| 23 |
def _(mo):
|
| 24 |
-
mo.md(
|
| 25 |
-
|
| 26 |
-
# Least squares
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
)
|
| 51 |
return
|
| 52 |
|
| 53 |
|
| 54 |
@app.cell(hide_code=True)
|
| 55 |
def _(mo):
|
| 56 |
-
mo.md(
|
| 57 |
-
|
| 58 |
-
## Example
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
)
|
| 63 |
return
|
| 64 |
|
| 65 |
|
|
@@ -91,7 +87,7 @@ def _(A, b, cp, n):
|
|
| 91 |
objective = cp.sum_squares(A @ x - b)
|
| 92 |
problem = cp.Problem(cp.Minimize(objective))
|
| 93 |
optimal_value = problem.solve()
|
| 94 |
-
return
|
| 95 |
|
| 96 |
|
| 97 |
@app.cell
|
|
@@ -108,14 +104,12 @@ def _(A, b, cp, mo, optimal_value, x):
|
|
| 108 |
|
| 109 |
@app.cell(hide_code=True)
|
| 110 |
def _(mo):
|
| 111 |
-
mo.md(
|
| 112 |
-
|
| 113 |
-
## Further reading
|
| 114 |
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
)
|
| 119 |
return
|
| 120 |
|
| 121 |
|
|
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
+
__generated_with = "0.18.4"
|
| 13 |
app = marimo.App()
|
| 14 |
|
| 15 |
|
|
|
|
| 21 |
|
| 22 |
@app.cell(hide_code=True)
|
| 23 |
def _(mo):
|
| 24 |
+
mo.md(r"""
|
| 25 |
+
# Least squares
|
|
|
|
| 26 |
|
| 27 |
+
In a least-squares problem, we have measurements $A \in \mathcal{R}^{m \times
|
| 28 |
+
n}$ (i.e., $m$ rows and $n$ columns) and $b \in \mathcal{R}^m$. We seek a vector
|
| 29 |
+
$x \in \mathcal{R}^{n}$ such that $Ax$ is close to $b$. The matrices $A$ and $b$ are problem data or constants, and $x$ is the variable we are solving for.
|
| 30 |
|
| 31 |
+
Closeness is defined as the sum of the squared differences:
|
| 32 |
|
| 33 |
+
\[ \sum_{i=1}^m (a_i^Tx - b_i)^2, \]
|
| 34 |
|
| 35 |
+
also known as the $\ell_2$-norm squared, $\|Ax - b\|_2^2$.
|
| 36 |
|
| 37 |
+
For example, we might have a dataset of $m$ users, each represented by $n$ features. Each row $a_i^T$ of $A$ is the feature vector for user $i$, while the corresponding entry $b_i$ of $b$ is the measurement we want to predict from $a_i^T$, such as ad spending. The prediction for user $i$ is given by $a_i^Tx$.
|
| 38 |
|
| 39 |
+
We find the optimal value of $x$ by solving the optimization problem
|
| 40 |
|
| 41 |
+
\[
|
| 42 |
+
\begin{array}{ll}
|
| 43 |
+
\text{minimize} & \|Ax - b\|_2^2.
|
| 44 |
+
\end{array}
|
| 45 |
+
\]
|
| 46 |
|
| 47 |
+
Let $x^\star$ denote the optimal $x$. The quantity $r = Ax^\star - b$ is known as the residual. If $\|r\|_2 = 0$, we have a perfect fit.
|
| 48 |
+
""")
|
|
|
|
| 49 |
return
|
| 50 |
|
| 51 |
|
| 52 |
@app.cell(hide_code=True)
|
| 53 |
def _(mo):
|
| 54 |
+
mo.md(r"""
|
| 55 |
+
## Example
|
|
|
|
| 56 |
|
| 57 |
+
In this example, we use the Python library [CVXPY](https://github.com/cvxpy/cvxpy) to construct and solve a least-squares problems.
|
| 58 |
+
""")
|
|
|
|
| 59 |
return
|
| 60 |
|
| 61 |
|
|
|
|
| 87 |
objective = cp.sum_squares(A @ x - b)
|
| 88 |
problem = cp.Problem(cp.Minimize(objective))
|
| 89 |
optimal_value = problem.solve()
|
| 90 |
+
return optimal_value, x
|
| 91 |
|
| 92 |
|
| 93 |
@app.cell
|
|
|
|
| 104 |
|
| 105 |
@app.cell(hide_code=True)
|
| 106 |
def _(mo):
|
| 107 |
+
mo.md(r"""
|
| 108 |
+
## Further reading
|
|
|
|
| 109 |
|
| 110 |
+
For a primer on least squares, with many real-world examples, check out the free book
|
| 111 |
+
[Vectors, Matrices, and Least Squares](https://web.stanford.edu/~boyd/vmls/), which is used for undergraduate linear algebra education at Stanford.
|
| 112 |
+
""")
|
|
|
|
| 113 |
return
|
| 114 |
|
| 115 |
|
optimization/02_linear_program.py
CHANGED
|
@@ -11,7 +11,7 @@
|
|
| 11 |
|
| 12 |
import marimo
|
| 13 |
|
| 14 |
-
__generated_with = "0.
|
| 15 |
app = marimo.App()
|
| 16 |
|
| 17 |
|
|
@@ -23,33 +23,31 @@ def _():
|
|
| 23 |
|
| 24 |
@app.cell(hide_code=True)
|
| 25 |
def _(mo):
|
| 26 |
-
mo.md(
|
| 27 |
-
|
| 28 |
-
# Linear program
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
-
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
)
|
| 53 |
return
|
| 54 |
|
| 55 |
|
|
@@ -66,13 +64,11 @@ def _(mo):
|
|
| 66 |
|
| 67 |
@app.cell(hide_code=True)
|
| 68 |
def _(mo):
|
| 69 |
-
mo.md(
|
| 70 |
-
|
| 71 |
-
## Example
|
| 72 |
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
)
|
| 76 |
return
|
| 77 |
|
| 78 |
|
|
@@ -119,7 +115,9 @@ def _(np):
|
|
| 119 |
|
| 120 |
@app.cell(hide_code=True)
|
| 121 |
def _(mo):
|
| 122 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 123 |
return
|
| 124 |
|
| 125 |
|
|
@@ -129,7 +127,7 @@ def _(mo, np):
|
|
| 129 |
|
| 130 |
c_widget = mo.ui.anywidget(Matrix(matrix=np.array([[0.1, -0.2]]), step=0.01))
|
| 131 |
c_widget
|
| 132 |
-
return
|
| 133 |
|
| 134 |
|
| 135 |
@app.cell
|
|
@@ -149,7 +147,9 @@ def _(A, b, c, cp):
|
|
| 149 |
|
| 150 |
@app.cell(hide_code=True)
|
| 151 |
def _(mo):
|
| 152 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 153 |
return
|
| 154 |
|
| 155 |
|
|
@@ -249,7 +249,7 @@ def _(np):
|
|
| 249 |
ax.set_xlim(np.min(x_vals), np.max(x_vals))
|
| 250 |
ax.set_ylim(np.min(y_vals), np.max(y_vals))
|
| 251 |
return ax
|
| 252 |
-
return make_plot,
|
| 253 |
|
| 254 |
|
| 255 |
@app.cell(hide_code=True)
|
|
@@ -257,7 +257,7 @@ def _(mo, prob, x):
|
|
| 257 |
mo.md(
|
| 258 |
f"""
|
| 259 |
The optimal value is {prob.value:.04f}.
|
| 260 |
-
|
| 261 |
A solution $x$ is {mo.as_html(list(x.value))}
|
| 262 |
A dual solution is is {mo.as_html(list(prob.constraints[0].dual_value))}
|
| 263 |
"""
|
|
|
|
| 11 |
|
| 12 |
import marimo
|
| 13 |
|
| 14 |
+
__generated_with = "0.18.4"
|
| 15 |
app = marimo.App()
|
| 16 |
|
| 17 |
|
|
|
|
| 23 |
|
| 24 |
@app.cell(hide_code=True)
|
| 25 |
def _(mo):
|
| 26 |
+
mo.md(r"""
|
| 27 |
+
# Linear program
|
|
|
|
| 28 |
|
| 29 |
+
A linear program is an optimization problem with a linear objective and affine
|
| 30 |
+
inequality constraints. A common standard form is the following:
|
| 31 |
|
| 32 |
+
\[
|
| 33 |
+
\begin{array}{ll}
|
| 34 |
+
\text{minimize} & c^Tx \\
|
| 35 |
+
\text{subject to} & Ax \leq b.
|
| 36 |
+
\end{array}
|
| 37 |
+
\]
|
| 38 |
|
| 39 |
+
Here $A \in \mathcal{R}^{m \times n}$, $b \in \mathcal{R}^m$, and $c \in \mathcal{R}^n$ are problem data and $x \in \mathcal{R}^{n}$ is the optimization variable. The inequality constraint $Ax \leq b$ is elementwise.
|
| 40 |
|
| 41 |
+
For example, we might have $n$ different products, each constructed out of $m$ components. Each entry $A_{ij}$ is the amount of component $i$ required to build one unit of product $j$. Each entry $b_i$ is the total amount of component $i$ available. We lose $c_j$ for each unit of product $j$ ($c_j < 0$ indicates profit). Our goal then is to choose how many units of each product $j$ to make, $x_j$, in order to minimize loss without exceeding our budget for any component.
|
| 42 |
|
| 43 |
+
In addition to a solution $x^\star$, we obtain a dual solution $\lambda^\star$. A positive entry $\lambda^\star_i$ indicates that the constraint $a_i^Tx \leq b_i$ holds with equality for $x^\star$ and suggests that changing $b_i$ would change the optimal value.
|
| 44 |
|
| 45 |
+
**Why linear programming?** Linear programming is a way to achieve an optimal outcome, such as maximum utility or lowest cost, subject to a linear objective function and affine constraints. Developed in the 20th century, linear programming is widely used today to solve problems in resource allocation, scheduling, transportation, and more. The discovery of polynomial-time algorithms to solve linear programs was of tremendous worldwide importance and entered the public discourse, even making the front page of the New York Times.
|
| 46 |
|
| 47 |
+
In the late 20th and early 21st century, researchers generalized linear programming to a much wider class of problems called convex optimization problems. Nearly all convex optimization problems can be solved efficiently and reliably, and even more difficult problems are readily solved by a sequence of convex optimization problems. Today, convex optimization is used to fit machine learning models, land rockets in real-time at SpaceX, plan trajectories for self-driving cars at Waymo, execute many billions of dollars of financial trades a day, and much more.
|
| 48 |
|
| 49 |
+
This marimo learn course uses CVXPY, a modeling language for convex optimization problems developed originally at Stanford, to construct and solve convex programs.
|
| 50 |
+
""")
|
|
|
|
| 51 |
return
|
| 52 |
|
| 53 |
|
|
|
|
| 64 |
|
| 65 |
@app.cell(hide_code=True)
|
| 66 |
def _(mo):
|
| 67 |
+
mo.md(r"""
|
| 68 |
+
## Example
|
|
|
|
| 69 |
|
| 70 |
+
Here we use CVXPY to construct and solve a linear program.
|
| 71 |
+
""")
|
|
|
|
| 72 |
return
|
| 73 |
|
| 74 |
|
|
|
|
| 115 |
|
| 116 |
@app.cell(hide_code=True)
|
| 117 |
def _(mo):
|
| 118 |
+
mo.md(r"""
|
| 119 |
+
We've randomly generated problem data $A$ and $B$. The vector for $c$ is shown below. Try playing with the value of $c$ by dragging the components, and see how the level curves change in the visualization below.
|
| 120 |
+
""")
|
| 121 |
return
|
| 122 |
|
| 123 |
|
|
|
|
| 127 |
|
| 128 |
c_widget = mo.ui.anywidget(Matrix(matrix=np.array([[0.1, -0.2]]), step=0.01))
|
| 129 |
c_widget
|
| 130 |
+
return (c_widget,)
|
| 131 |
|
| 132 |
|
| 133 |
@app.cell
|
|
|
|
| 147 |
|
| 148 |
@app.cell(hide_code=True)
|
| 149 |
def _(mo):
|
| 150 |
+
mo.md(r"""
|
| 151 |
+
Below, we plot the feasible region of the problem — the intersection of the inequalities — and the level curves of the objective function. The optimal value $x^\star$ is the point farthest in the feasible region in the direction $-c$.
|
| 152 |
+
""")
|
| 153 |
return
|
| 154 |
|
| 155 |
|
|
|
|
| 249 |
ax.set_xlim(np.min(x_vals), np.max(x_vals))
|
| 250 |
ax.set_ylim(np.min(y_vals), np.max(y_vals))
|
| 251 |
return ax
|
| 252 |
+
return (make_plot,)
|
| 253 |
|
| 254 |
|
| 255 |
@app.cell(hide_code=True)
|
|
|
|
| 257 |
mo.md(
|
| 258 |
f"""
|
| 259 |
The optimal value is {prob.value:.04f}.
|
| 260 |
+
|
| 261 |
A solution $x$ is {mo.as_html(list(x.value))}
|
| 262 |
A dual solution is is {mo.as_html(list(prob.constraints[0].dual_value))}
|
| 263 |
"""
|
optimization/03_minimum_fuel_optimal_control.py
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
import marimo
|
| 2 |
|
| 3 |
-
__generated_with = "0.
|
| 4 |
app = marimo.App()
|
| 5 |
|
| 6 |
|
|
@@ -12,46 +12,44 @@ def _():
|
|
| 12 |
|
| 13 |
@app.cell(hide_code=True)
|
| 14 |
def _(mo):
|
| 15 |
-
mo.md(
|
| 16 |
-
|
| 17 |
-
# Minimal fuel optimal control
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
)
|
| 55 |
return
|
| 56 |
|
| 57 |
|
|
@@ -85,7 +83,7 @@ def _(mo, n, np):
|
|
| 85 |
rf"""
|
| 86 |
|
| 87 |
Choose a value for $x_0$ ...
|
| 88 |
-
|
| 89 |
{x0_widget}
|
| 90 |
"""
|
| 91 |
)
|
|
@@ -99,7 +97,7 @@ def _(mo, n, np):
|
|
| 99 |
)
|
| 100 |
|
| 101 |
mo.hstack([_a, _b], justify="space-around")
|
| 102 |
-
return
|
| 103 |
|
| 104 |
|
| 105 |
@app.cell
|
|
@@ -111,7 +109,9 @@ def _(x0_widget, xdes_widget):
|
|
| 111 |
|
| 112 |
@app.cell(hide_code=True)
|
| 113 |
def _(mo):
|
| 114 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 115 |
return
|
| 116 |
|
| 117 |
|
|
@@ -134,18 +134,16 @@ def _(A, T, b, cp, mo, n, x0, xdes):
|
|
| 134 |
|
| 135 |
fuel_used = cp.Problem(cp.Minimize(objective), constraints).solve()
|
| 136 |
mo.md(f"Achieved a fuel usage of {fuel_used:.02f}. 🚀")
|
| 137 |
-
return
|
| 138 |
|
| 139 |
|
| 140 |
@app.cell(hide_code=True)
|
| 141 |
def _(mo):
|
| 142 |
-
mo.md(
|
| 143 |
-
|
| 144 |
-
Finally, we plot the chosen inputs over time.
|
| 145 |
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
)
|
| 149 |
return
|
| 150 |
|
| 151 |
|
|
|
|
| 1 |
import marimo
|
| 2 |
|
| 3 |
+
__generated_with = "0.18.4"
|
| 4 |
app = marimo.App()
|
| 5 |
|
| 6 |
|
|
|
|
| 12 |
|
| 13 |
@app.cell(hide_code=True)
|
| 14 |
def _(mo):
|
| 15 |
+
mo.md(r"""
|
| 16 |
+
# Minimal fuel optimal control
|
|
|
|
| 17 |
|
| 18 |
+
This notebook includes an application of linear programming to controlling a
|
| 19 |
+
physical system, adapted from [Convex
|
| 20 |
+
Optimization](https://web.stanford.edu/~boyd/cvxbook/) by Boyd and Vandenberghe.
|
| 21 |
|
| 22 |
+
We consider a linear dynamical system with state $x(t) \in \mathbf{R}^n$, for $t = 0, \ldots, T$. At each time step $t = 0, \ldots, T - 1$, an actuator or input signal $u(t)$ is applied, affecting the state. The dynamics
|
| 23 |
+
of the system is given by the linear recurrence
|
| 24 |
|
| 25 |
+
\[
|
| 26 |
+
x(t + 1) = Ax(t) + bu(t), \quad t = 0, \ldots, T - 1,
|
| 27 |
+
\]
|
| 28 |
|
| 29 |
+
where $A \in \mathbf{R}^{n \times n}$ and $b \in \mathbf{R}^n$ are given and encode how the system evolves. The initial state $x(0)$ is also given.
|
| 30 |
|
| 31 |
+
The _minimum fuel optimal control problem_ is to choose the inputs $u(0), \ldots, u(T - 1)$ so as to achieve
|
| 32 |
+
a given desired state $x_\text{des} = x(T)$ while minimizing the total fuel consumed
|
| 33 |
|
| 34 |
+
\[
|
| 35 |
+
F = \sum_{t=0}^{T - 1} f(u(t)).
|
| 36 |
+
\]
|
| 37 |
|
| 38 |
+
The function $f : \mathbf{R} \to \mathbf{R}$ tells us how much fuel is consumed as a function of the input, and is given by
|
| 39 |
|
| 40 |
+
\[
|
| 41 |
+
f(a) = \begin{cases}
|
| 42 |
+
|a| & |a| \leq 1 \\
|
| 43 |
+
2|a| - 1 & |a| > 1.
|
| 44 |
+
\end{cases}
|
| 45 |
+
\]
|
| 46 |
|
| 47 |
+
This means the fuel use is proportional to the magnitude of the signal between $-1$ and $1$, but for larger signals the marginal fuel efficiency is half.
|
| 48 |
|
| 49 |
+
**This notebook.** In this notebook we use CVXPY to formulate the minimum fuel optimal control problem as a linear program. The notebook lets you play with the initial and target states, letting you see how they affect the planned trajectory of inputs $u$.
|
| 50 |
|
| 51 |
+
First, we create the **problem data**.
|
| 52 |
+
""")
|
|
|
|
| 53 |
return
|
| 54 |
|
| 55 |
|
|
|
|
| 83 |
rf"""
|
| 84 |
|
| 85 |
Choose a value for $x_0$ ...
|
| 86 |
+
|
| 87 |
{x0_widget}
|
| 88 |
"""
|
| 89 |
)
|
|
|
|
| 97 |
)
|
| 98 |
|
| 99 |
mo.hstack([_a, _b], justify="space-around")
|
| 100 |
+
return x0_widget, xdes_widget
|
| 101 |
|
| 102 |
|
| 103 |
@app.cell
|
|
|
|
| 109 |
|
| 110 |
@app.cell(hide_code=True)
|
| 111 |
def _(mo):
|
| 112 |
+
mo.md(r"""
|
| 113 |
+
**Next, we specify the problem as a linear program using CVXPY.** This problem is linear because the objective and constraints are affine. (In fact, the objective is piecewise affine, but CVXPY rewrites it to be affine for you.)
|
| 114 |
+
""")
|
| 115 |
return
|
| 116 |
|
| 117 |
|
|
|
|
| 134 |
|
| 135 |
fuel_used = cp.Problem(cp.Minimize(objective), constraints).solve()
|
| 136 |
mo.md(f"Achieved a fuel usage of {fuel_used:.02f}. 🚀")
|
| 137 |
+
return (u,)
|
| 138 |
|
| 139 |
|
| 140 |
@app.cell(hide_code=True)
|
| 141 |
def _(mo):
|
| 142 |
+
mo.md("""
|
| 143 |
+
Finally, we plot the chosen inputs over time.
|
|
|
|
| 144 |
|
| 145 |
+
**🌊 Try it!** Change the initial and desired states; how do fuel usage and controls change? Can you explain what you see? You can also try experimenting with the value of $T$.
|
| 146 |
+
""")
|
|
|
|
| 147 |
return
|
| 148 |
|
| 149 |
|
optimization/04_quadratic_program.py
CHANGED
|
@@ -11,7 +11,7 @@
|
|
| 11 |
|
| 12 |
import marimo
|
| 13 |
|
| 14 |
-
__generated_with = "0.
|
| 15 |
app = marimo.App()
|
| 16 |
|
| 17 |
|
|
@@ -23,53 +23,49 @@ def _():
|
|
| 23 |
|
| 24 |
@app.cell(hide_code=True)
|
| 25 |
def _(mo):
|
| 26 |
-
mo.md(
|
| 27 |
-
|
| 28 |
-
# Quadratic program
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
-
|
| 47 |
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
)
|
| 61 |
return
|
| 62 |
|
| 63 |
|
| 64 |
@app.cell(hide_code=True)
|
| 65 |
def _(mo):
|
| 66 |
-
mo.md(
|
| 67 |
-
|
| 68 |
-
## Example
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
)
|
| 73 |
return
|
| 74 |
|
| 75 |
|
|
@@ -82,7 +78,9 @@ def _():
|
|
| 82 |
|
| 83 |
@app.cell(hide_code=True)
|
| 84 |
def _(mo):
|
| 85 |
-
mo.md("""
|
|
|
|
|
|
|
| 86 |
return
|
| 87 |
|
| 88 |
|
|
@@ -95,7 +93,7 @@ def _(np):
|
|
| 95 |
q = np.random.randn(n)
|
| 96 |
G = np.random.randn(m, n)
|
| 97 |
h = G @ np.random.randn(n)
|
| 98 |
-
return G, h,
|
| 99 |
|
| 100 |
|
| 101 |
@app.cell(hide_code=True)
|
|
@@ -114,7 +112,7 @@ def _(mo, np):
|
|
| 114 |
{P_widget.center()}
|
| 115 |
"""
|
| 116 |
)
|
| 117 |
-
return P_widget,
|
| 118 |
|
| 119 |
|
| 120 |
@app.cell
|
|
@@ -125,7 +123,9 @@ def _(P_widget, np):
|
|
| 125 |
|
| 126 |
@app.cell(hide_code=True)
|
| 127 |
def _(mo):
|
| 128 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 129 |
return
|
| 130 |
|
| 131 |
|
|
@@ -162,14 +162,12 @@ def _(G, P, h, plot_contours, q, x):
|
|
| 162 |
|
| 163 |
@app.cell(hide_code=True)
|
| 164 |
def _(mo):
|
| 165 |
-
mo.md(
|
| 166 |
-
|
| 167 |
-
In this plot, the gray shaded region is the feasible region (points satisfying the inequality), and the ellipses are level curves of the quadratic form.
|
| 168 |
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
)
|
| 173 |
return
|
| 174 |
|
| 175 |
|
|
@@ -178,7 +176,7 @@ def _(P, mo):
|
|
| 178 |
mo.md(
|
| 179 |
rf"""
|
| 180 |
The above contour lines were generated with
|
| 181 |
-
|
| 182 |
\[
|
| 183 |
P= \begin{{bmatrix}}
|
| 184 |
{P[0, 0]:.01f} & {P[0, 1]:.01f} \\
|
|
|
|
| 11 |
|
| 12 |
import marimo
|
| 13 |
|
| 14 |
+
__generated_with = "0.18.4"
|
| 15 |
app = marimo.App()
|
| 16 |
|
| 17 |
|
|
|
|
| 23 |
|
| 24 |
@app.cell(hide_code=True)
|
| 25 |
def _(mo):
|
| 26 |
+
mo.md(r"""
|
| 27 |
+
# Quadratic program
|
|
|
|
| 28 |
|
| 29 |
+
A quadratic program is an optimization problem with a quadratic objective and
|
| 30 |
+
affine equality and inequality constraints. A common standard form is the
|
| 31 |
+
following:
|
| 32 |
|
| 33 |
+
\[
|
| 34 |
+
\begin{array}{ll}
|
| 35 |
+
\text{minimize} & (1/2)x^TPx + q^Tx\\
|
| 36 |
+
\text{subject to} & Gx \leq h \\
|
| 37 |
+
& Ax = b.
|
| 38 |
+
\end{array}
|
| 39 |
+
\]
|
| 40 |
|
| 41 |
+
Here $P \in \mathcal{S}^{n}_+$, $q \in \mathcal{R}^n$, $G \in \mathcal{R}^{m \times n}$, $h \in \mathcal{R}^m$, $A \in \mathcal{R}^{p \times n}$, and $b \in \mathcal{R}^p$ are problem data and $x \in \mathcal{R}^{n}$ is the optimization variable. The inequality constraint $Gx \leq h$ is elementwise.
|
| 42 |
|
| 43 |
+
**Why quadratic programming?** Quadratic programs are convex optimization problems that generalize both least-squares and linear programming.They can be solved efficiently and reliably, even in real-time.
|
| 44 |
|
| 45 |
+
**An example from finance.** A simple example of a quadratic program arises in finance. Suppose we have $n$ different stocks, an estimate $r \in \mathcal{R}^n$ of the expected return on each stock, and an estimate $\Sigma \in \mathcal{S}^{n}_+$ of the covariance of the returns. Then we solve the optimization problem
|
| 46 |
|
| 47 |
+
\[
|
| 48 |
+
\begin{array}{ll}
|
| 49 |
+
\text{minimize} & (1/2)x^T\Sigma x - r^Tx\\
|
| 50 |
+
\text{subject to} & x \geq 0 \\
|
| 51 |
+
& \mathbf{1}^Tx = 1,
|
| 52 |
+
\end{array}
|
| 53 |
+
\]
|
| 54 |
|
| 55 |
+
to find a nonnegative portfolio allocation $x \in \mathcal{R}^n_+$ that optimally balances expected return and variance of return.
|
| 56 |
|
| 57 |
+
When we solve a quadratic program, in addition to a solution $x^\star$, we obtain a dual solution $\lambda^\star$ corresponding to the inequality constraints. A positive entry $\lambda^\star_i$ indicates that the constraint $g_i^Tx \leq h_i$ holds with equality for $x^\star$ and suggests that changing $h_i$ would change the optimal value.
|
| 58 |
+
""")
|
|
|
|
| 59 |
return
|
| 60 |
|
| 61 |
|
| 62 |
@app.cell(hide_code=True)
|
| 63 |
def _(mo):
|
| 64 |
+
mo.md(r"""
|
| 65 |
+
## Example
|
|
|
|
| 66 |
|
| 67 |
+
In this example, we use CVXPY to construct and solve a quadratic program.
|
| 68 |
+
""")
|
|
|
|
| 69 |
return
|
| 70 |
|
| 71 |
|
|
|
|
| 78 |
|
| 79 |
@app.cell(hide_code=True)
|
| 80 |
def _(mo):
|
| 81 |
+
mo.md("""
|
| 82 |
+
First we generate synthetic data. In this problem, we don't include equality constraints, only inequality.
|
| 83 |
+
""")
|
| 84 |
return
|
| 85 |
|
| 86 |
|
|
|
|
| 93 |
q = np.random.randn(n)
|
| 94 |
G = np.random.randn(m, n)
|
| 95 |
h = G @ np.random.randn(n)
|
| 96 |
+
return G, h, n, q
|
| 97 |
|
| 98 |
|
| 99 |
@app.cell(hide_code=True)
|
|
|
|
| 112 |
{P_widget.center()}
|
| 113 |
"""
|
| 114 |
)
|
| 115 |
+
return (P_widget,)
|
| 116 |
|
| 117 |
|
| 118 |
@app.cell
|
|
|
|
| 123 |
|
| 124 |
@app.cell(hide_code=True)
|
| 125 |
def _(mo):
|
| 126 |
+
mo.md(r"""
|
| 127 |
+
Next, we specify the problem. Notice that we use the `quad_form` function from CVXPY to create the quadratic form $x^TPx$.
|
| 128 |
+
""")
|
| 129 |
return
|
| 130 |
|
| 131 |
|
|
|
|
| 162 |
|
| 163 |
@app.cell(hide_code=True)
|
| 164 |
def _(mo):
|
| 165 |
+
mo.md(r"""
|
| 166 |
+
In this plot, the gray shaded region is the feasible region (points satisfying the inequality), and the ellipses are level curves of the quadratic form.
|
|
|
|
| 167 |
|
| 168 |
+
**🌊 Try it!** Try changing the entries of $P$ above with your mouse. How do the
|
| 169 |
+
level curves and the optimal value of $x$ change? Can you explain what you see?
|
| 170 |
+
""")
|
|
|
|
| 171 |
return
|
| 172 |
|
| 173 |
|
|
|
|
| 176 |
mo.md(
|
| 177 |
rf"""
|
| 178 |
The above contour lines were generated with
|
| 179 |
+
|
| 180 |
\[
|
| 181 |
P= \begin{{bmatrix}}
|
| 182 |
{P[0, 0]:.01f} & {P[0, 1]:.01f} \\
|
optimization/05_portfolio_optimization.py
CHANGED
|
@@ -12,7 +12,7 @@
|
|
| 12 |
|
| 13 |
import marimo
|
| 14 |
|
| 15 |
-
__generated_with = "0.
|
| 16 |
app = marimo.App()
|
| 17 |
|
| 18 |
|
|
@@ -24,88 +24,78 @@ def _():
|
|
| 24 |
|
| 25 |
@app.cell(hide_code=True)
|
| 26 |
def _(mo):
|
| 27 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 28 |
return
|
| 29 |
|
| 30 |
|
| 31 |
@app.cell(hide_code=True)
|
| 32 |
def _(mo):
|
| 33 |
-
mo.md(
|
| 34 |
-
|
| 35 |
-
In this example we show how to use CVXPY to design a financial portfolio; this is called _portfolio optimization_.
|
| 36 |
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
)
|
| 41 |
return
|
| 42 |
|
| 43 |
|
| 44 |
@app.cell(hide_code=True)
|
| 45 |
def _(mo):
|
| 46 |
-
mo.md(
|
| 47 |
-
|
| 48 |
-
## Asset returns and risk
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
)
|
| 58 |
return
|
| 59 |
|
| 60 |
|
| 61 |
@app.cell(hide_code=True)
|
| 62 |
def _(mo):
|
| 63 |
-
mo.md(
|
| 64 |
-
|
| 65 |
-
## Classical (Markowitz) portfolio optimization
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
)
|
| 70 |
return
|
| 71 |
|
| 72 |
|
| 73 |
@app.cell(hide_code=True)
|
| 74 |
def _(mo):
|
| 75 |
-
mo.md(
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
"""
|
| 83 |
-
)
|
| 84 |
return
|
| 85 |
|
| 86 |
|
| 87 |
@app.cell(hide_code=True)
|
| 88 |
def _(mo):
|
| 89 |
-
mo.md(
|
| 90 |
-
|
| 91 |
-
where $w \in {\bf R}^n$ is the optimization variable and $\gamma >0$ is a constant called the *risk aversion parameter*. The constraint $\mathbf{1}^Tw = 1$ says the portfolio weight vector must sum to 1, and $w \geq 0$ says that we can't invest a negative amount into any asset.
|
| 92 |
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
)
|
| 97 |
return
|
| 98 |
|
| 99 |
|
| 100 |
@app.cell(hide_code=True)
|
| 101 |
def _(mo):
|
| 102 |
-
mo.md(
|
| 103 |
-
|
| 104 |
-
## Example
|
| 105 |
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
)
|
| 109 |
return
|
| 110 |
|
| 111 |
|
|
@@ -148,7 +138,7 @@ def _(mo, np):
|
|
| 148 |
_Try changing the entries of $\mu$ and see how the plots below change._
|
| 149 |
"""
|
| 150 |
)
|
| 151 |
-
return mu_widget,
|
| 152 |
|
| 153 |
|
| 154 |
@app.cell
|
|
@@ -163,7 +153,9 @@ def _(mu_widget, np):
|
|
| 163 |
|
| 164 |
@app.cell(hide_code=True)
|
| 165 |
def _(mo):
|
| 166 |
-
mo.md("""
|
|
|
|
|
|
|
| 167 |
return
|
| 168 |
|
| 169 |
|
|
@@ -176,7 +168,7 @@ def _(Sigma, mu, n):
|
|
| 176 |
ret = mu.T @ w
|
| 177 |
risk = cp.quad_form(w, Sigma)
|
| 178 |
prob = cp.Problem(cp.Maximize(ret - gamma * risk), [cp.sum(w) == 1, w >= 0])
|
| 179 |
-
return cp, gamma, prob, ret, risk
|
| 180 |
|
| 181 |
|
| 182 |
@app.cell
|
|
@@ -195,7 +187,9 @@ def _(cp, gamma, np, prob, ret, risk):
|
|
| 195 |
|
| 196 |
@app.cell(hide_code=True)
|
| 197 |
def _(mo):
|
| 198 |
-
mo.md("""
|
|
|
|
|
|
|
| 199 |
return
|
| 200 |
|
| 201 |
|
|
@@ -218,17 +212,15 @@ def _(Sigma, cp, gamma_vals, mu, n, ret_data, risk_data):
|
|
| 218 |
plt.xlabel("Standard deviation")
|
| 219 |
plt.ylabel("Return")
|
| 220 |
plt.show()
|
| 221 |
-
return
|
| 222 |
|
| 223 |
|
| 224 |
@app.cell(hide_code=True)
|
| 225 |
def _(mo):
|
| 226 |
-
mo.md(
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
"""
|
| 231 |
-
)
|
| 232 |
return
|
| 233 |
|
| 234 |
|
|
@@ -250,7 +242,7 @@ def _(gamma, gamma_vals, markers_on, np, plt, prob, ret, risk):
|
|
| 250 |
plt.ylabel("Density")
|
| 251 |
plt.legend(loc="upper right")
|
| 252 |
plt.show()
|
| 253 |
-
return
|
| 254 |
|
| 255 |
|
| 256 |
if __name__ == "__main__":
|
|
|
|
| 12 |
|
| 13 |
import marimo
|
| 14 |
|
| 15 |
+
__generated_with = "0.18.4"
|
| 16 |
app = marimo.App()
|
| 17 |
|
| 18 |
|
|
|
|
| 24 |
|
| 25 |
@app.cell(hide_code=True)
|
| 26 |
def _(mo):
|
| 27 |
+
mo.md(r"""
|
| 28 |
+
# Portfolio optimization
|
| 29 |
+
""")
|
| 30 |
return
|
| 31 |
|
| 32 |
|
| 33 |
@app.cell(hide_code=True)
|
| 34 |
def _(mo):
|
| 35 |
+
mo.md(r"""
|
| 36 |
+
In this example we show how to use CVXPY to design a financial portfolio; this is called _portfolio optimization_.
|
|
|
|
| 37 |
|
| 38 |
+
In portfolio optimization we have some amount of money to invest in any of $n$ different assets.
|
| 39 |
+
We choose what fraction $w_i$ of our money to invest in each asset $i$, $i=1, \ldots, n$. The goal is to maximize return of the portfolio while minimizing risk.
|
| 40 |
+
""")
|
|
|
|
| 41 |
return
|
| 42 |
|
| 43 |
|
| 44 |
@app.cell(hide_code=True)
|
| 45 |
def _(mo):
|
| 46 |
+
mo.md(r"""
|
| 47 |
+
## Asset returns and risk
|
|
|
|
| 48 |
|
| 49 |
+
We will only model investments held for one period. The initial prices are $p_i > 0$. The end of period prices are $p_i^+ >0$. The asset (fractional) returns are $r_i = (p_i^+-p_i)/p_i$. The portfolio (fractional) return is $R = r^Tw$.
|
| 50 |
|
| 51 |
+
A common model is that $r$ is a random variable with mean ${\bf E}r = \mu$ and covariance ${\bf E{(r-\mu)(r-\mu)^T}} = \Sigma$.
|
| 52 |
+
It follows that $R$ is a random variable with ${\bf E}R = \mu^T w$ and ${\bf var}(R) = w^T\Sigma w$. In real-world applications, $\mu$ and $\Sigma$ are estimated from data and models, and $w$ is chosen using a library like CVXPY.
|
| 53 |
|
| 54 |
+
${\bf E}R$ is the (mean) *return* of the portfolio. ${\bf var}(R)$ is the *risk* of the portfolio. Portfolio optimization has two competing objectives: high return and low risk.
|
| 55 |
+
""")
|
|
|
|
| 56 |
return
|
| 57 |
|
| 58 |
|
| 59 |
@app.cell(hide_code=True)
|
| 60 |
def _(mo):
|
| 61 |
+
mo.md(r"""
|
| 62 |
+
## Classical (Markowitz) portfolio optimization
|
|
|
|
| 63 |
|
| 64 |
+
Classical (Markowitz) portfolio optimization solves the optimization problem
|
| 65 |
+
""")
|
|
|
|
| 66 |
return
|
| 67 |
|
| 68 |
|
| 69 |
@app.cell(hide_code=True)
|
| 70 |
def _(mo):
|
| 71 |
+
mo.md(r"""
|
| 72 |
+
$$
|
| 73 |
+
\begin{array}{ll} \text{maximize} & \mu^T w - \gamma w^T\Sigma w\\
|
| 74 |
+
\text{subject to} & {\bf 1}^T w = 1, w \geq 0,
|
| 75 |
+
\end{array}
|
| 76 |
+
$$
|
| 77 |
+
""")
|
|
|
|
|
|
|
| 78 |
return
|
| 79 |
|
| 80 |
|
| 81 |
@app.cell(hide_code=True)
|
| 82 |
def _(mo):
|
| 83 |
+
mo.md(r"""
|
| 84 |
+
where $w \in {\bf R}^n$ is the optimization variable and $\gamma >0$ is a constant called the *risk aversion parameter*. The constraint $\mathbf{1}^Tw = 1$ says the portfolio weight vector must sum to 1, and $w \geq 0$ says that we can't invest a negative amount into any asset.
|
|
|
|
| 85 |
|
| 86 |
+
The objective $\mu^Tw - \gamma w^T\Sigma w$ is the *risk-adjusted return*. Varying $\gamma$ gives the optimal *risk-return trade-off*.
|
| 87 |
+
We can get the same risk-return trade-off by fixing return and minimizing risk.
|
| 88 |
+
""")
|
|
|
|
| 89 |
return
|
| 90 |
|
| 91 |
|
| 92 |
@app.cell(hide_code=True)
|
| 93 |
def _(mo):
|
| 94 |
+
mo.md(r"""
|
| 95 |
+
## Example
|
|
|
|
| 96 |
|
| 97 |
+
In the following code we compute and plot the optimal risk-return trade-off for $10$ assets. First we generate random problem data $\mu$ and $\Sigma$.
|
| 98 |
+
""")
|
|
|
|
| 99 |
return
|
| 100 |
|
| 101 |
|
|
|
|
| 138 |
_Try changing the entries of $\mu$ and see how the plots below change._
|
| 139 |
"""
|
| 140 |
)
|
| 141 |
+
return (mu_widget,)
|
| 142 |
|
| 143 |
|
| 144 |
@app.cell
|
|
|
|
| 153 |
|
| 154 |
@app.cell(hide_code=True)
|
| 155 |
def _(mo):
|
| 156 |
+
mo.md("""
|
| 157 |
+
Next, we solve the problem for 100 different values of $\gamma$
|
| 158 |
+
""")
|
| 159 |
return
|
| 160 |
|
| 161 |
|
|
|
|
| 168 |
ret = mu.T @ w
|
| 169 |
risk = cp.quad_form(w, Sigma)
|
| 170 |
prob = cp.Problem(cp.Maximize(ret - gamma * risk), [cp.sum(w) == 1, w >= 0])
|
| 171 |
+
return cp, gamma, prob, ret, risk
|
| 172 |
|
| 173 |
|
| 174 |
@app.cell
|
|
|
|
| 187 |
|
| 188 |
@app.cell(hide_code=True)
|
| 189 |
def _(mo):
|
| 190 |
+
mo.md("""
|
| 191 |
+
Plotted below are the risk return tradeoffs for two values of $\gamma$ (blue squares), and the risk return tradeoffs for investing fully in each asset (red circles)
|
| 192 |
+
""")
|
| 193 |
return
|
| 194 |
|
| 195 |
|
|
|
|
| 212 |
plt.xlabel("Standard deviation")
|
| 213 |
plt.ylabel("Return")
|
| 214 |
plt.show()
|
| 215 |
+
return markers_on, plt
|
| 216 |
|
| 217 |
|
| 218 |
@app.cell(hide_code=True)
|
| 219 |
def _(mo):
|
| 220 |
+
mo.md(r"""
|
| 221 |
+
We plot below the return distributions for the two risk aversion values marked on the trade-off curve.
|
| 222 |
+
Notice that the probability of a loss is near 0 for the low risk value and far above 0 for the high risk value.
|
| 223 |
+
""")
|
|
|
|
|
|
|
| 224 |
return
|
| 225 |
|
| 226 |
|
|
|
|
| 242 |
plt.ylabel("Density")
|
| 243 |
plt.legend(loc="upper right")
|
| 244 |
plt.show()
|
| 245 |
+
return
|
| 246 |
|
| 247 |
|
| 248 |
if __name__ == "__main__":
|
optimization/06_convex_optimization.py
CHANGED
|
@@ -9,7 +9,7 @@
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
-
__generated_with = "0.
|
| 13 |
app = marimo.App()
|
| 14 |
|
| 15 |
|
|
@@ -21,41 +21,39 @@ def _():
|
|
| 21 |
|
| 22 |
@app.cell(hide_code=True)
|
| 23 |
def _(mo):
|
| 24 |
-
mo.md(
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
"""
|
| 40 |
-
)
|
| 41 |
return
|
| 42 |
|
| 43 |
|
| 44 |
@app.cell(hide_code=True)
|
| 45 |
def _(mo):
|
| 46 |
-
mo.md(
|
| 47 |
-
|
| 48 |
-
**🛑 Stop!** Before proceeding, read the CVXPY docs to learn about atomic functions and the DCP ruleset:
|
| 49 |
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
)
|
| 53 |
return
|
| 54 |
|
| 55 |
|
| 56 |
@app.cell(hide_code=True)
|
| 57 |
def _(mo):
|
| 58 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 59 |
return
|
| 60 |
|
| 61 |
|
|
@@ -71,7 +69,7 @@ def _(mo):
|
|
| 71 |
constraints = [x >= 0, cp.sum(x) == 1]
|
| 72 |
problem = cp.Problem(cp.Maximize(objective), constraints)
|
| 73 |
mo.md(f"Is my problem DCP? `{problem.is_dcp()}`")
|
| 74 |
-
return
|
| 75 |
|
| 76 |
|
| 77 |
@app.cell
|
|
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
+
__generated_with = "0.18.4"
|
| 13 |
app = marimo.App()
|
| 14 |
|
| 15 |
|
|
|
|
| 21 |
|
| 22 |
@app.cell(hide_code=True)
|
| 23 |
def _(mo):
|
| 24 |
+
mo.md(r"""
|
| 25 |
+
# Convex optimization
|
| 26 |
+
|
| 27 |
+
In the previous tutorials, we learned about least squares, linear programming,
|
| 28 |
+
and quadratic programming, and saw applications of each. We also learned that these problem
|
| 29 |
+
classes can be solved efficiently and reliably using CVXPY. That's because these problem classes are a special
|
| 30 |
+
case of a more general class of tractable problems, called **convex optimization problems.**
|
| 31 |
+
|
| 32 |
+
A convex optimization problem is an optimization problem that minimizes a convex
|
| 33 |
+
function, subject to affine equality constraints and convex inequality
|
| 34 |
+
constraints ($f_i(x)\leq 0$, where $f_i$ is a convex function).
|
| 35 |
+
|
| 36 |
+
**CVXPY.** CVXPY lets you specify and solve any convex optimization problem,
|
| 37 |
+
abstracting away the more specific problem classes. You start with CVXPY's **atomic functions**, like `cp.exp`, `cp.log`, and `cp.square`, and compose them to build more complex convex functions. As long as the functions are composed in the right way — as long as they are "DCP-compliant" — your resulting problem will be convex and solvable by CVXPY.
|
| 38 |
+
""")
|
|
|
|
|
|
|
| 39 |
return
|
| 40 |
|
| 41 |
|
| 42 |
@app.cell(hide_code=True)
|
| 43 |
def _(mo):
|
| 44 |
+
mo.md(r"""
|
| 45 |
+
**🛑 Stop!** Before proceeding, read the CVXPY docs to learn about atomic functions and the DCP ruleset:
|
|
|
|
| 46 |
|
| 47 |
+
https://www.cvxpy.org/tutorial/index.html
|
| 48 |
+
""")
|
|
|
|
| 49 |
return
|
| 50 |
|
| 51 |
|
| 52 |
@app.cell(hide_code=True)
|
| 53 |
def _(mo):
|
| 54 |
+
mo.md(r"""
|
| 55 |
+
**Is my problem DCP-compliant?** Below is a sample CVXPY problem. It is DCP-compliant. Try typing in other problems and seeing if they are DCP-compliant. If you know your problem is convex, there exists a way to express it in a DCP-compliant way.
|
| 56 |
+
""")
|
| 57 |
return
|
| 58 |
|
| 59 |
|
|
|
|
| 69 |
constraints = [x >= 0, cp.sum(x) == 1]
|
| 70 |
problem = cp.Problem(cp.Maximize(objective), constraints)
|
| 71 |
mo.md(f"Is my problem DCP? `{problem.is_dcp()}`")
|
| 72 |
+
return problem, x
|
| 73 |
|
| 74 |
|
| 75 |
@app.cell
|
optimization/07_sdp.py
CHANGED
|
@@ -10,7 +10,7 @@
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
-
__generated_with = "0.
|
| 14 |
app = marimo.App()
|
| 15 |
|
| 16 |
|
|
@@ -22,49 +22,47 @@ def _():
|
|
| 22 |
|
| 23 |
@app.cell(hide_code=True)
|
| 24 |
def _(mo):
|
| 25 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 26 |
return
|
| 27 |
|
| 28 |
|
| 29 |
@app.cell(hide_code=True)
|
| 30 |
def _(mo):
|
| 31 |
-
mo.md(
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
\
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
\
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
"""
|
| 55 |
-
)
|
| 56 |
return
|
| 57 |
|
| 58 |
|
| 59 |
@app.cell(hide_code=True)
|
| 60 |
def _(mo):
|
| 61 |
-
mo.md(
|
| 62 |
-
|
| 63 |
-
## Example
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
)
|
| 68 |
return
|
| 69 |
|
| 70 |
|
|
@@ -87,7 +85,7 @@ def _(np):
|
|
| 87 |
for i in range(p):
|
| 88 |
A.append(np.random.randn(n, n))
|
| 89 |
b.append(np.random.randn())
|
| 90 |
-
return A, C, b,
|
| 91 |
|
| 92 |
|
| 93 |
@app.cell
|
|
@@ -101,7 +99,7 @@ def _(A, C, b, cp, n, p):
|
|
| 101 |
constraints += [cp.trace(A[i] @ X) == b[i] for i in range(p)]
|
| 102 |
prob = cp.Problem(cp.Minimize(cp.trace(C @ X)), constraints)
|
| 103 |
_ = prob.solve()
|
| 104 |
-
return X,
|
| 105 |
|
| 106 |
|
| 107 |
@app.cell
|
|
@@ -111,7 +109,7 @@ def _(X, mo, prob, wigglystuff):
|
|
| 111 |
The optimal value is {prob.value:0.4f}.
|
| 112 |
|
| 113 |
A solution for $X$ is (rounded to the nearest decimal) is:
|
| 114 |
-
|
| 115 |
{mo.ui.anywidget(wigglystuff.Matrix(X.value)).center()}
|
| 116 |
"""
|
| 117 |
)
|
|
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
+
__generated_with = "0.18.4"
|
| 14 |
app = marimo.App()
|
| 15 |
|
| 16 |
|
|
|
|
| 22 |
|
| 23 |
@app.cell(hide_code=True)
|
| 24 |
def _(mo):
|
| 25 |
+
mo.md(r"""
|
| 26 |
+
# Semidefinite program
|
| 27 |
+
""")
|
| 28 |
return
|
| 29 |
|
| 30 |
|
| 31 |
@app.cell(hide_code=True)
|
| 32 |
def _(mo):
|
| 33 |
+
mo.md(r"""
|
| 34 |
+
_This notebook introduces an advanced topic._ A semidefinite program (SDP) is an optimization problem of the form
|
| 35 |
+
|
| 36 |
+
\[
|
| 37 |
+
\begin{array}{ll}
|
| 38 |
+
\text{minimize} & \mathbf{tr}(CX) \\
|
| 39 |
+
\text{subject to} & \mathbf{tr}(A_iX) = b_i, \quad i=1,\ldots,p \\
|
| 40 |
+
& X \succeq 0,
|
| 41 |
+
\end{array}
|
| 42 |
+
\]
|
| 43 |
+
|
| 44 |
+
where $\mathbf{tr}$ is the trace function, $X \in \mathcal{S}^{n}$ is the optimization variable and $C, A_1, \ldots, A_p \in \mathcal{S}^{n}$, and $b_1, \ldots, b_p \in \mathcal{R}$ are problem data, and $X \succeq 0$ is a matrix inequality. Here $\mathcal{S}^{n}$ denotes the set of $n$-by-$n$ symmetric matrices.
|
| 45 |
+
|
| 46 |
+
**Example.** An example of an SDP is to complete a covariance matrix $\tilde \Sigma \in \mathcal{S}^{n}_+$ with missing entries $M \subset \{1,\ldots,n\} \times \{1,\ldots,n\}$:
|
| 47 |
+
|
| 48 |
+
\[
|
| 49 |
+
\begin{array}{ll}
|
| 50 |
+
\text{minimize} & 0 \\
|
| 51 |
+
\text{subject to} & \Sigma_{ij} = \tilde \Sigma_{ij}, \quad (i,j) \notin M \\
|
| 52 |
+
& \Sigma \succeq 0,
|
| 53 |
+
\end{array}
|
| 54 |
+
\]
|
| 55 |
+
""")
|
|
|
|
|
|
|
| 56 |
return
|
| 57 |
|
| 58 |
|
| 59 |
@app.cell(hide_code=True)
|
| 60 |
def _(mo):
|
| 61 |
+
mo.md(r"""
|
| 62 |
+
## Example
|
|
|
|
| 63 |
|
| 64 |
+
In the following code, we show how to specify and solve an SDP with CVXPY.
|
| 65 |
+
""")
|
|
|
|
| 66 |
return
|
| 67 |
|
| 68 |
|
|
|
|
| 85 |
for i in range(p):
|
| 86 |
A.append(np.random.randn(n, n))
|
| 87 |
b.append(np.random.randn())
|
| 88 |
+
return A, C, b, n, p
|
| 89 |
|
| 90 |
|
| 91 |
@app.cell
|
|
|
|
| 99 |
constraints += [cp.trace(A[i] @ X) == b[i] for i in range(p)]
|
| 100 |
prob = cp.Problem(cp.Minimize(cp.trace(C @ X)), constraints)
|
| 101 |
_ = prob.solve()
|
| 102 |
+
return X, prob
|
| 103 |
|
| 104 |
|
| 105 |
@app.cell
|
|
|
|
| 109 |
The optimal value is {prob.value:0.4f}.
|
| 110 |
|
| 111 |
A solution for $X$ is (rounded to the nearest decimal) is:
|
| 112 |
+
|
| 113 |
{mo.ui.anywidget(wigglystuff.Matrix(X.value)).center()}
|
| 114 |
"""
|
| 115 |
)
|
optimization/README.md
CHANGED
|
@@ -1,3 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Learn optimization
|
| 2 |
|
| 3 |
This collection of marimo notebooks teaches you the basics of convex
|
|
@@ -30,4 +35,4 @@ to a notebook's URL: [marimo.app/github.com/marimo-team/learn/blob/main/optimiza
|
|
| 30 |
|
| 31 |
**Thanks to all our notebook authors!**
|
| 32 |
|
| 33 |
-
* [Akshay Agrawal](https://github.com/akshayka)
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Readme
|
| 3 |
+
marimo-version: 0.18.4
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
# Learn optimization
|
| 7 |
|
| 8 |
This collection of marimo notebooks teaches you the basics of convex
|
|
|
|
| 35 |
|
| 36 |
**Thanks to all our notebook authors!**
|
| 37 |
|
| 38 |
+
* [Akshay Agrawal](https://github.com/akshayka)
|
polars/01_why_polars.py
CHANGED
|
@@ -9,7 +9,7 @@
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
-
__generated_with = "0.
|
| 13 |
app = marimo.App(width="medium")
|
| 14 |
|
| 15 |
|
|
@@ -21,17 +21,15 @@ def _():
|
|
| 21 |
|
| 22 |
@app.cell(hide_code=True)
|
| 23 |
def _(mo):
|
| 24 |
-
mo.md(
|
| 25 |
-
|
| 26 |
-
# An introduction to Polars
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
)
|
| 35 |
return
|
| 36 |
|
| 37 |
|
|
@@ -48,46 +46,40 @@ def _():
|
|
| 48 |
}
|
| 49 |
)
|
| 50 |
df_pl
|
| 51 |
-
return
|
| 52 |
|
| 53 |
|
| 54 |
@app.cell(hide_code=True)
|
| 55 |
def _(mo):
|
| 56 |
-
mo.md(
|
| 57 |
-
|
| 58 |
-
Unlike Python's earliest DataFrame library Pandas, Polars was designed with performance and usability in mind — Polars can scale to large datasets with ease while maintaining a simple and intuitive API.
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
)
|
| 65 |
return
|
| 66 |
|
| 67 |
|
| 68 |
@app.cell(hide_code=True)
|
| 69 |
def _(mo):
|
| 70 |
-
mo.md(
|
| 71 |
-
|
| 72 |
-
## Choosing Polars over Pandas
|
| 73 |
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
)
|
| 77 |
return
|
| 78 |
|
| 79 |
|
| 80 |
@app.cell(hide_code=True)
|
| 81 |
def _(mo):
|
| 82 |
-
mo.md(
|
| 83 |
-
|
| 84 |
-
### Intuitive syntax
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
)
|
| 91 |
return
|
| 92 |
|
| 93 |
|
|
@@ -112,12 +104,14 @@ def _():
|
|
| 112 |
# step-2: groupby and aggregation
|
| 113 |
result_pd = filtered_df_pd.groupby("Gender")["Height_CM"].mean()
|
| 114 |
result_pd
|
| 115 |
-
return
|
| 116 |
|
| 117 |
|
| 118 |
@app.cell(hide_code=True)
|
| 119 |
def _(mo):
|
| 120 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 121 |
return
|
| 122 |
|
| 123 |
|
|
@@ -137,17 +131,15 @@ def _(pl):
|
|
| 137 |
# filter, groupby and aggregation using method chaining
|
| 138 |
result_pl = data_pl.filter(pl.col("Age") > 15).group_by("Gender").agg(pl.mean("Height_CM"))
|
| 139 |
result_pl
|
| 140 |
-
return data_pl,
|
| 141 |
|
| 142 |
|
| 143 |
@app.cell(hide_code=True)
|
| 144 |
def _(mo):
|
| 145 |
-
mo.md(
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
"""
|
| 150 |
-
)
|
| 151 |
return
|
| 152 |
|
| 153 |
|
|
@@ -155,159 +147,145 @@ def _(mo):
|
|
| 155 |
def _(data_pl):
|
| 156 |
result = data_pl.sql("SELECT Gender, AVG(Height_CM) FROM self WHERE Age > 15 GROUP BY Gender")
|
| 157 |
result
|
| 158 |
-
return
|
| 159 |
|
| 160 |
|
| 161 |
@app.cell(hide_code=True)
|
| 162 |
def _(mo):
|
| 163 |
-
mo.md(
|
| 164 |
-
|
| 165 |
-
### A large collection of built-in APIs
|
| 166 |
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
)
|
| 170 |
return
|
| 171 |
|
| 172 |
|
| 173 |
@app.cell(hide_code=True)
|
| 174 |
def _(mo):
|
| 175 |
-
mo.md(
|
| 176 |
-
|
| 177 |
-
### Query optimization 📈
|
| 178 |
|
| 179 |
-
|
| 180 |
|
| 181 |
-
|
| 182 |
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
)
|
| 189 |
-
```
|
| 190 |
-
|
| 191 |
-
If executed eagerly, the `groupby` operation would first be applied to the entire DataFrame, followed by filtering the results by `Category`. However, with **lazy execution**, Polars can optimize this process by first filtering the DataFrame to include only the relevant categories ("A" and "B") and then performing the `groupby` operation on the reduced dataset. This approach minimizes unnecessary computations and significantly improves efficiency.
|
| 192 |
-
"""
|
| 193 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 194 |
return
|
| 195 |
|
| 196 |
|
| 197 |
@app.cell(hide_code=True)
|
| 198 |
def _(mo):
|
| 199 |
-
mo.md(
|
| 200 |
-
|
| 201 |
-
### Scalability — handling large datasets in memory ⬆️
|
| 202 |
|
| 203 |
-
|
| 204 |
|
| 205 |
-
|
| 206 |
-
|
| 207 |
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
|
| 213 |
-
|
| 214 |
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
|
| 219 |
-
|
| 220 |
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
)
|
| 227 |
return
|
| 228 |
|
| 229 |
|
| 230 |
@app.cell(hide_code=True)
|
| 231 |
def _(mo):
|
| 232 |
-
mo.md(
|
| 233 |
-
|
| 234 |
-
### Compatibility with other machine learning libraries 🤝
|
| 235 |
|
| 236 |
-
|
| 237 |
|
| 238 |
-
|
| 239 |
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
|
| 254 |
-
|
| 255 |
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
| 259 |
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
)
|
| 265 |
return
|
| 266 |
|
| 267 |
|
| 268 |
@app.cell(hide_code=True)
|
| 269 |
def _(mo):
|
| 270 |
-
mo.md(
|
| 271 |
-
|
| 272 |
-
### Easy to use, with room for power users
|
| 273 |
|
| 274 |
-
|
| 275 |
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
|
| 280 |
|
| 281 |
-
|
| 282 |
-
|
| 283 |
-
)
|
| 284 |
return
|
| 285 |
|
| 286 |
|
| 287 |
@app.cell(hide_code=True)
|
| 288 |
def _(mo):
|
| 289 |
-
mo.md(
|
| 290 |
-
|
| 291 |
-
## Why not PySpark?
|
| 292 |
|
| 293 |
-
|
| 294 |
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
)
|
| 298 |
return
|
| 299 |
|
| 300 |
|
| 301 |
@app.cell(hide_code=True)
|
| 302 |
def _(mo):
|
| 303 |
-
mo.md(
|
| 304 |
-
|
| 305 |
-
## 🔖 References
|
| 306 |
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
)
|
| 311 |
return
|
| 312 |
|
| 313 |
|
|
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
+
__generated_with = "0.18.4"
|
| 13 |
app = marimo.App(width="medium")
|
| 14 |
|
| 15 |
|
|
|
|
| 21 |
|
| 22 |
@app.cell(hide_code=True)
|
| 23 |
def _(mo):
|
| 24 |
+
mo.md("""
|
| 25 |
+
# An introduction to Polars
|
|
|
|
| 26 |
|
| 27 |
+
_By [Koushik Khan](https://github.com/koushikkhan)._
|
| 28 |
|
| 29 |
+
This notebook provides a birds-eye overview of [Polars](https://pola.rs/), a fast and user-friendly data manipulation library for Python, and compares it to alternatives like Pandas and PySpark.
|
| 30 |
|
| 31 |
+
Like Pandas and PySpark, the central data structure in Polars is **the DataFrame**, a tabular data structure consisting of named columns. For example, the next cell constructs a DataFrame that records the gender, age, and height in centimeters for a number of individuals.
|
| 32 |
+
""")
|
|
|
|
| 33 |
return
|
| 34 |
|
| 35 |
|
|
|
|
| 46 |
}
|
| 47 |
)
|
| 48 |
df_pl
|
| 49 |
+
return (pl,)
|
| 50 |
|
| 51 |
|
| 52 |
@app.cell(hide_code=True)
|
| 53 |
def _(mo):
|
| 54 |
+
mo.md("""
|
| 55 |
+
Unlike Python's earliest DataFrame library Pandas, Polars was designed with performance and usability in mind — Polars can scale to large datasets with ease while maintaining a simple and intuitive API.
|
|
|
|
| 56 |
|
| 57 |
+
Polars' performance is due to a number of factors, including its implementation in rust and its ability to perform operations in a parallelized and vectorized manner. It supports a wide range of data types, advanced query optimizations, and seamless integration with other Python libraries, making it a versatile tool for data scientists, engineers, and analysts. Additionally, Polars provides a lazy API for deferred execution, allowing users to optimize their workflows by chaining operations and executing them in a single pass.
|
| 58 |
|
| 59 |
+
With its focus on speed, scalability, and ease of use, Polars is quickly becoming a go-to choice for data professionals looking to streamline their data processing pipelines and tackle large-scale data challenges.
|
| 60 |
+
""")
|
|
|
|
| 61 |
return
|
| 62 |
|
| 63 |
|
| 64 |
@app.cell(hide_code=True)
|
| 65 |
def _(mo):
|
| 66 |
+
mo.md("""
|
| 67 |
+
## Choosing Polars over Pandas
|
|
|
|
| 68 |
|
| 69 |
+
In this section we'll give a few reasons why Polars is a better choice than Pandas, along with examples.
|
| 70 |
+
""")
|
|
|
|
| 71 |
return
|
| 72 |
|
| 73 |
|
| 74 |
@app.cell(hide_code=True)
|
| 75 |
def _(mo):
|
| 76 |
+
mo.md("""
|
| 77 |
+
### Intuitive syntax
|
|
|
|
| 78 |
|
| 79 |
+
Polars' syntax is similar to PySpark and intuitive like SQL, making heavy use of **method chaining**. This makes it easy for data professionals to transition to Polars, and leads to an API that is more concise and readable than Pandas.
|
| 80 |
|
| 81 |
+
**Example.** In the next few cells, we contrast the code to perform a basic filter and aggregation of data with Pandas to the code required to accomplish the same task with `Polars`.
|
| 82 |
+
""")
|
|
|
|
| 83 |
return
|
| 84 |
|
| 85 |
|
|
|
|
| 104 |
# step-2: groupby and aggregation
|
| 105 |
result_pd = filtered_df_pd.groupby("Gender")["Height_CM"].mean()
|
| 106 |
result_pd
|
| 107 |
+
return
|
| 108 |
|
| 109 |
|
| 110 |
@app.cell(hide_code=True)
|
| 111 |
def _(mo):
|
| 112 |
+
mo.md(r"""
|
| 113 |
+
The same example can be worked out in Polars more concisely, using method chaining. Notice how the Polars code is essentially as readable as English.
|
| 114 |
+
""")
|
| 115 |
return
|
| 116 |
|
| 117 |
|
|
|
|
| 131 |
# filter, groupby and aggregation using method chaining
|
| 132 |
result_pl = data_pl.filter(pl.col("Age") > 15).group_by("Gender").agg(pl.mean("Height_CM"))
|
| 133 |
result_pl
|
| 134 |
+
return (data_pl,)
|
| 135 |
|
| 136 |
|
| 137 |
@app.cell(hide_code=True)
|
| 138 |
def _(mo):
|
| 139 |
+
mo.md("""
|
| 140 |
+
Notice how Polars uses a *method-chaining* approach, similar to PySpark, which makes the code more readable and expressive while using a *single line* to design the query.
|
| 141 |
+
Additionally, Polars supports SQL-like operations *natively*, that allows you to write SQL queries directly on polars dataframe:
|
| 142 |
+
""")
|
|
|
|
|
|
|
| 143 |
return
|
| 144 |
|
| 145 |
|
|
|
|
| 147 |
def _(data_pl):
|
| 148 |
result = data_pl.sql("SELECT Gender, AVG(Height_CM) FROM self WHERE Age > 15 GROUP BY Gender")
|
| 149 |
result
|
| 150 |
+
return
|
| 151 |
|
| 152 |
|
| 153 |
@app.cell(hide_code=True)
|
| 154 |
def _(mo):
|
| 155 |
+
mo.md("""
|
| 156 |
+
### A large collection of built-in APIs
|
|
|
|
| 157 |
|
| 158 |
+
Polars has a comprehensive API that enables to perform virtually any operation using built-in methods. In contrast, Pandas often requires more complex operations to be handled using the `apply` method with a lambda function. The issue with `apply` is that it processes rows sequentially, looping through the DataFrame one row at a time, which can be inefficient. By leveraging Polars' built-in methods, you can operate on entire columns at once, unlocking the power of **SIMD (Single Instruction, Multiple Data)** parallelism. This approach not only simplifies your code but also significantly improves performance.
|
| 159 |
+
""")
|
|
|
|
| 160 |
return
|
| 161 |
|
| 162 |
|
| 163 |
@app.cell(hide_code=True)
|
| 164 |
def _(mo):
|
| 165 |
+
mo.md("""
|
| 166 |
+
### Query optimization 📈
|
|
|
|
| 167 |
|
| 168 |
+
A key factor behind Polars' performance lies in its **evaluation strategy**. While Pandas defaults to **eager execution**, executing operations in the exact order they are written, Polars offers both **eager and lazy execution**. With lazy execution, Polars employs a **query optimizer** that analyzes all required operations and determines the most efficient way to execute them. This optimization can involve reordering operations, eliminating redundant calculations, and more.
|
| 169 |
|
| 170 |
+
For example, consider the following expression to calculate the mean of the `Number1` column for categories "A" and "B" in the `Category` column:
|
| 171 |
|
| 172 |
+
```python
|
| 173 |
+
(
|
| 174 |
+
df
|
| 175 |
+
.groupby(by="Category").agg(pl.col("Number1").mean())
|
| 176 |
+
.filter(pl.col("Category").is_in(["A", "B"]))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
)
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
If executed eagerly, the `groupby` operation would first be applied to the entire DataFrame, followed by filtering the results by `Category`. However, with **lazy execution**, Polars can optimize this process by first filtering the DataFrame to include only the relevant categories ("A" and "B") and then performing the `groupby` operation on the reduced dataset. This approach minimizes unnecessary computations and significantly improves efficiency.
|
| 181 |
+
""")
|
| 182 |
return
|
| 183 |
|
| 184 |
|
| 185 |
@app.cell(hide_code=True)
|
| 186 |
def _(mo):
|
| 187 |
+
mo.md("""
|
| 188 |
+
### Scalability — handling large datasets in memory ⬆️
|
|
|
|
| 189 |
|
| 190 |
+
Pandas is limited by its single-threaded design and reliance on Python, which makes it inefficient for processing large datasets. Polars, on the other hand, is built in Rust and optimized for parallel processing, enabling it to handle datasets that are orders of magnitude larger.
|
| 191 |
|
| 192 |
+
**Example: Processing a Large Dataset**
|
| 193 |
+
In Pandas, loading a large dataset (e.g., 10GB) often results in memory errors:
|
| 194 |
|
| 195 |
+
```python
|
| 196 |
+
# This may fail with large datasets
|
| 197 |
+
df = pd.read_csv("large_dataset.csv")
|
| 198 |
+
```
|
| 199 |
|
| 200 |
+
In Polars, the same operation runs quickly, without memory pressure:
|
| 201 |
|
| 202 |
+
```python
|
| 203 |
+
df = pl.read_csv("large_dataset.csv")
|
| 204 |
+
```
|
| 205 |
|
| 206 |
+
Polars also supports lazy evaluation, which allows you to optimize your workflows by deferring computations until necessary. This is particularly useful for large datasets:
|
| 207 |
|
| 208 |
+
```python
|
| 209 |
+
df = pl.scan_csv("large_dataset.csv") # Lazy DataFrame
|
| 210 |
+
result = df.filter(pl.col("A") > 1).groupby("A").agg(pl.sum("B")).collect() # Execute
|
| 211 |
+
```
|
| 212 |
+
""")
|
|
|
|
| 213 |
return
|
| 214 |
|
| 215 |
|
| 216 |
@app.cell(hide_code=True)
|
| 217 |
def _(mo):
|
| 218 |
+
mo.md("""
|
| 219 |
+
### Compatibility with other machine learning libraries 🤝
|
|
|
|
| 220 |
|
| 221 |
+
Polars integrates seamlessly with popular machine learning libraries like Scikit-learn, PyTorch, and TensorFlow. Its ability to handle large datasets efficiently makes it an excellent choice for preprocessing data before feeding it into ML models.
|
| 222 |
|
| 223 |
+
**Example: Preprocessing Data for Scikit-learn**
|
| 224 |
|
| 225 |
+
```python
|
| 226 |
+
import polars as pl
|
| 227 |
+
from sklearn.linear_model import LinearRegression
|
| 228 |
|
| 229 |
+
# Load and preprocess data
|
| 230 |
+
df = pl.read_csv("data.csv")
|
| 231 |
+
X = df.select(["feature1", "feature2"]).to_numpy()
|
| 232 |
+
y = df.select("target").to_numpy()
|
| 233 |
|
| 234 |
+
# Train a model
|
| 235 |
+
model = LinearRegression()
|
| 236 |
+
model.fit(X, y)
|
| 237 |
+
```
|
| 238 |
|
| 239 |
+
Polars also supports conversion to other formats like NumPy arrays and Pandas DataFrames, ensuring compatibility with virtually any ML library:
|
| 240 |
|
| 241 |
+
```python
|
| 242 |
+
# Convert to Pandas DataFrame
|
| 243 |
+
pandas_df = df.to_pandas()
|
| 244 |
|
| 245 |
+
# Convert to NumPy array
|
| 246 |
+
numpy_array = df.to_numpy()
|
| 247 |
+
```
|
| 248 |
+
""")
|
|
|
|
| 249 |
return
|
| 250 |
|
| 251 |
|
| 252 |
@app.cell(hide_code=True)
|
| 253 |
def _(mo):
|
| 254 |
+
mo.md("""
|
| 255 |
+
### Easy to use, with room for power users
|
|
|
|
| 256 |
|
| 257 |
+
Polars supports advanced operations like
|
| 258 |
|
| 259 |
+
- **date handling**
|
| 260 |
+
- **window functions**
|
| 261 |
+
- **joins**
|
| 262 |
+
- **nested data types**
|
| 263 |
|
| 264 |
+
which is making it a versatile tool for data manipulation.
|
| 265 |
+
""")
|
|
|
|
| 266 |
return
|
| 267 |
|
| 268 |
|
| 269 |
@app.cell(hide_code=True)
|
| 270 |
def _(mo):
|
| 271 |
+
mo.md("""
|
| 272 |
+
## Why not PySpark?
|
|
|
|
| 273 |
|
| 274 |
+
While **PySpark** is versatile tool that has transformed the way big data is handled and processed in Python, its **complex setup process** can be intimidating, especially for beginners. In contrast, **Polars** requires minimal setup and is ready to use right out of the box, making it more accessible for users of all skill levels.
|
| 275 |
|
| 276 |
+
When deciding between the two, **PySpark** is the preferred choice for processing large datasets distributed across a **multi-node cluster**. However, for computations on a **single-node machine**, **Polars** is an excellent alternative. Remarkably, Polars is capable of handling datasets that exceed the size of the available RAM, making it a powerful tool for efficient data processing even on limited hardware.
|
| 277 |
+
""")
|
|
|
|
| 278 |
return
|
| 279 |
|
| 280 |
|
| 281 |
@app.cell(hide_code=True)
|
| 282 |
def _(mo):
|
| 283 |
+
mo.md("""
|
| 284 |
+
## 🔖 References
|
|
|
|
| 285 |
|
| 286 |
+
- [Polars official website](https://pola.rs/)
|
| 287 |
+
- [Polars vs. Pandas](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/)
|
| 288 |
+
""")
|
|
|
|
| 289 |
return
|
| 290 |
|
| 291 |
|
polars/02_dataframes.py
CHANGED
|
@@ -10,14 +10,13 @@
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
-
__generated_with = "0.
|
| 14 |
app = marimo.App()
|
| 15 |
|
| 16 |
|
| 17 |
@app.cell(hide_code=True)
|
| 18 |
def _(mo):
|
| 19 |
-
mo.md(
|
| 20 |
-
r"""
|
| 21 |
# DataFrames
|
| 22 |
Author: [*Raine Hoang*](https://github.com/Jystine)
|
| 23 |
|
|
@@ -25,33 +24,31 @@ def _(mo):
|
|
| 25 |
|
| 26 |
/// Note
|
| 27 |
The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/api/python/stable/reference/dataframe/index.html).
|
| 28 |
-
"""
|
| 29 |
-
)
|
| 30 |
return
|
| 31 |
|
| 32 |
|
| 33 |
@app.cell(hide_code=True)
|
| 34 |
def _(mo):
|
| 35 |
-
mo.md(
|
| 36 |
-
"""
|
| 37 |
## Defining a DataFrame
|
| 38 |
|
| 39 |
At the most basic level, all that you need to do in order to create a DataFrame in Polars is to use the .DataFrame() method and pass in some data into the data parameter. However, there are restrictions as to what exactly you can pass into this method.
|
| 40 |
-
"""
|
| 41 |
-
)
|
| 42 |
return
|
| 43 |
|
| 44 |
|
| 45 |
@app.cell(hide_code=True)
|
| 46 |
def _(mo):
|
| 47 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 48 |
return
|
| 49 |
|
| 50 |
|
| 51 |
@app.cell(hide_code=True)
|
| 52 |
def _(mo):
|
| 53 |
-
mo.md(
|
| 54 |
-
r"""
|
| 55 |
There are [5 data types](https://github.com/pola-rs/polars/blob/py-1.29.0/py-polars/polars/dataframe/frame.py#L197) that can be converted into a DataFrame.
|
| 56 |
|
| 57 |
1. Dictionary
|
|
@@ -59,20 +56,17 @@ def _(mo):
|
|
| 59 |
3. NumPy Array
|
| 60 |
4. Series
|
| 61 |
5. Pandas DataFrame
|
| 62 |
-
"""
|
| 63 |
-
)
|
| 64 |
return
|
| 65 |
|
| 66 |
|
| 67 |
@app.cell(hide_code=True)
|
| 68 |
def _(mo):
|
| 69 |
-
mo.md(
|
| 70 |
-
r"""
|
| 71 |
#### Dictionary
|
| 72 |
|
| 73 |
Dictionaries are structures that store data as `key:value` pairs. Let's say we have the following dictionary:
|
| 74 |
-
"""
|
| 75 |
-
)
|
| 76 |
return
|
| 77 |
|
| 78 |
|
|
@@ -85,7 +79,9 @@ def _():
|
|
| 85 |
|
| 86 |
@app.cell(hide_code=True)
|
| 87 |
def _(mo):
|
| 88 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 89 |
return
|
| 90 |
|
| 91 |
|
|
@@ -98,25 +94,21 @@ def _(dct_data, pl):
|
|
| 98 |
|
| 99 |
@app.cell(hide_code=True)
|
| 100 |
def _(mo):
|
| 101 |
-
mo.md(
|
| 102 |
-
|
| 103 |
-
In this case, Polars turned each of the lists in the dictionary into a column in the DataFrame.
|
| 104 |
|
| 105 |
The other data structures will follow a similar pattern when converting them to DataFrames.
|
| 106 |
-
"""
|
| 107 |
-
)
|
| 108 |
return
|
| 109 |
|
| 110 |
|
| 111 |
@app.cell(hide_code=True)
|
| 112 |
def _(mo):
|
| 113 |
-
mo.md(
|
| 114 |
-
r"""
|
| 115 |
##### Sequence
|
| 116 |
|
| 117 |
Sequences are data structures that contain collections of items, which can be accessed using its index. Examples of sequences are lists, tuples, and strings. We will be using a list of lists in order to demonstrate how to convert a sequence in a DataFrame.
|
| 118 |
-
"""
|
| 119 |
-
)
|
| 120 |
return
|
| 121 |
|
| 122 |
|
|
@@ -136,19 +128,19 @@ def _(pl, seq_data):
|
|
| 136 |
|
| 137 |
@app.cell(hide_code=True)
|
| 138 |
def _(mo):
|
| 139 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 140 |
return
|
| 141 |
|
| 142 |
|
| 143 |
@app.cell(hide_code=True)
|
| 144 |
def _(mo):
|
| 145 |
-
mo.md(
|
| 146 |
-
r"""
|
| 147 |
##### NumPy Array
|
| 148 |
|
| 149 |
NumPy arrays are considered a sequence of items that can also be accessed using its index. An important thing to note is that all of the items in an array must have the same data type.
|
| 150 |
-
"""
|
| 151 |
-
)
|
| 152 |
return
|
| 153 |
|
| 154 |
|
|
@@ -168,19 +160,19 @@ def _(arr_data, pl):
|
|
| 168 |
|
| 169 |
@app.cell(hide_code=True)
|
| 170 |
def _(mo):
|
| 171 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 172 |
return
|
| 173 |
|
| 174 |
|
| 175 |
@app.cell(hide_code=True)
|
| 176 |
def _(mo):
|
| 177 |
-
mo.md(
|
| 178 |
-
r"""
|
| 179 |
##### Series
|
| 180 |
|
| 181 |
Series are a way to store a single column in a DataFrame and all entries in a series must have the same data type. You can combine these series together to form one DataFrame.
|
| 182 |
-
"""
|
| 183 |
-
)
|
| 184 |
return
|
| 185 |
|
| 186 |
|
|
@@ -200,13 +192,11 @@ def _(pl, pl_series):
|
|
| 200 |
|
| 201 |
@app.cell(hide_code=True)
|
| 202 |
def _(mo):
|
| 203 |
-
mo.md(
|
| 204 |
-
r"""
|
| 205 |
##### Pandas DataFrame
|
| 206 |
|
| 207 |
Another popular package that utilizes DataFrames is pandas. By passing in a pandas DataFrame into .DataFrame(), you can easily convert it into a Polars DataFrame.
|
| 208 |
-
"""
|
| 209 |
-
)
|
| 210 |
return
|
| 211 |
|
| 212 |
|
|
@@ -230,19 +220,19 @@ def _(pd_df, pl):
|
|
| 230 |
|
| 231 |
@app.cell(hide_code=True)
|
| 232 |
def _(mo):
|
| 233 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 234 |
return
|
| 235 |
|
| 236 |
|
| 237 |
@app.cell(hide_code=True)
|
| 238 |
def _(mo):
|
| 239 |
-
mo.md(
|
| 240 |
-
r"""
|
| 241 |
## DataFrame Structure
|
| 242 |
|
| 243 |
Let's recall one of the DataFrames we defined earlier.
|
| 244 |
-
"""
|
| 245 |
-
)
|
| 246 |
return
|
| 247 |
|
| 248 |
|
|
@@ -254,14 +244,15 @@ def _(dct_df):
|
|
| 254 |
|
| 255 |
@app.cell(hide_code=True)
|
| 256 |
def _(mo):
|
| 257 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 258 |
return
|
| 259 |
|
| 260 |
|
| 261 |
@app.cell(hide_code=True)
|
| 262 |
def _(mo):
|
| 263 |
-
mo.md(
|
| 264 |
-
r"""
|
| 265 |
## Parameters
|
| 266 |
|
| 267 |
On top of the "data" parameter, there are 6 additional parameters you can specify:
|
|
@@ -272,20 +263,17 @@ def _(mo):
|
|
| 272 |
4. orient
|
| 273 |
5. infer_schema_length
|
| 274 |
6. nan_to_null
|
| 275 |
-
"""
|
| 276 |
-
)
|
| 277 |
return
|
| 278 |
|
| 279 |
|
| 280 |
@app.cell(hide_code=True)
|
| 281 |
def _(mo):
|
| 282 |
-
mo.md(
|
| 283 |
-
r"""
|
| 284 |
#### Schema
|
| 285 |
|
| 286 |
Let's recall the DataFrame we created using a sequence.
|
| 287 |
-
"""
|
| 288 |
-
)
|
| 289 |
return
|
| 290 |
|
| 291 |
|
|
@@ -297,7 +285,9 @@ def _(seq_df):
|
|
| 297 |
|
| 298 |
@app.cell(hide_code=True)
|
| 299 |
def _(mo):
|
| 300 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 301 |
return
|
| 302 |
|
| 303 |
|
|
@@ -309,7 +299,9 @@ def _(pl, seq_data):
|
|
| 309 |
|
| 310 |
@app.cell(hide_code=True)
|
| 311 |
def _(mo):
|
| 312 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 313 |
return
|
| 314 |
|
| 315 |
|
|
@@ -321,7 +313,9 @@ def _(pl, seq_data):
|
|
| 321 |
|
| 322 |
@app.cell(hide_code=True)
|
| 323 |
def _(mo):
|
| 324 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 325 |
return
|
| 326 |
|
| 327 |
|
|
@@ -333,19 +327,19 @@ def _(pl, seq_data):
|
|
| 333 |
|
| 334 |
@app.cell(hide_code=True)
|
| 335 |
def _(mo):
|
| 336 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 337 |
return
|
| 338 |
|
| 339 |
|
| 340 |
@app.cell(hide_code=True)
|
| 341 |
def _(mo):
|
| 342 |
-
mo.md(
|
| 343 |
-
r"""
|
| 344 |
#### Schema_Overrides
|
| 345 |
|
| 346 |
If you only wanted to specify the data type of specific columns and let Polars infer the rest, you can use the schema_overrides parameter for that. This parameter requires that you pass in a dictionary where the key value pair is column name:data type. Unlike the schema parameter, the column name must match the name already present in the DataFrame as that is how Polars will identify which column you want to specify the data type. If you use a column name that doesn't already exist, Polars won't be able to change the data type.
|
| 347 |
-
"""
|
| 348 |
-
)
|
| 349 |
return
|
| 350 |
|
| 351 |
|
|
@@ -357,13 +351,11 @@ def _(pl, seq_data):
|
|
| 357 |
|
| 358 |
@app.cell(hide_code=True)
|
| 359 |
def _(mo):
|
| 360 |
-
mo.md(
|
| 361 |
-
r"""
|
| 362 |
Notice here that only the data type in the first column changed while Polars inferred the rest.
|
| 363 |
|
| 364 |
It is important to note that if you only use the schema_overrides parameter, you are limited to how much you can change the data type. In the example above, we were able to change the data type from int32 to int16 without any further parameters since the data type is still an integer. However, if we wanted to change the first column to be a string, we would get an error as Polars has already strictly set the schema to only take in integer values.
|
| 365 |
-
"""
|
| 366 |
-
)
|
| 367 |
return
|
| 368 |
|
| 369 |
|
|
@@ -378,25 +370,27 @@ def _(pl, seq_data):
|
|
| 378 |
|
| 379 |
@app.cell(hide_code=True)
|
| 380 |
def _(mo):
|
| 381 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 382 |
return
|
| 383 |
|
| 384 |
|
| 385 |
@app.cell(hide_code=True)
|
| 386 |
def _(mo):
|
| 387 |
-
mo.md(
|
| 388 |
-
r"""
|
| 389 |
#### Strict
|
| 390 |
|
| 391 |
The strict parameter allows you to specify if you want a column's data type to be enforced with flexibility or not. When set to `True`, Polars will raise an error if there is a data type that doesn't match the data type the column is expecting. It will not attempt to type cast it to the correct data type as Polars prioritizes that all the data can be converted without any loss or error. When set to `False`, Polars will attempt to type cast the data into the data type the column wants. If it is unable to successfully convert the data type, the value will be replaced with a null value.
|
| 392 |
-
"""
|
| 393 |
-
)
|
| 394 |
return
|
| 395 |
|
| 396 |
|
| 397 |
@app.cell(hide_code=True)
|
| 398 |
def _(mo):
|
| 399 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 400 |
return
|
| 401 |
|
| 402 |
|
|
@@ -413,7 +407,9 @@ def _(pl):
|
|
| 413 |
|
| 414 |
@app.cell(hide_code=True)
|
| 415 |
def _(mo):
|
| 416 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 417 |
return
|
| 418 |
|
| 419 |
|
|
@@ -425,19 +421,19 @@ def _(pl, seq_data):
|
|
| 425 |
|
| 426 |
@app.cell(hide_code=True)
|
| 427 |
def _(mo):
|
| 428 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 429 |
return
|
| 430 |
|
| 431 |
|
| 432 |
@app.cell(hide_code=True)
|
| 433 |
def _(mo):
|
| 434 |
-
mo.md(
|
| 435 |
-
"""
|
| 436 |
#### Orient
|
| 437 |
|
| 438 |
Let's recall the DataFrame we made by using an array and the data used to make it.
|
| 439 |
-
"""
|
| 440 |
-
)
|
| 441 |
return
|
| 442 |
|
| 443 |
|
|
@@ -455,7 +451,9 @@ def _(arr_df):
|
|
| 455 |
|
| 456 |
@app.cell(hide_code=True)
|
| 457 |
def _(mo):
|
| 458 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 459 |
return
|
| 460 |
|
| 461 |
|
|
@@ -467,7 +465,9 @@ def _(arr_data, pl):
|
|
| 467 |
|
| 468 |
@app.cell(hide_code=True)
|
| 469 |
def _(mo):
|
| 470 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 471 |
return
|
| 472 |
|
| 473 |
|
|
@@ -485,39 +485,33 @@ def _(pl, seq_data):
|
|
| 485 |
|
| 486 |
@app.cell(hide_code=True)
|
| 487 |
def _(mo):
|
| 488 |
-
mo.md(
|
| 489 |
-
r"""
|
| 490 |
#### Infer_Schema_Length
|
| 491 |
|
| 492 |
Without setting the schema ourselves, Polars uses the data provided to infer the data types of the columns. It does this by looking at each of the rows in the data provided. You can specify to Polars how many rows to look at by using the infer_schema_length parameter. For example, if you were to set this parameter to 5, then Polars would use the first 5 rows to infer the schema.
|
| 493 |
-
"""
|
| 494 |
-
)
|
| 495 |
return
|
| 496 |
|
| 497 |
|
| 498 |
@app.cell(hide_code=True)
|
| 499 |
def _(mo):
|
| 500 |
-
mo.md(
|
| 501 |
-
r"""
|
| 502 |
#### NaN_To_Null
|
| 503 |
|
| 504 |
If there are np.nan values in the data, you can convert them to null values by setting the nan_to_null parameter to `True`.
|
| 505 |
-
"""
|
| 506 |
-
)
|
| 507 |
return
|
| 508 |
|
| 509 |
|
| 510 |
@app.cell(hide_code=True)
|
| 511 |
def _(mo):
|
| 512 |
-
mo.md(
|
| 513 |
-
r"""
|
| 514 |
## Summary
|
| 515 |
|
| 516 |
-
DataFrames are a useful data structure that can be used to organize and perform additional analysis on your data. In this notebook, we have learned how to define DataFrames, what can be a DataFrame, the structure of it, and additional parameters you can set while creating it.
|
| 517 |
|
| 518 |
In order to create a DataFrame, you pass your data into the .DataFrame() method through the data parameter. The data you pass through must be either a dictionary, sequence, array, series, or pandas DataFrame. Once defined, the DataFrame will separate the data into different columns and the data within the column must have the same data type. There exists additional parameters besides data that allows you to further customize the ending DataFrame. Some examples of these are orient, strict, and infer_schema_length.
|
| 519 |
-
"""
|
| 520 |
-
)
|
| 521 |
return
|
| 522 |
|
| 523 |
|
|
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
+
__generated_with = "0.18.4"
|
| 14 |
app = marimo.App()
|
| 15 |
|
| 16 |
|
| 17 |
@app.cell(hide_code=True)
|
| 18 |
def _(mo):
|
| 19 |
+
mo.md(r"""
|
|
|
|
| 20 |
# DataFrames
|
| 21 |
Author: [*Raine Hoang*](https://github.com/Jystine)
|
| 22 |
|
|
|
|
| 24 |
|
| 25 |
/// Note
|
| 26 |
The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/api/python/stable/reference/dataframe/index.html).
|
| 27 |
+
""")
|
|
|
|
| 28 |
return
|
| 29 |
|
| 30 |
|
| 31 |
@app.cell(hide_code=True)
|
| 32 |
def _(mo):
|
| 33 |
+
mo.md("""
|
|
|
|
| 34 |
## Defining a DataFrame
|
| 35 |
|
| 36 |
At the most basic level, all that you need to do in order to create a DataFrame in Polars is to use the .DataFrame() method and pass in some data into the data parameter. However, there are restrictions as to what exactly you can pass into this method.
|
| 37 |
+
""")
|
|
|
|
| 38 |
return
|
| 39 |
|
| 40 |
|
| 41 |
@app.cell(hide_code=True)
|
| 42 |
def _(mo):
|
| 43 |
+
mo.md(r"""
|
| 44 |
+
### What Can Be a DataFrame?
|
| 45 |
+
""")
|
| 46 |
return
|
| 47 |
|
| 48 |
|
| 49 |
@app.cell(hide_code=True)
|
| 50 |
def _(mo):
|
| 51 |
+
mo.md(r"""
|
|
|
|
| 52 |
There are [5 data types](https://github.com/pola-rs/polars/blob/py-1.29.0/py-polars/polars/dataframe/frame.py#L197) that can be converted into a DataFrame.
|
| 53 |
|
| 54 |
1. Dictionary
|
|
|
|
| 56 |
3. NumPy Array
|
| 57 |
4. Series
|
| 58 |
5. Pandas DataFrame
|
| 59 |
+
""")
|
|
|
|
| 60 |
return
|
| 61 |
|
| 62 |
|
| 63 |
@app.cell(hide_code=True)
|
| 64 |
def _(mo):
|
| 65 |
+
mo.md(r"""
|
|
|
|
| 66 |
#### Dictionary
|
| 67 |
|
| 68 |
Dictionaries are structures that store data as `key:value` pairs. Let's say we have the following dictionary:
|
| 69 |
+
""")
|
|
|
|
| 70 |
return
|
| 71 |
|
| 72 |
|
|
|
|
| 79 |
|
| 80 |
@app.cell(hide_code=True)
|
| 81 |
def _(mo):
|
| 82 |
+
mo.md(r"""
|
| 83 |
+
In order to convert this dictionary into a DataFrame, we simply need to pass it into the data parameter in the `.DataFrame()` method like so.
|
| 84 |
+
""")
|
| 85 |
return
|
| 86 |
|
| 87 |
|
|
|
|
| 94 |
|
| 95 |
@app.cell(hide_code=True)
|
| 96 |
def _(mo):
|
| 97 |
+
mo.md(r"""
|
| 98 |
+
In this case, Polars turned each of the lists in the dictionary into a column in the DataFrame.
|
|
|
|
| 99 |
|
| 100 |
The other data structures will follow a similar pattern when converting them to DataFrames.
|
| 101 |
+
""")
|
|
|
|
| 102 |
return
|
| 103 |
|
| 104 |
|
| 105 |
@app.cell(hide_code=True)
|
| 106 |
def _(mo):
|
| 107 |
+
mo.md(r"""
|
|
|
|
| 108 |
##### Sequence
|
| 109 |
|
| 110 |
Sequences are data structures that contain collections of items, which can be accessed using its index. Examples of sequences are lists, tuples, and strings. We will be using a list of lists in order to demonstrate how to convert a sequence in a DataFrame.
|
| 111 |
+
""")
|
|
|
|
| 112 |
return
|
| 113 |
|
| 114 |
|
|
|
|
| 128 |
|
| 129 |
@app.cell(hide_code=True)
|
| 130 |
def _(mo):
|
| 131 |
+
mo.md(r"""
|
| 132 |
+
Notice that since we didn't specify the column names, Polars automatically named them `column_0`, `column_1`, and `column_2`. Later, we will show you how to specify the names of the columns.
|
| 133 |
+
""")
|
| 134 |
return
|
| 135 |
|
| 136 |
|
| 137 |
@app.cell(hide_code=True)
|
| 138 |
def _(mo):
|
| 139 |
+
mo.md(r"""
|
|
|
|
| 140 |
##### NumPy Array
|
| 141 |
|
| 142 |
NumPy arrays are considered a sequence of items that can also be accessed using its index. An important thing to note is that all of the items in an array must have the same data type.
|
| 143 |
+
""")
|
|
|
|
| 144 |
return
|
| 145 |
|
| 146 |
|
|
|
|
| 160 |
|
| 161 |
@app.cell(hide_code=True)
|
| 162 |
def _(mo):
|
| 163 |
+
mo.md(r"""
|
| 164 |
+
Notice that each inner array is a row in the DataFrame, not a column like the previous methods discussed. Later, we will go over how to tell Polars if we the information in the data structure to be presented as rows or columns.
|
| 165 |
+
""")
|
| 166 |
return
|
| 167 |
|
| 168 |
|
| 169 |
@app.cell(hide_code=True)
|
| 170 |
def _(mo):
|
| 171 |
+
mo.md(r"""
|
|
|
|
| 172 |
##### Series
|
| 173 |
|
| 174 |
Series are a way to store a single column in a DataFrame and all entries in a series must have the same data type. You can combine these series together to form one DataFrame.
|
| 175 |
+
""")
|
|
|
|
| 176 |
return
|
| 177 |
|
| 178 |
|
|
|
|
| 192 |
|
| 193 |
@app.cell(hide_code=True)
|
| 194 |
def _(mo):
|
| 195 |
+
mo.md(r"""
|
|
|
|
| 196 |
##### Pandas DataFrame
|
| 197 |
|
| 198 |
Another popular package that utilizes DataFrames is pandas. By passing in a pandas DataFrame into .DataFrame(), you can easily convert it into a Polars DataFrame.
|
| 199 |
+
""")
|
|
|
|
| 200 |
return
|
| 201 |
|
| 202 |
|
|
|
|
| 220 |
|
| 221 |
@app.cell(hide_code=True)
|
| 222 |
def _(mo):
|
| 223 |
+
mo.md(r"""
|
| 224 |
+
Now that we've looked over what can be converted into a DataFrame and the basics of it, let's look at the structure of the DataFrame.
|
| 225 |
+
""")
|
| 226 |
return
|
| 227 |
|
| 228 |
|
| 229 |
@app.cell(hide_code=True)
|
| 230 |
def _(mo):
|
| 231 |
+
mo.md(r"""
|
|
|
|
| 232 |
## DataFrame Structure
|
| 233 |
|
| 234 |
Let's recall one of the DataFrames we defined earlier.
|
| 235 |
+
""")
|
|
|
|
| 236 |
return
|
| 237 |
|
| 238 |
|
|
|
|
| 244 |
|
| 245 |
@app.cell(hide_code=True)
|
| 246 |
def _(mo):
|
| 247 |
+
mo.md(r"""
|
| 248 |
+
We can see that this DataFrame has 4 rows and 3 columns as indicated by the text beneath the DataFrame. Each column has a name that can be used to access the data within that column. In this case, the names are: "col1", "col2", and "col3". Below the column name, there is text that indicates the data type stored within that column. "col1" has the text "i64" underneath its name, meaning that that column stores integers. "col2" stores strings as seen by the "str" under the column name. Finally, "col3" stores floats as it has "f64" under the column name. Polars will automatically assume the data types stored in each column, but we will go over a way to specify it later in this tutorial. Each column can only hold one data type at a time, so you can't have a string and an integer in the same column.
|
| 249 |
+
""")
|
| 250 |
return
|
| 251 |
|
| 252 |
|
| 253 |
@app.cell(hide_code=True)
|
| 254 |
def _(mo):
|
| 255 |
+
mo.md(r"""
|
|
|
|
| 256 |
## Parameters
|
| 257 |
|
| 258 |
On top of the "data" parameter, there are 6 additional parameters you can specify:
|
|
|
|
| 263 |
4. orient
|
| 264 |
5. infer_schema_length
|
| 265 |
6. nan_to_null
|
| 266 |
+
""")
|
|
|
|
| 267 |
return
|
| 268 |
|
| 269 |
|
| 270 |
@app.cell(hide_code=True)
|
| 271 |
def _(mo):
|
| 272 |
+
mo.md(r"""
|
|
|
|
| 273 |
#### Schema
|
| 274 |
|
| 275 |
Let's recall the DataFrame we created using a sequence.
|
| 276 |
+
""")
|
|
|
|
| 277 |
return
|
| 278 |
|
| 279 |
|
|
|
|
| 285 |
|
| 286 |
@app.cell(hide_code=True)
|
| 287 |
def _(mo):
|
| 288 |
+
mo.md(r"""
|
| 289 |
+
We can see that the column names and data type were inferred by Polars. The schema parameter allows us to specify the column names and data type we want for each column. There are 3 ways you can use this parameter. The first way involves using a dictionary to define the following key value pair: column name:data type.
|
| 290 |
+
""")
|
| 291 |
return
|
| 292 |
|
| 293 |
|
|
|
|
| 299 |
|
| 300 |
@app.cell(hide_code=True)
|
| 301 |
def _(mo):
|
| 302 |
+
mo.md(r"""
|
| 303 |
+
You can also do this using a list of (column name, data type) pairs instead of a dictionary.
|
| 304 |
+
""")
|
| 305 |
return
|
| 306 |
|
| 307 |
|
|
|
|
| 313 |
|
| 314 |
@app.cell(hide_code=True)
|
| 315 |
def _(mo):
|
| 316 |
+
mo.md(r"""
|
| 317 |
+
Notice how both the column names and the data type (text underneath the column name) is different from the original `seq_df`. If you only wanted to specify the column names and let Polars assume the data type, you can do so using a list of column names.
|
| 318 |
+
""")
|
| 319 |
return
|
| 320 |
|
| 321 |
|
|
|
|
| 327 |
|
| 328 |
@app.cell(hide_code=True)
|
| 329 |
def _(mo):
|
| 330 |
+
mo.md(r"""
|
| 331 |
+
The text under the column names is different from the previous two DataFrames we created since we didn't explicitly tell Polars what data type we wanted in each column.
|
| 332 |
+
""")
|
| 333 |
return
|
| 334 |
|
| 335 |
|
| 336 |
@app.cell(hide_code=True)
|
| 337 |
def _(mo):
|
| 338 |
+
mo.md(r"""
|
|
|
|
| 339 |
#### Schema_Overrides
|
| 340 |
|
| 341 |
If you only wanted to specify the data type of specific columns and let Polars infer the rest, you can use the schema_overrides parameter for that. This parameter requires that you pass in a dictionary where the key value pair is column name:data type. Unlike the schema parameter, the column name must match the name already present in the DataFrame as that is how Polars will identify which column you want to specify the data type. If you use a column name that doesn't already exist, Polars won't be able to change the data type.
|
| 342 |
+
""")
|
|
|
|
| 343 |
return
|
| 344 |
|
| 345 |
|
|
|
|
| 351 |
|
| 352 |
@app.cell(hide_code=True)
|
| 353 |
def _(mo):
|
| 354 |
+
mo.md(r"""
|
|
|
|
| 355 |
Notice here that only the data type in the first column changed while Polars inferred the rest.
|
| 356 |
|
| 357 |
It is important to note that if you only use the schema_overrides parameter, you are limited to how much you can change the data type. In the example above, we were able to change the data type from int32 to int16 without any further parameters since the data type is still an integer. However, if we wanted to change the first column to be a string, we would get an error as Polars has already strictly set the schema to only take in integer values.
|
| 358 |
+
""")
|
|
|
|
| 359 |
return
|
| 360 |
|
| 361 |
|
|
|
|
| 370 |
|
| 371 |
@app.cell(hide_code=True)
|
| 372 |
def _(mo):
|
| 373 |
+
mo.md(r"""
|
| 374 |
+
If we wanted to use schema_override to completely change the data type of the column, we need an additional parameter: strict.
|
| 375 |
+
""")
|
| 376 |
return
|
| 377 |
|
| 378 |
|
| 379 |
@app.cell(hide_code=True)
|
| 380 |
def _(mo):
|
| 381 |
+
mo.md(r"""
|
|
|
|
| 382 |
#### Strict
|
| 383 |
|
| 384 |
The strict parameter allows you to specify if you want a column's data type to be enforced with flexibility or not. When set to `True`, Polars will raise an error if there is a data type that doesn't match the data type the column is expecting. It will not attempt to type cast it to the correct data type as Polars prioritizes that all the data can be converted without any loss or error. When set to `False`, Polars will attempt to type cast the data into the data type the column wants. If it is unable to successfully convert the data type, the value will be replaced with a null value.
|
| 385 |
+
""")
|
|
|
|
| 386 |
return
|
| 387 |
|
| 388 |
|
| 389 |
@app.cell(hide_code=True)
|
| 390 |
def _(mo):
|
| 391 |
+
mo.md(r"""
|
| 392 |
+
Let's see an example of what happens when strict is set to `True`. The cell below should show an error.
|
| 393 |
+
""")
|
| 394 |
return
|
| 395 |
|
| 396 |
|
|
|
|
| 407 |
|
| 408 |
@app.cell(hide_code=True)
|
| 409 |
def _(mo):
|
| 410 |
+
mo.md(r"""
|
| 411 |
+
Now let's try setting strict to `False`.
|
| 412 |
+
""")
|
| 413 |
return
|
| 414 |
|
| 415 |
|
|
|
|
| 421 |
|
| 422 |
@app.cell(hide_code=True)
|
| 423 |
def _(mo):
|
| 424 |
+
mo.md(r"""
|
| 425 |
+
Since we allowed for Polars to change the schema by setting strict to `False`, we were able to cast the first column to be strings.
|
| 426 |
+
""")
|
| 427 |
return
|
| 428 |
|
| 429 |
|
| 430 |
@app.cell(hide_code=True)
|
| 431 |
def _(mo):
|
| 432 |
+
mo.md("""
|
|
|
|
| 433 |
#### Orient
|
| 434 |
|
| 435 |
Let's recall the DataFrame we made by using an array and the data used to make it.
|
| 436 |
+
""")
|
|
|
|
| 437 |
return
|
| 438 |
|
| 439 |
|
|
|
|
| 451 |
|
| 452 |
@app.cell(hide_code=True)
|
| 453 |
def _(mo):
|
| 454 |
+
mo.md(r"""
|
| 455 |
+
Notice how Polars decided to make each inner array a row in the DataFrame. If we wanted to make it so that each inner array was a column instead of a row, all we would need to do is pass `"col"` into the orient parameter.
|
| 456 |
+
""")
|
| 457 |
return
|
| 458 |
|
| 459 |
|
|
|
|
| 465 |
|
| 466 |
@app.cell(hide_code=True)
|
| 467 |
def _(mo):
|
| 468 |
+
mo.md(r"""
|
| 469 |
+
If we wanted to do the opposite, then we pass `"row"` into the orient parameter.
|
| 470 |
+
""")
|
| 471 |
return
|
| 472 |
|
| 473 |
|
|
|
|
| 485 |
|
| 486 |
@app.cell(hide_code=True)
|
| 487 |
def _(mo):
|
| 488 |
+
mo.md(r"""
|
|
|
|
| 489 |
#### Infer_Schema_Length
|
| 490 |
|
| 491 |
Without setting the schema ourselves, Polars uses the data provided to infer the data types of the columns. It does this by looking at each of the rows in the data provided. You can specify to Polars how many rows to look at by using the infer_schema_length parameter. For example, if you were to set this parameter to 5, then Polars would use the first 5 rows to infer the schema.
|
| 492 |
+
""")
|
|
|
|
| 493 |
return
|
| 494 |
|
| 495 |
|
| 496 |
@app.cell(hide_code=True)
|
| 497 |
def _(mo):
|
| 498 |
+
mo.md(r"""
|
|
|
|
| 499 |
#### NaN_To_Null
|
| 500 |
|
| 501 |
If there are np.nan values in the data, you can convert them to null values by setting the nan_to_null parameter to `True`.
|
| 502 |
+
""")
|
|
|
|
| 503 |
return
|
| 504 |
|
| 505 |
|
| 506 |
@app.cell(hide_code=True)
|
| 507 |
def _(mo):
|
| 508 |
+
mo.md(r"""
|
|
|
|
| 509 |
## Summary
|
| 510 |
|
| 511 |
+
DataFrames are a useful data structure that can be used to organize and perform additional analysis on your data. In this notebook, we have learned how to define DataFrames, what can be a DataFrame, the structure of it, and additional parameters you can set while creating it.
|
| 512 |
|
| 513 |
In order to create a DataFrame, you pass your data into the .DataFrame() method through the data parameter. The data you pass through must be either a dictionary, sequence, array, series, or pandas DataFrame. Once defined, the DataFrame will separate the data into different columns and the data within the column must have the same data type. There exists additional parameters besides data that allows you to further customize the ending DataFrame. Some examples of these are orient, strict, and infer_schema_length.
|
| 514 |
+
""")
|
|
|
|
| 515 |
return
|
| 516 |
|
| 517 |
|
polars/03_loading_data.py
CHANGED
|
@@ -14,14 +14,13 @@
|
|
| 14 |
|
| 15 |
import marimo
|
| 16 |
|
| 17 |
-
__generated_with = "0.
|
| 18 |
app = marimo.App(width="medium")
|
| 19 |
|
| 20 |
|
| 21 |
@app.cell(hide_code=True)
|
| 22 |
def _(mo):
|
| 23 |
-
mo.md(
|
| 24 |
-
r"""
|
| 25 |
# Loading Data
|
| 26 |
|
| 27 |
_By [etrotta](https://github.com/etrotta)._
|
|
@@ -29,8 +28,7 @@ def _(mo):
|
|
| 29 |
This tutorial covers how to load data of varying formats and from different sources using [polars](https://docs.pola.rs/).
|
| 30 |
|
| 31 |
It includes examples of how to load and write to a variety of formats, shows how to convert data from other libraries to support formats not supported directly by polars, includes relevant links for users that need to connect with external sources, and explains how to deal with custom formats via plugins.
|
| 32 |
-
"""
|
| 33 |
-
)
|
| 34 |
return
|
| 35 |
|
| 36 |
|
|
@@ -80,12 +78,10 @@ def _(mo, pl):
|
|
| 80 |
|
| 81 |
@app.cell(hide_code=True)
|
| 82 |
def _(mo):
|
| 83 |
-
mo.md(
|
| 84 |
-
r"""
|
| 85 |
## Parquet
|
| 86 |
Parquet is a popular format for storing tabular data based on the Arrow memory spec, it is a great default and you'll find a lot of datasets already using it in sites like HuggingFace
|
| 87 |
-
"""
|
| 88 |
-
)
|
| 89 |
return
|
| 90 |
|
| 91 |
|
|
@@ -100,14 +96,12 @@ def _(df, folder, pl):
|
|
| 100 |
|
| 101 |
@app.cell(hide_code=True)
|
| 102 |
def _(mo):
|
| 103 |
-
mo.md(
|
| 104 |
-
r"""
|
| 105 |
## CSV
|
| 106 |
A classic and common format that has been widely used for decades.
|
| 107 |
|
| 108 |
The API is almost identical to Parquet - You can just replace `parquet` by `csv` and it will work with the default settings, but polars also allows for you to customize some settings such as the delimiter and quoting rules.
|
| 109 |
-
"""
|
| 110 |
-
)
|
| 111 |
return
|
| 112 |
|
| 113 |
|
|
@@ -123,8 +117,7 @@ def _(df, folder, lz, pl):
|
|
| 123 |
|
| 124 |
@app.cell(hide_code=True)
|
| 125 |
def _(mo):
|
| 126 |
-
mo.md(
|
| 127 |
-
r"""
|
| 128 |
## JSON
|
| 129 |
|
| 130 |
JavaScript Object Notation is somewhat commonly used for storing unstructed data, and extremely commonly used for API responses.
|
|
@@ -138,8 +131,7 @@ def _(mo):
|
|
| 138 |
Polars supports Lists with variable length, Arrays with fixed length, and Structs with well defined fields, but not mappings with arbitrary keys.
|
| 139 |
|
| 140 |
You might want to transform data by unnesting structs and exploding lists after loading from complex JSON files.
|
| 141 |
-
"""
|
| 142 |
-
)
|
| 143 |
return
|
| 144 |
|
| 145 |
|
|
@@ -163,8 +155,7 @@ def _(df, folder, lz, pl):
|
|
| 163 |
|
| 164 |
@app.cell(hide_code=True)
|
| 165 |
def _(mo):
|
| 166 |
-
mo.md(
|
| 167 |
-
r"""
|
| 168 |
## Databases
|
| 169 |
|
| 170 |
Polars doesn't supports any databases _directly_, but rather uses other libraries as Engines. Reading and writing to databases using polars methods does not supports Lazy execution, but you may pass an SQL Query for the database to pre-filter the data before reaches polars. See the [User Guide](https://docs.pola.rs/user-guide/io/database) for more details.
|
|
@@ -172,8 +163,7 @@ def _(mo):
|
|
| 172 |
You can also use other libraries with [arrow support](#arrow-support) or [polars plugins](#plugin-support) to read from databases before loading into polars, some of which support lazy reading.
|
| 173 |
|
| 174 |
Using the Arrow Database Connectivity SQLite support as an example:
|
| 175 |
-
"""
|
| 176 |
-
)
|
| 177 |
return
|
| 178 |
|
| 179 |
|
|
@@ -190,43 +180,37 @@ def _(df, folder, pl):
|
|
| 190 |
|
| 191 |
@app.cell(hide_code=True)
|
| 192 |
def _(mo):
|
| 193 |
-
mo.md(
|
| 194 |
-
r"""
|
| 195 |
## Excel
|
| 196 |
|
| 197 |
From a performance perspective, we recommend using other formats if possible, such as Parquet or CSV files.
|
| 198 |
|
| 199 |
Similarly to Databases, polars doesn't supports it natively but rather uses other libraries as Engines. See the [User Guide](https://docs.pola.rs/user-guide/io/excel) if you need to use it.
|
| 200 |
-
"""
|
| 201 |
-
)
|
| 202 |
return
|
| 203 |
|
| 204 |
|
| 205 |
@app.cell(hide_code=True)
|
| 206 |
def _(mo):
|
| 207 |
-
mo.md(
|
| 208 |
-
r"""
|
| 209 |
## Others natively supported
|
| 210 |
|
| 211 |
If you understood the above examples, then all other formats should feel familiar - the core API is the same for all formats, `read` and `write` for the Eager API or `scan` and `sink` for the lazy API.
|
| 212 |
|
| 213 |
See https://docs.pola.rs/api/python/stable/reference/io.html for the full list of formats natively supported by Polars
|
| 214 |
-
"""
|
| 215 |
-
)
|
| 216 |
return
|
| 217 |
|
| 218 |
|
| 219 |
@app.cell(hide_code=True)
|
| 220 |
def _(mo):
|
| 221 |
-
mo.md(
|
| 222 |
-
r"""
|
| 223 |
## Arrow Support
|
| 224 |
|
| 225 |
You can convert Arrow compatible data from other libraries such as `pandas`, `duckdb` or `pyarrow` to polars DataFrames and vice-versa, much of the time without even having to copy data.
|
| 226 |
|
| 227 |
This allows for you to use other libraries to load data in formats not support by polars, then convert the dataframe in-memory to polars.
|
| 228 |
-
"""
|
| 229 |
-
)
|
| 230 |
return
|
| 231 |
|
| 232 |
|
|
@@ -241,13 +225,11 @@ def _(df, folder, pd, pl):
|
|
| 241 |
|
| 242 |
@app.cell(hide_code=True)
|
| 243 |
def _(mo):
|
| 244 |
-
mo.md(
|
| 245 |
-
r"""
|
| 246 |
## Plugin Support
|
| 247 |
|
| 248 |
You can also write [IO Plugins](https://docs.pola.rs/user-guide/plugins/io_plugins/) for Polars in order to support any format you need, or use other libraries that support polars via their own plugins such as DuckDB.
|
| 249 |
-
"""
|
| 250 |
-
)
|
| 251 |
return
|
| 252 |
|
| 253 |
|
|
@@ -261,8 +243,7 @@ def _(duckdb, folder):
|
|
| 261 |
|
| 262 |
@app.cell(hide_code=True)
|
| 263 |
def _(mo):
|
| 264 |
-
mo.md(
|
| 265 |
-
r"""
|
| 266 |
### Creating your own Plugin
|
| 267 |
|
| 268 |
The simplest form of plugins are essentially generators that yield DataFrames.
|
|
@@ -273,12 +254,11 @@ def _(mo):
|
|
| 273 |
|
| 274 |
- You must use `register_io_source` for polars to create the LazyFrame which will consume the Generator
|
| 275 |
- You are expected to provide a Schema before the Generator starts
|
| 276 |
-
- - For many use cases the Plugin may be able to infer it, but you could also pass it explicitly to the plugin function
|
| 277 |
- Ideally you should parse some of the filters and column selectors to avoid unnecessary work, but it is possible to delegate that to polars after loading the data in order to keep it simpler (at the cost of efficiency)
|
| 278 |
|
| 279 |
Efficiently parsing the filter expressions is out of the scope for this notebook.
|
| 280 |
-
"""
|
| 281 |
-
)
|
| 282 |
return
|
| 283 |
|
| 284 |
|
|
@@ -351,8 +331,7 @@ def _(Iterator, get_positional_names, itertools, pl, register_io_source):
|
|
| 351 |
|
| 352 |
@app.cell(hide_code=True)
|
| 353 |
def _(mo):
|
| 354 |
-
mo.md(
|
| 355 |
-
r"""
|
| 356 |
### DuckDB
|
| 357 |
|
| 358 |
As demonstrated above, in addition to Arrow interoperability support, [DuckDB](https://duckdb.org/) also has added support for loading query results into a polars DataFrame or LazyFrame via a polars plugin.
|
|
@@ -363,8 +342,7 @@ def _(mo):
|
|
| 363 |
- https://duckdb.org/docs/stable/guides/python/polars.html
|
| 364 |
|
| 365 |
You can learn more about DuckDB in the marimo course about it as well, including Marimo SQL related features
|
| 366 |
-
"""
|
| 367 |
-
)
|
| 368 |
return
|
| 369 |
|
| 370 |
|
|
@@ -398,16 +376,14 @@ def _(duckdb_conn, duckdb_query):
|
|
| 398 |
|
| 399 |
@app.cell(hide_code=True)
|
| 400 |
def _(mo):
|
| 401 |
-
mo.md(
|
| 402 |
-
r"""
|
| 403 |
## Hive Partitions
|
| 404 |
|
| 405 |
There is also support for [Hive](https://docs.pola.rs/user-guide/io/hive/) partitioned data, but parts of the API are still unstable (may change in future polars versions
|
| 406 |
).
|
| 407 |
|
| 408 |
Even without using partitions, many methods also support glob patterns to read multiple files in the same folder such as `scan_csv(folder / "*.csv")`
|
| 409 |
-
"""
|
| 410 |
-
)
|
| 411 |
return
|
| 412 |
|
| 413 |
|
|
@@ -422,28 +398,24 @@ def _(df, folder, pl):
|
|
| 422 |
|
| 423 |
@app.cell(hide_code=True)
|
| 424 |
def _(mo):
|
| 425 |
-
mo.md(
|
| 426 |
-
r"""
|
| 427 |
# Reading from the Cloud
|
| 428 |
|
| 429 |
-
Polars also has support for reading public and private datasets from multiple websites
|
| 430 |
and cloud storage solutions.
|
| 431 |
|
| 432 |
If you must (re)use the same file many times in the same machine you may want to manually download it then load from your local file system instead to avoid re-downloading though, or download and write to disk only if the file does not exists.
|
| 433 |
-
"""
|
| 434 |
-
)
|
| 435 |
return
|
| 436 |
|
| 437 |
|
| 438 |
@app.cell(hide_code=True)
|
| 439 |
def _(mo):
|
| 440 |
-
mo.md(
|
| 441 |
-
r"""
|
| 442 |
## Arbitrary web sites
|
| 443 |
|
| 444 |
You can load files from nearly any website just by using a HTTPS URL, as long as it is not locked behind authorization.
|
| 445 |
-
"""
|
| 446 |
-
)
|
| 447 |
return
|
| 448 |
|
| 449 |
|
|
@@ -455,15 +427,13 @@ def _():
|
|
| 455 |
|
| 456 |
@app.cell(hide_code=True)
|
| 457 |
def _(mo):
|
| 458 |
-
mo.md(
|
| 459 |
-
r"""
|
| 460 |
## Hugging Face & Kaggle Datasets
|
| 461 |
|
| 462 |
Look for polars inside of dropdowns such as "Use this dataset" in Hugging Face or "Code" in Kaggle, and oftentimes you'll get a snippet to load data directly into a dataframe you can use
|
| 463 |
|
| 464 |
Read more: [Hugging Face](https://docs.pola.rs/user-guide/io/hugging-face/), [Kaggle](https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpolars)
|
| 465 |
-
"""
|
| 466 |
-
)
|
| 467 |
return
|
| 468 |
|
| 469 |
|
|
@@ -475,15 +445,13 @@ def _():
|
|
| 475 |
|
| 476 |
@app.cell(hide_code=True)
|
| 477 |
def _(mo):
|
| 478 |
-
mo.md(
|
| 479 |
-
r"""
|
| 480 |
## Cloud Storage - AWS S3, Azure Blob Storage, Google Cloud Storage
|
| 481 |
|
| 482 |
The API is the same for all three storage providers, check the [User Guide](https://docs.pola.rs/user-guide/io/cloud-storage/) if you need of any of them.
|
| 483 |
|
| 484 |
Runnable examples are not included in this Notebook as it would require setting up authentication, but the disabled cell below shows an example using Azure.
|
| 485 |
-
"""
|
| 486 |
-
)
|
| 487 |
return
|
| 488 |
|
| 489 |
|
|
@@ -510,13 +478,11 @@ def _(adlfs, df, os, pl):
|
|
| 510 |
|
| 511 |
@app.cell(hide_code=True)
|
| 512 |
def _(mo):
|
| 513 |
-
mo.md(
|
| 514 |
-
r"""
|
| 515 |
# Multiplexing
|
| 516 |
|
| 517 |
You can also split a query into multiple sinks via [multiplexing](https://docs.pola.rs/user-guide/lazy/multiplexing/), to avoid reading multiple times, repeating the same operations for each sink or collecting intermediary results into memory.
|
| 518 |
-
"""
|
| 519 |
-
)
|
| 520 |
return
|
| 521 |
|
| 522 |
|
|
@@ -540,13 +506,11 @@ def _(folder, lz, pl):
|
|
| 540 |
|
| 541 |
@app.cell(hide_code=True)
|
| 542 |
def _(mo):
|
| 543 |
-
mo.md(
|
| 544 |
-
r"""
|
| 545 |
# Async Execution
|
| 546 |
|
| 547 |
Polars also has experimental support for running lazy queries in `async` mode, letting you `await` operations inside of async functions.
|
| 548 |
-
"""
|
| 549 |
-
)
|
| 550 |
return
|
| 551 |
|
| 552 |
|
|
@@ -566,27 +530,23 @@ async def _(folder, lz, pl, sinks):
|
|
| 566 |
|
| 567 |
@app.cell(hide_code=True)
|
| 568 |
def _(mo):
|
| 569 |
-
mo.md(
|
| 570 |
-
r"""
|
| 571 |
## Conclusion
|
| 572 |
As you have seen, polars makes it easy to work with a variety of formats and different data sources.
|
| 573 |
|
| 574 |
From natively supported formats such as Parquet and CSV files, to using other libraries as an intermediary for XML or geospatial data, and plugins for newly emerging or proprietary formats, as long as your data can fit in a table then odds are you can turn it into a polars DataFrame.
|
| 575 |
|
| 576 |
Combined with loading directly from remote sources, including public data platforms such as Hugging Face and Kaggle as well as private data in your cloud, you can import datasets for almost anything you can imagine.
|
| 577 |
-
"""
|
| 578 |
-
)
|
| 579 |
return
|
| 580 |
|
| 581 |
|
| 582 |
@app.cell(hide_code=True)
|
| 583 |
def _(mo):
|
| 584 |
-
mo.md(
|
| 585 |
-
r"""
|
| 586 |
## Utilities
|
| 587 |
Imports, utility functions and alike used through the Notebook
|
| 588 |
-
"""
|
| 589 |
-
)
|
| 590 |
return
|
| 591 |
|
| 592 |
|
|
|
|
| 14 |
|
| 15 |
import marimo
|
| 16 |
|
| 17 |
+
__generated_with = "0.18.4"
|
| 18 |
app = marimo.App(width="medium")
|
| 19 |
|
| 20 |
|
| 21 |
@app.cell(hide_code=True)
|
| 22 |
def _(mo):
|
| 23 |
+
mo.md(r"""
|
|
|
|
| 24 |
# Loading Data
|
| 25 |
|
| 26 |
_By [etrotta](https://github.com/etrotta)._
|
|
|
|
| 28 |
This tutorial covers how to load data of varying formats and from different sources using [polars](https://docs.pola.rs/).
|
| 29 |
|
| 30 |
It includes examples of how to load and write to a variety of formats, shows how to convert data from other libraries to support formats not supported directly by polars, includes relevant links for users that need to connect with external sources, and explains how to deal with custom formats via plugins.
|
| 31 |
+
""")
|
|
|
|
| 32 |
return
|
| 33 |
|
| 34 |
|
|
|
|
| 78 |
|
| 79 |
@app.cell(hide_code=True)
|
| 80 |
def _(mo):
|
| 81 |
+
mo.md(r"""
|
|
|
|
| 82 |
## Parquet
|
| 83 |
Parquet is a popular format for storing tabular data based on the Arrow memory spec, it is a great default and you'll find a lot of datasets already using it in sites like HuggingFace
|
| 84 |
+
""")
|
|
|
|
| 85 |
return
|
| 86 |
|
| 87 |
|
|
|
|
| 96 |
|
| 97 |
@app.cell(hide_code=True)
|
| 98 |
def _(mo):
|
| 99 |
+
mo.md(r"""
|
|
|
|
| 100 |
## CSV
|
| 101 |
A classic and common format that has been widely used for decades.
|
| 102 |
|
| 103 |
The API is almost identical to Parquet - You can just replace `parquet` by `csv` and it will work with the default settings, but polars also allows for you to customize some settings such as the delimiter and quoting rules.
|
| 104 |
+
""")
|
|
|
|
| 105 |
return
|
| 106 |
|
| 107 |
|
|
|
|
| 117 |
|
| 118 |
@app.cell(hide_code=True)
|
| 119 |
def _(mo):
|
| 120 |
+
mo.md(r"""
|
|
|
|
| 121 |
## JSON
|
| 122 |
|
| 123 |
JavaScript Object Notation is somewhat commonly used for storing unstructed data, and extremely commonly used for API responses.
|
|
|
|
| 131 |
Polars supports Lists with variable length, Arrays with fixed length, and Structs with well defined fields, but not mappings with arbitrary keys.
|
| 132 |
|
| 133 |
You might want to transform data by unnesting structs and exploding lists after loading from complex JSON files.
|
| 134 |
+
""")
|
|
|
|
| 135 |
return
|
| 136 |
|
| 137 |
|
|
|
|
| 155 |
|
| 156 |
@app.cell(hide_code=True)
|
| 157 |
def _(mo):
|
| 158 |
+
mo.md(r"""
|
|
|
|
| 159 |
## Databases
|
| 160 |
|
| 161 |
Polars doesn't supports any databases _directly_, but rather uses other libraries as Engines. Reading and writing to databases using polars methods does not supports Lazy execution, but you may pass an SQL Query for the database to pre-filter the data before reaches polars. See the [User Guide](https://docs.pola.rs/user-guide/io/database) for more details.
|
|
|
|
| 163 |
You can also use other libraries with [arrow support](#arrow-support) or [polars plugins](#plugin-support) to read from databases before loading into polars, some of which support lazy reading.
|
| 164 |
|
| 165 |
Using the Arrow Database Connectivity SQLite support as an example:
|
| 166 |
+
""")
|
|
|
|
| 167 |
return
|
| 168 |
|
| 169 |
|
|
|
|
| 180 |
|
| 181 |
@app.cell(hide_code=True)
|
| 182 |
def _(mo):
|
| 183 |
+
mo.md(r"""
|
|
|
|
| 184 |
## Excel
|
| 185 |
|
| 186 |
From a performance perspective, we recommend using other formats if possible, such as Parquet or CSV files.
|
| 187 |
|
| 188 |
Similarly to Databases, polars doesn't supports it natively but rather uses other libraries as Engines. See the [User Guide](https://docs.pola.rs/user-guide/io/excel) if you need to use it.
|
| 189 |
+
""")
|
|
|
|
| 190 |
return
|
| 191 |
|
| 192 |
|
| 193 |
@app.cell(hide_code=True)
|
| 194 |
def _(mo):
|
| 195 |
+
mo.md(r"""
|
|
|
|
| 196 |
## Others natively supported
|
| 197 |
|
| 198 |
If you understood the above examples, then all other formats should feel familiar - the core API is the same for all formats, `read` and `write` for the Eager API or `scan` and `sink` for the lazy API.
|
| 199 |
|
| 200 |
See https://docs.pola.rs/api/python/stable/reference/io.html for the full list of formats natively supported by Polars
|
| 201 |
+
""")
|
|
|
|
| 202 |
return
|
| 203 |
|
| 204 |
|
| 205 |
@app.cell(hide_code=True)
|
| 206 |
def _(mo):
|
| 207 |
+
mo.md(r"""
|
|
|
|
| 208 |
## Arrow Support
|
| 209 |
|
| 210 |
You can convert Arrow compatible data from other libraries such as `pandas`, `duckdb` or `pyarrow` to polars DataFrames and vice-versa, much of the time without even having to copy data.
|
| 211 |
|
| 212 |
This allows for you to use other libraries to load data in formats not support by polars, then convert the dataframe in-memory to polars.
|
| 213 |
+
""")
|
|
|
|
| 214 |
return
|
| 215 |
|
| 216 |
|
|
|
|
| 225 |
|
| 226 |
@app.cell(hide_code=True)
|
| 227 |
def _(mo):
|
| 228 |
+
mo.md(r"""
|
|
|
|
| 229 |
## Plugin Support
|
| 230 |
|
| 231 |
You can also write [IO Plugins](https://docs.pola.rs/user-guide/plugins/io_plugins/) for Polars in order to support any format you need, or use other libraries that support polars via their own plugins such as DuckDB.
|
| 232 |
+
""")
|
|
|
|
| 233 |
return
|
| 234 |
|
| 235 |
|
|
|
|
| 243 |
|
| 244 |
@app.cell(hide_code=True)
|
| 245 |
def _(mo):
|
| 246 |
+
mo.md(r"""
|
|
|
|
| 247 |
### Creating your own Plugin
|
| 248 |
|
| 249 |
The simplest form of plugins are essentially generators that yield DataFrames.
|
|
|
|
| 254 |
|
| 255 |
- You must use `register_io_source` for polars to create the LazyFrame which will consume the Generator
|
| 256 |
- You are expected to provide a Schema before the Generator starts
|
| 257 |
+
- - For many use cases the Plugin may be able to infer it, but you could also pass it explicitly to the plugin function
|
| 258 |
- Ideally you should parse some of the filters and column selectors to avoid unnecessary work, but it is possible to delegate that to polars after loading the data in order to keep it simpler (at the cost of efficiency)
|
| 259 |
|
| 260 |
Efficiently parsing the filter expressions is out of the scope for this notebook.
|
| 261 |
+
""")
|
|
|
|
| 262 |
return
|
| 263 |
|
| 264 |
|
|
|
|
| 331 |
|
| 332 |
@app.cell(hide_code=True)
|
| 333 |
def _(mo):
|
| 334 |
+
mo.md(r"""
|
|
|
|
| 335 |
### DuckDB
|
| 336 |
|
| 337 |
As demonstrated above, in addition to Arrow interoperability support, [DuckDB](https://duckdb.org/) also has added support for loading query results into a polars DataFrame or LazyFrame via a polars plugin.
|
|
|
|
| 342 |
- https://duckdb.org/docs/stable/guides/python/polars.html
|
| 343 |
|
| 344 |
You can learn more about DuckDB in the marimo course about it as well, including Marimo SQL related features
|
| 345 |
+
""")
|
|
|
|
| 346 |
return
|
| 347 |
|
| 348 |
|
|
|
|
| 376 |
|
| 377 |
@app.cell(hide_code=True)
|
| 378 |
def _(mo):
|
| 379 |
+
mo.md(r"""
|
|
|
|
| 380 |
## Hive Partitions
|
| 381 |
|
| 382 |
There is also support for [Hive](https://docs.pola.rs/user-guide/io/hive/) partitioned data, but parts of the API are still unstable (may change in future polars versions
|
| 383 |
).
|
| 384 |
|
| 385 |
Even without using partitions, many methods also support glob patterns to read multiple files in the same folder such as `scan_csv(folder / "*.csv")`
|
| 386 |
+
""")
|
|
|
|
| 387 |
return
|
| 388 |
|
| 389 |
|
|
|
|
| 398 |
|
| 399 |
@app.cell(hide_code=True)
|
| 400 |
def _(mo):
|
| 401 |
+
mo.md(r"""
|
|
|
|
| 402 |
# Reading from the Cloud
|
| 403 |
|
| 404 |
+
Polars also has support for reading public and private datasets from multiple websites
|
| 405 |
and cloud storage solutions.
|
| 406 |
|
| 407 |
If you must (re)use the same file many times in the same machine you may want to manually download it then load from your local file system instead to avoid re-downloading though, or download and write to disk only if the file does not exists.
|
| 408 |
+
""")
|
|
|
|
| 409 |
return
|
| 410 |
|
| 411 |
|
| 412 |
@app.cell(hide_code=True)
|
| 413 |
def _(mo):
|
| 414 |
+
mo.md(r"""
|
|
|
|
| 415 |
## Arbitrary web sites
|
| 416 |
|
| 417 |
You can load files from nearly any website just by using a HTTPS URL, as long as it is not locked behind authorization.
|
| 418 |
+
""")
|
|
|
|
| 419 |
return
|
| 420 |
|
| 421 |
|
|
|
|
| 427 |
|
| 428 |
@app.cell(hide_code=True)
|
| 429 |
def _(mo):
|
| 430 |
+
mo.md(r"""
|
|
|
|
| 431 |
## Hugging Face & Kaggle Datasets
|
| 432 |
|
| 433 |
Look for polars inside of dropdowns such as "Use this dataset" in Hugging Face or "Code" in Kaggle, and oftentimes you'll get a snippet to load data directly into a dataframe you can use
|
| 434 |
|
| 435 |
Read more: [Hugging Face](https://docs.pola.rs/user-guide/io/hugging-face/), [Kaggle](https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpolars)
|
| 436 |
+
""")
|
|
|
|
| 437 |
return
|
| 438 |
|
| 439 |
|
|
|
|
| 445 |
|
| 446 |
@app.cell(hide_code=True)
|
| 447 |
def _(mo):
|
| 448 |
+
mo.md(r"""
|
|
|
|
| 449 |
## Cloud Storage - AWS S3, Azure Blob Storage, Google Cloud Storage
|
| 450 |
|
| 451 |
The API is the same for all three storage providers, check the [User Guide](https://docs.pola.rs/user-guide/io/cloud-storage/) if you need of any of them.
|
| 452 |
|
| 453 |
Runnable examples are not included in this Notebook as it would require setting up authentication, but the disabled cell below shows an example using Azure.
|
| 454 |
+
""")
|
|
|
|
| 455 |
return
|
| 456 |
|
| 457 |
|
|
|
|
| 478 |
|
| 479 |
@app.cell(hide_code=True)
|
| 480 |
def _(mo):
|
| 481 |
+
mo.md(r"""
|
|
|
|
| 482 |
# Multiplexing
|
| 483 |
|
| 484 |
You can also split a query into multiple sinks via [multiplexing](https://docs.pola.rs/user-guide/lazy/multiplexing/), to avoid reading multiple times, repeating the same operations for each sink or collecting intermediary results into memory.
|
| 485 |
+
""")
|
|
|
|
| 486 |
return
|
| 487 |
|
| 488 |
|
|
|
|
| 506 |
|
| 507 |
@app.cell(hide_code=True)
|
| 508 |
def _(mo):
|
| 509 |
+
mo.md(r"""
|
|
|
|
| 510 |
# Async Execution
|
| 511 |
|
| 512 |
Polars also has experimental support for running lazy queries in `async` mode, letting you `await` operations inside of async functions.
|
| 513 |
+
""")
|
|
|
|
| 514 |
return
|
| 515 |
|
| 516 |
|
|
|
|
| 530 |
|
| 531 |
@app.cell(hide_code=True)
|
| 532 |
def _(mo):
|
| 533 |
+
mo.md(r"""
|
|
|
|
| 534 |
## Conclusion
|
| 535 |
As you have seen, polars makes it easy to work with a variety of formats and different data sources.
|
| 536 |
|
| 537 |
From natively supported formats such as Parquet and CSV files, to using other libraries as an intermediary for XML or geospatial data, and plugins for newly emerging or proprietary formats, as long as your data can fit in a table then odds are you can turn it into a polars DataFrame.
|
| 538 |
|
| 539 |
Combined with loading directly from remote sources, including public data platforms such as Hugging Face and Kaggle as well as private data in your cloud, you can import datasets for almost anything you can imagine.
|
| 540 |
+
""")
|
|
|
|
| 541 |
return
|
| 542 |
|
| 543 |
|
| 544 |
@app.cell(hide_code=True)
|
| 545 |
def _(mo):
|
| 546 |
+
mo.md(r"""
|
|
|
|
| 547 |
## Utilities
|
| 548 |
Imports, utility functions and alike used through the Notebook
|
| 549 |
+
""")
|
|
|
|
| 550 |
return
|
| 551 |
|
| 552 |
|
polars/04_basic_operations.py
CHANGED
|
@@ -8,7 +8,7 @@
|
|
| 8 |
|
| 9 |
import marimo
|
| 10 |
|
| 11 |
-
__generated_with = "0.
|
| 12 |
app = marimo.App(width="medium")
|
| 13 |
|
| 14 |
|
|
@@ -20,14 +20,12 @@ def _():
|
|
| 20 |
|
| 21 |
@app.cell(hide_code=True)
|
| 22 |
def _(mo):
|
| 23 |
-
mo.md(
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
_By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
)
|
| 31 |
return
|
| 32 |
|
| 33 |
|
|
@@ -107,13 +105,11 @@ def _():
|
|
| 107 |
|
| 108 |
@app.cell(hide_code=True)
|
| 109 |
def _(mo):
|
| 110 |
-
mo.md(
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
"""
|
| 116 |
-
)
|
| 117 |
return
|
| 118 |
|
| 119 |
|
|
@@ -125,7 +121,9 @@ def _(df, pl):
|
|
| 125 |
|
| 126 |
@app.cell(hide_code=True)
|
| 127 |
def _(mo):
|
| 128 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 129 |
return
|
| 130 |
|
| 131 |
|
|
@@ -137,12 +135,10 @@ def _(df, pl):
|
|
| 137 |
|
| 138 |
@app.cell(hide_code=True)
|
| 139 |
def _(mo):
|
| 140 |
-
mo.md(
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
"""
|
| 145 |
-
)
|
| 146 |
return
|
| 147 |
|
| 148 |
|
|
@@ -154,7 +150,9 @@ def _(df, pl):
|
|
| 154 |
|
| 155 |
@app.cell(hide_code=True)
|
| 156 |
def _(mo):
|
| 157 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 158 |
return
|
| 159 |
|
| 160 |
|
|
@@ -166,12 +164,10 @@ def _(df, pl):
|
|
| 166 |
|
| 167 |
@app.cell(hide_code=True)
|
| 168 |
def _(mo):
|
| 169 |
-
mo.md(
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
"""
|
| 174 |
-
)
|
| 175 |
return
|
| 176 |
|
| 177 |
|
|
@@ -183,7 +179,9 @@ def _(df, pl):
|
|
| 183 |
|
| 184 |
@app.cell(hide_code=True)
|
| 185 |
def _(mo):
|
| 186 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 187 |
return
|
| 188 |
|
| 189 |
|
|
@@ -195,7 +193,9 @@ def _(df, pl):
|
|
| 195 |
|
| 196 |
@app.cell(hide_code=True)
|
| 197 |
def _(mo):
|
| 198 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 199 |
return
|
| 200 |
|
| 201 |
|
|
@@ -207,12 +207,10 @@ def _(df, pl):
|
|
| 207 |
|
| 208 |
@app.cell(hide_code=True)
|
| 209 |
def _(mo):
|
| 210 |
-
mo.md(
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
"""
|
| 215 |
-
)
|
| 216 |
return
|
| 217 |
|
| 218 |
|
|
@@ -224,7 +222,9 @@ def _(df, pl):
|
|
| 224 |
|
| 225 |
@app.cell(hide_code=True)
|
| 226 |
def _(mo):
|
| 227 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 228 |
return
|
| 229 |
|
| 230 |
|
|
@@ -236,7 +236,9 @@ def _(df, pl):
|
|
| 236 |
|
| 237 |
@app.cell(hide_code=True)
|
| 238 |
def _(mo):
|
| 239 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 240 |
return
|
| 241 |
|
| 242 |
|
|
@@ -248,7 +250,9 @@ def _(df, pl):
|
|
| 248 |
|
| 249 |
@app.cell(hide_code=True)
|
| 250 |
def _(mo):
|
| 251 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 252 |
return
|
| 253 |
|
| 254 |
|
|
@@ -260,16 +264,14 @@ def _(df, pl):
|
|
| 260 |
|
| 261 |
@app.cell(hide_code=True)
|
| 262 |
def _(mo):
|
| 263 |
-
mo.md(
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
Polars encounrages you to perform your operations as a chain. This enables you to take advantage of the query optimizer. We'll build upon the above code as a chain.
|
| 267 |
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
)
|
| 273 |
return
|
| 274 |
|
| 275 |
|
|
@@ -284,7 +286,9 @@ def _(df, pl):
|
|
| 284 |
|
| 285 |
@app.cell(hide_code=True)
|
| 286 |
def _(mo):
|
| 287 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 288 |
return
|
| 289 |
|
| 290 |
|
|
@@ -300,13 +304,11 @@ def _(df, pl):
|
|
| 300 |
|
| 301 |
@app.cell(hide_code=True)
|
| 302 |
def _(mo):
|
| 303 |
-
mo.md(
|
| 304 |
-
|
| 305 |
-
We could also do this comparison in one line, if readability is not a concern
|
| 306 |
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
)
|
| 310 |
return
|
| 311 |
|
| 312 |
|
|
@@ -321,7 +323,9 @@ def _(df, pl):
|
|
| 321 |
|
| 322 |
@app.cell(hide_code=True)
|
| 323 |
def _(mo):
|
| 324 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 325 |
return
|
| 326 |
|
| 327 |
|
|
@@ -336,12 +340,10 @@ def _(df, pl):
|
|
| 336 |
|
| 337 |
@app.cell(hide_code=True)
|
| 338 |
def _(mo):
|
| 339 |
-
mo.md(
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
| 343 |
-
"""
|
| 344 |
-
)
|
| 345 |
return
|
| 346 |
|
| 347 |
|
|
@@ -356,7 +358,9 @@ def _(df, pl):
|
|
| 356 |
|
| 357 |
@app.cell(hide_code=True)
|
| 358 |
def _(mo):
|
| 359 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 360 |
return
|
| 361 |
|
| 362 |
|
|
@@ -371,7 +375,9 @@ def _(df, pl):
|
|
| 371 |
|
| 372 |
@app.cell(hide_code=True)
|
| 373 |
def _(mo):
|
| 374 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 375 |
return
|
| 376 |
|
| 377 |
|
|
@@ -386,12 +392,10 @@ def _(df, pl):
|
|
| 386 |
|
| 387 |
@app.cell(hide_code=True)
|
| 388 |
def _(mo):
|
| 389 |
-
mo.md(
|
| 390 |
-
|
| 391 |
-
|
| 392 |
-
|
| 393 |
-
"""
|
| 394 |
-
)
|
| 395 |
return
|
| 396 |
|
| 397 |
|
|
@@ -407,7 +411,9 @@ def _(df, pl):
|
|
| 407 |
|
| 408 |
@app.cell(hide_code=True)
|
| 409 |
def _(mo):
|
| 410 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 411 |
return
|
| 412 |
|
| 413 |
|
|
@@ -423,7 +429,9 @@ def _(df, pl):
|
|
| 423 |
|
| 424 |
@app.cell(hide_code=True)
|
| 425 |
def _(mo):
|
| 426 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 427 |
return
|
| 428 |
|
| 429 |
|
|
@@ -439,7 +447,9 @@ def _(df, pl):
|
|
| 439 |
|
| 440 |
@app.cell(hide_code=True)
|
| 441 |
def _(mo):
|
| 442 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 443 |
return
|
| 444 |
|
| 445 |
|
|
@@ -455,14 +465,12 @@ def _(df, pl):
|
|
| 455 |
|
| 456 |
@app.cell(hide_code=True)
|
| 457 |
def _(mo):
|
| 458 |
-
mo.md(
|
| 459 |
-
|
| 460 |
-
**Note**: For "less than", and "less or equal to" you can use the operators `<` or `<=`. Alternatively, you can use built-in functions `lt` or `le` respectively.
|
| 461 |
|
| 462 |
-
|
| 463 |
-
|
| 464 |
-
|
| 465 |
-
)
|
| 466 |
return
|
| 467 |
|
| 468 |
|
|
@@ -478,14 +486,12 @@ def _(df, pl):
|
|
| 478 |
|
| 479 |
@app.cell(hide_code=True)
|
| 480 |
def _(mo):
|
| 481 |
-
mo.md(
|
| 482 |
-
|
| 483 |
-
|
| 484 |
-
If we only want either one of the conditions in the comparison to be met, we could use `|`, which is the `or` operator.
|
| 485 |
|
| 486 |
-
|
| 487 |
-
|
| 488 |
-
)
|
| 489 |
return
|
| 490 |
|
| 491 |
|
|
@@ -500,14 +506,12 @@ def _(df, pl):
|
|
| 500 |
|
| 501 |
@app.cell(hide_code=True)
|
| 502 |
def _(mo):
|
| 503 |
-
mo.md(
|
| 504 |
-
|
| 505 |
-
|
| 506 |
-
Polars also allows you create new columns based on a condition. Let's create a column *status* that will indicate if the software is "discontinued" or "in use".
|
| 507 |
|
| 508 |
-
|
| 509 |
-
|
| 510 |
-
)
|
| 511 |
return
|
| 512 |
|
| 513 |
|
|
@@ -519,7 +523,9 @@ def _():
|
|
| 519 |
|
| 520 |
@app.cell(hide_code=True)
|
| 521 |
def _(mo):
|
| 522 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 523 |
return
|
| 524 |
|
| 525 |
|
|
@@ -534,7 +540,9 @@ def _(df, discontinued_list, pl):
|
|
| 534 |
|
| 535 |
@app.cell(hide_code=True)
|
| 536 |
def _(mo):
|
| 537 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 538 |
return
|
| 539 |
|
| 540 |
|
|
@@ -553,12 +561,10 @@ def _(df, discontinued_list, pl):
|
|
| 553 |
|
| 554 |
@app.cell(hide_code=True)
|
| 555 |
def _(mo):
|
| 556 |
-
mo.md(
|
| 557 |
-
|
| 558 |
-
|
| 559 |
-
|
| 560 |
-
"""
|
| 561 |
-
)
|
| 562 |
return
|
| 563 |
|
| 564 |
|
|
@@ -578,7 +584,9 @@ def _(df, discontinued_list, pl):
|
|
| 578 |
|
| 579 |
@app.cell(hide_code=True)
|
| 580 |
def _(mo):
|
| 581 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 582 |
return
|
| 583 |
|
| 584 |
|
|
@@ -598,7 +606,9 @@ def _(df, discontinued_list, pl):
|
|
| 598 |
|
| 599 |
@app.cell(hide_code=True)
|
| 600 |
def _(mo):
|
| 601 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 602 |
return
|
| 603 |
|
| 604 |
|
|
@@ -618,7 +628,9 @@ def _(df, discontinued_list, pl):
|
|
| 618 |
|
| 619 |
@app.cell(hide_code=True)
|
| 620 |
def _(mo):
|
| 621 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 622 |
return
|
| 623 |
|
| 624 |
|
|
|
|
| 8 |
|
| 9 |
import marimo
|
| 10 |
|
| 11 |
+
__generated_with = "0.18.4"
|
| 12 |
app = marimo.App(width="medium")
|
| 13 |
|
| 14 |
|
|
|
|
| 20 |
|
| 21 |
@app.cell(hide_code=True)
|
| 22 |
def _(mo):
|
| 23 |
+
mo.md(r"""
|
| 24 |
+
# Basic operations on data
|
| 25 |
+
_By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._
|
|
|
|
| 26 |
|
| 27 |
+
In this notebook, you'll learn how to perform arithmetic operations, comparisons, and conditionals on a Polars dataframe. We'll work with a DataFrame that tracks software usage by year, categorized as either Vintage (old) or Modern (new).
|
| 28 |
+
""")
|
|
|
|
| 29 |
return
|
| 30 |
|
| 31 |
|
|
|
|
| 105 |
|
| 106 |
@app.cell(hide_code=True)
|
| 107 |
def _(mo):
|
| 108 |
+
mo.md(r"""
|
| 109 |
+
## Arithmetic
|
| 110 |
+
### Addition
|
| 111 |
+
Let's add 42 users to each piece of software. This means adding 42 to each value under **users**.
|
| 112 |
+
""")
|
|
|
|
|
|
|
| 113 |
return
|
| 114 |
|
| 115 |
|
|
|
|
| 121 |
|
| 122 |
@app.cell(hide_code=True)
|
| 123 |
def _(mo):
|
| 124 |
+
mo.md(r"""
|
| 125 |
+
Another way to perform the above operation is using the built-in function.
|
| 126 |
+
""")
|
| 127 |
return
|
| 128 |
|
| 129 |
|
|
|
|
| 135 |
|
| 136 |
@app.cell(hide_code=True)
|
| 137 |
def _(mo):
|
| 138 |
+
mo.md(r"""
|
| 139 |
+
### Subtraction
|
| 140 |
+
Let's subtract 42 users to each piece of software.
|
| 141 |
+
""")
|
|
|
|
|
|
|
| 142 |
return
|
| 143 |
|
| 144 |
|
|
|
|
| 150 |
|
| 151 |
@app.cell(hide_code=True)
|
| 152 |
def _(mo):
|
| 153 |
+
mo.md(r"""
|
| 154 |
+
Alternatively, you could subtract like this:
|
| 155 |
+
""")
|
| 156 |
return
|
| 157 |
|
| 158 |
|
|
|
|
| 164 |
|
| 165 |
@app.cell(hide_code=True)
|
| 166 |
def _(mo):
|
| 167 |
+
mo.md(r"""
|
| 168 |
+
### Division
|
| 169 |
+
Suppose the **users** values are inflated, we can reduce them by dividing by 1000. Here's how to do it.
|
| 170 |
+
""")
|
|
|
|
|
|
|
| 171 |
return
|
| 172 |
|
| 173 |
|
|
|
|
| 179 |
|
| 180 |
@app.cell(hide_code=True)
|
| 181 |
def _(mo):
|
| 182 |
+
mo.md(r"""
|
| 183 |
+
Or we could do it with a built-in expression.
|
| 184 |
+
""")
|
| 185 |
return
|
| 186 |
|
| 187 |
|
|
|
|
| 193 |
|
| 194 |
@app.cell(hide_code=True)
|
| 195 |
def _(mo):
|
| 196 |
+
mo.md(r"""
|
| 197 |
+
If we didn't care about the remainder after division (i.e remove numbers after decimal point) we could do it like this.
|
| 198 |
+
""")
|
| 199 |
return
|
| 200 |
|
| 201 |
|
|
|
|
| 207 |
|
| 208 |
@app.cell(hide_code=True)
|
| 209 |
def _(mo):
|
| 210 |
+
mo.md(r"""
|
| 211 |
+
### Multiplication
|
| 212 |
+
Let's pretend the *user* values are deflated and increase them by multiplying by 100.
|
| 213 |
+
""")
|
|
|
|
|
|
|
| 214 |
return
|
| 215 |
|
| 216 |
|
|
|
|
| 222 |
|
| 223 |
@app.cell(hide_code=True)
|
| 224 |
def _(mo):
|
| 225 |
+
mo.md(r"""
|
| 226 |
+
Polars also has a built-in function for multiplication.
|
| 227 |
+
""")
|
| 228 |
return
|
| 229 |
|
| 230 |
|
|
|
|
| 236 |
|
| 237 |
@app.cell(hide_code=True)
|
| 238 |
def _(mo):
|
| 239 |
+
mo.md(r"""
|
| 240 |
+
So far, we've only modified the values in an existing column. Let's create a column **decade** that will represent the years as decades. Thus 1985 will be 1980 and 2008 will be 2000.
|
| 241 |
+
""")
|
| 242 |
return
|
| 243 |
|
| 244 |
|
|
|
|
| 250 |
|
| 251 |
@app.cell(hide_code=True)
|
| 252 |
def _(mo):
|
| 253 |
+
mo.md(r"""
|
| 254 |
+
We could create a new column another way as follows:
|
| 255 |
+
""")
|
| 256 |
return
|
| 257 |
|
| 258 |
|
|
|
|
| 264 |
|
| 265 |
@app.cell(hide_code=True)
|
| 266 |
def _(mo):
|
| 267 |
+
mo.md(r"""
|
| 268 |
+
**Tip**
|
| 269 |
+
Polars encounrages you to perform your operations as a chain. This enables you to take advantage of the query optimizer. We'll build upon the above code as a chain.
|
|
|
|
| 270 |
|
| 271 |
+
## Comparison
|
| 272 |
+
### Equal
|
| 273 |
+
Let's get all the software categorized as Vintage.
|
| 274 |
+
""")
|
|
|
|
| 275 |
return
|
| 276 |
|
| 277 |
|
|
|
|
| 286 |
|
| 287 |
@app.cell(hide_code=True)
|
| 288 |
def _(mo):
|
| 289 |
+
mo.md(r"""
|
| 290 |
+
We could also do a double comparison. VisiCal is the only software that's vintage and in the decade 1970s. Let's perform this comparison operation.
|
| 291 |
+
""")
|
| 292 |
return
|
| 293 |
|
| 294 |
|
|
|
|
| 304 |
|
| 305 |
@app.cell(hide_code=True)
|
| 306 |
def _(mo):
|
| 307 |
+
mo.md(r"""
|
| 308 |
+
We could also do this comparison in one line, if readability is not a concern
|
|
|
|
| 309 |
|
| 310 |
+
**Notice** that we must enclose the two expressions between the `&` with parenthesis.
|
| 311 |
+
""")
|
|
|
|
| 312 |
return
|
| 313 |
|
| 314 |
|
|
|
|
| 323 |
|
| 324 |
@app.cell(hide_code=True)
|
| 325 |
def _(mo):
|
| 326 |
+
mo.md(r"""
|
| 327 |
+
We can also use the built-in function for equal to comparisons.
|
| 328 |
+
""")
|
| 329 |
return
|
| 330 |
|
| 331 |
|
|
|
|
| 340 |
|
| 341 |
@app.cell(hide_code=True)
|
| 342 |
def _(mo):
|
| 343 |
+
mo.md(r"""
|
| 344 |
+
### Not equal
|
| 345 |
+
We can also compare if something is `not` equal to something. In this case, category is not vintage.
|
| 346 |
+
""")
|
|
|
|
|
|
|
| 347 |
return
|
| 348 |
|
| 349 |
|
|
|
|
| 358 |
|
| 359 |
@app.cell(hide_code=True)
|
| 360 |
def _(mo):
|
| 361 |
+
mo.md(r"""
|
| 362 |
+
Or with the built-in function.
|
| 363 |
+
""")
|
| 364 |
return
|
| 365 |
|
| 366 |
|
|
|
|
| 375 |
|
| 376 |
@app.cell(hide_code=True)
|
| 377 |
def _(mo):
|
| 378 |
+
mo.md(r"""
|
| 379 |
+
Or if you want to be extra clever, you can use the negation symbol `~` used in logic.
|
| 380 |
+
""")
|
| 381 |
return
|
| 382 |
|
| 383 |
|
|
|
|
| 392 |
|
| 393 |
@app.cell(hide_code=True)
|
| 394 |
def _(mo):
|
| 395 |
+
mo.md(r"""
|
| 396 |
+
### Greater than
|
| 397 |
+
Let's get the software where the year is greater than 2008 from the above dataframe.
|
| 398 |
+
""")
|
|
|
|
|
|
|
| 399 |
return
|
| 400 |
|
| 401 |
|
|
|
|
| 411 |
|
| 412 |
@app.cell(hide_code=True)
|
| 413 |
def _(mo):
|
| 414 |
+
mo.md(r"""
|
| 415 |
+
Or if we wanted the year 2008 to be included, we could use great or equal to.
|
| 416 |
+
""")
|
| 417 |
return
|
| 418 |
|
| 419 |
|
|
|
|
| 429 |
|
| 430 |
@app.cell(hide_code=True)
|
| 431 |
def _(mo):
|
| 432 |
+
mo.md(r"""
|
| 433 |
+
We could do the previous two operations with built-in functions. Here's with greater than.
|
| 434 |
+
""")
|
| 435 |
return
|
| 436 |
|
| 437 |
|
|
|
|
| 447 |
|
| 448 |
@app.cell(hide_code=True)
|
| 449 |
def _(mo):
|
| 450 |
+
mo.md(r"""
|
| 451 |
+
And here's with greater or equal to
|
| 452 |
+
""")
|
| 453 |
return
|
| 454 |
|
| 455 |
|
|
|
|
| 465 |
|
| 466 |
@app.cell(hide_code=True)
|
| 467 |
def _(mo):
|
| 468 |
+
mo.md(r"""
|
| 469 |
+
**Note**: For "less than", and "less or equal to" you can use the operators `<` or `<=`. Alternatively, you can use built-in functions `lt` or `le` respectively.
|
|
|
|
| 470 |
|
| 471 |
+
### Is between
|
| 472 |
+
Polars also allows us to filter between a range of values. Let's get the modern software were the year is between 2013 and 2016. This is inclusive on both ends (i.e. both years are part of the result).
|
| 473 |
+
""")
|
|
|
|
| 474 |
return
|
| 475 |
|
| 476 |
|
|
|
|
| 486 |
|
| 487 |
@app.cell(hide_code=True)
|
| 488 |
def _(mo):
|
| 489 |
+
mo.md(r"""
|
| 490 |
+
### Or operator
|
| 491 |
+
If we only want either one of the conditions in the comparison to be met, we could use `|`, which is the `or` operator.
|
|
|
|
| 492 |
|
| 493 |
+
Let's get software that is either modern or used in the decade 1980s.
|
| 494 |
+
""")
|
|
|
|
| 495 |
return
|
| 496 |
|
| 497 |
|
|
|
|
| 506 |
|
| 507 |
@app.cell(hide_code=True)
|
| 508 |
def _(mo):
|
| 509 |
+
mo.md(r"""
|
| 510 |
+
## Conditionals
|
| 511 |
+
Polars also allows you create new columns based on a condition. Let's create a column *status* that will indicate if the software is "discontinued" or "in use".
|
|
|
|
| 512 |
|
| 513 |
+
Here's a list of products that are no longer in use.
|
| 514 |
+
""")
|
|
|
|
| 515 |
return
|
| 516 |
|
| 517 |
|
|
|
|
| 523 |
|
| 524 |
@app.cell(hide_code=True)
|
| 525 |
def _(mo):
|
| 526 |
+
mo.md(r"""
|
| 527 |
+
Here's how we can get a dataframe of the products that are discontinued.
|
| 528 |
+
""")
|
| 529 |
return
|
| 530 |
|
| 531 |
|
|
|
|
| 540 |
|
| 541 |
@app.cell(hide_code=True)
|
| 542 |
def _(mo):
|
| 543 |
+
mo.md(r"""
|
| 544 |
+
Now, let's create the **status** column.
|
| 545 |
+
""")
|
| 546 |
return
|
| 547 |
|
| 548 |
|
|
|
|
| 561 |
|
| 562 |
@app.cell(hide_code=True)
|
| 563 |
def _(mo):
|
| 564 |
+
mo.md(r"""
|
| 565 |
+
## Unique counts
|
| 566 |
+
Sometimes you may want to see only the unique values in a column. Let's check the unique decades we have in our DataFrame.
|
| 567 |
+
""")
|
|
|
|
|
|
|
| 568 |
return
|
| 569 |
|
| 570 |
|
|
|
|
| 584 |
|
| 585 |
@app.cell(hide_code=True)
|
| 586 |
def _(mo):
|
| 587 |
+
mo.md(r"""
|
| 588 |
+
Finally, let's find out the number of software used in each decade.
|
| 589 |
+
""")
|
| 590 |
return
|
| 591 |
|
| 592 |
|
|
|
|
| 606 |
|
| 607 |
@app.cell(hide_code=True)
|
| 608 |
def _(mo):
|
| 609 |
+
mo.md(r"""
|
| 610 |
+
We could also rewrite the above code as follows:
|
| 611 |
+
""")
|
| 612 |
return
|
| 613 |
|
| 614 |
|
|
|
|
| 628 |
|
| 629 |
@app.cell(hide_code=True)
|
| 630 |
def _(mo):
|
| 631 |
+
mo.md(r"""
|
| 632 |
+
Hopefully, we've picked your interest to try out Polars the next time you analyze your data.
|
| 633 |
+
""")
|
| 634 |
return
|
| 635 |
|
| 636 |
|
polars/05_reactive_plots.py
CHANGED
|
@@ -11,26 +11,24 @@
|
|
| 11 |
|
| 12 |
import marimo
|
| 13 |
|
| 14 |
-
__generated_with = "0.
|
| 15 |
app = marimo.App(width="medium")
|
| 16 |
|
| 17 |
|
| 18 |
@app.cell(hide_code=True)
|
| 19 |
def _(mo):
|
| 20 |
-
mo.md(
|
| 21 |
-
|
| 22 |
-
# Reactive Plots
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
)
|
| 34 |
return
|
| 35 |
|
| 36 |
|
|
@@ -47,20 +45,18 @@ def _(pl):
|
|
| 47 |
# Or save to a local file first if you want to avoid downloading it each time you run:
|
| 48 |
# file_path = "spotify-tracks.parquet"
|
| 49 |
# lz = pl.scan_parquet(file_path)
|
| 50 |
-
return
|
| 51 |
|
| 52 |
|
| 53 |
@app.cell(hide_code=True)
|
| 54 |
def _(mo):
|
| 55 |
-
mo.md(
|
| 56 |
-
|
| 57 |
-
You should always take a look at the data you are working on before actually doing any operations on it - for data coming from sources such as HuggingFace or Kaggle you can preview it via their websites, and optionally filter or do some transformations before downloading.
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
)
|
| 64 |
return
|
| 65 |
|
| 66 |
|
|
@@ -87,18 +83,16 @@ def _(lz, pl):
|
|
| 87 |
|
| 88 |
@app.cell(hide_code=True)
|
| 89 |
def _(mo):
|
| 90 |
-
mo.md(
|
| 91 |
-
|
| 92 |
-
When you start exploring a dataset, some of the first things to do may include:
|
| 93 |
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
)
|
| 102 |
return
|
| 103 |
|
| 104 |
|
|
@@ -112,13 +106,11 @@ def _(df, pl):
|
|
| 112 |
|
| 113 |
@app.cell(hide_code=True)
|
| 114 |
def _(mo):
|
| 115 |
-
mo.md(
|
| 116 |
-
|
| 117 |
-
For this Notebook we will be using [plotly](https://plotly.com/python), but Marimo also [supports other plotting libraries](https://docs.marimo.io/guides/working_with_data/plotting/).
|
| 118 |
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
)
|
| 122 |
return
|
| 123 |
|
| 124 |
|
|
@@ -129,20 +121,18 @@ def _(df, mo, px):
|
|
| 129 |
fig.update_layout(selectdirection="h")
|
| 130 |
plot = mo.ui.plotly(fig)
|
| 131 |
plot
|
| 132 |
-
return
|
| 133 |
|
| 134 |
|
| 135 |
@app.cell(hide_code=True)
|
| 136 |
def _(mo):
|
| 137 |
-
mo.md(
|
| 138 |
-
|
| 139 |
-
Note how there are a few outliers with extremely little duration (less than 2 minutes) and a few with extremely long duration (more than 6 minutes)
|
| 140 |
|
| 141 |
-
|
| 142 |
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
)
|
| 146 |
return
|
| 147 |
|
| 148 |
|
|
@@ -154,7 +144,7 @@ def _(pl, plot):
|
|
| 154 |
|
| 155 |
|
| 156 |
@app.cell
|
| 157 |
-
def _(df,
|
| 158 |
# Now, we want to filter to only include tracks whose duration falls inside of our selection - we will need to first identify the extremes, then filter based on them
|
| 159 |
min_dur, max_dur = get_extremes(
|
| 160 |
plot.value, col="duration_seconds", defaults_if_missing=(120, 360)
|
|
@@ -168,27 +158,25 @@ def _(df, get_extremes, pl, plot):
|
|
| 168 |
# Actually apply the filter
|
| 169 |
filtered_duration = df.filter(duration_in_range)
|
| 170 |
filtered_duration
|
| 171 |
-
return
|
| 172 |
|
| 173 |
|
| 174 |
@app.cell(hide_code=True)
|
| 175 |
def _(mo):
|
| 176 |
-
mo.md(
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
"""
|
| 191 |
-
)
|
| 192 |
return
|
| 193 |
|
| 194 |
|
|
@@ -235,18 +223,16 @@ def _(filter_genre, filtered_duration, mo, pl):
|
|
| 235 |
),
|
| 236 |
],
|
| 237 |
)
|
| 238 |
-
return
|
| 239 |
|
| 240 |
|
| 241 |
@app.cell(hide_code=True)
|
| 242 |
def _(mo):
|
| 243 |
-
mo.md(
|
| 244 |
-
|
| 245 |
-
So far so good - but there's been a distinct lack of visualations, so let's fix that.
|
| 246 |
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
)
|
| 250 |
return
|
| 251 |
|
| 252 |
|
|
@@ -263,22 +249,20 @@ def _(filtered_duration, pl, px):
|
|
| 263 |
x="popularity",
|
| 264 |
)
|
| 265 |
fig_dur_per_genre
|
| 266 |
-
return
|
| 267 |
|
| 268 |
|
| 269 |
@app.cell(hide_code=True)
|
| 270 |
def _(mo):
|
| 271 |
-
mo.md(
|
| 272 |
-
|
| 273 |
-
Now, why don't we play a bit with marimo's UI elements?
|
| 274 |
|
| 275 |
-
|
| 276 |
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
)
|
| 282 |
return
|
| 283 |
|
| 284 |
|
|
@@ -312,18 +296,16 @@ def _(
|
|
| 312 |
chart2 = mo.ui.plotly(fig2)
|
| 313 |
|
| 314 |
mo.vstack([mo.hstack([x_axis, y_axis, color, alpha, include_trendline, filter_genre2]), chart2])
|
| 315 |
-
return chart2,
|
| 316 |
|
| 317 |
|
| 318 |
@app.cell(hide_code=True)
|
| 319 |
def _(mo):
|
| 320 |
-
mo.md(
|
| 321 |
-
|
| 322 |
-
As we have seen before, we can also use the plot as an input to select a region and look at it in more detail.
|
| 323 |
|
| 324 |
-
|
| 325 |
-
|
| 326 |
-
)
|
| 327 |
return
|
| 328 |
|
| 329 |
|
|
@@ -340,47 +322,45 @@ def _(chart2, filtered_duration, mo, pl):
|
|
| 340 |
pl.col(column_order), pl.exclude(*column_order)
|
| 341 |
)
|
| 342 |
out
|
| 343 |
-
return
|
| 344 |
|
| 345 |
|
| 346 |
@app.cell(hide_code=True)
|
| 347 |
def _(mo):
|
| 348 |
-
mo.md(
|
| 349 |
-
|
| 350 |
-
In this notebook, we've focused on a few key aspects. First, it's essential to *understand* the data you're working with — this forms the foundation of any analysis.
|
| 351 |
|
| 352 |
-
|
| 353 |
|
| 354 |
-
|
| 355 |
-
|
| 356 |
|
| 357 |
-
|
| 358 |
-
|
| 359 |
-
)
|
| 360 |
return
|
| 361 |
|
| 362 |
|
| 363 |
@app.cell(hide_code=True)
|
| 364 |
def _(mo):
|
| 365 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 366 |
return
|
| 367 |
|
| 368 |
|
| 369 |
-
@app.
|
| 370 |
-
def get_extremes():
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
|
| 378 |
-
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
|
| 382 |
-
)
|
| 383 |
-
return (get_extremes,)
|
| 384 |
|
| 385 |
|
| 386 |
@app.cell
|
|
@@ -426,20 +406,14 @@ def _(filtered_duration, mo):
|
|
| 426 |
searchable=True,
|
| 427 |
label="Filter by Track Genre:",
|
| 428 |
)
|
| 429 |
-
return
|
| 430 |
-
alpha,
|
| 431 |
-
color,
|
| 432 |
-
filter_genre2,
|
| 433 |
-
include_trendline,
|
| 434 |
-
options,
|
| 435 |
-
x_axis,
|
| 436 |
-
y_axis,
|
| 437 |
-
)
|
| 438 |
|
| 439 |
|
| 440 |
@app.cell(hide_code=True)
|
| 441 |
def _(mo):
|
| 442 |
-
mo.md("""
|
|
|
|
|
|
|
| 443 |
return
|
| 444 |
|
| 445 |
|
|
@@ -461,12 +435,7 @@ def _(filtered_duration, mo, pl):
|
|
| 461 |
# So we just provide freeform text boxes and filter ourselfves later
|
| 462 |
# (the "alternative_" in the name is just to avoid conflicts with the above cell,
|
| 463 |
# despite this being disabled marimo still requires global variables to be unique)
|
| 464 |
-
return
|
| 465 |
-
all_artists,
|
| 466 |
-
all_tracks,
|
| 467 |
-
alternative_filter_artist,
|
| 468 |
-
alternative_filter_track,
|
| 469 |
-
)
|
| 470 |
|
| 471 |
|
| 472 |
@app.cell
|
|
@@ -503,7 +472,7 @@ def _(filter_artist, filter_track, filtered_duration, mo, pl):
|
|
| 503 |
)
|
| 504 |
|
| 505 |
mo.vstack([mo.md("Filter a track based on its name or artist"), filter_artist, filter_track, filtered_artist_track])
|
| 506 |
-
return
|
| 507 |
|
| 508 |
|
| 509 |
@app.cell
|
|
@@ -532,7 +501,7 @@ def _(filter_genre2, filtered_duration, mo, pl):
|
|
| 532 |
],
|
| 533 |
align="center",
|
| 534 |
)
|
| 535 |
-
return
|
| 536 |
|
| 537 |
|
| 538 |
@app.cell
|
|
|
|
| 11 |
|
| 12 |
import marimo
|
| 13 |
|
| 14 |
+
__generated_with = "0.18.4"
|
| 15 |
app = marimo.App(width="medium")
|
| 16 |
|
| 17 |
|
| 18 |
@app.cell(hide_code=True)
|
| 19 |
def _(mo):
|
| 20 |
+
mo.md("""
|
| 21 |
+
# Reactive Plots
|
|
|
|
| 22 |
|
| 23 |
+
_By [etrotta](https://github.com/etrotta)._
|
| 24 |
|
| 25 |
+
This tutorial covers Data Visualisation basics using marimo, [polars](https://docs.pola.rs/) and [plotly](https://plotly.com/python/plotly-express/).
|
| 26 |
+
It shows how to load data, explore and visualise it, then use User Interface elements (including the plots themselves) to filter and select data for more refined analysis.
|
| 27 |
|
| 28 |
+
We will be using a [Spotify Tracks dataset](https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset). Before you write any code yourself, I recommend taking some time to understand the data you're working with, from which columns are available to what are their possible values, as well as more abstract details such as the scope, coverage and intended uses of the dataset.
|
| 29 |
|
| 30 |
+
Note that this dataset does not contains data about ***all*** tracks, you can try using a larger dataset such as [bigdata-pw/Spotify](https://huggingface.co/datasets/bigdata-pw/Spotify), but I'm sticking with the smaller one to keep the notebook size manageable for most users.
|
| 31 |
+
""")
|
|
|
|
| 32 |
return
|
| 33 |
|
| 34 |
|
|
|
|
| 45 |
# Or save to a local file first if you want to avoid downloading it each time you run:
|
| 46 |
# file_path = "spotify-tracks.parquet"
|
| 47 |
# lz = pl.scan_parquet(file_path)
|
| 48 |
+
return (lz,)
|
| 49 |
|
| 50 |
|
| 51 |
@app.cell(hide_code=True)
|
| 52 |
def _(mo):
|
| 53 |
+
mo.md("""
|
| 54 |
+
You should always take a look at the data you are working on before actually doing any operations on it - for data coming from sources such as HuggingFace or Kaggle you can preview it via their websites, and optionally filter or do some transformations before downloading.
|
|
|
|
| 55 |
|
| 56 |
+
The [Polars Lazy API](https://docs.pola.rs/user-guide/lazy/) allows for you define operations before loading the data, and polars will optimize the plan in order to avoid doing unnecessary operations or loading data we do not care about.
|
| 57 |
|
| 58 |
+
Let's say that looking at the dataset's preview in the Data Viewer, we decided we do not want the Unnamed column (which appears to be the row index), nor do we care about the original ID, and we only want non-explicit tracks.
|
| 59 |
+
""")
|
|
|
|
| 60 |
return
|
| 61 |
|
| 62 |
|
|
|
|
| 83 |
|
| 84 |
@app.cell(hide_code=True)
|
| 85 |
def _(mo):
|
| 86 |
+
mo.md(r"""
|
| 87 |
+
When you start exploring a dataset, some of the first things to do may include:
|
|
|
|
| 88 |
|
| 89 |
+
- investigating any values that seem weird
|
| 90 |
+
- verifying if there could be issues in the data
|
| 91 |
+
- checking for potential bugs in our pipelines
|
| 92 |
+
- ensuring you understand the data correctly, including its relationships and edge cases
|
| 93 |
|
| 94 |
+
For example, the "min" value for the duration column is zero, and the max is over an hour. Why is that?
|
| 95 |
+
""")
|
|
|
|
| 96 |
return
|
| 97 |
|
| 98 |
|
|
|
|
| 106 |
|
| 107 |
@app.cell(hide_code=True)
|
| 108 |
def _(mo):
|
| 109 |
+
mo.md(r"""
|
| 110 |
+
For this Notebook we will be using [plotly](https://plotly.com/python), but Marimo also [supports other plotting libraries](https://docs.marimo.io/guides/working_with_data/plotting/).
|
|
|
|
| 111 |
|
| 112 |
+
Let's visualize it using a [bar chart](https://plotly.com/python/bar-charts/) and get a feel for which region makes sense to focus on for our analysis
|
| 113 |
+
""")
|
|
|
|
| 114 |
return
|
| 115 |
|
| 116 |
|
|
|
|
| 121 |
fig.update_layout(selectdirection="h")
|
| 122 |
plot = mo.ui.plotly(fig)
|
| 123 |
plot
|
| 124 |
+
return (plot,)
|
| 125 |
|
| 126 |
|
| 127 |
@app.cell(hide_code=True)
|
| 128 |
def _(mo):
|
| 129 |
+
mo.md("""
|
| 130 |
+
Note how there are a few outliers with extremely little duration (less than 2 minutes) and a few with extremely long duration (more than 6 minutes)
|
|
|
|
| 131 |
|
| 132 |
+
You can select a region in the graph by clicking and dragging, which can later be used to filter or transform data. In this Notebook we set a default if there is no selection, but you should try selecting a region yourself.
|
| 133 |
|
| 134 |
+
We will focus on those within that middle ground from around 120 seconds to 360 seconds, but you can play around with it a bit and see how the results change if you move the Selection region. Perhaps you can even find some Classical songs?
|
| 135 |
+
""")
|
|
|
|
| 136 |
return
|
| 137 |
|
| 138 |
|
|
|
|
| 144 |
|
| 145 |
|
| 146 |
@app.cell
|
| 147 |
+
def _(df, pl, plot):
|
| 148 |
# Now, we want to filter to only include tracks whose duration falls inside of our selection - we will need to first identify the extremes, then filter based on them
|
| 149 |
min_dur, max_dur = get_extremes(
|
| 150 |
plot.value, col="duration_seconds", defaults_if_missing=(120, 360)
|
|
|
|
| 158 |
# Actually apply the filter
|
| 159 |
filtered_duration = df.filter(duration_in_range)
|
| 160 |
filtered_duration
|
| 161 |
+
return (filtered_duration,)
|
| 162 |
|
| 163 |
|
| 164 |
@app.cell(hide_code=True)
|
| 165 |
def _(mo):
|
| 166 |
+
mo.md(r"""
|
| 167 |
+
Now that our data is 'clean', let's start coming up with and answering some questions about it. Some examples:
|
| 168 |
+
|
| 169 |
+
- Which tracks or artists are the most popular? (Both globally as well as for each genre)
|
| 170 |
+
- Which genres are the most popular? The loudest?
|
| 171 |
+
- What are some common combinations of different artists?
|
| 172 |
+
- What can we infer anything based on the track's title or artist name?
|
| 173 |
+
- How popular is some specific song you like?
|
| 174 |
+
- How much does the mode and key affect other attributes?
|
| 175 |
+
- Can you classify a song's genre based on its attributes?
|
| 176 |
+
|
| 177 |
+
For brevity, we will not explore all of them - feel free to try some of the others yourself, or go more in deep in the explored ones.
|
| 178 |
+
Make sure to come up with some questions of your own and explore them as well!
|
| 179 |
+
""")
|
|
|
|
|
|
|
| 180 |
return
|
| 181 |
|
| 182 |
|
|
|
|
| 223 |
),
|
| 224 |
],
|
| 225 |
)
|
| 226 |
+
return
|
| 227 |
|
| 228 |
|
| 229 |
@app.cell(hide_code=True)
|
| 230 |
def _(mo):
|
| 231 |
+
mo.md(r"""
|
| 232 |
+
So far so good - but there's been a distinct lack of visualations, so let's fix that.
|
|
|
|
| 233 |
|
| 234 |
+
Let's start simple, just some metrics for each genre:
|
| 235 |
+
""")
|
|
|
|
| 236 |
return
|
| 237 |
|
| 238 |
|
|
|
|
| 249 |
x="popularity",
|
| 250 |
)
|
| 251 |
fig_dur_per_genre
|
| 252 |
+
return
|
| 253 |
|
| 254 |
|
| 255 |
@app.cell(hide_code=True)
|
| 256 |
def _(mo):
|
| 257 |
+
mo.md(r"""
|
| 258 |
+
Now, why don't we play a bit with marimo's UI elements?
|
|
|
|
| 259 |
|
| 260 |
+
We will use Dropdowns to allow for the user to select any column to use for the visualisation, and throw in some extras
|
| 261 |
|
| 262 |
+
- A slider for the transparency to help understand dense clusters
|
| 263 |
+
- Add a Trendline to the scatterplot (requires statsmodels)
|
| 264 |
+
- Filter by some specific Genre
|
| 265 |
+
""")
|
|
|
|
| 266 |
return
|
| 267 |
|
| 268 |
|
|
|
|
| 296 |
chart2 = mo.ui.plotly(fig2)
|
| 297 |
|
| 298 |
mo.vstack([mo.hstack([x_axis, y_axis, color, alpha, include_trendline, filter_genre2]), chart2])
|
| 299 |
+
return (chart2,)
|
| 300 |
|
| 301 |
|
| 302 |
@app.cell(hide_code=True)
|
| 303 |
def _(mo):
|
| 304 |
+
mo.md(r"""
|
| 305 |
+
As we have seen before, we can also use the plot as an input to select a region and look at it in more detail.
|
|
|
|
| 306 |
|
| 307 |
+
Try selecting a region then performing some explorations of your own with the data inside of it.
|
| 308 |
+
""")
|
|
|
|
| 309 |
return
|
| 310 |
|
| 311 |
|
|
|
|
| 322 |
pl.col(column_order), pl.exclude(*column_order)
|
| 323 |
)
|
| 324 |
out
|
| 325 |
+
return
|
| 326 |
|
| 327 |
|
| 328 |
@app.cell(hide_code=True)
|
| 329 |
def _(mo):
|
| 330 |
+
mo.md(r"""
|
| 331 |
+
In this notebook, we've focused on a few key aspects. First, it's essential to *understand* the data you're working with — this forms the foundation of any analysis.
|
|
|
|
| 332 |
|
| 333 |
+
Creating plots is a powerful way to identify patterns, outliers, and trends. These visualizations are not just for _presentation_; they are tools for deeper insight.
|
| 334 |
|
| 335 |
+
/// NOTE
|
| 336 |
+
With marimo's `interactive` UI elements, exploring different _facets_ of the data becomes seamless, allowing for dynamic analysis without altering the code.
|
| 337 |
|
| 338 |
+
Keep these points in mind as you continue to work with data.
|
| 339 |
+
""")
|
|
|
|
| 340 |
return
|
| 341 |
|
| 342 |
|
| 343 |
@app.cell(hide_code=True)
|
| 344 |
def _(mo):
|
| 345 |
+
mo.md(r"""
|
| 346 |
+
# Utility Functions and UI Elements
|
| 347 |
+
""")
|
| 348 |
return
|
| 349 |
|
| 350 |
|
| 351 |
+
@app.function
|
| 352 |
+
def get_extremes(selection, col, defaults_if_missing):
|
| 353 |
+
"Get the minimum and maximum values for a given column within the selection"
|
| 354 |
+
if selection is None or len(selection) == 0:
|
| 355 |
+
print(
|
| 356 |
+
f"Could not find a selected region. Using default values {defaults_if_missing} instead, try clicking and dragging in the plot to change them."
|
| 357 |
+
)
|
| 358 |
+
return defaults_if_missing
|
| 359 |
+
else:
|
| 360 |
+
return (
|
| 361 |
+
min(row[col] for row in selection),
|
| 362 |
+
max(row[col] for row in selection),
|
| 363 |
+
)
|
|
|
|
|
|
|
| 364 |
|
| 365 |
|
| 366 |
@app.cell
|
|
|
|
| 406 |
searchable=True,
|
| 407 |
label="Filter by Track Genre:",
|
| 408 |
)
|
| 409 |
+
return alpha, color, filter_genre2, include_trendline, x_axis, y_axis
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 410 |
|
| 411 |
|
| 412 |
@app.cell(hide_code=True)
|
| 413 |
def _(mo):
|
| 414 |
+
mo.md("""
|
| 415 |
+
# Appendix : Some other examples
|
| 416 |
+
""")
|
| 417 |
return
|
| 418 |
|
| 419 |
|
|
|
|
| 435 |
# So we just provide freeform text boxes and filter ourselfves later
|
| 436 |
# (the "alternative_" in the name is just to avoid conflicts with the above cell,
|
| 437 |
# despite this being disabled marimo still requires global variables to be unique)
|
| 438 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 439 |
|
| 440 |
|
| 441 |
@app.cell
|
|
|
|
| 472 |
)
|
| 473 |
|
| 474 |
mo.vstack([mo.md("Filter a track based on its name or artist"), filter_artist, filter_track, filtered_artist_track])
|
| 475 |
+
return
|
| 476 |
|
| 477 |
|
| 478 |
@app.cell
|
|
|
|
| 501 |
],
|
| 502 |
align="center",
|
| 503 |
)
|
| 504 |
+
return
|
| 505 |
|
| 506 |
|
| 507 |
@app.cell
|
polars/06_Dataframe_Transformer.py
CHANGED
|
@@ -12,21 +12,19 @@
|
|
| 12 |
|
| 13 |
import marimo
|
| 14 |
|
| 15 |
-
__generated_with = "0.
|
| 16 |
app = marimo.App(width="medium")
|
| 17 |
|
| 18 |
|
| 19 |
@app.cell(hide_code=True)
|
| 20 |
def _(mo):
|
| 21 |
-
mo.md(
|
| 22 |
-
r"""
|
| 23 |
# Polars with Marimo's Dataframe Transformer
|
| 24 |
|
| 25 |
*By [jesshart](https://github.com/jesshart)*
|
| 26 |
|
| 27 |
The goal of this notebook is to explore Marimo's data explore capabilities alonside the power of polars. Feel free to reference the latest about these Marimo features here: https://docs.marimo.io/guides/working_with_data/dataframes/?h=dataframe#transforming-dataframes
|
| 28 |
-
"""
|
| 29 |
-
)
|
| 30 |
return
|
| 31 |
|
| 32 |
|
|
@@ -40,14 +38,12 @@ def _(requests):
|
|
| 40 |
|
| 41 |
@app.cell(hide_code=True)
|
| 42 |
def _(mo):
|
| 43 |
-
mo.md(
|
| 44 |
-
r"""
|
| 45 |
# Loading Data
|
| 46 |
Let's start by loading our data and getting into the `.lazy()` format so our transformations and queries are speedy.
|
| 47 |
|
| 48 |
Read more about `.lazy()` here: https://docs.pola.rs/user-guide/lazy/
|
| 49 |
-
"""
|
| 50 |
-
)
|
| 51 |
return
|
| 52 |
|
| 53 |
|
|
@@ -60,21 +56,18 @@ def _(json_data, pl):
|
|
| 60 |
|
| 61 |
@app.cell(hide_code=True)
|
| 62 |
def _(mo):
|
| 63 |
-
mo.md(
|
| 64 |
-
|
| 65 |
-
Above, you will notice that when you reference the object as a standalone, you get out-of-the-box convenience from `marimo`. You have the `Table` and `Query Plan` options to choose from.
|
| 66 |
|
| 67 |
- 💡 Try out the `Table` view! You can click the `Preview data` button to get a quick view of your data.
|
| 68 |
- 💡 Take a look at the `Query plan`. Learn more about Polar's query plan here: https://docs.pola.rs/user-guide/lazy/query-plan/
|
| 69 |
-
"""
|
| 70 |
-
)
|
| 71 |
return
|
| 72 |
|
| 73 |
|
| 74 |
@app.cell(hide_code=True)
|
| 75 |
def _(mo):
|
| 76 |
-
mo.md(
|
| 77 |
-
r"""
|
| 78 |
## marimo's Native Dataframe UI
|
| 79 |
|
| 80 |
There are a few ways to leverage marimo's native dataframe UI. One is by doing what we saw above—by referencing a `pl.LazyFrame` directly. You can also try,
|
|
@@ -83,19 +76,16 @@ def _(mo):
|
|
| 83 |
- Referencing a `pl.DataFrame` and see how it different from its corresponding lazy version
|
| 84 |
- Use `mo.ui.table`
|
| 85 |
- Use `mo.ui.dataframe`
|
| 86 |
-
"""
|
| 87 |
-
)
|
| 88 |
return
|
| 89 |
|
| 90 |
|
| 91 |
@app.cell(hide_code=True)
|
| 92 |
def _(mo):
|
| 93 |
-
mo.md(
|
| 94 |
-
r"""
|
| 95 |
## Reference a `pl.DataFrame`
|
| 96 |
Let's reference the same frame as before, but this time as a `pl.DataFrame` by calling `.collect()` on it.
|
| 97 |
-
"""
|
| 98 |
-
)
|
| 99 |
return
|
| 100 |
|
| 101 |
|
|
@@ -107,26 +97,22 @@ def _(demand: "pl.LazyFrame"):
|
|
| 107 |
|
| 108 |
@app.cell(hide_code=True)
|
| 109 |
def _(mo):
|
| 110 |
-
mo.md(
|
| 111 |
-
r"""
|
| 112 |
Note how much functionality we have right out-of-the-box. Click on column names to see rich features like sorting, freezing, filtering, searching, and more!
|
| 113 |
|
| 114 |
Notice how `order_quantity` has a green bar chart under it indicating the distribution of values for the field!
|
| 115 |
|
| 116 |
Don't miss the `Download` feature as well which supports downloading in CSV, json, or parquet format!
|
| 117 |
-
"""
|
| 118 |
-
)
|
| 119 |
return
|
| 120 |
|
| 121 |
|
| 122 |
@app.cell(hide_code=True)
|
| 123 |
def _(mo):
|
| 124 |
-
mo.md(
|
| 125 |
-
r"""
|
| 126 |
## Use `mo.ui.table`
|
| 127 |
The `mo.ui.table` allows you to select rows for use downstream. You can select the rows you want, and then use these as filtered rows downstream.
|
| 128 |
-
"""
|
| 129 |
-
)
|
| 130 |
return
|
| 131 |
|
| 132 |
|
|
@@ -144,7 +130,9 @@ def _(demand_table):
|
|
| 144 |
|
| 145 |
@app.cell(hide_code=True)
|
| 146 |
def _(mo):
|
| 147 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 148 |
return
|
| 149 |
|
| 150 |
|
|
@@ -175,13 +163,11 @@ def _(summary_table):
|
|
| 175 |
|
| 176 |
@app.cell(hide_code=True)
|
| 177 |
def _(mo):
|
| 178 |
-
mo.md(
|
| 179 |
-
r"""
|
| 180 |
Now, instead of manually creating a filter for what I want to take a closer look at, I simply select from the ui and do a simple join to get that aggregated level with more detail.
|
| 181 |
|
| 182 |
The following cell uses the output of the `mo.ui.table` selection, selects its unique keys, and uses that to join for the selected subset of the original table.
|
| 183 |
-
"""
|
| 184 |
-
)
|
| 185 |
return
|
| 186 |
|
| 187 |
|
|
@@ -199,13 +185,17 @@ def _(demand: "pl.LazyFrame", pl, summary_table):
|
|
| 199 |
|
| 200 |
@app.cell(hide_code=True)
|
| 201 |
def _(mo):
|
| 202 |
-
mo.md("""
|
|
|
|
|
|
|
| 203 |
return
|
| 204 |
|
| 205 |
|
| 206 |
@app.cell(hide_code=True)
|
| 207 |
def _(mo):
|
| 208 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 209 |
return
|
| 210 |
|
| 211 |
|
|
@@ -218,7 +208,9 @@ def _(demand: "pl.LazyFrame", mo):
|
|
| 218 |
|
| 219 |
@app.cell(hide_code=True)
|
| 220 |
def _(mo):
|
| 221 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 222 |
return
|
| 223 |
|
| 224 |
|
|
@@ -230,7 +222,9 @@ def _(mo_dataframe):
|
|
| 230 |
|
| 231 |
@app.cell(hide_code=True)
|
| 232 |
def _(mo):
|
| 233 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 234 |
return
|
| 235 |
|
| 236 |
|
|
@@ -245,16 +239,14 @@ def _(demand_cached, pl):
|
|
| 245 |
|
| 246 |
@app.cell(hide_code=True)
|
| 247 |
def _(mo):
|
| 248 |
-
mo.md(
|
| 249 |
-
f"""
|
| 250 |
## Try Before You Buy
|
| 251 |
|
| 252 |
1. Now try to do the same summary using Marimo's `mo.ui.dataframe` object above. Also, note how your aggregated column is already renamed! Nice touch!
|
| 253 |
2. Try (1) again but use select statements first (This is actually better polars practice anyway since it reduces the frame as you move to aggregation.)
|
| 254 |
|
| 255 |
*When you are ready, check the `Python Code` tab at the top of the table to compare your output to the answer below.*
|
| 256 |
-
"""
|
| 257 |
-
)
|
| 258 |
return
|
| 259 |
|
| 260 |
|
|
@@ -331,29 +323,27 @@ def _(demand_agg: "pl.DataFrame", mo, px):
|
|
| 331 |
|
| 332 |
@app.cell(hide_code=True)
|
| 333 |
def _(mo):
|
| 334 |
-
mo.md(
|
| 335 |
-
r"""
|
| 336 |
# About this Notebook
|
| 337 |
Polars and Marimo are both relatively new to the data wrangling space, but their power (and the thrill of their use) cannot be overstated—well, I suppose it could, but you get the meaning. In this notebook, you learn how to leverage basic Polars skills to load-in and explore your data in concert with Marimo's powerful UI elements.
|
| 338 |
|
| 339 |
## 📚 Documentation References
|
| 340 |
|
| 341 |
-
- **Marimo: Dataframe Transformation Guide**
|
| 342 |
https://docs.marimo.io/guides/working_with_data/dataframes/?h=dataframe#transforming-dataframes
|
| 343 |
|
| 344 |
-
- **Polars: Lazy API Overview**
|
| 345 |
https://docs.pola.rs/user-guide/lazy/
|
| 346 |
|
| 347 |
-
- **Polars: Query Plan Explained**
|
| 348 |
https://docs.pola.rs/user-guide/lazy/query-plan/
|
| 349 |
|
| 350 |
-
- **Marimo Notebook: Basic Polars Joins (by jesshart)**
|
| 351 |
https://marimo.io/p/@jesshart/basic-polars-joins
|
| 352 |
|
| 353 |
-
- **Marimo Learn: Interactive Graphs with Polars**
|
| 354 |
https://github.com/marimo-team/learn/blob/main/polars/05_reactive_plots.py
|
| 355 |
-
"""
|
| 356 |
-
)
|
| 357 |
return
|
| 358 |
|
| 359 |
|
|
|
|
| 12 |
|
| 13 |
import marimo
|
| 14 |
|
| 15 |
+
__generated_with = "0.18.4"
|
| 16 |
app = marimo.App(width="medium")
|
| 17 |
|
| 18 |
|
| 19 |
@app.cell(hide_code=True)
|
| 20 |
def _(mo):
|
| 21 |
+
mo.md(r"""
|
|
|
|
| 22 |
# Polars with Marimo's Dataframe Transformer
|
| 23 |
|
| 24 |
*By [jesshart](https://github.com/jesshart)*
|
| 25 |
|
| 26 |
The goal of this notebook is to explore Marimo's data explore capabilities alonside the power of polars. Feel free to reference the latest about these Marimo features here: https://docs.marimo.io/guides/working_with_data/dataframes/?h=dataframe#transforming-dataframes
|
| 27 |
+
""")
|
|
|
|
| 28 |
return
|
| 29 |
|
| 30 |
|
|
|
|
| 38 |
|
| 39 |
@app.cell(hide_code=True)
|
| 40 |
def _(mo):
|
| 41 |
+
mo.md(r"""
|
|
|
|
| 42 |
# Loading Data
|
| 43 |
Let's start by loading our data and getting into the `.lazy()` format so our transformations and queries are speedy.
|
| 44 |
|
| 45 |
Read more about `.lazy()` here: https://docs.pola.rs/user-guide/lazy/
|
| 46 |
+
""")
|
|
|
|
| 47 |
return
|
| 48 |
|
| 49 |
|
|
|
|
| 56 |
|
| 57 |
@app.cell(hide_code=True)
|
| 58 |
def _(mo):
|
| 59 |
+
mo.md(r"""
|
| 60 |
+
Above, you will notice that when you reference the object as a standalone, you get out-of-the-box convenience from `marimo`. You have the `Table` and `Query Plan` options to choose from.
|
|
|
|
| 61 |
|
| 62 |
- 💡 Try out the `Table` view! You can click the `Preview data` button to get a quick view of your data.
|
| 63 |
- 💡 Take a look at the `Query plan`. Learn more about Polar's query plan here: https://docs.pola.rs/user-guide/lazy/query-plan/
|
| 64 |
+
""")
|
|
|
|
| 65 |
return
|
| 66 |
|
| 67 |
|
| 68 |
@app.cell(hide_code=True)
|
| 69 |
def _(mo):
|
| 70 |
+
mo.md(r"""
|
|
|
|
| 71 |
## marimo's Native Dataframe UI
|
| 72 |
|
| 73 |
There are a few ways to leverage marimo's native dataframe UI. One is by doing what we saw above—by referencing a `pl.LazyFrame` directly. You can also try,
|
|
|
|
| 76 |
- Referencing a `pl.DataFrame` and see how it different from its corresponding lazy version
|
| 77 |
- Use `mo.ui.table`
|
| 78 |
- Use `mo.ui.dataframe`
|
| 79 |
+
""")
|
|
|
|
| 80 |
return
|
| 81 |
|
| 82 |
|
| 83 |
@app.cell(hide_code=True)
|
| 84 |
def _(mo):
|
| 85 |
+
mo.md(r"""
|
|
|
|
| 86 |
## Reference a `pl.DataFrame`
|
| 87 |
Let's reference the same frame as before, but this time as a `pl.DataFrame` by calling `.collect()` on it.
|
| 88 |
+
""")
|
|
|
|
| 89 |
return
|
| 90 |
|
| 91 |
|
|
|
|
| 97 |
|
| 98 |
@app.cell(hide_code=True)
|
| 99 |
def _(mo):
|
| 100 |
+
mo.md(r"""
|
|
|
|
| 101 |
Note how much functionality we have right out-of-the-box. Click on column names to see rich features like sorting, freezing, filtering, searching, and more!
|
| 102 |
|
| 103 |
Notice how `order_quantity` has a green bar chart under it indicating the distribution of values for the field!
|
| 104 |
|
| 105 |
Don't miss the `Download` feature as well which supports downloading in CSV, json, or parquet format!
|
| 106 |
+
""")
|
|
|
|
| 107 |
return
|
| 108 |
|
| 109 |
|
| 110 |
@app.cell(hide_code=True)
|
| 111 |
def _(mo):
|
| 112 |
+
mo.md(r"""
|
|
|
|
| 113 |
## Use `mo.ui.table`
|
| 114 |
The `mo.ui.table` allows you to select rows for use downstream. You can select the rows you want, and then use these as filtered rows downstream.
|
| 115 |
+
""")
|
|
|
|
| 116 |
return
|
| 117 |
|
| 118 |
|
|
|
|
| 130 |
|
| 131 |
@app.cell(hide_code=True)
|
| 132 |
def _(mo):
|
| 133 |
+
mo.md(r"""
|
| 134 |
+
I like to use this feature to select groupings based on summary statistics so I can quickly explore subsets of categories. Let me show you what I mean.
|
| 135 |
+
""")
|
| 136 |
return
|
| 137 |
|
| 138 |
|
|
|
|
| 163 |
|
| 164 |
@app.cell(hide_code=True)
|
| 165 |
def _(mo):
|
| 166 |
+
mo.md(r"""
|
|
|
|
| 167 |
Now, instead of manually creating a filter for what I want to take a closer look at, I simply select from the ui and do a simple join to get that aggregated level with more detail.
|
| 168 |
|
| 169 |
The following cell uses the output of the `mo.ui.table` selection, selects its unique keys, and uses that to join for the selected subset of the original table.
|
| 170 |
+
""")
|
|
|
|
| 171 |
return
|
| 172 |
|
| 173 |
|
|
|
|
| 185 |
|
| 186 |
@app.cell(hide_code=True)
|
| 187 |
def _(mo):
|
| 188 |
+
mo.md("""
|
| 189 |
+
You can learn more about joins in Polars by checking out my other interactive notebook here: https://marimo.io/p/@jesshart/basic-polars-joins
|
| 190 |
+
""")
|
| 191 |
return
|
| 192 |
|
| 193 |
|
| 194 |
@app.cell(hide_code=True)
|
| 195 |
def _(mo):
|
| 196 |
+
mo.md(r"""
|
| 197 |
+
## Use `mo.ui.dataframe`
|
| 198 |
+
""")
|
| 199 |
return
|
| 200 |
|
| 201 |
|
|
|
|
| 208 |
|
| 209 |
@app.cell(hide_code=True)
|
| 210 |
def _(mo):
|
| 211 |
+
mo.md(r"""
|
| 212 |
+
Below I simply call the object into view. We will play with it in the following cells.
|
| 213 |
+
""")
|
| 214 |
return
|
| 215 |
|
| 216 |
|
|
|
|
| 222 |
|
| 223 |
@app.cell(hide_code=True)
|
| 224 |
def _(mo):
|
| 225 |
+
mo.md(r"""
|
| 226 |
+
One way to group this data in polars code directly would be to group by product family to get the mean. This is how it is done in polars:
|
| 227 |
+
""")
|
| 228 |
return
|
| 229 |
|
| 230 |
|
|
|
|
| 239 |
|
| 240 |
@app.cell(hide_code=True)
|
| 241 |
def _(mo):
|
| 242 |
+
mo.md(f"""
|
|
|
|
| 243 |
## Try Before You Buy
|
| 244 |
|
| 245 |
1. Now try to do the same summary using Marimo's `mo.ui.dataframe` object above. Also, note how your aggregated column is already renamed! Nice touch!
|
| 246 |
2. Try (1) again but use select statements first (This is actually better polars practice anyway since it reduces the frame as you move to aggregation.)
|
| 247 |
|
| 248 |
*When you are ready, check the `Python Code` tab at the top of the table to compare your output to the answer below.*
|
| 249 |
+
""")
|
|
|
|
| 250 |
return
|
| 251 |
|
| 252 |
|
|
|
|
| 323 |
|
| 324 |
@app.cell(hide_code=True)
|
| 325 |
def _(mo):
|
| 326 |
+
mo.md(r"""
|
|
|
|
| 327 |
# About this Notebook
|
| 328 |
Polars and Marimo are both relatively new to the data wrangling space, but their power (and the thrill of their use) cannot be overstated—well, I suppose it could, but you get the meaning. In this notebook, you learn how to leverage basic Polars skills to load-in and explore your data in concert with Marimo's powerful UI elements.
|
| 329 |
|
| 330 |
## 📚 Documentation References
|
| 331 |
|
| 332 |
+
- **Marimo: Dataframe Transformation Guide**
|
| 333 |
https://docs.marimo.io/guides/working_with_data/dataframes/?h=dataframe#transforming-dataframes
|
| 334 |
|
| 335 |
+
- **Polars: Lazy API Overview**
|
| 336 |
https://docs.pola.rs/user-guide/lazy/
|
| 337 |
|
| 338 |
+
- **Polars: Query Plan Explained**
|
| 339 |
https://docs.pola.rs/user-guide/lazy/query-plan/
|
| 340 |
|
| 341 |
+
- **Marimo Notebook: Basic Polars Joins (by jesshart)**
|
| 342 |
https://marimo.io/p/@jesshart/basic-polars-joins
|
| 343 |
|
| 344 |
+
- **Marimo Learn: Interactive Graphs with Polars**
|
| 345 |
https://github.com/marimo-team/learn/blob/main/polars/05_reactive_plots.py
|
| 346 |
+
""")
|
|
|
|
| 347 |
return
|
| 348 |
|
| 349 |
|
polars/07-querying-with-sql.py
CHANGED
|
@@ -35,7 +35,7 @@ def _(mo):
|
|
| 35 |
|
| 36 |
|
| 37 |
@app.cell
|
| 38 |
-
def _(mo,
|
| 39 |
_df = mo.sql(
|
| 40 |
f"""
|
| 41 |
SELECT * FROM reviews LIMIT 100
|
|
@@ -91,7 +91,7 @@ def _(mo):
|
|
| 91 |
|
| 92 |
|
| 93 |
@app.cell
|
| 94 |
-
def _(
|
| 95 |
_df = mo.sql(
|
| 96 |
f"""
|
| 97 |
SELECT * FROM hotels LIMIT 10
|
|
@@ -112,7 +112,7 @@ def _(mo):
|
|
| 112 |
|
| 113 |
|
| 114 |
@app.cell
|
| 115 |
-
def _(mo,
|
| 116 |
polars_age_groups = mo.sql(
|
| 117 |
f"""
|
| 118 |
SELECT reviews.*, age_group FROM reviews JOIN users ON reviews.user_id = users.user_id LIMIT 1000
|
|
@@ -139,7 +139,7 @@ def _(mo):
|
|
| 139 |
|
| 140 |
|
| 141 |
@app.cell
|
| 142 |
-
def _(mo,
|
| 143 |
_df = mo.sql(
|
| 144 |
f"""
|
| 145 |
SELECT age_group, AVG(reviews.score_overall) FROM reviews JOIN users ON reviews.user_id = users.user_id GROUP BY age_group
|
|
@@ -158,7 +158,7 @@ def _(mo):
|
|
| 158 |
|
| 159 |
|
| 160 |
@app.cell
|
| 161 |
-
def _(mo
|
| 162 |
_df = mo.sql(
|
| 163 |
f"""
|
| 164 |
SELECT * FROM polars_age_groups LIMIT 10
|
|
@@ -261,7 +261,7 @@ def _(mo):
|
|
| 261 |
|
| 262 |
|
| 263 |
@app.cell
|
| 264 |
-
def _(duckdb
|
| 265 |
duckdb.sql("SELECT * FROM hotels").pl(lazy=True).sort("cleanliness_base", descending=True).limit(5).collect()
|
| 266 |
return
|
| 267 |
|
|
|
|
| 35 |
|
| 36 |
|
| 37 |
@app.cell
|
| 38 |
+
def _(mo, sqlite_engine):
|
| 39 |
_df = mo.sql(
|
| 40 |
f"""
|
| 41 |
SELECT * FROM reviews LIMIT 100
|
|
|
|
| 91 |
|
| 92 |
|
| 93 |
@app.cell
|
| 94 |
+
def _(mo, sqlite_engine):
|
| 95 |
_df = mo.sql(
|
| 96 |
f"""
|
| 97 |
SELECT * FROM hotels LIMIT 10
|
|
|
|
| 112 |
|
| 113 |
|
| 114 |
@app.cell
|
| 115 |
+
def _(mo, sqlite_engine):
|
| 116 |
polars_age_groups = mo.sql(
|
| 117 |
f"""
|
| 118 |
SELECT reviews.*, age_group FROM reviews JOIN users ON reviews.user_id = users.user_id LIMIT 1000
|
|
|
|
| 139 |
|
| 140 |
|
| 141 |
@app.cell
|
| 142 |
+
def _(mo, sqlite_engine):
|
| 143 |
_df = mo.sql(
|
| 144 |
f"""
|
| 145 |
SELECT age_group, AVG(reviews.score_overall) FROM reviews JOIN users ON reviews.user_id = users.user_id GROUP BY age_group
|
|
|
|
| 158 |
|
| 159 |
|
| 160 |
@app.cell
|
| 161 |
+
def _(mo):
|
| 162 |
_df = mo.sql(
|
| 163 |
f"""
|
| 164 |
SELECT * FROM polars_age_groups LIMIT 10
|
|
|
|
| 261 |
|
| 262 |
|
| 263 |
@app.cell
|
| 264 |
+
def _(duckdb):
|
| 265 |
duckdb.sql("SELECT * FROM hotels").pl(lazy=True).sort("cleanliness_base", descending=True).limit(5).collect()
|
| 266 |
return
|
| 267 |
|
polars/08_working_with_columns.py
CHANGED
|
@@ -8,37 +8,33 @@
|
|
| 8 |
|
| 9 |
import marimo
|
| 10 |
|
| 11 |
-
__generated_with = "0.
|
| 12 |
app = marimo.App(width="medium")
|
| 13 |
|
| 14 |
|
| 15 |
@app.cell(hide_code=True)
|
| 16 |
def _(mo):
|
| 17 |
-
mo.md(
|
| 18 |
-
|
| 19 |
-
# Working with Columns
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
)
|
| 26 |
return
|
| 27 |
|
| 28 |
|
| 29 |
@app.cell(hide_code=True)
|
| 30 |
def _(mo):
|
| 31 |
-
mo.md(
|
| 32 |
-
|
| 33 |
-
## Expressions
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
)
|
| 42 |
return
|
| 43 |
|
| 44 |
|
|
@@ -46,24 +42,24 @@ def _(mo):
|
|
| 46 |
def _(pl):
|
| 47 |
speed_expr = pl.col("distance") / (pl.col("time"))
|
| 48 |
speed_expr
|
| 49 |
-
return
|
| 50 |
|
| 51 |
|
| 52 |
@app.cell(hide_code=True)
|
| 53 |
def _(mo):
|
| 54 |
-
mo.md(
|
| 55 |
-
|
| 56 |
-
## Expression expansion
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
)
|
| 61 |
return
|
| 62 |
|
| 63 |
|
| 64 |
@app.cell(hide_code=True)
|
| 65 |
def _(mo):
|
| 66 |
-
mo.md("""
|
|
|
|
|
|
|
| 67 |
return
|
| 68 |
|
| 69 |
|
|
@@ -80,32 +76,28 @@ def _(StringIO, pl):
|
|
| 80 |
|
| 81 |
data = pl.read_csv(StringIO(data_csv))
|
| 82 |
data
|
| 83 |
-
return data,
|
| 84 |
|
| 85 |
|
| 86 |
@app.cell(hide_code=True)
|
| 87 |
def _(mo):
|
| 88 |
-
mo.md(
|
| 89 |
-
|
| 90 |
-
## Function `col`
|
| 91 |
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
)
|
| 95 |
return
|
| 96 |
|
| 97 |
|
| 98 |
@app.cell(hide_code=True)
|
| 99 |
def _(mo):
|
| 100 |
-
mo.md(
|
| 101 |
-
|
| 102 |
-
### Explicit expansion by column name
|
| 103 |
|
| 104 |
-
|
| 105 |
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
)
|
| 109 |
return
|
| 110 |
|
| 111 |
|
|
@@ -118,12 +110,14 @@ def _(data, pl):
|
|
| 118 |
|
| 119 |
result = data.with_columns(exprs)
|
| 120 |
result
|
| 121 |
-
return
|
| 122 |
|
| 123 |
|
| 124 |
@app.cell(hide_code=True)
|
| 125 |
def _(mo):
|
| 126 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 127 |
return
|
| 128 |
|
| 129 |
|
|
@@ -139,28 +133,28 @@ def _(data, pl, result):
|
|
| 139 |
).round(2)
|
| 140 |
)
|
| 141 |
result_2.equals(result)
|
| 142 |
-
return
|
| 143 |
|
| 144 |
|
| 145 |
@app.cell(hide_code=True)
|
| 146 |
def _(mo):
|
| 147 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 148 |
return
|
| 149 |
|
| 150 |
|
| 151 |
@app.cell(hide_code=True)
|
| 152 |
def _(mo):
|
| 153 |
-
mo.md(
|
| 154 |
-
|
| 155 |
-
### Expansion by data type
|
| 156 |
|
| 157 |
-
|
| 158 |
|
| 159 |
-
|
| 160 |
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
)
|
| 164 |
return
|
| 165 |
|
| 166 |
|
|
@@ -168,18 +162,16 @@ def _(mo):
|
|
| 168 |
def _(data, pl, result):
|
| 169 |
result_3 = data.with_columns(((pl.col(pl.Float64) - 273.15) * 1.8 + 32).round(2))
|
| 170 |
result_3.equals(result)
|
| 171 |
-
return
|
| 172 |
|
| 173 |
|
| 174 |
@app.cell(hide_code=True)
|
| 175 |
def _(mo):
|
| 176 |
-
mo.md(
|
| 177 |
-
|
| 178 |
-
However, you should be careful to ensure that the transformation is only applied to the columns you want. For ensuring this it is important to know the schema of the data beforehand.
|
| 179 |
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
)
|
| 183 |
return
|
| 184 |
|
| 185 |
|
|
@@ -195,18 +187,16 @@ def _(data, pl, result):
|
|
| 195 |
).round(2)
|
| 196 |
)
|
| 197 |
result.equals(result_4)
|
| 198 |
-
return
|
| 199 |
|
| 200 |
|
| 201 |
@app.cell(hide_code=True)
|
| 202 |
def _(mo):
|
| 203 |
-
mo.md(
|
| 204 |
-
|
| 205 |
-
### Expansion by pattern matching
|
| 206 |
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
)
|
| 210 |
return
|
| 211 |
|
| 212 |
|
|
@@ -218,7 +208,9 @@ def _(data, pl):
|
|
| 218 |
|
| 219 |
@app.cell(hide_code=True)
|
| 220 |
def _(mo):
|
| 221 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 222 |
return
|
| 223 |
|
| 224 |
|
|
@@ -230,7 +222,9 @@ def _(data, pl):
|
|
| 230 |
|
| 231 |
@app.cell(hide_code=True)
|
| 232 |
def _(mo):
|
| 233 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 234 |
return
|
| 235 |
|
| 236 |
|
|
@@ -245,13 +239,11 @@ def _(data, pl):
|
|
| 245 |
|
| 246 |
@app.cell(hide_code=True)
|
| 247 |
def _(mo):
|
| 248 |
-
mo.md(
|
| 249 |
-
|
| 250 |
-
## Selecting all columns
|
| 251 |
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
)
|
| 255 |
return
|
| 256 |
|
| 257 |
|
|
@@ -259,18 +251,16 @@ def _(mo):
|
|
| 259 |
def _(data, pl):
|
| 260 |
result_6 = data.select(pl.all())
|
| 261 |
result_6.equals(data)
|
| 262 |
-
return
|
| 263 |
|
| 264 |
|
| 265 |
@app.cell(hide_code=True)
|
| 266 |
def _(mo):
|
| 267 |
-
mo.md(
|
| 268 |
-
|
| 269 |
-
## Excluding columns
|
| 270 |
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
)
|
| 274 |
return
|
| 275 |
|
| 276 |
|
|
@@ -282,7 +272,9 @@ def _(data, pl):
|
|
| 282 |
|
| 283 |
@app.cell(hide_code=True)
|
| 284 |
def _(mo):
|
| 285 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 286 |
return
|
| 287 |
|
| 288 |
|
|
@@ -294,13 +286,11 @@ def _(data, pl):
|
|
| 294 |
|
| 295 |
@app.cell(hide_code=True)
|
| 296 |
def _(mo):
|
| 297 |
-
mo.md(
|
| 298 |
-
|
| 299 |
-
## Column renaming
|
| 300 |
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
)
|
| 304 |
return
|
| 305 |
|
| 306 |
|
|
@@ -315,18 +305,16 @@ def _(data, pl):
|
|
| 315 |
)
|
| 316 |
except DuplicateError as err:
|
| 317 |
print("DuplicateError:", err)
|
| 318 |
-
return
|
| 319 |
|
| 320 |
|
| 321 |
@app.cell(hide_code=True)
|
| 322 |
def _(mo):
|
| 323 |
-
mo.md(
|
| 324 |
-
|
| 325 |
-
### Renaming a single column with `alias`
|
| 326 |
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
)
|
| 330 |
return
|
| 331 |
|
| 332 |
|
|
@@ -341,13 +329,11 @@ def _(data, pl):
|
|
| 341 |
|
| 342 |
@app.cell(hide_code=True)
|
| 343 |
def _(mo):
|
| 344 |
-
mo.md(
|
| 345 |
-
|
| 346 |
-
### Prefixing and suffixing column names
|
| 347 |
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
)
|
| 351 |
return
|
| 352 |
|
| 353 |
|
|
@@ -362,13 +348,11 @@ def _(data, pl):
|
|
| 362 |
|
| 363 |
@app.cell(hide_code=True)
|
| 364 |
def _(mo):
|
| 365 |
-
mo.md(
|
| 366 |
-
|
| 367 |
-
### Dynamic name replacement
|
| 368 |
|
| 369 |
-
|
| 370 |
-
|
| 371 |
-
)
|
| 372 |
return
|
| 373 |
|
| 374 |
|
|
@@ -381,13 +365,11 @@ def _(data, pl):
|
|
| 381 |
|
| 382 |
@app.cell(hide_code=True)
|
| 383 |
def _(mo):
|
| 384 |
-
mo.md(
|
| 385 |
-
|
| 386 |
-
## Programmatically generating expressions
|
| 387 |
|
| 388 |
-
|
| 389 |
-
|
| 390 |
-
)
|
| 391 |
return
|
| 392 |
|
| 393 |
|
|
@@ -402,13 +384,17 @@ def _(data, pl):
|
|
| 402 |
|
| 403 |
@app.cell(hide_code=True)
|
| 404 |
def _(mo):
|
| 405 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 406 |
return
|
| 407 |
|
| 408 |
|
| 409 |
@app.cell(hide_code=True)
|
| 410 |
def _(mo):
|
| 411 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 412 |
return
|
| 413 |
|
| 414 |
|
|
@@ -421,12 +407,14 @@ def _(ext_temp_data, pl):
|
|
| 421 |
.round(2).alias(f"Delta {col_name} temperature")
|
| 422 |
)
|
| 423 |
_result
|
| 424 |
-
return
|
| 425 |
|
| 426 |
|
| 427 |
@app.cell(hide_code=True)
|
| 428 |
def _(mo):
|
| 429 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 430 |
return
|
| 431 |
|
| 432 |
|
|
@@ -439,18 +427,16 @@ def _(ext_temp_data, pl):
|
|
| 439 |
|
| 440 |
|
| 441 |
ext_temp_data.with_columns(delta_expressions(["Air", "Process"]))
|
| 442 |
-
return
|
| 443 |
|
| 444 |
|
| 445 |
@app.cell(hide_code=True)
|
| 446 |
def _(mo):
|
| 447 |
-
mo.md(
|
| 448 |
-
|
| 449 |
-
## More flexible column selections
|
| 450 |
|
| 451 |
-
|
| 452 |
-
|
| 453 |
-
)
|
| 454 |
return
|
| 455 |
|
| 456 |
|
|
@@ -464,30 +450,30 @@ def _(data):
|
|
| 464 |
|
| 465 |
@app.cell(hide_code=True)
|
| 466 |
def _(mo):
|
| 467 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 468 |
return
|
| 469 |
|
| 470 |
|
| 471 |
@app.cell(hide_code=True)
|
| 472 |
def _(mo):
|
| 473 |
-
mo.md(
|
| 474 |
-
|
| 475 |
-
### Combining selectors with set operations
|
| 476 |
|
| 477 |
-
|
| 478 |
|
| 479 |
|
| 480 |
-
|
| 481 |
-
|
| 482 |
-
|
| 483 |
-
|
| 484 |
-
|
| 485 |
-
|
| 486 |
-
|
| 487 |
|
| 488 |
-
|
| 489 |
-
|
| 490 |
-
)
|
| 491 |
return
|
| 492 |
|
| 493 |
|
|
@@ -499,13 +485,11 @@ def _(cs, data):
|
|
| 499 |
|
| 500 |
@app.cell(hide_code=True)
|
| 501 |
def _(mo):
|
| 502 |
-
mo.md(
|
| 503 |
-
|
| 504 |
-
### Resolving operator ambiguity
|
| 505 |
|
| 506 |
-
|
| 507 |
-
|
| 508 |
-
)
|
| 509 |
return
|
| 510 |
|
| 511 |
|
|
@@ -518,13 +502,11 @@ def _(cs, data, pl):
|
|
| 518 |
|
| 519 |
@app.cell(hide_code=True)
|
| 520 |
def _(mo):
|
| 521 |
-
mo.md(
|
| 522 |
-
|
| 523 |
-
However, operators that perform set operations on column selectors operate on both selectors and on expressions. For example, the operator `~` on a selector represents the set operation “complement” and on an expression represents the Boolean operation of negation.
|
| 524 |
|
| 525 |
-
|
| 526 |
-
|
| 527 |
-
)
|
| 528 |
return
|
| 529 |
|
| 530 |
|
|
@@ -536,7 +518,9 @@ def _(cs, ext_failure_data):
|
|
| 536 |
|
| 537 |
@app.cell(hide_code=True)
|
| 538 |
def _(mo):
|
| 539 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 540 |
return
|
| 541 |
|
| 542 |
|
|
@@ -548,13 +532,11 @@ def _(cs, ext_failure_data):
|
|
| 548 |
|
| 549 |
@app.cell(hide_code=True)
|
| 550 |
def _(mo):
|
| 551 |
-
mo.md(
|
| 552 |
-
|
| 553 |
-
### Debugging selectors
|
| 554 |
|
| 555 |
-
|
| 556 |
-
|
| 557 |
-
)
|
| 558 |
return
|
| 559 |
|
| 560 |
|
|
@@ -566,7 +548,9 @@ def _(cs):
|
|
| 566 |
|
| 567 |
@app.cell(hide_code=True)
|
| 568 |
def _(mo):
|
| 569 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 570 |
return
|
| 571 |
|
| 572 |
|
|
@@ -581,14 +565,12 @@ def _(cs, ext_failure_data):
|
|
| 581 |
|
| 582 |
@app.cell(hide_code=True)
|
| 583 |
def _(mo):
|
| 584 |
-
mo.md(
|
| 585 |
-
|
| 586 |
-
### References
|
| 587 |
|
| 588 |
-
|
| 589 |
-
|
| 590 |
-
|
| 591 |
-
)
|
| 592 |
return
|
| 593 |
|
| 594 |
|
|
@@ -598,7 +580,7 @@ def _():
|
|
| 598 |
import marimo as mo
|
| 599 |
import polars as pl
|
| 600 |
from io import StringIO
|
| 601 |
-
return StringIO,
|
| 602 |
|
| 603 |
|
| 604 |
if __name__ == "__main__":
|
|
|
|
| 8 |
|
| 9 |
import marimo
|
| 10 |
|
| 11 |
+
__generated_with = "0.18.4"
|
| 12 |
app = marimo.App(width="medium")
|
| 13 |
|
| 14 |
|
| 15 |
@app.cell(hide_code=True)
|
| 16 |
def _(mo):
|
| 17 |
+
mo.md(r"""
|
| 18 |
+
# Working with Columns
|
|
|
|
| 19 |
|
| 20 |
+
Author: [Deb Debnath](https://github.com/debajyotid2)
|
| 21 |
|
| 22 |
+
**Note**: The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/user-guide/expressions/expression-expansion).
|
| 23 |
+
""")
|
|
|
|
| 24 |
return
|
| 25 |
|
| 26 |
|
| 27 |
@app.cell(hide_code=True)
|
| 28 |
def _(mo):
|
| 29 |
+
mo.md(r"""
|
| 30 |
+
## Expressions
|
|
|
|
| 31 |
|
| 32 |
+
Data transformations are sometimes complicated, or involve massive computations which are time-consuming. You can make a small version of the dataset with the schema you are trying to work your transformation into. But there is a better way to do it in Polars.
|
| 33 |
|
| 34 |
+
A Polars expression is a lazy representation of a data transformation. "Lazy" means that the transformation is not eagerly (immediately) executed.
|
| 35 |
|
| 36 |
+
Expressions are modular and flexible. They can be composed to build more complex expressions. For example, to calculate speed from distance and time, you can have an expression as:
|
| 37 |
+
""")
|
|
|
|
| 38 |
return
|
| 39 |
|
| 40 |
|
|
|
|
| 42 |
def _(pl):
|
| 43 |
speed_expr = pl.col("distance") / (pl.col("time"))
|
| 44 |
speed_expr
|
| 45 |
+
return
|
| 46 |
|
| 47 |
|
| 48 |
@app.cell(hide_code=True)
|
| 49 |
def _(mo):
|
| 50 |
+
mo.md(r"""
|
| 51 |
+
## Expression expansion
|
|
|
|
| 52 |
|
| 53 |
+
Expression expansion lets you write a single expression that can expand to multiple different expressions. So rather than repeatedly defining separate expressions, you can avoid redundancy while adhering to clean code principles (Do not Repeat Yourself - [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)). Since expressions are reusable, they aid in writing concise code.
|
| 54 |
+
""")
|
|
|
|
| 55 |
return
|
| 56 |
|
| 57 |
|
| 58 |
@app.cell(hide_code=True)
|
| 59 |
def _(mo):
|
| 60 |
+
mo.md("""
|
| 61 |
+
For the examples in this notebook, we will use a sliver of the *AI4I 2020 Predictive Maintenance Dataset*. This dataset comprises of measurements taken from sensors in industrial machinery undergoing preventive maintenance checks - basically being tested for failure conditions.
|
| 62 |
+
""")
|
| 63 |
return
|
| 64 |
|
| 65 |
|
|
|
|
| 76 |
|
| 77 |
data = pl.read_csv(StringIO(data_csv))
|
| 78 |
data
|
| 79 |
+
return (data,)
|
| 80 |
|
| 81 |
|
| 82 |
@app.cell(hide_code=True)
|
| 83 |
def _(mo):
|
| 84 |
+
mo.md(r"""
|
| 85 |
+
## Function `col`
|
|
|
|
| 86 |
|
| 87 |
+
The function `col` is used to refer to one column of a dataframe. It is one of the fundamental building blocks of expressions in Polars. `col` is also really handy in expression expansion.
|
| 88 |
+
""")
|
|
|
|
| 89 |
return
|
| 90 |
|
| 91 |
|
| 92 |
@app.cell(hide_code=True)
|
| 93 |
def _(mo):
|
| 94 |
+
mo.md(r"""
|
| 95 |
+
### Explicit expansion by column name
|
|
|
|
| 96 |
|
| 97 |
+
The simplest form of expression expansion happens when you provide multiple column names to the function `col`.
|
| 98 |
|
| 99 |
+
Say you wish to convert all temperature values in deg. Kelvin (K) to deg. Fahrenheit (F). One way to do this would be to define individual expressions for each column as follows:
|
| 100 |
+
""")
|
|
|
|
| 101 |
return
|
| 102 |
|
| 103 |
|
|
|
|
| 110 |
|
| 111 |
result = data.with_columns(exprs)
|
| 112 |
result
|
| 113 |
+
return (result,)
|
| 114 |
|
| 115 |
|
| 116 |
@app.cell(hide_code=True)
|
| 117 |
def _(mo):
|
| 118 |
+
mo.md(r"""
|
| 119 |
+
Expression expansion can reduce this verbosity when you list the column names you want the expression to expand to inside the `col` function. The result is the same as before.
|
| 120 |
+
""")
|
| 121 |
return
|
| 122 |
|
| 123 |
|
|
|
|
| 133 |
).round(2)
|
| 134 |
)
|
| 135 |
result_2.equals(result)
|
| 136 |
+
return
|
| 137 |
|
| 138 |
|
| 139 |
@app.cell(hide_code=True)
|
| 140 |
def _(mo):
|
| 141 |
+
mo.md(r"""
|
| 142 |
+
In this case, the expression that does the temperature conversion is expanded to a list of two expressions. The expansion of the expression is predictable and intuitive.
|
| 143 |
+
""")
|
| 144 |
return
|
| 145 |
|
| 146 |
|
| 147 |
@app.cell(hide_code=True)
|
| 148 |
def _(mo):
|
| 149 |
+
mo.md(r"""
|
| 150 |
+
### Expansion by data type
|
|
|
|
| 151 |
|
| 152 |
+
Can we do better than explicitly writing the names of every columns we want transformed? Yes.
|
| 153 |
|
| 154 |
+
If you provide data types instead of column names, the expression is expanded to all columns that match one of the data types provided.
|
| 155 |
|
| 156 |
+
The example below performs the exact same computation as before:
|
| 157 |
+
""")
|
|
|
|
| 158 |
return
|
| 159 |
|
| 160 |
|
|
|
|
| 162 |
def _(data, pl, result):
|
| 163 |
result_3 = data.with_columns(((pl.col(pl.Float64) - 273.15) * 1.8 + 32).round(2))
|
| 164 |
result_3.equals(result)
|
| 165 |
+
return
|
| 166 |
|
| 167 |
|
| 168 |
@app.cell(hide_code=True)
|
| 169 |
def _(mo):
|
| 170 |
+
mo.md(r"""
|
| 171 |
+
However, you should be careful to ensure that the transformation is only applied to the columns you want. For ensuring this it is important to know the schema of the data beforehand.
|
|
|
|
| 172 |
|
| 173 |
+
`col` accepts multiple data types in case the columns you need have more than one data type.
|
| 174 |
+
""")
|
|
|
|
| 175 |
return
|
| 176 |
|
| 177 |
|
|
|
|
| 187 |
).round(2)
|
| 188 |
)
|
| 189 |
result.equals(result_4)
|
| 190 |
+
return
|
| 191 |
|
| 192 |
|
| 193 |
@app.cell(hide_code=True)
|
| 194 |
def _(mo):
|
| 195 |
+
mo.md(r"""
|
| 196 |
+
### Expansion by pattern matching
|
|
|
|
| 197 |
|
| 198 |
+
`col` also accepts regular expressions for selecting columns by pattern matching. Regular expressions start and end with ^ and $, respectively.
|
| 199 |
+
""")
|
|
|
|
| 200 |
return
|
| 201 |
|
| 202 |
|
|
|
|
| 208 |
|
| 209 |
@app.cell(hide_code=True)
|
| 210 |
def _(mo):
|
| 211 |
+
mo.md(r"""
|
| 212 |
+
Regular expressions can be combined with exact column names.
|
| 213 |
+
""")
|
| 214 |
return
|
| 215 |
|
| 216 |
|
|
|
|
| 222 |
|
| 223 |
@app.cell(hide_code=True)
|
| 224 |
def _(mo):
|
| 225 |
+
mo.md(r"""
|
| 226 |
+
**Note**: You _cannot_ mix strings (exact names, regular expressions) and data types in a `col` function.
|
| 227 |
+
""")
|
| 228 |
return
|
| 229 |
|
| 230 |
|
|
|
|
| 239 |
|
| 240 |
@app.cell(hide_code=True)
|
| 241 |
def _(mo):
|
| 242 |
+
mo.md(r"""
|
| 243 |
+
## Selecting all columns
|
|
|
|
| 244 |
|
| 245 |
+
To select all columns, you can use the `all` function.
|
| 246 |
+
""")
|
|
|
|
| 247 |
return
|
| 248 |
|
| 249 |
|
|
|
|
| 251 |
def _(data, pl):
|
| 252 |
result_6 = data.select(pl.all())
|
| 253 |
result_6.equals(data)
|
| 254 |
+
return
|
| 255 |
|
| 256 |
|
| 257 |
@app.cell(hide_code=True)
|
| 258 |
def _(mo):
|
| 259 |
+
mo.md(r"""
|
| 260 |
+
## Excluding columns
|
|
|
|
| 261 |
|
| 262 |
+
There are scenarios where we might want to exclude specific columns from the ones selected by building expressions, e.g. by the `col` or `all` functions. For this purpose, we use the function `exclude`, which accepts exactly the same types of arguments as `col`:
|
| 263 |
+
""")
|
|
|
|
| 264 |
return
|
| 265 |
|
| 266 |
|
|
|
|
| 272 |
|
| 273 |
@app.cell(hide_code=True)
|
| 274 |
def _(mo):
|
| 275 |
+
mo.md(r"""
|
| 276 |
+
`exclude` can also be used after the function `col`:
|
| 277 |
+
""")
|
| 278 |
return
|
| 279 |
|
| 280 |
|
|
|
|
| 286 |
|
| 287 |
@app.cell(hide_code=True)
|
| 288 |
def _(mo):
|
| 289 |
+
mo.md(r"""
|
| 290 |
+
## Column renaming
|
|
|
|
| 291 |
|
| 292 |
+
When applying a transformation with an expression to a column, the data in the column gets overwritten with the transformed data. However, this might not be the intended outcome in all situations - ideally you would want to store transformed data in a new column. Applying multiple transformations to the same column at the same time without renaming leads to errors.
|
| 293 |
+
""")
|
|
|
|
| 294 |
return
|
| 295 |
|
| 296 |
|
|
|
|
| 305 |
)
|
| 306 |
except DuplicateError as err:
|
| 307 |
print("DuplicateError:", err)
|
| 308 |
+
return
|
| 309 |
|
| 310 |
|
| 311 |
@app.cell(hide_code=True)
|
| 312 |
def _(mo):
|
| 313 |
+
mo.md(r"""
|
| 314 |
+
### Renaming a single column with `alias`
|
|
|
|
| 315 |
|
| 316 |
+
The function `alias` lets you rename a single column:
|
| 317 |
+
""")
|
|
|
|
| 318 |
return
|
| 319 |
|
| 320 |
|
|
|
|
| 329 |
|
| 330 |
@app.cell(hide_code=True)
|
| 331 |
def _(mo):
|
| 332 |
+
mo.md(r"""
|
| 333 |
+
### Prefixing and suffixing column names
|
|
|
|
| 334 |
|
| 335 |
+
As `alias` renames a single column at a time, it cannot be used during expression expansion. If it is sufficient add a static prefix or a static suffix to the existing names, you can use the functions `name.prefix` and `name.suffix` with `col`:
|
| 336 |
+
""")
|
|
|
|
| 337 |
return
|
| 338 |
|
| 339 |
|
|
|
|
| 348 |
|
| 349 |
@app.cell(hide_code=True)
|
| 350 |
def _(mo):
|
| 351 |
+
mo.md(r"""
|
| 352 |
+
### Dynamic name replacement
|
|
|
|
| 353 |
|
| 354 |
+
If a static prefix/suffix is not enough, use `name.map`. `name.map` requires a function that transforms column names to the desired. The transformation should lead to unique names to avoid `DuplicateError`.
|
| 355 |
+
""")
|
|
|
|
| 356 |
return
|
| 357 |
|
| 358 |
|
|
|
|
| 365 |
|
| 366 |
@app.cell(hide_code=True)
|
| 367 |
def _(mo):
|
| 368 |
+
mo.md(r"""
|
| 369 |
+
## Programmatically generating expressions
|
|
|
|
| 370 |
|
| 371 |
+
For this example, we will first create four additional columns with the rolling mean temperatures of the two temperature columns. Such transformations are sometimes used to create additional features for machine learning models or data analysis.
|
| 372 |
+
""")
|
|
|
|
| 373 |
return
|
| 374 |
|
| 375 |
|
|
|
|
| 384 |
|
| 385 |
@app.cell(hide_code=True)
|
| 386 |
def _(mo):
|
| 387 |
+
mo.md(r"""
|
| 388 |
+
Now, suppose we want to calculate the difference between the rolling mean and actual temperatures. We cannot use expression expansion here as we want differences between specific columns.
|
| 389 |
+
""")
|
| 390 |
return
|
| 391 |
|
| 392 |
|
| 393 |
@app.cell(hide_code=True)
|
| 394 |
def _(mo):
|
| 395 |
+
mo.md(r"""
|
| 396 |
+
At first, you may think about using a `for` loop:
|
| 397 |
+
""")
|
| 398 |
return
|
| 399 |
|
| 400 |
|
|
|
|
| 407 |
.round(2).alias(f"Delta {col_name} temperature")
|
| 408 |
)
|
| 409 |
_result
|
| 410 |
+
return
|
| 411 |
|
| 412 |
|
| 413 |
@app.cell(hide_code=True)
|
| 414 |
def _(mo):
|
| 415 |
+
mo.md(r"""
|
| 416 |
+
Using a `for` loop is functional, but not scalable, as each expression needs to be defined in an iteration and executed serially. Instead we can use a generator in Python to programmatically create all expressions at once. In conjunction with the `with_columns` context, we can take advantage of parallel execution of computations and query optimization from Polars.
|
| 417 |
+
""")
|
| 418 |
return
|
| 419 |
|
| 420 |
|
|
|
|
| 427 |
|
| 428 |
|
| 429 |
ext_temp_data.with_columns(delta_expressions(["Air", "Process"]))
|
| 430 |
+
return
|
| 431 |
|
| 432 |
|
| 433 |
@app.cell(hide_code=True)
|
| 434 |
def _(mo):
|
| 435 |
+
mo.md(r"""
|
| 436 |
+
## More flexible column selections
|
|
|
|
| 437 |
|
| 438 |
+
For more flexible column selections, you can use column selectors from `selectors`. Column selectors allow for more expressiveness in the way you specify selections. For example, column selectors can perform the familiar set operations of union, intersection, difference, etc. We can use the union operation with the functions `string` and `ends_with` to select all string columns and the columns whose names end with "`_high`":
|
| 439 |
+
""")
|
|
|
|
| 440 |
return
|
| 441 |
|
| 442 |
|
|
|
|
| 450 |
|
| 451 |
@app.cell(hide_code=True)
|
| 452 |
def _(mo):
|
| 453 |
+
mo.md(r"""
|
| 454 |
+
Likewise, you can pick columns based on the category of the type of data, offering more flexibility than the `col` function. As an example, `cs.numeric` selects numeric data types (including `pl.Float32`, `pl.Float64`, `pl.Int32`, etc.) or `cs.temporal` for all dates, times and similar data types.
|
| 455 |
+
""")
|
| 456 |
return
|
| 457 |
|
| 458 |
|
| 459 |
@app.cell(hide_code=True)
|
| 460 |
def _(mo):
|
| 461 |
+
mo.md(r"""
|
| 462 |
+
### Combining selectors with set operations
|
|
|
|
| 463 |
|
| 464 |
+
Multiple selectors can be combined using set operations and the usual Python operators:
|
| 465 |
|
| 466 |
|
| 467 |
+
| Operator | Operation |
|
| 468 |
+
|:--------:|:--------------------:|
|
| 469 |
+
| `A | B` | Union |
|
| 470 |
+
| `A & B` | Intersection |
|
| 471 |
+
| `A - B` | Difference |
|
| 472 |
+
| `A ^ B` | Symmetric difference |
|
| 473 |
+
| `~A` | Complement |
|
| 474 |
|
| 475 |
+
For example, to select all failure indicator variables excluding the failure variables due to wear, we can perform a set difference between the column selectors.
|
| 476 |
+
""")
|
|
|
|
| 477 |
return
|
| 478 |
|
| 479 |
|
|
|
|
| 485 |
|
| 486 |
@app.cell(hide_code=True)
|
| 487 |
def _(mo):
|
| 488 |
+
mo.md(r"""
|
| 489 |
+
### Resolving operator ambiguity
|
|
|
|
| 490 |
|
| 491 |
+
Expression functions can be chained on top of selectors:
|
| 492 |
+
""")
|
|
|
|
| 493 |
return
|
| 494 |
|
| 495 |
|
|
|
|
| 502 |
|
| 503 |
@app.cell(hide_code=True)
|
| 504 |
def _(mo):
|
| 505 |
+
mo.md(r"""
|
| 506 |
+
However, operators that perform set operations on column selectors operate on both selectors and on expressions. For example, the operator `~` on a selector represents the set operation “complement” and on an expression represents the Boolean operation of negation.
|
|
|
|
| 507 |
|
| 508 |
+
For instance, if you want to negate the Boolean values in the columns “HDF”, “OSF”, and “RNF”, at first you would think about using the `~` operator with the column selector to choose all failure variables containing "W". Because of the operator ambiguity here, the columns that are not of interest are selected here.
|
| 509 |
+
""")
|
|
|
|
| 510 |
return
|
| 511 |
|
| 512 |
|
|
|
|
| 518 |
|
| 519 |
@app.cell(hide_code=True)
|
| 520 |
def _(mo):
|
| 521 |
+
mo.md(r"""
|
| 522 |
+
To resolve the operator ambiguity, we use `as_expr`:
|
| 523 |
+
""")
|
| 524 |
return
|
| 525 |
|
| 526 |
|
|
|
|
| 532 |
|
| 533 |
@app.cell(hide_code=True)
|
| 534 |
def _(mo):
|
| 535 |
+
mo.md(r"""
|
| 536 |
+
### Debugging selectors
|
|
|
|
| 537 |
|
| 538 |
+
The function `cs.is_selector` helps check whether a complex chain of selectors and operators ultimately results in a selector. For example, to resolve any ambiguity with the selector in the last example, we can do:
|
| 539 |
+
""")
|
|
|
|
| 540 |
return
|
| 541 |
|
| 542 |
|
|
|
|
| 548 |
|
| 549 |
@app.cell(hide_code=True)
|
| 550 |
def _(mo):
|
| 551 |
+
mo.md(r"""
|
| 552 |
+
Additionally we can use `expand_selector` to see what columns a selector expands into. Note that for this function we need to provide additional context in the form of the dataframe.
|
| 553 |
+
""")
|
| 554 |
return
|
| 555 |
|
| 556 |
|
|
|
|
| 565 |
|
| 566 |
@app.cell(hide_code=True)
|
| 567 |
def _(mo):
|
| 568 |
+
mo.md(r"""
|
| 569 |
+
### References
|
|
|
|
| 570 |
|
| 571 |
+
1. AI4I 2020 Predictive Maintenance Dataset [Dataset]. (2020). UCI Machine Learning Repository. ([link](https://doi.org/10.24432/C5HS5C)).
|
| 572 |
+
2. Polars documentation ([link](https://docs.pola.rs/user-guide/expressions/expression-expansion/#more-flexible-column-selections))
|
| 573 |
+
""")
|
|
|
|
| 574 |
return
|
| 575 |
|
| 576 |
|
|
|
|
| 580 |
import marimo as mo
|
| 581 |
import polars as pl
|
| 582 |
from io import StringIO
|
| 583 |
+
return StringIO, mo, pl
|
| 584 |
|
| 585 |
|
| 586 |
if __name__ == "__main__":
|
polars/09_data_types.py
CHANGED
|
@@ -8,52 +8,46 @@
|
|
| 8 |
|
| 9 |
import marimo
|
| 10 |
|
| 11 |
-
__generated_with = "0.
|
| 12 |
app = marimo.App(width="medium")
|
| 13 |
|
| 14 |
|
| 15 |
@app.cell(hide_code=True)
|
| 16 |
def _(mo):
|
| 17 |
-
mo.md(
|
| 18 |
-
|
| 19 |
-
# Data Types
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
)
|
| 26 |
return
|
| 27 |
|
| 28 |
|
| 29 |
@app.cell(hide_code=True)
|
| 30 |
def _(mo):
|
| 31 |
-
mo.md(
|
| 32 |
-
|
| 33 |
-
Polars supports a variety of data types that fall broadly under the following categories:
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
)
|
| 45 |
return
|
| 46 |
|
| 47 |
|
| 48 |
@app.cell(hide_code=True)
|
| 49 |
def _(mo):
|
| 50 |
-
mo.md(
|
| 51 |
-
|
| 52 |
-
## Series
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
)
|
| 57 |
return
|
| 58 |
|
| 59 |
|
|
@@ -61,12 +55,14 @@ def _(mo):
|
|
| 61 |
def _(pl):
|
| 62 |
s = pl.Series("emojis", ["😀", "🤣", "🥶", "💀", "🤖"])
|
| 63 |
s
|
| 64 |
-
return
|
| 65 |
|
| 66 |
|
| 67 |
@app.cell(hide_code=True)
|
| 68 |
def _(mo):
|
| 69 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 70 |
return
|
| 71 |
|
| 72 |
|
|
@@ -75,20 +71,18 @@ def _(pl):
|
|
| 75 |
s1 = pl.Series("friends", ["Евгений", "अभिषेक", "秀良", "Federico", "Bob"])
|
| 76 |
s2 = pl.Series("uints", [0x00, 0x01, 0x10, 0x11], dtype=pl.UInt8)
|
| 77 |
s1.dtype, s2.dtype
|
| 78 |
-
return
|
| 79 |
|
| 80 |
|
| 81 |
@app.cell(hide_code=True)
|
| 82 |
def _(mo):
|
| 83 |
-
mo.md(
|
| 84 |
-
|
| 85 |
-
## Dataframe
|
| 86 |
|
| 87 |
-
|
| 88 |
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
)
|
| 92 |
return
|
| 93 |
|
| 94 |
|
|
@@ -108,28 +102,24 @@ def _(pl):
|
|
| 108 |
|
| 109 |
@app.cell(hide_code=True)
|
| 110 |
def _(mo):
|
| 111 |
-
mo.md(
|
| 112 |
-
|
| 113 |
-
### Inspecting a dataframe
|
| 114 |
|
| 115 |
-
|
| 116 |
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
)
|
| 121 |
return
|
| 122 |
|
| 123 |
|
| 124 |
@app.cell(hide_code=True)
|
| 125 |
def _(mo):
|
| 126 |
-
mo.md(
|
| 127 |
-
|
| 128 |
-
#### Head
|
| 129 |
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
)
|
| 133 |
return
|
| 134 |
|
| 135 |
|
|
@@ -141,13 +131,11 @@ def _(data):
|
|
| 141 |
|
| 142 |
@app.cell(hide_code=True)
|
| 143 |
def _(mo):
|
| 144 |
-
mo.md(
|
| 145 |
-
|
| 146 |
-
#### Glimpse
|
| 147 |
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
)
|
| 151 |
return
|
| 152 |
|
| 153 |
|
|
@@ -159,13 +147,11 @@ def _(data):
|
|
| 159 |
|
| 160 |
@app.cell(hide_code=True)
|
| 161 |
def _(mo):
|
| 162 |
-
mo.md(
|
| 163 |
-
|
| 164 |
-
#### Tail
|
| 165 |
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
)
|
| 169 |
return
|
| 170 |
|
| 171 |
|
|
@@ -177,13 +163,11 @@ def _(data):
|
|
| 177 |
|
| 178 |
@app.cell(hide_code=True)
|
| 179 |
def _(mo):
|
| 180 |
-
mo.md(
|
| 181 |
-
|
| 182 |
-
#### Sample
|
| 183 |
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
)
|
| 187 |
return
|
| 188 |
|
| 189 |
|
|
@@ -194,18 +178,16 @@ def _(data):
|
|
| 194 |
random.seed(42) # For reproducibility.
|
| 195 |
|
| 196 |
data.sample(3)
|
| 197 |
-
return
|
| 198 |
|
| 199 |
|
| 200 |
@app.cell(hide_code=True)
|
| 201 |
def _(mo):
|
| 202 |
-
mo.md(
|
| 203 |
-
|
| 204 |
-
#### Describe
|
| 205 |
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
)
|
| 209 |
return
|
| 210 |
|
| 211 |
|
|
@@ -217,13 +199,11 @@ def _(data):
|
|
| 217 |
|
| 218 |
@app.cell(hide_code=True)
|
| 219 |
def _(mo):
|
| 220 |
-
mo.md(
|
| 221 |
-
|
| 222 |
-
## Schema
|
| 223 |
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
)
|
| 227 |
return
|
| 228 |
|
| 229 |
|
|
@@ -235,7 +215,9 @@ def _(data):
|
|
| 235 |
|
| 236 |
@app.cell(hide_code=True)
|
| 237 |
def _(mo):
|
| 238 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 239 |
return
|
| 240 |
|
| 241 |
|
|
@@ -255,7 +237,9 @@ def _(pl):
|
|
| 255 |
|
| 256 |
@app.cell(hide_code=True)
|
| 257 |
def _(mo):
|
| 258 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 259 |
return
|
| 260 |
|
| 261 |
|
|
@@ -275,13 +259,11 @@ def _(pl):
|
|
| 275 |
|
| 276 |
@app.cell(hide_code=True)
|
| 277 |
def _(mo):
|
| 278 |
-
mo.md(
|
| 279 |
-
|
| 280 |
-
### References
|
| 281 |
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
)
|
| 285 |
return
|
| 286 |
|
| 287 |
|
|
|
|
| 8 |
|
| 9 |
import marimo
|
| 10 |
|
| 11 |
+
__generated_with = "0.18.4"
|
| 12 |
app = marimo.App(width="medium")
|
| 13 |
|
| 14 |
|
| 15 |
@app.cell(hide_code=True)
|
| 16 |
def _(mo):
|
| 17 |
+
mo.md(r"""
|
| 18 |
+
# Data Types
|
|
|
|
| 19 |
|
| 20 |
+
Author: [Deb Debnath](https://github.com/debajyotid2)
|
| 21 |
|
| 22 |
+
**Note**: The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/user-guide/concepts/data-types-and-structures/).
|
| 23 |
+
""")
|
|
|
|
| 24 |
return
|
| 25 |
|
| 26 |
|
| 27 |
@app.cell(hide_code=True)
|
| 28 |
def _(mo):
|
| 29 |
+
mo.md(r"""
|
| 30 |
+
Polars supports a variety of data types that fall broadly under the following categories:
|
|
|
|
| 31 |
|
| 32 |
+
- Numeric data types: integers and floating point numbers.
|
| 33 |
+
- Nested data types: lists, structs, and arrays.
|
| 34 |
+
- Temporal: dates, datetimes, times, and time deltas.
|
| 35 |
+
- Miscellaneous: strings, binary data, Booleans, categoricals, enums, and objects.
|
| 36 |
|
| 37 |
+
All types support missing values represented by `null` which is different from `NaN` used in floating point data types. The numeric datatypes in Polars loosely follow the type system of the Rust language, since its core functionalities are built in Rust.
|
| 38 |
|
| 39 |
+
[Here](https://docs.pola.rs/api/python/stable/reference/datatypes.html) is a full list of all data types Polars supports.
|
| 40 |
+
""")
|
|
|
|
| 41 |
return
|
| 42 |
|
| 43 |
|
| 44 |
@app.cell(hide_code=True)
|
| 45 |
def _(mo):
|
| 46 |
+
mo.md(r"""
|
| 47 |
+
## Series
|
|
|
|
| 48 |
|
| 49 |
+
A series is a 1-dimensional data structure that can hold only one data type.
|
| 50 |
+
""")
|
|
|
|
| 51 |
return
|
| 52 |
|
| 53 |
|
|
|
|
| 55 |
def _(pl):
|
| 56 |
s = pl.Series("emojis", ["😀", "🤣", "🥶", "💀", "🤖"])
|
| 57 |
s
|
| 58 |
+
return
|
| 59 |
|
| 60 |
|
| 61 |
@app.cell(hide_code=True)
|
| 62 |
def _(mo):
|
| 63 |
+
mo.md(r"""
|
| 64 |
+
Unless specified, Polars infers the datatype from the supplied values.
|
| 65 |
+
""")
|
| 66 |
return
|
| 67 |
|
| 68 |
|
|
|
|
| 71 |
s1 = pl.Series("friends", ["Евгений", "अभिषेक", "秀良", "Federico", "Bob"])
|
| 72 |
s2 = pl.Series("uints", [0x00, 0x01, 0x10, 0x11], dtype=pl.UInt8)
|
| 73 |
s1.dtype, s2.dtype
|
| 74 |
+
return
|
| 75 |
|
| 76 |
|
| 77 |
@app.cell(hide_code=True)
|
| 78 |
def _(mo):
|
| 79 |
+
mo.md(r"""
|
| 80 |
+
## Dataframe
|
|
|
|
| 81 |
|
| 82 |
+
A dataframe is a 2-dimensional data structure that contains uniquely named series and can hold multiple data types. Dataframes are more commonly used for data manipulation using the functionality of Polars.
|
| 83 |
|
| 84 |
+
The snippet below shows how to create a dataframe from a dictionary of lists:
|
| 85 |
+
""")
|
|
|
|
| 86 |
return
|
| 87 |
|
| 88 |
|
|
|
|
| 102 |
|
| 103 |
@app.cell(hide_code=True)
|
| 104 |
def _(mo):
|
| 105 |
+
mo.md(r"""
|
| 106 |
+
### Inspecting a dataframe
|
|
|
|
| 107 |
|
| 108 |
+
Polars has various functions to explore the data in a dataframe. We will use the dataframe `data` defined above in our examples. Alongside we can also see a view of the dataframe rendered by `marimo` as the cells are executed.
|
| 109 |
|
| 110 |
+
///note
|
| 111 |
+
We can also use `marimo`'s built in data-inspection elements/features such as [`mo.ui.dataframe`](https://docs.marimo.io/api/inputs/dataframe/#marimo.ui.dataframe) & [`mo.ui.data_explorer`](https://docs.marimo.io/api/inputs/data_explorer/). For more check out our Polars tutorials at [`marimo learn`](https://marimo-team.github.io/learn/)!
|
| 112 |
+
""")
|
|
|
|
| 113 |
return
|
| 114 |
|
| 115 |
|
| 116 |
@app.cell(hide_code=True)
|
| 117 |
def _(mo):
|
| 118 |
+
mo.md("""
|
| 119 |
+
#### Head
|
|
|
|
| 120 |
|
| 121 |
+
The function `head` shows the first rows of a dataframe. Unless specified, it shows the first 5 rows.
|
| 122 |
+
""")
|
|
|
|
| 123 |
return
|
| 124 |
|
| 125 |
|
|
|
|
| 131 |
|
| 132 |
@app.cell(hide_code=True)
|
| 133 |
def _(mo):
|
| 134 |
+
mo.md(r"""
|
| 135 |
+
#### Glimpse
|
|
|
|
| 136 |
|
| 137 |
+
The function `glimpse` is an alternative to `head` to view the first few columns, but displays each line of the output corresponding to a single column. That way, it makes inspecting wider dataframes easier.
|
| 138 |
+
""")
|
|
|
|
| 139 |
return
|
| 140 |
|
| 141 |
|
|
|
|
| 147 |
|
| 148 |
@app.cell(hide_code=True)
|
| 149 |
def _(mo):
|
| 150 |
+
mo.md(r"""
|
| 151 |
+
#### Tail
|
|
|
|
| 152 |
|
| 153 |
+
The `tail` function, just like its name suggests, shows the last rows of a dataframe. Unless the number of rows is specified, it will show the last 5 rows.
|
| 154 |
+
""")
|
|
|
|
| 155 |
return
|
| 156 |
|
| 157 |
|
|
|
|
| 163 |
|
| 164 |
@app.cell(hide_code=True)
|
| 165 |
def _(mo):
|
| 166 |
+
mo.md(r"""
|
| 167 |
+
#### Sample
|
|
|
|
| 168 |
|
| 169 |
+
`sample` can be used to show a specified number of randomly selected rows from the dataframe. Unless the number of rows is specified, it will show a single row. `sample` does not preserve order of the rows.
|
| 170 |
+
""")
|
|
|
|
| 171 |
return
|
| 172 |
|
| 173 |
|
|
|
|
| 178 |
random.seed(42) # For reproducibility.
|
| 179 |
|
| 180 |
data.sample(3)
|
| 181 |
+
return
|
| 182 |
|
| 183 |
|
| 184 |
@app.cell(hide_code=True)
|
| 185 |
def _(mo):
|
| 186 |
+
mo.md(r"""
|
| 187 |
+
#### Describe
|
|
|
|
| 188 |
|
| 189 |
+
The function `describe` describes the summary statistics for all columns of a dataframe.
|
| 190 |
+
""")
|
|
|
|
| 191 |
return
|
| 192 |
|
| 193 |
|
|
|
|
| 199 |
|
| 200 |
@app.cell(hide_code=True)
|
| 201 |
def _(mo):
|
| 202 |
+
mo.md(r"""
|
| 203 |
+
## Schema
|
|
|
|
| 204 |
|
| 205 |
+
A schema is a mapping showing the datatype corresponding to every column of a dataframe. The schema of a dataframe can be viewed using the attribute `schema`.
|
| 206 |
+
""")
|
|
|
|
| 207 |
return
|
| 208 |
|
| 209 |
|
|
|
|
| 215 |
|
| 216 |
@app.cell(hide_code=True)
|
| 217 |
def _(mo):
|
| 218 |
+
mo.md(r"""
|
| 219 |
+
Since a schema is a mapping, it can be specified in the form of a Python dictionary. Then this dictionary can be used to specify the schema of a dataframe on definition. If not specified or the entry is `None`, Polars infers the datatype from the contents of the column. Note that if the schema is not specified, it will be inferred automatically by default.
|
| 220 |
+
""")
|
| 221 |
return
|
| 222 |
|
| 223 |
|
|
|
|
| 237 |
|
| 238 |
@app.cell(hide_code=True)
|
| 239 |
def _(mo):
|
| 240 |
+
mo.md(r"""
|
| 241 |
+
Sometimes the automatically inferred schema is enough for some columns, but we might wish to override the inference of only some columns. We can specify the schema for those columns using `schema_overrides`.
|
| 242 |
+
""")
|
| 243 |
return
|
| 244 |
|
| 245 |
|
|
|
|
| 259 |
|
| 260 |
@app.cell(hide_code=True)
|
| 261 |
def _(mo):
|
| 262 |
+
mo.md(r"""
|
| 263 |
+
### References
|
|
|
|
| 264 |
|
| 265 |
+
1. Polars documentation ([link](https://docs.pola.rs/api/python/stable/reference/datatypes.html))
|
| 266 |
+
""")
|
|
|
|
| 267 |
return
|
| 268 |
|
| 269 |
|
polars/10_strings.py
CHANGED
|
@@ -10,36 +10,32 @@
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
-
__generated_with = "0.
|
| 14 |
app = marimo.App(width="medium")
|
| 15 |
|
| 16 |
|
| 17 |
@app.cell(hide_code=True)
|
| 18 |
def _(mo):
|
| 19 |
-
mo.md(
|
| 20 |
-
|
| 21 |
-
# Strings
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
)
|
| 30 |
return
|
| 31 |
|
| 32 |
|
| 33 |
@app.cell(hide_code=True)
|
| 34 |
def _(mo):
|
| 35 |
-
mo.md(
|
| 36 |
-
|
| 37 |
-
## 🛠️ Parsing & Conversion
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
)
|
| 43 |
return
|
| 44 |
|
| 45 |
|
|
@@ -58,7 +54,9 @@ def _(pl):
|
|
| 58 |
|
| 59 |
@app.cell(hide_code=True)
|
| 60 |
def _(mo):
|
| 61 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 62 |
return
|
| 63 |
|
| 64 |
|
|
@@ -71,13 +69,17 @@ def _(pip_metadata_raw_df, pl):
|
|
| 71 |
|
| 72 |
@app.cell(hide_code=True)
|
| 73 |
def _(mo):
|
| 74 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 75 |
return
|
| 76 |
|
| 77 |
|
| 78 |
@app.cell(hide_code=True)
|
| 79 |
def _(mo):
|
| 80 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 81 |
return
|
| 82 |
|
| 83 |
|
|
@@ -93,25 +95,23 @@ def _(pip_metadata_df, pl):
|
|
| 93 |
|
| 94 |
@app.cell(hide_code=True)
|
| 95 |
def _(mo):
|
| 96 |
-
mo.md(
|
| 97 |
-
|
| 98 |
-
Moving on to the `released_at` attribute which indicates the exact time when a given Python package got released, we have a bit more options to consider. We can convert to `Date`, `DateTime`, and `Time` types based on the desired temporal granularity. The [`to_date`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_date.html), [`to_datetime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_datetime.html), and [`to_time`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_time.html) expressions are here to help us with the conversion, all we need is to provide the desired format string.
|
| 99 |
|
| 100 |
-
|
| 101 |
|
| 102 |
-
|
| 103 |
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
)
|
| 115 |
return
|
| 116 |
|
| 117 |
|
|
@@ -129,7 +129,9 @@ def _(pip_metadata_df, pl):
|
|
| 129 |
|
| 130 |
@app.cell(hide_code=True)
|
| 131 |
def _(mo):
|
| 132 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 133 |
return
|
| 134 |
|
| 135 |
|
|
@@ -147,7 +149,9 @@ def _(pip_metadata_df, pl):
|
|
| 147 |
|
| 148 |
@app.cell(hide_code=True)
|
| 149 |
def _(mo):
|
| 150 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 151 |
return
|
| 152 |
|
| 153 |
|
|
@@ -165,17 +169,15 @@ def _(pip_metadata_raw_df, pl):
|
|
| 165 |
|
| 166 |
@app.cell(hide_code=True)
|
| 167 |
def _(mo):
|
| 168 |
-
mo.md(
|
| 169 |
-
|
| 170 |
-
## 📊 Dataset Overview
|
| 171 |
|
| 172 |
-
|
| 173 |
|
| 174 |
-
|
| 175 |
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
)
|
| 179 |
return
|
| 180 |
|
| 181 |
|
|
@@ -214,12 +216,14 @@ def _(pl):
|
|
| 214 |
|
| 215 |
expressions_df = pl.from_dicts(list_expr_meta(), infer_schema_length=None).sort('namespace', 'member')
|
| 216 |
expressions_df
|
| 217 |
-
return expressions_df,
|
| 218 |
|
| 219 |
|
| 220 |
@app.cell(hide_code=True)
|
| 221 |
def _(mo):
|
| 222 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 223 |
return
|
| 224 |
|
| 225 |
|
|
@@ -234,17 +238,15 @@ def _(alt, expressions_df):
|
|
| 234 |
|
| 235 |
@app.cell(hide_code=True)
|
| 236 |
def _(mo):
|
| 237 |
-
mo.md(
|
| 238 |
-
|
| 239 |
-
## 📏 Length Calculation
|
| 240 |
|
| 241 |
-
|
| 242 |
|
| 243 |
-
|
| 244 |
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
)
|
| 248 |
return
|
| 249 |
|
| 250 |
|
|
@@ -262,7 +264,9 @@ def _(expressions_df, pl):
|
|
| 262 |
|
| 263 |
@app.cell(hide_code=True)
|
| 264 |
def _(mo):
|
| 265 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 266 |
return
|
| 267 |
|
| 268 |
|
|
@@ -278,13 +282,11 @@ def _(alt, docstring_length_df):
|
|
| 278 |
|
| 279 |
@app.cell(hide_code=True)
|
| 280 |
def _(mo):
|
| 281 |
-
mo.md(
|
| 282 |
-
|
| 283 |
-
## 🔠 Case Conversion
|
| 284 |
|
| 285 |
-
|
| 286 |
-
|
| 287 |
-
)
|
| 288 |
return
|
| 289 |
|
| 290 |
|
|
@@ -300,15 +302,13 @@ def _(expressions_df, pl):
|
|
| 300 |
|
| 301 |
@app.cell(hide_code=True)
|
| 302 |
def _(mo):
|
| 303 |
-
mo.md(
|
| 304 |
-
|
| 305 |
-
## ➕ Padding
|
| 306 |
|
| 307 |
-
|
| 308 |
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
)
|
| 312 |
return
|
| 313 |
|
| 314 |
|
|
@@ -340,15 +340,13 @@ def _(mo, padded_df, padding):
|
|
| 340 |
|
| 341 |
@app.cell(hide_code=True)
|
| 342 |
def _(mo):
|
| 343 |
-
mo.md(
|
| 344 |
-
|
| 345 |
-
## 🔄 Replacing
|
| 346 |
|
| 347 |
-
|
| 348 |
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
)
|
| 352 |
return
|
| 353 |
|
| 354 |
|
|
@@ -364,13 +362,11 @@ def _(expressions_df, pl):
|
|
| 364 |
|
| 365 |
@app.cell(hide_code=True)
|
| 366 |
def _(mo):
|
| 367 |
-
mo.md(
|
| 368 |
-
|
| 369 |
-
A related expression is [`replace_many`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_many.html), which accepts *many* pairs of to-be-matched patterns and corresponding replacements and uses the [Aho–Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) to carry out the operation with great performance.
|
| 370 |
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
)
|
| 374 |
return
|
| 375 |
|
| 376 |
|
|
@@ -390,15 +386,13 @@ def _(expressions_df, pl):
|
|
| 390 |
|
| 391 |
@app.cell(hide_code=True)
|
| 392 |
def _(mo):
|
| 393 |
-
mo.md(
|
| 394 |
-
|
| 395 |
-
## 🔍 Searching & Matching
|
| 396 |
|
| 397 |
-
|
| 398 |
|
| 399 |
-
|
| 400 |
-
|
| 401 |
-
)
|
| 402 |
return
|
| 403 |
|
| 404 |
|
|
@@ -414,13 +408,11 @@ def _(expressions_df, pl):
|
|
| 414 |
|
| 415 |
@app.cell(hide_code=True)
|
| 416 |
def _(mo):
|
| 417 |
-
mo.md(
|
| 418 |
-
|
| 419 |
-
Throughout this course as you have gained familiarity with the expression API you might have noticed that some members end with an underscore such as `or_`, since their "body" is a reserved Python keyword.
|
| 420 |
|
| 421 |
-
|
| 422 |
-
|
| 423 |
-
)
|
| 424 |
return
|
| 425 |
|
| 426 |
|
|
@@ -436,13 +428,11 @@ def _(expressions_df, pl):
|
|
| 436 |
|
| 437 |
@app.cell(hide_code=True)
|
| 438 |
def _(mo):
|
| 439 |
-
mo.md(
|
| 440 |
-
|
| 441 |
-
Now let's move on to analyzing the docstrings in a bit more detail. Based on their content we can determine whether a member is deprecated, accepts parameters, comes with examples, or references external URL(s) & related members.
|
| 442 |
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
)
|
| 446 |
return
|
| 447 |
|
| 448 |
|
|
@@ -462,7 +452,9 @@ def _(expressions_df, pl):
|
|
| 462 |
|
| 463 |
@app.cell(hide_code=True)
|
| 464 |
def _(mo):
|
| 465 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 466 |
return
|
| 467 |
|
| 468 |
|
|
@@ -478,21 +470,19 @@ def _(expressions_df, pl):
|
|
| 478 |
|
| 479 |
@app.cell(hide_code=True)
|
| 480 |
def _(mo):
|
| 481 |
-
mo.md(
|
| 482 |
-
|
| 483 |
-
From the above analysis we could see that almost all the members come with code examples. It would be interesting to know how many variable assignments are going on within each of these examples, right? That's not as simple as checking for a pre-defined literal string containment though, because variables can have arbitrary names - any valid Python identifier is allowed. While the `contains` function supports checking for regular expressions instead of literal strings too, it would not suffice for this exercise because it only tells us whether there is at least a single occurrence of the sought pattern rather than telling us the exact number of matches.
|
| 484 |
|
| 485 |
-
|
| 486 |
|
| 487 |
-
|
| 488 |
|
| 489 |
-
|
| 490 |
-
|
| 491 |
-
|
| 492 |
|
| 493 |
-
|
| 494 |
-
|
| 495 |
-
)
|
| 496 |
return
|
| 497 |
|
| 498 |
|
|
@@ -508,7 +498,9 @@ def _(expressions_df, pl):
|
|
| 508 |
|
| 509 |
@app.cell(hide_code=True)
|
| 510 |
def _(mo):
|
| 511 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 512 |
return
|
| 513 |
|
| 514 |
|
|
@@ -524,13 +516,11 @@ def _(expressions_df, pl):
|
|
| 524 |
|
| 525 |
@app.cell(hide_code=True)
|
| 526 |
def _(mo):
|
| 527 |
-
mo.md(
|
| 528 |
-
|
| 529 |
-
## ✂️ Slicing and Substrings
|
| 530 |
|
| 531 |
-
|
| 532 |
-
|
| 533 |
-
)
|
| 534 |
return
|
| 535 |
|
| 536 |
|
|
@@ -564,17 +554,15 @@ def _(mo, slice, sliced_df):
|
|
| 564 |
|
| 565 |
@app.cell(hide_code=True)
|
| 566 |
def _(mo):
|
| 567 |
-
mo.md(
|
| 568 |
-
|
| 569 |
-
## ➗ Splitting
|
| 570 |
|
| 571 |
-
|
| 572 |
|
| 573 |
-
|
| 574 |
|
| 575 |
-
|
| 576 |
-
|
| 577 |
-
)
|
| 578 |
return
|
| 579 |
|
| 580 |
|
|
@@ -591,7 +579,9 @@ def _(expressions_df, pl):
|
|
| 591 |
|
| 592 |
@app.cell(hide_code=True)
|
| 593 |
def _(mo):
|
| 594 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 595 |
return
|
| 596 |
|
| 597 |
|
|
@@ -640,20 +630,18 @@ def _(alt, expressions_df, pl, random, wordcloud_height, wordcloud_width):
|
|
| 640 |
size=alt.Size("len:Q", legend=None),
|
| 641 |
tooltip=["member", "len"],
|
| 642 |
).configure_view(strokeWidth=0)
|
| 643 |
-
return wordcloud,
|
| 644 |
|
| 645 |
|
| 646 |
@app.cell(hide_code=True)
|
| 647 |
def _(mo):
|
| 648 |
-
mo.md(
|
| 649 |
-
|
| 650 |
-
## 🔗 Concatenation & Joining
|
| 651 |
|
| 652 |
-
|
| 653 |
|
| 654 |
-
|
| 655 |
-
|
| 656 |
-
)
|
| 657 |
return
|
| 658 |
|
| 659 |
|
|
@@ -679,13 +667,11 @@ def _(expressions_df, pl):
|
|
| 679 |
|
| 680 |
@app.cell(hide_code=True)
|
| 681 |
def _(mo):
|
| 682 |
-
mo.md(
|
| 683 |
-
|
| 684 |
-
Now that we have constructed these bullet points through *horizontal* concatenation of strings, we can perform a *vertical* one so that we end up with a single string in which we have a bullet point on each line.
|
| 685 |
|
| 686 |
-
|
| 687 |
-
|
| 688 |
-
)
|
| 689 |
return
|
| 690 |
|
| 691 |
|
|
@@ -708,17 +694,15 @@ def _(descriptions_df, mo, pl):
|
|
| 708 |
|
| 709 |
@app.cell(hide_code=True)
|
| 710 |
def _(mo):
|
| 711 |
-
mo.md(
|
| 712 |
-
|
| 713 |
-
## 🔍 Pattern-based Extraction
|
| 714 |
|
| 715 |
-
|
| 716 |
|
| 717 |
-
|
| 718 |
|
| 719 |
-
|
| 720 |
-
|
| 721 |
-
)
|
| 722 |
return
|
| 723 |
|
| 724 |
|
|
@@ -731,20 +715,18 @@ def _(expressions_df, pl):
|
|
| 731 |
url_match=pl.col('docstring').str.extract(url_pattern),
|
| 732 |
url_matches=pl.col('docstring').str.extract_all(url_pattern),
|
| 733 |
).filter(pl.col('url_match').is_not_null())
|
| 734 |
-
return
|
| 735 |
|
| 736 |
|
| 737 |
@app.cell(hide_code=True)
|
| 738 |
def _(mo):
|
| 739 |
-
mo.md(
|
| 740 |
-
|
| 741 |
-
Note that in each `docstring` where a code example involving dataframes is present, we will see an output such as "shape: (5, 2)" indicating the number of rows and columns of the dataframe produced by the sample code. Let's say we would like to *capture* this information in a structured way.
|
| 742 |
|
| 743 |
-
|
| 744 |
|
| 745 |
-
|
| 746 |
-
|
| 747 |
-
)
|
| 748 |
return
|
| 749 |
|
| 750 |
|
|
@@ -760,15 +742,13 @@ def _(expressions_df, pl):
|
|
| 760 |
|
| 761 |
@app.cell(hide_code=True)
|
| 762 |
def _(mo):
|
| 763 |
-
mo.md(
|
| 764 |
-
|
| 765 |
-
## 🧹 Stripping
|
| 766 |
|
| 767 |
-
|
| 768 |
|
| 769 |
-
|
| 770 |
-
|
| 771 |
-
)
|
| 772 |
return
|
| 773 |
|
| 774 |
|
|
@@ -785,15 +765,13 @@ def _(expressions_df, pl):
|
|
| 785 |
|
| 786 |
@app.cell(hide_code=True)
|
| 787 |
def _(mo):
|
| 788 |
-
mo.md(
|
| 789 |
-
|
| 790 |
-
Note that when using the above expressions, the specified characters do not need to form a sequence; they are handled as a set. However, in certain use cases we only want to strip complete substrings, so we would need our input to be strictly treated as a sequence rather than as a set.
|
| 791 |
|
| 792 |
-
|
| 793 |
|
| 794 |
-
|
| 795 |
-
|
| 796 |
-
)
|
| 797 |
return
|
| 798 |
|
| 799 |
|
|
@@ -809,13 +787,11 @@ def _(expressions_df, pl):
|
|
| 809 |
|
| 810 |
@app.cell(hide_code=True)
|
| 811 |
def _(mo):
|
| 812 |
-
mo.md(
|
| 813 |
-
|
| 814 |
-
## 🔑 Encoding & Decoding
|
| 815 |
|
| 816 |
-
|
| 817 |
-
|
| 818 |
-
)
|
| 819 |
return
|
| 820 |
|
| 821 |
|
|
@@ -832,7 +808,9 @@ def _(expressions_df, pl):
|
|
| 832 |
|
| 833 |
@app.cell(hide_code=True)
|
| 834 |
def _(mo):
|
| 835 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 836 |
return
|
| 837 |
|
| 838 |
|
|
@@ -847,19 +825,17 @@ def _(encoded_df, pl):
|
|
| 847 |
|
| 848 |
@app.cell(hide_code=True)
|
| 849 |
def _(mo):
|
| 850 |
-
mo.md(
|
| 851 |
-
|
| 852 |
-
## 🚀 Application: Dynamic Execution of Polars Examples
|
| 853 |
|
| 854 |
-
|
| 855 |
|
| 856 |
-
|
| 857 |
|
| 858 |
-
|
| 859 |
|
| 860 |
-
|
| 861 |
-
|
| 862 |
-
)
|
| 863 |
return
|
| 864 |
|
| 865 |
|
|
@@ -894,7 +870,7 @@ def _(mo, selected_expression_record):
|
|
| 894 |
|
| 895 |
|
| 896 |
@app.cell(hide_code=True)
|
| 897 |
-
def _(example_editor
|
| 898 |
execution_result = execute_code(example_editor.value)
|
| 899 |
return (execution_result,)
|
| 900 |
|
|
@@ -943,50 +919,48 @@ def _(expressions_df, pl):
|
|
| 943 |
return (code_df,)
|
| 944 |
|
| 945 |
|
| 946 |
-
@app.
|
| 947 |
-
def
|
| 948 |
-
|
| 949 |
-
import ast
|
| 950 |
-
|
| 951 |
-
# Create a new local namespace for execution
|
| 952 |
-
local_namespace = {}
|
| 953 |
|
| 954 |
-
|
| 955 |
-
|
| 956 |
|
| 957 |
-
|
| 958 |
-
|
| 959 |
-
return None
|
| 960 |
|
| 961 |
-
|
| 962 |
-
|
|
|
|
| 963 |
|
| 964 |
-
|
| 965 |
-
|
| 966 |
-
last_expr = ast.Expression(parsed_code.body[-1].value)
|
| 967 |
|
| 968 |
-
|
| 969 |
-
|
|
|
|
| 970 |
|
| 971 |
-
|
| 972 |
-
|
| 973 |
-
exec(
|
| 974 |
-
compile(parsed_code, "<string>", "exec"),
|
| 975 |
-
globals(),
|
| 976 |
-
local_namespace,
|
| 977 |
-
)
|
| 978 |
|
| 979 |
-
|
| 980 |
-
|
| 981 |
-
|
|
|
|
|
|
|
|
|
|
| 982 |
)
|
| 983 |
-
|
| 984 |
-
|
| 985 |
-
|
| 986 |
-
|
| 987 |
-
|
| 988 |
-
|
| 989 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 990 |
|
| 991 |
|
| 992 |
@app.cell(hide_code=True)
|
|
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
+
__generated_with = "0.18.4"
|
| 14 |
app = marimo.App(width="medium")
|
| 15 |
|
| 16 |
|
| 17 |
@app.cell(hide_code=True)
|
| 18 |
def _(mo):
|
| 19 |
+
mo.md(r"""
|
| 20 |
+
# Strings
|
|
|
|
| 21 |
|
| 22 |
+
_By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_.
|
| 23 |
|
| 24 |
+
In this chapter we're going to dig into string manipulation. For a fun twist, we'll be mostly playing around with a dataset that every Polars user has bumped into without really thinking about it—the source code of the `polars` module itself. More precisely, we'll use a dataframe that pulls together all the Polars expressions and their docstrings, giving us a cool, hands-on way to explore the expression API in a truly data-driven manner.
|
| 25 |
|
| 26 |
+
We'll cover parsing, length calculation, case conversion, and much more, with practical examples and visualizations. Finally, we will combine various techniques you learned in prior chapters to build a fully interactive playground in which you can execute the official code examples of Polars expressions.
|
| 27 |
+
""")
|
|
|
|
| 28 |
return
|
| 29 |
|
| 30 |
|
| 31 |
@app.cell(hide_code=True)
|
| 32 |
def _(mo):
|
| 33 |
+
mo.md(r"""
|
| 34 |
+
## 🛠️ Parsing & Conversion
|
|
|
|
| 35 |
|
| 36 |
+
Let's warm up with one of the most frequent use cases: parsing raw strings into various formats.
|
| 37 |
+
We'll take a tiny dataframe with metadata about Python packages represented as raw JSON strings and we'll use Polars string expressions to parse the attributes into their true data types.
|
| 38 |
+
""")
|
|
|
|
| 39 |
return
|
| 40 |
|
| 41 |
|
|
|
|
| 54 |
|
| 55 |
@app.cell(hide_code=True)
|
| 56 |
def _(mo):
|
| 57 |
+
mo.md(r"""
|
| 58 |
+
We can use the [`json_decode`](https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.json_decode.html) expression to parse the raw JSON strings into Polars-native structs and we can use the [unnest](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.unnest.html) dataframe operation to have a dedicated column per parsed attribute.
|
| 59 |
+
""")
|
| 60 |
return
|
| 61 |
|
| 62 |
|
|
|
|
| 69 |
|
| 70 |
@app.cell(hide_code=True)
|
| 71 |
def _(mo):
|
| 72 |
+
mo.md(r"""
|
| 73 |
+
This is already a much friendlier representation of the data we started out with, but note that since the JSON entries had only string attributes, all values are strings, even the temporal `released_at` and numerical `size_mb` columns.
|
| 74 |
+
""")
|
| 75 |
return
|
| 76 |
|
| 77 |
|
| 78 |
@app.cell(hide_code=True)
|
| 79 |
def _(mo):
|
| 80 |
+
mo.md(r"""
|
| 81 |
+
As we know that the `size_mb` column should have a decimal representation, we go ahead and use [`to_decimal`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_decimal.html#polars.Expr.str.to_decimal) to perform the conversion.
|
| 82 |
+
""")
|
| 83 |
return
|
| 84 |
|
| 85 |
|
|
|
|
| 95 |
|
| 96 |
@app.cell(hide_code=True)
|
| 97 |
def _(mo):
|
| 98 |
+
mo.md(r"""
|
| 99 |
+
Moving on to the `released_at` attribute which indicates the exact time when a given Python package got released, we have a bit more options to consider. We can convert to `Date`, `DateTime`, and `Time` types based on the desired temporal granularity. The [`to_date`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_date.html), [`to_datetime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_datetime.html), and [`to_time`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_time.html) expressions are here to help us with the conversion, all we need is to provide the desired format string.
|
|
|
|
| 100 |
|
| 101 |
+
Since Polars uses Rust under the hood to implement all its expressions, we need to consult the [`chrono::format`](https://docs.rs/chrono/latest/chrono/format/strftime/index.html) reference to come up with appropriate format strings.
|
| 102 |
|
| 103 |
+
Here's a quick reference:
|
| 104 |
|
| 105 |
+
| Specifier | Meaning |
|
| 106 |
+
|-----------|--------------------|
|
| 107 |
+
| `%Y` | Year (e.g., 2025) |
|
| 108 |
+
| `%m` | Month (01-12) |
|
| 109 |
+
| `%d` | Day (01-31) |
|
| 110 |
+
| `%H` | Hour (00-23) |
|
| 111 |
+
| `%z` | UTC offset |
|
| 112 |
|
| 113 |
+
The raw strings we are working with look like `"2025-03-02T20:31:12+0000"`. We can match this using the `"%Y-%m-%dT%H:%M:%S%z"` format string.
|
| 114 |
+
""")
|
|
|
|
| 115 |
return
|
| 116 |
|
| 117 |
|
|
|
|
| 129 |
|
| 130 |
@app.cell(hide_code=True)
|
| 131 |
def _(mo):
|
| 132 |
+
mo.md(r"""
|
| 133 |
+
Alternatively, instead of using three different functions to perform the conversion to date, we can use a single one, [`strptime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strptime.html) which takes the desired temporal data type as its first parameter.
|
| 134 |
+
""")
|
| 135 |
return
|
| 136 |
|
| 137 |
|
|
|
|
| 149 |
|
| 150 |
@app.cell(hide_code=True)
|
| 151 |
def _(mo):
|
| 152 |
+
mo.md(r"""
|
| 153 |
+
And to wrap up this section on parsing and conversion, let's consider a final scenario. What if we don't want to parse the entire raw JSON string, because we only need a subset of its attributes? Well, in this case we can leverage the [`json_path_match`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.json_path_match.html) expression to extract only the desired attributes using standard [JSONPath](https://goessner.net/articles/JsonPath/) syntax.
|
| 154 |
+
""")
|
| 155 |
return
|
| 156 |
|
| 157 |
|
|
|
|
| 169 |
|
| 170 |
@app.cell(hide_code=True)
|
| 171 |
def _(mo):
|
| 172 |
+
mo.md(r"""
|
| 173 |
+
## 📊 Dataset Overview
|
|
|
|
| 174 |
|
| 175 |
+
Now that we got our hands dirty, let's consider a somewhat wilder dataset for the subsequent sections: a dataframe of metadata about every single expression in your current Polars module.
|
| 176 |
|
| 177 |
+
At the risk of stating the obvious, in the previous section, when we typed `pl.col('raw_json').str.json_decode()`, we accessed the `json_decode` member of the `str` expression namespace through the `pl.col('raw_json')` expression *instance*. Under the hood, deep inside the Polars source code, there is a corresponding `def json_decode(...)` method with a carefully authored docstring explaining the purpose and signature of the member.
|
| 178 |
|
| 179 |
+
Since Python makes module introspection simple, we can easily enumerate all Polars expressions and organize their metadata in `expressions_df`, to be used for all the upcoming string manipulation examples.
|
| 180 |
+
""")
|
|
|
|
| 181 |
return
|
| 182 |
|
| 183 |
|
|
|
|
| 216 |
|
| 217 |
expressions_df = pl.from_dicts(list_expr_meta(), infer_schema_length=None).sort('namespace', 'member')
|
| 218 |
expressions_df
|
| 219 |
+
return (expressions_df,)
|
| 220 |
|
| 221 |
|
| 222 |
@app.cell(hide_code=True)
|
| 223 |
def _(mo):
|
| 224 |
+
mo.md(r"""
|
| 225 |
+
As the following visualization shows, `str` is one of the richest Polars expression namespaces with multiple dozens of functions in it.
|
| 226 |
+
""")
|
| 227 |
return
|
| 228 |
|
| 229 |
|
|
|
|
| 238 |
|
| 239 |
@app.cell(hide_code=True)
|
| 240 |
def _(mo):
|
| 241 |
+
mo.md(r"""
|
| 242 |
+
## 📏 Length Calculation
|
|
|
|
| 243 |
|
| 244 |
+
A common use case is to compute the length of a string. Most people associate string length exclusively with the number of characters the said string consists of; however, in certain scenarios it is useful to also know how much memory is required for storing, so how many bytes are required to represent the textual data.
|
| 245 |
|
| 246 |
+
The expressions [`len_chars`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_chars.html) and [`len_bytes`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_bytes.html) are here to help us with these calculations.
|
| 247 |
|
| 248 |
+
Below, we compute `docstring_len_chars` and `docstring_len_bytes` columns to see how many characters and bytes the documentation of each expression is made up of.
|
| 249 |
+
""")
|
|
|
|
| 250 |
return
|
| 251 |
|
| 252 |
|
|
|
|
| 264 |
|
| 265 |
@app.cell(hide_code=True)
|
| 266 |
def _(mo):
|
| 267 |
+
mo.md(r"""
|
| 268 |
+
As the dataframe preview above and the scatterplot below show, the docstring length measured in bytes is almost always bigger than the length expressed in characters. This is due to the fact that the docstrings include characters which require more than a single byte to represent, such as "╞" for displaying dataframe header and body separators.
|
| 269 |
+
""")
|
| 270 |
return
|
| 271 |
|
| 272 |
|
|
|
|
| 282 |
|
| 283 |
@app.cell(hide_code=True)
|
| 284 |
def _(mo):
|
| 285 |
+
mo.md(r"""
|
| 286 |
+
## 🔠 Case Conversion
|
|
|
|
| 287 |
|
| 288 |
+
Another frequent string transformation is lowercasing, uppercasing, and titlecasing. We can use [`to_lowercase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html), [`to_uppercase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html) and [`to_titlecase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_titlecase.html) for doing so.
|
| 289 |
+
""")
|
|
|
|
| 290 |
return
|
| 291 |
|
| 292 |
|
|
|
|
| 302 |
|
| 303 |
@app.cell(hide_code=True)
|
| 304 |
def _(mo):
|
| 305 |
+
mo.md(r"""
|
| 306 |
+
## ➕ Padding
|
|
|
|
| 307 |
|
| 308 |
+
Sometimes we need to ensure that strings have a fixed-size character length. [`pad_start`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.pad_start.html) and [`pad_end`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.pad_end.html) can be used to fill the "front" or "back" of a string with a supplied character, while [`zfill`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.zfill.html) is a utility for padding the start of a string with `"0"` until it reaches a particular length. In other words, `zfill` is a more specific version of `pad_start`, where the `fill_char` parameter is explicitly set to `"0"`.
|
| 309 |
|
| 310 |
+
In the example below we take the unique Polars expression namespaces and pad them so that they have a uniform length which you can control via a slider.
|
| 311 |
+
""")
|
|
|
|
| 312 |
return
|
| 313 |
|
| 314 |
|
|
|
|
| 340 |
|
| 341 |
@app.cell(hide_code=True)
|
| 342 |
def _(mo):
|
| 343 |
+
mo.md(r"""
|
| 344 |
+
## 🔄 Replacing
|
|
|
|
| 345 |
|
| 346 |
+
Let's say we want to convert from `snake_case` API member names to `kebab-case`, that is, we need to replace the underscore character with a hyphen. For operations like that, we can use [`replace`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace.html) and [`replace_all`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_all.html).
|
| 347 |
|
| 348 |
+
As the example below demonstrates, `replace` stops after the first occurrence of the to-be-replaced pattern, while `replace_all` goes all the way through and changes all underscores to hyphens resulting in the `kebab-case` representation we were looking for.
|
| 349 |
+
""")
|
|
|
|
| 350 |
return
|
| 351 |
|
| 352 |
|
|
|
|
| 362 |
|
| 363 |
@app.cell(hide_code=True)
|
| 364 |
def _(mo):
|
| 365 |
+
mo.md(r"""
|
| 366 |
+
A related expression is [`replace_many`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_many.html), which accepts *many* pairs of to-be-matched patterns and corresponding replacements and uses the [Aho–Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) to carry out the operation with great performance.
|
|
|
|
| 367 |
|
| 368 |
+
In the example below we replace all instances of `"min"` with `"minimum"` and `"max"` with `"maximum"` using a single expression.
|
| 369 |
+
""")
|
|
|
|
| 370 |
return
|
| 371 |
|
| 372 |
|
|
|
|
| 386 |
|
| 387 |
@app.cell(hide_code=True)
|
| 388 |
def _(mo):
|
| 389 |
+
mo.md(r"""
|
| 390 |
+
## 🔍 Searching & Matching
|
|
|
|
| 391 |
|
| 392 |
+
A common need when working with strings is to determine whether their content satisfies some condition: whether it starts or ends with a particular substring or contains a certain pattern.
|
| 393 |
|
| 394 |
+
Let's suppose we want to determine whether a member of the Polars expression API is a "converter", such as `to_decimal`, identified by its `"to_"` prefix. We can use [`starts_with`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.starts_with.html) to perform this check.
|
| 395 |
+
""")
|
|
|
|
| 396 |
return
|
| 397 |
|
| 398 |
|
|
|
|
| 408 |
|
| 409 |
@app.cell(hide_code=True)
|
| 410 |
def _(mo):
|
| 411 |
+
mo.md(r"""
|
| 412 |
+
Throughout this course as you have gained familiarity with the expression API you might have noticed that some members end with an underscore such as `or_`, since their "body" is a reserved Python keyword.
|
|
|
|
| 413 |
|
| 414 |
+
Let's use [`ends_with`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.ends_with.html) to find all the members which are named after such keywords.
|
| 415 |
+
""")
|
|
|
|
| 416 |
return
|
| 417 |
|
| 418 |
|
|
|
|
| 428 |
|
| 429 |
@app.cell(hide_code=True)
|
| 430 |
def _(mo):
|
| 431 |
+
mo.md(r"""
|
| 432 |
+
Now let's move on to analyzing the docstrings in a bit more detail. Based on their content we can determine whether a member is deprecated, accepts parameters, comes with examples, or references external URL(s) & related members.
|
|
|
|
| 433 |
|
| 434 |
+
As demonstrated below, we can compute all these boolean attributes using [`contains`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html) to check whether the docstring includes a particular substring.
|
| 435 |
+
""")
|
|
|
|
| 436 |
return
|
| 437 |
|
| 438 |
|
|
|
|
| 452 |
|
| 453 |
@app.cell(hide_code=True)
|
| 454 |
def _(mo):
|
| 455 |
+
mo.md(r"""
|
| 456 |
+
For scenarios where we want to combine multiple substrings to check for, we can use the [`contains`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html) expression to check for the presence of various patterns.
|
| 457 |
+
""")
|
| 458 |
return
|
| 459 |
|
| 460 |
|
|
|
|
| 470 |
|
| 471 |
@app.cell(hide_code=True)
|
| 472 |
def _(mo):
|
| 473 |
+
mo.md(r"""
|
| 474 |
+
From the above analysis we could see that almost all the members come with code examples. It would be interesting to know how many variable assignments are going on within each of these examples, right? That's not as simple as checking for a pre-defined literal string containment though, because variables can have arbitrary names - any valid Python identifier is allowed. While the `contains` function supports checking for regular expressions instead of literal strings too, it would not suffice for this exercise because it only tells us whether there is at least a single occurrence of the sought pattern rather than telling us the exact number of matches.
|
|
|
|
| 475 |
|
| 476 |
+
Fortunately, we can take advantage of [`count_matches`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.count_matches.html) to achieve exactly what we want. We specify the regular expression `r'[a-zA-Z_][a-zA-Z0-9_]* = '` according to the [`regex` Rust crate](https://docs.rs/regex/latest/regex/) to match Python identifiers and we leave the rest to Polars.
|
| 477 |
|
| 478 |
+
In `count_matches(r'[a-zA-Z_][a-zA-Z0-9_]* = ')`:
|
| 479 |
|
| 480 |
+
- `[a-zA-Z_]` matches a letter or underscore (start of a Python identifier).
|
| 481 |
+
- `[a-zA-Z0-9_]*` matches zero or more letters, digits, or underscores.
|
| 482 |
+
- ` = ` matches a space, equals sign, and space (indicating assignment).
|
| 483 |
|
| 484 |
+
This finds variable assignments like `x = ` or `df_result = ` in docstrings.
|
| 485 |
+
""")
|
|
|
|
| 486 |
return
|
| 487 |
|
| 488 |
|
|
|
|
| 498 |
|
| 499 |
@app.cell(hide_code=True)
|
| 500 |
def _(mo):
|
| 501 |
+
mo.md(r"""
|
| 502 |
+
A related application example is to *find* the first index where a particular pattern is present, so that it can be used for downstream processing such as slicing. Below we use the [`find`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.find.html) expression to determine the index at which a code example starts in the docstring - identified by the Python shell substring `">>>"`.
|
| 503 |
+
""")
|
| 504 |
return
|
| 505 |
|
| 506 |
|
|
|
|
| 516 |
|
| 517 |
@app.cell(hide_code=True)
|
| 518 |
def _(mo):
|
| 519 |
+
mo.md(r"""
|
| 520 |
+
## ✂️ Slicing and Substrings
|
|
|
|
| 521 |
|
| 522 |
+
Sometimes we are only interested in a particular substring. We can use [`head`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.head.html), [`tail`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.tail.html) and [`slice`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.slice.html) to extract a substring from the start, end, or between arbitrary indices.
|
| 523 |
+
""")
|
|
|
|
| 524 |
return
|
| 525 |
|
| 526 |
|
|
|
|
| 554 |
|
| 555 |
@app.cell(hide_code=True)
|
| 556 |
def _(mo):
|
| 557 |
+
mo.md(r"""
|
| 558 |
+
## ➗ Splitting
|
|
|
|
| 559 |
|
| 560 |
+
Certain strings follow a well-defined structure and we might be only interested in some parts of them. For example, when dealing with `snake_cased_expression` member names we might be curious to get only the first, second, or $n^{\text{th}}$ word before an underscore. We would need to *split* the string at a particular pattern for downstream processing.
|
| 561 |
|
| 562 |
+
The [`split`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split.html), [`split_exact`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split_exact.html) and [`splitn`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.splitn.html) expressions enable us to achieve this.
|
| 563 |
|
| 564 |
+
The primary difference between these string splitting utilities is that `split` produces a list of variadic length based on the number of resulting segments, `splitn` returns a struct with at least `0` and at most `n` fields while `split_exact` returns a struct of exactly `n` fields.
|
| 565 |
+
""")
|
|
|
|
| 566 |
return
|
| 567 |
|
| 568 |
|
|
|
|
| 579 |
|
| 580 |
@app.cell(hide_code=True)
|
| 581 |
def _(mo):
|
| 582 |
+
mo.md(r"""
|
| 583 |
+
As a more practical example, we can use the `split` expression with some aggregation to count the number of times a particular word occurs in member names across all namespaces. This enables us to create a word cloud of the API members' constituents!
|
| 584 |
+
""")
|
| 585 |
return
|
| 586 |
|
| 587 |
|
|
|
|
| 630 |
size=alt.Size("len:Q", legend=None),
|
| 631 |
tooltip=["member", "len"],
|
| 632 |
).configure_view(strokeWidth=0)
|
| 633 |
+
return (wordcloud,)
|
| 634 |
|
| 635 |
|
| 636 |
@app.cell(hide_code=True)
|
| 637 |
def _(mo):
|
| 638 |
+
mo.md(r"""
|
| 639 |
+
## 🔗 Concatenation & Joining
|
|
|
|
| 640 |
|
| 641 |
+
Often we would like to create longer strings from strings we already have. We might want to create a formatted, sentence-like string or join multiple existing strings in our dataframe into a single one.
|
| 642 |
|
| 643 |
+
The top-level [`concat_str`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.concat_str.html) expression enables us to combine strings *horizontally* in a dataframe. As the example below shows, we can take the `member` and `namespace` column of each row and construct a `description` column in which each row will correspond to the value ``f"- Expression `{member}` belongs to namespace `{namespace}`"``.
|
| 644 |
+
""")
|
|
|
|
| 645 |
return
|
| 646 |
|
| 647 |
|
|
|
|
| 667 |
|
| 668 |
@app.cell(hide_code=True)
|
| 669 |
def _(mo):
|
| 670 |
+
mo.md(r"""
|
| 671 |
+
Now that we have constructed these bullet points through *horizontal* concatenation of strings, we can perform a *vertical* one so that we end up with a single string in which we have a bullet point on each line.
|
|
|
|
| 672 |
|
| 673 |
+
We will use the [`join`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.join.html) expression to do so.
|
| 674 |
+
""")
|
|
|
|
| 675 |
return
|
| 676 |
|
| 677 |
|
|
|
|
| 694 |
|
| 695 |
@app.cell(hide_code=True)
|
| 696 |
def _(mo):
|
| 697 |
+
mo.md(r"""
|
| 698 |
+
## 🔍 Pattern-based Extraction
|
|
|
|
| 699 |
|
| 700 |
+
In the vast majority of the cases, when dealing with unstructured text data, all we really want is to extract something structured from it. A common use case is to extract URLs from text to get a better understanding of related content.
|
| 701 |
|
| 702 |
+
In the example below that's exactly what we do. We scan the `docstring` of each API member and extract URLs from them using [`extract`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract.html) and [`extract_all`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract_all.html) using a simple regular expression to match http and https URLs.
|
| 703 |
|
| 704 |
+
Note that `extract` stops after a first match and returns a scalar result (or `null` if there was no match) while `extract_all` returns a - potentially empty - list of matches.
|
| 705 |
+
""")
|
|
|
|
| 706 |
return
|
| 707 |
|
| 708 |
|
|
|
|
| 715 |
url_match=pl.col('docstring').str.extract(url_pattern),
|
| 716 |
url_matches=pl.col('docstring').str.extract_all(url_pattern),
|
| 717 |
).filter(pl.col('url_match').is_not_null())
|
| 718 |
+
return
|
| 719 |
|
| 720 |
|
| 721 |
@app.cell(hide_code=True)
|
| 722 |
def _(mo):
|
| 723 |
+
mo.md(r"""
|
| 724 |
+
Note that in each `docstring` where a code example involving dataframes is present, we will see an output such as "shape: (5, 2)" indicating the number of rows and columns of the dataframe produced by the sample code. Let's say we would like to *capture* this information in a structured way.
|
|
|
|
| 725 |
|
| 726 |
+
[`extract_groups`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract_groups.html) is a really powerful expression allowing us to achieve exactly that.
|
| 727 |
|
| 728 |
+
Below we define the regular expression `r"shape:\s*\((?<height>\S+),\s*(?<width>\S+)\)"` with two capture groups, named `height` and `width` and pass it as the parameter of `extract_groups`. After execution, for each `docstring`, we end up with fully structured data we can further process downstream!
|
| 729 |
+
""")
|
|
|
|
| 730 |
return
|
| 731 |
|
| 732 |
|
|
|
|
| 742 |
|
| 743 |
@app.cell(hide_code=True)
|
| 744 |
def _(mo):
|
| 745 |
+
mo.md(r"""
|
| 746 |
+
## 🧹 Stripping
|
|
|
|
| 747 |
|
| 748 |
+
Strings might require some cleaning before further processing, such as the removal of some characters from the beginning or end of the text. [`strip_chars_start`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_start.html), [`strip_chars_end`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_end.html) and [`strip_chars`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars.html) are here to facilitate this.
|
| 749 |
|
| 750 |
+
All we need to do is to specify a set of characters we would like to get rid of and Polars handles the rest for us.
|
| 751 |
+
""")
|
|
|
|
| 752 |
return
|
| 753 |
|
| 754 |
|
|
|
|
| 765 |
|
| 766 |
@app.cell(hide_code=True)
|
| 767 |
def _(mo):
|
| 768 |
+
mo.md(r"""
|
| 769 |
+
Note that when using the above expressions, the specified characters do not need to form a sequence; they are handled as a set. However, in certain use cases we only want to strip complete substrings, so we would need our input to be strictly treated as a sequence rather than as a set.
|
|
|
|
| 770 |
|
| 771 |
+
That's exactly the rationale behind [`strip_prefix`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_prefix.html) and [`strip_suffix`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_suffix.html).
|
| 772 |
|
| 773 |
+
Below we use these to remove the `"to_"` prefixes and `"_with"` suffixes from each member name.
|
| 774 |
+
""")
|
|
|
|
| 775 |
return
|
| 776 |
|
| 777 |
|
|
|
|
| 787 |
|
| 788 |
@app.cell(hide_code=True)
|
| 789 |
def _(mo):
|
| 790 |
+
mo.md(r"""
|
| 791 |
+
## 🔑 Encoding & Decoding
|
|
|
|
| 792 |
|
| 793 |
+
Should you find yourself in the need of encoding your strings into [base64](https://en.wikipedia.org/wiki/Base64) or [hexadecimal](https://en.wikipedia.org/wiki/Hexadecimal) format, then Polars has your back with its [`encode`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.encode.html) expression.
|
| 794 |
+
""")
|
|
|
|
| 795 |
return
|
| 796 |
|
| 797 |
|
|
|
|
| 808 |
|
| 809 |
@app.cell(hide_code=True)
|
| 810 |
def _(mo):
|
| 811 |
+
mo.md(r"""
|
| 812 |
+
And of course, you can convert back into a human-readable representation using the [`decode`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.decode.html) expression.
|
| 813 |
+
""")
|
| 814 |
return
|
| 815 |
|
| 816 |
|
|
|
|
| 825 |
|
| 826 |
@app.cell(hide_code=True)
|
| 827 |
def _(mo):
|
| 828 |
+
mo.md(r"""
|
| 829 |
+
## 🚀 Application: Dynamic Execution of Polars Examples
|
|
|
|
| 830 |
|
| 831 |
+
Now that we are familiar with string expressions, we can combine them with other Polars operations to build a fully interactive playground where code examples of Polars expressions can be explored.
|
| 832 |
|
| 833 |
+
We make use of string expressions to extract the raw Python source code of examples from the docstrings and we leverage the interactive Marimo environment to enable the selection of expressions via a searchable dropdown and a fully functional code editor whose output is rendered with Marimo's rich display utilities.
|
| 834 |
|
| 835 |
+
In other words, we will use Polars to execute Polars. ❄️ How cool is that?
|
| 836 |
|
| 837 |
+
---
|
| 838 |
+
""")
|
|
|
|
| 839 |
return
|
| 840 |
|
| 841 |
|
|
|
|
| 870 |
|
| 871 |
|
| 872 |
@app.cell(hide_code=True)
|
| 873 |
+
def _(example_editor):
|
| 874 |
execution_result = execute_code(example_editor.value)
|
| 875 |
return (execution_result,)
|
| 876 |
|
|
|
|
| 919 |
return (code_df,)
|
| 920 |
|
| 921 |
|
| 922 |
+
@app.function(hide_code=True)
|
| 923 |
+
def execute_code(code: str):
|
| 924 |
+
import ast
|
|
|
|
|
|
|
|
|
|
|
|
|
| 925 |
|
| 926 |
+
# Create a new local namespace for execution
|
| 927 |
+
local_namespace = {}
|
| 928 |
|
| 929 |
+
# Parse the code into an AST to identify the last expression
|
| 930 |
+
parsed_code = ast.parse(code)
|
|
|
|
| 931 |
|
| 932 |
+
# Check if there's at least one statement
|
| 933 |
+
if not parsed_code.body:
|
| 934 |
+
return None
|
| 935 |
|
| 936 |
+
# If the last statement is an expression, we'll need to get its value
|
| 937 |
+
last_is_expr = isinstance(parsed_code.body[-1], ast.Expr)
|
|
|
|
| 938 |
|
| 939 |
+
if last_is_expr:
|
| 940 |
+
# Split the code: everything except the last statement, and the last statement
|
| 941 |
+
last_expr = ast.Expression(parsed_code.body[-1].value)
|
| 942 |
|
| 943 |
+
# Remove the last statement from the parsed code
|
| 944 |
+
parsed_code.body = parsed_code.body[:-1]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 945 |
|
| 946 |
+
# Execute everything except the last statement
|
| 947 |
+
if parsed_code.body:
|
| 948 |
+
exec(
|
| 949 |
+
compile(parsed_code, "<string>", "exec"),
|
| 950 |
+
globals(),
|
| 951 |
+
local_namespace,
|
| 952 |
)
|
| 953 |
+
|
| 954 |
+
# Execute the last statement and get its value
|
| 955 |
+
result = eval(
|
| 956 |
+
compile(last_expr, "<string>", "eval"), globals(), local_namespace
|
| 957 |
+
)
|
| 958 |
+
return result
|
| 959 |
+
else:
|
| 960 |
+
# If the last statement is not an expression (e.g., an assignment),
|
| 961 |
+
# execute the entire code and return None
|
| 962 |
+
exec(code, globals(), local_namespace)
|
| 963 |
+
return None
|
| 964 |
|
| 965 |
|
| 966 |
@app.cell(hide_code=True)
|
polars/11_missing_data.py
CHANGED
|
@@ -8,14 +8,13 @@
|
|
| 8 |
|
| 9 |
import marimo
|
| 10 |
|
| 11 |
-
__generated_with = "0.
|
| 12 |
app = marimo.App(width="medium")
|
| 13 |
|
| 14 |
|
| 15 |
@app.cell(hide_code=True)
|
| 16 |
def _(mo):
|
| 17 |
-
mo.md(
|
| 18 |
-
r"""
|
| 19 |
# Dealing with Missing Data
|
| 20 |
|
| 21 |
_by [etrotta](https://github.com/etrotta) and [Felix Najera](https://github.com/folicks)_
|
|
@@ -24,20 +23,17 @@ def _(mo):
|
|
| 24 |
|
| 25 |
First we provide an overview of the methods available in polars, then we walk through a mini case study with real world data showing how to use it, and at last we provide some additional information in the 'Bonus Content' section.
|
| 26 |
You can navigate to skip around to each header using the menu on the right side
|
| 27 |
-
"""
|
| 28 |
-
)
|
| 29 |
return
|
| 30 |
|
| 31 |
|
| 32 |
@app.cell(hide_code=True)
|
| 33 |
def _(mo):
|
| 34 |
-
mo.md(
|
| 35 |
-
r"""
|
| 36 |
## Methods for working with Nulls
|
| 37 |
|
| 38 |
We'll be using the following DataFrame to show the most important methods:
|
| 39 |
-
"""
|
| 40 |
-
)
|
| 41 |
return
|
| 42 |
|
| 43 |
|
|
@@ -59,13 +55,11 @@ def _(pl):
|
|
| 59 |
|
| 60 |
@app.cell(hide_code=True)
|
| 61 |
def _(mo):
|
| 62 |
-
mo.md(
|
| 63 |
-
r"""
|
| 64 |
### Counting nulls
|
| 65 |
|
| 66 |
A simple yet convenient aggregation
|
| 67 |
-
"""
|
| 68 |
-
)
|
| 69 |
return
|
| 70 |
|
| 71 |
|
|
@@ -77,13 +71,11 @@ def _(df):
|
|
| 77 |
|
| 78 |
@app.cell(hide_code=True)
|
| 79 |
def _(mo):
|
| 80 |
-
mo.md(
|
| 81 |
-
r"""
|
| 82 |
### Dropping Nulls
|
| 83 |
|
| 84 |
The simplest way of dealing with null values is throwing them away, but that is not always a good idea.
|
| 85 |
-
"""
|
| 86 |
-
)
|
| 87 |
return
|
| 88 |
|
| 89 |
|
|
@@ -101,8 +93,7 @@ def _(df):
|
|
| 101 |
|
| 102 |
@app.cell(hide_code=True)
|
| 103 |
def _(mo):
|
| 104 |
-
mo.md(
|
| 105 |
-
r"""
|
| 106 |
### Filtering null values
|
| 107 |
|
| 108 |
To filter in polars, you'll typically use `df.filter(expression)` or `df.remove(expression)` methods.
|
|
@@ -112,8 +103,7 @@ def _(mo):
|
|
| 112 |
|
| 113 |
Remove will only remove rows in which the expression evaluates to True.
|
| 114 |
It will keep rows in which it evaluates to None.
|
| 115 |
-
"""
|
| 116 |
-
)
|
| 117 |
return
|
| 118 |
|
| 119 |
|
|
@@ -131,13 +121,11 @@ def _(df, pl):
|
|
| 131 |
|
| 132 |
@app.cell(hide_code=True)
|
| 133 |
def _(mo):
|
| 134 |
-
mo.md(
|
| 135 |
-
r"""
|
| 136 |
You may also be tempted to use `== None` or `!= None`, but operators in polars will generally propagate null values.
|
| 137 |
|
| 138 |
You can use `.eq_missing()` or `.ne_missing()` methods if you want to be strict about it, but there are also `.is_null()` and `.is_not_null()` methods you can use.
|
| 139 |
-
"""
|
| 140 |
-
)
|
| 141 |
return
|
| 142 |
|
| 143 |
|
|
@@ -156,8 +144,7 @@ def _(df, pl):
|
|
| 156 |
|
| 157 |
@app.cell(hide_code=True)
|
| 158 |
def _(mo):
|
| 159 |
-
mo.md(
|
| 160 |
-
r"""
|
| 161 |
### Filling Null values
|
| 162 |
|
| 163 |
You can also fill in the values with constants, calculations or by consulting external data sources.
|
|
@@ -165,8 +152,7 @@ def _(mo):
|
|
| 165 |
Be careful not to treat estimated or guessed values as if they a ground truth however, otherwise you may end up making conclusions about a reality that does not exists.
|
| 166 |
|
| 167 |
As an exercise, let's guess some values to fill in nulls, then try giving names to the animals with `null` by editing the cells
|
| 168 |
-
"""
|
| 169 |
-
)
|
| 170 |
return
|
| 171 |
|
| 172 |
|
|
@@ -192,8 +178,7 @@ def _(guesstimates):
|
|
| 192 |
|
| 193 |
@app.cell(hide_code=True)
|
| 194 |
def _(mo):
|
| 195 |
-
mo.md(
|
| 196 |
-
r"""
|
| 197 |
### TL;DR
|
| 198 |
|
| 199 |
Before we head into the mini case study, a brief review of what we have covered:
|
|
@@ -207,24 +192,21 @@ def _(mo):
|
|
| 207 |
You can also refer to the polars [User Guide](https://docs.pola.rs/user-guide/expressions/missing-data/) more more information.
|
| 208 |
|
| 209 |
Whichever approach you take, remember to document how you handled it!
|
| 210 |
-
"""
|
| 211 |
-
)
|
| 212 |
return
|
| 213 |
|
| 214 |
|
| 215 |
@app.cell(hide_code=True)
|
| 216 |
def _(mo):
|
| 217 |
-
mo.md(
|
| 218 |
-
r"""
|
| 219 |
# Mini Case Study
|
| 220 |
|
| 221 |
-
We will be using a dataset from `alertario` about the weather in Rio de Janeiro, originally available in Google Big Query under `datario.clima_pluviometro`. What you need to know about it:
|
| 222 |
|
| 223 |
- Contains multiple stations covering the Municipality of Rio de Janeiro
|
| 224 |
- Measures the precipitation as millimeters, with a granularity of 15 minutes
|
| 225 |
- We filtered to only include data about 2020, 2021 and 2022
|
| 226 |
-
"""
|
| 227 |
-
)
|
| 228 |
return
|
| 229 |
|
| 230 |
|
|
@@ -257,8 +239,7 @@ def _(pl, px, stations):
|
|
| 257 |
|
| 258 |
@app.cell(hide_code=True)
|
| 259 |
def _(mo):
|
| 260 |
-
mo.md(
|
| 261 |
-
r"""
|
| 262 |
# Stations
|
| 263 |
|
| 264 |
First, let's take a look at some of the stations. Notice how
|
|
@@ -267,8 +248,7 @@ def _(mo):
|
|
| 267 |
- There are some columns that do not even contain data at all!
|
| 268 |
|
| 269 |
We will remove the empty columns and remove rows without coordinates
|
| 270 |
-
"""
|
| 271 |
-
)
|
| 272 |
return
|
| 273 |
|
| 274 |
|
|
@@ -295,16 +275,14 @@ def _(dirty_stations, mo, pl):
|
|
| 295 |
|
| 296 |
@app.cell(hide_code=True)
|
| 297 |
def _(mo):
|
| 298 |
-
mo.md(
|
| 299 |
-
r"""
|
| 300 |
# Precipitation
|
| 301 |
Now, let's move on to the Precipitation data.
|
| 302 |
|
| 303 |
## Part 1 - Null Values
|
| 304 |
|
| 305 |
First of all, let's check for null values:
|
| 306 |
-
"""
|
| 307 |
-
)
|
| 308 |
return
|
| 309 |
|
| 310 |
|
|
@@ -328,8 +306,7 @@ def _(dirty_weather, mo, rain):
|
|
| 328 |
|
| 329 |
@app.cell(hide_code=True)
|
| 330 |
def _(mo):
|
| 331 |
-
mo.md(
|
| 332 |
-
r"""
|
| 333 |
### First option to fixing it: Dropping data.
|
| 334 |
|
| 335 |
We could just remove those rows like we did for the stations, which may be a passable solution for some problems, but is not always the best idea.
|
|
@@ -354,8 +331,7 @@ def _(mo):
|
|
| 354 |
|
| 355 |
Let's investigate a bit more before deciding on following with either approach.
|
| 356 |
For example, is our current data even complete, or are we already missing some rows beyond those with null values?
|
| 357 |
-
"""
|
| 358 |
-
)
|
| 359 |
return
|
| 360 |
|
| 361 |
|
|
@@ -387,8 +363,7 @@ def _(pl):
|
|
| 387 |
|
| 388 |
@app.cell(hide_code=True)
|
| 389 |
def _(mo):
|
| 390 |
-
mo.md(
|
| 391 |
-
r"""
|
| 392 |
## Part 2 - Missing Rows
|
| 393 |
|
| 394 |
We can see that we expected there to be 1096 rows for each hour for each station (from the start of 2020 to the end of 2022) , but in reality we see between 1077 and 1096 rows.
|
|
@@ -400,8 +375,7 @@ def _(mo):
|
|
| 400 |
Given that we are working with time series data, we will [upsample](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.upsample.html) the data, but you could also create a DataFrame containing all expected rows then use `join(how="...")`
|
| 401 |
|
| 402 |
However, that will give us _even more_ null values, so we will want to fill them in afterwards. For this case, we will just use a forward fill followed by a backwards fill.
|
| 403 |
-
"""
|
| 404 |
-
)
|
| 405 |
return
|
| 406 |
|
| 407 |
|
|
@@ -435,15 +409,13 @@ def _(dirty_weather, mo, pl, rain):
|
|
| 435 |
|
| 436 |
@app.cell(hide_code=True)
|
| 437 |
def _(mo):
|
| 438 |
-
mo.md(
|
| 439 |
-
r"""
|
| 440 |
Now that we finally have a clean dataset, let's play around with it a little.
|
| 441 |
|
| 442 |
### Example App
|
| 443 |
|
| 444 |
Let's display the amount of precipitation each station measured within a timeframe, aggregated to a lower granularity.
|
| 445 |
-
"""
|
| 446 |
-
)
|
| 447 |
return
|
| 448 |
|
| 449 |
|
|
@@ -534,13 +506,11 @@ def _(animation_data, pl, px):
|
|
| 534 |
|
| 535 |
@app.cell(hide_code=True)
|
| 536 |
def _(mo):
|
| 537 |
-
mo.md(
|
| 538 |
-
r"""
|
| 539 |
If we were missing some rows, we would have circles popping in and out of existence instead of a smooth animation!
|
| 540 |
|
| 541 |
In many scenarios, missing data can also lead to wrong results overall, for example if we were to estimate the total amount of rainfall during the observed period:
|
| 542 |
-
"""
|
| 543 |
-
)
|
| 544 |
return
|
| 545 |
|
| 546 |
|
|
@@ -556,20 +526,17 @@ def _(dirty_weather, mo, rain, weather):
|
|
| 556 |
|
| 557 |
@app.cell(hide_code=True)
|
| 558 |
def _(mo):
|
| 559 |
-
mo.md(
|
| 560 |
-
r"""
|
| 561 |
Which is still a relatively small difference, but every drop counts when you are dealing with the weather.
|
| 562 |
|
| 563 |
For datasets with a higher share of missing values, that difference can get much higher.
|
| 564 |
-
"""
|
| 565 |
-
)
|
| 566 |
return
|
| 567 |
|
| 568 |
|
| 569 |
@app.cell(hide_code=True)
|
| 570 |
def _(mo):
|
| 571 |
-
mo.md(
|
| 572 |
-
r"""
|
| 573 |
# Bonus Content
|
| 574 |
|
| 575 |
## Appendix A: Missing Time Zones
|
|
@@ -577,8 +544,7 @@ def _(mo):
|
|
| 577 |
The original dataset contained naive datetimes instead of timezone-aware, but we can infer whenever it refers to UTC time or local time (for this case, -03:00 UTC) based on the measurements.
|
| 578 |
|
| 579 |
For example, we can select one specific interval during which we know that rained a lot, or graph the average amount of precipitation for each hour of the day, then compare the data timestamps with a ground truth.
|
| 580 |
-
"""
|
| 581 |
-
)
|
| 582 |
return
|
| 583 |
|
| 584 |
|
|
@@ -635,13 +601,11 @@ def _(dirty_weather_naive, pl, rain, stations):
|
|
| 635 |
|
| 636 |
@app.cell(hide_code=True)
|
| 637 |
def _(mo):
|
| 638 |
-
mo.md(
|
| 639 |
-
r"""
|
| 640 |
By externally researching the expected distribution and looking up some of the extreme weather events, we can come to a conclusion about whenever it is aligned with the local time or with UTC.
|
| 641 |
|
| 642 |
In this case, the distribution matches the normal weather for this region and we can see that the hours with the most precipitation match those of historical events, so it is safe to say it is using local time (equivalent to the Americas/São Paulo time zone).
|
| 643 |
-
"""
|
| 644 |
-
)
|
| 645 |
return
|
| 646 |
|
| 647 |
|
|
@@ -655,8 +619,7 @@ def _(dirty_weather_naive, pl):
|
|
| 655 |
|
| 656 |
@app.cell(hide_code=True)
|
| 657 |
def _(mo):
|
| 658 |
-
mo.md(
|
| 659 |
-
r"""
|
| 660 |
## Appendix B: Not a Number
|
| 661 |
|
| 662 |
While some other tools without proper support for missing values may use `NaN` as a way to indicate a value is missing, in polars it is treated exclusively as a float value, much like `0.0`, `1.0` or `infinity`.
|
|
@@ -664,8 +627,7 @@ def _(mo):
|
|
| 664 |
You can use `.fill_null(float('nan'))` if you need to convert floats to a format such tools accept, or use `.fill_nan(None)` if you are importing data from them, assuming that there are no values which really are supposed to be the float NaN.
|
| 665 |
|
| 666 |
Remember that many calculations can result in NaN, for example dividing by zero:
|
| 667 |
-
"""
|
| 668 |
-
)
|
| 669 |
return
|
| 670 |
|
| 671 |
|
|
@@ -696,29 +658,25 @@ def _(day_perc, mo, perc_col):
|
|
| 696 |
|
| 697 |
@app.cell(hide_code=True)
|
| 698 |
def _(mo):
|
| 699 |
-
mo.md(
|
| 700 |
-
r"""
|
| 701 |
## Appendix C: Everything else
|
| 702 |
|
| 703 |
As long as this Notebook is, it cannot reasonably cover ***everything*** that may have to deal with missing values, as that is literally everything that may have to deal with data.
|
| 704 |
|
| 705 |
This section very briefly covers some other features not mentioned above
|
| 706 |
-
"""
|
| 707 |
-
)
|
| 708 |
return
|
| 709 |
|
| 710 |
|
| 711 |
@app.cell(hide_code=True)
|
| 712 |
def _(mo):
|
| 713 |
-
mo.md(
|
| 714 |
-
r"""
|
| 715 |
### Missing values in Aggregations
|
| 716 |
|
| 717 |
Many aggregations methods will ignore/skip missing values, while others take them into consideration.
|
| 718 |
|
| 719 |
Always check the documentation of the method you're using, much of the time docstrings will explain their behaviour.
|
| 720 |
-
"""
|
| 721 |
-
)
|
| 722 |
return
|
| 723 |
|
| 724 |
|
|
@@ -733,13 +691,11 @@ def _(df, pl):
|
|
| 733 |
|
| 734 |
@app.cell(hide_code=True)
|
| 735 |
def _(mo):
|
| 736 |
-
mo.md(
|
| 737 |
-
r"""
|
| 738 |
### Missing values in Joins
|
| 739 |
|
| 740 |
By default null values will never produce matches using [join](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join.html), but you can specify `nulls_equal=True` to join Null values with each other.
|
| 741 |
-
"""
|
| 742 |
-
)
|
| 743 |
return
|
| 744 |
|
| 745 |
|
|
@@ -772,13 +728,11 @@ def _(age_groups, df):
|
|
| 772 |
|
| 773 |
@app.cell(hide_code=True)
|
| 774 |
def _(mo):
|
| 775 |
-
mo.md(
|
| 776 |
-
r"""
|
| 777 |
## Utilities
|
| 778 |
|
| 779 |
Loading data and imports
|
| 780 |
-
"""
|
| 781 |
-
)
|
| 782 |
return
|
| 783 |
|
| 784 |
|
|
|
|
| 8 |
|
| 9 |
import marimo
|
| 10 |
|
| 11 |
+
__generated_with = "0.18.4"
|
| 12 |
app = marimo.App(width="medium")
|
| 13 |
|
| 14 |
|
| 15 |
@app.cell(hide_code=True)
|
| 16 |
def _(mo):
|
| 17 |
+
mo.md(r"""
|
|
|
|
| 18 |
# Dealing with Missing Data
|
| 19 |
|
| 20 |
_by [etrotta](https://github.com/etrotta) and [Felix Najera](https://github.com/folicks)_
|
|
|
|
| 23 |
|
| 24 |
First we provide an overview of the methods available in polars, then we walk through a mini case study with real world data showing how to use it, and at last we provide some additional information in the 'Bonus Content' section.
|
| 25 |
You can navigate to skip around to each header using the menu on the right side
|
| 26 |
+
""")
|
|
|
|
| 27 |
return
|
| 28 |
|
| 29 |
|
| 30 |
@app.cell(hide_code=True)
|
| 31 |
def _(mo):
|
| 32 |
+
mo.md(r"""
|
|
|
|
| 33 |
## Methods for working with Nulls
|
| 34 |
|
| 35 |
We'll be using the following DataFrame to show the most important methods:
|
| 36 |
+
""")
|
|
|
|
| 37 |
return
|
| 38 |
|
| 39 |
|
|
|
|
| 55 |
|
| 56 |
@app.cell(hide_code=True)
|
| 57 |
def _(mo):
|
| 58 |
+
mo.md(r"""
|
|
|
|
| 59 |
### Counting nulls
|
| 60 |
|
| 61 |
A simple yet convenient aggregation
|
| 62 |
+
""")
|
|
|
|
| 63 |
return
|
| 64 |
|
| 65 |
|
|
|
|
| 71 |
|
| 72 |
@app.cell(hide_code=True)
|
| 73 |
def _(mo):
|
| 74 |
+
mo.md(r"""
|
|
|
|
| 75 |
### Dropping Nulls
|
| 76 |
|
| 77 |
The simplest way of dealing with null values is throwing them away, but that is not always a good idea.
|
| 78 |
+
""")
|
|
|
|
| 79 |
return
|
| 80 |
|
| 81 |
|
|
|
|
| 93 |
|
| 94 |
@app.cell(hide_code=True)
|
| 95 |
def _(mo):
|
| 96 |
+
mo.md(r"""
|
|
|
|
| 97 |
### Filtering null values
|
| 98 |
|
| 99 |
To filter in polars, you'll typically use `df.filter(expression)` or `df.remove(expression)` methods.
|
|
|
|
| 103 |
|
| 104 |
Remove will only remove rows in which the expression evaluates to True.
|
| 105 |
It will keep rows in which it evaluates to None.
|
| 106 |
+
""")
|
|
|
|
| 107 |
return
|
| 108 |
|
| 109 |
|
|
|
|
| 121 |
|
| 122 |
@app.cell(hide_code=True)
|
| 123 |
def _(mo):
|
| 124 |
+
mo.md(r"""
|
|
|
|
| 125 |
You may also be tempted to use `== None` or `!= None`, but operators in polars will generally propagate null values.
|
| 126 |
|
| 127 |
You can use `.eq_missing()` or `.ne_missing()` methods if you want to be strict about it, but there are also `.is_null()` and `.is_not_null()` methods you can use.
|
| 128 |
+
""")
|
|
|
|
| 129 |
return
|
| 130 |
|
| 131 |
|
|
|
|
| 144 |
|
| 145 |
@app.cell(hide_code=True)
|
| 146 |
def _(mo):
|
| 147 |
+
mo.md(r"""
|
|
|
|
| 148 |
### Filling Null values
|
| 149 |
|
| 150 |
You can also fill in the values with constants, calculations or by consulting external data sources.
|
|
|
|
| 152 |
Be careful not to treat estimated or guessed values as if they a ground truth however, otherwise you may end up making conclusions about a reality that does not exists.
|
| 153 |
|
| 154 |
As an exercise, let's guess some values to fill in nulls, then try giving names to the animals with `null` by editing the cells
|
| 155 |
+
""")
|
|
|
|
| 156 |
return
|
| 157 |
|
| 158 |
|
|
|
|
| 178 |
|
| 179 |
@app.cell(hide_code=True)
|
| 180 |
def _(mo):
|
| 181 |
+
mo.md(r"""
|
|
|
|
| 182 |
### TL;DR
|
| 183 |
|
| 184 |
Before we head into the mini case study, a brief review of what we have covered:
|
|
|
|
| 192 |
You can also refer to the polars [User Guide](https://docs.pola.rs/user-guide/expressions/missing-data/) more more information.
|
| 193 |
|
| 194 |
Whichever approach you take, remember to document how you handled it!
|
| 195 |
+
""")
|
|
|
|
| 196 |
return
|
| 197 |
|
| 198 |
|
| 199 |
@app.cell(hide_code=True)
|
| 200 |
def _(mo):
|
| 201 |
+
mo.md(r"""
|
|
|
|
| 202 |
# Mini Case Study
|
| 203 |
|
| 204 |
+
We will be using a dataset from `alertario` about the weather in Rio de Janeiro, originally available in Google Big Query under `datario.clima_pluviometro`. What you need to know about it:
|
| 205 |
|
| 206 |
- Contains multiple stations covering the Municipality of Rio de Janeiro
|
| 207 |
- Measures the precipitation as millimeters, with a granularity of 15 minutes
|
| 208 |
- We filtered to only include data about 2020, 2021 and 2022
|
| 209 |
+
""")
|
|
|
|
| 210 |
return
|
| 211 |
|
| 212 |
|
|
|
|
| 239 |
|
| 240 |
@app.cell(hide_code=True)
|
| 241 |
def _(mo):
|
| 242 |
+
mo.md(r"""
|
|
|
|
| 243 |
# Stations
|
| 244 |
|
| 245 |
First, let's take a look at some of the stations. Notice how
|
|
|
|
| 248 |
- There are some columns that do not even contain data at all!
|
| 249 |
|
| 250 |
We will remove the empty columns and remove rows without coordinates
|
| 251 |
+
""")
|
|
|
|
| 252 |
return
|
| 253 |
|
| 254 |
|
|
|
|
| 275 |
|
| 276 |
@app.cell(hide_code=True)
|
| 277 |
def _(mo):
|
| 278 |
+
mo.md(r"""
|
|
|
|
| 279 |
# Precipitation
|
| 280 |
Now, let's move on to the Precipitation data.
|
| 281 |
|
| 282 |
## Part 1 - Null Values
|
| 283 |
|
| 284 |
First of all, let's check for null values:
|
| 285 |
+
""")
|
|
|
|
| 286 |
return
|
| 287 |
|
| 288 |
|
|
|
|
| 306 |
|
| 307 |
@app.cell(hide_code=True)
|
| 308 |
def _(mo):
|
| 309 |
+
mo.md(r"""
|
|
|
|
| 310 |
### First option to fixing it: Dropping data.
|
| 311 |
|
| 312 |
We could just remove those rows like we did for the stations, which may be a passable solution for some problems, but is not always the best idea.
|
|
|
|
| 331 |
|
| 332 |
Let's investigate a bit more before deciding on following with either approach.
|
| 333 |
For example, is our current data even complete, or are we already missing some rows beyond those with null values?
|
| 334 |
+
""")
|
|
|
|
| 335 |
return
|
| 336 |
|
| 337 |
|
|
|
|
| 363 |
|
| 364 |
@app.cell(hide_code=True)
|
| 365 |
def _(mo):
|
| 366 |
+
mo.md(r"""
|
|
|
|
| 367 |
## Part 2 - Missing Rows
|
| 368 |
|
| 369 |
We can see that we expected there to be 1096 rows for each hour for each station (from the start of 2020 to the end of 2022) , but in reality we see between 1077 and 1096 rows.
|
|
|
|
| 375 |
Given that we are working with time series data, we will [upsample](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.upsample.html) the data, but you could also create a DataFrame containing all expected rows then use `join(how="...")`
|
| 376 |
|
| 377 |
However, that will give us _even more_ null values, so we will want to fill them in afterwards. For this case, we will just use a forward fill followed by a backwards fill.
|
| 378 |
+
""")
|
|
|
|
| 379 |
return
|
| 380 |
|
| 381 |
|
|
|
|
| 409 |
|
| 410 |
@app.cell(hide_code=True)
|
| 411 |
def _(mo):
|
| 412 |
+
mo.md(r"""
|
|
|
|
| 413 |
Now that we finally have a clean dataset, let's play around with it a little.
|
| 414 |
|
| 415 |
### Example App
|
| 416 |
|
| 417 |
Let's display the amount of precipitation each station measured within a timeframe, aggregated to a lower granularity.
|
| 418 |
+
""")
|
|
|
|
| 419 |
return
|
| 420 |
|
| 421 |
|
|
|
|
| 506 |
|
| 507 |
@app.cell(hide_code=True)
|
| 508 |
def _(mo):
|
| 509 |
+
mo.md(r"""
|
|
|
|
| 510 |
If we were missing some rows, we would have circles popping in and out of existence instead of a smooth animation!
|
| 511 |
|
| 512 |
In many scenarios, missing data can also lead to wrong results overall, for example if we were to estimate the total amount of rainfall during the observed period:
|
| 513 |
+
""")
|
|
|
|
| 514 |
return
|
| 515 |
|
| 516 |
|
|
|
|
| 526 |
|
| 527 |
@app.cell(hide_code=True)
|
| 528 |
def _(mo):
|
| 529 |
+
mo.md(r"""
|
|
|
|
| 530 |
Which is still a relatively small difference, but every drop counts when you are dealing with the weather.
|
| 531 |
|
| 532 |
For datasets with a higher share of missing values, that difference can get much higher.
|
| 533 |
+
""")
|
|
|
|
| 534 |
return
|
| 535 |
|
| 536 |
|
| 537 |
@app.cell(hide_code=True)
|
| 538 |
def _(mo):
|
| 539 |
+
mo.md(r"""
|
|
|
|
| 540 |
# Bonus Content
|
| 541 |
|
| 542 |
## Appendix A: Missing Time Zones
|
|
|
|
| 544 |
The original dataset contained naive datetimes instead of timezone-aware, but we can infer whenever it refers to UTC time or local time (for this case, -03:00 UTC) based on the measurements.
|
| 545 |
|
| 546 |
For example, we can select one specific interval during which we know that rained a lot, or graph the average amount of precipitation for each hour of the day, then compare the data timestamps with a ground truth.
|
| 547 |
+
""")
|
|
|
|
| 548 |
return
|
| 549 |
|
| 550 |
|
|
|
|
| 601 |
|
| 602 |
@app.cell(hide_code=True)
|
| 603 |
def _(mo):
|
| 604 |
+
mo.md(r"""
|
|
|
|
| 605 |
By externally researching the expected distribution and looking up some of the extreme weather events, we can come to a conclusion about whenever it is aligned with the local time or with UTC.
|
| 606 |
|
| 607 |
In this case, the distribution matches the normal weather for this region and we can see that the hours with the most precipitation match those of historical events, so it is safe to say it is using local time (equivalent to the Americas/São Paulo time zone).
|
| 608 |
+
""")
|
|
|
|
| 609 |
return
|
| 610 |
|
| 611 |
|
|
|
|
| 619 |
|
| 620 |
@app.cell(hide_code=True)
|
| 621 |
def _(mo):
|
| 622 |
+
mo.md(r"""
|
|
|
|
| 623 |
## Appendix B: Not a Number
|
| 624 |
|
| 625 |
While some other tools without proper support for missing values may use `NaN` as a way to indicate a value is missing, in polars it is treated exclusively as a float value, much like `0.0`, `1.0` or `infinity`.
|
|
|
|
| 627 |
You can use `.fill_null(float('nan'))` if you need to convert floats to a format such tools accept, or use `.fill_nan(None)` if you are importing data from them, assuming that there are no values which really are supposed to be the float NaN.
|
| 628 |
|
| 629 |
Remember that many calculations can result in NaN, for example dividing by zero:
|
| 630 |
+
""")
|
|
|
|
| 631 |
return
|
| 632 |
|
| 633 |
|
|
|
|
| 658 |
|
| 659 |
@app.cell(hide_code=True)
|
| 660 |
def _(mo):
|
| 661 |
+
mo.md(r"""
|
|
|
|
| 662 |
## Appendix C: Everything else
|
| 663 |
|
| 664 |
As long as this Notebook is, it cannot reasonably cover ***everything*** that may have to deal with missing values, as that is literally everything that may have to deal with data.
|
| 665 |
|
| 666 |
This section very briefly covers some other features not mentioned above
|
| 667 |
+
""")
|
|
|
|
| 668 |
return
|
| 669 |
|
| 670 |
|
| 671 |
@app.cell(hide_code=True)
|
| 672 |
def _(mo):
|
| 673 |
+
mo.md(r"""
|
|
|
|
| 674 |
### Missing values in Aggregations
|
| 675 |
|
| 676 |
Many aggregations methods will ignore/skip missing values, while others take them into consideration.
|
| 677 |
|
| 678 |
Always check the documentation of the method you're using, much of the time docstrings will explain their behaviour.
|
| 679 |
+
""")
|
|
|
|
| 680 |
return
|
| 681 |
|
| 682 |
|
|
|
|
| 691 |
|
| 692 |
@app.cell(hide_code=True)
|
| 693 |
def _(mo):
|
| 694 |
+
mo.md(r"""
|
|
|
|
| 695 |
### Missing values in Joins
|
| 696 |
|
| 697 |
By default null values will never produce matches using [join](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join.html), but you can specify `nulls_equal=True` to join Null values with each other.
|
| 698 |
+
""")
|
|
|
|
| 699 |
return
|
| 700 |
|
| 701 |
|
|
|
|
| 728 |
|
| 729 |
@app.cell(hide_code=True)
|
| 730 |
def _(mo):
|
| 731 |
+
mo.md(r"""
|
|
|
|
| 732 |
## Utilities
|
| 733 |
|
| 734 |
Loading data and imports
|
| 735 |
+
""")
|
|
|
|
| 736 |
return
|
| 737 |
|
| 738 |
|
polars/12_aggregations.py
CHANGED
|
@@ -8,7 +8,7 @@
|
|
| 8 |
|
| 9 |
import marimo
|
| 10 |
|
| 11 |
-
__generated_with = "0.
|
| 12 |
app = marimo.App(width="medium")
|
| 13 |
|
| 14 |
|
|
@@ -20,14 +20,12 @@ def _():
|
|
| 20 |
|
| 21 |
@app.cell(hide_code=True)
|
| 22 |
def _(mo):
|
| 23 |
-
mo.md(
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
_By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
)
|
| 31 |
return
|
| 32 |
|
| 33 |
|
|
@@ -44,13 +42,11 @@ def _():
|
|
| 44 |
|
| 45 |
@app.cell(hide_code=True)
|
| 46 |
def _(mo):
|
| 47 |
-
mo.md(
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
"""
|
| 53 |
-
)
|
| 54 |
return
|
| 55 |
|
| 56 |
|
|
@@ -65,13 +61,11 @@ def _(df, pl):
|
|
| 65 |
|
| 66 |
@app.cell(hide_code=True)
|
| 67 |
def _(mo):
|
| 68 |
-
mo.md(
|
| 69 |
-
|
| 70 |
-
It looks like we sold more sweaters. Maybe this was a winter season.
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
)
|
| 75 |
return
|
| 76 |
|
| 77 |
|
|
@@ -87,7 +81,9 @@ def _(df, pl):
|
|
| 87 |
|
| 88 |
@app.cell(hide_code=True)
|
| 89 |
def _(mo):
|
| 90 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 91 |
return
|
| 92 |
|
| 93 |
|
|
@@ -102,7 +98,9 @@ def _(df, pl):
|
|
| 102 |
|
| 103 |
@app.cell(hide_code=True)
|
| 104 |
def _(mo):
|
| 105 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 106 |
return
|
| 107 |
|
| 108 |
|
|
@@ -118,12 +116,10 @@ def _(df, pl):
|
|
| 118 |
|
| 119 |
@app.cell(hide_code=True)
|
| 120 |
def _(mo):
|
| 121 |
-
mo.md(
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
"""
|
| 126 |
-
)
|
| 127 |
return
|
| 128 |
|
| 129 |
|
|
@@ -138,13 +134,11 @@ def _(df, pl):
|
|
| 138 |
|
| 139 |
@app.cell(hide_code=True)
|
| 140 |
def _(mo):
|
| 141 |
-
mo.md(
|
| 142 |
-
|
| 143 |
-
Aggregations when grouping data are not limited to sums. You can also use functions like [`max`, `min`, `median`, `first`, and `last`](https://docs.pola.rs/user-guide/expressions/aggregation/#basic-aggregations).
|
| 144 |
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
)
|
| 148 |
return
|
| 149 |
|
| 150 |
|
|
@@ -159,13 +153,11 @@ def _(df, pl):
|
|
| 159 |
|
| 160 |
@app.cell(hide_code=True)
|
| 161 |
def _(mo):
|
| 162 |
-
mo.md(
|
| 163 |
-
|
| 164 |
-
Let's make the aggregation more interesting. We'll identify the first customer to purchase each item, along with the quantity they bought and the amount they spent.
|
| 165 |
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
)
|
| 169 |
return
|
| 170 |
|
| 171 |
|
|
@@ -181,14 +173,12 @@ def _(df, pl):
|
|
| 181 |
|
| 182 |
@app.cell(hide_code=True)
|
| 183 |
def _(mo):
|
| 184 |
-
mo.md(
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
Since `datetime` is a special data type in Polars, we can perform various group-by aggregations on it.
|
| 188 |
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
)
|
| 192 |
return
|
| 193 |
|
| 194 |
|
|
@@ -204,13 +194,11 @@ def _(df, pl):
|
|
| 204 |
|
| 205 |
@app.cell(hide_code=True)
|
| 206 |
def _(mo):
|
| 207 |
-
mo.md(
|
| 208 |
-
|
| 209 |
-
We had more sales in 2014.
|
| 210 |
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
)
|
| 214 |
return
|
| 215 |
|
| 216 |
|
|
@@ -226,13 +214,11 @@ def _(df, pl):
|
|
| 226 |
|
| 227 |
@app.cell(hide_code=True)
|
| 228 |
def _(mo):
|
| 229 |
-
mo.md(
|
| 230 |
-
|
| 231 |
-
The beauty of grouping with time is that it allows us to resample the data by selecting whatever time interval we want.
|
| 232 |
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
)
|
| 236 |
return
|
| 237 |
|
| 238 |
|
|
@@ -249,13 +235,11 @@ def _(df, pl):
|
|
| 249 |
|
| 250 |
@app.cell(hide_code=True)
|
| 251 |
def _(mo):
|
| 252 |
-
mo.md(
|
| 253 |
-
|
| 254 |
-
Here's an interesting question we can answer that takes advantage of grouping by time.
|
| 255 |
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
)
|
| 259 |
return
|
| 260 |
|
| 261 |
|
|
@@ -272,7 +256,9 @@ def _(df, pl):
|
|
| 272 |
|
| 273 |
@app.cell(hide_code=True)
|
| 274 |
def _(mo):
|
| 275 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 276 |
return
|
| 277 |
|
| 278 |
|
|
@@ -290,7 +276,9 @@ def _(df, pl):
|
|
| 290 |
|
| 291 |
@app.cell(hide_code=True)
|
| 292 |
def _(mo):
|
| 293 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 294 |
return
|
| 295 |
|
| 296 |
|
|
@@ -308,15 +296,13 @@ def _(df, pl):
|
|
| 308 |
|
| 309 |
@app.cell(hide_code=True)
|
| 310 |
def _(mo):
|
| 311 |
-
mo.md(
|
| 312 |
-
|
| 313 |
-
## Grouping with over
|
| 314 |
|
| 315 |
-
|
| 316 |
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
)
|
| 320 |
return
|
| 321 |
|
| 322 |
|
|
@@ -330,7 +316,9 @@ def _(df, pl):
|
|
| 330 |
|
| 331 |
@app.cell(hide_code=True)
|
| 332 |
def _(mo):
|
| 333 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 334 |
return
|
| 335 |
|
| 336 |
|
|
@@ -347,7 +335,9 @@ def _(df, pl):
|
|
| 347 |
|
| 348 |
@app.cell(hide_code=True)
|
| 349 |
def _(mo):
|
| 350 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 351 |
return
|
| 352 |
|
| 353 |
|
|
|
|
| 8 |
|
| 9 |
import marimo
|
| 10 |
|
| 11 |
+
__generated_with = "0.18.4"
|
| 12 |
app = marimo.App(width="medium")
|
| 13 |
|
| 14 |
|
|
|
|
| 20 |
|
| 21 |
@app.cell(hide_code=True)
|
| 22 |
def _(mo):
|
| 23 |
+
mo.md(r"""
|
| 24 |
+
# Aggregations
|
| 25 |
+
_By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._
|
|
|
|
| 26 |
|
| 27 |
+
In this notebook, you'll learn how to perform different types of aggregations in Polars, including grouping by categories and time. We'll analyze sales data from a clothing store, focusing on three product categories: hats, socks, and sweaters.
|
| 28 |
+
""")
|
|
|
|
| 29 |
return
|
| 30 |
|
| 31 |
|
|
|
|
| 42 |
|
| 43 |
@app.cell(hide_code=True)
|
| 44 |
def _(mo):
|
| 45 |
+
mo.md(r"""
|
| 46 |
+
## Grouping by category
|
| 47 |
+
### With single category
|
| 48 |
+
Let's find out how many of each product category we sold.
|
| 49 |
+
""")
|
|
|
|
|
|
|
| 50 |
return
|
| 51 |
|
| 52 |
|
|
|
|
| 61 |
|
| 62 |
@app.cell(hide_code=True)
|
| 63 |
def _(mo):
|
| 64 |
+
mo.md(r"""
|
| 65 |
+
It looks like we sold more sweaters. Maybe this was a winter season.
|
|
|
|
| 66 |
|
| 67 |
+
Let's add another aggregate to see how much was spent on the total units for each product.
|
| 68 |
+
""")
|
|
|
|
| 69 |
return
|
| 70 |
|
| 71 |
|
|
|
|
| 81 |
|
| 82 |
@app.cell(hide_code=True)
|
| 83 |
def _(mo):
|
| 84 |
+
mo.md(r"""
|
| 85 |
+
We could also write aggregate code for the two columns as a single line.
|
| 86 |
+
""")
|
| 87 |
return
|
| 88 |
|
| 89 |
|
|
|
|
| 98 |
|
| 99 |
@app.cell(hide_code=True)
|
| 100 |
def _(mo):
|
| 101 |
+
mo.md(r"""
|
| 102 |
+
Actually, the way we've been writing the aggregate lines is syntactic sugar. Here's a longer way of doing it as shown in the [Polars documentation](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.agg.html).
|
| 103 |
+
""")
|
| 104 |
return
|
| 105 |
|
| 106 |
|
|
|
|
| 116 |
|
| 117 |
@app.cell(hide_code=True)
|
| 118 |
def _(mo):
|
| 119 |
+
mo.md(r"""
|
| 120 |
+
### With multiple categories
|
| 121 |
+
We can also group by multiple categories. Let's find out how many items we sold in each product category for each SKU. This more detailed aggregation will produce more rows than the previous DataFrame.
|
| 122 |
+
""")
|
|
|
|
|
|
|
| 123 |
return
|
| 124 |
|
| 125 |
|
|
|
|
| 134 |
|
| 135 |
@app.cell(hide_code=True)
|
| 136 |
def _(mo):
|
| 137 |
+
mo.md(r"""
|
| 138 |
+
Aggregations when grouping data are not limited to sums. You can also use functions like [`max`, `min`, `median`, `first`, and `last`](https://docs.pola.rs/user-guide/expressions/aggregation/#basic-aggregations).
|
|
|
|
| 139 |
|
| 140 |
+
Let's find the largest sale quantity for each product category.
|
| 141 |
+
""")
|
|
|
|
| 142 |
return
|
| 143 |
|
| 144 |
|
|
|
|
| 153 |
|
| 154 |
@app.cell(hide_code=True)
|
| 155 |
def _(mo):
|
| 156 |
+
mo.md(r"""
|
| 157 |
+
Let's make the aggregation more interesting. We'll identify the first customer to purchase each item, along with the quantity they bought and the amount they spent.
|
|
|
|
| 158 |
|
| 159 |
+
**Note:** To make this work, we'll have to sort the date from earliest to latest.
|
| 160 |
+
""")
|
|
|
|
| 161 |
return
|
| 162 |
|
| 163 |
|
|
|
|
| 173 |
|
| 174 |
@app.cell(hide_code=True)
|
| 175 |
def _(mo):
|
| 176 |
+
mo.md(r"""
|
| 177 |
+
## Grouping by time
|
| 178 |
+
Since `datetime` is a special data type in Polars, we can perform various group-by aggregations on it.
|
|
|
|
| 179 |
|
| 180 |
+
Our dataset spans a two-year period. Let's calculate the total dollar sales for each year. We'll do it the naive way first so you can appreciate grouping with time.
|
| 181 |
+
""")
|
|
|
|
| 182 |
return
|
| 183 |
|
| 184 |
|
|
|
|
| 194 |
|
| 195 |
@app.cell(hide_code=True)
|
| 196 |
def _(mo):
|
| 197 |
+
mo.md(r"""
|
| 198 |
+
We had more sales in 2014.
|
|
|
|
| 199 |
|
| 200 |
+
Now let's perform the above operation by grouping with time. This requires sorting the dataframe first.
|
| 201 |
+
""")
|
|
|
|
| 202 |
return
|
| 203 |
|
| 204 |
|
|
|
|
| 214 |
|
| 215 |
@app.cell(hide_code=True)
|
| 216 |
def _(mo):
|
| 217 |
+
mo.md(r"""
|
| 218 |
+
The beauty of grouping with time is that it allows us to resample the data by selecting whatever time interval we want.
|
|
|
|
| 219 |
|
| 220 |
+
Let's find out what the quarterly sales were for 2014
|
| 221 |
+
""")
|
|
|
|
| 222 |
return
|
| 223 |
|
| 224 |
|
|
|
|
| 235 |
|
| 236 |
@app.cell(hide_code=True)
|
| 237 |
def _(mo):
|
| 238 |
+
mo.md(r"""
|
| 239 |
+
Here's an interesting question we can answer that takes advantage of grouping by time.
|
|
|
|
| 240 |
|
| 241 |
+
Let's find the hour of the day where we had the most sales in dollars.
|
| 242 |
+
""")
|
|
|
|
| 243 |
return
|
| 244 |
|
| 245 |
|
|
|
|
| 256 |
|
| 257 |
@app.cell(hide_code=True)
|
| 258 |
def _(mo):
|
| 259 |
+
mo.md(r"""
|
| 260 |
+
Just for fun, let's find the median number of items sold in each SKU and the total dollar amount in each SKU every six days.
|
| 261 |
+
""")
|
| 262 |
return
|
| 263 |
|
| 264 |
|
|
|
|
| 276 |
|
| 277 |
@app.cell(hide_code=True)
|
| 278 |
def _(mo):
|
| 279 |
+
mo.md(r"""
|
| 280 |
+
Let's rename the columns to clearly indicate the type of aggregation performed. This will help us identify the aggregation method used on a column without needing to check the code.
|
| 281 |
+
""")
|
| 282 |
return
|
| 283 |
|
| 284 |
|
|
|
|
| 296 |
|
| 297 |
@app.cell(hide_code=True)
|
| 298 |
def _(mo):
|
| 299 |
+
mo.md(r"""
|
| 300 |
+
## Grouping with over
|
|
|
|
| 301 |
|
| 302 |
+
Sometimes, we may want to perform an aggregation but also keep all the columns and rows of the dataframe.
|
| 303 |
|
| 304 |
+
Let's assign a value to indicate the number of times each customer visited and bought something.
|
| 305 |
+
""")
|
|
|
|
| 306 |
return
|
| 307 |
|
| 308 |
|
|
|
|
| 316 |
|
| 317 |
@app.cell(hide_code=True)
|
| 318 |
def _(mo):
|
| 319 |
+
mo.md(r"""
|
| 320 |
+
Finally, let's determine which customers visited the store the most and bought something.
|
| 321 |
+
""")
|
| 322 |
return
|
| 323 |
|
| 324 |
|
|
|
|
| 335 |
|
| 336 |
@app.cell(hide_code=True)
|
| 337 |
def _(mo):
|
| 338 |
+
mo.md(r"""
|
| 339 |
+
There's more you can do with aggregations in Polars such as [sorting with aggregations](https://docs.pola.rs/user-guide/expressions/aggregation/#sorting). We hope that in this notebook, we've armed you with the tools to get started.
|
| 340 |
+
""")
|
| 341 |
return
|
| 342 |
|
| 343 |
|
polars/13_window_functions.py
CHANGED
|
@@ -11,14 +11,13 @@
|
|
| 11 |
|
| 12 |
import marimo
|
| 13 |
|
| 14 |
-
__generated_with = "0.
|
| 15 |
app = marimo.App(width="medium", app_title="Window Functions")
|
| 16 |
|
| 17 |
|
| 18 |
@app.cell(hide_code=True)
|
| 19 |
def _(mo):
|
| 20 |
-
mo.md(
|
| 21 |
-
r"""
|
| 22 |
# Window Functions
|
| 23 |
_By [Henry Harbeck](https://github.com/henryharbeck)._
|
| 24 |
|
|
@@ -26,8 +25,7 @@ def _(mo):
|
|
| 26 |
You'll work with partitions, ordering and Polars' available "mapping strategies".
|
| 27 |
|
| 28 |
We'll use a dataset with a few days of paid and organic digital revenue data.
|
| 29 |
-
"""
|
| 30 |
-
)
|
| 31 |
return
|
| 32 |
|
| 33 |
|
|
@@ -53,8 +51,7 @@ def _():
|
|
| 53 |
|
| 54 |
@app.cell(hide_code=True)
|
| 55 |
def _(mo):
|
| 56 |
-
mo.md(
|
| 57 |
-
r"""
|
| 58 |
## What is a window function?
|
| 59 |
|
| 60 |
A window function performs a calculation across a set of rows that are related to the current row.
|
|
@@ -64,32 +61,27 @@ def _(mo):
|
|
| 64 |
|
| 65 |
Window functions can be used by specifying the [`over`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
|
| 66 |
method on an expression.
|
| 67 |
-
"""
|
| 68 |
-
)
|
| 69 |
return
|
| 70 |
|
| 71 |
|
| 72 |
@app.cell(hide_code=True)
|
| 73 |
def _(mo):
|
| 74 |
-
mo.md(
|
| 75 |
-
r"""
|
| 76 |
## Partitions
|
| 77 |
Partitions are the "group by" columns. We will have one "window" of data per unique value in the partition column(s), to
|
| 78 |
which the function will be applied.
|
| 79 |
-
"""
|
| 80 |
-
)
|
| 81 |
return
|
| 82 |
|
| 83 |
|
| 84 |
@app.cell(hide_code=True)
|
| 85 |
def _(mo):
|
| 86 |
-
mo.md(
|
| 87 |
-
r"""
|
| 88 |
### Partitioning by a single column
|
| 89 |
|
| 90 |
Let's get the total revenue per date...
|
| 91 |
-
"""
|
| 92 |
-
)
|
| 93 |
return
|
| 94 |
|
| 95 |
|
|
@@ -103,7 +95,9 @@ def _(df, pl):
|
|
| 103 |
|
| 104 |
@app.cell(hide_code=True)
|
| 105 |
def _(mo):
|
| 106 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 107 |
return
|
| 108 |
|
| 109 |
|
|
@@ -115,12 +109,10 @@ def _(daily_revenue, df, pl):
|
|
| 115 |
|
| 116 |
@app.cell(hide_code=True)
|
| 117 |
def _(mo):
|
| 118 |
-
mo.md(
|
| 119 |
-
r"""
|
| 120 |
Let's now calculate the maximum revenue, cumulative revenue, rank the revenue and calculate the day-on-day change,
|
| 121 |
all partitioned (split) by channel.
|
| 122 |
-
"""
|
| 123 |
-
)
|
| 124 |
return
|
| 125 |
|
| 126 |
|
|
@@ -137,28 +129,24 @@ def _(df, pl):
|
|
| 137 |
|
| 138 |
@app.cell(hide_code=True)
|
| 139 |
def _(mo):
|
| 140 |
-
mo.md(
|
| 141 |
-
r"""
|
| 142 |
Note that aggregation functions such as `sum` and `max` have their value applied back to each row in the partition
|
| 143 |
(group). Non-aggregate functions such as `cum_sum`, `rank` and `diff` can produce different values per row, but
|
| 144 |
still only consider rows within their partition.
|
| 145 |
-
"""
|
| 146 |
-
)
|
| 147 |
return
|
| 148 |
|
| 149 |
|
| 150 |
@app.cell(hide_code=True)
|
| 151 |
def _(mo):
|
| 152 |
-
mo.md(
|
| 153 |
-
r"""
|
| 154 |
### Partitioning by multiple columns
|
| 155 |
|
| 156 |
We can also partition by multiple columns.
|
| 157 |
|
| 158 |
Let's add a column to see whether it is a weekday (business day), then get the maximum revenue by that and
|
| 159 |
the channel.
|
| 160 |
-
"""
|
| 161 |
-
)
|
| 162 |
return
|
| 163 |
|
| 164 |
|
|
@@ -176,15 +164,13 @@ def _(df, pl):
|
|
| 176 |
|
| 177 |
@app.cell(hide_code=True)
|
| 178 |
def _(mo):
|
| 179 |
-
mo.md(
|
| 180 |
-
r"""
|
| 181 |
### Partitioning by expressions
|
| 182 |
|
| 183 |
Polars also lets you partition by expressions without needing to create them as columns first.
|
| 184 |
|
| 185 |
So, we could re-write the previous window function as...
|
| 186 |
-
"""
|
| 187 |
-
)
|
| 188 |
return
|
| 189 |
|
| 190 |
|
|
@@ -200,20 +186,17 @@ def _(df, pl):
|
|
| 200 |
|
| 201 |
@app.cell(hide_code=True)
|
| 202 |
def _(mo):
|
| 203 |
-
mo.md(
|
| 204 |
-
r"""
|
| 205 |
Window functions fit into Polars' composable [expressions API](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expressions),
|
| 206 |
so can be combined with all [aggregation methods](https://docs.pola.rs/api/python/stable/reference/expressions/aggregation.html)
|
| 207 |
and methods that consider more than 1 row (e.g., `cum_sum`, `rank` and `diff` as we just saw).
|
| 208 |
-
"""
|
| 209 |
-
)
|
| 210 |
return
|
| 211 |
|
| 212 |
|
| 213 |
@app.cell(hide_code=True)
|
| 214 |
def _(mo):
|
| 215 |
-
mo.md(
|
| 216 |
-
r"""
|
| 217 |
## Ordering
|
| 218 |
|
| 219 |
The `order_by` parameter controls how to order the data within the window. The function is applied to the data in this
|
|
@@ -221,21 +204,18 @@ def _(mo):
|
|
| 221 |
|
| 222 |
Up until this point, we have been letting Polars do the window function calculations based on the order of the rows in the
|
| 223 |
DataFrame. There can be times where we would like order of the calculation and the order of the output itself to differ.
|
| 224 |
-
"""
|
| 225 |
-
)
|
| 226 |
return
|
| 227 |
|
| 228 |
|
| 229 |
@app.cell(hide_code=True)
|
| 230 |
def _(mo):
|
| 231 |
-
mo.md(
|
| 232 |
-
"""
|
| 233 |
### Ordering in a window function
|
| 234 |
|
| 235 |
Let's say we want the DataFrame ordered by day of week, but we still want cumulative revenue and the first revenue observation, both
|
| 236 |
ordered by date and partitioned by channel...
|
| 237 |
-
"""
|
| 238 |
-
)
|
| 239 |
return
|
| 240 |
|
| 241 |
|
|
@@ -261,21 +241,19 @@ def _(df, pl):
|
|
| 261 |
|
| 262 |
@app.cell(hide_code=True)
|
| 263 |
def _(mo):
|
| 264 |
-
mo.md(
|
| 265 |
-
r"""
|
| 266 |
### Note about window function ordering compared to SQL
|
| 267 |
|
| 268 |
It is worth noting that traditionally in SQL, many more functions require an `ORDER BY` within `OVER` than in
|
| 269 |
equivalent functions in Polars.
|
| 270 |
|
| 271 |
For example, an SQL `RANK()` expression like...
|
| 272 |
-
"""
|
| 273 |
-
)
|
| 274 |
return
|
| 275 |
|
| 276 |
|
| 277 |
@app.cell
|
| 278 |
-
def _(
|
| 279 |
_df = mo.sql(
|
| 280 |
f"""
|
| 281 |
SELECT
|
|
@@ -293,12 +271,10 @@ def _(df, mo):
|
|
| 293 |
|
| 294 |
@app.cell(hide_code=True)
|
| 295 |
def _(mo):
|
| 296 |
-
mo.md(
|
| 297 |
-
r"""
|
| 298 |
...does not require an `order_by` in Polars as the column and the function are already bound (including with the
|
| 299 |
`descending=True` argument).
|
| 300 |
-
"""
|
| 301 |
-
)
|
| 302 |
return
|
| 303 |
|
| 304 |
|
|
@@ -315,13 +291,11 @@ def _(df, pl):
|
|
| 315 |
|
| 316 |
@app.cell(hide_code=True)
|
| 317 |
def _(mo):
|
| 318 |
-
mo.md(
|
| 319 |
-
r"""
|
| 320 |
### Descending order
|
| 321 |
|
| 322 |
We can also order in descending order by passing `descending=True`...
|
| 323 |
-
"""
|
| 324 |
-
)
|
| 325 |
return
|
| 326 |
|
| 327 |
|
|
@@ -348,29 +322,25 @@ def _(df_sorted, pl):
|
|
| 348 |
|
| 349 |
@app.cell(hide_code=True)
|
| 350 |
def _(mo):
|
| 351 |
-
mo.md(
|
| 352 |
-
"""
|
| 353 |
## Mapping Strategies
|
| 354 |
|
| 355 |
Mapping Strategies control how Polars maps the result of the window function back to the original DataFrame
|
| 356 |
|
| 357 |
Generally (by default) the result of a window function is assigned back to rows within the group. Through Polars' mapping
|
| 358 |
strategies, we will explore other possibilities.
|
| 359 |
-
"""
|
| 360 |
-
)
|
| 361 |
return
|
| 362 |
|
| 363 |
|
| 364 |
@app.cell(hide_code=True)
|
| 365 |
def _(mo):
|
| 366 |
-
mo.md(
|
| 367 |
-
"""
|
| 368 |
### Group to rows
|
| 369 |
|
| 370 |
"group_to_rows" is the default mapping strategy and assigns the result of the window function back to the rows in the
|
| 371 |
window.
|
| 372 |
-
"""
|
| 373 |
-
)
|
| 374 |
return
|
| 375 |
|
| 376 |
|
|
@@ -384,13 +354,11 @@ def _(df, pl):
|
|
| 384 |
|
| 385 |
@app.cell(hide_code=True)
|
| 386 |
def _(mo):
|
| 387 |
-
mo.md(
|
| 388 |
-
"""
|
| 389 |
### Join
|
| 390 |
|
| 391 |
The "join" mapping strategy aggregates the resulting values in a list and repeats the list for all rows in the group.
|
| 392 |
-
"""
|
| 393 |
-
)
|
| 394 |
return
|
| 395 |
|
| 396 |
|
|
@@ -404,8 +372,7 @@ def _(df, pl):
|
|
| 404 |
|
| 405 |
@app.cell(hide_code=True)
|
| 406 |
def _(mo):
|
| 407 |
-
mo.md(
|
| 408 |
-
r"""
|
| 409 |
### Explode
|
| 410 |
|
| 411 |
The "explode" mapping strategy is similar to "group_to_rows", but is typically faster and does not preserve the order of
|
|
@@ -413,8 +380,7 @@ def _(mo):
|
|
| 413 |
It should also only be used in a `select` context and not `with_columns`.
|
| 414 |
|
| 415 |
The result of "explode" is similar to a `group_by` followed by an `agg` followed by an `explode`.
|
| 416 |
-
"""
|
| 417 |
-
)
|
| 418 |
return
|
| 419 |
|
| 420 |
|
|
@@ -431,26 +397,28 @@ def _(df, pl):
|
|
| 431 |
|
| 432 |
@app.cell(hide_code=True)
|
| 433 |
def _(mo):
|
| 434 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 435 |
return
|
| 436 |
|
| 437 |
|
| 438 |
@app.cell(hide_code=True)
|
| 439 |
def _(mo):
|
| 440 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 441 |
return
|
| 442 |
|
| 443 |
|
| 444 |
@app.cell(hide_code=True)
|
| 445 |
def _(mo):
|
| 446 |
-
mo.md(
|
| 447 |
-
r"""
|
| 448 |
### Reusing a window
|
| 449 |
|
| 450 |
In SQL there is a `WINDOW` keyword, which easily allows the re-use of the same window specification across expressions
|
| 451 |
without needing to repeat it. In Polars, this can be achieved by using `dict` unpacking to pass arguments to `over`.
|
| 452 |
-
"""
|
| 453 |
-
)
|
| 454 |
return
|
| 455 |
|
| 456 |
|
|
@@ -472,8 +440,7 @@ def _(df_sorted, pl):
|
|
| 472 |
|
| 473 |
@app.cell(hide_code=True)
|
| 474 |
def _(mo):
|
| 475 |
-
mo.md(
|
| 476 |
-
r"""
|
| 477 |
### Rolling Windows
|
| 478 |
|
| 479 |
Much like in SQL, Polars also gives you the ability to do rolling window computations. In Polars, the rolling calculation
|
|
@@ -481,8 +448,7 @@ def _(mo):
|
|
| 481 |
|
| 482 |
Let's look at an example of that now by filtering out one day of our data and then calculating both a 3-day and 3-row
|
| 483 |
max revenue split by channel...
|
| 484 |
-
"""
|
| 485 |
-
)
|
| 486 |
return
|
| 487 |
|
| 488 |
|
|
@@ -503,27 +469,29 @@ def _(date, df, pl):
|
|
| 503 |
|
| 504 |
@app.cell(hide_code=True)
|
| 505 |
def _(mo):
|
| 506 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 507 |
return
|
| 508 |
|
| 509 |
|
| 510 |
@app.cell(hide_code=True)
|
| 511 |
def _(mo):
|
| 512 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 513 |
return
|
| 514 |
|
| 515 |
|
| 516 |
@app.cell(hide_code=True)
|
| 517 |
def _(mo):
|
| 518 |
-
mo.md(
|
| 519 |
-
r"""
|
| 520 |
## Additional References
|
| 521 |
|
| 522 |
- [Polars User guide - Window functions](https://docs.pola.rs/user-guide/expressions/window-functions/)
|
| 523 |
- [Polars over method API reference](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
|
| 524 |
- [PostgreSQL window function documentation](https://www.postgresql.org/docs/current/tutorial-window.html)
|
| 525 |
-
"""
|
| 526 |
-
)
|
| 527 |
return
|
| 528 |
|
| 529 |
|
|
|
|
| 11 |
|
| 12 |
import marimo
|
| 13 |
|
| 14 |
+
__generated_with = "0.18.4"
|
| 15 |
app = marimo.App(width="medium", app_title="Window Functions")
|
| 16 |
|
| 17 |
|
| 18 |
@app.cell(hide_code=True)
|
| 19 |
def _(mo):
|
| 20 |
+
mo.md(r"""
|
|
|
|
| 21 |
# Window Functions
|
| 22 |
_By [Henry Harbeck](https://github.com/henryharbeck)._
|
| 23 |
|
|
|
|
| 25 |
You'll work with partitions, ordering and Polars' available "mapping strategies".
|
| 26 |
|
| 27 |
We'll use a dataset with a few days of paid and organic digital revenue data.
|
| 28 |
+
""")
|
|
|
|
| 29 |
return
|
| 30 |
|
| 31 |
|
|
|
|
| 51 |
|
| 52 |
@app.cell(hide_code=True)
|
| 53 |
def _(mo):
|
| 54 |
+
mo.md(r"""
|
|
|
|
| 55 |
## What is a window function?
|
| 56 |
|
| 57 |
A window function performs a calculation across a set of rows that are related to the current row.
|
|
|
|
| 61 |
|
| 62 |
Window functions can be used by specifying the [`over`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
|
| 63 |
method on an expression.
|
| 64 |
+
""")
|
|
|
|
| 65 |
return
|
| 66 |
|
| 67 |
|
| 68 |
@app.cell(hide_code=True)
|
| 69 |
def _(mo):
|
| 70 |
+
mo.md(r"""
|
|
|
|
| 71 |
## Partitions
|
| 72 |
Partitions are the "group by" columns. We will have one "window" of data per unique value in the partition column(s), to
|
| 73 |
which the function will be applied.
|
| 74 |
+
""")
|
|
|
|
| 75 |
return
|
| 76 |
|
| 77 |
|
| 78 |
@app.cell(hide_code=True)
|
| 79 |
def _(mo):
|
| 80 |
+
mo.md(r"""
|
|
|
|
| 81 |
### Partitioning by a single column
|
| 82 |
|
| 83 |
Let's get the total revenue per date...
|
| 84 |
+
""")
|
|
|
|
| 85 |
return
|
| 86 |
|
| 87 |
|
|
|
|
| 95 |
|
| 96 |
@app.cell(hide_code=True)
|
| 97 |
def _(mo):
|
| 98 |
+
mo.md(r"""
|
| 99 |
+
And then see what percentage of the daily total was Paid and what percentage was Organic.
|
| 100 |
+
""")
|
| 101 |
return
|
| 102 |
|
| 103 |
|
|
|
|
| 109 |
|
| 110 |
@app.cell(hide_code=True)
|
| 111 |
def _(mo):
|
| 112 |
+
mo.md(r"""
|
|
|
|
| 113 |
Let's now calculate the maximum revenue, cumulative revenue, rank the revenue and calculate the day-on-day change,
|
| 114 |
all partitioned (split) by channel.
|
| 115 |
+
""")
|
|
|
|
| 116 |
return
|
| 117 |
|
| 118 |
|
|
|
|
| 129 |
|
| 130 |
@app.cell(hide_code=True)
|
| 131 |
def _(mo):
|
| 132 |
+
mo.md(r"""
|
|
|
|
| 133 |
Note that aggregation functions such as `sum` and `max` have their value applied back to each row in the partition
|
| 134 |
(group). Non-aggregate functions such as `cum_sum`, `rank` and `diff` can produce different values per row, but
|
| 135 |
still only consider rows within their partition.
|
| 136 |
+
""")
|
|
|
|
| 137 |
return
|
| 138 |
|
| 139 |
|
| 140 |
@app.cell(hide_code=True)
|
| 141 |
def _(mo):
|
| 142 |
+
mo.md(r"""
|
|
|
|
| 143 |
### Partitioning by multiple columns
|
| 144 |
|
| 145 |
We can also partition by multiple columns.
|
| 146 |
|
| 147 |
Let's add a column to see whether it is a weekday (business day), then get the maximum revenue by that and
|
| 148 |
the channel.
|
| 149 |
+
""")
|
|
|
|
| 150 |
return
|
| 151 |
|
| 152 |
|
|
|
|
| 164 |
|
| 165 |
@app.cell(hide_code=True)
|
| 166 |
def _(mo):
|
| 167 |
+
mo.md(r"""
|
|
|
|
| 168 |
### Partitioning by expressions
|
| 169 |
|
| 170 |
Polars also lets you partition by expressions without needing to create them as columns first.
|
| 171 |
|
| 172 |
So, we could re-write the previous window function as...
|
| 173 |
+
""")
|
|
|
|
| 174 |
return
|
| 175 |
|
| 176 |
|
|
|
|
| 186 |
|
| 187 |
@app.cell(hide_code=True)
|
| 188 |
def _(mo):
|
| 189 |
+
mo.md(r"""
|
|
|
|
| 190 |
Window functions fit into Polars' composable [expressions API](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expressions),
|
| 191 |
so can be combined with all [aggregation methods](https://docs.pola.rs/api/python/stable/reference/expressions/aggregation.html)
|
| 192 |
and methods that consider more than 1 row (e.g., `cum_sum`, `rank` and `diff` as we just saw).
|
| 193 |
+
""")
|
|
|
|
| 194 |
return
|
| 195 |
|
| 196 |
|
| 197 |
@app.cell(hide_code=True)
|
| 198 |
def _(mo):
|
| 199 |
+
mo.md(r"""
|
|
|
|
| 200 |
## Ordering
|
| 201 |
|
| 202 |
The `order_by` parameter controls how to order the data within the window. The function is applied to the data in this
|
|
|
|
| 204 |
|
| 205 |
Up until this point, we have been letting Polars do the window function calculations based on the order of the rows in the
|
| 206 |
DataFrame. There can be times where we would like order of the calculation and the order of the output itself to differ.
|
| 207 |
+
""")
|
|
|
|
| 208 |
return
|
| 209 |
|
| 210 |
|
| 211 |
@app.cell(hide_code=True)
|
| 212 |
def _(mo):
|
| 213 |
+
mo.md("""
|
|
|
|
| 214 |
### Ordering in a window function
|
| 215 |
|
| 216 |
Let's say we want the DataFrame ordered by day of week, but we still want cumulative revenue and the first revenue observation, both
|
| 217 |
ordered by date and partitioned by channel...
|
| 218 |
+
""")
|
|
|
|
| 219 |
return
|
| 220 |
|
| 221 |
|
|
|
|
| 241 |
|
| 242 |
@app.cell(hide_code=True)
|
| 243 |
def _(mo):
|
| 244 |
+
mo.md(r"""
|
|
|
|
| 245 |
### Note about window function ordering compared to SQL
|
| 246 |
|
| 247 |
It is worth noting that traditionally in SQL, many more functions require an `ORDER BY` within `OVER` than in
|
| 248 |
equivalent functions in Polars.
|
| 249 |
|
| 250 |
For example, an SQL `RANK()` expression like...
|
| 251 |
+
""")
|
|
|
|
| 252 |
return
|
| 253 |
|
| 254 |
|
| 255 |
@app.cell
|
| 256 |
+
def _(mo):
|
| 257 |
_df = mo.sql(
|
| 258 |
f"""
|
| 259 |
SELECT
|
|
|
|
| 271 |
|
| 272 |
@app.cell(hide_code=True)
|
| 273 |
def _(mo):
|
| 274 |
+
mo.md(r"""
|
|
|
|
| 275 |
...does not require an `order_by` in Polars as the column and the function are already bound (including with the
|
| 276 |
`descending=True` argument).
|
| 277 |
+
""")
|
|
|
|
| 278 |
return
|
| 279 |
|
| 280 |
|
|
|
|
| 291 |
|
| 292 |
@app.cell(hide_code=True)
|
| 293 |
def _(mo):
|
| 294 |
+
mo.md(r"""
|
|
|
|
| 295 |
### Descending order
|
| 296 |
|
| 297 |
We can also order in descending order by passing `descending=True`...
|
| 298 |
+
""")
|
|
|
|
| 299 |
return
|
| 300 |
|
| 301 |
|
|
|
|
| 322 |
|
| 323 |
@app.cell(hide_code=True)
|
| 324 |
def _(mo):
|
| 325 |
+
mo.md("""
|
|
|
|
| 326 |
## Mapping Strategies
|
| 327 |
|
| 328 |
Mapping Strategies control how Polars maps the result of the window function back to the original DataFrame
|
| 329 |
|
| 330 |
Generally (by default) the result of a window function is assigned back to rows within the group. Through Polars' mapping
|
| 331 |
strategies, we will explore other possibilities.
|
| 332 |
+
""")
|
|
|
|
| 333 |
return
|
| 334 |
|
| 335 |
|
| 336 |
@app.cell(hide_code=True)
|
| 337 |
def _(mo):
|
| 338 |
+
mo.md("""
|
|
|
|
| 339 |
### Group to rows
|
| 340 |
|
| 341 |
"group_to_rows" is the default mapping strategy and assigns the result of the window function back to the rows in the
|
| 342 |
window.
|
| 343 |
+
""")
|
|
|
|
| 344 |
return
|
| 345 |
|
| 346 |
|
|
|
|
| 354 |
|
| 355 |
@app.cell(hide_code=True)
|
| 356 |
def _(mo):
|
| 357 |
+
mo.md("""
|
|
|
|
| 358 |
### Join
|
| 359 |
|
| 360 |
The "join" mapping strategy aggregates the resulting values in a list and repeats the list for all rows in the group.
|
| 361 |
+
""")
|
|
|
|
| 362 |
return
|
| 363 |
|
| 364 |
|
|
|
|
| 372 |
|
| 373 |
@app.cell(hide_code=True)
|
| 374 |
def _(mo):
|
| 375 |
+
mo.md(r"""
|
|
|
|
| 376 |
### Explode
|
| 377 |
|
| 378 |
The "explode" mapping strategy is similar to "group_to_rows", but is typically faster and does not preserve the order of
|
|
|
|
| 380 |
It should also only be used in a `select` context and not `with_columns`.
|
| 381 |
|
| 382 |
The result of "explode" is similar to a `group_by` followed by an `agg` followed by an `explode`.
|
| 383 |
+
""")
|
|
|
|
| 384 |
return
|
| 385 |
|
| 386 |
|
|
|
|
| 397 |
|
| 398 |
@app.cell(hide_code=True)
|
| 399 |
def _(mo):
|
| 400 |
+
mo.md(r"""
|
| 401 |
+
Note the modified order of the rows in the output, (but data is the same)...
|
| 402 |
+
""")
|
| 403 |
return
|
| 404 |
|
| 405 |
|
| 406 |
@app.cell(hide_code=True)
|
| 407 |
def _(mo):
|
| 408 |
+
mo.md(r"""
|
| 409 |
+
## Other tips and tricks
|
| 410 |
+
""")
|
| 411 |
return
|
| 412 |
|
| 413 |
|
| 414 |
@app.cell(hide_code=True)
|
| 415 |
def _(mo):
|
| 416 |
+
mo.md(r"""
|
|
|
|
| 417 |
### Reusing a window
|
| 418 |
|
| 419 |
In SQL there is a `WINDOW` keyword, which easily allows the re-use of the same window specification across expressions
|
| 420 |
without needing to repeat it. In Polars, this can be achieved by using `dict` unpacking to pass arguments to `over`.
|
| 421 |
+
""")
|
|
|
|
| 422 |
return
|
| 423 |
|
| 424 |
|
|
|
|
| 440 |
|
| 441 |
@app.cell(hide_code=True)
|
| 442 |
def _(mo):
|
| 443 |
+
mo.md(r"""
|
|
|
|
| 444 |
### Rolling Windows
|
| 445 |
|
| 446 |
Much like in SQL, Polars also gives you the ability to do rolling window computations. In Polars, the rolling calculation
|
|
|
|
| 448 |
|
| 449 |
Let's look at an example of that now by filtering out one day of our data and then calculating both a 3-day and 3-row
|
| 450 |
max revenue split by channel...
|
| 451 |
+
""")
|
|
|
|
| 452 |
return
|
| 453 |
|
| 454 |
|
|
|
|
| 469 |
|
| 470 |
@app.cell(hide_code=True)
|
| 471 |
def _(mo):
|
| 472 |
+
mo.md(r"""
|
| 473 |
+
Notice the difference in the 2nd last row...
|
| 474 |
+
""")
|
| 475 |
return
|
| 476 |
|
| 477 |
|
| 478 |
@app.cell(hide_code=True)
|
| 479 |
def _(mo):
|
| 480 |
+
mo.md(r"""
|
| 481 |
+
We hope you enjoyed this notebook, demonstrating window functions in Polars!
|
| 482 |
+
""")
|
| 483 |
return
|
| 484 |
|
| 485 |
|
| 486 |
@app.cell(hide_code=True)
|
| 487 |
def _(mo):
|
| 488 |
+
mo.md(r"""
|
|
|
|
| 489 |
## Additional References
|
| 490 |
|
| 491 |
- [Polars User guide - Window functions](https://docs.pola.rs/user-guide/expressions/window-functions/)
|
| 492 |
- [Polars over method API reference](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
|
| 493 |
- [PostgreSQL window function documentation](https://www.postgresql.org/docs/current/tutorial-window.html)
|
| 494 |
+
""")
|
|
|
|
| 495 |
return
|
| 496 |
|
| 497 |
|
polars/14_user_defined_functions.py
CHANGED
|
@@ -14,58 +14,52 @@
|
|
| 14 |
|
| 15 |
import marimo
|
| 16 |
|
| 17 |
-
__generated_with = "0.
|
| 18 |
app = marimo.App(width="medium")
|
| 19 |
|
| 20 |
|
| 21 |
@app.cell(hide_code=True)
|
| 22 |
def _(mo):
|
| 23 |
-
mo.md(
|
| 24 |
-
|
| 25 |
-
# User-Defined Functions
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
)
|
| 34 |
return
|
| 35 |
|
| 36 |
|
| 37 |
@app.cell(hide_code=True)
|
| 38 |
def _(mo):
|
| 39 |
-
mo.md(
|
| 40 |
-
|
| 41 |
-
## ⚖️ The Cost of UDFs
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
)
|
| 55 |
return
|
| 56 |
|
| 57 |
|
| 58 |
@app.cell(hide_code=True)
|
| 59 |
def _(mo):
|
| 60 |
-
mo.md(
|
| 61 |
-
|
| 62 |
-
## 📊 Project Overview
|
| 63 |
|
| 64 |
-
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
)
|
| 69 |
return
|
| 70 |
|
| 71 |
|
|
@@ -90,7 +84,9 @@ def _(mo):
|
|
| 90 |
|
| 91 |
@app.cell(hide_code=True)
|
| 92 |
def _(mo):
|
| 93 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 94 |
return
|
| 95 |
|
| 96 |
|
|
@@ -109,7 +105,9 @@ def _(mo):
|
|
| 109 |
|
| 110 |
@app.cell(hide_code=True)
|
| 111 |
def _(mo):
|
| 112 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 113 |
return
|
| 114 |
|
| 115 |
|
|
@@ -129,19 +127,17 @@ def _(pl):
|
|
| 129 |
|
| 130 |
@app.cell(hide_code=True)
|
| 131 |
def _(mo):
|
| 132 |
-
mo.md(
|
| 133 |
-
|
| 134 |
-
## 🔂 Element-Wise UDFs
|
| 135 |
|
| 136 |
-
|
| 137 |
|
| 138 |
-
|
| 139 |
|
| 140 |
-
|
| 141 |
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
)
|
| 145 |
return
|
| 146 |
|
| 147 |
|
|
@@ -159,13 +155,11 @@ def _(httpx, pl, url_df):
|
|
| 159 |
|
| 160 |
@app.cell(hide_code=True)
|
| 161 |
def _(mo):
|
| 162 |
-
mo.md(
|
| 163 |
-
|
| 164 |
-
Now, `html_df` holds the HTML for each URL. We need to parse it. Again, a UDF is the way to go. Parsing HTML with native Polars expressions would be a nightmare! Instead, we'll use the [`beautifulsoup4`](https://pypi.org/project/beautifulsoup4/) library, a standard tool for this.
|
| 165 |
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
)
|
| 169 |
return
|
| 170 |
|
| 171 |
|
|
@@ -193,7 +187,9 @@ def _(extract_nextjs_data, html_df, pl):
|
|
| 193 |
|
| 194 |
@app.cell(hide_code=True)
|
| 195 |
def _(mo):
|
| 196 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 197 |
return
|
| 198 |
|
| 199 |
|
|
@@ -276,19 +272,17 @@ def _(parsed_html_df, pl):
|
|
| 276 |
|
| 277 |
@app.cell(hide_code=True)
|
| 278 |
def _(mo):
|
| 279 |
-
mo.md(
|
| 280 |
-
|
| 281 |
-
## 📦 Batch-Wise UDFs
|
| 282 |
|
| 283 |
-
|
| 284 |
|
| 285 |
-
|
| 286 |
|
| 287 |
-
|
| 288 |
|
| 289 |
-
|
| 290 |
-
|
| 291 |
-
)
|
| 292 |
return
|
| 293 |
|
| 294 |
|
|
@@ -372,19 +366,19 @@ def _(mo, notebook_stats_df):
|
|
| 372 |
return notebook_height, notebooks
|
| 373 |
|
| 374 |
|
| 375 |
-
@app.
|
| 376 |
-
def
|
| 377 |
-
|
| 378 |
-
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
return f'<iframe width="100%" height="{height}" frameborder="0" src="{embed_url}?cell=*"></iframe>'
|
| 382 |
-
return (nb_iframe,)
|
| 383 |
|
| 384 |
|
| 385 |
@app.cell(hide_code=True)
|
| 386 |
def _(mo):
|
| 387 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 388 |
return
|
| 389 |
|
| 390 |
|
|
@@ -395,7 +389,7 @@ def _(mo):
|
|
| 395 |
|
| 396 |
|
| 397 |
@app.cell(hide_code=True)
|
| 398 |
-
def _(category, mo,
|
| 399 |
notebook = notebooks.value.to_dicts()[0]
|
| 400 |
mo.vstack(
|
| 401 |
[
|
|
@@ -406,60 +400,56 @@ def _(category, mo, nb_iframe, notebook_height, notebooks):
|
|
| 406 |
mo.md(nb_iframe(notebook["notebook_url"], notebook_height.value)),
|
| 407 |
]
|
| 408 |
)
|
| 409 |
-
return
|
| 410 |
|
| 411 |
|
| 412 |
@app.cell(hide_code=True)
|
| 413 |
def _(mo):
|
| 414 |
-
mo.md(
|
| 415 |
-
|
| 416 |
-
## ⚙️ Row-Wise UDFs
|
| 417 |
|
| 418 |
-
|
| 419 |
|
| 420 |
-
|
| 421 |
|
| 422 |
-
|
| 423 |
-
|
| 424 |
-
)
|
| 425 |
return
|
| 426 |
|
| 427 |
|
| 428 |
-
@app.
|
| 429 |
-
def
|
| 430 |
-
|
| 431 |
-
|
| 432 |
-
|
| 433 |
-
|
| 434 |
-
|
| 435 |
-
|
| 436 |
-
|
| 437 |
-
|
| 438 |
-
|
| 439 |
-
|
| 440 |
-
|
| 441 |
-
|
| 442 |
-
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
|
| 446 |
-
<div
|
| 447 |
-
|
| 448 |
-
|
| 449 |
-
|
| 450 |
-
|
| 451 |
-
|
| 452 |
-
|
| 453 |
-
<
|
| 454 |
-
|
| 455 |
-
|
| 456 |
-
|
| 457 |
-
)
|
| 458 |
-
return (create_notebook_summary,)
|
| 459 |
|
| 460 |
|
| 461 |
@app.cell(hide_code=True)
|
| 462 |
-
def _(
|
| 463 |
notebook_summary_df = notebook_stats_df.map_rows(
|
| 464 |
create_notebook_summary,
|
| 465 |
return_dtype=pl.String,
|
|
@@ -487,37 +477,33 @@ def _(mo, notebook_summary_df):
|
|
| 487 |
|
| 488 |
@app.cell(hide_code=True)
|
| 489 |
def _(mo):
|
| 490 |
-
mo.md(
|
| 491 |
-
|
| 492 |
-
## 🚀 Higher-performance UDFs
|
| 493 |
|
| 494 |
-
|
| 495 |
|
| 496 |
-
|
| 497 |
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
)
|
| 501 |
return
|
| 502 |
|
| 503 |
|
| 504 |
@app.cell(hide_code=True)
|
| 505 |
def _(mo):
|
| 506 |
-
mo.md(
|
| 507 |
-
|
| 508 |
-
Let's create a custom popularity metric to rank notebooks, considering likes, forks, *and* comments (not just likes). We'll define `weighted_popularity_numba`, decorated with `@numba.guvectorize`. The decorator arguments specify that we're taking three integer vectors of length `n` and returning a float vector of length `n`.
|
| 509 |
|
| 510 |
-
|
| 511 |
|
| 512 |
-
|
| 513 |
-
|
| 514 |
-
|
| 515 |
-
|
| 516 |
-
|
| 517 |
|
| 518 |
-
|
| 519 |
-
|
| 520 |
-
)
|
| 521 |
return
|
| 522 |
|
| 523 |
|
|
@@ -606,12 +592,14 @@ def _(
|
|
| 606 |
+ w_f * (forks[i] ** nlf)
|
| 607 |
+ w_c * (comments[i] ** nlf)
|
| 608 |
)
|
| 609 |
-
return
|
| 610 |
|
| 611 |
|
| 612 |
@app.cell(hide_code=True)
|
| 613 |
def _(mo):
|
| 614 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 615 |
return
|
| 616 |
|
| 617 |
|
|
@@ -665,7 +653,9 @@ def _(
|
|
| 665 |
|
| 666 |
@app.cell(hide_code=True)
|
| 667 |
def _(mo):
|
| 668 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 669 |
return
|
| 670 |
|
| 671 |
|
|
@@ -700,27 +690,25 @@ def _(alt, notebook_popularity_df, pl):
|
|
| 700 |
fill="title:N",
|
| 701 |
)
|
| 702 |
(points + lines).properties(width=400)
|
| 703 |
-
return
|
| 704 |
|
| 705 |
|
| 706 |
@app.cell(hide_code=True)
|
| 707 |
def _(mo):
|
| 708 |
-
mo.md(
|
| 709 |
-
|
| 710 |
-
## ⏱️ Quantifying the Overhead
|
| 711 |
|
| 712 |
-
|
| 713 |
|
| 714 |
-
|
| 715 |
|
| 716 |
-
|
| 717 |
-
|
| 718 |
-
|
| 719 |
-
|
| 720 |
|
| 721 |
-
|
| 722 |
-
|
| 723 |
-
)
|
| 724 |
return
|
| 725 |
|
| 726 |
|
|
@@ -750,15 +738,13 @@ def _(benchmark_plot, mo, num_samples, num_trials):
|
|
| 750 |
|
| 751 |
@app.cell(hide_code=True)
|
| 752 |
def _(mo):
|
| 753 |
-
mo.md(
|
| 754 |
-
|
| 755 |
-
As anticipated, the `Batch-Wise UDF (Python)` and `Element-Wise UDF` exhibit significantly worse performance, essentially acting as pure-Python for-each loops.
|
| 756 |
|
| 757 |
-
|
| 758 |
|
| 759 |
-
|
| 760 |
-
|
| 761 |
-
)
|
| 762 |
return
|
| 763 |
|
| 764 |
|
|
@@ -789,7 +775,7 @@ def _(mo):
|
|
| 789 |
def _(np, num_samples, pl):
|
| 790 |
rng = np.random.default_rng(42)
|
| 791 |
sample_df = pl.from_dict({"x": rng.random(num_samples.value)})
|
| 792 |
-
return
|
| 793 |
|
| 794 |
|
| 795 |
@app.cell(hide_code=True)
|
|
@@ -861,14 +847,7 @@ def _(np, num_trials, numba, pl, sample_df, timeit):
|
|
| 861 |
def time_method(callable_name: str, number=num_trials.value) -> float:
|
| 862 |
fn = globals()[callable_name]
|
| 863 |
return timeit.timeit(fn, number=number)
|
| 864 |
-
return (
|
| 865 |
-
run_map_batches_numba,
|
| 866 |
-
run_map_batches_numpy,
|
| 867 |
-
run_map_batches_python,
|
| 868 |
-
run_map_elements,
|
| 869 |
-
run_native,
|
| 870 |
-
time_method,
|
| 871 |
-
)
|
| 872 |
|
| 873 |
|
| 874 |
@app.cell(hide_code=True)
|
|
@@ -906,7 +885,7 @@ def _(alt, pl, time_method):
|
|
| 906 |
x=alt.X("title:N", title="Method", sort="-y"),
|
| 907 |
y=alt.Y("time:Q", title="Execution Time (s)", axis=alt.Axis(format=".3f")),
|
| 908 |
).properties(width=400)
|
| 909 |
-
return
|
| 910 |
|
| 911 |
|
| 912 |
@app.cell(hide_code=True)
|
|
@@ -934,7 +913,6 @@ def _():
|
|
| 934 |
asyncio,
|
| 935 |
httpx,
|
| 936 |
mo,
|
| 937 |
-
nest_asyncio,
|
| 938 |
np,
|
| 939 |
numba,
|
| 940 |
pl,
|
|
|
|
| 14 |
|
| 15 |
import marimo
|
| 16 |
|
| 17 |
+
__generated_with = "0.18.4"
|
| 18 |
app = marimo.App(width="medium")
|
| 19 |
|
| 20 |
|
| 21 |
@app.cell(hide_code=True)
|
| 22 |
def _(mo):
|
| 23 |
+
mo.md(r"""
|
| 24 |
+
# User-Defined Functions
|
|
|
|
| 25 |
|
| 26 |
+
_By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_.
|
| 27 |
|
| 28 |
+
Throughout the previous chapters, you've seen how Polars provides a comprehensive set of built-in expressions for flexible data transformation. But what happens when you need something *more*? Perhaps your project has unique requirements, or you need to integrate functionality from an external Python library. This is where User-Defined Functions (UDFs) come into play, allowing you to extend Polars with your own custom logic.
|
| 29 |
|
| 30 |
+
In this chapter, we'll weigh the performance trade-offs of UDFs, pinpoint situations where they're truly beneficial, and explore different ways to effectively incorporate them into your Polars workflows. We'll walk through a complete, practical example.
|
| 31 |
+
""")
|
|
|
|
| 32 |
return
|
| 33 |
|
| 34 |
|
| 35 |
@app.cell(hide_code=True)
|
| 36 |
def _(mo):
|
| 37 |
+
mo.md(r"""
|
| 38 |
+
## ⚖️ The Cost of UDFs
|
|
|
|
| 39 |
|
| 40 |
+
> Performance vs. Flexibility
|
| 41 |
|
| 42 |
+
Polars' built-in expressions are highly optimized for speed and parallel processing. User-defined functions (UDFs), however, introduce a significant performance overhead because they rely on standard Python code, which often runs in a single thread and bypasses Polars' logical optimizations. Therefore, always prioritize native Polars operations *whenever possible*.
|
| 43 |
|
| 44 |
+
However, UDFs become inevitable when you need to:
|
| 45 |
|
| 46 |
+
- **Integrate external libraries:** Use functionality not directly available in Polars.
|
| 47 |
+
- **Implement custom logic:** Handle complex transformations that can't be easily expressed with Polars' built-in functions.
|
| 48 |
|
| 49 |
+
Let's dive into a real-world project where UDFs were the only way to get the job done, demonstrating a scenario where native Polars expressions simply weren't sufficient.
|
| 50 |
+
""")
|
|
|
|
| 51 |
return
|
| 52 |
|
| 53 |
|
| 54 |
@app.cell(hide_code=True)
|
| 55 |
def _(mo):
|
| 56 |
+
mo.md(r"""
|
| 57 |
+
## 📊 Project Overview
|
|
|
|
| 58 |
|
| 59 |
+
> Scraping and Analyzing Observable Notebook Statistics
|
| 60 |
|
| 61 |
+
If you're into data visualization, you've probably seen [D3.js](https://d3js.org/) and [Observable Plot](https://observablehq.com/plot/). Both have extensive galleries showcasing amazing visualizations. Each gallery item is a standalone [Observable notebook](https://observablehq.com/documentation/notebooks/), with metrics like stars, comments, and forks – indicators of popularity. But getting and analyzing these statistics directly isn't straightforward. We'll need to scrape the web.
|
| 62 |
+
""")
|
|
|
|
| 63 |
return
|
| 64 |
|
| 65 |
|
|
|
|
| 84 |
|
| 85 |
@app.cell(hide_code=True)
|
| 86 |
def _(mo):
|
| 87 |
+
mo.md(r"""
|
| 88 |
+
Our goal is to use Polars UDFs to fetch the HTML content of these gallery pages. Then, we'll use the `BeautifulSoup` Python library to parse the HTML and extract the relevant metadata. After some data wrangling with native Polars expressions, we'll have a DataFrame listing each visualization notebook. Then, we'll use another UDF to retrieve the number of likes, forks, and comments for each notebook. Finally, we will create our own high-performance UDF to implement a custom notebook ranking scheme. This will involve multiple steps, showcasing different UDF approaches.
|
| 89 |
+
""")
|
| 90 |
return
|
| 91 |
|
| 92 |
|
|
|
|
| 105 |
|
| 106 |
@app.cell(hide_code=True)
|
| 107 |
def _(mo):
|
| 108 |
+
mo.md(r"""
|
| 109 |
+
Our starting point, `url_df`, is a simple DataFrame with a single `url` column containing the URLs of the D3 and Observable Plot gallery notebooks.
|
| 110 |
+
""")
|
| 111 |
return
|
| 112 |
|
| 113 |
|
|
|
|
| 127 |
|
| 128 |
@app.cell(hide_code=True)
|
| 129 |
def _(mo):
|
| 130 |
+
mo.md(r"""
|
| 131 |
+
## 🔂 Element-Wise UDFs
|
|
|
|
| 132 |
|
| 133 |
+
> Processing Value by Value
|
| 134 |
|
| 135 |
+
The most common way to use UDFs is to apply them element-wise. This means our custom function will execute for *each individual row* in a specified column. Our first task is to fetch the HTML content for each URL in `url_df`.
|
| 136 |
|
| 137 |
+
We'll define a Python function that takes a `url` (a string) as input, uses the `httpx` library (an HTTP client) to fetch the content, and returns the HTML as a string. We then integrate this function into Polars using the [`map_elements`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_elements.html) expression.
|
| 138 |
|
| 139 |
+
You'll notice we have to explicitly specify the `return_dtype`. This is *crucial*. Polars doesn't automatically know what our custom function will return. We're responsible for defining the function's logic and, therefore, its output type. By providing the `return_dtype`, we help Polars maintain its internal representation of the DataFrame's schema, enabling query optimization. Think of it as giving Polars a "heads-up" about the data type it should expect.
|
| 140 |
+
""")
|
|
|
|
| 141 |
return
|
| 142 |
|
| 143 |
|
|
|
|
| 155 |
|
| 156 |
@app.cell(hide_code=True)
|
| 157 |
def _(mo):
|
| 158 |
+
mo.md(r"""
|
| 159 |
+
Now, `html_df` holds the HTML for each URL. We need to parse it. Again, a UDF is the way to go. Parsing HTML with native Polars expressions would be a nightmare! Instead, we'll use the [`beautifulsoup4`](https://pypi.org/project/beautifulsoup4/) library, a standard tool for this.
|
|
|
|
| 160 |
|
| 161 |
+
These Observable pages are built with [Next.js](https://nextjs.org/), which helpfully serializes page properties as JSON within the HTML. This simplifies our UDF: we'll extract the raw JSON from the `<script id="__NEXT_DATA__" type="application/json">` tag. We'll use [`map_elements`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_elements.html) again. For clarity, we'll define this UDF as a named function, `extract_nextjs_data`, since it's a bit more complex than a simple HTTP request.
|
| 162 |
+
""")
|
|
|
|
| 163 |
return
|
| 164 |
|
| 165 |
|
|
|
|
| 187 |
|
| 188 |
@app.cell(hide_code=True)
|
| 189 |
def _(mo):
|
| 190 |
+
mo.md(r"""
|
| 191 |
+
With some data wrangling of the raw JSON (using *native* Polars expressions!), we get `notebooks_df`, containing the metadata for each notebook.
|
| 192 |
+
""")
|
| 193 |
return
|
| 194 |
|
| 195 |
|
|
|
|
| 272 |
|
| 273 |
@app.cell(hide_code=True)
|
| 274 |
def _(mo):
|
| 275 |
+
mo.md(r"""
|
| 276 |
+
## 📦 Batch-Wise UDFs
|
|
|
|
| 277 |
|
| 278 |
+
> Processing Entire Series
|
| 279 |
|
| 280 |
+
`map_elements` calls the UDF for *each row*. Fine for our tiny, two-rows-tall `url_df`. But `notebooks_df` has almost 400 rows! Individual HTTP requests for each would be painfully slow.
|
| 281 |
|
| 282 |
+
We want stats for each notebook in `notebooks_df`. To avoid sequential requests, we'll use Polars' [`map_batches`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_batches.html). This lets us process an *entire Series* (a column) at once.
|
| 283 |
|
| 284 |
+
Our UDF, `fetch_html_batch`, will take a *Series* of URLs and use `asyncio` to make concurrent requests – a huge performance boost.
|
| 285 |
+
""")
|
|
|
|
| 286 |
return
|
| 287 |
|
| 288 |
|
|
|
|
| 366 |
return notebook_height, notebooks
|
| 367 |
|
| 368 |
|
| 369 |
+
@app.function(hide_code=True)
|
| 370 |
+
def nb_iframe(notebook_url: str, height=825) -> str:
|
| 371 |
+
embed_url = notebook_url.replace(
|
| 372 |
+
"https://observablehq.com", "https://observablehq.com/embed"
|
| 373 |
+
)
|
| 374 |
+
return f'<iframe width="100%" height="{height}" frameborder="0" src="{embed_url}?cell=*"></iframe>'
|
|
|
|
|
|
|
| 375 |
|
| 376 |
|
| 377 |
@app.cell(hide_code=True)
|
| 378 |
def _(mo):
|
| 379 |
+
mo.md(r"""
|
| 380 |
+
Now that we have access to notebook-level statistics, we can rank the visualizations by the number of likes they received & display them interactively.
|
| 381 |
+
""")
|
| 382 |
return
|
| 383 |
|
| 384 |
|
|
|
|
| 389 |
|
| 390 |
|
| 391 |
@app.cell(hide_code=True)
|
| 392 |
+
def _(category, mo, notebook_height, notebooks):
|
| 393 |
notebook = notebooks.value.to_dicts()[0]
|
| 394 |
mo.vstack(
|
| 395 |
[
|
|
|
|
| 400 |
mo.md(nb_iframe(notebook["notebook_url"], notebook_height.value)),
|
| 401 |
]
|
| 402 |
)
|
| 403 |
+
return
|
| 404 |
|
| 405 |
|
| 406 |
@app.cell(hide_code=True)
|
| 407 |
def _(mo):
|
| 408 |
+
mo.md(r"""
|
| 409 |
+
## ⚙️ Row-Wise UDFs
|
|
|
|
| 410 |
|
| 411 |
+
> Accessing All Columns at Once
|
| 412 |
|
| 413 |
+
Sometimes, you need to work with *all* columns of a row at once. This is where [`map_rows`](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.map_rows.html) comes in. It operates directly on the DataFrame, passing each row to your UDF *as a tuple*.
|
| 414 |
|
| 415 |
+
Below, `create_notebook_summary` takes a row from `notebook_stats_df` (as a tuple) and returns a formatted Markdown string summarizing the notebook's key stats. We're essentially reducing the DataFrame to a single column. While this *could* be done with native Polars expressions, it would be much more cumbersome. This example demonstrates a case where a row-wise UDF simplifies the code, even if the underlying operation isn't inherently complex.
|
| 416 |
+
""")
|
|
|
|
| 417 |
return
|
| 418 |
|
| 419 |
|
| 420 |
+
@app.function(hide_code=True)
|
| 421 |
+
def create_notebook_summary(row: tuple) -> str:
|
| 422 |
+
(
|
| 423 |
+
thumbnail_src,
|
| 424 |
+
category,
|
| 425 |
+
title,
|
| 426 |
+
likes,
|
| 427 |
+
forks,
|
| 428 |
+
comments,
|
| 429 |
+
license,
|
| 430 |
+
description,
|
| 431 |
+
notebook_url,
|
| 432 |
+
) = row
|
| 433 |
+
return (
|
| 434 |
+
f"""
|
| 435 |
+
### [{title}]({notebook_url})
|
| 436 |
+
|
| 437 |
+
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 12px; margin: 12px 0;">
|
| 438 |
+
<div>⭐ <strong>Likes:</strong> {likes}</div>
|
| 439 |
+
<div>↗️ <strong>Forks:</strong> {forks}</div>
|
| 440 |
+
<div>💬 <strong>Comments:</strong> {comments}</div>
|
| 441 |
+
<div>⚖️ <strong>License:</strong> {license}</div>
|
| 442 |
+
</div>
|
| 443 |
+
|
| 444 |
+
<a href="{notebook_url}" target="_blank">
|
| 445 |
+
<img src="{thumbnail_src}" style="height: 300px;" />
|
| 446 |
+
<a/>
|
| 447 |
+
""".strip('\n')
|
| 448 |
+
)
|
|
|
|
|
|
|
| 449 |
|
| 450 |
|
| 451 |
@app.cell(hide_code=True)
|
| 452 |
+
def _(notebook_stats_df, pl):
|
| 453 |
notebook_summary_df = notebook_stats_df.map_rows(
|
| 454 |
create_notebook_summary,
|
| 455 |
return_dtype=pl.String,
|
|
|
|
| 477 |
|
| 478 |
@app.cell(hide_code=True)
|
| 479 |
def _(mo):
|
| 480 |
+
mo.md(r"""
|
| 481 |
+
## 🚀 Higher-performance UDFs
|
|
|
|
| 482 |
|
| 483 |
+
> Leveraging Numba to Make Python Fast
|
| 484 |
|
| 485 |
+
Python code doesn't *always* mean slow code. While UDFs *often* introduce performance overhead, there are exceptions. NumPy's universal functions ([`ufuncs`](https://numpy.org/doc/stable/reference/ufuncs.html)) and generalized universal functions ([`gufuncs`](https://numpy.org/neps/nep-0005-generalized-ufuncs.html)) provide high-performance operations on NumPy arrays, thanks to low-level implementations.
|
| 486 |
|
| 487 |
+
But NumPy's built-in functions are predefined. We can't easily use them for *custom* logic. Enter [`numba`](https://numba.pydata.org/). Numba is a just-in-time (JIT) compiler that translates Python functions into optimized machine code *at runtime*. It provides decorators like [`numba.guvectorize`](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-guvectorize-decorator) that let us create our *own* high-performance `gufuncs` – *without* writing low-level code!
|
| 488 |
+
""")
|
|
|
|
| 489 |
return
|
| 490 |
|
| 491 |
|
| 492 |
@app.cell(hide_code=True)
|
| 493 |
def _(mo):
|
| 494 |
+
mo.md(r"""
|
| 495 |
+
Let's create a custom popularity metric to rank notebooks, considering likes, forks, *and* comments (not just likes). We'll define `weighted_popularity_numba`, decorated with `@numba.guvectorize`. The decorator arguments specify that we're taking three integer vectors of length `n` and returning a float vector of length `n`.
|
|
|
|
| 496 |
|
| 497 |
+
The weighted popularity score for each notebook is calculated using the following formula:
|
| 498 |
|
| 499 |
+
$$
|
| 500 |
+
\begin{equation}
|
| 501 |
+
\text{score}_i = w_l \cdot l_i^{f} + w_f \cdot f_i^{f} + w_c \cdot c_i^{f}
|
| 502 |
+
\end{equation}
|
| 503 |
+
$$
|
| 504 |
|
| 505 |
+
with:
|
| 506 |
+
""")
|
|
|
|
| 507 |
return
|
| 508 |
|
| 509 |
|
|
|
|
| 592 |
+ w_f * (forks[i] ** nlf)
|
| 593 |
+ w_c * (comments[i] ** nlf)
|
| 594 |
)
|
| 595 |
+
return (weighted_popularity_numba,)
|
| 596 |
|
| 597 |
|
| 598 |
@app.cell(hide_code=True)
|
| 599 |
def _(mo):
|
| 600 |
+
mo.md(r"""
|
| 601 |
+
We apply our JIT-compiled UDF using `map_batches`, as before. The key is that we're passing entire columns directly to `weighted_popularity_numba`. Polars and Numba handle the conversion to NumPy arrays behind the scenes. This direct integration is a major benefit of using `guvectorize`.
|
| 602 |
+
""")
|
| 603 |
return
|
| 604 |
|
| 605 |
|
|
|
|
| 653 |
|
| 654 |
@app.cell(hide_code=True)
|
| 655 |
def _(mo):
|
| 656 |
+
mo.md(r"""
|
| 657 |
+
As the slope chart below demonstrates, this new ranking strategy significantly changes the notebook order, as it considers forks and comments, not just likes.
|
| 658 |
+
""")
|
| 659 |
return
|
| 660 |
|
| 661 |
|
|
|
|
| 690 |
fill="title:N",
|
| 691 |
)
|
| 692 |
(points + lines).properties(width=400)
|
| 693 |
+
return
|
| 694 |
|
| 695 |
|
| 696 |
@app.cell(hide_code=True)
|
| 697 |
def _(mo):
|
| 698 |
+
mo.md(r"""
|
| 699 |
+
## ⏱️ Quantifying the Overhead
|
|
|
|
| 700 |
|
| 701 |
+
> UDF Performance Comparison
|
| 702 |
|
| 703 |
+
To truly understand the performance implications of using UDFs, let's conduct a benchmark. We'll create a DataFrame with random numbers and perform the same numerical operation using four different methods:
|
| 704 |
|
| 705 |
+
1. **Native Polars:** Using Polars' built-in expressions.
|
| 706 |
+
2. **`map_elements`:** Applying a Python function element-wise.
|
| 707 |
+
3. **`map_batches`:** **Applying** a Python function to the entire Series.
|
| 708 |
+
4. **`map_batches` with Numba:** Applying a JIT-compiled function to batches, similar to a generalized universal function.
|
| 709 |
|
| 710 |
+
We'll use a simple, but non-trivial, calculation: `result = (x * 2.5 + 5) / (x + 1)`. This involves multiplication, addition, and division, giving us a realistic representation of a common numerical operation. We'll use the `timeit` module, to accurately measure execution times over multiple trials.
|
| 711 |
+
""")
|
|
|
|
| 712 |
return
|
| 713 |
|
| 714 |
|
|
|
|
| 738 |
|
| 739 |
@app.cell(hide_code=True)
|
| 740 |
def _(mo):
|
| 741 |
+
mo.md(r"""
|
| 742 |
+
As anticipated, the `Batch-Wise UDF (Python)` and `Element-Wise UDF` exhibit significantly worse performance, essentially acting as pure-Python for-each loops.
|
|
|
|
| 743 |
|
| 744 |
+
However, when Python serves as an interface to lower-level, high-performance libraries, we observe substantial improvements. The `Batch-Wise UDF (NumPy)` lags behind both `Batch-Wise UDF (Numba)` and `Native Polars`, but it still represents a considerable improvement over pure-Python UDFs due to its vectorized computations.
|
| 745 |
|
| 746 |
+
Numba's Just-In-Time (JIT) compilation delivers a dramatic performance boost, achieving speeds comparable to native Polars expressions. This demonstrates that UDFs, particularly when combined with tools like Numba, don't inevitably lead to bottlenecks in numerical computations.
|
| 747 |
+
""")
|
|
|
|
| 748 |
return
|
| 749 |
|
| 750 |
|
|
|
|
| 775 |
def _(np, num_samples, pl):
|
| 776 |
rng = np.random.default_rng(42)
|
| 777 |
sample_df = pl.from_dict({"x": rng.random(num_samples.value)})
|
| 778 |
+
return (sample_df,)
|
| 779 |
|
| 780 |
|
| 781 |
@app.cell(hide_code=True)
|
|
|
|
| 847 |
def time_method(callable_name: str, number=num_trials.value) -> float:
|
| 848 |
fn = globals()[callable_name]
|
| 849 |
return timeit.timeit(fn, number=number)
|
| 850 |
+
return (time_method,)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 851 |
|
| 852 |
|
| 853 |
@app.cell(hide_code=True)
|
|
|
|
| 885 |
x=alt.X("title:N", title="Method", sort="-y"),
|
| 886 |
y=alt.Y("time:Q", title="Execution Time (s)", axis=alt.Axis(format=".3f")),
|
| 887 |
).properties(width=400)
|
| 888 |
+
return (benchmark_plot,)
|
| 889 |
|
| 890 |
|
| 891 |
@app.cell(hide_code=True)
|
|
|
|
| 913 |
asyncio,
|
| 914 |
httpx,
|
| 915 |
mo,
|
|
|
|
| 916 |
np,
|
| 917 |
numba,
|
| 918 |
pl,
|
polars/16_lazy_execution.py
CHANGED
|
@@ -15,19 +15,17 @@
|
|
| 15 |
|
| 16 |
import marimo
|
| 17 |
|
| 18 |
-
__generated_with = "0.
|
| 19 |
app = marimo.App(width="medium")
|
| 20 |
|
| 21 |
|
| 22 |
@app.cell(hide_code=True)
|
| 23 |
def _(mo):
|
| 24 |
-
mo.md(
|
| 25 |
-
|
| 26 |
-
# Lazy Execution (a.k.a. the Lazy API)
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
)
|
| 31 |
return
|
| 32 |
|
| 33 |
|
|
@@ -51,14 +49,9 @@ def _():
|
|
| 51 |
Generator,
|
| 52 |
datetime,
|
| 53 |
np,
|
| 54 |
-
numba,
|
| 55 |
-
pd,
|
| 56 |
pl,
|
| 57 |
-
plt,
|
| 58 |
random,
|
| 59 |
re,
|
| 60 |
-
spl,
|
| 61 |
-
st,
|
| 62 |
time,
|
| 63 |
timedelta,
|
| 64 |
timezone,
|
|
@@ -67,47 +60,43 @@ def _():
|
|
| 67 |
|
| 68 |
@app.cell(hide_code=True)
|
| 69 |
def _(mo):
|
| 70 |
-
mo.md(
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
"""
|
| 78 |
-
)
|
| 79 |
return
|
| 80 |
|
| 81 |
|
| 82 |
@app.cell(hide_code=True)
|
| 83 |
def _(mo):
|
| 84 |
-
mo.md(
|
| 85 |
-
|
| 86 |
-
## Setup
|
| 87 |
|
| 88 |
-
|
| 89 |
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
|
| 104 |
-
|
| 105 |
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
)
|
| 111 |
return
|
| 112 |
|
| 113 |
|
|
@@ -179,7 +168,6 @@ def _(Faker, datetime, np, num_log_lines, time):
|
|
| 179 |
responses,
|
| 180 |
rng,
|
| 181 |
sleep,
|
| 182 |
-
timestr,
|
| 183 |
tz,
|
| 184 |
user_agents,
|
| 185 |
verbs,
|
|
@@ -222,19 +210,17 @@ def _(
|
|
| 222 |
faker=faker, rng=rng, resources=resources,
|
| 223 |
user_agents=user_agents, responses=responses, verbs=verbs)
|
| 224 |
yield list(re.findall(pattern, log_line)[0])
|
| 225 |
-
return generator,
|
| 226 |
|
| 227 |
|
| 228 |
@app.cell(hide_code=True)
|
| 229 |
def _(mo):
|
| 230 |
-
mo.md(
|
| 231 |
-
|
| 232 |
-
Since we are generating data using a Python generator, we create a `pl.LazyFrame` directly, but we can start with either a file or an existing `DataFrame`. When using a file, the functions beginning with `pl.scan_` from the Polars API can be used, while in the case of an existing `pl.DataFrame`, we can simply call `.lazy()` to convert it to a `pl.LazyFrame`.
|
| 233 |
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
)
|
| 238 |
return
|
| 239 |
|
| 240 |
|
|
@@ -249,15 +235,13 @@ def _(generator, num_log_lines, pl):
|
|
| 249 |
|
| 250 |
@app.cell(hide_code=True)
|
| 251 |
def _(mo):
|
| 252 |
-
mo.md(
|
| 253 |
-
|
| 254 |
-
## Schema
|
| 255 |
|
| 256 |
-
|
| 257 |
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
)
|
| 261 |
return
|
| 262 |
|
| 263 |
|
|
@@ -269,26 +253,28 @@ def _(log_data):
|
|
| 269 |
|
| 270 |
@app.cell(hide_code=True)
|
| 271 |
def _(mo):
|
| 272 |
-
mo.md(
|
| 273 |
-
|
| 274 |
-
Since our generator yields strings, Polars defaults to the `pl.String` datatype while reading in the data from the generator, unless specified. This, however, is not the most space or computation efficient form of data storage, so we would like to convert the datatypes of some of the columns in our LazyFrame.
|
| 275 |
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
)
|
| 280 |
return
|
| 281 |
|
| 282 |
|
| 283 |
@app.cell(hide_code=True)
|
| 284 |
def _(mo):
|
| 285 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 286 |
return
|
| 287 |
|
| 288 |
|
| 289 |
@app.cell(hide_code=True)
|
| 290 |
def _(mo):
|
| 291 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 292 |
return
|
| 293 |
|
| 294 |
|
|
@@ -311,13 +297,11 @@ def _(log_data_erroneous):
|
|
| 311 |
|
| 312 |
@app.cell(hide_code=True)
|
| 313 |
def _(mo):
|
| 314 |
-
mo.md(
|
| 315 |
-
|
| 316 |
-
Polars uses a **query optimizer** to make sure that a query pipeline is executed with the least computational cost (more on this later). In order to be able to do the optimization, the optimizer must know the schema for each step of the pipeline (query plan). For example, if you have a `.pivot` operation somewhere in your pipeline, you are generating new columns based on the data. This is new information unknown to the query optimizer that it cannot work with, and so the lazy API does not support `.pivot` operations.
|
| 317 |
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
)
|
| 321 |
return
|
| 322 |
|
| 323 |
|
|
@@ -334,13 +318,11 @@ def _(log_data, pl):
|
|
| 334 |
|
| 335 |
@app.cell(hide_code=True)
|
| 336 |
def _(mo):
|
| 337 |
-
mo.md(
|
| 338 |
-
|
| 339 |
-
As a workaround, we can jump between "lazy mode" and "eager mode" by converting a LazyFrame to a DataFrame just before the unsupported operation (e.g. `.pivot`). We can do this by calling `.collect()` on the LazyFrame. Once done with the "eager mode" operations, we can jump back to "lazy mode" by calling ".lazy()" on the DataFrame!
|
| 340 |
|
| 341 |
-
|
| 342 |
-
|
| 343 |
-
)
|
| 344 |
return
|
| 345 |
|
| 346 |
|
|
@@ -360,21 +342,21 @@ def _(log_data, pl):
|
|
| 360 |
|
| 361 |
@app.cell(hide_code=True)
|
| 362 |
def _(mo):
|
| 363 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 364 |
return
|
| 365 |
|
| 366 |
|
| 367 |
@app.cell(hide_code=True)
|
| 368 |
def _(mo):
|
| 369 |
-
mo.md(
|
| 370 |
-
|
| 371 |
-
Polars has a query optimizer that works on a "query plan" to create a computationally efficient query pipeline. It builds the query plan/query graph from the user-specified lazy operations.
|
| 372 |
|
| 373 |
-
|
| 374 |
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
)
|
| 378 |
return
|
| 379 |
|
| 380 |
|
|
@@ -409,21 +391,21 @@ def _(a_query):
|
|
| 409 |
|
| 410 |
@app.cell(hide_code=True)
|
| 411 |
def _(mo):
|
| 412 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 413 |
return
|
| 414 |
|
| 415 |
|
| 416 |
@app.cell(hide_code=True)
|
| 417 |
def _(mo):
|
| 418 |
-
mo.md(
|
| 419 |
-
|
| 420 |
-
As mentioned before, Polars builds a query graph by going lazy operation by operation and then optimizes it by running a query optimizer on the graph. This optimized graph is run by default.
|
| 421 |
|
| 422 |
-
|
| 423 |
|
| 424 |
-
|
| 425 |
-
|
| 426 |
-
)
|
| 427 |
return
|
| 428 |
|
| 429 |
|
|
@@ -448,7 +430,9 @@ def _(log_data, pl):
|
|
| 448 |
|
| 449 |
@app.cell(hide_code=True)
|
| 450 |
def _(mo):
|
| 451 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 452 |
return
|
| 453 |
|
| 454 |
|
|
@@ -460,13 +444,11 @@ def _(a_query):
|
|
| 460 |
|
| 461 |
@app.cell(hide_code=True)
|
| 462 |
def _(mo):
|
| 463 |
-
mo.md(
|
| 464 |
-
|
| 465 |
-
## Optimizations
|
| 466 |
|
| 467 |
-
|
| 468 |
-
|
| 469 |
-
)
|
| 470 |
return
|
| 471 |
|
| 472 |
|
|
@@ -484,25 +466,33 @@ def _(a_query):
|
|
| 484 |
|
| 485 |
@app.cell(hide_code=True)
|
| 486 |
def _(mo):
|
| 487 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 488 |
return
|
| 489 |
|
| 490 |
|
| 491 |
@app.cell(hide_code=True)
|
| 492 |
def _(mo):
|
| 493 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 494 |
return
|
| 495 |
|
| 496 |
|
| 497 |
@app.cell(hide_code=True)
|
| 498 |
def _(mo):
|
| 499 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 500 |
return
|
| 501 |
|
| 502 |
|
| 503 |
@app.cell(hide_code=True)
|
| 504 |
def _(mo):
|
| 505 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 506 |
return
|
| 507 |
|
| 508 |
|
|
@@ -522,7 +512,9 @@ def _(a_query, pl):
|
|
| 522 |
|
| 523 |
@app.cell(hide_code=True)
|
| 524 |
def _(mo):
|
| 525 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 526 |
return
|
| 527 |
|
| 528 |
|
|
@@ -536,13 +528,11 @@ def _(a_query, pl):
|
|
| 536 |
|
| 537 |
@app.cell(hide_code=True)
|
| 538 |
def _(mo):
|
| 539 |
-
mo.md(
|
| 540 |
-
|
| 541 |
-
## References
|
| 542 |
|
| 543 |
-
|
| 544 |
-
|
| 545 |
-
)
|
| 546 |
return
|
| 547 |
|
| 548 |
|
|
|
|
| 15 |
|
| 16 |
import marimo
|
| 17 |
|
| 18 |
+
__generated_with = "0.18.4"
|
| 19 |
app = marimo.App(width="medium")
|
| 20 |
|
| 21 |
|
| 22 |
@app.cell(hide_code=True)
|
| 23 |
def _(mo):
|
| 24 |
+
mo.md(r"""
|
| 25 |
+
# Lazy Execution (a.k.a. the Lazy API)
|
|
|
|
| 26 |
|
| 27 |
+
Author: [Deb Debnath](https://github.com/debajyotid2)
|
| 28 |
+
""")
|
|
|
|
| 29 |
return
|
| 30 |
|
| 31 |
|
|
|
|
| 49 |
Generator,
|
| 50 |
datetime,
|
| 51 |
np,
|
|
|
|
|
|
|
| 52 |
pl,
|
|
|
|
| 53 |
random,
|
| 54 |
re,
|
|
|
|
|
|
|
| 55 |
time,
|
| 56 |
timedelta,
|
| 57 |
timezone,
|
|
|
|
| 60 |
|
| 61 |
@app.cell(hide_code=True)
|
| 62 |
def _(mo):
|
| 63 |
+
mo.md(r"""
|
| 64 |
+
We saw the benefits of lazy evaluation when we learned about the Expressions API in Polars. Lazy execution is further extended as a philosophy by the Lazy API. It offers significant performance enhancements over eager (immediate) execution of queries and is one of the reasons why Polars is faster at working with large (GB scale) datasets than other libraries. The lazy API optimizes the full query pipeline instead of executing individual queries optimally, unlike eager execution. Some of the advantages of using the Lazy API over eager execution include
|
| 65 |
+
|
| 66 |
+
- automatic query optimization with the query optimizer.
|
| 67 |
+
- ability to process datasets larger than memory using streaming.
|
| 68 |
+
- ability to catch schema errors before data processing.
|
| 69 |
+
""")
|
|
|
|
|
|
|
| 70 |
return
|
| 71 |
|
| 72 |
|
| 73 |
@app.cell(hide_code=True)
|
| 74 |
def _(mo):
|
| 75 |
+
mo.md(r"""
|
| 76 |
+
## Setup
|
|
|
|
| 77 |
|
| 78 |
+
For this notebook, we are going to work with logs from an Apache/Nginx web server - these logs contain useful information that can be utilized for performance optimization, security monitoring, etc. Such logs comprise of entries that look something like this:
|
| 79 |
|
| 80 |
+
```
|
| 81 |
+
10.23.97.15 - - [05/Jul/2024:11:35:05 +0000] "GET /index.html HTTP/1.1" 200 1342 "https://www.example.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/528.32 (KHTML, like Gecko) Chrome/19.0.1220.985 Safari/528.32" "-"
|
| 82 |
+
```
|
| 83 |
|
| 84 |
+
Different parts of the entry mean different things:
|
| 85 |
|
| 86 |
+
- `10.23.97.15` is the client IP address.
|
| 87 |
+
- `- -` represent identity and username of the client, respectively and are typically unused.
|
| 88 |
+
- `05/Jul/2024:11:35:05 +0000` indicates the timestamp for the request.
|
| 89 |
+
- `"GET /index.html HTTP/1.1"` represents the HTTP method, requested resource and the protocol version for HTTP, respectively.
|
| 90 |
+
- `200 1342` mean the response status code and size of the response in bytes, respectively
|
| 91 |
+
- `"https://www.example.com"` is the "referer", or the webpage URL that brought the client to the resource.
|
| 92 |
+
- `"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/528.32 (KHTML, like Gecko) Chrome/19.0.1220.985 Safari/528.32"` is the "User agent" or the details of the client device making the request (including browser version, operating system, etc.)
|
| 93 |
|
| 94 |
+
Normally, you would get your log files from a server that you have access to. In our case, we will generate fake data to simulate log records. We will simulate 7 days of server activity with 90,000 recorded lines.
|
| 95 |
|
| 96 |
+
///Note
|
| 97 |
+
1. If you are interested in the process of generating fake log entries, unhide the code cells immediately below the next one.
|
| 98 |
+
2. You can adjust the size of the dataset by resetting the `num_log_lines` variables to a size of your choice. It may be helpful if the data takes a long time to generate.
|
| 99 |
+
""")
|
|
|
|
| 100 |
return
|
| 101 |
|
| 102 |
|
|
|
|
| 168 |
responses,
|
| 169 |
rng,
|
| 170 |
sleep,
|
|
|
|
| 171 |
tz,
|
| 172 |
user_agents,
|
| 173 |
verbs,
|
|
|
|
| 210 |
faker=faker, rng=rng, resources=resources,
|
| 211 |
user_agents=user_agents, responses=responses, verbs=verbs)
|
| 212 |
yield list(re.findall(pattern, log_line)[0])
|
| 213 |
+
return (generator,)
|
| 214 |
|
| 215 |
|
| 216 |
@app.cell(hide_code=True)
|
| 217 |
def _(mo):
|
| 218 |
+
mo.md(r"""
|
| 219 |
+
Since we are generating data using a Python generator, we create a `pl.LazyFrame` directly, but we can start with either a file or an existing `DataFrame`. When using a file, the functions beginning with `pl.scan_` from the Polars API can be used, while in the case of an existing `pl.DataFrame`, we can simply call `.lazy()` to convert it to a `pl.LazyFrame`.
|
|
|
|
| 220 |
|
| 221 |
+
///Note
|
| 222 |
+
Depending on your machine, the following cell may take some time to execute.
|
| 223 |
+
""")
|
|
|
|
| 224 |
return
|
| 225 |
|
| 226 |
|
|
|
|
| 235 |
|
| 236 |
@app.cell(hide_code=True)
|
| 237 |
def _(mo):
|
| 238 |
+
mo.md(r"""
|
| 239 |
+
## Schema
|
|
|
|
| 240 |
|
| 241 |
+
A schema denotes the names and respective datatypes of columns in a DataFrame or LazyFrame. It can be specified when a DataFrame or LazyFrame is generated (as you may have noticed in the cell creating the LazyFrame above).
|
| 242 |
|
| 243 |
+
You can see the schema with the .collect_schema method on a DataFrame or LazyFrame.
|
| 244 |
+
""")
|
|
|
|
| 245 |
return
|
| 246 |
|
| 247 |
|
|
|
|
| 253 |
|
| 254 |
@app.cell(hide_code=True)
|
| 255 |
def _(mo):
|
| 256 |
+
mo.md(r"""
|
| 257 |
+
Since our generator yields strings, Polars defaults to the `pl.String` datatype while reading in the data from the generator, unless specified. This, however, is not the most space or computation efficient form of data storage, so we would like to convert the datatypes of some of the columns in our LazyFrame.
|
|
|
|
| 258 |
|
| 259 |
+
///Note
|
| 260 |
+
The data type conversion can also be done by specifying it in the schema when creating the LazyFrame or DataFrame. We are skipping doing this for demonstration. For more details on specifying data types in LazyFrames, please refer to the Polars [documentation](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html).
|
| 261 |
+
""")
|
|
|
|
| 262 |
return
|
| 263 |
|
| 264 |
|
| 265 |
@app.cell(hide_code=True)
|
| 266 |
def _(mo):
|
| 267 |
+
mo.md(r"""
|
| 268 |
+
The Lazy API validates a query pipeline end-to-end for schema consistency and correctness. The checks make sure that if there is a mistake in your query, you can correct it before the data gets processed.
|
| 269 |
+
""")
|
| 270 |
return
|
| 271 |
|
| 272 |
|
| 273 |
@app.cell(hide_code=True)
|
| 274 |
def _(mo):
|
| 275 |
+
mo.md(r"""
|
| 276 |
+
The `log_data_erroneous` query below throws an `InvalidOperationError` because Polars finds inconsistencies between the timestamps we parsed from the logs and the timestamp format specified. It turns out that the time stamps in string format still have trailing whitespace which leads to errors during conversion to `datetime[μs]` objects.
|
| 277 |
+
""")
|
| 278 |
return
|
| 279 |
|
| 280 |
|
|
|
|
| 297 |
|
| 298 |
@app.cell(hide_code=True)
|
| 299 |
def _(mo):
|
| 300 |
+
mo.md(r"""
|
| 301 |
+
Polars uses a **query optimizer** to make sure that a query pipeline is executed with the least computational cost (more on this later). In order to be able to do the optimization, the optimizer must know the schema for each step of the pipeline (query plan). For example, if you have a `.pivot` operation somewhere in your pipeline, you are generating new columns based on the data. This is new information unknown to the query optimizer that it cannot work with, and so the lazy API does not support `.pivot` operations.
|
|
|
|
| 302 |
|
| 303 |
+
For example, suppose you would like to know how many requests of each kind were received at a given time that were not "POST" requests. For this we would want to create a pivot table as follows, except that it throws an error as the lazy API does not support pivot operations.
|
| 304 |
+
""")
|
|
|
|
| 305 |
return
|
| 306 |
|
| 307 |
|
|
|
|
| 318 |
|
| 319 |
@app.cell(hide_code=True)
|
| 320 |
def _(mo):
|
| 321 |
+
mo.md(r"""
|
| 322 |
+
As a workaround, we can jump between "lazy mode" and "eager mode" by converting a LazyFrame to a DataFrame just before the unsupported operation (e.g. `.pivot`). We can do this by calling `.collect()` on the LazyFrame. Once done with the "eager mode" operations, we can jump back to "lazy mode" by calling ".lazy()" on the DataFrame!
|
|
|
|
| 323 |
|
| 324 |
+
As an example, see the fix to the query in the previous cell below:
|
| 325 |
+
""")
|
|
|
|
| 326 |
return
|
| 327 |
|
| 328 |
|
|
|
|
| 342 |
|
| 343 |
@app.cell(hide_code=True)
|
| 344 |
def _(mo):
|
| 345 |
+
mo.md(r"""
|
| 346 |
+
## Query plan
|
| 347 |
+
""")
|
| 348 |
return
|
| 349 |
|
| 350 |
|
| 351 |
@app.cell(hide_code=True)
|
| 352 |
def _(mo):
|
| 353 |
+
mo.md(r"""
|
| 354 |
+
Polars has a query optimizer that works on a "query plan" to create a computationally efficient query pipeline. It builds the query plan/query graph from the user-specified lazy operations.
|
|
|
|
| 355 |
|
| 356 |
+
We can understand query graphs with visualization and by printing them as text.
|
| 357 |
|
| 358 |
+
Say we want to convert the data in our log dataset from `pl.String` more space efficient data types. We also would like to view all "GET" requests that resulted in errors (client side). We build our query first, and then we visualize the query graph using `.show_graph()` and print it using `.request_code()`.
|
| 359 |
+
""")
|
|
|
|
| 360 |
return
|
| 361 |
|
| 362 |
|
|
|
|
| 391 |
|
| 392 |
@app.cell(hide_code=True)
|
| 393 |
def _(mo):
|
| 394 |
+
mo.md(r"""
|
| 395 |
+
## Execution
|
| 396 |
+
""")
|
| 397 |
return
|
| 398 |
|
| 399 |
|
| 400 |
@app.cell(hide_code=True)
|
| 401 |
def _(mo):
|
| 402 |
+
mo.md(r"""
|
| 403 |
+
As mentioned before, Polars builds a query graph by going lazy operation by operation and then optimizes it by running a query optimizer on the graph. This optimized graph is run by default.
|
|
|
|
| 404 |
|
| 405 |
+
We can execute our query on the full dataset by calling the .collect method on the query. But since this option processes all data in one batch, it is not memory efficient, and can crash if the size of the data exceeds the amount of memory your query can support.
|
| 406 |
|
| 407 |
+
For fast iterative development running `.collect` on the entire dataset is not a good idea due to slow runtimes. If your dataset is partitioned, you can use a few of them for testing. Another option is to use `.head` to limit the number of records processed, and `.collect` as few times as possible and toward the end of your query, as shown below.
|
| 408 |
+
""")
|
|
|
|
| 409 |
return
|
| 410 |
|
| 411 |
|
|
|
|
| 430 |
|
| 431 |
@app.cell(hide_code=True)
|
| 432 |
def _(mo):
|
| 433 |
+
mo.md(r"""
|
| 434 |
+
For large datasets Polars supports streaming mode by collecting data in batches. Streaming mode can be used by passing the keyword `engine="streaming"` into the `collect` method.
|
| 435 |
+
""")
|
| 436 |
return
|
| 437 |
|
| 438 |
|
|
|
|
| 444 |
|
| 445 |
@app.cell(hide_code=True)
|
| 446 |
def _(mo):
|
| 447 |
+
mo.md(r"""
|
| 448 |
+
## Optimizations
|
|
|
|
| 449 |
|
| 450 |
+
The lazy API runs a query optimizer on every Polars query. To do this, first it builds a non-optimized plan with the set of steps in the order they were specified by the user. Then it checks for optimization opportunities within the plan and reorders operations following specific rules to create an optimized query plan. Some of them are executed up front, others are determined just in time as the materialized data comes in. For the query that we built before and saw the query graph, we can view the unoptimized and optimized versions below.
|
| 451 |
+
""")
|
|
|
|
| 452 |
return
|
| 453 |
|
| 454 |
|
|
|
|
| 466 |
|
| 467 |
@app.cell(hide_code=True)
|
| 468 |
def _(mo):
|
| 469 |
+
mo.md(r"""
|
| 470 |
+
One difference between the optimized and the unoptimized versions above is that all of the datatype cast operations except for the conversion of the `"status"` column to `pl.Int16` are performed at the end together. Also, the `filter()` operation is "pushed down" the graph, but after the datatype cast operation for `"status"`. This is called **predicate pushdown**, and the lazy API optimizes the query graph for filters to be performed as early as possible. Since the datatype coercion makes the filter operation more efficient, the graph preserves its order to be before the filter.
|
| 471 |
+
""")
|
| 472 |
return
|
| 473 |
|
| 474 |
|
| 475 |
@app.cell(hide_code=True)
|
| 476 |
def _(mo):
|
| 477 |
+
mo.md(r"""
|
| 478 |
+
## Sources and Sinks
|
| 479 |
+
""")
|
| 480 |
return
|
| 481 |
|
| 482 |
|
| 483 |
@app.cell(hide_code=True)
|
| 484 |
def _(mo):
|
| 485 |
+
mo.md(r"""
|
| 486 |
+
For data sources like Parquets, CSVs, etc, the lazy API provides `scan_*` (`scan_parquet`, `scan_csv`, etc.) to lazily read in the data into LazyFrames. If queries are chained to the `scan_*` method, Polars will run the usual query optimizations and delay execution until the query is collected. An added benefit of chaining queries to `scan_*` operations is that the "scanners" can skip reading columns and rows that aren't required. This is helpful when streaming large datasets as well, as rows are processed in batches before the entire file is read.
|
| 487 |
+
""")
|
| 488 |
return
|
| 489 |
|
| 490 |
|
| 491 |
@app.cell(hide_code=True)
|
| 492 |
def _(mo):
|
| 493 |
+
mo.md(r"""
|
| 494 |
+
The results of a query from a lazyframe can be saved in streaming mode using `sink_*` (e.g. `sink_parquet`) functions. Sinks support saving data to disk or cloud, and are especially helpful with large datasets. The data being sunk can also be partitioned into multiple files if needed, after specifying a suitable partitioning strategy, as shown below.
|
| 495 |
+
""")
|
| 496 |
return
|
| 497 |
|
| 498 |
|
|
|
|
| 512 |
|
| 513 |
@app.cell(hide_code=True)
|
| 514 |
def _(mo):
|
| 515 |
+
mo.md(r"""
|
| 516 |
+
We can also write to multiple sinks at the same time. We just need to specify two separate lazy sinks and combine them by calling `pl.collect_all` and mentioning both sinks.
|
| 517 |
+
""")
|
| 518 |
return
|
| 519 |
|
| 520 |
|
|
|
|
| 528 |
|
| 529 |
@app.cell(hide_code=True)
|
| 530 |
def _(mo):
|
| 531 |
+
mo.md(r"""
|
| 532 |
+
## References
|
|
|
|
| 533 |
|
| 534 |
+
1. Polars [documentation](https://docs.pola.rs/user-guide/lazy/)
|
| 535 |
+
""")
|
|
|
|
| 536 |
return
|
| 537 |
|
| 538 |
|
polars/README.md
CHANGED
|
@@ -1,3 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Learn Polars
|
| 2 |
|
| 3 |
_🚧 This collection is a work in progress. Please help us add notebooks!_
|
|
@@ -24,4 +29,4 @@ You can also open notebooks in our online playground by appending marimo.app/ to
|
|
| 24 |
* [Péter Gyarmati](https://github.com/peter-gy)
|
| 25 |
* [Joram Mutenge](https://github.com/jorammutenge)
|
| 26 |
* [etrotta](https://github.com/etrotta)
|
| 27 |
-
* [Debajyoti Das](https://github.com/debajyotid2)
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Readme
|
| 3 |
+
marimo-version: 0.18.4
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
# Learn Polars
|
| 7 |
|
| 8 |
_🚧 This collection is a work in progress. Please help us add notebooks!_
|
|
|
|
| 29 |
* [Péter Gyarmati](https://github.com/peter-gy)
|
| 30 |
* [Joram Mutenge](https://github.com/jorammutenge)
|
| 31 |
* [etrotta](https://github.com/etrotta)
|
| 32 |
+
* [Debajyoti Das](https://github.com/debajyotid2)
|
probability/01_sets.py
CHANGED
|
@@ -7,45 +7,47 @@
|
|
| 7 |
|
| 8 |
import marimo
|
| 9 |
|
| 10 |
-
__generated_with = "0.
|
| 11 |
app = marimo.App()
|
| 12 |
|
| 13 |
|
| 14 |
@app.cell(hide_code=True)
|
| 15 |
def _(mo):
|
| 16 |
-
mo.md(
|
| 17 |
-
|
| 18 |
-
# Sets
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
)
|
| 37 |
return
|
| 38 |
|
| 39 |
|
| 40 |
@app.cell(hide_code=True)
|
| 41 |
def _(mo):
|
| 42 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 43 |
return
|
| 44 |
|
| 45 |
|
| 46 |
@app.cell(hide_code=True)
|
| 47 |
def _(mo):
|
| 48 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 49 |
return
|
| 50 |
|
| 51 |
|
|
@@ -65,15 +67,13 @@ def _():
|
|
| 65 |
|
| 66 |
@app.cell(hide_code=True)
|
| 67 |
def _(mo):
|
| 68 |
-
mo.md(
|
| 69 |
-
|
| 70 |
-
Below we explain common operations on sets.
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
)
|
| 77 |
return
|
| 78 |
|
| 79 |
|
|
@@ -85,7 +85,9 @@ def _(A, B):
|
|
| 85 |
|
| 86 |
@app.cell(hide_code=True)
|
| 87 |
def _(mo):
|
| 88 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 89 |
return
|
| 90 |
|
| 91 |
|
|
@@ -97,7 +99,9 @@ def _(A, B):
|
|
| 97 |
|
| 98 |
@app.cell(hide_code=True)
|
| 99 |
def _(mo):
|
| 100 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 101 |
return
|
| 102 |
|
| 103 |
|
|
@@ -109,13 +113,11 @@ def _(A, B):
|
|
| 109 |
|
| 110 |
@app.cell(hide_code=True)
|
| 111 |
def _(mo):
|
| 112 |
-
mo.md(
|
| 113 |
-
|
| 114 |
-
### 🎬 An interactive example
|
| 115 |
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
)
|
| 119 |
return
|
| 120 |
|
| 121 |
|
|
@@ -175,7 +177,7 @@ def _(mo, recommendations, viewer_type):
|
|
| 175 |
**Why these shows?**
|
| 176 |
{explanation[viewer_type.value]}
|
| 177 |
""")
|
| 178 |
-
return
|
| 179 |
|
| 180 |
|
| 181 |
@app.cell(hide_code=True)
|
|
@@ -214,58 +216,54 @@ def _(mo):
|
|
| 214 |
|
| 215 |
@app.cell(hide_code=True)
|
| 216 |
def _(mo):
|
| 217 |
-
mo.md(
|
| 218 |
-
|
| 219 |
-
## 🧮 Set properties
|
| 220 |
|
| 221 |
-
|
| 222 |
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
)
|
| 228 |
return
|
| 229 |
|
| 230 |
|
| 231 |
@app.cell(hide_code=True)
|
| 232 |
def _(mo):
|
| 233 |
-
mo.md(
|
| 234 |
-
|
| 235 |
-
## Set builder notation
|
| 236 |
|
| 237 |
-
|
| 238 |
|
| 239 |
-
|
| 240 |
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
|
| 245 |
-
|
| 246 |
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
)
|
| 250 |
return
|
| 251 |
|
| 252 |
|
| 253 |
-
@app.
|
| 254 |
-
def
|
| 255 |
-
|
| 256 |
-
return x > 0 and x < 10
|
| 257 |
-
return (predicate,)
|
| 258 |
|
| 259 |
|
| 260 |
@app.cell
|
| 261 |
-
def _(
|
| 262 |
set(x for x in range(100) if predicate(x))
|
| 263 |
return
|
| 264 |
|
| 265 |
|
| 266 |
@app.cell(hide_code=True)
|
| 267 |
def _(mo):
|
| 268 |
-
mo.md("""
|
|
|
|
|
|
|
| 269 |
return
|
| 270 |
|
| 271 |
|
|
|
|
| 7 |
|
| 8 |
import marimo
|
| 9 |
|
| 10 |
+
__generated_with = "0.18.4"
|
| 11 |
app = marimo.App()
|
| 12 |
|
| 13 |
|
| 14 |
@app.cell(hide_code=True)
|
| 15 |
def _(mo):
|
| 16 |
+
mo.md(r"""
|
| 17 |
+
# Sets
|
|
|
|
| 18 |
|
| 19 |
+
Probability is the study of "events", assigning numerical values to how likely
|
| 20 |
+
events are to occur. For example, probability lets us quantify how likely it is for it to rain or shine on a given day.
|
| 21 |
|
| 22 |
|
| 23 |
+
Typically we reason about _sets_ of events. In mathematics,
|
| 24 |
+
a set is a collection of elements, with no element included more than once.
|
| 25 |
+
Elements can be any kind of object.
|
| 26 |
|
| 27 |
+
For example:
|
| 28 |
|
| 29 |
+
- ☀️ Weather events: $\{\text{Rain}, \text{Overcast}, \text{Clear}\}$
|
| 30 |
+
- 🎲 Die rolls: $\{1, 2, 3, 4, 5, 6\}$
|
| 31 |
+
- 🪙 Pairs of coin flips = $\{ \text{(Heads, Heads)}, \text{(Heads, Tails)}, \text{(Tails, Tails)} \text{(Tails, Heads)}\}$
|
| 32 |
|
| 33 |
+
Sets are the building blocks of probability, and will arise frequently in our study.
|
| 34 |
+
""")
|
|
|
|
| 35 |
return
|
| 36 |
|
| 37 |
|
| 38 |
@app.cell(hide_code=True)
|
| 39 |
def _(mo):
|
| 40 |
+
mo.md(r"""
|
| 41 |
+
## Set operations
|
| 42 |
+
""")
|
| 43 |
return
|
| 44 |
|
| 45 |
|
| 46 |
@app.cell(hide_code=True)
|
| 47 |
def _(mo):
|
| 48 |
+
mo.md(r"""
|
| 49 |
+
In Python, sets are made with the `set` function:
|
| 50 |
+
""")
|
| 51 |
return
|
| 52 |
|
| 53 |
|
|
|
|
| 67 |
|
| 68 |
@app.cell(hide_code=True)
|
| 69 |
def _(mo):
|
| 70 |
+
mo.md(r"""
|
| 71 |
+
Below we explain common operations on sets.
|
|
|
|
| 72 |
|
| 73 |
+
_**Try it!** Try modifying the definitions of `A` and `B` above, and see how the results change below._
|
| 74 |
|
| 75 |
+
The **union** $A \cup B$ of sets $A$ and $B$ is the set of elements in $A$, $B$, or both.
|
| 76 |
+
""")
|
|
|
|
| 77 |
return
|
| 78 |
|
| 79 |
|
|
|
|
| 85 |
|
| 86 |
@app.cell(hide_code=True)
|
| 87 |
def _(mo):
|
| 88 |
+
mo.md(r"""
|
| 89 |
+
The **intersection** $A \cap B$ is the set of elements in both $A$ and $B$
|
| 90 |
+
""")
|
| 91 |
return
|
| 92 |
|
| 93 |
|
|
|
|
| 99 |
|
| 100 |
@app.cell(hide_code=True)
|
| 101 |
def _(mo):
|
| 102 |
+
mo.md(r"""
|
| 103 |
+
The **difference** $A \setminus B$ is the set of elements in $A$ that are not in $B$.
|
| 104 |
+
""")
|
| 105 |
return
|
| 106 |
|
| 107 |
|
|
|
|
| 113 |
|
| 114 |
@app.cell(hide_code=True)
|
| 115 |
def _(mo):
|
| 116 |
+
mo.md("""
|
| 117 |
+
### 🎬 An interactive example
|
|
|
|
| 118 |
|
| 119 |
+
Here's a simple example that classifies TV shows into sets by genre, and uses these sets to recommend shows to a user based on their preferences.
|
| 120 |
+
""")
|
|
|
|
| 121 |
return
|
| 122 |
|
| 123 |
|
|
|
|
| 177 |
**Why these shows?**
|
| 178 |
{explanation[viewer_type.value]}
|
| 179 |
""")
|
| 180 |
+
return
|
| 181 |
|
| 182 |
|
| 183 |
@app.cell(hide_code=True)
|
|
|
|
| 216 |
|
| 217 |
@app.cell(hide_code=True)
|
| 218 |
def _(mo):
|
| 219 |
+
mo.md(r"""
|
| 220 |
+
## 🧮 Set properties
|
|
|
|
| 221 |
|
| 222 |
+
Here are some important properties of the set operations:
|
| 223 |
|
| 224 |
+
1. **Commutative**: $A \cup B = B \cup A$
|
| 225 |
+
2. **Associative**: $(A \cup B) \cup C = A \cup (B \cup C)$
|
| 226 |
+
3. **Distributive**: $A \cup (B \cap C) = (A \cup B) \cap (A \cup C)$
|
| 227 |
+
""")
|
|
|
|
| 228 |
return
|
| 229 |
|
| 230 |
|
| 231 |
@app.cell(hide_code=True)
|
| 232 |
def _(mo):
|
| 233 |
+
mo.md(r"""
|
| 234 |
+
## Set builder notation
|
|
|
|
| 235 |
|
| 236 |
+
To compactly describe the elements in a set, we can use **set builder notation**, which specifies conditions that must be true for elements to be in the set.
|
| 237 |
|
| 238 |
+
For example, here is how to specify the set of positive numbers less than 10:
|
| 239 |
|
| 240 |
+
\[
|
| 241 |
+
\{x \mid 0 < x < 10 \}
|
| 242 |
+
\]
|
| 243 |
|
| 244 |
+
The predicate to the right of the vertical bar $\mid$ specifies conditions that must be true for an element to be in the set; the expression to the left of $\mid$ specifies the value being included.
|
| 245 |
|
| 246 |
+
In Python, set builder notation is called a "set comprehension."
|
| 247 |
+
""")
|
|
|
|
| 248 |
return
|
| 249 |
|
| 250 |
|
| 251 |
+
@app.function
|
| 252 |
+
def predicate(x):
|
| 253 |
+
return x > 0 and x < 10
|
|
|
|
|
|
|
| 254 |
|
| 255 |
|
| 256 |
@app.cell
|
| 257 |
+
def _():
|
| 258 |
set(x for x in range(100) if predicate(x))
|
| 259 |
return
|
| 260 |
|
| 261 |
|
| 262 |
@app.cell(hide_code=True)
|
| 263 |
def _(mo):
|
| 264 |
+
mo.md("""
|
| 265 |
+
**Try it!** Try modifying the `predicate` function above and see how the set changes.
|
| 266 |
+
""")
|
| 267 |
return
|
| 268 |
|
| 269 |
|
probability/02_axioms.py
CHANGED
|
@@ -9,7 +9,7 @@
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
-
__generated_with = "0.
|
| 13 |
app = marimo.App(width="medium")
|
| 14 |
|
| 15 |
|
|
@@ -21,49 +21,43 @@ def _():
|
|
| 21 |
|
| 22 |
@app.cell(hide_code=True)
|
| 23 |
def _(mo):
|
| 24 |
-
mo.md(
|
| 25 |
-
|
| 26 |
-
# Axioms of Probability
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
)
|
| 34 |
return
|
| 35 |
|
| 36 |
|
| 37 |
@app.cell(hide_code=True)
|
| 38 |
def _(mo):
|
| 39 |
-
mo.md(
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
"""
|
| 53 |
-
)
|
| 54 |
return
|
| 55 |
|
| 56 |
|
| 57 |
@app.cell(hide_code=True)
|
| 58 |
def _(mo):
|
| 59 |
-
mo.md(
|
| 60 |
-
|
| 61 |
-
## Understanding Through Examples
|
| 62 |
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
)
|
| 67 |
return
|
| 68 |
|
| 69 |
|
|
@@ -144,62 +138,58 @@ def _(event, mo, np, plt):
|
|
| 144 |
""")
|
| 145 |
|
| 146 |
mo.hstack([plt.gcf(), explanation])
|
| 147 |
-
return
|
| 148 |
|
| 149 |
|
| 150 |
@app.cell(hide_code=True)
|
| 151 |
def _(mo):
|
| 152 |
-
mo.md(
|
| 153 |
-
|
| 154 |
-
## Why These Axioms Matter
|
| 155 |
|
| 156 |
-
|
| 157 |
|
| 158 |
-
|
| 159 |
-
|
| 160 |
|
| 161 |
-
|
| 162 |
|
| 163 |
-
|
| 164 |
-
|
| 165 |
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
)
|
| 170 |
return
|
| 171 |
|
| 172 |
|
| 173 |
@app.cell(hide_code=True)
|
| 174 |
def _(mo):
|
| 175 |
-
mo.md(
|
| 176 |
-
|
| 177 |
-
## 🤔 Test Your Understanding
|
| 178 |
|
| 179 |
-
|
| 180 |
|
| 181 |
-
|
| 182 |
-
|
| 183 |
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
|
| 188 |
-
|
| 189 |
-
|
| 190 |
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
|
| 195 |
-
|
| 196 |
-
|
| 197 |
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
)
|
| 203 |
return
|
| 204 |
|
| 205 |
|
|
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
+
__generated_with = "0.18.4"
|
| 13 |
app = marimo.App(width="medium")
|
| 14 |
|
| 15 |
|
|
|
|
| 21 |
|
| 22 |
@app.cell(hide_code=True)
|
| 23 |
def _(mo):
|
| 24 |
+
mo.md(r"""
|
| 25 |
+
# Axioms of Probability
|
|
|
|
| 26 |
|
| 27 |
+
Probability theory is built on three fundamental axioms, known as the [Kolmogorov axioms](https://en.wikipedia.org/wiki/Probability_axioms). These axioms form
|
| 28 |
+
the mathematical foundation for all of probability theory[<sup>1</sup>](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/probability).
|
| 29 |
|
| 30 |
+
Let's explore each axiom and understand why they make intuitive sense:
|
| 31 |
+
""")
|
|
|
|
| 32 |
return
|
| 33 |
|
| 34 |
|
| 35 |
@app.cell(hide_code=True)
|
| 36 |
def _(mo):
|
| 37 |
+
mo.md(r"""
|
| 38 |
+
## The Three Axioms
|
| 39 |
+
|
| 40 |
+
| Axiom | Mathematical Form | Meaning |
|
| 41 |
+
|-------|------------------|----------|
|
| 42 |
+
| **Axiom 1** | $0 \leq P(E) \leq 1$ | All probabilities are between 0 and 1 |
|
| 43 |
+
| **Axiom 2** | $P(S) = 1$ | The probability of the sample space is 1 |
|
| 44 |
+
| **Axiom 3** | $P(E \cup F) = P(E) + P(F)$ | For mutually exclusive events, probabilities add |
|
| 45 |
+
|
| 46 |
+
where the set $S$ is the sample space (all possible outcomes), and $E$ and $F$ are sets that represent events. The notation $P(E)$ denotes the probability of $E$, which you can interpret as the chance that something happens. $P(E) = 0$ means that the event cannot happen, while $P(E) = 1$ means the event will happen no matter what; $P(E) = 0.5$ means that $E$ has a 50% chance of happening.
|
| 47 |
+
|
| 48 |
+
For an example, when rolling a fair six-sided die once, the sample space $S$ is the set of die faces ${1, 2, 3, 4, 5, 6}$, and there are many possible events; we'll see some examples below.
|
| 49 |
+
""")
|
|
|
|
|
|
|
| 50 |
return
|
| 51 |
|
| 52 |
|
| 53 |
@app.cell(hide_code=True)
|
| 54 |
def _(mo):
|
| 55 |
+
mo.md(r"""
|
| 56 |
+
## Understanding Through Examples
|
|
|
|
| 57 |
|
| 58 |
+
Let's explore these axioms using a simple experiment: rolling a fair six-sided die.
|
| 59 |
+
We'll use this to demonstrate why each axiom makes intuitive sense.
|
| 60 |
+
""")
|
|
|
|
| 61 |
return
|
| 62 |
|
| 63 |
|
|
|
|
| 138 |
""")
|
| 139 |
|
| 140 |
mo.hstack([plt.gcf(), explanation])
|
| 141 |
+
return
|
| 142 |
|
| 143 |
|
| 144 |
@app.cell(hide_code=True)
|
| 145 |
def _(mo):
|
| 146 |
+
mo.md(r"""
|
| 147 |
+
## Why These Axioms Matter
|
|
|
|
| 148 |
|
| 149 |
+
These axioms are more than just rules - they provide the foundation for all of probability theory:
|
| 150 |
|
| 151 |
+
1. **Non-negativity** (Axiom 1) makes intuitive sense: you can't have a negative number of occurrences
|
| 152 |
+
in any experiment.
|
| 153 |
|
| 154 |
+
2. **Normalization** (Axiom 2) ensures that something must happen - the total probability must be 1.
|
| 155 |
|
| 156 |
+
3. **Additivity** (Axiom 3) lets us build complex probabilities from simple ones, but only for events
|
| 157 |
+
that can't happen together (mutually exclusive events).
|
| 158 |
|
| 159 |
+
From these simple rules, we can derive all the powerful tools of probability theory that are used in
|
| 160 |
+
statistics, machine learning, and other fields.
|
| 161 |
+
""")
|
|
|
|
| 162 |
return
|
| 163 |
|
| 164 |
|
| 165 |
@app.cell(hide_code=True)
|
| 166 |
def _(mo):
|
| 167 |
+
mo.md(r"""
|
| 168 |
+
## 🤔 Test Your Understanding
|
|
|
|
| 169 |
|
| 170 |
+
Consider rolling two dice. Which of these statements follow from the axioms?
|
| 171 |
|
| 172 |
+
<details>
|
| 173 |
+
<summary>1. P(sum is 13) = 0</summary>
|
| 174 |
|
| 175 |
+
✅ Correct! This follows from Axiom 1. Since no combination of dice can sum to 13,
|
| 176 |
+
the probability must be non-negative but can be 0.
|
| 177 |
+
</details>
|
| 178 |
|
| 179 |
+
<details>
|
| 180 |
+
<summary>2. P(sum is 7) + P(sum is not 7) = 1</summary>
|
| 181 |
|
| 182 |
+
✅ Correct! This follows from Axioms 2 and 3. These events are mutually exclusive and cover
|
| 183 |
+
the entire sample space.
|
| 184 |
+
</details>
|
| 185 |
|
| 186 |
+
<details>
|
| 187 |
+
<summary>3. P(first die is 6 or second die is 6) = P(first die is 6) + P(second die is 6)</summary>
|
| 188 |
|
| 189 |
+
❌ Incorrect! This doesn't follow from Axiom 3 because the events are not mutually exclusive -
|
| 190 |
+
you could roll (6,6).
|
| 191 |
+
</details>
|
| 192 |
+
""")
|
|
|
|
| 193 |
return
|
| 194 |
|
| 195 |
|
probability/03_probability_of_or.py
CHANGED
|
@@ -9,7 +9,7 @@
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
-
__generated_with = "0.
|
| 13 |
app = marimo.App(width="medium")
|
| 14 |
|
| 15 |
|
|
@@ -24,43 +24,39 @@ def _():
|
|
| 24 |
import matplotlib.pyplot as plt
|
| 25 |
from matplotlib_venn import venn2
|
| 26 |
import numpy as np
|
| 27 |
-
return
|
| 28 |
|
| 29 |
|
| 30 |
@app.cell(hide_code=True)
|
| 31 |
def _(mo):
|
| 32 |
-
mo.md(
|
| 33 |
-
|
| 34 |
-
# Probability of Or
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
)
|
| 41 |
return
|
| 42 |
|
| 43 |
|
| 44 |
@app.cell(hide_code=True)
|
| 45 |
def _(mo):
|
| 46 |
-
mo.md(
|
| 47 |
-
|
| 48 |
-
## Mutually Exclusive Events
|
| 49 |
|
| 50 |
-
|
| 51 |
-
|
| 52 |
|
| 53 |
-
|
| 54 |
|
| 55 |
-
|
| 56 |
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
)
|
| 64 |
return
|
| 65 |
|
| 66 |
|
|
@@ -90,21 +86,19 @@ def _(are_mutually_exclusive, even_numbers, prime_numbers):
|
|
| 90 |
|
| 91 |
@app.cell(hide_code=True)
|
| 92 |
def _(mo):
|
| 93 |
-
mo.md(
|
| 94 |
-
|
| 95 |
-
## Or with Mutually Exclusive Events
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
|
| 102 |
|
| 103 |
-
|
| 104 |
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
)
|
| 108 |
return
|
| 109 |
|
| 110 |
|
|
@@ -121,34 +115,28 @@ def _():
|
|
| 121 |
# P(prime) = P(2) + P(3) + P(5)
|
| 122 |
p_prime_mutually_exclusive = prob_union_mutually_exclusive([1/6, 1/6, 1/6])
|
| 123 |
print(f"P(rolling a prime number) = {p_prime_mutually_exclusive}")
|
| 124 |
-
return
|
| 125 |
-
p_even_mutually_exclusive,
|
| 126 |
-
p_prime_mutually_exclusive,
|
| 127 |
-
prob_union_mutually_exclusive,
|
| 128 |
-
)
|
| 129 |
|
| 130 |
|
| 131 |
@app.cell(hide_code=True)
|
| 132 |
def _(mo):
|
| 133 |
-
mo.md(
|
| 134 |
-
|
| 135 |
-
## Or with Non-Mutually Exclusive Events
|
| 136 |
|
| 137 |
-
|
| 138 |
|
| 139 |
-
|
| 140 |
|
| 141 |
-
|
| 142 |
|
| 143 |
-
|
| 144 |
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
)
|
| 152 |
return
|
| 153 |
|
| 154 |
|
|
@@ -166,40 +154,34 @@ def _():
|
|
| 166 |
|
| 167 |
result = prob_union_general(p_prime_general, p_even_general, p_intersection)
|
| 168 |
print(f"P(prime or even) = {p_prime_general} + {p_even_general} - {p_intersection} = {result}")
|
| 169 |
-
return
|
| 170 |
-
p_even_general,
|
| 171 |
-
p_intersection,
|
| 172 |
-
p_prime_general,
|
| 173 |
-
prob_union_general,
|
| 174 |
-
result,
|
| 175 |
-
)
|
| 176 |
|
| 177 |
|
| 178 |
@app.cell(hide_code=True)
|
| 179 |
def _(mo):
|
| 180 |
-
mo.md(
|
| 181 |
-
|
| 182 |
-
### Extension to Three Events
|
| 183 |
|
| 184 |
-
|
| 185 |
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
|
| 190 |
-
|
| 191 |
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
)
|
| 197 |
return
|
| 198 |
|
| 199 |
|
| 200 |
@app.cell(hide_code=True)
|
| 201 |
def _(mo):
|
| 202 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 203 |
return
|
| 204 |
|
| 205 |
|
|
@@ -298,57 +280,53 @@ def _(event_type, mo, plt, venn2):
|
|
| 298 |
plt.gcf(),
|
| 299 |
mo.md(data["explanation"])
|
| 300 |
])
|
| 301 |
-
return
|
| 302 |
|
| 303 |
|
| 304 |
@app.cell(hide_code=True)
|
| 305 |
def _(mo):
|
| 306 |
-
mo.md(
|
| 307 |
-
|
| 308 |
-
## 🤔 Test Your Understanding
|
| 309 |
|
| 310 |
-
|
| 311 |
|
| 312 |
-
|
| 313 |
-
|
| 314 |
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
|
| 318 |
|
| 319 |
-
|
| 320 |
-
|
| 321 |
|
| 322 |
-
|
| 323 |
-
|
| 324 |
|
| 325 |
-
|
| 326 |
-
|
| 327 |
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
)
|
| 332 |
return
|
| 333 |
|
| 334 |
|
| 335 |
@app.cell(hide_code=True)
|
| 336 |
def _(mo):
|
| 337 |
-
mo.md(
|
| 338 |
-
|
| 339 |
-
## Summary
|
| 340 |
|
| 341 |
-
|
| 342 |
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
)
|
| 352 |
return
|
| 353 |
|
| 354 |
|
|
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
+
__generated_with = "0.18.4"
|
| 13 |
app = marimo.App(width="medium")
|
| 14 |
|
| 15 |
|
|
|
|
| 24 |
import matplotlib.pyplot as plt
|
| 25 |
from matplotlib_venn import venn2
|
| 26 |
import numpy as np
|
| 27 |
+
return plt, venn2
|
| 28 |
|
| 29 |
|
| 30 |
@app.cell(hide_code=True)
|
| 31 |
def _(mo):
|
| 32 |
+
mo.md(r"""
|
| 33 |
+
# Probability of Or
|
|
|
|
| 34 |
|
| 35 |
+
When calculating the probability of either one event _or_ another occurring, we need to be careful about how we combine probabilities. The method depends on whether the events can happen together[<sup>1</sup>](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/prob_or/).
|
| 36 |
|
| 37 |
+
Let's explore how to calculate $P(E \cup F)$, i.e. $P(E \text{ or } F)$, in different scenarios.
|
| 38 |
+
""")
|
|
|
|
| 39 |
return
|
| 40 |
|
| 41 |
|
| 42 |
@app.cell(hide_code=True)
|
| 43 |
def _(mo):
|
| 44 |
+
mo.md(r"""
|
| 45 |
+
## Mutually Exclusive Events
|
|
|
|
| 46 |
|
| 47 |
+
Two events $E$ and $F$ are **mutually exclusive** if they cannot occur simultaneously.
|
| 48 |
+
In set notation, this means:
|
| 49 |
|
| 50 |
+
$E \cap F = \emptyset$
|
| 51 |
|
| 52 |
+
For example:
|
| 53 |
|
| 54 |
+
- Rolling an even number (2,4,6) vs rolling an odd number (1,3,5)
|
| 55 |
+
- Drawing a heart vs drawing a spade from a deck
|
| 56 |
+
- Passing vs failing a test
|
| 57 |
|
| 58 |
+
Here's a Python function to check if two sets of outcomes are mutually exclusive:
|
| 59 |
+
""")
|
|
|
|
| 60 |
return
|
| 61 |
|
| 62 |
|
|
|
|
| 86 |
|
| 87 |
@app.cell(hide_code=True)
|
| 88 |
def _(mo):
|
| 89 |
+
mo.md(r"""
|
| 90 |
+
## Or with Mutually Exclusive Events
|
|
|
|
| 91 |
|
| 92 |
+
For mutually exclusive events, the probability of either event occurring is simply the sum of their individual probabilities:
|
| 93 |
|
| 94 |
+
$P(E \cup F) = P(E) + P(F)$
|
| 95 |
|
| 96 |
+
This extends to multiple events. For $n$ mutually exclusive events $E_1, E_2, \ldots, E_n$:
|
| 97 |
|
| 98 |
+
$P(E_1 \cup E_2 \cup \cdots \cup E_n) = \sum_{i=1}^n P(E_i)$
|
| 99 |
|
| 100 |
+
Let's implement this calculation:
|
| 101 |
+
""")
|
|
|
|
| 102 |
return
|
| 103 |
|
| 104 |
|
|
|
|
| 115 |
# P(prime) = P(2) + P(3) + P(5)
|
| 116 |
p_prime_mutually_exclusive = prob_union_mutually_exclusive([1/6, 1/6, 1/6])
|
| 117 |
print(f"P(rolling a prime number) = {p_prime_mutually_exclusive}")
|
| 118 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
|
| 121 |
@app.cell(hide_code=True)
|
| 122 |
def _(mo):
|
| 123 |
+
mo.md(r"""
|
| 124 |
+
## Or with Non-Mutually Exclusive Events
|
|
|
|
| 125 |
|
| 126 |
+
When events can occur together, we need to use the **inclusion-exclusion principle**:
|
| 127 |
|
| 128 |
+
$P(E \cup F) = P(E) + P(F) - P(E \cap F)$
|
| 129 |
|
| 130 |
+
Why subtract $P(E \cap F)$? Because when we add $P(E)$ and $P(F)$, we count the overlap twice!
|
| 131 |
|
| 132 |
+
For example, consider calculating $P(\text{prime or even})$ when rolling a die:
|
| 133 |
|
| 134 |
+
- Prime numbers: {2, 3, 5}
|
| 135 |
+
- Even numbers: {2, 4, 6}
|
| 136 |
+
- The number 2 is counted twice unless we subtract its probability
|
| 137 |
|
| 138 |
+
Here's how to implement this calculation:
|
| 139 |
+
""")
|
|
|
|
| 140 |
return
|
| 141 |
|
| 142 |
|
|
|
|
| 154 |
|
| 155 |
result = prob_union_general(p_prime_general, p_even_general, p_intersection)
|
| 156 |
print(f"P(prime or even) = {p_prime_general} + {p_even_general} - {p_intersection} = {result}")
|
| 157 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
|
| 159 |
|
| 160 |
@app.cell(hide_code=True)
|
| 161 |
def _(mo):
|
| 162 |
+
mo.md(r"""
|
| 163 |
+
### Extension to Three Events
|
|
|
|
| 164 |
|
| 165 |
+
For three events, the inclusion-exclusion principle becomes:
|
| 166 |
|
| 167 |
+
$P(E_1 \cup E_2 \cup E_3) = P(E_1) + P(E_2) + P(E_3)$
|
| 168 |
+
$- P(E_1 \cap E_2) - P(E_1 \cap E_3) - P(E_2 \cap E_3)$
|
| 169 |
+
$+ P(E_1 \cap E_2 \cap E_3)$
|
| 170 |
|
| 171 |
+
The pattern is:
|
| 172 |
|
| 173 |
+
1. Add individual probabilities
|
| 174 |
+
2. Subtract probabilities of pairs
|
| 175 |
+
3. Add probability of triple intersection
|
| 176 |
+
""")
|
|
|
|
| 177 |
return
|
| 178 |
|
| 179 |
|
| 180 |
@app.cell(hide_code=True)
|
| 181 |
def _(mo):
|
| 182 |
+
mo.md(r"""
|
| 183 |
+
### Interactive example:
|
| 184 |
+
""")
|
| 185 |
return
|
| 186 |
|
| 187 |
|
|
|
|
| 280 |
plt.gcf(),
|
| 281 |
mo.md(data["explanation"])
|
| 282 |
])
|
| 283 |
+
return
|
| 284 |
|
| 285 |
|
| 286 |
@app.cell(hide_code=True)
|
| 287 |
def _(mo):
|
| 288 |
+
mo.md(r"""
|
| 289 |
+
## 🤔 Test Your Understanding
|
|
|
|
| 290 |
|
| 291 |
+
Consider rolling a six-sided die. Which of these statements are true?
|
| 292 |
|
| 293 |
+
<details>
|
| 294 |
+
<summary>1. P(even or less than 3) = P(even) + P(less than 3)</summary>
|
| 295 |
|
| 296 |
+
❌ Incorrect! These events are not mutually exclusive (2 is both even and less than 3).
|
| 297 |
+
We need to use the inclusion-exclusion principle.
|
| 298 |
+
</details>
|
| 299 |
|
| 300 |
+
<details>
|
| 301 |
+
<summary>2. P(even or greater than 4) = 4/6</summary>
|
| 302 |
|
| 303 |
+
✅ Correct! {2,4,6} ∪ {5,6} = {2,4,5,6}, so probability is 4/6.
|
| 304 |
+
</details>
|
| 305 |
|
| 306 |
+
<details>
|
| 307 |
+
<summary>3. P(prime or odd) = 5/6</summary>
|
| 308 |
|
| 309 |
+
✅ Correct! {2,3,5} ∪ {1,3,5} = {1,2,3,5}, so probability is 5/6.
|
| 310 |
+
</details>
|
| 311 |
+
""")
|
|
|
|
| 312 |
return
|
| 313 |
|
| 314 |
|
| 315 |
@app.cell(hide_code=True)
|
| 316 |
def _(mo):
|
| 317 |
+
mo.md("""
|
| 318 |
+
## Summary
|
|
|
|
| 319 |
|
| 320 |
+
You've learned:
|
| 321 |
|
| 322 |
+
- How to identify mutually exclusive events
|
| 323 |
+
- The addition rule for mutually exclusive events
|
| 324 |
+
- The inclusion-exclusion principle for overlapping events
|
| 325 |
+
- How to extend these concepts to multiple events
|
| 326 |
|
| 327 |
+
In the next lesson, we'll explore **conditional probability** - how the probability
|
| 328 |
+
of one event changes when we know another event has occurred.
|
| 329 |
+
""")
|
|
|
|
| 330 |
return
|
| 331 |
|
| 332 |
|
probability/04_conditional_probability.py
CHANGED
|
@@ -10,7 +10,7 @@
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
-
__generated_with = "0.
|
| 14 |
app = marimo.App(width="medium", app_title="Conditional Probability")
|
| 15 |
|
| 16 |
|
|
@@ -22,42 +22,38 @@ def _():
|
|
| 22 |
|
| 23 |
@app.cell(hide_code=True)
|
| 24 |
def _(mo):
|
| 25 |
-
mo.md(
|
| 26 |
-
|
| 27 |
-
# Conditional Probability
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
)
|
| 42 |
return
|
| 43 |
|
| 44 |
|
| 45 |
@app.cell(hide_code=True)
|
| 46 |
def _(mo):
|
| 47 |
-
mo.md(
|
| 48 |
-
|
| 49 |
-
## Definition of Conditional Probability
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
-
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
)
|
| 61 |
return
|
| 62 |
|
| 63 |
|
|
@@ -66,7 +62,7 @@ def _():
|
|
| 66 |
import matplotlib.pyplot as plt
|
| 67 |
from matplotlib_venn import venn3
|
| 68 |
import numpy as np
|
| 69 |
-
return
|
| 70 |
|
| 71 |
|
| 72 |
@app.cell(hide_code=True)
|
|
@@ -138,75 +134,73 @@ def _(mo, plt, venn3):
|
|
| 138 |
""")
|
| 139 |
|
| 140 |
mo.vstack([mo.center(plt.gcf()), explanation])
|
| 141 |
-
return
|
| 142 |
|
| 143 |
|
| 144 |
@app.cell(hide_code=True)
|
| 145 |
def _(mo):
|
| 146 |
-
mo.md(
|
| 147 |
-
|
| 148 |
-
)
|
| 149 |
return
|
| 150 |
|
| 151 |
|
| 152 |
-
@app.
|
| 153 |
-
def
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
raise ValueError("P(E∩F) cannot be greater than P(F)")
|
| 159 |
|
| 160 |
-
|
| 161 |
-
return (conditional_probability,)
|
| 162 |
|
| 163 |
|
| 164 |
@app.cell
|
| 165 |
-
def _(
|
| 166 |
# Example 1: Rolling a die
|
| 167 |
# E: Rolling an even number (2,4,6)
|
| 168 |
# F: Rolling a number greater than 3 (4,5,6)
|
| 169 |
p_even_given_greater_than_3 = conditional_probability(2 / 6, 3 / 6)
|
| 170 |
print("Example 1: Rolling a die")
|
| 171 |
print(f"P(Even | >3) = {p_even_given_greater_than_3}") # Should be 2/3
|
| 172 |
-
return
|
| 173 |
|
| 174 |
|
| 175 |
@app.cell
|
| 176 |
-
def _(
|
| 177 |
# Example 2: Cards
|
| 178 |
# E: Drawing a Heart
|
| 179 |
# F: Drawing a Face card (J,Q,K)
|
| 180 |
p_heart_given_face = conditional_probability(3 / 52, 12 / 52)
|
| 181 |
print("\nExample 2: Drawing cards")
|
| 182 |
print(f"P(Heart | Face card) = {p_heart_given_face}") # Should be 1/4
|
| 183 |
-
return
|
| 184 |
|
| 185 |
|
| 186 |
@app.cell
|
| 187 |
-
def _(
|
| 188 |
# Example 3: Student grades
|
| 189 |
# E: Getting an A
|
| 190 |
# F: Studying more than 3 hours
|
| 191 |
p_a_given_study = conditional_probability(0.24, 0.40)
|
| 192 |
print("\nExample 3: Student grades")
|
| 193 |
print(f"P(A | Studied >3hrs) = {p_a_given_study}") # Should be 0.6
|
| 194 |
-
return
|
| 195 |
|
| 196 |
|
| 197 |
@app.cell
|
| 198 |
-
def _(
|
| 199 |
# Example 4: Weather
|
| 200 |
# E: Raining
|
| 201 |
# F: Cloudy
|
| 202 |
p_rain_given_cloudy = conditional_probability(0.15, 0.30)
|
| 203 |
print("\nExample 4: Weather")
|
| 204 |
print(f"P(Rain | Cloudy) = {p_rain_given_cloudy}") # Should be 0.5
|
| 205 |
-
return
|
| 206 |
|
| 207 |
|
| 208 |
@app.cell
|
| 209 |
-
def _(
|
| 210 |
# Example 5: Error cases
|
| 211 |
print("\nExample 5: Error cases")
|
| 212 |
try:
|
|
@@ -225,72 +219,66 @@ def _(conditional_probability):
|
|
| 225 |
|
| 226 |
@app.cell(hide_code=True)
|
| 227 |
def _(mo):
|
| 228 |
-
mo.md(
|
| 229 |
-
|
| 230 |
-
## The Conditional Paradigm
|
| 231 |
|
| 232 |
-
|
| 233 |
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
|
| 238 |
-
|
| 239 |
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
)
|
| 250 |
return
|
| 251 |
|
| 252 |
|
| 253 |
@app.cell(hide_code=True)
|
| 254 |
def _(mo):
|
| 255 |
-
mo.md(
|
| 256 |
-
|
| 257 |
-
## Multiple Conditions
|
| 258 |
|
| 259 |
-
|
| 260 |
-
|
| 261 |
|
| 262 |
-
|
| 263 |
|
| 264 |
-
|
| 265 |
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
)
|
| 270 |
return
|
| 271 |
|
| 272 |
|
| 273 |
-
@app.
|
| 274 |
-
def
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
)
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
|
| 285 |
-
|
| 286 |
-
|
| 287 |
-
|
| 288 |
-
return p_intersection_all / p_intersection_conditions
|
| 289 |
-
return (multiple_conditional_probability,)
|
| 290 |
|
| 291 |
|
| 292 |
@app.cell
|
| 293 |
-
def _(
|
| 294 |
# Example: College admissions
|
| 295 |
# E: Getting admitted
|
| 296 |
# F: High GPA
|
|
@@ -310,58 +298,54 @@ def _(multiple_conditional_probability):
|
|
| 310 |
multiple_conditional_probability(0.3, 0.2, 0.2)
|
| 311 |
except ValueError as e:
|
| 312 |
print(f"\nError case: {e}")
|
| 313 |
-
return
|
| 314 |
|
| 315 |
|
| 316 |
@app.cell(hide_code=True)
|
| 317 |
def _(mo):
|
| 318 |
-
mo.md(
|
| 319 |
-
|
| 320 |
-
|
| 321 |
-
|
| 322 |
-
|
| 323 |
-
|
| 324 |
-
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
|
| 333 |
-
|
| 334 |
-
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
| 343 |
-
"""
|
| 344 |
-
)
|
| 345 |
return
|
| 346 |
|
| 347 |
|
| 348 |
@app.cell(hide_code=True)
|
| 349 |
def _(mo):
|
| 350 |
-
mo.md(
|
| 351 |
-
|
| 352 |
-
## Summary
|
| 353 |
|
| 354 |
-
|
| 355 |
|
| 356 |
-
|
| 357 |
-
|
| 358 |
-
|
| 359 |
-
|
| 360 |
|
| 361 |
-
|
| 362 |
-
|
| 363 |
-
|
| 364 |
-
)
|
| 365 |
return
|
| 366 |
|
| 367 |
|
|
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
+
__generated_with = "0.18.4"
|
| 14 |
app = marimo.App(width="medium", app_title="Conditional Probability")
|
| 15 |
|
| 16 |
|
|
|
|
| 22 |
|
| 23 |
@app.cell(hide_code=True)
|
| 24 |
def _(mo):
|
| 25 |
+
mo.md(r"""
|
| 26 |
+
# Conditional Probability
|
|
|
|
| 27 |
|
| 28 |
+
_This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/cond_prob/), by Stanford professor Chris Piech._
|
| 29 |
|
| 30 |
+
In probability theory, we often want to update our beliefs when we receive new information.
|
| 31 |
+
Conditional probability helps us formalize this process by calculating "_what is the chance of
|
| 32 |
+
event $E$ happening given that we have already observed some other event $F$?_"[<sup>1</sup>](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/cond_prob/)
|
| 33 |
|
| 34 |
+
When we condition on an event $F$:
|
| 35 |
|
| 36 |
+
- We enter the universe where $F$ has occurred
|
| 37 |
+
- Only outcomes consistent with $F$ are possible
|
| 38 |
+
- Our sample space reduces to $F$
|
| 39 |
+
""")
|
|
|
|
| 40 |
return
|
| 41 |
|
| 42 |
|
| 43 |
@app.cell(hide_code=True)
|
| 44 |
def _(mo):
|
| 45 |
+
mo.md(r"""
|
| 46 |
+
## Definition of Conditional Probability
|
|
|
|
| 47 |
|
| 48 |
+
The probability of event $E$ given that event $F$ has occurred is denoted as $P(E \mid F)$ and is defined as:
|
| 49 |
|
| 50 |
+
$$P(E \mid F) = \frac{P(E \cap F)}{P(F)}$$
|
| 51 |
|
| 52 |
+
This formula tells us that the conditional probability is the probability of both events occurring
|
| 53 |
+
divided by the probability of the conditioning event.
|
| 54 |
|
| 55 |
+
Let's start with a visual example.
|
| 56 |
+
""")
|
|
|
|
| 57 |
return
|
| 58 |
|
| 59 |
|
|
|
|
| 62 |
import matplotlib.pyplot as plt
|
| 63 |
from matplotlib_venn import venn3
|
| 64 |
import numpy as np
|
| 65 |
+
return plt, venn3
|
| 66 |
|
| 67 |
|
| 68 |
@app.cell(hide_code=True)
|
|
|
|
| 134 |
""")
|
| 135 |
|
| 136 |
mo.vstack([mo.center(plt.gcf()), explanation])
|
| 137 |
+
return
|
| 138 |
|
| 139 |
|
| 140 |
@app.cell(hide_code=True)
|
| 141 |
def _(mo):
|
| 142 |
+
mo.md(r"""
|
| 143 |
+
Next, here's a function that computes $P(E \mid F)$, given $P( E \cap F)$ and $P(F)$
|
| 144 |
+
""")
|
| 145 |
return
|
| 146 |
|
| 147 |
|
| 148 |
+
@app.function
|
| 149 |
+
def conditional_probability(p_intersection, p_condition):
|
| 150 |
+
if p_condition == 0:
|
| 151 |
+
raise ValueError("Cannot condition on an impossible event")
|
| 152 |
+
if p_intersection > p_condition:
|
| 153 |
+
raise ValueError("P(E∩F) cannot be greater than P(F)")
|
|
|
|
| 154 |
|
| 155 |
+
return p_intersection / p_condition
|
|
|
|
| 156 |
|
| 157 |
|
| 158 |
@app.cell
|
| 159 |
+
def _():
|
| 160 |
# Example 1: Rolling a die
|
| 161 |
# E: Rolling an even number (2,4,6)
|
| 162 |
# F: Rolling a number greater than 3 (4,5,6)
|
| 163 |
p_even_given_greater_than_3 = conditional_probability(2 / 6, 3 / 6)
|
| 164 |
print("Example 1: Rolling a die")
|
| 165 |
print(f"P(Even | >3) = {p_even_given_greater_than_3}") # Should be 2/3
|
| 166 |
+
return
|
| 167 |
|
| 168 |
|
| 169 |
@app.cell
|
| 170 |
+
def _():
|
| 171 |
# Example 2: Cards
|
| 172 |
# E: Drawing a Heart
|
| 173 |
# F: Drawing a Face card (J,Q,K)
|
| 174 |
p_heart_given_face = conditional_probability(3 / 52, 12 / 52)
|
| 175 |
print("\nExample 2: Drawing cards")
|
| 176 |
print(f"P(Heart | Face card) = {p_heart_given_face}") # Should be 1/4
|
| 177 |
+
return
|
| 178 |
|
| 179 |
|
| 180 |
@app.cell
|
| 181 |
+
def _():
|
| 182 |
# Example 3: Student grades
|
| 183 |
# E: Getting an A
|
| 184 |
# F: Studying more than 3 hours
|
| 185 |
p_a_given_study = conditional_probability(0.24, 0.40)
|
| 186 |
print("\nExample 3: Student grades")
|
| 187 |
print(f"P(A | Studied >3hrs) = {p_a_given_study}") # Should be 0.6
|
| 188 |
+
return
|
| 189 |
|
| 190 |
|
| 191 |
@app.cell
|
| 192 |
+
def _():
|
| 193 |
# Example 4: Weather
|
| 194 |
# E: Raining
|
| 195 |
# F: Cloudy
|
| 196 |
p_rain_given_cloudy = conditional_probability(0.15, 0.30)
|
| 197 |
print("\nExample 4: Weather")
|
| 198 |
print(f"P(Rain | Cloudy) = {p_rain_given_cloudy}") # Should be 0.5
|
| 199 |
+
return
|
| 200 |
|
| 201 |
|
| 202 |
@app.cell
|
| 203 |
+
def _():
|
| 204 |
# Example 5: Error cases
|
| 205 |
print("\nExample 5: Error cases")
|
| 206 |
try:
|
|
|
|
| 219 |
|
| 220 |
@app.cell(hide_code=True)
|
| 221 |
def _(mo):
|
| 222 |
+
mo.md(r"""
|
| 223 |
+
## The Conditional Paradigm
|
|
|
|
| 224 |
|
| 225 |
+
When we condition on an event, we enter a new probability universe. In this universe:
|
| 226 |
|
| 227 |
+
1. All probability axioms still hold
|
| 228 |
+
2. We must consistently condition on the same event
|
| 229 |
+
3. Our sample space becomes the conditioning event
|
| 230 |
|
| 231 |
+
Here's how our familiar probability rules look when conditioned on event $G$:
|
| 232 |
|
| 233 |
+
| Rule | Original | Conditioned on $G$ |
|
| 234 |
+
|------|----------|-------------------|
|
| 235 |
+
| Axiom 1 | $0 \leq P(E) \leq 1$ | $0 \leq P(E \mid G) \leq 1$ |
|
| 236 |
+
| Axiom 2 | $P(S) = 1$ | $P(S \mid G) = 1$ |
|
| 237 |
+
| Axiom 3* | $P(E \cup F) = P(E) + P(F)$ | $P(E \cup F \mid G) = P(E \mid G) + P(F \mid G)$ |
|
| 238 |
+
| Complement | $P(E^C) = 1 - P(E)$ | $P(E^C \mid G) = 1 - P(E \mid G)$ |
|
| 239 |
|
| 240 |
+
*_For mutually exclusive events_
|
| 241 |
+
""")
|
|
|
|
| 242 |
return
|
| 243 |
|
| 244 |
|
| 245 |
@app.cell(hide_code=True)
|
| 246 |
def _(mo):
|
| 247 |
+
mo.md(r"""
|
| 248 |
+
## Multiple Conditions
|
|
|
|
| 249 |
|
| 250 |
+
We can condition on multiple events. The notation $P(E \mid F,G)$ means "_the probability of $E$
|
| 251 |
+
occurring, given that both $F$ and $G$ have occurred._"
|
| 252 |
|
| 253 |
+
The conditional probability formula still holds in the universe where $G$ has occurred:
|
| 254 |
|
| 255 |
+
$$P(E \mid F,G) = \frac{P(E \cap F \mid G)}{P(F \mid G)}$$
|
| 256 |
|
| 257 |
+
This is a powerful extension that allows us to update our probabilities as we receive
|
| 258 |
+
multiple pieces of information.
|
| 259 |
+
""")
|
|
|
|
| 260 |
return
|
| 261 |
|
| 262 |
|
| 263 |
+
@app.function
|
| 264 |
+
def multiple_conditional_probability(
|
| 265 |
+
p_intersection_all, p_intersection_conditions, p_condition
|
| 266 |
+
):
|
| 267 |
+
"""Calculate P(E|F,G) = P(E∩F|G)/P(F|G) = P(E∩F∩G)/P(F∩G)"""
|
| 268 |
+
if p_condition == 0:
|
| 269 |
+
raise ValueError("Cannot condition on an impossible event")
|
| 270 |
+
if p_intersection_conditions == 0:
|
| 271 |
+
raise ValueError(
|
| 272 |
+
"Cannot condition on an impossible combination of events"
|
| 273 |
+
)
|
| 274 |
+
if p_intersection_all > p_intersection_conditions:
|
| 275 |
+
raise ValueError("P(E∩F∩G) cannot be greater than P(F∩G)")
|
| 276 |
+
|
| 277 |
+
return p_intersection_all / p_intersection_conditions
|
|
|
|
|
|
|
| 278 |
|
| 279 |
|
| 280 |
@app.cell
|
| 281 |
+
def _():
|
| 282 |
# Example: College admissions
|
| 283 |
# E: Getting admitted
|
| 284 |
# F: High GPA
|
|
|
|
| 298 |
multiple_conditional_probability(0.3, 0.2, 0.2)
|
| 299 |
except ValueError as e:
|
| 300 |
print(f"\nError case: {e}")
|
| 301 |
+
return
|
| 302 |
|
| 303 |
|
| 304 |
@app.cell(hide_code=True)
|
| 305 |
def _(mo):
|
| 306 |
+
mo.md(r"""
|
| 307 |
+
## 🤔 Test Your Understanding
|
| 308 |
+
|
| 309 |
+
Which of these statements about conditional probability are true?
|
| 310 |
+
|
| 311 |
+
<details>
|
| 312 |
+
<summary>Knowing F occurred always decreases the probability of E</summary>
|
| 313 |
+
❌ False! Conditioning on F can either increase or decrease P(E), depending on how E and F are related.
|
| 314 |
+
</details>
|
| 315 |
+
|
| 316 |
+
<details>
|
| 317 |
+
<summary>P(E|F) represents entering a new probability universe where F has occurred</summary>
|
| 318 |
+
✅ True! We restrict ourselves to only the outcomes where F occurred, making F our new sample space.
|
| 319 |
+
</details>
|
| 320 |
+
|
| 321 |
+
<details>
|
| 322 |
+
<summary>If P(E|F) = P(E), then E and F must be the same event</summary>
|
| 323 |
+
❌ False! This actually means E and F are independent - knowing one doesn't affect the other.
|
| 324 |
+
</details>
|
| 325 |
+
|
| 326 |
+
<details>
|
| 327 |
+
<summary>P(E|F) can be calculated by dividing P(E∩F) by P(F)</summary>
|
| 328 |
+
✅ True! This is the fundamental definition of conditional probability.
|
| 329 |
+
</details>
|
| 330 |
+
""")
|
|
|
|
|
|
|
| 331 |
return
|
| 332 |
|
| 333 |
|
| 334 |
@app.cell(hide_code=True)
|
| 335 |
def _(mo):
|
| 336 |
+
mo.md("""
|
| 337 |
+
## Summary
|
|
|
|
| 338 |
|
| 339 |
+
You've learned:
|
| 340 |
|
| 341 |
+
- How conditional probability updates our beliefs with new information
|
| 342 |
+
- The formula $P(E \mid F) = P(E \cap F)/P(F)$ and its intuition
|
| 343 |
+
- How probability rules work in conditional universes
|
| 344 |
+
- How to handle multiple conditions
|
| 345 |
|
| 346 |
+
In the next lesson, we'll explore **independence** - when knowing about one event
|
| 347 |
+
tells us nothing about another.
|
| 348 |
+
""")
|
|
|
|
| 349 |
return
|
| 350 |
|
| 351 |
|
probability/05_independence.py
CHANGED
|
@@ -7,7 +7,7 @@
|
|
| 7 |
|
| 8 |
import marimo
|
| 9 |
|
| 10 |
-
__generated_with = "0.
|
| 11 |
app = marimo.App()
|
| 12 |
|
| 13 |
|
|
@@ -19,88 +19,84 @@ def _():
|
|
| 19 |
|
| 20 |
@app.cell(hide_code=True)
|
| 21 |
def _(mo):
|
| 22 |
-
mo.md(
|
| 23 |
-
|
| 24 |
-
# Independence in Probability Theory
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
)
|
| 47 |
return
|
| 48 |
|
| 49 |
|
| 50 |
@app.cell(hide_code=True)
|
| 51 |
def _(mo):
|
| 52 |
-
mo.md(
|
| 53 |
-
|
| 54 |
-
## Independence is Symmetric
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
-
|
| 68 |
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
-
|
| 81 |
|
| 82 |
-
|
| 83 |
|
| 84 |
-
|
| 85 |
|
| 86 |
|
| 87 |
-
|
| 88 |
|
| 89 |
-
|
| 90 |
|
| 91 |
-
|
| 92 |
|
| 93 |
-
|
| 94 |
|
| 95 |
-
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
)
|
| 104 |
return
|
| 105 |
|
| 106 |
|
|
@@ -121,21 +117,19 @@ def _(mo):
|
|
| 121 |
callout_text,
|
| 122 |
kind="warn"
|
| 123 |
)
|
| 124 |
-
return
|
| 125 |
|
| 126 |
|
| 127 |
-
@app.
|
| 128 |
-
def
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
tolerance = 1e-5 # Stricter tolerance for comparison
|
| 132 |
|
| 133 |
-
|
| 134 |
-
return (check_independence,)
|
| 135 |
|
| 136 |
|
| 137 |
@app.cell
|
| 138 |
-
def _(
|
| 139 |
# Example 1: Rolling dice
|
| 140 |
p_first_even = 0.5 # P(First die is even)
|
| 141 |
p_second_six = 1/6 # P(Second die is 6)
|
|
@@ -157,11 +151,11 @@ def _(check_independence, mo):
|
|
| 157 |
</details>
|
| 158 |
"""
|
| 159 |
mo.md(example1)
|
| 160 |
-
return
|
| 161 |
|
| 162 |
|
| 163 |
@app.cell
|
| 164 |
-
def _(
|
| 165 |
# Example 2: Drawing cards (dependent events)
|
| 166 |
p_first_heart = 13/52 # P(First card is heart)
|
| 167 |
p_second_heart = 12/51 # P(Second card is heart | First was heart)
|
|
@@ -192,18 +186,11 @@ def _(check_independence, mo):
|
|
| 192 |
</details>
|
| 193 |
"""
|
| 194 |
mo.md(example2)
|
| 195 |
-
return
|
| 196 |
-
cards_independent,
|
| 197 |
-
example2,
|
| 198 |
-
p_both_hearts,
|
| 199 |
-
p_first_heart,
|
| 200 |
-
p_second_heart,
|
| 201 |
-
theoretical_if_independent,
|
| 202 |
-
)
|
| 203 |
|
| 204 |
|
| 205 |
@app.cell
|
| 206 |
-
def _(
|
| 207 |
# Example 3: Computer system
|
| 208 |
p_hardware = 0.02 # P(Hardware failure)
|
| 209 |
p_software = 0.03 # P(Software crash)
|
|
@@ -224,60 +211,54 @@ def _(check_independence, mo):
|
|
| 224 |
</details>
|
| 225 |
"""
|
| 226 |
mo.md(example3)
|
| 227 |
-
return
|
| 228 |
-
example3,
|
| 229 |
-
p_both_failure,
|
| 230 |
-
p_hardware,
|
| 231 |
-
p_software,
|
| 232 |
-
system_independent,
|
| 233 |
-
)
|
| 234 |
|
| 235 |
|
| 236 |
@app.cell(hide_code=True)
|
| 237 |
def _(mo):
|
| 238 |
-
mo.md(
|
| 239 |
-
|
| 240 |
-
## Establishing Independence
|
| 241 |
|
| 242 |
-
|
| 243 |
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
|
| 248 |
-
|
| 249 |
-
|
| 250 |
|
| 251 |
-
|
| 252 |
|
| 253 |
-
|
| 254 |
|
| 255 |
-
|
| 256 |
|
| 257 |
-
|
| 258 |
|
| 259 |
-
|
| 260 |
|
| 261 |
-
|
| 262 |
|
| 263 |
-
|
| 264 |
|
| 265 |
-
|
| 266 |
-
|
| 267 |
|
| 268 |
-
|
| 269 |
|
| 270 |
-
|
| 271 |
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
)
|
| 275 |
return
|
| 276 |
|
| 277 |
|
| 278 |
@app.cell(hide_code=True)
|
| 279 |
def _(mo):
|
| 280 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 281 |
return
|
| 282 |
|
| 283 |
|
|
@@ -292,7 +273,7 @@ def _(mo):
|
|
| 292 |
flip_button = mo.ui.run_button(label="Flip Coins!", kind="info")
|
| 293 |
reset_button = mo.ui.run_button(label="Reset", kind="danger")
|
| 294 |
stats_display = mo.md("*Click 'Flip Coins!' to start simulation*")
|
| 295 |
-
return flip_button, reset_button
|
| 296 |
|
| 297 |
|
| 298 |
@app.cell(hide_code=True)
|
|
@@ -337,95 +318,80 @@ def _(flip_button, mo, np, reset_button):
|
|
| 337 |
|
| 338 |
new_stats_display = mo.md(stats)
|
| 339 |
new_stats_display
|
| 340 |
-
return
|
| 341 |
-
coin1,
|
| 342 |
-
coin2,
|
| 343 |
-
new_stats_display,
|
| 344 |
-
p_both_h,
|
| 345 |
-
p_h1,
|
| 346 |
-
p_h2,
|
| 347 |
-
p_product,
|
| 348 |
-
stats,
|
| 349 |
-
)
|
| 350 |
|
| 351 |
|
| 352 |
@app.cell(hide_code=True)
|
| 353 |
def _(mo):
|
| 354 |
-
mo.md(
|
| 355 |
-
|
| 356 |
-
## Understanding the Simulation
|
| 357 |
|
| 358 |
-
|
| 359 |
|
| 360 |
-
|
| 361 |
|
| 362 |
-
|
| 363 |
|
| 364 |
-
|
| 365 |
-
|
| 366 |
|
| 367 |
-
|
| 368 |
|
| 369 |
-
|
| 370 |
-
|
| 371 |
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
|
| 375 |
-
)
|
| 376 |
return
|
| 377 |
|
| 378 |
|
| 379 |
@app.cell(hide_code=True)
|
| 380 |
def _(mo):
|
| 381 |
-
mo.md(
|
| 382 |
-
|
| 383 |
-
|
| 384 |
-
|
| 385 |
-
|
| 386 |
-
|
| 387 |
-
|
| 388 |
-
|
| 389 |
-
|
| 390 |
-
|
| 391 |
-
|
| 392 |
-
|
| 393 |
-
|
| 394 |
-
|
| 395 |
-
|
| 396 |
-
|
| 397 |
-
|
| 398 |
-
|
| 399 |
-
|
| 400 |
-
|
| 401 |
-
|
| 402 |
-
|
| 403 |
-
|
| 404 |
-
|
| 405 |
-
|
| 406 |
-
|
| 407 |
-
|
| 408 |
-
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
"""
|
| 412 |
-
)
|
| 413 |
return
|
| 414 |
|
| 415 |
|
| 416 |
@app.cell(hide_code=True)
|
| 417 |
def _(mo):
|
| 418 |
-
mo.md(
|
| 419 |
-
|
| 420 |
-
## Summary
|
| 421 |
|
| 422 |
-
|
| 423 |
|
| 424 |
-
|
| 425 |
|
| 426 |
-
|
| 427 |
-
|
| 428 |
-
)
|
| 429 |
return
|
| 430 |
|
| 431 |
|
|
@@ -433,7 +399,7 @@ def _(mo):
|
|
| 433 |
def _():
|
| 434 |
import numpy as np
|
| 435 |
import pandas as pd
|
| 436 |
-
return np,
|
| 437 |
|
| 438 |
|
| 439 |
if __name__ == "__main__":
|
|
|
|
| 7 |
|
| 8 |
import marimo
|
| 9 |
|
| 10 |
+
__generated_with = "0.18.4"
|
| 11 |
app = marimo.App()
|
| 12 |
|
| 13 |
|
|
|
|
| 19 |
|
| 20 |
@app.cell(hide_code=True)
|
| 21 |
def _(mo):
|
| 22 |
+
mo.md("""
|
| 23 |
+
# Independence in Probability Theory
|
|
|
|
| 24 |
|
| 25 |
+
_This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/independence/), by Stanford professor Chris Piech._
|
| 26 |
|
| 27 |
+
In probability theory, independence is a fundamental concept that helps us understand
|
| 28 |
+
when events don't influence each other. Two events are independent if knowing the
|
| 29 |
+
outcome of one event doesn't change our belief about the other event occurring.
|
| 30 |
|
| 31 |
+
## Definition of Independence
|
| 32 |
|
| 33 |
+
Two events $E$ and $F$ are independent if:
|
| 34 |
|
| 35 |
+
$$P(E|F) = P(E)$$
|
| 36 |
|
| 37 |
+
This means that knowing $F$ occurred doesn't change the probability of $E$ occurring.
|
| 38 |
|
| 39 |
+
### _Alternative Definition_
|
| 40 |
|
| 41 |
+
Using the chain rule, we can derive another equivalent definition:
|
| 42 |
|
| 43 |
+
$$P(E \cap F) = P(E) \cdot P(F)$$
|
| 44 |
+
""")
|
|
|
|
| 45 |
return
|
| 46 |
|
| 47 |
|
| 48 |
@app.cell(hide_code=True)
|
| 49 |
def _(mo):
|
| 50 |
+
mo.md(r"""
|
| 51 |
+
## Independence is Symmetric
|
|
|
|
| 52 |
|
| 53 |
+
This property is symmetric: if $E$ is independent of $F$, then $F$ is independent of $E$.
|
| 54 |
+
We can prove this using Bayes' Theorem:
|
| 55 |
|
| 56 |
+
\[P(E|F) = \frac{P(F|E)P(E)}{P(F)}\]
|
| 57 |
|
| 58 |
+
\[= \frac{P(F)P(E)}{P(F)}\]
|
| 59 |
|
| 60 |
+
\[= P(E)\]
|
| 61 |
|
| 62 |
+
## Independence and Complements
|
| 63 |
|
| 64 |
+
Given independent events $A$ and $B$, we can prove that $A$ and $B^C$ are also independent:
|
| 65 |
|
| 66 |
|
| 67 |
+
\[P(AB^C) = P(A) - P(AB)\]
|
| 68 |
|
| 69 |
+
\[= P(A) - P(A)P(B)\]
|
| 70 |
|
| 71 |
+
\[= P(A)(1 - P(B))\]
|
| 72 |
|
| 73 |
+
\[= P(A)P(B^C)\]
|
| 74 |
|
| 75 |
+
## Generalized Independence
|
| 76 |
|
| 77 |
+
Events $E_1, E_2, \ldots, E_n$ are independent if for every subset with $r$ elements (where $r \leq n$):
|
| 78 |
|
| 79 |
+
\[P(E_1, E_2, \ldots, E_r) = \prod_{i=1}^r P(E_i)\]
|
| 80 |
|
| 81 |
+
For example, consider getting 5 heads on 5 coin flips. Let $H_i$ be the event that the $i$th flip is heads:
|
| 82 |
|
| 83 |
|
| 84 |
+
\[P(H_1, H_2, H_3, H_4, H_5) = P(H_1)P(H_2)P(H_3)P(H_4)P(H_5)\]
|
| 85 |
|
| 86 |
+
\[= \prod_{i=1}^5 P(H_i)\]
|
| 87 |
|
| 88 |
+
\[= \left(\frac{1}{2}\right)^5 = 0.03125\]
|
| 89 |
|
| 90 |
+
## Conditional Independence
|
| 91 |
|
| 92 |
+
Events $E_1, E_2, E_3$ are conditionally independent given event $F$ if:
|
| 93 |
|
| 94 |
+
\[P(E_1, E_2, E_3 | F) = P(E_1|F)P(E_2|F)P(E_3|F)\]
|
| 95 |
|
| 96 |
+
This can be written more succinctly using product notation:
|
| 97 |
|
| 98 |
+
\[P(E_1, E_2, E_3 | F) = \prod_{i=1}^3 P(E_i|F)\]
|
| 99 |
+
""")
|
|
|
|
| 100 |
return
|
| 101 |
|
| 102 |
|
|
|
|
| 117 |
callout_text,
|
| 118 |
kind="warn"
|
| 119 |
)
|
| 120 |
+
return
|
| 121 |
|
| 122 |
|
| 123 |
+
@app.function
|
| 124 |
+
def check_independence(p_e, p_f, p_intersection):
|
| 125 |
+
expected = p_e * p_f
|
| 126 |
+
tolerance = 1e-5 # Stricter tolerance for comparison
|
|
|
|
| 127 |
|
| 128 |
+
return abs(p_intersection - expected) < tolerance
|
|
|
|
| 129 |
|
| 130 |
|
| 131 |
@app.cell
|
| 132 |
+
def _(mo):
|
| 133 |
# Example 1: Rolling dice
|
| 134 |
p_first_even = 0.5 # P(First die is even)
|
| 135 |
p_second_six = 1/6 # P(Second die is 6)
|
|
|
|
| 151 |
</details>
|
| 152 |
"""
|
| 153 |
mo.md(example1)
|
| 154 |
+
return
|
| 155 |
|
| 156 |
|
| 157 |
@app.cell
|
| 158 |
+
def _(mo):
|
| 159 |
# Example 2: Drawing cards (dependent events)
|
| 160 |
p_first_heart = 13/52 # P(First card is heart)
|
| 161 |
p_second_heart = 12/51 # P(Second card is heart | First was heart)
|
|
|
|
| 186 |
</details>
|
| 187 |
"""
|
| 188 |
mo.md(example2)
|
| 189 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
|
| 191 |
|
| 192 |
@app.cell
|
| 193 |
+
def _(mo):
|
| 194 |
# Example 3: Computer system
|
| 195 |
p_hardware = 0.02 # P(Hardware failure)
|
| 196 |
p_software = 0.03 # P(Software crash)
|
|
|
|
| 211 |
</details>
|
| 212 |
"""
|
| 213 |
mo.md(example3)
|
| 214 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
|
| 216 |
|
| 217 |
@app.cell(hide_code=True)
|
| 218 |
def _(mo):
|
| 219 |
+
mo.md("""
|
| 220 |
+
## Establishing Independence
|
|
|
|
| 221 |
|
| 222 |
+
In practice, we can establish independence through:
|
| 223 |
|
| 224 |
+
1. **Mathematical Verification**: Show that P(E∩F) = P(E)P(F)
|
| 225 |
+
2. **Empirical Testing**: Analyze data to check if events appear independent
|
| 226 |
+
3. **Domain Knowledge**: Use understanding of the system to justify independence
|
| 227 |
|
| 228 |
+
> **Note**: Perfect independence is rare in real data. We often make independence assumptions
|
| 229 |
+
when dependencies are negligible and the simplification is useful.
|
| 230 |
|
| 231 |
+
## Backup Systems in Space Missions
|
| 232 |
|
| 233 |
+
Consider a space mission with two backup life support systems:
|
| 234 |
|
| 235 |
+
$$P( ext{Primary fails}) = p_1$$
|
| 236 |
|
| 237 |
+
$$P( ext{Secondary fails}) = p_2$$
|
| 238 |
|
| 239 |
+
If the systems are truly independent (different power sources, separate locations, distinct technologies):
|
| 240 |
|
| 241 |
+
$$P( ext{Life support fails}) = p_1p_2$$
|
| 242 |
|
| 243 |
+
For example:
|
| 244 |
|
| 245 |
+
- If $p_1 = 0.01$ and $p_2 = 0.02$ (99% and 98% reliable)
|
| 246 |
+
- Then $P( ext{Total failure}) = 0.0002$ (99.98% reliable)
|
| 247 |
|
| 248 |
+
However, if both systems share vulnerabilities (same radiation exposure, temperature extremes):
|
| 249 |
|
| 250 |
+
$$P( ext{Life support fails}) > p_1p_2$$
|
| 251 |
|
| 252 |
+
This example shows why space agencies invest heavily in ensuring true independence of backup systems.
|
| 253 |
+
""")
|
|
|
|
| 254 |
return
|
| 255 |
|
| 256 |
|
| 257 |
@app.cell(hide_code=True)
|
| 258 |
def _(mo):
|
| 259 |
+
mo.md(r"""
|
| 260 |
+
## Interactive Example
|
| 261 |
+
""")
|
| 262 |
return
|
| 263 |
|
| 264 |
|
|
|
|
| 273 |
flip_button = mo.ui.run_button(label="Flip Coins!", kind="info")
|
| 274 |
reset_button = mo.ui.run_button(label="Reset", kind="danger")
|
| 275 |
stats_display = mo.md("*Click 'Flip Coins!' to start simulation*")
|
| 276 |
+
return flip_button, reset_button
|
| 277 |
|
| 278 |
|
| 279 |
@app.cell(hide_code=True)
|
|
|
|
| 318 |
|
| 319 |
new_stats_display = mo.md(stats)
|
| 320 |
new_stats_display
|
| 321 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 322 |
|
| 323 |
|
| 324 |
@app.cell(hide_code=True)
|
| 325 |
def _(mo):
|
| 326 |
+
mo.md("""
|
| 327 |
+
## Understanding the Simulation
|
|
|
|
| 328 |
|
| 329 |
+
This simulation demonstrates independence using coin flips, where each coin's outcome is unaffected by the other.
|
| 330 |
|
| 331 |
+
### Reading the Results
|
| 332 |
|
| 333 |
+
1. **Individual Probabilities:**
|
| 334 |
|
| 335 |
+
- P(H₁): 1 if heads, 0 if tails on first coin
|
| 336 |
+
- P(H₂): 1 if heads, 0 if tails on second coin
|
| 337 |
|
| 338 |
+
2. **Testing Independence:**
|
| 339 |
|
| 340 |
+
- P(Both Heads): 1 if both show heads, 0 otherwise
|
| 341 |
+
- P(H₁)P(H₂): Product of individual results
|
| 342 |
|
| 343 |
+
> **Note**: Each click performs a new independent trial. While a single flip shows binary outcomes (0 or 1),
|
| 344 |
+
the theoretical probability is 0.5 for each coin and 0.25 for both heads.
|
| 345 |
+
""")
|
|
|
|
| 346 |
return
|
| 347 |
|
| 348 |
|
| 349 |
@app.cell(hide_code=True)
|
| 350 |
def _(mo):
|
| 351 |
+
mo.md(r"""
|
| 352 |
+
## 🤔 Test Your Understanding
|
| 353 |
+
|
| 354 |
+
Which of these statements about independence are true?
|
| 355 |
+
|
| 356 |
+
<details>
|
| 357 |
+
<summary>If P(E|F) = P(E), then E and F are independent</summary>
|
| 358 |
+
✅ True! This is one definition of independence - knowing F occurred doesn't change the probability of E.
|
| 359 |
+
</details>
|
| 360 |
+
|
| 361 |
+
<details>
|
| 362 |
+
<summary>Independent events cannot occur simultaneously</summary>
|
| 363 |
+
❌ False! Independent events can and do occur together - their joint probability is just the product of their individual probabilities.
|
| 364 |
+
</details>
|
| 365 |
+
|
| 366 |
+
<details>
|
| 367 |
+
<summary>If P(E∩F) = P(E)P(F), then E and F are independent</summary>
|
| 368 |
+
✅ True! This is the multiplicative definition of independence.
|
| 369 |
+
</details>
|
| 370 |
+
|
| 371 |
+
<details>
|
| 372 |
+
<summary>Independence is symmetric: if E is independent of F, then F is independent of E</summary>
|
| 373 |
+
✅ True! The definition P(E∩F) = P(E)P(F) is symmetric in E and F.
|
| 374 |
+
</details>
|
| 375 |
+
|
| 376 |
+
<details>
|
| 377 |
+
<summary>Three events being pairwise independent means they are mutually independent</summary>
|
| 378 |
+
❌ False! Pairwise independence doesn't guarantee mutual independence - we need to check all combinations.
|
| 379 |
+
</details>
|
| 380 |
+
""")
|
|
|
|
|
|
|
| 381 |
return
|
| 382 |
|
| 383 |
|
| 384 |
@app.cell(hide_code=True)
|
| 385 |
def _(mo):
|
| 386 |
+
mo.md("""
|
| 387 |
+
## Summary
|
|
|
|
| 388 |
|
| 389 |
+
In this exploration of probability independence, we've discovered how to recognize when events truly don't influence each other. Through the lens of both mathematical definitions and interactive examples, we've seen how independence manifests in scenarios ranging from simple coin flips to critical system designs.
|
| 390 |
|
| 391 |
+
The power of independence lies in its simplicity: when events are independent, we can multiply their individual probabilities to understand their joint behavior. Yet, as our examples showed, true independence is often more nuanced than it first appears. What seems independent might harbor hidden dependencies, and what appears dependent might be independent under certain conditions.
|
| 392 |
|
| 393 |
+
_The art lies not just in calculating probabilities, but in developing the intuition to recognize independence in real-world scenarios—a skill essential for making informed decisions in uncertain situations._
|
| 394 |
+
""")
|
|
|
|
| 395 |
return
|
| 396 |
|
| 397 |
|
|
|
|
| 399 |
def _():
|
| 400 |
import numpy as np
|
| 401 |
import pandas as pd
|
| 402 |
+
return (np,)
|
| 403 |
|
| 404 |
|
| 405 |
if __name__ == "__main__":
|
probability/06_probability_of_and.py
CHANGED
|
@@ -9,7 +9,7 @@
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
-
__generated_with = "0.
|
| 13 |
app = marimo.App(width="medium")
|
| 14 |
|
| 15 |
|
|
@@ -28,38 +28,34 @@ def _():
|
|
| 28 |
|
| 29 |
@app.cell(hide_code=True)
|
| 30 |
def _(mo):
|
| 31 |
-
mo.md(
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
"""
|
| 39 |
-
)
|
| 40 |
return
|
| 41 |
|
| 42 |
|
| 43 |
@app.cell(hide_code=True)
|
| 44 |
def _(mo):
|
| 45 |
-
mo.md(
|
| 46 |
-
|
| 47 |
-
## And with Independent Events
|
| 48 |
|
| 49 |
-
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
| 53 |
|
| 54 |
-
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
)
|
| 63 |
return
|
| 64 |
|
| 65 |
|
|
@@ -73,7 +69,7 @@ def _():
|
|
| 73 |
p_heads = 1/2 # P(getting heads)
|
| 74 |
p_both = calc_independent_prob(p_six, p_heads)
|
| 75 |
print(f"Example 1: P(rolling 6 AND getting heads) = {p_six:.3f} × {p_heads:.3f} = {p_both:.3f}")
|
| 76 |
-
return calc_independent_prob,
|
| 77 |
|
| 78 |
|
| 79 |
@app.cell
|
|
@@ -83,30 +79,28 @@ def _(calc_independent_prob):
|
|
| 83 |
p_disk_fail = 0.03 # P(disk failure)
|
| 84 |
p_both_fail = calc_independent_prob(p_cpu_fail, p_disk_fail)
|
| 85 |
print(f"Example 2: P(both CPU and disk failing) = {p_cpu_fail:.3f} × {p_disk_fail:.3f} = {p_both_fail:.3f}")
|
| 86 |
-
return
|
| 87 |
|
| 88 |
|
| 89 |
@app.cell(hide_code=True)
|
| 90 |
def _(mo):
|
| 91 |
-
mo.md(
|
| 92 |
-
|
| 93 |
-
## And with Dependent Events
|
| 94 |
|
| 95 |
-
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
|
| 102 |
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
)
|
| 110 |
return
|
| 111 |
|
| 112 |
|
|
@@ -120,7 +114,7 @@ def _():
|
|
| 120 |
p_second_heart = 12/51 # P(second heart | first heart)
|
| 121 |
p_both_hearts = calc_dependent_prob(p_first_heart, p_second_heart)
|
| 122 |
print(f"Example 1: P(two hearts) = {p_first_heart:.3f} × {p_second_heart:.3f} = {p_both_hearts:.3f}")
|
| 123 |
-
return calc_dependent_prob,
|
| 124 |
|
| 125 |
|
| 126 |
@app.cell
|
|
@@ -130,32 +124,32 @@ def _(calc_dependent_prob):
|
|
| 130 |
p_second_ace = 3/51 # P(second ace | first ace)
|
| 131 |
p_both_aces = calc_dependent_prob(p_first_ace, p_second_ace)
|
| 132 |
print(f"Example 2: P(two aces) = {p_first_ace:.3f} × {p_second_ace:.3f} = {p_both_aces:.3f}")
|
| 133 |
-
return
|
| 134 |
|
| 135 |
|
| 136 |
@app.cell(hide_code=True)
|
| 137 |
def _(mo):
|
| 138 |
-
mo.md(
|
| 139 |
-
|
| 140 |
-
## Multiple Events
|
| 141 |
|
| 142 |
-
|
| 143 |
|
| 144 |
-
|
| 145 |
|
| 146 |
-
|
| 147 |
|
| 148 |
-
|
| 149 |
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
)
|
| 153 |
return
|
| 154 |
|
| 155 |
|
| 156 |
@app.cell(hide_code=True)
|
| 157 |
def _(mo):
|
| 158 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 159 |
return
|
| 160 |
|
| 161 |
|
|
@@ -250,65 +244,61 @@ def _(event_type, mo, plt, venn2):
|
|
| 250 |
|
| 251 |
# Display explanation alongside visualization
|
| 252 |
mo.hstack([plt.gcf(), mo.md(data["explanation"])])
|
| 253 |
-
return
|
| 254 |
|
| 255 |
|
| 256 |
@app.cell(hide_code=True)
|
| 257 |
def _(mo):
|
| 258 |
-
mo.md(
|
| 259 |
-
|
| 260 |
-
## 🤔 Test Your Understanding
|
| 261 |
|
| 262 |
-
|
| 263 |
|
| 264 |
-
|
| 265 |
-
|
| 266 |
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
|
| 271 |
-
|
| 272 |
-
|
| 273 |
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
|
| 278 |
-
|
| 279 |
-
|
| 280 |
|
| 281 |
-
|
| 282 |
-
|
| 283 |
-
|
| 284 |
|
| 285 |
-
|
| 286 |
-
|
| 287 |
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
)
|
| 293 |
return
|
| 294 |
|
| 295 |
|
| 296 |
@app.cell(hide_code=True)
|
| 297 |
def _(mo):
|
| 298 |
-
mo.md(
|
| 299 |
-
|
| 300 |
-
## Summary
|
| 301 |
|
| 302 |
-
|
| 303 |
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
|
| 308 |
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
)
|
| 312 |
return
|
| 313 |
|
| 314 |
|
|
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
+
__generated_with = "0.18.4"
|
| 13 |
app = marimo.App(width="medium")
|
| 14 |
|
| 15 |
|
|
|
|
| 28 |
|
| 29 |
@app.cell(hide_code=True)
|
| 30 |
def _(mo):
|
| 31 |
+
mo.md(r"""
|
| 32 |
+
# Probability of And
|
| 33 |
+
_This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/prob_and/), by Stanford professor Chris Piech._
|
| 34 |
+
|
| 35 |
+
When calculating the probability of both events occurring together, we need to consider whether the events are independent or dependent.
|
| 36 |
+
Let's explore how to calculate $P(E \cap F)$, i.e. $P(E \text{ and } F)$, in different scenarios.
|
| 37 |
+
""")
|
|
|
|
|
|
|
| 38 |
return
|
| 39 |
|
| 40 |
|
| 41 |
@app.cell(hide_code=True)
|
| 42 |
def _(mo):
|
| 43 |
+
mo.md(r"""
|
| 44 |
+
## And with Independent Events
|
|
|
|
| 45 |
|
| 46 |
+
Two events $E$ and $F$ are **independent** if knowing one event occurred doesn't affect the probability of the other.
|
| 47 |
+
For independent events:
|
| 48 |
|
| 49 |
+
$P(E \text{ and } F) = P(E) \cdot P(F)$
|
| 50 |
|
| 51 |
+
For example:
|
| 52 |
|
| 53 |
+
- Rolling a 6 on one die and getting heads on a coin flip
|
| 54 |
+
- Drawing a heart from a deck, replacing it, and drawing another heart
|
| 55 |
+
- Getting a computer error on Monday vs. Tuesday
|
| 56 |
|
| 57 |
+
Here's a Python function to calculate probability for independent events:
|
| 58 |
+
""")
|
|
|
|
| 59 |
return
|
| 60 |
|
| 61 |
|
|
|
|
| 69 |
p_heads = 1/2 # P(getting heads)
|
| 70 |
p_both = calc_independent_prob(p_six, p_heads)
|
| 71 |
print(f"Example 1: P(rolling 6 AND getting heads) = {p_six:.3f} × {p_heads:.3f} = {p_both:.3f}")
|
| 72 |
+
return (calc_independent_prob,)
|
| 73 |
|
| 74 |
|
| 75 |
@app.cell
|
|
|
|
| 79 |
p_disk_fail = 0.03 # P(disk failure)
|
| 80 |
p_both_fail = calc_independent_prob(p_cpu_fail, p_disk_fail)
|
| 81 |
print(f"Example 2: P(both CPU and disk failing) = {p_cpu_fail:.3f} × {p_disk_fail:.3f} = {p_both_fail:.3f}")
|
| 82 |
+
return
|
| 83 |
|
| 84 |
|
| 85 |
@app.cell(hide_code=True)
|
| 86 |
def _(mo):
|
| 87 |
+
mo.md(r"""
|
| 88 |
+
## And with Dependent Events
|
|
|
|
| 89 |
|
| 90 |
+
For dependent events, we use the **chain rule**:
|
| 91 |
|
| 92 |
+
$P(E \text{ and } F) = P(E) \cdot P(F|E)$
|
| 93 |
|
| 94 |
+
where $P(F|E)$ is the probability of $F$ occurring given that $E$ has occurred.
|
| 95 |
|
| 96 |
+
For example:
|
| 97 |
|
| 98 |
+
- Drawing two hearts without replacement
|
| 99 |
+
- Getting two consecutive heads in poker
|
| 100 |
+
- System failures in connected components
|
| 101 |
|
| 102 |
+
Let's implement this calculation:
|
| 103 |
+
""")
|
|
|
|
| 104 |
return
|
| 105 |
|
| 106 |
|
|
|
|
| 114 |
p_second_heart = 12/51 # P(second heart | first heart)
|
| 115 |
p_both_hearts = calc_dependent_prob(p_first_heart, p_second_heart)
|
| 116 |
print(f"Example 1: P(two hearts) = {p_first_heart:.3f} × {p_second_heart:.3f} = {p_both_hearts:.3f}")
|
| 117 |
+
return (calc_dependent_prob,)
|
| 118 |
|
| 119 |
|
| 120 |
@app.cell
|
|
|
|
| 124 |
p_second_ace = 3/51 # P(second ace | first ace)
|
| 125 |
p_both_aces = calc_dependent_prob(p_first_ace, p_second_ace)
|
| 126 |
print(f"Example 2: P(two aces) = {p_first_ace:.3f} × {p_second_ace:.3f} = {p_both_aces:.3f}")
|
| 127 |
+
return
|
| 128 |
|
| 129 |
|
| 130 |
@app.cell(hide_code=True)
|
| 131 |
def _(mo):
|
| 132 |
+
mo.md(r"""
|
| 133 |
+
## Multiple Events
|
|
|
|
| 134 |
|
| 135 |
+
For multiple independent events:
|
| 136 |
|
| 137 |
+
$P(E_1 \text{ and } E_2 \text{ and } \cdots \text{ and } E_n) = \prod_{i=1}^n P(E_i)$
|
| 138 |
|
| 139 |
+
For dependent events:
|
| 140 |
|
| 141 |
+
$P(E_1 \text{ and } E_2 \text{ and } \cdots \text{ and } E_n) = P(E_1) \cdot P(E_2|E_1) \cdot P(E_3|E_1,E_2) \cdots P(E_n|E_1,\ldots,E_{n-1})$
|
| 142 |
|
| 143 |
+
Let's visualize these probabilities:
|
| 144 |
+
""")
|
|
|
|
| 145 |
return
|
| 146 |
|
| 147 |
|
| 148 |
@app.cell(hide_code=True)
|
| 149 |
def _(mo):
|
| 150 |
+
mo.md(r"""
|
| 151 |
+
### Interactive example
|
| 152 |
+
""")
|
| 153 |
return
|
| 154 |
|
| 155 |
|
|
|
|
| 244 |
|
| 245 |
# Display explanation alongside visualization
|
| 246 |
mo.hstack([plt.gcf(), mo.md(data["explanation"])])
|
| 247 |
+
return
|
| 248 |
|
| 249 |
|
| 250 |
@app.cell(hide_code=True)
|
| 251 |
def _(mo):
|
| 252 |
+
mo.md(r"""
|
| 253 |
+
## 🤔 Test Your Understanding
|
|
|
|
| 254 |
|
| 255 |
+
Which of these statements about AND probability are true?
|
| 256 |
|
| 257 |
+
<details>
|
| 258 |
+
<summary>1. The probability of getting two sixes in a row with a fair die is 1/36</summary>
|
| 259 |
|
| 260 |
+
✅ True! Since die rolls are independent events:
|
| 261 |
+
P(two sixes) = P(first six) × P(second six) = 1/6 × 1/6 = 1/36
|
| 262 |
+
</details>
|
| 263 |
|
| 264 |
+
<details>
|
| 265 |
+
<summary>2. When drawing cards without replacement, P(two kings) = 4/52 × 4/52</summary>
|
| 266 |
|
| 267 |
+
❌ False! This is a dependent event. The correct calculation is:
|
| 268 |
+
P(two kings) = P(first king) × P(second king | first king) = 4/52 × 3/51
|
| 269 |
+
</details>
|
| 270 |
|
| 271 |
+
<details>
|
| 272 |
+
<summary>3. If P(A) = 0.3 and P(B) = 0.4, then P(A and B) must be 0.12</summary>
|
| 273 |
|
| 274 |
+
❌ False! P(A and B) = 0.12 only if A and B are independent events.
|
| 275 |
+
If they're dependent, we need P(B|A) to calculate P(A and B).
|
| 276 |
+
</details>
|
| 277 |
|
| 278 |
+
<details>
|
| 279 |
+
<summary>4. The probability of rolling a six AND getting tails is (1/6 × 1/2)</summary>
|
| 280 |
|
| 281 |
+
✅ True! These are independent events, so we multiply their individual probabilities:
|
| 282 |
+
P(six and tails) = P(six) × P(tails) = 1/6 × 1/2 = 1/12
|
| 283 |
+
</details>
|
| 284 |
+
""")
|
|
|
|
| 285 |
return
|
| 286 |
|
| 287 |
|
| 288 |
@app.cell(hide_code=True)
|
| 289 |
def _(mo):
|
| 290 |
+
mo.md("""
|
| 291 |
+
## Summary
|
|
|
|
| 292 |
|
| 293 |
+
You've learned:
|
| 294 |
|
| 295 |
+
- How to identify independent vs dependent events
|
| 296 |
+
- The multiplication rule for independent events
|
| 297 |
+
- The chain rule for dependent events
|
| 298 |
+
- How to extend these concepts to multiple events
|
| 299 |
|
| 300 |
+
In the next lesson, we'll explore **law of total probability** in more detail, building on our understanding of various topics.
|
| 301 |
+
""")
|
|
|
|
| 302 |
return
|
| 303 |
|
| 304 |
|
probability/07_law_of_total_probability.py
CHANGED
|
@@ -9,7 +9,7 @@
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
-
__generated_with = "0.
|
| 13 |
app = marimo.App(width="medium")
|
| 14 |
|
| 15 |
|
|
@@ -24,56 +24,52 @@ def _():
|
|
| 24 |
import matplotlib.pyplot as plt
|
| 25 |
from matplotlib_venn import venn2
|
| 26 |
import numpy as np
|
| 27 |
-
return
|
| 28 |
|
| 29 |
|
| 30 |
@app.cell(hide_code=True)
|
| 31 |
def _(mo):
|
| 32 |
-
mo.md(
|
| 33 |
-
|
| 34 |
-
# Law of Total Probability
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
)
|
| 41 |
return
|
| 42 |
|
| 43 |
|
| 44 |
@app.cell(hide_code=True)
|
| 45 |
def _(mo):
|
| 46 |
-
mo.md(
|
| 47 |
-
|
| 48 |
-
## The Core Concept
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
| 53 |
|
| 54 |
-
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
)
|
| 77 |
return
|
| 78 |
|
| 79 |
|
|
@@ -98,7 +94,7 @@ def _():
|
|
| 98 |
|
| 99 |
print("Odd/Even partition:", is_valid_partition(partition1, sample_space))
|
| 100 |
print("Number pairs partition:", is_valid_partition(partition2, sample_space))
|
| 101 |
-
return is_valid_partition,
|
| 102 |
|
| 103 |
|
| 104 |
@app.cell
|
|
@@ -111,7 +107,7 @@ def _(is_valid_partition):
|
|
| 111 |
print("Student Grades Examples:")
|
| 112 |
print("Pass/Fail partition:", is_valid_partition(passing_partition, grade_space))
|
| 113 |
print("Individual grades partition:", is_valid_partition(letter_groups, grade_space))
|
| 114 |
-
return
|
| 115 |
|
| 116 |
|
| 117 |
@app.cell
|
|
@@ -124,7 +120,7 @@ def _(is_valid_partition):
|
|
| 124 |
print("\nPlaying Cards Examples:")
|
| 125 |
print("Color-based partition:", is_valid_partition(color_partition, card_space)) # True
|
| 126 |
print("Invalid partition:", is_valid_partition(invalid_partition, card_space)) # False
|
| 127 |
-
return
|
| 128 |
|
| 129 |
|
| 130 |
@app.cell(hide_code=True)
|
|
@@ -151,75 +147,71 @@ def _(mo, plt, venn2):
|
|
| 151 |
""")
|
| 152 |
|
| 153 |
mo.hstack([plt.gca(), viz_explanation])
|
| 154 |
-
return
|
| 155 |
|
| 156 |
|
| 157 |
@app.cell(hide_code=True)
|
| 158 |
def _(mo):
|
| 159 |
-
mo.md(
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
"""
|
| 170 |
-
)
|
| 171 |
return
|
| 172 |
|
| 173 |
|
| 174 |
@app.cell(hide_code=True)
|
| 175 |
def _(mo):
|
| 176 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 177 |
return
|
| 178 |
|
| 179 |
|
| 180 |
-
@app.
|
| 181 |
-
def
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
raise ValueError("Must have same number of conditional and partition probabilities")
|
| 189 |
|
| 190 |
-
|
| 191 |
-
|
| 192 |
|
| 193 |
-
|
| 194 |
-
return (total_probability,)
|
| 195 |
|
| 196 |
|
| 197 |
@app.cell(hide_code=True)
|
| 198 |
def _(mo):
|
| 199 |
-
mo.md(
|
| 200 |
-
|
| 201 |
-
## Example: System Reliability
|
| 202 |
|
| 203 |
-
|
| 204 |
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
|
| 209 |
-
|
| 210 |
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
)
|
| 218 |
return
|
| 219 |
|
| 220 |
|
| 221 |
@app.cell
|
| 222 |
-
def _(mo
|
| 223 |
# System states and probabilities
|
| 224 |
states = ["Normal", "Degraded", "Critical"]
|
| 225 |
state_probs = [0.7, 0.2, 0.1] # System spends 70%, 20%, 10% of time in each state
|
|
@@ -252,12 +244,14 @@ def _(mo, total_probability):
|
|
| 252 |
Total: {total_error:.3f} or {total_error:.1%} chance of error
|
| 253 |
""")
|
| 254 |
explanation
|
| 255 |
-
return
|
| 256 |
|
| 257 |
|
| 258 |
@app.cell(hide_code=True)
|
| 259 |
def _(mo):
|
| 260 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 261 |
return
|
| 262 |
|
| 263 |
|
|
@@ -311,24 +305,22 @@ def _(late_given_dry, late_given_rain, mo, plt, venn2, weather_prob):
|
|
| 311 |
plt.title("Weather and Traffic Probability")
|
| 312 |
|
| 313 |
mo.hstack([plt.gca(), explanation_example])
|
| 314 |
-
return
|
| 315 |
|
| 316 |
|
| 317 |
@app.cell(hide_code=True)
|
| 318 |
def _(mo):
|
| 319 |
-
mo.md(
|
| 320 |
-
|
| 321 |
-
## Visual Intuition
|
| 322 |
|
| 323 |
-
|
| 324 |
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
| 328 |
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
)
|
| 332 |
return
|
| 333 |
|
| 334 |
|
|
@@ -371,49 +363,45 @@ def _(plt):
|
|
| 371 |
|
| 372 |
@app.cell(hide_code=True)
|
| 373 |
def _(mo):
|
| 374 |
-
mo.md(
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
|
| 378 |
-
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
|
| 382 |
-
|
| 383 |
-
|
| 384 |
-
|
| 385 |
-
|
| 386 |
-
|
| 387 |
-
|
| 388 |
-
|
| 389 |
-
|
| 390 |
-
|
| 391 |
-
|
| 392 |
-
|
| 393 |
-
|
| 394 |
-
|
| 395 |
-
|
| 396 |
-
"""
|
| 397 |
-
)
|
| 398 |
return
|
| 399 |
|
| 400 |
|
| 401 |
@app.cell(hide_code=True)
|
| 402 |
def _(mo):
|
| 403 |
-
mo.md(
|
| 404 |
-
|
| 405 |
-
## Summary
|
| 406 |
|
| 407 |
-
|
| 408 |
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
|
| 412 |
-
|
| 413 |
|
| 414 |
-
|
| 415 |
-
|
| 416 |
-
)
|
| 417 |
return
|
| 418 |
|
| 419 |
|
|
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
+
__generated_with = "0.18.4"
|
| 13 |
app = marimo.App(width="medium")
|
| 14 |
|
| 15 |
|
|
|
|
| 24 |
import matplotlib.pyplot as plt
|
| 25 |
from matplotlib_venn import venn2
|
| 26 |
import numpy as np
|
| 27 |
+
return plt, venn2
|
| 28 |
|
| 29 |
|
| 30 |
@app.cell(hide_code=True)
|
| 31 |
def _(mo):
|
| 32 |
+
mo.md(r"""
|
| 33 |
+
# Law of Total Probability
|
|
|
|
| 34 |
|
| 35 |
+
_This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/law_total/), by Stanford professor Chris Piech._
|
| 36 |
|
| 37 |
+
The Law of Total Probability is a fundamental rule that helps us calculate probabilities by breaking down complex events into simpler parts. It's particularly useful when we want to compute the probability of an event that can occur through multiple distinct scenarios.
|
| 38 |
+
""")
|
|
|
|
| 39 |
return
|
| 40 |
|
| 41 |
|
| 42 |
@app.cell(hide_code=True)
|
| 43 |
def _(mo):
|
| 44 |
+
mo.md(r"""
|
| 45 |
+
## The Core Concept
|
|
|
|
| 46 |
|
| 47 |
+
The Law of Total Probability emerged from a simple but powerful observation: any event E can be broken down into parts based on another event F and its complement Fᶜ.
|
| 48 |
|
| 49 |
+
### From Simple Observation to Powerful Law
|
| 50 |
|
| 51 |
+
Consider an event E that can occur in two ways:
|
| 52 |
|
| 53 |
+
1. When F occurs (E ∩ F)
|
| 54 |
+
2. When F doesn't occur (E ∩ Fᶜ)
|
| 55 |
|
| 56 |
+
This leads to our first insight:
|
| 57 |
|
| 58 |
+
$P(E) = P(E \cap F) + P(E \cap F^c)$
|
| 59 |
|
| 60 |
+
Applying the chain rule to each term:
|
| 61 |
|
| 62 |
+
\begin{align}
|
| 63 |
+
P(E) &= P(E \cap F) + P(E \cap F^c) \\
|
| 64 |
+
&= P(E|F)P(F) + P(E|F^c)P(F^c)
|
| 65 |
+
\end{align}
|
| 66 |
|
| 67 |
+
This two-part version generalizes to any number of [mutually exclusive](marimo.app/https://github.com/marimo-team/learn/blob/main/probability/03_probability_of_or.py) events that cover the sample space:
|
| 68 |
|
| 69 |
+
$P(A) = \sum_{i=1}^n P(A|B_i)P(B_i)$
|
| 70 |
|
| 71 |
+
where {B₁, B₂, ..., Bₙ} forms a partition of the sample space.
|
| 72 |
+
""")
|
|
|
|
| 73 |
return
|
| 74 |
|
| 75 |
|
|
|
|
| 94 |
|
| 95 |
print("Odd/Even partition:", is_valid_partition(partition1, sample_space))
|
| 96 |
print("Number pairs partition:", is_valid_partition(partition2, sample_space))
|
| 97 |
+
return (is_valid_partition,)
|
| 98 |
|
| 99 |
|
| 100 |
@app.cell
|
|
|
|
| 107 |
print("Student Grades Examples:")
|
| 108 |
print("Pass/Fail partition:", is_valid_partition(passing_partition, grade_space))
|
| 109 |
print("Individual grades partition:", is_valid_partition(letter_groups, grade_space))
|
| 110 |
+
return
|
| 111 |
|
| 112 |
|
| 113 |
@app.cell
|
|
|
|
| 120 |
print("\nPlaying Cards Examples:")
|
| 121 |
print("Color-based partition:", is_valid_partition(color_partition, card_space)) # True
|
| 122 |
print("Invalid partition:", is_valid_partition(invalid_partition, card_space)) # False
|
| 123 |
+
return
|
| 124 |
|
| 125 |
|
| 126 |
@app.cell(hide_code=True)
|
|
|
|
| 147 |
""")
|
| 148 |
|
| 149 |
mo.hstack([plt.gca(), viz_explanation])
|
| 150 |
+
return
|
| 151 |
|
| 152 |
|
| 153 |
@app.cell(hide_code=True)
|
| 154 |
def _(mo):
|
| 155 |
+
mo.md(r"""
|
| 156 |
+
## Computing Total Probability
|
| 157 |
+
|
| 158 |
+
To use the Law of Total Probability:
|
| 159 |
+
|
| 160 |
+
1. Identify a partition of the sample space
|
| 161 |
+
2. Calculate $P(B_i)$ for each part
|
| 162 |
+
3. Calculate $P(A|B_i)$ for each part
|
| 163 |
+
4. Sum the products $P(A|B_i)P(B_i)$
|
| 164 |
+
""")
|
|
|
|
|
|
|
| 165 |
return
|
| 166 |
|
| 167 |
|
| 168 |
@app.cell(hide_code=True)
|
| 169 |
def _(mo):
|
| 170 |
+
mo.md(r"""
|
| 171 |
+
Let's implement this calculation:
|
| 172 |
+
""")
|
| 173 |
return
|
| 174 |
|
| 175 |
|
| 176 |
+
@app.function
|
| 177 |
+
def total_probability(conditional_probs, partition_probs):
|
| 178 |
+
"""Calculate total probability using Law of Total Probability
|
| 179 |
+
conditional_probs: List of P(A|Bi)
|
| 180 |
+
partition_probs: List of P(Bi)
|
| 181 |
+
"""
|
| 182 |
+
if len(conditional_probs) != len(partition_probs):
|
| 183 |
+
raise ValueError("Must have same number of conditional and partition probabilities")
|
|
|
|
| 184 |
|
| 185 |
+
if abs(sum(partition_probs) - 1) > 1e-10:
|
| 186 |
+
raise ValueError("Partition probabilities must sum to 1")
|
| 187 |
|
| 188 |
+
return sum(c * p for c, p in zip(conditional_probs, partition_probs))
|
|
|
|
| 189 |
|
| 190 |
|
| 191 |
@app.cell(hide_code=True)
|
| 192 |
def _(mo):
|
| 193 |
+
mo.md(r"""
|
| 194 |
+
## Example: System Reliability
|
|
|
|
| 195 |
|
| 196 |
+
Consider a computer system that can be in three states:
|
| 197 |
|
| 198 |
+
- Normal (70% of time)
|
| 199 |
+
- Degraded (20% of time)
|
| 200 |
+
- Critical (10% of time)
|
| 201 |
|
| 202 |
+
The probability of errors in each state:
|
| 203 |
|
| 204 |
+
- P(Error | Normal) = 0.01 (1%)
|
| 205 |
+
- P(Error | Degraded) = 0.15 (15%)
|
| 206 |
+
- P(Error | Critical) = 0.45 (45%)
|
| 207 |
|
| 208 |
+
Let's calculate the overall probability of encountering an error:
|
| 209 |
+
""")
|
|
|
|
| 210 |
return
|
| 211 |
|
| 212 |
|
| 213 |
@app.cell
|
| 214 |
+
def _(mo):
|
| 215 |
# System states and probabilities
|
| 216 |
states = ["Normal", "Degraded", "Critical"]
|
| 217 |
state_probs = [0.7, 0.2, 0.1] # System spends 70%, 20%, 10% of time in each state
|
|
|
|
| 244 |
Total: {total_error:.3f} or {total_error:.1%} chance of error
|
| 245 |
""")
|
| 246 |
explanation
|
| 247 |
+
return
|
| 248 |
|
| 249 |
|
| 250 |
@app.cell(hide_code=True)
|
| 251 |
def _(mo):
|
| 252 |
+
mo.md(r"""
|
| 253 |
+
## Interactive Example:
|
| 254 |
+
""")
|
| 255 |
return
|
| 256 |
|
| 257 |
|
|
|
|
| 305 |
plt.title("Weather and Traffic Probability")
|
| 306 |
|
| 307 |
mo.hstack([plt.gca(), explanation_example])
|
| 308 |
+
return
|
| 309 |
|
| 310 |
|
| 311 |
@app.cell(hide_code=True)
|
| 312 |
def _(mo):
|
| 313 |
+
mo.md(r"""
|
| 314 |
+
## Visual Intuition
|
|
|
|
| 315 |
|
| 316 |
+
The Law of Total Probability works because:
|
| 317 |
|
| 318 |
+
1. The partition divides the sample space into non-overlapping regions
|
| 319 |
+
2. Every outcome belongs to exactly one region
|
| 320 |
+
3. We account for all possible ways an event can occur
|
| 321 |
|
| 322 |
+
Let's visualize this with a tree diagram:
|
| 323 |
+
""")
|
|
|
|
| 324 |
return
|
| 325 |
|
| 326 |
|
|
|
|
| 363 |
|
| 364 |
@app.cell(hide_code=True)
|
| 365 |
def _(mo):
|
| 366 |
+
mo.md(r"""
|
| 367 |
+
## 🤔 Test Your Understanding
|
| 368 |
+
|
| 369 |
+
For a fair six-sided die with partitions:
|
| 370 |
+
- B₁: Numbers less than 3 {1,2}
|
| 371 |
+
- B₂: Numbers from 3 to 4 {3,4}
|
| 372 |
+
- B₃: Numbers greater than 4 {5,6}
|
| 373 |
+
|
| 374 |
+
**Question 1**: Which of these statements correctly describes the partition?
|
| 375 |
+
<details>
|
| 376 |
+
<summary>The sets overlap at number 3</summary>
|
| 377 |
+
❌ Incorrect! The sets are clearly separated with no overlapping numbers.
|
| 378 |
+
</details>
|
| 379 |
+
<details>
|
| 380 |
+
<summary>Some numbers are missing from the partition</summary>
|
| 381 |
+
❌ Incorrect! All numbers from 1 to 6 are included exactly once.
|
| 382 |
+
</details>
|
| 383 |
+
<details>
|
| 384 |
+
<summary>The sets form a valid partition of {1,2,3,4,5,6}</summary>
|
| 385 |
+
✅ Correct! The sets are mutually exclusive and their union covers all outcomes.
|
| 386 |
+
</details>
|
| 387 |
+
""")
|
|
|
|
|
|
|
| 388 |
return
|
| 389 |
|
| 390 |
|
| 391 |
@app.cell(hide_code=True)
|
| 392 |
def _(mo):
|
| 393 |
+
mo.md("""
|
| 394 |
+
## Summary
|
|
|
|
| 395 |
|
| 396 |
+
You've learned:
|
| 397 |
|
| 398 |
+
- How to identify valid partitions of a sample space
|
| 399 |
+
- The Law of Total Probability formula and its components
|
| 400 |
+
- How to break down complex probability calculations
|
| 401 |
+
- Applications to real-world scenarios
|
| 402 |
|
| 403 |
+
In the next lesson, we'll explore **Bayes' Theorem**, which builds on these concepts to solve even more sophisticated probability problems.
|
| 404 |
+
""")
|
|
|
|
| 405 |
return
|
| 406 |
|
| 407 |
|
probability/08_bayes_theorem.py
CHANGED
|
@@ -9,7 +9,7 @@
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
-
__generated_with = "0.
|
| 13 |
app = marimo.App(width="medium", app_title="Bayes Theorem")
|
| 14 |
|
| 15 |
|
|
@@ -23,140 +23,128 @@ def _():
|
|
| 23 |
def _():
|
| 24 |
import matplotlib.pyplot as plt
|
| 25 |
import numpy as np
|
| 26 |
-
return
|
| 27 |
|
| 28 |
|
| 29 |
@app.cell(hide_code=True)
|
| 30 |
def _(mo):
|
| 31 |
-
mo.md(
|
| 32 |
-
|
| 33 |
-
# Bayes' Theorem
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
)
|
| 42 |
return
|
| 43 |
|
| 44 |
|
| 45 |
@app.cell(hide_code=True)
|
| 46 |
def _(mo):
|
| 47 |
-
mo.md(
|
| 48 |
-
|
| 49 |
-
## The Heart of Bayesian Reasoning
|
| 50 |
|
| 51 |
-
|
| 52 |
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
)
|
| 61 |
return
|
| 62 |
|
| 63 |
|
| 64 |
@app.cell(hide_code=True)
|
| 65 |
def _(mo):
|
| 66 |
-
mo.md(
|
| 67 |
-
|
| 68 |
-
## The Formula
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
)
|
| 86 |
return
|
| 87 |
|
| 88 |
|
| 89 |
@app.cell(hide_code=True)
|
| 90 |
def _(mo):
|
| 91 |
-
mo.md(
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
"""
|
| 115 |
-
)
|
| 116 |
return
|
| 117 |
|
| 118 |
|
| 119 |
@app.cell(hide_code=True)
|
| 120 |
def _(mo):
|
| 121 |
-
mo.md(
|
| 122 |
-
|
| 123 |
-
## Real-World Examples
|
| 124 |
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
)
|
| 143 |
return
|
| 144 |
|
| 145 |
|
| 146 |
-
@app.
|
| 147 |
-
def
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
p_e = likelihood * prior + false_positive_rate * (1 - prior)
|
| 151 |
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
return (calculate_posterior,)
|
| 156 |
|
| 157 |
|
| 158 |
@app.cell
|
| 159 |
-
def _(
|
| 160 |
# Medical test example
|
| 161 |
p_disease = 0.01 # Prior: 1% have the disease
|
| 162 |
p_positive_given_disease = 0.95 # Likelihood: 95% test accuracy
|
|
@@ -167,13 +155,7 @@ def _(calculate_posterior):
|
|
| 167 |
p_positive_given_disease,
|
| 168 |
p_positive_given_healthy
|
| 169 |
)
|
| 170 |
-
return (
|
| 171 |
-
medical_evidence,
|
| 172 |
-
medical_posterior,
|
| 173 |
-
p_disease,
|
| 174 |
-
p_positive_given_disease,
|
| 175 |
-
p_positive_given_healthy,
|
| 176 |
-
)
|
| 177 |
|
| 178 |
|
| 179 |
@app.cell
|
|
@@ -203,7 +185,7 @@ def _(medical_posterior, mo):
|
|
| 203 |
|
| 204 |
|
| 205 |
@app.cell
|
| 206 |
-
def _(
|
| 207 |
# Student ability example
|
| 208 |
p_high_ability = 0.30 # Prior: 30% of students have high ability
|
| 209 |
p_good_grade_given_high = 0.90 # Likelihood: 90% of high ability students get good grades
|
|
@@ -214,13 +196,7 @@ def _(calculate_posterior):
|
|
| 214 |
p_good_grade_given_high,
|
| 215 |
p_good_grade_given_low
|
| 216 |
)
|
| 217 |
-
return (
|
| 218 |
-
p_good_grade_given_high,
|
| 219 |
-
p_good_grade_given_low,
|
| 220 |
-
p_high_ability,
|
| 221 |
-
student_evidence,
|
| 222 |
-
student_posterior,
|
| 223 |
-
)
|
| 224 |
|
| 225 |
|
| 226 |
@app.cell
|
|
@@ -250,7 +226,7 @@ def _(mo, student_posterior):
|
|
| 250 |
|
| 251 |
|
| 252 |
@app.cell
|
| 253 |
-
def _(
|
| 254 |
# Cell phone location example
|
| 255 |
p_location_a = 0.25 # Prior probability of being in location A
|
| 256 |
p_strong_signal_at_a = 0.85 # Likelihood of strong signal at A
|
|
@@ -261,13 +237,7 @@ def _(calculate_posterior):
|
|
| 261 |
p_strong_signal_at_a,
|
| 262 |
p_strong_signal_elsewhere
|
| 263 |
)
|
| 264 |
-
return (
|
| 265 |
-
location_evidence,
|
| 266 |
-
location_posterior,
|
| 267 |
-
p_location_a,
|
| 268 |
-
p_strong_signal_at_a,
|
| 269 |
-
p_strong_signal_elsewhere,
|
| 270 |
-
)
|
| 271 |
|
| 272 |
|
| 273 |
@app.cell
|
|
@@ -298,7 +268,9 @@ def _(location_posterior, mo):
|
|
| 298 |
|
| 299 |
@app.cell(hide_code=True)
|
| 300 |
def _(mo):
|
| 301 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 302 |
return
|
| 303 |
|
| 304 |
|
|
@@ -394,87 +366,79 @@ def _(
|
|
| 394 |
|
| 395 |
@app.cell(hide_code=True)
|
| 396 |
def _(mo):
|
| 397 |
-
mo.md(
|
| 398 |
-
|
| 399 |
-
## Applications in Computer Science
|
| 400 |
|
| 401 |
-
|
| 402 |
|
| 403 |
-
|
| 404 |
|
| 405 |
-
|
| 406 |
-
|
| 407 |
|
| 408 |
-
|
| 409 |
|
| 410 |
-
|
| 411 |
-
|
| 412 |
-
|
| 413 |
|
| 414 |
-
|
| 415 |
|
| 416 |
-
|
| 417 |
-
|
| 418 |
-
|
| 419 |
-
|
| 420 |
-
)
|
| 421 |
return
|
| 422 |
|
| 423 |
|
| 424 |
@app.cell(hide_code=True)
|
| 425 |
def _(mo):
|
| 426 |
-
mo.md(
|
| 427 |
-
|
| 428 |
-
## 🤔 Test Your Understanding
|
| 429 |
|
| 430 |
-
|
| 431 |
|
| 432 |
-
|
| 433 |
-
|
| 434 |
-
|
| 435 |
-
|
| 436 |
|
| 437 |
-
|
| 438 |
-
|
| 439 |
-
|
| 440 |
-
|
| 441 |
|
| 442 |
-
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
)
|
| 448 |
return
|
| 449 |
|
| 450 |
|
| 451 |
@app.cell(hide_code=True)
|
| 452 |
def _(mo):
|
| 453 |
-
mo.md(
|
| 454 |
-
|
| 455 |
-
## Summary
|
| 456 |
|
| 457 |
-
|
| 458 |
|
| 459 |
-
|
| 460 |
-
|
| 461 |
-
|
| 462 |
-
|
| 463 |
|
| 464 |
-
|
| 465 |
-
|
| 466 |
-
)
|
| 467 |
return
|
| 468 |
|
| 469 |
|
| 470 |
@app.cell(hide_code=True)
|
| 471 |
def _(mo):
|
| 472 |
-
mo.md(
|
| 473 |
-
|
| 474 |
-
|
| 475 |
-
|
| 476 |
-
"""
|
| 477 |
-
)
|
| 478 |
return
|
| 479 |
|
| 480 |
|
|
|
|
| 9 |
|
| 10 |
import marimo
|
| 11 |
|
| 12 |
+
__generated_with = "0.18.4"
|
| 13 |
app = marimo.App(width="medium", app_title="Bayes Theorem")
|
| 14 |
|
| 15 |
|
|
|
|
| 23 |
def _():
|
| 24 |
import matplotlib.pyplot as plt
|
| 25 |
import numpy as np
|
| 26 |
+
return
|
| 27 |
|
| 28 |
|
| 29 |
@app.cell(hide_code=True)
|
| 30 |
def _(mo):
|
| 31 |
+
mo.md(r"""
|
| 32 |
+
# Bayes' Theorem
|
|
|
|
| 33 |
|
| 34 |
+
_This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/bayes_theorem/), by Stanford professor Chris Piech._
|
| 35 |
|
| 36 |
+
In the 1740s, an English minister named Thomas Bayes discovered a profound mathematical relationship that would revolutionize how we reason about uncertainty. His theorem provides an elegant framework for calculating the probability of a hypothesis being true given observed evidence.
|
| 37 |
|
| 38 |
+
At its core, Bayes' Theorem connects two different types of probabilities: the probability of a hypothesis given evidence $P(H|E)$, and its reverse - the probability of evidence given a hypothesis $P(E|H)$. This relationship is particularly powerful because it allows us to compute difficult probabilities using ones that are easier to measure.
|
| 39 |
+
""")
|
|
|
|
| 40 |
return
|
| 41 |
|
| 42 |
|
| 43 |
@app.cell(hide_code=True)
|
| 44 |
def _(mo):
|
| 45 |
+
mo.md(r"""
|
| 46 |
+
## The Heart of Bayesian Reasoning
|
|
|
|
| 47 |
|
| 48 |
+
The fundamental insight of Bayes' Theorem lies in its ability to relate what we want to know with what we can measure. When we observe evidence $E$, we often want to know the probability of a hypothesis $H$ being true. However, it's typically much easier to measure how likely we are to observe the evidence when we know the hypothesis is true.
|
| 49 |
|
| 50 |
+
This reversal of perspective - from $P(H|E)$ to $P(E|H)$ - is powerful because it lets us:
|
| 51 |
+
1. Start with what we know (prior beliefs)
|
| 52 |
+
2. Use easily measurable relationships (likelihood)
|
| 53 |
+
3. Update our beliefs with new evidence
|
| 54 |
|
| 55 |
+
This approach mirrors both how humans naturally learn and the scientific method: we begin with prior beliefs, gather evidence, and update our understanding based on that evidence. This makes Bayes' Theorem not just a mathematical tool, but a framework for rational thinking.
|
| 56 |
+
""")
|
|
|
|
| 57 |
return
|
| 58 |
|
| 59 |
|
| 60 |
@app.cell(hide_code=True)
|
| 61 |
def _(mo):
|
| 62 |
+
mo.md(r"""
|
| 63 |
+
## The Formula
|
|
|
|
| 64 |
|
| 65 |
+
Bayes' Theorem states:
|
| 66 |
|
| 67 |
+
$P(H|E) = \frac{P(E|H)P(H)}{P(E)}$
|
| 68 |
|
| 69 |
+
Where:
|
| 70 |
|
| 71 |
+
- $P(H|E)$ is the **posterior probability** - probability of hypothesis H given evidence E
|
| 72 |
+
- $P(E|H)$ is the **likelihood** - probability of evidence E given hypothesis H
|
| 73 |
+
- $P(H)$ is the **prior probability** - initial probability of hypothesis H
|
| 74 |
+
- $P(E)$ is the **evidence** - total probability of observing evidence E
|
| 75 |
|
| 76 |
+
The denominator $P(E)$ can be expanded using the [Law of Total Probability](https://marimo.app/gh/marimo-team/learn/main?entrypoint=probability%2F07_law_of_total_probability.py):
|
| 77 |
|
| 78 |
+
$P(E) = P(E|H)P(H) + P(E|H^c)P(H^c)$
|
| 79 |
+
""")
|
|
|
|
| 80 |
return
|
| 81 |
|
| 82 |
|
| 83 |
@app.cell(hide_code=True)
|
| 84 |
def _(mo):
|
| 85 |
+
mo.md(r"""
|
| 86 |
+
## Understanding Each Component
|
| 87 |
+
|
| 88 |
+
### 1. Prior Probability - $P(H)$
|
| 89 |
+
- Initial belief about hypothesis before seeing evidence
|
| 90 |
+
- Based on previous knowledge or assumptions
|
| 91 |
+
- Example: Probability of having a disease before any tests
|
| 92 |
+
|
| 93 |
+
### 2. Likelihood - $P(E|H)$
|
| 94 |
+
- Probability of evidence given hypothesis is true
|
| 95 |
+
- Often known from data or scientific studies
|
| 96 |
+
- Example: Probability of positive test given disease present
|
| 97 |
+
|
| 98 |
+
### 3. Evidence - $P(E)$
|
| 99 |
+
- Total probability of observing the evidence
|
| 100 |
+
- Acts as a normalizing constant
|
| 101 |
+
- Can be calculated using Law of Total Probability
|
| 102 |
+
|
| 103 |
+
### 4. Posterior - $P(H|E)$
|
| 104 |
+
- Updated probability after considering evidence
|
| 105 |
+
- Combines prior knowledge with new evidence
|
| 106 |
+
- Becomes new prior for future updates
|
| 107 |
+
""")
|
|
|
|
|
|
|
| 108 |
return
|
| 109 |
|
| 110 |
|
| 111 |
@app.cell(hide_code=True)
|
| 112 |
def _(mo):
|
| 113 |
+
mo.md(r"""
|
| 114 |
+
## Real-World Examples
|
|
|
|
| 115 |
|
| 116 |
+
### 1. Medical Testing
|
| 117 |
+
- **Want to know**: $P(\text{Disease}|\text{Positive})$ - Probability of disease given positive test
|
| 118 |
+
- **Easy to know**: $P(\text{Positive}|\text{Disease})$ - Test accuracy for sick people
|
| 119 |
+
- **Causality**: Disease causes test results, not vice versa
|
| 120 |
|
| 121 |
+
### 2. Student Ability
|
| 122 |
+
- **Want to know**: $P(\text{High Ability}|\text{Good Grade})$ - Probability student is skilled given good grade
|
| 123 |
+
- **Easy to know**: $P(\text{Good Grade}|\text{High Ability})$ - Probability good students get good grades
|
| 124 |
+
- **Causality**: Ability influences grades, not vice versa
|
| 125 |
|
| 126 |
+
### 3. Cell Phone Location
|
| 127 |
+
- **Want to know**: $P(\text{Location}|\text{Signal Strength})$ - Probability of phone location given signal
|
| 128 |
+
- **Easy to know**: $P(\text{Signal Strength}|\text{Location})$ - Signal strength at known locations
|
| 129 |
+
- **Causality**: Location determines signal strength, not vice versa
|
| 130 |
|
| 131 |
+
These examples highlight a common pattern: what we want to know (posterior) is harder to measure directly than its reverse (likelihood).
|
| 132 |
+
""")
|
|
|
|
| 133 |
return
|
| 134 |
|
| 135 |
|
| 136 |
+
@app.function
|
| 137 |
+
def calculate_posterior(prior, likelihood, false_positive_rate):
|
| 138 |
+
# Calculate P(E) using Law of Total Probability
|
| 139 |
+
p_e = likelihood * prior + false_positive_rate * (1 - prior)
|
|
|
|
| 140 |
|
| 141 |
+
# Calculate posterior using Bayes' Theorem
|
| 142 |
+
posterior = (likelihood * prior) / p_e
|
| 143 |
+
return posterior, p_e
|
|
|
|
| 144 |
|
| 145 |
|
| 146 |
@app.cell
|
| 147 |
+
def _():
|
| 148 |
# Medical test example
|
| 149 |
p_disease = 0.01 # Prior: 1% have the disease
|
| 150 |
p_positive_given_disease = 0.95 # Likelihood: 95% test accuracy
|
|
|
|
| 155 |
p_positive_given_disease,
|
| 156 |
p_positive_given_healthy
|
| 157 |
)
|
| 158 |
+
return (medical_posterior,)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
|
| 161 |
@app.cell
|
|
|
|
| 185 |
|
| 186 |
|
| 187 |
@app.cell
|
| 188 |
+
def _():
|
| 189 |
# Student ability example
|
| 190 |
p_high_ability = 0.30 # Prior: 30% of students have high ability
|
| 191 |
p_good_grade_given_high = 0.90 # Likelihood: 90% of high ability students get good grades
|
|
|
|
| 196 |
p_good_grade_given_high,
|
| 197 |
p_good_grade_given_low
|
| 198 |
)
|
| 199 |
+
return (student_posterior,)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
|
| 201 |
|
| 202 |
@app.cell
|
|
|
|
| 226 |
|
| 227 |
|
| 228 |
@app.cell
|
| 229 |
+
def _():
|
| 230 |
# Cell phone location example
|
| 231 |
p_location_a = 0.25 # Prior probability of being in location A
|
| 232 |
p_strong_signal_at_a = 0.85 # Likelihood of strong signal at A
|
|
|
|
| 237 |
p_strong_signal_at_a,
|
| 238 |
p_strong_signal_elsewhere
|
| 239 |
)
|
| 240 |
+
return (location_posterior,)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 241 |
|
| 242 |
|
| 243 |
@app.cell
|
|
|
|
| 268 |
|
| 269 |
@app.cell(hide_code=True)
|
| 270 |
def _(mo):
|
| 271 |
+
mo.md(r"""
|
| 272 |
+
## Interactive example
|
| 273 |
+
""")
|
| 274 |
return
|
| 275 |
|
| 276 |
|
|
|
|
| 366 |
|
| 367 |
@app.cell(hide_code=True)
|
| 368 |
def _(mo):
|
| 369 |
+
mo.md(r"""
|
| 370 |
+
## Applications in Computer Science
|
|
|
|
| 371 |
|
| 372 |
+
Bayes' Theorem is fundamental in many computing applications:
|
| 373 |
|
| 374 |
+
1. **Spam Filtering**
|
| 375 |
|
| 376 |
+
- $P(\text{Spam}|\text{Words})$ = Probability email is spam given its words
|
| 377 |
+
- Updates as new emails are classified
|
| 378 |
|
| 379 |
+
2. **Machine Learning**
|
| 380 |
|
| 381 |
+
- Naive Bayes classifiers
|
| 382 |
+
- Probabilistic graphical models
|
| 383 |
+
- Bayesian neural networks
|
| 384 |
|
| 385 |
+
3. **Computer Vision**
|
| 386 |
|
| 387 |
+
- Object detection confidence
|
| 388 |
+
- Face recognition systems
|
| 389 |
+
- Image classification
|
| 390 |
+
""")
|
|
|
|
| 391 |
return
|
| 392 |
|
| 393 |
|
| 394 |
@app.cell(hide_code=True)
|
| 395 |
def _(mo):
|
| 396 |
+
mo.md("""
|
| 397 |
+
## 🤔 Test Your Understanding
|
|
|
|
| 398 |
|
| 399 |
+
Pick which of these statements about Bayes' Theorem you think are correct:
|
| 400 |
|
| 401 |
+
<details>
|
| 402 |
+
<summary>The posterior probability will always be larger than the prior probability</summary>
|
| 403 |
+
❌ Incorrect! Evidence can either increase or decrease our belief in the hypothesis. For example, a negative medical test decreases the probability of having a disease.
|
| 404 |
+
</details>
|
| 405 |
|
| 406 |
+
<details>
|
| 407 |
+
<summary>If the likelihood is 0.9 and the prior is 0.5, then the posterior must equal 0.9</summary>
|
| 408 |
+
❌ Incorrect! We also need the false positive rate to calculate the posterior probability. The likelihood alone doesn't determine the posterior.
|
| 409 |
+
</details>
|
| 410 |
|
| 411 |
+
<details>
|
| 412 |
+
<summary>The denominator acts as a normalizing constant to ensure the posterior is a valid probability</summary>
|
| 413 |
+
✅ Correct! The denominator ensures the posterior probability is between 0 and 1 by considering all ways the evidence could occur.
|
| 414 |
+
</details>
|
| 415 |
+
""")
|
|
|
|
| 416 |
return
|
| 417 |
|
| 418 |
|
| 419 |
@app.cell(hide_code=True)
|
| 420 |
def _(mo):
|
| 421 |
+
mo.md("""
|
| 422 |
+
## Summary
|
|
|
|
| 423 |
|
| 424 |
+
You've learned:
|
| 425 |
|
| 426 |
+
- The components and intuition behind Bayes' Theorem
|
| 427 |
+
- How to update probabilities when new evidence arrives
|
| 428 |
+
- Why posterior probabilities can be counterintuitive
|
| 429 |
+
- Real-world applications in computer science
|
| 430 |
|
| 431 |
+
In the next lesson, we'll explore Random Variables, which help us work with numerical outcomes in probability.
|
| 432 |
+
""")
|
|
|
|
| 433 |
return
|
| 434 |
|
| 435 |
|
| 436 |
@app.cell(hide_code=True)
|
| 437 |
def _(mo):
|
| 438 |
+
mo.md(r"""
|
| 439 |
+
### Appendix
|
| 440 |
+
Below (hidden) cell blocks are responsible for the interactive example above
|
| 441 |
+
""")
|
|
|
|
|
|
|
| 442 |
return
|
| 443 |
|
| 444 |
|
probability/09_random_variables.py
CHANGED
|
@@ -10,7 +10,7 @@
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
-
__generated_with = "0.
|
| 14 |
app = marimo.App(width="medium", app_title="Random Variables")
|
| 15 |
|
| 16 |
|
|
@@ -30,90 +30,82 @@ def _():
|
|
| 30 |
|
| 31 |
@app.cell(hide_code=True)
|
| 32 |
def _(mo):
|
| 33 |
-
mo.md(
|
| 34 |
-
|
| 35 |
-
# Random Variables
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
)
|
| 46 |
return
|
| 47 |
|
| 48 |
|
| 49 |
@app.cell(hide_code=True)
|
| 50 |
def _(mo):
|
| 51 |
-
mo.md(
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
"""
|
| 65 |
-
)
|
| 66 |
return
|
| 67 |
|
| 68 |
|
| 69 |
@app.cell(hide_code=True)
|
| 70 |
def _(mo):
|
| 71 |
-
mo.md(
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
"""
|
| 95 |
-
)
|
| 96 |
return
|
| 97 |
|
| 98 |
|
| 99 |
@app.cell(hide_code=True)
|
| 100 |
def _(mo):
|
| 101 |
-
mo.md(
|
| 102 |
-
|
| 103 |
-
## Probability Mass Functions (PMF)
|
| 104 |
|
| 105 |
-
|
| 106 |
|
| 107 |
-
|
| 108 |
|
| 109 |
-
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
| 113 |
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
)
|
| 117 |
return
|
| 118 |
|
| 119 |
|
|
@@ -135,27 +127,25 @@ def _(np, plt):
|
|
| 135 |
plt.ylabel("Probability")
|
| 136 |
plt.grid(True, alpha=0.3)
|
| 137 |
plt.gca()
|
| 138 |
-
return
|
| 139 |
|
| 140 |
|
| 141 |
@app.cell(hide_code=True)
|
| 142 |
def _(mo):
|
| 143 |
-
mo.md(
|
| 144 |
-
|
| 145 |
-
## Probability Density Functions (PDF)
|
| 146 |
|
| 147 |
-
|
| 148 |
|
| 149 |
-
|
| 150 |
|
| 151 |
-
|
| 152 |
|
| 153 |
-
|
| 154 |
-
|
| 155 |
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
)
|
| 159 |
return
|
| 160 |
|
| 161 |
|
|
@@ -178,24 +168,22 @@ def _(np, plt, stats):
|
|
| 178 |
|
| 179 |
@app.cell(hide_code=True)
|
| 180 |
def _(mo):
|
| 181 |
-
mo.md(
|
| 182 |
-
|
| 183 |
-
## Expected Value
|
| 184 |
|
| 185 |
-
|
| 186 |
|
| 187 |
-
|
| 188 |
-
|
| 189 |
|
| 190 |
-
|
| 191 |
-
|
| 192 |
|
| 193 |
-
|
| 194 |
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
)
|
| 199 |
return
|
| 200 |
|
| 201 |
|
|
@@ -209,7 +197,7 @@ def _(np):
|
|
| 209 |
die_probs = np.ones(6) / 6
|
| 210 |
|
| 211 |
E_X = expected_value_discrete(die_values, die_probs)
|
| 212 |
-
return E_X, die_probs, die_values
|
| 213 |
|
| 214 |
|
| 215 |
@app.cell
|
|
@@ -220,23 +208,21 @@ def _(E_X):
|
|
| 220 |
|
| 221 |
@app.cell(hide_code=True)
|
| 222 |
def _(mo):
|
| 223 |
-
mo.md(
|
| 224 |
-
|
| 225 |
-
## Variance
|
| 226 |
|
| 227 |
-
|
| 228 |
|
| 229 |
-
|
| 230 |
|
| 231 |
-
|
| 232 |
-
|
| 233 |
|
| 234 |
-
|
| 235 |
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
)
|
| 240 |
return
|
| 241 |
|
| 242 |
|
|
@@ -278,7 +264,7 @@ def _(variance_discrete):
|
|
| 278 |
coin_probs = [0.5, 0.5]
|
| 279 |
coin_mean = sum(x * p for x, p in zip(coin_values, coin_probs))
|
| 280 |
coin_var = variance_discrete(coin_values, coin_probs, coin_mean)
|
| 281 |
-
return
|
| 282 |
|
| 283 |
|
| 284 |
@app.cell
|
|
@@ -289,7 +275,7 @@ def _(np, stats, variance_discrete):
|
|
| 289 |
normal_probs = normal_probs / sum(normal_probs) # normalize
|
| 290 |
normal_mean = 0
|
| 291 |
normal_var = variance_discrete(normal_values, normal_probs, normal_mean)
|
| 292 |
-
return
|
| 293 |
|
| 294 |
|
| 295 |
@app.cell
|
|
@@ -299,7 +285,7 @@ def _(np, variance_discrete):
|
|
| 299 |
uniform_probs = np.ones_like(uniform_values) / len(uniform_values)
|
| 300 |
uniform_mean = 0.5
|
| 301 |
uniform_var = variance_discrete(uniform_values, uniform_probs, uniform_mean)
|
| 302 |
-
return
|
| 303 |
|
| 304 |
|
| 305 |
@app.cell(hide_code=True)
|
|
@@ -318,44 +304,40 @@ def _(coin_var, mo, normal_var, uniform_var):
|
|
| 318 |
|
| 319 |
@app.cell(hide_code=True)
|
| 320 |
def _(mo):
|
| 321 |
-
mo.md(
|
| 322 |
-
|
| 323 |
-
## Common Distributions
|
| 324 |
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
| 328 |
-
|
| 329 |
|
| 330 |
-
|
| 331 |
|
| 332 |
-
|
| 333 |
-
|
| 334 |
-
|
| 335 |
|
| 336 |
-
|
| 337 |
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
)
|
| 343 |
return
|
| 344 |
|
| 345 |
|
| 346 |
@app.cell(hide_code=True)
|
| 347 |
def _(mo):
|
| 348 |
-
mo.md(
|
| 349 |
-
|
| 350 |
-
### Example: Comparing Discrete and Continuous Distributions
|
| 351 |
|
| 352 |
-
|
| 353 |
-
|
| 354 |
|
| 355 |
-
|
| 356 |
-
|
| 357 |
-
|
| 358 |
-
)
|
| 359 |
return
|
| 360 |
|
| 361 |
|
|
@@ -405,7 +387,7 @@ def _(n_trials, np, p_success, plt, stats):
|
|
| 405 |
|
| 406 |
plt.tight_layout()
|
| 407 |
plt.gca()
|
| 408 |
-
return
|
| 409 |
|
| 410 |
|
| 411 |
@app.cell(hide_code=True)
|
|
@@ -426,42 +408,40 @@ def _(mo, n_trials, np, p_success):
|
|
| 426 |
|
| 427 |
@app.cell(hide_code=True)
|
| 428 |
def _(mo):
|
| 429 |
-
mo.md(
|
| 430 |
-
|
| 431 |
-
|
| 432 |
-
|
| 433 |
-
|
| 434 |
-
|
| 435 |
-
|
| 436 |
-
|
| 437 |
-
|
| 438 |
-
|
| 439 |
-
|
| 440 |
-
|
| 441 |
-
|
| 442 |
-
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
|
| 448 |
-
|
| 449 |
-
|
| 450 |
-
|
| 451 |
-
|
| 452 |
-
|
| 453 |
-
|
| 454 |
-
|
| 455 |
-
|
| 456 |
-
|
| 457 |
-
|
| 458 |
-
|
| 459 |
-
|
| 460 |
-
|
| 461 |
-
|
| 462 |
-
|
| 463 |
-
"""
|
| 464 |
-
)
|
| 465 |
return
|
| 466 |
|
| 467 |
|
|
@@ -479,72 +459,66 @@ def _(mktext, mo):
|
|
| 479 |
|
| 480 |
@app.cell(hide_code=True)
|
| 481 |
def _(mo):
|
| 482 |
-
mktext
|
| 483 |
-
|
| 484 |
-
Let's solve each part:
|
| 485 |
|
| 486 |
-
|
| 487 |
-
|
| 488 |
|
| 489 |
-
|
| 490 |
-
|
| 491 |
|
| 492 |
-
|
| 493 |
-
|
| 494 |
|
| 495 |
-
|
| 496 |
-
|
| 497 |
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
|
| 501 |
-
)
|
| 502 |
return (mktext,)
|
| 503 |
|
| 504 |
|
| 505 |
@app.cell(hide_code=True)
|
| 506 |
def _(mo):
|
| 507 |
-
mo.md(
|
| 508 |
-
|
| 509 |
-
## 🤔 Test Your Understanding
|
| 510 |
|
| 511 |
-
|
| 512 |
|
| 513 |
-
|
| 514 |
-
|
| 515 |
-
|
| 516 |
-
|
| 517 |
|
| 518 |
-
|
| 519 |
-
|
| 520 |
-
|
| 521 |
-
|
| 522 |
|
| 523 |
-
|
| 524 |
-
|
| 525 |
-
|
| 526 |
-
|
| 527 |
-
|
| 528 |
-
)
|
| 529 |
return
|
| 530 |
|
| 531 |
|
| 532 |
@app.cell(hide_code=True)
|
| 533 |
def _(mo):
|
| 534 |
-
mo.md(
|
| 535 |
-
|
| 536 |
-
## Summary
|
| 537 |
|
| 538 |
-
|
| 539 |
|
| 540 |
-
|
| 541 |
-
|
| 542 |
-
|
| 543 |
-
|
| 544 |
|
| 545 |
-
|
| 546 |
-
|
| 547 |
-
)
|
| 548 |
return
|
| 549 |
|
| 550 |
|
|
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
+
__generated_with = "0.18.4"
|
| 14 |
app = marimo.App(width="medium", app_title="Random Variables")
|
| 15 |
|
| 16 |
|
|
|
|
| 30 |
|
| 31 |
@app.cell(hide_code=True)
|
| 32 |
def _(mo):
|
| 33 |
+
mo.md(r"""
|
| 34 |
+
# Random Variables
|
|
|
|
| 35 |
|
| 36 |
+
_This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/rvs/), by Stanford professor Chris Piech._
|
| 37 |
|
| 38 |
+
Random variables are functions that map outcomes from a probability space to numbers. This mathematical abstraction allows us to:
|
| 39 |
|
| 40 |
+
- Work with numerical outcomes in probability
|
| 41 |
+
- Calculate expected values and variances
|
| 42 |
+
- Model real-world phenomena quantitatively
|
| 43 |
+
""")
|
|
|
|
| 44 |
return
|
| 45 |
|
| 46 |
|
| 47 |
@app.cell(hide_code=True)
|
| 48 |
def _(mo):
|
| 49 |
+
mo.md(r"""
|
| 50 |
+
## Types of Random Variables
|
| 51 |
+
|
| 52 |
+
### Discrete Random Variables
|
| 53 |
+
- Take on countable values (finite or infinite)
|
| 54 |
+
- Described by a probability mass function (PMF)
|
| 55 |
+
- Example: Number of heads in 3 coin flips
|
| 56 |
+
|
| 57 |
+
### Continuous Random Variables
|
| 58 |
+
- Take on uncountable values in an interval
|
| 59 |
+
- Described by a probability density function (PDF)
|
| 60 |
+
- Example: Height of a randomly selected person
|
| 61 |
+
""")
|
|
|
|
|
|
|
| 62 |
return
|
| 63 |
|
| 64 |
|
| 65 |
@app.cell(hide_code=True)
|
| 66 |
def _(mo):
|
| 67 |
+
mo.md(r"""
|
| 68 |
+
## Properties of Random Variables
|
| 69 |
+
|
| 70 |
+
Each random variable has several key properties:
|
| 71 |
+
|
| 72 |
+
| Property | Description | Example |
|
| 73 |
+
|----------|-------------|---------|
|
| 74 |
+
| Meaning | Semantic description | Number of successes in n trials |
|
| 75 |
+
| Symbol | Notation used | $X$, $Y$, $Z$ |
|
| 76 |
+
| Support/Range | Possible values | $\{0,1,2,...,n\}$ for binomial |
|
| 77 |
+
| Distribution | PMF or PDF | $p_X(x)$ or $f_X(x)$ |
|
| 78 |
+
| Expectation | Weighted average | $E[X]$ |
|
| 79 |
+
| Variance | Measure of spread | $\text{Var}(X)$ |
|
| 80 |
+
| Standard Deviation | Square root of variance | $\sigma_X$ |
|
| 81 |
+
| Mode | Most likely value | argmax$_x$ $p_X(x)$ |
|
| 82 |
+
|
| 83 |
+
Additional properties include:
|
| 84 |
+
|
| 85 |
+
- [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) (measure of uncertainty)
|
| 86 |
+
- [Median](https://en.wikipedia.org/wiki/Median) (middle value)
|
| 87 |
+
- [Skewness](https://en.wikipedia.org/wiki/Skewness) (asymmetry measure)
|
| 88 |
+
- [Kurtosis](https://en.wikipedia.org/wiki/Kurtosis) (tail heaviness measure)
|
| 89 |
+
""")
|
|
|
|
|
|
|
| 90 |
return
|
| 91 |
|
| 92 |
|
| 93 |
@app.cell(hide_code=True)
|
| 94 |
def _(mo):
|
| 95 |
+
mo.md(r"""
|
| 96 |
+
## Probability Mass Functions (PMF)
|
|
|
|
| 97 |
|
| 98 |
+
For discrete random variables, the PMF $p_X(x)$ gives the probability that $X$ equals $x$:
|
| 99 |
|
| 100 |
+
$p_X(x) = P(X = x)$
|
| 101 |
|
| 102 |
+
Properties of a PMF:
|
| 103 |
|
| 104 |
+
1. $p_X(x) \geq 0$ for all $x$
|
| 105 |
+
2. $\sum_x p_X(x) = 1$
|
| 106 |
|
| 107 |
+
Let's implement a PMF for rolling a fair die:
|
| 108 |
+
""")
|
|
|
|
| 109 |
return
|
| 110 |
|
| 111 |
|
|
|
|
| 127 |
plt.ylabel("Probability")
|
| 128 |
plt.grid(True, alpha=0.3)
|
| 129 |
plt.gca()
|
| 130 |
+
return
|
| 131 |
|
| 132 |
|
| 133 |
@app.cell(hide_code=True)
|
| 134 |
def _(mo):
|
| 135 |
+
mo.md(r"""
|
| 136 |
+
## Probability Density Functions (PDF)
|
|
|
|
| 137 |
|
| 138 |
+
For continuous random variables, we use a PDF $f_X(x)$. The probability of $X$ falling in an interval $[a,b]$ is:
|
| 139 |
|
| 140 |
+
$P(a \leq X \leq b) = \int_a^b f_X(x)dx$
|
| 141 |
|
| 142 |
+
Properties of a PDF:
|
| 143 |
|
| 144 |
+
1. $f_X(x) \geq 0$ for all $x$
|
| 145 |
+
2. $\int_{-\infty}^{\infty} f_X(x)dx = 1$
|
| 146 |
|
| 147 |
+
Let's look at the normal distribution, a common continuous random variable:
|
| 148 |
+
""")
|
|
|
|
| 149 |
return
|
| 150 |
|
| 151 |
|
|
|
|
| 168 |
|
| 169 |
@app.cell(hide_code=True)
|
| 170 |
def _(mo):
|
| 171 |
+
mo.md(r"""
|
| 172 |
+
## Expected Value
|
|
|
|
| 173 |
|
| 174 |
+
The expected value $E[X]$ is the long-run average of a random variable.
|
| 175 |
|
| 176 |
+
For discrete random variables:
|
| 177 |
+
$E[X] = \sum_x x \cdot p_X(x)$
|
| 178 |
|
| 179 |
+
For continuous random variables:
|
| 180 |
+
$E[X] = \int_{-\infty}^{\infty} x \cdot f_X(x)dx$
|
| 181 |
|
| 182 |
+
Properties:
|
| 183 |
|
| 184 |
+
1. $E[aX + b] = aE[X] + b$
|
| 185 |
+
2. $E[X + Y] = E[X] + E[Y]$
|
| 186 |
+
""")
|
|
|
|
| 187 |
return
|
| 188 |
|
| 189 |
|
|
|
|
| 197 |
die_probs = np.ones(6) / 6
|
| 198 |
|
| 199 |
E_X = expected_value_discrete(die_values, die_probs)
|
| 200 |
+
return E_X, die_probs, die_values
|
| 201 |
|
| 202 |
|
| 203 |
@app.cell
|
|
|
|
| 208 |
|
| 209 |
@app.cell(hide_code=True)
|
| 210 |
def _(mo):
|
| 211 |
+
mo.md(r"""
|
| 212 |
+
## Variance
|
|
|
|
| 213 |
|
| 214 |
+
The variance $\text{Var}(X)$ measures the spread of a random variable around its mean:
|
| 215 |
|
| 216 |
+
$\text{Var}(X) = E[(X - E[X])^2]$
|
| 217 |
|
| 218 |
+
This can be computed as:
|
| 219 |
+
$\text{Var}(X) = E[X^2] - (E[X])^2$
|
| 220 |
|
| 221 |
+
Properties:
|
| 222 |
|
| 223 |
+
1. $\text{Var}(aX) = a^2Var(X)$
|
| 224 |
+
2. $\text{Var}(X + b) = Var(X)$
|
| 225 |
+
""")
|
|
|
|
| 226 |
return
|
| 227 |
|
| 228 |
|
|
|
|
| 264 |
coin_probs = [0.5, 0.5]
|
| 265 |
coin_mean = sum(x * p for x, p in zip(coin_values, coin_probs))
|
| 266 |
coin_var = variance_discrete(coin_values, coin_probs, coin_mean)
|
| 267 |
+
return (coin_var,)
|
| 268 |
|
| 269 |
|
| 270 |
@app.cell
|
|
|
|
| 275 |
normal_probs = normal_probs / sum(normal_probs) # normalize
|
| 276 |
normal_mean = 0
|
| 277 |
normal_var = variance_discrete(normal_values, normal_probs, normal_mean)
|
| 278 |
+
return (normal_var,)
|
| 279 |
|
| 280 |
|
| 281 |
@app.cell
|
|
|
|
| 285 |
uniform_probs = np.ones_like(uniform_values) / len(uniform_values)
|
| 286 |
uniform_mean = 0.5
|
| 287 |
uniform_var = variance_discrete(uniform_values, uniform_probs, uniform_mean)
|
| 288 |
+
return (uniform_var,)
|
| 289 |
|
| 290 |
|
| 291 |
@app.cell(hide_code=True)
|
|
|
|
| 304 |
|
| 305 |
@app.cell(hide_code=True)
|
| 306 |
def _(mo):
|
| 307 |
+
mo.md(r"""
|
| 308 |
+
## Common Distributions
|
|
|
|
| 309 |
|
| 310 |
+
1. Bernoulli Distribution
|
| 311 |
+
- Models a single success/failure experiment
|
| 312 |
+
- $P(X = 1) = p$, $P(X = 0) = 1-p$
|
| 313 |
+
- $E[X] = p$, $\text{Var}(X) = p(1-p)$
|
| 314 |
|
| 315 |
+
2. Binomial Distribution
|
| 316 |
|
| 317 |
+
- Models number of successes in $n$ independent trials
|
| 318 |
+
- $P(X = k) = \binom{n}{k}p^k(1-p)^{n-k}$
|
| 319 |
+
- $E[X] = np$, $\text{Var}(X) = np(1-p)$
|
| 320 |
|
| 321 |
+
3. Normal Distribution
|
| 322 |
|
| 323 |
+
- Bell-shaped curve defined by mean $\mu$ and variance $\sigma^2$
|
| 324 |
+
- PDF: $f_X(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
|
| 325 |
+
- $E[X] = \mu$, $\text{Var}(X) = \sigma^2$
|
| 326 |
+
""")
|
|
|
|
| 327 |
return
|
| 328 |
|
| 329 |
|
| 330 |
@app.cell(hide_code=True)
|
| 331 |
def _(mo):
|
| 332 |
+
mo.md(r"""
|
| 333 |
+
### Example: Comparing Discrete and Continuous Distributions
|
|
|
|
| 334 |
|
| 335 |
+
This example shows the relationship between a Binomial distribution (discrete) and its Normal approximation (continuous).
|
| 336 |
+
The parameters control both distributions:
|
| 337 |
|
| 338 |
+
- **Number of Trials**: Controls the range of possible values and the shape's width
|
| 339 |
+
- **Success Probability**: Affects the distribution's center and skewness
|
| 340 |
+
""")
|
|
|
|
| 341 |
return
|
| 342 |
|
| 343 |
|
|
|
|
| 387 |
|
| 388 |
plt.tight_layout()
|
| 389 |
plt.gca()
|
| 390 |
+
return
|
| 391 |
|
| 392 |
|
| 393 |
@app.cell(hide_code=True)
|
|
|
|
| 408 |
|
| 409 |
@app.cell(hide_code=True)
|
| 410 |
def _(mo):
|
| 411 |
+
mo.md(r"""
|
| 412 |
+
## Practice Problems
|
| 413 |
+
|
| 414 |
+
### Problem 1: Discrete Random Variable
|
| 415 |
+
Let $X$ be the sum when rolling two fair dice. Find:
|
| 416 |
+
|
| 417 |
+
1. The support of $X$
|
| 418 |
+
2. The PMF $p_X(x)$
|
| 419 |
+
3. $E[X]$ and $\text{Var}(X)$
|
| 420 |
+
|
| 421 |
+
<details>
|
| 422 |
+
<summary>Solution</summary>
|
| 423 |
+
Let's solve this step by step:
|
| 424 |
+
```python
|
| 425 |
+
def two_dice_pmf(x):
|
| 426 |
+
outcomes = [(i,j) for i in range(1,7) for j in range(1,7)]
|
| 427 |
+
favorable = [pair for pair in outcomes if sum(pair) == x]
|
| 428 |
+
return len(favorable)/36
|
| 429 |
+
|
| 430 |
+
# Support: {2,3,...,12}
|
| 431 |
+
# E[X] = 7
|
| 432 |
+
# Var(X) = 5.83
|
| 433 |
+
```
|
| 434 |
+
</details>
|
| 435 |
+
|
| 436 |
+
### Problem 2: Continuous Random Variable
|
| 437 |
+
For a uniform random variable on $[0,1]$, verify that:
|
| 438 |
+
|
| 439 |
+
1. The PDF integrates to 1
|
| 440 |
+
2. $E[X] = 1/2$
|
| 441 |
+
3. $\text{Var}(X) = 1/12$
|
| 442 |
+
|
| 443 |
+
Try solving this yourself first, then check the solution below.
|
| 444 |
+
""")
|
|
|
|
|
|
|
| 445 |
return
|
| 446 |
|
| 447 |
|
|
|
|
| 459 |
|
| 460 |
@app.cell(hide_code=True)
|
| 461 |
def _(mo):
|
| 462 |
+
mktext=mo.md(r"""
|
| 463 |
+
Let's solve each part:
|
|
|
|
| 464 |
|
| 465 |
+
1. **PDF integrates to 1**:
|
| 466 |
+
$\int_0^1 1 \, dx = [x]_0^1 = 1 - 0 = 1$
|
| 467 |
|
| 468 |
+
2. **Expected Value**:
|
| 469 |
+
$E[X] = \int_0^1 x \cdot 1 \, dx = [\frac{x^2}{2}]_0^1 = \frac{1}{2} - 0 = \frac{1}{2}$
|
| 470 |
|
| 471 |
+
3. **Variance**:
|
| 472 |
+
$\text{Var}(X) = E[X^2] - (E[X])^2$
|
| 473 |
|
| 474 |
+
First calculate $E[X^2]$:
|
| 475 |
+
$E[X^2] = \int_0^1 x^2 \cdot 1 \, dx = [\frac{x^3}{3}]_0^1 = \frac{1}{3}$
|
| 476 |
|
| 477 |
+
Then:
|
| 478 |
+
$\text{Var}(X) = \frac{1}{3} - (\frac{1}{2})^2 = \frac{1}{3} - \frac{1}{4} = \frac{1}{12}$
|
| 479 |
+
""")
|
|
|
|
| 480 |
return (mktext,)
|
| 481 |
|
| 482 |
|
| 483 |
@app.cell(hide_code=True)
|
| 484 |
def _(mo):
|
| 485 |
+
mo.md(r"""
|
| 486 |
+
## 🤔 Test Your Understanding
|
|
|
|
| 487 |
|
| 488 |
+
Pick which of these statements about random variables you think are correct:
|
| 489 |
|
| 490 |
+
<details>
|
| 491 |
+
<summary>The probability density function can be greater than 1</summary>
|
| 492 |
+
✅ Correct! Unlike PMFs, PDFs can exceed 1 as long as the total area equals 1.
|
| 493 |
+
</details>
|
| 494 |
|
| 495 |
+
<details>
|
| 496 |
+
<summary>The expected value of a random variable must equal one of its possible values</summary>
|
| 497 |
+
❌ Incorrect! For example, the expected value of a fair die is 3.5, which is not a possible outcome.
|
| 498 |
+
</details>
|
| 499 |
|
| 500 |
+
<details>
|
| 501 |
+
<summary>Adding a constant to a random variable changes its variance</summary>
|
| 502 |
+
❌ Incorrect! Adding a constant shifts the distribution but doesn't affect its spread.
|
| 503 |
+
</details>
|
| 504 |
+
""")
|
|
|
|
| 505 |
return
|
| 506 |
|
| 507 |
|
| 508 |
@app.cell(hide_code=True)
|
| 509 |
def _(mo):
|
| 510 |
+
mo.md("""
|
| 511 |
+
## Summary
|
|
|
|
| 512 |
|
| 513 |
+
You've learned:
|
| 514 |
|
| 515 |
+
- The difference between discrete and continuous random variables
|
| 516 |
+
- How PMFs and PDFs describe probability distributions
|
| 517 |
+
- Methods for calculating expected values and variances
|
| 518 |
+
- Properties of common probability distributions
|
| 519 |
|
| 520 |
+
In the next lesson, we'll explore Probability Mass Functions in detail, focusing on their properties and applications.
|
| 521 |
+
""")
|
|
|
|
| 522 |
return
|
| 523 |
|
| 524 |
|
probability/10_probability_mass_function.py
CHANGED
|
@@ -10,57 +10,51 @@
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
-
__generated_with = "0.
|
| 14 |
app = marimo.App(width="medium", app_title="Probability Mass Functions")
|
| 15 |
|
| 16 |
|
| 17 |
@app.cell(hide_code=True)
|
| 18 |
def _(mo):
|
| 19 |
-
mo.md(
|
| 20 |
-
|
| 21 |
-
# Probability Mass Functions
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
)
|
| 30 |
return
|
| 31 |
|
| 32 |
|
| 33 |
@app.cell(hide_code=True)
|
| 34 |
def _(mo):
|
| 35 |
-
mo.md(
|
| 36 |
-
|
| 37 |
-
## Properties of a PMF
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
)
|
| 47 |
return
|
| 48 |
|
| 49 |
|
| 50 |
@app.cell(hide_code=True)
|
| 51 |
def _(mo):
|
| 52 |
-
mo.md(
|
| 53 |
-
|
| 54 |
-
## PMFs as Graphs
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
)
|
| 64 |
return
|
| 65 |
|
| 66 |
|
|
@@ -102,53 +96,39 @@ def _(np, plt):
|
|
| 102 |
|
| 103 |
plt.tight_layout()
|
| 104 |
plt.gca()
|
| 105 |
-
return
|
| 106 |
-
dice_ax1,
|
| 107 |
-
dice_ax2,
|
| 108 |
-
dice_fig,
|
| 109 |
-
dice_prob,
|
| 110 |
-
dice_sum,
|
| 111 |
-
single_die_probs,
|
| 112 |
-
single_die_values,
|
| 113 |
-
two_dice_probs,
|
| 114 |
-
two_dice_values,
|
| 115 |
-
)
|
| 116 |
|
| 117 |
|
| 118 |
@app.cell(hide_code=True)
|
| 119 |
def _(mo):
|
| 120 |
-
mo.md(
|
| 121 |
-
|
| 122 |
-
These graphs really show us how likely each value is when we roll the dice.
|
| 123 |
|
| 124 |
-
|
| 125 |
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
)
|
| 129 |
return
|
| 130 |
|
| 131 |
|
| 132 |
@app.cell(hide_code=True)
|
| 133 |
def _(mo):
|
| 134 |
-
mo.md(
|
| 135 |
-
|
| 136 |
-
## PMFs as Equations
|
| 137 |
|
| 138 |
-
|
| 139 |
|
| 140 |
-
|
| 141 |
-
|
| 142 |
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
)
|
| 152 |
return
|
| 153 |
|
| 154 |
|
|
@@ -167,12 +147,14 @@ def _():
|
|
| 167 |
test_values = [1, 2, 7, 12, 13]
|
| 168 |
for test_y in test_values:
|
| 169 |
print(f"P(Y = {test_y}) = {pmf_sum_two_dice(test_y)}")
|
| 170 |
-
return pmf_sum_two_dice,
|
| 171 |
|
| 172 |
|
| 173 |
@app.cell(hide_code=True)
|
| 174 |
def _(mo):
|
| 175 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 176 |
return
|
| 177 |
|
| 178 |
|
|
@@ -183,7 +165,7 @@ def _(pmf_sum_two_dice):
|
|
| 183 |
# Round to 10 decimal places to handle floating-point precision
|
| 184 |
verify_total_prob_rounded = round(verify_total_prob, 10)
|
| 185 |
print(f"Sum of all probabilities: {verify_total_prob_rounded}")
|
| 186 |
-
return
|
| 187 |
|
| 188 |
|
| 189 |
@app.cell(hide_code=True)
|
|
@@ -205,18 +187,16 @@ def _(plt, pmf_sum_two_dice):
|
|
| 205 |
plt.text(verify_y_values[verify_i], verify_prob + 0.001, f'{verify_prob:.3f}', ha='center')
|
| 206 |
|
| 207 |
plt.gca() # Return the current axes to ensure proper display
|
| 208 |
-
return
|
| 209 |
|
| 210 |
|
| 211 |
@app.cell(hide_code=True)
|
| 212 |
def _(mo):
|
| 213 |
-
mo.md(
|
| 214 |
-
|
| 215 |
-
## Data to Histograms to Probability Mass Functions
|
| 216 |
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
)
|
| 220 |
return
|
| 221 |
|
| 222 |
|
|
@@ -236,7 +216,7 @@ def _(np):
|
|
| 236 |
# Display a small sample of the data
|
| 237 |
print(f"First 20 dice sums: {sim_dice_sums[:20]}")
|
| 238 |
print(f"Total number of trials: {sim_num_trials}")
|
| 239 |
-
return sim_dice_sums,
|
| 240 |
|
| 241 |
|
| 242 |
@app.cell(hide_code=True)
|
|
@@ -296,32 +276,16 @@ def _(collections, np, plt, sim_dice_sums):
|
|
| 296 |
plt.text(sim_sorted_values[sim_i], sim_count + 19, str(sim_count), ha='center')
|
| 297 |
|
| 298 |
plt.gca() # Return the current axes to ensure proper display
|
| 299 |
-
return (
|
| 300 |
-
sim_ax1,
|
| 301 |
-
sim_ax2,
|
| 302 |
-
sim_count,
|
| 303 |
-
sim_counter,
|
| 304 |
-
sim_counts,
|
| 305 |
-
sim_empirical_pmf,
|
| 306 |
-
sim_fig,
|
| 307 |
-
sim_i,
|
| 308 |
-
sim_prob,
|
| 309 |
-
sim_sorted_values,
|
| 310 |
-
sim_theoretical_pmf,
|
| 311 |
-
sim_theoretical_values,
|
| 312 |
-
sim_y,
|
| 313 |
-
)
|
| 314 |
|
| 315 |
|
| 316 |
@app.cell(hide_code=True)
|
| 317 |
def _(mo):
|
| 318 |
-
mo.md(
|
| 319 |
-
|
| 320 |
-
When we normalize a histogram (divide each count by total sample size), we get a pretty good approximation of the true PMF. it's a simple yet powerful idea - count how many times each value appears, then divide by the total number of trials.
|
| 321 |
|
| 322 |
-
|
| 323 |
-
|
| 324 |
-
)
|
| 325 |
return
|
| 326 |
|
| 327 |
|
|
@@ -338,20 +302,18 @@ def _(sim_counter, sim_dice_sums):
|
|
| 338 |
print(f"Empirical P(Y=3): {sim_count_of_3}/{len(sim_dice_sums)} = {sim_empirical_prob:.4f}")
|
| 339 |
print(f"Theoretical P(Y=3): 2/36 = {sim_theoretical_prob:.4f}")
|
| 340 |
print(f"Difference: {abs(sim_empirical_prob - sim_theoretical_prob):.4f}")
|
| 341 |
-
return
|
| 342 |
|
| 343 |
|
| 344 |
@app.cell(hide_code=True)
|
| 345 |
def _(mo):
|
| 346 |
-
mo.md(
|
| 347 |
-
|
| 348 |
-
As we can see, with a large number of trials, the empirical PMF becomes a very good approximation of the theoretical PMF. This is an example of the [Law of Large Numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers) in action.
|
| 349 |
|
| 350 |
-
|
| 351 |
|
| 352 |
-
|
| 353 |
-
|
| 354 |
-
)
|
| 355 |
return
|
| 356 |
|
| 357 |
|
|
@@ -482,38 +444,20 @@ def _(dist_param1, dist_param2, dist_selection, np, plt, stats):
|
|
| 482 |
bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
|
| 483 |
|
| 484 |
plt.gca() # Return the current axes to ensure proper display
|
| 485 |
-
return
|
| 486 |
-
dist_baseline,
|
| 487 |
-
dist_lam,
|
| 488 |
-
dist_markerline,
|
| 489 |
-
dist_max_x,
|
| 490 |
-
dist_mean,
|
| 491 |
-
dist_n,
|
| 492 |
-
dist_p,
|
| 493 |
-
dist_pmf_values,
|
| 494 |
-
dist_props_text,
|
| 495 |
-
dist_std_dev,
|
| 496 |
-
dist_stemlines,
|
| 497 |
-
dist_title,
|
| 498 |
-
dist_variance,
|
| 499 |
-
dist_x_label,
|
| 500 |
-
dist_x_values,
|
| 501 |
-
)
|
| 502 |
|
| 503 |
|
| 504 |
@app.cell(hide_code=True)
|
| 505 |
def _(mo):
|
| 506 |
-
mo.md(
|
| 507 |
-
|
| 508 |
-
## Expected Value from a PMF
|
| 509 |
|
| 510 |
-
|
| 511 |
|
| 512 |
-
|
| 513 |
|
| 514 |
-
|
| 515 |
-
|
| 516 |
-
)
|
| 517 |
return
|
| 518 |
|
| 519 |
|
|
@@ -527,24 +471,22 @@ def _(dist_pmf_values, dist_x_values):
|
|
| 527 |
ev_dist_mean = calc_expected_value(dist_x_values, dist_pmf_values)
|
| 528 |
|
| 529 |
print(f"Expected value: {ev_dist_mean:.4f}")
|
| 530 |
-
return
|
| 531 |
|
| 532 |
|
| 533 |
@app.cell(hide_code=True)
|
| 534 |
def _(mo):
|
| 535 |
-
mo.md(
|
| 536 |
-
|
| 537 |
-
## Variance from a PMF
|
| 538 |
|
| 539 |
-
|
| 540 |
|
| 541 |
-
|
| 542 |
|
| 543 |
-
|
| 544 |
|
| 545 |
-
|
| 546 |
-
|
| 547 |
-
)
|
| 548 |
return
|
| 549 |
|
| 550 |
|
|
@@ -560,22 +502,20 @@ def _(dist_pmf_values, dist_x_values, ev_dist_mean, np):
|
|
| 560 |
|
| 561 |
print(f"Variance: {var_dist_var:.4f}")
|
| 562 |
print(f"Standard deviation: {var_dist_std_dev:.4f}")
|
| 563 |
-
return
|
| 564 |
|
| 565 |
|
| 566 |
@app.cell(hide_code=True)
|
| 567 |
def _(mo):
|
| 568 |
-
mo.md(
|
| 569 |
-
|
| 570 |
-
## PMF vs. CDF
|
| 571 |
|
| 572 |
-
|
| 573 |
|
| 574 |
-
|
| 575 |
|
| 576 |
-
|
| 577 |
-
|
| 578 |
-
)
|
| 579 |
return
|
| 580 |
|
| 581 |
|
|
@@ -612,77 +552,69 @@ def _(dist_pmf_values, dist_x_values, np, plt):
|
|
| 612 |
|
| 613 |
plt.tight_layout()
|
| 614 |
plt.gca() # Return the current axes to ensure proper display
|
| 615 |
-
return
|
| 616 |
|
| 617 |
|
| 618 |
@app.cell(hide_code=True)
|
| 619 |
def _(mo):
|
| 620 |
-
mo.md(
|
| 621 |
-
|
| 622 |
-
The graphs above illustrate the key difference between PMF and CDF:
|
| 623 |
|
| 624 |
-
|
| 625 |
-
|
| 626 |
|
| 627 |
-
|
| 628 |
-
|
| 629 |
-
)
|
| 630 |
return
|
| 631 |
|
| 632 |
|
| 633 |
@app.cell(hide_code=True)
|
| 634 |
def _(mo):
|
| 635 |
-
mo.md(
|
| 636 |
-
|
| 637 |
-
|
| 638 |
-
|
| 639 |
-
|
| 640 |
-
|
| 641 |
-
|
| 642 |
-
|
| 643 |
-
|
| 644 |
-
|
| 645 |
-
|
| 646 |
-
|
| 647 |
-
|
| 648 |
-
|
| 649 |
-
|
| 650 |
-
|
| 651 |
-
|
| 652 |
-
|
| 653 |
-
|
| 654 |
-
|
| 655 |
-
|
| 656 |
-
|
| 657 |
-
|
| 658 |
-
|
| 659 |
-
|
| 660 |
-
"""
|
| 661 |
-
)
|
| 662 |
return
|
| 663 |
|
| 664 |
|
| 665 |
@app.cell(hide_code=True)
|
| 666 |
def _(mo):
|
| 667 |
-
mo.md(
|
| 668 |
-
|
| 669 |
-
## Practical Applications of PMFs
|
| 670 |
|
| 671 |
-
|
| 672 |
-
|
| 673 |
-
)
|
| 674 |
return
|
| 675 |
|
| 676 |
|
| 677 |
@app.cell(hide_code=True)
|
| 678 |
def _(mo):
|
| 679 |
-
mo.md(
|
| 680 |
-
|
| 681 |
-
## Key Takeaways
|
| 682 |
|
| 683 |
-
|
| 684 |
-
|
| 685 |
-
)
|
| 686 |
return
|
| 687 |
|
| 688 |
|
|
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
+
__generated_with = "0.18.4"
|
| 14 |
app = marimo.App(width="medium", app_title="Probability Mass Functions")
|
| 15 |
|
| 16 |
|
| 17 |
@app.cell(hide_code=True)
|
| 18 |
def _(mo):
|
| 19 |
+
mo.md(r"""
|
| 20 |
+
# Probability Mass Functions
|
|
|
|
| 21 |
|
| 22 |
+
_This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/pmf/), by Stanford professor Chris Piech._
|
| 23 |
|
| 24 |
+
PMFs are really important in discrete probability. They tell us how likely each possible outcome is for a discrete random variable.
|
| 25 |
|
| 26 |
+
What's interesting about PMFs is that they can be represented in multiple ways - equations, graphs, or even empirical data. The core idea is simple: they map each possible value to its probability.
|
| 27 |
+
""")
|
|
|
|
| 28 |
return
|
| 29 |
|
| 30 |
|
| 31 |
@app.cell(hide_code=True)
|
| 32 |
def _(mo):
|
| 33 |
+
mo.md(r"""
|
| 34 |
+
## Properties of a PMF
|
|
|
|
| 35 |
|
| 36 |
+
For a function $p_X(x)$ to be a valid PMF:
|
| 37 |
|
| 38 |
+
1. **Non-negativity**: probability can't be negative, so $p_X(x) \geq 0$ for all $x$
|
| 39 |
+
2. **Unit total probability**: all probabilities sum to 1, i.e., $\sum_x p_X(x) = 1$
|
| 40 |
|
| 41 |
+
The second property makes intuitive sense - a random variable must take some value, and the sum of all possibilities should be 100%.
|
| 42 |
+
""")
|
|
|
|
| 43 |
return
|
| 44 |
|
| 45 |
|
| 46 |
@app.cell(hide_code=True)
|
| 47 |
def _(mo):
|
| 48 |
+
mo.md(r"""
|
| 49 |
+
## PMFs as Graphs
|
|
|
|
| 50 |
|
| 51 |
+
Let's start by looking at PMFs as graphs where the $x$-axis is the values that the random variable could take on and the $y$-axis is the probability of the random variable taking on said value.
|
| 52 |
|
| 53 |
+
In the following example, we show two PMFs:
|
| 54 |
|
| 55 |
+
- On the left: PMF for the random variable $X$ = the value of a single six-sided die roll
|
| 56 |
+
- On the right: PMF for the random variable $Y$ = value of the sum of two dice rolls
|
| 57 |
+
""")
|
|
|
|
| 58 |
return
|
| 59 |
|
| 60 |
|
|
|
|
| 96 |
|
| 97 |
plt.tight_layout()
|
| 98 |
plt.gca()
|
| 99 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
|
| 102 |
@app.cell(hide_code=True)
|
| 103 |
def _(mo):
|
| 104 |
+
mo.md(r"""
|
| 105 |
+
These graphs really show us how likely each value is when we roll the dice.
|
|
|
|
| 106 |
|
| 107 |
+
looking at the right graph, when we see "6" on the $x$-axis with probability $\frac{5}{36}$ on the $y$-axis, that's telling us there's a $\frac{5}{36}$ chance of rolling a sum of 6 with two dice. or more formally: $P(Y = 6) = \frac{5}{36}$.
|
| 108 |
|
| 109 |
+
Similarly, the value "2" has probability "$\frac{1}{36}$" - that's because there's only one way to get a sum of 2 (rolling 1 on both dice). and you'll notice there's no value for "1" since you can't get a sum of 1 with two dice - the minimum possible is 2.
|
| 110 |
+
""")
|
|
|
|
| 111 |
return
|
| 112 |
|
| 113 |
|
| 114 |
@app.cell(hide_code=True)
|
| 115 |
def _(mo):
|
| 116 |
+
mo.md(r"""
|
| 117 |
+
## PMFs as Equations
|
|
|
|
| 118 |
|
| 119 |
+
Here is the exact same information in equation form:
|
| 120 |
|
| 121 |
+
For a single die roll $X$:
|
| 122 |
+
$$P(X=x) = \frac{1}{6} \quad \text{ if } 1 \leq x \leq 6$$
|
| 123 |
|
| 124 |
+
For the sum of two dice $Y$:
|
| 125 |
+
$$P(Y=y) = \begin{cases}
|
| 126 |
+
\frac{(y-1)}{36} & \text{ if } 2 \leq y \leq 7\\
|
| 127 |
+
\frac{(13-y)}{36} & \text{ if } 8 \leq y \leq 12
|
| 128 |
+
\end{cases}$$
|
| 129 |
|
| 130 |
+
Let's implement the PMF for $Y$, the sum of two dice, in Python code:
|
| 131 |
+
""")
|
|
|
|
| 132 |
return
|
| 133 |
|
| 134 |
|
|
|
|
| 147 |
test_values = [1, 2, 7, 12, 13]
|
| 148 |
for test_y in test_values:
|
| 149 |
print(f"P(Y = {test_y}) = {pmf_sum_two_dice(test_y)}")
|
| 150 |
+
return (pmf_sum_two_dice,)
|
| 151 |
|
| 152 |
|
| 153 |
@app.cell(hide_code=True)
|
| 154 |
def _(mo):
|
| 155 |
+
mo.md(r"""
|
| 156 |
+
Now, let's verify that our PMF satisfies the property that the sum of all probabilities equals 1:
|
| 157 |
+
""")
|
| 158 |
return
|
| 159 |
|
| 160 |
|
|
|
|
| 165 |
# Round to 10 decimal places to handle floating-point precision
|
| 166 |
verify_total_prob_rounded = round(verify_total_prob, 10)
|
| 167 |
print(f"Sum of all probabilities: {verify_total_prob_rounded}")
|
| 168 |
+
return
|
| 169 |
|
| 170 |
|
| 171 |
@app.cell(hide_code=True)
|
|
|
|
| 187 |
plt.text(verify_y_values[verify_i], verify_prob + 0.001, f'{verify_prob:.3f}', ha='center')
|
| 188 |
|
| 189 |
plt.gca() # Return the current axes to ensure proper display
|
| 190 |
+
return
|
| 191 |
|
| 192 |
|
| 193 |
@app.cell(hide_code=True)
|
| 194 |
def _(mo):
|
| 195 |
+
mo.md(r"""
|
| 196 |
+
## Data to Histograms to Probability Mass Functions
|
|
|
|
| 197 |
|
| 198 |
+
Here's something I find interesting — one way to represent a likelihood function is just through raw data. instead of mathematical formulas, we can actually approximate a PMF by collecting data points. let's see this in action by simulating lots of dice rolls and building an empirical PMF:
|
| 199 |
+
""")
|
|
|
|
| 200 |
return
|
| 201 |
|
| 202 |
|
|
|
|
| 216 |
# Display a small sample of the data
|
| 217 |
print(f"First 20 dice sums: {sim_dice_sums[:20]}")
|
| 218 |
print(f"Total number of trials: {sim_num_trials}")
|
| 219 |
+
return (sim_dice_sums,)
|
| 220 |
|
| 221 |
|
| 222 |
@app.cell(hide_code=True)
|
|
|
|
| 276 |
plt.text(sim_sorted_values[sim_i], sim_count + 19, str(sim_count), ha='center')
|
| 277 |
|
| 278 |
plt.gca() # Return the current axes to ensure proper display
|
| 279 |
+
return (sim_counter,)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 280 |
|
| 281 |
|
| 282 |
@app.cell(hide_code=True)
|
| 283 |
def _(mo):
|
| 284 |
+
mo.md(r"""
|
| 285 |
+
When we normalize a histogram (divide each count by total sample size), we get a pretty good approximation of the true PMF. it's a simple yet powerful idea - count how many times each value appears, then divide by the total number of trials.
|
|
|
|
| 286 |
|
| 287 |
+
let's make this concrete. say we want to estimate $P(Y=3)$ - the probability of rolling a sum of 3 with two dice. we just count how many 3's show up in our simulated rolls and divide by the total number of rolls:
|
| 288 |
+
""")
|
|
|
|
| 289 |
return
|
| 290 |
|
| 291 |
|
|
|
|
| 302 |
print(f"Empirical P(Y=3): {sim_count_of_3}/{len(sim_dice_sums)} = {sim_empirical_prob:.4f}")
|
| 303 |
print(f"Theoretical P(Y=3): 2/36 = {sim_theoretical_prob:.4f}")
|
| 304 |
print(f"Difference: {abs(sim_empirical_prob - sim_theoretical_prob):.4f}")
|
| 305 |
+
return
|
| 306 |
|
| 307 |
|
| 308 |
@app.cell(hide_code=True)
|
| 309 |
def _(mo):
|
| 310 |
+
mo.md(r"""
|
| 311 |
+
As we can see, with a large number of trials, the empirical PMF becomes a very good approximation of the theoretical PMF. This is an example of the [Law of Large Numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers) in action.
|
|
|
|
| 312 |
|
| 313 |
+
## Interactive Example: Exploring PMFs
|
| 314 |
|
| 315 |
+
Let's create an interactive tool to explore different PMFs:
|
| 316 |
+
""")
|
|
|
|
| 317 |
return
|
| 318 |
|
| 319 |
|
|
|
|
| 444 |
bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
|
| 445 |
|
| 446 |
plt.gca() # Return the current axes to ensure proper display
|
| 447 |
+
return dist_pmf_values, dist_x_values
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 448 |
|
| 449 |
|
| 450 |
@app.cell(hide_code=True)
|
| 451 |
def _(mo):
|
| 452 |
+
mo.md(r"""
|
| 453 |
+
## Expected Value from a PMF
|
|
|
|
| 454 |
|
| 455 |
+
The expected value (or mean) of a discrete random variable is calculated using its PMF:
|
| 456 |
|
| 457 |
+
$$E[X] = \sum_x x \cdot p_X(x)$$
|
| 458 |
|
| 459 |
+
This represents the long-run average value of the random variable.
|
| 460 |
+
""")
|
|
|
|
| 461 |
return
|
| 462 |
|
| 463 |
|
|
|
|
| 471 |
ev_dist_mean = calc_expected_value(dist_x_values, dist_pmf_values)
|
| 472 |
|
| 473 |
print(f"Expected value: {ev_dist_mean:.4f}")
|
| 474 |
+
return (ev_dist_mean,)
|
| 475 |
|
| 476 |
|
| 477 |
@app.cell(hide_code=True)
|
| 478 |
def _(mo):
|
| 479 |
+
mo.md(r"""
|
| 480 |
+
## Variance from a PMF
|
|
|
|
| 481 |
|
| 482 |
+
The variance measures the spread or dispersion of a random variable around its mean:
|
| 483 |
|
| 484 |
+
$$\text{Var}(X) = E[(X - E[X])^2] = \sum_x (x - E[X])^2 \cdot p_X(x)$$
|
| 485 |
|
| 486 |
+
An alternative formula is:
|
| 487 |
|
| 488 |
+
$$\text{Var}(X) = E[X^2] - (E[X])^2 = \sum_x x^2 \cdot p_X(x) - \left(\sum_x x \cdot p_X(x)\right)^2$$
|
| 489 |
+
""")
|
|
|
|
| 490 |
return
|
| 491 |
|
| 492 |
|
|
|
|
| 502 |
|
| 503 |
print(f"Variance: {var_dist_var:.4f}")
|
| 504 |
print(f"Standard deviation: {var_dist_std_dev:.4f}")
|
| 505 |
+
return
|
| 506 |
|
| 507 |
|
| 508 |
@app.cell(hide_code=True)
|
| 509 |
def _(mo):
|
| 510 |
+
mo.md(r"""
|
| 511 |
+
## PMF vs. CDF
|
|
|
|
| 512 |
|
| 513 |
+
The **Cumulative Distribution Function (CDF)** is related to the PMF but gives the probability that the random variable $X$ is less than or equal to a value $x$:
|
| 514 |
|
| 515 |
+
$$F_X(x) = P(X \leq x) = \sum_{k \leq x} p_X(k)$$
|
| 516 |
|
| 517 |
+
While the PMF gives the probability mass at each point, the CDF accumulates these probabilities.
|
| 518 |
+
""")
|
|
|
|
| 519 |
return
|
| 520 |
|
| 521 |
|
|
|
|
| 552 |
|
| 553 |
plt.tight_layout()
|
| 554 |
plt.gca() # Return the current axes to ensure proper display
|
| 555 |
+
return
|
| 556 |
|
| 557 |
|
| 558 |
@app.cell(hide_code=True)
|
| 559 |
def _(mo):
|
| 560 |
+
mo.md(r"""
|
| 561 |
+
The graphs above illustrate the key difference between PMF and CDF:
|
|
|
|
| 562 |
|
| 563 |
+
- **PMF (left)**: Shows the probability of the random variable taking each specific value: P(X = x)
|
| 564 |
+
- **CDF (right)**: Shows the probability of the random variable being less than or equal to each value: P(X ≤ x)
|
| 565 |
|
| 566 |
+
The CDF at any point is the sum of all PMF values up to and including that point. This is why the CDF is always non-decreasing and eventually reaches 1. For discrete distributions like this one, the CDF forms a step function that jumps at each value in the support of the random variable.
|
| 567 |
+
""")
|
|
|
|
| 568 |
return
|
| 569 |
|
| 570 |
|
| 571 |
@app.cell(hide_code=True)
|
| 572 |
def _(mo):
|
| 573 |
+
mo.md(r"""
|
| 574 |
+
## Test Your Understanding
|
| 575 |
+
|
| 576 |
+
Choose what you believe are the correct options in the questions below:
|
| 577 |
+
|
| 578 |
+
<details>
|
| 579 |
+
<summary>If X is a discrete random variable with PMF p(x), then p(x) must always be less than 1</summary>
|
| 580 |
+
❌ False! While most values in a PMF are typically less than 1, a PMF can have p(x) = 1 for a specific value if the random variable always takes that value (with 100% probability).
|
| 581 |
+
</details>
|
| 582 |
+
|
| 583 |
+
<details>
|
| 584 |
+
<summary>The sum of all probabilities in a PMF must equal exactly 1</summary>
|
| 585 |
+
✅ True! This is a fundamental property of any valid PMF. The total probability across all possible values must be 1, as the random variable must take some value.
|
| 586 |
+
</details>
|
| 587 |
+
|
| 588 |
+
<details>
|
| 589 |
+
<summary>A PMF can be estimated from data by creating a normalized histogram</summary>
|
| 590 |
+
✅ True! Counting the frequency of each value and dividing by the total number of observations gives an empirical PMF.
|
| 591 |
+
</details>
|
| 592 |
+
|
| 593 |
+
<details>
|
| 594 |
+
<summary>The expected value of a discrete random variable is always one of the possible values of the variable</summary>
|
| 595 |
+
❌ False! The expected value is a weighted average and may not be a value the random variable can actually take. For example, the expected value of a fair die roll is 3.5, which is not a possible outcome.
|
| 596 |
+
</details>
|
| 597 |
+
""")
|
|
|
|
|
|
|
| 598 |
return
|
| 599 |
|
| 600 |
|
| 601 |
@app.cell(hide_code=True)
|
| 602 |
def _(mo):
|
| 603 |
+
mo.md(r"""
|
| 604 |
+
## Practical Applications of PMFs
|
|
|
|
| 605 |
|
| 606 |
+
PMFs pop up everywhere - network engineers use them to model traffic patterns, reliability teams predict equipment failures, and marketers analyze purchase behavior. In finance, they help price options; in gaming, they're behind every dice roll. Machine learning algorithms like Naive Bayes rely on them, and they're essential for modeling rare events like genetic mutations or system failures.
|
| 607 |
+
""")
|
|
|
|
| 608 |
return
|
| 609 |
|
| 610 |
|
| 611 |
@app.cell(hide_code=True)
|
| 612 |
def _(mo):
|
| 613 |
+
mo.md(r"""
|
| 614 |
+
## Key Takeaways
|
|
|
|
| 615 |
|
| 616 |
+
PMFs give us the probability picture for discrete random variables - they tell us how likely each value is, must be non-negative, and always sum to 1. We can write them as equations, draw them as graphs, or estimate them from data. They're the foundation for calculating expected values and variances, which we'll explore in our next notebook on Expectation, where we'll learn how to summarize random variables with a single, most "expected" value.
|
| 617 |
+
""")
|
|
|
|
| 618 |
return
|
| 619 |
|
| 620 |
|
probability/11_expectation.py
CHANGED
|
@@ -10,55 +10,49 @@
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
-
__generated_with = "0.
|
| 14 |
app = marimo.App(width="medium", app_title="Expectation")
|
| 15 |
|
| 16 |
|
| 17 |
@app.cell(hide_code=True)
|
| 18 |
def _(mo):
|
| 19 |
-
mo.md(
|
| 20 |
-
|
| 21 |
-
# Expectation
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
)
|
| 30 |
return
|
| 31 |
|
| 32 |
|
| 33 |
@app.cell(hide_code=True)
|
| 34 |
def _(mo):
|
| 35 |
-
mo.md(
|
| 36 |
-
|
| 37 |
-
## Definition of Expectation
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
)
|
| 46 |
return
|
| 47 |
|
| 48 |
|
| 49 |
@app.cell(hide_code=True)
|
| 50 |
def _(mo):
|
| 51 |
-
mo.md(
|
| 52 |
-
|
| 53 |
-
## Intuition Behind Expectation
|
| 54 |
|
| 55 |
-
|
| 56 |
|
| 57 |
-
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
)
|
| 62 |
return
|
| 63 |
|
| 64 |
|
|
@@ -91,12 +85,14 @@ def _(np, plt):
|
|
| 91 |
arrowprops=dict(facecolor='black', shrink=0.05, width=1.5))
|
| 92 |
|
| 93 |
plt.gca()
|
| 94 |
-
return
|
| 95 |
|
| 96 |
|
| 97 |
@app.cell(hide_code=True)
|
| 98 |
def _(mo):
|
| 99 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 100 |
return
|
| 101 |
|
| 102 |
|
|
@@ -145,25 +141,23 @@ def _(mo):
|
|
| 145 |
|
| 146 |
@app.cell(hide_code=True)
|
| 147 |
def _(mo):
|
| 148 |
-
mo.md(
|
| 149 |
-
|
| 150 |
-
## Calculating Expectation
|
| 151 |
|
| 152 |
-
|
| 153 |
|
| 154 |
-
|
| 155 |
|
| 156 |
-
|
| 157 |
|
| 158 |
-
|
| 159 |
|
| 160 |
-
|
| 161 |
|
| 162 |
-
|
| 163 |
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
)
|
| 167 |
return
|
| 168 |
|
| 169 |
|
|
@@ -179,18 +173,16 @@ def _():
|
|
| 179 |
|
| 180 |
exp_die_result = calc_expectation_die()
|
| 181 |
print(f"Expected value of a fair die roll: {exp_die_result}")
|
| 182 |
-
return
|
| 183 |
|
| 184 |
|
| 185 |
@app.cell(hide_code=True)
|
| 186 |
def _(mo):
|
| 187 |
-
mo.md(
|
| 188 |
-
|
| 189 |
-
### Example 2: Sum of Two Dice
|
| 190 |
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
)
|
| 194 |
return
|
| 195 |
|
| 196 |
|
|
@@ -210,7 +202,7 @@ def _():
|
|
| 210 |
exp_test_values = [2, 7, 12]
|
| 211 |
for exp_test_y in exp_test_values:
|
| 212 |
print(f"P(Y = {exp_test_y}) = {pmf_sum_two_dice(exp_test_y)}")
|
| 213 |
-
return
|
| 214 |
|
| 215 |
|
| 216 |
@app.cell
|
|
@@ -239,24 +231,16 @@ def _(pmf_sum_two_dice):
|
|
| 239 |
|
| 240 |
# Verify that this equals 7
|
| 241 |
print(f"Is the expected value exactly 7? {abs(exp_sum_result - 7) < 1e-10}")
|
| 242 |
-
return
|
| 243 |
-
calc_expectation_sum_two_dice,
|
| 244 |
-
exp_direct_calc,
|
| 245 |
-
exp_direct_calc_rounded,
|
| 246 |
-
exp_sum_result,
|
| 247 |
-
exp_sum_result_rounded,
|
| 248 |
-
)
|
| 249 |
|
| 250 |
|
| 251 |
@app.cell(hide_code=True)
|
| 252 |
def _(mo):
|
| 253 |
-
mo.md(
|
| 254 |
-
|
| 255 |
-
### Visualizing Expectation
|
| 256 |
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
)
|
| 260 |
return
|
| 261 |
|
| 262 |
|
|
@@ -283,18 +267,16 @@ def _(plt, pmf_sum_two_dice):
|
|
| 283 |
|
| 284 |
plt.tight_layout()
|
| 285 |
plt.gca()
|
| 286 |
-
return
|
| 287 |
|
| 288 |
|
| 289 |
@app.cell(hide_code=True)
|
| 290 |
def _(mo):
|
| 291 |
-
mo.md(
|
| 292 |
-
|
| 293 |
-
## Demonstrating the Properties of Expectation
|
| 294 |
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
)
|
| 298 |
return
|
| 299 |
|
| 300 |
|
|
@@ -321,25 +303,16 @@ def _(exp_die_result):
|
|
| 321 |
|
| 322 |
# Verify they match
|
| 323 |
print(f"Do they match? {abs(prop_expected_using_property - prop_expected_direct) < 1e-10}")
|
| 324 |
-
return
|
| 325 |
-
prop_a,
|
| 326 |
-
prop_b,
|
| 327 |
-
prop_expected_direct,
|
| 328 |
-
prop_expected_direct_rounded,
|
| 329 |
-
prop_expected_using_property,
|
| 330 |
-
prop_expected_using_property_rounded,
|
| 331 |
-
)
|
| 332 |
|
| 333 |
|
| 334 |
@app.cell(hide_code=True)
|
| 335 |
def _(mo):
|
| 336 |
-
mo.md(
|
| 337 |
-
|
| 338 |
-
### Law of the Unconscious Statistician (LOTUS)
|
| 339 |
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
)
|
| 343 |
return
|
| 344 |
|
| 345 |
|
|
@@ -358,38 +331,27 @@ def _():
|
|
| 358 |
|
| 359 |
print(f"E[X^2] for a die roll = {lotus_expected_x_squared_rounded}")
|
| 360 |
print(f"(E[X])^2 for a die roll = {expected_x_squared_rounded}")
|
| 361 |
-
return
|
| 362 |
-
expected_x_squared,
|
| 363 |
-
expected_x_squared_rounded,
|
| 364 |
-
lotus_die_probs,
|
| 365 |
-
lotus_die_values,
|
| 366 |
-
lotus_expected_x_squared,
|
| 367 |
-
lotus_expected_x_squared_rounded,
|
| 368 |
-
)
|
| 369 |
|
| 370 |
|
| 371 |
@app.cell(hide_code=True)
|
| 372 |
def _(mo):
|
| 373 |
-
mo.md(
|
| 374 |
-
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
"""
|
| 378 |
-
)
|
| 379 |
return
|
| 380 |
|
| 381 |
|
| 382 |
@app.cell(hide_code=True)
|
| 383 |
def _(mo):
|
| 384 |
-
mo.md(
|
| 385 |
-
|
| 386 |
-
## Interactive Example
|
| 387 |
|
| 388 |
-
|
| 389 |
|
| 390 |
-
|
| 391 |
-
|
| 392 |
-
)
|
| 393 |
return
|
| 394 |
|
| 395 |
|
|
@@ -423,7 +385,9 @@ def _(dist_description):
|
|
| 423 |
|
| 424 |
@app.cell(hide_code=True)
|
| 425 |
def _(mo):
|
| 426 |
-
mo.md("""
|
|
|
|
|
|
|
| 427 |
return
|
| 428 |
|
| 429 |
|
|
@@ -549,37 +513,16 @@ def _(
|
|
| 549 |
|
| 550 |
plt.tight_layout()
|
| 551 |
plt.gca()
|
| 552 |
-
return
|
| 553 |
-
annotation_x,
|
| 554 |
-
annotation_y,
|
| 555 |
-
current_expected,
|
| 556 |
-
current_param,
|
| 557 |
-
dist_ax,
|
| 558 |
-
dist_fig,
|
| 559 |
-
dist_props,
|
| 560 |
-
expected_values,
|
| 561 |
-
formula,
|
| 562 |
-
lambda_max,
|
| 563 |
-
lambda_min,
|
| 564 |
-
max_y,
|
| 565 |
-
n,
|
| 566 |
-
p_max,
|
| 567 |
-
p_min,
|
| 568 |
-
param_values,
|
| 569 |
-
title,
|
| 570 |
-
x_label,
|
| 571 |
-
)
|
| 572 |
|
| 573 |
|
| 574 |
@app.cell(hide_code=True)
|
| 575 |
def _(mo):
|
| 576 |
-
mo.md(
|
| 577 |
-
|
| 578 |
-
## Expectation vs. Mode
|
| 579 |
|
| 580 |
-
|
| 581 |
-
|
| 582 |
-
)
|
| 583 |
return
|
| 584 |
|
| 585 |
|
|
@@ -633,94 +576,75 @@ def _(np, plt, stats):
|
|
| 633 |
|
| 634 |
plt.tight_layout()
|
| 635 |
plt.gca()
|
| 636 |
-
return
|
| 637 |
-
max_x,
|
| 638 |
-
mid_x,
|
| 639 |
-
min_x,
|
| 640 |
-
skew_ax,
|
| 641 |
-
skew_expected,
|
| 642 |
-
skew_expected_rounded,
|
| 643 |
-
skew_fig,
|
| 644 |
-
skew_mode,
|
| 645 |
-
skew_n,
|
| 646 |
-
skew_p,
|
| 647 |
-
skew_pmf_values,
|
| 648 |
-
skew_x_values,
|
| 649 |
-
)
|
| 650 |
|
| 651 |
|
| 652 |
@app.cell(hide_code=True)
|
| 653 |
def _(mo):
|
| 654 |
-
mo.md(
|
| 655 |
-
|
| 656 |
-
|
| 657 |
-
For the sum of two dice we calculated earlier, we found the expected value to be exactly 7. In that case, 7 also happens to be the mode (most likely outcome) of the distribution. However, this is just a coincidence for this particular example!
|
| 658 |
|
| 659 |
-
|
| 660 |
-
|
| 661 |
-
)
|
| 662 |
return
|
| 663 |
|
| 664 |
|
| 665 |
@app.cell(hide_code=True)
|
| 666 |
def _(mo):
|
| 667 |
-
mo.md(
|
| 668 |
-
|
| 669 |
-
|
| 670 |
-
|
| 671 |
-
|
| 672 |
-
|
| 673 |
-
|
| 674 |
-
|
| 675 |
-
|
| 676 |
-
|
| 677 |
-
|
| 678 |
-
|
| 679 |
-
|
| 680 |
-
|
| 681 |
-
|
| 682 |
-
|
| 683 |
-
|
| 684 |
-
|
| 685 |
-
|
| 686 |
-
|
| 687 |
-
|
| 688 |
-
|
| 689 |
-
|
| 690 |
-
|
| 691 |
-
|
| 692 |
-
"""
|
| 693 |
-
)
|
| 694 |
return
|
| 695 |
|
| 696 |
|
| 697 |
@app.cell(hide_code=True)
|
| 698 |
def _(mo):
|
| 699 |
-
mo.md(
|
| 700 |
-
|
| 701 |
-
## Practical Applications of Expectation
|
| 702 |
|
| 703 |
-
|
| 704 |
-
|
| 705 |
-
)
|
| 706 |
return
|
| 707 |
|
| 708 |
|
| 709 |
@app.cell(hide_code=True)
|
| 710 |
def _(mo):
|
| 711 |
-
mo.md(
|
| 712 |
-
|
| 713 |
-
## Key Takeaways
|
| 714 |
|
| 715 |
-
|
| 716 |
-
|
| 717 |
-
)
|
| 718 |
return
|
| 719 |
|
| 720 |
|
| 721 |
@app.cell(hide_code=True)
|
| 722 |
def _(mo):
|
| 723 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 724 |
return
|
| 725 |
|
| 726 |
|
|
@@ -736,7 +660,7 @@ def _():
|
|
| 736 |
import numpy as np
|
| 737 |
from scipy import stats
|
| 738 |
import collections
|
| 739 |
-
return
|
| 740 |
|
| 741 |
|
| 742 |
@app.cell(hide_code=True)
|
|
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
+
__generated_with = "0.18.4"
|
| 14 |
app = marimo.App(width="medium", app_title="Expectation")
|
| 15 |
|
| 16 |
|
| 17 |
@app.cell(hide_code=True)
|
| 18 |
def _(mo):
|
| 19 |
+
mo.md(r"""
|
| 20 |
+
# Expectation
|
|
|
|
| 21 |
|
| 22 |
+
_This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/expectation/), by Stanford professor Chris Piech._
|
| 23 |
|
| 24 |
+
Expectations are fascinating — they represent the "center of mass" of a probability distribution. while they're often called "expected values" or "averages," they don't always match our intuition about what's "expected" to happen.
|
| 25 |
|
| 26 |
+
For me, the most interesting part about expectations is how they quantify what happens "on average" in the long run, even if that average isn't a possible outcome (like expecting 3.5 on a standard die roll).
|
| 27 |
+
""")
|
|
|
|
| 28 |
return
|
| 29 |
|
| 30 |
|
| 31 |
@app.cell(hide_code=True)
|
| 32 |
def _(mo):
|
| 33 |
+
mo.md(r"""
|
| 34 |
+
## Definition of Expectation
|
|
|
|
| 35 |
|
| 36 |
+
Expectation (written as $E[X]$) is basically the "average outcome" of a random variable, but with a twist - we weight each possible value by how likely it is to occur. I like to think of it as the "center of gravity" for probability.
|
| 37 |
|
| 38 |
+
$$E[X] = \sum_x x \cdot P(X=x)$$
|
| 39 |
|
| 40 |
+
People call this concept by different names - mean, weighted average, center of mass, or 1st moment if you're being fancy. They're all calculated the same way, though: multiply each value by its probability, then add everything up.
|
| 41 |
+
""")
|
|
|
|
| 42 |
return
|
| 43 |
|
| 44 |
|
| 45 |
@app.cell(hide_code=True)
|
| 46 |
def _(mo):
|
| 47 |
+
mo.md(r"""
|
| 48 |
+
## Intuition Behind Expectation
|
|
|
|
| 49 |
|
| 50 |
+
The expected value represents the long-run average value of a random variable over many independent repetitions of an experiment.
|
| 51 |
|
| 52 |
+
For example, if you roll a fair six-sided die many times and calculate the average of all rolls, that average will approach the expected value of 3.5 as the number of rolls increases.
|
| 53 |
|
| 54 |
+
Let's visualize this concept:
|
| 55 |
+
""")
|
|
|
|
| 56 |
return
|
| 57 |
|
| 58 |
|
|
|
|
| 85 |
arrowprops=dict(facecolor='black', shrink=0.05, width=1.5))
|
| 86 |
|
| 87 |
plt.gca()
|
| 88 |
+
return
|
| 89 |
|
| 90 |
|
| 91 |
@app.cell(hide_code=True)
|
| 92 |
def _(mo):
|
| 93 |
+
mo.md(r"""
|
| 94 |
+
## Properties of Expectation
|
| 95 |
+
""")
|
| 96 |
return
|
| 97 |
|
| 98 |
|
|
|
|
| 141 |
|
| 142 |
@app.cell(hide_code=True)
|
| 143 |
def _(mo):
|
| 144 |
+
mo.md(r"""
|
| 145 |
+
## Calculating Expectation
|
|
|
|
| 146 |
|
| 147 |
+
Let's calculate the expected value for some common examples:
|
| 148 |
|
| 149 |
+
### Example 1: Fair Die Roll
|
| 150 |
|
| 151 |
+
For a fair six-sided die, the PMF is:
|
| 152 |
|
| 153 |
+
$$P(X=x) = \frac{1}{6} \text{ for } x \in \{1, 2, 3, 4, 5, 6\}$$
|
| 154 |
|
| 155 |
+
The expected value is:
|
| 156 |
|
| 157 |
+
$$E[X] = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6} = \frac{21}{6} = 3.5$$
|
| 158 |
|
| 159 |
+
Let's implement this calculation in Python:
|
| 160 |
+
""")
|
|
|
|
| 161 |
return
|
| 162 |
|
| 163 |
|
|
|
|
| 173 |
|
| 174 |
exp_die_result = calc_expectation_die()
|
| 175 |
print(f"Expected value of a fair die roll: {exp_die_result}")
|
| 176 |
+
return (exp_die_result,)
|
| 177 |
|
| 178 |
|
| 179 |
@app.cell(hide_code=True)
|
| 180 |
def _(mo):
|
| 181 |
+
mo.md(r"""
|
| 182 |
+
### Example 2: Sum of Two Dice
|
|
|
|
| 183 |
|
| 184 |
+
Now let's calculate the expected value for the sum of two fair dice. First, we need the PMF:
|
| 185 |
+
""")
|
|
|
|
| 186 |
return
|
| 187 |
|
| 188 |
|
|
|
|
| 202 |
exp_test_values = [2, 7, 12]
|
| 203 |
for exp_test_y in exp_test_values:
|
| 204 |
print(f"P(Y = {exp_test_y}) = {pmf_sum_two_dice(exp_test_y)}")
|
| 205 |
+
return (pmf_sum_two_dice,)
|
| 206 |
|
| 207 |
|
| 208 |
@app.cell
|
|
|
|
| 231 |
|
| 232 |
# Verify that this equals 7
|
| 233 |
print(f"Is the expected value exactly 7? {abs(exp_sum_result - 7) < 1e-10}")
|
| 234 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 235 |
|
| 236 |
|
| 237 |
@app.cell(hide_code=True)
|
| 238 |
def _(mo):
|
| 239 |
+
mo.md(r"""
|
| 240 |
+
### Visualizing Expectation
|
|
|
|
| 241 |
|
| 242 |
+
Let's visualize the expectation for the sum of two dice. The expected value is the "center of mass" of the PMF:
|
| 243 |
+
""")
|
|
|
|
| 244 |
return
|
| 245 |
|
| 246 |
|
|
|
|
| 267 |
|
| 268 |
plt.tight_layout()
|
| 269 |
plt.gca()
|
| 270 |
+
return
|
| 271 |
|
| 272 |
|
| 273 |
@app.cell(hide_code=True)
|
| 274 |
def _(mo):
|
| 275 |
+
mo.md(r"""
|
| 276 |
+
## Demonstrating the Properties of Expectation
|
|
|
|
| 277 |
|
| 278 |
+
Let's demonstrate some of these properties with examples:
|
| 279 |
+
""")
|
|
|
|
| 280 |
return
|
| 281 |
|
| 282 |
|
|
|
|
| 303 |
|
| 304 |
# Verify they match
|
| 305 |
print(f"Do they match? {abs(prop_expected_using_property - prop_expected_direct) < 1e-10}")
|
| 306 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 307 |
|
| 308 |
|
| 309 |
@app.cell(hide_code=True)
|
| 310 |
def _(mo):
|
| 311 |
+
mo.md(r"""
|
| 312 |
+
### Law of the Unconscious Statistician (LOTUS)
|
|
|
|
| 313 |
|
| 314 |
+
Let's use LOTUS to calculate $E[X^2]$ for a die roll, which will be useful when we study variance:
|
| 315 |
+
""")
|
|
|
|
| 316 |
return
|
| 317 |
|
| 318 |
|
|
|
|
| 331 |
|
| 332 |
print(f"E[X^2] for a die roll = {lotus_expected_x_squared_rounded}")
|
| 333 |
print(f"(E[X])^2 for a die roll = {expected_x_squared_rounded}")
|
| 334 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 335 |
|
| 336 |
|
| 337 |
@app.cell(hide_code=True)
|
| 338 |
def _(mo):
|
| 339 |
+
mo.md(r"""
|
| 340 |
+
/// Note
|
| 341 |
+
Note that E[X^2] != (E[X])^2
|
| 342 |
+
""")
|
|
|
|
|
|
|
| 343 |
return
|
| 344 |
|
| 345 |
|
| 346 |
@app.cell(hide_code=True)
|
| 347 |
def _(mo):
|
| 348 |
+
mo.md(r"""
|
| 349 |
+
## Interactive Example
|
|
|
|
| 350 |
|
| 351 |
+
Let's explore how the expected value changes as we adjust the parameters of common probability distributions. This interactive visualization focuses specifically on the relationship between distribution parameters and expected values.
|
| 352 |
|
| 353 |
+
Use the controls below to select a distribution and adjust its parameters. The graph will show how the expected value changes across a range of parameter values.
|
| 354 |
+
""")
|
|
|
|
| 355 |
return
|
| 356 |
|
| 357 |
|
|
|
|
| 385 |
|
| 386 |
@app.cell(hide_code=True)
|
| 387 |
def _(mo):
|
| 388 |
+
mo.md("""
|
| 389 |
+
### Adjust Parameters
|
| 390 |
+
""")
|
| 391 |
return
|
| 392 |
|
| 393 |
|
|
|
|
| 513 |
|
| 514 |
plt.tight_layout()
|
| 515 |
plt.gca()
|
| 516 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 517 |
|
| 518 |
|
| 519 |
@app.cell(hide_code=True)
|
| 520 |
def _(mo):
|
| 521 |
+
mo.md(r"""
|
| 522 |
+
## Expectation vs. Mode
|
|
|
|
| 523 |
|
| 524 |
+
The expected value (mean) of a random variable is not always the same as its most likely value (mode). Let's explore this with an example:
|
| 525 |
+
""")
|
|
|
|
| 526 |
return
|
| 527 |
|
| 528 |
|
|
|
|
| 576 |
|
| 577 |
plt.tight_layout()
|
| 578 |
plt.gca()
|
| 579 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 580 |
|
| 581 |
|
| 582 |
@app.cell(hide_code=True)
|
| 583 |
def _(mo):
|
| 584 |
+
mo.md(r"""
|
| 585 |
+
/// NOTE
|
| 586 |
+
For the sum of two dice we calculated earlier, we found the expected value to be exactly 7. In that case, 7 also happens to be the mode (most likely outcome) of the distribution. However, this is just a coincidence for this particular example!
|
|
|
|
| 587 |
|
| 588 |
+
As we can see from the binomial distribution above, the expected value (2.50) and the mode (2) are often different values (this is common in skewed distributions). The expected value represents the "center of mass" of the distribution, while the mode represents the most likely single outcome.
|
| 589 |
+
""")
|
|
|
|
| 590 |
return
|
| 591 |
|
| 592 |
|
| 593 |
@app.cell(hide_code=True)
|
| 594 |
def _(mo):
|
| 595 |
+
mo.md(r"""
|
| 596 |
+
## 🤔 Test Your Understanding
|
| 597 |
+
|
| 598 |
+
Choose what you believe are the correct options in the questions below:
|
| 599 |
+
|
| 600 |
+
<details>
|
| 601 |
+
<summary>The expected value of a random variable is always one of the possible values the random variable can take.</summary>
|
| 602 |
+
❌ False! The expected value is a weighted average and may not be a value the random variable can actually take. For example, the expected value of a fair die roll is 3.5, which is not a possible outcome.
|
| 603 |
+
</details>
|
| 604 |
+
|
| 605 |
+
<details>
|
| 606 |
+
<summary>If X and Y are independent random variables, then E[X·Y] = E[X]·E[Y].</summary>
|
| 607 |
+
✅ True! For independent random variables, the expectation of their product equals the product of their expectations.
|
| 608 |
+
</details>
|
| 609 |
+
|
| 610 |
+
<details>
|
| 611 |
+
<summary>The expected value of a constant random variable (one that always takes the same value) is that constant.</summary>
|
| 612 |
+
✅ True! If X = c with probability 1, then E[X] = c.
|
| 613 |
+
</details>
|
| 614 |
+
|
| 615 |
+
<details>
|
| 616 |
+
<summary>The expected value of the sum of two random variables is always the sum of their expected values, regardless of whether they are independent.</summary>
|
| 617 |
+
✅ True! This is the linearity of expectation property: E[X + Y] = E[X] + E[Y], which holds regardless of dependence.
|
| 618 |
+
</details>
|
| 619 |
+
""")
|
|
|
|
|
|
|
| 620 |
return
|
| 621 |
|
| 622 |
|
| 623 |
@app.cell(hide_code=True)
|
| 624 |
def _(mo):
|
| 625 |
+
mo.md(r"""
|
| 626 |
+
## Practical Applications of Expectation
|
|
|
|
| 627 |
|
| 628 |
+
Expected values show up everywhere - from investment decisions and insurance pricing to machine learning algorithms and game design. Engineers use them to predict system reliability, data scientists to understand customer behavior, and economists to model market outcomes. They're essential for risk assessment in project management and for optimizing resource allocation in operations research.
|
| 629 |
+
""")
|
|
|
|
| 630 |
return
|
| 631 |
|
| 632 |
|
| 633 |
@app.cell(hide_code=True)
|
| 634 |
def _(mo):
|
| 635 |
+
mo.md(r"""
|
| 636 |
+
## Key Takeaways
|
|
|
|
| 637 |
|
| 638 |
+
Expectation gives us a single value that summarizes a random variable's central tendency - it's the weighted average of all possible outcomes, where the weights are probabilities. The linearity property makes expectations easy to work with, even for complex combinations of random variables. While a PMF gives the complete probability picture, expectation provides an essential summary that helps us make decisions under uncertainty. In our next notebook, we'll explore variance, which measures how spread out a random variable's values are around its expectation.
|
| 639 |
+
""")
|
|
|
|
| 640 |
return
|
| 641 |
|
| 642 |
|
| 643 |
@app.cell(hide_code=True)
|
| 644 |
def _(mo):
|
| 645 |
+
mo.md(r"""
|
| 646 |
+
#### Appendix (containing helper code)
|
| 647 |
+
""")
|
| 648 |
return
|
| 649 |
|
| 650 |
|
|
|
|
| 660 |
import numpy as np
|
| 661 |
from scipy import stats
|
| 662 |
import collections
|
| 663 |
+
return np, plt, stats
|
| 664 |
|
| 665 |
|
| 666 |
@app.cell(hide_code=True)
|
probability/12_variance.py
CHANGED
|
@@ -11,77 +11,69 @@
|
|
| 11 |
|
| 12 |
import marimo
|
| 13 |
|
| 14 |
-
__generated_with = "0.
|
| 15 |
app = marimo.App(width="medium", app_title="Variance")
|
| 16 |
|
| 17 |
|
| 18 |
@app.cell(hide_code=True)
|
| 19 |
def _(mo):
|
| 20 |
-
mo.md(
|
| 21 |
-
|
| 22 |
-
# Variance
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
)
|
| 35 |
return
|
| 36 |
|
| 37 |
|
| 38 |
@app.cell(hide_code=True)
|
| 39 |
def _(mo):
|
| 40 |
-
mo.md(
|
| 41 |
-
|
| 42 |
-
## Definition of Variance
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
-
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
)
|
| 56 |
return
|
| 57 |
|
| 58 |
|
| 59 |
@app.cell(hide_code=True)
|
| 60 |
def _(mo):
|
| 61 |
-
mo.md(
|
| 62 |
-
|
| 63 |
-
## Intuition Through Example
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
)
|
| 70 |
return
|
| 71 |
|
| 72 |
|
| 73 |
@app.cell(hide_code=True)
|
| 74 |
def _(mo):
|
| 75 |
-
mo.md(
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
"""
|
| 84 |
-
)
|
| 85 |
return
|
| 86 |
|
| 87 |
|
|
@@ -165,50 +157,35 @@ def _(
|
|
| 165 |
|
| 166 |
plt.tight_layout()
|
| 167 |
plt.gca()
|
| 168 |
-
return
|
| 169 |
-
ax1,
|
| 170 |
-
ax2,
|
| 171 |
-
ax3,
|
| 172 |
-
grader_a,
|
| 173 |
-
grader_b,
|
| 174 |
-
grader_c,
|
| 175 |
-
grader_fig,
|
| 176 |
-
var_a,
|
| 177 |
-
var_b,
|
| 178 |
-
var_c,
|
| 179 |
-
)
|
| 180 |
|
| 181 |
|
| 182 |
@app.cell(hide_code=True)
|
| 183 |
def _(mo):
|
| 184 |
-
mo.md(
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
"""
|
| 195 |
-
)
|
| 196 |
return
|
| 197 |
|
| 198 |
|
| 199 |
@app.cell(hide_code=True)
|
| 200 |
def _(mo):
|
| 201 |
-
mo.md(
|
| 202 |
-
|
| 203 |
-
## Computing Variance
|
| 204 |
|
| 205 |
-
|
| 206 |
|
| 207 |
-
|
| 208 |
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
)
|
| 212 |
return
|
| 213 |
|
| 214 |
|
|
@@ -234,75 +211,62 @@ def _(np):
|
|
| 234 |
print(f"E[X^2] = {expected_square:.2f}")
|
| 235 |
print(f"Var(X) = {variance:.2f}")
|
| 236 |
print(f"Standard Deviation = {std_dev:.2f}")
|
| 237 |
-
return
|
| 238 |
-
die_probs,
|
| 239 |
-
die_values,
|
| 240 |
-
expected_square,
|
| 241 |
-
expected_value,
|
| 242 |
-
std_dev,
|
| 243 |
-
variance,
|
| 244 |
-
)
|
| 245 |
|
| 246 |
|
| 247 |
@app.cell(hide_code=True)
|
| 248 |
def _(mo):
|
| 249 |
-
mo.md(
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
"""
|
| 258 |
-
)
|
| 259 |
return
|
| 260 |
|
| 261 |
|
| 262 |
@app.cell(hide_code=True)
|
| 263 |
def _(mo):
|
| 264 |
-
mo.md(
|
| 265 |
-
|
| 266 |
-
## Properties of Variance
|
| 267 |
|
| 268 |
-
|
| 269 |
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
|
| 275 |
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
)
|
| 279 |
return
|
| 280 |
|
| 281 |
|
| 282 |
@app.cell(hide_code=True)
|
| 283 |
def _(mo):
|
| 284 |
-
mo.md(
|
| 285 |
-
|
| 286 |
-
|
| 287 |
-
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
"""
|
| 305 |
-
)
|
| 306 |
return
|
| 307 |
|
| 308 |
|
|
@@ -322,7 +286,7 @@ def _(die_probs, die_values, np):
|
|
| 322 |
print(f"Scaled Variance (a={a}): {scaled_var:.2f}")
|
| 323 |
print(f"a^2 * Original Variance: {a**2 * original_var:.2f}")
|
| 324 |
print(f"Property holds: {abs(scaled_var - a**2 * original_var) < 1e-10}")
|
| 325 |
-
return
|
| 326 |
|
| 327 |
|
| 328 |
@app.cell
|
|
@@ -333,23 +297,21 @@ def _():
|
|
| 333 |
|
| 334 |
@app.cell(hide_code=True)
|
| 335 |
def _(mo):
|
| 336 |
-
mo.md(
|
| 337 |
-
|
| 338 |
-
## Standard Deviation
|
| 339 |
|
| 340 |
-
|
| 341 |
|
| 342 |
-
|
| 343 |
|
| 344 |
-
|
| 345 |
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
| 352 |
-
)
|
| 353 |
return
|
| 354 |
|
| 355 |
|
|
@@ -452,93 +414,80 @@ def _(normal_mean, normal_std, np, plt, stats):
|
|
| 452 |
|
| 453 |
plt.tight_layout()
|
| 454 |
plt.gca()
|
| 455 |
-
return
|
| 456 |
-
normal_ax,
|
| 457 |
-
normal_fig,
|
| 458 |
-
one_sigma_left,
|
| 459 |
-
one_sigma_right,
|
| 460 |
-
three_sigma_left,
|
| 461 |
-
three_sigma_right,
|
| 462 |
-
two_sigma_left,
|
| 463 |
-
two_sigma_right,
|
| 464 |
-
)
|
| 465 |
|
| 466 |
|
| 467 |
@app.cell(hide_code=True)
|
| 468 |
def _(mo):
|
| 469 |
-
mo.md(
|
| 470 |
-
|
| 471 |
-
|
| 472 |
-
|
| 473 |
-
|
| 474 |
-
|
| 475 |
-
|
| 476 |
-
|
| 477 |
-
|
| 478 |
-
|
| 479 |
-
"""
|
| 480 |
-
)
|
| 481 |
return
|
| 482 |
|
| 483 |
|
| 484 |
@app.cell(hide_code=True)
|
| 485 |
def _(mo):
|
| 486 |
-
mo.md(
|
| 487 |
-
|
| 488 |
-
|
| 489 |
-
|
| 490 |
-
|
| 491 |
-
|
| 492 |
-
|
| 493 |
-
|
| 494 |
-
|
| 495 |
-
|
| 496 |
-
|
| 497 |
-
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
|
| 501 |
-
|
| 502 |
-
|
| 503 |
-
|
| 504 |
-
|
| 505 |
-
|
| 506 |
-
|
| 507 |
-
|
| 508 |
-
|
| 509 |
-
|
| 510 |
-
|
| 511 |
-
|
| 512 |
-
|
| 513 |
-
|
| 514 |
-
|
| 515 |
-
|
| 516 |
-
"""
|
| 517 |
-
)
|
| 518 |
return
|
| 519 |
|
| 520 |
|
| 521 |
@app.cell(hide_code=True)
|
| 522 |
def _(mo):
|
| 523 |
-
mo.md(
|
| 524 |
-
|
| 525 |
-
## Key Takeaways
|
| 526 |
|
| 527 |
-
|
| 528 |
|
| 529 |
-
|
| 530 |
|
| 531 |
-
|
| 532 |
|
| 533 |
-
|
| 534 |
-
|
| 535 |
-
)
|
| 536 |
return
|
| 537 |
|
| 538 |
|
| 539 |
@app.cell(hide_code=True)
|
| 540 |
def _(mo):
|
| 541 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 542 |
return
|
| 543 |
|
| 544 |
|
|
|
|
| 11 |
|
| 12 |
import marimo
|
| 13 |
|
| 14 |
+
__generated_with = "0.18.4"
|
| 15 |
app = marimo.App(width="medium", app_title="Variance")
|
| 16 |
|
| 17 |
|
| 18 |
@app.cell(hide_code=True)
|
| 19 |
def _(mo):
|
| 20 |
+
mo.md(r"""
|
| 21 |
+
# Variance
|
|
|
|
| 22 |
|
| 23 |
+
_This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/variance/), by Stanford professor Chris Piech._
|
| 24 |
|
| 25 |
+
In our previous exploration of random variables, we learned about expectation - a measure of central tendency. However, knowing the average value alone doesn't tell us everything about a distribution. Consider these questions:
|
| 26 |
|
| 27 |
+
- How spread out are the values around the mean?
|
| 28 |
+
- How reliable is the expectation as a predictor of individual outcomes?
|
| 29 |
+
- How much do individual samples typically deviate from the average?
|
| 30 |
|
| 31 |
+
This is where **variance** comes in - it measures the spread or dispersion of a random variable around its expected value.
|
| 32 |
+
""")
|
|
|
|
| 33 |
return
|
| 34 |
|
| 35 |
|
| 36 |
@app.cell(hide_code=True)
|
| 37 |
def _(mo):
|
| 38 |
+
mo.md(r"""
|
| 39 |
+
## Definition of Variance
|
|
|
|
| 40 |
|
| 41 |
+
The variance of a random variable $X$ with expected value $\mu = E[X]$ is defined as:
|
| 42 |
|
| 43 |
+
$$\text{Var}(X) = E[(X-\mu)^2]$$
|
| 44 |
|
| 45 |
+
This definition captures the average squared deviation from the mean. There's also an equivalent, often more convenient formula:
|
| 46 |
|
| 47 |
+
$$\text{Var}(X) = E[X^2] - (E[X])^2$$
|
| 48 |
|
| 49 |
+
/// tip
|
| 50 |
+
The second formula is usually easier to compute, as it only requires calculating $E[X^2]$ and $E[X]$, rather than working with deviations from the mean.
|
| 51 |
+
""")
|
|
|
|
| 52 |
return
|
| 53 |
|
| 54 |
|
| 55 |
@app.cell(hide_code=True)
|
| 56 |
def _(mo):
|
| 57 |
+
mo.md(r"""
|
| 58 |
+
## Intuition Through Example
|
|
|
|
| 59 |
|
| 60 |
+
Let's look at a real-world example that illustrates why variance is important. Consider three different groups of graders evaluating assignments in a massive online course. Each grader has their own "grading distribution" - their pattern of assigning scores to work that deserves a 70/100.
|
| 61 |
|
| 62 |
+
The visualization below shows the probability distributions for three types of graders. Try clicking and dragging the blue numbers to adjust the parameters and see how they affect the variance.
|
| 63 |
+
""")
|
|
|
|
| 64 |
return
|
| 65 |
|
| 66 |
|
| 67 |
@app.cell(hide_code=True)
|
| 68 |
def _(mo):
|
| 69 |
+
mo.md(r"""
|
| 70 |
+
/// TIP
|
| 71 |
+
Try adjusting the blue numbers above to see how:
|
| 72 |
+
|
| 73 |
+
- Increasing spread increases variance
|
| 74 |
+
- The mixture ratio affects how many outliers appear in Grader C's distribution
|
| 75 |
+
- Changing the true grade shifts all distributions but maintains their relative variances
|
| 76 |
+
""")
|
|
|
|
|
|
|
| 77 |
return
|
| 78 |
|
| 79 |
|
|
|
|
| 157 |
|
| 158 |
plt.tight_layout()
|
| 159 |
plt.gca()
|
| 160 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
|
| 162 |
|
| 163 |
@app.cell(hide_code=True)
|
| 164 |
def _(mo):
|
| 165 |
+
mo.md(r"""
|
| 166 |
+
/// note
|
| 167 |
+
All three distributions have the same expected value (the true grade), but they differ significantly in their spread:
|
| 168 |
+
|
| 169 |
+
- **Grader A** has high variance - grades vary widely from the true value
|
| 170 |
+
- **Grader B** has low variance - grades consistently stay close to the true value
|
| 171 |
+
- **Grader C** has a mixture distribution - mostly consistent but with occasional extreme values
|
| 172 |
+
|
| 173 |
+
This illustrates why variance is crucial: two distributions can have the same mean but behave very differently in practice.
|
| 174 |
+
""")
|
|
|
|
|
|
|
| 175 |
return
|
| 176 |
|
| 177 |
|
| 178 |
@app.cell(hide_code=True)
|
| 179 |
def _(mo):
|
| 180 |
+
mo.md(r"""
|
| 181 |
+
## Computing Variance
|
|
|
|
| 182 |
|
| 183 |
+
Let's work through some concrete examples to understand how to calculate variance.
|
| 184 |
|
| 185 |
+
### Example 1: Fair Die Roll
|
| 186 |
|
| 187 |
+
Consider rolling a fair six-sided die. We'll calculate its variance step by step:
|
| 188 |
+
""")
|
|
|
|
| 189 |
return
|
| 190 |
|
| 191 |
|
|
|
|
| 211 |
print(f"E[X^2] = {expected_square:.2f}")
|
| 212 |
print(f"Var(X) = {variance:.2f}")
|
| 213 |
print(f"Standard Deviation = {std_dev:.2f}")
|
| 214 |
+
return die_probs, die_values
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
|
| 216 |
|
| 217 |
@app.cell(hide_code=True)
|
| 218 |
def _(mo):
|
| 219 |
+
mo.md(r"""
|
| 220 |
+
/// NOTE
|
| 221 |
+
For a fair die:
|
| 222 |
+
|
| 223 |
+
- The expected value (3.50) tells us the average roll
|
| 224 |
+
- The variance (2.92) tells us how much typical rolls deviate from this average
|
| 225 |
+
- The standard deviation (1.71) gives us this spread in the original units
|
| 226 |
+
""")
|
|
|
|
|
|
|
| 227 |
return
|
| 228 |
|
| 229 |
|
| 230 |
@app.cell(hide_code=True)
|
| 231 |
def _(mo):
|
| 232 |
+
mo.md(r"""
|
| 233 |
+
## Properties of Variance
|
|
|
|
| 234 |
|
| 235 |
+
Variance has several important properties that make it useful for analyzing random variables:
|
| 236 |
|
| 237 |
+
1. **Non-negativity**: $\text{Var}(X) \geq 0$ for any random variable $X$
|
| 238 |
+
2. **Variance of a constant**: $\text{Var}(c) = 0$ for any constant $c$
|
| 239 |
+
3. **Scaling**: $\text{Var}(aX) = a^2\text{Var}(X)$ for any constant $a$
|
| 240 |
+
4. **Translation**: $\text{Var}(X + b) = \text{Var}(X)$ for any constant $b$
|
| 241 |
+
5. **Independence**: If $X$ and $Y$ are independent, then $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$
|
| 242 |
|
| 243 |
+
Let's verify a property with an example.
|
| 244 |
+
""")
|
|
|
|
| 245 |
return
|
| 246 |
|
| 247 |
|
| 248 |
@app.cell(hide_code=True)
|
| 249 |
def _(mo):
|
| 250 |
+
mo.md(r"""
|
| 251 |
+
## Proof of Variance Formula
|
| 252 |
+
|
| 253 |
+
The equivalence of the two variance formulas is a fundamental result in probability theory. Here's the proof:
|
| 254 |
+
|
| 255 |
+
Starting with the definition $\text{Var}(X) = E[(X-\mu)^2]$ where $\mu = E[X]$:
|
| 256 |
+
|
| 257 |
+
\begin{align}
|
| 258 |
+
\text{Var}(X) &= E[(X-\mu)^2] \\
|
| 259 |
+
&= \sum_x(x-\mu)^2P(x) && \text{Definition of Expectation}\\
|
| 260 |
+
&= \sum_x (x^2 -2\mu x + \mu^2)P(x) && \text{Expanding the square}\\
|
| 261 |
+
&= \sum_x x^2P(x)- 2\mu \sum_x xP(x) + \mu^2 \sum_x P(x) && \text{Distributing the sum}\\
|
| 262 |
+
&= E[X^2]- 2\mu E[X] + \mu^2 && \text{Definition of expectation}\\
|
| 263 |
+
&= E[X^2]- 2(E[X])^2 + (E[X])^2 && \text{Since }\mu = E[X]\\
|
| 264 |
+
&= E[X^2]- (E[X])^2 && \text{Simplifying}
|
| 265 |
+
\end{align}
|
| 266 |
+
|
| 267 |
+
/// tip
|
| 268 |
+
This proof shows why the formula $\text{Var}(X) = E[X^2] - (E[X])^2$ is so useful - it's much easier to compute $E[X^2]$ and $E[X]$ separately than to work with deviations directly.
|
| 269 |
+
""")
|
|
|
|
|
|
|
| 270 |
return
|
| 271 |
|
| 272 |
|
|
|
|
| 286 |
print(f"Scaled Variance (a={a}): {scaled_var:.2f}")
|
| 287 |
print(f"a^2 * Original Variance: {a**2 * original_var:.2f}")
|
| 288 |
print(f"Property holds: {abs(scaled_var - a**2 * original_var) < 1e-10}")
|
| 289 |
+
return
|
| 290 |
|
| 291 |
|
| 292 |
@app.cell
|
|
|
|
| 297 |
|
| 298 |
@app.cell(hide_code=True)
|
| 299 |
def _(mo):
|
| 300 |
+
mo.md(r"""
|
| 301 |
+
## Standard Deviation
|
|
|
|
| 302 |
|
| 303 |
+
While variance is mathematically convenient, it has one practical drawback: its units are squared. For example, if we're measuring grades (0-100), the variance is in "grade points squared." This makes it hard to interpret intuitively.
|
| 304 |
|
| 305 |
+
The **standard deviation**, denoted by $\sigma$ or $\text{SD}(X)$, is the square root of variance:
|
| 306 |
|
| 307 |
+
$$\sigma = \sqrt{\text{Var}(X)}$$
|
| 308 |
|
| 309 |
+
/// tip
|
| 310 |
+
Standard deviation is often more intuitive because it's in the same units as the original data. For a normal distribution, approximately:
|
| 311 |
+
- 68% of values fall within 1 standard deviation of the mean
|
| 312 |
+
- 95% of values fall within 2 standard deviations
|
| 313 |
+
- 99.7% of values fall within 3 standard deviations
|
| 314 |
+
""")
|
|
|
|
| 315 |
return
|
| 316 |
|
| 317 |
|
|
|
|
| 414 |
|
| 415 |
plt.tight_layout()
|
| 416 |
plt.gca()
|
| 417 |
+
return
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 418 |
|
| 419 |
|
| 420 |
@app.cell(hide_code=True)
|
| 421 |
def _(mo):
|
| 422 |
+
mo.md(r"""
|
| 423 |
+
/// tip
|
| 424 |
+
The interactive visualization above demonstrates how standard deviation (σ) affects the shape of a normal distribution:
|
| 425 |
+
|
| 426 |
+
- The **red region** covers μ ± 1σ, containing approximately 68% of the probability
|
| 427 |
+
- The **green region** covers μ ± 2σ, containing approximately 95% of the probability
|
| 428 |
+
- The **blue region** covers μ ± 3σ, containing approximately 99.7% of the probability
|
| 429 |
+
|
| 430 |
+
This is known as the "68-95-99.7 rule" or the "empirical rule" and is a useful heuristic for understanding the spread of data.
|
| 431 |
+
""")
|
|
|
|
|
|
|
| 432 |
return
|
| 433 |
|
| 434 |
|
| 435 |
@app.cell(hide_code=True)
|
| 436 |
def _(mo):
|
| 437 |
+
mo.md(r"""
|
| 438 |
+
## 🤔 Test Your Understanding
|
| 439 |
+
|
| 440 |
+
Choose what you believe are the correct options in the questions below:
|
| 441 |
+
|
| 442 |
+
<details>
|
| 443 |
+
<summary>The variance of a random variable can be negative.</summary>
|
| 444 |
+
❌ False! Variance is defined as an expected value of squared deviations, and squares are always non-negative.
|
| 445 |
+
</details>
|
| 446 |
+
|
| 447 |
+
<details>
|
| 448 |
+
<summary>If X and Y are independent random variables, then Var(X + Y) = Var(X) + Var(Y).</summary>
|
| 449 |
+
✅ True! This is one of the key properties of variance for independent random variables.
|
| 450 |
+
</details>
|
| 451 |
+
|
| 452 |
+
<details>
|
| 453 |
+
<summary>Multiplying a random variable by 2 multiplies its variance by 2.</summary>
|
| 454 |
+
❌ False! Multiplying a random variable by a constant a multiplies its variance by a². So multiplying by 2 multiplies variance by 4.
|
| 455 |
+
</details>
|
| 456 |
+
|
| 457 |
+
<details>
|
| 458 |
+
<summary>Standard deviation is always equal to the square root of variance.</summary>
|
| 459 |
+
✅ True! By definition, standard deviation σ = √Var(X).
|
| 460 |
+
</details>
|
| 461 |
+
|
| 462 |
+
<details>
|
| 463 |
+
<summary>If Var(X) = 0, then X must be a constant.</summary>
|
| 464 |
+
✅ True! Zero variance means there is no spread around the mean, so X can only take one value.
|
| 465 |
+
</details>
|
| 466 |
+
""")
|
|
|
|
|
|
|
| 467 |
return
|
| 468 |
|
| 469 |
|
| 470 |
@app.cell(hide_code=True)
|
| 471 |
def _(mo):
|
| 472 |
+
mo.md(r"""
|
| 473 |
+
## Key Takeaways
|
|
|
|
| 474 |
|
| 475 |
+
Variance gives us a way to measure how spread out a random variable is around its mean. It's like the "uncertainty" in our expectation - a high variance means individual outcomes can differ widely from what we expect on average.
|
| 476 |
|
| 477 |
+
Standard deviation brings this measure back to the original units, making it easier to interpret. For grades, a standard deviation of 10 points means typical grades fall within about 10 points of the average.
|
| 478 |
|
| 479 |
+
Variance pops up everywhere - from weather forecasts (how reliable is the predicted temperature?) to financial investments (how risky is this stock?) to quality control (how consistent is our manufacturing process?).
|
| 480 |
|
| 481 |
+
In our next notebook, we'll explore more properties of random variables and see how they combine to form more complex distributions.
|
| 482 |
+
""")
|
|
|
|
| 483 |
return
|
| 484 |
|
| 485 |
|
| 486 |
@app.cell(hide_code=True)
|
| 487 |
def _(mo):
|
| 488 |
+
mo.md(r"""
|
| 489 |
+
Appendix (containing helper code):
|
| 490 |
+
""")
|
| 491 |
return
|
| 492 |
|
| 493 |
|
probability/13_bernoulli_distribution.py
CHANGED
|
@@ -10,60 +10,54 @@
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
-
__generated_with = "0.
|
| 14 |
app = marimo.App(width="medium", app_title="Bernoulli Distribution")
|
| 15 |
|
| 16 |
|
| 17 |
@app.cell(hide_code=True)
|
| 18 |
def _(mo):
|
| 19 |
-
mo.md(
|
| 20 |
-
|
| 21 |
-
# Bernoulli Distribution
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
)
|
| 34 |
return
|
| 35 |
|
| 36 |
|
| 37 |
@app.cell(hide_code=True)
|
| 38 |
def _(mo):
|
| 39 |
-
mo.md(
|
| 40 |
-
|
| 41 |
-
## Bernoulli Random Variables
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
)
|
| 55 |
return
|
| 56 |
|
| 57 |
|
| 58 |
@app.cell(hide_code=True)
|
| 59 |
def _(mo):
|
| 60 |
-
mo.md(
|
| 61 |
-
|
| 62 |
-
## Key Properties of a Bernoulli Random Variable
|
| 63 |
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
)
|
| 67 |
return
|
| 68 |
|
| 69 |
|
|
@@ -72,31 +66,29 @@ def _(stats):
|
|
| 72 |
# Define the Bernoulli distribution function
|
| 73 |
def Bern(p):
|
| 74 |
return stats.bernoulli(p)
|
| 75 |
-
return
|
| 76 |
|
| 77 |
|
| 78 |
@app.cell(hide_code=True)
|
| 79 |
def _(mo):
|
| 80 |
-
mo.md(
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
\
|
| 90 |
-
\
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
"""
|
| 99 |
-
)
|
| 100 |
return
|
| 101 |
|
| 102 |
|
|
@@ -158,62 +150,58 @@ def _(expected_value, p_slider, plt, probabilities, values, variance):
|
|
| 158 |
ax.legend()
|
| 159 |
plt.tight_layout()
|
| 160 |
plt.gca()
|
| 161 |
-
return
|
| 162 |
|
| 163 |
|
| 164 |
@app.cell(hide_code=True)
|
| 165 |
def _(mo):
|
| 166 |
-
mo.md(
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
"""
|
| 197 |
-
)
|
| 198 |
return
|
| 199 |
|
| 200 |
|
| 201 |
@app.cell(hide_code=True)
|
| 202 |
def _(mo):
|
| 203 |
-
mo.md(
|
| 204 |
-
|
| 205 |
-
## Indicator Random Variables
|
| 206 |
|
| 207 |
-
|
| 208 |
|
| 209 |
-
|
| 210 |
|
| 211 |
-
|
| 212 |
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
)
|
| 217 |
return
|
| 218 |
|
| 219 |
|
|
@@ -234,7 +222,9 @@ def _(mo):
|
|
| 234 |
|
| 235 |
@app.cell(hide_code=True)
|
| 236 |
def _(mo):
|
| 237 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 238 |
return
|
| 239 |
|
| 240 |
|
|
@@ -276,7 +266,7 @@ def _(np, num_trials_slider, p_sim_slider, plt):
|
|
| 276 |
|
| 277 |
plt.tight_layout()
|
| 278 |
plt.gca()
|
| 279 |
-
return
|
| 280 |
|
| 281 |
|
| 282 |
@app.cell(hide_code=True)
|
|
@@ -296,88 +286,84 @@ def _(mo, np, trials):
|
|
| 296 |
|
| 297 |
This demonstrates how the sample proportion approaches the true probability $p$ as the number of trials increases.
|
| 298 |
""")
|
| 299 |
-
return
|
| 300 |
|
| 301 |
|
| 302 |
@app.cell(hide_code=True)
|
| 303 |
def _(mo):
|
| 304 |
-
mo.md(
|
| 305 |
-
|
| 306 |
-
## 🤔 Test Your Understanding
|
| 307 |
|
| 308 |
-
|
| 309 |
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
|
| 321 |
|
| 322 |
-
|
| 323 |
-
|
| 324 |
-
|
| 325 |
-
|
| 326 |
-
)
|
| 327 |
return
|
| 328 |
|
| 329 |
|
| 330 |
@app.cell(hide_code=True)
|
| 331 |
def _(mo):
|
| 332 |
-
mo.md(
|
| 333 |
-
|
| 334 |
-
## Applications of Bernoulli Random Variables
|
| 335 |
|
| 336 |
-
|
| 337 |
|
| 338 |
-
|
| 339 |
|
| 340 |
-
|
| 341 |
|
| 342 |
-
|
| 343 |
|
| 344 |
-
|
| 345 |
|
| 346 |
-
|
| 347 |
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
)
|
| 351 |
return
|
| 352 |
|
| 353 |
|
| 354 |
@app.cell(hide_code=True)
|
| 355 |
def _(mo):
|
| 356 |
-
mo.md(
|
| 357 |
-
|
| 358 |
-
## Summary
|
| 359 |
|
| 360 |
-
|
| 361 |
|
| 362 |
-
|
| 363 |
|
| 364 |
-
|
| 365 |
|
| 366 |
-
|
| 367 |
-
|
| 368 |
|
| 369 |
-
|
| 370 |
-
|
| 371 |
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
)
|
| 375 |
return
|
| 376 |
|
| 377 |
|
| 378 |
@app.cell(hide_code=True)
|
| 379 |
def _(mo):
|
| 380 |
-
mo.md(r"""
|
|
|
|
|
|
|
| 381 |
return
|
| 382 |
|
| 383 |
|
|
@@ -390,7 +376,7 @@ def _():
|
|
| 390 |
@app.cell(hide_code=True)
|
| 391 |
def _():
|
| 392 |
from marimo import Html
|
| 393 |
-
return
|
| 394 |
|
| 395 |
|
| 396 |
@app.cell(hide_code=True)
|
|
@@ -407,7 +393,7 @@ def _():
|
|
| 407 |
|
| 408 |
# Set random seed for reproducibility
|
| 409 |
np.random.seed(42)
|
| 410 |
-
return
|
| 411 |
|
| 412 |
|
| 413 |
@app.cell(hide_code=True)
|
|
|
|
| 10 |
|
| 11 |
import marimo
|
| 12 |
|
| 13 |
+
__generated_with = "0.18.4"
|
| 14 |
app = marimo.App(width="medium", app_title="Bernoulli Distribution")
|
| 15 |
|
| 16 |
|
| 17 |
@app.cell(hide_code=True)
|
| 18 |
def _(mo):
|
| 19 |
+
mo.md(r"""
|
| 20 |
+
# Bernoulli Distribution
|
|
|
|
| 21 |
|
| 22 |
+
> _Note:_ This notebook builds on concepts from ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/bernoulli/) by Chris Piech.
|
| 23 |
|
| 24 |
+
## Parametric Random Variables
|
| 25 |
|
| 26 |
+
Probability has a bunch of classic random variable patterns that show up over and over. Let's explore some of the most important parametric discrete distributions.
|
| 27 |
|
| 28 |
+
Bernoulli is honestly the simplest distribution you'll ever see, but it's ridiculously powerful in practice. What makes it fascinating to me is how it captures any yes/no scenario: success/failure, heads/tails, 1/0.
|
| 29 |
|
| 30 |
+
I think of these distributions as the atoms of probability — they're the fundamental building blocks that everything else is made from.
|
| 31 |
+
""")
|
|
|
|
| 32 |
return
|
| 33 |
|
| 34 |
|
| 35 |
@app.cell(hide_code=True)
|
| 36 |
def _(mo):
|
| 37 |
+
mo.md(r"""
|
| 38 |
+
## Bernoulli Random Variables
|
|
|
|
| 39 |
|
| 40 |
+
A Bernoulli random variable boils down to just two possible values: 1 (success) or 0 (failure). dead simple, but incredibly useful.
|
| 41 |
|
| 42 |
+
Some everyday examples where I see these:
|
| 43 |
|
| 44 |
+
- Coin flip (heads=1, tails=0)
|
| 45 |
+
- Whether that sketchy email is spam
|
| 46 |
+
- If someone actually clicks my ad
|
| 47 |
+
- Whether my code compiles first try (almost always 0 for me)
|
| 48 |
|
| 49 |
+
All you need (the classic expression) is a single parameter $p$ - the probability of success.
|
| 50 |
+
""")
|
|
|
|
| 51 |
return
|
| 52 |
|
| 53 |
|
| 54 |
@app.cell(hide_code=True)
|
| 55 |
def _(mo):
|
| 56 |
+
mo.md(r"""
|
| 57 |
+
## Key Properties of a Bernoulli Random Variable
|
|
|
|
| 58 |
|
| 59 |
+
If $X$ is declared to be a Bernoulli random variable with parameter $p$, denoted $X \sim \text{Bern}(p)$, it has the following properties:
|
| 60 |
+
""")
|
|
|
|
| 61 |
return
|
| 62 |
|
| 63 |
|
|
|
|
| 66 |
# Define the Bernoulli distribution function
|
| 67 |
def Bern(p):
|
| 68 |
return stats.bernoulli(p)
|
| 69 |
+
return
|
| 70 |
|
| 71 |
|
| 72 |
@app.cell(hide_code=True)
|
| 73 |
def _(mo):
|
| 74 |
+
mo.md(r"""
|
| 75 |
+
## Bernoulli Distribution Properties
|
| 76 |
+
|
| 77 |
+
$\begin{array}{lll}
|
| 78 |
+
\text{Notation:} & X \sim \text{Bern}(p) \\
|
| 79 |
+
\text{Description:} & \text{A boolean variable that is 1 with probability } p \\
|
| 80 |
+
\text{Parameters:} & p, \text{ the probability that } X = 1 \\
|
| 81 |
+
\text{Support:} & x \text{ is either 0 or 1} \\
|
| 82 |
+
\text{PMF equation:} & P(X = x) =
|
| 83 |
+
\begin{cases}
|
| 84 |
+
p & \text{if }x = 1\\
|
| 85 |
+
1-p & \text{if }x = 0
|
| 86 |
+
\end{cases} \\
|
| 87 |
+
\text{PMF (smooth):} & P(X = x) = p^x(1-p)^{1-x} \\
|
| 88 |
+
\text{Expectation:} & E[X] = p \\
|
| 89 |
+
\text{Variance:} & \text{Var}(X) = p(1-p) \\
|
| 90 |
+
\end{array}$
|
| 91 |
+
""")
|
|
|
|
|
|
|
| 92 |
return
|
| 93 |
|
| 94 |
|
|
|
|
| 150 |
ax.legend()
|
| 151 |
plt.tight_layout()
|
| 152 |
plt.gca()
|
| 153 |
+
return
|
| 154 |
|
| 155 |
|
| 156 |
@app.cell(hide_code=True)
|
| 157 |
def _(mo):
|
| 158 |
+
mo.md(r"""
|
| 159 |
+
## Expectation and Variance of a Bernoulli
|
| 160 |
+
|
| 161 |
+
> _Note:_ The following derivations are included as reference material. The credit for these mathematical formulations belongs to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/bernoulli/) by Chris Piech.
|
| 162 |
+
|
| 163 |
+
Let's work through why $E[X] = p$ for a Bernoulli:
|
| 164 |
+
|
| 165 |
+
\begin{align}
|
| 166 |
+
E[X] &= \sum_x x \cdot (X=x) && \text{Definition of expectation} \\
|
| 167 |
+
&= 1 \cdot p + 0 \cdot (1-p) &&
|
| 168 |
+
X \text{ can take on values 0 and 1} \\
|
| 169 |
+
&= p && \text{Remove the 0 term}
|
| 170 |
+
\end{align}
|
| 171 |
+
|
| 172 |
+
And for variance, we first need $E[X^2]$:
|
| 173 |
+
|
| 174 |
+
\begin{align}
|
| 175 |
+
E[X^2]
|
| 176 |
+
&= \sum_x x^2 \cdot (X=x) &&\text{LOTUS}\\
|
| 177 |
+
&= 0^2 \cdot (1-p) + 1^2 \cdot p\\
|
| 178 |
+
&= p
|
| 179 |
+
\end{align}
|
| 180 |
+
|
| 181 |
+
\begin{align}
|
| 182 |
+
(X)
|
| 183 |
+
&= E[X^2] - E[X]^2&& \text{Def of variance} \\
|
| 184 |
+
&= p - p^2 && \text{Substitute }E[X^2]=p, E[X] = p \\
|
| 185 |
+
&= p (1-p) && \text{Factor out }p
|
| 186 |
+
\end{align}
|
| 187 |
+
""")
|
|
|
|
|
|
|
| 188 |
return
|
| 189 |
|
| 190 |
|
| 191 |
@app.cell(hide_code=True)
|
| 192 |
def _(mo):
|
| 193 |
+
mo.md(r"""
|
| 194 |
+
## Indicator Random Variables
|
|
|
|
| 195 |
|
| 196 |
+
Indicator variables are a clever trick I like to use — they turn events into numbers. Instead of dealing with "did the event happen?" (yes/no), we get "1" if it happened and "0" if it didn't.
|
| 197 |
|
| 198 |
+
Formally: an indicator variable $I$ for event $A$ equals 1 when $A$ occurs and 0 otherwise. These are just bernoulli variables where $p = P(A)$. people often use notation like $I_A$ to name them.
|
| 199 |
|
| 200 |
+
Two key properties that make them super useful:
|
| 201 |
|
| 202 |
+
- $P(I=1)=P(A)$ - probability of getting a 1 is just the probability of the event
|
| 203 |
+
- $E[I]=P(A)$ - the expected value equals the probability (this one's a game-changer!)
|
| 204 |
+
""")
|
|
|
|
| 205 |
return
|
| 206 |
|
| 207 |
|
|
|
|
| 222 |
|
| 223 |
@app.cell(hide_code=True)
|
| 224 |
def _(mo):
|
| 225 |
+
mo.md(r"""
|
| 226 |
+
## Simulation
|
| 227 |
+
""")
|
| 228 |
return
|
| 229 |
|
| 230 |
|
|
|
|
| 266 |
|
| 267 |
plt.tight_layout()
|
| 268 |
plt.gca()
|
| 269 |
+
return (trials,)
|
| 270 |
|
| 271 |
|
| 272 |
@app.cell(hide_code=True)
|
|
|
|
| 286 |
|
| 287 |
This demonstrates how the sample proportion approaches the true probability $p$ as the number of trials increases.
|
| 288 |
""")
|
| 289 |
+
return
|
| 290 |
|
| 291 |
|
| 292 |
@app.cell(hide_code=True)
|
| 293 |
def _(mo):
|
| 294 |
+
mo.md(r"""
|
| 295 |
+
## 🤔 Test Your Understanding
|
|
|
|
| 296 |
|
| 297 |
+
Pick which of these statements about Bernoulli random variables you think are correct:
|
| 298 |
|
| 299 |
+
/// details | The variance of a Bernoulli random variable is always less than or equal to 0.25
|
| 300 |
+
✅ Correct! The variance $p(1-p)$ reaches its maximum value of 0.25 when $p = 0.5$.
|
| 301 |
+
///
|
| 302 |
|
| 303 |
+
/// details | The expected value of a Bernoulli random variable must be either 0 or 1
|
| 304 |
+
❌ Incorrect! The expected value is $p$, which can be any value between 0 and 1.
|
| 305 |
+
///
|
| 306 |
|
| 307 |
+
/// details | If $X \sim \text{Bern}(0.3)$ and $Y \sim \text{Bern}(0.7)$, then $X$ and $Y$ have the same variance
|
| 308 |
+
✅ Correct! $\text{Var}(X) = 0.3 \times 0.7 = 0.21$ and $\text{Var}(Y) = 0.7 \times 0.3 = 0.21$.
|
| 309 |
+
///
|
| 310 |
|
| 311 |
+
/// details | Two independent coin flips can be modeled as the sum of two Bernoulli random variables
|
| 312 |
+
✅ Correct! The sum would follow a Binomial distribution with $n=2$.
|
| 313 |
+
///
|
| 314 |
+
""")
|
|
|
|
| 315 |
return
|
| 316 |
|
| 317 |
|
| 318 |
@app.cell(hide_code=True)
|
| 319 |
def _(mo):
|
| 320 |
+
mo.md(r"""
|
| 321 |
+
## Applications of Bernoulli Random Variables
|
|
|
|
| 322 |
|
| 323 |
+
Bernoulli random variables are used in many real-world scenarios:
|
| 324 |
|
| 325 |
+
1. **Quality Control**: Testing if a manufactured item is defective (1) or not (0)
|
| 326 |
|
| 327 |
+
2. **A/B Testing**: Determining if a user clicks (1) or doesn't click (0) on a website button
|
| 328 |
|
| 329 |
+
3. **Medical Testing**: Checking if a patient tests positive (1) or negative (0) for a disease
|
| 330 |
|
| 331 |
+
4. **Election Modeling**: Modeling if a particular voter votes for candidate A (1) or not (0)
|
| 332 |
|
| 333 |
+
5. **Financial Markets**: Modeling if a stock price goes up (1) or down (0) in a simplified model
|
| 334 |
|
| 335 |
+
Because Bernoulli random variables are parametric, as soon as you declare a random variable to be of type Bernoulli, you automatically know all of its pre-derived properties!
|
| 336 |
+
""")
|
|
|
|
| 337 |
return
|
| 338 |
|
| 339 |
|
| 340 |
@app.cell(hide_code=True)
|
| 341 |
def _(mo):
|
| 342 |
+
mo.md(r"""
|
| 343 |
+
## Summary
|
|
|
|
| 344 |
|
| 345 |
+
And that's a wrap on Bernoulli distributions! We've learnt the simplest of all probability distributions — the one that only has two possible outcomes. Flip a coin, check if an email is spam, see if your blind date shows up — these are all Bernoulli trials with success probability $p$.
|
| 346 |
|
| 347 |
+
The beauty of Bernoulli is in its simplicity: just set $p$ (the probability of success) and you're good to go! The PMF gives us $P(X=1) = p$ and $P(X=0) = 1-p$, while expectation is simply $p$ and variance is $p(1-p)$. Oh, and when you're tracking whether specific events happen or not? That's an indicator random variable — just another Bernoulli in disguise!
|
| 348 |
|
| 349 |
+
Two key things to remember:
|
| 350 |
|
| 351 |
+
/// note
|
| 352 |
+
💡 **Maximum Variance**: A Bernoulli's variance $p(1-p)$ reaches its maximum at $p=0.5$, making a fair coin the most "unpredictable" Bernoulli random variable.
|
| 353 |
|
| 354 |
+
💡 **Instant Properties**: When you identify a random variable as Bernoulli, you instantly know all its properties—expectation, variance, PMF—without additional calculations.
|
| 355 |
+
///
|
| 356 |
|
| 357 |
+
Next up: Binomial distribution—where we'll see what happens when we let Bernoulli trials have a party and add themselves together!
|
| 358 |
+
""")
|
|
|
|
| 359 |
return
|
| 360 |
|
| 361 |
|
| 362 |
@app.cell(hide_code=True)
|
| 363 |
def _(mo):
|
| 364 |
+
mo.md(r"""
|
| 365 |
+
#### Appendix (containing helper code for the notebook)
|
| 366 |
+
""")
|
| 367 |
return
|
| 368 |
|
| 369 |
|
|
|
|
| 376 |
@app.cell(hide_code=True)
|
| 377 |
def _():
|
| 378 |
from marimo import Html
|
| 379 |
+
return
|
| 380 |
|
| 381 |
|
| 382 |
@app.cell(hide_code=True)
|
|
|
|
| 393 |
|
| 394 |
# Set random seed for reproducibility
|
| 395 |
np.random.seed(42)
|
| 396 |
+
return np, plt, stats
|
| 397 |
|
| 398 |
|
| 399 |
@app.cell(hide_code=True)
|