koaning commited on
Commit
3b7283c
·
1 Parent(s): 8288699

a big --fix indeed

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. _server/README.md +6 -1
  2. daft/01_what_makes_daft_special.py +30 -35
  3. daft/README.md +6 -1
  4. duckdb/008_loading_parquet.py +64 -64
  5. duckdb/009_loading_json.py +50 -56
  6. duckdb/011_working_with_apache_arrow.py +101 -96
  7. duckdb/01_getting_started.py +52 -60
  8. duckdb/DuckDB_Loading_CSVs.py +43 -37
  9. duckdb/README.md +5 -0
  10. functional_programming/05_functors.py +546 -602
  11. functional_programming/06_applicatives.py +687 -658
  12. functional_programming/CHANGELOG.md +6 -1
  13. functional_programming/README.md +12 -7
  14. optimization/01_least_squares.py +28 -34
  15. optimization/02_linear_program.py +31 -31
  16. optimization/03_minimum_fuel_optimal_control.py +38 -40
  17. optimization/04_quadratic_program.py +44 -46
  18. optimization/05_portfolio_optimization.py +50 -58
  19. optimization/06_convex_optimization.py +24 -26
  20. optimization/07_sdp.py +34 -36
  21. optimization/README.md +6 -1
  22. polars/01_why_polars.py +110 -132
  23. polars/02_dataframes.py +89 -95
  24. polars/03_loading_data.py +43 -83
  25. polars/04_basic_operations.py +118 -106
  26. polars/05_reactive_plots.py +98 -129
  27. polars/06_Dataframe_Transformer.py +42 -52
  28. polars/07-querying-with-sql.py +6 -6
  29. polars/08_working_with_columns.py +147 -165
  30. polars/09_data_types.py +70 -88
  31. polars/10_strings.py +199 -225
  32. polars/11_missing_data.py +48 -94
  33. polars/12_aggregations.py +67 -77
  34. polars/13_window_functions.py +59 -91
  35. polars/14_user_defined_functions.py +137 -159
  36. polars/16_lazy_execution.py +103 -113
  37. polars/README.md +6 -1
  38. probability/01_sets.py +60 -62
  39. probability/02_axioms.py +56 -66
  40. probability/03_probability_of_or.py +81 -103
  41. probability/04_conditional_probability.py +117 -133
  42. probability/05_independence.py +129 -163
  43. probability/06_probability_of_and.py +79 -89
  44. probability/07_law_of_total_probability.py +110 -122
  45. probability/08_bayes_theorem.py +128 -164
  46. probability/09_random_variables.py +184 -210
  47. probability/10_probability_mass_function.py +123 -191
  48. probability/11_expectation.py +115 -191
  49. probability/12_variance.py +151 -202
  50. probability/13_bernoulli_distribution.py +128 -142
_server/README.md CHANGED
@@ -1,3 +1,8 @@
 
 
 
 
 
1
  # marimo learn server
2
 
3
  This folder contains server code for hosting marimo apps.
@@ -21,4 +26,4 @@ docker build -t marimo-learn .
21
 
22
  ```bash
23
  docker run -p 7860:7860 marimo-learn
24
- ```
 
1
+ ---
2
+ title: Readme
3
+ marimo-version: 0.18.4
4
+ ---
5
+
6
  # marimo learn server
7
 
8
  This folder contains server code for hosting marimo apps.
 
26
 
27
  ```bash
28
  docker run -p 7860:7860 marimo-learn
29
+ ```
daft/01_what_makes_daft_special.py CHANGED
@@ -8,28 +8,25 @@
8
 
9
  import marimo
10
 
11
- __generated_with = "0.13.6"
12
  app = marimo.App(width="medium")
13
 
14
 
15
  @app.cell(hide_code=True)
16
  def _(mo):
17
- mo.md(
18
- r"""
19
  # What Makes Daft Special?
20
 
21
  > _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_.
22
 
23
  Welcome to the course on [Daft](https://www.getdaft.io/), the distributed dataframe library! In this first chapter, we'll explore what Daft is and what makes it a noteworthy tool in the landscape of data processing. We'll look at its core design choices and how they aim to help you work with data more effectively, whether you're a data engineer, data scientist, or analyst.
24
- """
25
- )
26
  return
27
 
28
 
29
  @app.cell(hide_code=True)
30
  def _(mo):
31
- mo.md(
32
- r"""
33
  ## 🎯 Introducing Daft: A Unified Data Engine
34
 
35
  Daft is a distributed query engine designed to handle a wide array of data tasks, from data engineering and analytics to powering ML/AI workflows. It provides both a Python DataFrame API, familiar to users of libraries like Pandas, and a SQL interface, allowing you to choose the interaction style that best suits your needs or the task at hand.
@@ -37,8 +34,7 @@ def _(mo):
37
  The main goal of Daft is to provide a robust and versatile platform for processing data, whether it's gigabytes on your laptop or petabytes on a cluster.
38
 
39
  Let's go ahead and `pip install daft` to see it in action!
40
- """
41
- )
42
  return
43
 
44
 
@@ -86,8 +82,7 @@ def _(mo):
86
 
87
  @app.cell(hide_code=True)
88
  def _(mo):
89
- mo.md(
90
- r"""
91
  ## 🦀 Built with Rust: Performance and Simplicity
92
 
93
  One of Daft's key characteristics is that its core engine is written in Rust. This choice has several implications for users:
@@ -97,8 +92,7 @@ def _(mo):
97
  * **Simplified Developer Experience**: Rust-based systems typically require less configuration tuning compared to JVM-based systems. You don't need to worry about JVM heap sizes, garbage collection parameters, or managing Java dependencies.
98
 
99
  Daft also leverages [Apache Arrow](https://arrow.apache.org/) for its in-memory data format. This allows for efficient data exchange between Daft's Rust core and Python, often with zero-copy data sharing, further enhancing performance.
100
- """
101
- )
102
  return
103
 
104
 
@@ -118,7 +112,9 @@ def _(mo):
118
 
119
  @app.cell(hide_code=True)
120
  def _(mo):
121
- mo.md(r"""A cornerstone of Daft's design is **lazy execution**. Imagine defining a DataFrame with a trillion rows on your laptop – usually not a great prospect for your device's memory!""")
 
 
122
  return
123
 
124
 
@@ -135,7 +131,9 @@ def _(daft):
135
 
136
  @app.cell(hide_code=True)
137
  def _(mo):
138
- mo.md(r"""With Daft, this is perfectly fine. Operations like `with_column` or `filter` don't compute results immediately. Instead, Daft builds a *logical plan* – a blueprint of the transformations you've defined. You can inspect this plan:""")
 
 
139
  return
140
 
141
 
@@ -147,14 +145,15 @@ def _(mo, trillion_rows_df):
147
 
148
  @app.cell(hide_code=True)
149
  def _(mo):
150
- mo.md(r"""This plan is only executed (and data materialized) when you explicitly request it (e.g., with `.show()`, `.collect()`, or by writing to a file). Before execution, Daft's optimizer works to make your query run as efficiently as possible. This approach allows you to define complex operations on massive datasets without immediate computational cost or memory overflow.""")
 
 
151
  return
152
 
153
 
154
  @app.cell(hide_code=True)
155
  def _(mo):
156
- mo.md(
157
- r"""
158
  ## 🌐 Scale Your Work: From Laptop to Cluster
159
 
160
  Daft is designed with scalability in mind. As the trillion-row dataframe example above illustrates, you can write your data processing logic using Daft's Python API, and this same code can run:
@@ -163,15 +162,13 @@ def _(mo):
163
  * **On a Cluster**: By integrating with [Ray](https://www.ray.io/), a framework for distributed computing. This allows Daft to scale out to process very large datasets across many machines.
164
 
165
  This "write once, scale anywhere" approach means you don't need to significantly refactor your code when moving from local development to large-scale distributed execution. We'll delve into distributed computing with Ray in a later chapter.
166
- """
167
- )
168
  return
169
 
170
 
171
  @app.cell(hide_code=True)
172
  def _(mo):
173
- mo.md(
174
- r"""
175
  ## 🖼️ Handling More Than Just Tables: Multimodal Data Support
176
 
177
  Modern datasets often contain more than just numbers and text. They might include images, audio clips, URLs pointing to external files, tensor data from machine learning models, or complex nested structures like JSON.
@@ -179,8 +176,7 @@ def _(mo):
179
  Daft is built to accommodate these **multimodal data types** as integral parts of a DataFrame. This means you can have columns containing image data, embeddings, or other complex Python objects, and Daft provides mechanisms to process them. This is particularly useful for ML/AI pipelines and advanced analytics where diverse data sources are common.
180
 
181
  As an example of how Daft simplifies working with such complex data, let's see how we can process image URLs. With just a few lines of Daft code, we can pull open data from the [National Gallery of Art](https://github.com/NationalGalleryOfArt/opendata), then directly fetch, decode, and even resize the images within our DataFrame:
182
- """
183
- )
184
  return
185
 
186
 
@@ -217,20 +213,23 @@ def _(daft):
217
 
218
  @app.cell(hide_code=True)
219
  def _(mo):
220
- mo.md(r"""> Example inspired by the great post [Exploring Art with TypeScript, Jupyter, Polars, and Observable Plot](https://deno.com/blog/exploring-art-with-typescript-and-jupyter) published on Deno's blog.""")
 
 
221
  return
222
 
223
 
224
  @app.cell(hide_code=True)
225
  def _(mo):
226
- mo.md(r"""In later chapters, we'll explore in more detail how to work with these image objects and other complex types, including applying User-Defined Functions (UDFs) for custom processing. Until then, you can [take a look at a more complex example](https://blog.getdaft.io/p/we-cloned-over-15000-repos-to-find), in which Daft is used to clone over 15,000 GitHub repos to find the best developers.""")
 
 
227
  return
228
 
229
 
230
  @app.cell(hide_code=True)
231
  def _(mo):
232
- mo.md(
233
- r"""
234
  ## 🧑‍💻 Designed for Developers: Python and SQL Interfaces
235
 
236
  Daft aims to be developer-friendly by offering flexible ways to interact with your data:
@@ -239,8 +238,7 @@ def _(mo):
239
  * **SQL Interface**: For those who prefer SQL or have existing SQL-based logic, Daft allows you to write queries using SQL syntax. Daft can execute SQL queries directly or even translate SQL expressions into its native expression system.
240
 
241
  This dual-interface approach allows developers to choose the most appropriate tool for their specific task or leverage existing skills.
242
- """
243
- )
244
  return
245
 
246
 
@@ -285,8 +283,7 @@ def _(daft):
285
 
286
  @app.cell(hide_code=True)
287
  def _(mo):
288
- mo.md(
289
- r"""
290
  ## 🟣 Daft's Value Proposition
291
 
292
  So, what makes Daft special? It's the combination of these design choices:
@@ -299,8 +296,7 @@ def _(mo):
299
  These elements combine to make Daft a versatile tool for tackling modern data challenges.
300
 
301
  And this is just scratching the surface. Daft is a growing data engine with an ambitious vision: to unify data engineering, analytics, and ML/AI workflows 🚀.
302
- """
303
- )
304
  return
305
 
306
 
@@ -308,7 +304,6 @@ def _(mo):
308
  def _():
309
  import daft
310
  import marimo as mo
311
-
312
  return daft, mo
313
 
314
 
 
8
 
9
  import marimo
10
 
11
+ __generated_with = "0.18.4"
12
  app = marimo.App(width="medium")
13
 
14
 
15
  @app.cell(hide_code=True)
16
  def _(mo):
17
+ mo.md(r"""
 
18
  # What Makes Daft Special?
19
 
20
  > _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_.
21
 
22
  Welcome to the course on [Daft](https://www.getdaft.io/), the distributed dataframe library! In this first chapter, we'll explore what Daft is and what makes it a noteworthy tool in the landscape of data processing. We'll look at its core design choices and how they aim to help you work with data more effectively, whether you're a data engineer, data scientist, or analyst.
23
+ """)
 
24
  return
25
 
26
 
27
  @app.cell(hide_code=True)
28
  def _(mo):
29
+ mo.md(r"""
 
30
  ## 🎯 Introducing Daft: A Unified Data Engine
31
 
32
  Daft is a distributed query engine designed to handle a wide array of data tasks, from data engineering and analytics to powering ML/AI workflows. It provides both a Python DataFrame API, familiar to users of libraries like Pandas, and a SQL interface, allowing you to choose the interaction style that best suits your needs or the task at hand.
 
34
  The main goal of Daft is to provide a robust and versatile platform for processing data, whether it's gigabytes on your laptop or petabytes on a cluster.
35
 
36
  Let's go ahead and `pip install daft` to see it in action!
37
+ """)
 
38
  return
39
 
40
 
 
82
 
83
  @app.cell(hide_code=True)
84
  def _(mo):
85
+ mo.md(r"""
 
86
  ## 🦀 Built with Rust: Performance and Simplicity
87
 
88
  One of Daft's key characteristics is that its core engine is written in Rust. This choice has several implications for users:
 
92
  * **Simplified Developer Experience**: Rust-based systems typically require less configuration tuning compared to JVM-based systems. You don't need to worry about JVM heap sizes, garbage collection parameters, or managing Java dependencies.
93
 
94
  Daft also leverages [Apache Arrow](https://arrow.apache.org/) for its in-memory data format. This allows for efficient data exchange between Daft's Rust core and Python, often with zero-copy data sharing, further enhancing performance.
95
+ """)
 
96
  return
97
 
98
 
 
112
 
113
  @app.cell(hide_code=True)
114
  def _(mo):
115
+ mo.md(r"""
116
+ A cornerstone of Daft's design is **lazy execution**. Imagine defining a DataFrame with a trillion rows on your laptop – usually not a great prospect for your device's memory!
117
+ """)
118
  return
119
 
120
 
 
131
 
132
  @app.cell(hide_code=True)
133
  def _(mo):
134
+ mo.md(r"""
135
+ With Daft, this is perfectly fine. Operations like `with_column` or `filter` don't compute results immediately. Instead, Daft builds a *logical plan* – a blueprint of the transformations you've defined. You can inspect this plan:
136
+ """)
137
  return
138
 
139
 
 
145
 
146
  @app.cell(hide_code=True)
147
  def _(mo):
148
+ mo.md(r"""
149
+ This plan is only executed (and data materialized) when you explicitly request it (e.g., with `.show()`, `.collect()`, or by writing to a file). Before execution, Daft's optimizer works to make your query run as efficiently as possible. This approach allows you to define complex operations on massive datasets without immediate computational cost or memory overflow.
150
+ """)
151
  return
152
 
153
 
154
  @app.cell(hide_code=True)
155
  def _(mo):
156
+ mo.md(r"""
 
157
  ## 🌐 Scale Your Work: From Laptop to Cluster
158
 
159
  Daft is designed with scalability in mind. As the trillion-row dataframe example above illustrates, you can write your data processing logic using Daft's Python API, and this same code can run:
 
162
  * **On a Cluster**: By integrating with [Ray](https://www.ray.io/), a framework for distributed computing. This allows Daft to scale out to process very large datasets across many machines.
163
 
164
  This "write once, scale anywhere" approach means you don't need to significantly refactor your code when moving from local development to large-scale distributed execution. We'll delve into distributed computing with Ray in a later chapter.
165
+ """)
 
166
  return
167
 
168
 
169
  @app.cell(hide_code=True)
170
  def _(mo):
171
+ mo.md(r"""
 
172
  ## 🖼️ Handling More Than Just Tables: Multimodal Data Support
173
 
174
  Modern datasets often contain more than just numbers and text. They might include images, audio clips, URLs pointing to external files, tensor data from machine learning models, or complex nested structures like JSON.
 
176
  Daft is built to accommodate these **multimodal data types** as integral parts of a DataFrame. This means you can have columns containing image data, embeddings, or other complex Python objects, and Daft provides mechanisms to process them. This is particularly useful for ML/AI pipelines and advanced analytics where diverse data sources are common.
177
 
178
  As an example of how Daft simplifies working with such complex data, let's see how we can process image URLs. With just a few lines of Daft code, we can pull open data from the [National Gallery of Art](https://github.com/NationalGalleryOfArt/opendata), then directly fetch, decode, and even resize the images within our DataFrame:
179
+ """)
 
180
  return
181
 
182
 
 
213
 
214
  @app.cell(hide_code=True)
215
  def _(mo):
216
+ mo.md(r"""
217
+ > Example inspired by the great post [Exploring Art with TypeScript, Jupyter, Polars, and Observable Plot](https://deno.com/blog/exploring-art-with-typescript-and-jupyter) published on Deno's blog.
218
+ """)
219
  return
220
 
221
 
222
  @app.cell(hide_code=True)
223
  def _(mo):
224
+ mo.md(r"""
225
+ In later chapters, we'll explore in more detail how to work with these image objects and other complex types, including applying User-Defined Functions (UDFs) for custom processing. Until then, you can [take a look at a more complex example](https://blog.getdaft.io/p/we-cloned-over-15000-repos-to-find), in which Daft is used to clone over 15,000 GitHub repos to find the best developers.
226
+ """)
227
  return
228
 
229
 
230
  @app.cell(hide_code=True)
231
  def _(mo):
232
+ mo.md(r"""
 
233
  ## 🧑‍💻 Designed for Developers: Python and SQL Interfaces
234
 
235
  Daft aims to be developer-friendly by offering flexible ways to interact with your data:
 
238
  * **SQL Interface**: For those who prefer SQL or have existing SQL-based logic, Daft allows you to write queries using SQL syntax. Daft can execute SQL queries directly or even translate SQL expressions into its native expression system.
239
 
240
  This dual-interface approach allows developers to choose the most appropriate tool for their specific task or leverage existing skills.
241
+ """)
 
242
  return
243
 
244
 
 
283
 
284
  @app.cell(hide_code=True)
285
  def _(mo):
286
+ mo.md(r"""
 
287
  ## 🟣 Daft's Value Proposition
288
 
289
  So, what makes Daft special? It's the combination of these design choices:
 
296
  These elements combine to make Daft a versatile tool for tackling modern data challenges.
297
 
298
  And this is just scratching the surface. Daft is a growing data engine with an ambitious vision: to unify data engineering, analytics, and ML/AI workflows 🚀.
299
+ """)
 
300
  return
301
 
302
 
 
304
  def _():
305
  import daft
306
  import marimo as mo
 
307
  return daft, mo
308
 
309
 
daft/README.md CHANGED
@@ -1,3 +1,8 @@
 
 
 
 
 
1
  # Learn Daft
2
 
3
  _🚧 This collection is a work in progress. Please help us add notebooks!_
@@ -23,4 +28,4 @@ You can also open notebooks in our online playground by appending marimo.app/ to
23
 
24
  **Thanks to all our notebook authors!**
25
 
26
- * [Péter Gyarmati](https://github.com/peter-gy)
 
1
+ ---
2
+ title: Readme
3
+ marimo-version: 0.18.4
4
+ ---
5
+
6
  # Learn Daft
7
 
8
  _🚧 This collection is a work in progress. Please help us add notebooks!_
 
28
 
29
  **Thanks to all our notebook authors!**
30
 
31
+ * [Péter Gyarmati](https://github.com/peter-gy)
duckdb/008_loading_parquet.py CHANGED
@@ -11,39 +11,35 @@
11
 
12
  import marimo
13
 
14
- __generated_with = "0.14.10"
15
  app = marimo.App(width="medium")
16
 
17
 
18
  @app.cell(hide_code=True)
19
  def _(mo):
20
- mo.md(
21
- r"""
22
  # Loading Parquet files with DuckDB
23
  *By [Thomas Liang](https://github.com/thliang01)*
24
  #
25
- """
26
- )
27
  return
28
 
29
 
30
  @app.cell(hide_code=True)
31
  def _(mo):
32
- mo.md(
33
- r"""
34
- [Apache Parquet](https://parquet.apache.org/) is a popular columnar storage format, optimized for analytics. Its columnar nature allows query engines like DuckDB to read only the necessary columns, leading to significant performance gains, especially for wide tables.
35
-
36
- DuckDB has excellent, built-in support for reading Parquet files, making it incredibly easy to query and analyze Parquet data directly without a separate loading step.
37
-
38
- In this notebook, we'll explore how to load and analyze Airbnb's stock price data from a remote Parquet file:
39
- <ul>
40
- <li>Querying a remote Parquet file directly.</li>
41
- <li>Using the `read_parquet` function for more control.</li>
42
- <li>Creating a persistent table from a Parquet file.</li>
43
- <li>Performing basic data analysis and visualization.</li>
44
- </ul>
45
- """
46
- )
47
  return
48
 
49
 
@@ -55,24 +51,24 @@ def _():
55
 
56
  @app.cell(hide_code=True)
57
  def _(mo):
58
- mo.md(r"""## Using `FROM` to query Parquet files""")
 
 
59
  return
60
 
61
 
62
  @app.cell(hide_code=True)
63
  def _(mo):
64
- mo.md(
65
- r"""
66
- The simplest way to query a Parquet file is to use it directly in a `FROM` clause, just like you would with a table. DuckDB will automatically detect that it's a Parquet file and read it accordingly.
67
 
68
- Let's query a dataset of Airbnb's stock price from Hugging Face.
69
- """
70
- )
71
  return
72
 
73
 
74
  @app.cell
75
- def _(AIRBNB_URL, mo, null):
76
  mo.sql(
77
  f"""
78
  SELECT *
@@ -85,24 +81,24 @@ def _(AIRBNB_URL, mo, null):
85
 
86
  @app.cell(hide_code=True)
87
  def _(mo):
88
- mo.md(r"""## Using `read_parquet`""")
 
 
89
  return
90
 
91
 
92
  @app.cell(hide_code=True)
93
  def _(mo):
94
- mo.md(
95
- r"""
96
- For more control, you can use the `read_parquet` table function. This is useful when you need to specify options, for example, when dealing with multiple files or specific data types.
97
- Some useful options for `read_parquet` include:
98
 
99
- - `binary_as_string=True`: Reads `BINARY` columns as `VARCHAR`.
100
- - `filename=True`: Adds a `filename` column with the path of the file for each row.
101
- - `hive_partitioning=True`: Enables reading of Hive-partitioned datasets.
102
 
103
- Here, we'll use `read_parquet` to select only a few relevant columns. This is much more efficient than `SELECT *` because DuckDB only needs to read the data for the columns we specify.
104
- """
105
- )
106
  return
107
 
108
 
@@ -120,31 +116,29 @@ def _(AIRBNB_URL, mo):
120
 
121
  @app.cell(hide_code=True)
122
  def _(mo):
123
- mo.md(
124
- r"""
125
- You can also read multiple Parquet files at once using a glob pattern. For example, to read all Parquet files in a directory `data/`:
126
 
127
- ```sql
128
- SELECT * FROM read_parquet('data/*.parquet');
129
- ```
130
- """
131
- )
132
  return
133
 
134
 
135
  @app.cell(hide_code=True)
136
  def _(mo):
137
- mo.md(r"""## Creating a table from a Parquet file""")
 
 
138
  return
139
 
140
 
141
  @app.cell(hide_code=True)
142
  def _(mo):
143
- mo.md(
144
- r"""
145
- While querying Parquet files directly is powerful, sometimes it's useful to load the data into a persistent table within your DuckDB database. This can simplify subsequent queries and is a good practice if you'll be accessing the data frequently.
146
- """
147
- )
148
  return
149
 
150
 
@@ -156,7 +150,7 @@ def _(AIRBNB_URL, mo):
156
  SELECT * FROM read_parquet('{AIRBNB_URL}');
157
  """
158
  )
159
- return airbnb_stock, stock_table
160
 
161
 
162
  @app.cell(hide_code=True)
@@ -172,7 +166,7 @@ def _(mo, stock_table):
172
 
173
 
174
  @app.cell
175
- def _(airbnb_stock, mo):
176
  mo.sql(
177
  f"""
178
  SELECT * FROM airbnb_stock LIMIT 5;
@@ -183,18 +177,22 @@ def _(airbnb_stock, mo):
183
 
184
  @app.cell(hide_code=True)
185
  def _(mo):
186
- mo.md(r"""## Analysis and Visualization""")
 
 
187
  return
188
 
189
 
190
  @app.cell(hide_code=True)
191
  def _(mo):
192
- mo.md(r"""Let's perform a simple analysis: plotting the closing stock price over time.""")
 
 
193
  return
194
 
195
 
196
  @app.cell
197
- def _(airbnb_stock, mo):
198
  stock_data = mo.sql(
199
  f"""
200
  SELECT
@@ -209,7 +207,9 @@ def _(airbnb_stock, mo):
209
 
210
  @app.cell(hide_code=True)
211
  def _(mo):
212
- mo.md(r"""Now we can easily visualize this result using marimo's integration with plotting libraries like Plotly.""")
 
 
213
  return
214
 
215
 
@@ -227,14 +227,15 @@ def _(px, stock_data):
227
 
228
  @app.cell(hide_code=True)
229
  def _(mo):
230
- mo.md(r"""## Conclusion""")
 
 
231
  return
232
 
233
 
234
  @app.cell(hide_code=True)
235
  def _(mo):
236
- mo.md(
237
- r"""
238
  In this notebook, we've seen how easy it is to work with Parquet files in DuckDB. We learned how to:
239
  <ul>
240
  <li>Query Parquet files directly from a URL using a simple `FROM` clause.</li>
@@ -244,8 +245,7 @@ def _(mo):
244
  </ul>
245
 
246
  DuckDB's native Parquet support makes it a powerful tool for interactive data analysis on large datasets without complex ETL pipelines.
247
- """
248
- )
249
  return
250
 
251
 
 
11
 
12
  import marimo
13
 
14
+ __generated_with = "0.18.4"
15
  app = marimo.App(width="medium")
16
 
17
 
18
  @app.cell(hide_code=True)
19
  def _(mo):
20
+ mo.md(r"""
 
21
  # Loading Parquet files with DuckDB
22
  *By [Thomas Liang](https://github.com/thliang01)*
23
  #
24
+ """)
 
25
  return
26
 
27
 
28
  @app.cell(hide_code=True)
29
  def _(mo):
30
+ mo.md(r"""
31
+ [Apache Parquet](https://parquet.apache.org/) is a popular columnar storage format, optimized for analytics. Its columnar nature allows query engines like DuckDB to read only the necessary columns, leading to significant performance gains, especially for wide tables.
32
+
33
+ DuckDB has excellent, built-in support for reading Parquet files, making it incredibly easy to query and analyze Parquet data directly without a separate loading step.
34
+
35
+ In this notebook, we'll explore how to load and analyze Airbnb's stock price data from a remote Parquet file:
36
+ <ul>
37
+ <li>Querying a remote Parquet file directly.</li>
38
+ <li>Using the `read_parquet` function for more control.</li>
39
+ <li>Creating a persistent table from a Parquet file.</li>
40
+ <li>Performing basic data analysis and visualization.</li>
41
+ </ul>
42
+ """)
 
 
43
  return
44
 
45
 
 
51
 
52
  @app.cell(hide_code=True)
53
  def _(mo):
54
+ mo.md(r"""
55
+ ## Using `FROM` to query Parquet files
56
+ """)
57
  return
58
 
59
 
60
  @app.cell(hide_code=True)
61
  def _(mo):
62
+ mo.md(r"""
63
+ The simplest way to query a Parquet file is to use it directly in a `FROM` clause, just like you would with a table. DuckDB will automatically detect that it's a Parquet file and read it accordingly.
 
64
 
65
+ Let's query a dataset of Airbnb's stock price from Hugging Face.
66
+ """)
 
67
  return
68
 
69
 
70
  @app.cell
71
+ def _(AIRBNB_URL, mo):
72
  mo.sql(
73
  f"""
74
  SELECT *
 
81
 
82
  @app.cell(hide_code=True)
83
  def _(mo):
84
+ mo.md(r"""
85
+ ## Using `read_parquet`
86
+ """)
87
  return
88
 
89
 
90
  @app.cell(hide_code=True)
91
  def _(mo):
92
+ mo.md(r"""
93
+ For more control, you can use the `read_parquet` table function. This is useful when you need to specify options, for example, when dealing with multiple files or specific data types.
94
+ Some useful options for `read_parquet` include:
 
95
 
96
+ - `binary_as_string=True`: Reads `BINARY` columns as `VARCHAR`.
97
+ - `filename=True`: Adds a `filename` column with the path of the file for each row.
98
+ - `hive_partitioning=True`: Enables reading of Hive-partitioned datasets.
99
 
100
+ Here, we'll use `read_parquet` to select only a few relevant columns. This is much more efficient than `SELECT *` because DuckDB only needs to read the data for the columns we specify.
101
+ """)
 
102
  return
103
 
104
 
 
116
 
117
  @app.cell(hide_code=True)
118
  def _(mo):
119
+ mo.md(r"""
120
+ You can also read multiple Parquet files at once using a glob pattern. For example, to read all Parquet files in a directory `data/`:
 
121
 
122
+ ```sql
123
+ SELECT * FROM read_parquet('data/*.parquet');
124
+ ```
125
+ """)
 
126
  return
127
 
128
 
129
  @app.cell(hide_code=True)
130
  def _(mo):
131
+ mo.md(r"""
132
+ ## Creating a table from a Parquet file
133
+ """)
134
  return
135
 
136
 
137
  @app.cell(hide_code=True)
138
  def _(mo):
139
+ mo.md(r"""
140
+ While querying Parquet files directly is powerful, sometimes it's useful to load the data into a persistent table within your DuckDB database. This can simplify subsequent queries and is a good practice if you'll be accessing the data frequently.
141
+ """)
 
 
142
  return
143
 
144
 
 
150
  SELECT * FROM read_parquet('{AIRBNB_URL}');
151
  """
152
  )
153
+ return (stock_table,)
154
 
155
 
156
  @app.cell(hide_code=True)
 
166
 
167
 
168
  @app.cell
169
+ def _(mo):
170
  mo.sql(
171
  f"""
172
  SELECT * FROM airbnb_stock LIMIT 5;
 
177
 
178
  @app.cell(hide_code=True)
179
  def _(mo):
180
+ mo.md(r"""
181
+ ## Analysis and Visualization
182
+ """)
183
  return
184
 
185
 
186
  @app.cell(hide_code=True)
187
  def _(mo):
188
+ mo.md(r"""
189
+ Let's perform a simple analysis: plotting the closing stock price over time.
190
+ """)
191
  return
192
 
193
 
194
  @app.cell
195
+ def _(mo):
196
  stock_data = mo.sql(
197
  f"""
198
  SELECT
 
207
 
208
  @app.cell(hide_code=True)
209
  def _(mo):
210
+ mo.md(r"""
211
+ Now we can easily visualize this result using marimo's integration with plotting libraries like Plotly.
212
+ """)
213
  return
214
 
215
 
 
227
 
228
  @app.cell(hide_code=True)
229
  def _(mo):
230
+ mo.md(r"""
231
+ ## Conclusion
232
+ """)
233
  return
234
 
235
 
236
  @app.cell(hide_code=True)
237
  def _(mo):
238
+ mo.md(r"""
 
239
  In this notebook, we've seen how easy it is to work with Parquet files in DuckDB. We learned how to:
240
  <ul>
241
  <li>Query Parquet files directly from a URL using a simple `FROM` clause.</li>
 
245
  </ul>
246
 
247
  DuckDB's native Parquet support makes it a powerful tool for interactive data analysis on large datasets without complex ETL pipelines.
248
+ """)
 
249
  return
250
 
251
 
duckdb/009_loading_json.py CHANGED
@@ -10,38 +10,34 @@
10
 
11
  import marimo
12
 
13
- __generated_with = "0.12.8"
14
  app = marimo.App(width="medium")
15
 
16
 
17
  @app.cell(hide_code=True)
18
  def _(mo):
19
- mo.md(
20
- r"""
21
- # Loading JSON
22
 
23
- DuckDB supports reading and writing JSON through the `json` extension that should be present in most distributions and is autoloaded on first-use. If it's not, you can [install and load](https://duckdb.org/docs/stable/data/json/installing_and_loading.html) it manually like any other extension.
24
 
25
- In this tutorial we'll cover 4 different ways we can transfer JSON data in and out of DuckDB:
26
 
27
- - [`FROM`](https://duckdb.org/docs/stable/sql/query_syntax/from.html) statement.
28
- - [`read_json`](https://duckdb.org/docs/stable/data/json/loading_json#the-read_json-function) function.
29
- - [`COPY`](https://duckdb.org/docs/stable/sql/statements/copy#copy--from) statement.
30
- - [`IMPORT DATABASE`](https://duckdb.org/docs/stable/sql/statements/export.html) statement.
31
- """
32
- )
33
  return
34
 
35
 
36
  @app.cell(hide_code=True)
37
  def _(mo):
38
- mo.md(
39
- r"""
40
- ## Using `FROM`
41
 
42
- Loading data using `FROM` is simple and straightforward. We use a path or URL to the file we want to load where we'd normally put a table name. When we do this, DuckDB attempts to infer the right way to read the file including the correct format and column types. In most cases this is all we need to load data into DuckDB.
43
- """
44
- )
45
  return
46
 
47
 
@@ -57,20 +53,18 @@ def _(mo):
57
 
58
  @app.cell(hide_code=True)
59
  def _(mo):
60
- mo.md(
61
- r"""
62
- ## Using `read_json`
63
 
64
- For greater control over how the JSON is read, we can directly call the [`read_json`](https://duckdb.org/docs/stable/data/json/loading_json#the-read_json-function) function. It supports a few different arguments — some common ones are:
65
 
66
- - `format='array'` or `format='newline_delimited'` - the former tells DuckDB that the rows should be read from a top-level JSON array while the latter means the rows should be read from JSON objects separated by a newline (JSONL/NDJSON).
67
- - `ignore_errors=true` - skips lines with parse errors when reading newline delimited JSON.
68
- - `columns={columnName: type, ...}` - lets you set types for individual columns manually.
69
- - `dateformat` and `timestampformat` - controls how DuckDB attempts to parse [Date](https://duckdb.org/docs/stable/sql/data_types/date) and [Timestamp](https://duckdb.org/docs/stable/sql/data_types/timestamp) types. Use the format specifiers specified in the [docs](https://duckdb.org/docs/stable/sql/functions/dateformat.html#format-specifiers).
70
 
71
- We could rewrite the previous query more explicitly as:
72
- """
73
- )
74
  return
75
 
76
 
@@ -99,24 +93,24 @@ def _(mo):
99
  ;
100
  """
101
  )
102
- return (cars_df,)
103
 
104
 
105
  @app.cell(hide_code=True)
106
  def _(mo):
107
- mo.md(r"""Other than singular files we can read [multiple files](https://duckdb.org/docs/stable/data/multiple_files/overview.html) at a time by either passing a list of files or a UNIX glob pattern.""")
 
 
108
  return
109
 
110
 
111
  @app.cell(hide_code=True)
112
  def _(mo):
113
- mo.md(
114
- r"""
115
- ## Using `COPY`
116
 
117
- `COPY` is for useful both for importing and exporting data in a variety of formats including JSON. For example, we can import data into an existing table from a JSON file.
118
- """
119
- )
120
  return
121
 
122
 
@@ -137,11 +131,11 @@ def _(mo):
137
  );
138
  """
139
  )
140
- return (cars2,)
141
 
142
 
143
  @app.cell
144
- def _(cars2, mo):
145
  _df = mo.sql(
146
  f"""
147
  COPY cars2 FROM 'https://raw.githubusercontent.com/vega/vega-datasets/refs/heads/main/data/cars.json' (FORMAT json, ARRAY true, DATEFORMAT '%Y-%m-%d');
@@ -153,7 +147,9 @@ def _(cars2, mo):
153
 
154
  @app.cell(hide_code=True)
155
  def _(mo):
156
- mo.md(r"""Similarly, we can write data from a table or select statement to a JSON file. For example, we create a new JSONL file with just the car names and miles per gallon. We first create a temporary directory to avoid cluttering our project directory.""")
 
 
157
  return
158
 
159
 
@@ -164,11 +160,11 @@ def _(Path):
164
  TMP_DIR = TemporaryDirectory()
165
  COPY_PATH = Path(TMP_DIR.name) / "cars_mpg.jsonl"
166
  print(COPY_PATH)
167
- return COPY_PATH, TMP_DIR, TemporaryDirectory
168
 
169
 
170
  @app.cell
171
- def _(COPY_PATH, cars2, mo):
172
  _df = mo.sql(
173
  f"""
174
  COPY (
@@ -191,13 +187,11 @@ def _(COPY_PATH, Path):
191
 
192
  @app.cell(hide_code=True)
193
  def _(mo):
194
- mo.md(
195
- r"""
196
- ## Using `IMPORT DATABASE`
197
 
198
- The last method we can use to load JSON data is using the `IMPORT DATABASE` statement. It works in conjunction with `EXPORT DATABASE` to save and load an entire database to and from a directory. For example let's try and export our default in-memory database.
199
- """
200
- )
201
  return
202
 
203
 
@@ -226,7 +220,9 @@ def _(EXPORT_PATH, Path):
226
 
227
  @app.cell(hide_code=True)
228
  def _(mo):
229
- mo.md(r"""We can then load the database back into DuckDB.""")
 
 
230
  return
231
 
232
 
@@ -250,14 +246,12 @@ def _(TMP_DIR):
250
 
251
  @app.cell(hide_code=True)
252
  def _(mo):
253
- mo.md(
254
- r"""
255
- ## Further Reading
256
 
257
- - Complete information on the JSON support in DuckDB can be found in their [documentation](https://duckdb.org/docs/stable/data/json/overview.html).
258
- - You can also learn more about using SQL in marimo from the [examples](https://github.com/marimo-team/marimo/tree/main/examples/sql).
259
- """
260
- )
261
  return
262
 
263
 
 
10
 
11
  import marimo
12
 
13
+ __generated_with = "0.18.4"
14
  app = marimo.App(width="medium")
15
 
16
 
17
  @app.cell(hide_code=True)
18
  def _(mo):
19
+ mo.md(r"""
20
+ # Loading JSON
 
21
 
22
+ DuckDB supports reading and writing JSON through the `json` extension that should be present in most distributions and is autoloaded on first-use. If it's not, you can [install and load](https://duckdb.org/docs/stable/data/json/installing_and_loading.html) it manually like any other extension.
23
 
24
+ In this tutorial we'll cover 4 different ways we can transfer JSON data in and out of DuckDB:
25
 
26
+ - [`FROM`](https://duckdb.org/docs/stable/sql/query_syntax/from.html) statement.
27
+ - [`read_json`](https://duckdb.org/docs/stable/data/json/loading_json#the-read_json-function) function.
28
+ - [`COPY`](https://duckdb.org/docs/stable/sql/statements/copy#copy--from) statement.
29
+ - [`IMPORT DATABASE`](https://duckdb.org/docs/stable/sql/statements/export.html) statement.
30
+ """)
 
31
  return
32
 
33
 
34
  @app.cell(hide_code=True)
35
  def _(mo):
36
+ mo.md(r"""
37
+ ## Using `FROM`
 
38
 
39
+ Loading data using `FROM` is simple and straightforward. We use a path or URL to the file we want to load where we'd normally put a table name. When we do this, DuckDB attempts to infer the right way to read the file including the correct format and column types. In most cases this is all we need to load data into DuckDB.
40
+ """)
 
41
  return
42
 
43
 
 
53
 
54
  @app.cell(hide_code=True)
55
  def _(mo):
56
+ mo.md(r"""
57
+ ## Using `read_json`
 
58
 
59
+ For greater control over how the JSON is read, we can directly call the [`read_json`](https://duckdb.org/docs/stable/data/json/loading_json#the-read_json-function) function. It supports a few different arguments — some common ones are:
60
 
61
+ - `format='array'` or `format='newline_delimited'` - the former tells DuckDB that the rows should be read from a top-level JSON array while the latter means the rows should be read from JSON objects separated by a newline (JSONL/NDJSON).
62
+ - `ignore_errors=true` - skips lines with parse errors when reading newline delimited JSON.
63
+ - `columns={columnName: type, ...}` - lets you set types for individual columns manually.
64
+ - `dateformat` and `timestampformat` - controls how DuckDB attempts to parse [Date](https://duckdb.org/docs/stable/sql/data_types/date) and [Timestamp](https://duckdb.org/docs/stable/sql/data_types/timestamp) types. Use the format specifiers specified in the [docs](https://duckdb.org/docs/stable/sql/functions/dateformat.html#format-specifiers).
65
 
66
+ We could rewrite the previous query more explicitly as:
67
+ """)
 
68
  return
69
 
70
 
 
93
  ;
94
  """
95
  )
96
+ return
97
 
98
 
99
  @app.cell(hide_code=True)
100
  def _(mo):
101
+ mo.md(r"""
102
+ Other than singular files we can read [multiple files](https://duckdb.org/docs/stable/data/multiple_files/overview.html) at a time by either passing a list of files or a UNIX glob pattern.
103
+ """)
104
  return
105
 
106
 
107
  @app.cell(hide_code=True)
108
  def _(mo):
109
+ mo.md(r"""
110
+ ## Using `COPY`
 
111
 
112
+ `COPY` is for useful both for importing and exporting data in a variety of formats including JSON. For example, we can import data into an existing table from a JSON file.
113
+ """)
 
114
  return
115
 
116
 
 
131
  );
132
  """
133
  )
134
+ return
135
 
136
 
137
  @app.cell
138
+ def _(mo):
139
  _df = mo.sql(
140
  f"""
141
  COPY cars2 FROM 'https://raw.githubusercontent.com/vega/vega-datasets/refs/heads/main/data/cars.json' (FORMAT json, ARRAY true, DATEFORMAT '%Y-%m-%d');
 
147
 
148
  @app.cell(hide_code=True)
149
  def _(mo):
150
+ mo.md(r"""
151
+ Similarly, we can write data from a table or select statement to a JSON file. For example, we create a new JSONL file with just the car names and miles per gallon. We first create a temporary directory to avoid cluttering our project directory.
152
+ """)
153
  return
154
 
155
 
 
160
  TMP_DIR = TemporaryDirectory()
161
  COPY_PATH = Path(TMP_DIR.name) / "cars_mpg.jsonl"
162
  print(COPY_PATH)
163
+ return COPY_PATH, TMP_DIR
164
 
165
 
166
  @app.cell
167
+ def _(COPY_PATH, mo):
168
  _df = mo.sql(
169
  f"""
170
  COPY (
 
187
 
188
  @app.cell(hide_code=True)
189
  def _(mo):
190
+ mo.md(r"""
191
+ ## Using `IMPORT DATABASE`
 
192
 
193
+ The last method we can use to load JSON data is using the `IMPORT DATABASE` statement. It works in conjunction with `EXPORT DATABASE` to save and load an entire database to and from a directory. For example let's try and export our default in-memory database.
194
+ """)
 
195
  return
196
 
197
 
 
220
 
221
  @app.cell(hide_code=True)
222
  def _(mo):
223
+ mo.md(r"""
224
+ We can then load the database back into DuckDB.
225
+ """)
226
  return
227
 
228
 
 
246
 
247
  @app.cell(hide_code=True)
248
  def _(mo):
249
+ mo.md(r"""
250
+ ## Further Reading
 
251
 
252
+ - Complete information on the JSON support in DuckDB can be found in their [documentation](https://duckdb.org/docs/stable/data/json/overview.html).
253
+ - You can also learn more about using SQL in marimo from the [examples](https://github.com/marimo-team/marimo/tree/main/examples/sql).
254
+ """)
 
255
  return
256
 
257
 
duckdb/011_working_with_apache_arrow.py CHANGED
@@ -14,41 +14,37 @@
14
 
15
  import marimo
16
 
17
- __generated_with = "0.14.12"
18
  app = marimo.App(width="medium")
19
 
20
 
21
  @app.cell(hide_code=True)
22
  def _(mo):
23
- mo.md(
24
- r"""
25
  # Working with Apache Arrow
26
  *By [Thomas Liang](https://github.com/thliang01)*
27
  #
28
- """
29
- )
30
  return
31
 
32
 
33
  @app.cell(hide_code=True)
34
  def _(mo):
35
- mo.md(
36
- r"""
37
- [Apache Arrow](https://arrow.apache.org/) is a multi-language toolbox for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.
38
 
39
- A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.
40
 
41
- DuckDB has native support for Apache Arrow, which is an in-memory columnar data format. This allows for efficient data transfer between DuckDB and other Arrow-compatible systems, such as Polars and Pandas (via PyArrow).
42
 
43
- In this notebook, we'll explore how to:
44
 
45
- - Create an Arrow table from a DuckDB query.
46
- - Load an Arrow table into DuckDB.
47
- - Convert between DuckDB, Arrow, and Polars/Pandas DataFrames.
48
- - Combining data from multiple sources
49
- - Performance benefits
50
- """
51
- )
52
  return
53
 
54
 
@@ -71,23 +67,21 @@ def _(mo):
71
  (5, 'Eve', 40, 'London');
72
  """
73
  )
74
- return (users,)
75
 
76
 
77
  @app.cell(hide_code=True)
78
  def _(mo):
79
- mo.md(
80
- r"""
81
- ## 1. Creating an Arrow Table from a DuckDB Query
82
 
83
- You can directly fetch the results of a DuckDB query as an Apache Arrow table using the `.arrow()` method on the query result.
84
- """
85
- )
86
  return
87
 
88
 
89
  @app.cell
90
- def _(mo, users):
91
  users_arrow_table = mo.sql( # type: ignore
92
  """
93
  SELECT * FROM users WHERE age > 30;
@@ -98,7 +92,9 @@ def _(mo, users):
98
 
99
  @app.cell(hide_code=True)
100
  def _(mo):
101
- mo.md(r"""The `.arrow()` method returns a `pyarrow.Table` object. We can inspect its schema:""")
 
 
102
  return
103
 
104
 
@@ -110,13 +106,11 @@ def _(users_arrow_table):
110
 
111
  @app.cell(hide_code=True)
112
  def _(mo):
113
- mo.md(
114
- r"""
115
- ## 2. Loading an Arrow Table into DuckDB
116
 
117
- You can also register an existing Arrow table (or a Polars/Pandas DataFrame, which uses Arrow under the hood) directly with DuckDB. This allows you to query the in-memory data without any copying, which is highly efficient.
118
- """
119
- )
120
  return
121
 
122
 
@@ -129,17 +123,19 @@ def _(pa):
129
  'age': [22, 45],
130
  'city': ['Berlin', 'Tokyo']
131
  })
132
- return (new_data,)
133
 
134
 
135
  @app.cell(hide_code=True)
136
  def _(mo):
137
- mo.md(r"""Now, we can query this Arrow table `new_data` directly from SQL by embedding it in the query.""")
 
 
138
  return
139
 
140
 
141
  @app.cell
142
- def _(mo, new_data):
143
  mo.sql(
144
  f"""
145
  SELECT name, age, city
@@ -152,19 +148,19 @@ def _(mo, new_data):
152
 
153
  @app.cell(hide_code=True)
154
  def _(mo):
155
- mo.md(
156
- r"""
157
- ## 3. Convert between DuckDB, Arrow, and Polars/Pandas DataFrames.
158
 
159
- The real power of DuckDB's Arrow integration comes from its seamless interoperability with data frame libraries like Polars and Pandas. Because they all share the Arrow in-memory format, conversions are often zero-copy and extremely fast.
160
- """
161
- )
162
  return
163
 
164
 
165
  @app.cell(hide_code=True)
166
  def _(mo):
167
- mo.md(r"""### From DuckDB to Polars/Pandas""")
 
 
168
  return
169
 
170
 
@@ -186,7 +182,9 @@ def _(users_arrow_table):
186
 
187
  @app.cell(hide_code=True)
188
  def _(mo):
189
- mo.md(r"""### From Polars/Pandas to DuckDB""")
 
 
190
  return
191
 
192
 
@@ -199,17 +197,19 @@ def _(pl):
199
  "price": [1200.00, 25.50, 75.00]
200
  })
201
  polars_df
202
- return (polars_df,)
203
 
204
 
205
  @app.cell(hide_code=True)
206
  def _(mo):
207
- mo.md(r"""Now we can query this Polars DataFrame directly in DuckDB:""")
 
 
208
  return
209
 
210
 
211
  @app.cell
212
- def _(mo, polars_df):
213
  # Query the Polars DataFrame directly in DuckDB
214
  mo.sql(
215
  f"""
@@ -224,7 +224,9 @@ def _(mo, polars_df):
224
 
225
  @app.cell(hide_code=True)
226
  def _(mo):
227
- mo.md(r"""Similarly, we can query a Pandas DataFrame:""")
 
 
228
  return
229
 
230
 
@@ -238,11 +240,11 @@ def _(pd):
238
  "order_date": pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-16', '2024-01-17'])
239
  })
240
  pandas_df
241
- return (pandas_df,)
242
 
243
 
244
  @app.cell
245
- def _(mo, pandas_df):
246
  # Query the Pandas DataFrame in DuckDB
247
  mo.sql(
248
  f"""
@@ -257,18 +259,16 @@ def _(mo, pandas_df):
257
 
258
  @app.cell(hide_code=True)
259
  def _(mo):
260
- mo.md(
261
- r"""
262
- ## 4. Advanced Example: Combining Multiple Data Sources
263
 
264
- One of the most powerful features is the ability to join data from different sources (DuckDB tables, Arrow tables, Polars/Pandas DataFrames) in a single query:
265
- """
266
- )
267
  return
268
 
269
 
270
  @app.cell
271
- def _(mo, pandas_df, polars_df, users):
272
  # Join the DuckDB users table with the Polars products DataFrame and Pandas orders DataFrame
273
  result = mo.sql(
274
  f"""
@@ -291,27 +291,28 @@ def _(mo, pandas_df, polars_df, users):
291
 
292
  @app.cell(hide_code=True)
293
  def _(mo):
294
- mo.md(
295
- r"""
296
- ## 5. Performance Benefits of Arrow Integration
297
 
298
- The zero-copy integration between DuckDB and Apache Arrow delivers significant performance and memory benefits. This seamless integration enables:
299
 
300
- ### Key Benefits:
301
 
302
- - **Memory Efficiency**: Arrow's columnar format uses 20-40% less memory than traditional DataFrames through compact columnar representation and better compression ratios
303
- - **Zero-Copy Operations**: Data can be shared between DuckDB and Arrow-compatible systems (Polars, Pandas) without any data copying, eliminating redundant memory usage
304
- - **Query Performance**: 2-10x faster queries compared to traditional approaches that require data copying
305
- - **Larger-than-Memory Analysis**: Both DuckDB and Arrow-compatible libraries support streaming query results, allowing you to execute queries on data larger than available memory by processing data in batches.
306
- - **Advanced Query Optimization**: DuckDB's optimizer can push down filters and projections directly into Arrow scans, reading only relevant columns and partitions
307
- Let's demonstrate these benefits with concrete examples:
308
- """
309
- )
310
  return
311
 
 
312
  @app.cell(hide_code=True)
313
  def _(mo):
314
- mo.md(r"""### Memory Efficiency Demonstration""")
 
 
315
  return
316
 
317
 
@@ -352,18 +353,22 @@ def _(pd, pl):
352
 
353
  @app.cell(hide_code=True)
354
  def _(mo):
355
- mo.md(r"""### Performance Comparison: Arrow vs Non-Arrow Approaches""")
 
 
356
  return
357
 
358
 
359
  @app.cell(hide_code=True)
360
  def _(mo):
361
- mo.md(r"""Let's compare three approaches for the same analytical query:""")
 
 
362
  return
363
 
364
 
365
  @app.cell
366
- def _(duckdb, mo, pandas_data, polars_data, time):
367
  # Test query: group by category and calculate aggregations
368
  query = """
369
  SELECT
@@ -425,14 +430,16 @@ def _(duckdb, mo, pandas_data, polars_data, time):
425
 
426
  @app.cell(hide_code=True)
427
  def _(mo):
428
- mo.md(r"""### Visualizing the Performance Difference""")
 
 
429
  return
430
 
431
 
432
  @app.cell
433
  def _(approach1_time, approach2_time, approach3_time, mo, pl):
434
  import altair as alt
435
-
436
  # Create a bar chart showing the performance comparison
437
  performance_data = pl.DataFrame({
438
  "Approach": ["Traditional\n(Copy to DuckDB)", "Pandas\nGroupBy", "Arrow-based\n(Zero-copy)"],
@@ -450,27 +457,30 @@ def _(approach1_time, approach2_time, approach3_time, mo, pl):
450
  width=400,
451
  height=300
452
  )
453
-
454
  # Display using marimo's altair_chart UI element
455
  mo.ui.altair_chart(chart)
456
- return alt, chart, performance_data
457
-
458
 
459
 
460
  @app.cell(hide_code=True)
461
  def _(mo):
462
- mo.md(r"""### Complex Query Performance""")
 
 
463
  return
464
 
465
 
466
  @app.cell(hide_code=True)
467
  def _(mo):
468
- mo.md(r"""Let's test a more complex query with joins and window functions:""")
 
 
469
  return
470
 
471
 
472
  @app.cell
473
- def _(mo, pl, polars_data, time):
474
  # Create additional datasets for join operations
475
  categories_df = pl.DataFrame({
476
  "category": [f"cat_{i}" for i in range(100)],
@@ -510,23 +520,21 @@ def _(mo, pl, polars_data, time):
510
  print(f"Complex query with joins and window functions completed in {complex_query_time:.3f} seconds")
511
 
512
  complex_result
513
- return (categories_df,)
514
 
515
 
516
  @app.cell(hide_code=True)
517
  def _(mo):
518
- mo.md(
519
- r"""
520
  ### Memory Efficiency During Operations
521
 
522
  Let's demonstrate how Arrow's zero-copy operations save memory during data transformations:
523
- """
524
- )
525
  return
526
 
527
 
528
  @app.cell
529
- def _(polars_data, time):
530
  import os
531
  import pyarrow.compute as pc # Add this import
532
 
@@ -558,7 +566,7 @@ def _(polars_data, time):
558
 
559
  copy_ops_time = time.time() - latest_start_time
560
  memory_after_copy = process.memory_info().rss / 1024 / 1024 # MB
561
-
562
  print("Memory Usage Comparison:")
563
  print(f"Initial memory: {memory_before:.2f} MB")
564
  print(f"After Arrow operations: {memory_after_arrow:.2f} MB (diff: +{memory_after_arrow - memory_before:.2f} MB)")
@@ -567,14 +575,12 @@ def _(polars_data, time):
567
  print(f"Arrow operations: {arrow_ops_time:.3f} seconds")
568
  print(f"Copy operations: {copy_ops_time:.3f} seconds")
569
  print(f"Speedup: {copy_ops_time/arrow_ops_time:.1f}x")
570
- return pc
571
-
572
 
573
 
574
  @app.cell(hide_code=True)
575
  def _(mo):
576
- mo.md(
577
- r"""
578
  ## Summary
579
 
580
  In this notebook, we've explored:
@@ -590,8 +596,7 @@ def _(mo):
590
  - **Better scalability**: Can handle larger datasets within the same memory constraints
591
 
592
  The seamless integration between DuckDB and Arrow-compatible systems makes it easy to work with data across different tools while maintaining high performance and memory efficiency.
593
- """
594
- )
595
  return
596
 
597
 
@@ -604,7 +609,7 @@ def _():
604
  import duckdb
605
  import sqlglot
606
  import psutil
607
- return duckdb, mo, pa, pd, pl
608
 
609
 
610
  if __name__ == "__main__":
 
14
 
15
  import marimo
16
 
17
+ __generated_with = "0.18.4"
18
  app = marimo.App(width="medium")
19
 
20
 
21
  @app.cell(hide_code=True)
22
  def _(mo):
23
+ mo.md(r"""
 
24
  # Working with Apache Arrow
25
  *By [Thomas Liang](https://github.com/thliang01)*
26
  #
27
+ """)
 
28
  return
29
 
30
 
31
  @app.cell(hide_code=True)
32
  def _(mo):
33
+ mo.md(r"""
34
+ [Apache Arrow](https://arrow.apache.org/) is a multi-language toolbox for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.
 
35
 
36
+ A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.
37
 
38
+ DuckDB has native support for Apache Arrow, which is an in-memory columnar data format. This allows for efficient data transfer between DuckDB and other Arrow-compatible systems, such as Polars and Pandas (via PyArrow).
39
 
40
+ In this notebook, we'll explore how to:
41
 
42
+ - Create an Arrow table from a DuckDB query.
43
+ - Load an Arrow table into DuckDB.
44
+ - Convert between DuckDB, Arrow, and Polars/Pandas DataFrames.
45
+ - Combining data from multiple sources
46
+ - Performance benefits
47
+ """)
 
48
  return
49
 
50
 
 
67
  (5, 'Eve', 40, 'London');
68
  """
69
  )
70
+ return
71
 
72
 
73
  @app.cell(hide_code=True)
74
  def _(mo):
75
+ mo.md(r"""
76
+ ## 1. Creating an Arrow Table from a DuckDB Query
 
77
 
78
+ You can directly fetch the results of a DuckDB query as an Apache Arrow table using the `.arrow()` method on the query result.
79
+ """)
 
80
  return
81
 
82
 
83
  @app.cell
84
+ def _(mo):
85
  users_arrow_table = mo.sql( # type: ignore
86
  """
87
  SELECT * FROM users WHERE age > 30;
 
92
 
93
  @app.cell(hide_code=True)
94
  def _(mo):
95
+ mo.md(r"""
96
+ The `.arrow()` method returns a `pyarrow.Table` object. We can inspect its schema:
97
+ """)
98
  return
99
 
100
 
 
106
 
107
  @app.cell(hide_code=True)
108
  def _(mo):
109
+ mo.md(r"""
110
+ ## 2. Loading an Arrow Table into DuckDB
 
111
 
112
+ You can also register an existing Arrow table (or a Polars/Pandas DataFrame, which uses Arrow under the hood) directly with DuckDB. This allows you to query the in-memory data without any copying, which is highly efficient.
113
+ """)
 
114
  return
115
 
116
 
 
123
  'age': [22, 45],
124
  'city': ['Berlin', 'Tokyo']
125
  })
126
+ return
127
 
128
 
129
  @app.cell(hide_code=True)
130
  def _(mo):
131
+ mo.md(r"""
132
+ Now, we can query this Arrow table `new_data` directly from SQL by embedding it in the query.
133
+ """)
134
  return
135
 
136
 
137
  @app.cell
138
+ def _(mo):
139
  mo.sql(
140
  f"""
141
  SELECT name, age, city
 
148
 
149
  @app.cell(hide_code=True)
150
  def _(mo):
151
+ mo.md(r"""
152
+ ## 3. Convert between DuckDB, Arrow, and Polars/Pandas DataFrames.
 
153
 
154
+ The real power of DuckDB's Arrow integration comes from its seamless interoperability with data frame libraries like Polars and Pandas. Because they all share the Arrow in-memory format, conversions are often zero-copy and extremely fast.
155
+ """)
 
156
  return
157
 
158
 
159
  @app.cell(hide_code=True)
160
  def _(mo):
161
+ mo.md(r"""
162
+ ### From DuckDB to Polars/Pandas
163
+ """)
164
  return
165
 
166
 
 
182
 
183
  @app.cell(hide_code=True)
184
  def _(mo):
185
+ mo.md(r"""
186
+ ### From Polars/Pandas to DuckDB
187
+ """)
188
  return
189
 
190
 
 
197
  "price": [1200.00, 25.50, 75.00]
198
  })
199
  polars_df
200
+ return
201
 
202
 
203
  @app.cell(hide_code=True)
204
  def _(mo):
205
+ mo.md(r"""
206
+ Now we can query this Polars DataFrame directly in DuckDB:
207
+ """)
208
  return
209
 
210
 
211
  @app.cell
212
+ def _(mo):
213
  # Query the Polars DataFrame directly in DuckDB
214
  mo.sql(
215
  f"""
 
224
 
225
  @app.cell(hide_code=True)
226
  def _(mo):
227
+ mo.md(r"""
228
+ Similarly, we can query a Pandas DataFrame:
229
+ """)
230
  return
231
 
232
 
 
240
  "order_date": pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-16', '2024-01-17'])
241
  })
242
  pandas_df
243
+ return
244
 
245
 
246
  @app.cell
247
+ def _(mo):
248
  # Query the Pandas DataFrame in DuckDB
249
  mo.sql(
250
  f"""
 
259
 
260
  @app.cell(hide_code=True)
261
  def _(mo):
262
+ mo.md(r"""
263
+ ## 4. Advanced Example: Combining Multiple Data Sources
 
264
 
265
+ One of the most powerful features is the ability to join data from different sources (DuckDB tables, Arrow tables, Polars/Pandas DataFrames) in a single query:
266
+ """)
 
267
  return
268
 
269
 
270
  @app.cell
271
+ def _(mo):
272
  # Join the DuckDB users table with the Polars products DataFrame and Pandas orders DataFrame
273
  result = mo.sql(
274
  f"""
 
291
 
292
  @app.cell(hide_code=True)
293
  def _(mo):
294
+ mo.md(r"""
295
+ ## 5. Performance Benefits of Arrow Integration
 
296
 
297
+ The zero-copy integration between DuckDB and Apache Arrow delivers significant performance and memory benefits. This seamless integration enables:
298
 
299
+ ### Key Benefits:
300
 
301
+ - **Memory Efficiency**: Arrow's columnar format uses 20-40% less memory than traditional DataFrames through compact columnar representation and better compression ratios
302
+ - **Zero-Copy Operations**: Data can be shared between DuckDB and Arrow-compatible systems (Polars, Pandas) without any data copying, eliminating redundant memory usage
303
+ - **Query Performance**: 2-10x faster queries compared to traditional approaches that require data copying
304
+ - **Larger-than-Memory Analysis**: Both DuckDB and Arrow-compatible libraries support streaming query results, allowing you to execute queries on data larger than available memory by processing data in batches.
305
+ - **Advanced Query Optimization**: DuckDB's optimizer can push down filters and projections directly into Arrow scans, reading only relevant columns and partitions
306
+ Let's demonstrate these benefits with concrete examples:
307
+ """)
 
308
  return
309
 
310
+
311
  @app.cell(hide_code=True)
312
  def _(mo):
313
+ mo.md(r"""
314
+ ### Memory Efficiency Demonstration
315
+ """)
316
  return
317
 
318
 
 
353
 
354
  @app.cell(hide_code=True)
355
  def _(mo):
356
+ mo.md(r"""
357
+ ### Performance Comparison: Arrow vs Non-Arrow Approaches
358
+ """)
359
  return
360
 
361
 
362
  @app.cell(hide_code=True)
363
  def _(mo):
364
+ mo.md(r"""
365
+ Let's compare three approaches for the same analytical query:
366
+ """)
367
  return
368
 
369
 
370
  @app.cell
371
+ def _(duckdb, mo, pandas_data, time):
372
  # Test query: group by category and calculate aggregations
373
  query = """
374
  SELECT
 
430
 
431
  @app.cell(hide_code=True)
432
  def _(mo):
433
+ mo.md(r"""
434
+ ### Visualizing the Performance Difference
435
+ """)
436
  return
437
 
438
 
439
  @app.cell
440
  def _(approach1_time, approach2_time, approach3_time, mo, pl):
441
  import altair as alt
442
+
443
  # Create a bar chart showing the performance comparison
444
  performance_data = pl.DataFrame({
445
  "Approach": ["Traditional\n(Copy to DuckDB)", "Pandas\nGroupBy", "Arrow-based\n(Zero-copy)"],
 
457
  width=400,
458
  height=300
459
  )
460
+
461
  # Display using marimo's altair_chart UI element
462
  mo.ui.altair_chart(chart)
463
+ return
 
464
 
465
 
466
  @app.cell(hide_code=True)
467
  def _(mo):
468
+ mo.md(r"""
469
+ ### Complex Query Performance
470
+ """)
471
  return
472
 
473
 
474
  @app.cell(hide_code=True)
475
  def _(mo):
476
+ mo.md(r"""
477
+ Let's test a more complex query with joins and window functions:
478
+ """)
479
  return
480
 
481
 
482
  @app.cell
483
+ def _(mo, pl, time):
484
  # Create additional datasets for join operations
485
  categories_df = pl.DataFrame({
486
  "category": [f"cat_{i}" for i in range(100)],
 
520
  print(f"Complex query with joins and window functions completed in {complex_query_time:.3f} seconds")
521
 
522
  complex_result
523
+ return
524
 
525
 
526
  @app.cell(hide_code=True)
527
  def _(mo):
528
+ mo.md(r"""
 
529
  ### Memory Efficiency During Operations
530
 
531
  Let's demonstrate how Arrow's zero-copy operations save memory during data transformations:
532
+ """)
 
533
  return
534
 
535
 
536
  @app.cell
537
+ def _(polars_data, psutil, time):
538
  import os
539
  import pyarrow.compute as pc # Add this import
540
 
 
566
 
567
  copy_ops_time = time.time() - latest_start_time
568
  memory_after_copy = process.memory_info().rss / 1024 / 1024 # MB
569
+
570
  print("Memory Usage Comparison:")
571
  print(f"Initial memory: {memory_before:.2f} MB")
572
  print(f"After Arrow operations: {memory_after_arrow:.2f} MB (diff: +{memory_after_arrow - memory_before:.2f} MB)")
 
575
  print(f"Arrow operations: {arrow_ops_time:.3f} seconds")
576
  print(f"Copy operations: {copy_ops_time:.3f} seconds")
577
  print(f"Speedup: {copy_ops_time/arrow_ops_time:.1f}x")
578
+ return
 
579
 
580
 
581
  @app.cell(hide_code=True)
582
  def _(mo):
583
+ mo.md(r"""
 
584
  ## Summary
585
 
586
  In this notebook, we've explored:
 
596
  - **Better scalability**: Can handle larger datasets within the same memory constraints
597
 
598
  The seamless integration between DuckDB and Arrow-compatible systems makes it easy to work with data across different tools while maintaining high performance and memory efficiency.
599
+ """)
 
600
  return
601
 
602
 
 
609
  import duckdb
610
  import sqlglot
611
  import psutil
612
+ return duckdb, mo, pa, pd, pl, psutil
613
 
614
 
615
  if __name__ == "__main__":
duckdb/01_getting_started.py CHANGED
@@ -15,26 +15,23 @@
15
 
16
  import marimo
17
 
18
- __generated_with = "0.13.4"
19
  app = marimo.App(width="medium")
20
 
21
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
- mo.md(
25
- rf"""
26
  <p align="center">
27
  <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSxHAqB0W_61zuIGVMiU6sEeQyTaw-9xwiprw&s" alt="DuckDB Image"/>
28
  </p>
29
- """
30
- )
31
  return
32
 
33
 
34
  @app.cell(hide_code=True)
35
  def _(mo):
36
- mo.md(
37
- rf"""
38
  # 🦆 **DuckDB**: An Embeddable Analytical Database System
39
 
40
  ## What is DuckDB?
@@ -83,15 +80,13 @@ def _(mo):
83
  /// attention | Note
84
  DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system.
85
  ///
86
- """
87
- )
88
  return
89
 
90
 
91
  @app.cell(hide_code=True)
92
  def _(mo):
93
- mo.md(
94
- r"""
95
  # [1. DuckDB Connections: In-Memory vs. File-based](https://duckdb.org/docs/stable/connect/overview.html)
96
 
97
  DuckDB is a lightweight, _relational database management system (RDBMS)_ designed for analytical workloads. Unlike traditional client-server databases, it operates _in-process_ (embedded within your application) and supports both _in-memory_ (temporary) and _file-based_ (persistent) storage.
@@ -105,8 +100,7 @@ def _(mo):
105
  | Performance | Faster for most operations | Slightly slower but provides persistence |
106
  | Creation | duckdb.connect(':memory:') | duckdb.connect('filename.db') |
107
  | Multiple Connection Access | Limited to single connection | Multiple connections can access the same database |
108
- """
109
- )
110
  return
111
 
112
 
@@ -134,8 +128,7 @@ def _(mo):
134
 
135
  @app.cell(hide_code=True)
136
  def _(mo):
137
- mo.md(
138
- """
139
  ## Creating DuckDB Connections
140
 
141
  Let's create both types of DuckDB connections and explore their characteristics.
@@ -144,8 +137,7 @@ def _(mo):
144
  2. **File-based connection**: Data persists between sessions
145
 
146
  We'll then demonstrate the key differences between these connection types.
147
- """
148
- )
149
  return
150
 
151
 
@@ -176,28 +168,28 @@ def _(file_db, memory_db):
176
 
177
  @app.cell(hide_code=True)
178
  def _(mo):
179
- mo.md(
180
- r"""
181
  ## Testing Connection Persistence
182
 
183
- Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist.
184
 
185
  1. First, we'll query our tables to confirm the data was properly inserted
186
  2. Then, we'll simulate an application restart by creating new connections
187
  3. Finally, we'll check which data persists after the "restart"
188
- """
189
- )
190
  return
191
 
192
 
193
  @app.cell(hide_code=True)
194
  def _(mo):
195
- mo.md(r"""## Current Database Contents""")
 
 
196
  return
197
 
198
 
199
  @app.cell(hide_code=True)
200
- def _(mem_test, memory_db, mo):
201
  _df = mo.sql(
202
  f"""
203
  SELECT * FROM mem_test
@@ -208,7 +200,7 @@ def _(mem_test, memory_db, mo):
208
 
209
 
210
  @app.cell(hide_code=True)
211
- def _(file_db, file_test, mo):
212
  _df = mo.sql(
213
  f"""
214
  SELECT * FROM file_test
@@ -227,7 +219,9 @@ def _():
227
 
228
  @app.cell(hide_code=True)
229
  def _(mo):
230
- mo.md(rf"""## 🔄 Simulating Application Restart...""")
 
 
231
  return
232
 
233
 
@@ -311,8 +305,7 @@ def _(file_data, file_data_available, mo):
311
 
312
  @app.cell(hide_code=True)
313
  def _(mo):
314
- mo.md(
315
- r"""
316
  # [2. Creating Tables in DuckDB](https://duckdb.org/docs/stable/sql/statements/create_table.html)
317
 
318
  DuckDB supports standard SQL syntax for creating tables. Let's create more complex tables to demonstrate different data types and constraints.
@@ -326,8 +319,7 @@ def _(mo):
326
  - **CREATE OR REPLACE** to recreate tables
327
  - **Primary keys** and other constraints
328
  - **Various data types** including INTEGER, VARCHAR, TIMESTAMP, DECIMAL, etc.
329
- """
330
- )
331
  return
332
 
333
 
@@ -406,8 +398,7 @@ def _(memory_schema, mo):
406
 
407
  @app.cell(hide_code=True)
408
  def _(mo):
409
- mo.md(
410
- r"""
411
  # [3. Inserting Data Into Tables](https://duckdb.org/docs/stable/sql/statements/insert)
412
 
413
  DuckDB supports multiple ways to insert data:
@@ -418,8 +409,7 @@ def _(mo):
418
  4. **Bulk inserts**: For efficient loading of multiple rows
419
 
420
  Let's demonstrate these different insertion methods:
421
- """
422
- )
423
  return
424
 
425
 
@@ -741,8 +731,7 @@ def _(file_results, memory_results, mo):
741
 
742
  @app.cell(hide_code=True)
743
  def _(mo):
744
- mo.md(
745
- r"""
746
  # [4. Using SQL Directly in marimo](https://duckdb.org/docs/stable/sql/query_syntax/select)
747
 
748
  There are multiple ways to leverage DuckDB's SQL capabilities in marimo:
@@ -752,8 +741,7 @@ def _(mo):
752
  3. **Interactive queries**: Combining UI elements with SQL execution
753
 
754
  Let's explore these approaches:
755
- """
756
- )
757
  return
758
 
759
 
@@ -808,7 +796,9 @@ def _(age_threshold, filtered_users, mo):
808
 
809
  @app.cell(hide_code=True)
810
  def _(mo):
811
- mo.md(r"""# [5. Working with Polars and DuckDB](https://duckdb.org/docs/stable/guides/python/polars.html)""")
 
 
812
  return
813
 
814
 
@@ -904,7 +894,9 @@ def _(complex_query_result, mo):
904
 
905
  @app.cell(hide_code=True)
906
  def _(mo):
907
- mo.md(r"""# [6. Advanced Queries: Joins Between Tables](https://duckdb.org/docs/stable/guides/performance/join_operations.html)""")
 
 
908
  return
909
 
910
 
@@ -950,12 +942,10 @@ def _(new_memory_db):
950
 
951
  @app.cell(hide_code=True)
952
  def _(mo):
953
- mo.md(
954
- rf"""
955
  <!-- Display the join result -->
956
  ## Join Result (Users and Departments):
957
- """
958
- )
959
  return
960
 
961
 
@@ -967,12 +957,10 @@ def _(join_result, mo):
967
 
968
  @app.cell(hide_code=True)
969
  def _(mo):
970
- mo.md(
971
- rf"""
972
  <!-- Demonstrate different types of joins -->
973
  ## Different Types of Joins
974
- """
975
- )
976
  return
977
 
978
 
@@ -1122,7 +1110,9 @@ def _(join_description, join_tabs, mo):
1122
 
1123
  @app.cell(hide_code=True)
1124
  def _(mo):
1125
- mo.md(r"""# [7. Aggregate Functions in DuckDB](https://duckdb.org/docs/stable/sql/functions/aggregates.html)""")
 
 
1126
  return
1127
 
1128
 
@@ -1224,7 +1214,9 @@ def _(mo, window_result):
1224
 
1225
  @app.cell(hide_code=True)
1226
  def _(mo):
1227
- mo.md(r"""# [8. Converting DuckDB Results to Polars/Pandas](https://duckdb.org/docs/stable/guides/python/polars.html)""")
 
 
1228
  return
1229
 
1230
 
@@ -1342,7 +1334,9 @@ def _(mo, pandas_result):
1342
 
1343
  @app.cell(hide_code=True)
1344
  def _(mo):
1345
- mo.md("""# 9. Data Visualization with DuckDB and Plotly""")
 
 
1346
  return
1347
 
1348
 
@@ -1498,8 +1492,7 @@ def _(age_groups, mo, new_memory_db, plotly_express):
1498
 
1499
  @app.cell(hide_code=True)
1500
  def _(mo):
1501
- mo.md(
1502
- r"""
1503
  /// admonition |
1504
  ## Database Management Best Practices
1505
  ///
@@ -1538,14 +1531,15 @@ def _(mo):
1538
  - Create indexes for frequently queried columns
1539
  - For large datasets, consider partitioning
1540
  - Use prepared statements for repeated queries
1541
- """
1542
- )
1543
  return
1544
 
1545
 
1546
  @app.cell(hide_code=True)
1547
  def _(mo):
1548
- mo.md(rf"""## 10. Interactive DuckDB Dashboard with marimo and Plotly""")
 
 
1549
  return
1550
 
1551
 
@@ -1736,8 +1730,7 @@ def _(
1736
 
1737
  @app.cell(hide_code=True)
1738
  def _(mo):
1739
- mo.md(
1740
- rf"""
1741
  # Summary and Key Takeaways
1742
 
1743
  In this notebook, we've explored DuckDB, a powerful embedded analytical database system. Here's what we covered:
@@ -1770,8 +1763,7 @@ def _(mo):
1770
  - Experiment with more complex queries and window functions
1771
  - Use DuckDB's COPY functionality to import/export data from/to files
1772
  - Create more advanced interactive dashboards with marimo and Plotly
1773
- """
1774
- )
1775
  return
1776
 
1777
 
 
15
 
16
  import marimo
17
 
18
+ __generated_with = "0.18.4"
19
  app = marimo.App(width="medium")
20
 
21
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
+ mo.md(rf"""
 
25
  <p align="center">
26
  <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSxHAqB0W_61zuIGVMiU6sEeQyTaw-9xwiprw&s" alt="DuckDB Image"/>
27
  </p>
28
+ """)
 
29
  return
30
 
31
 
32
  @app.cell(hide_code=True)
33
  def _(mo):
34
+ mo.md(rf"""
 
35
  # 🦆 **DuckDB**: An Embeddable Analytical Database System
36
 
37
  ## What is DuckDB?
 
80
  /// attention | Note
81
  DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system.
82
  ///
83
+ """)
 
84
  return
85
 
86
 
87
  @app.cell(hide_code=True)
88
  def _(mo):
89
+ mo.md(r"""
 
90
  # [1. DuckDB Connections: In-Memory vs. File-based](https://duckdb.org/docs/stable/connect/overview.html)
91
 
92
  DuckDB is a lightweight, _relational database management system (RDBMS)_ designed for analytical workloads. Unlike traditional client-server databases, it operates _in-process_ (embedded within your application) and supports both _in-memory_ (temporary) and _file-based_ (persistent) storage.
 
100
  | Performance | Faster for most operations | Slightly slower but provides persistence |
101
  | Creation | duckdb.connect(':memory:') | duckdb.connect('filename.db') |
102
  | Multiple Connection Access | Limited to single connection | Multiple connections can access the same database |
103
+ """)
 
104
  return
105
 
106
 
 
128
 
129
  @app.cell(hide_code=True)
130
  def _(mo):
131
+ mo.md("""
 
132
  ## Creating DuckDB Connections
133
 
134
  Let's create both types of DuckDB connections and explore their characteristics.
 
137
  2. **File-based connection**: Data persists between sessions
138
 
139
  We'll then demonstrate the key differences between these connection types.
140
+ """)
 
141
  return
142
 
143
 
 
168
 
169
  @app.cell(hide_code=True)
170
  def _(mo):
171
+ mo.md(r"""
 
172
  ## Testing Connection Persistence
173
 
174
+ Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist.
175
 
176
  1. First, we'll query our tables to confirm the data was properly inserted
177
  2. Then, we'll simulate an application restart by creating new connections
178
  3. Finally, we'll check which data persists after the "restart"
179
+ """)
 
180
  return
181
 
182
 
183
  @app.cell(hide_code=True)
184
  def _(mo):
185
+ mo.md(r"""
186
+ ## Current Database Contents
187
+ """)
188
  return
189
 
190
 
191
  @app.cell(hide_code=True)
192
+ def _(memory_db, mo):
193
  _df = mo.sql(
194
  f"""
195
  SELECT * FROM mem_test
 
200
 
201
 
202
  @app.cell(hide_code=True)
203
+ def _(file_db, mo):
204
  _df = mo.sql(
205
  f"""
206
  SELECT * FROM file_test
 
219
 
220
  @app.cell(hide_code=True)
221
  def _(mo):
222
+ mo.md(rf"""
223
+ ## 🔄 Simulating Application Restart...
224
+ """)
225
  return
226
 
227
 
 
305
 
306
  @app.cell(hide_code=True)
307
  def _(mo):
308
+ mo.md(r"""
 
309
  # [2. Creating Tables in DuckDB](https://duckdb.org/docs/stable/sql/statements/create_table.html)
310
 
311
  DuckDB supports standard SQL syntax for creating tables. Let's create more complex tables to demonstrate different data types and constraints.
 
319
  - **CREATE OR REPLACE** to recreate tables
320
  - **Primary keys** and other constraints
321
  - **Various data types** including INTEGER, VARCHAR, TIMESTAMP, DECIMAL, etc.
322
+ """)
 
323
  return
324
 
325
 
 
398
 
399
  @app.cell(hide_code=True)
400
  def _(mo):
401
+ mo.md(r"""
 
402
  # [3. Inserting Data Into Tables](https://duckdb.org/docs/stable/sql/statements/insert)
403
 
404
  DuckDB supports multiple ways to insert data:
 
409
  4. **Bulk inserts**: For efficient loading of multiple rows
410
 
411
  Let's demonstrate these different insertion methods:
412
+ """)
 
413
  return
414
 
415
 
 
731
 
732
  @app.cell(hide_code=True)
733
  def _(mo):
734
+ mo.md(r"""
 
735
  # [4. Using SQL Directly in marimo](https://duckdb.org/docs/stable/sql/query_syntax/select)
736
 
737
  There are multiple ways to leverage DuckDB's SQL capabilities in marimo:
 
741
  3. **Interactive queries**: Combining UI elements with SQL execution
742
 
743
  Let's explore these approaches:
744
+ """)
 
745
  return
746
 
747
 
 
796
 
797
  @app.cell(hide_code=True)
798
  def _(mo):
799
+ mo.md(r"""
800
+ # [5. Working with Polars and DuckDB](https://duckdb.org/docs/stable/guides/python/polars.html)
801
+ """)
802
  return
803
 
804
 
 
894
 
895
  @app.cell(hide_code=True)
896
  def _(mo):
897
+ mo.md(r"""
898
+ # [6. Advanced Queries: Joins Between Tables](https://duckdb.org/docs/stable/guides/performance/join_operations.html)
899
+ """)
900
  return
901
 
902
 
 
942
 
943
  @app.cell(hide_code=True)
944
  def _(mo):
945
+ mo.md(rf"""
 
946
  <!-- Display the join result -->
947
  ## Join Result (Users and Departments):
948
+ """)
 
949
  return
950
 
951
 
 
957
 
958
  @app.cell(hide_code=True)
959
  def _(mo):
960
+ mo.md(rf"""
 
961
  <!-- Demonstrate different types of joins -->
962
  ## Different Types of Joins
963
+ """)
 
964
  return
965
 
966
 
 
1110
 
1111
  @app.cell(hide_code=True)
1112
  def _(mo):
1113
+ mo.md(r"""
1114
+ # [7. Aggregate Functions in DuckDB](https://duckdb.org/docs/stable/sql/functions/aggregates.html)
1115
+ """)
1116
  return
1117
 
1118
 
 
1214
 
1215
  @app.cell(hide_code=True)
1216
  def _(mo):
1217
+ mo.md(r"""
1218
+ # [8. Converting DuckDB Results to Polars/Pandas](https://duckdb.org/docs/stable/guides/python/polars.html)
1219
+ """)
1220
  return
1221
 
1222
 
 
1334
 
1335
  @app.cell(hide_code=True)
1336
  def _(mo):
1337
+ mo.md("""
1338
+ # 9. Data Visualization with DuckDB and Plotly
1339
+ """)
1340
  return
1341
 
1342
 
 
1492
 
1493
  @app.cell(hide_code=True)
1494
  def _(mo):
1495
+ mo.md(r"""
 
1496
  /// admonition |
1497
  ## Database Management Best Practices
1498
  ///
 
1531
  - Create indexes for frequently queried columns
1532
  - For large datasets, consider partitioning
1533
  - Use prepared statements for repeated queries
1534
+ """)
 
1535
  return
1536
 
1537
 
1538
  @app.cell(hide_code=True)
1539
  def _(mo):
1540
+ mo.md(rf"""
1541
+ ## 10. Interactive DuckDB Dashboard with marimo and Plotly
1542
+ """)
1543
  return
1544
 
1545
 
 
1730
 
1731
  @app.cell(hide_code=True)
1732
  def _(mo):
1733
+ mo.md(rf"""
 
1734
  # Summary and Key Takeaways
1735
 
1736
  In this notebook, we've explored DuckDB, a powerful embedded analytical database system. Here's what we covered:
 
1763
  - Experiment with more complex queries and window functions
1764
  - Use DuckDB's COPY functionality to import/export data from/to files
1765
  - Create more advanced interactive dashboards with marimo and Plotly
1766
+ """)
 
1767
  return
1768
 
1769
 
duckdb/DuckDB_Loading_CSVs.py CHANGED
@@ -13,39 +13,41 @@
13
 
14
  import marimo
15
 
16
- __generated_with = "0.12.10"
17
  app = marimo.App(width="medium")
18
 
19
 
20
  @app.cell(hide_code=True)
21
  def _(mo):
22
- mo.md(r"""#Loading CSVs with DuckDB""")
 
 
23
  return
24
 
25
 
26
  @app.cell(hide_code=True)
27
  def _(mo):
28
- mo.md(
29
- r"""
30
- <p> I remember when I first learnt about DuckDB, it was a gamechanger — I used to load the data I wanted to work on to a database software like MS SQL Server, and then build a bridge to an IDE with the language I wanted to use like Python, or R; it was quite the hassle. DuckDB changed my whole world — now I could just import the data file into the IDE, or notebook, make a duckdb connection, and there we go! But then, I realized I didn't even need the step of first importing the file using python. I could just query the csv file directly using SQL through a DuckDB connection.</p>
31
-
32
- ##Introduction
33
- <p> I found this dataset on the evolution of AI research by discipline from <a href= "https://oecd.ai/en/data?selectedArea=ai-research&selectedVisualization=16731"> OECD</a>, and it piqued my interest. I feel like publications in natural language processing drastically jumped in the mid 2010s, and I'm excited to find out if that's the case. </p>
34
-
35
- <p> In this notebook, we'll: </p>
36
- <ul>
37
- <li> Import the CSV file into the notebook</li>
38
- <li> Create another table within the database based on the CSV</li>
39
- <li> Dig into publications on natural language processing have evolved over the years</li>
40
- </ul>
41
- """
42
- )
43
  return
44
 
45
 
46
  @app.cell(hide_code=True)
47
  def _(mo):
48
- mo.md(r"""##Load the CSV""")
 
 
49
  return
50
 
51
 
@@ -67,7 +69,9 @@ def _(mo):
67
 
68
  @app.cell(hide_code=True)
69
  def _(mo):
70
- mo.md(r"""##Create Another Table""")
 
 
71
  return
72
 
73
 
@@ -80,11 +84,11 @@ def _(mo):
80
  SELECT Year, Concept, publications FROM "https://raw.githubusercontent.com/Mustjaab/Loading_CSVs_in_DuckDB/refs/heads/main/AI_Research_Data.csv"
81
  """
82
  )
83
- return Discipline_Analysis, Domain_Analysis
84
 
85
 
86
  @app.cell
87
- def _(Domain_Analysis, mo):
88
  Analysis = mo.sql(
89
  f"""
90
  SELECT *
@@ -93,11 +97,11 @@ def _(Domain_Analysis, mo):
93
  ORDER BY Year
94
  """
95
  )
96
- return (Analysis,)
97
 
98
 
99
  @app.cell
100
- def _(Domain_Analysis, mo):
101
  _df = mo.sql(
102
  f"""
103
  SELECT
@@ -111,7 +115,7 @@ def _(Domain_Analysis, mo):
111
 
112
 
113
  @app.cell
114
- def _(Domain_Analysis, mo):
115
  NLP_Analysis = mo.sql(
116
  f"""
117
  SELECT
@@ -137,21 +141,23 @@ def _(NLP_Analysis, px):
137
 
138
  @app.cell(hide_code=True)
139
  def _(mo):
140
- mo.md(r"""<p> We can see there's a significant increase in NLP publications 2020 and onwards which definitely makes sense provided the rapid emergence of commercial large language models, and AI assistants. </p>""")
 
 
 
 
141
 
142
  @app.cell(hide_code=True)
143
  def _(mo):
144
- mo.md(
145
- r"""
146
- ##Conclusion
147
- <p> In this notebook, we learned how to:</p>
148
- <ul>
149
- <li> Load a CSV into DuckDB </li>
150
- <li> Create other tables using the imported CSV </li>
151
- <li> Seamlessly analyze and visualize data between SQL, and Python cells</li>
152
- </ul>
153
- """
154
- )
155
  return
156
 
157
 
@@ -159,7 +165,7 @@ def _(mo):
159
  def _():
160
  import pyarrow
161
  import polars
162
- return polars, pyarrow
163
 
164
 
165
  @app.cell
 
13
 
14
  import marimo
15
 
16
+ __generated_with = "0.18.4"
17
  app = marimo.App(width="medium")
18
 
19
 
20
  @app.cell(hide_code=True)
21
  def _(mo):
22
+ mo.md(r"""
23
+ #Loading CSVs with DuckDB
24
+ """)
25
  return
26
 
27
 
28
  @app.cell(hide_code=True)
29
  def _(mo):
30
+ mo.md(r"""
31
+ <p> I remember when I first learnt about DuckDB, it was a gamechanger — I used to load the data I wanted to work on to a database software like MS SQL Server, and then build a bridge to an IDE with the language I wanted to use like Python, or R; it was quite the hassle. DuckDB changed my whole world — now I could just import the data file into the IDE, or notebook, make a duckdb connection, and there we go! But then, I realized I didn't even need the step of first importing the file using python. I could just query the csv file directly using SQL through a DuckDB connection.</p>
32
+
33
+ ##Introduction
34
+ <p> I found this dataset on the evolution of AI research by discipline from <a href= "https://oecd.ai/en/data?selectedArea=ai-research&selectedVisualization=16731"> OECD</a>, and it piqued my interest. I feel like publications in natural language processing drastically jumped in the mid 2010s, and I'm excited to find out if that's the case. </p>
35
+
36
+ <p> In this notebook, we'll: </p>
37
+ <ul>
38
+ <li> Import the CSV file into the notebook</li>
39
+ <li> Create another table within the database based on the CSV</li>
40
+ <li> Dig into publications on natural language processing have evolved over the years</li>
41
+ </ul>
42
+ """)
 
 
43
  return
44
 
45
 
46
  @app.cell(hide_code=True)
47
  def _(mo):
48
+ mo.md(r"""
49
+ ##Load the CSV
50
+ """)
51
  return
52
 
53
 
 
69
 
70
  @app.cell(hide_code=True)
71
  def _(mo):
72
+ mo.md(r"""
73
+ ##Create Another Table
74
+ """)
75
  return
76
 
77
 
 
84
  SELECT Year, Concept, publications FROM "https://raw.githubusercontent.com/Mustjaab/Loading_CSVs_in_DuckDB/refs/heads/main/AI_Research_Data.csv"
85
  """
86
  )
87
+ return
88
 
89
 
90
  @app.cell
91
+ def _(mo):
92
  Analysis = mo.sql(
93
  f"""
94
  SELECT *
 
97
  ORDER BY Year
98
  """
99
  )
100
+ return
101
 
102
 
103
  @app.cell
104
+ def _(mo):
105
  _df = mo.sql(
106
  f"""
107
  SELECT
 
115
 
116
 
117
  @app.cell
118
+ def _(mo):
119
  NLP_Analysis = mo.sql(
120
  f"""
121
  SELECT
 
141
 
142
  @app.cell(hide_code=True)
143
  def _(mo):
144
+ mo.md(r"""
145
+ <p> We can see there's a significant increase in NLP publications 2020 and onwards which definitely makes sense provided the rapid emergence of commercial large language models, and AI assistants. </p>
146
+ """)
147
+ return
148
+
149
 
150
  @app.cell(hide_code=True)
151
  def _(mo):
152
+ mo.md(r"""
153
+ ##Conclusion
154
+ <p> In this notebook, we learned how to:</p>
155
+ <ul>
156
+ <li> Load a CSV into DuckDB </li>
157
+ <li> Create other tables using the imported CSV </li>
158
+ <li> Seamlessly analyze and visualize data between SQL, and Python cells</li>
159
+ </ul>
160
+ """)
 
 
161
  return
162
 
163
 
 
165
  def _():
166
  import pyarrow
167
  import polars
168
+ return
169
 
170
 
171
  @app.cell
duckdb/README.md CHANGED
@@ -1,3 +1,8 @@
 
 
 
 
 
1
  # Learn DuckDB
2
 
3
  _🚧 This collection is a work in progress. Please help us add notebooks!_
 
1
+ ---
2
+ title: Readme
3
+ marimo-version: 0.18.4
4
+ ---
5
+
6
  # Learn DuckDB
7
 
8
  _🚧 This collection is a work in progress. Please help us add notebooks!_
functional_programming/05_functors.py CHANGED
@@ -7,102 +7,98 @@
7
 
8
  import marimo
9
 
10
- __generated_with = "0.12.8"
11
  app = marimo.App(app_title="Category Theory and Functors")
12
 
13
 
14
  @app.cell(hide_code=True)
15
  def _(mo):
16
- mo.md(
17
- """
18
- # Category Theory and Functors
19
 
20
- In this notebook, you will learn:
21
 
22
- * Why `length` is a *functor* from the category of `list concatenation` to the category of `integer addition`
23
- * How to *lift* an ordinary function into a specific *computational context*
24
- * How to write an *adapter* between two categories
25
 
26
- In short, a mathematical functor is a **mapping** between two categories in category theory. In practice, a functor represents a type that can be mapped over.
27
 
28
- /// admonition | Intuitions
29
 
30
- - A simple intuition is that a `Functor` represents a **container** of values, along with the ability to apply a function uniformly to every element in the container.
31
- - Another intuition is that a `Functor` represents some sort of **computational context**.
32
- - Mathematically, `Functors` generalize the idea of a container or a computational context.
33
- ///
34
 
35
- We will start with intuition, introduce the basics of category theory, and then examine functors from a categorical perspective.
36
 
37
- /// details | Notebook metadata
38
- type: info
39
 
40
- version: 0.1.5 | last modified: 2025-04-11 | author: [métaboulie](https://github.com/metaboulie)<br/>
41
- reviewer: [Haleshot](https://github.com/Haleshot)
42
 
43
- ///
44
- """
45
- )
46
  return
47
 
48
 
49
  @app.cell(hide_code=True)
50
  def _(mo):
51
- mo.md(
52
- """
53
- # Functor as a Computational Context
54
 
55
- A [**Functor**](https://wiki.haskell.org/Functor) is an abstraction that represents a computational context with the ability to apply a function to every value inside it without altering the structure of the context itself. This enables transformations while preserving the shape of the data.
56
 
57
- To understand this, let's look at a simple example.
58
 
59
- ## [The One-Way Wrapper Design Pattern](http://blog.sigfpe.com/2007/04/trivial-monad.html)
60
 
61
- Often, we need to wrap data in some kind of context. However, when performing operations on wrapped data, we typically have to:
62
 
63
- 1. Unwrap the data.
64
- 2. Modify the unwrapped data.
65
- 3. Rewrap the modified data.
66
 
67
- This process is tedious and inefficient. Instead, we want to wrap data **once** and apply functions directly to the wrapped data without unwrapping it.
68
 
69
- /// admonition | Rules for a One-Way Wrapper
70
 
71
- 1. We can wrap values, but we cannot unwrap them.
72
- 2. We should still be able to apply transformations to the wrapped data.
73
- 3. Any operation that depends on wrapped data should itself return a wrapped result.
74
- ///
75
 
76
- Let's define such a `Wrapper` class:
77
 
78
- ```python
79
- from dataclasses import dataclass
80
- from typing import TypeVar
81
 
82
- A = TypeVar("A")
83
- B = TypeVar("B")
84
 
85
- @dataclass
86
- class Wrapper[A]:
87
- value: A
88
- ```
89
 
90
- Now, we can create an instance of wrapped data:
91
 
92
- ```python
93
- wrapped = Wrapper(1)
94
- ```
95
 
96
- ### Mapping Functions Over Wrapped Data
97
 
98
- To modify wrapped data while keeping it wrapped, we define an `fmap` method:
99
- """
100
- )
101
  return
102
 
103
 
104
  @app.cell
105
- def _(B, Callable, Functor, dataclass):
106
  @dataclass
107
  class Wrapper[A](Functor):
108
  value: A
@@ -115,26 +111,24 @@ def _(B, Callable, Functor, dataclass):
115
 
116
  @app.cell(hide_code=True)
117
  def _(mo):
118
- mo.md(
119
- r"""
120
- /// attention
121
 
122
- To distinguish between regular types and functors, we use the prefix `f` to indicate `Functor`.
123
 
124
- For instance,
125
 
126
- - `a: A` is a regular variable of type `A`
127
- - `g: Callable[[A], B]` is a regular function from type `A` to `B`
128
- - `fa: Functor[A]` is a *Functor* wrapping a value of type `A`
129
- - `fg: Functor[Callable[[A], B]]` is a *Functor* wrapping a function from type `A` to `B`
130
 
131
- and we will avoid using `f` to represent a function
132
 
133
- ///
134
 
135
- > Try with Wrapper below
136
- """
137
- )
138
  return
139
 
140
 
@@ -149,46 +143,42 @@ def _(Wrapper, pp):
149
 
150
  @app.cell(hide_code=True)
151
  def _(mo):
152
- mo.md(
153
- """
154
- We can analyze the type signature of `fmap` for `Wrapper`:
155
 
156
- * `g` is of type `Callable[[A], B]`
157
- * `fa` is of type `Wrapper[A]`
158
- * The return value is of type `Wrapper[B]`
159
 
160
- Thus, in Python's type system, we can express the type signature of `fmap` as:
161
 
162
- ```python
163
- fmap(g: Callable[[A], B], fa: Wrapper[A]) -> Wrapper[B]:
164
- ```
165
 
166
- Essentially, `fmap`:
167
 
168
- 1. Takes a function `Callable[[A], B]` and a `Wrapper[A]` instance as input.
169
- 2. Applies the function to the value inside the wrapper.
170
- 3. Returns a new `Wrapper[B]` instance with the transformed value, leaving the original wrapper and its internal data unmodified.
171
 
172
- Now, let's examine `list` as a similar kind of wrapper.
173
- """
174
- )
175
  return
176
 
177
 
178
  @app.cell(hide_code=True)
179
  def _(mo):
180
- mo.md(
181
- """
182
- ## The List Functor
183
 
184
- We can define a `List` class to represent a wrapped list that supports `fmap`:
185
- """
186
- )
187
  return
188
 
189
 
190
  @app.cell
191
- def _(B, Callable, Functor, dataclass):
192
  @dataclass
193
  class List[A](Functor):
194
  value: list[A]
@@ -201,7 +191,9 @@ def _(B, Callable, Functor, dataclass):
201
 
202
  @app.cell(hide_code=True)
203
  def _(mo):
204
- mo.md(r"""> Try with List below""")
 
 
205
  return
206
 
207
 
@@ -215,114 +207,106 @@ def _(List, pp):
215
 
216
  @app.cell(hide_code=True)
217
  def _(mo):
218
- mo.md(
219
- """
220
- ### Extracting the Type of `fmap`
221
 
222
- The type signature of `fmap` for `List` is:
223
 
224
- ```python
225
- fmap(g: Callable[[A], B], fa: List[A]) -> List[B]
226
- ```
227
 
228
- Similarly, for `Wrapper`:
229
 
230
- ```python
231
- fmap(g: Callable[[A], B], fa: Wrapper[A]) -> Wrapper[B]
232
- ```
233
 
234
- Both follow the same pattern, which we can generalize as:
235
 
236
- ```python
237
- fmap(g: Callable[[A], B], fa: Functor[A]) -> Functor[B]
238
- ```
239
 
240
- where `Functor` can be `Wrapper`, `List`, or any other wrapper type that follows the same structure.
241
 
242
- ### Functors in Haskell (optional)
243
 
244
- In Haskell, the type of `fmap` is:
245
 
246
- ```haskell
247
- fmap :: Functor f => (a -> b) -> f a -> f b
248
- ```
249
 
250
- or equivalently:
251
 
252
- ```haskell
253
- fmap :: Functor f => (a -> b) -> (f a -> f b)
254
- ```
255
 
256
- This means that `fmap` **lifts** an ordinary function into the **functor world**, allowing it to operate within a computational context.
257
 
258
- Now, let's define an abstract class for `Functor`.
259
- """
260
- )
261
  return
262
 
263
 
264
  @app.cell(hide_code=True)
265
  def _(mo):
266
- mo.md(
267
- """
268
- ## Defining Functor
269
 
270
- Recall that, a **Functor** is an abstraction that allows us to apply a function to values inside a computational context while preserving its structure.
271
 
272
- To define `Functor` in Python, we use an abstract base class:
273
 
274
- ```python
275
- @dataclass
276
- class Functor[A](ABC):
277
- @classmethod
278
- @abstractmethod
279
- def fmap(g: Callable[[A], B], fa: "Functor[A]") -> "Functor[B]":
280
- raise NotImplementedError
281
- ```
282
 
283
- We can now extend custom wrappers, containers, or computation contexts with this `Functor` base class, implement the `fmap` method, and apply any function.
284
- """
285
- )
286
  return
287
 
288
 
289
  @app.cell(hide_code=True)
290
  def _(mo):
291
- mo.md(
292
- r"""
293
- # More Functor instances (optional)
294
 
295
- In this section, we will explore more *Functor* instances to help you build up a better comprehension.
296
 
297
- The main reference is [Data.Functor](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Functor.html)
298
- """
299
- )
300
  return
301
 
302
 
303
  @app.cell(hide_code=True)
304
  def _(mo):
305
- mo.md(
306
- r"""
307
- ## The [Maybe](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Maybe.html#t:Maybe) Functor
308
 
309
- **`Maybe`** is a functor that can either hold a value (`Just(value)`) or be `Nothing` (equivalent to `None` in Python).
310
 
311
- - It the value exists, `fmap` applies the function to this value inside the functor.
312
- - If the value is `None`, `fmap` simply returns `None`.
313
 
314
- /// admonition
315
- By using `Maybe` as a functor, we gain the ability to apply transformations (`fmap`) to potentially absent values, without having to explicitly handle the `None` case every time.
316
- ///
317
 
318
- We can implement the `Maybe` functor as:
319
- """
320
- )
321
  return
322
 
323
 
324
  @app.cell
325
- def _(B, Callable, Functor, dataclass):
326
  @dataclass
327
  class Maybe[A](Functor):
328
  value: None | A
@@ -345,24 +329,22 @@ def _(Maybe, pp):
345
 
346
  @app.cell(hide_code=True)
347
  def _(mo):
348
- mo.md(
349
- r"""
350
- ## The [Either](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Either.html#t:Either) Functor
351
 
352
- The `Either` type represents values with two possibilities: a value of type `Either a b` is either `Left a` or `Right b`.
353
 
354
- The `Either` type is sometimes used to represent a value which is **either correct or an error**; by convention, the `left` attribute is used to hold an error value and the `right` attribute is used to hold a correct value.
355
 
356
- `fmap` for `Either` will ignore Left values, but will apply the supplied function to values contained in the Right.
357
 
358
- The implementation is:
359
- """
360
- )
361
  return
362
 
363
 
364
  @app.cell
365
- def _(B, Callable, Functor, Union, dataclass):
366
  @dataclass
367
  class Either[A](Functor):
368
  left: A = None
@@ -400,29 +382,27 @@ def _(Either):
400
 
401
  @app.cell(hide_code=True)
402
  def _(mo):
403
- mo.md(
404
- """
405
- ## The [RoseTree](https://en.wikipedia.org/wiki/Rose_tree) Functor
406
 
407
- A **RoseTree** is a tree where:
408
 
409
- - Each node holds a **value**.
410
- - Each node has a **list of child nodes** (which are also RoseTrees).
411
 
412
- This structure is useful for representing hierarchical data, such as:
413
 
414
- - Abstract Syntax Trees (ASTs)
415
- - File system directories
416
- - Recursive computations
417
 
418
- The implementation is:
419
- """
420
- )
421
  return
422
 
423
 
424
  @app.cell
425
- def _(B, Callable, Functor, dataclass):
426
  @dataclass
427
  class RoseTree[A](Functor):
428
  value: A # The value stored in the node.
@@ -459,34 +439,32 @@ def _(RoseTree, pp):
459
 
460
  @app.cell(hide_code=True)
461
  def _(mo):
462
- mo.md(
463
- """
464
- ## Generic Functions that can be Used with Any Functor
465
 
466
- One of the powerful features of functors is that we can write **generic functions** that can work with any functor.
467
 
468
- Remember that in Haskell, the type of `fmap` can be written as:
469
 
470
- ```haskell
471
- fmap :: Functor f => (a -> b) -> (f a -> f b)
472
- ```
473
 
474
- Translating to Python, we get:
475
 
476
- ```python
477
- def fmap(g: Callable[[A], B]) -> Callable[[Functor[A]], Functor[B]]
478
- ```
479
 
480
- This means that `fmap`:
481
 
482
- - Takes an **ordinary function** `Callable[[A], B]` as input.
483
- - Outputs a function that:
484
- - Takes a **functor** of type `Functor[A]` as input.
485
- - Outputs a **functor** of type `Functor[B]`.
486
 
487
- Inspired by this, we can implement an `inc` function which takes a functor, applies the function `lambda x: x + 1` to every value inside it, and returns a new functor with the updated values.
488
- """
489
- )
490
  return
491
 
492
 
@@ -506,55 +484,51 @@ def _(flist, inc, pp, rosetree, wrapper):
506
 
507
  @app.cell(hide_code=True)
508
  def _(mo):
509
- mo.md(
510
- r"""
511
- /// admonition | exercise
512
- Implement other generic functions and apply them to different *Functor* instances.
513
- ///
514
- """
515
- )
516
  return
517
 
518
 
519
  @app.cell(hide_code=True)
520
  def _(mo):
521
- mo.md(r"""# Functor laws and utility functions""")
 
 
522
  return
523
 
524
 
525
  @app.cell(hide_code=True)
526
  def _(mo):
527
- mo.md(
528
- """
529
- ## Functor laws
530
 
531
- In addition to providing a function `fmap` of the specified type, functors are also required to satisfy two equational laws:
532
 
533
- ```haskell
534
- fmap id = id -- fmap preserves identity
535
- fmap (g . h) = fmap g . fmap h -- fmap distributes over composition
536
- ```
537
 
538
- 1. `fmap` should preserve the **identity function**, in the sense that applying `fmap` to this function returns the same function as the result.
539
- 2. `fmap` should also preserve **function composition**. Applying two composed functions `g` and `h` to a functor via `fmap` should give the same result as first applying `fmap` to `g` and then applying `fmap` to `h`.
540
 
541
- /// admonition |
542
- - Any `Functor` instance satisfying the first law `(fmap id = id)` will [automatically satisfy the second law](https://github.com/quchen/articles/blob/master/second_functor_law.md) as well.
543
- ///
544
- """
545
- )
546
  return
547
 
548
 
549
  @app.cell(hide_code=True)
550
  def _(mo):
551
- mo.md(
552
- r"""
553
- ### Functor laws verification
554
 
555
- We can define `id` and `compose` in `Python` as:
556
- """
557
- )
558
  return
559
 
560
 
@@ -562,12 +536,14 @@ def _(mo):
562
  def _():
563
  id = lambda x: x
564
  compose = lambda f, g: lambda x: f(g(x))
565
- return compose, id
566
 
567
 
568
  @app.cell(hide_code=True)
569
  def _(mo):
570
- mo.md(r"""We can add a helper function `check_functor_law` to verify that an instance satisfies the functor laws:""")
 
 
571
  return
572
 
573
 
@@ -581,7 +557,9 @@ def _(id):
581
 
582
  @app.cell(hide_code=True)
583
  def _(mo):
584
- mo.md(r"""We can verify the functor we've defined:""")
 
 
585
  return
586
 
587
 
@@ -589,17 +567,19 @@ def _(mo):
589
  def _(check_functor_law, flist, pp, rosetree, wrapper):
590
  for functor in (wrapper, flist, rosetree):
591
  pp(check_functor_law(functor))
592
- return (functor,)
593
 
594
 
595
  @app.cell(hide_code=True)
596
  def _(mo):
597
- mo.md("""And here is an `EvilFunctor`. We can verify it's not a valid `Functor`.""")
 
 
598
  return
599
 
600
 
601
  @app.cell
602
- def _(B, Callable, Functor, dataclass):
603
  @dataclass
604
  class EvilFunctor[A](Functor):
605
  value: list[A]
@@ -624,31 +604,29 @@ def _(EvilFunctor, check_functor_law, pp):
624
 
625
  @app.cell(hide_code=True)
626
  def _(mo):
627
- mo.md(
628
- r"""
629
- ## Utility functions
630
-
631
- ```python
632
- @classmethod
633
- def const(cls, fa: "Functor[A]", b: B) -> "Functor[B]":
634
- return cls.fmap(lambda _: b, fa)
635
-
636
- @classmethod
637
- def void(cls, fa: "Functor[A]") -> "Functor[None]":
638
- return cls.const(fa, None)
639
-
640
- @classmethod
641
- def unzip(
642
- cls, fab: "Functor[tuple[A, B]]"
643
- ) -> tuple["Functor[A]", "Functor[B]"]:
644
- return cls.fmap(lambda p: p[0], fab), cls.fmap(lambda p: p[1], fab)
645
- ```
646
-
647
- - `const` replaces all values inside a functor with a constant `b`
648
- - `void` is equivalent to `const(fa, None)`, transforming all values in a functor into `None`
649
- - `unzip` is a generalization of the regular *unzip* on a list of pairs
650
- """
651
- )
652
  return
653
 
654
 
@@ -676,13 +654,11 @@ def _(List, Maybe):
676
 
677
  @app.cell(hide_code=True)
678
  def _(mo):
679
- mo.md(
680
- r"""
681
- /// admonition
682
- You can always override these utility functions with a more efficient implementation for specific instances.
683
- ///
684
- """
685
- )
686
  return
687
 
688
 
@@ -697,7 +673,9 @@ def _(List, RoseTree, flist, pp, rosetree):
697
 
698
  @app.cell(hide_code=True)
699
  def _(mo):
700
- mo.md("""# Formal implementation of Functor""")
 
 
701
  return
702
 
703
 
@@ -728,291 +706,275 @@ def _(ABC, B, Callable, abstractmethod, dataclass):
728
 
729
  @app.cell(hide_code=True)
730
  def _(mo):
731
- mo.md(
732
- """
733
- ## Limitations of Functor
734
 
735
- Functors abstract the idea of mapping a function over each element of a structure. Suppose now that we wish to generalise this idea to allow functions with any number of arguments to be mapped, rather than being restricted to functions with a single argument. More precisely, suppose that we wish to define a hierarchy of `fmap` functions with the following types:
736
 
737
- ```haskell
738
- fmap0 :: a -> f a
739
 
740
- fmap1 :: (a -> b) -> f a -> f b
741
 
742
- fmap2 :: (a -> b -> c) -> f a -> f b -> f c
743
 
744
- fmap3 :: (a -> b -> c -> d) -> f a -> f b -> f c -> f d
745
- ```
746
 
747
- And we have to declare a special version of the functor class for each case.
748
 
749
- We will learn how to resolve this problem in the next notebook on `Applicatives`.
750
- """
751
- )
752
  return
753
 
754
 
755
  @app.cell(hide_code=True)
756
  def _(mo):
757
- mo.md(
758
- """
759
- # Introduction to Categories
760
 
761
- A [category](https://en.wikibooks.org/wiki/Haskell/Category_theory#Introduction_to_categories) is, in essence, a simple collection. It has three components:
762
 
763
- - A collection of **objects**.
764
- - A collection of **morphisms**, each of which ties two objects (a _source object_ and a _target object_) together. If $f$ is a morphism with source object $C$ and target object $B$, we write $f : C → B$.
765
- - A notion of **composition** of these morphisms. If $g : A → B$ and $f : B → C$ are two morphisms, they can be composed, resulting in a morphism $f ∘ g : A → C$.
766
 
767
- ## Category laws
768
 
769
- There are three laws that categories need to follow.
770
 
771
- 1. The composition of morphisms needs to be **associative**. Symbolically, $f ∘ (g ∘ h) = (f ∘ g) ∘ h$
772
 
773
- - Morphisms are applied right to left, so with $f ∘ g$ first $g$ is applied, then $f$.
774
 
775
- 2. The category needs to be **closed** under the composition operation. So if $f : B → C$ and $g : A → B$, then there must be some morphism $h : A → C$ in the category such that $h = f ∘ g$.
776
 
777
- 3. Given a category $C$ there needs to be for every object $A$ an **identity** morphism, $id_A : A → A$ that is an identity of composition with other morphisms. Put precisely, for every morphism $g : A → B$: $g ∘ id_A = id_B ∘ g = g$
778
 
779
- /// attention | The definition of a category does not define:
780
 
781
- - what `∘` is,
782
- - what `id` is, or
783
- - what `f`, `g`, and `h` might be.
784
 
785
- Instead, category theory leaves it up to us to discover what they might be.
786
- ///
787
- """
788
- )
789
  return
790
 
791
 
792
  @app.cell(hide_code=True)
793
  def _(mo):
794
- mo.md(
795
- """
796
- ## The Python category
797
-
798
- The main category we'll be concerning ourselves with in this part is the Python category, or we can give it a shorter name: `Py`. `Py` treats Python types as objects and Python functions as morphisms. A function `def f(a: A) -> B` for types A and B is a morphism in Python.
799
-
800
- Remember that we defined the `id` and `compose` function above as:
801
-
802
- ```Python
803
- def id(x: A) -> A:
804
- return x
805
-
806
- def compose(f: Callable[[B], C], g: Callable[[A], B]) -> Callable[[A], C]:
807
- return lambda x: f(g(x))
808
- ```
809
-
810
- We can check second law easily.
811
-
812
- For the first law, we have:
813
-
814
- ```python
815
- # compose(f, g) = lambda x: f(g(x))
816
- f (g h)
817
- = compose(f, compose(g, h))
818
- = lambda x: f(compose(g, h)(x))
819
- = lambda x: f(lambda y: g(h(y))(x))
820
- = lambda x: f(g(h(x)))
821
-
822
- (f g) h
823
- = compose(compose(f, g), h)
824
- = lambda x: compose(f, g)(h(x))
825
- = lambda x: lambda y: f(g(y))(h(x))
826
- = lambda x: f(g(h(x)))
827
- ```
828
-
829
- For the third law, we have:
830
-
831
- ```python
832
- g id_A
833
- = compose(g: Callable[[a], b], id: Callable[[a], a]) -> Callable[[a], b]
834
- = lambda x: g(id(x))
835
- = lambda x: g(x) # id(x) = x
836
- = g
837
- ```
838
- the similar proof can be applied to $id_B ∘ g =g$.
839
-
840
- Thus `Py` is a valid category.
841
- """
842
- )
843
  return
844
 
845
 
846
  @app.cell(hide_code=True)
847
  def _(mo):
848
- mo.md(
849
- """
850
- # Functors, again
851
 
852
- A functor is essentially a transformation between categories, so given categories $C$ and $D$, a functor $F : C → D$:
853
 
854
- - Maps any object $A$ in $C$ to $F ( A )$, in $D$.
855
- - Maps morphisms $f : A → B$ in $C$ to $F ( f ) : F ( A ) → F ( B )$ in $D$.
856
 
857
- /// admonition |
858
 
859
- Endofunctors are functors from a category to itself.
860
 
861
- ///
862
- """
863
- )
864
  return
865
 
866
 
867
  @app.cell(hide_code=True)
868
  def _(mo):
869
- mo.md(
870
- """
871
- ## Functors on the category of Python
872
 
873
- Remember that a functor has two parts: it maps objects in one category to objects in another and morphisms in the first category to morphisms in the second.
874
 
875
- Functors in Python are from `Py` to `Func`, where `Func` is the subcategory of `Py` defined on just that functor's types. E.g. the RoseTree functor goes from `Py` to `RoseTree`, where `RoseTree` is the category containing only RoseTree types, that is, `RoseTree[T]` for any type `T`. The morphisms in `RoseTree` are functions defined on RoseTree types, that is, functions `Callable[[RoseTree[T]], RoseTree[U]]` for types `T`, `U`.
876
 
877
- Recall the definition of `Functor`:
878
 
879
- ```Python
880
- @dataclass
881
- class Functor[A](ABC)
882
- ```
883
 
884
- And RoseTree:
885
 
886
- ```Python
887
- @dataclass
888
- class RoseTree[A](Functor)
889
- ```
890
 
891
- **Here's the key part:** the _type constructor_ `RoseTree` takes any type `T` to a new type, `RoseTree[T]`. Also, `fmap` restricted to `RoseTree` types takes a function `Callable[[A], B]` to a function `Callable[[RoseTree[A]], RoseTree[B]]`.
892
 
893
- But that's it. We've defined two parts, something that takes objects in `Py` to objects in another category (that of `RoseTree` types and functions defined on `RoseTree` types), and something that takes morphisms in `Py` to morphisms in this category. So `RoseTree` is a functor.
894
 
895
- To sum up:
896
 
897
- - We work in the category **Py** and its subcategories.
898
- - **Objects** are types (e.g., `int`, `str`, `list`).
899
- - **Morphisms** are functions (`Callable[[A], B]`).
900
- - **Things that take a type and return another type** are type constructors (`RoseTree[T]`).
901
- - **Things that take a function and return another function** are higher-order functions (`Callable[[Callable[[A], B]], Callable[[C], D]]`).
902
- - **Abstract base classes (ABC)** and duck typing provide a way to express polymorphism, capturing the idea that in category theory, structures are often defined over multiple objects at once.
903
- """
904
- )
905
  return
906
 
907
 
908
  @app.cell(hide_code=True)
909
  def _(mo):
910
- mo.md(
911
- """
912
- ## Functor laws, again
913
 
914
- Once again there are a few axioms that functors have to obey.
915
 
916
- 1. Given an identity morphism $id_A$ on an object $A$, $F ( id_A )$ must be the identity morphism on $F ( A )$.:
917
 
918
- $$F({id} _{A})={id} _{F(A)}$$
919
 
920
- 3. Functors must distribute over morphism composition.
921
 
922
- $$F(f\circ g)=F(f)\circ F(g)$$
923
- """
924
- )
925
  return
926
 
927
 
928
  @app.cell(hide_code=True)
929
  def _(mo):
930
- mo.md(
931
- """
932
- Remember that we defined the `id` and `compose` as
933
- ```python
934
- id = lambda x: x
935
- compose = lambda f, g: lambda x: f(g(x))
936
- ```
937
-
938
- We can define `fmap` as:
939
-
940
- ```python
941
- fmap = lambda g, functor: functor.fmap(g, functor)
942
- ```
943
-
944
- Let's prove that `fmap` is a functor.
945
-
946
- First, let's define a `Category` for a specific `Functor`. We choose to define the `Category` for the `Wrapper` as `WrapperCategory` here for simplicity, but remember that `Wrapper` can be any `Functor`(i.e. `List`, `RoseTree`, `Maybe` and more):
947
-
948
- We define `WrapperCategory` as:
949
-
950
- ```python
951
- @dataclass
952
- class WrapperCategory:
953
- @staticmethod
954
- def id(wrapper: Wrapper[A]) -> Wrapper[A]:
955
- return Wrapper(wrapper.value)
956
-
957
- @staticmethod
958
- def compose(
959
- f: Callable[[Wrapper[B]], Wrapper[C]],
960
- g: Callable[[Wrapper[A]], Wrapper[B]],
961
- wrapper: Wrapper[A]
962
- ) -> Callable[[Wrapper[A]], Wrapper[C]]:
963
- return f(g(Wrapper(wrapper.value)))
964
- ```
965
-
966
- And `Wrapper` is:
967
-
968
- ```Python
969
- @dataclass
970
- class Wrapper[A](Functor):
971
- value: A
972
-
973
- @classmethod
974
- def fmap(cls, g: Callable[[A], B], fa: "Wrapper[A]") -> "Wrapper[B]":
975
- return Wrapper(g(fa.value))
976
- ```
977
- """
978
- )
979
  return
980
 
981
 
982
  @app.cell(hide_code=True)
983
  def _(mo):
984
- mo.md(
985
- """
986
- We can prove that:
987
-
988
- ```python
989
- fmap(id, wrapper)
990
- = Wrapper.fmap(id, wrapper)
991
- = Wrapper(id(wrapper.value))
992
- = Wrapper(wrapper.value)
993
- = WrapperCategory.id(wrapper)
994
- ```
995
- and:
996
- ```python
997
- fmap(compose(f, g), wrapper)
998
- = Wrapper.fmap(compose(f, g), wrapper)
999
- = Wrapper(compose(f, g)(wrapper.value))
1000
- = Wrapper(f(g(wrapper.value)))
1001
-
1002
- WrapperCategory.compose(fmap(f, wrapper), fmap(g, wrapper), wrapper)
1003
- = fmap(f, wrapper)(fmap(g, wrapper)(wrapper))
1004
- = fmap(f, wrapper)(Wrapper.fmap(g, wrapper))
1005
- = fmap(f, wrapper)(Wrapper(g(wrapper.value)))
1006
- = Wrapper.fmap(f, Wrapper(g(wrapper.value)))
1007
- = Wrapper(f(Wrapper(g(wrapper.value)).value))
1008
- = Wrapper(f(g(wrapper.value))) # Wrapper(g(wrapper.value)).value = g(wrapper.value)
1009
- ```
1010
-
1011
- So our `Wrapper` is a valid `Functor`.
1012
-
1013
- > Try validating functor laws for `Wrapper` below.
1014
- """
1015
- )
1016
  return
1017
 
1018
 
@@ -1042,19 +1004,17 @@ def _(WrapperCategory, id, pp, wrapper):
1042
 
1043
  @app.cell(hide_code=True)
1044
  def _(mo):
1045
- mo.md(
1046
- """
1047
- ## Length as a Functor
1048
 
1049
- Remember that a functor is a transformation between two categories. It is not only limited to a functor from `Py` to `Func`, but also includes transformations between other mathematical structures.
1050
 
1051
- Let’s prove that **`length`** can be viewed as a functor. Specifically, we will demonstrate that `length` is a functor from the **category of list concatenation** to the **category of integer addition**.
1052
 
1053
- ### Category of List Concatenation
1054
 
1055
- First, let’s define the category of list concatenation:
1056
- """
1057
- )
1058
  return
1059
 
1060
 
@@ -1078,24 +1038,20 @@ def _(A, dataclass):
1078
 
1079
  @app.cell(hide_code=True)
1080
  def _(mo):
1081
- mo.md(
1082
- """
1083
- - **Identity**: The identity element is an empty list (`ListConcatenation([])`).
1084
- - **Composition**: The composition of two lists is their concatenation (`this.value + other.value`).
1085
- """
1086
- )
1087
  return
1088
 
1089
 
1090
  @app.cell(hide_code=True)
1091
  def _(mo):
1092
- mo.md(
1093
- """
1094
- ### Category of Integer Addition
1095
 
1096
- Now, let's define the category of integer addition:
1097
- """
1098
- )
1099
  return
1100
 
1101
 
@@ -1117,28 +1073,24 @@ def _(dataclass):
1117
 
1118
  @app.cell(hide_code=True)
1119
  def _(mo):
1120
- mo.md(
1121
- """
1122
- - **Identity**: The identity element is `IntAddition(0)` (the additive identity).
1123
- - **Composition**: The composition of two integers is their sum (`this.value + other.value`).
1124
- """
1125
- )
1126
  return
1127
 
1128
 
1129
  @app.cell(hide_code=True)
1130
  def _(mo):
1131
- mo.md(
1132
- """
1133
- ### Defining the Length Functor
1134
 
1135
- We now define the `length` function as a functor, mapping from the category of list concatenation to the category of integer addition:
1136
 
1137
- ```python
1138
- length = lambda l: IntAddition(len(l.value))
1139
- ```
1140
- """
1141
- )
1142
  return
1143
 
1144
 
@@ -1150,23 +1102,23 @@ def _(IntAddition):
1150
 
1151
  @app.cell(hide_code=True)
1152
  def _(mo):
1153
- mo.md("""This function takes an instance of `ListConcatenation`, computes its length, and returns an `IntAddition` instance with the computed length.""")
 
 
1154
  return
1155
 
1156
 
1157
  @app.cell(hide_code=True)
1158
  def _(mo):
1159
- mo.md(
1160
- """
1161
- ### Verifying Functor Laws
1162
 
1163
- Now, let’s verify that `length` satisfies the two functor laws.
1164
 
1165
- **Identity Law**
1166
 
1167
- The identity law states that applying the functor to the identity element of one category should give the identity element of the other category.
1168
- """
1169
- )
1170
  return
1171
 
1172
 
@@ -1178,19 +1130,19 @@ def _(IntAddition, ListConcatenation, length, pp):
1178
 
1179
  @app.cell(hide_code=True)
1180
  def _(mo):
1181
- mo.md("""This ensures that the length of an empty list (identity in the `ListConcatenation` category) is `0` (identity in the `IntAddition` category).""")
 
 
1182
  return
1183
 
1184
 
1185
  @app.cell(hide_code=True)
1186
  def _(mo):
1187
- mo.md(
1188
- """
1189
- **Composition Law**
1190
 
1191
- The composition law states that the functor should preserve composition. Applying the functor to a composed element should be the same as composing the functor applied to the individual elements.
1192
- """
1193
- )
1194
  return
1195
 
1196
 
@@ -1202,36 +1154,36 @@ def _(IntAddition, ListConcatenation, length, pp):
1202
  length(ListConcatenation.compose(lista, listb))
1203
  == IntAddition.compose(length(lista), length(listb))
1204
  )
1205
- return lista, listb
1206
 
1207
 
1208
  @app.cell(hide_code=True)
1209
  def _(mo):
1210
- mo.md("""This ensures that the length of the concatenation of two lists is the same as the sum of the lengths of the individual lists.""")
 
 
1211
  return
1212
 
1213
 
1214
  @app.cell(hide_code=True)
1215
  def _(mo):
1216
- mo.md(
1217
- r"""
1218
- # Bifunctor
1219
 
1220
- A `Bifunctor` is a type constructor that takes two type arguments and **is a functor in both arguments.**
1221
 
1222
- For example, think about `Either`'s usual `Functor` instance. It only allows you to fmap over the second type parameter: `right` values get mapped, `left` values stay as they are.
1223
 
1224
- However, its `Bifunctor` instance allows you to map both halves of the sum.
1225
 
1226
- There are three core methods for `Bifunctor`:
1227
 
1228
- - `bimap` allows mapping over both type arguments at once.
1229
- - `first` and `second` are also provided for mapping over only one type argument at a time.
1230
 
1231
 
1232
- The abstraction of `Bifunctor` is:
1233
- """
1234
- )
1235
  return
1236
 
1237
 
@@ -1261,38 +1213,36 @@ def _(ABC, B, Callable, D, dataclass, f, id):
1261
 
1262
  @app.cell(hide_code=True)
1263
  def _(mo):
1264
- mo.md(
1265
- r"""
1266
- /// admonition | minimal implementation requirement
1267
- - `bimap` or both `first` and `second`
1268
- ///
1269
- """
1270
- )
1271
  return
1272
 
1273
 
1274
  @app.cell(hide_code=True)
1275
  def _(mo):
1276
- mo.md(r"""## Instances of Bifunctor""")
 
 
1277
  return
1278
 
1279
 
1280
  @app.cell(hide_code=True)
1281
  def _(mo):
1282
- mo.md(
1283
- r"""
1284
- ### The Either Bifunctor
1285
 
1286
- For the `Either Bifunctor`, we allow it to map a function over the `left` value as well.
1287
 
1288
- Notice that, the `Either Bifunctor` still only contains the `left` value or the `right` value.
1289
- """
1290
- )
1291
  return
1292
 
1293
 
1294
  @app.cell
1295
- def _(B, Bifunctor, Callable, D, dataclass):
1296
  @dataclass
1297
  class BiEither[A, C](Bifunctor):
1298
  left: A = None
@@ -1334,18 +1284,16 @@ def _(BiEither):
1334
 
1335
  @app.cell(hide_code=True)
1336
  def _(mo):
1337
- mo.md(
1338
- r"""
1339
- ### The 2d Tuple Bifunctor
1340
 
1341
- For 2d tuples, we simply expect `bimap` to map 2 functions to the 2 elements in the tuple respectively.
1342
- """
1343
- )
1344
  return
1345
 
1346
 
1347
  @app.cell
1348
- def _(B, Bifunctor, Callable, D, dataclass):
1349
  @dataclass
1350
  class BiTuple[A, C](Bifunctor):
1351
  value: tuple[A, C]
@@ -1368,19 +1316,17 @@ def _(BiTuple):
1368
 
1369
  @app.cell(hide_code=True)
1370
  def _(mo):
1371
- mo.md(
1372
- r"""
1373
- ## Bifunctor laws
1374
 
1375
- The only law we need to follow is
1376
 
1377
- ```python
1378
- bimap(id, id, fa) == id(fa)
1379
- ```
1380
 
1381
- and then other laws are followed automatically.
1382
- """
1383
- )
1384
  return
1385
 
1386
 
@@ -1394,24 +1340,22 @@ def _(BiEither, BiTuple, id):
1394
 
1395
  @app.cell(hide_code=True)
1396
  def _(mo):
1397
- mo.md(
1398
- """
1399
- # Further reading
1400
-
1401
- - [The Trivial Monad](http://blog.sigfpe.com/2007/04/trivial-monad.html)
1402
- - [Haskellforall: The Category Design Pattern](https://www.haskellforall.com/2012/08/the-category-design-pattern.html)
1403
- - [Haskellforall: The Functor Design Pattern](https://www.haskellforall.com/2012/09/the-functor-design-pattern.html)
1404
-
1405
- /// attention | ATTENTION
1406
- The functor design pattern doesn't work at all if you aren't using categories in the first place. This is why you should structure your tools using the compositional category design pattern so that you can take advantage of functors to easily mix your tools together.
1407
- ///
1408
-
1409
- - [Haskellwiki: Functor](https://wiki.haskell.org/index.php?title=Functor)
1410
- - [Haskellwiki: Typeclassopedia#Functor](https://wiki.haskell.org/index.php?title=Typeclassopedia#Functor)
1411
- - [Haskellwiki: Typeclassopedia#Category](https://wiki.haskell.org/index.php?title=Typeclassopedia#Category)
1412
- - [Haskellwiki: Category Theory](https://en.wikibooks.org/wiki/Haskell/Category_theory)
1413
- """
1414
- )
1415
  return
1416
 
1417
 
 
7
 
8
  import marimo
9
 
10
+ __generated_with = "0.18.4"
11
  app = marimo.App(app_title="Category Theory and Functors")
12
 
13
 
14
  @app.cell(hide_code=True)
15
  def _(mo):
16
+ mo.md("""
17
+ # Category Theory and Functors
 
18
 
19
+ In this notebook, you will learn:
20
 
21
+ * Why `length` is a *functor* from the category of `list concatenation` to the category of `integer addition`
22
+ * How to *lift* an ordinary function into a specific *computational context*
23
+ * How to write an *adapter* between two categories
24
 
25
+ In short, a mathematical functor is a **mapping** between two categories in category theory. In practice, a functor represents a type that can be mapped over.
26
 
27
+ /// admonition | Intuitions
28
 
29
+ - A simple intuition is that a `Functor` represents a **container** of values, along with the ability to apply a function uniformly to every element in the container.
30
+ - Another intuition is that a `Functor` represents some sort of **computational context**.
31
+ - Mathematically, `Functors` generalize the idea of a container or a computational context.
32
+ ///
33
 
34
+ We will start with intuition, introduce the basics of category theory, and then examine functors from a categorical perspective.
35
 
36
+ /// details | Notebook metadata
37
+ type: info
38
 
39
+ version: 0.1.5 | last modified: 2025-04-11 | author: [métaboulie](https://github.com/metaboulie)<br/>
40
+ reviewer: [Haleshot](https://github.com/Haleshot)
41
 
42
+ ///
43
+ """)
 
44
  return
45
 
46
 
47
  @app.cell(hide_code=True)
48
  def _(mo):
49
+ mo.md("""
50
+ # Functor as a Computational Context
 
51
 
52
+ A [**Functor**](https://wiki.haskell.org/Functor) is an abstraction that represents a computational context with the ability to apply a function to every value inside it without altering the structure of the context itself. This enables transformations while preserving the shape of the data.
53
 
54
+ To understand this, let's look at a simple example.
55
 
56
+ ## [The One-Way Wrapper Design Pattern](http://blog.sigfpe.com/2007/04/trivial-monad.html)
57
 
58
+ Often, we need to wrap data in some kind of context. However, when performing operations on wrapped data, we typically have to:
59
 
60
+ 1. Unwrap the data.
61
+ 2. Modify the unwrapped data.
62
+ 3. Rewrap the modified data.
63
 
64
+ This process is tedious and inefficient. Instead, we want to wrap data **once** and apply functions directly to the wrapped data without unwrapping it.
65
 
66
+ /// admonition | Rules for a One-Way Wrapper
67
 
68
+ 1. We can wrap values, but we cannot unwrap them.
69
+ 2. We should still be able to apply transformations to the wrapped data.
70
+ 3. Any operation that depends on wrapped data should itself return a wrapped result.
71
+ ///
72
 
73
+ Let's define such a `Wrapper` class:
74
 
75
+ ```python
76
+ from dataclasses import dataclass
77
+ from typing import TypeVar
78
 
79
+ A = TypeVar("A")
80
+ B = TypeVar("B")
81
 
82
+ @dataclass
83
+ class Wrapper[A]:
84
+ value: A
85
+ ```
86
 
87
+ Now, we can create an instance of wrapped data:
88
 
89
+ ```python
90
+ wrapped = Wrapper(1)
91
+ ```
92
 
93
+ ### Mapping Functions Over Wrapped Data
94
 
95
+ To modify wrapped data while keeping it wrapped, we define an `fmap` method:
96
+ """)
 
97
  return
98
 
99
 
100
  @app.cell
101
+ def _(A, B, Callable, Functor, dataclass):
102
  @dataclass
103
  class Wrapper[A](Functor):
104
  value: A
 
111
 
112
  @app.cell(hide_code=True)
113
  def _(mo):
114
+ mo.md(r"""
115
+ /// attention
 
116
 
117
+ To distinguish between regular types and functors, we use the prefix `f` to indicate `Functor`.
118
 
119
+ For instance,
120
 
121
+ - `a: A` is a regular variable of type `A`
122
+ - `g: Callable[[A], B]` is a regular function from type `A` to `B`
123
+ - `fa: Functor[A]` is a *Functor* wrapping a value of type `A`
124
+ - `fg: Functor[Callable[[A], B]]` is a *Functor* wrapping a function from type `A` to `B`
125
 
126
+ and we will avoid using `f` to represent a function
127
 
128
+ ///
129
 
130
+ > Try with Wrapper below
131
+ """)
 
132
  return
133
 
134
 
 
143
 
144
  @app.cell(hide_code=True)
145
  def _(mo):
146
+ mo.md("""
147
+ We can analyze the type signature of `fmap` for `Wrapper`:
 
148
 
149
+ * `g` is of type `Callable[[A], B]`
150
+ * `fa` is of type `Wrapper[A]`
151
+ * The return value is of type `Wrapper[B]`
152
 
153
+ Thus, in Python's type system, we can express the type signature of `fmap` as:
154
 
155
+ ```python
156
+ fmap(g: Callable[[A], B], fa: Wrapper[A]) -> Wrapper[B]:
157
+ ```
158
 
159
+ Essentially, `fmap`:
160
 
161
+ 1. Takes a function `Callable[[A], B]` and a `Wrapper[A]` instance as input.
162
+ 2. Applies the function to the value inside the wrapper.
163
+ 3. Returns a new `Wrapper[B]` instance with the transformed value, leaving the original wrapper and its internal data unmodified.
164
 
165
+ Now, let's examine `list` as a similar kind of wrapper.
166
+ """)
 
167
  return
168
 
169
 
170
  @app.cell(hide_code=True)
171
  def _(mo):
172
+ mo.md("""
173
+ ## The List Functor
 
174
 
175
+ We can define a `List` class to represent a wrapped list that supports `fmap`:
176
+ """)
 
177
  return
178
 
179
 
180
  @app.cell
181
+ def _(A, B, Callable, Functor, dataclass):
182
  @dataclass
183
  class List[A](Functor):
184
  value: list[A]
 
191
 
192
  @app.cell(hide_code=True)
193
  def _(mo):
194
+ mo.md(r"""
195
+ > Try with List below
196
+ """)
197
  return
198
 
199
 
 
207
 
208
  @app.cell(hide_code=True)
209
  def _(mo):
210
+ mo.md("""
211
+ ### Extracting the Type of `fmap`
 
212
 
213
+ The type signature of `fmap` for `List` is:
214
 
215
+ ```python
216
+ fmap(g: Callable[[A], B], fa: List[A]) -> List[B]
217
+ ```
218
 
219
+ Similarly, for `Wrapper`:
220
 
221
+ ```python
222
+ fmap(g: Callable[[A], B], fa: Wrapper[A]) -> Wrapper[B]
223
+ ```
224
 
225
+ Both follow the same pattern, which we can generalize as:
226
 
227
+ ```python
228
+ fmap(g: Callable[[A], B], fa: Functor[A]) -> Functor[B]
229
+ ```
230
 
231
+ where `Functor` can be `Wrapper`, `List`, or any other wrapper type that follows the same structure.
232
 
233
+ ### Functors in Haskell (optional)
234
 
235
+ In Haskell, the type of `fmap` is:
236
 
237
+ ```haskell
238
+ fmap :: Functor f => (a -> b) -> f a -> f b
239
+ ```
240
 
241
+ or equivalently:
242
 
243
+ ```haskell
244
+ fmap :: Functor f => (a -> b) -> (f a -> f b)
245
+ ```
246
 
247
+ This means that `fmap` **lifts** an ordinary function into the **functor world**, allowing it to operate within a computational context.
248
 
249
+ Now, let's define an abstract class for `Functor`.
250
+ """)
 
251
  return
252
 
253
 
254
  @app.cell(hide_code=True)
255
  def _(mo):
256
+ mo.md("""
257
+ ## Defining Functor
 
258
 
259
+ Recall that, a **Functor** is an abstraction that allows us to apply a function to values inside a computational context while preserving its structure.
260
 
261
+ To define `Functor` in Python, we use an abstract base class:
262
 
263
+ ```python
264
+ @dataclass
265
+ class Functor[A](ABC):
266
+ @classmethod
267
+ @abstractmethod
268
+ def fmap(g: Callable[[A], B], fa: "Functor[A]") -> "Functor[B]":
269
+ raise NotImplementedError
270
+ ```
271
 
272
+ We can now extend custom wrappers, containers, or computation contexts with this `Functor` base class, implement the `fmap` method, and apply any function.
273
+ """)
 
274
  return
275
 
276
 
277
  @app.cell(hide_code=True)
278
  def _(mo):
279
+ mo.md(r"""
280
+ # More Functor instances (optional)
 
281
 
282
+ In this section, we will explore more *Functor* instances to help you build up a better comprehension.
283
 
284
+ The main reference is [Data.Functor](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Functor.html)
285
+ """)
 
286
  return
287
 
288
 
289
  @app.cell(hide_code=True)
290
  def _(mo):
291
+ mo.md(r"""
292
+ ## The [Maybe](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Maybe.html#t:Maybe) Functor
 
293
 
294
+ **`Maybe`** is a functor that can either hold a value (`Just(value)`) or be `Nothing` (equivalent to `None` in Python).
295
 
296
+ - It the value exists, `fmap` applies the function to this value inside the functor.
297
+ - If the value is `None`, `fmap` simply returns `None`.
298
 
299
+ /// admonition
300
+ By using `Maybe` as a functor, we gain the ability to apply transformations (`fmap`) to potentially absent values, without having to explicitly handle the `None` case every time.
301
+ ///
302
 
303
+ We can implement the `Maybe` functor as:
304
+ """)
 
305
  return
306
 
307
 
308
  @app.cell
309
+ def _(A, B, Callable, Functor, dataclass):
310
  @dataclass
311
  class Maybe[A](Functor):
312
  value: None | A
 
329
 
330
  @app.cell(hide_code=True)
331
  def _(mo):
332
+ mo.md(r"""
333
+ ## The [Either](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Either.html#t:Either) Functor
 
334
 
335
+ The `Either` type represents values with two possibilities: a value of type `Either a b` is either `Left a` or `Right b`.
336
 
337
+ The `Either` type is sometimes used to represent a value which is **either correct or an error**; by convention, the `left` attribute is used to hold an error value and the `right` attribute is used to hold a correct value.
338
 
339
+ `fmap` for `Either` will ignore Left values, but will apply the supplied function to values contained in the Right.
340
 
341
+ The implementation is:
342
+ """)
 
343
  return
344
 
345
 
346
  @app.cell
347
+ def _(A, B, Callable, Functor, Union, dataclass):
348
  @dataclass
349
  class Either[A](Functor):
350
  left: A = None
 
382
 
383
  @app.cell(hide_code=True)
384
  def _(mo):
385
+ mo.md("""
386
+ ## The [RoseTree](https://en.wikipedia.org/wiki/Rose_tree) Functor
 
387
 
388
+ A **RoseTree** is a tree where:
389
 
390
+ - Each node holds a **value**.
391
+ - Each node has a **list of child nodes** (which are also RoseTrees).
392
 
393
+ This structure is useful for representing hierarchical data, such as:
394
 
395
+ - Abstract Syntax Trees (ASTs)
396
+ - File system directories
397
+ - Recursive computations
398
 
399
+ The implementation is:
400
+ """)
 
401
  return
402
 
403
 
404
  @app.cell
405
+ def _(A, B, Callable, Functor, dataclass):
406
  @dataclass
407
  class RoseTree[A](Functor):
408
  value: A # The value stored in the node.
 
439
 
440
  @app.cell(hide_code=True)
441
  def _(mo):
442
+ mo.md("""
443
+ ## Generic Functions that can be Used with Any Functor
 
444
 
445
+ One of the powerful features of functors is that we can write **generic functions** that can work with any functor.
446
 
447
+ Remember that in Haskell, the type of `fmap` can be written as:
448
 
449
+ ```haskell
450
+ fmap :: Functor f => (a -> b) -> (f a -> f b)
451
+ ```
452
 
453
+ Translating to Python, we get:
454
 
455
+ ```python
456
+ def fmap(g: Callable[[A], B]) -> Callable[[Functor[A]], Functor[B]]
457
+ ```
458
 
459
+ This means that `fmap`:
460
 
461
+ - Takes an **ordinary function** `Callable[[A], B]` as input.
462
+ - Outputs a function that:
463
+ - Takes a **functor** of type `Functor[A]` as input.
464
+ - Outputs a **functor** of type `Functor[B]`.
465
 
466
+ Inspired by this, we can implement an `inc` function which takes a functor, applies the function `lambda x: x + 1` to every value inside it, and returns a new functor with the updated values.
467
+ """)
 
468
  return
469
 
470
 
 
484
 
485
  @app.cell(hide_code=True)
486
  def _(mo):
487
+ mo.md(r"""
488
+ /// admonition | exercise
489
+ Implement other generic functions and apply them to different *Functor* instances.
490
+ ///
491
+ """)
 
 
492
  return
493
 
494
 
495
  @app.cell(hide_code=True)
496
  def _(mo):
497
+ mo.md(r"""
498
+ # Functor laws and utility functions
499
+ """)
500
  return
501
 
502
 
503
  @app.cell(hide_code=True)
504
  def _(mo):
505
+ mo.md("""
506
+ ## Functor laws
 
507
 
508
+ In addition to providing a function `fmap` of the specified type, functors are also required to satisfy two equational laws:
509
 
510
+ ```haskell
511
+ fmap id = id -- fmap preserves identity
512
+ fmap (g . h) = fmap g . fmap h -- fmap distributes over composition
513
+ ```
514
 
515
+ 1. `fmap` should preserve the **identity function**, in the sense that applying `fmap` to this function returns the same function as the result.
516
+ 2. `fmap` should also preserve **function composition**. Applying two composed functions `g` and `h` to a functor via `fmap` should give the same result as first applying `fmap` to `g` and then applying `fmap` to `h`.
517
 
518
+ /// admonition |
519
+ - Any `Functor` instance satisfying the first law `(fmap id = id)` will [automatically satisfy the second law](https://github.com/quchen/articles/blob/master/second_functor_law.md) as well.
520
+ ///
521
+ """)
 
522
  return
523
 
524
 
525
  @app.cell(hide_code=True)
526
  def _(mo):
527
+ mo.md(r"""
528
+ ### Functor laws verification
 
529
 
530
+ We can define `id` and `compose` in `Python` as:
531
+ """)
 
532
  return
533
 
534
 
 
536
  def _():
537
  id = lambda x: x
538
  compose = lambda f, g: lambda x: f(g(x))
539
+ return (id,)
540
 
541
 
542
  @app.cell(hide_code=True)
543
  def _(mo):
544
+ mo.md(r"""
545
+ We can add a helper function `check_functor_law` to verify that an instance satisfies the functor laws:
546
+ """)
547
  return
548
 
549
 
 
557
 
558
  @app.cell(hide_code=True)
559
  def _(mo):
560
+ mo.md(r"""
561
+ We can verify the functor we've defined:
562
+ """)
563
  return
564
 
565
 
 
567
  def _(check_functor_law, flist, pp, rosetree, wrapper):
568
  for functor in (wrapper, flist, rosetree):
569
  pp(check_functor_law(functor))
570
+ return
571
 
572
 
573
  @app.cell(hide_code=True)
574
  def _(mo):
575
+ mo.md("""
576
+ And here is an `EvilFunctor`. We can verify it's not a valid `Functor`.
577
+ """)
578
  return
579
 
580
 
581
  @app.cell
582
+ def _(A, B, Callable, Functor, dataclass):
583
  @dataclass
584
  class EvilFunctor[A](Functor):
585
  value: list[A]
 
604
 
605
  @app.cell(hide_code=True)
606
  def _(mo):
607
+ mo.md(r"""
608
+ ## Utility functions
609
+
610
+ ```python
611
+ @classmethod
612
+ def const(cls, fa: "Functor[A]", b: B) -> "Functor[B]":
613
+ return cls.fmap(lambda _: b, fa)
614
+
615
+ @classmethod
616
+ def void(cls, fa: "Functor[A]") -> "Functor[None]":
617
+ return cls.const(fa, None)
618
+
619
+ @classmethod
620
+ def unzip(
621
+ cls, fab: "Functor[tuple[A, B]]"
622
+ ) -> tuple["Functor[A]", "Functor[B]"]:
623
+ return cls.fmap(lambda p: p[0], fab), cls.fmap(lambda p: p[1], fab)
624
+ ```
625
+
626
+ - `const` replaces all values inside a functor with a constant `b`
627
+ - `void` is equivalent to `const(fa, None)`, transforming all values in a functor into `None`
628
+ - `unzip` is a generalization of the regular *unzip* on a list of pairs
629
+ """)
 
 
630
  return
631
 
632
 
 
654
 
655
  @app.cell(hide_code=True)
656
  def _(mo):
657
+ mo.md(r"""
658
+ /// admonition
659
+ You can always override these utility functions with a more efficient implementation for specific instances.
660
+ ///
661
+ """)
 
 
662
  return
663
 
664
 
 
673
 
674
  @app.cell(hide_code=True)
675
  def _(mo):
676
+ mo.md("""
677
+ # Formal implementation of Functor
678
+ """)
679
  return
680
 
681
 
 
706
 
707
  @app.cell(hide_code=True)
708
  def _(mo):
709
+ mo.md("""
710
+ ## Limitations of Functor
 
711
 
712
+ Functors abstract the idea of mapping a function over each element of a structure. Suppose now that we wish to generalise this idea to allow functions with any number of arguments to be mapped, rather than being restricted to functions with a single argument. More precisely, suppose that we wish to define a hierarchy of `fmap` functions with the following types:
713
 
714
+ ```haskell
715
+ fmap0 :: a -> f a
716
 
717
+ fmap1 :: (a -> b) -> f a -> f b
718
 
719
+ fmap2 :: (a -> b -> c) -> f a -> f b -> f c
720
 
721
+ fmap3 :: (a -> b -> c -> d) -> f a -> f b -> f c -> f d
722
+ ```
723
 
724
+ And we have to declare a special version of the functor class for each case.
725
 
726
+ We will learn how to resolve this problem in the next notebook on `Applicatives`.
727
+ """)
 
728
  return
729
 
730
 
731
  @app.cell(hide_code=True)
732
  def _(mo):
733
+ mo.md("""
734
+ # Introduction to Categories
 
735
 
736
+ A [category](https://en.wikibooks.org/wiki/Haskell/Category_theory#Introduction_to_categories) is, in essence, a simple collection. It has three components:
737
 
738
+ - A collection of **objects**.
739
+ - A collection of **morphisms**, each of which ties two objects (a _source object_ and a _target object_) together. If $f$ is a morphism with source object $C$ and target object $B$, we write $f : C → B$.
740
+ - A notion of **composition** of these morphisms. If $g : A → B$ and $f : B → C$ are two morphisms, they can be composed, resulting in a morphism $f ∘ g : A → C$.
741
 
742
+ ## Category laws
743
 
744
+ There are three laws that categories need to follow.
745
 
746
+ 1. The composition of morphisms needs to be **associative**. Symbolically, $f ∘ (g ∘ h) = (f ∘ g) ∘ h$
747
 
748
+ - Morphisms are applied right to left, so with $f ∘ g$ first $g$ is applied, then $f$.
749
 
750
+ 2. The category needs to be **closed** under the composition operation. So if $f : B → C$ and $g : A → B$, then there must be some morphism $h : A → C$ in the category such that $h = f ∘ g$.
751
 
752
+ 3. Given a category $C$ there needs to be for every object $A$ an **identity** morphism, $id_A : A → A$ that is an identity of composition with other morphisms. Put precisely, for every morphism $g : A → B$: $g ∘ id_A = id_B ∘ g = g$
753
 
754
+ /// attention | The definition of a category does not define:
755
 
756
+ - what `∘` is,
757
+ - what `id` is, or
758
+ - what `f`, `g`, and `h` might be.
759
 
760
+ Instead, category theory leaves it up to us to discover what they might be.
761
+ ///
762
+ """)
 
763
  return
764
 
765
 
766
  @app.cell(hide_code=True)
767
  def _(mo):
768
+ mo.md("""
769
+ ## The Python category
770
+
771
+ The main category we'll be concerning ourselves with in this part is the Python category, or we can give it a shorter name: `Py`. `Py` treats Python types as objects and Python functions as morphisms. A function `def f(a: A) -> B` for types A and B is a morphism in Python.
772
+
773
+ Remember that we defined the `id` and `compose` function above as:
774
+
775
+ ```Python
776
+ def id(x: A) -> A:
777
+ return x
778
+
779
+ def compose(f: Callable[[B], C], g: Callable[[A], B]) -> Callable[[A], C]:
780
+ return lambda x: f(g(x))
781
+ ```
782
+
783
+ We can check second law easily.
784
+
785
+ For the first law, we have:
786
+
787
+ ```python
788
+ # compose(f, g) = lambda x: f(g(x))
789
+ f (g ∘ h)
790
+ = compose(f, compose(g, h))
791
+ = lambda x: f(compose(g, h)(x))
792
+ = lambda x: f(lambda y: g(h(y))(x))
793
+ = lambda x: f(g(h(x)))
794
+
795
+ (f ∘ g) ∘ h
796
+ = compose(compose(f, g), h)
797
+ = lambda x: compose(f, g)(h(x))
798
+ = lambda x: lambda y: f(g(y))(h(x))
799
+ = lambda x: f(g(h(x)))
800
+ ```
801
+
802
+ For the third law, we have:
803
+
804
+ ```python
805
+ g ∘ id_A
806
+ = compose(g: Callable[[a], b], id: Callable[[a], a]) -> Callable[[a], b]
807
+ = lambda x: g(id(x))
808
+ = lambda x: g(x) # id(x) = x
809
+ = g
810
+ ```
811
+ the similar proof can be applied to $id_B ∘ g =g$.
812
+
813
+ Thus `Py` is a valid category.
814
+ """)
 
 
815
  return
816
 
817
 
818
  @app.cell(hide_code=True)
819
  def _(mo):
820
+ mo.md("""
821
+ # Functors, again
 
822
 
823
+ A functor is essentially a transformation between categories, so given categories $C$ and $D$, a functor $F : C → D$:
824
 
825
+ - Maps any object $A$ in $C$ to $F ( A )$, in $D$.
826
+ - Maps morphisms $f : A → B$ in $C$ to $F ( f ) : F ( A ) → F ( B )$ in $D$.
827
 
828
+ /// admonition |
829
 
830
+ Endofunctors are functors from a category to itself.
831
 
832
+ ///
833
+ """)
 
834
  return
835
 
836
 
837
  @app.cell(hide_code=True)
838
  def _(mo):
839
+ mo.md("""
840
+ ## Functors on the category of Python
 
841
 
842
+ Remember that a functor has two parts: it maps objects in one category to objects in another and morphisms in the first category to morphisms in the second.
843
 
844
+ Functors in Python are from `Py` to `Func`, where `Func` is the subcategory of `Py` defined on just that functor's types. E.g. the RoseTree functor goes from `Py` to `RoseTree`, where `RoseTree` is the category containing only RoseTree types, that is, `RoseTree[T]` for any type `T`. The morphisms in `RoseTree` are functions defined on RoseTree types, that is, functions `Callable[[RoseTree[T]], RoseTree[U]]` for types `T`, `U`.
845
 
846
+ Recall the definition of `Functor`:
847
 
848
+ ```Python
849
+ @dataclass
850
+ class Functor[A](ABC)
851
+ ```
852
 
853
+ And RoseTree:
854
 
855
+ ```Python
856
+ @dataclass
857
+ class RoseTree[A](Functor)
858
+ ```
859
 
860
+ **Here's the key part:** the _type constructor_ `RoseTree` takes any type `T` to a new type, `RoseTree[T]`. Also, `fmap` restricted to `RoseTree` types takes a function `Callable[[A], B]` to a function `Callable[[RoseTree[A]], RoseTree[B]]`.
861
 
862
+ But that's it. We've defined two parts, something that takes objects in `Py` to objects in another category (that of `RoseTree` types and functions defined on `RoseTree` types), and something that takes morphisms in `Py` to morphisms in this category. So `RoseTree` is a functor.
863
 
864
+ To sum up:
865
 
866
+ - We work in the category **Py** and its subcategories.
867
+ - **Objects** are types (e.g., `int`, `str`, `list`).
868
+ - **Morphisms** are functions (`Callable[[A], B]`).
869
+ - **Things that take a type and return another type** are type constructors (`RoseTree[T]`).
870
+ - **Things that take a function and return another function** are higher-order functions (`Callable[[Callable[[A], B]], Callable[[C], D]]`).
871
+ - **Abstract base classes (ABC)** and duck typing provide a way to express polymorphism, capturing the idea that in category theory, structures are often defined over multiple objects at once.
872
+ """)
 
873
  return
874
 
875
 
876
  @app.cell(hide_code=True)
877
  def _(mo):
878
+ mo.md("""
879
+ ## Functor laws, again
 
880
 
881
+ Once again there are a few axioms that functors have to obey.
882
 
883
+ 1. Given an identity morphism $id_A$ on an object $A$, $F ( id_A )$ must be the identity morphism on $F ( A )$.:
884
 
885
+ $$F({id} _{A})={id} _{F(A)}$$
886
 
887
+ 3. Functors must distribute over morphism composition.
888
 
889
+ $$F(f\circ g)=F(f)\circ F(g)$$
890
+ """)
 
891
  return
892
 
893
 
894
  @app.cell(hide_code=True)
895
  def _(mo):
896
+ mo.md("""
897
+ Remember that we defined the `id` and `compose` as
898
+ ```python
899
+ id = lambda x: x
900
+ compose = lambda f, g: lambda x: f(g(x))
901
+ ```
902
+
903
+ We can define `fmap` as:
904
+
905
+ ```python
906
+ fmap = lambda g, functor: functor.fmap(g, functor)
907
+ ```
908
+
909
+ Let's prove that `fmap` is a functor.
910
+
911
+ First, let's define a `Category` for a specific `Functor`. We choose to define the `Category` for the `Wrapper` as `WrapperCategory` here for simplicity, but remember that `Wrapper` can be any `Functor`(i.e. `List`, `RoseTree`, `Maybe` and more):
912
+
913
+ We define `WrapperCategory` as:
914
+
915
+ ```python
916
+ @dataclass
917
+ class WrapperCategory:
918
+ @staticmethod
919
+ def id(wrapper: Wrapper[A]) -> Wrapper[A]:
920
+ return Wrapper(wrapper.value)
921
+
922
+ @staticmethod
923
+ def compose(
924
+ f: Callable[[Wrapper[B]], Wrapper[C]],
925
+ g: Callable[[Wrapper[A]], Wrapper[B]],
926
+ wrapper: Wrapper[A]
927
+ ) -> Callable[[Wrapper[A]], Wrapper[C]]:
928
+ return f(g(Wrapper(wrapper.value)))
929
+ ```
930
+
931
+ And `Wrapper` is:
932
+
933
+ ```Python
934
+ @dataclass
935
+ class Wrapper[A](Functor):
936
+ value: A
937
+
938
+ @classmethod
939
+ def fmap(cls, g: Callable[[A], B], fa: "Wrapper[A]") -> "Wrapper[B]":
940
+ return Wrapper(g(fa.value))
941
+ ```
942
+ """)
 
 
943
  return
944
 
945
 
946
  @app.cell(hide_code=True)
947
  def _(mo):
948
+ mo.md("""
949
+ We can prove that:
950
+
951
+ ```python
952
+ fmap(id, wrapper)
953
+ = Wrapper.fmap(id, wrapper)
954
+ = Wrapper(id(wrapper.value))
955
+ = Wrapper(wrapper.value)
956
+ = WrapperCategory.id(wrapper)
957
+ ```
958
+ and:
959
+ ```python
960
+ fmap(compose(f, g), wrapper)
961
+ = Wrapper.fmap(compose(f, g), wrapper)
962
+ = Wrapper(compose(f, g)(wrapper.value))
963
+ = Wrapper(f(g(wrapper.value)))
964
+
965
+ WrapperCategory.compose(fmap(f, wrapper), fmap(g, wrapper), wrapper)
966
+ = fmap(f, wrapper)(fmap(g, wrapper)(wrapper))
967
+ = fmap(f, wrapper)(Wrapper.fmap(g, wrapper))
968
+ = fmap(f, wrapper)(Wrapper(g(wrapper.value)))
969
+ = Wrapper.fmap(f, Wrapper(g(wrapper.value)))
970
+ = Wrapper(f(Wrapper(g(wrapper.value)).value))
971
+ = Wrapper(f(g(wrapper.value))) # Wrapper(g(wrapper.value)).value = g(wrapper.value)
972
+ ```
973
+
974
+ So our `Wrapper` is a valid `Functor`.
975
+
976
+ > Try validating functor laws for `Wrapper` below.
977
+ """)
 
 
978
  return
979
 
980
 
 
1004
 
1005
  @app.cell(hide_code=True)
1006
  def _(mo):
1007
+ mo.md("""
1008
+ ## Length as a Functor
 
1009
 
1010
+ Remember that a functor is a transformation between two categories. It is not only limited to a functor from `Py` to `Func`, but also includes transformations between other mathematical structures.
1011
 
1012
+ Let’s prove that **`length`** can be viewed as a functor. Specifically, we will demonstrate that `length` is a functor from the **category of list concatenation** to the **category of integer addition**.
1013
 
1014
+ ### Category of List Concatenation
1015
 
1016
+ First, let’s define the category of list concatenation:
1017
+ """)
 
1018
  return
1019
 
1020
 
 
1038
 
1039
  @app.cell(hide_code=True)
1040
  def _(mo):
1041
+ mo.md("""
1042
+ - **Identity**: The identity element is an empty list (`ListConcatenation([])`).
1043
+ - **Composition**: The composition of two lists is their concatenation (`this.value + other.value`).
1044
+ """)
 
 
1045
  return
1046
 
1047
 
1048
  @app.cell(hide_code=True)
1049
  def _(mo):
1050
+ mo.md("""
1051
+ ### Category of Integer Addition
 
1052
 
1053
+ Now, let's define the category of integer addition:
1054
+ """)
 
1055
  return
1056
 
1057
 
 
1073
 
1074
  @app.cell(hide_code=True)
1075
  def _(mo):
1076
+ mo.md("""
1077
+ - **Identity**: The identity element is `IntAddition(0)` (the additive identity).
1078
+ - **Composition**: The composition of two integers is their sum (`this.value + other.value`).
1079
+ """)
 
 
1080
  return
1081
 
1082
 
1083
  @app.cell(hide_code=True)
1084
  def _(mo):
1085
+ mo.md("""
1086
+ ### Defining the Length Functor
 
1087
 
1088
+ We now define the `length` function as a functor, mapping from the category of list concatenation to the category of integer addition:
1089
 
1090
+ ```python
1091
+ length = lambda l: IntAddition(len(l.value))
1092
+ ```
1093
+ """)
 
1094
  return
1095
 
1096
 
 
1102
 
1103
  @app.cell(hide_code=True)
1104
  def _(mo):
1105
+ mo.md("""
1106
+ This function takes an instance of `ListConcatenation`, computes its length, and returns an `IntAddition` instance with the computed length.
1107
+ """)
1108
  return
1109
 
1110
 
1111
  @app.cell(hide_code=True)
1112
  def _(mo):
1113
+ mo.md("""
1114
+ ### Verifying Functor Laws
 
1115
 
1116
+ Now, let’s verify that `length` satisfies the two functor laws.
1117
 
1118
+ **Identity Law**
1119
 
1120
+ The identity law states that applying the functor to the identity element of one category should give the identity element of the other category.
1121
+ """)
 
1122
  return
1123
 
1124
 
 
1130
 
1131
  @app.cell(hide_code=True)
1132
  def _(mo):
1133
+ mo.md("""
1134
+ This ensures that the length of an empty list (identity in the `ListConcatenation` category) is `0` (identity in the `IntAddition` category).
1135
+ """)
1136
  return
1137
 
1138
 
1139
  @app.cell(hide_code=True)
1140
  def _(mo):
1141
+ mo.md("""
1142
+ **Composition Law**
 
1143
 
1144
+ The composition law states that the functor should preserve composition. Applying the functor to a composed element should be the same as composing the functor applied to the individual elements.
1145
+ """)
 
1146
  return
1147
 
1148
 
 
1154
  length(ListConcatenation.compose(lista, listb))
1155
  == IntAddition.compose(length(lista), length(listb))
1156
  )
1157
+ return
1158
 
1159
 
1160
  @app.cell(hide_code=True)
1161
  def _(mo):
1162
+ mo.md("""
1163
+ This ensures that the length of the concatenation of two lists is the same as the sum of the lengths of the individual lists.
1164
+ """)
1165
  return
1166
 
1167
 
1168
  @app.cell(hide_code=True)
1169
  def _(mo):
1170
+ mo.md(r"""
1171
+ # Bifunctor
 
1172
 
1173
+ A `Bifunctor` is a type constructor that takes two type arguments and **is a functor in both arguments.**
1174
 
1175
+ For example, think about `Either`'s usual `Functor` instance. It only allows you to fmap over the second type parameter: `right` values get mapped, `left` values stay as they are.
1176
 
1177
+ However, its `Bifunctor` instance allows you to map both halves of the sum.
1178
 
1179
+ There are three core methods for `Bifunctor`:
1180
 
1181
+ - `bimap` allows mapping over both type arguments at once.
1182
+ - `first` and `second` are also provided for mapping over only one type argument at a time.
1183
 
1184
 
1185
+ The abstraction of `Bifunctor` is:
1186
+ """)
 
1187
  return
1188
 
1189
 
 
1213
 
1214
  @app.cell(hide_code=True)
1215
  def _(mo):
1216
+ mo.md(r"""
1217
+ /// admonition | minimal implementation requirement
1218
+ - `bimap` or both `first` and `second`
1219
+ ///
1220
+ """)
 
 
1221
  return
1222
 
1223
 
1224
  @app.cell(hide_code=True)
1225
  def _(mo):
1226
+ mo.md(r"""
1227
+ ## Instances of Bifunctor
1228
+ """)
1229
  return
1230
 
1231
 
1232
  @app.cell(hide_code=True)
1233
  def _(mo):
1234
+ mo.md(r"""
1235
+ ### The Either Bifunctor
 
1236
 
1237
+ For the `Either Bifunctor`, we allow it to map a function over the `left` value as well.
1238
 
1239
+ Notice that, the `Either Bifunctor` still only contains the `left` value or the `right` value.
1240
+ """)
 
1241
  return
1242
 
1243
 
1244
  @app.cell
1245
+ def _(A, B, Bifunctor, C, Callable, D, dataclass):
1246
  @dataclass
1247
  class BiEither[A, C](Bifunctor):
1248
  left: A = None
 
1284
 
1285
  @app.cell(hide_code=True)
1286
  def _(mo):
1287
+ mo.md(r"""
1288
+ ### The 2d Tuple Bifunctor
 
1289
 
1290
+ For 2d tuples, we simply expect `bimap` to map 2 functions to the 2 elements in the tuple respectively.
1291
+ """)
 
1292
  return
1293
 
1294
 
1295
  @app.cell
1296
+ def _(A, B, Bifunctor, C, Callable, D, dataclass):
1297
  @dataclass
1298
  class BiTuple[A, C](Bifunctor):
1299
  value: tuple[A, C]
 
1316
 
1317
  @app.cell(hide_code=True)
1318
  def _(mo):
1319
+ mo.md(r"""
1320
+ ## Bifunctor laws
 
1321
 
1322
+ The only law we need to follow is
1323
 
1324
+ ```python
1325
+ bimap(id, id, fa) == id(fa)
1326
+ ```
1327
 
1328
+ and then other laws are followed automatically.
1329
+ """)
 
1330
  return
1331
 
1332
 
 
1340
 
1341
  @app.cell(hide_code=True)
1342
  def _(mo):
1343
+ mo.md("""
1344
+ # Further reading
1345
+
1346
+ - [The Trivial Monad](http://blog.sigfpe.com/2007/04/trivial-monad.html)
1347
+ - [Haskellforall: The Category Design Pattern](https://www.haskellforall.com/2012/08/the-category-design-pattern.html)
1348
+ - [Haskellforall: The Functor Design Pattern](https://www.haskellforall.com/2012/09/the-functor-design-pattern.html)
1349
+
1350
+ /// attention | ATTENTION
1351
+ The functor design pattern doesn't work at all if you aren't using categories in the first place. This is why you should structure your tools using the compositional category design pattern so that you can take advantage of functors to easily mix your tools together.
1352
+ ///
1353
+
1354
+ - [Haskellwiki: Functor](https://wiki.haskell.org/index.php?title=Functor)
1355
+ - [Haskellwiki: Typeclassopedia#Functor](https://wiki.haskell.org/index.php?title=Typeclassopedia#Functor)
1356
+ - [Haskellwiki: Typeclassopedia#Category](https://wiki.haskell.org/index.php?title=Typeclassopedia#Category)
1357
+ - [Haskellwiki: Category Theory](https://en.wikibooks.org/wiki/Haskell/Category_theory)
1358
+ """)
 
 
1359
  return
1360
 
1361
 
functional_programming/06_applicatives.py CHANGED
@@ -7,266 +7,261 @@
7
 
8
  import marimo
9
 
10
- __generated_with = "0.12.9"
11
  app = marimo.App(app_title="Applicative programming with effects")
12
 
13
 
14
  @app.cell(hide_code=True)
15
- def _(mo) -> None:
16
- mo.md(
17
- r"""
18
- # Applicative programming with effects
19
 
20
- `Applicative Functor` encapsulates certain sorts of *effectful* computations in a functionally pure way, and encourages an *applicative* programming style.
21
 
22
- Applicative is a functor with application, providing operations to
23
 
24
- + embed pure expressions (`pure`), and
25
- + sequence computations and combine their results (`apply`).
26
 
27
- In this notebook, you will learn:
28
 
29
- 1. How to view `Applicative` as multi-functor intuitively.
30
- 2. How to use `lift` to simplify chaining application.
31
- 3. How to bring *effects* to the functional pure world.
32
- 4. How to view `Applicative` as a lax monoidal functor.
33
- 5. How to use `Alternative` to amalgamate multiple computations into a single computation.
34
 
35
- /// details | Notebook metadata
36
- type: info
37
 
38
- version: 0.1.3 | last modified: 2025-04-16 | author: [métaboulie](https://github.com/metaboulie)<br/>
39
- reviewer: [Haleshot](https://github.com/Haleshot)
40
 
41
- ///
42
- """
43
- )
44
 
45
 
46
  @app.cell(hide_code=True)
47
- def _(mo) -> None:
48
- mo.md(
49
- r"""
50
- # The intuition: [Multifunctor](https://arxiv.org/pdf/2401.14286)
51
 
52
- ## Limitations of functor
53
 
54
- Recall that functors abstract the idea of mapping a function over each element of a structure.
55
 
56
- Suppose now that we wish to generalise this idea to allow functions with any number of arguments to be mapped, rather than being restricted to functions with a single argument. More precisely, suppose that we wish to define a hierarchy of `fmap` functions with the following types:
57
 
58
- ```haskell
59
- fmap0 :: a -> f a
60
 
61
- fmap1 :: (a -> b) -> f a -> f b
62
 
63
- fmap2 :: (a -> b -> c) -> f a -> f b -> f c
64
 
65
- fmap3 :: (a -> b -> c -> d) -> f a -> f b -> f c -> f d
66
- ```
67
 
68
- And we have to declare a special version of the functor class for each case.
69
- """
70
- )
71
 
72
 
73
  @app.cell(hide_code=True)
74
- def _(mo) -> None:
75
- mo.md(
76
- r"""
77
- ## Defining Multifunctor
78
-
79
- /// admonition
80
- we use prefix `f` rather than `ap` to indicate *Applicative Functor*
81
- ///
82
-
83
- As a result, we may want to define a single `Multifunctor` such that:
84
 
85
- 1. Lift a regular n-argument function into the context of functors
 
 
86
 
87
- ```python
88
- # lift a regular 3-argument function `g`
89
- g: Callable[[A, B, C], D]
90
- # into the context of functors
91
- fg: Callable[[Functor[A], Functor[B], Functor[C]], Functor[D]]
92
- ```
93
 
94
- 3. Apply it to n functor-wrapped values
95
 
96
- ```python
97
- # fa: Functor[A], fb: Functor[B], fc: Functor[C]
98
- fg(fa, fb, fc)
99
- ```
 
 
100
 
101
- 5. Get a single functor-wrapped result
102
 
103
- ```python
104
- fd: Functor[D]
105
- ```
 
106
 
107
- We will define a function `lift` such that
108
 
109
  ```python
110
- fd = lift(g, fa, fb, fc)
111
  ```
112
- """
113
- )
114
-
115
 
116
- @app.cell(hide_code=True)
117
- def _(mo) -> None:
118
- mo.md(
119
- r"""
120
- ## Pure, apply and lift
121
 
122
- Traditionally, applicative functors are presented through two core operations:
 
 
 
 
123
 
124
- 1. `pure`: embeds an object (value or function) into the applicative functor
125
 
126
- ```python
127
- # a -> F a
128
- pure: Callable[[A], Applicative[A]]
129
- # for example, if `a` is
130
- a: A
131
- # then we can have `fa` as
132
- fa: Applicative[A] = pure(a)
133
- # or if we have a regular function `g`
134
- g: Callable[[A], B]
135
- # then we can have `fg` as
136
- fg: Applicative[Callable[[A], B]] = pure(g)
137
- ```
138
 
139
- 2. `apply`: applies a function inside an applicative functor to a value inside an applicative functor
140
 
141
- ```python
142
- # F (a -> b) -> F a -> F b
143
- apply: Callable[[Applicative[Callable[[A], B]], Applicative[A]], Applicative[B]]
144
- # and we can have
145
- fd = apply(apply(apply(fg, fa), fb), fc)
146
- ```
147
 
 
 
 
 
 
 
 
 
 
 
 
 
148
 
149
- As a result,
150
 
151
  ```python
152
- lift(g, fa, fb, fc) = apply(apply(apply(pure(g), fa), fb), fc)
 
 
 
153
  ```
154
- """
155
- )
 
 
 
 
 
 
 
156
 
157
 
158
  @app.cell(hide_code=True)
159
- def _(mo) -> None:
160
- mo.md(
161
- r"""
162
- /// admonition | How to use *Applicative* in the manner of *Multifunctor*
163
 
164
- 1. Define `pure` and `apply` for an `Applicative` subclass
165
 
166
- - We can define them much easier compared with `lift`.
167
 
168
- 2. Use the `lift` method
169
 
170
- - We can use it much more convenient compared with the combination of `pure` and `apply`.
171
 
172
 
173
- ///
174
 
175
- /// attention | You can suppress the chaining application of `apply` and `pure` as:
176
 
177
- ```python
178
- apply(pure(g), fa) -> lift(g, fa)
179
- apply(apply(pure(g), fa), fb) -> lift(g, fa, fb)
180
- apply(apply(apply(pure(g), fa), fb), fc) -> lift(g, fa, fb, fc)
181
- ```
182
 
183
- ///
184
- """
185
- )
186
 
187
 
188
  @app.cell(hide_code=True)
189
- def _(mo) -> None:
190
- mo.md(
191
- r"""
192
- ## Abstracting applicatives
193
 
194
- We can now provide an initial abstraction definition of applicatives:
195
 
196
- ```python
197
- @dataclass
198
- class Applicative[A](Functor, ABC):
199
- @classmethod
200
- @abstractmethod
201
- def pure(cls, a: A) -> "Applicative[A]":
202
- raise NotImplementedError("Subclasses must implement pure")
203
-
204
- @classmethod
205
- @abstractmethod
206
- def apply(
207
- cls, fg: "Applicative[Callable[[A], B]]", fa: "Applicative[A]"
208
- ) -> "Applicative[B]":
209
- raise NotImplementedError("Subclasses must implement apply")
210
-
211
- @classmethod
212
- def lift(cls, f: Callable, *args: "Applicative") -> "Applicative":
213
- curr = cls.pure(f)
214
- if not args:
215
- return curr
216
- for arg in args:
217
- curr = cls.apply(curr, arg)
218
  return curr
219
- ```
 
 
 
220
 
221
- /// attention | minimal implementation requirement
222
 
223
- - `pure`
224
- - `apply`
225
- ///
226
- """
227
- )
228
 
229
 
230
  @app.cell(hide_code=True)
231
- def _(mo) -> None:
232
- mo.md(r"""# Instances, laws and utility functions""")
 
 
 
233
 
234
 
235
  @app.cell(hide_code=True)
236
- def _(mo) -> None:
237
- mo.md(
238
- r"""
239
- ## Applicative instances
240
 
241
- When we are actually implementing an *Applicative* instance, we can keep in mind that `pure` and `apply` fundamentally:
242
 
243
- - embed an object (value or function) to the computational context
244
- - apply a function inside the computation context to a value inside the computational context
245
- """
246
- )
247
 
248
 
249
  @app.cell(hide_code=True)
250
- def _(mo) -> None:
251
- mo.md(
252
- r"""
253
- ### The Wrapper Applicative
254
 
255
- - `pure` should simply *wrap* an object, in the sense that:
256
 
257
- ```haskell
258
- Wrapper.pure(1) => Wrapper(value=1)
259
- ```
260
 
261
- - `apply` should apply a *wrapped* function to a *wrapped* value
262
 
263
- The implementation is:
264
- """
265
- )
266
 
267
 
268
  @app.cell
269
- def _(Applicative, dataclass):
270
  @dataclass
271
  class Wrapper[A](Applicative):
272
  value: A
@@ -284,42 +279,45 @@ def _(Applicative, dataclass):
284
 
285
 
286
  @app.cell(hide_code=True)
287
- def _(mo) -> None:
288
- mo.md(r"""> try with Wrapper below""")
 
 
 
289
 
290
 
291
  @app.cell
292
- def _(Wrapper) -> None:
293
  Wrapper.lift(
294
  lambda a: lambda b: lambda c: a + b * c,
295
  Wrapper(1),
296
  Wrapper(2),
297
  Wrapper(3),
298
  )
 
299
 
300
 
301
  @app.cell(hide_code=True)
302
- def _(mo) -> None:
303
- mo.md(
304
- r"""
305
- ### The List Applicative
306
 
307
- - `pure` should wrap the object in a list, in the sense that:
308
 
309
- ```haskell
310
- List.pure(1) => List(value=[1])
311
- ```
312
 
313
- - `apply` should apply a list of functions to a list of values
314
- - you can think of this as cartesian product, concatenating the result of applying every function to every value
315
 
316
- The implementation is:
317
- """
318
- )
319
 
320
 
321
  @app.cell
322
- def _(Applicative, dataclass, product):
323
  @dataclass
324
  class List[A](Applicative):
325
  value: list[A]
@@ -335,47 +333,51 @@ def _(Applicative, dataclass, product):
335
 
336
 
337
  @app.cell(hide_code=True)
338
- def _(mo) -> None:
339
- mo.md(r"""> try with List below""")
 
 
 
340
 
341
 
342
  @app.cell
343
- def _(List) -> None:
344
  List.apply(
345
  List([lambda a: a + 1, lambda a: a * 2]),
346
  List([1, 2]),
347
  )
 
348
 
349
 
350
  @app.cell
351
- def _(List) -> None:
352
  List.lift(lambda a: lambda b: a + b, List([1, 2]), List([3, 4, 5]))
 
353
 
354
 
355
  @app.cell(hide_code=True)
356
- def _(mo) -> None:
357
- mo.md(
358
- r"""
359
- ### The Maybe Applicative
360
 
361
- - `pure` should wrap the object in a Maybe, in the sense that:
362
 
363
- ```haskell
364
- Maybe.pure(1) => "Just 1"
365
- Maybe.pure(None) => "Nothing"
366
- ```
367
 
368
- - `apply` should apply a function maybe exist to a value maybe exist
369
- - if the function is `None` or the value is `None`, simply returns `None`
370
- - else apply the function to the value and wrap the result in `Just`
371
 
372
- The implementation is:
373
- """
374
- )
375
 
376
 
377
  @app.cell
378
- def _(Applicative, dataclass):
379
  @dataclass
380
  class Maybe[A](Applicative):
381
  value: None | A
@@ -399,51 +401,55 @@ def _(Applicative, dataclass):
399
 
400
 
401
  @app.cell(hide_code=True)
402
- def _(mo) -> None:
403
- mo.md(r"""> try with Maybe below""")
 
 
 
404
 
405
 
406
  @app.cell
407
- def _(Maybe) -> None:
408
  Maybe.lift(
409
  lambda a: lambda b: a + b,
410
  Maybe(1),
411
  Maybe(2),
412
  )
 
413
 
414
 
415
  @app.cell
416
- def _(Maybe) -> None:
417
  Maybe.lift(
418
  lambda a: lambda b: None,
419
  Maybe(1),
420
  Maybe(2),
421
  )
 
422
 
423
 
424
  @app.cell(hide_code=True)
425
- def _(mo) -> None:
426
- mo.md(
427
- r"""
428
- ### The Either Applicative
429
 
430
- - `pure` should wrap the object in `Right`, in the sense that:
431
 
432
- ```haskell
433
- Either.pure(1) => Right(1)
434
- ```
435
 
436
- - `apply` should apply a function that is either on Left or Right to a value that is either on Left or Right
437
- - if the function is `Left`, simply returns the `Left` of the function
438
- - else `fmap` the `Right` of the function to the value
439
 
440
- The implementation is:
441
- """
442
- )
443
 
444
 
445
  @app.cell
446
- def _(Applicative, B, Callable, Union, dataclass):
447
  @dataclass
448
  class Either[A](Applicative):
449
  left: A = None
@@ -486,171 +492,180 @@ def _(Applicative, B, Callable, Union, dataclass):
486
 
487
 
488
  @app.cell(hide_code=True)
489
- def _(mo) -> None:
490
- mo.md(r"""> try with `Either` below""")
 
 
 
491
 
492
 
493
  @app.cell
494
- def _(Either) -> None:
495
  Either.apply(Either(left=TypeError("Parse Error")), Either(right=2))
 
496
 
497
 
498
  @app.cell
499
- def _(Either) -> None:
500
  Either.apply(
501
  Either(right=lambda x: x + 1), Either(left=TypeError("Parse Error"))
502
  )
 
503
 
504
 
505
  @app.cell
506
- def _(Either) -> None:
507
  Either.apply(Either(right=lambda x: x + 1), Either(right=1))
 
508
 
509
 
510
  @app.cell(hide_code=True)
511
- def _(mo) -> None:
512
- mo.md(
513
- r"""
514
- ## Collect the list of response with sequenceL
515
 
516
- One often wants to execute a list of commands and collect the list of their response, and we can define a function `sequenceL` for this
517
 
518
- /// admonition
519
- In a further notebook about `Traversable`, we will have a more generic `sequence` that execute a **sequence** of commands and collect the **sequence** of their response, which is not limited to `list`.
520
- ///
521
 
522
- ```python
523
- @classmethod
524
- def sequenceL(cls, fas: list["Applicative[A]"]) -> "Applicative[list[A]]":
525
- if not fas:
526
- return cls.pure([])
527
 
528
- return cls.apply(
529
- cls.fmap(lambda v: lambda vs: [v] + vs, fas[0]),
530
- cls.sequenceL(fas[1:]),
531
- )
532
- ```
533
 
534
- Let's try `sequenceL` with the instances.
535
- """
536
- )
537
 
538
 
539
  @app.cell
540
- def _(Wrapper) -> None:
541
  Wrapper.sequenceL([Wrapper(1), Wrapper(2), Wrapper(3)])
 
542
 
543
 
544
  @app.cell(hide_code=True)
545
- def _(mo) -> None:
546
- mo.md(
547
- r"""
548
- /// attention
549
- For the `Maybe` Applicative, the presence of any `Nothing` causes the entire computation to return Nothing.
550
- ///
551
- """
552
- )
553
 
554
 
555
  @app.cell
556
- def _(Maybe) -> None:
557
  Maybe.sequenceL([Maybe(1), Maybe(2), Maybe(None), Maybe(3)])
 
558
 
559
 
560
  @app.cell(hide_code=True)
561
- def _(mo) -> None:
562
- mo.md(r"""The result of `sequenceL` for `List Applicative` is the Cartesian product of the input lists, yielding all possible ordered combinations of elements from each list.""")
 
 
 
563
 
564
 
565
  @app.cell
566
- def _(List) -> None:
567
  List.sequenceL([List([1, 2]), List([3]), List([5, 6, 7])])
 
568
 
569
 
570
  @app.cell(hide_code=True)
571
- def _(mo) -> None:
572
- mo.md(
573
- r"""
574
- ## Applicative laws
575
-
576
- /// admonition | id and compose
577
-
578
- Remember that
579
-
580
- - `id = lambda x: x`
581
- - `compose = lambda f: lambda g: lambda x: f(g(x))`
582
-
583
- ///
584
-
585
- Traditionally, there are four laws that `Applicative` instances should satisfy. In some sense, they are all concerned with making sure that `pure` deserves its name:
586
-
587
- - The identity law:
588
- ```python
589
- # fa: Applicative[A]
590
- apply(pure(id), fa) = fa
591
- ```
592
- - Homomorphism:
593
- ```python
594
- # a: A
595
- # g: Callable[[A], B]
596
- apply(pure(g), pure(a)) = pure(g(a))
597
- ```
598
- Intuitively, applying a non-effectful function to a non-effectful argument in an effectful context is the same as just applying the function to the argument and then injecting the result into the context with pure.
599
- - Interchange:
600
- ```python
601
- # a: A
602
- # fg: Applicative[Callable[[A], B]]
603
- apply(fg, pure(a)) = apply(pure(lambda g: g(a)), fg)
604
- ```
605
- Intuitively, this says that when evaluating the application of an effectful function to a pure argument, the order in which we evaluate the function and its argument doesn't matter.
606
- - Composition:
607
- ```python
608
- # fg: Applicative[Callable[[B], C]]
609
- # fh: Applicative[Callable[[A], B]]
610
- # fa: Applicative[A]
611
- apply(fg, apply(fh, fa)) = lift(compose, fg, fh, fa)
612
- ```
613
- This one is the trickiest law to gain intuition for. In some sense it is expressing a sort of associativity property of `apply`.
614
-
615
- We can add 4 helper functions to `Applicative` to check whether an instance respects the laws or not:
 
 
 
616
 
617
- ```python
618
- @dataclass
619
- class Applicative[A](Functor, ABC):
620
-
621
- @classmethod
622
- def check_identity(cls, fa: "Applicative[A]"):
623
- if cls.lift(id, fa) != fa:
624
- raise ValueError("Instance violates identity law")
625
- return True
626
-
627
- @classmethod
628
- def check_homomorphism(cls, a: A, f: Callable[[A], B]):
629
- if cls.lift(f, cls.pure(a)) != cls.pure(f(a)):
630
- raise ValueError("Instance violates homomorphism law")
631
- return True
632
-
633
- @classmethod
634
- def check_interchange(cls, a: A, fg: "Applicative[Callable[[A], B]]"):
635
- if cls.apply(fg, cls.pure(a)) != cls.lift(lambda g: g(a), fg):
636
- raise ValueError("Instance violates interchange law")
637
- return True
638
-
639
- @classmethod
640
- def check_composition(
641
- cls,
642
- fg: "Applicative[Callable[[B], C]]",
643
- fh: "Applicative[Callable[[A], B]]",
644
- fa: "Applicative[A]",
645
- ):
646
- if cls.apply(fg, cls.apply(fh, fa)) != cls.lift(compose, fg, fh, fa):
647
- raise ValueError("Instance violates composition law")
648
- return True
649
- ```
650
 
651
- > Try to validate applicative laws below
652
- """
653
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
654
 
655
 
656
  @app.cell
@@ -662,7 +677,7 @@ def _():
662
 
663
 
664
  @app.cell
665
- def _(List, Wrapper) -> None:
666
  print("Checking Wrapper")
667
  print(Wrapper.check_identity(Wrapper.pure(1)))
668
  print(Wrapper.check_homomorphism(1, lambda x: x + 1))
@@ -684,79 +699,77 @@ def _(List, Wrapper) -> None:
684
  List.pure(lambda x: x * 2), List.pure(lambda x: x + 0.1), List.pure(1)
685
  )
686
  )
 
687
 
688
 
689
  @app.cell(hide_code=True)
690
- def _(mo) -> None:
691
- mo.md(
692
- r"""
693
- ## Utility functions
694
 
695
- /// attention | using `fmap`
696
- `fmap` is defined automatically using `pure` and `apply`, so you can use `fmap` with any `Applicative`
697
- ///
698
 
699
- ```python
700
- @dataclass
701
- class Applicative[A](Functor, ABC):
702
- @classmethod
703
- def skip(
704
- cls, fa: "Applicative[A]", fb: "Applicative[B]"
705
- ) -> "Applicative[B]":
706
- '''
707
- Sequences the effects of two Applicative computations,
708
- but discards the result of the first.
709
- '''
710
- return cls.apply(cls.const(fa, id), fb)
711
-
712
- @classmethod
713
- def keep(
714
- cls, fa: "Applicative[A]", fb: "Applicative[B]"
715
- ) -> "Applicative[B]":
716
- '''
717
- Sequences the effects of two Applicative computations,
718
- but discard the result of the second.
719
- '''
720
- return cls.lift(const, fa, fb)
721
-
722
- @classmethod
723
- def revapp(
724
- cls, fa: "Applicative[A]", fg: "Applicative[Callable[[A], [B]]]"
725
- ) -> "Applicative[B]":
726
- '''
727
- The first computation produces values which are provided
728
- as input to the function(s) produced by the second computation.
729
- '''
730
- return cls.lift(lambda a: lambda f: f(a), fa, fg)
731
- ```
732
 
733
- - `skip` sequences the effects of two Applicative computations, but **discards the result of the first**. For example, if `m1` and `m2` are instances of type `Maybe[Int]`, then `Maybe.skip(m1, m2)` is `Nothing` whenever either `m1` or `m2` is `Nothing`; but if not, it will have the same value as `m2`.
734
- - Likewise, `keep` sequences the effects of two computations, but **keeps only the result of the first**.
735
- - `revapp` is similar to `apply`, but where the first computation produces value(s) which are provided as input to the function(s) produced by the second computation.
736
- """
737
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
738
 
739
 
740
  @app.cell(hide_code=True)
741
- def _(mo) -> None:
742
- mo.md(
743
- r"""
744
- /// admonition | Exercise
745
- Try to use utility functions with different instances
746
- ///
747
- """
748
- )
749
 
750
 
751
  @app.cell(hide_code=True)
752
- def _(mo) -> None:
753
- mo.md(
754
- r"""
755
- # Formal implementation of Applicative
756
 
757
- Now, we can give the formal implementation of `Applicative`
758
- """
759
- )
760
 
761
 
762
  @app.cell
@@ -887,40 +900,38 @@ def _(
887
 
888
 
889
  @app.cell(hide_code=True)
890
- def _(mo) -> None:
891
- mo.md(
892
- r"""
893
- # Effectful programming
894
 
895
- Our original motivation for applicatives was the desire to generalise the idea of mapping to functions with multiple arguments. This is a valid interpretation of the concept of applicatives, but from the three instances we have seen it becomes clear that there is also another, more abstract view.
896
 
897
- The arguments are no longer just plain values but may also have effects, such as the possibility of failure, having many ways to succeed, or performing input/output actions. In this manner, applicative functors can also be viewed as abstracting the idea of **applying pure functions to effectful arguments**, with the precise form of effects that are permitted depending on the nature of the underlying functor.
898
- """
899
- )
900
 
901
 
902
  @app.cell(hide_code=True)
903
- def _(mo) -> None:
904
- mo.md(
905
- r"""
906
- ## The IO Applicative
907
 
908
- We will try to define an `IO` applicative here.
909
 
910
- As before, we first abstract how `pure` and `apply` should function.
911
 
912
- - `pure` should wrap the object in an IO action, and make the object *callable* if it's not because we want to perform the action later:
913
 
914
- ```haskell
915
- IO.pure(1) => IO(effect=lambda: 1)
916
- IO.pure(f) => IO(effect=f)
917
- ```
918
 
919
- - `apply` should perform an action that produces a value, then apply the function with the value
920
 
921
- The implementation is:
922
- """
923
- )
924
 
925
 
926
  @app.cell
@@ -943,8 +954,11 @@ def _(Applicative, Callable, dataclass):
943
 
944
 
945
  @app.cell(hide_code=True)
946
- def _(mo) -> None:
947
- mo.md(r"""For example, a function that reads a given number of lines from the keyboard can be defined in applicative style as follows:""")
 
 
 
948
 
949
 
950
  @app.cell
@@ -953,29 +967,31 @@ def _(IO):
953
  return IO.sequenceL([
954
  IO.pure(input(f"input the {i}th str")) for i in range(1, n + 1)
955
  ])
956
- return (get_chars,)
957
 
958
 
959
  @app.cell
960
- def _() -> None:
961
  # get_chars()()
962
  return
963
 
964
 
965
  @app.cell(hide_code=True)
966
- def _(mo) -> None:
967
- mo.md(r"""# From the perspective of category theory""")
 
 
 
968
 
969
 
970
  @app.cell(hide_code=True)
971
- def _(mo) -> None:
972
- mo.md(
973
- r"""
974
- ## Lax Monoidal Functor
975
 
976
- An alternative, equivalent formulation of `Applicative` is given by
977
- """
978
- )
979
 
980
 
981
  @app.cell
@@ -997,97 +1013,92 @@ def _(ABC, Functor, abstractmethod, dataclass):
997
 
998
 
999
  @app.cell(hide_code=True)
1000
- def _(mo) -> None:
1001
- mo.md(
1002
- r"""
1003
- Intuitively, this states that a *monoidal functor* is one which has some sort of "default shape" and which supports some sort of "combining" operation.
1004
 
1005
- - `unit` provides the identity element
1006
- - `tensor` combines two contexts into a product context
1007
 
1008
- More technically, the idea is that `monoidal functor` preserves the "monoidal structure" given by the pairing constructor `(,)` and unit type `()`.
1009
- """
1010
- )
1011
 
1012
 
1013
  @app.cell(hide_code=True)
1014
- def _(mo) -> None:
1015
- mo.md(
1016
- r"""
1017
- Furthermore, to deserve the name "monoidal", instances of Monoidal ought to satisfy the following laws, which seem much more straightforward than the traditional Applicative laws:
1018
 
1019
- - Left identity
1020
 
1021
- `tensor(unit, v) ≅ v`
1022
 
1023
- - Right identity
1024
 
1025
- `tensor(u, unit) ≅ u`
1026
 
1027
- - Associativity
1028
 
1029
- `tensor(u, tensor(v, w)) ≅ tensor(tensor(u, v), w)`
1030
- """
1031
- )
1032
 
1033
 
1034
  @app.cell(hide_code=True)
1035
- def _(mo) -> None:
1036
- mo.md(
1037
- r"""
1038
- /// admonition | ≅ indicates isomorphism
1039
 
1040
- `≅` refers to *isomorphism* rather than equality.
1041
 
1042
- In particular we consider `(x, ()) ≅ x ≅ ((), x)` and `((x, y), z) ≅ (x, (y, z))`
1043
 
1044
- ///
1045
- """
1046
- )
1047
 
1048
 
1049
  @app.cell(hide_code=True)
1050
- def _(mo) -> None:
1051
- mo.md(
1052
- r"""
1053
- ## Mutual definability of Monoidal and Applicative
1054
-
1055
- We can implement `pure` and `apply` in terms of `unit` and `tensor`, and vice versa.
1056
-
1057
- ```python
1058
- pure(a) = fmap((lambda _: a), unit)
1059
- apply(fg, fa) = fmap((lambda pair: pair[0](pair[1])), tensor(fg, fa))
1060
- ```
1061
-
1062
- ```python
1063
- unit() = pure(())
1064
- tensor(fa, fb) = lift(lambda fa: lambda fb: (fa, fb), fa, fb)
1065
- ```
1066
- """
1067
- )
1068
 
1069
 
1070
  @app.cell(hide_code=True)
1071
- def _(mo) -> None:
1072
- mo.md(
1073
- r"""
1074
- ## Instance: ListMonoidal
1075
 
1076
- - `unit` should simply return a empty tuple wrapper in a list
1077
 
1078
- ```haskell
1079
- ListMonoidal.unit() => [()]
1080
- ```
1081
 
1082
- - `tensor` should return the *cartesian product* of the items of 2 ListMonoidal instances
1083
 
1084
- The implementation is:
1085
- """
1086
- )
1087
 
1088
 
1089
  @app.cell
1090
- def _(B, Callable, Monoidal, dataclass, product):
1091
  @dataclass
1092
  class ListMonoidal[A](Monoidal):
1093
  items: list[A]
@@ -1111,8 +1122,11 @@ def _(B, Callable, Monoidal, dataclass, product):
1111
 
1112
 
1113
  @app.cell(hide_code=True)
1114
- def _(mo) -> None:
1115
- mo.md(r"""> try with `ListMonoidal` below""")
 
 
 
1116
 
1117
 
1118
  @app.cell
@@ -1124,13 +1138,17 @@ def _(ListMonoidal):
1124
 
1125
 
1126
  @app.cell(hide_code=True)
1127
- def _(mo) -> None:
1128
- mo.md(r"""and we can prove that `tensor(fa, fb) = lift(lambda fa: lambda fb: (fa, fb), fa, fb)`:""")
 
 
 
1129
 
1130
 
1131
  @app.cell
1132
- def _(List, xs, ys) -> None:
1133
  List.lift(lambda fa: lambda fb: (fa, fb), List(xs.items), List(ys.items))
 
1134
 
1135
 
1136
  @app.cell(hide_code=True)
@@ -1179,83 +1197,81 @@ def _(TypeVar):
1179
  A = TypeVar("A")
1180
  B = TypeVar("B")
1181
  C = TypeVar("C")
1182
- return A, B, C
1183
 
1184
 
1185
  @app.cell(hide_code=True)
1186
- def _(mo) -> None:
1187
- mo.md(
1188
- r"""
1189
- # From Applicative to Alternative
1190
-
1191
- ## Abstracting Alternative
1192
-
1193
- In our studies so far, we saw that both `Maybe` and `List` can represent computations with a varying number of results.
1194
-
1195
- We use `Maybe` to indicate a computation can fail somehow and `List` for computations that can have many possible results. In both of these cases, one useful operation is amalgamating all possible results from multiple computations into a single computation.
1196
 
1197
- `Alternative` formalizes computations that support:
1198
 
1199
- - **Failure** (empty result)
1200
- - **Choice** (combination of results)
1201
- - **Repetition** (multiple results)
1202
 
1203
- It extends `Applicative` with monoidal structure, where:
1204
 
1205
- ```python
1206
- @dataclass
1207
- class Alternative[A](Applicative, ABC):
1208
- @classmethod
1209
- @abstractmethod
1210
- def empty(cls) -> "Alternative[A]":
1211
- '''Identity element for alternative computations'''
1212
-
1213
- @classmethod
1214
- @abstractmethod
1215
- def alt(
1216
- cls, fa: "Alternative[A]", fb: "Alternative[A]"
1217
- ) -> "Alternative[A]":
1218
- '''Binary operation combining computations'''
1219
- ```
1220
 
1221
- - `empty` is the identity element (e.g., `Maybe(None)`, `List([])`)
1222
- - `alt` is a combination operator (e.g., `Maybe` fallback, list concatenation)
 
1223
 
1224
- `empty` and `alt` should satisfy the following **laws**:
1225
-
1226
- ```python
1227
- # Left identity
1228
- alt(empty, fa) == fa
1229
- # Right identity
1230
- alt(fa, empty) == fa
1231
- # Associativity
1232
- alt(fa, alt(fb, fc)) == alt(alt(fa, fb), fc)
1233
- ```
1234
 
1235
- /// admonition
1236
- Actually, `Alternative` is a *monoid* on `Applicative Functors`. We will talk about *monoid* and review these laws in the next notebook about `Monads`.
1237
- ///
 
 
 
 
1238
 
1239
- /// attention | minimal implementation requirement
1240
- - `empty`
1241
- - `alt`
1242
- ///
1243
- """
1244
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1245
 
1246
 
1247
  @app.cell(hide_code=True)
1248
- def _(mo) -> None:
1249
- mo.md(
1250
- r"""
1251
- ## Instances of Alternative
1252
 
1253
- ### The Maybe Alternative
1254
 
1255
- - `empty`: the identity element of `Maybe` is `Maybe(None)`
1256
- - `alt`: return the first element if it's not `None`, else return the second element
1257
- """
1258
- )
1259
 
1260
 
1261
  @app.cell
@@ -1278,31 +1294,32 @@ def _(Alternative, Maybe, dataclass):
1278
 
1279
 
1280
  @app.cell
1281
- def _(AltMaybe) -> None:
1282
  print(AltMaybe.empty())
1283
  print(AltMaybe.alt(AltMaybe(None), AltMaybe(1)))
1284
  print(AltMaybe.alt(AltMaybe(None), AltMaybe(None)))
1285
  print(AltMaybe.alt(AltMaybe(1), AltMaybe(None)))
1286
  print(AltMaybe.alt(AltMaybe(1), AltMaybe(2)))
 
1287
 
1288
 
1289
  @app.cell
1290
- def _(AltMaybe) -> None:
1291
  print(AltMaybe.check_left_identity(AltMaybe(1)))
1292
  print(AltMaybe.check_right_identity(AltMaybe(1)))
1293
  print(AltMaybe.check_associativity(AltMaybe(1), AltMaybe(2), AltMaybe(None)))
 
1294
 
1295
 
1296
  @app.cell(hide_code=True)
1297
- def _(mo) -> None:
1298
- mo.md(
1299
- r"""
1300
- ### The List Alternative
1301
-
1302
- - `empty`: the identity element of `List` is `List([])`
1303
- - `alt`: return the concatenation of 2 input lists
1304
- """
1305
- )
1306
 
1307
 
1308
  @app.cell
@@ -1320,23 +1337,26 @@ def _(Alternative, List, dataclass):
1320
 
1321
 
1322
  @app.cell
1323
- def _(AltList) -> None:
1324
  print(AltList.empty())
1325
  print(AltList.alt(AltList([1, 2, 3]), AltList([4, 5])))
 
1326
 
1327
 
1328
  @app.cell
1329
- def _(AltList) -> None:
1330
  AltList([1])
 
1331
 
1332
 
1333
  @app.cell
1334
- def _(AltList) -> None:
1335
  AltList([1])
 
1336
 
1337
 
1338
  @app.cell
1339
- def _(AltList) -> None:
1340
  print(AltList.check_left_identity(AltList([1, 2, 3])))
1341
  print(AltList.check_right_identity(AltList([1, 2, 3])))
1342
  print(
@@ -1344,77 +1364,88 @@ def _(AltList) -> None:
1344
  AltList([1, 2]), AltList([3, 4, 5]), AltList([6])
1345
  )
1346
  )
 
1347
 
1348
 
1349
  @app.cell(hide_code=True)
1350
- def _(mo) -> None:
1351
- mo.md(
1352
- r"""
1353
- ## some and many
1354
 
1355
 
1356
- /// admonition | This section mainly refers to
1357
 
1358
- - https://stackoverflow.com/questions/7671009/some-and-many-functions-from-the-alternative-type-class/7681283#7681283
1359
 
1360
- ///
1361
 
1362
- First let's have a look at the implementation of `some` and `many`:
1363
 
1364
- ```python
1365
- @classmethod
1366
- def some(cls, fa: "Alternative[A]") -> "Alternative[list[A]]":
1367
- # Short-circuit if input is empty
1368
- if fa == cls.empty():
1369
- return cls.empty()
1370
 
1371
- return cls.apply(
1372
- cls.fmap(lambda a: lambda b: [a] + b, fa), cls.many(fa)
1373
- )
1374
 
1375
- @classmethod
1376
- def many(cls, fa: "Alternative[A]") -> "Alternative[list[A]]":
1377
- # Directly return empty list if input is empty
1378
- if fa == cls.empty():
1379
- return cls.pure([])
1380
 
1381
- return cls.alt(cls.some(fa), cls.pure([]))
1382
- ```
1383
 
1384
- So `some f` runs `f` once, then *many* times, and conses the results. `many f` runs f *some* times, or *alternatively* just returns the empty list.
1385
 
1386
- The idea is that they both run `f` as often as possible until it **fails**, collecting the results in a list. The difference is that `some f` immediately fails if `f` fails, while `many f` will still succeed and *return* the empty list in such a case. But what all this exactly means depends on how `alt` is defined.
1387
 
1388
- Let's see what it does for the instances `AltMaybe` and `AltList`.
1389
- """
1390
- )
1391
 
1392
 
1393
  @app.cell(hide_code=True)
1394
- def _(mo) -> None:
1395
- mo.md(r"""For `AltMaybe`. `None` means failure, so some `None` fails as well and evaluates to `None` while many `None` succeeds and evaluates to `Just []`. Both `some (Just ())` and `many (Just ())` never return, because `Just ()` never fails.""")
 
 
 
1396
 
1397
 
1398
  @app.cell
1399
- def _(AltMaybe) -> None:
1400
  print(AltMaybe.some(AltMaybe.empty()))
1401
  print(AltMaybe.many(AltMaybe.empty()))
 
1402
 
1403
 
1404
  @app.cell(hide_code=True)
1405
- def _(mo) -> None:
1406
- mo.md(r"""For `AltList`, `[]` means failure, so `some []` evaluates to `[]` (no answers) while `many []` evaluates to `[[]]` (there's one answer and it is the empty list). Again `some [()]` and `many [()]` don't return.""")
 
 
 
1407
 
1408
 
1409
  @app.cell
1410
- def _(AltList) -> None:
1411
  print(AltList.some(AltList.empty()))
1412
  print(AltList.many(AltList.empty()))
 
1413
 
1414
 
1415
  @app.cell(hide_code=True)
1416
- def _(mo) -> None:
1417
- mo.md(r"""## Formal implementation of Alternative""")
 
 
 
1418
 
1419
 
1420
  @app.cell
@@ -1472,42 +1503,40 @@ def _(ABC, Applicative, abstractmethod, dataclass):
1472
 
1473
 
1474
  @app.cell(hide_code=True)
1475
- def _(mo) -> None:
1476
- mo.md(
1477
- r"""
1478
- /// admonition
1479
 
1480
- We will explore more about `Alternative` in a future notebooks about [Monadic Parsing](https://www.cambridge.org/core/journals/journal-of-functional-programming/article/monadic-parsing-in-haskell/E557DFCCE00E0D4B6ED02F3FB0466093)
1481
 
1482
- ///
1483
- """
1484
- )
1485
 
1486
 
1487
  @app.cell(hide_code=True)
1488
- def _(mo) -> None:
1489
- mo.md(
1490
- r"""
1491
- # Further reading
1492
-
1493
- Notice that these reading sources are optional and non-trivial
1494
-
1495
- - [Applicaive Programming with Effects](https://www.staff.city.ac.uk/~ross/papers/Applicative.html)
1496
- - [Equivalence of Applicative Functors and
1497
- Multifunctors](https://arxiv.org/pdf/2401.14286)
1498
- - [Applicative functor](https://wiki.haskell.org/index.php?title=Applicative_functor)
1499
- - [Control.Applicative](https://hackage.haskell.org/package/base-4.21.0.0/docs/Control-Applicative.html#t:Applicative)
1500
- - [Typeclassopedia#Applicative](https://wiki.haskell.org/index.php?title=Typeclassopedia#Applicative)
1501
- - [Notions of computation as monoids](https://www.cambridge.org/core/journals/journal-of-functional-programming/article/notions-of-computation-as-monoids/70019FC0F2384270E9F41B9719042528)
1502
- - [Free Applicative Functors](https://arxiv.org/abs/1403.0749)
1503
- - [The basics of applicative functors, put to practical work](http://www.serpentine.com/blog/2008/02/06/the-basics-of-applicative-functors-put-to-practical-work/)
1504
- - [Abstracting with Applicatives](http://comonad.com/reader/2012/abstracting-with-applicatives/)
1505
- - [Static analysis with Applicatives](https://gergo.erdi.hu/blog/2012-12-01-static_analysis_with_applicatives/)
1506
- - [Explaining Applicative functor in categorical terms - monoidal functors](https://cstheory.stackexchange.com/questions/12412/explaining-applicative-functor-in-categorical-terms-monoidal-functors)
1507
- - [Applicative, A Strong Lax Monoidal Functor](https://beuke.org/applicative/)
1508
- - [Applicative Functors](https://bartoszmilewski.com/2017/02/06/applicative-functors/)
1509
- """
1510
- )
1511
 
1512
 
1513
  if __name__ == "__main__":
 
7
 
8
  import marimo
9
 
10
+ __generated_with = "0.18.4"
11
  app = marimo.App(app_title="Applicative programming with effects")
12
 
13
 
14
  @app.cell(hide_code=True)
15
+ def _(mo):
16
+ mo.md(r"""
17
+ # Applicative programming with effects
 
18
 
19
+ `Applicative Functor` encapsulates certain sorts of *effectful* computations in a functionally pure way, and encourages an *applicative* programming style.
20
 
21
+ Applicative is a functor with application, providing operations to
22
 
23
+ + embed pure expressions (`pure`), and
24
+ + sequence computations and combine their results (`apply`).
25
 
26
+ In this notebook, you will learn:
27
 
28
+ 1. How to view `Applicative` as multi-functor intuitively.
29
+ 2. How to use `lift` to simplify chaining application.
30
+ 3. How to bring *effects* to the functional pure world.
31
+ 4. How to view `Applicative` as a lax monoidal functor.
32
+ 5. How to use `Alternative` to amalgamate multiple computations into a single computation.
33
 
34
+ /// details | Notebook metadata
35
+ type: info
36
 
37
+ version: 0.1.3 | last modified: 2025-04-16 | author: [métaboulie](https://github.com/metaboulie)<br/>
38
+ reviewer: [Haleshot](https://github.com/Haleshot)
39
 
40
+ ///
41
+ """)
42
+ return
43
 
44
 
45
  @app.cell(hide_code=True)
46
+ def _(mo):
47
+ mo.md(r"""
48
+ # The intuition: [Multifunctor](https://arxiv.org/pdf/2401.14286)
 
49
 
50
+ ## Limitations of functor
51
 
52
+ Recall that functors abstract the idea of mapping a function over each element of a structure.
53
 
54
+ Suppose now that we wish to generalise this idea to allow functions with any number of arguments to be mapped, rather than being restricted to functions with a single argument. More precisely, suppose that we wish to define a hierarchy of `fmap` functions with the following types:
55
 
56
+ ```haskell
57
+ fmap0 :: a -> f a
58
 
59
+ fmap1 :: (a -> b) -> f a -> f b
60
 
61
+ fmap2 :: (a -> b -> c) -> f a -> f b -> f c
62
 
63
+ fmap3 :: (a -> b -> c -> d) -> f a -> f b -> f c -> f d
64
+ ```
65
 
66
+ And we have to declare a special version of the functor class for each case.
67
+ """)
68
+ return
69
 
70
 
71
  @app.cell(hide_code=True)
72
+ def _(mo):
73
+ mo.md(r"""
74
+ ## Defining Multifunctor
 
 
 
 
 
 
 
75
 
76
+ /// admonition
77
+ we use prefix `f` rather than `ap` to indicate *Applicative Functor*
78
+ ///
79
 
80
+ As a result, we may want to define a single `Multifunctor` such that:
 
 
 
 
 
81
 
82
+ 1. Lift a regular n-argument function into the context of functors
83
 
84
+ ```python
85
+ # lift a regular 3-argument function `g`
86
+ g: Callable[[A, B, C], D]
87
+ # into the context of functors
88
+ fg: Callable[[Functor[A], Functor[B], Functor[C]], Functor[D]]
89
+ ```
90
 
91
+ 3. Apply it to n functor-wrapped values
92
 
93
+ ```python
94
+ # fa: Functor[A], fb: Functor[B], fc: Functor[C]
95
+ fg(fa, fb, fc)
96
+ ```
97
 
98
+ 5. Get a single functor-wrapped result
99
 
100
  ```python
101
+ fd: Functor[D]
102
  ```
 
 
 
103
 
104
+ We will define a function `lift` such that
 
 
 
 
105
 
106
+ ```python
107
+ fd = lift(g, fa, fb, fc)
108
+ ```
109
+ """)
110
+ return
111
 
 
112
 
113
+ @app.cell(hide_code=True)
114
+ def _(mo):
115
+ mo.md(r"""
116
+ ## Pure, apply and lift
 
 
 
 
 
 
 
 
117
 
118
+ Traditionally, applicative functors are presented through two core operations:
119
 
120
+ 1. `pure`: embeds an object (value or function) into the applicative functor
 
 
 
 
 
121
 
122
+ ```python
123
+ # a -> F a
124
+ pure: Callable[[A], Applicative[A]]
125
+ # for example, if `a` is
126
+ a: A
127
+ # then we can have `fa` as
128
+ fa: Applicative[A] = pure(a)
129
+ # or if we have a regular function `g`
130
+ g: Callable[[A], B]
131
+ # then we can have `fg` as
132
+ fg: Applicative[Callable[[A], B]] = pure(g)
133
+ ```
134
 
135
+ 2. `apply`: applies a function inside an applicative functor to a value inside an applicative functor
136
 
137
  ```python
138
+ # F (a -> b) -> F a -> F b
139
+ apply: Callable[[Applicative[Callable[[A], B]], Applicative[A]], Applicative[B]]
140
+ # and we can have
141
+ fd = apply(apply(apply(fg, fa), fb), fc)
142
  ```
143
+
144
+
145
+ As a result,
146
+
147
+ ```python
148
+ lift(g, fa, fb, fc) = apply(apply(apply(pure(g), fa), fb), fc)
149
+ ```
150
+ """)
151
+ return
152
 
153
 
154
  @app.cell(hide_code=True)
155
+ def _(mo):
156
+ mo.md(r"""
157
+ /// admonition | How to use *Applicative* in the manner of *Multifunctor*
 
158
 
159
+ 1. Define `pure` and `apply` for an `Applicative` subclass
160
 
161
+ - We can define them much easier compared with `lift`.
162
 
163
+ 2. Use the `lift` method
164
 
165
+ - We can use it much more convenient compared with the combination of `pure` and `apply`.
166
 
167
 
168
+ ///
169
 
170
+ /// attention | You can suppress the chaining application of `apply` and `pure` as:
171
 
172
+ ```python
173
+ apply(pure(g), fa) -> lift(g, fa)
174
+ apply(apply(pure(g), fa), fb) -> lift(g, fa, fb)
175
+ apply(apply(apply(pure(g), fa), fb), fc) -> lift(g, fa, fb, fc)
176
+ ```
177
 
178
+ ///
179
+ """)
180
+ return
181
 
182
 
183
  @app.cell(hide_code=True)
184
+ def _(mo):
185
+ mo.md(r"""
186
+ ## Abstracting applicatives
 
187
 
188
+ We can now provide an initial abstraction definition of applicatives:
189
 
190
+ ```python
191
+ @dataclass
192
+ class Applicative[A](Functor, ABC):
193
+ @classmethod
194
+ @abstractmethod
195
+ def pure(cls, a: A) -> "Applicative[A]":
196
+ raise NotImplementedError("Subclasses must implement pure")
197
+
198
+ @classmethod
199
+ @abstractmethod
200
+ def apply(
201
+ cls, fg: "Applicative[Callable[[A], B]]", fa: "Applicative[A]"
202
+ ) -> "Applicative[B]":
203
+ raise NotImplementedError("Subclasses must implement apply")
204
+
205
+ @classmethod
206
+ def lift(cls, f: Callable, *args: "Applicative") -> "Applicative":
207
+ curr = cls.pure(f)
208
+ if not args:
 
 
 
209
  return curr
210
+ for arg in args:
211
+ curr = cls.apply(curr, arg)
212
+ return curr
213
+ ```
214
 
215
+ /// attention | minimal implementation requirement
216
 
217
+ - `pure`
218
+ - `apply`
219
+ ///
220
+ """)
221
+ return
222
 
223
 
224
  @app.cell(hide_code=True)
225
+ def _(mo):
226
+ mo.md(r"""
227
+ # Instances, laws and utility functions
228
+ """)
229
+ return
230
 
231
 
232
  @app.cell(hide_code=True)
233
+ def _(mo):
234
+ mo.md(r"""
235
+ ## Applicative instances
 
236
 
237
+ When we are actually implementing an *Applicative* instance, we can keep in mind that `pure` and `apply` fundamentally:
238
 
239
+ - embed an object (value or function) to the computational context
240
+ - apply a function inside the computation context to a value inside the computational context
241
+ """)
242
+ return
243
 
244
 
245
  @app.cell(hide_code=True)
246
+ def _(mo):
247
+ mo.md(r"""
248
+ ### The Wrapper Applicative
 
249
 
250
+ - `pure` should simply *wrap* an object, in the sense that:
251
 
252
+ ```haskell
253
+ Wrapper.pure(1) => Wrapper(value=1)
254
+ ```
255
 
256
+ - `apply` should apply a *wrapped* function to a *wrapped* value
257
 
258
+ The implementation is:
259
+ """)
260
+ return
261
 
262
 
263
  @app.cell
264
+ def _(A, Applicative, dataclass):
265
  @dataclass
266
  class Wrapper[A](Applicative):
267
  value: A
 
279
 
280
 
281
  @app.cell(hide_code=True)
282
+ def _(mo):
283
+ mo.md(r"""
284
+ > try with Wrapper below
285
+ """)
286
+ return
287
 
288
 
289
  @app.cell
290
+ def _(Wrapper):
291
  Wrapper.lift(
292
  lambda a: lambda b: lambda c: a + b * c,
293
  Wrapper(1),
294
  Wrapper(2),
295
  Wrapper(3),
296
  )
297
+ return
298
 
299
 
300
  @app.cell(hide_code=True)
301
+ def _(mo):
302
+ mo.md(r"""
303
+ ### The List Applicative
 
304
 
305
+ - `pure` should wrap the object in a list, in the sense that:
306
 
307
+ ```haskell
308
+ List.pure(1) => List(value=[1])
309
+ ```
310
 
311
+ - `apply` should apply a list of functions to a list of values
312
+ - you can think of this as cartesian product, concatenating the result of applying every function to every value
313
 
314
+ The implementation is:
315
+ """)
316
+ return
317
 
318
 
319
  @app.cell
320
+ def _(A, Applicative, dataclass, product):
321
  @dataclass
322
  class List[A](Applicative):
323
  value: list[A]
 
333
 
334
 
335
  @app.cell(hide_code=True)
336
+ def _(mo):
337
+ mo.md(r"""
338
+ > try with List below
339
+ """)
340
+ return
341
 
342
 
343
  @app.cell
344
+ def _(List):
345
  List.apply(
346
  List([lambda a: a + 1, lambda a: a * 2]),
347
  List([1, 2]),
348
  )
349
+ return
350
 
351
 
352
  @app.cell
353
+ def _(List):
354
  List.lift(lambda a: lambda b: a + b, List([1, 2]), List([3, 4, 5]))
355
+ return
356
 
357
 
358
  @app.cell(hide_code=True)
359
+ def _(mo):
360
+ mo.md(r"""
361
+ ### The Maybe Applicative
 
362
 
363
+ - `pure` should wrap the object in a Maybe, in the sense that:
364
 
365
+ ```haskell
366
+ Maybe.pure(1) => "Just 1"
367
+ Maybe.pure(None) => "Nothing"
368
+ ```
369
 
370
+ - `apply` should apply a function maybe exist to a value maybe exist
371
+ - if the function is `None` or the value is `None`, simply returns `None`
372
+ - else apply the function to the value and wrap the result in `Just`
373
 
374
+ The implementation is:
375
+ """)
376
+ return
377
 
378
 
379
  @app.cell
380
+ def _(A, Applicative, dataclass):
381
  @dataclass
382
  class Maybe[A](Applicative):
383
  value: None | A
 
401
 
402
 
403
  @app.cell(hide_code=True)
404
+ def _(mo):
405
+ mo.md(r"""
406
+ > try with Maybe below
407
+ """)
408
+ return
409
 
410
 
411
  @app.cell
412
+ def _(Maybe):
413
  Maybe.lift(
414
  lambda a: lambda b: a + b,
415
  Maybe(1),
416
  Maybe(2),
417
  )
418
+ return
419
 
420
 
421
  @app.cell
422
+ def _(Maybe):
423
  Maybe.lift(
424
  lambda a: lambda b: None,
425
  Maybe(1),
426
  Maybe(2),
427
  )
428
+ return
429
 
430
 
431
  @app.cell(hide_code=True)
432
+ def _(mo):
433
+ mo.md(r"""
434
+ ### The Either Applicative
 
435
 
436
+ - `pure` should wrap the object in `Right`, in the sense that:
437
 
438
+ ```haskell
439
+ Either.pure(1) => Right(1)
440
+ ```
441
 
442
+ - `apply` should apply a function that is either on Left or Right to a value that is either on Left or Right
443
+ - if the function is `Left`, simply returns the `Left` of the function
444
+ - else `fmap` the `Right` of the function to the value
445
 
446
+ The implementation is:
447
+ """)
448
+ return
449
 
450
 
451
  @app.cell
452
+ def _(A, Applicative, B, Callable, Union, dataclass):
453
  @dataclass
454
  class Either[A](Applicative):
455
  left: A = None
 
492
 
493
 
494
  @app.cell(hide_code=True)
495
+ def _(mo):
496
+ mo.md(r"""
497
+ > try with `Either` below
498
+ """)
499
+ return
500
 
501
 
502
  @app.cell
503
+ def _(Either):
504
  Either.apply(Either(left=TypeError("Parse Error")), Either(right=2))
505
+ return
506
 
507
 
508
  @app.cell
509
+ def _(Either):
510
  Either.apply(
511
  Either(right=lambda x: x + 1), Either(left=TypeError("Parse Error"))
512
  )
513
+ return
514
 
515
 
516
  @app.cell
517
+ def _(Either):
518
  Either.apply(Either(right=lambda x: x + 1), Either(right=1))
519
+ return
520
 
521
 
522
  @app.cell(hide_code=True)
523
+ def _(mo):
524
+ mo.md(r"""
525
+ ## Collect the list of response with sequenceL
 
526
 
527
+ One often wants to execute a list of commands and collect the list of their response, and we can define a function `sequenceL` for this
528
 
529
+ /// admonition
530
+ In a further notebook about `Traversable`, we will have a more generic `sequence` that execute a **sequence** of commands and collect the **sequence** of their response, which is not limited to `list`.
531
+ ///
532
 
533
+ ```python
534
+ @classmethod
535
+ def sequenceL(cls, fas: list["Applicative[A]"]) -> "Applicative[list[A]]":
536
+ if not fas:
537
+ return cls.pure([])
538
 
539
+ return cls.apply(
540
+ cls.fmap(lambda v: lambda vs: [v] + vs, fas[0]),
541
+ cls.sequenceL(fas[1:]),
542
+ )
543
+ ```
544
 
545
+ Let's try `sequenceL` with the instances.
546
+ """)
547
+ return
548
 
549
 
550
  @app.cell
551
+ def _(Wrapper):
552
  Wrapper.sequenceL([Wrapper(1), Wrapper(2), Wrapper(3)])
553
+ return
554
 
555
 
556
  @app.cell(hide_code=True)
557
+ def _(mo):
558
+ mo.md(r"""
559
+ /// attention
560
+ For the `Maybe` Applicative, the presence of any `Nothing` causes the entire computation to return Nothing.
561
+ ///
562
+ """)
563
+ return
 
564
 
565
 
566
  @app.cell
567
+ def _(Maybe):
568
  Maybe.sequenceL([Maybe(1), Maybe(2), Maybe(None), Maybe(3)])
569
+ return
570
 
571
 
572
  @app.cell(hide_code=True)
573
+ def _(mo):
574
+ mo.md(r"""
575
+ The result of `sequenceL` for `List Applicative` is the Cartesian product of the input lists, yielding all possible ordered combinations of elements from each list.
576
+ """)
577
+ return
578
 
579
 
580
  @app.cell
581
+ def _(List):
582
  List.sequenceL([List([1, 2]), List([3]), List([5, 6, 7])])
583
+ return
584
 
585
 
586
  @app.cell(hide_code=True)
587
+ def _(mo):
588
+ mo.md(r"""
589
+ ## Applicative laws
590
+
591
+ /// admonition | id and compose
592
+
593
+ Remember that
594
+
595
+ - `id = lambda x: x`
596
+ - `compose = lambda f: lambda g: lambda x: f(g(x))`
597
+
598
+ ///
599
+
600
+ Traditionally, there are four laws that `Applicative` instances should satisfy. In some sense, they are all concerned with making sure that `pure` deserves its name:
601
+
602
+ - The identity law:
603
+ ```python
604
+ # fa: Applicative[A]
605
+ apply(pure(id), fa) = fa
606
+ ```
607
+ - Homomorphism:
608
+ ```python
609
+ # a: A
610
+ # g: Callable[[A], B]
611
+ apply(pure(g), pure(a)) = pure(g(a))
612
+ ```
613
+ Intuitively, applying a non-effectful function to a non-effectful argument in an effectful context is the same as just applying the function to the argument and then injecting the result into the context with pure.
614
+ - Interchange:
615
+ ```python
616
+ # a: A
617
+ # fg: Applicative[Callable[[A], B]]
618
+ apply(fg, pure(a)) = apply(pure(lambda g: g(a)), fg)
619
+ ```
620
+ Intuitively, this says that when evaluating the application of an effectful function to a pure argument, the order in which we evaluate the function and its argument doesn't matter.
621
+ - Composition:
622
+ ```python
623
+ # fg: Applicative[Callable[[B], C]]
624
+ # fh: Applicative[Callable[[A], B]]
625
+ # fa: Applicative[A]
626
+ apply(fg, apply(fh, fa)) = lift(compose, fg, fh, fa)
627
+ ```
628
+ This one is the trickiest law to gain intuition for. In some sense it is expressing a sort of associativity property of `apply`.
629
+
630
+ We can add 4 helper functions to `Applicative` to check whether an instance respects the laws or not:
631
+
632
+ ```python
633
+ @dataclass
634
+ class Applicative[A](Functor, ABC):
635
 
636
+ @classmethod
637
+ def check_identity(cls, fa: "Applicative[A]"):
638
+ if cls.lift(id, fa) != fa:
639
+ raise ValueError("Instance violates identity law")
640
+ return True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
641
 
642
+ @classmethod
643
+ def check_homomorphism(cls, a: A, f: Callable[[A], B]):
644
+ if cls.lift(f, cls.pure(a)) != cls.pure(f(a)):
645
+ raise ValueError("Instance violates homomorphism law")
646
+ return True
647
+
648
+ @classmethod
649
+ def check_interchange(cls, a: A, fg: "Applicative[Callable[[A], B]]"):
650
+ if cls.apply(fg, cls.pure(a)) != cls.lift(lambda g: g(a), fg):
651
+ raise ValueError("Instance violates interchange law")
652
+ return True
653
+
654
+ @classmethod
655
+ def check_composition(
656
+ cls,
657
+ fg: "Applicative[Callable[[B], C]]",
658
+ fh: "Applicative[Callable[[A], B]]",
659
+ fa: "Applicative[A]",
660
+ ):
661
+ if cls.apply(fg, cls.apply(fh, fa)) != cls.lift(compose, fg, fh, fa):
662
+ raise ValueError("Instance violates composition law")
663
+ return True
664
+ ```
665
+
666
+ > Try to validate applicative laws below
667
+ """)
668
+ return
669
 
670
 
671
  @app.cell
 
677
 
678
 
679
  @app.cell
680
+ def _(List, Wrapper):
681
  print("Checking Wrapper")
682
  print(Wrapper.check_identity(Wrapper.pure(1)))
683
  print(Wrapper.check_homomorphism(1, lambda x: x + 1))
 
699
  List.pure(lambda x: x * 2), List.pure(lambda x: x + 0.1), List.pure(1)
700
  )
701
  )
702
+ return
703
 
704
 
705
  @app.cell(hide_code=True)
706
+ def _(mo):
707
+ mo.md(r"""
708
+ ## Utility functions
 
709
 
710
+ /// attention | using `fmap`
711
+ `fmap` is defined automatically using `pure` and `apply`, so you can use `fmap` with any `Applicative`
712
+ ///
713
 
714
+ ```python
715
+ @dataclass
716
+ class Applicative[A](Functor, ABC):
717
+ @classmethod
718
+ def skip(
719
+ cls, fa: "Applicative[A]", fb: "Applicative[B]"
720
+ ) -> "Applicative[B]":
721
+ '''
722
+ Sequences the effects of two Applicative computations,
723
+ but discards the result of the first.
724
+ '''
725
+ return cls.apply(cls.const(fa, id), fb)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
726
 
727
+ @classmethod
728
+ def keep(
729
+ cls, fa: "Applicative[A]", fb: "Applicative[B]"
730
+ ) -> "Applicative[B]":
731
+ '''
732
+ Sequences the effects of two Applicative computations,
733
+ but discard the result of the second.
734
+ '''
735
+ return cls.lift(const, fa, fb)
736
+
737
+ @classmethod
738
+ def revapp(
739
+ cls, fa: "Applicative[A]", fg: "Applicative[Callable[[A], [B]]]"
740
+ ) -> "Applicative[B]":
741
+ '''
742
+ The first computation produces values which are provided
743
+ as input to the function(s) produced by the second computation.
744
+ '''
745
+ return cls.lift(lambda a: lambda f: f(a), fa, fg)
746
+ ```
747
+
748
+ - `skip` sequences the effects of two Applicative computations, but **discards the result of the first**. For example, if `m1` and `m2` are instances of type `Maybe[Int]`, then `Maybe.skip(m1, m2)` is `Nothing` whenever either `m1` or `m2` is `Nothing`; but if not, it will have the same value as `m2`.
749
+ - Likewise, `keep` sequences the effects of two computations, but **keeps only the result of the first**.
750
+ - `revapp` is similar to `apply`, but where the first computation produces value(s) which are provided as input to the function(s) produced by the second computation.
751
+ """)
752
+ return
753
 
754
 
755
  @app.cell(hide_code=True)
756
+ def _(mo):
757
+ mo.md(r"""
758
+ /// admonition | Exercise
759
+ Try to use utility functions with different instances
760
+ ///
761
+ """)
762
+ return
 
763
 
764
 
765
  @app.cell(hide_code=True)
766
+ def _(mo):
767
+ mo.md(r"""
768
+ # Formal implementation of Applicative
 
769
 
770
+ Now, we can give the formal implementation of `Applicative`
771
+ """)
772
+ return
773
 
774
 
775
  @app.cell
 
900
 
901
 
902
  @app.cell(hide_code=True)
903
+ def _(mo):
904
+ mo.md(r"""
905
+ # Effectful programming
 
906
 
907
+ Our original motivation for applicatives was the desire to generalise the idea of mapping to functions with multiple arguments. This is a valid interpretation of the concept of applicatives, but from the three instances we have seen it becomes clear that there is also another, more abstract view.
908
 
909
+ The arguments are no longer just plain values but may also have effects, such as the possibility of failure, having many ways to succeed, or performing input/output actions. In this manner, applicative functors can also be viewed as abstracting the idea of **applying pure functions to effectful arguments**, with the precise form of effects that are permitted depending on the nature of the underlying functor.
910
+ """)
911
+ return
912
 
913
 
914
  @app.cell(hide_code=True)
915
+ def _(mo):
916
+ mo.md(r"""
917
+ ## The IO Applicative
 
918
 
919
+ We will try to define an `IO` applicative here.
920
 
921
+ As before, we first abstract how `pure` and `apply` should function.
922
 
923
+ - `pure` should wrap the object in an IO action, and make the object *callable* if it's not because we want to perform the action later:
924
 
925
+ ```haskell
926
+ IO.pure(1) => IO(effect=lambda: 1)
927
+ IO.pure(f) => IO(effect=f)
928
+ ```
929
 
930
+ - `apply` should perform an action that produces a value, then apply the function with the value
931
 
932
+ The implementation is:
933
+ """)
934
+ return
935
 
936
 
937
  @app.cell
 
954
 
955
 
956
  @app.cell(hide_code=True)
957
+ def _(mo):
958
+ mo.md(r"""
959
+ For example, a function that reads a given number of lines from the keyboard can be defined in applicative style as follows:
960
+ """)
961
+ return
962
 
963
 
964
  @app.cell
 
967
  return IO.sequenceL([
968
  IO.pure(input(f"input the {i}th str")) for i in range(1, n + 1)
969
  ])
970
+ return
971
 
972
 
973
  @app.cell
974
+ def _():
975
  # get_chars()()
976
  return
977
 
978
 
979
  @app.cell(hide_code=True)
980
+ def _(mo):
981
+ mo.md(r"""
982
+ # From the perspective of category theory
983
+ """)
984
+ return
985
 
986
 
987
  @app.cell(hide_code=True)
988
+ def _(mo):
989
+ mo.md(r"""
990
+ ## Lax Monoidal Functor
 
991
 
992
+ An alternative, equivalent formulation of `Applicative` is given by
993
+ """)
994
+ return
995
 
996
 
997
  @app.cell
 
1013
 
1014
 
1015
  @app.cell(hide_code=True)
1016
+ def _(mo):
1017
+ mo.md(r"""
1018
+ Intuitively, this states that a *monoidal functor* is one which has some sort of "default shape" and which supports some sort of "combining" operation.
 
1019
 
1020
+ - `unit` provides the identity element
1021
+ - `tensor` combines two contexts into a product context
1022
 
1023
+ More technically, the idea is that `monoidal functor` preserves the "monoidal structure" given by the pairing constructor `(,)` and unit type `()`.
1024
+ """)
1025
+ return
1026
 
1027
 
1028
  @app.cell(hide_code=True)
1029
+ def _(mo):
1030
+ mo.md(r"""
1031
+ Furthermore, to deserve the name "monoidal", instances of Monoidal ought to satisfy the following laws, which seem much more straightforward than the traditional Applicative laws:
 
1032
 
1033
+ - Left identity
1034
 
1035
+ `tensor(unit, v) ≅ v`
1036
 
1037
+ - Right identity
1038
 
1039
+ `tensor(u, unit) ≅ u`
1040
 
1041
+ - Associativity
1042
 
1043
+ `tensor(u, tensor(v, w)) ≅ tensor(tensor(u, v), w)`
1044
+ """)
1045
+ return
1046
 
1047
 
1048
  @app.cell(hide_code=True)
1049
+ def _(mo):
1050
+ mo.md(r"""
1051
+ /// admonition | ≅ indicates isomorphism
 
1052
 
1053
+ `≅` refers to *isomorphism* rather than equality.
1054
 
1055
+ In particular we consider `(x, ()) ≅ x ≅ ((), x)` and `((x, y), z) ≅ (x, (y, z))`
1056
 
1057
+ ///
1058
+ """)
1059
+ return
1060
 
1061
 
1062
  @app.cell(hide_code=True)
1063
+ def _(mo):
1064
+ mo.md(r"""
1065
+ ## Mutual definability of Monoidal and Applicative
1066
+
1067
+ We can implement `pure` and `apply` in terms of `unit` and `tensor`, and vice versa.
1068
+
1069
+ ```python
1070
+ pure(a) = fmap((lambda _: a), unit)
1071
+ apply(fg, fa) = fmap((lambda pair: pair[0](pair[1])), tensor(fg, fa))
1072
+ ```
1073
+
1074
+ ```python
1075
+ unit() = pure(())
1076
+ tensor(fa, fb) = lift(lambda fa: lambda fb: (fa, fb), fa, fb)
1077
+ ```
1078
+ """)
1079
+ return
 
1080
 
1081
 
1082
  @app.cell(hide_code=True)
1083
+ def _(mo):
1084
+ mo.md(r"""
1085
+ ## Instance: ListMonoidal
 
1086
 
1087
+ - `unit` should simply return a empty tuple wrapper in a list
1088
 
1089
+ ```haskell
1090
+ ListMonoidal.unit() => [()]
1091
+ ```
1092
 
1093
+ - `tensor` should return the *cartesian product* of the items of 2 ListMonoidal instances
1094
 
1095
+ The implementation is:
1096
+ """)
1097
+ return
1098
 
1099
 
1100
  @app.cell
1101
+ def _(A, B, Callable, Monoidal, dataclass, product):
1102
  @dataclass
1103
  class ListMonoidal[A](Monoidal):
1104
  items: list[A]
 
1122
 
1123
 
1124
  @app.cell(hide_code=True)
1125
+ def _(mo):
1126
+ mo.md(r"""
1127
+ > try with `ListMonoidal` below
1128
+ """)
1129
+ return
1130
 
1131
 
1132
  @app.cell
 
1138
 
1139
 
1140
  @app.cell(hide_code=True)
1141
+ def _(mo):
1142
+ mo.md(r"""
1143
+ and we can prove that `tensor(fa, fb) = lift(lambda fa: lambda fb: (fa, fb), fa, fb)`:
1144
+ """)
1145
+ return
1146
 
1147
 
1148
  @app.cell
1149
+ def _(List, xs, ys):
1150
  List.lift(lambda fa: lambda fb: (fa, fb), List(xs.items), List(ys.items))
1151
+ return
1152
 
1153
 
1154
  @app.cell(hide_code=True)
 
1197
  A = TypeVar("A")
1198
  B = TypeVar("B")
1199
  C = TypeVar("C")
1200
+ return A, B
1201
 
1202
 
1203
  @app.cell(hide_code=True)
1204
+ def _(mo):
1205
+ mo.md(r"""
1206
+ # From Applicative to Alternative
 
 
 
 
 
 
 
1207
 
1208
+ ## Abstracting Alternative
1209
 
1210
+ In our studies so far, we saw that both `Maybe` and `List` can represent computations with a varying number of results.
 
 
1211
 
1212
+ We use `Maybe` to indicate a computation can fail somehow and `List` for computations that can have many possible results. In both of these cases, one useful operation is amalgamating all possible results from multiple computations into a single computation.
1213
 
1214
+ `Alternative` formalizes computations that support:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1215
 
1216
+ - **Failure** (empty result)
1217
+ - **Choice** (combination of results)
1218
+ - **Repetition** (multiple results)
1219
 
1220
+ It extends `Applicative` with monoidal structure, where:
 
 
 
 
 
 
 
 
 
1221
 
1222
+ ```python
1223
+ @dataclass
1224
+ class Alternative[A](Applicative, ABC):
1225
+ @classmethod
1226
+ @abstractmethod
1227
+ def empty(cls) -> "Alternative[A]":
1228
+ '''Identity element for alternative computations'''
1229
 
1230
+ @classmethod
1231
+ @abstractmethod
1232
+ def alt(
1233
+ cls, fa: "Alternative[A]", fb: "Alternative[A]"
1234
+ ) -> "Alternative[A]":
1235
+ '''Binary operation combining computations'''
1236
+ ```
1237
+
1238
+ - `empty` is the identity element (e.g., `Maybe(None)`, `List([])`)
1239
+ - `alt` is a combination operator (e.g., `Maybe` fallback, list concatenation)
1240
+
1241
+ `empty` and `alt` should satisfy the following **laws**:
1242
+
1243
+ ```python
1244
+ # Left identity
1245
+ alt(empty, fa) == fa
1246
+ # Right identity
1247
+ alt(fa, empty) == fa
1248
+ # Associativity
1249
+ alt(fa, alt(fb, fc)) == alt(alt(fa, fb), fc)
1250
+ ```
1251
+
1252
+ /// admonition
1253
+ Actually, `Alternative` is a *monoid* on `Applicative Functors`. We will talk about *monoid* and review these laws in the next notebook about `Monads`.
1254
+ ///
1255
+
1256
+ /// attention | minimal implementation requirement
1257
+ - `empty`
1258
+ - `alt`
1259
+ ///
1260
+ """)
1261
+ return
1262
 
1263
 
1264
  @app.cell(hide_code=True)
1265
+ def _(mo):
1266
+ mo.md(r"""
1267
+ ## Instances of Alternative
 
1268
 
1269
+ ### The Maybe Alternative
1270
 
1271
+ - `empty`: the identity element of `Maybe` is `Maybe(None)`
1272
+ - `alt`: return the first element if it's not `None`, else return the second element
1273
+ """)
1274
+ return
1275
 
1276
 
1277
  @app.cell
 
1294
 
1295
 
1296
  @app.cell
1297
+ def _(AltMaybe):
1298
  print(AltMaybe.empty())
1299
  print(AltMaybe.alt(AltMaybe(None), AltMaybe(1)))
1300
  print(AltMaybe.alt(AltMaybe(None), AltMaybe(None)))
1301
  print(AltMaybe.alt(AltMaybe(1), AltMaybe(None)))
1302
  print(AltMaybe.alt(AltMaybe(1), AltMaybe(2)))
1303
+ return
1304
 
1305
 
1306
  @app.cell
1307
+ def _(AltMaybe):
1308
  print(AltMaybe.check_left_identity(AltMaybe(1)))
1309
  print(AltMaybe.check_right_identity(AltMaybe(1)))
1310
  print(AltMaybe.check_associativity(AltMaybe(1), AltMaybe(2), AltMaybe(None)))
1311
+ return
1312
 
1313
 
1314
  @app.cell(hide_code=True)
1315
+ def _(mo):
1316
+ mo.md(r"""
1317
+ ### The List Alternative
1318
+
1319
+ - `empty`: the identity element of `List` is `List([])`
1320
+ - `alt`: return the concatenation of 2 input lists
1321
+ """)
1322
+ return
 
1323
 
1324
 
1325
  @app.cell
 
1337
 
1338
 
1339
  @app.cell
1340
+ def _(AltList):
1341
  print(AltList.empty())
1342
  print(AltList.alt(AltList([1, 2, 3]), AltList([4, 5])))
1343
+ return
1344
 
1345
 
1346
  @app.cell
1347
+ def _(AltList):
1348
  AltList([1])
1349
+ return
1350
 
1351
 
1352
  @app.cell
1353
+ def _(AltList):
1354
  AltList([1])
1355
+ return
1356
 
1357
 
1358
  @app.cell
1359
+ def _(AltList):
1360
  print(AltList.check_left_identity(AltList([1, 2, 3])))
1361
  print(AltList.check_right_identity(AltList([1, 2, 3])))
1362
  print(
 
1364
  AltList([1, 2]), AltList([3, 4, 5]), AltList([6])
1365
  )
1366
  )
1367
+ return
1368
 
1369
 
1370
  @app.cell(hide_code=True)
1371
+ def _(mo):
1372
+ mo.md(r"""
1373
+ ## some and many
 
1374
 
1375
 
1376
+ /// admonition | This section mainly refers to
1377
 
1378
+ - https://stackoverflow.com/questions/7671009/some-and-many-functions-from-the-alternative-type-class/7681283#7681283
1379
 
1380
+ ///
1381
 
1382
+ First let's have a look at the implementation of `some` and `many`:
1383
 
1384
+ ```python
1385
+ @classmethod
1386
+ def some(cls, fa: "Alternative[A]") -> "Alternative[list[A]]":
1387
+ # Short-circuit if input is empty
1388
+ if fa == cls.empty():
1389
+ return cls.empty()
1390
 
1391
+ return cls.apply(
1392
+ cls.fmap(lambda a: lambda b: [a] + b, fa), cls.many(fa)
1393
+ )
1394
 
1395
+ @classmethod
1396
+ def many(cls, fa: "Alternative[A]") -> "Alternative[list[A]]":
1397
+ # Directly return empty list if input is empty
1398
+ if fa == cls.empty():
1399
+ return cls.pure([])
1400
 
1401
+ return cls.alt(cls.some(fa), cls.pure([]))
1402
+ ```
1403
 
1404
+ So `some f` runs `f` once, then *many* times, and conses the results. `many f` runs f *some* times, or *alternatively* just returns the empty list.
1405
 
1406
+ The idea is that they both run `f` as often as possible until it **fails**, collecting the results in a list. The difference is that `some f` immediately fails if `f` fails, while `many f` will still succeed and *return* the empty list in such a case. But what all this exactly means depends on how `alt` is defined.
1407
 
1408
+ Let's see what it does for the instances `AltMaybe` and `AltList`.
1409
+ """)
1410
+ return
1411
 
1412
 
1413
  @app.cell(hide_code=True)
1414
+ def _(mo):
1415
+ mo.md(r"""
1416
+ For `AltMaybe`. `None` means failure, so some `None` fails as well and evaluates to `None` while many `None` succeeds and evaluates to `Just []`. Both `some (Just ())` and `many (Just ())` never return, because `Just ()` never fails.
1417
+ """)
1418
+ return
1419
 
1420
 
1421
  @app.cell
1422
+ def _(AltMaybe):
1423
  print(AltMaybe.some(AltMaybe.empty()))
1424
  print(AltMaybe.many(AltMaybe.empty()))
1425
+ return
1426
 
1427
 
1428
  @app.cell(hide_code=True)
1429
+ def _(mo):
1430
+ mo.md(r"""
1431
+ For `AltList`, `[]` means failure, so `some []` evaluates to `[]` (no answers) while `many []` evaluates to `[[]]` (there's one answer and it is the empty list). Again `some [()]` and `many [()]` don't return.
1432
+ """)
1433
+ return
1434
 
1435
 
1436
  @app.cell
1437
+ def _(AltList):
1438
  print(AltList.some(AltList.empty()))
1439
  print(AltList.many(AltList.empty()))
1440
+ return
1441
 
1442
 
1443
  @app.cell(hide_code=True)
1444
+ def _(mo):
1445
+ mo.md(r"""
1446
+ ## Formal implementation of Alternative
1447
+ """)
1448
+ return
1449
 
1450
 
1451
  @app.cell
 
1503
 
1504
 
1505
  @app.cell(hide_code=True)
1506
+ def _(mo):
1507
+ mo.md(r"""
1508
+ /// admonition
 
1509
 
1510
+ We will explore more about `Alternative` in a future notebooks about [Monadic Parsing](https://www.cambridge.org/core/journals/journal-of-functional-programming/article/monadic-parsing-in-haskell/E557DFCCE00E0D4B6ED02F3FB0466093)
1511
 
1512
+ ///
1513
+ """)
1514
+ return
1515
 
1516
 
1517
  @app.cell(hide_code=True)
1518
+ def _(mo):
1519
+ mo.md(r"""
1520
+ # Further reading
1521
+
1522
+ Notice that these reading sources are optional and non-trivial
1523
+
1524
+ - [Applicaive Programming with Effects](https://www.staff.city.ac.uk/~ross/papers/Applicative.html)
1525
+ - [Equivalence of Applicative Functors and
1526
+ Multifunctors](https://arxiv.org/pdf/2401.14286)
1527
+ - [Applicative functor](https://wiki.haskell.org/index.php?title=Applicative_functor)
1528
+ - [Control.Applicative](https://hackage.haskell.org/package/base-4.21.0.0/docs/Control-Applicative.html#t:Applicative)
1529
+ - [Typeclassopedia#Applicative](https://wiki.haskell.org/index.php?title=Typeclassopedia#Applicative)
1530
+ - [Notions of computation as monoids](https://www.cambridge.org/core/journals/journal-of-functional-programming/article/notions-of-computation-as-monoids/70019FC0F2384270E9F41B9719042528)
1531
+ - [Free Applicative Functors](https://arxiv.org/abs/1403.0749)
1532
+ - [The basics of applicative functors, put to practical work](http://www.serpentine.com/blog/2008/02/06/the-basics-of-applicative-functors-put-to-practical-work/)
1533
+ - [Abstracting with Applicatives](http://comonad.com/reader/2012/abstracting-with-applicatives/)
1534
+ - [Static analysis with Applicatives](https://gergo.erdi.hu/blog/2012-12-01-static_analysis_with_applicatives/)
1535
+ - [Explaining Applicative functor in categorical terms - monoidal functors](https://cstheory.stackexchange.com/questions/12412/explaining-applicative-functor-in-categorical-terms-monoidal-functors)
1536
+ - [Applicative, A Strong Lax Monoidal Functor](https://beuke.org/applicative/)
1537
+ - [Applicative Functors](https://bartoszmilewski.com/2017/02/06/applicative-functors/)
1538
+ """)
1539
+ return
 
1540
 
1541
 
1542
  if __name__ == "__main__":
functional_programming/CHANGELOG.md CHANGED
@@ -1,3 +1,8 @@
 
 
 
 
 
1
  # Changelog of the functional-programming course
2
 
3
  ## 2025-04-16
@@ -121,4 +126,4 @@ for reviewing
121
 
122
  **functors.py**
123
 
124
- - Demo version of notebook `05_functors.py`
 
1
+ ---
2
+ title: Changelog
3
+ marimo-version: 0.18.4
4
+ ---
5
+
6
  # Changelog of the functional-programming course
7
 
8
  ## 2025-04-16
 
126
 
127
  **functors.py**
128
 
129
+ - Demo version of notebook `05_functors.py`
functional_programming/README.md CHANGED
@@ -1,3 +1,8 @@
 
 
 
 
 
1
  # Learn Functional Programming
2
 
3
  _🚧 This collection is a [work in progress](https://github.com/marimo-team/learn/issues/51)._
@@ -24,13 +29,13 @@ Topics include:
24
 
25
  To run a notebook locally, use
26
 
27
- ```bash
28
- uvx marimo edit <URL>
29
  ```
30
 
31
  For example, run the `Functor` tutorial with
32
 
33
- ```bash
34
  uvx marimo edit https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py
35
  ```
36
 
@@ -52,11 +57,11 @@ on Discord (@eugene.hs).
52
  ## Description of notebooks
53
 
54
  Check [here](https://github.com/marimo-team/learn/issues/51) for current series
55
- structure.
56
 
57
  | Notebook | Title | Key Concepts | Prerequisites |
58
- |----------|-------|--------------|---------------|
59
- | [05. Functors](https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py) | Category Theory and Functors | Category Theory, Functor, fmap, Bifunctor | Basic Python, Functions |
60
  | [06. Applicatives](https://github.com/marimo-team/learn/blob/main/functional_programming/06_applicatives.py) | Applicative programming with effects | Applicative Functor, pure, apply, Effectful programming, Alternative | Functors |
61
 
62
  **Authors.**
@@ -69,4 +74,4 @@ Thanks to all our notebook authors!
69
 
70
  Thanks to all our notebook reviews!
71
 
72
- - [Haleshot](https://github.com/Haleshot)
 
1
+ ---
2
+ title: Readme
3
+ marimo-version: 0.18.4
4
+ ---
5
+
6
  # Learn Functional Programming
7
 
8
  _🚧 This collection is a [work in progress](https://github.com/marimo-team/learn/issues/51)._
 
29
 
30
  To run a notebook locally, use
31
 
32
+ ```bash
33
+ uvx marimo edit <URL>
34
  ```
35
 
36
  For example, run the `Functor` tutorial with
37
 
38
+ ```bash
39
  uvx marimo edit https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py
40
  ```
41
 
 
57
  ## Description of notebooks
58
 
59
  Check [here](https://github.com/marimo-team/learn/issues/51) for current series
60
+ structure.
61
 
62
  | Notebook | Title | Key Concepts | Prerequisites |
63
+ |----------|-------|--------------|---------------|
64
+ | [05. Functors](https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py) | Category Theory and Functors | Category Theory, Functor, fmap, Bifunctor | Basic Python, Functions |
65
  | [06. Applicatives](https://github.com/marimo-team/learn/blob/main/functional_programming/06_applicatives.py) | Applicative programming with effects | Applicative Functor, pure, apply, Effectful programming, Alternative | Functors |
66
 
67
  **Authors.**
 
74
 
75
  Thanks to all our notebook reviews!
76
 
77
+ - [Haleshot](https://github.com/Haleshot)
optimization/01_least_squares.py CHANGED
@@ -9,7 +9,7 @@
9
 
10
  import marimo
11
 
12
- __generated_with = "0.11.0"
13
  app = marimo.App()
14
 
15
 
@@ -21,45 +21,41 @@ def _():
21
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
- mo.md(
25
- r"""
26
- # Least squares
27
 
28
- In a least-squares problem, we have measurements $A \in \mathcal{R}^{m \times
29
- n}$ (i.e., $m$ rows and $n$ columns) and $b \in \mathcal{R}^m$. We seek a vector
30
- $x \in \mathcal{R}^{n}$ such that $Ax$ is close to $b$. The matrices $A$ and $b$ are problem data or constants, and $x$ is the variable we are solving for.
31
 
32
- Closeness is defined as the sum of the squared differences:
33
 
34
- \[ \sum_{i=1}^m (a_i^Tx - b_i)^2, \]
35
 
36
- also known as the $\ell_2$-norm squared, $\|Ax - b\|_2^2$.
37
 
38
- For example, we might have a dataset of $m$ users, each represented by $n$ features. Each row $a_i^T$ of $A$ is the feature vector for user $i$, while the corresponding entry $b_i$ of $b$ is the measurement we want to predict from $a_i^T$, such as ad spending. The prediction for user $i$ is given by $a_i^Tx$.
39
 
40
- We find the optimal value of $x$ by solving the optimization problem
41
 
42
- \[
43
- \begin{array}{ll}
44
- \text{minimize} & \|Ax - b\|_2^2.
45
- \end{array}
46
- \]
47
 
48
- Let $x^\star$ denote the optimal $x$. The quantity $r = Ax^\star - b$ is known as the residual. If $\|r\|_2 = 0$, we have a perfect fit.
49
- """
50
- )
51
  return
52
 
53
 
54
  @app.cell(hide_code=True)
55
  def _(mo):
56
- mo.md(
57
- r"""
58
- ## Example
59
 
60
- In this example, we use the Python library [CVXPY](https://github.com/cvxpy/cvxpy) to construct and solve a least-squares problems.
61
- """
62
- )
63
  return
64
 
65
 
@@ -91,7 +87,7 @@ def _(A, b, cp, n):
91
  objective = cp.sum_squares(A @ x - b)
92
  problem = cp.Problem(cp.Minimize(objective))
93
  optimal_value = problem.solve()
94
- return objective, optimal_value, problem, x
95
 
96
 
97
  @app.cell
@@ -108,14 +104,12 @@ def _(A, b, cp, mo, optimal_value, x):
108
 
109
  @app.cell(hide_code=True)
110
  def _(mo):
111
- mo.md(
112
- r"""
113
- ## Further reading
114
 
115
- For a primer on least squares, with many real-world examples, check out the free book
116
- [Vectors, Matrices, and Least Squares](https://web.stanford.edu/~boyd/vmls/), which is used for undergraduate linear algebra education at Stanford.
117
- """
118
- )
119
  return
120
 
121
 
 
9
 
10
  import marimo
11
 
12
+ __generated_with = "0.18.4"
13
  app = marimo.App()
14
 
15
 
 
21
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
+ mo.md(r"""
25
+ # Least squares
 
26
 
27
+ In a least-squares problem, we have measurements $A \in \mathcal{R}^{m \times
28
+ n}$ (i.e., $m$ rows and $n$ columns) and $b \in \mathcal{R}^m$. We seek a vector
29
+ $x \in \mathcal{R}^{n}$ such that $Ax$ is close to $b$. The matrices $A$ and $b$ are problem data or constants, and $x$ is the variable we are solving for.
30
 
31
+ Closeness is defined as the sum of the squared differences:
32
 
33
+ \[ \sum_{i=1}^m (a_i^Tx - b_i)^2, \]
34
 
35
+ also known as the $\ell_2$-norm squared, $\|Ax - b\|_2^2$.
36
 
37
+ For example, we might have a dataset of $m$ users, each represented by $n$ features. Each row $a_i^T$ of $A$ is the feature vector for user $i$, while the corresponding entry $b_i$ of $b$ is the measurement we want to predict from $a_i^T$, such as ad spending. The prediction for user $i$ is given by $a_i^Tx$.
38
 
39
+ We find the optimal value of $x$ by solving the optimization problem
40
 
41
+ \[
42
+ \begin{array}{ll}
43
+ \text{minimize} & \|Ax - b\|_2^2.
44
+ \end{array}
45
+ \]
46
 
47
+ Let $x^\star$ denote the optimal $x$. The quantity $r = Ax^\star - b$ is known as the residual. If $\|r\|_2 = 0$, we have a perfect fit.
48
+ """)
 
49
  return
50
 
51
 
52
  @app.cell(hide_code=True)
53
  def _(mo):
54
+ mo.md(r"""
55
+ ## Example
 
56
 
57
+ In this example, we use the Python library [CVXPY](https://github.com/cvxpy/cvxpy) to construct and solve a least-squares problems.
58
+ """)
 
59
  return
60
 
61
 
 
87
  objective = cp.sum_squares(A @ x - b)
88
  problem = cp.Problem(cp.Minimize(objective))
89
  optimal_value = problem.solve()
90
+ return optimal_value, x
91
 
92
 
93
  @app.cell
 
104
 
105
  @app.cell(hide_code=True)
106
  def _(mo):
107
+ mo.md(r"""
108
+ ## Further reading
 
109
 
110
+ For a primer on least squares, with many real-world examples, check out the free book
111
+ [Vectors, Matrices, and Least Squares](https://web.stanford.edu/~boyd/vmls/), which is used for undergraduate linear algebra education at Stanford.
112
+ """)
 
113
  return
114
 
115
 
optimization/02_linear_program.py CHANGED
@@ -11,7 +11,7 @@
11
 
12
  import marimo
13
 
14
- __generated_with = "0.11.0"
15
  app = marimo.App()
16
 
17
 
@@ -23,33 +23,31 @@ def _():
23
 
24
  @app.cell(hide_code=True)
25
  def _(mo):
26
- mo.md(
27
- r"""
28
- # Linear program
29
 
30
- A linear program is an optimization problem with a linear objective and affine
31
- inequality constraints. A common standard form is the following:
32
 
33
- \[
34
- \begin{array}{ll}
35
- \text{minimize} & c^Tx \\
36
- \text{subject to} & Ax \leq b.
37
- \end{array}
38
- \]
39
 
40
- Here $A \in \mathcal{R}^{m \times n}$, $b \in \mathcal{R}^m$, and $c \in \mathcal{R}^n$ are problem data and $x \in \mathcal{R}^{n}$ is the optimization variable. The inequality constraint $Ax \leq b$ is elementwise.
41
 
42
- For example, we might have $n$ different products, each constructed out of $m$ components. Each entry $A_{ij}$ is the amount of component $i$ required to build one unit of product $j$. Each entry $b_i$ is the total amount of component $i$ available. We lose $c_j$ for each unit of product $j$ ($c_j < 0$ indicates profit). Our goal then is to choose how many units of each product $j$ to make, $x_j$, in order to minimize loss without exceeding our budget for any component.
43
 
44
- In addition to a solution $x^\star$, we obtain a dual solution $\lambda^\star$. A positive entry $\lambda^\star_i$ indicates that the constraint $a_i^Tx \leq b_i$ holds with equality for $x^\star$ and suggests that changing $b_i$ would change the optimal value.
45
 
46
- **Why linear programming?** Linear programming is a way to achieve an optimal outcome, such as maximum utility or lowest cost, subject to a linear objective function and affine constraints. Developed in the 20th century, linear programming is widely used today to solve problems in resource allocation, scheduling, transportation, and more. The discovery of polynomial-time algorithms to solve linear programs was of tremendous worldwide importance and entered the public discourse, even making the front page of the New York Times.
47
 
48
- In the late 20th and early 21st century, researchers generalized linear programming to a much wider class of problems called convex optimization problems. Nearly all convex optimization problems can be solved efficiently and reliably, and even more difficult problems are readily solved by a sequence of convex optimization problems. Today, convex optimization is used to fit machine learning models, land rockets in real-time at SpaceX, plan trajectories for self-driving cars at Waymo, execute many billions of dollars of financial trades a day, and much more.
49
 
50
- This marimo learn course uses CVXPY, a modeling language for convex optimization problems developed originally at Stanford, to construct and solve convex programs.
51
- """
52
- )
53
  return
54
 
55
 
@@ -66,13 +64,11 @@ def _(mo):
66
 
67
  @app.cell(hide_code=True)
68
  def _(mo):
69
- mo.md(
70
- r"""
71
- ## Example
72
 
73
- Here we use CVXPY to construct and solve a linear program.
74
- """
75
- )
76
  return
77
 
78
 
@@ -119,7 +115,9 @@ def _(np):
119
 
120
  @app.cell(hide_code=True)
121
  def _(mo):
122
- mo.md(r"""We've randomly generated problem data $A$ and $B$. The vector for $c$ is shown below. Try playing with the value of $c$ by dragging the components, and see how the level curves change in the visualization below.""")
 
 
123
  return
124
 
125
 
@@ -129,7 +127,7 @@ def _(mo, np):
129
 
130
  c_widget = mo.ui.anywidget(Matrix(matrix=np.array([[0.1, -0.2]]), step=0.01))
131
  c_widget
132
- return Matrix, c_widget
133
 
134
 
135
  @app.cell
@@ -149,7 +147,9 @@ def _(A, b, c, cp):
149
 
150
  @app.cell(hide_code=True)
151
  def _(mo):
152
- mo.md(r"""Below, we plot the feasible region of the problem — the intersection of the inequalities — and the level curves of the objective function. The optimal value $x^\star$ is the point farthest in the feasible region in the direction $-c$.""")
 
 
153
  return
154
 
155
 
@@ -249,7 +249,7 @@ def _(np):
249
  ax.set_xlim(np.min(x_vals), np.max(x_vals))
250
  ax.set_ylim(np.min(y_vals), np.max(y_vals))
251
  return ax
252
- return make_plot, plt
253
 
254
 
255
  @app.cell(hide_code=True)
@@ -257,7 +257,7 @@ def _(mo, prob, x):
257
  mo.md(
258
  f"""
259
  The optimal value is {prob.value:.04f}.
260
-
261
  A solution $x$ is {mo.as_html(list(x.value))}
262
  A dual solution is is {mo.as_html(list(prob.constraints[0].dual_value))}
263
  """
 
11
 
12
  import marimo
13
 
14
+ __generated_with = "0.18.4"
15
  app = marimo.App()
16
 
17
 
 
23
 
24
  @app.cell(hide_code=True)
25
  def _(mo):
26
+ mo.md(r"""
27
+ # Linear program
 
28
 
29
+ A linear program is an optimization problem with a linear objective and affine
30
+ inequality constraints. A common standard form is the following:
31
 
32
+ \[
33
+ \begin{array}{ll}
34
+ \text{minimize} & c^Tx \\
35
+ \text{subject to} & Ax \leq b.
36
+ \end{array}
37
+ \]
38
 
39
+ Here $A \in \mathcal{R}^{m \times n}$, $b \in \mathcal{R}^m$, and $c \in \mathcal{R}^n$ are problem data and $x \in \mathcal{R}^{n}$ is the optimization variable. The inequality constraint $Ax \leq b$ is elementwise.
40
 
41
+ For example, we might have $n$ different products, each constructed out of $m$ components. Each entry $A_{ij}$ is the amount of component $i$ required to build one unit of product $j$. Each entry $b_i$ is the total amount of component $i$ available. We lose $c_j$ for each unit of product $j$ ($c_j < 0$ indicates profit). Our goal then is to choose how many units of each product $j$ to make, $x_j$, in order to minimize loss without exceeding our budget for any component.
42
 
43
+ In addition to a solution $x^\star$, we obtain a dual solution $\lambda^\star$. A positive entry $\lambda^\star_i$ indicates that the constraint $a_i^Tx \leq b_i$ holds with equality for $x^\star$ and suggests that changing $b_i$ would change the optimal value.
44
 
45
+ **Why linear programming?** Linear programming is a way to achieve an optimal outcome, such as maximum utility or lowest cost, subject to a linear objective function and affine constraints. Developed in the 20th century, linear programming is widely used today to solve problems in resource allocation, scheduling, transportation, and more. The discovery of polynomial-time algorithms to solve linear programs was of tremendous worldwide importance and entered the public discourse, even making the front page of the New York Times.
46
 
47
+ In the late 20th and early 21st century, researchers generalized linear programming to a much wider class of problems called convex optimization problems. Nearly all convex optimization problems can be solved efficiently and reliably, and even more difficult problems are readily solved by a sequence of convex optimization problems. Today, convex optimization is used to fit machine learning models, land rockets in real-time at SpaceX, plan trajectories for self-driving cars at Waymo, execute many billions of dollars of financial trades a day, and much more.
48
 
49
+ This marimo learn course uses CVXPY, a modeling language for convex optimization problems developed originally at Stanford, to construct and solve convex programs.
50
+ """)
 
51
  return
52
 
53
 
 
64
 
65
  @app.cell(hide_code=True)
66
  def _(mo):
67
+ mo.md(r"""
68
+ ## Example
 
69
 
70
+ Here we use CVXPY to construct and solve a linear program.
71
+ """)
 
72
  return
73
 
74
 
 
115
 
116
  @app.cell(hide_code=True)
117
  def _(mo):
118
+ mo.md(r"""
119
+ We've randomly generated problem data $A$ and $B$. The vector for $c$ is shown below. Try playing with the value of $c$ by dragging the components, and see how the level curves change in the visualization below.
120
+ """)
121
  return
122
 
123
 
 
127
 
128
  c_widget = mo.ui.anywidget(Matrix(matrix=np.array([[0.1, -0.2]]), step=0.01))
129
  c_widget
130
+ return (c_widget,)
131
 
132
 
133
  @app.cell
 
147
 
148
  @app.cell(hide_code=True)
149
  def _(mo):
150
+ mo.md(r"""
151
+ Below, we plot the feasible region of the problem — the intersection of the inequalities — and the level curves of the objective function. The optimal value $x^\star$ is the point farthest in the feasible region in the direction $-c$.
152
+ """)
153
  return
154
 
155
 
 
249
  ax.set_xlim(np.min(x_vals), np.max(x_vals))
250
  ax.set_ylim(np.min(y_vals), np.max(y_vals))
251
  return ax
252
+ return (make_plot,)
253
 
254
 
255
  @app.cell(hide_code=True)
 
257
  mo.md(
258
  f"""
259
  The optimal value is {prob.value:.04f}.
260
+
261
  A solution $x$ is {mo.as_html(list(x.value))}
262
  A dual solution is is {mo.as_html(list(prob.constraints[0].dual_value))}
263
  """
optimization/03_minimum_fuel_optimal_control.py CHANGED
@@ -1,6 +1,6 @@
1
  import marimo
2
 
3
- __generated_with = "0.11.0"
4
  app = marimo.App()
5
 
6
 
@@ -12,46 +12,44 @@ def _():
12
 
13
  @app.cell(hide_code=True)
14
  def _(mo):
15
- mo.md(
16
- r"""
17
- # Minimal fuel optimal control
18
 
19
- This notebook includes an application of linear programming to controlling a
20
- physical system, adapted from [Convex
21
- Optimization](https://web.stanford.edu/~boyd/cvxbook/) by Boyd and Vandenberghe.
22
 
23
- We consider a linear dynamical system with state $x(t) \in \mathbf{R}^n$, for $t = 0, \ldots, T$. At each time step $t = 0, \ldots, T - 1$, an actuator or input signal $u(t)$ is applied, affecting the state. The dynamics
24
- of the system is given by the linear recurrence
25
 
26
- \[
27
- x(t + 1) = Ax(t) + bu(t), \quad t = 0, \ldots, T - 1,
28
- \]
29
 
30
- where $A \in \mathbf{R}^{n \times n}$ and $b \in \mathbf{R}^n$ are given and encode how the system evolves. The initial state $x(0)$ is also given.
31
 
32
- The _minimum fuel optimal control problem_ is to choose the inputs $u(0), \ldots, u(T - 1)$ so as to achieve
33
- a given desired state $x_\text{des} = x(T)$ while minimizing the total fuel consumed
34
 
35
- \[
36
- F = \sum_{t=0}^{T - 1} f(u(t)).
37
- \]
38
 
39
- The function $f : \mathbf{R} \to \mathbf{R}$ tells us how much fuel is consumed as a function of the input, and is given by
40
 
41
- \[
42
- f(a) = \begin{cases}
43
- |a| & |a| \leq 1 \\
44
- 2|a| - 1 & |a| > 1.
45
- \end{cases}
46
- \]
47
 
48
- This means the fuel use is proportional to the magnitude of the signal between $-1$ and $1$, but for larger signals the marginal fuel efficiency is half.
49
 
50
- **This notebook.** In this notebook we use CVXPY to formulate the minimum fuel optimal control problem as a linear program. The notebook lets you play with the initial and target states, letting you see how they affect the planned trajectory of inputs $u$.
51
 
52
- First, we create the **problem data**.
53
- """
54
- )
55
  return
56
 
57
 
@@ -85,7 +83,7 @@ def _(mo, n, np):
85
  rf"""
86
 
87
  Choose a value for $x_0$ ...
88
-
89
  {x0_widget}
90
  """
91
  )
@@ -99,7 +97,7 @@ def _(mo, n, np):
99
  )
100
 
101
  mo.hstack([_a, _b], justify="space-around")
102
- return wigglystuff, x0_widget, xdes_widget
103
 
104
 
105
  @app.cell
@@ -111,7 +109,9 @@ def _(x0_widget, xdes_widget):
111
 
112
  @app.cell(hide_code=True)
113
  def _(mo):
114
- mo.md(r"""**Next, we specify the problem as a linear program using CVXPY.** This problem is linear because the objective and constraints are affine. (In fact, the objective is piecewise affine, but CVXPY rewrites it to be affine for you.)""")
 
 
115
  return
116
 
117
 
@@ -134,18 +134,16 @@ def _(A, T, b, cp, mo, n, x0, xdes):
134
 
135
  fuel_used = cp.Problem(cp.Minimize(objective), constraints).solve()
136
  mo.md(f"Achieved a fuel usage of {fuel_used:.02f}. 🚀")
137
- return X, constraints, fuel_used, objective, u
138
 
139
 
140
  @app.cell(hide_code=True)
141
  def _(mo):
142
- mo.md(
143
- """
144
- Finally, we plot the chosen inputs over time.
145
 
146
- **🌊 Try it!** Change the initial and desired states; how do fuel usage and controls change? Can you explain what you see? You can also try experimenting with the value of $T$.
147
- """
148
- )
149
  return
150
 
151
 
 
1
  import marimo
2
 
3
+ __generated_with = "0.18.4"
4
  app = marimo.App()
5
 
6
 
 
12
 
13
  @app.cell(hide_code=True)
14
  def _(mo):
15
+ mo.md(r"""
16
+ # Minimal fuel optimal control
 
17
 
18
+ This notebook includes an application of linear programming to controlling a
19
+ physical system, adapted from [Convex
20
+ Optimization](https://web.stanford.edu/~boyd/cvxbook/) by Boyd and Vandenberghe.
21
 
22
+ We consider a linear dynamical system with state $x(t) \in \mathbf{R}^n$, for $t = 0, \ldots, T$. At each time step $t = 0, \ldots, T - 1$, an actuator or input signal $u(t)$ is applied, affecting the state. The dynamics
23
+ of the system is given by the linear recurrence
24
 
25
+ \[
26
+ x(t + 1) = Ax(t) + bu(t), \quad t = 0, \ldots, T - 1,
27
+ \]
28
 
29
+ where $A \in \mathbf{R}^{n \times n}$ and $b \in \mathbf{R}^n$ are given and encode how the system evolves. The initial state $x(0)$ is also given.
30
 
31
+ The _minimum fuel optimal control problem_ is to choose the inputs $u(0), \ldots, u(T - 1)$ so as to achieve
32
+ a given desired state $x_\text{des} = x(T)$ while minimizing the total fuel consumed
33
 
34
+ \[
35
+ F = \sum_{t=0}^{T - 1} f(u(t)).
36
+ \]
37
 
38
+ The function $f : \mathbf{R} \to \mathbf{R}$ tells us how much fuel is consumed as a function of the input, and is given by
39
 
40
+ \[
41
+ f(a) = \begin{cases}
42
+ |a| & |a| \leq 1 \\
43
+ 2|a| - 1 & |a| > 1.
44
+ \end{cases}
45
+ \]
46
 
47
+ This means the fuel use is proportional to the magnitude of the signal between $-1$ and $1$, but for larger signals the marginal fuel efficiency is half.
48
 
49
+ **This notebook.** In this notebook we use CVXPY to formulate the minimum fuel optimal control problem as a linear program. The notebook lets you play with the initial and target states, letting you see how they affect the planned trajectory of inputs $u$.
50
 
51
+ First, we create the **problem data**.
52
+ """)
 
53
  return
54
 
55
 
 
83
  rf"""
84
 
85
  Choose a value for $x_0$ ...
86
+
87
  {x0_widget}
88
  """
89
  )
 
97
  )
98
 
99
  mo.hstack([_a, _b], justify="space-around")
100
+ return x0_widget, xdes_widget
101
 
102
 
103
  @app.cell
 
109
 
110
  @app.cell(hide_code=True)
111
  def _(mo):
112
+ mo.md(r"""
113
+ **Next, we specify the problem as a linear program using CVXPY.** This problem is linear because the objective and constraints are affine. (In fact, the objective is piecewise affine, but CVXPY rewrites it to be affine for you.)
114
+ """)
115
  return
116
 
117
 
 
134
 
135
  fuel_used = cp.Problem(cp.Minimize(objective), constraints).solve()
136
  mo.md(f"Achieved a fuel usage of {fuel_used:.02f}. 🚀")
137
+ return (u,)
138
 
139
 
140
  @app.cell(hide_code=True)
141
  def _(mo):
142
+ mo.md("""
143
+ Finally, we plot the chosen inputs over time.
 
144
 
145
+ **🌊 Try it!** Change the initial and desired states; how do fuel usage and controls change? Can you explain what you see? You can also try experimenting with the value of $T$.
146
+ """)
 
147
  return
148
 
149
 
optimization/04_quadratic_program.py CHANGED
@@ -11,7 +11,7 @@
11
 
12
  import marimo
13
 
14
- __generated_with = "0.11.0"
15
  app = marimo.App()
16
 
17
 
@@ -23,53 +23,49 @@ def _():
23
 
24
  @app.cell(hide_code=True)
25
  def _(mo):
26
- mo.md(
27
- r"""
28
- # Quadratic program
29
 
30
- A quadratic program is an optimization problem with a quadratic objective and
31
- affine equality and inequality constraints. A common standard form is the
32
- following:
33
 
34
- \[
35
- \begin{array}{ll}
36
- \text{minimize} & (1/2)x^TPx + q^Tx\\
37
- \text{subject to} & Gx \leq h \\
38
- & Ax = b.
39
- \end{array}
40
- \]
41
 
42
- Here $P \in \mathcal{S}^{n}_+$, $q \in \mathcal{R}^n$, $G \in \mathcal{R}^{m \times n}$, $h \in \mathcal{R}^m$, $A \in \mathcal{R}^{p \times n}$, and $b \in \mathcal{R}^p$ are problem data and $x \in \mathcal{R}^{n}$ is the optimization variable. The inequality constraint $Gx \leq h$ is elementwise.
43
 
44
- **Why quadratic programming?** Quadratic programs are convex optimization problems that generalize both least-squares and linear programming.They can be solved efficiently and reliably, even in real-time.
45
 
46
- **An example from finance.** A simple example of a quadratic program arises in finance. Suppose we have $n$ different stocks, an estimate $r \in \mathcal{R}^n$ of the expected return on each stock, and an estimate $\Sigma \in \mathcal{S}^{n}_+$ of the covariance of the returns. Then we solve the optimization problem
47
 
48
- \[
49
- \begin{array}{ll}
50
- \text{minimize} & (1/2)x^T\Sigma x - r^Tx\\
51
- \text{subject to} & x \geq 0 \\
52
- & \mathbf{1}^Tx = 1,
53
- \end{array}
54
- \]
55
 
56
- to find a nonnegative portfolio allocation $x \in \mathcal{R}^n_+$ that optimally balances expected return and variance of return.
57
 
58
- When we solve a quadratic program, in addition to a solution $x^\star$, we obtain a dual solution $\lambda^\star$ corresponding to the inequality constraints. A positive entry $\lambda^\star_i$ indicates that the constraint $g_i^Tx \leq h_i$ holds with equality for $x^\star$ and suggests that changing $h_i$ would change the optimal value.
59
- """
60
- )
61
  return
62
 
63
 
64
  @app.cell(hide_code=True)
65
  def _(mo):
66
- mo.md(
67
- r"""
68
- ## Example
69
 
70
- In this example, we use CVXPY to construct and solve a quadratic program.
71
- """
72
- )
73
  return
74
 
75
 
@@ -82,7 +78,9 @@ def _():
82
 
83
  @app.cell(hide_code=True)
84
  def _(mo):
85
- mo.md("""First we generate synthetic data. In this problem, we don't include equality constraints, only inequality.""")
 
 
86
  return
87
 
88
 
@@ -95,7 +93,7 @@ def _(np):
95
  q = np.random.randn(n)
96
  G = np.random.randn(m, n)
97
  h = G @ np.random.randn(n)
98
- return G, h, m, n, q
99
 
100
 
101
  @app.cell(hide_code=True)
@@ -114,7 +112,7 @@ def _(mo, np):
114
  {P_widget.center()}
115
  """
116
  )
117
- return P_widget, wigglystuff
118
 
119
 
120
  @app.cell
@@ -125,7 +123,9 @@ def _(P_widget, np):
125
 
126
  @app.cell(hide_code=True)
127
  def _(mo):
128
- mo.md(r"""Next, we specify the problem. Notice that we use the `quad_form` function from CVXPY to create the quadratic form $x^TPx$.""")
 
 
129
  return
130
 
131
 
@@ -162,14 +162,12 @@ def _(G, P, h, plot_contours, q, x):
162
 
163
  @app.cell(hide_code=True)
164
  def _(mo):
165
- mo.md(
166
- r"""
167
- In this plot, the gray shaded region is the feasible region (points satisfying the inequality), and the ellipses are level curves of the quadratic form.
168
 
169
- **🌊 Try it!** Try changing the entries of $P$ above with your mouse. How do the
170
- level curves and the optimal value of $x$ change? Can you explain what you see?
171
- """
172
- )
173
  return
174
 
175
 
@@ -178,7 +176,7 @@ def _(P, mo):
178
  mo.md(
179
  rf"""
180
  The above contour lines were generated with
181
-
182
  \[
183
  P= \begin{{bmatrix}}
184
  {P[0, 0]:.01f} & {P[0, 1]:.01f} \\
 
11
 
12
  import marimo
13
 
14
+ __generated_with = "0.18.4"
15
  app = marimo.App()
16
 
17
 
 
23
 
24
  @app.cell(hide_code=True)
25
  def _(mo):
26
+ mo.md(r"""
27
+ # Quadratic program
 
28
 
29
+ A quadratic program is an optimization problem with a quadratic objective and
30
+ affine equality and inequality constraints. A common standard form is the
31
+ following:
32
 
33
+ \[
34
+ \begin{array}{ll}
35
+ \text{minimize} & (1/2)x^TPx + q^Tx\\
36
+ \text{subject to} & Gx \leq h \\
37
+ & Ax = b.
38
+ \end{array}
39
+ \]
40
 
41
+ Here $P \in \mathcal{S}^{n}_+$, $q \in \mathcal{R}^n$, $G \in \mathcal{R}^{m \times n}$, $h \in \mathcal{R}^m$, $A \in \mathcal{R}^{p \times n}$, and $b \in \mathcal{R}^p$ are problem data and $x \in \mathcal{R}^{n}$ is the optimization variable. The inequality constraint $Gx \leq h$ is elementwise.
42
 
43
+ **Why quadratic programming?** Quadratic programs are convex optimization problems that generalize both least-squares and linear programming.They can be solved efficiently and reliably, even in real-time.
44
 
45
+ **An example from finance.** A simple example of a quadratic program arises in finance. Suppose we have $n$ different stocks, an estimate $r \in \mathcal{R}^n$ of the expected return on each stock, and an estimate $\Sigma \in \mathcal{S}^{n}_+$ of the covariance of the returns. Then we solve the optimization problem
46
 
47
+ \[
48
+ \begin{array}{ll}
49
+ \text{minimize} & (1/2)x^T\Sigma x - r^Tx\\
50
+ \text{subject to} & x \geq 0 \\
51
+ & \mathbf{1}^Tx = 1,
52
+ \end{array}
53
+ \]
54
 
55
+ to find a nonnegative portfolio allocation $x \in \mathcal{R}^n_+$ that optimally balances expected return and variance of return.
56
 
57
+ When we solve a quadratic program, in addition to a solution $x^\star$, we obtain a dual solution $\lambda^\star$ corresponding to the inequality constraints. A positive entry $\lambda^\star_i$ indicates that the constraint $g_i^Tx \leq h_i$ holds with equality for $x^\star$ and suggests that changing $h_i$ would change the optimal value.
58
+ """)
 
59
  return
60
 
61
 
62
  @app.cell(hide_code=True)
63
  def _(mo):
64
+ mo.md(r"""
65
+ ## Example
 
66
 
67
+ In this example, we use CVXPY to construct and solve a quadratic program.
68
+ """)
 
69
  return
70
 
71
 
 
78
 
79
  @app.cell(hide_code=True)
80
  def _(mo):
81
+ mo.md("""
82
+ First we generate synthetic data. In this problem, we don't include equality constraints, only inequality.
83
+ """)
84
  return
85
 
86
 
 
93
  q = np.random.randn(n)
94
  G = np.random.randn(m, n)
95
  h = G @ np.random.randn(n)
96
+ return G, h, n, q
97
 
98
 
99
  @app.cell(hide_code=True)
 
112
  {P_widget.center()}
113
  """
114
  )
115
+ return (P_widget,)
116
 
117
 
118
  @app.cell
 
123
 
124
  @app.cell(hide_code=True)
125
  def _(mo):
126
+ mo.md(r"""
127
+ Next, we specify the problem. Notice that we use the `quad_form` function from CVXPY to create the quadratic form $x^TPx$.
128
+ """)
129
  return
130
 
131
 
 
162
 
163
  @app.cell(hide_code=True)
164
  def _(mo):
165
+ mo.md(r"""
166
+ In this plot, the gray shaded region is the feasible region (points satisfying the inequality), and the ellipses are level curves of the quadratic form.
 
167
 
168
+ **🌊 Try it!** Try changing the entries of $P$ above with your mouse. How do the
169
+ level curves and the optimal value of $x$ change? Can you explain what you see?
170
+ """)
 
171
  return
172
 
173
 
 
176
  mo.md(
177
  rf"""
178
  The above contour lines were generated with
179
+
180
  \[
181
  P= \begin{{bmatrix}}
182
  {P[0, 0]:.01f} & {P[0, 1]:.01f} \\
optimization/05_portfolio_optimization.py CHANGED
@@ -12,7 +12,7 @@
12
 
13
  import marimo
14
 
15
- __generated_with = "0.11.2"
16
  app = marimo.App()
17
 
18
 
@@ -24,88 +24,78 @@ def _():
24
 
25
  @app.cell(hide_code=True)
26
  def _(mo):
27
- mo.md(r"""# Portfolio optimization""")
 
 
28
  return
29
 
30
 
31
  @app.cell(hide_code=True)
32
  def _(mo):
33
- mo.md(
34
- r"""
35
- In this example we show how to use CVXPY to design a financial portfolio; this is called _portfolio optimization_.
36
 
37
- In portfolio optimization we have some amount of money to invest in any of $n$ different assets.
38
- We choose what fraction $w_i$ of our money to invest in each asset $i$, $i=1, \ldots, n$. The goal is to maximize return of the portfolio while minimizing risk.
39
- """
40
- )
41
  return
42
 
43
 
44
  @app.cell(hide_code=True)
45
  def _(mo):
46
- mo.md(
47
- r"""
48
- ## Asset returns and risk
49
 
50
- We will only model investments held for one period. The initial prices are $p_i > 0$. The end of period prices are $p_i^+ >0$. The asset (fractional) returns are $r_i = (p_i^+-p_i)/p_i$. The portfolio (fractional) return is $R = r^Tw$.
51
 
52
- A common model is that $r$ is a random variable with mean ${\bf E}r = \mu$ and covariance ${\bf E{(r-\mu)(r-\mu)^T}} = \Sigma$.
53
- It follows that $R$ is a random variable with ${\bf E}R = \mu^T w$ and ${\bf var}(R) = w^T\Sigma w$. In real-world applications, $\mu$ and $\Sigma$ are estimated from data and models, and $w$ is chosen using a library like CVXPY.
54
 
55
- ${\bf E}R$ is the (mean) *return* of the portfolio. ${\bf var}(R)$ is the *risk* of the portfolio. Portfolio optimization has two competing objectives: high return and low risk.
56
- """
57
- )
58
  return
59
 
60
 
61
  @app.cell(hide_code=True)
62
  def _(mo):
63
- mo.md(
64
- r"""
65
- ## Classical (Markowitz) portfolio optimization
66
 
67
- Classical (Markowitz) portfolio optimization solves the optimization problem
68
- """
69
- )
70
  return
71
 
72
 
73
  @app.cell(hide_code=True)
74
  def _(mo):
75
- mo.md(
76
- r"""
77
- $$
78
- \begin{array}{ll} \text{maximize} & \mu^T w - \gamma w^T\Sigma w\\
79
- \text{subject to} & {\bf 1}^T w = 1, w \geq 0,
80
- \end{array}
81
- $$
82
- """
83
- )
84
  return
85
 
86
 
87
  @app.cell(hide_code=True)
88
  def _(mo):
89
- mo.md(
90
- r"""
91
- where $w \in {\bf R}^n$ is the optimization variable and $\gamma >0$ is a constant called the *risk aversion parameter*. The constraint $\mathbf{1}^Tw = 1$ says the portfolio weight vector must sum to 1, and $w \geq 0$ says that we can't invest a negative amount into any asset.
92
 
93
- The objective $\mu^Tw - \gamma w^T\Sigma w$ is the *risk-adjusted return*. Varying $\gamma$ gives the optimal *risk-return trade-off*.
94
- We can get the same risk-return trade-off by fixing return and minimizing risk.
95
- """
96
- )
97
  return
98
 
99
 
100
  @app.cell(hide_code=True)
101
  def _(mo):
102
- mo.md(
103
- r"""
104
- ## Example
105
 
106
- In the following code we compute and plot the optimal risk-return trade-off for $10$ assets. First we generate random problem data $\mu$ and $\Sigma$.
107
- """
108
- )
109
  return
110
 
111
 
@@ -148,7 +138,7 @@ def _(mo, np):
148
  _Try changing the entries of $\mu$ and see how the plots below change._
149
  """
150
  )
151
- return mu_widget, wigglystuff
152
 
153
 
154
  @app.cell
@@ -163,7 +153,9 @@ def _(mu_widget, np):
163
 
164
  @app.cell(hide_code=True)
165
  def _(mo):
166
- mo.md("""Next, we solve the problem for 100 different values of $\gamma$""")
 
 
167
  return
168
 
169
 
@@ -176,7 +168,7 @@ def _(Sigma, mu, n):
176
  ret = mu.T @ w
177
  risk = cp.quad_form(w, Sigma)
178
  prob = cp.Problem(cp.Maximize(ret - gamma * risk), [cp.sum(w) == 1, w >= 0])
179
- return cp, gamma, prob, ret, risk, w
180
 
181
 
182
  @app.cell
@@ -195,7 +187,9 @@ def _(cp, gamma, np, prob, ret, risk):
195
 
196
  @app.cell(hide_code=True)
197
  def _(mo):
198
- mo.md("""Plotted below are the risk return tradeoffs for two values of $\gamma$ (blue squares), and the risk return tradeoffs for investing fully in each asset (red circles)""")
 
 
199
  return
200
 
201
 
@@ -218,17 +212,15 @@ def _(Sigma, cp, gamma_vals, mu, n, ret_data, risk_data):
218
  plt.xlabel("Standard deviation")
219
  plt.ylabel("Return")
220
  plt.show()
221
- return ax, fig, marker, markers_on, plt
222
 
223
 
224
  @app.cell(hide_code=True)
225
  def _(mo):
226
- mo.md(
227
- r"""
228
- We plot below the return distributions for the two risk aversion values marked on the trade-off curve.
229
- Notice that the probability of a loss is near 0 for the low risk value and far above 0 for the high risk value.
230
- """
231
- )
232
  return
233
 
234
 
@@ -250,7 +242,7 @@ def _(gamma, gamma_vals, markers_on, np, plt, prob, ret, risk):
250
  plt.ylabel("Density")
251
  plt.legend(loc="upper right")
252
  plt.show()
253
- return midx, spstats, x
254
 
255
 
256
  if __name__ == "__main__":
 
12
 
13
  import marimo
14
 
15
+ __generated_with = "0.18.4"
16
  app = marimo.App()
17
 
18
 
 
24
 
25
  @app.cell(hide_code=True)
26
  def _(mo):
27
+ mo.md(r"""
28
+ # Portfolio optimization
29
+ """)
30
  return
31
 
32
 
33
  @app.cell(hide_code=True)
34
  def _(mo):
35
+ mo.md(r"""
36
+ In this example we show how to use CVXPY to design a financial portfolio; this is called _portfolio optimization_.
 
37
 
38
+ In portfolio optimization we have some amount of money to invest in any of $n$ different assets.
39
+ We choose what fraction $w_i$ of our money to invest in each asset $i$, $i=1, \ldots, n$. The goal is to maximize return of the portfolio while minimizing risk.
40
+ """)
 
41
  return
42
 
43
 
44
  @app.cell(hide_code=True)
45
  def _(mo):
46
+ mo.md(r"""
47
+ ## Asset returns and risk
 
48
 
49
+ We will only model investments held for one period. The initial prices are $p_i > 0$. The end of period prices are $p_i^+ >0$. The asset (fractional) returns are $r_i = (p_i^+-p_i)/p_i$. The portfolio (fractional) return is $R = r^Tw$.
50
 
51
+ A common model is that $r$ is a random variable with mean ${\bf E}r = \mu$ and covariance ${\bf E{(r-\mu)(r-\mu)^T}} = \Sigma$.
52
+ It follows that $R$ is a random variable with ${\bf E}R = \mu^T w$ and ${\bf var}(R) = w^T\Sigma w$. In real-world applications, $\mu$ and $\Sigma$ are estimated from data and models, and $w$ is chosen using a library like CVXPY.
53
 
54
+ ${\bf E}R$ is the (mean) *return* of the portfolio. ${\bf var}(R)$ is the *risk* of the portfolio. Portfolio optimization has two competing objectives: high return and low risk.
55
+ """)
 
56
  return
57
 
58
 
59
  @app.cell(hide_code=True)
60
  def _(mo):
61
+ mo.md(r"""
62
+ ## Classical (Markowitz) portfolio optimization
 
63
 
64
+ Classical (Markowitz) portfolio optimization solves the optimization problem
65
+ """)
 
66
  return
67
 
68
 
69
  @app.cell(hide_code=True)
70
  def _(mo):
71
+ mo.md(r"""
72
+ $$
73
+ \begin{array}{ll} \text{maximize} & \mu^T w - \gamma w^T\Sigma w\\
74
+ \text{subject to} & {\bf 1}^T w = 1, w \geq 0,
75
+ \end{array}
76
+ $$
77
+ """)
 
 
78
  return
79
 
80
 
81
  @app.cell(hide_code=True)
82
  def _(mo):
83
+ mo.md(r"""
84
+ where $w \in {\bf R}^n$ is the optimization variable and $\gamma >0$ is a constant called the *risk aversion parameter*. The constraint $\mathbf{1}^Tw = 1$ says the portfolio weight vector must sum to 1, and $w \geq 0$ says that we can't invest a negative amount into any asset.
 
85
 
86
+ The objective $\mu^Tw - \gamma w^T\Sigma w$ is the *risk-adjusted return*. Varying $\gamma$ gives the optimal *risk-return trade-off*.
87
+ We can get the same risk-return trade-off by fixing return and minimizing risk.
88
+ """)
 
89
  return
90
 
91
 
92
  @app.cell(hide_code=True)
93
  def _(mo):
94
+ mo.md(r"""
95
+ ## Example
 
96
 
97
+ In the following code we compute and plot the optimal risk-return trade-off for $10$ assets. First we generate random problem data $\mu$ and $\Sigma$.
98
+ """)
 
99
  return
100
 
101
 
 
138
  _Try changing the entries of $\mu$ and see how the plots below change._
139
  """
140
  )
141
+ return (mu_widget,)
142
 
143
 
144
  @app.cell
 
153
 
154
  @app.cell(hide_code=True)
155
  def _(mo):
156
+ mo.md("""
157
+ Next, we solve the problem for 100 different values of $\gamma$
158
+ """)
159
  return
160
 
161
 
 
168
  ret = mu.T @ w
169
  risk = cp.quad_form(w, Sigma)
170
  prob = cp.Problem(cp.Maximize(ret - gamma * risk), [cp.sum(w) == 1, w >= 0])
171
+ return cp, gamma, prob, ret, risk
172
 
173
 
174
  @app.cell
 
187
 
188
  @app.cell(hide_code=True)
189
  def _(mo):
190
+ mo.md("""
191
+ Plotted below are the risk return tradeoffs for two values of $\gamma$ (blue squares), and the risk return tradeoffs for investing fully in each asset (red circles)
192
+ """)
193
  return
194
 
195
 
 
212
  plt.xlabel("Standard deviation")
213
  plt.ylabel("Return")
214
  plt.show()
215
+ return markers_on, plt
216
 
217
 
218
  @app.cell(hide_code=True)
219
  def _(mo):
220
+ mo.md(r"""
221
+ We plot below the return distributions for the two risk aversion values marked on the trade-off curve.
222
+ Notice that the probability of a loss is near 0 for the low risk value and far above 0 for the high risk value.
223
+ """)
 
 
224
  return
225
 
226
 
 
242
  plt.ylabel("Density")
243
  plt.legend(loc="upper right")
244
  plt.show()
245
+ return
246
 
247
 
248
  if __name__ == "__main__":
optimization/06_convex_optimization.py CHANGED
@@ -9,7 +9,7 @@
9
 
10
  import marimo
11
 
12
- __generated_with = "0.11.2"
13
  app = marimo.App()
14
 
15
 
@@ -21,41 +21,39 @@ def _():
21
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
- mo.md(
25
- r"""
26
- # Convex optimization
27
-
28
- In the previous tutorials, we learned about least squares, linear programming,
29
- and quadratic programming, and saw applications of each. We also learned that these problem
30
- classes can be solved efficiently and reliably using CVXPY. That's because these problem classes are a special
31
- case of a more general class of tractable problems, called **convex optimization problems.**
32
-
33
- A convex optimization problem is an optimization problem that minimizes a convex
34
- function, subject to affine equality constraints and convex inequality
35
- constraints ($f_i(x)\leq 0$, where $f_i$ is a convex function).
36
-
37
- **CVXPY.** CVXPY lets you specify and solve any convex optimization problem,
38
- abstracting away the more specific problem classes. You start with CVXPY's **atomic functions**, like `cp.exp`, `cp.log`, and `cp.square`, and compose them to build more complex convex functions. As long as the functions are composed in the right way — as long as they are "DCP-compliant" — your resulting problem will be convex and solvable by CVXPY.
39
- """
40
- )
41
  return
42
 
43
 
44
  @app.cell(hide_code=True)
45
  def _(mo):
46
- mo.md(
47
- r"""
48
- **🛑 Stop!** Before proceeding, read the CVXPY docs to learn about atomic functions and the DCP ruleset:
49
 
50
- https://www.cvxpy.org/tutorial/index.html
51
- """
52
- )
53
  return
54
 
55
 
56
  @app.cell(hide_code=True)
57
  def _(mo):
58
- mo.md(r"""**Is my problem DCP-compliant?** Below is a sample CVXPY problem. It is DCP-compliant. Try typing in other problems and seeing if they are DCP-compliant. If you know your problem is convex, there exists a way to express it in a DCP-compliant way.""")
 
 
59
  return
60
 
61
 
@@ -71,7 +69,7 @@ def _(mo):
71
  constraints = [x >= 0, cp.sum(x) == 1]
72
  problem = cp.Problem(cp.Maximize(objective), constraints)
73
  mo.md(f"Is my problem DCP? `{problem.is_dcp()}`")
74
- return P_sqrt, constraints, cp, np, objective, problem, x
75
 
76
 
77
  @app.cell
 
9
 
10
  import marimo
11
 
12
+ __generated_with = "0.18.4"
13
  app = marimo.App()
14
 
15
 
 
21
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
+ mo.md(r"""
25
+ # Convex optimization
26
+
27
+ In the previous tutorials, we learned about least squares, linear programming,
28
+ and quadratic programming, and saw applications of each. We also learned that these problem
29
+ classes can be solved efficiently and reliably using CVXPY. That's because these problem classes are a special
30
+ case of a more general class of tractable problems, called **convex optimization problems.**
31
+
32
+ A convex optimization problem is an optimization problem that minimizes a convex
33
+ function, subject to affine equality constraints and convex inequality
34
+ constraints ($f_i(x)\leq 0$, where $f_i$ is a convex function).
35
+
36
+ **CVXPY.** CVXPY lets you specify and solve any convex optimization problem,
37
+ abstracting away the more specific problem classes. You start with CVXPY's **atomic functions**, like `cp.exp`, `cp.log`, and `cp.square`, and compose them to build more complex convex functions. As long as the functions are composed in the right way — as long as they are "DCP-compliant" — your resulting problem will be convex and solvable by CVXPY.
38
+ """)
 
 
39
  return
40
 
41
 
42
  @app.cell(hide_code=True)
43
  def _(mo):
44
+ mo.md(r"""
45
+ **🛑 Stop!** Before proceeding, read the CVXPY docs to learn about atomic functions and the DCP ruleset:
 
46
 
47
+ https://www.cvxpy.org/tutorial/index.html
48
+ """)
 
49
  return
50
 
51
 
52
  @app.cell(hide_code=True)
53
  def _(mo):
54
+ mo.md(r"""
55
+ **Is my problem DCP-compliant?** Below is a sample CVXPY problem. It is DCP-compliant. Try typing in other problems and seeing if they are DCP-compliant. If you know your problem is convex, there exists a way to express it in a DCP-compliant way.
56
+ """)
57
  return
58
 
59
 
 
69
  constraints = [x >= 0, cp.sum(x) == 1]
70
  problem = cp.Problem(cp.Maximize(objective), constraints)
71
  mo.md(f"Is my problem DCP? `{problem.is_dcp()}`")
72
+ return problem, x
73
 
74
 
75
  @app.cell
optimization/07_sdp.py CHANGED
@@ -10,7 +10,7 @@
10
 
11
  import marimo
12
 
13
- __generated_with = "0.11.2"
14
  app = marimo.App()
15
 
16
 
@@ -22,49 +22,47 @@ def _():
22
 
23
  @app.cell(hide_code=True)
24
  def _(mo):
25
- mo.md(r"""# Semidefinite program""")
 
 
26
  return
27
 
28
 
29
  @app.cell(hide_code=True)
30
  def _(mo):
31
- mo.md(
32
- r"""
33
- _This notebook introduces an advanced topic._ A semidefinite program (SDP) is an optimization problem of the form
34
-
35
- \[
36
- \begin{array}{ll}
37
- \text{minimize} & \mathbf{tr}(CX) \\
38
- \text{subject to} & \mathbf{tr}(A_iX) = b_i, \quad i=1,\ldots,p \\
39
- & X \succeq 0,
40
- \end{array}
41
- \]
42
-
43
- where $\mathbf{tr}$ is the trace function, $X \in \mathcal{S}^{n}$ is the optimization variable and $C, A_1, \ldots, A_p \in \mathcal{S}^{n}$, and $b_1, \ldots, b_p \in \mathcal{R}$ are problem data, and $X \succeq 0$ is a matrix inequality. Here $\mathcal{S}^{n}$ denotes the set of $n$-by-$n$ symmetric matrices.
44
-
45
- **Example.** An example of an SDP is to complete a covariance matrix $\tilde \Sigma \in \mathcal{S}^{n}_+$ with missing entries $M \subset \{1,\ldots,n\} \times \{1,\ldots,n\}$:
46
-
47
- \[
48
- \begin{array}{ll}
49
- \text{minimize} & 0 \\
50
- \text{subject to} & \Sigma_{ij} = \tilde \Sigma_{ij}, \quad (i,j) \notin M \\
51
- & \Sigma \succeq 0,
52
- \end{array}
53
- \]
54
- """
55
- )
56
  return
57
 
58
 
59
  @app.cell(hide_code=True)
60
  def _(mo):
61
- mo.md(
62
- r"""
63
- ## Example
64
 
65
- In the following code, we show how to specify and solve an SDP with CVXPY.
66
- """
67
- )
68
  return
69
 
70
 
@@ -87,7 +85,7 @@ def _(np):
87
  for i in range(p):
88
  A.append(np.random.randn(n, n))
89
  b.append(np.random.randn())
90
- return A, C, b, i, n, p
91
 
92
 
93
  @app.cell
@@ -101,7 +99,7 @@ def _(A, C, b, cp, n, p):
101
  constraints += [cp.trace(A[i] @ X) == b[i] for i in range(p)]
102
  prob = cp.Problem(cp.Minimize(cp.trace(C @ X)), constraints)
103
  _ = prob.solve()
104
- return X, constraints, prob
105
 
106
 
107
  @app.cell
@@ -111,7 +109,7 @@ def _(X, mo, prob, wigglystuff):
111
  The optimal value is {prob.value:0.4f}.
112
 
113
  A solution for $X$ is (rounded to the nearest decimal) is:
114
-
115
  {mo.ui.anywidget(wigglystuff.Matrix(X.value)).center()}
116
  """
117
  )
 
10
 
11
  import marimo
12
 
13
+ __generated_with = "0.18.4"
14
  app = marimo.App()
15
 
16
 
 
22
 
23
  @app.cell(hide_code=True)
24
  def _(mo):
25
+ mo.md(r"""
26
+ # Semidefinite program
27
+ """)
28
  return
29
 
30
 
31
  @app.cell(hide_code=True)
32
  def _(mo):
33
+ mo.md(r"""
34
+ _This notebook introduces an advanced topic._ A semidefinite program (SDP) is an optimization problem of the form
35
+
36
+ \[
37
+ \begin{array}{ll}
38
+ \text{minimize} & \mathbf{tr}(CX) \\
39
+ \text{subject to} & \mathbf{tr}(A_iX) = b_i, \quad i=1,\ldots,p \\
40
+ & X \succeq 0,
41
+ \end{array}
42
+ \]
43
+
44
+ where $\mathbf{tr}$ is the trace function, $X \in \mathcal{S}^{n}$ is the optimization variable and $C, A_1, \ldots, A_p \in \mathcal{S}^{n}$, and $b_1, \ldots, b_p \in \mathcal{R}$ are problem data, and $X \succeq 0$ is a matrix inequality. Here $\mathcal{S}^{n}$ denotes the set of $n$-by-$n$ symmetric matrices.
45
+
46
+ **Example.** An example of an SDP is to complete a covariance matrix $\tilde \Sigma \in \mathcal{S}^{n}_+$ with missing entries $M \subset \{1,\ldots,n\} \times \{1,\ldots,n\}$:
47
+
48
+ \[
49
+ \begin{array}{ll}
50
+ \text{minimize} & 0 \\
51
+ \text{subject to} & \Sigma_{ij} = \tilde \Sigma_{ij}, \quad (i,j) \notin M \\
52
+ & \Sigma \succeq 0,
53
+ \end{array}
54
+ \]
55
+ """)
 
 
56
  return
57
 
58
 
59
  @app.cell(hide_code=True)
60
  def _(mo):
61
+ mo.md(r"""
62
+ ## Example
 
63
 
64
+ In the following code, we show how to specify and solve an SDP with CVXPY.
65
+ """)
 
66
  return
67
 
68
 
 
85
  for i in range(p):
86
  A.append(np.random.randn(n, n))
87
  b.append(np.random.randn())
88
+ return A, C, b, n, p
89
 
90
 
91
  @app.cell
 
99
  constraints += [cp.trace(A[i] @ X) == b[i] for i in range(p)]
100
  prob = cp.Problem(cp.Minimize(cp.trace(C @ X)), constraints)
101
  _ = prob.solve()
102
+ return X, prob
103
 
104
 
105
  @app.cell
 
109
  The optimal value is {prob.value:0.4f}.
110
 
111
  A solution for $X$ is (rounded to the nearest decimal) is:
112
+
113
  {mo.ui.anywidget(wigglystuff.Matrix(X.value)).center()}
114
  """
115
  )
optimization/README.md CHANGED
@@ -1,3 +1,8 @@
 
 
 
 
 
1
  # Learn optimization
2
 
3
  This collection of marimo notebooks teaches you the basics of convex
@@ -30,4 +35,4 @@ to a notebook's URL: [marimo.app/github.com/marimo-team/learn/blob/main/optimiza
30
 
31
  **Thanks to all our notebook authors!**
32
 
33
- * [Akshay Agrawal](https://github.com/akshayka)
 
1
+ ---
2
+ title: Readme
3
+ marimo-version: 0.18.4
4
+ ---
5
+
6
  # Learn optimization
7
 
8
  This collection of marimo notebooks teaches you the basics of convex
 
35
 
36
  **Thanks to all our notebook authors!**
37
 
38
+ * [Akshay Agrawal](https://github.com/akshayka)
polars/01_why_polars.py CHANGED
@@ -9,7 +9,7 @@
9
 
10
  import marimo
11
 
12
- __generated_with = "0.11.8"
13
  app = marimo.App(width="medium")
14
 
15
 
@@ -21,17 +21,15 @@ def _():
21
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
- mo.md(
25
- """
26
- # An introduction to Polars
27
 
28
- _By [Koushik Khan](https://github.com/koushikkhan)._
29
 
30
- This notebook provides a birds-eye overview of [Polars](https://pola.rs/), a fast and user-friendly data manipulation library for Python, and compares it to alternatives like Pandas and PySpark.
31
 
32
- Like Pandas and PySpark, the central data structure in Polars is **the DataFrame**, a tabular data structure consisting of named columns. For example, the next cell constructs a DataFrame that records the gender, age, and height in centimeters for a number of individuals.
33
- """
34
- )
35
  return
36
 
37
 
@@ -48,46 +46,40 @@ def _():
48
  }
49
  )
50
  df_pl
51
- return df_pl, pl
52
 
53
 
54
  @app.cell(hide_code=True)
55
  def _(mo):
56
- mo.md(
57
- """
58
- Unlike Python's earliest DataFrame library Pandas, Polars was designed with performance and usability in mind — Polars can scale to large datasets with ease while maintaining a simple and intuitive API.
59
 
60
- Polars' performance is due to a number of factors, including its implementation in rust and its ability to perform operations in a parallelized and vectorized manner. It supports a wide range of data types, advanced query optimizations, and seamless integration with other Python libraries, making it a versatile tool for data scientists, engineers, and analysts. Additionally, Polars provides a lazy API for deferred execution, allowing users to optimize their workflows by chaining operations and executing them in a single pass.
61
 
62
- With its focus on speed, scalability, and ease of use, Polars is quickly becoming a go-to choice for data professionals looking to streamline their data processing pipelines and tackle large-scale data challenges.
63
- """
64
- )
65
  return
66
 
67
 
68
  @app.cell(hide_code=True)
69
  def _(mo):
70
- mo.md(
71
- """
72
- ## Choosing Polars over Pandas
73
 
74
- In this section we'll give a few reasons why Polars is a better choice than Pandas, along with examples.
75
- """
76
- )
77
  return
78
 
79
 
80
  @app.cell(hide_code=True)
81
  def _(mo):
82
- mo.md(
83
- """
84
- ### Intuitive syntax
85
 
86
- Polars' syntax is similar to PySpark and intuitive like SQL, making heavy use of **method chaining**. This makes it easy for data professionals to transition to Polars, and leads to an API that is more concise and readable than Pandas.
87
 
88
- **Example.** In the next few cells, we contrast the code to perform a basic filter and aggregation of data with Pandas to the code required to accomplish the same task with `Polars`.
89
- """
90
- )
91
  return
92
 
93
 
@@ -112,12 +104,14 @@ def _():
112
  # step-2: groupby and aggregation
113
  result_pd = filtered_df_pd.groupby("Gender")["Height_CM"].mean()
114
  result_pd
115
- return df_pd, filtered_df_pd, pd, result_pd
116
 
117
 
118
  @app.cell(hide_code=True)
119
  def _(mo):
120
- mo.md(r"""The same example can be worked out in Polars more concisely, using method chaining. Notice how the Polars code is essentially as readable as English.""")
 
 
121
  return
122
 
123
 
@@ -137,17 +131,15 @@ def _(pl):
137
  # filter, groupby and aggregation using method chaining
138
  result_pl = data_pl.filter(pl.col("Age") > 15).group_by("Gender").agg(pl.mean("Height_CM"))
139
  result_pl
140
- return data_pl, result_pl
141
 
142
 
143
  @app.cell(hide_code=True)
144
  def _(mo):
145
- mo.md(
146
- """
147
- Notice how Polars uses a *method-chaining* approach, similar to PySpark, which makes the code more readable and expressive while using a *single line* to design the query.
148
- Additionally, Polars supports SQL-like operations *natively*, that allows you to write SQL queries directly on polars dataframe:
149
- """
150
- )
151
  return
152
 
153
 
@@ -155,159 +147,145 @@ def _(mo):
155
  def _(data_pl):
156
  result = data_pl.sql("SELECT Gender, AVG(Height_CM) FROM self WHERE Age > 15 GROUP BY Gender")
157
  result
158
- return (result,)
159
 
160
 
161
  @app.cell(hide_code=True)
162
  def _(mo):
163
- mo.md(
164
- """
165
- ### A large collection of built-in APIs
166
 
167
- Polars has a comprehensive API that enables to perform virtually any operation using built-in methods. In contrast, Pandas often requires more complex operations to be handled using the `apply` method with a lambda function. The issue with `apply` is that it processes rows sequentially, looping through the DataFrame one row at a time, which can be inefficient. By leveraging Polars' built-in methods, you can operate on entire columns at once, unlocking the power of **SIMD (Single Instruction, Multiple Data)** parallelism. This approach not only simplifies your code but also significantly improves performance.
168
- """
169
- )
170
  return
171
 
172
 
173
  @app.cell(hide_code=True)
174
  def _(mo):
175
- mo.md(
176
- """
177
- ### Query optimization 📈
178
 
179
- A key factor behind Polars' performance lies in its **evaluation strategy**. While Pandas defaults to **eager execution**, executing operations in the exact order they are written, Polars offers both **eager and lazy execution**. With lazy execution, Polars employs a **query optimizer** that analyzes all required operations and determines the most efficient way to execute them. This optimization can involve reordering operations, eliminating redundant calculations, and more.
180
 
181
- For example, consider the following expression to calculate the mean of the `Number1` column for categories "A" and "B" in the `Category` column:
182
 
183
- ```python
184
- (
185
- df
186
- .groupby(by="Category").agg(pl.col("Number1").mean())
187
- .filter(pl.col("Category").is_in(["A", "B"]))
188
- )
189
- ```
190
-
191
- If executed eagerly, the `groupby` operation would first be applied to the entire DataFrame, followed by filtering the results by `Category`. However, with **lazy execution**, Polars can optimize this process by first filtering the DataFrame to include only the relevant categories ("A" and "B") and then performing the `groupby` operation on the reduced dataset. This approach minimizes unnecessary computations and significantly improves efficiency.
192
- """
193
  )
 
 
 
 
194
  return
195
 
196
 
197
  @app.cell(hide_code=True)
198
  def _(mo):
199
- mo.md(
200
- """
201
- ### Scalability — handling large datasets in memory ⬆️
202
 
203
- Pandas is limited by its single-threaded design and reliance on Python, which makes it inefficient for processing large datasets. Polars, on the other hand, is built in Rust and optimized for parallel processing, enabling it to handle datasets that are orders of magnitude larger.
204
 
205
- **Example: Processing a Large Dataset**
206
- In Pandas, loading a large dataset (e.g., 10GB) often results in memory errors:
207
 
208
- ```python
209
- # This may fail with large datasets
210
- df = pd.read_csv("large_dataset.csv")
211
- ```
212
 
213
- In Polars, the same operation runs quickly, without memory pressure:
214
 
215
- ```python
216
- df = pl.read_csv("large_dataset.csv")
217
- ```
218
 
219
- Polars also supports lazy evaluation, which allows you to optimize your workflows by deferring computations until necessary. This is particularly useful for large datasets:
220
 
221
- ```python
222
- df = pl.scan_csv("large_dataset.csv") # Lazy DataFrame
223
- result = df.filter(pl.col("A") > 1).groupby("A").agg(pl.sum("B")).collect() # Execute
224
- ```
225
- """
226
- )
227
  return
228
 
229
 
230
  @app.cell(hide_code=True)
231
  def _(mo):
232
- mo.md(
233
- """
234
- ### Compatibility with other machine learning libraries 🤝
235
 
236
- Polars integrates seamlessly with popular machine learning libraries like Scikit-learn, PyTorch, and TensorFlow. Its ability to handle large datasets efficiently makes it an excellent choice for preprocessing data before feeding it into ML models.
237
 
238
- **Example: Preprocessing Data for Scikit-learn**
239
 
240
- ```python
241
- import polars as pl
242
- from sklearn.linear_model import LinearRegression
243
 
244
- # Load and preprocess data
245
- df = pl.read_csv("data.csv")
246
- X = df.select(["feature1", "feature2"]).to_numpy()
247
- y = df.select("target").to_numpy()
248
 
249
- # Train a model
250
- model = LinearRegression()
251
- model.fit(X, y)
252
- ```
253
 
254
- Polars also supports conversion to other formats like NumPy arrays and Pandas DataFrames, ensuring compatibility with virtually any ML library:
255
 
256
- ```python
257
- # Convert to Pandas DataFrame
258
- pandas_df = df.to_pandas()
259
 
260
- # Convert to NumPy array
261
- numpy_array = df.to_numpy()
262
- ```
263
- """
264
- )
265
  return
266
 
267
 
268
  @app.cell(hide_code=True)
269
  def _(mo):
270
- mo.md(
271
- """
272
- ### Easy to use, with room for power users
273
 
274
- Polars supports advanced operations like
275
 
276
- - **date handling**
277
- - **window functions**
278
- - **joins**
279
- - **nested data types**
280
 
281
- which is making it a versatile tool for data manipulation.
282
- """
283
- )
284
  return
285
 
286
 
287
  @app.cell(hide_code=True)
288
  def _(mo):
289
- mo.md(
290
- """
291
- ## Why not PySpark?
292
 
293
- While **PySpark** is versatile tool that has transformed the way big data is handled and processed in Python, its **complex setup process** can be intimidating, especially for beginners. In contrast, **Polars** requires minimal setup and is ready to use right out of the box, making it more accessible for users of all skill levels.
294
 
295
- When deciding between the two, **PySpark** is the preferred choice for processing large datasets distributed across a **multi-node cluster**. However, for computations on a **single-node machine**, **Polars** is an excellent alternative. Remarkably, Polars is capable of handling datasets that exceed the size of the available RAM, making it a powerful tool for efficient data processing even on limited hardware.
296
- """
297
- )
298
  return
299
 
300
 
301
  @app.cell(hide_code=True)
302
  def _(mo):
303
- mo.md(
304
- """
305
- ## 🔖 References
306
 
307
- - [Polars official website](https://pola.rs/)
308
- - [Polars vs. Pandas](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/)
309
- """
310
- )
311
  return
312
 
313
 
 
9
 
10
  import marimo
11
 
12
+ __generated_with = "0.18.4"
13
  app = marimo.App(width="medium")
14
 
15
 
 
21
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
+ mo.md("""
25
+ # An introduction to Polars
 
26
 
27
+ _By [Koushik Khan](https://github.com/koushikkhan)._
28
 
29
+ This notebook provides a birds-eye overview of [Polars](https://pola.rs/), a fast and user-friendly data manipulation library for Python, and compares it to alternatives like Pandas and PySpark.
30
 
31
+ Like Pandas and PySpark, the central data structure in Polars is **the DataFrame**, a tabular data structure consisting of named columns. For example, the next cell constructs a DataFrame that records the gender, age, and height in centimeters for a number of individuals.
32
+ """)
 
33
  return
34
 
35
 
 
46
  }
47
  )
48
  df_pl
49
+ return (pl,)
50
 
51
 
52
  @app.cell(hide_code=True)
53
  def _(mo):
54
+ mo.md("""
55
+ Unlike Python's earliest DataFrame library Pandas, Polars was designed with performance and usability in mind — Polars can scale to large datasets with ease while maintaining a simple and intuitive API.
 
56
 
57
+ Polars' performance is due to a number of factors, including its implementation in rust and its ability to perform operations in a parallelized and vectorized manner. It supports a wide range of data types, advanced query optimizations, and seamless integration with other Python libraries, making it a versatile tool for data scientists, engineers, and analysts. Additionally, Polars provides a lazy API for deferred execution, allowing users to optimize their workflows by chaining operations and executing them in a single pass.
58
 
59
+ With its focus on speed, scalability, and ease of use, Polars is quickly becoming a go-to choice for data professionals looking to streamline their data processing pipelines and tackle large-scale data challenges.
60
+ """)
 
61
  return
62
 
63
 
64
  @app.cell(hide_code=True)
65
  def _(mo):
66
+ mo.md("""
67
+ ## Choosing Polars over Pandas
 
68
 
69
+ In this section we'll give a few reasons why Polars is a better choice than Pandas, along with examples.
70
+ """)
 
71
  return
72
 
73
 
74
  @app.cell(hide_code=True)
75
  def _(mo):
76
+ mo.md("""
77
+ ### Intuitive syntax
 
78
 
79
+ Polars' syntax is similar to PySpark and intuitive like SQL, making heavy use of **method chaining**. This makes it easy for data professionals to transition to Polars, and leads to an API that is more concise and readable than Pandas.
80
 
81
+ **Example.** In the next few cells, we contrast the code to perform a basic filter and aggregation of data with Pandas to the code required to accomplish the same task with `Polars`.
82
+ """)
 
83
  return
84
 
85
 
 
104
  # step-2: groupby and aggregation
105
  result_pd = filtered_df_pd.groupby("Gender")["Height_CM"].mean()
106
  result_pd
107
+ return
108
 
109
 
110
  @app.cell(hide_code=True)
111
  def _(mo):
112
+ mo.md(r"""
113
+ The same example can be worked out in Polars more concisely, using method chaining. Notice how the Polars code is essentially as readable as English.
114
+ """)
115
  return
116
 
117
 
 
131
  # filter, groupby and aggregation using method chaining
132
  result_pl = data_pl.filter(pl.col("Age") > 15).group_by("Gender").agg(pl.mean("Height_CM"))
133
  result_pl
134
+ return (data_pl,)
135
 
136
 
137
  @app.cell(hide_code=True)
138
  def _(mo):
139
+ mo.md("""
140
+ Notice how Polars uses a *method-chaining* approach, similar to PySpark, which makes the code more readable and expressive while using a *single line* to design the query.
141
+ Additionally, Polars supports SQL-like operations *natively*, that allows you to write SQL queries directly on polars dataframe:
142
+ """)
 
 
143
  return
144
 
145
 
 
147
  def _(data_pl):
148
  result = data_pl.sql("SELECT Gender, AVG(Height_CM) FROM self WHERE Age > 15 GROUP BY Gender")
149
  result
150
+ return
151
 
152
 
153
  @app.cell(hide_code=True)
154
  def _(mo):
155
+ mo.md("""
156
+ ### A large collection of built-in APIs
 
157
 
158
+ Polars has a comprehensive API that enables to perform virtually any operation using built-in methods. In contrast, Pandas often requires more complex operations to be handled using the `apply` method with a lambda function. The issue with `apply` is that it processes rows sequentially, looping through the DataFrame one row at a time, which can be inefficient. By leveraging Polars' built-in methods, you can operate on entire columns at once, unlocking the power of **SIMD (Single Instruction, Multiple Data)** parallelism. This approach not only simplifies your code but also significantly improves performance.
159
+ """)
 
160
  return
161
 
162
 
163
  @app.cell(hide_code=True)
164
  def _(mo):
165
+ mo.md("""
166
+ ### Query optimization 📈
 
167
 
168
+ A key factor behind Polars' performance lies in its **evaluation strategy**. While Pandas defaults to **eager execution**, executing operations in the exact order they are written, Polars offers both **eager and lazy execution**. With lazy execution, Polars employs a **query optimizer** that analyzes all required operations and determines the most efficient way to execute them. This optimization can involve reordering operations, eliminating redundant calculations, and more.
169
 
170
+ For example, consider the following expression to calculate the mean of the `Number1` column for categories "A" and "B" in the `Category` column:
171
 
172
+ ```python
173
+ (
174
+ df
175
+ .groupby(by="Category").agg(pl.col("Number1").mean())
176
+ .filter(pl.col("Category").is_in(["A", "B"]))
 
 
 
 
 
177
  )
178
+ ```
179
+
180
+ If executed eagerly, the `groupby` operation would first be applied to the entire DataFrame, followed by filtering the results by `Category`. However, with **lazy execution**, Polars can optimize this process by first filtering the DataFrame to include only the relevant categories ("A" and "B") and then performing the `groupby` operation on the reduced dataset. This approach minimizes unnecessary computations and significantly improves efficiency.
181
+ """)
182
  return
183
 
184
 
185
  @app.cell(hide_code=True)
186
  def _(mo):
187
+ mo.md("""
188
+ ### Scalability — handling large datasets in memory ⬆️
 
189
 
190
+ Pandas is limited by its single-threaded design and reliance on Python, which makes it inefficient for processing large datasets. Polars, on the other hand, is built in Rust and optimized for parallel processing, enabling it to handle datasets that are orders of magnitude larger.
191
 
192
+ **Example: Processing a Large Dataset**
193
+ In Pandas, loading a large dataset (e.g., 10GB) often results in memory errors:
194
 
195
+ ```python
196
+ # This may fail with large datasets
197
+ df = pd.read_csv("large_dataset.csv")
198
+ ```
199
 
200
+ In Polars, the same operation runs quickly, without memory pressure:
201
 
202
+ ```python
203
+ df = pl.read_csv("large_dataset.csv")
204
+ ```
205
 
206
+ Polars also supports lazy evaluation, which allows you to optimize your workflows by deferring computations until necessary. This is particularly useful for large datasets:
207
 
208
+ ```python
209
+ df = pl.scan_csv("large_dataset.csv") # Lazy DataFrame
210
+ result = df.filter(pl.col("A") > 1).groupby("A").agg(pl.sum("B")).collect() # Execute
211
+ ```
212
+ """)
 
213
  return
214
 
215
 
216
  @app.cell(hide_code=True)
217
  def _(mo):
218
+ mo.md("""
219
+ ### Compatibility with other machine learning libraries 🤝
 
220
 
221
+ Polars integrates seamlessly with popular machine learning libraries like Scikit-learn, PyTorch, and TensorFlow. Its ability to handle large datasets efficiently makes it an excellent choice for preprocessing data before feeding it into ML models.
222
 
223
+ **Example: Preprocessing Data for Scikit-learn**
224
 
225
+ ```python
226
+ import polars as pl
227
+ from sklearn.linear_model import LinearRegression
228
 
229
+ # Load and preprocess data
230
+ df = pl.read_csv("data.csv")
231
+ X = df.select(["feature1", "feature2"]).to_numpy()
232
+ y = df.select("target").to_numpy()
233
 
234
+ # Train a model
235
+ model = LinearRegression()
236
+ model.fit(X, y)
237
+ ```
238
 
239
+ Polars also supports conversion to other formats like NumPy arrays and Pandas DataFrames, ensuring compatibility with virtually any ML library:
240
 
241
+ ```python
242
+ # Convert to Pandas DataFrame
243
+ pandas_df = df.to_pandas()
244
 
245
+ # Convert to NumPy array
246
+ numpy_array = df.to_numpy()
247
+ ```
248
+ """)
 
249
  return
250
 
251
 
252
  @app.cell(hide_code=True)
253
  def _(mo):
254
+ mo.md("""
255
+ ### Easy to use, with room for power users
 
256
 
257
+ Polars supports advanced operations like
258
 
259
+ - **date handling**
260
+ - **window functions**
261
+ - **joins**
262
+ - **nested data types**
263
 
264
+ which is making it a versatile tool for data manipulation.
265
+ """)
 
266
  return
267
 
268
 
269
  @app.cell(hide_code=True)
270
  def _(mo):
271
+ mo.md("""
272
+ ## Why not PySpark?
 
273
 
274
+ While **PySpark** is versatile tool that has transformed the way big data is handled and processed in Python, its **complex setup process** can be intimidating, especially for beginners. In contrast, **Polars** requires minimal setup and is ready to use right out of the box, making it more accessible for users of all skill levels.
275
 
276
+ When deciding between the two, **PySpark** is the preferred choice for processing large datasets distributed across a **multi-node cluster**. However, for computations on a **single-node machine**, **Polars** is an excellent alternative. Remarkably, Polars is capable of handling datasets that exceed the size of the available RAM, making it a powerful tool for efficient data processing even on limited hardware.
277
+ """)
 
278
  return
279
 
280
 
281
  @app.cell(hide_code=True)
282
  def _(mo):
283
+ mo.md("""
284
+ ## 🔖 References
 
285
 
286
+ - [Polars official website](https://pola.rs/)
287
+ - [Polars vs. Pandas](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/)
288
+ """)
 
289
  return
290
 
291
 
polars/02_dataframes.py CHANGED
@@ -10,14 +10,13 @@
10
 
11
  import marimo
12
 
13
- __generated_with = "0.13.10"
14
  app = marimo.App()
15
 
16
 
17
  @app.cell(hide_code=True)
18
  def _(mo):
19
- mo.md(
20
- r"""
21
  # DataFrames
22
  Author: [*Raine Hoang*](https://github.com/Jystine)
23
 
@@ -25,33 +24,31 @@ def _(mo):
25
 
26
  /// Note
27
  The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/api/python/stable/reference/dataframe/index.html).
28
- """
29
- )
30
  return
31
 
32
 
33
  @app.cell(hide_code=True)
34
  def _(mo):
35
- mo.md(
36
- """
37
  ## Defining a DataFrame
38
 
39
  At the most basic level, all that you need to do in order to create a DataFrame in Polars is to use the .DataFrame() method and pass in some data into the data parameter. However, there are restrictions as to what exactly you can pass into this method.
40
- """
41
- )
42
  return
43
 
44
 
45
  @app.cell(hide_code=True)
46
  def _(mo):
47
- mo.md(r"""### What Can Be a DataFrame?""")
 
 
48
  return
49
 
50
 
51
  @app.cell(hide_code=True)
52
  def _(mo):
53
- mo.md(
54
- r"""
55
  There are [5 data types](https://github.com/pola-rs/polars/blob/py-1.29.0/py-polars/polars/dataframe/frame.py#L197) that can be converted into a DataFrame.
56
 
57
  1. Dictionary
@@ -59,20 +56,17 @@ def _(mo):
59
  3. NumPy Array
60
  4. Series
61
  5. Pandas DataFrame
62
- """
63
- )
64
  return
65
 
66
 
67
  @app.cell(hide_code=True)
68
  def _(mo):
69
- mo.md(
70
- r"""
71
  #### Dictionary
72
 
73
  Dictionaries are structures that store data as `key:value` pairs. Let's say we have the following dictionary:
74
- """
75
- )
76
  return
77
 
78
 
@@ -85,7 +79,9 @@ def _():
85
 
86
  @app.cell(hide_code=True)
87
  def _(mo):
88
- mo.md(r"""In order to convert this dictionary into a DataFrame, we simply need to pass it into the data parameter in the `.DataFrame()` method like so.""")
 
 
89
  return
90
 
91
 
@@ -98,25 +94,21 @@ def _(dct_data, pl):
98
 
99
  @app.cell(hide_code=True)
100
  def _(mo):
101
- mo.md(
102
- r"""
103
- In this case, Polars turned each of the lists in the dictionary into a column in the DataFrame.
104
 
105
  The other data structures will follow a similar pattern when converting them to DataFrames.
106
- """
107
- )
108
  return
109
 
110
 
111
  @app.cell(hide_code=True)
112
  def _(mo):
113
- mo.md(
114
- r"""
115
  ##### Sequence
116
 
117
  Sequences are data structures that contain collections of items, which can be accessed using its index. Examples of sequences are lists, tuples, and strings. We will be using a list of lists in order to demonstrate how to convert a sequence in a DataFrame.
118
- """
119
- )
120
  return
121
 
122
 
@@ -136,19 +128,19 @@ def _(pl, seq_data):
136
 
137
  @app.cell(hide_code=True)
138
  def _(mo):
139
- mo.md(r"""Notice that since we didn't specify the column names, Polars automatically named them `column_0`, `column_1`, and `column_2`. Later, we will show you how to specify the names of the columns.""")
 
 
140
  return
141
 
142
 
143
  @app.cell(hide_code=True)
144
  def _(mo):
145
- mo.md(
146
- r"""
147
  ##### NumPy Array
148
 
149
  NumPy arrays are considered a sequence of items that can also be accessed using its index. An important thing to note is that all of the items in an array must have the same data type.
150
- """
151
- )
152
  return
153
 
154
 
@@ -168,19 +160,19 @@ def _(arr_data, pl):
168
 
169
  @app.cell(hide_code=True)
170
  def _(mo):
171
- mo.md(r"""Notice that each inner array is a row in the DataFrame, not a column like the previous methods discussed. Later, we will go over how to tell Polars if we the information in the data structure to be presented as rows or columns.""")
 
 
172
  return
173
 
174
 
175
  @app.cell(hide_code=True)
176
  def _(mo):
177
- mo.md(
178
- r"""
179
  ##### Series
180
 
181
  Series are a way to store a single column in a DataFrame and all entries in a series must have the same data type. You can combine these series together to form one DataFrame.
182
- """
183
- )
184
  return
185
 
186
 
@@ -200,13 +192,11 @@ def _(pl, pl_series):
200
 
201
  @app.cell(hide_code=True)
202
  def _(mo):
203
- mo.md(
204
- r"""
205
  ##### Pandas DataFrame
206
 
207
  Another popular package that utilizes DataFrames is pandas. By passing in a pandas DataFrame into .DataFrame(), you can easily convert it into a Polars DataFrame.
208
- """
209
- )
210
  return
211
 
212
 
@@ -230,19 +220,19 @@ def _(pd_df, pl):
230
 
231
  @app.cell(hide_code=True)
232
  def _(mo):
233
- mo.md(r"""Now that we've looked over what can be converted into a DataFrame and the basics of it, let's look at the structure of the DataFrame.""")
 
 
234
  return
235
 
236
 
237
  @app.cell(hide_code=True)
238
  def _(mo):
239
- mo.md(
240
- r"""
241
  ## DataFrame Structure
242
 
243
  Let's recall one of the DataFrames we defined earlier.
244
- """
245
- )
246
  return
247
 
248
 
@@ -254,14 +244,15 @@ def _(dct_df):
254
 
255
  @app.cell(hide_code=True)
256
  def _(mo):
257
- mo.md(r"""We can see that this DataFrame has 4 rows and 3 columns as indicated by the text beneath the DataFrame. Each column has a name that can be used to access the data within that column. In this case, the names are: "col1", "col2", and "col3". Below the column name, there is text that indicates the data type stored within that column. "col1" has the text "i64" underneath its name, meaning that that column stores integers. "col2" stores strings as seen by the "str" under the column name. Finally, "col3" stores floats as it has "f64" under the column name. Polars will automatically assume the data types stored in each column, but we will go over a way to specify it later in this tutorial. Each column can only hold one data type at a time, so you can't have a string and an integer in the same column.""")
 
 
258
  return
259
 
260
 
261
  @app.cell(hide_code=True)
262
  def _(mo):
263
- mo.md(
264
- r"""
265
  ## Parameters
266
 
267
  On top of the "data" parameter, there are 6 additional parameters you can specify:
@@ -272,20 +263,17 @@ def _(mo):
272
  4. orient
273
  5. infer_schema_length
274
  6. nan_to_null
275
- """
276
- )
277
  return
278
 
279
 
280
  @app.cell(hide_code=True)
281
  def _(mo):
282
- mo.md(
283
- r"""
284
  #### Schema
285
 
286
  Let's recall the DataFrame we created using a sequence.
287
- """
288
- )
289
  return
290
 
291
 
@@ -297,7 +285,9 @@ def _(seq_df):
297
 
298
  @app.cell(hide_code=True)
299
  def _(mo):
300
- mo.md(r"""We can see that the column names and data type were inferred by Polars. The schema parameter allows us to specify the column names and data type we want for each column. There are 3 ways you can use this parameter. The first way involves using a dictionary to define the following key value pair: column name:data type.""")
 
 
301
  return
302
 
303
 
@@ -309,7 +299,9 @@ def _(pl, seq_data):
309
 
310
  @app.cell(hide_code=True)
311
  def _(mo):
312
- mo.md(r"""You can also do this using a list of (column name, data type) pairs instead of a dictionary.""")
 
 
313
  return
314
 
315
 
@@ -321,7 +313,9 @@ def _(pl, seq_data):
321
 
322
  @app.cell(hide_code=True)
323
  def _(mo):
324
- mo.md(r"""Notice how both the column names and the data type (text underneath the column name) is different from the original `seq_df`. If you only wanted to specify the column names and let Polars assume the data type, you can do so using a list of column names.""")
 
 
325
  return
326
 
327
 
@@ -333,19 +327,19 @@ def _(pl, seq_data):
333
 
334
  @app.cell(hide_code=True)
335
  def _(mo):
336
- mo.md(r"""The text under the column names is different from the previous two DataFrames we created since we didn't explicitly tell Polars what data type we wanted in each column.""")
 
 
337
  return
338
 
339
 
340
  @app.cell(hide_code=True)
341
  def _(mo):
342
- mo.md(
343
- r"""
344
  #### Schema_Overrides
345
 
346
  If you only wanted to specify the data type of specific columns and let Polars infer the rest, you can use the schema_overrides parameter for that. This parameter requires that you pass in a dictionary where the key value pair is column name:data type. Unlike the schema parameter, the column name must match the name already present in the DataFrame as that is how Polars will identify which column you want to specify the data type. If you use a column name that doesn't already exist, Polars won't be able to change the data type.
347
- """
348
- )
349
  return
350
 
351
 
@@ -357,13 +351,11 @@ def _(pl, seq_data):
357
 
358
  @app.cell(hide_code=True)
359
  def _(mo):
360
- mo.md(
361
- r"""
362
  Notice here that only the data type in the first column changed while Polars inferred the rest.
363
 
364
  It is important to note that if you only use the schema_overrides parameter, you are limited to how much you can change the data type. In the example above, we were able to change the data type from int32 to int16 without any further parameters since the data type is still an integer. However, if we wanted to change the first column to be a string, we would get an error as Polars has already strictly set the schema to only take in integer values.
365
- """
366
- )
367
  return
368
 
369
 
@@ -378,25 +370,27 @@ def _(pl, seq_data):
378
 
379
  @app.cell(hide_code=True)
380
  def _(mo):
381
- mo.md(r"""If we wanted to use schema_override to completely change the data type of the column, we need an additional parameter: strict.""")
 
 
382
  return
383
 
384
 
385
  @app.cell(hide_code=True)
386
  def _(mo):
387
- mo.md(
388
- r"""
389
  #### Strict
390
 
391
  The strict parameter allows you to specify if you want a column's data type to be enforced with flexibility or not. When set to `True`, Polars will raise an error if there is a data type that doesn't match the data type the column is expecting. It will not attempt to type cast it to the correct data type as Polars prioritizes that all the data can be converted without any loss or error. When set to `False`, Polars will attempt to type cast the data into the data type the column wants. If it is unable to successfully convert the data type, the value will be replaced with a null value.
392
- """
393
- )
394
  return
395
 
396
 
397
  @app.cell(hide_code=True)
398
  def _(mo):
399
- mo.md(r"""Let's see an example of what happens when strict is set to `True`. The cell below should show an error.""")
 
 
400
  return
401
 
402
 
@@ -413,7 +407,9 @@ def _(pl):
413
 
414
  @app.cell(hide_code=True)
415
  def _(mo):
416
- mo.md(r"""Now let's try setting strict to `False`.""")
 
 
417
  return
418
 
419
 
@@ -425,19 +421,19 @@ def _(pl, seq_data):
425
 
426
  @app.cell(hide_code=True)
427
  def _(mo):
428
- mo.md(r"""Since we allowed for Polars to change the schema by setting strict to `False`, we were able to cast the first column to be strings.""")
 
 
429
  return
430
 
431
 
432
  @app.cell(hide_code=True)
433
  def _(mo):
434
- mo.md(
435
- """
436
  #### Orient
437
 
438
  Let's recall the DataFrame we made by using an array and the data used to make it.
439
- """
440
- )
441
  return
442
 
443
 
@@ -455,7 +451,9 @@ def _(arr_df):
455
 
456
  @app.cell(hide_code=True)
457
  def _(mo):
458
- mo.md(r"""Notice how Polars decided to make each inner array a row in the DataFrame. If we wanted to make it so that each inner array was a column instead of a row, all we would need to do is pass `"col"` into the orient parameter.""")
 
 
459
  return
460
 
461
 
@@ -467,7 +465,9 @@ def _(arr_data, pl):
467
 
468
  @app.cell(hide_code=True)
469
  def _(mo):
470
- mo.md(r"""If we wanted to do the opposite, then we pass `"row"` into the orient parameter.""")
 
 
471
  return
472
 
473
 
@@ -485,39 +485,33 @@ def _(pl, seq_data):
485
 
486
  @app.cell(hide_code=True)
487
  def _(mo):
488
- mo.md(
489
- r"""
490
  #### Infer_Schema_Length
491
 
492
  Without setting the schema ourselves, Polars uses the data provided to infer the data types of the columns. It does this by looking at each of the rows in the data provided. You can specify to Polars how many rows to look at by using the infer_schema_length parameter. For example, if you were to set this parameter to 5, then Polars would use the first 5 rows to infer the schema.
493
- """
494
- )
495
  return
496
 
497
 
498
  @app.cell(hide_code=True)
499
  def _(mo):
500
- mo.md(
501
- r"""
502
  #### NaN_To_Null
503
 
504
  If there are np.nan values in the data, you can convert them to null values by setting the nan_to_null parameter to `True`.
505
- """
506
- )
507
  return
508
 
509
 
510
  @app.cell(hide_code=True)
511
  def _(mo):
512
- mo.md(
513
- r"""
514
  ## Summary
515
 
516
- DataFrames are a useful data structure that can be used to organize and perform additional analysis on your data. In this notebook, we have learned how to define DataFrames, what can be a DataFrame, the structure of it, and additional parameters you can set while creating it.
517
 
518
  In order to create a DataFrame, you pass your data into the .DataFrame() method through the data parameter. The data you pass through must be either a dictionary, sequence, array, series, or pandas DataFrame. Once defined, the DataFrame will separate the data into different columns and the data within the column must have the same data type. There exists additional parameters besides data that allows you to further customize the ending DataFrame. Some examples of these are orient, strict, and infer_schema_length.
519
- """
520
- )
521
  return
522
 
523
 
 
10
 
11
  import marimo
12
 
13
+ __generated_with = "0.18.4"
14
  app = marimo.App()
15
 
16
 
17
  @app.cell(hide_code=True)
18
  def _(mo):
19
+ mo.md(r"""
 
20
  # DataFrames
21
  Author: [*Raine Hoang*](https://github.com/Jystine)
22
 
 
24
 
25
  /// Note
26
  The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/api/python/stable/reference/dataframe/index.html).
27
+ """)
 
28
  return
29
 
30
 
31
  @app.cell(hide_code=True)
32
  def _(mo):
33
+ mo.md("""
 
34
  ## Defining a DataFrame
35
 
36
  At the most basic level, all that you need to do in order to create a DataFrame in Polars is to use the .DataFrame() method and pass in some data into the data parameter. However, there are restrictions as to what exactly you can pass into this method.
37
+ """)
 
38
  return
39
 
40
 
41
  @app.cell(hide_code=True)
42
  def _(mo):
43
+ mo.md(r"""
44
+ ### What Can Be a DataFrame?
45
+ """)
46
  return
47
 
48
 
49
  @app.cell(hide_code=True)
50
  def _(mo):
51
+ mo.md(r"""
 
52
  There are [5 data types](https://github.com/pola-rs/polars/blob/py-1.29.0/py-polars/polars/dataframe/frame.py#L197) that can be converted into a DataFrame.
53
 
54
  1. Dictionary
 
56
  3. NumPy Array
57
  4. Series
58
  5. Pandas DataFrame
59
+ """)
 
60
  return
61
 
62
 
63
  @app.cell(hide_code=True)
64
  def _(mo):
65
+ mo.md(r"""
 
66
  #### Dictionary
67
 
68
  Dictionaries are structures that store data as `key:value` pairs. Let's say we have the following dictionary:
69
+ """)
 
70
  return
71
 
72
 
 
79
 
80
  @app.cell(hide_code=True)
81
  def _(mo):
82
+ mo.md(r"""
83
+ In order to convert this dictionary into a DataFrame, we simply need to pass it into the data parameter in the `.DataFrame()` method like so.
84
+ """)
85
  return
86
 
87
 
 
94
 
95
  @app.cell(hide_code=True)
96
  def _(mo):
97
+ mo.md(r"""
98
+ In this case, Polars turned each of the lists in the dictionary into a column in the DataFrame.
 
99
 
100
  The other data structures will follow a similar pattern when converting them to DataFrames.
101
+ """)
 
102
  return
103
 
104
 
105
  @app.cell(hide_code=True)
106
  def _(mo):
107
+ mo.md(r"""
 
108
  ##### Sequence
109
 
110
  Sequences are data structures that contain collections of items, which can be accessed using its index. Examples of sequences are lists, tuples, and strings. We will be using a list of lists in order to demonstrate how to convert a sequence in a DataFrame.
111
+ """)
 
112
  return
113
 
114
 
 
128
 
129
  @app.cell(hide_code=True)
130
  def _(mo):
131
+ mo.md(r"""
132
+ Notice that since we didn't specify the column names, Polars automatically named them `column_0`, `column_1`, and `column_2`. Later, we will show you how to specify the names of the columns.
133
+ """)
134
  return
135
 
136
 
137
  @app.cell(hide_code=True)
138
  def _(mo):
139
+ mo.md(r"""
 
140
  ##### NumPy Array
141
 
142
  NumPy arrays are considered a sequence of items that can also be accessed using its index. An important thing to note is that all of the items in an array must have the same data type.
143
+ """)
 
144
  return
145
 
146
 
 
160
 
161
  @app.cell(hide_code=True)
162
  def _(mo):
163
+ mo.md(r"""
164
+ Notice that each inner array is a row in the DataFrame, not a column like the previous methods discussed. Later, we will go over how to tell Polars if we the information in the data structure to be presented as rows or columns.
165
+ """)
166
  return
167
 
168
 
169
  @app.cell(hide_code=True)
170
  def _(mo):
171
+ mo.md(r"""
 
172
  ##### Series
173
 
174
  Series are a way to store a single column in a DataFrame and all entries in a series must have the same data type. You can combine these series together to form one DataFrame.
175
+ """)
 
176
  return
177
 
178
 
 
192
 
193
  @app.cell(hide_code=True)
194
  def _(mo):
195
+ mo.md(r"""
 
196
  ##### Pandas DataFrame
197
 
198
  Another popular package that utilizes DataFrames is pandas. By passing in a pandas DataFrame into .DataFrame(), you can easily convert it into a Polars DataFrame.
199
+ """)
 
200
  return
201
 
202
 
 
220
 
221
  @app.cell(hide_code=True)
222
  def _(mo):
223
+ mo.md(r"""
224
+ Now that we've looked over what can be converted into a DataFrame and the basics of it, let's look at the structure of the DataFrame.
225
+ """)
226
  return
227
 
228
 
229
  @app.cell(hide_code=True)
230
  def _(mo):
231
+ mo.md(r"""
 
232
  ## DataFrame Structure
233
 
234
  Let's recall one of the DataFrames we defined earlier.
235
+ """)
 
236
  return
237
 
238
 
 
244
 
245
  @app.cell(hide_code=True)
246
  def _(mo):
247
+ mo.md(r"""
248
+ We can see that this DataFrame has 4 rows and 3 columns as indicated by the text beneath the DataFrame. Each column has a name that can be used to access the data within that column. In this case, the names are: "col1", "col2", and "col3". Below the column name, there is text that indicates the data type stored within that column. "col1" has the text "i64" underneath its name, meaning that that column stores integers. "col2" stores strings as seen by the "str" under the column name. Finally, "col3" stores floats as it has "f64" under the column name. Polars will automatically assume the data types stored in each column, but we will go over a way to specify it later in this tutorial. Each column can only hold one data type at a time, so you can't have a string and an integer in the same column.
249
+ """)
250
  return
251
 
252
 
253
  @app.cell(hide_code=True)
254
  def _(mo):
255
+ mo.md(r"""
 
256
  ## Parameters
257
 
258
  On top of the "data" parameter, there are 6 additional parameters you can specify:
 
263
  4. orient
264
  5. infer_schema_length
265
  6. nan_to_null
266
+ """)
 
267
  return
268
 
269
 
270
  @app.cell(hide_code=True)
271
  def _(mo):
272
+ mo.md(r"""
 
273
  #### Schema
274
 
275
  Let's recall the DataFrame we created using a sequence.
276
+ """)
 
277
  return
278
 
279
 
 
285
 
286
  @app.cell(hide_code=True)
287
  def _(mo):
288
+ mo.md(r"""
289
+ We can see that the column names and data type were inferred by Polars. The schema parameter allows us to specify the column names and data type we want for each column. There are 3 ways you can use this parameter. The first way involves using a dictionary to define the following key value pair: column name:data type.
290
+ """)
291
  return
292
 
293
 
 
299
 
300
  @app.cell(hide_code=True)
301
  def _(mo):
302
+ mo.md(r"""
303
+ You can also do this using a list of (column name, data type) pairs instead of a dictionary.
304
+ """)
305
  return
306
 
307
 
 
313
 
314
  @app.cell(hide_code=True)
315
  def _(mo):
316
+ mo.md(r"""
317
+ Notice how both the column names and the data type (text underneath the column name) is different from the original `seq_df`. If you only wanted to specify the column names and let Polars assume the data type, you can do so using a list of column names.
318
+ """)
319
  return
320
 
321
 
 
327
 
328
  @app.cell(hide_code=True)
329
  def _(mo):
330
+ mo.md(r"""
331
+ The text under the column names is different from the previous two DataFrames we created since we didn't explicitly tell Polars what data type we wanted in each column.
332
+ """)
333
  return
334
 
335
 
336
  @app.cell(hide_code=True)
337
  def _(mo):
338
+ mo.md(r"""
 
339
  #### Schema_Overrides
340
 
341
  If you only wanted to specify the data type of specific columns and let Polars infer the rest, you can use the schema_overrides parameter for that. This parameter requires that you pass in a dictionary where the key value pair is column name:data type. Unlike the schema parameter, the column name must match the name already present in the DataFrame as that is how Polars will identify which column you want to specify the data type. If you use a column name that doesn't already exist, Polars won't be able to change the data type.
342
+ """)
 
343
  return
344
 
345
 
 
351
 
352
  @app.cell(hide_code=True)
353
  def _(mo):
354
+ mo.md(r"""
 
355
  Notice here that only the data type in the first column changed while Polars inferred the rest.
356
 
357
  It is important to note that if you only use the schema_overrides parameter, you are limited to how much you can change the data type. In the example above, we were able to change the data type from int32 to int16 without any further parameters since the data type is still an integer. However, if we wanted to change the first column to be a string, we would get an error as Polars has already strictly set the schema to only take in integer values.
358
+ """)
 
359
  return
360
 
361
 
 
370
 
371
  @app.cell(hide_code=True)
372
  def _(mo):
373
+ mo.md(r"""
374
+ If we wanted to use schema_override to completely change the data type of the column, we need an additional parameter: strict.
375
+ """)
376
  return
377
 
378
 
379
  @app.cell(hide_code=True)
380
  def _(mo):
381
+ mo.md(r"""
 
382
  #### Strict
383
 
384
  The strict parameter allows you to specify if you want a column's data type to be enforced with flexibility or not. When set to `True`, Polars will raise an error if there is a data type that doesn't match the data type the column is expecting. It will not attempt to type cast it to the correct data type as Polars prioritizes that all the data can be converted without any loss or error. When set to `False`, Polars will attempt to type cast the data into the data type the column wants. If it is unable to successfully convert the data type, the value will be replaced with a null value.
385
+ """)
 
386
  return
387
 
388
 
389
  @app.cell(hide_code=True)
390
  def _(mo):
391
+ mo.md(r"""
392
+ Let's see an example of what happens when strict is set to `True`. The cell below should show an error.
393
+ """)
394
  return
395
 
396
 
 
407
 
408
  @app.cell(hide_code=True)
409
  def _(mo):
410
+ mo.md(r"""
411
+ Now let's try setting strict to `False`.
412
+ """)
413
  return
414
 
415
 
 
421
 
422
  @app.cell(hide_code=True)
423
  def _(mo):
424
+ mo.md(r"""
425
+ Since we allowed for Polars to change the schema by setting strict to `False`, we were able to cast the first column to be strings.
426
+ """)
427
  return
428
 
429
 
430
  @app.cell(hide_code=True)
431
  def _(mo):
432
+ mo.md("""
 
433
  #### Orient
434
 
435
  Let's recall the DataFrame we made by using an array and the data used to make it.
436
+ """)
 
437
  return
438
 
439
 
 
451
 
452
  @app.cell(hide_code=True)
453
  def _(mo):
454
+ mo.md(r"""
455
+ Notice how Polars decided to make each inner array a row in the DataFrame. If we wanted to make it so that each inner array was a column instead of a row, all we would need to do is pass `"col"` into the orient parameter.
456
+ """)
457
  return
458
 
459
 
 
465
 
466
  @app.cell(hide_code=True)
467
  def _(mo):
468
+ mo.md(r"""
469
+ If we wanted to do the opposite, then we pass `"row"` into the orient parameter.
470
+ """)
471
  return
472
 
473
 
 
485
 
486
  @app.cell(hide_code=True)
487
  def _(mo):
488
+ mo.md(r"""
 
489
  #### Infer_Schema_Length
490
 
491
  Without setting the schema ourselves, Polars uses the data provided to infer the data types of the columns. It does this by looking at each of the rows in the data provided. You can specify to Polars how many rows to look at by using the infer_schema_length parameter. For example, if you were to set this parameter to 5, then Polars would use the first 5 rows to infer the schema.
492
+ """)
 
493
  return
494
 
495
 
496
  @app.cell(hide_code=True)
497
  def _(mo):
498
+ mo.md(r"""
 
499
  #### NaN_To_Null
500
 
501
  If there are np.nan values in the data, you can convert them to null values by setting the nan_to_null parameter to `True`.
502
+ """)
 
503
  return
504
 
505
 
506
  @app.cell(hide_code=True)
507
  def _(mo):
508
+ mo.md(r"""
 
509
  ## Summary
510
 
511
+ DataFrames are a useful data structure that can be used to organize and perform additional analysis on your data. In this notebook, we have learned how to define DataFrames, what can be a DataFrame, the structure of it, and additional parameters you can set while creating it.
512
 
513
  In order to create a DataFrame, you pass your data into the .DataFrame() method through the data parameter. The data you pass through must be either a dictionary, sequence, array, series, or pandas DataFrame. Once defined, the DataFrame will separate the data into different columns and the data within the column must have the same data type. There exists additional parameters besides data that allows you to further customize the ending DataFrame. Some examples of these are orient, strict, and infer_schema_length.
514
+ """)
 
515
  return
516
 
517
 
polars/03_loading_data.py CHANGED
@@ -14,14 +14,13 @@
14
 
15
  import marimo
16
 
17
- __generated_with = "0.15.2"
18
  app = marimo.App(width="medium")
19
 
20
 
21
  @app.cell(hide_code=True)
22
  def _(mo):
23
- mo.md(
24
- r"""
25
  # Loading Data
26
 
27
  _By [etrotta](https://github.com/etrotta)._
@@ -29,8 +28,7 @@ def _(mo):
29
  This tutorial covers how to load data of varying formats and from different sources using [polars](https://docs.pola.rs/).
30
 
31
  It includes examples of how to load and write to a variety of formats, shows how to convert data from other libraries to support formats not supported directly by polars, includes relevant links for users that need to connect with external sources, and explains how to deal with custom formats via plugins.
32
- """
33
- )
34
  return
35
 
36
 
@@ -80,12 +78,10 @@ def _(mo, pl):
80
 
81
  @app.cell(hide_code=True)
82
  def _(mo):
83
- mo.md(
84
- r"""
85
  ## Parquet
86
  Parquet is a popular format for storing tabular data based on the Arrow memory spec, it is a great default and you'll find a lot of datasets already using it in sites like HuggingFace
87
- """
88
- )
89
  return
90
 
91
 
@@ -100,14 +96,12 @@ def _(df, folder, pl):
100
 
101
  @app.cell(hide_code=True)
102
  def _(mo):
103
- mo.md(
104
- r"""
105
  ## CSV
106
  A classic and common format that has been widely used for decades.
107
 
108
  The API is almost identical to Parquet - You can just replace `parquet` by `csv` and it will work with the default settings, but polars also allows for you to customize some settings such as the delimiter and quoting rules.
109
- """
110
- )
111
  return
112
 
113
 
@@ -123,8 +117,7 @@ def _(df, folder, lz, pl):
123
 
124
  @app.cell(hide_code=True)
125
  def _(mo):
126
- mo.md(
127
- r"""
128
  ## JSON
129
 
130
  JavaScript Object Notation is somewhat commonly used for storing unstructed data, and extremely commonly used for API responses.
@@ -138,8 +131,7 @@ def _(mo):
138
  Polars supports Lists with variable length, Arrays with fixed length, and Structs with well defined fields, but not mappings with arbitrary keys.
139
 
140
  You might want to transform data by unnesting structs and exploding lists after loading from complex JSON files.
141
- """
142
- )
143
  return
144
 
145
 
@@ -163,8 +155,7 @@ def _(df, folder, lz, pl):
163
 
164
  @app.cell(hide_code=True)
165
  def _(mo):
166
- mo.md(
167
- r"""
168
  ## Databases
169
 
170
  Polars doesn't supports any databases _directly_, but rather uses other libraries as Engines. Reading and writing to databases using polars methods does not supports Lazy execution, but you may pass an SQL Query for the database to pre-filter the data before reaches polars. See the [User Guide](https://docs.pola.rs/user-guide/io/database) for more details.
@@ -172,8 +163,7 @@ def _(mo):
172
  You can also use other libraries with [arrow support](#arrow-support) or [polars plugins](#plugin-support) to read from databases before loading into polars, some of which support lazy reading.
173
 
174
  Using the Arrow Database Connectivity SQLite support as an example:
175
- """
176
- )
177
  return
178
 
179
 
@@ -190,43 +180,37 @@ def _(df, folder, pl):
190
 
191
  @app.cell(hide_code=True)
192
  def _(mo):
193
- mo.md(
194
- r"""
195
  ## Excel
196
 
197
  From a performance perspective, we recommend using other formats if possible, such as Parquet or CSV files.
198
 
199
  Similarly to Databases, polars doesn't supports it natively but rather uses other libraries as Engines. See the [User Guide](https://docs.pola.rs/user-guide/io/excel) if you need to use it.
200
- """
201
- )
202
  return
203
 
204
 
205
  @app.cell(hide_code=True)
206
  def _(mo):
207
- mo.md(
208
- r"""
209
  ## Others natively supported
210
 
211
  If you understood the above examples, then all other formats should feel familiar - the core API is the same for all formats, `read` and `write` for the Eager API or `scan` and `sink` for the lazy API.
212
 
213
  See https://docs.pola.rs/api/python/stable/reference/io.html for the full list of formats natively supported by Polars
214
- """
215
- )
216
  return
217
 
218
 
219
  @app.cell(hide_code=True)
220
  def _(mo):
221
- mo.md(
222
- r"""
223
  ## Arrow Support
224
 
225
  You can convert Arrow compatible data from other libraries such as `pandas`, `duckdb` or `pyarrow` to polars DataFrames and vice-versa, much of the time without even having to copy data.
226
 
227
  This allows for you to use other libraries to load data in formats not support by polars, then convert the dataframe in-memory to polars.
228
- """
229
- )
230
  return
231
 
232
 
@@ -241,13 +225,11 @@ def _(df, folder, pd, pl):
241
 
242
  @app.cell(hide_code=True)
243
  def _(mo):
244
- mo.md(
245
- r"""
246
  ## Plugin Support
247
 
248
  You can also write [IO Plugins](https://docs.pola.rs/user-guide/plugins/io_plugins/) for Polars in order to support any format you need, or use other libraries that support polars via their own plugins such as DuckDB.
249
- """
250
- )
251
  return
252
 
253
 
@@ -261,8 +243,7 @@ def _(duckdb, folder):
261
 
262
  @app.cell(hide_code=True)
263
  def _(mo):
264
- mo.md(
265
- r"""
266
  ### Creating your own Plugin
267
 
268
  The simplest form of plugins are essentially generators that yield DataFrames.
@@ -273,12 +254,11 @@ def _(mo):
273
 
274
  - You must use `register_io_source` for polars to create the LazyFrame which will consume the Generator
275
  - You are expected to provide a Schema before the Generator starts
276
- - - For many use cases the Plugin may be able to infer it, but you could also pass it explicitly to the plugin function
277
  - Ideally you should parse some of the filters and column selectors to avoid unnecessary work, but it is possible to delegate that to polars after loading the data in order to keep it simpler (at the cost of efficiency)
278
 
279
  Efficiently parsing the filter expressions is out of the scope for this notebook.
280
- """
281
- )
282
  return
283
 
284
 
@@ -351,8 +331,7 @@ def _(Iterator, get_positional_names, itertools, pl, register_io_source):
351
 
352
  @app.cell(hide_code=True)
353
  def _(mo):
354
- mo.md(
355
- r"""
356
  ### DuckDB
357
 
358
  As demonstrated above, in addition to Arrow interoperability support, [DuckDB](https://duckdb.org/) also has added support for loading query results into a polars DataFrame or LazyFrame via a polars plugin.
@@ -363,8 +342,7 @@ def _(mo):
363
  - https://duckdb.org/docs/stable/guides/python/polars.html
364
 
365
  You can learn more about DuckDB in the marimo course about it as well, including Marimo SQL related features
366
- """
367
- )
368
  return
369
 
370
 
@@ -398,16 +376,14 @@ def _(duckdb_conn, duckdb_query):
398
 
399
  @app.cell(hide_code=True)
400
  def _(mo):
401
- mo.md(
402
- r"""
403
  ## Hive Partitions
404
 
405
  There is also support for [Hive](https://docs.pola.rs/user-guide/io/hive/) partitioned data, but parts of the API are still unstable (may change in future polars versions
406
  ).
407
 
408
  Even without using partitions, many methods also support glob patterns to read multiple files in the same folder such as `scan_csv(folder / "*.csv")`
409
- """
410
- )
411
  return
412
 
413
 
@@ -422,28 +398,24 @@ def _(df, folder, pl):
422
 
423
  @app.cell(hide_code=True)
424
  def _(mo):
425
- mo.md(
426
- r"""
427
  # Reading from the Cloud
428
 
429
- Polars also has support for reading public and private datasets from multiple websites
430
  and cloud storage solutions.
431
 
432
  If you must (re)use the same file many times in the same machine you may want to manually download it then load from your local file system instead to avoid re-downloading though, or download and write to disk only if the file does not exists.
433
- """
434
- )
435
  return
436
 
437
 
438
  @app.cell(hide_code=True)
439
  def _(mo):
440
- mo.md(
441
- r"""
442
  ## Arbitrary web sites
443
 
444
  You can load files from nearly any website just by using a HTTPS URL, as long as it is not locked behind authorization.
445
- """
446
- )
447
  return
448
 
449
 
@@ -455,15 +427,13 @@ def _():
455
 
456
  @app.cell(hide_code=True)
457
  def _(mo):
458
- mo.md(
459
- r"""
460
  ## Hugging Face & Kaggle Datasets
461
 
462
  Look for polars inside of dropdowns such as "Use this dataset" in Hugging Face or "Code" in Kaggle, and oftentimes you'll get a snippet to load data directly into a dataframe you can use
463
 
464
  Read more: [Hugging Face](https://docs.pola.rs/user-guide/io/hugging-face/), [Kaggle](https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpolars)
465
- """
466
- )
467
  return
468
 
469
 
@@ -475,15 +445,13 @@ def _():
475
 
476
  @app.cell(hide_code=True)
477
  def _(mo):
478
- mo.md(
479
- r"""
480
  ## Cloud Storage - AWS S3, Azure Blob Storage, Google Cloud Storage
481
 
482
  The API is the same for all three storage providers, check the [User Guide](https://docs.pola.rs/user-guide/io/cloud-storage/) if you need of any of them.
483
 
484
  Runnable examples are not included in this Notebook as it would require setting up authentication, but the disabled cell below shows an example using Azure.
485
- """
486
- )
487
  return
488
 
489
 
@@ -510,13 +478,11 @@ def _(adlfs, df, os, pl):
510
 
511
  @app.cell(hide_code=True)
512
  def _(mo):
513
- mo.md(
514
- r"""
515
  # Multiplexing
516
 
517
  You can also split a query into multiple sinks via [multiplexing](https://docs.pola.rs/user-guide/lazy/multiplexing/), to avoid reading multiple times, repeating the same operations for each sink or collecting intermediary results into memory.
518
- """
519
- )
520
  return
521
 
522
 
@@ -540,13 +506,11 @@ def _(folder, lz, pl):
540
 
541
  @app.cell(hide_code=True)
542
  def _(mo):
543
- mo.md(
544
- r"""
545
  # Async Execution
546
 
547
  Polars also has experimental support for running lazy queries in `async` mode, letting you `await` operations inside of async functions.
548
- """
549
- )
550
  return
551
 
552
 
@@ -566,27 +530,23 @@ async def _(folder, lz, pl, sinks):
566
 
567
  @app.cell(hide_code=True)
568
  def _(mo):
569
- mo.md(
570
- r"""
571
  ## Conclusion
572
  As you have seen, polars makes it easy to work with a variety of formats and different data sources.
573
 
574
  From natively supported formats such as Parquet and CSV files, to using other libraries as an intermediary for XML or geospatial data, and plugins for newly emerging or proprietary formats, as long as your data can fit in a table then odds are you can turn it into a polars DataFrame.
575
 
576
  Combined with loading directly from remote sources, including public data platforms such as Hugging Face and Kaggle as well as private data in your cloud, you can import datasets for almost anything you can imagine.
577
- """
578
- )
579
  return
580
 
581
 
582
  @app.cell(hide_code=True)
583
  def _(mo):
584
- mo.md(
585
- r"""
586
  ## Utilities
587
  Imports, utility functions and alike used through the Notebook
588
- """
589
- )
590
  return
591
 
592
 
 
14
 
15
  import marimo
16
 
17
+ __generated_with = "0.18.4"
18
  app = marimo.App(width="medium")
19
 
20
 
21
  @app.cell(hide_code=True)
22
  def _(mo):
23
+ mo.md(r"""
 
24
  # Loading Data
25
 
26
  _By [etrotta](https://github.com/etrotta)._
 
28
  This tutorial covers how to load data of varying formats and from different sources using [polars](https://docs.pola.rs/).
29
 
30
  It includes examples of how to load and write to a variety of formats, shows how to convert data from other libraries to support formats not supported directly by polars, includes relevant links for users that need to connect with external sources, and explains how to deal with custom formats via plugins.
31
+ """)
 
32
  return
33
 
34
 
 
78
 
79
  @app.cell(hide_code=True)
80
  def _(mo):
81
+ mo.md(r"""
 
82
  ## Parquet
83
  Parquet is a popular format for storing tabular data based on the Arrow memory spec, it is a great default and you'll find a lot of datasets already using it in sites like HuggingFace
84
+ """)
 
85
  return
86
 
87
 
 
96
 
97
  @app.cell(hide_code=True)
98
  def _(mo):
99
+ mo.md(r"""
 
100
  ## CSV
101
  A classic and common format that has been widely used for decades.
102
 
103
  The API is almost identical to Parquet - You can just replace `parquet` by `csv` and it will work with the default settings, but polars also allows for you to customize some settings such as the delimiter and quoting rules.
104
+ """)
 
105
  return
106
 
107
 
 
117
 
118
  @app.cell(hide_code=True)
119
  def _(mo):
120
+ mo.md(r"""
 
121
  ## JSON
122
 
123
  JavaScript Object Notation is somewhat commonly used for storing unstructed data, and extremely commonly used for API responses.
 
131
  Polars supports Lists with variable length, Arrays with fixed length, and Structs with well defined fields, but not mappings with arbitrary keys.
132
 
133
  You might want to transform data by unnesting structs and exploding lists after loading from complex JSON files.
134
+ """)
 
135
  return
136
 
137
 
 
155
 
156
  @app.cell(hide_code=True)
157
  def _(mo):
158
+ mo.md(r"""
 
159
  ## Databases
160
 
161
  Polars doesn't supports any databases _directly_, but rather uses other libraries as Engines. Reading and writing to databases using polars methods does not supports Lazy execution, but you may pass an SQL Query for the database to pre-filter the data before reaches polars. See the [User Guide](https://docs.pola.rs/user-guide/io/database) for more details.
 
163
  You can also use other libraries with [arrow support](#arrow-support) or [polars plugins](#plugin-support) to read from databases before loading into polars, some of which support lazy reading.
164
 
165
  Using the Arrow Database Connectivity SQLite support as an example:
166
+ """)
 
167
  return
168
 
169
 
 
180
 
181
  @app.cell(hide_code=True)
182
  def _(mo):
183
+ mo.md(r"""
 
184
  ## Excel
185
 
186
  From a performance perspective, we recommend using other formats if possible, such as Parquet or CSV files.
187
 
188
  Similarly to Databases, polars doesn't supports it natively but rather uses other libraries as Engines. See the [User Guide](https://docs.pola.rs/user-guide/io/excel) if you need to use it.
189
+ """)
 
190
  return
191
 
192
 
193
  @app.cell(hide_code=True)
194
  def _(mo):
195
+ mo.md(r"""
 
196
  ## Others natively supported
197
 
198
  If you understood the above examples, then all other formats should feel familiar - the core API is the same for all formats, `read` and `write` for the Eager API or `scan` and `sink` for the lazy API.
199
 
200
  See https://docs.pola.rs/api/python/stable/reference/io.html for the full list of formats natively supported by Polars
201
+ """)
 
202
  return
203
 
204
 
205
  @app.cell(hide_code=True)
206
  def _(mo):
207
+ mo.md(r"""
 
208
  ## Arrow Support
209
 
210
  You can convert Arrow compatible data from other libraries such as `pandas`, `duckdb` or `pyarrow` to polars DataFrames and vice-versa, much of the time without even having to copy data.
211
 
212
  This allows for you to use other libraries to load data in formats not support by polars, then convert the dataframe in-memory to polars.
213
+ """)
 
214
  return
215
 
216
 
 
225
 
226
  @app.cell(hide_code=True)
227
  def _(mo):
228
+ mo.md(r"""
 
229
  ## Plugin Support
230
 
231
  You can also write [IO Plugins](https://docs.pola.rs/user-guide/plugins/io_plugins/) for Polars in order to support any format you need, or use other libraries that support polars via their own plugins such as DuckDB.
232
+ """)
 
233
  return
234
 
235
 
 
243
 
244
  @app.cell(hide_code=True)
245
  def _(mo):
246
+ mo.md(r"""
 
247
  ### Creating your own Plugin
248
 
249
  The simplest form of plugins are essentially generators that yield DataFrames.
 
254
 
255
  - You must use `register_io_source` for polars to create the LazyFrame which will consume the Generator
256
  - You are expected to provide a Schema before the Generator starts
257
+ - - For many use cases the Plugin may be able to infer it, but you could also pass it explicitly to the plugin function
258
  - Ideally you should parse some of the filters and column selectors to avoid unnecessary work, but it is possible to delegate that to polars after loading the data in order to keep it simpler (at the cost of efficiency)
259
 
260
  Efficiently parsing the filter expressions is out of the scope for this notebook.
261
+ """)
 
262
  return
263
 
264
 
 
331
 
332
  @app.cell(hide_code=True)
333
  def _(mo):
334
+ mo.md(r"""
 
335
  ### DuckDB
336
 
337
  As demonstrated above, in addition to Arrow interoperability support, [DuckDB](https://duckdb.org/) also has added support for loading query results into a polars DataFrame or LazyFrame via a polars plugin.
 
342
  - https://duckdb.org/docs/stable/guides/python/polars.html
343
 
344
  You can learn more about DuckDB in the marimo course about it as well, including Marimo SQL related features
345
+ """)
 
346
  return
347
 
348
 
 
376
 
377
  @app.cell(hide_code=True)
378
  def _(mo):
379
+ mo.md(r"""
 
380
  ## Hive Partitions
381
 
382
  There is also support for [Hive](https://docs.pola.rs/user-guide/io/hive/) partitioned data, but parts of the API are still unstable (may change in future polars versions
383
  ).
384
 
385
  Even without using partitions, many methods also support glob patterns to read multiple files in the same folder such as `scan_csv(folder / "*.csv")`
386
+ """)
 
387
  return
388
 
389
 
 
398
 
399
  @app.cell(hide_code=True)
400
  def _(mo):
401
+ mo.md(r"""
 
402
  # Reading from the Cloud
403
 
404
+ Polars also has support for reading public and private datasets from multiple websites
405
  and cloud storage solutions.
406
 
407
  If you must (re)use the same file many times in the same machine you may want to manually download it then load from your local file system instead to avoid re-downloading though, or download and write to disk only if the file does not exists.
408
+ """)
 
409
  return
410
 
411
 
412
  @app.cell(hide_code=True)
413
  def _(mo):
414
+ mo.md(r"""
 
415
  ## Arbitrary web sites
416
 
417
  You can load files from nearly any website just by using a HTTPS URL, as long as it is not locked behind authorization.
418
+ """)
 
419
  return
420
 
421
 
 
427
 
428
  @app.cell(hide_code=True)
429
  def _(mo):
430
+ mo.md(r"""
 
431
  ## Hugging Face & Kaggle Datasets
432
 
433
  Look for polars inside of dropdowns such as "Use this dataset" in Hugging Face or "Code" in Kaggle, and oftentimes you'll get a snippet to load data directly into a dataframe you can use
434
 
435
  Read more: [Hugging Face](https://docs.pola.rs/user-guide/io/hugging-face/), [Kaggle](https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpolars)
436
+ """)
 
437
  return
438
 
439
 
 
445
 
446
  @app.cell(hide_code=True)
447
  def _(mo):
448
+ mo.md(r"""
 
449
  ## Cloud Storage - AWS S3, Azure Blob Storage, Google Cloud Storage
450
 
451
  The API is the same for all three storage providers, check the [User Guide](https://docs.pola.rs/user-guide/io/cloud-storage/) if you need of any of them.
452
 
453
  Runnable examples are not included in this Notebook as it would require setting up authentication, but the disabled cell below shows an example using Azure.
454
+ """)
 
455
  return
456
 
457
 
 
478
 
479
  @app.cell(hide_code=True)
480
  def _(mo):
481
+ mo.md(r"""
 
482
  # Multiplexing
483
 
484
  You can also split a query into multiple sinks via [multiplexing](https://docs.pola.rs/user-guide/lazy/multiplexing/), to avoid reading multiple times, repeating the same operations for each sink or collecting intermediary results into memory.
485
+ """)
 
486
  return
487
 
488
 
 
506
 
507
  @app.cell(hide_code=True)
508
  def _(mo):
509
+ mo.md(r"""
 
510
  # Async Execution
511
 
512
  Polars also has experimental support for running lazy queries in `async` mode, letting you `await` operations inside of async functions.
513
+ """)
 
514
  return
515
 
516
 
 
530
 
531
  @app.cell(hide_code=True)
532
  def _(mo):
533
+ mo.md(r"""
 
534
  ## Conclusion
535
  As you have seen, polars makes it easy to work with a variety of formats and different data sources.
536
 
537
  From natively supported formats such as Parquet and CSV files, to using other libraries as an intermediary for XML or geospatial data, and plugins for newly emerging or proprietary formats, as long as your data can fit in a table then odds are you can turn it into a polars DataFrame.
538
 
539
  Combined with loading directly from remote sources, including public data platforms such as Hugging Face and Kaggle as well as private data in your cloud, you can import datasets for almost anything you can imagine.
540
+ """)
 
541
  return
542
 
543
 
544
  @app.cell(hide_code=True)
545
  def _(mo):
546
+ mo.md(r"""
 
547
  ## Utilities
548
  Imports, utility functions and alike used through the Notebook
549
+ """)
 
550
  return
551
 
552
 
polars/04_basic_operations.py CHANGED
@@ -8,7 +8,7 @@
8
 
9
  import marimo
10
 
11
- __generated_with = "0.11.13"
12
  app = marimo.App(width="medium")
13
 
14
 
@@ -20,14 +20,12 @@ def _():
20
 
21
  @app.cell(hide_code=True)
22
  def _(mo):
23
- mo.md(
24
- r"""
25
- # Basic operations on data
26
- _By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._
27
 
28
- In this notebook, you'll learn how to perform arithmetic operations, comparisons, and conditionals on a Polars dataframe. We'll work with a DataFrame that tracks software usage by year, categorized as either Vintage (old) or Modern (new).
29
- """
30
- )
31
  return
32
 
33
 
@@ -107,13 +105,11 @@ def _():
107
 
108
  @app.cell(hide_code=True)
109
  def _(mo):
110
- mo.md(
111
- r"""
112
- ## Arithmetic
113
- ### Addition
114
- Let's add 42 users to each piece of software. This means adding 42 to each value under **users**.
115
- """
116
- )
117
  return
118
 
119
 
@@ -125,7 +121,9 @@ def _(df, pl):
125
 
126
  @app.cell(hide_code=True)
127
  def _(mo):
128
- mo.md(r"""Another way to perform the above operation is using the built-in function.""")
 
 
129
  return
130
 
131
 
@@ -137,12 +135,10 @@ def _(df, pl):
137
 
138
  @app.cell(hide_code=True)
139
  def _(mo):
140
- mo.md(
141
- r"""
142
- ### Subtraction
143
- Let's subtract 42 users to each piece of software.
144
- """
145
- )
146
  return
147
 
148
 
@@ -154,7 +150,9 @@ def _(df, pl):
154
 
155
  @app.cell(hide_code=True)
156
  def _(mo):
157
- mo.md(r"""Alternatively, you could subtract like this:""")
 
 
158
  return
159
 
160
 
@@ -166,12 +164,10 @@ def _(df, pl):
166
 
167
  @app.cell(hide_code=True)
168
  def _(mo):
169
- mo.md(
170
- r"""
171
- ### Division
172
- Suppose the **users** values are inflated, we can reduce them by dividing by 1000. Here's how to do it.
173
- """
174
- )
175
  return
176
 
177
 
@@ -183,7 +179,9 @@ def _(df, pl):
183
 
184
  @app.cell(hide_code=True)
185
  def _(mo):
186
- mo.md(r"""Or we could do it with a built-in expression.""")
 
 
187
  return
188
 
189
 
@@ -195,7 +193,9 @@ def _(df, pl):
195
 
196
  @app.cell(hide_code=True)
197
  def _(mo):
198
- mo.md(r"""If we didn't care about the remainder after division (i.e remove numbers after decimal point) we could do it like this.""")
 
 
199
  return
200
 
201
 
@@ -207,12 +207,10 @@ def _(df, pl):
207
 
208
  @app.cell(hide_code=True)
209
  def _(mo):
210
- mo.md(
211
- r"""
212
- ### Multiplication
213
- Let's pretend the *user* values are deflated and increase them by multiplying by 100.
214
- """
215
- )
216
  return
217
 
218
 
@@ -224,7 +222,9 @@ def _(df, pl):
224
 
225
  @app.cell(hide_code=True)
226
  def _(mo):
227
- mo.md(r"""Polars also has a built-in function for multiplication.""")
 
 
228
  return
229
 
230
 
@@ -236,7 +236,9 @@ def _(df, pl):
236
 
237
  @app.cell(hide_code=True)
238
  def _(mo):
239
- mo.md(r"""So far, we've only modified the values in an existing column. Let's create a column **decade** that will represent the years as decades. Thus 1985 will be 1980 and 2008 will be 2000.""")
 
 
240
  return
241
 
242
 
@@ -248,7 +250,9 @@ def _(df, pl):
248
 
249
  @app.cell(hide_code=True)
250
  def _(mo):
251
- mo.md(r"""We could create a new column another way as follows:""")
 
 
252
  return
253
 
254
 
@@ -260,16 +264,14 @@ def _(df, pl):
260
 
261
  @app.cell(hide_code=True)
262
  def _(mo):
263
- mo.md(
264
- r"""
265
- **Tip**
266
- Polars encounrages you to perform your operations as a chain. This enables you to take advantage of the query optimizer. We'll build upon the above code as a chain.
267
 
268
- ## Comparison
269
- ### Equal
270
- Let's get all the software categorized as Vintage.
271
- """
272
- )
273
  return
274
 
275
 
@@ -284,7 +286,9 @@ def _(df, pl):
284
 
285
  @app.cell(hide_code=True)
286
  def _(mo):
287
- mo.md(r"""We could also do a double comparison. VisiCal is the only software that's vintage and in the decade 1970s. Let's perform this comparison operation.""")
 
 
288
  return
289
 
290
 
@@ -300,13 +304,11 @@ def _(df, pl):
300
 
301
  @app.cell(hide_code=True)
302
  def _(mo):
303
- mo.md(
304
- r"""
305
- We could also do this comparison in one line, if readability is not a concern
306
 
307
- **Notice** that we must enclose the two expressions between the `&` with parenthesis.
308
- """
309
- )
310
  return
311
 
312
 
@@ -321,7 +323,9 @@ def _(df, pl):
321
 
322
  @app.cell(hide_code=True)
323
  def _(mo):
324
- mo.md(r"""We can also use the built-in function for equal to comparisons.""")
 
 
325
  return
326
 
327
 
@@ -336,12 +340,10 @@ def _(df, pl):
336
 
337
  @app.cell(hide_code=True)
338
  def _(mo):
339
- mo.md(
340
- r"""
341
- ### Not equal
342
- We can also compare if something is `not` equal to something. In this case, category is not vintage.
343
- """
344
- )
345
  return
346
 
347
 
@@ -356,7 +358,9 @@ def _(df, pl):
356
 
357
  @app.cell(hide_code=True)
358
  def _(mo):
359
- mo.md(r"""Or with the built-in function.""")
 
 
360
  return
361
 
362
 
@@ -371,7 +375,9 @@ def _(df, pl):
371
 
372
  @app.cell(hide_code=True)
373
  def _(mo):
374
- mo.md(r"""Or if you want to be extra clever, you can use the negation symbol `~` used in logic.""")
 
 
375
  return
376
 
377
 
@@ -386,12 +392,10 @@ def _(df, pl):
386
 
387
  @app.cell(hide_code=True)
388
  def _(mo):
389
- mo.md(
390
- r"""
391
- ### Greater than
392
- Let's get the software where the year is greater than 2008 from the above dataframe.
393
- """
394
- )
395
  return
396
 
397
 
@@ -407,7 +411,9 @@ def _(df, pl):
407
 
408
  @app.cell(hide_code=True)
409
  def _(mo):
410
- mo.md(r"""Or if we wanted the year 2008 to be included, we could use great or equal to.""")
 
 
411
  return
412
 
413
 
@@ -423,7 +429,9 @@ def _(df, pl):
423
 
424
  @app.cell(hide_code=True)
425
  def _(mo):
426
- mo.md(r"""We could do the previous two operations with built-in functions. Here's with greater than.""")
 
 
427
  return
428
 
429
 
@@ -439,7 +447,9 @@ def _(df, pl):
439
 
440
  @app.cell(hide_code=True)
441
  def _(mo):
442
- mo.md(r"""And here's with greater or equal to""")
 
 
443
  return
444
 
445
 
@@ -455,14 +465,12 @@ def _(df, pl):
455
 
456
  @app.cell(hide_code=True)
457
  def _(mo):
458
- mo.md(
459
- r"""
460
- **Note**: For "less than", and "less or equal to" you can use the operators `<` or `<=`. Alternatively, you can use built-in functions `lt` or `le` respectively.
461
 
462
- ### Is between
463
- Polars also allows us to filter between a range of values. Let's get the modern software were the year is between 2013 and 2016. This is inclusive on both ends (i.e. both years are part of the result).
464
- """
465
- )
466
  return
467
 
468
 
@@ -478,14 +486,12 @@ def _(df, pl):
478
 
479
  @app.cell(hide_code=True)
480
  def _(mo):
481
- mo.md(
482
- r"""
483
- ### Or operator
484
- If we only want either one of the conditions in the comparison to be met, we could use `|`, which is the `or` operator.
485
 
486
- Let's get software that is either modern or used in the decade 1980s.
487
- """
488
- )
489
  return
490
 
491
 
@@ -500,14 +506,12 @@ def _(df, pl):
500
 
501
  @app.cell(hide_code=True)
502
  def _(mo):
503
- mo.md(
504
- r"""
505
- ## Conditionals
506
- Polars also allows you create new columns based on a condition. Let's create a column *status* that will indicate if the software is "discontinued" or "in use".
507
 
508
- Here's a list of products that are no longer in use.
509
- """
510
- )
511
  return
512
 
513
 
@@ -519,7 +523,9 @@ def _():
519
 
520
  @app.cell(hide_code=True)
521
  def _(mo):
522
- mo.md(r"""Here's how we can get a dataframe of the products that are discontinued.""")
 
 
523
  return
524
 
525
 
@@ -534,7 +540,9 @@ def _(df, discontinued_list, pl):
534
 
535
  @app.cell(hide_code=True)
536
  def _(mo):
537
- mo.md(r"""Now, let's create the **status** column.""")
 
 
538
  return
539
 
540
 
@@ -553,12 +561,10 @@ def _(df, discontinued_list, pl):
553
 
554
  @app.cell(hide_code=True)
555
  def _(mo):
556
- mo.md(
557
- r"""
558
- ## Unique counts
559
- Sometimes you may want to see only the unique values in a column. Let's check the unique decades we have in our DataFrame.
560
- """
561
- )
562
  return
563
 
564
 
@@ -578,7 +584,9 @@ def _(df, discontinued_list, pl):
578
 
579
  @app.cell(hide_code=True)
580
  def _(mo):
581
- mo.md(r"""Finally, let's find out the number of software used in each decade.""")
 
 
582
  return
583
 
584
 
@@ -598,7 +606,9 @@ def _(df, discontinued_list, pl):
598
 
599
  @app.cell(hide_code=True)
600
  def _(mo):
601
- mo.md(r"""We could also rewrite the above code as follows:""")
 
 
602
  return
603
 
604
 
@@ -618,7 +628,9 @@ def _(df, discontinued_list, pl):
618
 
619
  @app.cell(hide_code=True)
620
  def _(mo):
621
- mo.md(r"""Hopefully, we've picked your interest to try out Polars the next time you analyze your data.""")
 
 
622
  return
623
 
624
 
 
8
 
9
  import marimo
10
 
11
+ __generated_with = "0.18.4"
12
  app = marimo.App(width="medium")
13
 
14
 
 
20
 
21
  @app.cell(hide_code=True)
22
  def _(mo):
23
+ mo.md(r"""
24
+ # Basic operations on data
25
+ _By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._
 
26
 
27
+ In this notebook, you'll learn how to perform arithmetic operations, comparisons, and conditionals on a Polars dataframe. We'll work with a DataFrame that tracks software usage by year, categorized as either Vintage (old) or Modern (new).
28
+ """)
 
29
  return
30
 
31
 
 
105
 
106
  @app.cell(hide_code=True)
107
  def _(mo):
108
+ mo.md(r"""
109
+ ## Arithmetic
110
+ ### Addition
111
+ Let's add 42 users to each piece of software. This means adding 42 to each value under **users**.
112
+ """)
 
 
113
  return
114
 
115
 
 
121
 
122
  @app.cell(hide_code=True)
123
  def _(mo):
124
+ mo.md(r"""
125
+ Another way to perform the above operation is using the built-in function.
126
+ """)
127
  return
128
 
129
 
 
135
 
136
  @app.cell(hide_code=True)
137
  def _(mo):
138
+ mo.md(r"""
139
+ ### Subtraction
140
+ Let's subtract 42 users to each piece of software.
141
+ """)
 
 
142
  return
143
 
144
 
 
150
 
151
  @app.cell(hide_code=True)
152
  def _(mo):
153
+ mo.md(r"""
154
+ Alternatively, you could subtract like this:
155
+ """)
156
  return
157
 
158
 
 
164
 
165
  @app.cell(hide_code=True)
166
  def _(mo):
167
+ mo.md(r"""
168
+ ### Division
169
+ Suppose the **users** values are inflated, we can reduce them by dividing by 1000. Here's how to do it.
170
+ """)
 
 
171
  return
172
 
173
 
 
179
 
180
  @app.cell(hide_code=True)
181
  def _(mo):
182
+ mo.md(r"""
183
+ Or we could do it with a built-in expression.
184
+ """)
185
  return
186
 
187
 
 
193
 
194
  @app.cell(hide_code=True)
195
  def _(mo):
196
+ mo.md(r"""
197
+ If we didn't care about the remainder after division (i.e remove numbers after decimal point) we could do it like this.
198
+ """)
199
  return
200
 
201
 
 
207
 
208
  @app.cell(hide_code=True)
209
  def _(mo):
210
+ mo.md(r"""
211
+ ### Multiplication
212
+ Let's pretend the *user* values are deflated and increase them by multiplying by 100.
213
+ """)
 
 
214
  return
215
 
216
 
 
222
 
223
  @app.cell(hide_code=True)
224
  def _(mo):
225
+ mo.md(r"""
226
+ Polars also has a built-in function for multiplication.
227
+ """)
228
  return
229
 
230
 
 
236
 
237
  @app.cell(hide_code=True)
238
  def _(mo):
239
+ mo.md(r"""
240
+ So far, we've only modified the values in an existing column. Let's create a column **decade** that will represent the years as decades. Thus 1985 will be 1980 and 2008 will be 2000.
241
+ """)
242
  return
243
 
244
 
 
250
 
251
  @app.cell(hide_code=True)
252
  def _(mo):
253
+ mo.md(r"""
254
+ We could create a new column another way as follows:
255
+ """)
256
  return
257
 
258
 
 
264
 
265
  @app.cell(hide_code=True)
266
  def _(mo):
267
+ mo.md(r"""
268
+ **Tip**
269
+ Polars encounrages you to perform your operations as a chain. This enables you to take advantage of the query optimizer. We'll build upon the above code as a chain.
 
270
 
271
+ ## Comparison
272
+ ### Equal
273
+ Let's get all the software categorized as Vintage.
274
+ """)
 
275
  return
276
 
277
 
 
286
 
287
  @app.cell(hide_code=True)
288
  def _(mo):
289
+ mo.md(r"""
290
+ We could also do a double comparison. VisiCal is the only software that's vintage and in the decade 1970s. Let's perform this comparison operation.
291
+ """)
292
  return
293
 
294
 
 
304
 
305
  @app.cell(hide_code=True)
306
  def _(mo):
307
+ mo.md(r"""
308
+ We could also do this comparison in one line, if readability is not a concern
 
309
 
310
+ **Notice** that we must enclose the two expressions between the `&` with parenthesis.
311
+ """)
 
312
  return
313
 
314
 
 
323
 
324
  @app.cell(hide_code=True)
325
  def _(mo):
326
+ mo.md(r"""
327
+ We can also use the built-in function for equal to comparisons.
328
+ """)
329
  return
330
 
331
 
 
340
 
341
  @app.cell(hide_code=True)
342
  def _(mo):
343
+ mo.md(r"""
344
+ ### Not equal
345
+ We can also compare if something is `not` equal to something. In this case, category is not vintage.
346
+ """)
 
 
347
  return
348
 
349
 
 
358
 
359
  @app.cell(hide_code=True)
360
  def _(mo):
361
+ mo.md(r"""
362
+ Or with the built-in function.
363
+ """)
364
  return
365
 
366
 
 
375
 
376
  @app.cell(hide_code=True)
377
  def _(mo):
378
+ mo.md(r"""
379
+ Or if you want to be extra clever, you can use the negation symbol `~` used in logic.
380
+ """)
381
  return
382
 
383
 
 
392
 
393
  @app.cell(hide_code=True)
394
  def _(mo):
395
+ mo.md(r"""
396
+ ### Greater than
397
+ Let's get the software where the year is greater than 2008 from the above dataframe.
398
+ """)
 
 
399
  return
400
 
401
 
 
411
 
412
  @app.cell(hide_code=True)
413
  def _(mo):
414
+ mo.md(r"""
415
+ Or if we wanted the year 2008 to be included, we could use great or equal to.
416
+ """)
417
  return
418
 
419
 
 
429
 
430
  @app.cell(hide_code=True)
431
  def _(mo):
432
+ mo.md(r"""
433
+ We could do the previous two operations with built-in functions. Here's with greater than.
434
+ """)
435
  return
436
 
437
 
 
447
 
448
  @app.cell(hide_code=True)
449
  def _(mo):
450
+ mo.md(r"""
451
+ And here's with greater or equal to
452
+ """)
453
  return
454
 
455
 
 
465
 
466
  @app.cell(hide_code=True)
467
  def _(mo):
468
+ mo.md(r"""
469
+ **Note**: For "less than", and "less or equal to" you can use the operators `<` or `<=`. Alternatively, you can use built-in functions `lt` or `le` respectively.
 
470
 
471
+ ### Is between
472
+ Polars also allows us to filter between a range of values. Let's get the modern software were the year is between 2013 and 2016. This is inclusive on both ends (i.e. both years are part of the result).
473
+ """)
 
474
  return
475
 
476
 
 
486
 
487
  @app.cell(hide_code=True)
488
  def _(mo):
489
+ mo.md(r"""
490
+ ### Or operator
491
+ If we only want either one of the conditions in the comparison to be met, we could use `|`, which is the `or` operator.
 
492
 
493
+ Let's get software that is either modern or used in the decade 1980s.
494
+ """)
 
495
  return
496
 
497
 
 
506
 
507
  @app.cell(hide_code=True)
508
  def _(mo):
509
+ mo.md(r"""
510
+ ## Conditionals
511
+ Polars also allows you create new columns based on a condition. Let's create a column *status* that will indicate if the software is "discontinued" or "in use".
 
512
 
513
+ Here's a list of products that are no longer in use.
514
+ """)
 
515
  return
516
 
517
 
 
523
 
524
  @app.cell(hide_code=True)
525
  def _(mo):
526
+ mo.md(r"""
527
+ Here's how we can get a dataframe of the products that are discontinued.
528
+ """)
529
  return
530
 
531
 
 
540
 
541
  @app.cell(hide_code=True)
542
  def _(mo):
543
+ mo.md(r"""
544
+ Now, let's create the **status** column.
545
+ """)
546
  return
547
 
548
 
 
561
 
562
  @app.cell(hide_code=True)
563
  def _(mo):
564
+ mo.md(r"""
565
+ ## Unique counts
566
+ Sometimes you may want to see only the unique values in a column. Let's check the unique decades we have in our DataFrame.
567
+ """)
 
 
568
  return
569
 
570
 
 
584
 
585
  @app.cell(hide_code=True)
586
  def _(mo):
587
+ mo.md(r"""
588
+ Finally, let's find out the number of software used in each decade.
589
+ """)
590
  return
591
 
592
 
 
606
 
607
  @app.cell(hide_code=True)
608
  def _(mo):
609
+ mo.md(r"""
610
+ We could also rewrite the above code as follows:
611
+ """)
612
  return
613
 
614
 
 
628
 
629
  @app.cell(hide_code=True)
630
  def _(mo):
631
+ mo.md(r"""
632
+ Hopefully, we've picked your interest to try out Polars the next time you analyze your data.
633
+ """)
634
  return
635
 
636
 
polars/05_reactive_plots.py CHANGED
@@ -11,26 +11,24 @@
11
 
12
  import marimo
13
 
14
- __generated_with = "0.12.10"
15
  app = marimo.App(width="medium")
16
 
17
 
18
  @app.cell(hide_code=True)
19
  def _(mo):
20
- mo.md(
21
- """
22
- # Reactive Plots
23
 
24
- _By [etrotta](https://github.com/etrotta)._
25
 
26
- This tutorial covers Data Visualisation basics using marimo, [polars](https://docs.pola.rs/) and [plotly](https://plotly.com/python/plotly-express/).
27
- It shows how to load data, explore and visualise it, then use User Interface elements (including the plots themselves) to filter and select data for more refined analysis.
28
 
29
- We will be using a [Spotify Tracks dataset](https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset). Before you write any code yourself, I recommend taking some time to understand the data you're working with, from which columns are available to what are their possible values, as well as more abstract details such as the scope, coverage and intended uses of the dataset.
30
 
31
- Note that this dataset does not contains data about ***all*** tracks, you can try using a larger dataset such as [bigdata-pw/Spotify](https://huggingface.co/datasets/bigdata-pw/Spotify), but I'm sticking with the smaller one to keep the notebook size manageable for most users.
32
- """
33
- )
34
  return
35
 
36
 
@@ -47,20 +45,18 @@ def _(pl):
47
  # Or save to a local file first if you want to avoid downloading it each time you run:
48
  # file_path = "spotify-tracks.parquet"
49
  # lz = pl.scan_parquet(file_path)
50
- return URL, branch, file_path, lz, repo_id
51
 
52
 
53
  @app.cell(hide_code=True)
54
  def _(mo):
55
- mo.md(
56
- """
57
- You should always take a look at the data you are working on before actually doing any operations on it - for data coming from sources such as HuggingFace or Kaggle you can preview it via their websites, and optionally filter or do some transformations before downloading.
58
 
59
- The [Polars Lazy API](https://docs.pola.rs/user-guide/lazy/) allows for you define operations before loading the data, and polars will optimize the plan in order to avoid doing unnecessary operations or loading data we do not care about.
60
 
61
- Let's say that looking at the dataset's preview in the Data Viewer, we decided we do not want the Unnamed column (which appears to be the row index), nor do we care about the original ID, and we only want non-explicit tracks.
62
- """
63
- )
64
  return
65
 
66
 
@@ -87,18 +83,16 @@ def _(lz, pl):
87
 
88
  @app.cell(hide_code=True)
89
  def _(mo):
90
- mo.md(
91
- r"""
92
- When you start exploring a dataset, some of the first things to do may include:
93
 
94
- - investigating any values that seem weird
95
- - verifying if there could be issues in the data
96
- - checking for potential bugs in our pipelines
97
- - ensuring you understand the data correctly, including its relationships and edge cases
98
 
99
- For example, the "min" value for the duration column is zero, and the max is over an hour. Why is that?
100
- """
101
- )
102
  return
103
 
104
 
@@ -112,13 +106,11 @@ def _(df, pl):
112
 
113
  @app.cell(hide_code=True)
114
  def _(mo):
115
- mo.md(
116
- r"""
117
- For this Notebook we will be using [plotly](https://plotly.com/python), but Marimo also [supports other plotting libraries](https://docs.marimo.io/guides/working_with_data/plotting/).
118
 
119
- Let's visualize it using a [bar chart](https://plotly.com/python/bar-charts/) and get a feel for which region makes sense to focus on for our analysis
120
- """
121
- )
122
  return
123
 
124
 
@@ -129,20 +121,18 @@ def _(df, mo, px):
129
  fig.update_layout(selectdirection="h")
130
  plot = mo.ui.plotly(fig)
131
  plot
132
- return duration_counts, fig, plot
133
 
134
 
135
  @app.cell(hide_code=True)
136
  def _(mo):
137
- mo.md(
138
- """
139
- Note how there are a few outliers with extremely little duration (less than 2 minutes) and a few with extremely long duration (more than 6 minutes)
140
 
141
- You can select a region in the graph by clicking and dragging, which can later be used to filter or transform data. In this Notebook we set a default if there is no selection, but you should try selecting a region yourself.
142
 
143
- We will focus on those within that middle ground from around 120 seconds to 360 seconds, but you can play around with it a bit and see how the results change if you move the Selection region. Perhaps you can even find some Classical songs?
144
- """
145
- )
146
  return
147
 
148
 
@@ -154,7 +144,7 @@ def _(pl, plot):
154
 
155
 
156
  @app.cell
157
- def _(df, get_extremes, pl, plot):
158
  # Now, we want to filter to only include tracks whose duration falls inside of our selection - we will need to first identify the extremes, then filter based on them
159
  min_dur, max_dur = get_extremes(
160
  plot.value, col="duration_seconds", defaults_if_missing=(120, 360)
@@ -168,27 +158,25 @@ def _(df, get_extremes, pl, plot):
168
  # Actually apply the filter
169
  filtered_duration = df.filter(duration_in_range)
170
  filtered_duration
171
- return duration_in_range, filtered_duration, max_dur, min_dur
172
 
173
 
174
  @app.cell(hide_code=True)
175
  def _(mo):
176
- mo.md(
177
- r"""
178
- Now that our data is 'clean', let's start coming up with and answering some questions about it. Some examples:
179
-
180
- - Which tracks or artists are the most popular? (Both globally as well as for each genre)
181
- - Which genres are the most popular? The loudest?
182
- - What are some common combinations of different artists?
183
- - What can we infer anything based on the track's title or artist name?
184
- - How popular is some specific song you like?
185
- - How much does the mode and key affect other attributes?
186
- - Can you classify a song's genre based on its attributes?
187
-
188
- For brevity, we will not explore all of them - feel free to try some of the others yourself, or go more in deep in the explored ones.
189
- Make sure to come up with some questions of your own and explore them as well!
190
- """
191
- )
192
  return
193
 
194
 
@@ -235,18 +223,16 @@ def _(filter_genre, filtered_duration, mo, pl):
235
  ),
236
  ],
237
  )
238
- return (most_popular_artists,)
239
 
240
 
241
  @app.cell(hide_code=True)
242
  def _(mo):
243
- mo.md(
244
- r"""
245
- So far so good - but there's been a distinct lack of visualations, so let's fix that.
246
 
247
- Let's start simple, just some metrics for each genre:
248
- """
249
- )
250
  return
251
 
252
 
@@ -263,22 +249,20 @@ def _(filtered_duration, pl, px):
263
  x="popularity",
264
  )
265
  fig_dur_per_genre
266
- return (fig_dur_per_genre,)
267
 
268
 
269
  @app.cell(hide_code=True)
270
  def _(mo):
271
- mo.md(
272
- r"""
273
- Now, why don't we play a bit with marimo's UI elements?
274
 
275
- We will use Dropdowns to allow for the user to select any column to use for the visualisation, and throw in some extras
276
 
277
- - A slider for the transparency to help understand dense clusters
278
- - Add a Trendline to the scatterplot (requires statsmodels)
279
- - Filter by some specific Genre
280
- """
281
- )
282
  return
283
 
284
 
@@ -312,18 +296,16 @@ def _(
312
  chart2 = mo.ui.plotly(fig2)
313
 
314
  mo.vstack([mo.hstack([x_axis, y_axis, color, alpha, include_trendline, filter_genre2]), chart2])
315
- return chart2, fig2
316
 
317
 
318
  @app.cell(hide_code=True)
319
  def _(mo):
320
- mo.md(
321
- r"""
322
- As we have seen before, we can also use the plot as an input to select a region and look at it in more detail.
323
 
324
- Try selecting a region then performing some explorations of your own with the data inside of it.
325
- """
326
- )
327
  return
328
 
329
 
@@ -340,47 +322,45 @@ def _(chart2, filtered_duration, mo, pl):
340
  pl.col(column_order), pl.exclude(*column_order)
341
  )
342
  out
343
- return active_columns, column_order, out
344
 
345
 
346
  @app.cell(hide_code=True)
347
  def _(mo):
348
- mo.md(
349
- r"""
350
- In this notebook, we've focused on a few key aspects. First, it's essential to *understand* the data you're working with — this forms the foundation of any analysis.
351
 
352
- Creating plots is a powerful way to identify patterns, outliers, and trends. These visualizations are not just for _presentation_; they are tools for deeper insight.
353
 
354
- /// NOTE
355
- With marimo's `interactive` UI elements, exploring different _facets_ of the data becomes seamless, allowing for dynamic analysis without altering the code.
356
 
357
- Keep these points in mind as you continue to work with data.
358
- """
359
- )
360
  return
361
 
362
 
363
  @app.cell(hide_code=True)
364
  def _(mo):
365
- mo.md(r"""# Utility Functions and UI Elements""")
 
 
366
  return
367
 
368
 
369
- @app.cell
370
- def get_extremes():
371
- def get_extremes(selection, col, defaults_if_missing):
372
- "Get the minimum and maximum values for a given column within the selection"
373
- if selection is None or len(selection) == 0:
374
- print(
375
- f"Could not find a selected region. Using default values {defaults_if_missing} instead, try clicking and dragging in the plot to change them."
376
- )
377
- return defaults_if_missing
378
- else:
379
- return (
380
- min(row[col] for row in selection),
381
- max(row[col] for row in selection),
382
- )
383
- return (get_extremes,)
384
 
385
 
386
  @app.cell
@@ -426,20 +406,14 @@ def _(filtered_duration, mo):
426
  searchable=True,
427
  label="Filter by Track Genre:",
428
  )
429
- return (
430
- alpha,
431
- color,
432
- filter_genre2,
433
- include_trendline,
434
- options,
435
- x_axis,
436
- y_axis,
437
- )
438
 
439
 
440
  @app.cell(hide_code=True)
441
  def _(mo):
442
- mo.md("""# Appendix : Some other examples""")
 
 
443
  return
444
 
445
 
@@ -461,12 +435,7 @@ def _(filtered_duration, mo, pl):
461
  # So we just provide freeform text boxes and filter ourselfves later
462
  # (the "alternative_" in the name is just to avoid conflicts with the above cell,
463
  # despite this being disabled marimo still requires global variables to be unique)
464
- return (
465
- all_artists,
466
- all_tracks,
467
- alternative_filter_artist,
468
- alternative_filter_track,
469
- )
470
 
471
 
472
  @app.cell
@@ -503,7 +472,7 @@ def _(filter_artist, filter_track, filtered_duration, mo, pl):
503
  )
504
 
505
  mo.vstack([mo.md("Filter a track based on its name or artist"), filter_artist, filter_track, filtered_artist_track])
506
- return filtered_artist_track, score_match_text
507
 
508
 
509
  @app.cell
@@ -532,7 +501,7 @@ def _(filter_genre2, filtered_duration, mo, pl):
532
  ],
533
  align="center",
534
  )
535
- return (artist_combinations,)
536
 
537
 
538
  @app.cell
 
11
 
12
  import marimo
13
 
14
+ __generated_with = "0.18.4"
15
  app = marimo.App(width="medium")
16
 
17
 
18
  @app.cell(hide_code=True)
19
  def _(mo):
20
+ mo.md("""
21
+ # Reactive Plots
 
22
 
23
+ _By [etrotta](https://github.com/etrotta)._
24
 
25
+ This tutorial covers Data Visualisation basics using marimo, [polars](https://docs.pola.rs/) and [plotly](https://plotly.com/python/plotly-express/).
26
+ It shows how to load data, explore and visualise it, then use User Interface elements (including the plots themselves) to filter and select data for more refined analysis.
27
 
28
+ We will be using a [Spotify Tracks dataset](https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset). Before you write any code yourself, I recommend taking some time to understand the data you're working with, from which columns are available to what are their possible values, as well as more abstract details such as the scope, coverage and intended uses of the dataset.
29
 
30
+ Note that this dataset does not contains data about ***all*** tracks, you can try using a larger dataset such as [bigdata-pw/Spotify](https://huggingface.co/datasets/bigdata-pw/Spotify), but I'm sticking with the smaller one to keep the notebook size manageable for most users.
31
+ """)
 
32
  return
33
 
34
 
 
45
  # Or save to a local file first if you want to avoid downloading it each time you run:
46
  # file_path = "spotify-tracks.parquet"
47
  # lz = pl.scan_parquet(file_path)
48
+ return (lz,)
49
 
50
 
51
  @app.cell(hide_code=True)
52
  def _(mo):
53
+ mo.md("""
54
+ You should always take a look at the data you are working on before actually doing any operations on it - for data coming from sources such as HuggingFace or Kaggle you can preview it via their websites, and optionally filter or do some transformations before downloading.
 
55
 
56
+ The [Polars Lazy API](https://docs.pola.rs/user-guide/lazy/) allows for you define operations before loading the data, and polars will optimize the plan in order to avoid doing unnecessary operations or loading data we do not care about.
57
 
58
+ Let's say that looking at the dataset's preview in the Data Viewer, we decided we do not want the Unnamed column (which appears to be the row index), nor do we care about the original ID, and we only want non-explicit tracks.
59
+ """)
 
60
  return
61
 
62
 
 
83
 
84
  @app.cell(hide_code=True)
85
  def _(mo):
86
+ mo.md(r"""
87
+ When you start exploring a dataset, some of the first things to do may include:
 
88
 
89
+ - investigating any values that seem weird
90
+ - verifying if there could be issues in the data
91
+ - checking for potential bugs in our pipelines
92
+ - ensuring you understand the data correctly, including its relationships and edge cases
93
 
94
+ For example, the "min" value for the duration column is zero, and the max is over an hour. Why is that?
95
+ """)
 
96
  return
97
 
98
 
 
106
 
107
  @app.cell(hide_code=True)
108
  def _(mo):
109
+ mo.md(r"""
110
+ For this Notebook we will be using [plotly](https://plotly.com/python), but Marimo also [supports other plotting libraries](https://docs.marimo.io/guides/working_with_data/plotting/).
 
111
 
112
+ Let's visualize it using a [bar chart](https://plotly.com/python/bar-charts/) and get a feel for which region makes sense to focus on for our analysis
113
+ """)
 
114
  return
115
 
116
 
 
121
  fig.update_layout(selectdirection="h")
122
  plot = mo.ui.plotly(fig)
123
  plot
124
+ return (plot,)
125
 
126
 
127
  @app.cell(hide_code=True)
128
  def _(mo):
129
+ mo.md("""
130
+ Note how there are a few outliers with extremely little duration (less than 2 minutes) and a few with extremely long duration (more than 6 minutes)
 
131
 
132
+ You can select a region in the graph by clicking and dragging, which can later be used to filter or transform data. In this Notebook we set a default if there is no selection, but you should try selecting a region yourself.
133
 
134
+ We will focus on those within that middle ground from around 120 seconds to 360 seconds, but you can play around with it a bit and see how the results change if you move the Selection region. Perhaps you can even find some Classical songs?
135
+ """)
 
136
  return
137
 
138
 
 
144
 
145
 
146
  @app.cell
147
+ def _(df, pl, plot):
148
  # Now, we want to filter to only include tracks whose duration falls inside of our selection - we will need to first identify the extremes, then filter based on them
149
  min_dur, max_dur = get_extremes(
150
  plot.value, col="duration_seconds", defaults_if_missing=(120, 360)
 
158
  # Actually apply the filter
159
  filtered_duration = df.filter(duration_in_range)
160
  filtered_duration
161
+ return (filtered_duration,)
162
 
163
 
164
  @app.cell(hide_code=True)
165
  def _(mo):
166
+ mo.md(r"""
167
+ Now that our data is 'clean', let's start coming up with and answering some questions about it. Some examples:
168
+
169
+ - Which tracks or artists are the most popular? (Both globally as well as for each genre)
170
+ - Which genres are the most popular? The loudest?
171
+ - What are some common combinations of different artists?
172
+ - What can we infer anything based on the track's title or artist name?
173
+ - How popular is some specific song you like?
174
+ - How much does the mode and key affect other attributes?
175
+ - Can you classify a song's genre based on its attributes?
176
+
177
+ For brevity, we will not explore all of them - feel free to try some of the others yourself, or go more in deep in the explored ones.
178
+ Make sure to come up with some questions of your own and explore them as well!
179
+ """)
 
 
180
  return
181
 
182
 
 
223
  ),
224
  ],
225
  )
226
+ return
227
 
228
 
229
  @app.cell(hide_code=True)
230
  def _(mo):
231
+ mo.md(r"""
232
+ So far so good - but there's been a distinct lack of visualations, so let's fix that.
 
233
 
234
+ Let's start simple, just some metrics for each genre:
235
+ """)
 
236
  return
237
 
238
 
 
249
  x="popularity",
250
  )
251
  fig_dur_per_genre
252
+ return
253
 
254
 
255
  @app.cell(hide_code=True)
256
  def _(mo):
257
+ mo.md(r"""
258
+ Now, why don't we play a bit with marimo's UI elements?
 
259
 
260
+ We will use Dropdowns to allow for the user to select any column to use for the visualisation, and throw in some extras
261
 
262
+ - A slider for the transparency to help understand dense clusters
263
+ - Add a Trendline to the scatterplot (requires statsmodels)
264
+ - Filter by some specific Genre
265
+ """)
 
266
  return
267
 
268
 
 
296
  chart2 = mo.ui.plotly(fig2)
297
 
298
  mo.vstack([mo.hstack([x_axis, y_axis, color, alpha, include_trendline, filter_genre2]), chart2])
299
+ return (chart2,)
300
 
301
 
302
  @app.cell(hide_code=True)
303
  def _(mo):
304
+ mo.md(r"""
305
+ As we have seen before, we can also use the plot as an input to select a region and look at it in more detail.
 
306
 
307
+ Try selecting a region then performing some explorations of your own with the data inside of it.
308
+ """)
 
309
  return
310
 
311
 
 
322
  pl.col(column_order), pl.exclude(*column_order)
323
  )
324
  out
325
+ return
326
 
327
 
328
  @app.cell(hide_code=True)
329
  def _(mo):
330
+ mo.md(r"""
331
+ In this notebook, we've focused on a few key aspects. First, it's essential to *understand* the data you're working with — this forms the foundation of any analysis.
 
332
 
333
+ Creating plots is a powerful way to identify patterns, outliers, and trends. These visualizations are not just for _presentation_; they are tools for deeper insight.
334
 
335
+ /// NOTE
336
+ With marimo's `interactive` UI elements, exploring different _facets_ of the data becomes seamless, allowing for dynamic analysis without altering the code.
337
 
338
+ Keep these points in mind as you continue to work with data.
339
+ """)
 
340
  return
341
 
342
 
343
  @app.cell(hide_code=True)
344
  def _(mo):
345
+ mo.md(r"""
346
+ # Utility Functions and UI Elements
347
+ """)
348
  return
349
 
350
 
351
+ @app.function
352
+ def get_extremes(selection, col, defaults_if_missing):
353
+ "Get the minimum and maximum values for a given column within the selection"
354
+ if selection is None or len(selection) == 0:
355
+ print(
356
+ f"Could not find a selected region. Using default values {defaults_if_missing} instead, try clicking and dragging in the plot to change them."
357
+ )
358
+ return defaults_if_missing
359
+ else:
360
+ return (
361
+ min(row[col] for row in selection),
362
+ max(row[col] for row in selection),
363
+ )
 
 
364
 
365
 
366
  @app.cell
 
406
  searchable=True,
407
  label="Filter by Track Genre:",
408
  )
409
+ return alpha, color, filter_genre2, include_trendline, x_axis, y_axis
 
 
 
 
 
 
 
 
410
 
411
 
412
  @app.cell(hide_code=True)
413
  def _(mo):
414
+ mo.md("""
415
+ # Appendix : Some other examples
416
+ """)
417
  return
418
 
419
 
 
435
  # So we just provide freeform text boxes and filter ourselfves later
436
  # (the "alternative_" in the name is just to avoid conflicts with the above cell,
437
  # despite this being disabled marimo still requires global variables to be unique)
438
+ return
 
 
 
 
 
439
 
440
 
441
  @app.cell
 
472
  )
473
 
474
  mo.vstack([mo.md("Filter a track based on its name or artist"), filter_artist, filter_track, filtered_artist_track])
475
+ return
476
 
477
 
478
  @app.cell
 
501
  ],
502
  align="center",
503
  )
504
+ return
505
 
506
 
507
  @app.cell
polars/06_Dataframe_Transformer.py CHANGED
@@ -12,21 +12,19 @@
12
 
13
  import marimo
14
 
15
- __generated_with = "0.14.10"
16
  app = marimo.App(width="medium")
17
 
18
 
19
  @app.cell(hide_code=True)
20
  def _(mo):
21
- mo.md(
22
- r"""
23
  # Polars with Marimo's Dataframe Transformer
24
 
25
  *By [jesshart](https://github.com/jesshart)*
26
 
27
  The goal of this notebook is to explore Marimo's data explore capabilities alonside the power of polars. Feel free to reference the latest about these Marimo features here: https://docs.marimo.io/guides/working_with_data/dataframes/?h=dataframe#transforming-dataframes
28
- """
29
- )
30
  return
31
 
32
 
@@ -40,14 +38,12 @@ def _(requests):
40
 
41
  @app.cell(hide_code=True)
42
  def _(mo):
43
- mo.md(
44
- r"""
45
  # Loading Data
46
  Let's start by loading our data and getting into the `.lazy()` format so our transformations and queries are speedy.
47
 
48
  Read more about `.lazy()` here: https://docs.pola.rs/user-guide/lazy/
49
- """
50
- )
51
  return
52
 
53
 
@@ -60,21 +56,18 @@ def _(json_data, pl):
60
 
61
  @app.cell(hide_code=True)
62
  def _(mo):
63
- mo.md(
64
- r"""
65
- Above, you will notice that when you reference the object as a standalone, you get out-of-the-box convenience from `marimo`. You have the `Table` and `Query Plan` options to choose from.
66
 
67
  - 💡 Try out the `Table` view! You can click the `Preview data` button to get a quick view of your data.
68
  - 💡 Take a look at the `Query plan`. Learn more about Polar's query plan here: https://docs.pola.rs/user-guide/lazy/query-plan/
69
- """
70
- )
71
  return
72
 
73
 
74
  @app.cell(hide_code=True)
75
  def _(mo):
76
- mo.md(
77
- r"""
78
  ## marimo's Native Dataframe UI
79
 
80
  There are a few ways to leverage marimo's native dataframe UI. One is by doing what we saw above—by referencing a `pl.LazyFrame` directly. You can also try,
@@ -83,19 +76,16 @@ def _(mo):
83
  - Referencing a `pl.DataFrame` and see how it different from its corresponding lazy version
84
  - Use `mo.ui.table`
85
  - Use `mo.ui.dataframe`
86
- """
87
- )
88
  return
89
 
90
 
91
  @app.cell(hide_code=True)
92
  def _(mo):
93
- mo.md(
94
- r"""
95
  ## Reference a `pl.DataFrame`
96
  Let's reference the same frame as before, but this time as a `pl.DataFrame` by calling `.collect()` on it.
97
- """
98
- )
99
  return
100
 
101
 
@@ -107,26 +97,22 @@ def _(demand: "pl.LazyFrame"):
107
 
108
  @app.cell(hide_code=True)
109
  def _(mo):
110
- mo.md(
111
- r"""
112
  Note how much functionality we have right out-of-the-box. Click on column names to see rich features like sorting, freezing, filtering, searching, and more!
113
 
114
  Notice how `order_quantity` has a green bar chart under it indicating the distribution of values for the field!
115
 
116
  Don't miss the `Download` feature as well which supports downloading in CSV, json, or parquet format!
117
- """
118
- )
119
  return
120
 
121
 
122
  @app.cell(hide_code=True)
123
  def _(mo):
124
- mo.md(
125
- r"""
126
  ## Use `mo.ui.table`
127
  The `mo.ui.table` allows you to select rows for use downstream. You can select the rows you want, and then use these as filtered rows downstream.
128
- """
129
- )
130
  return
131
 
132
 
@@ -144,7 +130,9 @@ def _(demand_table):
144
 
145
  @app.cell(hide_code=True)
146
  def _(mo):
147
- mo.md(r"""I like to use this feature to select groupings based on summary statistics so I can quickly explore subsets of categories. Let me show you what I mean.""")
 
 
148
  return
149
 
150
 
@@ -175,13 +163,11 @@ def _(summary_table):
175
 
176
  @app.cell(hide_code=True)
177
  def _(mo):
178
- mo.md(
179
- r"""
180
  Now, instead of manually creating a filter for what I want to take a closer look at, I simply select from the ui and do a simple join to get that aggregated level with more detail.
181
 
182
  The following cell uses the output of the `mo.ui.table` selection, selects its unique keys, and uses that to join for the selected subset of the original table.
183
- """
184
- )
185
  return
186
 
187
 
@@ -199,13 +185,17 @@ def _(demand: "pl.LazyFrame", pl, summary_table):
199
 
200
  @app.cell(hide_code=True)
201
  def _(mo):
202
- mo.md("""You can learn more about joins in Polars by checking out my other interactive notebook here: https://marimo.io/p/@jesshart/basic-polars-joins""")
 
 
203
  return
204
 
205
 
206
  @app.cell(hide_code=True)
207
  def _(mo):
208
- mo.md(r"""## Use `mo.ui.dataframe`""")
 
 
209
  return
210
 
211
 
@@ -218,7 +208,9 @@ def _(demand: "pl.LazyFrame", mo):
218
 
219
  @app.cell(hide_code=True)
220
  def _(mo):
221
- mo.md(r"""Below I simply call the object into view. We will play with it in the following cells.""")
 
 
222
  return
223
 
224
 
@@ -230,7 +222,9 @@ def _(mo_dataframe):
230
 
231
  @app.cell(hide_code=True)
232
  def _(mo):
233
- mo.md(r"""One way to group this data in polars code directly would be to group by product family to get the mean. This is how it is done in polars:""")
 
 
234
  return
235
 
236
 
@@ -245,16 +239,14 @@ def _(demand_cached, pl):
245
 
246
  @app.cell(hide_code=True)
247
  def _(mo):
248
- mo.md(
249
- f"""
250
  ## Try Before You Buy
251
 
252
  1. Now try to do the same summary using Marimo's `mo.ui.dataframe` object above. Also, note how your aggregated column is already renamed! Nice touch!
253
  2. Try (1) again but use select statements first (This is actually better polars practice anyway since it reduces the frame as you move to aggregation.)
254
 
255
  *When you are ready, check the `Python Code` tab at the top of the table to compare your output to the answer below.*
256
- """
257
- )
258
  return
259
 
260
 
@@ -331,29 +323,27 @@ def _(demand_agg: "pl.DataFrame", mo, px):
331
 
332
  @app.cell(hide_code=True)
333
  def _(mo):
334
- mo.md(
335
- r"""
336
  # About this Notebook
337
  Polars and Marimo are both relatively new to the data wrangling space, but their power (and the thrill of their use) cannot be overstated—well, I suppose it could, but you get the meaning. In this notebook, you learn how to leverage basic Polars skills to load-in and explore your data in concert with Marimo's powerful UI elements.
338
 
339
  ## 📚 Documentation References
340
 
341
- - **Marimo: Dataframe Transformation Guide**
342
  https://docs.marimo.io/guides/working_with_data/dataframes/?h=dataframe#transforming-dataframes
343
 
344
- - **Polars: Lazy API Overview**
345
  https://docs.pola.rs/user-guide/lazy/
346
 
347
- - **Polars: Query Plan Explained**
348
  https://docs.pola.rs/user-guide/lazy/query-plan/
349
 
350
- - **Marimo Notebook: Basic Polars Joins (by jesshart)**
351
  https://marimo.io/p/@jesshart/basic-polars-joins
352
 
353
- - **Marimo Learn: Interactive Graphs with Polars**
354
  https://github.com/marimo-team/learn/blob/main/polars/05_reactive_plots.py
355
- """
356
- )
357
  return
358
 
359
 
 
12
 
13
  import marimo
14
 
15
+ __generated_with = "0.18.4"
16
  app = marimo.App(width="medium")
17
 
18
 
19
  @app.cell(hide_code=True)
20
  def _(mo):
21
+ mo.md(r"""
 
22
  # Polars with Marimo's Dataframe Transformer
23
 
24
  *By [jesshart](https://github.com/jesshart)*
25
 
26
  The goal of this notebook is to explore Marimo's data explore capabilities alonside the power of polars. Feel free to reference the latest about these Marimo features here: https://docs.marimo.io/guides/working_with_data/dataframes/?h=dataframe#transforming-dataframes
27
+ """)
 
28
  return
29
 
30
 
 
38
 
39
  @app.cell(hide_code=True)
40
  def _(mo):
41
+ mo.md(r"""
 
42
  # Loading Data
43
  Let's start by loading our data and getting into the `.lazy()` format so our transformations and queries are speedy.
44
 
45
  Read more about `.lazy()` here: https://docs.pola.rs/user-guide/lazy/
46
+ """)
 
47
  return
48
 
49
 
 
56
 
57
  @app.cell(hide_code=True)
58
  def _(mo):
59
+ mo.md(r"""
60
+ Above, you will notice that when you reference the object as a standalone, you get out-of-the-box convenience from `marimo`. You have the `Table` and `Query Plan` options to choose from.
 
61
 
62
  - 💡 Try out the `Table` view! You can click the `Preview data` button to get a quick view of your data.
63
  - 💡 Take a look at the `Query plan`. Learn more about Polar's query plan here: https://docs.pola.rs/user-guide/lazy/query-plan/
64
+ """)
 
65
  return
66
 
67
 
68
  @app.cell(hide_code=True)
69
  def _(mo):
70
+ mo.md(r"""
 
71
  ## marimo's Native Dataframe UI
72
 
73
  There are a few ways to leverage marimo's native dataframe UI. One is by doing what we saw above—by referencing a `pl.LazyFrame` directly. You can also try,
 
76
  - Referencing a `pl.DataFrame` and see how it different from its corresponding lazy version
77
  - Use `mo.ui.table`
78
  - Use `mo.ui.dataframe`
79
+ """)
 
80
  return
81
 
82
 
83
  @app.cell(hide_code=True)
84
  def _(mo):
85
+ mo.md(r"""
 
86
  ## Reference a `pl.DataFrame`
87
  Let's reference the same frame as before, but this time as a `pl.DataFrame` by calling `.collect()` on it.
88
+ """)
 
89
  return
90
 
91
 
 
97
 
98
  @app.cell(hide_code=True)
99
  def _(mo):
100
+ mo.md(r"""
 
101
  Note how much functionality we have right out-of-the-box. Click on column names to see rich features like sorting, freezing, filtering, searching, and more!
102
 
103
  Notice how `order_quantity` has a green bar chart under it indicating the distribution of values for the field!
104
 
105
  Don't miss the `Download` feature as well which supports downloading in CSV, json, or parquet format!
106
+ """)
 
107
  return
108
 
109
 
110
  @app.cell(hide_code=True)
111
  def _(mo):
112
+ mo.md(r"""
 
113
  ## Use `mo.ui.table`
114
  The `mo.ui.table` allows you to select rows for use downstream. You can select the rows you want, and then use these as filtered rows downstream.
115
+ """)
 
116
  return
117
 
118
 
 
130
 
131
  @app.cell(hide_code=True)
132
  def _(mo):
133
+ mo.md(r"""
134
+ I like to use this feature to select groupings based on summary statistics so I can quickly explore subsets of categories. Let me show you what I mean.
135
+ """)
136
  return
137
 
138
 
 
163
 
164
  @app.cell(hide_code=True)
165
  def _(mo):
166
+ mo.md(r"""
 
167
  Now, instead of manually creating a filter for what I want to take a closer look at, I simply select from the ui and do a simple join to get that aggregated level with more detail.
168
 
169
  The following cell uses the output of the `mo.ui.table` selection, selects its unique keys, and uses that to join for the selected subset of the original table.
170
+ """)
 
171
  return
172
 
173
 
 
185
 
186
  @app.cell(hide_code=True)
187
  def _(mo):
188
+ mo.md("""
189
+ You can learn more about joins in Polars by checking out my other interactive notebook here: https://marimo.io/p/@jesshart/basic-polars-joins
190
+ """)
191
  return
192
 
193
 
194
  @app.cell(hide_code=True)
195
  def _(mo):
196
+ mo.md(r"""
197
+ ## Use `mo.ui.dataframe`
198
+ """)
199
  return
200
 
201
 
 
208
 
209
  @app.cell(hide_code=True)
210
  def _(mo):
211
+ mo.md(r"""
212
+ Below I simply call the object into view. We will play with it in the following cells.
213
+ """)
214
  return
215
 
216
 
 
222
 
223
  @app.cell(hide_code=True)
224
  def _(mo):
225
+ mo.md(r"""
226
+ One way to group this data in polars code directly would be to group by product family to get the mean. This is how it is done in polars:
227
+ """)
228
  return
229
 
230
 
 
239
 
240
  @app.cell(hide_code=True)
241
  def _(mo):
242
+ mo.md(f"""
 
243
  ## Try Before You Buy
244
 
245
  1. Now try to do the same summary using Marimo's `mo.ui.dataframe` object above. Also, note how your aggregated column is already renamed! Nice touch!
246
  2. Try (1) again but use select statements first (This is actually better polars practice anyway since it reduces the frame as you move to aggregation.)
247
 
248
  *When you are ready, check the `Python Code` tab at the top of the table to compare your output to the answer below.*
249
+ """)
 
250
  return
251
 
252
 
 
323
 
324
  @app.cell(hide_code=True)
325
  def _(mo):
326
+ mo.md(r"""
 
327
  # About this Notebook
328
  Polars and Marimo are both relatively new to the data wrangling space, but their power (and the thrill of their use) cannot be overstated—well, I suppose it could, but you get the meaning. In this notebook, you learn how to leverage basic Polars skills to load-in and explore your data in concert with Marimo's powerful UI elements.
329
 
330
  ## 📚 Documentation References
331
 
332
+ - **Marimo: Dataframe Transformation Guide**
333
  https://docs.marimo.io/guides/working_with_data/dataframes/?h=dataframe#transforming-dataframes
334
 
335
+ - **Polars: Lazy API Overview**
336
  https://docs.pola.rs/user-guide/lazy/
337
 
338
+ - **Polars: Query Plan Explained**
339
  https://docs.pola.rs/user-guide/lazy/query-plan/
340
 
341
+ - **Marimo Notebook: Basic Polars Joins (by jesshart)**
342
  https://marimo.io/p/@jesshart/basic-polars-joins
343
 
344
+ - **Marimo Learn: Interactive Graphs with Polars**
345
  https://github.com/marimo-team/learn/blob/main/polars/05_reactive_plots.py
346
+ """)
 
347
  return
348
 
349
 
polars/07-querying-with-sql.py CHANGED
@@ -35,7 +35,7 @@ def _(mo):
35
 
36
 
37
  @app.cell
38
- def _(mo, reviews, sqlite_engine):
39
  _df = mo.sql(
40
  f"""
41
  SELECT * FROM reviews LIMIT 100
@@ -91,7 +91,7 @@ def _(mo):
91
 
92
 
93
  @app.cell
94
- def _(hotels, mo, sqlite_engine):
95
  _df = mo.sql(
96
  f"""
97
  SELECT * FROM hotels LIMIT 10
@@ -112,7 +112,7 @@ def _(mo):
112
 
113
 
114
  @app.cell
115
- def _(mo, reviews, sqlite_engine, users):
116
  polars_age_groups = mo.sql(
117
  f"""
118
  SELECT reviews.*, age_group FROM reviews JOIN users ON reviews.user_id = users.user_id LIMIT 1000
@@ -139,7 +139,7 @@ def _(mo):
139
 
140
 
141
  @app.cell
142
- def _(mo, reviews, sqlite_engine, users):
143
  _df = mo.sql(
144
  f"""
145
  SELECT age_group, AVG(reviews.score_overall) FROM reviews JOIN users ON reviews.user_id = users.user_id GROUP BY age_group
@@ -158,7 +158,7 @@ def _(mo):
158
 
159
 
160
  @app.cell
161
- def _(mo, polars_age_groups):
162
  _df = mo.sql(
163
  f"""
164
  SELECT * FROM polars_age_groups LIMIT 10
@@ -261,7 +261,7 @@ def _(mo):
261
 
262
 
263
  @app.cell
264
- def _(duckdb, hotels):
265
  duckdb.sql("SELECT * FROM hotels").pl(lazy=True).sort("cleanliness_base", descending=True).limit(5).collect()
266
  return
267
 
 
35
 
36
 
37
  @app.cell
38
+ def _(mo, sqlite_engine):
39
  _df = mo.sql(
40
  f"""
41
  SELECT * FROM reviews LIMIT 100
 
91
 
92
 
93
  @app.cell
94
+ def _(mo, sqlite_engine):
95
  _df = mo.sql(
96
  f"""
97
  SELECT * FROM hotels LIMIT 10
 
112
 
113
 
114
  @app.cell
115
+ def _(mo, sqlite_engine):
116
  polars_age_groups = mo.sql(
117
  f"""
118
  SELECT reviews.*, age_group FROM reviews JOIN users ON reviews.user_id = users.user_id LIMIT 1000
 
139
 
140
 
141
  @app.cell
142
+ def _(mo, sqlite_engine):
143
  _df = mo.sql(
144
  f"""
145
  SELECT age_group, AVG(reviews.score_overall) FROM reviews JOIN users ON reviews.user_id = users.user_id GROUP BY age_group
 
158
 
159
 
160
  @app.cell
161
+ def _(mo):
162
  _df = mo.sql(
163
  f"""
164
  SELECT * FROM polars_age_groups LIMIT 10
 
261
 
262
 
263
  @app.cell
264
+ def _(duckdb):
265
  duckdb.sql("SELECT * FROM hotels").pl(lazy=True).sort("cleanliness_base", descending=True).limit(5).collect()
266
  return
267
 
polars/08_working_with_columns.py CHANGED
@@ -8,37 +8,33 @@
8
 
9
  import marimo
10
 
11
- __generated_with = "0.12.0"
12
  app = marimo.App(width="medium")
13
 
14
 
15
  @app.cell(hide_code=True)
16
  def _(mo):
17
- mo.md(
18
- r"""
19
- # Working with Columns
20
 
21
- Author: [Deb Debnath](https://github.com/debajyotid2)
22
 
23
- **Note**: The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/user-guide/expressions/expression-expansion).
24
- """
25
- )
26
  return
27
 
28
 
29
  @app.cell(hide_code=True)
30
  def _(mo):
31
- mo.md(
32
- r"""
33
- ## Expressions
34
 
35
- Data transformations are sometimes complicated, or involve massive computations which are time-consuming. You can make a small version of the dataset with the schema you are trying to work your transformation into. But there is a better way to do it in Polars.
36
 
37
- A Polars expression is a lazy representation of a data transformation. "Lazy" means that the transformation is not eagerly (immediately) executed.
38
 
39
- Expressions are modular and flexible. They can be composed to build more complex expressions. For example, to calculate speed from distance and time, you can have an expression as:
40
- """
41
- )
42
  return
43
 
44
 
@@ -46,24 +42,24 @@ def _(mo):
46
  def _(pl):
47
  speed_expr = pl.col("distance") / (pl.col("time"))
48
  speed_expr
49
- return (speed_expr,)
50
 
51
 
52
  @app.cell(hide_code=True)
53
  def _(mo):
54
- mo.md(
55
- r"""
56
- ## Expression expansion
57
 
58
- Expression expansion lets you write a single expression that can expand to multiple different expressions. So rather than repeatedly defining separate expressions, you can avoid redundancy while adhering to clean code principles (Do not Repeat Yourself - [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)). Since expressions are reusable, they aid in writing concise code.
59
- """
60
- )
61
  return
62
 
63
 
64
  @app.cell(hide_code=True)
65
  def _(mo):
66
- mo.md("""For the examples in this notebook, we will use a sliver of the *AI4I 2020 Predictive Maintenance Dataset*. This dataset comprises of measurements taken from sensors in industrial machinery undergoing preventive maintenance checks - basically being tested for failure conditions.""")
 
 
67
  return
68
 
69
 
@@ -80,32 +76,28 @@ def _(StringIO, pl):
80
 
81
  data = pl.read_csv(StringIO(data_csv))
82
  data
83
- return data, data_csv
84
 
85
 
86
  @app.cell(hide_code=True)
87
  def _(mo):
88
- mo.md(
89
- r"""
90
- ## Function `col`
91
 
92
- The function `col` is used to refer to one column of a dataframe. It is one of the fundamental building blocks of expressions in Polars. `col` is also really handy in expression expansion.
93
- """
94
- )
95
  return
96
 
97
 
98
  @app.cell(hide_code=True)
99
  def _(mo):
100
- mo.md(
101
- r"""
102
- ### Explicit expansion by column name
103
 
104
- The simplest form of expression expansion happens when you provide multiple column names to the function `col`.
105
 
106
- Say you wish to convert all temperature values in deg. Kelvin (K) to deg. Fahrenheit (F). One way to do this would be to define individual expressions for each column as follows:
107
- """
108
- )
109
  return
110
 
111
 
@@ -118,12 +110,14 @@ def _(data, pl):
118
 
119
  result = data.with_columns(exprs)
120
  result
121
- return exprs, result
122
 
123
 
124
  @app.cell(hide_code=True)
125
  def _(mo):
126
- mo.md(r"""Expression expansion can reduce this verbosity when you list the column names you want the expression to expand to inside the `col` function. The result is the same as before.""")
 
 
127
  return
128
 
129
 
@@ -139,28 +133,28 @@ def _(data, pl, result):
139
  ).round(2)
140
  )
141
  result_2.equals(result)
142
- return (result_2,)
143
 
144
 
145
  @app.cell(hide_code=True)
146
  def _(mo):
147
- mo.md(r"""In this case, the expression that does the temperature conversion is expanded to a list of two expressions. The expansion of the expression is predictable and intuitive.""")
 
 
148
  return
149
 
150
 
151
  @app.cell(hide_code=True)
152
  def _(mo):
153
- mo.md(
154
- r"""
155
- ### Expansion by data type
156
 
157
- Can we do better than explicitly writing the names of every columns we want transformed? Yes.
158
 
159
- If you provide data types instead of column names, the expression is expanded to all columns that match one of the data types provided.
160
 
161
- The example below performs the exact same computation as before:
162
- """
163
- )
164
  return
165
 
166
 
@@ -168,18 +162,16 @@ def _(mo):
168
  def _(data, pl, result):
169
  result_3 = data.with_columns(((pl.col(pl.Float64) - 273.15) * 1.8 + 32).round(2))
170
  result_3.equals(result)
171
- return (result_3,)
172
 
173
 
174
  @app.cell(hide_code=True)
175
  def _(mo):
176
- mo.md(
177
- r"""
178
- However, you should be careful to ensure that the transformation is only applied to the columns you want. For ensuring this it is important to know the schema of the data beforehand.
179
 
180
- `col` accepts multiple data types in case the columns you need have more than one data type.
181
- """
182
- )
183
  return
184
 
185
 
@@ -195,18 +187,16 @@ def _(data, pl, result):
195
  ).round(2)
196
  )
197
  result.equals(result_4)
198
- return (result_4,)
199
 
200
 
201
  @app.cell(hide_code=True)
202
  def _(mo):
203
- mo.md(
204
- r"""
205
- ### Expansion by pattern matching
206
 
207
- `col` also accepts regular expressions for selecting columns by pattern matching. Regular expressions start and end with ^ and $, respectively.
208
- """
209
- )
210
  return
211
 
212
 
@@ -218,7 +208,9 @@ def _(data, pl):
218
 
219
  @app.cell(hide_code=True)
220
  def _(mo):
221
- mo.md(r"""Regular expressions can be combined with exact column names.""")
 
 
222
  return
223
 
224
 
@@ -230,7 +222,9 @@ def _(data, pl):
230
 
231
  @app.cell(hide_code=True)
232
  def _(mo):
233
- mo.md(r"""**Note**: You _cannot_ mix strings (exact names, regular expressions) and data types in a `col` function.""")
 
 
234
  return
235
 
236
 
@@ -245,13 +239,11 @@ def _(data, pl):
245
 
246
  @app.cell(hide_code=True)
247
  def _(mo):
248
- mo.md(
249
- r"""
250
- ## Selecting all columns
251
 
252
- To select all columns, you can use the `all` function.
253
- """
254
- )
255
  return
256
 
257
 
@@ -259,18 +251,16 @@ def _(mo):
259
  def _(data, pl):
260
  result_6 = data.select(pl.all())
261
  result_6.equals(data)
262
- return (result_6,)
263
 
264
 
265
  @app.cell(hide_code=True)
266
  def _(mo):
267
- mo.md(
268
- r"""
269
- ## Excluding columns
270
 
271
- There are scenarios where we might want to exclude specific columns from the ones selected by building expressions, e.g. by the `col` or `all` functions. For this purpose, we use the function `exclude`, which accepts exactly the same types of arguments as `col`:
272
- """
273
- )
274
  return
275
 
276
 
@@ -282,7 +272,9 @@ def _(data, pl):
282
 
283
  @app.cell(hide_code=True)
284
  def _(mo):
285
- mo.md(r"""`exclude` can also be used after the function `col`:""")
 
 
286
  return
287
 
288
 
@@ -294,13 +286,11 @@ def _(data, pl):
294
 
295
  @app.cell(hide_code=True)
296
  def _(mo):
297
- mo.md(
298
- r"""
299
- ## Column renaming
300
 
301
- When applying a transformation with an expression to a column, the data in the column gets overwritten with the transformed data. However, this might not be the intended outcome in all situations - ideally you would want to store transformed data in a new column. Applying multiple transformations to the same column at the same time without renaming leads to errors.
302
- """
303
- )
304
  return
305
 
306
 
@@ -315,18 +305,16 @@ def _(data, pl):
315
  )
316
  except DuplicateError as err:
317
  print("DuplicateError:", err)
318
- return (DuplicateError,)
319
 
320
 
321
  @app.cell(hide_code=True)
322
  def _(mo):
323
- mo.md(
324
- r"""
325
- ### Renaming a single column with `alias`
326
 
327
- The function `alias` lets you rename a single column:
328
- """
329
- )
330
  return
331
 
332
 
@@ -341,13 +329,11 @@ def _(data, pl):
341
 
342
  @app.cell(hide_code=True)
343
  def _(mo):
344
- mo.md(
345
- r"""
346
- ### Prefixing and suffixing column names
347
 
348
- As `alias` renames a single column at a time, it cannot be used during expression expansion. If it is sufficient add a static prefix or a static suffix to the existing names, you can use the functions `name.prefix` and `name.suffix` with `col`:
349
- """
350
- )
351
  return
352
 
353
 
@@ -362,13 +348,11 @@ def _(data, pl):
362
 
363
  @app.cell(hide_code=True)
364
  def _(mo):
365
- mo.md(
366
- r"""
367
- ### Dynamic name replacement
368
 
369
- If a static prefix/suffix is not enough, use `name.map`. `name.map` requires a function that transforms column names to the desired. The transformation should lead to unique names to avoid `DuplicateError`.
370
- """
371
- )
372
  return
373
 
374
 
@@ -381,13 +365,11 @@ def _(data, pl):
381
 
382
  @app.cell(hide_code=True)
383
  def _(mo):
384
- mo.md(
385
- r"""
386
- ## Programmatically generating expressions
387
 
388
- For this example, we will first create four additional columns with the rolling mean temperatures of the two temperature columns. Such transformations are sometimes used to create additional features for machine learning models or data analysis.
389
- """
390
- )
391
  return
392
 
393
 
@@ -402,13 +384,17 @@ def _(data, pl):
402
 
403
  @app.cell(hide_code=True)
404
  def _(mo):
405
- mo.md(r"""Now, suppose we want to calculate the difference between the rolling mean and actual temperatures. We cannot use expression expansion here as we want differences between specific columns.""")
 
 
406
  return
407
 
408
 
409
  @app.cell(hide_code=True)
410
  def _(mo):
411
- mo.md(r"""At first, you may think about using a `for` loop:""")
 
 
412
  return
413
 
414
 
@@ -421,12 +407,14 @@ def _(ext_temp_data, pl):
421
  .round(2).alias(f"Delta {col_name} temperature")
422
  )
423
  _result
424
- return (col_name,)
425
 
426
 
427
  @app.cell(hide_code=True)
428
  def _(mo):
429
- mo.md(r"""Using a `for` loop is functional, but not scalable, as each expression needs to be defined in an iteration and executed serially. Instead we can use a generator in Python to programmatically create all expressions at once. In conjunction with the `with_columns` context, we can take advantage of parallel execution of computations and query optimization from Polars.""")
 
 
430
  return
431
 
432
 
@@ -439,18 +427,16 @@ def _(ext_temp_data, pl):
439
 
440
 
441
  ext_temp_data.with_columns(delta_expressions(["Air", "Process"]))
442
- return (delta_expressions,)
443
 
444
 
445
  @app.cell(hide_code=True)
446
  def _(mo):
447
- mo.md(
448
- r"""
449
- ## More flexible column selections
450
 
451
- For more flexible column selections, you can use column selectors from `selectors`. Column selectors allow for more expressiveness in the way you specify selections. For example, column selectors can perform the familiar set operations of union, intersection, difference, etc. We can use the union operation with the functions `string` and `ends_with` to select all string columns and the columns whose names end with "`_high`":
452
- """
453
- )
454
  return
455
 
456
 
@@ -464,30 +450,30 @@ def _(data):
464
 
465
  @app.cell(hide_code=True)
466
  def _(mo):
467
- mo.md(r"""Likewise, you can pick columns based on the category of the type of data, offering more flexibility than the `col` function. As an example, `cs.numeric` selects numeric data types (including `pl.Float32`, `pl.Float64`, `pl.Int32`, etc.) or `cs.temporal` for all dates, times and similar data types.""")
 
 
468
  return
469
 
470
 
471
  @app.cell(hide_code=True)
472
  def _(mo):
473
- mo.md(
474
- r"""
475
- ### Combining selectors with set operations
476
 
477
- Multiple selectors can be combined using set operations and the usual Python operators:
478
 
479
 
480
- | Operator | Operation |
481
- |:--------:|:--------------------:|
482
- | `A | B` | Union |
483
- | `A & B` | Intersection |
484
- | `A - B` | Difference |
485
- | `A ^ B` | Symmetric difference |
486
- | `~A` | Complement |
487
 
488
- For example, to select all failure indicator variables excluding the failure variables due to wear, we can perform a set difference between the column selectors.
489
- """
490
- )
491
  return
492
 
493
 
@@ -499,13 +485,11 @@ def _(cs, data):
499
 
500
  @app.cell(hide_code=True)
501
  def _(mo):
502
- mo.md(
503
- r"""
504
- ### Resolving operator ambiguity
505
 
506
- Expression functions can be chained on top of selectors:
507
- """
508
- )
509
  return
510
 
511
 
@@ -518,13 +502,11 @@ def _(cs, data, pl):
518
 
519
  @app.cell(hide_code=True)
520
  def _(mo):
521
- mo.md(
522
- r"""
523
- However, operators that perform set operations on column selectors operate on both selectors and on expressions. For example, the operator `~` on a selector represents the set operation “complement” and on an expression represents the Boolean operation of negation.
524
 
525
- For instance, if you want to negate the Boolean values in the columns “HDF”, “OSF”, and “RNF”, at first you would think about using the `~` operator with the column selector to choose all failure variables containing "W". Because of the operator ambiguity here, the columns that are not of interest are selected here.
526
- """
527
- )
528
  return
529
 
530
 
@@ -536,7 +518,9 @@ def _(cs, ext_failure_data):
536
 
537
  @app.cell(hide_code=True)
538
  def _(mo):
539
- mo.md(r"""To resolve the operator ambiguity, we use `as_expr`:""")
 
 
540
  return
541
 
542
 
@@ -548,13 +532,11 @@ def _(cs, ext_failure_data):
548
 
549
  @app.cell(hide_code=True)
550
  def _(mo):
551
- mo.md(
552
- r"""
553
- ### Debugging selectors
554
 
555
- The function `cs.is_selector` helps check whether a complex chain of selectors and operators ultimately results in a selector. For example, to resolve any ambiguity with the selector in the last example, we can do:
556
- """
557
- )
558
  return
559
 
560
 
@@ -566,7 +548,9 @@ def _(cs):
566
 
567
  @app.cell(hide_code=True)
568
  def _(mo):
569
- mo.md(r"""Additionally we can use `expand_selector` to see what columns a selector expands into. Note that for this function we need to provide additional context in the form of the dataframe.""")
 
 
570
  return
571
 
572
 
@@ -581,14 +565,12 @@ def _(cs, ext_failure_data):
581
 
582
  @app.cell(hide_code=True)
583
  def _(mo):
584
- mo.md(
585
- r"""
586
- ### References
587
 
588
- 1. AI4I 2020 Predictive Maintenance Dataset [Dataset]. (2020). UCI Machine Learning Repository. ([link](https://doi.org/10.24432/C5HS5C)).
589
- 2. Polars documentation ([link](https://docs.pola.rs/user-guide/expressions/expression-expansion/#more-flexible-column-selections))
590
- """
591
- )
592
  return
593
 
594
 
@@ -598,7 +580,7 @@ def _():
598
  import marimo as mo
599
  import polars as pl
600
  from io import StringIO
601
- return StringIO, csv, mo, pl
602
 
603
 
604
  if __name__ == "__main__":
 
8
 
9
  import marimo
10
 
11
+ __generated_with = "0.18.4"
12
  app = marimo.App(width="medium")
13
 
14
 
15
  @app.cell(hide_code=True)
16
  def _(mo):
17
+ mo.md(r"""
18
+ # Working with Columns
 
19
 
20
+ Author: [Deb Debnath](https://github.com/debajyotid2)
21
 
22
+ **Note**: The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/user-guide/expressions/expression-expansion).
23
+ """)
 
24
  return
25
 
26
 
27
  @app.cell(hide_code=True)
28
  def _(mo):
29
+ mo.md(r"""
30
+ ## Expressions
 
31
 
32
+ Data transformations are sometimes complicated, or involve massive computations which are time-consuming. You can make a small version of the dataset with the schema you are trying to work your transformation into. But there is a better way to do it in Polars.
33
 
34
+ A Polars expression is a lazy representation of a data transformation. "Lazy" means that the transformation is not eagerly (immediately) executed.
35
 
36
+ Expressions are modular and flexible. They can be composed to build more complex expressions. For example, to calculate speed from distance and time, you can have an expression as:
37
+ """)
 
38
  return
39
 
40
 
 
42
  def _(pl):
43
  speed_expr = pl.col("distance") / (pl.col("time"))
44
  speed_expr
45
+ return
46
 
47
 
48
  @app.cell(hide_code=True)
49
  def _(mo):
50
+ mo.md(r"""
51
+ ## Expression expansion
 
52
 
53
+ Expression expansion lets you write a single expression that can expand to multiple different expressions. So rather than repeatedly defining separate expressions, you can avoid redundancy while adhering to clean code principles (Do not Repeat Yourself - [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)). Since expressions are reusable, they aid in writing concise code.
54
+ """)
 
55
  return
56
 
57
 
58
  @app.cell(hide_code=True)
59
  def _(mo):
60
+ mo.md("""
61
+ For the examples in this notebook, we will use a sliver of the *AI4I 2020 Predictive Maintenance Dataset*. This dataset comprises of measurements taken from sensors in industrial machinery undergoing preventive maintenance checks - basically being tested for failure conditions.
62
+ """)
63
  return
64
 
65
 
 
76
 
77
  data = pl.read_csv(StringIO(data_csv))
78
  data
79
+ return (data,)
80
 
81
 
82
  @app.cell(hide_code=True)
83
  def _(mo):
84
+ mo.md(r"""
85
+ ## Function `col`
 
86
 
87
+ The function `col` is used to refer to one column of a dataframe. It is one of the fundamental building blocks of expressions in Polars. `col` is also really handy in expression expansion.
88
+ """)
 
89
  return
90
 
91
 
92
  @app.cell(hide_code=True)
93
  def _(mo):
94
+ mo.md(r"""
95
+ ### Explicit expansion by column name
 
96
 
97
+ The simplest form of expression expansion happens when you provide multiple column names to the function `col`.
98
 
99
+ Say you wish to convert all temperature values in deg. Kelvin (K) to deg. Fahrenheit (F). One way to do this would be to define individual expressions for each column as follows:
100
+ """)
 
101
  return
102
 
103
 
 
110
 
111
  result = data.with_columns(exprs)
112
  result
113
+ return (result,)
114
 
115
 
116
  @app.cell(hide_code=True)
117
  def _(mo):
118
+ mo.md(r"""
119
+ Expression expansion can reduce this verbosity when you list the column names you want the expression to expand to inside the `col` function. The result is the same as before.
120
+ """)
121
  return
122
 
123
 
 
133
  ).round(2)
134
  )
135
  result_2.equals(result)
136
+ return
137
 
138
 
139
  @app.cell(hide_code=True)
140
  def _(mo):
141
+ mo.md(r"""
142
+ In this case, the expression that does the temperature conversion is expanded to a list of two expressions. The expansion of the expression is predictable and intuitive.
143
+ """)
144
  return
145
 
146
 
147
  @app.cell(hide_code=True)
148
  def _(mo):
149
+ mo.md(r"""
150
+ ### Expansion by data type
 
151
 
152
+ Can we do better than explicitly writing the names of every columns we want transformed? Yes.
153
 
154
+ If you provide data types instead of column names, the expression is expanded to all columns that match one of the data types provided.
155
 
156
+ The example below performs the exact same computation as before:
157
+ """)
 
158
  return
159
 
160
 
 
162
  def _(data, pl, result):
163
  result_3 = data.with_columns(((pl.col(pl.Float64) - 273.15) * 1.8 + 32).round(2))
164
  result_3.equals(result)
165
+ return
166
 
167
 
168
  @app.cell(hide_code=True)
169
  def _(mo):
170
+ mo.md(r"""
171
+ However, you should be careful to ensure that the transformation is only applied to the columns you want. For ensuring this it is important to know the schema of the data beforehand.
 
172
 
173
+ `col` accepts multiple data types in case the columns you need have more than one data type.
174
+ """)
 
175
  return
176
 
177
 
 
187
  ).round(2)
188
  )
189
  result.equals(result_4)
190
+ return
191
 
192
 
193
  @app.cell(hide_code=True)
194
  def _(mo):
195
+ mo.md(r"""
196
+ ### Expansion by pattern matching
 
197
 
198
+ `col` also accepts regular expressions for selecting columns by pattern matching. Regular expressions start and end with ^ and $, respectively.
199
+ """)
 
200
  return
201
 
202
 
 
208
 
209
  @app.cell(hide_code=True)
210
  def _(mo):
211
+ mo.md(r"""
212
+ Regular expressions can be combined with exact column names.
213
+ """)
214
  return
215
 
216
 
 
222
 
223
  @app.cell(hide_code=True)
224
  def _(mo):
225
+ mo.md(r"""
226
+ **Note**: You _cannot_ mix strings (exact names, regular expressions) and data types in a `col` function.
227
+ """)
228
  return
229
 
230
 
 
239
 
240
  @app.cell(hide_code=True)
241
  def _(mo):
242
+ mo.md(r"""
243
+ ## Selecting all columns
 
244
 
245
+ To select all columns, you can use the `all` function.
246
+ """)
 
247
  return
248
 
249
 
 
251
  def _(data, pl):
252
  result_6 = data.select(pl.all())
253
  result_6.equals(data)
254
+ return
255
 
256
 
257
  @app.cell(hide_code=True)
258
  def _(mo):
259
+ mo.md(r"""
260
+ ## Excluding columns
 
261
 
262
+ There are scenarios where we might want to exclude specific columns from the ones selected by building expressions, e.g. by the `col` or `all` functions. For this purpose, we use the function `exclude`, which accepts exactly the same types of arguments as `col`:
263
+ """)
 
264
  return
265
 
266
 
 
272
 
273
  @app.cell(hide_code=True)
274
  def _(mo):
275
+ mo.md(r"""
276
+ `exclude` can also be used after the function `col`:
277
+ """)
278
  return
279
 
280
 
 
286
 
287
  @app.cell(hide_code=True)
288
  def _(mo):
289
+ mo.md(r"""
290
+ ## Column renaming
 
291
 
292
+ When applying a transformation with an expression to a column, the data in the column gets overwritten with the transformed data. However, this might not be the intended outcome in all situations - ideally you would want to store transformed data in a new column. Applying multiple transformations to the same column at the same time without renaming leads to errors.
293
+ """)
 
294
  return
295
 
296
 
 
305
  )
306
  except DuplicateError as err:
307
  print("DuplicateError:", err)
308
+ return
309
 
310
 
311
  @app.cell(hide_code=True)
312
  def _(mo):
313
+ mo.md(r"""
314
+ ### Renaming a single column with `alias`
 
315
 
316
+ The function `alias` lets you rename a single column:
317
+ """)
 
318
  return
319
 
320
 
 
329
 
330
  @app.cell(hide_code=True)
331
  def _(mo):
332
+ mo.md(r"""
333
+ ### Prefixing and suffixing column names
 
334
 
335
+ As `alias` renames a single column at a time, it cannot be used during expression expansion. If it is sufficient add a static prefix or a static suffix to the existing names, you can use the functions `name.prefix` and `name.suffix` with `col`:
336
+ """)
 
337
  return
338
 
339
 
 
348
 
349
  @app.cell(hide_code=True)
350
  def _(mo):
351
+ mo.md(r"""
352
+ ### Dynamic name replacement
 
353
 
354
+ If a static prefix/suffix is not enough, use `name.map`. `name.map` requires a function that transforms column names to the desired. The transformation should lead to unique names to avoid `DuplicateError`.
355
+ """)
 
356
  return
357
 
358
 
 
365
 
366
  @app.cell(hide_code=True)
367
  def _(mo):
368
+ mo.md(r"""
369
+ ## Programmatically generating expressions
 
370
 
371
+ For this example, we will first create four additional columns with the rolling mean temperatures of the two temperature columns. Such transformations are sometimes used to create additional features for machine learning models or data analysis.
372
+ """)
 
373
  return
374
 
375
 
 
384
 
385
  @app.cell(hide_code=True)
386
  def _(mo):
387
+ mo.md(r"""
388
+ Now, suppose we want to calculate the difference between the rolling mean and actual temperatures. We cannot use expression expansion here as we want differences between specific columns.
389
+ """)
390
  return
391
 
392
 
393
  @app.cell(hide_code=True)
394
  def _(mo):
395
+ mo.md(r"""
396
+ At first, you may think about using a `for` loop:
397
+ """)
398
  return
399
 
400
 
 
407
  .round(2).alias(f"Delta {col_name} temperature")
408
  )
409
  _result
410
+ return
411
 
412
 
413
  @app.cell(hide_code=True)
414
  def _(mo):
415
+ mo.md(r"""
416
+ Using a `for` loop is functional, but not scalable, as each expression needs to be defined in an iteration and executed serially. Instead we can use a generator in Python to programmatically create all expressions at once. In conjunction with the `with_columns` context, we can take advantage of parallel execution of computations and query optimization from Polars.
417
+ """)
418
  return
419
 
420
 
 
427
 
428
 
429
  ext_temp_data.with_columns(delta_expressions(["Air", "Process"]))
430
+ return
431
 
432
 
433
  @app.cell(hide_code=True)
434
  def _(mo):
435
+ mo.md(r"""
436
+ ## More flexible column selections
 
437
 
438
+ For more flexible column selections, you can use column selectors from `selectors`. Column selectors allow for more expressiveness in the way you specify selections. For example, column selectors can perform the familiar set operations of union, intersection, difference, etc. We can use the union operation with the functions `string` and `ends_with` to select all string columns and the columns whose names end with "`_high`":
439
+ """)
 
440
  return
441
 
442
 
 
450
 
451
  @app.cell(hide_code=True)
452
  def _(mo):
453
+ mo.md(r"""
454
+ Likewise, you can pick columns based on the category of the type of data, offering more flexibility than the `col` function. As an example, `cs.numeric` selects numeric data types (including `pl.Float32`, `pl.Float64`, `pl.Int32`, etc.) or `cs.temporal` for all dates, times and similar data types.
455
+ """)
456
  return
457
 
458
 
459
  @app.cell(hide_code=True)
460
  def _(mo):
461
+ mo.md(r"""
462
+ ### Combining selectors with set operations
 
463
 
464
+ Multiple selectors can be combined using set operations and the usual Python operators:
465
 
466
 
467
+ | Operator | Operation |
468
+ |:--------:|:--------------------:|
469
+ | `A | B` | Union |
470
+ | `A & B` | Intersection |
471
+ | `A - B` | Difference |
472
+ | `A ^ B` | Symmetric difference |
473
+ | `~A` | Complement |
474
 
475
+ For example, to select all failure indicator variables excluding the failure variables due to wear, we can perform a set difference between the column selectors.
476
+ """)
 
477
  return
478
 
479
 
 
485
 
486
  @app.cell(hide_code=True)
487
  def _(mo):
488
+ mo.md(r"""
489
+ ### Resolving operator ambiguity
 
490
 
491
+ Expression functions can be chained on top of selectors:
492
+ """)
 
493
  return
494
 
495
 
 
502
 
503
  @app.cell(hide_code=True)
504
  def _(mo):
505
+ mo.md(r"""
506
+ However, operators that perform set operations on column selectors operate on both selectors and on expressions. For example, the operator `~` on a selector represents the set operation “complement” and on an expression represents the Boolean operation of negation.
 
507
 
508
+ For instance, if you want to negate the Boolean values in the columns “HDF”, “OSF”, and “RNF”, at first you would think about using the `~` operator with the column selector to choose all failure variables containing "W". Because of the operator ambiguity here, the columns that are not of interest are selected here.
509
+ """)
 
510
  return
511
 
512
 
 
518
 
519
  @app.cell(hide_code=True)
520
  def _(mo):
521
+ mo.md(r"""
522
+ To resolve the operator ambiguity, we use `as_expr`:
523
+ """)
524
  return
525
 
526
 
 
532
 
533
  @app.cell(hide_code=True)
534
  def _(mo):
535
+ mo.md(r"""
536
+ ### Debugging selectors
 
537
 
538
+ The function `cs.is_selector` helps check whether a complex chain of selectors and operators ultimately results in a selector. For example, to resolve any ambiguity with the selector in the last example, we can do:
539
+ """)
 
540
  return
541
 
542
 
 
548
 
549
  @app.cell(hide_code=True)
550
  def _(mo):
551
+ mo.md(r"""
552
+ Additionally we can use `expand_selector` to see what columns a selector expands into. Note that for this function we need to provide additional context in the form of the dataframe.
553
+ """)
554
  return
555
 
556
 
 
565
 
566
  @app.cell(hide_code=True)
567
  def _(mo):
568
+ mo.md(r"""
569
+ ### References
 
570
 
571
+ 1. AI4I 2020 Predictive Maintenance Dataset [Dataset]. (2020). UCI Machine Learning Repository. ([link](https://doi.org/10.24432/C5HS5C)).
572
+ 2. Polars documentation ([link](https://docs.pola.rs/user-guide/expressions/expression-expansion/#more-flexible-column-selections))
573
+ """)
 
574
  return
575
 
576
 
 
580
  import marimo as mo
581
  import polars as pl
582
  from io import StringIO
583
+ return StringIO, mo, pl
584
 
585
 
586
  if __name__ == "__main__":
polars/09_data_types.py CHANGED
@@ -8,52 +8,46 @@
8
 
9
  import marimo
10
 
11
- __generated_with = "0.12.0"
12
  app = marimo.App(width="medium")
13
 
14
 
15
  @app.cell(hide_code=True)
16
  def _(mo):
17
- mo.md(
18
- r"""
19
- # Data Types
20
 
21
- Author: [Deb Debnath](https://github.com/debajyotid2)
22
 
23
- **Note**: The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/user-guide/concepts/data-types-and-structures/).
24
- """
25
- )
26
  return
27
 
28
 
29
  @app.cell(hide_code=True)
30
  def _(mo):
31
- mo.md(
32
- r"""
33
- Polars supports a variety of data types that fall broadly under the following categories:
34
 
35
- - Numeric data types: integers and floating point numbers.
36
- - Nested data types: lists, structs, and arrays.
37
- - Temporal: dates, datetimes, times, and time deltas.
38
- - Miscellaneous: strings, binary data, Booleans, categoricals, enums, and objects.
39
 
40
- All types support missing values represented by `null` which is different from `NaN` used in floating point data types. The numeric datatypes in Polars loosely follow the type system of the Rust language, since its core functionalities are built in Rust.
41
 
42
- [Here](https://docs.pola.rs/api/python/stable/reference/datatypes.html) is a full list of all data types Polars supports.
43
- """
44
- )
45
  return
46
 
47
 
48
  @app.cell(hide_code=True)
49
  def _(mo):
50
- mo.md(
51
- r"""
52
- ## Series
53
 
54
- A series is a 1-dimensional data structure that can hold only one data type.
55
- """
56
- )
57
  return
58
 
59
 
@@ -61,12 +55,14 @@ def _(mo):
61
  def _(pl):
62
  s = pl.Series("emojis", ["😀", "🤣", "🥶", "💀", "🤖"])
63
  s
64
- return (s,)
65
 
66
 
67
  @app.cell(hide_code=True)
68
  def _(mo):
69
- mo.md(r"""Unless specified, Polars infers the datatype from the supplied values.""")
 
 
70
  return
71
 
72
 
@@ -75,20 +71,18 @@ def _(pl):
75
  s1 = pl.Series("friends", ["Евгений", "अभिषेक", "秀良", "Federico", "Bob"])
76
  s2 = pl.Series("uints", [0x00, 0x01, 0x10, 0x11], dtype=pl.UInt8)
77
  s1.dtype, s2.dtype
78
- return s1, s2
79
 
80
 
81
  @app.cell(hide_code=True)
82
  def _(mo):
83
- mo.md(
84
- r"""
85
- ## Dataframe
86
 
87
- A dataframe is a 2-dimensional data structure that contains uniquely named series and can hold multiple data types. Dataframes are more commonly used for data manipulation using the functionality of Polars.
88
 
89
- The snippet below shows how to create a dataframe from a dictionary of lists:
90
- """
91
- )
92
  return
93
 
94
 
@@ -108,28 +102,24 @@ def _(pl):
108
 
109
  @app.cell(hide_code=True)
110
  def _(mo):
111
- mo.md(
112
- r"""
113
- ### Inspecting a dataframe
114
 
115
- Polars has various functions to explore the data in a dataframe. We will use the dataframe `data` defined above in our examples. Alongside we can also see a view of the dataframe rendered by `marimo` as the cells are executed.
116
 
117
- ///note
118
- We can also use `marimo`'s built in data-inspection elements/features such as [`mo.ui.dataframe`](https://docs.marimo.io/api/inputs/dataframe/#marimo.ui.dataframe) & [`mo.ui.data_explorer`](https://docs.marimo.io/api/inputs/data_explorer/). For more check out our Polars tutorials at [`marimo learn`](https://marimo-team.github.io/learn/)!
119
- """
120
- )
121
  return
122
 
123
 
124
  @app.cell(hide_code=True)
125
  def _(mo):
126
- mo.md(
127
- """
128
- #### Head
129
 
130
- The function `head` shows the first rows of a dataframe. Unless specified, it shows the first 5 rows.
131
- """
132
- )
133
  return
134
 
135
 
@@ -141,13 +131,11 @@ def _(data):
141
 
142
  @app.cell(hide_code=True)
143
  def _(mo):
144
- mo.md(
145
- r"""
146
- #### Glimpse
147
 
148
- The function `glimpse` is an alternative to `head` to view the first few columns, but displays each line of the output corresponding to a single column. That way, it makes inspecting wider dataframes easier.
149
- """
150
- )
151
  return
152
 
153
 
@@ -159,13 +147,11 @@ def _(data):
159
 
160
  @app.cell(hide_code=True)
161
  def _(mo):
162
- mo.md(
163
- r"""
164
- #### Tail
165
 
166
- The `tail` function, just like its name suggests, shows the last rows of a dataframe. Unless the number of rows is specified, it will show the last 5 rows.
167
- """
168
- )
169
  return
170
 
171
 
@@ -177,13 +163,11 @@ def _(data):
177
 
178
  @app.cell(hide_code=True)
179
  def _(mo):
180
- mo.md(
181
- r"""
182
- #### Sample
183
 
184
- `sample` can be used to show a specified number of randomly selected rows from the dataframe. Unless the number of rows is specified, it will show a single row. `sample` does not preserve order of the rows.
185
- """
186
- )
187
  return
188
 
189
 
@@ -194,18 +178,16 @@ def _(data):
194
  random.seed(42) # For reproducibility.
195
 
196
  data.sample(3)
197
- return (random,)
198
 
199
 
200
  @app.cell(hide_code=True)
201
  def _(mo):
202
- mo.md(
203
- r"""
204
- #### Describe
205
 
206
- The function `describe` describes the summary statistics for all columns of a dataframe.
207
- """
208
- )
209
  return
210
 
211
 
@@ -217,13 +199,11 @@ def _(data):
217
 
218
  @app.cell(hide_code=True)
219
  def _(mo):
220
- mo.md(
221
- r"""
222
- ## Schema
223
 
224
- A schema is a mapping showing the datatype corresponding to every column of a dataframe. The schema of a dataframe can be viewed using the attribute `schema`.
225
- """
226
- )
227
  return
228
 
229
 
@@ -235,7 +215,9 @@ def _(data):
235
 
236
  @app.cell(hide_code=True)
237
  def _(mo):
238
- mo.md(r"""Since a schema is a mapping, it can be specified in the form of a Python dictionary. Then this dictionary can be used to specify the schema of a dataframe on definition. If not specified or the entry is `None`, Polars infers the datatype from the contents of the column. Note that if the schema is not specified, it will be inferred automatically by default.""")
 
 
239
  return
240
 
241
 
@@ -255,7 +237,9 @@ def _(pl):
255
 
256
  @app.cell(hide_code=True)
257
  def _(mo):
258
- mo.md(r"""Sometimes the automatically inferred schema is enough for some columns, but we might wish to override the inference of only some columns. We can specify the schema for those columns using `schema_overrides`.""")
 
 
259
  return
260
 
261
 
@@ -275,13 +259,11 @@ def _(pl):
275
 
276
  @app.cell(hide_code=True)
277
  def _(mo):
278
- mo.md(
279
- r"""
280
- ### References
281
 
282
- 1. Polars documentation ([link](https://docs.pola.rs/api/python/stable/reference/datatypes.html))
283
- """
284
- )
285
  return
286
 
287
 
 
8
 
9
  import marimo
10
 
11
+ __generated_with = "0.18.4"
12
  app = marimo.App(width="medium")
13
 
14
 
15
  @app.cell(hide_code=True)
16
  def _(mo):
17
+ mo.md(r"""
18
+ # Data Types
 
19
 
20
+ Author: [Deb Debnath](https://github.com/debajyotid2)
21
 
22
+ **Note**: The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/user-guide/concepts/data-types-and-structures/).
23
+ """)
 
24
  return
25
 
26
 
27
  @app.cell(hide_code=True)
28
  def _(mo):
29
+ mo.md(r"""
30
+ Polars supports a variety of data types that fall broadly under the following categories:
 
31
 
32
+ - Numeric data types: integers and floating point numbers.
33
+ - Nested data types: lists, structs, and arrays.
34
+ - Temporal: dates, datetimes, times, and time deltas.
35
+ - Miscellaneous: strings, binary data, Booleans, categoricals, enums, and objects.
36
 
37
+ All types support missing values represented by `null` which is different from `NaN` used in floating point data types. The numeric datatypes in Polars loosely follow the type system of the Rust language, since its core functionalities are built in Rust.
38
 
39
+ [Here](https://docs.pola.rs/api/python/stable/reference/datatypes.html) is a full list of all data types Polars supports.
40
+ """)
 
41
  return
42
 
43
 
44
  @app.cell(hide_code=True)
45
  def _(mo):
46
+ mo.md(r"""
47
+ ## Series
 
48
 
49
+ A series is a 1-dimensional data structure that can hold only one data type.
50
+ """)
 
51
  return
52
 
53
 
 
55
  def _(pl):
56
  s = pl.Series("emojis", ["😀", "🤣", "🥶", "💀", "🤖"])
57
  s
58
+ return
59
 
60
 
61
  @app.cell(hide_code=True)
62
  def _(mo):
63
+ mo.md(r"""
64
+ Unless specified, Polars infers the datatype from the supplied values.
65
+ """)
66
  return
67
 
68
 
 
71
  s1 = pl.Series("friends", ["Евгений", "अभिषेक", "秀良", "Federico", "Bob"])
72
  s2 = pl.Series("uints", [0x00, 0x01, 0x10, 0x11], dtype=pl.UInt8)
73
  s1.dtype, s2.dtype
74
+ return
75
 
76
 
77
  @app.cell(hide_code=True)
78
  def _(mo):
79
+ mo.md(r"""
80
+ ## Dataframe
 
81
 
82
+ A dataframe is a 2-dimensional data structure that contains uniquely named series and can hold multiple data types. Dataframes are more commonly used for data manipulation using the functionality of Polars.
83
 
84
+ The snippet below shows how to create a dataframe from a dictionary of lists:
85
+ """)
 
86
  return
87
 
88
 
 
102
 
103
  @app.cell(hide_code=True)
104
  def _(mo):
105
+ mo.md(r"""
106
+ ### Inspecting a dataframe
 
107
 
108
+ Polars has various functions to explore the data in a dataframe. We will use the dataframe `data` defined above in our examples. Alongside we can also see a view of the dataframe rendered by `marimo` as the cells are executed.
109
 
110
+ ///note
111
+ We can also use `marimo`'s built in data-inspection elements/features such as [`mo.ui.dataframe`](https://docs.marimo.io/api/inputs/dataframe/#marimo.ui.dataframe) & [`mo.ui.data_explorer`](https://docs.marimo.io/api/inputs/data_explorer/). For more check out our Polars tutorials at [`marimo learn`](https://marimo-team.github.io/learn/)!
112
+ """)
 
113
  return
114
 
115
 
116
  @app.cell(hide_code=True)
117
  def _(mo):
118
+ mo.md("""
119
+ #### Head
 
120
 
121
+ The function `head` shows the first rows of a dataframe. Unless specified, it shows the first 5 rows.
122
+ """)
 
123
  return
124
 
125
 
 
131
 
132
  @app.cell(hide_code=True)
133
  def _(mo):
134
+ mo.md(r"""
135
+ #### Glimpse
 
136
 
137
+ The function `glimpse` is an alternative to `head` to view the first few columns, but displays each line of the output corresponding to a single column. That way, it makes inspecting wider dataframes easier.
138
+ """)
 
139
  return
140
 
141
 
 
147
 
148
  @app.cell(hide_code=True)
149
  def _(mo):
150
+ mo.md(r"""
151
+ #### Tail
 
152
 
153
+ The `tail` function, just like its name suggests, shows the last rows of a dataframe. Unless the number of rows is specified, it will show the last 5 rows.
154
+ """)
 
155
  return
156
 
157
 
 
163
 
164
  @app.cell(hide_code=True)
165
  def _(mo):
166
+ mo.md(r"""
167
+ #### Sample
 
168
 
169
+ `sample` can be used to show a specified number of randomly selected rows from the dataframe. Unless the number of rows is specified, it will show a single row. `sample` does not preserve order of the rows.
170
+ """)
 
171
  return
172
 
173
 
 
178
  random.seed(42) # For reproducibility.
179
 
180
  data.sample(3)
181
+ return
182
 
183
 
184
  @app.cell(hide_code=True)
185
  def _(mo):
186
+ mo.md(r"""
187
+ #### Describe
 
188
 
189
+ The function `describe` describes the summary statistics for all columns of a dataframe.
190
+ """)
 
191
  return
192
 
193
 
 
199
 
200
  @app.cell(hide_code=True)
201
  def _(mo):
202
+ mo.md(r"""
203
+ ## Schema
 
204
 
205
+ A schema is a mapping showing the datatype corresponding to every column of a dataframe. The schema of a dataframe can be viewed using the attribute `schema`.
206
+ """)
 
207
  return
208
 
209
 
 
215
 
216
  @app.cell(hide_code=True)
217
  def _(mo):
218
+ mo.md(r"""
219
+ Since a schema is a mapping, it can be specified in the form of a Python dictionary. Then this dictionary can be used to specify the schema of a dataframe on definition. If not specified or the entry is `None`, Polars infers the datatype from the contents of the column. Note that if the schema is not specified, it will be inferred automatically by default.
220
+ """)
221
  return
222
 
223
 
 
237
 
238
  @app.cell(hide_code=True)
239
  def _(mo):
240
+ mo.md(r"""
241
+ Sometimes the automatically inferred schema is enough for some columns, but we might wish to override the inference of only some columns. We can specify the schema for those columns using `schema_overrides`.
242
+ """)
243
  return
244
 
245
 
 
259
 
260
  @app.cell(hide_code=True)
261
  def _(mo):
262
+ mo.md(r"""
263
+ ### References
 
264
 
265
+ 1. Polars documentation ([link](https://docs.pola.rs/api/python/stable/reference/datatypes.html))
266
+ """)
 
267
  return
268
 
269
 
polars/10_strings.py CHANGED
@@ -10,36 +10,32 @@
10
 
11
  import marimo
12
 
13
- __generated_with = "0.11.17"
14
  app = marimo.App(width="medium")
15
 
16
 
17
  @app.cell(hide_code=True)
18
  def _(mo):
19
- mo.md(
20
- r"""
21
- # Strings
22
 
23
- _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_.
24
 
25
- In this chapter we're going to dig into string manipulation. For a fun twist, we'll be mostly playing around with a dataset that every Polars user has bumped into without really thinking about it—the source code of the `polars` module itself. More precisely, we'll use a dataframe that pulls together all the Polars expressions and their docstrings, giving us a cool, hands-on way to explore the expression API in a truly data-driven manner.
26
 
27
- We'll cover parsing, length calculation, case conversion, and much more, with practical examples and visualizations. Finally, we will combine various techniques you learned in prior chapters to build a fully interactive playground in which you can execute the official code examples of Polars expressions.
28
- """
29
- )
30
  return
31
 
32
 
33
  @app.cell(hide_code=True)
34
  def _(mo):
35
- mo.md(
36
- r"""
37
- ## 🛠️ Parsing & Conversion
38
 
39
- Let's warm up with one of the most frequent use cases: parsing raw strings into various formats.
40
- We'll take a tiny dataframe with metadata about Python packages represented as raw JSON strings and we'll use Polars string expressions to parse the attributes into their true data types.
41
- """
42
- )
43
  return
44
 
45
 
@@ -58,7 +54,9 @@ def _(pl):
58
 
59
  @app.cell(hide_code=True)
60
  def _(mo):
61
- mo.md(r"""We can use the [`json_decode`](https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.json_decode.html) expression to parse the raw JSON strings into Polars-native structs and we can use the [unnest](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.unnest.html) dataframe operation to have a dedicated column per parsed attribute.""")
 
 
62
  return
63
 
64
 
@@ -71,13 +69,17 @@ def _(pip_metadata_raw_df, pl):
71
 
72
  @app.cell(hide_code=True)
73
  def _(mo):
74
- mo.md(r"""This is already a much friendlier representation of the data we started out with, but note that since the JSON entries had only string attributes, all values are strings, even the temporal `released_at` and numerical `size_mb` columns.""")
 
 
75
  return
76
 
77
 
78
  @app.cell(hide_code=True)
79
  def _(mo):
80
- mo.md(r"""As we know that the `size_mb` column should have a decimal representation, we go ahead and use [`to_decimal`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_decimal.html#polars.Expr.str.to_decimal) to perform the conversion.""")
 
 
81
  return
82
 
83
 
@@ -93,25 +95,23 @@ def _(pip_metadata_df, pl):
93
 
94
  @app.cell(hide_code=True)
95
  def _(mo):
96
- mo.md(
97
- r"""
98
- Moving on to the `released_at` attribute which indicates the exact time when a given Python package got released, we have a bit more options to consider. We can convert to `Date`, `DateTime`, and `Time` types based on the desired temporal granularity. The [`to_date`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_date.html), [`to_datetime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_datetime.html), and [`to_time`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_time.html) expressions are here to help us with the conversion, all we need is to provide the desired format string.
99
 
100
- Since Polars uses Rust under the hood to implement all its expressions, we need to consult the [`chrono::format`](https://docs.rs/chrono/latest/chrono/format/strftime/index.html) reference to come up with appropriate format strings.
101
 
102
- Here's a quick reference:
103
 
104
- | Specifier | Meaning |
105
- |-----------|--------------------|
106
- | `%Y` | Year (e.g., 2025) |
107
- | `%m` | Month (01-12) |
108
- | `%d` | Day (01-31) |
109
- | `%H` | Hour (00-23) |
110
- | `%z` | UTC offset |
111
 
112
- The raw strings we are working with look like `"2025-03-02T20:31:12+0000"`. We can match this using the `"%Y-%m-%dT%H:%M:%S%z"` format string.
113
- """
114
- )
115
  return
116
 
117
 
@@ -129,7 +129,9 @@ def _(pip_metadata_df, pl):
129
 
130
  @app.cell(hide_code=True)
131
  def _(mo):
132
- mo.md(r"""Alternatively, instead of using three different functions to perform the conversion to date, we can use a single one, [`strptime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strptime.html) which takes the desired temporal data type as its first parameter.""")
 
 
133
  return
134
 
135
 
@@ -147,7 +149,9 @@ def _(pip_metadata_df, pl):
147
 
148
  @app.cell(hide_code=True)
149
  def _(mo):
150
- mo.md(r"""And to wrap up this section on parsing and conversion, let's consider a final scenario. What if we don't want to parse the entire raw JSON string, because we only need a subset of its attributes? Well, in this case we can leverage the [`json_path_match`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.json_path_match.html) expression to extract only the desired attributes using standard [JSONPath](https://goessner.net/articles/JsonPath/) syntax.""")
 
 
151
  return
152
 
153
 
@@ -165,17 +169,15 @@ def _(pip_metadata_raw_df, pl):
165
 
166
  @app.cell(hide_code=True)
167
  def _(mo):
168
- mo.md(
169
- r"""
170
- ## 📊 Dataset Overview
171
 
172
- Now that we got our hands dirty, let's consider a somewhat wilder dataset for the subsequent sections: a dataframe of metadata about every single expression in your current Polars module.
173
 
174
- At the risk of stating the obvious, in the previous section, when we typed `pl.col('raw_json').str.json_decode()`, we accessed the `json_decode` member of the `str` expression namespace through the `pl.col('raw_json')` expression *instance*. Under the hood, deep inside the Polars source code, there is a corresponding `def json_decode(...)` method with a carefully authored docstring explaining the purpose and signature of the member.
175
 
176
- Since Python makes module introspection simple, we can easily enumerate all Polars expressions and organize their metadata in `expressions_df`, to be used for all the upcoming string manipulation examples.
177
- """
178
- )
179
  return
180
 
181
 
@@ -214,12 +216,14 @@ def _(pl):
214
 
215
  expressions_df = pl.from_dicts(list_expr_meta(), infer_schema_length=None).sort('namespace', 'member')
216
  expressions_df
217
- return expressions_df, list_expr_meta, list_members
218
 
219
 
220
  @app.cell(hide_code=True)
221
  def _(mo):
222
- mo.md(r"""As the following visualization shows, `str` is one of the richest Polars expression namespaces with multiple dozens of functions in it.""")
 
 
223
  return
224
 
225
 
@@ -234,17 +238,15 @@ def _(alt, expressions_df):
234
 
235
  @app.cell(hide_code=True)
236
  def _(mo):
237
- mo.md(
238
- r"""
239
- ## 📏 Length Calculation
240
 
241
- A common use case is to compute the length of a string. Most people associate string length exclusively with the number of characters the said string consists of; however, in certain scenarios it is useful to also know how much memory is required for storing, so how many bytes are required to represent the textual data.
242
 
243
- The expressions [`len_chars`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_chars.html) and [`len_bytes`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_bytes.html) are here to help us with these calculations.
244
 
245
- Below, we compute `docstring_len_chars` and `docstring_len_bytes` columns to see how many characters and bytes the documentation of each expression is made up of.
246
- """
247
- )
248
  return
249
 
250
 
@@ -262,7 +264,9 @@ def _(expressions_df, pl):
262
 
263
  @app.cell(hide_code=True)
264
  def _(mo):
265
- mo.md(r"""As the dataframe preview above and the scatterplot below show, the docstring length measured in bytes is almost always bigger than the length expressed in characters. This is due to the fact that the docstrings include characters which require more than a single byte to represent, such as "╞" for displaying dataframe header and body separators.""")
 
 
266
  return
267
 
268
 
@@ -278,13 +282,11 @@ def _(alt, docstring_length_df):
278
 
279
  @app.cell(hide_code=True)
280
  def _(mo):
281
- mo.md(
282
- r"""
283
- ## 🔠 Case Conversion
284
 
285
- Another frequent string transformation is lowercasing, uppercasing, and titlecasing. We can use [`to_lowercase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html), [`to_uppercase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html) and [`to_titlecase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_titlecase.html) for doing so.
286
- """
287
- )
288
  return
289
 
290
 
@@ -300,15 +302,13 @@ def _(expressions_df, pl):
300
 
301
  @app.cell(hide_code=True)
302
  def _(mo):
303
- mo.md(
304
- r"""
305
- ## ➕ Padding
306
 
307
- Sometimes we need to ensure that strings have a fixed-size character length. [`pad_start`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.pad_start.html) and [`pad_end`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.pad_end.html) can be used to fill the "front" or "back" of a string with a supplied character, while [`zfill`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.zfill.html) is a utility for padding the start of a string with `"0"` until it reaches a particular length. In other words, `zfill` is a more specific version of `pad_start`, where the `fill_char` parameter is explicitly set to `"0"`.
308
 
309
- In the example below we take the unique Polars expression namespaces and pad them so that they have a uniform length which you can control via a slider.
310
- """
311
- )
312
  return
313
 
314
 
@@ -340,15 +340,13 @@ def _(mo, padded_df, padding):
340
 
341
  @app.cell(hide_code=True)
342
  def _(mo):
343
- mo.md(
344
- r"""
345
- ## 🔄 Replacing
346
 
347
- Let's say we want to convert from `snake_case` API member names to `kebab-case`, that is, we need to replace the underscore character with a hyphen. For operations like that, we can use [`replace`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace.html) and [`replace_all`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_all.html).
348
 
349
- As the example below demonstrates, `replace` stops after the first occurrence of the to-be-replaced pattern, while `replace_all` goes all the way through and changes all underscores to hyphens resulting in the `kebab-case` representation we were looking for.
350
- """
351
- )
352
  return
353
 
354
 
@@ -364,13 +362,11 @@ def _(expressions_df, pl):
364
 
365
  @app.cell(hide_code=True)
366
  def _(mo):
367
- mo.md(
368
- r"""
369
- A related expression is [`replace_many`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_many.html), which accepts *many* pairs of to-be-matched patterns and corresponding replacements and uses the [Aho–Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) to carry out the operation with great performance.
370
 
371
- In the example below we replace all instances of `"min"` with `"minimum"` and `"max"` with `"maximum"` using a single expression.
372
- """
373
- )
374
  return
375
 
376
 
@@ -390,15 +386,13 @@ def _(expressions_df, pl):
390
 
391
  @app.cell(hide_code=True)
392
  def _(mo):
393
- mo.md(
394
- r"""
395
- ## 🔍 Searching & Matching
396
 
397
- A common need when working with strings is to determine whether their content satisfies some condition: whether it starts or ends with a particular substring or contains a certain pattern.
398
 
399
- Let's suppose we want to determine whether a member of the Polars expression API is a "converter", such as `to_decimal`, identified by its `"to_"` prefix. We can use [`starts_with`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.starts_with.html) to perform this check.
400
- """
401
- )
402
  return
403
 
404
 
@@ -414,13 +408,11 @@ def _(expressions_df, pl):
414
 
415
  @app.cell(hide_code=True)
416
  def _(mo):
417
- mo.md(
418
- r"""
419
- Throughout this course as you have gained familiarity with the expression API you might have noticed that some members end with an underscore such as `or_`, since their "body" is a reserved Python keyword.
420
 
421
- Let's use [`ends_with`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.ends_with.html) to find all the members which are named after such keywords.
422
- """
423
- )
424
  return
425
 
426
 
@@ -436,13 +428,11 @@ def _(expressions_df, pl):
436
 
437
  @app.cell(hide_code=True)
438
  def _(mo):
439
- mo.md(
440
- r"""
441
- Now let's move on to analyzing the docstrings in a bit more detail. Based on their content we can determine whether a member is deprecated, accepts parameters, comes with examples, or references external URL(s) & related members.
442
 
443
- As demonstrated below, we can compute all these boolean attributes using [`contains`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html) to check whether the docstring includes a particular substring.
444
- """
445
- )
446
  return
447
 
448
 
@@ -462,7 +452,9 @@ def _(expressions_df, pl):
462
 
463
  @app.cell(hide_code=True)
464
  def _(mo):
465
- mo.md(r"""For scenarios where we want to combine multiple substrings to check for, we can use the [`contains`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html) expression to check for the presence of various patterns.""")
 
 
466
  return
467
 
468
 
@@ -478,21 +470,19 @@ def _(expressions_df, pl):
478
 
479
  @app.cell(hide_code=True)
480
  def _(mo):
481
- mo.md(
482
- r"""
483
- From the above analysis we could see that almost all the members come with code examples. It would be interesting to know how many variable assignments are going on within each of these examples, right? That's not as simple as checking for a pre-defined literal string containment though, because variables can have arbitrary names - any valid Python identifier is allowed. While the `contains` function supports checking for regular expressions instead of literal strings too, it would not suffice for this exercise because it only tells us whether there is at least a single occurrence of the sought pattern rather than telling us the exact number of matches.
484
 
485
- Fortunately, we can take advantage of [`count_matches`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.count_matches.html) to achieve exactly what we want. We specify the regular expression `r'[a-zA-Z_][a-zA-Z0-9_]* = '` according to the [`regex` Rust crate](https://docs.rs/regex/latest/regex/) to match Python identifiers and we leave the rest to Polars.
486
 
487
- In `count_matches(r'[a-zA-Z_][a-zA-Z0-9_]* = ')`:
488
 
489
- - `[a-zA-Z_]` matches a letter or underscore (start of a Python identifier).
490
- - `[a-zA-Z0-9_]*` matches zero or more letters, digits, or underscores.
491
- - ` = ` matches a space, equals sign, and space (indicating assignment).
492
 
493
- This finds variable assignments like `x = ` or `df_result = ` in docstrings.
494
- """
495
- )
496
  return
497
 
498
 
@@ -508,7 +498,9 @@ def _(expressions_df, pl):
508
 
509
  @app.cell(hide_code=True)
510
  def _(mo):
511
- mo.md(r"""A related application example is to *find* the first index where a particular pattern is present, so that it can be used for downstream processing such as slicing. Below we use the [`find`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.find.html) expression to determine the index at which a code example starts in the docstring - identified by the Python shell substring `">>>"`.""")
 
 
512
  return
513
 
514
 
@@ -524,13 +516,11 @@ def _(expressions_df, pl):
524
 
525
  @app.cell(hide_code=True)
526
  def _(mo):
527
- mo.md(
528
- r"""
529
- ## ✂️ Slicing and Substrings
530
 
531
- Sometimes we are only interested in a particular substring. We can use [`head`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.head.html), [`tail`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.tail.html) and [`slice`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.slice.html) to extract a substring from the start, end, or between arbitrary indices.
532
- """
533
- )
534
  return
535
 
536
 
@@ -564,17 +554,15 @@ def _(mo, slice, sliced_df):
564
 
565
  @app.cell(hide_code=True)
566
  def _(mo):
567
- mo.md(
568
- r"""
569
- ## ➗ Splitting
570
 
571
- Certain strings follow a well-defined structure and we might be only interested in some parts of them. For example, when dealing with `snake_cased_expression` member names we might be curious to get only the first, second, or $n^{\text{th}}$ word before an underscore. We would need to *split* the string at a particular pattern for downstream processing.
572
 
573
- The [`split`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split.html), [`split_exact`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split_exact.html) and [`splitn`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.splitn.html) expressions enable us to achieve this.
574
 
575
- The primary difference between these string splitting utilities is that `split` produces a list of variadic length based on the number of resulting segments, `splitn` returns a struct with at least `0` and at most `n` fields while `split_exact` returns a struct of exactly `n` fields.
576
- """
577
- )
578
  return
579
 
580
 
@@ -591,7 +579,9 @@ def _(expressions_df, pl):
591
 
592
  @app.cell(hide_code=True)
593
  def _(mo):
594
- mo.md(r"""As a more practical example, we can use the `split` expression with some aggregation to count the number of times a particular word occurs in member names across all namespaces. This enables us to create a word cloud of the API members' constituents!""")
 
 
595
  return
596
 
597
 
@@ -640,20 +630,18 @@ def _(alt, expressions_df, pl, random, wordcloud_height, wordcloud_width):
640
  size=alt.Size("len:Q", legend=None),
641
  tooltip=["member", "len"],
642
  ).configure_view(strokeWidth=0)
643
- return wordcloud, wordcloud_df
644
 
645
 
646
  @app.cell(hide_code=True)
647
  def _(mo):
648
- mo.md(
649
- r"""
650
- ## 🔗 Concatenation & Joining
651
 
652
- Often we would like to create longer strings from strings we already have. We might want to create a formatted, sentence-like string or join multiple existing strings in our dataframe into a single one.
653
 
654
- The top-level [`concat_str`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.concat_str.html) expression enables us to combine strings *horizontally* in a dataframe. As the example below shows, we can take the `member` and `namespace` column of each row and construct a `description` column in which each row will correspond to the value ``f"- Expression `{member}` belongs to namespace `{namespace}`"``.
655
- """
656
- )
657
  return
658
 
659
 
@@ -679,13 +667,11 @@ def _(expressions_df, pl):
679
 
680
  @app.cell(hide_code=True)
681
  def _(mo):
682
- mo.md(
683
- r"""
684
- Now that we have constructed these bullet points through *horizontal* concatenation of strings, we can perform a *vertical* one so that we end up with a single string in which we have a bullet point on each line.
685
 
686
- We will use the [`join`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.join.html) expression to do so.
687
- """
688
- )
689
  return
690
 
691
 
@@ -708,17 +694,15 @@ def _(descriptions_df, mo, pl):
708
 
709
  @app.cell(hide_code=True)
710
  def _(mo):
711
- mo.md(
712
- r"""
713
- ## 🔍 Pattern-based Extraction
714
 
715
- In the vast majority of the cases, when dealing with unstructured text data, all we really want is to extract something structured from it. A common use case is to extract URLs from text to get a better understanding of related content.
716
 
717
- In the example below that's exactly what we do. We scan the `docstring` of each API member and extract URLs from them using [`extract`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract.html) and [`extract_all`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract_all.html) using a simple regular expression to match http and https URLs.
718
 
719
- Note that `extract` stops after a first match and returns a scalar result (or `null` if there was no match) while `extract_all` returns a - potentially empty - list of matches.
720
- """
721
- )
722
  return
723
 
724
 
@@ -731,20 +715,18 @@ def _(expressions_df, pl):
731
  url_match=pl.col('docstring').str.extract(url_pattern),
732
  url_matches=pl.col('docstring').str.extract_all(url_pattern),
733
  ).filter(pl.col('url_match').is_not_null())
734
- return (url_pattern,)
735
 
736
 
737
  @app.cell(hide_code=True)
738
  def _(mo):
739
- mo.md(
740
- r"""
741
- Note that in each `docstring` where a code example involving dataframes is present, we will see an output such as "shape: (5, 2)" indicating the number of rows and columns of the dataframe produced by the sample code. Let's say we would like to *capture* this information in a structured way.
742
 
743
- [`extract_groups`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract_groups.html) is a really powerful expression allowing us to achieve exactly that.
744
 
745
- Below we define the regular expression `r"shape:\s*\((?<height>\S+),\s*(?<width>\S+)\)"` with two capture groups, named `height` and `width` and pass it as the parameter of `extract_groups`. After execution, for each `docstring`, we end up with fully structured data we can further process downstream!
746
- """
747
- )
748
  return
749
 
750
 
@@ -760,15 +742,13 @@ def _(expressions_df, pl):
760
 
761
  @app.cell(hide_code=True)
762
  def _(mo):
763
- mo.md(
764
- r"""
765
- ## 🧹 Stripping
766
 
767
- Strings might require some cleaning before further processing, such as the removal of some characters from the beginning or end of the text. [`strip_chars_start`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_start.html), [`strip_chars_end`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_end.html) and [`strip_chars`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars.html) are here to facilitate this.
768
 
769
- All we need to do is to specify a set of characters we would like to get rid of and Polars handles the rest for us.
770
- """
771
- )
772
  return
773
 
774
 
@@ -785,15 +765,13 @@ def _(expressions_df, pl):
785
 
786
  @app.cell(hide_code=True)
787
  def _(mo):
788
- mo.md(
789
- r"""
790
- Note that when using the above expressions, the specified characters do not need to form a sequence; they are handled as a set. However, in certain use cases we only want to strip complete substrings, so we would need our input to be strictly treated as a sequence rather than as a set.
791
 
792
- That's exactly the rationale behind [`strip_prefix`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_prefix.html) and [`strip_suffix`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_suffix.html).
793
 
794
- Below we use these to remove the `"to_"` prefixes and `"_with"` suffixes from each member name.
795
- """
796
- )
797
  return
798
 
799
 
@@ -809,13 +787,11 @@ def _(expressions_df, pl):
809
 
810
  @app.cell(hide_code=True)
811
  def _(mo):
812
- mo.md(
813
- r"""
814
- ## 🔑 Encoding & Decoding
815
 
816
- Should you find yourself in the need of encoding your strings into [base64](https://en.wikipedia.org/wiki/Base64) or [hexadecimal](https://en.wikipedia.org/wiki/Hexadecimal) format, then Polars has your back with its [`encode`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.encode.html) expression.
817
- """
818
- )
819
  return
820
 
821
 
@@ -832,7 +808,9 @@ def _(expressions_df, pl):
832
 
833
  @app.cell(hide_code=True)
834
  def _(mo):
835
- mo.md(r"""And of course, you can convert back into a human-readable representation using the [`decode`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.decode.html) expression.""")
 
 
836
  return
837
 
838
 
@@ -847,19 +825,17 @@ def _(encoded_df, pl):
847
 
848
  @app.cell(hide_code=True)
849
  def _(mo):
850
- mo.md(
851
- r"""
852
- ## 🚀 Application: Dynamic Execution of Polars Examples
853
 
854
- Now that we are familiar with string expressions, we can combine them with other Polars operations to build a fully interactive playground where code examples of Polars expressions can be explored.
855
 
856
- We make use of string expressions to extract the raw Python source code of examples from the docstrings and we leverage the interactive Marimo environment to enable the selection of expressions via a searchable dropdown and a fully functional code editor whose output is rendered with Marimo's rich display utilities.
857
 
858
- In other words, we will use Polars to execute Polars. ❄️ How cool is that?
859
 
860
- ---
861
- """
862
- )
863
  return
864
 
865
 
@@ -894,7 +870,7 @@ def _(mo, selected_expression_record):
894
 
895
 
896
  @app.cell(hide_code=True)
897
- def _(example_editor, execute_code):
898
  execution_result = execute_code(example_editor.value)
899
  return (execution_result,)
900
 
@@ -943,50 +919,48 @@ def _(expressions_df, pl):
943
  return (code_df,)
944
 
945
 
946
- @app.cell(hide_code=True)
947
- def _():
948
- def execute_code(code: str):
949
- import ast
950
-
951
- # Create a new local namespace for execution
952
- local_namespace = {}
953
 
954
- # Parse the code into an AST to identify the last expression
955
- parsed_code = ast.parse(code)
956
 
957
- # Check if there's at least one statement
958
- if not parsed_code.body:
959
- return None
960
 
961
- # If the last statement is an expression, we'll need to get its value
962
- last_is_expr = isinstance(parsed_code.body[-1], ast.Expr)
 
963
 
964
- if last_is_expr:
965
- # Split the code: everything except the last statement, and the last statement
966
- last_expr = ast.Expression(parsed_code.body[-1].value)
967
 
968
- # Remove the last statement from the parsed code
969
- parsed_code.body = parsed_code.body[:-1]
 
970
 
971
- # Execute everything except the last statement
972
- if parsed_code.body:
973
- exec(
974
- compile(parsed_code, "<string>", "exec"),
975
- globals(),
976
- local_namespace,
977
- )
978
 
979
- # Execute the last statement and get its value
980
- result = eval(
981
- compile(last_expr, "<string>", "eval"), globals(), local_namespace
 
 
 
982
  )
983
- return result
984
- else:
985
- # If the last statement is not an expression (e.g., an assignment),
986
- # execute the entire code and return None
987
- exec(code, globals(), local_namespace)
988
- return None
989
- return (execute_code,)
 
 
 
 
990
 
991
 
992
  @app.cell(hide_code=True)
 
10
 
11
  import marimo
12
 
13
+ __generated_with = "0.18.4"
14
  app = marimo.App(width="medium")
15
 
16
 
17
  @app.cell(hide_code=True)
18
  def _(mo):
19
+ mo.md(r"""
20
+ # Strings
 
21
 
22
+ _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_.
23
 
24
+ In this chapter we're going to dig into string manipulation. For a fun twist, we'll be mostly playing around with a dataset that every Polars user has bumped into without really thinking about it—the source code of the `polars` module itself. More precisely, we'll use a dataframe that pulls together all the Polars expressions and their docstrings, giving us a cool, hands-on way to explore the expression API in a truly data-driven manner.
25
 
26
+ We'll cover parsing, length calculation, case conversion, and much more, with practical examples and visualizations. Finally, we will combine various techniques you learned in prior chapters to build a fully interactive playground in which you can execute the official code examples of Polars expressions.
27
+ """)
 
28
  return
29
 
30
 
31
  @app.cell(hide_code=True)
32
  def _(mo):
33
+ mo.md(r"""
34
+ ## 🛠️ Parsing & Conversion
 
35
 
36
+ Let's warm up with one of the most frequent use cases: parsing raw strings into various formats.
37
+ We'll take a tiny dataframe with metadata about Python packages represented as raw JSON strings and we'll use Polars string expressions to parse the attributes into their true data types.
38
+ """)
 
39
  return
40
 
41
 
 
54
 
55
  @app.cell(hide_code=True)
56
  def _(mo):
57
+ mo.md(r"""
58
+ We can use the [`json_decode`](https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.json_decode.html) expression to parse the raw JSON strings into Polars-native structs and we can use the [unnest](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.unnest.html) dataframe operation to have a dedicated column per parsed attribute.
59
+ """)
60
  return
61
 
62
 
 
69
 
70
  @app.cell(hide_code=True)
71
  def _(mo):
72
+ mo.md(r"""
73
+ This is already a much friendlier representation of the data we started out with, but note that since the JSON entries had only string attributes, all values are strings, even the temporal `released_at` and numerical `size_mb` columns.
74
+ """)
75
  return
76
 
77
 
78
  @app.cell(hide_code=True)
79
  def _(mo):
80
+ mo.md(r"""
81
+ As we know that the `size_mb` column should have a decimal representation, we go ahead and use [`to_decimal`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_decimal.html#polars.Expr.str.to_decimal) to perform the conversion.
82
+ """)
83
  return
84
 
85
 
 
95
 
96
  @app.cell(hide_code=True)
97
  def _(mo):
98
+ mo.md(r"""
99
+ Moving on to the `released_at` attribute which indicates the exact time when a given Python package got released, we have a bit more options to consider. We can convert to `Date`, `DateTime`, and `Time` types based on the desired temporal granularity. The [`to_date`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_date.html), [`to_datetime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_datetime.html), and [`to_time`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_time.html) expressions are here to help us with the conversion, all we need is to provide the desired format string.
 
100
 
101
+ Since Polars uses Rust under the hood to implement all its expressions, we need to consult the [`chrono::format`](https://docs.rs/chrono/latest/chrono/format/strftime/index.html) reference to come up with appropriate format strings.
102
 
103
+ Here's a quick reference:
104
 
105
+ | Specifier | Meaning |
106
+ |-----------|--------------------|
107
+ | `%Y` | Year (e.g., 2025) |
108
+ | `%m` | Month (01-12) |
109
+ | `%d` | Day (01-31) |
110
+ | `%H` | Hour (00-23) |
111
+ | `%z` | UTC offset |
112
 
113
+ The raw strings we are working with look like `"2025-03-02T20:31:12+0000"`. We can match this using the `"%Y-%m-%dT%H:%M:%S%z"` format string.
114
+ """)
 
115
  return
116
 
117
 
 
129
 
130
  @app.cell(hide_code=True)
131
  def _(mo):
132
+ mo.md(r"""
133
+ Alternatively, instead of using three different functions to perform the conversion to date, we can use a single one, [`strptime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strptime.html) which takes the desired temporal data type as its first parameter.
134
+ """)
135
  return
136
 
137
 
 
149
 
150
  @app.cell(hide_code=True)
151
  def _(mo):
152
+ mo.md(r"""
153
+ And to wrap up this section on parsing and conversion, let's consider a final scenario. What if we don't want to parse the entire raw JSON string, because we only need a subset of its attributes? Well, in this case we can leverage the [`json_path_match`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.json_path_match.html) expression to extract only the desired attributes using standard [JSONPath](https://goessner.net/articles/JsonPath/) syntax.
154
+ """)
155
  return
156
 
157
 
 
169
 
170
  @app.cell(hide_code=True)
171
  def _(mo):
172
+ mo.md(r"""
173
+ ## 📊 Dataset Overview
 
174
 
175
+ Now that we got our hands dirty, let's consider a somewhat wilder dataset for the subsequent sections: a dataframe of metadata about every single expression in your current Polars module.
176
 
177
+ At the risk of stating the obvious, in the previous section, when we typed `pl.col('raw_json').str.json_decode()`, we accessed the `json_decode` member of the `str` expression namespace through the `pl.col('raw_json')` expression *instance*. Under the hood, deep inside the Polars source code, there is a corresponding `def json_decode(...)` method with a carefully authored docstring explaining the purpose and signature of the member.
178
 
179
+ Since Python makes module introspection simple, we can easily enumerate all Polars expressions and organize their metadata in `expressions_df`, to be used for all the upcoming string manipulation examples.
180
+ """)
 
181
  return
182
 
183
 
 
216
 
217
  expressions_df = pl.from_dicts(list_expr_meta(), infer_schema_length=None).sort('namespace', 'member')
218
  expressions_df
219
+ return (expressions_df,)
220
 
221
 
222
  @app.cell(hide_code=True)
223
  def _(mo):
224
+ mo.md(r"""
225
+ As the following visualization shows, `str` is one of the richest Polars expression namespaces with multiple dozens of functions in it.
226
+ """)
227
  return
228
 
229
 
 
238
 
239
  @app.cell(hide_code=True)
240
  def _(mo):
241
+ mo.md(r"""
242
+ ## 📏 Length Calculation
 
243
 
244
+ A common use case is to compute the length of a string. Most people associate string length exclusively with the number of characters the said string consists of; however, in certain scenarios it is useful to also know how much memory is required for storing, so how many bytes are required to represent the textual data.
245
 
246
+ The expressions [`len_chars`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_chars.html) and [`len_bytes`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_bytes.html) are here to help us with these calculations.
247
 
248
+ Below, we compute `docstring_len_chars` and `docstring_len_bytes` columns to see how many characters and bytes the documentation of each expression is made up of.
249
+ """)
 
250
  return
251
 
252
 
 
264
 
265
  @app.cell(hide_code=True)
266
  def _(mo):
267
+ mo.md(r"""
268
+ As the dataframe preview above and the scatterplot below show, the docstring length measured in bytes is almost always bigger than the length expressed in characters. This is due to the fact that the docstrings include characters which require more than a single byte to represent, such as "╞" for displaying dataframe header and body separators.
269
+ """)
270
  return
271
 
272
 
 
282
 
283
  @app.cell(hide_code=True)
284
  def _(mo):
285
+ mo.md(r"""
286
+ ## 🔠 Case Conversion
 
287
 
288
+ Another frequent string transformation is lowercasing, uppercasing, and titlecasing. We can use [`to_lowercase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html), [`to_uppercase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html) and [`to_titlecase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_titlecase.html) for doing so.
289
+ """)
 
290
  return
291
 
292
 
 
302
 
303
  @app.cell(hide_code=True)
304
  def _(mo):
305
+ mo.md(r"""
306
+ ## ➕ Padding
 
307
 
308
+ Sometimes we need to ensure that strings have a fixed-size character length. [`pad_start`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.pad_start.html) and [`pad_end`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.pad_end.html) can be used to fill the "front" or "back" of a string with a supplied character, while [`zfill`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.zfill.html) is a utility for padding the start of a string with `"0"` until it reaches a particular length. In other words, `zfill` is a more specific version of `pad_start`, where the `fill_char` parameter is explicitly set to `"0"`.
309
 
310
+ In the example below we take the unique Polars expression namespaces and pad them so that they have a uniform length which you can control via a slider.
311
+ """)
 
312
  return
313
 
314
 
 
340
 
341
  @app.cell(hide_code=True)
342
  def _(mo):
343
+ mo.md(r"""
344
+ ## 🔄 Replacing
 
345
 
346
+ Let's say we want to convert from `snake_case` API member names to `kebab-case`, that is, we need to replace the underscore character with a hyphen. For operations like that, we can use [`replace`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace.html) and [`replace_all`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_all.html).
347
 
348
+ As the example below demonstrates, `replace` stops after the first occurrence of the to-be-replaced pattern, while `replace_all` goes all the way through and changes all underscores to hyphens resulting in the `kebab-case` representation we were looking for.
349
+ """)
 
350
  return
351
 
352
 
 
362
 
363
  @app.cell(hide_code=True)
364
  def _(mo):
365
+ mo.md(r"""
366
+ A related expression is [`replace_many`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_many.html), which accepts *many* pairs of to-be-matched patterns and corresponding replacements and uses the [Aho–Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) to carry out the operation with great performance.
 
367
 
368
+ In the example below we replace all instances of `"min"` with `"minimum"` and `"max"` with `"maximum"` using a single expression.
369
+ """)
 
370
  return
371
 
372
 
 
386
 
387
  @app.cell(hide_code=True)
388
  def _(mo):
389
+ mo.md(r"""
390
+ ## 🔍 Searching & Matching
 
391
 
392
+ A common need when working with strings is to determine whether their content satisfies some condition: whether it starts or ends with a particular substring or contains a certain pattern.
393
 
394
+ Let's suppose we want to determine whether a member of the Polars expression API is a "converter", such as `to_decimal`, identified by its `"to_"` prefix. We can use [`starts_with`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.starts_with.html) to perform this check.
395
+ """)
 
396
  return
397
 
398
 
 
408
 
409
  @app.cell(hide_code=True)
410
  def _(mo):
411
+ mo.md(r"""
412
+ Throughout this course as you have gained familiarity with the expression API you might have noticed that some members end with an underscore such as `or_`, since their "body" is a reserved Python keyword.
 
413
 
414
+ Let's use [`ends_with`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.ends_with.html) to find all the members which are named after such keywords.
415
+ """)
 
416
  return
417
 
418
 
 
428
 
429
  @app.cell(hide_code=True)
430
  def _(mo):
431
+ mo.md(r"""
432
+ Now let's move on to analyzing the docstrings in a bit more detail. Based on their content we can determine whether a member is deprecated, accepts parameters, comes with examples, or references external URL(s) & related members.
 
433
 
434
+ As demonstrated below, we can compute all these boolean attributes using [`contains`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html) to check whether the docstring includes a particular substring.
435
+ """)
 
436
  return
437
 
438
 
 
452
 
453
  @app.cell(hide_code=True)
454
  def _(mo):
455
+ mo.md(r"""
456
+ For scenarios where we want to combine multiple substrings to check for, we can use the [`contains`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html) expression to check for the presence of various patterns.
457
+ """)
458
  return
459
 
460
 
 
470
 
471
  @app.cell(hide_code=True)
472
  def _(mo):
473
+ mo.md(r"""
474
+ From the above analysis we could see that almost all the members come with code examples. It would be interesting to know how many variable assignments are going on within each of these examples, right? That's not as simple as checking for a pre-defined literal string containment though, because variables can have arbitrary names - any valid Python identifier is allowed. While the `contains` function supports checking for regular expressions instead of literal strings too, it would not suffice for this exercise because it only tells us whether there is at least a single occurrence of the sought pattern rather than telling us the exact number of matches.
 
475
 
476
+ Fortunately, we can take advantage of [`count_matches`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.count_matches.html) to achieve exactly what we want. We specify the regular expression `r'[a-zA-Z_][a-zA-Z0-9_]* = '` according to the [`regex` Rust crate](https://docs.rs/regex/latest/regex/) to match Python identifiers and we leave the rest to Polars.
477
 
478
+ In `count_matches(r'[a-zA-Z_][a-zA-Z0-9_]* = ')`:
479
 
480
+ - `[a-zA-Z_]` matches a letter or underscore (start of a Python identifier).
481
+ - `[a-zA-Z0-9_]*` matches zero or more letters, digits, or underscores.
482
+ - ` = ` matches a space, equals sign, and space (indicating assignment).
483
 
484
+ This finds variable assignments like `x = ` or `df_result = ` in docstrings.
485
+ """)
 
486
  return
487
 
488
 
 
498
 
499
  @app.cell(hide_code=True)
500
  def _(mo):
501
+ mo.md(r"""
502
+ A related application example is to *find* the first index where a particular pattern is present, so that it can be used for downstream processing such as slicing. Below we use the [`find`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.find.html) expression to determine the index at which a code example starts in the docstring - identified by the Python shell substring `">>>"`.
503
+ """)
504
  return
505
 
506
 
 
516
 
517
  @app.cell(hide_code=True)
518
  def _(mo):
519
+ mo.md(r"""
520
+ ## ✂️ Slicing and Substrings
 
521
 
522
+ Sometimes we are only interested in a particular substring. We can use [`head`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.head.html), [`tail`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.tail.html) and [`slice`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.slice.html) to extract a substring from the start, end, or between arbitrary indices.
523
+ """)
 
524
  return
525
 
526
 
 
554
 
555
  @app.cell(hide_code=True)
556
  def _(mo):
557
+ mo.md(r"""
558
+ ## ➗ Splitting
 
559
 
560
+ Certain strings follow a well-defined structure and we might be only interested in some parts of them. For example, when dealing with `snake_cased_expression` member names we might be curious to get only the first, second, or $n^{\text{th}}$ word before an underscore. We would need to *split* the string at a particular pattern for downstream processing.
561
 
562
+ The [`split`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split.html), [`split_exact`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split_exact.html) and [`splitn`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.splitn.html) expressions enable us to achieve this.
563
 
564
+ The primary difference between these string splitting utilities is that `split` produces a list of variadic length based on the number of resulting segments, `splitn` returns a struct with at least `0` and at most `n` fields while `split_exact` returns a struct of exactly `n` fields.
565
+ """)
 
566
  return
567
 
568
 
 
579
 
580
  @app.cell(hide_code=True)
581
  def _(mo):
582
+ mo.md(r"""
583
+ As a more practical example, we can use the `split` expression with some aggregation to count the number of times a particular word occurs in member names across all namespaces. This enables us to create a word cloud of the API members' constituents!
584
+ """)
585
  return
586
 
587
 
 
630
  size=alt.Size("len:Q", legend=None),
631
  tooltip=["member", "len"],
632
  ).configure_view(strokeWidth=0)
633
+ return (wordcloud,)
634
 
635
 
636
  @app.cell(hide_code=True)
637
  def _(mo):
638
+ mo.md(r"""
639
+ ## 🔗 Concatenation & Joining
 
640
 
641
+ Often we would like to create longer strings from strings we already have. We might want to create a formatted, sentence-like string or join multiple existing strings in our dataframe into a single one.
642
 
643
+ The top-level [`concat_str`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.concat_str.html) expression enables us to combine strings *horizontally* in a dataframe. As the example below shows, we can take the `member` and `namespace` column of each row and construct a `description` column in which each row will correspond to the value ``f"- Expression `{member}` belongs to namespace `{namespace}`"``.
644
+ """)
 
645
  return
646
 
647
 
 
667
 
668
  @app.cell(hide_code=True)
669
  def _(mo):
670
+ mo.md(r"""
671
+ Now that we have constructed these bullet points through *horizontal* concatenation of strings, we can perform a *vertical* one so that we end up with a single string in which we have a bullet point on each line.
 
672
 
673
+ We will use the [`join`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.join.html) expression to do so.
674
+ """)
 
675
  return
676
 
677
 
 
694
 
695
  @app.cell(hide_code=True)
696
  def _(mo):
697
+ mo.md(r"""
698
+ ## 🔍 Pattern-based Extraction
 
699
 
700
+ In the vast majority of the cases, when dealing with unstructured text data, all we really want is to extract something structured from it. A common use case is to extract URLs from text to get a better understanding of related content.
701
 
702
+ In the example below that's exactly what we do. We scan the `docstring` of each API member and extract URLs from them using [`extract`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract.html) and [`extract_all`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract_all.html) using a simple regular expression to match http and https URLs.
703
 
704
+ Note that `extract` stops after a first match and returns a scalar result (or `null` if there was no match) while `extract_all` returns a - potentially empty - list of matches.
705
+ """)
 
706
  return
707
 
708
 
 
715
  url_match=pl.col('docstring').str.extract(url_pattern),
716
  url_matches=pl.col('docstring').str.extract_all(url_pattern),
717
  ).filter(pl.col('url_match').is_not_null())
718
+ return
719
 
720
 
721
  @app.cell(hide_code=True)
722
  def _(mo):
723
+ mo.md(r"""
724
+ Note that in each `docstring` where a code example involving dataframes is present, we will see an output such as "shape: (5, 2)" indicating the number of rows and columns of the dataframe produced by the sample code. Let's say we would like to *capture* this information in a structured way.
 
725
 
726
+ [`extract_groups`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract_groups.html) is a really powerful expression allowing us to achieve exactly that.
727
 
728
+ Below we define the regular expression `r"shape:\s*\((?<height>\S+),\s*(?<width>\S+)\)"` with two capture groups, named `height` and `width` and pass it as the parameter of `extract_groups`. After execution, for each `docstring`, we end up with fully structured data we can further process downstream!
729
+ """)
 
730
  return
731
 
732
 
 
742
 
743
  @app.cell(hide_code=True)
744
  def _(mo):
745
+ mo.md(r"""
746
+ ## 🧹 Stripping
 
747
 
748
+ Strings might require some cleaning before further processing, such as the removal of some characters from the beginning or end of the text. [`strip_chars_start`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_start.html), [`strip_chars_end`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_end.html) and [`strip_chars`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars.html) are here to facilitate this.
749
 
750
+ All we need to do is to specify a set of characters we would like to get rid of and Polars handles the rest for us.
751
+ """)
 
752
  return
753
 
754
 
 
765
 
766
  @app.cell(hide_code=True)
767
  def _(mo):
768
+ mo.md(r"""
769
+ Note that when using the above expressions, the specified characters do not need to form a sequence; they are handled as a set. However, in certain use cases we only want to strip complete substrings, so we would need our input to be strictly treated as a sequence rather than as a set.
 
770
 
771
+ That's exactly the rationale behind [`strip_prefix`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_prefix.html) and [`strip_suffix`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_suffix.html).
772
 
773
+ Below we use these to remove the `"to_"` prefixes and `"_with"` suffixes from each member name.
774
+ """)
 
775
  return
776
 
777
 
 
787
 
788
  @app.cell(hide_code=True)
789
  def _(mo):
790
+ mo.md(r"""
791
+ ## 🔑 Encoding & Decoding
 
792
 
793
+ Should you find yourself in the need of encoding your strings into [base64](https://en.wikipedia.org/wiki/Base64) or [hexadecimal](https://en.wikipedia.org/wiki/Hexadecimal) format, then Polars has your back with its [`encode`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.encode.html) expression.
794
+ """)
 
795
  return
796
 
797
 
 
808
 
809
  @app.cell(hide_code=True)
810
  def _(mo):
811
+ mo.md(r"""
812
+ And of course, you can convert back into a human-readable representation using the [`decode`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.decode.html) expression.
813
+ """)
814
  return
815
 
816
 
 
825
 
826
  @app.cell(hide_code=True)
827
  def _(mo):
828
+ mo.md(r"""
829
+ ## 🚀 Application: Dynamic Execution of Polars Examples
 
830
 
831
+ Now that we are familiar with string expressions, we can combine them with other Polars operations to build a fully interactive playground where code examples of Polars expressions can be explored.
832
 
833
+ We make use of string expressions to extract the raw Python source code of examples from the docstrings and we leverage the interactive Marimo environment to enable the selection of expressions via a searchable dropdown and a fully functional code editor whose output is rendered with Marimo's rich display utilities.
834
 
835
+ In other words, we will use Polars to execute Polars. ❄️ How cool is that?
836
 
837
+ ---
838
+ """)
 
839
  return
840
 
841
 
 
870
 
871
 
872
  @app.cell(hide_code=True)
873
+ def _(example_editor):
874
  execution_result = execute_code(example_editor.value)
875
  return (execution_result,)
876
 
 
919
  return (code_df,)
920
 
921
 
922
+ @app.function(hide_code=True)
923
+ def execute_code(code: str):
924
+ import ast
 
 
 
 
925
 
926
+ # Create a new local namespace for execution
927
+ local_namespace = {}
928
 
929
+ # Parse the code into an AST to identify the last expression
930
+ parsed_code = ast.parse(code)
 
931
 
932
+ # Check if there's at least one statement
933
+ if not parsed_code.body:
934
+ return None
935
 
936
+ # If the last statement is an expression, we'll need to get its value
937
+ last_is_expr = isinstance(parsed_code.body[-1], ast.Expr)
 
938
 
939
+ if last_is_expr:
940
+ # Split the code: everything except the last statement, and the last statement
941
+ last_expr = ast.Expression(parsed_code.body[-1].value)
942
 
943
+ # Remove the last statement from the parsed code
944
+ parsed_code.body = parsed_code.body[:-1]
 
 
 
 
 
945
 
946
+ # Execute everything except the last statement
947
+ if parsed_code.body:
948
+ exec(
949
+ compile(parsed_code, "<string>", "exec"),
950
+ globals(),
951
+ local_namespace,
952
  )
953
+
954
+ # Execute the last statement and get its value
955
+ result = eval(
956
+ compile(last_expr, "<string>", "eval"), globals(), local_namespace
957
+ )
958
+ return result
959
+ else:
960
+ # If the last statement is not an expression (e.g., an assignment),
961
+ # execute the entire code and return None
962
+ exec(code, globals(), local_namespace)
963
+ return None
964
 
965
 
966
  @app.cell(hide_code=True)
polars/11_missing_data.py CHANGED
@@ -8,14 +8,13 @@
8
 
9
  import marimo
10
 
11
- __generated_with = "0.15.3"
12
  app = marimo.App(width="medium")
13
 
14
 
15
  @app.cell(hide_code=True)
16
  def _(mo):
17
- mo.md(
18
- r"""
19
  # Dealing with Missing Data
20
 
21
  _by [etrotta](https://github.com/etrotta) and [Felix Najera](https://github.com/folicks)_
@@ -24,20 +23,17 @@ def _(mo):
24
 
25
  First we provide an overview of the methods available in polars, then we walk through a mini case study with real world data showing how to use it, and at last we provide some additional information in the 'Bonus Content' section.
26
  You can navigate to skip around to each header using the menu on the right side
27
- """
28
- )
29
  return
30
 
31
 
32
  @app.cell(hide_code=True)
33
  def _(mo):
34
- mo.md(
35
- r"""
36
  ## Methods for working with Nulls
37
 
38
  We'll be using the following DataFrame to show the most important methods:
39
- """
40
- )
41
  return
42
 
43
 
@@ -59,13 +55,11 @@ def _(pl):
59
 
60
  @app.cell(hide_code=True)
61
  def _(mo):
62
- mo.md(
63
- r"""
64
  ### Counting nulls
65
 
66
  A simple yet convenient aggregation
67
- """
68
- )
69
  return
70
 
71
 
@@ -77,13 +71,11 @@ def _(df):
77
 
78
  @app.cell(hide_code=True)
79
  def _(mo):
80
- mo.md(
81
- r"""
82
  ### Dropping Nulls
83
 
84
  The simplest way of dealing with null values is throwing them away, but that is not always a good idea.
85
- """
86
- )
87
  return
88
 
89
 
@@ -101,8 +93,7 @@ def _(df):
101
 
102
  @app.cell(hide_code=True)
103
  def _(mo):
104
- mo.md(
105
- r"""
106
  ### Filtering null values
107
 
108
  To filter in polars, you'll typically use `df.filter(expression)` or `df.remove(expression)` methods.
@@ -112,8 +103,7 @@ def _(mo):
112
 
113
  Remove will only remove rows in which the expression evaluates to True.
114
  It will keep rows in which it evaluates to None.
115
- """
116
- )
117
  return
118
 
119
 
@@ -131,13 +121,11 @@ def _(df, pl):
131
 
132
  @app.cell(hide_code=True)
133
  def _(mo):
134
- mo.md(
135
- r"""
136
  You may also be tempted to use `== None` or `!= None`, but operators in polars will generally propagate null values.
137
 
138
  You can use `.eq_missing()` or `.ne_missing()` methods if you want to be strict about it, but there are also `.is_null()` and `.is_not_null()` methods you can use.
139
- """
140
- )
141
  return
142
 
143
 
@@ -156,8 +144,7 @@ def _(df, pl):
156
 
157
  @app.cell(hide_code=True)
158
  def _(mo):
159
- mo.md(
160
- r"""
161
  ### Filling Null values
162
 
163
  You can also fill in the values with constants, calculations or by consulting external data sources.
@@ -165,8 +152,7 @@ def _(mo):
165
  Be careful not to treat estimated or guessed values as if they a ground truth however, otherwise you may end up making conclusions about a reality that does not exists.
166
 
167
  As an exercise, let's guess some values to fill in nulls, then try giving names to the animals with `null` by editing the cells
168
- """
169
- )
170
  return
171
 
172
 
@@ -192,8 +178,7 @@ def _(guesstimates):
192
 
193
  @app.cell(hide_code=True)
194
  def _(mo):
195
- mo.md(
196
- r"""
197
  ### TL;DR
198
 
199
  Before we head into the mini case study, a brief review of what we have covered:
@@ -207,24 +192,21 @@ def _(mo):
207
  You can also refer to the polars [User Guide](https://docs.pola.rs/user-guide/expressions/missing-data/) more more information.
208
 
209
  Whichever approach you take, remember to document how you handled it!
210
- """
211
- )
212
  return
213
 
214
 
215
  @app.cell(hide_code=True)
216
  def _(mo):
217
- mo.md(
218
- r"""
219
  # Mini Case Study
220
 
221
- We will be using a dataset from `alertario` about the weather in Rio de Janeiro, originally available in Google Big Query under `datario.clima_pluviometro`. What you need to know about it:
222
 
223
  - Contains multiple stations covering the Municipality of Rio de Janeiro
224
  - Measures the precipitation as millimeters, with a granularity of 15 minutes
225
  - We filtered to only include data about 2020, 2021 and 2022
226
- """
227
- )
228
  return
229
 
230
 
@@ -257,8 +239,7 @@ def _(pl, px, stations):
257
 
258
  @app.cell(hide_code=True)
259
  def _(mo):
260
- mo.md(
261
- r"""
262
  # Stations
263
 
264
  First, let's take a look at some of the stations. Notice how
@@ -267,8 +248,7 @@ def _(mo):
267
  - There are some columns that do not even contain data at all!
268
 
269
  We will remove the empty columns and remove rows without coordinates
270
- """
271
- )
272
  return
273
 
274
 
@@ -295,16 +275,14 @@ def _(dirty_stations, mo, pl):
295
 
296
  @app.cell(hide_code=True)
297
  def _(mo):
298
- mo.md(
299
- r"""
300
  # Precipitation
301
  Now, let's move on to the Precipitation data.
302
 
303
  ## Part 1 - Null Values
304
 
305
  First of all, let's check for null values:
306
- """
307
- )
308
  return
309
 
310
 
@@ -328,8 +306,7 @@ def _(dirty_weather, mo, rain):
328
 
329
  @app.cell(hide_code=True)
330
  def _(mo):
331
- mo.md(
332
- r"""
333
  ### First option to fixing it: Dropping data.
334
 
335
  We could just remove those rows like we did for the stations, which may be a passable solution for some problems, but is not always the best idea.
@@ -354,8 +331,7 @@ def _(mo):
354
 
355
  Let's investigate a bit more before deciding on following with either approach.
356
  For example, is our current data even complete, or are we already missing some rows beyond those with null values?
357
- """
358
- )
359
  return
360
 
361
 
@@ -387,8 +363,7 @@ def _(pl):
387
 
388
  @app.cell(hide_code=True)
389
  def _(mo):
390
- mo.md(
391
- r"""
392
  ## Part 2 - Missing Rows
393
 
394
  We can see that we expected there to be 1096 rows for each hour for each station (from the start of 2020 to the end of 2022) , but in reality we see between 1077 and 1096 rows.
@@ -400,8 +375,7 @@ def _(mo):
400
  Given that we are working with time series data, we will [upsample](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.upsample.html) the data, but you could also create a DataFrame containing all expected rows then use `join(how="...")`
401
 
402
  However, that will give us _even more_ null values, so we will want to fill them in afterwards. For this case, we will just use a forward fill followed by a backwards fill.
403
- """
404
- )
405
  return
406
 
407
 
@@ -435,15 +409,13 @@ def _(dirty_weather, mo, pl, rain):
435
 
436
  @app.cell(hide_code=True)
437
  def _(mo):
438
- mo.md(
439
- r"""
440
  Now that we finally have a clean dataset, let's play around with it a little.
441
 
442
  ### Example App
443
 
444
  Let's display the amount of precipitation each station measured within a timeframe, aggregated to a lower granularity.
445
- """
446
- )
447
  return
448
 
449
 
@@ -534,13 +506,11 @@ def _(animation_data, pl, px):
534
 
535
  @app.cell(hide_code=True)
536
  def _(mo):
537
- mo.md(
538
- r"""
539
  If we were missing some rows, we would have circles popping in and out of existence instead of a smooth animation!
540
 
541
  In many scenarios, missing data can also lead to wrong results overall, for example if we were to estimate the total amount of rainfall during the observed period:
542
- """
543
- )
544
  return
545
 
546
 
@@ -556,20 +526,17 @@ def _(dirty_weather, mo, rain, weather):
556
 
557
  @app.cell(hide_code=True)
558
  def _(mo):
559
- mo.md(
560
- r"""
561
  Which is still a relatively small difference, but every drop counts when you are dealing with the weather.
562
 
563
  For datasets with a higher share of missing values, that difference can get much higher.
564
- """
565
- )
566
  return
567
 
568
 
569
  @app.cell(hide_code=True)
570
  def _(mo):
571
- mo.md(
572
- r"""
573
  # Bonus Content
574
 
575
  ## Appendix A: Missing Time Zones
@@ -577,8 +544,7 @@ def _(mo):
577
  The original dataset contained naive datetimes instead of timezone-aware, but we can infer whenever it refers to UTC time or local time (for this case, -03:00 UTC) based on the measurements.
578
 
579
  For example, we can select one specific interval during which we know that rained a lot, or graph the average amount of precipitation for each hour of the day, then compare the data timestamps with a ground truth.
580
- """
581
- )
582
  return
583
 
584
 
@@ -635,13 +601,11 @@ def _(dirty_weather_naive, pl, rain, stations):
635
 
636
  @app.cell(hide_code=True)
637
  def _(mo):
638
- mo.md(
639
- r"""
640
  By externally researching the expected distribution and looking up some of the extreme weather events, we can come to a conclusion about whenever it is aligned with the local time or with UTC.
641
 
642
  In this case, the distribution matches the normal weather for this region and we can see that the hours with the most precipitation match those of historical events, so it is safe to say it is using local time (equivalent to the Americas/São Paulo time zone).
643
- """
644
- )
645
  return
646
 
647
 
@@ -655,8 +619,7 @@ def _(dirty_weather_naive, pl):
655
 
656
  @app.cell(hide_code=True)
657
  def _(mo):
658
- mo.md(
659
- r"""
660
  ## Appendix B: Not a Number
661
 
662
  While some other tools without proper support for missing values may use `NaN` as a way to indicate a value is missing, in polars it is treated exclusively as a float value, much like `0.0`, `1.0` or `infinity`.
@@ -664,8 +627,7 @@ def _(mo):
664
  You can use `.fill_null(float('nan'))` if you need to convert floats to a format such tools accept, or use `.fill_nan(None)` if you are importing data from them, assuming that there are no values which really are supposed to be the float NaN.
665
 
666
  Remember that many calculations can result in NaN, for example dividing by zero:
667
- """
668
- )
669
  return
670
 
671
 
@@ -696,29 +658,25 @@ def _(day_perc, mo, perc_col):
696
 
697
  @app.cell(hide_code=True)
698
  def _(mo):
699
- mo.md(
700
- r"""
701
  ## Appendix C: Everything else
702
 
703
  As long as this Notebook is, it cannot reasonably cover ***everything*** that may have to deal with missing values, as that is literally everything that may have to deal with data.
704
 
705
  This section very briefly covers some other features not mentioned above
706
- """
707
- )
708
  return
709
 
710
 
711
  @app.cell(hide_code=True)
712
  def _(mo):
713
- mo.md(
714
- r"""
715
  ### Missing values in Aggregations
716
 
717
  Many aggregations methods will ignore/skip missing values, while others take them into consideration.
718
 
719
  Always check the documentation of the method you're using, much of the time docstrings will explain their behaviour.
720
- """
721
- )
722
  return
723
 
724
 
@@ -733,13 +691,11 @@ def _(df, pl):
733
 
734
  @app.cell(hide_code=True)
735
  def _(mo):
736
- mo.md(
737
- r"""
738
  ### Missing values in Joins
739
 
740
  By default null values will never produce matches using [join](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join.html), but you can specify `nulls_equal=True` to join Null values with each other.
741
- """
742
- )
743
  return
744
 
745
 
@@ -772,13 +728,11 @@ def _(age_groups, df):
772
 
773
  @app.cell(hide_code=True)
774
  def _(mo):
775
- mo.md(
776
- r"""
777
  ## Utilities
778
 
779
  Loading data and imports
780
- """
781
- )
782
  return
783
 
784
 
 
8
 
9
  import marimo
10
 
11
+ __generated_with = "0.18.4"
12
  app = marimo.App(width="medium")
13
 
14
 
15
  @app.cell(hide_code=True)
16
  def _(mo):
17
+ mo.md(r"""
 
18
  # Dealing with Missing Data
19
 
20
  _by [etrotta](https://github.com/etrotta) and [Felix Najera](https://github.com/folicks)_
 
23
 
24
  First we provide an overview of the methods available in polars, then we walk through a mini case study with real world data showing how to use it, and at last we provide some additional information in the 'Bonus Content' section.
25
  You can navigate to skip around to each header using the menu on the right side
26
+ """)
 
27
  return
28
 
29
 
30
  @app.cell(hide_code=True)
31
  def _(mo):
32
+ mo.md(r"""
 
33
  ## Methods for working with Nulls
34
 
35
  We'll be using the following DataFrame to show the most important methods:
36
+ """)
 
37
  return
38
 
39
 
 
55
 
56
  @app.cell(hide_code=True)
57
  def _(mo):
58
+ mo.md(r"""
 
59
  ### Counting nulls
60
 
61
  A simple yet convenient aggregation
62
+ """)
 
63
  return
64
 
65
 
 
71
 
72
  @app.cell(hide_code=True)
73
  def _(mo):
74
+ mo.md(r"""
 
75
  ### Dropping Nulls
76
 
77
  The simplest way of dealing with null values is throwing them away, but that is not always a good idea.
78
+ """)
 
79
  return
80
 
81
 
 
93
 
94
  @app.cell(hide_code=True)
95
  def _(mo):
96
+ mo.md(r"""
 
97
  ### Filtering null values
98
 
99
  To filter in polars, you'll typically use `df.filter(expression)` or `df.remove(expression)` methods.
 
103
 
104
  Remove will only remove rows in which the expression evaluates to True.
105
  It will keep rows in which it evaluates to None.
106
+ """)
 
107
  return
108
 
109
 
 
121
 
122
  @app.cell(hide_code=True)
123
  def _(mo):
124
+ mo.md(r"""
 
125
  You may also be tempted to use `== None` or `!= None`, but operators in polars will generally propagate null values.
126
 
127
  You can use `.eq_missing()` or `.ne_missing()` methods if you want to be strict about it, but there are also `.is_null()` and `.is_not_null()` methods you can use.
128
+ """)
 
129
  return
130
 
131
 
 
144
 
145
  @app.cell(hide_code=True)
146
  def _(mo):
147
+ mo.md(r"""
 
148
  ### Filling Null values
149
 
150
  You can also fill in the values with constants, calculations or by consulting external data sources.
 
152
  Be careful not to treat estimated or guessed values as if they a ground truth however, otherwise you may end up making conclusions about a reality that does not exists.
153
 
154
  As an exercise, let's guess some values to fill in nulls, then try giving names to the animals with `null` by editing the cells
155
+ """)
 
156
  return
157
 
158
 
 
178
 
179
  @app.cell(hide_code=True)
180
  def _(mo):
181
+ mo.md(r"""
 
182
  ### TL;DR
183
 
184
  Before we head into the mini case study, a brief review of what we have covered:
 
192
  You can also refer to the polars [User Guide](https://docs.pola.rs/user-guide/expressions/missing-data/) more more information.
193
 
194
  Whichever approach you take, remember to document how you handled it!
195
+ """)
 
196
  return
197
 
198
 
199
  @app.cell(hide_code=True)
200
  def _(mo):
201
+ mo.md(r"""
 
202
  # Mini Case Study
203
 
204
+ We will be using a dataset from `alertario` about the weather in Rio de Janeiro, originally available in Google Big Query under `datario.clima_pluviometro`. What you need to know about it:
205
 
206
  - Contains multiple stations covering the Municipality of Rio de Janeiro
207
  - Measures the precipitation as millimeters, with a granularity of 15 minutes
208
  - We filtered to only include data about 2020, 2021 and 2022
209
+ """)
 
210
  return
211
 
212
 
 
239
 
240
  @app.cell(hide_code=True)
241
  def _(mo):
242
+ mo.md(r"""
 
243
  # Stations
244
 
245
  First, let's take a look at some of the stations. Notice how
 
248
  - There are some columns that do not even contain data at all!
249
 
250
  We will remove the empty columns and remove rows without coordinates
251
+ """)
 
252
  return
253
 
254
 
 
275
 
276
  @app.cell(hide_code=True)
277
  def _(mo):
278
+ mo.md(r"""
 
279
  # Precipitation
280
  Now, let's move on to the Precipitation data.
281
 
282
  ## Part 1 - Null Values
283
 
284
  First of all, let's check for null values:
285
+ """)
 
286
  return
287
 
288
 
 
306
 
307
  @app.cell(hide_code=True)
308
  def _(mo):
309
+ mo.md(r"""
 
310
  ### First option to fixing it: Dropping data.
311
 
312
  We could just remove those rows like we did for the stations, which may be a passable solution for some problems, but is not always the best idea.
 
331
 
332
  Let's investigate a bit more before deciding on following with either approach.
333
  For example, is our current data even complete, or are we already missing some rows beyond those with null values?
334
+ """)
 
335
  return
336
 
337
 
 
363
 
364
  @app.cell(hide_code=True)
365
  def _(mo):
366
+ mo.md(r"""
 
367
  ## Part 2 - Missing Rows
368
 
369
  We can see that we expected there to be 1096 rows for each hour for each station (from the start of 2020 to the end of 2022) , but in reality we see between 1077 and 1096 rows.
 
375
  Given that we are working with time series data, we will [upsample](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.upsample.html) the data, but you could also create a DataFrame containing all expected rows then use `join(how="...")`
376
 
377
  However, that will give us _even more_ null values, so we will want to fill them in afterwards. For this case, we will just use a forward fill followed by a backwards fill.
378
+ """)
 
379
  return
380
 
381
 
 
409
 
410
  @app.cell(hide_code=True)
411
  def _(mo):
412
+ mo.md(r"""
 
413
  Now that we finally have a clean dataset, let's play around with it a little.
414
 
415
  ### Example App
416
 
417
  Let's display the amount of precipitation each station measured within a timeframe, aggregated to a lower granularity.
418
+ """)
 
419
  return
420
 
421
 
 
506
 
507
  @app.cell(hide_code=True)
508
  def _(mo):
509
+ mo.md(r"""
 
510
  If we were missing some rows, we would have circles popping in and out of existence instead of a smooth animation!
511
 
512
  In many scenarios, missing data can also lead to wrong results overall, for example if we were to estimate the total amount of rainfall during the observed period:
513
+ """)
 
514
  return
515
 
516
 
 
526
 
527
  @app.cell(hide_code=True)
528
  def _(mo):
529
+ mo.md(r"""
 
530
  Which is still a relatively small difference, but every drop counts when you are dealing with the weather.
531
 
532
  For datasets with a higher share of missing values, that difference can get much higher.
533
+ """)
 
534
  return
535
 
536
 
537
  @app.cell(hide_code=True)
538
  def _(mo):
539
+ mo.md(r"""
 
540
  # Bonus Content
541
 
542
  ## Appendix A: Missing Time Zones
 
544
  The original dataset contained naive datetimes instead of timezone-aware, but we can infer whenever it refers to UTC time or local time (for this case, -03:00 UTC) based on the measurements.
545
 
546
  For example, we can select one specific interval during which we know that rained a lot, or graph the average amount of precipitation for each hour of the day, then compare the data timestamps with a ground truth.
547
+ """)
 
548
  return
549
 
550
 
 
601
 
602
  @app.cell(hide_code=True)
603
  def _(mo):
604
+ mo.md(r"""
 
605
  By externally researching the expected distribution and looking up some of the extreme weather events, we can come to a conclusion about whenever it is aligned with the local time or with UTC.
606
 
607
  In this case, the distribution matches the normal weather for this region and we can see that the hours with the most precipitation match those of historical events, so it is safe to say it is using local time (equivalent to the Americas/São Paulo time zone).
608
+ """)
 
609
  return
610
 
611
 
 
619
 
620
  @app.cell(hide_code=True)
621
  def _(mo):
622
+ mo.md(r"""
 
623
  ## Appendix B: Not a Number
624
 
625
  While some other tools without proper support for missing values may use `NaN` as a way to indicate a value is missing, in polars it is treated exclusively as a float value, much like `0.0`, `1.0` or `infinity`.
 
627
  You can use `.fill_null(float('nan'))` if you need to convert floats to a format such tools accept, or use `.fill_nan(None)` if you are importing data from them, assuming that there are no values which really are supposed to be the float NaN.
628
 
629
  Remember that many calculations can result in NaN, for example dividing by zero:
630
+ """)
 
631
  return
632
 
633
 
 
658
 
659
  @app.cell(hide_code=True)
660
  def _(mo):
661
+ mo.md(r"""
 
662
  ## Appendix C: Everything else
663
 
664
  As long as this Notebook is, it cannot reasonably cover ***everything*** that may have to deal with missing values, as that is literally everything that may have to deal with data.
665
 
666
  This section very briefly covers some other features not mentioned above
667
+ """)
 
668
  return
669
 
670
 
671
  @app.cell(hide_code=True)
672
  def _(mo):
673
+ mo.md(r"""
 
674
  ### Missing values in Aggregations
675
 
676
  Many aggregations methods will ignore/skip missing values, while others take them into consideration.
677
 
678
  Always check the documentation of the method you're using, much of the time docstrings will explain their behaviour.
679
+ """)
 
680
  return
681
 
682
 
 
691
 
692
  @app.cell(hide_code=True)
693
  def _(mo):
694
+ mo.md(r"""
 
695
  ### Missing values in Joins
696
 
697
  By default null values will never produce matches using [join](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join.html), but you can specify `nulls_equal=True` to join Null values with each other.
698
+ """)
 
699
  return
700
 
701
 
 
728
 
729
  @app.cell(hide_code=True)
730
  def _(mo):
731
+ mo.md(r"""
 
732
  ## Utilities
733
 
734
  Loading data and imports
735
+ """)
 
736
  return
737
 
738
 
polars/12_aggregations.py CHANGED
@@ -8,7 +8,7 @@
8
 
9
  import marimo
10
 
11
- __generated_with = "0.12.9"
12
  app = marimo.App(width="medium")
13
 
14
 
@@ -20,14 +20,12 @@ def _():
20
 
21
  @app.cell(hide_code=True)
22
  def _(mo):
23
- mo.md(
24
- r"""
25
- # Aggregations
26
- _By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._
27
 
28
- In this notebook, you'll learn how to perform different types of aggregations in Polars, including grouping by categories and time. We'll analyze sales data from a clothing store, focusing on three product categories: hats, socks, and sweaters.
29
- """
30
- )
31
  return
32
 
33
 
@@ -44,13 +42,11 @@ def _():
44
 
45
  @app.cell(hide_code=True)
46
  def _(mo):
47
- mo.md(
48
- r"""
49
- ## Grouping by category
50
- ### With single category
51
- Let's find out how many of each product category we sold.
52
- """
53
- )
54
  return
55
 
56
 
@@ -65,13 +61,11 @@ def _(df, pl):
65
 
66
  @app.cell(hide_code=True)
67
  def _(mo):
68
- mo.md(
69
- r"""
70
- It looks like we sold more sweaters. Maybe this was a winter season.
71
 
72
- Let's add another aggregate to see how much was spent on the total units for each product.
73
- """
74
- )
75
  return
76
 
77
 
@@ -87,7 +81,9 @@ def _(df, pl):
87
 
88
  @app.cell(hide_code=True)
89
  def _(mo):
90
- mo.md(r"""We could also write aggregate code for the two columns as a single line.""")
 
 
91
  return
92
 
93
 
@@ -102,7 +98,9 @@ def _(df, pl):
102
 
103
  @app.cell(hide_code=True)
104
  def _(mo):
105
- mo.md(r"""Actually, the way we've been writing the aggregate lines is syntactic sugar. Here's a longer way of doing it as shown in the [Polars documentation](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.agg.html).""")
 
 
106
  return
107
 
108
 
@@ -118,12 +116,10 @@ def _(df, pl):
118
 
119
  @app.cell(hide_code=True)
120
  def _(mo):
121
- mo.md(
122
- r"""
123
- ### With multiple categories
124
- We can also group by multiple categories. Let's find out how many items we sold in each product category for each SKU. This more detailed aggregation will produce more rows than the previous DataFrame.
125
- """
126
- )
127
  return
128
 
129
 
@@ -138,13 +134,11 @@ def _(df, pl):
138
 
139
  @app.cell(hide_code=True)
140
  def _(mo):
141
- mo.md(
142
- r"""
143
- Aggregations when grouping data are not limited to sums. You can also use functions like [`max`, `min`, `median`, `first`, and `last`](https://docs.pola.rs/user-guide/expressions/aggregation/#basic-aggregations).
144
 
145
- Let's find the largest sale quantity for each product category.
146
- """
147
- )
148
  return
149
 
150
 
@@ -159,13 +153,11 @@ def _(df, pl):
159
 
160
  @app.cell(hide_code=True)
161
  def _(mo):
162
- mo.md(
163
- r"""
164
- Let's make the aggregation more interesting. We'll identify the first customer to purchase each item, along with the quantity they bought and the amount they spent.
165
 
166
- **Note:** To make this work, we'll have to sort the date from earliest to latest.
167
- """
168
- )
169
  return
170
 
171
 
@@ -181,14 +173,12 @@ def _(df, pl):
181
 
182
  @app.cell(hide_code=True)
183
  def _(mo):
184
- mo.md(
185
- r"""
186
- ## Grouping by time
187
- Since `datetime` is a special data type in Polars, we can perform various group-by aggregations on it.
188
 
189
- Our dataset spans a two-year period. Let's calculate the total dollar sales for each year. We'll do it the naive way first so you can appreciate grouping with time.
190
- """
191
- )
192
  return
193
 
194
 
@@ -204,13 +194,11 @@ def _(df, pl):
204
 
205
  @app.cell(hide_code=True)
206
  def _(mo):
207
- mo.md(
208
- r"""
209
- We had more sales in 2014.
210
 
211
- Now let's perform the above operation by grouping with time. This requires sorting the dataframe first.
212
- """
213
- )
214
  return
215
 
216
 
@@ -226,13 +214,11 @@ def _(df, pl):
226
 
227
  @app.cell(hide_code=True)
228
  def _(mo):
229
- mo.md(
230
- r"""
231
- The beauty of grouping with time is that it allows us to resample the data by selecting whatever time interval we want.
232
 
233
- Let's find out what the quarterly sales were for 2014
234
- """
235
- )
236
  return
237
 
238
 
@@ -249,13 +235,11 @@ def _(df, pl):
249
 
250
  @app.cell(hide_code=True)
251
  def _(mo):
252
- mo.md(
253
- r"""
254
- Here's an interesting question we can answer that takes advantage of grouping by time.
255
 
256
- Let's find the hour of the day where we had the most sales in dollars.
257
- """
258
- )
259
  return
260
 
261
 
@@ -272,7 +256,9 @@ def _(df, pl):
272
 
273
  @app.cell(hide_code=True)
274
  def _(mo):
275
- mo.md(r"""Just for fun, let's find the median number of items sold in each SKU and the total dollar amount in each SKU every six days.""")
 
 
276
  return
277
 
278
 
@@ -290,7 +276,9 @@ def _(df, pl):
290
 
291
  @app.cell(hide_code=True)
292
  def _(mo):
293
- mo.md(r"""Let's rename the columns to clearly indicate the type of aggregation performed. This will help us identify the aggregation method used on a column without needing to check the code.""")
 
 
294
  return
295
 
296
 
@@ -308,15 +296,13 @@ def _(df, pl):
308
 
309
  @app.cell(hide_code=True)
310
  def _(mo):
311
- mo.md(
312
- r"""
313
- ## Grouping with over
314
 
315
- Sometimes, we may want to perform an aggregation but also keep all the columns and rows of the dataframe.
316
 
317
- Let's assign a value to indicate the number of times each customer visited and bought something.
318
- """
319
- )
320
  return
321
 
322
 
@@ -330,7 +316,9 @@ def _(df, pl):
330
 
331
  @app.cell(hide_code=True)
332
  def _(mo):
333
- mo.md(r"""Finally, let's determine which customers visited the store the most and bought something.""")
 
 
334
  return
335
 
336
 
@@ -347,7 +335,9 @@ def _(df, pl):
347
 
348
  @app.cell(hide_code=True)
349
  def _(mo):
350
- mo.md(r"""There's more you can do with aggregations in Polars such as [sorting with aggregations](https://docs.pola.rs/user-guide/expressions/aggregation/#sorting). We hope that in this notebook, we've armed you with the tools to get started.""")
 
 
351
  return
352
 
353
 
 
8
 
9
  import marimo
10
 
11
+ __generated_with = "0.18.4"
12
  app = marimo.App(width="medium")
13
 
14
 
 
20
 
21
  @app.cell(hide_code=True)
22
  def _(mo):
23
+ mo.md(r"""
24
+ # Aggregations
25
+ _By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._
 
26
 
27
+ In this notebook, you'll learn how to perform different types of aggregations in Polars, including grouping by categories and time. We'll analyze sales data from a clothing store, focusing on three product categories: hats, socks, and sweaters.
28
+ """)
 
29
  return
30
 
31
 
 
42
 
43
  @app.cell(hide_code=True)
44
  def _(mo):
45
+ mo.md(r"""
46
+ ## Grouping by category
47
+ ### With single category
48
+ Let's find out how many of each product category we sold.
49
+ """)
 
 
50
  return
51
 
52
 
 
61
 
62
  @app.cell(hide_code=True)
63
  def _(mo):
64
+ mo.md(r"""
65
+ It looks like we sold more sweaters. Maybe this was a winter season.
 
66
 
67
+ Let's add another aggregate to see how much was spent on the total units for each product.
68
+ """)
 
69
  return
70
 
71
 
 
81
 
82
  @app.cell(hide_code=True)
83
  def _(mo):
84
+ mo.md(r"""
85
+ We could also write aggregate code for the two columns as a single line.
86
+ """)
87
  return
88
 
89
 
 
98
 
99
  @app.cell(hide_code=True)
100
  def _(mo):
101
+ mo.md(r"""
102
+ Actually, the way we've been writing the aggregate lines is syntactic sugar. Here's a longer way of doing it as shown in the [Polars documentation](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.agg.html).
103
+ """)
104
  return
105
 
106
 
 
116
 
117
  @app.cell(hide_code=True)
118
  def _(mo):
119
+ mo.md(r"""
120
+ ### With multiple categories
121
+ We can also group by multiple categories. Let's find out how many items we sold in each product category for each SKU. This more detailed aggregation will produce more rows than the previous DataFrame.
122
+ """)
 
 
123
  return
124
 
125
 
 
134
 
135
  @app.cell(hide_code=True)
136
  def _(mo):
137
+ mo.md(r"""
138
+ Aggregations when grouping data are not limited to sums. You can also use functions like [`max`, `min`, `median`, `first`, and `last`](https://docs.pola.rs/user-guide/expressions/aggregation/#basic-aggregations).
 
139
 
140
+ Let's find the largest sale quantity for each product category.
141
+ """)
 
142
  return
143
 
144
 
 
153
 
154
  @app.cell(hide_code=True)
155
  def _(mo):
156
+ mo.md(r"""
157
+ Let's make the aggregation more interesting. We'll identify the first customer to purchase each item, along with the quantity they bought and the amount they spent.
 
158
 
159
+ **Note:** To make this work, we'll have to sort the date from earliest to latest.
160
+ """)
 
161
  return
162
 
163
 
 
173
 
174
  @app.cell(hide_code=True)
175
  def _(mo):
176
+ mo.md(r"""
177
+ ## Grouping by time
178
+ Since `datetime` is a special data type in Polars, we can perform various group-by aggregations on it.
 
179
 
180
+ Our dataset spans a two-year period. Let's calculate the total dollar sales for each year. We'll do it the naive way first so you can appreciate grouping with time.
181
+ """)
 
182
  return
183
 
184
 
 
194
 
195
  @app.cell(hide_code=True)
196
  def _(mo):
197
+ mo.md(r"""
198
+ We had more sales in 2014.
 
199
 
200
+ Now let's perform the above operation by grouping with time. This requires sorting the dataframe first.
201
+ """)
 
202
  return
203
 
204
 
 
214
 
215
  @app.cell(hide_code=True)
216
  def _(mo):
217
+ mo.md(r"""
218
+ The beauty of grouping with time is that it allows us to resample the data by selecting whatever time interval we want.
 
219
 
220
+ Let's find out what the quarterly sales were for 2014
221
+ """)
 
222
  return
223
 
224
 
 
235
 
236
  @app.cell(hide_code=True)
237
  def _(mo):
238
+ mo.md(r"""
239
+ Here's an interesting question we can answer that takes advantage of grouping by time.
 
240
 
241
+ Let's find the hour of the day where we had the most sales in dollars.
242
+ """)
 
243
  return
244
 
245
 
 
256
 
257
  @app.cell(hide_code=True)
258
  def _(mo):
259
+ mo.md(r"""
260
+ Just for fun, let's find the median number of items sold in each SKU and the total dollar amount in each SKU every six days.
261
+ """)
262
  return
263
 
264
 
 
276
 
277
  @app.cell(hide_code=True)
278
  def _(mo):
279
+ mo.md(r"""
280
+ Let's rename the columns to clearly indicate the type of aggregation performed. This will help us identify the aggregation method used on a column without needing to check the code.
281
+ """)
282
  return
283
 
284
 
 
296
 
297
  @app.cell(hide_code=True)
298
  def _(mo):
299
+ mo.md(r"""
300
+ ## Grouping with over
 
301
 
302
+ Sometimes, we may want to perform an aggregation but also keep all the columns and rows of the dataframe.
303
 
304
+ Let's assign a value to indicate the number of times each customer visited and bought something.
305
+ """)
 
306
  return
307
 
308
 
 
316
 
317
  @app.cell(hide_code=True)
318
  def _(mo):
319
+ mo.md(r"""
320
+ Finally, let's determine which customers visited the store the most and bought something.
321
+ """)
322
  return
323
 
324
 
 
335
 
336
  @app.cell(hide_code=True)
337
  def _(mo):
338
+ mo.md(r"""
339
+ There's more you can do with aggregations in Polars such as [sorting with aggregations](https://docs.pola.rs/user-guide/expressions/aggregation/#sorting). We hope that in this notebook, we've armed you with the tools to get started.
340
+ """)
341
  return
342
 
343
 
polars/13_window_functions.py CHANGED
@@ -11,14 +11,13 @@
11
 
12
  import marimo
13
 
14
- __generated_with = "0.13.11"
15
  app = marimo.App(width="medium", app_title="Window Functions")
16
 
17
 
18
  @app.cell(hide_code=True)
19
  def _(mo):
20
- mo.md(
21
- r"""
22
  # Window Functions
23
  _By [Henry Harbeck](https://github.com/henryharbeck)._
24
 
@@ -26,8 +25,7 @@ def _(mo):
26
  You'll work with partitions, ordering and Polars' available "mapping strategies".
27
 
28
  We'll use a dataset with a few days of paid and organic digital revenue data.
29
- """
30
- )
31
  return
32
 
33
 
@@ -53,8 +51,7 @@ def _():
53
 
54
  @app.cell(hide_code=True)
55
  def _(mo):
56
- mo.md(
57
- r"""
58
  ## What is a window function?
59
 
60
  A window function performs a calculation across a set of rows that are related to the current row.
@@ -64,32 +61,27 @@ def _(mo):
64
 
65
  Window functions can be used by specifying the [`over`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
66
  method on an expression.
67
- """
68
- )
69
  return
70
 
71
 
72
  @app.cell(hide_code=True)
73
  def _(mo):
74
- mo.md(
75
- r"""
76
  ## Partitions
77
  Partitions are the "group by" columns. We will have one "window" of data per unique value in the partition column(s), to
78
  which the function will be applied.
79
- """
80
- )
81
  return
82
 
83
 
84
  @app.cell(hide_code=True)
85
  def _(mo):
86
- mo.md(
87
- r"""
88
  ### Partitioning by a single column
89
 
90
  Let's get the total revenue per date...
91
- """
92
- )
93
  return
94
 
95
 
@@ -103,7 +95,9 @@ def _(df, pl):
103
 
104
  @app.cell(hide_code=True)
105
  def _(mo):
106
- mo.md(r"""And then see what percentage of the daily total was Paid and what percentage was Organic.""")
 
 
107
  return
108
 
109
 
@@ -115,12 +109,10 @@ def _(daily_revenue, df, pl):
115
 
116
  @app.cell(hide_code=True)
117
  def _(mo):
118
- mo.md(
119
- r"""
120
  Let's now calculate the maximum revenue, cumulative revenue, rank the revenue and calculate the day-on-day change,
121
  all partitioned (split) by channel.
122
- """
123
- )
124
  return
125
 
126
 
@@ -137,28 +129,24 @@ def _(df, pl):
137
 
138
  @app.cell(hide_code=True)
139
  def _(mo):
140
- mo.md(
141
- r"""
142
  Note that aggregation functions such as `sum` and `max` have their value applied back to each row in the partition
143
  (group). Non-aggregate functions such as `cum_sum`, `rank` and `diff` can produce different values per row, but
144
  still only consider rows within their partition.
145
- """
146
- )
147
  return
148
 
149
 
150
  @app.cell(hide_code=True)
151
  def _(mo):
152
- mo.md(
153
- r"""
154
  ### Partitioning by multiple columns
155
 
156
  We can also partition by multiple columns.
157
 
158
  Let's add a column to see whether it is a weekday (business day), then get the maximum revenue by that and
159
  the channel.
160
- """
161
- )
162
  return
163
 
164
 
@@ -176,15 +164,13 @@ def _(df, pl):
176
 
177
  @app.cell(hide_code=True)
178
  def _(mo):
179
- mo.md(
180
- r"""
181
  ### Partitioning by expressions
182
 
183
  Polars also lets you partition by expressions without needing to create them as columns first.
184
 
185
  So, we could re-write the previous window function as...
186
- """
187
- )
188
  return
189
 
190
 
@@ -200,20 +186,17 @@ def _(df, pl):
200
 
201
  @app.cell(hide_code=True)
202
  def _(mo):
203
- mo.md(
204
- r"""
205
  Window functions fit into Polars' composable [expressions API](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expressions),
206
  so can be combined with all [aggregation methods](https://docs.pola.rs/api/python/stable/reference/expressions/aggregation.html)
207
  and methods that consider more than 1 row (e.g., `cum_sum`, `rank` and `diff` as we just saw).
208
- """
209
- )
210
  return
211
 
212
 
213
  @app.cell(hide_code=True)
214
  def _(mo):
215
- mo.md(
216
- r"""
217
  ## Ordering
218
 
219
  The `order_by` parameter controls how to order the data within the window. The function is applied to the data in this
@@ -221,21 +204,18 @@ def _(mo):
221
 
222
  Up until this point, we have been letting Polars do the window function calculations based on the order of the rows in the
223
  DataFrame. There can be times where we would like order of the calculation and the order of the output itself to differ.
224
- """
225
- )
226
  return
227
 
228
 
229
  @app.cell(hide_code=True)
230
  def _(mo):
231
- mo.md(
232
- """
233
  ### Ordering in a window function
234
 
235
  Let's say we want the DataFrame ordered by day of week, but we still want cumulative revenue and the first revenue observation, both
236
  ordered by date and partitioned by channel...
237
- """
238
- )
239
  return
240
 
241
 
@@ -261,21 +241,19 @@ def _(df, pl):
261
 
262
  @app.cell(hide_code=True)
263
  def _(mo):
264
- mo.md(
265
- r"""
266
  ### Note about window function ordering compared to SQL
267
 
268
  It is worth noting that traditionally in SQL, many more functions require an `ORDER BY` within `OVER` than in
269
  equivalent functions in Polars.
270
 
271
  For example, an SQL `RANK()` expression like...
272
- """
273
- )
274
  return
275
 
276
 
277
  @app.cell
278
- def _(df, mo):
279
  _df = mo.sql(
280
  f"""
281
  SELECT
@@ -293,12 +271,10 @@ def _(df, mo):
293
 
294
  @app.cell(hide_code=True)
295
  def _(mo):
296
- mo.md(
297
- r"""
298
  ...does not require an `order_by` in Polars as the column and the function are already bound (including with the
299
  `descending=True` argument).
300
- """
301
- )
302
  return
303
 
304
 
@@ -315,13 +291,11 @@ def _(df, pl):
315
 
316
  @app.cell(hide_code=True)
317
  def _(mo):
318
- mo.md(
319
- r"""
320
  ### Descending order
321
 
322
  We can also order in descending order by passing `descending=True`...
323
- """
324
- )
325
  return
326
 
327
 
@@ -348,29 +322,25 @@ def _(df_sorted, pl):
348
 
349
  @app.cell(hide_code=True)
350
  def _(mo):
351
- mo.md(
352
- """
353
  ## Mapping Strategies
354
 
355
  Mapping Strategies control how Polars maps the result of the window function back to the original DataFrame
356
 
357
  Generally (by default) the result of a window function is assigned back to rows within the group. Through Polars' mapping
358
  strategies, we will explore other possibilities.
359
- """
360
- )
361
  return
362
 
363
 
364
  @app.cell(hide_code=True)
365
  def _(mo):
366
- mo.md(
367
- """
368
  ### Group to rows
369
 
370
  "group_to_rows" is the default mapping strategy and assigns the result of the window function back to the rows in the
371
  window.
372
- """
373
- )
374
  return
375
 
376
 
@@ -384,13 +354,11 @@ def _(df, pl):
384
 
385
  @app.cell(hide_code=True)
386
  def _(mo):
387
- mo.md(
388
- """
389
  ### Join
390
 
391
  The "join" mapping strategy aggregates the resulting values in a list and repeats the list for all rows in the group.
392
- """
393
- )
394
  return
395
 
396
 
@@ -404,8 +372,7 @@ def _(df, pl):
404
 
405
  @app.cell(hide_code=True)
406
  def _(mo):
407
- mo.md(
408
- r"""
409
  ### Explode
410
 
411
  The "explode" mapping strategy is similar to "group_to_rows", but is typically faster and does not preserve the order of
@@ -413,8 +380,7 @@ def _(mo):
413
  It should also only be used in a `select` context and not `with_columns`.
414
 
415
  The result of "explode" is similar to a `group_by` followed by an `agg` followed by an `explode`.
416
- """
417
- )
418
  return
419
 
420
 
@@ -431,26 +397,28 @@ def _(df, pl):
431
 
432
  @app.cell(hide_code=True)
433
  def _(mo):
434
- mo.md(r"""Note the modified order of the rows in the output, (but data is the same)...""")
 
 
435
  return
436
 
437
 
438
  @app.cell(hide_code=True)
439
  def _(mo):
440
- mo.md(r"""## Other tips and tricks""")
 
 
441
  return
442
 
443
 
444
  @app.cell(hide_code=True)
445
  def _(mo):
446
- mo.md(
447
- r"""
448
  ### Reusing a window
449
 
450
  In SQL there is a `WINDOW` keyword, which easily allows the re-use of the same window specification across expressions
451
  without needing to repeat it. In Polars, this can be achieved by using `dict` unpacking to pass arguments to `over`.
452
- """
453
- )
454
  return
455
 
456
 
@@ -472,8 +440,7 @@ def _(df_sorted, pl):
472
 
473
  @app.cell(hide_code=True)
474
  def _(mo):
475
- mo.md(
476
- r"""
477
  ### Rolling Windows
478
 
479
  Much like in SQL, Polars also gives you the ability to do rolling window computations. In Polars, the rolling calculation
@@ -481,8 +448,7 @@ def _(mo):
481
 
482
  Let's look at an example of that now by filtering out one day of our data and then calculating both a 3-day and 3-row
483
  max revenue split by channel...
484
- """
485
- )
486
  return
487
 
488
 
@@ -503,27 +469,29 @@ def _(date, df, pl):
503
 
504
  @app.cell(hide_code=True)
505
  def _(mo):
506
- mo.md(r"""Notice the difference in the 2nd last row...""")
 
 
507
  return
508
 
509
 
510
  @app.cell(hide_code=True)
511
  def _(mo):
512
- mo.md(r"""We hope you enjoyed this notebook, demonstrating window functions in Polars!""")
 
 
513
  return
514
 
515
 
516
  @app.cell(hide_code=True)
517
  def _(mo):
518
- mo.md(
519
- r"""
520
  ## Additional References
521
 
522
  - [Polars User guide - Window functions](https://docs.pola.rs/user-guide/expressions/window-functions/)
523
  - [Polars over method API reference](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
524
  - [PostgreSQL window function documentation](https://www.postgresql.org/docs/current/tutorial-window.html)
525
- """
526
- )
527
  return
528
 
529
 
 
11
 
12
  import marimo
13
 
14
+ __generated_with = "0.18.4"
15
  app = marimo.App(width="medium", app_title="Window Functions")
16
 
17
 
18
  @app.cell(hide_code=True)
19
  def _(mo):
20
+ mo.md(r"""
 
21
  # Window Functions
22
  _By [Henry Harbeck](https://github.com/henryharbeck)._
23
 
 
25
  You'll work with partitions, ordering and Polars' available "mapping strategies".
26
 
27
  We'll use a dataset with a few days of paid and organic digital revenue data.
28
+ """)
 
29
  return
30
 
31
 
 
51
 
52
  @app.cell(hide_code=True)
53
  def _(mo):
54
+ mo.md(r"""
 
55
  ## What is a window function?
56
 
57
  A window function performs a calculation across a set of rows that are related to the current row.
 
61
 
62
  Window functions can be used by specifying the [`over`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
63
  method on an expression.
64
+ """)
 
65
  return
66
 
67
 
68
  @app.cell(hide_code=True)
69
  def _(mo):
70
+ mo.md(r"""
 
71
  ## Partitions
72
  Partitions are the "group by" columns. We will have one "window" of data per unique value in the partition column(s), to
73
  which the function will be applied.
74
+ """)
 
75
  return
76
 
77
 
78
  @app.cell(hide_code=True)
79
  def _(mo):
80
+ mo.md(r"""
 
81
  ### Partitioning by a single column
82
 
83
  Let's get the total revenue per date...
84
+ """)
 
85
  return
86
 
87
 
 
95
 
96
  @app.cell(hide_code=True)
97
  def _(mo):
98
+ mo.md(r"""
99
+ And then see what percentage of the daily total was Paid and what percentage was Organic.
100
+ """)
101
  return
102
 
103
 
 
109
 
110
  @app.cell(hide_code=True)
111
  def _(mo):
112
+ mo.md(r"""
 
113
  Let's now calculate the maximum revenue, cumulative revenue, rank the revenue and calculate the day-on-day change,
114
  all partitioned (split) by channel.
115
+ """)
 
116
  return
117
 
118
 
 
129
 
130
  @app.cell(hide_code=True)
131
  def _(mo):
132
+ mo.md(r"""
 
133
  Note that aggregation functions such as `sum` and `max` have their value applied back to each row in the partition
134
  (group). Non-aggregate functions such as `cum_sum`, `rank` and `diff` can produce different values per row, but
135
  still only consider rows within their partition.
136
+ """)
 
137
  return
138
 
139
 
140
  @app.cell(hide_code=True)
141
  def _(mo):
142
+ mo.md(r"""
 
143
  ### Partitioning by multiple columns
144
 
145
  We can also partition by multiple columns.
146
 
147
  Let's add a column to see whether it is a weekday (business day), then get the maximum revenue by that and
148
  the channel.
149
+ """)
 
150
  return
151
 
152
 
 
164
 
165
  @app.cell(hide_code=True)
166
  def _(mo):
167
+ mo.md(r"""
 
168
  ### Partitioning by expressions
169
 
170
  Polars also lets you partition by expressions without needing to create them as columns first.
171
 
172
  So, we could re-write the previous window function as...
173
+ """)
 
174
  return
175
 
176
 
 
186
 
187
  @app.cell(hide_code=True)
188
  def _(mo):
189
+ mo.md(r"""
 
190
  Window functions fit into Polars' composable [expressions API](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expressions),
191
  so can be combined with all [aggregation methods](https://docs.pola.rs/api/python/stable/reference/expressions/aggregation.html)
192
  and methods that consider more than 1 row (e.g., `cum_sum`, `rank` and `diff` as we just saw).
193
+ """)
 
194
  return
195
 
196
 
197
  @app.cell(hide_code=True)
198
  def _(mo):
199
+ mo.md(r"""
 
200
  ## Ordering
201
 
202
  The `order_by` parameter controls how to order the data within the window. The function is applied to the data in this
 
204
 
205
  Up until this point, we have been letting Polars do the window function calculations based on the order of the rows in the
206
  DataFrame. There can be times where we would like order of the calculation and the order of the output itself to differ.
207
+ """)
 
208
  return
209
 
210
 
211
  @app.cell(hide_code=True)
212
  def _(mo):
213
+ mo.md("""
 
214
  ### Ordering in a window function
215
 
216
  Let's say we want the DataFrame ordered by day of week, but we still want cumulative revenue and the first revenue observation, both
217
  ordered by date and partitioned by channel...
218
+ """)
 
219
  return
220
 
221
 
 
241
 
242
  @app.cell(hide_code=True)
243
  def _(mo):
244
+ mo.md(r"""
 
245
  ### Note about window function ordering compared to SQL
246
 
247
  It is worth noting that traditionally in SQL, many more functions require an `ORDER BY` within `OVER` than in
248
  equivalent functions in Polars.
249
 
250
  For example, an SQL `RANK()` expression like...
251
+ """)
 
252
  return
253
 
254
 
255
  @app.cell
256
+ def _(mo):
257
  _df = mo.sql(
258
  f"""
259
  SELECT
 
271
 
272
  @app.cell(hide_code=True)
273
  def _(mo):
274
+ mo.md(r"""
 
275
  ...does not require an `order_by` in Polars as the column and the function are already bound (including with the
276
  `descending=True` argument).
277
+ """)
 
278
  return
279
 
280
 
 
291
 
292
  @app.cell(hide_code=True)
293
  def _(mo):
294
+ mo.md(r"""
 
295
  ### Descending order
296
 
297
  We can also order in descending order by passing `descending=True`...
298
+ """)
 
299
  return
300
 
301
 
 
322
 
323
  @app.cell(hide_code=True)
324
  def _(mo):
325
+ mo.md("""
 
326
  ## Mapping Strategies
327
 
328
  Mapping Strategies control how Polars maps the result of the window function back to the original DataFrame
329
 
330
  Generally (by default) the result of a window function is assigned back to rows within the group. Through Polars' mapping
331
  strategies, we will explore other possibilities.
332
+ """)
 
333
  return
334
 
335
 
336
  @app.cell(hide_code=True)
337
  def _(mo):
338
+ mo.md("""
 
339
  ### Group to rows
340
 
341
  "group_to_rows" is the default mapping strategy and assigns the result of the window function back to the rows in the
342
  window.
343
+ """)
 
344
  return
345
 
346
 
 
354
 
355
  @app.cell(hide_code=True)
356
  def _(mo):
357
+ mo.md("""
 
358
  ### Join
359
 
360
  The "join" mapping strategy aggregates the resulting values in a list and repeats the list for all rows in the group.
361
+ """)
 
362
  return
363
 
364
 
 
372
 
373
  @app.cell(hide_code=True)
374
  def _(mo):
375
+ mo.md(r"""
 
376
  ### Explode
377
 
378
  The "explode" mapping strategy is similar to "group_to_rows", but is typically faster and does not preserve the order of
 
380
  It should also only be used in a `select` context and not `with_columns`.
381
 
382
  The result of "explode" is similar to a `group_by` followed by an `agg` followed by an `explode`.
383
+ """)
 
384
  return
385
 
386
 
 
397
 
398
  @app.cell(hide_code=True)
399
  def _(mo):
400
+ mo.md(r"""
401
+ Note the modified order of the rows in the output, (but data is the same)...
402
+ """)
403
  return
404
 
405
 
406
  @app.cell(hide_code=True)
407
  def _(mo):
408
+ mo.md(r"""
409
+ ## Other tips and tricks
410
+ """)
411
  return
412
 
413
 
414
  @app.cell(hide_code=True)
415
  def _(mo):
416
+ mo.md(r"""
 
417
  ### Reusing a window
418
 
419
  In SQL there is a `WINDOW` keyword, which easily allows the re-use of the same window specification across expressions
420
  without needing to repeat it. In Polars, this can be achieved by using `dict` unpacking to pass arguments to `over`.
421
+ """)
 
422
  return
423
 
424
 
 
440
 
441
  @app.cell(hide_code=True)
442
  def _(mo):
443
+ mo.md(r"""
 
444
  ### Rolling Windows
445
 
446
  Much like in SQL, Polars also gives you the ability to do rolling window computations. In Polars, the rolling calculation
 
448
 
449
  Let's look at an example of that now by filtering out one day of our data and then calculating both a 3-day and 3-row
450
  max revenue split by channel...
451
+ """)
 
452
  return
453
 
454
 
 
469
 
470
  @app.cell(hide_code=True)
471
  def _(mo):
472
+ mo.md(r"""
473
+ Notice the difference in the 2nd last row...
474
+ """)
475
  return
476
 
477
 
478
  @app.cell(hide_code=True)
479
  def _(mo):
480
+ mo.md(r"""
481
+ We hope you enjoyed this notebook, demonstrating window functions in Polars!
482
+ """)
483
  return
484
 
485
 
486
  @app.cell(hide_code=True)
487
  def _(mo):
488
+ mo.md(r"""
 
489
  ## Additional References
490
 
491
  - [Polars User guide - Window functions](https://docs.pola.rs/user-guide/expressions/window-functions/)
492
  - [Polars over method API reference](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html)
493
  - [PostgreSQL window function documentation](https://www.postgresql.org/docs/current/tutorial-window.html)
494
+ """)
 
495
  return
496
 
497
 
polars/14_user_defined_functions.py CHANGED
@@ -14,58 +14,52 @@
14
 
15
  import marimo
16
 
17
- __generated_with = "0.11.17"
18
  app = marimo.App(width="medium")
19
 
20
 
21
  @app.cell(hide_code=True)
22
  def _(mo):
23
- mo.md(
24
- r"""
25
- # User-Defined Functions
26
 
27
- _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_.
28
 
29
- Throughout the previous chapters, you've seen how Polars provides a comprehensive set of built-in expressions for flexible data transformation. But what happens when you need something *more*? Perhaps your project has unique requirements, or you need to integrate functionality from an external Python library. This is where User-Defined Functions (UDFs) come into play, allowing you to extend Polars with your own custom logic.
30
 
31
- In this chapter, we'll weigh the performance trade-offs of UDFs, pinpoint situations where they're truly beneficial, and explore different ways to effectively incorporate them into your Polars workflows. We'll walk through a complete, practical example.
32
- """
33
- )
34
  return
35
 
36
 
37
  @app.cell(hide_code=True)
38
  def _(mo):
39
- mo.md(
40
- r"""
41
- ## ⚖️ The Cost of UDFs
42
 
43
- > Performance vs. Flexibility
44
 
45
- Polars' built-in expressions are highly optimized for speed and parallel processing. User-defined functions (UDFs), however, introduce a significant performance overhead because they rely on standard Python code, which often runs in a single thread and bypasses Polars' logical optimizations. Therefore, always prioritize native Polars operations *whenever possible*.
46
 
47
- However, UDFs become inevitable when you need to:
48
 
49
- - **Integrate external libraries:** Use functionality not directly available in Polars.
50
- - **Implement custom logic:** Handle complex transformations that can't be easily expressed with Polars' built-in functions.
51
 
52
- Let's dive into a real-world project where UDFs were the only way to get the job done, demonstrating a scenario where native Polars expressions simply weren't sufficient.
53
- """
54
- )
55
  return
56
 
57
 
58
  @app.cell(hide_code=True)
59
  def _(mo):
60
- mo.md(
61
- r"""
62
- ## 📊 Project Overview
63
 
64
- > Scraping and Analyzing Observable Notebook Statistics
65
 
66
- If you're into data visualization, you've probably seen [D3.js](https://d3js.org/) and [Observable Plot](https://observablehq.com/plot/). Both have extensive galleries showcasing amazing visualizations. Each gallery item is a standalone [Observable notebook](https://observablehq.com/documentation/notebooks/), with metrics like stars, comments, and forks – indicators of popularity. But getting and analyzing these statistics directly isn't straightforward. We'll need to scrape the web.
67
- """
68
- )
69
  return
70
 
71
 
@@ -90,7 +84,9 @@ def _(mo):
90
 
91
  @app.cell(hide_code=True)
92
  def _(mo):
93
- mo.md(r"""Our goal is to use Polars UDFs to fetch the HTML content of these gallery pages. Then, we'll use the `BeautifulSoup` Python library to parse the HTML and extract the relevant metadata. After some data wrangling with native Polars expressions, we'll have a DataFrame listing each visualization notebook. Then, we'll use another UDF to retrieve the number of likes, forks, and comments for each notebook. Finally, we will create our own high-performance UDF to implement a custom notebook ranking scheme. This will involve multiple steps, showcasing different UDF approaches.""")
 
 
94
  return
95
 
96
 
@@ -109,7 +105,9 @@ def _(mo):
109
 
110
  @app.cell(hide_code=True)
111
  def _(mo):
112
- mo.md(r"""Our starting point, `url_df`, is a simple DataFrame with a single `url` column containing the URLs of the D3 and Observable Plot gallery notebooks.""")
 
 
113
  return
114
 
115
 
@@ -129,19 +127,17 @@ def _(pl):
129
 
130
  @app.cell(hide_code=True)
131
  def _(mo):
132
- mo.md(
133
- r"""
134
- ## 🔂 Element-Wise UDFs
135
 
136
- > Processing Value by Value
137
 
138
- The most common way to use UDFs is to apply them element-wise. This means our custom function will execute for *each individual row* in a specified column. Our first task is to fetch the HTML content for each URL in `url_df`.
139
 
140
- We'll define a Python function that takes a `url` (a string) as input, uses the `httpx` library (an HTTP client) to fetch the content, and returns the HTML as a string. We then integrate this function into Polars using the [`map_elements`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_elements.html) expression.
141
 
142
- You'll notice we have to explicitly specify the `return_dtype`. This is *crucial*. Polars doesn't automatically know what our custom function will return. We're responsible for defining the function's logic and, therefore, its output type. By providing the `return_dtype`, we help Polars maintain its internal representation of the DataFrame's schema, enabling query optimization. Think of it as giving Polars a "heads-up" about the data type it should expect.
143
- """
144
- )
145
  return
146
 
147
 
@@ -159,13 +155,11 @@ def _(httpx, pl, url_df):
159
 
160
  @app.cell(hide_code=True)
161
  def _(mo):
162
- mo.md(
163
- r"""
164
- Now, `html_df` holds the HTML for each URL. We need to parse it. Again, a UDF is the way to go. Parsing HTML with native Polars expressions would be a nightmare! Instead, we'll use the [`beautifulsoup4`](https://pypi.org/project/beautifulsoup4/) library, a standard tool for this.
165
 
166
- These Observable pages are built with [Next.js](https://nextjs.org/), which helpfully serializes page properties as JSON within the HTML. This simplifies our UDF: we'll extract the raw JSON from the `<script id="__NEXT_DATA__" type="application/json">` tag. We'll use [`map_elements`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_elements.html) again. For clarity, we'll define this UDF as a named function, `extract_nextjs_data`, since it's a bit more complex than a simple HTTP request.
167
- """
168
- )
169
  return
170
 
171
 
@@ -193,7 +187,9 @@ def _(extract_nextjs_data, html_df, pl):
193
 
194
  @app.cell(hide_code=True)
195
  def _(mo):
196
- mo.md(r"""With some data wrangling of the raw JSON (using *native* Polars expressions!), we get `notebooks_df`, containing the metadata for each notebook.""")
 
 
197
  return
198
 
199
 
@@ -276,19 +272,17 @@ def _(parsed_html_df, pl):
276
 
277
  @app.cell(hide_code=True)
278
  def _(mo):
279
- mo.md(
280
- r"""
281
- ## 📦 Batch-Wise UDFs
282
 
283
- > Processing Entire Series
284
 
285
- `map_elements` calls the UDF for *each row*. Fine for our tiny, two-rows-tall `url_df`. But `notebooks_df` has almost 400 rows! Individual HTTP requests for each would be painfully slow.
286
 
287
- We want stats for each notebook in `notebooks_df`. To avoid sequential requests, we'll use Polars' [`map_batches`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_batches.html). This lets us process an *entire Series* (a column) at once.
288
 
289
- Our UDF, `fetch_html_batch`, will take a *Series* of URLs and use `asyncio` to make concurrent requests – a huge performance boost.
290
- """
291
- )
292
  return
293
 
294
 
@@ -372,19 +366,19 @@ def _(mo, notebook_stats_df):
372
  return notebook_height, notebooks
373
 
374
 
375
- @app.cell(hide_code=True)
376
- def _():
377
- def nb_iframe(notebook_url: str, height=825) -> str:
378
- embed_url = notebook_url.replace(
379
- "https://observablehq.com", "https://observablehq.com/embed"
380
- )
381
- return f'<iframe width="100%" height="{height}" frameborder="0" src="{embed_url}?cell=*"></iframe>'
382
- return (nb_iframe,)
383
 
384
 
385
  @app.cell(hide_code=True)
386
  def _(mo):
387
- mo.md(r"""Now that we have access to notebook-level statistics, we can rank the visualizations by the number of likes they received & display them interactively.""")
 
 
388
  return
389
 
390
 
@@ -395,7 +389,7 @@ def _(mo):
395
 
396
 
397
  @app.cell(hide_code=True)
398
- def _(category, mo, nb_iframe, notebook_height, notebooks):
399
  notebook = notebooks.value.to_dicts()[0]
400
  mo.vstack(
401
  [
@@ -406,60 +400,56 @@ def _(category, mo, nb_iframe, notebook_height, notebooks):
406
  mo.md(nb_iframe(notebook["notebook_url"], notebook_height.value)),
407
  ]
408
  )
409
- return (notebook,)
410
 
411
 
412
  @app.cell(hide_code=True)
413
  def _(mo):
414
- mo.md(
415
- r"""
416
- ## ⚙️ Row-Wise UDFs
417
 
418
- > Accessing All Columns at Once
419
 
420
- Sometimes, you need to work with *all* columns of a row at once. This is where [`map_rows`](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.map_rows.html) comes in. It operates directly on the DataFrame, passing each row to your UDF *as a tuple*.
421
 
422
- Below, `create_notebook_summary` takes a row from `notebook_stats_df` (as a tuple) and returns a formatted Markdown string summarizing the notebook's key stats. We're essentially reducing the DataFrame to a single column. While this *could* be done with native Polars expressions, it would be much more cumbersome. This example demonstrates a case where a row-wise UDF simplifies the code, even if the underlying operation isn't inherently complex.
423
- """
424
- )
425
  return
426
 
427
 
428
- @app.cell(hide_code=True)
429
- def _():
430
- def create_notebook_summary(row: tuple) -> str:
431
- (
432
- thumbnail_src,
433
- category,
434
- title,
435
- likes,
436
- forks,
437
- comments,
438
- license,
439
- description,
440
- notebook_url,
441
- ) = row
442
- return (
443
- f"""
444
- ### [{title}]({notebook_url})
445
-
446
- <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 12px; margin: 12px 0;">
447
- <div> <strong>Likes:</strong> {likes}</div>
448
- <div>↗️ <strong>Forks:</strong> {forks}</div>
449
- <div>💬 <strong>Comments:</strong> {comments}</div>
450
- <div>⚖️ <strong>License:</strong> {license}</div>
451
- </div>
452
-
453
- <a href="{notebook_url}" target="_blank">
454
- <img src="{thumbnail_src}" style="height: 300px;" />
455
- <a/>
456
- """.strip('\n')
457
- )
458
- return (create_notebook_summary,)
459
 
460
 
461
  @app.cell(hide_code=True)
462
- def _(create_notebook_summary, notebook_stats_df, pl):
463
  notebook_summary_df = notebook_stats_df.map_rows(
464
  create_notebook_summary,
465
  return_dtype=pl.String,
@@ -487,37 +477,33 @@ def _(mo, notebook_summary_df):
487
 
488
  @app.cell(hide_code=True)
489
  def _(mo):
490
- mo.md(
491
- r"""
492
- ## 🚀 Higher-performance UDFs
493
 
494
- > Leveraging Numba to Make Python Fast
495
 
496
- Python code doesn't *always* mean slow code. While UDFs *often* introduce performance overhead, there are exceptions. NumPy's universal functions ([`ufuncs`](https://numpy.org/doc/stable/reference/ufuncs.html)) and generalized universal functions ([`gufuncs`](https://numpy.org/neps/nep-0005-generalized-ufuncs.html)) provide high-performance operations on NumPy arrays, thanks to low-level implementations.
497
 
498
- But NumPy's built-in functions are predefined. We can't easily use them for *custom* logic. Enter [`numba`](https://numba.pydata.org/). Numba is a just-in-time (JIT) compiler that translates Python functions into optimized machine code *at runtime*. It provides decorators like [`numba.guvectorize`](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-guvectorize-decorator) that let us create our *own* high-performance `gufuncs` – *without* writing low-level code!
499
- """
500
- )
501
  return
502
 
503
 
504
  @app.cell(hide_code=True)
505
  def _(mo):
506
- mo.md(
507
- r"""
508
- Let's create a custom popularity metric to rank notebooks, considering likes, forks, *and* comments (not just likes). We'll define `weighted_popularity_numba`, decorated with `@numba.guvectorize`. The decorator arguments specify that we're taking three integer vectors of length `n` and returning a float vector of length `n`.
509
 
510
- The weighted popularity score for each notebook is calculated using the following formula:
511
 
512
- $$
513
- \begin{equation}
514
- \text{score}_i = w_l \cdot l_i^{f} + w_f \cdot f_i^{f} + w_c \cdot c_i^{f}
515
- \end{equation}
516
- $$
517
 
518
- with:
519
- """
520
- )
521
  return
522
 
523
 
@@ -606,12 +592,14 @@ def _(
606
  + w_f * (forks[i] ** nlf)
607
  + w_c * (comments[i] ** nlf)
608
  )
609
- return nlf, w_c, w_f, w_l, weighted_popularity_numba
610
 
611
 
612
  @app.cell(hide_code=True)
613
  def _(mo):
614
- mo.md(r"""We apply our JIT-compiled UDF using `map_batches`, as before. The key is that we're passing entire columns directly to `weighted_popularity_numba`. Polars and Numba handle the conversion to NumPy arrays behind the scenes. This direct integration is a major benefit of using `guvectorize`.""")
 
 
615
  return
616
 
617
 
@@ -665,7 +653,9 @@ def _(
665
 
666
  @app.cell(hide_code=True)
667
  def _(mo):
668
- mo.md(r"""As the slope chart below demonstrates, this new ranking strategy significantly changes the notebook order, as it considers forks and comments, not just likes.""")
 
 
669
  return
670
 
671
 
@@ -700,27 +690,25 @@ def _(alt, notebook_popularity_df, pl):
700
  fill="title:N",
701
  )
702
  (points + lines).properties(width=400)
703
- return lines, notebook_ranks_df, points
704
 
705
 
706
  @app.cell(hide_code=True)
707
  def _(mo):
708
- mo.md(
709
- r"""
710
- ## ⏱️ Quantifying the Overhead
711
 
712
- > UDF Performance Comparison
713
 
714
- To truly understand the performance implications of using UDFs, let's conduct a benchmark. We'll create a DataFrame with random numbers and perform the same numerical operation using four different methods:
715
 
716
- 1. **Native Polars:** Using Polars' built-in expressions.
717
- 2. **`map_elements`:** Applying a Python function element-wise.
718
- 3. **`map_batches`:** **Applying** a Python function to the entire Series.
719
- 4. **`map_batches` with Numba:** Applying a JIT-compiled function to batches, similar to a generalized universal function.
720
 
721
- We'll use a simple, but non-trivial, calculation: `result = (x * 2.5 + 5) / (x + 1)`. This involves multiplication, addition, and division, giving us a realistic representation of a common numerical operation. We'll use the `timeit` module, to accurately measure execution times over multiple trials.
722
- """
723
- )
724
  return
725
 
726
 
@@ -750,15 +738,13 @@ def _(benchmark_plot, mo, num_samples, num_trials):
750
 
751
  @app.cell(hide_code=True)
752
  def _(mo):
753
- mo.md(
754
- r"""
755
- As anticipated, the `Batch-Wise UDF (Python)` and `Element-Wise UDF` exhibit significantly worse performance, essentially acting as pure-Python for-each loops.
756
 
757
- However, when Python serves as an interface to lower-level, high-performance libraries, we observe substantial improvements. The `Batch-Wise UDF (NumPy)` lags behind both `Batch-Wise UDF (Numba)` and `Native Polars`, but it still represents a considerable improvement over pure-Python UDFs due to its vectorized computations.
758
 
759
- Numba's Just-In-Time (JIT) compilation delivers a dramatic performance boost, achieving speeds comparable to native Polars expressions. This demonstrates that UDFs, particularly when combined with tools like Numba, don't inevitably lead to bottlenecks in numerical computations.
760
- """
761
- )
762
  return
763
 
764
 
@@ -789,7 +775,7 @@ def _(mo):
789
  def _(np, num_samples, pl):
790
  rng = np.random.default_rng(42)
791
  sample_df = pl.from_dict({"x": rng.random(num_samples.value)})
792
- return rng, sample_df
793
 
794
 
795
  @app.cell(hide_code=True)
@@ -861,14 +847,7 @@ def _(np, num_trials, numba, pl, sample_df, timeit):
861
  def time_method(callable_name: str, number=num_trials.value) -> float:
862
  fn = globals()[callable_name]
863
  return timeit.timeit(fn, number=number)
864
- return (
865
- run_map_batches_numba,
866
- run_map_batches_numpy,
867
- run_map_batches_python,
868
- run_map_elements,
869
- run_native,
870
- time_method,
871
- )
872
 
873
 
874
  @app.cell(hide_code=True)
@@ -906,7 +885,7 @@ def _(alt, pl, time_method):
906
  x=alt.X("title:N", title="Method", sort="-y"),
907
  y=alt.Y("time:Q", title="Execution Time (s)", axis=alt.Axis(format=".3f")),
908
  ).properties(width=400)
909
- return benchmark_df, benchmark_plot
910
 
911
 
912
  @app.cell(hide_code=True)
@@ -934,7 +913,6 @@ def _():
934
  asyncio,
935
  httpx,
936
  mo,
937
- nest_asyncio,
938
  np,
939
  numba,
940
  pl,
 
14
 
15
  import marimo
16
 
17
+ __generated_with = "0.18.4"
18
  app = marimo.App(width="medium")
19
 
20
 
21
  @app.cell(hide_code=True)
22
  def _(mo):
23
+ mo.md(r"""
24
+ # User-Defined Functions
 
25
 
26
+ _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_.
27
 
28
+ Throughout the previous chapters, you've seen how Polars provides a comprehensive set of built-in expressions for flexible data transformation. But what happens when you need something *more*? Perhaps your project has unique requirements, or you need to integrate functionality from an external Python library. This is where User-Defined Functions (UDFs) come into play, allowing you to extend Polars with your own custom logic.
29
 
30
+ In this chapter, we'll weigh the performance trade-offs of UDFs, pinpoint situations where they're truly beneficial, and explore different ways to effectively incorporate them into your Polars workflows. We'll walk through a complete, practical example.
31
+ """)
 
32
  return
33
 
34
 
35
  @app.cell(hide_code=True)
36
  def _(mo):
37
+ mo.md(r"""
38
+ ## ⚖️ The Cost of UDFs
 
39
 
40
+ > Performance vs. Flexibility
41
 
42
+ Polars' built-in expressions are highly optimized for speed and parallel processing. User-defined functions (UDFs), however, introduce a significant performance overhead because they rely on standard Python code, which often runs in a single thread and bypasses Polars' logical optimizations. Therefore, always prioritize native Polars operations *whenever possible*.
43
 
44
+ However, UDFs become inevitable when you need to:
45
 
46
+ - **Integrate external libraries:** Use functionality not directly available in Polars.
47
+ - **Implement custom logic:** Handle complex transformations that can't be easily expressed with Polars' built-in functions.
48
 
49
+ Let's dive into a real-world project where UDFs were the only way to get the job done, demonstrating a scenario where native Polars expressions simply weren't sufficient.
50
+ """)
 
51
  return
52
 
53
 
54
  @app.cell(hide_code=True)
55
  def _(mo):
56
+ mo.md(r"""
57
+ ## 📊 Project Overview
 
58
 
59
+ > Scraping and Analyzing Observable Notebook Statistics
60
 
61
+ If you're into data visualization, you've probably seen [D3.js](https://d3js.org/) and [Observable Plot](https://observablehq.com/plot/). Both have extensive galleries showcasing amazing visualizations. Each gallery item is a standalone [Observable notebook](https://observablehq.com/documentation/notebooks/), with metrics like stars, comments, and forks – indicators of popularity. But getting and analyzing these statistics directly isn't straightforward. We'll need to scrape the web.
62
+ """)
 
63
  return
64
 
65
 
 
84
 
85
  @app.cell(hide_code=True)
86
  def _(mo):
87
+ mo.md(r"""
88
+ Our goal is to use Polars UDFs to fetch the HTML content of these gallery pages. Then, we'll use the `BeautifulSoup` Python library to parse the HTML and extract the relevant metadata. After some data wrangling with native Polars expressions, we'll have a DataFrame listing each visualization notebook. Then, we'll use another UDF to retrieve the number of likes, forks, and comments for each notebook. Finally, we will create our own high-performance UDF to implement a custom notebook ranking scheme. This will involve multiple steps, showcasing different UDF approaches.
89
+ """)
90
  return
91
 
92
 
 
105
 
106
  @app.cell(hide_code=True)
107
  def _(mo):
108
+ mo.md(r"""
109
+ Our starting point, `url_df`, is a simple DataFrame with a single `url` column containing the URLs of the D3 and Observable Plot gallery notebooks.
110
+ """)
111
  return
112
 
113
 
 
127
 
128
  @app.cell(hide_code=True)
129
  def _(mo):
130
+ mo.md(r"""
131
+ ## 🔂 Element-Wise UDFs
 
132
 
133
+ > Processing Value by Value
134
 
135
+ The most common way to use UDFs is to apply them element-wise. This means our custom function will execute for *each individual row* in a specified column. Our first task is to fetch the HTML content for each URL in `url_df`.
136
 
137
+ We'll define a Python function that takes a `url` (a string) as input, uses the `httpx` library (an HTTP client) to fetch the content, and returns the HTML as a string. We then integrate this function into Polars using the [`map_elements`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_elements.html) expression.
138
 
139
+ You'll notice we have to explicitly specify the `return_dtype`. This is *crucial*. Polars doesn't automatically know what our custom function will return. We're responsible for defining the function's logic and, therefore, its output type. By providing the `return_dtype`, we help Polars maintain its internal representation of the DataFrame's schema, enabling query optimization. Think of it as giving Polars a "heads-up" about the data type it should expect.
140
+ """)
 
141
  return
142
 
143
 
 
155
 
156
  @app.cell(hide_code=True)
157
  def _(mo):
158
+ mo.md(r"""
159
+ Now, `html_df` holds the HTML for each URL. We need to parse it. Again, a UDF is the way to go. Parsing HTML with native Polars expressions would be a nightmare! Instead, we'll use the [`beautifulsoup4`](https://pypi.org/project/beautifulsoup4/) library, a standard tool for this.
 
160
 
161
+ These Observable pages are built with [Next.js](https://nextjs.org/), which helpfully serializes page properties as JSON within the HTML. This simplifies our UDF: we'll extract the raw JSON from the `<script id="__NEXT_DATA__" type="application/json">` tag. We'll use [`map_elements`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_elements.html) again. For clarity, we'll define this UDF as a named function, `extract_nextjs_data`, since it's a bit more complex than a simple HTTP request.
162
+ """)
 
163
  return
164
 
165
 
 
187
 
188
  @app.cell(hide_code=True)
189
  def _(mo):
190
+ mo.md(r"""
191
+ With some data wrangling of the raw JSON (using *native* Polars expressions!), we get `notebooks_df`, containing the metadata for each notebook.
192
+ """)
193
  return
194
 
195
 
 
272
 
273
  @app.cell(hide_code=True)
274
  def _(mo):
275
+ mo.md(r"""
276
+ ## 📦 Batch-Wise UDFs
 
277
 
278
+ > Processing Entire Series
279
 
280
+ `map_elements` calls the UDF for *each row*. Fine for our tiny, two-rows-tall `url_df`. But `notebooks_df` has almost 400 rows! Individual HTTP requests for each would be painfully slow.
281
 
282
+ We want stats for each notebook in `notebooks_df`. To avoid sequential requests, we'll use Polars' [`map_batches`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_batches.html). This lets us process an *entire Series* (a column) at once.
283
 
284
+ Our UDF, `fetch_html_batch`, will take a *Series* of URLs and use `asyncio` to make concurrent requests – a huge performance boost.
285
+ """)
 
286
  return
287
 
288
 
 
366
  return notebook_height, notebooks
367
 
368
 
369
+ @app.function(hide_code=True)
370
+ def nb_iframe(notebook_url: str, height=825) -> str:
371
+ embed_url = notebook_url.replace(
372
+ "https://observablehq.com", "https://observablehq.com/embed"
373
+ )
374
+ return f'<iframe width="100%" height="{height}" frameborder="0" src="{embed_url}?cell=*"></iframe>'
 
 
375
 
376
 
377
  @app.cell(hide_code=True)
378
  def _(mo):
379
+ mo.md(r"""
380
+ Now that we have access to notebook-level statistics, we can rank the visualizations by the number of likes they received & display them interactively.
381
+ """)
382
  return
383
 
384
 
 
389
 
390
 
391
  @app.cell(hide_code=True)
392
+ def _(category, mo, notebook_height, notebooks):
393
  notebook = notebooks.value.to_dicts()[0]
394
  mo.vstack(
395
  [
 
400
  mo.md(nb_iframe(notebook["notebook_url"], notebook_height.value)),
401
  ]
402
  )
403
+ return
404
 
405
 
406
  @app.cell(hide_code=True)
407
  def _(mo):
408
+ mo.md(r"""
409
+ ## ⚙️ Row-Wise UDFs
 
410
 
411
+ > Accessing All Columns at Once
412
 
413
+ Sometimes, you need to work with *all* columns of a row at once. This is where [`map_rows`](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.map_rows.html) comes in. It operates directly on the DataFrame, passing each row to your UDF *as a tuple*.
414
 
415
+ Below, `create_notebook_summary` takes a row from `notebook_stats_df` (as a tuple) and returns a formatted Markdown string summarizing the notebook's key stats. We're essentially reducing the DataFrame to a single column. While this *could* be done with native Polars expressions, it would be much more cumbersome. This example demonstrates a case where a row-wise UDF simplifies the code, even if the underlying operation isn't inherently complex.
416
+ """)
 
417
  return
418
 
419
 
420
+ @app.function(hide_code=True)
421
+ def create_notebook_summary(row: tuple) -> str:
422
+ (
423
+ thumbnail_src,
424
+ category,
425
+ title,
426
+ likes,
427
+ forks,
428
+ comments,
429
+ license,
430
+ description,
431
+ notebook_url,
432
+ ) = row
433
+ return (
434
+ f"""
435
+ ### [{title}]({notebook_url})
436
+
437
+ <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 12px; margin: 12px 0;">
438
+ <div>⭐ <strong>Likes:</strong> {likes}</div>
439
+ <div>↗️ <strong>Forks:</strong> {forks}</div>
440
+ <div>💬 <strong>Comments:</strong> {comments}</div>
441
+ <div>⚖️ <strong>License:</strong> {license}</div>
442
+ </div>
443
+
444
+ <a href="{notebook_url}" target="_blank">
445
+ <img src="{thumbnail_src}" style="height: 300px;" />
446
+ <a/>
447
+ """.strip('\n')
448
+ )
 
 
449
 
450
 
451
  @app.cell(hide_code=True)
452
+ def _(notebook_stats_df, pl):
453
  notebook_summary_df = notebook_stats_df.map_rows(
454
  create_notebook_summary,
455
  return_dtype=pl.String,
 
477
 
478
  @app.cell(hide_code=True)
479
  def _(mo):
480
+ mo.md(r"""
481
+ ## 🚀 Higher-performance UDFs
 
482
 
483
+ > Leveraging Numba to Make Python Fast
484
 
485
+ Python code doesn't *always* mean slow code. While UDFs *often* introduce performance overhead, there are exceptions. NumPy's universal functions ([`ufuncs`](https://numpy.org/doc/stable/reference/ufuncs.html)) and generalized universal functions ([`gufuncs`](https://numpy.org/neps/nep-0005-generalized-ufuncs.html)) provide high-performance operations on NumPy arrays, thanks to low-level implementations.
486
 
487
+ But NumPy's built-in functions are predefined. We can't easily use them for *custom* logic. Enter [`numba`](https://numba.pydata.org/). Numba is a just-in-time (JIT) compiler that translates Python functions into optimized machine code *at runtime*. It provides decorators like [`numba.guvectorize`](https://numba.readthedocs.io/en/stable/user/vectorize.html#the-guvectorize-decorator) that let us create our *own* high-performance `gufuncs` – *without* writing low-level code!
488
+ """)
 
489
  return
490
 
491
 
492
  @app.cell(hide_code=True)
493
  def _(mo):
494
+ mo.md(r"""
495
+ Let's create a custom popularity metric to rank notebooks, considering likes, forks, *and* comments (not just likes). We'll define `weighted_popularity_numba`, decorated with `@numba.guvectorize`. The decorator arguments specify that we're taking three integer vectors of length `n` and returning a float vector of length `n`.
 
496
 
497
+ The weighted popularity score for each notebook is calculated using the following formula:
498
 
499
+ $$
500
+ \begin{equation}
501
+ \text{score}_i = w_l \cdot l_i^{f} + w_f \cdot f_i^{f} + w_c \cdot c_i^{f}
502
+ \end{equation}
503
+ $$
504
 
505
+ with:
506
+ """)
 
507
  return
508
 
509
 
 
592
  + w_f * (forks[i] ** nlf)
593
  + w_c * (comments[i] ** nlf)
594
  )
595
+ return (weighted_popularity_numba,)
596
 
597
 
598
  @app.cell(hide_code=True)
599
  def _(mo):
600
+ mo.md(r"""
601
+ We apply our JIT-compiled UDF using `map_batches`, as before. The key is that we're passing entire columns directly to `weighted_popularity_numba`. Polars and Numba handle the conversion to NumPy arrays behind the scenes. This direct integration is a major benefit of using `guvectorize`.
602
+ """)
603
  return
604
 
605
 
 
653
 
654
  @app.cell(hide_code=True)
655
  def _(mo):
656
+ mo.md(r"""
657
+ As the slope chart below demonstrates, this new ranking strategy significantly changes the notebook order, as it considers forks and comments, not just likes.
658
+ """)
659
  return
660
 
661
 
 
690
  fill="title:N",
691
  )
692
  (points + lines).properties(width=400)
693
+ return
694
 
695
 
696
  @app.cell(hide_code=True)
697
  def _(mo):
698
+ mo.md(r"""
699
+ ## ⏱️ Quantifying the Overhead
 
700
 
701
+ > UDF Performance Comparison
702
 
703
+ To truly understand the performance implications of using UDFs, let's conduct a benchmark. We'll create a DataFrame with random numbers and perform the same numerical operation using four different methods:
704
 
705
+ 1. **Native Polars:** Using Polars' built-in expressions.
706
+ 2. **`map_elements`:** Applying a Python function element-wise.
707
+ 3. **`map_batches`:** **Applying** a Python function to the entire Series.
708
+ 4. **`map_batches` with Numba:** Applying a JIT-compiled function to batches, similar to a generalized universal function.
709
 
710
+ We'll use a simple, but non-trivial, calculation: `result = (x * 2.5 + 5) / (x + 1)`. This involves multiplication, addition, and division, giving us a realistic representation of a common numerical operation. We'll use the `timeit` module, to accurately measure execution times over multiple trials.
711
+ """)
 
712
  return
713
 
714
 
 
738
 
739
  @app.cell(hide_code=True)
740
  def _(mo):
741
+ mo.md(r"""
742
+ As anticipated, the `Batch-Wise UDF (Python)` and `Element-Wise UDF` exhibit significantly worse performance, essentially acting as pure-Python for-each loops.
 
743
 
744
+ However, when Python serves as an interface to lower-level, high-performance libraries, we observe substantial improvements. The `Batch-Wise UDF (NumPy)` lags behind both `Batch-Wise UDF (Numba)` and `Native Polars`, but it still represents a considerable improvement over pure-Python UDFs due to its vectorized computations.
745
 
746
+ Numba's Just-In-Time (JIT) compilation delivers a dramatic performance boost, achieving speeds comparable to native Polars expressions. This demonstrates that UDFs, particularly when combined with tools like Numba, don't inevitably lead to bottlenecks in numerical computations.
747
+ """)
 
748
  return
749
 
750
 
 
775
  def _(np, num_samples, pl):
776
  rng = np.random.default_rng(42)
777
  sample_df = pl.from_dict({"x": rng.random(num_samples.value)})
778
+ return (sample_df,)
779
 
780
 
781
  @app.cell(hide_code=True)
 
847
  def time_method(callable_name: str, number=num_trials.value) -> float:
848
  fn = globals()[callable_name]
849
  return timeit.timeit(fn, number=number)
850
+ return (time_method,)
 
 
 
 
 
 
 
851
 
852
 
853
  @app.cell(hide_code=True)
 
885
  x=alt.X("title:N", title="Method", sort="-y"),
886
  y=alt.Y("time:Q", title="Execution Time (s)", axis=alt.Axis(format=".3f")),
887
  ).properties(width=400)
888
+ return (benchmark_plot,)
889
 
890
 
891
  @app.cell(hide_code=True)
 
913
  asyncio,
914
  httpx,
915
  mo,
 
916
  np,
917
  numba,
918
  pl,
polars/16_lazy_execution.py CHANGED
@@ -15,19 +15,17 @@
15
 
16
  import marimo
17
 
18
- __generated_with = "0.12.6"
19
  app = marimo.App(width="medium")
20
 
21
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
- mo.md(
25
- r"""
26
- # Lazy Execution (a.k.a. the Lazy API)
27
 
28
- Author: [Deb Debnath](https://github.com/debajyotid2)
29
- """
30
- )
31
  return
32
 
33
 
@@ -51,14 +49,9 @@ def _():
51
  Generator,
52
  datetime,
53
  np,
54
- numba,
55
- pd,
56
  pl,
57
- plt,
58
  random,
59
  re,
60
- spl,
61
- st,
62
  time,
63
  timedelta,
64
  timezone,
@@ -67,47 +60,43 @@ def _():
67
 
68
  @app.cell(hide_code=True)
69
  def _(mo):
70
- mo.md(
71
- r"""
72
- We saw the benefits of lazy evaluation when we learned about the Expressions API in Polars. Lazy execution is further extended as a philosophy by the Lazy API. It offers significant performance enhancements over eager (immediate) execution of queries and is one of the reasons why Polars is faster at working with large (GB scale) datasets than other libraries. The lazy API optimizes the full query pipeline instead of executing individual queries optimally, unlike eager execution. Some of the advantages of using the Lazy API over eager execution include
73
-
74
- - automatic query optimization with the query optimizer.
75
- - ability to process datasets larger than memory using streaming.
76
- - ability to catch schema errors before data processing.
77
- """
78
- )
79
  return
80
 
81
 
82
  @app.cell(hide_code=True)
83
  def _(mo):
84
- mo.md(
85
- r"""
86
- ## Setup
87
 
88
- For this notebook, we are going to work with logs from an Apache/Nginx web server - these logs contain useful information that can be utilized for performance optimization, security monitoring, etc. Such logs comprise of entries that look something like this:
89
 
90
- ```
91
- 10.23.97.15 - - [05/Jul/2024:11:35:05 +0000] "GET /index.html HTTP/1.1" 200 1342 "https://www.example.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/528.32 (KHTML, like Gecko) Chrome/19.0.1220.985 Safari/528.32" "-"
92
- ```
93
 
94
- Different parts of the entry mean different things:
95
 
96
- - `10.23.97.15` is the client IP address.
97
- - `- -` represent identity and username of the client, respectively and are typically unused.
98
- - `05/Jul/2024:11:35:05 +0000` indicates the timestamp for the request.
99
- - `"GET /index.html HTTP/1.1"` represents the HTTP method, requested resource and the protocol version for HTTP, respectively.
100
- - `200 1342` mean the response status code and size of the response in bytes, respectively
101
- - `"https://www.example.com"` is the "referer", or the webpage URL that brought the client to the resource.
102
- - `"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/528.32 (KHTML, like Gecko) Chrome/19.0.1220.985 Safari/528.32"` is the "User agent" or the details of the client device making the request (including browser version, operating system, etc.)
103
 
104
- Normally, you would get your log files from a server that you have access to. In our case, we will generate fake data to simulate log records. We will simulate 7 days of server activity with 90,000 recorded lines.
105
 
106
- ///Note
107
- 1. If you are interested in the process of generating fake log entries, unhide the code cells immediately below the next one.
108
- 2. You can adjust the size of the dataset by resetting the `num_log_lines` variables to a size of your choice. It may be helpful if the data takes a long time to generate.
109
- """
110
- )
111
  return
112
 
113
 
@@ -179,7 +168,6 @@ def _(Faker, datetime, np, num_log_lines, time):
179
  responses,
180
  rng,
181
  sleep,
182
- timestr,
183
  tz,
184
  user_agents,
185
  verbs,
@@ -222,19 +210,17 @@ def _(
222
  faker=faker, rng=rng, resources=resources,
223
  user_agents=user_agents, responses=responses, verbs=verbs)
224
  yield list(re.findall(pattern, log_line)[0])
225
- return generator, pattern
226
 
227
 
228
  @app.cell(hide_code=True)
229
  def _(mo):
230
- mo.md(
231
- r"""
232
- Since we are generating data using a Python generator, we create a `pl.LazyFrame` directly, but we can start with either a file or an existing `DataFrame`. When using a file, the functions beginning with `pl.scan_` from the Polars API can be used, while in the case of an existing `pl.DataFrame`, we can simply call `.lazy()` to convert it to a `pl.LazyFrame`.
233
 
234
- ///Note
235
- Depending on your machine, the following cell may take some time to execute.
236
- """
237
- )
238
  return
239
 
240
 
@@ -249,15 +235,13 @@ def _(generator, num_log_lines, pl):
249
 
250
  @app.cell(hide_code=True)
251
  def _(mo):
252
- mo.md(
253
- r"""
254
- ## Schema
255
 
256
- A schema denotes the names and respective datatypes of columns in a DataFrame or LazyFrame. It can be specified when a DataFrame or LazyFrame is generated (as you may have noticed in the cell creating the LazyFrame above).
257
 
258
- You can see the schema with the .collect_schema method on a DataFrame or LazyFrame.
259
- """
260
- )
261
  return
262
 
263
 
@@ -269,26 +253,28 @@ def _(log_data):
269
 
270
  @app.cell(hide_code=True)
271
  def _(mo):
272
- mo.md(
273
- r"""
274
- Since our generator yields strings, Polars defaults to the `pl.String` datatype while reading in the data from the generator, unless specified. This, however, is not the most space or computation efficient form of data storage, so we would like to convert the datatypes of some of the columns in our LazyFrame.
275
 
276
- ///Note
277
- The data type conversion can also be done by specifying it in the schema when creating the LazyFrame or DataFrame. We are skipping doing this for demonstration. For more details on specifying data types in LazyFrames, please refer to the Polars [documentation](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html).
278
- """
279
- )
280
  return
281
 
282
 
283
  @app.cell(hide_code=True)
284
  def _(mo):
285
- mo.md(r"""The Lazy API validates a query pipeline end-to-end for schema consistency and correctness. The checks make sure that if there is a mistake in your query, you can correct it before the data gets processed.""")
 
 
286
  return
287
 
288
 
289
  @app.cell(hide_code=True)
290
  def _(mo):
291
- mo.md(r"""The `log_data_erroneous` query below throws an `InvalidOperationError` because Polars finds inconsistencies between the timestamps we parsed from the logs and the timestamp format specified. It turns out that the time stamps in string format still have trailing whitespace which leads to errors during conversion to `datetime[μs]` objects.""")
 
 
292
  return
293
 
294
 
@@ -311,13 +297,11 @@ def _(log_data_erroneous):
311
 
312
  @app.cell(hide_code=True)
313
  def _(mo):
314
- mo.md(
315
- r"""
316
- Polars uses a **query optimizer** to make sure that a query pipeline is executed with the least computational cost (more on this later). In order to be able to do the optimization, the optimizer must know the schema for each step of the pipeline (query plan). For example, if you have a `.pivot` operation somewhere in your pipeline, you are generating new columns based on the data. This is new information unknown to the query optimizer that it cannot work with, and so the lazy API does not support `.pivot` operations.
317
 
318
- For example, suppose you would like to know how many requests of each kind were received at a given time that were not "POST" requests. For this we would want to create a pivot table as follows, except that it throws an error as the lazy API does not support pivot operations.
319
- """
320
- )
321
  return
322
 
323
 
@@ -334,13 +318,11 @@ def _(log_data, pl):
334
 
335
  @app.cell(hide_code=True)
336
  def _(mo):
337
- mo.md(
338
- r"""
339
- As a workaround, we can jump between "lazy mode" and "eager mode" by converting a LazyFrame to a DataFrame just before the unsupported operation (e.g. `.pivot`). We can do this by calling `.collect()` on the LazyFrame. Once done with the "eager mode" operations, we can jump back to "lazy mode" by calling ".lazy()" on the DataFrame!
340
 
341
- As an example, see the fix to the query in the previous cell below:
342
- """
343
- )
344
  return
345
 
346
 
@@ -360,21 +342,21 @@ def _(log_data, pl):
360
 
361
  @app.cell(hide_code=True)
362
  def _(mo):
363
- mo.md(r"""## Query plan""")
 
 
364
  return
365
 
366
 
367
  @app.cell(hide_code=True)
368
  def _(mo):
369
- mo.md(
370
- r"""
371
- Polars has a query optimizer that works on a "query plan" to create a computationally efficient query pipeline. It builds the query plan/query graph from the user-specified lazy operations.
372
 
373
- We can understand query graphs with visualization and by printing them as text.
374
 
375
- Say we want to convert the data in our log dataset from `pl.String` more space efficient data types. We also would like to view all "GET" requests that resulted in errors (client side). We build our query first, and then we visualize the query graph using `.show_graph()` and print it using `.request_code()`.
376
- """
377
- )
378
  return
379
 
380
 
@@ -409,21 +391,21 @@ def _(a_query):
409
 
410
  @app.cell(hide_code=True)
411
  def _(mo):
412
- mo.md(r"""## Execution""")
 
 
413
  return
414
 
415
 
416
  @app.cell(hide_code=True)
417
  def _(mo):
418
- mo.md(
419
- r"""
420
- As mentioned before, Polars builds a query graph by going lazy operation by operation and then optimizes it by running a query optimizer on the graph. This optimized graph is run by default.
421
 
422
- We can execute our query on the full dataset by calling the .collect method on the query. But since this option processes all data in one batch, it is not memory efficient, and can crash if the size of the data exceeds the amount of memory your query can support.
423
 
424
- For fast iterative development running `.collect` on the entire dataset is not a good idea due to slow runtimes. If your dataset is partitioned, you can use a few of them for testing. Another option is to use `.head` to limit the number of records processed, and `.collect` as few times as possible and toward the end of your query, as shown below.
425
- """
426
- )
427
  return
428
 
429
 
@@ -448,7 +430,9 @@ def _(log_data, pl):
448
 
449
  @app.cell(hide_code=True)
450
  def _(mo):
451
- mo.md(r"""For large datasets Polars supports streaming mode by collecting data in batches. Streaming mode can be used by passing the keyword `engine="streaming"` into the `collect` method.""")
 
 
452
  return
453
 
454
 
@@ -460,13 +444,11 @@ def _(a_query):
460
 
461
  @app.cell(hide_code=True)
462
  def _(mo):
463
- mo.md(
464
- r"""
465
- ## Optimizations
466
 
467
- The lazy API runs a query optimizer on every Polars query. To do this, first it builds a non-optimized plan with the set of steps in the order they were specified by the user. Then it checks for optimization opportunities within the plan and reorders operations following specific rules to create an optimized query plan. Some of them are executed up front, others are determined just in time as the materialized data comes in. For the query that we built before and saw the query graph, we can view the unoptimized and optimized versions below.
468
- """
469
- )
470
  return
471
 
472
 
@@ -484,25 +466,33 @@ def _(a_query):
484
 
485
  @app.cell(hide_code=True)
486
  def _(mo):
487
- mo.md(r"""One difference between the optimized and the unoptimized versions above is that all of the datatype cast operations except for the conversion of the `"status"` column to `pl.Int16` are performed at the end together. Also, the `filter()` operation is "pushed down" the graph, but after the datatype cast operation for `"status"`. This is called **predicate pushdown**, and the lazy API optimizes the query graph for filters to be performed as early as possible. Since the datatype coercion makes the filter operation more efficient, the graph preserves its order to be before the filter.""")
 
 
488
  return
489
 
490
 
491
  @app.cell(hide_code=True)
492
  def _(mo):
493
- mo.md(r"""## Sources and Sinks""")
 
 
494
  return
495
 
496
 
497
  @app.cell(hide_code=True)
498
  def _(mo):
499
- mo.md(r"""For data sources like Parquets, CSVs, etc, the lazy API provides `scan_*` (`scan_parquet`, `scan_csv`, etc.) to lazily read in the data into LazyFrames. If queries are chained to the `scan_*` method, Polars will run the usual query optimizations and delay execution until the query is collected. An added benefit of chaining queries to `scan_*` operations is that the "scanners" can skip reading columns and rows that aren't required. This is helpful when streaming large datasets as well, as rows are processed in batches before the entire file is read.""")
 
 
500
  return
501
 
502
 
503
  @app.cell(hide_code=True)
504
  def _(mo):
505
- mo.md(r"""The results of a query from a lazyframe can be saved in streaming mode using `sink_*` (e.g. `sink_parquet`) functions. Sinks support saving data to disk or cloud, and are especially helpful with large datasets. The data being sunk can also be partitioned into multiple files if needed, after specifying a suitable partitioning strategy, as shown below.""")
 
 
506
  return
507
 
508
 
@@ -522,7 +512,9 @@ def _(a_query, pl):
522
 
523
  @app.cell(hide_code=True)
524
  def _(mo):
525
- mo.md(r"""We can also write to multiple sinks at the same time. We just need to specify two separate lazy sinks and combine them by calling `pl.collect_all` and mentioning both sinks.""")
 
 
526
  return
527
 
528
 
@@ -536,13 +528,11 @@ def _(a_query, pl):
536
 
537
  @app.cell(hide_code=True)
538
  def _(mo):
539
- mo.md(
540
- r"""
541
- ## References
542
 
543
- 1. Polars [documentation](https://docs.pola.rs/user-guide/lazy/)
544
- """
545
- )
546
  return
547
 
548
 
 
15
 
16
  import marimo
17
 
18
+ __generated_with = "0.18.4"
19
  app = marimo.App(width="medium")
20
 
21
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
+ mo.md(r"""
25
+ # Lazy Execution (a.k.a. the Lazy API)
 
26
 
27
+ Author: [Deb Debnath](https://github.com/debajyotid2)
28
+ """)
 
29
  return
30
 
31
 
 
49
  Generator,
50
  datetime,
51
  np,
 
 
52
  pl,
 
53
  random,
54
  re,
 
 
55
  time,
56
  timedelta,
57
  timezone,
 
60
 
61
  @app.cell(hide_code=True)
62
  def _(mo):
63
+ mo.md(r"""
64
+ We saw the benefits of lazy evaluation when we learned about the Expressions API in Polars. Lazy execution is further extended as a philosophy by the Lazy API. It offers significant performance enhancements over eager (immediate) execution of queries and is one of the reasons why Polars is faster at working with large (GB scale) datasets than other libraries. The lazy API optimizes the full query pipeline instead of executing individual queries optimally, unlike eager execution. Some of the advantages of using the Lazy API over eager execution include
65
+
66
+ - automatic query optimization with the query optimizer.
67
+ - ability to process datasets larger than memory using streaming.
68
+ - ability to catch schema errors before data processing.
69
+ """)
 
 
70
  return
71
 
72
 
73
  @app.cell(hide_code=True)
74
  def _(mo):
75
+ mo.md(r"""
76
+ ## Setup
 
77
 
78
+ For this notebook, we are going to work with logs from an Apache/Nginx web server - these logs contain useful information that can be utilized for performance optimization, security monitoring, etc. Such logs comprise of entries that look something like this:
79
 
80
+ ```
81
+ 10.23.97.15 - - [05/Jul/2024:11:35:05 +0000] "GET /index.html HTTP/1.1" 200 1342 "https://www.example.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/528.32 (KHTML, like Gecko) Chrome/19.0.1220.985 Safari/528.32" "-"
82
+ ```
83
 
84
+ Different parts of the entry mean different things:
85
 
86
+ - `10.23.97.15` is the client IP address.
87
+ - `- -` represent identity and username of the client, respectively and are typically unused.
88
+ - `05/Jul/2024:11:35:05 +0000` indicates the timestamp for the request.
89
+ - `"GET /index.html HTTP/1.1"` represents the HTTP method, requested resource and the protocol version for HTTP, respectively.
90
+ - `200 1342` mean the response status code and size of the response in bytes, respectively
91
+ - `"https://www.example.com"` is the "referer", or the webpage URL that brought the client to the resource.
92
+ - `"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/528.32 (KHTML, like Gecko) Chrome/19.0.1220.985 Safari/528.32"` is the "User agent" or the details of the client device making the request (including browser version, operating system, etc.)
93
 
94
+ Normally, you would get your log files from a server that you have access to. In our case, we will generate fake data to simulate log records. We will simulate 7 days of server activity with 90,000 recorded lines.
95
 
96
+ ///Note
97
+ 1. If you are interested in the process of generating fake log entries, unhide the code cells immediately below the next one.
98
+ 2. You can adjust the size of the dataset by resetting the `num_log_lines` variables to a size of your choice. It may be helpful if the data takes a long time to generate.
99
+ """)
 
100
  return
101
 
102
 
 
168
  responses,
169
  rng,
170
  sleep,
 
171
  tz,
172
  user_agents,
173
  verbs,
 
210
  faker=faker, rng=rng, resources=resources,
211
  user_agents=user_agents, responses=responses, verbs=verbs)
212
  yield list(re.findall(pattern, log_line)[0])
213
+ return (generator,)
214
 
215
 
216
  @app.cell(hide_code=True)
217
  def _(mo):
218
+ mo.md(r"""
219
+ Since we are generating data using a Python generator, we create a `pl.LazyFrame` directly, but we can start with either a file or an existing `DataFrame`. When using a file, the functions beginning with `pl.scan_` from the Polars API can be used, while in the case of an existing `pl.DataFrame`, we can simply call `.lazy()` to convert it to a `pl.LazyFrame`.
 
220
 
221
+ ///Note
222
+ Depending on your machine, the following cell may take some time to execute.
223
+ """)
 
224
  return
225
 
226
 
 
235
 
236
  @app.cell(hide_code=True)
237
  def _(mo):
238
+ mo.md(r"""
239
+ ## Schema
 
240
 
241
+ A schema denotes the names and respective datatypes of columns in a DataFrame or LazyFrame. It can be specified when a DataFrame or LazyFrame is generated (as you may have noticed in the cell creating the LazyFrame above).
242
 
243
+ You can see the schema with the .collect_schema method on a DataFrame or LazyFrame.
244
+ """)
 
245
  return
246
 
247
 
 
253
 
254
  @app.cell(hide_code=True)
255
  def _(mo):
256
+ mo.md(r"""
257
+ Since our generator yields strings, Polars defaults to the `pl.String` datatype while reading in the data from the generator, unless specified. This, however, is not the most space or computation efficient form of data storage, so we would like to convert the datatypes of some of the columns in our LazyFrame.
 
258
 
259
+ ///Note
260
+ The data type conversion can also be done by specifying it in the schema when creating the LazyFrame or DataFrame. We are skipping doing this for demonstration. For more details on specifying data types in LazyFrames, please refer to the Polars [documentation](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html).
261
+ """)
 
262
  return
263
 
264
 
265
  @app.cell(hide_code=True)
266
  def _(mo):
267
+ mo.md(r"""
268
+ The Lazy API validates a query pipeline end-to-end for schema consistency and correctness. The checks make sure that if there is a mistake in your query, you can correct it before the data gets processed.
269
+ """)
270
  return
271
 
272
 
273
  @app.cell(hide_code=True)
274
  def _(mo):
275
+ mo.md(r"""
276
+ The `log_data_erroneous` query below throws an `InvalidOperationError` because Polars finds inconsistencies between the timestamps we parsed from the logs and the timestamp format specified. It turns out that the time stamps in string format still have trailing whitespace which leads to errors during conversion to `datetime[μs]` objects.
277
+ """)
278
  return
279
 
280
 
 
297
 
298
  @app.cell(hide_code=True)
299
  def _(mo):
300
+ mo.md(r"""
301
+ Polars uses a **query optimizer** to make sure that a query pipeline is executed with the least computational cost (more on this later). In order to be able to do the optimization, the optimizer must know the schema for each step of the pipeline (query plan). For example, if you have a `.pivot` operation somewhere in your pipeline, you are generating new columns based on the data. This is new information unknown to the query optimizer that it cannot work with, and so the lazy API does not support `.pivot` operations.
 
302
 
303
+ For example, suppose you would like to know how many requests of each kind were received at a given time that were not "POST" requests. For this we would want to create a pivot table as follows, except that it throws an error as the lazy API does not support pivot operations.
304
+ """)
 
305
  return
306
 
307
 
 
318
 
319
  @app.cell(hide_code=True)
320
  def _(mo):
321
+ mo.md(r"""
322
+ As a workaround, we can jump between "lazy mode" and "eager mode" by converting a LazyFrame to a DataFrame just before the unsupported operation (e.g. `.pivot`). We can do this by calling `.collect()` on the LazyFrame. Once done with the "eager mode" operations, we can jump back to "lazy mode" by calling ".lazy()" on the DataFrame!
 
323
 
324
+ As an example, see the fix to the query in the previous cell below:
325
+ """)
 
326
  return
327
 
328
 
 
342
 
343
  @app.cell(hide_code=True)
344
  def _(mo):
345
+ mo.md(r"""
346
+ ## Query plan
347
+ """)
348
  return
349
 
350
 
351
  @app.cell(hide_code=True)
352
  def _(mo):
353
+ mo.md(r"""
354
+ Polars has a query optimizer that works on a "query plan" to create a computationally efficient query pipeline. It builds the query plan/query graph from the user-specified lazy operations.
 
355
 
356
+ We can understand query graphs with visualization and by printing them as text.
357
 
358
+ Say we want to convert the data in our log dataset from `pl.String` more space efficient data types. We also would like to view all "GET" requests that resulted in errors (client side). We build our query first, and then we visualize the query graph using `.show_graph()` and print it using `.request_code()`.
359
+ """)
 
360
  return
361
 
362
 
 
391
 
392
  @app.cell(hide_code=True)
393
  def _(mo):
394
+ mo.md(r"""
395
+ ## Execution
396
+ """)
397
  return
398
 
399
 
400
  @app.cell(hide_code=True)
401
  def _(mo):
402
+ mo.md(r"""
403
+ As mentioned before, Polars builds a query graph by going lazy operation by operation and then optimizes it by running a query optimizer on the graph. This optimized graph is run by default.
 
404
 
405
+ We can execute our query on the full dataset by calling the .collect method on the query. But since this option processes all data in one batch, it is not memory efficient, and can crash if the size of the data exceeds the amount of memory your query can support.
406
 
407
+ For fast iterative development running `.collect` on the entire dataset is not a good idea due to slow runtimes. If your dataset is partitioned, you can use a few of them for testing. Another option is to use `.head` to limit the number of records processed, and `.collect` as few times as possible and toward the end of your query, as shown below.
408
+ """)
 
409
  return
410
 
411
 
 
430
 
431
  @app.cell(hide_code=True)
432
  def _(mo):
433
+ mo.md(r"""
434
+ For large datasets Polars supports streaming mode by collecting data in batches. Streaming mode can be used by passing the keyword `engine="streaming"` into the `collect` method.
435
+ """)
436
  return
437
 
438
 
 
444
 
445
  @app.cell(hide_code=True)
446
  def _(mo):
447
+ mo.md(r"""
448
+ ## Optimizations
 
449
 
450
+ The lazy API runs a query optimizer on every Polars query. To do this, first it builds a non-optimized plan with the set of steps in the order they were specified by the user. Then it checks for optimization opportunities within the plan and reorders operations following specific rules to create an optimized query plan. Some of them are executed up front, others are determined just in time as the materialized data comes in. For the query that we built before and saw the query graph, we can view the unoptimized and optimized versions below.
451
+ """)
 
452
  return
453
 
454
 
 
466
 
467
  @app.cell(hide_code=True)
468
  def _(mo):
469
+ mo.md(r"""
470
+ One difference between the optimized and the unoptimized versions above is that all of the datatype cast operations except for the conversion of the `"status"` column to `pl.Int16` are performed at the end together. Also, the `filter()` operation is "pushed down" the graph, but after the datatype cast operation for `"status"`. This is called **predicate pushdown**, and the lazy API optimizes the query graph for filters to be performed as early as possible. Since the datatype coercion makes the filter operation more efficient, the graph preserves its order to be before the filter.
471
+ """)
472
  return
473
 
474
 
475
  @app.cell(hide_code=True)
476
  def _(mo):
477
+ mo.md(r"""
478
+ ## Sources and Sinks
479
+ """)
480
  return
481
 
482
 
483
  @app.cell(hide_code=True)
484
  def _(mo):
485
+ mo.md(r"""
486
+ For data sources like Parquets, CSVs, etc, the lazy API provides `scan_*` (`scan_parquet`, `scan_csv`, etc.) to lazily read in the data into LazyFrames. If queries are chained to the `scan_*` method, Polars will run the usual query optimizations and delay execution until the query is collected. An added benefit of chaining queries to `scan_*` operations is that the "scanners" can skip reading columns and rows that aren't required. This is helpful when streaming large datasets as well, as rows are processed in batches before the entire file is read.
487
+ """)
488
  return
489
 
490
 
491
  @app.cell(hide_code=True)
492
  def _(mo):
493
+ mo.md(r"""
494
+ The results of a query from a lazyframe can be saved in streaming mode using `sink_*` (e.g. `sink_parquet`) functions. Sinks support saving data to disk or cloud, and are especially helpful with large datasets. The data being sunk can also be partitioned into multiple files if needed, after specifying a suitable partitioning strategy, as shown below.
495
+ """)
496
  return
497
 
498
 
 
512
 
513
  @app.cell(hide_code=True)
514
  def _(mo):
515
+ mo.md(r"""
516
+ We can also write to multiple sinks at the same time. We just need to specify two separate lazy sinks and combine them by calling `pl.collect_all` and mentioning both sinks.
517
+ """)
518
  return
519
 
520
 
 
528
 
529
  @app.cell(hide_code=True)
530
  def _(mo):
531
+ mo.md(r"""
532
+ ## References
 
533
 
534
+ 1. Polars [documentation](https://docs.pola.rs/user-guide/lazy/)
535
+ """)
 
536
  return
537
 
538
 
polars/README.md CHANGED
@@ -1,3 +1,8 @@
 
 
 
 
 
1
  # Learn Polars
2
 
3
  _🚧 This collection is a work in progress. Please help us add notebooks!_
@@ -24,4 +29,4 @@ You can also open notebooks in our online playground by appending marimo.app/ to
24
  * [Péter Gyarmati](https://github.com/peter-gy)
25
  * [Joram Mutenge](https://github.com/jorammutenge)
26
  * [etrotta](https://github.com/etrotta)
27
- * [Debajyoti Das](https://github.com/debajyotid2)
 
1
+ ---
2
+ title: Readme
3
+ marimo-version: 0.18.4
4
+ ---
5
+
6
  # Learn Polars
7
 
8
  _🚧 This collection is a work in progress. Please help us add notebooks!_
 
29
  * [Péter Gyarmati](https://github.com/peter-gy)
30
  * [Joram Mutenge](https://github.com/jorammutenge)
31
  * [etrotta](https://github.com/etrotta)
32
+ * [Debajyoti Das](https://github.com/debajyotid2)
probability/01_sets.py CHANGED
@@ -7,45 +7,47 @@
7
 
8
  import marimo
9
 
10
- __generated_with = "0.11.0"
11
  app = marimo.App()
12
 
13
 
14
  @app.cell(hide_code=True)
15
  def _(mo):
16
- mo.md(
17
- r"""
18
- # Sets
19
 
20
- Probability is the study of "events", assigning numerical values to how likely
21
- events are to occur. For example, probability lets us quantify how likely it is for it to rain or shine on a given day.
22
 
23
 
24
- Typically we reason about _sets_ of events. In mathematics,
25
- a set is a collection of elements, with no element included more than once.
26
- Elements can be any kind of object.
27
 
28
- For example:
29
 
30
- - ☀️ Weather events: $\{\text{Rain}, \text{Overcast}, \text{Clear}\}$
31
- - 🎲 Die rolls: $\{1, 2, 3, 4, 5, 6\}$
32
- - 🪙 Pairs of coin flips = $\{ \text{(Heads, Heads)}, \text{(Heads, Tails)}, \text{(Tails, Tails)} \text{(Tails, Heads)}\}$
33
 
34
- Sets are the building blocks of probability, and will arise frequently in our study.
35
- """
36
- )
37
  return
38
 
39
 
40
  @app.cell(hide_code=True)
41
  def _(mo):
42
- mo.md(r"""## Set operations""")
 
 
43
  return
44
 
45
 
46
  @app.cell(hide_code=True)
47
  def _(mo):
48
- mo.md(r"""In Python, sets are made with the `set` function:""")
 
 
49
  return
50
 
51
 
@@ -65,15 +67,13 @@ def _():
65
 
66
  @app.cell(hide_code=True)
67
  def _(mo):
68
- mo.md(
69
- r"""
70
- Below we explain common operations on sets.
71
 
72
- _**Try it!** Try modifying the definitions of `A` and `B` above, and see how the results change below._
73
 
74
- The **union** $A \cup B$ of sets $A$ and $B$ is the set of elements in $A$, $B$, or both.
75
- """
76
- )
77
  return
78
 
79
 
@@ -85,7 +85,9 @@ def _(A, B):
85
 
86
  @app.cell(hide_code=True)
87
  def _(mo):
88
- mo.md(r"""The **intersection** $A \cap B$ is the set of elements in both $A$ and $B$""")
 
 
89
  return
90
 
91
 
@@ -97,7 +99,9 @@ def _(A, B):
97
 
98
  @app.cell(hide_code=True)
99
  def _(mo):
100
- mo.md(r"""The **difference** $A \setminus B$ is the set of elements in $A$ that are not in $B$.""")
 
 
101
  return
102
 
103
 
@@ -109,13 +113,11 @@ def _(A, B):
109
 
110
  @app.cell(hide_code=True)
111
  def _(mo):
112
- mo.md(
113
- """
114
- ### 🎬 An interactive example
115
 
116
- Here's a simple example that classifies TV shows into sets by genre, and uses these sets to recommend shows to a user based on their preferences.
117
- """
118
- )
119
  return
120
 
121
 
@@ -175,7 +177,7 @@ def _(mo, recommendations, viewer_type):
175
  **Why these shows?**
176
  {explanation[viewer_type.value]}
177
  """)
178
- return explanation, result
179
 
180
 
181
  @app.cell(hide_code=True)
@@ -214,58 +216,54 @@ def _(mo):
214
 
215
  @app.cell(hide_code=True)
216
  def _(mo):
217
- mo.md(
218
- r"""
219
- ## 🧮 Set properties
220
 
221
- Here are some important properties of the set operations:
222
 
223
- 1. **Commutative**: $A \cup B = B \cup A$
224
- 2. **Associative**: $(A \cup B) \cup C = A \cup (B \cup C)$
225
- 3. **Distributive**: $A \cup (B \cap C) = (A \cup B) \cap (A \cup C)$
226
- """
227
- )
228
  return
229
 
230
 
231
  @app.cell(hide_code=True)
232
  def _(mo):
233
- mo.md(
234
- r"""
235
- ## Set builder notation
236
 
237
- To compactly describe the elements in a set, we can use **set builder notation**, which specifies conditions that must be true for elements to be in the set.
238
 
239
- For example, here is how to specify the set of positive numbers less than 10:
240
 
241
- \[
242
- \{x \mid 0 < x < 10 \}
243
- \]
244
 
245
- The predicate to the right of the vertical bar $\mid$ specifies conditions that must be true for an element to be in the set; the expression to the left of $\mid$ specifies the value being included.
246
 
247
- In Python, set builder notation is called a "set comprehension."
248
- """
249
- )
250
  return
251
 
252
 
253
- @app.cell
254
- def _():
255
- def predicate(x):
256
- return x > 0 and x < 10
257
- return (predicate,)
258
 
259
 
260
  @app.cell
261
- def _(predicate):
262
  set(x for x in range(100) if predicate(x))
263
  return
264
 
265
 
266
  @app.cell(hide_code=True)
267
  def _(mo):
268
- mo.md("""**Try it!** Try modifying the `predicate` function above and see how the set changes.""")
 
 
269
  return
270
 
271
 
 
7
 
8
  import marimo
9
 
10
+ __generated_with = "0.18.4"
11
  app = marimo.App()
12
 
13
 
14
  @app.cell(hide_code=True)
15
  def _(mo):
16
+ mo.md(r"""
17
+ # Sets
 
18
 
19
+ Probability is the study of "events", assigning numerical values to how likely
20
+ events are to occur. For example, probability lets us quantify how likely it is for it to rain or shine on a given day.
21
 
22
 
23
+ Typically we reason about _sets_ of events. In mathematics,
24
+ a set is a collection of elements, with no element included more than once.
25
+ Elements can be any kind of object.
26
 
27
+ For example:
28
 
29
+ - ☀️ Weather events: $\{\text{Rain}, \text{Overcast}, \text{Clear}\}$
30
+ - 🎲 Die rolls: $\{1, 2, 3, 4, 5, 6\}$
31
+ - 🪙 Pairs of coin flips = $\{ \text{(Heads, Heads)}, \text{(Heads, Tails)}, \text{(Tails, Tails)} \text{(Tails, Heads)}\}$
32
 
33
+ Sets are the building blocks of probability, and will arise frequently in our study.
34
+ """)
 
35
  return
36
 
37
 
38
  @app.cell(hide_code=True)
39
  def _(mo):
40
+ mo.md(r"""
41
+ ## Set operations
42
+ """)
43
  return
44
 
45
 
46
  @app.cell(hide_code=True)
47
  def _(mo):
48
+ mo.md(r"""
49
+ In Python, sets are made with the `set` function:
50
+ """)
51
  return
52
 
53
 
 
67
 
68
  @app.cell(hide_code=True)
69
  def _(mo):
70
+ mo.md(r"""
71
+ Below we explain common operations on sets.
 
72
 
73
+ _**Try it!** Try modifying the definitions of `A` and `B` above, and see how the results change below._
74
 
75
+ The **union** $A \cup B$ of sets $A$ and $B$ is the set of elements in $A$, $B$, or both.
76
+ """)
 
77
  return
78
 
79
 
 
85
 
86
  @app.cell(hide_code=True)
87
  def _(mo):
88
+ mo.md(r"""
89
+ The **intersection** $A \cap B$ is the set of elements in both $A$ and $B$
90
+ """)
91
  return
92
 
93
 
 
99
 
100
  @app.cell(hide_code=True)
101
  def _(mo):
102
+ mo.md(r"""
103
+ The **difference** $A \setminus B$ is the set of elements in $A$ that are not in $B$.
104
+ """)
105
  return
106
 
107
 
 
113
 
114
  @app.cell(hide_code=True)
115
  def _(mo):
116
+ mo.md("""
117
+ ### 🎬 An interactive example
 
118
 
119
+ Here's a simple example that classifies TV shows into sets by genre, and uses these sets to recommend shows to a user based on their preferences.
120
+ """)
 
121
  return
122
 
123
 
 
177
  **Why these shows?**
178
  {explanation[viewer_type.value]}
179
  """)
180
+ return
181
 
182
 
183
  @app.cell(hide_code=True)
 
216
 
217
  @app.cell(hide_code=True)
218
  def _(mo):
219
+ mo.md(r"""
220
+ ## 🧮 Set properties
 
221
 
222
+ Here are some important properties of the set operations:
223
 
224
+ 1. **Commutative**: $A \cup B = B \cup A$
225
+ 2. **Associative**: $(A \cup B) \cup C = A \cup (B \cup C)$
226
+ 3. **Distributive**: $A \cup (B \cap C) = (A \cup B) \cap (A \cup C)$
227
+ """)
 
228
  return
229
 
230
 
231
  @app.cell(hide_code=True)
232
  def _(mo):
233
+ mo.md(r"""
234
+ ## Set builder notation
 
235
 
236
+ To compactly describe the elements in a set, we can use **set builder notation**, which specifies conditions that must be true for elements to be in the set.
237
 
238
+ For example, here is how to specify the set of positive numbers less than 10:
239
 
240
+ \[
241
+ \{x \mid 0 < x < 10 \}
242
+ \]
243
 
244
+ The predicate to the right of the vertical bar $\mid$ specifies conditions that must be true for an element to be in the set; the expression to the left of $\mid$ specifies the value being included.
245
 
246
+ In Python, set builder notation is called a "set comprehension."
247
+ """)
 
248
  return
249
 
250
 
251
+ @app.function
252
+ def predicate(x):
253
+ return x > 0 and x < 10
 
 
254
 
255
 
256
  @app.cell
257
+ def _():
258
  set(x for x in range(100) if predicate(x))
259
  return
260
 
261
 
262
  @app.cell(hide_code=True)
263
  def _(mo):
264
+ mo.md("""
265
+ **Try it!** Try modifying the `predicate` function above and see how the set changes.
266
+ """)
267
  return
268
 
269
 
probability/02_axioms.py CHANGED
@@ -9,7 +9,7 @@
9
 
10
  import marimo
11
 
12
- __generated_with = "0.11.2"
13
  app = marimo.App(width="medium")
14
 
15
 
@@ -21,49 +21,43 @@ def _():
21
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
- mo.md(
25
- r"""
26
- # Axioms of Probability
27
 
28
- Probability theory is built on three fundamental axioms, known as the [Kolmogorov axioms](https://en.wikipedia.org/wiki/Probability_axioms). These axioms form
29
- the mathematical foundation for all of probability theory[<sup>1</sup>](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/probability).
30
 
31
- Let's explore each axiom and understand why they make intuitive sense:
32
- """
33
- )
34
  return
35
 
36
 
37
  @app.cell(hide_code=True)
38
  def _(mo):
39
- mo.md(
40
- r"""
41
- ## The Three Axioms
42
-
43
- | Axiom | Mathematical Form | Meaning |
44
- |-------|------------------|----------|
45
- | **Axiom 1** | $0 \leq P(E) \leq 1$ | All probabilities are between 0 and 1 |
46
- | **Axiom 2** | $P(S) = 1$ | The probability of the sample space is 1 |
47
- | **Axiom 3** | $P(E \cup F) = P(E) + P(F)$ | For mutually exclusive events, probabilities add |
48
-
49
- where the set $S$ is the sample space (all possible outcomes), and $E$ and $F$ are sets that represent events. The notation $P(E)$ denotes the probability of $E$, which you can interpret as the chance that something happens. $P(E) = 0$ means that the event cannot happen, while $P(E) = 1$ means the event will happen no matter what; $P(E) = 0.5$ means that $E$ has a 50% chance of happening.
50
-
51
- For an example, when rolling a fair six-sided die once, the sample space $S$ is the set of die faces ${1, 2, 3, 4, 5, 6}$, and there are many possible events; we'll see some examples below.
52
- """
53
- )
54
  return
55
 
56
 
57
  @app.cell(hide_code=True)
58
  def _(mo):
59
- mo.md(
60
- r"""
61
- ## Understanding Through Examples
62
 
63
- Let's explore these axioms using a simple experiment: rolling a fair six-sided die.
64
- We'll use this to demonstrate why each axiom makes intuitive sense.
65
- """
66
- )
67
  return
68
 
69
 
@@ -144,62 +138,58 @@ def _(event, mo, np, plt):
144
  """)
145
 
146
  mo.hstack([plt.gcf(), explanation])
147
- return ax, colors, dice, event_map, explanation, fig, outcomes, prob
148
 
149
 
150
  @app.cell(hide_code=True)
151
  def _(mo):
152
- mo.md(
153
- r"""
154
- ## Why These Axioms Matter
155
 
156
- These axioms are more than just rules - they provide the foundation for all of probability theory:
157
 
158
- 1. **Non-negativity** (Axiom 1) makes intuitive sense: you can't have a negative number of occurrences
159
- in any experiment.
160
 
161
- 2. **Normalization** (Axiom 2) ensures that something must happen - the total probability must be 1.
162
 
163
- 3. **Additivity** (Axiom 3) lets us build complex probabilities from simple ones, but only for events
164
- that can't happen together (mutually exclusive events).
165
 
166
- From these simple rules, we can derive all the powerful tools of probability theory that are used in
167
- statistics, machine learning, and other fields.
168
- """
169
- )
170
  return
171
 
172
 
173
  @app.cell(hide_code=True)
174
  def _(mo):
175
- mo.md(
176
- r"""
177
- ## 🤔 Test Your Understanding
178
 
179
- Consider rolling two dice. Which of these statements follow from the axioms?
180
 
181
- <details>
182
- <summary>1. P(sum is 13) = 0</summary>
183
 
184
- ✅ Correct! This follows from Axiom 1. Since no combination of dice can sum to 13,
185
- the probability must be non-negative but can be 0.
186
- </details>
187
 
188
- <details>
189
- <summary>2. P(sum is 7) + P(sum is not 7) = 1</summary>
190
 
191
- ✅ Correct! This follows from Axioms 2 and 3. These events are mutually exclusive and cover
192
- the entire sample space.
193
- </details>
194
 
195
- <details>
196
- <summary>3. P(first die is 6 or second die is 6) = P(first die is 6) + P(second die is 6)</summary>
197
 
198
- ❌ Incorrect! This doesn't follow from Axiom 3 because the events are not mutually exclusive -
199
- you could roll (6,6).
200
- </details>
201
- """
202
- )
203
  return
204
 
205
 
 
9
 
10
  import marimo
11
 
12
+ __generated_with = "0.18.4"
13
  app = marimo.App(width="medium")
14
 
15
 
 
21
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
+ mo.md(r"""
25
+ # Axioms of Probability
 
26
 
27
+ Probability theory is built on three fundamental axioms, known as the [Kolmogorov axioms](https://en.wikipedia.org/wiki/Probability_axioms). These axioms form
28
+ the mathematical foundation for all of probability theory[<sup>1</sup>](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/probability).
29
 
30
+ Let's explore each axiom and understand why they make intuitive sense:
31
+ """)
 
32
  return
33
 
34
 
35
  @app.cell(hide_code=True)
36
  def _(mo):
37
+ mo.md(r"""
38
+ ## The Three Axioms
39
+
40
+ | Axiom | Mathematical Form | Meaning |
41
+ |-------|------------------|----------|
42
+ | **Axiom 1** | $0 \leq P(E) \leq 1$ | All probabilities are between 0 and 1 |
43
+ | **Axiom 2** | $P(S) = 1$ | The probability of the sample space is 1 |
44
+ | **Axiom 3** | $P(E \cup F) = P(E) + P(F)$ | For mutually exclusive events, probabilities add |
45
+
46
+ where the set $S$ is the sample space (all possible outcomes), and $E$ and $F$ are sets that represent events. The notation $P(E)$ denotes the probability of $E$, which you can interpret as the chance that something happens. $P(E) = 0$ means that the event cannot happen, while $P(E) = 1$ means the event will happen no matter what; $P(E) = 0.5$ means that $E$ has a 50% chance of happening.
47
+
48
+ For an example, when rolling a fair six-sided die once, the sample space $S$ is the set of die faces ${1, 2, 3, 4, 5, 6}$, and there are many possible events; we'll see some examples below.
49
+ """)
 
 
50
  return
51
 
52
 
53
  @app.cell(hide_code=True)
54
  def _(mo):
55
+ mo.md(r"""
56
+ ## Understanding Through Examples
 
57
 
58
+ Let's explore these axioms using a simple experiment: rolling a fair six-sided die.
59
+ We'll use this to demonstrate why each axiom makes intuitive sense.
60
+ """)
 
61
  return
62
 
63
 
 
138
  """)
139
 
140
  mo.hstack([plt.gcf(), explanation])
141
+ return
142
 
143
 
144
  @app.cell(hide_code=True)
145
  def _(mo):
146
+ mo.md(r"""
147
+ ## Why These Axioms Matter
 
148
 
149
+ These axioms are more than just rules - they provide the foundation for all of probability theory:
150
 
151
+ 1. **Non-negativity** (Axiom 1) makes intuitive sense: you can't have a negative number of occurrences
152
+ in any experiment.
153
 
154
+ 2. **Normalization** (Axiom 2) ensures that something must happen - the total probability must be 1.
155
 
156
+ 3. **Additivity** (Axiom 3) lets us build complex probabilities from simple ones, but only for events
157
+ that can't happen together (mutually exclusive events).
158
 
159
+ From these simple rules, we can derive all the powerful tools of probability theory that are used in
160
+ statistics, machine learning, and other fields.
161
+ """)
 
162
  return
163
 
164
 
165
  @app.cell(hide_code=True)
166
  def _(mo):
167
+ mo.md(r"""
168
+ ## 🤔 Test Your Understanding
 
169
 
170
+ Consider rolling two dice. Which of these statements follow from the axioms?
171
 
172
+ <details>
173
+ <summary>1. P(sum is 13) = 0</summary>
174
 
175
+ ✅ Correct! This follows from Axiom 1. Since no combination of dice can sum to 13,
176
+ the probability must be non-negative but can be 0.
177
+ </details>
178
 
179
+ <details>
180
+ <summary>2. P(sum is 7) + P(sum is not 7) = 1</summary>
181
 
182
+ ✅ Correct! This follows from Axioms 2 and 3. These events are mutually exclusive and cover
183
+ the entire sample space.
184
+ </details>
185
 
186
+ <details>
187
+ <summary>3. P(first die is 6 or second die is 6) = P(first die is 6) + P(second die is 6)</summary>
188
 
189
+ ❌ Incorrect! This doesn't follow from Axiom 3 because the events are not mutually exclusive -
190
+ you could roll (6,6).
191
+ </details>
192
+ """)
 
193
  return
194
 
195
 
probability/03_probability_of_or.py CHANGED
@@ -9,7 +9,7 @@
9
 
10
  import marimo
11
 
12
- __generated_with = "0.11.2"
13
  app = marimo.App(width="medium")
14
 
15
 
@@ -24,43 +24,39 @@ def _():
24
  import matplotlib.pyplot as plt
25
  from matplotlib_venn import venn2
26
  import numpy as np
27
- return np, plt, venn2
28
 
29
 
30
  @app.cell(hide_code=True)
31
  def _(mo):
32
- mo.md(
33
- r"""
34
- # Probability of Or
35
 
36
- When calculating the probability of either one event _or_ another occurring, we need to be careful about how we combine probabilities. The method depends on whether the events can happen together[<sup>1</sup>](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/prob_or/).
37
 
38
- Let's explore how to calculate $P(E \cup F)$, i.e. $P(E \text{ or } F)$, in different scenarios.
39
- """
40
- )
41
  return
42
 
43
 
44
  @app.cell(hide_code=True)
45
  def _(mo):
46
- mo.md(
47
- r"""
48
- ## Mutually Exclusive Events
49
 
50
- Two events $E$ and $F$ are **mutually exclusive** if they cannot occur simultaneously.
51
- In set notation, this means:
52
 
53
- $E \cap F = \emptyset$
54
 
55
- For example:
56
 
57
- - Rolling an even number (2,4,6) vs rolling an odd number (1,3,5)
58
- - Drawing a heart vs drawing a spade from a deck
59
- - Passing vs failing a test
60
 
61
- Here's a Python function to check if two sets of outcomes are mutually exclusive:
62
- """
63
- )
64
  return
65
 
66
 
@@ -90,21 +86,19 @@ def _(are_mutually_exclusive, even_numbers, prime_numbers):
90
 
91
  @app.cell(hide_code=True)
92
  def _(mo):
93
- mo.md(
94
- r"""
95
- ## Or with Mutually Exclusive Events
96
 
97
- For mutually exclusive events, the probability of either event occurring is simply the sum of their individual probabilities:
98
 
99
- $P(E \cup F) = P(E) + P(F)$
100
 
101
- This extends to multiple events. For $n$ mutually exclusive events $E_1, E_2, \ldots, E_n$:
102
 
103
- $P(E_1 \cup E_2 \cup \cdots \cup E_n) = \sum_{i=1}^n P(E_i)$
104
 
105
- Let's implement this calculation:
106
- """
107
- )
108
  return
109
 
110
 
@@ -121,34 +115,28 @@ def _():
121
  # P(prime) = P(2) + P(3) + P(5)
122
  p_prime_mutually_exclusive = prob_union_mutually_exclusive([1/6, 1/6, 1/6])
123
  print(f"P(rolling a prime number) = {p_prime_mutually_exclusive}")
124
- return (
125
- p_even_mutually_exclusive,
126
- p_prime_mutually_exclusive,
127
- prob_union_mutually_exclusive,
128
- )
129
 
130
 
131
  @app.cell(hide_code=True)
132
  def _(mo):
133
- mo.md(
134
- r"""
135
- ## Or with Non-Mutually Exclusive Events
136
 
137
- When events can occur together, we need to use the **inclusion-exclusion principle**:
138
 
139
- $P(E \cup F) = P(E) + P(F) - P(E \cap F)$
140
 
141
- Why subtract $P(E \cap F)$? Because when we add $P(E)$ and $P(F)$, we count the overlap twice!
142
 
143
- For example, consider calculating $P(\text{prime or even})$ when rolling a die:
144
 
145
- - Prime numbers: {2, 3, 5}
146
- - Even numbers: {2, 4, 6}
147
- - The number 2 is counted twice unless we subtract its probability
148
 
149
- Here's how to implement this calculation:
150
- """
151
- )
152
  return
153
 
154
 
@@ -166,40 +154,34 @@ def _():
166
 
167
  result = prob_union_general(p_prime_general, p_even_general, p_intersection)
168
  print(f"P(prime or even) = {p_prime_general} + {p_even_general} - {p_intersection} = {result}")
169
- return (
170
- p_even_general,
171
- p_intersection,
172
- p_prime_general,
173
- prob_union_general,
174
- result,
175
- )
176
 
177
 
178
  @app.cell(hide_code=True)
179
  def _(mo):
180
- mo.md(
181
- r"""
182
- ### Extension to Three Events
183
 
184
- For three events, the inclusion-exclusion principle becomes:
185
 
186
- $P(E_1 \cup E_2 \cup E_3) = P(E_1) + P(E_2) + P(E_3)$
187
- $- P(E_1 \cap E_2) - P(E_1 \cap E_3) - P(E_2 \cap E_3)$
188
- $+ P(E_1 \cap E_2 \cap E_3)$
189
 
190
- The pattern is:
191
 
192
- 1. Add individual probabilities
193
- 2. Subtract probabilities of pairs
194
- 3. Add probability of triple intersection
195
- """
196
- )
197
  return
198
 
199
 
200
  @app.cell(hide_code=True)
201
  def _(mo):
202
- mo.md(r"""### Interactive example:""")
 
 
203
  return
204
 
205
 
@@ -298,57 +280,53 @@ def _(event_type, mo, plt, venn2):
298
  plt.gcf(),
299
  mo.md(data["explanation"])
300
  ])
301
- return data, events_data, v
302
 
303
 
304
  @app.cell(hide_code=True)
305
  def _(mo):
306
- mo.md(
307
- r"""
308
- ## 🤔 Test Your Understanding
309
 
310
- Consider rolling a six-sided die. Which of these statements are true?
311
 
312
- <details>
313
- <summary>1. P(even or less than 3) = P(even) + P(less than 3)</summary>
314
 
315
- ❌ Incorrect! These events are not mutually exclusive (2 is both even and less than 3).
316
- We need to use the inclusion-exclusion principle.
317
- </details>
318
 
319
- <details>
320
- <summary>2. P(even or greater than 4) = 4/6</summary>
321
 
322
- ✅ Correct! {2,4,6} ∪ {5,6} = {2,4,5,6}, so probability is 4/6.
323
- </details>
324
 
325
- <details>
326
- <summary>3. P(prime or odd) = 5/6</summary>
327
 
328
- ✅ Correct! {2,3,5} ∪ {1,3,5} = {1,2,3,5}, so probability is 5/6.
329
- </details>
330
- """
331
- )
332
  return
333
 
334
 
335
  @app.cell(hide_code=True)
336
  def _(mo):
337
- mo.md(
338
- """
339
- ## Summary
340
 
341
- You've learned:
342
 
343
- - How to identify mutually exclusive events
344
- - The addition rule for mutually exclusive events
345
- - The inclusion-exclusion principle for overlapping events
346
- - How to extend these concepts to multiple events
347
 
348
- In the next lesson, we'll explore **conditional probability** - how the probability
349
- of one event changes when we know another event has occurred.
350
- """
351
- )
352
  return
353
 
354
 
 
9
 
10
  import marimo
11
 
12
+ __generated_with = "0.18.4"
13
  app = marimo.App(width="medium")
14
 
15
 
 
24
  import matplotlib.pyplot as plt
25
  from matplotlib_venn import venn2
26
  import numpy as np
27
+ return plt, venn2
28
 
29
 
30
  @app.cell(hide_code=True)
31
  def _(mo):
32
+ mo.md(r"""
33
+ # Probability of Or
 
34
 
35
+ When calculating the probability of either one event _or_ another occurring, we need to be careful about how we combine probabilities. The method depends on whether the events can happen together[<sup>1</sup>](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/prob_or/).
36
 
37
+ Let's explore how to calculate $P(E \cup F)$, i.e. $P(E \text{ or } F)$, in different scenarios.
38
+ """)
 
39
  return
40
 
41
 
42
  @app.cell(hide_code=True)
43
  def _(mo):
44
+ mo.md(r"""
45
+ ## Mutually Exclusive Events
 
46
 
47
+ Two events $E$ and $F$ are **mutually exclusive** if they cannot occur simultaneously.
48
+ In set notation, this means:
49
 
50
+ $E \cap F = \emptyset$
51
 
52
+ For example:
53
 
54
+ - Rolling an even number (2,4,6) vs rolling an odd number (1,3,5)
55
+ - Drawing a heart vs drawing a spade from a deck
56
+ - Passing vs failing a test
57
 
58
+ Here's a Python function to check if two sets of outcomes are mutually exclusive:
59
+ """)
 
60
  return
61
 
62
 
 
86
 
87
  @app.cell(hide_code=True)
88
  def _(mo):
89
+ mo.md(r"""
90
+ ## Or with Mutually Exclusive Events
 
91
 
92
+ For mutually exclusive events, the probability of either event occurring is simply the sum of their individual probabilities:
93
 
94
+ $P(E \cup F) = P(E) + P(F)$
95
 
96
+ This extends to multiple events. For $n$ mutually exclusive events $E_1, E_2, \ldots, E_n$:
97
 
98
+ $P(E_1 \cup E_2 \cup \cdots \cup E_n) = \sum_{i=1}^n P(E_i)$
99
 
100
+ Let's implement this calculation:
101
+ """)
 
102
  return
103
 
104
 
 
115
  # P(prime) = P(2) + P(3) + P(5)
116
  p_prime_mutually_exclusive = prob_union_mutually_exclusive([1/6, 1/6, 1/6])
117
  print(f"P(rolling a prime number) = {p_prime_mutually_exclusive}")
118
+ return
 
 
 
 
119
 
120
 
121
  @app.cell(hide_code=True)
122
  def _(mo):
123
+ mo.md(r"""
124
+ ## Or with Non-Mutually Exclusive Events
 
125
 
126
+ When events can occur together, we need to use the **inclusion-exclusion principle**:
127
 
128
+ $P(E \cup F) = P(E) + P(F) - P(E \cap F)$
129
 
130
+ Why subtract $P(E \cap F)$? Because when we add $P(E)$ and $P(F)$, we count the overlap twice!
131
 
132
+ For example, consider calculating $P(\text{prime or even})$ when rolling a die:
133
 
134
+ - Prime numbers: {2, 3, 5}
135
+ - Even numbers: {2, 4, 6}
136
+ - The number 2 is counted twice unless we subtract its probability
137
 
138
+ Here's how to implement this calculation:
139
+ """)
 
140
  return
141
 
142
 
 
154
 
155
  result = prob_union_general(p_prime_general, p_even_general, p_intersection)
156
  print(f"P(prime or even) = {p_prime_general} + {p_even_general} - {p_intersection} = {result}")
157
+ return
 
 
 
 
 
 
158
 
159
 
160
  @app.cell(hide_code=True)
161
  def _(mo):
162
+ mo.md(r"""
163
+ ### Extension to Three Events
 
164
 
165
+ For three events, the inclusion-exclusion principle becomes:
166
 
167
+ $P(E_1 \cup E_2 \cup E_3) = P(E_1) + P(E_2) + P(E_3)$
168
+ $- P(E_1 \cap E_2) - P(E_1 \cap E_3) - P(E_2 \cap E_3)$
169
+ $+ P(E_1 \cap E_2 \cap E_3)$
170
 
171
+ The pattern is:
172
 
173
+ 1. Add individual probabilities
174
+ 2. Subtract probabilities of pairs
175
+ 3. Add probability of triple intersection
176
+ """)
 
177
  return
178
 
179
 
180
  @app.cell(hide_code=True)
181
  def _(mo):
182
+ mo.md(r"""
183
+ ### Interactive example:
184
+ """)
185
  return
186
 
187
 
 
280
  plt.gcf(),
281
  mo.md(data["explanation"])
282
  ])
283
+ return
284
 
285
 
286
  @app.cell(hide_code=True)
287
  def _(mo):
288
+ mo.md(r"""
289
+ ## 🤔 Test Your Understanding
 
290
 
291
+ Consider rolling a six-sided die. Which of these statements are true?
292
 
293
+ <details>
294
+ <summary>1. P(even or less than 3) = P(even) + P(less than 3)</summary>
295
 
296
+ ❌ Incorrect! These events are not mutually exclusive (2 is both even and less than 3).
297
+ We need to use the inclusion-exclusion principle.
298
+ </details>
299
 
300
+ <details>
301
+ <summary>2. P(even or greater than 4) = 4/6</summary>
302
 
303
+ ✅ Correct! {2,4,6} ∪ {5,6} = {2,4,5,6}, so probability is 4/6.
304
+ </details>
305
 
306
+ <details>
307
+ <summary>3. P(prime or odd) = 5/6</summary>
308
 
309
+ ✅ Correct! {2,3,5} ∪ {1,3,5} = {1,2,3,5}, so probability is 5/6.
310
+ </details>
311
+ """)
 
312
  return
313
 
314
 
315
  @app.cell(hide_code=True)
316
  def _(mo):
317
+ mo.md("""
318
+ ## Summary
 
319
 
320
+ You've learned:
321
 
322
+ - How to identify mutually exclusive events
323
+ - The addition rule for mutually exclusive events
324
+ - The inclusion-exclusion principle for overlapping events
325
+ - How to extend these concepts to multiple events
326
 
327
+ In the next lesson, we'll explore **conditional probability** - how the probability
328
+ of one event changes when we know another event has occurred.
329
+ """)
 
330
  return
331
 
332
 
probability/04_conditional_probability.py CHANGED
@@ -10,7 +10,7 @@
10
 
11
  import marimo
12
 
13
- __generated_with = "0.11.4"
14
  app = marimo.App(width="medium", app_title="Conditional Probability")
15
 
16
 
@@ -22,42 +22,38 @@ def _():
22
 
23
  @app.cell(hide_code=True)
24
  def _(mo):
25
- mo.md(
26
- r"""
27
- # Conditional Probability
28
 
29
- _This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/cond_prob/), by Stanford professor Chris Piech._
30
 
31
- In probability theory, we often want to update our beliefs when we receive new information.
32
- Conditional probability helps us formalize this process by calculating "_what is the chance of
33
- event $E$ happening given that we have already observed some other event $F$?_"[<sup>1</sup>](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/cond_prob/)
34
 
35
- When we condition on an event $F$:
36
 
37
- - We enter the universe where $F$ has occurred
38
- - Only outcomes consistent with $F$ are possible
39
- - Our sample space reduces to $F$
40
- """
41
- )
42
  return
43
 
44
 
45
  @app.cell(hide_code=True)
46
  def _(mo):
47
- mo.md(
48
- r"""
49
- ## Definition of Conditional Probability
50
 
51
- The probability of event $E$ given that event $F$ has occurred is denoted as $P(E \mid F)$ and is defined as:
52
 
53
- $$P(E \mid F) = \frac{P(E \cap F)}{P(F)}$$
54
 
55
- This formula tells us that the conditional probability is the probability of both events occurring
56
- divided by the probability of the conditioning event.
57
 
58
- Let's start with a visual example.
59
- """
60
- )
61
  return
62
 
63
 
@@ -66,7 +62,7 @@ def _():
66
  import matplotlib.pyplot as plt
67
  from matplotlib_venn import venn3
68
  import numpy as np
69
- return np, plt, venn3
70
 
71
 
72
  @app.cell(hide_code=True)
@@ -138,75 +134,73 @@ def _(mo, plt, venn3):
138
  """)
139
 
140
  mo.vstack([mo.center(plt.gcf()), explanation])
141
- return explanation, id, rect, v
142
 
143
 
144
  @app.cell(hide_code=True)
145
  def _(mo):
146
- mo.md(
147
- r"Next, here's a function that computes $P(E \mid F)$, given $P( E \cap F)$ and $P(F)$"
148
- )
149
  return
150
 
151
 
152
- @app.cell
153
- def _():
154
- def conditional_probability(p_intersection, p_condition):
155
- if p_condition == 0:
156
- raise ValueError("Cannot condition on an impossible event")
157
- if p_intersection > p_condition:
158
- raise ValueError("P(E∩F) cannot be greater than P(F)")
159
 
160
- return p_intersection / p_condition
161
- return (conditional_probability,)
162
 
163
 
164
  @app.cell
165
- def _(conditional_probability):
166
  # Example 1: Rolling a die
167
  # E: Rolling an even number (2,4,6)
168
  # F: Rolling a number greater than 3 (4,5,6)
169
  p_even_given_greater_than_3 = conditional_probability(2 / 6, 3 / 6)
170
  print("Example 1: Rolling a die")
171
  print(f"P(Even | >3) = {p_even_given_greater_than_3}") # Should be 2/3
172
- return (p_even_given_greater_than_3,)
173
 
174
 
175
  @app.cell
176
- def _(conditional_probability):
177
  # Example 2: Cards
178
  # E: Drawing a Heart
179
  # F: Drawing a Face card (J,Q,K)
180
  p_heart_given_face = conditional_probability(3 / 52, 12 / 52)
181
  print("\nExample 2: Drawing cards")
182
  print(f"P(Heart | Face card) = {p_heart_given_face}") # Should be 1/4
183
- return (p_heart_given_face,)
184
 
185
 
186
  @app.cell
187
- def _(conditional_probability):
188
  # Example 3: Student grades
189
  # E: Getting an A
190
  # F: Studying more than 3 hours
191
  p_a_given_study = conditional_probability(0.24, 0.40)
192
  print("\nExample 3: Student grades")
193
  print(f"P(A | Studied >3hrs) = {p_a_given_study}") # Should be 0.6
194
- return (p_a_given_study,)
195
 
196
 
197
  @app.cell
198
- def _(conditional_probability):
199
  # Example 4: Weather
200
  # E: Raining
201
  # F: Cloudy
202
  p_rain_given_cloudy = conditional_probability(0.15, 0.30)
203
  print("\nExample 4: Weather")
204
  print(f"P(Rain | Cloudy) = {p_rain_given_cloudy}") # Should be 0.5
205
- return (p_rain_given_cloudy,)
206
 
207
 
208
  @app.cell
209
- def _(conditional_probability):
210
  # Example 5: Error cases
211
  print("\nExample 5: Error cases")
212
  try:
@@ -225,72 +219,66 @@ def _(conditional_probability):
225
 
226
  @app.cell(hide_code=True)
227
  def _(mo):
228
- mo.md(
229
- r"""
230
- ## The Conditional Paradigm
231
 
232
- When we condition on an event, we enter a new probability universe. In this universe:
233
 
234
- 1. All probability axioms still hold
235
- 2. We must consistently condition on the same event
236
- 3. Our sample space becomes the conditioning event
237
 
238
- Here's how our familiar probability rules look when conditioned on event $G$:
239
 
240
- | Rule | Original | Conditioned on $G$ |
241
- |------|----------|-------------------|
242
- | Axiom 1 | $0 \leq P(E) \leq 1$ | $0 \leq P(E \mid G) \leq 1$ |
243
- | Axiom 2 | $P(S) = 1$ | $P(S \mid G) = 1$ |
244
- | Axiom 3* | $P(E \cup F) = P(E) + P(F)$ | $P(E \cup F \mid G) = P(E \mid G) + P(F \mid G)$ |
245
- | Complement | $P(E^C) = 1 - P(E)$ | $P(E^C \mid G) = 1 - P(E \mid G)$ |
246
 
247
- *_For mutually exclusive events_
248
- """
249
- )
250
  return
251
 
252
 
253
  @app.cell(hide_code=True)
254
  def _(mo):
255
- mo.md(
256
- r"""
257
- ## Multiple Conditions
258
 
259
- We can condition on multiple events. The notation $P(E \mid F,G)$ means "_the probability of $E$
260
- occurring, given that both $F$ and $G$ have occurred._"
261
 
262
- The conditional probability formula still holds in the universe where $G$ has occurred:
263
 
264
- $$P(E \mid F,G) = \frac{P(E \cap F \mid G)}{P(F \mid G)}$$
265
 
266
- This is a powerful extension that allows us to update our probabilities as we receive
267
- multiple pieces of information.
268
- """
269
- )
270
  return
271
 
272
 
273
- @app.cell
274
- def _():
275
- def multiple_conditional_probability(
276
- p_intersection_all, p_intersection_conditions, p_condition
277
- ):
278
- """Calculate P(E|F,G) = P(E∩F|G)/P(F|G) = P(E∩F∩G)/P(F∩G)"""
279
- if p_condition == 0:
280
- raise ValueError("Cannot condition on an impossible event")
281
- if p_intersection_conditions == 0:
282
- raise ValueError(
283
- "Cannot condition on an impossible combination of events"
284
- )
285
- if p_intersection_all > p_intersection_conditions:
286
- raise ValueError("P(E∩F∩G) cannot be greater than P(F∩G)")
287
-
288
- return p_intersection_all / p_intersection_conditions
289
- return (multiple_conditional_probability,)
290
 
291
 
292
  @app.cell
293
- def _(multiple_conditional_probability):
294
  # Example: College admissions
295
  # E: Getting admitted
296
  # F: High GPA
@@ -310,58 +298,54 @@ def _(multiple_conditional_probability):
310
  multiple_conditional_probability(0.3, 0.2, 0.2)
311
  except ValueError as e:
312
  print(f"\nError case: {e}")
313
- return (p_admit_given_both,)
314
 
315
 
316
  @app.cell(hide_code=True)
317
  def _(mo):
318
- mo.md(
319
- r"""
320
- ## 🤔 Test Your Understanding
321
-
322
- Which of these statements about conditional probability are true?
323
-
324
- <details>
325
- <summary>Knowing F occurred always decreases the probability of E</summary>
326
- ❌ False! Conditioning on F can either increase or decrease P(E), depending on how E and F are related.
327
- </details>
328
-
329
- <details>
330
- <summary>P(E|F) represents entering a new probability universe where F has occurred</summary>
331
- ✅ True! We restrict ourselves to only the outcomes where F occurred, making F our new sample space.
332
- </details>
333
-
334
- <details>
335
- <summary>If P(E|F) = P(E), then E and F must be the same event</summary>
336
- ❌ False! This actually means E and F are independent - knowing one doesn't affect the other.
337
- </details>
338
-
339
- <details>
340
- <summary>P(E|F) can be calculated by dividing P(E∩F) by P(F)</summary>
341
- ✅ True! This is the fundamental definition of conditional probability.
342
- </details>
343
- """
344
- )
345
  return
346
 
347
 
348
  @app.cell(hide_code=True)
349
  def _(mo):
350
- mo.md(
351
- """
352
- ## Summary
353
 
354
- You've learned:
355
 
356
- - How conditional probability updates our beliefs with new information
357
- - The formula $P(E \mid F) = P(E \cap F)/P(F)$ and its intuition
358
- - How probability rules work in conditional universes
359
- - How to handle multiple conditions
360
 
361
- In the next lesson, we'll explore **independence** - when knowing about one event
362
- tells us nothing about another.
363
- """
364
- )
365
  return
366
 
367
 
 
10
 
11
  import marimo
12
 
13
+ __generated_with = "0.18.4"
14
  app = marimo.App(width="medium", app_title="Conditional Probability")
15
 
16
 
 
22
 
23
  @app.cell(hide_code=True)
24
  def _(mo):
25
+ mo.md(r"""
26
+ # Conditional Probability
 
27
 
28
+ _This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/cond_prob/), by Stanford professor Chris Piech._
29
 
30
+ In probability theory, we often want to update our beliefs when we receive new information.
31
+ Conditional probability helps us formalize this process by calculating "_what is the chance of
32
+ event $E$ happening given that we have already observed some other event $F$?_"[<sup>1</sup>](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/cond_prob/)
33
 
34
+ When we condition on an event $F$:
35
 
36
+ - We enter the universe where $F$ has occurred
37
+ - Only outcomes consistent with $F$ are possible
38
+ - Our sample space reduces to $F$
39
+ """)
 
40
  return
41
 
42
 
43
  @app.cell(hide_code=True)
44
  def _(mo):
45
+ mo.md(r"""
46
+ ## Definition of Conditional Probability
 
47
 
48
+ The probability of event $E$ given that event $F$ has occurred is denoted as $P(E \mid F)$ and is defined as:
49
 
50
+ $$P(E \mid F) = \frac{P(E \cap F)}{P(F)}$$
51
 
52
+ This formula tells us that the conditional probability is the probability of both events occurring
53
+ divided by the probability of the conditioning event.
54
 
55
+ Let's start with a visual example.
56
+ """)
 
57
  return
58
 
59
 
 
62
  import matplotlib.pyplot as plt
63
  from matplotlib_venn import venn3
64
  import numpy as np
65
+ return plt, venn3
66
 
67
 
68
  @app.cell(hide_code=True)
 
134
  """)
135
 
136
  mo.vstack([mo.center(plt.gcf()), explanation])
137
+ return
138
 
139
 
140
  @app.cell(hide_code=True)
141
  def _(mo):
142
+ mo.md(r"""
143
+ Next, here's a function that computes $P(E \mid F)$, given $P( E \cap F)$ and $P(F)$
144
+ """)
145
  return
146
 
147
 
148
+ @app.function
149
+ def conditional_probability(p_intersection, p_condition):
150
+ if p_condition == 0:
151
+ raise ValueError("Cannot condition on an impossible event")
152
+ if p_intersection > p_condition:
153
+ raise ValueError("P(E∩F) cannot be greater than P(F)")
 
154
 
155
+ return p_intersection / p_condition
 
156
 
157
 
158
  @app.cell
159
+ def _():
160
  # Example 1: Rolling a die
161
  # E: Rolling an even number (2,4,6)
162
  # F: Rolling a number greater than 3 (4,5,6)
163
  p_even_given_greater_than_3 = conditional_probability(2 / 6, 3 / 6)
164
  print("Example 1: Rolling a die")
165
  print(f"P(Even | >3) = {p_even_given_greater_than_3}") # Should be 2/3
166
+ return
167
 
168
 
169
  @app.cell
170
+ def _():
171
  # Example 2: Cards
172
  # E: Drawing a Heart
173
  # F: Drawing a Face card (J,Q,K)
174
  p_heart_given_face = conditional_probability(3 / 52, 12 / 52)
175
  print("\nExample 2: Drawing cards")
176
  print(f"P(Heart | Face card) = {p_heart_given_face}") # Should be 1/4
177
+ return
178
 
179
 
180
  @app.cell
181
+ def _():
182
  # Example 3: Student grades
183
  # E: Getting an A
184
  # F: Studying more than 3 hours
185
  p_a_given_study = conditional_probability(0.24, 0.40)
186
  print("\nExample 3: Student grades")
187
  print(f"P(A | Studied >3hrs) = {p_a_given_study}") # Should be 0.6
188
+ return
189
 
190
 
191
  @app.cell
192
+ def _():
193
  # Example 4: Weather
194
  # E: Raining
195
  # F: Cloudy
196
  p_rain_given_cloudy = conditional_probability(0.15, 0.30)
197
  print("\nExample 4: Weather")
198
  print(f"P(Rain | Cloudy) = {p_rain_given_cloudy}") # Should be 0.5
199
+ return
200
 
201
 
202
  @app.cell
203
+ def _():
204
  # Example 5: Error cases
205
  print("\nExample 5: Error cases")
206
  try:
 
219
 
220
  @app.cell(hide_code=True)
221
  def _(mo):
222
+ mo.md(r"""
223
+ ## The Conditional Paradigm
 
224
 
225
+ When we condition on an event, we enter a new probability universe. In this universe:
226
 
227
+ 1. All probability axioms still hold
228
+ 2. We must consistently condition on the same event
229
+ 3. Our sample space becomes the conditioning event
230
 
231
+ Here's how our familiar probability rules look when conditioned on event $G$:
232
 
233
+ | Rule | Original | Conditioned on $G$ |
234
+ |------|----------|-------------------|
235
+ | Axiom 1 | $0 \leq P(E) \leq 1$ | $0 \leq P(E \mid G) \leq 1$ |
236
+ | Axiom 2 | $P(S) = 1$ | $P(S \mid G) = 1$ |
237
+ | Axiom 3* | $P(E \cup F) = P(E) + P(F)$ | $P(E \cup F \mid G) = P(E \mid G) + P(F \mid G)$ |
238
+ | Complement | $P(E^C) = 1 - P(E)$ | $P(E^C \mid G) = 1 - P(E \mid G)$ |
239
 
240
+ *_For mutually exclusive events_
241
+ """)
 
242
  return
243
 
244
 
245
  @app.cell(hide_code=True)
246
  def _(mo):
247
+ mo.md(r"""
248
+ ## Multiple Conditions
 
249
 
250
+ We can condition on multiple events. The notation $P(E \mid F,G)$ means "_the probability of $E$
251
+ occurring, given that both $F$ and $G$ have occurred._"
252
 
253
+ The conditional probability formula still holds in the universe where $G$ has occurred:
254
 
255
+ $$P(E \mid F,G) = \frac{P(E \cap F \mid G)}{P(F \mid G)}$$
256
 
257
+ This is a powerful extension that allows us to update our probabilities as we receive
258
+ multiple pieces of information.
259
+ """)
 
260
  return
261
 
262
 
263
+ @app.function
264
+ def multiple_conditional_probability(
265
+ p_intersection_all, p_intersection_conditions, p_condition
266
+ ):
267
+ """Calculate P(E|F,G) = P(E∩F|G)/P(F|G) = P(E∩F∩G)/P(F∩G)"""
268
+ if p_condition == 0:
269
+ raise ValueError("Cannot condition on an impossible event")
270
+ if p_intersection_conditions == 0:
271
+ raise ValueError(
272
+ "Cannot condition on an impossible combination of events"
273
+ )
274
+ if p_intersection_all > p_intersection_conditions:
275
+ raise ValueError("P(E∩F∩G) cannot be greater than P(F∩G)")
276
+
277
+ return p_intersection_all / p_intersection_conditions
 
 
278
 
279
 
280
  @app.cell
281
+ def _():
282
  # Example: College admissions
283
  # E: Getting admitted
284
  # F: High GPA
 
298
  multiple_conditional_probability(0.3, 0.2, 0.2)
299
  except ValueError as e:
300
  print(f"\nError case: {e}")
301
+ return
302
 
303
 
304
  @app.cell(hide_code=True)
305
  def _(mo):
306
+ mo.md(r"""
307
+ ## 🤔 Test Your Understanding
308
+
309
+ Which of these statements about conditional probability are true?
310
+
311
+ <details>
312
+ <summary>Knowing F occurred always decreases the probability of E</summary>
313
+ False! Conditioning on F can either increase or decrease P(E), depending on how E and F are related.
314
+ </details>
315
+
316
+ <details>
317
+ <summary>P(E|F) represents entering a new probability universe where F has occurred</summary>
318
+ True! We restrict ourselves to only the outcomes where F occurred, making F our new sample space.
319
+ </details>
320
+
321
+ <details>
322
+ <summary>If P(E|F) = P(E), then E and F must be the same event</summary>
323
+ False! This actually means E and F are independent - knowing one doesn't affect the other.
324
+ </details>
325
+
326
+ <details>
327
+ <summary>P(E|F) can be calculated by dividing P(E∩F) by P(F)</summary>
328
+ True! This is the fundamental definition of conditional probability.
329
+ </details>
330
+ """)
 
 
331
  return
332
 
333
 
334
  @app.cell(hide_code=True)
335
  def _(mo):
336
+ mo.md("""
337
+ ## Summary
 
338
 
339
+ You've learned:
340
 
341
+ - How conditional probability updates our beliefs with new information
342
+ - The formula $P(E \mid F) = P(E \cap F)/P(F)$ and its intuition
343
+ - How probability rules work in conditional universes
344
+ - How to handle multiple conditions
345
 
346
+ In the next lesson, we'll explore **independence** - when knowing about one event
347
+ tells us nothing about another.
348
+ """)
 
349
  return
350
 
351
 
probability/05_independence.py CHANGED
@@ -7,7 +7,7 @@
7
 
8
  import marimo
9
 
10
- __generated_with = "0.11.4"
11
  app = marimo.App()
12
 
13
 
@@ -19,88 +19,84 @@ def _():
19
 
20
  @app.cell(hide_code=True)
21
  def _(mo):
22
- mo.md(
23
- """
24
- # Independence in Probability Theory
25
 
26
- _This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/independence/), by Stanford professor Chris Piech._
27
 
28
- In probability theory, independence is a fundamental concept that helps us understand
29
- when events don't influence each other. Two events are independent if knowing the
30
- outcome of one event doesn't change our belief about the other event occurring.
31
 
32
- ## Definition of Independence
33
 
34
- Two events $E$ and $F$ are independent if:
35
 
36
- $$P(E|F) = P(E)$$
37
 
38
- This means that knowing $F$ occurred doesn't change the probability of $E$ occurring.
39
 
40
- ### _Alternative Definition_
41
 
42
- Using the chain rule, we can derive another equivalent definition:
43
 
44
- $$P(E \cap F) = P(E) \cdot P(F)$$
45
- """
46
- )
47
  return
48
 
49
 
50
  @app.cell(hide_code=True)
51
  def _(mo):
52
- mo.md(
53
- r"""
54
- ## Independence is Symmetric
55
 
56
- This property is symmetric: if $E$ is independent of $F$, then $F$ is independent of $E$.
57
- We can prove this using Bayes' Theorem:
58
 
59
- \[P(E|F) = \frac{P(F|E)P(E)}{P(F)}\]
60
 
61
- \[= \frac{P(F)P(E)}{P(F)}\]
62
 
63
- \[= P(E)\]
64
 
65
- ## Independence and Complements
66
 
67
- Given independent events $A$ and $B$, we can prove that $A$ and $B^C$ are also independent:
68
 
69
 
70
- \[P(AB^C) = P(A) - P(AB)\]
71
 
72
- \[= P(A) - P(A)P(B)\]
73
 
74
- \[= P(A)(1 - P(B))\]
75
 
76
- \[= P(A)P(B^C)\]
77
 
78
- ## Generalized Independence
79
 
80
- Events $E_1, E_2, \ldots, E_n$ are independent if for every subset with $r$ elements (where $r \leq n$):
81
 
82
- \[P(E_1, E_2, \ldots, E_r) = \prod_{i=1}^r P(E_i)\]
83
 
84
- For example, consider getting 5 heads on 5 coin flips. Let $H_i$ be the event that the $i$th flip is heads:
85
 
86
 
87
- \[P(H_1, H_2, H_3, H_4, H_5) = P(H_1)P(H_2)P(H_3)P(H_4)P(H_5)\]
88
 
89
- \[= \prod_{i=1}^5 P(H_i)\]
90
 
91
- \[= \left(\frac{1}{2}\right)^5 = 0.03125\]
92
 
93
- ## Conditional Independence
94
 
95
- Events $E_1, E_2, E_3$ are conditionally independent given event $F$ if:
96
 
97
- \[P(E_1, E_2, E_3 | F) = P(E_1|F)P(E_2|F)P(E_3|F)\]
98
 
99
- This can be written more succinctly using product notation:
100
 
101
- \[P(E_1, E_2, E_3 | F) = \prod_{i=1}^3 P(E_i|F)\]
102
- """
103
- )
104
  return
105
 
106
 
@@ -121,21 +117,19 @@ def _(mo):
121
  callout_text,
122
  kind="warn"
123
  )
124
- return (callout_text,)
125
 
126
 
127
- @app.cell
128
- def _():
129
- def check_independence(p_e, p_f, p_intersection):
130
- expected = p_e * p_f
131
- tolerance = 1e-5 # Stricter tolerance for comparison
132
 
133
- return abs(p_intersection - expected) < tolerance
134
- return (check_independence,)
135
 
136
 
137
  @app.cell
138
- def _(check_independence, mo):
139
  # Example 1: Rolling dice
140
  p_first_even = 0.5 # P(First die is even)
141
  p_second_six = 1/6 # P(Second die is 6)
@@ -157,11 +151,11 @@ def _(check_independence, mo):
157
  </details>
158
  """
159
  mo.md(example1)
160
- return dice_independent, example1, p_both, p_first_even, p_second_six
161
 
162
 
163
  @app.cell
164
- def _(check_independence, mo):
165
  # Example 2: Drawing cards (dependent events)
166
  p_first_heart = 13/52 # P(First card is heart)
167
  p_second_heart = 12/51 # P(Second card is heart | First was heart)
@@ -192,18 +186,11 @@ def _(check_independence, mo):
192
  </details>
193
  """
194
  mo.md(example2)
195
- return (
196
- cards_independent,
197
- example2,
198
- p_both_hearts,
199
- p_first_heart,
200
- p_second_heart,
201
- theoretical_if_independent,
202
- )
203
 
204
 
205
  @app.cell
206
- def _(check_independence, mo):
207
  # Example 3: Computer system
208
  p_hardware = 0.02 # P(Hardware failure)
209
  p_software = 0.03 # P(Software crash)
@@ -224,60 +211,54 @@ def _(check_independence, mo):
224
  </details>
225
  """
226
  mo.md(example3)
227
- return (
228
- example3,
229
- p_both_failure,
230
- p_hardware,
231
- p_software,
232
- system_independent,
233
- )
234
 
235
 
236
  @app.cell(hide_code=True)
237
  def _(mo):
238
- mo.md(
239
- """
240
- ## Establishing Independence
241
 
242
- In practice, we can establish independence through:
243
 
244
- 1. **Mathematical Verification**: Show that P(E∩F) = P(E)P(F)
245
- 2. **Empirical Testing**: Analyze data to check if events appear independent
246
- 3. **Domain Knowledge**: Use understanding of the system to justify independence
247
 
248
- > **Note**: Perfect independence is rare in real data. We often make independence assumptions
249
- when dependencies are negligible and the simplification is useful.
250
 
251
- ## Backup Systems in Space Missions
252
 
253
- Consider a space mission with two backup life support systems:
254
 
255
- $$P(\text{Primary fails}) = p_1$$
256
 
257
- $$P(\text{Secondary fails}) = p_2$$
258
 
259
- If the systems are truly independent (different power sources, separate locations, distinct technologies):
260
 
261
- $$P(\text{Life support fails}) = p_1p_2$$
262
 
263
- For example:
264
 
265
- - If $p_1 = 0.01$ and $p_2 = 0.02$ (99% and 98% reliable)
266
- - Then $P(\text{Total failure}) = 0.0002$ (99.98% reliable)
267
 
268
- However, if both systems share vulnerabilities (same radiation exposure, temperature extremes):
269
 
270
- $$P(\text{Life support fails}) > p_1p_2$$
271
 
272
- This example shows why space agencies invest heavily in ensuring true independence of backup systems.
273
- """
274
- )
275
  return
276
 
277
 
278
  @app.cell(hide_code=True)
279
  def _(mo):
280
- mo.md(r"""## Interactive Example""")
 
 
281
  return
282
 
283
 
@@ -292,7 +273,7 @@ def _(mo):
292
  flip_button = mo.ui.run_button(label="Flip Coins!", kind="info")
293
  reset_button = mo.ui.run_button(label="Reset", kind="danger")
294
  stats_display = mo.md("*Click 'Flip Coins!' to start simulation*")
295
- return flip_button, reset_button, stats_display
296
 
297
 
298
  @app.cell(hide_code=True)
@@ -337,95 +318,80 @@ def _(flip_button, mo, np, reset_button):
337
 
338
  new_stats_display = mo.md(stats)
339
  new_stats_display
340
- return (
341
- coin1,
342
- coin2,
343
- new_stats_display,
344
- p_both_h,
345
- p_h1,
346
- p_h2,
347
- p_product,
348
- stats,
349
- )
350
 
351
 
352
  @app.cell(hide_code=True)
353
  def _(mo):
354
- mo.md(
355
- """
356
- ## Understanding the Simulation
357
 
358
- This simulation demonstrates independence using coin flips, where each coin's outcome is unaffected by the other.
359
 
360
- ### Reading the Results
361
 
362
- 1. **Individual Probabilities:**
363
 
364
- - P(H₁): 1 if heads, 0 if tails on first coin
365
- - P(H₂): 1 if heads, 0 if tails on second coin
366
 
367
- 2. **Testing Independence:**
368
 
369
- - P(Both Heads): 1 if both show heads, 0 otherwise
370
- - P(H₁)P(H₂): Product of individual results
371
 
372
- > **Note**: Each click performs a new independent trial. While a single flip shows binary outcomes (0 or 1),
373
- the theoretical probability is 0.5 for each coin and 0.25 for both heads.
374
- """
375
- )
376
  return
377
 
378
 
379
  @app.cell(hide_code=True)
380
  def _(mo):
381
- mo.md(
382
- r"""
383
- ## 🤔 Test Your Understanding
384
-
385
- Which of these statements about independence are true?
386
-
387
- <details>
388
- <summary>If P(E|F) = P(E), then E and F are independent</summary>
389
- ✅ True! This is one definition of independence - knowing F occurred doesn't change the probability of E.
390
- </details>
391
-
392
- <details>
393
- <summary>Independent events cannot occur simultaneously</summary>
394
- ❌ False! Independent events can and do occur together - their joint probability is just the product of their individual probabilities.
395
- </details>
396
-
397
- <details>
398
- <summary>If P(E∩F) = P(E)P(F), then E and F are independent</summary>
399
- ✅ True! This is the multiplicative definition of independence.
400
- </details>
401
-
402
- <details>
403
- <summary>Independence is symmetric: if E is independent of F, then F is independent of E</summary>
404
- ✅ True! The definition P(E∩F) = P(E)P(F) is symmetric in E and F.
405
- </details>
406
-
407
- <details>
408
- <summary>Three events being pairwise independent means they are mutually independent</summary>
409
- ❌ False! Pairwise independence doesn't guarantee mutual independence - we need to check all combinations.
410
- </details>
411
- """
412
- )
413
  return
414
 
415
 
416
  @app.cell(hide_code=True)
417
  def _(mo):
418
- mo.md(
419
- """
420
- ## Summary
421
 
422
- In this exploration of probability independence, we've discovered how to recognize when events truly don't influence each other. Through the lens of both mathematical definitions and interactive examples, we've seen how independence manifests in scenarios ranging from simple coin flips to critical system designs.
423
 
424
- The power of independence lies in its simplicity: when events are independent, we can multiply their individual probabilities to understand their joint behavior. Yet, as our examples showed, true independence is often more nuanced than it first appears. What seems independent might harbor hidden dependencies, and what appears dependent might be independent under certain conditions.
425
 
426
- _The art lies not just in calculating probabilities, but in developing the intuition to recognize independence in real-world scenarios—a skill essential for making informed decisions in uncertain situations._
427
- """
428
- )
429
  return
430
 
431
 
@@ -433,7 +399,7 @@ def _(mo):
433
  def _():
434
  import numpy as np
435
  import pandas as pd
436
- return np, pd
437
 
438
 
439
  if __name__ == "__main__":
 
7
 
8
  import marimo
9
 
10
+ __generated_with = "0.18.4"
11
  app = marimo.App()
12
 
13
 
 
19
 
20
  @app.cell(hide_code=True)
21
  def _(mo):
22
+ mo.md("""
23
+ # Independence in Probability Theory
 
24
 
25
+ _This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/independence/), by Stanford professor Chris Piech._
26
 
27
+ In probability theory, independence is a fundamental concept that helps us understand
28
+ when events don't influence each other. Two events are independent if knowing the
29
+ outcome of one event doesn't change our belief about the other event occurring.
30
 
31
+ ## Definition of Independence
32
 
33
+ Two events $E$ and $F$ are independent if:
34
 
35
+ $$P(E|F) = P(E)$$
36
 
37
+ This means that knowing $F$ occurred doesn't change the probability of $E$ occurring.
38
 
39
+ ### _Alternative Definition_
40
 
41
+ Using the chain rule, we can derive another equivalent definition:
42
 
43
+ $$P(E \cap F) = P(E) \cdot P(F)$$
44
+ """)
 
45
  return
46
 
47
 
48
  @app.cell(hide_code=True)
49
  def _(mo):
50
+ mo.md(r"""
51
+ ## Independence is Symmetric
 
52
 
53
+ This property is symmetric: if $E$ is independent of $F$, then $F$ is independent of $E$.
54
+ We can prove this using Bayes' Theorem:
55
 
56
+ \[P(E|F) = \frac{P(F|E)P(E)}{P(F)}\]
57
 
58
+ \[= \frac{P(F)P(E)}{P(F)}\]
59
 
60
+ \[= P(E)\]
61
 
62
+ ## Independence and Complements
63
 
64
+ Given independent events $A$ and $B$, we can prove that $A$ and $B^C$ are also independent:
65
 
66
 
67
+ \[P(AB^C) = P(A) - P(AB)\]
68
 
69
+ \[= P(A) - P(A)P(B)\]
70
 
71
+ \[= P(A)(1 - P(B))\]
72
 
73
+ \[= P(A)P(B^C)\]
74
 
75
+ ## Generalized Independence
76
 
77
+ Events $E_1, E_2, \ldots, E_n$ are independent if for every subset with $r$ elements (where $r \leq n$):
78
 
79
+ \[P(E_1, E_2, \ldots, E_r) = \prod_{i=1}^r P(E_i)\]
80
 
81
+ For example, consider getting 5 heads on 5 coin flips. Let $H_i$ be the event that the $i$th flip is heads:
82
 
83
 
84
+ \[P(H_1, H_2, H_3, H_4, H_5) = P(H_1)P(H_2)P(H_3)P(H_4)P(H_5)\]
85
 
86
+ \[= \prod_{i=1}^5 P(H_i)\]
87
 
88
+ \[= \left(\frac{1}{2}\right)^5 = 0.03125\]
89
 
90
+ ## Conditional Independence
91
 
92
+ Events $E_1, E_2, E_3$ are conditionally independent given event $F$ if:
93
 
94
+ \[P(E_1, E_2, E_3 | F) = P(E_1|F)P(E_2|F)P(E_3|F)\]
95
 
96
+ This can be written more succinctly using product notation:
97
 
98
+ \[P(E_1, E_2, E_3 | F) = \prod_{i=1}^3 P(E_i|F)\]
99
+ """)
 
100
  return
101
 
102
 
 
117
  callout_text,
118
  kind="warn"
119
  )
120
+ return
121
 
122
 
123
+ @app.function
124
+ def check_independence(p_e, p_f, p_intersection):
125
+ expected = p_e * p_f
126
+ tolerance = 1e-5 # Stricter tolerance for comparison
 
127
 
128
+ return abs(p_intersection - expected) < tolerance
 
129
 
130
 
131
  @app.cell
132
+ def _(mo):
133
  # Example 1: Rolling dice
134
  p_first_even = 0.5 # P(First die is even)
135
  p_second_six = 1/6 # P(Second die is 6)
 
151
  </details>
152
  """
153
  mo.md(example1)
154
+ return
155
 
156
 
157
  @app.cell
158
+ def _(mo):
159
  # Example 2: Drawing cards (dependent events)
160
  p_first_heart = 13/52 # P(First card is heart)
161
  p_second_heart = 12/51 # P(Second card is heart | First was heart)
 
186
  </details>
187
  """
188
  mo.md(example2)
189
+ return
 
 
 
 
 
 
 
190
 
191
 
192
  @app.cell
193
+ def _(mo):
194
  # Example 3: Computer system
195
  p_hardware = 0.02 # P(Hardware failure)
196
  p_software = 0.03 # P(Software crash)
 
211
  </details>
212
  """
213
  mo.md(example3)
214
+ return
 
 
 
 
 
 
215
 
216
 
217
  @app.cell(hide_code=True)
218
  def _(mo):
219
+ mo.md("""
220
+ ## Establishing Independence
 
221
 
222
+ In practice, we can establish independence through:
223
 
224
+ 1. **Mathematical Verification**: Show that P(E∩F) = P(E)P(F)
225
+ 2. **Empirical Testing**: Analyze data to check if events appear independent
226
+ 3. **Domain Knowledge**: Use understanding of the system to justify independence
227
 
228
+ > **Note**: Perfect independence is rare in real data. We often make independence assumptions
229
+ when dependencies are negligible and the simplification is useful.
230
 
231
+ ## Backup Systems in Space Missions
232
 
233
+ Consider a space mission with two backup life support systems:
234
 
235
+ $$P( ext{Primary fails}) = p_1$$
236
 
237
+ $$P( ext{Secondary fails}) = p_2$$
238
 
239
+ If the systems are truly independent (different power sources, separate locations, distinct technologies):
240
 
241
+ $$P( ext{Life support fails}) = p_1p_2$$
242
 
243
+ For example:
244
 
245
+ - If $p_1 = 0.01$ and $p_2 = 0.02$ (99% and 98% reliable)
246
+ - Then $P( ext{Total failure}) = 0.0002$ (99.98% reliable)
247
 
248
+ However, if both systems share vulnerabilities (same radiation exposure, temperature extremes):
249
 
250
+ $$P( ext{Life support fails}) > p_1p_2$$
251
 
252
+ This example shows why space agencies invest heavily in ensuring true independence of backup systems.
253
+ """)
 
254
  return
255
 
256
 
257
  @app.cell(hide_code=True)
258
  def _(mo):
259
+ mo.md(r"""
260
+ ## Interactive Example
261
+ """)
262
  return
263
 
264
 
 
273
  flip_button = mo.ui.run_button(label="Flip Coins!", kind="info")
274
  reset_button = mo.ui.run_button(label="Reset", kind="danger")
275
  stats_display = mo.md("*Click 'Flip Coins!' to start simulation*")
276
+ return flip_button, reset_button
277
 
278
 
279
  @app.cell(hide_code=True)
 
318
 
319
  new_stats_display = mo.md(stats)
320
  new_stats_display
321
+ return
 
 
 
 
 
 
 
 
 
322
 
323
 
324
  @app.cell(hide_code=True)
325
  def _(mo):
326
+ mo.md("""
327
+ ## Understanding the Simulation
 
328
 
329
+ This simulation demonstrates independence using coin flips, where each coin's outcome is unaffected by the other.
330
 
331
+ ### Reading the Results
332
 
333
+ 1. **Individual Probabilities:**
334
 
335
+ - P(H₁): 1 if heads, 0 if tails on first coin
336
+ - P(H₂): 1 if heads, 0 if tails on second coin
337
 
338
+ 2. **Testing Independence:**
339
 
340
+ - P(Both Heads): 1 if both show heads, 0 otherwise
341
+ - P(H₁)P(H₂): Product of individual results
342
 
343
+ > **Note**: Each click performs a new independent trial. While a single flip shows binary outcomes (0 or 1),
344
+ the theoretical probability is 0.5 for each coin and 0.25 for both heads.
345
+ """)
 
346
  return
347
 
348
 
349
  @app.cell(hide_code=True)
350
  def _(mo):
351
+ mo.md(r"""
352
+ ## 🤔 Test Your Understanding
353
+
354
+ Which of these statements about independence are true?
355
+
356
+ <details>
357
+ <summary>If P(E|F) = P(E), then E and F are independent</summary>
358
+ True! This is one definition of independence - knowing F occurred doesn't change the probability of E.
359
+ </details>
360
+
361
+ <details>
362
+ <summary>Independent events cannot occur simultaneously</summary>
363
+ ❌ False! Independent events can and do occur together - their joint probability is just the product of their individual probabilities.
364
+ </details>
365
+
366
+ <details>
367
+ <summary>If P(E∩F) = P(E)P(F), then E and F are independent</summary>
368
+ True! This is the multiplicative definition of independence.
369
+ </details>
370
+
371
+ <details>
372
+ <summary>Independence is symmetric: if E is independent of F, then F is independent of E</summary>
373
+ True! The definition P(EF) = P(E)P(F) is symmetric in E and F.
374
+ </details>
375
+
376
+ <details>
377
+ <summary>Three events being pairwise independent means they are mutually independent</summary>
378
+ False! Pairwise independence doesn't guarantee mutual independence - we need to check all combinations.
379
+ </details>
380
+ """)
 
 
381
  return
382
 
383
 
384
  @app.cell(hide_code=True)
385
  def _(mo):
386
+ mo.md("""
387
+ ## Summary
 
388
 
389
+ In this exploration of probability independence, we've discovered how to recognize when events truly don't influence each other. Through the lens of both mathematical definitions and interactive examples, we've seen how independence manifests in scenarios ranging from simple coin flips to critical system designs.
390
 
391
+ The power of independence lies in its simplicity: when events are independent, we can multiply their individual probabilities to understand their joint behavior. Yet, as our examples showed, true independence is often more nuanced than it first appears. What seems independent might harbor hidden dependencies, and what appears dependent might be independent under certain conditions.
392
 
393
+ _The art lies not just in calculating probabilities, but in developing the intuition to recognize independence in real-world scenarios—a skill essential for making informed decisions in uncertain situations._
394
+ """)
 
395
  return
396
 
397
 
 
399
  def _():
400
  import numpy as np
401
  import pandas as pd
402
+ return (np,)
403
 
404
 
405
  if __name__ == "__main__":
probability/06_probability_of_and.py CHANGED
@@ -9,7 +9,7 @@
9
 
10
  import marimo
11
 
12
- __generated_with = "0.11.4"
13
  app = marimo.App(width="medium")
14
 
15
 
@@ -28,38 +28,34 @@ def _():
28
 
29
  @app.cell(hide_code=True)
30
  def _(mo):
31
- mo.md(
32
- r"""
33
- # Probability of And
34
- _This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/prob_and/), by Stanford professor Chris Piech._
35
-
36
- When calculating the probability of both events occurring together, we need to consider whether the events are independent or dependent.
37
- Let's explore how to calculate $P(E \cap F)$, i.e. $P(E \text{ and } F)$, in different scenarios.
38
- """
39
- )
40
  return
41
 
42
 
43
  @app.cell(hide_code=True)
44
  def _(mo):
45
- mo.md(
46
- r"""
47
- ## And with Independent Events
48
 
49
- Two events $E$ and $F$ are **independent** if knowing one event occurred doesn't affect the probability of the other.
50
- For independent events:
51
 
52
- $P(E \text{ and } F) = P(E) \cdot P(F)$
53
 
54
- For example:
55
 
56
- - Rolling a 6 on one die and getting heads on a coin flip
57
- - Drawing a heart from a deck, replacing it, and drawing another heart
58
- - Getting a computer error on Monday vs. Tuesday
59
 
60
- Here's a Python function to calculate probability for independent events:
61
- """
62
- )
63
  return
64
 
65
 
@@ -73,7 +69,7 @@ def _():
73
  p_heads = 1/2 # P(getting heads)
74
  p_both = calc_independent_prob(p_six, p_heads)
75
  print(f"Example 1: P(rolling 6 AND getting heads) = {p_six:.3f} × {p_heads:.3f} = {p_both:.3f}")
76
- return calc_independent_prob, p_both, p_heads, p_six
77
 
78
 
79
  @app.cell
@@ -83,30 +79,28 @@ def _(calc_independent_prob):
83
  p_disk_fail = 0.03 # P(disk failure)
84
  p_both_fail = calc_independent_prob(p_cpu_fail, p_disk_fail)
85
  print(f"Example 2: P(both CPU and disk failing) = {p_cpu_fail:.3f} × {p_disk_fail:.3f} = {p_both_fail:.3f}")
86
- return p_both_fail, p_cpu_fail, p_disk_fail
87
 
88
 
89
  @app.cell(hide_code=True)
90
  def _(mo):
91
- mo.md(
92
- r"""
93
- ## And with Dependent Events
94
 
95
- For dependent events, we use the **chain rule**:
96
 
97
- $P(E \text{ and } F) = P(E) \cdot P(F|E)$
98
 
99
- where $P(F|E)$ is the probability of $F$ occurring given that $E$ has occurred.
100
 
101
- For example:
102
 
103
- - Drawing two hearts without replacement
104
- - Getting two consecutive heads in poker
105
- - System failures in connected components
106
 
107
- Let's implement this calculation:
108
- """
109
- )
110
  return
111
 
112
 
@@ -120,7 +114,7 @@ def _():
120
  p_second_heart = 12/51 # P(second heart | first heart)
121
  p_both_hearts = calc_dependent_prob(p_first_heart, p_second_heart)
122
  print(f"Example 1: P(two hearts) = {p_first_heart:.3f} × {p_second_heart:.3f} = {p_both_hearts:.3f}")
123
- return calc_dependent_prob, p_both_hearts, p_first_heart, p_second_heart
124
 
125
 
126
  @app.cell
@@ -130,32 +124,32 @@ def _(calc_dependent_prob):
130
  p_second_ace = 3/51 # P(second ace | first ace)
131
  p_both_aces = calc_dependent_prob(p_first_ace, p_second_ace)
132
  print(f"Example 2: P(two aces) = {p_first_ace:.3f} × {p_second_ace:.3f} = {p_both_aces:.3f}")
133
- return p_both_aces, p_first_ace, p_second_ace
134
 
135
 
136
  @app.cell(hide_code=True)
137
  def _(mo):
138
- mo.md(
139
- r"""
140
- ## Multiple Events
141
 
142
- For multiple independent events:
143
 
144
- $P(E_1 \text{ and } E_2 \text{ and } \cdots \text{ and } E_n) = \prod_{i=1}^n P(E_i)$
145
 
146
- For dependent events:
147
 
148
- $P(E_1 \text{ and } E_2 \text{ and } \cdots \text{ and } E_n) = P(E_1) \cdot P(E_2|E_1) \cdot P(E_3|E_1,E_2) \cdots P(E_n|E_1,\ldots,E_{n-1})$
149
 
150
- Let's visualize these probabilities:
151
- """
152
- )
153
  return
154
 
155
 
156
  @app.cell(hide_code=True)
157
  def _(mo):
158
- mo.md(r"""### Interactive example""")
 
 
159
  return
160
 
161
 
@@ -250,65 +244,61 @@ def _(event_type, mo, plt, venn2):
250
 
251
  # Display explanation alongside visualization
252
  mo.hstack([plt.gcf(), mo.md(data["explanation"])])
253
- return data, events_data, v
254
 
255
 
256
  @app.cell(hide_code=True)
257
  def _(mo):
258
- mo.md(
259
- r"""
260
- ## 🤔 Test Your Understanding
261
 
262
- Which of these statements about AND probability are true?
263
 
264
- <details>
265
- <summary>1. The probability of getting two sixes in a row with a fair die is 1/36</summary>
266
 
267
- ✅ True! Since die rolls are independent events:
268
- P(two sixes) = P(first six) × P(second six) = 1/6 × 1/6 = 1/36
269
- </details>
270
 
271
- <details>
272
- <summary>2. When drawing cards without replacement, P(two kings) = 4/52 × 4/52</summary>
273
 
274
- ❌ False! This is a dependent event. The correct calculation is:
275
- P(two kings) = P(first king) × P(second king | first king) = 4/52 × 3/51
276
- </details>
277
 
278
- <details>
279
- <summary>3. If P(A) = 0.3 and P(B) = 0.4, then P(A and B) must be 0.12</summary>
280
 
281
- ❌ False! P(A and B) = 0.12 only if A and B are independent events.
282
- If they're dependent, we need P(B|A) to calculate P(A and B).
283
- </details>
284
 
285
- <details>
286
- <summary>4. The probability of rolling a six AND getting tails is (1/6 × 1/2)</summary>
287
 
288
- ✅ True! These are independent events, so we multiply their individual probabilities:
289
- P(six and tails) = P(six) × P(tails) = 1/6 × 1/2 = 1/12
290
- </details>
291
- """
292
- )
293
  return
294
 
295
 
296
  @app.cell(hide_code=True)
297
  def _(mo):
298
- mo.md(
299
- """
300
- ## Summary
301
 
302
- You've learned:
303
 
304
- - How to identify independent vs dependent events
305
- - The multiplication rule for independent events
306
- - The chain rule for dependent events
307
- - How to extend these concepts to multiple events
308
 
309
- In the next lesson, we'll explore **law of total probability** in more detail, building on our understanding of various topics.
310
- """
311
- )
312
  return
313
 
314
 
 
9
 
10
  import marimo
11
 
12
+ __generated_with = "0.18.4"
13
  app = marimo.App(width="medium")
14
 
15
 
 
28
 
29
  @app.cell(hide_code=True)
30
  def _(mo):
31
+ mo.md(r"""
32
+ # Probability of And
33
+ _This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/prob_and/), by Stanford professor Chris Piech._
34
+
35
+ When calculating the probability of both events occurring together, we need to consider whether the events are independent or dependent.
36
+ Let's explore how to calculate $P(E \cap F)$, i.e. $P(E \text{ and } F)$, in different scenarios.
37
+ """)
 
 
38
  return
39
 
40
 
41
  @app.cell(hide_code=True)
42
  def _(mo):
43
+ mo.md(r"""
44
+ ## And with Independent Events
 
45
 
46
+ Two events $E$ and $F$ are **independent** if knowing one event occurred doesn't affect the probability of the other.
47
+ For independent events:
48
 
49
+ $P(E \text{ and } F) = P(E) \cdot P(F)$
50
 
51
+ For example:
52
 
53
+ - Rolling a 6 on one die and getting heads on a coin flip
54
+ - Drawing a heart from a deck, replacing it, and drawing another heart
55
+ - Getting a computer error on Monday vs. Tuesday
56
 
57
+ Here's a Python function to calculate probability for independent events:
58
+ """)
 
59
  return
60
 
61
 
 
69
  p_heads = 1/2 # P(getting heads)
70
  p_both = calc_independent_prob(p_six, p_heads)
71
  print(f"Example 1: P(rolling 6 AND getting heads) = {p_six:.3f} × {p_heads:.3f} = {p_both:.3f}")
72
+ return (calc_independent_prob,)
73
 
74
 
75
  @app.cell
 
79
  p_disk_fail = 0.03 # P(disk failure)
80
  p_both_fail = calc_independent_prob(p_cpu_fail, p_disk_fail)
81
  print(f"Example 2: P(both CPU and disk failing) = {p_cpu_fail:.3f} × {p_disk_fail:.3f} = {p_both_fail:.3f}")
82
+ return
83
 
84
 
85
  @app.cell(hide_code=True)
86
  def _(mo):
87
+ mo.md(r"""
88
+ ## And with Dependent Events
 
89
 
90
+ For dependent events, we use the **chain rule**:
91
 
92
+ $P(E \text{ and } F) = P(E) \cdot P(F|E)$
93
 
94
+ where $P(F|E)$ is the probability of $F$ occurring given that $E$ has occurred.
95
 
96
+ For example:
97
 
98
+ - Drawing two hearts without replacement
99
+ - Getting two consecutive heads in poker
100
+ - System failures in connected components
101
 
102
+ Let's implement this calculation:
103
+ """)
 
104
  return
105
 
106
 
 
114
  p_second_heart = 12/51 # P(second heart | first heart)
115
  p_both_hearts = calc_dependent_prob(p_first_heart, p_second_heart)
116
  print(f"Example 1: P(two hearts) = {p_first_heart:.3f} × {p_second_heart:.3f} = {p_both_hearts:.3f}")
117
+ return (calc_dependent_prob,)
118
 
119
 
120
  @app.cell
 
124
  p_second_ace = 3/51 # P(second ace | first ace)
125
  p_both_aces = calc_dependent_prob(p_first_ace, p_second_ace)
126
  print(f"Example 2: P(two aces) = {p_first_ace:.3f} × {p_second_ace:.3f} = {p_both_aces:.3f}")
127
+ return
128
 
129
 
130
  @app.cell(hide_code=True)
131
  def _(mo):
132
+ mo.md(r"""
133
+ ## Multiple Events
 
134
 
135
+ For multiple independent events:
136
 
137
+ $P(E_1 \text{ and } E_2 \text{ and } \cdots \text{ and } E_n) = \prod_{i=1}^n P(E_i)$
138
 
139
+ For dependent events:
140
 
141
+ $P(E_1 \text{ and } E_2 \text{ and } \cdots \text{ and } E_n) = P(E_1) \cdot P(E_2|E_1) \cdot P(E_3|E_1,E_2) \cdots P(E_n|E_1,\ldots,E_{n-1})$
142
 
143
+ Let's visualize these probabilities:
144
+ """)
 
145
  return
146
 
147
 
148
  @app.cell(hide_code=True)
149
  def _(mo):
150
+ mo.md(r"""
151
+ ### Interactive example
152
+ """)
153
  return
154
 
155
 
 
244
 
245
  # Display explanation alongside visualization
246
  mo.hstack([plt.gcf(), mo.md(data["explanation"])])
247
+ return
248
 
249
 
250
  @app.cell(hide_code=True)
251
  def _(mo):
252
+ mo.md(r"""
253
+ ## 🤔 Test Your Understanding
 
254
 
255
+ Which of these statements about AND probability are true?
256
 
257
+ <details>
258
+ <summary>1. The probability of getting two sixes in a row with a fair die is 1/36</summary>
259
 
260
+ ✅ True! Since die rolls are independent events:
261
+ P(two sixes) = P(first six) × P(second six) = 1/6 × 1/6 = 1/36
262
+ </details>
263
 
264
+ <details>
265
+ <summary>2. When drawing cards without replacement, P(two kings) = 4/52 × 4/52</summary>
266
 
267
+ ❌ False! This is a dependent event. The correct calculation is:
268
+ P(two kings) = P(first king) × P(second king | first king) = 4/52 × 3/51
269
+ </details>
270
 
271
+ <details>
272
+ <summary>3. If P(A) = 0.3 and P(B) = 0.4, then P(A and B) must be 0.12</summary>
273
 
274
+ ❌ False! P(A and B) = 0.12 only if A and B are independent events.
275
+ If they're dependent, we need P(B|A) to calculate P(A and B).
276
+ </details>
277
 
278
+ <details>
279
+ <summary>4. The probability of rolling a six AND getting tails is (1/6 × 1/2)</summary>
280
 
281
+ ✅ True! These are independent events, so we multiply their individual probabilities:
282
+ P(six and tails) = P(six) × P(tails) = 1/6 × 1/2 = 1/12
283
+ </details>
284
+ """)
 
285
  return
286
 
287
 
288
  @app.cell(hide_code=True)
289
  def _(mo):
290
+ mo.md("""
291
+ ## Summary
 
292
 
293
+ You've learned:
294
 
295
+ - How to identify independent vs dependent events
296
+ - The multiplication rule for independent events
297
+ - The chain rule for dependent events
298
+ - How to extend these concepts to multiple events
299
 
300
+ In the next lesson, we'll explore **law of total probability** in more detail, building on our understanding of various topics.
301
+ """)
 
302
  return
303
 
304
 
probability/07_law_of_total_probability.py CHANGED
@@ -9,7 +9,7 @@
9
 
10
  import marimo
11
 
12
- __generated_with = "0.11.7"
13
  app = marimo.App(width="medium")
14
 
15
 
@@ -24,56 +24,52 @@ def _():
24
  import matplotlib.pyplot as plt
25
  from matplotlib_venn import venn2
26
  import numpy as np
27
- return np, plt, venn2
28
 
29
 
30
  @app.cell(hide_code=True)
31
  def _(mo):
32
- mo.md(
33
- r"""
34
- # Law of Total Probability
35
 
36
- _This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/law_total/), by Stanford professor Chris Piech._
37
 
38
- The Law of Total Probability is a fundamental rule that helps us calculate probabilities by breaking down complex events into simpler parts. It's particularly useful when we want to compute the probability of an event that can occur through multiple distinct scenarios.
39
- """
40
- )
41
  return
42
 
43
 
44
  @app.cell(hide_code=True)
45
  def _(mo):
46
- mo.md(
47
- r"""
48
- ## The Core Concept
49
 
50
- The Law of Total Probability emerged from a simple but powerful observation: any event E can be broken down into parts based on another event F and its complement Fᶜ.
51
 
52
- ### From Simple Observation to Powerful Law
53
 
54
- Consider an event E that can occur in two ways:
55
 
56
- 1. When F occurs (E ∩ F)
57
- 2. When F doesn't occur (E ∩ Fᶜ)
58
 
59
- This leads to our first insight:
60
 
61
- $P(E) = P(E \cap F) + P(E \cap F^c)$
62
 
63
- Applying the chain rule to each term:
64
 
65
- \begin{align}
66
- P(E) &= P(E \cap F) + P(E \cap F^c) \\
67
- &= P(E|F)P(F) + P(E|F^c)P(F^c)
68
- \end{align}
69
 
70
- This two-part version generalizes to any number of [mutually exclusive](marimo.app/https://github.com/marimo-team/learn/blob/main/probability/03_probability_of_or.py) events that cover the sample space:
71
 
72
- $P(A) = \sum_{i=1}^n P(A|B_i)P(B_i)$
73
 
74
- where {B₁, B₂, ..., Bₙ} forms a partition of the sample space.
75
- """
76
- )
77
  return
78
 
79
 
@@ -98,7 +94,7 @@ def _():
98
 
99
  print("Odd/Even partition:", is_valid_partition(partition1, sample_space))
100
  print("Number pairs partition:", is_valid_partition(partition2, sample_space))
101
- return is_valid_partition, partition1, partition2, sample_space
102
 
103
 
104
  @app.cell
@@ -111,7 +107,7 @@ def _(is_valid_partition):
111
  print("Student Grades Examples:")
112
  print("Pass/Fail partition:", is_valid_partition(passing_partition, grade_space))
113
  print("Individual grades partition:", is_valid_partition(letter_groups, grade_space))
114
- return grade_space, letter_groups, passing_partition
115
 
116
 
117
  @app.cell
@@ -124,7 +120,7 @@ def _(is_valid_partition):
124
  print("\nPlaying Cards Examples:")
125
  print("Color-based partition:", is_valid_partition(color_partition, card_space)) # True
126
  print("Invalid partition:", is_valid_partition(invalid_partition, card_space)) # False
127
- return card_space, color_partition, invalid_partition
128
 
129
 
130
  @app.cell(hide_code=True)
@@ -151,75 +147,71 @@ def _(mo, plt, venn2):
151
  """)
152
 
153
  mo.hstack([plt.gca(), viz_explanation])
154
- return v, viz_explanation
155
 
156
 
157
  @app.cell(hide_code=True)
158
  def _(mo):
159
- mo.md(
160
- r"""
161
- ## Computing Total Probability
162
-
163
- To use the Law of Total Probability:
164
-
165
- 1. Identify a partition of the sample space
166
- 2. Calculate $P(B_i)$ for each part
167
- 3. Calculate $P(A|B_i)$ for each part
168
- 4. Sum the products $P(A|B_i)P(B_i)$
169
- """
170
- )
171
  return
172
 
173
 
174
  @app.cell(hide_code=True)
175
  def _(mo):
176
- mo.md(r"""Let's implement this calculation:""")
 
 
177
  return
178
 
179
 
180
- @app.cell
181
- def _():
182
- def total_probability(conditional_probs, partition_probs):
183
- """Calculate total probability using Law of Total Probability
184
- conditional_probs: List of P(A|Bi)
185
- partition_probs: List of P(Bi)
186
- """
187
- if len(conditional_probs) != len(partition_probs):
188
- raise ValueError("Must have same number of conditional and partition probabilities")
189
 
190
- if abs(sum(partition_probs) - 1) > 1e-10:
191
- raise ValueError("Partition probabilities must sum to 1")
192
 
193
- return sum(c * p for c, p in zip(conditional_probs, partition_probs))
194
- return (total_probability,)
195
 
196
 
197
  @app.cell(hide_code=True)
198
  def _(mo):
199
- mo.md(
200
- r"""
201
- ## Example: System Reliability
202
 
203
- Consider a computer system that can be in three states:
204
 
205
- - Normal (70% of time)
206
- - Degraded (20% of time)
207
- - Critical (10% of time)
208
 
209
- The probability of errors in each state:
210
 
211
- - P(Error | Normal) = 0.01 (1%)
212
- - P(Error | Degraded) = 0.15 (15%)
213
- - P(Error | Critical) = 0.45 (45%)
214
 
215
- Let's calculate the overall probability of encountering an error:
216
- """
217
- )
218
  return
219
 
220
 
221
  @app.cell
222
- def _(mo, total_probability):
223
  # System states and probabilities
224
  states = ["Normal", "Degraded", "Critical"]
225
  state_probs = [0.7, 0.2, 0.1] # System spends 70%, 20%, 10% of time in each state
@@ -252,12 +244,14 @@ def _(mo, total_probability):
252
  Total: {total_error:.3f} or {total_error:.1%} chance of error
253
  """)
254
  explanation
255
- return error_probs, explanation, state_probs, states, total_error
256
 
257
 
258
  @app.cell(hide_code=True)
259
  def _(mo):
260
- mo.md(r"""## Interactive Example:""")
 
 
261
  return
262
 
263
 
@@ -311,24 +305,22 @@ def _(late_given_dry, late_given_rain, mo, plt, venn2, weather_prob):
311
  plt.title("Weather and Traffic Probability")
312
 
313
  mo.hstack([plt.gca(), explanation_example])
314
- return explanation_example, p_dry, p_late, p_rain
315
 
316
 
317
  @app.cell(hide_code=True)
318
  def _(mo):
319
- mo.md(
320
- r"""
321
- ## Visual Intuition
322
 
323
- The Law of Total Probability works because:
324
 
325
- 1. The partition divides the sample space into non-overlapping regions
326
- 2. Every outcome belongs to exactly one region
327
- 3. We account for all possible ways an event can occur
328
 
329
- Let's visualize this with a tree diagram:
330
- """
331
- )
332
  return
333
 
334
 
@@ -371,49 +363,45 @@ def _(plt):
371
 
372
  @app.cell(hide_code=True)
373
  def _(mo):
374
- mo.md(
375
- r"""
376
- ## 🤔 Test Your Understanding
377
-
378
- For a fair six-sided die with partitions:
379
- - B: Numbers less than 3 {1,2}
380
- - B: Numbers from 3 to 4 {3,4}
381
- - B₃: Numbers greater than 4 {5,6}
382
-
383
- **Question 1**: Which of these statements correctly describes the partition?
384
- <details>
385
- <summary>The sets overlap at number 3</summary>
386
- ❌ Incorrect! The sets are clearly separated with no overlapping numbers.
387
- </details>
388
- <details>
389
- <summary>Some numbers are missing from the partition</summary>
390
- ❌ Incorrect! All numbers from 1 to 6 are included exactly once.
391
- </details>
392
- <details>
393
- <summary>The sets form a valid partition of {1,2,3,4,5,6}</summary>
394
- ✅ Correct! The sets are mutually exclusive and their union covers all outcomes.
395
- </details>
396
- """
397
- )
398
  return
399
 
400
 
401
  @app.cell(hide_code=True)
402
  def _(mo):
403
- mo.md(
404
- """
405
- ## Summary
406
 
407
- You've learned:
408
 
409
- - How to identify valid partitions of a sample space
410
- - The Law of Total Probability formula and its components
411
- - How to break down complex probability calculations
412
- - Applications to real-world scenarios
413
 
414
- In the next lesson, we'll explore **Bayes' Theorem**, which builds on these concepts to solve even more sophisticated probability problems.
415
- """
416
- )
417
  return
418
 
419
 
 
9
 
10
  import marimo
11
 
12
+ __generated_with = "0.18.4"
13
  app = marimo.App(width="medium")
14
 
15
 
 
24
  import matplotlib.pyplot as plt
25
  from matplotlib_venn import venn2
26
  import numpy as np
27
+ return plt, venn2
28
 
29
 
30
  @app.cell(hide_code=True)
31
  def _(mo):
32
+ mo.md(r"""
33
+ # Law of Total Probability
 
34
 
35
+ _This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/law_total/), by Stanford professor Chris Piech._
36
 
37
+ The Law of Total Probability is a fundamental rule that helps us calculate probabilities by breaking down complex events into simpler parts. It's particularly useful when we want to compute the probability of an event that can occur through multiple distinct scenarios.
38
+ """)
 
39
  return
40
 
41
 
42
  @app.cell(hide_code=True)
43
  def _(mo):
44
+ mo.md(r"""
45
+ ## The Core Concept
 
46
 
47
+ The Law of Total Probability emerged from a simple but powerful observation: any event E can be broken down into parts based on another event F and its complement Fᶜ.
48
 
49
+ ### From Simple Observation to Powerful Law
50
 
51
+ Consider an event E that can occur in two ways:
52
 
53
+ 1. When F occurs (E ∩ F)
54
+ 2. When F doesn't occur (E ∩ Fᶜ)
55
 
56
+ This leads to our first insight:
57
 
58
+ $P(E) = P(E \cap F) + P(E \cap F^c)$
59
 
60
+ Applying the chain rule to each term:
61
 
62
+ \begin{align}
63
+ P(E) &= P(E \cap F) + P(E \cap F^c) \\
64
+ &= P(E|F)P(F) + P(E|F^c)P(F^c)
65
+ \end{align}
66
 
67
+ This two-part version generalizes to any number of [mutually exclusive](marimo.app/https://github.com/marimo-team/learn/blob/main/probability/03_probability_of_or.py) events that cover the sample space:
68
 
69
+ $P(A) = \sum_{i=1}^n P(A|B_i)P(B_i)$
70
 
71
+ where {B₁, B₂, ..., Bₙ} forms a partition of the sample space.
72
+ """)
 
73
  return
74
 
75
 
 
94
 
95
  print("Odd/Even partition:", is_valid_partition(partition1, sample_space))
96
  print("Number pairs partition:", is_valid_partition(partition2, sample_space))
97
+ return (is_valid_partition,)
98
 
99
 
100
  @app.cell
 
107
  print("Student Grades Examples:")
108
  print("Pass/Fail partition:", is_valid_partition(passing_partition, grade_space))
109
  print("Individual grades partition:", is_valid_partition(letter_groups, grade_space))
110
+ return
111
 
112
 
113
  @app.cell
 
120
  print("\nPlaying Cards Examples:")
121
  print("Color-based partition:", is_valid_partition(color_partition, card_space)) # True
122
  print("Invalid partition:", is_valid_partition(invalid_partition, card_space)) # False
123
+ return
124
 
125
 
126
  @app.cell(hide_code=True)
 
147
  """)
148
 
149
  mo.hstack([plt.gca(), viz_explanation])
150
+ return
151
 
152
 
153
  @app.cell(hide_code=True)
154
  def _(mo):
155
+ mo.md(r"""
156
+ ## Computing Total Probability
157
+
158
+ To use the Law of Total Probability:
159
+
160
+ 1. Identify a partition of the sample space
161
+ 2. Calculate $P(B_i)$ for each part
162
+ 3. Calculate $P(A|B_i)$ for each part
163
+ 4. Sum the products $P(A|B_i)P(B_i)$
164
+ """)
 
 
165
  return
166
 
167
 
168
  @app.cell(hide_code=True)
169
  def _(mo):
170
+ mo.md(r"""
171
+ Let's implement this calculation:
172
+ """)
173
  return
174
 
175
 
176
+ @app.function
177
+ def total_probability(conditional_probs, partition_probs):
178
+ """Calculate total probability using Law of Total Probability
179
+ conditional_probs: List of P(A|Bi)
180
+ partition_probs: List of P(Bi)
181
+ """
182
+ if len(conditional_probs) != len(partition_probs):
183
+ raise ValueError("Must have same number of conditional and partition probabilities")
 
184
 
185
+ if abs(sum(partition_probs) - 1) > 1e-10:
186
+ raise ValueError("Partition probabilities must sum to 1")
187
 
188
+ return sum(c * p for c, p in zip(conditional_probs, partition_probs))
 
189
 
190
 
191
  @app.cell(hide_code=True)
192
  def _(mo):
193
+ mo.md(r"""
194
+ ## Example: System Reliability
 
195
 
196
+ Consider a computer system that can be in three states:
197
 
198
+ - Normal (70% of time)
199
+ - Degraded (20% of time)
200
+ - Critical (10% of time)
201
 
202
+ The probability of errors in each state:
203
 
204
+ - P(Error | Normal) = 0.01 (1%)
205
+ - P(Error | Degraded) = 0.15 (15%)
206
+ - P(Error | Critical) = 0.45 (45%)
207
 
208
+ Let's calculate the overall probability of encountering an error:
209
+ """)
 
210
  return
211
 
212
 
213
  @app.cell
214
+ def _(mo):
215
  # System states and probabilities
216
  states = ["Normal", "Degraded", "Critical"]
217
  state_probs = [0.7, 0.2, 0.1] # System spends 70%, 20%, 10% of time in each state
 
244
  Total: {total_error:.3f} or {total_error:.1%} chance of error
245
  """)
246
  explanation
247
+ return
248
 
249
 
250
  @app.cell(hide_code=True)
251
  def _(mo):
252
+ mo.md(r"""
253
+ ## Interactive Example:
254
+ """)
255
  return
256
 
257
 
 
305
  plt.title("Weather and Traffic Probability")
306
 
307
  mo.hstack([plt.gca(), explanation_example])
308
+ return
309
 
310
 
311
  @app.cell(hide_code=True)
312
  def _(mo):
313
+ mo.md(r"""
314
+ ## Visual Intuition
 
315
 
316
+ The Law of Total Probability works because:
317
 
318
+ 1. The partition divides the sample space into non-overlapping regions
319
+ 2. Every outcome belongs to exactly one region
320
+ 3. We account for all possible ways an event can occur
321
 
322
+ Let's visualize this with a tree diagram:
323
+ """)
 
324
  return
325
 
326
 
 
363
 
364
  @app.cell(hide_code=True)
365
  def _(mo):
366
+ mo.md(r"""
367
+ ## 🤔 Test Your Understanding
368
+
369
+ For a fair six-sided die with partitions:
370
+ - B₁: Numbers less than 3 {1,2}
371
+ - B: Numbers from 3 to 4 {3,4}
372
+ - B: Numbers greater than 4 {5,6}
373
+
374
+ **Question 1**: Which of these statements correctly describes the partition?
375
+ <details>
376
+ <summary>The sets overlap at number 3</summary>
377
+ ❌ Incorrect! The sets are clearly separated with no overlapping numbers.
378
+ </details>
379
+ <details>
380
+ <summary>Some numbers are missing from the partition</summary>
381
+ Incorrect! All numbers from 1 to 6 are included exactly once.
382
+ </details>
383
+ <details>
384
+ <summary>The sets form a valid partition of {1,2,3,4,5,6}</summary>
385
+ ✅ Correct! The sets are mutually exclusive and their union covers all outcomes.
386
+ </details>
387
+ """)
 
 
388
  return
389
 
390
 
391
  @app.cell(hide_code=True)
392
  def _(mo):
393
+ mo.md("""
394
+ ## Summary
 
395
 
396
+ You've learned:
397
 
398
+ - How to identify valid partitions of a sample space
399
+ - The Law of Total Probability formula and its components
400
+ - How to break down complex probability calculations
401
+ - Applications to real-world scenarios
402
 
403
+ In the next lesson, we'll explore **Bayes' Theorem**, which builds on these concepts to solve even more sophisticated probability problems.
404
+ """)
 
405
  return
406
 
407
 
probability/08_bayes_theorem.py CHANGED
@@ -9,7 +9,7 @@
9
 
10
  import marimo
11
 
12
- __generated_with = "0.11.8"
13
  app = marimo.App(width="medium", app_title="Bayes Theorem")
14
 
15
 
@@ -23,140 +23,128 @@ def _():
23
  def _():
24
  import matplotlib.pyplot as plt
25
  import numpy as np
26
- return np, plt
27
 
28
 
29
  @app.cell(hide_code=True)
30
  def _(mo):
31
- mo.md(
32
- r"""
33
- # Bayes' Theorem
34
 
35
- _This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/bayes_theorem/), by Stanford professor Chris Piech._
36
 
37
- In the 1740s, an English minister named Thomas Bayes discovered a profound mathematical relationship that would revolutionize how we reason about uncertainty. His theorem provides an elegant framework for calculating the probability of a hypothesis being true given observed evidence.
38
 
39
- At its core, Bayes' Theorem connects two different types of probabilities: the probability of a hypothesis given evidence $P(H|E)$, and its reverse - the probability of evidence given a hypothesis $P(E|H)$. This relationship is particularly powerful because it allows us to compute difficult probabilities using ones that are easier to measure.
40
- """
41
- )
42
  return
43
 
44
 
45
  @app.cell(hide_code=True)
46
  def _(mo):
47
- mo.md(
48
- r"""
49
- ## The Heart of Bayesian Reasoning
50
 
51
- The fundamental insight of Bayes' Theorem lies in its ability to relate what we want to know with what we can measure. When we observe evidence $E$, we often want to know the probability of a hypothesis $H$ being true. However, it's typically much easier to measure how likely we are to observe the evidence when we know the hypothesis is true.
52
 
53
- This reversal of perspective - from $P(H|E)$ to $P(E|H)$ - is powerful because it lets us:
54
- 1. Start with what we know (prior beliefs)
55
- 2. Use easily measurable relationships (likelihood)
56
- 3. Update our beliefs with new evidence
57
 
58
- This approach mirrors both how humans naturally learn and the scientific method: we begin with prior beliefs, gather evidence, and update our understanding based on that evidence. This makes Bayes' Theorem not just a mathematical tool, but a framework for rational thinking.
59
- """
60
- )
61
  return
62
 
63
 
64
  @app.cell(hide_code=True)
65
  def _(mo):
66
- mo.md(
67
- r"""
68
- ## The Formula
69
 
70
- Bayes' Theorem states:
71
 
72
- $P(H|E) = \frac{P(E|H)P(H)}{P(E)}$
73
 
74
- Where:
75
 
76
- - $P(H|E)$ is the **posterior probability** - probability of hypothesis H given evidence E
77
- - $P(E|H)$ is the **likelihood** - probability of evidence E given hypothesis H
78
- - $P(H)$ is the **prior probability** - initial probability of hypothesis H
79
- - $P(E)$ is the **evidence** - total probability of observing evidence E
80
 
81
- The denominator $P(E)$ can be expanded using the [Law of Total Probability](https://marimo.app/gh/marimo-team/learn/main?entrypoint=probability%2F07_law_of_total_probability.py):
82
 
83
- $P(E) = P(E|H)P(H) + P(E|H^c)P(H^c)$
84
- """
85
- )
86
  return
87
 
88
 
89
  @app.cell(hide_code=True)
90
  def _(mo):
91
- mo.md(
92
- r"""
93
- ## Understanding Each Component
94
-
95
- ### 1. Prior Probability - $P(H)$
96
- - Initial belief about hypothesis before seeing evidence
97
- - Based on previous knowledge or assumptions
98
- - Example: Probability of having a disease before any tests
99
-
100
- ### 2. Likelihood - $P(E|H)$
101
- - Probability of evidence given hypothesis is true
102
- - Often known from data or scientific studies
103
- - Example: Probability of positive test given disease present
104
-
105
- ### 3. Evidence - $P(E)$
106
- - Total probability of observing the evidence
107
- - Acts as a normalizing constant
108
- - Can be calculated using Law of Total Probability
109
-
110
- ### 4. Posterior - $P(H|E)$
111
- - Updated probability after considering evidence
112
- - Combines prior knowledge with new evidence
113
- - Becomes new prior for future updates
114
- """
115
- )
116
  return
117
 
118
 
119
  @app.cell(hide_code=True)
120
  def _(mo):
121
- mo.md(
122
- r"""
123
- ## Real-World Examples
124
 
125
- ### 1. Medical Testing
126
- - **Want to know**: $P(\text{Disease}|\text{Positive})$ - Probability of disease given positive test
127
- - **Easy to know**: $P(\text{Positive}|\text{Disease})$ - Test accuracy for sick people
128
- - **Causality**: Disease causes test results, not vice versa
129
 
130
- ### 2. Student Ability
131
- - **Want to know**: $P(\text{High Ability}|\text{Good Grade})$ - Probability student is skilled given good grade
132
- - **Easy to know**: $P(\text{Good Grade}|\text{High Ability})$ - Probability good students get good grades
133
- - **Causality**: Ability influences grades, not vice versa
134
 
135
- ### 3. Cell Phone Location
136
- - **Want to know**: $P(\text{Location}|\text{Signal Strength})$ - Probability of phone location given signal
137
- - **Easy to know**: $P(\text{Signal Strength}|\text{Location})$ - Signal strength at known locations
138
- - **Causality**: Location determines signal strength, not vice versa
139
 
140
- These examples highlight a common pattern: what we want to know (posterior) is harder to measure directly than its reverse (likelihood).
141
- """
142
- )
143
  return
144
 
145
 
146
- @app.cell
147
- def _():
148
- def calculate_posterior(prior, likelihood, false_positive_rate):
149
- # Calculate P(E) using Law of Total Probability
150
- p_e = likelihood * prior + false_positive_rate * (1 - prior)
151
 
152
- # Calculate posterior using Bayes' Theorem
153
- posterior = (likelihood * prior) / p_e
154
- return posterior, p_e
155
- return (calculate_posterior,)
156
 
157
 
158
  @app.cell
159
- def _(calculate_posterior):
160
  # Medical test example
161
  p_disease = 0.01 # Prior: 1% have the disease
162
  p_positive_given_disease = 0.95 # Likelihood: 95% test accuracy
@@ -167,13 +155,7 @@ def _(calculate_posterior):
167
  p_positive_given_disease,
168
  p_positive_given_healthy
169
  )
170
- return (
171
- medical_evidence,
172
- medical_posterior,
173
- p_disease,
174
- p_positive_given_disease,
175
- p_positive_given_healthy,
176
- )
177
 
178
 
179
  @app.cell
@@ -203,7 +185,7 @@ def _(medical_posterior, mo):
203
 
204
 
205
  @app.cell
206
- def _(calculate_posterior):
207
  # Student ability example
208
  p_high_ability = 0.30 # Prior: 30% of students have high ability
209
  p_good_grade_given_high = 0.90 # Likelihood: 90% of high ability students get good grades
@@ -214,13 +196,7 @@ def _(calculate_posterior):
214
  p_good_grade_given_high,
215
  p_good_grade_given_low
216
  )
217
- return (
218
- p_good_grade_given_high,
219
- p_good_grade_given_low,
220
- p_high_ability,
221
- student_evidence,
222
- student_posterior,
223
- )
224
 
225
 
226
  @app.cell
@@ -250,7 +226,7 @@ def _(mo, student_posterior):
250
 
251
 
252
  @app.cell
253
- def _(calculate_posterior):
254
  # Cell phone location example
255
  p_location_a = 0.25 # Prior probability of being in location A
256
  p_strong_signal_at_a = 0.85 # Likelihood of strong signal at A
@@ -261,13 +237,7 @@ def _(calculate_posterior):
261
  p_strong_signal_at_a,
262
  p_strong_signal_elsewhere
263
  )
264
- return (
265
- location_evidence,
266
- location_posterior,
267
- p_location_a,
268
- p_strong_signal_at_a,
269
- p_strong_signal_elsewhere,
270
- )
271
 
272
 
273
  @app.cell
@@ -298,7 +268,9 @@ def _(location_posterior, mo):
298
 
299
  @app.cell(hide_code=True)
300
  def _(mo):
301
- mo.md(r"""## Interactive example""")
 
 
302
  return
303
 
304
 
@@ -394,87 +366,79 @@ def _(
394
 
395
  @app.cell(hide_code=True)
396
  def _(mo):
397
- mo.md(
398
- r"""
399
- ## Applications in Computer Science
400
 
401
- Bayes' Theorem is fundamental in many computing applications:
402
 
403
- 1. **Spam Filtering**
404
 
405
- - $P(\text{Spam}|\text{Words})$ = Probability email is spam given its words
406
- - Updates as new emails are classified
407
 
408
- 2. **Machine Learning**
409
 
410
- - Naive Bayes classifiers
411
- - Probabilistic graphical models
412
- - Bayesian neural networks
413
 
414
- 3. **Computer Vision**
415
 
416
- - Object detection confidence
417
- - Face recognition systems
418
- - Image classification
419
- """
420
- )
421
  return
422
 
423
 
424
  @app.cell(hide_code=True)
425
  def _(mo):
426
- mo.md(
427
- """
428
- ## 🤔 Test Your Understanding
429
 
430
- Pick which of these statements about Bayes' Theorem you think are correct:
431
 
432
- <details>
433
- <summary>The posterior probability will always be larger than the prior probability</summary>
434
- ❌ Incorrect! Evidence can either increase or decrease our belief in the hypothesis. For example, a negative medical test decreases the probability of having a disease.
435
- </details>
436
 
437
- <details>
438
- <summary>If the likelihood is 0.9 and the prior is 0.5, then the posterior must equal 0.9</summary>
439
- ❌ Incorrect! We also need the false positive rate to calculate the posterior probability. The likelihood alone doesn't determine the posterior.
440
- </details>
441
 
442
- <details>
443
- <summary>The denominator acts as a normalizing constant to ensure the posterior is a valid probability</summary>
444
- ✅ Correct! The denominator ensures the posterior probability is between 0 and 1 by considering all ways the evidence could occur.
445
- </details>
446
- """
447
- )
448
  return
449
 
450
 
451
  @app.cell(hide_code=True)
452
  def _(mo):
453
- mo.md(
454
- """
455
- ## Summary
456
 
457
- You've learned:
458
 
459
- - The components and intuition behind Bayes' Theorem
460
- - How to update probabilities when new evidence arrives
461
- - Why posterior probabilities can be counterintuitive
462
- - Real-world applications in computer science
463
 
464
- In the next lesson, we'll explore Random Variables, which help us work with numerical outcomes in probability.
465
- """
466
- )
467
  return
468
 
469
 
470
  @app.cell(hide_code=True)
471
  def _(mo):
472
- mo.md(
473
- r"""
474
- ### Appendix
475
- Below (hidden) cell blocks are responsible for the interactive example above
476
- """
477
- )
478
  return
479
 
480
 
 
9
 
10
  import marimo
11
 
12
+ __generated_with = "0.18.4"
13
  app = marimo.App(width="medium", app_title="Bayes Theorem")
14
 
15
 
 
23
  def _():
24
  import matplotlib.pyplot as plt
25
  import numpy as np
26
+ return
27
 
28
 
29
  @app.cell(hide_code=True)
30
  def _(mo):
31
+ mo.md(r"""
32
+ # Bayes' Theorem
 
33
 
34
+ _This notebook is a computational companion to the book ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part1/bayes_theorem/), by Stanford professor Chris Piech._
35
 
36
+ In the 1740s, an English minister named Thomas Bayes discovered a profound mathematical relationship that would revolutionize how we reason about uncertainty. His theorem provides an elegant framework for calculating the probability of a hypothesis being true given observed evidence.
37
 
38
+ At its core, Bayes' Theorem connects two different types of probabilities: the probability of a hypothesis given evidence $P(H|E)$, and its reverse - the probability of evidence given a hypothesis $P(E|H)$. This relationship is particularly powerful because it allows us to compute difficult probabilities using ones that are easier to measure.
39
+ """)
 
40
  return
41
 
42
 
43
  @app.cell(hide_code=True)
44
  def _(mo):
45
+ mo.md(r"""
46
+ ## The Heart of Bayesian Reasoning
 
47
 
48
+ The fundamental insight of Bayes' Theorem lies in its ability to relate what we want to know with what we can measure. When we observe evidence $E$, we often want to know the probability of a hypothesis $H$ being true. However, it's typically much easier to measure how likely we are to observe the evidence when we know the hypothesis is true.
49
 
50
+ This reversal of perspective - from $P(H|E)$ to $P(E|H)$ - is powerful because it lets us:
51
+ 1. Start with what we know (prior beliefs)
52
+ 2. Use easily measurable relationships (likelihood)
53
+ 3. Update our beliefs with new evidence
54
 
55
+ This approach mirrors both how humans naturally learn and the scientific method: we begin with prior beliefs, gather evidence, and update our understanding based on that evidence. This makes Bayes' Theorem not just a mathematical tool, but a framework for rational thinking.
56
+ """)
 
57
  return
58
 
59
 
60
  @app.cell(hide_code=True)
61
  def _(mo):
62
+ mo.md(r"""
63
+ ## The Formula
 
64
 
65
+ Bayes' Theorem states:
66
 
67
+ $P(H|E) = \frac{P(E|H)P(H)}{P(E)}$
68
 
69
+ Where:
70
 
71
+ - $P(H|E)$ is the **posterior probability** - probability of hypothesis H given evidence E
72
+ - $P(E|H)$ is the **likelihood** - probability of evidence E given hypothesis H
73
+ - $P(H)$ is the **prior probability** - initial probability of hypothesis H
74
+ - $P(E)$ is the **evidence** - total probability of observing evidence E
75
 
76
+ The denominator $P(E)$ can be expanded using the [Law of Total Probability](https://marimo.app/gh/marimo-team/learn/main?entrypoint=probability%2F07_law_of_total_probability.py):
77
 
78
+ $P(E) = P(E|H)P(H) + P(E|H^c)P(H^c)$
79
+ """)
 
80
  return
81
 
82
 
83
  @app.cell(hide_code=True)
84
  def _(mo):
85
+ mo.md(r"""
86
+ ## Understanding Each Component
87
+
88
+ ### 1. Prior Probability - $P(H)$
89
+ - Initial belief about hypothesis before seeing evidence
90
+ - Based on previous knowledge or assumptions
91
+ - Example: Probability of having a disease before any tests
92
+
93
+ ### 2. Likelihood - $P(E|H)$
94
+ - Probability of evidence given hypothesis is true
95
+ - Often known from data or scientific studies
96
+ - Example: Probability of positive test given disease present
97
+
98
+ ### 3. Evidence - $P(E)$
99
+ - Total probability of observing the evidence
100
+ - Acts as a normalizing constant
101
+ - Can be calculated using Law of Total Probability
102
+
103
+ ### 4. Posterior - $P(H|E)$
104
+ - Updated probability after considering evidence
105
+ - Combines prior knowledge with new evidence
106
+ - Becomes new prior for future updates
107
+ """)
 
 
108
  return
109
 
110
 
111
  @app.cell(hide_code=True)
112
  def _(mo):
113
+ mo.md(r"""
114
+ ## Real-World Examples
 
115
 
116
+ ### 1. Medical Testing
117
+ - **Want to know**: $P(\text{Disease}|\text{Positive})$ - Probability of disease given positive test
118
+ - **Easy to know**: $P(\text{Positive}|\text{Disease})$ - Test accuracy for sick people
119
+ - **Causality**: Disease causes test results, not vice versa
120
 
121
+ ### 2. Student Ability
122
+ - **Want to know**: $P(\text{High Ability}|\text{Good Grade})$ - Probability student is skilled given good grade
123
+ - **Easy to know**: $P(\text{Good Grade}|\text{High Ability})$ - Probability good students get good grades
124
+ - **Causality**: Ability influences grades, not vice versa
125
 
126
+ ### 3. Cell Phone Location
127
+ - **Want to know**: $P(\text{Location}|\text{Signal Strength})$ - Probability of phone location given signal
128
+ - **Easy to know**: $P(\text{Signal Strength}|\text{Location})$ - Signal strength at known locations
129
+ - **Causality**: Location determines signal strength, not vice versa
130
 
131
+ These examples highlight a common pattern: what we want to know (posterior) is harder to measure directly than its reverse (likelihood).
132
+ """)
 
133
  return
134
 
135
 
136
+ @app.function
137
+ def calculate_posterior(prior, likelihood, false_positive_rate):
138
+ # Calculate P(E) using Law of Total Probability
139
+ p_e = likelihood * prior + false_positive_rate * (1 - prior)
 
140
 
141
+ # Calculate posterior using Bayes' Theorem
142
+ posterior = (likelihood * prior) / p_e
143
+ return posterior, p_e
 
144
 
145
 
146
  @app.cell
147
+ def _():
148
  # Medical test example
149
  p_disease = 0.01 # Prior: 1% have the disease
150
  p_positive_given_disease = 0.95 # Likelihood: 95% test accuracy
 
155
  p_positive_given_disease,
156
  p_positive_given_healthy
157
  )
158
+ return (medical_posterior,)
 
 
 
 
 
 
159
 
160
 
161
  @app.cell
 
185
 
186
 
187
  @app.cell
188
+ def _():
189
  # Student ability example
190
  p_high_ability = 0.30 # Prior: 30% of students have high ability
191
  p_good_grade_given_high = 0.90 # Likelihood: 90% of high ability students get good grades
 
196
  p_good_grade_given_high,
197
  p_good_grade_given_low
198
  )
199
+ return (student_posterior,)
 
 
 
 
 
 
200
 
201
 
202
  @app.cell
 
226
 
227
 
228
  @app.cell
229
+ def _():
230
  # Cell phone location example
231
  p_location_a = 0.25 # Prior probability of being in location A
232
  p_strong_signal_at_a = 0.85 # Likelihood of strong signal at A
 
237
  p_strong_signal_at_a,
238
  p_strong_signal_elsewhere
239
  )
240
+ return (location_posterior,)
 
 
 
 
 
 
241
 
242
 
243
  @app.cell
 
268
 
269
  @app.cell(hide_code=True)
270
  def _(mo):
271
+ mo.md(r"""
272
+ ## Interactive example
273
+ """)
274
  return
275
 
276
 
 
366
 
367
  @app.cell(hide_code=True)
368
  def _(mo):
369
+ mo.md(r"""
370
+ ## Applications in Computer Science
 
371
 
372
+ Bayes' Theorem is fundamental in many computing applications:
373
 
374
+ 1. **Spam Filtering**
375
 
376
+ - $P(\text{Spam}|\text{Words})$ = Probability email is spam given its words
377
+ - Updates as new emails are classified
378
 
379
+ 2. **Machine Learning**
380
 
381
+ - Naive Bayes classifiers
382
+ - Probabilistic graphical models
383
+ - Bayesian neural networks
384
 
385
+ 3. **Computer Vision**
386
 
387
+ - Object detection confidence
388
+ - Face recognition systems
389
+ - Image classification
390
+ """)
 
391
  return
392
 
393
 
394
  @app.cell(hide_code=True)
395
  def _(mo):
396
+ mo.md("""
397
+ ## 🤔 Test Your Understanding
 
398
 
399
+ Pick which of these statements about Bayes' Theorem you think are correct:
400
 
401
+ <details>
402
+ <summary>The posterior probability will always be larger than the prior probability</summary>
403
+ ❌ Incorrect! Evidence can either increase or decrease our belief in the hypothesis. For example, a negative medical test decreases the probability of having a disease.
404
+ </details>
405
 
406
+ <details>
407
+ <summary>If the likelihood is 0.9 and the prior is 0.5, then the posterior must equal 0.9</summary>
408
+ ❌ Incorrect! We also need the false positive rate to calculate the posterior probability. The likelihood alone doesn't determine the posterior.
409
+ </details>
410
 
411
+ <details>
412
+ <summary>The denominator acts as a normalizing constant to ensure the posterior is a valid probability</summary>
413
+ ✅ Correct! The denominator ensures the posterior probability is between 0 and 1 by considering all ways the evidence could occur.
414
+ </details>
415
+ """)
 
416
  return
417
 
418
 
419
  @app.cell(hide_code=True)
420
  def _(mo):
421
+ mo.md("""
422
+ ## Summary
 
423
 
424
+ You've learned:
425
 
426
+ - The components and intuition behind Bayes' Theorem
427
+ - How to update probabilities when new evidence arrives
428
+ - Why posterior probabilities can be counterintuitive
429
+ - Real-world applications in computer science
430
 
431
+ In the next lesson, we'll explore Random Variables, which help us work with numerical outcomes in probability.
432
+ """)
 
433
  return
434
 
435
 
436
  @app.cell(hide_code=True)
437
  def _(mo):
438
+ mo.md(r"""
439
+ ### Appendix
440
+ Below (hidden) cell blocks are responsible for the interactive example above
441
+ """)
 
 
442
  return
443
 
444
 
probability/09_random_variables.py CHANGED
@@ -10,7 +10,7 @@
10
 
11
  import marimo
12
 
13
- __generated_with = "0.11.10"
14
  app = marimo.App(width="medium", app_title="Random Variables")
15
 
16
 
@@ -30,90 +30,82 @@ def _():
30
 
31
  @app.cell(hide_code=True)
32
  def _(mo):
33
- mo.md(
34
- r"""
35
- # Random Variables
36
 
37
- _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/rvs/), by Stanford professor Chris Piech._
38
 
39
- Random variables are functions that map outcomes from a probability space to numbers. This mathematical abstraction allows us to:
40
 
41
- - Work with numerical outcomes in probability
42
- - Calculate expected values and variances
43
- - Model real-world phenomena quantitatively
44
- """
45
- )
46
  return
47
 
48
 
49
  @app.cell(hide_code=True)
50
  def _(mo):
51
- mo.md(
52
- r"""
53
- ## Types of Random Variables
54
-
55
- ### Discrete Random Variables
56
- - Take on countable values (finite or infinite)
57
- - Described by a probability mass function (PMF)
58
- - Example: Number of heads in 3 coin flips
59
-
60
- ### Continuous Random Variables
61
- - Take on uncountable values in an interval
62
- - Described by a probability density function (PDF)
63
- - Example: Height of a randomly selected person
64
- """
65
- )
66
  return
67
 
68
 
69
  @app.cell(hide_code=True)
70
  def _(mo):
71
- mo.md(
72
- r"""
73
- ## Properties of Random Variables
74
-
75
- Each random variable has several key properties:
76
-
77
- | Property | Description | Example |
78
- |----------|-------------|---------|
79
- | Meaning | Semantic description | Number of successes in n trials |
80
- | Symbol | Notation used | $X$, $Y$, $Z$ |
81
- | Support/Range | Possible values | $\{0,1,2,...,n\}$ for binomial |
82
- | Distribution | PMF or PDF | $p_X(x)$ or $f_X(x)$ |
83
- | Expectation | Weighted average | $E[X]$ |
84
- | Variance | Measure of spread | $\text{Var}(X)$ |
85
- | Standard Deviation | Square root of variance | $\sigma_X$ |
86
- | Mode | Most likely value | argmax$_x$ $p_X(x)$ |
87
-
88
- Additional properties include:
89
-
90
- - [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) (measure of uncertainty)
91
- - [Median](https://en.wikipedia.org/wiki/Median) (middle value)
92
- - [Skewness](https://en.wikipedia.org/wiki/Skewness) (asymmetry measure)
93
- - [Kurtosis](https://en.wikipedia.org/wiki/Kurtosis) (tail heaviness measure)
94
- """
95
- )
96
  return
97
 
98
 
99
  @app.cell(hide_code=True)
100
  def _(mo):
101
- mo.md(
102
- r"""
103
- ## Probability Mass Functions (PMF)
104
 
105
- For discrete random variables, the PMF $p_X(x)$ gives the probability that $X$ equals $x$:
106
 
107
- $p_X(x) = P(X = x)$
108
 
109
- Properties of a PMF:
110
 
111
- 1. $p_X(x) \geq 0$ for all $x$
112
- 2. $\sum_x p_X(x) = 1$
113
 
114
- Let's implement a PMF for rolling a fair die:
115
- """
116
- )
117
  return
118
 
119
 
@@ -135,27 +127,25 @@ def _(np, plt):
135
  plt.ylabel("Probability")
136
  plt.grid(True, alpha=0.3)
137
  plt.gca()
138
- return die_pmf, probabilities
139
 
140
 
141
  @app.cell(hide_code=True)
142
  def _(mo):
143
- mo.md(
144
- r"""
145
- ## Probability Density Functions (PDF)
146
 
147
- For continuous random variables, we use a PDF $f_X(x)$. The probability of $X$ falling in an interval $[a,b]$ is:
148
 
149
- $P(a \leq X \leq b) = \int_a^b f_X(x)dx$
150
 
151
- Properties of a PDF:
152
 
153
- 1. $f_X(x) \geq 0$ for all $x$
154
- 2. $\int_{-\infty}^{\infty} f_X(x)dx = 1$
155
 
156
- Let's look at the normal distribution, a common continuous random variable:
157
- """
158
- )
159
  return
160
 
161
 
@@ -178,24 +168,22 @@ def _(np, plt, stats):
178
 
179
  @app.cell(hide_code=True)
180
  def _(mo):
181
- mo.md(
182
- r"""
183
- ## Expected Value
184
 
185
- The expected value $E[X]$ is the long-run average of a random variable.
186
 
187
- For discrete random variables:
188
- $E[X] = \sum_x x \cdot p_X(x)$
189
 
190
- For continuous random variables:
191
- $E[X] = \int_{-\infty}^{\infty} x \cdot f_X(x)dx$
192
 
193
- Properties:
194
 
195
- 1. $E[aX + b] = aE[X] + b$
196
- 2. $E[X + Y] = E[X] + E[Y]$
197
- """
198
- )
199
  return
200
 
201
 
@@ -209,7 +197,7 @@ def _(np):
209
  die_probs = np.ones(6) / 6
210
 
211
  E_X = expected_value_discrete(die_values, die_probs)
212
- return E_X, die_probs, die_values, expected_value_discrete
213
 
214
 
215
  @app.cell
@@ -220,23 +208,21 @@ def _(E_X):
220
 
221
  @app.cell(hide_code=True)
222
  def _(mo):
223
- mo.md(
224
- r"""
225
- ## Variance
226
 
227
- The variance $\text{Var}(X)$ measures the spread of a random variable around its mean:
228
 
229
- $\text{Var}(X) = E[(X - E[X])^2]$
230
 
231
- This can be computed as:
232
- $\text{Var}(X) = E[X^2] - (E[X])^2$
233
 
234
- Properties:
235
 
236
- 1. $\text{Var}(aX) = a^2Var(X)$
237
- 2. $\text{Var}(X + b) = Var(X)$
238
- """
239
- )
240
  return
241
 
242
 
@@ -278,7 +264,7 @@ def _(variance_discrete):
278
  coin_probs = [0.5, 0.5]
279
  coin_mean = sum(x * p for x, p in zip(coin_values, coin_probs))
280
  coin_var = variance_discrete(coin_values, coin_probs, coin_mean)
281
- return coin_mean, coin_probs, coin_values, coin_var
282
 
283
 
284
  @app.cell
@@ -289,7 +275,7 @@ def _(np, stats, variance_discrete):
289
  normal_probs = normal_probs / sum(normal_probs) # normalize
290
  normal_mean = 0
291
  normal_var = variance_discrete(normal_values, normal_probs, normal_mean)
292
- return normal_mean, normal_probs, normal_values, normal_var
293
 
294
 
295
  @app.cell
@@ -299,7 +285,7 @@ def _(np, variance_discrete):
299
  uniform_probs = np.ones_like(uniform_values) / len(uniform_values)
300
  uniform_mean = 0.5
301
  uniform_var = variance_discrete(uniform_values, uniform_probs, uniform_mean)
302
- return uniform_mean, uniform_probs, uniform_values, uniform_var
303
 
304
 
305
  @app.cell(hide_code=True)
@@ -318,44 +304,40 @@ def _(coin_var, mo, normal_var, uniform_var):
318
 
319
  @app.cell(hide_code=True)
320
  def _(mo):
321
- mo.md(
322
- r"""
323
- ## Common Distributions
324
 
325
- 1. Bernoulli Distribution
326
- - Models a single success/failure experiment
327
- - $P(X = 1) = p$, $P(X = 0) = 1-p$
328
- - $E[X] = p$, $\text{Var}(X) = p(1-p)$
329
 
330
- 2. Binomial Distribution
331
 
332
- - Models number of successes in $n$ independent trials
333
- - $P(X = k) = \binom{n}{k}p^k(1-p)^{n-k}$
334
- - $E[X] = np$, $\text{Var}(X) = np(1-p)$
335
 
336
- 3. Normal Distribution
337
 
338
- - Bell-shaped curve defined by mean $\mu$ and variance $\sigma^2$
339
- - PDF: $f_X(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
340
- - $E[X] = \mu$, $\text{Var}(X) = \sigma^2$
341
- """
342
- )
343
  return
344
 
345
 
346
  @app.cell(hide_code=True)
347
  def _(mo):
348
- mo.md(
349
- r"""
350
- ### Example: Comparing Discrete and Continuous Distributions
351
 
352
- This example shows the relationship between a Binomial distribution (discrete) and its Normal approximation (continuous).
353
- The parameters control both distributions:
354
 
355
- - **Number of Trials**: Controls the range of possible values and the shape's width
356
- - **Success Probability**: Affects the distribution's center and skewness
357
- """
358
- )
359
  return
360
 
361
 
@@ -405,7 +387,7 @@ def _(n_trials, np, p_success, plt, stats):
405
 
406
  plt.tight_layout()
407
  plt.gca()
408
- return ax1, ax2, fig, k, mu, pdf, pmf, sigma, x
409
 
410
 
411
  @app.cell(hide_code=True)
@@ -426,42 +408,40 @@ def _(mo, n_trials, np, p_success):
426
 
427
  @app.cell(hide_code=True)
428
  def _(mo):
429
- mo.md(
430
- r"""
431
- ## Practice Problems
432
-
433
- ### Problem 1: Discrete Random Variable
434
- Let $X$ be the sum when rolling two fair dice. Find:
435
-
436
- 1. The support of $X$
437
- 2. The PMF $p_X(x)$
438
- 3. $E[X]$ and $\text{Var}(X)$
439
-
440
- <details>
441
- <summary>Solution</summary>
442
- Let's solve this step by step:
443
- ```python
444
- def two_dice_pmf(x):
445
- outcomes = [(i,j) for i in range(1,7) for j in range(1,7)]
446
- favorable = [pair for pair in outcomes if sum(pair) == x]
447
- return len(favorable)/36
448
-
449
- # Support: {2,3,...,12}
450
- # E[X] = 7
451
- # Var(X) = 5.83
452
- ```
453
- </details>
454
-
455
- ### Problem 2: Continuous Random Variable
456
- For a uniform random variable on $[0,1]$, verify that:
457
-
458
- 1. The PDF integrates to 1
459
- 2. $E[X] = 1/2$
460
- 3. $\text{Var}(X) = 1/12$
461
-
462
- Try solving this yourself first, then check the solution below.
463
- """
464
- )
465
  return
466
 
467
 
@@ -479,72 +459,66 @@ def _(mktext, mo):
479
 
480
  @app.cell(hide_code=True)
481
  def _(mo):
482
- mktext = mo.md(
483
- r"""
484
- Let's solve each part:
485
 
486
- 1. **PDF integrates to 1**:
487
- $\int_0^1 1 \, dx = [x]_0^1 = 1 - 0 = 1$
488
 
489
- 2. **Expected Value**:
490
- $E[X] = \int_0^1 x \cdot 1 \, dx = [\frac{x^2}{2}]_0^1 = \frac{1}{2} - 0 = \frac{1}{2}$
491
 
492
- 3. **Variance**:
493
- $\text{Var}(X) = E[X^2] - (E[X])^2$
494
 
495
- First calculate $E[X^2]$:
496
- $E[X^2] = \int_0^1 x^2 \cdot 1 \, dx = [\frac{x^3}{3}]_0^1 = \frac{1}{3}$
497
 
498
- Then:
499
- $\text{Var}(X) = \frac{1}{3} - (\frac{1}{2})^2 = \frac{1}{3} - \frac{1}{4} = \frac{1}{12}$
500
- """
501
- )
502
  return (mktext,)
503
 
504
 
505
  @app.cell(hide_code=True)
506
  def _(mo):
507
- mo.md(
508
- r"""
509
- ## 🤔 Test Your Understanding
510
 
511
- Pick which of these statements about random variables you think are correct:
512
 
513
- <details>
514
- <summary>The probability density function can be greater than 1</summary>
515
- ✅ Correct! Unlike PMFs, PDFs can exceed 1 as long as the total area equals 1.
516
- </details>
517
 
518
- <details>
519
- <summary>The expected value of a random variable must equal one of its possible values</summary>
520
- ❌ Incorrect! For example, the expected value of a fair die is 3.5, which is not a possible outcome.
521
- </details>
522
 
523
- <details>
524
- <summary>Adding a constant to a random variable changes its variance</summary>
525
- ❌ Incorrect! Adding a constant shifts the distribution but doesn't affect its spread.
526
- </details>
527
- """
528
- )
529
  return
530
 
531
 
532
  @app.cell(hide_code=True)
533
  def _(mo):
534
- mo.md(
535
- """
536
- ## Summary
537
 
538
- You've learned:
539
 
540
- - The difference between discrete and continuous random variables
541
- - How PMFs and PDFs describe probability distributions
542
- - Methods for calculating expected values and variances
543
- - Properties of common probability distributions
544
 
545
- In the next lesson, we'll explore Probability Mass Functions in detail, focusing on their properties and applications.
546
- """
547
- )
548
  return
549
 
550
 
 
10
 
11
  import marimo
12
 
13
+ __generated_with = "0.18.4"
14
  app = marimo.App(width="medium", app_title="Random Variables")
15
 
16
 
 
30
 
31
  @app.cell(hide_code=True)
32
  def _(mo):
33
+ mo.md(r"""
34
+ # Random Variables
 
35
 
36
+ _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/rvs/), by Stanford professor Chris Piech._
37
 
38
+ Random variables are functions that map outcomes from a probability space to numbers. This mathematical abstraction allows us to:
39
 
40
+ - Work with numerical outcomes in probability
41
+ - Calculate expected values and variances
42
+ - Model real-world phenomena quantitatively
43
+ """)
 
44
  return
45
 
46
 
47
  @app.cell(hide_code=True)
48
  def _(mo):
49
+ mo.md(r"""
50
+ ## Types of Random Variables
51
+
52
+ ### Discrete Random Variables
53
+ - Take on countable values (finite or infinite)
54
+ - Described by a probability mass function (PMF)
55
+ - Example: Number of heads in 3 coin flips
56
+
57
+ ### Continuous Random Variables
58
+ - Take on uncountable values in an interval
59
+ - Described by a probability density function (PDF)
60
+ - Example: Height of a randomly selected person
61
+ """)
 
 
62
  return
63
 
64
 
65
  @app.cell(hide_code=True)
66
  def _(mo):
67
+ mo.md(r"""
68
+ ## Properties of Random Variables
69
+
70
+ Each random variable has several key properties:
71
+
72
+ | Property | Description | Example |
73
+ |----------|-------------|---------|
74
+ | Meaning | Semantic description | Number of successes in n trials |
75
+ | Symbol | Notation used | $X$, $Y$, $Z$ |
76
+ | Support/Range | Possible values | $\{0,1,2,...,n\}$ for binomial |
77
+ | Distribution | PMF or PDF | $p_X(x)$ or $f_X(x)$ |
78
+ | Expectation | Weighted average | $E[X]$ |
79
+ | Variance | Measure of spread | $\text{Var}(X)$ |
80
+ | Standard Deviation | Square root of variance | $\sigma_X$ |
81
+ | Mode | Most likely value | argmax$_x$ $p_X(x)$ |
82
+
83
+ Additional properties include:
84
+
85
+ - [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) (measure of uncertainty)
86
+ - [Median](https://en.wikipedia.org/wiki/Median) (middle value)
87
+ - [Skewness](https://en.wikipedia.org/wiki/Skewness) (asymmetry measure)
88
+ - [Kurtosis](https://en.wikipedia.org/wiki/Kurtosis) (tail heaviness measure)
89
+ """)
 
 
90
  return
91
 
92
 
93
  @app.cell(hide_code=True)
94
  def _(mo):
95
+ mo.md(r"""
96
+ ## Probability Mass Functions (PMF)
 
97
 
98
+ For discrete random variables, the PMF $p_X(x)$ gives the probability that $X$ equals $x$:
99
 
100
+ $p_X(x) = P(X = x)$
101
 
102
+ Properties of a PMF:
103
 
104
+ 1. $p_X(x) \geq 0$ for all $x$
105
+ 2. $\sum_x p_X(x) = 1$
106
 
107
+ Let's implement a PMF for rolling a fair die:
108
+ """)
 
109
  return
110
 
111
 
 
127
  plt.ylabel("Probability")
128
  plt.grid(True, alpha=0.3)
129
  plt.gca()
130
+ return
131
 
132
 
133
  @app.cell(hide_code=True)
134
  def _(mo):
135
+ mo.md(r"""
136
+ ## Probability Density Functions (PDF)
 
137
 
138
+ For continuous random variables, we use a PDF $f_X(x)$. The probability of $X$ falling in an interval $[a,b]$ is:
139
 
140
+ $P(a \leq X \leq b) = \int_a^b f_X(x)dx$
141
 
142
+ Properties of a PDF:
143
 
144
+ 1. $f_X(x) \geq 0$ for all $x$
145
+ 2. $\int_{-\infty}^{\infty} f_X(x)dx = 1$
146
 
147
+ Let's look at the normal distribution, a common continuous random variable:
148
+ """)
 
149
  return
150
 
151
 
 
168
 
169
  @app.cell(hide_code=True)
170
  def _(mo):
171
+ mo.md(r"""
172
+ ## Expected Value
 
173
 
174
+ The expected value $E[X]$ is the long-run average of a random variable.
175
 
176
+ For discrete random variables:
177
+ $E[X] = \sum_x x \cdot p_X(x)$
178
 
179
+ For continuous random variables:
180
+ $E[X] = \int_{-\infty}^{\infty} x \cdot f_X(x)dx$
181
 
182
+ Properties:
183
 
184
+ 1. $E[aX + b] = aE[X] + b$
185
+ 2. $E[X + Y] = E[X] + E[Y]$
186
+ """)
 
187
  return
188
 
189
 
 
197
  die_probs = np.ones(6) / 6
198
 
199
  E_X = expected_value_discrete(die_values, die_probs)
200
+ return E_X, die_probs, die_values
201
 
202
 
203
  @app.cell
 
208
 
209
  @app.cell(hide_code=True)
210
  def _(mo):
211
+ mo.md(r"""
212
+ ## Variance
 
213
 
214
+ The variance $\text{Var}(X)$ measures the spread of a random variable around its mean:
215
 
216
+ $\text{Var}(X) = E[(X - E[X])^2]$
217
 
218
+ This can be computed as:
219
+ $\text{Var}(X) = E[X^2] - (E[X])^2$
220
 
221
+ Properties:
222
 
223
+ 1. $\text{Var}(aX) = a^2Var(X)$
224
+ 2. $\text{Var}(X + b) = Var(X)$
225
+ """)
 
226
  return
227
 
228
 
 
264
  coin_probs = [0.5, 0.5]
265
  coin_mean = sum(x * p for x, p in zip(coin_values, coin_probs))
266
  coin_var = variance_discrete(coin_values, coin_probs, coin_mean)
267
+ return (coin_var,)
268
 
269
 
270
  @app.cell
 
275
  normal_probs = normal_probs / sum(normal_probs) # normalize
276
  normal_mean = 0
277
  normal_var = variance_discrete(normal_values, normal_probs, normal_mean)
278
+ return (normal_var,)
279
 
280
 
281
  @app.cell
 
285
  uniform_probs = np.ones_like(uniform_values) / len(uniform_values)
286
  uniform_mean = 0.5
287
  uniform_var = variance_discrete(uniform_values, uniform_probs, uniform_mean)
288
+ return (uniform_var,)
289
 
290
 
291
  @app.cell(hide_code=True)
 
304
 
305
  @app.cell(hide_code=True)
306
  def _(mo):
307
+ mo.md(r"""
308
+ ## Common Distributions
 
309
 
310
+ 1. Bernoulli Distribution
311
+ - Models a single success/failure experiment
312
+ - $P(X = 1) = p$, $P(X = 0) = 1-p$
313
+ - $E[X] = p$, $\text{Var}(X) = p(1-p)$
314
 
315
+ 2. Binomial Distribution
316
 
317
+ - Models number of successes in $n$ independent trials
318
+ - $P(X = k) = \binom{n}{k}p^k(1-p)^{n-k}$
319
+ - $E[X] = np$, $\text{Var}(X) = np(1-p)$
320
 
321
+ 3. Normal Distribution
322
 
323
+ - Bell-shaped curve defined by mean $\mu$ and variance $\sigma^2$
324
+ - PDF: $f_X(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
325
+ - $E[X] = \mu$, $\text{Var}(X) = \sigma^2$
326
+ """)
 
327
  return
328
 
329
 
330
  @app.cell(hide_code=True)
331
  def _(mo):
332
+ mo.md(r"""
333
+ ### Example: Comparing Discrete and Continuous Distributions
 
334
 
335
+ This example shows the relationship between a Binomial distribution (discrete) and its Normal approximation (continuous).
336
+ The parameters control both distributions:
337
 
338
+ - **Number of Trials**: Controls the range of possible values and the shape's width
339
+ - **Success Probability**: Affects the distribution's center and skewness
340
+ """)
 
341
  return
342
 
343
 
 
387
 
388
  plt.tight_layout()
389
  plt.gca()
390
+ return
391
 
392
 
393
  @app.cell(hide_code=True)
 
408
 
409
  @app.cell(hide_code=True)
410
  def _(mo):
411
+ mo.md(r"""
412
+ ## Practice Problems
413
+
414
+ ### Problem 1: Discrete Random Variable
415
+ Let $X$ be the sum when rolling two fair dice. Find:
416
+
417
+ 1. The support of $X$
418
+ 2. The PMF $p_X(x)$
419
+ 3. $E[X]$ and $\text{Var}(X)$
420
+
421
+ <details>
422
+ <summary>Solution</summary>
423
+ Let's solve this step by step:
424
+ ```python
425
+ def two_dice_pmf(x):
426
+ outcomes = [(i,j) for i in range(1,7) for j in range(1,7)]
427
+ favorable = [pair for pair in outcomes if sum(pair) == x]
428
+ return len(favorable)/36
429
+
430
+ # Support: {2,3,...,12}
431
+ # E[X] = 7
432
+ # Var(X) = 5.83
433
+ ```
434
+ </details>
435
+
436
+ ### Problem 2: Continuous Random Variable
437
+ For a uniform random variable on $[0,1]$, verify that:
438
+
439
+ 1. The PDF integrates to 1
440
+ 2. $E[X] = 1/2$
441
+ 3. $\text{Var}(X) = 1/12$
442
+
443
+ Try solving this yourself first, then check the solution below.
444
+ """)
 
 
445
  return
446
 
447
 
 
459
 
460
  @app.cell(hide_code=True)
461
  def _(mo):
462
+ mktext=mo.md(r"""
463
+ Let's solve each part:
 
464
 
465
+ 1. **PDF integrates to 1**:
466
+ $\int_0^1 1 \, dx = [x]_0^1 = 1 - 0 = 1$
467
 
468
+ 2. **Expected Value**:
469
+ $E[X] = \int_0^1 x \cdot 1 \, dx = [\frac{x^2}{2}]_0^1 = \frac{1}{2} - 0 = \frac{1}{2}$
470
 
471
+ 3. **Variance**:
472
+ $\text{Var}(X) = E[X^2] - (E[X])^2$
473
 
474
+ First calculate $E[X^2]$:
475
+ $E[X^2] = \int_0^1 x^2 \cdot 1 \, dx = [\frac{x^3}{3}]_0^1 = \frac{1}{3}$
476
 
477
+ Then:
478
+ $\text{Var}(X) = \frac{1}{3} - (\frac{1}{2})^2 = \frac{1}{3} - \frac{1}{4} = \frac{1}{12}$
479
+ """)
 
480
  return (mktext,)
481
 
482
 
483
  @app.cell(hide_code=True)
484
  def _(mo):
485
+ mo.md(r"""
486
+ ## 🤔 Test Your Understanding
 
487
 
488
+ Pick which of these statements about random variables you think are correct:
489
 
490
+ <details>
491
+ <summary>The probability density function can be greater than 1</summary>
492
+ ✅ Correct! Unlike PMFs, PDFs can exceed 1 as long as the total area equals 1.
493
+ </details>
494
 
495
+ <details>
496
+ <summary>The expected value of a random variable must equal one of its possible values</summary>
497
+ ❌ Incorrect! For example, the expected value of a fair die is 3.5, which is not a possible outcome.
498
+ </details>
499
 
500
+ <details>
501
+ <summary>Adding a constant to a random variable changes its variance</summary>
502
+ ❌ Incorrect! Adding a constant shifts the distribution but doesn't affect its spread.
503
+ </details>
504
+ """)
 
505
  return
506
 
507
 
508
  @app.cell(hide_code=True)
509
  def _(mo):
510
+ mo.md("""
511
+ ## Summary
 
512
 
513
+ You've learned:
514
 
515
+ - The difference between discrete and continuous random variables
516
+ - How PMFs and PDFs describe probability distributions
517
+ - Methods for calculating expected values and variances
518
+ - Properties of common probability distributions
519
 
520
+ In the next lesson, we'll explore Probability Mass Functions in detail, focusing on their properties and applications.
521
+ """)
 
522
  return
523
 
524
 
probability/10_probability_mass_function.py CHANGED
@@ -10,57 +10,51 @@
10
 
11
  import marimo
12
 
13
- __generated_with = "0.12.6"
14
  app = marimo.App(width="medium", app_title="Probability Mass Functions")
15
 
16
 
17
  @app.cell(hide_code=True)
18
  def _(mo):
19
- mo.md(
20
- r"""
21
- # Probability Mass Functions
22
 
23
- _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/pmf/), by Stanford professor Chris Piech._
24
 
25
- PMFs are really important in discrete probability. They tell us how likely each possible outcome is for a discrete random variable.
26
 
27
- What's interesting about PMFs is that they can be represented in multiple ways - equations, graphs, or even empirical data. The core idea is simple: they map each possible value to its probability.
28
- """
29
- )
30
  return
31
 
32
 
33
  @app.cell(hide_code=True)
34
  def _(mo):
35
- mo.md(
36
- r"""
37
- ## Properties of a PMF
38
 
39
- For a function $p_X(x)$ to be a valid PMF:
40
 
41
- 1. **Non-negativity**: probability can't be negative, so $p_X(x) \geq 0$ for all $x$
42
- 2. **Unit total probability**: all probabilities sum to 1, i.e., $\sum_x p_X(x) = 1$
43
 
44
- The second property makes intuitive sense - a random variable must take some value, and the sum of all possibilities should be 100%.
45
- """
46
- )
47
  return
48
 
49
 
50
  @app.cell(hide_code=True)
51
  def _(mo):
52
- mo.md(
53
- r"""
54
- ## PMFs as Graphs
55
 
56
- Let's start by looking at PMFs as graphs where the $x$-axis is the values that the random variable could take on and the $y$-axis is the probability of the random variable taking on said value.
57
 
58
- In the following example, we show two PMFs:
59
 
60
- - On the left: PMF for the random variable $X$ = the value of a single six-sided die roll
61
- - On the right: PMF for the random variable $Y$ = value of the sum of two dice rolls
62
- """
63
- )
64
  return
65
 
66
 
@@ -102,53 +96,39 @@ def _(np, plt):
102
 
103
  plt.tight_layout()
104
  plt.gca()
105
- return (
106
- dice_ax1,
107
- dice_ax2,
108
- dice_fig,
109
- dice_prob,
110
- dice_sum,
111
- single_die_probs,
112
- single_die_values,
113
- two_dice_probs,
114
- two_dice_values,
115
- )
116
 
117
 
118
  @app.cell(hide_code=True)
119
  def _(mo):
120
- mo.md(
121
- r"""
122
- These graphs really show us how likely each value is when we roll the dice.
123
 
124
- looking at the right graph, when we see "6" on the $x$-axis with probability $\frac{5}{36}$ on the $y$-axis, that's telling us there's a $\frac{5}{36}$ chance of rolling a sum of 6 with two dice. or more formally: $P(Y = 6) = \frac{5}{36}$.
125
 
126
- Similarly, the value "2" has probability "$\frac{1}{36}$" - that's because there's only one way to get a sum of 2 (rolling 1 on both dice). and you'll notice there's no value for "1" since you can't get a sum of 1 with two dice - the minimum possible is 2.
127
- """
128
- )
129
  return
130
 
131
 
132
  @app.cell(hide_code=True)
133
  def _(mo):
134
- mo.md(
135
- r"""
136
- ## PMFs as Equations
137
 
138
- Here is the exact same information in equation form:
139
 
140
- For a single die roll $X$:
141
- $$P(X=x) = \frac{1}{6} \quad \text{ if } 1 \leq x \leq 6$$
142
 
143
- For the sum of two dice $Y$:
144
- $$P(Y=y) = \begin{cases}
145
- \frac{(y-1)}{36} & \text{ if } 2 \leq y \leq 7\\
146
- \frac{(13-y)}{36} & \text{ if } 8 \leq y \leq 12
147
- \end{cases}$$
148
 
149
- Let's implement the PMF for $Y$, the sum of two dice, in Python code:
150
- """
151
- )
152
  return
153
 
154
 
@@ -167,12 +147,14 @@ def _():
167
  test_values = [1, 2, 7, 12, 13]
168
  for test_y in test_values:
169
  print(f"P(Y = {test_y}) = {pmf_sum_two_dice(test_y)}")
170
- return pmf_sum_two_dice, test_values, test_y
171
 
172
 
173
  @app.cell(hide_code=True)
174
  def _(mo):
175
- mo.md(r"""Now, let's verify that our PMF satisfies the property that the sum of all probabilities equals 1:""")
 
 
176
  return
177
 
178
 
@@ -183,7 +165,7 @@ def _(pmf_sum_two_dice):
183
  # Round to 10 decimal places to handle floating-point precision
184
  verify_total_prob_rounded = round(verify_total_prob, 10)
185
  print(f"Sum of all probabilities: {verify_total_prob_rounded}")
186
- return verify_total_prob, verify_total_prob_rounded
187
 
188
 
189
  @app.cell(hide_code=True)
@@ -205,18 +187,16 @@ def _(plt, pmf_sum_two_dice):
205
  plt.text(verify_y_values[verify_i], verify_prob + 0.001, f'{verify_prob:.3f}', ha='center')
206
 
207
  plt.gca() # Return the current axes to ensure proper display
208
- return verify_i, verify_prob, verify_probabilities, verify_y_values
209
 
210
 
211
  @app.cell(hide_code=True)
212
  def _(mo):
213
- mo.md(
214
- r"""
215
- ## Data to Histograms to Probability Mass Functions
216
 
217
- Here's something I find interesting — one way to represent a likelihood function is just through raw data. instead of mathematical formulas, we can actually approximate a PMF by collecting data points. let's see this in action by simulating lots of dice rolls and building an empirical PMF:
218
- """
219
- )
220
  return
221
 
222
 
@@ -236,7 +216,7 @@ def _(np):
236
  # Display a small sample of the data
237
  print(f"First 20 dice sums: {sim_dice_sums[:20]}")
238
  print(f"Total number of trials: {sim_num_trials}")
239
- return sim_dice_sums, sim_die1, sim_die2, sim_num_trials
240
 
241
 
242
  @app.cell(hide_code=True)
@@ -296,32 +276,16 @@ def _(collections, np, plt, sim_dice_sums):
296
  plt.text(sim_sorted_values[sim_i], sim_count + 19, str(sim_count), ha='center')
297
 
298
  plt.gca() # Return the current axes to ensure proper display
299
- return (
300
- sim_ax1,
301
- sim_ax2,
302
- sim_count,
303
- sim_counter,
304
- sim_counts,
305
- sim_empirical_pmf,
306
- sim_fig,
307
- sim_i,
308
- sim_prob,
309
- sim_sorted_values,
310
- sim_theoretical_pmf,
311
- sim_theoretical_values,
312
- sim_y,
313
- )
314
 
315
 
316
  @app.cell(hide_code=True)
317
  def _(mo):
318
- mo.md(
319
- r"""
320
- When we normalize a histogram (divide each count by total sample size), we get a pretty good approximation of the true PMF. it's a simple yet powerful idea - count how many times each value appears, then divide by the total number of trials.
321
 
322
- let's make this concrete. say we want to estimate $P(Y=3)$ - the probability of rolling a sum of 3 with two dice. we just count how many 3's show up in our simulated rolls and divide by the total number of rolls:
323
- """
324
- )
325
  return
326
 
327
 
@@ -338,20 +302,18 @@ def _(sim_counter, sim_dice_sums):
338
  print(f"Empirical P(Y=3): {sim_count_of_3}/{len(sim_dice_sums)} = {sim_empirical_prob:.4f}")
339
  print(f"Theoretical P(Y=3): 2/36 = {sim_theoretical_prob:.4f}")
340
  print(f"Difference: {abs(sim_empirical_prob - sim_theoretical_prob):.4f}")
341
- return sim_count_of_3, sim_empirical_prob, sim_theoretical_prob
342
 
343
 
344
  @app.cell(hide_code=True)
345
  def _(mo):
346
- mo.md(
347
- r"""
348
- As we can see, with a large number of trials, the empirical PMF becomes a very good approximation of the theoretical PMF. This is an example of the [Law of Large Numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers) in action.
349
 
350
- ## Interactive Example: Exploring PMFs
351
 
352
- Let's create an interactive tool to explore different PMFs:
353
- """
354
- )
355
  return
356
 
357
 
@@ -482,38 +444,20 @@ def _(dist_param1, dist_param2, dist_selection, np, plt, stats):
482
  bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
483
 
484
  plt.gca() # Return the current axes to ensure proper display
485
- return (
486
- dist_baseline,
487
- dist_lam,
488
- dist_markerline,
489
- dist_max_x,
490
- dist_mean,
491
- dist_n,
492
- dist_p,
493
- dist_pmf_values,
494
- dist_props_text,
495
- dist_std_dev,
496
- dist_stemlines,
497
- dist_title,
498
- dist_variance,
499
- dist_x_label,
500
- dist_x_values,
501
- )
502
 
503
 
504
  @app.cell(hide_code=True)
505
  def _(mo):
506
- mo.md(
507
- r"""
508
- ## Expected Value from a PMF
509
 
510
- The expected value (or mean) of a discrete random variable is calculated using its PMF:
511
 
512
- $$E[X] = \sum_x x \cdot p_X(x)$$
513
 
514
- This represents the long-run average value of the random variable.
515
- """
516
- )
517
  return
518
 
519
 
@@ -527,24 +471,22 @@ def _(dist_pmf_values, dist_x_values):
527
  ev_dist_mean = calc_expected_value(dist_x_values, dist_pmf_values)
528
 
529
  print(f"Expected value: {ev_dist_mean:.4f}")
530
- return calc_expected_value, ev_dist_mean
531
 
532
 
533
  @app.cell(hide_code=True)
534
  def _(mo):
535
- mo.md(
536
- r"""
537
- ## Variance from a PMF
538
 
539
- The variance measures the spread or dispersion of a random variable around its mean:
540
 
541
- $$\text{Var}(X) = E[(X - E[X])^2] = \sum_x (x - E[X])^2 \cdot p_X(x)$$
542
 
543
- An alternative formula is:
544
 
545
- $$\text{Var}(X) = E[X^2] - (E[X])^2 = \sum_x x^2 \cdot p_X(x) - \left(\sum_x x \cdot p_X(x)\right)^2$$
546
- """
547
- )
548
  return
549
 
550
 
@@ -560,22 +502,20 @@ def _(dist_pmf_values, dist_x_values, ev_dist_mean, np):
560
 
561
  print(f"Variance: {var_dist_var:.4f}")
562
  print(f"Standard deviation: {var_dist_std_dev:.4f}")
563
- return calc_variance, var_dist_std_dev, var_dist_var
564
 
565
 
566
  @app.cell(hide_code=True)
567
  def _(mo):
568
- mo.md(
569
- r"""
570
- ## PMF vs. CDF
571
 
572
- The **Cumulative Distribution Function (CDF)** is related to the PMF but gives the probability that the random variable $X$ is less than or equal to a value $x$:
573
 
574
- $$F_X(x) = P(X \leq x) = \sum_{k \leq x} p_X(k)$$
575
 
576
- While the PMF gives the probability mass at each point, the CDF accumulates these probabilities.
577
- """
578
- )
579
  return
580
 
581
 
@@ -612,77 +552,69 @@ def _(dist_pmf_values, dist_x_values, np, plt):
612
 
613
  plt.tight_layout()
614
  plt.gca() # Return the current axes to ensure proper display
615
- return cdf_ax1, cdf_ax2, cdf_dist_values, cdf_fig, x_max, x_min
616
 
617
 
618
  @app.cell(hide_code=True)
619
  def _(mo):
620
- mo.md(
621
- r"""
622
- The graphs above illustrate the key difference between PMF and CDF:
623
 
624
- - **PMF (left)**: Shows the probability of the random variable taking each specific value: P(X = x)
625
- - **CDF (right)**: Shows the probability of the random variable being less than or equal to each value: P(X ≤ x)
626
 
627
- The CDF at any point is the sum of all PMF values up to and including that point. This is why the CDF is always non-decreasing and eventually reaches 1. For discrete distributions like this one, the CDF forms a step function that jumps at each value in the support of the random variable.
628
- """
629
- )
630
  return
631
 
632
 
633
  @app.cell(hide_code=True)
634
  def _(mo):
635
- mo.md(
636
- r"""
637
- ## Test Your Understanding
638
-
639
- Choose what you believe are the correct options in the questions below:
640
-
641
- <details>
642
- <summary>If X is a discrete random variable with PMF p(x), then p(x) must always be less than 1</summary>
643
- ❌ False! While most values in a PMF are typically less than 1, a PMF can have p(x) = 1 for a specific value if the random variable always takes that value (with 100% probability).
644
- </details>
645
-
646
- <details>
647
- <summary>The sum of all probabilities in a PMF must equal exactly 1</summary>
648
- ✅ True! This is a fundamental property of any valid PMF. The total probability across all possible values must be 1, as the random variable must take some value.
649
- </details>
650
-
651
- <details>
652
- <summary>A PMF can be estimated from data by creating a normalized histogram</summary>
653
- ✅ True! Counting the frequency of each value and dividing by the total number of observations gives an empirical PMF.
654
- </details>
655
-
656
- <details>
657
- <summary>The expected value of a discrete random variable is always one of the possible values of the variable</summary>
658
- ❌ False! The expected value is a weighted average and may not be a value the random variable can actually take. For example, the expected value of a fair die roll is 3.5, which is not a possible outcome.
659
- </details>
660
- """
661
- )
662
  return
663
 
664
 
665
  @app.cell(hide_code=True)
666
  def _(mo):
667
- mo.md(
668
- r"""
669
- ## Practical Applications of PMFs
670
 
671
- PMFs pop up everywhere - network engineers use them to model traffic patterns, reliability teams predict equipment failures, and marketers analyze purchase behavior. In finance, they help price options; in gaming, they're behind every dice roll. Machine learning algorithms like Naive Bayes rely on them, and they're essential for modeling rare events like genetic mutations or system failures.
672
- """
673
- )
674
  return
675
 
676
 
677
  @app.cell(hide_code=True)
678
  def _(mo):
679
- mo.md(
680
- r"""
681
- ## Key Takeaways
682
 
683
- PMFs give us the probability picture for discrete random variables - they tell us how likely each value is, must be non-negative, and always sum to 1. We can write them as equations, draw them as graphs, or estimate them from data. They're the foundation for calculating expected values and variances, which we'll explore in our next notebook on Expectation, where we'll learn how to summarize random variables with a single, most "expected" value.
684
- """
685
- )
686
  return
687
 
688
 
 
10
 
11
  import marimo
12
 
13
+ __generated_with = "0.18.4"
14
  app = marimo.App(width="medium", app_title="Probability Mass Functions")
15
 
16
 
17
  @app.cell(hide_code=True)
18
  def _(mo):
19
+ mo.md(r"""
20
+ # Probability Mass Functions
 
21
 
22
+ _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/pmf/), by Stanford professor Chris Piech._
23
 
24
+ PMFs are really important in discrete probability. They tell us how likely each possible outcome is for a discrete random variable.
25
 
26
+ What's interesting about PMFs is that they can be represented in multiple ways - equations, graphs, or even empirical data. The core idea is simple: they map each possible value to its probability.
27
+ """)
 
28
  return
29
 
30
 
31
  @app.cell(hide_code=True)
32
  def _(mo):
33
+ mo.md(r"""
34
+ ## Properties of a PMF
 
35
 
36
+ For a function $p_X(x)$ to be a valid PMF:
37
 
38
+ 1. **Non-negativity**: probability can't be negative, so $p_X(x) \geq 0$ for all $x$
39
+ 2. **Unit total probability**: all probabilities sum to 1, i.e., $\sum_x p_X(x) = 1$
40
 
41
+ The second property makes intuitive sense - a random variable must take some value, and the sum of all possibilities should be 100%.
42
+ """)
 
43
  return
44
 
45
 
46
  @app.cell(hide_code=True)
47
  def _(mo):
48
+ mo.md(r"""
49
+ ## PMFs as Graphs
 
50
 
51
+ Let's start by looking at PMFs as graphs where the $x$-axis is the values that the random variable could take on and the $y$-axis is the probability of the random variable taking on said value.
52
 
53
+ In the following example, we show two PMFs:
54
 
55
+ - On the left: PMF for the random variable $X$ = the value of a single six-sided die roll
56
+ - On the right: PMF for the random variable $Y$ = value of the sum of two dice rolls
57
+ """)
 
58
  return
59
 
60
 
 
96
 
97
  plt.tight_layout()
98
  plt.gca()
99
+ return
 
 
 
 
 
 
 
 
 
 
100
 
101
 
102
  @app.cell(hide_code=True)
103
  def _(mo):
104
+ mo.md(r"""
105
+ These graphs really show us how likely each value is when we roll the dice.
 
106
 
107
+ looking at the right graph, when we see "6" on the $x$-axis with probability $\frac{5}{36}$ on the $y$-axis, that's telling us there's a $\frac{5}{36}$ chance of rolling a sum of 6 with two dice. or more formally: $P(Y = 6) = \frac{5}{36}$.
108
 
109
+ Similarly, the value "2" has probability "$\frac{1}{36}$" - that's because there's only one way to get a sum of 2 (rolling 1 on both dice). and you'll notice there's no value for "1" since you can't get a sum of 1 with two dice - the minimum possible is 2.
110
+ """)
 
111
  return
112
 
113
 
114
  @app.cell(hide_code=True)
115
  def _(mo):
116
+ mo.md(r"""
117
+ ## PMFs as Equations
 
118
 
119
+ Here is the exact same information in equation form:
120
 
121
+ For a single die roll $X$:
122
+ $$P(X=x) = \frac{1}{6} \quad \text{ if } 1 \leq x \leq 6$$
123
 
124
+ For the sum of two dice $Y$:
125
+ $$P(Y=y) = \begin{cases}
126
+ \frac{(y-1)}{36} & \text{ if } 2 \leq y \leq 7\\
127
+ \frac{(13-y)}{36} & \text{ if } 8 \leq y \leq 12
128
+ \end{cases}$$
129
 
130
+ Let's implement the PMF for $Y$, the sum of two dice, in Python code:
131
+ """)
 
132
  return
133
 
134
 
 
147
  test_values = [1, 2, 7, 12, 13]
148
  for test_y in test_values:
149
  print(f"P(Y = {test_y}) = {pmf_sum_two_dice(test_y)}")
150
+ return (pmf_sum_two_dice,)
151
 
152
 
153
  @app.cell(hide_code=True)
154
  def _(mo):
155
+ mo.md(r"""
156
+ Now, let's verify that our PMF satisfies the property that the sum of all probabilities equals 1:
157
+ """)
158
  return
159
 
160
 
 
165
  # Round to 10 decimal places to handle floating-point precision
166
  verify_total_prob_rounded = round(verify_total_prob, 10)
167
  print(f"Sum of all probabilities: {verify_total_prob_rounded}")
168
+ return
169
 
170
 
171
  @app.cell(hide_code=True)
 
187
  plt.text(verify_y_values[verify_i], verify_prob + 0.001, f'{verify_prob:.3f}', ha='center')
188
 
189
  plt.gca() # Return the current axes to ensure proper display
190
+ return
191
 
192
 
193
  @app.cell(hide_code=True)
194
  def _(mo):
195
+ mo.md(r"""
196
+ ## Data to Histograms to Probability Mass Functions
 
197
 
198
+ Here's something I find interesting — one way to represent a likelihood function is just through raw data. instead of mathematical formulas, we can actually approximate a PMF by collecting data points. let's see this in action by simulating lots of dice rolls and building an empirical PMF:
199
+ """)
 
200
  return
201
 
202
 
 
216
  # Display a small sample of the data
217
  print(f"First 20 dice sums: {sim_dice_sums[:20]}")
218
  print(f"Total number of trials: {sim_num_trials}")
219
+ return (sim_dice_sums,)
220
 
221
 
222
  @app.cell(hide_code=True)
 
276
  plt.text(sim_sorted_values[sim_i], sim_count + 19, str(sim_count), ha='center')
277
 
278
  plt.gca() # Return the current axes to ensure proper display
279
+ return (sim_counter,)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280
 
281
 
282
  @app.cell(hide_code=True)
283
  def _(mo):
284
+ mo.md(r"""
285
+ When we normalize a histogram (divide each count by total sample size), we get a pretty good approximation of the true PMF. it's a simple yet powerful idea - count how many times each value appears, then divide by the total number of trials.
 
286
 
287
+ let's make this concrete. say we want to estimate $P(Y=3)$ - the probability of rolling a sum of 3 with two dice. we just count how many 3's show up in our simulated rolls and divide by the total number of rolls:
288
+ """)
 
289
  return
290
 
291
 
 
302
  print(f"Empirical P(Y=3): {sim_count_of_3}/{len(sim_dice_sums)} = {sim_empirical_prob:.4f}")
303
  print(f"Theoretical P(Y=3): 2/36 = {sim_theoretical_prob:.4f}")
304
  print(f"Difference: {abs(sim_empirical_prob - sim_theoretical_prob):.4f}")
305
+ return
306
 
307
 
308
  @app.cell(hide_code=True)
309
  def _(mo):
310
+ mo.md(r"""
311
+ As we can see, with a large number of trials, the empirical PMF becomes a very good approximation of the theoretical PMF. This is an example of the [Law of Large Numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers) in action.
 
312
 
313
+ ## Interactive Example: Exploring PMFs
314
 
315
+ Let's create an interactive tool to explore different PMFs:
316
+ """)
 
317
  return
318
 
319
 
 
444
  bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
445
 
446
  plt.gca() # Return the current axes to ensure proper display
447
+ return dist_pmf_values, dist_x_values
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
448
 
449
 
450
  @app.cell(hide_code=True)
451
  def _(mo):
452
+ mo.md(r"""
453
+ ## Expected Value from a PMF
 
454
 
455
+ The expected value (or mean) of a discrete random variable is calculated using its PMF:
456
 
457
+ $$E[X] = \sum_x x \cdot p_X(x)$$
458
 
459
+ This represents the long-run average value of the random variable.
460
+ """)
 
461
  return
462
 
463
 
 
471
  ev_dist_mean = calc_expected_value(dist_x_values, dist_pmf_values)
472
 
473
  print(f"Expected value: {ev_dist_mean:.4f}")
474
+ return (ev_dist_mean,)
475
 
476
 
477
  @app.cell(hide_code=True)
478
  def _(mo):
479
+ mo.md(r"""
480
+ ## Variance from a PMF
 
481
 
482
+ The variance measures the spread or dispersion of a random variable around its mean:
483
 
484
+ $$\text{Var}(X) = E[(X - E[X])^2] = \sum_x (x - E[X])^2 \cdot p_X(x)$$
485
 
486
+ An alternative formula is:
487
 
488
+ $$\text{Var}(X) = E[X^2] - (E[X])^2 = \sum_x x^2 \cdot p_X(x) - \left(\sum_x x \cdot p_X(x)\right)^2$$
489
+ """)
 
490
  return
491
 
492
 
 
502
 
503
  print(f"Variance: {var_dist_var:.4f}")
504
  print(f"Standard deviation: {var_dist_std_dev:.4f}")
505
+ return
506
 
507
 
508
  @app.cell(hide_code=True)
509
  def _(mo):
510
+ mo.md(r"""
511
+ ## PMF vs. CDF
 
512
 
513
+ The **Cumulative Distribution Function (CDF)** is related to the PMF but gives the probability that the random variable $X$ is less than or equal to a value $x$:
514
 
515
+ $$F_X(x) = P(X \leq x) = \sum_{k \leq x} p_X(k)$$
516
 
517
+ While the PMF gives the probability mass at each point, the CDF accumulates these probabilities.
518
+ """)
 
519
  return
520
 
521
 
 
552
 
553
  plt.tight_layout()
554
  plt.gca() # Return the current axes to ensure proper display
555
+ return
556
 
557
 
558
  @app.cell(hide_code=True)
559
  def _(mo):
560
+ mo.md(r"""
561
+ The graphs above illustrate the key difference between PMF and CDF:
 
562
 
563
+ - **PMF (left)**: Shows the probability of the random variable taking each specific value: P(X = x)
564
+ - **CDF (right)**: Shows the probability of the random variable being less than or equal to each value: P(X ≤ x)
565
 
566
+ The CDF at any point is the sum of all PMF values up to and including that point. This is why the CDF is always non-decreasing and eventually reaches 1. For discrete distributions like this one, the CDF forms a step function that jumps at each value in the support of the random variable.
567
+ """)
 
568
  return
569
 
570
 
571
  @app.cell(hide_code=True)
572
  def _(mo):
573
+ mo.md(r"""
574
+ ## Test Your Understanding
575
+
576
+ Choose what you believe are the correct options in the questions below:
577
+
578
+ <details>
579
+ <summary>If X is a discrete random variable with PMF p(x), then p(x) must always be less than 1</summary>
580
+ False! While most values in a PMF are typically less than 1, a PMF can have p(x) = 1 for a specific value if the random variable always takes that value (with 100% probability).
581
+ </details>
582
+
583
+ <details>
584
+ <summary>The sum of all probabilities in a PMF must equal exactly 1</summary>
585
+ True! This is a fundamental property of any valid PMF. The total probability across all possible values must be 1, as the random variable must take some value.
586
+ </details>
587
+
588
+ <details>
589
+ <summary>A PMF can be estimated from data by creating a normalized histogram</summary>
590
+ True! Counting the frequency of each value and dividing by the total number of observations gives an empirical PMF.
591
+ </details>
592
+
593
+ <details>
594
+ <summary>The expected value of a discrete random variable is always one of the possible values of the variable</summary>
595
+ ❌ False! The expected value is a weighted average and may not be a value the random variable can actually take. For example, the expected value of a fair die roll is 3.5, which is not a possible outcome.
596
+ </details>
597
+ """)
 
 
598
  return
599
 
600
 
601
  @app.cell(hide_code=True)
602
  def _(mo):
603
+ mo.md(r"""
604
+ ## Practical Applications of PMFs
 
605
 
606
+ PMFs pop up everywhere - network engineers use them to model traffic patterns, reliability teams predict equipment failures, and marketers analyze purchase behavior. In finance, they help price options; in gaming, they're behind every dice roll. Machine learning algorithms like Naive Bayes rely on them, and they're essential for modeling rare events like genetic mutations or system failures.
607
+ """)
 
608
  return
609
 
610
 
611
  @app.cell(hide_code=True)
612
  def _(mo):
613
+ mo.md(r"""
614
+ ## Key Takeaways
 
615
 
616
+ PMFs give us the probability picture for discrete random variables - they tell us how likely each value is, must be non-negative, and always sum to 1. We can write them as equations, draw them as graphs, or estimate them from data. They're the foundation for calculating expected values and variances, which we'll explore in our next notebook on Expectation, where we'll learn how to summarize random variables with a single, most "expected" value.
617
+ """)
 
618
  return
619
 
620
 
probability/11_expectation.py CHANGED
@@ -10,55 +10,49 @@
10
 
11
  import marimo
12
 
13
- __generated_with = "0.12.6"
14
  app = marimo.App(width="medium", app_title="Expectation")
15
 
16
 
17
  @app.cell(hide_code=True)
18
  def _(mo):
19
- mo.md(
20
- r"""
21
- # Expectation
22
 
23
- _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/expectation/), by Stanford professor Chris Piech._
24
 
25
- Expectations are fascinating — they represent the "center of mass" of a probability distribution. while they're often called "expected values" or "averages," they don't always match our intuition about what's "expected" to happen.
26
 
27
- For me, the most interesting part about expectations is how they quantify what happens "on average" in the long run, even if that average isn't a possible outcome (like expecting 3.5 on a standard die roll).
28
- """
29
- )
30
  return
31
 
32
 
33
  @app.cell(hide_code=True)
34
  def _(mo):
35
- mo.md(
36
- r"""
37
- ## Definition of Expectation
38
 
39
- Expectation (written as $E[X]$) is basically the "average outcome" of a random variable, but with a twist - we weight each possible value by how likely it is to occur. I like to think of it as the "center of gravity" for probability.
40
 
41
- $$E[X] = \sum_x x \cdot P(X=x)$$
42
 
43
- People call this concept by different names - mean, weighted average, center of mass, or 1st moment if you're being fancy. They're all calculated the same way, though: multiply each value by its probability, then add everything up.
44
- """
45
- )
46
  return
47
 
48
 
49
  @app.cell(hide_code=True)
50
  def _(mo):
51
- mo.md(
52
- r"""
53
- ## Intuition Behind Expectation
54
 
55
- The expected value represents the long-run average value of a random variable over many independent repetitions of an experiment.
56
 
57
- For example, if you roll a fair six-sided die many times and calculate the average of all rolls, that average will approach the expected value of 3.5 as the number of rolls increases.
58
 
59
- Let's visualize this concept:
60
- """
61
- )
62
  return
63
 
64
 
@@ -91,12 +85,14 @@ def _(np, plt):
91
  arrowprops=dict(facecolor='black', shrink=0.05, width=1.5))
92
 
93
  plt.gca()
94
- return exp_die_rolls, exp_num_rolls, exp_running_avg
95
 
96
 
97
  @app.cell(hide_code=True)
98
  def _(mo):
99
- mo.md(r"""## Properties of Expectation""")
 
 
100
  return
101
 
102
 
@@ -145,25 +141,23 @@ def _(mo):
145
 
146
  @app.cell(hide_code=True)
147
  def _(mo):
148
- mo.md(
149
- r"""
150
- ## Calculating Expectation
151
 
152
- Let's calculate the expected value for some common examples:
153
 
154
- ### Example 1: Fair Die Roll
155
 
156
- For a fair six-sided die, the PMF is:
157
 
158
- $$P(X=x) = \frac{1}{6} \text{ for } x \in \{1, 2, 3, 4, 5, 6\}$$
159
 
160
- The expected value is:
161
 
162
- $$E[X] = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6} = \frac{21}{6} = 3.5$$
163
 
164
- Let's implement this calculation in Python:
165
- """
166
- )
167
  return
168
 
169
 
@@ -179,18 +173,16 @@ def _():
179
 
180
  exp_die_result = calc_expectation_die()
181
  print(f"Expected value of a fair die roll: {exp_die_result}")
182
- return calc_expectation_die, exp_die_result
183
 
184
 
185
  @app.cell(hide_code=True)
186
  def _(mo):
187
- mo.md(
188
- r"""
189
- ### Example 2: Sum of Two Dice
190
 
191
- Now let's calculate the expected value for the sum of two fair dice. First, we need the PMF:
192
- """
193
- )
194
  return
195
 
196
 
@@ -210,7 +202,7 @@ def _():
210
  exp_test_values = [2, 7, 12]
211
  for exp_test_y in exp_test_values:
212
  print(f"P(Y = {exp_test_y}) = {pmf_sum_two_dice(exp_test_y)}")
213
- return exp_test_values, exp_test_y, pmf_sum_two_dice
214
 
215
 
216
  @app.cell
@@ -239,24 +231,16 @@ def _(pmf_sum_two_dice):
239
 
240
  # Verify that this equals 7
241
  print(f"Is the expected value exactly 7? {abs(exp_sum_result - 7) < 1e-10}")
242
- return (
243
- calc_expectation_sum_two_dice,
244
- exp_direct_calc,
245
- exp_direct_calc_rounded,
246
- exp_sum_result,
247
- exp_sum_result_rounded,
248
- )
249
 
250
 
251
  @app.cell(hide_code=True)
252
  def _(mo):
253
- mo.md(
254
- r"""
255
- ### Visualizing Expectation
256
 
257
- Let's visualize the expectation for the sum of two dice. The expected value is the "center of mass" of the PMF:
258
- """
259
- )
260
  return
261
 
262
 
@@ -283,18 +267,16 @@ def _(plt, pmf_sum_two_dice):
283
 
284
  plt.tight_layout()
285
  plt.gca()
286
- return dice_ax, dice_fig, exp_i, exp_prob, exp_probabilities, exp_y_values
287
 
288
 
289
  @app.cell(hide_code=True)
290
  def _(mo):
291
- mo.md(
292
- r"""
293
- ## Demonstrating the Properties of Expectation
294
 
295
- Let's demonstrate some of these properties with examples:
296
- """
297
- )
298
  return
299
 
300
 
@@ -321,25 +303,16 @@ def _(exp_die_result):
321
 
322
  # Verify they match
323
  print(f"Do they match? {abs(prop_expected_using_property - prop_expected_direct) < 1e-10}")
324
- return (
325
- prop_a,
326
- prop_b,
327
- prop_expected_direct,
328
- prop_expected_direct_rounded,
329
- prop_expected_using_property,
330
- prop_expected_using_property_rounded,
331
- )
332
 
333
 
334
  @app.cell(hide_code=True)
335
  def _(mo):
336
- mo.md(
337
- r"""
338
- ### Law of the Unconscious Statistician (LOTUS)
339
 
340
- Let's use LOTUS to calculate $E[X^2]$ for a die roll, which will be useful when we study variance:
341
- """
342
- )
343
  return
344
 
345
 
@@ -358,38 +331,27 @@ def _():
358
 
359
  print(f"E[X^2] for a die roll = {lotus_expected_x_squared_rounded}")
360
  print(f"(E[X])^2 for a die roll = {expected_x_squared_rounded}")
361
- return (
362
- expected_x_squared,
363
- expected_x_squared_rounded,
364
- lotus_die_probs,
365
- lotus_die_values,
366
- lotus_expected_x_squared,
367
- lotus_expected_x_squared_rounded,
368
- )
369
 
370
 
371
  @app.cell(hide_code=True)
372
  def _(mo):
373
- mo.md(
374
- r"""
375
- /// Note
376
- Note that E[X^2] != (E[X])^2
377
- """
378
- )
379
  return
380
 
381
 
382
  @app.cell(hide_code=True)
383
  def _(mo):
384
- mo.md(
385
- r"""
386
- ## Interactive Example
387
 
388
- Let's explore how the expected value changes as we adjust the parameters of common probability distributions. This interactive visualization focuses specifically on the relationship between distribution parameters and expected values.
389
 
390
- Use the controls below to select a distribution and adjust its parameters. The graph will show how the expected value changes across a range of parameter values.
391
- """
392
- )
393
  return
394
 
395
 
@@ -423,7 +385,9 @@ def _(dist_description):
423
 
424
  @app.cell(hide_code=True)
425
  def _(mo):
426
- mo.md("""### Adjust Parameters""")
 
 
427
  return
428
 
429
 
@@ -549,37 +513,16 @@ def _(
549
 
550
  plt.tight_layout()
551
  plt.gca()
552
- return (
553
- annotation_x,
554
- annotation_y,
555
- current_expected,
556
- current_param,
557
- dist_ax,
558
- dist_fig,
559
- dist_props,
560
- expected_values,
561
- formula,
562
- lambda_max,
563
- lambda_min,
564
- max_y,
565
- n,
566
- p_max,
567
- p_min,
568
- param_values,
569
- title,
570
- x_label,
571
- )
572
 
573
 
574
  @app.cell(hide_code=True)
575
  def _(mo):
576
- mo.md(
577
- r"""
578
- ## Expectation vs. Mode
579
 
580
- The expected value (mean) of a random variable is not always the same as its most likely value (mode). Let's explore this with an example:
581
- """
582
- )
583
  return
584
 
585
 
@@ -633,94 +576,75 @@ def _(np, plt, stats):
633
 
634
  plt.tight_layout()
635
  plt.gca()
636
- return (
637
- max_x,
638
- mid_x,
639
- min_x,
640
- skew_ax,
641
- skew_expected,
642
- skew_expected_rounded,
643
- skew_fig,
644
- skew_mode,
645
- skew_n,
646
- skew_p,
647
- skew_pmf_values,
648
- skew_x_values,
649
- )
650
 
651
 
652
  @app.cell(hide_code=True)
653
  def _(mo):
654
- mo.md(
655
- r"""
656
- /// NOTE
657
- For the sum of two dice we calculated earlier, we found the expected value to be exactly 7. In that case, 7 also happens to be the mode (most likely outcome) of the distribution. However, this is just a coincidence for this particular example!
658
 
659
- As we can see from the binomial distribution above, the expected value (2.50) and the mode (2) are often different values (this is common in skewed distributions). The expected value represents the "center of mass" of the distribution, while the mode represents the most likely single outcome.
660
- """
661
- )
662
  return
663
 
664
 
665
  @app.cell(hide_code=True)
666
  def _(mo):
667
- mo.md(
668
- r"""
669
- ## 🤔 Test Your Understanding
670
-
671
- Choose what you believe are the correct options in the questions below:
672
-
673
- <details>
674
- <summary>The expected value of a random variable is always one of the possible values the random variable can take.</summary>
675
- ❌ False! The expected value is a weighted average and may not be a value the random variable can actually take. For example, the expected value of a fair die roll is 3.5, which is not a possible outcome.
676
- </details>
677
-
678
- <details>
679
- <summary>If X and Y are independent random variables, then E[X·Y] = E[X]·E[Y].</summary>
680
- ✅ True! For independent random variables, the expectation of their product equals the product of their expectations.
681
- </details>
682
-
683
- <details>
684
- <summary>The expected value of a constant random variable (one that always takes the same value) is that constant.</summary>
685
- ✅ True! If X = c with probability 1, then E[X] = c.
686
- </details>
687
-
688
- <details>
689
- <summary>The expected value of the sum of two random variables is always the sum of their expected values, regardless of whether they are independent.</summary>
690
- ✅ True! This is the linearity of expectation property: E[X + Y] = E[X] + E[Y], which holds regardless of dependence.
691
- </details>
692
- """
693
- )
694
  return
695
 
696
 
697
  @app.cell(hide_code=True)
698
  def _(mo):
699
- mo.md(
700
- r"""
701
- ## Practical Applications of Expectation
702
 
703
- Expected values show up everywhere - from investment decisions and insurance pricing to machine learning algorithms and game design. Engineers use them to predict system reliability, data scientists to understand customer behavior, and economists to model market outcomes. They're essential for risk assessment in project management and for optimizing resource allocation in operations research.
704
- """
705
- )
706
  return
707
 
708
 
709
  @app.cell(hide_code=True)
710
  def _(mo):
711
- mo.md(
712
- r"""
713
- ## Key Takeaways
714
 
715
- Expectation gives us a single value that summarizes a random variable's central tendency - it's the weighted average of all possible outcomes, where the weights are probabilities. The linearity property makes expectations easy to work with, even for complex combinations of random variables. While a PMF gives the complete probability picture, expectation provides an essential summary that helps us make decisions under uncertainty. In our next notebook, we'll explore variance, which measures how spread out a random variable's values are around its expectation.
716
- """
717
- )
718
  return
719
 
720
 
721
  @app.cell(hide_code=True)
722
  def _(mo):
723
- mo.md(r"""#### Appendix (containing helper code)""")
 
 
724
  return
725
 
726
 
@@ -736,7 +660,7 @@ def _():
736
  import numpy as np
737
  from scipy import stats
738
  import collections
739
- return collections, np, plt, stats
740
 
741
 
742
  @app.cell(hide_code=True)
 
10
 
11
  import marimo
12
 
13
+ __generated_with = "0.18.4"
14
  app = marimo.App(width="medium", app_title="Expectation")
15
 
16
 
17
  @app.cell(hide_code=True)
18
  def _(mo):
19
+ mo.md(r"""
20
+ # Expectation
 
21
 
22
+ _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/expectation/), by Stanford professor Chris Piech._
23
 
24
+ Expectations are fascinating — they represent the "center of mass" of a probability distribution. while they're often called "expected values" or "averages," they don't always match our intuition about what's "expected" to happen.
25
 
26
+ For me, the most interesting part about expectations is how they quantify what happens "on average" in the long run, even if that average isn't a possible outcome (like expecting 3.5 on a standard die roll).
27
+ """)
 
28
  return
29
 
30
 
31
  @app.cell(hide_code=True)
32
  def _(mo):
33
+ mo.md(r"""
34
+ ## Definition of Expectation
 
35
 
36
+ Expectation (written as $E[X]$) is basically the "average outcome" of a random variable, but with a twist - we weight each possible value by how likely it is to occur. I like to think of it as the "center of gravity" for probability.
37
 
38
+ $$E[X] = \sum_x x \cdot P(X=x)$$
39
 
40
+ People call this concept by different names - mean, weighted average, center of mass, or 1st moment if you're being fancy. They're all calculated the same way, though: multiply each value by its probability, then add everything up.
41
+ """)
 
42
  return
43
 
44
 
45
  @app.cell(hide_code=True)
46
  def _(mo):
47
+ mo.md(r"""
48
+ ## Intuition Behind Expectation
 
49
 
50
+ The expected value represents the long-run average value of a random variable over many independent repetitions of an experiment.
51
 
52
+ For example, if you roll a fair six-sided die many times and calculate the average of all rolls, that average will approach the expected value of 3.5 as the number of rolls increases.
53
 
54
+ Let's visualize this concept:
55
+ """)
 
56
  return
57
 
58
 
 
85
  arrowprops=dict(facecolor='black', shrink=0.05, width=1.5))
86
 
87
  plt.gca()
88
+ return
89
 
90
 
91
  @app.cell(hide_code=True)
92
  def _(mo):
93
+ mo.md(r"""
94
+ ## Properties of Expectation
95
+ """)
96
  return
97
 
98
 
 
141
 
142
  @app.cell(hide_code=True)
143
  def _(mo):
144
+ mo.md(r"""
145
+ ## Calculating Expectation
 
146
 
147
+ Let's calculate the expected value for some common examples:
148
 
149
+ ### Example 1: Fair Die Roll
150
 
151
+ For a fair six-sided die, the PMF is:
152
 
153
+ $$P(X=x) = \frac{1}{6} \text{ for } x \in \{1, 2, 3, 4, 5, 6\}$$
154
 
155
+ The expected value is:
156
 
157
+ $$E[X] = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6} = \frac{21}{6} = 3.5$$
158
 
159
+ Let's implement this calculation in Python:
160
+ """)
 
161
  return
162
 
163
 
 
173
 
174
  exp_die_result = calc_expectation_die()
175
  print(f"Expected value of a fair die roll: {exp_die_result}")
176
+ return (exp_die_result,)
177
 
178
 
179
  @app.cell(hide_code=True)
180
  def _(mo):
181
+ mo.md(r"""
182
+ ### Example 2: Sum of Two Dice
 
183
 
184
+ Now let's calculate the expected value for the sum of two fair dice. First, we need the PMF:
185
+ """)
 
186
  return
187
 
188
 
 
202
  exp_test_values = [2, 7, 12]
203
  for exp_test_y in exp_test_values:
204
  print(f"P(Y = {exp_test_y}) = {pmf_sum_two_dice(exp_test_y)}")
205
+ return (pmf_sum_two_dice,)
206
 
207
 
208
  @app.cell
 
231
 
232
  # Verify that this equals 7
233
  print(f"Is the expected value exactly 7? {abs(exp_sum_result - 7) < 1e-10}")
234
+ return
 
 
 
 
 
 
235
 
236
 
237
  @app.cell(hide_code=True)
238
  def _(mo):
239
+ mo.md(r"""
240
+ ### Visualizing Expectation
 
241
 
242
+ Let's visualize the expectation for the sum of two dice. The expected value is the "center of mass" of the PMF:
243
+ """)
 
244
  return
245
 
246
 
 
267
 
268
  plt.tight_layout()
269
  plt.gca()
270
+ return
271
 
272
 
273
  @app.cell(hide_code=True)
274
  def _(mo):
275
+ mo.md(r"""
276
+ ## Demonstrating the Properties of Expectation
 
277
 
278
+ Let's demonstrate some of these properties with examples:
279
+ """)
 
280
  return
281
 
282
 
 
303
 
304
  # Verify they match
305
  print(f"Do they match? {abs(prop_expected_using_property - prop_expected_direct) < 1e-10}")
306
+ return
 
 
 
 
 
 
 
307
 
308
 
309
  @app.cell(hide_code=True)
310
  def _(mo):
311
+ mo.md(r"""
312
+ ### Law of the Unconscious Statistician (LOTUS)
 
313
 
314
+ Let's use LOTUS to calculate $E[X^2]$ for a die roll, which will be useful when we study variance:
315
+ """)
 
316
  return
317
 
318
 
 
331
 
332
  print(f"E[X^2] for a die roll = {lotus_expected_x_squared_rounded}")
333
  print(f"(E[X])^2 for a die roll = {expected_x_squared_rounded}")
334
+ return
 
 
 
 
 
 
 
335
 
336
 
337
  @app.cell(hide_code=True)
338
  def _(mo):
339
+ mo.md(r"""
340
+ /// Note
341
+ Note that E[X^2] != (E[X])^2
342
+ """)
 
 
343
  return
344
 
345
 
346
  @app.cell(hide_code=True)
347
  def _(mo):
348
+ mo.md(r"""
349
+ ## Interactive Example
 
350
 
351
+ Let's explore how the expected value changes as we adjust the parameters of common probability distributions. This interactive visualization focuses specifically on the relationship between distribution parameters and expected values.
352
 
353
+ Use the controls below to select a distribution and adjust its parameters. The graph will show how the expected value changes across a range of parameter values.
354
+ """)
 
355
  return
356
 
357
 
 
385
 
386
  @app.cell(hide_code=True)
387
  def _(mo):
388
+ mo.md("""
389
+ ### Adjust Parameters
390
+ """)
391
  return
392
 
393
 
 
513
 
514
  plt.tight_layout()
515
  plt.gca()
516
+ return
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
517
 
518
 
519
  @app.cell(hide_code=True)
520
  def _(mo):
521
+ mo.md(r"""
522
+ ## Expectation vs. Mode
 
523
 
524
+ The expected value (mean) of a random variable is not always the same as its most likely value (mode). Let's explore this with an example:
525
+ """)
 
526
  return
527
 
528
 
 
576
 
577
  plt.tight_layout()
578
  plt.gca()
579
+ return
 
 
 
 
 
 
 
 
 
 
 
 
 
580
 
581
 
582
  @app.cell(hide_code=True)
583
  def _(mo):
584
+ mo.md(r"""
585
+ /// NOTE
586
+ For the sum of two dice we calculated earlier, we found the expected value to be exactly 7. In that case, 7 also happens to be the mode (most likely outcome) of the distribution. However, this is just a coincidence for this particular example!
 
587
 
588
+ As we can see from the binomial distribution above, the expected value (2.50) and the mode (2) are often different values (this is common in skewed distributions). The expected value represents the "center of mass" of the distribution, while the mode represents the most likely single outcome.
589
+ """)
 
590
  return
591
 
592
 
593
  @app.cell(hide_code=True)
594
  def _(mo):
595
+ mo.md(r"""
596
+ ## 🤔 Test Your Understanding
597
+
598
+ Choose what you believe are the correct options in the questions below:
599
+
600
+ <details>
601
+ <summary>The expected value of a random variable is always one of the possible values the random variable can take.</summary>
602
+ ❌ False! The expected value is a weighted average and may not be a value the random variable can actually take. For example, the expected value of a fair die roll is 3.5, which is not a possible outcome.
603
+ </details>
604
+
605
+ <details>
606
+ <summary>If X and Y are independent random variables, then E[X·Y] = E[X]·E[Y].</summary>
607
+ True! For independent random variables, the expectation of their product equals the product of their expectations.
608
+ </details>
609
+
610
+ <details>
611
+ <summary>The expected value of a constant random variable (one that always takes the same value) is that constant.</summary>
612
+ True! If X = c with probability 1, then E[X] = c.
613
+ </details>
614
+
615
+ <details>
616
+ <summary>The expected value of the sum of two random variables is always the sum of their expected values, regardless of whether they are independent.</summary>
617
+ True! This is the linearity of expectation property: E[X + Y] = E[X] + E[Y], which holds regardless of dependence.
618
+ </details>
619
+ """)
 
 
620
  return
621
 
622
 
623
  @app.cell(hide_code=True)
624
  def _(mo):
625
+ mo.md(r"""
626
+ ## Practical Applications of Expectation
 
627
 
628
+ Expected values show up everywhere - from investment decisions and insurance pricing to machine learning algorithms and game design. Engineers use them to predict system reliability, data scientists to understand customer behavior, and economists to model market outcomes. They're essential for risk assessment in project management and for optimizing resource allocation in operations research.
629
+ """)
 
630
  return
631
 
632
 
633
  @app.cell(hide_code=True)
634
  def _(mo):
635
+ mo.md(r"""
636
+ ## Key Takeaways
 
637
 
638
+ Expectation gives us a single value that summarizes a random variable's central tendency - it's the weighted average of all possible outcomes, where the weights are probabilities. The linearity property makes expectations easy to work with, even for complex combinations of random variables. While a PMF gives the complete probability picture, expectation provides an essential summary that helps us make decisions under uncertainty. In our next notebook, we'll explore variance, which measures how spread out a random variable's values are around its expectation.
639
+ """)
 
640
  return
641
 
642
 
643
  @app.cell(hide_code=True)
644
  def _(mo):
645
+ mo.md(r"""
646
+ #### Appendix (containing helper code)
647
+ """)
648
  return
649
 
650
 
 
660
  import numpy as np
661
  from scipy import stats
662
  import collections
663
+ return np, plt, stats
664
 
665
 
666
  @app.cell(hide_code=True)
probability/12_variance.py CHANGED
@@ -11,77 +11,69 @@
11
 
12
  import marimo
13
 
14
- __generated_with = "0.11.20"
15
  app = marimo.App(width="medium", app_title="Variance")
16
 
17
 
18
  @app.cell(hide_code=True)
19
  def _(mo):
20
- mo.md(
21
- r"""
22
- # Variance
23
 
24
- _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/variance/), by Stanford professor Chris Piech._
25
 
26
- In our previous exploration of random variables, we learned about expectation - a measure of central tendency. However, knowing the average value alone doesn't tell us everything about a distribution. Consider these questions:
27
 
28
- - How spread out are the values around the mean?
29
- - How reliable is the expectation as a predictor of individual outcomes?
30
- - How much do individual samples typically deviate from the average?
31
 
32
- This is where **variance** comes in - it measures the spread or dispersion of a random variable around its expected value.
33
- """
34
- )
35
  return
36
 
37
 
38
  @app.cell(hide_code=True)
39
  def _(mo):
40
- mo.md(
41
- r"""
42
- ## Definition of Variance
43
 
44
- The variance of a random variable $X$ with expected value $\mu = E[X]$ is defined as:
45
 
46
- $$\text{Var}(X) = E[(X-\mu)^2]$$
47
 
48
- This definition captures the average squared deviation from the mean. There's also an equivalent, often more convenient formula:
49
 
50
- $$\text{Var}(X) = E[X^2] - (E[X])^2$$
51
 
52
- /// tip
53
- The second formula is usually easier to compute, as it only requires calculating $E[X^2]$ and $E[X]$, rather than working with deviations from the mean.
54
- """
55
- )
56
  return
57
 
58
 
59
  @app.cell(hide_code=True)
60
  def _(mo):
61
- mo.md(
62
- r"""
63
- ## Intuition Through Example
64
 
65
- Let's look at a real-world example that illustrates why variance is important. Consider three different groups of graders evaluating assignments in a massive online course. Each grader has their own "grading distribution" - their pattern of assigning scores to work that deserves a 70/100.
66
 
67
- The visualization below shows the probability distributions for three types of graders. Try clicking and dragging the blue numbers to adjust the parameters and see how they affect the variance.
68
- """
69
- )
70
  return
71
 
72
 
73
  @app.cell(hide_code=True)
74
  def _(mo):
75
- mo.md(
76
- r"""
77
- /// TIP
78
- Try adjusting the blue numbers above to see how:
79
-
80
- - Increasing spread increases variance
81
- - The mixture ratio affects how many outliers appear in Grader C's distribution
82
- - Changing the true grade shifts all distributions but maintains their relative variances
83
- """
84
- )
85
  return
86
 
87
 
@@ -165,50 +157,35 @@ def _(
165
 
166
  plt.tight_layout()
167
  plt.gca()
168
- return (
169
- ax1,
170
- ax2,
171
- ax3,
172
- grader_a,
173
- grader_b,
174
- grader_c,
175
- grader_fig,
176
- var_a,
177
- var_b,
178
- var_c,
179
- )
180
 
181
 
182
  @app.cell(hide_code=True)
183
  def _(mo):
184
- mo.md(
185
- r"""
186
- /// note
187
- All three distributions have the same expected value (the true grade), but they differ significantly in their spread:
188
-
189
- - **Grader A** has high variance - grades vary widely from the true value
190
- - **Grader B** has low variance - grades consistently stay close to the true value
191
- - **Grader C** has a mixture distribution - mostly consistent but with occasional extreme values
192
-
193
- This illustrates why variance is crucial: two distributions can have the same mean but behave very differently in practice.
194
- """
195
- )
196
  return
197
 
198
 
199
  @app.cell(hide_code=True)
200
  def _(mo):
201
- mo.md(
202
- r"""
203
- ## Computing Variance
204
 
205
- Let's work through some concrete examples to understand how to calculate variance.
206
 
207
- ### Example 1: Fair Die Roll
208
 
209
- Consider rolling a fair six-sided die. We'll calculate its variance step by step:
210
- """
211
- )
212
  return
213
 
214
 
@@ -234,75 +211,62 @@ def _(np):
234
  print(f"E[X^2] = {expected_square:.2f}")
235
  print(f"Var(X) = {variance:.2f}")
236
  print(f"Standard Deviation = {std_dev:.2f}")
237
- return (
238
- die_probs,
239
- die_values,
240
- expected_square,
241
- expected_value,
242
- std_dev,
243
- variance,
244
- )
245
 
246
 
247
  @app.cell(hide_code=True)
248
  def _(mo):
249
- mo.md(
250
- r"""
251
- /// NOTE
252
- For a fair die:
253
-
254
- - The expected value (3.50) tells us the average roll
255
- - The variance (2.92) tells us how much typical rolls deviate from this average
256
- - The standard deviation (1.71) gives us this spread in the original units
257
- """
258
- )
259
  return
260
 
261
 
262
  @app.cell(hide_code=True)
263
  def _(mo):
264
- mo.md(
265
- r"""
266
- ## Properties of Variance
267
 
268
- Variance has several important properties that make it useful for analyzing random variables:
269
 
270
- 1. **Non-negativity**: $\text{Var}(X) \geq 0$ for any random variable $X$
271
- 2. **Variance of a constant**: $\text{Var}(c) = 0$ for any constant $c$
272
- 3. **Scaling**: $\text{Var}(aX) = a^2\text{Var}(X)$ for any constant $a$
273
- 4. **Translation**: $\text{Var}(X + b) = \text{Var}(X)$ for any constant $b$
274
- 5. **Independence**: If $X$ and $Y$ are independent, then $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$
275
 
276
- Let's verify a property with an example.
277
- """
278
- )
279
  return
280
 
281
 
282
  @app.cell(hide_code=True)
283
  def _(mo):
284
- mo.md(
285
- r"""
286
- ## Proof of Variance Formula
287
-
288
- The equivalence of the two variance formulas is a fundamental result in probability theory. Here's the proof:
289
-
290
- Starting with the definition $\text{Var}(X) = E[(X-\mu)^2]$ where $\mu = E[X]$:
291
-
292
- \begin{align}
293
- \text{Var}(X) &= E[(X-\mu)^2] \\
294
- &= \sum_x(x-\mu)^2P(x) && \text{Definition of Expectation}\\
295
- &= \sum_x (x^2 -2\mu x + \mu^2)P(x) && \text{Expanding the square}\\
296
- &= \sum_x x^2P(x)- 2\mu \sum_x xP(x) + \mu^2 \sum_x P(x) && \text{Distributing the sum}\\
297
- &= E[X^2]- 2\mu E[X] + \mu^2 && \text{Definition of expectation}\\
298
- &= E[X^2]- 2(E[X])^2 + (E[X])^2 && \text{Since }\mu = E[X]\\
299
- &= E[X^2]- (E[X])^2 && \text{Simplifying}
300
- \end{align}
301
-
302
- /// tip
303
- This proof shows why the formula $\text{Var}(X) = E[X^2] - (E[X])^2$ is so useful - it's much easier to compute $E[X^2]$ and $E[X]$ separately than to work with deviations directly.
304
- """
305
- )
306
  return
307
 
308
 
@@ -322,7 +286,7 @@ def _(die_probs, die_values, np):
322
  print(f"Scaled Variance (a={a}): {scaled_var:.2f}")
323
  print(f"a^2 * Original Variance: {a**2 * original_var:.2f}")
324
  print(f"Property holds: {abs(scaled_var - a**2 * original_var) < 1e-10}")
325
- return a, original_var, scaled_values, scaled_var
326
 
327
 
328
  @app.cell
@@ -333,23 +297,21 @@ def _():
333
 
334
  @app.cell(hide_code=True)
335
  def _(mo):
336
- mo.md(
337
- r"""
338
- ## Standard Deviation
339
 
340
- While variance is mathematically convenient, it has one practical drawback: its units are squared. For example, if we're measuring grades (0-100), the variance is in "grade points squared." This makes it hard to interpret intuitively.
341
 
342
- The **standard deviation**, denoted by $\sigma$ or $\text{SD}(X)$, is the square root of variance:
343
 
344
- $$\sigma = \sqrt{\text{Var}(X)}$$
345
 
346
- /// tip
347
- Standard deviation is often more intuitive because it's in the same units as the original data. For a normal distribution, approximately:
348
- - 68% of values fall within 1 standard deviation of the mean
349
- - 95% of values fall within 2 standard deviations
350
- - 99.7% of values fall within 3 standard deviations
351
- """
352
- )
353
  return
354
 
355
 
@@ -452,93 +414,80 @@ def _(normal_mean, normal_std, np, plt, stats):
452
 
453
  plt.tight_layout()
454
  plt.gca()
455
- return (
456
- normal_ax,
457
- normal_fig,
458
- one_sigma_left,
459
- one_sigma_right,
460
- three_sigma_left,
461
- three_sigma_right,
462
- two_sigma_left,
463
- two_sigma_right,
464
- )
465
 
466
 
467
  @app.cell(hide_code=True)
468
  def _(mo):
469
- mo.md(
470
- r"""
471
- /// tip
472
- The interactive visualization above demonstrates how standard deviation (σ) affects the shape of a normal distribution:
473
-
474
- - The **red region** covers μ ± 1σ, containing approximately 68% of the probability
475
- - The **green region** covers μ ± 2σ, containing approximately 95% of the probability
476
- - The **blue region** covers μ ± 3σ, containing approximately 99.7% of the probability
477
-
478
- This is known as the "68-95-99.7 rule" or the "empirical rule" and is a useful heuristic for understanding the spread of data.
479
- """
480
- )
481
  return
482
 
483
 
484
  @app.cell(hide_code=True)
485
  def _(mo):
486
- mo.md(
487
- r"""
488
- ## 🤔 Test Your Understanding
489
-
490
- Choose what you believe are the correct options in the questions below:
491
-
492
- <details>
493
- <summary>The variance of a random variable can be negative.</summary>
494
- ❌ False! Variance is defined as an expected value of squared deviations, and squares are always non-negative.
495
- </details>
496
-
497
- <details>
498
- <summary>If X and Y are independent random variables, then Var(X + Y) = Var(X) + Var(Y).</summary>
499
- ✅ True! This is one of the key properties of variance for independent random variables.
500
- </details>
501
-
502
- <details>
503
- <summary>Multiplying a random variable by 2 multiplies its variance by 2.</summary>
504
- ❌ False! Multiplying a random variable by a constant a multiplies its variance by a². So multiplying by 2 multiplies variance by 4.
505
- </details>
506
-
507
- <details>
508
- <summary>Standard deviation is always equal to the square root of variance.</summary>
509
- ✅ True! By definition, standard deviation σ = √Var(X).
510
- </details>
511
-
512
- <details>
513
- <summary>If Var(X) = 0, then X must be a constant.</summary>
514
- ✅ True! Zero variance means there is no spread around the mean, so X can only take one value.
515
- </details>
516
- """
517
- )
518
  return
519
 
520
 
521
  @app.cell(hide_code=True)
522
  def _(mo):
523
- mo.md(
524
- r"""
525
- ## Key Takeaways
526
 
527
- Variance gives us a way to measure how spread out a random variable is around its mean. It's like the "uncertainty" in our expectation - a high variance means individual outcomes can differ widely from what we expect on average.
528
 
529
- Standard deviation brings this measure back to the original units, making it easier to interpret. For grades, a standard deviation of 10 points means typical grades fall within about 10 points of the average.
530
 
531
- Variance pops up everywhere - from weather forecasts (how reliable is the predicted temperature?) to financial investments (how risky is this stock?) to quality control (how consistent is our manufacturing process?).
532
 
533
- In our next notebook, we'll explore more properties of random variables and see how they combine to form more complex distributions.
534
- """
535
- )
536
  return
537
 
538
 
539
  @app.cell(hide_code=True)
540
  def _(mo):
541
- mo.md(r"""Appendix (containing helper code):""")
 
 
542
  return
543
 
544
 
 
11
 
12
  import marimo
13
 
14
+ __generated_with = "0.18.4"
15
  app = marimo.App(width="medium", app_title="Variance")
16
 
17
 
18
  @app.cell(hide_code=True)
19
  def _(mo):
20
+ mo.md(r"""
21
+ # Variance
 
22
 
23
+ _This notebook is a computational companion to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/variance/), by Stanford professor Chris Piech._
24
 
25
+ In our previous exploration of random variables, we learned about expectation - a measure of central tendency. However, knowing the average value alone doesn't tell us everything about a distribution. Consider these questions:
26
 
27
+ - How spread out are the values around the mean?
28
+ - How reliable is the expectation as a predictor of individual outcomes?
29
+ - How much do individual samples typically deviate from the average?
30
 
31
+ This is where **variance** comes in - it measures the spread or dispersion of a random variable around its expected value.
32
+ """)
 
33
  return
34
 
35
 
36
  @app.cell(hide_code=True)
37
  def _(mo):
38
+ mo.md(r"""
39
+ ## Definition of Variance
 
40
 
41
+ The variance of a random variable $X$ with expected value $\mu = E[X]$ is defined as:
42
 
43
+ $$\text{Var}(X) = E[(X-\mu)^2]$$
44
 
45
+ This definition captures the average squared deviation from the mean. There's also an equivalent, often more convenient formula:
46
 
47
+ $$\text{Var}(X) = E[X^2] - (E[X])^2$$
48
 
49
+ /// tip
50
+ The second formula is usually easier to compute, as it only requires calculating $E[X^2]$ and $E[X]$, rather than working with deviations from the mean.
51
+ """)
 
52
  return
53
 
54
 
55
  @app.cell(hide_code=True)
56
  def _(mo):
57
+ mo.md(r"""
58
+ ## Intuition Through Example
 
59
 
60
+ Let's look at a real-world example that illustrates why variance is important. Consider three different groups of graders evaluating assignments in a massive online course. Each grader has their own "grading distribution" - their pattern of assigning scores to work that deserves a 70/100.
61
 
62
+ The visualization below shows the probability distributions for three types of graders. Try clicking and dragging the blue numbers to adjust the parameters and see how they affect the variance.
63
+ """)
 
64
  return
65
 
66
 
67
  @app.cell(hide_code=True)
68
  def _(mo):
69
+ mo.md(r"""
70
+ /// TIP
71
+ Try adjusting the blue numbers above to see how:
72
+
73
+ - Increasing spread increases variance
74
+ - The mixture ratio affects how many outliers appear in Grader C's distribution
75
+ - Changing the true grade shifts all distributions but maintains their relative variances
76
+ """)
 
 
77
  return
78
 
79
 
 
157
 
158
  plt.tight_layout()
159
  plt.gca()
160
+ return
 
 
 
 
 
 
 
 
 
 
 
161
 
162
 
163
  @app.cell(hide_code=True)
164
  def _(mo):
165
+ mo.md(r"""
166
+ /// note
167
+ All three distributions have the same expected value (the true grade), but they differ significantly in their spread:
168
+
169
+ - **Grader A** has high variance - grades vary widely from the true value
170
+ - **Grader B** has low variance - grades consistently stay close to the true value
171
+ - **Grader C** has a mixture distribution - mostly consistent but with occasional extreme values
172
+
173
+ This illustrates why variance is crucial: two distributions can have the same mean but behave very differently in practice.
174
+ """)
 
 
175
  return
176
 
177
 
178
  @app.cell(hide_code=True)
179
  def _(mo):
180
+ mo.md(r"""
181
+ ## Computing Variance
 
182
 
183
+ Let's work through some concrete examples to understand how to calculate variance.
184
 
185
+ ### Example 1: Fair Die Roll
186
 
187
+ Consider rolling a fair six-sided die. We'll calculate its variance step by step:
188
+ """)
 
189
  return
190
 
191
 
 
211
  print(f"E[X^2] = {expected_square:.2f}")
212
  print(f"Var(X) = {variance:.2f}")
213
  print(f"Standard Deviation = {std_dev:.2f}")
214
+ return die_probs, die_values
 
 
 
 
 
 
 
215
 
216
 
217
  @app.cell(hide_code=True)
218
  def _(mo):
219
+ mo.md(r"""
220
+ /// NOTE
221
+ For a fair die:
222
+
223
+ - The expected value (3.50) tells us the average roll
224
+ - The variance (2.92) tells us how much typical rolls deviate from this average
225
+ - The standard deviation (1.71) gives us this spread in the original units
226
+ """)
 
 
227
  return
228
 
229
 
230
  @app.cell(hide_code=True)
231
  def _(mo):
232
+ mo.md(r"""
233
+ ## Properties of Variance
 
234
 
235
+ Variance has several important properties that make it useful for analyzing random variables:
236
 
237
+ 1. **Non-negativity**: $\text{Var}(X) \geq 0$ for any random variable $X$
238
+ 2. **Variance of a constant**: $\text{Var}(c) = 0$ for any constant $c$
239
+ 3. **Scaling**: $\text{Var}(aX) = a^2\text{Var}(X)$ for any constant $a$
240
+ 4. **Translation**: $\text{Var}(X + b) = \text{Var}(X)$ for any constant $b$
241
+ 5. **Independence**: If $X$ and $Y$ are independent, then $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$
242
 
243
+ Let's verify a property with an example.
244
+ """)
 
245
  return
246
 
247
 
248
  @app.cell(hide_code=True)
249
  def _(mo):
250
+ mo.md(r"""
251
+ ## Proof of Variance Formula
252
+
253
+ The equivalence of the two variance formulas is a fundamental result in probability theory. Here's the proof:
254
+
255
+ Starting with the definition $\text{Var}(X) = E[(X-\mu)^2]$ where $\mu = E[X]$:
256
+
257
+ \begin{align}
258
+ \text{Var}(X) &= E[(X-\mu)^2] \\
259
+ &= \sum_x(x-\mu)^2P(x) && \text{Definition of Expectation}\\
260
+ &= \sum_x (x^2 -2\mu x + \mu^2)P(x) && \text{Expanding the square}\\
261
+ &= \sum_x x^2P(x)- 2\mu \sum_x xP(x) + \mu^2 \sum_x P(x) && \text{Distributing the sum}\\
262
+ &= E[X^2]- 2\mu E[X] + \mu^2 && \text{Definition of expectation}\\
263
+ &= E[X^2]- 2(E[X])^2 + (E[X])^2 && \text{Since }\mu = E[X]\\
264
+ &= E[X^2]- (E[X])^2 && \text{Simplifying}
265
+ \end{align}
266
+
267
+ /// tip
268
+ This proof shows why the formula $\text{Var}(X) = E[X^2] - (E[X])^2$ is so useful - it's much easier to compute $E[X^2]$ and $E[X]$ separately than to work with deviations directly.
269
+ """)
 
 
270
  return
271
 
272
 
 
286
  print(f"Scaled Variance (a={a}): {scaled_var:.2f}")
287
  print(f"a^2 * Original Variance: {a**2 * original_var:.2f}")
288
  print(f"Property holds: {abs(scaled_var - a**2 * original_var) < 1e-10}")
289
+ return
290
 
291
 
292
  @app.cell
 
297
 
298
  @app.cell(hide_code=True)
299
  def _(mo):
300
+ mo.md(r"""
301
+ ## Standard Deviation
 
302
 
303
+ While variance is mathematically convenient, it has one practical drawback: its units are squared. For example, if we're measuring grades (0-100), the variance is in "grade points squared." This makes it hard to interpret intuitively.
304
 
305
+ The **standard deviation**, denoted by $\sigma$ or $\text{SD}(X)$, is the square root of variance:
306
 
307
+ $$\sigma = \sqrt{\text{Var}(X)}$$
308
 
309
+ /// tip
310
+ Standard deviation is often more intuitive because it's in the same units as the original data. For a normal distribution, approximately:
311
+ - 68% of values fall within 1 standard deviation of the mean
312
+ - 95% of values fall within 2 standard deviations
313
+ - 99.7% of values fall within 3 standard deviations
314
+ """)
 
315
  return
316
 
317
 
 
414
 
415
  plt.tight_layout()
416
  plt.gca()
417
+ return
 
 
 
 
 
 
 
 
 
418
 
419
 
420
  @app.cell(hide_code=True)
421
  def _(mo):
422
+ mo.md(r"""
423
+ /// tip
424
+ The interactive visualization above demonstrates how standard deviation (σ) affects the shape of a normal distribution:
425
+
426
+ - The **red region** covers μ ± 1σ, containing approximately 68% of the probability
427
+ - The **green region** covers μ ± 2σ, containing approximately 95% of the probability
428
+ - The **blue region** covers μ ± 3σ, containing approximately 99.7% of the probability
429
+
430
+ This is known as the "68-95-99.7 rule" or the "empirical rule" and is a useful heuristic for understanding the spread of data.
431
+ """)
 
 
432
  return
433
 
434
 
435
  @app.cell(hide_code=True)
436
  def _(mo):
437
+ mo.md(r"""
438
+ ## 🤔 Test Your Understanding
439
+
440
+ Choose what you believe are the correct options in the questions below:
441
+
442
+ <details>
443
+ <summary>The variance of a random variable can be negative.</summary>
444
+ False! Variance is defined as an expected value of squared deviations, and squares are always non-negative.
445
+ </details>
446
+
447
+ <details>
448
+ <summary>If X and Y are independent random variables, then Var(X + Y) = Var(X) + Var(Y).</summary>
449
+ True! This is one of the key properties of variance for independent random variables.
450
+ </details>
451
+
452
+ <details>
453
+ <summary>Multiplying a random variable by 2 multiplies its variance by 2.</summary>
454
+ ❌ False! Multiplying a random variable by a constant a multiplies its variance by a². So multiplying by 2 multiplies variance by 4.
455
+ </details>
456
+
457
+ <details>
458
+ <summary>Standard deviation is always equal to the square root of variance.</summary>
459
+ True! By definition, standard deviation σ = √Var(X).
460
+ </details>
461
+
462
+ <details>
463
+ <summary>If Var(X) = 0, then X must be a constant.</summary>
464
+ True! Zero variance means there is no spread around the mean, so X can only take one value.
465
+ </details>
466
+ """)
 
 
467
  return
468
 
469
 
470
  @app.cell(hide_code=True)
471
  def _(mo):
472
+ mo.md(r"""
473
+ ## Key Takeaways
 
474
 
475
+ Variance gives us a way to measure how spread out a random variable is around its mean. It's like the "uncertainty" in our expectation - a high variance means individual outcomes can differ widely from what we expect on average.
476
 
477
+ Standard deviation brings this measure back to the original units, making it easier to interpret. For grades, a standard deviation of 10 points means typical grades fall within about 10 points of the average.
478
 
479
+ Variance pops up everywhere - from weather forecasts (how reliable is the predicted temperature?) to financial investments (how risky is this stock?) to quality control (how consistent is our manufacturing process?).
480
 
481
+ In our next notebook, we'll explore more properties of random variables and see how they combine to form more complex distributions.
482
+ """)
 
483
  return
484
 
485
 
486
  @app.cell(hide_code=True)
487
  def _(mo):
488
+ mo.md(r"""
489
+ Appendix (containing helper code):
490
+ """)
491
  return
492
 
493
 
probability/13_bernoulli_distribution.py CHANGED
@@ -10,60 +10,54 @@
10
 
11
  import marimo
12
 
13
- __generated_with = "0.12.6"
14
  app = marimo.App(width="medium", app_title="Bernoulli Distribution")
15
 
16
 
17
  @app.cell(hide_code=True)
18
  def _(mo):
19
- mo.md(
20
- r"""
21
- # Bernoulli Distribution
22
 
23
- > _Note:_ This notebook builds on concepts from ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/bernoulli/) by Chris Piech.
24
 
25
- ## Parametric Random Variables
26
 
27
- Probability has a bunch of classic random variable patterns that show up over and over. Let's explore some of the most important parametric discrete distributions.
28
 
29
- Bernoulli is honestly the simplest distribution you'll ever see, but it's ridiculously powerful in practice. What makes it fascinating to me is how it captures any yes/no scenario: success/failure, heads/tails, 1/0.
30
 
31
- I think of these distributions as the atoms of probability — they're the fundamental building blocks that everything else is made from.
32
- """
33
- )
34
  return
35
 
36
 
37
  @app.cell(hide_code=True)
38
  def _(mo):
39
- mo.md(
40
- r"""
41
- ## Bernoulli Random Variables
42
 
43
- A Bernoulli random variable boils down to just two possible values: 1 (success) or 0 (failure). dead simple, but incredibly useful.
44
 
45
- Some everyday examples where I see these:
46
 
47
- - Coin flip (heads=1, tails=0)
48
- - Whether that sketchy email is spam
49
- - If someone actually clicks my ad
50
- - Whether my code compiles first try (almost always 0 for me)
51
 
52
- All you need (the classic expression) is a single parameter $p$ - the probability of success.
53
- """
54
- )
55
  return
56
 
57
 
58
  @app.cell(hide_code=True)
59
  def _(mo):
60
- mo.md(
61
- r"""
62
- ## Key Properties of a Bernoulli Random Variable
63
 
64
- If $X$ is declared to be a Bernoulli random variable with parameter $p$, denoted $X \sim \text{Bern}(p)$, it has the following properties:
65
- """
66
- )
67
  return
68
 
69
 
@@ -72,31 +66,29 @@ def _(stats):
72
  # Define the Bernoulli distribution function
73
  def Bern(p):
74
  return stats.bernoulli(p)
75
- return (Bern,)
76
 
77
 
78
  @app.cell(hide_code=True)
79
  def _(mo):
80
- mo.md(
81
- r"""
82
- ## Bernoulli Distribution Properties
83
-
84
- $\begin{array}{lll}
85
- \text{Notation:} & X \sim \text{Bern}(p) \\
86
- \text{Description:} & \text{A boolean variable that is 1 with probability } p \\
87
- \text{Parameters:} & p, \text{ the probability that } X = 1 \\
88
- \text{Support:} & x \text{ is either 0 or 1} \\
89
- \text{PMF equation:} & P(X = x) =
90
- \begin{cases}
91
- p & \text{if }x = 1\\
92
- 1-p & \text{if }x = 0
93
- \end{cases} \\
94
- \text{PMF (smooth):} & P(X = x) = p^x(1-p)^{1-x} \\
95
- \text{Expectation:} & E[X] = p \\
96
- \text{Variance:} & \text{Var}(X) = p(1-p) \\
97
- \end{array}$
98
- """
99
- )
100
  return
101
 
102
 
@@ -158,62 +150,58 @@ def _(expected_value, p_slider, plt, probabilities, values, variance):
158
  ax.legend()
159
  plt.tight_layout()
160
  plt.gca()
161
- return ax, fig
162
 
163
 
164
  @app.cell(hide_code=True)
165
  def _(mo):
166
- mo.md(
167
- r"""
168
- ## Expectation and Variance of a Bernoulli
169
-
170
- > _Note:_ The following derivations are included as reference material. The credit for these mathematical formulations belongs to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/bernoulli/) by Chris Piech.
171
-
172
- Let's work through why $E[X] = p$ for a Bernoulli:
173
-
174
- \begin{align}
175
- E[X] &= \sum_x x \cdot (X=x) && \text{Definition of expectation} \\
176
- &= 1 \cdot p + 0 \cdot (1-p) &&
177
- X \text{ can take on values 0 and 1} \\
178
- &= p && \text{Remove the 0 term}
179
- \end{align}
180
-
181
- And for variance, we first need $E[X^2]$:
182
-
183
- \begin{align}
184
- E[X^2]
185
- &= \sum_x x^2 \cdot (X=x) &&\text{LOTUS}\\
186
- &= 0^2 \cdot (1-p) + 1^2 \cdot p\\
187
- &= p
188
- \end{align}
189
-
190
- \begin{align}
191
- (X)
192
- &= E[X^2] - E[X]^2&& \text{Def of variance} \\
193
- &= p - p^2 && \text{Substitute }E[X^2]=p, E[X] = p \\
194
- &= p (1-p) && \text{Factor out }p
195
- \end{align}
196
- """
197
- )
198
  return
199
 
200
 
201
  @app.cell(hide_code=True)
202
  def _(mo):
203
- mo.md(
204
- r"""
205
- ## Indicator Random Variables
206
 
207
- Indicator variables are a clever trick I like to use — they turn events into numbers. Instead of dealing with "did the event happen?" (yes/no), we get "1" if it happened and "0" if it didn't.
208
 
209
- Formally: an indicator variable $I$ for event $A$ equals 1 when $A$ occurs and 0 otherwise. These are just bernoulli variables where $p = P(A)$. people often use notation like $I_A$ to name them.
210
 
211
- Two key properties that make them super useful:
212
 
213
- - $P(I=1)=P(A)$ - probability of getting a 1 is just the probability of the event
214
- - $E[I]=P(A)$ - the expected value equals the probability (this one's a game-changer!)
215
- """
216
- )
217
  return
218
 
219
 
@@ -234,7 +222,9 @@ def _(mo):
234
 
235
  @app.cell(hide_code=True)
236
  def _(mo):
237
- mo.md(r"""## Simulation""")
 
 
238
  return
239
 
240
 
@@ -276,7 +266,7 @@ def _(np, num_trials_slider, p_sim_slider, plt):
276
 
277
  plt.tight_layout()
278
  plt.gca()
279
- return cumulative_mean, p, trials
280
 
281
 
282
  @app.cell(hide_code=True)
@@ -296,88 +286,84 @@ def _(mo, np, trials):
296
 
297
  This demonstrates how the sample proportion approaches the true probability $p$ as the number of trials increases.
298
  """)
299
- return num_successes, num_trials, proportion
300
 
301
 
302
  @app.cell(hide_code=True)
303
  def _(mo):
304
- mo.md(
305
- r"""
306
- ## 🤔 Test Your Understanding
307
 
308
- Pick which of these statements about Bernoulli random variables you think are correct:
309
 
310
- /// details | The variance of a Bernoulli random variable is always less than or equal to 0.25
311
- ✅ Correct! The variance $p(1-p)$ reaches its maximum value of 0.25 when $p = 0.5$.
312
- ///
313
 
314
- /// details | The expected value of a Bernoulli random variable must be either 0 or 1
315
- ❌ Incorrect! The expected value is $p$, which can be any value between 0 and 1.
316
- ///
317
 
318
- /// details | If $X \sim \text{Bern}(0.3)$ and $Y \sim \text{Bern}(0.7)$, then $X$ and $Y$ have the same variance
319
- ✅ Correct! $\text{Var}(X) = 0.3 \times 0.7 = 0.21$ and $\text{Var}(Y) = 0.7 \times 0.3 = 0.21$.
320
- ///
321
 
322
- /// details | Two independent coin flips can be modeled as the sum of two Bernoulli random variables
323
- ✅ Correct! The sum would follow a Binomial distribution with $n=2$.
324
- ///
325
- """
326
- )
327
  return
328
 
329
 
330
  @app.cell(hide_code=True)
331
  def _(mo):
332
- mo.md(
333
- r"""
334
- ## Applications of Bernoulli Random Variables
335
 
336
- Bernoulli random variables are used in many real-world scenarios:
337
 
338
- 1. **Quality Control**: Testing if a manufactured item is defective (1) or not (0)
339
 
340
- 2. **A/B Testing**: Determining if a user clicks (1) or doesn't click (0) on a website button
341
 
342
- 3. **Medical Testing**: Checking if a patient tests positive (1) or negative (0) for a disease
343
 
344
- 4. **Election Modeling**: Modeling if a particular voter votes for candidate A (1) or not (0)
345
 
346
- 5. **Financial Markets**: Modeling if a stock price goes up (1) or down (0) in a simplified model
347
 
348
- Because Bernoulli random variables are parametric, as soon as you declare a random variable to be of type Bernoulli, you automatically know all of its pre-derived properties!
349
- """
350
- )
351
  return
352
 
353
 
354
  @app.cell(hide_code=True)
355
  def _(mo):
356
- mo.md(
357
- r"""
358
- ## Summary
359
 
360
- And that's a wrap on Bernoulli distributions! We've learnt the simplest of all probability distributions — the one that only has two possible outcomes. Flip a coin, check if an email is spam, see if your blind date shows up — these are all Bernoulli trials with success probability $p$.
361
 
362
- The beauty of Bernoulli is in its simplicity: just set $p$ (the probability of success) and you're good to go! The PMF gives us $P(X=1) = p$ and $P(X=0) = 1-p$, while expectation is simply $p$ and variance is $p(1-p)$. Oh, and when you're tracking whether specific events happen or not? That's an indicator random variable — just another Bernoulli in disguise!
363
 
364
- Two key things to remember:
365
 
366
- /// note
367
- 💡 **Maximum Variance**: A Bernoulli's variance $p(1-p)$ reaches its maximum at $p=0.5$, making a fair coin the most "unpredictable" Bernoulli random variable.
368
 
369
- 💡 **Instant Properties**: When you identify a random variable as Bernoulli, you instantly know all its properties—expectation, variance, PMF—without additional calculations.
370
- ///
371
 
372
- Next up: Binomial distribution—where we'll see what happens when we let Bernoulli trials have a party and add themselves together!
373
- """
374
- )
375
  return
376
 
377
 
378
  @app.cell(hide_code=True)
379
  def _(mo):
380
- mo.md(r"""#### Appendix (containing helper code for the notebook)""")
 
 
381
  return
382
 
383
 
@@ -390,7 +376,7 @@ def _():
390
  @app.cell(hide_code=True)
391
  def _():
392
  from marimo import Html
393
- return (Html,)
394
 
395
 
396
  @app.cell(hide_code=True)
@@ -407,7 +393,7 @@ def _():
407
 
408
  # Set random seed for reproducibility
409
  np.random.seed(42)
410
- return math, np, plt, stats
411
 
412
 
413
  @app.cell(hide_code=True)
 
10
 
11
  import marimo
12
 
13
+ __generated_with = "0.18.4"
14
  app = marimo.App(width="medium", app_title="Bernoulli Distribution")
15
 
16
 
17
  @app.cell(hide_code=True)
18
  def _(mo):
19
+ mo.md(r"""
20
+ # Bernoulli Distribution
 
21
 
22
+ > _Note:_ This notebook builds on concepts from ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/bernoulli/) by Chris Piech.
23
 
24
+ ## Parametric Random Variables
25
 
26
+ Probability has a bunch of classic random variable patterns that show up over and over. Let's explore some of the most important parametric discrete distributions.
27
 
28
+ Bernoulli is honestly the simplest distribution you'll ever see, but it's ridiculously powerful in practice. What makes it fascinating to me is how it captures any yes/no scenario: success/failure, heads/tails, 1/0.
29
 
30
+ I think of these distributions as the atoms of probability — they're the fundamental building blocks that everything else is made from.
31
+ """)
 
32
  return
33
 
34
 
35
  @app.cell(hide_code=True)
36
  def _(mo):
37
+ mo.md(r"""
38
+ ## Bernoulli Random Variables
 
39
 
40
+ A Bernoulli random variable boils down to just two possible values: 1 (success) or 0 (failure). dead simple, but incredibly useful.
41
 
42
+ Some everyday examples where I see these:
43
 
44
+ - Coin flip (heads=1, tails=0)
45
+ - Whether that sketchy email is spam
46
+ - If someone actually clicks my ad
47
+ - Whether my code compiles first try (almost always 0 for me)
48
 
49
+ All you need (the classic expression) is a single parameter $p$ - the probability of success.
50
+ """)
 
51
  return
52
 
53
 
54
  @app.cell(hide_code=True)
55
  def _(mo):
56
+ mo.md(r"""
57
+ ## Key Properties of a Bernoulli Random Variable
 
58
 
59
+ If $X$ is declared to be a Bernoulli random variable with parameter $p$, denoted $X \sim \text{Bern}(p)$, it has the following properties:
60
+ """)
 
61
  return
62
 
63
 
 
66
  # Define the Bernoulli distribution function
67
  def Bern(p):
68
  return stats.bernoulli(p)
69
+ return
70
 
71
 
72
  @app.cell(hide_code=True)
73
  def _(mo):
74
+ mo.md(r"""
75
+ ## Bernoulli Distribution Properties
76
+
77
+ $\begin{array}{lll}
78
+ \text{Notation:} & X \sim \text{Bern}(p) \\
79
+ \text{Description:} & \text{A boolean variable that is 1 with probability } p \\
80
+ \text{Parameters:} & p, \text{ the probability that } X = 1 \\
81
+ \text{Support:} & x \text{ is either 0 or 1} \\
82
+ \text{PMF equation:} & P(X = x) =
83
+ \begin{cases}
84
+ p & \text{if }x = 1\\
85
+ 1-p & \text{if }x = 0
86
+ \end{cases} \\
87
+ \text{PMF (smooth):} & P(X = x) = p^x(1-p)^{1-x} \\
88
+ \text{Expectation:} & E[X] = p \\
89
+ \text{Variance:} & \text{Var}(X) = p(1-p) \\
90
+ \end{array}$
91
+ """)
 
 
92
  return
93
 
94
 
 
150
  ax.legend()
151
  plt.tight_layout()
152
  plt.gca()
153
+ return
154
 
155
 
156
  @app.cell(hide_code=True)
157
  def _(mo):
158
+ mo.md(r"""
159
+ ## Expectation and Variance of a Bernoulli
160
+
161
+ > _Note:_ The following derivations are included as reference material. The credit for these mathematical formulations belongs to ["Probability for Computer Scientists"](https://chrispiech.github.io/probabilityForComputerScientists/en/part2/bernoulli/) by Chris Piech.
162
+
163
+ Let's work through why $E[X] = p$ for a Bernoulli:
164
+
165
+ \begin{align}
166
+ E[X] &= \sum_x x \cdot (X=x) && \text{Definition of expectation} \\
167
+ &= 1 \cdot p + 0 \cdot (1-p) &&
168
+ X \text{ can take on values 0 and 1} \\
169
+ &= p && \text{Remove the 0 term}
170
+ \end{align}
171
+
172
+ And for variance, we first need $E[X^2]$:
173
+
174
+ \begin{align}
175
+ E[X^2]
176
+ &= \sum_x x^2 \cdot (X=x) &&\text{LOTUS}\\
177
+ &= 0^2 \cdot (1-p) + 1^2 \cdot p\\
178
+ &= p
179
+ \end{align}
180
+
181
+ \begin{align}
182
+ (X)
183
+ &= E[X^2] - E[X]^2&& \text{Def of variance} \\
184
+ &= p - p^2 && \text{Substitute }E[X^2]=p, E[X] = p \\
185
+ &= p (1-p) && \text{Factor out }p
186
+ \end{align}
187
+ """)
 
 
188
  return
189
 
190
 
191
  @app.cell(hide_code=True)
192
  def _(mo):
193
+ mo.md(r"""
194
+ ## Indicator Random Variables
 
195
 
196
+ Indicator variables are a clever trick I like to use — they turn events into numbers. Instead of dealing with "did the event happen?" (yes/no), we get "1" if it happened and "0" if it didn't.
197
 
198
+ Formally: an indicator variable $I$ for event $A$ equals 1 when $A$ occurs and 0 otherwise. These are just bernoulli variables where $p = P(A)$. people often use notation like $I_A$ to name them.
199
 
200
+ Two key properties that make them super useful:
201
 
202
+ - $P(I=1)=P(A)$ - probability of getting a 1 is just the probability of the event
203
+ - $E[I]=P(A)$ - the expected value equals the probability (this one's a game-changer!)
204
+ """)
 
205
  return
206
 
207
 
 
222
 
223
  @app.cell(hide_code=True)
224
  def _(mo):
225
+ mo.md(r"""
226
+ ## Simulation
227
+ """)
228
  return
229
 
230
 
 
266
 
267
  plt.tight_layout()
268
  plt.gca()
269
+ return (trials,)
270
 
271
 
272
  @app.cell(hide_code=True)
 
286
 
287
  This demonstrates how the sample proportion approaches the true probability $p$ as the number of trials increases.
288
  """)
289
+ return
290
 
291
 
292
  @app.cell(hide_code=True)
293
  def _(mo):
294
+ mo.md(r"""
295
+ ## 🤔 Test Your Understanding
 
296
 
297
+ Pick which of these statements about Bernoulli random variables you think are correct:
298
 
299
+ /// details | The variance of a Bernoulli random variable is always less than or equal to 0.25
300
+ ✅ Correct! The variance $p(1-p)$ reaches its maximum value of 0.25 when $p = 0.5$.
301
+ ///
302
 
303
+ /// details | The expected value of a Bernoulli random variable must be either 0 or 1
304
+ ❌ Incorrect! The expected value is $p$, which can be any value between 0 and 1.
305
+ ///
306
 
307
+ /// details | If $X \sim \text{Bern}(0.3)$ and $Y \sim \text{Bern}(0.7)$, then $X$ and $Y$ have the same variance
308
+ ✅ Correct! $\text{Var}(X) = 0.3 \times 0.7 = 0.21$ and $\text{Var}(Y) = 0.7 \times 0.3 = 0.21$.
309
+ ///
310
 
311
+ /// details | Two independent coin flips can be modeled as the sum of two Bernoulli random variables
312
+ ✅ Correct! The sum would follow a Binomial distribution with $n=2$.
313
+ ///
314
+ """)
 
315
  return
316
 
317
 
318
  @app.cell(hide_code=True)
319
  def _(mo):
320
+ mo.md(r"""
321
+ ## Applications of Bernoulli Random Variables
 
322
 
323
+ Bernoulli random variables are used in many real-world scenarios:
324
 
325
+ 1. **Quality Control**: Testing if a manufactured item is defective (1) or not (0)
326
 
327
+ 2. **A/B Testing**: Determining if a user clicks (1) or doesn't click (0) on a website button
328
 
329
+ 3. **Medical Testing**: Checking if a patient tests positive (1) or negative (0) for a disease
330
 
331
+ 4. **Election Modeling**: Modeling if a particular voter votes for candidate A (1) or not (0)
332
 
333
+ 5. **Financial Markets**: Modeling if a stock price goes up (1) or down (0) in a simplified model
334
 
335
+ Because Bernoulli random variables are parametric, as soon as you declare a random variable to be of type Bernoulli, you automatically know all of its pre-derived properties!
336
+ """)
 
337
  return
338
 
339
 
340
  @app.cell(hide_code=True)
341
  def _(mo):
342
+ mo.md(r"""
343
+ ## Summary
 
344
 
345
+ And that's a wrap on Bernoulli distributions! We've learnt the simplest of all probability distributions — the one that only has two possible outcomes. Flip a coin, check if an email is spam, see if your blind date shows up — these are all Bernoulli trials with success probability $p$.
346
 
347
+ The beauty of Bernoulli is in its simplicity: just set $p$ (the probability of success) and you're good to go! The PMF gives us $P(X=1) = p$ and $P(X=0) = 1-p$, while expectation is simply $p$ and variance is $p(1-p)$. Oh, and when you're tracking whether specific events happen or not? That's an indicator random variable — just another Bernoulli in disguise!
348
 
349
+ Two key things to remember:
350
 
351
+ /// note
352
+ 💡 **Maximum Variance**: A Bernoulli's variance $p(1-p)$ reaches its maximum at $p=0.5$, making a fair coin the most "unpredictable" Bernoulli random variable.
353
 
354
+ 💡 **Instant Properties**: When you identify a random variable as Bernoulli, you instantly know all its properties—expectation, variance, PMF—without additional calculations.
355
+ ///
356
 
357
+ Next up: Binomial distribution—where we'll see what happens when we let Bernoulli trials have a party and add themselves together!
358
+ """)
 
359
  return
360
 
361
 
362
  @app.cell(hide_code=True)
363
  def _(mo):
364
+ mo.md(r"""
365
+ #### Appendix (containing helper code for the notebook)
366
+ """)
367
  return
368
 
369
 
 
376
  @app.cell(hide_code=True)
377
  def _():
378
  from marimo import Html
379
+ return
380
 
381
 
382
  @app.cell(hide_code=True)
 
393
 
394
  # Set random seed for reproducibility
395
  np.random.seed(42)
396
+ return np, plt, stats
397
 
398
 
399
  @app.cell(hide_code=True)