diff --git a/_server/README.md b/_server/README.md index 39670b8c374acc26ae3bdae95df1a2f8dda73623..80de9a7fec1ef68c920bdd552c41c0995971dfb6 100644 --- a/_server/README.md +++ b/_server/README.md @@ -1,3 +1,8 @@ +--- +title: Readme +marimo-version: 0.18.4 +--- + # marimo learn server This folder contains server code for hosting marimo apps. @@ -21,4 +26,4 @@ docker build -t marimo-learn . ```bash docker run -p 7860:7860 marimo-learn -``` +``` \ No newline at end of file diff --git a/daft/01_what_makes_daft_special.py b/daft/01_what_makes_daft_special.py index 1102e245cae4f86807d0f9cced9703b70f234306..7217a9d278b8d9424454dd866a8bd8ace1b4044c 100644 --- a/daft/01_what_makes_daft_special.py +++ b/daft/01_what_makes_daft_special.py @@ -8,28 +8,25 @@ import marimo -__generated_with = "0.13.6" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # What Makes Daft Special? > _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_. Welcome to the course on [Daft](https://www.getdaft.io/), the distributed dataframe library! In this first chapter, we'll explore what Daft is and what makes it a noteworthy tool in the landscape of data processing. We'll look at its core design choices and how they aim to help you work with data more effectively, whether you're a data engineer, data scientist, or analyst. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## 🎯 Introducing Daft: A Unified Data Engine Daft is a distributed query engine designed to handle a wide array of data tasks, from data engineering and analytics to powering ML/AI workflows. It provides both a Python DataFrame API, familiar to users of libraries like Pandas, and a SQL interface, allowing you to choose the interaction style that best suits your needs or the task at hand. @@ -37,8 +34,7 @@ def _(mo): The main goal of Daft is to provide a robust and versatile platform for processing data, whether it's gigabytes on your laptop or petabytes on a cluster. Let's go ahead and `pip install daft` to see it in action! - """ - ) + """) return @@ -86,8 +82,7 @@ def _(mo): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## 🦀 Built with Rust: Performance and Simplicity One of Daft's key characteristics is that its core engine is written in Rust. This choice has several implications for users: @@ -97,8 +92,7 @@ def _(mo): * **Simplified Developer Experience**: Rust-based systems typically require less configuration tuning compared to JVM-based systems. You don't need to worry about JVM heap sizes, garbage collection parameters, or managing Java dependencies. Daft also leverages [Apache Arrow](https://arrow.apache.org/) for its in-memory data format. This allows for efficient data exchange between Daft's Rust core and Python, often with zero-copy data sharing, further enhancing performance. - """ - ) + """) return @@ -118,7 +112,9 @@ def _(mo): @app.cell(hide_code=True) def _(mo): - mo.md(r"""A cornerstone of Daft's design is **lazy execution**. Imagine defining a DataFrame with a trillion rows on your laptop – usually not a great prospect for your device's memory!""") + mo.md(r""" + A cornerstone of Daft's design is **lazy execution**. Imagine defining a DataFrame with a trillion rows on your laptop – usually not a great prospect for your device's memory! + """) return @@ -135,7 +131,9 @@ def _(daft): @app.cell(hide_code=True) def _(mo): - mo.md(r"""With Daft, this is perfectly fine. Operations like `with_column` or `filter` don't compute results immediately. Instead, Daft builds a *logical plan* – a blueprint of the transformations you've defined. You can inspect this plan:""") + mo.md(r""" + With Daft, this is perfectly fine. Operations like `with_column` or `filter` don't compute results immediately. Instead, Daft builds a *logical plan* – a blueprint of the transformations you've defined. You can inspect this plan: + """) return @@ -147,14 +145,15 @@ def _(mo, trillion_rows_df): @app.cell(hide_code=True) def _(mo): - mo.md(r"""This plan is only executed (and data materialized) when you explicitly request it (e.g., with `.show()`, `.collect()`, or by writing to a file). Before execution, Daft's optimizer works to make your query run as efficiently as possible. This approach allows you to define complex operations on massive datasets without immediate computational cost or memory overflow.""") + mo.md(r""" + This plan is only executed (and data materialized) when you explicitly request it (e.g., with `.show()`, `.collect()`, or by writing to a file). Before execution, Daft's optimizer works to make your query run as efficiently as possible. This approach allows you to define complex operations on massive datasets without immediate computational cost or memory overflow. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## 🌐 Scale Your Work: From Laptop to Cluster Daft is designed with scalability in mind. As the trillion-row dataframe example above illustrates, you can write your data processing logic using Daft's Python API, and this same code can run: @@ -163,15 +162,13 @@ def _(mo): * **On a Cluster**: By integrating with [Ray](https://www.ray.io/), a framework for distributed computing. This allows Daft to scale out to process very large datasets across many machines. This "write once, scale anywhere" approach means you don't need to significantly refactor your code when moving from local development to large-scale distributed execution. We'll delve into distributed computing with Ray in a later chapter. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## 🖼️ Handling More Than Just Tables: Multimodal Data Support Modern datasets often contain more than just numbers and text. They might include images, audio clips, URLs pointing to external files, tensor data from machine learning models, or complex nested structures like JSON. @@ -179,8 +176,7 @@ def _(mo): Daft is built to accommodate these **multimodal data types** as integral parts of a DataFrame. This means you can have columns containing image data, embeddings, or other complex Python objects, and Daft provides mechanisms to process them. This is particularly useful for ML/AI pipelines and advanced analytics where diverse data sources are common. As an example of how Daft simplifies working with such complex data, let's see how we can process image URLs. With just a few lines of Daft code, we can pull open data from the [National Gallery of Art](https://github.com/NationalGalleryOfArt/opendata), then directly fetch, decode, and even resize the images within our DataFrame: - """ - ) + """) return @@ -217,20 +213,23 @@ def _(daft): @app.cell(hide_code=True) def _(mo): - mo.md(r"""> Example inspired by the great post [Exploring Art with TypeScript, Jupyter, Polars, and Observable Plot](https://deno.com/blog/exploring-art-with-typescript-and-jupyter) published on Deno's blog.""") + mo.md(r""" + > Example inspired by the great post [Exploring Art with TypeScript, Jupyter, Polars, and Observable Plot](https://deno.com/blog/exploring-art-with-typescript-and-jupyter) published on Deno's blog. + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""In later chapters, we'll explore in more detail how to work with these image objects and other complex types, including applying User-Defined Functions (UDFs) for custom processing. Until then, you can [take a look at a more complex example](https://blog.getdaft.io/p/we-cloned-over-15000-repos-to-find), in which Daft is used to clone over 15,000 GitHub repos to find the best developers.""") + mo.md(r""" + In later chapters, we'll explore in more detail how to work with these image objects and other complex types, including applying User-Defined Functions (UDFs) for custom processing. Until then, you can [take a look at a more complex example](https://blog.getdaft.io/p/we-cloned-over-15000-repos-to-find), in which Daft is used to clone over 15,000 GitHub repos to find the best developers. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## 🧑‍💻 Designed for Developers: Python and SQL Interfaces Daft aims to be developer-friendly by offering flexible ways to interact with your data: @@ -239,8 +238,7 @@ def _(mo): * **SQL Interface**: For those who prefer SQL or have existing SQL-based logic, Daft allows you to write queries using SQL syntax. Daft can execute SQL queries directly or even translate SQL expressions into its native expression system. This dual-interface approach allows developers to choose the most appropriate tool for their specific task or leverage existing skills. - """ - ) + """) return @@ -285,8 +283,7 @@ def _(daft): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## 🟣 Daft's Value Proposition So, what makes Daft special? It's the combination of these design choices: @@ -299,8 +296,7 @@ def _(mo): These elements combine to make Daft a versatile tool for tackling modern data challenges. And this is just scratching the surface. Daft is a growing data engine with an ambitious vision: to unify data engineering, analytics, and ML/AI workflows 🚀. - """ - ) + """) return @@ -308,7 +304,6 @@ def _(mo): def _(): import daft import marimo as mo - return daft, mo diff --git a/daft/README.md b/daft/README.md index 79196f37628bdfe76db0fed991da8262a07508af..e51a66fd2f1991207d7bbe3a52703550a3526b3e 100644 --- a/daft/README.md +++ b/daft/README.md @@ -1,3 +1,8 @@ +--- +title: Readme +marimo-version: 0.18.4 +--- + # Learn Daft _🚧 This collection is a work in progress. Please help us add notebooks!_ @@ -23,4 +28,4 @@ You can also open notebooks in our online playground by appending marimo.app/ to **Thanks to all our notebook authors!** -* [Péter Gyarmati](https://github.com/peter-gy) +* [Péter Gyarmati](https://github.com/peter-gy) \ No newline at end of file diff --git a/duckdb/008_loading_parquet.py b/duckdb/008_loading_parquet.py index a85ca40bba6b0f5989af31e342a791f8821db19a..ffc0b4f35f0f77fea0d3ecc7b4f0c0e722306d2f 100644 --- a/duckdb/008_loading_parquet.py +++ b/duckdb/008_loading_parquet.py @@ -11,39 +11,35 @@ import marimo -__generated_with = "0.14.10" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Loading Parquet files with DuckDB *By [Thomas Liang](https://github.com/thliang01)* # - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - [Apache Parquet](https://parquet.apache.org/) is a popular columnar storage format, optimized for analytics. Its columnar nature allows query engines like DuckDB to read only the necessary columns, leading to significant performance gains, especially for wide tables. - - DuckDB has excellent, built-in support for reading Parquet files, making it incredibly easy to query and analyze Parquet data directly without a separate loading step. - - In this notebook, we'll explore how to load and analyze Airbnb's stock price data from a remote Parquet file: - - """ - ) + mo.md(r""" + [Apache Parquet](https://parquet.apache.org/) is a popular columnar storage format, optimized for analytics. Its columnar nature allows query engines like DuckDB to read only the necessary columns, leading to significant performance gains, especially for wide tables. + + DuckDB has excellent, built-in support for reading Parquet files, making it incredibly easy to query and analyze Parquet data directly without a separate loading step. + + In this notebook, we'll explore how to load and analyze Airbnb's stock price data from a remote Parquet file: + + """) return @@ -55,24 +51,24 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md(r"""## Using `FROM` to query Parquet files""") + mo.md(r""" + ## Using `FROM` to query Parquet files + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - The simplest way to query a Parquet file is to use it directly in a `FROM` clause, just like you would with a table. DuckDB will automatically detect that it's a Parquet file and read it accordingly. + mo.md(r""" + The simplest way to query a Parquet file is to use it directly in a `FROM` clause, just like you would with a table. DuckDB will automatically detect that it's a Parquet file and read it accordingly. - Let's query a dataset of Airbnb's stock price from Hugging Face. - """ - ) + Let's query a dataset of Airbnb's stock price from Hugging Face. + """) return @app.cell -def _(AIRBNB_URL, mo, null): +def _(AIRBNB_URL, mo): mo.sql( f""" SELECT * @@ -85,24 +81,24 @@ def _(AIRBNB_URL, mo, null): @app.cell(hide_code=True) def _(mo): - mo.md(r"""## Using `read_parquet`""") + mo.md(r""" + ## Using `read_parquet` + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - For more control, you can use the `read_parquet` table function. This is useful when you need to specify options, for example, when dealing with multiple files or specific data types. - Some useful options for `read_parquet` include: + mo.md(r""" + For more control, you can use the `read_parquet` table function. This is useful when you need to specify options, for example, when dealing with multiple files or specific data types. + Some useful options for `read_parquet` include: - - `binary_as_string=True`: Reads `BINARY` columns as `VARCHAR`. - - `filename=True`: Adds a `filename` column with the path of the file for each row. - - `hive_partitioning=True`: Enables reading of Hive-partitioned datasets. + - `binary_as_string=True`: Reads `BINARY` columns as `VARCHAR`. + - `filename=True`: Adds a `filename` column with the path of the file for each row. + - `hive_partitioning=True`: Enables reading of Hive-partitioned datasets. - Here, we'll use `read_parquet` to select only a few relevant columns. This is much more efficient than `SELECT *` because DuckDB only needs to read the data for the columns we specify. - """ - ) + Here, we'll use `read_parquet` to select only a few relevant columns. This is much more efficient than `SELECT *` because DuckDB only needs to read the data for the columns we specify. + """) return @@ -120,31 +116,29 @@ def _(AIRBNB_URL, mo): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - You can also read multiple Parquet files at once using a glob pattern. For example, to read all Parquet files in a directory `data/`: + mo.md(r""" + You can also read multiple Parquet files at once using a glob pattern. For example, to read all Parquet files in a directory `data/`: - ```sql - SELECT * FROM read_parquet('data/*.parquet'); - ``` - """ - ) + ```sql + SELECT * FROM read_parquet('data/*.parquet'); + ``` + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""## Creating a table from a Parquet file""") + mo.md(r""" + ## Creating a table from a Parquet file + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - While querying Parquet files directly is powerful, sometimes it's useful to load the data into a persistent table within your DuckDB database. This can simplify subsequent queries and is a good practice if you'll be accessing the data frequently. - """ - ) + mo.md(r""" + While querying Parquet files directly is powerful, sometimes it's useful to load the data into a persistent table within your DuckDB database. This can simplify subsequent queries and is a good practice if you'll be accessing the data frequently. + """) return @@ -156,7 +150,7 @@ def _(AIRBNB_URL, mo): SELECT * FROM read_parquet('{AIRBNB_URL}'); """ ) - return airbnb_stock, stock_table + return (stock_table,) @app.cell(hide_code=True) @@ -172,7 +166,7 @@ def _(mo, stock_table): @app.cell -def _(airbnb_stock, mo): +def _(mo): mo.sql( f""" SELECT * FROM airbnb_stock LIMIT 5; @@ -183,18 +177,22 @@ def _(airbnb_stock, mo): @app.cell(hide_code=True) def _(mo): - mo.md(r"""## Analysis and Visualization""") + mo.md(r""" + ## Analysis and Visualization + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""Let's perform a simple analysis: plotting the closing stock price over time.""") + mo.md(r""" + Let's perform a simple analysis: plotting the closing stock price over time. + """) return @app.cell -def _(airbnb_stock, mo): +def _(mo): stock_data = mo.sql( f""" SELECT @@ -209,7 +207,9 @@ def _(airbnb_stock, mo): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Now we can easily visualize this result using marimo's integration with plotting libraries like Plotly.""") + mo.md(r""" + Now we can easily visualize this result using marimo's integration with plotting libraries like Plotly. + """) return @@ -227,14 +227,15 @@ def _(px, stock_data): @app.cell(hide_code=True) def _(mo): - mo.md(r"""## Conclusion""") + mo.md(r""" + ## Conclusion + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" In this notebook, we've seen how easy it is to work with Parquet files in DuckDB. We learned how to: DuckDB's native Parquet support makes it a powerful tool for interactive data analysis on large datasets without complex ETL pipelines. - """ - ) + """) return diff --git a/duckdb/009_loading_json.py b/duckdb/009_loading_json.py index 05334511dc2776031d6415b881770bd4426dfc48..d48cadb5339bf0f69c4d12896e13e2e8e6364d71 100644 --- a/duckdb/009_loading_json.py +++ b/duckdb/009_loading_json.py @@ -10,38 +10,34 @@ import marimo -__generated_with = "0.12.8" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # Loading JSON + mo.md(r""" + # Loading JSON - DuckDB supports reading and writing JSON through the `json` extension that should be present in most distributions and is autoloaded on first-use. If it's not, you can [install and load](https://duckdb.org/docs/stable/data/json/installing_and_loading.html) it manually like any other extension. + DuckDB supports reading and writing JSON through the `json` extension that should be present in most distributions and is autoloaded on first-use. If it's not, you can [install and load](https://duckdb.org/docs/stable/data/json/installing_and_loading.html) it manually like any other extension. - In this tutorial we'll cover 4 different ways we can transfer JSON data in and out of DuckDB: + In this tutorial we'll cover 4 different ways we can transfer JSON data in and out of DuckDB: - - [`FROM`](https://duckdb.org/docs/stable/sql/query_syntax/from.html) statement. - - [`read_json`](https://duckdb.org/docs/stable/data/json/loading_json#the-read_json-function) function. - - [`COPY`](https://duckdb.org/docs/stable/sql/statements/copy#copy--from) statement. - - [`IMPORT DATABASE`](https://duckdb.org/docs/stable/sql/statements/export.html) statement. - """ - ) + - [`FROM`](https://duckdb.org/docs/stable/sql/query_syntax/from.html) statement. + - [`read_json`](https://duckdb.org/docs/stable/data/json/loading_json#the-read_json-function) function. + - [`COPY`](https://duckdb.org/docs/stable/sql/statements/copy#copy--from) statement. + - [`IMPORT DATABASE`](https://duckdb.org/docs/stable/sql/statements/export.html) statement. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Using `FROM` + mo.md(r""" + ## Using `FROM` - Loading data using `FROM` is simple and straightforward. We use a path or URL to the file we want to load where we'd normally put a table name. When we do this, DuckDB attempts to infer the right way to read the file including the correct format and column types. In most cases this is all we need to load data into DuckDB. - """ - ) + Loading data using `FROM` is simple and straightforward. We use a path or URL to the file we want to load where we'd normally put a table name. When we do this, DuckDB attempts to infer the right way to read the file including the correct format and column types. In most cases this is all we need to load data into DuckDB. + """) return @@ -57,20 +53,18 @@ def _(mo): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Using `read_json` + mo.md(r""" + ## Using `read_json` - For greater control over how the JSON is read, we can directly call the [`read_json`](https://duckdb.org/docs/stable/data/json/loading_json#the-read_json-function) function. It supports a few different arguments — some common ones are: + For greater control over how the JSON is read, we can directly call the [`read_json`](https://duckdb.org/docs/stable/data/json/loading_json#the-read_json-function) function. It supports a few different arguments — some common ones are: - - `format='array'` or `format='newline_delimited'` - the former tells DuckDB that the rows should be read from a top-level JSON array while the latter means the rows should be read from JSON objects separated by a newline (JSONL/NDJSON). - - `ignore_errors=true` - skips lines with parse errors when reading newline delimited JSON. - - `columns={columnName: type, ...}` - lets you set types for individual columns manually. - - `dateformat` and `timestampformat` - controls how DuckDB attempts to parse [Date](https://duckdb.org/docs/stable/sql/data_types/date) and [Timestamp](https://duckdb.org/docs/stable/sql/data_types/timestamp) types. Use the format specifiers specified in the [docs](https://duckdb.org/docs/stable/sql/functions/dateformat.html#format-specifiers). + - `format='array'` or `format='newline_delimited'` - the former tells DuckDB that the rows should be read from a top-level JSON array while the latter means the rows should be read from JSON objects separated by a newline (JSONL/NDJSON). + - `ignore_errors=true` - skips lines with parse errors when reading newline delimited JSON. + - `columns={columnName: type, ...}` - lets you set types for individual columns manually. + - `dateformat` and `timestampformat` - controls how DuckDB attempts to parse [Date](https://duckdb.org/docs/stable/sql/data_types/date) and [Timestamp](https://duckdb.org/docs/stable/sql/data_types/timestamp) types. Use the format specifiers specified in the [docs](https://duckdb.org/docs/stable/sql/functions/dateformat.html#format-specifiers). - We could rewrite the previous query more explicitly as: - """ - ) + We could rewrite the previous query more explicitly as: + """) return @@ -99,24 +93,24 @@ def _(mo): ; """ ) - return (cars_df,) + return @app.cell(hide_code=True) def _(mo): - mo.md(r"""Other than singular files we can read [multiple files](https://duckdb.org/docs/stable/data/multiple_files/overview.html) at a time by either passing a list of files or a UNIX glob pattern.""") + mo.md(r""" + Other than singular files we can read [multiple files](https://duckdb.org/docs/stable/data/multiple_files/overview.html) at a time by either passing a list of files or a UNIX glob pattern. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Using `COPY` + mo.md(r""" + ## Using `COPY` - `COPY` is for useful both for importing and exporting data in a variety of formats including JSON. For example, we can import data into an existing table from a JSON file. - """ - ) + `COPY` is for useful both for importing and exporting data in a variety of formats including JSON. For example, we can import data into an existing table from a JSON file. + """) return @@ -137,11 +131,11 @@ def _(mo): ); """ ) - return (cars2,) + return @app.cell -def _(cars2, mo): +def _(mo): _df = mo.sql( f""" COPY cars2 FROM 'https://raw.githubusercontent.com/vega/vega-datasets/refs/heads/main/data/cars.json' (FORMAT json, ARRAY true, DATEFORMAT '%Y-%m-%d'); @@ -153,7 +147,9 @@ def _(cars2, mo): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Similarly, we can write data from a table or select statement to a JSON file. For example, we create a new JSONL file with just the car names and miles per gallon. We first create a temporary directory to avoid cluttering our project directory.""") + mo.md(r""" + Similarly, we can write data from a table or select statement to a JSON file. For example, we create a new JSONL file with just the car names and miles per gallon. We first create a temporary directory to avoid cluttering our project directory. + """) return @@ -164,11 +160,11 @@ def _(Path): TMP_DIR = TemporaryDirectory() COPY_PATH = Path(TMP_DIR.name) / "cars_mpg.jsonl" print(COPY_PATH) - return COPY_PATH, TMP_DIR, TemporaryDirectory + return COPY_PATH, TMP_DIR @app.cell -def _(COPY_PATH, cars2, mo): +def _(COPY_PATH, mo): _df = mo.sql( f""" COPY ( @@ -191,13 +187,11 @@ def _(COPY_PATH, Path): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Using `IMPORT DATABASE` + mo.md(r""" + ## Using `IMPORT DATABASE` - The last method we can use to load JSON data is using the `IMPORT DATABASE` statement. It works in conjunction with `EXPORT DATABASE` to save and load an entire database to and from a directory. For example let's try and export our default in-memory database. - """ - ) + The last method we can use to load JSON data is using the `IMPORT DATABASE` statement. It works in conjunction with `EXPORT DATABASE` to save and load an entire database to and from a directory. For example let's try and export our default in-memory database. + """) return @@ -226,7 +220,9 @@ def _(EXPORT_PATH, Path): @app.cell(hide_code=True) def _(mo): - mo.md(r"""We can then load the database back into DuckDB.""") + mo.md(r""" + We can then load the database back into DuckDB. + """) return @@ -250,14 +246,12 @@ def _(TMP_DIR): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Further Reading + mo.md(r""" + ## Further Reading - - Complete information on the JSON support in DuckDB can be found in their [documentation](https://duckdb.org/docs/stable/data/json/overview.html). - - You can also learn more about using SQL in marimo from the [examples](https://github.com/marimo-team/marimo/tree/main/examples/sql). - """ - ) + - Complete information on the JSON support in DuckDB can be found in their [documentation](https://duckdb.org/docs/stable/data/json/overview.html). + - You can also learn more about using SQL in marimo from the [examples](https://github.com/marimo-team/marimo/tree/main/examples/sql). + """) return diff --git a/duckdb/011_working_with_apache_arrow.py b/duckdb/011_working_with_apache_arrow.py index 3f105e7000ee61c740f5790982f1d6685b7c176c..7765754b77735a5b8526decb5610584aa63c6215 100644 --- a/duckdb/011_working_with_apache_arrow.py +++ b/duckdb/011_working_with_apache_arrow.py @@ -14,41 +14,37 @@ import marimo -__generated_with = "0.14.12" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Working with Apache Arrow *By [Thomas Liang](https://github.com/thliang01)* # - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - [Apache Arrow](https://arrow.apache.org/) is a multi-language toolbox for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another. + mo.md(r""" + [Apache Arrow](https://arrow.apache.org/) is a multi-language toolbox for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another. - A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more. + A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more. - DuckDB has native support for Apache Arrow, which is an in-memory columnar data format. This allows for efficient data transfer between DuckDB and other Arrow-compatible systems, such as Polars and Pandas (via PyArrow). + DuckDB has native support for Apache Arrow, which is an in-memory columnar data format. This allows for efficient data transfer between DuckDB and other Arrow-compatible systems, such as Polars and Pandas (via PyArrow). - In this notebook, we'll explore how to: + In this notebook, we'll explore how to: - - Create an Arrow table from a DuckDB query. - - Load an Arrow table into DuckDB. - - Convert between DuckDB, Arrow, and Polars/Pandas DataFrames. - - Combining data from multiple sources - - Performance benefits - """ - ) + - Create an Arrow table from a DuckDB query. + - Load an Arrow table into DuckDB. + - Convert between DuckDB, Arrow, and Polars/Pandas DataFrames. + - Combining data from multiple sources + - Performance benefits + """) return @@ -71,23 +67,21 @@ def _(mo): (5, 'Eve', 40, 'London'); """ ) - return (users,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 1. Creating an Arrow Table from a DuckDB Query + mo.md(r""" + ## 1. Creating an Arrow Table from a DuckDB Query - You can directly fetch the results of a DuckDB query as an Apache Arrow table using the `.arrow()` method on the query result. - """ - ) + You can directly fetch the results of a DuckDB query as an Apache Arrow table using the `.arrow()` method on the query result. + """) return @app.cell -def _(mo, users): +def _(mo): users_arrow_table = mo.sql( # type: ignore """ SELECT * FROM users WHERE age > 30; @@ -98,7 +92,9 @@ def _(mo, users): @app.cell(hide_code=True) def _(mo): - mo.md(r"""The `.arrow()` method returns a `pyarrow.Table` object. We can inspect its schema:""") + mo.md(r""" + The `.arrow()` method returns a `pyarrow.Table` object. We can inspect its schema: + """) return @@ -110,13 +106,11 @@ def _(users_arrow_table): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 2. Loading an Arrow Table into DuckDB + mo.md(r""" + ## 2. Loading an Arrow Table into DuckDB - You can also register an existing Arrow table (or a Polars/Pandas DataFrame, which uses Arrow under the hood) directly with DuckDB. This allows you to query the in-memory data without any copying, which is highly efficient. - """ - ) + You can also register an existing Arrow table (or a Polars/Pandas DataFrame, which uses Arrow under the hood) directly with DuckDB. This allows you to query the in-memory data without any copying, which is highly efficient. + """) return @@ -129,17 +123,19 @@ def _(pa): 'age': [22, 45], 'city': ['Berlin', 'Tokyo'] }) - return (new_data,) + return @app.cell(hide_code=True) def _(mo): - mo.md(r"""Now, we can query this Arrow table `new_data` directly from SQL by embedding it in the query.""") + mo.md(r""" + Now, we can query this Arrow table `new_data` directly from SQL by embedding it in the query. + """) return @app.cell -def _(mo, new_data): +def _(mo): mo.sql( f""" SELECT name, age, city @@ -152,19 +148,19 @@ def _(mo, new_data): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 3. Convert between DuckDB, Arrow, and Polars/Pandas DataFrames. + mo.md(r""" + ## 3. Convert between DuckDB, Arrow, and Polars/Pandas DataFrames. - The real power of DuckDB's Arrow integration comes from its seamless interoperability with data frame libraries like Polars and Pandas. Because they all share the Arrow in-memory format, conversions are often zero-copy and extremely fast. - """ - ) + The real power of DuckDB's Arrow integration comes from its seamless interoperability with data frame libraries like Polars and Pandas. Because they all share the Arrow in-memory format, conversions are often zero-copy and extremely fast. + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""### From DuckDB to Polars/Pandas""") + mo.md(r""" + ### From DuckDB to Polars/Pandas + """) return @@ -186,7 +182,9 @@ def _(users_arrow_table): @app.cell(hide_code=True) def _(mo): - mo.md(r"""### From Polars/Pandas to DuckDB""") + mo.md(r""" + ### From Polars/Pandas to DuckDB + """) return @@ -199,17 +197,19 @@ def _(pl): "price": [1200.00, 25.50, 75.00] }) polars_df - return (polars_df,) + return @app.cell(hide_code=True) def _(mo): - mo.md(r"""Now we can query this Polars DataFrame directly in DuckDB:""") + mo.md(r""" + Now we can query this Polars DataFrame directly in DuckDB: + """) return @app.cell -def _(mo, polars_df): +def _(mo): # Query the Polars DataFrame directly in DuckDB mo.sql( f""" @@ -224,7 +224,9 @@ def _(mo, polars_df): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Similarly, we can query a Pandas DataFrame:""") + mo.md(r""" + Similarly, we can query a Pandas DataFrame: + """) return @@ -238,11 +240,11 @@ def _(pd): "order_date": pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-16', '2024-01-17']) }) pandas_df - return (pandas_df,) + return @app.cell -def _(mo, pandas_df): +def _(mo): # Query the Pandas DataFrame in DuckDB mo.sql( f""" @@ -257,18 +259,16 @@ def _(mo, pandas_df): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 4. Advanced Example: Combining Multiple Data Sources + mo.md(r""" + ## 4. Advanced Example: Combining Multiple Data Sources - One of the most powerful features is the ability to join data from different sources (DuckDB tables, Arrow tables, Polars/Pandas DataFrames) in a single query: - """ - ) + One of the most powerful features is the ability to join data from different sources (DuckDB tables, Arrow tables, Polars/Pandas DataFrames) in a single query: + """) return @app.cell -def _(mo, pandas_df, polars_df, users): +def _(mo): # Join the DuckDB users table with the Polars products DataFrame and Pandas orders DataFrame result = mo.sql( f""" @@ -291,27 +291,28 @@ def _(mo, pandas_df, polars_df, users): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 5. Performance Benefits of Arrow Integration + mo.md(r""" + ## 5. Performance Benefits of Arrow Integration - The zero-copy integration between DuckDB and Apache Arrow delivers significant performance and memory benefits. This seamless integration enables: + The zero-copy integration between DuckDB and Apache Arrow delivers significant performance and memory benefits. This seamless integration enables: - ### Key Benefits: + ### Key Benefits: - - **Memory Efficiency**: Arrow's columnar format uses 20-40% less memory than traditional DataFrames through compact columnar representation and better compression ratios - - **Zero-Copy Operations**: Data can be shared between DuckDB and Arrow-compatible systems (Polars, Pandas) without any data copying, eliminating redundant memory usage - - **Query Performance**: 2-10x faster queries compared to traditional approaches that require data copying - - **Larger-than-Memory Analysis**: Both DuckDB and Arrow-compatible libraries support streaming query results, allowing you to execute queries on data larger than available memory by processing data in batches. - - **Advanced Query Optimization**: DuckDB's optimizer can push down filters and projections directly into Arrow scans, reading only relevant columns and partitions - Let's demonstrate these benefits with concrete examples: - """ - ) + - **Memory Efficiency**: Arrow's columnar format uses 20-40% less memory than traditional DataFrames through compact columnar representation and better compression ratios + - **Zero-Copy Operations**: Data can be shared between DuckDB and Arrow-compatible systems (Polars, Pandas) without any data copying, eliminating redundant memory usage + - **Query Performance**: 2-10x faster queries compared to traditional approaches that require data copying + - **Larger-than-Memory Analysis**: Both DuckDB and Arrow-compatible libraries support streaming query results, allowing you to execute queries on data larger than available memory by processing data in batches. + - **Advanced Query Optimization**: DuckDB's optimizer can push down filters and projections directly into Arrow scans, reading only relevant columns and partitions + Let's demonstrate these benefits with concrete examples: + """) return + @app.cell(hide_code=True) def _(mo): - mo.md(r"""### Memory Efficiency Demonstration""") + mo.md(r""" + ### Memory Efficiency Demonstration + """) return @@ -352,18 +353,22 @@ def _(pd, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""### Performance Comparison: Arrow vs Non-Arrow Approaches""") + mo.md(r""" + ### Performance Comparison: Arrow vs Non-Arrow Approaches + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""Let's compare three approaches for the same analytical query:""") + mo.md(r""" + Let's compare three approaches for the same analytical query: + """) return @app.cell -def _(duckdb, mo, pandas_data, polars_data, time): +def _(duckdb, mo, pandas_data, time): # Test query: group by category and calculate aggregations query = """ SELECT @@ -425,14 +430,16 @@ def _(duckdb, mo, pandas_data, polars_data, time): @app.cell(hide_code=True) def _(mo): - mo.md(r"""### Visualizing the Performance Difference""") + mo.md(r""" + ### Visualizing the Performance Difference + """) return @app.cell def _(approach1_time, approach2_time, approach3_time, mo, pl): import altair as alt - + # Create a bar chart showing the performance comparison performance_data = pl.DataFrame({ "Approach": ["Traditional\n(Copy to DuckDB)", "Pandas\nGroupBy", "Arrow-based\n(Zero-copy)"], @@ -450,27 +457,30 @@ def _(approach1_time, approach2_time, approach3_time, mo, pl): width=400, height=300 ) - + # Display using marimo's altair_chart UI element mo.ui.altair_chart(chart) - return alt, chart, performance_data - + return @app.cell(hide_code=True) def _(mo): - mo.md(r"""### Complex Query Performance""") + mo.md(r""" + ### Complex Query Performance + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""Let's test a more complex query with joins and window functions:""") + mo.md(r""" + Let's test a more complex query with joins and window functions: + """) return @app.cell -def _(mo, pl, polars_data, time): +def _(mo, pl, time): # Create additional datasets for join operations categories_df = pl.DataFrame({ "category": [f"cat_{i}" for i in range(100)], @@ -510,23 +520,21 @@ def _(mo, pl, polars_data, time): print(f"Complex query with joins and window functions completed in {complex_query_time:.3f} seconds") complex_result - return (categories_df,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Memory Efficiency During Operations Let's demonstrate how Arrow's zero-copy operations save memory during data transformations: - """ - ) + """) return @app.cell -def _(polars_data, time): +def _(polars_data, psutil, time): import os import pyarrow.compute as pc # Add this import @@ -558,7 +566,7 @@ def _(polars_data, time): copy_ops_time = time.time() - latest_start_time memory_after_copy = process.memory_info().rss / 1024 / 1024 # MB - + print("Memory Usage Comparison:") print(f"Initial memory: {memory_before:.2f} MB") print(f"After Arrow operations: {memory_after_arrow:.2f} MB (diff: +{memory_after_arrow - memory_before:.2f} MB)") @@ -567,14 +575,12 @@ def _(polars_data, time): print(f"Arrow operations: {arrow_ops_time:.3f} seconds") print(f"Copy operations: {copy_ops_time:.3f} seconds") print(f"Speedup: {copy_ops_time/arrow_ops_time:.1f}x") - return pc - + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Summary In this notebook, we've explored: @@ -590,8 +596,7 @@ def _(mo): - **Better scalability**: Can handle larger datasets within the same memory constraints The seamless integration between DuckDB and Arrow-compatible systems makes it easy to work with data across different tools while maintaining high performance and memory efficiency. - """ - ) + """) return @@ -604,7 +609,7 @@ def _(): import duckdb import sqlglot import psutil - return duckdb, mo, pa, pd, pl + return duckdb, mo, pa, pd, pl, psutil if __name__ == "__main__": diff --git a/duckdb/01_getting_started.py b/duckdb/01_getting_started.py index 849b5e85122c82d5a43b99ccbb3ee80070d68923..d6a735f2a793e0ef889b8a9edeb8c262f73617fb 100644 --- a/duckdb/01_getting_started.py +++ b/duckdb/01_getting_started.py @@ -15,26 +15,23 @@ import marimo -__generated_with = "0.13.4" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - rf""" + mo.md(rf"""

DuckDB Image

- """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - rf""" + mo.md(rf""" # 🦆 **DuckDB**: An Embeddable Analytical Database System ## What is DuckDB? @@ -83,15 +80,13 @@ def _(mo): /// attention | Note DuckDB requires Python 3.7 or newer. You also need to have Python and `pip` or `conda` installed on your system. /// - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # [1. DuckDB Connections: In-Memory vs. File-based](https://duckdb.org/docs/stable/connect/overview.html) DuckDB is a lightweight, _relational database management system (RDBMS)_ designed for analytical workloads. Unlike traditional client-server databases, it operates _in-process_ (embedded within your application) and supports both _in-memory_ (temporary) and _file-based_ (persistent) storage. @@ -105,8 +100,7 @@ def _(mo): | Performance | Faster for most operations | Slightly slower but provides persistence | | Creation | duckdb.connect(':memory:') | duckdb.connect('filename.db') | | Multiple Connection Access | Limited to single connection | Multiple connections can access the same database | - """ - ) + """) return @@ -134,8 +128,7 @@ def _(mo): @app.cell(hide_code=True) def _(mo): - mo.md( - """ + mo.md(""" ## Creating DuckDB Connections Let's create both types of DuckDB connections and explore their characteristics. @@ -144,8 +137,7 @@ def _(mo): 2. **File-based connection**: Data persists between sessions We'll then demonstrate the key differences between these connection types. - """ - ) + """) return @@ -176,28 +168,28 @@ def _(file_db, memory_db): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Testing Connection Persistence - Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist. + Let's demonstrate how in-memory databases are ephemeral, while file-based databases persist. 1. First, we'll query our tables to confirm the data was properly inserted 2. Then, we'll simulate an application restart by creating new connections 3. Finally, we'll check which data persists after the "restart" - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""## Current Database Contents""") + mo.md(r""" + ## Current Database Contents + """) return @app.cell(hide_code=True) -def _(mem_test, memory_db, mo): +def _(memory_db, mo): _df = mo.sql( f""" SELECT * FROM mem_test @@ -208,7 +200,7 @@ def _(mem_test, memory_db, mo): @app.cell(hide_code=True) -def _(file_db, file_test, mo): +def _(file_db, mo): _df = mo.sql( f""" SELECT * FROM file_test @@ -227,7 +219,9 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md(rf"""## 🔄 Simulating Application Restart...""") + mo.md(rf""" + ## 🔄 Simulating Application Restart... + """) return @@ -311,8 +305,7 @@ def _(file_data, file_data_available, mo): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # [2. Creating Tables in DuckDB](https://duckdb.org/docs/stable/sql/statements/create_table.html) DuckDB supports standard SQL syntax for creating tables. Let's create more complex tables to demonstrate different data types and constraints. @@ -326,8 +319,7 @@ def _(mo): - **CREATE OR REPLACE** to recreate tables - **Primary keys** and other constraints - **Various data types** including INTEGER, VARCHAR, TIMESTAMP, DECIMAL, etc. - """ - ) + """) return @@ -406,8 +398,7 @@ def _(memory_schema, mo): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # [3. Inserting Data Into Tables](https://duckdb.org/docs/stable/sql/statements/insert) DuckDB supports multiple ways to insert data: @@ -418,8 +409,7 @@ def _(mo): 4. **Bulk inserts**: For efficient loading of multiple rows Let's demonstrate these different insertion methods: - """ - ) + """) return @@ -741,8 +731,7 @@ def _(file_results, memory_results, mo): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # [4. Using SQL Directly in marimo](https://duckdb.org/docs/stable/sql/query_syntax/select) There are multiple ways to leverage DuckDB's SQL capabilities in marimo: @@ -752,8 +741,7 @@ def _(mo): 3. **Interactive queries**: Combining UI elements with SQL execution Let's explore these approaches: - """ - ) + """) return @@ -808,7 +796,9 @@ def _(age_threshold, filtered_users, mo): @app.cell(hide_code=True) def _(mo): - mo.md(r"""# [5. Working with Polars and DuckDB](https://duckdb.org/docs/stable/guides/python/polars.html)""") + mo.md(r""" + # [5. Working with Polars and DuckDB](https://duckdb.org/docs/stable/guides/python/polars.html) + """) return @@ -904,7 +894,9 @@ def _(complex_query_result, mo): @app.cell(hide_code=True) def _(mo): - mo.md(r"""# [6. Advanced Queries: Joins Between Tables](https://duckdb.org/docs/stable/guides/performance/join_operations.html)""") + mo.md(r""" + # [6. Advanced Queries: Joins Between Tables](https://duckdb.org/docs/stable/guides/performance/join_operations.html) + """) return @@ -950,12 +942,10 @@ def _(new_memory_db): @app.cell(hide_code=True) def _(mo): - mo.md( - rf""" + mo.md(rf""" ## Join Result (Users and Departments): - """ - ) + """) return @@ -967,12 +957,10 @@ def _(join_result, mo): @app.cell(hide_code=True) def _(mo): - mo.md( - rf""" + mo.md(rf""" ## Different Types of Joins - """ - ) + """) return @@ -1122,7 +1110,9 @@ def _(join_description, join_tabs, mo): @app.cell(hide_code=True) def _(mo): - mo.md(r"""# [7. Aggregate Functions in DuckDB](https://duckdb.org/docs/stable/sql/functions/aggregates.html)""") + mo.md(r""" + # [7. Aggregate Functions in DuckDB](https://duckdb.org/docs/stable/sql/functions/aggregates.html) + """) return @@ -1224,7 +1214,9 @@ def _(mo, window_result): @app.cell(hide_code=True) def _(mo): - mo.md(r"""# [8. Converting DuckDB Results to Polars/Pandas](https://duckdb.org/docs/stable/guides/python/polars.html)""") + mo.md(r""" + # [8. Converting DuckDB Results to Polars/Pandas](https://duckdb.org/docs/stable/guides/python/polars.html) + """) return @@ -1342,7 +1334,9 @@ def _(mo, pandas_result): @app.cell(hide_code=True) def _(mo): - mo.md("""# 9. Data Visualization with DuckDB and Plotly""") + mo.md(""" + # 9. Data Visualization with DuckDB and Plotly + """) return @@ -1498,8 +1492,7 @@ def _(age_groups, mo, new_memory_db, plotly_express): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" /// admonition | ## Database Management Best Practices /// @@ -1538,14 +1531,15 @@ def _(mo): - Create indexes for frequently queried columns - For large datasets, consider partitioning - Use prepared statements for repeated queries - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md(rf"""## 10. Interactive DuckDB Dashboard with marimo and Plotly""") + mo.md(rf""" + ## 10. Interactive DuckDB Dashboard with marimo and Plotly + """) return @@ -1736,8 +1730,7 @@ def _( @app.cell(hide_code=True) def _(mo): - mo.md( - rf""" + mo.md(rf""" # Summary and Key Takeaways In this notebook, we've explored DuckDB, a powerful embedded analytical database system. Here's what we covered: @@ -1770,8 +1763,7 @@ def _(mo): - Experiment with more complex queries and window functions - Use DuckDB's COPY functionality to import/export data from/to files - Create more advanced interactive dashboards with marimo and Plotly - """ - ) + """) return diff --git a/duckdb/DuckDB_Loading_CSVs.py b/duckdb/DuckDB_Loading_CSVs.py index f54da63327693e8bed93187e4623e90107e6ea6b..d7a25a2314a1bfa8ae2b932f5bf8f2f259db0d4d 100644 --- a/duckdb/DuckDB_Loading_CSVs.py +++ b/duckdb/DuckDB_Loading_CSVs.py @@ -13,39 +13,41 @@ import marimo -__generated_with = "0.12.10" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md(r"""#Loading CSVs with DuckDB""") + mo.md(r""" + #Loading CSVs with DuckDB + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" -

I remember when I first learnt about DuckDB, it was a gamechanger — I used to load the data I wanted to work on to a database software like MS SQL Server, and then build a bridge to an IDE with the language I wanted to use like Python, or R; it was quite the hassle. DuckDB changed my whole world — now I could just import the data file into the IDE, or notebook, make a duckdb connection, and there we go! But then, I realized I didn't even need the step of first importing the file using python. I could just query the csv file directly using SQL through a DuckDB connection.

- - ##Introduction -

I found this dataset on the evolution of AI research by discipline from OECD, and it piqued my interest. I feel like publications in natural language processing drastically jumped in the mid 2010s, and I'm excited to find out if that's the case.

- -

In this notebook, we'll:

- - """ - ) + mo.md(r""" +

I remember when I first learnt about DuckDB, it was a gamechanger — I used to load the data I wanted to work on to a database software like MS SQL Server, and then build a bridge to an IDE with the language I wanted to use like Python, or R; it was quite the hassle. DuckDB changed my whole world — now I could just import the data file into the IDE, or notebook, make a duckdb connection, and there we go! But then, I realized I didn't even need the step of first importing the file using python. I could just query the csv file directly using SQL through a DuckDB connection.

+ + ##Introduction +

I found this dataset on the evolution of AI research by discipline from OECD, and it piqued my interest. I feel like publications in natural language processing drastically jumped in the mid 2010s, and I'm excited to find out if that's the case.

+ +

In this notebook, we'll:

+ + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""##Load the CSV""") + mo.md(r""" + ##Load the CSV + """) return @@ -67,7 +69,9 @@ def _(mo): @app.cell(hide_code=True) def _(mo): - mo.md(r"""##Create Another Table""") + mo.md(r""" + ##Create Another Table + """) return @@ -80,11 +84,11 @@ def _(mo): SELECT Year, Concept, publications FROM "https://raw.githubusercontent.com/Mustjaab/Loading_CSVs_in_DuckDB/refs/heads/main/AI_Research_Data.csv" """ ) - return Discipline_Analysis, Domain_Analysis + return @app.cell -def _(Domain_Analysis, mo): +def _(mo): Analysis = mo.sql( f""" SELECT * @@ -93,11 +97,11 @@ def _(Domain_Analysis, mo): ORDER BY Year """ ) - return (Analysis,) + return @app.cell -def _(Domain_Analysis, mo): +def _(mo): _df = mo.sql( f""" SELECT @@ -111,7 +115,7 @@ def _(Domain_Analysis, mo): @app.cell -def _(Domain_Analysis, mo): +def _(mo): NLP_Analysis = mo.sql( f""" SELECT @@ -137,21 +141,23 @@ def _(NLP_Analysis, px): @app.cell(hide_code=True) def _(mo): - mo.md(r"""

We can see there's a significant increase in NLP publications 2020 and onwards which definitely makes sense provided the rapid emergence of commercial large language models, and AI assistants.

""") + mo.md(r""" +

We can see there's a significant increase in NLP publications 2020 and onwards which definitely makes sense provided the rapid emergence of commercial large language models, and AI assistants.

+ """) + return + @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ##Conclusion -

In this notebook, we learned how to:

- - """ - ) + mo.md(r""" + ##Conclusion +

In this notebook, we learned how to:

+ + """) return @@ -159,7 +165,7 @@ def _(mo): def _(): import pyarrow import polars - return polars, pyarrow + return @app.cell diff --git a/duckdb/README.md b/duckdb/README.md index 1b7be852df8e5a382193ccf25ba5ce5af91523d7..8d4b80b21b718dd48ef9af963f065d77e2e749b0 100644 --- a/duckdb/README.md +++ b/duckdb/README.md @@ -1,3 +1,8 @@ +--- +title: Readme +marimo-version: 0.18.4 +--- + # Learn DuckDB _🚧 This collection is a work in progress. Please help us add notebooks!_ diff --git a/functional_programming/05_functors.py b/functional_programming/05_functors.py index 8954f4b1a68a8fed1a9c14f5befe1f14cc9fb8c7..cf942c543f8b9ea8a3b87c039e68ae44ec0fa9a3 100644 --- a/functional_programming/05_functors.py +++ b/functional_programming/05_functors.py @@ -7,102 +7,98 @@ import marimo -__generated_with = "0.12.8" +__generated_with = "0.18.4" app = marimo.App(app_title="Category Theory and Functors") @app.cell(hide_code=True) def _(mo): - mo.md( - """ - # Category Theory and Functors + mo.md(""" + # Category Theory and Functors - In this notebook, you will learn: + In this notebook, you will learn: - * Why `length` is a *functor* from the category of `list concatenation` to the category of `integer addition` - * How to *lift* an ordinary function into a specific *computational context* - * How to write an *adapter* between two categories + * Why `length` is a *functor* from the category of `list concatenation` to the category of `integer addition` + * How to *lift* an ordinary function into a specific *computational context* + * How to write an *adapter* between two categories - In short, a mathematical functor is a **mapping** between two categories in category theory. In practice, a functor represents a type that can be mapped over. + In short, a mathematical functor is a **mapping** between two categories in category theory. In practice, a functor represents a type that can be mapped over. - /// admonition | Intuitions + /// admonition | Intuitions - - A simple intuition is that a `Functor` represents a **container** of values, along with the ability to apply a function uniformly to every element in the container. - - Another intuition is that a `Functor` represents some sort of **computational context**. - - Mathematically, `Functors` generalize the idea of a container or a computational context. - /// + - A simple intuition is that a `Functor` represents a **container** of values, along with the ability to apply a function uniformly to every element in the container. + - Another intuition is that a `Functor` represents some sort of **computational context**. + - Mathematically, `Functors` generalize the idea of a container or a computational context. + /// - We will start with intuition, introduce the basics of category theory, and then examine functors from a categorical perspective. + We will start with intuition, introduce the basics of category theory, and then examine functors from a categorical perspective. - /// details | Notebook metadata - type: info + /// details | Notebook metadata + type: info - version: 0.1.5 | last modified: 2025-04-11 | author: [métaboulie](https://github.com/metaboulie)
- reviewer: [Haleshot](https://github.com/Haleshot) + version: 0.1.5 | last modified: 2025-04-11 | author: [métaboulie](https://github.com/metaboulie)
+ reviewer: [Haleshot](https://github.com/Haleshot) - /// - """ - ) + /// + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - # Functor as a Computational Context + mo.md(""" + # Functor as a Computational Context - A [**Functor**](https://wiki.haskell.org/Functor) is an abstraction that represents a computational context with the ability to apply a function to every value inside it without altering the structure of the context itself. This enables transformations while preserving the shape of the data. + A [**Functor**](https://wiki.haskell.org/Functor) is an abstraction that represents a computational context with the ability to apply a function to every value inside it without altering the structure of the context itself. This enables transformations while preserving the shape of the data. - To understand this, let's look at a simple example. + To understand this, let's look at a simple example. - ## [The One-Way Wrapper Design Pattern](http://blog.sigfpe.com/2007/04/trivial-monad.html) + ## [The One-Way Wrapper Design Pattern](http://blog.sigfpe.com/2007/04/trivial-monad.html) - Often, we need to wrap data in some kind of context. However, when performing operations on wrapped data, we typically have to: + Often, we need to wrap data in some kind of context. However, when performing operations on wrapped data, we typically have to: - 1. Unwrap the data. - 2. Modify the unwrapped data. - 3. Rewrap the modified data. + 1. Unwrap the data. + 2. Modify the unwrapped data. + 3. Rewrap the modified data. - This process is tedious and inefficient. Instead, we want to wrap data **once** and apply functions directly to the wrapped data without unwrapping it. + This process is tedious and inefficient. Instead, we want to wrap data **once** and apply functions directly to the wrapped data without unwrapping it. - /// admonition | Rules for a One-Way Wrapper + /// admonition | Rules for a One-Way Wrapper - 1. We can wrap values, but we cannot unwrap them. - 2. We should still be able to apply transformations to the wrapped data. - 3. Any operation that depends on wrapped data should itself return a wrapped result. - /// + 1. We can wrap values, but we cannot unwrap them. + 2. We should still be able to apply transformations to the wrapped data. + 3. Any operation that depends on wrapped data should itself return a wrapped result. + /// - Let's define such a `Wrapper` class: + Let's define such a `Wrapper` class: - ```python - from dataclasses import dataclass - from typing import TypeVar + ```python + from dataclasses import dataclass + from typing import TypeVar - A = TypeVar("A") - B = TypeVar("B") + A = TypeVar("A") + B = TypeVar("B") - @dataclass - class Wrapper[A]: - value: A - ``` + @dataclass + class Wrapper[A]: + value: A + ``` - Now, we can create an instance of wrapped data: + Now, we can create an instance of wrapped data: - ```python - wrapped = Wrapper(1) - ``` + ```python + wrapped = Wrapper(1) + ``` - ### Mapping Functions Over Wrapped Data + ### Mapping Functions Over Wrapped Data - To modify wrapped data while keeping it wrapped, we define an `fmap` method: - """ - ) + To modify wrapped data while keeping it wrapped, we define an `fmap` method: + """) return @app.cell -def _(B, Callable, Functor, dataclass): +def _(A, B, Callable, Functor, dataclass): @dataclass class Wrapper[A](Functor): value: A @@ -115,26 +111,24 @@ def _(B, Callable, Functor, dataclass): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - /// attention + mo.md(r""" + /// attention - To distinguish between regular types and functors, we use the prefix `f` to indicate `Functor`. + To distinguish between regular types and functors, we use the prefix `f` to indicate `Functor`. - For instance, + For instance, - - `a: A` is a regular variable of type `A` - - `g: Callable[[A], B]` is a regular function from type `A` to `B` - - `fa: Functor[A]` is a *Functor* wrapping a value of type `A` - - `fg: Functor[Callable[[A], B]]` is a *Functor* wrapping a function from type `A` to `B` + - `a: A` is a regular variable of type `A` + - `g: Callable[[A], B]` is a regular function from type `A` to `B` + - `fa: Functor[A]` is a *Functor* wrapping a value of type `A` + - `fg: Functor[Callable[[A], B]]` is a *Functor* wrapping a function from type `A` to `B` - and we will avoid using `f` to represent a function + and we will avoid using `f` to represent a function - /// + /// - > Try with Wrapper below - """ - ) + > Try with Wrapper below + """) return @@ -149,46 +143,42 @@ def _(Wrapper, pp): @app.cell(hide_code=True) def _(mo): - mo.md( - """ - We can analyze the type signature of `fmap` for `Wrapper`: + mo.md(""" + We can analyze the type signature of `fmap` for `Wrapper`: - * `g` is of type `Callable[[A], B]` - * `fa` is of type `Wrapper[A]` - * The return value is of type `Wrapper[B]` + * `g` is of type `Callable[[A], B]` + * `fa` is of type `Wrapper[A]` + * The return value is of type `Wrapper[B]` - Thus, in Python's type system, we can express the type signature of `fmap` as: + Thus, in Python's type system, we can express the type signature of `fmap` as: - ```python - fmap(g: Callable[[A], B], fa: Wrapper[A]) -> Wrapper[B]: - ``` + ```python + fmap(g: Callable[[A], B], fa: Wrapper[A]) -> Wrapper[B]: + ``` - Essentially, `fmap`: + Essentially, `fmap`: - 1. Takes a function `Callable[[A], B]` and a `Wrapper[A]` instance as input. - 2. Applies the function to the value inside the wrapper. - 3. Returns a new `Wrapper[B]` instance with the transformed value, leaving the original wrapper and its internal data unmodified. + 1. Takes a function `Callable[[A], B]` and a `Wrapper[A]` instance as input. + 2. Applies the function to the value inside the wrapper. + 3. Returns a new `Wrapper[B]` instance with the transformed value, leaving the original wrapper and its internal data unmodified. - Now, let's examine `list` as a similar kind of wrapper. - """ - ) + Now, let's examine `list` as a similar kind of wrapper. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## The List Functor + mo.md(""" + ## The List Functor - We can define a `List` class to represent a wrapped list that supports `fmap`: - """ - ) + We can define a `List` class to represent a wrapped list that supports `fmap`: + """) return @app.cell -def _(B, Callable, Functor, dataclass): +def _(A, B, Callable, Functor, dataclass): @dataclass class List[A](Functor): value: list[A] @@ -201,7 +191,9 @@ def _(B, Callable, Functor, dataclass): @app.cell(hide_code=True) def _(mo): - mo.md(r"""> Try with List below""") + mo.md(r""" + > Try with List below + """) return @@ -215,114 +207,106 @@ def _(List, pp): @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ### Extracting the Type of `fmap` + mo.md(""" + ### Extracting the Type of `fmap` - The type signature of `fmap` for `List` is: + The type signature of `fmap` for `List` is: - ```python - fmap(g: Callable[[A], B], fa: List[A]) -> List[B] - ``` + ```python + fmap(g: Callable[[A], B], fa: List[A]) -> List[B] + ``` - Similarly, for `Wrapper`: + Similarly, for `Wrapper`: - ```python - fmap(g: Callable[[A], B], fa: Wrapper[A]) -> Wrapper[B] - ``` + ```python + fmap(g: Callable[[A], B], fa: Wrapper[A]) -> Wrapper[B] + ``` - Both follow the same pattern, which we can generalize as: + Both follow the same pattern, which we can generalize as: - ```python - fmap(g: Callable[[A], B], fa: Functor[A]) -> Functor[B] - ``` + ```python + fmap(g: Callable[[A], B], fa: Functor[A]) -> Functor[B] + ``` - where `Functor` can be `Wrapper`, `List`, or any other wrapper type that follows the same structure. + where `Functor` can be `Wrapper`, `List`, or any other wrapper type that follows the same structure. - ### Functors in Haskell (optional) + ### Functors in Haskell (optional) - In Haskell, the type of `fmap` is: + In Haskell, the type of `fmap` is: - ```haskell - fmap :: Functor f => (a -> b) -> f a -> f b - ``` + ```haskell + fmap :: Functor f => (a -> b) -> f a -> f b + ``` - or equivalently: + or equivalently: - ```haskell - fmap :: Functor f => (a -> b) -> (f a -> f b) - ``` + ```haskell + fmap :: Functor f => (a -> b) -> (f a -> f b) + ``` - This means that `fmap` **lifts** an ordinary function into the **functor world**, allowing it to operate within a computational context. + This means that `fmap` **lifts** an ordinary function into the **functor world**, allowing it to operate within a computational context. - Now, let's define an abstract class for `Functor`. - """ - ) + Now, let's define an abstract class for `Functor`. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## Defining Functor + mo.md(""" + ## Defining Functor - Recall that, a **Functor** is an abstraction that allows us to apply a function to values inside a computational context while preserving its structure. + Recall that, a **Functor** is an abstraction that allows us to apply a function to values inside a computational context while preserving its structure. - To define `Functor` in Python, we use an abstract base class: + To define `Functor` in Python, we use an abstract base class: - ```python - @dataclass - class Functor[A](ABC): - @classmethod - @abstractmethod - def fmap(g: Callable[[A], B], fa: "Functor[A]") -> "Functor[B]": - raise NotImplementedError - ``` + ```python + @dataclass + class Functor[A](ABC): + @classmethod + @abstractmethod + def fmap(g: Callable[[A], B], fa: "Functor[A]") -> "Functor[B]": + raise NotImplementedError + ``` - We can now extend custom wrappers, containers, or computation contexts with this `Functor` base class, implement the `fmap` method, and apply any function. - """ - ) + We can now extend custom wrappers, containers, or computation contexts with this `Functor` base class, implement the `fmap` method, and apply any function. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # More Functor instances (optional) + mo.md(r""" + # More Functor instances (optional) - In this section, we will explore more *Functor* instances to help you build up a better comprehension. + In this section, we will explore more *Functor* instances to help you build up a better comprehension. - The main reference is [Data.Functor](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Functor.html) - """ - ) + The main reference is [Data.Functor](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Functor.html) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## The [Maybe](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Maybe.html#t:Maybe) Functor + mo.md(r""" + ## The [Maybe](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Maybe.html#t:Maybe) Functor - **`Maybe`** is a functor that can either hold a value (`Just(value)`) or be `Nothing` (equivalent to `None` in Python). + **`Maybe`** is a functor that can either hold a value (`Just(value)`) or be `Nothing` (equivalent to `None` in Python). - - It the value exists, `fmap` applies the function to this value inside the functor. - - If the value is `None`, `fmap` simply returns `None`. + - It the value exists, `fmap` applies the function to this value inside the functor. + - If the value is `None`, `fmap` simply returns `None`. - /// admonition - By using `Maybe` as a functor, we gain the ability to apply transformations (`fmap`) to potentially absent values, without having to explicitly handle the `None` case every time. - /// + /// admonition + By using `Maybe` as a functor, we gain the ability to apply transformations (`fmap`) to potentially absent values, without having to explicitly handle the `None` case every time. + /// - We can implement the `Maybe` functor as: - """ - ) + We can implement the `Maybe` functor as: + """) return @app.cell -def _(B, Callable, Functor, dataclass): +def _(A, B, Callable, Functor, dataclass): @dataclass class Maybe[A](Functor): value: None | A @@ -345,24 +329,22 @@ def _(Maybe, pp): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## The [Either](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Either.html#t:Either) Functor + mo.md(r""" + ## The [Either](https://hackage.haskell.org/package/base-4.21.0.0/docs/Data-Either.html#t:Either) Functor - The `Either` type represents values with two possibilities: a value of type `Either a b` is either `Left a` or `Right b`. + The `Either` type represents values with two possibilities: a value of type `Either a b` is either `Left a` or `Right b`. - The `Either` type is sometimes used to represent a value which is **either correct or an error**; by convention, the `left` attribute is used to hold an error value and the `right` attribute is used to hold a correct value. + The `Either` type is sometimes used to represent a value which is **either correct or an error**; by convention, the `left` attribute is used to hold an error value and the `right` attribute is used to hold a correct value. - `fmap` for `Either` will ignore Left values, but will apply the supplied function to values contained in the Right. + `fmap` for `Either` will ignore Left values, but will apply the supplied function to values contained in the Right. - The implementation is: - """ - ) + The implementation is: + """) return @app.cell -def _(B, Callable, Functor, Union, dataclass): +def _(A, B, Callable, Functor, Union, dataclass): @dataclass class Either[A](Functor): left: A = None @@ -400,29 +382,27 @@ def _(Either): @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## The [RoseTree](https://en.wikipedia.org/wiki/Rose_tree) Functor + mo.md(""" + ## The [RoseTree](https://en.wikipedia.org/wiki/Rose_tree) Functor - A **RoseTree** is a tree where: + A **RoseTree** is a tree where: - - Each node holds a **value**. - - Each node has a **list of child nodes** (which are also RoseTrees). + - Each node holds a **value**. + - Each node has a **list of child nodes** (which are also RoseTrees). - This structure is useful for representing hierarchical data, such as: + This structure is useful for representing hierarchical data, such as: - - Abstract Syntax Trees (ASTs) - - File system directories - - Recursive computations + - Abstract Syntax Trees (ASTs) + - File system directories + - Recursive computations - The implementation is: - """ - ) + The implementation is: + """) return @app.cell -def _(B, Callable, Functor, dataclass): +def _(A, B, Callable, Functor, dataclass): @dataclass class RoseTree[A](Functor): value: A # The value stored in the node. @@ -459,34 +439,32 @@ def _(RoseTree, pp): @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## Generic Functions that can be Used with Any Functor + mo.md(""" + ## Generic Functions that can be Used with Any Functor - One of the powerful features of functors is that we can write **generic functions** that can work with any functor. + One of the powerful features of functors is that we can write **generic functions** that can work with any functor. - Remember that in Haskell, the type of `fmap` can be written as: + Remember that in Haskell, the type of `fmap` can be written as: - ```haskell - fmap :: Functor f => (a -> b) -> (f a -> f b) - ``` + ```haskell + fmap :: Functor f => (a -> b) -> (f a -> f b) + ``` - Translating to Python, we get: + Translating to Python, we get: - ```python - def fmap(g: Callable[[A], B]) -> Callable[[Functor[A]], Functor[B]] - ``` + ```python + def fmap(g: Callable[[A], B]) -> Callable[[Functor[A]], Functor[B]] + ``` - This means that `fmap`: + This means that `fmap`: - - Takes an **ordinary function** `Callable[[A], B]` as input. - - Outputs a function that: - - Takes a **functor** of type `Functor[A]` as input. - - Outputs a **functor** of type `Functor[B]`. + - Takes an **ordinary function** `Callable[[A], B]` as input. + - Outputs a function that: + - Takes a **functor** of type `Functor[A]` as input. + - Outputs a **functor** of type `Functor[B]`. - Inspired by this, we can implement an `inc` function which takes a functor, applies the function `lambda x: x + 1` to every value inside it, and returns a new functor with the updated values. - """ - ) + Inspired by this, we can implement an `inc` function which takes a functor, applies the function `lambda x: x + 1` to every value inside it, and returns a new functor with the updated values. + """) return @@ -506,55 +484,51 @@ def _(flist, inc, pp, rosetree, wrapper): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - /// admonition | exercise - Implement other generic functions and apply them to different *Functor* instances. - /// - """ - ) + mo.md(r""" + /// admonition | exercise + Implement other generic functions and apply them to different *Functor* instances. + /// + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""# Functor laws and utility functions""") + mo.md(r""" + # Functor laws and utility functions + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## Functor laws + mo.md(""" + ## Functor laws - In addition to providing a function `fmap` of the specified type, functors are also required to satisfy two equational laws: + In addition to providing a function `fmap` of the specified type, functors are also required to satisfy two equational laws: - ```haskell - fmap id = id -- fmap preserves identity - fmap (g . h) = fmap g . fmap h -- fmap distributes over composition - ``` + ```haskell + fmap id = id -- fmap preserves identity + fmap (g . h) = fmap g . fmap h -- fmap distributes over composition + ``` - 1. `fmap` should preserve the **identity function**, in the sense that applying `fmap` to this function returns the same function as the result. - 2. `fmap` should also preserve **function composition**. Applying two composed functions `g` and `h` to a functor via `fmap` should give the same result as first applying `fmap` to `g` and then applying `fmap` to `h`. + 1. `fmap` should preserve the **identity function**, in the sense that applying `fmap` to this function returns the same function as the result. + 2. `fmap` should also preserve **function composition**. Applying two composed functions `g` and `h` to a functor via `fmap` should give the same result as first applying `fmap` to `g` and then applying `fmap` to `h`. - /// admonition | - - Any `Functor` instance satisfying the first law `(fmap id = id)` will [automatically satisfy the second law](https://github.com/quchen/articles/blob/master/second_functor_law.md) as well. - /// - """ - ) + /// admonition | + - Any `Functor` instance satisfying the first law `(fmap id = id)` will [automatically satisfy the second law](https://github.com/quchen/articles/blob/master/second_functor_law.md) as well. + /// + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Functor laws verification + mo.md(r""" + ### Functor laws verification - We can define `id` and `compose` in `Python` as: - """ - ) + We can define `id` and `compose` in `Python` as: + """) return @@ -562,12 +536,14 @@ def _(mo): def _(): id = lambda x: x compose = lambda f, g: lambda x: f(g(x)) - return compose, id + return (id,) @app.cell(hide_code=True) def _(mo): - mo.md(r"""We can add a helper function `check_functor_law` to verify that an instance satisfies the functor laws:""") + mo.md(r""" + We can add a helper function `check_functor_law` to verify that an instance satisfies the functor laws: + """) return @@ -581,7 +557,9 @@ def _(id): @app.cell(hide_code=True) def _(mo): - mo.md(r"""We can verify the functor we've defined:""") + mo.md(r""" + We can verify the functor we've defined: + """) return @@ -589,17 +567,19 @@ def _(mo): def _(check_functor_law, flist, pp, rosetree, wrapper): for functor in (wrapper, flist, rosetree): pp(check_functor_law(functor)) - return (functor,) + return @app.cell(hide_code=True) def _(mo): - mo.md("""And here is an `EvilFunctor`. We can verify it's not a valid `Functor`.""") + mo.md(""" + And here is an `EvilFunctor`. We can verify it's not a valid `Functor`. + """) return @app.cell -def _(B, Callable, Functor, dataclass): +def _(A, B, Callable, Functor, dataclass): @dataclass class EvilFunctor[A](Functor): value: list[A] @@ -624,31 +604,29 @@ def _(EvilFunctor, check_functor_law, pp): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Utility functions - - ```python - @classmethod - def const(cls, fa: "Functor[A]", b: B) -> "Functor[B]": - return cls.fmap(lambda _: b, fa) - - @classmethod - def void(cls, fa: "Functor[A]") -> "Functor[None]": - return cls.const(fa, None) - - @classmethod - def unzip( - cls, fab: "Functor[tuple[A, B]]" - ) -> tuple["Functor[A]", "Functor[B]"]: - return cls.fmap(lambda p: p[0], fab), cls.fmap(lambda p: p[1], fab) - ``` - - - `const` replaces all values inside a functor with a constant `b` - - `void` is equivalent to `const(fa, None)`, transforming all values in a functor into `None` - - `unzip` is a generalization of the regular *unzip* on a list of pairs - """ - ) + mo.md(r""" + ## Utility functions + + ```python + @classmethod + def const(cls, fa: "Functor[A]", b: B) -> "Functor[B]": + return cls.fmap(lambda _: b, fa) + + @classmethod + def void(cls, fa: "Functor[A]") -> "Functor[None]": + return cls.const(fa, None) + + @classmethod + def unzip( + cls, fab: "Functor[tuple[A, B]]" + ) -> tuple["Functor[A]", "Functor[B]"]: + return cls.fmap(lambda p: p[0], fab), cls.fmap(lambda p: p[1], fab) + ``` + + - `const` replaces all values inside a functor with a constant `b` + - `void` is equivalent to `const(fa, None)`, transforming all values in a functor into `None` + - `unzip` is a generalization of the regular *unzip* on a list of pairs + """) return @@ -676,13 +654,11 @@ def _(List, Maybe): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - /// admonition - You can always override these utility functions with a more efficient implementation for specific instances. - /// - """ - ) + mo.md(r""" + /// admonition + You can always override these utility functions with a more efficient implementation for specific instances. + /// + """) return @@ -697,7 +673,9 @@ def _(List, RoseTree, flist, pp, rosetree): @app.cell(hide_code=True) def _(mo): - mo.md("""# Formal implementation of Functor""") + mo.md(""" + # Formal implementation of Functor + """) return @@ -728,291 +706,275 @@ def _(ABC, B, Callable, abstractmethod, dataclass): @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## Limitations of Functor + mo.md(""" + ## Limitations of Functor - Functors abstract the idea of mapping a function over each element of a structure. Suppose now that we wish to generalise this idea to allow functions with any number of arguments to be mapped, rather than being restricted to functions with a single argument. More precisely, suppose that we wish to define a hierarchy of `fmap` functions with the following types: + Functors abstract the idea of mapping a function over each element of a structure. Suppose now that we wish to generalise this idea to allow functions with any number of arguments to be mapped, rather than being restricted to functions with a single argument. More precisely, suppose that we wish to define a hierarchy of `fmap` functions with the following types: - ```haskell - fmap0 :: a -> f a + ```haskell + fmap0 :: a -> f a - fmap1 :: (a -> b) -> f a -> f b + fmap1 :: (a -> b) -> f a -> f b - fmap2 :: (a -> b -> c) -> f a -> f b -> f c + fmap2 :: (a -> b -> c) -> f a -> f b -> f c - fmap3 :: (a -> b -> c -> d) -> f a -> f b -> f c -> f d - ``` + fmap3 :: (a -> b -> c -> d) -> f a -> f b -> f c -> f d + ``` - And we have to declare a special version of the functor class for each case. + And we have to declare a special version of the functor class for each case. - We will learn how to resolve this problem in the next notebook on `Applicatives`. - """ - ) + We will learn how to resolve this problem in the next notebook on `Applicatives`. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - # Introduction to Categories + mo.md(""" + # Introduction to Categories - A [category](https://en.wikibooks.org/wiki/Haskell/Category_theory#Introduction_to_categories) is, in essence, a simple collection. It has three components: + A [category](https://en.wikibooks.org/wiki/Haskell/Category_theory#Introduction_to_categories) is, in essence, a simple collection. It has three components: - - A collection of **objects**. - - A collection of **morphisms**, each of which ties two objects (a _source object_ and a _target object_) together. If $f$ is a morphism with source object $C$ and target object $B$, we write $f : C → B$. - - A notion of **composition** of these morphisms. If $g : A → B$ and $f : B → C$ are two morphisms, they can be composed, resulting in a morphism $f ∘ g : A → C$. + - A collection of **objects**. + - A collection of **morphisms**, each of which ties two objects (a _source object_ and a _target object_) together. If $f$ is a morphism with source object $C$ and target object $B$, we write $f : C → B$. + - A notion of **composition** of these morphisms. If $g : A → B$ and $f : B → C$ are two morphisms, they can be composed, resulting in a morphism $f ∘ g : A → C$. - ## Category laws + ## Category laws - There are three laws that categories need to follow. + There are three laws that categories need to follow. - 1. The composition of morphisms needs to be **associative**. Symbolically, $f ∘ (g ∘ h) = (f ∘ g) ∘ h$ + 1. The composition of morphisms needs to be **associative**. Symbolically, $f ∘ (g ∘ h) = (f ∘ g) ∘ h$ - - Morphisms are applied right to left, so with $f ∘ g$ first $g$ is applied, then $f$. + - Morphisms are applied right to left, so with $f ∘ g$ first $g$ is applied, then $f$. - 2. The category needs to be **closed** under the composition operation. So if $f : B → C$ and $g : A → B$, then there must be some morphism $h : A → C$ in the category such that $h = f ∘ g$. + 2. The category needs to be **closed** under the composition operation. So if $f : B → C$ and $g : A → B$, then there must be some morphism $h : A → C$ in the category such that $h = f ∘ g$. - 3. Given a category $C$ there needs to be for every object $A$ an **identity** morphism, $id_A : A → A$ that is an identity of composition with other morphisms. Put precisely, for every morphism $g : A → B$: $g ∘ id_A = id_B ∘ g = g$ + 3. Given a category $C$ there needs to be for every object $A$ an **identity** morphism, $id_A : A → A$ that is an identity of composition with other morphisms. Put precisely, for every morphism $g : A → B$: $g ∘ id_A = id_B ∘ g = g$ - /// attention | The definition of a category does not define: + /// attention | The definition of a category does not define: - - what `∘` is, - - what `id` is, or - - what `f`, `g`, and `h` might be. + - what `∘` is, + - what `id` is, or + - what `f`, `g`, and `h` might be. - Instead, category theory leaves it up to us to discover what they might be. - /// - """ - ) + Instead, category theory leaves it up to us to discover what they might be. + /// + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## The Python category - - The main category we'll be concerning ourselves with in this part is the Python category, or we can give it a shorter name: `Py`. `Py` treats Python types as objects and Python functions as morphisms. A function `def f(a: A) -> B` for types A and B is a morphism in Python. - - Remember that we defined the `id` and `compose` function above as: - - ```Python - def id(x: A) -> A: - return x - - def compose(f: Callable[[B], C], g: Callable[[A], B]) -> Callable[[A], C]: - return lambda x: f(g(x)) - ``` - - We can check second law easily. - - For the first law, we have: - - ```python - # compose(f, g) = lambda x: f(g(x)) - f ∘ (g ∘ h) - = compose(f, compose(g, h)) - = lambda x: f(compose(g, h)(x)) - = lambda x: f(lambda y: g(h(y))(x)) - = lambda x: f(g(h(x))) - - (f ∘ g) ∘ h - = compose(compose(f, g), h) - = lambda x: compose(f, g)(h(x)) - = lambda x: lambda y: f(g(y))(h(x)) - = lambda x: f(g(h(x))) - ``` - - For the third law, we have: - - ```python - g ∘ id_A - = compose(g: Callable[[a], b], id: Callable[[a], a]) -> Callable[[a], b] - = lambda x: g(id(x)) - = lambda x: g(x) # id(x) = x - = g - ``` - the similar proof can be applied to $id_B ∘ g =g$. - - Thus `Py` is a valid category. - """ - ) + mo.md(""" + ## The Python category + + The main category we'll be concerning ourselves with in this part is the Python category, or we can give it a shorter name: `Py`. `Py` treats Python types as objects and Python functions as morphisms. A function `def f(a: A) -> B` for types A and B is a morphism in Python. + + Remember that we defined the `id` and `compose` function above as: + + ```Python + def id(x: A) -> A: + return x + + def compose(f: Callable[[B], C], g: Callable[[A], B]) -> Callable[[A], C]: + return lambda x: f(g(x)) + ``` + + We can check second law easily. + + For the first law, we have: + + ```python + # compose(f, g) = lambda x: f(g(x)) + f ∘ (g ∘ h) + = compose(f, compose(g, h)) + = lambda x: f(compose(g, h)(x)) + = lambda x: f(lambda y: g(h(y))(x)) + = lambda x: f(g(h(x))) + + (f ∘ g) ∘ h + = compose(compose(f, g), h) + = lambda x: compose(f, g)(h(x)) + = lambda x: lambda y: f(g(y))(h(x)) + = lambda x: f(g(h(x))) + ``` + + For the third law, we have: + + ```python + g ∘ id_A + = compose(g: Callable[[a], b], id: Callable[[a], a]) -> Callable[[a], b] + = lambda x: g(id(x)) + = lambda x: g(x) # id(x) = x + = g + ``` + the similar proof can be applied to $id_B ∘ g =g$. + + Thus `Py` is a valid category. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - # Functors, again + mo.md(""" + # Functors, again - A functor is essentially a transformation between categories, so given categories $C$ and $D$, a functor $F : C → D$: + A functor is essentially a transformation between categories, so given categories $C$ and $D$, a functor $F : C → D$: - - Maps any object $A$ in $C$ to $F ( A )$, in $D$. - - Maps morphisms $f : A → B$ in $C$ to $F ( f ) : F ( A ) → F ( B )$ in $D$. + - Maps any object $A$ in $C$ to $F ( A )$, in $D$. + - Maps morphisms $f : A → B$ in $C$ to $F ( f ) : F ( A ) → F ( B )$ in $D$. - /// admonition | + /// admonition | - Endofunctors are functors from a category to itself. + Endofunctors are functors from a category to itself. - /// - """ - ) + /// + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## Functors on the category of Python + mo.md(""" + ## Functors on the category of Python - Remember that a functor has two parts: it maps objects in one category to objects in another and morphisms in the first category to morphisms in the second. + Remember that a functor has two parts: it maps objects in one category to objects in another and morphisms in the first category to morphisms in the second. - Functors in Python are from `Py` to `Func`, where `Func` is the subcategory of `Py` defined on just that functor's types. E.g. the RoseTree functor goes from `Py` to `RoseTree`, where `RoseTree` is the category containing only RoseTree types, that is, `RoseTree[T]` for any type `T`. The morphisms in `RoseTree` are functions defined on RoseTree types, that is, functions `Callable[[RoseTree[T]], RoseTree[U]]` for types `T`, `U`. + Functors in Python are from `Py` to `Func`, where `Func` is the subcategory of `Py` defined on just that functor's types. E.g. the RoseTree functor goes from `Py` to `RoseTree`, where `RoseTree` is the category containing only RoseTree types, that is, `RoseTree[T]` for any type `T`. The morphisms in `RoseTree` are functions defined on RoseTree types, that is, functions `Callable[[RoseTree[T]], RoseTree[U]]` for types `T`, `U`. - Recall the definition of `Functor`: + Recall the definition of `Functor`: - ```Python - @dataclass - class Functor[A](ABC) - ``` + ```Python + @dataclass + class Functor[A](ABC) + ``` - And RoseTree: + And RoseTree: - ```Python - @dataclass - class RoseTree[A](Functor) - ``` + ```Python + @dataclass + class RoseTree[A](Functor) + ``` - **Here's the key part:** the _type constructor_ `RoseTree` takes any type `T` to a new type, `RoseTree[T]`. Also, `fmap` restricted to `RoseTree` types takes a function `Callable[[A], B]` to a function `Callable[[RoseTree[A]], RoseTree[B]]`. + **Here's the key part:** the _type constructor_ `RoseTree` takes any type `T` to a new type, `RoseTree[T]`. Also, `fmap` restricted to `RoseTree` types takes a function `Callable[[A], B]` to a function `Callable[[RoseTree[A]], RoseTree[B]]`. - But that's it. We've defined two parts, something that takes objects in `Py` to objects in another category (that of `RoseTree` types and functions defined on `RoseTree` types), and something that takes morphisms in `Py` to morphisms in this category. So `RoseTree` is a functor. + But that's it. We've defined two parts, something that takes objects in `Py` to objects in another category (that of `RoseTree` types and functions defined on `RoseTree` types), and something that takes morphisms in `Py` to morphisms in this category. So `RoseTree` is a functor. - To sum up: + To sum up: - - We work in the category **Py** and its subcategories. - - **Objects** are types (e.g., `int`, `str`, `list`). - - **Morphisms** are functions (`Callable[[A], B]`). - - **Things that take a type and return another type** are type constructors (`RoseTree[T]`). - - **Things that take a function and return another function** are higher-order functions (`Callable[[Callable[[A], B]], Callable[[C], D]]`). - - **Abstract base classes (ABC)** and duck typing provide a way to express polymorphism, capturing the idea that in category theory, structures are often defined over multiple objects at once. - """ - ) + - We work in the category **Py** and its subcategories. + - **Objects** are types (e.g., `int`, `str`, `list`). + - **Morphisms** are functions (`Callable[[A], B]`). + - **Things that take a type and return another type** are type constructors (`RoseTree[T]`). + - **Things that take a function and return another function** are higher-order functions (`Callable[[Callable[[A], B]], Callable[[C], D]]`). + - **Abstract base classes (ABC)** and duck typing provide a way to express polymorphism, capturing the idea that in category theory, structures are often defined over multiple objects at once. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## Functor laws, again + mo.md(""" + ## Functor laws, again - Once again there are a few axioms that functors have to obey. + Once again there are a few axioms that functors have to obey. - 1. Given an identity morphism $id_A$ on an object $A$, $F ( id_A )$ must be the identity morphism on $F ( A )$.: + 1. Given an identity morphism $id_A$ on an object $A$, $F ( id_A )$ must be the identity morphism on $F ( A )$.: - $$F({id} _{A})={id} _{F(A)}$$ + $$F({id} _{A})={id} _{F(A)}$$ - 3. Functors must distribute over morphism composition. + 3. Functors must distribute over morphism composition. - $$F(f\circ g)=F(f)\circ F(g)$$ - """ - ) + $$F(f\circ g)=F(f)\circ F(g)$$ + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - Remember that we defined the `id` and `compose` as - ```python - id = lambda x: x - compose = lambda f, g: lambda x: f(g(x)) - ``` - - We can define `fmap` as: - - ```python - fmap = lambda g, functor: functor.fmap(g, functor) - ``` - - Let's prove that `fmap` is a functor. - - First, let's define a `Category` for a specific `Functor`. We choose to define the `Category` for the `Wrapper` as `WrapperCategory` here for simplicity, but remember that `Wrapper` can be any `Functor`(i.e. `List`, `RoseTree`, `Maybe` and more): - - We define `WrapperCategory` as: - - ```python - @dataclass - class WrapperCategory: - @staticmethod - def id(wrapper: Wrapper[A]) -> Wrapper[A]: - return Wrapper(wrapper.value) - - @staticmethod - def compose( - f: Callable[[Wrapper[B]], Wrapper[C]], - g: Callable[[Wrapper[A]], Wrapper[B]], - wrapper: Wrapper[A] - ) -> Callable[[Wrapper[A]], Wrapper[C]]: - return f(g(Wrapper(wrapper.value))) - ``` - - And `Wrapper` is: - - ```Python - @dataclass - class Wrapper[A](Functor): - value: A - - @classmethod - def fmap(cls, g: Callable[[A], B], fa: "Wrapper[A]") -> "Wrapper[B]": - return Wrapper(g(fa.value)) - ``` - """ - ) + mo.md(""" + Remember that we defined the `id` and `compose` as + ```python + id = lambda x: x + compose = lambda f, g: lambda x: f(g(x)) + ``` + + We can define `fmap` as: + + ```python + fmap = lambda g, functor: functor.fmap(g, functor) + ``` + + Let's prove that `fmap` is a functor. + + First, let's define a `Category` for a specific `Functor`. We choose to define the `Category` for the `Wrapper` as `WrapperCategory` here for simplicity, but remember that `Wrapper` can be any `Functor`(i.e. `List`, `RoseTree`, `Maybe` and more): + + We define `WrapperCategory` as: + + ```python + @dataclass + class WrapperCategory: + @staticmethod + def id(wrapper: Wrapper[A]) -> Wrapper[A]: + return Wrapper(wrapper.value) + + @staticmethod + def compose( + f: Callable[[Wrapper[B]], Wrapper[C]], + g: Callable[[Wrapper[A]], Wrapper[B]], + wrapper: Wrapper[A] + ) -> Callable[[Wrapper[A]], Wrapper[C]]: + return f(g(Wrapper(wrapper.value))) + ``` + + And `Wrapper` is: + + ```Python + @dataclass + class Wrapper[A](Functor): + value: A + + @classmethod + def fmap(cls, g: Callable[[A], B], fa: "Wrapper[A]") -> "Wrapper[B]": + return Wrapper(g(fa.value)) + ``` + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - We can prove that: - - ```python - fmap(id, wrapper) - = Wrapper.fmap(id, wrapper) - = Wrapper(id(wrapper.value)) - = Wrapper(wrapper.value) - = WrapperCategory.id(wrapper) - ``` - and: - ```python - fmap(compose(f, g), wrapper) - = Wrapper.fmap(compose(f, g), wrapper) - = Wrapper(compose(f, g)(wrapper.value)) - = Wrapper(f(g(wrapper.value))) - - WrapperCategory.compose(fmap(f, wrapper), fmap(g, wrapper), wrapper) - = fmap(f, wrapper)(fmap(g, wrapper)(wrapper)) - = fmap(f, wrapper)(Wrapper.fmap(g, wrapper)) - = fmap(f, wrapper)(Wrapper(g(wrapper.value))) - = Wrapper.fmap(f, Wrapper(g(wrapper.value))) - = Wrapper(f(Wrapper(g(wrapper.value)).value)) - = Wrapper(f(g(wrapper.value))) # Wrapper(g(wrapper.value)).value = g(wrapper.value) - ``` - - So our `Wrapper` is a valid `Functor`. - - > Try validating functor laws for `Wrapper` below. - """ - ) + mo.md(""" + We can prove that: + + ```python + fmap(id, wrapper) + = Wrapper.fmap(id, wrapper) + = Wrapper(id(wrapper.value)) + = Wrapper(wrapper.value) + = WrapperCategory.id(wrapper) + ``` + and: + ```python + fmap(compose(f, g), wrapper) + = Wrapper.fmap(compose(f, g), wrapper) + = Wrapper(compose(f, g)(wrapper.value)) + = Wrapper(f(g(wrapper.value))) + + WrapperCategory.compose(fmap(f, wrapper), fmap(g, wrapper), wrapper) + = fmap(f, wrapper)(fmap(g, wrapper)(wrapper)) + = fmap(f, wrapper)(Wrapper.fmap(g, wrapper)) + = fmap(f, wrapper)(Wrapper(g(wrapper.value))) + = Wrapper.fmap(f, Wrapper(g(wrapper.value))) + = Wrapper(f(Wrapper(g(wrapper.value)).value)) + = Wrapper(f(g(wrapper.value))) # Wrapper(g(wrapper.value)).value = g(wrapper.value) + ``` + + So our `Wrapper` is a valid `Functor`. + + > Try validating functor laws for `Wrapper` below. + """) return @@ -1042,19 +1004,17 @@ def _(WrapperCategory, id, pp, wrapper): @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## Length as a Functor + mo.md(""" + ## Length as a Functor - Remember that a functor is a transformation between two categories. It is not only limited to a functor from `Py` to `Func`, but also includes transformations between other mathematical structures. + Remember that a functor is a transformation between two categories. It is not only limited to a functor from `Py` to `Func`, but also includes transformations between other mathematical structures. - Let’s prove that **`length`** can be viewed as a functor. Specifically, we will demonstrate that `length` is a functor from the **category of list concatenation** to the **category of integer addition**. + Let’s prove that **`length`** can be viewed as a functor. Specifically, we will demonstrate that `length` is a functor from the **category of list concatenation** to the **category of integer addition**. - ### Category of List Concatenation + ### Category of List Concatenation - First, let’s define the category of list concatenation: - """ - ) + First, let’s define the category of list concatenation: + """) return @@ -1078,24 +1038,20 @@ def _(A, dataclass): @app.cell(hide_code=True) def _(mo): - mo.md( - """ - - **Identity**: The identity element is an empty list (`ListConcatenation([])`). - - **Composition**: The composition of two lists is their concatenation (`this.value + other.value`). - """ - ) + mo.md(""" + - **Identity**: The identity element is an empty list (`ListConcatenation([])`). + - **Composition**: The composition of two lists is their concatenation (`this.value + other.value`). + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ### Category of Integer Addition + mo.md(""" + ### Category of Integer Addition - Now, let's define the category of integer addition: - """ - ) + Now, let's define the category of integer addition: + """) return @@ -1117,28 +1073,24 @@ def _(dataclass): @app.cell(hide_code=True) def _(mo): - mo.md( - """ - - **Identity**: The identity element is `IntAddition(0)` (the additive identity). - - **Composition**: The composition of two integers is their sum (`this.value + other.value`). - """ - ) + mo.md(""" + - **Identity**: The identity element is `IntAddition(0)` (the additive identity). + - **Composition**: The composition of two integers is their sum (`this.value + other.value`). + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ### Defining the Length Functor + mo.md(""" + ### Defining the Length Functor - We now define the `length` function as a functor, mapping from the category of list concatenation to the category of integer addition: + We now define the `length` function as a functor, mapping from the category of list concatenation to the category of integer addition: - ```python - length = lambda l: IntAddition(len(l.value)) - ``` - """ - ) + ```python + length = lambda l: IntAddition(len(l.value)) + ``` + """) return @@ -1150,23 +1102,23 @@ def _(IntAddition): @app.cell(hide_code=True) def _(mo): - mo.md("""This function takes an instance of `ListConcatenation`, computes its length, and returns an `IntAddition` instance with the computed length.""") + mo.md(""" + This function takes an instance of `ListConcatenation`, computes its length, and returns an `IntAddition` instance with the computed length. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ### Verifying Functor Laws + mo.md(""" + ### Verifying Functor Laws - Now, let’s verify that `length` satisfies the two functor laws. + Now, let’s verify that `length` satisfies the two functor laws. - **Identity Law** + **Identity Law** - The identity law states that applying the functor to the identity element of one category should give the identity element of the other category. - """ - ) + The identity law states that applying the functor to the identity element of one category should give the identity element of the other category. + """) return @@ -1178,19 +1130,19 @@ def _(IntAddition, ListConcatenation, length, pp): @app.cell(hide_code=True) def _(mo): - mo.md("""This ensures that the length of an empty list (identity in the `ListConcatenation` category) is `0` (identity in the `IntAddition` category).""") + mo.md(""" + This ensures that the length of an empty list (identity in the `ListConcatenation` category) is `0` (identity in the `IntAddition` category). + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - **Composition Law** + mo.md(""" + **Composition Law** - The composition law states that the functor should preserve composition. Applying the functor to a composed element should be the same as composing the functor applied to the individual elements. - """ - ) + The composition law states that the functor should preserve composition. Applying the functor to a composed element should be the same as composing the functor applied to the individual elements. + """) return @@ -1202,36 +1154,36 @@ def _(IntAddition, ListConcatenation, length, pp): length(ListConcatenation.compose(lista, listb)) == IntAddition.compose(length(lista), length(listb)) ) - return lista, listb + return @app.cell(hide_code=True) def _(mo): - mo.md("""This ensures that the length of the concatenation of two lists is the same as the sum of the lengths of the individual lists.""") + mo.md(""" + This ensures that the length of the concatenation of two lists is the same as the sum of the lengths of the individual lists. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # Bifunctor + mo.md(r""" + # Bifunctor - A `Bifunctor` is a type constructor that takes two type arguments and **is a functor in both arguments.** + A `Bifunctor` is a type constructor that takes two type arguments and **is a functor in both arguments.** - For example, think about `Either`'s usual `Functor` instance. It only allows you to fmap over the second type parameter: `right` values get mapped, `left` values stay as they are. + For example, think about `Either`'s usual `Functor` instance. It only allows you to fmap over the second type parameter: `right` values get mapped, `left` values stay as they are. - However, its `Bifunctor` instance allows you to map both halves of the sum. + However, its `Bifunctor` instance allows you to map both halves of the sum. - There are three core methods for `Bifunctor`: + There are three core methods for `Bifunctor`: - - `bimap` allows mapping over both type arguments at once. - - `first` and `second` are also provided for mapping over only one type argument at a time. + - `bimap` allows mapping over both type arguments at once. + - `first` and `second` are also provided for mapping over only one type argument at a time. - The abstraction of `Bifunctor` is: - """ - ) + The abstraction of `Bifunctor` is: + """) return @@ -1261,38 +1213,36 @@ def _(ABC, B, Callable, D, dataclass, f, id): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - /// admonition | minimal implementation requirement - - `bimap` or both `first` and `second` - /// - """ - ) + mo.md(r""" + /// admonition | minimal implementation requirement + - `bimap` or both `first` and `second` + /// + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""## Instances of Bifunctor""") + mo.md(r""" + ## Instances of Bifunctor + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### The Either Bifunctor + mo.md(r""" + ### The Either Bifunctor - For the `Either Bifunctor`, we allow it to map a function over the `left` value as well. + For the `Either Bifunctor`, we allow it to map a function over the `left` value as well. - Notice that, the `Either Bifunctor` still only contains the `left` value or the `right` value. - """ - ) + Notice that, the `Either Bifunctor` still only contains the `left` value or the `right` value. + """) return @app.cell -def _(B, Bifunctor, Callable, D, dataclass): +def _(A, B, Bifunctor, C, Callable, D, dataclass): @dataclass class BiEither[A, C](Bifunctor): left: A = None @@ -1334,18 +1284,16 @@ def _(BiEither): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### The 2d Tuple Bifunctor + mo.md(r""" + ### The 2d Tuple Bifunctor - For 2d tuples, we simply expect `bimap` to map 2 functions to the 2 elements in the tuple respectively. - """ - ) + For 2d tuples, we simply expect `bimap` to map 2 functions to the 2 elements in the tuple respectively. + """) return @app.cell -def _(B, Bifunctor, Callable, D, dataclass): +def _(A, B, Bifunctor, C, Callable, D, dataclass): @dataclass class BiTuple[A, C](Bifunctor): value: tuple[A, C] @@ -1368,19 +1316,17 @@ def _(BiTuple): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Bifunctor laws + mo.md(r""" + ## Bifunctor laws - The only law we need to follow is + The only law we need to follow is - ```python - bimap(id, id, fa) == id(fa) - ``` + ```python + bimap(id, id, fa) == id(fa) + ``` - and then other laws are followed automatically. - """ - ) + and then other laws are followed automatically. + """) return @@ -1394,24 +1340,22 @@ def _(BiEither, BiTuple, id): @app.cell(hide_code=True) def _(mo): - mo.md( - """ - # Further reading - - - [The Trivial Monad](http://blog.sigfpe.com/2007/04/trivial-monad.html) - - [Haskellforall: The Category Design Pattern](https://www.haskellforall.com/2012/08/the-category-design-pattern.html) - - [Haskellforall: The Functor Design Pattern](https://www.haskellforall.com/2012/09/the-functor-design-pattern.html) - - /// attention | ATTENTION - The functor design pattern doesn't work at all if you aren't using categories in the first place. This is why you should structure your tools using the compositional category design pattern so that you can take advantage of functors to easily mix your tools together. - /// - - - [Haskellwiki: Functor](https://wiki.haskell.org/index.php?title=Functor) - - [Haskellwiki: Typeclassopedia#Functor](https://wiki.haskell.org/index.php?title=Typeclassopedia#Functor) - - [Haskellwiki: Typeclassopedia#Category](https://wiki.haskell.org/index.php?title=Typeclassopedia#Category) - - [Haskellwiki: Category Theory](https://en.wikibooks.org/wiki/Haskell/Category_theory) - """ - ) + mo.md(""" + # Further reading + + - [The Trivial Monad](http://blog.sigfpe.com/2007/04/trivial-monad.html) + - [Haskellforall: The Category Design Pattern](https://www.haskellforall.com/2012/08/the-category-design-pattern.html) + - [Haskellforall: The Functor Design Pattern](https://www.haskellforall.com/2012/09/the-functor-design-pattern.html) + + /// attention | ATTENTION + The functor design pattern doesn't work at all if you aren't using categories in the first place. This is why you should structure your tools using the compositional category design pattern so that you can take advantage of functors to easily mix your tools together. + /// + + - [Haskellwiki: Functor](https://wiki.haskell.org/index.php?title=Functor) + - [Haskellwiki: Typeclassopedia#Functor](https://wiki.haskell.org/index.php?title=Typeclassopedia#Functor) + - [Haskellwiki: Typeclassopedia#Category](https://wiki.haskell.org/index.php?title=Typeclassopedia#Category) + - [Haskellwiki: Category Theory](https://en.wikibooks.org/wiki/Haskell/Category_theory) + """) return diff --git a/functional_programming/06_applicatives.py b/functional_programming/06_applicatives.py index ce10022cb9aed304a2de952909cfd92aa2bdadc6..22e19e0ac3dee560b395ec4c9c41b0ab56bc61ec 100644 --- a/functional_programming/06_applicatives.py +++ b/functional_programming/06_applicatives.py @@ -7,266 +7,261 @@ import marimo -__generated_with = "0.12.9" +__generated_with = "0.18.4" app = marimo.App(app_title="Applicative programming with effects") @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - # Applicative programming with effects +def _(mo): + mo.md(r""" + # Applicative programming with effects - `Applicative Functor` encapsulates certain sorts of *effectful* computations in a functionally pure way, and encourages an *applicative* programming style. + `Applicative Functor` encapsulates certain sorts of *effectful* computations in a functionally pure way, and encourages an *applicative* programming style. - Applicative is a functor with application, providing operations to + Applicative is a functor with application, providing operations to - + embed pure expressions (`pure`), and - + sequence computations and combine their results (`apply`). + + embed pure expressions (`pure`), and + + sequence computations and combine their results (`apply`). - In this notebook, you will learn: + In this notebook, you will learn: - 1. How to view `Applicative` as multi-functor intuitively. - 2. How to use `lift` to simplify chaining application. - 3. How to bring *effects* to the functional pure world. - 4. How to view `Applicative` as a lax monoidal functor. - 5. How to use `Alternative` to amalgamate multiple computations into a single computation. + 1. How to view `Applicative` as multi-functor intuitively. + 2. How to use `lift` to simplify chaining application. + 3. How to bring *effects* to the functional pure world. + 4. How to view `Applicative` as a lax monoidal functor. + 5. How to use `Alternative` to amalgamate multiple computations into a single computation. - /// details | Notebook metadata - type: info + /// details | Notebook metadata + type: info - version: 0.1.3 | last modified: 2025-04-16 | author: [métaboulie](https://github.com/metaboulie)
- reviewer: [Haleshot](https://github.com/Haleshot) + version: 0.1.3 | last modified: 2025-04-16 | author: [métaboulie](https://github.com/metaboulie)
+ reviewer: [Haleshot](https://github.com/Haleshot) - /// - """ - ) + /// + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - # The intuition: [Multifunctor](https://arxiv.org/pdf/2401.14286) +def _(mo): + mo.md(r""" + # The intuition: [Multifunctor](https://arxiv.org/pdf/2401.14286) - ## Limitations of functor + ## Limitations of functor - Recall that functors abstract the idea of mapping a function over each element of a structure. + Recall that functors abstract the idea of mapping a function over each element of a structure. - Suppose now that we wish to generalise this idea to allow functions with any number of arguments to be mapped, rather than being restricted to functions with a single argument. More precisely, suppose that we wish to define a hierarchy of `fmap` functions with the following types: + Suppose now that we wish to generalise this idea to allow functions with any number of arguments to be mapped, rather than being restricted to functions with a single argument. More precisely, suppose that we wish to define a hierarchy of `fmap` functions with the following types: - ```haskell - fmap0 :: a -> f a + ```haskell + fmap0 :: a -> f a - fmap1 :: (a -> b) -> f a -> f b + fmap1 :: (a -> b) -> f a -> f b - fmap2 :: (a -> b -> c) -> f a -> f b -> f c + fmap2 :: (a -> b -> c) -> f a -> f b -> f c - fmap3 :: (a -> b -> c -> d) -> f a -> f b -> f c -> f d - ``` + fmap3 :: (a -> b -> c -> d) -> f a -> f b -> f c -> f d + ``` - And we have to declare a special version of the functor class for each case. - """ - ) + And we have to declare a special version of the functor class for each case. + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## Defining Multifunctor - - /// admonition - we use prefix `f` rather than `ap` to indicate *Applicative Functor* - /// - - As a result, we may want to define a single `Multifunctor` such that: +def _(mo): + mo.md(r""" + ## Defining Multifunctor - 1. Lift a regular n-argument function into the context of functors + /// admonition + we use prefix `f` rather than `ap` to indicate *Applicative Functor* + /// - ```python - # lift a regular 3-argument function `g` - g: Callable[[A, B, C], D] - # into the context of functors - fg: Callable[[Functor[A], Functor[B], Functor[C]], Functor[D]] - ``` + As a result, we may want to define a single `Multifunctor` such that: - 3. Apply it to n functor-wrapped values + 1. Lift a regular n-argument function into the context of functors - ```python - # fa: Functor[A], fb: Functor[B], fc: Functor[C] - fg(fa, fb, fc) - ``` + ```python + # lift a regular 3-argument function `g` + g: Callable[[A, B, C], D] + # into the context of functors + fg: Callable[[Functor[A], Functor[B], Functor[C]], Functor[D]] + ``` - 5. Get a single functor-wrapped result + 3. Apply it to n functor-wrapped values - ```python - fd: Functor[D] - ``` + ```python + # fa: Functor[A], fb: Functor[B], fc: Functor[C] + fg(fa, fb, fc) + ``` - We will define a function `lift` such that + 5. Get a single functor-wrapped result ```python - fd = lift(g, fa, fb, fc) + fd: Functor[D] ``` - """ - ) - -@app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## Pure, apply and lift + We will define a function `lift` such that - Traditionally, applicative functors are presented through two core operations: + ```python + fd = lift(g, fa, fb, fc) + ``` + """) + return - 1. `pure`: embeds an object (value or function) into the applicative functor - ```python - # a -> F a - pure: Callable[[A], Applicative[A]] - # for example, if `a` is - a: A - # then we can have `fa` as - fa: Applicative[A] = pure(a) - # or if we have a regular function `g` - g: Callable[[A], B] - # then we can have `fg` as - fg: Applicative[Callable[[A], B]] = pure(g) - ``` +@app.cell(hide_code=True) +def _(mo): + mo.md(r""" + ## Pure, apply and lift - 2. `apply`: applies a function inside an applicative functor to a value inside an applicative functor + Traditionally, applicative functors are presented through two core operations: - ```python - # F (a -> b) -> F a -> F b - apply: Callable[[Applicative[Callable[[A], B]], Applicative[A]], Applicative[B]] - # and we can have - fd = apply(apply(apply(fg, fa), fb), fc) - ``` + 1. `pure`: embeds an object (value or function) into the applicative functor + ```python + # a -> F a + pure: Callable[[A], Applicative[A]] + # for example, if `a` is + a: A + # then we can have `fa` as + fa: Applicative[A] = pure(a) + # or if we have a regular function `g` + g: Callable[[A], B] + # then we can have `fg` as + fg: Applicative[Callable[[A], B]] = pure(g) + ``` - As a result, + 2. `apply`: applies a function inside an applicative functor to a value inside an applicative functor ```python - lift(g, fa, fb, fc) = apply(apply(apply(pure(g), fa), fb), fc) + # F (a -> b) -> F a -> F b + apply: Callable[[Applicative[Callable[[A], B]], Applicative[A]], Applicative[B]] + # and we can have + fd = apply(apply(apply(fg, fa), fb), fc) ``` - """ - ) + + + As a result, + + ```python + lift(g, fa, fb, fc) = apply(apply(apply(pure(g), fa), fb), fc) + ``` + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - /// admonition | How to use *Applicative* in the manner of *Multifunctor* +def _(mo): + mo.md(r""" + /// admonition | How to use *Applicative* in the manner of *Multifunctor* - 1. Define `pure` and `apply` for an `Applicative` subclass + 1. Define `pure` and `apply` for an `Applicative` subclass - - We can define them much easier compared with `lift`. + - We can define them much easier compared with `lift`. - 2. Use the `lift` method + 2. Use the `lift` method - - We can use it much more convenient compared with the combination of `pure` and `apply`. + - We can use it much more convenient compared with the combination of `pure` and `apply`. - /// + /// - /// attention | You can suppress the chaining application of `apply` and `pure` as: + /// attention | You can suppress the chaining application of `apply` and `pure` as: - ```python - apply(pure(g), fa) -> lift(g, fa) - apply(apply(pure(g), fa), fb) -> lift(g, fa, fb) - apply(apply(apply(pure(g), fa), fb), fc) -> lift(g, fa, fb, fc) - ``` + ```python + apply(pure(g), fa) -> lift(g, fa) + apply(apply(pure(g), fa), fb) -> lift(g, fa, fb) + apply(apply(apply(pure(g), fa), fb), fc) -> lift(g, fa, fb, fc) + ``` - /// - """ - ) + /// + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## Abstracting applicatives +def _(mo): + mo.md(r""" + ## Abstracting applicatives - We can now provide an initial abstraction definition of applicatives: + We can now provide an initial abstraction definition of applicatives: - ```python - @dataclass - class Applicative[A](Functor, ABC): - @classmethod - @abstractmethod - def pure(cls, a: A) -> "Applicative[A]": - raise NotImplementedError("Subclasses must implement pure") - - @classmethod - @abstractmethod - def apply( - cls, fg: "Applicative[Callable[[A], B]]", fa: "Applicative[A]" - ) -> "Applicative[B]": - raise NotImplementedError("Subclasses must implement apply") - - @classmethod - def lift(cls, f: Callable, *args: "Applicative") -> "Applicative": - curr = cls.pure(f) - if not args: - return curr - for arg in args: - curr = cls.apply(curr, arg) + ```python + @dataclass + class Applicative[A](Functor, ABC): + @classmethod + @abstractmethod + def pure(cls, a: A) -> "Applicative[A]": + raise NotImplementedError("Subclasses must implement pure") + + @classmethod + @abstractmethod + def apply( + cls, fg: "Applicative[Callable[[A], B]]", fa: "Applicative[A]" + ) -> "Applicative[B]": + raise NotImplementedError("Subclasses must implement apply") + + @classmethod + def lift(cls, f: Callable, *args: "Applicative") -> "Applicative": + curr = cls.pure(f) + if not args: return curr - ``` + for arg in args: + curr = cls.apply(curr, arg) + return curr + ``` - /// attention | minimal implementation requirement + /// attention | minimal implementation requirement - - `pure` - - `apply` - /// - """ - ) + - `pure` + - `apply` + /// + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""# Instances, laws and utility functions""") +def _(mo): + mo.md(r""" + # Instances, laws and utility functions + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## Applicative instances +def _(mo): + mo.md(r""" + ## Applicative instances - When we are actually implementing an *Applicative* instance, we can keep in mind that `pure` and `apply` fundamentally: + When we are actually implementing an *Applicative* instance, we can keep in mind that `pure` and `apply` fundamentally: - - embed an object (value or function) to the computational context - - apply a function inside the computation context to a value inside the computational context - """ - ) + - embed an object (value or function) to the computational context + - apply a function inside the computation context to a value inside the computational context + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ### The Wrapper Applicative +def _(mo): + mo.md(r""" + ### The Wrapper Applicative - - `pure` should simply *wrap* an object, in the sense that: + - `pure` should simply *wrap* an object, in the sense that: - ```haskell - Wrapper.pure(1) => Wrapper(value=1) - ``` + ```haskell + Wrapper.pure(1) => Wrapper(value=1) + ``` - - `apply` should apply a *wrapped* function to a *wrapped* value + - `apply` should apply a *wrapped* function to a *wrapped* value - The implementation is: - """ - ) + The implementation is: + """) + return @app.cell -def _(Applicative, dataclass): +def _(A, Applicative, dataclass): @dataclass class Wrapper[A](Applicative): value: A @@ -284,42 +279,45 @@ def _(Applicative, dataclass): @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""> try with Wrapper below""") +def _(mo): + mo.md(r""" + > try with Wrapper below + """) + return @app.cell -def _(Wrapper) -> None: +def _(Wrapper): Wrapper.lift( lambda a: lambda b: lambda c: a + b * c, Wrapper(1), Wrapper(2), Wrapper(3), ) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ### The List Applicative +def _(mo): + mo.md(r""" + ### The List Applicative - - `pure` should wrap the object in a list, in the sense that: + - `pure` should wrap the object in a list, in the sense that: - ```haskell - List.pure(1) => List(value=[1]) - ``` + ```haskell + List.pure(1) => List(value=[1]) + ``` - - `apply` should apply a list of functions to a list of values - - you can think of this as cartesian product, concatenating the result of applying every function to every value + - `apply` should apply a list of functions to a list of values + - you can think of this as cartesian product, concatenating the result of applying every function to every value - The implementation is: - """ - ) + The implementation is: + """) + return @app.cell -def _(Applicative, dataclass, product): +def _(A, Applicative, dataclass, product): @dataclass class List[A](Applicative): value: list[A] @@ -335,47 +333,51 @@ def _(Applicative, dataclass, product): @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""> try with List below""") +def _(mo): + mo.md(r""" + > try with List below + """) + return @app.cell -def _(List) -> None: +def _(List): List.apply( List([lambda a: a + 1, lambda a: a * 2]), List([1, 2]), ) + return @app.cell -def _(List) -> None: +def _(List): List.lift(lambda a: lambda b: a + b, List([1, 2]), List([3, 4, 5])) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ### The Maybe Applicative +def _(mo): + mo.md(r""" + ### The Maybe Applicative - - `pure` should wrap the object in a Maybe, in the sense that: + - `pure` should wrap the object in a Maybe, in the sense that: - ```haskell - Maybe.pure(1) => "Just 1" - Maybe.pure(None) => "Nothing" - ``` + ```haskell + Maybe.pure(1) => "Just 1" + Maybe.pure(None) => "Nothing" + ``` - - `apply` should apply a function maybe exist to a value maybe exist - - if the function is `None` or the value is `None`, simply returns `None` - - else apply the function to the value and wrap the result in `Just` + - `apply` should apply a function maybe exist to a value maybe exist + - if the function is `None` or the value is `None`, simply returns `None` + - else apply the function to the value and wrap the result in `Just` - The implementation is: - """ - ) + The implementation is: + """) + return @app.cell -def _(Applicative, dataclass): +def _(A, Applicative, dataclass): @dataclass class Maybe[A](Applicative): value: None | A @@ -399,51 +401,55 @@ def _(Applicative, dataclass): @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""> try with Maybe below""") +def _(mo): + mo.md(r""" + > try with Maybe below + """) + return @app.cell -def _(Maybe) -> None: +def _(Maybe): Maybe.lift( lambda a: lambda b: a + b, Maybe(1), Maybe(2), ) + return @app.cell -def _(Maybe) -> None: +def _(Maybe): Maybe.lift( lambda a: lambda b: None, Maybe(1), Maybe(2), ) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ### The Either Applicative +def _(mo): + mo.md(r""" + ### The Either Applicative - - `pure` should wrap the object in `Right`, in the sense that: + - `pure` should wrap the object in `Right`, in the sense that: - ```haskell - Either.pure(1) => Right(1) - ``` + ```haskell + Either.pure(1) => Right(1) + ``` - - `apply` should apply a function that is either on Left or Right to a value that is either on Left or Right - - if the function is `Left`, simply returns the `Left` of the function - - else `fmap` the `Right` of the function to the value + - `apply` should apply a function that is either on Left or Right to a value that is either on Left or Right + - if the function is `Left`, simply returns the `Left` of the function + - else `fmap` the `Right` of the function to the value - The implementation is: - """ - ) + The implementation is: + """) + return @app.cell -def _(Applicative, B, Callable, Union, dataclass): +def _(A, Applicative, B, Callable, Union, dataclass): @dataclass class Either[A](Applicative): left: A = None @@ -486,171 +492,180 @@ def _(Applicative, B, Callable, Union, dataclass): @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""> try with `Either` below""") +def _(mo): + mo.md(r""" + > try with `Either` below + """) + return @app.cell -def _(Either) -> None: +def _(Either): Either.apply(Either(left=TypeError("Parse Error")), Either(right=2)) + return @app.cell -def _(Either) -> None: +def _(Either): Either.apply( Either(right=lambda x: x + 1), Either(left=TypeError("Parse Error")) ) + return @app.cell -def _(Either) -> None: +def _(Either): Either.apply(Either(right=lambda x: x + 1), Either(right=1)) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## Collect the list of response with sequenceL +def _(mo): + mo.md(r""" + ## Collect the list of response with sequenceL - One often wants to execute a list of commands and collect the list of their response, and we can define a function `sequenceL` for this + One often wants to execute a list of commands and collect the list of their response, and we can define a function `sequenceL` for this - /// admonition - In a further notebook about `Traversable`, we will have a more generic `sequence` that execute a **sequence** of commands and collect the **sequence** of their response, which is not limited to `list`. - /// + /// admonition + In a further notebook about `Traversable`, we will have a more generic `sequence` that execute a **sequence** of commands and collect the **sequence** of their response, which is not limited to `list`. + /// - ```python - @classmethod - def sequenceL(cls, fas: list["Applicative[A]"]) -> "Applicative[list[A]]": - if not fas: - return cls.pure([]) + ```python + @classmethod + def sequenceL(cls, fas: list["Applicative[A]"]) -> "Applicative[list[A]]": + if not fas: + return cls.pure([]) - return cls.apply( - cls.fmap(lambda v: lambda vs: [v] + vs, fas[0]), - cls.sequenceL(fas[1:]), - ) - ``` + return cls.apply( + cls.fmap(lambda v: lambda vs: [v] + vs, fas[0]), + cls.sequenceL(fas[1:]), + ) + ``` - Let's try `sequenceL` with the instances. - """ - ) + Let's try `sequenceL` with the instances. + """) + return @app.cell -def _(Wrapper) -> None: +def _(Wrapper): Wrapper.sequenceL([Wrapper(1), Wrapper(2), Wrapper(3)]) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - /// attention - For the `Maybe` Applicative, the presence of any `Nothing` causes the entire computation to return Nothing. - /// - """ - ) +def _(mo): + mo.md(r""" + /// attention + For the `Maybe` Applicative, the presence of any `Nothing` causes the entire computation to return Nothing. + /// + """) + return @app.cell -def _(Maybe) -> None: +def _(Maybe): Maybe.sequenceL([Maybe(1), Maybe(2), Maybe(None), Maybe(3)]) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""The result of `sequenceL` for `List Applicative` is the Cartesian product of the input lists, yielding all possible ordered combinations of elements from each list.""") +def _(mo): + mo.md(r""" + The result of `sequenceL` for `List Applicative` is the Cartesian product of the input lists, yielding all possible ordered combinations of elements from each list. + """) + return @app.cell -def _(List) -> None: +def _(List): List.sequenceL([List([1, 2]), List([3]), List([5, 6, 7])]) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## Applicative laws - - /// admonition | id and compose - - Remember that - - - `id = lambda x: x` - - `compose = lambda f: lambda g: lambda x: f(g(x))` - - /// - - Traditionally, there are four laws that `Applicative` instances should satisfy. In some sense, they are all concerned with making sure that `pure` deserves its name: - - - The identity law: - ```python - # fa: Applicative[A] - apply(pure(id), fa) = fa - ``` - - Homomorphism: - ```python - # a: A - # g: Callable[[A], B] - apply(pure(g), pure(a)) = pure(g(a)) - ``` - Intuitively, applying a non-effectful function to a non-effectful argument in an effectful context is the same as just applying the function to the argument and then injecting the result into the context with pure. - - Interchange: - ```python - # a: A - # fg: Applicative[Callable[[A], B]] - apply(fg, pure(a)) = apply(pure(lambda g: g(a)), fg) - ``` - Intuitively, this says that when evaluating the application of an effectful function to a pure argument, the order in which we evaluate the function and its argument doesn't matter. - - Composition: - ```python - # fg: Applicative[Callable[[B], C]] - # fh: Applicative[Callable[[A], B]] - # fa: Applicative[A] - apply(fg, apply(fh, fa)) = lift(compose, fg, fh, fa) - ``` - This one is the trickiest law to gain intuition for. In some sense it is expressing a sort of associativity property of `apply`. - - We can add 4 helper functions to `Applicative` to check whether an instance respects the laws or not: +def _(mo): + mo.md(r""" + ## Applicative laws + + /// admonition | id and compose + + Remember that + + - `id = lambda x: x` + - `compose = lambda f: lambda g: lambda x: f(g(x))` + + /// + + Traditionally, there are four laws that `Applicative` instances should satisfy. In some sense, they are all concerned with making sure that `pure` deserves its name: + + - The identity law: + ```python + # fa: Applicative[A] + apply(pure(id), fa) = fa + ``` + - Homomorphism: + ```python + # a: A + # g: Callable[[A], B] + apply(pure(g), pure(a)) = pure(g(a)) + ``` + Intuitively, applying a non-effectful function to a non-effectful argument in an effectful context is the same as just applying the function to the argument and then injecting the result into the context with pure. + - Interchange: + ```python + # a: A + # fg: Applicative[Callable[[A], B]] + apply(fg, pure(a)) = apply(pure(lambda g: g(a)), fg) + ``` + Intuitively, this says that when evaluating the application of an effectful function to a pure argument, the order in which we evaluate the function and its argument doesn't matter. + - Composition: + ```python + # fg: Applicative[Callable[[B], C]] + # fh: Applicative[Callable[[A], B]] + # fa: Applicative[A] + apply(fg, apply(fh, fa)) = lift(compose, fg, fh, fa) + ``` + This one is the trickiest law to gain intuition for. In some sense it is expressing a sort of associativity property of `apply`. + + We can add 4 helper functions to `Applicative` to check whether an instance respects the laws or not: + + ```python + @dataclass + class Applicative[A](Functor, ABC): - ```python - @dataclass - class Applicative[A](Functor, ABC): - - @classmethod - def check_identity(cls, fa: "Applicative[A]"): - if cls.lift(id, fa) != fa: - raise ValueError("Instance violates identity law") - return True - - @classmethod - def check_homomorphism(cls, a: A, f: Callable[[A], B]): - if cls.lift(f, cls.pure(a)) != cls.pure(f(a)): - raise ValueError("Instance violates homomorphism law") - return True - - @classmethod - def check_interchange(cls, a: A, fg: "Applicative[Callable[[A], B]]"): - if cls.apply(fg, cls.pure(a)) != cls.lift(lambda g: g(a), fg): - raise ValueError("Instance violates interchange law") - return True - - @classmethod - def check_composition( - cls, - fg: "Applicative[Callable[[B], C]]", - fh: "Applicative[Callable[[A], B]]", - fa: "Applicative[A]", - ): - if cls.apply(fg, cls.apply(fh, fa)) != cls.lift(compose, fg, fh, fa): - raise ValueError("Instance violates composition law") - return True - ``` + @classmethod + def check_identity(cls, fa: "Applicative[A]"): + if cls.lift(id, fa) != fa: + raise ValueError("Instance violates identity law") + return True - > Try to validate applicative laws below - """ - ) + @classmethod + def check_homomorphism(cls, a: A, f: Callable[[A], B]): + if cls.lift(f, cls.pure(a)) != cls.pure(f(a)): + raise ValueError("Instance violates homomorphism law") + return True + + @classmethod + def check_interchange(cls, a: A, fg: "Applicative[Callable[[A], B]]"): + if cls.apply(fg, cls.pure(a)) != cls.lift(lambda g: g(a), fg): + raise ValueError("Instance violates interchange law") + return True + + @classmethod + def check_composition( + cls, + fg: "Applicative[Callable[[B], C]]", + fh: "Applicative[Callable[[A], B]]", + fa: "Applicative[A]", + ): + if cls.apply(fg, cls.apply(fh, fa)) != cls.lift(compose, fg, fh, fa): + raise ValueError("Instance violates composition law") + return True + ``` + + > Try to validate applicative laws below + """) + return @app.cell @@ -662,7 +677,7 @@ def _(): @app.cell -def _(List, Wrapper) -> None: +def _(List, Wrapper): print("Checking Wrapper") print(Wrapper.check_identity(Wrapper.pure(1))) print(Wrapper.check_homomorphism(1, lambda x: x + 1)) @@ -684,79 +699,77 @@ def _(List, Wrapper) -> None: List.pure(lambda x: x * 2), List.pure(lambda x: x + 0.1), List.pure(1) ) ) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## Utility functions +def _(mo): + mo.md(r""" + ## Utility functions - /// attention | using `fmap` - `fmap` is defined automatically using `pure` and `apply`, so you can use `fmap` with any `Applicative` - /// + /// attention | using `fmap` + `fmap` is defined automatically using `pure` and `apply`, so you can use `fmap` with any `Applicative` + /// - ```python - @dataclass - class Applicative[A](Functor, ABC): - @classmethod - def skip( - cls, fa: "Applicative[A]", fb: "Applicative[B]" - ) -> "Applicative[B]": - ''' - Sequences the effects of two Applicative computations, - but discards the result of the first. - ''' - return cls.apply(cls.const(fa, id), fb) - - @classmethod - def keep( - cls, fa: "Applicative[A]", fb: "Applicative[B]" - ) -> "Applicative[B]": - ''' - Sequences the effects of two Applicative computations, - but discard the result of the second. - ''' - return cls.lift(const, fa, fb) - - @classmethod - def revapp( - cls, fa: "Applicative[A]", fg: "Applicative[Callable[[A], [B]]]" - ) -> "Applicative[B]": - ''' - The first computation produces values which are provided - as input to the function(s) produced by the second computation. - ''' - return cls.lift(lambda a: lambda f: f(a), fa, fg) - ``` + ```python + @dataclass + class Applicative[A](Functor, ABC): + @classmethod + def skip( + cls, fa: "Applicative[A]", fb: "Applicative[B]" + ) -> "Applicative[B]": + ''' + Sequences the effects of two Applicative computations, + but discards the result of the first. + ''' + return cls.apply(cls.const(fa, id), fb) - - `skip` sequences the effects of two Applicative computations, but **discards the result of the first**. For example, if `m1` and `m2` are instances of type `Maybe[Int]`, then `Maybe.skip(m1, m2)` is `Nothing` whenever either `m1` or `m2` is `Nothing`; but if not, it will have the same value as `m2`. - - Likewise, `keep` sequences the effects of two computations, but **keeps only the result of the first**. - - `revapp` is similar to `apply`, but where the first computation produces value(s) which are provided as input to the function(s) produced by the second computation. - """ - ) + @classmethod + def keep( + cls, fa: "Applicative[A]", fb: "Applicative[B]" + ) -> "Applicative[B]": + ''' + Sequences the effects of two Applicative computations, + but discard the result of the second. + ''' + return cls.lift(const, fa, fb) + + @classmethod + def revapp( + cls, fa: "Applicative[A]", fg: "Applicative[Callable[[A], [B]]]" + ) -> "Applicative[B]": + ''' + The first computation produces values which are provided + as input to the function(s) produced by the second computation. + ''' + return cls.lift(lambda a: lambda f: f(a), fa, fg) + ``` + + - `skip` sequences the effects of two Applicative computations, but **discards the result of the first**. For example, if `m1` and `m2` are instances of type `Maybe[Int]`, then `Maybe.skip(m1, m2)` is `Nothing` whenever either `m1` or `m2` is `Nothing`; but if not, it will have the same value as `m2`. + - Likewise, `keep` sequences the effects of two computations, but **keeps only the result of the first**. + - `revapp` is similar to `apply`, but where the first computation produces value(s) which are provided as input to the function(s) produced by the second computation. + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - /// admonition | Exercise - Try to use utility functions with different instances - /// - """ - ) +def _(mo): + mo.md(r""" + /// admonition | Exercise + Try to use utility functions with different instances + /// + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - # Formal implementation of Applicative +def _(mo): + mo.md(r""" + # Formal implementation of Applicative - Now, we can give the formal implementation of `Applicative` - """ - ) + Now, we can give the formal implementation of `Applicative` + """) + return @app.cell @@ -887,40 +900,38 @@ def _( @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - # Effectful programming +def _(mo): + mo.md(r""" + # Effectful programming - Our original motivation for applicatives was the desire to generalise the idea of mapping to functions with multiple arguments. This is a valid interpretation of the concept of applicatives, but from the three instances we have seen it becomes clear that there is also another, more abstract view. + Our original motivation for applicatives was the desire to generalise the idea of mapping to functions with multiple arguments. This is a valid interpretation of the concept of applicatives, but from the three instances we have seen it becomes clear that there is also another, more abstract view. - The arguments are no longer just plain values but may also have effects, such as the possibility of failure, having many ways to succeed, or performing input/output actions. In this manner, applicative functors can also be viewed as abstracting the idea of **applying pure functions to effectful arguments**, with the precise form of effects that are permitted depending on the nature of the underlying functor. - """ - ) + The arguments are no longer just plain values but may also have effects, such as the possibility of failure, having many ways to succeed, or performing input/output actions. In this manner, applicative functors can also be viewed as abstracting the idea of **applying pure functions to effectful arguments**, with the precise form of effects that are permitted depending on the nature of the underlying functor. + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## The IO Applicative +def _(mo): + mo.md(r""" + ## The IO Applicative - We will try to define an `IO` applicative here. + We will try to define an `IO` applicative here. - As before, we first abstract how `pure` and `apply` should function. + As before, we first abstract how `pure` and `apply` should function. - - `pure` should wrap the object in an IO action, and make the object *callable* if it's not because we want to perform the action later: + - `pure` should wrap the object in an IO action, and make the object *callable* if it's not because we want to perform the action later: - ```haskell - IO.pure(1) => IO(effect=lambda: 1) - IO.pure(f) => IO(effect=f) - ``` + ```haskell + IO.pure(1) => IO(effect=lambda: 1) + IO.pure(f) => IO(effect=f) + ``` - - `apply` should perform an action that produces a value, then apply the function with the value + - `apply` should perform an action that produces a value, then apply the function with the value - The implementation is: - """ - ) + The implementation is: + """) + return @app.cell @@ -943,8 +954,11 @@ def _(Applicative, Callable, dataclass): @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""For example, a function that reads a given number of lines from the keyboard can be defined in applicative style as follows:""") +def _(mo): + mo.md(r""" + For example, a function that reads a given number of lines from the keyboard can be defined in applicative style as follows: + """) + return @app.cell @@ -953,29 +967,31 @@ def _(IO): return IO.sequenceL([ IO.pure(input(f"input the {i}th str")) for i in range(1, n + 1) ]) - return (get_chars,) + return @app.cell -def _() -> None: +def _(): # get_chars()() return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""# From the perspective of category theory""") +def _(mo): + mo.md(r""" + # From the perspective of category theory + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## Lax Monoidal Functor +def _(mo): + mo.md(r""" + ## Lax Monoidal Functor - An alternative, equivalent formulation of `Applicative` is given by - """ - ) + An alternative, equivalent formulation of `Applicative` is given by + """) + return @app.cell @@ -997,97 +1013,92 @@ def _(ABC, Functor, abstractmethod, dataclass): @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - Intuitively, this states that a *monoidal functor* is one which has some sort of "default shape" and which supports some sort of "combining" operation. +def _(mo): + mo.md(r""" + Intuitively, this states that a *monoidal functor* is one which has some sort of "default shape" and which supports some sort of "combining" operation. - - `unit` provides the identity element - - `tensor` combines two contexts into a product context + - `unit` provides the identity element + - `tensor` combines two contexts into a product context - More technically, the idea is that `monoidal functor` preserves the "monoidal structure" given by the pairing constructor `(,)` and unit type `()`. - """ - ) + More technically, the idea is that `monoidal functor` preserves the "monoidal structure" given by the pairing constructor `(,)` and unit type `()`. + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - Furthermore, to deserve the name "monoidal", instances of Monoidal ought to satisfy the following laws, which seem much more straightforward than the traditional Applicative laws: +def _(mo): + mo.md(r""" + Furthermore, to deserve the name "monoidal", instances of Monoidal ought to satisfy the following laws, which seem much more straightforward than the traditional Applicative laws: - - Left identity + - Left identity - `tensor(unit, v) ≅ v` + `tensor(unit, v) ≅ v` - - Right identity + - Right identity - `tensor(u, unit) ≅ u` + `tensor(u, unit) ≅ u` - - Associativity + - Associativity - `tensor(u, tensor(v, w)) ≅ tensor(tensor(u, v), w)` - """ - ) + `tensor(u, tensor(v, w)) ≅ tensor(tensor(u, v), w)` + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - /// admonition | ≅ indicates isomorphism +def _(mo): + mo.md(r""" + /// admonition | ≅ indicates isomorphism - `≅` refers to *isomorphism* rather than equality. + `≅` refers to *isomorphism* rather than equality. - In particular we consider `(x, ()) ≅ x ≅ ((), x)` and `((x, y), z) ≅ (x, (y, z))` + In particular we consider `(x, ()) ≅ x ≅ ((), x)` and `((x, y), z) ≅ (x, (y, z))` - /// - """ - ) + /// + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## Mutual definability of Monoidal and Applicative - - We can implement `pure` and `apply` in terms of `unit` and `tensor`, and vice versa. - - ```python - pure(a) = fmap((lambda _: a), unit) - apply(fg, fa) = fmap((lambda pair: pair[0](pair[1])), tensor(fg, fa)) - ``` - - ```python - unit() = pure(()) - tensor(fa, fb) = lift(lambda fa: lambda fb: (fa, fb), fa, fb) - ``` - """ - ) +def _(mo): + mo.md(r""" + ## Mutual definability of Monoidal and Applicative + + We can implement `pure` and `apply` in terms of `unit` and `tensor`, and vice versa. + + ```python + pure(a) = fmap((lambda _: a), unit) + apply(fg, fa) = fmap((lambda pair: pair[0](pair[1])), tensor(fg, fa)) + ``` + + ```python + unit() = pure(()) + tensor(fa, fb) = lift(lambda fa: lambda fb: (fa, fb), fa, fb) + ``` + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## Instance: ListMonoidal +def _(mo): + mo.md(r""" + ## Instance: ListMonoidal - - `unit` should simply return a empty tuple wrapper in a list + - `unit` should simply return a empty tuple wrapper in a list - ```haskell - ListMonoidal.unit() => [()] - ``` + ```haskell + ListMonoidal.unit() => [()] + ``` - - `tensor` should return the *cartesian product* of the items of 2 ListMonoidal instances + - `tensor` should return the *cartesian product* of the items of 2 ListMonoidal instances - The implementation is: - """ - ) + The implementation is: + """) + return @app.cell -def _(B, Callable, Monoidal, dataclass, product): +def _(A, B, Callable, Monoidal, dataclass, product): @dataclass class ListMonoidal[A](Monoidal): items: list[A] @@ -1111,8 +1122,11 @@ def _(B, Callable, Monoidal, dataclass, product): @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""> try with `ListMonoidal` below""") +def _(mo): + mo.md(r""" + > try with `ListMonoidal` below + """) + return @app.cell @@ -1124,13 +1138,17 @@ def _(ListMonoidal): @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""and we can prove that `tensor(fa, fb) = lift(lambda fa: lambda fb: (fa, fb), fa, fb)`:""") +def _(mo): + mo.md(r""" + and we can prove that `tensor(fa, fb) = lift(lambda fa: lambda fb: (fa, fb), fa, fb)`: + """) + return @app.cell -def _(List, xs, ys) -> None: +def _(List, xs, ys): List.lift(lambda fa: lambda fb: (fa, fb), List(xs.items), List(ys.items)) + return @app.cell(hide_code=True) @@ -1179,83 +1197,81 @@ def _(TypeVar): A = TypeVar("A") B = TypeVar("B") C = TypeVar("C") - return A, B, C + return A, B @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - # From Applicative to Alternative - - ## Abstracting Alternative - - In our studies so far, we saw that both `Maybe` and `List` can represent computations with a varying number of results. - - We use `Maybe` to indicate a computation can fail somehow and `List` for computations that can have many possible results. In both of these cases, one useful operation is amalgamating all possible results from multiple computations into a single computation. +def _(mo): + mo.md(r""" + # From Applicative to Alternative - `Alternative` formalizes computations that support: + ## Abstracting Alternative - - **Failure** (empty result) - - **Choice** (combination of results) - - **Repetition** (multiple results) + In our studies so far, we saw that both `Maybe` and `List` can represent computations with a varying number of results. - It extends `Applicative` with monoidal structure, where: + We use `Maybe` to indicate a computation can fail somehow and `List` for computations that can have many possible results. In both of these cases, one useful operation is amalgamating all possible results from multiple computations into a single computation. - ```python - @dataclass - class Alternative[A](Applicative, ABC): - @classmethod - @abstractmethod - def empty(cls) -> "Alternative[A]": - '''Identity element for alternative computations''' - - @classmethod - @abstractmethod - def alt( - cls, fa: "Alternative[A]", fb: "Alternative[A]" - ) -> "Alternative[A]": - '''Binary operation combining computations''' - ``` + `Alternative` formalizes computations that support: - - `empty` is the identity element (e.g., `Maybe(None)`, `List([])`) - - `alt` is a combination operator (e.g., `Maybe` fallback, list concatenation) + - **Failure** (empty result) + - **Choice** (combination of results) + - **Repetition** (multiple results) - `empty` and `alt` should satisfy the following **laws**: - - ```python - # Left identity - alt(empty, fa) == fa - # Right identity - alt(fa, empty) == fa - # Associativity - alt(fa, alt(fb, fc)) == alt(alt(fa, fb), fc) - ``` + It extends `Applicative` with monoidal structure, where: - /// admonition - Actually, `Alternative` is a *monoid* on `Applicative Functors`. We will talk about *monoid* and review these laws in the next notebook about `Monads`. - /// + ```python + @dataclass + class Alternative[A](Applicative, ABC): + @classmethod + @abstractmethod + def empty(cls) -> "Alternative[A]": + '''Identity element for alternative computations''' - /// attention | minimal implementation requirement - - `empty` - - `alt` - /// - """ - ) + @classmethod + @abstractmethod + def alt( + cls, fa: "Alternative[A]", fb: "Alternative[A]" + ) -> "Alternative[A]": + '''Binary operation combining computations''' + ``` + + - `empty` is the identity element (e.g., `Maybe(None)`, `List([])`) + - `alt` is a combination operator (e.g., `Maybe` fallback, list concatenation) + + `empty` and `alt` should satisfy the following **laws**: + + ```python + # Left identity + alt(empty, fa) == fa + # Right identity + alt(fa, empty) == fa + # Associativity + alt(fa, alt(fb, fc)) == alt(alt(fa, fb), fc) + ``` + + /// admonition + Actually, `Alternative` is a *monoid* on `Applicative Functors`. We will talk about *monoid* and review these laws in the next notebook about `Monads`. + /// + + /// attention | minimal implementation requirement + - `empty` + - `alt` + /// + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## Instances of Alternative +def _(mo): + mo.md(r""" + ## Instances of Alternative - ### The Maybe Alternative + ### The Maybe Alternative - - `empty`: the identity element of `Maybe` is `Maybe(None)` - - `alt`: return the first element if it's not `None`, else return the second element - """ - ) + - `empty`: the identity element of `Maybe` is `Maybe(None)` + - `alt`: return the first element if it's not `None`, else return the second element + """) + return @app.cell @@ -1278,31 +1294,32 @@ def _(Alternative, Maybe, dataclass): @app.cell -def _(AltMaybe) -> None: +def _(AltMaybe): print(AltMaybe.empty()) print(AltMaybe.alt(AltMaybe(None), AltMaybe(1))) print(AltMaybe.alt(AltMaybe(None), AltMaybe(None))) print(AltMaybe.alt(AltMaybe(1), AltMaybe(None))) print(AltMaybe.alt(AltMaybe(1), AltMaybe(2))) + return @app.cell -def _(AltMaybe) -> None: +def _(AltMaybe): print(AltMaybe.check_left_identity(AltMaybe(1))) print(AltMaybe.check_right_identity(AltMaybe(1))) print(AltMaybe.check_associativity(AltMaybe(1), AltMaybe(2), AltMaybe(None))) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ### The List Alternative - - - `empty`: the identity element of `List` is `List([])` - - `alt`: return the concatenation of 2 input lists - """ - ) +def _(mo): + mo.md(r""" + ### The List Alternative + + - `empty`: the identity element of `List` is `List([])` + - `alt`: return the concatenation of 2 input lists + """) + return @app.cell @@ -1320,23 +1337,26 @@ def _(Alternative, List, dataclass): @app.cell -def _(AltList) -> None: +def _(AltList): print(AltList.empty()) print(AltList.alt(AltList([1, 2, 3]), AltList([4, 5]))) + return @app.cell -def _(AltList) -> None: +def _(AltList): AltList([1]) + return @app.cell -def _(AltList) -> None: +def _(AltList): AltList([1]) + return @app.cell -def _(AltList) -> None: +def _(AltList): print(AltList.check_left_identity(AltList([1, 2, 3]))) print(AltList.check_right_identity(AltList([1, 2, 3]))) print( @@ -1344,77 +1364,88 @@ def _(AltList) -> None: AltList([1, 2]), AltList([3, 4, 5]), AltList([6]) ) ) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - ## some and many +def _(mo): + mo.md(r""" + ## some and many - /// admonition | This section mainly refers to + /// admonition | This section mainly refers to - - https://stackoverflow.com/questions/7671009/some-and-many-functions-from-the-alternative-type-class/7681283#7681283 + - https://stackoverflow.com/questions/7671009/some-and-many-functions-from-the-alternative-type-class/7681283#7681283 - /// + /// - First let's have a look at the implementation of `some` and `many`: + First let's have a look at the implementation of `some` and `many`: - ```python - @classmethod - def some(cls, fa: "Alternative[A]") -> "Alternative[list[A]]": - # Short-circuit if input is empty - if fa == cls.empty(): - return cls.empty() + ```python + @classmethod + def some(cls, fa: "Alternative[A]") -> "Alternative[list[A]]": + # Short-circuit if input is empty + if fa == cls.empty(): + return cls.empty() - return cls.apply( - cls.fmap(lambda a: lambda b: [a] + b, fa), cls.many(fa) - ) + return cls.apply( + cls.fmap(lambda a: lambda b: [a] + b, fa), cls.many(fa) + ) - @classmethod - def many(cls, fa: "Alternative[A]") -> "Alternative[list[A]]": - # Directly return empty list if input is empty - if fa == cls.empty(): - return cls.pure([]) + @classmethod + def many(cls, fa: "Alternative[A]") -> "Alternative[list[A]]": + # Directly return empty list if input is empty + if fa == cls.empty(): + return cls.pure([]) - return cls.alt(cls.some(fa), cls.pure([])) - ``` + return cls.alt(cls.some(fa), cls.pure([])) + ``` - So `some f` runs `f` once, then *many* times, and conses the results. `many f` runs f *some* times, or *alternatively* just returns the empty list. + So `some f` runs `f` once, then *many* times, and conses the results. `many f` runs f *some* times, or *alternatively* just returns the empty list. - The idea is that they both run `f` as often as possible until it **fails**, collecting the results in a list. The difference is that `some f` immediately fails if `f` fails, while `many f` will still succeed and *return* the empty list in such a case. But what all this exactly means depends on how `alt` is defined. + The idea is that they both run `f` as often as possible until it **fails**, collecting the results in a list. The difference is that `some f` immediately fails if `f` fails, while `many f` will still succeed and *return* the empty list in such a case. But what all this exactly means depends on how `alt` is defined. - Let's see what it does for the instances `AltMaybe` and `AltList`. - """ - ) + Let's see what it does for the instances `AltMaybe` and `AltList`. + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""For `AltMaybe`. `None` means failure, so some `None` fails as well and evaluates to `None` while many `None` succeeds and evaluates to `Just []`. Both `some (Just ())` and `many (Just ())` never return, because `Just ()` never fails.""") +def _(mo): + mo.md(r""" + For `AltMaybe`. `None` means failure, so some `None` fails as well and evaluates to `None` while many `None` succeeds and evaluates to `Just []`. Both `some (Just ())` and `many (Just ())` never return, because `Just ()` never fails. + """) + return @app.cell -def _(AltMaybe) -> None: +def _(AltMaybe): print(AltMaybe.some(AltMaybe.empty())) print(AltMaybe.many(AltMaybe.empty())) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""For `AltList`, `[]` means failure, so `some []` evaluates to `[]` (no answers) while `many []` evaluates to `[[]]` (there's one answer and it is the empty list). Again `some [()]` and `many [()]` don't return.""") +def _(mo): + mo.md(r""" + For `AltList`, `[]` means failure, so `some []` evaluates to `[]` (no answers) while `many []` evaluates to `[[]]` (there's one answer and it is the empty list). Again `some [()]` and `many [()]` don't return. + """) + return @app.cell -def _(AltList) -> None: +def _(AltList): print(AltList.some(AltList.empty())) print(AltList.many(AltList.empty())) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md(r"""## Formal implementation of Alternative""") +def _(mo): + mo.md(r""" + ## Formal implementation of Alternative + """) + return @app.cell @@ -1472,42 +1503,40 @@ def _(ABC, Applicative, abstractmethod, dataclass): @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - /// admonition +def _(mo): + mo.md(r""" + /// admonition - We will explore more about `Alternative` in a future notebooks about [Monadic Parsing](https://www.cambridge.org/core/journals/journal-of-functional-programming/article/monadic-parsing-in-haskell/E557DFCCE00E0D4B6ED02F3FB0466093) + We will explore more about `Alternative` in a future notebooks about [Monadic Parsing](https://www.cambridge.org/core/journals/journal-of-functional-programming/article/monadic-parsing-in-haskell/E557DFCCE00E0D4B6ED02F3FB0466093) - /// - """ - ) + /// + """) + return @app.cell(hide_code=True) -def _(mo) -> None: - mo.md( - r""" - # Further reading - - Notice that these reading sources are optional and non-trivial - - - [Applicaive Programming with Effects](https://www.staff.city.ac.uk/~ross/papers/Applicative.html) - - [Equivalence of Applicative Functors and - Multifunctors](https://arxiv.org/pdf/2401.14286) - - [Applicative functor](https://wiki.haskell.org/index.php?title=Applicative_functor) - - [Control.Applicative](https://hackage.haskell.org/package/base-4.21.0.0/docs/Control-Applicative.html#t:Applicative) - - [Typeclassopedia#Applicative](https://wiki.haskell.org/index.php?title=Typeclassopedia#Applicative) - - [Notions of computation as monoids](https://www.cambridge.org/core/journals/journal-of-functional-programming/article/notions-of-computation-as-monoids/70019FC0F2384270E9F41B9719042528) - - [Free Applicative Functors](https://arxiv.org/abs/1403.0749) - - [The basics of applicative functors, put to practical work](http://www.serpentine.com/blog/2008/02/06/the-basics-of-applicative-functors-put-to-practical-work/) - - [Abstracting with Applicatives](http://comonad.com/reader/2012/abstracting-with-applicatives/) - - [Static analysis with Applicatives](https://gergo.erdi.hu/blog/2012-12-01-static_analysis_with_applicatives/) - - [Explaining Applicative functor in categorical terms - monoidal functors](https://cstheory.stackexchange.com/questions/12412/explaining-applicative-functor-in-categorical-terms-monoidal-functors) - - [Applicative, A Strong Lax Monoidal Functor](https://beuke.org/applicative/) - - [Applicative Functors](https://bartoszmilewski.com/2017/02/06/applicative-functors/) - """ - ) +def _(mo): + mo.md(r""" + # Further reading + + Notice that these reading sources are optional and non-trivial + + - [Applicaive Programming with Effects](https://www.staff.city.ac.uk/~ross/papers/Applicative.html) + - [Equivalence of Applicative Functors and + Multifunctors](https://arxiv.org/pdf/2401.14286) + - [Applicative functor](https://wiki.haskell.org/index.php?title=Applicative_functor) + - [Control.Applicative](https://hackage.haskell.org/package/base-4.21.0.0/docs/Control-Applicative.html#t:Applicative) + - [Typeclassopedia#Applicative](https://wiki.haskell.org/index.php?title=Typeclassopedia#Applicative) + - [Notions of computation as monoids](https://www.cambridge.org/core/journals/journal-of-functional-programming/article/notions-of-computation-as-monoids/70019FC0F2384270E9F41B9719042528) + - [Free Applicative Functors](https://arxiv.org/abs/1403.0749) + - [The basics of applicative functors, put to practical work](http://www.serpentine.com/blog/2008/02/06/the-basics-of-applicative-functors-put-to-practical-work/) + - [Abstracting with Applicatives](http://comonad.com/reader/2012/abstracting-with-applicatives/) + - [Static analysis with Applicatives](https://gergo.erdi.hu/blog/2012-12-01-static_analysis_with_applicatives/) + - [Explaining Applicative functor in categorical terms - monoidal functors](https://cstheory.stackexchange.com/questions/12412/explaining-applicative-functor-in-categorical-terms-monoidal-functors) + - [Applicative, A Strong Lax Monoidal Functor](https://beuke.org/applicative/) + - [Applicative Functors](https://bartoszmilewski.com/2017/02/06/applicative-functors/) + """) + return if __name__ == "__main__": diff --git a/functional_programming/CHANGELOG.md b/functional_programming/CHANGELOG.md index 4305c34202ed3891d818f07b8fd858aa1cda45b4..0c8dd2ae71762c1e7b59bd17ebd8ddb19f7e623a 100644 --- a/functional_programming/CHANGELOG.md +++ b/functional_programming/CHANGELOG.md @@ -1,3 +1,8 @@ +--- +title: Changelog +marimo-version: 0.18.4 +--- + # Changelog of the functional-programming course ## 2025-04-16 @@ -121,4 +126,4 @@ for reviewing **functors.py** -- Demo version of notebook `05_functors.py` +- Demo version of notebook `05_functors.py` \ No newline at end of file diff --git a/functional_programming/README.md b/functional_programming/README.md index f264dfd8253ef6661228b10c45e6dd2a6104a84f..72f94a5fc4db533aa1f0b9a845fd768e0d5e3948 100644 --- a/functional_programming/README.md +++ b/functional_programming/README.md @@ -1,3 +1,8 @@ +--- +title: Readme +marimo-version: 0.18.4 +--- + # Learn Functional Programming _🚧 This collection is a [work in progress](https://github.com/marimo-team/learn/issues/51)._ @@ -24,13 +29,13 @@ Topics include: To run a notebook locally, use -```bash -uvx marimo edit +```bash +uvx marimo edit ``` For example, run the `Functor` tutorial with -```bash +```bash uvx marimo edit https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py ``` @@ -52,11 +57,11 @@ on Discord (@eugene.hs). ## Description of notebooks Check [here](https://github.com/marimo-team/learn/issues/51) for current series -structure. +structure. | Notebook | Title | Key Concepts | Prerequisites | -|----------|-------|--------------|---------------| -| [05. Functors](https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py) | Category Theory and Functors | Category Theory, Functor, fmap, Bifunctor | Basic Python, Functions | +|----------|-------|--------------|---------------| +| [05. Functors](https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py) | Category Theory and Functors | Category Theory, Functor, fmap, Bifunctor | Basic Python, Functions | | [06. Applicatives](https://github.com/marimo-team/learn/blob/main/functional_programming/06_applicatives.py) | Applicative programming with effects | Applicative Functor, pure, apply, Effectful programming, Alternative | Functors | **Authors.** @@ -69,4 +74,4 @@ Thanks to all our notebook authors! Thanks to all our notebook reviews! -- [Haleshot](https://github.com/Haleshot) +- [Haleshot](https://github.com/Haleshot) \ No newline at end of file diff --git a/optimization/01_least_squares.py b/optimization/01_least_squares.py index aa1309b96b265e6207c42711b0da507e8ab04289..b69d71966f1494f648bb6e466bfdda88c233703f 100644 --- a/optimization/01_least_squares.py +++ b/optimization/01_least_squares.py @@ -9,7 +9,7 @@ import marimo -__generated_with = "0.11.0" +__generated_with = "0.18.4" app = marimo.App() @@ -21,45 +21,41 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # Least squares + mo.md(r""" + # Least squares - In a least-squares problem, we have measurements $A \in \mathcal{R}^{m \times - n}$ (i.e., $m$ rows and $n$ columns) and $b \in \mathcal{R}^m$. We seek a vector - $x \in \mathcal{R}^{n}$ such that $Ax$ is close to $b$. The matrices $A$ and $b$ are problem data or constants, and $x$ is the variable we are solving for. + In a least-squares problem, we have measurements $A \in \mathcal{R}^{m \times + n}$ (i.e., $m$ rows and $n$ columns) and $b \in \mathcal{R}^m$. We seek a vector + $x \in \mathcal{R}^{n}$ such that $Ax$ is close to $b$. The matrices $A$ and $b$ are problem data or constants, and $x$ is the variable we are solving for. - Closeness is defined as the sum of the squared differences: + Closeness is defined as the sum of the squared differences: - \[ \sum_{i=1}^m (a_i^Tx - b_i)^2, \] + \[ \sum_{i=1}^m (a_i^Tx - b_i)^2, \] - also known as the $\ell_2$-norm squared, $\|Ax - b\|_2^2$. + also known as the $\ell_2$-norm squared, $\|Ax - b\|_2^2$. - For example, we might have a dataset of $m$ users, each represented by $n$ features. Each row $a_i^T$ of $A$ is the feature vector for user $i$, while the corresponding entry $b_i$ of $b$ is the measurement we want to predict from $a_i^T$, such as ad spending. The prediction for user $i$ is given by $a_i^Tx$. + For example, we might have a dataset of $m$ users, each represented by $n$ features. Each row $a_i^T$ of $A$ is the feature vector for user $i$, while the corresponding entry $b_i$ of $b$ is the measurement we want to predict from $a_i^T$, such as ad spending. The prediction for user $i$ is given by $a_i^Tx$. - We find the optimal value of $x$ by solving the optimization problem + We find the optimal value of $x$ by solving the optimization problem - \[ - \begin{array}{ll} - \text{minimize} & \|Ax - b\|_2^2. - \end{array} - \] + \[ + \begin{array}{ll} + \text{minimize} & \|Ax - b\|_2^2. + \end{array} + \] - Let $x^\star$ denote the optimal $x$. The quantity $r = Ax^\star - b$ is known as the residual. If $\|r\|_2 = 0$, we have a perfect fit. - """ - ) + Let $x^\star$ denote the optimal $x$. The quantity $r = Ax^\star - b$ is known as the residual. If $\|r\|_2 = 0$, we have a perfect fit. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Example + mo.md(r""" + ## Example - In this example, we use the Python library [CVXPY](https://github.com/cvxpy/cvxpy) to construct and solve a least-squares problems. - """ - ) + In this example, we use the Python library [CVXPY](https://github.com/cvxpy/cvxpy) to construct and solve a least-squares problems. + """) return @@ -91,7 +87,7 @@ def _(A, b, cp, n): objective = cp.sum_squares(A @ x - b) problem = cp.Problem(cp.Minimize(objective)) optimal_value = problem.solve() - return objective, optimal_value, problem, x + return optimal_value, x @app.cell @@ -108,14 +104,12 @@ def _(A, b, cp, mo, optimal_value, x): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Further reading + mo.md(r""" + ## Further reading - For a primer on least squares, with many real-world examples, check out the free book - [Vectors, Matrices, and Least Squares](https://web.stanford.edu/~boyd/vmls/), which is used for undergraduate linear algebra education at Stanford. - """ - ) + For a primer on least squares, with many real-world examples, check out the free book + [Vectors, Matrices, and Least Squares](https://web.stanford.edu/~boyd/vmls/), which is used for undergraduate linear algebra education at Stanford. + """) return diff --git a/optimization/02_linear_program.py b/optimization/02_linear_program.py index cd30b41bfc6d780c4b18826ad2f7f3e0a39ebec7..40cdc1f19b9ad84fd86dfab5b53f049d0889bea0 100644 --- a/optimization/02_linear_program.py +++ b/optimization/02_linear_program.py @@ -11,7 +11,7 @@ import marimo -__generated_with = "0.11.0" +__generated_with = "0.18.4" app = marimo.App() @@ -23,33 +23,31 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # Linear program + mo.md(r""" + # Linear program - A linear program is an optimization problem with a linear objective and affine - inequality constraints. A common standard form is the following: + A linear program is an optimization problem with a linear objective and affine + inequality constraints. A common standard form is the following: - \[ - \begin{array}{ll} - \text{minimize} & c^Tx \\ - \text{subject to} & Ax \leq b. - \end{array} - \] + \[ + \begin{array}{ll} + \text{minimize} & c^Tx \\ + \text{subject to} & Ax \leq b. + \end{array} + \] - Here $A \in \mathcal{R}^{m \times n}$, $b \in \mathcal{R}^m$, and $c \in \mathcal{R}^n$ are problem data and $x \in \mathcal{R}^{n}$ is the optimization variable. The inequality constraint $Ax \leq b$ is elementwise. + Here $A \in \mathcal{R}^{m \times n}$, $b \in \mathcal{R}^m$, and $c \in \mathcal{R}^n$ are problem data and $x \in \mathcal{R}^{n}$ is the optimization variable. The inequality constraint $Ax \leq b$ is elementwise. - For example, we might have $n$ different products, each constructed out of $m$ components. Each entry $A_{ij}$ is the amount of component $i$ required to build one unit of product $j$. Each entry $b_i$ is the total amount of component $i$ available. We lose $c_j$ for each unit of product $j$ ($c_j < 0$ indicates profit). Our goal then is to choose how many units of each product $j$ to make, $x_j$, in order to minimize loss without exceeding our budget for any component. + For example, we might have $n$ different products, each constructed out of $m$ components. Each entry $A_{ij}$ is the amount of component $i$ required to build one unit of product $j$. Each entry $b_i$ is the total amount of component $i$ available. We lose $c_j$ for each unit of product $j$ ($c_j < 0$ indicates profit). Our goal then is to choose how many units of each product $j$ to make, $x_j$, in order to minimize loss without exceeding our budget for any component. - In addition to a solution $x^\star$, we obtain a dual solution $\lambda^\star$. A positive entry $\lambda^\star_i$ indicates that the constraint $a_i^Tx \leq b_i$ holds with equality for $x^\star$ and suggests that changing $b_i$ would change the optimal value. + In addition to a solution $x^\star$, we obtain a dual solution $\lambda^\star$. A positive entry $\lambda^\star_i$ indicates that the constraint $a_i^Tx \leq b_i$ holds with equality for $x^\star$ and suggests that changing $b_i$ would change the optimal value. - **Why linear programming?** Linear programming is a way to achieve an optimal outcome, such as maximum utility or lowest cost, subject to a linear objective function and affine constraints. Developed in the 20th century, linear programming is widely used today to solve problems in resource allocation, scheduling, transportation, and more. The discovery of polynomial-time algorithms to solve linear programs was of tremendous worldwide importance and entered the public discourse, even making the front page of the New York Times. + **Why linear programming?** Linear programming is a way to achieve an optimal outcome, such as maximum utility or lowest cost, subject to a linear objective function and affine constraints. Developed in the 20th century, linear programming is widely used today to solve problems in resource allocation, scheduling, transportation, and more. The discovery of polynomial-time algorithms to solve linear programs was of tremendous worldwide importance and entered the public discourse, even making the front page of the New York Times. - In the late 20th and early 21st century, researchers generalized linear programming to a much wider class of problems called convex optimization problems. Nearly all convex optimization problems can be solved efficiently and reliably, and even more difficult problems are readily solved by a sequence of convex optimization problems. Today, convex optimization is used to fit machine learning models, land rockets in real-time at SpaceX, plan trajectories for self-driving cars at Waymo, execute many billions of dollars of financial trades a day, and much more. + In the late 20th and early 21st century, researchers generalized linear programming to a much wider class of problems called convex optimization problems. Nearly all convex optimization problems can be solved efficiently and reliably, and even more difficult problems are readily solved by a sequence of convex optimization problems. Today, convex optimization is used to fit machine learning models, land rockets in real-time at SpaceX, plan trajectories for self-driving cars at Waymo, execute many billions of dollars of financial trades a day, and much more. - This marimo learn course uses CVXPY, a modeling language for convex optimization problems developed originally at Stanford, to construct and solve convex programs. - """ - ) + This marimo learn course uses CVXPY, a modeling language for convex optimization problems developed originally at Stanford, to construct and solve convex programs. + """) return @@ -66,13 +64,11 @@ def _(mo): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Example + mo.md(r""" + ## Example - Here we use CVXPY to construct and solve a linear program. - """ - ) + Here we use CVXPY to construct and solve a linear program. + """) return @@ -119,7 +115,9 @@ def _(np): @app.cell(hide_code=True) def _(mo): - mo.md(r"""We've randomly generated problem data $A$ and $B$. The vector for $c$ is shown below. Try playing with the value of $c$ by dragging the components, and see how the level curves change in the visualization below.""") + mo.md(r""" + We've randomly generated problem data $A$ and $B$. The vector for $c$ is shown below. Try playing with the value of $c$ by dragging the components, and see how the level curves change in the visualization below. + """) return @@ -129,7 +127,7 @@ def _(mo, np): c_widget = mo.ui.anywidget(Matrix(matrix=np.array([[0.1, -0.2]]), step=0.01)) c_widget - return Matrix, c_widget + return (c_widget,) @app.cell @@ -149,7 +147,9 @@ def _(A, b, c, cp): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Below, we plot the feasible region of the problem — the intersection of the inequalities — and the level curves of the objective function. The optimal value $x^\star$ is the point farthest in the feasible region in the direction $-c$.""") + mo.md(r""" + Below, we plot the feasible region of the problem — the intersection of the inequalities — and the level curves of the objective function. The optimal value $x^\star$ is the point farthest in the feasible region in the direction $-c$. + """) return @@ -249,7 +249,7 @@ def _(np): ax.set_xlim(np.min(x_vals), np.max(x_vals)) ax.set_ylim(np.min(y_vals), np.max(y_vals)) return ax - return make_plot, plt + return (make_plot,) @app.cell(hide_code=True) @@ -257,7 +257,7 @@ def _(mo, prob, x): mo.md( f""" The optimal value is {prob.value:.04f}. - + A solution $x$ is {mo.as_html(list(x.value))} A dual solution is is {mo.as_html(list(prob.constraints[0].dual_value))} """ diff --git a/optimization/03_minimum_fuel_optimal_control.py b/optimization/03_minimum_fuel_optimal_control.py index 9a5655a8a101e8fa53f71778a415aef32c47059e..7c81c3014a7b6b7422fe3ae50427a4a33712c83e 100644 --- a/optimization/03_minimum_fuel_optimal_control.py +++ b/optimization/03_minimum_fuel_optimal_control.py @@ -1,6 +1,6 @@ import marimo -__generated_with = "0.11.0" +__generated_with = "0.18.4" app = marimo.App() @@ -12,46 +12,44 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # Minimal fuel optimal control + mo.md(r""" + # Minimal fuel optimal control - This notebook includes an application of linear programming to controlling a - physical system, adapted from [Convex - Optimization](https://web.stanford.edu/~boyd/cvxbook/) by Boyd and Vandenberghe. + This notebook includes an application of linear programming to controlling a + physical system, adapted from [Convex + Optimization](https://web.stanford.edu/~boyd/cvxbook/) by Boyd and Vandenberghe. - We consider a linear dynamical system with state $x(t) \in \mathbf{R}^n$, for $t = 0, \ldots, T$. At each time step $t = 0, \ldots, T - 1$, an actuator or input signal $u(t)$ is applied, affecting the state. The dynamics - of the system is given by the linear recurrence + We consider a linear dynamical system with state $x(t) \in \mathbf{R}^n$, for $t = 0, \ldots, T$. At each time step $t = 0, \ldots, T - 1$, an actuator or input signal $u(t)$ is applied, affecting the state. The dynamics + of the system is given by the linear recurrence - \[ - x(t + 1) = Ax(t) + bu(t), \quad t = 0, \ldots, T - 1, - \] + \[ + x(t + 1) = Ax(t) + bu(t), \quad t = 0, \ldots, T - 1, + \] - where $A \in \mathbf{R}^{n \times n}$ and $b \in \mathbf{R}^n$ are given and encode how the system evolves. The initial state $x(0)$ is also given. + where $A \in \mathbf{R}^{n \times n}$ and $b \in \mathbf{R}^n$ are given and encode how the system evolves. The initial state $x(0)$ is also given. - The _minimum fuel optimal control problem_ is to choose the inputs $u(0), \ldots, u(T - 1)$ so as to achieve - a given desired state $x_\text{des} = x(T)$ while minimizing the total fuel consumed + The _minimum fuel optimal control problem_ is to choose the inputs $u(0), \ldots, u(T - 1)$ so as to achieve + a given desired state $x_\text{des} = x(T)$ while minimizing the total fuel consumed - \[ - F = \sum_{t=0}^{T - 1} f(u(t)). - \] + \[ + F = \sum_{t=0}^{T - 1} f(u(t)). + \] - The function $f : \mathbf{R} \to \mathbf{R}$ tells us how much fuel is consumed as a function of the input, and is given by + The function $f : \mathbf{R} \to \mathbf{R}$ tells us how much fuel is consumed as a function of the input, and is given by - \[ - f(a) = \begin{cases} - |a| & |a| \leq 1 \\ - 2|a| - 1 & |a| > 1. - \end{cases} - \] + \[ + f(a) = \begin{cases} + |a| & |a| \leq 1 \\ + 2|a| - 1 & |a| > 1. + \end{cases} + \] - This means the fuel use is proportional to the magnitude of the signal between $-1$ and $1$, but for larger signals the marginal fuel efficiency is half. + This means the fuel use is proportional to the magnitude of the signal between $-1$ and $1$, but for larger signals the marginal fuel efficiency is half. - **This notebook.** In this notebook we use CVXPY to formulate the minimum fuel optimal control problem as a linear program. The notebook lets you play with the initial and target states, letting you see how they affect the planned trajectory of inputs $u$. + **This notebook.** In this notebook we use CVXPY to formulate the minimum fuel optimal control problem as a linear program. The notebook lets you play with the initial and target states, letting you see how they affect the planned trajectory of inputs $u$. - First, we create the **problem data**. - """ - ) + First, we create the **problem data**. + """) return @@ -85,7 +83,7 @@ def _(mo, n, np): rf""" Choose a value for $x_0$ ... - + {x0_widget} """ ) @@ -99,7 +97,7 @@ def _(mo, n, np): ) mo.hstack([_a, _b], justify="space-around") - return wigglystuff, x0_widget, xdes_widget + return x0_widget, xdes_widget @app.cell @@ -111,7 +109,9 @@ def _(x0_widget, xdes_widget): @app.cell(hide_code=True) def _(mo): - mo.md(r"""**Next, we specify the problem as a linear program using CVXPY.** This problem is linear because the objective and constraints are affine. (In fact, the objective is piecewise affine, but CVXPY rewrites it to be affine for you.)""") + mo.md(r""" + **Next, we specify the problem as a linear program using CVXPY.** This problem is linear because the objective and constraints are affine. (In fact, the objective is piecewise affine, but CVXPY rewrites it to be affine for you.) + """) return @@ -134,18 +134,16 @@ def _(A, T, b, cp, mo, n, x0, xdes): fuel_used = cp.Problem(cp.Minimize(objective), constraints).solve() mo.md(f"Achieved a fuel usage of {fuel_used:.02f}. 🚀") - return X, constraints, fuel_used, objective, u + return (u,) @app.cell(hide_code=True) def _(mo): - mo.md( - """ - Finally, we plot the chosen inputs over time. + mo.md(""" + Finally, we plot the chosen inputs over time. - **🌊 Try it!** Change the initial and desired states; how do fuel usage and controls change? Can you explain what you see? You can also try experimenting with the value of $T$. - """ - ) + **🌊 Try it!** Change the initial and desired states; how do fuel usage and controls change? Can you explain what you see? You can also try experimenting with the value of $T$. + """) return diff --git a/optimization/04_quadratic_program.py b/optimization/04_quadratic_program.py index a7fbd1be150e125c8494e7ebe2225c1bf528bd6b..b81fa6857c885959e93bd3a815d23c392ddf1205 100644 --- a/optimization/04_quadratic_program.py +++ b/optimization/04_quadratic_program.py @@ -11,7 +11,7 @@ import marimo -__generated_with = "0.11.0" +__generated_with = "0.18.4" app = marimo.App() @@ -23,53 +23,49 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # Quadratic program + mo.md(r""" + # Quadratic program - A quadratic program is an optimization problem with a quadratic objective and - affine equality and inequality constraints. A common standard form is the - following: + A quadratic program is an optimization problem with a quadratic objective and + affine equality and inequality constraints. A common standard form is the + following: - \[ - \begin{array}{ll} - \text{minimize} & (1/2)x^TPx + q^Tx\\ - \text{subject to} & Gx \leq h \\ - & Ax = b. - \end{array} - \] + \[ + \begin{array}{ll} + \text{minimize} & (1/2)x^TPx + q^Tx\\ + \text{subject to} & Gx \leq h \\ + & Ax = b. + \end{array} + \] - Here $P \in \mathcal{S}^{n}_+$, $q \in \mathcal{R}^n$, $G \in \mathcal{R}^{m \times n}$, $h \in \mathcal{R}^m$, $A \in \mathcal{R}^{p \times n}$, and $b \in \mathcal{R}^p$ are problem data and $x \in \mathcal{R}^{n}$ is the optimization variable. The inequality constraint $Gx \leq h$ is elementwise. + Here $P \in \mathcal{S}^{n}_+$, $q \in \mathcal{R}^n$, $G \in \mathcal{R}^{m \times n}$, $h \in \mathcal{R}^m$, $A \in \mathcal{R}^{p \times n}$, and $b \in \mathcal{R}^p$ are problem data and $x \in \mathcal{R}^{n}$ is the optimization variable. The inequality constraint $Gx \leq h$ is elementwise. - **Why quadratic programming?** Quadratic programs are convex optimization problems that generalize both least-squares and linear programming.They can be solved efficiently and reliably, even in real-time. + **Why quadratic programming?** Quadratic programs are convex optimization problems that generalize both least-squares and linear programming.They can be solved efficiently and reliably, even in real-time. - **An example from finance.** A simple example of a quadratic program arises in finance. Suppose we have $n$ different stocks, an estimate $r \in \mathcal{R}^n$ of the expected return on each stock, and an estimate $\Sigma \in \mathcal{S}^{n}_+$ of the covariance of the returns. Then we solve the optimization problem + **An example from finance.** A simple example of a quadratic program arises in finance. Suppose we have $n$ different stocks, an estimate $r \in \mathcal{R}^n$ of the expected return on each stock, and an estimate $\Sigma \in \mathcal{S}^{n}_+$ of the covariance of the returns. Then we solve the optimization problem - \[ - \begin{array}{ll} - \text{minimize} & (1/2)x^T\Sigma x - r^Tx\\ - \text{subject to} & x \geq 0 \\ - & \mathbf{1}^Tx = 1, - \end{array} - \] + \[ + \begin{array}{ll} + \text{minimize} & (1/2)x^T\Sigma x - r^Tx\\ + \text{subject to} & x \geq 0 \\ + & \mathbf{1}^Tx = 1, + \end{array} + \] - to find a nonnegative portfolio allocation $x \in \mathcal{R}^n_+$ that optimally balances expected return and variance of return. + to find a nonnegative portfolio allocation $x \in \mathcal{R}^n_+$ that optimally balances expected return and variance of return. - When we solve a quadratic program, in addition to a solution $x^\star$, we obtain a dual solution $\lambda^\star$ corresponding to the inequality constraints. A positive entry $\lambda^\star_i$ indicates that the constraint $g_i^Tx \leq h_i$ holds with equality for $x^\star$ and suggests that changing $h_i$ would change the optimal value. - """ - ) + When we solve a quadratic program, in addition to a solution $x^\star$, we obtain a dual solution $\lambda^\star$ corresponding to the inequality constraints. A positive entry $\lambda^\star_i$ indicates that the constraint $g_i^Tx \leq h_i$ holds with equality for $x^\star$ and suggests that changing $h_i$ would change the optimal value. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Example + mo.md(r""" + ## Example - In this example, we use CVXPY to construct and solve a quadratic program. - """ - ) + In this example, we use CVXPY to construct and solve a quadratic program. + """) return @@ -82,7 +78,9 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md("""First we generate synthetic data. In this problem, we don't include equality constraints, only inequality.""") + mo.md(""" + First we generate synthetic data. In this problem, we don't include equality constraints, only inequality. + """) return @@ -95,7 +93,7 @@ def _(np): q = np.random.randn(n) G = np.random.randn(m, n) h = G @ np.random.randn(n) - return G, h, m, n, q + return G, h, n, q @app.cell(hide_code=True) @@ -114,7 +112,7 @@ def _(mo, np): {P_widget.center()} """ ) - return P_widget, wigglystuff + return (P_widget,) @app.cell @@ -125,7 +123,9 @@ def _(P_widget, np): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Next, we specify the problem. Notice that we use the `quad_form` function from CVXPY to create the quadratic form $x^TPx$.""") + mo.md(r""" + Next, we specify the problem. Notice that we use the `quad_form` function from CVXPY to create the quadratic form $x^TPx$. + """) return @@ -162,14 +162,12 @@ def _(G, P, h, plot_contours, q, x): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - In this plot, the gray shaded region is the feasible region (points satisfying the inequality), and the ellipses are level curves of the quadratic form. + mo.md(r""" + In this plot, the gray shaded region is the feasible region (points satisfying the inequality), and the ellipses are level curves of the quadratic form. - **🌊 Try it!** Try changing the entries of $P$ above with your mouse. How do the - level curves and the optimal value of $x$ change? Can you explain what you see? - """ - ) + **🌊 Try it!** Try changing the entries of $P$ above with your mouse. How do the + level curves and the optimal value of $x$ change? Can you explain what you see? + """) return @@ -178,7 +176,7 @@ def _(P, mo): mo.md( rf""" The above contour lines were generated with - + \[ P= \begin{{bmatrix}} {P[0, 0]:.01f} & {P[0, 1]:.01f} \\ diff --git a/optimization/05_portfolio_optimization.py b/optimization/05_portfolio_optimization.py index c61001717a219c428eed2c54946d861d879cbe41..b3c42476e6f7ae0926ac8e0e216ddea693968f37 100644 --- a/optimization/05_portfolio_optimization.py +++ b/optimization/05_portfolio_optimization.py @@ -12,7 +12,7 @@ import marimo -__generated_with = "0.11.2" +__generated_with = "0.18.4" app = marimo.App() @@ -24,88 +24,78 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md(r"""# Portfolio optimization""") + mo.md(r""" + # Portfolio optimization + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - In this example we show how to use CVXPY to design a financial portfolio; this is called _portfolio optimization_. + mo.md(r""" + In this example we show how to use CVXPY to design a financial portfolio; this is called _portfolio optimization_. - In portfolio optimization we have some amount of money to invest in any of $n$ different assets. - We choose what fraction $w_i$ of our money to invest in each asset $i$, $i=1, \ldots, n$. The goal is to maximize return of the portfolio while minimizing risk. - """ - ) + In portfolio optimization we have some amount of money to invest in any of $n$ different assets. + We choose what fraction $w_i$ of our money to invest in each asset $i$, $i=1, \ldots, n$. The goal is to maximize return of the portfolio while minimizing risk. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Asset returns and risk + mo.md(r""" + ## Asset returns and risk - We will only model investments held for one period. The initial prices are $p_i > 0$. The end of period prices are $p_i^+ >0$. The asset (fractional) returns are $r_i = (p_i^+-p_i)/p_i$. The portfolio (fractional) return is $R = r^Tw$. + We will only model investments held for one period. The initial prices are $p_i > 0$. The end of period prices are $p_i^+ >0$. The asset (fractional) returns are $r_i = (p_i^+-p_i)/p_i$. The portfolio (fractional) return is $R = r^Tw$. - A common model is that $r$ is a random variable with mean ${\bf E}r = \mu$ and covariance ${\bf E{(r-\mu)(r-\mu)^T}} = \Sigma$. - It follows that $R$ is a random variable with ${\bf E}R = \mu^T w$ and ${\bf var}(R) = w^T\Sigma w$. In real-world applications, $\mu$ and $\Sigma$ are estimated from data and models, and $w$ is chosen using a library like CVXPY. + A common model is that $r$ is a random variable with mean ${\bf E}r = \mu$ and covariance ${\bf E{(r-\mu)(r-\mu)^T}} = \Sigma$. + It follows that $R$ is a random variable with ${\bf E}R = \mu^T w$ and ${\bf var}(R) = w^T\Sigma w$. In real-world applications, $\mu$ and $\Sigma$ are estimated from data and models, and $w$ is chosen using a library like CVXPY. - ${\bf E}R$ is the (mean) *return* of the portfolio. ${\bf var}(R)$ is the *risk* of the portfolio. Portfolio optimization has two competing objectives: high return and low risk. - """ - ) + ${\bf E}R$ is the (mean) *return* of the portfolio. ${\bf var}(R)$ is the *risk* of the portfolio. Portfolio optimization has two competing objectives: high return and low risk. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Classical (Markowitz) portfolio optimization + mo.md(r""" + ## Classical (Markowitz) portfolio optimization - Classical (Markowitz) portfolio optimization solves the optimization problem - """ - ) + Classical (Markowitz) portfolio optimization solves the optimization problem + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - $$ - \begin{array}{ll} \text{maximize} & \mu^T w - \gamma w^T\Sigma w\\ - \text{subject to} & {\bf 1}^T w = 1, w \geq 0, - \end{array} - $$ - """ - ) + mo.md(r""" + $$ + \begin{array}{ll} \text{maximize} & \mu^T w - \gamma w^T\Sigma w\\ + \text{subject to} & {\bf 1}^T w = 1, w \geq 0, + \end{array} + $$ + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - where $w \in {\bf R}^n$ is the optimization variable and $\gamma >0$ is a constant called the *risk aversion parameter*. The constraint $\mathbf{1}^Tw = 1$ says the portfolio weight vector must sum to 1, and $w \geq 0$ says that we can't invest a negative amount into any asset. + mo.md(r""" + where $w \in {\bf R}^n$ is the optimization variable and $\gamma >0$ is a constant called the *risk aversion parameter*. The constraint $\mathbf{1}^Tw = 1$ says the portfolio weight vector must sum to 1, and $w \geq 0$ says that we can't invest a negative amount into any asset. - The objective $\mu^Tw - \gamma w^T\Sigma w$ is the *risk-adjusted return*. Varying $\gamma$ gives the optimal *risk-return trade-off*. - We can get the same risk-return trade-off by fixing return and minimizing risk. - """ - ) + The objective $\mu^Tw - \gamma w^T\Sigma w$ is the *risk-adjusted return*. Varying $\gamma$ gives the optimal *risk-return trade-off*. + We can get the same risk-return trade-off by fixing return and minimizing risk. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Example + mo.md(r""" + ## Example - In the following code we compute and plot the optimal risk-return trade-off for $10$ assets. First we generate random problem data $\mu$ and $\Sigma$. - """ - ) + In the following code we compute and plot the optimal risk-return trade-off for $10$ assets. First we generate random problem data $\mu$ and $\Sigma$. + """) return @@ -148,7 +138,7 @@ def _(mo, np): _Try changing the entries of $\mu$ and see how the plots below change._ """ ) - return mu_widget, wigglystuff + return (mu_widget,) @app.cell @@ -163,7 +153,9 @@ def _(mu_widget, np): @app.cell(hide_code=True) def _(mo): - mo.md("""Next, we solve the problem for 100 different values of $\gamma$""") + mo.md(""" + Next, we solve the problem for 100 different values of $\gamma$ + """) return @@ -176,7 +168,7 @@ def _(Sigma, mu, n): ret = mu.T @ w risk = cp.quad_form(w, Sigma) prob = cp.Problem(cp.Maximize(ret - gamma * risk), [cp.sum(w) == 1, w >= 0]) - return cp, gamma, prob, ret, risk, w + return cp, gamma, prob, ret, risk @app.cell @@ -195,7 +187,9 @@ def _(cp, gamma, np, prob, ret, risk): @app.cell(hide_code=True) def _(mo): - mo.md("""Plotted below are the risk return tradeoffs for two values of $\gamma$ (blue squares), and the risk return tradeoffs for investing fully in each asset (red circles)""") + mo.md(""" + Plotted below are the risk return tradeoffs for two values of $\gamma$ (blue squares), and the risk return tradeoffs for investing fully in each asset (red circles) + """) return @@ -218,17 +212,15 @@ def _(Sigma, cp, gamma_vals, mu, n, ret_data, risk_data): plt.xlabel("Standard deviation") plt.ylabel("Return") plt.show() - return ax, fig, marker, markers_on, plt + return markers_on, plt @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - We plot below the return distributions for the two risk aversion values marked on the trade-off curve. - Notice that the probability of a loss is near 0 for the low risk value and far above 0 for the high risk value. - """ - ) + mo.md(r""" + We plot below the return distributions for the two risk aversion values marked on the trade-off curve. + Notice that the probability of a loss is near 0 for the low risk value and far above 0 for the high risk value. + """) return @@ -250,7 +242,7 @@ def _(gamma, gamma_vals, markers_on, np, plt, prob, ret, risk): plt.ylabel("Density") plt.legend(loc="upper right") plt.show() - return midx, spstats, x + return if __name__ == "__main__": diff --git a/optimization/06_convex_optimization.py b/optimization/06_convex_optimization.py index 3fec569a8dbb99b146614d4af5a19d1c642b36dc..cbf1f7d74bf6ba6c292e2b9ff2a554a9f0806853 100644 --- a/optimization/06_convex_optimization.py +++ b/optimization/06_convex_optimization.py @@ -9,7 +9,7 @@ import marimo -__generated_with = "0.11.2" +__generated_with = "0.18.4" app = marimo.App() @@ -21,41 +21,39 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # Convex optimization - - In the previous tutorials, we learned about least squares, linear programming, - and quadratic programming, and saw applications of each. We also learned that these problem - classes can be solved efficiently and reliably using CVXPY. That's because these problem classes are a special - case of a more general class of tractable problems, called **convex optimization problems.** - - A convex optimization problem is an optimization problem that minimizes a convex - function, subject to affine equality constraints and convex inequality - constraints ($f_i(x)\leq 0$, where $f_i$ is a convex function). - - **CVXPY.** CVXPY lets you specify and solve any convex optimization problem, - abstracting away the more specific problem classes. You start with CVXPY's **atomic functions**, like `cp.exp`, `cp.log`, and `cp.square`, and compose them to build more complex convex functions. As long as the functions are composed in the right way — as long as they are "DCP-compliant" — your resulting problem will be convex and solvable by CVXPY. - """ - ) + mo.md(r""" + # Convex optimization + + In the previous tutorials, we learned about least squares, linear programming, + and quadratic programming, and saw applications of each. We also learned that these problem + classes can be solved efficiently and reliably using CVXPY. That's because these problem classes are a special + case of a more general class of tractable problems, called **convex optimization problems.** + + A convex optimization problem is an optimization problem that minimizes a convex + function, subject to affine equality constraints and convex inequality + constraints ($f_i(x)\leq 0$, where $f_i$ is a convex function). + + **CVXPY.** CVXPY lets you specify and solve any convex optimization problem, + abstracting away the more specific problem classes. You start with CVXPY's **atomic functions**, like `cp.exp`, `cp.log`, and `cp.square`, and compose them to build more complex convex functions. As long as the functions are composed in the right way — as long as they are "DCP-compliant" — your resulting problem will be convex and solvable by CVXPY. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - **🛑 Stop!** Before proceeding, read the CVXPY docs to learn about atomic functions and the DCP ruleset: + mo.md(r""" + **🛑 Stop!** Before proceeding, read the CVXPY docs to learn about atomic functions and the DCP ruleset: - https://www.cvxpy.org/tutorial/index.html - """ - ) + https://www.cvxpy.org/tutorial/index.html + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""**Is my problem DCP-compliant?** Below is a sample CVXPY problem. It is DCP-compliant. Try typing in other problems and seeing if they are DCP-compliant. If you know your problem is convex, there exists a way to express it in a DCP-compliant way.""") + mo.md(r""" + **Is my problem DCP-compliant?** Below is a sample CVXPY problem. It is DCP-compliant. Try typing in other problems and seeing if they are DCP-compliant. If you know your problem is convex, there exists a way to express it in a DCP-compliant way. + """) return @@ -71,7 +69,7 @@ def _(mo): constraints = [x >= 0, cp.sum(x) == 1] problem = cp.Problem(cp.Maximize(objective), constraints) mo.md(f"Is my problem DCP? `{problem.is_dcp()}`") - return P_sqrt, constraints, cp, np, objective, problem, x + return problem, x @app.cell diff --git a/optimization/07_sdp.py b/optimization/07_sdp.py index bcc0faa0c33eb9b717b4ebb3032ada27e43dc106..0783ad3a473e4d0a6d0b28ae51cbd1f619576fed 100644 --- a/optimization/07_sdp.py +++ b/optimization/07_sdp.py @@ -10,7 +10,7 @@ import marimo -__generated_with = "0.11.2" +__generated_with = "0.18.4" app = marimo.App() @@ -22,49 +22,47 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md(r"""# Semidefinite program""") + mo.md(r""" + # Semidefinite program + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - _This notebook introduces an advanced topic._ A semidefinite program (SDP) is an optimization problem of the form - - \[ - \begin{array}{ll} - \text{minimize} & \mathbf{tr}(CX) \\ - \text{subject to} & \mathbf{tr}(A_iX) = b_i, \quad i=1,\ldots,p \\ - & X \succeq 0, - \end{array} - \] - - where $\mathbf{tr}$ is the trace function, $X \in \mathcal{S}^{n}$ is the optimization variable and $C, A_1, \ldots, A_p \in \mathcal{S}^{n}$, and $b_1, \ldots, b_p \in \mathcal{R}$ are problem data, and $X \succeq 0$ is a matrix inequality. Here $\mathcal{S}^{n}$ denotes the set of $n$-by-$n$ symmetric matrices. - - **Example.** An example of an SDP is to complete a covariance matrix $\tilde \Sigma \in \mathcal{S}^{n}_+$ with missing entries $M \subset \{1,\ldots,n\} \times \{1,\ldots,n\}$: - - \[ - \begin{array}{ll} - \text{minimize} & 0 \\ - \text{subject to} & \Sigma_{ij} = \tilde \Sigma_{ij}, \quad (i,j) \notin M \\ - & \Sigma \succeq 0, - \end{array} - \] - """ - ) + mo.md(r""" + _This notebook introduces an advanced topic._ A semidefinite program (SDP) is an optimization problem of the form + + \[ + \begin{array}{ll} + \text{minimize} & \mathbf{tr}(CX) \\ + \text{subject to} & \mathbf{tr}(A_iX) = b_i, \quad i=1,\ldots,p \\ + & X \succeq 0, + \end{array} + \] + + where $\mathbf{tr}$ is the trace function, $X \in \mathcal{S}^{n}$ is the optimization variable and $C, A_1, \ldots, A_p \in \mathcal{S}^{n}$, and $b_1, \ldots, b_p \in \mathcal{R}$ are problem data, and $X \succeq 0$ is a matrix inequality. Here $\mathcal{S}^{n}$ denotes the set of $n$-by-$n$ symmetric matrices. + + **Example.** An example of an SDP is to complete a covariance matrix $\tilde \Sigma \in \mathcal{S}^{n}_+$ with missing entries $M \subset \{1,\ldots,n\} \times \{1,\ldots,n\}$: + + \[ + \begin{array}{ll} + \text{minimize} & 0 \\ + \text{subject to} & \Sigma_{ij} = \tilde \Sigma_{ij}, \quad (i,j) \notin M \\ + & \Sigma \succeq 0, + \end{array} + \] + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Example + mo.md(r""" + ## Example - In the following code, we show how to specify and solve an SDP with CVXPY. - """ - ) + In the following code, we show how to specify and solve an SDP with CVXPY. + """) return @@ -87,7 +85,7 @@ def _(np): for i in range(p): A.append(np.random.randn(n, n)) b.append(np.random.randn()) - return A, C, b, i, n, p + return A, C, b, n, p @app.cell @@ -101,7 +99,7 @@ def _(A, C, b, cp, n, p): constraints += [cp.trace(A[i] @ X) == b[i] for i in range(p)] prob = cp.Problem(cp.Minimize(cp.trace(C @ X)), constraints) _ = prob.solve() - return X, constraints, prob + return X, prob @app.cell @@ -111,7 +109,7 @@ def _(X, mo, prob, wigglystuff): The optimal value is {prob.value:0.4f}. A solution for $X$ is (rounded to the nearest decimal) is: - + {mo.ui.anywidget(wigglystuff.Matrix(X.value)).center()} """ ) diff --git a/optimization/README.md b/optimization/README.md index d846f9baa8cf9f35e2af10ab7b386e8b313e17df..edbfa9db0b1974dc235d7b888b6fe3b7df55dd9d 100644 --- a/optimization/README.md +++ b/optimization/README.md @@ -1,3 +1,8 @@ +--- +title: Readme +marimo-version: 0.18.4 +--- + # Learn optimization This collection of marimo notebooks teaches you the basics of convex @@ -30,4 +35,4 @@ to a notebook's URL: [marimo.app/github.com/marimo-team/learn/blob/main/optimiza **Thanks to all our notebook authors!** -* [Akshay Agrawal](https://github.com/akshayka) +* [Akshay Agrawal](https://github.com/akshayka) \ No newline at end of file diff --git a/polars/01_why_polars.py b/polars/01_why_polars.py index 7f8f78fd81dae54f6a539141906b08bfa1e8dcde..0ed303c6b0ec806499a754e907aea3cb28ec91fc 100644 --- a/polars/01_why_polars.py +++ b/polars/01_why_polars.py @@ -9,7 +9,7 @@ import marimo -__generated_with = "0.11.8" +__generated_with = "0.18.4" app = marimo.App(width="medium") @@ -21,17 +21,15 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - """ - # An introduction to Polars + mo.md(""" + # An introduction to Polars - _By [Koushik Khan](https://github.com/koushikkhan)._ + _By [Koushik Khan](https://github.com/koushikkhan)._ - This notebook provides a birds-eye overview of [Polars](https://pola.rs/), a fast and user-friendly data manipulation library for Python, and compares it to alternatives like Pandas and PySpark. + This notebook provides a birds-eye overview of [Polars](https://pola.rs/), a fast and user-friendly data manipulation library for Python, and compares it to alternatives like Pandas and PySpark. - Like Pandas and PySpark, the central data structure in Polars is **the DataFrame**, a tabular data structure consisting of named columns. For example, the next cell constructs a DataFrame that records the gender, age, and height in centimeters for a number of individuals. - """ - ) + Like Pandas and PySpark, the central data structure in Polars is **the DataFrame**, a tabular data structure consisting of named columns. For example, the next cell constructs a DataFrame that records the gender, age, and height in centimeters for a number of individuals. + """) return @@ -48,46 +46,40 @@ def _(): } ) df_pl - return df_pl, pl + return (pl,) @app.cell(hide_code=True) def _(mo): - mo.md( - """ - Unlike Python's earliest DataFrame library Pandas, Polars was designed with performance and usability in mind — Polars can scale to large datasets with ease while maintaining a simple and intuitive API. + mo.md(""" + Unlike Python's earliest DataFrame library Pandas, Polars was designed with performance and usability in mind — Polars can scale to large datasets with ease while maintaining a simple and intuitive API. - Polars' performance is due to a number of factors, including its implementation in rust and its ability to perform operations in a parallelized and vectorized manner. It supports a wide range of data types, advanced query optimizations, and seamless integration with other Python libraries, making it a versatile tool for data scientists, engineers, and analysts. Additionally, Polars provides a lazy API for deferred execution, allowing users to optimize their workflows by chaining operations and executing them in a single pass. + Polars' performance is due to a number of factors, including its implementation in rust and its ability to perform operations in a parallelized and vectorized manner. It supports a wide range of data types, advanced query optimizations, and seamless integration with other Python libraries, making it a versatile tool for data scientists, engineers, and analysts. Additionally, Polars provides a lazy API for deferred execution, allowing users to optimize their workflows by chaining operations and executing them in a single pass. - With its focus on speed, scalability, and ease of use, Polars is quickly becoming a go-to choice for data professionals looking to streamline their data processing pipelines and tackle large-scale data challenges. - """ - ) + With its focus on speed, scalability, and ease of use, Polars is quickly becoming a go-to choice for data professionals looking to streamline their data processing pipelines and tackle large-scale data challenges. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## Choosing Polars over Pandas + mo.md(""" + ## Choosing Polars over Pandas - In this section we'll give a few reasons why Polars is a better choice than Pandas, along with examples. - """ - ) + In this section we'll give a few reasons why Polars is a better choice than Pandas, along with examples. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ### Intuitive syntax + mo.md(""" + ### Intuitive syntax - Polars' syntax is similar to PySpark and intuitive like SQL, making heavy use of **method chaining**. This makes it easy for data professionals to transition to Polars, and leads to an API that is more concise and readable than Pandas. + Polars' syntax is similar to PySpark and intuitive like SQL, making heavy use of **method chaining**. This makes it easy for data professionals to transition to Polars, and leads to an API that is more concise and readable than Pandas. - **Example.** In the next few cells, we contrast the code to perform a basic filter and aggregation of data with Pandas to the code required to accomplish the same task with `Polars`. - """ - ) + **Example.** In the next few cells, we contrast the code to perform a basic filter and aggregation of data with Pandas to the code required to accomplish the same task with `Polars`. + """) return @@ -112,12 +104,14 @@ def _(): # step-2: groupby and aggregation result_pd = filtered_df_pd.groupby("Gender")["Height_CM"].mean() result_pd - return df_pd, filtered_df_pd, pd, result_pd + return @app.cell(hide_code=True) def _(mo): - mo.md(r"""The same example can be worked out in Polars more concisely, using method chaining. Notice how the Polars code is essentially as readable as English.""") + mo.md(r""" + The same example can be worked out in Polars more concisely, using method chaining. Notice how the Polars code is essentially as readable as English. + """) return @@ -137,17 +131,15 @@ def _(pl): # filter, groupby and aggregation using method chaining result_pl = data_pl.filter(pl.col("Age") > 15).group_by("Gender").agg(pl.mean("Height_CM")) result_pl - return data_pl, result_pl + return (data_pl,) @app.cell(hide_code=True) def _(mo): - mo.md( - """ - Notice how Polars uses a *method-chaining* approach, similar to PySpark, which makes the code more readable and expressive while using a *single line* to design the query. - Additionally, Polars supports SQL-like operations *natively*, that allows you to write SQL queries directly on polars dataframe: - """ - ) + mo.md(""" + Notice how Polars uses a *method-chaining* approach, similar to PySpark, which makes the code more readable and expressive while using a *single line* to design the query. + Additionally, Polars supports SQL-like operations *natively*, that allows you to write SQL queries directly on polars dataframe: + """) return @@ -155,159 +147,145 @@ def _(mo): def _(data_pl): result = data_pl.sql("SELECT Gender, AVG(Height_CM) FROM self WHERE Age > 15 GROUP BY Gender") result - return (result,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ### A large collection of built-in APIs + mo.md(""" + ### A large collection of built-in APIs - Polars has a comprehensive API that enables to perform virtually any operation using built-in methods. In contrast, Pandas often requires more complex operations to be handled using the `apply` method with a lambda function. The issue with `apply` is that it processes rows sequentially, looping through the DataFrame one row at a time, which can be inefficient. By leveraging Polars' built-in methods, you can operate on entire columns at once, unlocking the power of **SIMD (Single Instruction, Multiple Data)** parallelism. This approach not only simplifies your code but also significantly improves performance. - """ - ) + Polars has a comprehensive API that enables to perform virtually any operation using built-in methods. In contrast, Pandas often requires more complex operations to be handled using the `apply` method with a lambda function. The issue with `apply` is that it processes rows sequentially, looping through the DataFrame one row at a time, which can be inefficient. By leveraging Polars' built-in methods, you can operate on entire columns at once, unlocking the power of **SIMD (Single Instruction, Multiple Data)** parallelism. This approach not only simplifies your code but also significantly improves performance. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ### Query optimization 📈 + mo.md(""" + ### Query optimization 📈 - A key factor behind Polars' performance lies in its **evaluation strategy**. While Pandas defaults to **eager execution**, executing operations in the exact order they are written, Polars offers both **eager and lazy execution**. With lazy execution, Polars employs a **query optimizer** that analyzes all required operations and determines the most efficient way to execute them. This optimization can involve reordering operations, eliminating redundant calculations, and more. + A key factor behind Polars' performance lies in its **evaluation strategy**. While Pandas defaults to **eager execution**, executing operations in the exact order they are written, Polars offers both **eager and lazy execution**. With lazy execution, Polars employs a **query optimizer** that analyzes all required operations and determines the most efficient way to execute them. This optimization can involve reordering operations, eliminating redundant calculations, and more. - For example, consider the following expression to calculate the mean of the `Number1` column for categories "A" and "B" in the `Category` column: + For example, consider the following expression to calculate the mean of the `Number1` column for categories "A" and "B" in the `Category` column: - ```python - ( - df - .groupby(by="Category").agg(pl.col("Number1").mean()) - .filter(pl.col("Category").is_in(["A", "B"])) - ) - ``` - - If executed eagerly, the `groupby` operation would first be applied to the entire DataFrame, followed by filtering the results by `Category`. However, with **lazy execution**, Polars can optimize this process by first filtering the DataFrame to include only the relevant categories ("A" and "B") and then performing the `groupby` operation on the reduced dataset. This approach minimizes unnecessary computations and significantly improves efficiency. - """ + ```python + ( + df + .groupby(by="Category").agg(pl.col("Number1").mean()) + .filter(pl.col("Category").is_in(["A", "B"])) ) + ``` + + If executed eagerly, the `groupby` operation would first be applied to the entire DataFrame, followed by filtering the results by `Category`. However, with **lazy execution**, Polars can optimize this process by first filtering the DataFrame to include only the relevant categories ("A" and "B") and then performing the `groupby` operation on the reduced dataset. This approach minimizes unnecessary computations and significantly improves efficiency. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ### Scalability — handling large datasets in memory ⬆️ + mo.md(""" + ### Scalability — handling large datasets in memory ⬆️ - Pandas is limited by its single-threaded design and reliance on Python, which makes it inefficient for processing large datasets. Polars, on the other hand, is built in Rust and optimized for parallel processing, enabling it to handle datasets that are orders of magnitude larger. + Pandas is limited by its single-threaded design and reliance on Python, which makes it inefficient for processing large datasets. Polars, on the other hand, is built in Rust and optimized for parallel processing, enabling it to handle datasets that are orders of magnitude larger. - **Example: Processing a Large Dataset** - In Pandas, loading a large dataset (e.g., 10GB) often results in memory errors: + **Example: Processing a Large Dataset** + In Pandas, loading a large dataset (e.g., 10GB) often results in memory errors: - ```python - # This may fail with large datasets - df = pd.read_csv("large_dataset.csv") - ``` + ```python + # This may fail with large datasets + df = pd.read_csv("large_dataset.csv") + ``` - In Polars, the same operation runs quickly, without memory pressure: + In Polars, the same operation runs quickly, without memory pressure: - ```python - df = pl.read_csv("large_dataset.csv") - ``` + ```python + df = pl.read_csv("large_dataset.csv") + ``` - Polars also supports lazy evaluation, which allows you to optimize your workflows by deferring computations until necessary. This is particularly useful for large datasets: + Polars also supports lazy evaluation, which allows you to optimize your workflows by deferring computations until necessary. This is particularly useful for large datasets: - ```python - df = pl.scan_csv("large_dataset.csv") # Lazy DataFrame - result = df.filter(pl.col("A") > 1).groupby("A").agg(pl.sum("B")).collect() # Execute - ``` - """ - ) + ```python + df = pl.scan_csv("large_dataset.csv") # Lazy DataFrame + result = df.filter(pl.col("A") > 1).groupby("A").agg(pl.sum("B")).collect() # Execute + ``` + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ### Compatibility with other machine learning libraries 🤝 + mo.md(""" + ### Compatibility with other machine learning libraries 🤝 - Polars integrates seamlessly with popular machine learning libraries like Scikit-learn, PyTorch, and TensorFlow. Its ability to handle large datasets efficiently makes it an excellent choice for preprocessing data before feeding it into ML models. + Polars integrates seamlessly with popular machine learning libraries like Scikit-learn, PyTorch, and TensorFlow. Its ability to handle large datasets efficiently makes it an excellent choice for preprocessing data before feeding it into ML models. - **Example: Preprocessing Data for Scikit-learn** + **Example: Preprocessing Data for Scikit-learn** - ```python - import polars as pl - from sklearn.linear_model import LinearRegression + ```python + import polars as pl + from sklearn.linear_model import LinearRegression - # Load and preprocess data - df = pl.read_csv("data.csv") - X = df.select(["feature1", "feature2"]).to_numpy() - y = df.select("target").to_numpy() + # Load and preprocess data + df = pl.read_csv("data.csv") + X = df.select(["feature1", "feature2"]).to_numpy() + y = df.select("target").to_numpy() - # Train a model - model = LinearRegression() - model.fit(X, y) - ``` + # Train a model + model = LinearRegression() + model.fit(X, y) + ``` - Polars also supports conversion to other formats like NumPy arrays and Pandas DataFrames, ensuring compatibility with virtually any ML library: + Polars also supports conversion to other formats like NumPy arrays and Pandas DataFrames, ensuring compatibility with virtually any ML library: - ```python - # Convert to Pandas DataFrame - pandas_df = df.to_pandas() + ```python + # Convert to Pandas DataFrame + pandas_df = df.to_pandas() - # Convert to NumPy array - numpy_array = df.to_numpy() - ``` - """ - ) + # Convert to NumPy array + numpy_array = df.to_numpy() + ``` + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ### Easy to use, with room for power users + mo.md(""" + ### Easy to use, with room for power users - Polars supports advanced operations like + Polars supports advanced operations like - - **date handling** - - **window functions** - - **joins** - - **nested data types** + - **date handling** + - **window functions** + - **joins** + - **nested data types** - which is making it a versatile tool for data manipulation. - """ - ) + which is making it a versatile tool for data manipulation. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## Why not PySpark? + mo.md(""" + ## Why not PySpark? - While **PySpark** is versatile tool that has transformed the way big data is handled and processed in Python, its **complex setup process** can be intimidating, especially for beginners. In contrast, **Polars** requires minimal setup and is ready to use right out of the box, making it more accessible for users of all skill levels. + While **PySpark** is versatile tool that has transformed the way big data is handled and processed in Python, its **complex setup process** can be intimidating, especially for beginners. In contrast, **Polars** requires minimal setup and is ready to use right out of the box, making it more accessible for users of all skill levels. - When deciding between the two, **PySpark** is the preferred choice for processing large datasets distributed across a **multi-node cluster**. However, for computations on a **single-node machine**, **Polars** is an excellent alternative. Remarkably, Polars is capable of handling datasets that exceed the size of the available RAM, making it a powerful tool for efficient data processing even on limited hardware. - """ - ) + When deciding between the two, **PySpark** is the preferred choice for processing large datasets distributed across a **multi-node cluster**. However, for computations on a **single-node machine**, **Polars** is an excellent alternative. Remarkably, Polars is capable of handling datasets that exceed the size of the available RAM, making it a powerful tool for efficient data processing even on limited hardware. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - ## 🔖 References + mo.md(""" + ## 🔖 References - - [Polars official website](https://pola.rs/) - - [Polars vs. Pandas](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/) - """ - ) + - [Polars official website](https://pola.rs/) + - [Polars vs. Pandas](https://blog.jetbrains.com/pycharm/2024/07/polars-vs-pandas/) + """) return diff --git a/polars/02_dataframes.py b/polars/02_dataframes.py index e090b4f67ed6659dd6b945f4e8c610d5d30d856e..71ad9658833bab2f4deb8d4857186d7871449945 100644 --- a/polars/02_dataframes.py +++ b/polars/02_dataframes.py @@ -10,14 +10,13 @@ import marimo -__generated_with = "0.13.10" +__generated_with = "0.18.4" app = marimo.App() @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # DataFrames Author: [*Raine Hoang*](https://github.com/Jystine) @@ -25,33 +24,31 @@ def _(mo): /// Note The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/api/python/stable/reference/dataframe/index.html). - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ + mo.md(""" ## Defining a DataFrame At the most basic level, all that you need to do in order to create a DataFrame in Polars is to use the .DataFrame() method and pass in some data into the data parameter. However, there are restrictions as to what exactly you can pass into this method. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""### What Can Be a DataFrame?""") + mo.md(r""" + ### What Can Be a DataFrame? + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" There are [5 data types](https://github.com/pola-rs/polars/blob/py-1.29.0/py-polars/polars/dataframe/frame.py#L197) that can be converted into a DataFrame. 1. Dictionary @@ -59,20 +56,17 @@ def _(mo): 3. NumPy Array 4. Series 5. Pandas DataFrame - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" #### Dictionary Dictionaries are structures that store data as `key:value` pairs. Let's say we have the following dictionary: - """ - ) + """) return @@ -85,7 +79,9 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md(r"""In order to convert this dictionary into a DataFrame, we simply need to pass it into the data parameter in the `.DataFrame()` method like so.""") + mo.md(r""" + In order to convert this dictionary into a DataFrame, we simply need to pass it into the data parameter in the `.DataFrame()` method like so. + """) return @@ -98,25 +94,21 @@ def _(dct_data, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - In this case, Polars turned each of the lists in the dictionary into a column in the DataFrame. + mo.md(r""" + In this case, Polars turned each of the lists in the dictionary into a column in the DataFrame. The other data structures will follow a similar pattern when converting them to DataFrames. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ##### Sequence Sequences are data structures that contain collections of items, which can be accessed using its index. Examples of sequences are lists, tuples, and strings. We will be using a list of lists in order to demonstrate how to convert a sequence in a DataFrame. - """ - ) + """) return @@ -136,19 +128,19 @@ def _(pl, seq_data): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Notice that since we didn't specify the column names, Polars automatically named them `column_0`, `column_1`, and `column_2`. Later, we will show you how to specify the names of the columns.""") + mo.md(r""" + Notice that since we didn't specify the column names, Polars automatically named them `column_0`, `column_1`, and `column_2`. Later, we will show you how to specify the names of the columns. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ##### NumPy Array NumPy arrays are considered a sequence of items that can also be accessed using its index. An important thing to note is that all of the items in an array must have the same data type. - """ - ) + """) return @@ -168,19 +160,19 @@ def _(arr_data, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Notice that each inner array is a row in the DataFrame, not a column like the previous methods discussed. Later, we will go over how to tell Polars if we the information in the data structure to be presented as rows or columns.""") + mo.md(r""" + Notice that each inner array is a row in the DataFrame, not a column like the previous methods discussed. Later, we will go over how to tell Polars if we the information in the data structure to be presented as rows or columns. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ##### Series Series are a way to store a single column in a DataFrame and all entries in a series must have the same data type. You can combine these series together to form one DataFrame. - """ - ) + """) return @@ -200,13 +192,11 @@ def _(pl, pl_series): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ##### Pandas DataFrame Another popular package that utilizes DataFrames is pandas. By passing in a pandas DataFrame into .DataFrame(), you can easily convert it into a Polars DataFrame. - """ - ) + """) return @@ -230,19 +220,19 @@ def _(pd_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Now that we've looked over what can be converted into a DataFrame and the basics of it, let's look at the structure of the DataFrame.""") + mo.md(r""" + Now that we've looked over what can be converted into a DataFrame and the basics of it, let's look at the structure of the DataFrame. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## DataFrame Structure Let's recall one of the DataFrames we defined earlier. - """ - ) + """) return @@ -254,14 +244,15 @@ def _(dct_df): @app.cell(hide_code=True) def _(mo): - mo.md(r"""We can see that this DataFrame has 4 rows and 3 columns as indicated by the text beneath the DataFrame. Each column has a name that can be used to access the data within that column. In this case, the names are: "col1", "col2", and "col3". Below the column name, there is text that indicates the data type stored within that column. "col1" has the text "i64" underneath its name, meaning that that column stores integers. "col2" stores strings as seen by the "str" under the column name. Finally, "col3" stores floats as it has "f64" under the column name. Polars will automatically assume the data types stored in each column, but we will go over a way to specify it later in this tutorial. Each column can only hold one data type at a time, so you can't have a string and an integer in the same column.""") + mo.md(r""" + We can see that this DataFrame has 4 rows and 3 columns as indicated by the text beneath the DataFrame. Each column has a name that can be used to access the data within that column. In this case, the names are: "col1", "col2", and "col3". Below the column name, there is text that indicates the data type stored within that column. "col1" has the text "i64" underneath its name, meaning that that column stores integers. "col2" stores strings as seen by the "str" under the column name. Finally, "col3" stores floats as it has "f64" under the column name. Polars will automatically assume the data types stored in each column, but we will go over a way to specify it later in this tutorial. Each column can only hold one data type at a time, so you can't have a string and an integer in the same column. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Parameters On top of the "data" parameter, there are 6 additional parameters you can specify: @@ -272,20 +263,17 @@ def _(mo): 4. orient 5. infer_schema_length 6. nan_to_null - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" #### Schema Let's recall the DataFrame we created using a sequence. - """ - ) + """) return @@ -297,7 +285,9 @@ def _(seq_df): @app.cell(hide_code=True) def _(mo): - mo.md(r"""We can see that the column names and data type were inferred by Polars. The schema parameter allows us to specify the column names and data type we want for each column. There are 3 ways you can use this parameter. The first way involves using a dictionary to define the following key value pair: column name:data type.""") + mo.md(r""" + We can see that the column names and data type were inferred by Polars. The schema parameter allows us to specify the column names and data type we want for each column. There are 3 ways you can use this parameter. The first way involves using a dictionary to define the following key value pair: column name:data type. + """) return @@ -309,7 +299,9 @@ def _(pl, seq_data): @app.cell(hide_code=True) def _(mo): - mo.md(r"""You can also do this using a list of (column name, data type) pairs instead of a dictionary.""") + mo.md(r""" + You can also do this using a list of (column name, data type) pairs instead of a dictionary. + """) return @@ -321,7 +313,9 @@ def _(pl, seq_data): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Notice how both the column names and the data type (text underneath the column name) is different from the original `seq_df`. If you only wanted to specify the column names and let Polars assume the data type, you can do so using a list of column names.""") + mo.md(r""" + Notice how both the column names and the data type (text underneath the column name) is different from the original `seq_df`. If you only wanted to specify the column names and let Polars assume the data type, you can do so using a list of column names. + """) return @@ -333,19 +327,19 @@ def _(pl, seq_data): @app.cell(hide_code=True) def _(mo): - mo.md(r"""The text under the column names is different from the previous two DataFrames we created since we didn't explicitly tell Polars what data type we wanted in each column.""") + mo.md(r""" + The text under the column names is different from the previous two DataFrames we created since we didn't explicitly tell Polars what data type we wanted in each column. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" #### Schema_Overrides If you only wanted to specify the data type of specific columns and let Polars infer the rest, you can use the schema_overrides parameter for that. This parameter requires that you pass in a dictionary where the key value pair is column name:data type. Unlike the schema parameter, the column name must match the name already present in the DataFrame as that is how Polars will identify which column you want to specify the data type. If you use a column name that doesn't already exist, Polars won't be able to change the data type. - """ - ) + """) return @@ -357,13 +351,11 @@ def _(pl, seq_data): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" Notice here that only the data type in the first column changed while Polars inferred the rest. It is important to note that if you only use the schema_overrides parameter, you are limited to how much you can change the data type. In the example above, we were able to change the data type from int32 to int16 without any further parameters since the data type is still an integer. However, if we wanted to change the first column to be a string, we would get an error as Polars has already strictly set the schema to only take in integer values. - """ - ) + """) return @@ -378,25 +370,27 @@ def _(pl, seq_data): @app.cell(hide_code=True) def _(mo): - mo.md(r"""If we wanted to use schema_override to completely change the data type of the column, we need an additional parameter: strict.""") + mo.md(r""" + If we wanted to use schema_override to completely change the data type of the column, we need an additional parameter: strict. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" #### Strict The strict parameter allows you to specify if you want a column's data type to be enforced with flexibility or not. When set to `True`, Polars will raise an error if there is a data type that doesn't match the data type the column is expecting. It will not attempt to type cast it to the correct data type as Polars prioritizes that all the data can be converted without any loss or error. When set to `False`, Polars will attempt to type cast the data into the data type the column wants. If it is unable to successfully convert the data type, the value will be replaced with a null value. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""Let's see an example of what happens when strict is set to `True`. The cell below should show an error.""") + mo.md(r""" + Let's see an example of what happens when strict is set to `True`. The cell below should show an error. + """) return @@ -413,7 +407,9 @@ def _(pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Now let's try setting strict to `False`.""") + mo.md(r""" + Now let's try setting strict to `False`. + """) return @@ -425,19 +421,19 @@ def _(pl, seq_data): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Since we allowed for Polars to change the schema by setting strict to `False`, we were able to cast the first column to be strings.""") + mo.md(r""" + Since we allowed for Polars to change the schema by setting strict to `False`, we were able to cast the first column to be strings. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ + mo.md(""" #### Orient Let's recall the DataFrame we made by using an array and the data used to make it. - """ - ) + """) return @@ -455,7 +451,9 @@ def _(arr_df): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Notice how Polars decided to make each inner array a row in the DataFrame. If we wanted to make it so that each inner array was a column instead of a row, all we would need to do is pass `"col"` into the orient parameter.""") + mo.md(r""" + Notice how Polars decided to make each inner array a row in the DataFrame. If we wanted to make it so that each inner array was a column instead of a row, all we would need to do is pass `"col"` into the orient parameter. + """) return @@ -467,7 +465,9 @@ def _(arr_data, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""If we wanted to do the opposite, then we pass `"row"` into the orient parameter.""") + mo.md(r""" + If we wanted to do the opposite, then we pass `"row"` into the orient parameter. + """) return @@ -485,39 +485,33 @@ def _(pl, seq_data): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" #### Infer_Schema_Length Without setting the schema ourselves, Polars uses the data provided to infer the data types of the columns. It does this by looking at each of the rows in the data provided. You can specify to Polars how many rows to look at by using the infer_schema_length parameter. For example, if you were to set this parameter to 5, then Polars would use the first 5 rows to infer the schema. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" #### NaN_To_Null If there are np.nan values in the data, you can convert them to null values by setting the nan_to_null parameter to `True`. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Summary - DataFrames are a useful data structure that can be used to organize and perform additional analysis on your data. In this notebook, we have learned how to define DataFrames, what can be a DataFrame, the structure of it, and additional parameters you can set while creating it. + DataFrames are a useful data structure that can be used to organize and perform additional analysis on your data. In this notebook, we have learned how to define DataFrames, what can be a DataFrame, the structure of it, and additional parameters you can set while creating it. In order to create a DataFrame, you pass your data into the .DataFrame() method through the data parameter. The data you pass through must be either a dictionary, sequence, array, series, or pandas DataFrame. Once defined, the DataFrame will separate the data into different columns and the data within the column must have the same data type. There exists additional parameters besides data that allows you to further customize the ending DataFrame. Some examples of these are orient, strict, and infer_schema_length. - """ - ) + """) return diff --git a/polars/03_loading_data.py b/polars/03_loading_data.py index f14a57d721639ce790e24232e301387fd8112e60..ff9ee9885b7fc55dae400e3c0b38b8cdb2d84440 100644 --- a/polars/03_loading_data.py +++ b/polars/03_loading_data.py @@ -14,14 +14,13 @@ import marimo -__generated_with = "0.15.2" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Loading Data _By [etrotta](https://github.com/etrotta)._ @@ -29,8 +28,7 @@ def _(mo): This tutorial covers how to load data of varying formats and from different sources using [polars](https://docs.pola.rs/). It includes examples of how to load and write to a variety of formats, shows how to convert data from other libraries to support formats not supported directly by polars, includes relevant links for users that need to connect with external sources, and explains how to deal with custom formats via plugins. - """ - ) + """) return @@ -80,12 +78,10 @@ def _(mo, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Parquet Parquet is a popular format for storing tabular data based on the Arrow memory spec, it is a great default and you'll find a lot of datasets already using it in sites like HuggingFace - """ - ) + """) return @@ -100,14 +96,12 @@ def _(df, folder, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## CSV A classic and common format that has been widely used for decades. The API is almost identical to Parquet - You can just replace `parquet` by `csv` and it will work with the default settings, but polars also allows for you to customize some settings such as the delimiter and quoting rules. - """ - ) + """) return @@ -123,8 +117,7 @@ def _(df, folder, lz, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## JSON JavaScript Object Notation is somewhat commonly used for storing unstructed data, and extremely commonly used for API responses. @@ -138,8 +131,7 @@ def _(mo): Polars supports Lists with variable length, Arrays with fixed length, and Structs with well defined fields, but not mappings with arbitrary keys. You might want to transform data by unnesting structs and exploding lists after loading from complex JSON files. - """ - ) + """) return @@ -163,8 +155,7 @@ def _(df, folder, lz, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Databases Polars doesn't supports any databases _directly_, but rather uses other libraries as Engines. Reading and writing to databases using polars methods does not supports Lazy execution, but you may pass an SQL Query for the database to pre-filter the data before reaches polars. See the [User Guide](https://docs.pola.rs/user-guide/io/database) for more details. @@ -172,8 +163,7 @@ def _(mo): You can also use other libraries with [arrow support](#arrow-support) or [polars plugins](#plugin-support) to read from databases before loading into polars, some of which support lazy reading. Using the Arrow Database Connectivity SQLite support as an example: - """ - ) + """) return @@ -190,43 +180,37 @@ def _(df, folder, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Excel From a performance perspective, we recommend using other formats if possible, such as Parquet or CSV files. Similarly to Databases, polars doesn't supports it natively but rather uses other libraries as Engines. See the [User Guide](https://docs.pola.rs/user-guide/io/excel) if you need to use it. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Others natively supported If you understood the above examples, then all other formats should feel familiar - the core API is the same for all formats, `read` and `write` for the Eager API or `scan` and `sink` for the lazy API. See https://docs.pola.rs/api/python/stable/reference/io.html for the full list of formats natively supported by Polars - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Arrow Support You can convert Arrow compatible data from other libraries such as `pandas`, `duckdb` or `pyarrow` to polars DataFrames and vice-versa, much of the time without even having to copy data. This allows for you to use other libraries to load data in formats not support by polars, then convert the dataframe in-memory to polars. - """ - ) + """) return @@ -241,13 +225,11 @@ def _(df, folder, pd, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Plugin Support You can also write [IO Plugins](https://docs.pola.rs/user-guide/plugins/io_plugins/) for Polars in order to support any format you need, or use other libraries that support polars via their own plugins such as DuckDB. - """ - ) + """) return @@ -261,8 +243,7 @@ def _(duckdb, folder): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Creating your own Plugin The simplest form of plugins are essentially generators that yield DataFrames. @@ -273,12 +254,11 @@ def _(mo): - You must use `register_io_source` for polars to create the LazyFrame which will consume the Generator - You are expected to provide a Schema before the Generator starts - - - For many use cases the Plugin may be able to infer it, but you could also pass it explicitly to the plugin function + - - For many use cases the Plugin may be able to infer it, but you could also pass it explicitly to the plugin function - Ideally you should parse some of the filters and column selectors to avoid unnecessary work, but it is possible to delegate that to polars after loading the data in order to keep it simpler (at the cost of efficiency) Efficiently parsing the filter expressions is out of the scope for this notebook. - """ - ) + """) return @@ -351,8 +331,7 @@ def _(Iterator, get_positional_names, itertools, pl, register_io_source): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### DuckDB As demonstrated above, in addition to Arrow interoperability support, [DuckDB](https://duckdb.org/) also has added support for loading query results into a polars DataFrame or LazyFrame via a polars plugin. @@ -363,8 +342,7 @@ def _(mo): - https://duckdb.org/docs/stable/guides/python/polars.html You can learn more about DuckDB in the marimo course about it as well, including Marimo SQL related features - """ - ) + """) return @@ -398,16 +376,14 @@ def _(duckdb_conn, duckdb_query): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Hive Partitions There is also support for [Hive](https://docs.pola.rs/user-guide/io/hive/) partitioned data, but parts of the API are still unstable (may change in future polars versions ). Even without using partitions, many methods also support glob patterns to read multiple files in the same folder such as `scan_csv(folder / "*.csv")` - """ - ) + """) return @@ -422,28 +398,24 @@ def _(df, folder, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Reading from the Cloud - Polars also has support for reading public and private datasets from multiple websites + Polars also has support for reading public and private datasets from multiple websites and cloud storage solutions. If you must (re)use the same file many times in the same machine you may want to manually download it then load from your local file system instead to avoid re-downloading though, or download and write to disk only if the file does not exists. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Arbitrary web sites You can load files from nearly any website just by using a HTTPS URL, as long as it is not locked behind authorization. - """ - ) + """) return @@ -455,15 +427,13 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Hugging Face & Kaggle Datasets Look for polars inside of dropdowns such as "Use this dataset" in Hugging Face or "Code" in Kaggle, and oftentimes you'll get a snippet to load data directly into a dataframe you can use Read more: [Hugging Face](https://docs.pola.rs/user-guide/io/hugging-face/), [Kaggle](https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpolars) - """ - ) + """) return @@ -475,15 +445,13 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Cloud Storage - AWS S3, Azure Blob Storage, Google Cloud Storage The API is the same for all three storage providers, check the [User Guide](https://docs.pola.rs/user-guide/io/cloud-storage/) if you need of any of them. Runnable examples are not included in this Notebook as it would require setting up authentication, but the disabled cell below shows an example using Azure. - """ - ) + """) return @@ -510,13 +478,11 @@ def _(adlfs, df, os, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Multiplexing You can also split a query into multiple sinks via [multiplexing](https://docs.pola.rs/user-guide/lazy/multiplexing/), to avoid reading multiple times, repeating the same operations for each sink or collecting intermediary results into memory. - """ - ) + """) return @@ -540,13 +506,11 @@ def _(folder, lz, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Async Execution Polars also has experimental support for running lazy queries in `async` mode, letting you `await` operations inside of async functions. - """ - ) + """) return @@ -566,27 +530,23 @@ async def _(folder, lz, pl, sinks): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Conclusion As you have seen, polars makes it easy to work with a variety of formats and different data sources. From natively supported formats such as Parquet and CSV files, to using other libraries as an intermediary for XML or geospatial data, and plugins for newly emerging or proprietary formats, as long as your data can fit in a table then odds are you can turn it into a polars DataFrame. Combined with loading directly from remote sources, including public data platforms such as Hugging Face and Kaggle as well as private data in your cloud, you can import datasets for almost anything you can imagine. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Utilities Imports, utility functions and alike used through the Notebook - """ - ) + """) return diff --git a/polars/04_basic_operations.py b/polars/04_basic_operations.py index ae16fe13f425baffaf6eeebfa9abeabd923e59c6..fdcebeabc4d11e2398c43448fce1d3b07c79c11e 100644 --- a/polars/04_basic_operations.py +++ b/polars/04_basic_operations.py @@ -8,7 +8,7 @@ import marimo -__generated_with = "0.11.13" +__generated_with = "0.18.4" app = marimo.App(width="medium") @@ -20,14 +20,12 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # Basic operations on data - _By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._ + mo.md(r""" + # Basic operations on data + _By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._ - In this notebook, you'll learn how to perform arithmetic operations, comparisons, and conditionals on a Polars dataframe. We'll work with a DataFrame that tracks software usage by year, categorized as either Vintage (old) or Modern (new). - """ - ) + In this notebook, you'll learn how to perform arithmetic operations, comparisons, and conditionals on a Polars dataframe. We'll work with a DataFrame that tracks software usage by year, categorized as either Vintage (old) or Modern (new). + """) return @@ -107,13 +105,11 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Arithmetic - ### Addition - Let's add 42 users to each piece of software. This means adding 42 to each value under **users**. - """ - ) + mo.md(r""" + ## Arithmetic + ### Addition + Let's add 42 users to each piece of software. This means adding 42 to each value under **users**. + """) return @@ -125,7 +121,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Another way to perform the above operation is using the built-in function.""") + mo.md(r""" + Another way to perform the above operation is using the built-in function. + """) return @@ -137,12 +135,10 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Subtraction - Let's subtract 42 users to each piece of software. - """ - ) + mo.md(r""" + ### Subtraction + Let's subtract 42 users to each piece of software. + """) return @@ -154,7 +150,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Alternatively, you could subtract like this:""") + mo.md(r""" + Alternatively, you could subtract like this: + """) return @@ -166,12 +164,10 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Division - Suppose the **users** values are inflated, we can reduce them by dividing by 1000. Here's how to do it. - """ - ) + mo.md(r""" + ### Division + Suppose the **users** values are inflated, we can reduce them by dividing by 1000. Here's how to do it. + """) return @@ -183,7 +179,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Or we could do it with a built-in expression.""") + mo.md(r""" + Or we could do it with a built-in expression. + """) return @@ -195,7 +193,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""If we didn't care about the remainder after division (i.e remove numbers after decimal point) we could do it like this.""") + mo.md(r""" + If we didn't care about the remainder after division (i.e remove numbers after decimal point) we could do it like this. + """) return @@ -207,12 +207,10 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Multiplication - Let's pretend the *user* values are deflated and increase them by multiplying by 100. - """ - ) + mo.md(r""" + ### Multiplication + Let's pretend the *user* values are deflated and increase them by multiplying by 100. + """) return @@ -224,7 +222,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Polars also has a built-in function for multiplication.""") + mo.md(r""" + Polars also has a built-in function for multiplication. + """) return @@ -236,7 +236,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""So far, we've only modified the values in an existing column. Let's create a column **decade** that will represent the years as decades. Thus 1985 will be 1980 and 2008 will be 2000.""") + mo.md(r""" + So far, we've only modified the values in an existing column. Let's create a column **decade** that will represent the years as decades. Thus 1985 will be 1980 and 2008 will be 2000. + """) return @@ -248,7 +250,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""We could create a new column another way as follows:""") + mo.md(r""" + We could create a new column another way as follows: + """) return @@ -260,16 +264,14 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - **Tip** - Polars encounrages you to perform your operations as a chain. This enables you to take advantage of the query optimizer. We'll build upon the above code as a chain. + mo.md(r""" + **Tip** + Polars encounrages you to perform your operations as a chain. This enables you to take advantage of the query optimizer. We'll build upon the above code as a chain. - ## Comparison - ### Equal - Let's get all the software categorized as Vintage. - """ - ) + ## Comparison + ### Equal + Let's get all the software categorized as Vintage. + """) return @@ -284,7 +286,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""We could also do a double comparison. VisiCal is the only software that's vintage and in the decade 1970s. Let's perform this comparison operation.""") + mo.md(r""" + We could also do a double comparison. VisiCal is the only software that's vintage and in the decade 1970s. Let's perform this comparison operation. + """) return @@ -300,13 +304,11 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - We could also do this comparison in one line, if readability is not a concern + mo.md(r""" + We could also do this comparison in one line, if readability is not a concern - **Notice** that we must enclose the two expressions between the `&` with parenthesis. - """ - ) + **Notice** that we must enclose the two expressions between the `&` with parenthesis. + """) return @@ -321,7 +323,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""We can also use the built-in function for equal to comparisons.""") + mo.md(r""" + We can also use the built-in function for equal to comparisons. + """) return @@ -336,12 +340,10 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Not equal - We can also compare if something is `not` equal to something. In this case, category is not vintage. - """ - ) + mo.md(r""" + ### Not equal + We can also compare if something is `not` equal to something. In this case, category is not vintage. + """) return @@ -356,7 +358,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Or with the built-in function.""") + mo.md(r""" + Or with the built-in function. + """) return @@ -371,7 +375,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Or if you want to be extra clever, you can use the negation symbol `~` used in logic.""") + mo.md(r""" + Or if you want to be extra clever, you can use the negation symbol `~` used in logic. + """) return @@ -386,12 +392,10 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Greater than - Let's get the software where the year is greater than 2008 from the above dataframe. - """ - ) + mo.md(r""" + ### Greater than + Let's get the software where the year is greater than 2008 from the above dataframe. + """) return @@ -407,7 +411,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Or if we wanted the year 2008 to be included, we could use great or equal to.""") + mo.md(r""" + Or if we wanted the year 2008 to be included, we could use great or equal to. + """) return @@ -423,7 +429,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""We could do the previous two operations with built-in functions. Here's with greater than.""") + mo.md(r""" + We could do the previous two operations with built-in functions. Here's with greater than. + """) return @@ -439,7 +447,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""And here's with greater or equal to""") + mo.md(r""" + And here's with greater or equal to + """) return @@ -455,14 +465,12 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - **Note**: For "less than", and "less or equal to" you can use the operators `<` or `<=`. Alternatively, you can use built-in functions `lt` or `le` respectively. + mo.md(r""" + **Note**: For "less than", and "less or equal to" you can use the operators `<` or `<=`. Alternatively, you can use built-in functions `lt` or `le` respectively. - ### Is between - Polars also allows us to filter between a range of values. Let's get the modern software were the year is between 2013 and 2016. This is inclusive on both ends (i.e. both years are part of the result). - """ - ) + ### Is between + Polars also allows us to filter between a range of values. Let's get the modern software were the year is between 2013 and 2016. This is inclusive on both ends (i.e. both years are part of the result). + """) return @@ -478,14 +486,12 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Or operator - If we only want either one of the conditions in the comparison to be met, we could use `|`, which is the `or` operator. + mo.md(r""" + ### Or operator + If we only want either one of the conditions in the comparison to be met, we could use `|`, which is the `or` operator. - Let's get software that is either modern or used in the decade 1980s. - """ - ) + Let's get software that is either modern or used in the decade 1980s. + """) return @@ -500,14 +506,12 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Conditionals - Polars also allows you create new columns based on a condition. Let's create a column *status* that will indicate if the software is "discontinued" or "in use". + mo.md(r""" + ## Conditionals + Polars also allows you create new columns based on a condition. Let's create a column *status* that will indicate if the software is "discontinued" or "in use". - Here's a list of products that are no longer in use. - """ - ) + Here's a list of products that are no longer in use. + """) return @@ -519,7 +523,9 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Here's how we can get a dataframe of the products that are discontinued.""") + mo.md(r""" + Here's how we can get a dataframe of the products that are discontinued. + """) return @@ -534,7 +540,9 @@ def _(df, discontinued_list, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Now, let's create the **status** column.""") + mo.md(r""" + Now, let's create the **status** column. + """) return @@ -553,12 +561,10 @@ def _(df, discontinued_list, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Unique counts - Sometimes you may want to see only the unique values in a column. Let's check the unique decades we have in our DataFrame. - """ - ) + mo.md(r""" + ## Unique counts + Sometimes you may want to see only the unique values in a column. Let's check the unique decades we have in our DataFrame. + """) return @@ -578,7 +584,9 @@ def _(df, discontinued_list, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Finally, let's find out the number of software used in each decade.""") + mo.md(r""" + Finally, let's find out the number of software used in each decade. + """) return @@ -598,7 +606,9 @@ def _(df, discontinued_list, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""We could also rewrite the above code as follows:""") + mo.md(r""" + We could also rewrite the above code as follows: + """) return @@ -618,7 +628,9 @@ def _(df, discontinued_list, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Hopefully, we've picked your interest to try out Polars the next time you analyze your data.""") + mo.md(r""" + Hopefully, we've picked your interest to try out Polars the next time you analyze your data. + """) return diff --git a/polars/05_reactive_plots.py b/polars/05_reactive_plots.py index e2d4654042f5cf2774839d2de2d9bdf32fb478ed..cb4696cb98f900bd053e20bbb256ae7d9bce4c0a 100644 --- a/polars/05_reactive_plots.py +++ b/polars/05_reactive_plots.py @@ -11,26 +11,24 @@ import marimo -__generated_with = "0.12.10" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - """ - # Reactive Plots + mo.md(""" + # Reactive Plots - _By [etrotta](https://github.com/etrotta)._ + _By [etrotta](https://github.com/etrotta)._ - This tutorial covers Data Visualisation basics using marimo, [polars](https://docs.pola.rs/) and [plotly](https://plotly.com/python/plotly-express/). - It shows how to load data, explore and visualise it, then use User Interface elements (including the plots themselves) to filter and select data for more refined analysis. + This tutorial covers Data Visualisation basics using marimo, [polars](https://docs.pola.rs/) and [plotly](https://plotly.com/python/plotly-express/). + It shows how to load data, explore and visualise it, then use User Interface elements (including the plots themselves) to filter and select data for more refined analysis. - We will be using a [Spotify Tracks dataset](https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset). Before you write any code yourself, I recommend taking some time to understand the data you're working with, from which columns are available to what are their possible values, as well as more abstract details such as the scope, coverage and intended uses of the dataset. + We will be using a [Spotify Tracks dataset](https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset). Before you write any code yourself, I recommend taking some time to understand the data you're working with, from which columns are available to what are their possible values, as well as more abstract details such as the scope, coverage and intended uses of the dataset. - Note that this dataset does not contains data about ***all*** tracks, you can try using a larger dataset such as [bigdata-pw/Spotify](https://huggingface.co/datasets/bigdata-pw/Spotify), but I'm sticking with the smaller one to keep the notebook size manageable for most users. - """ - ) + Note that this dataset does not contains data about ***all*** tracks, you can try using a larger dataset such as [bigdata-pw/Spotify](https://huggingface.co/datasets/bigdata-pw/Spotify), but I'm sticking with the smaller one to keep the notebook size manageable for most users. + """) return @@ -47,20 +45,18 @@ def _(pl): # Or save to a local file first if you want to avoid downloading it each time you run: # file_path = "spotify-tracks.parquet" # lz = pl.scan_parquet(file_path) - return URL, branch, file_path, lz, repo_id + return (lz,) @app.cell(hide_code=True) def _(mo): - mo.md( - """ - You should always take a look at the data you are working on before actually doing any operations on it - for data coming from sources such as HuggingFace or Kaggle you can preview it via their websites, and optionally filter or do some transformations before downloading. + mo.md(""" + You should always take a look at the data you are working on before actually doing any operations on it - for data coming from sources such as HuggingFace or Kaggle you can preview it via their websites, and optionally filter or do some transformations before downloading. - The [Polars Lazy API](https://docs.pola.rs/user-guide/lazy/) allows for you define operations before loading the data, and polars will optimize the plan in order to avoid doing unnecessary operations or loading data we do not care about. + The [Polars Lazy API](https://docs.pola.rs/user-guide/lazy/) allows for you define operations before loading the data, and polars will optimize the plan in order to avoid doing unnecessary operations or loading data we do not care about. - Let's say that looking at the dataset's preview in the Data Viewer, we decided we do not want the Unnamed column (which appears to be the row index), nor do we care about the original ID, and we only want non-explicit tracks. - """ - ) + Let's say that looking at the dataset's preview in the Data Viewer, we decided we do not want the Unnamed column (which appears to be the row index), nor do we care about the original ID, and we only want non-explicit tracks. + """) return @@ -87,18 +83,16 @@ def _(lz, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - When you start exploring a dataset, some of the first things to do may include: + mo.md(r""" + When you start exploring a dataset, some of the first things to do may include: - - investigating any values that seem weird - - verifying if there could be issues in the data - - checking for potential bugs in our pipelines - - ensuring you understand the data correctly, including its relationships and edge cases + - investigating any values that seem weird + - verifying if there could be issues in the data + - checking for potential bugs in our pipelines + - ensuring you understand the data correctly, including its relationships and edge cases - For example, the "min" value for the duration column is zero, and the max is over an hour. Why is that? - """ - ) + For example, the "min" value for the duration column is zero, and the max is over an hour. Why is that? + """) return @@ -112,13 +106,11 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - For this Notebook we will be using [plotly](https://plotly.com/python), but Marimo also [supports other plotting libraries](https://docs.marimo.io/guides/working_with_data/plotting/). + mo.md(r""" + For this Notebook we will be using [plotly](https://plotly.com/python), but Marimo also [supports other plotting libraries](https://docs.marimo.io/guides/working_with_data/plotting/). - Let's visualize it using a [bar chart](https://plotly.com/python/bar-charts/) and get a feel for which region makes sense to focus on for our analysis - """ - ) + Let's visualize it using a [bar chart](https://plotly.com/python/bar-charts/) and get a feel for which region makes sense to focus on for our analysis + """) return @@ -129,20 +121,18 @@ def _(df, mo, px): fig.update_layout(selectdirection="h") plot = mo.ui.plotly(fig) plot - return duration_counts, fig, plot + return (plot,) @app.cell(hide_code=True) def _(mo): - mo.md( - """ - Note how there are a few outliers with extremely little duration (less than 2 minutes) and a few with extremely long duration (more than 6 minutes) + mo.md(""" + Note how there are a few outliers with extremely little duration (less than 2 minutes) and a few with extremely long duration (more than 6 minutes) - You can select a region in the graph by clicking and dragging, which can later be used to filter or transform data. In this Notebook we set a default if there is no selection, but you should try selecting a region yourself. + You can select a region in the graph by clicking and dragging, which can later be used to filter or transform data. In this Notebook we set a default if there is no selection, but you should try selecting a region yourself. - We will focus on those within that middle ground from around 120 seconds to 360 seconds, but you can play around with it a bit and see how the results change if you move the Selection region. Perhaps you can even find some Classical songs? - """ - ) + We will focus on those within that middle ground from around 120 seconds to 360 seconds, but you can play around with it a bit and see how the results change if you move the Selection region. Perhaps you can even find some Classical songs? + """) return @@ -154,7 +144,7 @@ def _(pl, plot): @app.cell -def _(df, get_extremes, pl, plot): +def _(df, pl, plot): # Now, we want to filter to only include tracks whose duration falls inside of our selection - we will need to first identify the extremes, then filter based on them min_dur, max_dur = get_extremes( plot.value, col="duration_seconds", defaults_if_missing=(120, 360) @@ -168,27 +158,25 @@ def _(df, get_extremes, pl, plot): # Actually apply the filter filtered_duration = df.filter(duration_in_range) filtered_duration - return duration_in_range, filtered_duration, max_dur, min_dur + return (filtered_duration,) @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Now that our data is 'clean', let's start coming up with and answering some questions about it. Some examples: - - - Which tracks or artists are the most popular? (Both globally as well as for each genre) - - Which genres are the most popular? The loudest? - - What are some common combinations of different artists? - - What can we infer anything based on the track's title or artist name? - - How popular is some specific song you like? - - How much does the mode and key affect other attributes? - - Can you classify a song's genre based on its attributes? - - For brevity, we will not explore all of them - feel free to try some of the others yourself, or go more in deep in the explored ones. - Make sure to come up with some questions of your own and explore them as well! - """ - ) + mo.md(r""" + Now that our data is 'clean', let's start coming up with and answering some questions about it. Some examples: + + - Which tracks or artists are the most popular? (Both globally as well as for each genre) + - Which genres are the most popular? The loudest? + - What are some common combinations of different artists? + - What can we infer anything based on the track's title or artist name? + - How popular is some specific song you like? + - How much does the mode and key affect other attributes? + - Can you classify a song's genre based on its attributes? + + For brevity, we will not explore all of them - feel free to try some of the others yourself, or go more in deep in the explored ones. + Make sure to come up with some questions of your own and explore them as well! + """) return @@ -235,18 +223,16 @@ def _(filter_genre, filtered_duration, mo, pl): ), ], ) - return (most_popular_artists,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - So far so good - but there's been a distinct lack of visualations, so let's fix that. + mo.md(r""" + So far so good - but there's been a distinct lack of visualations, so let's fix that. - Let's start simple, just some metrics for each genre: - """ - ) + Let's start simple, just some metrics for each genre: + """) return @@ -263,22 +249,20 @@ def _(filtered_duration, pl, px): x="popularity", ) fig_dur_per_genre - return (fig_dur_per_genre,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Now, why don't we play a bit with marimo's UI elements? + mo.md(r""" + Now, why don't we play a bit with marimo's UI elements? - We will use Dropdowns to allow for the user to select any column to use for the visualisation, and throw in some extras + We will use Dropdowns to allow for the user to select any column to use for the visualisation, and throw in some extras - - A slider for the transparency to help understand dense clusters - - Add a Trendline to the scatterplot (requires statsmodels) - - Filter by some specific Genre - """ - ) + - A slider for the transparency to help understand dense clusters + - Add a Trendline to the scatterplot (requires statsmodels) + - Filter by some specific Genre + """) return @@ -312,18 +296,16 @@ def _( chart2 = mo.ui.plotly(fig2) mo.vstack([mo.hstack([x_axis, y_axis, color, alpha, include_trendline, filter_genre2]), chart2]) - return chart2, fig2 + return (chart2,) @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - As we have seen before, we can also use the plot as an input to select a region and look at it in more detail. + mo.md(r""" + As we have seen before, we can also use the plot as an input to select a region and look at it in more detail. - Try selecting a region then performing some explorations of your own with the data inside of it. - """ - ) + Try selecting a region then performing some explorations of your own with the data inside of it. + """) return @@ -340,47 +322,45 @@ def _(chart2, filtered_duration, mo, pl): pl.col(column_order), pl.exclude(*column_order) ) out - return active_columns, column_order, out + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - In this notebook, we've focused on a few key aspects. First, it's essential to *understand* the data you're working with — this forms the foundation of any analysis. + mo.md(r""" + In this notebook, we've focused on a few key aspects. First, it's essential to *understand* the data you're working with — this forms the foundation of any analysis. - Creating plots is a powerful way to identify patterns, outliers, and trends. These visualizations are not just for _presentation_; they are tools for deeper insight. + Creating plots is a powerful way to identify patterns, outliers, and trends. These visualizations are not just for _presentation_; they are tools for deeper insight. - /// NOTE - With marimo's `interactive` UI elements, exploring different _facets_ of the data becomes seamless, allowing for dynamic analysis without altering the code. + /// NOTE + With marimo's `interactive` UI elements, exploring different _facets_ of the data becomes seamless, allowing for dynamic analysis without altering the code. - Keep these points in mind as you continue to work with data. - """ - ) + Keep these points in mind as you continue to work with data. + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""# Utility Functions and UI Elements""") + mo.md(r""" + # Utility Functions and UI Elements + """) return -@app.cell -def get_extremes(): - def get_extremes(selection, col, defaults_if_missing): - "Get the minimum and maximum values for a given column within the selection" - if selection is None or len(selection) == 0: - print( - f"Could not find a selected region. Using default values {defaults_if_missing} instead, try clicking and dragging in the plot to change them." - ) - return defaults_if_missing - else: - return ( - min(row[col] for row in selection), - max(row[col] for row in selection), - ) - return (get_extremes,) +@app.function +def get_extremes(selection, col, defaults_if_missing): + "Get the minimum and maximum values for a given column within the selection" + if selection is None or len(selection) == 0: + print( + f"Could not find a selected region. Using default values {defaults_if_missing} instead, try clicking and dragging in the plot to change them." + ) + return defaults_if_missing + else: + return ( + min(row[col] for row in selection), + max(row[col] for row in selection), + ) @app.cell @@ -426,20 +406,14 @@ def _(filtered_duration, mo): searchable=True, label="Filter by Track Genre:", ) - return ( - alpha, - color, - filter_genre2, - include_trendline, - options, - x_axis, - y_axis, - ) + return alpha, color, filter_genre2, include_trendline, x_axis, y_axis @app.cell(hide_code=True) def _(mo): - mo.md("""# Appendix : Some other examples""") + mo.md(""" + # Appendix : Some other examples + """) return @@ -461,12 +435,7 @@ def _(filtered_duration, mo, pl): # So we just provide freeform text boxes and filter ourselfves later # (the "alternative_" in the name is just to avoid conflicts with the above cell, # despite this being disabled marimo still requires global variables to be unique) - return ( - all_artists, - all_tracks, - alternative_filter_artist, - alternative_filter_track, - ) + return @app.cell @@ -503,7 +472,7 @@ def _(filter_artist, filter_track, filtered_duration, mo, pl): ) mo.vstack([mo.md("Filter a track based on its name or artist"), filter_artist, filter_track, filtered_artist_track]) - return filtered_artist_track, score_match_text + return @app.cell @@ -532,7 +501,7 @@ def _(filter_genre2, filtered_duration, mo, pl): ], align="center", ) - return (artist_combinations,) + return @app.cell diff --git a/polars/06_Dataframe_Transformer.py b/polars/06_Dataframe_Transformer.py index 905a7cdf1064529ad5c0f9bb6a2df27c2b9f9db1..1099809b30053803e6f838633d8d8b1ec3aac7ad 100644 --- a/polars/06_Dataframe_Transformer.py +++ b/polars/06_Dataframe_Transformer.py @@ -12,21 +12,19 @@ import marimo -__generated_with = "0.14.10" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Polars with Marimo's Dataframe Transformer *By [jesshart](https://github.com/jesshart)* The goal of this notebook is to explore Marimo's data explore capabilities alonside the power of polars. Feel free to reference the latest about these Marimo features here: https://docs.marimo.io/guides/working_with_data/dataframes/?h=dataframe#transforming-dataframes - """ - ) + """) return @@ -40,14 +38,12 @@ def _(requests): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Loading Data Let's start by loading our data and getting into the `.lazy()` format so our transformations and queries are speedy. Read more about `.lazy()` here: https://docs.pola.rs/user-guide/lazy/ - """ - ) + """) return @@ -60,21 +56,18 @@ def _(json_data, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Above, you will notice that when you reference the object as a standalone, you get out-of-the-box convenience from `marimo`. You have the `Table` and `Query Plan` options to choose from. + mo.md(r""" + Above, you will notice that when you reference the object as a standalone, you get out-of-the-box convenience from `marimo`. You have the `Table` and `Query Plan` options to choose from. - 💡 Try out the `Table` view! You can click the `Preview data` button to get a quick view of your data. - 💡 Take a look at the `Query plan`. Learn more about Polar's query plan here: https://docs.pola.rs/user-guide/lazy/query-plan/ - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## marimo's Native Dataframe UI There are a few ways to leverage marimo's native dataframe UI. One is by doing what we saw above—by referencing a `pl.LazyFrame` directly. You can also try, @@ -83,19 +76,16 @@ def _(mo): - Referencing a `pl.DataFrame` and see how it different from its corresponding lazy version - Use `mo.ui.table` - Use `mo.ui.dataframe` - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Reference a `pl.DataFrame` Let's reference the same frame as before, but this time as a `pl.DataFrame` by calling `.collect()` on it. - """ - ) + """) return @@ -107,26 +97,22 @@ def _(demand: "pl.LazyFrame"): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" Note how much functionality we have right out-of-the-box. Click on column names to see rich features like sorting, freezing, filtering, searching, and more! Notice how `order_quantity` has a green bar chart under it indicating the distribution of values for the field! Don't miss the `Download` feature as well which supports downloading in CSV, json, or parquet format! - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Use `mo.ui.table` The `mo.ui.table` allows you to select rows for use downstream. You can select the rows you want, and then use these as filtered rows downstream. - """ - ) + """) return @@ -144,7 +130,9 @@ def _(demand_table): @app.cell(hide_code=True) def _(mo): - mo.md(r"""I like to use this feature to select groupings based on summary statistics so I can quickly explore subsets of categories. Let me show you what I mean.""") + mo.md(r""" + I like to use this feature to select groupings based on summary statistics so I can quickly explore subsets of categories. Let me show you what I mean. + """) return @@ -175,13 +163,11 @@ def _(summary_table): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" Now, instead of manually creating a filter for what I want to take a closer look at, I simply select from the ui and do a simple join to get that aggregated level with more detail. The following cell uses the output of the `mo.ui.table` selection, selects its unique keys, and uses that to join for the selected subset of the original table. - """ - ) + """) return @@ -199,13 +185,17 @@ def _(demand: "pl.LazyFrame", pl, summary_table): @app.cell(hide_code=True) def _(mo): - mo.md("""You can learn more about joins in Polars by checking out my other interactive notebook here: https://marimo.io/p/@jesshart/basic-polars-joins""") + mo.md(""" + You can learn more about joins in Polars by checking out my other interactive notebook here: https://marimo.io/p/@jesshart/basic-polars-joins + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""## Use `mo.ui.dataframe`""") + mo.md(r""" + ## Use `mo.ui.dataframe` + """) return @@ -218,7 +208,9 @@ def _(demand: "pl.LazyFrame", mo): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Below I simply call the object into view. We will play with it in the following cells.""") + mo.md(r""" + Below I simply call the object into view. We will play with it in the following cells. + """) return @@ -230,7 +222,9 @@ def _(mo_dataframe): @app.cell(hide_code=True) def _(mo): - mo.md(r"""One way to group this data in polars code directly would be to group by product family to get the mean. This is how it is done in polars:""") + mo.md(r""" + One way to group this data in polars code directly would be to group by product family to get the mean. This is how it is done in polars: + """) return @@ -245,16 +239,14 @@ def _(demand_cached, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - f""" + mo.md(f""" ## Try Before You Buy 1. Now try to do the same summary using Marimo's `mo.ui.dataframe` object above. Also, note how your aggregated column is already renamed! Nice touch! 2. Try (1) again but use select statements first (This is actually better polars practice anyway since it reduces the frame as you move to aggregation.) *When you are ready, check the `Python Code` tab at the top of the table to compare your output to the answer below.* - """ - ) + """) return @@ -331,29 +323,27 @@ def _(demand_agg: "pl.DataFrame", mo, px): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # About this Notebook Polars and Marimo are both relatively new to the data wrangling space, but their power (and the thrill of their use) cannot be overstated—well, I suppose it could, but you get the meaning. In this notebook, you learn how to leverage basic Polars skills to load-in and explore your data in concert with Marimo's powerful UI elements. ## 📚 Documentation References - - **Marimo: Dataframe Transformation Guide** + - **Marimo: Dataframe Transformation Guide** https://docs.marimo.io/guides/working_with_data/dataframes/?h=dataframe#transforming-dataframes - - **Polars: Lazy API Overview** + - **Polars: Lazy API Overview** https://docs.pola.rs/user-guide/lazy/ - - **Polars: Query Plan Explained** + - **Polars: Query Plan Explained** https://docs.pola.rs/user-guide/lazy/query-plan/ - - **Marimo Notebook: Basic Polars Joins (by jesshart)** + - **Marimo Notebook: Basic Polars Joins (by jesshart)** https://marimo.io/p/@jesshart/basic-polars-joins - - **Marimo Learn: Interactive Graphs with Polars** + - **Marimo Learn: Interactive Graphs with Polars** https://github.com/marimo-team/learn/blob/main/polars/05_reactive_plots.py - """ - ) + """) return diff --git a/polars/07-querying-with-sql.py b/polars/07-querying-with-sql.py index 52f481c85491fbbde8751c3b4febb423f0353bb8..ce3466f6a1bc321b8a3f7ed62f4361b04032c80b 100644 --- a/polars/07-querying-with-sql.py +++ b/polars/07-querying-with-sql.py @@ -35,7 +35,7 @@ def _(mo): @app.cell -def _(mo, reviews, sqlite_engine): +def _(mo, sqlite_engine): _df = mo.sql( f""" SELECT * FROM reviews LIMIT 100 @@ -91,7 +91,7 @@ def _(mo): @app.cell -def _(hotels, mo, sqlite_engine): +def _(mo, sqlite_engine): _df = mo.sql( f""" SELECT * FROM hotels LIMIT 10 @@ -112,7 +112,7 @@ def _(mo): @app.cell -def _(mo, reviews, sqlite_engine, users): +def _(mo, sqlite_engine): polars_age_groups = mo.sql( f""" SELECT reviews.*, age_group FROM reviews JOIN users ON reviews.user_id = users.user_id LIMIT 1000 @@ -139,7 +139,7 @@ def _(mo): @app.cell -def _(mo, reviews, sqlite_engine, users): +def _(mo, sqlite_engine): _df = mo.sql( f""" SELECT age_group, AVG(reviews.score_overall) FROM reviews JOIN users ON reviews.user_id = users.user_id GROUP BY age_group @@ -158,7 +158,7 @@ def _(mo): @app.cell -def _(mo, polars_age_groups): +def _(mo): _df = mo.sql( f""" SELECT * FROM polars_age_groups LIMIT 10 @@ -261,7 +261,7 @@ def _(mo): @app.cell -def _(duckdb, hotels): +def _(duckdb): duckdb.sql("SELECT * FROM hotels").pl(lazy=True).sort("cleanliness_base", descending=True).limit(5).collect() return diff --git a/polars/08_working_with_columns.py b/polars/08_working_with_columns.py index 89fa9e06ba63c4913cd03d8e0b44d1a0cb6aec80..915b7080c48ba9ea7a8d36fcbd3f939d0d2e9f18 100644 --- a/polars/08_working_with_columns.py +++ b/polars/08_working_with_columns.py @@ -8,37 +8,33 @@ import marimo -__generated_with = "0.12.0" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # Working with Columns + mo.md(r""" + # Working with Columns - Author: [Deb Debnath](https://github.com/debajyotid2) + Author: [Deb Debnath](https://github.com/debajyotid2) - **Note**: The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/user-guide/expressions/expression-expansion). - """ - ) + **Note**: The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/user-guide/expressions/expression-expansion). + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Expressions + mo.md(r""" + ## Expressions - Data transformations are sometimes complicated, or involve massive computations which are time-consuming. You can make a small version of the dataset with the schema you are trying to work your transformation into. But there is a better way to do it in Polars. + Data transformations are sometimes complicated, or involve massive computations which are time-consuming. You can make a small version of the dataset with the schema you are trying to work your transformation into. But there is a better way to do it in Polars. - A Polars expression is a lazy representation of a data transformation. "Lazy" means that the transformation is not eagerly (immediately) executed. + A Polars expression is a lazy representation of a data transformation. "Lazy" means that the transformation is not eagerly (immediately) executed. - Expressions are modular and flexible. They can be composed to build more complex expressions. For example, to calculate speed from distance and time, you can have an expression as: - """ - ) + Expressions are modular and flexible. They can be composed to build more complex expressions. For example, to calculate speed from distance and time, you can have an expression as: + """) return @@ -46,24 +42,24 @@ def _(mo): def _(pl): speed_expr = pl.col("distance") / (pl.col("time")) speed_expr - return (speed_expr,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Expression expansion + mo.md(r""" + ## Expression expansion - Expression expansion lets you write a single expression that can expand to multiple different expressions. So rather than repeatedly defining separate expressions, you can avoid redundancy while adhering to clean code principles (Do not Repeat Yourself - [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)). Since expressions are reusable, they aid in writing concise code. - """ - ) + Expression expansion lets you write a single expression that can expand to multiple different expressions. So rather than repeatedly defining separate expressions, you can avoid redundancy while adhering to clean code principles (Do not Repeat Yourself - [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)). Since expressions are reusable, they aid in writing concise code. + """) return @app.cell(hide_code=True) def _(mo): - mo.md("""For the examples in this notebook, we will use a sliver of the *AI4I 2020 Predictive Maintenance Dataset*. This dataset comprises of measurements taken from sensors in industrial machinery undergoing preventive maintenance checks - basically being tested for failure conditions.""") + mo.md(""" + For the examples in this notebook, we will use a sliver of the *AI4I 2020 Predictive Maintenance Dataset*. This dataset comprises of measurements taken from sensors in industrial machinery undergoing preventive maintenance checks - basically being tested for failure conditions. + """) return @@ -80,32 +76,28 @@ def _(StringIO, pl): data = pl.read_csv(StringIO(data_csv)) data - return data, data_csv + return (data,) @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Function `col` + mo.md(r""" + ## Function `col` - The function `col` is used to refer to one column of a dataframe. It is one of the fundamental building blocks of expressions in Polars. `col` is also really handy in expression expansion. - """ - ) + The function `col` is used to refer to one column of a dataframe. It is one of the fundamental building blocks of expressions in Polars. `col` is also really handy in expression expansion. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Explicit expansion by column name + mo.md(r""" + ### Explicit expansion by column name - The simplest form of expression expansion happens when you provide multiple column names to the function `col`. + The simplest form of expression expansion happens when you provide multiple column names to the function `col`. - Say you wish to convert all temperature values in deg. Kelvin (K) to deg. Fahrenheit (F). One way to do this would be to define individual expressions for each column as follows: - """ - ) + Say you wish to convert all temperature values in deg. Kelvin (K) to deg. Fahrenheit (F). One way to do this would be to define individual expressions for each column as follows: + """) return @@ -118,12 +110,14 @@ def _(data, pl): result = data.with_columns(exprs) result - return exprs, result + return (result,) @app.cell(hide_code=True) def _(mo): - mo.md(r"""Expression expansion can reduce this verbosity when you list the column names you want the expression to expand to inside the `col` function. The result is the same as before.""") + mo.md(r""" + Expression expansion can reduce this verbosity when you list the column names you want the expression to expand to inside the `col` function. The result is the same as before. + """) return @@ -139,28 +133,28 @@ def _(data, pl, result): ).round(2) ) result_2.equals(result) - return (result_2,) + return @app.cell(hide_code=True) def _(mo): - mo.md(r"""In this case, the expression that does the temperature conversion is expanded to a list of two expressions. The expansion of the expression is predictable and intuitive.""") + mo.md(r""" + In this case, the expression that does the temperature conversion is expanded to a list of two expressions. The expansion of the expression is predictable and intuitive. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Expansion by data type + mo.md(r""" + ### Expansion by data type - Can we do better than explicitly writing the names of every columns we want transformed? Yes. + Can we do better than explicitly writing the names of every columns we want transformed? Yes. - If you provide data types instead of column names, the expression is expanded to all columns that match one of the data types provided. + If you provide data types instead of column names, the expression is expanded to all columns that match one of the data types provided. - The example below performs the exact same computation as before: - """ - ) + The example below performs the exact same computation as before: + """) return @@ -168,18 +162,16 @@ def _(mo): def _(data, pl, result): result_3 = data.with_columns(((pl.col(pl.Float64) - 273.15) * 1.8 + 32).round(2)) result_3.equals(result) - return (result_3,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - However, you should be careful to ensure that the transformation is only applied to the columns you want. For ensuring this it is important to know the schema of the data beforehand. + mo.md(r""" + However, you should be careful to ensure that the transformation is only applied to the columns you want. For ensuring this it is important to know the schema of the data beforehand. - `col` accepts multiple data types in case the columns you need have more than one data type. - """ - ) + `col` accepts multiple data types in case the columns you need have more than one data type. + """) return @@ -195,18 +187,16 @@ def _(data, pl, result): ).round(2) ) result.equals(result_4) - return (result_4,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Expansion by pattern matching + mo.md(r""" + ### Expansion by pattern matching - `col` also accepts regular expressions for selecting columns by pattern matching. Regular expressions start and end with ^ and $, respectively. - """ - ) + `col` also accepts regular expressions for selecting columns by pattern matching. Regular expressions start and end with ^ and $, respectively. + """) return @@ -218,7 +208,9 @@ def _(data, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Regular expressions can be combined with exact column names.""") + mo.md(r""" + Regular expressions can be combined with exact column names. + """) return @@ -230,7 +222,9 @@ def _(data, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""**Note**: You _cannot_ mix strings (exact names, regular expressions) and data types in a `col` function.""") + mo.md(r""" + **Note**: You _cannot_ mix strings (exact names, regular expressions) and data types in a `col` function. + """) return @@ -245,13 +239,11 @@ def _(data, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Selecting all columns + mo.md(r""" + ## Selecting all columns - To select all columns, you can use the `all` function. - """ - ) + To select all columns, you can use the `all` function. + """) return @@ -259,18 +251,16 @@ def _(mo): def _(data, pl): result_6 = data.select(pl.all()) result_6.equals(data) - return (result_6,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Excluding columns + mo.md(r""" + ## Excluding columns - There are scenarios where we might want to exclude specific columns from the ones selected by building expressions, e.g. by the `col` or `all` functions. For this purpose, we use the function `exclude`, which accepts exactly the same types of arguments as `col`: - """ - ) + There are scenarios where we might want to exclude specific columns from the ones selected by building expressions, e.g. by the `col` or `all` functions. For this purpose, we use the function `exclude`, which accepts exactly the same types of arguments as `col`: + """) return @@ -282,7 +272,9 @@ def _(data, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""`exclude` can also be used after the function `col`:""") + mo.md(r""" + `exclude` can also be used after the function `col`: + """) return @@ -294,13 +286,11 @@ def _(data, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Column renaming + mo.md(r""" + ## Column renaming - When applying a transformation with an expression to a column, the data in the column gets overwritten with the transformed data. However, this might not be the intended outcome in all situations - ideally you would want to store transformed data in a new column. Applying multiple transformations to the same column at the same time without renaming leads to errors. - """ - ) + When applying a transformation with an expression to a column, the data in the column gets overwritten with the transformed data. However, this might not be the intended outcome in all situations - ideally you would want to store transformed data in a new column. Applying multiple transformations to the same column at the same time without renaming leads to errors. + """) return @@ -315,18 +305,16 @@ def _(data, pl): ) except DuplicateError as err: print("DuplicateError:", err) - return (DuplicateError,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Renaming a single column with `alias` + mo.md(r""" + ### Renaming a single column with `alias` - The function `alias` lets you rename a single column: - """ - ) + The function `alias` lets you rename a single column: + """) return @@ -341,13 +329,11 @@ def _(data, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Prefixing and suffixing column names + mo.md(r""" + ### Prefixing and suffixing column names - As `alias` renames a single column at a time, it cannot be used during expression expansion. If it is sufficient add a static prefix or a static suffix to the existing names, you can use the functions `name.prefix` and `name.suffix` with `col`: - """ - ) + As `alias` renames a single column at a time, it cannot be used during expression expansion. If it is sufficient add a static prefix or a static suffix to the existing names, you can use the functions `name.prefix` and `name.suffix` with `col`: + """) return @@ -362,13 +348,11 @@ def _(data, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Dynamic name replacement + mo.md(r""" + ### Dynamic name replacement - If a static prefix/suffix is not enough, use `name.map`. `name.map` requires a function that transforms column names to the desired. The transformation should lead to unique names to avoid `DuplicateError`. - """ - ) + If a static prefix/suffix is not enough, use `name.map`. `name.map` requires a function that transforms column names to the desired. The transformation should lead to unique names to avoid `DuplicateError`. + """) return @@ -381,13 +365,11 @@ def _(data, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Programmatically generating expressions + mo.md(r""" + ## Programmatically generating expressions - For this example, we will first create four additional columns with the rolling mean temperatures of the two temperature columns. Such transformations are sometimes used to create additional features for machine learning models or data analysis. - """ - ) + For this example, we will first create four additional columns with the rolling mean temperatures of the two temperature columns. Such transformations are sometimes used to create additional features for machine learning models or data analysis. + """) return @@ -402,13 +384,17 @@ def _(data, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Now, suppose we want to calculate the difference between the rolling mean and actual temperatures. We cannot use expression expansion here as we want differences between specific columns.""") + mo.md(r""" + Now, suppose we want to calculate the difference between the rolling mean and actual temperatures. We cannot use expression expansion here as we want differences between specific columns. + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""At first, you may think about using a `for` loop:""") + mo.md(r""" + At first, you may think about using a `for` loop: + """) return @@ -421,12 +407,14 @@ def _(ext_temp_data, pl): .round(2).alias(f"Delta {col_name} temperature") ) _result - return (col_name,) + return @app.cell(hide_code=True) def _(mo): - mo.md(r"""Using a `for` loop is functional, but not scalable, as each expression needs to be defined in an iteration and executed serially. Instead we can use a generator in Python to programmatically create all expressions at once. In conjunction with the `with_columns` context, we can take advantage of parallel execution of computations and query optimization from Polars.""") + mo.md(r""" + Using a `for` loop is functional, but not scalable, as each expression needs to be defined in an iteration and executed serially. Instead we can use a generator in Python to programmatically create all expressions at once. In conjunction with the `with_columns` context, we can take advantage of parallel execution of computations and query optimization from Polars. + """) return @@ -439,18 +427,16 @@ def _(ext_temp_data, pl): ext_temp_data.with_columns(delta_expressions(["Air", "Process"])) - return (delta_expressions,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## More flexible column selections + mo.md(r""" + ## More flexible column selections - For more flexible column selections, you can use column selectors from `selectors`. Column selectors allow for more expressiveness in the way you specify selections. For example, column selectors can perform the familiar set operations of union, intersection, difference, etc. We can use the union operation with the functions `string` and `ends_with` to select all string columns and the columns whose names end with "`_high`": - """ - ) + For more flexible column selections, you can use column selectors from `selectors`. Column selectors allow for more expressiveness in the way you specify selections. For example, column selectors can perform the familiar set operations of union, intersection, difference, etc. We can use the union operation with the functions `string` and `ends_with` to select all string columns and the columns whose names end with "`_high`": + """) return @@ -464,30 +450,30 @@ def _(data): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Likewise, you can pick columns based on the category of the type of data, offering more flexibility than the `col` function. As an example, `cs.numeric` selects numeric data types (including `pl.Float32`, `pl.Float64`, `pl.Int32`, etc.) or `cs.temporal` for all dates, times and similar data types.""") + mo.md(r""" + Likewise, you can pick columns based on the category of the type of data, offering more flexibility than the `col` function. As an example, `cs.numeric` selects numeric data types (including `pl.Float32`, `pl.Float64`, `pl.Int32`, etc.) or `cs.temporal` for all dates, times and similar data types. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Combining selectors with set operations + mo.md(r""" + ### Combining selectors with set operations - Multiple selectors can be combined using set operations and the usual Python operators: + Multiple selectors can be combined using set operations and the usual Python operators: - | Operator | Operation | - |:--------:|:--------------------:| - | `A | B` | Union | - | `A & B` | Intersection | - | `A - B` | Difference | - | `A ^ B` | Symmetric difference | - | `~A` | Complement | + | Operator | Operation | + |:--------:|:--------------------:| + | `A | B` | Union | + | `A & B` | Intersection | + | `A - B` | Difference | + | `A ^ B` | Symmetric difference | + | `~A` | Complement | - For example, to select all failure indicator variables excluding the failure variables due to wear, we can perform a set difference between the column selectors. - """ - ) + For example, to select all failure indicator variables excluding the failure variables due to wear, we can perform a set difference between the column selectors. + """) return @@ -499,13 +485,11 @@ def _(cs, data): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Resolving operator ambiguity + mo.md(r""" + ### Resolving operator ambiguity - Expression functions can be chained on top of selectors: - """ - ) + Expression functions can be chained on top of selectors: + """) return @@ -518,13 +502,11 @@ def _(cs, data, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - However, operators that perform set operations on column selectors operate on both selectors and on expressions. For example, the operator `~` on a selector represents the set operation “complement” and on an expression represents the Boolean operation of negation. + mo.md(r""" + However, operators that perform set operations on column selectors operate on both selectors and on expressions. For example, the operator `~` on a selector represents the set operation “complement” and on an expression represents the Boolean operation of negation. - For instance, if you want to negate the Boolean values in the columns “HDF”, “OSF”, and “RNF”, at first you would think about using the `~` operator with the column selector to choose all failure variables containing "W". Because of the operator ambiguity here, the columns that are not of interest are selected here. - """ - ) + For instance, if you want to negate the Boolean values in the columns “HDF”, “OSF”, and “RNF”, at first you would think about using the `~` operator with the column selector to choose all failure variables containing "W". Because of the operator ambiguity here, the columns that are not of interest are selected here. + """) return @@ -536,7 +518,9 @@ def _(cs, ext_failure_data): @app.cell(hide_code=True) def _(mo): - mo.md(r"""To resolve the operator ambiguity, we use `as_expr`:""") + mo.md(r""" + To resolve the operator ambiguity, we use `as_expr`: + """) return @@ -548,13 +532,11 @@ def _(cs, ext_failure_data): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Debugging selectors + mo.md(r""" + ### Debugging selectors - The function `cs.is_selector` helps check whether a complex chain of selectors and operators ultimately results in a selector. For example, to resolve any ambiguity with the selector in the last example, we can do: - """ - ) + The function `cs.is_selector` helps check whether a complex chain of selectors and operators ultimately results in a selector. For example, to resolve any ambiguity with the selector in the last example, we can do: + """) return @@ -566,7 +548,9 @@ def _(cs): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Additionally we can use `expand_selector` to see what columns a selector expands into. Note that for this function we need to provide additional context in the form of the dataframe.""") + mo.md(r""" + Additionally we can use `expand_selector` to see what columns a selector expands into. Note that for this function we need to provide additional context in the form of the dataframe. + """) return @@ -581,14 +565,12 @@ def _(cs, ext_failure_data): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### References + mo.md(r""" + ### References - 1. AI4I 2020 Predictive Maintenance Dataset [Dataset]. (2020). UCI Machine Learning Repository. ([link](https://doi.org/10.24432/C5HS5C)). - 2. Polars documentation ([link](https://docs.pola.rs/user-guide/expressions/expression-expansion/#more-flexible-column-selections)) - """ - ) + 1. AI4I 2020 Predictive Maintenance Dataset [Dataset]. (2020). UCI Machine Learning Repository. ([link](https://doi.org/10.24432/C5HS5C)). + 2. Polars documentation ([link](https://docs.pola.rs/user-guide/expressions/expression-expansion/#more-flexible-column-selections)) + """) return @@ -598,7 +580,7 @@ def _(): import marimo as mo import polars as pl from io import StringIO - return StringIO, csv, mo, pl + return StringIO, mo, pl if __name__ == "__main__": diff --git a/polars/09_data_types.py b/polars/09_data_types.py index 615e1e7eb073891dd3da7da9c6776d02d7dd2d06..c719c0dbb4752ab0252d434ce5f15ae22d059d65 100644 --- a/polars/09_data_types.py +++ b/polars/09_data_types.py @@ -8,52 +8,46 @@ import marimo -__generated_with = "0.12.0" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # Data Types + mo.md(r""" + # Data Types - Author: [Deb Debnath](https://github.com/debajyotid2) + Author: [Deb Debnath](https://github.com/debajyotid2) - **Note**: The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/user-guide/concepts/data-types-and-structures/). - """ - ) + **Note**: The following tutorial has been adapted from the Polars [documentation](https://docs.pola.rs/user-guide/concepts/data-types-and-structures/). + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Polars supports a variety of data types that fall broadly under the following categories: + mo.md(r""" + Polars supports a variety of data types that fall broadly under the following categories: - - Numeric data types: integers and floating point numbers. - - Nested data types: lists, structs, and arrays. - - Temporal: dates, datetimes, times, and time deltas. - - Miscellaneous: strings, binary data, Booleans, categoricals, enums, and objects. + - Numeric data types: integers and floating point numbers. + - Nested data types: lists, structs, and arrays. + - Temporal: dates, datetimes, times, and time deltas. + - Miscellaneous: strings, binary data, Booleans, categoricals, enums, and objects. - All types support missing values represented by `null` which is different from `NaN` used in floating point data types. The numeric datatypes in Polars loosely follow the type system of the Rust language, since its core functionalities are built in Rust. + All types support missing values represented by `null` which is different from `NaN` used in floating point data types. The numeric datatypes in Polars loosely follow the type system of the Rust language, since its core functionalities are built in Rust. - [Here](https://docs.pola.rs/api/python/stable/reference/datatypes.html) is a full list of all data types Polars supports. - """ - ) + [Here](https://docs.pola.rs/api/python/stable/reference/datatypes.html) is a full list of all data types Polars supports. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Series + mo.md(r""" + ## Series - A series is a 1-dimensional data structure that can hold only one data type. - """ - ) + A series is a 1-dimensional data structure that can hold only one data type. + """) return @@ -61,12 +55,14 @@ def _(mo): def _(pl): s = pl.Series("emojis", ["😀", "🤣", "🥶", "💀", "🤖"]) s - return (s,) + return @app.cell(hide_code=True) def _(mo): - mo.md(r"""Unless specified, Polars infers the datatype from the supplied values.""") + mo.md(r""" + Unless specified, Polars infers the datatype from the supplied values. + """) return @@ -75,20 +71,18 @@ def _(pl): s1 = pl.Series("friends", ["Евгений", "अभिषेक", "秀良", "Federico", "Bob"]) s2 = pl.Series("uints", [0x00, 0x01, 0x10, 0x11], dtype=pl.UInt8) s1.dtype, s2.dtype - return s1, s2 + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Dataframe + mo.md(r""" + ## Dataframe - A dataframe is a 2-dimensional data structure that contains uniquely named series and can hold multiple data types. Dataframes are more commonly used for data manipulation using the functionality of Polars. + A dataframe is a 2-dimensional data structure that contains uniquely named series and can hold multiple data types. Dataframes are more commonly used for data manipulation using the functionality of Polars. - The snippet below shows how to create a dataframe from a dictionary of lists: - """ - ) + The snippet below shows how to create a dataframe from a dictionary of lists: + """) return @@ -108,28 +102,24 @@ def _(pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### Inspecting a dataframe + mo.md(r""" + ### Inspecting a dataframe - Polars has various functions to explore the data in a dataframe. We will use the dataframe `data` defined above in our examples. Alongside we can also see a view of the dataframe rendered by `marimo` as the cells are executed. + Polars has various functions to explore the data in a dataframe. We will use the dataframe `data` defined above in our examples. Alongside we can also see a view of the dataframe rendered by `marimo` as the cells are executed. - ///note - We can also use `marimo`'s built in data-inspection elements/features such as [`mo.ui.dataframe`](https://docs.marimo.io/api/inputs/dataframe/#marimo.ui.dataframe) & [`mo.ui.data_explorer`](https://docs.marimo.io/api/inputs/data_explorer/). For more check out our Polars tutorials at [`marimo learn`](https://marimo-team.github.io/learn/)! - """ - ) + ///note + We can also use `marimo`'s built in data-inspection elements/features such as [`mo.ui.dataframe`](https://docs.marimo.io/api/inputs/dataframe/#marimo.ui.dataframe) & [`mo.ui.data_explorer`](https://docs.marimo.io/api/inputs/data_explorer/). For more check out our Polars tutorials at [`marimo learn`](https://marimo-team.github.io/learn/)! + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ - #### Head + mo.md(""" + #### Head - The function `head` shows the first rows of a dataframe. Unless specified, it shows the first 5 rows. - """ - ) + The function `head` shows the first rows of a dataframe. Unless specified, it shows the first 5 rows. + """) return @@ -141,13 +131,11 @@ def _(data): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - #### Glimpse + mo.md(r""" + #### Glimpse - The function `glimpse` is an alternative to `head` to view the first few columns, but displays each line of the output corresponding to a single column. That way, it makes inspecting wider dataframes easier. - """ - ) + The function `glimpse` is an alternative to `head` to view the first few columns, but displays each line of the output corresponding to a single column. That way, it makes inspecting wider dataframes easier. + """) return @@ -159,13 +147,11 @@ def _(data): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - #### Tail + mo.md(r""" + #### Tail - The `tail` function, just like its name suggests, shows the last rows of a dataframe. Unless the number of rows is specified, it will show the last 5 rows. - """ - ) + The `tail` function, just like its name suggests, shows the last rows of a dataframe. Unless the number of rows is specified, it will show the last 5 rows. + """) return @@ -177,13 +163,11 @@ def _(data): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - #### Sample + mo.md(r""" + #### Sample - `sample` can be used to show a specified number of randomly selected rows from the dataframe. Unless the number of rows is specified, it will show a single row. `sample` does not preserve order of the rows. - """ - ) + `sample` can be used to show a specified number of randomly selected rows from the dataframe. Unless the number of rows is specified, it will show a single row. `sample` does not preserve order of the rows. + """) return @@ -194,18 +178,16 @@ def _(data): random.seed(42) # For reproducibility. data.sample(3) - return (random,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - #### Describe + mo.md(r""" + #### Describe - The function `describe` describes the summary statistics for all columns of a dataframe. - """ - ) + The function `describe` describes the summary statistics for all columns of a dataframe. + """) return @@ -217,13 +199,11 @@ def _(data): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Schema + mo.md(r""" + ## Schema - A schema is a mapping showing the datatype corresponding to every column of a dataframe. The schema of a dataframe can be viewed using the attribute `schema`. - """ - ) + A schema is a mapping showing the datatype corresponding to every column of a dataframe. The schema of a dataframe can be viewed using the attribute `schema`. + """) return @@ -235,7 +215,9 @@ def _(data): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Since a schema is a mapping, it can be specified in the form of a Python dictionary. Then this dictionary can be used to specify the schema of a dataframe on definition. If not specified or the entry is `None`, Polars infers the datatype from the contents of the column. Note that if the schema is not specified, it will be inferred automatically by default.""") + mo.md(r""" + Since a schema is a mapping, it can be specified in the form of a Python dictionary. Then this dictionary can be used to specify the schema of a dataframe on definition. If not specified or the entry is `None`, Polars infers the datatype from the contents of the column. Note that if the schema is not specified, it will be inferred automatically by default. + """) return @@ -255,7 +237,9 @@ def _(pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Sometimes the automatically inferred schema is enough for some columns, but we might wish to override the inference of only some columns. We can specify the schema for those columns using `schema_overrides`.""") + mo.md(r""" + Sometimes the automatically inferred schema is enough for some columns, but we might wish to override the inference of only some columns. We can specify the schema for those columns using `schema_overrides`. + """) return @@ -275,13 +259,11 @@ def _(pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### References + mo.md(r""" + ### References - 1. Polars documentation ([link](https://docs.pola.rs/api/python/stable/reference/datatypes.html)) - """ - ) + 1. Polars documentation ([link](https://docs.pola.rs/api/python/stable/reference/datatypes.html)) + """) return diff --git a/polars/10_strings.py b/polars/10_strings.py index 585d0dd04855f2b72574c3df4aeffb8662a6fc98..9c5b4d8c28db49ac98b7ecbb724946eb598d10bf 100644 --- a/polars/10_strings.py +++ b/polars/10_strings.py @@ -10,36 +10,32 @@ import marimo -__generated_with = "0.11.17" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # Strings + mo.md(r""" + # Strings - _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_. + _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_. - In this chapter we're going to dig into string manipulation. For a fun twist, we'll be mostly playing around with a dataset that every Polars user has bumped into without really thinking about it—the source code of the `polars` module itself. More precisely, we'll use a dataframe that pulls together all the Polars expressions and their docstrings, giving us a cool, hands-on way to explore the expression API in a truly data-driven manner. + In this chapter we're going to dig into string manipulation. For a fun twist, we'll be mostly playing around with a dataset that every Polars user has bumped into without really thinking about it—the source code of the `polars` module itself. More precisely, we'll use a dataframe that pulls together all the Polars expressions and their docstrings, giving us a cool, hands-on way to explore the expression API in a truly data-driven manner. - We'll cover parsing, length calculation, case conversion, and much more, with practical examples and visualizations. Finally, we will combine various techniques you learned in prior chapters to build a fully interactive playground in which you can execute the official code examples of Polars expressions. - """ - ) + We'll cover parsing, length calculation, case conversion, and much more, with practical examples and visualizations. Finally, we will combine various techniques you learned in prior chapters to build a fully interactive playground in which you can execute the official code examples of Polars expressions. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 🛠️ Parsing & Conversion + mo.md(r""" + ## 🛠️ Parsing & Conversion - Let's warm up with one of the most frequent use cases: parsing raw strings into various formats. - We'll take a tiny dataframe with metadata about Python packages represented as raw JSON strings and we'll use Polars string expressions to parse the attributes into their true data types. - """ - ) + Let's warm up with one of the most frequent use cases: parsing raw strings into various formats. + We'll take a tiny dataframe with metadata about Python packages represented as raw JSON strings and we'll use Polars string expressions to parse the attributes into their true data types. + """) return @@ -58,7 +54,9 @@ def _(pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""We can use the [`json_decode`](https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.json_decode.html) expression to parse the raw JSON strings into Polars-native structs and we can use the [unnest](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.unnest.html) dataframe operation to have a dedicated column per parsed attribute.""") + mo.md(r""" + We can use the [`json_decode`](https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.json_decode.html) expression to parse the raw JSON strings into Polars-native structs and we can use the [unnest](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.unnest.html) dataframe operation to have a dedicated column per parsed attribute. + """) return @@ -71,13 +69,17 @@ def _(pip_metadata_raw_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""This is already a much friendlier representation of the data we started out with, but note that since the JSON entries had only string attributes, all values are strings, even the temporal `released_at` and numerical `size_mb` columns.""") + mo.md(r""" + This is already a much friendlier representation of the data we started out with, but note that since the JSON entries had only string attributes, all values are strings, even the temporal `released_at` and numerical `size_mb` columns. + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""As we know that the `size_mb` column should have a decimal representation, we go ahead and use [`to_decimal`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_decimal.html#polars.Expr.str.to_decimal) to perform the conversion.""") + mo.md(r""" + As we know that the `size_mb` column should have a decimal representation, we go ahead and use [`to_decimal`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_decimal.html#polars.Expr.str.to_decimal) to perform the conversion. + """) return @@ -93,25 +95,23 @@ def _(pip_metadata_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Moving on to the `released_at` attribute which indicates the exact time when a given Python package got released, we have a bit more options to consider. We can convert to `Date`, `DateTime`, and `Time` types based on the desired temporal granularity. The [`to_date`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_date.html), [`to_datetime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_datetime.html), and [`to_time`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_time.html) expressions are here to help us with the conversion, all we need is to provide the desired format string. + mo.md(r""" + Moving on to the `released_at` attribute which indicates the exact time when a given Python package got released, we have a bit more options to consider. We can convert to `Date`, `DateTime`, and `Time` types based on the desired temporal granularity. The [`to_date`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_date.html), [`to_datetime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_datetime.html), and [`to_time`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_time.html) expressions are here to help us with the conversion, all we need is to provide the desired format string. - Since Polars uses Rust under the hood to implement all its expressions, we need to consult the [`chrono::format`](https://docs.rs/chrono/latest/chrono/format/strftime/index.html) reference to come up with appropriate format strings. + Since Polars uses Rust under the hood to implement all its expressions, we need to consult the [`chrono::format`](https://docs.rs/chrono/latest/chrono/format/strftime/index.html) reference to come up with appropriate format strings. - Here's a quick reference: + Here's a quick reference: - | Specifier | Meaning | - |-----------|--------------------| - | `%Y` | Year (e.g., 2025) | - | `%m` | Month (01-12) | - | `%d` | Day (01-31) | - | `%H` | Hour (00-23) | - | `%z` | UTC offset | + | Specifier | Meaning | + |-----------|--------------------| + | `%Y` | Year (e.g., 2025) | + | `%m` | Month (01-12) | + | `%d` | Day (01-31) | + | `%H` | Hour (00-23) | + | `%z` | UTC offset | - The raw strings we are working with look like `"2025-03-02T20:31:12+0000"`. We can match this using the `"%Y-%m-%dT%H:%M:%S%z"` format string. - """ - ) + The raw strings we are working with look like `"2025-03-02T20:31:12+0000"`. We can match this using the `"%Y-%m-%dT%H:%M:%S%z"` format string. + """) return @@ -129,7 +129,9 @@ def _(pip_metadata_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Alternatively, instead of using three different functions to perform the conversion to date, we can use a single one, [`strptime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strptime.html) which takes the desired temporal data type as its first parameter.""") + mo.md(r""" + Alternatively, instead of using three different functions to perform the conversion to date, we can use a single one, [`strptime`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strptime.html) which takes the desired temporal data type as its first parameter. + """) return @@ -147,7 +149,9 @@ def _(pip_metadata_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""And to wrap up this section on parsing and conversion, let's consider a final scenario. What if we don't want to parse the entire raw JSON string, because we only need a subset of its attributes? Well, in this case we can leverage the [`json_path_match`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.json_path_match.html) expression to extract only the desired attributes using standard [JSONPath](https://goessner.net/articles/JsonPath/) syntax.""") + mo.md(r""" + And to wrap up this section on parsing and conversion, let's consider a final scenario. What if we don't want to parse the entire raw JSON string, because we only need a subset of its attributes? Well, in this case we can leverage the [`json_path_match`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.json_path_match.html) expression to extract only the desired attributes using standard [JSONPath](https://goessner.net/articles/JsonPath/) syntax. + """) return @@ -165,17 +169,15 @@ def _(pip_metadata_raw_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 📊 Dataset Overview + mo.md(r""" + ## 📊 Dataset Overview - Now that we got our hands dirty, let's consider a somewhat wilder dataset for the subsequent sections: a dataframe of metadata about every single expression in your current Polars module. + Now that we got our hands dirty, let's consider a somewhat wilder dataset for the subsequent sections: a dataframe of metadata about every single expression in your current Polars module. - At the risk of stating the obvious, in the previous section, when we typed `pl.col('raw_json').str.json_decode()`, we accessed the `json_decode` member of the `str` expression namespace through the `pl.col('raw_json')` expression *instance*. Under the hood, deep inside the Polars source code, there is a corresponding `def json_decode(...)` method with a carefully authored docstring explaining the purpose and signature of the member. + At the risk of stating the obvious, in the previous section, when we typed `pl.col('raw_json').str.json_decode()`, we accessed the `json_decode` member of the `str` expression namespace through the `pl.col('raw_json')` expression *instance*. Under the hood, deep inside the Polars source code, there is a corresponding `def json_decode(...)` method with a carefully authored docstring explaining the purpose and signature of the member. - Since Python makes module introspection simple, we can easily enumerate all Polars expressions and organize their metadata in `expressions_df`, to be used for all the upcoming string manipulation examples. - """ - ) + Since Python makes module introspection simple, we can easily enumerate all Polars expressions and organize their metadata in `expressions_df`, to be used for all the upcoming string manipulation examples. + """) return @@ -214,12 +216,14 @@ def _(pl): expressions_df = pl.from_dicts(list_expr_meta(), infer_schema_length=None).sort('namespace', 'member') expressions_df - return expressions_df, list_expr_meta, list_members + return (expressions_df,) @app.cell(hide_code=True) def _(mo): - mo.md(r"""As the following visualization shows, `str` is one of the richest Polars expression namespaces with multiple dozens of functions in it.""") + mo.md(r""" + As the following visualization shows, `str` is one of the richest Polars expression namespaces with multiple dozens of functions in it. + """) return @@ -234,17 +238,15 @@ def _(alt, expressions_df): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 📏 Length Calculation + mo.md(r""" + ## 📏 Length Calculation - A common use case is to compute the length of a string. Most people associate string length exclusively with the number of characters the said string consists of; however, in certain scenarios it is useful to also know how much memory is required for storing, so how many bytes are required to represent the textual data. + A common use case is to compute the length of a string. Most people associate string length exclusively with the number of characters the said string consists of; however, in certain scenarios it is useful to also know how much memory is required for storing, so how many bytes are required to represent the textual data. - The expressions [`len_chars`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_chars.html) and [`len_bytes`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_bytes.html) are here to help us with these calculations. + The expressions [`len_chars`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_chars.html) and [`len_bytes`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_bytes.html) are here to help us with these calculations. - Below, we compute `docstring_len_chars` and `docstring_len_bytes` columns to see how many characters and bytes the documentation of each expression is made up of. - """ - ) + Below, we compute `docstring_len_chars` and `docstring_len_bytes` columns to see how many characters and bytes the documentation of each expression is made up of. + """) return @@ -262,7 +264,9 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""As the dataframe preview above and the scatterplot below show, the docstring length measured in bytes is almost always bigger than the length expressed in characters. This is due to the fact that the docstrings include characters which require more than a single byte to represent, such as "╞" for displaying dataframe header and body separators.""") + mo.md(r""" + As the dataframe preview above and the scatterplot below show, the docstring length measured in bytes is almost always bigger than the length expressed in characters. This is due to the fact that the docstrings include characters which require more than a single byte to represent, such as "╞" for displaying dataframe header and body separators. + """) return @@ -278,13 +282,11 @@ def _(alt, docstring_length_df): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 🔠 Case Conversion + mo.md(r""" + ## 🔠 Case Conversion - Another frequent string transformation is lowercasing, uppercasing, and titlecasing. We can use [`to_lowercase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html), [`to_uppercase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html) and [`to_titlecase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_titlecase.html) for doing so. - """ - ) + Another frequent string transformation is lowercasing, uppercasing, and titlecasing. We can use [`to_lowercase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html), [`to_uppercase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html) and [`to_titlecase`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_titlecase.html) for doing so. + """) return @@ -300,15 +302,13 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## ➕ Padding + mo.md(r""" + ## ➕ Padding - Sometimes we need to ensure that strings have a fixed-size character length. [`pad_start`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.pad_start.html) and [`pad_end`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.pad_end.html) can be used to fill the "front" or "back" of a string with a supplied character, while [`zfill`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.zfill.html) is a utility for padding the start of a string with `"0"` until it reaches a particular length. In other words, `zfill` is a more specific version of `pad_start`, where the `fill_char` parameter is explicitly set to `"0"`. + Sometimes we need to ensure that strings have a fixed-size character length. [`pad_start`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.pad_start.html) and [`pad_end`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.pad_end.html) can be used to fill the "front" or "back" of a string with a supplied character, while [`zfill`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.zfill.html) is a utility for padding the start of a string with `"0"` until it reaches a particular length. In other words, `zfill` is a more specific version of `pad_start`, where the `fill_char` parameter is explicitly set to `"0"`. - In the example below we take the unique Polars expression namespaces and pad them so that they have a uniform length which you can control via a slider. - """ - ) + In the example below we take the unique Polars expression namespaces and pad them so that they have a uniform length which you can control via a slider. + """) return @@ -340,15 +340,13 @@ def _(mo, padded_df, padding): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 🔄 Replacing + mo.md(r""" + ## 🔄 Replacing - Let's say we want to convert from `snake_case` API member names to `kebab-case`, that is, we need to replace the underscore character with a hyphen. For operations like that, we can use [`replace`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace.html) and [`replace_all`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_all.html). + Let's say we want to convert from `snake_case` API member names to `kebab-case`, that is, we need to replace the underscore character with a hyphen. For operations like that, we can use [`replace`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace.html) and [`replace_all`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_all.html). - As the example below demonstrates, `replace` stops after the first occurrence of the to-be-replaced pattern, while `replace_all` goes all the way through and changes all underscores to hyphens resulting in the `kebab-case` representation we were looking for. - """ - ) + As the example below demonstrates, `replace` stops after the first occurrence of the to-be-replaced pattern, while `replace_all` goes all the way through and changes all underscores to hyphens resulting in the `kebab-case` representation we were looking for. + """) return @@ -364,13 +362,11 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - A related expression is [`replace_many`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_many.html), which accepts *many* pairs of to-be-matched patterns and corresponding replacements and uses the [Aho–Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) to carry out the operation with great performance. + mo.md(r""" + A related expression is [`replace_many`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_many.html), which accepts *many* pairs of to-be-matched patterns and corresponding replacements and uses the [Aho–Corasick algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) to carry out the operation with great performance. - In the example below we replace all instances of `"min"` with `"minimum"` and `"max"` with `"maximum"` using a single expression. - """ - ) + In the example below we replace all instances of `"min"` with `"minimum"` and `"max"` with `"maximum"` using a single expression. + """) return @@ -390,15 +386,13 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 🔍 Searching & Matching + mo.md(r""" + ## 🔍 Searching & Matching - A common need when working with strings is to determine whether their content satisfies some condition: whether it starts or ends with a particular substring or contains a certain pattern. + A common need when working with strings is to determine whether their content satisfies some condition: whether it starts or ends with a particular substring or contains a certain pattern. - Let's suppose we want to determine whether a member of the Polars expression API is a "converter", such as `to_decimal`, identified by its `"to_"` prefix. We can use [`starts_with`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.starts_with.html) to perform this check. - """ - ) + Let's suppose we want to determine whether a member of the Polars expression API is a "converter", such as `to_decimal`, identified by its `"to_"` prefix. We can use [`starts_with`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.starts_with.html) to perform this check. + """) return @@ -414,13 +408,11 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Throughout this course as you have gained familiarity with the expression API you might have noticed that some members end with an underscore such as `or_`, since their "body" is a reserved Python keyword. + mo.md(r""" + Throughout this course as you have gained familiarity with the expression API you might have noticed that some members end with an underscore such as `or_`, since their "body" is a reserved Python keyword. - Let's use [`ends_with`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.ends_with.html) to find all the members which are named after such keywords. - """ - ) + Let's use [`ends_with`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.ends_with.html) to find all the members which are named after such keywords. + """) return @@ -436,13 +428,11 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Now let's move on to analyzing the docstrings in a bit more detail. Based on their content we can determine whether a member is deprecated, accepts parameters, comes with examples, or references external URL(s) & related members. + mo.md(r""" + Now let's move on to analyzing the docstrings in a bit more detail. Based on their content we can determine whether a member is deprecated, accepts parameters, comes with examples, or references external URL(s) & related members. - As demonstrated below, we can compute all these boolean attributes using [`contains`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html) to check whether the docstring includes a particular substring. - """ - ) + As demonstrated below, we can compute all these boolean attributes using [`contains`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html) to check whether the docstring includes a particular substring. + """) return @@ -462,7 +452,9 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""For scenarios where we want to combine multiple substrings to check for, we can use the [`contains`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html) expression to check for the presence of various patterns.""") + mo.md(r""" + For scenarios where we want to combine multiple substrings to check for, we can use the [`contains`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html) expression to check for the presence of various patterns. + """) return @@ -478,21 +470,19 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - From the above analysis we could see that almost all the members come with code examples. It would be interesting to know how many variable assignments are going on within each of these examples, right? That's not as simple as checking for a pre-defined literal string containment though, because variables can have arbitrary names - any valid Python identifier is allowed. While the `contains` function supports checking for regular expressions instead of literal strings too, it would not suffice for this exercise because it only tells us whether there is at least a single occurrence of the sought pattern rather than telling us the exact number of matches. + mo.md(r""" + From the above analysis we could see that almost all the members come with code examples. It would be interesting to know how many variable assignments are going on within each of these examples, right? That's not as simple as checking for a pre-defined literal string containment though, because variables can have arbitrary names - any valid Python identifier is allowed. While the `contains` function supports checking for regular expressions instead of literal strings too, it would not suffice for this exercise because it only tells us whether there is at least a single occurrence of the sought pattern rather than telling us the exact number of matches. - Fortunately, we can take advantage of [`count_matches`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.count_matches.html) to achieve exactly what we want. We specify the regular expression `r'[a-zA-Z_][a-zA-Z0-9_]* = '` according to the [`regex` Rust crate](https://docs.rs/regex/latest/regex/) to match Python identifiers and we leave the rest to Polars. + Fortunately, we can take advantage of [`count_matches`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.count_matches.html) to achieve exactly what we want. We specify the regular expression `r'[a-zA-Z_][a-zA-Z0-9_]* = '` according to the [`regex` Rust crate](https://docs.rs/regex/latest/regex/) to match Python identifiers and we leave the rest to Polars. - In `count_matches(r'[a-zA-Z_][a-zA-Z0-9_]* = ')`: + In `count_matches(r'[a-zA-Z_][a-zA-Z0-9_]* = ')`: - - `[a-zA-Z_]` matches a letter or underscore (start of a Python identifier). - - `[a-zA-Z0-9_]*` matches zero or more letters, digits, or underscores. - - ` = ` matches a space, equals sign, and space (indicating assignment). + - `[a-zA-Z_]` matches a letter or underscore (start of a Python identifier). + - `[a-zA-Z0-9_]*` matches zero or more letters, digits, or underscores. + - ` = ` matches a space, equals sign, and space (indicating assignment). - This finds variable assignments like `x = ` or `df_result = ` in docstrings. - """ - ) + This finds variable assignments like `x = ` or `df_result = ` in docstrings. + """) return @@ -508,7 +498,9 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""A related application example is to *find* the first index where a particular pattern is present, so that it can be used for downstream processing such as slicing. Below we use the [`find`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.find.html) expression to determine the index at which a code example starts in the docstring - identified by the Python shell substring `">>>"`.""") + mo.md(r""" + A related application example is to *find* the first index where a particular pattern is present, so that it can be used for downstream processing such as slicing. Below we use the [`find`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.find.html) expression to determine the index at which a code example starts in the docstring - identified by the Python shell substring `">>>"`. + """) return @@ -524,13 +516,11 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## ✂️ Slicing and Substrings + mo.md(r""" + ## ✂️ Slicing and Substrings - Sometimes we are only interested in a particular substring. We can use [`head`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.head.html), [`tail`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.tail.html) and [`slice`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.slice.html) to extract a substring from the start, end, or between arbitrary indices. - """ - ) + Sometimes we are only interested in a particular substring. We can use [`head`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.head.html), [`tail`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.tail.html) and [`slice`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.slice.html) to extract a substring from the start, end, or between arbitrary indices. + """) return @@ -564,17 +554,15 @@ def _(mo, slice, sliced_df): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## ➗ Splitting + mo.md(r""" + ## ➗ Splitting - Certain strings follow a well-defined structure and we might be only interested in some parts of them. For example, when dealing with `snake_cased_expression` member names we might be curious to get only the first, second, or $n^{\text{th}}$ word before an underscore. We would need to *split* the string at a particular pattern for downstream processing. + Certain strings follow a well-defined structure and we might be only interested in some parts of them. For example, when dealing with `snake_cased_expression` member names we might be curious to get only the first, second, or $n^{\text{th}}$ word before an underscore. We would need to *split* the string at a particular pattern for downstream processing. - The [`split`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split.html), [`split_exact`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split_exact.html) and [`splitn`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.splitn.html) expressions enable us to achieve this. + The [`split`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split.html), [`split_exact`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split_exact.html) and [`splitn`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.splitn.html) expressions enable us to achieve this. - The primary difference between these string splitting utilities is that `split` produces a list of variadic length based on the number of resulting segments, `splitn` returns a struct with at least `0` and at most `n` fields while `split_exact` returns a struct of exactly `n` fields. - """ - ) + The primary difference between these string splitting utilities is that `split` produces a list of variadic length based on the number of resulting segments, `splitn` returns a struct with at least `0` and at most `n` fields while `split_exact` returns a struct of exactly `n` fields. + """) return @@ -591,7 +579,9 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""As a more practical example, we can use the `split` expression with some aggregation to count the number of times a particular word occurs in member names across all namespaces. This enables us to create a word cloud of the API members' constituents!""") + mo.md(r""" + As a more practical example, we can use the `split` expression with some aggregation to count the number of times a particular word occurs in member names across all namespaces. This enables us to create a word cloud of the API members' constituents! + """) return @@ -640,20 +630,18 @@ def _(alt, expressions_df, pl, random, wordcloud_height, wordcloud_width): size=alt.Size("len:Q", legend=None), tooltip=["member", "len"], ).configure_view(strokeWidth=0) - return wordcloud, wordcloud_df + return (wordcloud,) @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 🔗 Concatenation & Joining + mo.md(r""" + ## 🔗 Concatenation & Joining - Often we would like to create longer strings from strings we already have. We might want to create a formatted, sentence-like string or join multiple existing strings in our dataframe into a single one. + Often we would like to create longer strings from strings we already have. We might want to create a formatted, sentence-like string or join multiple existing strings in our dataframe into a single one. - The top-level [`concat_str`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.concat_str.html) expression enables us to combine strings *horizontally* in a dataframe. As the example below shows, we can take the `member` and `namespace` column of each row and construct a `description` column in which each row will correspond to the value ``f"- Expression `{member}` belongs to namespace `{namespace}`"``. - """ - ) + The top-level [`concat_str`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.concat_str.html) expression enables us to combine strings *horizontally* in a dataframe. As the example below shows, we can take the `member` and `namespace` column of each row and construct a `description` column in which each row will correspond to the value ``f"- Expression `{member}` belongs to namespace `{namespace}`"``. + """) return @@ -679,13 +667,11 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Now that we have constructed these bullet points through *horizontal* concatenation of strings, we can perform a *vertical* one so that we end up with a single string in which we have a bullet point on each line. + mo.md(r""" + Now that we have constructed these bullet points through *horizontal* concatenation of strings, we can perform a *vertical* one so that we end up with a single string in which we have a bullet point on each line. - We will use the [`join`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.join.html) expression to do so. - """ - ) + We will use the [`join`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.join.html) expression to do so. + """) return @@ -708,17 +694,15 @@ def _(descriptions_df, mo, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 🔍 Pattern-based Extraction + mo.md(r""" + ## 🔍 Pattern-based Extraction - In the vast majority of the cases, when dealing with unstructured text data, all we really want is to extract something structured from it. A common use case is to extract URLs from text to get a better understanding of related content. + In the vast majority of the cases, when dealing with unstructured text data, all we really want is to extract something structured from it. A common use case is to extract URLs from text to get a better understanding of related content. - In the example below that's exactly what we do. We scan the `docstring` of each API member and extract URLs from them using [`extract`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract.html) and [`extract_all`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract_all.html) using a simple regular expression to match http and https URLs. + In the example below that's exactly what we do. We scan the `docstring` of each API member and extract URLs from them using [`extract`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract.html) and [`extract_all`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract_all.html) using a simple regular expression to match http and https URLs. - Note that `extract` stops after a first match and returns a scalar result (or `null` if there was no match) while `extract_all` returns a - potentially empty - list of matches. - """ - ) + Note that `extract` stops after a first match and returns a scalar result (or `null` if there was no match) while `extract_all` returns a - potentially empty - list of matches. + """) return @@ -731,20 +715,18 @@ def _(expressions_df, pl): url_match=pl.col('docstring').str.extract(url_pattern), url_matches=pl.col('docstring').str.extract_all(url_pattern), ).filter(pl.col('url_match').is_not_null()) - return (url_pattern,) + return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Note that in each `docstring` where a code example involving dataframes is present, we will see an output such as "shape: (5, 2)" indicating the number of rows and columns of the dataframe produced by the sample code. Let's say we would like to *capture* this information in a structured way. + mo.md(r""" + Note that in each `docstring` where a code example involving dataframes is present, we will see an output such as "shape: (5, 2)" indicating the number of rows and columns of the dataframe produced by the sample code. Let's say we would like to *capture* this information in a structured way. - [`extract_groups`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract_groups.html) is a really powerful expression allowing us to achieve exactly that. + [`extract_groups`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract_groups.html) is a really powerful expression allowing us to achieve exactly that. - Below we define the regular expression `r"shape:\s*\((?\S+),\s*(?\S+)\)"` with two capture groups, named `height` and `width` and pass it as the parameter of `extract_groups`. After execution, for each `docstring`, we end up with fully structured data we can further process downstream! - """ - ) + Below we define the regular expression `r"shape:\s*\((?\S+),\s*(?\S+)\)"` with two capture groups, named `height` and `width` and pass it as the parameter of `extract_groups`. After execution, for each `docstring`, we end up with fully structured data we can further process downstream! + """) return @@ -760,15 +742,13 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 🧹 Stripping + mo.md(r""" + ## 🧹 Stripping - Strings might require some cleaning before further processing, such as the removal of some characters from the beginning or end of the text. [`strip_chars_start`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_start.html), [`strip_chars_end`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_end.html) and [`strip_chars`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars.html) are here to facilitate this. + Strings might require some cleaning before further processing, such as the removal of some characters from the beginning or end of the text. [`strip_chars_start`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_start.html), [`strip_chars_end`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_end.html) and [`strip_chars`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars.html) are here to facilitate this. - All we need to do is to specify a set of characters we would like to get rid of and Polars handles the rest for us. - """ - ) + All we need to do is to specify a set of characters we would like to get rid of and Polars handles the rest for us. + """) return @@ -785,15 +765,13 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Note that when using the above expressions, the specified characters do not need to form a sequence; they are handled as a set. However, in certain use cases we only want to strip complete substrings, so we would need our input to be strictly treated as a sequence rather than as a set. + mo.md(r""" + Note that when using the above expressions, the specified characters do not need to form a sequence; they are handled as a set. However, in certain use cases we only want to strip complete substrings, so we would need our input to be strictly treated as a sequence rather than as a set. - That's exactly the rationale behind [`strip_prefix`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_prefix.html) and [`strip_suffix`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_suffix.html). + That's exactly the rationale behind [`strip_prefix`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_prefix.html) and [`strip_suffix`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_suffix.html). - Below we use these to remove the `"to_"` prefixes and `"_with"` suffixes from each member name. - """ - ) + Below we use these to remove the `"to_"` prefixes and `"_with"` suffixes from each member name. + """) return @@ -809,13 +787,11 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 🔑 Encoding & Decoding + mo.md(r""" + ## 🔑 Encoding & Decoding - Should you find yourself in the need of encoding your strings into [base64](https://en.wikipedia.org/wiki/Base64) or [hexadecimal](https://en.wikipedia.org/wiki/Hexadecimal) format, then Polars has your back with its [`encode`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.encode.html) expression. - """ - ) + Should you find yourself in the need of encoding your strings into [base64](https://en.wikipedia.org/wiki/Base64) or [hexadecimal](https://en.wikipedia.org/wiki/Hexadecimal) format, then Polars has your back with its [`encode`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.encode.html) expression. + """) return @@ -832,7 +808,9 @@ def _(expressions_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""And of course, you can convert back into a human-readable representation using the [`decode`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.decode.html) expression.""") + mo.md(r""" + And of course, you can convert back into a human-readable representation using the [`decode`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.decode.html) expression. + """) return @@ -847,19 +825,17 @@ def _(encoded_df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 🚀 Application: Dynamic Execution of Polars Examples + mo.md(r""" + ## 🚀 Application: Dynamic Execution of Polars Examples - Now that we are familiar with string expressions, we can combine them with other Polars operations to build a fully interactive playground where code examples of Polars expressions can be explored. + Now that we are familiar with string expressions, we can combine them with other Polars operations to build a fully interactive playground where code examples of Polars expressions can be explored. - We make use of string expressions to extract the raw Python source code of examples from the docstrings and we leverage the interactive Marimo environment to enable the selection of expressions via a searchable dropdown and a fully functional code editor whose output is rendered with Marimo's rich display utilities. + We make use of string expressions to extract the raw Python source code of examples from the docstrings and we leverage the interactive Marimo environment to enable the selection of expressions via a searchable dropdown and a fully functional code editor whose output is rendered with Marimo's rich display utilities. - In other words, we will use Polars to execute Polars. ❄️ How cool is that? + In other words, we will use Polars to execute Polars. ❄️ How cool is that? - --- - """ - ) + --- + """) return @@ -894,7 +870,7 @@ def _(mo, selected_expression_record): @app.cell(hide_code=True) -def _(example_editor, execute_code): +def _(example_editor): execution_result = execute_code(example_editor.value) return (execution_result,) @@ -943,50 +919,48 @@ def _(expressions_df, pl): return (code_df,) -@app.cell(hide_code=True) -def _(): - def execute_code(code: str): - import ast - - # Create a new local namespace for execution - local_namespace = {} +@app.function(hide_code=True) +def execute_code(code: str): + import ast - # Parse the code into an AST to identify the last expression - parsed_code = ast.parse(code) + # Create a new local namespace for execution + local_namespace = {} - # Check if there's at least one statement - if not parsed_code.body: - return None + # Parse the code into an AST to identify the last expression + parsed_code = ast.parse(code) - # If the last statement is an expression, we'll need to get its value - last_is_expr = isinstance(parsed_code.body[-1], ast.Expr) + # Check if there's at least one statement + if not parsed_code.body: + return None - if last_is_expr: - # Split the code: everything except the last statement, and the last statement - last_expr = ast.Expression(parsed_code.body[-1].value) + # If the last statement is an expression, we'll need to get its value + last_is_expr = isinstance(parsed_code.body[-1], ast.Expr) - # Remove the last statement from the parsed code - parsed_code.body = parsed_code.body[:-1] + if last_is_expr: + # Split the code: everything except the last statement, and the last statement + last_expr = ast.Expression(parsed_code.body[-1].value) - # Execute everything except the last statement - if parsed_code.body: - exec( - compile(parsed_code, "", "exec"), - globals(), - local_namespace, - ) + # Remove the last statement from the parsed code + parsed_code.body = parsed_code.body[:-1] - # Execute the last statement and get its value - result = eval( - compile(last_expr, "", "eval"), globals(), local_namespace + # Execute everything except the last statement + if parsed_code.body: + exec( + compile(parsed_code, "", "exec"), + globals(), + local_namespace, ) - return result - else: - # If the last statement is not an expression (e.g., an assignment), - # execute the entire code and return None - exec(code, globals(), local_namespace) - return None - return (execute_code,) + + # Execute the last statement and get its value + result = eval( + compile(last_expr, "", "eval"), globals(), local_namespace + ) + return result + else: + # If the last statement is not an expression (e.g., an assignment), + # execute the entire code and return None + exec(code, globals(), local_namespace) + return None @app.cell(hide_code=True) diff --git a/polars/11_missing_data.py b/polars/11_missing_data.py index c6e2cd3d835d9ed43db8a77165313a89fff3bf03..6d8082bd2568e7f068520f58a063c01e3f016f2c 100644 --- a/polars/11_missing_data.py +++ b/polars/11_missing_data.py @@ -8,14 +8,13 @@ import marimo -__generated_with = "0.15.3" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Dealing with Missing Data _by [etrotta](https://github.com/etrotta) and [Felix Najera](https://github.com/folicks)_ @@ -24,20 +23,17 @@ def _(mo): First we provide an overview of the methods available in polars, then we walk through a mini case study with real world data showing how to use it, and at last we provide some additional information in the 'Bonus Content' section. You can navigate to skip around to each header using the menu on the right side - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Methods for working with Nulls We'll be using the following DataFrame to show the most important methods: - """ - ) + """) return @@ -59,13 +55,11 @@ def _(pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Counting nulls A simple yet convenient aggregation - """ - ) + """) return @@ -77,13 +71,11 @@ def _(df): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Dropping Nulls The simplest way of dealing with null values is throwing them away, but that is not always a good idea. - """ - ) + """) return @@ -101,8 +93,7 @@ def _(df): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Filtering null values To filter in polars, you'll typically use `df.filter(expression)` or `df.remove(expression)` methods. @@ -112,8 +103,7 @@ def _(mo): Remove will only remove rows in which the expression evaluates to True. It will keep rows in which it evaluates to None. - """ - ) + """) return @@ -131,13 +121,11 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" You may also be tempted to use `== None` or `!= None`, but operators in polars will generally propagate null values. You can use `.eq_missing()` or `.ne_missing()` methods if you want to be strict about it, but there are also `.is_null()` and `.is_not_null()` methods you can use. - """ - ) + """) return @@ -156,8 +144,7 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Filling Null values You can also fill in the values with constants, calculations or by consulting external data sources. @@ -165,8 +152,7 @@ def _(mo): Be careful not to treat estimated or guessed values as if they a ground truth however, otherwise you may end up making conclusions about a reality that does not exists. As an exercise, let's guess some values to fill in nulls, then try giving names to the animals with `null` by editing the cells - """ - ) + """) return @@ -192,8 +178,7 @@ def _(guesstimates): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### TL;DR Before we head into the mini case study, a brief review of what we have covered: @@ -207,24 +192,21 @@ def _(mo): You can also refer to the polars [User Guide](https://docs.pola.rs/user-guide/expressions/missing-data/) more more information. Whichever approach you take, remember to document how you handled it! - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Mini Case Study - We will be using a dataset from `alertario` about the weather in Rio de Janeiro, originally available in Google Big Query under `datario.clima_pluviometro`. What you need to know about it: + We will be using a dataset from `alertario` about the weather in Rio de Janeiro, originally available in Google Big Query under `datario.clima_pluviometro`. What you need to know about it: - Contains multiple stations covering the Municipality of Rio de Janeiro - Measures the precipitation as millimeters, with a granularity of 15 minutes - We filtered to only include data about 2020, 2021 and 2022 - """ - ) + """) return @@ -257,8 +239,7 @@ def _(pl, px, stations): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Stations First, let's take a look at some of the stations. Notice how @@ -267,8 +248,7 @@ def _(mo): - There are some columns that do not even contain data at all! We will remove the empty columns and remove rows without coordinates - """ - ) + """) return @@ -295,16 +275,14 @@ def _(dirty_stations, mo, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Precipitation Now, let's move on to the Precipitation data. ## Part 1 - Null Values First of all, let's check for null values: - """ - ) + """) return @@ -328,8 +306,7 @@ def _(dirty_weather, mo, rain): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### First option to fixing it: Dropping data. We could just remove those rows like we did for the stations, which may be a passable solution for some problems, but is not always the best idea. @@ -354,8 +331,7 @@ def _(mo): Let's investigate a bit more before deciding on following with either approach. For example, is our current data even complete, or are we already missing some rows beyond those with null values? - """ - ) + """) return @@ -387,8 +363,7 @@ def _(pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Part 2 - Missing Rows We can see that we expected there to be 1096 rows for each hour for each station (from the start of 2020 to the end of 2022) , but in reality we see between 1077 and 1096 rows. @@ -400,8 +375,7 @@ def _(mo): Given that we are working with time series data, we will [upsample](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.upsample.html) the data, but you could also create a DataFrame containing all expected rows then use `join(how="...")` However, that will give us _even more_ null values, so we will want to fill them in afterwards. For this case, we will just use a forward fill followed by a backwards fill. - """ - ) + """) return @@ -435,15 +409,13 @@ def _(dirty_weather, mo, pl, rain): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" Now that we finally have a clean dataset, let's play around with it a little. ### Example App Let's display the amount of precipitation each station measured within a timeframe, aggregated to a lower granularity. - """ - ) + """) return @@ -534,13 +506,11 @@ def _(animation_data, pl, px): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" If we were missing some rows, we would have circles popping in and out of existence instead of a smooth animation! In many scenarios, missing data can also lead to wrong results overall, for example if we were to estimate the total amount of rainfall during the observed period: - """ - ) + """) return @@ -556,20 +526,17 @@ def _(dirty_weather, mo, rain, weather): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" Which is still a relatively small difference, but every drop counts when you are dealing with the weather. For datasets with a higher share of missing values, that difference can get much higher. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Bonus Content ## Appendix A: Missing Time Zones @@ -577,8 +544,7 @@ def _(mo): The original dataset contained naive datetimes instead of timezone-aware, but we can infer whenever it refers to UTC time or local time (for this case, -03:00 UTC) based on the measurements. For example, we can select one specific interval during which we know that rained a lot, or graph the average amount of precipitation for each hour of the day, then compare the data timestamps with a ground truth. - """ - ) + """) return @@ -635,13 +601,11 @@ def _(dirty_weather_naive, pl, rain, stations): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" By externally researching the expected distribution and looking up some of the extreme weather events, we can come to a conclusion about whenever it is aligned with the local time or with UTC. In this case, the distribution matches the normal weather for this region and we can see that the hours with the most precipitation match those of historical events, so it is safe to say it is using local time (equivalent to the Americas/São Paulo time zone). - """ - ) + """) return @@ -655,8 +619,7 @@ def _(dirty_weather_naive, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Appendix B: Not a Number While some other tools without proper support for missing values may use `NaN` as a way to indicate a value is missing, in polars it is treated exclusively as a float value, much like `0.0`, `1.0` or `infinity`. @@ -664,8 +627,7 @@ def _(mo): You can use `.fill_null(float('nan'))` if you need to convert floats to a format such tools accept, or use `.fill_nan(None)` if you are importing data from them, assuming that there are no values which really are supposed to be the float NaN. Remember that many calculations can result in NaN, for example dividing by zero: - """ - ) + """) return @@ -696,29 +658,25 @@ def _(day_perc, mo, perc_col): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Appendix C: Everything else As long as this Notebook is, it cannot reasonably cover ***everything*** that may have to deal with missing values, as that is literally everything that may have to deal with data. This section very briefly covers some other features not mentioned above - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Missing values in Aggregations Many aggregations methods will ignore/skip missing values, while others take them into consideration. Always check the documentation of the method you're using, much of the time docstrings will explain their behaviour. - """ - ) + """) return @@ -733,13 +691,11 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Missing values in Joins By default null values will never produce matches using [join](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join.html), but you can specify `nulls_equal=True` to join Null values with each other. - """ - ) + """) return @@ -772,13 +728,11 @@ def _(age_groups, df): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Utilities Loading data and imports - """ - ) + """) return diff --git a/polars/12_aggregations.py b/polars/12_aggregations.py index 1b44da380a27ab65e8124ac8b8486ccb5751899f..fe5385e4aa2a65ab20ecd9adacb5d6d77f53dd88 100644 --- a/polars/12_aggregations.py +++ b/polars/12_aggregations.py @@ -8,7 +8,7 @@ import marimo -__generated_with = "0.12.9" +__generated_with = "0.18.4" app = marimo.App(width="medium") @@ -20,14 +20,12 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # Aggregations - _By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._ + mo.md(r""" + # Aggregations + _By [Joram Mutenge](https://www.udemy.com/user/joram-mutenge/)._ - In this notebook, you'll learn how to perform different types of aggregations in Polars, including grouping by categories and time. We'll analyze sales data from a clothing store, focusing on three product categories: hats, socks, and sweaters. - """ - ) + In this notebook, you'll learn how to perform different types of aggregations in Polars, including grouping by categories and time. We'll analyze sales data from a clothing store, focusing on three product categories: hats, socks, and sweaters. + """) return @@ -44,13 +42,11 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Grouping by category - ### With single category - Let's find out how many of each product category we sold. - """ - ) + mo.md(r""" + ## Grouping by category + ### With single category + Let's find out how many of each product category we sold. + """) return @@ -65,13 +61,11 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - It looks like we sold more sweaters. Maybe this was a winter season. + mo.md(r""" + It looks like we sold more sweaters. Maybe this was a winter season. - Let's add another aggregate to see how much was spent on the total units for each product. - """ - ) + Let's add another aggregate to see how much was spent on the total units for each product. + """) return @@ -87,7 +81,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""We could also write aggregate code for the two columns as a single line.""") + mo.md(r""" + We could also write aggregate code for the two columns as a single line. + """) return @@ -102,7 +98,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Actually, the way we've been writing the aggregate lines is syntactic sugar. Here's a longer way of doing it as shown in the [Polars documentation](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.agg.html).""") + mo.md(r""" + Actually, the way we've been writing the aggregate lines is syntactic sugar. Here's a longer way of doing it as shown in the [Polars documentation](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.agg.html). + """) return @@ -118,12 +116,10 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ### With multiple categories - We can also group by multiple categories. Let's find out how many items we sold in each product category for each SKU. This more detailed aggregation will produce more rows than the previous DataFrame. - """ - ) + mo.md(r""" + ### With multiple categories + We can also group by multiple categories. Let's find out how many items we sold in each product category for each SKU. This more detailed aggregation will produce more rows than the previous DataFrame. + """) return @@ -138,13 +134,11 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Aggregations when grouping data are not limited to sums. You can also use functions like [`max`, `min`, `median`, `first`, and `last`](https://docs.pola.rs/user-guide/expressions/aggregation/#basic-aggregations). + mo.md(r""" + Aggregations when grouping data are not limited to sums. You can also use functions like [`max`, `min`, `median`, `first`, and `last`](https://docs.pola.rs/user-guide/expressions/aggregation/#basic-aggregations). - Let's find the largest sale quantity for each product category. - """ - ) + Let's find the largest sale quantity for each product category. + """) return @@ -159,13 +153,11 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Let's make the aggregation more interesting. We'll identify the first customer to purchase each item, along with the quantity they bought and the amount they spent. + mo.md(r""" + Let's make the aggregation more interesting. We'll identify the first customer to purchase each item, along with the quantity they bought and the amount they spent. - **Note:** To make this work, we'll have to sort the date from earliest to latest. - """ - ) + **Note:** To make this work, we'll have to sort the date from earliest to latest. + """) return @@ -181,14 +173,12 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Grouping by time - Since `datetime` is a special data type in Polars, we can perform various group-by aggregations on it. + mo.md(r""" + ## Grouping by time + Since `datetime` is a special data type in Polars, we can perform various group-by aggregations on it. - Our dataset spans a two-year period. Let's calculate the total dollar sales for each year. We'll do it the naive way first so you can appreciate grouping with time. - """ - ) + Our dataset spans a two-year period. Let's calculate the total dollar sales for each year. We'll do it the naive way first so you can appreciate grouping with time. + """) return @@ -204,13 +194,11 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - We had more sales in 2014. + mo.md(r""" + We had more sales in 2014. - Now let's perform the above operation by grouping with time. This requires sorting the dataframe first. - """ - ) + Now let's perform the above operation by grouping with time. This requires sorting the dataframe first. + """) return @@ -226,13 +214,11 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - The beauty of grouping with time is that it allows us to resample the data by selecting whatever time interval we want. + mo.md(r""" + The beauty of grouping with time is that it allows us to resample the data by selecting whatever time interval we want. - Let's find out what the quarterly sales were for 2014 - """ - ) + Let's find out what the quarterly sales were for 2014 + """) return @@ -249,13 +235,11 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Here's an interesting question we can answer that takes advantage of grouping by time. + mo.md(r""" + Here's an interesting question we can answer that takes advantage of grouping by time. - Let's find the hour of the day where we had the most sales in dollars. - """ - ) + Let's find the hour of the day where we had the most sales in dollars. + """) return @@ -272,7 +256,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Just for fun, let's find the median number of items sold in each SKU and the total dollar amount in each SKU every six days.""") + mo.md(r""" + Just for fun, let's find the median number of items sold in each SKU and the total dollar amount in each SKU every six days. + """) return @@ -290,7 +276,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Let's rename the columns to clearly indicate the type of aggregation performed. This will help us identify the aggregation method used on a column without needing to check the code.""") + mo.md(r""" + Let's rename the columns to clearly indicate the type of aggregation performed. This will help us identify the aggregation method used on a column without needing to check the code. + """) return @@ -308,15 +296,13 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## Grouping with over + mo.md(r""" + ## Grouping with over - Sometimes, we may want to perform an aggregation but also keep all the columns and rows of the dataframe. + Sometimes, we may want to perform an aggregation but also keep all the columns and rows of the dataframe. - Let's assign a value to indicate the number of times each customer visited and bought something. - """ - ) + Let's assign a value to indicate the number of times each customer visited and bought something. + """) return @@ -330,7 +316,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Finally, let's determine which customers visited the store the most and bought something.""") + mo.md(r""" + Finally, let's determine which customers visited the store the most and bought something. + """) return @@ -347,7 +335,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""There's more you can do with aggregations in Polars such as [sorting with aggregations](https://docs.pola.rs/user-guide/expressions/aggregation/#sorting). We hope that in this notebook, we've armed you with the tools to get started.""") + mo.md(r""" + There's more you can do with aggregations in Polars such as [sorting with aggregations](https://docs.pola.rs/user-guide/expressions/aggregation/#sorting). We hope that in this notebook, we've armed you with the tools to get started. + """) return diff --git a/polars/13_window_functions.py b/polars/13_window_functions.py index b9f69a47810c79a14e1fb7fddbea835b07b887b6..c4f3117d48358e1df6f47111584b5b061d237c41 100644 --- a/polars/13_window_functions.py +++ b/polars/13_window_functions.py @@ -11,14 +11,13 @@ import marimo -__generated_with = "0.13.11" +__generated_with = "0.18.4" app = marimo.App(width="medium", app_title="Window Functions") @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" # Window Functions _By [Henry Harbeck](https://github.com/henryharbeck)._ @@ -26,8 +25,7 @@ def _(mo): You'll work with partitions, ordering and Polars' available "mapping strategies". We'll use a dataset with a few days of paid and organic digital revenue data. - """ - ) + """) return @@ -53,8 +51,7 @@ def _(): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## What is a window function? A window function performs a calculation across a set of rows that are related to the current row. @@ -64,32 +61,27 @@ def _(mo): Window functions can be used by specifying the [`over`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html) method on an expression. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Partitions Partitions are the "group by" columns. We will have one "window" of data per unique value in the partition column(s), to which the function will be applied. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Partitioning by a single column Let's get the total revenue per date... - """ - ) + """) return @@ -103,7 +95,9 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""And then see what percentage of the daily total was Paid and what percentage was Organic.""") + mo.md(r""" + And then see what percentage of the daily total was Paid and what percentage was Organic. + """) return @@ -115,12 +109,10 @@ def _(daily_revenue, df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" Let's now calculate the maximum revenue, cumulative revenue, rank the revenue and calculate the day-on-day change, all partitioned (split) by channel. - """ - ) + """) return @@ -137,28 +129,24 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" Note that aggregation functions such as `sum` and `max` have their value applied back to each row in the partition (group). Non-aggregate functions such as `cum_sum`, `rank` and `diff` can produce different values per row, but still only consider rows within their partition. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Partitioning by multiple columns We can also partition by multiple columns. Let's add a column to see whether it is a weekday (business day), then get the maximum revenue by that and the channel. - """ - ) + """) return @@ -176,15 +164,13 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Partitioning by expressions Polars also lets you partition by expressions without needing to create them as columns first. So, we could re-write the previous window function as... - """ - ) + """) return @@ -200,20 +186,17 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" Window functions fit into Polars' composable [expressions API](https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expressions), so can be combined with all [aggregation methods](https://docs.pola.rs/api/python/stable/reference/expressions/aggregation.html) and methods that consider more than 1 row (e.g., `cum_sum`, `rank` and `diff` as we just saw). - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Ordering The `order_by` parameter controls how to order the data within the window. The function is applied to the data in this @@ -221,21 +204,18 @@ def _(mo): Up until this point, we have been letting Polars do the window function calculations based on the order of the rows in the DataFrame. There can be times where we would like order of the calculation and the order of the output itself to differ. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ + mo.md(""" ### Ordering in a window function Let's say we want the DataFrame ordered by day of week, but we still want cumulative revenue and the first revenue observation, both ordered by date and partitioned by channel... - """ - ) + """) return @@ -261,21 +241,19 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Note about window function ordering compared to SQL It is worth noting that traditionally in SQL, many more functions require an `ORDER BY` within `OVER` than in equivalent functions in Polars. For example, an SQL `RANK()` expression like... - """ - ) + """) return @app.cell -def _(df, mo): +def _(mo): _df = mo.sql( f""" SELECT @@ -293,12 +271,10 @@ def _(df, mo): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ...does not require an `order_by` in Polars as the column and the function are already bound (including with the `descending=True` argument). - """ - ) + """) return @@ -315,13 +291,11 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Descending order We can also order in descending order by passing `descending=True`... - """ - ) + """) return @@ -348,29 +322,25 @@ def _(df_sorted, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - """ + mo.md(""" ## Mapping Strategies Mapping Strategies control how Polars maps the result of the window function back to the original DataFrame Generally (by default) the result of a window function is assigned back to rows within the group. Through Polars' mapping strategies, we will explore other possibilities. - """ - ) + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - """ + mo.md(""" ### Group to rows "group_to_rows" is the default mapping strategy and assigns the result of the window function back to the rows in the window. - """ - ) + """) return @@ -384,13 +354,11 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - """ + mo.md(""" ### Join The "join" mapping strategy aggregates the resulting values in a list and repeats the list for all rows in the group. - """ - ) + """) return @@ -404,8 +372,7 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Explode The "explode" mapping strategy is similar to "group_to_rows", but is typically faster and does not preserve the order of @@ -413,8 +380,7 @@ def _(mo): It should also only be used in a `select` context and not `with_columns`. The result of "explode" is similar to a `group_by` followed by an `agg` followed by an `explode`. - """ - ) + """) return @@ -431,26 +397,28 @@ def _(df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Note the modified order of the rows in the output, (but data is the same)...""") + mo.md(r""" + Note the modified order of the rows in the output, (but data is the same)... + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""## Other tips and tricks""") + mo.md(r""" + ## Other tips and tricks + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Reusing a window In SQL there is a `WINDOW` keyword, which easily allows the re-use of the same window specification across expressions without needing to repeat it. In Polars, this can be achieved by using `dict` unpacking to pass arguments to `over`. - """ - ) + """) return @@ -472,8 +440,7 @@ def _(df_sorted, pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ### Rolling Windows Much like in SQL, Polars also gives you the ability to do rolling window computations. In Polars, the rolling calculation @@ -481,8 +448,7 @@ def _(mo): Let's look at an example of that now by filtering out one day of our data and then calculating both a 3-day and 3-row max revenue split by channel... - """ - ) + """) return @@ -503,27 +469,29 @@ def _(date, df, pl): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Notice the difference in the 2nd last row...""") + mo.md(r""" + Notice the difference in the 2nd last row... + """) return @app.cell(hide_code=True) def _(mo): - mo.md(r"""We hope you enjoyed this notebook, demonstrating window functions in Polars!""") + mo.md(r""" + We hope you enjoyed this notebook, demonstrating window functions in Polars! + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" + mo.md(r""" ## Additional References - [Polars User guide - Window functions](https://docs.pola.rs/user-guide/expressions/window-functions/) - [Polars over method API reference](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.over.html) - [PostgreSQL window function documentation](https://www.postgresql.org/docs/current/tutorial-window.html) - """ - ) + """) return diff --git a/polars/14_user_defined_functions.py b/polars/14_user_defined_functions.py index 6ce5ad8f3d365c008c28b2dc8bff962e769264c4..34e568ce582f86a57ef4fc5a3e85844359bdebdd 100644 --- a/polars/14_user_defined_functions.py +++ b/polars/14_user_defined_functions.py @@ -14,58 +14,52 @@ import marimo -__generated_with = "0.11.17" +__generated_with = "0.18.4" app = marimo.App(width="medium") @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - # User-Defined Functions + mo.md(r""" + # User-Defined Functions - _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_. + _By [Péter Ferenc Gyarmati](http://github.com/peter-gy)_. - Throughout the previous chapters, you've seen how Polars provides a comprehensive set of built-in expressions for flexible data transformation. But what happens when you need something *more*? Perhaps your project has unique requirements, or you need to integrate functionality from an external Python library. This is where User-Defined Functions (UDFs) come into play, allowing you to extend Polars with your own custom logic. + Throughout the previous chapters, you've seen how Polars provides a comprehensive set of built-in expressions for flexible data transformation. But what happens when you need something *more*? Perhaps your project has unique requirements, or you need to integrate functionality from an external Python library. This is where User-Defined Functions (UDFs) come into play, allowing you to extend Polars with your own custom logic. - In this chapter, we'll weigh the performance trade-offs of UDFs, pinpoint situations where they're truly beneficial, and explore different ways to effectively incorporate them into your Polars workflows. We'll walk through a complete, practical example. - """ - ) + In this chapter, we'll weigh the performance trade-offs of UDFs, pinpoint situations where they're truly beneficial, and explore different ways to effectively incorporate them into your Polars workflows. We'll walk through a complete, practical example. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## ⚖️ The Cost of UDFs + mo.md(r""" + ## ⚖️ The Cost of UDFs - > Performance vs. Flexibility + > Performance vs. Flexibility - Polars' built-in expressions are highly optimized for speed and parallel processing. User-defined functions (UDFs), however, introduce a significant performance overhead because they rely on standard Python code, which often runs in a single thread and bypasses Polars' logical optimizations. Therefore, always prioritize native Polars operations *whenever possible*. + Polars' built-in expressions are highly optimized for speed and parallel processing. User-defined functions (UDFs), however, introduce a significant performance overhead because they rely on standard Python code, which often runs in a single thread and bypasses Polars' logical optimizations. Therefore, always prioritize native Polars operations *whenever possible*. - However, UDFs become inevitable when you need to: + However, UDFs become inevitable when you need to: - - **Integrate external libraries:** Use functionality not directly available in Polars. - - **Implement custom logic:** Handle complex transformations that can't be easily expressed with Polars' built-in functions. + - **Integrate external libraries:** Use functionality not directly available in Polars. + - **Implement custom logic:** Handle complex transformations that can't be easily expressed with Polars' built-in functions. - Let's dive into a real-world project where UDFs were the only way to get the job done, demonstrating a scenario where native Polars expressions simply weren't sufficient. - """ - ) + Let's dive into a real-world project where UDFs were the only way to get the job done, demonstrating a scenario where native Polars expressions simply weren't sufficient. + """) return @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 📊 Project Overview + mo.md(r""" + ## 📊 Project Overview - > Scraping and Analyzing Observable Notebook Statistics + > Scraping and Analyzing Observable Notebook Statistics - If you're into data visualization, you've probably seen [D3.js](https://d3js.org/) and [Observable Plot](https://observablehq.com/plot/). Both have extensive galleries showcasing amazing visualizations. Each gallery item is a standalone [Observable notebook](https://observablehq.com/documentation/notebooks/), with metrics like stars, comments, and forks – indicators of popularity. But getting and analyzing these statistics directly isn't straightforward. We'll need to scrape the web. - """ - ) + If you're into data visualization, you've probably seen [D3.js](https://d3js.org/) and [Observable Plot](https://observablehq.com/plot/). Both have extensive galleries showcasing amazing visualizations. Each gallery item is a standalone [Observable notebook](https://observablehq.com/documentation/notebooks/), with metrics like stars, comments, and forks – indicators of popularity. But getting and analyzing these statistics directly isn't straightforward. We'll need to scrape the web. + """) return @@ -90,7 +84,9 @@ def _(mo): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Our goal is to use Polars UDFs to fetch the HTML content of these gallery pages. Then, we'll use the `BeautifulSoup` Python library to parse the HTML and extract the relevant metadata. After some data wrangling with native Polars expressions, we'll have a DataFrame listing each visualization notebook. Then, we'll use another UDF to retrieve the number of likes, forks, and comments for each notebook. Finally, we will create our own high-performance UDF to implement a custom notebook ranking scheme. This will involve multiple steps, showcasing different UDF approaches.""") + mo.md(r""" + Our goal is to use Polars UDFs to fetch the HTML content of these gallery pages. Then, we'll use the `BeautifulSoup` Python library to parse the HTML and extract the relevant metadata. After some data wrangling with native Polars expressions, we'll have a DataFrame listing each visualization notebook. Then, we'll use another UDF to retrieve the number of likes, forks, and comments for each notebook. Finally, we will create our own high-performance UDF to implement a custom notebook ranking scheme. This will involve multiple steps, showcasing different UDF approaches. + """) return @@ -109,7 +105,9 @@ def _(mo): @app.cell(hide_code=True) def _(mo): - mo.md(r"""Our starting point, `url_df`, is a simple DataFrame with a single `url` column containing the URLs of the D3 and Observable Plot gallery notebooks.""") + mo.md(r""" + Our starting point, `url_df`, is a simple DataFrame with a single `url` column containing the URLs of the D3 and Observable Plot gallery notebooks. + """) return @@ -129,19 +127,17 @@ def _(pl): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - ## 🔂 Element-Wise UDFs + mo.md(r""" + ## 🔂 Element-Wise UDFs - > Processing Value by Value + > Processing Value by Value - The most common way to use UDFs is to apply them element-wise. This means our custom function will execute for *each individual row* in a specified column. Our first task is to fetch the HTML content for each URL in `url_df`. + The most common way to use UDFs is to apply them element-wise. This means our custom function will execute for *each individual row* in a specified column. Our first task is to fetch the HTML content for each URL in `url_df`. - We'll define a Python function that takes a `url` (a string) as input, uses the `httpx` library (an HTTP client) to fetch the content, and returns the HTML as a string. We then integrate this function into Polars using the [`map_elements`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_elements.html) expression. + We'll define a Python function that takes a `url` (a string) as input, uses the `httpx` library (an HTTP client) to fetch the content, and returns the HTML as a string. We then integrate this function into Polars using the [`map_elements`](https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.map_elements.html) expression. - You'll notice we have to explicitly specify the `return_dtype`. This is *crucial*. Polars doesn't automatically know what our custom function will return. We're responsible for defining the function's logic and, therefore, its output type. By providing the `return_dtype`, we help Polars maintain its internal representation of the DataFrame's schema, enabling query optimization. Think of it as giving Polars a "heads-up" about the data type it should expect. - """ - ) + You'll notice we have to explicitly specify the `return_dtype`. This is *crucial*. Polars doesn't automatically know what our custom function will return. We're responsible for defining the function's logic and, therefore, its output type. By providing the `return_dtype`, we help Polars maintain its internal representation of the DataFrame's schema, enabling query optimization. Think of it as giving Polars a "heads-up" about the data type it should expect. + """) return @@ -159,13 +155,11 @@ def _(httpx, pl, url_df): @app.cell(hide_code=True) def _(mo): - mo.md( - r""" - Now, `html_df` holds the HTML for each URL. We need to parse it. Again, a UDF is the way to go. Parsing HTML with native Polars expressions would be a nightmare! Instead, we'll use the [`beautifulsoup4`](https://pypi.org/project/beautifulsoup4/) library, a standard tool for this. + mo.md(r""" + Now, `html_df` holds the HTML for each URL. We need to parse it. Again, a UDF is the way to go. Parsing HTML with native Polars expressions would be a nightmare! Instead, we'll use the [`beautifulsoup4`](https://pypi.org/project/beautifulsoup4/) library, a standard tool for this. - These Observable pages are built with [Next.js](https://nextjs.org/), which helpfully serializes page properties as JSON within the HTML. This simplifies our UDF: we'll extract the raw JSON from the `