Spaces:

Clone77
/

zero_to_hero_ML

Sleeping

App Files Files Community

Clone77 commited on Dec 19, 2024

Commit

ea4f5e7

verified ·

1 Parent(s): a8427cf

Upload 2 files

Browse files

Files changed (2) hide show

file/streamlit_CSV.ipynb +125 -0
file/streamlit_JSON.ipynb +134 -0

file/streamlit_CSV.ipynb ADDED Viewed

	@@ -0,0 +1,125 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Reading a CSV File"
+      ],
+      "metadata": {
+        "id": "QqqobpZnHA4L"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import pandas as pd\n",
+        "\n",
+        "pd.read_csv(\"Sample.csv\")"
+      ],
+      "metadata": {
+        "id": "--u1kQ0zHQ17"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## **1. ParserError:**\n",
+        "\n",
+        "- This error occurs when we have extra column.\n",
+        "- This error is mostly caused when CSV is created in Text editor"
+      ],
+      "metadata": {
+        "id": "y-cw10rIHREJ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import pandas as pd\n",
+        "\n",
+        "pd.read_csv('sample.csv',on_bad_lines='warn')"
+      ],
+      "metadata": {
+        "id": "kkN-ZyrbHRQF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## **2. Encoding:**\n",
+        "\n",
+        "- Encoding is a process of translating a character, numbers, symbols, etc into ASCII and then binary number.\n",
+        "- If a proper enconding while reading csv is not used then the letter/characters will be decode to other binary number which will cause to loss the information.\n",
+        "- Most of the csv will be in `UTF-8`, but not all csv"
+      ],
+      "metadata": {
+        "id": "7KBX5a6oHRbu"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import pandas as pd\n",
+        "import encodings\n",
+        "l=encodings.aliases.aliases.keys() # list of all encodings\n",
+        "for y in l:\n",
+        "    try:\n",
+        "        pd.read_csv('sample.csv',encoding='utf-8')\n",
+        "        print('{} is an correct encoding')\n",
+        "    except UnicodeDecodeError:\n",
+        "        print('{} is not an correct encoding'.format(y))\n",
+        "    except LookUpError:\n",
+        "        print('{} is not supported'.format(y))"
+      ],
+      "metadata": {
+        "id": "JO4cJrwFHRme"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## **3. Out of memory:**\n",
+        "\n",
+        "- If we dont have enough memory to load the dataset then we will divide them into chunks.\n",
+        "- Chunks are the part of the data, which takes chunksize as a number of rows.\n",
+        "- If we have 100_00_000 & `chunksize` = 1000, this means the data will be divided in 1000 rows called as chunks.\n",
+        "- Its output will be in generator"
+      ],
+      "metadata": {
+        "id": "ZIVWCn22HRwj"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import pandas as pd\n",
+        "pd.read_csv('spam.csv', encoding='latin', chunksize= 100)"
+      ],
+      "metadata": {
+        "id": "hooz_lCRHR5u"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}

file/streamlit_JSON.ipynb ADDED Viewed

	@@ -0,0 +1,134 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## **Reading JSON File**"
+      ],
+      "metadata": {
+        "id": "4L2BwncXK7Uv"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import pandas as pd\n",
+        "\n",
+        "pd.read_json(\"sample.json\")"
+      ],
+      "metadata": {
+        "id": "sVqhFskxK-r5"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## **Handling Structured JSON**\n",
+        "\n",
+        "- The `orient` parameter in `pd.read_json()` specifies the format of JSON data being read:\n",
+        "    - **\"split\"**: Dictionary format with keys as \"index\", \"columns\", and \"data\".\n",
+        "    - **\"records\"**: List of dictionaries where each dictionary represents a row.\n",
+        "    - **\"index\"**: Dictionary format with row indices as keys and dictionaries of column data as values.\n",
+        "    - **\"columns\"**: Default format where keys are column names and values are arrays of data"
+      ],
+      "metadata": {
+        "id": "clN_z7L8K-zD"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import pandas as pd\n",
+        "\n",
+        "# Sample Structured JSON\n",
+        "structured_json = {\n",
+        "    \"name\": [\"John\", \"Doe\", \"Jane\"],\n",
+        "    \"age\": [30, 25, 28],\n",
+        "    \"city\": [\"New York\", \"Los Angeles\", \"Chicago\"]\n",
+        "}\n",
+        "\n",
+        "# Reading JSON with different 'orient' values\n",
+        "df_default = pd.read_json('structured.json')  # Default (columns)\n",
+        "df_split = pd.read_json('structured.json', orient='split')\n",
+        "df_index = pd.read_json('structured.json', orient='index')\n",
+        "\n",
+        "print(df_default)\n",
+        "print(df_split)\n",
+        "print(df_index)"
+      ],
+      "metadata": {
+        "id": "yFW5vsHkK-6Q"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## **Handling Semi-Structured JSON**\n",
+        "\n",
+        "- `pandas.json_normalize()` is used to flatten nested JSON objects into a DataFrame.\n",
+        "    - **`record_path`**: Specifies the path in the JSON to extract records from nested lists.\n",
+        "    - **`meta`**: Includes additional metadata fields from parent records.\n",
+        "    - **`max_level`**: Limits the number of levels to flatten."
+      ],
+      "metadata": {
+        "id": "-xtBajp5K_Ge"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import pandas as pd\n",
+        "import json\n",
+        "\n",
+        "# Sample Semi-Structured JSON\n",
+        "semi_structured_json = [\n",
+        "    {\n",
+        "        \"name\": \"John\",\n",
+        "        \"age\": 30,\n",
+        "        \"address\": {\"city\": \"New York\", \"zip\": \"10001\"},\n",
+        "        \"skills\": [\"Python\", \"SQL\"]\n",
+        "    },\n",
+        "    {\n",
+        "        \"name\": \"Jane\",\n",
+        "        \"age\": 28,\n",
+        "        \"address\": {\"city\": \"Chicago\", \"zip\": \"60601\"},\n",
+        "        \"skills\": [\"Java\", \"C++\"]\n",
+        "    }\n",
+        "]\n",
+        "\n",
+        "# Flattening nested JSON\n",
+        "df = pd.json_normalize(\n",
+        "    semi_structured_json,\n",
+        "    record_path=['skills'],\n",
+        "    meta=['name', 'age', ['address', 'city'], ['address', 'zip']],\n",
+        "    max_level=1\n",
+        ")\n",
+        "\n",
+        "print(df)"
+      ],
+      "metadata": {
+        "id": "7tfcZuiRK_Ln"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}