Spaces:

uc-ctds
/

llama-data-model-generator-demo

Sleeping

App Files Files Community

avantol commited on Jul 21, 2025

Commit

025ac5e

1 Parent(s): 3176ee9

feat(pfb): fix paths in notebook

Browse files

Files changed (2) hide show

__init__.py +0 -0
serialized_file_creation_demo/serialized_file_creation_demo.ipynb +106 -36

__init__.py ADDED Viewed

File without changes

serialized_file_creation_demo/serialized_file_creation_demo.ipynb CHANGED Viewed

@@ -3,20 +3,23 @@
   {
    "cell_type": "markdown",
    "id": "0",
-   "metadata": {},
    "source": [
     "# Creation of Serialized File From AI Model Output\n",
     "---\n",
-    "This notebook demonstrates how to use the AI-assited data model output (originally just a collection of TSV files) to a serialized file, a [PFB (Portable Format for Bioinformatics)](https://pmc.ncbi.nlm.nih.gov/articles/PMC10035862/) file. \n",
     "\n",
-    "PFB is widely used within NIH-funded initiativies that our center is a part of, as a means for efficient storage and transfer of data between systems.\n",
-    "    "
    ]
   },
   {
    "cell_type": "markdown",
    "id": "1",
-   "metadata": {},
    "source": [
     "### Setup"
    ]
@@ -25,7 +28,13 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "2",
-   "metadata": {},
    "outputs": [],
    "source": [
     "%pip install pandas gen3"
@@ -34,7 +43,9 @@
   {
    "cell_type": "markdown",
    "id": "3",
-   "metadata": {},
    "source": [
     "We need some helper files to demonstrate this, so pull them in from Huggingface."
    ]
@@ -43,17 +54,24 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "4",
-   "metadata": {},
    "outputs": [],
    "source": [
-    "!git clone https://huggingface.co/spaces/uc-ctds/llama-data-model-generator-demo\n",
-    "!cd llama-data-model-generator-demo/serialized_file_creation_demo"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "5",
-   "metadata": {},
    "source": [
     "### Imports and Initial Loading"
    ]
@@ -62,10 +80,12 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "6",
-   "metadata": {},
    "outputs": [],
    "source": [
-    "from utils import *\n",
     "import os\n",
     "from pathlib import Path\n",
     "import pandas as pd"
@@ -75,7 +95,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "7",
-   "metadata": {},
    "outputs": [],
    "source": [
     "# read in the minimal Gen3 data model scaffold\n",
@@ -86,7 +108,9 @@
   {
    "cell_type": "markdown",
    "id": "8",
-   "metadata": {},
    "source": [
     "We are demonstrating the ability to use this against an AI-generated model, but not directly inferencing to get the data model. Instead we're using a Sythnetic Data Contribution (a sample of what a data contributor would provide AND the expected simplified data model). We use these to train and test the AI model. For simplicity, we're using the model here."
    ]
@@ -95,7 +119,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "9",
-   "metadata": {},
    "outputs": [],
    "source": [
     "# Find the simplified data model in a Synthetic Data Contribution directory\n",
@@ -111,7 +137,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "10",
-   "metadata": {},
    "outputs": [],
    "source": [
     "sdm = read_schema(schema=sdm_path)"
@@ -120,7 +148,9 @@
   {
    "cell_type": "markdown",
    "id": "11",
-   "metadata": {},
    "source": [
     "### Creation of Serialized File"
    ]
@@ -128,7 +158,9 @@
   {
    "cell_type": "markdown",
    "id": "12",
-   "metadata": {},
    "source": [
     "As of writing, PFB requires a Gen3-style data model, so the next steps are to ensure we can go from the simplified AI model output to a Gen3 data model. Note that in the future we may allow alternative, non-Gen3 models to create such PFBs."
    ]
@@ -137,7 +169,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "13",
-   "metadata": {},
    "outputs": [],
    "source": [
     "## Create a Gen3 data model from the simplified data model\n",
@@ -157,7 +191,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "14",
-   "metadata": {},
    "outputs": [],
    "source": [
     "## Write the Gen3-style data model to a JSON file\n",
@@ -171,7 +207,9 @@
   {
    "cell_type": "markdown",
    "id": "15",
-   "metadata": {},
    "source": [
     "Now we have the data model in proper format, we can serialize it into a PFB."
    ]
@@ -180,7 +218,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "16",
-   "metadata": {},
    "outputs": [],
    "source": [
     "# Convert the Gen3-style data model to PFB format schema\n",
@@ -191,7 +231,9 @@
   {
    "cell_type": "markdown",
    "id": "17",
-   "metadata": {},
    "source": [
     "### PFB Utilities"
    ]
@@ -199,7 +241,9 @@
   {
    "cell_type": "markdown",
    "id": "18",
-   "metadata": {},
    "source": [
     "Now we can demonstrate creation of a PFB when you have content for it (in this case in the form of TSV metadata). The above is a PFB which contains only the data model."
    ]
@@ -208,7 +252,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "19",
-   "metadata": {},
    "outputs": [],
    "source": [
     "# Get a list of TSV files in the sdm_dir\n",
@@ -220,7 +266,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "20",
-   "metadata": {},
    "outputs": [],
    "source": [
     "# calculate tsv file size and md5sum for each tsv_files\n",
@@ -254,7 +302,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "21",
-   "metadata": {},
    "outputs": [],
    "source": [
     "%ls -l $sdm_dir/tsv_metadata"
@@ -264,7 +314,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "22",
-   "metadata": {},
    "outputs": [],
    "source": [
     "tsv_metadata"
@@ -274,7 +326,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "23",
-   "metadata": {},
    "outputs": [],
    "source": [
     "pfb_data = os.path.join(sdm_dir, Path(out_file).stem + \"_data.avro\")\n",
@@ -286,7 +340,9 @@
   {
    "cell_type": "markdown",
    "id": "24",
-   "metadata": {},
    "source": [
     "PFB contains a utility to convert from the serialized format to more readable and workable files, including TSVs. Here we demonstrate that utility:"
    ]
@@ -295,7 +351,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "25",
-   "metadata": {},
    "outputs": [],
    "source": [
     "!gen3 pfb to -i $pfb_data tsv # convert the PFB file to TSV format"
@@ -305,7 +363,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "26",
-   "metadata": {},
    "outputs": [],
    "source": [
     "!gen3 pfb show -i $pfb_data # show the contents of the PFB file"
@@ -315,7 +375,9 @@
    "cell_type": "code",
    "execution_count": null,
    "id": "27",
-   "metadata": {},
    "outputs": [],
    "source": [
     "!gen3 pfb show -i $pfb_data schema | jq # show the schema of the PFB file"
@@ -324,7 +386,9 @@
   {
    "cell_type": "markdown",
    "id": "28",
-   "metadata": {},
    "source": [
     "Now we've gone all the way from a dump of data contribution files, to a simple structured data model, to a serialized PFB, and back to usable files!"
    ]
@@ -332,11 +396,17 @@
   {
    "cell_type": "markdown",
    "id": "29",
-   "metadata": {},
    "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
    "display_name": "Python 3",
    "language": "python",

   {
    "cell_type": "markdown",
    "id": "0",
+   "metadata": {
+    "id": "0"
+   },
    "source": [
     "# Creation of Serialized File From AI Model Output\n",
     "---\n",
+    "This notebook demonstrates how to use the AI-assisted data model output (originally just a collection of TSV files) to a serialized file, a [PFB (Portable Format for Bioinformatics)](https://pmc.ncbi.nlm.nih.gov/articles/PMC10035862/) file.\n",
     "\n",
+    "PFB is widely used within NIH-funded initiatives that our center is a part of, as a means for efficient storage and transfer of data between systems."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "1",
+   "metadata": {
+    "id": "1"
+   },
    "source": [
     "### Setup"
    ]
    "cell_type": "code",
    "execution_count": null,
    "id": "2",
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "2",
+    "outputId": "93bf3200-e3e2-4607-b7fc-23de90f967e1"
+   },
    "outputs": [],
    "source": [
     "%pip install pandas gen3"
   {
    "cell_type": "markdown",
    "id": "3",
+   "metadata": {
+    "id": "3"
+   },
    "source": [
     "We need some helper files to demonstrate this, so pull them in from Huggingface."
    ]
    "cell_type": "code",
    "execution_count": null,
    "id": "4",
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "4",
+    "outputId": "ca90e09b-4d66-4019-ea91-4f9694b246ec"
+   },
    "outputs": [],
    "source": [
+    "!git clone https://huggingface.co/spaces/uc-ctds/llama-data-model-generator-demo llama_data_model_generator_demo\n"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "5",
+   "metadata": {
+    "id": "5"
+   },
    "source": [
     "### Imports and Initial Loading"
    ]
    "cell_type": "code",
    "execution_count": null,
    "id": "6",
+   "metadata": {
+    "id": "6"
+   },
    "outputs": [],
    "source": [
+    "from llama_data_model_generator_demo.utils import *\n",
     "import os\n",
     "from pathlib import Path\n",
     "import pandas as pd"
    "cell_type": "code",
    "execution_count": null,
    "id": "7",
+   "metadata": {
+    "id": "7"
+   },
    "outputs": [],
    "source": [
     "# read in the minimal Gen3 data model scaffold\n",
   {
    "cell_type": "markdown",
    "id": "8",
+   "metadata": {
+    "id": "8"
+   },
    "source": [
     "We are demonstrating the ability to use this against an AI-generated model, but not directly inferencing to get the data model. Instead we're using a Sythnetic Data Contribution (a sample of what a data contributor would provide AND the expected simplified data model). We use these to train and test the AI model. For simplicity, we're using the model here."
    ]
    "cell_type": "code",
    "execution_count": null,
    "id": "9",
+   "metadata": {
+    "id": "9"
+   },
    "outputs": [],
    "source": [
     "# Find the simplified data model in a Synthetic Data Contribution directory\n",
    "cell_type": "code",
    "execution_count": null,
    "id": "10",
+   "metadata": {
+    "id": "10"
+   },
    "outputs": [],
    "source": [
     "sdm = read_schema(schema=sdm_path)"
   {
    "cell_type": "markdown",
    "id": "11",
+   "metadata": {
+    "id": "11"
+   },
    "source": [
     "### Creation of Serialized File"
    ]
   {
    "cell_type": "markdown",
    "id": "12",
+   "metadata": {
+    "id": "12"
+   },
    "source": [
     "As of writing, PFB requires a Gen3-style data model, so the next steps are to ensure we can go from the simplified AI model output to a Gen3 data model. Note that in the future we may allow alternative, non-Gen3 models to create such PFBs."
    ]
    "cell_type": "code",
    "execution_count": null,
    "id": "13",
+   "metadata": {
+    "id": "13"
+   },
    "outputs": [],
    "source": [
     "## Create a Gen3 data model from the simplified data model\n",
    "cell_type": "code",
    "execution_count": null,
    "id": "14",
+   "metadata": {
+    "id": "14"
+   },
    "outputs": [],
    "source": [
     "## Write the Gen3-style data model to a JSON file\n",
   {
    "cell_type": "markdown",
    "id": "15",
+   "metadata": {
+    "id": "15"
+   },
    "source": [
     "Now we have the data model in proper format, we can serialize it into a PFB."
    ]
    "cell_type": "code",
    "execution_count": null,
    "id": "16",
+   "metadata": {
+    "id": "16"
+   },
    "outputs": [],
    "source": [
     "# Convert the Gen3-style data model to PFB format schema\n",
   {
    "cell_type": "markdown",
    "id": "17",
+   "metadata": {
+    "id": "17"
+   },
    "source": [
     "### PFB Utilities"
    ]
   {
    "cell_type": "markdown",
    "id": "18",
+   "metadata": {
+    "id": "18"
+   },
    "source": [
     "Now we can demonstrate creation of a PFB when you have content for it (in this case in the form of TSV metadata). The above is a PFB which contains only the data model."
    ]
    "cell_type": "code",
    "execution_count": null,
    "id": "19",
+   "metadata": {
+    "id": "19"
+   },
    "outputs": [],
    "source": [
     "# Get a list of TSV files in the sdm_dir\n",
    "cell_type": "code",
    "execution_count": null,
    "id": "20",
+   "metadata": {
+    "id": "20"
+   },
    "outputs": [],
    "source": [
     "# calculate tsv file size and md5sum for each tsv_files\n",
    "cell_type": "code",
    "execution_count": null,
    "id": "21",
+   "metadata": {
+    "id": "21"
+   },
    "outputs": [],
    "source": [
     "%ls -l $sdm_dir/tsv_metadata"
    "cell_type": "code",
    "execution_count": null,
    "id": "22",
+   "metadata": {
+    "id": "22"
+   },
    "outputs": [],
    "source": [
     "tsv_metadata"
    "cell_type": "code",
    "execution_count": null,
    "id": "23",
+   "metadata": {
+    "id": "23"
+   },
    "outputs": [],
    "source": [
     "pfb_data = os.path.join(sdm_dir, Path(out_file).stem + \"_data.avro\")\n",
   {
    "cell_type": "markdown",
    "id": "24",
+   "metadata": {
+    "id": "24"
+   },
    "source": [
     "PFB contains a utility to convert from the serialized format to more readable and workable files, including TSVs. Here we demonstrate that utility:"
    ]
    "cell_type": "code",
    "execution_count": null,
    "id": "25",
+   "metadata": {
+    "id": "25"
+   },
    "outputs": [],
    "source": [
     "!gen3 pfb to -i $pfb_data tsv # convert the PFB file to TSV format"
    "cell_type": "code",
    "execution_count": null,
    "id": "26",
+   "metadata": {
+    "id": "26"
+   },
    "outputs": [],
    "source": [
     "!gen3 pfb show -i $pfb_data # show the contents of the PFB file"
    "cell_type": "code",
    "execution_count": null,
    "id": "27",
+   "metadata": {
+    "id": "27"
+   },
    "outputs": [],
    "source": [
     "!gen3 pfb show -i $pfb_data schema | jq # show the schema of the PFB file"
   {
    "cell_type": "markdown",
    "id": "28",
+   "metadata": {
+    "id": "28"
+   },
    "source": [
     "Now we've gone all the way from a dump of data contribution files, to a simple structured data model, to a serialized PFB, and back to usable files!"
    ]
   {
    "cell_type": "markdown",
    "id": "29",
+   "metadata": {
+    "id": "29"
+   },
    "source": []
   }
  ],
  "metadata": {
+  "colab": {
+   "provenance": [],
+   "toc_visible": true
+  },
   "kernelspec": {
    "display_name": "Python 3",
    "language": "python",