Spaces:

uc-ctds
/

ai_assisted_data_curation_toolkit

Sleeping

App Files Files Community

avantol commited on Nov 11, 2025

Commit

0a4b5f6

1 Parent(s): 5c660a7

feat(notebook): initial attempt to add notebook

Browse files

Files changed (2) hide show

ai_assisted_data_curation.ipynb +338 -0
app.py +4 -4

ai_assisted_data_curation.ipynb ADDED Viewed

	@@ -0,0 +1,338 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0",
+   "metadata": {},
+   "source": [
+    "# AI-Assisted Data Curation Toolkit"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1",
+   "metadata": {},
+   "source": [
+    "This notebook demonstrates the AI-Assisted Data Curation Toolkit. It is capable of suggesting harmonizations from a source data model into a target data model using AI-backed approaches, but leaving the expert curator in complete control."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e9f03bd7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install ai_harmonization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2",
+   "metadata": {},
+   "source": [
+    "## Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "\n",
+    "from ai_harmonization.interactive import (\n",
+    "    get_interactive_table_for_suggestions,\n",
+    "    get_nodes_and_properties_df,\n",
+    ")\n",
+    "from ai_harmonization.simple_data_model import (\n",
+    "    SimpleDataModel,\n",
+    "    get_data_model_as_node_prop_type_descriptions,\n",
+    ")\n",
+    "from ai_harmonization.harmonization_approaches.similarity_inmem import (\n",
+    "    SimilaritySearchInMemoryVectorDb,\n",
+    ")\n",
+    "from ai_harmonization.harmonization_approaches.embeddings import BGEEmbeddings"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5",
+   "metadata": {},
+   "source": [
+    "Set available GPUs (skip this step is using CPUs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0,1,2,3\"  # change as necessary"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7",
+   "metadata": {},
+   "source": [
+    "## Use a Harmonization Approach to get Suggestions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8",
+   "metadata": {},
+   "source": [
+    "### Get Input Data\n",
+    "\n",
+    "- A `source data model` you want to harmonize from\n",
+    "- A `target data model` you want to harmonize to\n",
+    "\n",
+    "For this initial example, you can just using hard-coded examples.\n",
+    "\n",
+    "- The `example_synthetic_source_model.json` is a synthetically generated model for example purposes\n",
+    "- The `example_real_source_model.json` is a real original study before ingestion into the NHLBI BioData Catalyst ecosystem (e.g. not yet harmonized)\n",
+    "- The `target data model` example is the **NHLBI BioData Catalyst Gen3 Data Dictionary v4.6.5** (latest version as of 21 AUG 2025)\n",
+    "\n",
+    "You can change this to supply your own source model, so long as the format follows the example. Similarly for target model. The source model will eventually come from a connection to a previously released AI-backed tool for Schema Generation, allowing this entire flow to start from arbitrary TSVs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# source_file = \"./examples/example_synthetic_source_model.json\"\n",
+    "source_file = \"./examples/example_real_source_model.json\"\n",
+    "\n",
+    "target_file = \"./examples/example_target_model_BDC.json\"\n",
+    "\n",
+    "with open(source_file, \"r\") as f:\n",
+    "    input_source_model = json.load(f)\n",
+    "\n",
+    "input_source_model = SimpleDataModel.get_from_unknown_json_format(\n",
+    "    json.dumps(input_source_model)\n",
+    ")\n",
+    "\n",
+    "with open(target_file, \"r\") as f:\n",
+    "    input_target_model = json.load(f)\n",
+    "\n",
+    "input_target_model = SimpleDataModel.get_from_unknown_json_format(\n",
+    "    json.dumps(input_target_model)\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "10",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Source Model\")\n",
+    "input_source_model.get_property_df()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "11",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Target Model\")\n",
+    "input_target_model.get_property_df()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12",
+   "metadata": {},
+   "source": [
+    "### Use a Specific Harmonization Approach to get Suggestions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "13",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "embedding_fn = BGEEmbeddings(model_name=\"BAAI/bge-large-en-v1.5\")\n",
+    "batch_size = 32\n",
+    "\n",
+    "harmonization_approach = SimilaritySearchInMemoryVectorDb(\n",
+    "    # A unique name for this file and embedding algorithm within the limits of the length required by the in-memory vectostore\n",
+    "    vectordb_persist_directory_name=f\"{os.path.basename(target_file)[:53]}-{embedding_fn.model.name_or_path.split(\"/\")[-1][:5]}-0\",\n",
+    "    input_target_model=input_target_model,\n",
+    "    embedding_function=embedding_fn,\n",
+    "    batch_size=batch_size,\n",
+    ")\n",
+    "\n",
+    "max_suggestions_per_property = 10\n",
+    "# set threshold low to just get top properties no matter what\n",
+    "score_threshold = 0\n",
+    "\n",
+    "suggestions = harmonization_approach.get_harmonization_suggestions(\n",
+    "    input_source_model=input_source_model,\n",
+    "    input_target_model=input_target_model,\n",
+    "    score_threshold=score_threshold,\n",
+    "    k=max_suggestions_per_property,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "14",
+   "metadata": {},
+   "source": [
+    "### Visualize Suggestions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "15",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "table_df = suggestions.to_simlified_dataframe()\n",
+    "table_df.sort_values(by=\"Similarity\", ascending=False, inplace=True)\n",
+    "table_df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "16",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Group by 'Original Node.Property' and find the index of max similarity for each group\n",
+    "idx = table_df.groupby(\"Original Node.Property\")[\"Similarity\"].idxmax()\n",
+    "\n",
+    "# Filter DataFrame using the indices found above\n",
+    "filtered_df = table_df.loc[idx]\n",
+    "filtered_df.drop(columns=[\"Original Description\", \"Target Description\"], inplace=True)\n",
+    "filtered_df.sort_values(by=\"Similarity\", ascending=False, inplace=True)\n",
+    "filtered_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "17",
+   "metadata": {},
+   "source": [
+    "### Create Interactive Table for Selecting Suggestions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "table = get_interactive_table_for_suggestions(\n",
+    "    table_df,\n",
+    "    column_for_filtering=1,\n",
+    "    # additional config for the interactive table\n",
+    "    maxBytes=\"2MB\",\n",
+    "    pageLength=50,\n",
+    ")\n",
+    "table"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "19",
+   "metadata": {},
+   "source": [
+    "> **Don't see the table or see an error above?** Try restarting the kernel, then try restarting jupyter lab (if that's what you're using). The installs for AnyWidgets might not be picked up yet.\n",
+    "\n",
+    "> **Dark Theme?** If you're using a dark theme, you might need to switch to light for the table to display properly. \n",
+    "\n",
+    "> **Using VS Code Jupyter Extension?** Links might not work"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "20",
+   "metadata": {},
+   "source": [
+    "To use the selections above, record them below in `manual_selection_indexes` or ue multi-select in the above table and the below will automatically use those. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "21",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Fill this out manually as you go, or we'll use the table selections\n",
+    "manual_selection_indexes = []  # [1, 8, 24, ...]\n",
+    "\n",
+    "selected_rows = manual_selection_indexes or table.selected_rows\n",
+    "\n",
+    "print(f\"Selected Suggestions: {selected_rows}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "22",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "table_df.loc[selected_rows]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "23",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "table_df.loc[selected_rows].to_csv(\n",
+    "    \"./selected_suggestions.tsv\",\n",
+    "    index=False,\n",
+    "    na_rep=\"N/A\",\n",
+    "    sep=\"\\t\",\n",
+    "    quotechar='\"',\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ai-harmonization (3.13.5)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

app.py CHANGED Viewed

@@ -1,7 +1,7 @@
 import gradio as gr
-def greet(name):
-    return "Hello " + name + "!!"
-demo = gr.Interface(fn=greet, inputs="text", outputs="text")
-demo.launch()

 import gradio as gr
+def show_link():
+    return "Check out the Jupyter notebook demo: https://huggingface.co/spaces/uc-ctds/ai_assisted_data_curation_toolkit/blob/main/ai_assisted_data_curation.ipynb"
+interface = gr.Interface(fn=show_link, inputs=None, outputs="text")
+interface.launch()