Spaces:

uc-ctds
/

ai_assisted_data_curation_toolkit

Sleeping

App Files Files Community

avantol commited on Nov 20, 2025

Commit

7c91de7

1 Parent(s): b2300b5

feat(notebook): more docs, better defaults

Browse files

Files changed (1) hide show

ai_assisted_data_curation.ipynb +17 -4

ai_assisted_data_curation.ipynb CHANGED Viewed

@@ -198,6 +198,16 @@
     "### Use a Specific Harmonization Approach to get Suggestions"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -210,13 +220,16 @@
     "\n",
     "harmonization_approach = SimilaritySearchInMemoryVectorDb(\n",
     "    # A unique name for this file and embedding algorithm within the limits of the length required by the in-memory vectostore\n",
-    "    vectordb_persist_directory_name=f\"{os.path.basename(target_file)[:53]}-{embedding_fn.model.name_or_path.split(\"/\")[-1][:5]}-0\",\n",
     "    input_target_model=input_target_model,\n",
     "    embedding_function=embedding_fn,\n",
     "    batch_size=batch_size,\n",
     ")\n",
     "\n",
-    "max_suggestions_per_property = 10\n",
     "# set threshold low to just get top properties no matter what\n",
     "score_threshold = 0\n",
     "\n",
@@ -297,9 +310,9 @@
    "source": [
     "> **Don't see the table or see an error above?** Try restarting the kernel, then try restarting jupyter lab (if that's what you're using). The installs for AnyWidgets might not be picked up yet.\n",
     "\n",
-    "> **Dark Theme?** If you're using a dark theme, you might need to switch to light for the table to display properly. \n",
     "\n",
-    "> **Using VS Code Jupyter Extension?** Links might not work"
    ]
   },
   {

     "### Use a Specific Harmonization Approach to get Suggestions"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "deb30aa8",
+   "metadata": {},
+   "source": [
+    "We are using a specially trained embedding model created by UChicago CTDS, which is optimized for variable-level mapping. \n",
+    "\n",
+    "You can view details of the model here: https://huggingface.co/uc-ctds/bge-large-en-v1.5-bio-mapping"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
     "\n",
     "harmonization_approach = SimilaritySearchInMemoryVectorDb(\n",
     "    # A unique name for this file and embedding algorithm within the limits of the length required by the in-memory vectostore\n",
+    "    vectordb_persist_directory_name=f\"{os.path.basename(target_file)[:53]}-{embedding_fn.model_name.split(\"/\")[-1][:5]}-0\",\n",
     "    input_target_model=input_target_model,\n",
     "    embedding_function=embedding_fn,\n",
     "    batch_size=batch_size,\n",
     ")\n",
     "\n",
+    "# By default, get all options (will eventually sort by most relevant)\n",
+    "max_suggestions_per_property = len(harmonization_approach.vectorstore.get()[\"ids\"])\n",
+    "# max_suggestions_per_property = 10\n",
+    "\n",
     "# set threshold low to just get top properties no matter what\n",
     "score_threshold = 0\n",
     "\n",
    "source": [
     "> **Don't see the table or see an error above?** Try restarting the kernel, then try restarting jupyter lab (if that's what you're using). The installs for AnyWidgets might not be picked up yet.\n",
     "\n",
+    "> **Colors / Theme off?** If you're using a dark theme, you might need to switch to light for the table to display properly (or vice-versa).\n",
     "\n",
+    "> **Using VS Code Jupyter Extension?** Any Embedded links (if they exist) might not work"
    ]
   },
   {