avantol commited on
Commit
7c91de7
·
1 Parent(s): b2300b5

feat(notebook): more docs, better defaults

Browse files
Files changed (1) hide show
  1. ai_assisted_data_curation.ipynb +17 -4
ai_assisted_data_curation.ipynb CHANGED
@@ -198,6 +198,16 @@
198
  "### Use a Specific Harmonization Approach to get Suggestions"
199
  ]
200
  },
 
 
 
 
 
 
 
 
 
 
201
  {
202
  "cell_type": "code",
203
  "execution_count": null,
@@ -210,13 +220,16 @@
210
  "\n",
211
  "harmonization_approach = SimilaritySearchInMemoryVectorDb(\n",
212
  " # A unique name for this file and embedding algorithm within the limits of the length required by the in-memory vectostore\n",
213
- " vectordb_persist_directory_name=f\"{os.path.basename(target_file)[:53]}-{embedding_fn.model.name_or_path.split(\"/\")[-1][:5]}-0\",\n",
214
  " input_target_model=input_target_model,\n",
215
  " embedding_function=embedding_fn,\n",
216
  " batch_size=batch_size,\n",
217
  ")\n",
218
  "\n",
219
- "max_suggestions_per_property = 10\n",
 
 
 
220
  "# set threshold low to just get top properties no matter what\n",
221
  "score_threshold = 0\n",
222
  "\n",
@@ -297,9 +310,9 @@
297
  "source": [
298
  "> **Don't see the table or see an error above?** Try restarting the kernel, then try restarting jupyter lab (if that's what you're using). The installs for AnyWidgets might not be picked up yet.\n",
299
  "\n",
300
- "> **Dark Theme?** If you're using a dark theme, you might need to switch to light for the table to display properly. \n",
301
  "\n",
302
- "> **Using VS Code Jupyter Extension?** Links might not work"
303
  ]
304
  },
305
  {
 
198
  "### Use a Specific Harmonization Approach to get Suggestions"
199
  ]
200
  },
201
+ {
202
+ "cell_type": "markdown",
203
+ "id": "deb30aa8",
204
+ "metadata": {},
205
+ "source": [
206
+ "We are using a specially trained embedding model created by UChicago CTDS, which is optimized for variable-level mapping. \n",
207
+ "\n",
208
+ "You can view details of the model here: https://huggingface.co/uc-ctds/bge-large-en-v1.5-bio-mapping"
209
+ ]
210
+ },
211
  {
212
  "cell_type": "code",
213
  "execution_count": null,
 
220
  "\n",
221
  "harmonization_approach = SimilaritySearchInMemoryVectorDb(\n",
222
  " # A unique name for this file and embedding algorithm within the limits of the length required by the in-memory vectostore\n",
223
+ " vectordb_persist_directory_name=f\"{os.path.basename(target_file)[:53]}-{embedding_fn.model_name.split(\"/\")[-1][:5]}-0\",\n",
224
  " input_target_model=input_target_model,\n",
225
  " embedding_function=embedding_fn,\n",
226
  " batch_size=batch_size,\n",
227
  ")\n",
228
  "\n",
229
+ "# By default, get all options (will eventually sort by most relevant)\n",
230
+ "max_suggestions_per_property = len(harmonization_approach.vectorstore.get()[\"ids\"])\n",
231
+ "# max_suggestions_per_property = 10\n",
232
+ "\n",
233
  "# set threshold low to just get top properties no matter what\n",
234
  "score_threshold = 0\n",
235
  "\n",
 
310
  "source": [
311
  "> **Don't see the table or see an error above?** Try restarting the kernel, then try restarting jupyter lab (if that's what you're using). The installs for AnyWidgets might not be picked up yet.\n",
312
  "\n",
313
+ "> **Colors / Theme off?** If you're using a dark theme, you might need to switch to light for the table to display properly (or vice-versa).\n",
314
  "\n",
315
+ "> **Using VS Code Jupyter Extension?** Any Embedded links (if they exist) might not work"
316
  ]
317
  },
318
  {