Spaces:

TwinklData
/

Community_Collections_App

Sleeping

App Files Files Community

lynn-twinkl commited on Apr 13, 2025

Commit

20a2002

1 Parent(s): 56aa4c1

removed md file

Browse files

Files changed (1) hide show

notebooks/training-code.md +0 -32

notebooks/training-code.md DELETED Viewed

@@ -1,32 +0,0 @@
-Below are some observations and suggestions that may help you improve or validate your current training approach. Overall, the code “works” in the sense that it will run and produce a model—but there are a few important details that can make training more stable and give you better results, especially if you are using spaCy 3.x:
-1) Mixing a pre‐trained pipeline with “begin_training()”
-   – In your code, you load the pre‐trained model “en_core_web_md” and then call nlp.begin_training(). In spaCy 3.x, begin_training() is typically used to initialize weights from scratch (as if you had a blank model). By loading “en_core_web_md,” you actually start off with weights that are already pre‐trained.
-   – If your goal is to leverage the pre‐trained weights and fine‐tune them with your custom entities, it is often safer to use either nlp.initialize() (within a config-based training workflow) or, in older spaCy 2.x code, use resume_training().
-   – Using begin_training() on a pipeline that already has weights can sometimes lead to unexpected results (often referred to as “catastrophic forgetting”).
-2) Using a config-based workflow (spaCy 3.x)
-   – spaCy 3 introduced a config system that separates out hyperparameters and makes it easier to reproduce training. While the scripted approach you’re using (calling nlp.update in a loop) is still acceptable in simpler workflows, it does not take advantage of spaCy 3’s main training features (e.g., handling splits, logging, saving intermediate checkpoints, etc.).
-   – If you plan to maintain and iterate on this model, consider switching to a spaCy project with a config file. This ensures you can easily re-run training and track changes.
-3) Disable_pipes vs. continuing to train the tagger/parser
-   – In your code, you disable all other pipes except NER. This means you’re not updating the parser, tagger, etc. If your goal truly is to train only the NER component on your new labels, that’s perfect. But if you ever want to keep the tagger or parser up-to-date, or avoid catastrophic forgetting, you’ll need to be more deliberate about which components are updated.
-   – Because you are loading “en_core_web_md,” it already includes a tagger, parser, etc. If you truly do not need them, you could remove them from the pipeline altogether.
-4) Monitoring performance and stopping criteria
-   – You have a fixed 100-epoch training loop. You’ll see a final loss, but you have no early stopping or validation set. Adding at least a small dev set to measure F-scores (precision/recall) each epoch can help you avoid overfitting and let you stop training once the model peaks.
-   – Because NER can easily overfit, you may not actually need 100 epochs. Sometimes 10–20 epochs is enough, especially for smaller datasets.
-5) Dropout rate
-   – You’re using drop=0.5 in nlp.update(...). That can be okay for small data, but you might experiment with a lower (or slightly adaptive) dropout. You may get better entity performance with something like 0.1–0.3. There’s no universal rule—experiment and see.
-6) Overall structure looks good
-   – The logic for transforming your JSON labels into spaCy’s entity format, adding them to TRAIN_DATA, and iterating over minibatches is solid.
-   – You’re using the Example class from spacy.training.example, which is correct for spaCy 3.x code.
-To summarize:
-• The script will run and train a model, but if you are aiming to fine‐tune a pre‐trained pipeline (en_core_web_md), consider using a more modern approach (e.g., a config file plus nlp.initialize() or the spaCy CLI).
-• Double‐check that you actually want to overwrite all of the existing pipeline’s layers with begin_training()—it can cause forgetting of the original pipeline’s knowledge.
-• Add a small validation set (or cross‐validation approach) so that you can track performance beyond the raw loss, and consider early stopping or fewer epochs if the model is converging early.
-If the model’s predictions are working for you in practice, that’s great! But if you want to push performance further, those are the key areas to refine.