lynn-twinkl commited on
Commit
a7a8438
·
1 Parent(s): c63c700

How to use ner-training docs

Browse files
Files changed (1) hide show
  1. ner-training/readme.md +36 -0
ner-training/readme.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Appropriate Usage for NER Training
2
+
3
+ ## Cleaning and Debugging Training Data
4
+
5
+ We first need to debug our raw labeled data from Label Studio. Sometimes, labeled data has trailing whitespaces or punctuation, which Spacy _really_ doesn't like. So we need to remove it.
6
+
7
+ `python3 debug_labeled_data.py raw_labeded_data_path text_key_to_debug outdir`
8
+
9
+ This will create a new debugged json file in the specified directory. **Use this file for the next step.**
10
+
11
+ ## Preparing Data For Training
12
+
13
+ Now, we need to convert this raw labeled data into Spacy's binary format. Before doing so however, we must make sure to split the data into training and dev sets for testing.
14
+
15
+ 1. `python3 split_data.py debugged_json_path`
16
+
17
+ This will create `train.json` and `dev.json` files in the current working directory.
18
+
19
+ 2. Move these file into the trianing_data dir: `mv *.json training_data/`
20
+
21
+ 3. Convert both sets into Spacy's binary format:
22
+
23
+ `python3 convert_to_spacy.py training_data/train.json training_data/train.spacy`
24
+ `python3 convert_to_spacy.py training_data/dev.json training_data/dev.spacy`
25
+
26
+ ## Training
27
+
28
+ To start training the data from the CLI, we simply run the following command:
29
+
30
+ `
31
+ python -m spacy train transformer.cfg \
32
+ --paths.train training_data/train.spacy \
33
+ --paths.dev training_data/dev.spacy \
34
+ --gpu-id 0 \
35
+ --output ./roberta_model
36
+ `