Update model by training for 25 epochs and two more datasets i.e. mit restaurant and mit movie trivia.

Files changed (6) hide show

README.md +131 -39
added_tokens.json +1 -1
config.json +4 -2
pytorch_model.bin +2 -2
tokenizer.json +0 -0
tokenizer_config.json +1 -1

README.md CHANGED Viewed

@@ -17,59 +17,151 @@ The FSNER model was proposed in [Example-Based Named Entity Recognition](https:/
 ## Model Training Details
 -----
-| identifier        | epochs           | datasets  |
-| ---------- |:----------:| :-----:|
-| [sayef/fsner-bert-base-uncased](https://huggingface.co/sayef/fsner-bert-base-uncased)      | 10 | ontonotes5, conll2003, wnut2017, and fin (Alvarado et al.). |
 ## Installation and Example Usage
 ------
-- Installation: `pip install fsner`
-```python
-from fsner import FSNERModel, FSNERTokenizerUtils
-model = FSNERModel("sayef/fsner-bert-base-uncased")
-tokenizer = FSNERTokenizerUtils("sayef/fsner-bert-base-uncased")
-# size of query and supports must be the same. Each query corresponds to the same index of supports.
-query = [
-    'KWE 4000 can reach with a maximum speed from up to 450 P/min an accuracy from 50 mg',
-    'I would like to order a computer from eBay.',
-]
-# each list in supports are the examples of one entity type
-# wrap entities around with [E] and [/E] in the examples
-supports = [
-        [
-           'Horizontal flow wrapper [E] Pack 403 [/E] features the new retrofit-kit „paper-ON-form“',
-           '[E] Paloma Pick-and-Place-Roboter [/E] arranges the bakery products for the downstream tray-forming equipment',
-           'Finally, the new [E] Kliklok ACE [/E] carton former forms cartons and trays without the use of glue',
-           'We set up our pilot plant with the right [E] FibreForm® [/E] configuration to make prototypes for your marketing tests and package validation',
-           'The [E] CAR-T5 [/E] is a reliable, purely mechanically driven cartoning machine for versatile application fields'
-        ],
-        [
-            "[E] Walmart [/E] is a leading e-commerce company",
-            "I recently ordered a book from [E] Amazon [/E]",
-            "I ordered this from [E] ShopClues [/E]",
-            "Fridge can be ordered in [E] Amazon [/E]",
-            "[E] Flipkart [/E] started it's journey from zero"
-        ]
-   ]
-device = 'cpu'
-W_query = tokenizer.tokenize(query).to(device)
-W_supports = tokenizer.tokenize(supports).to(device)
-start_prob, end_prob = model(W_query, W_supports)
-output = tokenizer.extract_entity_from_scores(query, W_query, start_prob, end_prob, thresh=0.50)
-print(output)
 ```

 ## Model Training Details
 -----
+| identifier        | epochs |                                            datasets                                             |
+| ---------- |:------:|:-----------------------------------------------------------------------------------------------:|
+| [sayef/fsner-bert-base-uncased](https://huggingface.co/sayef/fsner-bert-base-uncased)      |   25   |  ontonotes5, conll2003, wnut2017, mit_movie_trivia, mit_restaurant and fin (Alvarado et al.).   |
 ## Installation and Example Usage
 ------
+You can use the FSNER model in 3 ways:
+1. Install directly from PyPI: `pip install fsner` and import the model as shown in the code example below
+   or
+2. Install from source: `python setup.py install` and import the model as shown in the code example below
+   or
+3. Clone [repo](https://github.com/sayef/fsner) and add absolute path of `fsner/src` directory to your PYTHONPATH and import the model as shown in the code example below
+```python
+import json
+from fsner import FSNERModel, FSNERTokenizerUtils, pretty_embed
+query_texts = [
+    "Does Luke's serve lunch?",
+    "Chang does not speak Taiwanese very well.",
+    "I like Berlin."
+]
+# Each list in supports are the examples of one entity type
+# Wrap entities around with [E] and [/E] in the examples.
+# Each sentence should have only one pair of [E] ... [/E]
+support_texts = {
+    "Restaurant": [
+        "What time does [E] Subway [/E] open for breakfast?",
+        "Is there a [E] China Garden [/E] restaurant in newark?",
+        "Does [E] Le Cirque [/E] have valet parking?",
+        "Is there a [E] McDonalds [/E] on main street?",
+        "Does [E] Mike's Diner [/E] offer huge portions and outdoor dining?"
+    ],
+    "Language": [
+        "Although I understood no [E] French [/E] in those days , I was prepared to spend the whole day with Chien - chien .",
+        "like what the hell 's that called in [E] English [/E] ? I have to register to be here like since I 'm a foreigner .",
+        "So , I 'm also working on an [E] English [/E] degree because that 's my real interest .",
+        "Al - Jazeera TV station , established in November 1996 in Qatar , is an [E] Arabic - language [/E] news TV station broadcasting global news and reports nonstop around the clock .",
+        "They think it 's far better for their children to be here improving their [E] English [/E] than sitting at home in front of a TV . \"",
+        "The only solution seemed to be to have her learn [E] French [/E] .",
+        "I have to read sixty pages of [E] Russian [/E] today ."
+    ]
+}
+device = 'cpu'
+tokenizer = FSNERTokenizerUtils("checkpoints/model")
+queries = tokenizer.tokenize(query_texts).to(device)
+supports = tokenizer.tokenize(list(support_texts.values())).to(device)
+model = FSNERModel("checkpoints/model")
+model.to(device)
+p_starts, p_ends = model.predict(queries, supports)
+# One can prepare supports once and reuse  multiple times with different queries
+# ------------------------------------------------------------------------------
+# start_token_embeddings, end_token_embeddings = model.prepare_supports(supports)
+# p_starts, p_ends = model.predict(queries, start_token_embeddings=start_token_embeddings,
+#                                  end_token_embeddings=end_token_embeddings)
+output = tokenizer.extract_entity_from_scores(query_texts, queries, p_starts, p_ends,
+                                              entity_keys=list(support_texts.keys()), thresh=0.50)
+print(json.dumps(output, indent=2))
+# install displacy for pretty embed
+pretty_embed(query_texts, output, list(support_texts.keys()))
+```
+<!DOCTYPE html>
+<html lang="en">
+    <head>
+        <title>displaCy</title>
+    </head>
+    <body style="font-size: 16px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol'; padding: 4rem 2rem; direction: ltr">
+<figure style="margin-bottom: 6rem">
+<div class="entities" style="line-height: 2.5; direction: ltr">
+<div class="entities" style="line-height: 2.5; direction: ltr">Does
+<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
+    Luke's
+    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">Restaurant</span>
+</mark>
+ serve lunch?</div>
+<div class="entities" style="line-height: 2.5; direction: ltr">Chang does not speak
+<mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
+    Taiwanese
+    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">Language</span>
+</mark>
+ very well.</div>
+<div class="entities" style="line-height: 2.5; direction: ltr">I like Berlin.</div>
+ </div>
+</figure>
+</body>
+</html>
+## Datasets preparation
+1. We need to convert dataset into the following format. Let's say we have a dataset file train.json like following.
+```json
+{
+  "CARDINAL_NUMBER": [
+    "Washington , cloudy , [E] 2 [/E] to 6 degrees .",
+    "New Dehli , sunny , [E] 6 [/E] to 19 degrees .",
+    "Well this is number [E] two [/E] .",
+    "....."
+  ],
+  "LANGUAGE": [
+    "They do n't have the Quicken [E] Dutch [/E] version ?",
+    "they learned a lot of [E] German [/E] .",
+    "and then [E] Dutch [/E] it 's Mifrau",
+    "...."
+  ],
+  "MONEY": [
+    "Per capita personal income ranged from $ [E] 11,116 [/E] in Mississippi to $ 23,059 in Connecticut ... .",
+    "The trade surplus was [E] 582 million US dollars [/E] .",
+    "It settled with a loss of 4.95 cents at $ [E] 1.3210 [/E] a pound .",
+    "...."
+  ]
+}
+```
+2. Converted ontonotes5 dataset can be found here:
+    1. [train](https://gist.githubusercontent.com/sayef/46deaf7e6c6e1410b430ddc8aff9c557/raw/ea7ae2ae933bfc9c0daac1aa52a9dc093d5b36f4/ontonotes5.train.json)
+    2. [dev](https://gist.githubusercontent.com/sayef/46deaf7e6c6e1410b430ddc8aff9c557/raw/ea7ae2ae933bfc9c0daac1aa52a9dc093d5b36f4/ontonotes5.dev.json)
+3. Then one could use examples/train.py script to train/evaluate your fsner model.
+```bash
+python train.py --pretrained-model bert-base-uncased --mode train --train-data train.json --val-data val.json \
+                --train-batch-size 6 --val-batch-size 6 --n-examples-per-entity 10 --neg-example-batch-ratio 1/3 --max-epochs 25 --device gpu \
+                --gpus -1 --strategy ddp
 ```

added_tokens.json CHANGED Viewed

	@@ -1 +1 @@
1	- {"[/E]": ~~30523~~, "[E]": ~~30522~~}


1	+ {"[E]": 30522, "[/E]": 30523}

config.json CHANGED Viewed

@@ -1,9 +1,10 @@
 {
-  "_name_or_path": "./fsner-bert-base-uncased/",
   "architectures": [
     "BertModel"
   ],
   "attention_probs_dropout_prob": 0.1,
   "gradient_checkpointing": false,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,
@@ -17,7 +18,8 @@
   "num_hidden_layers": 12,
   "pad_token_id": 0,
   "position_embedding_type": "absolute",
-  "transformers_version": "4.3.3",
   "type_vocab_size": 2,
   "use_cache": true,
   "vocab_size": 30524

 {
+  "_name_or_path": "checkpoints/model4",
   "architectures": [
     "BertModel"
   ],
   "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
   "gradient_checkpointing": false,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,
   "num_hidden_layers": 12,
   "pad_token_id": 0,
   "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.17.0",
   "type_vocab_size": 2,
   "use_cache": true,
   "vocab_size": 30524

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:96b397b8b984de4833b1193e4a0b1a882a898d1a0696e7138fe7aaf1b1970e87
-size 438019662

 version https://git-lfs.github.com/spec/v1
+oid sha256:c2a2401a91d2bf80826341c52a0c1f8b6814f36c1b7852d4c93482a13041260f
+size 438017329

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

	@@ -1 +1 @@
1	- {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "~~bert-base-uncased~~", "tokenizer_class": "BertTokenizer"}


1	+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "checkpoints/model4", "tokenizer_class": "BertTokenizer"}