File size: 3,572 Bytes
c289c0d
 
 
 
 
 
dec28f9
7e6674b
2d26d13
8cb5ae5
1b3bf11
8cb5ae5
2d26d13
8cb5ae5
f4f9e11
f7af90c
82b3cbe
 
 
2d26d13
82b3cbe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db7f144
 
 
e783884
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
language:
- en
pipeline_tag: token-classification
tags:
- semantics
license: apache-2.0
---
# An English semantic tagging model based on `bert-base-uncased`
This model is a BERT-base-uncased model finetuned for **semantic tagging**.

As training data, I use the English fragment (both gold and silver data) from the Parallel Meaning Bank's Universal Semantic Tags dataset [1].

## Inference
The model is trained to make predictions for the embedded representations corresponding to the first subword of each word. Inference in the same setting as in training can be achieved with the following code ([huggingface's standard pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) does not behave as intended here). Note that the pipeline below assumes that inputs are already split into words by spaces.
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from spacy_alignments.tokenizations import get_alignments
import torch

tokenizer = AutoTokenizer.from_pretrained("hfunakura/en-bertsemtagger-gold")
model = AutoModelForTokenClassification.from_pretrained("hfunakura/en-bertsemtagger-gold")

# define the tagset
id2semtag = {"0": "@@UNK@@", "1": "PRO", "2": "CTC", "3": "INT", "4": "EMP", "5": "DEC", "6": "ITJ", "7": "GRE", "8": "NEC", "9": "PFT", "10": "IMP", "11": "HAP", "12": "ROL", "13": "MOY", "14": "PRG", "15": "HAS", "16": "CLO", "17": "MOR", "18": "DEF", "19": "BUT", "20": "YOC", "21": "PRI", "22": "EQU", "23": "SUB", "24": "APX", "25": "REL", "26": "XCL", "27": "CON", "28": "GPO", "29": "QUE", "30": "DIS", "31": "IST", "32": "COL", "33": "SCO", "34": "GRP", "35": "EXS", "36": "FUT", "37": "ENS", "38": "QUC", "39": "DOM", "40": "SST", "41": "NIL", "42": "COO", "43": "QUV", "44": "PST", "45": "UNK", "46": "EXT", "47": "NTH", "48": "LIT", "49": "ORG", "50": "EXG", "51": "REF", "52": "DOW", "53": "TOP", "54": "EPS", "55": "DXT", "56": "AND", "57": "UOM", "58": "ALT", "59": "POS", "60": "PRX", "61": "GEO", "62": "BOT", "63": "DEG", "64": "ART", "65": "PER", "66": "GPE", "67": "EFS", "68": "DST", "69": "LES", "70": "ORD", "71": "NOT", "72": "NOW", "-100": "@@PAD@@"}

class SemtaggerPipeline():
    def __init__(self, model, tokenizer, id2semtag):
        self.model = model
        self.tokenizer = tokenizer
        self.id2semtag = id2semtag
    def predict(self, text):
        # get alignments
        encoding_list = self.tokenizer(text, add_special_tokens=False)
        encoded_tokens = self.tokenizer.convert_ids_to_tokens(encoding_list["input_ids"])
        words = text.split(" ")
        alignments = get_alignments(encoded_tokens, words)[1]
        is_first_list = []
        for alignment in alignments:
            is_first_list += [1] + [0]*(len(alignment)-1)
        is_first = torch.tensor(is_first_list)
        # yield and extract predictions 
        encoding = self.tokenizer(text, return_tensors="pt", add_special_tokens=False)
        logits = model(**encoding).logits
        preds = logits.argmax(-1)[0][is_first==1]
        pred_labels = [self.id2semtag[str(int(i))] for i in preds]
        result = [f"{word}/{label}" for word, label in zip(words,pred_labels)]
        return " ".join(result)

pipeline = SemtaggerPipeline(model, tokenizer, id2semtag)
pipeline.predict("Jim and Mary smiled and left .")
```

## References
[1] Lasha Abzianidze, Johan Bos (2017): Towards Universal Semantic Tagging. Proceedings of the 12th International Conference on Computational Semantics (IWCS 2017) -- Short Papers, pp 1–6, Montpellier, France, https://pmb.let.rug.nl/data.php.