Upload MultilabelNerPipeline

Browse files

Files changed (8) hide show

README.md +199 -0
config.json +93 -0
configuration_multilabelbert.py +7 -0
model.safetensors +3 -0
modeling_multilabelbert.py +76 -0
multilabel_ner.py +182 -0
tokenizer.json +0 -0
tokenizer_config.json +14 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,93 @@

+{
+  "add_cross_attention": false,
+  "architectures": [
+    "BertForMultiLabelTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "auto_map": {
+    "AutoConfig": "configuration_multilabelbert.MultiLabelBertConfig",
+    "AutoModelForTokenClassification": "modeling_multilabelbert.BertForMultiLabelTokenClassification"
+  },
+  "bos_token_id": null,
+  "classifier_dropout": null,
+  "custom_pipelines": {
+    "multilabel-ner": {
+      "default": {
+        "model": {
+          "pt": [
+            "jvaquet/multilabel-classification-bert",
+            "main"
+          ]
+        }
+      },
+      "impl": "multilabel_ner.MultilabelNerPipeline",
+      "pt": [
+        "AutoModelForTokenClassification"
+      ],
+      "type": "text"
+    }
+  },
+  "directionality": "bidi",
+  "dtype": "float32",
+  "eos_token_id": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "id2label": {
+    "0": "B-MISC",
+    "1": "I-MISC",
+    "2": "E-MISC",
+    "3": "S-MISC",
+    "4": "B-ORG",
+    "5": "I-ORG",
+    "6": "E-ORG",
+    "7": "S-ORG",
+    "8": "B-PER",
+    "9": "I-PER",
+    "10": "E-PER",
+    "11": "S-PER",
+    "12": "B-LOC",
+    "13": "I-LOC",
+    "14": "E-LOC",
+    "15": "S-LOC"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "is_decoder": false,
+  "label2id": {
+    "B-LOC": 12,
+    "B-MISC": 0,
+    "B-ORG": 4,
+    "B-PER": 8,
+    "E-LOC": 14,
+    "E-MISC": 2,
+    "E-ORG": 6,
+    "E-PER": 10,
+    "I-LOC": 13,
+    "I-MISC": 1,
+    "I-ORG": 5,
+    "I-PER": 9,
+    "S-LOC": 15,
+    "S-MISC": 3,
+    "S-ORG": 7,
+    "S-PER": 11
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "MultiLabelBert",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "pad_token_id": 0,
+  "pooler_fc_size": 768,
+  "pooler_num_attention_heads": 12,
+  "pooler_num_fc_layers": 3,
+  "pooler_size_per_head": 128,
+  "pooler_type": "first_token_transform",
+  "position_embedding_type": "absolute",
+  "tie_word_embeddings": true,
+  "transformers_version": "5.5.3",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 28996
+}

configuration_multilabelbert.py ADDED Viewed

	@@ -0,0 +1,7 @@

+from transformers import BertConfig, AutoConfig
+class MultiLabelBertConfig(BertConfig):
+    model_type = 'MultiLabelBert'
+AutoConfig.register('MultiLabelBert', MultiLabelBertConfig)
+MultiLabelBertConfig.register_for_auto_class()

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c8d43ff02648f12146cc4cec24cd36dfbaaa077d6417b571efa928fecfee36f8
+size 1330231072

modeling_multilabelbert.py ADDED Viewed

	@@ -0,0 +1,76 @@

+from transformers import BertPreTrainedModel, BertModel, AutoConfig, AutoModelForTokenClassification
+import torch
+import torch.nn as nn
+from transformers.modeling_outputs import TokenClassifierOutput
+from transformers.utils import TransformersKwargs, can_return_tuple
+from transformers.processing_utils import Unpack
+from .configuration_multilabelbert import MultiLabelBertConfig
+from typing import Optional
+class BertForMultiLabelTokenClassification(BertPreTrainedModel):
+    config_class = MultiLabelBertConfig
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+        self.bert = BertModel(config, add_pooling_layer=False)
+        classifier_dropout = (
+            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
+        )
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+        # Initialize weights and apply final processing
+        self.post_init()
+    @can_return_tuple
+    def forward(
+        self,
+        input_ids: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        token_type_ids: torch.Tensor | None = None,
+        position_ids: torch.Tensor | None = None,
+        inputs_embeds: torch.Tensor | None = None,
+        labels: torch.Tensor | None = None,
+        special_tokens_mask: Optional[torch.Tensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple[torch.Tensor] | TokenClassifierOutput:
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            inputs_embeds=inputs_embeds,
+            return_dict=True,
+            **kwargs,
+        )
+        sequence_output = outputs[0]
+        sequence_output = self.dropout(sequence_output)
+        logits = self.classifier(sequence_output)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.BCEWithLogitsLoss(reduction = 'none')
+            loss = loss_fct(logits, labels)
+            if special_tokens_mask is not None:
+                loss = loss[special_tokens_mask != 1].mean()
+            else:
+                loss = loss.mean()
+        return TokenClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+AutoModelForTokenClassification.register(MultiLabelBertConfig, BertForMultiLabelTokenClassification)
+BertForMultiLabelTokenClassification.register_for_auto_class('AutoModelForTokenClassification')

multilabel_ner.py ADDED Viewed

	@@ -0,0 +1,182 @@

+from transformers import Pipeline
+import torch
+import torch.nn as nn
+MODEL_FOR_MULTILABEL_TOKEN_CLASSIFICATION = [
+    'BertForMultiLabelTokenClassification'
+]
+class MultilabelNerPipeline(Pipeline):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.check_model_type(MODEL_FOR_MULTILABEL_TOKEN_CLASSIFICATION)
+        self.entity_types = {label[2:] for label in self.model.config.label2id}
+    def _sanitize_parameters(self, **kwargs):
+        preprocess_kwargs = {}
+        if 'stride' in kwargs:
+            preprocess_kwargs['stride'] = kwargs['stride']
+        postprocess_kwargs = {}
+        if 'threshold' in kwargs:
+            postprocess_kwargs['threshold'] = kwargs['threshold']
+        if 'use_hierarchy_heuristic' in kwargs:
+            postprocess_kwargs['use_hierarchy_heuristic'] = kwargs['use_hierarchy_heuristic']
+        return preprocess_kwargs, {}, postprocess_kwargs
+    def preprocess(self, inputs, stride=128):
+        tokenized_inputs = self.tokenizer(inputs,
+            truncation=True,
+            padding=True,
+            stride=stride,
+            return_tensors='pt',
+            return_overflowing_tokens=True,
+            return_special_tokens_mask=True
+            )
+        n_samples = tokenized_inputs.input_ids.size()[0]
+        char_offsets = [tokenized_inputs[idx].offsets for idx in range(n_samples)]
+        return {
+            'input_ids': tokenized_inputs.input_ids,
+            'attention_mask': tokenized_inputs.attention_mask,
+            'char_offsets': char_offsets,
+            'special_tokens_mask': tokenized_inputs.special_tokens_mask,
+            'text': inputs
+        }
+    def _forward(self, model_inputs):
+        return {
+            'logits': self.model(**model_inputs).logits,
+            'text': model_inputs['text'],
+            'char_offsets': model_inputs['char_offsets'],
+            'special_tokens_mask': model_inputs['special_tokens_mask']
+        }
+    def postprocess(self, model_outputs, threshold=0.5, use_hierarchy_heuristic=False):
+        predictions = nn.functional.sigmoid(model_outputs['logits'])
+        predictions[model_outputs['special_tokens_mask'] == 1] = 0
+        spans_single = self.extract_single_token_spans(predictions, threshold)
+        spans_multi = self.extract_multi_token_spans(predictions, threshold)
+        spans = self.token_spans_to_char_spans(spans_single + spans_multi, model_outputs['char_offsets'], model_outputs['text'])
+        spans = self.deduplicate_spans(spans)
+        if use_hierarchy_heuristic:
+            spans = self.apply_hierarchy_heristic(spans)
+        return spans
+    def extract_single_token_spans(self, predictions, threshold):
+        return [{
+            'label': entity_type,
+            'batch': idx_batch,
+            'span_token': (int(idx_token), int(idx_token+1))
+        }
+            for entity_type in self.entity_types
+            for idx_batch, idx_token in zip(*torch.where(predictions[:,:, self.model.config.label2id[f'S-{entity_type}']] >= threshold))
+        ]
+    def extract_multi_token_spans(self, predictions, threshold):
+        return [{
+            'label': entity_type,
+            'batch': idx_batch_begin,
+            'span_token': (int(idx_token_begin), int(idx_token_end+1))
+        }
+            for entity_type in self.entity_types
+            for idx_batch_begin, idx_token_begin in zip(*torch.where(predictions[:,:, self.model.config.label2id[f'B-{entity_type}']] >= threshold))
+            for idx_batch_end, idx_token_end in zip(*torch.where(predictions[:,:, self.model.config.label2id[f'E-{entity_type}']] >= threshold))
+            if idx_batch_begin == idx_batch_end
+            if idx_token_begin < idx_token_end
+            if torch.all(predictions[idx_batch_begin, idx_token_begin+1:idx_token_end, self.model.config.label2id[f'I-{entity_type}']] >= threshold)
+        ]
+    def token_spans_to_char_spans(self, spans, char_offsets, text):
+        return [{
+            'label': span['label'],
+            'span': (char_start, char_end),
+            'text': text[char_start:char_end]
+        }
+            for span in spans
+            if (batch := span['batch']) is not None
+            if (span_token := span['span_token']) is not None
+            if (char_start := char_offsets[batch][span_token[0]][0]) is not None
+            if (char_end := char_offsets[batch][span_token[1]-1][1]) is not None]
+    def deduplicate_spans(self, spans):
+        return [dict(tup)
+                for tup in {tuple(span.items()) for span in spans}
+            ]
+    def apply_hierarchy_heristic(self, spans):
+        def _group_spans(spans):
+            groups = []
+            for span in sorted(spans, key=lambda span: span['span'][0] - span['span'][1]):
+                found_group = False
+                for cur_group in groups:
+                    if (cur_group['label'] == span['label']
+                            and cur_group['start'] <= span['span'][0]
+                            and cur_group['end'] >= span['span'][1]):
+                        cur_group['spans'].append(span)
+                        found_group = True
+                        break
+                # If no group found, make new one
+                if not found_group:
+                    groups.append({
+                        'start': span['span'][0],
+                        'end': span['span'][1],
+                        'spans': [span],
+                        'label': span['label']
+                    })
+            return groups
+        return_spans = []
+        for group in _group_spans(spans):
+            sorted_spans = sorted(group['spans'], key=lambda span: span['span'][1] - span['span'][0])
+            # Collect all start and end positions
+            span_starts = {span['span'][0] for span in sorted_spans}
+            span_ends = {span['span'][1] for span in sorted_spans}
+            # Except for start and end of group
+            span_starts.discard(sorted_spans[-1]['span'][0])
+            span_ends.discard(sorted_spans[-1]['span'][1])
+            # Preserve encapsulating span
+            cur_spans = [sorted_spans[-1]]
+            # Iteratively add shortest span, if it covers an unused start or end point
+            for cur_span in sorted_spans[:-1]:
+                if len(span_starts) + len(span_ends) == 0:
+                    break
+                if cur_span['span'][0] in span_starts \
+                        or cur_span['span'][1] in span_ends:
+                    cur_spans.append(cur_span)
+                    span_starts.discard(cur_span['span'][0])
+                    span_ends.discard(cur_span['span'][1])
+            return_spans += cur_spans
+        return return_spans
+from transformers.pipelines import PIPELINE_REGISTRY
+from transformers import AutoModelForTokenClassification
+PIPELINE_REGISTRY.register_pipeline(
+    'multilabel-ner',
+    pipeline_class=MultilabelNerPipeline,
+    pt_model=AutoModelForTokenClassification,
+    default={'pt': ('jvaquet/multilabel-classification-bert', 'main')},
+    type='text',
+)

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "backend": "tokenizers",
+  "cls_token": "[CLS]",
+  "do_lower_case": false,
+  "is_local": false,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}