davanstrien HF Staff commited on
Commit
52c4067
·
verified ·
1 Parent(s): e758e1d

Initial: HumanSignal huggingface_ner example patched for HF Spaces

Browse files
Files changed (9) hide show
  1. .dockerignore +18 -0
  2. Dockerfile +48 -0
  3. README.md +151 -5
  4. _wsgi.py +122 -0
  5. model.py +260 -0
  6. requirements-base.txt +2 -0
  7. requirements-test.txt +2 -0
  8. requirements.txt +4 -0
  9. test_api.py +232 -0
.dockerignore ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Exclude everything
2
+ **
3
+
4
+ # Include Dockerfile and docker-compose for reference (optional, decide based on your use case)
5
+ !Dockerfile
6
+ !docker-compose.yml
7
+
8
+ # Include Python application files
9
+ !*.py
10
+
11
+ # Include requirements files
12
+ !requirements*.txt
13
+
14
+ # Include script
15
+ !*.sh
16
+
17
+ # Exclude specific requirements if necessary
18
+ # requirements-test.txt (Uncomment if you decide to exclude this)
Dockerfile ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # syntax=docker/dockerfile:1
2
+ ARG PYTHON_VERSION=3.11
3
+
4
+ FROM python:${PYTHON_VERSION}-slim AS python-base
5
+ ARG TEST_ENV
6
+
7
+ WORKDIR /app
8
+
9
+ ENV PYTHONUNBUFFERED=1 \
10
+ PYTHONDONTWRITEBYTECODE=1 \
11
+ PORT=${PORT:-9090} \
12
+ PIP_CACHE_DIR=/.cache \
13
+ WORKERS=1 \
14
+ THREADS=8
15
+
16
+ # Update the base OS
17
+ RUN --mount=type=cache,target="/var/cache/apt",sharing=locked \
18
+ --mount=type=cache,target="/var/lib/apt/lists",sharing=locked \
19
+ set -eux; \
20
+ apt-get update; \
21
+ apt-get upgrade -y; \
22
+ apt install --no-install-recommends -y \
23
+ git; \
24
+ apt-get autoremove -y
25
+
26
+ # install base requirements
27
+ COPY requirements-base.txt .
28
+ RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
29
+ pip install -r requirements-base.txt
30
+
31
+ # install custom requirements
32
+ COPY requirements.txt .
33
+ RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
34
+ pip install -r requirements.txt
35
+
36
+ # install test requirements if needed
37
+ COPY requirements-test.txt .
38
+ # build only when TEST_ENV="true"
39
+ RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
40
+ if [ "$TEST_ENV" = "true" ]; then \
41
+ pip install -r requirements-test.txt; \
42
+ fi
43
+
44
+ COPY . .
45
+
46
+ EXPOSE 9090
47
+
48
+ CMD gunicorn --preload --bind :$PORT --workers $WORKERS --threads $THREADS --timeout 0 _wsgi:app
README.md CHANGED
@@ -1,10 +1,156 @@
1
  ---
2
- title: Ls Huggingface Ner Backend
3
- emoji: 📉
4
- colorFrom: blue
5
- colorTo: indigo
6
  sdk: docker
 
7
  pinned: false
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: LS Hugging Face NER Backend
3
+ emoji: 🏷️
4
+ colorFrom: purple
5
+ colorTo: pink
6
  sdk: docker
7
+ app_port: 9090
8
  pinned: false
9
+ license: apache-2.0
10
+ short_description: HF NER models as a Label Studio ML backend
11
  ---
12
 
13
+ # LS Hugging Face NER Backend (Hugging Face Spaces)
14
+
15
+ This Space wraps HumanSignal's [`huggingface_ner` ML backend example](https://github.com/HumanSignal/label-studio-ml-backend/tree/master/label_studio_ml/examples/huggingface_ner) as a Hugging Face Space, so it can be used as a `/predict` backend for a Label Studio instance.
16
+
17
+ **Patches from the upstream example (minimal):**
18
+ - Added Spaces SDK frontmatter at the top of this README.
19
+ - Removed `docker-compose.yml` (not used on Spaces).
20
+
21
+ The Dockerfile already binds to `$PORT` and listens on `0.0.0.0`, so it deploys as-is. Default base is `python:3.11-slim` (CPU) — fine for small NER models like `dslim/bert-base-NER`. For larger models, upgrade hardware to a GPU tier.
22
+
23
+ Connect from Label Studio: set the ML backend URL to `https://davanstrien-ls-huggingface-ner-backend.hf.space`.
24
+
25
+ ---
26
+
27
+ <!-- Original upstream README below -->
28
+
29
+ <!--
30
+ ---
31
+ title: Hugging Face NER
32
+ type: guide
33
+ tier: all
34
+ order: 25
35
+ hide_menu: true
36
+ hide_frontmatter_title: true
37
+ meta_title: Label Studio tutorial to run Hugging Face NER backend
38
+ meta_description: This tutorial explains how to run a Hugging Face NER backend in Label Studio.
39
+ categories:
40
+ - Natural Language Processing
41
+ - Named Entity Recognition
42
+ - Hugging Face
43
+ image: "/guide/ml_tutorials/hf-ner.png"
44
+ ---
45
+ -->
46
+
47
+ # Hugging Face NER model with Label Studio
48
+
49
+ This project uses a custom machine learning backend model for Named Entity Recognition (NER) with Hugging Face's transformers and Label Studio.
50
+
51
+ The model instantiates `AutoModelForTokenClassification` from Hugging Face's transformers library and fine-tunes it on the NER task.
52
+
53
+ - If you want to use this model only in inference mode, it serves predictions from the pre-trained model.
54
+ - If you want to fine-tune the model, you can use the Label Studio interface to provide training data and train the model.
55
+
56
+ Read more about the compatible models from [Hugging Face's official documentation](https://huggingface.co/docs/transformers/en/tasks/token_classification).
57
+
58
+ ## Before you begin
59
+
60
+ Before you begin, you must install the [Label Studio ML backend](https://github.com/HumanSignal/label-studio-ml-backend?tab=readme-ov-file#quickstart).
61
+
62
+ This tutorial uses the [`huggingface_ner` example](https://github.com/HumanSignal/label-studio-ml-backend/tree/master/label_studio_ml/examples/huggingface_ner).
63
+
64
+
65
+ ## Labeling configuration
66
+
67
+ This ML backend works with the default NER template from Label Studio. You can find this by selecting Label Studio's pre-built NER template when configuring the labeling interface. It is available under **Natural Language Processing > Named Entity Recognition**:
68
+
69
+ ```xml
70
+ <View>
71
+ <Labels name="label" toName="text">
72
+ <Label value="PER" background="red"/>
73
+ <Label value="ORG" background="darkorange"/>
74
+ <Label value="LOC" background="orange"/>
75
+ <Label value="MISC" background="green"/>
76
+ </Labels>
77
+
78
+ <Text name="text" value="$text"/>
79
+ </View>
80
+ ```
81
+
82
+ You can then customize the template to suit your needs (for example, modifying the label names). However, note the model outputs compatibility:
83
+
84
+ > If you plan to use your model only for the inference, make sure the output label names are compatible with what is listed in XML labeling configuration. If you plan to train the model, you have to provide the baseline pretrained model that can be fine-tuned (i.e. where the last layer can be trained, for example, `distilbert/distilbert-base-uncased`). Otherwise, you may see the error about tensor sizes mismatch during training.
85
+
86
+ ## Running with Docker (recommended)
87
+
88
+ 1. Start the Machine Learning backend on `http://localhost:9090` with the prebuilt image:
89
+
90
+ ```bash
91
+ docker-compose up
92
+ ```
93
+
94
+ 2. Validate that backend is running
95
+
96
+ ```bash
97
+ $ curl http://localhost:9090/
98
+ {"status":"UP"}
99
+ ```
100
+
101
+ 3. Create a project in Label Studio. Then from the **Model** page in the project settings, [connect the model](https://labelstud.io/guide/ml#Connect-the-model-to-Label-Studio). The default URL is `http://localhost:9090`.
102
+
103
+
104
+ ## Building from source (advanced)
105
+
106
+ To build the ML backend from source, you have to clone the repository and build the Docker image:
107
+
108
+ ```bash
109
+ docker-compose build
110
+ ```
111
+
112
+ ## Running without Docker (advanced)
113
+
114
+ To run the ML backend without Docker, you have to clone the repository and install all dependencies using pip:
115
+
116
+ ```bash
117
+ python -m venv ml-backend
118
+ source ml-backend/bin/activate
119
+ pip install -r requirements.txt
120
+ ```
121
+
122
+ Then you can start the ML backend:
123
+
124
+ ```bash
125
+ label-studio-ml start ./huggingface_ner
126
+ ```
127
+
128
+ # Configuration
129
+
130
+ Parameters can be set in `docker-compose.yml` before running the container.
131
+
132
+
133
+ The following common parameters are available:
134
+ - `BASIC_AUTH_USER` - Specify the basic auth user for the model server
135
+ - `BASIC_AUTH_PASS` - Specify the basic auth password for the model server
136
+ - `LOG_LEVEL` - Set the log level for the model server
137
+ - `WORKERS` - Specify the number of workers for the model server
138
+ - `THREADS` - Specify the number of threads for the model server
139
+ - `BASELINE_MODEL_NAME`: The name of the baseline model to use. Default is `dslim/bert-base-NER`.
140
+ - `FINETUNED_MODEL_NAME`: The name of the fine-tuned model. Default is `finetuned_model`.
141
+ - `LABEL_STUDIO_HOST`: The host of the Label Studio instance. Default is 'http://localhost:8080'.
142
+ - `LABEL_STUDIO_API_KEY`: The API key for the Label Studio instance.
143
+ - `START_TRAINING_EACH_N_UPDATES`: The number of updates after which to start training. Default is `10`.
144
+ - `LEARNING_RATE`: The learning rate for the model. Default is `1e-3`.
145
+ - `NUM_TRAIN_EPOCHS`: The number of training epochs. Default is `10`.
146
+ - `WEIGHT_DECAY`: The weight decay for the model. Default is `0.01`.
147
+ - `MODEL_DIR`: The directory where the model is stored. Default is `'./results'`.
148
+
149
+ > Note: The `LABEL_STUDIO_API_KEY` is required for training the model. This can be found by logging
150
+ into Label Studio and [going to the **Account & Settings** page](https://labelstud.io/guide/user_account#Access-token).
151
+
152
+ # Customization
153
+
154
+ The ML backend can be customized by adding your own models and logic inside `./huggingface_ner/model.py`.
155
+
156
+ Modify the `predict()` and `fit()` methods to implement your own logic.
_wsgi.py ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import argparse
3
+ import json
4
+ import logging
5
+ import logging.config
6
+
7
+ logging.config.dictConfig({
8
+ "version": 1,
9
+ "disable_existing_loggers": False,
10
+ "formatters": {
11
+ "standard": {
12
+ "format": "[%(asctime)s] [%(levelname)s] [%(name)s::%(funcName)s::%(lineno)d] %(message)s"
13
+ }
14
+ },
15
+ "handlers": {
16
+ "console": {
17
+ "class": "logging.StreamHandler",
18
+ "level": os.getenv('LOG_LEVEL'),
19
+ "stream": "ext://sys.stdout",
20
+ "formatter": "standard"
21
+ }
22
+ },
23
+ "root": {
24
+ "level": os.getenv('LOG_LEVEL'),
25
+ "handlers": [
26
+ "console"
27
+ ],
28
+ "propagate": True
29
+ }
30
+ })
31
+
32
+ from label_studio_ml.api import init_app
33
+ from model import HuggingFaceNER
34
+
35
+
36
+ _DEFAULT_CONFIG_PATH = os.path.join(os.path.dirname(__file__), 'config.json')
37
+
38
+
39
+ def get_kwargs_from_config(config_path=_DEFAULT_CONFIG_PATH):
40
+ if not os.path.exists(config_path):
41
+ return dict()
42
+ with open(config_path) as f:
43
+ config = json.load(f)
44
+ assert isinstance(config, dict)
45
+ return config
46
+
47
+
48
+ if __name__ == "__main__":
49
+ parser = argparse.ArgumentParser(description='Label studio')
50
+ parser.add_argument(
51
+ '-p', '--port', dest='port', type=int, default=9090,
52
+ help='Server port')
53
+ parser.add_argument(
54
+ '--host', dest='host', type=str, default='0.0.0.0',
55
+ help='Server host')
56
+ parser.add_argument(
57
+ '--kwargs', '--with', dest='kwargs', metavar='KEY=VAL', nargs='+', type=lambda kv: kv.split('='),
58
+ help='Additional LabelStudioMLBase model initialization kwargs')
59
+ parser.add_argument(
60
+ '-d', '--debug', dest='debug', action='store_true',
61
+ help='Switch debug mode')
62
+ parser.add_argument(
63
+ '--log-level', dest='log_level', choices=['DEBUG', 'INFO', 'WARNING', 'ERROR'], default=None,
64
+ help='Logging level')
65
+ parser.add_argument(
66
+ '--model-dir', dest='model_dir', default=os.path.dirname(__file__),
67
+ help='Directory where models are stored (relative to the project directory)')
68
+ parser.add_argument(
69
+ '--check', dest='check', action='store_true',
70
+ help='Validate model instance before launching server')
71
+ parser.add_argument('--basic-auth-user',
72
+ default=os.environ.get('ML_SERVER_BASIC_AUTH_USER', None),
73
+ help='Basic auth user')
74
+
75
+ parser.add_argument('--basic-auth-pass',
76
+ default=os.environ.get('ML_SERVER_BASIC_AUTH_PASS', None),
77
+ help='Basic auth pass')
78
+
79
+ args = parser.parse_args()
80
+
81
+ # setup logging level
82
+ if args.log_level:
83
+ logging.root.setLevel(args.log_level)
84
+
85
+ def isfloat(value):
86
+ try:
87
+ float(value)
88
+ return True
89
+ except ValueError:
90
+ return False
91
+
92
+ def parse_kwargs():
93
+ param = dict()
94
+ for k, v in args.kwargs:
95
+ if v.isdigit():
96
+ param[k] = int(v)
97
+ elif v == 'True' or v == 'true':
98
+ param[k] = True
99
+ elif v == 'False' or v == 'false':
100
+ param[k] = False
101
+ elif isfloat(v):
102
+ param[k] = float(v)
103
+ else:
104
+ param[k] = v
105
+ return param
106
+
107
+ kwargs = get_kwargs_from_config()
108
+
109
+ if args.kwargs:
110
+ kwargs.update(parse_kwargs())
111
+
112
+ if args.check:
113
+ print('Check "' + HuggingFaceNER.__name__ + '" instance creation..')
114
+ model = HuggingFaceNER(**kwargs)
115
+
116
+ app = init_app(model_class=HuggingFaceNER, basic_auth_user=args.basic_auth_user, basic_auth_pass=args.basic_auth_pass)
117
+
118
+ app.run(host=args.host, port=args.port, debug=args.debug)
119
+
120
+ else:
121
+ # for uWSGI use
122
+ app = init_app(model_class=HuggingFaceNER)
model.py ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import pathlib
3
+ import re
4
+ import label_studio_sdk
5
+ import logging
6
+
7
+ from typing import List, Dict, Optional
8
+ from label_studio_ml.model import LabelStudioMLBase
9
+ from label_studio_ml.response import ModelResponse
10
+ from transformers import pipeline, Pipeline
11
+ from itertools import groupby
12
+ from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer, AutoTokenizer
13
+ from transformers import DataCollatorForTokenClassification
14
+ from datasets import Dataset, ClassLabel, Value, Sequence, Features
15
+ from functools import partial
16
+
17
+ logger = logging.getLogger(__name__)
18
+ _model: Optional[Pipeline] = None
19
+ MODEL_DIR = os.getenv('MODEL_DIR', './results')
20
+ BASELINE_MODEL_NAME = os.getenv('BASELINE_MODEL_NAME', 'dslim/bert-base-NER')
21
+ FINETUNED_MODEL_NAME = os.getenv('FINETUNED_MODEL_NAME', 'finetuned_model')
22
+
23
+
24
+ def reload_model():
25
+ global _model
26
+ _model = None
27
+ try:
28
+ chk_path = str(pathlib.Path(MODEL_DIR) / FINETUNED_MODEL_NAME)
29
+ logger.info(f"Loading finetuned model from {chk_path}")
30
+ _model = pipeline("ner", model=chk_path, tokenizer=chk_path)
31
+ except:
32
+ # if finetuned model is not available, use the baseline model with the original labels
33
+ logger.info(f"Loading baseline model {BASELINE_MODEL_NAME}")
34
+ _model = pipeline("ner", model=BASELINE_MODEL_NAME, tokenizer=BASELINE_MODEL_NAME)
35
+
36
+
37
+ reload_model()
38
+
39
+
40
+ class HuggingFaceNER(LabelStudioMLBase):
41
+ """Custom ML Backend model
42
+ """
43
+ LABEL_STUDIO_HOST = os.getenv('LABEL_STUDIO_HOST', 'http://localhost:8080')
44
+ LABEL_STUDIO_API_KEY = os.getenv('LABEL_STUDIO_API_KEY')
45
+ START_TRAINING_EACH_N_UPDATES = int(os.getenv('START_TRAINING_EACH_N_UPDATES', 10))
46
+ LEARNING_RATE = float(os.getenv('LEARNING_RATE', 1e-3))
47
+ NUM_TRAIN_EPOCHS = int(os.getenv('NUM_TRAIN_EPOCHS', 10))
48
+ WEIGHT_DECAY = float(os.getenv('WEIGHT_DECAY', 0.01))
49
+
50
+ def get_labels(self):
51
+ li = self.label_interface
52
+ from_name, _, _ = li.get_first_tag_occurence('Labels', 'Text')
53
+ tag = li.get_tag(from_name)
54
+ return tag.labels
55
+
56
+ def setup(self):
57
+ """Configure any paramaters of your model here
58
+ """
59
+ self.set("model_version", f'{self.__class__.__name__}-v0.0.1')
60
+
61
+ def predict(self, tasks: List[Dict], context: Optional[Dict] = None, **kwargs) -> ModelResponse:
62
+ """ Write your inference logic here
63
+ :param tasks: [Label Studio tasks in JSON format](https://labelstud.io/guide/task_format.html)
64
+ :param context: [Label Studio context in JSON format](https://labelstud.io/guide/ml_create#Implement-prediction-logic)
65
+ :return model_response
66
+ ModelResponse(predictions=predictions) with
67
+ predictions: [Predictions array in JSON format](https://labelstud.io/guide/export.html#Label-Studio-JSON-format-of-annotated-tasks)
68
+ """
69
+ li = self.label_interface
70
+ from_name, to_name, value = li.get_first_tag_occurence('Labels', 'Text')
71
+ texts = [self.preload_task_data(task, task['data'][value]) for task in tasks]
72
+
73
+ # run predictions
74
+ model_predictions = _model(texts)
75
+
76
+ predictions = []
77
+ for prediction in model_predictions:
78
+ # prediction returned in the format: [{'entity': 'B-ORG', 'score': 0.999, 'index': 1, 'start': 0, 'end': 7, 'word': 'Google'}, ...]
79
+ # we need to group them by 'B-' and 'I-' prefixes to form entities
80
+ results = []
81
+ avg_score = 0
82
+ for label, group in groupby(prediction, key=lambda x: re.sub(r'^[BI]-', '', x['entity'])):
83
+ entities = list(group)
84
+ start = entities[0]['start']
85
+ end = entities[-1]['end']
86
+ score = float(sum([entity['score'] for entity in entities]) / len(entities))
87
+ results.append({
88
+ 'from_name': from_name,
89
+ 'to_name': to_name,
90
+ 'type': 'labels',
91
+ 'value': {
92
+ 'start': start,
93
+ 'end': end,
94
+ 'labels': [label]
95
+ },
96
+ 'score': score
97
+ })
98
+ avg_score += score
99
+ if results:
100
+ predictions.append({
101
+ 'result': results,
102
+ 'score': avg_score / len(results),
103
+ 'model_version': self.get('model_version')
104
+ })
105
+
106
+ return ModelResponse(predictions=predictions, model_version=self.get('model_version'))
107
+
108
+ def _get_tasks(self, project_id):
109
+ # download annotated tasks from Label Studio
110
+ ls = label_studio_sdk.Client(self.LABEL_STUDIO_HOST, self.LABEL_STUDIO_API_KEY)
111
+ project = ls.get_project(id=project_id)
112
+ tasks = project.get_labeled_tasks()
113
+ return tasks
114
+
115
+ def tokenize_and_align_labels(self, examples, tokenizer):
116
+ """
117
+ From example https://huggingface.co/docs/transformers/en/tasks/token_classification#preprocess
118
+ """
119
+ tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
120
+
121
+ labels = []
122
+ for i, label in enumerate(examples[f"ner_tags"]):
123
+ word_ids = tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word.
124
+ previous_word_idx = None
125
+ label_ids = []
126
+ for word_idx in word_ids: # Set the special tokens to -100.
127
+ if word_idx is None:
128
+ label_ids.append(-100)
129
+ elif word_idx != previous_word_idx: # Only label the first token of a given word.
130
+ label_ids.append(label[word_idx])
131
+ else:
132
+ label_ids.append(-100)
133
+ previous_word_idx = word_idx
134
+ labels.append(label_ids)
135
+
136
+ tokenized_inputs["labels"] = labels
137
+ return tokenized_inputs
138
+
139
+ def fit(self, event, data, **kwargs):
140
+ """Download dataset from Label Studio and prepare data for training in BERT
141
+ """
142
+ if event not in ('ANNOTATION_CREATED', 'ANNOTATION_UPDATED', 'START_TRAINING'):
143
+ logger.info(f"Skip training: event {event} is not supported")
144
+ return
145
+
146
+ # Get project from annotation first if present, otherwise fall back to top-level project field
147
+ project = data.get('annotation', {}).get('project') or data.get('project')
148
+ # Handle both possible formats
149
+ if isinstance(project, dict):
150
+ project_id = project.get('id')
151
+ else:
152
+ project_id = project
153
+ # If project_id is still None, log and safely exit
154
+ if project_id is None:
155
+ logger.error(f"Cannot find project_id in webhook payload: {data}")
156
+ return
157
+
158
+ tasks = self._get_tasks(project_id)
159
+
160
+ if len(tasks) % self.START_TRAINING_EACH_N_UPDATES != 0 and event != 'START_TRAINING':
161
+ logger.info(f"Skip training: {len(tasks)} tasks are not multiple of {self.START_TRAINING_EACH_N_UPDATES}")
162
+ return
163
+
164
+ # we need to convert Label Studio NER annotations to hugingface NER format in datasets
165
+ # for example:
166
+ # {'id': '0',
167
+ # 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
168
+ # 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
169
+ # }
170
+ ds_raw = []
171
+ from_name, to_name, value = self.label_interface.get_first_tag_occurence('Labels', 'Text')
172
+ tokenizer = AutoTokenizer.from_pretrained(BASELINE_MODEL_NAME)
173
+
174
+ no_label = 'O'
175
+ label_to_id = {no_label: 0}
176
+ for task in tasks:
177
+ for annotation in task['annotations']:
178
+ if not annotation.get('result'):
179
+ continue
180
+ spans = [{'label': r['value']['labels'][0], 'start': r['value']['start'], 'end': r['value']['end']} for r in annotation['result']]
181
+ spans = sorted(spans, key=lambda x: x['start'])
182
+ text = self.preload_task_data(task, task['data'][value])
183
+
184
+ # insert tokenizer.pad_token to the unlabeled chunks of the text in-between the labeled spans, as well as to the beginning and end of the text
185
+ last_end = 0
186
+ all_spans = []
187
+ for span in spans:
188
+ if last_end < span['start']:
189
+ all_spans.append({'label': no_label, 'start': last_end, 'end': span['start']})
190
+ all_spans.append(span)
191
+ last_end = span['end']
192
+ if last_end < len(text):
193
+ all_spans.append({'label': no_label, 'start': last_end, 'end': len(text)})
194
+
195
+ # now tokenize chunks separately and add them to the dataset
196
+ item = {'id': task['id'], 'tokens': [], 'ner_tags': []}
197
+ for span in all_spans:
198
+ tokens = tokenizer.tokenize(text[span['start']:span['end']])
199
+ item['tokens'].extend(tokens)
200
+ if span['label'] == no_label:
201
+ item['ner_tags'].extend([label_to_id[no_label]] * len(tokens))
202
+ else:
203
+ label = 'B-' + span['label']
204
+ if label not in label_to_id:
205
+ label_to_id[label] = len(label_to_id)
206
+ item['ner_tags'].append(label_to_id[label])
207
+ if len(tokens) > 1:
208
+ label = 'I-' + span['label']
209
+ if label not in label_to_id:
210
+ label_to_id[label] = len(label_to_id)
211
+ item['ner_tags'].extend([label_to_id[label] for _ in range(1, len(tokens))])
212
+ ds_raw.append(item)
213
+
214
+ logger.debug(f"Dataset: {ds_raw}")
215
+ # convert to huggingface dataset
216
+ # Define the features of your dataset
217
+ features = Features({
218
+ 'id': Value('string'),
219
+ 'tokens': Sequence(Value('string')),
220
+ 'ner_tags': Sequence(ClassLabel(names=list(label_to_id.keys())))
221
+ })
222
+ hf_dataset = Dataset.from_list(ds_raw, features=features)
223
+ tokenized_dataset = hf_dataset.map(partial(self.tokenize_and_align_labels, tokenizer=tokenizer), batched=True)
224
+
225
+ logger.debug(f"HF Dataset: {tokenized_dataset}")
226
+
227
+ data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
228
+ id_to_label = {i: label for label, i in label_to_id.items()}
229
+ logger.debug(f"Labels: {id_to_label}")
230
+
231
+ model = AutoModelForTokenClassification.from_pretrained(
232
+ BASELINE_MODEL_NAME, num_labels=len(id_to_label),
233
+ id2label=id_to_label, label2id=label_to_id)
234
+ logger.debug(f"Model: {model}")
235
+
236
+ training_args = TrainingArguments(
237
+ output_dir=str(pathlib.Path(MODEL_DIR) / FINETUNED_MODEL_NAME),
238
+ learning_rate=self.LEARNING_RATE,
239
+ per_device_train_batch_size=8,
240
+ num_train_epochs=self.NUM_TRAIN_EPOCHS,
241
+ weight_decay=self.WEIGHT_DECAY,
242
+ evaluation_strategy="no",
243
+ )
244
+
245
+ trainer = Trainer(
246
+ model=model,
247
+ args=training_args,
248
+ train_dataset=tokenized_dataset,
249
+ tokenizer=tokenizer,
250
+ data_collator=data_collator,
251
+ )
252
+ trainer.train()
253
+
254
+ chk_path = str(pathlib.Path(MODEL_DIR) / FINETUNED_MODEL_NAME)
255
+ logger.info(f"Model is trained and saved as {chk_path}")
256
+ trainer.save_model(chk_path)
257
+
258
+ # reload model
259
+ # TODO: this is not thread-safe, should be done with critical section
260
+ reload_model()
requirements-base.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ gunicorn==23.0.0
2
+ label-studio-ml @ git+https://github.com/HumanSignal/label-studio-ml-backend.git
requirements-test.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ pytest
2
+ pytest-cov
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ transformers==4.30.2
2
+ datasets==2.18.0
3
+ accelerate==0.28.0
4
+
test_api.py ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ This file contains tests for the API of your model. You can run these tests by installing test requirements:
3
+
4
+ ```bash
5
+ pip install -r requirements-test.txt
6
+ ```
7
+ Then execute `pytest` in the directory of this file.
8
+
9
+ - Change `NewModel` to the name of the class in your model.py file.
10
+ - Change the `request` and `expected_response` variables to match the input and output of your model.
11
+ """
12
+
13
+ import pytest
14
+ import json
15
+ from model import HuggingFaceNER
16
+ import unittest.mock as mock
17
+
18
+
19
+ @pytest.fixture
20
+ def client():
21
+ from _wsgi import init_app
22
+ app = init_app(model_class=HuggingFaceNER)
23
+ app.config['TESTING'] = True
24
+ with app.test_client() as client:
25
+ yield client
26
+
27
+
28
+ def test_predict(client):
29
+ request = {
30
+ 'tasks': [{
31
+ 'data': {
32
+ 'text': 'President Obama is speaking at 3pm today in New York.'
33
+ }
34
+ }],
35
+ # Your labeling configuration here
36
+ 'label_config': '''
37
+ <View>
38
+ <Text name="text" value="$text"/>
39
+ <Labels name="ner" toName="text">
40
+ <Label value="Person"/>
41
+ <Label value="Location"/>
42
+ <Label value="Time"/>
43
+ </Labels>
44
+ </View>
45
+ '''
46
+ }
47
+
48
+ expected_response = {
49
+ 'results': [{
50
+ 'model_version': 'HuggingFaceNER-v0.0.1',
51
+ 'result': [{
52
+ 'from_name': 'ner',
53
+ 'score': 0.9974774718284607,
54
+ 'to_name': 'text',
55
+ 'type': 'labels',
56
+ 'value': {
57
+ 'end': 15,
58
+ 'labels': ['PER'],
59
+ 'start': 10}},
60
+ {'from_name': 'ner',
61
+ 'score': 0.9994751214981079,
62
+ 'to_name': 'text',
63
+ 'type': 'labels',
64
+ 'value': {'end': 52,
65
+ 'labels': ['LOC'],
66
+ 'start': 44}}],
67
+ 'score': 0.9984762966632843}]
68
+ }
69
+
70
+ response = client.post('/predict', data=json.dumps(request), content_type='application/json')
71
+ assert response.status_code == 200
72
+ response = json.loads(response.data)
73
+ assert response['results'][0]['model_version'] == expected_response['results'][0]['model_version']
74
+ assert response['results'][0]['result'][0]['value'] == expected_response['results'][0]['result'][0]['value']
75
+ assert response['results'][0]['result'][1]['value'] == expected_response['results'][0]['result'][1]['value']
76
+
77
+
78
+ # mock response of label_studio_sdk.Project.get_labeled_tasks() and return the list of Label Studio tasks with NER annotations
79
+ def get_labeled_tasks_mock(self, project_id):
80
+ return [
81
+ {
82
+ 'id': '0',
83
+ 'data': {'text': 'President Obama is speaking at 3pm today in New York'},
84
+ 'annotations': [
85
+ {
86
+ 'result': [
87
+ {
88
+ 'from_name': 'ner',
89
+ 'to_name': 'text',
90
+ 'type': 'labels',
91
+ 'value': {
92
+ 'start': 10,
93
+ 'end': 15,
94
+ 'labels': ['Person']
95
+ }
96
+ },
97
+ {
98
+ 'from_name': 'ner',
99
+ 'to_name': 'text',
100
+ 'type': 'labels',
101
+ 'value': {
102
+ 'start': 44,
103
+ 'end': 52,
104
+ 'labels': ['Location']
105
+ }
106
+ },
107
+ {
108
+ 'from_name': 'ner',
109
+ 'to_name': 'text',
110
+ 'type': 'labels',
111
+ 'value': {
112
+ 'start': 31,
113
+ 'end': 40,
114
+ 'labels': ['Time']
115
+ }
116
+ }
117
+ ]
118
+ }
119
+ ]
120
+ }
121
+ ]
122
+
123
+
124
+ # mock NewModel.START_TRAINING_EACH_N_UPDATES to 1 to trigger training in the test
125
+ @pytest.fixture
126
+ def mock_start_training():
127
+ with mock.patch.object(HuggingFaceNER, 'START_TRAINING_EACH_N_UPDATES', new=1):
128
+ yield
129
+
130
+
131
+ @pytest.fixture
132
+ def mock_get_labeled_tasks():
133
+ with mock.patch.object(HuggingFaceNER, '_get_tasks', new=get_labeled_tasks_mock):
134
+ yield
135
+
136
+
137
+ @pytest.fixture
138
+ def mock_baseline_model_name_for_train():
139
+ with mock.patch('model.BASELINE_MODEL_NAME', new='distilbert/distilbert-base-uncased'):
140
+ yield
141
+
142
+
143
+ def test_fit(client, mock_get_labeled_tasks, mock_start_training, mock_baseline_model_name_for_train):
144
+ request = {
145
+ 'action': 'ANNOTATION_CREATED',
146
+ 'project': {
147
+ 'id': 12345,
148
+ 'label_config': '''
149
+ <View>
150
+ <Text name="text" value="$text"/>
151
+ <Labels name="ner" toName="text">
152
+ <Label value="Person"/>
153
+ <Label value="Location"/>
154
+ <Label value="Time"/>
155
+ </Labels>
156
+ </View>
157
+ '''
158
+ },
159
+ 'annotation': {
160
+ 'project': 12345
161
+ }
162
+ }
163
+
164
+ response = client.post('/webhook', data=json.dumps(request), content_type='application/json')
165
+ assert response.status_code == 201
166
+
167
+ # assert new model is created in ./results/finetuned_model directory
168
+ import os
169
+ from model import MODEL_DIR
170
+ results_dir = os.path.join(MODEL_DIR, 'finetuned_model')
171
+ assert os.path.exists(os.path.join(results_dir, 'pytorch_model.bin'))
172
+
173
+ # now let's test whether the model is trained by running predict
174
+ request = {
175
+ 'tasks': [{
176
+ 'data': {
177
+ 'text': 'President Obama is speaking at 3pm today in New York.'
178
+ }
179
+ }],
180
+ # Your labeling configuration here
181
+ 'label_config': '''
182
+ <View>
183
+ <Text name="text" value="$text"/>
184
+ <Labels name="ner" toName="text">
185
+ <Label value="Person"/>
186
+ <Label value="Location"/>
187
+ <Label value="Time"/>
188
+ </Labels>
189
+ </View>
190
+ '''
191
+ }
192
+
193
+ response = client.post('/predict', data=json.dumps(request), content_type='application/json')
194
+ assert response.status_code == 200
195
+
196
+ # TODO: we also need to check the prediction results to make sure the model is trained correctly
197
+ # but the training needs to be deterministic to make the test stable
198
+ # assert response is as expected
199
+
200
+ # remove './results/finetuned_model' directory after testing
201
+ import shutil
202
+ shutil.rmtree(results_dir)
203
+
204
+ def test_fit_missing_annotation(monkeypatch):
205
+ # Initialize the model
206
+ model = HuggingFaceNER()
207
+
208
+ # Mock label_interface to avoid AttributeError
209
+ model.label_interface = mock.MagicMock()
210
+ # Mock get_first_tag_occurence to return fake values
211
+ model.label_interface.get_first_tag_occurence.return_value = ('Labels', 'Text', 'text_field_name')
212
+
213
+ # Mock data payload with annotation missing, only project present
214
+ payload = {
215
+ "action": "ANNOTATION_UPDATED",
216
+ "project": {"id": 123, "name": "Test Project"}
217
+ }
218
+
219
+ # Monkeypatch _get_tasks to return one fake task
220
+ monkeypatch.setattr(model, "_get_tasks", lambda project_id: [
221
+ {
222
+ "id": "1",
223
+ "data": {"text_field_name": "Hello world"},
224
+ "annotations": []
225
+ }
226
+ ])
227
+
228
+ # Call fit()
229
+ try:
230
+ model.fit(event="ANNOTATION_UPDATED", data=payload)
231
+ except Exception as e:
232
+ pytest.fail(f"fit() raised an exception when annotation is missing: {e}")