Spaces:

davanstrien
/

ls-gliner-backend

Paused

App Files Files Community

davanstrien HF Staff commited on 12 days ago

Commit

e5b9c3b

verified ·

1 Parent(s): 636747c

Initial: HumanSignal gliner example patched for HF Spaces

Browse files

Files changed (8) hide show

Dockerfile +48 -0
README.md +123 -5
_wsgi.py +122 -0
model.py +261 -0
requirements-base.txt +2 -0
requirements-test.txt +2 -0
requirements.txt +5 -0
test_api.py +68 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,48 @@

+# syntax=docker/dockerfile:1
+ARG PYTHON_VERSION=3.11
+FROM python:${PYTHON_VERSION}-slim AS python-base
+ARG TEST_ENV
+WORKDIR /app
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PORT=${PORT:-9090} \
+    PIP_CACHE_DIR=/.cache \
+    WORKERS=1 \
+    THREADS=8
+# Update the base OS
+RUN --mount=type=cache,target="/var/cache/apt",sharing=locked \
+    --mount=type=cache,target="/var/lib/apt/lists",sharing=locked \
+    set -eux; \
+    apt-get update; \
+    apt-get upgrade -y; \
+    apt install --no-install-recommends -y  \
+        git; \
+    apt-get autoremove -y
+# install base requirements
+COPY requirements-base.txt .
+RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
+    pip install -r requirements-base.txt
+# install custom requirements
+COPY requirements.txt .
+RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
+    pip install -r requirements.txt
+# install test requirements if needed
+COPY requirements-test.txt .
+# build only when TEST_ENV="true"
+RUN --mount=type=cache,target=${PIP_CACHE_DIR},sharing=locked \
+    if [ "$TEST_ENV" = "true" ]; then \
+      pip install -r requirements-test.txt; \
+    fi
+COPY . .
+EXPOSE 9090
+CMD gunicorn --preload --bind :$PORT --workers $WORKERS --threads $THREADS --timeout 0 _wsgi:app

README.md CHANGED Viewed

@@ -1,10 +1,128 @@
 ---
-title: Ls Gliner Backend
-emoji: 😻
-colorFrom: green
-colorTo: red
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: LS GLiNER Backend
+emoji: 🪄
+colorFrom: pink
+colorTo: purple
 sdk: docker
+app_port: 9090
 pinned: false
+license: apache-2.0
+short_description: GLiNER zero-shot NER as a Label Studio ML backend
 ---
+# LS GLiNER Backend (Hugging Face Spaces)
+This Space wraps HumanSignal's [`gliner` ML backend example](https://github.com/HumanSignal/label-studio-ml-backend/tree/master/label_studio_ml/examples/gliner) as a Hugging Face Space. GLiNER is a zero-shot NER model — it accepts arbitrary user-defined labels at inference time, so it can predict any label your LS project's config defines without retraining.
+Default model: `urchade/gliner_medium-v2.1` (~750MB). Override via `GLINER_MODEL_NAME` env var.
+**Patches from the upstream example (minimal):**
+- Added Spaces SDK frontmatter at the top of this README.
+- Removed `docker-compose.yml` (not used on Spaces).
+Connect from Label Studio: set the ML backend URL to `https://davanstrien-ls-gliner-backend.hf.space`.
+---
+<!-- Original upstream README below -->
+<!--
+---
+title: Use GLiNER for NER annotation
+type: guide
+tier: all
+order: 37
+hide_menu: true
+hide_frontmatter_title: true
+meta_title: Use GLiNER for NER annotation
+meta_description: Tutorial on how to use GLiNER with your Label Studio project to complete NER tasks
+categories:
+    - Natural Language Processing
+    - Named Entity Recognition
+    - GLiNER
+    - BERT
+    - Hugging Face
+image: "/guide/ml_tutorials/gliner.png"
+---
+-->
+# Use GLiNER for NER annotation
+The GLiNER model is a BERT family model for generalist NER. We download the model from HuggingFace, but the original
+model is
+available on [GitHub](https://github.com/urchade/GLiNER).
+## Before you begin
+Before you begin, you must install the [Label Studio ML backend](https://github.com/HumanSignal/label-studio-ml-backend?tab=readme-ov-file#quickstart).
+This tutorial uses the [`gliner` example](https://github.com/HumanSignal/label-studio-ml-backend/tree/master/label_studio_ml/examples/gliner).
+## Running with Docker (recommended)
+1. Start Machine Learning backend on `http://localhost:9090` with prebuilt image:
+```bash
+docker-compose up
+```
+2. Validate that backend is running
+```bash
+$ curl http://localhost:9090/
+{"status":"UP"}
+```
+3. Create a project in Label Studio. Then from the **Model** page in the project settings, [connect the model](https://labelstud.io/guide/ml#Connect-the-model-to-Label-Studio). The default URL is `http://localhost:9090`.
+## Building from source (advanced)
+To build the ML backend from source, you have to clone the repository and build the Docker image:
+```bash
+docker-compose build
+```
+## Running without Docker (advanced)
+To run the ML backend without Docker, you have to clone the repository and install all dependencies using pip:
+```bash
+python -m venv ml-backend
+source ml-backend/bin/activate
+pip install -r requirements.txt
+```
+Then you can start the ML backend:
+```bash
+label-studio-ml start ./dir_with_your_model
+```
+## Configuration
+Parameters can be set in `docker-compose.yml` before running the container.
+The following common parameters are available:
+- `BASIC_AUTH_USER` - Specify the basic auth user for the model server.
+- `BASIC_AUTH_PASS` - Specify the basic auth password for the model server.
+- `LOG_LEVEL` - Set the log level for the model server.
+- `WORKERS` - Specify the number of workers for the model server.
+- `THREADS` - Specify the number of threads for the model server.
+- `LABEL_STUDIO_URL` - Specify the URL of your Label Studio instance. Note that this might need to be `http://host.docker.internal:8080` if you are running Label Studio on another Docker container.
+- `LABEL_STUDIO_API_KEY`- Specify the API key for authenticating your Label Studio instance. You can find this by logging into Label Studio and and [going to the **Account & Settings** page](https://labelstud.io/guide/user_account#Access-token).
+## A Note on Model Training
+If you plan to use a webhook to train this model on "Start Training", note that you do
+not need to configure a separate webhook. Instead, go to the three dots next to your model
+on the Model tab in your project settings and click "start training".
+Additionally, note that this container has been set for a **VERY SMALL** demo set, with only 1
+non-eval sample (we expect the first 10 data samples to be for evaluation.)
+If you're working with a larger dataset, be sure to:
+1. update num_steps and batch size to the number of training steps you want and the batch size that works for your dataset.
+2. change the uploaded model after training (line 239 of `model.py`) to the highest checkpoint that you have.

_wsgi.py ADDED Viewed

	@@ -0,0 +1,122 @@

+import os
+import argparse
+import json
+import logging
+import logging.config
+logging.config.dictConfig({
+  "version": 1,
+  "disable_existing_loggers": False,
+  "formatters": {
+    "standard": {
+      "format": "[%(asctime)s] [%(levelname)s] [%(name)s::%(funcName)s::%(lineno)d] %(message)s"
+    }
+  },
+  "handlers": {
+    "console": {
+      "class": "logging.StreamHandler",
+      "level": os.getenv('LOG_LEVEL'),
+      "stream": "ext://sys.stdout",
+      "formatter": "standard"
+    }
+  },
+  "root": {
+    "level": os.getenv('LOG_LEVEL'),
+    "handlers": [
+      "console"
+    ],
+    "propagate": True
+  }
+})
+from label_studio_ml.api import init_app
+from model import GLiNERModel
+_DEFAULT_CONFIG_PATH = os.path.join(os.path.dirname(__file__), 'config.json')
+def get_kwargs_from_config(config_path=_DEFAULT_CONFIG_PATH):
+    if not os.path.exists(config_path):
+        return dict()
+    with open(config_path) as f:
+        config = json.load(f)
+    assert isinstance(config, dict)
+    return config
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='Label studio')
+    parser.add_argument(
+        '-p', '--port', dest='port', type=int, default=9090,
+        help='Server port')
+    parser.add_argument(
+        '--host', dest='host', type=str, default='0.0.0.0',
+        help='Server host')
+    parser.add_argument(
+        '--kwargs', '--with', dest='kwargs', metavar='KEY=VAL', nargs='+', type=lambda kv: kv.split('='),
+        help='Additional LabelStudioMLBase model initialization kwargs')
+    parser.add_argument(
+        '-d', '--debug', dest='debug', action='store_true',
+        help='Switch debug mode')
+    parser.add_argument(
+        '--log-level', dest='log_level', choices=['DEBUG', 'INFO', 'WARNING', 'ERROR'], default=None,
+        help='Logging level')
+    parser.add_argument(
+        '--model-dir', dest='model_dir', default=os.path.dirname(__file__),
+        help='Directory where models are stored (relative to the project directory)')
+    parser.add_argument(
+        '--check', dest='check', action='store_true',
+        help='Validate model instance before launching server')
+    parser.add_argument('--basic-auth-user',
+                        default=os.environ.get('ML_SERVER_BASIC_AUTH_USER', None),
+                        help='Basic auth user')
+    parser.add_argument('--basic-auth-pass',
+                        default=os.environ.get('ML_SERVER_BASIC_AUTH_PASS', None),
+                        help='Basic auth pass')
+    args = parser.parse_args()
+    # setup logging level
+    if args.log_level:
+        logging.root.setLevel(args.log_level)
+    def isfloat(value):
+        try:
+            float(value)
+            return True
+        except ValueError:
+            return False
+    def parse_kwargs():
+        param = dict()
+        for k, v in args.kwargs:
+            if v.isdigit():
+                param[k] = int(v)
+            elif v == 'True' or v == 'true':
+                param[k] = True
+            elif v == 'False' or v == 'false':
+                param[k] = False
+            elif isfloat(v):
+                param[k] = float(v)
+            else:
+                param[k] = v
+        return param
+    kwargs = get_kwargs_from_config()
+    if args.kwargs:
+        kwargs.update(parse_kwargs())
+    if args.check:
+        print('Check "' + GLiNERModel.__name__ + '" instance creation..')
+        model = GLiNERModel(**kwargs)
+    app = init_app(model_class=GLiNERModel, basic_auth_user=args.basic_auth_user, basic_auth_pass=args.basic_auth_pass)
+    app.run(host=args.host, port=args.port, debug=args.debug)
+else:
+    # for uWSGI use
+    app = init_app(model_class=GLiNERModel)

model.py ADDED Viewed

	@@ -0,0 +1,261 @@

+import logging
+import os
+from math import floor
+from typing import List, Dict, Optional
+import pathlib
+import label_studio_sdk
+from gliner import GLiNER
+from gliner.data_processing.collator import DataCollator
+from gliner.training import Trainer, TrainingArguments
+from label_studio_sdk.label_interface.objects import PredictionValue
+from label_studio_ml.model import LabelStudioMLBase
+from label_studio_ml.response import ModelResponse
+logger = logging.getLogger(__name__)
+GLINER_MODEL_NAME = os.getenv("GLINER_MODEL_NAME", "urchade/gliner_medium-v2.1")
+class GLiNERModel(LabelStudioMLBase):
+    """
+    Custom ML Backend for GILNER model
+    """
+    def setup(self):
+        """Configure any parameters of your model here
+        """
+        self.LABEL_STUDIO_HOST = os.getenv('LABEL_STUDIO_URL', 'http://localhost:8080')
+        self.LABEL_STUDIO_API_KEY = os.getenv('LABEL_STUDIO_API_KEY')
+        self.MODEL_DIR = os.getenv("MODEL_DIR", "/data/models")
+        self.finetuned_model_path = os.getenv("FINETUNED_MODEL_PATH", f"models/checkpoint-10")
+        self.threshold = float(os.getenv('THRESHOLD', 0.5))
+        self.model = None
+    def lazy_init(self):
+        if not self.model:
+            try:
+                logger.info(f"Loading Pretrained Model from {self.finetuned_model_path}")
+                self.model = GLiNER.from_pretrained(str(pathlib.Path(self.MODEL_DIR, self.finetuned_model_path)), local_files_only=True)
+                self.set("model_version", f'{self.__class__.__name__}-v0.0.2')
+            except:
+                # If no finetuned model, use default
+                logger.info(f"No Pretrained Model Found. Loading GLINER model {GLINER_MODEL_NAME}")
+                self.model = GLiNER.from_pretrained(GLINER_MODEL_NAME)
+                self.set("model_version", f'{self.__class__.__name__}-v0.0.1')
+    def convert_to_ls_annotation(self, prediction, from_name, to_name):
+        """
+        Convert from GLiNER output format to Label Studio annotastion format
+        :param prediction: The prediction output from GLiNER
+        :param from_name
+        :param to_name
+        """
+        results = []
+        sent_preds = []
+        for ent in prediction:
+            label = [ent['label']]
+            if label:
+                score = ent['score']
+                sent_preds.append({
+                    'from_name': from_name,
+                    'to_name': to_name,
+                    'type': 'labels',
+                    "value": {
+                        "start": ent['start'],
+                        "end": ent['end'],
+                        "text": ent['text'],
+                        "labels": label
+                    },
+                    "score": round(score, 4)
+                })
+        # add minimum of certaincy scores of entities in sentence for active learning use
+        score = min([p['score'] for p in sent_preds]) if sent_preds else 2.0
+        results.append(PredictionValue(
+            result=sent_preds,
+            score=score,
+            model_version=self.get('model_version')
+        ))
+        return results
+    def convert_char_to_token_span(self, text: List, start: int, end: int):
+        """
+        A helper function to convert character spans to token spans
+        text: a list of the tokenized text
+        :param start: the first character of the span, as an int
+        end: the last character of the span, as an int
+        returns: the first and last tokens of the spans, as ints
+        """
+        start_token = None
+        end_token = None
+        total_char = 0
+        for i, word in enumerate(text):
+            if total_char >= start and not start_token:
+                start_token = i
+            if total_char >= end and not end_token:
+                end_token = i
+            total_char += (len(word) + 1)
+        if not end_token:
+            end_token = len(text)
+        return start_token, end_token
+    def predict(self, tasks: List[Dict], context: Optional[Dict] = None, **kwargs) -> ModelResponse:
+        """ inference logic
+            :param tasks: [Label Studio tasks in JSON format](https://labelstud.io/guide/task_format.html)
+            :param context: [Label Studio context in JSON format](https://labelstud.io/guide/ml_create#Implement-prediction-logic)
+            :return model_response
+                ModelResponse(predictions=predictions) with
+                predictions: [Predictions array in JSON format](https://labelstud.io/guide/export.html#Label-Studio-JSON-format-of-annotated-tasks)
+        """
+        print(f'''\
+        Run prediction on {tasks}
+        Received context: {context}
+        Project ID: {self.project_id}
+        Label config: {self.label_config}
+        Parsed JSON Label config: {self.parsed_label_config}
+        Extra params: {self.extra_params}''')
+        # TODO: this may result in single-time timeout for large models - consider adjusting the timeout on Label Studio side
+        self.lazy_init()
+        # make predictions with currently set model
+        from_name, to_name, value = self.label_interface.get_first_tag_occurence('Labels', 'Text')
+        # get labels from the labeling configuration
+        labels = sorted(self.label_interface.get_tag(from_name).labels)
+        texts = [task['data'][value] for task in tasks]
+        predictions = []
+        for text in texts:
+            entities = self.model.predict_entities(text, labels, threshold=self.threshold)
+            pred = self.convert_to_ls_annotation(entities, from_name, to_name)
+            predictions.extend(pred)
+        return ModelResponse(predictions=predictions)
+    def process_training_data(self, task):
+        """
+        Process the task from Label Studio export to isolate the information needed for prediction.
+        We need the tokenized text of the input, along with the start and end indicies, by word, of the annotated spans
+        :param task: the task as output by Label Studio
+        """
+        # We get the list of tokens from the original data sample we uploaded
+        tokens = task['data']['tokens']
+        ner = []
+        # Parse the annotations
+        for annotation in task['annotations']:
+            for result in annotation['result']:
+                start = result['value']['start']
+                end = result['value']['end']
+                start_token, end_token = self.convert_char_to_token_span(tokens, start, end)
+                label = result['value']['labels'][0]
+                ner.append([start_token, end_token, label])
+        return tokens, ner
+    def train(self, model, training_args, train_data, eval_data=None):
+        """
+        retrain the GLiNER model. Code adapted from the GLiNER finetuning notebook.
+        :param model: the model to train
+        :param config: the config object for training parameters
+        :param train_data: the training data, as a list of dictionaries
+        :param eval_data: the eval data
+        """
+        # TODO: this may result in single-time timeout for large models - consider adjusting the timeout on Label Studio side
+        self.lazy_init()
+        logger.info("Training Model")
+        if training_args.use_cpu == True:
+            model = model.to('cpu')
+        else:
+            model = model.to("cuda")
+        data_collator = DataCollator(model.config, data_processor=model.data_processor, prepare_labels=True)
+        trainer = Trainer(
+            model=model,
+            args=training_args,
+            train_dataset=train_data,
+            eval_dataset=eval_data,
+            tokenizer=model.data_processor.transformer_tokenizer,
+            data_collator=data_collator,
+        )
+        trainer.train()
+        #Save model
+        ckpt = str(pathlib.Path(self.MODEL_DIR, self.finetuned_model_path))
+        logger.info(f"Model Trained, saving to {ckpt} ")
+        trainer.save_model(ckpt)
+    def fit(self, event, data, **kwargs):
+        """
+        This method is called each time an annotation is created or updated
+        You can run your logic here to update the model and persist it to the cache
+        It is not recommended to perform long-running operations here, as it will block the main thread
+        Instead, consider running a separate process or a thread (like RQ worker) to perform the training
+        :param event: event type can be ('ANNOTATION_CREATED', 'ANNOTATION_UPDATED')
+        :param data: the payload received from the event (check [Webhook event reference](https://labelstud.io/guide/webhook_reference.html))
+        """
+        self.lazy_init()
+        # we only train the model if the "start training" button is pressed from settings.
+        if event == "START_TRAINING":
+            logger.info("Fitting model")
+            # download annotated tasks from Label Studio
+            ls = label_studio_sdk.Client(self.LABEL_STUDIO_HOST, self.LABEL_STUDIO_API_KEY)
+            project = ls.get_project(id=self.project_id)
+            tasks = project.get_labeled_tasks()
+            logger.info(f"Downloaded {len(tasks)} labeled tasks from Label Studio")
+            training_data = []
+            for task in tasks:
+                tokens, ner = self.process_training_data(task)
+                training_data.append({"tokenized_text": tokens, "ner": ner})
+            from_name, to_name, value = self.label_interface.get_first_tag_occurence('Labels', 'Text')
+            eval_data = {
+                "entity_types": sorted(self.label_interface.get_tag(from_name).labels),
+                "samples": training_data[:10]
+            }
+            training_data = training_data[10:]
+            logger.debug(training_data)
+            # Define the hyperparameters in a config variable
+            # This comes from the pretraining example in the GLiNER repo
+            num_steps = 10
+            batch_size = 1
+            data_size = len(training_data)
+            num_batches = floor(data_size / batch_size)
+            num_epochs = max(1, floor(num_steps / num_batches))
+            training_args = TrainingArguments(
+                output_dir="models/training_output",
+                learning_rate=5e-6,
+                weight_decay=0.01,
+                others_lr=1e-5,
+                others_weight_decay=0.01,
+                lr_scheduler_type="linear",  # cosine
+                warmup_ratio=0.1,
+                per_device_train_batch_size=batch_size,
+                per_device_eval_batch_size=batch_size,
+                focal_loss_alpha=0.75,
+                focal_loss_gamma=2,
+                num_train_epochs=num_epochs,
+                evaluation_strategy="steps",
+                save_steps=100,
+                save_total_limit=10,
+                dataloader_num_workers=0,
+                use_cpu=True,
+                report_to="none",
+            )
+            self.train(self.model, training_args, training_data, eval_data)
+        else:
+            logger.info("Model training not triggered")

requirements-base.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ gunicorn==23.0.0
2	+ label-studio-ml @ git+https://github.com/HumanSignal/label-studio-ml-backend.git

requirements-test.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ pytest
2	+ pytest-cov

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+gliner==0.2.16
+torch==2.7.1
+accelerate>=0.26.0
+transformers==4.38.2
+huggingface-hub==0.21.4

test_api.py ADDED Viewed

	@@ -0,0 +1,68 @@

+"""
+This file contains tests for the API of your model. You can run these tests by installing test requirements:
+    ```bash
+    pip install -r requirements-test.txt
+    ```
+Then execute `pytest` in the directory of this file.
+- Change `NewModel` to the name of the class in your model.py file.
+- Change the `request` and `expected_response` variables to match the input and output of your model.
+"""
+import pytest
+import json
+from model import GLiNERModel
+@pytest.fixture
+def client():
+    from _wsgi import init_app
+    app = init_app(model_class=GLiNERModel)
+    app.config['TESTING'] = True
+    with app.test_client() as client:
+        yield client
+def test_predict(client):
+    request = {
+        'tasks': [{'id': 6,
+                   'data': {'id': '5316', 'sample_id': '83dd3f62-4dd5-45eb-8626-ee8539963194',
+                            'tokens': ['atomoxetine', '[', 'oral', 'suspension', ']', 'norepinephrine', 'reuptake',
+                                       'inhibitor'],
+                            'ner_tags': ['B-Medication/Vaccine', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
+                            'ner_tags_index': [63, 0, 0, 0, 0, 0, 0, 0],
+                            'text': 'atomoxetine [ oral suspension ] norepinephrine reuptake inhibitor'},
+                   'meta': {},
+                   'created_at': '2024-04-13T19:22:37.153686Z',
+                   'updated_at': '2024-05-03T00:03:22.356871Z',
+                   'is_labeled': False,
+                   'overlap': 1,
+                   'inner_id': 6,
+                   'total_annotations': 1,
+                   'cancelled_annotations': 0,
+                   'total_predictions': 0,
+                   'comment_count': 0,
+                   'unresolved_comment_count': 0,
+                   'last_comment_updated_at': None,
+                   'project': 2,
+                   'updated_by': 1,
+                   'file_upload': None,
+                   'comment_authors': [],
+                   'predictions': [],
+                   }],
+        # Your labeling configuration here
+        'label_config': '<View> \\n <Labels name="label" toName="text">\\n<Label value="Medication/Vaccine" background="red"/>\\n<Label value="MedicalProcedure" background="blue"/>\\n<Label value="AnatomicalStructure" background="orange"/>\\n<Label value="Symptom" background="green"/>\\n<Label value="Disease" background="purple"/>\\n</Labels>\\n<Text name="text" value="$text"/>\\n</View>'
+    }
+    expected_response = {"results": [{"model_version": "GLiNERModel-v0.0.1", "result": [
+        {"from_name": "label", "score": 0.922, "to_name": "text", "type": "labels",
+         "value": {"end": 11, "labels": ["Medication/Vaccine"], "start": 0, "text": "atomoxetine"}},
+        {"from_name": "label", "score": 0.7053, "to_name": "text", "type": "labels",
+         "value": {"end": 65, "labels": ["Medication/Vaccine"], "start": 32,
+                   "text": "norepinephrine reuptake inhibitor"}}], "score": 0.7053}]}
+    response = client.post('/predict', data=json.dumps(request), content_type='application/json')
+    assert response.status_code == 200
+    response = json.loads(response.data)
+    assert expected_response == response