| ---
|
| license: apache-2.0
|
| base_model: distilbert-base-uncased
|
| tags:
|
| - text-classification
|
| - dialogue-act-classification
|
| - multi-task-learning
|
| - onnx
|
| - roleplay
|
| - game-ai
|
| language:
|
| - en
|
| pipeline_tag: text-classification
|
| datasets:
|
| - deepset/prompt-injections
|
| - hackaprompt/hackaprompt-dataset
|
| - lakera-ai/gandalf_ignore_instructions
|
| ---
|
|
|
| # distilbert-multitask
|
|
|
| Multi-task DistilBERT classifier for conversational AI pipelines in interactive fiction and games. Performs two classification tasks in a single forward pass:
|
|
|
| | Task | Output | Notes |
|
| |---|---|---|
|
| | Dialogue act | 21-class label | Classifies player utterance type |
|
| | Manipulation detection | Binary probability | Detects prompt injection / NPC takeover attempts |
|
|
|
| ## Dialogue Act Labels
|
|
|
| `accusation`, `acknowledgment`, `action`, `agree`, `command`, `conditional`, `confession`, `disagree`, `emote`, `farewell`, `flirt`, `greeting`, `hedge`, `hostile`, `intent`, `offer`, `opinion`, `out_of_character`, `question`, `statement`, `yes_no_question`
|
|
|
| ## Model Details
|
|
|
| - **Base model:** [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)
|
| - **Format:** ONNX (CPU inference via `onnxruntime`)
|
| - **Inference time:** ~10–15ms per input on CPU
|
| - **Input:** Single sentence (player utterance in a conversation)
|
| - **Output:** Dialogue act label + manipulation detection probability
|
|
|
| ## Usage
|
|
|
| ```python
|
| import json
|
| import numpy as np
|
| import onnxruntime
|
| from huggingface_hub import snapshot_download
|
| from transformers import AutoTokenizer
|
|
|
| snapshot_download(repo_id="myemfar/distilbert-multitask", local_dir="./distilbert_multitask")
|
|
|
| session = onnxruntime.InferenceSession("./distilbert_multitask/model.onnx")
|
| tokenizer = AutoTokenizer.from_pretrained("./distilbert_multitask")
|
|
|
| with open("./distilbert_multitask/label_map_da.json") as f:
|
| labels = {int(k): v for k, v in json.load(f).items()}
|
|
|
| inputs = tokenizer("Where is the tavern?", return_tensors="np")
|
| logits_da, logits_manip = session.run(None, dict(inputs))
|
|
|
| da_label = labels[int(np.argmax(logits_da))] # "question"
|
| manip_prob = float(1 / (1 + np.exp(-logits_manip[0][0]))) # sigmoid
|
| ```
|
|
|
| ## Training Data
|
|
|
| ### Dialogue Act Classification
|
| Synthetic training data generated via Claude across 21 conversational categories, curated for interactive fiction and RPG dialogue contexts. Approximately 2,000 labeled examples with targeted augmentation at category boundaries.
|
|
|
| ### Manipulation / Prompt Injection Detection
|
| Fine-tuned on a combination of three public datasets plus domain-specific negative examples (in-character RPG dialogue):
|
|
|
| | Dataset | License | Description |
|
| |---|---|---|
|
| | [deepset/prompt-injections](https://huggingface.co/datasets/deepset/prompt-injections) | CC BY 4.0 | Benign queries + prompt injection examples |
|
| | [hackaprompt/hackaprompt-dataset](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | Apache 2.0 | Red-teaming competition submissions |
|
| | [lakera-ai/gandalf_ignore_instructions](https://huggingface.co/datasets/lakera-ai/gandalf_ignore_instructions) | CC BY 4.0 | Instruction-override attempts from Lakera's Gandalf challenge |
|
|
|