File size: 3,239 Bytes
66670a1
 
 
 
 
 
 
 
 
 
 
 
 
406f606
 
 
 
66670a1
 
 
 
406f606
66670a1
406f606
66670a1
406f606
 
66670a1
 
 
 
 
 
 
406f606
66670a1
 
 
406f606
66670a1
 
 
 
 
406f606
 
 
66670a1
 
406f606
 
 
 
66670a1
406f606
66670a1
 
 
 
 
406f606
66670a1
 
 
406f606
 
 
 
66670a1
406f606
 
66670a1
406f606
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---

license: apache-2.0
base_model: distilbert-base-uncased
tags:
  - text-classification
  - dialogue-act-classification
  - multi-task-learning
  - onnx
  - roleplay
  - game-ai
language:
  - en
pipeline_tag: text-classification
datasets:
  - deepset/prompt-injections
  - hackaprompt/hackaprompt-dataset
  - lakera-ai/gandalf_ignore_instructions
---


# distilbert-multitask

Multi-task DistilBERT classifier for conversational AI pipelines in interactive fiction and games. Performs two classification tasks in a single forward pass:

| Task | Output | Notes |
|---|---|---|
| Dialogue act | 21-class label | Classifies player utterance type |
| Manipulation detection | Binary probability | Detects prompt injection / NPC takeover attempts |

## Dialogue Act Labels

`accusation`, `acknowledgment`, `action`, `agree`, `command`, `conditional`, `confession`, `disagree`, `emote`, `farewell`, `flirt`, `greeting`, `hedge`, `hostile`, `intent`, `offer`, `opinion`, `out_of_character`, `question`, `statement`, `yes_no_question`

## Model Details

- **Base model:** [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)
- **Format:** ONNX (CPU inference via `onnxruntime`)
- **Inference time:** ~10–15ms per input on CPU
- **Input:** Single sentence (player utterance in a conversation)
- **Output:** Dialogue act label + manipulation detection probability

## Usage

```python

import json

import numpy as np

import onnxruntime

from huggingface_hub import snapshot_download

from transformers import AutoTokenizer



snapshot_download(repo_id="myemfar/distilbert-multitask", local_dir="./distilbert_multitask")



session = onnxruntime.InferenceSession("./distilbert_multitask/model.onnx")

tokenizer = AutoTokenizer.from_pretrained("./distilbert_multitask")



with open("./distilbert_multitask/label_map_da.json") as f:

    labels = {int(k): v for k, v in json.load(f).items()}



inputs = tokenizer("Where is the tavern?", return_tensors="np")

logits_da, logits_manip = session.run(None, dict(inputs))



da_label = labels[int(np.argmax(logits_da))]           # "question"

manip_prob = float(1 / (1 + np.exp(-logits_manip[0][0])))  # sigmoid

```

## Training Data

### Dialogue Act Classification
Synthetic training data generated via Claude across 21 conversational categories, curated for interactive fiction and RPG dialogue contexts. Approximately 2,000 labeled examples with targeted augmentation at category boundaries.

### Manipulation / Prompt Injection Detection
Fine-tuned on a combination of three public datasets plus domain-specific negative examples (in-character RPG dialogue):

| Dataset | License | Description |
|---|---|---|
| [deepset/prompt-injections](https://huggingface.co/datasets/deepset/prompt-injections) | CC BY 4.0 | Benign queries + prompt injection examples |
| [hackaprompt/hackaprompt-dataset](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset) | Apache 2.0 | Red-teaming competition submissions |
| [lakera-ai/gandalf_ignore_instructions](https://huggingface.co/datasets/lakera-ai/gandalf_ignore_instructions) | CC BY 4.0 | Instruction-override attempts from Lakera's Gandalf challenge |