Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

README.md +97 -0
config.json +37 -0
label_map.json +6 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +63 -0
training_config.json +16 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,97 @@

+---
+language:
+- en
+- de
+- ru
+license: apache-2.0
+library_name: transformers
+tags:
+- dialogue-act-classification
+- distilbert
+- multilingual
+- conversational-ai
+- asr
+base_model: distilbert-base-multilingual-cased
+metrics:
+- accuracy
+- f1
+pipeline_tag: text-classification
+---
+# distilbert-multilingual-dialogue-act-classifier
+Fine-tuned **DistilBERT** (`distilbert-base-multilingual-cased`) for **4-class dialogue act classification** in English, German, and Russian. Trained on conversational dialogue data, optimized for ASR transcripts.
+## Labels
+| Index | Label | Description |
+|-------|-------|-------------|
+| 0 | commissive | Promises, commitments ("I'll handle it.") |
+| 1 | directive | Commands, requests ("Send the report.") |
+| 2 | inform | Statements, facts ("The deadline is Friday.") |
+| 3 | question | Questions, inquiries ("What is the timeline?") |
+## Evaluation
+Per-language performance on held-out test sets:
+| Language | Test Set | Accuracy | F1 Macro |
+|----------|----------|----------|----------|
+| English | SILICONE dyda_da | 80.8% | 0.725 |
+| English | XDailyDialog | 82.5% | 0.750 |
+| German | XDailyDialog | 81.8% | 0.738 |
+| Russian | xdailydialog-ru | 81.7% | 0.734 |
+Edge-case test suite (ASR disfluent input, conversational): **77.8%** (35/45)
+## Usage
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+import torch
+model = AutoModelForSequenceClassification.from_pretrained("WSHAPER/distilbert-multilingual-dialogue-act-classifier")
+tokenizer = AutoTokenizer.from_pretrained("WSHAPER/distilbert-multilingual-dialogue-act-classifier")
+texts = ["What is the timeline?", "Send the report.", "The meeting went well."]
+inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
+with torch.no_grad():
+    logits = model(**inputs).logits
+    probs = torch.softmax(logits, dim=-1)
+    preds = torch.argmax(probs, dim=-1)
+labels = ["commissive", "directive", "inform", "question"]
+for text, pred, prob in zip(texts, preds, probs):
+    print(f"{text} → {labels[pred]} ({prob[pred]:.2f})")
+```
+## Training Details
+- **Base model**: `distilbert-base-multilingual-cased` (277M params)
+- **Training data**:
+  - [XDailyDialog](https://github.com/liuzeming01/XDailyDialog) — EN, DE, IT (~249K utterances)
+  - [WSHAPER/xdailydialog-ru](https://huggingface.co/datasets/WSHAPER/xdailydialog-ru) — RU (~82K utterances)
+  - Total: ~331K utterances across 4 languages
+- **Hyperparameters**: 5 epochs, batch 32, lr 2e-5, warmup 10%
+- **Hardware**: NVIDIA RTX A3000 12GB, ~1.5 hours
+## Rust Inference (candle-transformers)
+This model is compatible with `candle-transformers` for pure Rust inference:
+```rust
+// Loads model.safetensors + tokenizer.json directly
+let config = DistilBertConfig::from_file("config.json");
+let bert = BertModel::load(vb.pp("distilbert"), &config)?;
+let classifier = candle_nn::linear(config.hidden_size, 4, vb.pp("classifier"))?;
+```
+## Links
+- **GitHub**: [WSHAPER/dialogue-act-classifier](https://github.com/WSHAPER/dialogue-act-classifier) — training code, evaluation scripts, export tools
+- **Russian dataset**: [WSHAPER/xdailydialog-ru](https://huggingface.co/datasets/WSHAPER/xdailydialog-ru) — Russian translation of XDailyDialog
+## License
+Apache-2.0

config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "commissive",
+    "1": "directive",
+    "2": "inform",
+    "3": "question"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "commissive": 0,
+    "directive": 1,
+    "inform": 2,
+    "question": 3
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "output_past": true,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.53.1",
+  "vocab_size": 119547
+}

label_map.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "0": "commissive",
+  "1": "directive",
+  "2": "inform",
+  "3": "question"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f226db309b0b679faaa4dc3b955f31b6024cbe87a7ea43af2a372b78d0be38b5
+size 541323496

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 128,
+  "model_max_length": 512,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

training_config.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "base_model": "distilbert-base-multilingual-cased",
+  "num_labels": 4,
+  "max_seq_length": 128,
+  "epochs": 5,
+  "batch_size": 32,
+  "learning_rate": 2e-05,
+  "seed": 42,
+  "trained_at": "20260514_193557",
+  "languages": [
+    "it",
+    "de",
+    "ru",
+    "en"
+  ]
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff