|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- masked-language-modeling |
|
|
- dialogue |
|
|
- speaker-aware |
|
|
- transformer |
|
|
- saute |
|
|
- pytorch |
|
|
datasets: |
|
|
- SODA |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: fill-mask |
|
|
model_type: saute |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# π¨βπ³ SAUTE: Speaker-Aware Utterance Embedding Unit |
|
|
|
|
|
**SAUTE** is a lightweight, speaker-aware transformer architecture designed for effective modeling of multi-speaker dialogues. It combines **EDU-level utterance embeddings**, **speaker-sensitive memory**, and **efficient linear attention** to encode rich conversational context with minimal overhead. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Overview |
|
|
|
|
|
SAUTE is tailored for: |
|
|
- π£οΈ Multi-turn conversations |
|
|
- π₯ Multi-speaker interactions |
|
|
- π§΅ Long-range dialog dependencies |
|
|
|
|
|
It avoids the quadratic cost of full self-attention by summarizing per-speaker memory from EDU embeddings and injecting contextual information through lightweight linear attention mechanisms. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§± Architecture |
|
|
|
|
|
> π SAUTE contextualizes each token with speaker-specific memory summaries built from utterance-level embeddings. |
|
|
|
|
|
- **EDU-Level Encoder**: Mean-pooled BERT outputs per utterance. |
|
|
- **Speaker Memory**: Outer-product-based accumulation per speaker. |
|
|
- **Contextualization Layer**: Integrates memory summaries with current token representations. |
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
## π Key Features |
|
|
|
|
|
- π§ **Speaker-Aware Memory**: Structured per-speaker representation of dialogue context. |
|
|
- β‘ **Linear Attention**: Efficient and scalable to long dialogues. |
|
|
- π§© **Pretrained Transformer Compatible**: Can plug into frozen or fine-tuned BERT models. |
|
|
- πͺΆ **Lightweight**: ~4M parameters less than 2-layer with strong MLM performance improvements. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Performance (on SODA, Masked Language Modeling) |
|
|
|
|
|
|
|
|
| Model | Avg MLM Acc | Best MLM Acc | |
|
|
|---------------------------|-------------|--------------| |
|
|
| BERT-base (frozen) | 33.45 | 45.89 | |
|
|
| + 1-layer Transformer | 68.20 | 76.69 | |
|
|
| + 2-layer Transformer | 71.81 | 79.54 | |
|
|
| **+ 1-layer SAUTE (Ours)** | **72.05** | **80.40%** | |
|
|
| + 3-layer Transformer| 73.5 | 80.84 | |
|
|
| **+ 3-layer SAUTE (Ours)**| **75.65** | **85.55%**| |
|
|
|
|
|
> SAUTE achieves the best accuracy using fewer parameters than multi-layer transformers. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation / Paper |
|
|
|
|
|
π [SAUTE: Speaker-Aware Utterance Embedding Unit (PDF)](https://github.com/user-attachments/files/20689695/SAUTE_Speaker_Aware_Utterance_Embedding_Unit.pdf) |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ How to Use |
|
|
|
|
|
```python |
|
|
from saute_model import SAUTEConfig, UtteranceEmbedings |
|
|
from transformers import BertTokenizerFast |
|
|
|
|
|
# Load tokenizer and model |
|
|
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased") |
|
|
model = UtteranceEmbedings.from_pretrained("JustinDuc/saute") |
|
|
|
|
|
# Prepare inputs (example) |
|
|
outputs = model( |
|
|
input_ids=input_ids, |
|
|
attention_mask=attention_mask, |
|
|
speaker_names=speaker_names |
|
|
) |
|
|
|