saute / README.md
JustinDuc's picture
Update README.md
be51023 verified
---
license: mit
tags:
- masked-language-modeling
- dialogue
- speaker-aware
- transformer
- saute
- pytorch
datasets:
- SODA
language:
- en
pipeline_tag: fill-mask
model_type: saute
library_name: transformers
---
# πŸ‘¨β€πŸ³ SAUTE: Speaker-Aware Utterance Embedding Unit
**SAUTE** is a lightweight, speaker-aware transformer architecture designed for effective modeling of multi-speaker dialogues. It combines **EDU-level utterance embeddings**, **speaker-sensitive memory**, and **efficient linear attention** to encode rich conversational context with minimal overhead.
---
## 🧠 Overview
SAUTE is tailored for:
- πŸ—£οΈ Multi-turn conversations
- πŸ‘₯ Multi-speaker interactions
- 🧡 Long-range dialog dependencies
It avoids the quadratic cost of full self-attention by summarizing per-speaker memory from EDU embeddings and injecting contextual information through lightweight linear attention mechanisms.
---
## 🧱 Architecture
> πŸ” SAUTE contextualizes each token with speaker-specific memory summaries built from utterance-level embeddings.
- **EDU-Level Encoder**: Mean-pooled BERT outputs per utterance.
- **Speaker Memory**: Outer-product-based accumulation per speaker.
- **Contextualization Layer**: Integrates memory summaries with current token representations.
![saute-architecture](https://github.com/user-attachments/assets/7f18d5b8-9c6b-4577-b718-206a34d84535)
---
## πŸš€ Key Features
- 🧠 **Speaker-Aware Memory**: Structured per-speaker representation of dialogue context.
- ⚑ **Linear Attention**: Efficient and scalable to long dialogues.
- 🧩 **Pretrained Transformer Compatible**: Can plug into frozen or fine-tuned BERT models.
- πŸͺΆ **Lightweight**: ~4M parameters less than 2-layer with strong MLM performance improvements.
---
## πŸ“ˆ Performance (on SODA, Masked Language Modeling)
| Model | Avg MLM Acc | Best MLM Acc |
|---------------------------|-------------|--------------|
| BERT-base (frozen) | 33.45 | 45.89 |
| + 1-layer Transformer | 68.20 | 76.69 |
| + 2-layer Transformer | 71.81 | 79.54 |
| **+ 1-layer SAUTE (Ours)** | **72.05** | **80.40%** |
| + 3-layer Transformer| 73.5 | 80.84 |
| **+ 3-layer SAUTE (Ours)**| **75.65** | **85.55%**|
> SAUTE achieves the best accuracy using fewer parameters than multi-layer transformers.
---
## πŸ“š Citation / Paper
πŸ“„ [SAUTE: Speaker-Aware Utterance Embedding Unit (PDF)](https://github.com/user-attachments/files/20689695/SAUTE_Speaker_Aware_Utterance_Embedding_Unit.pdf)
---
## πŸ”§ How to Use
```python
from saute_model import SAUTEConfig, UtteranceEmbedings
from transformers import BertTokenizerFast
# Load tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = UtteranceEmbedings.from_pretrained("JustinDuc/saute")
# Prepare inputs (example)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
speaker_names=speaker_names
)