--- license: mit tags: - masked-language-modeling - dialogue - speaker-aware - transformer - saute - pytorch datasets: - SODA language: - en pipeline_tag: fill-mask model_type: saute library_name: transformers --- # ๐Ÿ‘จโ€๐Ÿณ SAUTE: Speaker-Aware Utterance Embedding Unit **SAUTE** is a lightweight, speaker-aware transformer architecture designed for effective modeling of multi-speaker dialogues. It combines **EDU-level utterance embeddings**, **speaker-sensitive memory**, and **efficient linear attention** to encode rich conversational context with minimal overhead. --- ## ๐Ÿง  Overview SAUTE is tailored for: - ๐Ÿ—ฃ๏ธ Multi-turn conversations - ๐Ÿ‘ฅ Multi-speaker interactions - ๐Ÿงต Long-range dialog dependencies It avoids the quadratic cost of full self-attention by summarizing per-speaker memory from EDU embeddings and injecting contextual information through lightweight linear attention mechanisms. --- ## ๐Ÿงฑ Architecture > ๐Ÿ” SAUTE contextualizes each token with speaker-specific memory summaries built from utterance-level embeddings. - **EDU-Level Encoder**: Mean-pooled BERT outputs per utterance. - **Speaker Memory**: Outer-product-based accumulation per speaker. - **Contextualization Layer**: Integrates memory summaries with current token representations. ![saute-architecture](https://github.com/user-attachments/assets/7f18d5b8-9c6b-4577-b718-206a34d84535) --- ## ๐Ÿš€ Key Features - ๐Ÿง  **Speaker-Aware Memory**: Structured per-speaker representation of dialogue context. - โšก **Linear Attention**: Efficient and scalable to long dialogues. - ๐Ÿงฉ **Pretrained Transformer Compatible**: Can plug into frozen or fine-tuned BERT models. - ๐Ÿชถ **Lightweight**: ~4M parameters less than 2-layer with strong MLM performance improvements. --- ## ๐Ÿ“ˆ Performance (on SODA, Masked Language Modeling) | Model | Avg MLM Acc | Best MLM Acc | |---------------------------|-------------|--------------| | BERT-base (frozen) | 33.45 | 45.89 | | + 1-layer Transformer | 68.20 | 76.69 | | + 2-layer Transformer | 71.81 | 79.54 | | **+ 1-layer SAUTE (Ours)** | **72.05** | **80.40%** | | + 3-layer Transformer| 73.5 | 80.84 | | **+ 3-layer SAUTE (Ours)**| **75.65** | **85.55%**| > SAUTE achieves the best accuracy using fewer parameters than multi-layer transformers. --- ## ๐Ÿ“š Citation / Paper ๐Ÿ“„ [SAUTE: Speaker-Aware Utterance Embedding Unit (PDF)](https://github.com/user-attachments/files/20689695/SAUTE_Speaker_Aware_Utterance_Embedding_Unit.pdf) --- ## ๐Ÿ”ง How to Use ```python from saute_model import SAUTEConfig, UtteranceEmbedings from transformers import BertTokenizerFast # Load tokenizer and model tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased") model = UtteranceEmbedings.from_pretrained("JustinDuc/saute") # Prepare inputs (example) outputs = model( input_ids=input_ids, attention_mask=attention_mask, speaker_names=speaker_names )