|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- quote-attribution |
|
|
- speaker-identification |
|
|
- dialogue-attribution |
|
|
- nlp |
|
|
- transformers |
|
|
- bert |
|
|
language: |
|
|
- en |
|
|
datasets: |
|
|
- aNameNobodyChose/quote-speaker-attribution |
|
|
--- |
|
|
|
|
|
# π£οΈ QuoteCaster: Speaker-Aware Quote Encoder |
|
|
|
|
|
**QuoteCaster** is a fine-tuned BERT-based model designed to encode dialogue quotes along with their surrounding context in order to **identify or group quotes by speaker** β even in stories the model has never seen before. |
|
|
|
|
|
This encoder powers unsupervised or few-shot quote attribution by mapping similar speaking styles (with context) to nearby points in embedding space. Perfect for clustering or nearest-neighbor speaker inference tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## π¦ Model Details |
|
|
|
|
|
- **Base model**: `bert-base-uncased` |
|
|
- **Trained with**: Triplet Margin Loss |
|
|
- **Objective**: Pull quotes from the same speaker together, push different ones apart |
|
|
- **Input**: `context [SEP] quote` |
|
|
- **Output**: `[CLS]` embedding as a 768-dimensional vector |
|
|
|
|
|
--- |
|
|
|
|
|
## π Use Case |
|
|
|
|
|
QuoteCaster is ideal for: |
|
|
|
|
|
- π§ Clustering quotes by speaker using KMeans or Agglomerative Clustering |
|
|
- π Zero-shot speaker inference on unseen stories |
|
|
- π§ͺ Dialogue structure analysis in novels, scripts, or plays |
|
|
|
|
|
--- |
|
|
|
|
|
## π Example: Inference with QuoteCaster |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
# Load fine-tuned encoder |
|
|
model = AutoModel.from_pretrained("aNameNobodyChose/quote-caster-encoder") |
|
|
tokenizer = AutoTokenizer.from_pretrained("aNameNobodyChose/quote-caster-encoder") |
|
|
|
|
|
# Encode a quote with its surrounding context |
|
|
def encode_quote(context, quote): |
|
|
text = f"{context} [SEP] {quote}" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512) |
|
|
outputs = model(**inputs) |
|
|
return outputs.last_hidden_state[:, 0, :] # [CLS] token |
|
|
|