|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- huuuyeah/meetingbank |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- rouge |
|
|
base_model: |
|
|
- google/bigbird-pegasus-large-bigpatent |
|
|
pipeline_tag: summarization |
|
|
library_name: transformers |
|
|
--- |
|
|
# MeetingScript |
|
|
|
|
|
> A BigBird‐Pegasus model fine‑tuned for meeting transcription summarization on the MeetingBank dataset. |
|
|
|
|
|
📦 **Model Files** |
|
|
- **Weights & config**: `pytorch_model.bin`, `config.json` |
|
|
- **Tokenizer**: `tokenizer.json`, `tokenizer_config.json`, `merges.txt`, `special_tokens_map.json` |
|
|
- **Generation defaults**: `generation_config.json` |
|
|
|
|
|
🔗 **Hub:** https://github.com/kevin0437/Meeting_scripts |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**MeetingScript** is a sequence‑to‑sequence model based on |
|
|
[google/bigbird-pegasus-large-bigpatent](https://huggingface.co/google/bigbird-pegasus-large-bigpatent) |
|
|
and fine‑tuned on the [MeetingBank](https://huggingface.co/datasets/huuuyeah/meetingbank) corpus of meeting transcripts paired with human‐written summaries. |
|
|
It is designed to take long meeting transcripts (up to 4096 tokens) and produce concise, coherent summaries. |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
Evaluated on the held‑out test split of MeetingBank (≈ 600 transcripts), using beam search (4 beams, max_length=600): |
|
|
|
|
|
| Metric | F1 Score (%) | |
|
|
|-------------|-------------:| |
|
|
| **ROUGE‑1** | 51.5556 | |
|
|
| **ROUGE‑2** | 38.5378 | |
|
|
| **ROUGE‑L** | 48.0786 | |
|
|
| **ROUGE‑Lsum** | 48.0142 | |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
import torch |
|
|
|
|
|
# 1) Load from the Hub |
|
|
tokenizer = AutoTokenizer.from_pretrained("Shaelois/MeetingScript") |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained("Shaelois/MeetingScript") |
|
|
|
|
|
# 2) Summarize a long transcript |
|
|
transcript = """ |
|
|
Alice: Good morning everyone, let’s get started… |
|
|
Bob: I updated the design mockups… |
|
|
… (thousands of words) … |
|
|
""" |
|
|
inputs = tokenizer( |
|
|
transcript, |
|
|
max_length=4096, |
|
|
truncation=True, |
|
|
return_tensors="pt" |
|
|
).to("cuda") |
|
|
|
|
|
summary_ids = model.generate( |
|
|
**inputs, |
|
|
num_beams=4, |
|
|
max_length=150, |
|
|
early_stopping=True |
|
|
) |
|
|
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) |
|
|
print("📝 Summary:", summary) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Data |
|
|
Dataset: MeetingBank |
|
|
Splits: Train (5000+), Validation (600+), Test (600+) |
|
|
Preprocessing: Sliding‑window chunking for sequences > 4096 tokens |