MeetingScript / README.md
Shaelois's picture
Update README.md
e7001d2 verified
---
license: apache-2.0
datasets:
- huuuyeah/meetingbank
language:
- en
metrics:
- rouge
base_model:
- google/bigbird-pegasus-large-bigpatent
pipeline_tag: summarization
library_name: transformers
---
# MeetingScript
> A BigBird‐Pegasus model fine‑tuned for meeting transcription summarization on the MeetingBank dataset.
📦 **Model Files**
- **Weights & config**: `pytorch_model.bin`, `config.json`
- **Tokenizer**: `tokenizer.json`, `tokenizer_config.json`, `merges.txt`, `special_tokens_map.json`
- **Generation defaults**: `generation_config.json`
🔗 **Hub:** https://github.com/kevin0437/Meeting_scripts
---
## Model Description
**MeetingScript** is a sequence‑to‑sequence model based on
[google/bigbird-pegasus-large-bigpatent](https://huggingface.co/google/bigbird-pegasus-large-bigpatent)
and fine‑tuned on the [MeetingBank](https://huggingface.co/datasets/huuuyeah/meetingbank) corpus of meeting transcripts paired with human‐written summaries.
It is designed to take long meeting transcripts (up to 4096 tokens) and produce concise, coherent summaries.
---
## Evaluation Results
Evaluated on the held‑out test split of MeetingBank (≈ 600 transcripts), using beam search (4 beams, max_length=600):
| Metric | F1 Score (%) |
|-------------|-------------:|
| **ROUGE‑1** | 51.5556 |
| **ROUGE‑2** | 38.5378 |
| **ROUGE‑L** | 48.0786 |
| **ROUGE‑Lsum** | 48.0142 |
---
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
# 1) Load from the Hub
tokenizer = AutoTokenizer.from_pretrained("Shaelois/MeetingScript")
model = AutoModelForSeq2SeqLM.from_pretrained("Shaelois/MeetingScript")
# 2) Summarize a long transcript
transcript = """
Alice: Good morning everyone, let’s get started…
Bob: I updated the design mockups…
… (thousands of words) …
"""
inputs = tokenizer(
transcript,
max_length=4096,
truncation=True,
return_tensors="pt"
).to("cuda")
summary_ids = model.generate(
**inputs,
num_beams=4,
max_length=150,
early_stopping=True
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("📝 Summary:", summary)
```
---
## Training Data
Dataset: MeetingBank
Splits: Train (5000+), Validation (600+), Test (600+)
Preprocessing: Sliding‑window chunking for sequences > 4096 tokens