--- license: apache-2.0 datasets: - huuuyeah/meetingbank language: - en metrics: - rouge base_model: - google/bigbird-pegasus-large-bigpatent pipeline_tag: summarization library_name: transformers --- # MeetingScript > A BigBird‐Pegasus model fine‑tuned for meeting transcription summarization on the MeetingBank dataset. 📦 **Model Files** - **Weights & config**: `pytorch_model.bin`, `config.json` - **Tokenizer**: `tokenizer.json`, `tokenizer_config.json`, `merges.txt`, `special_tokens_map.json` - **Generation defaults**: `generation_config.json` 🔗 **Hub:** https://github.com/kevin0437/Meeting_scripts --- ## Model Description **MeetingScript** is a sequence‑to‑sequence model based on [google/bigbird-pegasus-large-bigpatent](https://huggingface.co/google/bigbird-pegasus-large-bigpatent) and fine‑tuned on the [MeetingBank](https://huggingface.co/datasets/huuuyeah/meetingbank) corpus of meeting transcripts paired with human‐written summaries. It is designed to take long meeting transcripts (up to 4096 tokens) and produce concise, coherent summaries. --- ## Evaluation Results Evaluated on the held‑out test split of MeetingBank (≈ 600 transcripts), using beam search (4 beams, max_length=600): | Metric | F1 Score (%) | |-------------|-------------:| | **ROUGE‑1** | 51.5556 | | **ROUGE‑2** | 38.5378 | | **ROUGE‑L** | 48.0786 | | **ROUGE‑Lsum** | 48.0142 | --- ## Usage ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM import torch # 1) Load from the Hub tokenizer = AutoTokenizer.from_pretrained("Shaelois/MeetingScript") model = AutoModelForSeq2SeqLM.from_pretrained("Shaelois/MeetingScript") # 2) Summarize a long transcript transcript = """ Alice: Good morning everyone, let’s get started… Bob: I updated the design mockups… … (thousands of words) … """ inputs = tokenizer( transcript, max_length=4096, truncation=True, return_tensors="pt" ).to("cuda") summary_ids = model.generate( **inputs, num_beams=4, max_length=150, early_stopping=True ) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) print("📝 Summary:", summary) ``` --- ## Training Data Dataset: MeetingBank Splits: Train (5000+), Validation (600+), Test (600+) Preprocessing: Sliding‑window chunking for sequences > 4096 tokens