Shaelois
/

MeetingScript

bigbird_pegasus

text2text-generation

Model card Files Files and versions

MeetingScript / README.md

Shaelois's picture

Update README.md

e7001d2 verified 10 months ago

|

history blame contribute delete

2.4 kB

	---
	license: apache-2.0
	datasets:
	- huuuyeah/meetingbank
	language:
	- en
	metrics:
	- rouge
	base_model:
	- google/bigbird-pegasus-large-bigpatent
	pipeline_tag: summarization
	library_name: transformers
	---
	# MeetingScript

	> A BigBird‐Pegasus model fine‑tuned for meeting transcription summarization on the MeetingBank dataset.

	📦 Model Files
	- Weights & config: `pytorch_model.bin`, `config.json`
	- Tokenizer: `tokenizer.json`, `tokenizer_config.json`, `merges.txt`, `special_tokens_map.json`
	- Generation defaults: `generation_config.json`

	🔗 Hub: https://github.com/kevin0437/Meeting_scripts

	---

	## Model Description

	MeetingScript is a sequence‑to‑sequence model based on
	[google/bigbird-pegasus-large-bigpatent](https://huggingface.co/google/bigbird-pegasus-large-bigpatent)
	and fine‑tuned on the [MeetingBank](https://huggingface.co/datasets/huuuyeah/meetingbank) corpus of meeting transcripts paired with human‐written summaries.
	It is designed to take long meeting transcripts (up to 4096 tokens) and produce concise, coherent summaries.

	---

	## Evaluation Results

	Evaluated on the held‑out test split of MeetingBank (≈ 600 transcripts), using beam search (4 beams, max_length=600):

	\| Metric \| F1 Score (%) \|
	\|-------------\|-------------:\|
	\| ROUGE‑1 \| 51.5556 \|
	\| ROUGE‑2 \| 38.5378 \|
	\| ROUGE‑L \| 48.0786 \|
	\| ROUGE‑Lsum \| 48.0142 \|

	---

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
	import torch

	# 1) Load from the Hub
	tokenizer = AutoTokenizer.from_pretrained("Shaelois/MeetingScript")
	model = AutoModelForSeq2SeqLM.from_pretrained("Shaelois/MeetingScript")

	# 2) Summarize a long transcript
	transcript = """
	Alice: Good morning everyone, let’s get started…
	Bob: I updated the design mockups…
	… (thousands of words) …
	"""
	inputs = tokenizer(
	transcript,
	max_length=4096,
	truncation=True,
	return_tensors="pt"
	).to("cuda")

	summary_ids = model.generate(
	**inputs,
	num_beams=4,
	max_length=150,
	early_stopping=True
	)
	summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
	print("📝 Summary:", summary)
	```

	---

	## Training Data
	Dataset: MeetingBank
	Splits: Train (5000+), Validation (600+), Test (600+)
	Preprocessing: Sliding‑window chunking for sequences > 4096 tokens