yalsaffar
/

mt5-small-Arabic-Summarization

text2text-generation

Model card Files Files and versions

mt5-small-Arabic-Summarization / README.md

yalsaffar's picture

Update README.md

5a864bd verified almost 2 years ago

|

history blame contribute delete

2.94 kB

	# Arabic News Article Summarization with mT5

	This project fine-tunes the `google/mt5-small` model on the BBC Arabic news dataset for the task of summarizing news articles into concise summaries. Utilizing the Transformer-based model's state-of-the-art performance in natural language understanding and generation, this project addresses the unique linguistic nuances of Arabic through advanced NLP techniques.

	## Introduction

	Harnessing the power of the `google/mt5-small` model, this project aims to leverage its multilingual processing capabilities for Arabic text summarization. By fine-tuning the model on the BBC Arabic news dataset, we enhance its ability to generate accurate and concise summaries of Arabic news articles. The project employs the Transformers library for an efficient training loop and uses ROUGE scores as an evaluation metric to ensure high-quality summaries. You can replicate this model following the [Training Repo](https://github.com/yalsaffar/mt5-small-Arabic-Summarization)

	## Dataset

	The dataset comprises news articles from the BBC Arabic news, split into 32,000 training rows, 4,000 testing rows, and 4,000 validation rows.

	- Dataset Source: [BBC Arabic News Data](https://www.kaggle.com/datasets/fadyelkbeer/arabic-summarization-bbc-news)

	## Model

	The `google/mt5-small` model, a part of the T5 family, is extended to mT5 to support multilingual capabilities, covering 101 languages including Arabic. This project fine-tunes mT5 for Arabic news summarization.

	- Pretrained Model: [google/mt5-small](https://huggingface.co/google/mt5-small)


	## Usage

	To use this model for summarizing Arabic news articles, follow the steps below:

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig
	import torch

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("yalsaffar/mt5-small-Arabic-Summarization")
	config = AutoConfig.from_pretrained(
	"yalsaffar/mt5-small-Arabic-Summarization",
	max_length=128,
	length_penalty=0.6,
	no_repeat_ngram_size=2,
	num_beams=15,
	)
	model = AutoModelForSeq2SeqLM.from_pretrained("yalsaffar/mt5-small-Arabic-Summarization", config=config).to("cuda" if torch.cuda.is_available() else "cpu")

	# Prepare input
	input_text = "الأخبار ...."
	input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu")

	# Generate summary
	with torch.no_grad():
	preds = model.generate(
	input_ids,
	num_beams=15,
	num_return_sequences=1,
	no_repeat_ngram_size=1,
	remove_invalid_values=True,
	max_length=128,
	)

	# Convert ids to text
	summary = tokenizer.batch_decode(preds, skip_special_tokens=True)

	print("*** Original Text ***")
	print(input_text)
	print("*** Generated Summary ***")
	print(summary[0])
	```

	## License

	This project is licensed under the MIT License - see the LICENSE file for details.