| # Arabic News Article Summarization with mT5 | |
| This project fine-tunes the `google/mt5-small` model on the BBC Arabic news dataset for the task of summarizing news articles into concise summaries. Utilizing the Transformer-based model's state-of-the-art performance in natural language understanding and generation, this project addresses the unique linguistic nuances of Arabic through advanced NLP techniques. | |
| ## Introduction | |
| Harnessing the power of the `google/mt5-small` model, this project aims to leverage its multilingual processing capabilities for Arabic text summarization. By fine-tuning the model on the BBC Arabic news dataset, we enhance its ability to generate accurate and concise summaries of Arabic news articles. The project employs the Transformers library for an efficient training loop and uses ROUGE scores as an evaluation metric to ensure high-quality summaries. You can replicate this model following the [Training Repo](https://github.com/yalsaffar/mt5-small-Arabic-Summarization) | |
| ## Dataset | |
| The dataset comprises news articles from the BBC Arabic news, split into 32,000 training rows, 4,000 testing rows, and 4,000 validation rows. | |
| - **Dataset Source:** [BBC Arabic News Data](https://www.kaggle.com/datasets/fadyelkbeer/arabic-summarization-bbc-news) | |
| ## Model | |
| The `google/mt5-small` model, a part of the T5 family, is extended to mT5 to support multilingual capabilities, covering 101 languages including Arabic. This project fine-tunes mT5 for Arabic news summarization. | |
| - **Pretrained Model:** [google/mt5-small](https://huggingface.co/google/mt5-small) | |
| ## Usage | |
| To use this model for summarizing Arabic news articles, follow the steps below: | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig | |
| import torch | |
| # Load tokenizer and model | |
| tokenizer = AutoTokenizer.from_pretrained("yalsaffar/mt5-small-Arabic-Summarization") | |
| config = AutoConfig.from_pretrained( | |
| "yalsaffar/mt5-small-Arabic-Summarization", | |
| max_length=128, | |
| length_penalty=0.6, | |
| no_repeat_ngram_size=2, | |
| num_beams=15, | |
| ) | |
| model = AutoModelForSeq2SeqLM.from_pretrained("yalsaffar/mt5-small-Arabic-Summarization", config=config).to("cuda" if torch.cuda.is_available() else "cpu") | |
| # Prepare input | |
| input_text = "الأخبار ...." | |
| input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda" if torch.cuda.is_available() else "cpu") | |
| # Generate summary | |
| with torch.no_grad(): | |
| preds = model.generate( | |
| input_ids, | |
| num_beams=15, | |
| num_return_sequences=1, | |
| no_repeat_ngram_size=1, | |
| remove_invalid_values=True, | |
| max_length=128, | |
| ) | |
| # Convert ids to text | |
| summary = tokenizer.batch_decode(preds, skip_special_tokens=True) | |
| print("***** Original Text *****") | |
| print(input_text) | |
| print("***** Generated Summary *****") | |
| print(summary[0]) | |
| ``` | |
| ## License | |
| This project is licensed under the MIT License - see the LICENSE file for details. | |