|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- llama-factory |
|
|
--- |
|
|
|
|
|
# Qwen-Ar-GEC |
|
|
|
|
|
Qwen-Ar-GEC is a fine-tuned adaptation of the Qwen model for **Arabic Grammatical Error Correction (GEC)**. |
|
|
The goal of this model is to automatically detect and correct grammatical, spelling, and stylistic errors in Arabic text, |
|
|
making it useful for applications such as language learning, academic writing assistance, and automated proofreading. |
|
|
|
|
|
# Architecture |
|
|
|
|
|
This model was fine-tuned using the **QLoRA** method on **50,000 samples**, based on the **[Qwen 2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)** architecture. |
|
|
The fine-tuning followed the system instruction below: |
|
|
|
|
|
``` |
|
|
ุตุญูุญ ุงูุฃุฎุทุงุก ุงููุญููุฉ ูุงูุฅู
ูุงุฆูุฉ ููุท ุฅู ููุฌุฏุช. ุฃุถู ุงูุชุดููู ุงููุงู
ู ุนูู ูู ุงูุญุฑูู ุฅุฌุจุงุฑููุง โ ุญุชู ูู ูุงู ุงููุต ุตุญูุญูุง. ูุง ุชูุบููุฑ ุฃู ููู
ุฉ ุฃู ุงุณู
ุฃู ุฑูู
ุฃู ุจููุฉ ุฌู
ูุฉ. ุฅุฐุง ูู
ููู ููุงู ุฎุทุฃ ูุญูู ุฃู ุฅู
ูุงุฆูุ ุฃุนุฏ ุฅูุชุงุฌ ุงูู
ุฏุฎูุงุช ูู
ุง ูู โ ููู ู
ุน ุงูุชุดููู ุงููุงู
ู. ูุง ุชูุถู ุดุฑูุญุงุช. ูุง ุชููุฑุฑ ุงูู
ุฏุฎูุงุช. ูุง ุชูุนุฏูู ุงูู
ุนูู. |
|
|
``` |
|
|
Training was conducted with **[Llama Factory](https://github.com/hiyouga/LLaMA-Factory)**, using a rank `r = 32`, and `alpha = 64`. |
|
|
|
|
|
|
|
|
# Dataset |
|
|
|
|
|
This model is train on 50000 sample of **[our dataset](https://huggingface.co/datasets/CUAIStudents/Arabic-Tashkeel)** but with small pre-processing since we are dealing with larger knowledge. |
|
|
|
|
|
|
|
|
# Usage |
|
|
|
|
|
```python |
|
|
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model_name = "Abdo-Alshoki/qwen-ar-gec-v2" |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto") |
|
|
|
|
|
# Recommended system instruction (same as training) |
|
|
system_prompt = """ุตุญูุญ ุงูุฃุฎุทุงุก ุงููุญููุฉ ูุงูุฅู
ูุงุฆูุฉ ููุท ุฅู ููุฌุฏุช. ุฃุถู ุงูุชุดููู ุงููุงู
ู ุนูู ูู ุงูุญุฑูู ุฅุฌุจุงุฑููุง โ ุญุชู ูู ูุงู ุงููุต ุตุญูุญูุง. ูุง ุชูุบููุฑ ุฃู ููู
ุฉ ุฃู ุงุณู
ุฃู ุฑูู
ุฃู ุจููุฉ ุฌู
ูุฉ. ุฅุฐุง ูู
ููู ููุงู ุฎุทุฃ ูุญูู ุฃู ุฅู
ูุงุฆูุ ุฃุนุฏ ุฅูุชุงุฌ ุงูู
ุฏุฎูุงุช ูู
ุง ูู โ ููู ู
ุน ุงูุชุดููู ุงููุงู
ู. ูุง ุชูุถู ุดุฑูุญุงุช. ูุง ุชููุฑุฑ ุงูู
ุฏุฎูุงุช. ูุง ุชูุนุฏูู ุงูู
ุนูู.""" |
|
|
|
|
|
# Example input |
|
|
messages = [ |
|
|
{"role": "system", "content": system_prompt}, |
|
|
{"role": "user", "content": "ู
ููู ุงููู
ูููู
ูู ุฃููู ูุงู ููุณุณูููุทูุคุฃ ุฃูุจูุฏูุงุ ูููุงู ููุจูููููุง ููู ุงูุฎูุงุฑูุฌู ุทูููููุงู ูุฃููููููู
ู ููุญูุชูุงุฌูููู ุฅููู ุงูุฑููุทูุงุจู."} |
|
|
] |
|
|
|
|
|
# Format prompt and tokenize |
|
|
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
# Generate output |
|
|
outputs = model.generate(**inputs, max_new_tokens=512) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # ู
ููู ุงููู
ูููู
ูู ุฃููู ูุงู ููุณูููุทููุง ุฃูุจูุฏูุงุ ูููุงู ููุจูููููุง ููู ุงูุฎูุงุฑูุฌู ุทูููููุงู ูุฃููููููู
ู ููุญูุชูุงุฌูููู ุฅููู ุงูุฑููุทูุงุจู. |
|
|
|
|
|
``` |
|
|
|
|
|
# limits and improvements |
|
|
|
|
|
This model achieves promising accuracy on our dataset; however, the dataset contains limited coverage of Modern Standard Arabic (MSA). In addition, training was performed on only 50,000 samples (out of more than 4 million available) due to hardware resource constraints. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|