Qwen-Ar-GEC / README.md

Update README.md

c35adf4 verified 4 months ago

3.69 kB

	---
	library_name: transformers
	tags:
	- llama-factory
	---

	# Qwen-Ar-GEC

	Qwen-Ar-GEC is a fine-tuned adaptation of the Qwen model for Arabic Grammatical Error Correction (GEC).
	The goal of this model is to automatically detect and correct grammatical, spelling, and stylistic errors in Arabic text,
	making it useful for applications such as language learning, academic writing assistance, and automated proofreading.

	# Architecture

	This model was fine-tuned using the QLoRA method on 50,000 samples, based on the [Qwen 2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) architecture.
	The fine-tuning followed the system instruction below:

	```
	صحّح الأخطاء النحوية والإملائية فقط إن وُجدت. أضف التشكيل الكامل على كل الحروف إجباريًا — حتى لو كان النص صحيحًا. لا تُغيّر أي كلمة أو اسم أو رقم أو بنية جملة. إذا لم يكن هناك خطأ نحوي أو إملائي، أعد إنتاج المدخلات كما هي — لكن مع التشكيل الكامل. لا تُضف شروحات. لا تُكرر المدخلات. لا تُعدِل المعنى.
	```
	Training was conducted with [Llama Factory](https://github.com/hiyouga/LLaMA-Factory), using a rank `r = 32`, and `alpha = 64`.


	# Dataset

	This model is train on 50000 sample of [our dataset](https://huggingface.co/datasets/CUAIStudents/Arabic-Tashkeel) but with small pre-processing since we are dealing with larger knowledge.


	# Usage

	```python

	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_name = "Abdo-Alshoki/qwen-ar-gec-v2"

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

	# Recommended system instruction (same as training)
	system_prompt = """صحّح الأخطاء النحوية والإملائية فقط إن وُجدت. أضف التشكيل الكامل على كل الحروف إجباريًا — حتى لو كان النص صحيحًا. لا تُغيّر أي كلمة أو اسم أو رقم أو بنية جملة. إذا لم يكن هناك خطأ نحوي أو إملائي، أعد إنتاج المدخلات كما هي — لكن مع التشكيل الكامل. لا تُضف شروحات. لا تُكرر المدخلات. لا تُعدِل المعنى."""

	# Example input
	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": "مِنَ الْمُهِمِّ أَنْ لاَ يَسسْقُطُؤأ أَبَدًا، وَلاَ يَبْقَوْا فِي الخَارِجِ طَوِيلاً لأَنَّهُمْ يَحْتَاجُونَ إلَى الرِّطَابِ."}
	]

	# Format prompt and tokenize
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	# Generate output
	outputs = model.generate(**inputs, max_new_tokens=512)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # مِنَ الْمُهِمِّ أَنْ لاَ يَسْقُطُوا أَبَدًا، وَلاَ يَبْقَوْا فِي الخَارِجِ طَوِيلاً لأَنَّهُمْ يَحْتَاجُونَ إلَى الرِّطَابِ.

	```

	# limits and improvements

	This model achieves promising accuracy on our dataset; however, the dataset contains limited coverage of Modern Standard Arabic (MSA). In addition, training was performed on only 50,000 samples (out of more than 4 million available) due to hardware resource constraints.