FaseehGPT / README.md

Update README.md

2365afd verified 4 months ago

7.24 kB


	---
	license: apache-2.0
	datasets:
	- arbml/Arabic_Literature
	- arbml/Arabic_News
	- khalidalt/ultimate_arabic_news
	- pain/Arabic-Tweets
	language:
	- ar
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- torch
	- custom
	- GPT
	---

	# Model Card for FaseehGPT

	## Model Details

	* Model Name: FaseehGPT
	* Model Type: Decoder-only Transformer (GPT-style)
	* Repository: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
	* Version: 1.1
	* *Builder: Alphatechlogics*** 🔗 [GitHub](https://github.com/alphatechlogics) \| 🤗 [Hugging Face](https://huggingface.co/alphatechlogics) \| 💼 [LinkedIn](https://www.linkedin.com/company/alphatechlogics)
	* *Developer: Ahsan Umar*** 🔗 [GitHub](https://github.com/codewithdark-git) \| 🤗 [Hugging Face](https://huggingface.co/codewithdark) \| 💼 [LinkedIn](https://linkedin.com/in/codewithdark)
	* Date: July 10, 2025
	* License: Apache 2.0
	* Framework: PyTorch, Hugging Face Transformers
	* Language: Arabic
	* Intended Use: Text generation and language modeling for Arabic text

	FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (`asafaya/bert-base-arabic`) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations.

	---

	## Model Architecture

	* Architecture: Decoder-only transformer with multi-head self-attention and feed-forward layers
	* Parameters:

	* Vocabulary Size: \~32,000 (from `asafaya/bert-base-arabic` tokenizer)
	* Embedding Dimension: 512
	* Number of Layers: 12
	* Number of Attention Heads: 8
	* Feed-forward Dimension: 2048
	* Total Parameters: \~70.7 million
	* Configuration:

	* Maximum Sequence Length: 512
	* Dropout Rate: 0.1
	* Activation Function: GELU
	* Weight Initialization: Normal distribution (mean = 0, std = 0.02)
	* Special Features: Supports top-k and top-p sampling; weight tying between input and output embeddings

	---

	## Training Details

	### Datasets

	* arbml/Arabic\_News: 7,114,814 news article texts
	* arbml/Arabic\_Literature: 1,592,629 literary texts
	* Subset Used: 50,000 texts (randomly sampled)

	* Training Set: 45,000 (90%)
	* Validation Set: 5,000 (10%)

	### Training Configuration

	* Epochs: 20
	* Learning Rate: 3e-4 (Karpathy constant)
	* Optimizer: AdamW (weight decay = 0.01)
	* Scheduler: Linear warmup (10% of steps) with decay
	* Batch Size: Effective 16 (4 gradient accumulation steps)
	* Hardware: Kaggle (P100)
	* Training Duration: 8.18 hours
	* Checkpoint: Saved at epoch 20

	---

	## Sample Generated Text (Epoch 20)

	Prompt 1: `"اللغة العربية"`
	Output:

	> اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ

	Prompt 2: `"كان يا مكان في قديم الزمان"`
	Output:

	> كان يا مكان في قديم الزمان الانسان الانسان بعض لا انر لقد الانسان ذلك انلاركارك عرض عرض كروي.رح نشا المطلوب وعمل كنكتب الاردني فبدي السابق كان " يريد " صورة ولا وانما " التي النعيم الصحيح بمع للنفط ". يريد قصر توفيق ديكتوتو قد في ثمانية جسد ". الصحيفة انه الاسلام البلد التي " لا من ثالثة شبه كانت بصفته في الوعيدها انبر التي في ما من ، رحب مهمة مز انه ليبر بسرعةالية ، الارجح ما عن به انقلاب في

	Analysis: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning.

	---

	## Usage

	FaseehGPT can be used to generate Arabic text from a prompt. Example code:

	```python
	from transformers import AutoModel, AutoTokenizer

	# Load model and tokenizer
	model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT")

	# Generate text
	prompt = "السلام عليكم"
	input_ids = tokenizer(prompt, return_tensors="pt").input_ids
	outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(generated_text)
	```

	### Parameters for Generation

	* `max_new_tokens`: Max tokens to generate (e.g., 100)
	* `temperature`: Controls randomness (default: 1.0)
	* `top_k`: Limits sampling to top-k tokens (default: 50)
	* `top_p`: Nucleus sampling threshold (default: 0.9)

	Expected Output: Arabic text that continues the given prompt, depending on training quality and settings.

	---

	## Dataset Description

	* Source: Hugging Face Datasets
	* Used Datasets:

	* `arbml/Arabic_News`: News across diverse topics with formal Arabic
	* `arbml/Arabic_Literature`: Novels and poetry, providing rich language variety
	* Total Texts: 8,707,443 (full); 50,000 used for training

	### Preprocessing

	* Tokenized using `asafaya/bert-base-arabic`
	* Long texts split into overlapping chunks (`stride = max_seq_len // 2`)
	* Special tokens: `<SOS>`, `<EOS>`, `<PAD>`, `<UNK>`

	---

	## Evaluation

	* Metrics: Cross-entropy loss (training and validation)
	* Status: Loss metrics unavailable due to incomplete logging
	* Observations: Generated samples show partial learning; some incoherence remains

	### Recommendations

	* Extract loss from checkpoint `model_checkpoint_epoch_20.pt`
	* Use verbose logging in future training
	* Add evaluation metrics: Perplexity, BLEU
	* Try smaller models (e.g., `embed_dim=256`, `num_layers=6`) for faster Colab testing

	---

	## Limitations

	* Generated Text Quality: Inconsistent coherence suggests undertraining
	* Resource Constraints: Small subset used due to Colab GPU limits
	* Language Specificity: Only Arabic supported; others untested
	* Training Duration: 8.18 hours insufficient for full dataset

	---

	## Ethical Considerations

	* Bias: May reflect cultural or topical biases from source data
	* Usage: For research/non-commercial use; validate outputs
	* Privacy: Datasets are public; comply with Hugging Face policies

	---

	## How to Contribute

	* Repo: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
	* Issues: Report bugs or suggest features via issue tracker
	* Training: Resume on full dataset or better hardware
	* Evaluation: Add scripts for BLEU, perplexity, etc.

	---

	## Citation

	```bibtex
	@article{umar2025faseehgpt,
	title={FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding},
	author={Umar, Ahsan},
	publisher={Engineering Archive}
	}
	```