---
license: apache-2.0
datasets:
  - arbml/Arabic_Literature
  - arbml/Arabic_News
  - khalidalt/ultimate_arabic_news
  - pain/Arabic-Tweets
language:
  - ar
pipeline_tag: text-generation
library_name: transformers
tags:
  - torch
  - custom
  - GPT
---

# Model Card for FaseehGPT

## Model Details

* **Model Name**: FaseehGPT
* **Model Type**: Decoder-only Transformer (GPT-style)
* **Repository**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
* **Version**: 1.1
* **Builder: *Alphatechlogics***  🔗 [GitHub](https://github.com/alphatechlogics) | 🤗 [Hugging Face](https://huggingface.co/alphatechlogics) | 💼 [LinkedIn](https://www.linkedin.com/company/alphatechlogics)
* **Developer: *Ahsan Umar***  🔗 [GitHub](https://github.com/codewithdark-git) | 🤗 [Hugging Face](https://huggingface.co/codewithdark) | 💼 [LinkedIn](https://linkedin.com/in/codewithdark)
* **Date**: July 10, 2025
* **License**: Apache 2.0
* **Framework**: PyTorch, Hugging Face Transformers
* **Language**: Arabic
* **Intended Use**: Text generation and language modeling for Arabic text

FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (`asafaya/bert-base-arabic`) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations.

---

## Model Architecture

* **Architecture**: Decoder-only transformer with multi-head self-attention and feed-forward layers
* **Parameters**:

  * Vocabulary Size: \~32,000 (from `asafaya/bert-base-arabic` tokenizer)
  * Embedding Dimension: 512
  * Number of Layers: 12
  * Number of Attention Heads: 8
  * Feed-forward Dimension: 2048
  * Total Parameters: \~70.7 million
* **Configuration**:

  * Maximum Sequence Length: 512
  * Dropout Rate: 0.1
  * Activation Function: GELU
* **Weight Initialization**: Normal distribution (mean = 0, std = 0.02)
* **Special Features**: Supports top-k and top-p sampling; weight tying between input and output embeddings

---

## Training Details

### Datasets

* **arbml/Arabic\_News**: 7,114,814 news article texts
* **arbml/Arabic\_Literature**: 1,592,629 literary texts
* **Subset Used**: 50,000 texts (randomly sampled)

  * **Training Set**: 45,000 (90%)
  * **Validation Set**: 5,000 (10%)

### Training Configuration

* **Epochs**: 20
* **Learning Rate**: 3e-4 *(Karpathy constant)*
* **Optimizer**: AdamW (weight decay = 0.01)
* **Scheduler**: Linear warmup (10% of steps) with decay
* **Batch Size**: Effective 16 (4 gradient accumulation steps)
* **Hardware**: Kaggle (P100)
* **Training Duration**: 8.18 hours
* **Checkpoint**: Saved at epoch 20

---

## Sample Generated Text (Epoch 20)

**Prompt 1**: `"اللغة العربية"`
**Output**:

> اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ

**Prompt 2**: `"كان يا مكان في قديم الزمان"`
**Output**:

> كان يا مكان في قديم الزمان الانسان الانسان بعض لا انر لقد الانسان ذلك انلاركارك عرض عرض كروي.رح نشا المطلوب وعمل كنكتب الاردني فبدي السابق كان " يريد " صورة ولا وانما " التي النعيم الصحيح بمع للنفط ". يريد قصر توفيق ديكتوتو قد في ثمانية جسد ". الصحيفة انه الاسلام البلد التي " لا من ثالثة شبه كانت بصفته في الوعيدها انبر التي في ما من ، رحب مهمة مز انه ليبر بسرعةالية ، الارجح ما عن به انقلاب في

**Analysis**: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning.

---

## Usage

FaseehGPT can be used to generate Arabic text from a prompt. Example code:

```python
from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT")

# Generate text
prompt = "السلام عليكم"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

### Parameters for Generation

* `max_new_tokens`: Max tokens to generate (e.g., 100)
* `temperature`: Controls randomness (default: 1.0)
* `top_k`: Limits sampling to top-k tokens (default: 50)
* `top_p`: Nucleus sampling threshold (default: 0.9)

**Expected Output**: Arabic text that continues the given prompt, depending on training quality and settings.

---

## Dataset Description

* **Source**: Hugging Face Datasets
* **Used Datasets**:

  * `arbml/Arabic_News`: News across diverse topics with formal Arabic
  * `arbml/Arabic_Literature`: Novels and poetry, providing rich language variety
* **Total Texts**: 8,707,443 (full); 50,000 used for training

### Preprocessing

* Tokenized using `asafaya/bert-base-arabic`
* Long texts split into overlapping chunks (`stride = max_seq_len // 2`)
* Special tokens: `<SOS>`, `<EOS>`, `<PAD>`, `<UNK>`

---

## Evaluation

* **Metrics**: Cross-entropy loss (training and validation)
* **Status**: Loss metrics unavailable due to incomplete logging
* **Observations**: Generated samples show partial learning; some incoherence remains

### Recommendations

* Extract loss from checkpoint `model_checkpoint_epoch_20.pt`
* Use verbose logging in future training
* Add evaluation metrics: Perplexity, BLEU
* Try smaller models (e.g., `embed_dim=256`, `num_layers=6`) for faster Colab testing

---

## Limitations

* **Generated Text Quality**: Inconsistent coherence suggests undertraining
* **Resource Constraints**: Small subset used due to Colab GPU limits
* **Language Specificity**: Only Arabic supported; others untested
* **Training Duration**: 8.18 hours insufficient for full dataset

---

## Ethical Considerations

* **Bias**: May reflect cultural or topical biases from source data
* **Usage**: For research/non-commercial use; validate outputs
* **Privacy**: Datasets are public; comply with Hugging Face policies

---

## How to Contribute

* **Repo**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
* **Issues**: Report bugs or suggest features via issue tracker
* **Training**: Resume on full dataset or better hardware
* **Evaluation**: Add scripts for BLEU, perplexity, etc.

---

## Citation

```bibtex
@article{umar2025faseehgpt,
  title={FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding},
  author={Umar, Ahsan},
  publisher={Engineering Archive}
}
```