File size: 7,239 Bytes

084a962
accea60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
 
 
 
0afd0a2
 
accea60
 
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
084a962
accea60
 
084a962
accea60
 
 
 
 
 
 
084a962
accea60
 
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
084a962
accea60
 
 
084a962
accea60
 
084a962
accea60
084a962
accea60
 
 
 
 
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
084a962
 
 
 
 
 
 
 
 
 
 
 
accea60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
084a962
accea60
084a962
accea60
 
 
084a962
accea60
084a962
accea60
084a962
accea60
 
 
084a962
accea60
084a962
accea60
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
 
 
084a962
accea60
084a962
accea60
084a962
accea60
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
2365afd
 
 
 
084a962
accea60


---
license: apache-2.0
datasets:
  - arbml/Arabic_Literature
  - arbml/Arabic_News
  - khalidalt/ultimate_arabic_news
  - pain/Arabic-Tweets
language:
  - ar
pipeline_tag: text-generation
library_name: transformers
tags:
  - torch
  - custom
  - GPT
---

# Model Card for FaseehGPT

## Model Details

* **Model Name**: FaseehGPT
* **Model Type**: Decoder-only Transformer (GPT-style)
* **Repository**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
* **Version**: 1.1
* **Builder: *Alphatechlogics***  🔗 [GitHub](https://github.com/alphatechlogics) | 🤗 [Hugging Face](https://huggingface.co/alphatechlogics) | 💼 [LinkedIn](https://www.linkedin.com/company/alphatechlogics)
* **Developer: *Ahsan Umar***  🔗 [GitHub](https://github.com/codewithdark-git) | 🤗 [Hugging Face](https://huggingface.co/codewithdark) | 💼 [LinkedIn](https://linkedin.com/in/codewithdark)
* **Date**: July 10, 2025
* **License**: Apache 2.0
* **Framework**: PyTorch, Hugging Face Transformers
* **Language**: Arabic
* **Intended Use**: Text generation and language modeling for Arabic text

FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (`asafaya/bert-base-arabic`) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations.

---

## Model Architecture

* **Architecture**: Decoder-only transformer with multi-head self-attention and feed-forward layers
* **Parameters**:

  * Vocabulary Size: \~32,000 (from `asafaya/bert-base-arabic` tokenizer)
  * Embedding Dimension: 512
  * Number of Layers: 12
  * Number of Attention Heads: 8
  * Feed-forward Dimension: 2048
  * Total Parameters: \~70.7 million
* **Configuration**:

  * Maximum Sequence Length: 512
  * Dropout Rate: 0.1
  * Activation Function: GELU
* **Weight Initialization**: Normal distribution (mean = 0, std = 0.02)
* **Special Features**: Supports top-k and top-p sampling; weight tying between input and output embeddings

---

## Training Details

### Datasets

* **arbml/Arabic\_News**: 7,114,814 news article texts
* **arbml/Arabic\_Literature**: 1,592,629 literary texts
* **Subset Used**: 50,000 texts (randomly sampled)

  * **Training Set**: 45,000 (90%)
  * **Validation Set**: 5,000 (10%)

### Training Configuration

* **Epochs**: 20
* **Learning Rate**: 3e-4 *(Karpathy constant)*
* **Optimizer**: AdamW (weight decay = 0.01)
* **Scheduler**: Linear warmup (10% of steps) with decay
* **Batch Size**: Effective 16 (4 gradient accumulation steps)
* **Hardware**: Kaggle (P100)
* **Training Duration**: 8.18 hours
* **Checkpoint**: Saved at epoch 20

---

## Sample Generated Text (Epoch 20)

**Prompt 1**: `"اللغة العربية"`
**Output**:

> اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ

**Prompt 2**: `"كان يا مكان في قديم الزمان"`
**Output**:

> كان يا مكان في قديم الزمان الانسان الانسان بعض لا انر لقد الانسان ذلك انلاركارك عرض عرض كروي.رح نشا المطلوب وعمل كنكتب الاردني فبدي السابق كان " يريد " صورة ولا وانما " التي النعيم الصحيح بمع للنفط ". يريد قصر توفيق ديكتوتو قد في ثمانية جسد ". الصحيفة انه الاسلام البلد التي " لا من ثالثة شبه كانت بصفته في الوعيدها انبر التي في ما من ، رحب مهمة مز انه ليبر بسرعةالية ، الارجح ما عن به انقلاب في

**Analysis**: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning.

---

## Usage

FaseehGPT can be used to generate Arabic text from a prompt. Example code:

```python
from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT")

# Generate text
prompt = "السلام عليكم"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

### Parameters for Generation

* `max_new_tokens`: Max tokens to generate (e.g., 100)
* `temperature`: Controls randomness (default: 1.0)
* `top_k`: Limits sampling to top-k tokens (default: 50)
* `top_p`: Nucleus sampling threshold (default: 0.9)

**Expected Output**: Arabic text that continues the given prompt, depending on training quality and settings.

---

## Dataset Description

* **Source**: Hugging Face Datasets
* **Used Datasets**:

  * `arbml/Arabic_News`: News across diverse topics with formal Arabic
  * `arbml/Arabic_Literature`: Novels and poetry, providing rich language variety
* **Total Texts**: 8,707,443 (full); 50,000 used for training

### Preprocessing

* Tokenized using `asafaya/bert-base-arabic`
* Long texts split into overlapping chunks (`stride = max_seq_len // 2`)
* Special tokens: `<SOS>`, `<EOS>`, `<PAD>`, `<UNK>`

---

## Evaluation

* **Metrics**: Cross-entropy loss (training and validation)
* **Status**: Loss metrics unavailable due to incomplete logging
* **Observations**: Generated samples show partial learning; some incoherence remains

### Recommendations

* Extract loss from checkpoint `model_checkpoint_epoch_20.pt`
* Use verbose logging in future training
* Add evaluation metrics: Perplexity, BLEU
* Try smaller models (e.g., `embed_dim=256`, `num_layers=6`) for faster Colab testing

---

## Limitations

* **Generated Text Quality**: Inconsistent coherence suggests undertraining
* **Resource Constraints**: Small subset used due to Colab GPU limits
* **Language Specificity**: Only Arabic supported; others untested
* **Training Duration**: 8.18 hours insufficient for full dataset

---

## Ethical Considerations

* **Bias**: May reflect cultural or topical biases from source data
* **Usage**: For research/non-commercial use; validate outputs
* **Privacy**: Datasets are public; comply with Hugging Face policies

---

## How to Contribute

* **Repo**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
* **Issues**: Report bugs or suggest features via issue tracker
* **Training**: Resume on full dataset or better hardware
* **Evaluation**: Add scripts for BLEU, perplexity, etc.

---

## Citation

```bibtex
@article{umar2025faseehgpt,
  title={FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding},
  author={Umar, Ahsan},
  publisher={Engineering Archive}
}
```