File size: 7,239 Bytes
084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 0afd0a2 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 084a962 accea60 2365afd 084a962 accea60 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
---
license: apache-2.0
datasets:
- arbml/Arabic_Literature
- arbml/Arabic_News
- khalidalt/ultimate_arabic_news
- pain/Arabic-Tweets
language:
- ar
pipeline_tag: text-generation
library_name: transformers
tags:
- torch
- custom
- GPT
---
# Model Card for FaseehGPT
## Model Details
* **Model Name**: FaseehGPT
* **Model Type**: Decoder-only Transformer (GPT-style)
* **Repository**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
* **Version**: 1.1
* **Builder: *Alphatechlogics*** ๐ [GitHub](https://github.com/alphatechlogics) | ๐ค [Hugging Face](https://huggingface.co/alphatechlogics) | ๐ผ [LinkedIn](https://www.linkedin.com/company/alphatechlogics)
* **Developer: *Ahsan Umar*** ๐ [GitHub](https://github.com/codewithdark-git) | ๐ค [Hugging Face](https://huggingface.co/codewithdark) | ๐ผ [LinkedIn](https://linkedin.com/in/codewithdark)
* **Date**: July 10, 2025
* **License**: Apache 2.0
* **Framework**: PyTorch, Hugging Face Transformers
* **Language**: Arabic
* **Intended Use**: Text generation and language modeling for Arabic text
FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (`asafaya/bert-base-arabic`) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations.
---
## Model Architecture
* **Architecture**: Decoder-only transformer with multi-head self-attention and feed-forward layers
* **Parameters**:
* Vocabulary Size: \~32,000 (from `asafaya/bert-base-arabic` tokenizer)
* Embedding Dimension: 512
* Number of Layers: 12
* Number of Attention Heads: 8
* Feed-forward Dimension: 2048
* Total Parameters: \~70.7 million
* **Configuration**:
* Maximum Sequence Length: 512
* Dropout Rate: 0.1
* Activation Function: GELU
* **Weight Initialization**: Normal distribution (mean = 0, std = 0.02)
* **Special Features**: Supports top-k and top-p sampling; weight tying between input and output embeddings
---
## Training Details
### Datasets
* **arbml/Arabic\_News**: 7,114,814 news article texts
* **arbml/Arabic\_Literature**: 1,592,629 literary texts
* **Subset Used**: 50,000 texts (randomly sampled)
* **Training Set**: 45,000 (90%)
* **Validation Set**: 5,000 (10%)
### Training Configuration
* **Epochs**: 20
* **Learning Rate**: 3e-4 *(Karpathy constant)*
* **Optimizer**: AdamW (weight decay = 0.01)
* **Scheduler**: Linear warmup (10% of steps) with decay
* **Batch Size**: Effective 16 (4 gradient accumulation steps)
* **Hardware**: Kaggle (P100)
* **Training Duration**: 8.18 hours
* **Checkpoint**: Saved at epoch 20
---
## Sample Generated Text (Epoch 20)
**Prompt 1**: `"ุงููุบุฉ ุงูุนุฑุจูุฉ"`
**Output**:
> ุงููุบุฉ ุงูุนุฑุจูุฉ ุงูุฑุจ ููุญ ุงูู ูู
ุง ุฐูู ูุฐู ุงูุจูุงู ุดุนุฑู ูุงูู ุงูุงุณุชุงุฐุฑ ู
ู ูุชุฌ ู
ุนูู
ูู
ูููู ูุตููู ูู ุงููุฑูุฉ ุงูุชููุงุงููุง ุงูุฎุทุงุจ ู
ุงู ู
ุณูู
ูู ุ ุชูููุจุฉ ูุญูุงุฉ โุฒุฉ ุงูุดุฎุตูุฉ ู
ุณูู
ุดุจู ู
ูุฐ
**Prompt 2**: `"ูุงู ูุง ู
ูุงู ูู ูุฏูู
ุงูุฒู
ุงู"`
**Output**:
> ูุงู ูุง ู
ูุงู ูู ูุฏูู
ุงูุฒู
ุงู ุงูุงูุณุงู ุงูุงูุณุงู ุจุนุถ ูุง ุงูุฑ ููุฏ ุงูุงูุณุงู ุฐูู ุงููุงุฑูุงุฑู ุนุฑุถ ุนุฑุถ ูุฑูู.ุฑุญ ูุดุง ุงูู
ุทููุจ ูุนู
ู ูููุชุจ ุงูุงุฑุฏูู ูุจุฏู ุงูุณุงุจู ูุงู " ูุฑูุฏ " ุตูุฑุฉ ููุง ูุงูู
ุง " ุงูุชู ุงููุนูู
ุงูุตุญูุญ ุจู
ุน ููููุท ". ูุฑูุฏ ูุตุฑ ุชูููู ุฏููุชูุชู ูุฏ ูู ุซู
ุงููุฉ ุฌุณุฏ ". ุงูุตุญููุฉ ุงูู ุงูุงุณูุงู
ุงูุจูุฏ ุงูุชู " ูุง ู
ู ุซุงูุซุฉ ุดุจู ูุงูุช ุจุตูุชู ูู ุงููุนูุฏูุง ุงูุจุฑ ุงูุชู ูู ู
ุง ู
ู ุ ุฑุญุจ ู
ูู
ุฉ ู
ุฒ ุงูู ููุจุฑ ุจุณุฑุนุฉุงููุฉ ุ ุงูุงุฑุฌุญ ู
ุง ุนู ุจู ุงูููุงุจ ูู
**Analysis**: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning.
---
## Usage
FaseehGPT can be used to generate Arabic text from a prompt. Example code:
```python
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT")
# Generate text
prompt = "ุงูุณูุงู
ุนูููู
"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
### Parameters for Generation
* `max_new_tokens`: Max tokens to generate (e.g., 100)
* `temperature`: Controls randomness (default: 1.0)
* `top_k`: Limits sampling to top-k tokens (default: 50)
* `top_p`: Nucleus sampling threshold (default: 0.9)
**Expected Output**: Arabic text that continues the given prompt, depending on training quality and settings.
---
## Dataset Description
* **Source**: Hugging Face Datasets
* **Used Datasets**:
* `arbml/Arabic_News`: News across diverse topics with formal Arabic
* `arbml/Arabic_Literature`: Novels and poetry, providing rich language variety
* **Total Texts**: 8,707,443 (full); 50,000 used for training
### Preprocessing
* Tokenized using `asafaya/bert-base-arabic`
* Long texts split into overlapping chunks (`stride = max_seq_len // 2`)
* Special tokens: `<SOS>`, `<EOS>`, `<PAD>`, `<UNK>`
---
## Evaluation
* **Metrics**: Cross-entropy loss (training and validation)
* **Status**: Loss metrics unavailable due to incomplete logging
* **Observations**: Generated samples show partial learning; some incoherence remains
### Recommendations
* Extract loss from checkpoint `model_checkpoint_epoch_20.pt`
* Use verbose logging in future training
* Add evaluation metrics: Perplexity, BLEU
* Try smaller models (e.g., `embed_dim=256`, `num_layers=6`) for faster Colab testing
---
## Limitations
* **Generated Text Quality**: Inconsistent coherence suggests undertraining
* **Resource Constraints**: Small subset used due to Colab GPU limits
* **Language Specificity**: Only Arabic supported; others untested
* **Training Duration**: 8.18 hours insufficient for full dataset
---
## Ethical Considerations
* **Bias**: May reflect cultural or topical biases from source data
* **Usage**: For research/non-commercial use; validate outputs
* **Privacy**: Datasets are public; comply with Hugging Face policies
---
## How to Contribute
* **Repo**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
* **Issues**: Report bugs or suggest features via issue tracker
* **Training**: Resume on full dataset or better hardware
* **Evaluation**: Add scripts for BLEU, perplexity, etc.
---
## Citation
```bibtex
@article{umar2025faseehgpt,
title={FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding},
author={Umar, Ahsan},
publisher={Engineering Archive}
}
```
|