File size: 7,239 Bytes
084a962
accea60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
 
 
 
0afd0a2
 
accea60
 
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
084a962
accea60
 
084a962
accea60
 
 
 
 
 
 
084a962
accea60
 
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
084a962
accea60
 
 
084a962
accea60
 
084a962
accea60
084a962
accea60
 
 
 
 
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
084a962
 
 
 
 
 
 
 
 
 
 
 
accea60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
084a962
accea60
084a962
accea60
 
 
084a962
accea60
084a962
accea60
084a962
accea60
 
 
084a962
accea60
084a962
accea60
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
 
 
084a962
accea60
084a962
accea60
084a962
accea60
 
 
 
084a962
accea60
084a962
accea60
084a962
accea60
2365afd
 
 
 
084a962
accea60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197

---
license: apache-2.0
datasets:
  - arbml/Arabic_Literature
  - arbml/Arabic_News
  - khalidalt/ultimate_arabic_news
  - pain/Arabic-Tweets
language:
  - ar
pipeline_tag: text-generation
library_name: transformers
tags:
  - torch
  - custom
  - GPT
---

# Model Card for FaseehGPT

## Model Details

* **Model Name**: FaseehGPT
* **Model Type**: Decoder-only Transformer (GPT-style)
* **Repository**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
* **Version**: 1.1
* **Builder: *Alphatechlogics***  ๐Ÿ”— [GitHub](https://github.com/alphatechlogics) | ๐Ÿค— [Hugging Face](https://huggingface.co/alphatechlogics) | ๐Ÿ’ผ [LinkedIn](https://www.linkedin.com/company/alphatechlogics)
* **Developer: *Ahsan Umar***  ๐Ÿ”— [GitHub](https://github.com/codewithdark-git) | ๐Ÿค— [Hugging Face](https://huggingface.co/codewithdark) | ๐Ÿ’ผ [LinkedIn](https://linkedin.com/in/codewithdark)
* **Date**: July 10, 2025
* **License**: Apache 2.0
* **Framework**: PyTorch, Hugging Face Transformers
* **Language**: Arabic
* **Intended Use**: Text generation and language modeling for Arabic text

FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (`asafaya/bert-base-arabic`) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations.

---

## Model Architecture

* **Architecture**: Decoder-only transformer with multi-head self-attention and feed-forward layers
* **Parameters**:

  * Vocabulary Size: \~32,000 (from `asafaya/bert-base-arabic` tokenizer)
  * Embedding Dimension: 512
  * Number of Layers: 12
  * Number of Attention Heads: 8
  * Feed-forward Dimension: 2048
  * Total Parameters: \~70.7 million
* **Configuration**:

  * Maximum Sequence Length: 512
  * Dropout Rate: 0.1
  * Activation Function: GELU
* **Weight Initialization**: Normal distribution (mean = 0, std = 0.02)
* **Special Features**: Supports top-k and top-p sampling; weight tying between input and output embeddings

---

## Training Details

### Datasets

* **arbml/Arabic\_News**: 7,114,814 news article texts
* **arbml/Arabic\_Literature**: 1,592,629 literary texts
* **Subset Used**: 50,000 texts (randomly sampled)

  * **Training Set**: 45,000 (90%)
  * **Validation Set**: 5,000 (10%)

### Training Configuration

* **Epochs**: 20
* **Learning Rate**: 3e-4 *(Karpathy constant)*
* **Optimizer**: AdamW (weight decay = 0.01)
* **Scheduler**: Linear warmup (10% of steps) with decay
* **Batch Size**: Effective 16 (4 gradient accumulation steps)
* **Hardware**: Kaggle (P100)
* **Training Duration**: 8.18 hours
* **Checkpoint**: Saved at epoch 20

---

## Sample Generated Text (Epoch 20)

**Prompt 1**: `"ุงู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ"`
**Output**:

> ุงู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ ุงู‚ุฑุจ ูˆูŠุญ ุงู„ูŠ ูƒู…ุง ุฐู„ูƒ ู‡ุฐู‡ ุงู„ุจูŠุงู† ุดุนุฑู‡ ู‚ุงู„ู‡ ุงู„ุงุณุชุงุฐุฑ ู…ู† ูˆุชุฌ ู…ุนู‡ู… ูู…ู†ู„ูŠู„ ูˆุตูˆู„ู‡ ู„ู‡ ุงู„ูุฑู‚ุฉ ุงู„ุชูŠู‡ุงุงู‡ู‡ุง ุงู„ุฎุทุงุจ ู…ุงู‡ ู…ุณู„ู…ูู† ุŒ ุชู‚ูˆู„ุจุฉ ูˆุญูŠุงุฉ โ€“ุฒุฉ ุงู„ุดุฎุตูŠุฉ ู…ุณู„ู… ุดุจู‡ ู…ู†ุฐ

**Prompt 2**: `"ูƒุงู† ูŠุง ู…ูƒุงู† ููŠ ู‚ุฏูŠู… ุงู„ุฒู…ุงู†"`
**Output**:

> ูƒุงู† ูŠุง ู…ูƒุงู† ููŠ ู‚ุฏูŠู… ุงู„ุฒู…ุงู† ุงู„ุงู†ุณุงู† ุงู„ุงู†ุณุงู† ุจุนุถ ู„ุง ุงู†ุฑ ู„ู‚ุฏ ุงู„ุงู†ุณุงู† ุฐู„ูƒ ุงู†ู„ุงุฑูƒุงุฑูƒ ุนุฑุถ ุนุฑุถ ูƒุฑูˆูŠ.ุฑุญ ู†ุดุง ุงู„ู…ุทู„ูˆุจ ูˆุนู…ู„ ูƒู†ูƒุชุจ ุงู„ุงุฑุฏู†ูŠ ูุจุฏูŠ ุงู„ุณุงุจู‚ ูƒุงู† " ูŠุฑูŠุฏ " ุตูˆุฑุฉ ูˆู„ุง ูˆุงู†ู…ุง " ุงู„ุชูŠ ุงู„ู†ุนูŠู… ุงู„ุตุญูŠุญ ุจู…ุน ู„ู„ู†ูุท ". ูŠุฑูŠุฏ ู‚ุตุฑ ุชูˆููŠู‚ ุฏูŠูƒุชูˆุชูˆ ู‚ุฏ ููŠ ุซู…ุงู†ูŠุฉ ุฌุณุฏ ". ุงู„ุตุญูŠูุฉ ุงู†ู‡ ุงู„ุงุณู„ุงู… ุงู„ุจู„ุฏ ุงู„ุชูŠ " ู„ุง ู…ู† ุซุงู„ุซุฉ ุดุจู‡ ูƒุงู†ุช ุจุตูุชู‡ ููŠ ุงู„ูˆุนูŠุฏู‡ุง ุงู†ุจุฑ ุงู„ุชูŠ ููŠ ู…ุง ู…ู† ุŒ ุฑุญุจ ู…ู‡ู…ุฉ ู…ุฒ ุงู†ู‡ ู„ูŠุจุฑ ุจุณุฑุนุฉุงู„ูŠุฉ ุŒ ุงู„ุงุฑุฌุญ ู…ุง ุนู† ุจู‡ ุงู†ู‚ู„ุงุจ ููŠ

**Analysis**: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning.

---

## Usage

FaseehGPT can be used to generate Arabic text from a prompt. Example code:

```python
from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT")

# Generate text
prompt = "ุงู„ุณู„ุงู… ุนู„ูŠูƒู…"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

### Parameters for Generation

* `max_new_tokens`: Max tokens to generate (e.g., 100)
* `temperature`: Controls randomness (default: 1.0)
* `top_k`: Limits sampling to top-k tokens (default: 50)
* `top_p`: Nucleus sampling threshold (default: 0.9)

**Expected Output**: Arabic text that continues the given prompt, depending on training quality and settings.

---

## Dataset Description

* **Source**: Hugging Face Datasets
* **Used Datasets**:

  * `arbml/Arabic_News`: News across diverse topics with formal Arabic
  * `arbml/Arabic_Literature`: Novels and poetry, providing rich language variety
* **Total Texts**: 8,707,443 (full); 50,000 used for training

### Preprocessing

* Tokenized using `asafaya/bert-base-arabic`
* Long texts split into overlapping chunks (`stride = max_seq_len // 2`)
* Special tokens: `<SOS>`, `<EOS>`, `<PAD>`, `<UNK>`

---

## Evaluation

* **Metrics**: Cross-entropy loss (training and validation)
* **Status**: Loss metrics unavailable due to incomplete logging
* **Observations**: Generated samples show partial learning; some incoherence remains

### Recommendations

* Extract loss from checkpoint `model_checkpoint_epoch_20.pt`
* Use verbose logging in future training
* Add evaluation metrics: Perplexity, BLEU
* Try smaller models (e.g., `embed_dim=256`, `num_layers=6`) for faster Colab testing

---

## Limitations

* **Generated Text Quality**: Inconsistent coherence suggests undertraining
* **Resource Constraints**: Small subset used due to Colab GPU limits
* **Language Specificity**: Only Arabic supported; others untested
* **Training Duration**: 8.18 hours insufficient for full dataset

---

## Ethical Considerations

* **Bias**: May reflect cultural or topical biases from source data
* **Usage**: For research/non-commercial use; validate outputs
* **Privacy**: Datasets are public; comply with Hugging Face policies

---

## How to Contribute

* **Repo**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
* **Issues**: Report bugs or suggest features via issue tracker
* **Training**: Resume on full dataset or better hardware
* **Evaluation**: Add scripts for BLEU, perplexity, etc.

---

## Citation

```bibtex
@article{umar2025faseehgpt,
  title={FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding},
  author={Umar, Ahsan},
  publisher={Engineering Archive}
}
```