Update README.md
Browse files
README.md
CHANGED
|
@@ -1,76 +1,107 @@
|
|
| 1 |
-
Model Card for FaseehGPT
|
| 2 |
-
Model Details
|
| 3 |
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It leverages a pre-trained Arabic tokenizer (asafaya/bert-base-arabic) and is optimized for resource-constrained environments like Google Colab's free GPU. The model completed training for 20 epochs, with checkpoints saved and sample text generated.
|
| 16 |
-
Model Architecture
|
| 17 |
|
| 18 |
-
|
| 19 |
-
Parameters:
|
| 20 |
-
Vocabulary Size: ~32,000 (from asafaya/bert-base-arabic tokenizer)
|
| 21 |
-
Embedding Dimension: 512
|
| 22 |
-
Number of Layers: 12
|
| 23 |
-
Number of Attention Heads: 8
|
| 24 |
-
Feed-forward Dimension: 2048
|
| 25 |
-
Total Parameters: ~70.7 million
|
| 26 |
|
|
|
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
|
|
|
| 33 |
|
| 34 |
-
|
| 35 |
-
Special Features: Supports top-k and top-p sampling for text generation, with weight tying between input and output embeddings for efficiency.
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
arbml/Arabic_Literature: 1,592,629 literary texts
|
| 42 |
-
Subset Used: 50,000 texts (randomly sampled) for training and evaluation
|
| 43 |
-
Training Set: 45,000 texts (90%)
|
| 44 |
-
Validation Set: 5,000 texts (10%)
|
| 45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
|
|
|
| 48 |
|
| 49 |
-
Training
|
| 50 |
-
Epochs: 20
|
| 51 |
-
Learning Rate: 3e-4 # Karpathy constant
|
| 52 |
-
Optimizer: AdamW (weight decay=0.01)
|
| 53 |
-
Scheduler: Linear warmup (10% of steps) with decay
|
| 54 |
-
Batch Size: Effective batch size of 16 (using 4 gradient accumulation steps)
|
| 55 |
-
Hardware: kaggle (P100)
|
| 56 |
-
Training Duration: 8.18 hours
|
| 57 |
-
Checkpoint: Saved at epoch 20
|
| 58 |
|
|
|
|
| 59 |
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
Output: اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ
|
| 64 |
|
|
|
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
|
|
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
from transformers import AutoModel, AutoTokenizer
|
| 75 |
|
| 76 |
# Load model and tokenizer
|
|
@@ -83,67 +114,84 @@ input_ids = tokenizer(prompt, return_tensors="pt").input_ids
|
|
| 83 |
outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
|
| 84 |
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 85 |
print(generated_text)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
|
| 88 |
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
top_p: Nucleus sampling threshold (default: 0.9).
|
| 93 |
|
| 94 |
-
|
| 95 |
-
Dataset Description
|
| 96 |
|
| 97 |
-
|
| 98 |
-
Datasets Used:
|
| 99 |
-
arbml/Arabic_News: News articles covering diverse topics, providing formal and varied Arabic text.
|
| 100 |
-
arbml/Arabic_Literature: Literary works, including novels and poetry, offering rich linguistic patterns.
|
| 101 |
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
-
|
| 104 |
-
Preprocessing:
|
| 105 |
-
Texts are tokenized using asafaya/bert-base-arabic.
|
| 106 |
-
Long texts are split into overlapping chunks (stride: max_seq_len // 2) to fit the maximum sequence length (512).
|
| 107 |
-
Special tokens (<SOS>, <EOS>, <PAD>, <UNK>) are added for language modeling.
|
| 108 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
|
|
|
| 110 |
|
| 111 |
-
|
| 112 |
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
Compute additional metrics like perplexity or BLEU to quantify generation quality.
|
| 118 |
-
Experiment with a smaller model (e.g., embed_dim=256, num_layers=6) for faster evaluation on Colab.
|
| 119 |
|
|
|
|
| 120 |
|
|
|
|
| 121 |
|
| 122 |
-
|
|
|
|
|
|
|
| 123 |
|
| 124 |
-
|
| 125 |
-
Resource Constraints: Trained on a 50,000-text subset due to Colab's GPU limitations, potentially reducing generalization compared to the full 8.7M-text dataset.
|
| 126 |
-
Language Specificity: Optimized for Arabic; performance on other languages is untested.
|
| 127 |
-
Training Duration: 8.18 hours for 20 epochs on a limited dataset; full dataset training requires more powerful hardware.
|
| 128 |
|
| 129 |
-
|
| 130 |
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
|
|
|
| 134 |
|
| 135 |
-
|
| 136 |
|
| 137 |
-
|
| 138 |
-
Issues: Report bugs or suggest improvements via the repository's issue tracker.
|
| 139 |
-
Training: Resume training with the full dataset or enhanced hardware to improve performance.
|
| 140 |
-
Evaluation: Contribute scripts for computing perplexity, BLEU, or other metrics to assess text quality.
|
| 141 |
|
| 142 |
-
|
| 143 |
-
If you use FaseehGPT in your research, please cite:
|
| 144 |
@misc{faseehgpt2025,
|
| 145 |
-
title={FaseehGPT: An Arabic Language Model},
|
| 146 |
-
author={Rohma, Ahsan Umar},
|
| 147 |
-
year={2025},
|
| 148 |
-
url={https://huggingface.co/alphatechlogics/FaseehGPT}
|
| 149 |
}
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
|
| 2 |
+
---
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
datasets:
|
| 5 |
+
- arbml/Arabic_Literature
|
| 6 |
+
- arbml/Arabic_News
|
| 7 |
+
- khalidalt/ultimate_arabic_news
|
| 8 |
+
- pain/Arabic-Tweets
|
| 9 |
+
language:
|
| 10 |
+
- ar
|
| 11 |
+
pipeline_tag: text-generation
|
| 12 |
+
library_name: transformers
|
| 13 |
+
tags:
|
| 14 |
+
- torch
|
| 15 |
+
- custom
|
| 16 |
+
- GPT
|
| 17 |
+
---
|
| 18 |
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
# Model Card for FaseehGPT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
+
## Model Details
|
| 23 |
|
| 24 |
+
* **Model Name**: FaseehGPT
|
| 25 |
+
* **Model Type**: Decoder-only Transformer (GPT-style)
|
| 26 |
+
* **Repository**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
|
| 27 |
+
* **Version**: 1.1
|
| 28 |
+
* **Developers**: [Ahsan Umar](https://huggingface.co/codewithdark)
|
| 29 |
+
* **Date**: July 10, 2025
|
| 30 |
+
* **License**: Apache 2.0
|
| 31 |
+
* **Framework**: PyTorch, Hugging Face Transformers
|
| 32 |
+
* **Language**: Arabic
|
| 33 |
+
* **Intended Use**: Text generation and language modeling for Arabic text
|
| 34 |
|
| 35 |
+
FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (`asafaya/bert-base-arabic`) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations.
|
| 36 |
|
| 37 |
+
---
|
|
|
|
| 38 |
|
| 39 |
+
## Model Architecture
|
| 40 |
|
| 41 |
+
* **Architecture**: Decoder-only transformer with multi-head self-attention and feed-forward layers
|
| 42 |
+
* **Parameters**:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
* Vocabulary Size: \~32,000 (from `asafaya/bert-base-arabic` tokenizer)
|
| 45 |
+
* Embedding Dimension: 512
|
| 46 |
+
* Number of Layers: 12
|
| 47 |
+
* Number of Attention Heads: 8
|
| 48 |
+
* Feed-forward Dimension: 2048
|
| 49 |
+
* Total Parameters: \~70.7 million
|
| 50 |
+
* **Configuration**:
|
| 51 |
|
| 52 |
+
* Maximum Sequence Length: 512
|
| 53 |
+
* Dropout Rate: 0.1
|
| 54 |
+
* Activation Function: GELU
|
| 55 |
+
* **Weight Initialization**: Normal distribution (mean = 0, std = 0.02)
|
| 56 |
+
* **Special Features**: Supports top-k and top-p sampling; weight tying between input and output embeddings
|
| 57 |
|
| 58 |
+
---
|
| 59 |
|
| 60 |
+
## Training Details
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
+
### Datasets
|
| 63 |
|
| 64 |
+
* **arbml/Arabic\_News**: 7,114,814 news article texts
|
| 65 |
+
* **arbml/Arabic\_Literature**: 1,592,629 literary texts
|
| 66 |
+
* **Subset Used**: 50,000 texts (randomly sampled)
|
| 67 |
|
| 68 |
+
* **Training Set**: 45,000 (90%)
|
| 69 |
+
* **Validation Set**: 5,000 (10%)
|
|
|
|
| 70 |
|
| 71 |
+
### Training Configuration
|
| 72 |
|
| 73 |
+
* **Epochs**: 20
|
| 74 |
+
* **Learning Rate**: 3e-4 *(Karpathy constant)*
|
| 75 |
+
* **Optimizer**: AdamW (weight decay = 0.01)
|
| 76 |
+
* **Scheduler**: Linear warmup (10% of steps) with decay
|
| 77 |
+
* **Batch Size**: Effective 16 (4 gradient accumulation steps)
|
| 78 |
+
* **Hardware**: Kaggle (P100)
|
| 79 |
+
* **Training Duration**: 8.18 hours
|
| 80 |
+
* **Checkpoint**: Saved at epoch 20
|
| 81 |
|
| 82 |
+
---
|
| 83 |
|
| 84 |
+
## Sample Generated Text (Epoch 20)
|
| 85 |
|
| 86 |
+
**Prompt 1**: `"اللغة العربية"`
|
| 87 |
+
**Output**:
|
| 88 |
+
|
| 89 |
+
> اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ
|
| 90 |
+
|
| 91 |
+
**Prompt 2**: `"كان يا مكان في قديم الزمان"`
|
| 92 |
+
**Output**:
|
| 93 |
+
|
| 94 |
+
> كان يا مكان في قديم الزمان الانسان الانسان بعض لا انر لقد الانسان ذلك انلاركارك عرض عرض كروي.رح نشا المطلوب وعمل كنكتب الاردني فبدي السابق كان " يريد " صورة ولا وانما " التي النعيم الصحيح بمع للنفط ". يريد قصر توفيق ديكتوتو قد في ثمانية جسد ". الصحيفة انه الاسلام البلد التي " لا من ثالثة شبه كانت بصفته في الوعيدها انبر التي في ما من ، رحب مهمة مز انه ليبر بسرعةالية ، الارجح ما عن به انقلاب في
|
| 95 |
+
|
| 96 |
+
**Analysis**: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning.
|
| 97 |
+
|
| 98 |
+
---
|
| 99 |
+
|
| 100 |
+
## Usage
|
| 101 |
+
|
| 102 |
+
FaseehGPT can be used to generate Arabic text from a prompt. Example code:
|
| 103 |
+
|
| 104 |
+
```python
|
| 105 |
from transformers import AutoModel, AutoTokenizer
|
| 106 |
|
| 107 |
# Load model and tokenizer
|
|
|
|
| 114 |
outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
|
| 115 |
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 116 |
print(generated_text)
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
### Parameters for Generation
|
| 120 |
+
|
| 121 |
+
* `max_new_tokens`: Max tokens to generate (e.g., 100)
|
| 122 |
+
* `temperature`: Controls randomness (default: 1.0)
|
| 123 |
+
* `top_k`: Limits sampling to top-k tokens (default: 50)
|
| 124 |
+
* `top_p`: Nucleus sampling threshold (default: 0.9)
|
| 125 |
+
|
| 126 |
+
**Expected Output**: Arabic text that continues the given prompt, depending on training quality and settings.
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## Dataset Description
|
| 131 |
+
|
| 132 |
+
* **Source**: Hugging Face Datasets
|
| 133 |
+
* **Used Datasets**:
|
| 134 |
+
|
| 135 |
+
* `arbml/Arabic_News`: News across diverse topics with formal Arabic
|
| 136 |
+
* `arbml/Arabic_Literature`: Novels and poetry, providing rich language variety
|
| 137 |
+
* **Total Texts**: 8,707,443 (full); 50,000 used for training
|
| 138 |
|
| 139 |
+
### Preprocessing
|
| 140 |
|
| 141 |
+
* Tokenized using `asafaya/bert-base-arabic`
|
| 142 |
+
* Long texts split into overlapping chunks (`stride = max_seq_len // 2`)
|
| 143 |
+
* Special tokens: `<SOS>`, `<EOS>`, `<PAD>`, `<UNK>`
|
|
|
|
| 144 |
|
| 145 |
+
---
|
|
|
|
| 146 |
|
| 147 |
+
## Evaluation
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
+
* **Metrics**: Cross-entropy loss (training and validation)
|
| 150 |
+
* **Status**: Loss metrics unavailable due to incomplete logging
|
| 151 |
+
* **Observations**: Generated samples show partial learning; some incoherence remains
|
| 152 |
|
| 153 |
+
### Recommendations
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
|
| 155 |
+
* Extract loss from checkpoint `model_checkpoint_epoch_20.pt`
|
| 156 |
+
* Use verbose logging in future training
|
| 157 |
+
* Add evaluation metrics: Perplexity, BLEU
|
| 158 |
+
* Try smaller models (e.g., `embed_dim=256`, `num_layers=6`) for faster Colab testing
|
| 159 |
|
| 160 |
+
---
|
| 161 |
|
| 162 |
+
## Limitations
|
| 163 |
|
| 164 |
+
* **Generated Text Quality**: Inconsistent coherence suggests undertraining
|
| 165 |
+
* **Resource Constraints**: Small subset used due to Colab GPU limits
|
| 166 |
+
* **Language Specificity**: Only Arabic supported; others untested
|
| 167 |
+
* **Training Duration**: 8.18 hours insufficient for full dataset
|
|
|
|
|
|
|
| 168 |
|
| 169 |
+
---
|
| 170 |
|
| 171 |
+
## Ethical Considerations
|
| 172 |
|
| 173 |
+
* **Bias**: May reflect cultural or topical biases from source data
|
| 174 |
+
* **Usage**: For research/non-commercial use; validate outputs
|
| 175 |
+
* **Privacy**: Datasets are public; comply with Hugging Face policies
|
| 176 |
|
| 177 |
+
---
|
|
|
|
|
|
|
|
|
|
| 178 |
|
| 179 |
+
## How to Contribute
|
| 180 |
|
| 181 |
+
* **Repo**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
|
| 182 |
+
* **Issues**: Report bugs or suggest features via issue tracker
|
| 183 |
+
* **Training**: Resume on full dataset or better hardware
|
| 184 |
+
* **Evaluation**: Add scripts for BLEU, perplexity, etc.
|
| 185 |
|
| 186 |
+
---
|
| 187 |
|
| 188 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
| 189 |
|
| 190 |
+
```bibtex
|
|
|
|
| 191 |
@misc{faseehgpt2025,
|
| 192 |
+
title = {FaseehGPT: An Arabic Language Model},
|
| 193 |
+
author = {Rohma, Ahsan Umar},
|
| 194 |
+
year = {2025},
|
| 195 |
+
url = {https://huggingface.co/alphatechlogics/FaseehGPT}
|
| 196 |
}
|
| 197 |
+
```
|