|
|
|
|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- arbml/Arabic_Literature |
|
|
- arbml/Arabic_News |
|
|
- khalidalt/ultimate_arabic_news |
|
|
- pain/Arabic-Tweets |
|
|
language: |
|
|
- ar |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
tags: |
|
|
- torch |
|
|
- custom |
|
|
- GPT |
|
|
--- |
|
|
|
|
|
# Model Card for FaseehGPT |
|
|
|
|
|
## Model Details |
|
|
|
|
|
* **Model Name**: FaseehGPT |
|
|
* **Model Type**: Decoder-only Transformer (GPT-style) |
|
|
* **Repository**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT) |
|
|
* **Version**: 1.1 |
|
|
* **Builder: *Alphatechlogics*** ๐ [GitHub](https://github.com/alphatechlogics) | ๐ค [Hugging Face](https://huggingface.co/alphatechlogics) | ๐ผ [LinkedIn](https://www.linkedin.com/company/alphatechlogics) |
|
|
* **Developer: *Ahsan Umar*** ๐ [GitHub](https://github.com/codewithdark-git) | ๐ค [Hugging Face](https://huggingface.co/codewithdark) | ๐ผ [LinkedIn](https://linkedin.com/in/codewithdark) |
|
|
* **Date**: July 10, 2025 |
|
|
* **License**: Apache 2.0 |
|
|
* **Framework**: PyTorch, Hugging Face Transformers |
|
|
* **Language**: Arabic |
|
|
* **Intended Use**: Text generation and language modeling for Arabic text |
|
|
|
|
|
FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (`asafaya/bert-base-arabic`) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
* **Architecture**: Decoder-only transformer with multi-head self-attention and feed-forward layers |
|
|
* **Parameters**: |
|
|
|
|
|
* Vocabulary Size: \~32,000 (from `asafaya/bert-base-arabic` tokenizer) |
|
|
* Embedding Dimension: 512 |
|
|
* Number of Layers: 12 |
|
|
* Number of Attention Heads: 8 |
|
|
* Feed-forward Dimension: 2048 |
|
|
* Total Parameters: \~70.7 million |
|
|
* **Configuration**: |
|
|
|
|
|
* Maximum Sequence Length: 512 |
|
|
* Dropout Rate: 0.1 |
|
|
* Activation Function: GELU |
|
|
* **Weight Initialization**: Normal distribution (mean = 0, std = 0.02) |
|
|
* **Special Features**: Supports top-k and top-p sampling; weight tying between input and output embeddings |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Datasets |
|
|
|
|
|
* **arbml/Arabic\_News**: 7,114,814 news article texts |
|
|
* **arbml/Arabic\_Literature**: 1,592,629 literary texts |
|
|
* **Subset Used**: 50,000 texts (randomly sampled) |
|
|
|
|
|
* **Training Set**: 45,000 (90%) |
|
|
* **Validation Set**: 5,000 (10%) |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
* **Epochs**: 20 |
|
|
* **Learning Rate**: 3e-4 *(Karpathy constant)* |
|
|
* **Optimizer**: AdamW (weight decay = 0.01) |
|
|
* **Scheduler**: Linear warmup (10% of steps) with decay |
|
|
* **Batch Size**: Effective 16 (4 gradient accumulation steps) |
|
|
* **Hardware**: Kaggle (P100) |
|
|
* **Training Duration**: 8.18 hours |
|
|
* **Checkpoint**: Saved at epoch 20 |
|
|
|
|
|
--- |
|
|
|
|
|
## Sample Generated Text (Epoch 20) |
|
|
|
|
|
**Prompt 1**: `"ุงููุบุฉ ุงูุนุฑุจูุฉ"` |
|
|
**Output**: |
|
|
|
|
|
> ุงููุบุฉ ุงูุนุฑุจูุฉ ุงูุฑุจ ููุญ ุงูู ูู
ุง ุฐูู ูุฐู ุงูุจูุงู ุดุนุฑู ูุงูู ุงูุงุณุชุงุฐุฑ ู
ู ูุชุฌ ู
ุนูู
ูู
ูููู ูุตููู ูู ุงููุฑูุฉ ุงูุชููุงุงููุง ุงูุฎุทุงุจ ู
ุงู ู
ุณูู
ูู ุ ุชูููุจุฉ ูุญูุงุฉ โุฒุฉ ุงูุดุฎุตูุฉ ู
ุณูู
ุดุจู ู
ูุฐ |
|
|
|
|
|
**Prompt 2**: `"ูุงู ูุง ู
ูุงู ูู ูุฏูู
ุงูุฒู
ุงู"` |
|
|
**Output**: |
|
|
|
|
|
> ูุงู ูุง ู
ูุงู ูู ูุฏูู
ุงูุฒู
ุงู ุงูุงูุณุงู ุงูุงูุณุงู ุจุนุถ ูุง ุงูุฑ ููุฏ ุงูุงูุณุงู ุฐูู ุงููุงุฑูุงุฑู ุนุฑุถ ุนุฑุถ ูุฑูู.ุฑุญ ูุดุง ุงูู
ุทููุจ ูุนู
ู ูููุชุจ ุงูุงุฑุฏูู ูุจุฏู ุงูุณุงุจู ูุงู " ูุฑูุฏ " ุตูุฑุฉ ููุง ูุงูู
ุง " ุงูุชู ุงููุนูู
ุงูุตุญูุญ ุจู
ุน ููููุท ". ูุฑูุฏ ูุตุฑ ุชูููู ุฏููุชูุชู ูุฏ ูู ุซู
ุงููุฉ ุฌุณุฏ ". ุงูุตุญููุฉ ุงูู ุงูุงุณูุงู
ุงูุจูุฏ ุงูุชู " ูุง ู
ู ุซุงูุซุฉ ุดุจู ูุงูุช ุจุตูุชู ูู ุงููุนูุฏูุง ุงูุจุฑ ุงูุชู ูู ู
ุง ู
ู ุ ุฑุญุจ ู
ูู
ุฉ ู
ุฒ ุงูู ููุจุฑ ุจุณุฑุนุฉุงููุฉ ุ ุงูุงุฑุฌุญ ู
ุง ุนู ุจู ุงูููุงุจ ูู |
|
|
|
|
|
**Analysis**: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning. |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
FaseehGPT can be used to generate Arabic text from a prompt. Example code: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT") |
|
|
|
|
|
# Generate text |
|
|
prompt = "ุงูุณูุงู
ุนูููู
" |
|
|
input_ids = tokenizer(prompt, return_tensors="pt").input_ids |
|
|
outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9) |
|
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(generated_text) |
|
|
``` |
|
|
|
|
|
### Parameters for Generation |
|
|
|
|
|
* `max_new_tokens`: Max tokens to generate (e.g., 100) |
|
|
* `temperature`: Controls randomness (default: 1.0) |
|
|
* `top_k`: Limits sampling to top-k tokens (default: 50) |
|
|
* `top_p`: Nucleus sampling threshold (default: 0.9) |
|
|
|
|
|
**Expected Output**: Arabic text that continues the given prompt, depending on training quality and settings. |
|
|
|
|
|
--- |
|
|
|
|
|
## Dataset Description |
|
|
|
|
|
* **Source**: Hugging Face Datasets |
|
|
* **Used Datasets**: |
|
|
|
|
|
* `arbml/Arabic_News`: News across diverse topics with formal Arabic |
|
|
* `arbml/Arabic_Literature`: Novels and poetry, providing rich language variety |
|
|
* **Total Texts**: 8,707,443 (full); 50,000 used for training |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
* Tokenized using `asafaya/bert-base-arabic` |
|
|
* Long texts split into overlapping chunks (`stride = max_seq_len // 2`) |
|
|
* Special tokens: `<SOS>`, `<EOS>`, `<PAD>`, `<UNK>` |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
* **Metrics**: Cross-entropy loss (training and validation) |
|
|
* **Status**: Loss metrics unavailable due to incomplete logging |
|
|
* **Observations**: Generated samples show partial learning; some incoherence remains |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
* Extract loss from checkpoint `model_checkpoint_epoch_20.pt` |
|
|
* Use verbose logging in future training |
|
|
* Add evaluation metrics: Perplexity, BLEU |
|
|
* Try smaller models (e.g., `embed_dim=256`, `num_layers=6`) for faster Colab testing |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
* **Generated Text Quality**: Inconsistent coherence suggests undertraining |
|
|
* **Resource Constraints**: Small subset used due to Colab GPU limits |
|
|
* **Language Specificity**: Only Arabic supported; others untested |
|
|
* **Training Duration**: 8.18 hours insufficient for full dataset |
|
|
|
|
|
--- |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
* **Bias**: May reflect cultural or topical biases from source data |
|
|
* **Usage**: For research/non-commercial use; validate outputs |
|
|
* **Privacy**: Datasets are public; comply with Hugging Face policies |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Contribute |
|
|
|
|
|
* **Repo**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT) |
|
|
* **Issues**: Report bugs or suggest features via issue tracker |
|
|
* **Training**: Resume on full dataset or better hardware |
|
|
* **Evaluation**: Add scripts for BLEU, perplexity, etc. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{umar2025faseehgpt, |
|
|
title={FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding}, |
|
|
author={Umar, Ahsan}, |
|
|
publisher={Engineering Archive} |
|
|
} |
|
|
``` |
|
|
|