FaseehGPT / README.md
codewithdark's picture
Update README.md
2365afd verified
---
license: apache-2.0
datasets:
- arbml/Arabic_Literature
- arbml/Arabic_News
- khalidalt/ultimate_arabic_news
- pain/Arabic-Tweets
language:
- ar
pipeline_tag: text-generation
library_name: transformers
tags:
- torch
- custom
- GPT
---
# Model Card for FaseehGPT
## Model Details
* **Model Name**: FaseehGPT
* **Model Type**: Decoder-only Transformer (GPT-style)
* **Repository**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
* **Version**: 1.1
* **Builder: *Alphatechlogics*** ๐Ÿ”— [GitHub](https://github.com/alphatechlogics) | ๐Ÿค— [Hugging Face](https://huggingface.co/alphatechlogics) | ๐Ÿ’ผ [LinkedIn](https://www.linkedin.com/company/alphatechlogics)
* **Developer: *Ahsan Umar*** ๐Ÿ”— [GitHub](https://github.com/codewithdark-git) | ๐Ÿค— [Hugging Face](https://huggingface.co/codewithdark) | ๐Ÿ’ผ [LinkedIn](https://linkedin.com/in/codewithdark)
* **Date**: July 10, 2025
* **License**: Apache 2.0
* **Framework**: PyTorch, Hugging Face Transformers
* **Language**: Arabic
* **Intended Use**: Text generation and language modeling for Arabic text
FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (`asafaya/bert-base-arabic`) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations.
---
## Model Architecture
* **Architecture**: Decoder-only transformer with multi-head self-attention and feed-forward layers
* **Parameters**:
* Vocabulary Size: \~32,000 (from `asafaya/bert-base-arabic` tokenizer)
* Embedding Dimension: 512
* Number of Layers: 12
* Number of Attention Heads: 8
* Feed-forward Dimension: 2048
* Total Parameters: \~70.7 million
* **Configuration**:
* Maximum Sequence Length: 512
* Dropout Rate: 0.1
* Activation Function: GELU
* **Weight Initialization**: Normal distribution (mean = 0, std = 0.02)
* **Special Features**: Supports top-k and top-p sampling; weight tying between input and output embeddings
---
## Training Details
### Datasets
* **arbml/Arabic\_News**: 7,114,814 news article texts
* **arbml/Arabic\_Literature**: 1,592,629 literary texts
* **Subset Used**: 50,000 texts (randomly sampled)
* **Training Set**: 45,000 (90%)
* **Validation Set**: 5,000 (10%)
### Training Configuration
* **Epochs**: 20
* **Learning Rate**: 3e-4 *(Karpathy constant)*
* **Optimizer**: AdamW (weight decay = 0.01)
* **Scheduler**: Linear warmup (10% of steps) with decay
* **Batch Size**: Effective 16 (4 gradient accumulation steps)
* **Hardware**: Kaggle (P100)
* **Training Duration**: 8.18 hours
* **Checkpoint**: Saved at epoch 20
---
## Sample Generated Text (Epoch 20)
**Prompt 1**: `"ุงู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ"`
**Output**:
> ุงู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ ุงู‚ุฑุจ ูˆูŠุญ ุงู„ูŠ ูƒู…ุง ุฐู„ูƒ ู‡ุฐู‡ ุงู„ุจูŠุงู† ุดุนุฑู‡ ู‚ุงู„ู‡ ุงู„ุงุณุชุงุฐุฑ ู…ู† ูˆุชุฌ ู…ุนู‡ู… ูู…ู†ู„ูŠู„ ูˆุตูˆู„ู‡ ู„ู‡ ุงู„ูุฑู‚ุฉ ุงู„ุชูŠู‡ุงุงู‡ู‡ุง ุงู„ุฎุทุงุจ ู…ุงู‡ ู…ุณู„ู…ูู† ุŒ ุชู‚ูˆู„ุจุฉ ูˆุญูŠุงุฉ โ€“ุฒุฉ ุงู„ุดุฎุตูŠุฉ ู…ุณู„ู… ุดุจู‡ ู…ู†ุฐ
**Prompt 2**: `"ูƒุงู† ูŠุง ู…ูƒุงู† ููŠ ู‚ุฏูŠู… ุงู„ุฒู…ุงู†"`
**Output**:
> ูƒุงู† ูŠุง ู…ูƒุงู† ููŠ ู‚ุฏูŠู… ุงู„ุฒู…ุงู† ุงู„ุงู†ุณุงู† ุงู„ุงู†ุณุงู† ุจุนุถ ู„ุง ุงู†ุฑ ู„ู‚ุฏ ุงู„ุงู†ุณุงู† ุฐู„ูƒ ุงู†ู„ุงุฑูƒุงุฑูƒ ุนุฑุถ ุนุฑุถ ูƒุฑูˆูŠ.ุฑุญ ู†ุดุง ุงู„ู…ุทู„ูˆุจ ูˆุนู…ู„ ูƒู†ูƒุชุจ ุงู„ุงุฑุฏู†ูŠ ูุจุฏูŠ ุงู„ุณุงุจู‚ ูƒุงู† " ูŠุฑูŠุฏ " ุตูˆุฑุฉ ูˆู„ุง ูˆุงู†ู…ุง " ุงู„ุชูŠ ุงู„ู†ุนูŠู… ุงู„ุตุญูŠุญ ุจู…ุน ู„ู„ู†ูุท ". ูŠุฑูŠุฏ ู‚ุตุฑ ุชูˆููŠู‚ ุฏูŠูƒุชูˆุชูˆ ู‚ุฏ ููŠ ุซู…ุงู†ูŠุฉ ุฌุณุฏ ". ุงู„ุตุญูŠูุฉ ุงู†ู‡ ุงู„ุงุณู„ุงู… ุงู„ุจู„ุฏ ุงู„ุชูŠ " ู„ุง ู…ู† ุซุงู„ุซุฉ ุดุจู‡ ูƒุงู†ุช ุจุตูุชู‡ ููŠ ุงู„ูˆุนูŠุฏู‡ุง ุงู†ุจุฑ ุงู„ุชูŠ ููŠ ู…ุง ู…ู† ุŒ ุฑุญุจ ู…ู‡ู…ุฉ ู…ุฒ ุงู†ู‡ ู„ูŠุจุฑ ุจุณุฑุนุฉุงู„ูŠุฉ ุŒ ุงู„ุงุฑุฌุญ ู…ุง ุนู† ุจู‡ ุงู†ู‚ู„ุงุจ ููŠ
**Analysis**: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning.
---
## Usage
FaseehGPT can be used to generate Arabic text from a prompt. Example code:
```python
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT")
# Generate text
prompt = "ุงู„ุณู„ุงู… ุนู„ูŠูƒู…"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
### Parameters for Generation
* `max_new_tokens`: Max tokens to generate (e.g., 100)
* `temperature`: Controls randomness (default: 1.0)
* `top_k`: Limits sampling to top-k tokens (default: 50)
* `top_p`: Nucleus sampling threshold (default: 0.9)
**Expected Output**: Arabic text that continues the given prompt, depending on training quality and settings.
---
## Dataset Description
* **Source**: Hugging Face Datasets
* **Used Datasets**:
* `arbml/Arabic_News`: News across diverse topics with formal Arabic
* `arbml/Arabic_Literature`: Novels and poetry, providing rich language variety
* **Total Texts**: 8,707,443 (full); 50,000 used for training
### Preprocessing
* Tokenized using `asafaya/bert-base-arabic`
* Long texts split into overlapping chunks (`stride = max_seq_len // 2`)
* Special tokens: `<SOS>`, `<EOS>`, `<PAD>`, `<UNK>`
---
## Evaluation
* **Metrics**: Cross-entropy loss (training and validation)
* **Status**: Loss metrics unavailable due to incomplete logging
* **Observations**: Generated samples show partial learning; some incoherence remains
### Recommendations
* Extract loss from checkpoint `model_checkpoint_epoch_20.pt`
* Use verbose logging in future training
* Add evaluation metrics: Perplexity, BLEU
* Try smaller models (e.g., `embed_dim=256`, `num_layers=6`) for faster Colab testing
---
## Limitations
* **Generated Text Quality**: Inconsistent coherence suggests undertraining
* **Resource Constraints**: Small subset used due to Colab GPU limits
* **Language Specificity**: Only Arabic supported; others untested
* **Training Duration**: 8.18 hours insufficient for full dataset
---
## Ethical Considerations
* **Bias**: May reflect cultural or topical biases from source data
* **Usage**: For research/non-commercial use; validate outputs
* **Privacy**: Datasets are public; comply with Hugging Face policies
---
## How to Contribute
* **Repo**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
* **Issues**: Report bugs or suggest features via issue tracker
* **Training**: Resume on full dataset or better hardware
* **Evaluation**: Add scripts for BLEU, perplexity, etc.
---
## Citation
```bibtex
@article{umar2025faseehgpt,
title={FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding},
author={Umar, Ahsan},
publisher={Engineering Archive}
}
```