| |
|
| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - arbml/Arabic_Literature |
| | - arbml/Arabic_News |
| | - khalidalt/ultimate_arabic_news |
| | - pain/Arabic-Tweets |
| | language: |
| | - ar |
| | pipeline_tag: text-generation |
| | library_name: transformers |
| | tags: |
| | - torch |
| | - custom |
| | - GPT |
| | --- |
| | |
| | # Model Card for FaseehGPT |
| |
|
| | ## Model Details |
| |
|
| | * **Model Name**: FaseehGPT |
| | * **Model Type**: Decoder-only Transformer (GPT-style) |
| | * **Repository**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT) |
| | * **Version**: 1.1 |
| | * **Builder: *Alphatechlogics*** ๐ [GitHub](https://github.com/alphatechlogics) | ๐ค [Hugging Face](https://huggingface.co/alphatechlogics) | ๐ผ [LinkedIn](https://www.linkedin.com/company/alphatechlogics) |
| | * **Developer: *Ahsan Umar*** ๐ [GitHub](https://github.com/codewithdark-git) | ๐ค [Hugging Face](https://huggingface.co/codewithdark) | ๐ผ [LinkedIn](https://linkedin.com/in/codewithdark) |
| | * **Date**: July 10, 2025 |
| | * **License**: Apache 2.0 |
| | * **Framework**: PyTorch, Hugging Face Transformers |
| | * **Language**: Arabic |
| | * **Intended Use**: Text generation and language modeling for Arabic text |
| |
|
| | FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (`asafaya/bert-base-arabic`) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations. |
| |
|
| | --- |
| |
|
| | ## Model Architecture |
| |
|
| | * **Architecture**: Decoder-only transformer with multi-head self-attention and feed-forward layers |
| | * **Parameters**: |
| |
|
| | * Vocabulary Size: \~32,000 (from `asafaya/bert-base-arabic` tokenizer) |
| | * Embedding Dimension: 512 |
| | * Number of Layers: 12 |
| | * Number of Attention Heads: 8 |
| | * Feed-forward Dimension: 2048 |
| | * Total Parameters: \~70.7 million |
| | * **Configuration**: |
| |
|
| | * Maximum Sequence Length: 512 |
| | * Dropout Rate: 0.1 |
| | * Activation Function: GELU |
| | * **Weight Initialization**: Normal distribution (mean = 0, std = 0.02) |
| | * **Special Features**: Supports top-k and top-p sampling; weight tying between input and output embeddings |
| |
|
| | --- |
| |
|
| | ## Training Details |
| |
|
| | ### Datasets |
| |
|
| | * **arbml/Arabic\_News**: 7,114,814 news article texts |
| | * **arbml/Arabic\_Literature**: 1,592,629 literary texts |
| | * **Subset Used**: 50,000 texts (randomly sampled) |
| |
|
| | * **Training Set**: 45,000 (90%) |
| | * **Validation Set**: 5,000 (10%) |
| |
|
| | ### Training Configuration |
| |
|
| | * **Epochs**: 20 |
| | * **Learning Rate**: 3e-4 *(Karpathy constant)* |
| | * **Optimizer**: AdamW (weight decay = 0.01) |
| | * **Scheduler**: Linear warmup (10% of steps) with decay |
| | * **Batch Size**: Effective 16 (4 gradient accumulation steps) |
| | * **Hardware**: Kaggle (P100) |
| | * **Training Duration**: 8.18 hours |
| | * **Checkpoint**: Saved at epoch 20 |
| |
|
| | --- |
| |
|
| | ## Sample Generated Text (Epoch 20) |
| |
|
| | **Prompt 1**: `"ุงููุบุฉ ุงูุนุฑุจูุฉ"` |
| | **Output**: |
| |
|
| | > ุงููุบุฉ ุงูุนุฑุจูุฉ ุงูุฑุจ ููุญ ุงูู ูู
ุง ุฐูู ูุฐู ุงูุจูุงู ุดุนุฑู ูุงูู ุงูุงุณุชุงุฐุฑ ู
ู ูุชุฌ ู
ุนูู
ูู
ูููู ูุตููู ูู ุงููุฑูุฉ ุงูุชููุงุงููุง ุงูุฎุทุงุจ ู
ุงู ู
ุณูู
ูู ุ ุชูููุจุฉ ูุญูุงุฉ โุฒุฉ ุงูุดุฎุตูุฉ ู
ุณูู
ุดุจู ู
ูุฐ |
| |
|
| | **Prompt 2**: `"ูุงู ูุง ู
ูุงู ูู ูุฏูู
ุงูุฒู
ุงู"` |
| | **Output**: |
| |
|
| | > ูุงู ูุง ู
ูุงู ูู ูุฏูู
ุงูุฒู
ุงู ุงูุงูุณุงู ุงูุงูุณุงู ุจุนุถ ูุง ุงูุฑ ููุฏ ุงูุงูุณุงู ุฐูู ุงููุงุฑูุงุฑู ุนุฑุถ ุนุฑุถ ูุฑูู.ุฑุญ ูุดุง ุงูู
ุทููุจ ูุนู
ู ูููุชุจ ุงูุงุฑุฏูู ูุจุฏู ุงูุณุงุจู ูุงู " ูุฑูุฏ " ุตูุฑุฉ ููุง ูุงูู
ุง " ุงูุชู ุงููุนูู
ุงูุตุญูุญ ุจู
ุน ููููุท ". ูุฑูุฏ ูุตุฑ ุชูููู ุฏููุชูุชู ูุฏ ูู ุซู
ุงููุฉ ุฌุณุฏ ". ุงูุตุญููุฉ ุงูู ุงูุงุณูุงู
ุงูุจูุฏ ุงูุชู " ูุง ู
ู ุซุงูุซุฉ ุดุจู ูุงูุช ุจุตูุชู ูู ุงููุนูุฏูุง ุงูุจุฑ ุงูุชู ูู ู
ุง ู
ู ุ ุฑุญุจ ู
ูู
ุฉ ู
ุฒ ุงูู ููุจุฑ ุจุณุฑุนุฉุงููุฉ ุ ุงูุงุฑุฌุญ ู
ุง ุนู ุจู ุงูููุงุจ ูู |
| |
|
| | **Analysis**: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning. |
| |
|
| | --- |
| |
|
| | ## Usage |
| |
|
| | FaseehGPT can be used to generate Arabic text from a prompt. Example code: |
| |
|
| | ```python |
| | from transformers import AutoModel, AutoTokenizer |
| | |
| | # Load model and tokenizer |
| | model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True) |
| | tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT") |
| | |
| | # Generate text |
| | prompt = "ุงูุณูุงู
ุนูููู
" |
| | input_ids = tokenizer(prompt, return_tensors="pt").input_ids |
| | outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9) |
| | generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(generated_text) |
| | ``` |
| |
|
| | ### Parameters for Generation |
| |
|
| | * `max_new_tokens`: Max tokens to generate (e.g., 100) |
| | * `temperature`: Controls randomness (default: 1.0) |
| | * `top_k`: Limits sampling to top-k tokens (default: 50) |
| | * `top_p`: Nucleus sampling threshold (default: 0.9) |
| |
|
| | **Expected Output**: Arabic text that continues the given prompt, depending on training quality and settings. |
| |
|
| | --- |
| |
|
| | ## Dataset Description |
| |
|
| | * **Source**: Hugging Face Datasets |
| | * **Used Datasets**: |
| |
|
| | * `arbml/Arabic_News`: News across diverse topics with formal Arabic |
| | * `arbml/Arabic_Literature`: Novels and poetry, providing rich language variety |
| | * **Total Texts**: 8,707,443 (full); 50,000 used for training |
| |
|
| | ### Preprocessing |
| |
|
| | * Tokenized using `asafaya/bert-base-arabic` |
| | * Long texts split into overlapping chunks (`stride = max_seq_len // 2`) |
| | * Special tokens: `<SOS>`, `<EOS>`, `<PAD>`, `<UNK>` |
| |
|
| | --- |
| |
|
| | ## Evaluation |
| |
|
| | * **Metrics**: Cross-entropy loss (training and validation) |
| | * **Status**: Loss metrics unavailable due to incomplete logging |
| | * **Observations**: Generated samples show partial learning; some incoherence remains |
| |
|
| | ### Recommendations |
| |
|
| | * Extract loss from checkpoint `model_checkpoint_epoch_20.pt` |
| | * Use verbose logging in future training |
| | * Add evaluation metrics: Perplexity, BLEU |
| | * Try smaller models (e.g., `embed_dim=256`, `num_layers=6`) for faster Colab testing |
| |
|
| | --- |
| |
|
| | ## Limitations |
| |
|
| | * **Generated Text Quality**: Inconsistent coherence suggests undertraining |
| | * **Resource Constraints**: Small subset used due to Colab GPU limits |
| | * **Language Specificity**: Only Arabic supported; others untested |
| | * **Training Duration**: 8.18 hours insufficient for full dataset |
| |
|
| | --- |
| |
|
| | ## Ethical Considerations |
| |
|
| | * **Bias**: May reflect cultural or topical biases from source data |
| | * **Usage**: For research/non-commercial use; validate outputs |
| | * **Privacy**: Datasets are public; comply with Hugging Face policies |
| |
|
| | --- |
| |
|
| | ## How to Contribute |
| |
|
| | * **Repo**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT) |
| | * **Issues**: Report bugs or suggest features via issue tracker |
| | * **Training**: Resume on full dataset or better hardware |
| | * **Evaluation**: Add scripts for BLEU, perplexity, etc. |
| |
|
| | --- |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{umar2025faseehgpt, |
| | title={FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding}, |
| | author={Umar, Ahsan}, |
| | publisher={Engineering Archive} |
| | } |
| | ``` |
| |
|