--- license: apache-2.0 datasets: - arbml/Arabic_Literature - arbml/Arabic_News - khalidalt/ultimate_arabic_news - pain/Arabic-Tweets language: - ar pipeline_tag: text-generation library_name: transformers tags: - torch - custom - GPT --- # Model Card for FaseehGPT ## Model Details * **Model Name**: FaseehGPT * **Model Type**: Decoder-only Transformer (GPT-style) * **Repository**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT) * **Version**: 1.1 * **Builder: *Alphatechlogics*** 🔗 [GitHub](https://github.com/alphatechlogics) | 🤗 [Hugging Face](https://huggingface.co/alphatechlogics) | 💼 [LinkedIn](https://www.linkedin.com/company/alphatechlogics) * **Developer: *Ahsan Umar*** 🔗 [GitHub](https://github.com/codewithdark-git) | 🤗 [Hugging Face](https://huggingface.co/codewithdark) | 💼 [LinkedIn](https://linkedin.com/in/codewithdark) * **Date**: July 10, 2025 * **License**: Apache 2.0 * **Framework**: PyTorch, Hugging Face Transformers * **Language**: Arabic * **Intended Use**: Text generation and language modeling for Arabic text FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (`asafaya/bert-base-arabic`) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations. --- ## Model Architecture * **Architecture**: Decoder-only transformer with multi-head self-attention and feed-forward layers * **Parameters**: * Vocabulary Size: \~32,000 (from `asafaya/bert-base-arabic` tokenizer) * Embedding Dimension: 512 * Number of Layers: 12 * Number of Attention Heads: 8 * Feed-forward Dimension: 2048 * Total Parameters: \~70.7 million * **Configuration**: * Maximum Sequence Length: 512 * Dropout Rate: 0.1 * Activation Function: GELU * **Weight Initialization**: Normal distribution (mean = 0, std = 0.02) * **Special Features**: Supports top-k and top-p sampling; weight tying between input and output embeddings --- ## Training Details ### Datasets * **arbml/Arabic\_News**: 7,114,814 news article texts * **arbml/Arabic\_Literature**: 1,592,629 literary texts * **Subset Used**: 50,000 texts (randomly sampled) * **Training Set**: 45,000 (90%) * **Validation Set**: 5,000 (10%) ### Training Configuration * **Epochs**: 20 * **Learning Rate**: 3e-4 *(Karpathy constant)* * **Optimizer**: AdamW (weight decay = 0.01) * **Scheduler**: Linear warmup (10% of steps) with decay * **Batch Size**: Effective 16 (4 gradient accumulation steps) * **Hardware**: Kaggle (P100) * **Training Duration**: 8.18 hours * **Checkpoint**: Saved at epoch 20 --- ## Sample Generated Text (Epoch 20) **Prompt 1**: `"اللغة العربية"` **Output**: > اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ **Prompt 2**: `"كان يا مكان في قديم الزمان"` **Output**: > كان يا مكان في قديم الزمان الانسان الانسان بعض لا انر لقد الانسان ذلك انلاركارك عرض عرض كروي.رح نشا المطلوب وعمل كنكتب الاردني فبدي السابق كان " يريد " صورة ولا وانما " التي النعيم الصحيح بمع للنفط ". يريد قصر توفيق ديكتوتو قد في ثمانية جسد ". الصحيفة انه الاسلام البلد التي " لا من ثالثة شبه كانت بصفته في الوعيدها انبر التي في ما من ، رحب مهمة مز انه ليبر بسرعةالية ، الارجح ما عن به انقلاب في **Analysis**: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning. --- ## Usage FaseehGPT can be used to generate Arabic text from a prompt. Example code: ```python from transformers import AutoModel, AutoTokenizer # Load model and tokenizer model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT") # Generate text prompt = "السلام عليكم" input_ids = tokenizer(prompt, return_tensors="pt").input_ids outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ``` ### Parameters for Generation * `max_new_tokens`: Max tokens to generate (e.g., 100) * `temperature`: Controls randomness (default: 1.0) * `top_k`: Limits sampling to top-k tokens (default: 50) * `top_p`: Nucleus sampling threshold (default: 0.9) **Expected Output**: Arabic text that continues the given prompt, depending on training quality and settings. --- ## Dataset Description * **Source**: Hugging Face Datasets * **Used Datasets**: * `arbml/Arabic_News`: News across diverse topics with formal Arabic * `arbml/Arabic_Literature`: Novels and poetry, providing rich language variety * **Total Texts**: 8,707,443 (full); 50,000 used for training ### Preprocessing * Tokenized using `asafaya/bert-base-arabic` * Long texts split into overlapping chunks (`stride = max_seq_len // 2`) * Special tokens: ``, ``, ``, `` --- ## Evaluation * **Metrics**: Cross-entropy loss (training and validation) * **Status**: Loss metrics unavailable due to incomplete logging * **Observations**: Generated samples show partial learning; some incoherence remains ### Recommendations * Extract loss from checkpoint `model_checkpoint_epoch_20.pt` * Use verbose logging in future training * Add evaluation metrics: Perplexity, BLEU * Try smaller models (e.g., `embed_dim=256`, `num_layers=6`) for faster Colab testing --- ## Limitations * **Generated Text Quality**: Inconsistent coherence suggests undertraining * **Resource Constraints**: Small subset used due to Colab GPU limits * **Language Specificity**: Only Arabic supported; others untested * **Training Duration**: 8.18 hours insufficient for full dataset --- ## Ethical Considerations * **Bias**: May reflect cultural or topical biases from source data * **Usage**: For research/non-commercial use; validate outputs * **Privacy**: Datasets are public; comply with Hugging Face policies --- ## How to Contribute * **Repo**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT) * **Issues**: Report bugs or suggest features via issue tracker * **Training**: Resume on full dataset or better hardware * **Evaluation**: Add scripts for BLEU, perplexity, etc. --- ## Citation ```bibtex @article{umar2025faseehgpt, title={FaseehGPT: A Lightweight Transformer Model for Arabic Text Generation with Enhanced Morphological Understanding}, author={Umar, Ahsan}, publisher={Engineering Archive} } ```