alphatechlogics
/

FaseehGPT

@@ -1,15 +1,149 @@
----
-license: mit
-datasets:
-- arbml/Arabic_News
-- arbml/Arabic_Literature
-- khalidalt/ultimate_arabic_news
-- pain/Arabic-Tweets
-language:
-- ar
-pipeline_tag: text-generation
-library_name: transformers
-tags:
-- Custom
-- pytorch
----

+Model Card for FaseehGPT
+Model Details
+Model Name: FaseehGPT
+Model Type: Decoder-only Transformer (GPT-style)
+Repository: alphatechlogics/FaseehGPT
+Version: 1.1
+Developers: [Ahsan Umar](https://huggingface.co/codewithdark)
+Date: July 10, 2025
+License: Apache 2.0
+Framework: PyTorch, Hugging Face Transformers
+Language: Arabic
+Intended Use: Text generation and language modeling for Arabic text
+FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It leverages a pre-trained Arabic tokenizer (asafaya/bert-base-arabic) and is optimized for resource-constrained environments like Google Colab's free GPU. The model completed training for 20 epochs, with checkpoints saved and sample text generated.
+Model Architecture
+Architecture: Decoder-only transformer with multi-head self-attention and feed-forward layers.
+Parameters:
+Vocabulary Size: ~32,000 (from asafaya/bert-base-arabic tokenizer)
+Embedding Dimension: 512
+Number of Layers: 12
+Number of Attention Heads: 8
+Feed-forward Dimension: 2048
+Total Parameters: ~70.7 million
+Configuration:
+Maximum Sequence Length: 512
+Dropout Rate: 0.1
+Activation Function: GELU
+Weight Initialization: Normal distribution (mean=0, std=0.02)
+Special Features: Supports top-k and top-p sampling for text generation, with weight tying between input and output embeddings for efficiency.
+Training Details
+Datasets:
+arbml/Arabic_News: 7,114,814 news article texts
+arbml/Arabic_Literature: 1,592,629 literary texts
+Subset Used: 50,000 texts (randomly sampled) for training and evaluation
+Training Set: 45,000 texts (90%)
+Validation Set: 5,000 texts (10%)
+Training Configuration:
+Epochs: 20
+Learning Rate: 3e-4 # Karpathy constant
+Optimizer: AdamW (weight decay=0.01)
+Scheduler: Linear warmup (10% of steps) with decay
+Batch Size: Effective batch size of 16 (using 4 gradient accumulation steps)
+Hardware: kaggle (P100)
+Training Duration: 8.18 hours
+Checkpoint: Saved at epoch 20
+Sample Generated Text (at epoch 20):
+Prompt 1: "اللغة العربية"
+Output: اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ
+Prompt 2: "كان يا مكان في قديم الزمان"
+Output: كان يا مكان في قديم الزمان الانسان الانسان بعض لا انر لقد الانسان ذلك انلاركارك عرض عرض كروي.رح نشا المطلوب وعمل كنكتب الاردني فبدي السابق كان " يريد " صورة ولا وانما " التي النعيم الصحيح بمع للنفط ". يريد قصر توفيق ديكتوتو قد في ثمانية جسد ". الصحيفة انه الاسلام البلد التي " لا من ثالثة شبه كانت بصفته في الوعيدها انبر التي في ما من ، رحب مهمة مز انه ليبر بسرعةالية ، الارجح ما عن به انقلاب في
+Analysis: The generated text shows some coherence but includes grammatical and semantic inconsistencies, suggesting the model may benefit from further training or fine-tuning.
+Usage
+FaseehGPT can be used for generating Arabic text given a prompt. Below is an example of how to load and use the model with the Hugging Face transformers library.
+from transformers import AutoModel, AutoTokenizer
+# Load model and tokenizer
+model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT")
+# Generate text
+prompt = "السلام عليكم"
+input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_text)
+Parameters for Generation:
+max_new_tokens: Maximum number of tokens to generate (e.g., 100).
+temperature: Controls randomness (default: 1.0).
+top_k: Limits sampling to top-k tokens (default: 50).
+top_p: Nucleus sampling threshold (default: 0.9).
+Expected Output: Generates Arabic text continuing from the prompt, with quality dependent on training completion and hyperparameter settings.
+Dataset Description
+Source: Hugging Face Datasets
+Datasets Used:
+arbml/Arabic_News: News articles covering diverse topics, providing formal and varied Arabic text.
+arbml/Arabic_Literature: Literary works, including novels and poetry, offering rich linguistic patterns.
+Total Texts: 8,707,443 (full dataset); 50,000 used in example training.
+Preprocessing:
+Texts are tokenized using asafaya/bert-base-arabic.
+Long texts are split into overlapping chunks (stride: max_seq_len // 2) to fit the maximum sequence length (512).
+Special tokens (<SOS>, <EOS>, <PAD>, <UNK>) are added for language modeling.
+Evaluation
+Metrics: Cross-entropy loss (training and validation).
+Status: Loss metrics are unavailable in the provided output due to incomplete logging. Sample text generation at epoch 20 indicates partial learning of Arabic linguistic patterns, but coherence is limited.
+Recommendations:
+Extract loss values from the checkpoint file (model_checkpoint_epoch_20.pt) or rerun training with verbose logging.
+Compute additional metrics like perplexity or BLEU to quantify generation quality.
+Experiment with a smaller model (e.g., embed_dim=256, num_layers=6) for faster evaluation on Colab.
+Limitations
+Generated Text Quality: Sample outputs show partial coherence, indicating potential undertraining or need for hyperparameter tuning (e.g., lower temperature, adjusted top-k/top-p).
+Resource Constraints: Trained on a 50,000-text subset due to Colab's GPU limitations, potentially reducing generalization compared to the full 8.7M-text dataset.
+Language Specificity: Optimized for Arabic; performance on other languages is untested.
+Training Duration: 8.18 hours for 20 epochs on a limited dataset; full dataset training requires more powerful hardware.
+Ethical Considerations
+Bias: The model may reflect biases in the training datasets, such as regional or topical biases in news or literary styles.
+Usage: Intended for research and non-commercial applications. Users should verify generated text for accuracy and cultural appropriateness.
+Data Privacy: Datasets are publicly available on Hugging Face, but users must comply with data usage policies.
+How to Contribute
+Repository: alphatechlogics/FaseehGPT
+Issues: Report bugs or suggest improvements via the repository's issue tracker.
+Training: Resume training with the full dataset or enhanced hardware to improve performance.
+Evaluation: Contribute scripts for computing perplexity, BLEU, or other metrics to assess text quality.
+Citation
+If you use FaseehGPT in your research, please cite:
+@misc{faseehgpt2025,
+  title={FaseehGPT: An Arabic Language Model},
+  author={Rohma, Ahsan Umar},
+  year={2025},
+  url={https://huggingface.co/alphatechlogics/FaseehGPT}
+}