alphatechlogics
/

FaseehGPT

@@ -1,76 +1,107 @@
-Model Card for FaseehGPT
-Model Details
-Model Name: FaseehGPT
-Model Type: Decoder-only Transformer (GPT-style)
-Repository: alphatechlogics/FaseehGPT
-Version: 1.1
-Developers: [Ahsan Umar](https://huggingface.co/codewithdark)
-Date: July 10, 2025
-License: Apache 2.0
-Framework: PyTorch, Hugging Face Transformers
-Language: Arabic
-Intended Use: Text generation and language modeling for Arabic text
-FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It leverages a pre-trained Arabic tokenizer (asafaya/bert-base-arabic) and is optimized for resource-constrained environments like Google Colab's free GPU. The model completed training for 20 epochs, with checkpoints saved and sample text generated.
-Model Architecture
-Architecture: Decoder-only transformer with multi-head self-attention and feed-forward layers.
-Parameters:
-Vocabulary Size: ~32,000 (from asafaya/bert-base-arabic tokenizer)
-Embedding Dimension: 512
-Number of Layers: 12
-Number of Attention Heads: 8
-Feed-forward Dimension: 2048
-Total Parameters: ~70.7 million
-Configuration:
-Maximum Sequence Length: 512
-Dropout Rate: 0.1
-Activation Function: GELU
-Weight Initialization: Normal distribution (mean=0, std=0.02)
-Special Features: Supports top-k and top-p sampling for text generation, with weight tying between input and output embeddings for efficiency.
-Training Details
-Datasets:
-arbml/Arabic_News: 7,114,814 news article texts
-arbml/Arabic_Literature: 1,592,629 literary texts
-Subset Used: 50,000 texts (randomly sampled) for training and evaluation
-Training Set: 45,000 texts (90%)
-Validation Set: 5,000 texts (10%)
-Training Configuration:
-Epochs: 20
-Learning Rate: 3e-4 # Karpathy constant
-Optimizer: AdamW (weight decay=0.01)
-Scheduler: Linear warmup (10% of steps) with decay
-Batch Size: Effective batch size of 16 (using 4 gradient accumulation steps)
-Hardware: kaggle (P100)
-Training Duration: 8.18 hours
-Checkpoint: Saved at epoch 20
-Sample Generated Text (at epoch 20):
-Prompt 1: "اللغة العربية"
-Output: اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ
-Prompt 2: "كان يا مكان في قديم الزمان"
-Output: كان يا مكان في قديم الزمان الانسان الانسان بعض لا انر لقد الانسان ذلك انلاركارك عرض عرض كروي.رح نشا المطلوب وعمل كنكتب الاردني فبدي السابق كان " يريد " صورة ولا وانما " التي النعيم الصحيح بمع للنفط ". يريد قصر توفيق ديكتوتو قد في ثمانية جسد ". الصحيفة انه الاسلام البلد التي " لا من ثالثة شبه كانت بصفته في الوعيدها انبر التي في ما من ، رحب مهمة مز انه ليبر بسرعةالية ، الارجح ما عن به انقلاب في
-Analysis: The generated text shows some coherence but includes grammatical and semantic inconsistencies, suggesting the model may benefit from further training or fine-tuning.
-Usage
-FaseehGPT can be used for generating Arabic text given a prompt. Below is an example of how to load and use the model with the Hugging Face transformers library.
 from transformers import AutoModel, AutoTokenizer
 # Load model and tokenizer
@@ -83,67 +114,84 @@ input_ids = tokenizer(prompt, return_tensors="pt").input_ids
 outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
 generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(generated_text)
-Parameters for Generation:
-max_new_tokens: Maximum number of tokens to generate (e.g., 100).
-temperature: Controls randomness (default: 1.0).
-top_k: Limits sampling to top-k tokens (default: 50).
-top_p: Nucleus sampling threshold (default: 0.9).
-Expected Output: Generates Arabic text continuing from the prompt, with quality dependent on training completion and hyperparameter settings.
-Dataset Description
-Source: Hugging Face Datasets
-Datasets Used:
-arbml/Arabic_News: News articles covering diverse topics, providing formal and varied Arabic text.
-arbml/Arabic_Literature: Literary works, including novels and poetry, offering rich linguistic patterns.
-Total Texts: 8,707,443 (full dataset); 50,000 used in example training.
-Preprocessing:
-Texts are tokenized using asafaya/bert-base-arabic.
-Long texts are split into overlapping chunks (stride: max_seq_len // 2) to fit the maximum sequence length (512).
-Special tokens (<SOS>, <EOS>, <PAD>, <UNK>) are added for language modeling.
-Evaluation
-Metrics: Cross-entropy loss (training and validation).
-Status: Loss metrics are unavailable in the provided output due to incomplete logging. Sample text generation at epoch 20 indicates partial learning of Arabic linguistic patterns, but coherence is limited.
-Recommendations:
-Extract loss values from the checkpoint file (model_checkpoint_epoch_20.pt) or rerun training with verbose logging.
-Compute additional metrics like perplexity or BLEU to quantify generation quality.
-Experiment with a smaller model (e.g., embed_dim=256, num_layers=6) for faster evaluation on Colab.
-Limitations
-Generated Text Quality: Sample outputs show partial coherence, indicating potential undertraining or need for hyperparameter tuning (e.g., lower temperature, adjusted top-k/top-p).
-Resource Constraints: Trained on a 50,000-text subset due to Colab's GPU limitations, potentially reducing generalization compared to the full 8.7M-text dataset.
-Language Specificity: Optimized for Arabic; performance on other languages is untested.
-Training Duration: 8.18 hours for 20 epochs on a limited dataset; full dataset training requires more powerful hardware.
-Ethical Considerations
-Bias: The model may reflect biases in the training datasets, such as regional or topical biases in news or literary styles.
-Usage: Intended for research and non-commercial applications. Users should verify generated text for accuracy and cultural appropriateness.
-Data Privacy: Datasets are publicly available on Hugging Face, but users must comply with data usage policies.
-How to Contribute
-Repository: alphatechlogics/FaseehGPT
-Issues: Report bugs or suggest improvements via the repository's issue tracker.
-Training: Resume training with the full dataset or enhanced hardware to improve performance.
-Evaluation: Contribute scripts for computing perplexity, BLEU, or other metrics to assess text quality.
-Citation
-If you use FaseehGPT in your research, please cite:
 @misc{faseehgpt2025,
-  title={FaseehGPT: An Arabic Language Model},
-  author={Rohma, Ahsan Umar},
-  year={2025},
-  url={https://huggingface.co/alphatechlogics/FaseehGPT}
 }

+---
+license: apache-2.0
+datasets:
+  - arbml/Arabic_Literature
+  - arbml/Arabic_News
+  - khalidalt/ultimate_arabic_news
+  - pain/Arabic-Tweets
+language:
+  - ar
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+  - torch
+  - custom
+  - GPT
+---
+# Model Card for FaseehGPT
+## Model Details
+* **Model Name**: FaseehGPT
+* **Model Type**: Decoder-only Transformer (GPT-style)
+* **Repository**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
+* **Version**: 1.1
+* **Developers**: [Ahsan Umar](https://huggingface.co/codewithdark)
+* **Date**: July 10, 2025
+* **License**: Apache 2.0
+* **Framework**: PyTorch, Hugging Face Transformers
+* **Language**: Arabic
+* **Intended Use**: Text generation and language modeling for Arabic text
+FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (`asafaya/bert-base-arabic`) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations.
+---
+## Model Architecture
+* **Architecture**: Decoder-only transformer with multi-head self-attention and feed-forward layers
+* **Parameters**:
+  * Vocabulary Size: \~32,000 (from `asafaya/bert-base-arabic` tokenizer)
+  * Embedding Dimension: 512
+  * Number of Layers: 12
+  * Number of Attention Heads: 8
+  * Feed-forward Dimension: 2048
+  * Total Parameters: \~70.7 million
+* **Configuration**:
+  * Maximum Sequence Length: 512
+  * Dropout Rate: 0.1
+  * Activation Function: GELU
+* **Weight Initialization**: Normal distribution (mean = 0, std = 0.02)
+* **Special Features**: Supports top-k and top-p sampling; weight tying between input and output embeddings
+---
+## Training Details
+### Datasets
+* **arbml/Arabic\_News**: 7,114,814 news article texts
+* **arbml/Arabic\_Literature**: 1,592,629 literary texts
+* **Subset Used**: 50,000 texts (randomly sampled)
+  * **Training Set**: 45,000 (90%)
+  * **Validation Set**: 5,000 (10%)
+### Training Configuration
+* **Epochs**: 20
+* **Learning Rate**: 3e-4 *(Karpathy constant)*
+* **Optimizer**: AdamW (weight decay = 0.01)
+* **Scheduler**: Linear warmup (10% of steps) with decay
+* **Batch Size**: Effective 16 (4 gradient accumulation steps)
+* **Hardware**: Kaggle (P100)
+* **Training Duration**: 8.18 hours
+* **Checkpoint**: Saved at epoch 20
+---
+## Sample Generated Text (Epoch 20)
+**Prompt 1**: `"اللغة العربية"`
+**Output**:
+> اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ
+**Prompt 2**: `"كان يا مكان في قديم الزمان"`
+**Output**:
+> كان يا مكان في قديم الزمان الانسان الانسان بعض لا انر لقد الانسان ذلك انلاركارك عرض عرض كروي.رح نشا المطلوب وعمل كنكتب الاردني فبدي السابق كان " يريد " صورة ولا وانما " التي النعيم الصحيح بمع للنفط ". يريد قصر توفيق ديكتوتو قد في ثمانية جسد ". الصحيفة انه الاسلام البلد التي " لا من ثالثة شبه كانت بصفته في الوعيدها انبر التي في ما من ، رحب مهمة مز انه ليبر بسرعةالية ، الارجح ما عن به انقلاب في
+**Analysis**: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning.
+---
+## Usage
+FaseehGPT can be used to generate Arabic text from a prompt. Example code:
+```python
 from transformers import AutoModel, AutoTokenizer
 # Load model and tokenizer
 outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
 generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(generated_text)
+```
+### Parameters for Generation
+* `max_new_tokens`: Max tokens to generate (e.g., 100)
+* `temperature`: Controls randomness (default: 1.0)
+* `top_k`: Limits sampling to top-k tokens (default: 50)
+* `top_p`: Nucleus sampling threshold (default: 0.9)
+**Expected Output**: Arabic text that continues the given prompt, depending on training quality and settings.
+---
+## Dataset Description
+* **Source**: Hugging Face Datasets
+* **Used Datasets**:
+  * `arbml/Arabic_News`: News across diverse topics with formal Arabic
+  * `arbml/Arabic_Literature`: Novels and poetry, providing rich language variety
+* **Total Texts**: 8,707,443 (full); 50,000 used for training
+### Preprocessing
+* Tokenized using `asafaya/bert-base-arabic`
+* Long texts split into overlapping chunks (`stride = max_seq_len // 2`)
+* Special tokens: `<SOS>`, `<EOS>`, `<PAD>`, `<UNK>`
+---
+## Evaluation
+* **Metrics**: Cross-entropy loss (training and validation)
+* **Status**: Loss metrics unavailable due to incomplete logging
+* **Observations**: Generated samples show partial learning; some incoherence remains
+### Recommendations
+* Extract loss from checkpoint `model_checkpoint_epoch_20.pt`
+* Use verbose logging in future training
+* Add evaluation metrics: Perplexity, BLEU
+* Try smaller models (e.g., `embed_dim=256`, `num_layers=6`) for faster Colab testing
+---
+## Limitations
+* **Generated Text Quality**: Inconsistent coherence suggests undertraining
+* **Resource Constraints**: Small subset used due to Colab GPU limits
+* **Language Specificity**: Only Arabic supported; others untested
+* **Training Duration**: 8.18 hours insufficient for full dataset
+---
+## Ethical Considerations
+* **Bias**: May reflect cultural or topical biases from source data
+* **Usage**: For research/non-commercial use; validate outputs
+* **Privacy**: Datasets are public; comply with Hugging Face policies
+---
+## How to Contribute
+* **Repo**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
+* **Issues**: Report bugs or suggest features via issue tracker
+* **Training**: Resume on full dataset or better hardware
+* **Evaluation**: Add scripts for BLEU, perplexity, etc.
+---
+## Citation
+```bibtex
 @misc{faseehgpt2025,
+  title     = {FaseehGPT: An Arabic Language Model},
+  author    = {Rohma, Ahsan Umar},
+  year      = {2025},
+  url       = {https://huggingface.co/alphatechlogics/FaseehGPT}
 }
+```