codewithdark commited on
Commit
accea60
·
verified ·
1 Parent(s): 084a962

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -95
README.md CHANGED
@@ -1,76 +1,107 @@
1
- Model Card for FaseehGPT
2
- Model Details
3
 
4
- Model Name: FaseehGPT
5
- Model Type: Decoder-only Transformer (GPT-style)
6
- Repository: alphatechlogics/FaseehGPT
7
- Version: 1.1
8
- Developers: [Ahsan Umar](https://huggingface.co/codewithdark)
9
- Date: July 10, 2025
10
- License: Apache 2.0
11
- Framework: PyTorch, Hugging Face Transformers
12
- Language: Arabic
13
- Intended Use: Text generation and language modeling for Arabic text
 
 
 
 
 
 
14
 
15
- FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It leverages a pre-trained Arabic tokenizer (asafaya/bert-base-arabic) and is optimized for resource-constrained environments like Google Colab's free GPU. The model completed training for 20 epochs, with checkpoints saved and sample text generated.
16
- Model Architecture
17
 
18
- Architecture: Decoder-only transformer with multi-head self-attention and feed-forward layers.
19
- Parameters:
20
- Vocabulary Size: ~32,000 (from asafaya/bert-base-arabic tokenizer)
21
- Embedding Dimension: 512
22
- Number of Layers: 12
23
- Number of Attention Heads: 8
24
- Feed-forward Dimension: 2048
25
- Total Parameters: ~70.7 million
26
 
 
27
 
28
- Configuration:
29
- Maximum Sequence Length: 512
30
- Dropout Rate: 0.1
31
- Activation Function: GELU
 
 
 
 
 
 
32
 
 
33
 
34
- Weight Initialization: Normal distribution (mean=0, std=0.02)
35
- Special Features: Supports top-k and top-p sampling for text generation, with weight tying between input and output embeddings for efficiency.
36
 
37
- Training Details
38
 
39
- Datasets:
40
- arbml/Arabic_News: 7,114,814 news article texts
41
- arbml/Arabic_Literature: 1,592,629 literary texts
42
- Subset Used: 50,000 texts (randomly sampled) for training and evaluation
43
- Training Set: 45,000 texts (90%)
44
- Validation Set: 5,000 texts (10%)
45
 
 
 
 
 
 
 
 
46
 
 
 
 
 
 
47
 
 
48
 
49
- Training Configuration:
50
- Epochs: 20
51
- Learning Rate: 3e-4 # Karpathy constant
52
- Optimizer: AdamW (weight decay=0.01)
53
- Scheduler: Linear warmup (10% of steps) with decay
54
- Batch Size: Effective batch size of 16 (using 4 gradient accumulation steps)
55
- Hardware: kaggle (P100)
56
- Training Duration: 8.18 hours
57
- Checkpoint: Saved at epoch 20
58
 
 
59
 
 
 
 
60
 
61
- Sample Generated Text (at epoch 20):
62
- Prompt 1: "اللغة العربية"
63
- Output: اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ
64
 
 
65
 
66
- Prompt 2: "كان يا مكان في قديم الزمان"
67
- Output: كان يا مكان في قديم الزمان الانسان الانسان بعض لا انر لقد الانسان ذلك انلاركارك عرض عرض كروي.رح نشا المطلوب وعمل كنكتب الاردني فبدي السابق كان " يريد " صورة ولا وانما " التي النعيم الصحيح بمع للنفط ". يريد قصر توفيق ديكتوتو قد في ثمانية جسد ". الصحيفة انه الاسلام البلد التي " لا من ثالثة شبه كانت بصفته في الوعيدها انبر التي في ما من ، رحب مهمة مز انه ليبر بسرعةالية ، الارجح ما عن به انقلاب في
 
 
 
 
 
 
68
 
 
69
 
70
- Analysis: The generated text shows some coherence but includes grammatical and semantic inconsistencies, suggesting the model may benefit from further training or fine-tuning.
71
 
72
- Usage
73
- FaseehGPT can be used for generating Arabic text given a prompt. Below is an example of how to load and use the model with the Hugging Face transformers library.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  from transformers import AutoModel, AutoTokenizer
75
 
76
  # Load model and tokenizer
@@ -83,67 +114,84 @@ input_ids = tokenizer(prompt, return_tensors="pt").input_ids
83
  outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
84
  generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
85
  print(generated_text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
- Parameters for Generation:
88
 
89
- max_new_tokens: Maximum number of tokens to generate (e.g., 100).
90
- temperature: Controls randomness (default: 1.0).
91
- top_k: Limits sampling to top-k tokens (default: 50).
92
- top_p: Nucleus sampling threshold (default: 0.9).
93
 
94
- Expected Output: Generates Arabic text continuing from the prompt, with quality dependent on training completion and hyperparameter settings.
95
- Dataset Description
96
 
97
- Source: Hugging Face Datasets
98
- Datasets Used:
99
- arbml/Arabic_News: News articles covering diverse topics, providing formal and varied Arabic text.
100
- arbml/Arabic_Literature: Literary works, including novels and poetry, offering rich linguistic patterns.
101
 
 
 
 
102
 
103
- Total Texts: 8,707,443 (full dataset); 50,000 used in example training.
104
- Preprocessing:
105
- Texts are tokenized using asafaya/bert-base-arabic.
106
- Long texts are split into overlapping chunks (stride: max_seq_len // 2) to fit the maximum sequence length (512).
107
- Special tokens (<SOS>, <EOS>, <PAD>, <UNK>) are added for language modeling.
108
 
 
 
 
 
109
 
 
110
 
111
- Evaluation
112
 
113
- Metrics: Cross-entropy loss (training and validation).
114
- Status: Loss metrics are unavailable in the provided output due to incomplete logging. Sample text generation at epoch 20 indicates partial learning of Arabic linguistic patterns, but coherence is limited.
115
- Recommendations:
116
- Extract loss values from the checkpoint file (model_checkpoint_epoch_20.pt) or rerun training with verbose logging.
117
- Compute additional metrics like perplexity or BLEU to quantify generation quality.
118
- Experiment with a smaller model (e.g., embed_dim=256, num_layers=6) for faster evaluation on Colab.
119
 
 
120
 
 
121
 
122
- Limitations
 
 
123
 
124
- Generated Text Quality: Sample outputs show partial coherence, indicating potential undertraining or need for hyperparameter tuning (e.g., lower temperature, adjusted top-k/top-p).
125
- Resource Constraints: Trained on a 50,000-text subset due to Colab's GPU limitations, potentially reducing generalization compared to the full 8.7M-text dataset.
126
- Language Specificity: Optimized for Arabic; performance on other languages is untested.
127
- Training Duration: 8.18 hours for 20 epochs on a limited dataset; full dataset training requires more powerful hardware.
128
 
129
- Ethical Considerations
130
 
131
- Bias: The model may reflect biases in the training datasets, such as regional or topical biases in news or literary styles.
132
- Usage: Intended for research and non-commercial applications. Users should verify generated text for accuracy and cultural appropriateness.
133
- Data Privacy: Datasets are publicly available on Hugging Face, but users must comply with data usage policies.
 
134
 
135
- How to Contribute
136
 
137
- Repository: alphatechlogics/FaseehGPT
138
- Issues: Report bugs or suggest improvements via the repository's issue tracker.
139
- Training: Resume training with the full dataset or enhanced hardware to improve performance.
140
- Evaluation: Contribute scripts for computing perplexity, BLEU, or other metrics to assess text quality.
141
 
142
- Citation
143
- If you use FaseehGPT in your research, please cite:
144
  @misc{faseehgpt2025,
145
- title={FaseehGPT: An Arabic Language Model},
146
- author={Rohma, Ahsan Umar},
147
- year={2025},
148
- url={https://huggingface.co/alphatechlogics/FaseehGPT}
149
  }
 
 
 
 
1
 
2
+ ---
3
+ license: apache-2.0
4
+ datasets:
5
+ - arbml/Arabic_Literature
6
+ - arbml/Arabic_News
7
+ - khalidalt/ultimate_arabic_news
8
+ - pain/Arabic-Tweets
9
+ language:
10
+ - ar
11
+ pipeline_tag: text-generation
12
+ library_name: transformers
13
+ tags:
14
+ - torch
15
+ - custom
16
+ - GPT
17
+ ---
18
 
 
 
19
 
20
+ # Model Card for FaseehGPT
 
 
 
 
 
 
 
21
 
22
+ ## Model Details
23
 
24
+ * **Model Name**: FaseehGPT
25
+ * **Model Type**: Decoder-only Transformer (GPT-style)
26
+ * **Repository**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
27
+ * **Version**: 1.1
28
+ * **Developers**: [Ahsan Umar](https://huggingface.co/codewithdark)
29
+ * **Date**: July 10, 2025
30
+ * **License**: Apache 2.0
31
+ * **Framework**: PyTorch, Hugging Face Transformers
32
+ * **Language**: Arabic
33
+ * **Intended Use**: Text generation and language modeling for Arabic text
34
 
35
+ FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (`asafaya/bert-base-arabic`) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations.
36
 
37
+ ---
 
38
 
39
+ ## Model Architecture
40
 
41
+ * **Architecture**: Decoder-only transformer with multi-head self-attention and feed-forward layers
42
+ * **Parameters**:
 
 
 
 
43
 
44
+ * Vocabulary Size: \~32,000 (from `asafaya/bert-base-arabic` tokenizer)
45
+ * Embedding Dimension: 512
46
+ * Number of Layers: 12
47
+ * Number of Attention Heads: 8
48
+ * Feed-forward Dimension: 2048
49
+ * Total Parameters: \~70.7 million
50
+ * **Configuration**:
51
 
52
+ * Maximum Sequence Length: 512
53
+ * Dropout Rate: 0.1
54
+ * Activation Function: GELU
55
+ * **Weight Initialization**: Normal distribution (mean = 0, std = 0.02)
56
+ * **Special Features**: Supports top-k and top-p sampling; weight tying between input and output embeddings
57
 
58
+ ---
59
 
60
+ ## Training Details
 
 
 
 
 
 
 
 
61
 
62
+ ### Datasets
63
 
64
+ * **arbml/Arabic\_News**: 7,114,814 news article texts
65
+ * **arbml/Arabic\_Literature**: 1,592,629 literary texts
66
+ * **Subset Used**: 50,000 texts (randomly sampled)
67
 
68
+ * **Training Set**: 45,000 (90%)
69
+ * **Validation Set**: 5,000 (10%)
 
70
 
71
+ ### Training Configuration
72
 
73
+ * **Epochs**: 20
74
+ * **Learning Rate**: 3e-4 *(Karpathy constant)*
75
+ * **Optimizer**: AdamW (weight decay = 0.01)
76
+ * **Scheduler**: Linear warmup (10% of steps) with decay
77
+ * **Batch Size**: Effective 16 (4 gradient accumulation steps)
78
+ * **Hardware**: Kaggle (P100)
79
+ * **Training Duration**: 8.18 hours
80
+ * **Checkpoint**: Saved at epoch 20
81
 
82
+ ---
83
 
84
+ ## Sample Generated Text (Epoch 20)
85
 
86
+ **Prompt 1**: `"اللغة العربية"`
87
+ **Output**:
88
+
89
+ > اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ
90
+
91
+ **Prompt 2**: `"كان يا مكان في قديم الزمان"`
92
+ **Output**:
93
+
94
+ > كان يا مكان في قديم الزمان الانسان الانسان بعض لا انر لقد الانسان ذلك انلاركارك عرض عرض كروي.رح نشا المطلوب وعمل كنكتب الاردني فبدي السابق كان " يريد " صورة ولا وانما " التي النعيم الصحيح بمع للنفط ". يريد قصر توفيق ديكتوتو قد في ثمانية جسد ". الصحيفة انه الاسلام البلد التي " لا من ثالثة شبه كانت بصفته في الوعيدها انبر التي في ما من ، رحب مهمة مز انه ليبر بسرعةالية ، الارجح ما عن به انقلاب في
95
+
96
+ **Analysis**: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning.
97
+
98
+ ---
99
+
100
+ ## Usage
101
+
102
+ FaseehGPT can be used to generate Arabic text from a prompt. Example code:
103
+
104
+ ```python
105
  from transformers import AutoModel, AutoTokenizer
106
 
107
  # Load model and tokenizer
 
114
  outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
115
  generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
116
  print(generated_text)
117
+ ```
118
+
119
+ ### Parameters for Generation
120
+
121
+ * `max_new_tokens`: Max tokens to generate (e.g., 100)
122
+ * `temperature`: Controls randomness (default: 1.0)
123
+ * `top_k`: Limits sampling to top-k tokens (default: 50)
124
+ * `top_p`: Nucleus sampling threshold (default: 0.9)
125
+
126
+ **Expected Output**: Arabic text that continues the given prompt, depending on training quality and settings.
127
+
128
+ ---
129
+
130
+ ## Dataset Description
131
+
132
+ * **Source**: Hugging Face Datasets
133
+ * **Used Datasets**:
134
+
135
+ * `arbml/Arabic_News`: News across diverse topics with formal Arabic
136
+ * `arbml/Arabic_Literature`: Novels and poetry, providing rich language variety
137
+ * **Total Texts**: 8,707,443 (full); 50,000 used for training
138
 
139
+ ### Preprocessing
140
 
141
+ * Tokenized using `asafaya/bert-base-arabic`
142
+ * Long texts split into overlapping chunks (`stride = max_seq_len // 2`)
143
+ * Special tokens: `<SOS>`, `<EOS>`, `<PAD>`, `<UNK>`
 
144
 
145
+ ---
 
146
 
147
+ ## Evaluation
 
 
 
148
 
149
+ * **Metrics**: Cross-entropy loss (training and validation)
150
+ * **Status**: Loss metrics unavailable due to incomplete logging
151
+ * **Observations**: Generated samples show partial learning; some incoherence remains
152
 
153
+ ### Recommendations
 
 
 
 
154
 
155
+ * Extract loss from checkpoint `model_checkpoint_epoch_20.pt`
156
+ * Use verbose logging in future training
157
+ * Add evaluation metrics: Perplexity, BLEU
158
+ * Try smaller models (e.g., `embed_dim=256`, `num_layers=6`) for faster Colab testing
159
 
160
+ ---
161
 
162
+ ## Limitations
163
 
164
+ * **Generated Text Quality**: Inconsistent coherence suggests undertraining
165
+ * **Resource Constraints**: Small subset used due to Colab GPU limits
166
+ * **Language Specificity**: Only Arabic supported; others untested
167
+ * **Training Duration**: 8.18 hours insufficient for full dataset
 
 
168
 
169
+ ---
170
 
171
+ ## Ethical Considerations
172
 
173
+ * **Bias**: May reflect cultural or topical biases from source data
174
+ * **Usage**: For research/non-commercial use; validate outputs
175
+ * **Privacy**: Datasets are public; comply with Hugging Face policies
176
 
177
+ ---
 
 
 
178
 
179
+ ## How to Contribute
180
 
181
+ * **Repo**: [alphatechlogics/FaseehGPT](https://huggingface.co/alphatechlogics/FaseehGPT)
182
+ * **Issues**: Report bugs or suggest features via issue tracker
183
+ * **Training**: Resume on full dataset or better hardware
184
+ * **Evaluation**: Add scripts for BLEU, perplexity, etc.
185
 
186
+ ---
187
 
188
+ ## Citation
 
 
 
189
 
190
+ ```bibtex
 
191
  @misc{faseehgpt2025,
192
+ title = {FaseehGPT: An Arabic Language Model},
193
+ author = {Rohma, Ahsan Umar},
194
+ year = {2025},
195
+ url = {https://huggingface.co/alphatechlogics/FaseehGPT}
196
  }
197
+ ```