codewithdark commited on
Commit
084a962
·
verified ·
1 Parent(s): aba9604

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -15
README.md CHANGED
@@ -1,15 +1,149 @@
1
- ---
2
- license: mit
3
- datasets:
4
- - arbml/Arabic_News
5
- - arbml/Arabic_Literature
6
- - khalidalt/ultimate_arabic_news
7
- - pain/Arabic-Tweets
8
- language:
9
- - ar
10
- pipeline_tag: text-generation
11
- library_name: transformers
12
- tags:
13
- - Custom
14
- - pytorch
15
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Model Card for FaseehGPT
2
+ Model Details
3
+
4
+ Model Name: FaseehGPT
5
+ Model Type: Decoder-only Transformer (GPT-style)
6
+ Repository: alphatechlogics/FaseehGPT
7
+ Version: 1.1
8
+ Developers: [Ahsan Umar](https://huggingface.co/codewithdark)
9
+ Date: July 10, 2025
10
+ License: Apache 2.0
11
+ Framework: PyTorch, Hugging Face Transformers
12
+ Language: Arabic
13
+ Intended Use: Text generation and language modeling for Arabic text
14
+
15
+ FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It leverages a pre-trained Arabic tokenizer (asafaya/bert-base-arabic) and is optimized for resource-constrained environments like Google Colab's free GPU. The model completed training for 20 epochs, with checkpoints saved and sample text generated.
16
+ Model Architecture
17
+
18
+ Architecture: Decoder-only transformer with multi-head self-attention and feed-forward layers.
19
+ Parameters:
20
+ Vocabulary Size: ~32,000 (from asafaya/bert-base-arabic tokenizer)
21
+ Embedding Dimension: 512
22
+ Number of Layers: 12
23
+ Number of Attention Heads: 8
24
+ Feed-forward Dimension: 2048
25
+ Total Parameters: ~70.7 million
26
+
27
+
28
+ Configuration:
29
+ Maximum Sequence Length: 512
30
+ Dropout Rate: 0.1
31
+ Activation Function: GELU
32
+
33
+
34
+ Weight Initialization: Normal distribution (mean=0, std=0.02)
35
+ Special Features: Supports top-k and top-p sampling for text generation, with weight tying between input and output embeddings for efficiency.
36
+
37
+ Training Details
38
+
39
+ Datasets:
40
+ arbml/Arabic_News: 7,114,814 news article texts
41
+ arbml/Arabic_Literature: 1,592,629 literary texts
42
+ Subset Used: 50,000 texts (randomly sampled) for training and evaluation
43
+ Training Set: 45,000 texts (90%)
44
+ Validation Set: 5,000 texts (10%)
45
+
46
+
47
+
48
+
49
+ Training Configuration:
50
+ Epochs: 20
51
+ Learning Rate: 3e-4 # Karpathy constant
52
+ Optimizer: AdamW (weight decay=0.01)
53
+ Scheduler: Linear warmup (10% of steps) with decay
54
+ Batch Size: Effective batch size of 16 (using 4 gradient accumulation steps)
55
+ Hardware: kaggle (P100)
56
+ Training Duration: 8.18 hours
57
+ Checkpoint: Saved at epoch 20
58
+
59
+
60
+
61
+ Sample Generated Text (at epoch 20):
62
+ Prompt 1: "اللغة العربية"
63
+ Output: اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ
64
+
65
+
66
+ Prompt 2: "كان يا مكان في قديم الزمان"
67
+ Output: كان يا مكان في قديم الزمان الانسان الانسان بعض لا انر لقد الانسان ذلك انلاركارك عرض عرض كروي.رح نشا المطلوب وعمل كنكتب الاردني فبدي السابق كان " يريد " صورة ولا وانما " التي النعيم الصحيح بمع للنفط ". يريد قصر توفيق ديكتوتو قد في ثمانية جسد ". الصحيفة انه الاسلام البلد التي " لا من ثالثة شبه كانت بصفته في الوعيدها انبر التي في ما من ، رحب مهمة مز انه ليبر بسرعةالية ، الارجح ما عن به انقلاب في
68
+
69
+
70
+ Analysis: The generated text shows some coherence but includes grammatical and semantic inconsistencies, suggesting the model may benefit from further training or fine-tuning.
71
+
72
+ Usage
73
+ FaseehGPT can be used for generating Arabic text given a prompt. Below is an example of how to load and use the model with the Hugging Face transformers library.
74
+ from transformers import AutoModel, AutoTokenizer
75
+
76
+ # Load model and tokenizer
77
+ model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True)
78
+ tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT")
79
+
80
+ # Generate text
81
+ prompt = "السلام عليكم"
82
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids
83
+ outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
84
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
85
+ print(generated_text)
86
+
87
+ Parameters for Generation:
88
+
89
+ max_new_tokens: Maximum number of tokens to generate (e.g., 100).
90
+ temperature: Controls randomness (default: 1.0).
91
+ top_k: Limits sampling to top-k tokens (default: 50).
92
+ top_p: Nucleus sampling threshold (default: 0.9).
93
+
94
+ Expected Output: Generates Arabic text continuing from the prompt, with quality dependent on training completion and hyperparameter settings.
95
+ Dataset Description
96
+
97
+ Source: Hugging Face Datasets
98
+ Datasets Used:
99
+ arbml/Arabic_News: News articles covering diverse topics, providing formal and varied Arabic text.
100
+ arbml/Arabic_Literature: Literary works, including novels and poetry, offering rich linguistic patterns.
101
+
102
+
103
+ Total Texts: 8,707,443 (full dataset); 50,000 used in example training.
104
+ Preprocessing:
105
+ Texts are tokenized using asafaya/bert-base-arabic.
106
+ Long texts are split into overlapping chunks (stride: max_seq_len // 2) to fit the maximum sequence length (512).
107
+ Special tokens (<SOS>, <EOS>, <PAD>, <UNK>) are added for language modeling.
108
+
109
+
110
+
111
+ Evaluation
112
+
113
+ Metrics: Cross-entropy loss (training and validation).
114
+ Status: Loss metrics are unavailable in the provided output due to incomplete logging. Sample text generation at epoch 20 indicates partial learning of Arabic linguistic patterns, but coherence is limited.
115
+ Recommendations:
116
+ Extract loss values from the checkpoint file (model_checkpoint_epoch_20.pt) or rerun training with verbose logging.
117
+ Compute additional metrics like perplexity or BLEU to quantify generation quality.
118
+ Experiment with a smaller model (e.g., embed_dim=256, num_layers=6) for faster evaluation on Colab.
119
+
120
+
121
+
122
+ Limitations
123
+
124
+ Generated Text Quality: Sample outputs show partial coherence, indicating potential undertraining or need for hyperparameter tuning (e.g., lower temperature, adjusted top-k/top-p).
125
+ Resource Constraints: Trained on a 50,000-text subset due to Colab's GPU limitations, potentially reducing generalization compared to the full 8.7M-text dataset.
126
+ Language Specificity: Optimized for Arabic; performance on other languages is untested.
127
+ Training Duration: 8.18 hours for 20 epochs on a limited dataset; full dataset training requires more powerful hardware.
128
+
129
+ Ethical Considerations
130
+
131
+ Bias: The model may reflect biases in the training datasets, such as regional or topical biases in news or literary styles.
132
+ Usage: Intended for research and non-commercial applications. Users should verify generated text for accuracy and cultural appropriateness.
133
+ Data Privacy: Datasets are publicly available on Hugging Face, but users must comply with data usage policies.
134
+
135
+ How to Contribute
136
+
137
+ Repository: alphatechlogics/FaseehGPT
138
+ Issues: Report bugs or suggest improvements via the repository's issue tracker.
139
+ Training: Resume training with the full dataset or enhanced hardware to improve performance.
140
+ Evaluation: Contribute scripts for computing perplexity, BLEU, or other metrics to assess text quality.
141
+
142
+ Citation
143
+ If you use FaseehGPT in your research, please cite:
144
+ @misc{faseehgpt2025,
145
+ title={FaseehGPT: An Arabic Language Model},
146
+ author={Rohma, Ahsan Umar},
147
+ year={2025},
148
+ url={https://huggingface.co/alphatechlogics/FaseehGPT}
149
+ }