temporary0-0name
/

orator

Model card Files Files and versions

temporary0-0name commited on Aug 8, 2024

Commit

0a36318

·

verified ·

1 Parent(s): 2f049ca

Update README.md

Files changed (1) hide show

README.md +16 -1

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ widget:
 # Custom GPT Model
 ## Model Description
-This model, designed and pretrained from scratch, was developed without utilizing the Hugging Face library. It was independently trained on custom datasets, specifically focusing on tailored NLP tasks. The training process involved meticulous data preprocessing and training strategies to enhance its language understanding capabilities.
 ## Model Parameters
 - **Block Size**: `256` (Maximum sequence length)
@@ -30,6 +30,21 @@ This model, designed and pretrained from scratch, was developed without utilizin
 - **Micro Batch Size**: `128`
 - **Sequence Length**: `256`
 ### Tokenization
 For tokenization, this model uses:
 ```python

 # Custom GPT Model
 ## Model Description
+This model, designed and pretrained from scratch, was developed without utilizing the Hugging Face library.
 ## Model Parameters
 - **Block Size**: `256` (Maximum sequence length)
 - **Micro Batch Size**: `128`
 - **Sequence Length**: `256`
+## Dataset Description
+### Overview
+For the training of this model, a significant subset of the **HuggingFaceFW/fineweb-edu** dataset was utilized. Specifically, the model was pretrained on 3 billion tokens selected from the "Sample 10B" segment of the dataset. This dataset provides a rich corpus compiled from educational and academic web sources, making it an excellent foundation for developing language models with a strong grasp of academic and formal text.
+### Dataset Source
+The dataset is hosted and maintained on Hugging Face's dataset repository. More detailed information and access to the dataset can be found through its dedicated page:
+[HuggingFaceFW/fineweb-edu Sample 10B](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/tree/main/sample/10BT)
+### Training Details
+- **Total Tokens Used for Training**: 3 billion tokens
+- **Training Duration**: The model was trained over 3 epochs to ensure sufficient exposure to the data while optimizing the learning trajectory.
 ### Tokenization
 For tokenization, this model uses:
 ```python