Update README.md
Browse files
README.md
CHANGED
|
@@ -14,7 +14,7 @@ widget:
|
|
| 14 |
# Custom GPT Model
|
| 15 |
|
| 16 |
## Model Description
|
| 17 |
-
This model, designed and pretrained from scratch, was developed without utilizing the Hugging Face library.
|
| 18 |
|
| 19 |
## Model Parameters
|
| 20 |
- **Block Size**: `256` (Maximum sequence length)
|
|
@@ -30,6 +30,21 @@ This model, designed and pretrained from scratch, was developed without utilizin
|
|
| 30 |
- **Micro Batch Size**: `128`
|
| 31 |
- **Sequence Length**: `256`
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
### Tokenization
|
| 34 |
For tokenization, this model uses:
|
| 35 |
```python
|
|
|
|
| 14 |
# Custom GPT Model
|
| 15 |
|
| 16 |
## Model Description
|
| 17 |
+
This model, designed and pretrained from scratch, was developed without utilizing the Hugging Face library.
|
| 18 |
|
| 19 |
## Model Parameters
|
| 20 |
- **Block Size**: `256` (Maximum sequence length)
|
|
|
|
| 30 |
- **Micro Batch Size**: `128`
|
| 31 |
- **Sequence Length**: `256`
|
| 32 |
|
| 33 |
+
## Dataset Description
|
| 34 |
+
|
| 35 |
+
### Overview
|
| 36 |
+
For the training of this model, a significant subset of the **HuggingFaceFW/fineweb-edu** dataset was utilized. Specifically, the model was pretrained on 3 billion tokens selected from the "Sample 10B" segment of the dataset. This dataset provides a rich corpus compiled from educational and academic web sources, making it an excellent foundation for developing language models with a strong grasp of academic and formal text.
|
| 37 |
+
|
| 38 |
+
### Dataset Source
|
| 39 |
+
The dataset is hosted and maintained on Hugging Face's dataset repository. More detailed information and access to the dataset can be found through its dedicated page:
|
| 40 |
+
[HuggingFaceFW/fineweb-edu Sample 10B](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/tree/main/sample/10BT)
|
| 41 |
+
|
| 42 |
+
### Training Details
|
| 43 |
+
- **Total Tokens Used for Training**: 3 billion tokens
|
| 44 |
+
- **Training Duration**: The model was trained over 3 epochs to ensure sufficient exposure to the data while optimizing the learning trajectory.
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
|
| 48 |
### Tokenization
|
| 49 |
For tokenization, this model uses:
|
| 50 |
```python
|