RicardoPoleo's picture
Update README.md
4fd0757 verified
# Model Card: Custom Language Model
## Overview
This model was trained using the WikiText-103 dataset to generate text based on input prompts.
## Dataset
**Dataset Used**: WikiText-103
**Source**: [Hugging Face Datasets](https://huggingface.co/datasets/wikitext)
**Dataset Details**: The WikiText-103 dataset is a collection of over 100 million tokens extracted from the set of verified "Good" and "Featured" articles on Wikipedia. It is designed for language modeling and other text generation tasks.
## Data Cleaning
To ensure high-quality input for training, the dataset underwent the following cleaning steps:
1. Removal of non-standard characters and punctuation.
2. Tokenization using BERT's tokenizer.
3. Lowercasing all text.
4. Filtering out any overly short or long sequences to maintain a consistent input size.
## Neural Network Definition
The neural network used for this model is based on a transformer architecture with the following specifications:
- **Model Type**: BERT-based transformer
- **Number of Layers**: 5
- **Dropout**: Applied at each layer to prevent overfitting
- **Optimizer**: AdamW with a learning rate of 5e-5
- **Loss Function**: Cross-entropy loss for language modeling
## Training Details
The model was trained on an L4 GPU with the following resources:
- **CPU Cores**: 16
- **System RAM**: 62.8 GB
- **GPU RAM**: 22.5 GB
- **Disk**: 201.2 GB
**Training Configuration**:
- **Batch Size**: Dynamic, adjusted based on GPU RAM availability
- **Epochs**: 50
- **Initial Learning Rate**: 5e-5
### Training Results
The training involved several experiments with different batch sizes and epochs. The final training loss was plotted to visualize the model's performance.
## Usage
To use this model, you can load it from Hugging Face and generate text as follows:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("RicardoPoleo/DL_LLM_from_scratch_2")
model = AutoModelForCausalLM.from_pretrained("RicardoPoleo/DL_LLM_from_scratch_2")
input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))