File size: 2,230 Bytes

# Model Card: Custom Language Model

## Overview

This model was trained using the WikiText-103 dataset to generate text based on input prompts.

## Dataset

**Dataset Used**: WikiText-103

**Source**: [Hugging Face Datasets](https://huggingface.co/datasets/wikitext)

**Dataset Details**: The WikiText-103 dataset is a collection of over 100 million tokens extracted from the set of verified "Good" and "Featured" articles on Wikipedia. It is designed for language modeling and other text generation tasks.

## Data Cleaning

To ensure high-quality input for training, the dataset underwent the following cleaning steps:
1. Removal of non-standard characters and punctuation.
2. Tokenization using BERT's tokenizer.
3. Lowercasing all text.
4. Filtering out any overly short or long sequences to maintain a consistent input size.

## Neural Network Definition

The neural network used for this model is based on a transformer architecture with the following specifications:
- **Model Type**: BERT-based transformer
- **Number of Layers**: 5
- **Dropout**: Applied at each layer to prevent overfitting
- **Optimizer**: AdamW with a learning rate of 5e-5
- **Loss Function**: Cross-entropy loss for language modeling

## Training Details

The model was trained on an L4 GPU with the following resources:
- **CPU Cores**: 16
- **System RAM**: 62.8 GB
- **GPU RAM**: 22.5 GB
- **Disk**: 201.2 GB

**Training Configuration**:
- **Batch Size**: Dynamic, adjusted based on GPU RAM availability
- **Epochs**: 50
- **Initial Learning Rate**: 5e-5

### Training Results

The training involved several experiments with different batch sizes and epochs. The final training loss was plotted to visualize the model's performance.

## Usage

To use this model, you can load it from Hugging Face and generate text as follows:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("RicardoPoleo/DL_LLM_from_scratch_2")
model = AutoModelForCausalLM.from_pretrained("RicardoPoleo/DL_LLM_from_scratch_2")

input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))