| # Model Card: Custom Language Model | |
| ## Overview | |
| This model was trained using the WikiText-103 dataset to generate text based on input prompts. | |
| ## Dataset | |
| **Dataset Used**: WikiText-103 | |
| **Source**: [Hugging Face Datasets](https://huggingface.co/datasets/wikitext) | |
| **Dataset Details**: The WikiText-103 dataset is a collection of over 100 million tokens extracted from the set of verified "Good" and "Featured" articles on Wikipedia. It is designed for language modeling and other text generation tasks. | |
| ## Data Cleaning | |
| To ensure high-quality input for training, the dataset underwent the following cleaning steps: | |
| 1. Removal of non-standard characters and punctuation. | |
| 2. Tokenization using BERT's tokenizer. | |
| 3. Lowercasing all text. | |
| 4. Filtering out any overly short or long sequences to maintain a consistent input size. | |
| ## Neural Network Definition | |
| The neural network used for this model is based on a transformer architecture with the following specifications: | |
| - **Model Type**: BERT-based transformer | |
| - **Number of Layers**: 5 | |
| - **Dropout**: Applied at each layer to prevent overfitting | |
| - **Optimizer**: AdamW with a learning rate of 5e-5 | |
| - **Loss Function**: Cross-entropy loss for language modeling | |
| ## Training Details | |
| The model was trained on an L4 GPU with the following resources: | |
| - **CPU Cores**: 16 | |
| - **System RAM**: 62.8 GB | |
| - **GPU RAM**: 22.5 GB | |
| - **Disk**: 201.2 GB | |
| **Training Configuration**: | |
| - **Batch Size**: Dynamic, adjusted based on GPU RAM availability | |
| - **Epochs**: 50 | |
| - **Initial Learning Rate**: 5e-5 | |
| ### Training Results | |
| The training involved several experiments with different batch sizes and epochs. The final training loss was plotted to visualize the model's performance. | |
| ## Usage | |
| To use this model, you can load it from Hugging Face and generate text as follows: | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("RicardoPoleo/DL_LLM_from_scratch_2") | |
| model = AutoModelForCausalLM.from_pretrained("RicardoPoleo/DL_LLM_from_scratch_2") | |
| input_text = "Once upon a time" | |
| inputs = tokenizer(input_text, return_tensors="pt") | |
| outputs = model.generate(**inputs) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |