# Model Card: Custom Language Model ## Overview This model was trained using the WikiText-103 dataset to generate text based on input prompts. ## Dataset **Dataset Used**: WikiText-103 **Source**: [Hugging Face Datasets](https://huggingface.co/datasets/wikitext) **Dataset Details**: The WikiText-103 dataset is a collection of over 100 million tokens extracted from the set of verified "Good" and "Featured" articles on Wikipedia. It is designed for language modeling and other text generation tasks. ## Data Cleaning To ensure high-quality input for training, the dataset underwent the following cleaning steps: 1. Removal of non-standard characters and punctuation. 2. Tokenization using BERT's tokenizer. 3. Lowercasing all text. 4. Filtering out any overly short or long sequences to maintain a consistent input size. ## Neural Network Definition The neural network used for this model is based on a transformer architecture with the following specifications: - **Model Type**: BERT-based transformer - **Number of Layers**: 5 - **Dropout**: Applied at each layer to prevent overfitting - **Optimizer**: AdamW with a learning rate of 5e-5 - **Loss Function**: Cross-entropy loss for language modeling ## Training Details The model was trained on an L4 GPU with the following resources: - **CPU Cores**: 16 - **System RAM**: 62.8 GB - **GPU RAM**: 22.5 GB - **Disk**: 201.2 GB **Training Configuration**: - **Batch Size**: Dynamic, adjusted based on GPU RAM availability - **Epochs**: 50 - **Initial Learning Rate**: 5e-5 ### Training Results The training involved several experiments with different batch sizes and epochs. The final training loss was plotted to visualize the model's performance. ## Usage To use this model, you can load it from Hugging Face and generate text as follows: ```python from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("RicardoPoleo/DL_LLM_from_scratch_2") model = AutoModelForCausalLM.from_pretrained("RicardoPoleo/DL_LLM_from_scratch_2") input_text = "Once upon a time" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs) print(tokenizer.decode(outputs[0], skip_special_tokens=True))