--- language: - en tags: - gpt2 - text-generation - pytorch license: mit --- # SchorbGPT-Medium This is a medium sized language model trained on web data. The model uses the GPT-2 architecture and tokenizer. ## Model Details - Model Type: GPT-2 - Training Data: Web text data - Number of Parameters: GPT-2 medium scale - Context Length: 512 tokens - Training Framework: PyTorch ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("iimaginary/schorbGPT-medium") model = AutoModelForCausalLM.from_pretrained("iimaginary/schorbGPT-medium") text = "Your prompt here" inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs, max_length=100) response = tokenizer.decode(outputs[0], skip_special_tokens=True) ``` ## Performance and Model Analysis ### Zero-shot Evaluation Results | Task | Metric | Value | Stderr | |------|--------|-------|--------| | WikiText | bits_per_byte | 0.9860 | N/A | | WikiText | byte_perplexity | 1.9806 | N/A | | WikiText | word_perplexity | 38.6497 | N/A | | ARC Easy | accuracy | 48.02% | ±1.03% | | ARC Easy | accuracy (normalized) | 42.17% | ±1.01% | | HellaSwag | accuracy | 29.06% | ±0.45% | | HellaSwag | accuracy (normalized) | 31.26% | ±0.46% | | LAMBADA | accuracy | 33.90% | ±0.66% | | LAMBADA | perplexity | 36.2055 | ±1.4052 | | PIQA | accuracy | 61.92% | ±1.13% | | PIQA | accuracy (normalized) | 62.46% | ±1.13% | | Winogrande | accuracy | 50.59% | ±1.41% | ### Analysis and Comparisons #### Language Modeling Performance The model achieves a word perplexity of 38.65 on WikiText, which is competitive with similar-sized models. For comparison: - Original GPT-2 (small): ~35-40 perplexity - GPT-2 medium: ~30-35 perplexity - BERT-base: ~40-45 perplexity #### Task-Specific Analysis: 1. Physical and Commonsense Reasoning: - PIQA: 61.92% (Random baseline: 50%) - Comparable to GPT-2 small/medium performance - Shows good physical commonsense understanding 2. Science Knowledge: - ARC Easy: 48.02% (Random baseline: 25%) - Above random chance and demonstrates basic scientific knowledge - Similar to performance seen in early GPT-2 variants 3. Linguistic Understanding: - LAMBADA: 33.90% accuracy with perplexity of 36.21 - HellaSwag: 31.26% (Random baseline: 25%) - Performance indicates basic linguistic and contextual understanding - Typical range for non-fine-tuned models of this scale 4. Reasoning and Logic: - Winogrande: 50.59% (Random baseline: 50%) - At par with random chance, suggesting room for improvement in complex reasoning tasks - Common for base models without specific fine-tuning ### Strengths and Limitations **Strengths:** - Strong performance on physical commonsense (PIQA) - Decent basic science knowledge (ARC Easy) - Competitive language modeling metrics **Limitations:** - Limited complex reasoning capabilities (Winogrande) - Basic linguistic understanding could be improved (LAMBADA, HellaSwag) - Performance typical of base models without task-specific fine-tuning ## Limitations This is a base model without fine-tuning or alignment. It should be used with appropriate consideration of its capabilities and limitations.