---
language:
- en
tags:
- gpt2
- text-generation
- pytorch
license: mit
---

# SchorbGPT-Medium

This is a medium sized language model trained on web data. The model uses the GPT-2 architecture and tokenizer.

## Model Details

- Model Type: GPT-2
- Training Data: Web text data
- Number of Parameters: GPT-2 medium scale
- Context Length: 512 tokens
- Training Framework: PyTorch

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("iimaginary/schorbGPT-medium")
model = AutoModelForCausalLM.from_pretrained("iimaginary/schorbGPT-medium")

text = "Your prompt here"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
```

## Performance and Model Analysis

### Zero-shot Evaluation Results

| Task | Metric | Value | Stderr |
|------|--------|-------|--------|
| WikiText | bits_per_byte | 0.9860 | N/A |
| WikiText | byte_perplexity | 1.9806 | N/A |
| WikiText | word_perplexity | 38.6497 | N/A |
| ARC Easy | accuracy | 48.02% | ±1.03% |
| ARC Easy | accuracy (normalized) | 42.17% | ±1.01% |
| HellaSwag | accuracy | 29.06% | ±0.45% |
| HellaSwag | accuracy (normalized) | 31.26% | ±0.46% |
| LAMBADA | accuracy | 33.90% | ±0.66% |
| LAMBADA | perplexity | 36.2055 | ±1.4052 |
| PIQA | accuracy | 61.92% | ±1.13% |
| PIQA | accuracy (normalized) | 62.46% | ±1.13% |
| Winogrande | accuracy | 50.59% | ±1.41% |

### Analysis and Comparisons

#### Language Modeling Performance
The model achieves a word perplexity of 38.65 on WikiText, which is competitive with similar-sized models. For comparison:
- Original GPT-2 (small): ~35-40 perplexity
- GPT-2 medium: ~30-35 perplexity
- BERT-base: ~40-45 perplexity

#### Task-Specific Analysis:

1. Physical and Commonsense Reasoning:
   - PIQA: 61.92% (Random baseline: 50%)
   - Comparable to GPT-2 small/medium performance
   - Shows good physical commonsense understanding

2. Science Knowledge:
   - ARC Easy: 48.02% (Random baseline: 25%)
   - Above random chance and demonstrates basic scientific knowledge
   - Similar to performance seen in early GPT-2 variants

3. Linguistic Understanding:
   - LAMBADA: 33.90% accuracy with perplexity of 36.21
   - HellaSwag: 31.26% (Random baseline: 25%)
   - Performance indicates basic linguistic and contextual understanding
   - Typical range for non-fine-tuned models of this scale

4. Reasoning and Logic:
   - Winogrande: 50.59% (Random baseline: 50%)
   - At par with random chance, suggesting room for improvement in complex reasoning tasks
   - Common for base models without specific fine-tuning

### Strengths and Limitations

**Strengths:**
- Strong performance on physical commonsense (PIQA)
- Decent basic science knowledge (ARC Easy)
- Competitive language modeling metrics

**Limitations:**
- Limited complex reasoning capabilities (Winogrande)
- Basic linguistic understanding could be improved (LAMBADA, HellaSwag)
- Performance typical of base models without task-specific fine-tuning

## Limitations

This is a base model without fine-tuning or alignment. It should be used with appropriate consideration of its capabilities and limitations.