GPT-2 Small — Screenplay Scriptwriting Model

Study Context: This is the first model in a dual-architecture comparative study on screenplay generation using GPT-2 Small. This model is a full-parameter fine-tune executed on a cloud NVIDIA T4 GPU. The second model is a LoRA adapter trained entirely on consumer edge hardware — an Apple Silicon MacBook Air — using PEFT to operate within the hard constraints of a fanless, unified-memory device.

A fully fine-tuned GPT-2 Small (124M) causal language model trained end-to-end on ~94M tokens of professional screenplay corpora, with stateful MLOps checkpoint recovery from a mid-run hardware preemption event.


Model Description

This model is a full-parameter fine-tune of OpenAI's GPT-2 Small (124M parameters) for the task of Causal Language Modeling with a specialization in screenplay and script generation. Every one of the 124,439,808 parameters was unfrozen and updated during training — this is not a LoRA, adapter, or PEFT-based model. All weights have been fully overwritten from the base GPT-2 checkpoint.

The model has internalized the highly structured formatting conventions of professional screenplays: scene slugs (INT./EXT.), character action lines, dialogue blocks, parentheticals, and production draft metadata — making it capable of generating coherent, industry-formatted script content from open-ended prompts.

Property Value
Base Model GPT-2 Small (openai-community/gpt2)
Parameter Count 124,439,808 (100% updated)
Architecture Decoder-only Transformer (GPT-2)
Fine-tune Method Full-Parameter Overwrite (no PEFT/LoRA)
Task Causal Language Modeling / Script Generation
Context Window 512 tokens (contiguous)
Language English

Training Data

The model was trained on a corpus of approximately 94 million tokens of raw, professionally formatted screenplay text files. The dataset consists of:

  • Standard industry-formatted .fountain / plain-text screenplay sources
  • Scene slugline notation (INT. LOCATION - DAY/NIGHT)
  • Character cues, action blocks, parentheticals, and dialogue
  • Production draft metadata headers and transition markers

No dataset card is available at this time. The corpus was not filtered for content rating or genre — the model reflects the full stylistic and tonal range of the training material.


Training Procedure & Infrastructure

Compute Infrastructure

Component Specification
Accelerator NVIDIA T4 Cloud GPU
CUDA Backend Enabled
Precision Strategy FP16 Mixed Precision (torch.cuda.amp via HF Accelerate)

Hyperparameters

Hyperparameter Value
Optimizer AdamW
Learning Rate 5e-5 (linear decay)
per_device_train_batch_size 4
gradient_accumulation_steps 4
Effective Global Batch Size 16
Total Optimization Steps 9,272 (1 full epoch)
Total FLOs 3.876 × 10¹⁶

MLOps Resiliency & Checkpoint Recovery

A defining characteristic of this training run is its stateful recovery from a mid-training hardware preemption event. The full timeline is documented below as an engineering reference.

Timeline

[00:00:00] → Training initiated on primary cloud instance (T4 GPU).
                Checkpoints configured to persist every 200 global steps.

[04:43:00] → HARDWARE PREEMPTION at global Step 5,600 (60.4% complete).
                Primary compute container abruptly disconnected.
                Checkpoint preserved: model.safetensors, optimizer.pt, scheduler.pt

[04:43:xx] → Hot-resume initiated on secondary cloud instance from Step 5,601.
                Full optimizer state (momentum buffers, variance estimates),
                learning rate scheduler, and gradient context fully restored.

[07:43:30] → Training complete at global Step 9,272.
                Zero loss discontinuity detected across the resume boundary.

Total aggregate compute time: 7 hours, 43 minutes, 30 seconds across both instances.

The pre-crash and post-resume loss values at Steps 5,600 and 5,800 (see convergence table below) confirm perfect gradient and loss continuity with no regression caused by the preemption event. This demonstrates that HuggingFace's Trainer-native checkpoint serialization — saving full optimizer and scheduler state — is sufficient for lossless mid-run recovery on stateless cloud infrastructure.


Training Metrics & Convergence

The model shows clear asymptotic convergence on screenplay formatting conventions and domain vocabulary across the full 9,272-step run.

Global Step Training Phase Validation Loss Notes
200 Baseline (early) 1.4586 Initial domain vocabulary acquisition
2,000 Formatting alignment 1.3653 Scene/dialogue structure stabilizing
5,600 Pre-crash state 1.3305 Checkpoint preserved at preemption
5,800 Post-resume stability 1.3276 Confirmed loss continuity after resume
9,272 Final (absolute termination) 1.3194 Convergence plateau reached

Total loss reduction: −0.1392 across the full run (−9.5% relative improvement from baseline).

The negligible delta between Steps 5,600 and 5,800 (−0.0029) confirms that the optimizer state was fully restored and training resumed without gradient shock or instability.


Usage & Inference

Loading the Model

from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_id = "raghavnimbalkar/gpt2-screenplay-generator"

tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)
model.eval()

Recommended Inference Parameters

The following nucleus sampling configuration is recommended to produce high-fidelity, coherent screenplay output while avoiding repetitive boilerplate:

Parameter Recommended Value Notes
max_length Up to 512 Hard context window limit
temperature 0.750.85 Lower = sharper dialogue; higher = creative variance
top_k 40 or 50 Limits vocabulary sampling pool
top_p 0.920.95 Nucleus sampling threshold
repetition_penalty 1.121.15 Critical — prevents screenplay boilerplate loops

Inference Example

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_id = "raghavnimbalkar/gpt2-screenplay-generator"  

tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)
model.eval()

prompt = "INT. POLICE PRECINCT - NIGHT\n\nDetective HARRIS slams a folder on the table."

inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_length=512,
        temperature=0.80,
        top_k=50,
        top_p=0.92,
        repetition_penalty=1.13,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

Tip: A repetition_penalty in the 1.12–1.15 range is especially important for this model. Screenplay corpora contain many repeated structural tokens (INT., EXT., CUT TO:, character cues) that, without penalty, the model will loop aggressively during unconstrained generation.


Comparison with LoRA Adapter Model

This model is one half of an ongoing comparative study. The table below contrasts both trained models across architecture, compute, and convergence dimensions.

Property Full-Parameter (This Model) LoRA Adapter (Local)
Hardware NVIDIA T4 (Cloud GPU) Apple Silicon MacBook Air (MPS)
Fine-tune Method Full-parameter overwrite LoRA / PEFT (c_attn only)
Trainable Parameters 124,439,808 (100%) 294,912 (0.24%)
Epoch Coverage 1.0 (full corpus) 0.51 (half corpus)
Total Steps 9,272 4,700
Training Time 7h 43m 30s 7h 51m 02s
Final Eval Loss 1.3194 2.4017
Step Throughput ~3.0s/step ~6.01s/step
MLOps Event Hardware preemption + stateful hot-resume 17× speedup via LoRA over full-param attempt

Both models spent approximately the same wall-clock time training (~7h 45m). The divergence in final evaluation loss is a direct consequence of full-parameter depth and full corpus coverage versus adapter-based efficiency on constrained hardware — not a difference in compute investment. The LoRA adapter represents a deliberate trade-off: edge-feasibility over convergence depth.


Intended Use

Intended uses:

  • Screenplay drafting assistance and creative ideation
  • Automated scene/dialogue continuation from a provided slug or action line
  • Style transfer and scriptwriting research
  • Educational exploration of domain-adaptive fine-tuning on structured text

Out-of-scope uses:

  • Factual question answering (this is a generative, not retrieval, model)
  • Production-ready script generation without human editorial review
  • Any use case requiring truthfulness, citation, or factual accuracy

Bias, Risks, and Limitations

  • The model was trained on an unfiltered corpus spanning multiple genres and tones; it may generate content reflecting biases, stereotypes, or mature themes present in its training data.
  • As a 124M parameter model, outputs are prone to incoherence over long sequences and may not maintain narrative or character consistency beyond a few exchanges.
  • The model has no instruction-following capability; it is a raw next-token predictor conditioned on screenplay-formatted text.
  • Users should apply content moderation filters appropriate for their deployment context.

Environmental Impact

Carbon emissions were estimated using the Machine Learning Impact Calculator.

Property Value
Hardware Type NVIDIA T4 (Cloud GPU)
Hours Used ~7.72 hours (across 2 instances)
Cloud Provider (Not disclosed)
Compute Region (Not disclosed)
Carbon Emitted 0.31 kg

Citation

If you reference this model or its training methodology in research, please cite the base model:

@article{radford2019language,
  title   = {Language Models are Unsupervised Multitask Learners},
  author  = {Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year    = {2019}
}

Model Card Contact

For questions about this fine-tune's training methodology, dataset, or inference behavior, please open an issue in this repository.

Downloads last month
104
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for raghavnimbalkar/gpt2-screenplay-generator

Finetuned
(2166)
this model

Dataset used to train raghavnimbalkar/gpt2-screenplay-generator