|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
## Markov vs Transformer: Text Generation Experiment |
|
|
|
|
|
This project compares classic statistical models (Markov chains & n-grams) with modern transformers (GPT-2) on the same text corpus (Pushkin’s poetry). |
|
|
It highlights the jump from local memory to long-context reasoning in text generation. |
|
|
|
|
|
## Motivation |
|
|
|
|
|
In 1906, Andrey Markov showed how letter dependencies could model text. |
|
|
Fast forward a century — transformers now handle hundreds of tokens of context with attention. |
|
|
This repo recreates that evolution, side by side. |
|
|
|
|
|
## Models Compared |
|
|
|
|
|
Markov Chains (n=1,3,5) |
|
|
|
|
|
Generate text from local character windows. |
|
|
|
|
|
Capture letter frequencies, small fragments of fluency. |
|
|
|
|
|
Fail on long-term dependencies → loops & gibberish. |
|
|
|
|
|
GPT-2 (medium) |
|
|
|
|
|
Pretrained transformer with 345M parameters. |
|
|
|
|
|
Extends prompts into fluent, poetic lines. |
|
|
|
|
|
Still prone to degeneration without sampling controls. |
|
|
|
|
|
## Results |
|
|
|
|
|
Prompt: “I loved you” (Pushkin corpus) |
|
|
|
|
|
Markov vs GPT-2 on Pushkin (Prompt: “I loved you”) |
|
|
Model Sample Output (excerpt) |
|
|
Markov (n=3) “I longer trouble you so tenderly, sorrow. I loved in my so sincerely extinguished in my shyness, no loved now by jealousy. I loved you may be loved you may be loved you so tenderly…” |
|
|
Markov (n=5) “I loved you silently, without hope, Tormented now by jealousy. I loved you: and perhaps this flame Has not entirely extinguished in my soul; But let it no longer trouble you; I do not entirely extinguished in my soul; But let it…” |
|
|
GPT-2 medium “I love you with all my heart, without reserve. I am in love with you now, and have never been. I am in love with you now, and will never be… I love you with all my heart, without reserve…” |
|
|
|
|
|
## Key Observations |
|
|
|
|
|
Markov chains: good for local coherence, but collapse quickly. |
|
|
|
|
|
Transformers: sustain global structure, more creative continuations. |
|
|
|
|
|
Both models show failure modes — repetition loops highlight why sampling strategies matter. |
|
|
|
|
|
Demonstrates the leap from statistical modeling → neural networks → generative AI. |
|
|
|
|
|
## How to Run |
|
|
|
|
|
Markov chains |
|
|
|
|
|
mc = NGramMarkov(n=5) |
|
|
mc.train(corpus) |
|
|
print(mc.generate("<Pushkin Poetry Corpus of Choice>", 200)) |
|
|
|
|
|
|
|
|
## GPT-2 (via 🤗 Transformers) |
|
|
|
|
|
from transformers import pipeline |
|
|
|
|
|
generator = pipeline("text-generation", model="gpt2-medium") |
|
|
|
|
|
print(generator("<Pushkin Poetry Corpus of Choice>", max_length=80, do_sample=True)) |
|
|
|
|
|
✨ Author |
|
|
|
|
|
## Developed by [Naga Adithya Kaushik (GenAIDevTOProd)], assisted with AI(Debug, text corpus generation only) |
|
|
|
|
|
For research, debug, and teaching purposes. |