markov-lm-demo / README.md
OpenTransformer's picture
Upload README.md with huggingface_hub
8e358ba verified

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: Markov Chain Language Model
emoji: 🔗
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: true
license: mit
models:
  - OpenTransformer/markov-5gram-500m
short_description: N-gram LM with Kneser-Ney smoothing

Markov Chain Language Model

Interactive demo of a classical n-gram language model with Modified Kneser-Ney smoothing.

No neural network — this is pure statistical language modelling using n-gram counts and interpolated backoff.

Architecture

  • Model: 5-gram with Modified Kneser-Ney smoothing
  • Training data: 500M tokens from web crawl datasets
  • Storage: GPU hash tables (sorted int64 keys + torch.searchsorted)
  • Inference: Batch-parallel probability computation via binary search

How It Works

  1. Count: Track how often every sequence of 1-5 tokens appears in training data
  2. Smooth: Apply Modified Kneser-Ney smoothing to handle unseen n-grams
  3. Predict: For a given context, compute P(next_token | context) across all orders
  4. Sample: Draw from the smoothed distribution with temperature/top-k/top-p

Performance

Metric Value
Perplexity (Pile) 46,047
Top-1 Accuracy (Pile) 15.14%
N-gram entries 61.6M
Memory 1.83 GB

Links

Built by OpenTransformers Ltd. Part of AGILLM research.