Indus Script Models

Four trained models + NanoGPT for the undeciphered Indus Valley Script (2600–1900 BCE).

What's in this repo

models/
  mlm/best/           TinyBERT masked language model
  cls/best/           TinyBERT sequence classifier (valid vs corrupted)
  ngram_model.pkl     N-gram RTL transition model
  electra/best/       ELECTRA token discriminator
  deberta/best/       DeBERTa sequence discriminator
  nanogpt_indus.pt    NanoGPT generator (153K params)
data/
  indus_tokenizer/    Custom tokenizer (641 Indus sign tokens)
  id_to_glyph.json    Sign ID → glyph character mapping
inference.py          Run all tasks (see below)
indus_ngram.py        Required by ngram_model.pkl

How the pipeline works

Stage 1 — Real inscriptions (3,310 sequences): Four models trained independently on real Indus Script inscriptions. Each learned a different aspect of grammar:

TinyBERT MLM → which signs can fill a masked position
TinyBERT Classifier → valid sequence vs corrupted
N-gram RTL → right-to-left transition probabilities
ELECTRA → token-level real vs fake discrimination
DeBERTa → sequence-level real vs fake discrimination

Stage 2 — Generate + filter: NanoGPT generates candidates in RTL order. Each candidate scored by BERT (50%) + N-gram (25%) + ELECTRA (25%). Only sequences scoring ≥85% ensemble are kept. Exact matches to real inscriptions separated as validation evidence.

Stage 3 — Retrain on combined data (3,310 real + 5,000 synthetic = 8,310): All models retrained → TinyBERT accuracy 78% → 89%, NanoGPT PPL 32.5 → 13.3. Final 5,000 sequences generated with retrained models.

Quick start

pip install torch transformers huggingface_hub

# Clone this repo
git clone https://huggingface.co/YOUR_USERNAME/indus-script-models
cd indus-script-models

# Run demo (validates 5 example sequences)
python inference.py --task demo

# Validate a sequence
python inference.py --task validate --sequence "T638 T177 T420 T122"

# Predict a masked sign
python inference.py --task predict --sequence "T638 [MASK] T420 T122"

# Generate 10 new sequences
python inference.py --task generate --count 10

# Score any sequence
python inference.py --task score --sequence "T604 T123 T609"

Example output

Loading models...
  ✓ TinyBERT
  ✓ N-gram
  ✓ ELECTRA

  Sequence  : T638 T177 T420 T122
  Glyphs    : 𐦭𐦬𐦰𐦡
  BERT      : 0.9650
  N-gram    : 0.8930
  ELECTRA   : 0.9410
  Ensemble  : 0.9410
  Verdict   : ✅ VALID (≥85%)

Model performance

Model	Metric	Value
TinyBERT Classifier	Test accuracy	89.0%
TinyBERT MLM	Val loss	2.06
N-gram RTL	Pairwise accuracy	88.2%
ELECTRA	Token accuracy	95.1%
DeBERTa	Test accuracy	87.1%
NanoGPT	Perplexity	13.3

Key findings

RTL confirmed — right-to-left has 12% stronger grammatical structure than LTR
Grammar proven — H1→H2→H3 = 6.03→3.41→2.39 bits (language-like decay)
Zipf's law — R²=0.968 (language-like token distribution)
752 seal reproductions — model independently reproduced real inscriptions
Sign roles — PREFIX (T638, T604), SUFFIX (T123, T122), CORE (T101, T268)

Dataset

The 5,000 synthetic sequences are available at: YOUR_USERNAME/indus-script-synthetic