# Indus Script Models Four trained models + NanoGPT for the undeciphered Indus Valley Script (2600–1900 BCE). ## What's in this repo ``` models/ mlm/best/ TinyBERT masked language model cls/best/ TinyBERT sequence classifier (valid vs corrupted) ngram_model.pkl N-gram RTL transition model electra/best/ ELECTRA token discriminator deberta/best/ DeBERTa sequence discriminator nanogpt_indus.pt NanoGPT generator (153K params) data/ indus_tokenizer/ Custom tokenizer (641 Indus sign tokens) id_to_glyph.json Sign ID → glyph character mapping inference.py Run all tasks (see below) indus_ngram.py Required by ngram_model.pkl ``` ## How the pipeline works **Stage 1 — Real inscriptions (3,310 sequences):** Four models trained independently on real Indus Script inscriptions. Each learned a different aspect of grammar: - TinyBERT MLM → which signs can fill a masked position - TinyBERT Classifier → valid sequence vs corrupted - N-gram RTL → right-to-left transition probabilities - ELECTRA → token-level real vs fake discrimination - DeBERTa → sequence-level real vs fake discrimination **Stage 2 — Generate + filter:** NanoGPT generates candidates in RTL order. Each candidate scored by BERT (50%) + N-gram (25%) + ELECTRA (25%). Only sequences scoring ≥85% ensemble are kept. Exact matches to real inscriptions separated as validation evidence. **Stage 3 — Retrain on combined data (3,310 real + 5,000 synthetic = 8,310):** All models retrained → TinyBERT accuracy 78% → 89%, NanoGPT PPL 32.5 → 13.3. Final 5,000 sequences generated with retrained models. ## Quick start ```bash pip install torch transformers huggingface_hub # Clone this repo git clone https://huggingface.co/YOUR_USERNAME/indus-script-models cd indus-script-models # Run demo (validates 5 example sequences) python inference.py --task demo # Validate a sequence python inference.py --task validate --sequence "T638 T177 T420 T122" # Predict a masked sign python inference.py --task predict --sequence "T638 [MASK] T420 T122" # Generate 10 new sequences python inference.py --task generate --count 10 # Score any sequence python inference.py --task score --sequence "T604 T123 T609" ``` ## Example output ``` Loading models... ✓ TinyBERT ✓ N-gram ✓ ELECTRA Sequence : T638 T177 T420 T122 Glyphs : 𐦭𐦬𐦰𐦡 BERT : 0.9650 N-gram : 0.8930 ELECTRA : 0.9410 Ensemble : 0.9410 Verdict : ✅ VALID (≥85%) ``` ## Model performance | Model | Metric | Value | |---|---|---| | TinyBERT Classifier | Test accuracy | 89.0% | | TinyBERT MLM | Val loss | 2.06 | | N-gram RTL | Pairwise accuracy | 88.2% | | ELECTRA | Token accuracy | 95.1% | | DeBERTa | Test accuracy | 87.1% | | NanoGPT | Perplexity | 13.3 | ## Key findings - **RTL confirmed** — right-to-left has 12% stronger grammatical structure than LTR - **Grammar proven** — H1→H2→H3 = 6.03→3.41→2.39 bits (language-like decay) - **Zipf's law** — R²=0.968 (language-like token distribution) - **752 seal reproductions** — model independently reproduced real inscriptions - **Sign roles** — PREFIX (T638, T604), SUFFIX (T123, T122), CORE (T101, T268) ## Dataset The 5,000 synthetic sequences are available at: [YOUR_USERNAME/indus-script-synthetic](https://huggingface.co/datasets/YOUR_USERNAME/indus-script-synthetic)