File size: 3,431 Bytes
a4f4b5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# Indus Script Models

Four trained models + NanoGPT for the undeciphered Indus Valley Script (2600–1900 BCE).

## What's in this repo

```
models/
  mlm/best/           TinyBERT masked language model
  cls/best/           TinyBERT sequence classifier (valid vs corrupted)
  ngram_model.pkl     N-gram RTL transition model
  electra/best/       ELECTRA token discriminator
  deberta/best/       DeBERTa sequence discriminator
  nanogpt_indus.pt    NanoGPT generator (153K params)
data/
  indus_tokenizer/    Custom tokenizer (641 Indus sign tokens)
  id_to_glyph.json    Sign ID β†’ glyph character mapping
inference.py          Run all tasks (see below)
indus_ngram.py        Required by ngram_model.pkl
```

## How the pipeline works

**Stage 1 β€” Real inscriptions (3,310 sequences):**
Four models trained independently on real Indus Script inscriptions.
Each learned a different aspect of grammar:
- TinyBERT MLM β†’ which signs can fill a masked position
- TinyBERT Classifier β†’ valid sequence vs corrupted
- N-gram RTL β†’ right-to-left transition probabilities
- ELECTRA β†’ token-level real vs fake discrimination
- DeBERTa β†’ sequence-level real vs fake discrimination

**Stage 2 β€” Generate + filter:**
NanoGPT generates candidates in RTL order.
Each candidate scored by BERT (50%) + N-gram (25%) + ELECTRA (25%).
Only sequences scoring β‰₯85% ensemble are kept.
Exact matches to real inscriptions separated as validation evidence.

**Stage 3 β€” Retrain on combined data (3,310 real + 5,000 synthetic = 8,310):**
All models retrained β†’ TinyBERT accuracy 78% β†’ 89%, NanoGPT PPL 32.5 β†’ 13.3.
Final 5,000 sequences generated with retrained models.

## Quick start

```bash
pip install torch transformers huggingface_hub

# Clone this repo
git clone https://huggingface.co/YOUR_USERNAME/indus-script-models
cd indus-script-models

# Run demo (validates 5 example sequences)
python inference.py --task demo

# Validate a sequence
python inference.py --task validate --sequence "T638 T177 T420 T122"

# Predict a masked sign
python inference.py --task predict --sequence "T638 [MASK] T420 T122"

# Generate 10 new sequences
python inference.py --task generate --count 10

# Score any sequence
python inference.py --task score --sequence "T604 T123 T609"
```

## Example output

```
Loading models...
  βœ“ TinyBERT
  βœ“ N-gram
  βœ“ ELECTRA

  Sequence  : T638 T177 T420 T122
  Glyphs    : 𐦭𐦬𐦰𐦑
  BERT      : 0.9650
  N-gram    : 0.8930
  ELECTRA   : 0.9410
  Ensemble  : 0.9410
  Verdict   : βœ… VALID (β‰₯85%)
```

## Model performance

| Model | Metric | Value |
|---|---|---|
| TinyBERT Classifier | Test accuracy | 89.0% |
| TinyBERT MLM | Val loss | 2.06 |
| N-gram RTL | Pairwise accuracy | 88.2% |
| ELECTRA | Token accuracy | 95.1% |
| DeBERTa | Test accuracy | 87.1% |
| NanoGPT | Perplexity | 13.3 |

## Key findings

- **RTL confirmed** β€” right-to-left has 12% stronger grammatical structure than LTR
- **Grammar proven** β€” H1β†’H2β†’H3 = 6.03β†’3.41β†’2.39 bits (language-like decay)
- **Zipf's law** β€” RΒ²=0.968 (language-like token distribution)
- **752 seal reproductions** β€” model independently reproduced real inscriptions
- **Sign roles** β€” PREFIX (T638, T604), SUFFIX (T123, T122), CORE (T101, T268)

## Dataset

The 5,000 synthetic sequences are available at:
[YOUR_USERNAME/indus-script-synthetic](https://huggingface.co/datasets/YOUR_USERNAME/indus-script-synthetic)