hellosindh
/

indus-script-models

Text Generation

ancient-scripts

sequence-modeling

grammar-analysis

undeciphered-script

Model card Files Files and versions

indus-script-models / README.md

hellosindh's picture

Upload 2 files

a4f4b5c verified about 2 months ago

|

3.43 kB

	# Indus Script Models

	Four trained models + NanoGPT for the undeciphered Indus Valley Script (2600–1900 BCE).

	## What's in this repo

	```
	models/
	mlm/best/ TinyBERT masked language model
	cls/best/ TinyBERT sequence classifier (valid vs corrupted)
	ngram_model.pkl N-gram RTL transition model
	electra/best/ ELECTRA token discriminator
	deberta/best/ DeBERTa sequence discriminator
	nanogpt_indus.pt NanoGPT generator (153K params)
	data/
	indus_tokenizer/ Custom tokenizer (641 Indus sign tokens)
	id_to_glyph.json Sign ID → glyph character mapping
	inference.py Run all tasks (see below)
	indus_ngram.py Required by ngram_model.pkl
	```

	## How the pipeline works

	Stage 1 — Real inscriptions (3,310 sequences):
	Four models trained independently on real Indus Script inscriptions.
	Each learned a different aspect of grammar:
	- TinyBERT MLM → which signs can fill a masked position
	- TinyBERT Classifier → valid sequence vs corrupted
	- N-gram RTL → right-to-left transition probabilities
	- ELECTRA → token-level real vs fake discrimination
	- DeBERTa → sequence-level real vs fake discrimination

	Stage 2 — Generate + filter:
	NanoGPT generates candidates in RTL order.
	Each candidate scored by BERT (50%) + N-gram (25%) + ELECTRA (25%).
	Only sequences scoring ≥85% ensemble are kept.
	Exact matches to real inscriptions separated as validation evidence.

	Stage 3 — Retrain on combined data (3,310 real + 5,000 synthetic = 8,310):
	All models retrained → TinyBERT accuracy 78% → 89%, NanoGPT PPL 32.5 → 13.3.
	Final 5,000 sequences generated with retrained models.

	## Quick start

	```bash
	pip install torch transformers huggingface_hub

	# Clone this repo
	git clone https://huggingface.co/YOUR_USERNAME/indus-script-models
	cd indus-script-models

	# Run demo (validates 5 example sequences)
	python inference.py --task demo

	# Validate a sequence
	python inference.py --task validate --sequence "T638 T177 T420 T122"

	# Predict a masked sign
	python inference.py --task predict --sequence "T638 [MASK] T420 T122"

	# Generate 10 new sequences
	python inference.py --task generate --count 10

	# Score any sequence
	python inference.py --task score --sequence "T604 T123 T609"
	```

	## Example output

	```
	Loading models...
	✓ TinyBERT
	✓ N-gram
	✓ ELECTRA

	Sequence : T638 T177 T420 T122
	Glyphs : 𐦭𐦬𐦰𐦡
	BERT : 0.9650
	N-gram : 0.8930
	ELECTRA : 0.9410
	Ensemble : 0.9410
	Verdict : ✅ VALID (≥85%)
	```

	## Model performance

	\| Model \| Metric \| Value \|
	\|---\|---\|---\|
	\| TinyBERT Classifier \| Test accuracy \| 89.0% \|
	\| TinyBERT MLM \| Val loss \| 2.06 \|
	\| N-gram RTL \| Pairwise accuracy \| 88.2% \|
	\| ELECTRA \| Token accuracy \| 95.1% \|
	\| DeBERTa \| Test accuracy \| 87.1% \|
	\| NanoGPT \| Perplexity \| 13.3 \|

	## Key findings

	- RTL confirmed — right-to-left has 12% stronger grammatical structure than LTR
	- Grammar proven — H1→H2→H3 = 6.03→3.41→2.39 bits (language-like decay)
	- Zipf's law — R²=0.968 (language-like token distribution)
	- 752 seal reproductions — model independently reproduced real inscriptions
	- Sign roles — PREFIX (T638, T604), SUFFIX (T123, T122), CORE (T101, T268)

	## Dataset

	The 5,000 synthetic sequences are available at:
	[YOUR_USERNAME/indus-script-synthetic](https://huggingface.co/datasets/YOUR_USERNAME/indus-script-synthetic)