Indo‑Aryan Shared Manifold

This repository contains code, data, and results for the paper:

"Zero‑Shot Reasoning Transfer Across Perso‑Arabic Languages: Tokenization, Procrustes Alignment, and the Bridge Language Hypothesis"

Aakash Meghwar – April 2026

📊 Corrected Results (April 2026)

1. Zero‑shot classification (25 classes, trained on Urdu)

Language	Accuracy
Urdu (in‑language validation)	70%
Punjabi	55%
Sindhi	21%
Saraiki	53%

Random baseline for 25 classes = 4%

2. Procrustes alignment (Urdu → target)

Target	Procrustes Distance ↓	Nearest‑Neighbor Accuracy ↑
Punjabi	0.0432	75.4%
Sindhi	0.0717	43.4%
Saraiki	0.0498	69.5%
Hindko	0.0432	75.4%

3. Token fertility (tokens per word)

Language	SindhiNLTK (our tokenizer)	XLM‑R (baseline)
Urdu	1.293	1.293
Punjabi	1.609	1.609
Sindhi	1.000	1.475
Saraiki	1.628	1.628

SindhiNLTK achieves perfect token fertility (1.00) for Sindhi – a major efficiency gain.

4. Bridge hypothesis (Saraiki as geometric anchor)

Direct Urdu → Sindhi Procrustes distance: 0.0717
Two‑step Urdu → Saraiki → Sindhi distance: 0.0717
✅ Supports the Shared Manifold Hypothesis – Saraiki acts as a geometric bridge between Urdu and Sindhi.

📁 Repository Files

File	Description
`zero_shot_test.ipynb`	Complete notebook for zero‑shot reasoning transfer (25 classes).
`procrustes_results.csv`	Procrustes distances and nearest‑neighbour accuracies (800‑word vocabulary).
`fertility_results.csv`	Token fertility comparison (SindhiNLTK vs XLM‑R).
`zero_shot_accuracy.csv`	Classification accuracies for Urdu, Punjabi, Sindhi, Saraiki.
`*.png`	Publication‑ready figures (300 DPI).

📖 Citation

If you use this data or code in your research, please cite:

@misc{meghwar2026indoaaryanshared,
  author = {Meghwar, Aakash},
  title = {Indo‑Aryan Shared Manifold: Zero‑Shot Reasoning Transfer and Procrustes Alignment},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/aakashMeghwar01/IndoAryan-Shared-Manifold}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support