Indo‑Aryan Shared Manifold
This repository contains code, data, and results for the paper:
"Zero‑Shot Reasoning Transfer Across Perso‑Arabic Languages: Tokenization, Procrustes Alignment, and the Bridge Language Hypothesis"
Aakash Meghwar – April 2026
📊 Corrected Results (April 2026)
1. Zero‑shot classification (25 classes, trained on Urdu)
| Language | Accuracy |
|---|---|
| Urdu (in‑language validation) | 70% |
| Punjabi | 55% |
| Sindhi | 21% |
| Saraiki | 53% |
Random baseline for 25 classes = 4%
2. Procrustes alignment (Urdu → target)
| Target | Procrustes Distance ↓ | Nearest‑Neighbor Accuracy ↑ |
|---|---|---|
| Punjabi | 0.0432 | 75.4% |
| Sindhi | 0.0717 | 43.4% |
| Saraiki | 0.0498 | 69.5% |
| Hindko | 0.0432 | 75.4% |
3. Token fertility (tokens per word)
| Language | SindhiNLTK (our tokenizer) | XLM‑R (baseline) |
|---|---|---|
| Urdu | 1.293 | 1.293 |
| Punjabi | 1.609 | 1.609 |
| Sindhi | 1.000 | 1.475 |
| Saraiki | 1.628 | 1.628 |
SindhiNLTK achieves perfect token fertility (1.00) for Sindhi – a major efficiency gain.
4. Bridge hypothesis (Saraiki as geometric anchor)
- Direct Urdu → Sindhi Procrustes distance:
0.0717 - Two‑step Urdu → Saraiki → Sindhi distance:
0.0717 - ✅ Supports the Shared Manifold Hypothesis – Saraiki acts as a geometric bridge between Urdu and Sindhi.
📁 Repository Files
| File | Description |
|---|---|
zero_shot_test.ipynb |
Complete notebook for zero‑shot reasoning transfer (25 classes). |
procrustes_results.csv |
Procrustes distances and nearest‑neighbour accuracies (800‑word vocabulary). |
fertility_results.csv |
Token fertility comparison (SindhiNLTK vs XLM‑R). |
zero_shot_accuracy.csv |
Classification accuracies for Urdu, Punjabi, Sindhi, Saraiki. |
*.png |
Publication‑ready figures (300 DPI). |
📖 Citation
If you use this data or code in your research, please cite:
@misc{meghwar2026indoaaryanshared,
author = {Meghwar, Aakash},
title = {Indo‑Aryan Shared Manifold: Zero‑Shot Reasoning Transfer and Procrustes Alignment},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/aakashMeghwar01/IndoAryan-Shared-Manifold}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support