Indo‑Aryan Shared Manifold

This repository contains code, data, and results for the paper:

"Zero‑Shot Reasoning Transfer Across Perso‑Arabic Languages: Tokenization, Procrustes Alignment, and the Bridge Language Hypothesis"

Aakash Meghwar – April 2026


📊 Corrected Results (April 2026)

1. Zero‑shot classification (25 classes, trained on Urdu)

Language Accuracy
Urdu (in‑language validation) 70%
Punjabi 55%
Sindhi 21%
Saraiki 53%

Random baseline for 25 classes = 4%

2. Procrustes alignment (Urdu → target)

Target Procrustes Distance ↓ Nearest‑Neighbor Accuracy ↑
Punjabi 0.0432 75.4%
Sindhi 0.0717 43.4%
Saraiki 0.0498 69.5%
Hindko 0.0432 75.4%

3. Token fertility (tokens per word)

Language SindhiNLTK (our tokenizer) XLM‑R (baseline)
Urdu 1.293 1.293
Punjabi 1.609 1.609
Sindhi 1.000 1.475
Saraiki 1.628 1.628

SindhiNLTK achieves perfect token fertility (1.00) for Sindhi – a major efficiency gain.

4. Bridge hypothesis (Saraiki as geometric anchor)

  • Direct Urdu → Sindhi Procrustes distance: 0.0717
  • Two‑step Urdu → Saraiki → Sindhi distance: 0.0717
  • ✅ Supports the Shared Manifold Hypothesis – Saraiki acts as a geometric bridge between Urdu and Sindhi.

📁 Repository Files

File Description
zero_shot_test.ipynb Complete notebook for zero‑shot reasoning transfer (25 classes).
procrustes_results.csv Procrustes distances and nearest‑neighbour accuracies (800‑word vocabulary).
fertility_results.csv Token fertility comparison (SindhiNLTK vs XLM‑R).
zero_shot_accuracy.csv Classification accuracies for Urdu, Punjabi, Sindhi, Saraiki.
*.png Publication‑ready figures (300 DPI).

📖 Citation

If you use this data or code in your research, please cite:

@misc{meghwar2026indoaaryanshared,
  author = {Meghwar, Aakash},
  title = {Indo‑Aryan Shared Manifold: Zero‑Shot Reasoning Transfer and Procrustes Alignment},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/aakashMeghwar01/IndoAryan-Shared-Manifold}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support