synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier Paper • 2601.16113 • Published 4 days ago
ks-lit-3m: A 3.1 million word kashmiri text dataset for large language model pretraining Paper • 2601.01091 • Published 23 days ago
600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script Paper • 2601.01088 • Published 23 days ago
Omarrran/3.1Million_KASHMIRI_text_Pre_training_Dataset_for_LLM_2026_by_HNM Viewer • Updated 24 days ago • 1 • 26 • 2
Omarrran/3.1Million_KASHMIRI_text_Pre_training_Dataset_for_LLM_2026_by_HNM Viewer • Updated 24 days ago • 1 • 26 • 2