Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -18,7 +18,7 @@ short_description: Indic SLMs; SLMs for comparative analysis
|
|
| 18 |
|
| 19 |
> [!IMPORTANT]
|
| 20 |
> #### _Link to Our Paper_
|
| 21 |
-
<i> [](https://arxiv.org/pdf/2504.07989)
|
| 22 |
> #### _Must read_
|
| 23 |
> * <i> `✨ All Models & Datasets can be found after this README section! ✨`</i>
|
| 24 |
> * <i> Refer to our [GitHub](https://github.com/VizuaraAI/Tiny-Stories-Regional) to find <b> results, extensive guides, resources and code </b> for our TinyStories-Regional framework, which extends <br> the TinyStories approach ([Eldan & Li, 2023](https://arxiv.org/abs/2305.07759)) to three major Indian languages: `Hindi, Marathi, and Bangla`. </i>
|
|
@@ -34,12 +34,8 @@ short_description: Indic SLMs; SLMs for comparative analysis
|
|
| 34 |
<br>
|
| 35 |
|
| 36 |
|
| 37 |
-
#
|
| 38 |
-
|
| 39 |
-
- The `Sarvam-1 tokenizer outperforms the SUTRA-mlt256-v2` across key linguistic metrics—context, completeness, creativity, fluency, and grammar—on inference-time story generation.
|
| 40 |
-
- `Synthetic data outperforms machine-translated data`, highlighting the limitations of current translation tools compared to SOTA LLM-generated content.
|
| 41 |
-
- Language models find `Marathi most challenging`, followed by Bengali, with Hindi being the easiest for generating high-quality inferences. This is seen across SOTA LLMs (4o, gemini-1.5) and our SLMs.
|
| 42 |
-
|
| 43 |
|
| 44 |
|
| 45 |
---
|
|
@@ -48,8 +44,8 @@ short_description: Indic SLMs; SLMs for comparative analysis
|
|
| 48 |
|
| 49 |
# ⚙️ Usage
|
| 50 |
|
| 51 |
-
|
| 52 |
-
|
| 53 |
|
| 54 |
|
| 55 |
|
|
@@ -88,7 +84,7 @@ short_description: Indic SLMs; SLMs for comparative analysis
|
|
| 88 |
|
| 89 |
|
| 90 |
# 📝 Citation
|
| 91 |
-
If you use Vizuara's TinyStories
|
| 92 |
|
| 93 |
```text
|
| 94 |
@misc{patil2025regionaltinystoriesusing,
|
|
|
|
| 18 |
|
| 19 |
> [!IMPORTANT]
|
| 20 |
> #### _Link to Our Paper_
|
| 21 |
+
<i> [](https://arxiv.org/pdf/2504.07989) </i>
|
| 22 |
> #### _Must read_
|
| 23 |
> * <i> `✨ All Models & Datasets can be found after this README section! ✨`</i>
|
| 24 |
> * <i> Refer to our [GitHub](https://github.com/VizuaraAI/Tiny-Stories-Regional) to find <b> results, extensive guides, resources and code </b> for our TinyStories-Regional framework, which extends <br> the TinyStories approach ([Eldan & Li, 2023](https://arxiv.org/abs/2305.07759)) to three major Indian languages: `Hindi, Marathi, and Bangla`. </i>
|
|
|
|
| 34 |
<br>
|
| 35 |
|
| 36 |
|
| 37 |
+
# Abstract
|
| 38 |
+
Small, resource-efficient language models are pivotal for extending high-quality text generation to low-resource and regional languages—the true frontier of linguistic equity in AI. Yet research largely prioritises massive English-centric systems, leaving regional-centric (low-resource) language modelling underexplored, particularly how tokenizer design, dataset diversity, and linguistic structure shape a Small Language Models’ (SLMs) effectiveness under realistic computational and data constraints. We present \emph{Regional-TinyStories}, a lightweight framework that treats SLMs as cost-effective stand-ins for LLMs to enable rapid, variable-wise analysis. Extending TinyStories to Hindi, Marathi, and Bangla, we release datasets of 2M synthetic and translated stories per language and train over 20 SLMs spanning 5–157M parameters. Using this framework, we (i) uncover contrasts between form-oriented (grammar, fluency) and content-oriented (context, completeness, creativity) metrics; (ii) chart language-specific learning dynamics; (iii) rank tokenizers, showing Indic-specific \textsc{Sarvam-1} outperforming \textsc{SUTRA} and generic \textsc{Tiktoken (GPT-2)} across all metrics; and (iv) demonstrate that dataset semantic quality (translation vs.\ synthetic) strongly governs downstream generation. Validation through an LLM-as-Judge ensemble (GPT-4o, LLaMA-3.3-70B) and a 100+ participant human study confirms these trends while exposing systematic score inflation in automated evaluations. \emph{Regional-TinyStories} offers a reproducible path to benchmark tokenizers, datasets, and SLM designs for scalable, context-faithful generation in low-resource
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
|
| 41 |
---
|
|
|
|
| 44 |
|
| 45 |
# ⚙️ Usage
|
| 46 |
|
| 47 |
+
* Datasets can be downloaded (below) as .json files. Please check the format of each entry using the dataset viewer tab :)
|
| 48 |
+
* Model weights from HF can be loaded for inference via the config.py script provided in our GitHub.
|
| 49 |
|
| 50 |
|
| 51 |
|
|
|
|
| 84 |
|
| 85 |
|
| 86 |
# 📝 Citation
|
| 87 |
+
If you use Vizuara's Regional-TinyStories in your research, please cite us using the following BibText template:
|
| 88 |
|
| 89 |
```text
|
| 90 |
@misc{patil2025regionaltinystoriesusing,
|