madiedgar's picture
init: create README.md
f48842c verified
---
license: cc-by-nc-4.0
language:
- multilingual
tags:
- lora
- aya
- tiny-aya
- multilingual
- code
- legesher
- tiny-aya-expedition
- language-decoded
library_name: transformers
base_model:
- CohereLabs/tiny-aya-global
- CohereLabs/tiny-aya-fire
- CohereLabs/tiny-aya-earth
- CohereLabs/tiny-aya-water
pipeline_tag: text-generation
---
# Language Decoded LoRA
LoRA adapters fine-tuned on multilingual code conditions for the **Language Decoded** project (part of Cohere's Tiny Aya Expedition).
## Research Question
> Does fine-tuning on non-English code improve multilingual reasoning — and is the benefit language-dependent or structure-dependent?
## Base Models
All adapters are trained on [Tiny Aya](https://huggingface.co/collections/CohereLabs/tiny-aya) (3.35B parameters), a multilingual model optimized for 70+ languages.
| Model | HF ID | Regional Strength |
|---|---|---|
| **Global** | `CohereLabs/tiny-aya-global` | Balanced across all languages |
| **Fire** | `CohereLabs/tiny-aya-fire` | South Asian (Urdu) |
| **Earth** | `CohereLabs/tiny-aya-earth` | West Asian & African (Amharic) |
| **Water** | `CohereLabs/tiny-aya-water` | European & Asia Pacific (Chinese) |
## Model Structure
This repo contains LoRA adapters organized by experimental condition and base model variant:
| Subdirectory | Condition | Training Data |
|---|---|---|
| `global/baseline/` | Condition 1 | No code augmentation |
| `global/english-code/` | Condition 2 | English-keyword Python code |
| `global/multilingual-code/` | Condition 3 | Python transpiled to Urdu, Amharic, Chinese keywords |
| `global/multilingual-text/` | Condition 4 | Non-code multilingual text |
| `fire/multilingual-code/` | Regional | Urdu-keyword Python on Fire variant |
| `earth/multilingual-code/` | Regional | Amharic-keyword Python on Earth variant |
| `water/multilingual-code/` | Regional | Chinese-keyword Python on Water variant |
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model (Global variant)
base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-global")
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-global")
# Load a LoRA adapter (e.g., multilingual code on Global)
model = PeftModel.from_pretrained(base_model, "Legesher/language-decoded-lora", subfolder="global/multilingual-code")
# Or load a regional variant (e.g., Urdu code on Fire)
base_fire = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-fire")
model_fire = PeftModel.from_pretrained(base_fire, "Legesher/language-decoded-lora", subfolder="fire/multilingual-code")
```
## Training Details
- **Base models**: Tiny Aya 3.35B — Global, Fire, Earth, Water ([CohereLabs](https://huggingface.co/CohereLabs))
- **Method**: QLoRA (Quantized Low-Rank Adaptation)
- **Training data**: [Legesher/language-decoded-data](https://huggingface.co/datasets/Legesher/language-decoded-data)
- **Parameters**: 3.35B base, ~0.1% trainable via LoRA
*Detailed hyperparameters and training configs will be added as training completes.*
## Evaluation
Models are evaluated on multilingual reasoning benchmarks:
| Benchmark | Task | Languages |
|---|---|---|
| XNLI | Natural language inference | 15 |
| XStoryCloze | Story completion | 11 |
| TyDi QA | Question answering | 11 |
| MMLU | Knowledge | Multilingual |
*Results will be added as evaluation completes.*
## Related Resources
- **Base models**: [Tiny Aya Collection](https://huggingface.co/collections/CohereLabs/tiny-aya)
- **Training data**: [Legesher/language-decoded-data](https://huggingface.co/datasets/Legesher/language-decoded-data)
- **Community code**: [Legesher/language-decoded-community](https://huggingface.co/datasets/Legesher/language-decoded-community)
- **Experiments**: [Legesher/language-decoded-experiments](https://huggingface.co/datasets/Legesher/language-decoded-experiments)
- **Transpilation tool**: [Legesher](https://github.com/Legesher/legesher)
## Citation
```bibtex
@misc{language-decoded-2026,
title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
author={Madison Edgar and Saad Bazaz and Rafay Mustafa and Sarah Jawaid and Rashik Shahjahan and Khojasteh Mirza and Sohaib Bazaz},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/Legesher/language-decoded-lora}
}
```
## License
CC-BY-NC 4.0 (inherits from Tiny Aya base models)