Language Decoded LoRA
LoRA adapters fine-tuned on multilingual code conditions for the Language Decoded project (part of Cohere's Tiny Aya Expedition).
Research Question
Does fine-tuning on non-English code improve multilingual reasoning โ and is the benefit language-dependent or structure-dependent?
Base Models
All adapters are trained on Tiny Aya (3.35B parameters), a multilingual model optimized for 70+ languages.
| Model | HF ID | Regional Strength |
|---|---|---|
| Global | CohereLabs/tiny-aya-global |
Balanced across all languages |
| Fire | CohereLabs/tiny-aya-fire |
South Asian (Urdu) |
| Earth | CohereLabs/tiny-aya-earth |
West Asian & African (Amharic) |
| Water | CohereLabs/tiny-aya-water |
European & Asia Pacific (Chinese) |
Model Structure
This repo contains LoRA adapters organized by experimental condition and base model variant:
| Subdirectory | Condition | Training Data |
|---|---|---|
global/baseline/ |
Condition 1 | No code augmentation |
global/english-code/ |
Condition 2 | English-keyword Python code |
global/multilingual-code/ |
Condition 3 | Python transpiled to Urdu, Amharic, Chinese keywords |
global/multilingual-text/ |
Condition 4 | Non-code multilingual text |
fire/multilingual-code/ |
Regional | Urdu-keyword Python on Fire variant |
earth/multilingual-code/ |
Regional | Amharic-keyword Python on Earth variant |
water/multilingual-code/ |
Regional | Chinese-keyword Python on Water variant |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model (Global variant)
base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-global")
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-global")
# Load a LoRA adapter (e.g., multilingual code on Global)
model = PeftModel.from_pretrained(base_model, "Legesher/language-decoded-lora", subfolder="global/multilingual-code")
# Or load a regional variant (e.g., Urdu code on Fire)
base_fire = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-fire")
model_fire = PeftModel.from_pretrained(base_fire, "Legesher/language-decoded-lora", subfolder="fire/multilingual-code")
Training Details
- Base models: Tiny Aya 3.35B โ Global, Fire, Earth, Water (CohereLabs)
- Method: QLoRA (Quantized Low-Rank Adaptation)
- Training data: Legesher/language-decoded-data
- Parameters: 3.35B base, ~0.1% trainable via LoRA
Detailed hyperparameters and training configs will be added as training completes.
Evaluation
Models are evaluated on multilingual reasoning benchmarks:
| Benchmark | Task | Languages |
|---|---|---|
| XNLI | Natural language inference | 15 |
| XStoryCloze | Story completion | 11 |
| TyDi QA | Question answering | 11 |
| MMLU | Knowledge | Multilingual |
Results will be added as evaluation completes.
Related Resources
- Base models: Tiny Aya Collection
- Training data: Legesher/language-decoded-data
- Community code: Legesher/language-decoded-community
- Experiments: Legesher/language-decoded-experiments
- Transpilation tool: Legesher
Citation
@misc{language-decoded-2026,
title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
author={Madison Edgar and Saad Bazaz and Rafay Mustafa and Sarah Jawaid and Rashik Shahjahan and Khojasteh Mirza and Sohaib Bazaz},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/Legesher/language-decoded-lora}
}
License
CC-BY-NC 4.0 (inherits from Tiny Aya base models)