| | --- |
| | license: cc-by-nc-4.0 |
| | language: |
| | - multilingual |
| | tags: |
| | - lora |
| | - aya |
| | - tiny-aya |
| | - multilingual |
| | - code |
| | - legesher |
| | - tiny-aya-expedition |
| | - language-decoded |
| | library_name: transformers |
| | base_model: |
| | - CohereLabs/tiny-aya-global |
| | - CohereLabs/tiny-aya-fire |
| | - CohereLabs/tiny-aya-earth |
| | - CohereLabs/tiny-aya-water |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # Language Decoded LoRA |
| |
|
| | LoRA adapters fine-tuned on multilingual code conditions for the **Language Decoded** project (part of Cohere's Tiny Aya Expedition). |
| |
|
| | ## Research Question |
| |
|
| | > Does fine-tuning on non-English code improve multilingual reasoning — and is the benefit language-dependent or structure-dependent? |
| |
|
| | ## Base Models |
| |
|
| | All adapters are trained on [Tiny Aya](https://huggingface.co/collections/CohereLabs/tiny-aya) (3.35B parameters), a multilingual model optimized for 70+ languages. |
| |
|
| | | Model | HF ID | Regional Strength | |
| | |---|---|---| |
| | | **Global** | `CohereLabs/tiny-aya-global` | Balanced across all languages | |
| | | **Fire** | `CohereLabs/tiny-aya-fire` | South Asian (Urdu) | |
| | | **Earth** | `CohereLabs/tiny-aya-earth` | West Asian & African (Amharic) | |
| | | **Water** | `CohereLabs/tiny-aya-water` | European & Asia Pacific (Chinese) | |
| |
|
| | ## Model Structure |
| |
|
| | This repo contains LoRA adapters organized by experimental condition and base model variant: |
| |
|
| | | Subdirectory | Condition | Training Data | |
| | |---|---|---| |
| | | `global/baseline/` | Condition 1 | No code augmentation | |
| | | `global/english-code/` | Condition 2 | English-keyword Python code | |
| | | `global/multilingual-code/` | Condition 3 | Python transpiled to Urdu, Amharic, Chinese keywords | |
| | | `global/multilingual-text/` | Condition 4 | Non-code multilingual text | |
| | | `fire/multilingual-code/` | Regional | Urdu-keyword Python on Fire variant | |
| | | `earth/multilingual-code/` | Regional | Amharic-keyword Python on Earth variant | |
| | | `water/multilingual-code/` | Regional | Chinese-keyword Python on Water variant | |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | from peft import PeftModel |
| | |
| | # Load base model (Global variant) |
| | base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-global") |
| | tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-global") |
| | |
| | # Load a LoRA adapter (e.g., multilingual code on Global) |
| | model = PeftModel.from_pretrained(base_model, "Legesher/language-decoded-lora", subfolder="global/multilingual-code") |
| | |
| | # Or load a regional variant (e.g., Urdu code on Fire) |
| | base_fire = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-fire") |
| | model_fire = PeftModel.from_pretrained(base_fire, "Legesher/language-decoded-lora", subfolder="fire/multilingual-code") |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | - **Base models**: Tiny Aya 3.35B — Global, Fire, Earth, Water ([CohereLabs](https://huggingface.co/CohereLabs)) |
| | - **Method**: QLoRA (Quantized Low-Rank Adaptation) |
| | - **Training data**: [Legesher/language-decoded-data](https://huggingface.co/datasets/Legesher/language-decoded-data) |
| | - **Parameters**: 3.35B base, ~0.1% trainable via LoRA |
| |
|
| | *Detailed hyperparameters and training configs will be added as training completes.* |
| |
|
| | ## Evaluation |
| |
|
| | Models are evaluated on multilingual reasoning benchmarks: |
| |
|
| | | Benchmark | Task | Languages | |
| | |---|---|---| |
| | | XNLI | Natural language inference | 15 | |
| | | XStoryCloze | Story completion | 11 | |
| | | TyDi QA | Question answering | 11 | |
| | | MMLU | Knowledge | Multilingual | |
| |
|
| | *Results will be added as evaluation completes.* |
| |
|
| | ## Related Resources |
| |
|
| | - **Base models**: [Tiny Aya Collection](https://huggingface.co/collections/CohereLabs/tiny-aya) |
| | - **Training data**: [Legesher/language-decoded-data](https://huggingface.co/datasets/Legesher/language-decoded-data) |
| | - **Community code**: [Legesher/language-decoded-community](https://huggingface.co/datasets/Legesher/language-decoded-community) |
| | - **Experiments**: [Legesher/language-decoded-experiments](https://huggingface.co/datasets/Legesher/language-decoded-experiments) |
| | - **Transpilation tool**: [Legesher](https://github.com/Legesher/legesher) |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{language-decoded-2026, |
| | title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code}, |
| | author={Madison Edgar and Saad Bazaz and Rafay Mustafa and Sarah Jawaid and Rashik Shahjahan and Khojasteh Mirza and Sohaib Bazaz}, |
| | year={2026}, |
| | publisher={Hugging Face}, |
| | url={https://huggingface.co/Legesher/language-decoded-lora} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | CC-BY-NC 4.0 (inherits from Tiny Aya base models) |