File size: 6,584 Bytes
f48842c 689d1ea f48842c 689d1ea f48842c 689d1ea f48842c 689d1ea f48842c e4f13e7 f48842c e4f13e7 f48842c e4f13e7 f48842c 689d1ea f48842c 6ac3acd e4f13e7 689d1ea f48842c e4f13e7 f48842c e4f13e7 689d1ea f48842c e4f13e7 689d1ea f48842c 689d1ea f48842c e4f13e7 f48842c 689d1ea f48842c 689d1ea f48842c e4f13e7 f48842c 689d1ea f48842c e4f13e7 f48842c 689d1ea | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | ---
license: apache-2.0
language:
- en
- zh
- es
- ur
tags:
- lora
- aya
- tiny-aya
- multilingual
- code
- legesher
- tiny-aya-expedition
- language-decoded
- unsloth
library_name: transformers
base_model:
- CohereLabs/tiny-aya-base
pipeline_tag: text-generation
---
# Language Decoded LoRA
QLoRA adapters fine-tuned on multilingual code conditions for the **Language Decoded** project (part of [Cohere's Tiny Aya Expedition](https://aya.for.ai)).
## Research Question
> Does fine-tuning on non-English code improve multilingual reasoning — and is the benefit language-dependent or structure-dependent?
## Base Model
All adapters are trained on [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (3.35B parameters).
## Model Structure
This repo is the canonical hub for all Language Decoded LoRA adapters, organized by experimental condition:
| Subdirectory | Condition | Training Data |
| --------------------- | ----------- | ----------------------------------------------------- |
| `condition-1-en-32k/` | Condition 1 | English Python from The Stack Dedup (full 32k corpus) |
| `condition-1-en-5k/` | Condition 1 | English Python from The Stack Dedup (5k subset) |
| `condition-2-zh-5k/` | Condition 2 | Chinese keyword-swapped Python (Legesher-transpiled) |
| `condition-2-es-5k/` | Condition 2 | Spanish keyword-swapped Python (Legesher-transpiled) |
| `condition-2-ur-5k/` | Condition 2 | Urdu keyword-swapped Python (Legesher-transpiled) |
| `condition-3-zh-5k/` | Condition 3 | Transpiled + native Chinese code (blended) |
### The Experimental Ladder
- **Baseline --> 1**: Does code help at all?
- **1 --> 2**: Does the language of keywords matter?
- **2 --> 3**: Does diversity of native-language sources add value beyond keyword swap?
- **3 --> 4**: Does code written in the cultural context of a language carry unique signal?
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
# Load a LoRA adapter (e.g., Condition 1 — English code)
model = PeftModel.from_pretrained(base_model, "legesher/language-decoded-lora", subfolder="condition-1-en-5k")
# Load a language-specific adapter (e.g., Condition 2 — Chinese keyword-swapped)
model = PeftModel.from_pretrained(base_model, "legesher/language-decoded-lora", subfolder="condition-2-zh-5k")
```
## Training Details
| Parameter | Value |
| ------------------ | ------------------------------------------------------------------------------------------------ |
| Base model | [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (3.35B params) |
| Method | QLoRA 4-bit (NF4), ~5.4GB VRAM |
| Hardware | Kaggle T4 (16GB) |
| Tokenizer | CohereLabs/tiny-aya-base |
| Transpilation tool | [Legesher](https://github.com/legesher/legesher) v0.7.3 |
| Training data | [legesher/language-decoded-data](https://huggingface.co/datasets/legesher/language-decoded-data) |
### QLoRA Hyperparameters
| Parameter | Value |
| --------------- | ------------------------------------------------------------- |
| LoRA rank (`r`) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.0 |
| Target modules | q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj |
| Bias | none |
| Task type | CAUSAL_LM |
| PEFT version | 0.18.1 |
| Quantization | NF4 (4-bit) via Unsloth |
## Evaluation
Models are evaluated on multilingual reasoning benchmarks with dual prompts (English + language-specific):
| Benchmark | What it measures | Examples per language |
| --------- | -------------------------- | --------------------- |
| MGSM | Math reasoning | 250 (full set) |
| X-CSQA | Commonsense reasoning | ~1,000 (full set) |
| XNLI | Natural language inference | ~5,000 (full set) |
_Results will be added as evaluation completes._
## Limitations
- **Single base model**: All adapters are trained on CohereLabs/tiny-aya-base (3.35B params). Results may not generalize to larger or architecturally different models.
- **Limited training data**: Each condition uses a 5k-file subset for QLoRA fine-tuning, constrained by Kaggle T4 hardware limits.
- **Evaluation scope**: Currently evaluated on 3 benchmarks (MGSM, X-CSQA, XNLI). Other reasoning tasks may show different patterns.
- **Consumer hardware**: Training on Kaggle T4 (16GB) with 4-bit quantization introduces approximation that may affect adapter quality compared to full-precision training.
## Related Resources
- **Training data**: [legesher/language-decoded-data](https://huggingface.co/datasets/legesher/language-decoded-data)
- **Community code**: [legesher/language-decoded-community](https://huggingface.co/datasets/legesher/language-decoded-community)
- **Experiment tracking**: [legesher/language-decoded-experiments](https://huggingface.co/datasets/legesher/language-decoded-experiments)
- **Transpilation tool**: [Legesher on GitHub](https://github.com/legesher/legesher)
## Citation
```bibtex
@misc{language-decoded-2026,
title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
author={Madison Edgar and Saad Ahmed Bazaz and Tom Sherborne and Rashik Shahjahan and Khojasteh Mirza and Sarah Jawaid and Rafay Mustafa and Sohaib Ahmed Bazaz},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/legesher/language-decoded-lora}
}
```
## License
Apache 2.0
|