--- license: cc-by-nc-4.0 language: - en - zh - es - ur tags: - lora - aya - tiny-aya - multilingual - code - legesher - tiny-aya-expedition - language-decoded - unsloth library_name: transformers base_model: - CohereLabs/tiny-aya-base pipeline_tag: text-generation --- # Language Decoded LoRA QLoRA adapters fine-tuned on multilingual code conditions for the **Language Decoded** project (part of [Cohere's Tiny Aya Expedition](https://aya.for.ai)). ## Research Question > Does fine-tuning on non-English code improve multilingual reasoning — and is the benefit language-dependent or structure-dependent? ## Base Model All adapters are trained on [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (3.35B parameters). ## Model Structure This repo contains QLoRA adapters organized by experimental condition: | Subdirectory | Condition | Training Data | |---|---|---| | `baseline/` | Baseline | No fine-tuning (base model eval only) | | `condition-1-en/` | Condition 1 | English Python from The Stack Dedup | | `condition-2-zh/` | Condition 2 | Chinese keyword-swapped Python (Legesher-transpiled) | | `condition-2-es/` | Condition 2 | Spanish keyword-swapped Python (Legesher-transpiled) | | `condition-2-ur/` | Condition 2 | Urdu keyword-swapped Python (Legesher-transpiled) | | `condition-3-zh/` | Condition 3 | Transpiled + native Chinese code (Wenyan + community) | | `condition-3-es/` | Condition 3 | Transpiled + native Spanish code (Latino + community) | | `condition-3-ur/` | Condition 3 | Transpiled + native Urdu code (Qalb + community) | | `condition-4-combined/` | Condition 4 | All strictly native code (combined) | ### The Experimental Ladder - **Baseline → 1**: Does code help at all? - **1 → 2**: Does the language of keywords matter? - **2 → 3**: Does diversity of native-language sources add value beyond keyword swap? - **3 → 4**: Does code written in the cultural context of a language carry unique signal? ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load base model base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base") tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base") # Load a LoRA adapter (e.g., Condition 1 — English code) model = PeftModel.from_pretrained(base_model, "legesher/language-decoded-lora", subfolder="condition-1-en") # Load a language-specific adapter (e.g., Condition 2 — Chinese keyword-swapped) model = PeftModel.from_pretrained(base_model, "legesher/language-decoded-lora", subfolder="condition-2-zh") ``` ## Training Details | Parameter | Value | |---|---| | Base model | [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (3.35B params) | | Method | QLoRA 4-bit (NF4), ~5.4GB VRAM | | Hardware | Kaggle T4 (16GB) | | Tokenizer | CohereLabs/tiny-aya-base | | Transpilation tool | [Legesher](https://github.com/legesher/legesher) v0.7.3 | | Training data | [legesher/language-decoded-data](https://huggingface.co/datasets/legesher/language-decoded-data) | *Detailed hyperparameters and training configs will be added as training completes.* ## Evaluation Models are evaluated on multilingual reasoning benchmarks with dual prompts (English + language-specific): | Benchmark | What it measures | Examples per language | |---|---|---| | MGSM | Math reasoning | 250 (full set) | | X-CSQA | Commonsense reasoning | ~1,000 (full set) | | XNLI | Natural language inference | ~5,000 (full set) | *Results will be added as evaluation completes.* ## Related Resources - **Training data**: [legesher/language-decoded-data](https://huggingface.co/datasets/legesher/language-decoded-data) - **Community code**: [legesher/language-decoded-community](https://huggingface.co/datasets/legesher/language-decoded-community) - **Experiment tracking**: [legesher/language-decoded-experiments](https://huggingface.co/datasets/legesher/language-decoded-experiments) - **Transpilation tool**: [Legesher on GitHub](https://github.com/legesher/legesher) ## Citation ```bibtex @misc{language-decoded-2026, title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code}, author={Madison Edgar and Saad Bazaz and Rafay Mustafa and Sarah Jawaid and Rashik Shahjahan and Khojasteh Mirza and Sohaib Bazaz}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/legesher/language-decoded-lora} } ``` ## License CC-BY-NC 4.0 (inherits from Tiny Aya base model)