Language Decoded LoRA — Condition 3: Chinese Mixed Native Sources
Blended dataset of 3,486 native Chinese code files + 1,514 transpiled Python (5k subset). Tests whether diverse native-language code adds value beyond keyword swapping.
Part of the Language Decoded project (Cohere's Tiny Aya Expedition).
For full experiment details, see the Language Decoded LoRA hub.
Training Data
legesher/language-decoded-data / condition-3-zh-5k
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
model = PeftModel.from_pretrained(base_model, "legesher/language-decoded-lora-condition-3-zh-5k")
Citation
@misc{language-decoded-2026,
title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
author={Madison Edgar and Saad Ahmed Bazaz and Tom Sherborne and Rashik Shahjahan and Khojasteh Mirza and Sarah Jawaid and Rafay Mustafa and Sohaib Ahmed Bazaz},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/legesher/language-decoded-lora}
}
License
Apache 2.0
Model tree for legesher/language-decoded-lora-condition-3-zh-5k
Base model
CohereLabs/tiny-aya-base