Language Decoded LoRA — Condition 3: Chinese Mixed Native Sources

Blended dataset of 3,486 native Chinese code files + 1,514 transpiled Python (5k subset). Tests whether diverse native-language code adds value beyond keyword swapping.

Part of the Language Decoded project (Cohere's Tiny Aya Expedition).

For full experiment details, see the Language Decoded LoRA hub.

Training Data

legesher/language-decoded-data / condition-3-zh-5k

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
model = PeftModel.from_pretrained(base_model, "legesher/language-decoded-lora-condition-3-zh-5k")

Citation

@misc{language-decoded-2026,
  title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
  author={Madison Edgar and Saad Ahmed Bazaz and Tom Sherborne and Rashik Shahjahan and Khojasteh Mirza and Sarah Jawaid and Rafay Mustafa and Sohaib Ahmed Bazaz},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/legesher/language-decoded-lora}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for legesher/language-decoded-lora-condition-3-zh-5k

Base model

CohereLabs/tiny-aya-base

Adapter

(7)

this model