init: create README.md

f48842c verified 1 day ago

4.5 kB

license: cc-by-nc-4.0
language:
  - multilingual
tags:
  - lora
  - aya
  - tiny-aya
  - multilingual
  - code
  - legesher
  - tiny-aya-expedition
  - language-decoded
library_name: transformers
base_model:
  - CohereLabs/tiny-aya-global
  - CohereLabs/tiny-aya-fire
  - CohereLabs/tiny-aya-earth
  - CohereLabs/tiny-aya-water
pipeline_tag: text-generation

Language Decoded LoRA

LoRA adapters fine-tuned on multilingual code conditions for the Language Decoded project (part of Cohere's Tiny Aya Expedition).

Research Question

Does fine-tuning on non-English code improve multilingual reasoning — and is the benefit language-dependent or structure-dependent?

Base Models

All adapters are trained on Tiny Aya (3.35B parameters), a multilingual model optimized for 70+ languages.

Model	HF ID	Regional Strength
Global	`CohereLabs/tiny-aya-global`	Balanced across all languages
Fire	`CohereLabs/tiny-aya-fire`	South Asian (Urdu)
Earth	`CohereLabs/tiny-aya-earth`	West Asian & African (Amharic)
Water	`CohereLabs/tiny-aya-water`	European & Asia Pacific (Chinese)

Model Structure

This repo contains LoRA adapters organized by experimental condition and base model variant:

Subdirectory	Condition	Training Data
`global/baseline/`	Condition 1	No code augmentation
`global/english-code/`	Condition 2	English-keyword Python code
`global/multilingual-code/`	Condition 3	Python transpiled to Urdu, Amharic, Chinese keywords
`global/multilingual-text/`	Condition 4	Non-code multilingual text
`fire/multilingual-code/`	Regional	Urdu-keyword Python on Fire variant
`earth/multilingual-code/`	Regional	Amharic-keyword Python on Earth variant
`water/multilingual-code/`	Regional	Chinese-keyword Python on Water variant

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model (Global variant)
base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-global")
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-global")

# Load a LoRA adapter (e.g., multilingual code on Global)
model = PeftModel.from_pretrained(base_model, "Legesher/language-decoded-lora", subfolder="global/multilingual-code")

# Or load a regional variant (e.g., Urdu code on Fire)
base_fire = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-fire")
model_fire = PeftModel.from_pretrained(base_fire, "Legesher/language-decoded-lora", subfolder="fire/multilingual-code")

Training Details

Base models: Tiny Aya 3.35B — Global, Fire, Earth, Water (CohereLabs)
Method: QLoRA (Quantized Low-Rank Adaptation)
Training data: Legesher/language-decoded-data
Parameters: 3.35B base, ~0.1% trainable via LoRA

Detailed hyperparameters and training configs will be added as training completes.

Evaluation

Models are evaluated on multilingual reasoning benchmarks:

Benchmark	Task	Languages
XNLI	Natural language inference	15
XStoryCloze	Story completion	11
TyDi QA	Question answering	11
MMLU	Knowledge	Multilingual

Results will be added as evaluation completes.

Related Resources

Base models: Tiny Aya Collection
Training data: Legesher/language-decoded-data
Community code: Legesher/language-decoded-community
Experiments: Legesher/language-decoded-experiments
Transpilation tool: Legesher

Citation

@misc{language-decoded-2026,
  title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
  author={Madison Edgar and Saad Bazaz and Rafay Mustafa and Sarah Jawaid and Rashik Shahjahan and Khojasteh Mirza and Sohaib Bazaz},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/Legesher/language-decoded-lora}
}

License

CC-BY-NC 4.0 (inherits from Tiny Aya base models)