madiedgar's picture
init: create README.md
f48842c verified
metadata
license: cc-by-nc-4.0
language:
  - multilingual
tags:
  - lora
  - aya
  - tiny-aya
  - multilingual
  - code
  - legesher
  - tiny-aya-expedition
  - language-decoded
library_name: transformers
base_model:
  - CohereLabs/tiny-aya-global
  - CohereLabs/tiny-aya-fire
  - CohereLabs/tiny-aya-earth
  - CohereLabs/tiny-aya-water
pipeline_tag: text-generation

Language Decoded LoRA

LoRA adapters fine-tuned on multilingual code conditions for the Language Decoded project (part of Cohere's Tiny Aya Expedition).

Research Question

Does fine-tuning on non-English code improve multilingual reasoning — and is the benefit language-dependent or structure-dependent?

Base Models

All adapters are trained on Tiny Aya (3.35B parameters), a multilingual model optimized for 70+ languages.

Model HF ID Regional Strength
Global CohereLabs/tiny-aya-global Balanced across all languages
Fire CohereLabs/tiny-aya-fire South Asian (Urdu)
Earth CohereLabs/tiny-aya-earth West Asian & African (Amharic)
Water CohereLabs/tiny-aya-water European & Asia Pacific (Chinese)

Model Structure

This repo contains LoRA adapters organized by experimental condition and base model variant:

Subdirectory Condition Training Data
global/baseline/ Condition 1 No code augmentation
global/english-code/ Condition 2 English-keyword Python code
global/multilingual-code/ Condition 3 Python transpiled to Urdu, Amharic, Chinese keywords
global/multilingual-text/ Condition 4 Non-code multilingual text
fire/multilingual-code/ Regional Urdu-keyword Python on Fire variant
earth/multilingual-code/ Regional Amharic-keyword Python on Earth variant
water/multilingual-code/ Regional Chinese-keyword Python on Water variant

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model (Global variant)
base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-global")
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-global")

# Load a LoRA adapter (e.g., multilingual code on Global)
model = PeftModel.from_pretrained(base_model, "Legesher/language-decoded-lora", subfolder="global/multilingual-code")

# Or load a regional variant (e.g., Urdu code on Fire)
base_fire = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-fire")
model_fire = PeftModel.from_pretrained(base_fire, "Legesher/language-decoded-lora", subfolder="fire/multilingual-code")

Training Details

  • Base models: Tiny Aya 3.35B — Global, Fire, Earth, Water (CohereLabs)
  • Method: QLoRA (Quantized Low-Rank Adaptation)
  • Training data: Legesher/language-decoded-data
  • Parameters: 3.35B base, ~0.1% trainable via LoRA

Detailed hyperparameters and training configs will be added as training completes.

Evaluation

Models are evaluated on multilingual reasoning benchmarks:

Benchmark Task Languages
XNLI Natural language inference 15
XStoryCloze Story completion 11
TyDi QA Question answering 11
MMLU Knowledge Multilingual

Results will be added as evaluation completes.

Related Resources

Citation

@misc{language-decoded-2026,
  title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
  author={Madison Edgar and Saad Bazaz and Rafay Mustafa and Sarah Jawaid and Rashik Shahjahan and Khojasteh Mirza and Sohaib Bazaz},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/Legesher/language-decoded-lora}
}

License

CC-BY-NC 4.0 (inherits from Tiny Aya base models)