init: create README.md

f48842c verified 3 days ago

4.5 kB

	---
	license: cc-by-nc-4.0
	language:
	- multilingual
	tags:
	- lora
	- aya
	- tiny-aya
	- multilingual
	- code
	- legesher
	- tiny-aya-expedition
	- language-decoded
	library_name: transformers
	base_model:
	- CohereLabs/tiny-aya-global
	- CohereLabs/tiny-aya-fire
	- CohereLabs/tiny-aya-earth
	- CohereLabs/tiny-aya-water
	pipeline_tag: text-generation
	---

	# Language Decoded LoRA

	LoRA adapters fine-tuned on multilingual code conditions for the Language Decoded project (part of Cohere's Tiny Aya Expedition).

	## Research Question

	> Does fine-tuning on non-English code improve multilingual reasoning — and is the benefit language-dependent or structure-dependent?

	## Base Models

	All adapters are trained on [Tiny Aya](https://huggingface.co/collections/CohereLabs/tiny-aya) (3.35B parameters), a multilingual model optimized for 70+ languages.

	\| Model \| HF ID \| Regional Strength \|
	\|---\|---\|---\|
	\| Global \| `CohereLabs/tiny-aya-global` \| Balanced across all languages \|
	\| Fire \| `CohereLabs/tiny-aya-fire` \| South Asian (Urdu) \|
	\| Earth \| `CohereLabs/tiny-aya-earth` \| West Asian & African (Amharic) \|
	\| Water \| `CohereLabs/tiny-aya-water` \| European & Asia Pacific (Chinese) \|

	## Model Structure

	This repo contains LoRA adapters organized by experimental condition and base model variant:

	\| Subdirectory \| Condition \| Training Data \|
	\|---\|---\|---\|
	\| `global/baseline/` \| Condition 1 \| No code augmentation \|
	\| `global/english-code/` \| Condition 2 \| English-keyword Python code \|
	\| `global/multilingual-code/` \| Condition 3 \| Python transpiled to Urdu, Amharic, Chinese keywords \|
	\| `global/multilingual-text/` \| Condition 4 \| Non-code multilingual text \|
	\| `fire/multilingual-code/` \| Regional \| Urdu-keyword Python on Fire variant \|
	\| `earth/multilingual-code/` \| Regional \| Amharic-keyword Python on Earth variant \|
	\| `water/multilingual-code/` \| Regional \| Chinese-keyword Python on Water variant \|

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	# Load base model (Global variant)
	base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-global")
	tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-global")

	# Load a LoRA adapter (e.g., multilingual code on Global)
	model = PeftModel.from_pretrained(base_model, "Legesher/language-decoded-lora", subfolder="global/multilingual-code")

	# Or load a regional variant (e.g., Urdu code on Fire)
	base_fire = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-fire")
	model_fire = PeftModel.from_pretrained(base_fire, "Legesher/language-decoded-lora", subfolder="fire/multilingual-code")
	```

	## Training Details

	- Base models: Tiny Aya 3.35B — Global, Fire, Earth, Water ([CohereLabs](https://huggingface.co/CohereLabs))
	- Method: QLoRA (Quantized Low-Rank Adaptation)
	- Training data: [Legesher/language-decoded-data](https://huggingface.co/datasets/Legesher/language-decoded-data)
	- Parameters: 3.35B base, ~0.1% trainable via LoRA

	Detailed hyperparameters and training configs will be added as training completes.

	## Evaluation

	Models are evaluated on multilingual reasoning benchmarks:

	\| Benchmark \| Task \| Languages \|
	\|---\|---\|---\|
	\| XNLI \| Natural language inference \| 15 \|
	\| XStoryCloze \| Story completion \| 11 \|
	\| TyDi QA \| Question answering \| 11 \|
	\| MMLU \| Knowledge \| Multilingual \|

	Results will be added as evaluation completes.

	## Related Resources

	- Base models: [Tiny Aya Collection](https://huggingface.co/collections/CohereLabs/tiny-aya)
	- Training data: [Legesher/language-decoded-data](https://huggingface.co/datasets/Legesher/language-decoded-data)
	- Community code: [Legesher/language-decoded-community](https://huggingface.co/datasets/Legesher/language-decoded-community)
	- Experiments: [Legesher/language-decoded-experiments](https://huggingface.co/datasets/Legesher/language-decoded-experiments)
	- Transpilation tool: [Legesher](https://github.com/Legesher/legesher)

	## Citation

	```bibtex
	@misc{language-decoded-2026,
	title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
	author={Madison Edgar and Saad Bazaz and Rafay Mustafa and Sarah Jawaid and Rashik Shahjahan and Khojasteh Mirza and Sohaib Bazaz},
	year={2026},
	publisher={Hugging Face},
	url={https://huggingface.co/Legesher/language-decoded-lora}
	}
	```

	## License

	CC-BY-NC 4.0 (inherits from Tiny Aya base models)