madiedgar commited on
Commit
f48842c
·
verified ·
1 Parent(s): b51e5c6

init: create README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -3
README.md CHANGED
@@ -1,3 +1,118 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - multilingual
5
+ tags:
6
+ - lora
7
+ - aya
8
+ - tiny-aya
9
+ - multilingual
10
+ - code
11
+ - legesher
12
+ - tiny-aya-expedition
13
+ - language-decoded
14
+ library_name: transformers
15
+ base_model:
16
+ - CohereLabs/tiny-aya-global
17
+ - CohereLabs/tiny-aya-fire
18
+ - CohereLabs/tiny-aya-earth
19
+ - CohereLabs/tiny-aya-water
20
+ pipeline_tag: text-generation
21
+ ---
22
+
23
+ # Language Decoded LoRA
24
+
25
+ LoRA adapters fine-tuned on multilingual code conditions for the **Language Decoded** project (part of Cohere's Tiny Aya Expedition).
26
+
27
+ ## Research Question
28
+
29
+ > Does fine-tuning on non-English code improve multilingual reasoning — and is the benefit language-dependent or structure-dependent?
30
+
31
+ ## Base Models
32
+
33
+ All adapters are trained on [Tiny Aya](https://huggingface.co/collections/CohereLabs/tiny-aya) (3.35B parameters), a multilingual model optimized for 70+ languages.
34
+
35
+ | Model | HF ID | Regional Strength |
36
+ |---|---|---|
37
+ | **Global** | `CohereLabs/tiny-aya-global` | Balanced across all languages |
38
+ | **Fire** | `CohereLabs/tiny-aya-fire` | South Asian (Urdu) |
39
+ | **Earth** | `CohereLabs/tiny-aya-earth` | West Asian & African (Amharic) |
40
+ | **Water** | `CohereLabs/tiny-aya-water` | European & Asia Pacific (Chinese) |
41
+
42
+ ## Model Structure
43
+
44
+ This repo contains LoRA adapters organized by experimental condition and base model variant:
45
+
46
+ | Subdirectory | Condition | Training Data |
47
+ |---|---|---|
48
+ | `global/baseline/` | Condition 1 | No code augmentation |
49
+ | `global/english-code/` | Condition 2 | English-keyword Python code |
50
+ | `global/multilingual-code/` | Condition 3 | Python transpiled to Urdu, Amharic, Chinese keywords |
51
+ | `global/multilingual-text/` | Condition 4 | Non-code multilingual text |
52
+ | `fire/multilingual-code/` | Regional | Urdu-keyword Python on Fire variant |
53
+ | `earth/multilingual-code/` | Regional | Amharic-keyword Python on Earth variant |
54
+ | `water/multilingual-code/` | Regional | Chinese-keyword Python on Water variant |
55
+
56
+ ## Usage
57
+
58
+ ```python
59
+ from transformers import AutoModelForCausalLM, AutoTokenizer
60
+ from peft import PeftModel
61
+
62
+ # Load base model (Global variant)
63
+ base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-global")
64
+ tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-global")
65
+
66
+ # Load a LoRA adapter (e.g., multilingual code on Global)
67
+ model = PeftModel.from_pretrained(base_model, "Legesher/language-decoded-lora", subfolder="global/multilingual-code")
68
+
69
+ # Or load a regional variant (e.g., Urdu code on Fire)
70
+ base_fire = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-fire")
71
+ model_fire = PeftModel.from_pretrained(base_fire, "Legesher/language-decoded-lora", subfolder="fire/multilingual-code")
72
+ ```
73
+
74
+ ## Training Details
75
+
76
+ - **Base models**: Tiny Aya 3.35B — Global, Fire, Earth, Water ([CohereLabs](https://huggingface.co/CohereLabs))
77
+ - **Method**: QLoRA (Quantized Low-Rank Adaptation)
78
+ - **Training data**: [Legesher/language-decoded-data](https://huggingface.co/datasets/Legesher/language-decoded-data)
79
+ - **Parameters**: 3.35B base, ~0.1% trainable via LoRA
80
+
81
+ *Detailed hyperparameters and training configs will be added as training completes.*
82
+
83
+ ## Evaluation
84
+
85
+ Models are evaluated on multilingual reasoning benchmarks:
86
+
87
+ | Benchmark | Task | Languages |
88
+ |---|---|---|
89
+ | XNLI | Natural language inference | 15 |
90
+ | XStoryCloze | Story completion | 11 |
91
+ | TyDi QA | Question answering | 11 |
92
+ | MMLU | Knowledge | Multilingual |
93
+
94
+ *Results will be added as evaluation completes.*
95
+
96
+ ## Related Resources
97
+
98
+ - **Base models**: [Tiny Aya Collection](https://huggingface.co/collections/CohereLabs/tiny-aya)
99
+ - **Training data**: [Legesher/language-decoded-data](https://huggingface.co/datasets/Legesher/language-decoded-data)
100
+ - **Community code**: [Legesher/language-decoded-community](https://huggingface.co/datasets/Legesher/language-decoded-community)
101
+ - **Experiments**: [Legesher/language-decoded-experiments](https://huggingface.co/datasets/Legesher/language-decoded-experiments)
102
+ - **Transpilation tool**: [Legesher](https://github.com/Legesher/legesher)
103
+
104
+ ## Citation
105
+
106
+ ```bibtex
107
+ @misc{language-decoded-2026,
108
+ title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
109
+ author={Madison Edgar and Saad Bazaz and Rafay Mustafa and Sarah Jawaid and Rashik Shahjahan and Khojasteh Mirza and Sohaib Bazaz},
110
+ year={2026},
111
+ publisher={Hugging Face},
112
+ url={https://huggingface.co/Legesher/language-decoded-lora}
113
+ }
114
+ ```
115
+
116
+ ## License
117
+
118
+ CC-BY-NC 4.0 (inherits from Tiny Aya base models)