madiedgar commited on
Commit
e4f13e7
·
1 Parent(s): f48842c

Update model card: fix conditions, base model, benchmarks, languages (#1)

Browse files

- Update model card: fix conditions, base model, benchmarks, languages (bf55f60c10d5d40de2f0c7a6653c7e707beef94c)

Files changed (1) hide show
  1. README.md +51 -48
README.md CHANGED
@@ -1,7 +1,10 @@
1
  ---
2
  license: cc-by-nc-4.0
3
  language:
4
- - multilingual
 
 
 
5
  tags:
6
  - lora
7
  - aya
@@ -13,45 +16,44 @@ tags:
13
  - language-decoded
14
  library_name: transformers
15
  base_model:
16
- - CohereLabs/tiny-aya-global
17
- - CohereLabs/tiny-aya-fire
18
- - CohereLabs/tiny-aya-earth
19
- - CohereLabs/tiny-aya-water
20
  pipeline_tag: text-generation
21
  ---
22
 
23
  # Language Decoded LoRA
24
 
25
- LoRA adapters fine-tuned on multilingual code conditions for the **Language Decoded** project (part of Cohere's Tiny Aya Expedition).
26
 
27
  ## Research Question
28
 
29
  > Does fine-tuning on non-English code improve multilingual reasoning — and is the benefit language-dependent or structure-dependent?
30
 
31
- ## Base Models
32
-
33
- All adapters are trained on [Tiny Aya](https://huggingface.co/collections/CohereLabs/tiny-aya) (3.35B parameters), a multilingual model optimized for 70+ languages.
34
 
35
- | Model | HF ID | Regional Strength |
36
- |---|---|---|
37
- | **Global** | `CohereLabs/tiny-aya-global` | Balanced across all languages |
38
- | **Fire** | `CohereLabs/tiny-aya-fire` | South Asian (Urdu) |
39
- | **Earth** | `CohereLabs/tiny-aya-earth` | West Asian & African (Amharic) |
40
- | **Water** | `CohereLabs/tiny-aya-water` | European & Asia Pacific (Chinese) |
41
 
42
  ## Model Structure
43
 
44
- This repo contains LoRA adapters organized by experimental condition and base model variant:
45
 
46
  | Subdirectory | Condition | Training Data |
47
  |---|---|---|
48
- | `global/baseline/` | Condition 1 | No code augmentation |
49
- | `global/english-code/` | Condition 2 | English-keyword Python code |
50
- | `global/multilingual-code/` | Condition 3 | Python transpiled to Urdu, Amharic, Chinese keywords |
51
- | `global/multilingual-text/` | Condition 4 | Non-code multilingual text |
52
- | `fire/multilingual-code/` | Regional | Urdu-keyword Python on Fire variant |
53
- | `earth/multilingual-code/` | Regional | Amharic-keyword Python on Earth variant |
54
- | `water/multilingual-code/` | Regional | Chinese-keyword Python on Water variant |
 
 
 
 
 
 
 
 
 
55
 
56
  ## Usage
57
 
@@ -59,47 +61,48 @@ This repo contains LoRA adapters organized by experimental condition and base mo
59
  from transformers import AutoModelForCausalLM, AutoTokenizer
60
  from peft import PeftModel
61
 
62
- # Load base model (Global variant)
63
- base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-global")
64
- tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-global")
65
 
66
- # Load a LoRA adapter (e.g., multilingual code on Global)
67
- model = PeftModel.from_pretrained(base_model, "Legesher/language-decoded-lora", subfolder="global/multilingual-code")
68
 
69
- # Or load a regional variant (e.g., Urdu code on Fire)
70
- base_fire = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-fire")
71
- model_fire = PeftModel.from_pretrained(base_fire, "Legesher/language-decoded-lora", subfolder="fire/multilingual-code")
72
  ```
73
 
74
  ## Training Details
75
 
76
- - **Base models**: Tiny Aya 3.35B — Global, Fire, Earth, Water ([CohereLabs](https://huggingface.co/CohereLabs))
77
- - **Method**: QLoRA (Quantized Low-Rank Adaptation)
78
- - **Training data**: [Legesher/language-decoded-data](https://huggingface.co/datasets/Legesher/language-decoded-data)
79
- - **Parameters**: 3.35B base, ~0.1% trainable via LoRA
 
 
 
 
80
 
81
  *Detailed hyperparameters and training configs will be added as training completes.*
82
 
83
  ## Evaluation
84
 
85
- Models are evaluated on multilingual reasoning benchmarks:
86
 
87
- | Benchmark | Task | Languages |
88
  |---|---|---|
89
- | XNLI | Natural language inference | 15 |
90
- | XStoryCloze | Story completion | 11 |
91
- | TyDi QA | Question answering | 11 |
92
- | MMLU | Knowledge | Multilingual |
93
 
94
  *Results will be added as evaluation completes.*
95
 
96
  ## Related Resources
97
 
98
- - **Base models**: [Tiny Aya Collection](https://huggingface.co/collections/CohereLabs/tiny-aya)
99
- - **Training data**: [Legesher/language-decoded-data](https://huggingface.co/datasets/Legesher/language-decoded-data)
100
- - **Community code**: [Legesher/language-decoded-community](https://huggingface.co/datasets/Legesher/language-decoded-community)
101
- - **Experiments**: [Legesher/language-decoded-experiments](https://huggingface.co/datasets/Legesher/language-decoded-experiments)
102
- - **Transpilation tool**: [Legesher](https://github.com/Legesher/legesher)
103
 
104
  ## Citation
105
 
@@ -109,10 +112,10 @@ Models are evaluated on multilingual reasoning benchmarks:
109
  author={Madison Edgar and Saad Bazaz and Rafay Mustafa and Sarah Jawaid and Rashik Shahjahan and Khojasteh Mirza and Sohaib Bazaz},
110
  year={2026},
111
  publisher={Hugging Face},
112
- url={https://huggingface.co/Legesher/language-decoded-lora}
113
  }
114
  ```
115
 
116
  ## License
117
 
118
- CC-BY-NC 4.0 (inherits from Tiny Aya base models)
 
1
  ---
2
  license: cc-by-nc-4.0
3
  language:
4
+ - en
5
+ - zh
6
+ - es
7
+ - ur
8
  tags:
9
  - lora
10
  - aya
 
16
  - language-decoded
17
  library_name: transformers
18
  base_model:
19
+ - CohereLabs/tiny-aya-base
 
 
 
20
  pipeline_tag: text-generation
21
  ---
22
 
23
  # Language Decoded LoRA
24
 
25
+ QLoRA adapters fine-tuned on multilingual code conditions for the **Language Decoded** project (part of [Cohere's Tiny Aya Expedition](https://aya.for.ai)).
26
 
27
  ## Research Question
28
 
29
  > Does fine-tuning on non-English code improve multilingual reasoning — and is the benefit language-dependent or structure-dependent?
30
 
31
+ ## Base Model
 
 
32
 
33
+ All adapters are trained on [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (3.35B parameters).
 
 
 
 
 
34
 
35
  ## Model Structure
36
 
37
+ This repo contains QLoRA adapters organized by experimental condition:
38
 
39
  | Subdirectory | Condition | Training Data |
40
  |---|---|---|
41
+ | `baseline/` | Baseline | No fine-tuning (base model eval only) |
42
+ | `condition-1-en/` | Condition 1 | English Python from The Stack Dedup |
43
+ | `condition-2-zh/` | Condition 2 | Chinese keyword-swapped Python (Legesher-transpiled) |
44
+ | `condition-2-es/` | Condition 2 | Spanish keyword-swapped Python (Legesher-transpiled) |
45
+ | `condition-2-ur/` | Condition 2 | Urdu keyword-swapped Python (Legesher-transpiled) |
46
+ | `condition-3-zh/` | Condition 3 | Transpiled + native Chinese code (Wenyan + community) |
47
+ | `condition-3-es/` | Condition 3 | Transpiled + native Spanish code (Latino + community) |
48
+ | `condition-3-ur/` | Condition 3 | Transpiled + native Urdu code (Qalb + community) |
49
+ | `condition-4-combined/` | Condition 4 | All strictly native code (combined) |
50
+
51
+ ### The Experimental Ladder
52
+
53
+ - **Baseline → 1**: Does code help at all?
54
+ - **1 → 2**: Does the language of keywords matter?
55
+ - **2 → 3**: Does diversity of native-language sources add value beyond keyword swap?
56
+ - **3 → 4**: Does code written in the cultural context of a language carry unique signal?
57
 
58
  ## Usage
59
 
 
61
  from transformers import AutoModelForCausalLM, AutoTokenizer
62
  from peft import PeftModel
63
 
64
+ # Load base model
65
+ base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
66
+ tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
67
 
68
+ # Load a LoRA adapter (e.g., Condition 1 — English code)
69
+ model = PeftModel.from_pretrained(base_model, "legesher/language-decoded-lora", subfolder="condition-1-en")
70
 
71
+ # Load a language-specific adapter (e.g., Condition 2 — Chinese keyword-swapped)
72
+ model = PeftModel.from_pretrained(base_model, "legesher/language-decoded-lora", subfolder="condition-2-zh")
 
73
  ```
74
 
75
  ## Training Details
76
 
77
+ | Parameter | Value |
78
+ |---|---|
79
+ | Base model | [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (3.35B params) |
80
+ | Method | QLoRA 4-bit (NF4), ~5.4GB VRAM |
81
+ | Hardware | Kaggle T4 (16GB) |
82
+ | Tokenizer | CohereLabs/tiny-aya-base |
83
+ | Transpilation tool | [Legesher](https://github.com/legesher/legesher) v0.7.3 |
84
+ | Training data | [legesher/language-decoded-data](https://huggingface.co/datasets/legesher/language-decoded-data) |
85
 
86
  *Detailed hyperparameters and training configs will be added as training completes.*
87
 
88
  ## Evaluation
89
 
90
+ Models are evaluated on multilingual reasoning benchmarks with dual prompts (English + language-specific):
91
 
92
+ | Benchmark | What it measures | Examples per language |
93
  |---|---|---|
94
+ | MGSM | Math reasoning | 250 (full set) |
95
+ | X-CSQA | Commonsense reasoning | ~1,000 (full set) |
96
+ | XNLI | Natural language inference | ~5,000 (full set) |
 
97
 
98
  *Results will be added as evaluation completes.*
99
 
100
  ## Related Resources
101
 
102
+ - **Training data**: [legesher/language-decoded-data](https://huggingface.co/datasets/legesher/language-decoded-data)
103
+ - **Community code**: [legesher/language-decoded-community](https://huggingface.co/datasets/legesher/language-decoded-community)
104
+ - **Experiment tracking**: [legesher/language-decoded-experiments](https://huggingface.co/datasets/legesher/language-decoded-experiments)
105
+ - **Transpilation tool**: [Legesher on GitHub](https://github.com/legesher/legesher)
 
106
 
107
  ## Citation
108
 
 
112
  author={Madison Edgar and Saad Bazaz and Rafay Mustafa and Sarah Jawaid and Rashik Shahjahan and Khojasteh Mirza and Sohaib Bazaz},
113
  year={2026},
114
  publisher={Hugging Face},
115
+ url={https://huggingface.co/legesher/language-decoded-lora}
116
  }
117
  ```
118
 
119
  ## License
120
 
121
+ CC-BY-NC 4.0 (inherits from Tiny Aya base model)