madiedgar commited on
Commit
689d1ea
·
1 Parent(s): 57a2134

docs: fix license to Apache 2.0, add hyperparameters + limitations, update citations (#3)

Browse files

- docs: fix license to Apache 2.0, add hyperparameters + limitations, update citations (52c78860fd325c1f14657b96be34609c1d835ea4)

Files changed (1) hide show
  1. README.md +65 -51
README.md CHANGED
@@ -1,23 +1,23 @@
1
  ---
2
- license: cc-by-nc-4.0
3
  language:
4
- - en
5
- - zh
6
- - es
7
- - ur
8
  tags:
9
- - lora
10
- - aya
11
- - tiny-aya
12
- - multilingual
13
- - code
14
- - legesher
15
- - tiny-aya-expedition
16
- - language-decoded
17
- - unsloth
18
  library_name: transformers
19
  base_model:
20
- - CohereLabs/tiny-aya-base
21
  pipeline_tag: text-generation
22
  ---
23
 
@@ -35,26 +35,22 @@ All adapters are trained on [CohereLabs/tiny-aya-base](https://huggingface.co/Co
35
 
36
  ## Model Structure
37
 
38
- This repo contains QLoRA adapters organized by experimental condition:
39
 
40
- | Subdirectory | Condition | Training Data |
41
- |---|---|---|
42
- | `baseline/` | Baseline | No fine-tuning (base model eval only) |
43
- | `condition-1-en/` | Condition 1 | English Python from The Stack Dedup |
44
- | `condition-2-zh/` | Condition 2 | Chinese keyword-swapped Python (Legesher-transpiled) |
45
- | `condition-2-es/` | Condition 2 | Spanish keyword-swapped Python (Legesher-transpiled) |
46
- | `condition-2-ur/` | Condition 2 | Urdu keyword-swapped Python (Legesher-transpiled) |
47
- | `condition-3-zh/` | Condition 3 | Transpiled + native Chinese code (Wenyan + community) |
48
- | `condition-3-es/` | Condition 3 | Transpiled + native Spanish code (Latino + community) |
49
- | `condition-3-ur/` | Condition 3 | Transpiled + native Urdu code (Qalb + community) |
50
- | `condition-4-combined/` | Condition 4 | All strictly native code (combined) |
51
 
52
  ### The Experimental Ladder
53
 
54
- - **Baseline 1**: Does code help at all?
55
- - **1 2**: Does the language of keywords matter?
56
- - **2 3**: Does diversity of native-language sources add value beyond keyword swap?
57
- - **3 4**: Does code written in the cultural context of a language carry unique signal?
58
 
59
  ## Usage
60
 
@@ -67,36 +63,54 @@ base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
67
  tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
68
 
69
  # Load a LoRA adapter (e.g., Condition 1 — English code)
70
- model = PeftModel.from_pretrained(base_model, "legesher/language-decoded-lora", subfolder="condition-1-en")
71
 
72
  # Load a language-specific adapter (e.g., Condition 2 — Chinese keyword-swapped)
73
- model = PeftModel.from_pretrained(base_model, "legesher/language-decoded-lora", subfolder="condition-2-zh")
74
  ```
75
 
76
  ## Training Details
77
 
78
- | Parameter | Value |
79
- |---|---|
80
- | Base model | [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (3.35B params) |
81
- | Method | QLoRA 4-bit (NF4), ~5.4GB VRAM |
82
- | Hardware | Kaggle T4 (16GB) |
83
- | Tokenizer | CohereLabs/tiny-aya-base |
84
- | Transpilation tool | [Legesher](https://github.com/legesher/legesher) v0.7.3 |
85
- | Training data | [legesher/language-decoded-data](https://huggingface.co/datasets/legesher/language-decoded-data) |
86
-
87
- *Detailed hyperparameters and training configs will be added as training completes.*
 
 
 
 
 
 
 
 
 
 
 
88
 
89
  ## Evaluation
90
 
91
  Models are evaluated on multilingual reasoning benchmarks with dual prompts (English + language-specific):
92
 
93
- | Benchmark | What it measures | Examples per language |
94
- |---|---|---|
95
- | MGSM | Math reasoning | 250 (full set) |
96
- | X-CSQA | Commonsense reasoning | ~1,000 (full set) |
97
- | XNLI | Natural language inference | ~5,000 (full set) |
 
 
 
 
98
 
99
- *Results will be added as evaluation completes.*
 
 
 
100
 
101
  ## Related Resources
102
 
@@ -110,7 +124,7 @@ Models are evaluated on multilingual reasoning benchmarks with dual prompts (Eng
110
  ```bibtex
111
  @misc{language-decoded-2026,
112
  title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
113
- author={Madison Edgar and Saad Bazaz and Rafay Mustafa and Sarah Jawaid and Rashik Shahjahan and Khojasteh Mirza and Sohaib Bazaz},
114
  year={2026},
115
  publisher={Hugging Face},
116
  url={https://huggingface.co/legesher/language-decoded-lora}
@@ -119,4 +133,4 @@ Models are evaluated on multilingual reasoning benchmarks with dual prompts (Eng
119
 
120
  ## License
121
 
122
- CC-BY-NC 4.0 (inherits from Tiny Aya base model)
 
1
  ---
2
+ license: apache-2.0
3
  language:
4
+ - en
5
+ - zh
6
+ - es
7
+ - ur
8
  tags:
9
+ - lora
10
+ - aya
11
+ - tiny-aya
12
+ - multilingual
13
+ - code
14
+ - legesher
15
+ - tiny-aya-expedition
16
+ - language-decoded
17
+ - unsloth
18
  library_name: transformers
19
  base_model:
20
+ - CohereLabs/tiny-aya-base
21
  pipeline_tag: text-generation
22
  ---
23
 
 
35
 
36
  ## Model Structure
37
 
38
+ This repo is the canonical hub for all Language Decoded LoRA adapters, organized by experimental condition:
39
 
40
+ | Subdirectory | Condition | Training Data |
41
+ | -------------------- | ----------- | ---------------------------------------------------- |
42
+ | `condition-1-en-5k/` | Condition 1 | English Python from The Stack Dedup (5k subset) |
43
+ | `condition-2-zh-5k/` | Condition 2 | Chinese keyword-swapped Python (Legesher-transpiled) |
44
+ | `condition-2-es-5k/` | Condition 2 | Spanish keyword-swapped Python (Legesher-transpiled) |
45
+ | `condition-2-ur-5k/` | Condition 2 | Urdu keyword-swapped Python (Legesher-transpiled) |
46
+ | `condition-3-zh-5k/` | Condition 3 | Transpiled + native Chinese code (blended) |
 
 
 
 
47
 
48
  ### The Experimental Ladder
49
 
50
+ - **Baseline --> 1**: Does code help at all?
51
+ - **1 --> 2**: Does the language of keywords matter?
52
+ - **2 --> 3**: Does diversity of native-language sources add value beyond keyword swap?
53
+ - **3 --> 4**: Does code written in the cultural context of a language carry unique signal?
54
 
55
  ## Usage
56
 
 
63
  tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
64
 
65
  # Load a LoRA adapter (e.g., Condition 1 — English code)
66
+ model = PeftModel.from_pretrained(base_model, "legesher/language-decoded-lora", subfolder="condition-1-en-5k")
67
 
68
  # Load a language-specific adapter (e.g., Condition 2 — Chinese keyword-swapped)
69
+ model = PeftModel.from_pretrained(base_model, "legesher/language-decoded-lora", subfolder="condition-2-zh-5k")
70
  ```
71
 
72
  ## Training Details
73
 
74
+ | Parameter | Value |
75
+ | ------------------ | ------------------------------------------------------------------------------------------------ |
76
+ | Base model | [CohereLabs/tiny-aya-base](https://huggingface.co/CohereLabs/tiny-aya-base) (3.35B params) |
77
+ | Method | QLoRA 4-bit (NF4), ~5.4GB VRAM |
78
+ | Hardware | Kaggle T4 (16GB) |
79
+ | Tokenizer | CohereLabs/tiny-aya-base |
80
+ | Transpilation tool | [Legesher](https://github.com/legesher/legesher) v0.7.3 |
81
+ | Training data | [legesher/language-decoded-data](https://huggingface.co/datasets/legesher/language-decoded-data) |
82
+
83
+ ### QLoRA Hyperparameters
84
+
85
+ | Parameter | Value |
86
+ | --------------- | ------------------------------------------------------------- |
87
+ | LoRA rank (`r`) | 16 |
88
+ | LoRA alpha | 32 |
89
+ | LoRA dropout | 0.0 |
90
+ | Target modules | q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj |
91
+ | Bias | none |
92
+ | Task type | CAUSAL_LM |
93
+ | PEFT version | 0.18.1 |
94
+ | Quantization | NF4 (4-bit) via Unsloth |
95
 
96
  ## Evaluation
97
 
98
  Models are evaluated on multilingual reasoning benchmarks with dual prompts (English + language-specific):
99
 
100
+ | Benchmark | What it measures | Examples per language |
101
+ | --------- | -------------------------- | --------------------- |
102
+ | MGSM | Math reasoning | 250 (full set) |
103
+ | X-CSQA | Commonsense reasoning | ~1,000 (full set) |
104
+ | XNLI | Natural language inference | ~5,000 (full set) |
105
+
106
+ _Results will be added as evaluation completes._
107
+
108
+ ## Limitations
109
 
110
+ - **Single base model**: All adapters are trained on CohereLabs/tiny-aya-base (3.35B params). Results may not generalize to larger or architecturally different models.
111
+ - **Limited training data**: Each condition uses a 5k-file subset for QLoRA fine-tuning, constrained by Kaggle T4 hardware limits.
112
+ - **Evaluation scope**: Currently evaluated on 3 benchmarks (MGSM, X-CSQA, XNLI). Other reasoning tasks may show different patterns.
113
+ - **Consumer hardware**: Training on Kaggle T4 (16GB) with 4-bit quantization introduces approximation that may affect adapter quality compared to full-precision training.
114
 
115
  ## Related Resources
116
 
 
124
  ```bibtex
125
  @misc{language-decoded-2026,
126
  title={Language Decoded: Investigating Language-Dependent vs. Structure-Dependent Reasoning Benefits of Code},
127
+ author={Madison Edgar and Saad Ahmed Bazaz and Tom Sherborne and Rashik Shahjahan and Khojasteh Mirza and Sarah Jawaid and Rafay Mustafa and Sohaib Ahmed Bazaz},
128
  year={2026},
129
  publisher={Hugging Face},
130
  url={https://huggingface.co/legesher/language-decoded-lora}
 
133
 
134
  ## License
135
 
136
+ Apache 2.0