docs: delineate Phase 2 (Stack v1) vs Phase 3 (Stack v2-dedup) adapters

#12
Files changed (2) hide show
  1. MANIFEST.md +43 -0
  2. README.md +55 -13
MANIFEST.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Adapter Manifest β€” `legesher/language-decoded-lora`
2
+
3
+ Every trained adapter in this repo, mapped to its project phase and source code corpus. Paper citations should use **Phase 3** adapters only. See the README [Provenance & Manifest](README.md#provenance--manifest) section for the summary.
4
+
5
+ Generated from the repo file tree; one row per directory containing `adapter_config.json`.
6
+
7
+ ## Phase 3 β€” paper adapters (`bigcode/the-stack-v2-dedup`, Legesher v0.7.3)
8
+
9
+ | Adapter path | Condition | Seed |
10
+ | --- | --- | --- |
11
+ | `tiny-aya-base/condition-1-en-20k-seed42/` | 1 | 42 |
12
+ | `tiny-aya-base/condition-1-en-5k-seed123/` | 1 | 123 |
13
+ | `tiny-aya-base/condition-1-en-5k-seed42/` | 1 | 42 |
14
+ | `tiny-aya-base/condition-1-en-5k-seed456/` | 1 | 456 |
15
+ | `tiny-aya-base/condition-2-es-20k-seed42/` | 2 | 42 |
16
+ | `tiny-aya-base/condition-2-es-5k-seed123/` | 2 | 123 |
17
+ | `tiny-aya-base/condition-2-es-5k-seed42/` | 2 | 42 |
18
+ | `tiny-aya-base/condition-2-es-5k-seed456/` | 2 | 456 |
19
+ | `tiny-aya-base/condition-2-ur-20k-seed42/` | 2 | 42 |
20
+ | `tiny-aya-base/condition-2-ur-5k-seed123/` | 2 | 123 |
21
+ | `tiny-aya-base/condition-2-ur-5k-seed42/` | 2 | 42 |
22
+ | `tiny-aya-base/condition-2-ur-5k-seed456/` | 2 | 456 |
23
+ | `tiny-aya-base/condition-2-zh-20k-seed42/` | 2 | 42 |
24
+ | `tiny-aya-base/condition-2-zh-5k-seed123/` | 2 | 123 |
25
+ | `tiny-aya-base/condition-2-zh-5k-seed42/` | 2 | 42 |
26
+ | `tiny-aya-base/condition-2-zh-5k-seed456/` | 2 | 456 |
27
+ | `tiny-aya-base/condition-3-zh-5k-native-code-seed42/` | 3 | 42 |
28
+ | `tiny-aya-base/condition-5-es-5k-c4ai-aya-expanse-32b-seed42/` | 5 | 42 |
29
+ | `tiny-aya-base/condition-5-ur-5k-c4ai-aya-expanse-32b-seed42/` | 5 | 42 |
30
+ | `tiny-aya-base/condition-5-zh-5k-c4ai-aya-expanse-32b-seed42/` | 5 | 42 |
31
+
32
+ ## Phase 2 β€” preliminary adapters (`bigcode/the-stack` v1, Legesher v0.5.1 / v0.6.0)
33
+
34
+ > Retained for reproducibility of the March-2026 hackathon results. **Not cited in the paper.** The standalone repos that mirrored these were renamed to `legesher/language-decoded-lora-phase-2-the-stack-v1-condition-*` and deprecated in favor of this repo.
35
+
36
+ | Adapter path | Condition | Tier |
37
+ | --- | --- | --- |
38
+ | `condition-1-en-32k/` | 1 | 32k |
39
+ | `condition-1-en-5k/` | 1 | 5k |
40
+ | `condition-2-es-5k/` | 2 | 5k |
41
+ | `condition-2-ur-5k/` | 2 | 5k |
42
+ | `condition-2-zh-5k/` | 2 | 5k |
43
+ | `condition-3-zh-5k/` | 3 | 5k |
README.md CHANGED
@@ -51,19 +51,41 @@ All adapters are trained on [CohereLabs/tiny-aya-base](https://huggingface.co/Co
51
 
52
  ## Adapter Inventory
53
 
54
- Each subdirectory is one trained condition Γ— file-volume Γ— seed combination. All adapters share the QLoRA hyperparameters listed under [Training Details](#training-details).
55
 
56
- | Subdirectory | Condition | Training data | Seeds |
57
- | ----------------------------------------------------------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | ---------- |
58
- | `condition-1-en-5k-seed{42,123,456}/` | 1 | Raw English Python from `bigcode/the-stack-v2-dedup` (5k file subset) | 42, 123, 456 |
59
- | `condition-1-en-20k-seed42/` | 1 | Raw English Python (20k file subset) | 42 |
60
- | `condition-2-{zh,es,ur}-5k-seed{42,123,456}/` | 2 | The **same 5k subset as cond-1**, processed through Legesher v0.7.3 β€” Python's reserved words (keywords, exceptions, built-in functions, numerical system for some target languages) translated to the target language; user logic preserved | 42, 123, 456 |
61
- | `condition-2-{zh,es,ur}-20k-seed42/` | 2 | The **same 20k subset as cond-1**, processed through Legesher v0.7.3 | 42 |
62
- | `condition-3-zh-5k-native-code-seed42/` | 3 | Community-collected raw Chinese code from varied online public-source repositories (different source-file population from cond-1/2/5 by design) | 42 |
63
- | `condition-5-{zh,es,ur}-5k-c4ai-aya-expanse-32b-seed42/` | 5 | The **same 5k subset as cond-1**, first transpiled by Legesher v0.7.3 to translate Python's reserved words, then run through `c4ai-aya-expanse-32b` via the Cohere API to translate the remaining content (identifiers, comments, docstrings, string literals) | 42 |
 
 
 
 
 
 
 
64
 
65
  **Condition 4 ("Community-Contributed Native Code")** is pending sufficient direct community contributions to the [`legesher/legesher-native-code`](https://huggingface.co/spaces/legesher/legesher-native-code) HF Space; no cond-4 adapter exists yet.
66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  ### Source-file control
68
 
69
  Cond-1, cond-2, and cond-5 all train on the **same 5,000-file subset** drawn from `bigcode/the-stack-v2-dedup` (with a parallel 20k subset for the 20k tier). Differences across these conditions reflect the processing pipeline (raw / transpiled / fully translated), not file-quality or content drift. Cond-3 is the deliberate exception β€” its source files are a different population by design.
@@ -78,6 +100,17 @@ Cond-1, cond-2, and cond-5 all train on the **same 5,000-file subset** drawn fro
78
 
79
  For the full ladder including future directions (natural-language text control, combined-language training, similar-script evaluation), see [legesher/language-decoded-experiments](https://huggingface.co/datasets/legesher/language-decoded-experiments).
80
 
 
 
 
 
 
 
 
 
 
 
 
81
  ## Usage
82
 
83
  ```python
@@ -88,25 +121,34 @@ from peft import PeftModel
88
  base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
89
  tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
90
 
91
- # Load a LoRA adapter β€” e.g., cond-1 (English code, seed 42, 5k tier)
 
92
  model = PeftModel.from_pretrained(
93
  base_model,
94
  "legesher/language-decoded-lora",
95
- subfolder="condition-1-en-5k-seed42",
96
  )
97
 
98
  # Or a language-specific cond-2 adapter (Chinese reserved-word translation, seed 42)
99
  model = PeftModel.from_pretrained(
100
  base_model,
101
  "legesher/language-decoded-lora",
102
- subfolder="condition-2-zh-5k-seed42",
103
  )
104
 
105
  # Or a cond-5 adapter (Synthesized Native Code, Urdu, seed 42)
106
  model = PeftModel.from_pretrained(
107
  base_model,
108
  "legesher/language-decoded-lora",
109
- subfolder="condition-5-ur-5k-c4ai-aya-expanse-32b-seed42",
 
 
 
 
 
 
 
 
110
  )
111
  ```
112
 
 
51
 
52
  ## Adapter Inventory
53
 
54
+ This repo holds adapters from **two generations of the project**, kept side by side and clearly separated by folder. See the [Provenance & Manifest](#provenance--manifest) section for a complete path β†’ phase β†’ source-corpus map, and [`MANIFEST.md`](MANIFEST.md) for the machine-readable version.
55
 
56
+ - **Paper adapters (Phase 3 Β· The Stack v2-dedup)** β€” live under the **`tiny-aya-base/`** prefix. These are the adapters cited in the submitted paper; cond-1, cond-2, and cond-5 were re-trained from scratch on the cleaner [`bigcode/the-stack-v2-dedup`](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) corpus.
57
+ - **Preliminary adapters (Phase 2 Β· The Stack v1)** β€” live as **flat top-level folders** (`condition-1-en-32k/`, `condition-2-zh-5k/`, …). These are the original March-2026 hackathon adapters trained on [`bigcode/the-stack`](https://huggingface.co/datasets/bigcode/the-stack) (v1, non-dedup), retained for reproducibility. **Do not cite these for the paper.**
58
+
59
+ ### Paper adapters β€” Phase 3 Β· The Stack v2-dedup
60
+
61
+ Each subdirectory under `tiny-aya-base/` is one trained condition Γ— file-volume Γ— seed combination. All adapters share the QLoRA hyperparameters listed under [Training Details](#training-details).
62
+
63
+ | Subdirectory (under `tiny-aya-base/`) | Condition | Training data | Seeds |
64
+ | ------------------------------------------------------------------------ | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | ---------- |
65
+ | `tiny-aya-base/condition-1-en-5k-seed{42,123,456}/` | 1 | Raw English Python from `bigcode/the-stack-v2-dedup` (5k file subset) | 42, 123, 456 |
66
+ | `tiny-aya-base/condition-1-en-20k-seed42/` | 1 | Raw English Python (20k file subset) | 42 |
67
+ | `tiny-aya-base/condition-2-{zh,es,ur}-5k-seed{42,123,456}/` | 2 | The **same 5k subset as cond-1**, processed through Legesher v0.7.3 β€” Python's reserved words (keywords, exceptions, built-in functions, numerical system for some target languages) translated to the target language; user logic preserved | 42, 123, 456 |
68
+ | `tiny-aya-base/condition-2-{zh,es,ur}-20k-seed42/` | 2 | The **same 20k subset as cond-1**, processed through Legesher v0.7.3 | 42 |
69
+ | `tiny-aya-base/condition-3-zh-5k-native-code-seed42/` | 3 | Community-collected raw Chinese code from varied online public-source repositories (different source-file population from cond-1/2/5 by design) | 42 |
70
+ | `tiny-aya-base/condition-5-{zh,es,ur}-5k-c4ai-aya-expanse-32b-seed42/` | 5 | The **same 5k subset as cond-1**, first transpiled by Legesher v0.7.3 to translate Python's reserved words, then run through `c4ai-aya-expanse-32b` via the Cohere API to translate the remaining content (identifiers, comments, docstrings, string literals) | 42 |
71
 
72
  **Condition 4 ("Community-Contributed Native Code")** is pending sufficient direct community contributions to the [`legesher/legesher-native-code`](https://huggingface.co/spaces/legesher/legesher-native-code) HF Space; no cond-4 adapter exists yet.
73
 
74
+ ### Preliminary adapters β€” Phase 2 Β· The Stack v1
75
+
76
+ These flat top-level folders are the original hackathon adapters, trained on [`bigcode/the-stack`](https://huggingface.co/datasets/bigcode/the-stack) (v1, non-dedup) with Legesher v0.5.1 / v0.6.0. They are **superseded by the `tiny-aya-base/` Phase 3 adapters above** and are kept only for reproducibility of the preliminary results. The `32k` size and the single-seed setup are Phase 2 signatures.
77
+
78
+ | Subdirectory (top level) | Condition | Source corpus | Notes |
79
+ | ------------------------ | --------- | ---------------------------------------------- | ---------------------------------- |
80
+ | `condition-1-en-32k/` | 1 | `bigcode/the-stack` (v1) | Phase 2 32k tier; no Phase 3 equivalent |
81
+ | `condition-1-en-5k/` | 1 | `bigcode/the-stack` (v1) | Preliminary; use `tiny-aya-base/condition-1-en-5k-seed42/` for the paper |
82
+ | `condition-2-es-5k/` | 2 | `bigcode/the-stack` (v1), Legesher transpiled | Preliminary |
83
+ | `condition-2-ur-5k/` | 2 | `bigcode/the-stack` (v1), Legesher transpiled | Preliminary |
84
+ | `condition-2-zh-5k/` | 2 | `bigcode/the-stack` (v1), Legesher transpiled | Preliminary |
85
+ | `condition-3-zh-5k/` | 3 | Community-collected raw Chinese code | Preliminary; corpus unchanged across phases |
86
+
87
+ > The standalone per-adapter repos that previously published these Phase 2 / v1 adapters (`legesher/language-decoded-lora-condition-*`) have been renamed to `legesher/language-decoded-lora-phase-2-the-stack-v1-condition-*` and deprecated in favor of this umbrella repo. Their old URLs continue to resolve via Hugging Face redirects.
88
+
89
  ### Source-file control
90
 
91
  Cond-1, cond-2, and cond-5 all train on the **same 5,000-file subset** drawn from `bigcode/the-stack-v2-dedup` (with a parallel 20k subset for the 20k tier). Differences across these conditions reflect the processing pipeline (raw / transpiled / fully translated), not file-quality or content drift. Cond-3 is the deliberate exception β€” its source files are a different population by design.
 
100
 
101
  For the full ladder including future directions (natural-language text control, combined-language training, similar-script evaluation), see [legesher/language-decoded-experiments](https://huggingface.co/datasets/legesher/language-decoded-experiments).
102
 
103
+ ## Provenance & Manifest
104
+
105
+ The two adapter generations are distinguished by **folder location and source corpus**, matching the convention used across the project's repos (`phase-2-the-stack-v1-*` on [`language-decoded-data`](https://huggingface.co/datasets/legesher/language-decoded-data), `phase2/`Γ·`phase3/` on [`language-decoded-experiments`](https://huggingface.co/datasets/legesher/language-decoded-experiments)):
106
+
107
+ | Generation | Location in this repo | Source corpus | Legesher | Tier / seeds | Cite for paper? |
108
+ | --- | --- | --- | --- | --- | --- |
109
+ | **Phase 3 (paper)** | `tiny-aya-base/…-seed*/` | [`bigcode/the-stack-v2-dedup`](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) | v0.7.3 | 5k (3 seeds) + 20k (1 seed) | βœ… Yes |
110
+ | **Phase 2 (preliminary)** | flat top-level `condition-*/` | [`bigcode/the-stack`](https://huggingface.co/datasets/bigcode/the-stack) (v1) | v0.5.1 / v0.6.0 | 5k / 32k (1 seed) | ❌ No |
111
+
112
+ A complete, machine-readable path β†’ phase β†’ corpus β†’ condition map is in [`MANIFEST.md`](MANIFEST.md). Training-data provenance for each condition is detailed on [`language-decoded-data`](https://huggingface.co/datasets/legesher/language-decoded-data); the phase comparison is in the ["Phase 2 β†’ Phase 3 at a glance"](https://huggingface.co/datasets/legesher/language-decoded-experiments#phase-2--phase-3-at-a-glance) table on the experiments repo.
113
+
114
  ## Usage
115
 
116
  ```python
 
121
  base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
122
  tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
123
 
124
+ # Load a paper (Phase 3 Β· Stack v2-dedup) adapter β€” e.g., cond-1 (English code, seed 42, 5k tier).
125
+ # Paper adapters live under the `tiny-aya-base/` prefix.
126
  model = PeftModel.from_pretrained(
127
  base_model,
128
  "legesher/language-decoded-lora",
129
+ subfolder="tiny-aya-base/condition-1-en-5k-seed42",
130
  )
131
 
132
  # Or a language-specific cond-2 adapter (Chinese reserved-word translation, seed 42)
133
  model = PeftModel.from_pretrained(
134
  base_model,
135
  "legesher/language-decoded-lora",
136
+ subfolder="tiny-aya-base/condition-2-zh-5k-seed42",
137
  )
138
 
139
  # Or a cond-5 adapter (Synthesized Native Code, Urdu, seed 42)
140
  model = PeftModel.from_pretrained(
141
  base_model,
142
  "legesher/language-decoded-lora",
143
+ subfolder="tiny-aya-base/condition-5-ur-5k-c4ai-aya-expanse-32b-seed42",
144
+ )
145
+
146
+ # To load a *preliminary* Phase 2 / Stack v1 adapter instead, use the flat top-level
147
+ # folder (no `tiny-aya-base/` prefix) β€” e.g. the original cond-2 Chinese hackathon adapter:
148
+ model = PeftModel.from_pretrained(
149
+ base_model,
150
+ "legesher/language-decoded-lora",
151
+ subfolder="condition-2-zh-5k",
152
  )
153
  ```
154