Text Generation
Transformers
Safetensors
lora
aya
tiny-aya
multilingual
code
legesher
tiny-aya-expedition
language-decoded
unsloth
arxiv:2603.11510
arxiv:2211.15533
arxiv:2510.09591
arxiv:1809.05053
arxiv:2308.16884
arxiv:2106.06937
arxiv:2210.03057
Instructions to use legesher/language-decoded-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use legesher/language-decoded-lora with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="legesher/language-decoded-lora")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("legesher/language-decoded-lora", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use legesher/language-decoded-lora with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "legesher/language-decoded-lora" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "legesher/language-decoded-lora", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/legesher/language-decoded-lora
- SGLang
How to use legesher/language-decoded-lora with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "legesher/language-decoded-lora" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "legesher/language-decoded-lora", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "legesher/language-decoded-lora" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "legesher/language-decoded-lora", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Unsloth Studio
How to use legesher/language-decoded-lora with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for legesher/language-decoded-lora to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for legesher/language-decoded-lora to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for legesher/language-decoded-lora to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="legesher/language-decoded-lora", max_seq_length=2048, ) - Docker Model Runner
How to use legesher/language-decoded-lora with Docker Model Runner:
docker model run hf.co/legesher/language-decoded-lora
docs: delineate Phase 2 (Stack v1) vs Phase 3 (Stack v2-dedup) adapters
#12
by madiedgar - opened
- MANIFEST.md +43 -0
- README.md +55 -13
MANIFEST.md
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Adapter Manifest β `legesher/language-decoded-lora`
|
| 2 |
+
|
| 3 |
+
Every trained adapter in this repo, mapped to its project phase and source code corpus. Paper citations should use **Phase 3** adapters only. See the README [Provenance & Manifest](README.md#provenance--manifest) section for the summary.
|
| 4 |
+
|
| 5 |
+
Generated from the repo file tree; one row per directory containing `adapter_config.json`.
|
| 6 |
+
|
| 7 |
+
## Phase 3 β paper adapters (`bigcode/the-stack-v2-dedup`, Legesher v0.7.3)
|
| 8 |
+
|
| 9 |
+
| Adapter path | Condition | Seed |
|
| 10 |
+
| --- | --- | --- |
|
| 11 |
+
| `tiny-aya-base/condition-1-en-20k-seed42/` | 1 | 42 |
|
| 12 |
+
| `tiny-aya-base/condition-1-en-5k-seed123/` | 1 | 123 |
|
| 13 |
+
| `tiny-aya-base/condition-1-en-5k-seed42/` | 1 | 42 |
|
| 14 |
+
| `tiny-aya-base/condition-1-en-5k-seed456/` | 1 | 456 |
|
| 15 |
+
| `tiny-aya-base/condition-2-es-20k-seed42/` | 2 | 42 |
|
| 16 |
+
| `tiny-aya-base/condition-2-es-5k-seed123/` | 2 | 123 |
|
| 17 |
+
| `tiny-aya-base/condition-2-es-5k-seed42/` | 2 | 42 |
|
| 18 |
+
| `tiny-aya-base/condition-2-es-5k-seed456/` | 2 | 456 |
|
| 19 |
+
| `tiny-aya-base/condition-2-ur-20k-seed42/` | 2 | 42 |
|
| 20 |
+
| `tiny-aya-base/condition-2-ur-5k-seed123/` | 2 | 123 |
|
| 21 |
+
| `tiny-aya-base/condition-2-ur-5k-seed42/` | 2 | 42 |
|
| 22 |
+
| `tiny-aya-base/condition-2-ur-5k-seed456/` | 2 | 456 |
|
| 23 |
+
| `tiny-aya-base/condition-2-zh-20k-seed42/` | 2 | 42 |
|
| 24 |
+
| `tiny-aya-base/condition-2-zh-5k-seed123/` | 2 | 123 |
|
| 25 |
+
| `tiny-aya-base/condition-2-zh-5k-seed42/` | 2 | 42 |
|
| 26 |
+
| `tiny-aya-base/condition-2-zh-5k-seed456/` | 2 | 456 |
|
| 27 |
+
| `tiny-aya-base/condition-3-zh-5k-native-code-seed42/` | 3 | 42 |
|
| 28 |
+
| `tiny-aya-base/condition-5-es-5k-c4ai-aya-expanse-32b-seed42/` | 5 | 42 |
|
| 29 |
+
| `tiny-aya-base/condition-5-ur-5k-c4ai-aya-expanse-32b-seed42/` | 5 | 42 |
|
| 30 |
+
| `tiny-aya-base/condition-5-zh-5k-c4ai-aya-expanse-32b-seed42/` | 5 | 42 |
|
| 31 |
+
|
| 32 |
+
## Phase 2 β preliminary adapters (`bigcode/the-stack` v1, Legesher v0.5.1 / v0.6.0)
|
| 33 |
+
|
| 34 |
+
> Retained for reproducibility of the March-2026 hackathon results. **Not cited in the paper.** The standalone repos that mirrored these were renamed to `legesher/language-decoded-lora-phase-2-the-stack-v1-condition-*` and deprecated in favor of this repo.
|
| 35 |
+
|
| 36 |
+
| Adapter path | Condition | Tier |
|
| 37 |
+
| --- | --- | --- |
|
| 38 |
+
| `condition-1-en-32k/` | 1 | 32k |
|
| 39 |
+
| `condition-1-en-5k/` | 1 | 5k |
|
| 40 |
+
| `condition-2-es-5k/` | 2 | 5k |
|
| 41 |
+
| `condition-2-ur-5k/` | 2 | 5k |
|
| 42 |
+
| `condition-2-zh-5k/` | 2 | 5k |
|
| 43 |
+
| `condition-3-zh-5k/` | 3 | 5k |
|
README.md
CHANGED
|
@@ -51,19 +51,41 @@ All adapters are trained on [CohereLabs/tiny-aya-base](https://huggingface.co/Co
|
|
| 51 |
|
| 52 |
## Adapter Inventory
|
| 53 |
|
| 54 |
-
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
| `
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
**Condition 4 ("Community-Contributed Native Code")** is pending sufficient direct community contributions to the [`legesher/legesher-native-code`](https://huggingface.co/spaces/legesher/legesher-native-code) HF Space; no cond-4 adapter exists yet.
|
| 66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
### Source-file control
|
| 68 |
|
| 69 |
Cond-1, cond-2, and cond-5 all train on the **same 5,000-file subset** drawn from `bigcode/the-stack-v2-dedup` (with a parallel 20k subset for the 20k tier). Differences across these conditions reflect the processing pipeline (raw / transpiled / fully translated), not file-quality or content drift. Cond-3 is the deliberate exception β its source files are a different population by design.
|
|
@@ -78,6 +100,17 @@ Cond-1, cond-2, and cond-5 all train on the **same 5,000-file subset** drawn fro
|
|
| 78 |
|
| 79 |
For the full ladder including future directions (natural-language text control, combined-language training, similar-script evaluation), see [legesher/language-decoded-experiments](https://huggingface.co/datasets/legesher/language-decoded-experiments).
|
| 80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
## Usage
|
| 82 |
|
| 83 |
```python
|
|
@@ -88,25 +121,34 @@ from peft import PeftModel
|
|
| 88 |
base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
|
| 89 |
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
|
| 90 |
|
| 91 |
-
# Load a
|
|
|
|
| 92 |
model = PeftModel.from_pretrained(
|
| 93 |
base_model,
|
| 94 |
"legesher/language-decoded-lora",
|
| 95 |
-
subfolder="condition-1-en-5k-seed42",
|
| 96 |
)
|
| 97 |
|
| 98 |
# Or a language-specific cond-2 adapter (Chinese reserved-word translation, seed 42)
|
| 99 |
model = PeftModel.from_pretrained(
|
| 100 |
base_model,
|
| 101 |
"legesher/language-decoded-lora",
|
| 102 |
-
subfolder="condition-2-zh-5k-seed42",
|
| 103 |
)
|
| 104 |
|
| 105 |
# Or a cond-5 adapter (Synthesized Native Code, Urdu, seed 42)
|
| 106 |
model = PeftModel.from_pretrained(
|
| 107 |
base_model,
|
| 108 |
"legesher/language-decoded-lora",
|
| 109 |
-
subfolder="condition-5-ur-5k-c4ai-aya-expanse-32b-seed42",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
)
|
| 111 |
```
|
| 112 |
|
|
|
|
| 51 |
|
| 52 |
## Adapter Inventory
|
| 53 |
|
| 54 |
+
This repo holds adapters from **two generations of the project**, kept side by side and clearly separated by folder. See the [Provenance & Manifest](#provenance--manifest) section for a complete path β phase β source-corpus map, and [`MANIFEST.md`](MANIFEST.md) for the machine-readable version.
|
| 55 |
|
| 56 |
+
- **Paper adapters (Phase 3 Β· The Stack v2-dedup)** β live under the **`tiny-aya-base/`** prefix. These are the adapters cited in the submitted paper; cond-1, cond-2, and cond-5 were re-trained from scratch on the cleaner [`bigcode/the-stack-v2-dedup`](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) corpus.
|
| 57 |
+
- **Preliminary adapters (Phase 2 Β· The Stack v1)** β live as **flat top-level folders** (`condition-1-en-32k/`, `condition-2-zh-5k/`, β¦). These are the original March-2026 hackathon adapters trained on [`bigcode/the-stack`](https://huggingface.co/datasets/bigcode/the-stack) (v1, non-dedup), retained for reproducibility. **Do not cite these for the paper.**
|
| 58 |
+
|
| 59 |
+
### Paper adapters β Phase 3 Β· The Stack v2-dedup
|
| 60 |
+
|
| 61 |
+
Each subdirectory under `tiny-aya-base/` is one trained condition Γ file-volume Γ seed combination. All adapters share the QLoRA hyperparameters listed under [Training Details](#training-details).
|
| 62 |
+
|
| 63 |
+
| Subdirectory (under `tiny-aya-base/`) | Condition | Training data | Seeds |
|
| 64 |
+
| ------------------------------------------------------------------------ | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | ---------- |
|
| 65 |
+
| `tiny-aya-base/condition-1-en-5k-seed{42,123,456}/` | 1 | Raw English Python from `bigcode/the-stack-v2-dedup` (5k file subset) | 42, 123, 456 |
|
| 66 |
+
| `tiny-aya-base/condition-1-en-20k-seed42/` | 1 | Raw English Python (20k file subset) | 42 |
|
| 67 |
+
| `tiny-aya-base/condition-2-{zh,es,ur}-5k-seed{42,123,456}/` | 2 | The **same 5k subset as cond-1**, processed through Legesher v0.7.3 β Python's reserved words (keywords, exceptions, built-in functions, numerical system for some target languages) translated to the target language; user logic preserved | 42, 123, 456 |
|
| 68 |
+
| `tiny-aya-base/condition-2-{zh,es,ur}-20k-seed42/` | 2 | The **same 20k subset as cond-1**, processed through Legesher v0.7.3 | 42 |
|
| 69 |
+
| `tiny-aya-base/condition-3-zh-5k-native-code-seed42/` | 3 | Community-collected raw Chinese code from varied online public-source repositories (different source-file population from cond-1/2/5 by design) | 42 |
|
| 70 |
+
| `tiny-aya-base/condition-5-{zh,es,ur}-5k-c4ai-aya-expanse-32b-seed42/` | 5 | The **same 5k subset as cond-1**, first transpiled by Legesher v0.7.3 to translate Python's reserved words, then run through `c4ai-aya-expanse-32b` via the Cohere API to translate the remaining content (identifiers, comments, docstrings, string literals) | 42 |
|
| 71 |
|
| 72 |
**Condition 4 ("Community-Contributed Native Code")** is pending sufficient direct community contributions to the [`legesher/legesher-native-code`](https://huggingface.co/spaces/legesher/legesher-native-code) HF Space; no cond-4 adapter exists yet.
|
| 73 |
|
| 74 |
+
### Preliminary adapters β Phase 2 Β· The Stack v1
|
| 75 |
+
|
| 76 |
+
These flat top-level folders are the original hackathon adapters, trained on [`bigcode/the-stack`](https://huggingface.co/datasets/bigcode/the-stack) (v1, non-dedup) with Legesher v0.5.1 / v0.6.0. They are **superseded by the `tiny-aya-base/` Phase 3 adapters above** and are kept only for reproducibility of the preliminary results. The `32k` size and the single-seed setup are Phase 2 signatures.
|
| 77 |
+
|
| 78 |
+
| Subdirectory (top level) | Condition | Source corpus | Notes |
|
| 79 |
+
| ------------------------ | --------- | ---------------------------------------------- | ---------------------------------- |
|
| 80 |
+
| `condition-1-en-32k/` | 1 | `bigcode/the-stack` (v1) | Phase 2 32k tier; no Phase 3 equivalent |
|
| 81 |
+
| `condition-1-en-5k/` | 1 | `bigcode/the-stack` (v1) | Preliminary; use `tiny-aya-base/condition-1-en-5k-seed42/` for the paper |
|
| 82 |
+
| `condition-2-es-5k/` | 2 | `bigcode/the-stack` (v1), Legesher transpiled | Preliminary |
|
| 83 |
+
| `condition-2-ur-5k/` | 2 | `bigcode/the-stack` (v1), Legesher transpiled | Preliminary |
|
| 84 |
+
| `condition-2-zh-5k/` | 2 | `bigcode/the-stack` (v1), Legesher transpiled | Preliminary |
|
| 85 |
+
| `condition-3-zh-5k/` | 3 | Community-collected raw Chinese code | Preliminary; corpus unchanged across phases |
|
| 86 |
+
|
| 87 |
+
> The standalone per-adapter repos that previously published these Phase 2 / v1 adapters (`legesher/language-decoded-lora-condition-*`) have been renamed to `legesher/language-decoded-lora-phase-2-the-stack-v1-condition-*` and deprecated in favor of this umbrella repo. Their old URLs continue to resolve via Hugging Face redirects.
|
| 88 |
+
|
| 89 |
### Source-file control
|
| 90 |
|
| 91 |
Cond-1, cond-2, and cond-5 all train on the **same 5,000-file subset** drawn from `bigcode/the-stack-v2-dedup` (with a parallel 20k subset for the 20k tier). Differences across these conditions reflect the processing pipeline (raw / transpiled / fully translated), not file-quality or content drift. Cond-3 is the deliberate exception β its source files are a different population by design.
|
|
|
|
| 100 |
|
| 101 |
For the full ladder including future directions (natural-language text control, combined-language training, similar-script evaluation), see [legesher/language-decoded-experiments](https://huggingface.co/datasets/legesher/language-decoded-experiments).
|
| 102 |
|
| 103 |
+
## Provenance & Manifest
|
| 104 |
+
|
| 105 |
+
The two adapter generations are distinguished by **folder location and source corpus**, matching the convention used across the project's repos (`phase-2-the-stack-v1-*` on [`language-decoded-data`](https://huggingface.co/datasets/legesher/language-decoded-data), `phase2/`Γ·`phase3/` on [`language-decoded-experiments`](https://huggingface.co/datasets/legesher/language-decoded-experiments)):
|
| 106 |
+
|
| 107 |
+
| Generation | Location in this repo | Source corpus | Legesher | Tier / seeds | Cite for paper? |
|
| 108 |
+
| --- | --- | --- | --- | --- | --- |
|
| 109 |
+
| **Phase 3 (paper)** | `tiny-aya-base/β¦-seed*/` | [`bigcode/the-stack-v2-dedup`](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) | v0.7.3 | 5k (3 seeds) + 20k (1 seed) | β
Yes |
|
| 110 |
+
| **Phase 2 (preliminary)** | flat top-level `condition-*/` | [`bigcode/the-stack`](https://huggingface.co/datasets/bigcode/the-stack) (v1) | v0.5.1 / v0.6.0 | 5k / 32k (1 seed) | β No |
|
| 111 |
+
|
| 112 |
+
A complete, machine-readable path β phase β corpus β condition map is in [`MANIFEST.md`](MANIFEST.md). Training-data provenance for each condition is detailed on [`language-decoded-data`](https://huggingface.co/datasets/legesher/language-decoded-data); the phase comparison is in the ["Phase 2 β Phase 3 at a glance"](https://huggingface.co/datasets/legesher/language-decoded-experiments#phase-2--phase-3-at-a-glance) table on the experiments repo.
|
| 113 |
+
|
| 114 |
## Usage
|
| 115 |
|
| 116 |
```python
|
|
|
|
| 121 |
base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
|
| 122 |
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
|
| 123 |
|
| 124 |
+
# Load a paper (Phase 3 Β· Stack v2-dedup) adapter β e.g., cond-1 (English code, seed 42, 5k tier).
|
| 125 |
+
# Paper adapters live under the `tiny-aya-base/` prefix.
|
| 126 |
model = PeftModel.from_pretrained(
|
| 127 |
base_model,
|
| 128 |
"legesher/language-decoded-lora",
|
| 129 |
+
subfolder="tiny-aya-base/condition-1-en-5k-seed42",
|
| 130 |
)
|
| 131 |
|
| 132 |
# Or a language-specific cond-2 adapter (Chinese reserved-word translation, seed 42)
|
| 133 |
model = PeftModel.from_pretrained(
|
| 134 |
base_model,
|
| 135 |
"legesher/language-decoded-lora",
|
| 136 |
+
subfolder="tiny-aya-base/condition-2-zh-5k-seed42",
|
| 137 |
)
|
| 138 |
|
| 139 |
# Or a cond-5 adapter (Synthesized Native Code, Urdu, seed 42)
|
| 140 |
model = PeftModel.from_pretrained(
|
| 141 |
base_model,
|
| 142 |
"legesher/language-decoded-lora",
|
| 143 |
+
subfolder="tiny-aya-base/condition-5-ur-5k-c4ai-aya-expanse-32b-seed42",
|
| 144 |
+
)
|
| 145 |
+
|
| 146 |
+
# To load a *preliminary* Phase 2 / Stack v1 adapter instead, use the flat top-level
|
| 147 |
+
# folder (no `tiny-aya-base/` prefix) β e.g. the original cond-2 Chinese hackathon adapter:
|
| 148 |
+
model = PeftModel.from_pretrained(
|
| 149 |
+
base_model,
|
| 150 |
+
"legesher/language-decoded-lora",
|
| 151 |
+
subfolder="condition-2-zh-5k",
|
| 152 |
)
|
| 153 |
```
|
| 154 |
|