Instructions to use legesher/language-decoded-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use legesher/language-decoded-lora with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="legesher/language-decoded-lora")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("legesher/language-decoded-lora", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use legesher/language-decoded-lora with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "legesher/language-decoded-lora"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "legesher/language-decoded-lora",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/legesher/language-decoded-lora

SGLang

How to use legesher/language-decoded-lora with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "legesher/language-decoded-lora" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "legesher/language-decoded-lora",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "legesher/language-decoded-lora" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "legesher/language-decoded-lora",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Unsloth Studio

How to use legesher/language-decoded-lora with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for legesher/language-decoded-lora to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for legesher/language-decoded-lora to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for legesher/language-decoded-lora to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="legesher/language-decoded-lora",
    max_seq_length=2048,
)

Docker Model Runner
How to use legesher/language-decoded-lora with Docker Model Runner:
```
docker model run hf.co/legesher/language-decoded-lora
```

docs: delineate Phase 2 (Stack v1) vs Phase 3 (Stack v2-dedup) adapters

#12

by madiedgar - opened May 31

base: refs/heads/main

←

from: refs/pr/12

Discussion Files changed

+98

-13

Files changed (2) hide show

MANIFEST.md +43 -0
README.md +55 -13

MANIFEST.md ADDED Viewed

	@@ -0,0 +1,43 @@

+# Adapter Manifest — `legesher/language-decoded-lora`
+Every trained adapter in this repo, mapped to its project phase and source code corpus. Paper citations should use **Phase 3** adapters only. See the README [Provenance & Manifest](README.md#provenance--manifest) section for the summary.
+Generated from the repo file tree; one row per directory containing `adapter_config.json`.
+## Phase 3 — paper adapters (`bigcode/the-stack-v2-dedup`, Legesher v0.7.3)
+| Adapter path | Condition | Seed |
+| --- | --- | --- |
+| `tiny-aya-base/condition-1-en-20k-seed42/` | 1 | 42 |
+| `tiny-aya-base/condition-1-en-5k-seed123/` | 1 | 123 |
+| `tiny-aya-base/condition-1-en-5k-seed42/` | 1 | 42 |
+| `tiny-aya-base/condition-1-en-5k-seed456/` | 1 | 456 |
+| `tiny-aya-base/condition-2-es-20k-seed42/` | 2 | 42 |
+| `tiny-aya-base/condition-2-es-5k-seed123/` | 2 | 123 |
+| `tiny-aya-base/condition-2-es-5k-seed42/` | 2 | 42 |
+| `tiny-aya-base/condition-2-es-5k-seed456/` | 2 | 456 |
+| `tiny-aya-base/condition-2-ur-20k-seed42/` | 2 | 42 |
+| `tiny-aya-base/condition-2-ur-5k-seed123/` | 2 | 123 |
+| `tiny-aya-base/condition-2-ur-5k-seed42/` | 2 | 42 |
+| `tiny-aya-base/condition-2-ur-5k-seed456/` | 2 | 456 |
+| `tiny-aya-base/condition-2-zh-20k-seed42/` | 2 | 42 |
+| `tiny-aya-base/condition-2-zh-5k-seed123/` | 2 | 123 |
+| `tiny-aya-base/condition-2-zh-5k-seed42/` | 2 | 42 |
+| `tiny-aya-base/condition-2-zh-5k-seed456/` | 2 | 456 |
+| `tiny-aya-base/condition-3-zh-5k-native-code-seed42/` | 3 | 42 |
+| `tiny-aya-base/condition-5-es-5k-c4ai-aya-expanse-32b-seed42/` | 5 | 42 |
+| `tiny-aya-base/condition-5-ur-5k-c4ai-aya-expanse-32b-seed42/` | 5 | 42 |
+| `tiny-aya-base/condition-5-zh-5k-c4ai-aya-expanse-32b-seed42/` | 5 | 42 |
+## Phase 2 — preliminary adapters (`bigcode/the-stack` v1, Legesher v0.5.1 / v0.6.0)
+> Retained for reproducibility of the March-2026 hackathon results. **Not cited in the paper.** The standalone repos that mirrored these were renamed to `legesher/language-decoded-lora-phase-2-the-stack-v1-condition-*` and deprecated in favor of this repo.
+| Adapter path | Condition | Tier |
+| --- | --- | --- |
+| `condition-1-en-32k/` | 1 | 32k |
+| `condition-1-en-5k/` | 1 | 5k |
+| `condition-2-es-5k/` | 2 | 5k |
+| `condition-2-ur-5k/` | 2 | 5k |
+| `condition-2-zh-5k/` | 2 | 5k |
+| `condition-3-zh-5k/` | 3 | 5k |

README.md CHANGED Viewed

@@ -51,19 +51,41 @@ All adapters are trained on [CohereLabs/tiny-aya-base](https://huggingface.co/Co
 ## Adapter Inventory
-Each subdirectory is one trained condition × file-volume × seed combination. All adapters share the QLoRA hyperparameters listed under [Training Details](#training-details).
-| Subdirectory                                                | Condition | Training data                                                                                                                                  | Seeds      |
-| ----------------------------------------------------------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | ---------- |
-| `condition-1-en-5k-seed{42,123,456}/`                       | 1         | Raw English Python from `bigcode/the-stack-v2-dedup` (5k file subset)                                                                          | 42, 123, 456 |
-| `condition-1-en-20k-seed42/`                                | 1         | Raw English Python (20k file subset)                                                                                                           | 42         |
-| `condition-2-{zh,es,ur}-5k-seed{42,123,456}/`               | 2         | The **same 5k subset as cond-1**, processed through Legesher v0.7.3 — Python's reserved words (keywords, exceptions, built-in functions, numerical system for some target languages) translated to the target language; user logic preserved | 42, 123, 456 |
-| `condition-2-{zh,es,ur}-20k-seed42/`                        | 2         | The **same 20k subset as cond-1**, processed through Legesher v0.7.3                                                                           | 42         |
-| `condition-3-zh-5k-native-code-seed42/`                     | 3         | Community-collected raw Chinese code from varied online public-source repositories (different source-file population from cond-1/2/5 by design) | 42         |
-| `condition-5-{zh,es,ur}-5k-c4ai-aya-expanse-32b-seed42/`    | 5         | The **same 5k subset as cond-1**, first transpiled by Legesher v0.7.3 to translate Python's reserved words, then run through `c4ai-aya-expanse-32b` via the Cohere API to translate the remaining content (identifiers, comments, docstrings, string literals) | 42         |
 **Condition 4 ("Community-Contributed Native Code")** is pending sufficient direct community contributions to the [`legesher/legesher-native-code`](https://huggingface.co/spaces/legesher/legesher-native-code) HF Space; no cond-4 adapter exists yet.
 ### Source-file control
 Cond-1, cond-2, and cond-5 all train on the **same 5,000-file subset** drawn from `bigcode/the-stack-v2-dedup` (with a parallel 20k subset for the 20k tier). Differences across these conditions reflect the processing pipeline (raw / transpiled / fully translated), not file-quality or content drift. Cond-3 is the deliberate exception — its source files are a different population by design.
@@ -78,6 +100,17 @@ Cond-1, cond-2, and cond-5 all train on the **same 5,000-file subset** drawn fro
 For the full ladder including future directions (natural-language text control, combined-language training, similar-script evaluation), see [legesher/language-decoded-experiments](https://huggingface.co/datasets/legesher/language-decoded-experiments).
 ## Usage
 ```python
@@ -88,25 +121,34 @@ from peft import PeftModel
 base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
 tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
-# Load a LoRA adapter — e.g., cond-1 (English code, seed 42, 5k tier)
 model = PeftModel.from_pretrained(
     base_model,
     "legesher/language-decoded-lora",
-    subfolder="condition-1-en-5k-seed42",
 )
 # Or a language-specific cond-2 adapter (Chinese reserved-word translation, seed 42)
 model = PeftModel.from_pretrained(
     base_model,
     "legesher/language-decoded-lora",
-    subfolder="condition-2-zh-5k-seed42",
 )
 # Or a cond-5 adapter (Synthesized Native Code, Urdu, seed 42)
 model = PeftModel.from_pretrained(
     base_model,
     "legesher/language-decoded-lora",
-    subfolder="condition-5-ur-5k-c4ai-aya-expanse-32b-seed42",
 )
 ```

 ## Adapter Inventory
+This repo holds adapters from **two generations of the project**, kept side by side and clearly separated by folder. See the [Provenance & Manifest](#provenance--manifest) section for a complete path → phase → source-corpus map, and [`MANIFEST.md`](MANIFEST.md) for the machine-readable version.
+- **Paper adapters (Phase 3 · The Stack v2-dedup)** — live under the **`tiny-aya-base/`** prefix. These are the adapters cited in the submitted paper; cond-1, cond-2, and cond-5 were re-trained from scratch on the cleaner [`bigcode/the-stack-v2-dedup`](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) corpus.
+- **Preliminary adapters (Phase 2 · The Stack v1)** — live as **flat top-level folders** (`condition-1-en-32k/`, `condition-2-zh-5k/`, …). These are the original March-2026 hackathon adapters trained on [`bigcode/the-stack`](https://huggingface.co/datasets/bigcode/the-stack) (v1, non-dedup), retained for reproducibility. **Do not cite these for the paper.**
+### Paper adapters — Phase 3 · The Stack v2-dedup
+Each subdirectory under `tiny-aya-base/` is one trained condition × file-volume × seed combination. All adapters share the QLoRA hyperparameters listed under [Training Details](#training-details).
+| Subdirectory (under `tiny-aya-base/`)                                    | Condition | Training data                                                                                                                                  | Seeds      |
+| ------------------------------------------------------------------------ | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | ---------- |
+| `tiny-aya-base/condition-1-en-5k-seed{42,123,456}/`                      | 1         | Raw English Python from `bigcode/the-stack-v2-dedup` (5k file subset)                                                                          | 42, 123, 456 |
+| `tiny-aya-base/condition-1-en-20k-seed42/`                               | 1         | Raw English Python (20k file subset)                                                                                                           | 42         |
+| `tiny-aya-base/condition-2-{zh,es,ur}-5k-seed{42,123,456}/`             | 2         | The **same 5k subset as cond-1**, processed through Legesher v0.7.3 — Python's reserved words (keywords, exceptions, built-in functions, numerical system for some target languages) translated to the target language; user logic preserved | 42, 123, 456 |
+| `tiny-aya-base/condition-2-{zh,es,ur}-20k-seed42/`                      | 2         | The **same 20k subset as cond-1**, processed through Legesher v0.7.3                                                                           | 42         |
+| `tiny-aya-base/condition-3-zh-5k-native-code-seed42/`                   | 3         | Community-collected raw Chinese code from varied online public-source repositories (different source-file population from cond-1/2/5 by design) | 42         |
+| `tiny-aya-base/condition-5-{zh,es,ur}-5k-c4ai-aya-expanse-32b-seed42/`  | 5         | The **same 5k subset as cond-1**, first transpiled by Legesher v0.7.3 to translate Python's reserved words, then run through `c4ai-aya-expanse-32b` via the Cohere API to translate the remaining content (identifiers, comments, docstrings, string literals) | 42         |
 **Condition 4 ("Community-Contributed Native Code")** is pending sufficient direct community contributions to the [`legesher/legesher-native-code`](https://huggingface.co/spaces/legesher/legesher-native-code) HF Space; no cond-4 adapter exists yet.
+### Preliminary adapters — Phase 2 · The Stack v1
+These flat top-level folders are the original hackathon adapters, trained on [`bigcode/the-stack`](https://huggingface.co/datasets/bigcode/the-stack) (v1, non-dedup) with Legesher v0.5.1 / v0.6.0. They are **superseded by the `tiny-aya-base/` Phase 3 adapters above** and are kept only for reproducibility of the preliminary results. The `32k` size and the single-seed setup are Phase 2 signatures.
+| Subdirectory (top level) | Condition | Source corpus                                  | Notes                              |
+| ------------------------ | --------- | ---------------------------------------------- | ---------------------------------- |
+| `condition-1-en-32k/`    | 1         | `bigcode/the-stack` (v1)                        | Phase 2 32k tier; no Phase 3 equivalent |
+| `condition-1-en-5k/`     | 1         | `bigcode/the-stack` (v1)                        | Preliminary; use `tiny-aya-base/condition-1-en-5k-seed42/` for the paper |
+| `condition-2-es-5k/`     | 2         | `bigcode/the-stack` (v1), Legesher transpiled   | Preliminary                        |
+| `condition-2-ur-5k/`     | 2         | `bigcode/the-stack` (v1), Legesher transpiled   | Preliminary                        |
+| `condition-2-zh-5k/`     | 2         | `bigcode/the-stack` (v1), Legesher transpiled   | Preliminary                        |
+| `condition-3-zh-5k/`     | 3         | Community-collected raw Chinese code            | Preliminary; corpus unchanged across phases |
+> The standalone per-adapter repos that previously published these Phase 2 / v1 adapters (`legesher/language-decoded-lora-condition-*`) have been renamed to `legesher/language-decoded-lora-phase-2-the-stack-v1-condition-*` and deprecated in favor of this umbrella repo. Their old URLs continue to resolve via Hugging Face redirects.
 ### Source-file control
 Cond-1, cond-2, and cond-5 all train on the **same 5,000-file subset** drawn from `bigcode/the-stack-v2-dedup` (with a parallel 20k subset for the 20k tier). Differences across these conditions reflect the processing pipeline (raw / transpiled / fully translated), not file-quality or content drift. Cond-3 is the deliberate exception — its source files are a different population by design.
 For the full ladder including future directions (natural-language text control, combined-language training, similar-script evaluation), see [legesher/language-decoded-experiments](https://huggingface.co/datasets/legesher/language-decoded-experiments).
+## Provenance & Manifest
+The two adapter generations are distinguished by **folder location and source corpus**, matching the convention used across the project's repos (`phase-2-the-stack-v1-*` on [`language-decoded-data`](https://huggingface.co/datasets/legesher/language-decoded-data), `phase2/`÷`phase3/` on [`language-decoded-experiments`](https://huggingface.co/datasets/legesher/language-decoded-experiments)):
+| Generation | Location in this repo | Source corpus | Legesher | Tier / seeds | Cite for paper? |
+| --- | --- | --- | --- | --- | --- |
+| **Phase 3 (paper)** | `tiny-aya-base/…-seed*/` | [`bigcode/the-stack-v2-dedup`](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup) | v0.7.3 | 5k (3 seeds) + 20k (1 seed) | ✅ Yes |
+| **Phase 2 (preliminary)** | flat top-level `condition-*/` | [`bigcode/the-stack`](https://huggingface.co/datasets/bigcode/the-stack) (v1) | v0.5.1 / v0.6.0 | 5k / 32k (1 seed) | ❌ No |
+A complete, machine-readable path → phase → corpus → condition map is in [`MANIFEST.md`](MANIFEST.md). Training-data provenance for each condition is detailed on [`language-decoded-data`](https://huggingface.co/datasets/legesher/language-decoded-data); the phase comparison is in the ["Phase 2 → Phase 3 at a glance"](https://huggingface.co/datasets/legesher/language-decoded-experiments#phase-2--phase-3-at-a-glance) table on the experiments repo.
 ## Usage
 ```python
 base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/tiny-aya-base")
 tokenizer = AutoTokenizer.from_pretrained("CohereLabs/tiny-aya-base")
+# Load a paper (Phase 3 · Stack v2-dedup) adapter — e.g., cond-1 (English code, seed 42, 5k tier).
+# Paper adapters live under the `tiny-aya-base/` prefix.
 model = PeftModel.from_pretrained(
     base_model,
     "legesher/language-decoded-lora",
+    subfolder="tiny-aya-base/condition-1-en-5k-seed42",
 )
 # Or a language-specific cond-2 adapter (Chinese reserved-word translation, seed 42)
 model = PeftModel.from_pretrained(
     base_model,
     "legesher/language-decoded-lora",
+    subfolder="tiny-aya-base/condition-2-zh-5k-seed42",
 )
 # Or a cond-5 adapter (Synthesized Native Code, Urdu, seed 42)
 model = PeftModel.from_pretrained(
     base_model,
     "legesher/language-decoded-lora",
+    subfolder="tiny-aya-base/condition-5-ur-5k-c4ai-aya-expanse-32b-seed42",
+)
+# To load a *preliminary* Phase 2 / Stack v1 adapter instead, use the flat top-level
+# folder (no `tiny-aya-base/` prefix) — e.g. the original cond-2 Chinese hackathon adapter:
+model = PeftModel.from_pretrained(
+    base_model,
+    "legesher/language-decoded-lora",
+    subfolder="condition-2-zh-5k",
 )
 ```