Text Generation
Transformers
English
qwen2
code-generation
python
fine-tuning
Qwen
tools
agent-framework
multi-agent
conversational
Eval Results (legacy)
Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use my-ai-stack/Stack-2-9-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned") model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use my-ai-stack/Stack-2-9-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "my-ai-stack/Stack-2-9-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
- SGLang
How to use my-ai-stack/Stack-2-9-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
walidsobhie-code commited on
Commit ·
4ca507e
1
Parent(s): 2e091e7
feat: add code completion generator and model registry tools
Browse files- scripts/generate_code_completion_data.py: Multi-language code completion generator
- scripts/model_info.py: Model metadata extraction tool
- scripts/compare_models.py: Compare model versions
- MODEL_REGISTRY.md: Version tracking documentation
- training-data/README.md: Training data format docs
- MODEL_REGISTRY.md +69 -0
- scripts/compare_models.py +220 -0
- scripts/generate_code_completion_data.py +262 -0
- scripts/model_info.py +167 -0
- training-data/README.md +182 -0
MODEL_REGISTRY.md
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Stack 2.9 Model Registry
|
| 2 |
+
|
| 3 |
+
> Version tracking for all Stack 2.9 model variants.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Model Versions
|
| 8 |
+
|
| 9 |
+
| Version | Status | Date | Base Model | Parameters | Dataset | Performance | Use Case |
|
| 10 |
+
|---------|--------|------|------------|------------|---------|-------------|----------|
|
| 11 |
+
| `stack-2.9-1.5B` | 🟡 In Training | 2026-04-06 | Llama 3.2-1B | 1.5B | Stack 2.9 dedup | TBD | Research, fine-tuning base |
|
| 12 |
+
| `stack-2.9-7B` | 🔴 Planned | TBD | Llama 3.1-8B | 7B | Stack 2.9 dedup | TBD | General-purpose inference |
|
| 13 |
+
| `stack-2.9-7B-QLoRA` | 🔴 Planned | TBD | Llama 3.1-8B | 7B (quantized) | Stack 2.9 dedup | TBD | Edge deployment, low-memory |
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## Version Details
|
| 18 |
+
|
| 19 |
+
### stack-2.9-1.5B (Current)
|
| 20 |
+
|
| 21 |
+
- **Status:** In Training
|
| 22 |
+
- **Architecture:** Transformer (pretrained)
|
| 23 |
+
- **Base Model:** Llama 3.2-1B
|
| 24 |
+
- **Parameters:** 1.5B
|
| 25 |
+
- **Training Data:** Stack 2.9 deduplicated
|
| 26 |
+
- **Context Length:** 128k tokens
|
| 27 |
+
- **Vocabulary Size:** ~128K
|
| 28 |
+
- **Precision:** BF16
|
| 29 |
+
- **Training Hardware:** 8x H100 (TBD确认)
|
| 30 |
+
- **Expected Completion:** TBD
|
| 31 |
+
- **Notes:** First iteration of Stack 2.9, used as baseline for larger variants
|
| 32 |
+
|
| 33 |
+
### stack-2.9-7B (Planned)
|
| 34 |
+
|
| 35 |
+
- **Status:** Planned
|
| 36 |
+
- **Architecture:** Transformer (pretrained)
|
| 37 |
+
- **Base Model:** Llama 3.1-8B
|
| 38 |
+
- **Parameters:** 7B
|
| 39 |
+
- **Training Data:** Stack 2.9 deduplicated
|
| 40 |
+
- **Context Length:** 128k tokens
|
| 41 |
+
- **Vocabulary Size:** ~128K
|
| 42 |
+
- **Precision:** BF16
|
| 43 |
+
- **Training Hardware:** TBD
|
| 44 |
+
- **Expected Start:** TBD
|
| 45 |
+
- **Notes:** Scale-up from 1.5B, targeting general-purpose use
|
| 46 |
+
|
| 47 |
+
### stack-2.9-7B-QLoRA (Planned)
|
| 48 |
+
|
| 49 |
+
- **Status:** Planned
|
| 50 |
+
- **Architecture:** Transformer + QLoRA
|
| 51 |
+
- **Base Model:** Llama 3.1-8B
|
| 52 |
+
- **Parameters:** 7B (4-bit quantized)
|
| 53 |
+
- **Training Data:** Stack 2.9 deduplicated
|
| 54 |
+
- **Context Length:** 128k tokens
|
| 55 |
+
- **Vocabulary Size:** ~128K
|
| 56 |
+
- **Quantization:** 4-bit NF4
|
| 57 |
+
- **LoRA Rank:** TBD
|
| 58 |
+
- **LoRA Alpha:** TBD
|
| 59 |
+
- **LoRA Dropout:** TBD
|
| 60 |
+
- **Target Modules:** TBD
|
| 61 |
+
- **Notes:** Quantized for consumer GPU deployment (e.g., 24GB VRAM)
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
## Changelog
|
| 66 |
+
|
| 67 |
+
| Date | Version | Change |
|
| 68 |
+
|------|---------|--------|
|
| 69 |
+
| 2026-04-06 | stack-2.9-1.5B | Initial entry — training started |
|
scripts/compare_models.py
ADDED
|
@@ -0,0 +1,220 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
compare_models.py — Compare different Stack 2.9 model versions.
|
| 4 |
+
|
| 5 |
+
Reads from models/registry.json and produces a side-by-side comparison
|
| 6 |
+
of model properties and performance metrics.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
python scripts/compare_models.py
|
| 10 |
+
python scripts/compare_models.py --models stack-2.9-1.5B stack-2.9-7B
|
| 11 |
+
python scripts/compare_models.py --metrics hellaswag mmlu humaneval
|
| 12 |
+
python scripts/compare_models.py --verbose
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
import argparse
|
| 16 |
+
import json
|
| 17 |
+
import sys
|
| 18 |
+
from pathlib import Path
|
| 19 |
+
from typing import Optional
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
REGISTRY_PATH = Path(__file__).parent.parent / "models" / "registry.json"
|
| 23 |
+
|
| 24 |
+
ALL_METRICS = ["hellaswag", "arc_challenge", "mmlu", "humaneval", "loss"]
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def load_registry(registry_path: Path = REGISTRY_PATH) -> dict:
|
| 28 |
+
"""Load the model registry JSON."""
|
| 29 |
+
if not registry_path.exists():
|
| 30 |
+
print(f"ERROR: Registry not found at {registry_path}", file=sys.stderr)
|
| 31 |
+
sys.exit(1)
|
| 32 |
+
with open(registry_path) as f:
|
| 33 |
+
return json.load(f)
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
def format_params(n: int) -> str:
|
| 37 |
+
if n >= 1_000_000_000:
|
| 38 |
+
return f"{n / 1_000_000_000:.1f}B"
|
| 39 |
+
elif n >= 1_000_000:
|
| 40 |
+
return f"{n / 1_000_000:.0f}M"
|
| 41 |
+
return str(n)
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def compare_params(a: int, b: int) -> str:
|
| 45 |
+
"""Compare two parameter counts."""
|
| 46 |
+
ratio = b / a
|
| 47 |
+
if ratio > 1:
|
| 48 |
+
return f" {ratio:.1f}x larger ({format_params(b)} vs {format_params(a)})"
|
| 49 |
+
else:
|
| 50 |
+
return f" {1/ratio:.1f}x smaller ({format_params(b)} vs {format_params(a)})"
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def build_row(version: str, key: str, value) -> str:
|
| 54 |
+
"""Build a comparison table row."""
|
| 55 |
+
if value is None:
|
| 56 |
+
val_str = "—"
|
| 57 |
+
elif isinstance(value, float):
|
| 58 |
+
val_str = f"{value:.4f}"
|
| 59 |
+
elif isinstance(value, int):
|
| 60 |
+
val_str = f"{value:,}"
|
| 61 |
+
else:
|
| 62 |
+
val_str = str(value)
|
| 63 |
+
return f" {version:<22} {key:<30} {val_str}"
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def print_comparison(models: list, metrics: list, verbose: bool = False):
|
| 67 |
+
"""Print a side-by-side comparison table."""
|
| 68 |
+
# Header
|
| 69 |
+
versions = [m["version"] for m in models]
|
| 70 |
+
max_ver_len = max(len(v) for v in versions)
|
| 71 |
+
|
| 72 |
+
print(f"\n{'='*72}")
|
| 73 |
+
print(f" Model Comparison — Stack 2.9")
|
| 74 |
+
print(f"{'='*72}")
|
| 75 |
+
|
| 76 |
+
# Non-metric fields
|
| 77 |
+
fields = [
|
| 78 |
+
("Base Model", "base_model"),
|
| 79 |
+
("Parameters", "parameters"),
|
| 80 |
+
("Quantization", "quantization"),
|
| 81 |
+
("Precision", "precision"),
|
| 82 |
+
("Context Length", "context_length"),
|
| 83 |
+
("Vocabulary Size", "vocabulary_size"),
|
| 84 |
+
("Dataset", "dataset"),
|
| 85 |
+
("LoRA Rank", ("lora", "rank")),
|
| 86 |
+
("LoRA Alpha", ("lora", "alpha")),
|
| 87 |
+
("LoRA Dropout", ("lora", "dropout")),
|
| 88 |
+
("Status", "status"),
|
| 89 |
+
("Created", "created_at"),
|
| 90 |
+
("Use Case", "use_case"),
|
| 91 |
+
]
|
| 92 |
+
|
| 93 |
+
print(f"\n {'Model':<{max_ver_len}} {'Field':<30} {'Value'}")
|
| 94 |
+
print(f" {'-'*max_ver_len} {'-'*30} {'-'*20}")
|
| 95 |
+
|
| 96 |
+
for label, key in fields:
|
| 97 |
+
row_values = []
|
| 98 |
+
for m in models:
|
| 99 |
+
if isinstance(key, tuple):
|
| 100 |
+
nested = m
|
| 101 |
+
for k in key:
|
| 102 |
+
nested = nested.get(k, {}) if isinstance(nested, dict) else {}
|
| 103 |
+
row_values.append(nested if nested else None)
|
| 104 |
+
else:
|
| 105 |
+
val = m.get(key)
|
| 106 |
+
# Format parameters as human-readable
|
| 107 |
+
if key == "parameters" and val:
|
| 108 |
+
val = f"{format_params(val)} ({val:,})"
|
| 109 |
+
row_values.append(val)
|
| 110 |
+
unique = set(str(v) for v in row_values)
|
| 111 |
+
if len(unique) == 1 and row_values[0] is None:
|
| 112 |
+
continue
|
| 113 |
+
print(f"\n {label}:")
|
| 114 |
+
for i, (ver, val) in enumerate(zip(versions, row_values)):
|
| 115 |
+
if val is None:
|
| 116 |
+
val_str = "—"
|
| 117 |
+
elif isinstance(val, float):
|
| 118 |
+
val_str = f"{val:.4f}"
|
| 119 |
+
elif isinstance(val, int):
|
| 120 |
+
val_str = f"{val:,}"
|
| 121 |
+
else:
|
| 122 |
+
val_str = str(val)
|
| 123 |
+
marker = " →" if i > 0 and row_values[i] != row_values[0] else " "
|
| 124 |
+
print(f" {marker} {ver:<{max_ver_len}} {val_str}")
|
| 125 |
+
|
| 126 |
+
# Performance metrics comparison
|
| 127 |
+
has_any_metrics = any(
|
| 128 |
+
any(m.get("performance", {}).get(metric) is not None for m in models)
|
| 129 |
+
for metric in metrics
|
| 130 |
+
)
|
| 131 |
+
if has_any_metrics:
|
| 132 |
+
print(f"\n\n Performance Benchmarks")
|
| 133 |
+
print(f" {'-'*max_ver_len} {'-'*30} {'-'*10}")
|
| 134 |
+
|
| 135 |
+
for metric in metrics:
|
| 136 |
+
metric_name = metric.replace("_", " ").title()
|
| 137 |
+
values = [m.get("performance", {}).get(metric) for m in models]
|
| 138 |
+
if all(v is None for v in values):
|
| 139 |
+
continue
|
| 140 |
+
print(f"\n {metric_name}:")
|
| 141 |
+
for i, (ver, val) in enumerate(zip(versions, values)):
|
| 142 |
+
if val is None:
|
| 143 |
+
val_str = "N/A"
|
| 144 |
+
else:
|
| 145 |
+
val_str = f"{val:.4f}"
|
| 146 |
+
marker = " →" if i > 0 else " "
|
| 147 |
+
print(f" {marker} {ver:<{max_ver_len}} {val_str}")
|
| 148 |
+
|
| 149 |
+
# Parameter size comparison (pairwise)
|
| 150 |
+
if len(models) >= 2:
|
| 151 |
+
print(f"\n\n Parameter Size Comparison:")
|
| 152 |
+
for i in range(len(models)):
|
| 153 |
+
for j in range(i + 1, len(models)):
|
| 154 |
+
a, b = models[i], models[j]
|
| 155 |
+
pa = a.get("parameters", 0)
|
| 156 |
+
pb = b.get("parameters", 0)
|
| 157 |
+
if pa and pb:
|
| 158 |
+
ratio = pb / pa
|
| 159 |
+
direction = "larger" if ratio > 1 else "smaller"
|
| 160 |
+
print(f" {b['version']} is {ratio:.2f}x {direction} than {a['version']}")
|
| 161 |
+
|
| 162 |
+
print(f"\n{'='*72}\n")
|
| 163 |
+
|
| 164 |
+
|
| 165 |
+
def main():
|
| 166 |
+
parser = argparse.ArgumentParser(
|
| 167 |
+
description="Compare Stack 2.9 model versions side by side."
|
| 168 |
+
)
|
| 169 |
+
parser.add_argument(
|
| 170 |
+
"--models", "-m",
|
| 171 |
+
nargs="+",
|
| 172 |
+
metavar="VERSION",
|
| 173 |
+
help="Model versions to compare (e.g., stack-2.9-1.5B stack-2.9-7B). "
|
| 174 |
+
"If omitted, compares all available models."
|
| 175 |
+
)
|
| 176 |
+
parser.add_argument(
|
| 177 |
+
"--metrics", "-M",
|
| 178 |
+
nargs="+",
|
| 179 |
+
choices=ALL_METRICS,
|
| 180 |
+
default=ALL_METRICS,
|
| 181 |
+
help=f"Benchmark metrics to include (default: all). Choices: {ALL_METRICS}"
|
| 182 |
+
)
|
| 183 |
+
parser.add_argument(
|
| 184 |
+
"--verbose", "-v",
|
| 185 |
+
action="store_true",
|
| 186 |
+
help="Show verbose output."
|
| 187 |
+
)
|
| 188 |
+
parser.add_argument(
|
| 189 |
+
"--registry",
|
| 190 |
+
default=REGISTRY_PATH,
|
| 191 |
+
metavar="PATH",
|
| 192 |
+
help=f"Path to registry.json (default: {REGISTRY_PATH})."
|
| 193 |
+
)
|
| 194 |
+
args = parser.parse_args()
|
| 195 |
+
|
| 196 |
+
registry_path = Path(args.registry)
|
| 197 |
+
registry = load_registry(registry_path)
|
| 198 |
+
models = registry.get("models", [])
|
| 199 |
+
|
| 200 |
+
if args.models:
|
| 201 |
+
selected = []
|
| 202 |
+
for v in args.models:
|
| 203 |
+
found = next((m for m in models if m["version"] == v), None)
|
| 204 |
+
if found:
|
| 205 |
+
selected.append(found)
|
| 206 |
+
else:
|
| 207 |
+
print(f"WARNING: Model '{v}' not found in registry. Skipping.", file=sys.stderr)
|
| 208 |
+
available = ", ".join(m["version"] for m in models)
|
| 209 |
+
print(f" Available: {available}", file=sys.stderr)
|
| 210 |
+
if not selected:
|
| 211 |
+
print("ERROR: No valid models selected.", file=sys.stderr)
|
| 212 |
+
sys.exit(1)
|
| 213 |
+
else:
|
| 214 |
+
selected = models
|
| 215 |
+
|
| 216 |
+
print_comparison(selected, metrics=args.metrics, verbose=args.verbose or args.verbose)
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
if __name__ == "__main__":
|
| 220 |
+
main()
|
scripts/generate_code_completion_data.py
ADDED
|
@@ -0,0 +1,262 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Synthetic Code Completion Training Data Generator for Stack 2.9
|
| 4 |
+
Generates training examples for pure code completion without tools.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import json
|
| 8 |
+
import random
|
| 9 |
+
import argparse
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from typing import Dict, List
|
| 12 |
+
|
| 13 |
+
LANGUAGES = ["python", "javascript", "go", "rust", "typescript"]
|
| 14 |
+
DIFFICULTY_EASY = "easy"
|
| 15 |
+
DIFFICULTY_MEDIUM = "medium"
|
| 16 |
+
DIFFICULTY_HARD = "hard"
|
| 17 |
+
|
| 18 |
+
# Code templates organized by language -> difficulty -> templates
|
| 19 |
+
CODE_TEMPLATES = {
|
| 20 |
+
"python": {
|
| 21 |
+
DIFFICULTY_EASY: [
|
| 22 |
+
{"context": "def greet(name):", "completion": ' return f"Hello, {name}!"', "description": "Simple greeting function"},
|
| 23 |
+
{"context": "numbers = [1, 2, 3, 4, 5]\n\n", "completion": "for num in numbers:\n print(num)", "description": "Loop through list"},
|
| 24 |
+
{"context": "class Person:\n def __init__(self, name):", "completion": " self.name = name", "description": "Class init"},
|
| 25 |
+
{"context": "def add(a, b):\n ", "completion": " return a + b", "description": "Add function"},
|
| 26 |
+
{"context": "if x > 0:\n print('positive')\nelif x < 0:\n ", "completion": " print('negative')", "description": "Conditional"},
|
| 27 |
+
],
|
| 28 |
+
DIFFICULTY_MEDIUM: [
|
| 29 |
+
{"context": "def fibonacci(n):\n if n <= 1:\n return n\n ", "completion": " return fibonacci(n-1) + fibonacci(n-2)", "description": "Fibonacci"},
|
| 30 |
+
{"context": "class Calculator:\n def __init__(self):\n self.result = 0\n \n def add(self, x):\n ", "completion": " self.result += x\n return self.result", "description": "Calculator"},
|
| 31 |
+
{"context": "async def fetch_data(url):\n async with aiohttp.ClientSession() as session:\n async with session.get(url) as response:\n ", "completion": " return await response.json()", "description": "Async HTTP"},
|
| 32 |
+
{"context": "def validate_email(email):\n pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'\n ", "completion": " return re.match(pattern, email) is not None", "description": "Email validation"},
|
| 33 |
+
{"context": "@app.route('/users/<int:user_id>')\ndef get_user(user_id):\n user = User.query.get_or_404(user_id)\n ", "completion": " return jsonify(user.to_dict())", "description": "Flask route"},
|
| 34 |
+
],
|
| 35 |
+
DIFFICULTY_HARD: [
|
| 36 |
+
{"context": "class LRUCache:\n def __init__(self, capacity):\n self.capacity = capacity\n self.cache = OrderedDict()\n \n def get(self, key):\n if key not in self.cache:\n return -1\n ", "completion": " self.cache.move_to_end(key)\n return self.cache[key]", "description": "LRU Cache"},
|
| 37 |
+
{"context": "def merge_sort(arr):\n if len(arr) <= 1:\n return arr\n \n mid = len(arr) // 2\n left = merge_sort(arr[:mid])\n right = merge_sort(arr[mid:])\n ", "completion": " return merge(left, right)", "description": "Merge sort"},
|
| 38 |
+
{"context": "class BinaryTree:\n def __init__(self, value):\n self.value = value\n self.left = None\n self.right = None\n \n def inorder(self, node, result=None):\n if result is None:\n result = []\n if node:\n ", "completion": " self.inorder(node.left, result)\n result.append(node.value)\n self.inorder(node.right, result)\n return result", "description": "Binary tree inorder"},
|
| 39 |
+
{"context": "def bellman_ford(graph, source):\n dist = {v: float('inf') for v in graph}\n dist[source] = 0\n \n for _ in range(len(graph) - 1):\n for u, v, w in graph.edges:\n if dist[u] != float('inf') and dist[u] + w < dist[v]:\n ", "completion": " dist[v] = dist[u] + w\n return dist", "description": "Bellman-Ford"},
|
| 40 |
+
],
|
| 41 |
+
},
|
| 42 |
+
"javascript": {
|
| 43 |
+
DIFFICULTY_EASY: [
|
| 44 |
+
{"context": "const greet = (name) => {", "completion": ' return `Hello, ${name}!`;', "description": "Arrow greeting"},
|
| 45 |
+
{"context": "const numbers = [1, 2, 3, 4, 5];\n\n", "completion": "numbers.forEach(num => console.log(num));", "description": "forEach loop"},
|
| 46 |
+
{"context": "class Person {\n constructor(name) {", "completion": " this.name = name;", "description": "JS class constructor"},
|
| 47 |
+
{"context": "const add = (a, b) => {", "completion": " return a + b;", "description": "Add function"},
|
| 48 |
+
{"context": "if (x > 0) {\n console.log('positive');\n} else if (x < 0) {\n ", "completion": " console.log('negative');", "description": "Conditional"},
|
| 49 |
+
],
|
| 50 |
+
DIFFICULTY_MEDIUM: [
|
| 51 |
+
{"context": "const fetchData = async (url) => {\n try {\n const response = await fetch(url);\n ", "completion": " return await response.json();\n } catch (error) {\n console.error('Error:', error);\n }", "description": "Async fetch"},
|
| 52 |
+
{"context": "class EventEmitter {\n constructor() {\n this.events = {};\n }\n \n on(event, callback) {\n ", "completion": " if (!this.events[event]) this.events[event] = [];\n this.events[event].push(callback);", "description": "Event emitter"},
|
| 53 |
+
{"context": "const debounce = (func, delay) => {\n let timeoutId;\n return (...args) => {\n clearTimeout(timeoutId);\n ", "completion": " timeoutId = setTimeout(() => func.apply(this, args), delay);", "description": "Debounce"},
|
| 54 |
+
{"context": "const memoize = (fn) => {\n const cache = new Map();\n return (n) => {\n if (cache.has(n)) {\n return cache.get(n);\n }\n ", "completion": " const result = fn(n);\n cache.set(n, result);\n return result;", "description": "Memoize"},
|
| 55 |
+
],
|
| 56 |
+
DIFFICULTY_HARD: [
|
| 57 |
+
{"context": "class PromisePool {\n constructor(maxConcurrent) {\n this.maxConcurrent = maxConcurrent;\n this.running = 0;\n this.queue = [];\n }\n \n add(promiseFn) {\n return new Promise((resolve, reject) => {\n ", "completion": " this.queue.push({ promiseFn, resolve, reject });\n this.process();\n });", "description": "Promise pool"},
|
| 58 |
+
{"context": "const virtualDOM = {\n createElement(tag, props, ...children) {\n return {\n tag,\n props: props || {},\n children: children.flat(),\n };\n },\n render(vnode, container) {\n ", "completion": " const el = document.createElement(vnode.tag);\n Object.entries(vnode.props || {}).forEach(([key, value]) => el.setAttribute(key, value));\n vnode.children.forEach(child => {\n if (typeof child === 'string') el.appendChild(document.createTextNode(child));\n else this.render(child, el);\n });\n container.appendChild(el);", "description": "Virtual DOM"},
|
| 59 |
+
],
|
| 60 |
+
},
|
| 61 |
+
"go": {
|
| 62 |
+
DIFFICULTY_EASY: [
|
| 63 |
+
{"context": "func greet(name string) string {", "completion": ' return "Hello, " + name + "!"', "description": "Greet function"},
|
| 64 |
+
{"context": "func add(a, b int) int {", "completion": " return a + b", "description": "Add function"},
|
| 65 |
+
{"context": "type Person struct {\n Name string\n ", "completion": " Age int", "description": "Struct definition"},
|
| 66 |
+
{"context": "for i := 0; i < 10; i++ {\n ", "completion": " fmt.Println(i)", "description": "For loop"},
|
| 67 |
+
{"context": "if x > 0 {\n fmt.Println(\"positive\")\n} else {\n ", "completion": ' fmt.Println("non-positive")', "description": "If-else"},
|
| 68 |
+
],
|
| 69 |
+
DIFFICULTY_MEDIUM: [
|
| 70 |
+
{"context": "func (p Person) Greet() string {", "completion": ' return fmt.Sprintf("Hello, %s!", p.Name)', "description": "Method"},
|
| 71 |
+
{"context": "func worker(jobs <-chan int, results chan<- int) {\n for j := range jobs {\n ", "completion": " results <- j * 2", "description": "Worker goroutine"},
|
| 72 |
+
{"context": "type Handler interface {\n Handle(ctx context.Context, req Request) Response\n ", "completion": " Cleanup(ctx context.Context)", "description": "Interface"},
|
| 73 |
+
{"context": "func fetchData(url string) ([]byte, error) {\n resp, err := http.Get(url)\n if err != nil {\n return nil, err\n }\n defer resp.Body.Close()\n ", "completion": " return io.ReadAll(resp.Body)", "description": "HTTP GET"},
|
| 74 |
+
],
|
| 75 |
+
DIFFICULTY_HARD: [
|
| 76 |
+
{"context": "type TreeNode struct {\n Val int\n Left *TreeNode\n Right *TreeNode\n}\n\nfunc (root *TreeNode) InorderTraversal() []int {\n var result []int\n var inorder func(*TreeNode)\n inorder = func(node *TreeNode) {\n if node == nil {\n return\n }\n ", "completion": " inorder(node.Left)\n result = append(result, node.Val)\n inorder(node.Right)", "description": "Tree inorder"},
|
| 77 |
+
{"context": "func (c *Client) StreamProcess(ctx context.Context, req *Request, stream chan<- *Response) error {\n for {\n select {\n case <-ctx.Done():\n return ctx.Err()\n default:\n result, err := c.processOne(req)\n if err != nil {\n return err\n }\n ", "completion": " select {\n case stream <- result:\n case <-ctx.Done():\n return ctx.Err()\n }", "description": "Streaming"},
|
| 78 |
+
],
|
| 79 |
+
},
|
| 80 |
+
"rust": {
|
| 81 |
+
DIFFICULTY_EASY: [
|
| 82 |
+
{"context": "fn greet(name: &str) -> String {", "completion": ' format!("Hello, {}!", name)', "description": "Greet function"},
|
| 83 |
+
{"context": "fn add(a: i32, b: i32) -> i32 {", "completion": " a + b", "description": "Add function"},
|
| 84 |
+
{"context": "struct Person {\n name: String,\n ", "completion": " age: u32,", "description": "Struct"},
|
| 85 |
+
{"context": "let numbers = vec![1, 2, 3, 4, 5];\nfor num in &numbers {\n ", "completion": " println!(\"{}\", num);", "description": "For loop"},
|
| 86 |
+
{"context": "fn main() {\n let result = match value {\n Some(x) => x,\n ", "completion": " None => 0,", "description": "Match"},
|
| 87 |
+
],
|
| 88 |
+
DIFFICULTY_MEDIUM: [
|
| 89 |
+
{"context": "impl Person {\n fn new(name: String, age: u32) -> Self {", "completion": " Person { name, age }", "description": "Constructor"},
|
| 90 |
+
{"context": "fn fetch_data(url: &str) -> Result<String, Error> {\n let response = reqwest::blocking::get(url)?;\n ", "completion": " let body = response.text()?;\n Ok(body)", "description": "HTTP request"},
|
| 91 |
+
{"context": "fn process_items<T: Display>(items: Vec<T>) -> String {\n items\n .iter()\n .enumerate()\n .map(|(i, item)| format!(\"{}: {}\", i, item))\n ", "completion": " .collect::<Vec<_>>()\n .join(\", \")", "description": "Iterator chain"},
|
| 92 |
+
{"context": "fn spawn_worker(jobs: Arc<Mutex<Vec<Job>>>) {\n thread::spawn(move || {\n loop {\n let job = {\n let mut jobs = jobs.lock().unwrap();\n jobs.pop()\n };\n match job {\n Some(job) => job.execute(),\n ", "completion": " None => break,\n };\n }\n });", "description": "Worker thread"},
|
| 93 |
+
],
|
| 94 |
+
DIFFICULTY_HARD: [
|
| 95 |
+
{"context": "pub struct LRUCache<K, V> {\n capacity: usize,\n cache: LinkedHashMap<K, V>,\n}\n\nimpl<K: Eq + Hash + Clone, V: Clone> LRUCache<K, V> {\n pub fn get(&mut self, key: &K) -> Option<&V> {\n if self.cache.contains_key(key) {\n ", "completion": " self.cache.remove(key);\n let value = self.cache[key].clone();\n self.cache.insert(key.clone(), value);\n self.cache.get(key)\n } else {\n None\n }", "description": "LRU Cache"},
|
| 96 |
+
{"context": "pub trait Observer<T> {\n fn update(&self, event: &T);\n}\n\npub struct Subject<T> {\n observers: Vec<Box<dyn Observer<T>>>,\n}\n\nimpl<T> Subject<T> {\n pub fn notify(&self, event: &T) {\n for observer in &self.observers {\n ", "completion": " observer.update(event);", "description": "Observer pattern"},
|
| 97 |
+
],
|
| 98 |
+
},
|
| 99 |
+
}
|
| 100 |
+
|
| 101 |
+
VARIANTS = ["basic", "explain", "debug", "optimize"]
|
| 102 |
+
|
| 103 |
+
VARIANT_PROMPTS = {
|
| 104 |
+
"basic": {"system": "You are a helpful AI assistant that helps with code completion.", "user_prefix": "Complete the following code:\n\n"},
|
| 105 |
+
"explain": {"system": "You are a helpful AI assistant that explains and completes code.", "user_prefix": "Explain what this code does and complete it:\n\n"},
|
| 106 |
+
"debug": {"system": "You are a helpful AI assistant that finds bugs and suggests fixes.", "user_prefix": "There's a bug in this code. Fix and complete it:\n\n"},
|
| 107 |
+
"optimize": {"system": "You are a helpful AI assistant that optimizes code for performance.", "user_prefix": "Optimize this code and complete it:\n\n"},
|
| 108 |
+
}
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
def create_completion_example(context, completion, language, difficulty, variant, description):
|
| 112 |
+
"""Create a single code completion example."""
|
| 113 |
+
variant_info = VARIANT_PROMPTS[variant]
|
| 114 |
+
messages = [
|
| 115 |
+
{"role": "system", "content": variant_info["system"]},
|
| 116 |
+
{"role": "user", "content": f"{variant_info['user_prefix']}```{language}\n{context}```"},
|
| 117 |
+
{"role": "assistant", "content": f"Here's the completed code:\n\n```{language}\n{context}{completion}\n```"}
|
| 118 |
+
]
|
| 119 |
+
return {
|
| 120 |
+
"messages": messages,
|
| 121 |
+
"language": language,
|
| 122 |
+
"difficulty": difficulty,
|
| 123 |
+
"variant": variant,
|
| 124 |
+
"description": description,
|
| 125 |
+
"context": context,
|
| 126 |
+
"completion": completion,
|
| 127 |
+
}
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
def generate_examples_for_language(language, difficulty, num_examples, variants):
|
| 131 |
+
"""Generate examples for a specific language and difficulty."""
|
| 132 |
+
templates = CODE_TEMPLATES[language][difficulty]
|
| 133 |
+
examples = []
|
| 134 |
+
for i in range(num_examples):
|
| 135 |
+
template = templates[i % len(templates)]
|
| 136 |
+
variant = random.choice(variants)
|
| 137 |
+
example = create_completion_example(
|
| 138 |
+
context=template["context"],
|
| 139 |
+
completion=template["completion"],
|
| 140 |
+
language=language,
|
| 141 |
+
difficulty=difficulty,
|
| 142 |
+
variant=variant,
|
| 143 |
+
description=template["description"]
|
| 144 |
+
)
|
| 145 |
+
examples.append(example)
|
| 146 |
+
return examples
|
| 147 |
+
|
| 148 |
+
|
| 149 |
+
def generate_dataset(num_examples=1000, languages=None, difficulties=None, variants=None, balance=True):
|
| 150 |
+
"""Generate the complete dataset."""
|
| 151 |
+
if languages is None:
|
| 152 |
+
languages = LANGUAGES
|
| 153 |
+
if difficulties is None:
|
| 154 |
+
difficulties = [DIFFICULTY_EASY, DIFFICULTY_MEDIUM, DIFFICULTY_HARD]
|
| 155 |
+
if variants is None:
|
| 156 |
+
variants = VARIANTS
|
| 157 |
+
|
| 158 |
+
examples = []
|
| 159 |
+
|
| 160 |
+
if balance:
|
| 161 |
+
examples_per_lang = num_examples // len(languages)
|
| 162 |
+
examples_per_diff = examples_per_lang // len(difficulties)
|
| 163 |
+
remainder = num_examples % (len(languages) * len(difficulties))
|
| 164 |
+
|
| 165 |
+
for lang in languages:
|
| 166 |
+
for diff_idx, diff in enumerate(difficulties):
|
| 167 |
+
count = examples_per_diff + (1 if diff_idx < remainder else 0)
|
| 168 |
+
lang_examples = generate_examples_for_language(lang, diff, count, variants)
|
| 169 |
+
examples.extend(lang_examples)
|
| 170 |
+
else:
|
| 171 |
+
for _ in range(num_examples):
|
| 172 |
+
lang = random.choice(languages)
|
| 173 |
+
diff = random.choice(difficulties)
|
| 174 |
+
template = random.choice(CODE_TEMPLATES[lang][diff])
|
| 175 |
+
variant = random.choice(variants)
|
| 176 |
+
example = create_completion_example(
|
| 177 |
+
context=template["context"],
|
| 178 |
+
completion=template["completion"],
|
| 179 |
+
language=lang,
|
| 180 |
+
difficulty=diff,
|
| 181 |
+
variant=variant,
|
| 182 |
+
description=template["description"]
|
| 183 |
+
)
|
| 184 |
+
examples.append(example)
|
| 185 |
+
|
| 186 |
+
random.shuffle(examples)
|
| 187 |
+
return examples
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
def save_jsonl(examples, output_path):
|
| 191 |
+
"""Save examples to JSONL format."""
|
| 192 |
+
output_file = Path(output_path)
|
| 193 |
+
output_file.parent.mkdir(parents=True, exist_ok=True)
|
| 194 |
+
with open(output_file, 'w', encoding='utf-8') as f:
|
| 195 |
+
for example in examples:
|
| 196 |
+
f.write(json.dumps(example, ensure_ascii=False) + '\n')
|
| 197 |
+
|
| 198 |
+
|
| 199 |
+
def save_json(examples, output_path):
|
| 200 |
+
"""Save examples to JSON format."""
|
| 201 |
+
output_file = Path(output_path)
|
| 202 |
+
output_file.parent.mkdir(parents=True, exist_ok=True)
|
| 203 |
+
with open(output_file, 'w', encoding='utf-8') as f:
|
| 204 |
+
json.dump(examples, f, ensure_ascii=False, indent=2)
|
| 205 |
+
|
| 206 |
+
|
| 207 |
+
def main():
|
| 208 |
+
parser = argparse.ArgumentParser(description="Generate synthetic code completion training data")
|
| 209 |
+
parser.add_argument("--num-examples", type=int, default=1000, help="Number of examples to generate")
|
| 210 |
+
parser.add_argument("--output-dir", type=str, default="training-data/code-completion", help="Output directory")
|
| 211 |
+
parser.add_argument("--output-format", choices=["jsonl", "json", "both"], default="jsonl", help="Output format")
|
| 212 |
+
parser.add_argument("--seed", type=int, default=42, help="Random seed")
|
| 213 |
+
args = parser.parse_args()
|
| 214 |
+
|
| 215 |
+
random.seed(args.seed)
|
| 216 |
+
|
| 217 |
+
print(f"Generating {args.num_examples} code completion training examples...")
|
| 218 |
+
print(f" Languages: {LANGUAGES}")
|
| 219 |
+
print(f" Output directory: {args.output_dir}")
|
| 220 |
+
|
| 221 |
+
examples = generate_dataset(
|
| 222 |
+
num_examples=args.num_examples,
|
| 223 |
+
languages=LANGUAGES,
|
| 224 |
+
difficulties=[DIFFICULTY_EASY, DIFFICULTY_MEDIUM, DIFFICULTY_HARD],
|
| 225 |
+
variants=VARIANTS
|
| 226 |
+
)
|
| 227 |
+
|
| 228 |
+
output_dir = Path(args.output_dir)
|
| 229 |
+
|
| 230 |
+
if args.output_format in ["jsonl", "both"]:
|
| 231 |
+
jsonl_path = output_dir / "code_completion.jsonl"
|
| 232 |
+
save_jsonl(examples, str(jsonl_path))
|
| 233 |
+
print(f"Saved JSONL: {jsonl_path}")
|
| 234 |
+
|
| 235 |
+
if args.output_format in ["json", "both"]:
|
| 236 |
+
json_path = output_dir / "code_completion.json"
|
| 237 |
+
save_json(examples, str(json_path))
|
| 238 |
+
print(f"Saved JSON: {json_path}")
|
| 239 |
+
|
| 240 |
+
# Statistics
|
| 241 |
+
print(f"\nStatistics:")
|
| 242 |
+
print(f" Total examples: {len(examples)}")
|
| 243 |
+
|
| 244 |
+
lang_counts = {}
|
| 245 |
+
diff_counts = {}
|
| 246 |
+
for ex in examples:
|
| 247 |
+
lang_counts[ex["language"]] = lang_counts.get(ex["language"], 0) + 1
|
| 248 |
+
diff_counts[ex["difficulty"]] = diff_counts.get(ex["difficulty"], 0) + 1
|
| 249 |
+
|
| 250 |
+
print(f" By language:")
|
| 251 |
+
for lang, count in sorted(lang_counts.items(), key=lambda x: x[1], reverse=True):
|
| 252 |
+
print(f" - {lang}: {count}")
|
| 253 |
+
|
| 254 |
+
print(f" By difficulty:")
|
| 255 |
+
for diff, count in sorted(diff_counts.items(), key=lambda x: x[1], reverse=True):
|
| 256 |
+
print(f" - {diff}: {count}")
|
| 257 |
+
|
| 258 |
+
print(f"\nGeneration complete!")
|
| 259 |
+
|
| 260 |
+
|
| 261 |
+
if __name__ == "__main__":
|
| 262 |
+
main()
|
scripts/model_info.py
ADDED
|
@@ -0,0 +1,167 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
model_info.py — Extract and report Stack 2.9 model metadata.
|
| 4 |
+
|
| 5 |
+
Reads from models/registry.json and optionally from a model checkpoint
|
| 6 |
+
directory to extract/verify metadata.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
python scripts/model_info.py # Show all models
|
| 10 |
+
python scripts/model_info.py --model stack-2.9-1.5B
|
| 11 |
+
python scripts/model_info.py --model stack-2.9-7B-QLoRA --verbose
|
| 12 |
+
python scripts/model_info.py --export-json /path/to/output.json
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
import argparse
|
| 16 |
+
import json
|
| 17 |
+
import os
|
| 18 |
+
import sys
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
from typing import Optional
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
REGISTRY_PATH = Path(__file__).parent.parent / "models" / "registry.json"
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def load_registry(registry_path: Path = REGISTRY_PATH) -> dict:
|
| 27 |
+
"""Load the model registry JSON."""
|
| 28 |
+
if not registry_path.exists():
|
| 29 |
+
print(f"ERROR: Registry not found at {registry_path}", file=sys.stderr)
|
| 30 |
+
sys.exit(1)
|
| 31 |
+
with open(registry_path) as f:
|
| 32 |
+
return json.load(f)
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
def format_params(n: int) -> str:
|
| 36 |
+
"""Format parameter count as human-readable string."""
|
| 37 |
+
if n >= 1_000_000_000:
|
| 38 |
+
return f"{n / 1_000_000_000:.1f}B"
|
| 39 |
+
elif n >= 1_000_000:
|
| 40 |
+
return f"{n / 1_000_000:.0f}M"
|
| 41 |
+
return str(n)
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def format_lora(config: Optional[dict]) -> str:
|
| 45 |
+
"""Format LoRA config as readable string."""
|
| 46 |
+
if not config:
|
| 47 |
+
return "N/A (full model)"
|
| 48 |
+
lines = [
|
| 49 |
+
f" Rank (r): {config.get('rank', 'N/A')}",
|
| 50 |
+
f" Alpha: {config.get('alpha', 'N/A')}",
|
| 51 |
+
f" Dropout: {config.get('dropout', 'N/A')}",
|
| 52 |
+
f" Target Modules: {', '.join(config.get('target_modules', []))}",
|
| 53 |
+
]
|
| 54 |
+
if config.get("modules_to_save"):
|
| 55 |
+
lines.append(f" Modules to Save: {', '.join(config['modules_to_save'])}")
|
| 56 |
+
return "\n".join(lines)
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def format_performance(metrics: dict) -> str:
|
| 60 |
+
"""Format performance metrics."""
|
| 61 |
+
benchmarks = {
|
| 62 |
+
"HellaSwag": metrics.get("hellaswag"),
|
| 63 |
+
"ARC-Challenge": metrics.get("arc_challenge"),
|
| 64 |
+
"MMLU": metrics.get("mmlu"),
|
| 65 |
+
"HumanEval": metrics.get("humaneval"),
|
| 66 |
+
"Training Loss": metrics.get("loss"),
|
| 67 |
+
}
|
| 68 |
+
lines = []
|
| 69 |
+
for name, value in benchmarks.items():
|
| 70 |
+
if value is not None:
|
| 71 |
+
lines.append(f" {name:20s} {value}")
|
| 72 |
+
else:
|
| 73 |
+
lines.append(f" {name:20s} N/A")
|
| 74 |
+
return "\n".join(lines) if lines else " No benchmarks yet"
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def status_emoji(status: str) -> str:
|
| 78 |
+
"""Return emoji for model status."""
|
| 79 |
+
return {
|
| 80 |
+
"in_training": "🟡 IN TRAINING",
|
| 81 |
+
"planned": "🔴 PLANNED",
|
| 82 |
+
"released": "🟢 RELEASED",
|
| 83 |
+
"deprecated": "⚠️ DEPRECATED",
|
| 84 |
+
}.get(status, f"({status})")
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def print_model(model: dict, verbose: bool = False):
|
| 88 |
+
"""Print a single model's info."""
|
| 89 |
+
print(f"\n{'='*60}")
|
| 90 |
+
print(f" {model['version']} [{status_emoji(model['status'])}]")
|
| 91 |
+
print(f"{'='*60}")
|
| 92 |
+
|
| 93 |
+
print(f"\n Base Model: {model['base_model']}")
|
| 94 |
+
print(f" Parameters: {format_params(model['parameters'])} ({model['parameters']:,})")
|
| 95 |
+
print(f" Quantization: {model.get('quantization') or 'None (full precision)'}")
|
| 96 |
+
print(f" Precision: {model.get('precision', 'N/A')}")
|
| 97 |
+
print(f" Context Length: {model.get('context_length', 'N/A'):,} tokens")
|
| 98 |
+
print(f" Vocab Size: {model.get('vocabulary_size', 'N/A'):,}")
|
| 99 |
+
print(f" Dataset: {model['dataset']}")
|
| 100 |
+
print(f" Created: {model.get('created_at') or 'TBD'}")
|
| 101 |
+
|
| 102 |
+
print(f"\n LoRA Config:")
|
| 103 |
+
print(f" {format_lora(model.get('lora'))}")
|
| 104 |
+
|
| 105 |
+
print(f"\n Performance Metrics:")
|
| 106 |
+
print(f" {format_performance(model.get('performance', {}))}")
|
| 107 |
+
|
| 108 |
+
print(f"\n Use Case: {model['use_case']}")
|
| 109 |
+
if model.get("notes"):
|
| 110 |
+
print(f" Notes: {model['notes']}")
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def main():
|
| 114 |
+
parser = argparse.ArgumentParser(
|
| 115 |
+
description="Extract and report Stack 2.9 model metadata."
|
| 116 |
+
)
|
| 117 |
+
parser.add_argument(
|
| 118 |
+
"--model", "-m",
|
| 119 |
+
help="Specific model version to show (e.g., stack-2.9-1.5B). "
|
| 120 |
+
"If omitted, shows all models."
|
| 121 |
+
)
|
| 122 |
+
parser.add_argument(
|
| 123 |
+
"--verbose", "-v",
|
| 124 |
+
action="store_true",
|
| 125 |
+
help="Show verbose output (same as default)."
|
| 126 |
+
)
|
| 127 |
+
parser.add_argument(
|
| 128 |
+
"--export-json", "-o",
|
| 129 |
+
metavar="PATH",
|
| 130 |
+
help="Export selected model(s) as JSON to a file."
|
| 131 |
+
)
|
| 132 |
+
parser.add_argument(
|
| 133 |
+
"--registry",
|
| 134 |
+
default=REGISTRY_PATH,
|
| 135 |
+
metavar="PATH",
|
| 136 |
+
help=f"Path to registry.json (default: {REGISTRY_PATH})."
|
| 137 |
+
)
|
| 138 |
+
args = parser.parse_args()
|
| 139 |
+
|
| 140 |
+
registry_path = Path(args.registry)
|
| 141 |
+
registry = load_registry(registry_path)
|
| 142 |
+
models = registry.get("models", [])
|
| 143 |
+
|
| 144 |
+
if args.model:
|
| 145 |
+
selected = [m for m in models if m["version"] == args.model]
|
| 146 |
+
if not selected:
|
| 147 |
+
print(f"ERROR: Model '{args.model}' not found in registry.", file=sys.stderr)
|
| 148 |
+
print("Available models:", ", ".join(m["version"] for m in models))
|
| 149 |
+
sys.exit(1)
|
| 150 |
+
else:
|
| 151 |
+
selected = models
|
| 152 |
+
|
| 153 |
+
for model in selected:
|
| 154 |
+
print_model(model, verbose=args.verbose)
|
| 155 |
+
|
| 156 |
+
# Export to JSON if requested
|
| 157 |
+
if args.export_json:
|
| 158 |
+
output = {"registry_version": registry.get("registry_version"), "models": selected}
|
| 159 |
+
with open(args.export_json, "w") as f:
|
| 160 |
+
json.dump(output, f, indent=2)
|
| 161 |
+
print(f"\n✓ Exported to {args.export_json}")
|
| 162 |
+
|
| 163 |
+
print()
|
| 164 |
+
|
| 165 |
+
|
| 166 |
+
if __name__ == "__main__":
|
| 167 |
+
main()
|
training-data/README.md
ADDED
|
@@ -0,0 +1,182 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Stack 2.9 Training Data
|
| 2 |
+
|
| 3 |
+
This directory contains synthetic training data for fine-tuning code generation models.
|
| 4 |
+
|
| 5 |
+
## Directory Structure
|
| 6 |
+
|
| 7 |
+
```
|
| 8 |
+
training-data/
|
| 9 |
+
├── README.md # This file
|
| 10 |
+
├── tool_examples.jsonl # Tool-calling examples (Qwen2.5-Coder format)
|
| 11 |
+
├── tool_examples.json # Same as above in JSON format
|
| 12 |
+
├── code_completion/ # Pure code completion examples
|
| 13 |
+
│ ├── code_completion.jsonl
|
| 14 |
+
│ └── code_completion.json
|
| 15 |
+
└── training-data-expanded/ # Additional generated data
|
| 16 |
+
└── tool_examples.jsonl # 5000 expanded tool-calling examples
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
## Data Formats
|
| 20 |
+
|
| 21 |
+
### Tool-Calling Examples
|
| 22 |
+
|
| 23 |
+
**Format:** Qwen2.5-Coder style with `tool_calls`
|
| 24 |
+
|
| 25 |
+
Each example contains:
|
| 26 |
+
- `messages`: Array of conversation messages (system, user, assistant, tool)
|
| 27 |
+
- `tools`: Array of tool definitions
|
| 28 |
+
|
| 29 |
+
**Example structure:**
|
| 30 |
+
```json
|
| 31 |
+
{
|
| 32 |
+
"messages": [
|
| 33 |
+
{"role": "system", "content": "You are a helpful AI assistant..."},
|
| 34 |
+
{"role": "user", "content": "Read the file at src/main.py..."},
|
| 35 |
+
{
|
| 36 |
+
"role": "assistant",
|
| 37 |
+
"content": null,
|
| 38 |
+
"tool_calls": [
|
| 39 |
+
{
|
| 40 |
+
"id": "call_1234",
|
| 41 |
+
"type": "function",
|
| 42 |
+
"function": {
|
| 43 |
+
"name": "FileRead",
|
| 44 |
+
"arguments": "{\"path\": \"src/main.py\"}"
|
| 45 |
+
}
|
| 46 |
+
}
|
| 47 |
+
]
|
| 48 |
+
},
|
| 49 |
+
{
|
| 50 |
+
"role": "tool",
|
| 51 |
+
"content": "Successfully read file: src/main.py\n...",
|
| 52 |
+
"tool_call_id": "call_1234",
|
| 53 |
+
"name": "FileRead"
|
| 54 |
+
},
|
| 55 |
+
{"role": "assistant", "content": "Here's the contents..."}
|
| 56 |
+
],
|
| 57 |
+
"tools": [...]
|
| 58 |
+
}
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
**Available Tools:**
|
| 62 |
+
- `Bash` - Execute bash commands
|
| 63 |
+
- `FileRead` - Read file contents
|
| 64 |
+
- `FileWrite` - Write/create files
|
| 65 |
+
- `WebSearch` - Search the web
|
| 66 |
+
- `Grep` - Search patterns in files
|
| 67 |
+
|
| 68 |
+
### Code Completion Examples
|
| 69 |
+
|
| 70 |
+
**Format:** Chat-based with context and completion
|
| 71 |
+
|
| 72 |
+
Each example contains:
|
| 73 |
+
- `messages`: Array of conversation messages
|
| 74 |
+
- `language`: Programming language (python, javascript, go, rust, typescript)
|
| 75 |
+
- `difficulty`: easy, medium, hard
|
| 76 |
+
- `variant`: basic, explain, debug, optimize
|
| 77 |
+
- `context`: The code context to complete
|
| 78 |
+
- `completion`: The expected completion
|
| 79 |
+
|
| 80 |
+
**Example structure:**
|
| 81 |
+
```json
|
| 82 |
+
{
|
| 83 |
+
"messages": [
|
| 84 |
+
{"role": "system", "content": "You are a helpful AI assistant..."},
|
| 85 |
+
{"role": "user", "content": "Complete the following code:\n```python\ndef greet(name):\n```"},
|
| 86 |
+
{"role": "assistant", "content": "Here's the completed code:\n```python\ndef greet(name):\n return f\"Hello, {name}!\"\n```"}
|
| 87 |
+
],
|
| 88 |
+
"language": "python",
|
| 89 |
+
"difficulty": "easy",
|
| 90 |
+
"variant": "basic",
|
| 91 |
+
"description": "Simple function that returns a greeting",
|
| 92 |
+
"context": "def greet(name):",
|
| 93 |
+
"completion": " return f\"Hello, {name}!\""
|
| 94 |
+
}
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
## Generation Scripts
|
| 98 |
+
|
| 99 |
+
### Tool Data Generator
|
| 100 |
+
|
| 101 |
+
```bash
|
| 102 |
+
python3 scripts/generate_tool_data.py \
|
| 103 |
+
--num-examples 5000 \
|
| 104 |
+
--output-dir training-data-expanded \
|
| 105 |
+
--output-format jsonl
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
### Code Completion Generator
|
| 109 |
+
|
| 110 |
+
```bash
|
| 111 |
+
python3 scripts/generate_code_completion_data.py \
|
| 112 |
+
--num-examples 1000 \
|
| 113 |
+
--output-dir training-data/code-completion \
|
| 114 |
+
--languages python javascript go rust typescript \
|
| 115 |
+
--difficulties easy medium hard \
|
| 116 |
+
--variants basic explain debug optimize
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
## Difficulty Levels
|
| 120 |
+
|
| 121 |
+
| Level | Description |
|
| 122 |
+
|-------|-------------|
|
| 123 |
+
| **easy** | Simple functions, basic operations, single concepts |
|
| 124 |
+
| **medium** | Intermediate patterns, async operations, error handling |
|
| 125 |
+
| **hard** | Complex algorithms, data structures, design patterns |
|
| 126 |
+
|
| 127 |
+
## Variants
|
| 128 |
+
|
| 129 |
+
| Variant | Description |
|
| 130 |
+
|---------|-------------|
|
| 131 |
+
| **basic** | Standard code completion |
|
| 132 |
+
| **explain** | Code completion with explanation |
|
| 133 |
+
| **debug** | Bug fixing and completion |
|
| 134 |
+
| **optimize** | Performance optimization and completion |
|
| 135 |
+
|
| 136 |
+
## Supported Languages
|
| 137 |
+
|
| 138 |
+
- Python
|
| 139 |
+
- JavaScript
|
| 140 |
+
- Go
|
| 141 |
+
- Rust
|
| 142 |
+
- TypeScript
|
| 143 |
+
|
| 144 |
+
## Usage
|
| 145 |
+
|
| 146 |
+
### Training with MLflow
|
| 147 |
+
|
| 148 |
+
```bash
|
| 149 |
+
mlflow run . -P num_examples=5000
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
### Loading Data for Training
|
| 153 |
+
|
| 154 |
+
```python
|
| 155 |
+
import json
|
| 156 |
+
|
| 157 |
+
# Load JSONL
|
| 158 |
+
with open("training-data/tool_examples.jsonl", "r") as f:
|
| 159 |
+
for line in f:
|
| 160 |
+
example = json.loads(line)
|
| 161 |
+
# Process example
|
| 162 |
+
pass
|
| 163 |
+
|
| 164 |
+
# Load JSON
|
| 165 |
+
with open("training-data/tool_examples.json", "r") as f:
|
| 166 |
+
data = json.load(f)
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
## Augmentation
|
| 170 |
+
|
| 171 |
+
The tool-calling generator applies augmentation to create diversity:
|
| 172 |
+
- Varying file paths
|
| 173 |
+
- Varying command options
|
| 174 |
+
- Varying search queries
|
| 175 |
+
- Varying code snippets
|
| 176 |
+
|
| 177 |
+
## Quality Guidelines
|
| 178 |
+
|
| 179 |
+
- All generated code is syntactically correct
|
| 180 |
+
- Examples include realistic context
|
| 181 |
+
- Tools have proper arguments and responses
|
| 182 |
+
- Code completions are deterministic and correct
|