Spaces:

slenk
/

codewraith

Sleeping

File size: 16,698 Bytes

5b9ce50
d74777b
 
5b9ce50
d74777b
5b9ce50
4aee1a3
5b9ce50
0a1f3da
5b9ce50
d74777b
5b9ce50
 
d09a8cf
d74777b
d09a8cf
d74777b
d09a8cf
d74777b
d09a8cf
d74777b
d09a8cf
 
 
eeef81e
 
d09a8cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eeef81e
 
d09a8cf
 
 
 
 
 
 
eeef81e
d09a8cf
 
eeef81e
d09a8cf
 
 
 
 
 
 
eeef81e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d09a8cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eeef81e
d09a8cf
 
eeef81e
d09a8cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eeef81e
d09a8cf
 
 
 
 
 
 
eeef81e
d09a8cf
 
 
eeef81e
d09a8cf
 
eeef81e
 
 
 
 
 
 
 
d09a8cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eeef81e
d09a8cf
 
eeef81e
d09a8cf
 
 
 
 
 
 
 
 
 
eeef81e
d09a8cf
 
eeef81e
d09a8cf
 
 
 
 
 
 
eeef81e
d09a8cf
 
 
 
 
 
 
 
 
eeef81e
d09a8cf
 
eeef81e
 
 
d09a8cf
eeef81e
 
 
 
 
d09a8cf
 
eeef81e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d09a8cf
 
eeef81e
 
 
 
 
d09a8cf
eeef81e
d09a8cf
 
 
eeef81e
 
d09a8cf
eeef81e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d09a8cf
eeef81e
 
 
 
 
 
 
d09a8cf
 
 
eeef81e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d09a8cf
 
 
eeef81e
 
d09a8cf
eeef81e
 
 
 
 
 
 
d09a8cf
 
 
eeef81e
 
 
d09a8cf
 
 
eeef81e
d09a8cf
 
eeef81e
 
 
d09a8cf
 
 
 
 
 
 
 
 
 
 
 
eeef81e
d09a8cf
eeef81e
d09a8cf
 
 
 
 
 
 
eeef81e
 
 
 
 
 
 
 
 
 
 
 
 
d09a8cf

---
title: CodeWraith
emoji: 🔮
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 6.11.0
app_file: app.py
python_version: "3.12"
pinned: false
license: mit
---

# CodeWraith

**Module-to-Spec Transformer** -- Automates the generation of high-fidelity, verifiable technical specifications from Python source code.

CodeWraith uses a teacher-student architecture: a large model generates gold-standard training data, a verification pipeline ensures accuracy, and a fine-tuned lightweight model delivers fast, deployable inference.

## Architecture

```
                    ┌─────────────┐
  Python Source ──> │   Teacher   │ ──> Training Pairs (code -> spec)
                    │ LLM via     │         │
                    │ vLLM/Ollama │         │
                    └─────────────┘         │
                                            ▼
                    ┌─────────────┐    ┌─────────────┐
                    │  Verifier   │<── │  Training    │
                    │ AST + Judge │    │  Dataset     │
                    └─────────────┘    └──────┬──────┘
                          │                   │
                          │ validates         │ trains
                          ▼                   ▼
                    ┌─────────────┐    ┌─────────────┐
                    │  Verified   │    │   Student   │
                    │  Specs      │    │ Llama 3B/8B │
                    └─────────────┘    │ + LoRA      │
                                       └──────┬──────┘
                                              │
                                              ▼
                                       ┌─────────────┐
                                       │  Gradio App │ <── RAG Retriever
                                       │  HF Spaces  │     (ChromaDB)
                                       └─────────────┘
```

## Components

| Component | Directory | Purpose |
|-----------|-----------|---------|
| **Teacher** | `src/codewraith/teacher/` | Generates synthetic training pairs using a large LLM via vLLM (JSON-constrained) or Ollama |
| **Verifier** | `src/codewraith/verifier/` | AST-based structural validation + LLM-as-Judge semantic audit |
| **Student** | `src/codewraith/student/` | LoRA fine-tuning via Unsloth, evaluation pipeline |
| **App** | `src/codewraith/app/` | Gradio web interface with RAG retrieval, deployed on HuggingFace Spaces |

## Verification Pipeline

1. **Structural Validation**: Uses Python's `ast` module to verify function signatures, arguments, and class hierarchies match the source
2. **Semantic Audit**: LLM-as-a-Judge evaluates completeness, accuracy, hallucination, and detail (scored 0-10 each)
3. **Round-trip Consistency**: Tests whether an LLM can reconstruct the module's function/class signatures from the spec alone

## Sampling & Inference

The inference pipeline uses **nucleus sampling** (top-p) combined with temperature scaling to balance output quality and diversity:

| Parameter | Default | Range | Purpose |
|-----------|---------|-------|---------|
| **Temperature** | 0.7 | 0.0 - 2.0 | Controls randomness. Lower values (0.1-0.3) produce more deterministic, structured output. Higher values increase diversity but risk incoherence. |
| **Top-p** | 0.9 | 0.0 - 1.0 | Nucleus sampling threshold. At each step, only tokens whose cumulative probability mass falls within the top-p fraction are considered. 0.9 retains the top 90% probability mass, filtering out low-likelihood tokens. |
| **Max Tokens** | 2048 | 256 - 8192 | Maximum generation length. Technical specs for typical modules run 500-1500 tokens; larger modules may need 4096+. |

**Why nucleus sampling over beam search?** Spec generation benefits from controlled creativity -- mermaid diagrams and natural language descriptions need some variation, while function signatures need precision. Nucleus sampling with moderate temperature (0.7) gives the model freedom in prose while the fine-tuning keeps structured elements accurate. For maximum precision, users can lower temperature to 0.1-0.3.

## Retrieval-Augmented Generation (RAG)

At inference time, the app optionally retrieves similar code-spec pairs from a ChromaDB vector index to provide few-shot context:

1. **Indexing**: All training pairs are embedded using `sentence-transformers` and stored in ChromaDB (193 pairs)
2. **Retrieval**: When a user submits code, the retriever finds the 3 most similar source files by cosine similarity
3. **Augmentation**: Retrieved examples are prepended to the user's input as context, giving the model concrete formatting examples
4. **Auto-truncation**: If RAG context pushes the input beyond 6000 tokens, it is automatically dropped to prevent context overflow

RAG improves output consistency, especially for formatting patterns like mermaid diagrams and markdown tables that the model may not reliably produce from fine-tuning alone.

## Quick Start

### Prerequisites

- Python 3.10+
- [uv](https://docs.astral.sh/uv/) package manager
- [Ollama](https://ollama.ai/) (for teacher model / judge)
- NVIDIA GPU with 32GB+ VRAM (for training)

### Install

```bash
git clone <repo-url>
cd CodeWraith

# Base install (verifier works with no ML dependencies)
uv venv
uv sync

# Install ML dependencies (transformers, unsloth, vllm, etc.)
uv sync --extra ml

# Install app dependencies (gradio, chromadb)
uv sync --extra app

# Install everything
uv sync --extra all

# Install dev tools (pytest, ruff)
uv sync --extra dev
```

### Run Tests

```bash
uv run pytest
```

## Full Pipeline

### Step 1: Collect Source Files

Pull diverse Python modules from HuggingFace's the-stack-dedup dataset.
Requires accepting the [Terms of Use](https://huggingface.co/datasets/bigcode/the-stack-dedup) on HuggingFace.

```bash
uv run --extra ml python3 -m codewraith.teacher.collect
```

This collects 150 clean (well-starred) and 100 messy (zero-star) Python files
into `data/source_files/`. Resumable if interrupted.

### Step 2: Optimize Prompt with DSPy

Uses DSPy's BootstrapFewShot optimizer to find the best prompt for spec generation.
Requires Ollama running with the configured teacher model.

```bash
# Run optimization
uv run --extra ml python3 -m codewraith.teacher.optimize
```

Saves the optimized generator to `data/optimized_generator.json`.
Falls back to raw Ollama generation if DSPy optimization is unavailable or returns null.

### Step 3: Generate Training Data

Generate specs for all collected source files. Two backends are available:

```bash
# vLLM backend (recommended) -- JSON-constrained output for consistent structure
# Requires vLLM server running with a code-specialized model
uv run --extra ml python3 -c "
from codewraith.teacher.generator import generate_dataset
generate_dataset('data/source_files', 'data/training_pairs.jsonl', backend='vllm')
"

# Ollama backend -- raw generation, uses DSPy-optimized prompt if available
uv run --extra ml python3 -c "
from codewraith.teacher.generator import generate_dataset
generate_dataset('data/source_files', 'data/training_pairs.jsonl')
"
```

Writes pairs incrementally to JSONL. Fully resumable if interrupted.

### Step 4: Clean Dataset

Filter out null outputs, too-short specs, and outliers.

```bash
uv run python3 -m codewraith.teacher.clean_dataset
```

### Step 5: Train Student Model

Fine-tune with Unsloth + LoRA. Supports both 3B and 8B models.

```bash
# Train Llama 3.2 3B (fast, ~3-4 minutes)
uv run --extra ml python3 -m codewraith.student.trainer 3b

# Train Llama 3.1 8B (better quality, ~8-10 minutes)
uv run --extra ml python3 -m codewraith.student.trainer 8b
```

Adapters are saved to `models/codewraith-lora-{3b,8b}/`.

### Step 6: Evaluate

Run evaluation comparing structural accuracy across models.

```bash
# Evaluate 3B
uv run --extra ml python3 -m codewraith.student.evaluate 3b

# Evaluate 8B
uv run --extra ml python3 -m codewraith.student.evaluate 8b
```

Generates `data/eval_report.md` with comparison metrics.

### Step 7: Run Gradio App

```bash
uv run --extra ml --extra app python3 -m codewraith.app.main
```

Auto-detects the best available adapter (prefers 8B over 3B).
Opens a web UI with code input, sampling parameter controls, and live spec generation.

### Step 8: Deploy to HF Spaces

```bash
# Push adapter to HuggingFace Hub
uv run --extra ml python3 -c "
from codewraith.student.trainer import load_base_model, push_to_hub
from peft import PeftModel
model, tokenizer = load_base_model('8b')
model = PeftModel.from_pretrained(model, './models/codewraith-lora-8b')
push_to_hub(model, tokenizer, 'slenk/codewraith-lora-8b')
"

# Upload app to HuggingFace Spaces (uses .hfignore to exclude large files)
hf upload slenk/codewraith . . --repo-type space \
  --exclude "models/*" --exclude ".venv/*" --exclude "adapter/*" \
  --exclude ".git/*" --exclude "tests/*" --exclude "scripts/*"
```

The Space downloads the LoRA adapter from HF Hub at startup, so model weights
are not included in the Space repository. A `.hfignore` file is provided to
exclude development artifacts from uploads.

## Model Evolution

The project iterated through multiple teacher models and training configurations to find the best combination:

| Version | Teacher Model | Student | Key Finding |
|---------|--------------|---------|-------------|
| v1 | Llama 3.1 70B (Q4) | 3B, 8B | Baseline. Functional specs but inconsistent formatting. |
| v2 | Llama 3.1 70B (Q4) | 3B, 8B | Improved hyperparameters (r=32, 8192 context, 4 epochs). 8B reached 0.89 structural score. |
| v3 | Qwen3 30B-A3B (MoE) | 3B, 8B | Better structured output -- tables, type annotations, cleaner markdown. 3B chosen as primary (0.92 structural). |
| v4 | Gemma 4 26B | 3B, 8B | Higher structural scores (8B: 0.95, 100% class coverage). Wordier prose but weaker return type coverage (67%). 8B selected as deployed model. |
| v5 | Qwen2.5-Coder 32B (Q6) | 8B | Code-specialized teacher for more precise, structured specifications. |
| v6 | Qwen2.5-Coder 32B (AWQ) via vLLM | 8B | JSON-constrained generation via vLLM ensures consistent spec structure. 171 pairs, 0.97 structural score. |
| v7 | Qwen2.5-Coder 14B (AWQ) via vLLM | 8B | Smaller teacher with 16384 context recovers large files. 231 pairs (+35%), 0.97 structural score maintained. |

Each iteration preserved previous model adapters for comparison. The teacher model
has the largest impact on output quality -- a code-specialized teacher (Qwen2.5-Coder)
is expected to produce more precise function signatures and structured formatting than
general-purpose models.

## Evaluation Results

### v7 -- Current (Qwen2.5-Coder 14B AWQ via vLLM)

Models trained with 4096 context, LoRA r=16, 3 epochs.
Training data generated by Qwen2.5-Coder-14B-Instruct-AWQ via vLLM with JSON-constrained output.
Evaluated on 34 held-out examples (197 train / 34 eval split from 231 total pairs).

#### Llama 3.1 8B (CodeWraith-8b-v7)

| Metric | Score |
|--------|-------|
| Avg Structural Score | 0.97 |
| Function Coverage | 97% |
| Class Coverage | 100% |
| Argument Coverage | 95% |
| Return Type Coverage | 90% |
| Perfect Scores | 25/34 |
| Good Scores (>=80%) | 29/34 |
| Training Loss | 0.12 |

**Key change from v6:** Switched from the 32B teacher (limited to 4096 context on 32GB VRAM)
to the 14B teacher with 16384 context. This recovered 60 additional training pairs from source
files that previously exceeded the context window, increasing the dataset by 35%. Structural
score held steady at 0.97 with the larger eval set. Four low scores (0.50) traced to Python 2
syntax in source files, not model output issues.

### v6 -- Previous (Qwen2.5-Coder 32B AWQ via vLLM)

Models trained with 8192 context, LoRA r=32, 3 epochs, dropout=0.05.
Training data generated by Qwen2.5-Coder-32B-Instruct-AWQ via vLLM with JSON-constrained output.
Evaluated on 26 held-out examples (145 train / 26 eval split from 171 total pairs).

#### Llama 3.1 8B (CodeWraith-8b-v6)

| Metric | Score |
|--------|-------|
| Avg Structural Score | 0.97 |
| Perfect Scores | 19/26 |
| Good Scores (>=80%) | 22/26 |

### v5 -- Previous (Qwen2.5-Coder 32B via Ollama)

Models trained with 8192 context, LoRA r=32, 4 epochs, dropout=0.05.
Training data generated by Qwen2.5-Coder 32B (Q6 quantization) via Ollama.
Evaluated on 37 held-out examples (proper train/eval split, no data leakage).

#### Llama 3.1 8B (CodeWraith-8b-v5)

| Metric | Score |
|--------|-------|
| Avg Structural Score | 0.99 |
| Function Coverage | 97% |
| Class Coverage | 100% |
| Argument Coverage | 99% |
| Return Type Coverage | 100% |
| Perfect Scores | 29/37 |
| Good Scores (>=80%) | 36/37 |
| Training Loss | 0.33 |

### v4 -- Previous (Gemma 4 26B Teacher)

Evaluated on 28 held-out examples.

#### Llama 3.1 8B (CodeWraith-8b-v4)

| Metric | v4 | v5 | Change |
|--------|-----|-----|--------|
| Structural Score | 0.95 | 0.99 | +0.04 |
| Function Coverage | 90% | 97% | +7% |
| Class Coverage | 100% | 100% | -- |
| Argument Coverage | 94% | 99% | +5% |
| Return Type Coverage | 67% | 100% | +33% |
| Perfect Scores | 78% | 78% | -- |
| Good Scores (>=80%) | 89% | 97% | +8% |
| Training Loss | 0.59 | 0.33 | -44% |

### Analysis

The v5 model using a **code-specialized teacher** (Qwen2.5-Coder 32B) dramatically
improved over v4's general-purpose teacher (Gemma 4 26B):

- **Return type coverage recovered from 67% to 100%** -- the v4 regression was caused
  by Gemma producing prose descriptions instead of precise type annotations
- **Training loss dropped 44%** -- the code-specialized teacher produces more consistent,
  structured output that the student model learns more efficiently
- **97% good scores** -- only 1 of 37 examples scored below 80%
- The code-specialized teacher generates more precise function signatures and parameter
  types, which directly translates to higher AST verification scores

### HuggingFace Models

- Deployed (8B LoRA adapter): https://huggingface.co/slenk/codewraith-lora-8b
- Merged (8B standalone): https://huggingface.co/slenk/codewraith-merged-8b
- Alternative (3B LoRA adapter): https://huggingface.co/slenk/codewraith-lora-3b

## Environment

- **Teacher model**: Configurable via Ollama at `127.0.0.1:11434` (tested with Llama 70B, Qwen3 30B, Gemma 4 26B, Qwen2.5-Coder 32B)
- **Student models**: Llama 3.2 3B / Llama 3.1 8B fine-tuned with LoRA via Unsloth
- **Prompt optimization**: DSPy BootstrapFewShot with AST checker as metric
- **RAG retrieval**: ChromaDB + sentence-transformers for few-shot context at inference
- **Deployment**: Gradio on HuggingFace Spaces with ZeroGPU (A10G)
- **Hardware (local)**: NVIDIA RTX 5090 (32GB VRAM)

## Project Structure

```
CodeWraith/
├── pyproject.toml
├── README.md
├── Modelfile.teacher
├── src/codewraith/
│   ├── teacher/
│   │   ├── collect.py          # HF dataset collection
│   │   ├── optimize.py         # DSPy prompt optimization
│   │   ├── generator.py        # Training data generation (Ollama + vLLM backends)
│   │   └── clean_dataset.py    # Dataset filtering
│   ├── spec_schema.py           # Pydantic ModuleSpec schema + markdown renderer
│   ├── verifier/
│   │   ├── ast_checker.py      # AST structural validation
│   │   └── judge.py            # LLM-as-Judge semantic audit
│   ├── student/
│   │   ├── trainer.py          # Unsloth + LoRA fine-tuning
│   │   └── evaluate.py         # Model evaluation pipeline
│   └── app/
│       ├── main.py             # Gradio inference UI
│       └── retriever.py        # RAG retrieval from ChromaDB
├── app.py                      # HF Spaces entry point
├── data/
│   ├── chromadb/               # Vector index for RAG retrieval
│   ├── source_files/           # Collected Python source files
│   ├── training_pairs*.jsonl   # Generated training data (per version)
│   └── eval_report*.md         # Evaluation reports
├── models/                     # Local LoRA adapters (gitignored, hosted on HF Hub)

├── scripts/
│   └── retrain.py              # Full retrain pipeline
└── tests/                      # Test suite
```