Spaces:

slenk
/

codewraith

Sleeping

App Files Files Community

slenk commited on Apr 12

Commit

d09a8cf

verified ·

1 Parent(s): cf6c23e

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +282 -9

README.md CHANGED Viewed

@@ -11,16 +11,289 @@ pinned: false
 license: mit
 ---
-# CodeWraith - Module-to-Spec Transformer
-Generate technical specifications from Python source code using a fine-tuned LLM.
-## How it works
-1. Paste Python source code in the left panel
-2. Adjust sampling parameters (temperature, top_p, max tokens)
-3. Toggle RAG to include similar examples as context
-4. Click **Generate Specification**
-The model is a LoRA-fine-tuned Llama that was trained on 200+ Python module / specification pairs
-generated by a teacher model and verified with AST-based structural validation.

 license: mit
 ---
+# CodeWraith
+**Module-to-Spec Transformer** -- Automates the generation of high-fidelity, verifiable technical specifications from Python source code.
+CodeWraith uses a teacher-student architecture: a large model generates gold-standard training data, a verification pipeline ensures accuracy, and a fine-tuned lightweight model delivers fast, deployable inference.
+## Architecture
+```
+                    ┌─────────────┐
+  Python Source ──> │   Teacher   │ ──> Training Pairs (code -> spec)
+                    │ Qwen3 30B   │         │
+                    │ (Ollama)    │         │
+                    └─────────────┘         │
+                                            ▼
+                    ┌─────────────┐    ┌─────────────┐
+                    │  Verifier   │<── │  Training    │
+                    │ AST + Judge │    │  Dataset     │
+                    └─────────────┘    └──────┬──────┘
+                          │                   │
+                          │ validates         │ trains
+                          ▼                   ▼
+                    ┌─────────────┐    ┌─────────────┐
+                    │  Verified   │    │   Student   │
+                    │  Specs      │    │ Llama 3B/8B │
+                    └─────────────┘    │ + LoRA      │
+                                       └──────┬──────┘
+                                              │
+                                              ▼
+                                       ┌─────────────┐
+                                       │  Gradio App │
+                                       │  HF Spaces  │
+                                       └─────────────┘
+```
+## Components
+| Component | Directory | Purpose |
+|-----------|-----------|---------|
+| **Teacher** | `src/codewraith/teacher/` | Generates synthetic training pairs using Qwen3 30B via Ollama |
+| **Verifier** | `src/codewraith/verifier/` | AST-based structural validation + LLM-as-Judge semantic audit |
+| **Student** | `src/codewraith/student/` | LoRA fine-tuning via Unsloth, evaluation pipeline |
+| **App** | `src/codewraith/app/` | Gradio web interface deployed on HuggingFace Spaces |
+## Verification Pipeline
+1. **Structural Validation**: Uses Python's `ast` module to verify function signatures, arguments, and class hierarchies match the source
+2. **Semantic Audit**: LLM-as-a-Judge evaluates completeness, accuracy, hallucination, and detail (scored 0-10 each)
+3. **Round-trip Consistency**: Tests whether an LLM can reconstruct the module's function/class signatures from the spec alone
+## Quick Start
+### Prerequisites
+- Python 3.10+
+- [uv](https://docs.astral.sh/uv/) package manager
+- [Ollama](https://ollama.ai/) (for teacher model / judge)
+- NVIDIA GPU with 32GB+ VRAM (for training)
+### Install
+```bash
+git clone <repo-url>
+cd CodeWraith
+# Base install (verifier works with no ML dependencies)
+uv venv
+uv sync
+# Install ML dependencies (datasets, transformers, dspy)
+uv sync --extra ml
+# Install training dependencies (unsloth, peft, trl)
+uv sync --extra ml --extra training
+# Install app dependencies (gradio)
+uv sync --extra app
+# Install everything
+uv sync --extra all
+# Install dev tools (pytest, ruff)
+uv sync --extra dev
+```
+### Run Tests
+```bash
+uv run pytest
+```
+## Full Pipeline
+### Step 1: Collect Source Files
+Pull diverse Python modules from HuggingFace's the-stack-dedup dataset.
+Requires accepting the [Terms of Use](https://huggingface.co/datasets/bigcode/the-stack-dedup) on HuggingFace.
+```bash
+uv run --extra ml python3 -m codewraith.teacher.collect
+```
+This collects 150 clean (well-starred) and 100 messy (zero-star) Python files
+into `data/source_files/`. Resumable if interrupted.
+### Step 2: Optimize Prompt with DSPy
+Uses DSPy's BootstrapFewShot optimizer to find the best prompt for spec generation.
+Requires Ollama running with `qwen3:30b-a3b`.
+```bash
+# Pull the teacher model
+ollama pull qwen3:30b-a3b
+# Run optimization
+uv run --extra ml python3 -m codewraith.teacher.optimize
+```
+Saves the optimized generator to `data/optimized_generator.json`.
+### Step 3: Generate Training Data
+Generate specs for all collected source files using the optimized prompt.
+```bash
+uv run --extra ml python3 -c "
+from codewraith.teacher.generator import generate_dataset
+generate_dataset('data/source_files', 'data/training_pairs.jsonl')
+"
+```
+Writes pairs incrementally to JSONL. Fully resumable if interrupted.
+### Step 4: Clean Dataset
+Filter out null outputs, too-short specs, and outliers.
+```bash
+uv run python3 -m codewraith.teacher.clean_dataset
+```
+### Step 5: Train Student Model
+Fine-tune with Unsloth + LoRA. Supports both 3B and 8B models.
+```bash
+# Train Llama 3.2 3B (fast, ~3-4 minutes)
+uv run --extra ml --extra training python3 -m codewraith.student.trainer 3b
+# Train Llama 3.1 8B (better quality, ~8-10 minutes)
+uv run --extra ml --extra training python3 -m codewraith.student.trainer 8b
+```
+Adapters are saved to `models/codewraith-lora-{3b,8b}/`.
+### Step 6: Evaluate
+Run evaluation comparing structural accuracy across models.
+```bash
+# Evaluate 3B
+uv run --extra ml --extra training python3 -m codewraith.student.evaluate 3b
+# Evaluate 8B
+uv run --extra ml --extra training python3 -m codewraith.student.evaluate 8b
+```
+Generates `data/eval_report.md` with comparison metrics.
+### Step 7: Run Gradio App
+```bash
+uv run --extra ml --extra training --extra app python3 -m codewraith.app.main
+```
+Auto-detects the best available adapter (prefers 8B over 3B).
+Opens a web UI with code input, sampling parameter controls, and live spec generation.
+### Step 8: Deploy to HF Spaces
+```bash
+# Push adapter to HuggingFace Hub
+uv run --extra ml --extra training python3 -c "
+from codewraith.student.trainer import load_base_model, push_to_hub
+from peft import PeftModel
+model, tokenizer = load_base_model('3b')
+model = PeftModel.from_pretrained(model, './models/codewraith-lora-3b')
+push_to_hub(model, tokenizer, 'your-username/codewraith-lora-3b')
+"
+```
+## Evaluation Results
+Models trained with 8192 context, LoRA r=32, 4 epochs, dropout=0.05.
+Training data generated by Gemma 4 26B teacher model with DSPy-optimized prompts.
+Evaluated on 28 held-out examples (proper train/eval split, no data leakage).
+### Llama 3.1 8B (CodeWraith-8b) -- Deployed Model
+| Metric | Score |
+|--------|-------|
+| Avg Structural Score | 0.95 |
+| Function Coverage | 90% |
+| Class Coverage | 100% |
+| Argument Coverage | 94% |
+| Return Type Coverage | 67% |
+| Perfect Scores | 22/28 |
+| Good Scores (>=80%) | 25/28 |
+| Avg Inference Time | 28s |
+| Training Loss | 0.59 |
+### Llama 3.2 3B (CodeWraith-3b)
+| Metric | Score |
+|--------|-------|
+| Avg Structural Score | 0.91 |
+| Function Coverage | 86% |
+| Class Coverage | 96% |
+| Argument Coverage | 93% |
+| Return Type Coverage | 67% |
+| Perfect Scores | 19/28 |
+| Good Scores (>=80%) | 24/28 |
+| Avg Inference Time | 26s |
+| Training Loss | 0.76 |
+### Analysis
+The 8B model was selected for deployment because:
+- Higher overall structural score (0.95 vs 0.91)
+- Perfect class coverage (100% vs 96%)
+- More perfect scores (22/28 vs 19/28)
+- Higher quality training data from Gemma 4 26B teacher enabled the larger model to shine
+Training data was generated using Gemma 4 26B as the teacher model (replacing Qwen3 30B),
+producing higher quality specs with better structured Markdown and mermaid diagrams.
+DSPy BootstrapFewShot was used to optimize the generation prompt.
+### HuggingFace Models
+- Deployed (8B): https://huggingface.co/slenk/codewraith-lora-8b
+- Alternative (3B): https://huggingface.co/slenk/codewraith-lora-3b
+## Environment
+- **Teacher model**: Gemma 4 26B via Ollama at `127.0.0.1:11434`
+- **Student models**: Llama 3.2 3B / Llama 3.1 8B fine-tuned with LoRA via Unsloth
+- **Prompt optimization**: DSPy BootstrapFewShot with AST checker as metric
+- **Deployment**: Gradio on HuggingFace Spaces
+- **Hardware**: NVIDIA RTX 5090 (32GB VRAM)
+## Project Structure
+```
+CodeWraith/
+├── pyproject.toml
+├── README.md
+├── Modelfile.teacher
+├── src/codewraith/
+│   ├── teacher/
+│   │   ├── collect.py          # HF dataset collection
+│   │   ├── optimize.py         # DSPy prompt optimization
+│   │   ├── generator.py        # Training data generation
+│   │   └── clean_dataset.py    # Dataset filtering
+│   ├── verifier/
+│   │   ├── ast_checker.py      # AST structural validation
+│   │   └── judge.py            # LLM-as-Judge semantic audit
+│   ├── student/
+│   │   ├── trainer.py          # Unsloth + LoRA fine-tuning
+│   │   └── evaluate.py         # Model evaluation pipeline
+│   └── app/
+│       └── main.py             # Gradio inference UI
+├── data/                       # Training data, eval sets, reports
+├── models/                     # Saved LoRA adapters
+└── tests/                      # Test suite (96% coverage)
+```
+## Rubric Alignment
+| Rubric Section | Points | Implementation |
+|---------------|--------|----------------|
+| Model Functionality (training + LoRA + eval) | 20 | `student/trainer.py`, `student/evaluate.py`, 3B vs 8B comparison |
+| Innovation & Creativity | 20 | Teacher-student architecture, DSPy prompt optimization, AST verification pipeline |
+| Environment Setup (deployment) | 15 | `app/main.py`, Gradio on HF Spaces |
+| Inference Pipeline (sampling) | 15 | `app/main.py` with temperature/top_p/max_tokens controls |
+| Technical Documentation | 15 | This README, evaluation reports, docstrings |
+| Demo & Presentation | 15 | Live Gradio app as interactive demo |