Spaces:
Sleeping
Sleeping
| title: CodeWraith | |
| emoji: ๐ฎ | |
| colorFrom: purple | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: 6.11.0 | |
| app_file: app.py | |
| python_version: "3.12" | |
| pinned: false | |
| license: mit | |
| # CodeWraith | |
| **Module-to-Spec Transformer** -- Automates the generation of high-fidelity, verifiable technical specifications from Python source code. | |
| CodeWraith uses a teacher-student architecture: a large model generates gold-standard training data, a verification pipeline ensures accuracy, and a fine-tuned lightweight model delivers fast, deployable inference. | |
| ## Architecture | |
| ``` | |
| โโโโโโโโโโโโโโโ | |
| Python Source โโ> โ Teacher โ โโ> Training Pairs (code -> spec) | |
| โ LLM via โ โ | |
| โ vLLM/Ollama โ โ | |
| โโโโโโโโโโโโโโโ โ | |
| โผ | |
| โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ | |
| โ Verifier โ<โโ โ Training โ | |
| โ AST + Judge โ โ Dataset โ | |
| โโโโโโโโโโโโโโโ โโโโโโโโฌโโโโโโโ | |
| โ โ | |
| โ validates โ trains | |
| โผ โผ | |
| โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ | |
| โ Verified โ โ Student โ | |
| โ Specs โ โ Llama 3B/8B โ | |
| โโโโโโโโโโโโโโโ โ + LoRA โ | |
| โโโโโโโโฌโโโโโโโ | |
| โ | |
| โผ | |
| โโโโโโโโโโโโโโโ | |
| โ Gradio App โ <โโ RAG Retriever | |
| โ HF Spaces โ (ChromaDB) | |
| โโโโโโโโโโโโโโโ | |
| ``` | |
| ## Components | |
| | Component | Directory | Purpose | | |
| |-----------|-----------|---------| | |
| | **Teacher** | `src/codewraith/teacher/` | Generates synthetic training pairs using a large LLM via vLLM (JSON-constrained) or Ollama | | |
| | **Verifier** | `src/codewraith/verifier/` | AST-based structural validation + LLM-as-Judge semantic audit | | |
| | **Student** | `src/codewraith/student/` | LoRA fine-tuning via Unsloth, evaluation pipeline | | |
| | **App** | `src/codewraith/app/` | Gradio web interface with RAG retrieval, deployed on HuggingFace Spaces | | |
| ## Verification Pipeline | |
| 1. **Structural Validation**: Uses Python's `ast` module to verify function signatures, arguments, and class hierarchies match the source | |
| 2. **Semantic Audit**: LLM-as-a-Judge evaluates completeness, accuracy, hallucination, and detail (scored 0-10 each) | |
| 3. **Round-trip Consistency**: Tests whether an LLM can reconstruct the module's function/class signatures from the spec alone | |
| ## Sampling & Inference | |
| The inference pipeline uses **nucleus sampling** (top-p) combined with temperature scaling to balance output quality and diversity: | |
| | Parameter | Default | Range | Purpose | | |
| |-----------|---------|-------|---------| | |
| | **Temperature** | 0.7 | 0.0 - 2.0 | Controls randomness. Lower values (0.1-0.3) produce more deterministic, structured output. Higher values increase diversity but risk incoherence. | | |
| | **Top-p** | 0.9 | 0.0 - 1.0 | Nucleus sampling threshold. At each step, only tokens whose cumulative probability mass falls within the top-p fraction are considered. 0.9 retains the top 90% probability mass, filtering out low-likelihood tokens. | | |
| | **Max Tokens** | 2048 | 256 - 8192 | Maximum generation length. Technical specs for typical modules run 500-1500 tokens; larger modules may need 4096+. | | |
| **Why nucleus sampling over beam search?** Spec generation benefits from controlled creativity -- mermaid diagrams and natural language descriptions need some variation, while function signatures need precision. Nucleus sampling with moderate temperature (0.7) gives the model freedom in prose while the fine-tuning keeps structured elements accurate. For maximum precision, users can lower temperature to 0.1-0.3. | |
| ## Retrieval-Augmented Generation (RAG) | |
| At inference time, the app optionally retrieves similar code-spec pairs from a ChromaDB vector index to provide few-shot context: | |
| 1. **Indexing**: All training pairs are embedded using `sentence-transformers` and stored in ChromaDB (193 pairs) | |
| 2. **Retrieval**: When a user submits code, the retriever finds the 3 most similar source files by cosine similarity | |
| 3. **Augmentation**: Retrieved examples are prepended to the user's input as context, giving the model concrete formatting examples | |
| 4. **Auto-truncation**: If RAG context pushes the input beyond 6000 tokens, it is automatically dropped to prevent context overflow | |
| RAG improves output consistency, especially for formatting patterns like mermaid diagrams and markdown tables that the model may not reliably produce from fine-tuning alone. | |
| ## Quick Start | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - [uv](https://docs.astral.sh/uv/) package manager | |
| - [Ollama](https://ollama.ai/) (for teacher model / judge) | |
| - NVIDIA GPU with 32GB+ VRAM (for training) | |
| ### Install | |
| ```bash | |
| git clone <repo-url> | |
| cd CodeWraith | |
| # Base install (verifier works with no ML dependencies) | |
| uv venv | |
| uv sync | |
| # Install ML dependencies (transformers, unsloth, vllm, etc.) | |
| uv sync --extra ml | |
| # Install app dependencies (gradio, chromadb) | |
| uv sync --extra app | |
| # Install everything | |
| uv sync --extra all | |
| # Install dev tools (pytest, ruff) | |
| uv sync --extra dev | |
| ``` | |
| ### Run Tests | |
| ```bash | |
| uv run pytest | |
| ``` | |
| ## Full Pipeline | |
| ### Step 1: Collect Source Files | |
| Pull diverse Python modules from HuggingFace's the-stack-dedup dataset. | |
| Requires accepting the [Terms of Use](https://huggingface.co/datasets/bigcode/the-stack-dedup) on HuggingFace. | |
| ```bash | |
| uv run --extra ml python3 -m codewraith.teacher.collect | |
| ``` | |
| This collects 150 clean (well-starred) and 100 messy (zero-star) Python files | |
| into `data/source_files/`. Resumable if interrupted. | |
| ### Step 2: Optimize Prompt with DSPy | |
| Uses DSPy's BootstrapFewShot optimizer to find the best prompt for spec generation. | |
| Requires Ollama running with the configured teacher model. | |
| ```bash | |
| # Run optimization | |
| uv run --extra ml python3 -m codewraith.teacher.optimize | |
| ``` | |
| Saves the optimized generator to `data/optimized_generator.json`. | |
| Falls back to raw Ollama generation if DSPy optimization is unavailable or returns null. | |
| ### Step 3: Generate Training Data | |
| Generate specs for all collected source files. Two backends are available: | |
| ```bash | |
| # vLLM backend (recommended) -- JSON-constrained output for consistent structure | |
| # Requires vLLM server running with a code-specialized model | |
| uv run --extra ml python3 -c " | |
| from codewraith.teacher.generator import generate_dataset | |
| generate_dataset('data/source_files', 'data/training_pairs.jsonl', backend='vllm') | |
| " | |
| # Ollama backend -- raw generation, uses DSPy-optimized prompt if available | |
| uv run --extra ml python3 -c " | |
| from codewraith.teacher.generator import generate_dataset | |
| generate_dataset('data/source_files', 'data/training_pairs.jsonl') | |
| " | |
| ``` | |
| Writes pairs incrementally to JSONL. Fully resumable if interrupted. | |
| ### Step 4: Clean Dataset | |
| Filter out null outputs, too-short specs, and outliers. | |
| ```bash | |
| uv run python3 -m codewraith.teacher.clean_dataset | |
| ``` | |
| ### Step 5: Train Student Model | |
| Fine-tune with Unsloth + LoRA. Supports both 3B and 8B models. | |
| ```bash | |
| # Train Llama 3.2 3B (fast, ~3-4 minutes) | |
| uv run --extra ml python3 -m codewraith.student.trainer 3b | |
| # Train Llama 3.1 8B (better quality, ~8-10 minutes) | |
| uv run --extra ml python3 -m codewraith.student.trainer 8b | |
| ``` | |
| Adapters are saved to `models/codewraith-lora-{3b,8b}/`. | |
| ### Step 6: Evaluate | |
| Run evaluation comparing structural accuracy across models. | |
| ```bash | |
| # Evaluate 3B | |
| uv run --extra ml python3 -m codewraith.student.evaluate 3b | |
| # Evaluate 8B | |
| uv run --extra ml python3 -m codewraith.student.evaluate 8b | |
| ``` | |
| Generates `data/eval_report.md` with comparison metrics. | |
| ### Step 7: Run Gradio App | |
| ```bash | |
| uv run --extra ml --extra app python3 -m codewraith.app.main | |
| ``` | |
| Auto-detects the best available adapter (prefers 8B over 3B). | |
| Opens a web UI with code input, sampling parameter controls, and live spec generation. | |
| ### Step 8: Deploy to HF Spaces | |
| ```bash | |
| # Push adapter to HuggingFace Hub | |
| uv run --extra ml python3 -c " | |
| from codewraith.student.trainer import load_base_model, push_to_hub | |
| from peft import PeftModel | |
| model, tokenizer = load_base_model('8b') | |
| model = PeftModel.from_pretrained(model, './models/codewraith-lora-8b') | |
| push_to_hub(model, tokenizer, 'slenk/codewraith-lora-8b') | |
| " | |
| # Upload app to HuggingFace Spaces (uses .hfignore to exclude large files) | |
| hf upload slenk/codewraith . . --repo-type space \ | |
| --exclude "models/*" --exclude ".venv/*" --exclude "adapter/*" \ | |
| --exclude ".git/*" --exclude "tests/*" --exclude "scripts/*" | |
| ``` | |
| The Space downloads the LoRA adapter from HF Hub at startup, so model weights | |
| are not included in the Space repository. A `.hfignore` file is provided to | |
| exclude development artifacts from uploads. | |
| ## Model Evolution | |
| The project iterated through multiple teacher models and training configurations to find the best combination: | |
| | Version | Teacher Model | Student | Key Finding | | |
| |---------|--------------|---------|-------------| | |
| | v1 | Llama 3.1 70B (Q4) | 3B, 8B | Baseline. Functional specs but inconsistent formatting. | | |
| | v2 | Llama 3.1 70B (Q4) | 3B, 8B | Improved hyperparameters (r=32, 8192 context, 4 epochs). 8B reached 0.89 structural score. | | |
| | v3 | Qwen3 30B-A3B (MoE) | 3B, 8B | Better structured output -- tables, type annotations, cleaner markdown. 3B chosen as primary (0.92 structural). | | |
| | v4 | Gemma 4 26B | 3B, 8B | Higher structural scores (8B: 0.95, 100% class coverage). Wordier prose but weaker return type coverage (67%). 8B selected as deployed model. | | |
| | v5 | Qwen2.5-Coder 32B (Q6) | 8B | Code-specialized teacher for more precise, structured specifications. | | |
| | v6 | Qwen2.5-Coder 32B (AWQ) via vLLM | 8B | JSON-constrained generation via vLLM ensures consistent spec structure. 171 pairs, 0.97 structural score. | | |
| | v7 | Qwen2.5-Coder 14B (AWQ) via vLLM | 8B | Smaller teacher with 16384 context recovers large files. 231 pairs (+35%), 0.97 structural score maintained. | | |
| Each iteration preserved previous model adapters for comparison. The teacher model | |
| has the largest impact on output quality -- a code-specialized teacher (Qwen2.5-Coder) | |
| is expected to produce more precise function signatures and structured formatting than | |
| general-purpose models. | |
| ## Evaluation Results | |
| ### v7 -- Current (Qwen2.5-Coder 14B AWQ via vLLM) | |
| Models trained with 4096 context, LoRA r=16, 3 epochs. | |
| Training data generated by Qwen2.5-Coder-14B-Instruct-AWQ via vLLM with JSON-constrained output. | |
| Evaluated on 34 held-out examples (197 train / 34 eval split from 231 total pairs). | |
| #### Llama 3.1 8B (CodeWraith-8b-v7) | |
| | Metric | Score | | |
| |--------|-------| | |
| | Avg Structural Score | 0.97 | | |
| | Function Coverage | 97% | | |
| | Class Coverage | 100% | | |
| | Argument Coverage | 95% | | |
| | Return Type Coverage | 90% | | |
| | Perfect Scores | 25/34 | | |
| | Good Scores (>=80%) | 29/34 | | |
| | Training Loss | 0.12 | | |
| **Key change from v6:** Switched from the 32B teacher (limited to 4096 context on 32GB VRAM) | |
| to the 14B teacher with 16384 context. This recovered 60 additional training pairs from source | |
| files that previously exceeded the context window, increasing the dataset by 35%. Structural | |
| score held steady at 0.97 with the larger eval set. Four low scores (0.50) traced to Python 2 | |
| syntax in source files, not model output issues. | |
| ### v6 -- Previous (Qwen2.5-Coder 32B AWQ via vLLM) | |
| Models trained with 8192 context, LoRA r=32, 3 epochs, dropout=0.05. | |
| Training data generated by Qwen2.5-Coder-32B-Instruct-AWQ via vLLM with JSON-constrained output. | |
| Evaluated on 26 held-out examples (145 train / 26 eval split from 171 total pairs). | |
| #### Llama 3.1 8B (CodeWraith-8b-v6) | |
| | Metric | Score | | |
| |--------|-------| | |
| | Avg Structural Score | 0.97 | | |
| | Perfect Scores | 19/26 | | |
| | Good Scores (>=80%) | 22/26 | | |
| ### v5 -- Previous (Qwen2.5-Coder 32B via Ollama) | |
| Models trained with 8192 context, LoRA r=32, 4 epochs, dropout=0.05. | |
| Training data generated by Qwen2.5-Coder 32B (Q6 quantization) via Ollama. | |
| Evaluated on 37 held-out examples (proper train/eval split, no data leakage). | |
| #### Llama 3.1 8B (CodeWraith-8b-v5) | |
| | Metric | Score | | |
| |--------|-------| | |
| | Avg Structural Score | 0.99 | | |
| | Function Coverage | 97% | | |
| | Class Coverage | 100% | | |
| | Argument Coverage | 99% | | |
| | Return Type Coverage | 100% | | |
| | Perfect Scores | 29/37 | | |
| | Good Scores (>=80%) | 36/37 | | |
| | Training Loss | 0.33 | | |
| ### v4 -- Previous (Gemma 4 26B Teacher) | |
| Evaluated on 28 held-out examples. | |
| #### Llama 3.1 8B (CodeWraith-8b-v4) | |
| | Metric | v4 | v5 | Change | | |
| |--------|-----|-----|--------| | |
| | Structural Score | 0.95 | 0.99 | +0.04 | | |
| | Function Coverage | 90% | 97% | +7% | | |
| | Class Coverage | 100% | 100% | -- | | |
| | Argument Coverage | 94% | 99% | +5% | | |
| | Return Type Coverage | 67% | 100% | +33% | | |
| | Perfect Scores | 78% | 78% | -- | | |
| | Good Scores (>=80%) | 89% | 97% | +8% | | |
| | Training Loss | 0.59 | 0.33 | -44% | | |
| ### Analysis | |
| The v5 model using a **code-specialized teacher** (Qwen2.5-Coder 32B) dramatically | |
| improved over v4's general-purpose teacher (Gemma 4 26B): | |
| - **Return type coverage recovered from 67% to 100%** -- the v4 regression was caused | |
| by Gemma producing prose descriptions instead of precise type annotations | |
| - **Training loss dropped 44%** -- the code-specialized teacher produces more consistent, | |
| structured output that the student model learns more efficiently | |
| - **97% good scores** -- only 1 of 37 examples scored below 80% | |
| - The code-specialized teacher generates more precise function signatures and parameter | |
| types, which directly translates to higher AST verification scores | |
| ### HuggingFace Models | |
| - Deployed (8B LoRA adapter): https://huggingface.co/slenk/codewraith-lora-8b | |
| - Merged (8B standalone): https://huggingface.co/slenk/codewraith-merged-8b | |
| - Alternative (3B LoRA adapter): https://huggingface.co/slenk/codewraith-lora-3b | |
| ## Environment | |
| - **Teacher model**: Configurable via Ollama at `127.0.0.1:11434` (tested with Llama 70B, Qwen3 30B, Gemma 4 26B, Qwen2.5-Coder 32B) | |
| - **Student models**: Llama 3.2 3B / Llama 3.1 8B fine-tuned with LoRA via Unsloth | |
| - **Prompt optimization**: DSPy BootstrapFewShot with AST checker as metric | |
| - **RAG retrieval**: ChromaDB + sentence-transformers for few-shot context at inference | |
| - **Deployment**: Gradio on HuggingFace Spaces with ZeroGPU (A10G) | |
| - **Hardware (local)**: NVIDIA RTX 5090 (32GB VRAM) | |
| ## Project Structure | |
| ``` | |
| CodeWraith/ | |
| โโโ pyproject.toml | |
| โโโ README.md | |
| โโโ Modelfile.teacher | |
| โโโ src/codewraith/ | |
| โ โโโ teacher/ | |
| โ โ โโโ collect.py # HF dataset collection | |
| โ โ โโโ optimize.py # DSPy prompt optimization | |
| โ โ โโโ generator.py # Training data generation (Ollama + vLLM backends) | |
| โ โ โโโ clean_dataset.py # Dataset filtering | |
| โ โโโ spec_schema.py # Pydantic ModuleSpec schema + markdown renderer | |
| โ โโโ verifier/ | |
| โ โ โโโ ast_checker.py # AST structural validation | |
| โ โ โโโ judge.py # LLM-as-Judge semantic audit | |
| โ โโโ student/ | |
| โ โ โโโ trainer.py # Unsloth + LoRA fine-tuning | |
| โ โ โโโ evaluate.py # Model evaluation pipeline | |
| โ โโโ app/ | |
| โ โโโ main.py # Gradio inference UI | |
| โ โโโ retriever.py # RAG retrieval from ChromaDB | |
| โโโ app.py # HF Spaces entry point | |
| โโโ data/ | |
| โ โโโ chromadb/ # Vector index for RAG retrieval | |
| โ โโโ source_files/ # Collected Python source files | |
| โ โโโ training_pairs*.jsonl # Generated training data (per version) | |
| โ โโโ eval_report*.md # Evaluation reports | |
| โโโ models/ # Local LoRA adapters (gitignored, hosted on HF Hub) | |
| โโโ scripts/ | |
| โ โโโ retrain.py # Full retrain pipeline | |
| โโโ tests/ # Test suite | |
| ``` | |