slenk commited on
Commit
d09a8cf
ยท
verified ยท
1 Parent(s): cf6c23e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +282 -9
README.md CHANGED
@@ -11,16 +11,289 @@ pinned: false
11
  license: mit
12
  ---
13
 
14
- # CodeWraith - Module-to-Spec Transformer
15
 
16
- Generate technical specifications from Python source code using a fine-tuned LLM.
17
 
18
- ## How it works
19
 
20
- 1. Paste Python source code in the left panel
21
- 2. Adjust sampling parameters (temperature, top_p, max tokens)
22
- 3. Toggle RAG to include similar examples as context
23
- 4. Click **Generate Specification**
24
 
25
- The model is a LoRA-fine-tuned Llama that was trained on 200+ Python module / specification pairs
26
- generated by a teacher model and verified with AST-based structural validation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  license: mit
12
  ---
13
 
14
+ # CodeWraith
15
 
16
+ **Module-to-Spec Transformer** -- Automates the generation of high-fidelity, verifiable technical specifications from Python source code.
17
 
18
+ CodeWraith uses a teacher-student architecture: a large model generates gold-standard training data, a verification pipeline ensures accuracy, and a fine-tuned lightweight model delivers fast, deployable inference.
19
 
20
+ ## Architecture
 
 
 
21
 
22
+ ```
23
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
24
+ Python Source โ”€โ”€> โ”‚ Teacher โ”‚ โ”€โ”€> Training Pairs (code -> spec)
25
+ โ”‚ Qwen3 30B โ”‚ โ”‚
26
+ โ”‚ (Ollama) โ”‚ โ”‚
27
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
28
+ โ–ผ
29
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
30
+ โ”‚ Verifier โ”‚<โ”€โ”€ โ”‚ Training โ”‚
31
+ โ”‚ AST + Judge โ”‚ โ”‚ Dataset โ”‚
32
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
33
+ โ”‚ โ”‚
34
+ โ”‚ validates โ”‚ trains
35
+ โ–ผ โ–ผ
36
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
37
+ โ”‚ Verified โ”‚ โ”‚ Student โ”‚
38
+ โ”‚ Specs โ”‚ โ”‚ Llama 3B/8B โ”‚
39
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ + LoRA โ”‚
40
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
41
+ โ”‚
42
+ โ–ผ
43
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
44
+ โ”‚ Gradio App โ”‚
45
+ โ”‚ HF Spaces โ”‚
46
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
47
+ ```
48
+
49
+ ## Components
50
+
51
+ | Component | Directory | Purpose |
52
+ |-----------|-----------|---------|
53
+ | **Teacher** | `src/codewraith/teacher/` | Generates synthetic training pairs using Qwen3 30B via Ollama |
54
+ | **Verifier** | `src/codewraith/verifier/` | AST-based structural validation + LLM-as-Judge semantic audit |
55
+ | **Student** | `src/codewraith/student/` | LoRA fine-tuning via Unsloth, evaluation pipeline |
56
+ | **App** | `src/codewraith/app/` | Gradio web interface deployed on HuggingFace Spaces |
57
+
58
+ ## Verification Pipeline
59
+
60
+ 1. **Structural Validation**: Uses Python's `ast` module to verify function signatures, arguments, and class hierarchies match the source
61
+ 2. **Semantic Audit**: LLM-as-a-Judge evaluates completeness, accuracy, hallucination, and detail (scored 0-10 each)
62
+ 3. **Round-trip Consistency**: Tests whether an LLM can reconstruct the module's function/class signatures from the spec alone
63
+
64
+ ## Quick Start
65
+
66
+ ### Prerequisites
67
+
68
+ - Python 3.10+
69
+ - [uv](https://docs.astral.sh/uv/) package manager
70
+ - [Ollama](https://ollama.ai/) (for teacher model / judge)
71
+ - NVIDIA GPU with 32GB+ VRAM (for training)
72
+
73
+ ### Install
74
+
75
+ ```bash
76
+ git clone <repo-url>
77
+ cd CodeWraith
78
+
79
+ # Base install (verifier works with no ML dependencies)
80
+ uv venv
81
+ uv sync
82
+
83
+ # Install ML dependencies (datasets, transformers, dspy)
84
+ uv sync --extra ml
85
+
86
+ # Install training dependencies (unsloth, peft, trl)
87
+ uv sync --extra ml --extra training
88
+
89
+ # Install app dependencies (gradio)
90
+ uv sync --extra app
91
+
92
+ # Install everything
93
+ uv sync --extra all
94
+
95
+ # Install dev tools (pytest, ruff)
96
+ uv sync --extra dev
97
+ ```
98
+
99
+ ### Run Tests
100
+
101
+ ```bash
102
+ uv run pytest
103
+ ```
104
+
105
+ ## Full Pipeline
106
+
107
+ ### Step 1: Collect Source Files
108
+
109
+ Pull diverse Python modules from HuggingFace's the-stack-dedup dataset.
110
+ Requires accepting the [Terms of Use](https://huggingface.co/datasets/bigcode/the-stack-dedup) on HuggingFace.
111
+
112
+ ```bash
113
+ uv run --extra ml python3 -m codewraith.teacher.collect
114
+ ```
115
+
116
+ This collects 150 clean (well-starred) and 100 messy (zero-star) Python files
117
+ into `data/source_files/`. Resumable if interrupted.
118
+
119
+ ### Step 2: Optimize Prompt with DSPy
120
+
121
+ Uses DSPy's BootstrapFewShot optimizer to find the best prompt for spec generation.
122
+ Requires Ollama running with `qwen3:30b-a3b`.
123
+
124
+ ```bash
125
+ # Pull the teacher model
126
+ ollama pull qwen3:30b-a3b
127
+
128
+ # Run optimization
129
+ uv run --extra ml python3 -m codewraith.teacher.optimize
130
+ ```
131
+
132
+ Saves the optimized generator to `data/optimized_generator.json`.
133
+
134
+ ### Step 3: Generate Training Data
135
+
136
+ Generate specs for all collected source files using the optimized prompt.
137
+
138
+ ```bash
139
+ uv run --extra ml python3 -c "
140
+ from codewraith.teacher.generator import generate_dataset
141
+ generate_dataset('data/source_files', 'data/training_pairs.jsonl')
142
+ "
143
+ ```
144
+
145
+ Writes pairs incrementally to JSONL. Fully resumable if interrupted.
146
+
147
+ ### Step 4: Clean Dataset
148
+
149
+ Filter out null outputs, too-short specs, and outliers.
150
+
151
+ ```bash
152
+ uv run python3 -m codewraith.teacher.clean_dataset
153
+ ```
154
+
155
+ ### Step 5: Train Student Model
156
+
157
+ Fine-tune with Unsloth + LoRA. Supports both 3B and 8B models.
158
+
159
+ ```bash
160
+ # Train Llama 3.2 3B (fast, ~3-4 minutes)
161
+ uv run --extra ml --extra training python3 -m codewraith.student.trainer 3b
162
+
163
+ # Train Llama 3.1 8B (better quality, ~8-10 minutes)
164
+ uv run --extra ml --extra training python3 -m codewraith.student.trainer 8b
165
+ ```
166
+
167
+ Adapters are saved to `models/codewraith-lora-{3b,8b}/`.
168
+
169
+ ### Step 6: Evaluate
170
+
171
+ Run evaluation comparing structural accuracy across models.
172
+
173
+ ```bash
174
+ # Evaluate 3B
175
+ uv run --extra ml --extra training python3 -m codewraith.student.evaluate 3b
176
+
177
+ # Evaluate 8B
178
+ uv run --extra ml --extra training python3 -m codewraith.student.evaluate 8b
179
+ ```
180
+
181
+ Generates `data/eval_report.md` with comparison metrics.
182
+
183
+ ### Step 7: Run Gradio App
184
+
185
+ ```bash
186
+ uv run --extra ml --extra training --extra app python3 -m codewraith.app.main
187
+ ```
188
+
189
+ Auto-detects the best available adapter (prefers 8B over 3B).
190
+ Opens a web UI with code input, sampling parameter controls, and live spec generation.
191
+
192
+ ### Step 8: Deploy to HF Spaces
193
+
194
+ ```bash
195
+ # Push adapter to HuggingFace Hub
196
+ uv run --extra ml --extra training python3 -c "
197
+ from codewraith.student.trainer import load_base_model, push_to_hub
198
+ from peft import PeftModel
199
+ model, tokenizer = load_base_model('3b')
200
+ model = PeftModel.from_pretrained(model, './models/codewraith-lora-3b')
201
+ push_to_hub(model, tokenizer, 'your-username/codewraith-lora-3b')
202
+ "
203
+ ```
204
+
205
+ ## Evaluation Results
206
+
207
+ Models trained with 8192 context, LoRA r=32, 4 epochs, dropout=0.05.
208
+ Training data generated by Gemma 4 26B teacher model with DSPy-optimized prompts.
209
+ Evaluated on 28 held-out examples (proper train/eval split, no data leakage).
210
+
211
+ ### Llama 3.1 8B (CodeWraith-8b) -- Deployed Model
212
+
213
+ | Metric | Score |
214
+ |--------|-------|
215
+ | Avg Structural Score | 0.95 |
216
+ | Function Coverage | 90% |
217
+ | Class Coverage | 100% |
218
+ | Argument Coverage | 94% |
219
+ | Return Type Coverage | 67% |
220
+ | Perfect Scores | 22/28 |
221
+ | Good Scores (>=80%) | 25/28 |
222
+ | Avg Inference Time | 28s |
223
+ | Training Loss | 0.59 |
224
+
225
+ ### Llama 3.2 3B (CodeWraith-3b)
226
+
227
+ | Metric | Score |
228
+ |--------|-------|
229
+ | Avg Structural Score | 0.91 |
230
+ | Function Coverage | 86% |
231
+ | Class Coverage | 96% |
232
+ | Argument Coverage | 93% |
233
+ | Return Type Coverage | 67% |
234
+ | Perfect Scores | 19/28 |
235
+ | Good Scores (>=80%) | 24/28 |
236
+ | Avg Inference Time | 26s |
237
+ | Training Loss | 0.76 |
238
+
239
+ ### Analysis
240
+
241
+ The 8B model was selected for deployment because:
242
+ - Higher overall structural score (0.95 vs 0.91)
243
+ - Perfect class coverage (100% vs 96%)
244
+ - More perfect scores (22/28 vs 19/28)
245
+ - Higher quality training data from Gemma 4 26B teacher enabled the larger model to shine
246
+
247
+ Training data was generated using Gemma 4 26B as the teacher model (replacing Qwen3 30B),
248
+ producing higher quality specs with better structured Markdown and mermaid diagrams.
249
+ DSPy BootstrapFewShot was used to optimize the generation prompt.
250
+
251
+ ### HuggingFace Models
252
+
253
+ - Deployed (8B): https://huggingface.co/slenk/codewraith-lora-8b
254
+ - Alternative (3B): https://huggingface.co/slenk/codewraith-lora-3b
255
+
256
+ ## Environment
257
+
258
+ - **Teacher model**: Gemma 4 26B via Ollama at `127.0.0.1:11434`
259
+ - **Student models**: Llama 3.2 3B / Llama 3.1 8B fine-tuned with LoRA via Unsloth
260
+ - **Prompt optimization**: DSPy BootstrapFewShot with AST checker as metric
261
+ - **Deployment**: Gradio on HuggingFace Spaces
262
+ - **Hardware**: NVIDIA RTX 5090 (32GB VRAM)
263
+
264
+ ## Project Structure
265
+
266
+ ```
267
+ CodeWraith/
268
+ โ”œโ”€โ”€ pyproject.toml
269
+ โ”œโ”€โ”€ README.md
270
+ โ”œโ”€โ”€ Modelfile.teacher
271
+ โ”œโ”€โ”€ src/codewraith/
272
+ โ”‚ โ”œโ”€โ”€ teacher/
273
+ โ”‚ โ”‚ โ”œโ”€โ”€ collect.py # HF dataset collection
274
+ โ”‚ โ”‚ โ”œโ”€โ”€ optimize.py # DSPy prompt optimization
275
+ โ”‚ โ”‚ โ”œโ”€โ”€ generator.py # Training data generation
276
+ โ”‚ โ”‚ โ””โ”€โ”€ clean_dataset.py # Dataset filtering
277
+ โ”‚ โ”œโ”€โ”€ verifier/
278
+ โ”‚ โ”‚ โ”œโ”€โ”€ ast_checker.py # AST structural validation
279
+ โ”‚ โ”‚ โ””โ”€โ”€ judge.py # LLM-as-Judge semantic audit
280
+ โ”‚ โ”œโ”€โ”€ student/
281
+ โ”‚ โ”‚ โ”œโ”€โ”€ trainer.py # Unsloth + LoRA fine-tuning
282
+ โ”‚ โ”‚ โ””โ”€โ”€ evaluate.py # Model evaluation pipeline
283
+ โ”‚ โ””โ”€โ”€ app/
284
+ โ”‚ โ””โ”€โ”€ main.py # Gradio inference UI
285
+ โ”œโ”€โ”€ data/ # Training data, eval sets, reports
286
+ โ”œโ”€โ”€ models/ # Saved LoRA adapters
287
+ โ””โ”€โ”€ tests/ # Test suite (96% coverage)
288
+ ```
289
+
290
+ ## Rubric Alignment
291
+
292
+ | Rubric Section | Points | Implementation |
293
+ |---------------|--------|----------------|
294
+ | Model Functionality (training + LoRA + eval) | 20 | `student/trainer.py`, `student/evaluate.py`, 3B vs 8B comparison |
295
+ | Innovation & Creativity | 20 | Teacher-student architecture, DSPy prompt optimization, AST verification pipeline |
296
+ | Environment Setup (deployment) | 15 | `app/main.py`, Gradio on HF Spaces |
297
+ | Inference Pipeline (sampling) | 15 | `app/main.py` with temperature/top_p/max_tokens controls |
298
+ | Technical Documentation | 15 | This README, evaluation reports, docstrings |
299
+ | Demo & Presentation | 15 | Live Gradio app as interactive demo |