cmndcntrlcyber commited on
Commit
afb3494
·
verified ·
1 Parent(s): f823218

Sync model card from docs/model_cards/code-trainer-vision-adapter.md

Browse files
Files changed (1) hide show
  1. README.md +138 -16
README.md CHANGED
@@ -1,34 +1,156 @@
1
  ---
 
 
 
2
  tags:
3
  - code-generation
4
- - vision-language
 
5
  - lora
 
 
6
  - qwen2.5-coder
7
- base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
8
  datasets:
9
  - cmndcntrlcyber/code-trainer-offsec-dataset
 
10
  ---
11
 
12
- # Code-Trainer V6 — Phase 3 Vision Adapter
 
 
 
 
 
13
 
14
- LoRA adapter + MLP projector trained on `cmndcntrlcyber/code-trainer-offsec-dataset@v2-multimodal`.
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  ## Architecture
17
- - **Vision encoder:** microsoft/swin-base-patch4-window7-224 (frozen)
18
- - **Projector:** 2-layer MLP
19
- - **Decoder:** Qwen/Qwen2.5-Coder-1.5B-Instruct + LoRA r=16
20
 
21
- ## Training
22
- - Hardware: HF Skills A100-large (40GB)
23
- - Batch size: 8 × grad_accum 4
24
- - Epochs: 3
25
- - LR: 0.0002
26
- - Precision: BF16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- ## Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  ```python
31
- from src.phase3_vision_model.architecture.vision_model import CodeVisionModel
 
 
 
 
 
 
 
 
 
32
 
33
- model = CodeVisionModel.from_pretrained("<this-repo>")
 
34
  ```
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
3
+ library_name: peft
4
+ license: apache-2.0
5
  tags:
6
  - code-generation
7
+ - multimodal
8
+ - vision-encoder-decoder
9
  - lora
10
+ - peft
11
+ - swin
12
  - qwen2.5-coder
13
+ - code-trainer-v6
14
  datasets:
15
  - cmndcntrlcyber/code-trainer-offsec-dataset
16
+ pipeline_tag: image-to-text
17
  ---
18
 
19
+ # code-trainer-vision-adapter
20
+
21
+ A multimodal **screenshot → code** model: a frozen
22
+ [Swin-B](https://huggingface.co/microsoft/swin-base-patch4-window7-224) vision
23
+ encoder, an MLP projector, and a LoRA adapter for
24
+ [`Qwen/Qwen2.5-Coder-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct).
25
 
26
+ This is **Phase 3** of the Code-Trainer V6 / RTPI pipeline
27
+ ([GitHub](https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline)) —
28
+ the multimodal stage that takes a Monaco-Editor-rendered VS Code screenshot of
29
+ source code and emits the underlying source.
30
+
31
+ ## Intended use
32
+
33
+ * **Direct use:** infer source code from VS Code-style code screenshots in
34
+ Python, JavaScript, TypeScript, Java, Go, Rust, C++, or C#.
35
+ * **Research / pedagogy:** ablation baseline for larger vision-language code
36
+ models; the projector + LoRA architecture is small enough to retrain on a
37
+ single A100.
38
+ * **Out of scope:** general OCR, natural images, hand-written code, or screen
39
+ recordings (all training images came from the Monaco renderer pipeline).
40
 
41
  ## Architecture
 
 
 
42
 
43
+ ```
44
+ image (224×224, 3 channels)
45
+ │
46
+ â–¼
47
+ Swin-B encoder (frozen, 87.7 M params)
48
+ │ visual feature sequence (49 × 1024)
49
+ â–¼
50
+ MLP projector (trained, 2.1 M params)
51
+ │ decoder-shaped embedding sequence
52
+ â–¼
53
+ Qwen2.5-Coder-1.5B (with LoRA r=16, α=32 — trained)
54
+ │
55
+ â–¼
56
+ source code tokens
57
+ ```
58
+
59
+ ## Training data
60
+
61
+ * **Dataset:** [`cmndcntrlcyber/code-trainer-offsec-dataset`](https://huggingface.co/datasets/cmndcntrlcyber/code-trainer-offsec-dataset),
62
+ revision **`v2-multimodal`** (rows include base64-encoded WebP screenshots).
63
+ * **Splits:** 26,126 train / 3,265 validation / 3,267 test (≈80/10/10).
64
+ * **Capture pipeline:** Monaco Editor in headless Chromium via Playwright,
65
+ rendered through 8 rotating VS Code-style themes for diversity.
66
+
67
+ ## Training procedure
68
+
69
+ | Knob | Value |
70
+ |---|---|
71
+ | Vision encoder | `microsoft/swin-base-patch4-window7-224` (frozen) |
72
+ | Decoder | `Qwen/Qwen2.5-Coder-1.5B-Instruct` (+ LoRA r=16, α=32, dropout 0.05) |
73
+ | Projector | 2-layer MLP, 1024 → 1536 hidden, GELU |
74
+ | Learning rate | 2e-4 (cosine, warmup ratio 0.03) |
75
+ | Batch size × accum | 8 × 4 (effective batch = 32) |
76
+ | Epochs | 3 |
77
+ | Sequence length | 2,048 |
78
+ | Precision | bfloat16 + gradient checkpointing |
79
+ | Hardware | HF Skills `a100-large` |
80
+ | Frameworks | `transformers`, `peft`, custom Trainer + `wandb` |
81
+
82
+ ## Evaluation — base vs fine-tuned (test split, 200 samples)
83
 
84
+ Source: HF Job [`69f7175f9d85bec4d76f125d`](https://huggingface.co/jobs/cmndcntrlcyber/69f7175f9d85bec4d76f125d),
85
+ A100-large, 20 m 38 s.
86
+
87
+ | Metric | Base (Qwen2.5-Coder-1.5B + random projector) | Fine-tuned | Δ |
88
+ |-----------------------|-----------------------------------------------|------------|---|
89
+ | `exact_match` | 0.0000 | 0.0000 | 0 |
90
+ | `bleu_4` | 0.0000 | 0.0000 | 0 |
91
+ | `mean_edit_similarity`| 0.0382 | 0.0446 | **+16.8 %** |
92
+ | `syntax_valid_rate` † | 0.1950 | 0.6100 | **+213 %** |
93
+
94
+ † Syntax check uses a Python parser. The test split is multilingual
95
+ (java 5,140; ts 5,095; csharp 5,035; python 3,300; cpp 3,156; go 2,086;
96
+ rust 1,457; js 857), so the absolute number is not directly comparable to a
97
+ Python-only run. The **delta is meaningful** because both rows use the same
98
+ metric on the same samples.
99
+
100
+ **Reading the numbers:**
101
+
102
+ * **Strong positive on `syntax_valid_rate`** (0.195 → 0.610): the adapter has
103
+ learned to emit code-shaped output rather than free-form text.
104
+ * **Modest positive on `mean_edit_similarity`** (+16.8 %): predictions are
105
+ closer to references than the baseline.
106
+ * **`exact_match = 0` and `bleu_4 = 0` for both runs**: the model is
107
+ *paraphrasing* the source, not *reconstructing* it verbatim. This is a
108
+ reasonable result for a 1.5 B base model with ~5.5 h of training on 26 K
109
+ multilingual samples — full-fidelity code reconstruction from screenshots
110
+ is hard.
111
+
112
+ See [`docs/eval/phase3-summary.md`](https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline/blob/main/docs/eval/phase3-summary.md)
113
+ for the full provenance, including the prior eval-pipeline bug fix.
114
+
115
+ ## Limitations
116
+
117
+ * **Not a full transcription model.** Use the fine-tuned model for code
118
+ *suggestions* from screenshots, not for byte-exact reconstruction.
119
+ * **Domain shift.** The training screenshots all come from the Monaco renderer
120
+ with VS Code-style themes; behaviour on real IDE screenshots, IDEs other
121
+ than VS Code, or non-Monaco editors is undefined.
122
+ * **Multilingual evaluation gap.** The `syntax_valid_rate` metric checks
123
+ Python syntax across all languages; per-language metrics are an open
124
+ follow-up (tracked in `docs/eval/phase3-summary.md`).
125
+ * **Small base model.** The 1.5 B decoder limits long-form fidelity; pairing
126
+ with a larger code-trained decoder would likely improve `bleu_4` /
127
+ `exact_match`.
128
+
129
+ ## How to use
130
 
131
  ```python
132
+ # This adapter expects a paired Swin-B vision encoder. Use the loader bundled
133
+ # in the source repository:
134
+ from src.phase3_vision_model.architecture import VisionLanguageModel
135
+ from PIL import Image
136
+
137
+ model = VisionLanguageModel.from_pretrained(
138
+ vision_encoder="microsoft/swin-base-patch4-window7-224",
139
+ decoder="Qwen/Qwen2.5-Coder-1.5B-Instruct",
140
+ adapter_repo="cmndcntrlcyber/code-trainer-vision-adapter",
141
+ ).cuda().eval()
142
 
143
+ image = Image.open("vs_code_screenshot.png").convert("RGB")
144
+ print(model.generate(image, max_new_tokens=512))
145
  ```
146
+
147
+ ## Reproducibility
148
+
149
+ * **Code:** [github.com/cmndcntrlcyber/code-trainer-offsec-pipeline](https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline)
150
+ * **Training launcher:**
151
+ ```bash
152
+ python -m src.phase3_vision_model.scripts.launch_vision_training \
153
+ --config src/config/v6_config.yaml --wait
154
+ ```
155
+ * **W&B project:** `rtpi-phase3-vision`.
156
+ * **Cost:** approximately $18 on `a100-large` (~5.5 h training + ~20 min eval).