File size: 6,418 Bytes
5d1b48a
afb3494
 
 
5d1b48a
 
afb3494
 
5d1b48a
afb3494
 
5d1b48a
afb3494
5d1b48a
 
afb3494
5d1b48a
 
afb3494
 
 
 
 
 
5d1b48a
afb3494
 
 
 
 
 
 
 
 
 
 
 
 
 
5d1b48a
 
 
afb3494
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d1b48a
afb3494
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d1b48a
 
afb3494
 
 
 
 
 
 
 
 
 
5d1b48a
afb3494
 
5d1b48a
afb3494
 
 
 
 
 
 
 
 
0c4e186
afb3494
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
library_name: peft
license: apache-2.0
tags:
- code-generation
- multimodal
- vision-encoder-decoder
- lora
- peft
- swin
- qwen2.5-coder
- code-trainer-v6
datasets:
- cmndcntrlcyber/code-trainer-offsec-dataset
pipeline_tag: image-to-text
---

# code-trainer-vision-adapter

A multimodal **screenshot → code** model: a frozen
[Swin-B](https://huggingface.co/microsoft/swin-base-patch4-window7-224) vision
encoder, an MLP projector, and a LoRA adapter for
[`Qwen/Qwen2.5-Coder-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct).

This is **Phase 3** of the Code-Trainer V6 / RTPI pipeline
([GitHub](https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline)) —
the multimodal stage that takes a Monaco-Editor-rendered VS Code screenshot of
source code and emits the underlying source.

## Intended use

* **Direct use:** infer source code from VS Code-style code screenshots in
  Python, JavaScript, TypeScript, Java, Go, Rust, C++, or C#.
* **Research / pedagogy:** ablation baseline for larger vision-language code
  models; the projector + LoRA architecture is small enough to retrain on a
  single A100.
* **Out of scope:** general OCR, natural images, hand-written code, or screen
  recordings (all training images came from the Monaco renderer pipeline).

## Architecture

```
   image (224×224, 3 channels)
     │
     â–¼
  Swin-B encoder (frozen, 87.7 M params)
     │  visual feature sequence (49 × 1024)
     â–¼
  MLP projector (trained, 2.1 M params)
     │  decoder-shaped embedding sequence
     â–¼
  Qwen2.5-Coder-1.5B (with LoRA r=16, α=32 — trained)
     │
     â–¼
   source code tokens
```

## Training data

* **Dataset:** [`cmndcntrlcyber/code-trainer-offsec-dataset`](https://huggingface.co/datasets/cmndcntrlcyber/code-trainer-offsec-dataset),
  revision **`v2-multimodal`** (rows include base64-encoded WebP screenshots).
* **Splits:** 26,126 train / 3,265 validation / 3,267 test (≈80/10/10).
* **Capture pipeline:** Monaco Editor in headless Chromium via Playwright,
  rendered through 8 rotating VS Code-style themes for diversity.

## Training procedure

| Knob | Value |
|---|---|
| Vision encoder | `microsoft/swin-base-patch4-window7-224` (frozen) |
| Decoder | `Qwen/Qwen2.5-Coder-1.5B-Instruct` (+ LoRA r=16, α=32, dropout 0.05) |
| Projector | 2-layer MLP, 1024 → 1536 hidden, GELU |
| Learning rate | 2e-4 (cosine, warmup ratio 0.03) |
| Batch size × accum | 8 × 4 (effective batch = 32) |
| Epochs | 3 |
| Sequence length | 2,048 |
| Precision | bfloat16 + gradient checkpointing |
| Hardware | HF Skills `a100-large` |
| Frameworks | `transformers`, `peft`, custom Trainer + `wandb` |

## Evaluation — base vs fine-tuned (test split, 200 samples)

Source: HF Job [`69f7175f9d85bec4d76f125d`](https://huggingface.co/jobs/cmndcntrlcyber/69f7175f9d85bec4d76f125d),
A100-large, 20 m 38 s.

| Metric                | Base (Qwen2.5-Coder-1.5B + random projector) | Fine-tuned | Δ |
|-----------------------|-----------------------------------------------|------------|---|
| `exact_match`         | 0.0000                                        | 0.0000     | 0 |
| `bleu_4`              | 0.0000                                        | 0.0000     | 0 |
| `mean_edit_similarity`| 0.0382                                        | 0.0446     | **+16.8 %** |
| `syntax_valid_rate` † | 0.1950                                        | 0.6100     | **+213 %** |

† Syntax check uses a Python parser. The test split is multilingual
(java 5,140; ts 5,095; csharp 5,035; python 3,300; cpp 3,156; go 2,086;
rust 1,457; js 857), so the absolute number is not directly comparable to a
Python-only run. The **delta is meaningful** because both rows use the same
metric on the same samples.

**Reading the numbers:**

* **Strong positive on `syntax_valid_rate`** (0.195 → 0.610): the adapter has
  learned to emit code-shaped output rather than free-form text.
* **Modest positive on `mean_edit_similarity`** (+16.8 %): predictions are
  closer to references than the baseline.
* **`exact_match = 0` and `bleu_4 = 0` for both runs**: the model is
  *paraphrasing* the source, not *reconstructing* it verbatim. This is a
  reasonable result for a 1.5 B base model with ~5.5 h of training on 26 K
  multilingual samples — full-fidelity code reconstruction from screenshots
  is hard.

See [`docs/eval/phase3-summary.md`](https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline/blob/main/docs/eval/phase3-summary.md)
for the full provenance, including the prior eval-pipeline bug fix.

## Limitations

* **Not a full transcription model.** Use the fine-tuned model for code
  *suggestions* from screenshots, not for byte-exact reconstruction.
* **Domain shift.** The training screenshots all come from the Monaco renderer
  with VS Code-style themes; behaviour on real IDE screenshots, IDEs other
  than VS Code, or non-Monaco editors is undefined.
* **Multilingual evaluation gap.** The `syntax_valid_rate` metric checks
  Python syntax across all languages; per-language metrics are an open
  follow-up (tracked in `docs/eval/phase3-summary.md`).
* **Small base model.** The 1.5 B decoder limits long-form fidelity; pairing
  with a larger code-trained decoder would likely improve `bleu_4` /
  `exact_match`.

## How to use

```python
# This adapter expects a paired Swin-B vision encoder. Use the loader bundled
# in the source repository:
from src.phase3_vision_model.architecture import VisionLanguageModel
from PIL import Image

model = VisionLanguageModel.from_pretrained(
    vision_encoder="microsoft/swin-base-patch4-window7-224",
    decoder="Qwen/Qwen2.5-Coder-1.5B-Instruct",
    adapter_repo="cmndcntrlcyber/code-trainer-vision-adapter",
).cuda().eval()

image = Image.open("vs_code_screenshot.png").convert("RGB")
print(model.generate(image, max_new_tokens=512))
```

## Reproducibility

* **Code:** [github.com/cmndcntrlcyber/code-trainer-offsec-pipeline](https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline)
* **Training launcher:**
  ```bash
  python -m src.phase3_vision_model.scripts.launch_vision_training \
      --config src/config/v6_config.yaml --wait
  ```
* **W&B project:** [`rtpi-phase3-vision`](https://wandb.ai/cmndcntrlcyber-c3s-consulting/rtpi-phase3-vision).
* **Cost:** approximately $18 on `a100-large` (~5.5 h training + ~20 min eval).