Image-to-Text
PEFT
Safetensors
code-generation
multimodal
vision-encoder-decoder
lora
swin
qwen2.5-coder
code-trainer-v6
Instructions to use cmndcntrlcyber/code-trainer-vision-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use cmndcntrlcyber/code-trainer-vision-adapter with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
File size: 6,418 Bytes
5d1b48a afb3494 5d1b48a afb3494 5d1b48a afb3494 5d1b48a afb3494 5d1b48a afb3494 5d1b48a afb3494 5d1b48a afb3494 5d1b48a afb3494 5d1b48a afb3494 5d1b48a afb3494 5d1b48a afb3494 5d1b48a afb3494 0c4e186 afb3494 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | ---
base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
library_name: peft
license: apache-2.0
tags:
- code-generation
- multimodal
- vision-encoder-decoder
- lora
- peft
- swin
- qwen2.5-coder
- code-trainer-v6
datasets:
- cmndcntrlcyber/code-trainer-offsec-dataset
pipeline_tag: image-to-text
---
# code-trainer-vision-adapter
A multimodal **screenshot → code** model: a frozen
[Swin-B](https://huggingface.co/microsoft/swin-base-patch4-window7-224) vision
encoder, an MLP projector, and a LoRA adapter for
[`Qwen/Qwen2.5-Coder-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct).
This is **Phase 3** of the Code-Trainer V6 / RTPI pipeline
([GitHub](https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline)) —
the multimodal stage that takes a Monaco-Editor-rendered VS Code screenshot of
source code and emits the underlying source.
## Intended use
* **Direct use:** infer source code from VS Code-style code screenshots in
Python, JavaScript, TypeScript, Java, Go, Rust, C++, or C#.
* **Research / pedagogy:** ablation baseline for larger vision-language code
models; the projector + LoRA architecture is small enough to retrain on a
single A100.
* **Out of scope:** general OCR, natural images, hand-written code, or screen
recordings (all training images came from the Monaco renderer pipeline).
## Architecture
```
image (224×224, 3 channels)
│
â–¼
Swin-B encoder (frozen, 87.7 M params)
│ visual feature sequence (49 × 1024)
â–¼
MLP projector (trained, 2.1 M params)
│ decoder-shaped embedding sequence
â–¼
Qwen2.5-Coder-1.5B (with LoRA r=16, α=32 — trained)
│
â–¼
source code tokens
```
## Training data
* **Dataset:** [`cmndcntrlcyber/code-trainer-offsec-dataset`](https://huggingface.co/datasets/cmndcntrlcyber/code-trainer-offsec-dataset),
revision **`v2-multimodal`** (rows include base64-encoded WebP screenshots).
* **Splits:** 26,126 train / 3,265 validation / 3,267 test (≈80/10/10).
* **Capture pipeline:** Monaco Editor in headless Chromium via Playwright,
rendered through 8 rotating VS Code-style themes for diversity.
## Training procedure
| Knob | Value |
|---|---|
| Vision encoder | `microsoft/swin-base-patch4-window7-224` (frozen) |
| Decoder | `Qwen/Qwen2.5-Coder-1.5B-Instruct` (+ LoRA r=16, α=32, dropout 0.05) |
| Projector | 2-layer MLP, 1024 → 1536 hidden, GELU |
| Learning rate | 2e-4 (cosine, warmup ratio 0.03) |
| Batch size × accum | 8 × 4 (effective batch = 32) |
| Epochs | 3 |
| Sequence length | 2,048 |
| Precision | bfloat16 + gradient checkpointing |
| Hardware | HF Skills `a100-large` |
| Frameworks | `transformers`, `peft`, custom Trainer + `wandb` |
## Evaluation — base vs fine-tuned (test split, 200 samples)
Source: HF Job [`69f7175f9d85bec4d76f125d`](https://huggingface.co/jobs/cmndcntrlcyber/69f7175f9d85bec4d76f125d),
A100-large, 20 m 38 s.
| Metric | Base (Qwen2.5-Coder-1.5B + random projector) | Fine-tuned | Δ |
|-----------------------|-----------------------------------------------|------------|---|
| `exact_match` | 0.0000 | 0.0000 | 0 |
| `bleu_4` | 0.0000 | 0.0000 | 0 |
| `mean_edit_similarity`| 0.0382 | 0.0446 | **+16.8 %** |
| `syntax_valid_rate` †| 0.1950 | 0.6100 | **+213 %** |
†Syntax check uses a Python parser. The test split is multilingual
(java 5,140; ts 5,095; csharp 5,035; python 3,300; cpp 3,156; go 2,086;
rust 1,457; js 857), so the absolute number is not directly comparable to a
Python-only run. The **delta is meaningful** because both rows use the same
metric on the same samples.
**Reading the numbers:**
* **Strong positive on `syntax_valid_rate`** (0.195 → 0.610): the adapter has
learned to emit code-shaped output rather than free-form text.
* **Modest positive on `mean_edit_similarity`** (+16.8 %): predictions are
closer to references than the baseline.
* **`exact_match = 0` and `bleu_4 = 0` for both runs**: the model is
*paraphrasing* the source, not *reconstructing* it verbatim. This is a
reasonable result for a 1.5 B base model with ~5.5 h of training on 26 K
multilingual samples — full-fidelity code reconstruction from screenshots
is hard.
See [`docs/eval/phase3-summary.md`](https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline/blob/main/docs/eval/phase3-summary.md)
for the full provenance, including the prior eval-pipeline bug fix.
## Limitations
* **Not a full transcription model.** Use the fine-tuned model for code
*suggestions* from screenshots, not for byte-exact reconstruction.
* **Domain shift.** The training screenshots all come from the Monaco renderer
with VS Code-style themes; behaviour on real IDE screenshots, IDEs other
than VS Code, or non-Monaco editors is undefined.
* **Multilingual evaluation gap.** The `syntax_valid_rate` metric checks
Python syntax across all languages; per-language metrics are an open
follow-up (tracked in `docs/eval/phase3-summary.md`).
* **Small base model.** The 1.5 B decoder limits long-form fidelity; pairing
with a larger code-trained decoder would likely improve `bleu_4` /
`exact_match`.
## How to use
```python
# This adapter expects a paired Swin-B vision encoder. Use the loader bundled
# in the source repository:
from src.phase3_vision_model.architecture import VisionLanguageModel
from PIL import Image
model = VisionLanguageModel.from_pretrained(
vision_encoder="microsoft/swin-base-patch4-window7-224",
decoder="Qwen/Qwen2.5-Coder-1.5B-Instruct",
adapter_repo="cmndcntrlcyber/code-trainer-vision-adapter",
).cuda().eval()
image = Image.open("vs_code_screenshot.png").convert("RGB")
print(model.generate(image, max_new_tokens=512))
```
## Reproducibility
* **Code:** [github.com/cmndcntrlcyber/code-trainer-offsec-pipeline](https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline)
* **Training launcher:**
```bash
python -m src.phase3_vision_model.scripts.launch_vision_training \
--config src/config/v6_config.yaml --wait
```
* **W&B project:** [`rtpi-phase3-vision`](https://wandb.ai/cmndcntrlcyber-c3s-consulting/rtpi-phase3-vision).
* **Cost:** approximately $18 on `a100-large` (~5.5 h training + ~20 min eval).
|