Image-to-Text
PEFT
Safetensors
code-generation
multimodal
vision-encoder-decoder
lora
swin
qwen2.5-coder
code-trainer-v6
Instructions to use cmndcntrlcyber/code-trainer-vision-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use cmndcntrlcyber/code-trainer-vision-adapter with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct | |
| library_name: peft | |
| license: apache-2.0 | |
| tags: | |
| - code-generation | |
| - multimodal | |
| - vision-encoder-decoder | |
| - lora | |
| - peft | |
| - swin | |
| - qwen2.5-coder | |
| - code-trainer-v6 | |
| datasets: | |
| - cmndcntrlcyber/code-trainer-offsec-dataset | |
| pipeline_tag: image-to-text | |
| # code-trainer-vision-adapter | |
| A multimodal **screenshot → code** model: a frozen | |
| [Swin-B](https://huggingface.co/microsoft/swin-base-patch4-window7-224) vision | |
| encoder, an MLP projector, and a LoRA adapter for | |
| [`Qwen/Qwen2.5-Coder-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct). | |
| This is **Phase 3** of the Code-Trainer V6 / RTPI pipeline | |
| ([GitHub](https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline)) — | |
| the multimodal stage that takes a Monaco-Editor-rendered VS Code screenshot of | |
| source code and emits the underlying source. | |
| ## Intended use | |
| * **Direct use:** infer source code from VS Code-style code screenshots in | |
| Python, JavaScript, TypeScript, Java, Go, Rust, C++, or C#. | |
| * **Research / pedagogy:** ablation baseline for larger vision-language code | |
| models; the projector + LoRA architecture is small enough to retrain on a | |
| single A100. | |
| * **Out of scope:** general OCR, natural images, hand-written code, or screen | |
| recordings (all training images came from the Monaco renderer pipeline). | |
| ## Architecture | |
| ``` | |
| image (224×224, 3 channels) | |
| │ | |
| â–¼ | |
| Swin-B encoder (frozen, 87.7 M params) | |
| │ visual feature sequence (49 × 1024) | |
| â–¼ | |
| MLP projector (trained, 2.1 M params) | |
| │ decoder-shaped embedding sequence | |
| â–¼ | |
| Qwen2.5-Coder-1.5B (with LoRA r=16, α=32 — trained) | |
| │ | |
| â–¼ | |
| source code tokens | |
| ``` | |
| ## Training data | |
| * **Dataset:** [`cmndcntrlcyber/code-trainer-offsec-dataset`](https://huggingface.co/datasets/cmndcntrlcyber/code-trainer-offsec-dataset), | |
| revision **`v2-multimodal`** (rows include base64-encoded WebP screenshots). | |
| * **Splits:** 26,126 train / 3,265 validation / 3,267 test (≈80/10/10). | |
| * **Capture pipeline:** Monaco Editor in headless Chromium via Playwright, | |
| rendered through 8 rotating VS Code-style themes for diversity. | |
| ## Training procedure | |
| | Knob | Value | | |
| |---|---| | |
| | Vision encoder | `microsoft/swin-base-patch4-window7-224` (frozen) | | |
| | Decoder | `Qwen/Qwen2.5-Coder-1.5B-Instruct` (+ LoRA r=16, α=32, dropout 0.05) | | |
| | Projector | 2-layer MLP, 1024 → 1536 hidden, GELU | | |
| | Learning rate | 2e-4 (cosine, warmup ratio 0.03) | | |
| | Batch size × accum | 8 × 4 (effective batch = 32) | | |
| | Epochs | 3 | | |
| | Sequence length | 2,048 | | |
| | Precision | bfloat16 + gradient checkpointing | | |
| | Hardware | HF Skills `a100-large` | | |
| | Frameworks | `transformers`, `peft`, custom Trainer + `wandb` | | |
| ## Evaluation — base vs fine-tuned (test split, 200 samples) | |
| Source: HF Job [`69f7175f9d85bec4d76f125d`](https://huggingface.co/jobs/cmndcntrlcyber/69f7175f9d85bec4d76f125d), | |
| A100-large, 20 m 38 s. | |
| | Metric | Base (Qwen2.5-Coder-1.5B + random projector) | Fine-tuned | Δ | | |
| |-----------------------|-----------------------------------------------|------------|---| | |
| | `exact_match` | 0.0000 | 0.0000 | 0 | | |
| | `bleu_4` | 0.0000 | 0.0000 | 0 | | |
| | `mean_edit_similarity`| 0.0382 | 0.0446 | **+16.8 %** | | |
| | `syntax_valid_rate` †| 0.1950 | 0.6100 | **+213 %** | | |
| †Syntax check uses a Python parser. The test split is multilingual | |
| (java 5,140; ts 5,095; csharp 5,035; python 3,300; cpp 3,156; go 2,086; | |
| rust 1,457; js 857), so the absolute number is not directly comparable to a | |
| Python-only run. The **delta is meaningful** because both rows use the same | |
| metric on the same samples. | |
| **Reading the numbers:** | |
| * **Strong positive on `syntax_valid_rate`** (0.195 → 0.610): the adapter has | |
| learned to emit code-shaped output rather than free-form text. | |
| * **Modest positive on `mean_edit_similarity`** (+16.8 %): predictions are | |
| closer to references than the baseline. | |
| * **`exact_match = 0` and `bleu_4 = 0` for both runs**: the model is | |
| *paraphrasing* the source, not *reconstructing* it verbatim. This is a | |
| reasonable result for a 1.5 B base model with ~5.5 h of training on 26 K | |
| multilingual samples — full-fidelity code reconstruction from screenshots | |
| is hard. | |
| See [`docs/eval/phase3-summary.md`](https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline/blob/main/docs/eval/phase3-summary.md) | |
| for the full provenance, including the prior eval-pipeline bug fix. | |
| ## Limitations | |
| * **Not a full transcription model.** Use the fine-tuned model for code | |
| *suggestions* from screenshots, not for byte-exact reconstruction. | |
| * **Domain shift.** The training screenshots all come from the Monaco renderer | |
| with VS Code-style themes; behaviour on real IDE screenshots, IDEs other | |
| than VS Code, or non-Monaco editors is undefined. | |
| * **Multilingual evaluation gap.** The `syntax_valid_rate` metric checks | |
| Python syntax across all languages; per-language metrics are an open | |
| follow-up (tracked in `docs/eval/phase3-summary.md`). | |
| * **Small base model.** The 1.5 B decoder limits long-form fidelity; pairing | |
| with a larger code-trained decoder would likely improve `bleu_4` / | |
| `exact_match`. | |
| ## How to use | |
| ```python | |
| # This adapter expects a paired Swin-B vision encoder. Use the loader bundled | |
| # in the source repository: | |
| from src.phase3_vision_model.architecture import VisionLanguageModel | |
| from PIL import Image | |
| model = VisionLanguageModel.from_pretrained( | |
| vision_encoder="microsoft/swin-base-patch4-window7-224", | |
| decoder="Qwen/Qwen2.5-Coder-1.5B-Instruct", | |
| adapter_repo="cmndcntrlcyber/code-trainer-vision-adapter", | |
| ).cuda().eval() | |
| image = Image.open("vs_code_screenshot.png").convert("RGB") | |
| print(model.generate(image, max_new_tokens=512)) | |
| ``` | |
| ## Reproducibility | |
| * **Code:** [github.com/cmndcntrlcyber/code-trainer-offsec-pipeline](https://github.com/cmndcntrlcyber/code-trainer-offsec-pipeline) | |
| * **Training launcher:** | |
| ```bash | |
| python -m src.phase3_vision_model.scripts.launch_vision_training \ | |
| --config src/config/v6_config.yaml --wait | |
| ``` | |
| * **W&B project:** [`rtpi-phase3-vision`](https://wandb.ai/cmndcntrlcyber-c3s-consulting/rtpi-phase3-vision). | |
| * **Cost:** approximately $18 on `a100-large` (~5.5 h training + ~20 min eval). | |