Instructions to use cmndcntrlcyber/code-trainer-vision-adapter with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use cmndcntrlcyber/code-trainer-vision-adapter with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
code-trainer-vision-adapter
A multimodal screenshot β code model: a frozen
Swin-B vision
encoder, an MLP projector, and a LoRA adapter for
Qwen/Qwen2.5-Coder-1.5B-Instruct.
This is Phase 3 of the Code-Trainer V6 / RTPI pipeline (GitHub) β the multimodal stage that takes a Monaco-Editor-rendered VS Code screenshot of source code and emits the underlying source.
Intended use
- Direct use: infer source code from VS Code-style code screenshots in Python, JavaScript, TypeScript, Java, Go, Rust, C++, or C#.
- Research / pedagogy: ablation baseline for larger vision-language code models; the projector + LoRA architecture is small enough to retrain on a single A100.
- Out of scope: general OCR, natural images, hand-written code, or screen recordings (all training images came from the Monaco renderer pipeline).
Architecture
image (224Γ224, 3 channels)
β
βΌ
Swin-B encoder (frozen, 87.7 M params)
β visual feature sequence (49 Γ 1024)
βΌ
MLP projector (trained, 2.1 M params)
β decoder-shaped embedding sequence
βΌ
Qwen2.5-Coder-1.5B (with LoRA r=16, Ξ±=32 β trained)
β
βΌ
source code tokens
Training data
- Dataset:
cmndcntrlcyber/code-trainer-offsec-dataset, revisionv2-multimodal(rows include base64-encoded WebP screenshots). - Splits: 26,126 train / 3,265 validation / 3,267 test (β80/10/10).
- Capture pipeline: Monaco Editor in headless Chromium via Playwright, rendered through 8 rotating VS Code-style themes for diversity.
Training procedure
| Knob | Value |
|---|---|
| Vision encoder | microsoft/swin-base-patch4-window7-224 (frozen) |
| Decoder | Qwen/Qwen2.5-Coder-1.5B-Instruct (+ LoRA r=16, Ξ±=32, dropout 0.05) |
| Projector | 2-layer MLP, 1024 β 1536 hidden, GELU |
| Learning rate | 2e-4 (cosine, warmup ratio 0.03) |
| Batch size Γ accum | 8 Γ 4 (effective batch = 32) |
| Epochs | 3 |
| Sequence length | 2,048 |
| Precision | bfloat16 + gradient checkpointing |
| Hardware | HF Skills a100-large |
| Frameworks | transformers, peft, custom Trainer + wandb |
Evaluation β base vs fine-tuned (test split, 200 samples)
Source: HF Job 69f7175f9d85bec4d76f125d,
A100-large, 20 m 38 s.
| Metric | Base (Qwen2.5-Coder-1.5B + random projector) | Fine-tuned | Ξ |
|---|---|---|---|
exact_match |
0.0000 | 0.0000 | 0 |
bleu_4 |
0.0000 | 0.0000 | 0 |
mean_edit_similarity |
0.0382 | 0.0446 | +16.8 % |
syntax_valid_rate β |
0.1950 | 0.6100 | +213 % |
β Syntax check uses a Python parser. The test split is multilingual (java 5,140; ts 5,095; csharp 5,035; python 3,300; cpp 3,156; go 2,086; rust 1,457; js 857), so the absolute number is not directly comparable to a Python-only run. The delta is meaningful because both rows use the same metric on the same samples.
Reading the numbers:
- Strong positive on
syntax_valid_rate(0.195 β 0.610): the adapter has learned to emit code-shaped output rather than free-form text. - Modest positive on
mean_edit_similarity(+16.8 %): predictions are closer to references than the baseline. exact_match = 0andbleu_4 = 0for both runs: the model is paraphrasing the source, not reconstructing it verbatim. This is a reasonable result for a 1.5 B base model with ~5.5 h of training on 26 K multilingual samples β full-fidelity code reconstruction from screenshots is hard.
See docs/eval/phase3-summary.md
for the full provenance, including the prior eval-pipeline bug fix.
Limitations
- Not a full transcription model. Use the fine-tuned model for code suggestions from screenshots, not for byte-exact reconstruction.
- Domain shift. The training screenshots all come from the Monaco renderer with VS Code-style themes; behaviour on real IDE screenshots, IDEs other than VS Code, or non-Monaco editors is undefined.
- Multilingual evaluation gap. The
syntax_valid_ratemetric checks Python syntax across all languages; per-language metrics are an open follow-up (tracked indocs/eval/phase3-summary.md). - Small base model. The 1.5 B decoder limits long-form fidelity; pairing
with a larger code-trained decoder would likely improve
bleu_4/exact_match.
How to use
# This adapter expects a paired Swin-B vision encoder. Use the loader bundled
# in the source repository:
from src.phase3_vision_model.architecture import VisionLanguageModel
from PIL import Image
model = VisionLanguageModel.from_pretrained(
vision_encoder="microsoft/swin-base-patch4-window7-224",
decoder="Qwen/Qwen2.5-Coder-1.5B-Instruct",
adapter_repo="cmndcntrlcyber/code-trainer-vision-adapter",
).cuda().eval()
image = Image.open("vs_code_screenshot.png").convert("RGB")
print(model.generate(image, max_new_tokens=512))
Reproducibility
- Code: github.com/cmndcntrlcyber/code-trainer-offsec-pipeline
- Training launcher:
python -m src.phase3_vision_model.scripts.launch_vision_training \ --config src/config/v6_config.yaml --wait - W&B project:
rtpi-phase3-vision. - Cost: approximately $18 on
a100-large(~5.5 h training + ~20 min eval).
- Downloads last month
- -
Model tree for cmndcntrlcyber/code-trainer-vision-adapter
Base model
Qwen/Qwen2.5-1.5B