How to use from the
Use from the
PEFT library
Task type is invalid.

code-trainer-vision-adapter

A multimodal screenshot β†’ code model: a frozen Swin-B vision encoder, an MLP projector, and a LoRA adapter for Qwen/Qwen2.5-Coder-1.5B-Instruct.

This is Phase 3 of the Code-Trainer V6 / RTPI pipeline (GitHub) β€” the multimodal stage that takes a Monaco-Editor-rendered VS Code screenshot of source code and emits the underlying source.

Intended use

  • Direct use: infer source code from VS Code-style code screenshots in Python, JavaScript, TypeScript, Java, Go, Rust, C++, or C#.
  • Research / pedagogy: ablation baseline for larger vision-language code models; the projector + LoRA architecture is small enough to retrain on a single A100.
  • Out of scope: general OCR, natural images, hand-written code, or screen recordings (all training images came from the Monaco renderer pipeline).

Architecture

   image (224Γ—224, 3 channels)
     β”‚
     β–Ό
  Swin-B encoder (frozen, 87.7 M params)
     β”‚  visual feature sequence (49 Γ— 1024)
     β–Ό
  MLP projector (trained, 2.1 M params)
     β”‚  decoder-shaped embedding sequence
     β–Ό
  Qwen2.5-Coder-1.5B (with LoRA r=16, Ξ±=32 β€” trained)
     β”‚
     β–Ό
   source code tokens

Training data

  • Dataset: cmndcntrlcyber/code-trainer-offsec-dataset, revision v2-multimodal (rows include base64-encoded WebP screenshots).
  • Splits: 26,126 train / 3,265 validation / 3,267 test (β‰ˆ80/10/10).
  • Capture pipeline: Monaco Editor in headless Chromium via Playwright, rendered through 8 rotating VS Code-style themes for diversity.

Training procedure

Knob Value
Vision encoder microsoft/swin-base-patch4-window7-224 (frozen)
Decoder Qwen/Qwen2.5-Coder-1.5B-Instruct (+ LoRA r=16, Ξ±=32, dropout 0.05)
Projector 2-layer MLP, 1024 β†’ 1536 hidden, GELU
Learning rate 2e-4 (cosine, warmup ratio 0.03)
Batch size Γ— accum 8 Γ— 4 (effective batch = 32)
Epochs 3
Sequence length 2,048
Precision bfloat16 + gradient checkpointing
Hardware HF Skills a100-large
Frameworks transformers, peft, custom Trainer + wandb

Evaluation β€” base vs fine-tuned (test split, 200 samples)

Source: HF Job 69f7175f9d85bec4d76f125d, A100-large, 20 m 38 s.

Metric Base (Qwen2.5-Coder-1.5B + random projector) Fine-tuned Ξ”
exact_match 0.0000 0.0000 0
bleu_4 0.0000 0.0000 0
mean_edit_similarity 0.0382 0.0446 +16.8 %
syntax_valid_rate † 0.1950 0.6100 +213 %

† Syntax check uses a Python parser. The test split is multilingual (java 5,140; ts 5,095; csharp 5,035; python 3,300; cpp 3,156; go 2,086; rust 1,457; js 857), so the absolute number is not directly comparable to a Python-only run. The delta is meaningful because both rows use the same metric on the same samples.

Reading the numbers:

  • Strong positive on syntax_valid_rate (0.195 β†’ 0.610): the adapter has learned to emit code-shaped output rather than free-form text.
  • Modest positive on mean_edit_similarity (+16.8 %): predictions are closer to references than the baseline.
  • exact_match = 0 and bleu_4 = 0 for both runs: the model is paraphrasing the source, not reconstructing it verbatim. This is a reasonable result for a 1.5 B base model with ~5.5 h of training on 26 K multilingual samples β€” full-fidelity code reconstruction from screenshots is hard.

See docs/eval/phase3-summary.md for the full provenance, including the prior eval-pipeline bug fix.

Limitations

  • Not a full transcription model. Use the fine-tuned model for code suggestions from screenshots, not for byte-exact reconstruction.
  • Domain shift. The training screenshots all come from the Monaco renderer with VS Code-style themes; behaviour on real IDE screenshots, IDEs other than VS Code, or non-Monaco editors is undefined.
  • Multilingual evaluation gap. The syntax_valid_rate metric checks Python syntax across all languages; per-language metrics are an open follow-up (tracked in docs/eval/phase3-summary.md).
  • Small base model. The 1.5 B decoder limits long-form fidelity; pairing with a larger code-trained decoder would likely improve bleu_4 / exact_match.

How to use

# This adapter expects a paired Swin-B vision encoder. Use the loader bundled
# in the source repository:
from src.phase3_vision_model.architecture import VisionLanguageModel
from PIL import Image

model = VisionLanguageModel.from_pretrained(
    vision_encoder="microsoft/swin-base-patch4-window7-224",
    decoder="Qwen/Qwen2.5-Coder-1.5B-Instruct",
    adapter_repo="cmndcntrlcyber/code-trainer-vision-adapter",
).cuda().eval()

image = Image.open("vs_code_screenshot.png").convert("RGB")
print(model.generate(image, max_new_tokens=512))

Reproducibility

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cmndcntrlcyber/code-trainer-vision-adapter

Adapter
(104)
this model

Dataset used to train cmndcntrlcyber/code-trainer-vision-adapter