PEFT
modulewarden
software-supply-chain-security
npm
cve
vulnerability-analysis
security-research
qwen
lora
leonardo
cineca
supervised-fine-tuning
Instructions to use apetersson/modulewarden-cve-predict with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use apetersson/modulewarden-cve-predict with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: huihui-ai/Huihui-Qwen3.6-27B-abliterated | |
| tags: | |
| - modulewarden | |
| - software-supply-chain-security | |
| - npm | |
| - cve | |
| - vulnerability-analysis | |
| - security-research | |
| - qwen | |
| - lora | |
| - peft | |
| - leonardo | |
| - cineca | |
| - supervised-fine-tuning | |
| pretty_name: ModuleWarden.com CVE Diagnosis Adapter | |
| # modulewarden-cve-predict | |
| `modulewarden-cve-predict` is a ModuleWarden.com auditor fine-tune artifact for | |
| structured software supply-chain vulnerability diagnosis. Given a JSON | |
| `AuditDossier` for an npm package release, the model is trained to produce a JSON | |
| `Diagnosis` with a verdict, advisory references, CVE labels, a concise issue | |
| summary, and evidence references selected from the dossier. | |
| This repository contains the exported training outputs and provenance bundle for | |
| the run, including the LoRA adapter archives, SFT records, checkpoints, and | |
| Leonardo job logs. | |
| ## Model Details | |
| - **Base model:** `huihui-ai/Huihui-Qwen3.6-27B-abliterated` | |
| - **Task:** audit dossier to structured vulnerability diagnosis | |
| - **Adapter type:** LoRA / PEFT | |
| - **Training precision:** bf16 | |
| - **Training method:** supervised fine-tuning | |
| - **Primary ecosystem:** npm | |
| - **Output format:** `modulewarden.diagnosis.v1` JSON | |
| ModuleWarden.com uses deterministic policy gates for final allow/quarantine/block | |
| decisions. This model is intended to generate the structured, evidence-cited | |
| diagnosis layer and should not be treated as a standalone security scanner. | |
| The adapter inherits the behavior, risks, and license constraints of the base | |
| model. Users should review the base model card before use; this adapter does not | |
| add safety alignment and is not intended for autonomous security decisions. | |
| ## Training Infrastructure | |
| This run was trained on the CINECA Leonardo supercomputer. | |
| - **System:** CINECA Leonardo | |
| - **Scratch path:** `/leonardo_scratch/large/usertrain/a08trc02` | |
| - **Partition/reservation:** `boost_usr_prod`, `s_tra_ncc` | |
| - **Hardware:** 1 node with 4x NVIDIA A100-SXM-64GB | |
| - **Environment:** Python 3.11.7 virtualenv, torch 2.6-compatible stack, | |
| `transformers==5.9.0` Qwen3.6 text-only loading | |
| Leonardo compute nodes were run offline. The base model, code, corpora, | |
| environment, and dependencies were staged to `$SCRATCH` before training. | |
| ## Training Configuration | |
| - **Epochs:** 3 | |
| - **Sequence length:** 4096 | |
| - **Gradient accumulation:** 16 | |
| - **LoRA rank:** 32 | |
| - **LoRA alpha:** 64 | |
| - **LoRA dropout:** 0.0 | |
| - **Quantization:** none for the final Qwen3.6 path; bf16 LoRA was used instead | |
| of bitsandbytes/QLoRA | |
| ## Data Collection Methodology | |
| The SFT samples were built from the ModuleWarden.com finetune-data corpus. The | |
| source workspace for this run was: | |
| The case-sourcing pipeline starts from public software-security feeds and then | |
| normalizes them into package-version audit cases: | |
| 1. **GitHub Global Security Advisories feed.** Candidate npm advisories were | |
| collected from GitHub's advisory API, including reviewed vulnerability | |
| advisories and npm malware advisories. | |
| 2. **npm registry enrichment.** Candidate packages were enriched with npm | |
| packument metadata such as versions, publish times, dependency changes, | |
| scripts, maintainers, repository URLs, and tarball metadata. | |
| 3. **OSV cross-checking.** OSV records were used to enrich or cross-check | |
| advisory IDs, affected ranges, fixed versions, and package mappings. | |
| 4. **Version-pair construction.** The pipeline inferred vulnerable/fixed pairs, | |
| likely affected versions, and nearby benign neighbor releases where possible. | |
| 5. **Case selection and normalization.** Scraped candidates were normalized into | |
| JSONL case records, then filtered for usable advisory metadata, | |
| package-version context, and dossier evidence before SFT conversion. | |
| 6. **Dossier-to-SFT conversion.** Selected package-version cases were converted | |
| into `AuditDossier` JSON objects with static evidence indexes, then paired | |
| with target `Diagnosis` JSON outputs. | |
| The relevant local corpus artifacts include: | |
| - `scraped-cases.npm-enriched.jsonl`: GitHub advisory candidates enriched with | |
| npm metadata. | |
| - `sft-records-partial.jsonl`: earlier SFT record snapshot. | |
| - `Results/a08trc02/sft-records-diagnosis*.jsonl.part-*`: the diagnosis SFT | |
| snapshots uploaded in this repository. | |
| The corpus emphasizes npm package-version changes, CVE/advisory diffs, malicious | |
| or suspicious releases, and benign neighboring versions. Cases are intended to | |
| teach diagnosis over release diffs with explicit evidence references rather than | |
| free-form malware classification. | |
| ## Evaluation | |
| No benchmark-style CVE prediction or vulnerability detection score is claimed | |
| for this artifact. Validation for this exported run focused on preserving the | |
| adapter artifacts, training logs, SFT records, and provenance needed to inspect | |
| or reproduce the run. | |
| Before operational use, evaluate the adapter on your own held-out package | |
| corpus. In particular, measure JSON schema validity, evidence-reference | |
| validity, false positives, false negatives, and precision/recall for the | |
| specific policy workflow where it will be used. | |
| ## Repository Contents | |
| - `mw-diagnosis-adapter.tgz.part-*`: split archive of the main diagnosis LoRA | |
| adapter and checkpoints. | |
| - `rehearsal-diag-adapter.tgz.part-*`: split archive of a 1-GPU rehearsal | |
| adapter. | |
| - `sft-records-diagnosis.jsonl.part-*`: full SFT corpus, 1,867 chat records. | |
| - `sft-records-diagnosis-filtered.jsonl.part-0000`: filtered SFT corpus, 1,708 | |
| records. | |
| - `sft-records-diagnosis-rich.jsonl.part-0000`: richer SFT corpus variant, 1,708 | |
| records. | |
| - `mw-*.out.part-0000`: Leonardo job logs for setup, training, serving, and | |
| smoke tests. | |
| - `vllm-logs.tgz.part-0000`: vLLM log archive. | |
| - `REASSEMBLE.txt`: original artifact reassembly note. | |
| The large archives are split into 100 MB chunks. | |
| ## Reassembling Artifacts | |
| Restore a split artifact with: | |
| ```bash | |
| cat X.part-* > X | |
| ``` | |
| For the main adapter: | |
| ```bash | |
| cat mw-diagnosis-adapter.tgz.part-* > mw-diagnosis-adapter.tgz | |
| tar xzf mw-diagnosis-adapter.tgz | |
| ``` | |
| For the rehearsal adapter: | |
| ```bash | |
| cat rehearsal-diag-adapter.tgz.part-* > rehearsal-diag-adapter.tgz | |
| tar xzf rehearsal-diag-adapter.tgz | |
| ``` | |
| The main adapter archive includes: | |
| - `mw-diagnosis-adapter/adapter_model.safetensors` | |
| - `mw-diagnosis-adapter/adapter_config.json` | |
| - tokenizer files | |
| - checkpoints `checkpoint-81`, `checkpoint-162`, and `checkpoint-243` | |
| ## Data Format | |
| Each SFT row follows `modulewarden.sft_record.v1`: | |
| - `messages[0]`: system instruction | |
| - `messages[1]`: user message containing a `modulewarden.audit_dossier.v1` | |
| - `messages[2]`: assistant message containing a `modulewarden.diagnosis.v1` | |
| The target diagnosis JSON includes: | |
| - `verdict`: `NO_ISSUES_FOUND` or `ISSUE_FOUND` | |
| - advisory identifiers when known | |
| - CVE identifiers | |
| - a short issue summary | |
| - `evidence_refs` citing only IDs present in the dossier's `evidence_index` | |
| ## Intended Use | |
| This artifact is intended for research, inspection, and reproduction of the | |
| ModuleWarden.com auditor fine-tuning pipeline. It can be used to study schema | |
| fidelity, evidence-cited vulnerability diagnosis, and model behavior on npm | |
| release audit dossiers. Downstream tooling should validate that cited evidence | |
| IDs exist in the source dossier. | |
| It is not intended to be used as a complete vulnerability detection system by | |
| itself. ModuleWarden.com's production design combines deterministic gates, package | |
| diff evidence, and model-generated explanations. | |
| ## Safety | |
| The SFT records are inert JSON dossiers and target JSON diagnoses. They are not | |
| package installations or executable npm corpora. Do not install or execute npm | |
| packages referenced inside the training records solely because they appear in | |
| this repository. Treat package names, tarball URLs, scripts, and metadata as | |
| untrusted security-research data. | |