Update README.md

ebbae5e verified about 1 month ago

7.81 kB

	---
	license: apache-2.0
	base_model: huihui-ai/Huihui-Qwen3.6-27B-abliterated
	tags:
	- modulewarden
	- software-supply-chain-security
	- npm
	- cve
	- vulnerability-analysis
	- security-research
	- qwen
	- lora
	- peft
	- leonardo
	- cineca
	- supervised-fine-tuning
	pretty_name: ModuleWarden.com CVE Diagnosis Adapter
	---

	# modulewarden-cve-predict

	`modulewarden-cve-predict` is a ModuleWarden.com auditor fine-tune artifact for
	structured software supply-chain vulnerability diagnosis. Given a JSON
	`AuditDossier` for an npm package release, the model is trained to produce a JSON
	`Diagnosis` with a verdict, advisory references, CVE labels, a concise issue
	summary, and evidence references selected from the dossier.

	This repository contains the exported training outputs and provenance bundle for
	the run, including the LoRA adapter archives, SFT records, checkpoints, and
	Leonardo job logs.

	## Model Details

	- Base model: `huihui-ai/Huihui-Qwen3.6-27B-abliterated`
	- Task: audit dossier to structured vulnerability diagnosis
	- Adapter type: LoRA / PEFT
	- Training precision: bf16
	- Training method: supervised fine-tuning
	- Primary ecosystem: npm
	- Output format: `modulewarden.diagnosis.v1` JSON

	ModuleWarden.com uses deterministic policy gates for final allow/quarantine/block
	decisions. This model is intended to generate the structured, evidence-cited
	diagnosis layer and should not be treated as a standalone security scanner.

	The adapter inherits the behavior, risks, and license constraints of the base
	model. Users should review the base model card before use; this adapter does not
	add safety alignment and is not intended for autonomous security decisions.

	## Training Infrastructure

	This run was trained on the CINECA Leonardo supercomputer.

	- System: CINECA Leonardo
	- Scratch path: `/leonardo_scratch/large/usertrain/a08trc02`
	- Partition/reservation: `boost_usr_prod`, `s_tra_ncc`
	- Hardware: 1 node with 4x NVIDIA A100-SXM-64GB
	- Environment: Python 3.11.7 virtualenv, torch 2.6-compatible stack,
	`transformers==5.9.0` Qwen3.6 text-only loading

	Leonardo compute nodes were run offline. The base model, code, corpora,
	environment, and dependencies were staged to `$SCRATCH` before training.

	## Training Configuration

	- Epochs: 3
	- Sequence length: 4096
	- Gradient accumulation: 16
	- LoRA rank: 32
	- LoRA alpha: 64
	- LoRA dropout: 0.0
	- Quantization: none for the final Qwen3.6 path; bf16 LoRA was used instead
	of bitsandbytes/QLoRA

	## Data Collection Methodology

	The SFT samples were built from the ModuleWarden.com finetune-data corpus. The
	source workspace for this run was:

	The case-sourcing pipeline starts from public software-security feeds and then
	normalizes them into package-version audit cases:

	1. GitHub Global Security Advisories feed. Candidate npm advisories were
	collected from GitHub's advisory API, including reviewed vulnerability
	advisories and npm malware advisories.
	2. npm registry enrichment. Candidate packages were enriched with npm
	packument metadata such as versions, publish times, dependency changes,
	scripts, maintainers, repository URLs, and tarball metadata.
	3. OSV cross-checking. OSV records were used to enrich or cross-check
	advisory IDs, affected ranges, fixed versions, and package mappings.
	4. Version-pair construction. The pipeline inferred vulnerable/fixed pairs,
	likely affected versions, and nearby benign neighbor releases where possible.
	5. Case selection and normalization. Scraped candidates were normalized into
	JSONL case records, then filtered for usable advisory metadata,
	package-version context, and dossier evidence before SFT conversion.
	6. Dossier-to-SFT conversion. Selected package-version cases were converted
	into `AuditDossier` JSON objects with static evidence indexes, then paired
	with target `Diagnosis` JSON outputs.

	The relevant local corpus artifacts include:

	- `scraped-cases.npm-enriched.jsonl`: GitHub advisory candidates enriched with
	npm metadata.
	- `sft-records-partial.jsonl`: earlier SFT record snapshot.
	- `Results/a08trc02/sft-records-diagnosis.jsonl.part-`: the diagnosis SFT
	snapshots uploaded in this repository.

	The corpus emphasizes npm package-version changes, CVE/advisory diffs, malicious
	or suspicious releases, and benign neighboring versions. Cases are intended to
	teach diagnosis over release diffs with explicit evidence references rather than
	free-form malware classification.

	## Evaluation

	No benchmark-style CVE prediction or vulnerability detection score is claimed
	for this artifact. Validation for this exported run focused on preserving the
	adapter artifacts, training logs, SFT records, and provenance needed to inspect
	or reproduce the run.

	Before operational use, evaluate the adapter on your own held-out package
	corpus. In particular, measure JSON schema validity, evidence-reference
	validity, false positives, false negatives, and precision/recall for the
	specific policy workflow where it will be used.

	## Repository Contents

	- `mw-diagnosis-adapter.tgz.part-*`: split archive of the main diagnosis LoRA
	adapter and checkpoints.
	- `rehearsal-diag-adapter.tgz.part-*`: split archive of a 1-GPU rehearsal
	adapter.
	- `sft-records-diagnosis.jsonl.part-*`: full SFT corpus, 1,867 chat records.
	- `sft-records-diagnosis-filtered.jsonl.part-0000`: filtered SFT corpus, 1,708
	records.
	- `sft-records-diagnosis-rich.jsonl.part-0000`: richer SFT corpus variant, 1,708
	records.
	- `mw-*.out.part-0000`: Leonardo job logs for setup, training, serving, and
	smoke tests.
	- `vllm-logs.tgz.part-0000`: vLLM log archive.
	- `REASSEMBLE.txt`: original artifact reassembly note.

	The large archives are split into 100 MB chunks.

	## Reassembling Artifacts

	Restore a split artifact with:

	```bash
	cat X.part-* > X
	```

	For the main adapter:

	```bash
	cat mw-diagnosis-adapter.tgz.part-* > mw-diagnosis-adapter.tgz
	tar xzf mw-diagnosis-adapter.tgz
	```

	For the rehearsal adapter:

	```bash
	cat rehearsal-diag-adapter.tgz.part-* > rehearsal-diag-adapter.tgz
	tar xzf rehearsal-diag-adapter.tgz
	```

	The main adapter archive includes:

	- `mw-diagnosis-adapter/adapter_model.safetensors`
	- `mw-diagnosis-adapter/adapter_config.json`
	- tokenizer files
	- checkpoints `checkpoint-81`, `checkpoint-162`, and `checkpoint-243`

	## Data Format

	Each SFT row follows `modulewarden.sft_record.v1`:

	- `messages[0]`: system instruction
	- `messages[1]`: user message containing a `modulewarden.audit_dossier.v1`
	- `messages[2]`: assistant message containing a `modulewarden.diagnosis.v1`

	The target diagnosis JSON includes:

	- `verdict`: `NO_ISSUES_FOUND` or `ISSUE_FOUND`
	- advisory identifiers when known
	- CVE identifiers
	- a short issue summary
	- `evidence_refs` citing only IDs present in the dossier's `evidence_index`

	## Intended Use

	This artifact is intended for research, inspection, and reproduction of the
	ModuleWarden.com auditor fine-tuning pipeline. It can be used to study schema
	fidelity, evidence-cited vulnerability diagnosis, and model behavior on npm
	release audit dossiers. Downstream tooling should validate that cited evidence
	IDs exist in the source dossier.

	It is not intended to be used as a complete vulnerability detection system by
	itself. ModuleWarden.com's production design combines deterministic gates, package
	diff evidence, and model-generated explanations.

	## Safety

	The SFT records are inert JSON dossiers and target JSON diagnoses. They are not
	package installations or executable npm corpora. Do not install or execute npm
	packages referenced inside the training records solely because they appear in
	this repository. Treat package names, tarball URLs, scripts, and metadata as
	untrusted security-research data.