YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
VectraYX β Reproducibility Release
Paper: VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use
This repository contains the code, datasets, and pre-computed results needed to reproduce the key experiments from the paper.
Repository Structure
release/
βββ Makefile β make repro / make bench-nano / make lora-nano
βββ requirements.txt β exact package versions
βββ configs/
β βββ nano.json β Nano 42M architecture (GQA 8q/2kv, d_model=512)
β βββ base.json β Base 260M architecture (GQA 16q/4kv, d_model=1024)
βββ training/
β βββ transformer.py β VectraYXNano model (GQA + QK-Norm + Z-loss + RoPE)
β βββ pretrain.py β 3-phase curriculum pre-training driver
β βββ finetune_sft.py β SFT with assistant-only loss masking + mini-curriculum
β βββ finetune_lora_tools.py β LoRA adapter injection + merge (key experiment)
β βββ finetune_tools.py β Full fine-tune (baseline comparison)
β βββ sft_dataset.py β JSONL β tokenized dataset with loss masking
β βββ utils.py β AdamW, cosine LR, checkpoint save/load
β βββ aws_lora_nano_tools_s3.py β SageMaker launcher: Nano LoRA (S3-only)
β βββ aws_lora_base_tools_s3.py β SageMaker launcher: Base LoRA (S3-only)
βββ eval/
β βββ benchmark.py β VectraYX-Bench B1βB5 harness
β βββ run_inference_lora.py β Inference with LoRA adapter loaded
β βββ run_inference_base.py β Inference with base checkpoint
β βββ red_team_eval.py β Adversarial probe script
βββ eval_data/
β βββ b1_cveqa.jsonl β 500 CVE Q&A prompts + expected keywords
β βββ b2_classification.jsonl β 200 threat classification examples
β βββ b3_commands.jsonl β 35 command-line completion prompts
β βββ b4_tooluse.jsonl β 25 tool-selection prompts (v2: 50 prompts)
β βββ b5_conversational.jsonl β 10 conversational gate prompts
βββ corpus/
β βββ tool_sft_mini_v1.jsonl β 2,801 tool-use examples (ratio 1:21) β KEY
β βββ tool_sft_v3_bash.jsonl β 296 bash-focused examples
β βββ tool_sft_v2_simple.jsonl β 115 simple bash examples
β βββ b4_tooluse_v2.jsonl β B4 benchmark v2 (50 questions, 60% bash)
β βββ build_mini_tool_corpus.py β Regenerate tool_sft_mini_v1 from scratch
β βββ build_tool_sft_corpus.py β Full tool-use corpus generator
β βββ build_v3_and_bench.py β v3 corpus + benchmark builder
βββ results/
β βββ bench_nano_baseline_multiseed.json β Nano baseline N=4 seeds (paper Table 2)
β βββ bench_nano_lora_multiseed.json β Nano LoRA N=4 seeds (paper Table 3)
β βββ bench_base_lora_s42.json β Base LoRA seed=42 (paper Table 3)
βββ paper/
βββ main.pdf β Paper PDF
Key Finding: Tool-Use Corpus Density
The B4=0.000 floor in mixed SFT is a corpus-density artifact, not a capacity gate.
| Model | Corpus | Ratio | B4 |
|---|---|---|---|
| Nano 42M (mixed SFT, N=4 seeds) | 62K examples | 1:211 | 0.000 |
| Nano 42M + LoRA (N=4 seeds) | 2,801 examples | 1:21 | 0.145 Β± 0.046 |
| Base 260M (mixed SFT) | 62K examples | 1:211 | 0.000 |
| Base 260M + LoRA | 2,801 examples | 1:21 | 0.580 |
| Pro 3B + LoRA-64 | 62K examples | ~1:10 | 0.600 |
| Pro 7B + QLoRA-32 | 62K examples | ~1:10 | 0.880 |
Nano LoRA Multi-Seed Results (N=4, Table 3 in paper)
| Seed | B1 KW | B2 F1 | B3 TM | B4 | B5 |
|---|---|---|---|---|---|
| 42 | 0.008 | 0.200 | 0.029 | 0.220 | 0.500 |
| 7 | 0.017 | 0.200 | 0.029 | 0.140 | 0.600 |
| 13 | 0.006 | 0.200 | 0.000 | 0.120 | 0.600 |
| 23 | 0.014 | 0.205 | 0.029 | 0.100 | 0.600 |
| Mean Β± std | 0.011 Β± 0.004 | 0.201 Β± 0.002 | 0.021 Β± 0.012 | 0.145 Β± 0.046 | 0.575 Β± 0.043 |
Quick Start
1. Install dependencies
pip install -r requirements.txt
2. Download checkpoints
mkdir -p checkpoints
# From HuggingFace (links TBD β see paper for GCS paths)
# Nano 42M post-SFT (503 MB)
# wget https://huggingface.co/vectrayx/nano-sft-v5/resolve/main/nano_sft_v5.pt \
# -O checkpoints/nano_sft_v5.pt
# Base 260M post-Phase3 (3.1 GB)
# wget https://huggingface.co/vectrayx/base-phase3/resolve/main/base_phase3_last.pt \
# -O checkpoints/base_phase3_last.pt
# Tokenizer (474 KB)
# wget https://huggingface.co/vectrayx/tokenizer/resolve/main/vectrayx_bpe.model \
# -O checkpoints/vectrayx_bpe.model
3. Run the full reproducibility suite
make repro
This runs:
make bench-nanoβ B1βB5 on Nano baseline (expected B4=0.000)make bench-baseβ B1βB5 on Base baseline (expected B4=0.000)make lora-nanoβ LoRA fine-tune Nano + eval (expected B4β0.220 for seed=42)make lora-baseβ LoRA fine-tune Base + eval (expected B4β0.580 for seed=42)
4. Run individual experiments
# Benchmark only (no training)
make bench-nano
make bench-base
# LoRA fine-tune + benchmark
make lora-nano # ~30 min on A10G
make lora-base # ~45 min on A10G
# Regenerate corpus
make corpus
Reproducing the Pre-Training Pipeline
The full from-scratch pre-training pipeline (Phases 1β3 + SFT) is described in training_v2/README.md in the main repository. The key entry points are:
# 1. Train tokenizer (BPE-16384, 50/50 conv/tech balance)
python -m training.tokenizer.train_spm_bpe \
--config configs/nano.json \
--corpus-root /path/to/corpus \
--out-dir checkpoints/tokenizer
# 2. Tokenize corpus β binary shards
python -m training.data.prepare_corpus \
--tokenizer checkpoints/tokenizer/vectrayx_bpe.model \
--corpus-root /path/to/corpus \
--out-root data/bins
# 3. Pre-train (3 phases with replay buffer)
python training/pretrain.py --config configs/nano.json \
--bins data/bins --out checkpoints --phase 1 \
--batch-size 16 --grad-accum 8 --epochs 2
python training/pretrain.py --config configs/nano.json \
--bins data/bins --out checkpoints --phase 2 \
--resume checkpoints/phase1/last.pt
python training/pretrain.py --config configs/nano.json \
--bins data/bins --out checkpoints --phase 3 \
--resume checkpoints/phase2/last.pt
# 4. SFT with mini-curriculum
python training/finetune_sft.py \
--config configs/nano.json \
--tokenizer checkpoints/tokenizer/vectrayx_bpe.model \
--resume checkpoints/phase3/last.pt \
--out checkpoints/sft_v5 \
--batch-size 16 --grad-accum 4 --epochs 3 --lr 2e-5
# 5. Benchmark
python eval/benchmark.py \
--config configs/nano.json \
--tokenizer checkpoints/tokenizer/vectrayx_bpe.model \
--checkpoint checkpoints/sft_v5/final.pt \
--data-dir eval_data \
--out results/bench_nano_baseline.json
Estimated cost: ~$12 USD on GCP L4 for 3 full runs (v2/v4/v6 ablations).
SageMaker Experiments (LoRA)
The LoRA experiments were run on AWS SageMaker ml.g5.xlarge (NVIDIA A10G 24GB).
# Prerequisites: AWS CLI configured, S3 bucket with assets
# See training/aws_lora_nano_tools_s3.py for full setup
# Upload assets to S3
aws s3 cp checkpoints/nano_sft_v5.pt s3://YOUR_BUCKET/checkpoints/
aws s3 cp checkpoints/vectrayx_bpe.model s3://YOUR_BUCKET/tokenizers/
aws s3 cp corpus/tool_sft_mini_v1.jsonl s3://YOUR_BUCKET/training-data/
# Launch Nano LoRA (seed=42)
bash corpus/launch_nano_lora_mini_ondemand.sh
# Launch Base LoRA (seed=42)
bash corpus/launch_base_lora_mini_ondemand.sh
Estimated cost per run: ~$1.50 USD (ml.g5.xlarge on-demand, ~45 min).
Model Checkpoints
| Checkpoint | Size | Description | Link |
|---|---|---|---|
nano_sft_v5.pt |
503 MB | Nano 42M post-SFT (base for LoRA) | HuggingFace (TBD) |
nano_lora_mini_s42.pt |
~5 MB | Nano LoRA adapter (seed=42) | HuggingFace (TBD) |
base_phase3_last.pt |
3.1 GB | Base 260M post-Phase3 (base for LoRA) | HuggingFace (TBD) |
base_lora_mini_s42.pt |
~20 MB | Base LoRA adapter (seed=42) | HuggingFace (TBD) |
vectrayx_bpe.model |
474 KB | BPE-16384 tokenizer | HuggingFace (TBD) |
Environment
Experiments were run with:
| Package | Version |
|---|---|
| Python | 3.10 |
| PyTorch | 2.11.0 |
| sentencepiece | 0.2.1 |
| numpy | 2.4.2 |
| CUDA | 12.1 |
| boto3 | 1.42.93 |
| sagemaker | 3.10.0 |
Hardware:
- Pre-training: GCP
g2-standard-4(NVIDIA L4 24GB),us-west1-a - LoRA experiments: AWS SageMaker
ml.g5.xlarge(NVIDIA A10G 24GB),us-east-1 - Multi-seed runs: AWS EC2
g4dn.xlarge(NVIDIA T4 16GB)
Citation
@inproceedings{santillana2026vectrayx,
title = {VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model
with Curriculum Learning and Native Tool Use},
author = {Santillana, Juan S.},
booktitle = {Preprint},
year = {2026}
}
License
| Component | License |
|---|---|
| Training code | MIT |
| Evaluation datasets (B1βB5) | CC-BY-4.0 |
| Model weights | Apache 2.0 |
| Paper | CC-BY-4.0 |