marvy-14B

The first open, fine-tuned LLM for the full ServiceNow delivery lifecycle — from business analysis to validation.

marvy-14B is an open-source language model fine-tuned for the complete ServiceNow delivery lifecycle: business analysis, requirements, stakeholder mapping, systems inventory, Solution Design Documents, user stories with acceptance criteria, implementation planning, test cases, and validation. Where general-purpose models treat ServiceNow as one topic among many, marvy is built to draft the actual artifacts a delivery team produces — in the structure and sequence real engagements follow. It is a first-draft specialist, not a consultant replacement, and it is not an agentic or tool-use fine-tune.

It was built by MainStack, a consultancy specializing in ServiceNow Agentic Delivery. marvy is a LoRA SFT fine-tune of Qwen2.5-14B-Instruct (Apache-2.0), trained on 1,958 anonymized artifacts from real engagements (887k tokens), rigorously redacted to zero residual PII per an automated leakage scanner. Its test perplexity of 13.107 was measured on a project- and customer-disjoint held-out split — the model generalizes to unseen work rather than memorizing the training set.

Released under Apache-2.0. Built with Qwen — see NOTICE.

Why marvy-14B

  • Drafts the full lifecycle, not just snippets. Business analysis through validation — the artifacts and sequence real delivery teams actually work in.
  • OOTB-first and implementation-grade. Tuned to favor out-of-the-box correctness and produce drafts you can review, not rewrite.
  • Runs locally and privately. Merged FP16, a LoRA adapter, and GGUF quants — run it on Apple Silicon via LM Studio or Ollama, with your engagement data never leaving your machine.
  • Trained on real, anonymized delivery work. 1,958 redacted engagement artifacts (887k tokens), with zero residual PII verified by an automated leakage scanner.
  • Open and Apache-2.0. Built on Qwen2.5-14B-Instruct — inspect it, fine-tune it, and deploy it on your own terms.

📖 Full docs: USAGE.md (every runtime + OpenCode wiring) · VALIDATION.md (prove the fine-tune works) · validate.sh (one-command probe harness)


Quick start

Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "MainStack/marvy-14B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

SYSTEM = (
  "You are a senior ServiceNow delivery consultant. You produce precise, "
  "implementation-grade artifacts: business analyses, requirements, solution "
  "design documents, user stories with acceptance criteria, test cases, and "
  "validation reviews. You favor out-of-the-box capabilities, cite concrete "
  "tables/plugins/sys_ids when relevant, and write in clear professional English."
)

messages = [
  {"role": "system", "content": SYSTEM},
  {"role": "user", "content": "Write a ServiceNow user story with acceptance criteria for SLA escalation on P1 incidents."},
]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=1024, temperature=0.4)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

vLLM

pip install vllm
vllm serve MainStack/marvy-14B

Ollama (via GGUF)

Use the companion repo MainStack/marvy-14B-GGUF:

ollama run hf.co/MainStack/marvy-14B-GGUF:Q4_K_M

MLX (Apple Silicon native)

pip install mlx-lm
python -m mlx_lm generate --model MainStack/marvy-14B \
  --system-prompt "You are a senior ServiceNow delivery consultant..." \
  --prompt "Draft the Platform Architecture section of an ITSM SDD." \
  --max-tokens 1024 --temp 0.4

LoRA-only (apply on top of the base)

If you prefer a tiny adapter (~175 MB) on top of the BF16 base, see MainStack/marvy-14B-lora.


Intended use

marvy-14B is designed to produce implementation-grade first drafts across the ServiceNow delivery lifecycle — accelerating the artifacts a practitioner would otherwise write from scratch, then review and refine. Built for solution architects, business analysts, technical consultants, and project managers. Typical tasks:

Task family What it produces
business_analysis Structured BA reports from SOWs / discovery notes
requirements_extraction Functional/non-functional requirements with acceptance bullets
stakeholder_mapping RACI / influence-interest grids from raw notes
systems_inventory CMDB-shaped systems inventories from architecture inputs
sdd_design Solution Design Document sections (architecture, integrations, data model)
story_authoring User stories with crisp acceptance criteria
implementation_planning Story-level implementation plans citing tables/plugins
test_case_generation Test cases per story, mapped to acceptance criteria
validation_critique Gap analysis, follow-up questions, assumption checks against source docs
delivery_chain Multi-turn: story → implementation → test, end-to-end

Recommended system prompt

You are a senior ServiceNow delivery consultant. You produce precise, implementation-grade
artifacts: business analyses, requirements, solution design documents, user stories with
acceptance criteria, test cases, and validation reviews. You favor out-of-the-box
capabilities, cite concrete tables/plugins/sys_ids when relevant, and write in clear
professional English.

Recommended generation settings

Use case temperature top_p max_new_tokens
Structured artifacts (SDD, stories) 0.3 – 0.5 0.9 1024 – 4096
Exploratory brainstorming 0.7 – 0.9 0.95 1024
Validation / critique 0.2 – 0.4 0.9 1024 – 2048

Training data

Item Value
Source Anonymized real engagement artifacts (.md, .csv, .json, .mmd, .txt)
Total records 1,958 (after schema + exact-dedupe)
Estimated tokens ~887k
Splits (project-disjoint) train 1,359 · val 347 · test 252
Tasks 11 task families (see table above)
Multi-turn share delivery_chain (158 records) — story→implementation→test

Privacy & redaction

  • All customer/partner names → stable aliases (e.g. Customer-FIN-03, Customer-ENERGY-01).
  • Emails → user@example.com; hostnames → instance.example.service-now.com; IPs → RFC 5737 range; key: value secrets → [REDACTED].
  • Credential/login/VPN files excluded entirely; bulk CMDB dumps >1.5 MB excluded.
  • ServiceNow sys_ids and table/plugin names preserved (instance-local, technically valuable, low risk).
  • A leakage scanner asserts 0 residual emails, hostnames, or mapped real names in message content.

Split integrity

Train / val / test are split by project, so no customer appears in more than one split. The largest project is forced into train to keep eval honest:

  • val projects: Customer-ENERGY-01
  • test projects: Customer-CHEM-01, Customer-FININST-01

Training procedure

Setting Value
Method LoRA SFT (QLoRA-style: LoRA on 4-bit base)
Base model mlx-community/Qwen2.5-14B-Instruct-4bit (training) → fused onto Qwen/Qwen2.5-14B-Instruct BF16 (release)
Framework MLX-LM 0.31.3
Hardware Apple Silicon (M-series), Metal
Max sequence length 8,192
Batch size / grad accum 1 / 16 (effective batch 16)
Iterations 350 (~4 epochs over 1,359 train records)
Optimizer AdamW, cosine decay, warmup 20, lr 1e-4 → 1e-6
LoRA rank / scale / dropout 32 / 20.0 / 0.0
LoRA target keys q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Adapted layers top 16 transformer layers
Prompt masking yes — loss computed only on assistant turns
Seed 42

Evaluation

Test-set evaluation on the project-disjoint test split (252 records from two customers never seen in training/val), 50 batches:

Metric Value
Test cross-entropy loss 2.573
Test perplexity 13.107

Note: two test sequences exceed 2,048 tokens and are truncated by the MLX eval harness. The reported figure is therefore a slight upper bound on true loss. Full-length scoring is planned for v2.

To reproduce or validate these results yourself — including a base-vs-marvy comparison and qualitative task probes — see VALIDATION.md and run validate.sh.


Limitations & known issues

  • Text-only sources. SOWs/SDDs/workbooks in .docx/.pptx/.pdf/.xlsx are not parsed in this build. Coverage of binary-only engagements is therefore thin.
  • Project concentration. ~95% of records come from ~12 data-rich projects; the long tail contributes a single case study each. Some task families (e.g. case_study, validation_critique) are smaller and may exhibit higher variance.
  • Synthetic instructions. User prompts are templated paraphrases (3–5 variants per task); assistant outputs are the original human-authored artifacts.
  • English-only. The corpus is English.
  • Not a replacement for a consultant. Output is first-draft, implementation-grade content that requires expert review before client delivery or production use.
  • No tool use / function calling fine-tune. marvy-14B is a text-completion specialist; agentic tool use is left to the orchestrator.
  • Hallucination risk on instance-specific facts. The model will confidently invent sys_ids, plugin IDs, and table fields if asked about specifics it has not seen. Always verify against an actual ServiceNow instance.
  • No safety fine-tune beyond the base. Inherits Qwen2.5-14B-Instruct safety behavior; no additional RLHF.

License

Released under the Apache License 2.0 (see LICENSE).

This model is a derivative of Qwen2.5-14B-Instruct (Apache-2.0). See NOTICE for attribution.

Citation

@software{marvy_14b_2026,
  title  = {marvy-14B: A ServiceNow delivery lifecycle fine-tune of Qwen2.5-14B-Instruct},
  author = {MainStack},
  year   = {2026},
  url    = {https://huggingface.co/MainStack/marvy-14B},
  license= {Apache-2.0}
}

@misc{qwen2.5,
  title  = {Qwen2.5: A Party of Foundation Models},
  author = {Qwen Team},
  year   = {2024},
  url    = {https://qwenlm.github.io/blog/qwen2.5/}
}

Acknowledgements

  • Qwen team at Alibaba Cloud for the Qwen2.5 family.
  • Apple MLX team for mlx and mlx-lm, enabling native Apple Silicon training.
  • Hugging Face for hosting and the surrounding ecosystem.
Downloads last month
-
Safetensors
Model size
15B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MainStack/marvy-14B

Base model

Qwen/Qwen2.5-14B
Finetuned
(407)
this model
Quantizations
1 model

Evaluation results

  • Test perplexity on ServiceNow Delivery SFT (project-disjoint test split)
    self-reported
    13.107
  • Test cross-entropy loss on ServiceNow Delivery SFT (project-disjoint test split)
    self-reported
    2.573