marvy-1-14B / README.md
tgetsov's picture
Upload README.md with huggingface_hub
2f463a5 verified
---
license: apache-2.0
base_model: Qwen/Qwen2.5-14B-Instruct
base_model_relation: finetune
library_name: transformers
pipeline_tag: text-generation
language:
- en
tags:
- servicenow
- itsm
- csdm
- itom
- delivery
- solution-design
- user-stories
- business-analysis
- qwen2.5
- lora
- sft
- mlx
model-index:
- name: marvy-1-14B
results:
- task:
type: text-generation
name: Text Generation
dataset:
type: custom
name: ServiceNow Delivery SFT (project-disjoint test split)
metrics:
- type: perplexity
value: 13.107
name: Test perplexity
- type: loss
value: 2.573
name: Test cross-entropy loss
---
# marvy-1-14B
**The first open, fine-tuned LLM for the full ServiceNow delivery lifecycle β€” from business analysis to validation.**
marvy-1-14B is an open-source language model fine-tuned for the complete ServiceNow delivery lifecycle: business analysis, requirements, stakeholder mapping, systems inventory, Solution Design Documents, user stories with acceptance criteria, implementation planning, test cases, and validation. Where general-purpose models treat ServiceNow as one topic among many, marvy is built to draft the actual artifacts a delivery team produces β€” in the structure and sequence real engagements follow. It is a first-draft specialist, not a consultant replacement, and it is not an agentic or tool-use fine-tune.
It was built by [MainStack](https://huggingface.co/MainStack), a consultancy specializing in ServiceNow Agentic Delivery. marvy is a LoRA SFT fine-tune of [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) (Apache-2.0), trained on ~1,958 anonymized artifacts from real engagements (~887k tokens), rigorously redacted to zero residual PII per an automated leakage scanner. Its test perplexity of 13.107 was measured on a project- and customer-disjoint held-out split β€” the model generalizes to unseen work rather than memorizing the training set.
> Released under **Apache-2.0**. Built with Qwen β€” see `NOTICE`.
## Why marvy-1-14B
- **Drafts the full lifecycle, not just snippets.** Business analysis through validation β€” the artifacts and sequence real delivery teams actually work in.
- **OOTB-first and implementation-grade.** Tuned to favor out-of-the-box correctness and produce drafts you can review, not rewrite.
- **Runs locally and privately.** Merged FP16, a LoRA adapter, and GGUF quants β€” run it on Apple Silicon via LM Studio or Ollama, with your engagement data never leaving your machine.
- **Trained on real, anonymized delivery work.** ~1,958 redacted engagement artifacts (~887k tokens), with zero residual PII verified by an automated leakage scanner.
- **Open and Apache-2.0.** Built on Qwen2.5-14B-Instruct β€” inspect it, fine-tune it, and deploy it on your own terms.
πŸ“– **Full docs:** [`USAGE.md`](./USAGE.md) (every runtime + OpenCode wiring) Β·
[`VALIDATION.md`](./VALIDATION.md) (prove the fine-tune works) Β·
[`validate.sh`](./validate.sh) (one-command probe harness)
---
## Quick start
### Transformers
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "MainStack/marvy-1-14B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
SYSTEM = (
"You are a senior ServiceNow delivery consultant. You produce precise, "
"implementation-grade artifacts: business analyses, requirements, solution "
"design documents, user stories with acceptance criteria, test cases, and "
"validation reviews. You favor out-of-the-box capabilities, cite concrete "
"tables/plugins/sys_ids when relevant, and write in clear professional English."
)
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": "Write a ServiceNow user story with acceptance criteria for SLA escalation on P1 incidents."},
]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=1024, temperature=0.4)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
```
### vLLM
```bash
pip install vllm
vllm serve MainStack/marvy-1-14B
```
### Ollama (via GGUF)
Use the companion repo [`MainStack/marvy-1-14B-GGUF`](https://huggingface.co/MainStack/marvy-1-14B-GGUF):
```bash
ollama run hf.co/MainStack/marvy-1-14B-GGUF:Q4_K_M
```
### MLX (Apple Silicon native)
```bash
pip install mlx-lm
python -m mlx_lm generate --model MainStack/marvy-1-14B \
--system-prompt "You are a senior ServiceNow delivery consultant..." \
--prompt "Draft the Platform Architecture section of an ITSM SDD." \
--max-tokens 1024 --temp 0.4
```
### LoRA-only (apply on top of the base)
If you prefer a tiny adapter (~175 MB) on top of the BF16 base, see [`MainStack/marvy-1-14B-lora`](https://huggingface.co/MainStack/marvy-1-14B-lora).
---
## Intended use
marvy-1-14B is designed to produce implementation-grade first drafts across the ServiceNow delivery lifecycle β€” accelerating the artifacts a practitioner would otherwise write from scratch, then review and refine. Built for solution architects, business analysts, technical consultants, and project managers. Typical tasks:
| Task family | What it produces |
|------------------------|---------------------------------------------------------------------------------|
| `business_analysis` | Structured BA reports from SOWs / discovery notes |
| `requirements_extraction` | Functional/non-functional requirements with acceptance bullets |
| `stakeholder_mapping` | RACI / influence-interest grids from raw notes |
| `systems_inventory` | CMDB-shaped systems inventories from architecture inputs |
| `sdd_design` | Solution Design Document sections (architecture, integrations, data model) |
| `story_authoring` | User stories with crisp acceptance criteria |
| `implementation_planning` | Story-level implementation plans citing tables/plugins |
| `test_case_generation` | Test cases per story, mapped to acceptance criteria |
| `validation_critique` | Gap analysis, follow-up questions, assumption checks against source docs |
| `delivery_chain` | Multi-turn: story β†’ implementation β†’ test, end-to-end |
### Recommended system prompt
```
You are a senior ServiceNow delivery consultant. You produce precise, implementation-grade
artifacts: business analyses, requirements, solution design documents, user stories with
acceptance criteria, test cases, and validation reviews. You favor out-of-the-box
capabilities, cite concrete tables/plugins/sys_ids when relevant, and write in clear
professional English.
```
### Recommended generation settings
| Use case | temperature | top_p | max_new_tokens |
|-----------------------------|-------------|-------|----------------|
| Structured artifacts (SDD, stories) | 0.3 – 0.5 | 0.9 | 1024 – 4096 |
| Exploratory brainstorming | 0.7 – 0.9 | 0.95 | 1024 |
| Validation / critique | 0.2 – 0.4 | 0.9 | 1024 – 2048 |
---
## Training data
> **The training dataset is proprietary to MainStack and is not publicly
> released.** It is derived from confidential, anonymized client engagement
> artifacts. The statistics below describe the corpus for transparency; the data
> itself is not distributed with the model.
| Item | Value |
|---|---|
| Source | Anonymized real engagement artifacts (`.md`, `.csv`, `.json`, `.mmd`, `.txt`) |
| Availability | **Proprietary β€” not released** |
| Total records | **1,958** (after schema + exact-dedupe) |
| Estimated tokens | **~887k** |
| Splits (project-disjoint) | train 1,359 Β· val 347 Β· test 252 |
| Tasks | 11 task families (see table above) |
| Multi-turn share | `delivery_chain` (158 records) — story→implementation→test |
### Privacy & redaction
- All customer/partner names β†’ stable aliases (e.g. `Customer-FIN-03`, `Customer-ENERGY-01`).
- Emails β†’ `user@example.com`; hostnames β†’ `instance.example.service-now.com`; IPs β†’ RFC 5737 range; `key: value` secrets β†’ `[REDACTED]`.
- Credential/login/VPN files excluded entirely; bulk CMDB dumps >1.5 MB excluded.
- ServiceNow `sys_id`s and table/plugin names preserved (instance-local, technically valuable, low risk).
- A leakage scanner asserts **0** residual emails, hostnames, or mapped real names in message content.
### Split integrity
Train / val / test are split **by project**, so no customer appears in more than one split. The largest project is forced into `train` to keep eval honest:
- val projects: `Customer-ENERGY-01`
- test projects: `Customer-CHEM-01`, `Customer-FININST-01`
---
## Training procedure
| Setting | Value |
|---|---|
| Method | LoRA SFT (QLoRA-style: LoRA on 4-bit base) |
| Base model | `mlx-community/Qwen2.5-14B-Instruct-4bit` (training) β†’ fused onto `Qwen/Qwen2.5-14B-Instruct` BF16 (release) |
| Framework | [MLX-LM](https://github.com/ml-explore/mlx-lm) 0.31.3 |
| Hardware | Apple Silicon (M-series), Metal |
| Max sequence length | 8,192 |
| Batch size / grad accum | 1 / 16 (effective batch 16) |
| Iterations | 350 (~4 epochs over 1,359 train records) |
| Optimizer | AdamW, cosine decay, warmup 20, lr 1e-4 β†’ 1e-6 |
| LoRA rank / scale / dropout | 32 / 20.0 / 0.0 |
| LoRA target keys | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| Adapted layers | top 16 transformer layers |
| Prompt masking | yes β€” loss computed only on assistant turns |
| Seed | 42 |
---
## Evaluation
### Fine-tuned vs. base β€” efficiency on the held-out test set
The cleanest measure of the fine-tune's value is to score the **same base
model twice** β€” plain vs. with the marvy adapter β€” on the **project-disjoint**
test split (252 records from two customers never seen in training/val), using
per-token cross-entropy/perplexity on the **assistant tokens only**
(prompt-masked, the same objective used in training). Lower perplexity = the
model assigns higher probability to the real, human-authored delivery artifact.
![marvy-1-14B vs base β€” perplexity by task](./marvy_vs_base_ppl.png)
![How much fine-tuning improved each task](./marvy_improvement.png)
**Overall: perplexity 8.91 β†’ 6.03, a 32.3% reduction** on unseen customers.
| Task | Base ppl | marvy-1-14B ppl | Improvement |
|---|---:|---:|---:|
| Systems inventory | 77.07 | 10.53 | **βˆ’86.3%** |
| Requirements extraction | 46.76 | 9.39 | **βˆ’79.9%** |
| Stakeholder mapping | 27.81 | 6.91 | **βˆ’75.2%** |
| Story authoring | 15.38 | 7.86 | **βˆ’48.9%** |
| Validation / critique | 9.72 | 8.23 | βˆ’15.3% |
| Business analysis | 7.14 | 6.66 | βˆ’6.6% |
| SDD design | 4.48 | 4.40 | βˆ’1.7% |
| **Overall** | **8.91** | **6.03** | **βˆ’32.3%** |
The gains are largest on **structured, format-heavy artifacts** (inventories,
requirements, stakeholder registers, stories) where the base model wanders from
the expected schema; they are smaller on long-form prose (SDD sections, business
analysis) where the base was already competent. This is the honest, expected
shape of a domain SFT.
> Notes: the test customers (`Customer-CHEM-01`, `Customer-FININST-01`) appear in
> neither train nor val, so this reflects generalization, not memorization. The
> test split happens to cover 7 of the 11 task families. An earlier MLX
> batch-eval reported aggregate ppl β‰ˆ 13.1 with 2,048-token truncation; the
> figures above recompute per-task with full assistant-token masking, so the
> base-vs-marvy **delta** is the result of interest.
Reproduce it yourself: `bash benchmark/run_benchmark.sh` (see
[`VALIDATION.md`](./VALIDATION.md) for qualitative probes too).
---
## Limitations & known issues
- **Text-only sources.** SOWs/SDDs/workbooks in `.docx/.pptx/.pdf/.xlsx` are not parsed in this build. Coverage of binary-only engagements is therefore thin.
- **Project concentration.** ~95% of records come from ~12 data-rich projects; the long tail contributes a single case study each. Some task families (e.g. `case_study`, `validation_critique`) are smaller and may exhibit higher variance.
- **Synthetic instructions.** User prompts are templated paraphrases (3–5 variants per task); assistant outputs are the original human-authored artifacts.
- **English-only.** The corpus is English.
- **Not a replacement for a consultant.** Output is first-draft, implementation-grade content that requires expert review before client delivery or production use.
- **No tool use / function calling fine-tune.** `marvy-1-14B` is a text-completion specialist; agentic tool use is left to the orchestrator.
- **Hallucination risk on instance-specific facts.** The model will confidently invent `sys_id`s, plugin IDs, and table fields if asked about specifics it has not seen. Always verify against an actual ServiceNow instance.
- **No safety fine-tune beyond the base.** Inherits Qwen2.5-14B-Instruct safety behavior; no additional RLHF.
---
## License
marvy-1-14B is **dual-licensed** β€” see [`LICENSING.md`](./LICENSING.md) for the full breakdown:
| Component | License |
|---|---|
| **Model weights** (safetensors / GGUF / LoRA) | **Apache-2.0** (`LICENSE`) β€” inherited from the Qwen2.5-14B-Instruct base; free to use, fine-tune, and redistribute, with `NOTICE` retained. |
| **MainStack contributions** (model cards, docs, benchmark, charts, training methodology) | **CC-BY-4.0** (`LICENSE-CC-BY-4.0`) β€” reuse requires attribution to MainStack. |
The model weights are a derivative of **Qwen2.5-14B-Instruct** (Apache-2.0).
Per Apache-2.0, the weights cannot be placed under a more restrictive license;
MainStack's protection is the CC-BY-4.0 license on our own authored materials
plus the mandatory `NOTICE` retention. See `NOTICE` for attribution.
## Attribution
`marvy-1-14B` is free to use, fine-tune, and redistribute under Apache-2.0.
**If you use marvy-1-14B as a baseline, fine-tune it, distill from it, evaluate
against it, or otherwise build on it, please credit MainStack** and link back to
this model:
> Built on / evaluated against **marvy-1-14B** by **MainStack** β€”
> https://huggingface.co/MainStack/marvy-1-14B
Concretely, we ask that derivatives and comparisons:
- keep the `NOTICE` file intact (this is **required** by Apache-2.0 Β§4),
- name `MainStack/marvy-1-14B` in the model card, paper, or README, and
- cite the entry below.
Per Apache-2.0, you must also continue to attribute the upstream base model
(Qwen2.5-14B-Instruct) β€” see `NOTICE`.
## Citation
If you use marvy-1-14B (as a baseline, a starting point, or in evaluation),
please cite:
```bibtex
@software{marvy_1_14b_2026,
title = {marvy-1-14B: An open fine-tuned model for the full ServiceNow delivery lifecycle},
author = {MainStack},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/MainStack/marvy-1-14B},
note = {Fine-tune of Qwen2.5-14B-Instruct},
license = {Apache-2.0}
}
@misc{qwen2.5,
title = {Qwen2.5: A Party of Foundation Models},
author = {Qwen Team},
year = {2024},
url = {https://qwenlm.github.io/blog/qwen2.5/}
}
```
```bibtex
@software{marvy_14b_2026,
title = {marvy-1-14B: A ServiceNow delivery lifecycle fine-tune of Qwen2.5-14B-Instruct},
author = {MainStack},
year = {2026},
url = {https://huggingface.co/MainStack/marvy-1-14B},
license= {Apache-2.0}
}
@misc{qwen2.5,
title = {Qwen2.5: A Party of Foundation Models},
author = {Qwen Team},
year = {2024},
url = {https://qwenlm.github.io/blog/qwen2.5/}
}
```
## Acknowledgements
- **Qwen team** at Alibaba Cloud for the Qwen2.5 family.
- **Apple MLX team** for `mlx` and `mlx-lm`, enabling native Apple Silicon training.
- **Hugging Face** for hosting and the surrounding ecosystem.