Instructions to use MainStack/marvy-14B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MainStack/marvy-14B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="MainStack/marvy-14B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("MainStack/marvy-14B") model = AutoModelForCausalLM.from_pretrained("MainStack/marvy-14B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - MLX
How to use MainStack/marvy-14B with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("MainStack/marvy-14B") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- vLLM
How to use MainStack/marvy-14B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MainStack/marvy-14B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MainStack/marvy-14B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/MainStack/marvy-14B
- SGLang
How to use MainStack/marvy-14B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "MainStack/marvy-14B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MainStack/marvy-14B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "MainStack/marvy-14B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MainStack/marvy-14B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Pi
How to use MainStack/marvy-14B with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "MainStack/marvy-14B"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "MainStack/marvy-14B" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use MainStack/marvy-14B with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "MainStack/marvy-14B"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default MainStack/marvy-14B
Run Hermes
hermes
- MLX LM
How to use MainStack/marvy-14B with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "MainStack/marvy-14B"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "MainStack/marvy-14B" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MainStack/marvy-14B", "messages": [ {"role": "user", "content": "Hello"} ] }' - Docker Model Runner
How to use MainStack/marvy-14B with Docker Model Runner:
docker model run hf.co/MainStack/marvy-14B
marvy-14B
The first open, fine-tuned LLM for the full ServiceNow delivery lifecycle — from business analysis to validation.
marvy-14B is an open-source language model fine-tuned for the complete ServiceNow delivery lifecycle: business analysis, requirements, stakeholder mapping, systems inventory, Solution Design Documents, user stories with acceptance criteria, implementation planning, test cases, and validation. Where general-purpose models treat ServiceNow as one topic among many, marvy is built to draft the actual artifacts a delivery team produces — in the structure and sequence real engagements follow. It is a first-draft specialist, not a consultant replacement, and it is not an agentic or tool-use fine-tune.
It was built by MainStack, a consultancy specializing in ServiceNow Agentic Delivery. marvy is a LoRA SFT fine-tune of Qwen2.5-14B-Instruct (Apache-2.0), trained on 1,958 anonymized artifacts from real engagements (887k tokens), rigorously redacted to zero residual PII per an automated leakage scanner. Its test perplexity of 13.107 was measured on a project- and customer-disjoint held-out split — the model generalizes to unseen work rather than memorizing the training set.
Released under Apache-2.0. Built with Qwen — see
NOTICE.
Why marvy-14B
- Drafts the full lifecycle, not just snippets. Business analysis through validation — the artifacts and sequence real delivery teams actually work in.
- OOTB-first and implementation-grade. Tuned to favor out-of-the-box correctness and produce drafts you can review, not rewrite.
- Runs locally and privately. Merged FP16, a LoRA adapter, and GGUF quants — run it on Apple Silicon via LM Studio or Ollama, with your engagement data never leaving your machine.
- Trained on real, anonymized delivery work.
1,958 redacted engagement artifacts (887k tokens), with zero residual PII verified by an automated leakage scanner. - Open and Apache-2.0. Built on Qwen2.5-14B-Instruct — inspect it, fine-tune it, and deploy it on your own terms.
📖 Full docs: USAGE.md (every runtime + OpenCode wiring) ·
VALIDATION.md (prove the fine-tune works) ·
validate.sh (one-command probe harness)
Quick start
Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "MainStack/marvy-14B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
SYSTEM = (
"You are a senior ServiceNow delivery consultant. You produce precise, "
"implementation-grade artifacts: business analyses, requirements, solution "
"design documents, user stories with acceptance criteria, test cases, and "
"validation reviews. You favor out-of-the-box capabilities, cite concrete "
"tables/plugins/sys_ids when relevant, and write in clear professional English."
)
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": "Write a ServiceNow user story with acceptance criteria for SLA escalation on P1 incidents."},
]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=1024, temperature=0.4)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
vLLM
pip install vllm
vllm serve MainStack/marvy-14B
Ollama (via GGUF)
Use the companion repo MainStack/marvy-14B-GGUF:
ollama run hf.co/MainStack/marvy-14B-GGUF:Q4_K_M
MLX (Apple Silicon native)
pip install mlx-lm
python -m mlx_lm generate --model MainStack/marvy-14B \
--system-prompt "You are a senior ServiceNow delivery consultant..." \
--prompt "Draft the Platform Architecture section of an ITSM SDD." \
--max-tokens 1024 --temp 0.4
LoRA-only (apply on top of the base)
If you prefer a tiny adapter (~175 MB) on top of the BF16 base, see MainStack/marvy-14B-lora.
Intended use
marvy-14B is designed to produce implementation-grade first drafts across the ServiceNow delivery lifecycle — accelerating the artifacts a practitioner would otherwise write from scratch, then review and refine. Built for solution architects, business analysts, technical consultants, and project managers. Typical tasks:
| Task family | What it produces |
|---|---|
business_analysis |
Structured BA reports from SOWs / discovery notes |
requirements_extraction |
Functional/non-functional requirements with acceptance bullets |
stakeholder_mapping |
RACI / influence-interest grids from raw notes |
systems_inventory |
CMDB-shaped systems inventories from architecture inputs |
sdd_design |
Solution Design Document sections (architecture, integrations, data model) |
story_authoring |
User stories with crisp acceptance criteria |
implementation_planning |
Story-level implementation plans citing tables/plugins |
test_case_generation |
Test cases per story, mapped to acceptance criteria |
validation_critique |
Gap analysis, follow-up questions, assumption checks against source docs |
delivery_chain |
Multi-turn: story → implementation → test, end-to-end |
Recommended system prompt
You are a senior ServiceNow delivery consultant. You produce precise, implementation-grade
artifacts: business analyses, requirements, solution design documents, user stories with
acceptance criteria, test cases, and validation reviews. You favor out-of-the-box
capabilities, cite concrete tables/plugins/sys_ids when relevant, and write in clear
professional English.
Recommended generation settings
| Use case | temperature | top_p | max_new_tokens |
|---|---|---|---|
| Structured artifacts (SDD, stories) | 0.3 – 0.5 | 0.9 | 1024 – 4096 |
| Exploratory brainstorming | 0.7 – 0.9 | 0.95 | 1024 |
| Validation / critique | 0.2 – 0.4 | 0.9 | 1024 – 2048 |
Training data
| Item | Value |
|---|---|
| Source | Anonymized real engagement artifacts (.md, .csv, .json, .mmd, .txt) |
| Total records | 1,958 (after schema + exact-dedupe) |
| Estimated tokens | ~887k |
| Splits (project-disjoint) | train 1,359 · val 347 · test 252 |
| Tasks | 11 task families (see table above) |
| Multi-turn share | delivery_chain (158 records) — story→implementation→test |
Privacy & redaction
- All customer/partner names → stable aliases (e.g.
Customer-FIN-03,Customer-ENERGY-01). - Emails →
user@example.com; hostnames →instance.example.service-now.com; IPs → RFC 5737 range;key: valuesecrets →[REDACTED]. - Credential/login/VPN files excluded entirely; bulk CMDB dumps >1.5 MB excluded.
- ServiceNow
sys_ids and table/plugin names preserved (instance-local, technically valuable, low risk). - A leakage scanner asserts 0 residual emails, hostnames, or mapped real names in message content.
Split integrity
Train / val / test are split by project, so no customer appears in more than one split. The largest project is forced into train to keep eval honest:
- val projects:
Customer-ENERGY-01 - test projects:
Customer-CHEM-01,Customer-FININST-01
Training procedure
| Setting | Value |
|---|---|
| Method | LoRA SFT (QLoRA-style: LoRA on 4-bit base) |
| Base model | mlx-community/Qwen2.5-14B-Instruct-4bit (training) → fused onto Qwen/Qwen2.5-14B-Instruct BF16 (release) |
| Framework | MLX-LM 0.31.3 |
| Hardware | Apple Silicon (M-series), Metal |
| Max sequence length | 8,192 |
| Batch size / grad accum | 1 / 16 (effective batch 16) |
| Iterations | 350 (~4 epochs over 1,359 train records) |
| Optimizer | AdamW, cosine decay, warmup 20, lr 1e-4 → 1e-6 |
| LoRA rank / scale / dropout | 32 / 20.0 / 0.0 |
| LoRA target keys | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Adapted layers | top 16 transformer layers |
| Prompt masking | yes — loss computed only on assistant turns |
| Seed | 42 |
Evaluation
Test-set evaluation on the project-disjoint test split (252 records from two customers never seen in training/val), 50 batches:
| Metric | Value |
|---|---|
| Test cross-entropy loss | 2.573 |
| Test perplexity | 13.107 |
Note: two test sequences exceed 2,048 tokens and are truncated by the MLX eval harness. The reported figure is therefore a slight upper bound on true loss. Full-length scoring is planned for v2.
To reproduce or validate these results yourself — including a base-vs-marvy
comparison and qualitative task probes — see VALIDATION.md
and run validate.sh.
Limitations & known issues
- Text-only sources. SOWs/SDDs/workbooks in
.docx/.pptx/.pdf/.xlsxare not parsed in this build. Coverage of binary-only engagements is therefore thin. - Project concentration. ~95% of records come from ~12 data-rich projects; the long tail contributes a single case study each. Some task families (e.g.
case_study,validation_critique) are smaller and may exhibit higher variance. - Synthetic instructions. User prompts are templated paraphrases (3–5 variants per task); assistant outputs are the original human-authored artifacts.
- English-only. The corpus is English.
- Not a replacement for a consultant. Output is first-draft, implementation-grade content that requires expert review before client delivery or production use.
- No tool use / function calling fine-tune.
marvy-14Bis a text-completion specialist; agentic tool use is left to the orchestrator. - Hallucination risk on instance-specific facts. The model will confidently invent
sys_ids, plugin IDs, and table fields if asked about specifics it has not seen. Always verify against an actual ServiceNow instance. - No safety fine-tune beyond the base. Inherits Qwen2.5-14B-Instruct safety behavior; no additional RLHF.
License
Released under the Apache License 2.0 (see LICENSE).
This model is a derivative of Qwen2.5-14B-Instruct (Apache-2.0). See NOTICE for attribution.
Citation
@software{marvy_14b_2026,
title = {marvy-14B: A ServiceNow delivery lifecycle fine-tune of Qwen2.5-14B-Instruct},
author = {MainStack},
year = {2026},
url = {https://huggingface.co/MainStack/marvy-14B},
license= {Apache-2.0}
}
@misc{qwen2.5,
title = {Qwen2.5: A Party of Foundation Models},
author = {Qwen Team},
year = {2024},
url = {https://qwenlm.github.io/blog/qwen2.5/}
}
Acknowledgements
- Qwen team at Alibaba Cloud for the Qwen2.5 family.
- Apple MLX team for
mlxandmlx-lm, enabling native Apple Silicon training. - Hugging Face for hosting and the surrounding ecosystem.
- Downloads last month
- -
Quantized
Model tree for MainStack/marvy-14B
Evaluation results
- Test perplexity on ServiceNow Delivery SFT (project-disjoint test split)self-reported13.107
- Test cross-entropy loss on ServiceNow Delivery SFT (project-disjoint test split)self-reported2.573