tgetsov commited on
Commit
be43504
·
verified ·
1 Parent(s): aa1b9ff

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +263 -0
README.md ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen2.5-14B-Instruct
4
+ base_model_relation: finetune
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
+ language:
8
+ - en
9
+ tags:
10
+ - servicenow
11
+ - itsm
12
+ - csdm
13
+ - itom
14
+ - delivery
15
+ - solution-design
16
+ - user-stories
17
+ - business-analysis
18
+ - qwen2.5
19
+ - lora
20
+ - sft
21
+ - mlx
22
+ model-index:
23
+ - name: marvy-14B
24
+ results:
25
+ - task:
26
+ type: text-generation
27
+ name: ServiceNow Delivery SFT (project-disjoint test split)
28
+ metrics:
29
+ - type: perplexity
30
+ value: 13.107
31
+ name: Test perplexity
32
+ - type: loss
33
+ value: 2.573
34
+ name: Test cross-entropy loss
35
+ ---
36
+
37
+ # marvy-14B
38
+
39
+ **The first open, fine-tuned LLM for the full ServiceNow delivery lifecycle — from business analysis to validation.**
40
+
41
+ marvy-14B is an open-source language model fine-tuned for the complete ServiceNow delivery lifecycle: business analysis, requirements, stakeholder mapping, systems inventory, Solution Design Documents, user stories with acceptance criteria, implementation planning, test cases, and validation. Where general-purpose models treat ServiceNow as one topic among many, marvy is built to draft the actual artifacts a delivery team produces — in the structure and sequence real engagements follow. It is a first-draft specialist, not a consultant replacement, and it is not an agentic or tool-use fine-tune.
42
+
43
+ It was built by [MainStack](https://huggingface.co/MainStack), a consultancy specializing in ServiceNow Agentic Delivery. marvy is a LoRA SFT fine-tune of [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) (Apache-2.0), trained on ~1,958 anonymized artifacts from real engagements (~887k tokens), rigorously redacted to zero residual PII per an automated leakage scanner. Its test perplexity of 13.107 was measured on a project- and customer-disjoint held-out split — the model generalizes to unseen work rather than memorizing the training set.
44
+
45
+ > Released under **Apache-2.0**. Built with Qwen — see `NOTICE`.
46
+
47
+ ## Why marvy-14B
48
+
49
+ - **Drafts the full lifecycle, not just snippets.** Business analysis through validation — the artifacts and sequence real delivery teams actually work in.
50
+ - **OOTB-first and implementation-grade.** Tuned to favor out-of-the-box correctness and produce drafts you can review, not rewrite.
51
+ - **Runs locally and privately.** Merged FP16, a LoRA adapter, and GGUF quants — run it on Apple Silicon via LM Studio or Ollama, with your engagement data never leaving your machine.
52
+ - **Trained on real, anonymized delivery work.** ~1,958 redacted engagement artifacts (~887k tokens), with zero residual PII verified by an automated leakage scanner.
53
+ - **Open and Apache-2.0.** Built on Qwen2.5-14B-Instruct — inspect it, fine-tune it, and deploy it on your own terms.
54
+
55
+ 📖 **Full docs:** [`USAGE.md`](./USAGE.md) (every runtime + OpenCode wiring) ·
56
+ [`VALIDATION.md`](./VALIDATION.md) (prove the fine-tune works) ·
57
+ [`validate.sh`](./validate.sh) (one-command probe harness)
58
+
59
+ ---
60
+
61
+ ## Quick start
62
+
63
+ ### Transformers
64
+
65
+ ```python
66
+ from transformers import AutoTokenizer, AutoModelForCausalLM
67
+
68
+ model_id = "MainStack/marvy-14B"
69
+ tok = AutoTokenizer.from_pretrained(model_id)
70
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
71
+
72
+ SYSTEM = (
73
+ "You are a senior ServiceNow delivery consultant. You produce precise, "
74
+ "implementation-grade artifacts: business analyses, requirements, solution "
75
+ "design documents, user stories with acceptance criteria, test cases, and "
76
+ "validation reviews. You favor out-of-the-box capabilities, cite concrete "
77
+ "tables/plugins/sys_ids when relevant, and write in clear professional English."
78
+ )
79
+
80
+ messages = [
81
+ {"role": "system", "content": SYSTEM},
82
+ {"role": "user", "content": "Write a ServiceNow user story with acceptance criteria for SLA escalation on P1 incidents."},
83
+ ]
84
+ inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
85
+ out = model.generate(inputs, max_new_tokens=1024, temperature=0.4)
86
+ print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
87
+ ```
88
+
89
+ ### vLLM
90
+
91
+ ```bash
92
+ pip install vllm
93
+ vllm serve MainStack/marvy-14B
94
+ ```
95
+
96
+ ### Ollama (via GGUF)
97
+
98
+ Use the companion repo [`MainStack/marvy-14B-GGUF`](https://huggingface.co/MainStack/marvy-14B-GGUF):
99
+
100
+ ```bash
101
+ ollama run hf.co/MainStack/marvy-14B-GGUF:Q4_K_M
102
+ ```
103
+
104
+ ### MLX (Apple Silicon native)
105
+
106
+ ```bash
107
+ pip install mlx-lm
108
+ python -m mlx_lm generate --model MainStack/marvy-14B \
109
+ --system-prompt "You are a senior ServiceNow delivery consultant..." \
110
+ --prompt "Draft the Platform Architecture section of an ITSM SDD." \
111
+ --max-tokens 1024 --temp 0.4
112
+ ```
113
+
114
+ ### LoRA-only (apply on top of the base)
115
+
116
+ If you prefer a tiny adapter (~175 MB) on top of the BF16 base, see [`MainStack/marvy-14B-lora`](https://huggingface.co/MainStack/marvy-14B-lora).
117
+
118
+ ---
119
+
120
+ ## Intended use
121
+
122
+ marvy-14B is designed to produce implementation-grade first drafts across the ServiceNow delivery lifecycle — accelerating the artifacts a practitioner would otherwise write from scratch, then review and refine. Built for solution architects, business analysts, technical consultants, and project managers. Typical tasks:
123
+
124
+ | Task family | What it produces |
125
+ |------------------------|---------------------------------------------------------------------------------|
126
+ | `business_analysis` | Structured BA reports from SOWs / discovery notes |
127
+ | `requirements_extraction` | Functional/non-functional requirements with acceptance bullets |
128
+ | `stakeholder_mapping` | RACI / influence-interest grids from raw notes |
129
+ | `systems_inventory` | CMDB-shaped systems inventories from architecture inputs |
130
+ | `sdd_design` | Solution Design Document sections (architecture, integrations, data model) |
131
+ | `story_authoring` | User stories with crisp acceptance criteria |
132
+ | `implementation_planning` | Story-level implementation plans citing tables/plugins |
133
+ | `test_case_generation` | Test cases per story, mapped to acceptance criteria |
134
+ | `validation_critique` | Gap analysis, follow-up questions, assumption checks against source docs |
135
+ | `delivery_chain` | Multi-turn: story → implementation → test, end-to-end |
136
+
137
+ ### Recommended system prompt
138
+
139
+ ```
140
+ You are a senior ServiceNow delivery consultant. You produce precise, implementation-grade
141
+ artifacts: business analyses, requirements, solution design documents, user stories with
142
+ acceptance criteria, test cases, and validation reviews. You favor out-of-the-box
143
+ capabilities, cite concrete tables/plugins/sys_ids when relevant, and write in clear
144
+ professional English.
145
+ ```
146
+
147
+ ### Recommended generation settings
148
+
149
+ | Use case | temperature | top_p | max_new_tokens |
150
+ |-----------------------------|-------------|-------|----------------|
151
+ | Structured artifacts (SDD, stories) | 0.3 – 0.5 | 0.9 | 1024 – 4096 |
152
+ | Exploratory brainstorming | 0.7 – 0.9 | 0.95 | 1024 |
153
+ | Validation / critique | 0.2 – 0.4 | 0.9 | 1024 – 2048 |
154
+
155
+ ---
156
+
157
+ ## Training data
158
+
159
+ | Item | Value |
160
+ |---|---|
161
+ | Source | Anonymized real engagement artifacts (`.md`, `.csv`, `.json`, `.mmd`, `.txt`) |
162
+ | Total records | **1,958** (after schema + exact-dedupe) |
163
+ | Estimated tokens | **~887k** |
164
+ | Splits (project-disjoint) | train 1,359 · val 347 · test 252 |
165
+ | Tasks | 11 task families (see table above) |
166
+ | Multi-turn share | `delivery_chain` (158 records) — story→implementation→test |
167
+
168
+ ### Privacy & redaction
169
+
170
+ - All customer/partner names → stable aliases (e.g. `Customer-FIN-03`, `Customer-ENERGY-01`).
171
+ - Emails → `user@example.com`; hostnames → `instance.example.service-now.com`; IPs → RFC 5737 range; `key: value` secrets → `[REDACTED]`.
172
+ - Credential/login/VPN files excluded entirely; bulk CMDB dumps >1.5 MB excluded.
173
+ - ServiceNow `sys_id`s and table/plugin names preserved (instance-local, technically valuable, low risk).
174
+ - A leakage scanner asserts **0** residual emails, hostnames, or mapped real names in message content.
175
+
176
+ ### Split integrity
177
+
178
+ Train / val / test are split **by project**, so no customer appears in more than one split. The largest project is forced into `train` to keep eval honest:
179
+ - val projects: `Customer-ENERGY-01`
180
+ - test projects: `Customer-CHEM-01`, `Customer-FININST-01`
181
+
182
+ ---
183
+
184
+ ## Training procedure
185
+
186
+ | Setting | Value |
187
+ |---|---|
188
+ | Method | LoRA SFT (QLoRA-style: LoRA on 4-bit base) |
189
+ | Base model | `mlx-community/Qwen2.5-14B-Instruct-4bit` (training) → fused onto `Qwen/Qwen2.5-14B-Instruct` BF16 (release) |
190
+ | Framework | [MLX-LM](https://github.com/ml-explore/mlx-lm) 0.31.3 |
191
+ | Hardware | Apple Silicon (M-series), Metal |
192
+ | Max sequence length | 8,192 |
193
+ | Batch size / grad accum | 1 / 16 (effective batch 16) |
194
+ | Iterations | 350 (~4 epochs over 1,359 train records) |
195
+ | Optimizer | AdamW, cosine decay, warmup 20, lr 1e-4 → 1e-6 |
196
+ | LoRA rank / scale / dropout | 32 / 20.0 / 0.0 |
197
+ | LoRA target keys | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
198
+ | Adapted layers | top 16 transformer layers |
199
+ | Prompt masking | yes — loss computed only on assistant turns |
200
+ | Seed | 42 |
201
+
202
+ ---
203
+
204
+ ## Evaluation
205
+
206
+ Test-set evaluation on the **project-disjoint** test split (252 records from two customers never seen in training/val), 50 batches:
207
+
208
+ | Metric | Value |
209
+ |---|---|
210
+ | Test cross-entropy loss | **2.573** |
211
+ | Test perplexity | **13.107** |
212
+
213
+ > Note: two test sequences exceed 2,048 tokens and are truncated by the MLX eval harness. The reported figure is therefore a slight upper bound on true loss. Full-length scoring is planned for v2.
214
+
215
+ To reproduce or validate these results yourself — including a base-vs-marvy
216
+ comparison and qualitative task probes — see [`VALIDATION.md`](./VALIDATION.md)
217
+ and run [`validate.sh`](./validate.sh).
218
+
219
+ ---
220
+
221
+ ## Limitations & known issues
222
+
223
+ - **Text-only sources.** SOWs/SDDs/workbooks in `.docx/.pptx/.pdf/.xlsx` are not parsed in this build. Coverage of binary-only engagements is therefore thin.
224
+ - **Project concentration.** ~95% of records come from ~12 data-rich projects; the long tail contributes a single case study each. Some task families (e.g. `case_study`, `validation_critique`) are smaller and may exhibit higher variance.
225
+ - **Synthetic instructions.** User prompts are templated paraphrases (3–5 variants per task); assistant outputs are the original human-authored artifacts.
226
+ - **English-only.** The corpus is English.
227
+ - **Not a replacement for a consultant.** Output is first-draft, implementation-grade content that requires expert review before client delivery or production use.
228
+ - **No tool use / function calling fine-tune.** `marvy-14B` is a text-completion specialist; agentic tool use is left to the orchestrator.
229
+ - **Hallucination risk on instance-specific facts.** The model will confidently invent `sys_id`s, plugin IDs, and table fields if asked about specifics it has not seen. Always verify against an actual ServiceNow instance.
230
+ - **No safety fine-tune beyond the base.** Inherits Qwen2.5-14B-Instruct safety behavior; no additional RLHF.
231
+
232
+ ---
233
+
234
+ ## License
235
+
236
+ Released under the **Apache License 2.0** (see `LICENSE`).
237
+
238
+ This model is a derivative of **Qwen2.5-14B-Instruct** (Apache-2.0). See `NOTICE` for attribution.
239
+
240
+ ## Citation
241
+
242
+ ```bibtex
243
+ @software{marvy_14b_2026,
244
+ title = {marvy-14B: A ServiceNow delivery lifecycle fine-tune of Qwen2.5-14B-Instruct},
245
+ author = {MainStack},
246
+ year = {2026},
247
+ url = {https://huggingface.co/MainStack/marvy-14B},
248
+ license= {Apache-2.0}
249
+ }
250
+
251
+ @misc{qwen2.5,
252
+ title = {Qwen2.5: A Party of Foundation Models},
253
+ author = {Qwen Team},
254
+ year = {2024},
255
+ url = {https://qwenlm.github.io/blog/qwen2.5/}
256
+ }
257
+ ```
258
+
259
+ ## Acknowledgements
260
+
261
+ - **Qwen team** at Alibaba Cloud for the Qwen2.5 family.
262
+ - **Apple MLX team** for `mlx` and `mlx-lm`, enabling native Apple Silicon training.
263
+ - **Hugging Face** for hosting and the surrounding ecosystem.