tgetsov commited on
Commit
7e2e677
Β·
verified Β·
1 Parent(s): 6ae68c1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +52 -23
README.md CHANGED
@@ -20,7 +20,7 @@ tags:
20
  - sft
21
  - mlx
22
  model-index:
23
- - name: marvy-14B
24
  results:
25
  - task:
26
  type: text-generation
@@ -37,17 +37,17 @@ model-index:
37
  name: Test cross-entropy loss
38
  ---
39
 
40
- # marvy-14B
41
 
42
  **The first open, fine-tuned LLM for the full ServiceNow delivery lifecycle β€” from business analysis to validation.**
43
 
44
- marvy-14B is an open-source language model fine-tuned for the complete ServiceNow delivery lifecycle: business analysis, requirements, stakeholder mapping, systems inventory, Solution Design Documents, user stories with acceptance criteria, implementation planning, test cases, and validation. Where general-purpose models treat ServiceNow as one topic among many, marvy is built to draft the actual artifacts a delivery team produces β€” in the structure and sequence real engagements follow. It is a first-draft specialist, not a consultant replacement, and it is not an agentic or tool-use fine-tune.
45
 
46
  It was built by [MainStack](https://huggingface.co/MainStack), a consultancy specializing in ServiceNow Agentic Delivery. marvy is a LoRA SFT fine-tune of [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) (Apache-2.0), trained on ~1,958 anonymized artifacts from real engagements (~887k tokens), rigorously redacted to zero residual PII per an automated leakage scanner. Its test perplexity of 13.107 was measured on a project- and customer-disjoint held-out split β€” the model generalizes to unseen work rather than memorizing the training set.
47
 
48
  > Released under **Apache-2.0**. Built with Qwen β€” see `NOTICE`.
49
 
50
- ## Why marvy-14B
51
 
52
  - **Drafts the full lifecycle, not just snippets.** Business analysis through validation β€” the artifacts and sequence real delivery teams actually work in.
53
  - **OOTB-first and implementation-grade.** Tuned to favor out-of-the-box correctness and produce drafts you can review, not rewrite.
@@ -68,7 +68,7 @@ It was built by [MainStack](https://huggingface.co/MainStack), a consultancy spe
68
  ```python
69
  from transformers import AutoTokenizer, AutoModelForCausalLM
70
 
71
- model_id = "MainStack/marvy-14B"
72
  tok = AutoTokenizer.from_pretrained(model_id)
73
  model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
74
 
@@ -93,22 +93,22 @@ print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
93
 
94
  ```bash
95
  pip install vllm
96
- vllm serve MainStack/marvy-14B
97
  ```
98
 
99
  ### Ollama (via GGUF)
100
 
101
- Use the companion repo [`MainStack/marvy-14B-GGUF`](https://huggingface.co/MainStack/marvy-14B-GGUF):
102
 
103
  ```bash
104
- ollama run hf.co/MainStack/marvy-14B-GGUF:Q4_K_M
105
  ```
106
 
107
  ### MLX (Apple Silicon native)
108
 
109
  ```bash
110
  pip install mlx-lm
111
- python -m mlx_lm generate --model MainStack/marvy-14B \
112
  --system-prompt "You are a senior ServiceNow delivery consultant..." \
113
  --prompt "Draft the Platform Architecture section of an ITSM SDD." \
114
  --max-tokens 1024 --temp 0.4
@@ -116,13 +116,13 @@ python -m mlx_lm generate --model MainStack/marvy-14B \
116
 
117
  ### LoRA-only (apply on top of the base)
118
 
119
- If you prefer a tiny adapter (~175 MB) on top of the BF16 base, see [`MainStack/marvy-14B-lora`](https://huggingface.co/MainStack/marvy-14B-lora).
120
 
121
  ---
122
 
123
  ## Intended use
124
 
125
- marvy-14B is designed to produce implementation-grade first drafts across the ServiceNow delivery lifecycle β€” accelerating the artifacts a practitioner would otherwise write from scratch, then review and refine. Built for solution architects, business analysts, technical consultants, and project managers. Typical tasks:
126
 
127
  | Task family | What it produces |
128
  |------------------------|---------------------------------------------------------------------------------|
@@ -206,18 +206,47 @@ Train / val / test are split **by project**, so no customer appears in more than
206
 
207
  ## Evaluation
208
 
209
- Test-set evaluation on the **project-disjoint** test split (252 records from two customers never seen in training/val), 50 batches:
210
 
211
- | Metric | Value |
212
- |---|---|
213
- | Test cross-entropy loss | **2.573** |
214
- | Test perplexity | **13.107** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
 
216
- > Note: two test sequences exceed 2,048 tokens and are truncated by the MLX eval harness. The reported figure is therefore a slight upper bound on true loss. Full-length scoring is planned for v2.
 
 
 
 
 
217
 
218
- To reproduce or validate these results yourself β€” including a base-vs-marvy
219
- comparison and qualitative task probes β€” see [`VALIDATION.md`](./VALIDATION.md)
220
- and run [`validate.sh`](./validate.sh).
221
 
222
  ---
223
 
@@ -228,7 +257,7 @@ and run [`validate.sh`](./validate.sh).
228
  - **Synthetic instructions.** User prompts are templated paraphrases (3–5 variants per task); assistant outputs are the original human-authored artifacts.
229
  - **English-only.** The corpus is English.
230
  - **Not a replacement for a consultant.** Output is first-draft, implementation-grade content that requires expert review before client delivery or production use.
231
- - **No tool use / function calling fine-tune.** `marvy-14B` is a text-completion specialist; agentic tool use is left to the orchestrator.
232
  - **Hallucination risk on instance-specific facts.** The model will confidently invent `sys_id`s, plugin IDs, and table fields if asked about specifics it has not seen. Always verify against an actual ServiceNow instance.
233
  - **No safety fine-tune beyond the base.** Inherits Qwen2.5-14B-Instruct safety behavior; no additional RLHF.
234
 
@@ -244,10 +273,10 @@ This model is a derivative of **Qwen2.5-14B-Instruct** (Apache-2.0). See `NOTICE
244
 
245
  ```bibtex
246
  @software{marvy_14b_2026,
247
- title = {marvy-14B: A ServiceNow delivery lifecycle fine-tune of Qwen2.5-14B-Instruct},
248
  author = {MainStack},
249
  year = {2026},
250
- url = {https://huggingface.co/MainStack/marvy-14B},
251
  license= {Apache-2.0}
252
  }
253
 
 
20
  - sft
21
  - mlx
22
  model-index:
23
+ - name: marvy-1-14B
24
  results:
25
  - task:
26
  type: text-generation
 
37
  name: Test cross-entropy loss
38
  ---
39
 
40
+ # marvy-1-14B
41
 
42
  **The first open, fine-tuned LLM for the full ServiceNow delivery lifecycle β€” from business analysis to validation.**
43
 
44
+ marvy-1-14B is an open-source language model fine-tuned for the complete ServiceNow delivery lifecycle: business analysis, requirements, stakeholder mapping, systems inventory, Solution Design Documents, user stories with acceptance criteria, implementation planning, test cases, and validation. Where general-purpose models treat ServiceNow as one topic among many, marvy is built to draft the actual artifacts a delivery team produces β€” in the structure and sequence real engagements follow. It is a first-draft specialist, not a consultant replacement, and it is not an agentic or tool-use fine-tune.
45
 
46
  It was built by [MainStack](https://huggingface.co/MainStack), a consultancy specializing in ServiceNow Agentic Delivery. marvy is a LoRA SFT fine-tune of [Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) (Apache-2.0), trained on ~1,958 anonymized artifacts from real engagements (~887k tokens), rigorously redacted to zero residual PII per an automated leakage scanner. Its test perplexity of 13.107 was measured on a project- and customer-disjoint held-out split β€” the model generalizes to unseen work rather than memorizing the training set.
47
 
48
  > Released under **Apache-2.0**. Built with Qwen β€” see `NOTICE`.
49
 
50
+ ## Why marvy-1-14B
51
 
52
  - **Drafts the full lifecycle, not just snippets.** Business analysis through validation β€” the artifacts and sequence real delivery teams actually work in.
53
  - **OOTB-first and implementation-grade.** Tuned to favor out-of-the-box correctness and produce drafts you can review, not rewrite.
 
68
  ```python
69
  from transformers import AutoTokenizer, AutoModelForCausalLM
70
 
71
+ model_id = "MainStack/marvy-1-14B"
72
  tok = AutoTokenizer.from_pretrained(model_id)
73
  model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
74
 
 
93
 
94
  ```bash
95
  pip install vllm
96
+ vllm serve MainStack/marvy-1-14B
97
  ```
98
 
99
  ### Ollama (via GGUF)
100
 
101
+ Use the companion repo [`MainStack/marvy-1-14B-GGUF`](https://huggingface.co/MainStack/marvy-1-14B-GGUF):
102
 
103
  ```bash
104
+ ollama run hf.co/MainStack/marvy-1-14B-GGUF:Q4_K_M
105
  ```
106
 
107
  ### MLX (Apple Silicon native)
108
 
109
  ```bash
110
  pip install mlx-lm
111
+ python -m mlx_lm generate --model MainStack/marvy-1-14B \
112
  --system-prompt "You are a senior ServiceNow delivery consultant..." \
113
  --prompt "Draft the Platform Architecture section of an ITSM SDD." \
114
  --max-tokens 1024 --temp 0.4
 
116
 
117
  ### LoRA-only (apply on top of the base)
118
 
119
+ If you prefer a tiny adapter (~175 MB) on top of the BF16 base, see [`MainStack/marvy-1-14B-lora`](https://huggingface.co/MainStack/marvy-1-14B-lora).
120
 
121
  ---
122
 
123
  ## Intended use
124
 
125
+ marvy-1-14B is designed to produce implementation-grade first drafts across the ServiceNow delivery lifecycle β€” accelerating the artifacts a practitioner would otherwise write from scratch, then review and refine. Built for solution architects, business analysts, technical consultants, and project managers. Typical tasks:
126
 
127
  | Task family | What it produces |
128
  |------------------------|---------------------------------------------------------------------------------|
 
206
 
207
  ## Evaluation
208
 
209
+ ### Fine-tuned vs. base β€” efficiency on the held-out test set
210
 
211
+ The cleanest measure of the fine-tune's value is to score the **same base
212
+ model twice** β€” plain vs. with the marvy adapter β€” on the **project-disjoint**
213
+ test split (252 records from two customers never seen in training/val), using
214
+ per-token cross-entropy/perplexity on the **assistant tokens only**
215
+ (prompt-masked, the same objective used in training). Lower perplexity = the
216
+ model assigns higher probability to the real, human-authored delivery artifact.
217
+
218
+ ![marvy-1-14B vs base β€” perplexity by task](./marvy_vs_base_ppl.png)
219
+
220
+ ![How much fine-tuning improved each task](./marvy_improvement.png)
221
+
222
+ **Overall: perplexity 8.91 β†’ 6.03, a 32.3% reduction** on unseen customers.
223
+
224
+ | Task | Base ppl | marvy-1-14B ppl | Improvement |
225
+ |---|---:|---:|---:|
226
+ | Systems inventory | 77.07 | 10.53 | **βˆ’86.3%** |
227
+ | Requirements extraction | 46.76 | 9.39 | **βˆ’79.9%** |
228
+ | Stakeholder mapping | 27.81 | 6.91 | **βˆ’75.2%** |
229
+ | Story authoring | 15.38 | 7.86 | **βˆ’48.9%** |
230
+ | Validation / critique | 9.72 | 8.23 | βˆ’15.3% |
231
+ | Business analysis | 7.14 | 6.66 | βˆ’6.6% |
232
+ | SDD design | 4.48 | 4.40 | βˆ’1.7% |
233
+ | **Overall** | **8.91** | **6.03** | **βˆ’32.3%** |
234
+
235
+ The gains are largest on **structured, format-heavy artifacts** (inventories,
236
+ requirements, stakeholder registers, stories) where the base model wanders from
237
+ the expected schema; they are smaller on long-form prose (SDD sections, business
238
+ analysis) where the base was already competent. This is the honest, expected
239
+ shape of a domain SFT.
240
 
241
+ > Notes: the test customers (`Customer-CHEM-01`, `Customer-FININST-01`) appear in
242
+ > neither train nor val, so this reflects generalization, not memorization. The
243
+ > test split happens to cover 7 of the 11 task families. An earlier MLX
244
+ > batch-eval reported aggregate ppl β‰ˆ 13.1 with 2,048-token truncation; the
245
+ > figures above recompute per-task with full assistant-token masking, so the
246
+ > base-vs-marvy **delta** is the result of interest.
247
 
248
+ Reproduce it yourself: `bash benchmark/run_benchmark.sh` (see
249
+ [`VALIDATION.md`](./VALIDATION.md) for qualitative probes too).
 
250
 
251
  ---
252
 
 
257
  - **Synthetic instructions.** User prompts are templated paraphrases (3–5 variants per task); assistant outputs are the original human-authored artifacts.
258
  - **English-only.** The corpus is English.
259
  - **Not a replacement for a consultant.** Output is first-draft, implementation-grade content that requires expert review before client delivery or production use.
260
+ - **No tool use / function calling fine-tune.** `marvy-1-14B` is a text-completion specialist; agentic tool use is left to the orchestrator.
261
  - **Hallucination risk on instance-specific facts.** The model will confidently invent `sys_id`s, plugin IDs, and table fields if asked about specifics it has not seen. Always verify against an actual ServiceNow instance.
262
  - **No safety fine-tune beyond the base.** Inherits Qwen2.5-14B-Instruct safety behavior; no additional RLHF.
263
 
 
273
 
274
  ```bibtex
275
  @software{marvy_14b_2026,
276
+ title = {marvy-1-14B: A ServiceNow delivery lifecycle fine-tune of Qwen2.5-14B-Instruct},
277
  author = {MainStack},
278
  year = {2026},
279
+ url = {https://huggingface.co/MainStack/marvy-1-14B},
280
  license= {Apache-2.0}
281
  }
282