File size: 6,205 Bytes
cb513af
3ba4a4b
 
 
 
 
 
 
 
 
 
 
 
cb513af
3ba4a4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb513af
3ba4a4b
 
 
 
 
cb513af
3ba4a4b
 
 
 
 
 
 
 
 
 
 
cb513af
3ba4a4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb513af
3ba4a4b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# Validating marvy-1-14B

This guide gives you three independent ways to confirm the fine-tune actually
learned the ServiceNow delivery style β€” from a 60-second smoke test to a
quantitative base-vs-marvy comparison on a held-out, customer-disjoint test set.

> TL;DR: run `bash docs/validate.sh` (from the model repo) for the quick path,
> or follow the manual steps below.

---

## What "working" means here

marvy-1-14B is a **specialist drafting model**. A successful fine-tune should show:

1. **Format fidelity** β€” it emits the delivery artifact shape on cue (user
   stories with acceptance criteria, SDD sections, test cases with
   pre-conditions/steps/expected results) without being told the structure.
2. **Domain voice** β€” OOTB-first framing, ServiceNow tables/plugins, ITIL/CSDM
   vocabulary, `sys_id` citations where relevant.
3. **Lower loss than the base** on held-out ServiceNow delivery text.

The base model (Qwen2.5-14B-Instruct) is a strong generalist and will produce
*plausible* answers β€” the point of validation is to show marvy is **more
on-format, more domain-specific, and lower-perplexity** on this task.

---

## Test 1 β€” 60-second smoke test (qualitative)

Prompt the model with a bare instruction and check it produces a correctly
structured artifact with no format coaching.

### LM Studio (local)

```bash
lms load MainStack/marvy-1-14B
lms server start          # OpenAI-compatible on http://localhost:1234/v1

curl -s http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "marvy-1-14B",
    "temperature": 0.4,
    "messages": [
      {"role": "system", "content": "You are a senior ServiceNow delivery consultant. You produce precise, implementation-grade artifacts and favor out-of-the-box capabilities."},
      {"role": "user", "content": "Write a user story with acceptance criteria for auto-escalating P1 incidents that breach a 15-minute response SLA."}
    ]
  }' | python3 -c "import sys,json;print(json.load(sys.stdin)['choices'][0]['message']['content'])"
```

### MLX (Apple Silicon)

```bash
python -m mlx_lm generate --model MainStack/marvy-1-14B \
  --system-prompt "You are a senior ServiceNow delivery consultant..." \
  --prompt "Write a user story with acceptance criteria for auto-escalating P1 incidents that breach a 15-minute response SLA." \
  --max-tokens 512 --temp 0.4
```

### Pass criteria

- [ ] Output is a **user story** (`As a … I want … so that …`) followed by
      discrete, testable **acceptance criteria**.
- [ ] References ServiceNow concretely (e.g. `incident`, SLA definitions,
      `sla_definition`, escalation/notification, assignment groups).
- [ ] No meta-chatter ("Sure, here is…") dominating the answer; it reads like a
      backlog item, not a chatbot reply.

---

## Test 2 β€” Task-coverage probes (qualitative, one per skill)

Run each prompt with the recommended system prompt. Each should yield the
artifact named, in the right shape.

| # | Prompt | Expect |
|---|--------|--------|
| 1 | "Draft the Incident Management section of an SDD for a greenfield ITSM implementation. Include assignment rules and SLA design." | SDD section: architecture/process, assignment rules (condition/action/order), SLA table |
| 2 | "Extract structured requirements (id, category, priority, target phase, success metric) from: 'We need to replace email-based access requests with a catalog item routed for manager approval.'" | Tabular/structured requirements with priorities & metrics |
| 3 | "Write a test case for the story: 'Restrict the Assignment Group field on incidents to groups with the itil role.'" | Test case: pre-conditions, steps, expected results, pass/fail |
| 4 | "We are migrating CMDB to CSDM. Produce the foundation-data load sequence and the CI classes involved." | CSDM/CMDB sequence, classes (cmdb_ci_*), foundation order |
| 5 | "Validate this requirement against best practice and list follow-up questions: 'All incidents must auto-close after 3 days.'" | Critique + concrete follow-up questions + risks |

### Pass criteria
At least **4 of 5** produce the correct artifact type with ServiceNow-specific,
implementation-grade content (not generic ITSM prose).

---

## Test 3 β€” Quantitative: base vs marvy on the held-out test set

This is the strongest signal. The test split is **customer-disjoint** β€” two
customers that never appear in training or validation β€” so it measures
generalization, not memorization.

### With the MLX training kit (in the source repo)

```bash
cd training

# marvy (fine-tuned adapter on the base)
python -m mlx_lm lora \
  --model mlx-community/Qwen2.5-14B-Instruct-4bit \
  --adapter-path train/adapters \
  --data train/data --test --test-batches 50
# -> Test loss 2.573, Test ppl 13.107   (lower is better)

# base (no adapter) for comparison
python -m mlx_lm lora \
  --model mlx-community/Qwen2.5-14B-Instruct-4bit \
  --data train/data --test --test-batches 50
# -> expect a HIGHER loss/ppl than marvy
```

### Pass criteria
- [ ] marvy's **test perplexity is meaningfully lower** than the base on the
      same held-out split.
- [ ] No data leakage: the test customers (`Customer-CHEM-01`,
      `Customer-FININST-01`) are absent from `train.jsonl` / `valid.jsonl`.

> Reference result for this release: **test loss 2.573 / ppl 13.107** on 50
> batches of the project-disjoint test split (two sequences >2048 tokens are
> truncated by the eval harness, so this is a slight upper bound).

---

## Interpreting results

| Symptom | Likely cause | Action |
|---|---|---|
| Generic ITSM prose, no ServiceNow specifics | wrong/short system prompt | use the full recommended system prompt; temp 0.3–0.5 |
| Rambling, no artifact structure | temperature too high | lower to 0.3–0.4 |
| Invents `sys_id`s / plugin IDs | expected limitation | verify against a real instance; never trust IDs blindly |
| marvy ppl β‰ˆ base ppl | adapter not applied / wrong checkpoint | confirm `--adapter-path` points at the trained adapter (iter-150) |

marvy-1-14B is a first-draft assistant. All output must be reviewed by a qualified
ServiceNow consultant before client delivery or production configuration.