Instructions to use MainStack/marvy-1-14B-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use MainStack/marvy-1-14B-lora with PEFT:
Task type is invalid.
- MLX
How to use MainStack/marvy-1-14B-lora with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("MainStack/marvy-1-14B-lora") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- MLX LM
How to use MainStack/marvy-1-14B-lora with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "MainStack/marvy-1-14B-lora" --prompt "Once upon a time"
File size: 6,205 Bytes
32f6571 65fb6ae 32f6571 65fb6ae 32f6571 65fb6ae 32f6571 65fb6ae 32f6571 65fb6ae 32f6571 65fb6ae | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | # Validating marvy-1-14B
This guide gives you three independent ways to confirm the fine-tune actually
learned the ServiceNow delivery style β from a 60-second smoke test to a
quantitative base-vs-marvy comparison on a held-out, customer-disjoint test set.
> TL;DR: run `bash docs/validate.sh` (from the model repo) for the quick path,
> or follow the manual steps below.
---
## What "working" means here
marvy-1-14B is a **specialist drafting model**. A successful fine-tune should show:
1. **Format fidelity** β it emits the delivery artifact shape on cue (user
stories with acceptance criteria, SDD sections, test cases with
pre-conditions/steps/expected results) without being told the structure.
2. **Domain voice** β OOTB-first framing, ServiceNow tables/plugins, ITIL/CSDM
vocabulary, `sys_id` citations where relevant.
3. **Lower loss than the base** on held-out ServiceNow delivery text.
The base model (Qwen2.5-14B-Instruct) is a strong generalist and will produce
*plausible* answers β the point of validation is to show marvy is **more
on-format, more domain-specific, and lower-perplexity** on this task.
---
## Test 1 β 60-second smoke test (qualitative)
Prompt the model with a bare instruction and check it produces a correctly
structured artifact with no format coaching.
### LM Studio (local)
```bash
lms load MainStack/marvy-1-14B
lms server start # OpenAI-compatible on http://localhost:1234/v1
curl -s http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "marvy-1-14B",
"temperature": 0.4,
"messages": [
{"role": "system", "content": "You are a senior ServiceNow delivery consultant. You produce precise, implementation-grade artifacts and favor out-of-the-box capabilities."},
{"role": "user", "content": "Write a user story with acceptance criteria for auto-escalating P1 incidents that breach a 15-minute response SLA."}
]
}' | python3 -c "import sys,json;print(json.load(sys.stdin)['choices'][0]['message']['content'])"
```
### MLX (Apple Silicon)
```bash
python -m mlx_lm generate --model MainStack/marvy-1-14B \
--system-prompt "You are a senior ServiceNow delivery consultant..." \
--prompt "Write a user story with acceptance criteria for auto-escalating P1 incidents that breach a 15-minute response SLA." \
--max-tokens 512 --temp 0.4
```
### Pass criteria
- [ ] Output is a **user story** (`As a β¦ I want β¦ so that β¦`) followed by
discrete, testable **acceptance criteria**.
- [ ] References ServiceNow concretely (e.g. `incident`, SLA definitions,
`sla_definition`, escalation/notification, assignment groups).
- [ ] No meta-chatter ("Sure, here isβ¦") dominating the answer; it reads like a
backlog item, not a chatbot reply.
---
## Test 2 β Task-coverage probes (qualitative, one per skill)
Run each prompt with the recommended system prompt. Each should yield the
artifact named, in the right shape.
| # | Prompt | Expect |
|---|--------|--------|
| 1 | "Draft the Incident Management section of an SDD for a greenfield ITSM implementation. Include assignment rules and SLA design." | SDD section: architecture/process, assignment rules (condition/action/order), SLA table |
| 2 | "Extract structured requirements (id, category, priority, target phase, success metric) from: 'We need to replace email-based access requests with a catalog item routed for manager approval.'" | Tabular/structured requirements with priorities & metrics |
| 3 | "Write a test case for the story: 'Restrict the Assignment Group field on incidents to groups with the itil role.'" | Test case: pre-conditions, steps, expected results, pass/fail |
| 4 | "We are migrating CMDB to CSDM. Produce the foundation-data load sequence and the CI classes involved." | CSDM/CMDB sequence, classes (cmdb_ci_*), foundation order |
| 5 | "Validate this requirement against best practice and list follow-up questions: 'All incidents must auto-close after 3 days.'" | Critique + concrete follow-up questions + risks |
### Pass criteria
At least **4 of 5** produce the correct artifact type with ServiceNow-specific,
implementation-grade content (not generic ITSM prose).
---
## Test 3 β Quantitative: base vs marvy on the held-out test set
This is the strongest signal. The test split is **customer-disjoint** β two
customers that never appear in training or validation β so it measures
generalization, not memorization.
### With the MLX training kit (in the source repo)
```bash
cd training
# marvy (fine-tuned adapter on the base)
python -m mlx_lm lora \
--model mlx-community/Qwen2.5-14B-Instruct-4bit \
--adapter-path train/adapters \
--data train/data --test --test-batches 50
# -> Test loss 2.573, Test ppl 13.107 (lower is better)
# base (no adapter) for comparison
python -m mlx_lm lora \
--model mlx-community/Qwen2.5-14B-Instruct-4bit \
--data train/data --test --test-batches 50
# -> expect a HIGHER loss/ppl than marvy
```
### Pass criteria
- [ ] marvy's **test perplexity is meaningfully lower** than the base on the
same held-out split.
- [ ] No data leakage: the test customers (`Customer-CHEM-01`,
`Customer-FININST-01`) are absent from `train.jsonl` / `valid.jsonl`.
> Reference result for this release: **test loss 2.573 / ppl 13.107** on 50
> batches of the project-disjoint test split (two sequences >2048 tokens are
> truncated by the eval harness, so this is a slight upper bound).
---
## Interpreting results
| Symptom | Likely cause | Action |
|---|---|---|
| Generic ITSM prose, no ServiceNow specifics | wrong/short system prompt | use the full recommended system prompt; temp 0.3β0.5 |
| Rambling, no artifact structure | temperature too high | lower to 0.3β0.4 |
| Invents `sys_id`s / plugin IDs | expected limitation | verify against a real instance; never trust IDs blindly |
| marvy ppl β base ppl | adapter not applied / wrong checkpoint | confirm `--adapter-path` points at the trained adapter (iter-150) |
marvy-1-14B is a first-draft assistant. All output must be reviewed by a qualified
ServiceNow consultant before client delivery or production configuration.
|