File size: 3,871 Bytes
9afd28d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# Kaiju Coder 7 by Kiyomi - Data Provenance Draft

This draft records the current data boundary for release review.

## Policy

Kaiju Coder training data must be legally usable for a commercial derivative model.

Allowed:

- RMDW-authored examples.
- RMDW-owned repository diffs and documentation.
- Human-reviewed examples created specifically for Kaiju.
- Public permissive data only when license review confirms compatibility.

Not allowed:

- Closed-model answers from OpenAI, Anthropic, Gemini, or similar services as supervised completions.
- Unreviewed customer data.
- Private customer code without consent.
- Secrets, tokens, credentials, cookies, or private keys.
- Unlicensed scraped code.

## v0.1 Dataset Snapshot

- Total reviewed examples: 575
- Dataset build: `datasets/build/kaiju-sft-v0.1.jsonl`
- Candidate sources:
  - `datasets/candidates/rmdw-git-patches.jsonl`
  - `datasets/candidates/v0.1-safe-git-backlog.jsonl`
  - `datasets/candidates/v0.1-file-level-git.jsonl`
  - `datasets/candidates/v0.1-wiki-strategy-business-identity.jsonl`

## v1.7 Business-Owner Suite Addendum

- Date prepared: 2026-06-03
- Reviewed examples: 8
- Candidate file: `datasets/candidates/v1.7-rmdw-business-owner-suite.jsonl`
- Addendum-only SFT build: `datasets/build/kaiju-sft-v1.7-business-owner-suite.jsonl`
- Training SFT build: `datasets/build/kaiju-sft-v1.7-business-owner-oversampled.jsonl`
- Training config: `training/configs/qwen36-27b-lora-v1.7.example.json`
- v1.8 training config: `training/configs/qwen36-27b-lora-v1.8-business-owner.example.json`
- New task type: `business_suite`
- Source inventory: `release/SOURCE_INVENTORY.md`, refreshed from GitHub source-of-truth repositories and the requested local RMDW wiki snapshot.

This addendum targets Kiyomi 7.7.7 style business-owner work: complete AI-company build packs, premium service websites, intake and CRM flows, sales follow-up, proposals, ROI dashboards, operator handbooks, and Workshop golden-run automations.

Every row includes:

- `source_repos`
- `source_paths`
- `provenance_notes`
- `reviewed: true`
- `license: RMDW-owned`

For the v1.7 LoRA run, the 8 reviewed business-owner rows are oversampled 24 times by `scripts/build_v17_business_owner_sft_dataset.py`. Repeated rows receive unique IDs ending in `__v17_business_repeat_NN` and preserve the original source repository, source path, and provenance metadata.

Client-site repositories are used only as eval and generalized pattern sources unless a row is explicitly reviewed for training eligibility. Do not bulk-train on client-specific text, contact details, contracts, or private business data.

The local wiki path `/Users/richardecholsai7/Documents/RMDW-Wiki` is present but is not a git checkout. It is recorded as `RMDW-Wiki-local`, `selective-reference-only`, with `credentials.md`, `customers.md`, `customers/`, and `raw/` excluded. The GitHub `RichardEchols/rmdw-agent-wiki` repo remains the authoritative wiki source for training/eval provenance unless a reviewer documents a local exception.

## Category Mix

The v0.1 category gate passed:

- Website/UI: at least 75 examples
- Coding: at least 75 examples
- Debugging: at least 50 examples
- Automation: at least 50 examples
- Tool-use: at least 50 examples
- Strategy: at least 25 examples
- Business: at least 15 examples
- Identity: at least 10 examples

## Release Review Checklist

Before public release:

- Re-run dataset validation.
- Re-run source inventory against the current GitHub source-of-truth SHAs.
- Spot-check examples for secrets and private data.
- Confirm client-site rows are generalized pattern examples or eval-only.
- Confirm closed-model outputs are not used as supervised completions.
- Record exact base model revision.
- Attach upstream license and notices.
- Attach eval summary.
- Document known limitations and unsafe use boundaries.