kaiju-coder-7 / DATA_PROVENANCE_DRAFT.md
restokes92's picture
Add files using upload-large-folder tool
9afd28d verified

Kaiju Coder 7 by Kiyomi - Data Provenance Draft

This draft records the current data boundary for release review.

Policy

Kaiju Coder training data must be legally usable for a commercial derivative model.

Allowed:

  • RMDW-authored examples.
  • RMDW-owned repository diffs and documentation.
  • Human-reviewed examples created specifically for Kaiju.
  • Public permissive data only when license review confirms compatibility.

Not allowed:

  • Closed-model answers from OpenAI, Anthropic, Gemini, or similar services as supervised completions.
  • Unreviewed customer data.
  • Private customer code without consent.
  • Secrets, tokens, credentials, cookies, or private keys.
  • Unlicensed scraped code.

v0.1 Dataset Snapshot

  • Total reviewed examples: 575
  • Dataset build: datasets/build/kaiju-sft-v0.1.jsonl
  • Candidate sources:
    • datasets/candidates/rmdw-git-patches.jsonl
    • datasets/candidates/v0.1-safe-git-backlog.jsonl
    • datasets/candidates/v0.1-file-level-git.jsonl
    • datasets/candidates/v0.1-wiki-strategy-business-identity.jsonl

v1.7 Business-Owner Suite Addendum

  • Date prepared: 2026-06-03
  • Reviewed examples: 8
  • Candidate file: datasets/candidates/v1.7-rmdw-business-owner-suite.jsonl
  • Addendum-only SFT build: datasets/build/kaiju-sft-v1.7-business-owner-suite.jsonl
  • Training SFT build: datasets/build/kaiju-sft-v1.7-business-owner-oversampled.jsonl
  • Training config: training/configs/qwen36-27b-lora-v1.7.example.json
  • v1.8 training config: training/configs/qwen36-27b-lora-v1.8-business-owner.example.json
  • New task type: business_suite
  • Source inventory: release/SOURCE_INVENTORY.md, refreshed from GitHub source-of-truth repositories and the requested local RMDW wiki snapshot.

This addendum targets Kiyomi 7.7.7 style business-owner work: complete AI-company build packs, premium service websites, intake and CRM flows, sales follow-up, proposals, ROI dashboards, operator handbooks, and Workshop golden-run automations.

Every row includes:

  • source_repos
  • source_paths
  • provenance_notes
  • reviewed: true
  • license: RMDW-owned

For the v1.7 LoRA run, the 8 reviewed business-owner rows are oversampled 24 times by scripts/build_v17_business_owner_sft_dataset.py. Repeated rows receive unique IDs ending in __v17_business_repeat_NN and preserve the original source repository, source path, and provenance metadata.

Client-site repositories are used only as eval and generalized pattern sources unless a row is explicitly reviewed for training eligibility. Do not bulk-train on client-specific text, contact details, contracts, or private business data.

The local wiki path /Users/richardecholsai7/Documents/RMDW-Wiki is present but is not a git checkout. It is recorded as RMDW-Wiki-local, selective-reference-only, with credentials.md, customers.md, customers/, and raw/ excluded. The GitHub RichardEchols/rmdw-agent-wiki repo remains the authoritative wiki source for training/eval provenance unless a reviewer documents a local exception.

Category Mix

The v0.1 category gate passed:

  • Website/UI: at least 75 examples
  • Coding: at least 75 examples
  • Debugging: at least 50 examples
  • Automation: at least 50 examples
  • Tool-use: at least 50 examples
  • Strategy: at least 25 examples
  • Business: at least 15 examples
  • Identity: at least 10 examples

Release Review Checklist

Before public release:

  • Re-run dataset validation.
  • Re-run source inventory against the current GitHub source-of-truth SHAs.
  • Spot-check examples for secrets and private data.
  • Confirm client-site rows are generalized pattern examples or eval-only.
  • Confirm closed-model outputs are not used as supervised completions.
  • Record exact base model revision.
  • Attach upstream license and notices.
  • Attach eval summary.
  • Document known limitations and unsafe use boundaries.