# Dataset Plan

## Status

The project now has a deterministic SFT preview generator for local planning and schema validation.

Current preview artifact:

```bash
.venv/bin/python -B scripts/generate_dataset.py
```

Default output:

```text
data/train/objectverse_sft_preview.jsonl
```

This preview is mock-generated. It is not a final training dataset and should not be described as real model output.

The preview JSONL file is evidence for schema and workflow readiness only.

Curated v1 training-test artifact:

```bash
.venv/bin/python -B scripts/prepare_curated_dataset.py \
  --count 50 \
  --output data/train/objectverse_sft_curated.jsonl
```

This file is synthetic curated data: hand-shaped, deterministic, privacy-safe, and useful for testing the LoRA pipeline. It is not based on private user photos or commercial AI output.

Published synthetic curated dataset:

```text
https://huggingface.co/datasets/qqyule/objectverse-diary-sft-curated
```

Current curated v2 artifact:

```bash
.venv/bin/python -B scripts/prepare_curated_dataset.py \
  --version v2 \
  --count 200 \
  --output data/train/objectverse_sft_curated_v2.jsonl
```

The published dataset repo now includes `objectverse_sft_curated_v2.jsonl`: 200 synthetic curated rows covering 40 everyday objects and 5 personality modes, with exactly 40 rows per mode and no repeated object-mode pair. The v1 file remains preserved through repository history.

## Target Dataset

Target before stronger fine-tuning:

- 200-500 generated or curated object-persona-diary samples
- at least 50 manually curated high-quality samples
- no private user photos
- no emails, tokens, serial numbers, or other sensitive identifiers
- English-first output with optional Chinese helper text

## JSONL Schema

Each line is one training candidate:

```json
{
  "id": "sft-preview-0001",
  "source": "objectverse-diary-mock-mvp",
  "split": "preview",
  "mode": "Cynical",
  "object_description": "old white coffee mug on a developer desk",
  "object_understanding": {},
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "{\"persona\":{},\"diary\":{}}"}
  ]
}
```

The `assistant.content` field is JSON text so it can be used with chat-style SFT tools while preserving structured outputs.

## Generation Workflow

Preview:

```bash
.venv/bin/python -B scripts/generate_dataset.py --count 60
```

Full candidate pool later:

```bash
.venv/bin/python -B scripts/generate_dataset.py --count 300 --output data/train/objectverse_sft_candidates.jsonl
```

Manual curation should happen after generation. For a stronger LoRA run, curate 150-300 rows from a broader object/mode/scene pool and leave 10-15% for evaluation. Do not publish the full candidate file until it has been reviewed.

Space VLM validation traces under `data/traces/space-vlm/` are failure evidence because they include `vision-fallback-to-mock`. Do not mix them into curated training data or describe them as successful real VLM outputs.

## Modal LoRA Training Scaffold

The repository includes a Modal training scaffold for the future Well-Tuned path. It is not run by default and does not affect the Gradio Space runtime.

Install the local Modal CLI dependency separately:

```bash
pip install -r requirements-training.txt
```

Validate the local JSONL shape without Modal auth or GPU usage:

```bash
.venv/bin/python -B scripts/finetune_lora.py \
  --dry-run \
  --dataset data/train/objectverse_sft_curated.jsonl \
  --run-name objectverse-diary-qwen15b-curated-test
```

The first badge-evidence run used 20 steps on 50 synthetic curated rows. For a higher-quality v2 run, validate the larger curated file first:

```bash
.venv/bin/python -B scripts/finetune_lora.py \
  --dry-run \
  --dataset data/train/objectverse_sft_curated_v2.jsonl \
  --run-name objectverse-diary-qwen15b-lora-v2 \
  --max-steps 120 \
  --learning-rate 1e-4 \
  --max-seq-length 1536 \
  --lora-r 32 \
  --lora-alpha 64 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 4 \
  --eval-ratio 0.1 \
  --eval-steps 20
```

Executed v2 training command:

```bash
modal run --timestamps -n objectverse-diary-qwen15b-lora-v2 scripts/finetune_lora.py \
  --dataset data/train/objectverse_sft_curated_v2.jsonl \
  --run-name objectverse-diary-qwen15b-lora-v2 \
  --max-steps 120 \
  --learning-rate 1e-4 \
  --max-seq-length 1536 \
  --lora-r 32 \
  --lora-alpha 64 \
  --per-device-train-batch-size 2 \
  --gradient-accumulation-steps 4 \
  --eval-ratio 0.1 \
  --eval-steps 20
```

Current Modal status: the v2 job completed successfully and produced the published LoRA adapter at `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`.

Current v2 run summary:

- run name: `objectverse-diary-qwen15b-lora-v2`
- dataset: `data/train/objectverse_sft_curated_v2.jsonl`
- dataset repo path: `objectverse_sft_curated_v2.jsonl`
- records: 200 total, 180 train, 20 eval
- base model: `Qwen/Qwen2.5-1.5B-Instruct`
- max steps: 120
- learning rate: `1e-4`
- max sequence length: 1536
- LoRA rank / alpha / dropout: 32 / 64 / 0.05
- effective batch size: 8
- assistant-output-only loss: enabled
- train loss: 0.3240
- eval loss: 0.0162
- train runtime: 140.3364s
- epoch: 5.2222
- local adapter export: ignored `exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/`
- model repo: `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`

Additional v2 scaffold validation run: `objectverse-diary-qwen15b-lora-v2-curated50-retry1` completed on Modal with the existing 50-row curated dataset, using assistant-output-only loss, 45 train rows, 5 eval rows, `max_steps=120`, `learning_rate=1e-4`, `max_seq_length=1536`, LoRA `r=32`, `alpha=64`, and effective batch size 8. Metrics: `train_loss=0.2551`, `eval_loss=0.0093`, `train_runtime=146.5398s`, `epoch=20.0`. The adapter was downloaded to ignored local `exports/`; it has not been published to Hugging Face Hub.

Default training scaffold settings:

- base model: `Qwen/Qwen2.5-1.5B-Instruct`
- LoRA adapter target: persona and diary JSON output
- default loss: assistant-output-only labels, with prompt tokens masked
- default eval split: 10% when the dataset has at least two rows
- GPU: Modal `A10G`
- output: Modal Volume artifacts, not committed files

The current `objectverse_sft_preview.jsonl` file is mock-generated and should only be used to validate the training pipeline. It is not final Well-Tuned evidence. Do not store Modal credit codes, tokens, Hugging Face tokens, or private datasets in the repo.

The published `objectverse_sft_curated_v2.jsonl` dataset is synthetic curated training data. It is suitable for hackathon training evidence, but it should still be described honestly as deterministic synthetic curation rather than real user trace data.

## Curation Checklist

- Persona stays consistent with the object.
- Diary is short, vivid, and English-first.
- Chinese helper text is secondary.
- Output has a strange object archive feeling.
- No real person, email, token, address, credit code, or serial number remains.
- No commercial cloud AI model was used to create the sample.
- JSON parses cleanly.

## Publishing Notes

When publishing to Hugging Face Datasets:

- create a dataset card
- document that mock preview rows are synthetic
- separate curated rows from raw candidates
- include license and privacy notes
- keep private images out of the repo

Curated v2 was published with:

```bash
.venv/bin/python -B scripts/publish_hf_dataset.py \
  --dataset-file data/train/objectverse_sft_curated_v2.jsonl \
  --repo-id qqyule/objectverse-diary-sft-curated \
  --path-in-repo objectverse_sft_curated_v2.jsonl \
  --commit-message "Upload Objectverse Diary curated v2 dataset"
```