# Dataset Plan ## Status The project now has a deterministic SFT preview generator for local planning and schema validation. Current preview artifact: ```bash .venv/bin/python -B scripts/generate_dataset.py ``` Default output: ```text data/train/objectverse_sft_preview.jsonl ``` This preview is mock-generated. It is not a final training dataset and should not be described as real model output. The preview JSONL file is evidence for schema and workflow readiness only. Curated v1 training-test artifact: ```bash .venv/bin/python -B scripts/prepare_curated_dataset.py \ --count 50 \ --output data/train/objectverse_sft_curated.jsonl ``` This file is synthetic curated data: hand-shaped, deterministic, privacy-safe, and useful for testing the LoRA pipeline. It is not based on private user photos or commercial AI output. Published synthetic curated dataset: ```text https://huggingface.co/datasets/qqyule/objectverse-diary-sft-curated ``` Current curated v2 artifact: ```bash .venv/bin/python -B scripts/prepare_curated_dataset.py \ --version v2 \ --count 200 \ --output data/train/objectverse_sft_curated_v2.jsonl ``` The published dataset repo now includes `objectverse_sft_curated_v2.jsonl`: 200 synthetic curated rows covering 40 everyday objects and 5 personality modes, with exactly 40 rows per mode and no repeated object-mode pair. The v1 file remains preserved through repository history. ## Target Dataset Target before stronger fine-tuning: - 200-500 generated or curated object-persona-diary samples - at least 50 manually curated high-quality samples - no private user photos - no emails, tokens, serial numbers, or other sensitive identifiers - English-first output with optional Chinese helper text ## JSONL Schema Each line is one training candidate: ```json { "id": "sft-preview-0001", "source": "objectverse-diary-mock-mvp", "split": "preview", "mode": "Cynical", "object_description": "old white coffee mug on a developer desk", "object_understanding": {}, "messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "{\"persona\":{},\"diary\":{}}"} ] } ``` The `assistant.content` field is JSON text so it can be used with chat-style SFT tools while preserving structured outputs. ## Generation Workflow Preview: ```bash .venv/bin/python -B scripts/generate_dataset.py --count 60 ``` Full candidate pool later: ```bash .venv/bin/python -B scripts/generate_dataset.py --count 300 --output data/train/objectverse_sft_candidates.jsonl ``` Manual curation should happen after generation. For a stronger LoRA run, curate 150-300 rows from a broader object/mode/scene pool and leave 10-15% for evaluation. Do not publish the full candidate file until it has been reviewed. Space VLM validation traces under `data/traces/space-vlm/` are failure evidence because they include `vision-fallback-to-mock`. Do not mix them into curated training data or describe them as successful real VLM outputs. ## Modal LoRA Training Scaffold The repository includes a Modal training scaffold for the future Well-Tuned path. It is not run by default and does not affect the Gradio Space runtime. Install the local Modal CLI dependency separately: ```bash pip install -r requirements-training.txt ``` Validate the local JSONL shape without Modal auth or GPU usage: ```bash .venv/bin/python -B scripts/finetune_lora.py \ --dry-run \ --dataset data/train/objectverse_sft_curated.jsonl \ --run-name objectverse-diary-qwen15b-curated-test ``` The first badge-evidence run used 20 steps on 50 synthetic curated rows. For a higher-quality v2 run, validate the larger curated file first: ```bash .venv/bin/python -B scripts/finetune_lora.py \ --dry-run \ --dataset data/train/objectverse_sft_curated_v2.jsonl \ --run-name objectverse-diary-qwen15b-lora-v2 \ --max-steps 120 \ --learning-rate 1e-4 \ --max-seq-length 1536 \ --lora-r 32 \ --lora-alpha 64 \ --per-device-train-batch-size 2 \ --gradient-accumulation-steps 4 \ --eval-ratio 0.1 \ --eval-steps 20 ``` Executed v2 training command: ```bash modal run --timestamps -n objectverse-diary-qwen15b-lora-v2 scripts/finetune_lora.py \ --dataset data/train/objectverse_sft_curated_v2.jsonl \ --run-name objectverse-diary-qwen15b-lora-v2 \ --max-steps 120 \ --learning-rate 1e-4 \ --max-seq-length 1536 \ --lora-r 32 \ --lora-alpha 64 \ --per-device-train-batch-size 2 \ --gradient-accumulation-steps 4 \ --eval-ratio 0.1 \ --eval-steps 20 ``` Current Modal status: the v2 job completed successfully and produced the published LoRA adapter at `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora`. Current v2 run summary: - run name: `objectverse-diary-qwen15b-lora-v2` - dataset: `data/train/objectverse_sft_curated_v2.jsonl` - dataset repo path: `objectverse_sft_curated_v2.jsonl` - records: 200 total, 180 train, 20 eval - base model: `Qwen/Qwen2.5-1.5B-Instruct` - max steps: 120 - learning rate: `1e-4` - max sequence length: 1536 - LoRA rank / alpha / dropout: 32 / 64 / 0.05 - effective batch size: 8 - assistant-output-only loss: enabled - train loss: 0.3240 - eval loss: 0.0162 - train runtime: 140.3364s - epoch: 5.2222 - local adapter export: ignored `exports/objectverse-diary-qwen15b-lora-v2-adapter-dir/` - model repo: `https://huggingface.co/qqyule/objectverse-diary-qwen15b-lora` Additional v2 scaffold validation run: `objectverse-diary-qwen15b-lora-v2-curated50-retry1` completed on Modal with the existing 50-row curated dataset, using assistant-output-only loss, 45 train rows, 5 eval rows, `max_steps=120`, `learning_rate=1e-4`, `max_seq_length=1536`, LoRA `r=32`, `alpha=64`, and effective batch size 8. Metrics: `train_loss=0.2551`, `eval_loss=0.0093`, `train_runtime=146.5398s`, `epoch=20.0`. The adapter was downloaded to ignored local `exports/`; it has not been published to Hugging Face Hub. Default training scaffold settings: - base model: `Qwen/Qwen2.5-1.5B-Instruct` - LoRA adapter target: persona and diary JSON output - default loss: assistant-output-only labels, with prompt tokens masked - default eval split: 10% when the dataset has at least two rows - GPU: Modal `A10G` - output: Modal Volume artifacts, not committed files The current `objectverse_sft_preview.jsonl` file is mock-generated and should only be used to validate the training pipeline. It is not final Well-Tuned evidence. Do not store Modal credit codes, tokens, Hugging Face tokens, or private datasets in the repo. The published `objectverse_sft_curated_v2.jsonl` dataset is synthetic curated training data. It is suitable for hackathon training evidence, but it should still be described honestly as deterministic synthetic curation rather than real user trace data. ## Curation Checklist - Persona stays consistent with the object. - Diary is short, vivid, and English-first. - Chinese helper text is secondary. - Output has a strange object archive feeling. - No real person, email, token, address, credit code, or serial number remains. - No commercial cloud AI model was used to create the sample. - JSON parses cleanly. ## Publishing Notes When publishing to Hugging Face Datasets: - create a dataset card - document that mock preview rows are synthetic - separate curated rows from raw candidates - include license and privacy notes - keep private images out of the repo Curated v2 was published with: ```bash .venv/bin/python -B scripts/publish_hf_dataset.py \ --dataset-file data/train/objectverse_sft_curated_v2.jsonl \ --repo-id qqyule/objectverse-diary-sft-curated \ --path-in-repo objectverse_sft_curated_v2.jsonl \ --commit-message "Upload Objectverse Diary curated v2 dataset" ```